Info: (elisp) Non-ASCII in Strings

Info Catalog
elisp: Syntax for Strings
elisp: String Type
elisp: Nonprinting Characters
elisp: Non-ASCII in Strings

 
 2.3.8.2 Non-ASCII Characters in Strings
 .......................................
 
 There are two text representations for non-ASCII characters in Emacs
 strings: multibyte and unibyte (Text Representations).  Roughly
 speaking, unibyte strings store raw bytes, while multibyte strings store
 human-readable text.  Each character in a unibyte string is a byte,
 i.e., its value is between 0 and 255.  By contrast, each character in a
 multibyte string may have a value between 0 to 4194303 (Character
 Type).  In both cases, characters above 127 are non-ASCII.
 
    You can include a non-ASCII character in a string constant by writing
 it literally.  If the string constant is read from a multibyte source,
 such as a multibyte buffer or string, or a file that would be visited as
 multibyte, then Emacs reads each non-ASCII character as a multibyte
 character and automatically makes the string a multibyte string.  If the
 string constant is read from a unibyte source, then Emacs reads the
 non-ASCII character as unibyte, and makes the string unibyte.
 
    Instead of writing a character literally into a multibyte string, you
 can write it as its character code using an escape sequence.  
 General Escape Syntax, for details about escape sequences.
 
    If you use any Unicode-style escape sequence ‘\uNNNN’ or ‘\U00NNNNNN’
 in a string constant (even for an ASCII character), Emacs automatically
 assumes that it is multibyte.
 
    You can also use hexadecimal escape sequences (‘\xN’) and octal
 escape sequences (‘\N’) in string constants.  *But beware:* If a string
 constant contains hexadecimal or octal escape sequences, and these
 escape sequences all specify unibyte characters (i.e., less than 256),
 and there are no other literal non-ASCII characters or Unicode-style
 escape sequences in the string, then Emacs automatically assumes that it
 is a unibyte string.  That is to say, it assumes that all non-ASCII
 characters occurring in the string are 8-bit raw bytes.
 
    In hexadecimal and octal escape sequences, the escaped character code
 may contain a variable number of digits, so the first subsequent
 character which is not a valid hexadecimal or octal digit terminates the
 escape sequence.  If the next character in a string could be interpreted
 as a hexadecimal or octal digit, write ‘\ ’ (backslash and space) to
 terminate the escape sequence.  For example, ‘\xe0\ ’ represents one
 character, ‘a’ with grave accent.  ‘\ ’ in a string constant is just
 like backslash-newline; it does not contribute any character to the
 string, but it does terminate any preceding hex escape.
Info Catalog
elisp: Syntax for Strings
elisp: String Type
elisp: Nonprinting Characters