Info: (elisp) Coding System Basics

elisp: Coding System Basics

 
 32.10.1 Basic Concepts of Coding Systems
 ----------------------------------------
 
 “Character code conversion” involves conversion between the internal
 representation of characters used inside Emacs and some other encoding.
 Emacs supports many different encodings, in that it can convert to and
 from them.  For example, it can convert text to or from encodings such
 as Latin 1, Latin 2, Latin 3, Latin 4, Latin 5, and several variants of
 ISO 2022.  In some cases, Emacs supports several alternative encodings
 for the same characters; for example, there are three coding systems for
 the Cyrillic (Russian) alphabet: ISO, Alternativnyj, and KOI8.
 
    Every coding system specifies a particular set of character code
 conversions, but the coding system ‘undecided’ is special: it leaves the
 choice unspecified, to be chosen heuristically for each file, based on
 the file’s data.  The coding system ‘prefer-utf-8’ is like ‘undecided’,
 but it prefers to choose ‘utf-8’ when possible.
 
    In general, a coding system doesn’t guarantee roundtrip identity:
 decoding a byte sequence using coding system, then encoding the
 resulting text in the same coding system, can produce a different byte
 sequence.  But some coding systems do guarantee that the byte sequence
 will be the same as what you originally decoded.  Here are a few
 examples:
 
      iso-8859-1, utf-8, big5, shift_jis, euc-jp
 
    Encoding buffer text and then decoding the result can also fail to
 reproduce the original text.  For instance, if you encode a character
 with a coding system which does not support that character, the result
 is unpredictable, and thus decoding it using the same coding system may
 produce a different text.  Currently, Emacs can’t report errors that
 result from encoding unsupported characters.
 
    “End of line conversion” handles three different conventions used on
 various systems for representing end of line in files.  The Unix
 convention, used on GNU and Unix systems, is to use the linefeed
 character (also called newline).  The DOS convention, used on MS-Windows
 and MS-DOS systems, is to use a carriage-return and a linefeed at the
 end of a line.  The Mac convention is to use just carriage-return.
 (This was the convention used in Classic Mac OS.)
 
    “Base coding systems” such as ‘latin-1’ leave the end-of-line
 conversion unspecified, to be chosen based on the data.  “Variant coding
 systems” such as ‘latin-1-unix’, ‘latin-1-dos’ and ‘latin-1-mac’ specify
 the end-of-line conversion explicitly as well.  Most base coding systems
 have three corresponding variants whose names are formed by adding
 ‘-unix’, ‘-dos’ and ‘-mac’.
 
    The coding system ‘raw-text’ is special in that it prevents character
 code conversion, and causes the buffer visited with this coding system
 to be a unibyte buffer.  For historical reasons, you can save both
 unibyte and multibyte text with this coding system.  When you use
 ‘raw-text’ to encode multibyte text, it does perform one character code
 conversion: it converts eight-bit characters to their single-byte
 external representation.  ‘raw-text’ does not specify the end-of-line
 conversion, allowing that to be determined as usual by the data, and has
 the usual three variants which specify the end-of-line conversion.
 
    ‘no-conversion’ (and its alias ‘binary’) is equivalent to
 ‘raw-text-unix’: it specifies no conversion of either character codes or
 end-of-line.
 
    The coding system ‘utf-8-emacs’ specifies that the data is
 represented in the internal Emacs encoding (Text
 Representations).  This is like ‘raw-text’ in that no code conversion
 happens, but different in that the result is multibyte data.  The name
 ‘emacs-internal’ is an alias for ‘utf-8-emacs’.
 
  -- Function: coding-system-get coding-system property
      This function returns the specified property of the coding system
      CODING-SYSTEM.  Most coding system properties exist for internal
      purposes, but one that you might find useful is ‘:mime-charset’.
      That property’s value is the name used in MIME for the character
      coding which this coding system can read and write.  Examples:
 
           (coding-system-get 'iso-latin-1 :mime-charset)
                ⇒ iso-8859-1
           (coding-system-get 'iso-2022-cn :mime-charset)
                ⇒ iso-2022-cn
           (coding-system-get 'cyrillic-koi8 :mime-charset)
                ⇒ koi8-r
 
      The value of the ‘:mime-charset’ property is also defined as an
      alias for the coding system.
 
  -- Function: coding-system-aliases coding-system
      This function returns the list of aliases of CODING-SYSTEM.
Info Catalog
elisp: Coding Systems
elisp: Encoding and I/O