emacs: Recognize Coding

 
 22.6 Recognizing Coding Systems
 ===============================
 
 Whenever Emacs reads a given piece of text, it tries to recognize which
 coding system to use.  This applies to files being read, output from
 subprocesses, text from X selections, etc.  Emacs can select the right
 coding system automatically most of the time—once you have specified
 your preferences.
 
    Some coding systems can be recognized or distinguished by which byte
 sequences appear in the data.  However, there are coding systems that
 cannot be distinguished, not even potentially.  For example, there is no
 way to distinguish between Latin-1 and Latin-2; they use the same byte
 values with different meanings.
 
    Emacs handles this situation by means of a priority list of coding
 systems.  Whenever Emacs reads a file, if you do not specify the coding
 system to use, Emacs checks the data against each coding system,
 starting with the first in priority and working down the list, until it
 finds a coding system that fits the data.  Then it converts the file
 contents assuming that they are represented in this coding system.
 
    The priority list of coding systems depends on the selected language
 environment (SeeLanguage Environments).  For example, if you use
 French, you probably want Emacs to prefer Latin-1 to Latin-2; if you use
 Czech, you probably want Latin-2 to be preferred.  This is one of the
 reasons to specify a language environment.
 
    However, you can alter the coding system priority list in detail with
 the command ‘M-x prefer-coding-system’.  This command reads the name of
 a coding system from the minibuffer, and adds it to the front of the
 priority list, so that it is preferred to all others.  If you use this
 command several times, each use adds one element to the front of the
 priority list.
 
    If you use a coding system that specifies the end-of-line conversion
 type, such as ‘iso-8859-1-dos’, what this means is that Emacs should
 attempt to recognize ‘iso-8859-1’ with priority, and should use DOS
 end-of-line conversion when it does recognize ‘iso-8859-1’.
 
    Sometimes a file name indicates which coding system to use for the
 file.  The variable ‘file-coding-system-alist’ specifies this
 correspondence.  There is a special function
 ‘modify-coding-system-alist’ for adding elements to this list.  For
 example, to read and write all ‘.txt’ files using the coding system
 ‘chinese-iso-8bit’, you can execute this Lisp expression:
 
      (modify-coding-system-alist 'file "\\.txt\\'" 'chinese-iso-8bit)
 
 The first argument should be ‘file’, the second argument should be a
 regular expression that determines which files this applies to, and the
 third argument says which coding system to use for these files.
 
    Emacs recognizes which kind of end-of-line conversion to use based on
 the contents of the file: if it sees only carriage-returns, or only
 carriage-return linefeed sequences, then it chooses the end-of-line
 conversion accordingly.  You can inhibit the automatic use of
 end-of-line conversion by setting the variable ‘inhibit-eol-conversion’
 to non-‘nil’.  If you do that, DOS-style files will be displayed with
 the ‘^M’ characters visible in the buffer; some people prefer this to
 the more subtle ‘(DOS)’ end-of-line type indication near the left edge
 of the mode line (Seeeol-mnemonic Mode Line.).
 
    By default, the automatic detection of coding system is sensitive to
 escape sequences.  If Emacs sees a sequence of characters that begin
 with an escape character, and the sequence is valid as an ISO-2022 code,
 that tells Emacs to use one of the ISO-2022 encodings to decode the
 file.
 
    However, there may be cases that you want to read escape sequences in
 a file as is.  In such a case, you can set the variable
 ‘inhibit-iso-escape-detection’ to non-‘nil’.  Then the code detection
 ignores any escape sequences, and never uses an ISO-2022 encoding.  The
 result is that all escape sequences become visible in the buffer.
 
    The default value of ‘inhibit-iso-escape-detection’ is ‘nil’.  We
 recommend that you not change it permanently, only for one specific
 operation.  That’s because some Emacs Lisp source files in the Emacs
 distribution contain non-ASCII characters encoded in the coding system
 ‘iso-2022-7bit’, and they won’t be decoded correctly when you visit
 those files if you suppress the escape sequence detection.
 
    The variables ‘auto-coding-alist’ and ‘auto-coding-regexp-alist’ are
 the strongest way to specify the coding system for certain patterns of
 file names, or for files containing certain patterns, respectively.
 These variables even override ‘-*-coding:-*-’ tags in the file itself
 (SeeSpecify Coding).  For example, Emacs uses ‘auto-coding-alist’
 for tar and archive files, to prevent it from being confused by a
 ‘-*-coding:-*-’ tag in a member of the archive and thinking it applies
 to the archive file as a whole.
 
    Another way to specify a coding system is with the variable
 ‘auto-coding-functions’.  For example, one of the builtin
 ‘auto-coding-functions’ detects the encoding for XML files.  Unlike the
 previous two, this variable does not override any ‘-*-coding:-*-’ tag.