gawk: Gory Details

 
 9.1.3.1 More about '\' and '&' with 'sub()', 'gsub()', and 'gensub()'
 .....................................................................
 
      CAUTION: This subsubsection has been reported to cause headaches.
      You might want to skip it upon first reading.
 
    When using 'sub()', 'gsub()', or 'gensub()', and trying to get
 literal backslashes and ampersands into the replacement text, you need
 to remember that there are several levels of "escape processing" going
 on.
 
    First, there is the "lexical" level, which is when 'awk' reads your
 program and builds an internal copy of it to execute.  Then there is the
 runtime level, which is when 'awk' actually scans the replacement string
 to determine what to generate.
 
    At both levels, 'awk' looks for a defined set of characters that can
 come after a backslash.  At the lexical level, it looks for the escape
 sequences listed in SeeEscape Sequences.  Thus, for every '\' that
 'awk' processes at the runtime level, you must type two backslashes at
 the lexical level.  When a character that is not valid for an escape
 sequence follows the '\', BWK 'awk' and 'gawk' both simply remove the
 initial '\' and put the next character into the string.  Thus, for
 example, '"a\qb"' is treated as '"aqb"'.
 
    At the runtime level, the various functions handle sequences of '\'
 and '&' differently.  The situation is (sadly) somewhat complex.
 Historically, the 'sub()' and 'gsub()' functions treated the
 two-character sequence '\&' specially; this sequence was replaced in the
 generated text with a single '&'.  Any other '\' within the REPLACEMENT
 string that did not precede an '&' was passed through unchanged.  This
 is illustrated in SeeTable 9.1 table-sub-escapes.
 
       You type         'sub()' sees          'sub()' generates
       -----         -------          ----------
           '\&'              '&'            The matched text
          '\\&'             '\&'            A literal '&'
         '\\\&'             '\&'            A literal '&'
        '\\\\&'            '\\&'            A literal '\&'
       '\\\\\&'            '\\&'            A literal '\&'
      '\\\\\\&'           '\\\&'            A literal '\\&'
          '\\q'             '\q'            A literal '\q'
 
 Table 9.1: Historical escape sequence processing for 'sub()' and
 'gsub()'
 
 This table shows the lexical-level processing, where an odd number of
 backslashes becomes an even number at the runtime level, as well as the
 runtime processing done by 'sub()'.  (For the sake of simplicity, the
 rest of the following tables only show the case of even numbers of
 backslashes entered at the lexical level.)
 
    The problem with the historical approach is that there is no way to
 get a literal '\' followed by the matched text.
 
    Several editions of the POSIX standard attempted to fix this problem
 but weren't successful.  The details are irrelevant at this point in
 time.
 
    At one point, the 'gawk' maintainer submitted proposed text for a
 revised standard that reverts to rules that correspond more closely to
 the original existing practice.  The proposed rules have special cases
 that make it possible to produce a '\' preceding the matched text.  This
 is shown in SeeTable 9.2 table-sub-proposed.
 
       You type         'sub()' sees         'sub()' generates
       -----         -------         ----------
      '\\\\\\&'           '\\\&'            A literal '\&'
        '\\\\&'            '\\&'            A literal '\', followed by the matched text
          '\\&'             '\&'            A literal '&'
          '\\q'             '\q'            A literal '\q'
         '\\\\'             '\\'            '\\'
 
 Table 9.2: 'gawk' rules for 'sub()' and backslash
 
    In a nutshell, at the runtime level, there are now three special
 sequences of characters ('\\\&', '\\&', and '\&') whereas historically
 there was only one.  However, as in the historical case, any '\' that is
 not part of one of these three sequences is not special and appears in
 the output literally.
 
    'gawk' 3.0 and 3.1 follow these rules for 'sub()' and 'gsub()'.  The
 POSIX standard took much longer to be revised than was expected.  In
 addition, the 'gawk' maintainer's proposal was lost during the
 standardization process.  The final rules are somewhat simpler.  The
 results are similar except for one case.
 
    The POSIX rules state that '\&' in the replacement string produces a
 literal '&', '\\' produces a literal '\', and '\' followed by anything
 else is not special; the '\' is placed straight into the output.  These
 rules are presented in SeeTable 9.3 table-posix-sub.
 
       You type         'sub()' sees         'sub()' generates
       -----         -------         ----------
      '\\\\\\&'           '\\\&'            A literal '\&'
        '\\\\&'            '\\&'            A literal '\', followed by the matched text
          '\\&'             '\&'            A literal '&'
          '\\q'             '\q'            A literal '\q'
         '\\\\'             '\\'            '\'
 
 Table 9.3: POSIX rules for 'sub()' and 'gsub()'
 
    The only case where the difference is noticeable is the last one:
 '\\\\' is seen as '\\' and produces '\' instead of '\\'.
 
    Starting with version 3.1.4, 'gawk' followed the POSIX rules when
 '--posix' was specified (SeeOptions).  Otherwise, it continued to
 follow the proposed rules, as that had been its behavior for many years.
 
    When version 4.0.0 was released, the 'gawk' maintainer made the POSIX
 rules the default, breaking well over a decade's worth of backward
 compatibility.(1)  Needless to say, this was a bad idea, and as of
 version 4.0.1, 'gawk' resumed its historical behavior, and only follows
 the POSIX rules when '--posix' is given.
 
    The rules for 'gensub()' are considerably simpler.  At the runtime
 level, whenever 'gawk' sees a '\', if the following character is a
 digit, then the text that matched the corresponding parenthesized
 subexpression is placed in the generated output.  Otherwise, no matter
 what character follows the '\', it appears in the generated text and the
 '\' does not, as shown in SeeTable 9.4 table-gensub-escapes.
 
        You type          'gensub()' sees         'gensub()' generates
        -----          ---------         ------------
            '&'                    '&'            The matched text
          '\\&'                   '\&'            A literal '&'
         '\\\\'                   '\\'            A literal '\'
        '\\\\&'                  '\\&'            A literal '\', then the matched text
      '\\\\\\&'                 '\\\&'            A literal '\&'
          '\\q'                   '\q'            A literal 'q'
 
 Table 9.4: Escape sequence processing for 'gensub()'
 
    Because of the complexity of the lexical- and runtime-level
 processing and the special cases for 'sub()' and 'gsub()', we recommend
 the use of 'gawk' and 'gensub()' when you have to do substitutions.
 
    ---------- Footnotes ----------
 
    (1) This was rather naive of him, despite there being a note in this
 minor node indicating that the next major version would move to the
 POSIX rules.