gawk: GNU Regexp Operators

 
 3.7 'gawk'-Specific Regexp Operators
 ====================================
 
 GNU software that deals with regular expressions provides a number of
 additional regexp operators.  These operators are described in this
 minor node and are specific to 'gawk'; they are not available in other
 'awk' implementations.  Most of the additional operators deal with word
 matching.  For our purposes, a "word" is a sequence of one or more
 letters, digits, or underscores ('_'):
 
 '\s'
      Matches any whitespace character.  Think of it as shorthand for
      '[[:space:]]'.
 
 '\S'
      Matches any character that is not whitespace.  Think of it as
      shorthand for '[^[:space:]]'.
 
 '\w'
      Matches any word-constituent character--that is, it matches any
      letter, digit, or underscore.  Think of it as shorthand for
      '[[:alnum:]_]'.
 
 '\W'
      Matches any character that is not word-constituent.  Think of it as
      shorthand for '[^[:alnum:]_]'.
 
 '\<'
      Matches the empty string at the beginning of a word.  For example,
      '/\<away/' matches 'away' but not 'stowaway'.
 
 '\>'
      Matches the empty string at the end of a word.  For example,
      '/stow\>/' matches 'stow' but not 'stowaway'.
 
 '\y'
      Matches the empty string at either the beginning or the end of a
      word (i.e., the word boundar*y*).  For example, '\yballs?\y'
      matches either 'ball' or 'balls', as a separate word.
 
 '\B'
      Matches the empty string that occurs between two word-constituent
      characters.  For example, '/\Brat\B/' matches 'crate', but it does
      not match 'dirty rat'.  '\B' is essentially the opposite of '\y'.
 
    There are two other operators that work on buffers.  In Emacs, a
 "buffer" is, naturally, an Emacs buffer.  Other GNU programs, including
 'gawk', consider the entire string to match as the buffer.  The
 operators are:
 
 '\`'
      Matches the empty string at the beginning of a buffer (string)
 
 '\''
      Matches the empty string at the end of a buffer (string)
 
    Because '^' and '$' always work in terms of the beginning and end of
 strings, these operators don't add any new capabilities for 'awk'.  They
 are provided for compatibility with other GNU software.
 
    In other GNU software, the word-boundary operator is '\b'.  However,
 that conflicts with the 'awk' language's definition of '\b' as
 backspace, so 'gawk' uses a different letter.  An alternative method
 would have been to require two backslashes in the GNU operators, but
 this was deemed too confusing.  The current method of using '\y' for the
 GNU '\b' appears to be the lesser of two evils.
 
    The various command-line options (SeeOptions) control how 'gawk'
 interprets characters in regexps:
 
 No options
      In the default case, 'gawk' provides all the facilities of POSIX
      regexps and the GNU regexp operators described in SeeRegexp
      Operators.
 
 '--posix'
      Match only POSIX regexps; the GNU operators are not special (e.g.,
      '\w' matches a literal 'w').  Interval expressions are allowed.
 
 '--traditional'
      Match traditional Unix 'awk' regexps.  The GNU operators are not
      special, and interval expressions are not available.  Because BWK
      'awk' supports them, the POSIX character classes ('[[:alnum:]]',
      etc.)  are available.  Characters described by octal and
      hexadecimal escape sequences are treated literally, even if they
      represent regexp metacharacters.
 
 '--re-interval'
      Allow interval expressions in regexps, if '--traditional' has been
      provided.  Otherwise, interval expressions are available by
      default.