gawk: Regexp Operators

 
 3.3 Regular Expression Operators
 ================================
 
 You can combine regular expressions with special characters, called
 "regular expression operators" or "metacharacters", to increase the
 power and versatility of regular expressions.
 
    The escape sequences described in SeeEscape Sequences are valid
 inside a regexp.  They are introduced by a '\' and are recognized and
 converted into corresponding real characters as the very first step in
 processing regexps.
 
    Here is a list of metacharacters.  All characters that are not escape
 sequences and that are not listed here stand for themselves:
 
 '\'
      This suppresses the special meaning of a character when matching.
      For example, '\$' matches the character '$'.
 
 '^'
      This matches the beginning of a string.  '^@chapter' matches
      '@chapter' at the beginning of a string, for example, and can be
      used to identify chapter beginnings in Texinfo source files.  The
      '^' is known as an "anchor", because it anchors the pattern to
      match only at the beginning of the string.
 
      It is important to realize that '^' does not match the beginning of
      a line (the point right after a '\n' newline character) embedded in
      a string.  The condition is not true in the following example:
 
           if ("line1\nLINE 2" ~ /^L/) ...
 
 '$'
      This is similar to '^', but it matches only at the end of a string.
      For example, 'p$' matches a record that ends with a 'p'.  The '$'
      is an anchor and does not match the end of a line (the point right
      before a '\n' newline character) embedded in a string.  The
      condition in the following example is not true:
 
           if ("line1\nLINE 2" ~ /1$/) ...
 
 '.' (period)
      This matches any single character, _including_ the newline
      character.  For example, '.P' matches any single character followed
      by a 'P' in a string.  Using concatenation, we can make a regular
      expression such as 'U.A', which matches any three-character
      sequence that begins with 'U' and ends with 'A'.
 
      In strict POSIX mode (SeeOptions), '.' does not match the NUL
      character, which is a character with all bits equal to zero.
      Otherwise, NUL is just another character.  Other versions of 'awk'
      may not be able to match the NUL character.
 
 '['...']'
      This is called a "bracket expression".(1)  It matches any _one_ of
      the characters that are enclosed in the square brackets.  For
      example, '[MVX]' matches any one of the characters 'M', 'V', or 'X'
      in a string.  A full discussion of what can be inside the square
      brackets of a bracket expression is given in SeeBracket
      Expressions.
 
 '[^'...']'
      This is a "complemented bracket expression".  The first character
      after the '[' _must_ be a '^'.  It matches any characters _except_
      those in the square brackets.  For example, '[^awk]' matches any
      character that is not an 'a', 'w', or 'k'.
 
 '|'
      This is the "alternation operator" and it is used to specify
      alternatives.  The '|' has the lowest precedence of all the regular
      expression operators.  For example, '^P|[aeiouy]' matches any
      string that matches either '^P' or '[aeiouy]'.  This means it
      matches any string that starts with 'P' or contains (anywhere
      within it) a lowercase English vowel.
 
      The alternation applies to the largest possible regexps on either
      side.
 
 '('...')'
      Parentheses are used for grouping in regular expressions, as in
      arithmetic.  They can be used to concatenate regular expressions
      containing the alternation operator, '|'.  For example,
      '@(samp|code)\{[^}]+\}' matches both '@code{foo}' and '@samp{bar}'.
      (These are Texinfo formatting control sequences.  The '+' is
      explained further on in this list.)
 
 '*'
      This symbol means that the preceding regular expression should be
      repeated as many times as necessary to find a match.  For example,
      'ph*' applies the '*' symbol to the preceding 'h' and looks for
      matches of one 'p' followed by any number of 'h's.  This also
      matches just 'p' if no 'h's are present.
 
      There are two subtle points to understand about how '*' works.
      First, the '*' applies only to the single preceding regular
      expression component (e.g., in 'ph*', it applies just to the 'h').
      To cause '*' to apply to a larger subexpression, use parentheses:
      '(ph)*' matches 'ph', 'phph', 'phphph', and so on.
 
      Second, '*' finds as many repetitions as possible.  If the text to
      be matched is 'phhhhhhhhhhhhhhooey', 'ph*' matches all of the 'h's.
 
 '+'
      This symbol is similar to '*', except that the preceding expression
      must be matched at least once.  This means that 'wh+y' would match
      'why' and 'whhy', but not 'wy', whereas 'wh*y' would match all
      three.
 
 '?'
      This symbol is similar to '*', except that the preceding expression
      can be matched either once or not at all.  For example, 'fe?d'
      matches 'fed' and 'fd', but nothing else.
 
 '{'N'}'
 '{'N',}'
 '{'N','M'}'
      One or two numbers inside braces denote an "interval expression".
      If there is one number in the braces, the preceding regexp is
      repeated N times.  If there are two numbers separated by a comma,
      the preceding regexp is repeated N to M times.  If there is one
      number followed by a comma, then the preceding regexp is repeated
      at least N times:
 
      'wh{3}y'
           Matches 'whhhy', but not 'why' or 'whhhhy'.
 
      'wh{3,5}y'
           Matches 'whhhy', 'whhhhy', or 'whhhhhy' only.
 
      'wh{2,}y'
           Matches 'whhy', 'whhhy', and so on.
 
      Interval expressions were not traditionally available in 'awk'.
      They were added as part of the POSIX standard to make 'awk' and
      'egrep' consistent with each other.
 
      Initially, because old programs may use '{' and '}' in regexp
      constants, 'gawk' did _not_ match interval expressions in regexps.
 
      However, beginning with version 4.0, 'gawk' does match interval
      expressions by default.  This is because compatibility with POSIX
      has become more important to most 'gawk' users than compatibility
      with old programs.
 
      For programs that use '{' and '}' in regexp constants, it is good
      practice to always escape them with a backslash.  Then the regexp
      constants are valid and work the way you want them to, using any
      version of 'awk'.(2)
 
      Finally, when '{' and '}' appear in regexp constants in a way that
      cannot be interpreted as an interval expression (such as '/q{a}/'),
      then they stand for themselves.
 
    In regular expressions, the '*', '+', and '?' operators, as well as
 the braces '{' and '}', have the highest precedence, followed by
 concatenation, and finally by '|'.  As in arithmetic, parentheses can
 change how operators are grouped.
 
    In POSIX 'awk' and 'gawk', the '*', '+', and '?' operators stand for
 themselves when there is nothing in the regexp that precedes them.  For
 example, '/+/' matches a literal plus sign.  However, many other
 versions of 'awk' treat such a usage as a syntax error.
 
    If 'gawk' is in compatibility mode (SeeOptions), interval
 expressions are not available in regular expressions.
 
    ---------- Footnotes ----------
 
    (1) In other literature, you may see a bracket expression referred to
 as either a "character set", a "character class", or a "character list".
 
    (2) Use two backslashes if you're using a string constant with a
 regexp operator or function.