Info: (gawk) Bracket Expressions

Info Catalog
gawk: Regexp Operators
gawk: Regexp
gawk: Leftmost Longest
gawk: Bracket Expressions

 
 3.4 Using Bracket Expressions
 =============================
 
 As mentioned earlier, a bracket expression matches any character among
 those listed between the opening and closing square brackets.
 
    Within a bracket expression, a "range expression" consists of two
 characters separated by a hyphen.  It matches any single character that
 sorts between the two characters, based upon the system's native
 character set.  For example, '[0-9]' is equivalent to '[0123456789]'.
 (See Ranges and Locales for an explanation of how the POSIX
 standard and 'gawk' have changed over time.  This is mainly of
 historical interest.)
 
    With the increasing popularity of the Unicode character standard
 (http://www.unicode.org), there is an additional wrinkle to consider.
 Octal and hexadecimal escape sequences inside bracket expressions are
 taken to represent only single-byte characters (characters whose values
 fit within the range 0-256).  To match a range of characters where the
 endpoints of the range are larger than 256, enter the multibyte
 encodings of the characters directly.
 
    To include one of the characters '\', ']', '-', or '^' in a bracket
 expression, put a '\' in front of it.  For example:
 
      [d\]]
 
 matches either 'd' or ']'.  Additionally, if you place ']' right after
 the opening '[', the closing bracket is treated as one of the characters
 to be matched.
 
    The treatment of '\' in bracket expressions is compatible with other
 'awk' implementations and is also mandated by POSIX. The regular
 expressions in 'awk' are a superset of the POSIX specification for
 Extended Regular Expressions (EREs).  POSIX EREs are based on the
 regular expressions accepted by the traditional 'egrep' utility.
 
    "Character classes" are a feature introduced in the POSIX standard.
 A character class is a special notation for describing lists of
 characters that have a specific attribute, but the actual characters can
 vary from country to country and/or from character set to character set.
 For example, the notion of what is an alphabetic character differs
 between the United States and France.
 
    A character class is only valid in a regexp _inside_ the brackets of
 a bracket expression.  Character classes consist of '[:', a keyword
 denoting the class, and ':]'.  Table 3.1 table-char-classes.
 lists the character classes defined by the POSIX standard.
 
 Class       Meaning
 --------------------------------------------------------------------------
 '[:alnum:]' Alphanumeric characters
 '[:alpha:]' Alphabetic characters
 '[:blank:]' Space and TAB characters
 '[:cntrl:]' Control characters
 '[:digit:]' Numeric characters
 '[:graph:]' Characters that are both printable and visible (a space is
             printable but not visible, whereas an 'a' is both)
 '[:lower:]' Lowercase alphabetic characters
 '[:print:]' Printable characters (characters that are not control
             characters)
 '[:punct:]' Punctuation characters (characters that are not letters,
             digits, control characters, or space characters)
 '[:space:]' Space characters (such as space, TAB, and formfeed, to name
             a few)
 '[:upper:]' Uppercase alphabetic characters
 '[:xdigit:]'Characters that are hexadecimal digits
 
 Table 3.1: POSIX character classes
 
    For example, before the POSIX standard, you had to write
 '/[A-Za-z0-9]/' to match alphanumeric characters.  If your character set
 had other alphabetic characters in it, this would not match them.  With
 the POSIX character classes, you can write '/[[:alnum:]]/' to match the
 alphabetic and numeric characters in your character set.
 
    Some utilities that match regular expressions provide a nonstandard
 '[:ascii:]' character class; 'awk' does not.  However, you can simulate
 such a construct using '[\x00-\x7F]'.  This matches all values
 numerically between zero and 127, which is the defined range of the
 ASCII character set.  Use a complemented character list ('[^\x00-\x7F]')
 to match any single-byte characters that are not in the ASCII range.
 
    Two additional special sequences can appear in bracket expressions.
 These apply to non-ASCII character sets, which can have single symbols
 (called "collating elements") that are represented with more than one
 character.  They can also have several characters that are equivalent
 for "collating", or sorting, purposes.  (For example, in French, a plain
 "e" and a grave-accented "e`" are equivalent.)  These sequences are:
 
 Collating symbols
      Multicharacter collating elements enclosed between '[.' and '.]'.
      For example, if 'ch' is a collating element, then '[[.ch.]]' is a
      regexp that matches this collating element, whereas '[ch]' is a
      regexp that matches either 'c' or 'h'.
 
 Equivalence classes
      Locale-specific names for a list of characters that are equal.  The
      name is enclosed between '[=' and '=]'.  For example, the name 'e'
      might be used to represent all of "e," "e^," "e`," and "e'."  In
      this case, '[[=e=]]' is a regexp that matches any of 'e', 'e^',
      'e'', or 'e`'.
 
    These features are very valuable in non-English-speaking locales.
 
      CAUTION: The library functions that 'gawk' uses for regular
      expression matching currently recognize only POSIX character
      classes; they do not recognize collating symbols or equivalence
      classes.
 
    Inside a bracket expression, an opening bracket ('[') that does not
 start a character class, collating element or equivalence class is taken
 literally.  This is also true of '.' and '*'.
Info Catalog
gawk: Regexp Operators
gawk: Regexp
gawk: Leftmost Longest