Info: (gawk) Regexp Field Splitting

Info Catalog
gawk: Default Field Splitting
gawk: Field Separators
gawk: Single Character Fields
gawk: Regexp Field Splitting

 
 4.5.2 Using Regular Expressions to Separate Fields
 --------------------------------------------------
 
 The previous node discussed the use of single characters or simple
 strings as the value of 'FS'.  More generally, the value of 'FS' may be
 a string containing any regular expression.  In this case, each match in
 the record for the regular expression separates fields.  For example,
 the assignment:
 
      FS = ", \t"
 
 makes every area of an input line that consists of a comma followed by a
 space and a TAB into a field separator.  ('\t' is an "escape sequence"
 that stands for a TAB; Escape Sequences, for the complete list
 of similar escape sequences.)
 
    For a less trivial example of a regular expression, try using single
 spaces to separate fields the way single commas are used.  'FS' can be
 set to '"[ ]"' (left bracket, space, right bracket).  This regular
 expression matches a single space and nothing else (Regexp).
 
    There is an important difference between the two cases of 'FS = " "'
 (a single space) and 'FS = "[ \t\n]+"' (a regular expression matching
 one or more spaces, TABs, or newlines).  For both values of 'FS', fields
 are separated by "runs" (multiple adjacent occurrences) of spaces, TABs,
 and/or newlines.  However, when the value of 'FS' is '" "', 'awk' first
 strips leading and trailing whitespace from the record and then decides
 where the fields are.  For example, the following pipeline prints 'b':
 
      $ echo ' a b c d ' | awk '{ print $2 }'
      -| b
 
 However, this pipeline prints 'a' (note the extra spaces around each
 letter):
 
      $ echo ' a  b  c  d ' | awk 'BEGIN { FS = "[ \t\n]+" }
      >                                  { print $2 }'
      -| a
 
 In this case, the first field is null, or empty.
 
    The stripping of leading and trailing whitespace also comes into play
 whenever '$0' is recomputed.  For instance, study this pipeline:
 
      $ echo '   a b c d' | awk '{ print; $2 = $2; print }'
      -|    a b c d
      -| a b c d
 
 The first 'print' statement prints the record as it was read, with
 leading whitespace intact.  The assignment to '$2' rebuilds '$0' by
 concatenating '$1' through '$NF' together, separated by the value of
 'OFS' (which is a space by default).  Because the leading whitespace was
 ignored when finding '$1', it is not part of the new '$0'.  Finally, the
 last 'print' statement prints the new '$0'.
 
    There is an additional subtlety to be aware of when using regular
 expressions for field splitting.  It is not well specified in the POSIX
 standard, or anywhere else, what '^' means when splitting fields.  Does
 the '^' match only at the beginning of the entire record?  Or is each
 field separator a new string?  It turns out that different 'awk'
 versions answer this question differently, and you should not rely on
 any specific behavior in your programs.  (d.c.)
 
    As a point of information, BWK 'awk' allows '^' to match only at the
 beginning of the record.  'gawk' also works this way.  For example:
 
      $ echo 'xxAA  xxBxx  C' |
      > gawk -F '(^x+)|( +)' '{ for (i = 1; i <= NF; i++)
      >                             printf "-->%s<--\n", $i }'
      -| --><--
      -| -->AA<--
      -| -->xxBxx<--
      -| -->C<--
Info Catalog
gawk: Default Field Splitting
gawk: Field Separators
gawk: Single Character Fields