gawk: gawk split records

 
 4.1.2 Record Splitting with 'gawk'
 ----------------------------------
 
 When using 'gawk', the value of 'RS' is not limited to a one-character
 string.  It can be any regular expression (SeeRegexp).  (c.e.)  In
 general, each record ends at the next string that matches the regular
 expression; the next record starts at the end of the matching string.
 This general rule is actually at work in the usual case, where 'RS'
 contains just a newline: a record ends at the beginning of the next
 matching string (the next newline in the input), and the following
 record starts just after the end of this string (at the first character
 of the following line).  The newline, because it matches 'RS', is not
 part of either record.
 
    When 'RS' is a single character, 'RT' contains the same single
 character.  However, when 'RS' is a regular expression, 'RT' contains
 the actual input text that matched the regular expression.
 
    If the input file ends without any text matching 'RS', 'gawk' sets
 'RT' to the null string.
 
    The following example illustrates both of these features.  It sets
 'RS' equal to a regular expression that matches either a newline or a
 series of one or more uppercase letters with optional leading and/or
 trailing whitespace:
 
      $ echo record 1 AAAA record 2 BBBB record 3 |
      > gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" }
      >             { print "Record =", $0,"and RT = [" RT "]" }'
      -| Record = record 1 and RT = [ AAAA ]
      -| Record = record 2 and RT = [ BBBB ]
      -| Record = record 3 and RT = [
      -| ]
 
 The square brackets delineate the contents of 'RT', letting you see the
 leading and trailing whitespace.  The final value of 'RT' is a newline.
 SeeSimple Sed for a more useful example of 'RS' as a regexp and
 'RT'.
 
    If you set 'RS' to a regular expression that allows optional trailing
 text, such as 'RS = "abc(XYZ)?"', it is possible, due to implementation
 constraints, that 'gawk' may match the leading part of the regular
 expression, but not the trailing part, particularly if the input text
 that could match the trailing part is fairly long.  'gawk' attempts to
 avoid this problem, but currently, there's no guarantee that this will
 never happen.
 
      NOTE: Remember that in 'awk', the '^' and '$' anchor metacharacters
      match the beginning and end of a _string_, and not the beginning
      and end of a _line_.  As a result, something like 'RS =
      "^[[:upper:]]"' can only match at the beginning of a file.  This is
      because 'gawk' views the input file as one long string that happens
      to contain newline characters.  It is thus best to avoid anchor
      metacharacters in the value of 'RS'.
 
    The use of 'RS' as a regular expression and the 'RT' variable are
 'gawk' extensions; they are not available in compatibility mode (See
 Options).  In compatibility mode, only the first character of the
 value of 'RS' determines the end of the record.
 
                       'RS = "\0"' Is Not Portable
 
    There are times when you might want to treat an entire data file as a
 single record.  The only way to make this happen is to give 'RS' a value
 that you know doesn't occur in the input file.  This is hard to do in a
 general way, such that a program always works for arbitrary input files.
 
    You might think that for text files, the NUL character, which
 consists of a character with all bits equal to zero, is a good value to
 use for 'RS' in this case:
 
      BEGIN { RS = "\0" }  # whole file becomes one record?
 
    'gawk' in fact accepts this, and uses the NUL character for the
 record separator.  This works for certain special files, such as
 '/proc/environ' on GNU/Linux systems, where the NUL character is in fact
 the record separator.  However, this usage is _not_ portable to most
 other 'awk' implementations.
 
    Almost all other 'awk' implementations(1) store strings internally as
 C-style strings.  C strings use the NUL character as the string
 terminator.  In effect, this means that 'RS = "\0"' is the same as 'RS =
 ""'.  (d.c.)
 
    It happens that recent versions of 'mawk' can use the NUL character
 as a record separator.  However, this is a special case: 'mawk' does not
 allow embedded NUL characters in strings.  (This may change in a future
 version of 'mawk'.)
 
    SeeReadfile Function for an interesting way to read whole files.
 If you are using 'gawk', see SeeExtension Sample Readfile for
 another option.
 
    ---------- Footnotes ----------
 
    (1) At least that we know about.