Info: (gawk) awk split records

gawk: awk split records

 
 4.1.1 Record Splitting with Standard 'awk'
 ------------------------------------------
 
 Records are separated by a character called the "record separator".  By
 default, the record separator is the newline character.  This is why
 records are, by default, single lines.  To use a different character for
 the record separator, simply assign that character to the predefined
 variable 'RS'.
 
    Like any other variable, the value of 'RS' can be changed in the
 'awk' program with the assignment operator, '=' (Assignment
 Ops).  The new record-separator character should be enclosed in
 quotation marks, which indicate a string constant.  Often, the right
 time to do this is at the beginning of execution, before any input is
 processed, so that the very first record is read with the proper
 separator.  To do this, use the special 'BEGIN' pattern (
 BEGIN/END).  For example:
 
      awk 'BEGIN { RS = "u" }
           { print $0 }' mail-list
 
 changes the value of 'RS' to 'u', before reading any input.  The new
 value is a string whose first character is the letter "u"; as a result,
 records are separated by the letter "u".  Then the input file is read,
 and the second rule in the 'awk' program (the action with no pattern)
 prints each record.  Because each 'print' statement adds a newline at
 the end of its output, this 'awk' program copies the input with each 'u'
 changed to a newline.  Here are the results of running the program on
 'mail-list':
 
      $ awk 'BEGIN { RS = "u" }
      >      { print $0 }' mail-list
      -| Amelia       555-5553     amelia.zodiac
      -| sq
      -| e@gmail.com    F
      -| Anthony      555-3412     anthony.assert
      -| ro@hotmail.com   A
      -| Becky        555-7685     becky.algebrar
      -| m@gmail.com      A
      -| Bill         555-1675     bill.drowning@hotmail.com       A
      -| Broderick    555-0542     broderick.aliq
      -| otiens@yahoo.com R
      -| Camilla      555-2912     camilla.inf
      -| sar
      -| m@skynet.be     R
      -| Fabi
      -| s       555-1234     fabi
      -| s.
      -| ndevicesim
      -| s@
      -| cb.ed
      -|     F
      -| J
      -| lie        555-6699     j
      -| lie.perscr
      -| tabor@skeeve.com   F
      -| Martin       555-6480     martin.codicib
      -| s@hotmail.com    A
      -| Sam
      -| el       555-3430     sam
      -| el.lanceolis@sh
      -| .ed
      -|         A
      -| Jean-Pa
      -| l    555-2127     jeanpa
      -| l.campanor
      -| m@ny
      -| .ed
      -|      R
      -|
 
 Note that the entry for the name 'Bill' is not split.  In the original
 data file (Sample Data Files), the line looks like this:
 
      Bill         555-1675     bill.drowning@hotmail.com       A
 
 It contains no 'u', so there is no reason to split the record, unlike
 the others, which each have one or more occurrences of the 'u'.  In
 fact, this record is treated as part of the previous record; the newline
 separating them in the output is the original newline in the data file,
 not the one added by 'awk' when it printed the record!
 
    Another way to change the record separator is on the command line,
 using the variable-assignment feature (Other Arguments):
 
      awk '{ print $0 }' RS="u" mail-list
 
 This sets 'RS' to 'u' before processing 'mail-list'.
 
    Using an alphabetic character such as 'u' for the record separator is
 highly likely to produce strange results.  Using an unusual character
 such as '/' is more likely to produce correct behavior in the majority
 of cases, but there are no guarantees.  The moral is: Know Your Data.
 
    When using regular characters as the record separator, there is one
 unusual case that occurs when 'gawk' is being fully POSIX-compliant
 (Options).  Then, the following (extreme) pipeline prints a
 surprising '1':
 
      $ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }'
      -| 1
 
    There is one field, consisting of a newline.  The value of the
 built-in variable 'NF' is the number of fields in the current record.
 (In the normal case, 'gawk' treats the newline as whitespace, printing
 '0' as the result.  Most other versions of 'awk' also act this way.)
 
    Reaching the end of an input file terminates the current input
 record, even if the last character in the file is not the character in
 'RS'.  (d.c.)
 
    The empty string '""' (a string without any characters) has a special
 meaning as the value of 'RS'.  It means that records are separated by
 one or more blank lines and nothing else.  Multiple Line for
 more details.
 
    If you change the value of 'RS' in the middle of an 'awk' run, the
 new value is used to delimit subsequent records, but the record
 currently being processed, as well as records already processed, are not
 affected.
 
    After the end of the record has been determined, 'gawk' sets the
 variable 'RT' to the text in the input that matched 'RS'.
Info Catalog
gawk: Records
gawk: gawk split records