gawk: Extract Program

 
 11.3.7 Extracting Programs from Texinfo Source Files
 ----------------------------------------------------
 
 The nodes SeeLibrary Functions, and SeeSample Programs, are
 the top level nodes for a large number of 'awk' programs.  If you want
 to experiment with these programs, it is tedious to type them in by
 hand.  Here we present a program that can extract parts of a Texinfo
 input file into separate files.
 
    This Info file is written in Texinfo
 (https://www.gnu.org/software/texinfo/), the GNU Project's document
 formatting language.  A single Texinfo source file can be used to
 produce both printed documentation, with TeX, and online documentation.
 (The Texinfo language is described fully, starting with *note(Texinfo,
 texinfo,Texinfo---The GNU Documentation Format)Top::.)
 
    For our purposes, it is enough to know three things about Texinfo
 input files:
 
    * The "at" symbol ('@') is special in Texinfo, much as the backslash
      ('\') is in C or 'awk'.  Literal '@' symbols are represented in
      Texinfo source files as '@@'.
 
    * Comments start with either '@c' or '@comment'.  The file-extraction
      program works by using special comments that start at the beginning
      of a line.
 
    * Lines containing '@group' and '@end group' commands bracket example
      text that should not be split across a page boundary.
      (Unfortunately, TeX isn't always smart enough to do things exactly
      right, so we have to give it some help.)
 
    The following program, 'extract.awk', reads through a Texinfo source
 file and does two things, based on the special comments.  Upon seeing
 '@c system ...', it runs a command, by extracting the command text from
 the control line and passing it on to the 'system()' function (SeeI/O
 Functions).  Upon seeing '@c file FILENAME', each subsequent line is
 sent to the file FILENAME, until '@c endfile' is encountered.  The rules
 in 'extract.awk' match either '@c' or '@comment' by letting the 'omment'
 part be optional.  Lines containing '@group' and '@end group' are simply
 removed.  'extract.awk' uses the 'join()' library function (SeeJoin
 Function).
 
    The example programs in the online Texinfo source for 'GAWK:
 Effective AWK Programming' ('gawktexi.in') have all been bracketed
 inside 'file' and 'endfile' lines.  The 'gawk' distribution uses a copy
 of 'extract.awk' to extract the sample programs and install many of them
 in a standard directory where 'gawk' can find them.  The Texinfo file
 looks something like this:
 
      ...
      This program has a @code{BEGIN} rule
      that prints a nice message:
 
      @example
      @c file examples/messages.awk
      BEGIN @{ print "Don't panic!" @}
      @c endfile
      @end example
 
      It also prints some final advice:
 
      @example
      @c file examples/messages.awk
      END @{ print "Always avoid bored archaeologists!" @}
      @c endfile
      @end example
      ...
 
    'extract.awk' begins by setting 'IGNORECASE' to one, so that mixed
 upper- and lowercase letters in the directives won't matter.
 
    The first rule handles calling 'system()', checking that a command is
 given ('NF' is at least three) and also checking that the command exits
 with a zero exit status, signifying OK:
 
      # extract.awk --- extract files and run programs from Texinfo files
 
      BEGIN    { IGNORECASE = 1 }
 
      /^@c(omment)?[ \t]+system/ {
          if (NF < 3) {
              e = ("extract: " FILENAME ":" FNR)
              e = (e  ": badly formed `system' line")
              print e > "/dev/stderr"
              next
          }
          $1 = ""
          $2 = ""
          stat = system($0)
          if (stat != 0) {
              e = ("extract: " FILENAME ":" FNR)
              e = (e ": warning: system returned " stat)
              print e > "/dev/stderr"
          }
      }
 
 The variable 'e' is used so that the rule fits nicely on the screen.
 
    The second rule handles moving data into files.  It verifies that a
 file name is given in the directive.  If the file named is not the
 current file, then the current file is closed.  Keeping the current file
 open until a new file is encountered allows the use of the '>'
 redirection for printing the contents, keeping open-file management
 simple.
 
    The 'for' loop does the work.  It reads lines using 'getline' (See
 Getline).  For an unexpected end-of-file, it calls the
 'unexpected_eof()' function.  If the line is an "endfile" line, then it
 breaks out of the loop.  If the line is an '@group' or '@end group'
 line, then it ignores it and goes on to the next line.  Similarly,
 comments within examples are also ignored.
 
    Most of the work is in the following few lines.  If the line has no
 '@' symbols, the program can print it directly.  Otherwise, each leading
 '@' must be stripped off.  To remove the '@' symbols, the line is split
 into separate elements of the array 'a', using the 'split()' function
 (SeeString Functions).  The '@' symbol is used as the separator
 character.  Each element of 'a' that is empty indicates two successive
 '@' symbols in the original line.  For each two empty elements ('@@' in
 the original file), we have to add a single '@' symbol back in.
 
    When the processing of the array is finished, 'join()' is called with
 the value of 'SUBSEP' (SeeMultidimensional), to rejoin the pieces
 back into a single line.  That line is then printed to the output file:
 
      /^@c(omment)?[ \t]+file/ {
          if (NF != 3) {
              e = ("extract: " FILENAME ":" FNR ": badly formed `file' line")
              print e > "/dev/stderr"
              next
          }
          if ($3 != curfile) {
              if (curfile != "")
                  close(curfile)
              curfile = $3
          }
 
          for (;;) {
              if ((getline line) <= 0)
                  unexpected_eof()
              if (line ~ /^@c(omment)?[ \t]+endfile/)
                  break
              else if (line ~ /^@(end[ \t]+)?group/)
                  continue
              else if (line ~ /^@c(omment+)?[ \t]+/)
                  continue
              if (index(line, "@") == 0) {
                  print line > curfile
                  continue
              }
              n = split(line, a, "@")
              # if a[1] == "", means leading @,
              # don't add one back in.
              for (i = 2; i <= n; i++) {
                  if (a[i] == "") { # was an @@
                      a[i] = "@"
                      if (a[i+1] == "")
                          i++
                  }
              }
              print join(a, 1, n, SUBSEP) > curfile
          }
      }
 
    An important thing to note is the use of the '>' redirection.  Output
 done with '>' only opens the file once; it stays open and subsequent
 output is appended to the file (SeeRedirection).  This makes it
 easy to mix program text and explanatory prose for the same sample
 source file (as has been done here!)  without any hassle.  The file is
 only closed when a new data file name is encountered or at the end of
 the input file.
 
    Finally, the function 'unexpected_eof()' prints an appropriate error
 message and then exits.  The 'END' rule handles the final cleanup,
 closing the open file:
 
      function unexpected_eof()
      {
          printf("extract: %s:%d: unexpected EOF or error\n",
                           FILENAME, FNR) > "/dev/stderr"
          exit 1
      }
 
      END {
          if (curfile)
              close(curfile)
      }