gawk: Igawk Program

 
 11.3.9 An Easy Way to Use Library Functions
 -------------------------------------------
 
 In SeeInclude Files, we saw how 'gawk' provides a built-in
 file-inclusion capability.  However, this is a 'gawk' extension.  This
 minor node provides the motivation for making file inclusion available
 for standard 'awk', and shows how to do it using a combination of shell
 and 'awk' programming.
 
    Using library functions in 'awk' can be very beneficial.  It
 encourages code reuse and the writing of general functions.  Programs
 are smaller and therefore clearer.  However, using library functions is
 only easy when writing 'awk' programs; it is painful when running them,
 requiring multiple '-f' options.  If 'gawk' is unavailable, then so too
 is the 'AWKPATH' environment variable and the ability to put 'awk'
 functions into a library directory (SeeOptions).  It would be nice
 to be able to write programs in the following manner:
 
      # library functions
      @include getopt.awk
      @include join.awk
      ...
 
      # main program
      BEGIN {
          while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
              ...
          ...
      }
 
    The following program, 'igawk.sh', provides this service.  It
 simulates 'gawk''s searching of the 'AWKPATH' variable and also allows
 "nested" includes (i.e., a file that is included with '@include' can
 contain further '@include' statements).  'igawk' makes an effort to only
 include files once, so that nested includes don't accidentally include a
 library function twice.
 
    'igawk' should behave just like 'gawk' externally.  This means it
 should accept all of 'gawk''s command-line arguments, including the
 ability to have multiple source files specified via '-f' and the ability
 to mix command-line and library source files.
 
    The program is written using the POSIX Shell ('sh') command
 language.(1)  It works as follows:
 
   1. Loop through the arguments, saving anything that doesn't represent
      'awk' source code for later, when the expanded program is run.
 
   2. For any arguments that do represent 'awk' text, put the arguments
      into a shell variable that will be expanded.  There are two cases:
 
        a. Literal text, provided with '-e' or '--source'.  This text is
           just appended directly.
 
        b. Source file names, provided with '-f'.  We use a neat trick
           and append '@include FILENAME' to the shell variable's
           contents.  Because the file-inclusion program works the way
           'gawk' does, this gets the text of the file included in the
           program at the correct point.
 
   3. Run an 'awk' program (naturally) over the shell variable's contents
      to expand '@include' statements.  The expanded program is placed in
      a second shell variable.
 
   4. Run the expanded program with 'gawk' and any other original
      command-line arguments that the user supplied (such as the data
      file names).
 
    This program uses shell variables extensively: for storing
 command-line arguments and the text of the 'awk' program that will
 expand the user's program, for the user's original program, and for the
 expanded program.  Doing so removes some potential problems that might
 arise were we to use temporary files instead, at the cost of making the
 script somewhat more complicated.
 
    The initial part of the program turns on shell tracing if the first
 argument is 'debug'.
 
    The next part loops through all the command-line arguments.  There
 are several cases of interest:
 
 '--'
      This ends the arguments to 'igawk'.  Anything else should be passed
      on to the user's 'awk' program without being evaluated.
 
 '-W'
      This indicates that the next option is specific to 'gawk'.  To make
      argument processing easier, the '-W' is appended to the front of
      the remaining arguments and the loop continues.  (This is an 'sh'
      programming trick.  Don't worry about it if you are not familiar
      with 'sh'.)
 
 '-v', '-F'
      These are saved and passed on to 'gawk'.
 
 '-f', '--file', '--file=', '-Wfile='
      The file name is appended to the shell variable 'program' with an
      '@include' statement.  The 'expr' utility is used to remove the
      leading option part of the argument (e.g., '--file=').  (Typical
      'sh' usage would be to use the 'echo' and 'sed' utilities to do
      this work.  Unfortunately, some versions of 'echo' evaluate escape
      sequences in their arguments, possibly mangling the program text.
      Using 'expr' avoids this problem.)
 
 '--source', '--source=', '-Wsource='
      The source text is appended to 'program'.
 
 '--version', '-Wversion'
      'igawk' prints its version number, runs 'gawk --version' to get the
      'gawk' version information, and then exits.
 
    If none of the '-f', '--file', '-Wfile', '--source', or '-Wsource'
 arguments are supplied, then the first nonoption argument should be the
 'awk' program.  If there are no command-line arguments left, 'igawk'
 prints an error message and exits.  Otherwise, the first argument is
 appended to 'program'.  In any case, after the arguments have been
 processed, the shell variable 'program' contains the complete text of
 the original 'awk' program.
 
    The program is as follows:
 
      #! /bin/sh
      # igawk --- like gawk but do @include processing
 
      if [ "$1" = debug ]
      then
          set -x
          shift
      fi
 
      # A literal newline, so that program text is formatted correctly
      n='
      '
 
      # Initialize variables to empty
      program=
      opts=
 
      while [ $# -ne 0 ] # loop over arguments
      do
          case $1 in
          --)     shift
                  break ;;
 
          -W)     shift
                  # The ${x?'message here'} construct prints a
                  # diagnostic if $x is the null string
                  set -- -W"${@?'missing operand'}"
                  continue ;;
 
          -[vF])  opts="$opts $1 '${2?'missing operand'}'"
                  shift ;;
 
          -[vF]*) opts="$opts '$1'" ;;
 
          -f)     program="$program$n@include ${2?'missing operand'}"
                  shift ;;
 
          -f*)    f=$(expr "$1" : '-f\(.*\)')
                  program="$program$n@include $f" ;;
 
          -[W-]file=*)
                  f=$(expr "$1" : '-.file=\(.*\)')
                  program="$program$n@include $f" ;;
 
          -[W-]file)
                  program="$program$n@include ${2?'missing operand'}"
                  shift ;;
 
          -[W-]source=*)
                  t=$(expr "$1" : '-.source=\(.*\)')
                  program="$program$n$t" ;;
 
          -[W-]source)
                  program="$program$n${2?'missing operand'}"
                  shift ;;
 
          -[W-]version)
                  echo igawk: version 3.0 1>&2
                  gawk --version
                  exit 0 ;;
 
          -[W-]*) opts="$opts '$1'" ;;
 
          *)      break ;;
          esac
          shift
      done
 
      if [ -z "$program" ]
      then
           program=${1?'missing program'}
           shift
      fi
 
      # At this point, `program' has the program.
 
    The 'awk' program to process '@include' directives is stored in the
 shell variable 'expand_prog'.  Doing this keeps the shell script
 readable.  The 'awk' program reads through the user's program, one line
 at a time, using 'getline' (SeeGetline).  The input file names and
 '@include' statements are managed using a stack.  As each '@include' is
 encountered, the current file name is "pushed" onto the stack and the
 file named in the '@include' directive becomes the current file name.
 As each file is finished, the stack is "popped," and the previous input
 file becomes the current input file again.  The process is started by
 making the original file the first one on the stack.
 
    The 'pathto()' function does the work of finding the full path to a
 file.  It simulates 'gawk''s behavior when searching the 'AWKPATH'
 environment variable (SeeAWKPATH Variable).  If a file name has a
 '/' in it, no path search is done.  Similarly, if the file name is
 '"-"', then that string is used as-is.  Otherwise, the file name is
 concatenated with the name of each directory in the path, and an attempt
 is made to open the generated file name.  The only way to test if a file
 can be read in 'awk' is to go ahead and try to read it with 'getline';
 this is what 'pathto()' does.(2)  If the file can be read, it is closed
 and the file name is returned:
 
      expand_prog='
 
      function pathto(file,    i, t, junk)
      {
          if (index(file, "/") != 0)
              return file
 
          if (file == "-")
              return file
 
          for (i = 1; i <= ndirs; i++) {
              t = (pathlist[i] "/" file)
              if ((getline junk < t) > 0) {
                  # found it
                  close(t)
                  return t
              }
          }
          return ""
      }
 
    The main program is contained inside one 'BEGIN' rule.  The first
 thing it does is set up the 'pathlist' array that 'pathto()' uses.
 After splitting the path on ':', null elements are replaced with '"."',
 which represents the current directory:
 
      BEGIN {
          path = ENVIRON["AWKPATH"]
          ndirs = split(path, pathlist, ":")
          for (i = 1; i <= ndirs; i++) {
              if (pathlist[i] == "")
                  pathlist[i] = "."
          }
 
    The stack is initialized with 'ARGV[1]', which will be
 '"/dev/stdin"'.  The main loop comes next.  Input lines are read in
 succession.  Lines that do not start with '@include' are printed
 verbatim.  If the line does start with '@include', the file name is in
 '$2'.  'pathto()' is called to generate the full path.  If it cannot,
 then the program prints an error message and continues.
 
    The next thing to check is if the file is included already.  The
 'processed' array is indexed by the full file name of each included file
 and it tracks this information for us.  If the file is seen again, a
 warning message is printed.  Otherwise, the new file name is pushed onto
 the stack and processing continues.
 
    Finally, when 'getline' encounters the end of the input file, the
 file is closed and the stack is popped.  When 'stackptr' is less than
 zero, the program is done:
 
          stackptr = 0
          input[stackptr] = ARGV[1] # ARGV[1] is first file
 
          for (; stackptr >= 0; stackptr--) {
              while ((getline < input[stackptr]) > 0) {
                  if (tolower($1) != "@include") {
                      print
                      continue
                  }
                  fpath = pathto($2)
                  if (fpath == "") {
                      printf("igawk: %s:%d: cannot find %s\n",
                          input[stackptr], FNR, $2) > "/dev/stderr"
                      continue
                  }
                  if (! (fpath in processed)) {
                      processed[fpath] = input[stackptr]
                      input[++stackptr] = fpath  # push onto stack
                  } else
                      print $2, "included in", input[stackptr],
                          "already included in",
                          processed[fpath] > "/dev/stderr"
              }
              close(input[stackptr])
          }
      }'  # close quote ends `expand_prog' variable
 
      processed_program=$(gawk -- "$expand_prog" /dev/stdin << EOF
      $program
      EOF
      )
 
    The shell construct 'COMMAND << MARKER' is called a "here document".
 Everything in the shell script up to the MARKER is fed to COMMAND as
 input.  The shell processes the contents of the here document for
 variable and command substitution (and possibly other things as well,
 depending upon the shell).
 
    The shell construct '$(...)' is called "command substitution".  The
 output of the command inside the parentheses is substituted into the
 command line.  Because the result is used in a variable assignment, it
 is saved as a single string, even if the results contain whitespace.
 
    The expanded program is saved in the variable 'processed_program'.
 It's done in these steps:
 
   1. Run 'gawk' with the '@include'-processing program (the value of the
      'expand_prog' shell variable) reading standard input.
 
   2. Standard input is the contents of the user's program, from the
      shell variable 'program'.  Feed its contents to 'gawk' via a here
      document.
 
   3. Save the results of this processing in the shell variable
      'processed_program' by using command substitution.
 
    The last step is to call 'gawk' with the expanded program, along with
 the original options and command-line arguments that the user supplied:
 
      eval gawk $opts -- '"$processed_program"' '"$@"'
 
    The 'eval' command is a shell construct that reruns the shell's
 parsing process.  This keeps things properly quoted.
 
    This version of 'igawk' represents the fifth version of this program.
 There are four key simplifications that make the program work better:
 
    * Using '@include' even for the files named with '-f' makes building
      the initial collected 'awk' program much simpler; all the
      '@include' processing can be done once.
 
    * Not trying to save the line read with 'getline' in the 'pathto()'
      function when testing for the file's accessibility for use with the
      main program simplifies things considerably.
 
    * Using a 'getline' loop in the 'BEGIN' rule does it all in one
      place.  It is not necessary to call out to a separate loop for
      processing nested '@include' statements.
 
    * Instead of saving the expanded program in a temporary file, putting
      it in a shell variable avoids some potential security problems.
      This has the disadvantage that the script relies upon more features
      of the 'sh' language, making it harder to follow for those who
      aren't familiar with 'sh'.
 
    Also, this program illustrates that it is often worthwhile to combine
 'sh' and 'awk' programming together.  You can usually accomplish quite a
 lot, without having to resort to low-level programming in C or C++, and
 it is frequently easier to do certain kinds of string and argument
 manipulation using the shell than it is in 'awk'.
 
    Finally, 'igawk' shows that it is not always necessary to add new
 features to a program; they can often be layered on top.(3)
 
    ---------- Footnotes ----------
 
    (1) Fully explaining the 'sh' language is beyond the scope of this
 book.  We provide some minimal explanations, but see a good shell
 programming book if you wish to understand things in more depth.
 
    (2) On some very old versions of 'awk', the test 'getline junk < t'
 can loop forever if the file exists but is empty.
 
    (3) 'gawk' does '@include' processing itself in order to support the
 use of 'awk' programs as Web CGI scripts.