gawk: Filetrans Function

 
 10.3.1 Noting Data file Boundaries
 ----------------------------------
 
 The 'BEGIN' and 'END' rules are each executed exactly once, at the
 beginning and end of your 'awk' program, respectively (See
 BEGIN/END).  We (the 'gawk' authors) once had a user who mistakenly
 thought that the 'BEGIN' rules were executed at the beginning of each
 data file and the 'END' rules were executed at the end of each data
 file.
 
    When informed that this was not the case, the user requested that we
 add new special patterns to 'gawk', named 'BEGIN_FILE' and 'END_FILE',
 that would have the desired behavior.  He even supplied us the code to
 do so.
 
    Adding these special patterns to 'gawk' wasn't necessary; the job can
 be done cleanly in 'awk' itself, as illustrated by the following library
 program.  It arranges to call two user-supplied functions, 'beginfile()'
 and 'endfile()', at the beginning and end of each data file.  Besides
 solving the problem in only nine(!)  lines of code, it does so
 _portably_; this works with any implementation of 'awk':
 
      # transfile.awk
      #
      # Give the user a hook for filename transitions
      #
      # The user must supply functions beginfile() and endfile()
      # that each take the name of the file being started or
      # finished, respectively.
 
      FILENAME != _oldfilename {
          if (_oldfilename != "")
              endfile(_oldfilename)
          _oldfilename = FILENAME
          beginfile(FILENAME)
      }
 
      END { endfile(FILENAME) }
 
    This file must be loaded before the user's "main" program, so that
 the rule it supplies is executed first.
 
    This rule relies on 'awk''s 'FILENAME' variable, which automatically
 changes for each new data file.  The current file name is saved in a
 private variable, '_oldfilename'.  If 'FILENAME' does not equal
 '_oldfilename', then a new data file is being processed and it is
 necessary to call 'endfile()' for the old file.  Because 'endfile()'
 should only be called if a file has been processed, the program first
 checks to make sure that '_oldfilename' is not the null string.  The
 program then assigns the current file name to '_oldfilename' and calls
 'beginfile()' for the file.  Because, like all 'awk' variables,
 '_oldfilename' is initialized to the null string, this rule executes
 correctly even for the first data file.
 
    The program also supplies an 'END' rule to do the final processing
 for the last file.  Because this 'END' rule comes before any 'END' rules
 supplied in the "main" program, 'endfile()' is called first.  Once
 again, the value of multiple 'BEGIN' and 'END' rules should be clear.
 
    If the same data file occurs twice in a row on the command line, then
 'endfile()' and 'beginfile()' are not executed at the end of the first
 pass and at the beginning of the second pass.  The following version
 solves the problem:
 
      # ftrans.awk --- handle datafile transitions
      #
      # user supplies beginfile() and endfile() functions
 
      FNR == 1 {
          if (_filename_ != "")
              endfile(_filename_)
          _filename_ = FILENAME
          beginfile(FILENAME)
      }
 
      END { endfile(_filename_) }
 
    SeeWc Program shows how this library function can be used and
 how it simplifies writing the main program.
 
           So Why Does 'gawk' Have 'BEGINFILE' and 'ENDFILE'?
 
    You are probably wondering, if 'beginfile()' and 'endfile()'
 functions can do the job, why does 'gawk' have 'BEGINFILE' and 'ENDFILE'
 patterns?
 
    Good question.  Normally, if 'awk' cannot open a file, this causes an
 immediate fatal error.  In this case, there is no way for a user-defined
 function to deal with the problem, as the mechanism for calling it
 relies on the file being open and at the first record.  Thus, the main
 reason for 'BEGINFILE' is to give you a "hook" to catch files that
 cannot be processed.  'ENDFILE' exists for symmetry, and because it
 provides an easy way to do per-file cleanup processing.  For more
 information, refer to SeeBEGINFILE/ENDFILE.