gawk: Egrep Program

 
 11.2.2 Searching for Regular Expressions in Files
 -------------------------------------------------
 
 The 'egrep' utility searches files for patterns.  It uses regular
 expressions that are almost identical to those available in 'awk' (See
 Regexp).  You invoke it as follows:
 
      'egrep' [OPTIONS] ''PATTERN'' FILES ...
 
    The PATTERN is a regular expression.  In typical usage, the regular
 expression is quoted to prevent the shell from expanding any of the
 special characters as file name wildcards.  Normally, 'egrep' prints the
 lines that matched.  If multiple file names are provided on the command
 line, each output line is preceded by the name of the file and a colon.
 
    The options to 'egrep' are as follows:
 
 '-c'
      Print out a count of the lines that matched the pattern, instead of
      the lines themselves.
 
 '-s'
      Be silent.  No output is produced and the exit value indicates
      whether the pattern was matched.
 
 '-v'
      Invert the sense of the test.  'egrep' prints the lines that do
      _not_ match the pattern and exits successfully if the pattern is
      not matched.
 
 '-i'
      Ignore case distinctions in both the pattern and the input data.
 
 '-l'
      Only print (list) the names of the files that matched, not the
      lines that matched.
 
 '-e PATTERN'
      Use PATTERN as the regexp to match.  The purpose of the '-e' option
      is to allow patterns that start with a '-'.
 
 Function::) and the file transition library program (SeeFiletrans
 Function).
 
    The program begins with a descriptive comment and then a 'BEGIN' rule
 that processes the command-line arguments with 'getopt()'.  The '-i'
 (ignore case) option is particularly easy with 'gawk'; we just use the
 'IGNORECASE' predefined variable (SeeBuilt-in Variables):
 
      # egrep.awk --- simulate egrep in awk
      #
      # Options:
      #    -c    count of lines
      #    -s    silent - use exit value
      #    -v    invert test, success if no match
      #    -i    ignore case
      #    -l    print filenames only
      #    -e    argument is pattern
      #
      # Requires getopt and file transition library functions
 
      BEGIN {
          while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) {
              if (c == "c")
                  count_only++
              else if (c == "s")
                  no_print++
              else if (c == "v")
                  invert++
              else if (c == "i")
                  IGNORECASE = 1
              else if (c == "l")
                  filenames_only++
              else if (c == "e")
                  pattern = Optarg
              else
                  usage()
          }
 
    Next comes the code that handles the 'egrep'-specific behavior.  If
 no pattern is supplied with '-e', the first nonoption on the command
 line is used.  The 'awk' command-line arguments up to 'ARGV[Optind]' are
 cleared, so that 'awk' won't try to process them as files.  If no files
 are specified, the standard input is used, and if multiple files are
 specified, we make sure to note this so that the file names can precede
 the matched lines in the output:
 
          if (pattern == "")
              pattern = ARGV[Optind++]
 
          for (i = 1; i < Optind; i++)
              ARGV[i] = ""
          if (Optind >= ARGC) {
              ARGV[1] = "-"
              ARGC = 2
          } else if (ARGC - Optind > 1)
              do_filenames++
 
      #    if (IGNORECASE)
      #        pattern = tolower(pattern)
      }
 
    The last two lines are commented out, as they are not needed in
 'gawk'.  They should be uncommented if you have to use another version
 of 'awk'.
 
    The next set of lines should be uncommented if you are not using
 'gawk'.  This rule translates all the characters in the input line into
 lowercase if the '-i' option is specified.(1)  The rule is commented out
 as it is not necessary with 'gawk':
 
      #{
      #    if (IGNORECASE)
      #        $0 = tolower($0)
      #}
 
    The 'beginfile()' function is called by the rule in 'ftrans.awk' when
 each new file is processed.  In this case, it is very simple; all it
 does is initialize a variable 'fcount' to zero.  'fcount' tracks how
 many lines in the current file matched the pattern.  Naming the
 parameter 'junk' shows we know that 'beginfile()' is called with a
 parameter, but that we're not interested in its value:
 
      function beginfile(junk)
      {
          fcount = 0
      }
 
    The 'endfile()' function is called after each file has been
 processed.  It affects the output only when the user wants a count of
 the number of lines that matched.  'no_print' is true only if the exit
 status is desired.  'count_only' is true if line counts are desired.
 'egrep' therefore only prints line counts if printing and counting are
 enabled.  The output format must be adjusted depending upon the number
 of files to process.  Finally, 'fcount' is added to 'total', so that we
 know the total number of lines that matched the pattern:
 
      function endfile(file)
      {
          if (! no_print && count_only) {
              if (do_filenames)
                  print file ":" fcount
              else
                  print fcount
          }
 
          total += fcount
      }
 
    The 'BEGINFILE' and 'ENDFILE' special patterns (See
 BEGINFILE/ENDFILE) could be used, but then the program would be
 'gawk'-specific.  Additionally, this example was written before 'gawk'
 acquired 'BEGINFILE' and 'ENDFILE'.
 
    The following rule does most of the work of matching lines.  The
 variable 'matches' is true if the line matched the pattern.  If the user
 wants lines that did not match, the sense of 'matches' is inverted using
 the '!' operator.  'fcount' is incremented with the value of 'matches',
 which is either one or zero, depending upon a successful or unsuccessful
 match.  If the line does not match, the 'next' statement just moves on
 to the next record.
 
    A number of additional tests are made, but they are only done if we
 are not counting lines.  First, if the user only wants the exit status
 ('no_print' is true), then it is enough to know that _one_ line in this
 file matched, and we can skip on to the next file with 'nextfile'.
 Similarly, if we are only printing file names, we can print the file
 name, and then skip to the next file with 'nextfile'.  Finally, each
 line is printed, with a leading file name and colon if necessary:
 
      {
          matches = ($0 ~ pattern)
          if (invert)
              matches = ! matches
 
          fcount += matches    # 1 or 0
 
          if (! matches)
              next
 
          if (! count_only) {
              if (no_print)
                  nextfile
 
              if (filenames_only) {
                  print FILENAME
                  nextfile
              }
 
              if (do_filenames)
                  print FILENAME ":" $0
              else
                  print
          }
      }
 
    The 'END' rule takes care of producing the correct exit status.  If
 there are no matches, the exit status is one; otherwise, it is zero:
 
      END {
          exit (total == 0)
      }
 
    The 'usage()' function prints a usage message in case of invalid
 options, and then exits:
 
      function usage()
      {
          print("Usage: egrep [-csvil] [-e pat] [files ...]") > "/dev/stderr"
          print("\n\tegrep [-csvil] pat [files ...]") > "/dev/stderr"
          exit 1
      }
 
    ---------- Footnotes ----------
 
    (1) It also introduces a subtle bug; if a match happens, we output
 the translated line, not the original.