gawk: Wc Program

 
 11.2.7 Counting Things
 ----------------------
 
 The 'wc' (word count) utility counts lines, words, and characters in one
 or more input files.  Its usage is as follows:
 
      'wc' ['-lwc'] [FILES ...]
 
    If no files are specified on the command line, 'wc' reads its
 standard input.  If there are multiple files, it also prints total
 counts for all the files.  The options and their meanings are as
 follows:
 
 '-l'
      Count only lines.
 
 '-w'
      Count only words.  A "word" is a contiguous sequence of
      nonwhitespace characters, separated by spaces and/or TABs.
      Luckily, this is the normal way 'awk' separates fields in its input
      data.
 
 '-c'
      Count only characters.
 
    Implementing 'wc' in 'awk' is particularly elegant, because 'awk'
 does a lot of the work for us; it splits lines into words (i.e., fields)
 and counts them, it counts lines (i.e., records), and it can easily tell
 us how long a line is.
 
 Function::) and the file-transition functions (SeeFiletrans
 Function).
 
    This version has one notable difference from traditional versions of
 'wc': it always prints the counts in the order lines, words, and
 characters.  Traditional versions note the order of the '-l', '-w', and
 '-c' options on the command line, and print the counts in that order.
 
    The 'BEGIN' rule does the argument processing.  The variable
 'print_total' is true if more than one file is named on the command
 line:
 
      # wc.awk --- count lines, words, characters
 
      # Options:
      #    -l    only count lines
      #    -w    only count words
      #    -c    only count characters
      #
      # Default is to count lines, words, characters
      #
      # Requires getopt() and file transition library functions
 
      BEGIN {
          # let getopt() print a message about
          # invalid options. we ignore them
          while ((c = getopt(ARGC, ARGV, "lwc")) != -1) {
              if (c == "l")
                  do_lines = 1
              else if (c == "w")
                  do_words = 1
              else if (c == "c")
                  do_chars = 1
          }
          for (i = 1; i < Optind; i++)
              ARGV[i] = ""
 
          # if no options, do all
          if (! do_lines && ! do_words && ! do_chars)
              do_lines = do_words = do_chars = 1
 
          print_total = (ARGC - i > 1)
      }
 
    The 'beginfile()' function is simple; it just resets the counts of
 lines, words, and characters to zero, and saves the current file name in
 'fname':
 
      function beginfile(file)
      {
          lines = words = chars = 0
          fname = FILENAME
      }
 
    The 'endfile()' function adds the current file's numbers to the
 running totals of lines, words, and characters.  It then prints out
 those numbers for the file that was just read.  It relies on
 'beginfile()' to reset the numbers for the following data file:
 
      function endfile(file)
      {
          tlines += lines
          twords += words
          tchars += chars
          if (do_lines)
              printf "\t%d", lines
          if (do_words)
              printf "\t%d", words
          if (do_chars)
              printf "\t%d", chars
          printf "\t%s\n", fname
      }
 
    There is one rule that is executed for each line.  It adds the length
 of the record, plus one, to 'chars'.(1)  Adding one plus the record
 length is needed because the newline character separating records (the
 value of 'RS') is not part of the record itself, and thus not included
 in its length.  Next, 'lines' is incremented for each line read, and
 'words' is incremented by the value of 'NF', which is the number of
 "words" on this line:
 
      # do per line
      {
          chars += length($0) + 1    # get newline
          lines++
          words += NF
      }
 
    Finally, the 'END' rule simply prints the totals for all the files:
 
      END {
          if (print_total) {
              if (do_lines)
                  printf "\t%d", tlines
              if (do_words)
                  printf "\t%d", twords
              if (do_chars)
                  printf "\t%d", tchars
              print "\ttotal"
          }
      }
 
    ---------- Footnotes ----------
 
    (1) Because 'gawk' understands multibyte locales, this code counts
 characters, not bytes.