gawk: Uniq Program

 
 11.2.6 Printing Nonduplicated Lines of Text
 -------------------------------------------
 
 The 'uniq' utility reads sorted lines of data on its standard input, and
 by default removes duplicate lines.  In other words, it only prints
 unique lines--hence the name.  'uniq' has a number of options.  The
 usage is as follows:
 
      'uniq' ['-udc' ['-N']] ['+N'] [INPUTFILE [OUTPUTFILE]]
 
    The options for 'uniq' are:
 
 '-d'
      Print only repeated (duplicated) lines.
 
 '-u'
      Print only nonrepeated (unique) lines.
 
 '-c'
      Count lines.  This option overrides '-d' and '-u'.  Both repeated
      and nonrepeated lines are counted.
 
 '-N'
      Skip N fields before comparing lines.  The definition of fields is
      similar to 'awk''s default: nonwhitespace characters separated by
      runs of spaces and/or TABs.
 
 '+N'
      Skip N characters before comparing lines.  Any fields specified
      with '-N' are skipped first.
 
 'INPUTFILE'
      Data is read from the input file named on the command line, instead
      of from the standard input.
 
 'OUTPUTFILE'
      The generated output is sent to the named output file, instead of
      to the standard output.
 
    Normally 'uniq' behaves as if both the '-d' and '-u' options are
 provided.
 
    'uniq' uses the 'getopt()' library function (SeeGetopt Function)
 and the 'join()' library function (SeeJoin Function).
 
    The program begins with a 'usage()' function and then a brief outline
 of the options and their meanings in comments.  The 'BEGIN' rule deals
 with the command-line arguments and options.  It uses a trick to get
 'getopt()' to handle options of the form '-25', treating such an option
 as the option letter '2' with an argument of '5'.  If indeed two or more
 digits are supplied ('Optarg' looks like a number), 'Optarg' is
 concatenated with the option digit and then the result is added to zero
 to make it into a number.  If there is only one digit in the option,
 then 'Optarg' is not needed.  In this case, 'Optind' must be decremented
 so that 'getopt()' processes it next time.  This code is admittedly a
 bit tricky.
 
    If no options are supplied, then the default is taken, to print both
 repeated and nonrepeated lines.  The output file, if provided, is
 assigned to 'outputfile'.  Early on, 'outputfile' is initialized to the
 standard output, '/dev/stdout':
 
      # uniq.awk --- do uniq in awk
      #
      # Requires getopt() and join() library functions
 
      function usage()
      {
          print("Usage: uniq [-udc [-n]] [+n] [ in [ out ]]") > "/dev/stderr"
          exit 1
      }
 
      # -c    count lines. overrides -d and -u
      # -d    only repeated lines
      # -u    only nonrepeated lines
      # -n    skip n fields
      # +n    skip n characters, skip fields first
 
      BEGIN {
          count = 1
          outputfile = "/dev/stdout"
          opts = "udc0:1:2:3:4:5:6:7:8:9:"
          while ((c = getopt(ARGC, ARGV, opts)) != -1) {
              if (c == "u")
                  non_repeated_only++
              else if (c == "d")
                  repeated_only++
              else if (c == "c")
                  do_count++
              else if (index("0123456789", c) != 0) {
                  # getopt() requires args to options
                  # this messes us up for things like -5
                  if (Optarg ~ /^[[:digit:]]+$/)
                      fcount = (c Optarg) + 0
                  else {
                      fcount = c + 0
                      Optind--
                  }
              } else
                  usage()
          }
 
          if (ARGV[Optind] ~ /^\+[[:digit:]]+$/) {
              charcount = substr(ARGV[Optind], 2) + 0
              Optind++
          }
 
          for (i = 1; i < Optind; i++)
              ARGV[i] = ""
 
          if (repeated_only == 0 && non_repeated_only == 0)
              repeated_only = non_repeated_only = 1
 
          if (ARGC - Optind == 2) {
              outputfile = ARGV[ARGC - 1]
              ARGV[ARGC - 1] = ""
          }
      }
 
    The following function, 'are_equal()', compares the current line,
 '$0', to the previous line, 'last'.  It handles skipping fields and
 characters.  If no field count and no character count are specified,
 'are_equal()' returns one or zero depending upon the result of a simple
 string comparison of 'last' and '$0'.
 
    Otherwise, things get more complicated.  If fields have to be
 Functions::); the desired fields are then joined back into a line using
 'join()'.  The joined lines are stored in 'clast' and 'cline'.  If no
 fields are skipped, 'clast' and 'cline' are set to 'last' and '$0',
 respectively.  Finally, if characters are skipped, 'substr()' is used to
 strip off the leading 'charcount' characters in 'clast' and 'cline'.
 The two strings are then compared and 'are_equal()' returns the result:
 
      function are_equal(    n, m, clast, cline, alast, aline)
      {
          if (fcount == 0 && charcount == 0)
              return (last == $0)
 
          if (fcount > 0) {
              n = split(last, alast)
              m = split($0, aline)
              clast = join(alast, fcount+1, n)
              cline = join(aline, fcount+1, m)
          } else {
              clast = last
              cline = $0
          }
          if (charcount) {
              clast = substr(clast, charcount + 1)
              cline = substr(cline, charcount + 1)
          }
 
          return (clast == cline)
      }
 
    The following two rules are the body of the program.  The first one
 is executed only for the very first line of data.  It sets 'last' equal
 to '$0', so that subsequent lines of text have something to be compared
 to.
 
    The second rule does the work.  The variable 'equal' is one or zero,
 depending upon the results of 'are_equal()''s comparison.  If 'uniq' is
 counting repeated lines, and the lines are equal, then it increments the
 'count' variable.  Otherwise, it prints the line and resets 'count',
 because the two lines are not equal.
 
    If 'uniq' is not counting, and if the lines are equal, 'count' is
 incremented.  Nothing is printed, as the point is to remove duplicates.
 Otherwise, if 'uniq' is counting repeated lines and more than one line
 is seen, or if 'uniq' is counting nonrepeated lines and only one line is
 seen, then the line is printed, and 'count' is reset.
 
    Finally, similar logic is used in the 'END' rule to print the final
 line of input data:
 
      NR == 1 {
          last = $0
          next
      }
 
      {
          equal = are_equal()
 
          if (do_count) {    # overrides -d and -u
              if (equal)
                  count++
              else {
                  printf("%4d %s\n", count, last) > outputfile
                  last = $0
                  count = 1    # reset
              }
              next
          }
 
          if (equal)
              count++
          else {
              if ((repeated_only && count > 1) ||
                  (non_repeated_only && count == 1))
                      print last > outputfile
              last = $0
              count = 1
          }
      }
 
      END {
          if (do_count)
              printf("%4d %s\n", count, last) > outputfile
          else if ((repeated_only && count > 1) ||
                  (non_repeated_only && count == 1))
              print last > outputfile
          close(outputfile)
      }