gawk: Cut Program

 
 11.2.1 Cutting Out Fields and Columns
 -------------------------------------
 
 The 'cut' utility selects, or "cuts," characters or fields from its
 standard input and sends them to its standard output.  Fields are
 separated by TABs by default, but you may supply a command-line option
 to change the field "delimiter" (i.e., the field-separator character).
 'cut''s definition of fields is less general than 'awk''s.
 
    A common use of 'cut' might be to pull out just the login names of
 logged-on users from the output of 'who'.  For example, the following
 pipeline generates a sorted, unique list of the logged-on users:
 
      who | cut -c1-8 | sort | uniq
 
    The options for 'cut' are:
 
 '-c LIST'
      Use LIST as the list of characters to cut out.  Items within the
      list may be separated by commas, and ranges of characters can be
      separated with dashes.  The list '1-8,15,22-35' specifies
      characters 1 through 8, 15, and 22 through 35.
 
 '-f LIST'
      Use LIST as the list of fields to cut out.
 
 '-d DELIM'
      Use DELIM as the field-separator character instead of the TAB
      character.
 
 '-s'
      Suppress printing of lines that do not contain the field delimiter.
 
    The 'awk' implementation of 'cut' uses the 'getopt()' library
 function (SeeGetopt Function) and the 'join()' library function
 (SeeJoin Function).
 
    The program begins with a comment describing the options, the library
 functions needed, and a 'usage()' function that prints out a usage
 message and exits.  'usage()' is called if invalid arguments are
 supplied:
 
      # cut.awk --- implement cut in awk
 
      # Options:
      #    -f list     Cut fields
      #    -d c        Field delimiter character
      #    -c list     Cut characters
      #
      #    -s          Suppress lines without the delimiter
      #
      # Requires getopt() and join() library functions
 
      function usage()
      {
          print("usage: cut [-f list] [-d c] [-s] [files...]") > "/dev/stderr"
          print("usage: cut [-c list] [files...]") > "/dev/stderr"
          exit 1
      }
 
    Next comes a 'BEGIN' rule that parses the command-line options.  It
 sets 'FS' to a single TAB character, because that is 'cut''s default
 field separator.  The rule then sets the output field separator to be
 the same as the input field separator.  A loop using 'getopt()' steps
 through the command-line options.  Exactly one of the variables
 'by_fields' or 'by_chars' is set to true, to indicate that processing
 should be done by fields or by characters, respectively.  When cutting
 by characters, the output field separator is set to the null string:
 
      BEGIN {
          FS = "\t"    # default
          OFS = FS
          while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) {
              if (c == "f") {
                  by_fields = 1
                  fieldlist = Optarg
              } else if (c == "c") {
                  by_chars = 1
                  fieldlist = Optarg
                  OFS = ""
              } else if (c == "d") {
                  if (length(Optarg) > 1) {
                      printf("cut: using first character of %s" \
                             " for delimiter\n", Optarg) > "/dev/stderr"
                      Optarg = substr(Optarg, 1, 1)
                  }
                  fs = FS = Optarg
                  OFS = FS
                  if (FS == " ")    # defeat awk semantics
                      FS = "[ ]"
              } else if (c == "s")
                  suppress = 1
              else
                  usage()
          }
 
          # Clear out options
          for (i = 1; i < Optind; i++)
              ARGV[i] = ""
 
    The code must take special care when the field delimiter is a space.
 Using a single space ('" "') for the value of 'FS' is incorrect--'awk'
 would separate fields with runs of spaces, TABs, and/or newlines, and we
 want them to be separated with individual spaces.  To this end, we save
 the original space character in the variable 'fs' for later use; after
 setting 'FS' to '"[ ]"' we can't use it directly to see if the field
 delimiter character is in the string.
 
    Also remember that after 'getopt()' is through (as described in See
 Getopt Function), we have to clear out all the elements of 'ARGV' from
 1 to 'Optind', so that 'awk' does not try to process the command-line
 options as file names.
 
    After dealing with the command-line options, the program verifies
 that the options make sense.  Only one or the other of '-c' and '-f'
 should be used, and both require a field list.  Then the program calls
 either 'set_fieldlist()' or 'set_charlist()' to pull apart the list of
 fields or characters:
 
          if (by_fields && by_chars)
              usage()
 
          if (by_fields == 0 && by_chars == 0)
              by_fields = 1    # default
 
          if (fieldlist == "") {
              print "cut: needs list for -c or -f" > "/dev/stderr"
              exit 1
          }
 
          if (by_fields)
              set_fieldlist()
          else
              set_charlist()
      }
 
    'set_fieldlist()' splits the field list apart at the commas into an
 array.  Then, for each element of the array, it looks to see if the
 element is actually a range, and if so, splits it apart.  The function
 checks the range to make sure that the first number is smaller than the
 second.  Each number in the list is added to the 'flist' array, which
 simply lists the fields that will be printed.  Normal field splitting is
 used.  The program lets 'awk' handle the job of doing the field
 splitting:
 
      function set_fieldlist(        n, m, i, j, k, f, g)
      {
          n = split(fieldlist, f, ",")
          j = 1    # index in flist
          for (i = 1; i <= n; i++) {
              if (index(f[i], "-") != 0) { # a range
                  m = split(f[i], g, "-")
                  if (m != 2 || g[1] >= g[2]) {
                      printf("cut: bad field list: %s\n",
                                        f[i]) > "/dev/stderr"
                      exit 1
                  }
                  for (k = g[1]; k <= g[2]; k++)
                      flist[j++] = k
              } else
                  flist[j++] = f[i]
          }
          nfields = j - 1
      }
 
    The 'set_charlist()' function is more complicated than
 'set_fieldlist()'.  The idea here is to use 'gawk''s 'FIELDWIDTHS'
 variable (SeeConstant Size), which describes constant-width input.
 When using a character list, that is exactly what we have.
 
    Setting up 'FIELDWIDTHS' is more complicated than simply listing the
 fields that need to be printed.  We have to keep track of the fields to
 print and also the intervening characters that have to be skipped.  For
 example, suppose you wanted characters 1 through 8, 15, and 22 through
 35.  You would use '-c 1-8,15,22-35'.  The necessary value for
 'FIELDWIDTHS' is '"8 6 1 6 14"'.  This yields five fields, and the
 fields to print are '$1', '$3', and '$5'.  The intermediate fields are
 "filler", which is stuff in between the desired data.  'flist' lists the
 fields to print, and 't' tracks the complete field list, including
 filler fields:
 
      function set_charlist(    field, i, j, f, g, n, m, t,
                                filler, last, len)
      {
          field = 1   # count total fields
          n = split(fieldlist, f, ",")
          j = 1       # index in flist
          for (i = 1; i <= n; i++) {
              if (index(f[i], "-") != 0) { # range
                  m = split(f[i], g, "-")
                  if (m != 2 || g[1] >= g[2]) {
                      printf("cut: bad character list: %s\n",
                                     f[i]) > "/dev/stderr"
                      exit 1
                  }
                  len = g[2] - g[1] + 1
                  if (g[1] > 1)  # compute length of filler
                      filler = g[1] - last - 1
                  else
                      filler = 0
                  if (filler)
                      t[field++] = filler
                  t[field++] = len  # length of field
                  last = g[2]
                  flist[j++] = field - 1
              } else {
                  if (f[i] > 1)
                      filler = f[i] - last - 1
                  else
                      filler = 0
                  if (filler)
                      t[field++] = filler
                  t[field++] = 1
                  last = f[i]
                  flist[j++] = field - 1
              }
          }
          FIELDWIDTHS = join(t, 1, field - 1)
          nfields = j - 1
      }
 
    Next is the rule that processes the data.  If the '-s' option is
 given, then 'suppress' is true.  The first 'if' statement makes sure
 that the input record does have the field separator.  If 'cut' is
 processing fields, 'suppress' is true, and the field separator character
 is not in the record, then the record is skipped.
 
    If the record is valid, then 'gawk' has split the data into fields,
 either using the character in 'FS' or using fixed-length fields and
 'FIELDWIDTHS'.  The loop goes through the list of fields that should be
 printed.  The corresponding field is printed if it contains data.  If
 the next field also has data, then the separator character is written
 out between the fields:
 
      {
          if (by_fields && suppress && index($0, fs) == 0)
              next
 
          for (i = 1; i <= nfields; i++) {
              if ($flist[i] != "") {
                  printf "%s", $flist[i]
                  if (i < nfields && $flist[i+1] != "")
                      printf "%s", OFS
              }
          }
          print ""
      }
 
    This version of 'cut' relies on 'gawk''s 'FIELDWIDTHS' variable to do
 the character-based cutting.  It is possible in other 'awk'
 implementations to use 'substr()' (SeeString Functions), but it is
 also extremely painful.  The 'FIELDWIDTHS' variable supplies an elegant
 solution to the problem of picking the input line apart by characters.