Info: (gawk) Split Program

Info Catalog
gawk: Id Program
gawk: Clones
gawk: Tee Program
gawk: Split Program

 
 11.2.4 Splitting a Large File into Pieces
 -----------------------------------------
 
 The 'split' program splits large text files into smaller pieces.  Usage
 is as follows:(1)
 
      'split' ['-COUNT'] [FILE] [PREFIX]
 
    By default, the output files are named 'xaa', 'xab', and so on.  Each
 file has 1,000 lines in it, with the likely exception of the last file.
 To change the number of lines in each file, supply a number on the
 command line preceded with a minus sign (e.g., '-500' for files with 500
 lines in them instead of 1,000).  To change the names of the output
 files to something like 'myfileaa', 'myfileab', and so on, supply an
 additional argument that specifies the file name prefix.
 
    Here is a version of 'split' in 'awk'.  It uses the 'ord()' and
 'chr()' functions presented in Ordinal Functions.
 
    The program first sets its defaults, and then tests to make sure
 there are not too many arguments.  It then looks at each argument in
 turn.  The first argument could be a minus sign followed by a number.
 If it is, this happens to look like a negative number, so it is made
 positive, and that is the count of lines.  The data file name is skipped
 over and the final argument is used as the prefix for the output file
 names:
 
      # split.awk --- do split in awk
      #
      # Requires ord() and chr() library functions
      # usage: split [-count] [file] [outname]
 
      BEGIN {
          outfile = "x"    # default
          count = 1000
          if (ARGC > 4)
              usage()
 
          i = 1
          if (i in ARGV && ARGV[i] ~ /^-[[:digit:]]+$/) {
              count = -ARGV[i]
              ARGV[i] = ""
              i++
          }
          # test argv in case reading from stdin instead of file
          if (i in ARGV)
              i++    # skip datafile name
          if (i in ARGV) {
              outfile = ARGV[i]
              ARGV[i] = ""
          }
          s1 = s2 = "a"
          out = (outfile s1 s2)
      }
 
    The next rule does most of the work.  'tcount' (temporary count)
 tracks how many lines have been printed to the output file so far.  If
 it is greater than 'count', it is time to close the current file and
 start a new one.  's1' and 's2' track the current suffixes for the file
 name.  If they are both 'z', the file is just too big.  Otherwise, 's1'
 moves to the next letter in the alphabet and 's2' starts over again at
 'a':
 
      {
          if (++tcount > count) {
              close(out)
              if (s2 == "z") {
                  if (s1 == "z") {
                      printf("split: %s is too large to split\n",
                             FILENAME) > "/dev/stderr"
                      exit 1
                  }
                  s1 = chr(ord(s1) + 1)
                  s2 = "a"
              }
              else
                  s2 = chr(ord(s2) + 1)
              out = (outfile s1 s2)
              tcount = 1
          }
          print > out
      }
 
 The 'usage()' function simply prints an error message and exits:
 
      function usage()
      {
          print("usage: split [-num] [file] [outname]") > "/dev/stderr"
          exit 1
      }
 
    This program is a bit sloppy; it relies on 'awk' to automatically
 close the last file instead of doing it in an 'END' rule.  It also
 assumes that letters are contiguous in the character set, which isn't
 true for EBCDIC systems.
 
    ---------- Footnotes ----------
 
    (1) This is the traditional usage.  The POSIX usage is different, but
 not relevant for what the program aims to demonstrate.