gawk: Word Sorting

 
 11.3.5 Generating Word-Usage Counts
 -----------------------------------
 
 When working with large amounts of text, it can be interesting to know
 how often different words appear.  For example, an author may overuse
 certain words, in which case he or she might wish to find synonyms to
 substitute for words that appear too often.  This node develops a
 program for counting words and presenting the frequency information in a
 useful format.
 
    At first glance, a program like this would seem to do the job:
 
      # wordfreq-first-try.awk --- print list of word frequencies
 
      {
          for (i = 1; i <= NF; i++)
              freq[$i]++
      }
 
      END {
          for (word in freq)
              printf "%s\t%d\n", word, freq[word]
      }
 
    The program relies on 'awk''s default field-splitting mechanism to
 break each line up into "words" and uses an associative array named
 'freq', indexed by each word, to count the number of times the word
 occurs.  In the 'END' rule, it prints the counts.
 
    This program has several problems that prevent it from being useful
 on real text files:
 
    * The 'awk' language considers upper- and lowercase characters to be
      distinct.  Therefore, "bartender" and "Bartender" are not treated
      as the same word.  This is undesirable, because words are
      capitalized if they begin sentences in normal text, and a frequency
      analyzer should not be sensitive to capitalization.
 
    * Words are detected using the 'awk' convention that fields are
      separated just by whitespace.  Other characters in the input
      (except newlines) don't have any special meaning to 'awk'.  This
      means that punctuation characters count as part of words.
 
    * The output does not come out in any useful order.  You're more
      likely to be interested in which words occur most frequently or in
      having an alphabetized table of how frequently each word occurs.
 
    The first problem can be solved by using 'tolower()' to remove case
 distinctions.  The second problem can be solved by using 'gsub()' to
 remove punctuation characters.  Finally, we solve the third problem by
 using the system 'sort' utility to process the output of the 'awk'
 script.  Here is the new version of the program:
 
      # wordfreq.awk --- print list of word frequencies
 
      {
          $0 = tolower($0)    # remove case distinctions
          # remove punctuation
          gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
          for (i = 1; i <= NF; i++)
              freq[$i]++
      }
 
      END {
          for (word in freq)
              printf "%s\t%d\n", word, freq[word]
      }
 
    The regexp '/[^[:alnum:]_[:blank:]]/' might have been written
 '/[[:punct:]]/', but then underscores would also be removed, and we want
 to keep them.
 
    Assuming we have saved this program in a file named 'wordfreq.awk',
 and that the data is in 'file1', the following pipeline:
 
      awk -f wordfreq.awk file1 | sort -k 2nr
 
 produces a table of the words appearing in 'file1' in order of
 decreasing frequency.
 
    The 'awk' program suitably massages the data and produces a word
 frequency table, which is not ordered.  The 'awk' script's output is
 then sorted by the 'sort' utility and printed on the screen.
 
    The options given to 'sort' specify a sort that uses the second field
 of each input line (skipping one field), that the sort keys should be
 treated as numeric quantities (otherwise '15' would come before '5'),
 and that the sorting should be done in descending (reverse) order.
 
    The 'sort' could even be done from within the program, by changing
 the 'END' action to:
 
      END {
          sort = "sort -k 2nr"
          for (word in freq)
              printf "%s\t%d\n", word, freq[word] | sort
          close(sort)
      }
 
    This way of sorting must be used on systems that do not have true
 pipes at the command-line (or batch-file) level.  See the general
 operating system documentation for more information on how to use the
 'sort' program.