Info: (gawk) Anagram Program

Info Catalog
gawk: Igawk Program
gawk: Miscellaneous Programs
gawk: Signature Program
gawk: Anagram Program

 
 11.3.10 Finding Anagrams from a Dictionary
 ------------------------------------------
 
 An interesting programming challenge is to search for "anagrams" in a
 word list (such as '/usr/share/dict/words' on many GNU/Linux systems).
 One word is an anagram of another if both words contain the same letters
 (e.g., "babbling" and "blabbing").
 
    Column 2, Problem C, of Jon Bentley's 'Programming Pearls', Second
 Edition, presents an elegant algorithm.  The idea is to give words that
 are anagrams a common signature, sort all the words together by their
 signatures, and then print them.  Dr. Bentley observes that taking the
 letters in each word and sorting them produces those common signatures.
 
    The following program uses arrays of arrays to bring together words
 with the same signature and array sorting to print the words in sorted
 order:
 
      # anagram.awk --- An implementation of the anagram-finding algorithm
      #                 from Jon Bentley's "Programming Pearls," 2nd edition.
      #                 Addison Wesley, 2000, ISBN 0-201-65788-0.
      #                 Column 2, Problem C, section 2.8, pp 18-20.
 
      /'s$/   { next }        # Skip possessives
 
    The program starts with a header, and then a rule to skip possessives
 in the dictionary file.  The next rule builds up the data structure.
 The first dimension of the array is indexed by the signature; the second
 dimension is the word itself:
 
      {
          key = word2key($1)  # Build signature
          data[key][$1] = $1  # Store word with signature
      }
 
    The 'word2key()' function creates the signature.  It splits the word
 apart into individual letters, sorts the letters, and then joins them
 back together:
 
      # word2key --- split word apart into letters, sort, and join back together
 
      function word2key(word,     a, i, n, result)
      {
          n = split(word, a, "")
          asort(a)
 
          for (i = 1; i <= n; i++)
              result = result a[i]
 
          return result
      }
 
    Finally, the 'END' rule traverses the array and prints out the
 anagram lists.  It sends the output to the system 'sort' command because
 otherwise the anagrams would appear in arbitrary order:
 
      END {
          sort = "sort"
          for (key in data) {
              # Sort words with same key
              nwords = asorti(data[key], words)
              if (nwords == 1)
                  continue
 
              # And print. Minor glitch: trailing space at end of each line
              for (j = 1; j <= nwords; j++)
                  printf("%s ", words[j]) | sort
              print "" | sort
          }
          close(sort)
      }
 
    Here is some partial output when the program is run:
 
      $ gawk -f anagram.awk /usr/share/dict/words | grep '^b'
      ...
      babbled blabbed
      babbler blabber brabble
      babblers blabbers brabbles
      babbling blabbing
      babbly blabby
      babel bable
      babels beslab
      babery yabber
      ...
Info Catalog
gawk: Igawk Program
gawk: Miscellaneous Programs
gawk: Signature Program