Info: (gawk) Dupword Program

gawk: Dupword Program

 
 11.3.1 Finding Duplicated Words in a Document
 ---------------------------------------------
 
 A common error when writing large amounts of prose is to accidentally
 duplicate words.  Typically you will see this in text as something like
 "the the program does the following..." When the text is online, often
 the duplicated words occur at the end of one line and the beginning of
 another, making them very difficult to spot.
 
    This program, 'dupword.awk', scans through a file one line at a time
 and looks for adjacent occurrences of the same word.  It also saves the
 last word on a line (in the variable 'prev') for comparison with the
 first word on the next line.
 
    The first two statements make sure that the line is all lowercase, so
 that, for example, "The" and "the" compare equal to each other.  The
 next statement replaces nonalphanumeric and nonwhitespace characters
 with spaces, so that punctuation does not affect the comparison either.
 The characters are replaced with spaces so that formatting controls
 don't create nonsense words (e.g., the Texinfo '@code{NF}' becomes
 'codeNF' if punctuation is simply deleted).  The record is then resplit
 into fields, yielding just the actual words on the line, and ensuring
 that there are no empty fields.
 
    If there are no fields left after removing all the punctuation, the
 current record is skipped.  Otherwise, the program loops through each
 word, comparing it to the previous one:
 
      # dupword.awk --- find duplicate words in text
      {
          $0 = tolower($0)
          gsub(/[^[:alnum:][:blank:]]/, " ");
          $0 = $0         # re-split
          if (NF == 0)
              next
          if ($1 == prev)
              printf("%s:%d: duplicate %s\n",
                  FILENAME, FNR, $1)
          for (i = 2; i <= NF; i++)
              if ($i == $(i-1))
                  printf("%s:%d: duplicate %s\n",
                      FILENAME, FNR, $i)
          prev = $NF
      }
Info Catalog
gawk: Miscellaneous Programs
gawk: Alarm Program