gawk: Uniq Program
11.2.6 Printing Nonduplicated Lines of Text
-------------------------------------------
The 'uniq' utility reads sorted lines of data on its standard input, and
by default removes duplicate lines. In other words, it only prints
unique lines--hence the name. 'uniq' has a number of options. The
usage is as follows:
'uniq' ['-udc' ['-N']] ['+N'] [INPUTFILE [OUTPUTFILE]]
The options for 'uniq' are:
'-d'
Print only repeated (duplicated) lines.
'-u'
Print only nonrepeated (unique) lines.
'-c'
Count lines. This option overrides '-d' and '-u'. Both repeated
and nonrepeated lines are counted.
'-N'
Skip N fields before comparing lines. The definition of fields is
similar to 'awk''s default: nonwhitespace characters separated by
runs of spaces and/or TABs.
'+N'
Skip N characters before comparing lines. Any fields specified
with '-N' are skipped first.
'INPUTFILE'
Data is read from the input file named on the command line, instead
of from the standard input.
'OUTPUTFILE'
The generated output is sent to the named output file, instead of
to the standard output.
Normally 'uniq' behaves as if both the '-d' and '-u' options are
provided.
'uniq' uses the 'getopt()' library function (Getopt Function)
and the 'join()' library function (Join Function).
The program begins with a 'usage()' function and then a brief outline
of the options and their meanings in comments. The 'BEGIN' rule deals
with the command-line arguments and options. It uses a trick to get
'getopt()' to handle options of the form '-25', treating such an option
as the option letter '2' with an argument of '5'. If indeed two or more
digits are supplied ('Optarg' looks like a number), 'Optarg' is
concatenated with the option digit and then the result is added to zero
to make it into a number. If there is only one digit in the option,
then 'Optarg' is not needed. In this case, 'Optind' must be decremented
so that 'getopt()' processes it next time. This code is admittedly a
bit tricky.
If no options are supplied, then the default is taken, to print both
repeated and nonrepeated lines. The output file, if provided, is
assigned to 'outputfile'. Early on, 'outputfile' is initialized to the
standard output, '/dev/stdout':
# uniq.awk --- do uniq in awk
#
# Requires getopt() and join() library functions
function usage()
{
print("Usage: uniq [-udc [-n]] [+n] [ in [ out ]]") > "/dev/stderr"
exit 1
}
# -c count lines. overrides -d and -u
# -d only repeated lines
# -u only nonrepeated lines
# -n skip n fields
# +n skip n characters, skip fields first
BEGIN {
count = 1
outputfile = "/dev/stdout"
opts = "udc0:1:2:3:4:5:6:7:8:9:"
while ((c = getopt(ARGC, ARGV, opts)) != -1) {
if (c == "u")
non_repeated_only++
else if (c == "d")
repeated_only++
else if (c == "c")
do_count++
else if (index("0123456789", c) != 0) {
# getopt() requires args to options
# this messes us up for things like -5
if (Optarg ~ /^[[:digit:]]+$/)
fcount = (c Optarg) + 0
else {
fcount = c + 0
Optind--
}
} else
usage()
}
if (ARGV[Optind] ~ /^\+[[:digit:]]+$/) {
charcount = substr(ARGV[Optind], 2) + 0
Optind++
}
for (i = 1; i < Optind; i++)
ARGV[i] = ""
if (repeated_only == 0 && non_repeated_only == 0)
repeated_only = non_repeated_only = 1
if (ARGC - Optind == 2) {
outputfile = ARGV[ARGC - 1]
ARGV[ARGC - 1] = ""
}
}
The following function, 'are_equal()', compares the current line,
'$0', to the previous line, 'last'. It handles skipping fields and
characters. If no field count and no character count are specified,
'are_equal()' returns one or zero depending upon the result of a simple
string comparison of 'last' and '$0'.
Otherwise, things get more complicated. If fields have to be
Functions::); the desired fields are then joined back into a line using
'join()'. The joined lines are stored in 'clast' and 'cline'. If no
fields are skipped, 'clast' and 'cline' are set to 'last' and '$0',
respectively. Finally, if characters are skipped, 'substr()' is used to
strip off the leading 'charcount' characters in 'clast' and 'cline'.
The two strings are then compared and 'are_equal()' returns the result:
function are_equal( n, m, clast, cline, alast, aline)
{
if (fcount == 0 && charcount == 0)
return (last == $0)
if (fcount > 0) {
n = split(last, alast)
m = split($0, aline)
clast = join(alast, fcount+1, n)
cline = join(aline, fcount+1, m)
} else {
clast = last
cline = $0
}
if (charcount) {
clast = substr(clast, charcount + 1)
cline = substr(cline, charcount + 1)
}
return (clast == cline)
}
The following two rules are the body of the program. The first one
is executed only for the very first line of data. It sets 'last' equal
to '$0', so that subsequent lines of text have something to be compared
to.
The second rule does the work. The variable 'equal' is one or zero,
depending upon the results of 'are_equal()''s comparison. If 'uniq' is
counting repeated lines, and the lines are equal, then it increments the
'count' variable. Otherwise, it prints the line and resets 'count',
because the two lines are not equal.
If 'uniq' is not counting, and if the lines are equal, 'count' is
incremented. Nothing is printed, as the point is to remove duplicates.
Otherwise, if 'uniq' is counting repeated lines and more than one line
is seen, or if 'uniq' is counting nonrepeated lines and only one line is
seen, then the line is printed, and 'count' is reset.
Finally, similar logic is used in the 'END' rule to print the final
line of input data:
NR == 1 {
last = $0
next
}
{
equal = are_equal()
if (do_count) { # overrides -d and -u
if (equal)
count++
else {
printf("%4d %s\n", count, last) > outputfile
last = $0
count = 1 # reset
}
next
}
if (equal)
count++
else {
if ((repeated_only && count > 1) ||
(non_repeated_only && count == 1))
print last > outputfile
last = $0
count = 1
}
}
END {
if (do_count)
printf("%4d %s\n", count, last) > outputfile
else if ((repeated_only && count > 1) ||
(non_repeated_only && count == 1))
print last > outputfile
close(outputfile)
}