gawk: Egrep Program
11.2.2 Searching for Regular Expressions in Files
-------------------------------------------------
The 'egrep' utility searches files for patterns. It uses regular
expressions that are almost identical to those available in 'awk' (
Regexp). You invoke it as follows:
'egrep' [OPTIONS] ''PATTERN'' FILES ...
The PATTERN is a regular expression. In typical usage, the regular
expression is quoted to prevent the shell from expanding any of the
special characters as file name wildcards. Normally, 'egrep' prints the
lines that matched. If multiple file names are provided on the command
line, each output line is preceded by the name of the file and a colon.
The options to 'egrep' are as follows:
'-c'
Print out a count of the lines that matched the pattern, instead of
the lines themselves.
'-s'
Be silent. No output is produced and the exit value indicates
whether the pattern was matched.
'-v'
Invert the sense of the test. 'egrep' prints the lines that do
_not_ match the pattern and exits successfully if the pattern is
not matched.
'-i'
Ignore case distinctions in both the pattern and the input data.
'-l'
Only print (list) the names of the files that matched, not the
lines that matched.
'-e PATTERN'
Use PATTERN as the regexp to match. The purpose of the '-e' option
is to allow patterns that start with a '-'.
Function::) and the file transition library program (Filetrans
Function).
The program begins with a descriptive comment and then a 'BEGIN' rule
that processes the command-line arguments with 'getopt()'. The '-i'
(ignore case) option is particularly easy with 'gawk'; we just use the
'IGNORECASE' predefined variable (Built-in Variables):
# egrep.awk --- simulate egrep in awk
#
# Options:
# -c count of lines
# -s silent - use exit value
# -v invert test, success if no match
# -i ignore case
# -l print filenames only
# -e argument is pattern
#
# Requires getopt and file transition library functions
BEGIN {
while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) {
if (c == "c")
count_only++
else if (c == "s")
no_print++
else if (c == "v")
invert++
else if (c == "i")
IGNORECASE = 1
else if (c == "l")
filenames_only++
else if (c == "e")
pattern = Optarg
else
usage()
}
Next comes the code that handles the 'egrep'-specific behavior. If
no pattern is supplied with '-e', the first nonoption on the command
line is used. The 'awk' command-line arguments up to 'ARGV[Optind]' are
cleared, so that 'awk' won't try to process them as files. If no files
are specified, the standard input is used, and if multiple files are
specified, we make sure to note this so that the file names can precede
the matched lines in the output:
if (pattern == "")
pattern = ARGV[Optind++]
for (i = 1; i < Optind; i++)
ARGV[i] = ""
if (Optind >= ARGC) {
ARGV[1] = "-"
ARGC = 2
} else if (ARGC - Optind > 1)
do_filenames++
# if (IGNORECASE)
# pattern = tolower(pattern)
}
The last two lines are commented out, as they are not needed in
'gawk'. They should be uncommented if you have to use another version
of 'awk'.
The next set of lines should be uncommented if you are not using
'gawk'. This rule translates all the characters in the input line into
lowercase if the '-i' option is specified.(1) The rule is commented out
as it is not necessary with 'gawk':
#{
# if (IGNORECASE)
# $0 = tolower($0)
#}
The 'beginfile()' function is called by the rule in 'ftrans.awk' when
each new file is processed. In this case, it is very simple; all it
does is initialize a variable 'fcount' to zero. 'fcount' tracks how
many lines in the current file matched the pattern. Naming the
parameter 'junk' shows we know that 'beginfile()' is called with a
parameter, but that we're not interested in its value:
function beginfile(junk)
{
fcount = 0
}
The 'endfile()' function is called after each file has been
processed. It affects the output only when the user wants a count of
the number of lines that matched. 'no_print' is true only if the exit
status is desired. 'count_only' is true if line counts are desired.
'egrep' therefore only prints line counts if printing and counting are
enabled. The output format must be adjusted depending upon the number
of files to process. Finally, 'fcount' is added to 'total', so that we
know the total number of lines that matched the pattern:
function endfile(file)
{
if (! no_print && count_only) {
if (do_filenames)
print file ":" fcount
else
print fcount
}
total += fcount
}
The 'BEGINFILE' and 'ENDFILE' special patterns (
BEGINFILE/ENDFILE) could be used, but then the program would be
'gawk'-specific. Additionally, this example was written before 'gawk'
acquired 'BEGINFILE' and 'ENDFILE'.
The following rule does most of the work of matching lines. The
variable 'matches' is true if the line matched the pattern. If the user
wants lines that did not match, the sense of 'matches' is inverted using
the '!' operator. 'fcount' is incremented with the value of 'matches',
which is either one or zero, depending upon a successful or unsuccessful
match. If the line does not match, the 'next' statement just moves on
to the next record.
A number of additional tests are made, but they are only done if we
are not counting lines. First, if the user only wants the exit status
('no_print' is true), then it is enough to know that _one_ line in this
file matched, and we can skip on to the next file with 'nextfile'.
Similarly, if we are only printing file names, we can print the file
name, and then skip to the next file with 'nextfile'. Finally, each
line is printed, with a leading file name and colon if necessary:
{
matches = ($0 ~ pattern)
if (invert)
matches = ! matches
fcount += matches # 1 or 0
if (! matches)
next
if (! count_only) {
if (no_print)
nextfile
if (filenames_only) {
print FILENAME
nextfile
}
if (do_filenames)
print FILENAME ":" $0
else
print
}
}
The 'END' rule takes care of producing the correct exit status. If
there are no matches, the exit status is one; otherwise, it is zero:
END {
exit (total == 0)
}
The 'usage()' function prints a usage message in case of invalid
options, and then exits:
function usage()
{
print("Usage: egrep [-csvil] [-e pat] [files ...]") > "/dev/stderr"
print("\n\tegrep [-csvil] pat [files ...]") > "/dev/stderr"
exit 1
}
---------- Footnotes ----------
(1) It also introduces a subtle bug; if a match happens, we output
the translated line, not the original.