gawk: Glossary
Glossary
********
Action
A series of 'awk' statements attached to a rule. If the rule's
pattern matches an input record, 'awk' executes the rule's action.
Actions are always enclosed in braces. (Action Overview.)
Ada
A programming language originally defined by the U.S. Department of
Defense for embedded programming. It was designed to enforce good
Software Engineering practices.
Amazing 'awk' Assembler
Henry Spencer at the University of Toronto wrote a retargetable
assembler completely as 'sed' and 'awk' scripts. It is thousands
of lines long, including machine descriptions for several eight-bit
microcomputers. It is a good example of a program that would have
been better written in another language.
Amazingly Workable Formatter ('awf')
Henry Spencer at the University of Toronto wrote a formatter that
accepts a large subset of the 'nroff -ms' and 'nroff -man'
formatting commands, using 'awk' and 'sh'.
Anchor
The regexp metacharacters '^' and '$', which force the match to the
beginning or end of the string, respectively.
ANSI
The American National Standards Institute. This organization
produces many standards, among them the standards for the C and C++
programming languages. These standards often become international
standards as well. See also "ISO."
Argument
An argument can be two different things. It can be an option or a
file name passed to a command while invoking it from the command
line, or it can be something passed to a "function" inside a
program, e.g. inside 'awk'.
In the latter case, an argument can be passed to a function in two
ways. Either it is given to the called function by value, i.e., a
copy of the value of the variable is made available to the called
function, but the original variable cannot be modified by the
function itself; or it is given by reference, i.e., a pointer to
the interested variable is passed to the function, which can then
directly modify it. In 'awk' scalars are passed by value, and
arrays are passed by reference. See "Pass By Value/Reference."
Array
A grouping of multiple values under the same name. Most languages
just provide sequential arrays. 'awk' provides associative arrays.
Assertion
A statement in a program that a condition is true at this point in
the program. Useful for reasoning about how a program is supposed
to behave.
Assignment
An 'awk' expression that changes the value of some 'awk' variable
or data object. An object that you can assign to is called an
"lvalue". The assigned values are called "rvalues".
Assignment Ops.
Associative Array
Arrays in which the indices may be numbers or strings, not just
sequential integers in a fixed range.
'awk' Language
The language in which 'awk' programs are written.
'awk' Program
An 'awk' program consists of a series of "patterns" and "actions",
collectively known as "rules". For each input record given to the
program, the program's rules are all processed in turn. 'awk'
programs may also contain function definitions.
'awk' Script
Another name for an 'awk' program.
Bash
The GNU version of the standard shell (the Bourne-Again SHell).
See also "Bourne Shell."
Binary
Base-two notation, where the digits are '0'-'1'. Since electronic
circuitry works "naturally" in base 2 (just think of Off/On),
everything inside a computer is calculated using base 2. Each
digit represents the presence (or absence) of a power of 2 and is
called a "bit". So, for example, the base-two number '10101' is
the same as decimal 21, ((1 x 16) + (1 x 4) + (1 x 1)).
Since base-two numbers quickly become very long to read and write,
they are usually grouped by 3 (i.e., they are read as octal
numbers), or by 4 (i.e., they are read as hexadecimal numbers).
There is no direct way to insert base 2 numbers in a C program. If
need arises, such numbers are usually inserted as octal or
hexadecimal numbers. The number of base-two digits that fit into
registers used for representing integer numbers in computers is a
rough indication of the computing power of the computer itself.
Most computers nowadays use 64 bits for representing integer
numbers in their registers, but 32-bit, 16-bit and 8-bit registers
have been widely used in the past. Nondecimal-numbers.
Bit
Short for "Binary Digit." All values in computer memory ultimately
reduce to binary digits: values that are either zero or one.
Groups of bits may be interpreted differently--as integers,
floating-point numbers, character data, addresses of other memory
objects, or other data. 'awk' lets you work with floating-point
numbers and strings. 'gawk' lets you manipulate bit values with
the built-in functions described in Bitwise Functions.
Computers are often defined by how many bits they use to represent
integer values. Typical systems are 32-bit systems, but 64-bit
systems are becoming increasingly popular, and 16-bit systems have
essentially disappeared.
Boolean Expression
Named after the English mathematician Boole. See also "Logical
Expression."
Bourne Shell
The standard shell ('/bin/sh') on Unix and Unix-like systems,
originally written by Steven R. Bourne at Bell Laboratories. Many
shells (Bash, 'ksh', 'pdksh', 'zsh') are generally upwardly
compatible with the Bourne shell.
Braces
The characters '{' and '}'. Braces are used in 'awk' for
delimiting actions, compound statements, and function bodies.
Bracket Expression
Inside a "regular expression", an expression included in square
brackets, meant to designate a single character as belonging to a
specified character class. A bracket expression can contain a list
of one or more characters, like '[abc]', a range of characters,
like '[A-Z]', or a name, delimited by ':', that designates a known
set of characters, like '[:digit:]'. The form of bracket
expression enclosed between ':' is independent of the underlying
representation of the character themselves, which could utilize the
ASCII, EBCDIC, or Unicode codesets, depending on the architecture
of the computer system, and on localization. See also "Regular
Expression."
Built-in Function
The 'awk' language provides built-in functions that perform various
numerical, I/O-related, and string computations. Examples are
'sqrt()' (for the square root of a number) and 'substr()' (for a
substring of a string). 'gawk' provides functions for timestamp
management, bit manipulation, array sorting, type checking, and
runtime string translation. (Built-in.)
Built-in Variable
'ARGC', 'ARGV', 'CONVFMT', 'ENVIRON', 'FILENAME', 'FNR', 'FS',
'NF', 'NR', 'OFMT', 'OFS', 'ORS', 'RLENGTH', 'RSTART', 'RS', and
'SUBSEP' are the variables that have special meaning to 'awk'. In
addition, 'ARGIND', 'BINMODE', 'ERRNO', 'FIELDWIDTHS', 'FPAT',
'IGNORECASE', 'LINT', 'PROCINFO', 'RT', and 'TEXTDOMAIN' are the
variables that have special meaning to 'gawk'. Changing some of
them affects 'awk''s running environment. (Built-in
Variables.)
C
The system programming language that most GNU software is written
in. The 'awk' programming language has C-like syntax, and this
Info file points out similarities between 'awk' and C when
appropriate.
In general, 'gawk' attempts to be as similar to the 1990 version of
ISO C as makes sense.
C Shell
The C Shell ('csh' or its improved version, 'tcsh') is a Unix shell
that was created by Bill Joy in the late 1970s. The C shell was
differentiated from other shells by its interactive features and
overall style, which looks more like C. The C Shell is not backward
compatible with the Bourne Shell, so special attention is required
when converting scripts written for other Unix shells to the C
shell, especially with regard to the management of shell variables.
See also "Bourne Shell."
C++
A popular object-oriented programming language derived from C.
Character Class
See "Bracket Expression."
Character List
See "Bracket Expression."
Character Set
The set of numeric codes used by a computer system to represent the
characters (letters, numbers, punctuation, etc.) of a particular
country or place. The most common character set in use today is
ASCII (American Standard Code for Information Interchange). Many
European countries use an extension of ASCII known as ISO-8859-1
(ISO Latin-1). The Unicode character set (http://www.unicode.org)
is increasingly popular and standard, and is particularly widely
used on GNU/Linux systems.
CHEM
A preprocessor for 'pic' that reads descriptions of molecules and
produces 'pic' input for drawing them. It was written in 'awk' by
Brian Kernighan and Jon Bentley, and is available from
<http://netlib.org/typesetting/chem>.
Comparison Expression
A relation that is either true or false, such as 'a < b'.
Comparison expressions are used in 'if', 'while', 'do', and 'for'
statements, and in patterns to select which input records to
process. (Typing and Comparison.)
Compiler
A program that translates human-readable source code into
machine-executable object code. The object code is then executed
directly by the computer. See also "Interpreter."
Complemented Bracket Expression
The negation of a "bracket expression". All that is _not_
described by a given bracket expression. The symbol '^' precedes
the negated bracket expression. E.g.: '[^[:digit:]]' designates
whatever character is not a digit. '[^bad]' designates whatever
character is not one of the letters 'b', 'a', or 'd'. See "Bracket
Expression."
Compound Statement
A series of 'awk' statements, enclosed in curly braces. Compound
statements may be nested. (Statements.)
Computed Regexps
See "Dynamic Regular Expressions."
Concatenation
Concatenating two strings means sticking them together, one after
another, producing a new string. For example, the string 'foo'
concatenated with the string 'bar' gives the string 'foobar'.
(Concatenation.)
Conditional Expression
An expression using the '?:' ternary operator, such as 'EXPR1 ?
EXPR2 : EXPR3'. The expression EXPR1 is evaluated; if the result
is true, the value of the whole expression is the value of EXPR2;
otherwise the value is EXPR3. In either case, only one of EXPR2
and EXPR3 is evaluated. (Conditional Exp.)
Control Statement
A control statement is an instruction to perform a given operation
or a set of operations inside an 'awk' program, if a given
condition is true. Control statements are: 'if', 'for', 'while',
and 'do' (Statements).
Cookie
A peculiar goodie, token, saying or remembrance produced by or
presented to a program. (With thanks to Professor Doug McIlroy.)
Coprocess
A subordinate program with which two-way communications is
possible.
Curly Braces
See "Braces."
Dark Corner
An area in the language where specifications often were (or still
are) not clear, leading to unexpected or undesirable behavior.
Such areas are marked in this Info file with "(d.c.)" in the text
and are indexed under the heading "dark corner."
Data Driven
A description of 'awk' programs, where you specify the data you are
interested in processing, and what to do when that data is seen.
Data Objects
These are numbers and strings of characters. Numbers are converted
into strings and vice versa, as needed. (Conversion.)
Deadlock
The situation in which two communicating processes are each waiting
for the other to perform an action.
Debugger
A program used to help developers remove "bugs" from (de-bug) their
programs.
Double Precision
An internal representation of numbers that can have fractional
parts. Double precision numbers keep track of more digits than do
single precision numbers, but operations on them are sometimes more
expensive. This is the way 'awk' stores numeric values. It is the
C type 'double'.
Dynamic Regular Expression
A dynamic regular expression is a regular expression written as an
ordinary expression. It could be a string constant, such as
'"foo"', but it may also be an expression whose value can vary.
(Computed Regexps.)
Empty String
See "Null String."
Environment
A collection of strings, of the form 'NAME=VAL', that each program
has available to it. Users generally place values into the
environment in order to provide information to various programs.
Typical examples are the environment variables 'HOME' and 'PATH'.
Epoch
The date used as the "beginning of time" for timestamps. Time
values in most systems are represented as seconds since the epoch,
with library functions available for converting these values into
standard date and time formats.
The epoch on Unix and POSIX systems is 1970-01-01 00:00:00 UTC. See
also "GMT" and "UTC."
Escape Sequences
A special sequence of characters used for describing nonprinting
characters, such as '\n' for newline or '\033' for the ASCII ESC
(Escape) character. (Escape Sequences.)
Extension
An additional feature or change to a programming language or
utility not defined by that language's or utility's standard.
'gawk' has (too) many extensions over POSIX 'awk'.
FDL
See "Free Documentation License."
Field
When 'awk' reads an input record, it splits the record into pieces
separated by whitespace (or by a separator regexp that you can
change by setting the predefined variable 'FS'). Such pieces are
called fields. If the pieces are of fixed length, you can use the
built-in variable 'FIELDWIDTHS' to describe their lengths. If you
wish to specify the contents of fields instead of the field
separator, you can use the predefined variable 'FPAT' to do so.
DONTPRINTYET (Field Separators, Constant Size, and *noteDONTPRINTYET (Field Separators, Constant Size, and
Splitting By Content.)
Flag
A variable whose truth value indicates the existence or
nonexistence of some condition.
Floating-Point Number
Often referred to in mathematical terms as a "rational" or real
number, this is just a number that can have a fractional part. See
also "Double Precision" and "Single Precision."
Format
Format strings control the appearance of output in the 'strftime()'
and 'sprintf()' functions, and in the 'printf' statement as well.
Also, data conversions from numbers to strings are controlled by
the format strings contained in the predefined variables 'CONVFMT'
and 'OFMT'. (Control Letters.)
Fortran
Shorthand for FORmula TRANslator, one of the first programming
languages available for scientific calculations. It was created by
John Backus, and has been available since 1957. It is still in use
today.
Free Documentation License
This document describes the terms under which this Info file is
published and may be copied. (GNU Free Documentation
License.)
Free Software Foundation
A nonprofit organization dedicated to the production and
distribution of freely distributable software. It was founded by
Richard M. Stallman, the author of the original Emacs editor. GNU
Emacs is the most widely used version of Emacs today.
FSF
See "Free Software Foundation."
Function
A part of an 'awk' program that can be invoked from every point of
the program, to perform a task. 'awk' has several built-in
functions. Users can define their own functions in every part of
the program. Function can be recursive, i.e., they may invoke
themselves. Functions. In 'gawk' it is also possible to
have functions shared among different programs, and included where
required using the '@include' directive (Include Files).
In 'gawk' the name of the function that should be invoked can be
generated at run time, i.e., dynamically. The 'gawk' extension API
provides constructor functions (Constructor Functions).
'gawk'
The GNU implementation of 'awk'.
General Public License
This document describes the terms under which 'gawk' and its source
code may be distributed. (Copying.)
GMT
"Greenwich Mean Time." This is the old term for UTC. It is the
time of day used internally for Unix and POSIX systems. See also
"Epoch" and "UTC."
GNU
"GNU's not Unix". An on-going project of the Free Software
Foundation to create a complete, freely distributable,
POSIX-compliant computing environment.
GNU/Linux
A variant of the GNU system using the Linux kernel, instead of the
Free Software Foundation's Hurd kernel. The Linux kernel is a
stable, efficient, full-featured clone of Unix that has been ported
to a variety of architectures. It is most popular on PC-class
systems, but runs well on a variety of other systems too. The
Linux kernel source code is available under the terms of the GNU
General Public License, which is perhaps its most important aspect.
GPL
See "General Public License."
Hexadecimal
Base 16 notation, where the digits are '0'-'9' and 'A'-'F', with
'A' representing 10, 'B' representing 11, and so on, up to 'F' for
15. Hexadecimal numbers are written in C using a leading '0x', to
indicate their base. Thus, '0x12' is 18 ((1 x 16) + 2).
Nondecimal-numbers.
I/O
Abbreviation for "Input/Output," the act of moving data into and/or
out of a running program.
Input Record
A single chunk of data that is read in by 'awk'. Usually, an 'awk'
input record consists of one line of text. (Records.)
Integer
A whole number, i.e., a number that does not have a fractional
part.
Internationalization
The process of writing or modifying a program so that it can use
multiple languages without requiring further source code changes.
Interpreter
A program that reads human-readable source code directly, and uses
the instructions in it to process data and produce results. 'awk'
is typically (but not always) implemented as an interpreter. See
also "Compiler."
Interval Expression
A component of a regular expression that lets you specify repeated
matches of some part of the regexp. Interval expressions were not
originally available in 'awk' programs.
ISO
The International Organization for Standardization. This
organization produces international standards for many things,
including programming languages, such as C and C++. In the
computer arena, important standards like those for C, C++, and
POSIX become both American national and ISO international standards
simultaneously. This Info file refers to Standard C as "ISO C"
throughout. See the ISO website
(https://www.iso.org/iso/home/about.htm) for more information about
the name of the organization and its language-independent
three-letter acronym.
Java
A modern programming language originally developed by Sun
Microsystems (now Oracle) supporting Object-Oriented programming.
Although usually implemented by compiling to the instructions for a
standard virtual machine (the JVM), the language can be compiled to
native code.
Keyword
In the 'awk' language, a keyword is a word that has special
meaning. Keywords are reserved and may not be used as variable
names.
'gawk''s keywords are: 'BEGIN', 'BEGINFILE', 'END', 'ENDFILE',
'break', 'case', 'continue', 'default', 'delete', 'do...while',
'else', 'exit', 'for...in', 'for', 'function', 'func', 'if',
'next', 'nextfile', 'switch', and 'while'.
Korn Shell
The Korn Shell ('ksh') is a Unix shell which was developed by David
Korn at Bell Laboratories in the early 1980s. The Korn Shell is
backward-compatible with the Bourne shell and includes many
features of the C shell. See also "Bourne Shell."
Lesser General Public License
This document describes the terms under which binary library
archives or shared objects, and their source code may be
distributed.
LGPL
See "Lesser General Public License."
Linux
See "GNU/Linux."
Localization
The process of providing the data necessary for an
internationalized program to work in a particular language.
Logical Expression
An expression using the operators for logic, AND, OR, and NOT,
written '&&', '||', and '!' in 'awk'. Often called Boolean
expressions, after the mathematician who pioneered this kind of
mathematical logic.
Lvalue
An expression that can appear on the left side of an assignment
operator. In most languages, lvalues can be variables or array
elements. In 'awk', a field designator can also be used as an
lvalue.
Matching
The act of testing a string against a regular expression. If the
regexp describes the contents of the string, it is said to "match"
it.
Metacharacters
Characters used within a regexp that do not stand for themselves.
Instead, they denote regular expression operations, such as
repetition, grouping, or alternation.
Nesting
Nesting is where information is organized in layers, or where
objects contain other similar objects. In 'gawk' the '@include'
directive can be nested. The "natural" nesting of arithmetic and
logical operations can be changed using parentheses (
Precedence).
No-op
An operation that does nothing.
Null String
A string with no characters in it. It is represented explicitly in
'awk' programs by placing two double quote characters next to each
other ('""'). It can appear in input data by having two successive
occurrences of the field separator appear next to each other.
Number
A numeric-valued data object. Modern 'awk' implementations use
double precision floating-point to represent numbers. Ancient
'awk' implementations used single precision floating-point.
Octal
Base-eight notation, where the digits are '0'-'7'. Octal numbers
are written in C using a leading '0', to indicate their base.
Thus, '013' is 11 ((1 x 8) + 3). Nondecimal-numbers.
Output Record
A single chunk of data that is written out by 'awk'. Usually, an
'awk' output record consists of one or more lines of text.
Records.
Pattern
Patterns tell 'awk' which input records are interesting to which
rules.
A pattern is an arbitrary conditional expression against which
input is tested. If the condition is satisfied, the pattern is
said to "match" the input record. A typical pattern might compare
the input record against a regular expression. (Pattern
Overview.)
PEBKAC
An acronym describing what is possibly the most frequent source of
computer usage problems. (Problem Exists Between Keyboard And
Chair.)
Plug-in
See "Extensions."
POSIX
The name for a series of standards that specify a Portable
Operating System interface. The "IX" denotes the Unix heritage of
these standards. The main standard of interest for 'awk' users is
'IEEE Standard for Information Technology, Standard 1003.1-2008'.
The 2008 POSIX standard can be found online at
<http://pubs.opengroup.org/onlinepubs/9699919799/>.
Precedence
The order in which operations are performed when operators are used
without explicit parentheses.
Private
Variables and/or functions that are meant for use exclusively by
library functions and not for the main 'awk' program. Special care
must be taken when naming such variables and functions. (
Library Names.)
Range (of input lines)
A sequence of consecutive lines from the input file(s). A pattern
can specify ranges of input lines for 'awk' to process or it can
specify single lines. (Pattern Overview.)
Record
See "Input record" and "Output record."
Recursion
When a function calls itself, either directly or indirectly. If
this is clear, stop, and proceed to the next entry. Otherwise,
refer to the entry for "recursion."
Redirection
Redirection means performing input from something other than the
standard input stream, or performing output to something other than
the standard output stream.
You can redirect input to the 'getline' statement using the '<',
'|', and '|&' operators. You can redirect the output of the
'print' and 'printf' statements to a file or a system command,
using the '>', '>>', '|', and '|&' operators. (Getline,
and Redirection.)
Reference Counts
An internal mechanism in 'gawk' to minimize the amount of memory
needed to store the value of string variables. If the value
assumed by a variable is used in more than one place, only one copy
of the value itself is kept, and the associated reference count is
increased when the same value is used by an additional variable,
and decreased when the related variable is no longer in use. When
the reference count goes to zero, the memory space used to store
the value of the variable is freed.
Regexp
See "Regular Expression."
Regular Expression
A regular expression ("regexp" for short) is a pattern that denotes
a set of strings, possibly an infinite set. For example, the
regular expression 'R.*xp' matches any string starting with the
letter 'R' and ending with the letters 'xp'. In 'awk', regular
expressions are used in patterns and in conditional expressions.
Regular expressions may contain escape sequences. (
Regexp.)
Regular Expression Constant
A regular expression constant is a regular expression written
within slashes, such as '/foo/'. This regular expression is chosen
when you write the 'awk' program and cannot be changed during its
execution. (Regexp Usage.)
Regular Expression Operators
See "Metacharacters."
Rounding
Rounding the result of an arithmetic operation can be tricky. More
than one way of rounding exists, and in 'gawk' it is possible to
choose which method should be used in a program. Setting the
rounding mode.
Rule
A segment of an 'awk' program that specifies how to process single
input records. A rule consists of a "pattern" and an "action".
'awk' reads an input record; then, for each rule, if the input
record satisfies the rule's pattern, 'awk' executes the rule's
action. Otherwise, the rule does nothing for that input record.
Rvalue
A value that can appear on the right side of an assignment
operator. In 'awk', essentially every expression has a value.
These values are rvalues.
Scalar
A single value, be it a number or a string. Regular variables are
scalars; arrays and functions are not.
Search Path
In 'gawk', a list of directories to search for 'awk' program source
files. In the shell, a list of directories to search for
executable programs.
'sed'
See "Stream Editor."
Seed
The initial value, or starting point, for a sequence of random
numbers.
Shell
The command interpreter for Unix and POSIX-compliant systems. The
shell works both interactively, and as a programming language for
batch files, or shell scripts.
Short-Circuit
The nature of the 'awk' logical operators '&&' and '||'. If the
value of the entire expression is determinable from evaluating just
the lefthand side of these operators, the righthand side is not
evaluated. (Boolean Ops.)
Side Effect
A side effect occurs when an expression has an effect aside from
merely producing a value. Assignment expressions, increment and
decrement expressions, and function calls have side effects.
(Assignment Ops.)
Single Precision
An internal representation of numbers that can have fractional
parts. Single precision numbers keep track of fewer digits than do
double precision numbers, but operations on them are sometimes less
expensive in terms of CPU time. This is the type used by some
ancient versions of 'awk' to store numeric values. It is the C
type 'float'.
Space
The character generated by hitting the space bar on the keyboard.
Special File
A file name interpreted internally by 'gawk', instead of being
handed directly to the underlying operating system--for example,
'/dev/stderr'. (Special Files.)
Statement
An expression inside an 'awk' program in the action part of a
pattern-action rule, or inside an 'awk' function. A statement can
be a variable assignment, an array operation, a loop, etc.
Stream Editor
A program that reads records from an input stream and processes
them one or more at a time. This is in contrast with batch
programs, which may expect to read their input files in entirety
before starting to do anything, as well as with interactive
programs which require input from the user.
String
A datum consisting of a sequence of characters, such as 'I am a
string'. Constant strings are written with double quotes in the
'awk' language and may contain escape sequences. (Escape
Sequences.)
Tab
The character generated by hitting the 'TAB' key on the keyboard.
It usually expands to up to eight spaces upon output.
Text Domain
A unique name that identifies an application. Used for grouping
messages that are translated at runtime into the local language.
Timestamp
A value in the "seconds since the epoch" format used by Unix and
POSIX systems. Used for the 'gawk' functions 'mktime()',
'strftime()', and 'systime()'. See also "Epoch," "GMT," and "UTC."
Unix
A computer operating system originally developed in the early
1970's at AT&T Bell Laboratories. It initially became popular in
universities around the world and later moved into commercial
environments as a software development system and network server
system. There are many commercial versions of Unix, as well as
several work-alike systems whose source code is freely available
(such as GNU/Linux, NetBSD (http://www.netbsd.org), FreeBSD
(https://www.freebsd.org), and OpenBSD (http://www.openbsd.org)).
UTC
The accepted abbreviation for "Universal Coordinated Time." This
is standard time in Greenwich, England, which is used as a
reference time for day and date calculations. See also "Epoch" and
"GMT."
Variable
A name for a value. In 'awk', variables may be either scalars or
arrays.
Whitespace
A sequence of space, TAB, or newline characters occurring inside an
input record or a string.