gawk: Input Parsers
16.4.5.4 Customized Input Parsers
.................................
By default, 'gawk' reads text files as its input. It uses the value of
'RS' to find the end of the record, and then uses 'FS' (or 'FIELDWIDTHS'
or 'FPAT') to split it into fields (Reading Files).
Additionally, it sets the value of 'RT' (Built-in Variables).
If you want, you can provide your own custom input parser. An input
parser's job is to return a record to the 'gawk' record-processing code,
along with indicators for the value and length of the data to be used
for 'RT', if any.
To provide an input parser, you must first provide two functions
(where XXX is a prefix name for your extension):
'awk_bool_t XXX_can_take_file(const awk_input_buf_t *iobuf);'
This function examines the information available in 'iobuf' (which
we discuss shortly). Based on the information there, it decides if
the input parser should be used for this file. If so, it should
return true. Otherwise, it should return false. It should not
change any state (variable values, etc.) within 'gawk'.
'awk_bool_t XXX_take_control_of(awk_input_buf_t *iobuf);'
When 'gawk' decides to hand control of the file over to the input
parser, it calls this function. This function in turn must fill in
certain fields in the 'awk_input_buf_t' structure and ensure that
certain conditions are true. It should then return true. If an
error of some kind occurs, it should not fill in any fields and
should return false; then 'gawk' will not use the input parser.
The details are presented shortly.
Your extension should package these functions inside an
'awk_input_parser_t', which looks like this:
typedef struct awk_input_parser {
const char *name; /* name of parser */
awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);
awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);
awk_const struct awk_input_parser *awk_const next; /* for gawk */
} awk_input_parser_t;
The fields are:
'const char *name;'
The name of the input parser. This is a regular C string.
'awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);'
A pointer to your 'XXX_can_take_file()' function.
'awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);'
A pointer to your 'XXX_take_control_of()' function.
'awk_const struct input_parser *awk_const next;'
This is for use by 'gawk'; therefore it is marked 'awk_const' so
that the extension cannot modify it.
The steps are as follows:
1. Create a 'static awk_input_parser_t' variable and initialize it
appropriately.
2. When your extension is loaded, register your input parser with
'gawk' using the 'register_input_parser()' API function (described
next).
An 'awk_input_buf_t' looks like this:
typedef struct awk_input {
const char *name; /* filename */
int fd; /* file descriptor */
#define INVALID_HANDLE (-1)
void *opaque; /* private data for input parsers */
int (*get_record)(char **out, struct awk_input *iobuf,
int *errcode, char **rt_start, size_t *rt_len,
const awk_fieldwidth_info_t **field_width);
ssize_t (*read_func)();
void (*close_func)(struct awk_input *iobuf);
struct stat sbuf; /* stat buf */
} awk_input_buf_t;
The fields can be divided into two categories: those for use
(initially, at least) by 'XXX_can_take_file()', and those for use by
'XXX_take_control_of()'. The first group of fields and their uses are
as follows:
'const char *name;'
The name of the file.
'int fd;'
A file descriptor for the file. If 'gawk' was able to open the
file, then 'fd' will _not_ be equal to 'INVALID_HANDLE'.
Otherwise, it will.
'struct stat sbuf;'
If the file descriptor is valid, then 'gawk' will have filled in
this structure via a call to the 'fstat()' system call.
The 'XXX_can_take_file()' function should examine these fields and
decide if the input parser should be used for the file. The decision
can be made based upon 'gawk' state (the value of a variable defined
previously by the extension and set by 'awk' code), the name of the
file, whether or not the file descriptor is valid, the information in
the 'struct stat', or any combination of these factors.
Once 'XXX_can_take_file()' has returned true, and 'gawk' has decided
to use your input parser, it calls 'XXX_take_control_of()'. That
function then fills either the 'get_record' field or the 'read_func'
field in the 'awk_input_buf_t'. It must also ensure that 'fd' is _not_
set to 'INVALID_HANDLE'. The following list describes the fields that
may be filled by 'XXX_take_control_of()':
'void *opaque;'
This is used to hold any state information needed by the input
parser for this file. It is "opaque" to 'gawk'. The input parser
is not required to use this pointer.
'int (*get_record)(char **out,'
' struct awk_input *iobuf,'
' int *errcode,'
' char **rt_start,'
' size_t *rt_len,'
' const awk_fieldwidth_info_t **field_width);'
This function pointer should point to a function that creates the
input records. Said function is the core of the input parser. Its
behavior is described in the text following this list.
'ssize_t (*read_func)();'
This function pointer should point to a function that has the same
behavior as the standard POSIX 'read()' system call. It is an
alternative to the 'get_record' pointer. Its behavior is also
described in the text following this list.
'void (*close_func)(struct awk_input *iobuf);'
This function pointer should point to a function that does the
"teardown." It should release any resources allocated by
'XXX_take_control_of()'. It may also close the file. If it does
so, it should set the 'fd' field to 'INVALID_HANDLE'.
If 'fd' is still not 'INVALID_HANDLE' after the call to this
function, 'gawk' calls the regular 'close()' system call.
Having a "teardown" function is optional. If your input parser
does not need it, do not set this field. Then, 'gawk' calls the
regular 'close()' system call on the file descriptor, so it should
be valid.
The 'XXX_get_record()' function does the work of creating input
records. The parameters are as follows:
'char **out'
This is a pointer to a 'char *' variable that is set to point to
the record. 'gawk' makes its own copy of the data, so the
extension must manage this storage.
'struct awk_input *iobuf'
This is the 'awk_input_buf_t' for the file. The fields should be
used for reading data ('fd') and for managing private state
('opaque'), if any.
'int *errcode'
If an error occurs, '*errcode' should be set to an appropriate code
from '<errno.h>'.
'char **rt_start'
'size_t *rt_len'
If the concept of a "record terminator" makes sense, then
'*rt_start' should be set to point to the data to be used for 'RT',
and '*rt_len' should be set to the length of the data. Otherwise,
'*rt_len' should be set to zero. 'gawk' makes its own copy of this
data, so the extension must manage this storage.
'const awk_fieldwidth_info_t **field_width'
If 'field_width' is not 'NULL', then '*field_width' will be
initialized to 'NULL', and the function may set it to point to a
structure supplying field width information to override the default
field parsing mechanism. Note that this structure will not be
copied by 'gawk'; it must persist at least until the next call to
'get_record' or 'close_func'. Note also that 'field_width' is
'NULL' when 'getline' is assigning the results to a variable, thus
field parsing is not needed. If the parser does set
'*field_width', then 'gawk' uses this layout to parse the input
record, and the 'PROCINFO["FS"]' value will be '"API"' while this
record is active in '$0'. The 'awk_fieldwidth_info_t' data
structure is described below.
The return value is the length of the buffer pointed to by '*out', or
'EOF' if end-of-file was reached or an error occurred.
It is guaranteed that 'errcode' is a valid pointer, so there is no
need to test for a 'NULL' value. 'gawk' sets '*errcode' to zero, so
there is no need to set it unless an error occurs.
If an error does occur, the function should return 'EOF' and set
'*errcode' to a value greater than zero. In that case, if '*errcode'
does not equal zero, 'gawk' automatically updates the 'ERRNO' variable
based on the value of '*errcode'. (In general, setting '*errcode =
errno' should do the right thing.)
As an alternative to supplying a function that returns an input
record, you may instead supply a function that simply reads bytes, and
let 'gawk' parse the data into records. If you do so, the data should
be returned in the multibyte encoding of the current locale. Such a
function should follow the same behavior as the 'read()' system call,
and you fill in the 'read_func' pointer with its address in the
'awk_input_buf_t' structure.
By default, 'gawk' sets the 'read_func' pointer to point to the
'read()' system call. So your extension need not set this field
explicitly.
NOTE: You must choose one method or the other: either a function
that returns a record, or one that returns raw data. In
particular, if you supply a function to get a record, 'gawk' will
call it, and will never call the raw read function.
'gawk' ships with a sample extension that reads directories,
returning records for each entry in a directory (Extension Sample
Readdir). You may wish to use that code as a guide for writing your
own input parser.
When writing an input parser, you should think about (and document)
how it is expected to interact with 'awk' code. You may want it to
always be called, and to take effect as appropriate (as the 'readdir'
extension does). Or you may want it to take effect based upon the value
of an 'awk' variable, as the XML extension from the 'gawkextlib' project
does (gawkextlib). In the latter case, code in a 'BEGINFILE'
rule can look at 'FILENAME' and 'ERRNO' to decide whether or not to
activate an input parser (BEGINFILE/ENDFILE).
You register your input parser with the following function:
'void register_input_parser(awk_input_parser_t *input_parser);'
Register the input parser pointed to by 'input_parser' with 'gawk'.
If you would like to override the default field parsing mechanism for
a given record, then you must populate an 'awk_fieldwidth_info_t'
structure, which looks like this:
typedef struct {
awk_bool_t use_chars; /* false ==> use bytes */
size_t nf; /* number of fields in record (NF) */
struct awk_field_info {
size_t skip; /* amount to skip before field starts */
size_t len; /* length of field */
} fields[1]; /* actual dimension should be nf */
} awk_fieldwidth_info_t;
The fields are:
'awk_bool_t use_chars;'
Set this to 'awk_true' if the field lengths are specified in terms
of potentially multi-byte characters, and set it to 'awk_false' if
the lengths are in terms of bytes. Performance will be better if
the values are supplied in terms of bytes.
'size_t nf;'
Set this to the number of fields in the input record, i.e. 'NF'.
'struct awk_field_info fields[nf];'
This is a variable-length array whose actual dimension should be
'nf'. For each field, the 'skip' element should be set to the
number of characters or bytes, as controlled by the 'use_chars'
flag, to skip before the start of this field. The 'len' element
provides the length of the field. The values in 'fields[0]'
provide the information for '$1', and so on through the
'fields[nf-1]' element containing the information for '$NF'.
A convenience macro 'awk_fieldwidth_info_size(numfields)' is provided
to calculate the appropriate size of a variable-length
'awk_fieldwidth_info_t' structure containing 'numfields' fields. This
can be used as an argument to 'malloc()' or in a union to allocate space
statically. Please refer to the 'readdir_test' sample extension for an
example.