gawk: Input Parsers

 
 16.4.5.4 Customized Input Parsers
 .................................
 
 By default, 'gawk' reads text files as its input.  It uses the value of
 'RS' to find the end of the record, and then uses 'FS' (or 'FIELDWIDTHS'
 or 'FPAT') to split it into fields (SeeReading Files).
 Additionally, it sets the value of 'RT' (SeeBuilt-in Variables).
 
    If you want, you can provide your own custom input parser.  An input
 parser's job is to return a record to the 'gawk' record-processing code,
 along with indicators for the value and length of the data to be used
 for 'RT', if any.
 
    To provide an input parser, you must first provide two functions
 (where XXX is a prefix name for your extension):
 
 'awk_bool_t XXX_can_take_file(const awk_input_buf_t *iobuf);'
      This function examines the information available in 'iobuf' (which
      we discuss shortly).  Based on the information there, it decides if
      the input parser should be used for this file.  If so, it should
      return true.  Otherwise, it should return false.  It should not
      change any state (variable values, etc.)  within 'gawk'.
 
 'awk_bool_t XXX_take_control_of(awk_input_buf_t *iobuf);'
      When 'gawk' decides to hand control of the file over to the input
      parser, it calls this function.  This function in turn must fill in
      certain fields in the 'awk_input_buf_t' structure and ensure that
      certain conditions are true.  It should then return true.  If an
      error of some kind occurs, it should not fill in any fields and
      should return false; then 'gawk' will not use the input parser.
      The details are presented shortly.
 
    Your extension should package these functions inside an
 'awk_input_parser_t', which looks like this:
 
      typedef struct awk_input_parser {
          const char *name;   /* name of parser */
          awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);
          awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);
          awk_const struct awk_input_parser *awk_const next;   /* for gawk */
      } awk_input_parser_t;
 
    The fields are:
 
 'const char *name;'
      The name of the input parser.  This is a regular C string.
 
 'awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);'
      A pointer to your 'XXX_can_take_file()' function.
 
 'awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);'
      A pointer to your 'XXX_take_control_of()' function.
 
 'awk_const struct input_parser *awk_const next;'
      This is for use by 'gawk'; therefore it is marked 'awk_const' so
      that the extension cannot modify it.
 
    The steps are as follows:
 
   1. Create a 'static awk_input_parser_t' variable and initialize it
      appropriately.
 
   2. When your extension is loaded, register your input parser with
      'gawk' using the 'register_input_parser()' API function (described
      next).
 
    An 'awk_input_buf_t' looks like this:
 
      typedef struct awk_input {
          const char *name;       /* filename */
          int fd;                 /* file descriptor */
      #define INVALID_HANDLE (-1)
          void *opaque;           /* private data for input parsers */
          int (*get_record)(char **out, struct awk_input *iobuf,
                            int *errcode, char **rt_start, size_t *rt_len,
                            const awk_fieldwidth_info_t **field_width);
          ssize_t (*read_func)();
          void (*close_func)(struct awk_input *iobuf);
          struct stat sbuf;       /* stat buf */
      } awk_input_buf_t;
 
    The fields can be divided into two categories: those for use
 (initially, at least) by 'XXX_can_take_file()', and those for use by
 'XXX_take_control_of()'.  The first group of fields and their uses are
 as follows:
 
 'const char *name;'
      The name of the file.
 
 'int fd;'
      A file descriptor for the file.  If 'gawk' was able to open the
      file, then 'fd' will _not_ be equal to 'INVALID_HANDLE'.
      Otherwise, it will.
 
 'struct stat sbuf;'
      If the file descriptor is valid, then 'gawk' will have filled in
      this structure via a call to the 'fstat()' system call.
 
    The 'XXX_can_take_file()' function should examine these fields and
 decide if the input parser should be used for the file.  The decision
 can be made based upon 'gawk' state (the value of a variable defined
 previously by the extension and set by 'awk' code), the name of the
 file, whether or not the file descriptor is valid, the information in
 the 'struct stat', or any combination of these factors.
 
    Once 'XXX_can_take_file()' has returned true, and 'gawk' has decided
 to use your input parser, it calls 'XXX_take_control_of()'.  That
 function then fills either the 'get_record' field or the 'read_func'
 field in the 'awk_input_buf_t'.  It must also ensure that 'fd' is _not_
 set to 'INVALID_HANDLE'.  The following list describes the fields that
 may be filled by 'XXX_take_control_of()':
 
 'void *opaque;'
      This is used to hold any state information needed by the input
      parser for this file.  It is "opaque" to 'gawk'.  The input parser
      is not required to use this pointer.
 
 'int (*get_record)(char **out,'
 '                  struct awk_input *iobuf,'
 '                  int *errcode,'
 '                  char **rt_start,'
 '                  size_t *rt_len,'
 '                  const awk_fieldwidth_info_t **field_width);'
      This function pointer should point to a function that creates the
      input records.  Said function is the core of the input parser.  Its
      behavior is described in the text following this list.
 
 'ssize_t (*read_func)();'
      This function pointer should point to a function that has the same
      behavior as the standard POSIX 'read()' system call.  It is an
      alternative to the 'get_record' pointer.  Its behavior is also
      described in the text following this list.
 
 'void (*close_func)(struct awk_input *iobuf);'
      This function pointer should point to a function that does the
      "teardown."  It should release any resources allocated by
      'XXX_take_control_of()'.  It may also close the file.  If it does
      so, it should set the 'fd' field to 'INVALID_HANDLE'.
 
      If 'fd' is still not 'INVALID_HANDLE' after the call to this
      function, 'gawk' calls the regular 'close()' system call.
 
      Having a "teardown" function is optional.  If your input parser
      does not need it, do not set this field.  Then, 'gawk' calls the
      regular 'close()' system call on the file descriptor, so it should
      be valid.
 
    The 'XXX_get_record()' function does the work of creating input
 records.  The parameters are as follows:
 
 'char **out'
      This is a pointer to a 'char *' variable that is set to point to
      the record.  'gawk' makes its own copy of the data, so the
      extension must manage this storage.
 
 'struct awk_input *iobuf'
      This is the 'awk_input_buf_t' for the file.  The fields should be
      used for reading data ('fd') and for managing private state
      ('opaque'), if any.
 
 'int *errcode'
      If an error occurs, '*errcode' should be set to an appropriate code
      from '<errno.h>'.
 
 'char **rt_start'
 'size_t *rt_len'
      If the concept of a "record terminator" makes sense, then
      '*rt_start' should be set to point to the data to be used for 'RT',
      and '*rt_len' should be set to the length of the data.  Otherwise,
      '*rt_len' should be set to zero.  'gawk' makes its own copy of this
      data, so the extension must manage this storage.
 
 'const awk_fieldwidth_info_t **field_width'
      If 'field_width' is not 'NULL', then '*field_width' will be
      initialized to 'NULL', and the function may set it to point to a
      structure supplying field width information to override the default
      field parsing mechanism.  Note that this structure will not be
      copied by 'gawk'; it must persist at least until the next call to
      'get_record' or 'close_func'.  Note also that 'field_width' is
      'NULL' when 'getline' is assigning the results to a variable, thus
      field parsing is not needed.  If the parser does set
      '*field_width', then 'gawk' uses this layout to parse the input
      record, and the 'PROCINFO["FS"]' value will be '"API"' while this
      record is active in '$0'.  The 'awk_fieldwidth_info_t' data
      structure is described below.
 
    The return value is the length of the buffer pointed to by '*out', or
 'EOF' if end-of-file was reached or an error occurred.
 
    It is guaranteed that 'errcode' is a valid pointer, so there is no
 need to test for a 'NULL' value.  'gawk' sets '*errcode' to zero, so
 there is no need to set it unless an error occurs.
 
    If an error does occur, the function should return 'EOF' and set
 '*errcode' to a value greater than zero.  In that case, if '*errcode'
 does not equal zero, 'gawk' automatically updates the 'ERRNO' variable
 based on the value of '*errcode'.  (In general, setting '*errcode =
 errno' should do the right thing.)
 
    As an alternative to supplying a function that returns an input
 record, you may instead supply a function that simply reads bytes, and
 let 'gawk' parse the data into records.  If you do so, the data should
 be returned in the multibyte encoding of the current locale.  Such a
 function should follow the same behavior as the 'read()' system call,
 and you fill in the 'read_func' pointer with its address in the
 'awk_input_buf_t' structure.
 
    By default, 'gawk' sets the 'read_func' pointer to point to the
 'read()' system call.  So your extension need not set this field
 explicitly.
 
      NOTE: You must choose one method or the other: either a function
      that returns a record, or one that returns raw data.  In
      particular, if you supply a function to get a record, 'gawk' will
      call it, and will never call the raw read function.
 
    'gawk' ships with a sample extension that reads directories,
 returning records for each entry in a directory (SeeExtension Sample
 Readdir).  You may wish to use that code as a guide for writing your
 own input parser.
 
    When writing an input parser, you should think about (and document)
 how it is expected to interact with 'awk' code.  You may want it to
 always be called, and to take effect as appropriate (as the 'readdir'
 extension does).  Or you may want it to take effect based upon the value
 of an 'awk' variable, as the XML extension from the 'gawkextlib' project
 does (Seegawkextlib).  In the latter case, code in a 'BEGINFILE'
 rule can look at 'FILENAME' and 'ERRNO' to decide whether or not to
 activate an input parser (SeeBEGINFILE/ENDFILE).
 
    You register your input parser with the following function:
 
 'void register_input_parser(awk_input_parser_t *input_parser);'
      Register the input parser pointed to by 'input_parser' with 'gawk'.
 
    If you would like to override the default field parsing mechanism for
 a given record, then you must populate an 'awk_fieldwidth_info_t'
 structure, which looks like this:
 
      typedef struct {
              awk_bool_t     use_chars; /* false ==> use bytes */
              size_t         nf;        /* number of fields in record (NF) */
              struct awk_field_info {
                      size_t skip;      /* amount to skip before field starts */
                      size_t len;       /* length of field */
              } fields[1];              /* actual dimension should be nf */
      } awk_fieldwidth_info_t;
 
    The fields are:
 
 'awk_bool_t use_chars;'
      Set this to 'awk_true' if the field lengths are specified in terms
      of potentially multi-byte characters, and set it to 'awk_false' if
      the lengths are in terms of bytes.  Performance will be better if
      the values are supplied in terms of bytes.
 
 'size_t nf;'
      Set this to the number of fields in the input record, i.e.  'NF'.
 
 'struct awk_field_info fields[nf];'
      This is a variable-length array whose actual dimension should be
      'nf'.  For each field, the 'skip' element should be set to the
      number of characters or bytes, as controlled by the 'use_chars'
      flag, to skip before the start of this field.  The 'len' element
      provides the length of the field.  The values in 'fields[0]'
      provide the information for '$1', and so on through the
      'fields[nf-1]' element containing the information for '$NF'.
 
    A convenience macro 'awk_fieldwidth_info_size(numfields)' is provided
 to calculate the appropriate size of a variable-length
 'awk_fieldwidth_info_t' structure containing 'numfields' fields.  This
 can be used as an argument to 'malloc()' or in a union to allocate space
 statically.  Please refer to the 'readdir_test' sample extension for an
 example.