gawk: Array Sorting Functions

 
 12.2.2 Sorting Array Values and Indices with 'gawk'
 ---------------------------------------------------
 
 In most 'awk' implementations, sorting an array requires writing a
 'sort()' function.  This can be educational for exploring different
 sorting algorithms, but usually that's not the point of the program.
 'gawk' provides the built-in 'asort()' and 'asorti()' functions (See
 String Functions) for sorting arrays.  For example:
 
      POPULATE THE ARRAY data
      n = asort(data)
      for (i = 1; i <= n; i++)
          DO SOMETHING WITH data[i]
 
    After the call to 'asort()', the array 'data' is indexed from 1 to
 some number N, the total number of elements in 'data'.  (This count is
 'asort()''s return value.)  'data[1]' <= 'data[2]' <= 'data[3]', and so
 on.  The default comparison is based on the type of the elements (See
 Typing and Comparison).  All numeric values come before all string
 values, which in turn come before all subarrays.
 
    An important side effect of calling 'asort()' is that _the array's
 original indices are irrevocably lost_.  As this isn't always desirable,
 'asort()' accepts a second argument:
 
      POPULATE THE ARRAY source
      n = asort(source, dest)
      for (i = 1; i <= n; i++)
          DO SOMETHING WITH dest[i]
 
    In this case, 'gawk' copies the 'source' array into the 'dest' array
 and then sorts 'dest', destroying its indices.  However, the 'source'
 array is not affected.
 
    Often, what's needed is to sort on the values of the _indices_
 instead of the values of the elements.  To do that, use the 'asorti()'
 function.  The interface and behavior are identical to that of
 'asort()', except that the index values are used for sorting and become
 the values of the result array:
 
      { source[$0] = some_func($0) }
 
      END {
          n = asorti(source, dest)
          for (i = 1; i <= n; i++) {
              Work with sorted indices directly:
              DO SOMETHING WITH dest[i]
              ...
              Access original array via sorted indices:
              DO SOMETHING WITH source[dest[i]]
          }
      }
 
    So far, so good.  Now it starts to get interesting.  Both 'asort()'
 and 'asorti()' accept a third string argument to control comparison of
 array elements.  When we introduced 'asort()' and 'asorti()' in See
 String Functions, we ignored this third argument; however, now is the
 time to describe how this argument affects these two functions.
 
    Basically, the third argument specifies how the array is to be
 sorted.  There are two possibilities.  As with 'PROCINFO["sorted_in"]',
 this argument may be one of the predefined names that 'gawk' provides
 (SeeControlling Scanning), or it may be the name of a user-defined
 function (SeeControlling Array Traversal).
 
    In the latter case, _the function can compare elements in any way it
 chooses_, taking into account just the indices, just the values, or
 both.  This is extremely powerful.
 
    Once the array is sorted, 'asort()' takes the _values_ in their final
 order and uses them to fill in the result array, whereas 'asorti()'
 takes the _indices_ in their final order and uses them to fill in the
 result array.
 
      NOTE: Copying array indices and elements isn't expensive in terms
      of memory.  Internally, 'gawk' maintains "reference counts" to
      data.  For example, when 'asort()' copies the first array to the
      second one, there is only one copy of the original array elements'
      data, even though both arrays use the values.
 
    Because 'IGNORECASE' affects string comparisons, the value of
 'IGNORECASE' also affects sorting for both 'asort()' and 'asorti()'.
 Note also that the locale's sorting order does _not_ come into play;
 comparisons are based on character values only.(1)
 
    The following example demonstrates the use of a comparison function
 with 'asort()'.  The comparison function, 'case_fold_compare()', maps
 both values to lowercase in order to compare them ignoring case.
 
      # case_fold_compare --- compare as strings, ignoring case
 
      function case_fold_compare(i1, v1, i2, v2,    l, r)
      {
          l = tolower(v1)
          r = tolower(v2)
 
          if (l < r)
              return -1
          else if (l == r)
              return 0
          else
              return 1
      }
 
    And here is the test program for it:
 
      # Test program
 
      BEGIN {
          Letters = "abcdefghijklmnopqrstuvwxyz" \
                    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
          split(Letters, data, "")
 
          asort(data, result, "case_fold_compare")
 
          j = length(result)
          for (i = 1; i <= j; i++) {
              printf("%s", result[i])
              if (i % (j/2) == 0)
                  printf("\n")
              else
                  printf(" ")
          }
      }
 
    When run, we get the following:
 
      $ gawk -f case_fold_compare.awk
      -| A a B b c C D d e E F f g G H h i I J j k K l L M m
      -| n N O o p P Q q r R S s t T u U V v w W X x y Y z Z
 
    ---------- Footnotes ----------
 
    (1) This is true because locale-based comparison occurs only when in
 POSIX-compatibility mode, and because 'asort()' and 'asorti()' are
 'gawk' extensions, they are not available in that case.