gawk: Array Sorting Functions
12.2.2 Sorting Array Values and Indices with 'gawk'
---------------------------------------------------
In most 'awk' implementations, sorting an array requires writing a
'sort()' function. This can be educational for exploring different
sorting algorithms, but usually that's not the point of the program.
'gawk' provides the built-in 'asort()' and 'asorti()' functions (
String Functions) for sorting arrays. For example:
POPULATE THE ARRAY data
n = asort(data)
for (i = 1; i <= n; i++)
DO SOMETHING WITH data[i]
After the call to 'asort()', the array 'data' is indexed from 1 to
some number N, the total number of elements in 'data'. (This count is
'asort()''s return value.) 'data[1]' <= 'data[2]' <= 'data[3]', and so
on. The default comparison is based on the type of the elements (
Typing and Comparison). All numeric values come before all string
values, which in turn come before all subarrays.
An important side effect of calling 'asort()' is that _the array's
original indices are irrevocably lost_. As this isn't always desirable,
'asort()' accepts a second argument:
POPULATE THE ARRAY source
n = asort(source, dest)
for (i = 1; i <= n; i++)
DO SOMETHING WITH dest[i]
In this case, 'gawk' copies the 'source' array into the 'dest' array
and then sorts 'dest', destroying its indices. However, the 'source'
array is not affected.
Often, what's needed is to sort on the values of the _indices_
instead of the values of the elements. To do that, use the 'asorti()'
function. The interface and behavior are identical to that of
'asort()', except that the index values are used for sorting and become
the values of the result array:
{ source[$0] = some_func($0) }
END {
n = asorti(source, dest)
for (i = 1; i <= n; i++) {
Work with sorted indices directly:
DO SOMETHING WITH dest[i]
...
Access original array via sorted indices:
DO SOMETHING WITH source[dest[i]]
}
}
So far, so good. Now it starts to get interesting. Both 'asort()'
and 'asorti()' accept a third string argument to control comparison of
array elements. When we introduced 'asort()' and 'asorti()' in
String Functions, we ignored this third argument; however, now is the
time to describe how this argument affects these two functions.
Basically, the third argument specifies how the array is to be
sorted. There are two possibilities. As with 'PROCINFO["sorted_in"]',
this argument may be one of the predefined names that 'gawk' provides
(Controlling Scanning), or it may be the name of a user-defined
function (Controlling Array Traversal).
In the latter case, _the function can compare elements in any way it
chooses_, taking into account just the indices, just the values, or
both. This is extremely powerful.
Once the array is sorted, 'asort()' takes the _values_ in their final
order and uses them to fill in the result array, whereas 'asorti()'
takes the _indices_ in their final order and uses them to fill in the
result array.
NOTE: Copying array indices and elements isn't expensive in terms
of memory. Internally, 'gawk' maintains "reference counts" to
data. For example, when 'asort()' copies the first array to the
second one, there is only one copy of the original array elements'
data, even though both arrays use the values.
Because 'IGNORECASE' affects string comparisons, the value of
'IGNORECASE' also affects sorting for both 'asort()' and 'asorti()'.
Note also that the locale's sorting order does _not_ come into play;
comparisons are based on character values only.(1)
The following example demonstrates the use of a comparison function
with 'asort()'. The comparison function, 'case_fold_compare()', maps
both values to lowercase in order to compare them ignoring case.
# case_fold_compare --- compare as strings, ignoring case
function case_fold_compare(i1, v1, i2, v2, l, r)
{
l = tolower(v1)
r = tolower(v2)
if (l < r)
return -1
else if (l == r)
return 0
else
return 1
}
And here is the test program for it:
# Test program
BEGIN {
Letters = "abcdefghijklmnopqrstuvwxyz" \
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
split(Letters, data, "")
asort(data, result, "case_fold_compare")
j = length(result)
for (i = 1; i <= j; i++) {
printf("%s", result[i])
if (i % (j/2) == 0)
printf("\n")
else
printf(" ")
}
}
When run, we get the following:
$ gawk -f case_fold_compare.awk
-| A a B b c C D d e E F f g G H h i I J j k K l L M m
-| n N O o p P Q q r R S s t T u U V v w W X x y Y z Z
---------- Footnotes ----------
(1) This is true because locale-based comparison occurs only when in
POSIX-compatibility mode, and because 'asort()' and 'asorti()' are
'gawk' extensions, they are not available in that case.