https://invisible-island.net/mawk/


mawk-arrays(7)


NAME

       mawk-arrays - design notes for mawk's array implementation

SYNOPSIS

       This  is  the  documentation for the mawk implementation of awk arrays.
       Arrays in awk are associations of strings to awk  scalar  values.   The
       mawk  implementation  stores the associations in hash tables.  The hash
       table scheme was influenced by and is similar to the  design  presented
       in  Griswold  and  Townsend,  The  Design and Implementation of Dynamic
       Hashing Sets and Tables in Icon}, Software Practice and Experience, 23,
       351-367, 1993.

Data Structures

Array Structure

       The  type  ARRAY is a pointer to a struct array.  The size field is the
       number of elements in the table.   The  meaning  of  the  other  fields
       depends on the type field.

       There are three types of arrays and these are distinguished by the type
       field in the structure.  The types are:

       AY_NULL
            The array is empty and the size field is always zero.   The  other
            fields have no meaning.

       AY_SPLIT
            The array was created by the AWK built-in split.  The return value
            from split is stored in the size field.  The ptr field points at a
            vector  of  CELLs.  The number of CELLs is the limit field.  It is
            always  true  that  size <= limit.   The  address   of   A[i]   is
            (CELL*)A->ptr+i-1  for  1<= i <=  size.   The  hmask  field has no
            meaning.

       Hash Table
            The array is a hash table.  If the AY_STR bit in the type field is
            set, then the table is keyed on strings.  If the AY_INT bit in the
            type field is set, then the table is keyed on integers.  Both bits
            can be set, and then the two keys are consistent, i.e., look up of
            A[-14] and A["-14"] will return identical CELL  pointers  although
            the  look  up  methods  will be different.  In this case, the size
            field is the number of hash nodes in the table.  When insertion of
            a new element would cause size to exceed limit, the table grows by
            doubling   the   number   of   hash   chains.    The    invariant,
            (hmask+1)max_ave_list_length=limit       is      always      true.
            Max_ave_list_length is a tunable constant.

Hash Tables

       The hash tables are linked lists of nodes, called ANODEs.   The  number
       of  lists  is  hmask+1  which  is always a power of two.  The ptr field
       points at a vector of list heads.   Since  there  are  potentially  two
       types  of  lists,  integer lists and strings lists, each list head is a
       structure, DUAL_LINK.

       The string lists are chains connected by slinks and the  integer  lists
       are  chains  connected by ilinks.  We sometimes refer to these lists as
       slists and ilists, respectively.  The elements on the lists are ANODEs.
       The fields of an ANODE are:

       slink  The  link  field  for  slists.  ilink The link field for ilists.
       sval If non-null, then sval is a pointer to a string key.  For a  given
       table,  if  the  AY_STR bit is set then every ANODE has a non-null sval
       field and conversely, if AY_STR is not set, then every  sval  field  is
       null.

       hval  The  hash  value  of  sval.  This field has no meaning if sval is
       null.

       ival The integer key.  The field has no meaning if set to the constant,
       NOT_AN_IVALUE.   If the AY_STR bit is off, then every ANODE will have a
       valid ival field.  If the AY_STR bit is on, then the ival field may  or
       may not be valid.

       cell The data field in the hash table.  \ndhitems

       So  the  value of A[expr is stored in the cell field, and if expr is an
       integer, then expr is stored in ival, else it is stored in sval.

Array Operations

       The functions that operate on arrays are,

       CELL* array_find(ARRAY A, CELL *cp, int create_flag)
            returns a pointer to A[expr] where cp is a  pointer  to  the  CELL
            holding expr.  If the create_flag is on and expr is not an element
            of A, then the element is created with value null.

       void array_delete(ARRAY A, CELL *cp)
            removes an element A[expr from the array A.  cp points at the CELL
            holding expr.

       void array_load(ARRAY A, size_t cnt)
            builds  a split array.  The values A[1..cnt] are moved into A from
            an anonymous buffer with transfer_to_array() which is declared  in
            split.h.

       void array_clear(ARRAY A) removes all elements of A.
            The type of A is then AY_NULL.

       STRING** array_loop_vector(ARRAY A, size_t *sizep)
            returns  a  pointer  to a linear vector that holds all the strings
            that are indices of A.  The size of the  the  vector  is  returned
            indirectly in *sizep.  If A->size==0, a null pointer is returned.

       CELL* array_cat(CELL *sp, int cnt)
            concatenates  the  elements  of  sp[1-cnt..0],  with  each element
            separated by SUBSEP, to compute an array index.  For example, on a
            reference  to  A[i,j],  array_cat  computes i O SUBSEP O j where O
            denotes concatenation.

Array Find

       Any reference to A[expr]  creates  a  call  to  array_find(A,cp,CREATE)
       where cp points at the cell holding expr.  The test, expr in A, creates
       a call to array_find(A,cp,NO_CREATE).

       Array_find is a hash-table lookup function that handles two cases:

       1.   If *cp is numeric and integer valued, then lookup by integer value
            using  find_by_ival.   If  *cp is numeric, but not integer valued,
            then convert to string with sprintf(CONVFMT,...) and go to case~2.

       2.   If *cp is  string  valued,  then  lookup  by  string  value  using
            find_by_sval.  \ndlist

       To  test whether cp->dval is integer, we convert to the nearest integer
       by rounding towards zero (done  by  do_to_I)  and  then  cast  back  to
       double.   If  we  get the same number we started with, then cp->dval is
       integer valued.

       When we get to the function find_by_ival, the search has  been  reduced
       to lookup in a hash table by integer value.

       When  a search by integer value fails, we have to check by string value
       to correctly handle the case insertion by A["123"] and later search  as
       A[123].   This string search is necessary if and only if the AY_STR bit
       is set.  An important point is that all ANODEs get created with a valid
       sval if AY_STR is set, because then creation of new nodes always occurs
       in a call to find_by_sval.

       Searching by string value is  easier  because  AWK  arrays  are  really
       string  associations.   If  the array does not have the AY_STR bit set,
       then we have to convert the array to a dual  hash  table  with  strings
       which is done by the function add_string_associations.

       One Int value is reserved to show that the ival field is invalid.  This
       works because d_to_I returns a value in [-Max_Int, Max_Int].

       On entry to add_string_associations, we know that the AY_STR bit is not
       set.   We convert to a dual hash table, then walk all the integer lists
       and put each ANODE on a string list.

Array Delete

       The execution of the statement,  delete  A[expr],  creates  a  call  to
       array_delete(ARRAY  A,  CELL  *cp).   Depending on the type of *cp, the
       call  is  routed  to  find_by_sval  or  find_by_ival.   Each  of  these
       functions  leaves  its  return value on the front of an slist or ilist,
       respectively, and then it is deleted from the front of the  list.   The
       case  where  A[expr is on two lists, e.g., A[12] and A["12"] is checked
       by examining the sval and ival fields of the returned ANODE*.

       Even though we found a node by searching an ilist it might also  be  on
       an slist and vice-versa.

       When  the size of a hash table drops below a certain value, it might be
       profitable to shrink the hash  table.   Currently  we  don't  do  this,
       because  our  guess  is  that  it would be a waste of time for most AWK
       applications.  However, we do convert an array to AY_NULL when the size
       goes  to  zero  which  would  resize  a  large hash table that had been
       completely cleared by successive deletions.

Building an Array with Split

       A simple operation is to create an array with the AWK primitive  split.
       The  code  that  performs split puts the pieces in an anonymous buffer.
       array_load(A, cnt) moves the cnt elements  from  the  anonymous  buffer
       into A.  This is the only way an array of type AY_SPLIT is created.

       If  the  array  A  is  a  split  array and big enough then we reuse it,
       otherwise we need to allocate a new split array.  When  we  allocate  a
       block of CELLs for a split array, we round up to a multiple of 4.

Array Clear

       The  function array_clear(ARRAY A) converts A to type AY_NULL and frees
       all storage used by  A  except  for  the  struct  array  itself.   This
       function gets called in three contexts:

       (1)  when an array local to a user function goes out of scope,

       (2)  execution of the AWK statement, delete A and

       (3)  when an existing changes type or size from split().

Constructor and Conversions

       Arrays  are  always  created  as  empty arrays of type AY_NULL.  Global
       arrays are never destroyed although they can go  empty  or  have  their
       type change by conversion.  The only constructor function is a macro.

       Hash  tables  only  get constructed by conversion.  This happens in two
       ways.  The function make_empty_table converts an empty  array  of  type
       AY_NULL  to an empty hash table.  The number of lists in the table is a
       power of 2 determined by the constant STARTING_HMASK.  The  limit  size
       of the table is determined by the constant MAX_AVE_LIST_LENGTH which is
       the largest average size of the hash  lists  that  we  are  willing  to
       tolerate  before  enlarging  the table.  When A->size exceeds A->limit,
       the hash table grows in size by doubling the number of lists.  A->limit
       is then reset to MAX_AVE_LIST_LENGTH times A->hmask+1.

       The  other  way  a hash table gets constructed is when a split array is
       converted to a hash table of type AY_INT.

       To determine the size  of  the  table,  we  set  the  initial  size  to
       STARTING_HMASK+1 and then double the size until A->size <= A->limit.

Doubling the Size of a Hash Table

       The  whole  point  of  making  the  table  size  a  power  of two is to
       facilitate resizing the table.  If the table size is 2**(n)  and  h  is
       the  hash  key,  then h mod 2**(n) is the hash chain index which can be
       calculated with bit-wise and, h &  (2**(n-1)).   When  the  table  size
       doubles,  the  new bit-mask has one more bit turned on.  Elements of an
       old hash chain whose hash value have this bit turned on get moved to  a
       new  chain.   Elements with this bit turned off stay on the same chain.
       On average only half the old chain moves to the new chain.  If the  old
       chain is at table[i], 0 <= i < 2**(n), then the elements that move, all
       move to the new chain at table[i + 2**(n)].

       As we walk an old string list with pointer p, the expression p->hval  &
       new_hmask  takes  one  of  two  values.   If  it  is equal to p->hval &
       old_hmask (which equals i), then the node stays otherwise it gets moved
       to a new string list at j.  The new string list preserves order so that
       the positions of the move-to-the-front heuristic are preserved.   Nodes
       moving  to  the  new  list  are  appended at pointer tail.  The ANODEs,
       dummy0~and dummy1,  are  sentinels  that  remove  special  handling  of
       boundary conditions.

       The doubling of the integer lists is exactly the same except that slink
       is replaced by ilink and hval is replaced by ival.

Array Loops

       Our mechanism for dealing with execution of the statement,

              for (i in A) { statements }

       is simple.  We allocate a vector of STRING*  of  size,  A->size.   Each
       element  of  the vector is a string key for~A.  Note that if the AY_STR
       bit of A is not set, then A has to be converted to a string hash table,
       because the index i walks string indices.

       To  execute  the  loop,  the  only  state that needs to be saved is the
       address of i and an index  into  the  vector  of  string  keys.   Since
       nothing  about A is saved as state, the user program can do anything to
       A inside the body of the loop, even delete A, and the loop still works.
       Essentially,  we have traded data space (the string vector) in exchange
       for implementation simplicity.  On a 32-bit system, each  ANODE  is  36
       bytes,  so  the  extra memory needed for the array loop is 11 more than
       the memory consumed by the ANODEs of the array.  Note  that  the  large
       size  of  the  ANODEs is indicative of our whole design which pays data
       space for integer lookup speed and algorithm simplicity.

       The only aspect of array loops that occurs in array.c  is  construction
       of  the  string  vector.  The rest of the implementation is in the file
       execute.c.

       As we walk over the hash table ANODEs, putting each  sval  in  ret,  we
       need  to  increment each reference count.  The user of the return value
       is responsible for these new reference counts.

Concatenating Array Indices

       In AWK, an array expression A[i,j] is equivalent to the expression  A[i
       SUBSEP  j],  i.e., the index is the concatenation of the three elements
       i, SUBSEP and j.  This is performed  by  the  function  array_cat.   On
       entry,  sp points at the top of a stack of CELLs.  Cnt cells are popped
       off the stack and concatenated together separated  by  SUBSEP  and  the
       result is pushed back on the stack.  On entry, the first multi-index is
       in sp[1-cnt] and the last is in sp[0].  The return  value  is  the  new
       stack  top.   (The  stack  is  the  run-time  evaluation  stack.   This
       operation really has nothing to do with array structure,  so  logically
       this  code  belongs  in  execute.c,  but  remains  here  for historical
       reasons.)

       We make a copy of SUBSEP which we can cast to string  in  the  unlikely
       event the user has assigned a number to SUBSEP.

       Set  sp  and top so the cells to concatenate are inclusively between sp
       and top.

       The total_len is the sum of the lengths of  the  cnt  strings  and  the
       cnt-1 copies of subsep.

       The  return  value is sp and it is already set correctly.  We just need
       to free the strings and set the contents of sp.

Version 1.3.4                     2020-08-18                    MAWK-ARRAYS(7)