https://invisible-island.net/mawk/

mawk-arrays - design notes for mawk's array implementation

This is the documentation for themawkimplementation of awk arrays. Arrays in awk are associations of strings to awk scalar values. Themawkimplementation stores the associations in hash tables. The hash table scheme was influenced by and is similar to the design presented in Griswold and Townsend,TheDesign and Implementation of DynamicHashing Sets and Tables in Icon},SoftwarePracticeandExperience, 23, 351-367, 1993.

The typeARRAYis a pointer to astructarray. Thesizefield is the number of elements in the table. The meaning of the other fields depends on thetypefield. There are three types of arrays and these are distinguished by thetypefield in the structure. The types are:AY_NULLThe array is empty and thesizefield is always zero. The other fields have no meaning.AY_SPLITThe array was created by theAWKbuilt-insplit. The return value fromsplitis stored in thesizefield. Theptrfield points at a vector ofCELLs. The number ofCELLs is thelimitfield. It is always true thatsize<=limit. The address ofA[i] is(CELL*)A->ptr+i-1for 1<=i<=size. Thehmaskfield has no meaning.Hash TableThe array is a hash table. If theAY_STRbit in thetypefield is set, then the table is keyed on strings. If theAY_INTbit in thetypefield is set, then the table is keyed on integers. Both bits can be set, and then the two keys are consistent, i.e., look up ofA[-14] andA["-14"] will return identicalCELLpointers although the look up methods will be different. In this case, thesizefield is the number of hash nodes in the table. When insertion of a new element would causesizeto exceedlimit, the table grows by doubling the number of hash chains. The invariant, (hmask+1)max_ave_list_length=limitis always true.Max_ave_list_lengthis a tunable constant.

The hash tables are linked lists of nodes, calledANODEs. The number of lists ishmask+1which is always a power of two. Theptrfield points at a vector of list heads. Since there are potentially two types of lists, integer lists and strings lists, each list head is a structure,DUAL_LINK. The string lists are chains connected byslinksand the integer lists are chains connected byilinks. We sometimes refer to these lists as slists and ilists, respectively. The elements on the lists areANODEs. The fields of anANODEare:slinkThe link field for slists.ilinkThe link field for ilists.svalIf non-null, thensvalis a pointer to a string key. For a given table, if theAY_STRbit is set then everyANODEhas a non-nullsvalfield and conversely, ifAY_STRis not set, then everysvalfield is null.hvalThe hash value ofsval. This field has no meaning ifsvalis null.ivalThe integer key. The field has no meaning if set to the constant,NOT_AN_IVALUE. If theAY_STRbit is off, then everyANODEwill have a validivalfield. If theAY_STRbit is on, then theivalfield may or may not be valid.cellThe data field in the hash table. \ndhitems So the value ofA[expris stored in thecellfield, and ifexpris an integer, thenexpris stored inival, else it is stored insval.

The functions that operate on arrays are,CELL* array_find(ARRAY A, CELL *cp, int create_flag)returns a pointer toA[expr] wherecpis a pointer to theCELLholdingexpr. If thecreate_flagis on andexpris not an element ofA, then the element is created with valuenull.void array_delete(ARRAY A, CELL *cp)removes an elementA[exprfrom the arrayA.cppoints at theCELLholdingexpr.void array_load(ARRAY A, size_t cnt)builds a split array. The valuesA[1..cnt] are moved intoAfrom an anonymous buffer withtransfer_to_array()which is declared insplit.h.void array_clear(ARRAY A)removes all elements ofA. The type ofAis thenAY_NULL.STRING** array_loop_vector(ARRAY A, size_t *sizep)returns a pointer to a linear vector that holds all the strings that are indices ofA. The size of the the vector is returned indirectly in*sizep. IfA->size==0, anullpointer is returned.CELL* array_cat(CELL *sp, int cnt)concatenates the elements ofsp[1-cnt..0], with each element separated bySUBSEP, to compute an array index. For example, on a reference toA[i,j],array_catcomputesiOSUBSEP O j where Odenotes concatenation.

Any reference to A[expr] creates a call toarray_find(A,cp,CREATE)wherecppoints at the cell holdingexpr. The test,exprinA, creates a call toarray_find(A,cp,NO_CREATE).Array_findis a hash-table lookup function that handles two cases: 1. If*cpis numeric and integer valued, then lookup by integer value usingfind_by_ival. If*cpis numeric, but not integer valued, then convert to string withsprintf(CONVFMT,...)and go to case~2. 2. If*cpis string valued, then lookup by string value usingfind_by_sval. \ndlist To test whethercp->dvalis integer, we convert to the nearest integer by rounding towards zero (done bydo_to_I) and then cast back to double. If we get the same number we started with, thencp->dvalis integer valued. When we get to the functionfind_by_ival, the search has been reduced to lookup in a hash table by integer value. When a search by integer value fails, we have to check by string value to correctly handle the case insertion byA["123"] and later search asA[123]. This string search is necessary if and only if theAY_STRbit is set. An important point is that allANODEs get created with a validsvalifAY_STRis set, because then creation of new nodes always occurs in a call tofind_by_sval. Searching by string value is easier becauseAWKarrays are really string associations. If the array does not have theAY_STRbit set, then we have to convert the array to a dual hash table with strings which is done by the functionadd_string_associations. OneIntvalue is reserved to show that theivalfield is invalid. This works becaused_to_Ireturns a value in[-Max_Int, Max_Int]. On entry toadd_string_associations, we know that theAY_STRbit is not set. We convert to a dual hash table, then walk all the integer lists and put eachANODEon a string list.

The execution of the statement,deleteA[expr], creates a call toarray_delete(ARRAYA,CELL*cp). Depending on the type of*cp, the call is routed tofind_by_svalorfind_by_ival. Each of these functions leaves its return value on the front of anslistorilist, respectively, and then it is deleted from the front of the list. The case whereA[expris on two lists, e.g.,A[12] andA["12"] is checked by examining thesvalandivalfields of the returnedANODE*. Even though we found a node by searching anilistit might also be on anslistand vice-versa. When the size of a hash table drops below a certain value, it might be profitable to shrink the hash table. Currently we don't do this, because our guess is that it would be a waste of time for mostAWKapplications. However, we do convert an array toAY_NULLwhen the size goes to zero which would resize a large hash table that had been completely cleared by successive deletions.

A simple operation is to create an array with theAWKprimitivesplit. The code that performssplitputs the pieces in an anonymous buffer.array_load(A, cnt)moves thecntelements from the anonymous buffer intoA. This is the only way an array of typeAY_SPLITis created. If the arrayAis a split array and big enough then we reuse it, otherwise we need to allocate a new split array. When we allocate a block ofCELLs for a split array, we round up to a multiple of 4.

The functionarray_clear(ARRAY A)convertsAto typeAY_NULLand frees all storage used byAexcept for thestructarrayitself. This function gets called in three contexts: (1) when an array local to a user function goes out of scope, (2) execution of theAWKstatement,delete Aand (3) when an existing changes type or size fromsplit().

Arrays are always created as empty arrays of typeAY_NULL. Global arrays are never destroyed although they can go empty or have their type change by conversion. The only constructor function is a macro. Hash tables only get constructed by conversion. This happens in two ways. The functionmake_empty_tableconverts an empty array of typeAY_NULLto an empty hash table. The number of lists in the table is a power of 2 determined by the constantSTARTING_HMASK. The limit size of the table is determined by the constantMAX_AVE_LIST_LENGTHwhich is the largest average size of the hash lists that we are willing to tolerate before enlarging the table. WhenA->sizeexceedsA->limit, the hash table grows in size by doubling the number of lists.A->limitis then reset toMAX_AVE_LIST_LENGTHtimesA->hmask+1. The other way a hash table gets constructed is when a split array is converted to a hash table of typeAY_INT. To determine the size of the table, we set the initial size toSTARTING_HMASK+1and then double the size untilA->size<=A->limit.

The whole point of making the table size a power of two is to facilitate resizing the table. If the table size is 2**(n) andhis the hash key, thenhmod2**(n) is the hash chain index which can be calculated with bit-wise and,h& (2**(n-1)). When the table size doubles, the new bit-mask has one more bit turned on. Elements of an old hash chain whose hash value have this bit turned on get moved to a new chain. Elements with this bit turned off stay on the same chain. On average only half the old chain moves to the new chain. If the old chain is attable[i], 0 <=i< 2**(n), then the elements that move, all move to the new chain attable[i + 2**(n)]. As we walk an old string list with pointerp, the expressionp->hval&new_hmasktakes one of two values. If it is equal top->hval &old_hmask(which equalsi), then the node stays otherwise it gets moved to a new string list atj. The new string list preserves order so that the positions of the move-to-the-front heuristic are preserved. Nodes moving to the new list are appended at pointertail. TheANODEs,dummy0~anddummy1, are sentinels that remove special handling of boundary conditions. The doubling of the integer lists is exactly the same except thatslinkis replaced byilinkandhvalis replaced byival.

Our mechanism for dealing with execution of the statement,for(iinA) {statements} is simple. We allocate a vector ofSTRING*of size,A->size. Each element of the vector is a string key for~A. Note that if theAY_STRbit ofAis not set, thenAhas to be converted to a string hash table, because the indexiwalks string indices. To execute the loop, the only state that needs to be saved is the address ofiand an index into the vector of string keys. Since nothing aboutAis saved as state, the user program can do anything toAinside the body of the loop, evendelete A, and the loop still works. Essentially, we have traded data space (the string vector) in exchange for implementation simplicity. On a 32-bit system, eachANODEis 36 bytes, so the extra memory needed for the array loop is 11 more than the memory consumed by theANODEs of the array. Note that the large size of theANODEs is indicative of our whole design which pays data space for integer lookup speed and algorithm simplicity. The only aspect of array loops that occurs inarray.cis construction of the string vector. The rest of the implementation is in the fileexecute.c. As we walk over the hash tableANODEs, putting eachsvalinret, we need to increment each reference count. The user of the return value is responsible for these new reference counts.

InAWK, an array expressionA[i,j] is equivalent to the expressionA[iSUBSEPj], i.e., the index is the concatenation of the three elementsi,SUBSEPandj. This is performed by the functionarray_cat. On entry,sppoints at the top of a stack ofCELLs.Cntcells are popped off the stack and concatenated together separated bySUBSEPand the result is pushed back on the stack. On entry, the first multi-index is insp[1-cnt] and the last is insp[0]. The return value is the new stack top. (The stack is the run-time evaluation stack. This operation really has nothing to do with array structure, so logically this code belongs inexecute.c, but remains here for historical reasons.) We make a copy ofSUBSEPwhich we can cast to string in the unlikely event the user has assigned a number toSUBSEP. Setspandtopso the cells to concatenate are inclusively betweenspandtop. Thetotal_lenis the sum of the lengths of thecntstrings and thecnt-1copies ofsubsep. The return value isspand it is already set correctly. We just need to free the strings and set the contents ofsp. Version 1.3.4 2020-08-18MAWK-ARRAYS(7)