(Redirected from
Hashing)
In computer science, a hash table is a data structure for storing information that allows specific information to be quickly located. Requests are made using a key, an identifier for the information to be found, such as a person's name or an ID number. The key is transformed using a hash function directly into a hash value, a number describing, roughly speaking, the location of the information.
Hash tables are often used to implement associative arrays, sets and caches. Like arrays, hash tables can provide constant-time O(1) lookup on average, regardless of the number of items in the table. However, their (very improbable) worst-case time can be as bad as O(n). Hash tables are most useful when a large number of records of data are to be stored.
Overview
Hash tables use an array to hold, or reference, the stored records. However, because we want to allow keys which are much larger than the range of valid indexes, or might even be a different type of data like strings, we need a way to convert each key into a valid index. We achieve this using a hash function, which is a simple function that takes a key and produces an index into an array which is called the hash table. The hash table, in turn, ideally contains a pointer to the record or the record itself.
The problem is that, since we have more potential keys than array indexes, multiple keys will map to the same array index (by the pigeonhole principle). If we try to insert two keys that map to the same index, this is called a collision. Although we attempt to avoid collisions, they are often all but inevitable: even if the array were to be allocated with space for a million entries, the chance of at least one collision occurring exceeds 95% before it contains 2500 items (see birthday paradox, birthday attack).
To resolve this, a collision resolution strategy is used. Most collision resolution strategies are some form of linear search through the set of inserted keys that map to the hash index of the one being searched for. It can be shown that it is very improbable any of these sets will become large enough that such a search is expensive, provided that the table doesn't become too full.
If the table does become too full, performance becomes poor, and the hash table's array must be enlarged. Enlarging the table means that effectively all the items in the hash table have to be added all over again. This is an expensive, but in most implementations infrequent, operation. Some hash table implementations, notably realtime systems, cannot pay the price of enlarging the table and are initially allocated with enough space to remain efficient.
Hash tables allow reasonably simple implementations of the essential operations of associative arrays: lookup of a record given a key; setting a record at a given key (adding or replacing as necessary); and removing a record with a given key. Their main competitor in practice is a self-balancing binary search tree. See a comparison of hash tables and self-balancing binary search trees. Hash tables don't allow simple implementations of "previous/next" operations to find logically "adjacent" keys, so presenting keys as a sorted list is difficult to implement.
Common uses of hash tables
Hash tables are found in a wide variety of programs. Most programming languages provide them in standard libraries. Most interpreted or scripting languages have special syntactic support (examples being Python, Perl, PHP, Ruby, Io, Smalltalk, Lua and ICI). In these languages, hash tables tend to be used extensively as data structures, sometimes replacing records and arrays.
Hash tables are commonly used for symbol tables, caches, and sets.
In computer chess, a hash table is generally used to implement the transposition table.
Choosing a good hash function
A good hash function is essential for good hash table performance. Hash collisions are generally resolved by some form of linear search, so if a hash function tends to produce similar values for some set of keys, slow linear searches will result.
In an ideal hash function, changing any single bit in the key (including extending or shortening the key) would produce a seemingly random change in all of the bits of the hash, and this change would be independent of the changes caused by any other bits of the key. Because a good hash function can be hard to design, or computationally expensive to execute, much research has been devoted to collision resolution strategies that mitigate poor hashing performance. However, none of them is as effective as using a good hash function in the first place.
One issue with the use of hash functions in hash tables is that the range of indexes of the hash table's array is limited, but it should be possible to use a single hash function for arrays of all conceivable sizes. To get around this, the index into the hash table's array is calculated in two steps:
- A generic hash value is calculated which fills a natural machine integer
- This value is reduced to a valid array index by finding its modulus with the array's size.
Hash table array sizes are sometimes set, by construction, to be prime numbers. This is done to avoid any tendency for the large integer hash to have common divisors with the hash table size, which would otherwise induce collisions after the modulus operation.
A common alternative to prime sizes is to use a power of 2 size, with simple bit masking to achieve the modulus operation. Such bit masking may be significantly computationally cheaper than the division operation.
In either case it is often a good idea to arrange the generic hash value to be constructed using numbers that share no common divisors (is coprime) with the table length.
One common problem that can occur with hash functions is clustering. Clustering occurs when the structure of the hash function causes commonly used keys to tend to fall closely spaced or even consecutively within the hash table. This can cause significant performance degradation as the table fills when using certain collision resolution strategies, such as linear probing.
When debugging the collision handling in a hash table, it is useful to use a hash function that always returns a constant value, such as 1, which causes collisions on almost every insert.
Collision resolution
If two keys hash to the same value, they cannot be stored in the same location. We must find a place to store a new value if its ideal location is already occupied. There are a number of ways to do this, but the most popular are chaining and open addressing.
Chaining
In the simplest chained hash table technique, each slot in the array references a linked list of records that collide to the same slot.
Insertion requires finding the correct slot, and appending to either end of the list in that slot; deletion requires searching the list and removal.
Chaining hash tables have advantages over open addressed hash tables in that the removal operation is simple and resizing the table can be postponed for a much longer time because performance degrades more gracefully even when every slot is used. Indeed, many chaining hash tables may not require resizing at all since performance degradation is linear as the table fills. For example, a chaining hash table containing twice its recommended capacity of data would only be about twice as slow on average as the same table at its recommended capacity.
Disadvantages of chained hash tables are due to the common use of linked lists. Particularly when storing small elements, the percentage overhead of the linked list can be large. If this is a problem Open Addressed hash tables may be a better choice. An additional disadvantage is that traversing a linked list, which chiefly occurs when the hash table is large and full, has poor cache performance. In most cases hash tables can be sized appropriately so that this traversal does not happen very much, and the memory overhead is often not significant.
Alternative data structures can be used for chains instead of linked lists. By using a red-black tree, for example, the theoretical worst-case time of a hash table can be brought down to O(log n) rather than O(n). However, since each list is intended to be short, this approach is usually inefficient unless the hash table is designed to run at full capacity or there are unusually high collision rates, as might occur in input designed to cause collisions. Dynamic arrays can also be used to decrease space overhead and improve cache performance when elements are small.
From the point of view of writing suitable hash functions, chained hash tables are insensitive to clustering, only requiring minimization of collisions. On the other hand, open hash tables, as we'll see in the next section, depend upon better hash functions to avoid clustering. This can destroy performance even in nearly empty tables, and is easy to trigger accidentally.
Overall, chained hash tables are simple to implement, robust and frequently give high performance. Due to their lower complexity, designers should probably first consider use of this design in preference to open addressed hash tables.
Open addressing
Open addressing hash tables store all the records within the array. A hash collision is resolved by searching through alternate locations in the array (the probe sequence) until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table. Well known probe sequences include:
- linear probing
- in which the interval between probes is fixed--often at 1,
- quadratic probing
- in which the interval between probes increases linearly (hence, the indices are described by a quadratic function), and
- double hashing
- in which the interval between probes is fixed for each record but is computed by another hash function.
The main tradeoffs between these methods is that linear probing has the best cache performance but is most sensitive to clustering, while double hashing has poor cache performance but exhibits virtually no clustering; quadratic hashing falls in-between in both areas.
A critical influence on performance of an open addressing hash table is the load factor, that is, the proportion of the slots in the array that are used. As the load factor increases towards 100%, the number of probes that may be required to find or insert a given key rises dramatically. Once the table becomes full, most algorithms will fail to terminate. Even with good hash functions, load factors are normally limited to 80%. A poor hash function can exhibit poor performance even at very low load factors by generating significant clustering. In some cases, when the load factor exceeds a certain value, the hash table will be regenerated; this is done by allocating a larger hash table and rehashing every key. While this is obviously creates a one-time slowdown, it can be justified in speeding up subsequent operations.
Perfect hashing
If all of the keys that will be used are known ahead of time, perfect hashing can be used to create a perfect hash table, in which there will be no collisions. If minimal perfect hashing is used, every location in the hash table can be used as well.
Probabilistic hashing
Perhaps the simplest solution to a collision is to having it replace the item that is already there, or slightly less commonly, drop the item that is to be inserted. In later searches, this may result in a search not finding an item which has been inserted. This technique is particularly useful and often used for implementing cacheing.
An even more space-efficient solution which is similar to this is to keep only one bit for each bucket, each indicating whether or not a value has been inserted in its bucket. False negatives cannot occur, but false positives can, since if the search finds a 1 bit, it will say the value was found, even if it was just another value that hashed into the same bucket by coincidence. In reality, such a hash table is merely a specific type of Bloom filter.
Example pseudocode
The following pseudocode is an implementation of an open addressing hash table with linear probing and single-slot stepping, a common approach that is effective if the hash function is good. Each of the lookup, set and remove functions use a common internal function findSlot to locate the array slot that either does or should contain a given key.
record pair { key, value }
var pair array slot[0..numSlots-1]
function findSlot(key) {
i := hash(key) modulus numSlots
loop {
if slot[i] is not occupied or slot[i].key = key
return i
i := (i + 1) modulus numSlots
}
}
function lookup(key)
i := findSlot(key)
if slot[i] is occupied // key is in table
return slot[i].value
else // key is not in table
return not found
function set(key, value) {
i := findSlot(key)
if slot[i] is occupied
slot[i].value := value
else {
if the table is almost full
rebuild the table larger (note 1)
i := findSlot(key)
slot[i].key := key
slot[i].value := value
}
}
- note 1
- Rebuilding the table requires allocating a larger array and recursively using the set operation to insert all the elements of the old array into the new larger array. It is common to increase the array size exponentially, for example by doubling the old array size.
function remove(key)
i := find_slot(key)
if slot[i] is unoccupied
return // key is not in the table
j := i
loop
j := (j+1) modulus numSlots
if slot[j] is unoccupied
exit loop
k := hash(slot[j].key) modulus numSlots
if (j > i and (k <= i or k > j)) or
(j < i and (k <= i and k > j)) (note 2)
slot[i] := slot[j]
i := j
mark slot[i] as unoccupied
- note 2
- For all records in a cluster, there must be no vacant slots between their natural hash position and their current position (else lookups will terminate before finding the record). At this point in the pseudocode, i is a vacant slot that might be invalidating this property for subsequent records in the cluster. j is such as subsequent record. k is the raw hash where the record at j would naturally land in the hash table if there were no collisions. This test is asking if the record at j is invalidly positioned with respect to the required properties of a cluster now that i is vacant.
Another technique for removal is simply to mark the slot as deleted. However this eventually requires rebuilding the table simply to remove deleted records. The methods above provide O(1) updating and removal of existing records, with occasional rebuilding if the high water mark of the table size grows.
The O(1) remove method above is only possible in linearly probed hash tables with single-slot stepping. In the case where many entries are to be deleted in one operation, marking the slots for deletion and later rebuilding may be more efficient.
Problems with hash tables
Although hash table lookups require constant time on average, the time spent can be significant. Evaluating a good hash function can be a slow operation, as well as comparing large keys to check that the item accessed is the correct one. In particular, if simple array indexing can be used instead, this is usually faster.
Hash tables in general exhibit poor locality of reference — that is, the data to be accessed is distributed seemingly at random in memory. Because hash tables cause access patterns that jump around, this can trigger microprocessor cache misses that cause long delays (see CPU cache). Compact data structures such as arrays, searched with linear search, may be faster if the table is relatively small and keys are cheap to compare, such as with simple integer keys. Due to Moore's Law, cache sizes are growing exponentially and so what is considered "short" may be increasing. The optimum performance point varies from system to system; for example, a trial on Parrot shows that its hash tables outperform linear search in all but the most trivial cases (one to three entries).
More significantly, hash tables are more difficult and error-prone to write and use. Hash tables require the design of an effective hash function for each key type, which in many situations is more difficult and time-consuming to design and debug than the mere comparison function required for a self-balancing binary search tree.
Additionally, in some applications, a black hat with knowledge of the hash function may be able to supply information to a hash which creates worst-case behavior by causing excessive collisions, resulting in very poor performance (i.e. a denial of service attack). In critical applications, a data structure with better worst-case guarantees may be preferable. For details, see Crosby and Wallach's Denial of Service via Algorithmic Complexity Attacks.
See also
References