Chapter 19: Hashing
This page is a generated reference surface for selective reading. It exists to keep the learner apps guide-first while still preserving source access.
Learning objectives
- Explain the main ideas and vocabulary in Hashing.
- Work through the source examples for Hashing without depending on raw chunk order.
- Use Hashing as selective reference when learner modules point back to Algorithms Sedgewick.
Prerequisites
- Earlier prerequisite concepts leading into Chapter 19: Hashing.
Module targets
module-02-sorting-searching-structures
AI companion modes
- Explain simply
- Socratic tutor
- Quiz me
- Challenge my understanding
- Diagnose my confusion
- Generate extra practice
- Revision mode
- Connect forward / backward
Source-of-truth note
This unit is anchored to Algorithms Sedgewick and the source chapter "Chapter 19: Hashing". Use external resources only to clarify, extend, or modernize details without replacing the chapter's conceptual spine.
External enrichment
No chapter-specific enrichment resources are curated yet. Add them in the unit manifest when a source clearly improves learning.
Source provenance
- Primary source:
Algorithms Sedgewick - Source chapter 19: Chapter 19: Hashing
- Raw source file:
058-hashing.md - Raw source file:
059-hashing.md - Raw source file:
060-exercises.md
Merged source
Hashing
Hashing
16. Hashing
A completely different approach to searching from the comparisonbased tree structures of the last section is provided by hashing: directly referencing records in a table by doing arithmetic transformations on keys into t,able addresses. If we were to know that the keys are distinct integers from 1 to N, then we could store the record with key i in table position i, ready for immediate access with the key value. Hashing is a generalization of this trivial method for typical searching applications when we don't have such specialized knowledge about the key values. The first step in a search using hashing is to compute a hush function which transforms the search key into a table address. No hash function is perfect, and two or more different keys might hash to the same table address: the second part of a hashing search is a collision resolution process which deals with such keys. One of the collision resolution methods that we'll study uses linked lists, and is appropriate in a highly dynamic situation where the number of search keys can not be predicted in advance. The other two collision resolution methods that we'll examine achieve fast search times on records stored within a fixed array. Hashing is a good example of a "time-space tradeoff." If there were no memory limitation, then we could do any search with only one memory access by simply using the key as a memory address. If there were no time limitation, then we could get by with only a minimum amount of memory by using a sequential search method. Hashing provides a way to use a reasonable amount of memory and time to strike a balance between these two extremes. Efficient use of available memory and fast access to the memory are prime concerns of any hashing method. Hashing is a "classical" computer science problem in the sense that the various algorithms have been studied in some depth and are very widely used. There is a great deal of empirical and analytic evidence to support the utility
of hashing for a broad variety of applications.
The first problem we must address is the computation of the hash function which transforms keys into table addresses. This is an arithmetic computation with properties similar to the random number generators that we have studied. What is needed is a function which transforms keys (usually integers or short character strings) into integers in the range [O..M-11, where A4 is the amount of memory available. An ideal hash function is one which is easy to compute and approximates a "random" function: for each input, every output should be "equally likely."
Since the methods that we will use are arithmetic, the first step is to transform keys into numbers which can be operated on (and are as large as possible). For example, this could involve removing bits from character strings and packing them together in a machine word. From now on, we'll assume that such an operation has been performed and that our keys are integers which fit into a machine word.
One commonly used method is to take A4 to be prime and, for any key
pute h(k) = k mod M. This is a straightforward method which is easy
pute in many environments and spreads the key values out well.
A second commonly used method is similar to the linear congruential random number generator: take M = 2m and h(k) to be the leading m bits of (bkmod w), where w is the word size of the computer and b is chosen as for the random number generator. This can be more efficiently computed than the method above on some computers, and it has the advantage that it can spread out key values which are close to one another (e. g., templ, temp$ temp3). As we've noted before, languages like Pascal are not well-suited to such operaiions.
The hash functions above will convert keys into table addresses: we still need to decide how to handle the case when two keys hash to the same address. The most straightforward method is to simply build a linked list, for each table address, of the records whose keys hash to that address. Since the keys which hash to the same table position are kept in a linked list, they might as well be kept in order. This leads directly to a generalization of the elementary list searching method that we discussed in Chapter 14. Rather than maintaining a single list with a single list header node head as discussed there, we maintain M lists with M list header nodes, initialized as follows:
type link=fnode;
node=record key, info: integer; next: link end;
var heads: array [O..M] of link;
t, z: link;
procedure initialize;
var i: integer;
begin
new(z); zt.next:=z; end;
for i:=O to M-l do
begin new(heads[i]); heads[i]f.next:=z
end;
Now the procedures from Chapter 14 can be used as is, with a hash function
used to choose among the lists. For example, listinsert(v, heads[v mod M] )
can be used to add something to the table, t:=listsearch(v, heads[v mod M])
to find the first record with key v, and successively set t:=listsearch(v, t) until
t=z to find subsequent records with key v.
For example if the ith letter in the alphabet is represented with the
i and we use the hash function h(k) = kmod M, then we get the
ing hash values for our sample set of keys with M = 11:
Key: A S E A R C H I N G E X A M P L E Hash: 1 8 5 1 7 3 8 9 3 7 5 2 1 2 5 1 5
if these keys are successively inserted into an initially empty table, the following set of lists would result:
0 1 2 3 4 5 6 7 8 9 10
A E
L P
Obviously, the amount of time required for a search depends on the length
lists (and the relative positions of the keys in them). The lists could be
nordered: maintaining sorted lists is not as important for this application
was for the elementary sequential search because the lists are quite short.
For an "unsuccessful search" (for a record with a key not in the table) we
sume that the hash function scrambles things enough so that each of
the M lists is equally likely to be searched and, as in sequential list searching, that the list searched is only traversed halfway (on t,he average). The average length of the list examined (not counting z) in this example for unsuccessful search is (0+4+2+2+0+4+0+2+2+1+0)/11 z 1.545. This would be the average time for an unsuccessful search were the lists unordered; by keeping them ordered we cut the time in half. For a "successful search" (for one of the records in the table), we assume that each record is equally likely to be sought: seven of the keys would be found as the first list item examined, six would be found as the second item examined, etc., so the average is (7.1+ 6.2 + 2.3 + 2.4)/17) z 1.941. (This count assumes that equal keys are distinguished with a unique identifier or some other mechanism, and the search routine modified appropriately to be able to search for each individual key.)
If N, the number of keys in the table, is much larger than M then a good approximation to the average length of the lists is N/M. As in Chapter 14, unsuccessful and successful searches would be expected to go about halfway down some list. Thus, hashing provides an easy way to cut down the time required for sequential search by a factor of M, on the average.
In a separate chaining implementation, M is typically chosen to be relatively small so as not to use up a large area of contiguous memory. But it's probably best to choose M sufficiently large that the lists are short enough to make sequential search the most efficient method for them: "hybrid" methods (such as using binary trees instead of linked lists) are probably not worth the trouble.
The implementation given above uses a hash table of links to headers of the lists containing the actual keys. Maintaining M list header nodes is somewhat wasteful of space; it is probably worthwhile to eliminate them and make heads be a table of links to the first keys in the lists. This leads to some complication in the algorithm. For example, adding a new record to the beginning of a list becomes a different operation than adding a new record anywhere else in a list, because it involves modifying an entry in the table of links, not a field of a record. An alternate implementation is to put the first key within the table. If space is at a premium, it is necessary to carefully analyze the tradeoff between wasting space for a table of links and wasting space for a key and a link for each empty list. If N is much bigger than M then the alternate method is probably better, though M is usually small enough that the extra convenience of using list header nodes is probably justified.
Hashing
Hashing
If the number of elements to be put in the hash table can be estimated in advance, then it is probably not worthwhile to use any links at all in the hash table. Several methods have been devised which store N records in a table
of size A4 > N, relying on empty places in the table to help with collision resolution. Such methods are called open-addressing hashing methods.
The simplest open-addressing method is called linear probing: when there is a collision (when we hash to a place in the table which is already occupied and whose key is not the same as the search key), then just probe the next position in the table: that is, compare the key in the record there against the search key. There are three possible outcomes of the probe: if the keys match, then the search terminates successfully; if there's no record there, then the search terminates unsuccessfully; otherwise probe the next position, continuing until either the search key or an empty table position is found. If a record containing the search key is to be inserted following an unsuccessful search, then it can simply be put into the empty table space which terminated the search. This method is easily implemented as follows:
type node=record key, info: integer end;
var a: array [O..M] of node;
function h(k: integer): integer;
begin h:=k mod Mend;
procedure hashinitialize;
var i: integer;
begin
for i:=O to M do a[i].key:=maxint;
end ;
function hashinsert (v: integer) : integer;
var x: integer;
begin
x:=h(v);
while a[x].key<>maxint do x:=(x+1) mod M;
a [x] .key:=v;
hashinsert :=x;
end ;
Linear probing requires a special key value to signal an empty spot in the table: this program uses maxint for that purpose. The computation x:=(x+1) mod M corresponds to examining the next position (wrapping back to the beginning when the end of the table is reached). Note that this program does not check for the table being filled to capacity. (What would happen in this case?)
The implementation of hashsearch is similar to hashinsert: simply add the condition " a [x].key< >v" to the while loop, and delete the following instruction which stores v. This leaves the calling routine with the task
of checking if the search was unsuccessful, by testing whether the table position returned actually contains v (successful) or maxi& (unsuccessful). Other conventions could be used, for example hashsearch could return M for unsuccessful search. For reasons that will become obvious below, open addressing is not appropriate if large numbers of records with duplicate keys are to be processed, but hashsearch can easily be adapted to handle equal keys in the case where each record has a unique identifier.
For our example set of keys with A4 = 19, we might get the hash values:
Key: A S E A R C H I N G E X A M P L E Hash: 1 0 5 1 18 3 8 9 14 7 5 5 1 13 16 12 5
The following table shows the steps of successively inserting these into an initially empty hash table:
0 -1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
SAAm E R
H R
HIII R
HI
SAAC E 0G H I N R
S A A C R~~GHI R
S A A C : ~~~~~~ R
SmbimwEEGHIX N R
SAACAEEGHIX 0 M N R
SAACAEEGHIX M N 0P R
SAACAEEGHIX LL-l M N P R
SAACA~~~~~~lE]LMN P R
The table size is greater than before, since we must have M > N, but the total amount of memory space used is less, since no links are used. The average number of items that must be examined for a successful search for this example is: 33/17 z 1.941.
Linear probing (indeed, any hashing method) works because it guarantees that, when searching for a particular key, we look at every key that hashes to the same table address (in particular, the key itself if it's in the table). Unfortunately, in linear probing, other keys are also examined, especially when the table begins to fill up: in the example above, the search for X involved looking at G, H, and I which did not have the same hash value. What's worse, insertion of a key with one hash value can drastically increase the search times for keys with other hash values: in the example, an insertion at position 17 would cause greatly increased search times for position 16. This phenomenon, called clustering, can make linear probing run very slowly for nearly full tables.
Fortunately, there is an easy way to virtually eliminate the clustering problem: double hashing. The basic strategy is the same; the only difference is that, instead of examining each successive entry following a collided position, we use a second hash function to get a fixed increment to use for the "probe" sequence. This is easily implemented by inserting u:=h2(v) at the beginning of the procedure and changing x:=(x+1) mod M to x:=(x+u) mod M within the while loop. The second hash function h2 must be chosen with some care, otherwise the program might not work at all.
First, we obviously don't want to have u=O, since that would lead to an infinite loop on collision. Second, it is important that M and u be relatively prime here, since otherwise some of the probe sequences could be very short (for example, consider the case M=2u). This is easily enforced by making M prime and u<M. Third, the second hash function should be "different" from the first, otherwise a slightly more complicated clustering could occur. A function such as hz(k) = M - 2 - k mod (M - 2) will produce a good range of "second" hash values.
For our sample keys, we get the following hash values:
Hash 1: 1 0 5 1 1 8 3 8 9 1 4 7 5 5 1 1 3 1 6 1 2 5
Hash 2': 16 15 12 16 16 14 9 8 3 10 12 10 16 4 1 5 12
The following table is produced by successively inserting our sample keys into an initially empty table using double hashing with these values.
456 789 10 11 12 13 14 15 16 17 18
IHI
0A 0 A
Al-4
AR
AR
lEl
aI E
SA C E
SA
SA C E - HI 0N AR
SA
SA C E IGI H I N AR
0 CE TH Im N 0 A R
C 0E GHIE NM A R
C E GHIEA rM- l N X A R
SA C E G H I E A -MNX~AR
sB 0 C ~~G~I~AL~N~PMR
This technique uses the same amount of space as linear probing but the average number of items examined for successful search is slightly smaller: 32/17 z 1.882. Still, such a full table is to be avoided as shown by the 9 probes used to insert the last E in this example.
Open addressing methods can be inconvenient in a dynamic situation, when an unpredictable number of insertions and deletions might have to be processed. First, how big should the table be? Some estimate must be made of how many insertions are expected but performance degrades drastically as the table starts to get full. A common solution to this problem is to rehash everything into a larger table on a (very) infrequent basis. Second, a word of caution is necessary about deletion: a record can't simply be removed from a table built with linear probing or double hashing. The reason is that later insertions into the table might have skipped over that record, and searches for those records will terminate at the hole left by the deleted record. A way to solve this problem is to have another special key which can serve as a placeholder for searches but can be identified and remembered as an
empty position for insertions. Note that neither table size nor deletion are a particular problem with separate chaining.
Exercises
Exercises
The methods discussed above have been analyzed completely and it is possible to compare their performance in some detail. The following formulas, summarized from detailed analyses described by D. E. Knuth in his book on sorting and searching, give the average number of items examined (probes) for unsuccessful and successful searching using the methods we've studied. The formulas are most conveniently expressed in terms of the "load factor" of the hash table, cr = N/M. Note that for separate chaining we can have CY > 1, but for the other methods we must have Q < 1.
Separate Chaining: Unsuccessful Successful
Linear Probing: 1+ 42
Double Hashing: (a + 1)/Z
l/(1 - a) - ln( 1 - o)/o
For small o, it turns out that all the formulas reduce to the basic result that unsuccessful search takes about 1 + N/M probes and successful search takes about 1 + N/2M probes. (Except, as we've noted, the cost of an unsuccessful search for separate chaining is reduced by about half by ordering the lists.) The formulas indicate how badly performance degrades for open addressing as CY gets close to 1. For large M and N, with a table about 90% full, linear probing will take about 50 probes for an unsuccessful search, compared to 10 for double hashing. Comparing linear probing and double hashing against separate chaining is more complicated, because there is more memory available in the open addressing methods (since there are no links). The value of CY used should be modified to take this into account, based on the relative size of keys and links. This means that it is not normally justified to choose separate chaining over double hashing on the basis of performance.
The choice of the very best hashing method for a particular application can be very difficult. However, the very best method is rarely needed for a given situation, and the various methods do have similar performance characteristics as long as the memory resource is not being severely strained. Generally, the best course of action is to use the simple separate chaining method to reduce search times drastically when the number of records to be processed is not known in advance (and a good storage allocator is available) and to use double hashing to search among a set of keys whose size can be roughly predicted ahead of time.
Many other hashing methods have been developed which have application in some special situations. Although we can't go into details, we'll briefly
consider two examples to illustrate the nature of specially adapted hashing methods. These and many other methods are fully described in Knuth's book.
The first, called ordered hashing, is a method for making use of ordering within an open addressing table: in standard linear probing, we stop the search when we find an empty table position or a record with a key equal to the search key; in ordered hashing, we stop the search when we find a record with a key greater than or equal to the search key (the table must be cleverly constructed to make this work). This method turns out to reduce the time for unsuccessful search to approximately that for successful search. (This is the same kind of improvement that comes in separate chaining.) This method is useful for applications where unsuccessful searching is frequently used. For example, a text processing system might have an algorithm for hyphenating words that works well for most words, but not for bizarre cases (such as "bizarre"). The situation could be handled by looking up all words in a relatively small exception dictionary of words which must be handled in a special way, with most searches likely to be unsuccessful.
Similarly, there are methods for moving some records around during unsuccessful search to make successful searching more efficient. In fact, R. P. Brent developed a method for which the average time for a successful search can be bounded by a constant, giving a very useful method for applications with frequent successful searching in very large tables such as dictionaries.
These are only two examples of a large number of algorithmic improvements which have been suggested for hashing. Many of these improvements are interesting and have important applications. However, our usual cautions must be raised against premature use of advanced methods except by experts with serious searching applications, because separate chaining and double hashing are simple, efficient, and quite acceptable for most applications.
Hashing is preferred to the binary tree structures of the previous two chapters for many applications because it is somewhat simpler and it can provide very fast (constant) searching times, if space is available for a large enough table. Binary tree structures have the advantages that they are dynamic (no advance information on the number of insertions is needed), they can provide guaranteed worst-case performance (everything could hash to the same place even in the best hashing method), and they support a wider range of operations (most important, the sort function). When these factors are not important, hashing is certainly the searching method of choice.
Exercises
1. Describe how you might implement a hash function by making use of a
good random number generator. Would it make sense to implement a
random number generator by making use of a hash function?
2. How long could it take in the worst case to insert N keys into an initially
empty table, using separate chaining with unordered lists? Answer the
same question for sorted lists.
3. Give the contents of the hash table that results when the keys E A S Y Q
U E S T I 0 N are inserted in that order into an initially empty table of
size 13 using linear probing. (Use hi(k) = kmod 13 for the hash function
for the lath letter of the alphabet.)
4. Give the contents of the hash table that results when the keys E A S Y
Q U E S T I 0 N are inserted in that order into an initially empty table
of size 13 using double hashing. (Use hi(k) from the previous question,
hz(lc) = 1 + (Ic mod 11) for the second hash function.)
-
About how many probes are involved when double hashing is used to build a table consisting of N equal keys?
-
Which hashing method would you use for an application in which many equal keys are likely to be present?
-
Suppose that the number of items to be put into a hash table is known in advance. Under what condition will separate chaining be preferable to double hashing?
-
Suppose a programmer has a bug in his double hashing code so that one of the hash functions always returns the same value (not 0). Describe what happens in each situation (when the first one is wrong and when the second one is wrong).
-
What hash function should be used if it is known in advance that the key values fall into a relatively small range?
iticize the following algorithm for deletion from a hash table built with
linear probing. Scan right from the element to be deleted (wrapping as
necessary) to find an empty position, then scan left to find an element
with the same hash value. Then replace the element to be deleted with
that element, leaving its table position empty.