From charlesreid1

(Fix typos, clarify expected chain length analysis (distinguish probability vs expected value), correct multiplication method notation, add missing load factor definition, and improve MAD method description (via update-page on MediaWiki MCP Server))
 
(One intermediate revision by the same user not shown)
Line 7: Line 7:
Chaining deals with this problem by erring on the side of a small range (and therefore more efficient memory usage), and dealing with collisions by storing buckets of values in linked lists, rather than storing the single values themselves.
Chaining deals with this problem by erring on the side of a small range (and therefore more efficient memory usage), and dealing with collisions by storing buckets of values in linked lists, rather than storing the single values themselves.


So, I have a high probability of two keys (say, the strings "peanut butter" and "jelly") mapping to the same two keys in the array (say, the integer 10). But I can deal with this by placing the data corresponding to the key "peanut butter" in a linked list node and add it to a linked list stored at array index 10. Then, when I pass in "jelly" data, I get the same slot in the array (10), but create another linked list node with data corresponding to key "jelly" and add it to the same linked list at array index 10.
So, I have a high probability of two keys (say, the strings "peanut butter" and "jelly") mapping to the same slot in the array (say, index 10). But I can deal with this by placing the data corresponding to the key "peanut butter" in a linked list node and add it to a linked list stored at array index 10. Then, when I pass in "jelly" data, I get the same slot in the array (10), but create another linked list node with data corresponding to key "jelly" and add it to the same linked list at array index 10.


Worst case: all your keys have exactly the same hash, so you just have one big long linked list of size n. Then the complexity class is <math>\Theta(n)</math>.
Worst case: all your keys have exactly the same hash, so you just have one big long linked list of size n. Then the complexity class is <math>\Theta(n)</math>.
Line 19: Line 19:
Analysis: What is the expected length of a chain of hashed items, if we have n keys and m slots?
Analysis: What is the expected length of a chain of hashed items, if we have n keys and m slots?


Don't overthink this! The expected length of a chain of hashed items assumes you have a good hash function, so it assumes a perfectly uniform distribution of n items into m slots. So each of the m slots gets n/m.
Don't overthink this! The expected length of a chain of hashed items assumes you have a good hash function, so it assumes a perfectly uniform distribution of n items into m slots. So each of the m slots gets n/m items on average.


The probability of one key going into one slot is 1/m. The probability of each key is independent. Each slot gets a 1/m chance of receiving the ith key, so they see a probability of getting a key of (1/m) + (1/m) + (1/m) + (1/m) + ...
The probability of a particular key going into a particular slot is 1/m. Since each key is hashed independently, the expected number of keys in a given slot is the sum of the probabilities that each of the n keys lands in that slot:


This is something we want to control. We call this the ''load factor''.
<math>
E[\text{length}] = \sum_{i=1}^{n} \frac{1}{m} = \frac{n}{m}
</math>


The runtime of retrieving all the keys from a slot therefore becomes (n/m), so if <math>m = \Theta(n)</math> we can say that retrieval is <math>\Theta(1)</math>
This ratio <math>\alpha = n/m</math> is called the ''load factor''.


More correctly, retrieval is <math>\Theta(1 + \lambda)</math> where lambda is the load factor
The expected runtime of searching for a key (unsuccessful search) is therefore <math>\Theta(1 + \alpha)</math>, since we must compute the hash (constant time) and then traverse the chain of expected length α. For a successful search, the expected time is also <math>\Theta(1 + \alpha)</math>.
 
If we keep <math>m = \Theta(n)</math> (i.e., the number of slots is proportional to the number of keys), then the load factor <math>\alpha = O(1)</math> and we can say that search is <math>\Theta(1)</math> on average.


===Simple uniform hashing===
===Simple uniform hashing===


Simple uniform hashing is basically the hashing equivalent of the assumption that running any single operation has a constant O(1) cost. It makes big O analysis easier. Likewise, to analyze the performance of chaining, we want to be able to use simple uniform hashing to simplify analysis.
Simple uniform hashing is basically the hashing equivalent of the assumption that running any single operation has a constant O(1) cost. It makes big O analysis easier. Likewise, to analyze the performance of chaining, we want to be able to use simple uniform hashing to simplify analysis.
Under simple uniform hashing, each key is equally likely to be hashed to any of the m slots, independently of where any other key has been hashed.


This gives us an average case. If we assume anything less than uniform, we end up with worse- to worst-case scenarios.
This gives us an average case. If we assume anything less than uniform, we end up with worse- to worst-case scenarios.
Line 40: Line 46:
* Division method
* Division method
* Multiplication method
* Multiplication method
* MAD (multiply-and-divide) method/Universal hashing
* MAD (multiply-and-divide) method / Universal hashing
 
'''Division method:'''


Division method:
<math>
h(k) = k \bmod m
</math>
 
We require m to be prime to minimize collisions, and be far from powers of 2 or 10 (more commonly occurring patterns in input).
 
'''Multiplication method:'''


<math>
<math>
h(k) = k \mod m
h(k) = \lfloor m \cdot (k \cdot A \bmod 1) \rfloor
</math>
</math>


We require m to be prime to minimize collisions, and be far from powers of 2 or 10 (more commonly occurring)
where A is a constant in the range (0, 1). A common choice is <math>A \approx (\sqrt{5} - 1)/2 \approx 0.6180339887</math> (the golden ratio conjugate).


Multiplication method:
On a w-bit machine with table size <math>m = 2^r</math>, this can be implemented efficiently as:


<math>
<math>
h(k) = a k \mod 2^w > > (w-r)
h(k) = (a \cdot k \bmod 2^w) \gg (w - r)
</math>
</math>


w bit machine. right arrows mean shift it right (w-r) bits.
where a is an odd integer constant approximating <math>A \cdot 2^w</math>, and <math>\gg</math> denotes a logical right shift by (w r) bits.


MAD Method/Universal hashing:
'''MAD Method / Universal hashing:'''


Map a key to:
Map a key to:


<math>
<math>
h(k) = \left( (ak + b) \mod p \right) \mod m
h(k) = \left( (ak + b) \bmod p \right) \bmod m
</math>
</math>


where a and b are integers chosen at random from interval 0 to p-1, p is a prime number larger than N.
where a and b are integers chosen at random from the interval [0, p−1] with a ≠ 0, and p is a prime number larger than the size of the key universe.


For the worst case keys, k1 != k2,
For any two distinct keys k₁ ≠ k₂, the probability of collision is:


<math>
<math>
Pr \left( h(k_1) = h(k_2) \right) = \frac{1}{m}
\Pr \left( h(k_1) = h(k_2) \right) \leq \frac{1}{m}
</math>
</math>
This family of hash functions is ''universal''.


==Implementation==
==Implementation==
Line 78: Line 94:
We implement a chained collision handling hash table here:
We implement a chained collision handling hash table here:


git.charlesreid1.com link: https://charlesreid1.com:3000/cs/java/src/master/hash/ChainedHashMap.java
git.charlesreid1.com link: https://git.charlesreid1.com/cs/java/src/master/hash/ChainedHashMap.java


In this particular implementation, we did not, strictly speaking, follow the "traditional" chained hash table by using a linked list to store items in different buckets - instead, we used the UnsortedArrayMap class, which is basically (internally) an ArrayList of MapItem objects (which are the key-value pairs that the map stores).  
In this particular implementation, we did not, strictly speaking, follow the "traditional" chained hash table by using a linked list to store items in different buckets - instead, we used the UnsortedArrayMap class, which is basically (internally) an ArrayList of MapItem objects (which are the key-value pairs that the map stores).  

Latest revision as of 04:41, 8 June 2026

Hash Maps: Collision Handling with Chaining

Chaining is a way of using linked lists to deal with the problem of turning a huge keyspace with a tiny number of keys into actual usable slots in an array.

Normally, if you turn a keyspace into usable slots in an array, you have a tradeoff between a large range (meaning a larger number of potential slots, and therefore higher memory usage) and a small range (meaning a smaller number of potential slots, and therefore a higher probability of colliding hashes).

Chaining deals with this problem by erring on the side of a small range (and therefore more efficient memory usage), and dealing with collisions by storing buckets of values in linked lists, rather than storing the single values themselves.

So, I have a high probability of two keys (say, the strings "peanut butter" and "jelly") mapping to the same slot in the array (say, index 10). But I can deal with this by placing the data corresponding to the key "peanut butter" in a linked list node and add it to a linked list stored at array index 10. Then, when I pass in "jelly" data, I get the same slot in the array (10), but create another linked list node with data corresponding to key "jelly" and add it to the same linked list at array index 10.

Worst case: all your keys have exactly the same hash, so you just have one big long linked list of size n. Then the complexity class is $ \Theta(n) $.

In the worst case, hashing sucks. But normally, it works really well, because of randomization.

Note: see Theta versus Big O for a discussion of when to use $ \Theta( f(N) ) $ versus $ O( f(N) ) $.

Expected length of chain

Analysis: What is the expected length of a chain of hashed items, if we have n keys and m slots?

Don't overthink this! The expected length of a chain of hashed items assumes you have a good hash function, so it assumes a perfectly uniform distribution of n items into m slots. So each of the m slots gets n/m items on average.

The probability of a particular key going into a particular slot is 1/m. Since each key is hashed independently, the expected number of keys in a given slot is the sum of the probabilities that each of the n keys lands in that slot:

$ E[\text{length}] = \sum_{i=1}^{n} \frac{1}{m} = \frac{n}{m} $

This ratio $ \alpha = n/m $ is called the load factor.

The expected runtime of searching for a key (unsuccessful search) is therefore $ \Theta(1 + \alpha) $, since we must compute the hash (constant time) and then traverse the chain of expected length α. For a successful search, the expected time is also $ \Theta(1 + \alpha) $.

If we keep $ m = \Theta(n) $ (i.e., the number of slots is proportional to the number of keys), then the load factor $ \alpha = O(1) $ and we can say that search is $ \Theta(1) $ on average.

Simple uniform hashing

Simple uniform hashing is basically the hashing equivalent of the assumption that running any single operation has a constant O(1) cost. It makes big O analysis easier. Likewise, to analyze the performance of chaining, we want to be able to use simple uniform hashing to simplify analysis.

Under simple uniform hashing, each key is equally likely to be hashed to any of the m slots, independently of where any other key has been hashed.

This gives us an average case. If we assume anything less than uniform, we end up with worse- to worst-case scenarios.

Good hash functions (compression functions)

Most common:

  • Division method
  • Multiplication method
  • MAD (multiply-and-divide) method / Universal hashing

Division method:

$ h(k) = k \bmod m $

We require m to be prime to minimize collisions, and be far from powers of 2 or 10 (more commonly occurring patterns in input).

Multiplication method:

$ h(k) = \lfloor m \cdot (k \cdot A \bmod 1) \rfloor $

where A is a constant in the range (0, 1). A common choice is $ A \approx (\sqrt{5} - 1)/2 \approx 0.6180339887 $ (the golden ratio conjugate).

On a w-bit machine with table size $ m = 2^r $, this can be implemented efficiently as:

$ h(k) = (a \cdot k \bmod 2^w) \gg (w - r) $

where a is an odd integer constant approximating $ A \cdot 2^w $, and $ \gg $ denotes a logical right shift by (w − r) bits.

MAD Method / Universal hashing:

Map a key to:

$ h(k) = \left( (ak + b) \bmod p \right) \bmod m $

where a and b are integers chosen at random from the interval [0, p−1] with a ≠ 0, and p is a prime number larger than the size of the key universe.

For any two distinct keys k₁ ≠ k₂, the probability of collision is:

$ \Pr \left( h(k_1) = h(k_2) \right) \leq \frac{1}{m} $

This family of hash functions is universal.

Implementation

We implement a chained collision handling hash table here:

git.charlesreid1.com link: https://git.charlesreid1.com/cs/java/src/master/hash/ChainedHashMap.java

In this particular implementation, we did not, strictly speaking, follow the "traditional" chained hash table by using a linked list to store items in different buckets - instead, we used the UnsortedArrayMap class, which is basically (internally) an ArrayList of MapItem objects (which are the key-value pairs that the map stores).

Flags