Hash Tables

←Back to Tech Tree

inventorycoverage

Hash Tables #

Data StructuresDifficulty: ★★☆☆☆Depth: 1Unlocks: 1

Key-value storage with O(1) average lookup. Collision handling.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

h(key)A[i] (array bucket at index i)

Essential Relationships #

Prerequisites (2) #

Arrays5 atomsFunctions6 atoms

Unlocks (1) #

Trieslvl 3

Advanced Learning Details

Graph Position #

17

Depth Cost

1

Fan-Out (ROI)

1

Bottleneck Score

1

Chain Length

Cognitive Load #

6

Atomic Elements

47

Total Elements

L3

Percentile Level

L4

Atomic Level

All Concepts (23) #

Teaching Strategy #

Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.

Arrays give O(1) access—if you already know the index. Hash tables are the classic trick for turning “find by key” into “go to index” on average, without scanning.

TL;DR:

A hash table stores (key → value) pairs in an array A. A hash function h(key) deterministically maps each key to an array index i. Because different keys can map to the same i (a collision), the table needs a collision strategy (usually chaining or open addressing). With a good hash function and controlled load factor α, lookup/insert/delete are O(1) on average.

What Is a Hash Table? #

Why it exists (motivation) #

You already know a core tradeoff:

A hash table is a data structure that tries to keep the array’s speed while supporting associative mapping: store and retrieve values by a unique key.

Instead of asking “where is this key in the array?”, we compute where it should go.

Definition #

A hash table stores key-value pairs (key → value) using:

  1. 1)An underlying array A of size m (often called the number of buckets/slots)
  2. 2)A hash function h(key) that maps keys to indices:

Then we store the pair in A[i] (or in some structure associated with A[i]).

The idealized picture #

If every key mapped to a unique index, operations would be trivial:

That would be O(1) worst-case. But reality is messier.

Collisions: the central problem #

A collision happens when two different keys map to the same index:

Collisions are unavoidable because:

By the pigeonhole principle, if you insert more than m distinct keys, at least one collision must occur.

So a hash table is really:

Average-case O(1) (what it really means) #

When people say “hash tables have O(1) lookup,” the precise claim is:

The common performance lever is the load factor:

where n = number of stored elements, m = number of buckets.

Rough intuition:

Hash tables keep α under control by resizing (rehashing) when the table gets too full.

Core Mechanic 1: Hashing (Turning Keys into Array Indices) #

Why hashing works as a strategy #

Arrays are fast because indexing is direct. Hashing tries to manufacture an index from a key.

So the goal of h(key) is not “cryptographic security”—it’s:

The basic mapping #

We typically think of hashing in two stages:

  1. Map key to an integer (a hash code):
  1. Compress that integer into the array range:

So indices always land in 0…m−1.

Determinism and equality #

Hash tables assume a consistent relationship between equality and hashing:

The reverse is not required (and generally false):

That’s exactly collisions.

A concrete example: hashing strings #

A common approach is a polynomial rolling hash. For a string s with characters s[0], s[1], …, s[k−1], interpret each character as a number cᵢ (e.g., ASCII code).

One form:

Then compress:

You don’t need to memorize this. The idea is:

What makes a “good” hash function (practically) #

A hash function is “good” for hash tables if it:

If h is poor (e.g., always returns 0), performance collapses to O(n).

The role of m (table size) #

The modulus m matters. Many implementations choose m carefully (often prime, or a power of 2 with bit-mixing).

What you should remember at difficulty 2:

A small derivation: why collisions become likely #

Suppose keys distribute uniformly across m buckets.

That’s not a proof of O(1), but it builds the intuition: average work per bucket is proportional to α.

So keeping α bounded (e.g., α ≤ 0.75) is a strategy to keep operations near constant time.

Core Mechanic 2: Handling Collisions #

Collisions are not an edge case—they are the normal case once the table has enough items. Two main families of strategies dominate.

Strategy A: Separate chaining #

Idea: A[i] stores a collection (often a linked list, dynamic array, or small balanced tree) of all entries whose keys hash to i.

Operations (high-level):

  1. i = h(key)

  2. search bucket A[i] for key

  3. if found, update value; else append new pair

  1. i = h(key)

  2. search bucket A[i] for matching key

  1. i = h(key)

  2. remove entry from bucket if present

Cost intuition:

So average time is often described as O(1 + α). If α is kept constant by resizing, that becomes O(1) average.

Strategy B: Open addressing #

Idea: Store entries directly in the array. If A[i] is occupied, probe other indices according to a rule until you find an empty slot.

Common probing methods:

Cost intuition:

Also, deletion is trickier: you often need a special “tombstone” marker so searches don’t stop early.

Comparison table #

FeatureSeparate chainingOpen addressing
Where entries liveIn buckets (lists/arrays) at A[i]Directly in array slots
Handles α > 1?Yes (buckets can grow)No (must have n ≤ m)
Typical implementation complexitySimplerTrickier (probing + tombstones)
Memory overheadExtra pointers/structuresLow overhead
Performance when nearly fullMore gracefulCan degrade sharply

At this level, separate chaining is easiest to reason about and is common in teaching.

Resizing (rehashing): keeping α under control #

To keep operations fast, hash tables resize when α crosses a threshold.

Typical policy (example):

This is called rehashing because h depends on m.

Amortized cost intuition #

A resize costs O(n) because you reinsert all elements.

But it happens infrequently (e.g., when the table doubles). Over many inserts, the average extra cost per insert stays small.

You don’t need full amortized analysis yet—just remember:

Application/Connection: Where Hash Tables Show Up (and Why Tries Are Next) #

Common uses #

Hash tables appear any time you need fast “lookup by key”:

Hash tables hit a sweet spot:

Limitations (important to know) #

Hash tables are not always the best choice:

  1. No ordering by default
  1. Worst-case can be O(n)
  1. Prefix/range queries are awkward

Connection to Tries #

A trie is a tree specialized for strings (or sequences). It supports operations in O(k) where k = length of the string, and it naturally supports prefix queries.

So the mental transition is:

If your next goal is autocomplete, dictionary prefix search, or storing many strings with shared prefixes, tries are a natural unlock.

Worked Examples (3) #

Chaining lookup with a collision #

We have m = 5 buckets: A[0]…A[4]. The hash function is h(key) = key mod 5 for integer keys. We insert keys with values: (12 → "a"), (7 → "b"), (2 → "c"), (17 → "d"). Use separate chaining (each A[i] is a list). Then look up key 17.

  1. Compute indices:

    • •h(12) = 12 mod 5 = 2
    • •h(7) = 7 mod 5 = 2
    • •h(2) = 2 mod 5 = 2
    • •h(17) = 17 mod 5 = 2
  2. All keys collide into bucket A[2]. After insertion (order doesn’t matter conceptually), bucket A[2] contains:

    • •(12, "a"), (7, "b"), (2, "c"), (17, "d")

    All other buckets are empty.

  3. Lookup key 17:

    1. Compute i = h(17) = 2

    2. Go to bucket A[2]

    3. Scan entries until key = 17 is found

    4. Return the associated value "d"

Insight: Even with a terrible distribution (everything collides), the hash table is still correct—it just loses its speed. Performance depends on keeping average bucket sizes small (α controlled + decent hashing).

Load factor and expected bucket size #

A hash table with separate chaining has m = 100 buckets and currently holds n = 60 keys. Assume keys distribute uniformly. Estimate the average number of elements you examine during a successful lookup (very rough model: scan half the bucket on average).

  1. Compute load factor:

    α = n / m = 60 / 100 = 0.6

  2. Expected bucket size ≈ α ≈ 0.6 elements per bucket.

    Interpretation: many buckets are empty; some have 1; a few have 2+.

  3. For a successful lookup, you must search within the bucket containing the key.

    A rough rule of thumb: average comparisons in a list of length L is about (L+1)/2.

  4. Plug in L ≈ 0.6:

    Average comparisons ≈ (0.6 + 1) / 2 = 0.8

  5. Total work is:

    • •O(1) to compute i = h(key)
    • •plus about 0.8 key comparisons in the bucket

Insight: This is why people say O(1) average: if α stays bounded (like 0.6), the bucket work stays bounded too.

Resizing (rehashing) changes indices #

We store integer keys in a table with m = 4 using h(key) = key mod m and separate chaining. Keys inserted: 1, 5, 9. Then we resize to m′ = 8. Show the old and new bucket indices.

  1. With m = 4:

    • •h(1) = 1 mod 4 = 1
    • •h(5) = 5 mod 4 = 1
    • •h(9) = 9 mod 4 = 1

    So all three land in A[1] (a collision chain).

  2. Resize to m′ = 8. Now the hash-to-index step changes because modulus changed:

    • •h′(1) = 1 mod 8 = 1
    • •h′(5) = 5 mod 8 = 5
    • •h′(9) = 9 mod 8 = 1
  3. After rehashing into the new array A′:

    • •A′[1] contains keys 1 and 9
    • •A′[5] contains key 5

    Bucket sizes are smaller than before.

Insight: Resizing isn’t just “making more room”—it changes where keys belong. That’s why you must reinsert (rehash) existing entries.

Key Takeaways #

Common Mistakes #

Practice #

easy

A hash table uses m = 10 buckets and separate chaining. You insert n = 25 keys. (1) Compute the load factor α. (2) If keys distribute uniformly, what is the expected bucket size?

Hint: Use α = n/m. Under uniform distribution, expected bucket size ≈ α.

Show solution

(1) α = n/m = 25/10 = 2.5

(2) Expected bucket size ≈ 2.5 entries per bucket.

medium

Given m = 7 and h(key) = key mod 7 for integer keys, insert keys 10, 3, 17, 24 into a chaining hash table. List which bucket A[i] each key goes to, and identify collisions.

Hint: Compute each key mod 7. A collision means two keys share the same i.

Show solution

Compute indices:

All collide in bucket A[3].

hard

Open addressing (linear probing): m = 8, h(key) = key mod 8. Insert keys in order: 3, 11, 19. Show the final array indices used (assume empty array initially).

Hint: All three keys have the same initial index. With linear probing, you try i, i+1, i+2,… wrapping mod m.

Show solution

Compute initial indices:

Final placements: 3@A[3], 11@A[4], 19@A[5].

Connections #

Next: Tries

Related reinforcement nodes you likely already know:

Quality: A (4.4/5)

← back to treebrowse all →