Dashboard/Block 2: Data Structures/Week 6
Week 6Data Structures

Hash Tables

Hash tables are arguably the most important data structure in everyday programming. They let you store and look up data in expected O(1) time — essentially instant, regardless of how much data you have. The idea is simple: use a 'hash function' to convert each key into an array index, then store the value at that index. The challenge arises when two different keys hash to the same index (called a 'collision'). This week you will learn how hash functions work, two main strategies for handling collisions (chaining — storing a linked list at each slot, and open addressing — probing for another empty slot), and how the 'load factor' (the ratio of stored items to table size) affects performance. Hash tables power dictionaries in Python, objects in JavaScript, and HashMaps in Java.

Learning Objectives

Understand what hash functions do (convert keys to array indices) and what makes a good hash function (uniform distribution, deterministic, fast to compute)
Compare the two main collision resolution strategies: chaining (each slot holds a linked list of colliding elements) vs open addressing (find another empty slot using a probing sequence)
Analyze expected performance: with a good hash function and reasonable load factor, operations take O(1+α) expected time, where α = n/m is the load factor
Implement the three open addressing strategies: linear probing (check the next slot), quadratic probing (check slots at increasing squared distances), and double hashing (use a second hash function to determine the step size)

Key Concepts

Hash Functions

A hash function is like an address calculator — it takes a key (like a name or number) and converts it into an array index where that key's data will be stored. For example, if your table has 11 slots (indices 0 through 10), the hash function for key k might be h(k) = k mod 11 — so key 25 would go to slot 25 mod 11 = 3. A good hash function should spread keys evenly across all slots (to minimize collisions), always give the same result for the same key (deterministic), and be fast to compute. The simplest approach is the division method (h(k) = k mod m, where m is the table size and ideally a prime number). There is also the multiplication method and universal hashing, which randomly picks a hash function from a family to defend against adversarial inputs.

  • -

    Division method: h(k) = k mod m — simple and widely used. Choosing m as a prime number (not a power of 2) helps distribute keys more evenly

  • -

    Multiplication method: h(k) = ⌊m · (k · A mod 1)⌋ where 0 < A < 1 — more robust against bad m choices, Knuth suggests A ≈ 0.6180 (the golden ratio)

  • -

    Universal hashing: randomly choose a hash function from a predefined family at runtime — this prevents any single input pattern from always causing worst-case performance

  • -

    Simple uniform hashing assumption (SUHA): we often assume each key is equally likely to hash to any slot — this simplifies analysis and gives expected O(1) operations

Chaining

Chaining is the simplest way to handle collisions. Each slot in the hash table holds a linked list (a chain) of all elements that hash to that index. When you insert a new key, compute its hash to find the slot, and prepend it to that slot's list — this takes O(1). When you search for a key, go to its slot and scan through the linked list there. On average, each list has α = n/m elements (where n is the number of stored keys and m is the table size), so search takes O(1 + α) expected time. If the load factor α is kept small (say, below 1), operations are nearly O(1). Chaining is simple and works well even when the table gets full — you can have more elements than table slots.

  • -

    Load factor α = n/m (number of stored elements divided by table size) — this is the key measure of how 'full' the table is

  • -

    Expected search time is O(1 + α): the '1' accounts for computing the hash, and 'α' accounts for scanning the average chain length

  • -

    Chaining works well even when α > 1 (more elements than slots) — the chains just get longer, but performance degrades gracefully

  • -

    Requires extra memory for the linked list pointers at each slot, which can add up for very large tables

Open Addressing

Open addressing is an alternative to chaining where all elements are stored directly in the table array — no linked lists needed. When a collision occurs (the desired slot is already taken), you 'probe' for the next available slot using a specific pattern. Linear probing checks the next slot, then the one after that, and so on (h(k,i) = (h(k)+i) mod m) — it is simple but creates 'clusters' of consecutive filled slots that slow things down. Quadratic probing checks slots at increasing squared distances (1, 4, 9, 16...), which reduces clustering. Double hashing uses a second hash function to determine the step size, giving the best distribution. The big constraint with open addressing is that the load factor must stay below 1 (the table cannot be more than 100% full), and in practice you should resize (rehash) when it reaches about 70%.

  • -

    Linear probing: simplest method (check next slot, then the one after), but creates 'primary clustering' — long runs of filled slots that get longer over time

  • -

    Quadratic probing: reduces primary clustering by jumping in squared increments (1, 4, 9, 16...), but can create 'secondary clustering' where keys with the same initial hash follow the same probe sequence

  • -

    Double hashing: provides the best distribution by using a second hash function h₂(k) to determine step size — requires h₂(k) ≠ 0 and h₂(k) coprime with m

  • -

    The load factor α must stay below 1 (you cannot store more items than there are slots); in practice, resize and rehash everything when α reaches about 0.7 to maintain good performance

  • -

    Expected number of probes for an unsuccessful search (key not found) is approximately 1/(1-α) — at α = 0.9, that is about 10 probes on average

  • -

    ⚠ Common mistake: In open addressing, do NOT stop searching when you hit a deleted marker (*). You must continue probing past deleted slots. Only stop at an empty (never-used) slot or when the key is found.