Hash Table | O(1) Search Data Structure Complete Guide

Hash Table | O(1) Search Data Structure Complete Guide

이 글의 핵심

Practical guide to hash tables. Complete coverage of O(1) search data structure with detailed examples and problem-solving patterns.

Introduction

What is a Hash Table?

A hash table is a data structure that stores key-value pairs, enabling search/insertion/deletion in average O(1) time complexity. O(1) means nearly constant time regardless of input size.

Other names:

  • Hash Map
  • Dictionary (Python)
  • Associative Array
  • Map (C++, Java)

Why is it fast?

Arrays allow O(1) access by index. Hash tables use a hash function (converts key to fixed-length number/slot) to transform keys into indices, enabling array-like fast access.

# Array: access by index
arr = [10, 20, 30, 40, 50]
print(arr[2])  # 30 (O(1))

# Hash table: access by key
hash_table = {"apple": 100, "banana": 200}
print(hash_table["apple"])  # 100 (O(1))
# Internally: hash("apple") % size → index → value

Time Complexity Comparison:

Data StructureSearchInsertDelete
Array (unsorted)O(n)O(1)O(n)
Array (sorted)O(log n)O(n)O(n)
Linked ListO(n)O(1)O(n)
Hash TableO(1)O(1)O(1)

1. Hash Function

What is a Hash Function?

A hash function converts arbitrary-size data into a fixed-size value (hash value).

Good Hash Function Properties:

  1. Deterministic: Same input → same output
  2. Fast Computation: O(1)
  3. Uniform Distribution: Hash values evenly distributed
  4. Minimize Collisions: Different inputs → different outputs (ideally)

Simple Hash Function Example

# Method 1: Python built-in hash() + modulo
def simple_hash(key, size):
    return hash(key) % size

print(simple_hash("apple", 10))   # 8 (store at index 8)
print(simple_hash("banana", 10))  # 3 (store at index 3)
print(simple_hash("cherry", 10))  # 6 (store at index 6)

# Method 2: String hash (manual implementation)
def string_hash(s, size):
    hash_value = 0
    for char in s:
        hash_value = (hash_value * 31 + ord(char)) % size
    return hash_value

print(string_hash("apple", 10))

# Python built-in hash()
print(hash("apple"))   # -6076896708304529682 (example)
print(hash(42))        # 42 (integers hash to themselves)
print(hash((1, 2)))    # 3713081631934410656 (tuple hash)

# ❌ Mutable objects cannot be hashed
# print(hash([1, 2]))  # TypeError: unhashable type: 'list'

Hash Collision

Collision occurs when different keys produce the same hash value.

# Collision example
def bad_hash(key, size):
    return len(key) % size  # Bad hash function

print(bad_hash("apple", 10))   # 5
print(bad_hash("grape", 10))   # 5 (collision!)
print(bad_hash("peach", 10))   # 5 (collision!)

2. Python dict Usage

Basic Operations

# Creation
hash_table = {}
hash_table = dict()

# Insert O(1)
hash_table["apple"] = 100
hash_table["banana"] = 200
hash_table["cherry"] = 300

# Search O(1)
print(hash_table["apple"])  # 100
print(hash_table.get("grape", 0))  # 0 (default)

# Update O(1)
hash_table["apple"] = 150
print(hash_table["apple"])  # 150

# Delete O(1)
del hash_table["apple"]
print("apple" in hash_table)  # False

# Existence check O(1)
if "banana" in hash_table:
    print("exists")

# Iteration O(n)
for key in hash_table:
    print(key, hash_table[key])

for key, value in hash_table.items():
    print(f"{key}: {value}")

# Keys/values lists
keys = list(hash_table.keys())
values = list(hash_table.values())

3. Collision Resolution

Method 1: Chaining

Store values with same index in a linked list.

class HashTableChaining:
    def __init__(self, size=10):
        self.size = size
        self.table = [[] for _ in range(size)]
    
    def _hash(self, key):
        return hash(key) % self.size
    
    def insert(self, key, value):
        idx = self._hash(key)
        
        # Update if key exists
        for i, (k, v) in enumerate(self.table[idx]):
            if k == key:
                self.table[idx][i] = (key, value)
                return
        
        # Add if not exists
        self.table[idx].append((key, value))
    
    def get(self, key):
        idx = self._hash(key)
        
        for k, v in self.table[idx]:
            if k == key:
                return v
        
        return None

# Test
ht = HashTableChaining(size=5)
ht.insert("apple", 100)
ht.insert("banana", 200)
print(ht.get("apple"))  # 100

Method 2: Open Addressing

Find another empty slot on collision.

class HashTableLinearProbing:
    def __init__(self, size=10):
        self.size = size
        self.keys = [None] * size
        self.values = [None] * size
    
    def _hash(self, key):
        return hash(key) % self.size
    
    def insert(self, key, value):
        idx = self._hash(key)
        
        while self.keys[idx] is not None:
            if self.keys[idx] == key:
                self.values[idx] = value
                return
            idx = (idx + 1) % self.size
        
        self.keys[idx] = key
        self.values[idx] = value
    
    def get(self, key):
        idx = self._hash(key)
        
        while self.keys[idx] is not None:
            if self.keys[idx] == key:
                return self.values[idx]
            idx = (idx + 1) % self.size
        
        return None

4. Problem Solving

Problem 1: Two Sum

def two_sum(arr, target):
    """
    Find indices of two numbers that sum to target
    [2, 7, 11, 15], target=9 → [0, 1]
    
    Time: O(n), Space: O(n)
    """
    seen = {}  # value: index
    
    for i, num in enumerate(arr):
        complement = target - num
        
        if complement in seen:
            return [seen[complement], i]
        
        seen[num] = i
    
    return []

# Test
arr = [2, 7, 11, 15]
print(two_sum(arr, 9))   # [0, 1]
print(two_sum(arr, 13))  # [0, 2]

Problem 2: Group Anagrams

def group_anagrams(strs):
    """
    Group anagrams together
    ["eat", "tea", "tan", "ate", "nat", "bat"]
    → [["eat", "tea", "ate"], ["tan", "nat"], ["bat"]]
    
    Time: O(n * k log k), Space: O(n * k)
    """
    anagrams = {}
    
    for word in strs:
        key = ''.join(sorted(word))
        
        if key not in anagrams:
            anagrams[key] = []
        anagrams[key].append(word)
    
    return list(anagrams.values())

# Test
strs = ["eat", "tea", "tan", "ate", "nat", "bat"]
print(group_anagrams(strs))
# [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]

Problem 3: Longest Substring Without Repeating Characters

def length_of_longest_substring(s):
    """
    "abcabcbb" → 3 ("abc")
    "bbbbb" → 1 ("b")
    
    Time: O(n), Space: O(min(n, m))
    """
    char_index = {}
    max_length = 0
    start = 0
    
    for i, char in enumerate(s):
        if char in char_index and char_index[char] >= start:
            start = char_index[char] + 1
        
        char_index[char] = i
        max_length = max(max_length, i - start + 1)
    
    return max_length

# Test
print(length_of_longest_substring("abcabcbb"))  # 3
print(length_of_longest_substring("bbbbb"))     # 1
print(length_of_longest_substring("pwwkew"))    # 3

5. Python Collections Module

Counter

from collections import Counter

# Create
counter = Counter([1, 2, 2, 3, 3, 3])
print(counter)  # Counter({3: 3, 2: 2, 1: 1})

# String count
text = "hello world"
counter = Counter(text)
print(counter)  # Counter({'l': 3, 'o': 2, ...})

# most_common(): most frequent elements
print(counter.most_common(3))  # [('l', 3), ('o', 2), ('h', 1)]

# Operations
c1 = Counter(['a', 'b', 'c', 'a'])
c2 = Counter(['a', 'b', 'b'])

print(c1 + c2)  # Counter({'a': 3, 'b': 3, 'c': 1})
print(c1 - c2)  # Counter({'a': 1, 'c': 1})
print(c1 & c2)  # Counter({'a': 1, 'b': 1}) (intersection)
print(c1 | c2)  # Counter({'a': 2, 'b': 2, 'c': 1}) (union)

defaultdict

from collections import defaultdict

# int default (0)
word_count = defaultdict(int)
for word in ["apple", "banana", "apple"]:
    word_count[word] += 1
print(dict(word_count))  # {'apple': 2, 'banana': 1}

# list default ([])
groups = defaultdict(list)
for name, grade in [("Alice", "A"), ("Bob", "B"), ("Charlie", "A")]:
    groups[grade].append(name)
print(dict(groups))  # {'A': ['Alice', 'Charlie'], 'B': ['Bob']}

LRU Cache Implementation

from collections import OrderedDict

class LRUCache:
    def __init__(self, capacity):
        self.cache = OrderedDict()
        self.capacity = capacity
    
    def get(self, key):
        if key not in self.cache:
            return -1
        
        self.cache.move_to_end(key)
        return self.cache[key]
    
    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        
        self.cache[key] = value
        
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)

# Test
cache = LRUCache(3)
cache.put("a", 1)
cache.put("b", 2)
cache.put("c", 3)

cache.get("a")
cache.put("d", 4)  # Removes 'b'

6. Usage Patterns

Pattern 1: Frequency Counting

def char_frequency(s):
    freq = {}
    for char in s:
        freq[char] = freq.get(char, 0) + 1
    return freq

print(char_frequency("hello"))  # {'h': 1, 'e': 1, 'l': 2, 'o': 1}

# Using Counter
from collections import Counter
print(dict(Counter("hello")))

Pattern 2: Grouping

from collections import defaultdict

words = ["apple", "pie", "banana", "cat", "dog", "cherry"]

groups = defaultdict(list)
for word in words:
    groups[len(word)].append(word)

print(dict(groups))
# {5: ['apple'], 3: ['pie', 'cat', 'dog'], 6: ['banana', 'cherry']}

Pattern 3: Caching (Memoization)

# Fibonacci with caching
def fib_fast(n, cache={}):
    if n in cache:
        return cache[n]
    
    if n <= 1:
        return n
    
    cache[n] = fib_fast(n-1, cache) + fib_fast(n-2, cache)
    return cache[n]

print(fib_fast(30))  # 832040 (fast)

# Using functools.lru_cache (recommended)
from functools import lru_cache

@lru_cache(maxsize=None)
def fib(n):
    if n <= 1:
        return n
    return fib(n-1) + fib(n-2)

print(fib(30))  # 832040 (very fast)

Pattern 4: Index Mapping

def find_indices(arr, target):
    """Find all indices of target value"""
    index_map = {}
    
    for i, num in enumerate(arr):
        if num not in index_map:
            index_map[num] = []
        index_map[num].append(i)
    
    return index_map.get(target, [])

arr = [1, 2, 3, 2, 4, 2, 5]
print(find_indices(arr, 2))  # [1, 3, 5]

7. Common Mistakes

Mistake 1: Using Mutable Objects as Keys

# ❌ Wrong
# hash_table = {[1, 2]: "value"}  # TypeError: unhashable type: 'list'

# ✅ Correct
hash_table = {(1, 2): "value"}  # Use tuple
print(hash_table[(1, 2)])  # value

Mistake 2: Missing KeyError Handling

# ❌ Wrong
hash_table = {"a": 1}
# print(hash_table["b"])  # KeyError: 'b'

# ✅ Correct 1: Use get()
value = hash_table.get("b", 0)
print(value)  # 0

# ✅ Correct 2: Check with in
if "b" in hash_table:
    print(hash_table["b"])
else:
    print("Key not found")

Mistake 3: Modifying During Iteration

# ❌ Wrong
hash_table = {"a": 1, "b": 2, "c": 3}
# for key in hash_table:
#     if hash_table[key] == 2:
#         del hash_table[key]  # RuntimeError

# ✅ Correct 1: Convert to list
hash_table = {"a": 1, "b": 2, "c": 3}
for key in list(hash_table.keys()):
    if hash_table[key] == 2:
        del hash_table[key]

# ✅ Correct 2: Dictionary comprehension
hash_table = {"a": 1, "b": 2, "c": 3}
hash_table = {k: v for k, v in hash_table.items() if v != 2}

8. Performance Optimization

Load Factor Management

class DynamicHashTable:
    def __init__(self, initial_size=8):
        self.size = initial_size
        self.count = 0
        self.table = [[] for _ in range(self.size)]
    
    def _load_factor(self):
        return self.count / self.size
    
    def _resize(self):
        """Double size when load factor > 0.7"""
        old_table = self.table
        self.size *= 2
        self.table = [[] for _ in range(self.size)]
        self.count = 0
        
        for bucket in old_table:
            for key, value in bucket:
                self.insert(key, value)
    
    def insert(self, key, value):
        if self._load_factor() > 0.7:
            self._resize()
        
        idx = self._hash(key)
        
        for i, (k, v) in enumerate(self.table[idx]):
            if k == key:
                self.table[idx][i] = (key, value)
                return
        
        self.table[idx].append((key, value))
        self.count += 1

Summary

Key Points

  1. Hash Table: Key-value storage, average O(1) search/insert/delete
  2. Hash Function: Convert key to index, uniform distribution important
  3. Collision Resolution:
    • Chaining: Store in linked list
    • Open Addressing: Find another empty slot
  4. Python Tools:
    • dict: Basic hash table
    • Counter: Frequency counting
    • defaultdict: Auto-create default values
    • OrderedDict: Maintain order + extra features

When to Use

SituationData StructureTime Complexity
Fast searchHash TableO(1)
Frequency countCounterO(n)
Duplicate checksetO(n)
GroupingdefaultdictO(n)
Cachingdict / LRU CacheO(1)

Time Complexity Improvement Pattern

# Before: O(n²) - nested loops
def has_duplicate_slow(arr):
    for i in range(len(arr)):
        for j in range(i+1, len(arr)):
            if arr[i] == arr[j]:
                return True
    return False

# After: O(n) - hash table
def has_duplicate_fast(arr):
    seen = set()
    for num in arr:
        if num in seen:
            return True
        seen.add(num)
    return False

# Or simpler
def has_duplicate(arr):
    return len(arr) != len(set(arr))

Beginner

  • LeetCode 1: Two Sum
  • LeetCode 217: Contains Duplicate
  • LeetCode 242: Valid Anagram

Intermediate

  • LeetCode 49: Group Anagrams
  • LeetCode 3: Longest Substring Without Repeating Characters
  • LeetCode 146: LRU Cache

Advanced

  • LeetCode 76: Minimum Window Substring
  • LeetCode 149: Max Points on a Line
  • LeetCode 380: Insert Delete GetRandom O(1)

  • Arrays and Lists | Essential Data Structures
  • Stack and Queue | LIFO and FIFO Data Structures
  • Python Data Types | Complete Guide

Keywords

Hash Table, HashMap, Dictionary, Data Structure, O(1) Search, Hash Function, Collision Resolution, Chaining, Open Addressing, Counter, defaultdict, Coding Interview, Algorithm