Aprenda Hashing Fundamentals and Collision Strategies | Core Algorithms for Data Processing

Hashing is a fundamental concept in data engineering, providing a way to efficiently organize, store, and retrieve data. A hash function is an algorithm that takes input data (such as a string or number) and produces a fixed-size integer, known as a hash value or hash code. This process is crucial in many data engineering tasks, including partitioning large datasets across servers, creating indexes for fast lookups, and ensuring rapid access to key-value stores.

Hash functions are valued for their speed and their ability to distribute data evenly across a range. In data engineering, you often use them to assign records to partitions in distributed systems, or to locate the position of a record in an index. The quality of a hash function is measured by how uniformly it distributes inputs across possible outputs and how well it minimizes collisions, where different inputs produce the same hash value.


              123456789101112131415
            
import matplotlib.pyplot as plt

def simple_hash(key, size):
    return sum(ord(char) for char in str(key)) % size

# Visualize hash value distribution for sample keys
table_size = 20
keys = [f"user{i}" for i in range(100)]
hashes = [simple_hash(k, table_size) for k in keys]

plt.hist(hashes, bins=range(table_size + 1), edgecolor="black", align="left")
plt.title("Distribution of Hash Values (Simple Hash Function)")
plt.xlabel("Hash Value")
plt.ylabel("Frequency")
plt.show()

When using hash functions, collisions are inevitable: two different keys may produce the same hash value. Efficient collision resolution is essential for hash-based structures like hash tables. Two common strategies are chaining and open addressing.

Chaining stores all items that hash to the same value in a linked list (or another collection) at that slot. This approach is simple and flexible, allowing the table to handle an arbitrary number of collisions at each slot. However, if many keys collide, lookup times can degrade as the lists grow longer.

Open addressing, on the other hand, stores all entries directly within the hash table array. When a collision occurs, the algorithm searches for another open slot using a probing sequence (such as linear or quadratic probing). This keeps all data in the array but can lead to clustering, where groups of entries fill up contiguous slots, increasing lookup times as the table becomes fuller.

Below is a diagram illustrating the difference:

Chaining:

Index 0: [A, D]
Index 1: [B]
Index 2: []
Index 3: [C]

Open Addressing (linear probing):

Index 0: A
Index 1: B
Index 2: C
Index 3: D


              1234567891011121314151617181920212223242526272829303132
            
# Hash table with chaining for collision handling
class HashTableChaining:
    def __init__(self, size):
        self.size = size
        self.table = [[] for _ in range(size)]
    
    def _hash(self, key):
        return sum(ord(char) for char in str(key)) % self.size
    
    def insert(self, key, value):
        idx = self._hash(key)
        for i, (k, v) in enumerate(self.table[idx]):
            if k == key:
                self.table[idx][i] = (key, value)
                return
        self.table[idx].append((key, value))
    
    def get(self, key):
        idx = self._hash(key)
        for k, v in self.table[idx]:
            if k == key:
                return v
        return None

# Example usage
ht = HashTableChaining(5)
ht.insert("apple", 10)
ht.insert("banana", 20)
ht.insert("grape", 30)
ht.insert("apricot", 40)  # May collide with "apple"
print(ht.table)
print(ht.get("banana"))  # Output: 20

Definition

The load factor of a hash table is the ratio of the number of stored elements to the total number of slots in the table. A higher load factor increases the chance of collisions and can slow down operations. Keeping the load factor below a certain threshold (commonly 0.7) helps maintain efficient performance by reducing the likelihood of long chains or probing sequences.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 1. Capítulo 2

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Suggested prompts:

Can you explain how the chaining method works in this hash table example?

What would happen if I tried to retrieve a key that doesn't exist in the table?

How does the hash function in this code determine the index for each key?

Deslize para mostrar o menu


              123456789101112131415
            
import matplotlib.pyplot as plt

def simple_hash(key, size):
    return sum(ord(char) for char in str(key)) % size

# Visualize hash value distribution for sample keys
table_size = 20
keys = [f"user{i}" for i in range(100)]
hashes = [simple_hash(k, table_size) for k in keys]

plt.hist(hashes, bins=range(table_size + 1), edgecolor="black", align="left")
plt.title("Distribution of Hash Values (Simple Hash Function)")
plt.xlabel("Hash Value")
plt.ylabel("Frequency")
plt.show()