Impara HyperLogLog and Cardinality Estimation | Probabilistic & Streaming Data Structures

Scorri per mostrare il menu

When working with massive datasets, such as web logs, telemetry streams, or user activity data, one important question often arises: how many unique items are present? This question, known as cardinality estimation, is deceptively challenging at scale. Traditional methods, like storing each unique element in a set or hash table, quickly become impractical because they require memory proportional to the number of distinct elements. As datasets grow to billions of records, this can lead to enormous resource consumption.

HyperLogLog is a probabilistic algorithm designed specifically to address this challenge. Instead of tracking every unique element, it uses clever mathematical techniques to estimate the cardinality with high accuracy and dramatically reduced memory usage. HyperLogLog relies on the properties of hash functions and the distribution of leading zeros in hashed values to produce reliable estimates, even when processing data streams that cannot fit in memory.


              1234567891011121314151617181920212223242526272829303132333435363738
            
import hashlib
import math

class SimpleHyperLogLog:
    def __init__(self, b=10):
        self.b = b  # Number of bits for bucket selection
        self.m = 1 << b  # Number of buckets
        self.registers = [0] * self.m

    def _hash(self, value):
        # Use SHA-1 hash and convert to integer
        h = hashlib.sha1(str(value).encode('utf-8')).hexdigest()
        return int(h, 16)

    def add(self, value):
        x = self._hash(value)
        j = x & (self.m - 1)  # Bucket index (last b bits)
        w = x >> self.b  # Remaining bits
        rank = self._rho(w)
        self.registers[j] = max(self.registers[j], rank)

    def _rho(self, w):
        # Position of leftmost 1-bit (count leading zeros + 1)
        return (w.bit_length() - w.bit_length() + 1) if w == 0 else (w.bit_length() - w.bit_length() + 1 + (w.bit_length() - w.bit_length()))

    def count(self):
        # Estimate cardinality using HyperLogLog formula
        alpha = 0.7213 / (1 + 1.079 / self.m)
        Z = 1.0 / sum([2.0 ** -r for r in self.registers])
        E = alpha * self.m * self.m * Z
        return int(E)

# Demonstration
hll = SimpleHyperLogLog(b=8)  # 256 buckets
for i in range(10000):
    hll.add(f"user_{i}")

print("Estimated unique elements:", hll.count())

HyperLogLog achieves remarkable accuracy while using only kilobytes of memory to estimate millions of unique elements. The trade-off is that the result is an estimate, not an exact count, but for many analytics and monitoring scenarios, this is more than sufficient. HyperLogLog is especially popular in systems that process large-scale logs, real-time event streams, and analytics platforms, where tracking unique visitors, IPs, or transactions is essential but memory is at a premium. Its memory efficiency and ability to merge estimates from multiple sources make it a favorite for distributed systems and big data pipelines.

Study More

HyperLogLog is widely used in real-world systems such as Redis, where it powers the PFCOUNT and PFADD commands for cardinality estimation. It is also a core component of streaming analytics platforms that need to track unique users or events in real time, such as Apache Flink and Google Dataflow.

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 3. Capitolo 3

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 3. Capitolo 3