Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara HyperLogLog and Cardinality Estimation | Probabilistic & Streaming Data Structures
Data Structures and Algorithms for Scalable Systems

bookHyperLogLog and Cardinality Estimation

When working with massive datasets, such as web logs, telemetry streams, or user activity data, one important question often arises: how many unique items are present? This question, known as cardinality estimation, is deceptively challenging at scale. Traditional methods, like storing each unique element in a set or hash table, quickly become impractical because they require memory proportional to the number of distinct elements. As datasets grow to billions of records, this can lead to enormous resource consumption.

HyperLogLog is a probabilistic algorithm designed specifically to address this challenge. Instead of tracking every unique element, it uses clever mathematical techniques to estimate the cardinality with high accuracy and dramatically reduced memory usage. HyperLogLog relies on the properties of hash functions and the distribution of leading zeros in hashed values to produce reliable estimates, even when processing data streams that cannot fit in memory.

1234567891011121314151617181920212223242526272829303132333435363738
import hashlib import math class SimpleHyperLogLog: def __init__(self, b=10): self.b = b # Number of bits for bucket selection self.m = 1 << b # Number of buckets self.registers = [0] * self.m def _hash(self, value): # Use SHA-1 hash and convert to integer h = hashlib.sha1(str(value).encode('utf-8')).hexdigest() return int(h, 16) def add(self, value): x = self._hash(value) j = x & (self.m - 1) # Bucket index (last b bits) w = x >> self.b # Remaining bits rank = self._rho(w) self.registers[j] = max(self.registers[j], rank) def _rho(self, w): # Position of leftmost 1-bit (count leading zeros + 1) return (w.bit_length() - w.bit_length() + 1) if w == 0 else (w.bit_length() - w.bit_length() + 1 + (w.bit_length() - w.bit_length())) def count(self): # Estimate cardinality using HyperLogLog formula alpha = 0.7213 / (1 + 1.079 / self.m) Z = 1.0 / sum([2.0 ** -r for r in self.registers]) E = alpha * self.m * self.m * Z return int(E) # Demonstration hll = SimpleHyperLogLog(b=8) # 256 buckets for i in range(10000): hll.add(f"user_{i}") print("Estimated unique elements:", hll.count())
copy

HyperLogLog achieves remarkable accuracy while using only kilobytes of memory to estimate millions of unique elements. The trade-off is that the result is an estimate, not an exact count, but for many analytics and monitoring scenarios, this is more than sufficient. HyperLogLog is especially popular in systems that process large-scale logs, real-time event streams, and analytics platforms, where tracking unique visitors, IPs, or transactions is essential but memory is at a premium. Its memory efficiency and ability to merge estimates from multiple sources make it a favorite for distributed systems and big data pipelines.

Note
Study More

HyperLogLog is widely used in real-world systems such as Redis, where it powers the PFCOUNT and PFADD commands for cardinality estimation. It is also a core component of streaming analytics platforms that need to track unique users or events in real time, such as Apache Flink and Google Dataflow.

question mark

Which of the following best describes the main advantage of HyperLogLog over traditional set-based counting for unique elements in large datasets?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 3. Capitolo 3

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Suggested prompts:

Can you explain how the HyperLogLog algorithm works in more detail?

What are the main limitations or drawbacks of using HyperLogLog?

How accurate is the estimate produced by HyperLogLog in practice?

bookHyperLogLog and Cardinality Estimation

Scorri per mostrare il menu

When working with massive datasets, such as web logs, telemetry streams, or user activity data, one important question often arises: how many unique items are present? This question, known as cardinality estimation, is deceptively challenging at scale. Traditional methods, like storing each unique element in a set or hash table, quickly become impractical because they require memory proportional to the number of distinct elements. As datasets grow to billions of records, this can lead to enormous resource consumption.

HyperLogLog is a probabilistic algorithm designed specifically to address this challenge. Instead of tracking every unique element, it uses clever mathematical techniques to estimate the cardinality with high accuracy and dramatically reduced memory usage. HyperLogLog relies on the properties of hash functions and the distribution of leading zeros in hashed values to produce reliable estimates, even when processing data streams that cannot fit in memory.

1234567891011121314151617181920212223242526272829303132333435363738
import hashlib import math class SimpleHyperLogLog: def __init__(self, b=10): self.b = b # Number of bits for bucket selection self.m = 1 << b # Number of buckets self.registers = [0] * self.m def _hash(self, value): # Use SHA-1 hash and convert to integer h = hashlib.sha1(str(value).encode('utf-8')).hexdigest() return int(h, 16) def add(self, value): x = self._hash(value) j = x & (self.m - 1) # Bucket index (last b bits) w = x >> self.b # Remaining bits rank = self._rho(w) self.registers[j] = max(self.registers[j], rank) def _rho(self, w): # Position of leftmost 1-bit (count leading zeros + 1) return (w.bit_length() - w.bit_length() + 1) if w == 0 else (w.bit_length() - w.bit_length() + 1 + (w.bit_length() - w.bit_length())) def count(self): # Estimate cardinality using HyperLogLog formula alpha = 0.7213 / (1 + 1.079 / self.m) Z = 1.0 / sum([2.0 ** -r for r in self.registers]) E = alpha * self.m * self.m * Z return int(E) # Demonstration hll = SimpleHyperLogLog(b=8) # 256 buckets for i in range(10000): hll.add(f"user_{i}") print("Estimated unique elements:", hll.count())
copy

HyperLogLog achieves remarkable accuracy while using only kilobytes of memory to estimate millions of unique elements. The trade-off is that the result is an estimate, not an exact count, but for many analytics and monitoring scenarios, this is more than sufficient. HyperLogLog is especially popular in systems that process large-scale logs, real-time event streams, and analytics platforms, where tracking unique visitors, IPs, or transactions is essential but memory is at a premium. Its memory efficiency and ability to merge estimates from multiple sources make it a favorite for distributed systems and big data pipelines.

Note
Study More

HyperLogLog is widely used in real-world systems such as Redis, where it powers the PFCOUNT and PFADD commands for cardinality estimation. It is also a core component of streaming analytics platforms that need to track unique users or events in real time, such as Apache Flink and Google Dataflow.

question mark

Which of the following best describes the main advantage of HyperLogLog over traditional set-based counting for unique elements in large datasets?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 3. Capitolo 3
some-alt