Apprendre Count-Min Sketch for Frequency Estimation | Probabilistic & Streaming Data Structures

Glissez pour afficher le menu

In many real-world data engineering scenarios, you need to estimate the frequency of items in massive data streams — think of tracking trending hashtags on social media, counting queries in a search engine, or monitoring IP traffic in a data center. Trying to keep exact counts for every possible item quickly becomes impractical: the number of unique items may be enormous, memory is limited, and the data arrives too fast to store or process precisely. This is where the Count-Min Sketch comes in. Instead of storing every item, it uses a compact, probabilistic data structure to provide fast, approximate answers to frequency queries. While you sacrifice some accuracy, you gain dramatic improvements in speed and memory efficiency — making the Count-Min Sketch a powerful tool for streaming analytics.


              1234567891011121314151617181920212223242526272829303132333435
            
import hashlib

class CountMinSketch:
    def __init__(self, width, depth):
        self.width = width
        self.depth = depth
        self.table = [[0] * width for _ in range(depth)]
        self.seeds = [i * 11 + 17 for i in range(depth)]

    def _hash(self, item, seed):
        # Use hashlib for reproducible hash values
        h = hashlib.md5((str(item) + str(seed)).encode())
        return int(h.hexdigest(), 16) % self.width

    def update(self, item, count=1):
        for i in range(self.depth):
            idx = self._hash(item, self.seeds[i])
            self.table[i][idx] += count

    def query(self, item):
        min_count = float('inf')
        for i in range(self.depth):
            idx = self._hash(item, self.seeds[i])
            min_count = min(min_count, self.table[i][idx])
        return min_count

# Example usage:
cms = CountMinSketch(width=100, depth=5)
cms.update("apple")
cms.update("banana", 3)
cms.update("apple", 2)
print("apple:", cms.query("apple"))   # Output: 3
print("banana:", cms.query("banana")) # Output: 3
print("grape:", cms.query("grape"))   # Output: 0

The Count-Min Sketch cleverly balances accuracy, memory usage, and speed. Its error is always one-sided: it may overestimate, but never underestimate, the true frequency of any item. The error margin and probability of exceeding it are determined by the width and depth of the sketch — wider tables reduce the maximum error, while deeper tables reduce the probability of a large error. This makes it ideal for high-speed environments like network monitoring, where you need to spot heavy hitters or detect anomalies without storing all data. In practice, you might use Count-Min Sketch to rank trending topics, throttle abusive users, or summarize massive logs for fast analytics. The trade-off is that you cannot retrieve the exact list of items or their true counts, but you can answer "how frequent is X?" for any X, with a guaranteed error bound and a tiny memory footprint.

Definition

The overestimation property of Count-Min Sketch means that frequency queries will never return a value less than the true count, but may return a value that is higher. This is due to hash collisions: multiple items may map to the same cell, causing their counts to be combined. The implication is that you can safely use the sketch to filter out items below a threshold (since you'll never miss a true heavy hitter), but you may need further checks to confirm high counts.

Tout était clair ?

Merci pour vos commentaires !

Section 3. Chapitre 2

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Section 3. Chapitre 2