Count-Min Sketch for Frequency Estimation
In many real-world data engineering scenarios, you need to estimate the frequency of items in massive data streams — think of tracking trending hashtags on social media, counting queries in a search engine, or monitoring IP traffic in a data center. Trying to keep exact counts for every possible item quickly becomes impractical: the number of unique items may be enormous, memory is limited, and the data arrives too fast to store or process precisely. This is where the Count-Min Sketch comes in. Instead of storing every item, it uses a compact, probabilistic data structure to provide fast, approximate answers to frequency queries. While you sacrifice some accuracy, you gain dramatic improvements in speed and memory efficiency — making the Count-Min Sketch a powerful tool for streaming analytics.
1234567891011121314151617181920212223242526272829303132333435import hashlib class CountMinSketch: def __init__(self, width, depth): self.width = width self.depth = depth self.table = [[0] * width for _ in range(depth)] self.seeds = [i * 11 + 17 for i in range(depth)] def _hash(self, item, seed): # Use hashlib for reproducible hash values h = hashlib.md5((str(item) + str(seed)).encode()) return int(h.hexdigest(), 16) % self.width def update(self, item, count=1): for i in range(self.depth): idx = self._hash(item, self.seeds[i]) self.table[i][idx] += count def query(self, item): min_count = float('inf') for i in range(self.depth): idx = self._hash(item, self.seeds[i]) min_count = min(min_count, self.table[i][idx]) return min_count # Example usage: cms = CountMinSketch(width=100, depth=5) cms.update("apple") cms.update("banana", 3) cms.update("apple", 2) print("apple:", cms.query("apple")) # Output: 3 print("banana:", cms.query("banana")) # Output: 3 print("grape:", cms.query("grape")) # Output: 0
The Count-Min Sketch cleverly balances accuracy, memory usage, and speed. Its error is always one-sided: it may overestimate, but never underestimate, the true frequency of any item. The error margin and probability of exceeding it are determined by the width and depth of the sketch — wider tables reduce the maximum error, while deeper tables reduce the probability of a large error. This makes it ideal for high-speed environments like network monitoring, where you need to spot heavy hitters or detect anomalies without storing all data. In practice, you might use Count-Min Sketch to rank trending topics, throttle abusive users, or summarize massive logs for fast analytics. The trade-off is that you cannot retrieve the exact list of items or their true counts, but you can answer "how frequent is X?" for any X, with a guaranteed error bound and a tiny memory footprint.
The overestimation property of Count-Min Sketch means that frequency queries will never return a value less than the true count, but may return a value that is higher. This is due to hash collisions: multiple items may map to the same cell, causing their counts to be combined. The implication is that you can safely use the sketch to filter out items below a threshold (since you'll never miss a true heavy hitter), but you may need further checks to confirm high counts.
Дякуємо за ваш відгук!
Запитати АІ
Запитати АІ
Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат
Чудово!
Completion показник покращився до 7.69
Count-Min Sketch for Frequency Estimation
Свайпніть щоб показати меню
In many real-world data engineering scenarios, you need to estimate the frequency of items in massive data streams — think of tracking trending hashtags on social media, counting queries in a search engine, or monitoring IP traffic in a data center. Trying to keep exact counts for every possible item quickly becomes impractical: the number of unique items may be enormous, memory is limited, and the data arrives too fast to store or process precisely. This is where the Count-Min Sketch comes in. Instead of storing every item, it uses a compact, probabilistic data structure to provide fast, approximate answers to frequency queries. While you sacrifice some accuracy, you gain dramatic improvements in speed and memory efficiency — making the Count-Min Sketch a powerful tool for streaming analytics.
1234567891011121314151617181920212223242526272829303132333435import hashlib class CountMinSketch: def __init__(self, width, depth): self.width = width self.depth = depth self.table = [[0] * width for _ in range(depth)] self.seeds = [i * 11 + 17 for i in range(depth)] def _hash(self, item, seed): # Use hashlib for reproducible hash values h = hashlib.md5((str(item) + str(seed)).encode()) return int(h.hexdigest(), 16) % self.width def update(self, item, count=1): for i in range(self.depth): idx = self._hash(item, self.seeds[i]) self.table[i][idx] += count def query(self, item): min_count = float('inf') for i in range(self.depth): idx = self._hash(item, self.seeds[i]) min_count = min(min_count, self.table[i][idx]) return min_count # Example usage: cms = CountMinSketch(width=100, depth=5) cms.update("apple") cms.update("banana", 3) cms.update("apple", 2) print("apple:", cms.query("apple")) # Output: 3 print("banana:", cms.query("banana")) # Output: 3 print("grape:", cms.query("grape")) # Output: 0
The Count-Min Sketch cleverly balances accuracy, memory usage, and speed. Its error is always one-sided: it may overestimate, but never underestimate, the true frequency of any item. The error margin and probability of exceeding it are determined by the width and depth of the sketch — wider tables reduce the maximum error, while deeper tables reduce the probability of a large error. This makes it ideal for high-speed environments like network monitoring, where you need to spot heavy hitters or detect anomalies without storing all data. In practice, you might use Count-Min Sketch to rank trending topics, throttle abusive users, or summarize massive logs for fast analytics. The trade-off is that you cannot retrieve the exact list of items or their true counts, but you can answer "how frequent is X?" for any X, with a guaranteed error bound and a tiny memory footprint.
The overestimation property of Count-Min Sketch means that frequency queries will never return a value less than the true count, but may return a value that is higher. This is due to hash collisions: multiple items may map to the same cell, causing their counts to be combined. The implication is that you can safely use the sketch to filter out items below a threshold (since you'll never miss a true heavy hitter), but you may need further checks to confirm high counts.
Дякуємо за ваш відгук!