This post is part of our ongoing series on streaming SQL engines and advanced join algorithms. In previous posts, we discussed techniques like The Sliding Window Hash Join Algorithm: Efficiently Joining Infinite Streams with Order Preservation and General-purpose Stream Joins via Pruning Symmetric Hash Joins and Running Windowing Queries in Stream Processing that serve as building blocks. The concepts discussed in this series are implemented in Apache Arrow/DataFusion, which has a bunch of other cool features we haven’t talked about yet.

The Count-Min Sketch (CMS) is a probabilistic data structure that can be used to estimate the frequency of items in a stream of data. It is particularly useful when the size of the data stream is too large to store in memory, or when we want to perform frequent updates to the data stream in real-time.

To understand how the CMS works, it's helpful to first understand some basic concepts:

**Hash functions**are functions that take an input (such as a string or a number) and produce a fixed-size output (called a "hash value") in a way that is difficult to predict or reverse. Hash functions are often used to store and retrieve data efficiently, as they allow us to map data items to indices in an array without having to store the data itself.**Probabilistic data structures**are data structures that use randomness to achieve their goals with a certain probability. This means that the results of probabilistic data structures are not always accurate, but the probability of error can be controlled by the user.

The basic idea behind the CMS is to use hash functions to map items in the stream to a fixed-size array, and then use this array to store an estimate of the frequency of each item. The CMS maintains a two-dimensional array of counters, where each row corresponds to a hash function and each column corresponds to a hash value. When an item is encountered in the stream, it is hashed using each of the hash functions and the corresponding counters are incremented.

Let’s figure it out with a simple example. First, we'll initialize the table with all zeros:

To update the table with this data, we'll need to increment the count at the positions in the table that are mapped to by each of our hash functions for each item in the stream.

- hash1("apple"): 3
- hash2("apple"): 1
- hash3("apple"): 4

So, we'll increment the count at these positions in the table:

- hash1("banana"): 1
- hash2("banana"): 7
- hash3("banana"): 3

So, we'll increment the count at these positions in the table:

- hash1("cherry"): 0
- hash2("cherry"): 1
- hash3("cherry"): 6

The final table will look like this:

The positions in the table that are mapped to by each of our hash functions for "apple" are:

- hash1("apple"): 3
- hash2("apple"): 1
- hash3("apple"): 4

And we get [1, 2, 1] from each row. To estimate the frequency of an item, we simply take the minimum value among the counters corresponding to that item across all the rows. This is because the hash functions are designed to evenly distribute items across the array, so the minimum count is likely to be the most accurate estimate of the true frequency.

*Note that there is some ambiguity on how to calculate the these values. Check* this post for an in-depth discussion.

Thus, we can add the utility methods

Here, we use the **xxHash** algorithm in the *hash_once* function to produce hash values for the items being added to the CMS. The algorithm provides:

**High speed**: The*xxHash*algorithm is known for being very fast and efficient, making it a good choice for use in streaming applications where performance is important.**Reasonable collision resistance**: The*xxHash*algorithm doesn’t just produce hash values with a low probability of collision; it is also computationally somewhat difficult to find two inputs generating the same output. This property doesn’t matter for our use case (it is important in security use cases for cryptographic hash functions, which**xxHash**is not), but it increases our confidence that two distinct items are unlikely to get the same hash value. This is important for the CMS, as collisions can lead to incorrect frequency estimates.**Quality of estimates**: The accuracy of the estimates produced by the CMS depends, in part, on the quality of the hash function being used. The*xxHash*algorithm is known to produce high-quality hash values (in a distributional sense), which should benefit the estimation accuracy of the CMS.

Specifically, we used the **xxh3** flavor of **xxHash** here. The **xxh3** algorithm improves on the vanilla *xxHash* algorithm by leveraging various low-level optimizations while producing the final hash value. It is designed to be faster and produces higher-quality hash values than the original algorithm, particularly for larger input data.

Here are a few other hash functions that you might consider using in your CMS implementation, along with some pros and cons of each:

**MurmurHash**:*MurmurHash*is a reasonably fast, non-cryptographic hash function that is widely used in data structures such as hash tables and bloom filters. It is known for being fast and producing reasonable-quality hash values, making it a good choice for use in the CMS. However, its mixing quality is not as high as some other hash functions, so you may see a higher rate of collisions with*MurmurHash*compared to other options.**FNV (Fowler-Noll-Vo) hash**: The*FNV*hash is another non-cryptographic hash function that is sometimes used in data structures. It is known for its simplicity and reasonable mixing properties, but it may not be as fast as some other hash functions.

The benchmark section of the xxHash source repository is a good source for many other candidates.

Ultimately, the choice of which hash function to use in the implementation will depend on the specific needs and constraints. Factors to consider might include the speed of the hash function, the collision resistance of the hash values it produces, and the quality of the hash values.

In this article, we have introduced the count-min sketch, a data structure that allows you to estimate the frequency of items in a stream. The count-min sketch is a probabilistic data structure that uses multiple hash functions and a small array of counters to estimate the frequencies of items with a certain error rate.

The count-min sketch is a useful tool for processing large streams of data and estimating the frequencies of items. It has many applications, including counting the number of unique visitors to a website, tracking the popularity of items in a recommendation system, and detecting anomalies in network traffic.

I hope you have found this article helpful and that you have a better understanding of the count-min sketch and how it works. Thank you for reading!

*You can see source code from **here**.*

Get early access and pioneer

AI-native data infrastructure.

AI-native data infrastructure.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

rocket_launch

Get Early Access!