The Inexact Design of Probabilistic Data Structures

Written on 29 Oct 2023

The Bloom filter.

In the realm of computer science, there exists a plethora of data structures that suit a variety of needs. Some data structures, like the Red–Black Tree, AVL Tree or B-Tree are optimized for searching elements due to their unique organization and balancing techniques. Other data structures, such as Queues, Stacks, and Heaps, excel in storing elements in a particular order, facilitating efficient access and modification in various scenarios.

In truth, the landscape of data structures and their variations is extraordinarily vast. However, there is one particular class of data structures that is intrinsically different from the rest. These data structures are known as Probabilistic Data Structures, and they serve important roles in the domain of Database Architectures.

Exactness

The data structures I listed above (trees, queues and so on) have one thing in common: they are exact. What does it mean for a data structure to be exact? Consider the following array:

An array with random numbers.

If we were to ask for the length of the array, the array would reply with 5. It would have no reason to fail; the exact number of elements is something known to the array.

That is the case for most data structures—they are built on exactness. Whether you query the length of an array or request the maximum value from a heap, these data structures are bound to deliver accurate results.

“Isn’t that obvious?”, one may answer—after all, what is the point of a data structure that cannot provide exact results? The answer to that question is simple: performance.

The cost of exactness

Ever thought about counting the number of distinct plays for a certain song? For example, consider the following stream of UUIDs (Universally Unique Identifier) representing the account ids of people that played a certain song:

A stream of UUIDs. The red color indicates that the UUID is a duplicate.

If we wanted to know how many distinct song plays there were, we would have to store all the UUIDs and then count the number of distinct elements. However, large music streaming services like Spotify and Apple Music have hundreds of millions of users, each playing tens of songs each day—let us say an average of 20. The storage and computing requirements for this approach would be impractical.

Let us estimate the storage requirements for this approach (UUIDs addresses take up 16 bytes):

The storage requirements for counting all the different users playing songs for a day.

32 GB of storage per day used up just for counting the number of distinct users playing songs is not ideal. Do we have an alternative?

An estimate is all you need

HyperLogLog is a probabilistic data structure that estimates the number of distinct elements in a stream of data. While it does not provide exact results, it is extremely efficient in terms of storage and latency.

If we were to use HyperLogLog for the song play count example, we would only need around a few kilobytes of storage for each song to get accurate results. Note that this is not storage per day, but storage for the entire history of the song.

Implementations in modern databases achieve even more impressive results. For sets of up to 18,446,744,073,709,551,616 (2⁶⁴) members, Redis has a “HyperLogLog implementation [that] uses up to 12 KB and provides a standard error of 0.81%.”

That is the power of probabilistic data structures—they are able to provide estimates with a very small margin of error, while using a fraction of the storage required by exact data structures.

Understanding HyperLogLog

Let us consider a random 8-bit binary number:

An 8-bit binary number.

What is the probability that the first two bits are both 0?

The probability of the first bit being 0 is 50%. The same applies to the second bit—50% chance of being 0. Calculating the probability that both bits are 0 is straightforward: 50% × 50% = 0.5 × 0.5 = 0.25 = 25%. On average, 25% of 8-bit binary numbers will have the first two bits set to 0.

Actually, the probability of the first two bits being 0 is 25% for any binary number with at least two bits.

And what about the first three bits being 0? Just multiply the probability of the first two bits being 0 by the probability of the third bit being 0: 25% × 50% = 0.25 × 0.5 = 0.125 = 12.5%. On average, 12.5% of 8-bit binary numbers will have the first three bits set to 0. This easily extends to the following rule:

On average, one in every 2ⁿ binary numbers will have the first n bits set to 0. For example, one in every sixteen 8-bit binary numbers will have the first 4 bits set to 0.

We can use this to make a simple assumption: if we encounter an 8-bit binary number with the first n bits set to 0 in a stream of random numbers, we can assume that there are 2ⁿ numbers in the stream. For example, if we encounter the following 8-bit binary number in a stream of random numbers:

An 8-bit binary number with the first two bits set to 0.

We can assume that there are 2² = 4 numbers in the stream. This is simply because, on average, one in every four 8-bit binary numbers will have the first 2 bits set to 0.

You can check this out yourself by running this Python code several times:

import random

bits = 8
max_number = 2 ** bits
zeros = 2

print(f"On average, 1 in {2 ** zeros} of these numbers \
should have {zeros} leading zeros.")

for i in range(2 ** zeros):
    number = random.randint(0, max_number)
    print(f"{number:0{bits}b}")

Sample output:

On average, 1 in 4 of these numbers should have 2 leading zeros.
10010100
00010101
10001000
01011110

So, to reiterate what we ended up on: If we encounter a binary number with the first n bits set to 0 in a stream of random binary numbers, we can assume that there are 2ⁿ binary numbers in the stream.

A simple estimate

Let us apply what we discovered earlier on our song play count example with UUIDs to make a simple estimate of the count of distinct UUIDs. Unfortunately, we cannot simply take our UUIDs from above and look at the first n bits of each UUID. The reason is that UUIDs are not random. However, we can use a technique called hashing to transform each UUID into a random binary number.

Hashing algorithms such as SHA-3 are designed to produce wildly different outputs if the input is even one bit different. This practically means that we can treat the output of a hashing algorithm as a random number.

Let us hash the initial stream of UUIDs from above and store the maximum number of leading zeros we encounter:

Our example stream of UUIDs. Red indicates that the UUID is a duplicate.

Hashing each UUID with SHA-3-256 we get 256-bit binary numbers:

That is a lot of hexadecimal digits, but I want you to see actual hashes. The red color indicates that the UUID is a duplicate.

Now, let us check the number of leading zeros we encounter for each hashed UUID:

Teal bits are leading zeros.

As you can see, there are only a maximum of 2 leading zeros in our hashed UUIDs. We can assume that there are 2² = 4 distinct UUIDs in the stream. This is simply because, on average, one in every four 256-bit binary numbers will have the first 2 bits set to 0. In reality, there are 7 distinct UUIDs.

We got an approximation that is not perfect, but it is still better than nothing. Note that the main issue is that we are using just a few numbers to estimate the number of distinct elements in the stream. If we were to use a larger stream, we would get a better estimate.

Real version of HyperLogLog

While the core idea of HyperLogLog is to count the maximum number of leading zeros in a stream of hashed elements, there are a few things HyperLogLog does to improve the accuracy of the estimate:

First, observing one single stream of data is bound to give us a bad estimate, mostly because we rely too much on one single result. If we get unlucky and somehow get a number with lots of leading zeroes, our estimate will be way off. To address this, we can divide the stream into multiple substreams and calculate the maximum number of leading zeros for each. We can then take the harmonic mean of all the maximums number of leading zeros to get a better estimate, thus amortizing the effect of outliers.

Second, with some optimizations that are math heavy, we can get even better estimates. I will not go into this but you can read more about it here.

Conclusion

Although this is a simplified example of a probabilistic data structure, it is enough to show the power of these data structures. In this example we only talked about HyperLogLog, but there are several other probabilistic data structures that are used in databases, such as Bloom Filters and Count-Min Sketches. I recommend you read more about them if you are interested.

Probabilistic data structures are a powerful tool in the arsenal of a database architect. They allow us to estimate the number of distinct elements in a stream of data with a very small margin of error, while using a fraction of the storage required by exact data structures.