A bloom filter is a space-efficient data structure that lets you quickly check whether or not an item is in a set.
The tradeoff for that space efficiency is that it's probabilistic: sometimes instead of giving you concrete answers it just says "probably."
When you look up an item in a bloom filter, the possible answers are:
- It's definitely not in the set. This is a true negative.
- It might be in the set. This could be a false positive, or it could be a true positive.
- Space-efficient. Bloom filters take up space, regardless of the number of items inserted. (But, their accuracy goes down as more elements are added.)
- Fast. Insert and lookup operations are both time.
- Probabilistic. Bloom filters can only definitively identify true negatives. They cannot identify true positives. If a bloom filter says an item is present, that item might actually be present (a true positive) or it might not (a false positive).
- Limited Interface. Bloom filters only support the insert and lookup operations. You can't iterate through the items in the set or delete items.
Under the hood, a bloom filter is just a bitmap. Initially, all the bits are set to 0.
Fixed size is crucial here. Remember, bloom filters require space.
Let's add some cake flavors to our bloom filter, starting with "chocolate." To do this:
- Hash the string ("chocolate")
- Mod the result by the length of our bitmap
- Set that bit to 1
Same thing for adding "vanilla" and "red velvet":
To check if a word is in our bloom filter we:
- Hash it
- Mod the result
- Check if the corresponding bit is 0 or 1
If the bit is 0, then that word definitely isn't in our set. If it were, we would have set that bit to 1.
But if the bit is 1, then that word might be in our set. But it could be a false positive.
In our example, checking for "vanilla" gives us a true positive—we did put that in our bloom filter earlier. But checking for "carrot" gives us a false positive—we set that bit to 1 for "red velvet".
This false positive isn't a bug; it's a tradeoff we've made to keep checks fast. Bloom filters are only appropriate when it's feasible to handle false positives.
When to use bloom filters
Bloom filters are a great first layer of filtering, since they don't require much space and can fit in fast storage, like RAM. Consider using them anywhere where knowing if something is definitely not present or possibly present would be helpful.
One common use is to eliminate unnecessary accesses to slower storage / expensive lookups.
For instance, say we wanted to query a large database stored on a rotating hard drive (slow to read from). And suppose the thing we're querying for has a good chance of not being present at all. Before querying the disk, we could check for the record in a bloom filter; if the bloom filter says the record definitely isn't present, then we don't need to touch the slow disk at all.
As the number of items inserted into a bloom filter increases, so does the likelihood of a false positive. That's because over time, we'll end up setting most of the bits to 1, instead of 0.
The larger the bitmap, the less likely false positives are. When using a bloom filter, it's helpful to know ahead of time roughly how many elements will be in your set. That way you can make the bitmap large enough to avoid a high false positive rate.
Multiple Hash Functions
Another way to avoid a high false positive rate is to use multiple hash functions.
In the examples above we only used one hash function, checking one bit in the bitmap for each lookup.
When using multiple hash functions:
- To add a new item: generate a bitmap index with each hash function, and set all of those bits to 1.
- To check if an item is present: just check if any of the bits checked are still 0. If so, the item is definitely not present.