← all writings
Probabilistic Data Structures — Deep Dive

Bloom
Filters

A data structure that sometimes lies — and why that's perfectly fine. A complete guide for working engineers.

~30 min read Fully interactive Math · Implementation · Variants · Production
Live 32-bit filter — words being added in real time
→ jump straight to the code

The Core Idea

Suppose you're building a spell-checker holding 250,000 words in memory, a web browser checking every URL against a million known malicious links, or a database engine deciding whether a disk block might contain a key before seeking it. All three share a shape: a massive membership test where occasional false alarms are tolerable but missing a hit is not.

A bloom filter was designed for exactly this. It answers: "Is this item definitely not in the set?" or "Is this item possibly in the set?" It never lies about the first. It occasionally lies about the second.

A bloom filter has exactly two possible answers: "Definitely not present" (guaranteed correct — no false negatives) and "Maybe present" (probably correct, but occasionally wrong — a false positive). It never says "definitely present".

Before diving in, you should be comfortable with one concept: hash functions. Think of them as a deterministic black box — put any value in, get a fixed-size integer out. Same input always produces the same output. Different inputs scatter unpredictably across the output space.

The Bit Array

A bloom filter is, at its most basic, just an array of bits. When created, every bit is 0. In memory this is a single contiguous Uint8Array of ⌈m/8⌉ bytes. A 10-million-bit filter fits in 1.25 MB. That density is the whole point.

✦ 32-bit bloom filter — all zeros, waiting

We use 32 bits throughout for visual clarity. Real filters might use 107–109 bits — the mechanics are identical.

Adding an Item

To add an item, we run it through k hash functions. Each function returns a number; we take it modulo m (the filter size) to get a bit position, then set that bit to 1. With k=3, one item sets three bits (or fewer if two hash functions collide on the same position).

Watch what happens with the word "jasmine", using the Kirsch-Mitzenmacher double-hashing technique explained later:

✦ Hashing "jasmine" through k = 3 derived positions

Notice we never stored the string "jasmine". It's gone. What remains is only a pattern of lit bits — a fingerprint. The bloom filter remembers the shape of things, not the things themselves.

Checking for Membership

To check if an item is present, we run it through the same k hash functions to get the same k positions, then ask: are all k bits currently 1?

  • Any bit is 0"Definitely not present." This is unconditional — a genuine member always has all its bits set, so a zero bit proves absence.
  • All k bits are 1"Maybe present." Those bits could have been set by completely different items. The filter cannot tell whether this is a true positive or a false positive — both cases produce the exact same output.
✦ Interactive Bloom Filter — add words, discover false positives

The filter only ever answers "definitely not present" or "maybe present". The demo reveals whether each "maybe present" was a true or false positive by tracking inserted items separately — a real bloom filter never stores the items it was given.

Bits set
0 / 32
Fill rate
0%
Est. FPR
~0%
Add some words, then Check to test membership.

Add 8–10 words, then check for something you never added. With only 32 bits, false positives appear readily. A properly sized production filter keeps FPR well below 1%.

Why False Positives Happen

Imagine you've added "fox", "owl", and "elk". Their combined hashes set bits at positions 3, 8, 11, 14, 19, 22, 27. Now you check "gnu" — never added. Its three hash positions happen to be 8, 19, 27. All three are set. False positive.

This is not a failure mode. It's a known, quantifiable, controllable property. The bits set by other animals accidentally form the pattern "gnu" expects.

False negatives are impossible. If you added something, its bits are permanently set. A check for a genuine member always succeeds. The lies only go one way.

This asymmetry is the design's genius. Chrome's safe-browsing false positive just means one extra server-side verification call. A false negative — missing a genuine malicious URL — would be a real security failure. Bloom filters fail in exactly the safe direction.

✦ Revealing true vs. false positives via ground-truth comparison

Important: The filter itself cannot distinguish a true positive from a false positive — it only ever says "maybe present" or "definitely not present". This demo reveals the difference by separately tracking what was added (something a real bloom filter never does). The labels below show you what the filter says, then what the ground truth actually is.

Try: add "cat", "dog", "bird" → then check "fish".

The Mathematics, Derived

The FPR formula isn't magic — it follows from basic probability. Let's build it from the ground up so you understand every term.

Step 1 — Probability a bit remains 0

After inserting n items using k hash functions into a filter of m bits, what is the probability that a specific bit is still 0?

Each of the k·n hash operations independently selects one of m positions. The probability of missing a given bit in one operation is (1 − 1/m). All k·n operations are independent:

Exact probability — bit stays zero P(bit = 0) = (1 − 1/m) ^ (k·n) Every hash independently misses the bit with probability (1 − 1/m). Raise to k·n for all operations.

Step 2 — Approximation for large m

Using the standard calculus limit lim(m→∞) (1 − 1/m)^m = e^(−1), raising to the k·n/m power:

Approximation (used in all practical code) P(bit = 0) ≈ e ^ (−k·n / m) Excellent for m ≥ 1000. For our 32-bit demo the approximation has ~2% error — tolerable for illustration.

Step 3 — False positive probability

A false positive occurs when all k positions checked for a never-inserted item are all 1. Each bit is 1 with probability (1 − P(bit=0)). Assuming k positions are approximately independent:
The fundamental bloom filter equation FPR ≈ (1 − e ^ (−k·n / m)) ^ k Three levers: n (items inserted), m (bit array size), k (hash function count). This is the equation your library uses internally.

Step 4 — Optimal k for given m and n

For fixed m and n, minimize FPR over k. Let p = e^(−kn/m), so FPR = (1−p)^k. Differentiating with respect to k and setting to zero gives:

Optimal hash function count k* = (m / n) · ln(2) ≈ 0.6931 · (m / n) At k*, exactly half the bits are set on average — P(bit=1) = 0.5. This is an elegant result: optimal performance happens at 50% fill.

Step 5 — Optimal m for given n and target FPR p

Substituting k* back and solving for m:

Bits needed for n items at false positive rate p m = ⌈ −n · ln(p) / (ln 2)² ⌉ The (ln 2)² ≈ 0.4805 denominator is why bloom filters are so compact. ln(p) is negative for p < 1, making m positive. Every order-of-magnitude improvement in FPR costs ~4.8 bits per item.
✦ FPR as fill rate increases (k = 3) — drag the slider
Fill rate: 50%FPR: 12.50%
✦ FPR vs. fill rate for different values of k

Higher k gives lower FPR at low fill rates, but each insert sets more bits. The optimal k* balances both — and at k*, exactly half the bits are set.

Choosing Hash Functions

The quality of a bloom filter depends heavily on its hash functions. Clustering or poor distribution means your actual FPR is worse than the formula predicts. Two properties matter above all: speed and uniformity.

The Kirsch-Mitzenmacher double-hashing trick

Running k separate hash algorithms per item is expensive. In 2006 Kirsch and Mitzenmacher proved you only need two independent base hashes — every other position can be a linear combination of them with negligible additional error. Hashing cost drops from O(k) to O(1) real hash computations.

✦ Deriving k = 5 positions from just two hashes

gi(x) = h1(x) + i · h2(x)  mod  m. Each position is just an arithmetic step from the previous one. Think of it as walking around the bit array in steps of size h2, starting at h1.

The two seeds need to be independent — typically two different MurmurHash3 invocations with different seed values. Forcing h2 to be odd (e.g. h2 | 1) keeps it coprime to most array sizes m, which guarantees the derived positions don't degenerate into a small cycle.

Hash function comparison

FunctionSpeedDistributionNotes
MurmurHash3Very fastExcellentStandard choice. Cassandra, Guava BloomFilter, Redis.
xxHash64Fastest knownExcellentModern choice. Best on 64-bit. Recommended for new code.
FNV-1aModerateGoodSimple to implement. Weaker avalanche properties.
SHA-25610–25× slowerExcellentCryptographically secure. Wasteful here — the security adds nothing.
CRC32Fast (HW)PoorDesigned for error detection, not uniform distribution.
Never use SHA or MD5 for a bloom filter. They'd work but are an order of magnitude slower with zero benefit — the security properties they provide are useless here. MurmurHash3 or xxHash are the right tools.

What a good hash function actually does

The job of a hash function here is one thing: take any input and scatter it uniformly across [0, 2³²). Two equally-good hashes can produce wildly different output bits for inputs that differ by a single character. This property is called avalanche.

✦ Avalanche — flip one bit of input, watch output transform

Type two similar strings. A good hash flips ~50% of output bits when one input bit changes — that's what makes positions in the filter look independent even for similar items.

Implementations of MurmurHash3, xxHash, and the full BloomFilter class are in the companion implementation page — copy-pasteable, dependency-free.

Tuning for Production

Picking m and k is straightforward. The harder engineering questions are: how wrong can I afford to be?, how many items will I actually insert?, and what happens if my n estimate is wrong?

Parameter calculator

✦ Optimal Bloom Filter Parameters
Bits (m)
Memory
Hash fns (k*)

Compare: storing 1M strings (~20 chars each) in a plain Set ≈ 20 MB. The bloom filter above is orders of magnitude smaller for the same membership test.

FPR cost reference

FPRBits per itemk*Use case
1% (0.01)~9.67General-purpose. Browser URL checking, cache warming.
0.1% (0.001)~14.410Database index pre-filter (Cassandra, RocksDB). Eliminates 99.9% of unnecessary disk reads.
0.01% (0.0001)~19.213Security checks. CDN single-hit prevention.
0.001%~2417Deduplication pipelines where extra lookups are very expensive.

Every order-of-magnitude improvement in FPR costs exactly ln(10) / (ln 2)² ≈ 4.78 bits per item. This is a fixed price you can budget for.

Bit manipulation — the only data structure you need

The entire data structure is a single byte array. Setting bit i means: figure out which byte holds it (i >> 3), figure out which bit within that byte (i & 7), and OR in the right mask. That's two integer ops and one memory write — guaranteed O(1) and cache-line friendly.

✦ Hover any bit position to see its byte index and bit offset

For very large filters (hundreds of megabits), align each partition to a 64-byte cache line (512 bits) so a single hash position fits inside one cache fetch. This minimizes cache thrash when evaluating multiple hash functions on the same item.

✦ Memory layout — 32 bits packed into 4 bytes (after adding "jasmine", "cedar", "river")

Each row = 1 byte. Green = set bits. Binary representation on the right shows how bits pack into the underlying byte array — exactly what you'd see in a hex editor.

The API Surface

Stripped to its essentials, a production bloom filter has only six operations. Every library you'll meet — Guava's BloomFilter, the bloom-filters npm package, Cassandra's internal filters — exposes some shape of these six.

new BloomFilter(n, p)
Compute m and k* from the formulas, allocate a zeroed Uint8Array of ⌈m/8⌉ bytes. That's the entire constructor.
.add(item)
Hash → k positions (Kirsch-Mitzenmacher) → set k bits to 1. Idempotent: re-adding has no effect.
.has(item) → boolean
Hash → k positions → return true iff every bit is 1. False = "definitely not". True = "maybe".
.fillRate() → number
Count set bits / m. Crucial diagnostic — at fill rate ≈ 0.5 you're at the optimal operating point. Above 0.7 your FPR is rising fast.
.merge(other)
Bitwise OR every byte. Result represents the union of both sets. Both filters must share the same m and k.
.toBase64()  /  .fromBase64(b64, n, p)
Wire format. Serialize to a string, store in Redis, embed in a JWT, ship over HTTP. Just bytes — utterly portable.

Empirical verification

Theory says FPR should be (1−e−kn/m)k. Reality should agree. The practical test: insert n items, then query 100,000 items you definitely didn't insert and count how many "maybe" responses you get. Press the button below to run a live experiment in your browser.

✦ Theoretical FPR vs. measured FPR — runs in your browser
Items inserted (n)
10,000
Target FPR (p)
1.000%
Sample queries
100,000
Theoretical
Measured

Click "Run experiment" — about 1 second of pure JS. Result typically lands within ±5% of theoretical for a well-sized filter.

Merging and distributing filters

Because a bloom filter is just bytes, it's trivially portable. Bitwise OR merges two compatible filters — the result is the union of both sets. Composability across nodes is one of the things bloom filters do better than nearly any other data structure: each shard maintains its own filter, and a query coordinator merges them all to skip shards with no chance of a match.

✦ Two filters merged with bitwise OR

The merged filter contains every item from both — but you cannot recover which filter contributed each bit. That's perfect for distributed unions; useless for "who knows what" forensics.

Full implementations — BloomFilter, CountingFilter, ScalableFilter, CuckooFilter, RotatingFilter — live in the implementation companion page, with serialization, tests, and benchmarks.

Variants

The standard bloom filter has two limitations: no deletion, and must be sized upfront. Several well-studied variants address these.

Counting
Scalable
Partitioned
Cuckoo Filter

Counting Bloom Filter

Replace each bit with a small integer counter — typically 4 bits (values 0–15). Adding increments k counters. Deleting decrements them. A check returns "maybe present" if all k counters are > 0, or "definitely not present" if any counter is 0.

Deletion has a critical hazard: you must only decrement counters for items you know were actually added. Using check() to decide whether to delete is wrong — a false positive would make check() return "maybe present" for an item never inserted, and decrementing those counters would corrupt the filter and introduce false negatives. Always guard deletions with a separate confirmed-membership record.
✦ Counting filter — shade intensity shows counter value (0–4+)
Add a word, then delete it. Check will say "maybe present" or "definitely not present" — never "definitely present".

Memory cost: 4× over a standard filter. Overflow: counters max at 15 (sticky — no further increment) to prevent false negatives on decrement. In practice overflow is vanishingly rare: a counter would need the same bit set by 16 different items, probability <10⁻⁶ for typical workloads.

✦ How 4-bit counters pack — two per byte

Each byte stores two counters in its low and high nibbles. A 32-counter filter fits in 16 bytes — 4× a standard 32-bit bloom filter, exactly as the math says.

Scalable Bloom Filter (SBF)

Maintains a list of standard filters, each larger than the last. When the current filter exceeds a fill threshold (typically 50%), a new larger filter is appended. Lookups check all sub-filters. Writes only go to the last filter.

Each successive filter targets a tighter FPR (multiplied by tightening ratio r < 1). The total FPR is bounded by a geometric series:

SBF total FPR bound FPR_total ≤ p₀ / (1 − r) With r = 0.9 and p₀ = 0.001: total FPR ≤ 0.01. The bound converges because r < 1.
✦ Scalable filter — watch sub-filters spawn as capacity fills
Active filters
1
Total items
0
Memory

Geometric growth means you rarely have more than 5–6 sub-filters even after 50× the initial capacity. Lookup cost is O(num_filters × k) but remains fast in practice. The implementation page has a complete ScalableBloomFilter with growth and tightening configurable.

Partitioned Bloom Filter

Divides the m bits into k equal partitions of size m/k. Hash function i selects exactly one bit from partition i only — guaranteed no intra-item collisions between hash functions.

✦ Partitioned filter — each color = one partition (k = 3 × 16 bits)
Each hash function can only set bits within its own partition.

Benefits: exactly k bits are always set per item (never fewer due to intra-item collision); each partition can be analyzed independently. In practice FPR difference vs. a standard filter is negligible for large m, but cache-line alignment per partition can improve lookup throughput on modern hardware.

Cuckoo Filter

A different data structure solving the same problem but supporting deletion without the memory overhead of counting filters. Stores small fingerprints (8–16 bits) in a hash table using cuckoo hashing. Each item has two candidate bucket locations. On insertion, if both are full, an existing fingerprint is displaced ("kicked out") to its alternate location.

The critical invariant enabling deletion: alt(b, fp) = b XOR hash(fp). This lets you compute a fingerprint's alternate bucket without knowing the original item — so deletion works even without storing the original key.

Bloom FilterCounting BFCuckoo Filter
DeletionNoYesYes
Memory @ 1% FPR~9.6 bits/item~38 bits/item~12 bits/item
Lookup speedO(k)O(k)O(1) avg
Worst-case insertO(k)O(k)O(n) — rare
Duplicate insertsIdempotentCounter incrementsOnly 2b/fingerprint
✦ XOR is its own inverse — the round-trip property

Step 1 (insert): bucket b XOR hash(fp) lands at bucket b'. Step 2 (round-trip): b' XOR the same hash(fp) returns to b. This involution is what makes deletion possible without ever storing the original key.

For systems requiring deletion, the Cuckoo filter is generally preferable to the Counting filter due to ~3× better space efficiency. A max-displacement limit (e.g., 500 kicks) triggers a full rehash if exceeded — rare with proper sizing but must be handled. Full code lives on the implementation page.

In Production Systems

The same pattern appears everywhere: bloom filters act as a cheap first-pass in front of an expensive operation. The expensive operation is only triggered on positives.

Request
Bloom check
~10–100 ns
Definitely absent
skip expensive op (99%+)
Maybe present
→ disk / network / DB
Cassandra & RocksDB

Each SSTable has a bloom filter. A read checks the filter first. "Definitely not" skips the entire disk seek. At 0.1% FPR, 99.9% of unnecessary seeks are eliminated — the primary reason LSM-tree reads stay fast despite not rewriting data in-place.

Google Chrome Safe Browsing

Shipped ~650K malicious URLs as a ~1.5 MB binary blob downloaded periodically. Local filter ran on every URL visit in microseconds. Positives triggered a server-side hash lookup. Used until 2012 when the URL list outgrew the approach.

Akamai CDN

Prevents one-hit-wonder URLs from polluting edge caches. Two rotating filters: a URL must appear in two consecutive request windows before promotion to cache. Massively reduces cache churn for long-tail content.

Bitcoin SPV / BIP 37 cautionary

BIP 37 (2012) had lightweight clients send a bloom filter to full nodes, expecting the false positive rate to provide privacy. It didn't. Research showed clients with even modest address counts leaked their entire wallet. Bitcoin Core disabled BIP 37 by default in 0.19.0 (2019) — a useful lesson in why probabilistic privacy is hard.

PostgreSQL — bloom & BRIN

PostgreSQL ships a bloom index extension and BRIN bloom operator classes. They speed up equality lookups on tables with many low-selectivity columns by giving each block range a bloom filter — the planner can skip block ranges whose filter rejects the predicate.

Recommendation Engines

Track per-user seen content. A filter of ~100K items fits in ~200KB — viable in Redis, as a cookie payload, or in a JWT. No need to query a "seen items" table on every recommendation generation.

Rotating bloom filters for sliding windows

A standard production pattern for time-windowed deduplication. Two filters alternate active/passive on a schedule. Lookups check both. When the window expires, the passive filter is cleared and becomes the new active.

✦ Watch the rotation — drag the time slider
Time elapsed: 0sWindow: 30s

Memory is always exactly 2× a single filter. Any item is "remembered" for at least 1 window and at most 2. Implementation in the companion page.

Distributed merge patterns

Because bloom filters are just bytes, they enable distributed patterns that are difficult with standard data structures:

  • Anti-entropy sync. Two nodes exchange their filters (bitwise OR = union). Each node identifies items the other might be missing without transmitting the full dataset.
  • Per-shard pre-routing. Each shard maintains a filter. A query coordinator checks all shard filters and routes to only the shards with potential matches — avoiding unnecessary cross-shard requests.
  • Gossip acceleration. Before gossiping state to a peer, check a filter of recently gossiped items to avoid redundant transmissions.

When Not to Use One

Bloom filters are a specialized tool. The right question is: does this specific problem have the shape they solve?

StructureFP possibleDeleteMemoryChoose when
Set / hash tableNoYesHighSet is small, or FP is unacceptable.
Sorted array + bisectNoCostlyLowStatic set, read-heavy, memory constrained.
Bloom filterControlledNoVery lowLarge set, misses dominate, FP tolerable, no deletes.
Counting BFControlledYes~4× bloomAs above, but items must be removable.
Cuckoo filterControlledYes~1.2× bloomAs above with deletion, higher throughput needed.
HyperLogLogN/AN/ATinyCardinality estimation — a different problem.

Situations where a bloom filter adds no value

  • The dataset is small (<10K items). A plain Set uses kilobytes with zero false positives.
  • False positives surface as user errors. If a FP means showing an error message rather than a silent extra lookup, you've traded correctness for memory.
  • Most lookups are hits. The bloom filter's value is entirely in the miss case. If your access patterns are mostly hits, the filter adds overhead without benefit. Profile first.
  • You need to enumerate members. A bloom filter has no iteration. You cannot reconstruct what was added.
  • Items are frequently deleted. Use a Counting or Cuckoo filter, or periodically rebuild the standard filter from a persistent store.
  • The protected operation is already fast. If the "expensive" operation is already a microsecond memory lookup, k hash function calls may cost more than they save.
The bloom filter's value proposition is entirely about the miss case. If most of your lookups are hits, or the operation being protected is already fast, the bloom filter adds complexity without meaningful benefit. Measure before adding probabilistic machinery.

Summary

1
It's a bit array + k hash functions
Add: set k bits. Check: test k bits. Everything else — space efficiency, false positive rates, variants — follows from this simplicity.
2
No false negatives — unconditionally
A "definitely not" is mathematically guaranteed by the structure. This is the bloom filter's core contract. Build your system around it.
3
FPR = (1 − e^(−kn/m))^k
The fundamental equation. Given n and target p, compute optimal m and k*. Two closed-form formulas. That's the entire design space.
4
Use MurmurHash3 with Kirsch-Mitzenmacher double hashing
Two seeded base hashes derive all k positions. Never SHA or MD5 — they're ~40× slower with no benefit here.
5
The data structure is just a Uint8Array — exploit that
Serialize it, merge it with bitwise OR, ship it over a network, store it in Redis, embed it in a JWT. A filter's portability is as useful as its compactness.
6
Design for saturation from the start
Overestimate n, use Scalable filters, or plan for periodic recreation. A saturated filter silently returns true for everything — a hard-to-debug failure mode.
7
Know the variants
Counting for deletion. Scalable for unknown n. Cuckoo for deletion with lower memory. Rotating for time windows. Each solves a specific limitation.
8
The value lives in the miss case — profile first
If most lookups are hits, or the protected operation is fast, the bloom filter adds complexity without benefit. Measure access patterns before adding probabilistic machinery.