The Core Idea
Suppose you're building a spell-checker holding 250,000 words in memory, a web browser checking every URL against a million known malicious links, or a database engine deciding whether a disk block might contain a key before seeking it. All three share a shape: a massive membership test where occasional false alarms are tolerable but missing a hit is not.
A bloom filter was designed for exactly this. It answers: "Is this item definitely not in the set?" or "Is this item possibly in the set?" It never lies about the first. It occasionally lies about the second.
Before diving in, you should be comfortable with one concept: hash functions. Think of them as a deterministic black box — put any value in, get a fixed-size integer out. Same input always produces the same output. Different inputs scatter unpredictably across the output space.
The Bit Array
A bloom filter is, at its most basic, just an array of bits. When created, every bit is 0. In memory this is a single contiguous Uint8Array of ⌈m/8⌉ bytes. A 10-million-bit filter fits in 1.25 MB. That density is the whole point.
We use 32 bits throughout for visual clarity. Real filters might use 107–109 bits — the mechanics are identical.
Adding an Item
To add an item, we run it through k hash functions. Each function returns a number; we take it modulo m (the filter size) to get a bit position, then set that bit to 1. With k=3, one item sets three bits (or fewer if two hash functions collide on the same position).
Watch what happens with the word "jasmine", using the Kirsch-Mitzenmacher double-hashing technique explained later:
Notice we never stored the string "jasmine". It's gone. What remains is only a pattern of lit bits — a fingerprint. The bloom filter remembers the shape of things, not the things themselves.
Checking for Membership
To check if an item is present, we run it through the same k hash functions to get the same k positions, then ask: are all k bits currently 1?
- Any bit is 0 → "Definitely not present." This is unconditional — a genuine member always has all its bits set, so a zero bit proves absence.
- All k bits are 1 → "Maybe present." Those bits could have been set by completely different items. The filter cannot tell whether this is a true positive or a false positive — both cases produce the exact same output.
The filter only ever answers "definitely not present" or "maybe present". The demo reveals whether each "maybe present" was a true or false positive by tracking inserted items separately — a real bloom filter never stores the items it was given.
Add 8–10 words, then check for something you never added. With only 32 bits, false positives appear readily. A properly sized production filter keeps FPR well below 1%.
Why False Positives Happen
Imagine you've added "fox", "owl", and "elk". Their combined hashes set bits at positions 3, 8, 11, 14, 19, 22, 27. Now you check "gnu" — never added. Its three hash positions happen to be 8, 19, 27. All three are set. False positive.
This is not a failure mode. It's a known, quantifiable, controllable property. The bits set by other animals accidentally form the pattern "gnu" expects.
This asymmetry is the design's genius. Chrome's safe-browsing false positive just means one extra server-side verification call. A false negative — missing a genuine malicious URL — would be a real security failure. Bloom filters fail in exactly the safe direction.
Important: The filter itself cannot distinguish a true positive from a false positive — it only ever says "maybe present" or "definitely not present". This demo reveals the difference by separately tracking what was added (something a real bloom filter never does). The labels below show you what the filter says, then what the ground truth actually is.
The Mathematics, Derived
The FPR formula isn't magic — it follows from basic probability. Let's build it from the ground up so you understand every term.
Step 1 — Probability a bit remains 0
Each of the k·n hash operations independently selects one of m positions. The probability of missing a given bit in one operation is (1 − 1/m). All k·n operations are independent:
Step 2 — Approximation for large m
Using the standard calculus limit lim(m→∞) (1 − 1/m)^m = e^(−1), raising to the k·n/m power:
Step 3 — False positive probability
Step 4 — Optimal k for given m and n
For fixed m and n, minimize FPR over k. Let p = e^(−kn/m), so FPR = (1−p)^k. Differentiating with respect to k and setting to zero gives:
Step 5 — Optimal m for given n and target FPR p
Substituting k* back and solving for m:
Higher k gives lower FPR at low fill rates, but each insert sets more bits. The optimal k* balances both — and at k*, exactly half the bits are set.
Choosing Hash Functions
The quality of a bloom filter depends heavily on its hash functions. Clustering or poor distribution means your actual FPR is worse than the formula predicts. Two properties matter above all: speed and uniformity.
The Kirsch-Mitzenmacher double-hashing trick
Running k separate hash algorithms per item is expensive. In 2006 Kirsch and Mitzenmacher proved you only need two independent base hashes — every other position can be a linear combination of them with negligible additional error. Hashing cost drops from O(k) to O(1) real hash computations.
gi(x) = h1(x) + i · h2(x) mod m. Each position is just an arithmetic step from the previous one. Think of it as walking around the bit array in steps of size h2, starting at h1.
The two seeds need to be independent — typically two different MurmurHash3 invocations with different seed values. Forcing h2 to be odd (e.g. h2 | 1) keeps it coprime to most array sizes m, which guarantees the derived positions don't degenerate into a small cycle.
Hash function comparison
| Function | Speed | Distribution | Notes |
|---|---|---|---|
MurmurHash3 | Very fast | Excellent | Standard choice. Cassandra, Guava BloomFilter, Redis. |
xxHash64 | Fastest known | Excellent | Modern choice. Best on 64-bit. Recommended for new code. |
FNV-1a | Moderate | Good | Simple to implement. Weaker avalanche properties. |
SHA-256 | 10–25× slower | Excellent | Cryptographically secure. Wasteful here — the security adds nothing. |
CRC32 | Fast (HW) | Poor | Designed for error detection, not uniform distribution. |
What a good hash function actually does
The job of a hash function here is one thing: take any input and scatter it uniformly across [0, 2³²). Two equally-good hashes can produce wildly different output bits for inputs that differ by a single character. This property is called avalanche.
Type two similar strings. A good hash flips ~50% of output bits when one input bit changes — that's what makes positions in the filter look independent even for similar items.
Implementations of MurmurHash3, xxHash, and the full BloomFilter class are in the companion implementation page — copy-pasteable, dependency-free.
Tuning for Production
Picking m and k is straightforward. The harder engineering questions are: how wrong can I afford to be?, how many items will I actually insert?, and what happens if my n estimate is wrong?
Parameter calculator
Compare: storing 1M strings (~20 chars each) in a plain Set ≈ 20 MB. The bloom filter above is orders of magnitude smaller for the same membership test.
FPR cost reference
| FPR | Bits per item | k* | Use case |
|---|---|---|---|
| 1% (0.01) | ~9.6 | 7 | General-purpose. Browser URL checking, cache warming. |
| 0.1% (0.001) | ~14.4 | 10 | Database index pre-filter (Cassandra, RocksDB). Eliminates 99.9% of unnecessary disk reads. |
| 0.01% (0.0001) | ~19.2 | 13 | Security checks. CDN single-hit prevention. |
| 0.001% | ~24 | 17 | Deduplication pipelines where extra lookups are very expensive. |
Every order-of-magnitude improvement in FPR costs exactly ln(10) / (ln 2)² ≈ 4.78 bits per item. This is a fixed price you can budget for.
Bit manipulation — the only data structure you need
The entire data structure is a single byte array. Setting bit i means: figure out which byte holds it (i >> 3), figure out which bit within that byte (i & 7), and OR in the right mask. That's two integer ops and one memory write — guaranteed O(1) and cache-line friendly.
For very large filters (hundreds of megabits), align each partition to a 64-byte cache line (512 bits) so a single hash position fits inside one cache fetch. This minimizes cache thrash when evaluating multiple hash functions on the same item.
Each row = 1 byte. Green = set bits. Binary representation on the right shows how bits pack into the underlying byte array — exactly what you'd see in a hex editor.
The API Surface
Stripped to its essentials, a production bloom filter has only six operations. Every library you'll meet — Guava's BloomFilter, the bloom-filters npm package, Cassandra's internal filters — exposes some shape of these six.
Uint8Array of ⌈m/8⌉ bytes. That's the entire constructor.Empirical verification
Theory says FPR should be (1−e−kn/m)k. Reality should agree. The practical test: insert n items, then query 100,000 items you definitely didn't insert and count how many "maybe" responses you get. Press the button below to run a live experiment in your browser.
Click "Run experiment" — about 1 second of pure JS. Result typically lands within ±5% of theoretical for a well-sized filter.
Merging and distributing filters
Because a bloom filter is just bytes, it's trivially portable. Bitwise OR merges two compatible filters — the result is the union of both sets. Composability across nodes is one of the things bloom filters do better than nearly any other data structure: each shard maintains its own filter, and a query coordinator merges them all to skip shards with no chance of a match.
The merged filter contains every item from both — but you cannot recover which filter contributed each bit. That's perfect for distributed unions; useless for "who knows what" forensics.
Full implementations — BloomFilter, CountingFilter, ScalableFilter, CuckooFilter, RotatingFilter — live in the implementation companion page, with serialization, tests, and benchmarks.
Variants
The standard bloom filter has two limitations: no deletion, and must be sized upfront. Several well-studied variants address these.
Counting Bloom Filter
Replace each bit with a small integer counter — typically 4 bits (values 0–15). Adding increments k counters. Deleting decrements them. A check returns "maybe present" if all k counters are > 0, or "definitely not present" if any counter is 0.
check() to decide whether to delete is wrong — a false positive would make check() return "maybe present" for an item never inserted, and decrementing those counters would corrupt the filter and introduce false negatives. Always guard deletions with a separate confirmed-membership record.Memory cost: 4× over a standard filter. Overflow: counters max at 15 (sticky — no further increment) to prevent false negatives on decrement. In practice overflow is vanishingly rare: a counter would need the same bit set by 16 different items, probability <10⁻⁶ for typical workloads.
Each byte stores two counters in its low and high nibbles. A 32-counter filter fits in 16 bytes — 4× a standard 32-bit bloom filter, exactly as the math says.
Scalable Bloom Filter (SBF)
Maintains a list of standard filters, each larger than the last. When the current filter exceeds a fill threshold (typically 50%), a new larger filter is appended. Lookups check all sub-filters. Writes only go to the last filter.
Each successive filter targets a tighter FPR (multiplied by tightening ratio r < 1). The total FPR is bounded by a geometric series:
Geometric growth means you rarely have more than 5–6 sub-filters even after 50× the initial capacity. Lookup cost is O(num_filters × k) but remains fast in practice. The implementation page has a complete ScalableBloomFilter with growth and tightening configurable.
Partitioned Bloom Filter
Divides the m bits into k equal partitions of size m/k. Hash function i selects exactly one bit from partition i only — guaranteed no intra-item collisions between hash functions.
Benefits: exactly k bits are always set per item (never fewer due to intra-item collision); each partition can be analyzed independently. In practice FPR difference vs. a standard filter is negligible for large m, but cache-line alignment per partition can improve lookup throughput on modern hardware.
Cuckoo Filter
A different data structure solving the same problem but supporting deletion without the memory overhead of counting filters. Stores small fingerprints (8–16 bits) in a hash table using cuckoo hashing. Each item has two candidate bucket locations. On insertion, if both are full, an existing fingerprint is displaced ("kicked out") to its alternate location.
The critical invariant enabling deletion: alt(b, fp) = b XOR hash(fp). This lets you compute a fingerprint's alternate bucket without knowing the original item — so deletion works even without storing the original key.
| Bloom Filter | Counting BF | Cuckoo Filter | |
|---|---|---|---|
| Deletion | No | Yes | Yes |
| Memory @ 1% FPR | ~9.6 bits/item | ~38 bits/item | ~12 bits/item |
| Lookup speed | O(k) | O(k) | O(1) avg |
| Worst-case insert | O(k) | O(k) | O(n) — rare |
| Duplicate inserts | Idempotent | Counter increments | Only 2b/fingerprint |
Step 1 (insert): bucket b XOR hash(fp) lands at bucket b'. Step 2 (round-trip): b' XOR the same hash(fp) returns to b. This involution is what makes deletion possible without ever storing the original key.
For systems requiring deletion, the Cuckoo filter is generally preferable to the Counting filter due to ~3× better space efficiency. A max-displacement limit (e.g., 500 kicks) triggers a full rehash if exceeded — rare with proper sizing but must be handled. Full code lives on the implementation page.
In Production Systems
The same pattern appears everywhere: bloom filters act as a cheap first-pass in front of an expensive operation. The expensive operation is only triggered on positives.
~10–100 ns
skip expensive op (99%+)
→ disk / network / DB
Each SSTable has a bloom filter. A read checks the filter first. "Definitely not" skips the entire disk seek. At 0.1% FPR, 99.9% of unnecessary seeks are eliminated — the primary reason LSM-tree reads stay fast despite not rewriting data in-place.
Shipped ~650K malicious URLs as a ~1.5 MB binary blob downloaded periodically. Local filter ran on every URL visit in microseconds. Positives triggered a server-side hash lookup. Used until 2012 when the URL list outgrew the approach.
Prevents one-hit-wonder URLs from polluting edge caches. Two rotating filters: a URL must appear in two consecutive request windows before promotion to cache. Massively reduces cache churn for long-tail content.
BIP 37 (2012) had lightweight clients send a bloom filter to full nodes, expecting the false positive rate to provide privacy. It didn't. Research showed clients with even modest address counts leaked their entire wallet. Bitcoin Core disabled BIP 37 by default in 0.19.0 (2019) — a useful lesson in why probabilistic privacy is hard.
bloom & BRINPostgreSQL ships a bloom index extension and BRIN bloom operator classes. They speed up equality lookups on tables with many low-selectivity columns by giving each block range a bloom filter — the planner can skip block ranges whose filter rejects the predicate.
Track per-user seen content. A filter of ~100K items fits in ~200KB — viable in Redis, as a cookie payload, or in a JWT. No need to query a "seen items" table on every recommendation generation.
Rotating bloom filters for sliding windows
A standard production pattern for time-windowed deduplication. Two filters alternate active/passive on a schedule. Lookups check both. When the window expires, the passive filter is cleared and becomes the new active.
Memory is always exactly 2× a single filter. Any item is "remembered" for at least 1 window and at most 2. Implementation in the companion page.
Distributed merge patterns
Because bloom filters are just bytes, they enable distributed patterns that are difficult with standard data structures:
- Anti-entropy sync. Two nodes exchange their filters (bitwise OR = union). Each node identifies items the other might be missing without transmitting the full dataset.
- Per-shard pre-routing. Each shard maintains a filter. A query coordinator checks all shard filters and routes to only the shards with potential matches — avoiding unnecessary cross-shard requests.
- Gossip acceleration. Before gossiping state to a peer, check a filter of recently gossiped items to avoid redundant transmissions.
When Not to Use One
Bloom filters are a specialized tool. The right question is: does this specific problem have the shape they solve?
| Structure | FP possible | Delete | Memory | Choose when |
|---|---|---|---|---|
Set / hash table | No | Yes | High | Set is small, or FP is unacceptable. |
| Sorted array + bisect | No | Costly | Low | Static set, read-heavy, memory constrained. |
| Bloom filter | Controlled | No | Very low | Large set, misses dominate, FP tolerable, no deletes. |
| Counting BF | Controlled | Yes | ~4× bloom | As above, but items must be removable. |
| Cuckoo filter | Controlled | Yes | ~1.2× bloom | As above with deletion, higher throughput needed. |
| HyperLogLog | N/A | N/A | Tiny | Cardinality estimation — a different problem. |
Situations where a bloom filter adds no value
- The dataset is small (<10K items). A plain
Setuses kilobytes with zero false positives. - False positives surface as user errors. If a FP means showing an error message rather than a silent extra lookup, you've traded correctness for memory.
- Most lookups are hits. The bloom filter's value is entirely in the miss case. If your access patterns are mostly hits, the filter adds overhead without benefit. Profile first.
- You need to enumerate members. A bloom filter has no iteration. You cannot reconstruct what was added.
- Items are frequently deleted. Use a Counting or Cuckoo filter, or periodically rebuild the standard filter from a persistent store.
- The protected operation is already fast. If the "expensive" operation is already a microsecond memory lookup, k hash function calls may cost more than they save.