this post was submitted on 18 May 2024
260 points (98.5% liked)

xkcd

8888 readers
95 users here now

A community for a webcomic of romance, sarcasm, math, and language.

founded 1 year ago
MODERATORS
 

https://xkcd.com/2934

Alt text:

Sometimes, you can tell Bloom filters are the wrong tool for the job, but when they're the right one you can never be sure.

you are viewing a single comment's thread
view the rest of the comments
[–] jaybone 2 points 6 months ago (1 children)

So you’re just putting a bunch of values in memory that you can access quickly, like similar to a hash set contains(), maybe hoping for O(1) time. But other than that there’s no trick to it?

[–] [email protected] 2 points 6 months ago* (last edited 6 months ago)

Well, yes and no. With a straight-up hash set, you're keeping set_size * bits_per_element bits plus whatever the overhead of the hash table is in memory, which might not be tenable for very large sets, but with a Bloom filter that has eg. ~1% false positive rate and an ideal k parameter (number of hash functions, see eg. the Bloom filter wiki article) you're only keeping ~10 bits per element completely regardless of element size because they don't store the elements themselves or even their full hashes – they only tell you whether some element is probably in the set or not, but you can't eg. enumerate the elements in the set. As an example of memory usage, a Bloom filter that has a false positive rate of ~1% for 500 million elements would need 571 MiB (noting that although the size of the filter doesn't grow when you insert elements, the false positive rate goes up once you go past that 500 million element count.)

Lookup and insertion time complexity for a Bloom filter is O(k) where k is the parameter I mentioned and a constant – ie. effectively O(1).

Probabilistic set membership queries are mainly useful when you're dealing with ginormous sets of elements that you can't shove into a regular in-memory hash set. A good example in the wiki article is CDN cache filtering:

Nearly three-quarters of the URLs accessed from a typical web cache are "one-hit-wonders" that are accessed by users only once and never again. It is clearly wasteful of disk resources to store one-hit-wonders in a web cache, since they will never be accessed again. To prevent caching one-hit-wonders, a Bloom filter is used to keep track of all URLs that are accessed by users. A web object is cached only when it has been accessed at least once before, i.e., the object is cached on its second request.