The Bloom Filter

The bloom filter is a probabilistic data structure. This is a data set in which we have two premises;

  • i) The element is not in the set.
  • ii) may be that the element is in the set, with an approximate degree of certainty.

 
Basically there are two working operations.

  • i) TEST tells us whether or not an element can be.
  • ii) ADD Adds an element to the structure.

 

The set may be on the order of millions or hundreds of millions of entries, and is highly efficient to reduce disk IO. This algorithm is commonly used in operations with parallel join when tables and partitioned tables both Oracle and SQL Server, in Sybase is also used in normal joins.

The algorithm is based on an array of m bits (b1, b2, .., bm) that are set to zero, then on a data d1, k hash functions are applyed, ess if we have k = 3 (more usual value) the three hash functions will be applyed on d1, these functions return a position in the range from 0 upto m (in the array). The positions returned by the hash functions will be setted to one in the array.

 

  • fk1 (d1)
  • fk2 (d1)
  • fk3 (d1)

 
When this check if an element is in the set, the three functions will return positions that must be 1, if so the element maybe is in the set, if one bit is not (equal to zero) the element is not in the set.

Its main advantage is that we do not store in memory values as a pk fields (number, varchar, etc.) Instead we substitute these values by a single bit. If a varchar occupies 10 bytes (8 bits) would be a single value against eighty possible values.

 

HTH – Antonio NAVARRO

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s