Spark, Hadoop, Hive and Programming: Bloom Filter

Sunday, February 12, 2017

Bloom Filter

A Bloom filter is a probabilistic data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.

The base data structure of a Bloom filter is a Bit Array. It tells us that the element either definitely is not in the set or may be in the set.

Hive: Bloom filter are relatively new feature in Hive (1.2.0) and should be leveraged for any high-performance applications. Bloom filter are suitable for queries using where together with the = operator:

SELECT AVG(revenue) WHERE gender=0

A bloom filter can apply to numeric, but also non-numeric (categorical) data, which is an advantage over the storage index. Internally, a bloom filter is a hash value for the data in a column in a given block. This means you can "ask" a bloom filter if it contains a certain value, such as gender=male, without you needing to read the block at all. This obviously increases performance significantly, because most of the time a database is busy with reading non-relevant blocks for a query.

HBase: An HBase Bloom Filter is an efficient mechanism to test whether a StoreFile contains a specific row or row-col cell. Without Bloom Filter, the only way to decide if a row key is contained in a StoreFile is to check the StoreFile's block index, which stores the start row key of each block in the StoreFile. BloomFilters provide an in-memory structure to reduce disk reads to only the files likely to contain that Row. In short it can be considered as an in-memory index to determine probability of finding a row in a particular StoreFile.

Algorithm description

An example of a Bloom filter, representing the set {

x

y

z

}. The colored arrows show the positions in the bit array that each set element is mapped to. The element

w

is not in the set {

x

y

z

}, because it hashes to one bit-array position containing 0. For this figure,

m

= 18 and

k

= 3.

An empty Bloom filter is a bit array of

m

bits, all set to 0. There must also be

k

different hash functions defined, each of which maps or hashes some set element to one of the

m

array positions with a uniform random distribution. Typically,

k

is a constant, much smaller than

m

, which is proportional to the number of elements to be added; the precise choice of

k

and the constant of proportionality of

m

are determined by the intended false positive rate of the filter.

To add an element, feed it to each of the

k

hash functions to get

k

array positions. Set the bits at all these positions to 1.

To query for an element (test whether it is in the set), feed it to each of the

k

hash functions to get

k

array positions. If any of the bits at these positions is 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive. In a simple Bloom filter, there is no way to distinguish between the two cases, but more advanced techniques can address this problem.

References :

https://en.wikipedia.org/wiki/Bloom_filter

Spark, Hadoop, Hive and Programming

Sunday, February 12, 2017

Bloom Filter

Algorithm description

No comments:

Post a Comment