Sunday, February 12, 2017

Bloom Filter

A Bloom filter is probabilistic data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.

The base data structure of a Bloom filter is a Bit Array. It tells us that the element either definitely is not in the set or may be in the set.






Hive: Bloom filter are relatively new feature in Hive (1.2.0) and should be leveraged for any high-performance applications. Bloom filter are suitable for queries using where together with the = operator:
SELECT AVG(revenue) WHERE gender=0
A bloom filter can apply to numeric, but also non-numeric (categorical) data, which is an advantage over the storage index. Internally, a bloom filter is a hash value for the data in a column in a given block. This means you can "ask" a bloom filter if it contains a certain value, such as gender=male, without you needing to read the block at all. This obviously increases performance significantly, because most of the time a database is busy with reading non-relevant blocks for a query.
HBase: An HBase Bloom Filter is an efficient mechanism to test whether a StoreFile contains a specific row or row-col cell. Without Bloom Filter, the only way to decide if a row key is contained in a StoreFile is to check the StoreFile's block index, which stores the start row key of each block in the StoreFile. BloomFilters provide an in-memory structure to reduce disk reads to only the files likely to contain that Row. In short it can be considered as an in-memory index to determine probability of finding a row in a particular StoreFile.

Algorithm description

An example of a Bloom filter, representing the set {xyz}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {xyz}, because it hashes to one bit-array position containing 0. For this figure, m = 18 and k = 3.
An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution. Typically, k is a constant, much smaller than m, which is proportional to the number of elements to be added; the precise choice of k and the constant of proportionality of m are determined by the intended false positive rate of the filter.
To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.
To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions is 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive. In a simple Bloom filter, there is no way to distinguish between the two cases, but more advanced techniques can address this problem.


References :
https://en.wikipedia.org/wiki/Bloom_filter

No comments:

Post a Comment