Wednesday, September 27, 2017

Secondary sort in Hadoop

What is secondary sort in Hadoop?


In MapReduce, the reduce function is called one time for each unique map function output key.  Each call to it includes a collection of all values that accompanied that key in map outputs.  The framework sorts the data in between the map and reduce phase, meaning that a comparator is used to determine the order that keys and their corresponding lists of values are fed into the reduce function.

The order of the values within a reduce function call, however, is typically unspecified and can vary between runs.

Secondary sort is a technique that allows the MapReduce programmer to control the order that the values show up within a reduce function call.

We can achieve this using a composite key that contains both the information needed to sort by key and the information needed by value, and then decoupling the grouping of the intermediate data from the sorting of the intermediate data.  By sorting, we mean deciding the order that map output key/value pairs are presented to the reduce functions.  We want to sort both by the keys and the values.  By grouping, we mean deciding which sets of key/value are lumped together into a single call of the reduce function.  We want to group only on the keys so that we don't get a separate call to the reduce function for each unique value.


No comments:

Post a Comment