Spark, Hadoop, Hive and Programming: Total Order Sorting in MapReduce

Friday, February 17, 2017

Total Order Sorting in MapReduce

When using multiple reducers, each reducer receives (key,value) pairs assigned to them by the Partitioner. When a reducer receives those pairs they are sorted by key, so generally the output of a reducer is also sorted by key. However, the outputs of different reducers are not ordered between each other, so they cannot be concatenated or read sequentially in the correct order.

For example with 2 reducers, sorting on simple Text keys, you can have :
– Reducer 1 output : (a,5), (d,6), (w,5)
– Reducer 2 output : (b,2), (c,5), (e,7)
The keys are only sorted if you look at each output individually, but if you read one after the other, the ordering is broken.

The objective of Total Order Sorting is to have all outputs sorted across all reducers :
– Reducer 1 output : (a,5), (b,2), (c,5)
– Reducer 2 output : (d,6), (e,7), (w,5)
This way the outputs can be read/searched/concatenated sequentially as a single ordered output.

Ans:

We can create a manual total order sorting with a custom partitioner.

We can also use Hadoop’s TotalOrderPartitioner to automatically create partitions on simple type keys.

Finally we can use more advanced technique to use our Secondary Sort’s Composite Key with this partitioner to achieve “Total Secondary Sorting“.

Total Order Sort example using TotalOrderPartitioner

Total Secondary Sort by Composite Key

The TotalOrderPartitioner uses the map output keys to calculate partitioning.
Our map output keys are of type CompositeKey, because we are doing Secondary Sorting.

Spark, Hadoop, Hive and Programming

Friday, February 17, 2017

Total Order Sorting in MapReduce

No comments:

Post a Comment