Replicated Join in MapReduce:

The Replicated Join is a Map-Side Join. The joining is done in mappers, and no reducer is even needed for this operation. So in a sense it should be faster join, but only if certain requirements are met.

The main idea in the Replicated Join is to cache all data from the smaller dataset into the mappers which process splits of the bigger dataset.

In our example, our Donations sequence file is 4-blocks long, but our Projects sequence file is only 1-block long. The aim is to cache the Projects data in each of the 4 Donations mappers. This way, each mapper has all the Project records in memory and can use them to join its Donation records from the input split.

When using a Replicated join, the results are grouped by the bigger dataset records. The mapper outputs still contain the same records as their respective input splits.

Limitations: There are 2 important limitations using the Replicated Join:

The smallest dataset has to fit into memory.
Only an Inner Join or a Left Join can be performed.

Spark, Hadoop, Hive and Programming

Friday, February 17, 2017

Replicated Join in MapReduce

Replicated Join in MapReduce:

No comments:

Post a Comment