Spark, Hadoop, Hive and Programming: Map Side Join

Friday, February 17, 2017

Map Side Join

Case 1: Map Side Join:

Assume you have

1 table with 1TB and 1 table with 1MB

Assume as well that we have 50 nodes. ( I think he only copies datasets for the distributed cache once per node and not once per task might be wrong )

So MapSide Join:

You have to copy the small table to every node: 50x1MB = 50MB of data are copied across the network

Shuffle Join:

Both tables need to be copied once i.e. 1TB + 1MB = 1.00001 TB will be copied over the network.

MapSide Join is much better than Shuffle join in this case.

Hive supports MAPJOINs, which are well suited for this scenario – at least for dimensions small enough to fit in memory. Before release 0.11, a MAPJOIN could be invoked either through an optimizer hint:

select /*+ MAPJOIN(time_dim) */ count(*) from
store_sales join time_dim on (ss_sold_time_sk = t_time_sk)

or via auto join conversion:

set hive.auto.convert.join=true;
select count(*) from
store_sales join time_dim on (ss_sold_time_sk = t_time_sk)

The default value for hive.auto.convert.join was false in Hive 0.10.0. Hive 0.11.0 changed the default to true (HIVE-3297). Note that hive-default.xml.template incorrectly gives the default as false in Hive 0.11.0 through 0.13.1.

Case 2: Shuffle Join

Assume you have 1 table with 1TB and one table with 500GB:

MapSide Join:

500GB table needs to be copied to 50 nodes for 50 * 500 = 25 TB of data being copied over the network

Shuffle Join:

1.5TB of data need to be copied over the network.

So in this case MapSide join is much worse than shuffle join.

Spark, Hadoop, Hive and Programming

Friday, February 17, 2017

Map Side Join

No comments:

Post a Comment