Case 1: Map Side Join:
Assume you have
1 table with 1TB and 1 table with 1MB
Assume as well that we have 50 nodes. ( I think he only copies datasets for the distributed cache once per node and not once per task might be wrong )
So MapSide Join:
You have to copy the small table to every node: 50x1MB = 50MB of data are copied across the network
Shuffle Join:
Both tables need to be copied once i.e. 1TB + 1MB = 1.00001 TB will be copied over the network.
MapSide Join is much better than Shuffle join in this case.
Hive supports MAPJOINs, which are well suited for this scenario – at least for dimensions small enough to fit in memory. Before release 0.11, a MAPJOIN could be invoked either through an optimizer hint:
select /*+ MAPJOIN(time_dim) */ count(*) from store_sales join time_dim on (ss_sold_time_sk = t_time_sk)
or via auto join conversion:
set hive.auto.convert.join=true; select count(*) from store_sales join time_dim on (ss_sold_time_sk = t_time_sk)
The default value for hive.auto.convert.join was false in Hive 0.10.0. Hive 0.11.0 changed the default to true (HIVE-3297). Note that hive-default.xml.template incorrectly gives the default as false in Hive 0.11.0 through 0.13.1.
Case 2: Shuffle Join
Assume you have 1 table with 1TB and one table with 500GB:
MapSide Join:
500GB table needs to be copied to 50 nodes for 50 * 500 = 25 TB of data being copied over the network
Shuffle Join:
1.5TB of data need to be copied over the network.
So in this case MapSide join is much worse than shuffle join.
No comments:
Post a Comment