Thursday, July 4, 2019

Difference between Coalesce and Repartition

The coalesce reduces the number of partitions in a DataFrame. 

The repartition either increase or decrease the number of partitions in a DataFrame.

The repartition algorithm does a full shuffle of the data and creates equal sized partitions of data. coalesce combines existing partitions to avoid a full shuffle.

Summary Of Difference
coalesce()repartition()
reduce the number of partitionsincrease or decrease the number of partitions.
Tries to minimize data movement by avoiding network shuffle.A network shuffle will be
triggered which can increase data movement.
Creates unequal sized partitionsCreates equal sized partitions

No comments:

Post a Comment