Wednesday, May 20, 2020

Performance of Delta Vs Parquet file formats



spark.sql("set spark.databricks.delta.autoCompact.enabled = true")
spark.sql("set spark.databricks.delta.optimizeWrite.enabled = true")



OPTIMIZE the Databricks Delta table   
     
display(spark.sql("OPTIMIZE flights ZORDER BY (DayofWeek)"))

The query over the Databricks Delta table runs much faster after OPTIMIZE is run. 
How much faster the query runs can depend on the configuration of the cluster you are 
running on, however should be 5-10X faster compared to the standard table.

References:

https://docs.databricks.com/_static/notebooks/delta/optimize-scala.html

https://docs.databricks.com/delta/optimizations/index.html#compaction-bin-packing

https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html