Spark, Hadoop, Hive and Programming: Hadoop Performance Tuning

Tuesday, December 20, 2016

Hadoop Performance Tuning

Various possible way to improve the performance of Hadoop

1. Compress input and output data : Compression of files saves storage space on HDFS and also improve speed of transfer.

Some of compression techniques are Snappy, bzip2, and lz4 (splittable) and gzip, deflate and lzo (non-splittable)

2. Adjust spill records (try to minimize disk spills) and sorting buffer

3. Try to implement combiner if the aggregate operations follows associative and commutative rule.

4. Consider reducing replication factor

5. Adjust number of map tasks, reduce tasks and memory.

6. Using Skewed joins

Some times the data being processed might have some skewness - meaning 80% of the data is going to a single reducer. If there is a huge amount of data for a single key, then one of the reducer will be held up with processing majority of the data –this is when Skewed join comes to the rescue. Skewed join computes a histogram to find out which key is dominant and then data is split based on its various reducers to achieve optimal performance.

7. Speculative Execution

The performance of MapReduce jobs is seriously impacted when tasks take a long time to finish execution. Speculative execution is a common approach to solve this problem by backing up slow tasks on alternate machines. Setting the configuration parameters ‘mapreduce.map.tasks.speculative.execution’ and ‘mapreduce.reduce.tasks.speculative.execution’ to true will enable speculative execution so that the job execution time is reduced if the task progress is slow due to memory unavailability.

8. Use Distribute cache

Spark, Hadoop, Hive and Programming

Tuesday, December 20, 2016

Hadoop Performance Tuning

No comments:

Post a Comment