Spark, Hadoop, Hive and Programming: Small files problem in Hadoop?

Friday, March 3, 2017

Small files problem in Hadoop?

Hadoop works better with a small number of large files and not with large number of small files. Large number of small files take up lots of memory on the Namenode. Each small file generates a map task and hence there are too many such map task with insufficient input. Storing and transforming small size file in HDFS creates an overhead to map reduce program which greatly affects the performance of Namenode. Default size of HDFS block is 128 MB. The files whose size is smaller than the default block size in HDFS are termed as small files.

Some of the possible solutions:

1. Merge small files in the same directory into large one and accordingly build index for each small file to enhance storage efficiency of small files and reduce burden on Namenode caused by metadata.

2. Modify InputFormat class. InputFormat will be modified in such a way that multiple files are combined into a single split. So, the map task will get more input to process, unlike existing system. As a result, the time required to process large number of small files will be minimized. In addition, multiple reducers will be used for taking advantage of parallelism.

Spark, Hadoop, Hive and Programming

Friday, March 3, 2017

Small files problem in Hadoop?

No comments:

Post a Comment