Hadoop Interview Questions

Hadoop Interview Questions:

What are the four modules that make up the Apache Hadoop framework?

Hadoop Common, which contains the common utilities and libraries necessary for Hadoop’s other modules.
Hadoop YARN, the framework’s platform for resource-management
Hadoop Distributed File System (HDFS), which stores information on commodity machines
Hadoop MapReduce, a programming model used to process  large-scale sets of data



What is a Combiner?


Combiner helps to minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map output, and the combiner function’s output forms the input to the reduce function. The combiner function is an optimization. 

Hadoop produce the same output from the reducer even combiner function calls  zero,
one, or many times for a particular map output record.

Summary: Combiner helps in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be sent to the reducers from mappers.

635897680720050137.png




We can not use combiner in all the cases. For example, if we were calculating mean
temperatures, we couldn’t use the mean as our combiner function, because:
mean(0, 20, 10, 25, 15) = 14
but:
mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15

The combiner function doesn’t replace the reduce function. (How could it? The reduce
function is still needed to process records with the same key from different maps.) But it
can help cut down the amount of data shuffled between the mappers and the reducers, and
for this reason alone it is always worth considering whether you can use a combiner

function in your MapReduce job.

How many map tasks are created?

Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or
just splits. Hadoop creates one map task for each split.

For most jobs, a good split size tends to be the size of an HDFS block, which is 64 MB by default in Apache Hadoop, although this can be changed for the cluster (for all newly created files) or specified when each file is created. It is reason why the optimal split size is the same as the block size: it is the largest size of input that can be guaranteed to be stored on a single node.

What does a split do?

Before transferring the data from hard disk location to map method, there is a phase or method called the 'Split Method'. Split method pulls a block of data from HDFS to the framework. The Split class does not write anything, but reads data from the block and pass it to the mapper. By default, Split is taken care by the framework. Split method is equal to the block size and is used to divide block into bunch of splits.

What is relation between HDFS Block and InputSplit?

HDFS Block is the physical division of the data and Input Split is the logical division of the data. During MapReduce execution Hadoop scans through the blocks and create InputSplits and each InputSplit will be assigned to individual mappers for processing. 

For most MapReduce jobs, a good input split size tends to be the size of an HDFS block, which is 64 MB by default in Apache Hadoop, although this can be changed for the cluster.

Where the map tasks output are stored?

The Map tasks write their output to the local disk, not to HDFS. Why is it? Map output is
intermediate output: it’s processed by reduce tasks to produce the final output, and once
the job is complete, the map output can be thrown away. So, storing it in HDFS with
replication would be overkill. 

What are 'maps' and 'reduces'?

'Maps' and 'Reduces' are two phases of solving a query in HDFS. 'Map' is responsible to read data from input location, and based on the input type, it will generate a key value pair,that is, an intermediate output in local machine.'Reducer' is responsible to process the intermediate output received from the mapper and generate the final output.

What is a Namenode?

Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS.


What is a Datanode?

Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.

What is a job tracker?

Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

What is a task tracker?

Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

Is Namenode machine same as datanode machine as in terms of hardware?

It depends upon the cluster you are trying to create. The Hadoop VM can be there on the same machine or on another machine. For instance, in a single node cluster, there is only one machine, whereas in the development or in a testing environment, Namenode and datanodes are on different machines.

Are Namenode and job tracker on the same host?

No, in practical environment, Namenode is on a separate host and job tracker is on a separate host.


What is the difference between an HDFS Block and InputSplit?

HDFS Block is the physical division of the data and InputSplit is the logical division of the data. 

What is a ‘block’ in HDFS?

A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost of seeks.

What are major HADOOP 1.x COMPONENTS?

The major components of hadoop are:
·Hadoop Distributed File System (HDFS): HDFS is designed to run on commodity machines which are of low cost hardware. The distributed data is stored in the HDFS file system. HDFS is highly fault tolerant and provides high throughput access to the applications that require big data.

·Namenode: Namenode is the heart of the hadoop system. The namenode manages the file system namespace. It stores the metadata information of the data blocks. This metadata is stored permanently on to local disk in the form of namespace image and edit log file. The namenode also knows the location of the data blocks on the data node. However the namenode does not store this information persistently. The namenode creates the block to datanode mapping when it is restarted. If the namenode crashes, then the entire hadoop system goes down. Read more about Namemode

Secondary Namenode: The responsibility of secondary name node is to periodically copy and merge the namespace image and edit log. In case if the name node crashes, then the namespace image stored in secondary namenode can be used to restart the namenode.DataNode: It stores the blocks of data and retrieves them. The datanodes also reports the blocks information to the namenode periodically.

JobTracker: JobTracker responsibility is to schedule the clients jobs. Job tracker creates map and reduce tasks and schedules them to run on the datanodes (tasktrackers). Job Tracker also checks for any failed tasks and reschedules the failed tasks on another datanode. Jobtracker can be run on the namenode or a separate node.

TaskTracker: Tasktracker runs on the datanodes. Task trackers responsibility is to run the the map or reduce tasks assigned by the namenode and to report the status of the tasks to the namenode.


Where mapper output is stored?

Mappers write their output to the local disk, not to HDFS. Why is this? Map output is
intermediate output: it’s processed by reduce tasks to produce the final output, and once

the job is complete, the map output can be thrown away.

What is best file format to store the data using Hive/Pig in Hadoop?


ORC File
The ORC File (Optimized Row Columnar) format provides a more efficient way to store relational data than the RC File, reducing the data storage format by up to 75% of the original. The ORC file format performs better than other Hive files formats when Hive is reading, writing, and processing data. Specifically compared to the RC File, ORC takes less time to access data and takes less space to store data. However, the ORC file increases CPU overhead by increasing the time it takes to decompress the relational data. Also, the ORC File format comes with the Hive 0.11 version and cannot be used with previous versions.

Click here for more Hadoop Interview Questions

No comments:

Post a Comment