Spark, Hadoop, Hive and Programming: Spark Interview Questions

Thursday, December 15, 2016

Spark Interview Questions

Q) What are the advantages of using Apache Spark over Hadoop MapReduce for big data processing?

Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop.

Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD).
Spark is easier to program as it comes with an interactive mode.
It provides complete recovery using lineage graph whenever something goes wrong.

Hadoop MapReduce	Apache Spark
Does not leverage the memory of the hadoop cluster to maximum.	Let's save data on memory with the use of RDD's.
MapReduce is disk oriented.	Spark caches data in-memory and ensures low latency.
Only batch processing is supported	Supports real-time processing through spark streaming.

Q) What is RDD?

Immutable – RDDs cannot be altered.

Q) Explain about the major libraries that constitute the Spark Ecosystem

Spark Streaming – This library is used to process real time streaming data.
Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.
Spark MLib- Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.
Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc.

Q) Explain about transformations and actions in the context of RDDs.

Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.

Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.

Q) Explain about the popular use cases of Apache Spark

Apache Spark is mainly used for

Iterative machine learning.
Interactive data analytics and processing.
Stream processing
Sensor data processing

Q) Explain about the core components of a distributed Spark application.

Driver- The process that runs the main () method of the program to create RDDs and perform transformations and actions on them.
Executor –The worker processes that run the individual tasks of a Spark job.
Cluster Manager-A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN.

Q) What is RDD Lineage?

Spark does not support data replication in the memory so if any data is lost, it is rebuild using RDD lineage.

RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other data sets.

Q) What is Spark Driver?

Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, driver in Spark creates SparkContext, connected to a given Spark Master.

Q) Can you use Spark to access and analyse data stored in Cassandra databases?

Yes, it is possible if you use Spark Cassandra Connector.

1 comment:

TejutejuJuly 9, 2018 at 6:44 PM
Big data is having the requirement of many top industries. Keep updating Big data hadoop online training India

ReplyDelete
Replies