What are ways to create a RDD in Spark?
Ans: There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
How many partitions are created in Spark?
By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
What are RDD Operations?
RDDs support two types of operations:
What is Accumulator?
Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster.
How do you print few elements of RDD?
rdd.take(100).foreach(println)
What is difference between Coalesce and Repartition?
Ans: There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
How many partitions are created in Spark?
By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
What are RDD Operations?
RDDs support two types of operations:
- transformations, which create a new dataset from an existing one
- actions, which return a value to the driver program after running a computation on the dataset.
For example,
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the
map
is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce
is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through
map
will be used in a reduce
and return only the result of the reduce
to the driver, rather than the larger mapped dataset.By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the
persist
(or cache
) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.What is Accumulator?
Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster.
How do you print few elements of RDD?
rdd.take(100).foreach(println)
Removing Data
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the
RDD.unpersist()
method.What is difference between Coalesce and Repartition?
No comments:
Post a Comment