Spark, Hadoop, Hive and Programming: partitions

Showing posts with label partitions. Show all posts

Thursday, February 18, 2021

Partition to two subsets

Partition problem is to determine whether a given set can be partitioned into two subsets such that the sum of elements in both subsets is same.

Examples

arr[] = {1, 5, 11, 5}
Output: true
The array can be partitioned as {1, 5, 5} and {11}

arr[] = {1, 5, 3}

Output: false

The array cannot be partitioned into equal sum sets.

Following are the two main steps to solve this problem:

1) Calculate sum of the array. If sum is odd, there cannot be two subsets with equal sum, so return false.

2) If sum of array elements is even, calculate sum/2 and find a subset of array with sum equal to sum/2.

The first step is simple. The second step is crucial, it can be solved either using recursion or Dynamic Programming.

public static boolean subSetSum(int[] a, int n, int sum) {

if(sum == 0) {

return true;

}

if(n == 0 && sum != 0) {

return false;

}

if(a[n]>sum) {

subSetSum(a, n-1, sum);

}

return subSetSum(a, n-1, sum-a[n]) || subSetSum(a, n-1, sum);

}

public static boolean partitionTwoSubsets(int[] a) {

int totalSum = 0;

for(int i=0; i<a.length; i++) {

totalSum += a[i];

}

if(totalSum % 2 != 0) {

return false;

}

else {

return subSetSum(a, a.length-1, totalSum/2);

}

Tuesday, March 7, 2017

What are ways to create a RDD in Spark?

Ans: There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

How many partitions are created in Spark?

By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

What are RDD Operations?

RDDs support two types of operations:

transformations, which create a new dataset from an existing one
actions, which return a value to the driver program after running a computation on the dataset.

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

What is Accumulator?

Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster.

How do you print few elements of RDD?

rdd.take(100).foreach(println)

Removing Data

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

What is difference between Coalesce and Repartition?

coalesce(numPartitions)	Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset. You can try to increase the number of partitions with coalesce, but it won’t work!
repartition(numPartitions)	Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network. The repartition algorithm does a full shuffle of the data and creates equal sized partitions of data.

Wednesday, December 21, 2016

Hive basic commands

Q) How to create an external table in Hive with partition and ORC file format

create external table defaultsearch_rc(
bcookie string
,intl string
,src_spaceid string
,calday string
,prmpt int
,setyh int
,device_type string
,src string
,pn string
,hspart string
,hsimp string
,tsrc string
,vtestid string
,mtestid string
)
partitioned by (date_id int)
clustered by (src) into 4 buckets
row format delimited fields Terminated By '\001'
stored as rcfile location '/projects/CommerceFeed/default_search'
;

alter table defaultsearch_rc add partition (date_id=20160801) location '/projects/CommerceFeed/default_search/20160801';

alter table defaultsearch_rc add partition (date_id=20160802) location '/projects/CommerceFeed/default_search/20160802';

Q) How to generate an ad-hoc report

Let's write Hive query in a file 'defaultSearchReport.txt'.

defaultSearchReport.txt:

----------------

use venkat_db;

select intl, src_spaceid, calday, count(*) as total_events, sum(prmpt) as prmpt, sum(setyh) as setyh

from defaultsearch_rc

where date_id=20160801

group by intl, src_spaceid, calday ;

----------------

hive -f defaultSearchReport.txt > output

cat output | awk -F'\t' '{ print $1 "," $2 "," $3 "," $4 "," $5 "," $6 }' > default_search_report_20160801.csv

Thursday, February 18, 2021

Partition to two subsets

Tuesday, March 7, 2017

Basic Spark questions

Wednesday, December 21, 2016

Hive basic commands