Thursday, December 29, 2016

Single sign-on authentication for Cloud using Kerberos

Kerberos is a network authentication protocol and provides a single sign-on facility to clients as well.
Kerberos was one of the first single sign-on solutions proposed in the literature and implemented as a
network service.





We can refer this paper for some more details : http://www.slideshare.net/erdeepakbagga?utm_campaign=profiletracking&utm_medium=sssite&utm_source=ssslideview

Wednesday, December 21, 2016

Sqoop performance tuning

1. Always import/export required data.  Use where clause wherever possible.

2. Use compression ( --compress ) to reduce data size.

3. Use incremental imports 
     --incremental append --check-column <column name> --last-value <value>
     OR
    --incremental lastmodified --check-column <column name> --last-value <value> 

4. Use split by (--split-by) to load balance map jobs to process equal number of records

5. Optimally use concurrent map tasks using --m <num-mappers>

6. Use direct mode to speed up data transfer

7. Use batch mode to export the data
Sqoop export you can use –batch argument which uses batch mode for underlying statement execution that will improve performance

8. Custom Boundary Queries

sqoop import   --connect <JDBC URL>   --username< <USER_NAME>   --password <PASSWORD>   --query <QUERY>   --split-by <ID>  --target-dir <TARGET_DIR_URI> 
 --boundary-query "select min(<ID>), max(<ID>)

from <TABLE>"

References: 
https://community.hortonworks.com/articles/70258/sqoop-performance-tuning.html
https://dzone.com/articles/apache-sqoop-performance-tuning


Sqoop incremental mode to get updated data

Delta data imports


In real-time scenario, we may need to synchronize the delta data (modified or updated data) from RDBMS to HDFS. Sqoop has incremental load command to facilitate the delta data.

Table 4. Incremental import arguments:

ArgumentDescription
--check-column (col)Specifies the column to be examined when determining which rows to import.
--incremental (mode)Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value)Specifies the maximum value of the check column from the previous import.

sqoop-increamental-append


OR

Importing incremental data with Last-modified mode option
sqoop-increamental-last-modified


Workaround for delta data import
Sqoop is importing and saving as RDBMS table name as a file in HDFS. The last modified mode is importing the delta data and trying to save the same name which already present in HDFS side and it throw error since HDFS does not allow the same name file.
Here is workaround to get complete updated data in HDFS side
1. Move existing HDFS data to temporary folder
2. Run last modified mode fresh import
3. Merge with this fresh import with old data which saved in temporary folder.
Reference: https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports


Hive basic commands

Q) How to create an external table in Hive with partition and ORC file format

create external table defaultsearch_rc(
        bcookie string
,intl string
,src_spaceid string
,calday string
,prmpt int
,setyh int
,device_type string
,src string
,pn string
,hspart string
,hsimp string
,tsrc string
,vtestid string
,mtestid string
)
partitioned by (date_id int)
clustered by (src) into 4 buckets
row format delimited fields Terminated By '\001'
stored as rcfile location '/projects/CommerceFeed/default_search'
;

 alter table defaultsearch_rc add partition (date_id=20160801) location '/projects/CommerceFeed/default_search/20160801';

alter table defaultsearch_rc add partition (date_id=20160802) location '/projects/CommerceFeed/default_search/20160802';


Q) How to generate an ad-hoc report 

Let's write Hive query in a file 'defaultSearchReport.txt'.

defaultSearchReport.txt:
----------------
  use venkat_db;
  select intl, src_spaceid, calday, count(*) as total_events,  sum(prmpt) as prmpt, sum(setyh) as setyh  
  from defaultsearch_rc 
  where date_id=20160801 
  group by intl, src_spaceid, calday ;
----------------

hive -f defaultSearchReport.txt > output

cat output | awk -F'\t' '{ print $1 "," $2 "," $3 "," $4 "," $5 "," $6 }' > default_search_report_20160801.csv


Tuesday, December 20, 2016

Hadoop Performance Tuning

Various possible way to improve the performance of Hadoop

1. Compress input and output data : Compression of files saves storage space on HDFS and also improve speed of transfer.

Some of compression techniques are Snappy, bzip2, and lz4 (splittable) and gzip, deflate and lzo (non-splittable)

2. Adjust spill records (try to minimize disk spills) and sorting buffer

3. Try to implement combiner if the aggregate operations follows associative and commutative rule.

4.  Consider reducing replication factor

5. Adjust number of map tasks, reduce tasks and memory.

6. Using Skewed joins

Some times the data being processed might have some skewness - meaning 80% of the data is going to a single reducer. If there is a huge amount of data for a single key, then one of the reducer will be held up with processing majority of the data –this is when Skewed join comes to the rescue. Skewed join computes a histogram to find out which key is dominant and then data is split based on its various reducers to achieve optimal performance.

7. Speculative Execution

The performance of MapReduce jobs is seriously impacted when tasks take a long time to finish execution. Speculative execution is a common approach to solve this problem by backing up slow tasks on alternate machines. Setting the configuration parameters ‘mapreduce.map.tasks.speculative.execution’ and ‘mapreduce.reduce.tasks.speculative.execution’ to true will enable speculative execution so that the job execution time is reduced if the task progress is slow due to memory unavailability.

8. Use Distribute cache


Thursday, December 15, 2016

Spark Interview Questions

Q) What are the advantages of using Apache Spark over Hadoop MapReduce for big data processing?
Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop.
  • Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD).
  • Spark is easier to program as it comes with an interactive mode.
  • It provides complete recovery using lineage graph whenever something goes wrong.
Hadoop MapReduceApache Spark
Does not leverage the memory of the hadoop cluster to maximum.Let's save data on memory with the use of RDD's.
MapReduce is disk oriented.Spark caches  data in-memory and ensures low latency.
Only batch processing is supportedSupports real-time processing through spark streaming.
Q) What is RDD?



  • Immutable – RDDs cannot be altered.
Q) Explain about the major libraries that constitute the Spark Ecosystem
  • Spark Streaming – This library is used to process real time streaming data.
  • Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.
  • Spark MLib- Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.
  • Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc.
Q) Explain about transformations and actions in the context of RDDs.
Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.
Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.

Q) Explain about the popular use cases of Apache Spark
Apache Spark is mainly used for
  • Iterative machine learning.
  • Interactive data analytics and processing.
  • Stream processing
  • Sensor data processing
Q) Explain about the core components of a distributed Spark application.
  • Driver- The process that runs the main () method of the program to create RDDs and perform transformations and actions on them.
  • Executor –The worker processes that run the individual tasks of a Spark job.
  • Cluster Manager-A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN.
Q) What is RDD Lineage?

Spark does not support data replication in the memory so if any data is lost, it is rebuild using RDD lineage. 

RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other data sets.


Q) What is Spark Driver?

Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, driver in Spark creates SparkContext, connected to a given Spark Master.

Q) Can you use Spark to access and analyse data stored in Cassandra databases?
Yes, it is possible if you use Spark Cassandra Connector.

Friday, December 2, 2016

Apache Zeppelin

What is Apache Zeppelin?

A web-based notebook that enables interactive data analytics.

The Notebook is the place for all your needs

 Data Ingestion
 Data Discovery
 Data Analytics
 Data Visualization & Collaboration


Wednesday, November 23, 2016

Apache Pig Interview Questions

1. Can you name complex data types in Pig Latin?
Ans:

Map: Set of key-value pairs; keys must be character arrays, but values may be any type. Ex: ['a'#'pomegranate']
Tuple: Sequence of fields of any type. Ex: (1,'pomegranate')
Bag: Unordered collection of tuples, possibly with duplicates. Ex: {(1,'pomegranate'),(2)}



Wednesday, November 9, 2016

Sqoop best practices

Best practices:

1.  With imports, use the output line formatting options wherever possible, for accuracy of data transfer - "--enclosed -by", "--fields-terminated-by",  "--escaped-by"

2.  Use your judgement when you provide number of mappers  to ensure you appropriately parallelize the import without increasing overall completion time. (default - 4 tasks are run parallel)

3.  Use direct connectors where available for better performance.

4.  With imports, use a boundary query for better performance.

5.  When importing into Hive and using dynamic partitions, think through partition criteria and number of files generated...you dont want too many small files on your cluster;  Also, there is a limit on the number of partitions on each node.

6.  Be cognizant of the configuration of concurrent connections allowed to the database; Use fetch size for controlling number of records to be read from the database, and also factor in the number of parallel tasks.

7.  Do not use the same table for import and export.

8.  Use an options file for reusability.

9. Be aware of case sensitivity nuances of Sqoop - you might save on time you would spend trouble-shooting issues.

10.  For exports, use a staging table where possible, during development phase, it will help with troubleshooting.

11.  Use the verbose argument (--verbose) for more information during trouble-shooting.


Trouble-shoting tips:

https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1774381


Import Data from RDBMS to Hive using Sqoop

Import data into Hive


1. Basic import


Here is some data in the employees database that we will use:

mysql> select * from departments;
+---------+--------------------+
| dept_no | dept_name          |
+---------+--------------------+
| d009    | Customer Service   |
| d005    | Development        |
| d002    | Finance            |
| d003    | Human Resources    |
| d001    | Marketing          |
| d004    | Production         |
| d006    | Quality Management |
| d008    | Research           |
| d007    | Sales              |
+---------+--------------------+
9 rows in set (0.00 sec)

sqoop import comand:
sqoop import \
--connect jdbc:mysql://vchennar-mySqlServer/employees \
--username myUID \
--password myPWD \
--table departments \
--direct \
-m 1 \
--hive-import \
--create-hive-table \
--hive-table departments_mysql \
--target-dir /user/hive/warehouse/ \
--enclosed-by '\"' \
--fields-terminated-by , \
--escaped-by \\ \


File created in HDFS:
$ hadoop fs -ls -R /user/hive/warehouse | grep /part* | awk '{print $8}'

/user/hive/warehouse/departments_mysql/part-m-00000 

Validate the number of records:

$ hadoop fs -cat /user/hive/warehouse/departments_mysql/part-m-00000 | wc -l

9

Check the imported data in HDFS:
$ hadoop fs -cat /user/hive/warehouse/departments_mysql/part-m-00000 | more

"d009","Customer Service"
"d005","Development"
"d002","Finance"
"d003","Human Resources"
"d001","Marketing"
"d004","Production"
"d006","Quality Management"
"d008","Research"
"d007","Sales"

Validate results in Hive:

$ hive

hive> show tables;

departments_mysql


hive> select * from departments_mysql;


"d009"           "Customer Service"
"d005"           "Development"
"d002"           "Finance"
"d003"           "Human Resources"
"d001"           "Marketing"
"d004"           "Production"
"d006"           "Quality Management"
"d008"           "Research"
"d007"           "Sales"

2. Importing into Hive with partitions


To try this functionality out, I decided on gender as my partition criteria.

mysql> select gender, count(*) from employees group by gender;  
+--------+----------+
| gender | count(*) |
+--------+----------+
| M      |   179973 |
| F      |   120051 |
+--------+----------+

Import gender="M"

$ sqoop import \
--connect jdbc:mysql://vchennar-mySqlServer-node/employees \
--username myUID \
--password myPwd \
--query 'select EMP_NO,birth_date,first_name,last_name,hire_date from employees where gender="M" AND $CONDITIONS'  \
--direct \
-m 6 \
--split-by EMP_NO \
--hive-import \
--create-hive-table \
--hive-table employees_import_parts \
--target-dir /user/hive/warehouse/employee-parts \
--hive-partition-key gender \
--hive-partition-value 'M' \
--enclosed-by '\"' \
--fields-terminated-by , \
--escaped-by \\ \


Note 1: Gender column should not be included in the query.
The two arguments (--hive-partition...) highlighted in yellow are required.
Also, note that I have added a where clause to filter on just gender="M".

Note 2: If the column emp_no is listed in lower case in the query, only null is retrieved.  If we switch the case of this just one field, to EMP_NO, it works fine.

See if files were created:

$ hadoop fs -ls -R /user/hive/warehouse/employees_import_parts | grep /part* | awk '{print $8}'

/user/hive/warehouse/employees_import_parts/gender=M/part-m-00000
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00001
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00002
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00003
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00004
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00005


Do a line count:
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=M/* | wc -l

179973

Open a file to see if it is formatted right:

$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=M/part-m-00005 | more

"418333","1954-11-10","Jackson","Simmen","1993-11-14"
"418334","1954-04-12","Jingling","Orlowski","1985-06-19"
"418335","1963-09-09","Kwok","Dalton","1986-07-28"
"418337","1961-08-31","Eberhardt","Ramras","1988-02-25"

Note: gender is not in the data file but in a directory name/partition name.

Check if table got created
hive> show tables;

employees_import_parts

Display column headers:
hive> set hive.cli.print.header=true;

Validate record count:
hive> select gender, count(*) from employees_import_parts group by gender;   

gender _c1
M 179973

The count is accurate.

Review one record for accuracy:
hive> select * from employees_import_parts limit 1;


emp_no birth_date first_name last_name hire_date gender
"1234" "1953-09-02" "Georgi" "Facello" "1986-06-26" M

Validate if table is partitioned

hive> show partitions employees_import_parts;

partition
gender=F
gender=M

Import gender="F"

$ sqoop import \
--connect jdbc:mysql://airawat-mySqlServer-node/employees \
--username myUID \
--password myPWD \
--query 'select emp_no,birth_date,first_name,last_name,hire_date from employees where gender="F" AND $CONDITIONS'  \
--direct \
-m 6 \
--split-by emp_no \
--hive-import \
--hive-overwrite \
--hive-table employees_import_parts \
--target-dir /user/hive/warehouse/employee-parts_F \
--hive-partition-key gender \
--hive-partition-value 'F' \
--enclosed-by '\"' \
--fields-terminated-by , \
--escaped-by \\ \

Files generated:

$ hadoop fs -ls -R /user/hive/warehouse/employees_import_parts | grep /part* | awk '{print $8}'

/user/hive/warehouse/employees_import_parts/gender=F/part-m-00000
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00001
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00002
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00003
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00004
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00005
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00000
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00001
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00002
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00003
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00004
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00005


Record count for gender=F:

$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=F/part* | wc -l

120051

The count is accurate.


Record count for employees in total:

Expected: 300024

$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/*/part* | wc -l

300024


Validate a bunch of records for accuracy of format:


$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=F/part-m-00005 | more

"418330","1953-06-13","Pranas","McFarlan","1989-12-23"
"418331","1954-04-07","Chaosheng","Greenaway","1996-05-21"
"418332","1961-04-19","Koichi","Cannard","1986-01-21"
"418336","1954-02-14","Georgy","Thimonier","1994-03-21"


Validate count in Hive:

hive> select gender, count(*) from employees_import_parts group by gender;   

gender _c1
F 120051
M 179973

The counts are accurate.

Validate records in Hive:

hive> select * from employees_import_parts where gender='F' limit 2;

emp_no birth_date first_name last_name hire_date gender
NULL "1964-06-02" "Bezalel" "Simmel" "1985-11-21" F
NULL "1953-04-20" "Anneke" "Preusig" "1989-06-02" F