Import data into Hive
1. Basic import
Here is some data in the employees database that we will use:
mysql> select * from departments;
+---------+--------------------+
| dept_no | dept_name |
+---------+--------------------+
| d009 | Customer Service |
| d005 | Development |
| d002 | Finance |
| d003 | Human Resources |
| d001 | Marketing |
| d004 | Production |
| d006 | Quality Management |
| d008 | Research |
| d007 | Sales |
+---------+--------------------+
9 rows in set (0.00 sec)
sqoop import comand:
sqoop import \
--connect jdbc:mysql://vchennar-mySqlServer/employees \
--username myUID \
--password myPWD \
--table departments \
--direct \
-m 1 \
--hive-import \
--create-hive-table \
--hive-table departments_mysql \
--target-dir /user/hive/warehouse/ \
--enclosed-by '\"' \
--fields-terminated-by , \
--escaped-by \\ \
File created in HDFS:
$ hadoop fs -ls -R /user/hive/warehouse | grep /part* | awk '{print $8}'
/user/hive/warehouse/departments_mysql/part-m-00000
Validate the number of records:
$ hadoop fs -cat /user/hive/warehouse/departments_mysql/part-m-00000 | wc -l
9
Check the imported data in HDFS:
$ hadoop fs -cat /user/hive/warehouse/departments_mysql/part-m-00000 | more
"d009","Customer Service"
"d005","Development"
"d002","Finance"
"d003","Human Resources"
"d001","Marketing"
"d004","Production"
"d006","Quality Management"
"d008","Research"
"d007","Sales"
Validate results in Hive:
$ hive
hive> show tables;
departments_mysql
hive> select * from departments_mysql;
"d009" "Customer Service"
"d005" "Development"
"d002" "Finance"
"d003" "Human Resources"
"d001" "Marketing"
"d004" "Production"
"d006" "Quality Management"
"d008" "Research"
"d007" "Sales"
2. Importing into Hive with partitions
To try this functionality out, I decided on gender as my partition criteria.
mysql> select gender, count(*) from employees group by gender;
+--------+----------+
| gender | count(*) |
+--------+----------+
| M | 179973 |
| F | 120051 |
+--------+----------+
Import gender="M"
$ sqoop import \
--connect jdbc:mysql://vchennar-mySqlServer-node/employees \
--username myUID \
--password myPwd \
--query 'select EMP_NO,birth_date,first_name,last_name,hire_date from employees where gender="M" AND $CONDITIONS' \
--direct \
-m 6 \
--split-by EMP_NO \
--hive-import \
--create-hive-table \
--hive-table employees_import_parts \
--target-dir /user/hive/warehouse/employee-parts \
--hive-partition-key gender \
--hive-partition-value 'M' \
--enclosed-by '\"' \
--fields-terminated-by , \
--escaped-by \\ \
Note 1: Gender column should not be included in the query.
The two arguments (--hive-partition...) highlighted in yellow are required.
Also, note that I have added a where clause to filter on just gender="M".
Note 2: If the column emp_no is listed in lower case in the query, only null is retrieved. If we switch the case of this just one field, to EMP_NO, it works fine.
See if files were created:
$ hadoop fs -ls -R /user/hive/warehouse/employees_import_parts | grep /part* | awk '{print $8}'
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00000
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00001
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00002
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00003
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00004
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00005
Do a line count:
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=M/* | wc -l
179973
Open a file to see if it is formatted right:
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=M/part-m-00005 | more
"418333","1954-11-10","Jackson","Simmen","1993-11-14"
"418334","1954-04-12","Jingling","Orlowski","1985-06-19"
"418335","1963-09-09","Kwok","Dalton","1986-07-28"
"418337","1961-08-31","Eberhardt","Ramras","1988-02-25"
Note: gender is not in the data file but in a directory name/partition name.
Check if table got created
hive> show tables;
employees_import_parts
Display column headers:
hive> set hive.cli.print.header=true;
Validate record count:
hive> select gender, count(*) from employees_import_parts group by gender;
The count is accurate.
Review one record for accuracy:
hive> select * from employees_import_parts limit 1;
emp_no birth_date first_name last_name hire_date gender
"1234" "1953-09-02" "Georgi" "Facello" "1986-06-26" M
Validate if table is partitioned
hive> show partitions employees_import_parts;
partition
gender=F
gender=M
Import gender="F"
$ sqoop import \
--connect jdbc:mysql://airawat-mySqlServer-node/employees \
--username myUID \
--password myPWD \
--query 'select emp_no,birth_date,first_name,last_name,hire_date from employees where gender="F" AND $CONDITIONS' \
--direct \
-m 6 \
--split-by emp_no \
--hive-import \
--hive-overwrite \
--hive-table employees_import_parts \
--target-dir /user/hive/warehouse/employee-parts_F \
--hive-partition-key gender \
--hive-partition-value 'F' \
--enclosed-by '\"' \
--fields-terminated-by , \
--escaped-by \\ \
Files generated:
$ hadoop fs -ls -R /user/hive/warehouse/employees_import_parts | grep /part* | awk '{print $8}'
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00000
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00001
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00002
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00003
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00004
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00005
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00000
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00001
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00002
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00003
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00004
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00005
Record count for gender=F:
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=F/part* | wc -l
120051
The count is accurate.
Record count for employees in total:
Expected: 300024
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/*/part* | wc -l
300024
Validate a bunch of records for accuracy of format:
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=F/part-m-00005 | more
"418330","1953-06-13","Pranas","McFarlan","1989-12-23"
"418331","1954-04-07","Chaosheng","Greenaway","1996-05-21"
"418332","1961-04-19","Koichi","Cannard","1986-01-21"
"418336","1954-02-14","Georgy","Thimonier","1994-03-21"
Validate count in Hive:
hive> select gender, count(*) from employees_import_parts group by gender;
gender _c1
F 120051
M 179973
The counts are accurate.
Validate records in Hive:
hive> select * from employees_import_parts where gender='F' limit 2;
emp_no birth_date first_name last_name hire_date gender
NULL "1964-06-02" "Bezalel" "Simmel" "1985-11-21" F
NULL "1953-04-20" "Anneke" "Preusig" "1989-06-02" F