Import data into Hive
1. Basic import
Here is some data in the employees database that we will use:
mysql> select * from departments;
+---------+--------------------+
| dept_no | dept_name |
+---------+--------------------+
| d009 | Customer Service |
| d005 | Development |
| d002 | Finance |
| d003 | Human Resources |
| d001 | Marketing |
| d004 | Production |
| d006 | Quality Management |
| d008 | Research |
| d007 | Sales |
+---------+--------------------+
9 rows in set (0.00 sec)
sqoop import comand:
sqoop import \
--connect jdbc:mysql://vchennar-mySqlServer/employees \
--username myUID \
--password myPWD \
--table departments \
--direct \
-m 1 \
--hive-import \
--create-hive-table \
--hive-table departments_mysql \
--target-dir /user/hive/warehouse/ \
--enclosed-by '\"' \
--fields-terminated-by , \
--escaped-by \\ \
File created in HDFS:
$ hadoop fs -ls -R /user/hive/warehouse | grep /part* | awk '{print $8}'
/user/hive/warehouse/departments_mysql/part-m-00000
Validate the number of records:
$ hadoop fs -cat /user/hive/warehouse/departments_mysql/part-m-00000 | wc -l
9
Check the imported data in HDFS:
$ hadoop fs -cat /user/hive/warehouse/departments_mysql/part-m-00000 | more
"d009","Customer Service"
"d005","Development"
"d002","Finance"
"d003","Human Resources"
"d001","Marketing"
"d004","Production"
"d006","Quality Management"
"d008","Research"
"d007","Sales"
Validate results in Hive:
$ hive
hive> show tables;
departments_mysql
hive> select * from departments_mysql;
"d009" "Customer Service"
"d005" "Development"
"d002" "Finance"
"d003" "Human Resources"
"d001" "Marketing"
"d004" "Production"
"d006" "Quality Management"
"d008" "Research"
"d007" "Sales"
2. Importing into Hive with partitions
mysql> select gender, count(*) from employees group by gender;
+--------+----------+
| gender | count(*) |
+--------+----------+
| M | 179973 |
| F | 120051 |
+--------+----------+
Import gender="M"
$ sqoop import \
--connect jdbc:mysql://vchennar-mySqlServer-node/employees \
--username myUID \
--password myPwd \
--query 'select EMP_NO,birth_date,first_name,last_name,hire_date from employees where gender="M" AND $CONDITIONS' \
--direct \
-m 6 \
--split-by EMP_NO \
--hive-import \
--create-hive-table \
--hive-table employees_import_parts \
--target-dir /user/hive/warehouse/employee-parts \
--hive-partition-key gender \
--hive-partition-value 'M' \
--enclosed-by '\"' \
--fields-terminated-by , \
--escaped-by \\ \
Note 1: Gender column should not be included in the query.
The two arguments (--hive-partition...) highlighted in yellow are required.
Also, note that I have added a where clause to filter on just gender="M".
Note 2: If the column emp_no is listed in lower case in the query, only null is retrieved. If we switch the case of this just one field, to EMP_NO, it works fine.
See if files were created:
$ hadoop fs -ls -R /user/hive/warehouse/employees_import_parts | grep /part* | awk '{print $8}'
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00000
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00001
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00002
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00003
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00004
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00005
Do a line count:
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=M/* | wc -l
179973
Open a file to see if it is formatted right:
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=M/part-m-00005 | more
"418333","1954-11-10","Jackson","Simmen","1993-11-14"
"418334","1954-04-12","Jingling","Orlowski","1985-06-19"
"418335","1963-09-09","Kwok","Dalton","1986-07-28"
"418337","1961-08-31","Eberhardt","Ramras","1988-02-25"
Note: gender is not in the data file but in a directory name/partition name.
Check if table got created
hive> show tables;
employees_import_parts
Display column headers:
hive> set hive.cli.print.header=true;
Validate record count:
hive> select gender, count(*) from employees_import_parts group by gender;
gender _c1
M 179973
The count is accurate.
Review one record for accuracy:
hive> select * from employees_import_parts limit 1;
emp_no birth_date first_name last_name hire_date gender
"1234" "1953-09-02" "Georgi" "Facello" "1986-06-26" M
Validate if table is partitioned
hive> show partitions employees_import_parts;
partition
gender=F
gender=M
Import gender="F"
$ sqoop import \--connect jdbc:mysql://airawat-mySqlServer-node/employees \
--username myUID \
--password myPWD \
--query 'select emp_no,birth_date,first_name,last_name,hire_date from employees where gender="F" AND $CONDITIONS' \
--direct \
-m 6 \
--split-by emp_no \
--hive-import \
--hive-overwrite \
--hive-table employees_import_parts \
--target-dir /user/hive/warehouse/employee-parts_F \
--hive-partition-key gender \
--hive-partition-value 'F' \
--enclosed-by '\"' \
--fields-terminated-by , \
--escaped-by \\ \
Files generated:
$ hadoop fs -ls -R /user/hive/warehouse/employees_import_parts | grep /part* | awk '{print $8}'
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00000
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00001
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00002
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00003
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00004
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00005
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00000
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00001
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00002
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00003
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00004
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00005
Record count for gender=F:
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=F/part* | wc -l
120051
The count is accurate.
Record count for employees in total:
Expected: 300024
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/*/part* | wc -l
300024
Validate a bunch of records for accuracy of format:
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=F/part-m-00005 | more
"418330","1953-06-13","Pranas","McFarlan","1989-12-23"
"418331","1954-04-07","Chaosheng","Greenaway","1996-05-21"
"418332","1961-04-19","Koichi","Cannard","1986-01-21"
"418336","1954-02-14","Georgy","Thimonier","1994-03-21"
Validate count in Hive:
hive> select gender, count(*) from employees_import_parts group by gender;
gender _c1
F 120051
M 179973
The counts are accurate.
Validate records in Hive:
hive> select * from employees_import_parts where gender='F' limit 2;
emp_no birth_date first_name last_name hire_date gender
NULL "1964-06-02" "Bezalel" "Simmel" "1985-11-21" F
NULL "1953-04-20" "Anneke" "Preusig" "1989-06-02" F
No comments:
Post a Comment