Tuesday, October 18, 2016

Import Data from RDBMS to HDFS using Sqoop

How to Import Data From Oracle database to HDFS using Sqoop.

Install Oracle JDBC Driver

You can download the JDBC Driver from the Oracle website, for example http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html. You must accept the license agreement before you can download the driver. Download theojdbc6.jar file and copy it to sqoop lib directory:
$ sudo cp ojdbc6.jar sqoop/lib
$ sudo cp ojdbc6.jar sqoop/lib
$ chmod -R 755 sqoop/lib

Import Data from Oracle to HDFS

Use below command to import the data

$ sqoop import -connect jdbc:oracle:thin:apps/apps@10.20.66.6:1521:vis1211 -username apps -P -table ap_invoices_all -columns "INVOICE_ID,VENDOR_ID" -target-dir /apinvoices -m 1

Required items for above command:
IPv4 Address – 10.20.66.6
Database Name – apps
Table Name – ap_invoices_allinvoices_all
Username – apps
Password – apps
Output Directory – Could be any. I have used apinvoices
Below commands may help you to identify if there is any issue

$ ping 10.20.66.6
$ nc -z 10.20.66.6 1521
Refer Sqoop user guide https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html for more details

Verify the Data on HDFS


Open the Browser and go to the URL: http://localhost:50070/dfshealth.jsp
Click on Browse the filesystem
Click on apinvoices directory
Click on part-m-00000 file to see the data.


Sqoop list commands


Run the commands on the Unix prompt, on the node where you have sqoop installed.


1. List databases

Lists databases in your mysql database.

$ sqoop list-databases --connect jdbc:mysql://<<mysql-server>>/employees --username vchennar --password myPassword
.
.
.
INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
employees
test

2. List tables

Lists tables in your mysql database.

$ sqoop list-tables --connect jdbc:mysql://<<mysql-server>>/employees --username vchennar --password myPassword
.
.
.
INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
departments
dept_emp
dept_manager
employees
employees_exp_stg
employees_export
salaries
titles


Importing data in MySql into HDFS 


Replace "vchennar-mySqlServer-node" with the host name of the node running mySQL server, replace login credentials and target directory.

1. Importing a table into HDFS - basic import


$ sqoop import \
--connect jdbc:mysql://vchennar-mySqlServer-node/employees \
--username myUID \
--password myPWD \
--table employees \
-m 1 \
--target-dir /user/vchennar/sqoop-mysql/employees
.
.
.
.9139 KB/sec)
INFO mapreduce.ImportJobBase: Retrieved 300024 records

2. Executing imports with an options file for static information


Rather than repeat the import command along with connection related input required, each time, you can pass an options file as an argument to sqoop.  Create a text file, as follows, and save it someplace, locally on the node you are running the sqoop client on.  

Note: This blog does not cover it, but you can create and use a password file as well, that you can pass as argument --password-file <<filename>>.

2.1. Sample Options file:
___________________________________________________________________________
$ vi SqoopImportOptions.txt 
#
#Options file for sqoop import
#

import
--connect
jdbc:mysql://vchennar-mySqlServer-node/employees
--username
myUID
--password
myPwd

#
#All other commands should be specified in the command line
___________________________________________________________________________

2.2. Command to execute import, using an options file:


Note: Provide the proper path for the options file

2.2.1. The command

$ sqoop --options-file SqoopImportOptions.txt \
--table departments \
-m 1 \
--target-dir /user/vchennar/sqoop-mysql/departments
.
.
.
INFO mapreduce.ImportJobBase: Transferred 153 bytes in 26.2453 seconds (5.8296 bytes/sec)
INFO mapreduce.ImportJobBase: Retrieved 9 records.


-m argument is to specify number of mappers.  The department table has a handful of records, so I am setting it to 1.

2.2.2 Data file contents:

$ hadoop fs -cat sqoop-mysql/departments/part-m-00000 | more

d009,Customer Service
d005,Development
d002,Finance
d003,Human Resources
d001,Marketing
d004,Production
d006,Quality Management
d008,Research
d007,Sales


Import all columns, filter rows using where clause


 $ sqoop --options-file SqoopImportOptions.txt \
--table employees  \
--where "emp_no > 499948\
--as-textfile \
-m 1 \
--target-dir /user/vchennar/sqoop-mysql/employeeGtTest

 Import with a free form query without where clause


 $ sqoop --options-file SqoopImportOptions.txt \
--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where  $CONDITIONS' \
-m 1 \
--target-dir /user/vchennar/sqoop-mysql/employeeFrfrmQrySmpl2

(Case of the column needs to match that used to create table, or else the import fails)

 Import with a free form query with where clause


$ sqoop --options-file SqoopImportOptions.txt \
--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where EMP_NO < 20000 AND $CONDITIONS' \
-m 1 \
--target-dir /user/vchennar/sqoop-mysql/employeeFrfrmQry1

$ sqoop import --connect <connect> --username <username> --password <password> --query "select * from table WHERE rownum<10 OR \$CONDITIONS" --target-dir=/location -m 1 --verbose 

Split by


$ sqoop --options-file SqoopImportOptions.txt \
--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where $CONDITIONS' \
--split-by EMP_NO \
--direct \
--target-dir /user/vchennar/sqoop-mysql/SplitByExampleImport


Import all tables


Command:
$ sqoop --options-file SqoopImportAllTablesOptions.txt \
--direct \
--warehouse-dir sqoop-mysql/EmployeeDatabase

Options file content:
$ more SqoopImportAllTablesOptions.txt
______________________________________________

#
#Options file for sqoop import
#

import-all-tables
--connect
jdbc:mysql://vchennar-mySqlServer-node/employees
--username
myUID
--password
myPWD

#
#All other commands should be specified in the command line
______________________________________________

Direct and quick queries or inserts and updates with Sqoop eval



The eval tool allows users to quickly run simple SQL queries against a database; results are printed to the console. This allows users to preview their import queries to ensure they import the data they expect.


1. Query:

$ sqoop eval --connect jdbc:mysql://vchennar-mySqlServer-node/employees \
--username myUID \
--password myPWD \
--query "select * from employees limit 2"

---------------------------------------------------------------------------------
| emp_no      | birth_date | first_name     | last_name        | gender | hire_date  | 
---------------------------------------------------------------------------------
| 10001       | 1953-09-02 | Georgi         | Facello          | M | 1986-06-26 | 
| 10002       | 1964-06-02 | Bezalel        | Simmel           | F | 1985-11-21 | 

2. Insert:

sqoop eval --connect jdbc:mysql://vchennar-mySqlServer-node/employees \
--username myUID \
--password myPWD \
-e "insert into employees_export values(550000,'1977-08-08','Mouse','Mickey','M','1999-04-12')"

1 comment: