Wednesday, December 21, 2016

Sqoop incremental mode to get updated data

Delta data imports


In real-time scenario, we may need to synchronize the delta data (modified or updated data) from RDBMS to HDFS. Sqoop has incremental load command to facilitate the delta data.

Table 4. Incremental import arguments:

ArgumentDescription
--check-column (col)Specifies the column to be examined when determining which rows to import.
--incremental (mode)Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value)Specifies the maximum value of the check column from the previous import.

sqoop-increamental-append


OR

Importing incremental data with Last-modified mode option
sqoop-increamental-last-modified


Workaround for delta data import
Sqoop is importing and saving as RDBMS table name as a file in HDFS. The last modified mode is importing the delta data and trying to save the same name which already present in HDFS side and it throw error since HDFS does not allow the same name file.
Here is workaround to get complete updated data in HDFS side
1. Move existing HDFS data to temporary folder
2. Run last modified mode fresh import
3. Merge with this fresh import with old data which saved in temporary folder.
Reference: https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports


No comments:

Post a Comment