Spark, Hadoop, Hive and Programming: Sqoop best practices

Wednesday, November 9, 2016

Sqoop best practices

Best practices:

1. With imports, use the output line formatting options wherever possible, for accuracy of data transfer - "--enclosed -by", "--fields-terminated-by", "--escaped-by"

2. Use your judgement when you provide number of mappers to ensure you appropriately parallelize the import without increasing overall completion time. (default - 4 tasks are run parallel)

3. Use direct connectors where available for better performance.

4. With imports, use a boundary query for better performance.

5. When importing into Hive and using dynamic partitions, think through partition criteria and number of files generated...you dont want too many small files on your cluster; Also, there is a limit on the number of partitions on each node.

6. Be cognizant of the configuration of concurrent connections allowed to the database; Use fetch size for controlling number of records to be read from the database, and also factor in the number of parallel tasks.

7. Do not use the same table for import and export.

8. Use an options file for reusability.

9. Be aware of case sensitivity nuances of Sqoop - you might save on time you would spend trouble-shooting issues.

10. For exports, use a staging table where possible, during development phase, it will help with troubleshooting.

11. Use the verbose argument (--verbose) for more information during trouble-shooting.

Trouble-shoting tips:

https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1774381

Spark, Hadoop, Hive and Programming

Wednesday, November 9, 2016

Sqoop best practices

No comments:

Post a Comment