Spark, Hadoop, Hive and Programming: Incremental Updates in Apache Hadoop or Hive

Friday, March 3, 2017

Incremental Updates in Apache Hadoop or Hive

1. Keep the Master Data. This is one-time, initial load of the data.

2. Load the Delta Data (newly added and updated data) into HDFS at different location.

3. Merge the data: Join the master and delta data together on the business key field(s).

4. Compact the data: After the merge you will have one or more records for each business key. Keep the most recently updated data for each business key.

5. Write the Data to a Temporary Output: Since most Hadoop jobs (e.g. MapReduce) cannot overwrite an existing directory, you must write the compacted data to a temporary output location.

6. Overwrite the original master data with temporary output.

High Level:

Ingest. Complete (base_table) table movement followed by Change (incremental_table) records only.
Reconcile. Creating a Single View of Base + Change records (reconcile_view) to reflect the most up-to-date record set.
Compact. Creating a Reporting table (reporting_table) from the reconciled view.
Purge. Replacing the Base table with Reporting table contents and deleting any previously processed Change records before the next Data Ingestion cycle.

Spark, Hadoop, Hive and Programming

Friday, March 3, 2017

Incremental Updates in Apache Hadoop or Hive

No comments:

Post a Comment