Friday, March 3, 2017

Incremental Updates in Apache Hadoop or Hive


1. Keep the Master Data. This is one-time, initial load of the data.

2. Load the Delta Data (newly added and updated data) into HDFS at different location.

3. Merge the data: Join the master and delta data together on the business key field(s).

4. Compact the data: After the merge you will have one or more records for each business key. Keep the most recently updated data for each business key.

5. Write the Data to a Temporary Output: Since most Hadoop jobs (e.g. MapReduce) cannot overwrite an existing directory, you must write the compacted data to a temporary output location.

6. Overwrite the original master data with temporary output.


High Level:

  • Ingest. Complete (base_table) table movement followed by Change (incremental_table) records only.
  • Reconcile. Creating a Single View of Base + Change records (reconcile_view) to reflect the most up-to-date record set.
  • Compact. Creating a Reporting table (reporting_table) from the reconciled view.
  • Purge. Replacing the Base table with Reporting table contents and deleting any previously processed Change records before the next Data Ingestion cycle.

No comments:

Post a Comment