1. Keep the Master Data. This is one-time, initial load of the data.
2. Load the Delta Data (newly added and updated data) into HDFS at different location.
3. Merge the data: Join the master and delta data together on the business key field(s).
4. Compact the data: After the merge you will have one or more records for each business key. Keep the most recently updated data for each business key.
5. Write the Data to a Temporary Output: Since most Hadoop jobs (e.g. MapReduce) cannot overwrite an existing directory, you must write the compacted data to a temporary output location.
6. Overwrite the original master data with temporary output.
High Level:
- Ingest. Complete (base_table) table movement followed by Change (incremental_table) records only.
- Reconcile. Creating a Single View of Base + Change records (reconcile_view) to reflect the most up-to-date record set.
- Compact. Creating a Reporting table (reporting_table) from the reconciled view.
- Purge. Replacing the Base table with Reporting table contents and deleting any previously processed Change records before the next Data Ingestion cycle.
No comments:
Post a Comment