Rigid Exchange Partition Data Acquisition approach
by Ludek Bob Jankovsky, 05-Feb-2010 (ETL PATTERNS)
There are several ways to support consistency of data warehouses and data marts within business hoursperiod. In standard simple approaches it usually suppresses loading time window to really short period between readiness of source data and required readiness of target data. Following approach is just another way allowing keep all data in the mart consistent at the moment, anyway I consider it as an interesting stuff.
In the process figured above the "simple" merge step will be replaced by three partial steps:
Create "new" target partitions as a result of joined both source and current target data for all tables within mart.
Exchange new target partition with old target partitions. It has to be done at the same moment so transition form yesterdays to todays state will be really quick.
Archive old partition. Old exchanged partitions has to be kept for the time while ability of rollback of mart is necessary. The parameter will influent sizing od the solution too, so it will be compromised by both requirements.
There are following assumption of the approach:
Step 1 of all transformations of the data mart has to be finished before step 2 of any transformation will be started. The step 2 of all the transformations will be done in short period of time to switch data mart from one consistent state to another.
Workflow dependencies of standard solution will be handover for the step 1 chain workflow.
Only "new version" partitions will be used for lookup in transformations.
Append of fact tables - the simpliest pattern - will be solved as an exchange of new data with empty partition.
Approach considers that partitioning of slowly changing dimensions is solved to eliminate amount of joined data.
Basic advantages of the approach:
Better performance of ELT solution. Especially when dynamic 'Create table as select' statement will be used to create a new partition. I've compared measurements of both standard merge and REP methods performance. I have been surprised how much the second one is faster.
Supports consistent data within whole the data mart even during major part of ETL processes.
Allows full rollback of load in the case of fatal failures.
In the case that soffisticated Metadata driven ELT method is used, transition to REP method could be done painless using set of new algorythms for used patterns.
Basic disadvantages:
Additional database sizing requirements.
Lightly more complex ETL infrastructure.
Hard transition in the case that the ETL solution is not metadata driven.
Articles about REP implementation for particular ETL patterns will follow.