by Ludek Bob Jankovsky, 11-Jan-2014 (ETL PATTERNS)
Nowadays Real-time and message based ETL take more significant place in IT infrastructure. Differently from Messaging based operational data, in the Real-time approach target data are refreshed based on changes of source data in source systems. That uses means what was earlier often used for replication of data. Differently from replication, realtime ETL is hardly 1:1 but the transformation takes place in the process.
Real-time ETL brings many limitation in comparison to standard batch ETL.
Advantages and disadvantages in comparison to Messaging based ETL
Advantages
There is minimal definition effort on the side of source system
Minimal latency and less risk of information loses
The same definition of ETL for initial and reconciliation loads
Disadvantages
Reactions on many data events on the source side should be defined
No independent interface, what leads to tight impact on changes of source structures
Limited possibilities for transformation complexity
Edge 1 - Target vs. Source perspective:
Standard ETL is on both logical and physical level defined as data from several source tables transformed into one target table. In most approaches, the rule One mapping - one target table takes place.
In Real-time approach we start from data event i.e. from particular change in one source table. That leads to opposite direction of logic when change of certain source information causes changes in several target tables. Mentioned fact defines most differences and difficulties characteristic for real-time approaches.
The picture above shows the logic either from the side of one target table or from the side of one source table.
Target-wise perspective is used to define transformation logic. It comes from requirements what define target structures and we map their transformations based on source data. On the logical side we cannot go opposite way. Support systems based on inverted logic (Source based perspective) would be impossible.
In addition there are tasks of both initial and incremental loads using standard ETL approaches and they should be also defined in Target based perspective even on physical level.
Source-wise perspective is used for real-time transformations (ETL) on physical level.
Both perspectives live at the same time within one solution. Main goal is to reduce redundancy between both perspectives for following reasons:
Consistency of the solution (e.g. between reconciliation and real-time loads)
Manageability of the solution, changes should be done in one place and they should start from the logical design (i.e. Target based perspective)
There are two basic ways to cope with the task:
Adaptive solution .. the solution is built the most possible adaptive way using reusable parts of physical design
Metadata driven solution .. the solution uses possibilities to fork one metadata into two different perspective during generating of physical implementation. There is still redundancy on physical level but there is no risk of inconsistency because all the branches we generate from one metadata. Mentioned approach requires full Metadata driven solution, what makes it longer to implement, especially without still established Metadata driven ETL tool.
Edge 2 - Target vs. Source identification:
To maintain changes in sources efficiently there are two ways of identification source data against target data through each transformation.
Target IDs
The approach requires we are able to transform source data into target identifier for each transformation the source table takes place.
In some cases that requires building huge redundant transformation structures just to reach target identifier based on every smallest change. The most obvious it is with List of values look-ups (LOV). Change of value of LOV causes changes of many target records. Using target IDs is not much efficient in that case on both the definition and the execution levels.
Anyway, on the execution level we have to consider, there is not many changes of such a kind of data.
Source IDs
The approach requires storing of source table identifier in the target table. The requirement is an addition to the target structure, but it allows to react on changes in every source table more directly. We can avoid all the additional look-ups and the mentioned LOV value change would be much easier.
Edge 3 - CRUD Matrix revival:
In early theory of Data modeling CRUD Matrix was defined. It was matrix defining what processes influence what tables and how.
In the time of detail mapping metadata it seems slightly ridiculous definition, but for Real-time ETL we have to classify how changes of particular source objects (may be on both table and column level) influence the target object in particular mapping. CRUD meant Create (Insert), Read, Update, Delete.
Currently we need especially to map active operations and based on changed convention we will use IUD matrix instead (Insert, Update, Delete).
The example above shows standard flows of data mapped from source tables to target tables.
For more efficient way we have to classify every source involved in mapping by the IUD categories:
So we reminded what the CRUD (for our purpose IUD) matrix is. Now it is the time to explain why it is so important for Real-time approach. Above, we spoke about Target IDs. To seek target ID we have to extract part of final transformation and derive particular ID transformation for each source object. As many source tables as complex the entire process will be. Our example is about three source tables in maximum for one mapping. In real there could be many look-ups and three source tables would appear as a really simple case. All the process would look following way in general:
Now we reconsider all the process using CRUD matrix. All the look-ups make usually hugest mass of source tables in large transformations. And all the look-ups have got U only in the CRUD matrix.
So for reason of pruning we change all target ID look-ups with U only in CRUD matrix to source ID approach. That requires, mappings and target tables are adopted to work with Source IDs too.
Following picture shows first wave of reduction. Compare it with the equivalent above:
Additionally we can reduce look-ups for Target IDs when Target ID is derived from only one source table because there are no look-ups necessary. Following picture illustrates how simpler the transformation implementation becomes using CRUD Matrix.
Example:
Lets have 2 operations on the same target record established by the first of them. When you mess the order, in Fine operations approach it fails cause you either try to update not existing record or insert existing one.
In Dirty operations approach the Final merge operation acts based on current state of the target table, so it never fails.
It also always retakes current information from the L0 (source wise) layer so last operation has got always the most recent data.
Edge 4 - Metadata Driven:
Still, we mentioned measures leading to adaptive design of physical implementation. Most of them are useful for metadata driven approach, but Metadata driven approach removes many limitations and even if we crate the most adaptive implementation, there will remain some level of redundancy. The redundancy can be solved by Metadata aided processes. Anyway, Metadata aided is not Metadata driven. Metadata driven solution means all implementation of ETL is generated from logical metadata. Metadata driven solution requires huger initial effort what would increase time to life of pilot solution. It is reasonable to start from Adaptive metadata aided solution and evolve it in iterations towards full metadata driven solution. Following picture illustrates reduced complexity solved by generated redundant implementations based on single logical metadata:
Edge 5 - Fine vs. Dirty operations:
In some approaches to ETL we resolve at the very start the impact to target data, we create INSERT set, UPDATE set and DELETE set and then operate. That way we can call Fine operations approach.
By the other hand we can create more robust solution. We can split all the transformation logic to Transformation itself and the Final operation. Final operation needs differ just between MERGE (Insert or Update) and DELETE. That decision we can name Dirty operations approach. In Dirty operations approach we trash the initial information about what should happen at the target and we derive the information again from current state of the target at the moment of final operation.
The approach brings following advantages:
More robust solution unbreakable by duplicity of the same operation
More robust solution not sensitive to exact order of concurrent operations
Much less of definition complexity getting rid of all the final operation logic
Edge 6 - Synchronous vs. Asynchronous:
One very important aspect of Real-time is synchronicity of loads. Synchronous approach would mean that every change would be propagated into target with no queuing. Some myths say the only way to do really real-time ETL is to do that synchronously. They also say that asynchronous ETL cannot be called Real-time. Nonsense.
Synchronous approach is very sensitive to peeks. That way the latency increases much more than with the asynchronous approach.
Asynchronous approach requires queue of changes and the ETL process makes bulks of the changes gathered within certain period of time and processes them. That way, as more changes, bulks are larger and the processing more efficient.