HomeArticlesOthersLogin
NewsCustom SearchContact
Real-time ETL patterns and how to piss to pit
    by Ludek Bob Jankovsky, 16-Mar-2014 (ETL PATTERNS)
Recently I published an article about real-time ETL. Designing it we hit on several challenges and various different approaches. To be certain, we speak about Real-time ETL realizing Data integration of sources, not a Service integration. We have to react on smallest changes in source data and propagate the changes to transformed target Data structure. Advantages and disadvantages I described in previous article.
See: Edges of Real-time ETL

As well as about any solution there are piles of requirements and most of them cannot be satisfied all together. I would like to specify three most important groups:

  • Latency - everyone wants to have data in the target as soon as possible, the best few seconds before that happens in the source. Users just know they want it fast but try to ask them for any level of certainty. As we speak in terms of units of seconds and we transfer data between systems, servers, geographic locations, we have to count on kind of "noises" in latency.
    We recommend to use three performance (SLA) indicators:
    • C90 - Latency 90% of cases shouldn't exceed. Excesses like End of days or reconciliations do not count.
    • MAX - Maximum latency. Excesses like End of days or reconciliations do not count.
    • MAX-Excesses - Maximum latency during excesses like End of days or reconciliations.
    We also recommend to specify it fine grained for particular ETL modules because the requirement limits us in following requirements.
  • Manageability - You can do very detailed tuning work on each of the ETL processes to get the best performance but by the other hand you have got plenty of ETL mappings you have to implement and, what is worse, you have to maintain them during all the life-cycle of system. That way some approaches leading to the best performance result can increase costs for change management including impact analysis to enormous levels. Of course, Metadata driven solution can help to mitigate the problem, but it is still here.
  • Effort and costs - Everyone wants to realize the solution in the lowest possible costs and efforts. If you want to solve the solution with almost zero latency, easily manageable with low costs, I suppose the best solution is to take a pencil and strike off one requirement after another. After striking all the requirements off you will have a solution with zero latency with no requirements to manageability and it costs just one pencil.
    Otherwise you have to specify what do you need exactly and to specify your priorities between three mentioned dimensions.
Maybe, you are surprised I didn't mention several other performance indicators like High availability, transmittance etc. They are also important but they in our decisions they did not affect other parts of the magic triangle as much as Latency


Pipe approaches
In the article mentioned above asynchronous solutions has been recommended. For that we have to transfer information about changes in sources to ETL processes in any form of log, ques, simply Pipes. In general the process indicating change in source and refreshing a mirror of all sources in L0 layer fills some kind of pipe informing ETL process (L0 to L1 target layer) what exactly has been changed. The mean of the information we call Pipe.
Again one assumption, we work with approach using complete (L0) mirror of source and and all look-ups are to be done against the L0 layer. The approach without L0 mirror we will describe another time..
Basic terms:
  • CDC - Capture data change method of get changes in source data based of redo log information on the source side.
  • L0 layer - Complete mirror of source replicated objects maintained based on CDC wise technology.
  • L1 layer - Target integrated data after transformation.
  • L0 to L1 ETL - ETL processes transforming data from L0 layer form to L1 layer form.
  • Pipe - Information about keys of changed records.

So, there is a Pipe. Now just to decide what the pipe should contain, how to manage consumers of each pipe etc.
Pipe classifications:

  • Type of contained key - The simplest way to fill the pipe is to fill it directly by information from CDC - so in that case we would use Source primary key of record. Another possibility is the pipe contains Target record identifiers. That brings usually requirement of look-ups during the L0 load phase.
    • PIS pipe - Source identifiers contained in the pipe. Such type of pipe has got several potential consumers depending on number of ETL processes using the same source.
    • PIT pipe - Target identifiers contained in the pipe. Such type of pipe has got just one consumer - the ETL process it is associated to, anyway it is filled from several sources based on number of sources used in particular ETL mapping.
  • Type of consuming management - We have to consider the records to the pipe are posted, they are used by consumers and then after usage by all consumers they should be deleted.
    • Cyclic sequence method - Each record of pipe has got a sequence number and we identify range of used sequences and then we delete records from pipe. This method is very efficient but useful mostly for single consumer of pipe, not for several consumers.
    • Flag method - Pipe contains a flag what is updated by consuming sessions and the record is deleted for all flags in the final state. There are more database operations anyway it allows more consumers of one pipe.


PIS - Source wise Pipes
Source pipes are much easier to fill but much more complicated to use them. First, they require multi-consumer model. That could be solved either by pipe multiplicator or using using system of several flags one for each consumer. At the end the mapping should have several clones or parametric variants one for each incoming pipe.
Lets start from the example we used in the previous article.

Following picture shows distribution of PIS pipes to tasks of ETL mappings:

Following pattern shows an example how to prepare Merge source:

WITH PIS as(
..pipe select
)
,SRC as (
..select source transformation, written the way allowing Push predicate from join with pipe
)
SELECT 
 .. target columns
FROM PIS 
JOIN SRC ON .. pipe determination condition 

In the PIS concept delete pattern should be designed separately. It requires to add source identifiers of each source to the target table. For example when you use transformation as join of three source table with single primary keys,you have to add three "service" columns to target table.

...
From <Source1> as S1
Join <Source2> as S2 on <condition for S2>
Join <Source2> as S3 on <condition for S3> 

In the example above you see we use two source tables but the second one in two instances. That way the target table will contain three source identifiers and the example also explains why we use the ALIAS name in the transformation definition as a prefix instead of short name of the source table. That way we will have to add three new service columns:

 ,S1_<name of primary key column of Source1> <datatype>
 ,S2_<name of primary key column of Source2> <datatype>
 ,S3_<name of primary key column of Source2> <datatype>

Following pattern shows an example how to prepare Delete source (both Merge and Delete operations can be done in one statement):

WITH PIS as(
..pipe select
)
,SRC as (
..select source transformation, written the way allowing Push predicate from join with pipe
)
SELECT 
 .. target columns
FROM PIS 
JOIN <Target table> ON ..target records for the PIS values, join based on the source identifier in the target
LEFT JOIN SRC ON .. pipe determination condition 
WHERE ...anti-join condition when there is nothing found on the source side

Note: PIS approach is efficiently used with Oracle Materialized views for fast refresh.
Advantages and disadvantages of PIS approach

  • Advantages
    • Straight approach with no additional look-ups - the approach supports all the look-ups concentrated in target patterns so they are not done twice.
    • Concentrated logic - all the transformation logic is concentrated in places it belongs to. Transformation of L1 does not affect L0 loads and no intermediate layer is required (well, expect multi-consumer pipe solution).
  • Disadvantages
    • Replication of ETL - many instances of the same ETL for each source pipe. The deployment and maintenance problems could be mitigated by Metadata driven ETL solution but there could be additional infrastructure costs e.g. running many sessions in Powercenter Informatica. That way the approach is better for more efficient environments than the originally batch based ETL tools.


PIT - Target wise Pipes
Target pipes are much much more complex to fill but much more comfortable to use them. First, they can be realized as a single-consumer model. The strongest on the solution is easy and quick realization of L1 ETL and easy way to do the Delete operation in one step together with Merge. Problem is, we have to know all the PITs in the moment of L0 definition so L0 loads are no more independent. CDC sessions then become really complex and hard manageable.

Merge and Delete operations can be done in one step:

WITH PIT as(
..pipe select
)
,SRC as (
..select source transformation, written the way allowing Push predicate from join with pipe
)
SELECT 
 .. target columns
FROM PIT 
LEFT JOIN SRC ON .. pipe determination condition 

According to the pattern above we work with two alternatives based on the result:

  • Record from SRC was found - leads to Merge (Insert/Update) operation because the source record exixts.
  • Record from SRC was not found - leads to Delete operation because the source record does not exist but it was in PIT so there was a reason it disappeared from the source transformation, so it should be deleted (marked deleted).


Advantages and disadvantages of PIS approach

  • Advantages
    • Simplified L1 ETL, the module is implemented in one instance.
    • Easy Delete operation - single step delete with no querying of target. The same reason it doesn't need additional columns in the target table.
  • Disadvantages
    • Shattered logic of ETL - all the logic is separated between L1 ETL and the processing of PITs.
    • Complex L0 load sessions - look-ups necessary to fill PITs slower L0 loads and make them less manageable.


PIS to PIT - both source and target wise pipes in one solution
The solution combines advantages of both mentioned approaches. In the L0 just PIS pipes are filled so the L0 loads are independent on further processing and L1 ETL mappings. Between L0 loads and L1 loads a new element PIS2PIT is placed, it solves all the look-ups and fills PITs based on PISs. L1 ETL is solved the same way as in the PIT solution.

Process of PIS2PIT is following:

  • L0 process fills PIS based on CDC information as a single-consumer pipe.
  • PIS2PIT process fills several PITs based on PIP information as a single-consumer pipes. It dispatches PISes to all necessary PITs. All look-ups done in the process use deleted (soft deleted) records in L0 to support keys based on just deleted rows.
  • L1 process works based on PIT and uses not-deleted rows from L0 layer only.

Following schema illustrates complexity of Pis to Pit relationships. Pis to Pit is designed to delegate the peer to peer complex dependencies to separate node:


... so just another way how to piss to pit


Advantages and disadvantages of PIS approach

  • Advantages
    • Simplified L1 ETL, the module is implemented in one instance.
    • Easy Delete operation - single step delete with no querying of target. The same reason it doesn't need additional columns in the target table.
    • Fully independent L0 loads - the intermediate element frees the L0 completely from the L1 logic.
  • Disadvantages
    • Shattered logic of ETL - all the logic is separated between L1 ETL and the processing of PITs. It's not as hard as in PIT approach but the PIS2PIT element still depends on L1 ETL logic
    • Additional step PIS2PIT - can slightly increase latency. It is the price for improving manageability of all the solution.
    • L0 tables require "soft delete" approach - records can't be simply deleted but just marked as deleted. PIS2PIT process has to approach even deleted records in L0 to construct "former" key.

Ludek Bob Jankovsky
All Right Reserved © 2007, Designed by Bob Jankovsky