Partitioning sreategy patterns - principal element of Data architecture
by Ludek Bob Jankovsky, 19-Aug-2011 (ETL PATTERNS)
One of basic chapters in data architecture is a decision about partitioning strategy for particular data layers, table stereotypes and ETL patterns.
Following article will describe basic drivers influencing decisions about partitioning strategy. Details on particular patterns will be linked along.
Drivers of partitioning:
Data ageing - I place it on the first place in the list as Data ageing is the most important and the most foregoten driver in large information system. Its importance is obvious in fact tables, where we want delete or just compress or move or archive old data and it's difficult without proper partitioning. Less obvious it is at slowly changing data (SCD), where history of changes is not as huge, but in time frame of 1 - 2 years it will turn out, amount of historical records slows down processes and could cause sizing difficulties too.
Processing independency - The second issue we solve with partitioning. When we process ETL tasks, we often drive workflow of such the way to avoid locks caused by simultaneous access to the same objects. Partitioning driven by this driver (i.e. partitioning by Source system) allows make particular threads of ETL processes more independent.
Performance - Third not less important driver. Driving by this one we follow requirement either of the most usual or the most critical queries and joins to make the system work more effective.
Stereotypes of partitioned tables:
Fact tables - are usually partitioned based on snapshot time. The most favourite partitioning is one to one, what means one batch of load is placed in one partition. For daily fact is one partition created for one day. There are three possibility how the partition can be done:
Range partitioning - is used mostly cause it is the oldest way how to partition and allows both one to one and one to many approach (i.e. modntly partitioning on daily facts). Disadvantage at one to one partitioning is it doesn't support equipartitioning when not the exactly same set of partitions. We know, there is one day in one partition, but system can't relay on it as not constrained technically. Cause of that I recommend to use some of following for one to one partitions in fact tables.
List partitioning - Much better solution for one to one partitions. In current solutions which are one to many (i.e. daily fact with date with time fraction) I would recommend to use virtual column as a partitioning key and use a list partition instead. List partitioning make system sure that one daily smapshot is placed to one certain partition.
Interval partition - is a new way how to manage partitioning. Yopu specify date column and kind of interval how to partition. It allows both one to one partitioning, when we use daily interval for daily facts, or one to many. Despite it's a range partitioning, the fact which data certainly fall into certain partition is defined and constrained technically. Only disadvantage is naming of partitions, created automatically, but there is workeround present in this site.
Slowly Changing Data - are known as SCD alose, what exactly means slowly changing dimensions. The most common approach to store SCD is to miltiplicate records, one for each variant, where validity of each row is defined with the date of start and the date of end.
No ageing partitioning - I was tempted not to mention such ... but finally I gaev up. That is the most common approach used in current system ... how naive at systems we suppose to be working for several years.
Range partitioning on VALID_TO_DATE - Thie approach is most common and usefull. It supports ageing, as all records not valid since some date are placed in certain partition. It also supports queries as as newer data we want, queries access less number of partitions and computing is faster. To avoid record migration when closing records, you have to keep last oartition wider and split it ex-post.
EOM partitioning - is a list partitioning based on month of record. Technically it requires to close validity of all record at end of month, what allow us easily approach all values valid at EOM (what is usual date of many monthly based analysis), places all data valid at certain month into certain partition, but it increase size of data slightly. By my experience one instance of slowly changing data usually makes 5 changes anualy in average. Increasing it to at least 12 changes increases size of the data three times. Anyway, the profit we would get by fast approach to any data despite on how old they are, it could be a win an many cases.