Long ago when I started to design ETL systems I realized, that robust ETL process management is as important as any other pillar of ETL approach. Most current workflow systems are focused to easy visualization in structured form.
Structured workflow systems are good for simple process management. I have never seen ETL system with just a little focus on robustness and performance. For that you need a fine grain heap based workflow system allowing to use operational research methods.
What's different?
When you put task into structured workflow, you wire it in certain place to be exacuted at. In heap workflow you put all task in one heap and define constraints of their executions. The constraints are of following types:
Predecessor dependency(solves dependencies comming from referential integrity, data rediness etc.)
Resource consumption(solves system resource balancing, exclusivity of filling certain segments and finally using pending resource consumption it can solve temporarily disability of certain objects)
Manageability
Issue of manageability should be solved on definition level - the best by way of 'process interface' method between local workflow models. Runtime process workflow model should contain metadata on the finest grain level and the execution of workflow should be managed by methods of operational research. There are two levels of metadata definition and two different sets of requirement on them:
Logical, definition level (manageable and well documenting)
Physical, engine level (effective at runtime)
Based on the first level we generate metadata for the second one. That way the requirements on the first level don't harm quality of the second one, what's happening with most current solution. Process workflow should be modeled as well as data structures. As well as we make data model first and then we create tables, indexes, constraints etc, we have to model process workflow as well.
In most processes we work just with one instance of load process at one moment. In advanced workflow coexistence of more loads simultaneously can appear, what's much more complex as we have not only consider dependencies between tasks, but even dependencies between two instances of the same task. We will start with the first variant as it covers majority of most common cases.
Common elements of a heap model:
T as Task .. basic element of process workflow causing some action, in most cases running of ETL module (ETL mapping) with certain parameters.
R as Resource .. any either real or virtual resource defined to limit task parallel run. Resources can be either based on real resources, e.g. CPU time, IO, lock on particular segment, or virtual resources covering group of resources when not easy to specify real resources consumption by tasks. Best practices start with common performance resource, preventing big tasks to go together and possible refining to real resources based on gathered statistic at runtime.
D(T,T) as Dependency .. containing predecessor - successor dependency between two tasks. Best practices shown it's only necessary dependency to be modeled. Mutual exclusion which is often defined is usually just a result of resource consumption conflict and can be better solved by Resources.
Task can't be executed till all it's predecessors are done in current load.
R(T) as Resource consumption .. specifies consumption of certain resource by certain task.
Task can't be executed till all required resources are free in proper amount at the moment.
Tc as Task run condition .. optional property of task which defines whether task should be either executed or skipped. Optional running of particular groups of tasks should be solved based on such a conditions for each task, or by defining separate heap when the group is too large and really separate.
Tp as Task priority .. when more tasks satisfy all dependency and resource requirements, execution is ordered by priorities. Priorities are filled offline based on statistics gathered at runtime based on operation research heuristic models.
Tt as Task estimated time .. attribute of task filled by runtime statistics as an average time of execution.
Runtime elements:
L as Load .. one run of process, e.g. daily load.
T(L) as Task run .. represents run of certain task in certain load.
T(L)s as Task runtime status .. status of runtime (i.e. None, Wait, Running, Failed, Done). Status 'Wait' requires additional attribute of time of first possible run. Status 'Failed' requires description of failure and number of occurrence. Status 'Done' should support flag of skipped run.
X as Thread .. runtime background process seeking proper task to be executed and execute it. After finishing (despite if successfully) threads seeks another one runable task and executes it. In the case there is no more executable tasks Thread fall asleep for a while. Number of tasks limits of extra-task parallelism so there are approximately 10 or more threads active in ETL workflow engines. In former implementations sleeping task woke up after certain period of time planing it's own start itself. In advanced solution sleeping thread is also awaken by another thread after finishing task and founding there is more than one task runable.
Heap vs. Process
Heap is a set of all Tasks with dependencies and resource consumption defines. For the most effective solution all daily ETL should be concentrated into one large heap.
Such a heap is afterwards hardly manageable. The fact all tasks are put on one large heap doesn't mean they have to be designed that way. Process is logical group of tasks maintained in one process dependencies model.
....Process workflow model as an oriented network chart
...Several process model on logical level are used to generate one heap of task with dependencies and resource consumption defined
Process interface
There are several ways how to organize things between processes. The most common is that they build kind of dependency organized graph on process level. That means, any process can depend on end of a nu other process. For example Process B can start when process B is finished. That leads to one of two kinds of poor results:
Workflow is not fine grain and can't be optimized
Processes are too small, so hardly effectively manageable, and structure of processes doesn't copy requirement of business and competencies.
To allow fain grain interaction between processes we use process interface. How does it work?
Redundant dependencies
When you finally "draw" your workflow graph (of course automatically by generator), some dependencies can appear redundant. There are two possibilities how to solve redundant (transitive)dependencies:
Not to generate it (transitive reduction) and keep it just in logical level (anyway always keep them, they are important to impact analyses and to manage further changes).
Generate them but flag them as "Redundant", so engine won't consider them, but there are in place for operational tasks for quick impact research at problems
Round dependencies
Round dependencies could make deadlock on workflow processing. The place where it should be checked is generating of heap workflow metadata from process metadata, or after any ad-hoc change. The process of check should be done before committing any changes in dependency structure. Round dependencies should be reported and commit of changes shouldn't be allowed.
Another kind of deadlock could be higher resource consumption than the total resource amount is. That can be used to stop some phases of processing temporarily by operations so that state should only be reported and notified when the processing is stopped by the reason.
Ve define virtual node in the "source" process visible also for the "target" one. Definition of such node should contain:
owning process
consuming process (processes)
description of meaning of the interface, e.g "table T1 is loaded, table T2 is loaded from source system S1 and table T3 has loaded all key attributes"
That's a contract between both processes. Dependencies on side of source process are a responsibility of owner of source process, dependencies on target side are a responsibility of interface consumers.
Despite is seems the description is redundant cause it is depicted by the dependencies, especially the description is a contract of the interface. It allows owners of source process to make changes in it's processing and know how to reorganize dependencies to the process interface to fulfill the contract.
Resources
Resource limitation has following attributes:
Available amount of resource (usually 100 as default like percentage as initial setting)
Resource consumption - sometimes we thought it should be percentage of total available amount, but best practice is to keep available amount modifiable to allow react on scaling issues.
Type of resource (either standard or pending) - standard resource is blocked in proper amount during the task execution and released after finish (despite successful). Pending resource is more rare, it is blocked at start of task and keep blocked after finish. Another task is expected to release it by negative consumption set. Pending resources are used to block some actions in intervals exceeding one task frame.
Thread
Thread is a background process executing tasks. Each thread works independently and it does following things repeatedly:
put lock on choosing process
select set of tasks available for further execution
select a task for execution and mark it as running
book resources
unlock choosing process
execute task and log the result, change state of task based on the result
and again and again till there is any task available for execution. Otherwise the thread switches into sleeping mode for a while.
In more advanced solutions in the phase thread finishes one task, it checks how many tasks are available to be further executed. Based on the result it:
if no task and any other thread is active, it falls asleep for longer period (about 5 minutes)
if no task remains and no other thread is active, it falls asleep for short period (about one minute)
if there is just one task to be executed, it executes it
if there is more tasks available to be executed, it executes the one with higher priority and it arouses another thread