Data Integrity Via Process Synchronization Across Multiple Computers in A Network Environment
By Slavko Stemberger
In a data warehouse, it is sometimes necessary for an operational system to pass data to the data warehouse via transport tables. If the operational system and the data warehouse reside on different computers and possibly different operating systems, maintaining these transport tables can get very difficult.
I will show that traditional methods of process synchronization do not work in this type of environment and one possible solution that works consistently. I have used a variation of this solution to solve the process synchronization problem in a NT/Unix environment.
Assume that the operational system loads and purges the transport tables while the warehouse load process can only read the tables. Also assume that the data in the transport tables is transactional in nature and, once sent to the warehouse, will never again be re-sent. This constraint allows the amount of data sent to the warehouse to be relatively small.
The questions that must be answered:
If this were a batch system on one computer, it would be possible to schedule the two processes sequentially in which case the problem can be dealt with in a fairly simplistic way. Since this is not the case, we must look for a more general solution. One such solution is to have the warehouse load process be dependent on the successful completion of the operational system.
Unfortunately, this dependency mechanism will not work in this scenario either.
In order to show why this mechanism will not work, let's assume that the dependency is based on the existence of the file DataLoaded.dat.
The Operational System Process:
if DataLoaded.dat exists then delete DataLoaded.dat to prevent the warehouse load from running. else purge the transport tables end if; do the operational processing and load the transport tables if no errors occurred then create file DataLoaded.dat end if;
The Warehouse Load Process:
while DataLoaded.dat does not exists loop wait some pre-determined amount of time end loop; load the data in the transport tables into the data warehouse and delete file DataLoaded.dat.
Since the Warehouse load process is dependent on the successful completion of the operational system, it will never start running before the operational system has completed successfully.
The reverse is not true. The operational system could start while the warehouse process is extracting the data. When this happens, and it will happen, the operational system believes that the warehouse load has not yet picked up the data, so it adds data to the transport tables.
One of the following can happen:
In addition to the above, the following will also occur:
We could use a more sophisticated two-way dependency to improve our chances of getting the data loaded into the warehouse. Let us use the following files (the names indicate their purpose):
The Operational System process:
while WarehouseLoading.dat exists loop wait some pre-determined amount of time (to give the warehouse load time to complete) end loop; create OperationalSysLoading.dat if WarehouseCompleted.dat exists then purge the transport tables delete WarehouseCompleted.dat end if; delete OperationalSysLoaded.dat (in case it it still there) do the operational processing and load the transport tables if no errors occurred then create OperationalSysLoaded.dat delete OperationalSysLoading.dat end if;
The Warehouse Load Process:
while OperationalSysLoading.dat exists loop wait some pre-determined amount of time (to give the opeational system time to complete) end loop; create WarehouseLoading.dat load data from the transport tables if all went well then create WarehouseLoaded.dat delete WarehouseLoading.dat end if;
This more complex dependency may appear to be fool-proof and for the most part it is, but not quite 100%!
Let us examine the case where the warehouse process is late and starts up at the same time as the operational system. This is what will happen (note that the file OperationalSysLoaded.dat will be the only file that exists):
Step |
Operational System |
Warehouse Load |
1 |
Loop checking for WarehouseLoading.dat terminates on first test |
Loop checking for OperationalSysLoading.dat terminates on first test |
2 |
Create OperationalSysLoading.dat and delete OperationalSysLoaded.dat (the purge step is note done because the WarehouseLoaded.dat file has not yet been created) |
Create WarehouseLoading.dat |
3 |
Do Operational process and load transport tables |
Load data from the transport tables |
Both systems terminate step 1 without any waiting, thinking that it is Ok to proceed. Once the two systems are at their 2nd step, we have the same problems as we did in the first, simpler version of the dependency mechanism.
Note that the probability of this happening is very small but not small enough to be ignored.
The whole problem exists because step one of each of the two processes is critical code that cannot run concurrently with the other. We must find a mechanism that will allow us to create a gateway that allows only one process to enter at a time.
It turns out that building such a gateway is rather simple.
Start by creating a one column/one row state table, LOAD_STATE, with the column CURRENT_STATE defined as a varchar.
The values allowed for the CURRENT_STATE column will be:
Operational system Process:
select CURRENT_STATE from LOAD_STATE for update of CURRENT_STATE; if CURRENT_STATE = 'OperationalSysLoading' then terminate with error - the previous run of the operational system failed and the problem has not yet been resolved. elsif CURRENT_STATE = 'WarehouseLoading' then commit (to release the locked table) restart this process after waiting some pre-determined amount of time (to give the warehouse load time to complete) elsif CURRENT_STATE in ('WarehouseLoaded', 'OperatinalSysLoading' ) then save the value of CURRENT_STATE update LOAD_STATE set CURRENT_STATE = 'OperationalSysLoading'; commit; end if; if the saved value of CURRENT_STATE is 'WarehouseLoaded' then purge the transport tables end if; do the operational processing and load the transport tables update LOAD_STATE set CURRENT_STATE = 'OperationalSysLoaded'; commit;
Warehouse Load Process:
select CURRENT_STATE from LOAD_STATE for update of CURRENT_STATE; if CURRENT_STATE = 'WarehouseLoading' then terminate with error - the previous run of the operational system failed and the problem has not yet been resolved. elsif CURRENT_STATE = 'OperationalSysLoading' then commit (to release the locked table) restart this process after waiting some pre-determined amount of time (to give the operational system time to complete) elsif CURRENT_STATE = 'WarehouseLoaded' then (No data to load yet - we must wait) commit (to release the locked table) restart this process after waiting some pre-determined amount of time (to give the operational system time to complete) elsif CURRENT_STATE = 'OperatinalSysLoaded' then update LOAD_STATE set CURRENT_STATE = 'WarehouseLoading' commit; end if; Load the data in the transport tables update LOAD_STATE set CURRENT_STATE = 'WarehouseLoaded' commit;
Since ORACLE does not allow two processes to lock the same row in a table by more than one process, we are guaranteed that only one of the two processes will be allowed to run to completion at a time. The select statement has become a gateway that allows only one process to pass through at a time even though the two process are running on different computers and operating system.
We have now guaranteed ourselves data integrity by synchronizing the two processes.