Data Synchronization Across Multiple Computers in A Network Environment

Data Integrity Via Process Synchronization Across Multiple Computers in A Network Environment

By Slavko Stemberger
Warehouse Architect/Data Architect/Developer
© Copyright 1995 by Starsoft Inc.

In a data warehouse, it is sometimes necessary for an operational system to pass data to the data warehouse via transport tables. If the operational system and the data warehouse reside on different computers and possibly different operating systems, maintaining these transport tables can get very difficult.

I will show that traditional methods of process synchronization do not work in this type of environment and one possible solution that works consistently. I have used a variation of this solution to solve the process synchronization problem in a NT/Unix environment.

Assume that the operational system loads and purges the transport tables while the warehouse load process can only read the tables. Also assume that the data in the transport tables is transactional in nature and, once sent to the warehouse, will never again be re-sent. This constraint allows the amount of data sent to the warehouse to be relatively small.

The questions that must be answered:

How does the operational system know that it is safe to clean out the transport tables and
How does the warehouse load process know that there is new data waiting for it to pickup.

If this were a batch system on one computer, it would be possible to schedule the two processes sequentially in which case the problem can be dealt with in a fairly simplistic way. Since this is not the case, we must look for a more general solution. One such solution is to have the warehouse load process be dependent on the successful completion of the operational system.

Unfortunately, this dependency mechanism will not work in this scenario either.

In order to show why this mechanism will not work, let's assume that the dependency is based on the existence of the file DataLoaded.dat.

The Operational System Process:


   if DataLoaded.dat exists
   then
      delete DataLoaded.dat to prevent 
      the warehouse load from running.
   else
      purge the transport tables
   end if;
 
   do the operational processing and load
   the transport tables
 
   if no errors occurred
   then
      create file DataLoaded.dat
   end if;

The Warehouse Load Process:


   while DataLoaded.dat does not exists
   loop
      wait some pre-determined amount of time
   end loop;
   load the data in the transport tables
   into the data warehouse and
   delete file DataLoaded.dat.

Since the Warehouse load process is dependent on the successful completion of the operational system, it will never start running before the operational system has completed successfully.

The reverse is not true. The operational system could start while the warehouse process is extracting the data. When this happens, and it will happen, the operational system believes that the warehouse load has not yet picked up the data, so it adds data to the transport tables.

One of the following can happen:

The warehouse load does not see any of the data added to the transport tables by the operational system so all the original data is loaded properly.
The warehouse load sees some but not all the data because it has processed some of the transport tables before the operational committed them. In this case the warehouse load process will see inconsistent or incomplete data.

In addition to the above, the following will also occur:

If the warehouse load completes before the operational system, the load process will try to delete the DataLoaded.dat file unsuccessfully because it has already been deleted by the operational system. The operational system, on the other hand, will re-create this file when it has completed. Note that, in this case, no data is deleted even though at least some of it has been loaded into the warehouse. The next time the warehouse load process is run, it will load the data that was loaded on the previous run as well as the new data.
If the operational system completes before the warehouse load deletes the Dataloaded.dat file, this file will have been deleted and re-created by the operational system just in time for the warehouse load to delete it again. When the operational system runs again, it will purge the transport tables of both the data that was loaded by warehouse load as well as the data that was missed. We have just lost some data!

We could use a more sophisticated two-way dependency to improve our chances of getting the data loaded into the warehouse. Let us use the following files (the names indicate their purpose):

WarehouseLoading.dat
WarehouseCompleted.dat
OperationalSysLoading.dat
OperationalSysLoaded.dat

The Operational System process:


   while WarehouseLoading.dat exists
   loop
      wait some pre-determined amount of time
      (to give the warehouse load time to
       complete)
   end loop;
 
   create OperationalSysLoading.dat
   if WarehouseCompleted.dat exists
   then
      purge the transport tables
      delete WarehouseCompleted.dat
   end if;
   delete OperationalSysLoaded.dat (in case it
   it still there)
 
   do the operational processing and
   load the  transport tables
   if no errors occurred
   then
      create OperationalSysLoaded.dat
      delete OperationalSysLoading.dat
   end if;

The Warehouse Load Process:


   while OperationalSysLoading.dat exists
   loop
      wait some pre-determined amount of time
      (to give the opeational system time to
       complete)
   end loop;
 
   create WarehouseLoading.dat
   load data from the transport tables
   if all went well
   then
      create WarehouseLoaded.dat
      delete WarehouseLoading.dat
   end if;

This more complex dependency may appear to be fool-proof and for the most part it is, but not quite 100%!

Let us examine the case where the warehouse process is late and starts up at the same time as the operational system. This is what will happen (note that the file OperationalSysLoaded.dat will be the only file that exists):

Step	Operational System	Warehouse Load
1	Loop checking for WarehouseLoading.dat terminates on first test	Loop checking for OperationalSysLoading.dat terminates on first test
2	Create OperationalSysLoading.dat and delete OperationalSysLoaded.dat (the purge step is note done because the WarehouseLoaded.dat file has not yet been created)	Create WarehouseLoading.dat
3	Do Operational process and load transport tables	Load data from the transport tables

Both systems terminate step 1 without any waiting, thinking that it is Ok to proceed. Once the two systems are at their 2^nd step, we have the same problems as we did in the first, simpler version of the dependency mechanism.

Note that the probability of this happening is very small but not small enough to be ignored.

The whole problem exists because step one of each of the two processes is critical code that cannot run concurrently with the other. We must find a mechanism that will allow us to create a gateway that allows only one process to enter at a time.

It turns out that building such a gateway is rather simple.

Start by creating a one column/one row state table, LOAD_STATE, with the column CURRENT_STATE defined as a varchar.

The values allowed for the CURRENT_STATE column will be:

'OperationalSysLoading'
'OperatinalSysLoaded'
'WarehouseLoading' and
'WarehouseLoaded'

Operational system Process:


   select CURRENT_STATE
     from LOAD_STATE
    for update of CURRENT_STATE;
   if CURRENT_STATE = 'OperationalSysLoading'
   then
      terminate with error - the previous run
      of the operational system failed and
      the problem has not yet been resolved.
   elsif CURRENT_STATE = 'WarehouseLoading'
   then
      commit (to release the locked table)
      restart this process after waiting some
      pre-determined amount of time (to give
      the warehouse load time to complete)
   elsif CURRENT_STATE
           in  ('WarehouseLoaded',
                'OperatinalSysLoading'
               )
   then
      save the value of CURRENT_STATE
      update LOAD_STATE
         set CURRENT_STATE
               = 'OperationalSysLoading';
      commit;
   end if;
   if the saved value of CURRENT_STATE
      is 'WarehouseLoaded'
   then
      purge the transport tables
   end if;
 
   do the operational processing and
   load the transport tables
 
   update LOAD_STATE
      set CURRENT_STATE
          = 'OperationalSysLoaded';
    commit;

Warehouse Load Process:


select CURRENT_STATE
     from LOAD_STATE
    for update of CURRENT_STATE;
   if CURRENT_STATE = 'WarehouseLoading'
   then
      terminate with error - the previous run
      of the operational system failed and
      the problem has not yet been resolved.
   elsif CURRENT_STATE
         = 'OperationalSysLoading'
   then
      commit (to release the locked table)
      restart this process after waiting some
      pre-determined amount of time (to give
      the operational system time to complete)
   elsif CURRENT_STATE = 'WarehouseLoaded'
   then
      (No data to load yet - we must wait)
      commit (to release the locked table)
      restart this process after waiting some
      pre-determined amount of time (to give
      the operational system time to complete)
   elsif CURRENT_STATE = 'OperatinalSysLoaded'
   then
      update LOAD_STATE
         set CURRENT_STATE
             = 'WarehouseLoading'
      commit;
   end if;
 
   Load the data in the transport tables
 
   update LOAD_STATE
         set CURRENT_STATE
             = 'WarehouseLoaded'
   commit;

Since ORACLE does not allow two processes to lock the same row in a table by more than one process, we are guaranteed that only one of the two processes will be allowed to run to completion at a time. The select statement has become a gateway that allows only one process to pass through at a time even though the two process are running on different computers and operating system.

We have now guaranteed ourselves data integrity by synchronizing the two processes.