ETL for reference/lookup data

acobb Fri Jul 09, 2010 11:57 am

I am currently providing input for an ETL architecture and wanted some advice on the best practice for loading reference data. The source system currently runs continuously and logs keys information about actions being taken against the source system. The logs for the system currently record ID's for the data that is being acted upon to indicate that the data has changed. My main question is if the source system should log all of the information on the object that changed, only log the data that changed, or provide a reference key? If it logs all the information then the ETL process could easily update the dimensions appropriately, but would cause the logs to be very verbose. If the system only logs a key which references the core data, then ETL will need to lookup this information from the source system, which will still be running. I have considered using SQL Replication, but I am not sure if that is the best practice in this case. I know I cannot hit the source system repeatedly for each lookup, but I am unsure if SQL Replication is worth the hardware costs.

Options:
- Make the logs verbose by logging all relevant information about all the objects that were effected
- Only log changes to objects, not in their entirety
- Log keys and lookup against the production source system
- Log keys and lookup against a SQL Replication database
- Log keys and bulk export and import a copy of the production source system when ETL runs

Please let me know your thoughts and if you have any recommendations. Thanks for reading.

ngalemmo Fri Jul 09, 2010 12:43 pm

I would go with either option 1 or 2 (verbose log or change log) with the second choice being replication.

I wouldn't use references as I get the sense this is a high volume activity and you don't want an ETL process reaching into the operational data while the operational system is running.

For the logs, most DB systems provide the ability to swap tables and do it very quickly (under a second), so you can have an empty log table at the ready and swap it with a full log when you are ready to pull the data. This makes the log sizes very managable as you would do a swap, commit the swap, then read the full version while the operational system is loading new changes into the empty table, then truncate the full version after the ETL process has completed so it is ready for the next swap. If you need to keep the log rows for recovery purposes, you can always have the ETL process put them into another table or flat file outside the operational system.

acobb Fri Jul 09, 2010 2:10 pm

Thanks for the quick response ngalemmo. That is a very interesting idea to use the database transaction logs instead of the application generated log files. I think we will likely use a hybrid of using the transaction logs and likely option 2. Thanks for the tips.

ngalemmo Fri Jul 09, 2010 3:19 pm

By 'log' I was referring to your application log (although you could use the database logs as well... which is basically what many change data capture (CDC) applications do).

I was assuming your application had a table it would record updates it applied as well as actually updating tables and that the log had no other use to the application. In which case the table swaping technique (which is also how the database logs work... however in this case your application would control the swaping directly) would work.

acobb Fri Jul 09, 2010 3:33 pm

Ah, I see. The application currently creates text log files to track the actions of the system and updates any appropriate information in the database. I think going with a more verbose log would prevent the need for having to use the reference tables, but I fear that it would slow the application down to write such large files. I fear the same would be true of writing all the database changes to a separate table, as its an extra call to the database for every modification. So I think it might be best to log only the fields or relationships that change to the flat files along with the other system events which make up the grain. This would keep the files smaller and reduce the need to call the production database.

Sponsored content

ETL for reference/lookup data

ETL for reference/lookup data

Re: ETL for reference/lookup data

Re: ETL for reference/lookup data

Re: ETL for reference/lookup data

Re: ETL for reference/lookup data

Re: ETL for reference/lookup data