ETL Fragility
3 posters
Page 1 of 1
ETL Fragility
My ETL process is built upon jobs strung together in sequence ran nightly (once every 24 hours). Our nightly routine kicks off and runs the jobs one after the other. Upon a failure, the entire process stops, logs the failure in a DB and will attempt the same thing the following night.
The *hope* is that during the day following a failure, the failure is investigated, fixed, and the failed jobs are manually re-ran ASAP.
This entire process seems very fragile:
I didn't find the Data Warehouse Toolkit as very good reference on the ETL portion of data warehousing. Are there any suggestions or suggested reads on how to structure the process or tips on mitigating some of these risks?
The *hope* is that during the day following a failure, the failure is investigated, fixed, and the failed jobs are manually re-ran ASAP.
This entire process seems very fragile:
- As there becomes more and more data to capture, the likelihood of an unseen situation occurring and breaking the ETL process increases
- An expert on the ETL process and jobs being ran (me) needs to be available every single day in order to fix the issue
- If there is a failure and someone isn't available to fix it immediately (or can't), snapshot data is missed and can never be re-created.
I didn't find the Data Warehouse Toolkit as very good reference on the ETL portion of data warehousing. Are there any suggestions or suggested reads on how to structure the process or tips on mitigating some of these risks?
Last edited by ryno1234 on Fri May 29, 2015 10:49 am; edited 1 time in total
ryno1234- Posts : 33
Join date : 2015-01-07
Re: ETL Fragility
What is the nature of the errors?
I don't do ETL anymore, but when I did, my goal was to avoid getting late night calls.
It is more a matter of philosophy than anything else. To me, there is no such thing as 'bad' data. A DW should dutifully load what it is given. If there are issues with the data, it is a matter of reporting back (from the DW) to those responsible for cleanup in the operational system.
The most common problem is missing dimension rows when loading facts. This can be resolved by inferring new dimension rows during the load. Populate the row with the natural key and any other data you have and use the new surrogate key in the fact. The dimension row itself can be populated later when the real data comes in.
But there are situations where failure is the best option. If you are dealing with text based sources (XML, comma delimited, etc...) there is always a chance of a corrupt file or malformed data, such as non-numerics in what should be a numeric field. There isn't much you can do but kick the whole file out.
However, if you are performing dimension inference, there is no reason the entire process should stop. If you have a corrupt source, that stuff doesn't get loaded, but it should not prevent everything else getting loaded.
I don't do ETL anymore, but when I did, my goal was to avoid getting late night calls.
It is more a matter of philosophy than anything else. To me, there is no such thing as 'bad' data. A DW should dutifully load what it is given. If there are issues with the data, it is a matter of reporting back (from the DW) to those responsible for cleanup in the operational system.
The most common problem is missing dimension rows when loading facts. This can be resolved by inferring new dimension rows during the load. Populate the row with the natural key and any other data you have and use the new surrogate key in the fact. The dimension row itself can be populated later when the real data comes in.
But there are situations where failure is the best option. If you are dealing with text based sources (XML, comma delimited, etc...) there is always a chance of a corrupt file or malformed data, such as non-numerics in what should be a numeric field. There isn't much you can do but kick the whole file out.
However, if you are performing dimension inference, there is no reason the entire process should stop. If you have a corrupt source, that stuff doesn't get loaded, but it should not prevent everything else getting loaded.
Re: ETL Fragility
You need the ETL toolkit book. It's the ETL bible. The sophistication of your ETL process is dependent on your ETL tool. During the design process you should identify things that can run concurrently and things that must run in sequence. The ETL should rarely stop, only when encountering a fatal error. Otherwise is should handle most problems gracefully. With informatica, you can get text messages with updates on your workflows (typically only when things go bad). Buy the book, start reading, all the Kimball books are full of useful information. Your item 3 should never happen. That's one of the reasons why we stage data, to rerun after a load failure.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum