ETL Optimization Scenario
3 posters
Page 1 of 1
ETL Optimization Scenario
Hi All,
Currently size of our fact table is in millions which will continue to expand and pretty soon our ETL processes will fail, so need to optimize them now before it gets out of hand.
So we need to create a scenario which will allow was to test our ETL processes.
Can anyone tell me, how can I cook up such a scenario.
Currently size of our fact table is in millions which will continue to expand and pretty soon our ETL processes will fail, so need to optimize them now before it gets out of hand.
So we need to create a scenario which will allow was to test our ETL processes.
Can anyone tell me, how can I cook up such a scenario.
Mohsin- Posts : 4
Join date : 2009-03-03
Re: ETL Optimization Scenario
I'm not even sure where to begin on this. My recommendation is to hire a senior ETL architect. The depth of your problems exceed the level of detail required in an online forum.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Try replicating data with shifted dates
I agree that if you've implemented an ETL process that won't scale up as needed you have a bigger problem than just testing.
But, to test with a data set larger than the source application currently has is not too hard. There is usually one (or at most a few) key dates that can be manipulated to send the same set of fact data through the ETL process multiple times. If your source has, for example, 3 years of data, you can double the data size for testing via a view that does a straight select of the source table UNION ALL with a select that adds 37 months to each important date column and just selects all the other columns.
If the older data in the source system is not typical of what new data will look like (not unusual), you can produce pretty good test data via a view that has UNION ALL of multiple selections of the most recent data, adding a different number of months in each query.
Depending on the ETL tools and techniques available, you may be able to do the same thing within the ETL process rather than using a view.
Potential gotchas:
1) Unique transaction ID -- there may be an identifier in the data that is checked for uniqueness in the ETL and/or has a unique constraint in the DW. For testing, change the unique definition to include that date (or one of the dates) being shifted to manufacture test data.
2) Special dates -- the main date used for this is a transaction or effective date that always has a valid recent date value in it. But if you shift other dates too, check for the possibility of NULL or special values that need to be preserved rather than adding to. Some applications use NULL when the date is missing, unknown, or does not apply, others use special date values that should not be altered (1900/01/01 and 9999/12/31 are popular). Most databases will throw an error if you add anything to 9999/12/31.
This technique enables scale-up testing without needing to generate and store any source data in addition to what already exists. You can test the ETL "pipe" without having extra storage in the DW by taking the data through the process up to but not including the actual final load. That way, when you do add storage for load testing, you'll know that you will actually get to use it.
Regards,
Tim
But, to test with a data set larger than the source application currently has is not too hard. There is usually one (or at most a few) key dates that can be manipulated to send the same set of fact data through the ETL process multiple times. If your source has, for example, 3 years of data, you can double the data size for testing via a view that does a straight select of the source table UNION ALL with a select that adds 37 months to each important date column and just selects all the other columns.
If the older data in the source system is not typical of what new data will look like (not unusual), you can produce pretty good test data via a view that has UNION ALL of multiple selections of the most recent data, adding a different number of months in each query.
Depending on the ETL tools and techniques available, you may be able to do the same thing within the ETL process rather than using a view.
Potential gotchas:
1) Unique transaction ID -- there may be an identifier in the data that is checked for uniqueness in the ETL and/or has a unique constraint in the DW. For testing, change the unique definition to include that date (or one of the dates) being shifted to manufacture test data.
2) Special dates -- the main date used for this is a transaction or effective date that always has a valid recent date value in it. But if you shift other dates too, check for the possibility of NULL or special values that need to be preserved rather than adding to. Some applications use NULL when the date is missing, unknown, or does not apply, others use special date values that should not be altered (1900/01/01 and 9999/12/31 are popular). Most databases will throw an error if you add anything to 9999/12/31.
This technique enables scale-up testing without needing to generate and store any source data in addition to what already exists. You can test the ETL "pipe" without having extra storage in the DW by taking the data through the process up to but not including the actual final load. That way, when you do add storage for load testing, you'll know that you will actually get to use it.
Regards,
Tim
tim_huck- Posts : 8
Join date : 2009-04-09
Location : Evanston, Illinois

» SCD 2 scenario
» Design scenario...
» Designing Sales Promotion for Packed Products
» Question - creating a dimensional model for facility management
» Degenerated dimension for my scenario ?
» Design scenario...
» Designing Sales Promotion for Packed Products
» Question - creating a dimensional model for facility management
» Degenerated dimension for my scenario ?
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum
|
|