Extracting - separate source tables vs SQL query
2 posters
Page 1 of 1
Extracting - separate source tables vs SQL query
Hi,
I am in 2 minds on my extract process for a particular source system. The system has customer information stored in 3 tables in 3rd normal form (e.g. name in one, address in another and contact details in another). Each table has a reliable modified date we can use to do delta loading.
My 2 options are:-
1. Separate tables - my normal methodology would involve keeping persistent copies of the required source system tables in my warehouse staging database. Updating/inserting into them on each delta load using the modified dates. The customer dimension build would then do the necessary joins between these tables, also using the modified dates for speed, to insert/update the denormalized dimension which includes all attributes from the 3 tables.
PROS: is my standard approach, have a direct copy of the source system to facilitate easier troubleshooting
CONS: will take longer to develop due to converting SQL into ETL tool transforms
2. Extract SQL - we have been provided with SQL statements that join the 3 tables together to create a denormalized view of the customer. This is already used by our counterparts in the US to build their own warehouse (but using flat files in between). We could run this to extract the data into a persistent staging table, updating/inserting on each delta load based on the primary key. The dimension build would then be very simple with almost no further transformations necessary.
PROS: quicker to develop (reasonably important at this stage in the project), uses a ready made and proven extract process
CONS: does not give us a copy of the source system as we would normally like
[The SQL is actually quite complicated as there are more than the 3 tables I mention above, I used 3 just to demonstrate the theory]
Any advice welcome, I am going backwards and forwards between the two options. I think I am leaning towards option 2, mainly because I can't come up with a good enough reason to have the exact copy of the source tables as in option 1.
Thanks,
b/
I am in 2 minds on my extract process for a particular source system. The system has customer information stored in 3 tables in 3rd normal form (e.g. name in one, address in another and contact details in another). Each table has a reliable modified date we can use to do delta loading.
My 2 options are:-
1. Separate tables - my normal methodology would involve keeping persistent copies of the required source system tables in my warehouse staging database. Updating/inserting into them on each delta load using the modified dates. The customer dimension build would then do the necessary joins between these tables, also using the modified dates for speed, to insert/update the denormalized dimension which includes all attributes from the 3 tables.
PROS: is my standard approach, have a direct copy of the source system to facilitate easier troubleshooting
CONS: will take longer to develop due to converting SQL into ETL tool transforms
2. Extract SQL - we have been provided with SQL statements that join the 3 tables together to create a denormalized view of the customer. This is already used by our counterparts in the US to build their own warehouse (but using flat files in between). We could run this to extract the data into a persistent staging table, updating/inserting on each delta load based on the primary key. The dimension build would then be very simple with almost no further transformations necessary.
PROS: quicker to develop (reasonably important at this stage in the project), uses a ready made and proven extract process
CONS: does not give us a copy of the source system as we would normally like
[The SQL is actually quite complicated as there are more than the 3 tables I mention above, I used 3 just to demonstrate the theory]
Any advice welcome, I am going backwards and forwards between the two options. I think I am leaning towards option 2, mainly because I can't come up with a good enough reason to have the exact copy of the source tables as in option 1.
Thanks,
b/
Bisquite- Posts : 5
Join date : 2010-02-07
Re: Extracting - separate source tables vs SQL query
With option 2, you do have all the data in source system, but in different structure (de-normalised). If you don't need a normalised data store for the sake of data quality, option 2 would be fine. However I am questioning about keeping persistent copies of source system tables in your staging database. Would you be better off having a separate database called ODS or some kind of data store that stores all the history source data, and leaving only delta in your staging area, so that all your DDS ETL will be a lot simpler and faster for surrogate key lookup and SCD 2 process. Treating staging area as a persistent data store is very risky as people normally think it's volatile and temporary, and therefor truncating the tables whenever necessary.Bisquite wrote:I think I am leaning towards option 2, mainly because I can't come up with a good enough reason to have the exact copy of the source tables as in option 1.
Last edited by hang on Sun Dec 12, 2010 9:22 pm; edited 1 time in total
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
Re: Extracting - separate source tables vs SQL query
hang wrote:Would you be better off having a separate database called ODS or some kind of data store that stores all the history source data, and leaving only delta in your staging area
Yes, that is kind of what I have. Full source history tables and delta versions of each table for the daily incremental loads.
Your reply does kind of indicate you would have chosen option 2 as well. Is that correct?
Thanks for the reply by the way.
b/
Bisquite- Posts : 5
Join date : 2010-02-07
Re: Extracting - separate source tables vs SQL query
Correct, I meant option 2 would be better choice and I have updated my previous post to reflect that.
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
Similar topics
» Extracting data when there is no timestamp in the source system
» Extracting Master/Detail data into Staging Tables
» Large Fact Table and Maintaining Periodic Snapshot: Practice
» Combining data from different fact tables in a query
» Where do you put your staging Tables? DWH DB or as a separate DB
» Extracting Master/Detail data into Staging Tables
» Large Fact Table and Maintaining Periodic Snapshot: Practice
» Combining data from different fact tables in a query
» Where do you put your staging Tables? DWH DB or as a separate DB
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum