Extracting - separate source tables vs SQL query

Bisquite Sun Dec 12, 2010 3:38 pm

Hi,

I am in 2 minds on my extract process for a particular source system. The system has customer information stored in 3 tables in 3rd normal form (e.g. name in one, address in another and contact details in another). Each table has a reliable modified date we can use to do delta loading.

My 2 options are:-

1. Separate tables - my normal methodology would involve keeping persistent copies of the required source system tables in my warehouse staging database. Updating/inserting into them on each delta load using the modified dates. The customer dimension build would then do the necessary joins between these tables, also using the modified dates for speed, to insert/update the denormalized dimension which includes all attributes from the 3 tables.
PROS: is my standard approach, have a direct copy of the source system to facilitate easier troubleshooting
CONS: will take longer to develop due to converting SQL into ETL tool transforms

2. Extract SQL - we have been provided with SQL statements that join the 3 tables together to create a denormalized view of the customer. This is already used by our counterparts in the US to build their own warehouse (but using flat files in between). We could run this to extract the data into a persistent staging table, updating/inserting on each delta load based on the primary key. The dimension build would then be very simple with almost no further transformations necessary.
PROS: quicker to develop (reasonably important at this stage in the project), uses a ready made and proven extract process
CONS: does not give us a copy of the source system as we would normally like

[The SQL is actually quite complicated as there are more than the 3 tables I mention above, I used 3 just to demonstrate the theory]

Any advice welcome, I am going backwards and forwards between the two options. I think I am leaning towards option 2, mainly because I can't come up with a good enough reason to have the exact copy of the source tables as in option 1.

Thanks,

b/

hang Sun Dec 12, 2010 6:51 pm

Bisquite wrote:I think I am leaning towards option 2, mainly because I can't come up with a good enough reason to have the exact copy of the source tables as in option 1.

With option 2, you do have all the data in source system, but in different structure (de-normalised). If you don't need a normalised data store for the sake of data quality, option 2 would be fine. However I am questioning about keeping persistent copies of source system tables in your staging database. Would you be better off having a separate database called ODS or some kind of data store that stores all the history source data, and leaving only delta in your staging area, so that all your DDS ETL will be a lot simpler and faster for surrogate key lookup and SCD 2 process. Treating staging area as a persistent data store is very risky as people normally think it's volatile and temporary, and therefor truncating the tables whenever necessary.

Bisquite Sun Dec 12, 2010 7:08 pm

hang wrote:Would you be better off having a separate database called ODS or some kind of data store that stores all the history source data, and leaving only delta in your staging area

Yes, that is kind of what I have. Full source history tables and delta versions of each table for the daily incremental loads.

Your reply does kind of indicate you would have chosen option 2 as well. Is that correct?

Thanks for the reply by the way.

b/

hang Sun Dec 12, 2010 9:23 pm

Correct, I meant option 2 would be better choice and I have updated my previous post to reflect that.

Sponsored content

Extracting - separate source tables vs SQL query

Extracting - separate source tables vs SQL query

Re: Extracting - separate source tables vs SQL query

Re: Extracting - separate source tables vs SQL query

Re: Extracting - separate source tables vs SQL query

Re: Extracting - separate source tables vs SQL query