Design of Staging Tables

amir2 Tue Aug 30, 2011 11:07 am

Hi

I am struggling with the design of my Staging Tables. I have System X and System Y, both of which have Account entities in them. The Dimensional Model has a conformed Account dimension, as per the Kimball methodology.

Is it better to:

Option 1 - Extract the source data into two staging tables (StagingSystemXAccount and StagingSystemYAccount) in my staging database and then to Transform & Load the data in these tables into the conformed DimAccount.

Option 2 - Extract the source data into one staging table (StagingAccount) in my staging database and then to Transform & Load the data in this table into the conformed DimAccount.

I can see benefict and liabilities for each option but I wonder if there is a "killer" argument for one of the options?

Thanks in advance...

ngalemmo Tue Aug 30, 2011 12:06 pm

I use option 2... it involves a lot less code.

You do some lightweight transformations moving data from the source into the staging table (one process per source) and then do the heavy work in a single process that reads staging and updates the dimension or fact.

amir2 Wed Aug 31, 2011 6:20 am

Thanks.

My thinking was the same: as you say option 1 means more code and maintenance. I just wasn't sure if there is a really strong case against option 2.

Below is a reply I got on the SQL Server Data Warehousing forum (http://social.msdn.microsoft.com/Forums/en-US/sqldatawarehousing/thread/d6c21bd7-a1b8-4c04-ae84-40e362a5741f).

Does anyone have any comments on this?

Well, there is no better or worse, it's just the question of approach.

But I would suggest to perform all data consolidation tasks in the Transform phase - so extraction should copy data from different sources to different target staging tables only, without any structure change (apart from narrowing the field list). If you extract into one common table, you should at least apply a flag field of the source system for later debugging.

For example, if you receive an error during the extraction of accounts, and you have a common table, which is loaded by the same Data Flow Task, you would have load your source systems twice, even if only one of the sources (or staging targets) failed.

If your ETL process has a restartability requirement (and why wouldn't it have? it's one of the best practices to build your packages with the ability of restarting from the point of failure), your restarted ETL would end sooner if it won't have to query both sources again. However separating the extraction into different Data Flow Tasks can meet a restartability requirement, you would have to use Checkpoints in all your SSIS packages. Instead of that, you can split the extraction tasks into different packages - one package per source -, and you only will have to deal with checkpoints in your master package.

LAndrews Wed Aug 31, 2011 1:07 pm

I use option 2 as well. (common table). The table should include a "source" column. (system A or B or ??)

You still would have separate Extract routines (1 for each source). If a routine fails for a given source, only that source should need to be re-extracted.

The key advantage to this approach is you've isolated your heavy lifting from changes in the source landscape. Source systems can be added/removed with minimal impact .... consider an acquisition which adds two more sources for Account. Using Option 2, the only changes required of the data warehouse would be 2 additional processes to load data from source to staging. No physical database changes, no changes to the dimensional load process etc.

hang Wed Aug 31, 2011 7:00 pm

I would go for option 1, simply because I don’t want to do any lifting other than bulk insert when I stage the source, minimising the query impact on the source systems. Whereas with option 2, you may have to join tables across two different database systems adding extra overhead compared to querying within a single database.

And also once I have landed the source data, I don’t want to query the source system again, if any upstream transformation or consolidation fails, or if you want to reuse the same raw data for other purposes.

ngalemmo Wed Aug 31, 2011 8:56 pm

I agree that if the extract would need to do unnecessary heavy lifting using the one table approach, then you are better off doing a quick dump from the source and move the heavy processing after the data has been pulled.

But most if the time, this is an exception. Usually the extract transformations are fairly simple (standardizing data types, building natural key strings and so forth). Also you can adjust staging requirements to suit the environment. For example, joins can be delayed by adjusting what the staged row should look like.

PeteGrace Sat Sep 03, 2011 9:36 am

In general I would want to combine the 2 sets of data at the earliest opportunity, but it depends how similarly the data is structured in each source system. I always try to think about whether the staging I need to do is source system-agnostic, or specific to the needs of a particular system.

e.g. System X might store everything you need in the database, but System Y relies on application logic to derive various fields. In this scenario you might need a separate staging table for the data coming from System Y so that you can fill in the gaps and plug it into your combined staging. Then you do your transformations that apply to both sets to data - such as additional hierarchy levels maybe - in a single staging pipeline.

The danger with doing all the transformations in two separate staging processes is that the logic you have to duplicate needs to be maintained in 2 different places, potentially more if you add new source systems. When one of the transformations needs to be changed in a year's time and there are new developers maintaining the code, then (depending on your ETL tool and how obvious the data flows are) there's a risk that they might overlook the fact that it needs to be changed in more than one place.

VHF Wed Sep 14, 2011 5:56 pm

I'll hang with hang on this and vote for Option 1 -- two staging tables. In my case I've found the data is different enough coming from the two systems that unique staging is the best way to go, but it does come at the price of needing two sets of load logic.

ngalemmo Wed Sep 14, 2011 6:23 pm

VHF wrote:I'll hang with hang on this

Cute.

hang Wed Sep 14, 2011 6:44 pm

ngalemmo wrote:
VHF wrote:I'll hang with hang on this

Cute.

Indeed, I am happy I am not alone this time.

Sponsored content

Design of Staging Tables

Design of Staging Tables

Re: Design of Staging Tables

Re: Design of Staging Tables

Re: Design of Staging Tables

Re: Design of Staging Tables

Re: Design of Staging Tables

Re: Design of Staging Tables

Re: Design of Staging Tables

Re: Design of Staging Tables

Re: Design of Staging Tables

Re: Design of Staging Tables