Initial stage tables from multiple sources?
2 posters
Page 1 of 1
Initial stage tables from multiple sources?
I was somewhat confused while reading this: The DWH ETL Toolkit, On page 116, Chapter 4, Cleaning and Conforming, Assumptions, in the first bullet "...customer information from various sources initially can be staged in a table [. . .] whose configuration is the same regardless of the data source. Such a structure could be used for incoming data from various data sources deposited and queued for ETL work."
If I understand this correctly, what is being described is a scenario where I am maybe getting data via both XML and CSV, or some other source, and it just so happens(maybe by design) that the data structure of both are identical, allowing me to import both to the same initial staging table(or tables)? So I would expect this scenario to be pretty rare, at least I have yet to encounter it in my limited ETL experience. I just want to make sure I am understanding what is being described correctly, in that this is not a all encompassing guideline, but really more of a side note for those special cases where someone is so lucky as to have data sources that are already conformed to one another.
If I understand this correctly, what is being described is a scenario where I am maybe getting data via both XML and CSV, or some other source, and it just so happens(maybe by design) that the data structure of both are identical, allowing me to import both to the same initial staging table(or tables)? So I would expect this scenario to be pretty rare, at least I have yet to encounter it in my limited ETL experience. I just want to make sure I am understanding what is being described correctly, in that this is not a all encompassing guideline, but really more of a side note for those special cases where someone is so lucky as to have data sources that are already conformed to one another.
AaronLS- Posts : 8
Join date : 2010-02-24
Re: Initial stage tables from multiple sources?
No, its good advice to be used in the general case.
The issue isn't weither the sources are the same but that the staging table is the same. Think of the staging table design as the API between the staging process and the loading process. If you get it right, you need only develop one load process regardless of source. "Getting it right" involves abstracting natural keys, conforming data types and other such techniques so that similar data populates the same columns in the staging table. You wind up writing one staging process for each source and only one load process, rather than unique staging and load processes for each source.
Another way to look at it is the content of the staging table should conform to the ultimate target (fact and/or dimension table) with the only difference being it contains natural rather than surrogate keys (surrogate key assignment occurs in the load processes). After all, if you have different sources going into a single fact table, how much more complicated could it be to have those same sources go into a common staging table?
The issue isn't weither the sources are the same but that the staging table is the same. Think of the staging table design as the API between the staging process and the loading process. If you get it right, you need only develop one load process regardless of source. "Getting it right" involves abstracting natural keys, conforming data types and other such techniques so that similar data populates the same columns in the staging table. You wind up writing one staging process for each source and only one load process, rather than unique staging and load processes for each source.
Another way to look at it is the content of the staging table should conform to the ultimate target (fact and/or dimension table) with the only difference being it contains natural rather than surrogate keys (surrogate key assignment occurs in the load processes). After all, if you have different sources going into a single fact table, how much more complicated could it be to have those same sources go into a common staging table?
Re: Initial stage tables from multiple sources?
Thanks for the response. I've only just started reading this book in a the last couple of weeks, so maybe this will become clearer as I read on.
I can understand the importance of getting them into the same structure as soon as possible, simply as a matter of reusing as much of the ETL process as possible. My point of confusion is that I thought a big plus of the ETL process is about dealing with the complexities of restructuring/conforming the data in a step by step process in stages. So it would seem the first few steps would involve data that was not yet conformed, and accompanying these steps would be staging tables.
When you ask how complex it can be, in the couple of cases I've seen, one source would be so poorly structured that no one could get anything useful out of the data because it was difficult to understand and query. These were projects where people were trying to "get by" without a datawarehouse, not knowing what one was anyways, but were almost at a stand still when it comes to getting useful data out of the existing nightmarish structure(repeating columns, repeating values, poorly named columns). And that problem would be the catalyst for an ETL system being built, because we were stuck with the source system as-is but needed to be able to get the data into a more queryable structure.
My first intuition has been to read in the data from whatever source using a process tailored to that source(a C# app for files, SQL for pulling from another database), and stage the data into initial tables, and from there I would stage the data from the different sources step by step towards getting them to a stage where I can merge them into the same tables. From that stage of the process, the rest of the ETL process would not be duplicated.
I could see that this would be much easier if the data sources were well normalized, as a single query for each data source's table could project them into similar structures. But for the cases I've dealt with, to accomplish the same would require alot of in-memory processing, and if I wanted to do this with stored procedures instead of some application outside the database(C# app/tooling), then it would be difficult without staging to tables(I can only imagine maybe using lots of nested cursors or passing table valued variables around between sp's in MS SQL Server).
I can understand the importance of getting them into the same structure as soon as possible, simply as a matter of reusing as much of the ETL process as possible. My point of confusion is that I thought a big plus of the ETL process is about dealing with the complexities of restructuring/conforming the data in a step by step process in stages. So it would seem the first few steps would involve data that was not yet conformed, and accompanying these steps would be staging tables.
When you ask how complex it can be, in the couple of cases I've seen, one source would be so poorly structured that no one could get anything useful out of the data because it was difficult to understand and query. These were projects where people were trying to "get by" without a datawarehouse, not knowing what one was anyways, but were almost at a stand still when it comes to getting useful data out of the existing nightmarish structure(repeating columns, repeating values, poorly named columns). And that problem would be the catalyst for an ETL system being built, because we were stuck with the source system as-is but needed to be able to get the data into a more queryable structure.
My first intuition has been to read in the data from whatever source using a process tailored to that source(a C# app for files, SQL for pulling from another database), and stage the data into initial tables, and from there I would stage the data from the different sources step by step towards getting them to a stage where I can merge them into the same tables. From that stage of the process, the rest of the ETL process would not be duplicated.
I could see that this would be much easier if the data sources were well normalized, as a single query for each data source's table could project them into similar structures. But for the cases I've dealt with, to accomplish the same would require alot of in-memory processing, and if I wanted to do this with stored procedures instead of some application outside the database(C# app/tooling), then it would be difficult without staging to tables(I can only imagine maybe using lots of nested cursors or passing table valued variables around between sp's in MS SQL Server).
AaronLS- Posts : 8
Join date : 2010-02-24
Re: Initial stage tables from multiple sources?
As they say... your mileage may vary. It's an architectual goal that you should try for, but it may not always be achieved given particular circumstances.
Similar topics
» Creating Crosswalk Tables for Multiple Data Sources
» Incremental load for a dimension table having multiple tables as sources
» Does it belong in the stage tables or fact tables?
» building a dimension from multiple sources
» Dimension with different sources (multiple business keys) ?
» Incremental load for a dimension table having multiple tables as sources
» Does it belong in the stage tables or fact tables?
» building a dimension from multiple sources
» Dimension with different sources (multiple business keys) ?
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum