ETL Design Problems for Real time
2 posters
Page 1 of 1
ETL Design Problems for Real time
Hi at all
I agonise my head in designing an ETL for a real time DWH. It is not so clear for me, what I have to do in which steps and i hope that you can help me a bit.
I tell you something about my initial situation.
I have two Source Systems, which are independent to each other and are used by different user applications. I have to design a DWH for Reporting issues. Some data in the source systems are the same. I will give you an example.
Table 1:
Table 2:
It is possible that the application which is using Table 1, is inserting a Product 4 with vvv in a few days for example.
So what I want is to build a Productdimension with the Attributes Productnumber, Description, Product, Status and Productyear.
When I want to integrate those tables in one schema, i need a staging area, is that right?
So what have I to do?
Table 1 -> one-to-one copy in Staging Area --> select attributes for Datawarehouse--> combine (join) with table 2 --> name this Table Products --> Load Dimension
Table 2 -> one-to-one copy in Staging Area --> select attributes for Datawarehouse--> combine (join) with table 1 --> name this Table Products --> Load Dimension
It is not clear for me, what tables do I need for the complete process.
Another approach is the real time approach, where changed or updated data will be pushed to the Datawarehouse.
It would be great, if you can help me to get clearer.
I agonise my head in designing an ETL for a real time DWH. It is not so clear for me, what I have to do in which steps and i hope that you can help me a bit.
I tell you something about my initial situation.
I have two Source Systems, which are independent to each other and are used by different user applications. I have to design a DWH for Reporting issues. Some data in the source systems are the same. I will give you an example.
Table 1:
Productnumber| | Description| | Status| | Product |
P1 | xxx | 1 | Product1 |
P2 | yyy | 1 | Product2 |
P3 | zzz | 1 | Product3 |
ID| | Description| | Product | | Productyear | | Productnumber | |
1 | yyy | Product2 | | 2001 | P2 |
2 | xxx | Product1 | | 2004 | P1 |
3 | vvv | Product4 | | 2006 | P4 |
It is possible that the application which is using Table 1, is inserting a Product 4 with vvv in a few days for example.
So what I want is to build a Productdimension with the Attributes Productnumber, Description, Product, Status and Productyear.
When I want to integrate those tables in one schema, i need a staging area, is that right?
So what have I to do?
Table 1 -> one-to-one copy in Staging Area --> select attributes for Datawarehouse--> combine (join) with table 2 --> name this Table Products --> Load Dimension
Table 2 -> one-to-one copy in Staging Area --> select attributes for Datawarehouse--> combine (join) with table 1 --> name this Table Products --> Load Dimension
It is not clear for me, what tables do I need for the complete process.
Another approach is the real time approach, where changed or updated data will be pushed to the Datawarehouse.
It would be great, if you can help me to get clearer.
sky87- Posts : 2
Join date : 2014-09-18
Re: ETL Design Problems for Real time
Hi - as a general principle you would use a staging area for loading any data into a DW. You load data "as is" from your source system(s) into the staging area, transform it and then load it into your DW. Your staging area would normally be a schema in the same DB as your DW (or at least on the same physical DB server) for performance reasons as you tend to do a lot of lookups between your staged data and your DW.
In your particular example I'm not sure how this data is structured. There appears to be a 1:1 relationship between the records in each table - is this correct or is this just because of the data examples you've given?
Can you have multiple records (one for each year) for each Product in Table 2?
Are both tables populated at the same time for each Product and if not is one always populated first?
I'm a bit confused as most source systems are OLTP systems and have normalised data while these tables are not normalised.
Anyway - you can probably extract the data you want using a single SQL query and the write this data to your staging area:
SELECT T2.*, T1.STATUS
FROM Table2 T2
INNER JOIN Table1 T1 on T2.productnumber = T1.PRODUCTNUMBER -- Change to left join if appropriate
Hope this helps?
Regards,
Nick
In your particular example I'm not sure how this data is structured. There appears to be a 1:1 relationship between the records in each table - is this correct or is this just because of the data examples you've given?
Can you have multiple records (one for each year) for each Product in Table 2?
Are both tables populated at the same time for each Product and if not is one always populated first?
I'm a bit confused as most source systems are OLTP systems and have normalised data while these tables are not normalised.
Anyway - you can probably extract the data you want using a single SQL query and the write this data to your staging area:
SELECT T2.*, T1.STATUS
FROM Table2 T2
INNER JOIN Table1 T1 on T2.productnumber = T1.PRODUCTNUMBER -- Change to left join if appropriate
Hope this helps?
Regards,
Nick
nick_white- Posts : 364
Join date : 2014-01-06
Location : London
Re: ETL Design Problems for Real time
hi
thank you for your reply.
Okay I design a staging area. I have read that the staging area is normalized but not even in 3NF. If I am right I pull my data as it is in the staging area (1to1 copy), then I consolidate my table schemas without any constraints. Is this right?
In table two the product number and product is unique. I can´t add multiple products in this table.
The tables are not populated at the same time. I might be possible that one application ( whatever which application) changes or adds data at one time and the other later.
If i have my staging area and my datawarehouse is loaded with an initial load from the staging area, what have i to do with the data in the staging area after loading? What are best practises, if only changed data will be poplated in real time to the datawarehouse?
Regards
Isa
thank you for your reply.
Okay I design a staging area. I have read that the staging area is normalized but not even in 3NF. If I am right I pull my data as it is in the staging area (1to1 copy), then I consolidate my table schemas without any constraints. Is this right?
In table two the product number and product is unique. I can´t add multiple products in this table.
The tables are not populated at the same time. I might be possible that one application ( whatever which application) changes or adds data at one time and the other later.
If i have my staging area and my datawarehouse is loaded with an initial load from the staging area, what have i to do with the data in the staging area after loading? What are best practises, if only changed data will be poplated in real time to the datawarehouse?
Regards
Isa
sky87- Posts : 2
Join date : 2014-09-18
Re: ETL Design Problems for Real time
Hi - this is not meant as a criticism but it sounds like your knowledge of ETL and data warehousing is rather limited - and this forum is probably not the best place to get a tutorial on these topics - if only because it would take days and hundreds of posts. Can I suggest that you read a book on the topic - Kimball's is a good starting point: http://www.amazon.co.uk/Data-Warehouse-ETL-Toolkit-Techniques/dp/0764567578/ref=sr_1_5?s=books&ie=UTF8&qid=1411123412&sr=1-5&keywords=kimball
Alternatively book yourself on a course - preferably one that focuses on whichever ETL tool it is that you are using.
Regards,
Nick
Alternatively book yourself on a course - preferably one that focuses on whichever ETL tool it is that you are using.
Regards,
Nick
nick_white- Posts : 364
Join date : 2014-01-06
Location : London
Similar topics
» Problems with design to allow Rolling up of Hierarchical Data
» Real Time Data Aggregation
» Real time datawarehousing
» Near Real time ETL and ETL cloud
» Real time DWH - Choosing Right Technologies
» Real Time Data Aggregation
» Real time datawarehousing
» Near Real time ETL and ETL cloud
» Real time DWH - Choosing Right Technologies
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum