Incremental load for a dimension table having multiple tables as sources
4 posters
Page 1 of 1
Incremental load for a dimension table having multiple tables as sources
I have a dimension table which has 2 tables joined together as the source. When it comes to the initial full load things are quite straight forward but now I would like to implement the incremental load based on CDC. How should I go about this?
The first thing I did was to move all the changes in CT tables since last time the incremental load was executed into counterparts tables in staging area. I want to do that as I would like to ease as much as possible the load on the source transactional database.
Now it is obvioius that I can't join the CT tables, moreover some times it is only one of the table that will generate changes. For example let's say attribute 1 and 2 come from source table A and attribute 3 come from source table B and the dimension table is an SCD2. If only the attribute 2 has been modified then only the CT table A would record the change and based only on this I cannot create the new record in the dimension table as the attribute3 is required but not available in its CT table B. And if I think about this even if I had the related record in both CT table I would still not be able to join them, consequently being forced to process them sequentally which gets me to the same spot.
I am really stuck here and would appreciate any suggestions, directions etc.
Thanks in advance
Iulian
The first thing I did was to move all the changes in CT tables since last time the incremental load was executed into counterparts tables in staging area. I want to do that as I would like to ease as much as possible the load on the source transactional database.
Now it is obvioius that I can't join the CT tables, moreover some times it is only one of the table that will generate changes. For example let's say attribute 1 and 2 come from source table A and attribute 3 come from source table B and the dimension table is an SCD2. If only the attribute 2 has been modified then only the CT table A would record the change and based only on this I cannot create the new record in the dimension table as the attribute3 is required but not available in its CT table B. And if I think about this even if I had the related record in both CT table I would still not be able to join them, consequently being forced to process them sequentally which gets me to the same spot.
I am really stuck here and would appreciate any suggestions, directions etc.
Thanks in advance
Iulian
iuliani- Posts : 2
Join date : 2012-12-30
Location : Canada
Re: Incremental load for a dimension table having multiple tables as sources
iuliani wrote:Now it is obvioius that I can't join the CT tables...
Why not?
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: Incremental load for a dimension table having multiple tables as sources
Well :-) I guess the only obvious thing here is that I don't have experience with ETL; so if you see misconceptions in the way I think of this process, please point them out.
The way I wanted this ETL process to function was to first separate them into two independent processes; one extracts changes from the source in batches and put them in similar CT tables in staging are; the second one hadles these batches in order, doing all the needed transformations and loading the data into the datawarehouse relational database.
This means that in the CT tables the "same" source record may exists multiple times, although I do bring in only the net changes per batch. Furthermore to the case I presented, what if one of the source table lacks a change? Even if I used an outer join I would still need to load something in the destination table field that was missing. Should I connect to the source database and bring in the missing value?
Thanks for consideration.
The way I wanted this ETL process to function was to first separate them into two independent processes; one extracts changes from the source in batches and put them in similar CT tables in staging are; the second one hadles these batches in order, doing all the needed transformations and loading the data into the datawarehouse relational database.
This means that in the CT tables the "same" source record may exists multiple times, although I do bring in only the net changes per batch. Furthermore to the case I presented, what if one of the source table lacks a change? Even if I used an outer join I would still need to load something in the destination table field that was missing. Should I connect to the source database and bring in the missing value?
Thanks for consideration.
iuliani- Posts : 2
Join date : 2012-12-30
Location : Canada
Re: Incremental load for a dimension table having multiple tables as sources
You seem to be heading in the right direction with your assesment of the problem. You need to look at your model and understand how the tables are related and how changes can occur. Also determine what kind of delta processing to apply, type 1, type 2 or something else. Not every dimension will be treated in the same way.
For example in insurance you may implement the following when considering your deltas:
A policy dimension dimension may be made up of many tables in source, 3 of which could be policy, broker and insured. If policy has changes/has new records then you need to process all this data. In addition, you may need to check if there is a change to broker or insured but no change to the base policy. If changes have occured on broker or insured then you need to figure out which policies this applies to and join back to the policy and process accordingly.
thedude- Posts : 21
Join date : 2009-02-03
Location : London
Re: Incremental load for a dimension table having multiple tables as sources
thedude wrote:
A policy dimension dimension may be made up of many tables in source, 3 of which could be policy, broker and insured. If policy has changes/has new records then you need to process all this data. In addition, you may need to check if there is a change to broker or insured but no change to the base policy. If changes have occured on broker or insured then you need to figure out which policies this applies to and join back to the policy and process accordingly.
Hi,
Unfortunately this implies that you have direct access to the operational database when you want to do your joins which is most of the time forbidden by the dba team.
I also have never been able to implement this solution even when the dba gave me full access to the operationnal datastore. Trying to do a join in a operational database between 100 000 rows (the daily delta) and a table with 120 millions of rows (looking for the other attributes of the dimension) will take more time than fully loading the two table (only the columns you need) into an ods.
exhortae- Posts : 30
Join date : 2010-08-01
Similar topics
» Initial stage tables from multiple sources?
» Creating Crosswalk Tables for Multiple Data Sources
» Dimension with different sources (multiple business keys) ?
» building a dimension from multiple sources
» Dimension Design with Multiple Data Sources (ORACLE and SQL Server)
» Creating Crosswalk Tables for Multiple Data Sources
» Dimension with different sources (multiple business keys) ?
» building a dimension from multiple sources
» Dimension Design with Multiple Data Sources (ORACLE and SQL Server)
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum