Incremental Loading of data
3 posters
Page 1 of 1
Incremental Loading of data
Hi,
In our GL_account Dimension table, the data from source data base comes from 6 tables. We need to load the data incrementally into data mart using SSIS. Means, we want to load only the changed records using the SQL Server CDC feature. We have the server enabled with CDC. There is a account id which links the 4 tables. But the two table needs to be connected using the client id. These tables undergoes frequent changes in the source and needs to be updated into the data mart. Now the challenge here is, we need to load these two table which does not contain the account id. As per the incremental load using cdc, only these two tables needs to be loaded and not the entire set of tables and I don't want to do a full load.. Can any one tell me how to overcome this situation? Is there any strategy available for this?
In our GL_account Dimension table, the data from source data base comes from 6 tables. We need to load the data incrementally into data mart using SSIS. Means, we want to load only the changed records using the SQL Server CDC feature. We have the server enabled with CDC. There is a account id which links the 4 tables. But the two table needs to be connected using the client id. These tables undergoes frequent changes in the source and needs to be updated into the data mart. Now the challenge here is, we need to load these two table which does not contain the account id. As per the incremental load using cdc, only these two tables needs to be loaded and not the entire set of tables and I don't want to do a full load.. Can any one tell me how to overcome this situation? Is there any strategy available for this?
dbadwh- Posts : 31
Join date : 2011-09-30
Re: Incremental Loading of data
A strategy that's worked for me is to load each of the required source tables incrementally, but into a staging DB / layer where I can query the full current picture of each table. We need to check for changes in any of the tables that will affect the row in the dimension, so our first stage in processing the dimension would be something like this:
SELECT {required columns}
FROM Table1
LEFT JOIN Table2 ON Table1.account_id = Table2.account_id
LEFT JOIN Table3 ON Table1.account_id = Table3.account_id
LEFT JOIN Table4 ON Table1.account_id = Table4.account_id
LEFT JOIN Table5 ON Table1.client_id = Table5.client_id
LEFT JOIN Table6 ON Table1.client_id = Table6.client_id
-- Change detection
WHERE Table1.changed_date = @yesterday_date
OR Table2.changed_date = @yesterday_date
OR Table3.changed_date = @yesterday_date
OR Table4.changed_date = @yesterday_date
OR Table5.changed_date = @yesterday_date
OR Table6.changed_date = @yesterday_date
The tables we're querying would be replicas of the source tables that we'd bulk load initially and then update via the CDC, setting a changed_date so that downstream processing can work with the replicas incrementally as above. The query above won't be the most efficient ever as the OR's in the WHERE clause mean it can't use an index on changed_date in any of the tables, but it's a significant improvement on bulk loading, bearing in mind that there might be a few transformations we need to do in between selecting the data and updating the dimension.
SELECT {required columns}
FROM Table1
LEFT JOIN Table2 ON Table1.account_id = Table2.account_id
LEFT JOIN Table3 ON Table1.account_id = Table3.account_id
LEFT JOIN Table4 ON Table1.account_id = Table4.account_id
LEFT JOIN Table5 ON Table1.client_id = Table5.client_id
LEFT JOIN Table6 ON Table1.client_id = Table6.client_id
-- Change detection
WHERE Table1.changed_date = @yesterday_date
OR Table2.changed_date = @yesterday_date
OR Table3.changed_date = @yesterday_date
OR Table4.changed_date = @yesterday_date
OR Table5.changed_date = @yesterday_date
OR Table6.changed_date = @yesterday_date
The tables we're querying would be replicas of the source tables that we'd bulk load initially and then update via the CDC, setting a changed_date so that downstream processing can work with the replicas incrementally as above. The query above won't be the most efficient ever as the OR's in the WHERE clause mean it can't use an index on changed_date in any of the tables, but it's a significant improvement on bulk loading, bearing in mind that there might be a few transformations we need to do in between selecting the data and updating the dimension.
PeteGrace- Posts : 7
Join date : 2011-09-01
Re: Incremental Loading of data
Will it not create heavy impact on performance?
dbadwh- Posts : 31
Join date : 2011-09-30
Re: Incremental Loading of data
Depends how big your tables are, but as I say, compared with the alternative of doing a bulk load it should represent a significant improvement.
If you're worried that your warehouse won't be able to cope with that sort of join then the only other option I can think of would be to update the dimension rows in multiple passes.
e.g.
UPDATE dimension
SET dimension.columnA = Table1.columnA etc.
FROM dimension
INNER JOIN Table1 ON account_id
WHERE Table1.changed_date = @yesterday_date
...
UPDATE dimension
SET dimension.columnB = Table5.columnB etc.
FROM dimension
INNER JOIN Table5 ON client_id
WHERE Table5.changed_date = @yesterday_date
and so on...
If you're worried that your warehouse won't be able to cope with that sort of join then the only other option I can think of would be to update the dimension rows in multiple passes.
e.g.
UPDATE dimension
SET dimension.columnA = Table1.columnA etc.
FROM dimension
INNER JOIN Table1 ON account_id
WHERE Table1.changed_date = @yesterday_date
...
UPDATE dimension
SET dimension.columnB = Table5.columnB etc.
FROM dimension
INNER JOIN Table5 ON client_id
WHERE Table5.changed_date = @yesterday_date
and so on...
PeteGrace- Posts : 7
Join date : 2011-09-01
Re: Incremental Loading of data
Hi dbadwh,
I liked PeteGrace's first suggestion personally. I think you might be able to invoke the date indexes by moving the WHERE clause statements inside each LEFT JOIN, e.g.
LEFT JOIN Table2 ON Table1.account_id = Table2.account_id
AND Table2.changed_date = @yesterday_date
Good luck!
Mike
I liked PeteGrace's first suggestion personally. I think you might be able to invoke the date indexes by moving the WHERE clause statements inside each LEFT JOIN, e.g.
LEFT JOIN Table2 ON Table1.account_id = Table2.account_id
AND Table2.changed_date = @yesterday_date
Good luck!
Mike
Similar topics
» Loading data in Data Warehouse
» Loading data without key
» Loading Duration Data
» Delta Loading
» Loading Data Aggregated to Date into Fact Table
» Loading data without key
» Loading Duration Data
» Delta Loading
» Loading Data Aggregated to Date into Fact Table
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum