Kimball Forum
Would you like to react to this message? Create an account in a few clicks or log in to continue.

Incremental Loading of Staging Table - Re-running Import for Same Window

2 posters

Go down

Incremental Loading of Staging Table - Re-running Import for Same Window Empty Incremental Loading of Staging Table - Re-running Import for Same Window

Post  min.emerg Wed Oct 31, 2012 7:30 am

Hello

I am unsure as to how to best cater for the following situation.

I am building an incremental load strategy for a data warehousing project that I am working on. Data that has been updated (we can reliably use the UpdatedDate field on each table) within a given date range (window) will be pulled from the source system to staging, and ultimately to the warehouse.

What confuses me is the consequence of extending the range of the window back to a time that has already been loaded. For instance, if a window from 2012-01-01 00:00:00 to 2012-01-01 23:59:59 was used yesterday, and for some reason we need to load the warehouse today with a window from 2012-01-01 00:00:00 to 2012-01-02 23:59:59. The start date is the same for yesterday and today, but today's end date has extended.

The problem I see is the ETL process pulling through two versions of each record, because one version would already exist in staging from yesterday's import, and the same record will be re-imported into staging today because today's window includes yesterday's window. Even if I did a check to exclude records that haven't been changed, the record could have been updated, resulting in two versions of the same record for the current window (i.e.: duplicate business keys within the window). As a result, we wouldn't know which one to pick for the load.

One option I see is to delete from staging all records whose UpdatedDate lies within the window. That way we are guaranteed to only have unique business keys for the window. The problem with this approach is that we might lose important history and lineage data in staging.

Another option is to select the most recent record if there are multiple versions with different data for the window.

Are there any best practices for this situation? I have scoured the Interwebs but haven't found a definite answer. Any suggestions of guidance would be greatly appreciated.

Thanks!

min.emerg

Posts : 39
Join date : 2011-02-25

Back to top Go down

Incremental Loading of Staging Table - Re-running Import for Same Window Empty Re: Incremental Loading of Staging Table - Re-running Import for Same Window

Post  ngalemmo Wed Oct 31, 2012 4:29 pm

Usually this is handled by storing the last pull timestamp in a table on the data warehouse side and using it as the start time to pull the next load. This assumes if a row did change in the source, it would have a more recent timestamp, reflecting the time of the change. This would ensure you only get new or recently changed information.

Since in either case you may get updates (this method or your window method), so your ETL process will need to handle it. This is the standard case with dimensional as well as fact data. Fact data is a little more complex depending on the nature of your fact table (transactional, snapshot, or accumulating snapshot). A robust ETL process should be designed to handle such situations regardless of how you pull data.

As far as multiple instances of the same row in a single pull, what you do will depend on requirements. The simplest case is to use the last row (based on some chronology). Where the requirement is to capture interim changes, you will need to process the data chronologically, which, if you are loading a type 2 dimension, will generate more rows. This latter case complicates assigning type 2 keys to facts because you need to take into account the time of the fact when assigning the FK.
ngalemmo
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

http://aginity.com

Back to top Go down

Incremental Loading of Staging Table - Re-running Import for Same Window Empty Re: Incremental Loading of Staging Table - Re-running Import for Same Window

Post  min.emerg Thu Nov 01, 2012 3:49 am

Thanks for the response. I have an auto-incrementing surrogate key in the staging table, so I think I'll take the max of this and group by the business key/id to get a single unique record for a given window.

Thanks again.

min.emerg

Posts : 39
Join date : 2011-02-25

Back to top Go down

Incremental Loading of Staging Table - Re-running Import for Same Window Empty Re: Incremental Loading of Staging Table - Re-running Import for Same Window

Post  Sponsored content


Sponsored content


Back to top Go down

Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum