SCD Type 2 more than one "unique" row per load. How to handle it

itcouple Mon Feb 13, 2012 9:21 am

Hi

I wanted to get some opinions how to handle this scenario:

I get a file that contains data from different tables. Some of the fields relate to "customer" so in one extract I can have the same customer with different value for type 2 attribute.

I need to correctly related surrogate key with fact table therefore I need to somehow load one customer multiple times and put effective from/to and end up only with 1 current customer.

I use SSIS and I have used SCD Transformation (build-in) and I have started testing it for this scenaior but it doesn't handle it very well and I end up with two new surrogate keys for the same customer but both of them have affectiveFrom filled in and EffectiveTo NULL which is not correct (2 "current" customers).

I can think of some ways of doing that but they sound messy... any advice from those who have experience in this scenario or have some ideas would be appreciated.

Regards
Emil

LAndrews Mon Feb 13, 2012 1:39 pm

I'll assume you're SCD2 dimension already has the following field
-Effective Start Date
-Effective End Date
-Current_Flag

What I've done in the past when I want to track multiple changes in a day (yet only get 1 record per day)

Add the following
- effective sequence number (1,2,3 ..... etc)
- max_sequence_flag ('Y' = last record loaded for that day, otherwise 'N").

Then your ETL needs to leverage both the dates and the max_sequence_flag when performing a surrogate lookup or assigning the current flag.

This might not be possible with the built-in SCD2 transformation, you may require a custom transformation.

Hope this helps.

ngalemmo Mon Feb 13, 2012 2:32 pm

Or you can split the incoming data into separate staging tables for each target dimension and work with distinct versions of each.

itcouple Mon Feb 13, 2012 2:54 pm

Hi

Thanks for the replies.

Just to clarify there is one only version and one dimension so I won't be able to put into different staging tables.

With sequence I understand the concept but how do you "re-create" sequence in fact table? Would you for instance load dimension + fact with the same sort?

I initially thought about doing date + time for EffectiveFrom/EffectiveTo using business datetime field from fact table, but this has a problem of having to do it individually or like in your sequence example I could load dimension table with correct effective from (datetime) and set correct EffectiveTo Date after the dimension load which I presume would be similar to setting CurrentFlag to Y for max sequence? Then do lookup using business key + effective from/to (datetime).

Thanks for the sequence suggestion I think this approach might work for me (with datetime).

ngalemmo Mon Feb 13, 2012 6:01 pm

Misread your post. Sorry.

I usually don't use the 'built in' processes because they usually have quirks like you describe. Also the basic pattern for updating a type 2 is straight forward and shouldn't matter when or where an update comes from. My guess it the problem has to do with lookup caching within the SSIS procedure.

The question is, what do you want to happen? Obviously, having 2 current rows is not the answer. But, do you need the earlier row? No fact would ever reference it, unless you are basing effective periods on business timestamps.

To get to a solution, you need some way to sequence the data chronologically and either only process the oldest version, or if you want to include the interim versions, use something other than the process date to set effective periods. You then need to modify the SSIS procedure to keep track of prior updates to the same NK during the same batch. I am more familiar with Informatica which has the ability to update a dynamic cache. If there is similar functionality in SSIS, then you would add a new row with NK and timestamp every time you update a row in the dimension. You would do the normal lookup first to see if you need to update, then do an lookup on the dynamic cache to see if it has already been updated during that batch and have your logic function accordingly (either overwrite the last update or add a new SCD row and expire the last one).

itcouple Mon Feb 13, 2012 6:40 pm

Hi

Thanks for sharing you experience with informatica. This is what I initially thought I should do in SSIS (it does support full, partial and no cache which I often used in loading fact table where dim key doesn't exist) and thought it is rather too complicated.

The problem (which is not a problem) is that I would have to possibly sort data, conditinally split where NK > 1 decide which rows meet critieria (SCD logic) and do inserts with update untill I process all NK > 1 which sounds complicated... to me.

The reason why I want to go with build-in which is not ideal component is that it performs the basic SCD logic very well and minimizes effort & human errors which might be fairly big in this case and because I'm not permanent member of team I would prefer to leave it as easy to understand as possible.

and fortunatelly I have business datetime field which I hope can use in effective from/to fields which will also help during initial load.

Mike Honey Mon Feb 13, 2012 8:51 pm

Hi Emil,

I've used the SSIS built-in SCD transformation extensively - for similar reasons to your last post.

I havent noticed it behaving as you described - in fact the opposite. When two rows with identical Business Keys arrive, I expect two rows output - the first row received would end up "retired". As I understand it, SCD checks row-by-row against the destination table, with no cache.

Perhaps you have some slight difference in the Business Key between the two rows? Even trailing spaces can be enough to throw off SSIS. Case differences could also cause this issue, if your target database is case-sensitive.

Good luck!
Mike

ngalemmo Mon Feb 13, 2012 10:37 pm

As Mike indicated, if you turn off caching for the SSIS lookup, it should work correctly. My guess is it probably doesn't update the cache when the job runs, so the lookup doesn't know there already was a new row. With caching off, it would need to read the table each time it to get the 'current' version. If you sort the incoming data chronologically, a direct table read will always get the most current version of the row, even if this is the second time you try to update it. It may not even run much slower as SQLServer has internal database level caching of its own.

itcouple Tue Feb 14, 2012 5:32 am

Hi Mike,

I'm not sure if we talk about the same SCD Transformation. In my case SCD logic is performed on all rows and I get three different outputs. Update (for SCD type 1 attributes), Update (effective to) for SCD Type 2 followed by Insert.

I have created the same NK with type 2 change and I observed that I have 2 current rows in dimension.

I might have different settings or I need to do something extra with SCD wizard that I'm not aware of. See below screenshot.

SCD Type 2 more than one "unique" row per load. How to handle it Scd-mu14

SCD Type 2 more than one "unique" row per load. How to handle it Scd-mu14

I would appreciate your clarification.

itcouple Tue Feb 14, 2012 5:42 am

Just to clarify in the screenshot I gave. If I gave type 2 change twice for the same NK it performs update for both rows and then performs insert for both rows at the same time which is not what I want.

itcouple Tue Feb 14, 2012 6:11 am

Hi ngalemmo

Could you describe the steps that would be performed to achieve your task using informatica.

Table:
SK, NK, Attribute(Type2), EffectiveFrom, EffectiveTo

Dimension table:
1, abc, value1, 2000-01-01, NULL

Input values for next load:
NK, Attribute(Type2), BusinessDateTime
abc, value2, 2012-02-14 01:00:00
abc, value3, 2012-02-14 02:00:00
abc, value4, 2012-02-14 03:00:00
abc, value5, 2012-02-14 04:00:00

What would be the steps (step by step) in informatica to correctly load the data? end results 5 rows with proper effectiveFrom/To values and value5 is current row.

The reason why I ask is that I cannot find a way to do that in SSIS (without using loops or something outside data flow). It is fairly easy with standard Type 2 (one NK per load) but I struggle with understanding the logic for multiple NK with type 2 change and I'm interested to hear how informatica does it.

Mike Honey Tue Feb 14, 2012 8:30 pm

Hi Emil,

Thanks for clarifying your scenario. I think you are correct and the SCD transformation doesnt handle that correctly. While it doesnt cache the output table, the issue arises when the two rows arrive together. Due to SSIS's buffering they often get passed together to the components downstream from the SCD, so the updates for the first row havent completed before the second row is checked by the SCD transformation.

In the past I've used a Lookup transform just before the SCD to dump all but the last duplicate. This might not suit your scenario. If you need to get two rows output, you will probably have to build a For Each loop around that Dataflow, perhaps using a ROW_COUNT() partitioned by your NK.

Good luck!
Mike

itcouple Wed Feb 15, 2012 5:00 am

Hi Mike,

Thanks for clarifying that. The for each loop is what I will go with which has the extra benefit of doing initial load correctly.

BDW I will only use effectiveFrom field only and put view on top of the table to work out the effectiveTo.

Debunking Kimball Effective Dates

Which to many sounds less error prone, less effort and more maintanable.

Many thanks for everyones answers
Emil

Sponsored content

SCD Type 2 more than one "unique" row per load. How to handle it

SCD Type 2 more than one "unique" row per load. How to handle it

Re: SCD Type 2 more than one "unique" row per load. How to handle it

Re: SCD Type 2 more than one "unique" row per load. How to handle it

reply

Re: SCD Type 2 more than one "unique" row per load. How to handle it

reply

Re: SCD Type 2 more than one "unique" row per load. How to handle it

Re: SCD Type 2 more than one "unique" row per load. How to handle it

reply

reply

question to ngalemmo

Re: SCD Type 2 more than one "unique" row per load. How to handle it

reply

Re: SCD Type 2 more than one "unique" row per load. How to handle it