Difference in Incremental Changes between Source and DW

rohanf Fri Apr 11, 2014 10:56 am

Hi,

We have a source table Src_1 which feeds into DWH fact table say fact_2. We are extracting Daily changes from the source. The issue is that a source may flag the record as changed for a field that is not relavent to DWH. Now we can just add a record into fact table regardless of whether the measure has changed or not, but that would make my measure non additive. Else the only way to populate just changes is to apply SCD 2 on fact or atleast a lookup to fetch last record to compare with the current record before loading it.

An E.g. Source has below columns:-
A
B
C
D
E
F
out of these we only insert B as a measure into the fact. Now we get the whole record even if C (or D or F ...) has changed.

Just wanted to know how can I insert only changes into the fact table?

Please Suggest!

Regards
Rohan

BoxesAndLines Fri Apr 11, 2014 1:06 pm

Compare the source to the target. Just like you do for any type 2 SCD.

ngalemmo Fri Apr 11, 2014 1:57 pm

A technique to simply checking for changes is to calculate a 32 bit CRC on the columns you care about and store it in the fact (or dimension). Do the same with incoming data and compare CRCs rather than each individual column. It's less coding and speeds up the join as you do not need to retrieve every column off the fact table. You can improve performance even more on some systems by declaring an index on the fact of the PK and CRC. While you do not use the CRC in the join, the system will use just the index structure rather than needing to access the full table.

rohanf Mon Apr 14, 2014 4:34 am

Thanks, that makes sense, I initially had apprehensions to look at previous record in the fact table, but with CRC we can improve performance. Looks like this is the only way out.

just had a question on maintenance, We have a system where we use a similar approach - calculate MD5 hash value which is calculated at runtime on source and then stored in target - but whenever we want to add a new column to the table the existing MD5 stored on the table is old and the new MD5 value that we calculate from source is now different as the new column is added now.
Due to this we end up adding millions of record over again. We did devise a way to calculate the MD5 at run time for both Target and source columns but with this approach we will end up with same performance issue i.e. reading more than one column from fact.
Also in the existing application this did not cause major issue (apart from Unnecessary records) because it flagged the active record i.e. was SCD2.

In fact table in case there are changes later on I think we might face the same issue which will again cause additive facts to become non additive!!!

Is it preferable to use staging table to do all these checks rather than actual fact table. Note that staging will be exact replica of fact or maybe just the latest record!

Is this advisable, or there is a better way?

Appreciate your help!

Regards
Rohan

nick_white Mon Apr 14, 2014 6:56 am

If you are adding a new column to a table then are you also retrospectively populating this column for all existing records in both your source and target? If so then why not recalculate the MD5 hash value at the same time.
If you only populate the new column going forward then whenever the source record changes you will want to update the target (in order to get the new column value into your target) so your existing logic will still work: the MD5 hash on the old and new record will be different therefore create a new record - the fact the the hash was created using different rules on the old record is irrelevant.
I guess the only scenario where this may not be true is if the new column is null in the source when the source record changes

rohanf Mon Apr 14, 2014 2:05 pm

For the system that I referred to, The new column only applies to new records i.e. need not be updated for historical records in the target. Moreover as you have correctly identified the value of the new column is null for 80% of the records but we end up adding a new record for each just because the old MD5 did not have this new column.

For My initial question I guess, CRC is the only way. Should we do this excersize in a pre-fact staging table or the fact itself?

ngalemmo Thu Apr 17, 2014 11:45 am

You calculate the CRC when you prep the incoming data for loading (i.e. within the ETL process). It is also stored in the target table (fact or dimension) so you can check for change. When you update the row you replace the CRC value with the one from the incoming data.

Sponsored content

Difference in Incremental Changes between Source and DW

Difference in Incremental Changes between Source and DW

Re: Difference in Incremental Changes between Source and DW

Re: Difference in Incremental Changes between Source and DW

Re: Difference in Incremental Changes between Source and DW

Re: Difference in Incremental Changes between Source and DW

Re: Difference in Incremental Changes between Source and DW

Re: Difference in Incremental Changes between Source and DW

Re: Difference in Incremental Changes between Source and DW