De-duplicating, DW and MDM

jryan Fri Mar 15, 2013 5:45 am

Hi,

I'm trying to come up with a structure to allow dimensions (e.g. customer) to be de-duplicated. I will do that via an MDM product, which is fine.

What I'm unsure of, is how to structure the star schema fact and dim tables to reflect this. To take an example, if I have three customers (J Smith, John Smith and Jonathan Smith) with the business keys A, B and C, the MDM product will have de-duplicated those. Lets say the agreed name is John Smith, which is the the name that the master (golden) record gets.

If each 3 source system records have facts associated with them, e.g. $100 each, then I'm fairly certain that the grain of my fact table should contain 3 rows, each with $100.

My current design is to hold 3 records in the customer dimension table, using surrogate keys. The cols in the dim table would be: CustomerKey (surrogate), SourceSystemKey (e.g. holding the A, B, C) and finally the CustomerName attribute, which would be identical for all 3 records. Only the CustomerName would be exposed to the user. If the MDM product determines that customer C is actually a different customer, then the changes would be fed to the DW. As the facts are at the grain of the source system, only the dim table requires a change. The CustomerName attribute for customer C (surrogate key 3) would change accordingly and the DW now thinks we have 2 customers, rather than just 1.

So this works, it avoids re-loading facts, and also it allows ongoing re-mapping of customers across different systems without causing problems. However....it feels really clunky to me. If you're de-duping across 10 systems that means your dimension table increases in size by a factor of 10. You're effectively holding a load of junk in your dimension table.

Is this the best way of doing it, or does anyone have a better way?

ngalemmo Fri Mar 15, 2013 2:39 pm

The problem with MDM de-duping is it can change its mind. Today's dupe may be a unique tomorrow.

So, it is correct to keep three rows, reflecting the operational natural keys in the dimension table. What I would suggest is adding columns to support MDM. The columns populated from the source systems should not be updated by the MDM feed. Instead, have new columns to hold the MDM sourced attributes and the MDM assigned ID. This gives you the ability to compare interpretations as well as choose what you want to expose to the end-users.

jryan Mon Mar 25, 2013 11:58 am

Thanks for the reply.

It just felt a bit clunky holding duplicates in the dim table that way, but I really can't see a better way of doing it. Good to know that it's the correct way to go.

jryan Tue Mar 26, 2013 2:51 pm

I've just had another thought on this...having talked it over with a collegue.

Has anyone ever put natural keys (e.g. source system keys) on the fact table to cater for the MDM mapping/de-duping? This came up as an option for solving this problem.

This would allow the dimesions to hold only the true, MDM de-duped, dimension member. If the mapping between the golden record and the business keys change, then you're just updating some surrogate keys in the fact table. The natural keys on the fact row would never change.

My concern over holding an increased grain in the dimension table is that it doesn't naturally give you a count of your dimension members and would mean that SCD operations now apply to several rows, whereas before they would apply to just 1. On the other hand adding cols to the fact table could impact performance, but the actual number of bytes would be fairly small if you use data types such as int, smallint etc.

Would welcome anyone's thoughts on this...

ngalemmo Tue Mar 26, 2013 7:19 pm

Why create problems when there isn't one?

The fact that there are "duplicate" rows in the dimension because an MDM system thinks they are the same isn't a problem. Multiple rows for the same NK exists normally in a Type 2 dimension as it is, and that has never been a problem.

Users do not query on keys, or rows for that matter. They query on attributes. If you expose the MDM provided attributes in the dimension, queries will aggregate on those attributes. No one will know there is 1 or 100 dimension rows. It doesn't matter.

jryan Tue Mar 26, 2013 7:53 pm

Thanks again for the reply,

I guess my concern is over the increased #rows in the dimension table. If every "golden record" has 20 children, then the dim table blows out by a factor of 20. Hence I'm weighing up whether natural keys on the fact table would be better for performance. But I'm probably worrying about it too much from the sounds of things.

Sponsored content

De-duplicating, DW and MDM

De-duplicating, DW and MDM

Re: De-duplicating, DW and MDM

Re: De-duplicating, DW and MDM

Re: De-duplicating, DW and MDM

Re: De-duplicating, DW and MDM

Re: De-duplicating, DW and MDM

Re: De-duplicating, DW and MDM