Dealing with Duplicate Dimension Rows

fred_mumble Mon Mar 08, 2010 12:21 pm

A recent acquisition has added complexity to the Customer dimension and associated fact tables at the company I work for. The root of the problem is that we now have around 20 sources of customer and associated sales facts. We have data quality processes in place such as cleansing, standardization, matching and merging. The problem is that given the number of sources, volume of data and that we can only block data from going into the warehouse for 24 hours, duplicate Customer dimension rows can and do get in.

Given this, we have been trying to come up with a design that is flexible enough to accommodate this situation; i.e allow us to merge the Customers and correct the fact tables. One example is destructively updating fact rows to change Customer surrogate keys which has some benefits and clear drawbacks.

I'm sure we're not the first company to encounter this situation but don't recall it coming up in the "Dimensional Modeling in Depth" course I took last year and haven't seen it covered in the design tips. So, I wanted to reach out to the community to see if I am overlooking a resource or if there are some best practices that can be shared in this subject area.

ngalemmo Tue Mar 09, 2010 12:44 pm

Define "duplicate"...

If you are talking about the same customer coming in from a different source because the natural key is different (due to the source system), I would not consider it a duplicate.

The important thing to realize is that integration and reporting off a dimensional structure has nothing to do with keys and everything to do with attributes. You see it all the time with type 2 dimensions, where the same customer may have many rows in the dimension table. I feel it is critical when integrating multiple dimensional sources that you do not rely on pre-process consolidation, but rather maintain the indentity of the row to its source system. In other words, each source is represented by a mutually exclusive set of rows in the dimension table. The natural key should be qualified by the source system and facts coming in from a source would be matched to dimension rows for that source.

Maintenance of such dimension tables requires the introduction of one or more update keys (set to natural key values for that source), one key for each source that represents the system of record for sets of attributes in the dimension. These would be populated based on a cross-reference of some sort. Updates would be performed based on these update keys.

What this does is allow for changes to the cross-reference or business rules without affecting the data structures or surrogate keys. If a cross-references changes, for example, it is simply a matter of correcting update keys in the dimension. The next time the dimension is updated, the attributes in the corrected rows would be updated to the correct values for the referenced source entity. Same thing if business rules change where the source of some attributes change. Then it is a matter of modifying the update processes with no effect to to relationships with facts.

DilMustafa Tue Mar 09, 2010 1:38 pm

Well customer dimension is a a copy of golden records. If you are getting the same record from multiple sources, your data quality engine should identify the master record

BoxesAndLines Thu Mar 11, 2010 12:04 am

The data quality is only as good as the incoming data. Take for example a Party dimension that rolls customers up to a person or organization. One system could have a partial address while another system could have a complete address. You may or may not link these customers as the same party. At some later time, the full address comes in from the first system. Now the data quality software does link the two parties into one party. All of the original facts will point to the old party. In order to fix this, the OP noted that he has to update the fact table rows for Party changes. There is a way around avoiding the updates on the fact tables but it requires the creation of a intermediary table. The intermediary table will contain the unique identifier(s) from the fact table as well as the FK to the Party dimension. This will more than likely be more painful than updating the fact tables.

ngalemmo Thu Mar 11, 2010 4:31 pm

DilMustafa,

Relying on a gold copy as a means to IDENITFY entities in a dimension table is a big mistake. MDM systems do not ensure a consistent identity of, say a customer, over time across different source systems.

For example, you have two sales systems. One system A there is a customer #12345 who is John Smith at 123 Main St. On sales system B there is customer #S9999 who has the same name and address. Both get fed into an MDM system that determines, based on rules that have been set up, that they are the same person and the MDM system assigns its own identifier, #76543, to that customer. If you build your dimension table with one row using the MDM's identifier as the natural key, you are leaving yourself open for a world of hurt.

In reality, these are really two different people, but the information available or the rules could not discern between the two. Later, the MDM rules are updated and now the MDM system assigns different identifiers. This is not a problem for the operational systems, because those systems retained their own identifiers and can easily update their information based on the MDM feed. But, the data warehouse cannot because it did not retain the true identity (i.e. the source system key) of the customer.

So, natural keys in a dimension table must be the source system key. Each row needs to retain its identity back to where it came from. When facts are loaded, you should not translate source system keys against some MDM cross reference. Instead link to the dimension row with the matching source system key. The update key provides the means to maintain dimensional attributes against the gold copy. In this case there would be two customer rows and each would initially have an update key set to the gold copy identitifer of 76543. Update dimension rows based on the gold identifer against the update key. If later the MDM assigns a new identifier to one of the source customers, it is simply a matter of changing the update key value. No foreign keys are changed because those are based on the source system identifier.

D_Pons Fri Mar 19, 2010 12:08 pm

We have a slight variation on the 'duplicate customer' scenario.
The source system regularly identifies where a customer is represented by more than one record.
This usually happens after the data has been loaded in to the DataWarehouse DataBase.

The source system merges details from the two records in to one or other of the pre-existing records and deletes the other.
All associated details are likewise moved to the retained customer record.

For the DataWarehouse, this means that some Facts are now attributed to a different Customer.
This is fairly easy to handle - we contra the fact records that were linked to the now deleted customer record and re-insert them linked to the retained customer record using the appropriate surrogate key.

However, we also have attributes in the Customer dimension that are summaries of fact data - helpful when just profiling customers based on customer value.
And some of these attributes are also type 2 SCD attributes.
So when these new facts are added, it effectively means our SCD 2 history does not match.
So, allowing that we may have painted ourselves into a corner with the design, are there any tips for dealing with a situation where we effectively have late arriving fact and dimension data that is related?

Our current thinking is that we will just have to go through the affected customers and recalculate the affected attributes and surrogate keys - probably as an exceptional 'data correction' process.

Sponsored content

Dealing with Duplicate Dimension Rows

Dealing with Duplicate Dimension Rows

Re: Dealing with Duplicate Dimension Rows

Re: Dealing with Duplicate Dimension Rows

Re: Dealing with Duplicate Dimension Rows

Re: Dealing with Duplicate Dimension Rows

Re: Dealing with Duplicate Dimension Rows

Re: Dealing with Duplicate Dimension Rows