A dozen systems with many dimension codes

Al Wood Mon Jan 10, 2011 9:55 am

Hi,

I have more than a dozen systems to integrate, and each has it's own version of dimension codes. The source systems need to use dummy codes, and multiple codes for one entity. This for operational reasons.
So, to conform a dimension I have a mapping table for each source system. These are to be maintained by Data Stewards. Each lookup table contains the same surrogate key values as the master table. The SSIS lookups use these mapping tables, not the Master DW dimension table.

The problem I have is:
When a lookup value is not found, do I load new natural keys into the Source System Lookup table, or the Master DW Dimension table, or both? For the data to be available I need to load at least the Master DW Dimension table, then I must de-duplicate when the same entity comes through from a different source system. But what if a different entity comes through from a different source with a code that has ALREADY been loaded from the first source system? That would cause a problem.

I think the best I can do is to load the lookup (mapping) tables with the natural key as a placeholder, and insert a generated unique value in the master, e.g. "STAFFMEMBER_A.Lien_FROM_SYSTEM_X". Then when/if we find out that its the same person in many systems, we deduplicate the master, and map the new entity for each system where it occurs. Can anyone think of a better way?

Al Wood

ngalemmo Mon Jan 10, 2011 12:01 pm

Make the natural key to the dimension a source identifier and the code, ensuring unique values. Store a row in the dimension for each natural key. Relate facts to the row based on source and code. Resolve cross reference in the dimension using 'standard' attributes.

Al Wood Mon Jan 10, 2011 1:54 pm

Thanks for the response.

Store a row in the dimension for each natural key

Doesn't this mean that a staff member will have multiple surrogate keys in the dimension table?

Is that a good way to design the DW?

Al Wood

ngalemmo Tue Jan 11, 2011 1:03 pm

If staff member is the dimension and it is coming in from different sources with different identifiers, then yes. One row for each unique source/identifier.

Include columns to hold the common identifier and the common attributes. Rows that map to the same staff member would hold the same values in the common attributes. Use the common attributes in queries to consolidate information for the same staff member.

Since you tie facts from a source based on the identifier in that source, you ensure a correct and stable assignment. Changes to the cross reference only affect the attributes in the dimension, not the relationship between facts and the dimension.

Al Wood Thu Jan 13, 2011 8:20 am

Hi,

Thanks again. That's very denormalised; quite a shock for someone coming from a 3NF database world!
I'm working through example scenarios with this design and I'm finding that it's a huge simplification and easy to load if I use SSIS with "readsert" lookups.

Al Wood

Al Wood Fri Jan 14, 2011 2:41 pm

Hi,

I'm wondering what you would do if one of the source systems needed to use two fields to uniquely identify the staff member, but most of them used just one field. Would you add a field to the dimension table?

I'm also wondering if this design can be used for the Location dimension? Customers? Products?

Al Wood

BoxesAndLines Fri Jan 14, 2011 4:20 pm

You need the natural key for lookups during the ETL process. If that's 1,2,3, or 4 columns, you'll need them all. Alternatively, you can manage the deduplication in the ETL process and present the consolidated dimension. Depending on the volume of duplicates, this can offer some performance advantages in reporting.

ngalemmo Sun Jan 16, 2011 2:22 am

Al Wood wrote:Hi,

I'm wondering what you would do if one of the source systems needed to use two fields to uniquely identify the staff member, but most of them used just one field. Would you add a field to the dimension table?

I'm also wondering if this design can be used for the Location dimension? Customers? Products?

Al Wood

As a rule, I abstract my natural keys. When I design a dimension table I always have a varchar column for holding the natural key in addition to columns for attributes. The natural key column contains a delimited concatenation of whatever source values make up the key. The concatenation is done during staging, since the process that extracts to staging is always source specific, so it can contain specific code to build the natural key values. The code that loads from staging can be source agnostic since it only has to deal with the abstract natural keys to lookup surrogate keys.

Sponsored content

A dozen systems with many dimension codes

A dozen systems with many dimension codes

Re: A dozen systems with many dimension codes

Re: A dozen systems with many dimension codes

Re: A dozen systems with many dimension codes

Re: A dozen systems with many dimension codes

But what if..

Re: A dozen systems with many dimension codes

Re: A dozen systems with many dimension codes

Re: A dozen systems with many dimension codes