A dozen systems with many dimension codes
3 posters
Page 1 of 1
A dozen systems with many dimension codes
Hi,
I have more than a dozen systems to integrate, and each has it's own version of dimension codes. The source systems need to use dummy codes, and multiple codes for one entity. This for operational reasons.
So, to conform a dimension I have a mapping table for each source system. These are to be maintained by Data Stewards. Each lookup table contains the same surrogate key values as the master table. The SSIS lookups use these mapping tables, not the Master DW dimension table.
The problem I have is:
When a lookup value is not found, do I load new natural keys into the Source System Lookup table, or the Master DW Dimension table, or both? For the data to be available I need to load at least the Master DW Dimension table, then I must de-duplicate when the same entity comes through from a different source system. But what if a different entity comes through from a different source with a code that has ALREADY been loaded from the first source system? That would cause a problem.
I think the best I can do is to load the lookup (mapping) tables with the natural key as a placeholder, and insert a generated unique value in the master, e.g. "STAFFMEMBER_A.Lien_FROM_SYSTEM_X". Then when/if we find out that its the same person in many systems, we deduplicate the master, and map the new entity for each system where it occurs. Can anyone think of a better way?
Al Wood
I have more than a dozen systems to integrate, and each has it's own version of dimension codes. The source systems need to use dummy codes, and multiple codes for one entity. This for operational reasons.
So, to conform a dimension I have a mapping table for each source system. These are to be maintained by Data Stewards. Each lookup table contains the same surrogate key values as the master table. The SSIS lookups use these mapping tables, not the Master DW dimension table.
The problem I have is:
When a lookup value is not found, do I load new natural keys into the Source System Lookup table, or the Master DW Dimension table, or both? For the data to be available I need to load at least the Master DW Dimension table, then I must de-duplicate when the same entity comes through from a different source system. But what if a different entity comes through from a different source with a code that has ALREADY been loaded from the first source system? That would cause a problem.
I think the best I can do is to load the lookup (mapping) tables with the natural key as a placeholder, and insert a generated unique value in the master, e.g. "STAFFMEMBER_A.Lien_FROM_SYSTEM_X". Then when/if we find out that its the same person in many systems, we deduplicate the master, and map the new entity for each system where it occurs. Can anyone think of a better way?
Al Wood
Al Wood- Posts : 46
Join date : 2010-12-08
Re: A dozen systems with many dimension codes
Make the natural key to the dimension a source identifier and the code, ensuring unique values. Store a row in the dimension for each natural key. Relate facts to the row based on source and code. Resolve cross reference in the dimension using 'standard' attributes.
Re: A dozen systems with many dimension codes
Thanks for the response.
Doesn't this mean that a staff member will have multiple surrogate keys in the dimension table?
Is that a good way to design the DW?
Al Wood
Store a row in the dimension for each natural key
Doesn't this mean that a staff member will have multiple surrogate keys in the dimension table?
Is that a good way to design the DW?
Al Wood
Al Wood- Posts : 46
Join date : 2010-12-08
Re: A dozen systems with many dimension codes
If staff member is the dimension and it is coming in from different sources with different identifiers, then yes. One row for each unique source/identifier.
Include columns to hold the common identifier and the common attributes. Rows that map to the same staff member would hold the same values in the common attributes. Use the common attributes in queries to consolidate information for the same staff member.
Since you tie facts from a source based on the identifier in that source, you ensure a correct and stable assignment. Changes to the cross reference only affect the attributes in the dimension, not the relationship between facts and the dimension.
Include columns to hold the common identifier and the common attributes. Rows that map to the same staff member would hold the same values in the common attributes. Use the common attributes in queries to consolidate information for the same staff member.
Since you tie facts from a source based on the identifier in that source, you ensure a correct and stable assignment. Changes to the cross reference only affect the attributes in the dimension, not the relationship between facts and the dimension.
Re: A dozen systems with many dimension codes
Hi,
Thanks again. That's very denormalised; quite a shock for someone coming from a 3NF database world!
I'm working through example scenarios with this design and I'm finding that it's a huge simplification and easy to load if I use SSIS with "readsert" lookups.
Al Wood
Thanks again. That's very denormalised; quite a shock for someone coming from a 3NF database world!
I'm working through example scenarios with this design and I'm finding that it's a huge simplification and easy to load if I use SSIS with "readsert" lookups.
Al Wood
Al Wood- Posts : 46
Join date : 2010-12-08
But what if..
Hi,
I'm wondering what you would do if one of the source systems needed to use two fields to uniquely identify the staff member, but most of them used just one field. Would you add a field to the dimension table?
I'm also wondering if this design can be used for the Location dimension? Customers? Products?
Al Wood
I'm wondering what you would do if one of the source systems needed to use two fields to uniquely identify the staff member, but most of them used just one field. Would you add a field to the dimension table?
I'm also wondering if this design can be used for the Location dimension? Customers? Products?
Al Wood
Al Wood- Posts : 46
Join date : 2010-12-08
Re: A dozen systems with many dimension codes
You need the natural key for lookups during the ETL process. If that's 1,2,3, or 4 columns, you'll need them all. Alternatively, you can manage the deduplication in the ETL process and present the consolidated dimension. Depending on the volume of duplicates, this can offer some performance advantages in reporting.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: A dozen systems with many dimension codes
Al Wood wrote:Hi,
I'm wondering what you would do if one of the source systems needed to use two fields to uniquely identify the staff member, but most of them used just one field. Would you add a field to the dimension table?
I'm also wondering if this design can be used for the Location dimension? Customers? Products?
Al Wood
As a rule, I abstract my natural keys. When I design a dimension table I always have a varchar column for holding the natural key in addition to columns for attributes. The natural key column contains a delimited concatenation of whatever source values make up the key. The concatenation is done during staging, since the process that extracts to staging is always source specific, so it can contain specific code to build the natural key values. The code that loads from staging can be source agnostic since it only has to deal with the abstract natural keys to lookup surrogate keys.
Similar topics
» Customer Dimension from multiple systems
» I can't figure out how to model this M:M relationship despite reading a dozen articles.
» How to handle bad zip codes
» Codes and indicators
» Language translation for large number of codes
» I can't figure out how to model this M:M relationship despite reading a dozen articles.
» How to handle bad zip codes
» Codes and indicators
» Language translation for large number of codes
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum