Conformed dimension design
3 posters
Page 1 of 1
Conformed dimension design
Hello,
I am working on an initiative to consolidate agent data for an insurance company - there is agent data being captured in several different operational sources - with 2 of them registering agents and then passing feeds into the several other operational sources to establish them.
To design the conformed dimension for this entity, I was suggesting we use a universal attribute like SSN/TIN to help recognize that its the same agent/agency coming in from a different operational source - my thought process was that the individual sources will have different system ids for the same agent - worse yet, the source systems are not constrained to have one system id for the same agent - therefore theoretically allowing duplicates to be created for what is the same agent/agency entity.
I am being told though it is not a good practice to use sensitive information like SSN/TIN to establish conformance - I was not given a very convincing answer around why the group felt so. I feel the BI group here is open to listening why I feel this would be a good approach - but I wanted to make sure I am not missing anything - hence the question - is there a significant demerit to using SSN/TIN to achieving conformance? I can address their data security issue by applying LBAC controls in the database - cant achieve conformance through system ids though!
THANKS FOR YOUR GUIDANCE in advance!
I am working on an initiative to consolidate agent data for an insurance company - there is agent data being captured in several different operational sources - with 2 of them registering agents and then passing feeds into the several other operational sources to establish them.
To design the conformed dimension for this entity, I was suggesting we use a universal attribute like SSN/TIN to help recognize that its the same agent/agency coming in from a different operational source - my thought process was that the individual sources will have different system ids for the same agent - worse yet, the source systems are not constrained to have one system id for the same agent - therefore theoretically allowing duplicates to be created for what is the same agent/agency entity.
I am being told though it is not a good practice to use sensitive information like SSN/TIN to establish conformance - I was not given a very convincing answer around why the group felt so. I feel the BI group here is open to listening why I feel this would be a good approach - but I wanted to make sure I am not missing anything - hence the question - is there a significant demerit to using SSN/TIN to achieving conformance? I can address their data security issue by applying LBAC controls in the database - cant achieve conformance through system ids though!
THANKS FOR YOUR GUIDANCE in advance!
leon_panokarren- Posts : 5
Join date : 2011-05-23
Re: Conformed dimension design
Sorry, I am getting a little antsy - Something I could do better to invite a response??
leon_panokarren- Posts : 5
Join date : 2011-05-23
Re: Conformed dimension design
From the dimension standpoint, you should create a row for every unique instance from the sources. The natural key should be the source system and its business key. Do not try to consolidate all instances to a single row. Facts would be associated to the appropriate row based on the business key used in the fact.
You can then deal with conformance at your leisure... add a standard business key (something the business finally decides is the common identifier, created in some manner) column and additional columns for conformed attributes and use the standard key to drive maintenance of the conformed columns (all rows with the same standard key get the same attribute values).
This approach allows you to load the data warehouse independent of whatever method is decided to conform the dimension. Facts are tied to rows based on their source natural key. This relationship remains stable not matter what is done to the dimension. Any changes are to dimension attributes and how you maintain the dimension's standard ID.
As far a using an SSN goes... you can always encrypt it and use the encrypted value as the standard key.
You can then deal with conformance at your leisure... add a standard business key (something the business finally decides is the common identifier, created in some manner) column and additional columns for conformed attributes and use the standard key to drive maintenance of the conformed columns (all rows with the same standard key get the same attribute values).
This approach allows you to load the data warehouse independent of whatever method is decided to conform the dimension. Facts are tied to rows based on their source natural key. This relationship remains stable not matter what is done to the dimension. Any changes are to dimension attributes and how you maintain the dimension's standard ID.
As far a using an SSN goes... you can always encrypt it and use the encrypted value as the standard key.
Re: Conformed dimension design
SSN's are not unique. You can change your SSN. This makes it a bad PK. I woud try to leverage the identifiers from the source system. Additionally, a data quality product such as Informatica IDQ or Dataflux can provide identity resolution using numerous methods.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: Conformed dimension design
Thanks for your insights!!
ngalemmo - that is exactly what we are trying to do - there are 3 different sources feeding agent data - we are preserving each source's feed in relation to their respective primary keys. However, besides constraining on the primary key to be unique, the operational source does not constrain any attribute that identifies the business entity to be unique - consequently, it is possible that each source itself might have multiple primary keys of the same agent - if I use SSN to establish the identity of an agent to the EDW, I feel I will generate/retrieve the same EDW SK for the same agent; even if he has been duplicated in the source. Thoughts?
Boxes and lines - I'd have to respectfully disagree - source system PKs are often system generated identifiers,of which most operational systems, sooner or later end up creating multiples(duplicates) for what is the same business entity - SSN's change??... yeah ... often??... nah!!!!
ngalemmo - that is exactly what we are trying to do - there are 3 different sources feeding agent data - we are preserving each source's feed in relation to their respective primary keys. However, besides constraining on the primary key to be unique, the operational source does not constrain any attribute that identifies the business entity to be unique - consequently, it is possible that each source itself might have multiple primary keys of the same agent - if I use SSN to establish the identity of an agent to the EDW, I feel I will generate/retrieve the same EDW SK for the same agent; even if he has been duplicated in the source. Thoughts?
Boxes and lines - I'd have to respectfully disagree - source system PKs are often system generated identifiers,of which most operational systems, sooner or later end up creating multiples(duplicates) for what is the same business entity - SSN's change??... yeah ... often??... nah!!!!
leon_panokarren- Posts : 5
Join date : 2011-05-23
Re: Conformed dimension design
it is possible that each source itself might have multiple primary keys of the same agent - if I use SSN to establish the identity of an agent to the EDW, I feel I will generate/retrieve the same EDW SK for the same agent
As far as having the same agent in the same source under different keys, its unfortunate, but don't worry about it too much. Bring them in as separate rows. But DO NOT attempt to use only one SK as you describe. Use the source business key to get the corresponding dimension row, even if it is the 5th version of the same agent with a different number in the same system. The critical thing about the fact/dimension relationship is the fact points to the row with the business key that was contained in the fact feed from the source.
Its important to remember that, in a data warehouse, keys are used to join rows, NOT to produce metrics. Reports and metrics are bound by the attributes. Even if there are five rows in the dimension table, if the attributes are the same, the facts related to those rows will be consolidated into a single collection of metrics. Use the alternate key (standard business key) to coordinate the attributes, not to consolidate the keys.
Re: Conformed dimension design
Thanks ngalemmo
Although,
sounds like we are saying we should allow multiple versions of truth to propagate into the final dimensional model even if these multiple versions do not agree with each other - to me, they would qualify as multiple identities that need to be consolidated before they are fed to the DM - which I would like to prevent from being fed to the DM and in fact create an audit item for. Otherwise, the ETL engine would not be providing much value beyond essentially performing table joins across operational sources and cascading operational source issues, such as duplicates, into the data mart. Am I missing something?
Although,
Use the source business key to get the corresponding dimension row, even if it is the 5th version of the same agent with a different number in the same system.
sounds like we are saying we should allow multiple versions of truth to propagate into the final dimensional model even if these multiple versions do not agree with each other - to me, they would qualify as multiple identities that need to be consolidated before they are fed to the DM - which I would like to prevent from being fed to the DM and in fact create an audit item for. Otherwise, the ETL engine would not be providing much value beyond essentially performing table joins across operational sources and cascading operational source issues, such as duplicates, into the data mart. Am I missing something?
leon_panokarren- Posts : 5
Join date : 2011-05-23
Re: Conformed dimension design
It is not multiple versions of the truth... it IS the truth.
In a data warehouse a primary key is not an identity in the business sense. It is a means of associating rows.
Assume you have agent A in a system 3 times (A1, A2, A3). If you receive a transaction that is being loaded into a fact and it is assigned to agent A2, the fact should be associated with the dimension row that represents A2. You associate A1, A2 & A3 through the dimension by assuring they have common attribute values. You do not try to associate facts to a single dimension row.
Why? Because you (or the cross reference) may be wrong. Let's say everything coming in for A1, A2 and A3 are associated to row #235 in the dimension table. Later, it is discovered that A2 really is agent B. What now? How do you find facts related to A2 and how do you reassociate them with B? All the facts have the same dimension surrogate key... 235. There is only one row in the dimension table.
If, however, you maintained separate rows in the dimension table, all you would need to do is update the attributes for the A2 row, changing them to B's attributes.
In a data warehouse a primary key is not an identity in the business sense. It is a means of associating rows.
Assume you have agent A in a system 3 times (A1, A2, A3). If you receive a transaction that is being loaded into a fact and it is assigned to agent A2, the fact should be associated with the dimension row that represents A2. You associate A1, A2 & A3 through the dimension by assuring they have common attribute values. You do not try to associate facts to a single dimension row.
Why? Because you (or the cross reference) may be wrong. Let's say everything coming in for A1, A2 and A3 are associated to row #235 in the dimension table. Later, it is discovered that A2 really is agent B. What now? How do you find facts related to A2 and how do you reassociate them with B? All the facts have the same dimension surrogate key... 235. There is only one row in the dimension table.
If, however, you maintained separate rows in the dimension table, all you would need to do is update the attributes for the A2 row, changing them to B's attributes.
Re: Conformed dimension design
It is not multiple versions of the truth... it IS the truth.
I could not agree more!!! In fact, I also like the argument there - however, here's how we envisioned dealing with the situation you explained -
In our staging area, before the load to the DM, we see we have one SK corresponding to agent versions A1, A2 and A3 and since there are multiple 'live' versions of the agent for the same SK - we will create an Audit entry in our Audit repository against the single SK - describing the multiple versions we are seeing in a source - this should trigger the auditi notification and invite an immediate review
Also, since there is an open Audit item for this SK, the facts for this SK would also be held up in the Audit repository, and not released to the DM.
Since we will be missing facts/measures for this 'potential duplicate' in the DM, business will review and provide the right version (A1/A2 or A3) to use and will let the source know to perform a clean up (leaving one of A1, or A2 or A3 linked to the SK). When this happens, the Audit item would be closed and the suspended data released into the DM
There might be better ways/disadvantages of doing this - your thoughts?
leon_panokarren- Posts : 5
Join date : 2011-05-23
Similar topics
» Question On Conformed Dimension design
» Replacing Aggregate dimension with Conformed base dimension at lowest level.
» Conformed Dimension
» Conformed dimension or Degenerate dimension?
» Conformed Dimensions
» Replacing Aggregate dimension with Conformed base dimension at lowest level.
» Conformed Dimension
» Conformed dimension or Degenerate dimension?
» Conformed Dimensions
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum