Design Alternatives for Late Arriving/Updated Type 2 Dimension Data
2 posters
Page 1 of 1
Design Alternatives for Late Arriving/Updated Type 2 Dimension Data
Requirement is to tie a customer service agent to transactional data (purchases/cancelations) and various other fact tables related customer operations (Cases, Calls). New conformed dimension for agents which will be widely used.
Limitations: Agent is not captured explicitly. Needs to be derived for each fact based on login information for each of the various tools. Application which manages agent logins has issues. Agents may not be associated with a login prior to handling transactions or calls. Agents may be incorrectly associated to login. Same login can be assigned to a different agent over time. Single agent has multiple logins.
Always have login information, but can’t always tie this to an agent at the time of fact table load. It is a business requirement to track that the agent was unknown at the time of the transaction and not lose this information when the agent has been determined at a later time.
Additional requirements: agent is type 3 (track current and historical values for key attributes). Login is also type 3 – needed for audit purposes. Track that the agent associated with the login was initially unknown. Because of the issues associated with late-arriving agent information and changes with agents associated to login data, both the agent and the login must be tracked with information relevant to the time of the transaction in addition to being able to report across history with current agent information.
Design alternatives:
Alternative 1:
Given that we always have login information, create a login dimension and associate it with each fact table. Update only the dimension when changes occur rather than update all the fact tables with the correct agent. Login is treated as type 3. Track Current/Historical agent in order to satisfy the requirement for tracking that agent was unknown at the time of the transaction and correctly determined at a later point in time.
Agent dimension is type 3 – not tied directly to fact tables but connected as snowflake to the login table. If using this option, then what is the best way to snowflake: a) Use the natural key (agent ID) to avoid dependency between 2 slowly changing dimensions. Current flag can be used to get the most recent agent record, or effective dates to get agent record as it existed at the time of the transaction. Or b) snowflake on the agent key, which ties the dimension records directly, but requires new rows to be created in the login dimension when an agent record is updated.
Alternative 2
Fold agent data into the login dimension. This would include several type 3 attributes for agent data. Agent can have a large number of logins across various agent applications, so there would be a lot of redundancy of agent data across the logins. This avoids the snowflake complexity but makes for a very large table given that agent data is slowly changing. One agent change could result in 20 more rows - one for each agent login.
Alternative 3
Tie agent data directly to the fact record with one surrogate key that remains unchanged (allows for tracking of unknown/incorrect agent) and a second which is updated with the appropriate reference to an agent. Unfortunately results in having to update rows in quite a number records across many different fact tables (impacts to EDW SLAs and a larger effort than updating dimension data). There are still audit requirements around the login information so a snowflake to a login dimension (either via natural or surrogate keys) is still needed to be able to create the picture of which login was used at the time of a transaction and what the agent data looked like (e.g. agent role).
Would appreciate any feedback/design suggestions to handle the complexity.
Limitations: Agent is not captured explicitly. Needs to be derived for each fact based on login information for each of the various tools. Application which manages agent logins has issues. Agents may not be associated with a login prior to handling transactions or calls. Agents may be incorrectly associated to login. Same login can be assigned to a different agent over time. Single agent has multiple logins.
Always have login information, but can’t always tie this to an agent at the time of fact table load. It is a business requirement to track that the agent was unknown at the time of the transaction and not lose this information when the agent has been determined at a later time.
Additional requirements: agent is type 3 (track current and historical values for key attributes). Login is also type 3 – needed for audit purposes. Track that the agent associated with the login was initially unknown. Because of the issues associated with late-arriving agent information and changes with agents associated to login data, both the agent and the login must be tracked with information relevant to the time of the transaction in addition to being able to report across history with current agent information.
Design alternatives:
Alternative 1:
Given that we always have login information, create a login dimension and associate it with each fact table. Update only the dimension when changes occur rather than update all the fact tables with the correct agent. Login is treated as type 3. Track Current/Historical agent in order to satisfy the requirement for tracking that agent was unknown at the time of the transaction and correctly determined at a later point in time.
Agent dimension is type 3 – not tied directly to fact tables but connected as snowflake to the login table. If using this option, then what is the best way to snowflake: a) Use the natural key (agent ID) to avoid dependency between 2 slowly changing dimensions. Current flag can be used to get the most recent agent record, or effective dates to get agent record as it existed at the time of the transaction. Or b) snowflake on the agent key, which ties the dimension records directly, but requires new rows to be created in the login dimension when an agent record is updated.
Alternative 2
Fold agent data into the login dimension. This would include several type 3 attributes for agent data. Agent can have a large number of logins across various agent applications, so there would be a lot of redundancy of agent data across the logins. This avoids the snowflake complexity but makes for a very large table given that agent data is slowly changing. One agent change could result in 20 more rows - one for each agent login.
Alternative 3
Tie agent data directly to the fact record with one surrogate key that remains unchanged (allows for tracking of unknown/incorrect agent) and a second which is updated with the appropriate reference to an agent. Unfortunately results in having to update rows in quite a number records across many different fact tables (impacts to EDW SLAs and a larger effort than updating dimension data). There are still audit requirements around the login information so a snowflake to a login dimension (either via natural or surrogate keys) is still needed to be able to create the picture of which login was used at the time of a transaction and what the agent data looked like (e.g. agent role).
Would appreciate any feedback/design suggestions to handle the complexity.
mkebbe- Posts : 6
Join date : 2011-08-23
Re: Design Alternatives for Late Arriving/Updated Type 2 Dimension Data
I would go with a combination of 1 & 2 if agent is a dimension used elsewhere. Otherwise I would go with 2. Although I don't understand why it would result in a very large table because of changes... you mention implementing it as a type 3, which does not create new rows when changes occur.
I would not snowflake. Having agent attributes in the login dimension is more than enough to provide conformance with an agent dimension if you implement it. Conformance is not about keys, it is about attribute values.
I would not snowflake. Having agent attributes in the login dimension is more than enough to provide conformance with an agent dimension if you implement it. Conformance is not about keys, it is about attribute values.
Re: Design Alternatives for Late Arriving/Updated Type 2 Dimension Data
Thank you for your feedback regarding my issue. While I'd like to go with option #2, upon further consideration, it will not satisfy the requirement to track that a login was initally associated with an unknown agent. Users want to know why their report would have changed and it has the potential to impact things like agent commissions.
Given the decision not to update several fact tables to correct the agent data, we definitely have no choice but to associate the agent login dimension directly with the each fact. The question is what to do about the type 3 agent data. Type 3 will create the same number of rows as type 2 (new rows to capture change, with additional attributes that maintain the current data across all the rows for a given agent). Considering a snowflake between the agent login and an agent dimension using the Agent Id rather than the agent surrogate key. This would avoid the need to create a large number of additional agent login rows every time an agent record is updated. There are requirements to track historical information for agent and login at the time of the fact occurrence in addition to being able to get all transaction history for a particular agent using their most current information. Could get the current information by linking dimensions via the natural key and the current flag of the agent. Historical - of which there are less requirements and mostly for audit/trouble-shooting purposes - would need to use the row effective dates.
If you have any additional thoughts/suggestions, I'd be happy to hear them. Thank you for your consideration of this issue.
Given the decision not to update several fact tables to correct the agent data, we definitely have no choice but to associate the agent login dimension directly with the each fact. The question is what to do about the type 3 agent data. Type 3 will create the same number of rows as type 2 (new rows to capture change, with additional attributes that maintain the current data across all the rows for a given agent). Considering a snowflake between the agent login and an agent dimension using the Agent Id rather than the agent surrogate key. This would avoid the need to create a large number of additional agent login rows every time an agent record is updated. There are requirements to track historical information for agent and login at the time of the fact occurrence in addition to being able to get all transaction history for a particular agent using their most current information. Could get the current information by linking dimensions via the natural key and the current flag of the agent. Historical - of which there are less requirements and mostly for audit/trouble-shooting purposes - would need to use the row effective dates.
If you have any additional thoughts/suggestions, I'd be happy to hear them. Thank you for your consideration of this issue.
mkebbe- Posts : 6
Join date : 2011-08-23
Re: Design Alternatives for Late Arriving/Updated Type 2 Dimension Data
Type 3 will create the same number of rows as type 2 (new rows to capture change, with additional attributes that maintain the current data across all the rows for a given agent).
No. A type 3 has the same number of rows as a type 1. A type 3 maintains versions of attribute values using recurring columns (current value, previous value, previous previous value, etc...). History is maintained by rolling values across the recurring columns when the current value is updated. History is limited to the number of generations (recurring columns) defined for a value. If you are creating new rows when changes occur it is a type 2 dimension.
Not sure why you need a type 3 (or 2 for that matter) for the login dimension. It would seem to me there are only two possible conditions: you know who the agent is or you don't. It seems you could simply record when you indentified the agent using a date column in the dimension. What was known before is superfluous, as you didn't know anything about the login.
Re: Design Alternatives for Late Arriving/Updated Type 2 Dimension Data
Yes. You are correct. I should have said hybrid 2/3 dimension as there are requirements to track historical data associated with the fact along with current values, expecially for key agent attributes.
Interesting idea regarding the tracking of the date the agent was identified. That would work if we only had to deal with logins that had unknown agents for a period of time. Unfortunately, we also have the use case of a login being incorrectly identified with an agent and we need to know that information as well.
Been reading the threads about self-referencing queries on the dimension to get the current value, but it seems more complex for users than the hybrid approach. Will give this some further consideration and discuss with the other members of the team.
Thanks again for your time and feedback. Much appreciated!
Interesting idea regarding the tracking of the date the agent was identified. That would work if we only had to deal with logins that had unknown agents for a period of time. Unfortunately, we also have the use case of a login being incorrectly identified with an agent and we need to know that information as well.
Been reading the threads about self-referencing queries on the dimension to get the current value, but it seems more complex for users than the hybrid approach. Will give this some further consideration and discuss with the other members of the team.
Thanks again for your time and feedback. Much appreciated!
mkebbe- Posts : 6
Join date : 2011-08-23
One last thing Re: Design Alternatives for Late Arriving/Updated Type 2 Dimension Data
Sorry, there was one last thing I wanted to mention/ask about. It is quite clear the problems and complexity that a snowflake can cause and typically avoid them. In considering the addition of the agent attributes to the login table, the concern is that there is one agent application which contains over 8,000 agents and the average logins per agent is 30. Given that agent is a type 2/3 hybrid dimension, this would result in a large number of rows for the same agent. Perhaps in the grand scheme of things, that's still not bad for a dimension (it's not like customer, with millions of rows) but that is the concern of collapsing the agent data into the login dimension.
mkebbe- Posts : 6
Join date : 2011-08-23
Re: Design Alternatives for Late Arriving/Updated Type 2 Dimension Data
mkebbe wrote:Sorry, there was one last thing I wanted to mention/ask about. It is quite clear the problems and complexity that a snowflake can cause and typically avoid them. In considering the addition of the agent attributes to the login table, the concern is that there is one agent application which contains over 8,000 agents and the average logins per agent is 30. Given that agent is a type 2/3 hybrid dimension, this would result in a large number of rows for the same agent. Perhaps in the grand scheme of things, that's still not bad for a dimension (it's not like customer, with millions of rows) but that is the concern of collapsing the agent data into the login dimension.
An average of 30 different logins for an agent??? I feel sorry for the agents. But, its still not that large of a dimension.
As far as the 2/3 thing goes... you can avoid duplicating attributes and just have a type 2 with two keys: the normal PK and a constant alternate key. The first allows retrieving point in time attributes while the second allows retriving current attributes. The alternate key is set to the PK value when the first dimension row is created for a particular natural key. When new versions are created, its value is propagated to the new rows, with all versions having the same alternate key value. You store both keys on the fact table. If you need current data you join using the alternate key and filter on the current flag.
I wouldn't be concerned with collapsing data. As long as the attribute values are consistent, there should not be an issue.
Re: Design Alternatives for Late Arriving/Updated Type 2 Dimension Data
Yes, it is crazy that an agent can have that many logins and that is only one of the many applications. That one happens to be point of sale specific. Fortunately, they are moving towards single sign-on for agents which should relieve this issue.
The idea of folding the agent data into the login dimension makes sense as well as your suggestion for the alternate key - though given that the agent id is an integer, we could probably just include the natural key in the fact. The only problem remaining however, is how to track the agent changes for a login. The agent associated to a login may initially be unknown or incorrect. We have requirements to track this information, which the snowflake with agent allowed us to do using two different keys associated with the separate agent dim (one for the one assigned to the login initially and another to contain an updated key that is the correct one). I don't mean to push my luck here with your time on this matter, but I don't know how to solve for this last requirement. Would not want two sets of agent data in the same record. The two keys allowed users to get to the agent originally associated to a login.
The idea of folding the agent data into the login dimension makes sense as well as your suggestion for the alternate key - though given that the agent id is an integer, we could probably just include the natural key in the fact. The only problem remaining however, is how to track the agent changes for a login. The agent associated to a login may initially be unknown or incorrect. We have requirements to track this information, which the snowflake with agent allowed us to do using two different keys associated with the separate agent dim (one for the one assigned to the login initially and another to contain an updated key that is the correct one). I don't mean to push my luck here with your time on this matter, but I don't know how to solve for this last requirement. Would not want two sets of agent data in the same record. The two keys allowed users to get to the agent originally associated to a login.
mkebbe- Posts : 6
Join date : 2011-08-23
Re: Design Alternatives for Late Arriving/Updated Type 2 Dimension Data
If you implement login as a type 2 you would have full history of changes relating to the login/agent relationship.
By the way, NEVER use natural keys as a FK to dimensions. (There are very few absolutes in dimensional design... this is one of them.)
By the way, NEVER use natural keys as a FK to dimensions. (There are very few absolutes in dimensional design... this is one of them.)
Re: Design Alternatives for Late Arriving/Updated Type 2 Dimension Data
ngalemmo wrote:
By the way, NEVER use natural keys as a FK to dimensions. (There are very few absolutes in dimensional design... this is one of them.)
Just to clarify, the natural key would not be the primary key of the dimension, but it could be included as a secondary attribute in the fact as mentioned in Kimball's article: Slowly Changing Dimensions Are Not As Easy as 1, 2, 3 (as a new member, I'm not allowed to post the link)
Type 2 with Natural Keys in the Fact Table
If you have a million-row dimension table with many attributes requiring historical and current tracking, the last technique we described becomes overly burdensome. In this situation, consider including the dimension natural key as a fact table foreign key, in addition to the surrogate key for type 2 tracking. This technique gives you essentially two dimension tables associated with the facts, but for good reason. The type 2 dimension has historically accurate attributes for filtering or grouping based on the effective values when the fact table was loaded. The dimension natural key joins to a table with just the current type 1 values.
Thanks again for your time and feedback. Much appreciated.
mkebbe- Posts : 6
Join date : 2011-08-23
Re: Design Alternatives for Late Arriving/Updated Type 2 Dimension Data
The idea of two keys to a type 2 dimension has been discussed before in this forum, but never where one of the keys was a natural key.
I strongly disagree with Ralph on this point. Both keys should be surrogate keys. The second (alternate) surrogate key essentially mimics the natural key but provides isolation from the natural key should such things change. Basically, the alternate key is set to the PK value when the first row is created for a NK. In subsequent versions of the row, the alternate key value remains the same. You populate the fact with both the PK (point in time) and alternate key (get current).
I strongly disagree with Ralph on this point. Both keys should be surrogate keys. The second (alternate) surrogate key essentially mimics the natural key but provides isolation from the natural key should such things change. Basically, the alternate key is set to the PK value when the first row is created for a NK. In subsequent versions of the row, the alternate key value remains the same. You populate the fact with both the PK (point in time) and alternate key (get current).
Similar topics
» Late Arriving Dimension Data
» Optimal SCD type 2 dimension design
» alternate approaches for late arriving dimension attributes
» type 1 dimension - new requirements for attributes that will be updated often
» Late Arriving Facts
» Optimal SCD type 2 dimension design
» alternate approaches for late arriving dimension attributes
» type 1 dimension - new requirements for attributes that will be updated often
» Late Arriving Facts
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum