History tracking in a CRM data warehouse
4 posters
Page 1 of 1
History tracking in a CRM data warehouse
Hi. I've been reading these forums for a while now, but registered specifically to see what kind of responses I could get to a problem we're having.
I work at a company with in-house expertise in data warehousing or dimensional modeling (you folks probably see a lot of posts that start out like that). The company has engaged some consultants with expertise in these areas to design a data warehouse (mart? I'm still coming to grips with the terminology) and cube which will contain combined CRM data from several disparate feeder systems. The system they have designed is actually the successor to one that was built by a previous consultant (I'll spare you the details). I've been working on the periphery of this project for about 9 months now, and whatever I've learned about DW I've picked up by osmosis. I'm the one charged with writing the ETL for the redesigned warehouse, and I'm having a few problems.
This is a fairly simple system. We have one fact table and less than a dozen dimensions. The problem I'm having seems to stem from the fact that the system is not transactional in nature. At issue are the fact table (FactOpportunity), and one of the dimensions (DimOpportunity). As you can probably guess, they represent two aspects of the same thing (the Opportunity, or Proposal). For trending and forecasting purposes, there are 3 fields we need to track changes for: 1) stage (identified, submitted, won, lost, cancelled...), 2) estimated value, and 3) the estimated probability that we will win the work. All 3 of these are subject to change over the lifetime of the opportunity.
The original consultant used a Periodic Snapshot to address the issue of history. We have not as yet implemented either trending or forecasting, so we haven't actually used it for anything, but in many ways it seems to fit our reporting needs fairly well.
When the new consultants redesigned the warehouse (they need it to feed a cube, which we don't currently have, and they said that the current warehouse wasn't suitable for that), they got rid of the Periodic Snapshot and replaced it with...well, that's what I'm trying to figure out (the person who did the redesign left the consultancy right after finishing this, and I'm trying to figure out if the people who replaced him understand what he was trying to do. He left the database schema, but not a lot of details on how we were supposed to track the history.
OK, so much for the background. Hopefully I won't run out of characters. Here are the details, and please bear with me because, as I said, we are pretty new to this.
DimOpportunity, a type 2 SCD, contains stage (one of the tracked fields). That part is not a problem. The problem is with the other two (revenue and probability), which are technically facts, and therefore live on FactOpportunity. To my novice eye, though, their behavior looks dimensional (they change over time, but a replacement value, not a delta, is passed from the feeder system). So, if probability changes from 10% to 50%, we don't have one line with 10 and one line with 40, the value is just updated in place from 10 to 50. The same goes for revenue.
If we're going to track the historical values within the fact table (not using the snapshot), it looks to me as though we need some kind of "SCD-ish-ness" in the fact table (oxymoron? heresy?). In other words, we will need to have multiple fact records, which represent the same entity (the opportunity), but showing its state for a discrete time span (opportunity X revenue $100000, probability 50% for 1/1/2011 - 2/28/2011, opportunity X revenue $100000, probability 75% for 3/1/2011 - 3/15/2011, opportunity X revenue $150000, probability 75% for 3/16/2011 - 12/31/9999).
Another problem I will run into if we do it this way is synching the changes to revenue and probability with the changes to stage (on DimOpportunity), since the 3 are essentially a set, but at this point that's a separate issue.
It seems kind of complicated to me, but that could just be me.
I read Ralph Kimball's book last fall (I don't have it with me at this moment), and the one example it gave of a system like ours used the Periodic Snapshot model (for us it's a weekly copy of the fact table, with a datestamp). If there was an alternative I don't recall seeing it.
We are a Microsoft (SQL Server, SSRS, SSAS) shop, and this whole thing is being driven by the desire of senior management to use PerformancePoint, which (so we are told) requires a cube. The consultants implied that the Periodic Snapshot was problematic in a cube (we know nothing at all about PerformancePoint or cubes, at least at this point). This is a pretty tiny system (currently ~100000 distinct opportunities).
For any of you who are still awake, can you shed any light on how we should be tracking history in this system? Should I tell them to go back to the Periodic Snapshot, or is there an alternative to that? As I said, they have yet to explain to me how the changes to revenue and probably are supposed to be tracked, which makes it kind of hard to do the ETL.
Thanks.
Evan
I work at a company with in-house expertise in data warehousing or dimensional modeling (you folks probably see a lot of posts that start out like that). The company has engaged some consultants with expertise in these areas to design a data warehouse (mart? I'm still coming to grips with the terminology) and cube which will contain combined CRM data from several disparate feeder systems. The system they have designed is actually the successor to one that was built by a previous consultant (I'll spare you the details). I've been working on the periphery of this project for about 9 months now, and whatever I've learned about DW I've picked up by osmosis. I'm the one charged with writing the ETL for the redesigned warehouse, and I'm having a few problems.
This is a fairly simple system. We have one fact table and less than a dozen dimensions. The problem I'm having seems to stem from the fact that the system is not transactional in nature. At issue are the fact table (FactOpportunity), and one of the dimensions (DimOpportunity). As you can probably guess, they represent two aspects of the same thing (the Opportunity, or Proposal). For trending and forecasting purposes, there are 3 fields we need to track changes for: 1) stage (identified, submitted, won, lost, cancelled...), 2) estimated value, and 3) the estimated probability that we will win the work. All 3 of these are subject to change over the lifetime of the opportunity.
The original consultant used a Periodic Snapshot to address the issue of history. We have not as yet implemented either trending or forecasting, so we haven't actually used it for anything, but in many ways it seems to fit our reporting needs fairly well.
When the new consultants redesigned the warehouse (they need it to feed a cube, which we don't currently have, and they said that the current warehouse wasn't suitable for that), they got rid of the Periodic Snapshot and replaced it with...well, that's what I'm trying to figure out (the person who did the redesign left the consultancy right after finishing this, and I'm trying to figure out if the people who replaced him understand what he was trying to do. He left the database schema, but not a lot of details on how we were supposed to track the history.
OK, so much for the background. Hopefully I won't run out of characters. Here are the details, and please bear with me because, as I said, we are pretty new to this.
DimOpportunity, a type 2 SCD, contains stage (one of the tracked fields). That part is not a problem. The problem is with the other two (revenue and probability), which are technically facts, and therefore live on FactOpportunity. To my novice eye, though, their behavior looks dimensional (they change over time, but a replacement value, not a delta, is passed from the feeder system). So, if probability changes from 10% to 50%, we don't have one line with 10 and one line with 40, the value is just updated in place from 10 to 50. The same goes for revenue.
If we're going to track the historical values within the fact table (not using the snapshot), it looks to me as though we need some kind of "SCD-ish-ness" in the fact table (oxymoron? heresy?). In other words, we will need to have multiple fact records, which represent the same entity (the opportunity), but showing its state for a discrete time span (opportunity X revenue $100000, probability 50% for 1/1/2011 - 2/28/2011, opportunity X revenue $100000, probability 75% for 3/1/2011 - 3/15/2011, opportunity X revenue $150000, probability 75% for 3/16/2011 - 12/31/9999).
Another problem I will run into if we do it this way is synching the changes to revenue and probability with the changes to stage (on DimOpportunity), since the 3 are essentially a set, but at this point that's a separate issue.
It seems kind of complicated to me, but that could just be me.
I read Ralph Kimball's book last fall (I don't have it with me at this moment), and the one example it gave of a system like ours used the Periodic Snapshot model (for us it's a weekly copy of the fact table, with a datestamp). If there was an alternative I don't recall seeing it.
We are a Microsoft (SQL Server, SSRS, SSAS) shop, and this whole thing is being driven by the desire of senior management to use PerformancePoint, which (so we are told) requires a cube. The consultants implied that the Periodic Snapshot was problematic in a cube (we know nothing at all about PerformancePoint or cubes, at least at this point). This is a pretty tiny system (currently ~100000 distinct opportunities).
For any of you who are still awake, can you shed any light on how we should be tracking history in this system? Should I tell them to go back to the Periodic Snapshot, or is there an alternative to that? As I said, they have yet to explain to me how the changes to revenue and probably are supposed to be tracked, which makes it kind of hard to do the ETL.
Thanks.
Evan
EvanJones- Posts : 4
Join date : 2011-07-09
Location : Southern NH
Re: History tracking in a CRM data warehouse
From what I can see, periodic snapshot fact is a perfect model to feed the cube, especially for your modest size of data volume. I think the person who initiated the fact table might have cube in his sight but might not have the time, budget or for some technical/political reason to implement it. I guess the attempt to ditch the concept of using periodic fact table for BI system in general, and cube in particular, is due to ignorance.
As far as change tracking is concerned, there are a couple of options. Option 1 is to use accumulating snapshot fact table to consolidate your facts that share the same dimensionality to keep the different value changes in separate columns instead of records assuming the life cycle is deterministic.
Option 2 is to have a transaction dimension to track the changes for all your SCD-ish-ness factual attributes to provide a change tracking mechanism in a format more compact than the straight and simplistic periodic snapshot fact table. However it's really a trade off between space saving and simplicity in ETL, and the decision comes down to the size of the eventual periodic fact table.
However the end fact table for the cube should be periodic snapshot fact table which can provide convenient and faster factual base for trend analysis, although both accumulating snapshot fact and transaction dimension can facilitate powerful analysis possibilities on their own rights.
As far as change tracking is concerned, there are a couple of options. Option 1 is to use accumulating snapshot fact table to consolidate your facts that share the same dimensionality to keep the different value changes in separate columns instead of records assuming the life cycle is deterministic.
Option 2 is to have a transaction dimension to track the changes for all your SCD-ish-ness factual attributes to provide a change tracking mechanism in a format more compact than the straight and simplistic periodic snapshot fact table. However it's really a trade off between space saving and simplicity in ETL, and the decision comes down to the size of the eventual periodic fact table.
However the end fact table for the cube should be periodic snapshot fact table which can provide convenient and faster factual base for trend analysis, although both accumulating snapshot fact and transaction dimension can facilitate powerful analysis possibilities on their own rights.
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
Re: History tracking in a CRM data warehouse
Evan, your write-up is exceptionally clear!
SSAS should support a periodic snapshot just fine. There are aggregations that SUM over the other dimensions except time which make sense for a periodic snapshot. The LastChild Aggregate Function might be appropriate for your 3 measure fields. This would cause SSAS to return the most recent value for whatever time period is being analyzed.
However, I don't think SSAS is going to know what to do with a "SCD2-ish" fact table. The primary benefit is space savings over a periodic snapshot, but it comes with the complexity of looking at BeginEffectiveDate and EndEffectiveDate on every query against that fact table. Ralph Kimball does present a form of this he calls "Instantaneous Balances" where you want to be able to query an account balance at any point in time (i.e. throughout the day). Obviously it would be impractical to take a periodic snapshot every second, so this case storing a BeginEffectiveTimestamp and EndEffectiveTimestamp for each record makes sense.
I have used SCD2-ish fact tables in my relational DW, but I have usually ended up using a view (joining on every day in the date dimension) to expose it as if were a periodic snapshot. This has worked fine for my very sparse fact table (price changes, which are very infrequent) but might be a performance issue for a larger data set.
A daily periodic snapshot would be the "standard" way to handle this in a fact table.
Of course you might also conclude that what you have are SCD2 dimension attributes! One of the areas where Dimensional Modeling seems to be more art than science is when something looks both like a fact measure and like a dimension attribute! The best way to resolve this dilemmas is to work backwards--determine what reports do users want out of the system and then determine what design best meets that need.
Good luck!
SSAS should support a periodic snapshot just fine. There are aggregations that SUM over the other dimensions except time which make sense for a periodic snapshot. The LastChild Aggregate Function might be appropriate for your 3 measure fields. This would cause SSAS to return the most recent value for whatever time period is being analyzed.
However, I don't think SSAS is going to know what to do with a "SCD2-ish" fact table. The primary benefit is space savings over a periodic snapshot, but it comes with the complexity of looking at BeginEffectiveDate and EndEffectiveDate on every query against that fact table. Ralph Kimball does present a form of this he calls "Instantaneous Balances" where you want to be able to query an account balance at any point in time (i.e. throughout the day). Obviously it would be impractical to take a periodic snapshot every second, so this case storing a BeginEffectiveTimestamp and EndEffectiveTimestamp for each record makes sense.
I have used SCD2-ish fact tables in my relational DW, but I have usually ended up using a view (joining on every day in the date dimension) to expose it as if were a periodic snapshot. This has worked fine for my very sparse fact table (price changes, which are very infrequent) but might be a performance issue for a larger data set.
A daily periodic snapshot would be the "standard" way to handle this in a fact table.
Of course you might also conclude that what you have are SCD2 dimension attributes! One of the areas where Dimensional Modeling seems to be more art than science is when something looks both like a fact measure and like a dimension attribute! The best way to resolve this dilemmas is to work backwards--determine what reports do users want out of the system and then determine what design best meets that need.
Good luck!
VHF- Posts : 236
Join date : 2009-04-28
Location : Wisconsin, US
Re: History tracking in a CRM data warehouse
Thanks for the responses. What we seem to have ended up with (I am on the receiving end of the design) is a situation where we have a dimension (DimOpportunity) and a fact table (FactOpportunity) which are joined together at the hip (there is an enforced 1:1 relationship between the two, so that any dimension key is referenced only once, by a distinct fact record. The time-sensitiveness of the different fact records is driven off of the dimension. I have no reason to think that this won't work, but it seems goofy to me. If I'm going to have a strict 1:1, why do I need two tables? Just to make the ETL harder to do? As it is now I have to generate what are essentially duplicate new records in either the fact or the dimension if something in the other one changes, just to maintain the 1:1. So, if stage (which is the only SCD-generating field on DimOpportunity) changes, but none of the tracked fact columns (revenue, probability) does, I have to write a new dimension record (to show the change in stage), plus a new fact record which is identical to the previous one, except for the fact that it points at a different dimension record than the previous one did. The reverse is also true (revenue or probability change, but stage does not). Assuming that this is a valid way to do it, I feel like I should have one single physical table overlayed by a fact view and a dimension view which include only the fields which logically belong in one or the other, since the logic involved in synching the two, while not super complicated, seems more complex than it really needs to be. I'm a support guy at heart. All other things being equal, I like things as simple as possible.
I guess what that all boils down to is: has anyone ever heard of this 1:1 fact/dimension model? If so, is there a term for it? Would views be a better way of dealing with it, or is there some reason to avoid them? I'm stuck with the design I was given, but going forward I want to try to understand this better than I currently do.
Thanks.
I guess what that all boils down to is: has anyone ever heard of this 1:1 fact/dimension model? If so, is there a term for it? Would views be a better way of dealing with it, or is there some reason to avoid them? I'm stuck with the design I was given, but going forward I want to try to understand this better than I currently do.
Thanks.
EvanJones- Posts : 4
Join date : 2011-07-09
Location : Southern NH
Re: History tracking in a CRM data warehouse
I guess what that all boils down to is: has anyone ever heard of this 1:1 fact/dimension model? If so, is there a term for it? Would views be a better way of dealing with it, or is there some reason to avoid them? I'm stuck with the design I was given, but going forward I want to try to understand this better than I currently do.
Yeah, I've heard of it. The term commonly used to describe it is 'bad dimensional design'.
The way to resolve it is examine the attributes in the opportunity dimension and reduce them to multiple dimensions... things like 'prospect/customer', 'stage', dates with various roles and so on. This is not unlike modeling sales orders where a proper model usually does not have an Order dimension table.
A 1:1 relationship performs very poorly and misses the objective of dimensional design: to produce relatively small dimensions that conform across multiple facts.
Similar topics
» Data Warehouse for mortgage tracking - need some advice
» Relationship between a history tracking table and a non-history tracking table?
» Tracking history in huge hierarchies
» Tracking fact table history
» Tracking history of multiple SCD type 2 attributes
» Relationship between a history tracking table and a non-history tracking table?
» Tracking history in huge hierarchies
» Tracking fact table history
» Tracking history of multiple SCD type 2 attributes
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum