What is the right granularity of time?
2 posters
Page 1 of 1
What is the right granularity of time?
Hi again,
We have a subscription system over here where our customers are able to retrieve informational data from. Depending on their contract some customers have to order subscriptions in advance. Some of them may correct it to the actual number used retroactive. Others may just use the informational data and report their subscription retrospective afterwards. To make it even more complicated, also based on contract, subscription fees may be calculated on a daily, monthly or quarterly base.
Since the heterogeneity of the many customers data processing systems and the easy duplication and redistribution of informational data there never was an effort for a fine-grained fool-proof data usage tracking system. The subscription fees are solely based on what our customers reports. To avoid misapplication there are on-site audits checking how the processing systems handle our data. We are then calculating usage numbers from our audit and compare it to what has been reported.
My question now is: What is the right granularity for the time dimension and how should I model it? The lowest grain is to store each subscription on a day-by-day base with a date foreign key. But since most of the subscriptions are for several month or even years, it would really explode the fact table with dense and redundant data. E.g. a 36 month period of monthly subscriptions from customer C for product P from 01.01.2009 till 31.12.2011 would result into 1095 day-by-day entries. Having more than 25 million customer-product combinations already a three years period could easily lead to more than 100 billion records. On the other hand I could model it with two date foreign keys e.g. subscription_start and subscription_end resulting in maybe 100 million records.
My worry is, that using subscription periods instead of day by day entries will make comparison of the reported data with the audit results much harder.
Hope you can share your experience with me on this. Kind regards,
Bergtroll
We have a subscription system over here where our customers are able to retrieve informational data from. Depending on their contract some customers have to order subscriptions in advance. Some of them may correct it to the actual number used retroactive. Others may just use the informational data and report their subscription retrospective afterwards. To make it even more complicated, also based on contract, subscription fees may be calculated on a daily, monthly or quarterly base.
Since the heterogeneity of the many customers data processing systems and the easy duplication and redistribution of informational data there never was an effort for a fine-grained fool-proof data usage tracking system. The subscription fees are solely based on what our customers reports. To avoid misapplication there are on-site audits checking how the processing systems handle our data. We are then calculating usage numbers from our audit and compare it to what has been reported.
My question now is: What is the right granularity for the time dimension and how should I model it? The lowest grain is to store each subscription on a day-by-day base with a date foreign key. But since most of the subscriptions are for several month or even years, it would really explode the fact table with dense and redundant data. E.g. a 36 month period of monthly subscriptions from customer C for product P from 01.01.2009 till 31.12.2011 would result into 1095 day-by-day entries. Having more than 25 million customer-product combinations already a three years period could easily lead to more than 100 billion records. On the other hand I could model it with two date foreign keys e.g. subscription_start and subscription_end resulting in maybe 100 million records.
My worry is, that using subscription periods instead of day by day entries will make comparison of the reported data with the audit results much harder.
Hope you can share your experience with me on this. Kind regards,
Bergtroll
Bergtroll- Posts : 15
Join date : 2011-02-02
Re: What is the right granularity of time?
P.S. Maybe its a good idea to store the periods and join a day-by-day table only occasionally on reporting purpose?
Bergtroll- Posts : 15
Join date : 2011-02-02
Re: What is the right granularity of time?
I forgot a very important constraint. The reporting for a certain customer, product and date has to be unambiguous. Therefore no overlap between two subscription periods is allowed. This should be validated in the cleaning step but is by far not as easy as checking a set of row attributes to me unique. I worry that this will be a big performance impact in the data preparation process.
Bergtroll- Posts : 15
Join date : 2011-02-02
Re: What is the right granularity of time?
Use an accumulating snapshot fact table. There is one row per subscription with a start and end date. Adding a row for every subscription for every customer is not a good idea.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: What is the right granularity of time?
Thank you, I have the feeling that this could totally fit our needs. Regarding your tip I read the following very helpful articles.
http://www.kimballgroup.com/html/designtipsPDF/DesignTips2002/KimballDT37ModelingPipeline.pdf
and
http://www.kimballgroup.com/html/designtipsPDF/DesignTips2003/KimballDT42Combining.pdf
Now I only have to find out how to transform our month based reporting into ongoing subscriptions :-). If I find out this should be a nice topic for a kettle tutorial!
Best regards,
Bergtroll
http://www.kimballgroup.com/html/designtipsPDF/DesignTips2002/KimballDT37ModelingPipeline.pdf
and
http://www.kimballgroup.com/html/designtipsPDF/DesignTips2003/KimballDT42Combining.pdf
Now I only have to find out how to transform our month based reporting into ongoing subscriptions :-). If I find out this should be a nice topic for a kettle tutorial!
Best regards,
Bergtroll
Bergtroll- Posts : 15
Join date : 2011-02-02
Re: What is the right granularity of time?
Month based reporting is easy. Create a metric (0 or 1) that indicates whether the subscription is currently active and sum on that column. To see how many were active last month, you would need to use your date dimensions.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Similar topics
» Time dimension design for cross time zone and custom calendar warehouse, 1min granularity
» Time granularity not fixed
» Combined Date Time dimension vs. Separate Day and Time dimensions and timezones
» Time Dimension, NULLs and Time datatype
» Model Design With Several 0 to Many Joins?
» Time granularity not fixed
» Combined Date Time dimension vs. Separate Day and Time dimensions and timezones
» Time Dimension, NULLs and Time datatype
» Model Design With Several 0 to Many Joins?
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum