Time-Stamped transaction tracking not suitable, gets too big
4 posters
Page 1 of 1
Time-Stamped transaction tracking not suitable, gets too big
Hi
I am looking for a modelling solution to the following problem:
My base data is a large amount of residents, each of them with a large set of (clearly defined) attributes (name, gender, address, job, name of father, ...). As residents move, marry, die, and so on, their attribute data changes. These changes happen quite frequently. With each change, usually only a small set of attributes change. I need to answer questions about both the present and any point back in time.
Basically, this is the same thing as the time-stamped transaction tracking in a dimension approach that Kimball describes for employee transactions. However, due to the large amount of people (residents instead of employees) and the size of each resident entry, I know that this approach will not work, even though the kind of questions that can be answered by this model seems suitable.
What are my options here?
Thanks,
Toby
I am looking for a modelling solution to the following problem:
My base data is a large amount of residents, each of them with a large set of (clearly defined) attributes (name, gender, address, job, name of father, ...). As residents move, marry, die, and so on, their attribute data changes. These changes happen quite frequently. With each change, usually only a small set of attributes change. I need to answer questions about both the present and any point back in time.
Basically, this is the same thing as the time-stamped transaction tracking in a dimension approach that Kimball describes for employee transactions. However, due to the large amount of people (residents instead of employees) and the size of each resident entry, I know that this approach will not work, even though the kind of questions that can be answered by this model seems suitable.
What are my options here?
Thanks,
Toby
Toby- Posts : 3
Join date : 2011-10-03
Re: Time-Stamped transaction tracking not suitable, gets too big
I'm sure there are frequent updates, but I can't imagine any one person is updated very frequently at all. Most people don't move that often and you only die once (last time I checked). Why don't you think the approach won't work?
Re: Time-Stamped transaction tracking not suitable, gets too big
OK, so I did the maths:
A person has about 3Kbyte worth of data which cannot easily be left out (things like 'name of father', which nobody will ever use for a query already removed). There are a lot of unicode varchar fields that are responsible for this size.
I start with 700'000 people in the largest area.
A typical week has about 10'000 updates in that area.
This leads to:
2GB worth of data to start with (700'000*3/1024/1024)
1.5GB worth of updates for a year (52*10000*3/1024/1024)
I agree that it's managable in terms of data storage. Will it be in terms of query performance? People plan to use PowerPivot to analize the data, which I guess will push their client machines to the limit (I don't know much about PowerPivot yet).
Thanks,
Toby
A person has about 3Kbyte worth of data which cannot easily be left out (things like 'name of father', which nobody will ever use for a query already removed). There are a lot of unicode varchar fields that are responsible for this size.
I start with 700'000 people in the largest area.
A typical week has about 10'000 updates in that area.
This leads to:
2GB worth of data to start with (700'000*3/1024/1024)
1.5GB worth of updates for a year (52*10000*3/1024/1024)
I agree that it's managable in terms of data storage. Will it be in terms of query performance? People plan to use PowerPivot to analize the data, which I guess will push their client machines to the limit (I don't know much about PowerPivot yet).
Thanks,
Toby
Toby- Posts : 3
Join date : 2011-10-03
Re: Time-Stamped transaction tracking not suitable, gets too big
A few points:
1. PowerPivot has a great compression engine, but if yur values are mostly distinct you won't get any gains
2. PowerPivot has an effective limit of 2Gb. PowerPivot itself does not have this limit, but SharePoint does, which means you cannot share this data in PowerPivot.
3. You may want to consider using a demographics mini-dimension if these attributes are changing often, and link this dimension into your fact table.
1. PowerPivot has a great compression engine, but if yur values are mostly distinct you won't get any gains
2. PowerPivot has an effective limit of 2Gb. PowerPivot itself does not have this limit, but SharePoint does, which means you cannot share this data in PowerPivot.
3. You may want to consider using a demographics mini-dimension if these attributes are changing often, and link this dimension into your fact table.
Re: Time-Stamped transaction tracking not suitable, gets too big
Hi Toby,
I'm curious about your statement: "700'000 people in the largest area". How many "areas" do you have to serve? If you have more than one or two, you may find the PowerPivot architecture a bit frustrating. AFAIK, it doesn't offer any built in support for cloning a model for different data slices, so maintenance and support of your models could get out of hand.
You might be better served building an SSAS model to cover all the "areas" in one pass, with a single model. Dimension security (on an "Areas" dimension) could split it for user consumption. The end-user interface could be Excel PivotTables, optionally published to Excel Web Services in SharePoint, so the same experience (actually a little better) to PowerPivot.
Good luck!
Mike
I'm curious about your statement: "700'000 people in the largest area". How many "areas" do you have to serve? If you have more than one or two, you may find the PowerPivot architecture a bit frustrating. AFAIK, it doesn't offer any built in support for cloning a model for different data slices, so maintenance and support of your models could get out of hand.
You might be better served building an SSAS model to cover all the "areas" in one pass, with a single model. Dimension security (on an "Areas" dimension) could split it for user consumption. The end-user interface could be Excel PivotTables, optionally published to Excel Web Services in SharePoint, so the same experience (actually a little better) to PowerPivot.
Good luck!
Mike
Re: Time-Stamped transaction tracking not suitable, gets too big
@John:
Thanks for this info, the compression engine will have a lot, since a lot of attributes are enum-like. The client currently wishes to not use sharepoint, so for now at least, the 2GB file size limit will not be a problem. I'll think about the demographics mini dimension, this seems to be a good approach. A lot of questions will be comparisons between different geographic locations for instance.
@Mike:
Thanks for this warning. In my case, it's one area = one customer, and the largest of all installations will be 700'000 people. There currently is one customer, with a good possibility of up to 10 other customers. These customers have quite similar needs. There is however no combined data available, and each customer runs his own database and his own clients with PowerPivot installed.
What I expect to happen is that I copy&paste the first few PowerPivot reports between customers (adjusting headers and such), and then each customer will then manage his reports himself. I will then only be called upon for new requirements that need the warehouse/etl to be adjusted or that require deeper PowerPivot knowledge (which I am not having yet).
Thanks for this info, the compression engine will have a lot, since a lot of attributes are enum-like. The client currently wishes to not use sharepoint, so for now at least, the 2GB file size limit will not be a problem. I'll think about the demographics mini dimension, this seems to be a good approach. A lot of questions will be comparisons between different geographic locations for instance.
@Mike:
Thanks for this warning. In my case, it's one area = one customer, and the largest of all installations will be 700'000 people. There currently is one customer, with a good possibility of up to 10 other customers. These customers have quite similar needs. There is however no combined data available, and each customer runs his own database and his own clients with PowerPivot installed.
What I expect to happen is that I copy&paste the first few PowerPivot reports between customers (adjusting headers and such), and then each customer will then manage his reports himself. I will then only be called upon for new requirements that need the warehouse/etl to be adjusted or that require deeper PowerPivot knowledge (which I am not having yet).
Toby- Posts : 3
Join date : 2011-10-03
Similar topics
» Tracking Support Tickets: accumulating snapshot and transaction grain fact (a presentation question)
» Transaction Hour in Fact table or Separate Time Dimension?
» Transaction fact without obvious transaction type field
» Multiple transaction types, Average Transaction Value, and KPIs
» Combined Date Time dimension vs. Separate Day and Time dimensions and timezones
» Transaction Hour in Fact table or Separate Time Dimension?
» Transaction fact without obvious transaction type field
» Multiple transaction types, Average Transaction Value, and KPIs
» Combined Date Time dimension vs. Separate Day and Time dimensions and timezones
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum