Potential 20+ Billion Row Fact Table
+3
Mike Honey
ngalemmo
meb97me
7 posters
Page 1 of 1
Potential 20+ Billion Row Fact Table
As part of our current Data Warehouse project one of the FACT tables we're planning on creating is a daily snapshot of all our customer accounts (10 million+)
Now 10million records snapshotted daily = 3.6 billion records a year (and based on the width of the fact table that equates to about 130GB data a year excluding indexes) and we have 6 years worth of data we want to load so that equates to 21.9 billion records so therefore ~1TB!
The DBAs were a bit shocked by these figures for a single table and were asking why we couldn't use effective from/to dates for the fact table as obviously all the accounts are necessarily going to change each and every day , so there will be a lot or duplication.
I just wondered what others have done with data on this scale. I've got to admit this is bigger than previous implementations I've worked on.
cheers
Now 10million records snapshotted daily = 3.6 billion records a year (and based on the width of the fact table that equates to about 130GB data a year excluding indexes) and we have 6 years worth of data we want to load so that equates to 21.9 billion records so therefore ~1TB!
The DBAs were a bit shocked by these figures for a single table and were asking why we couldn't use effective from/to dates for the fact table as obviously all the accounts are necessarily going to change each and every day , so there will be a lot or duplication.
I just wondered what others have done with data on this scale. I've got to admit this is bigger than previous implementations I've worked on.
cheers
meb97me- Posts : 34
Join date : 2010-07-28
Re: Potential 20+ Billion Row Fact Table
Well, it depends on you DBMS. With Oracle you would use partitions to help things out. Netezza you don't worry too much about it.
Talk with the business. A 6 year old balance by day probably isn't of much value. See if you can limit the number of days you need to keep, then create month-end numbers for the old stuff. It should cut down the volume considerably.
Talk with the business. A 6 year old balance by day probably isn't of much value. See if you can limit the number of days you need to keep, then create month-end numbers for the old stuff. It should cut down the volume considerably.
Re: Potential 20+ Billion Row Fact Table
Hi meb97me,
It sounds like your requirement could be met using the Kimball design pattern "Slowly Changing Dimension Type 2" e.g. http://www.kimballgroup.com/2008/09/22/slowly-changing-dimensions-part-2/
This complicates the ETL design (compared to a raw snapshot), but does require less data space. For example a Customer whos SCD2 attributes never change is represented by one row for all time.
Queries involving related Fact tables actually become simpler and very fast, as your ETL process assigns the relevant Customer row to each Fact via the Customer Surrogate Key, then you can just join using that.
Good luck!
Mike
It sounds like your requirement could be met using the Kimball design pattern "Slowly Changing Dimension Type 2" e.g. http://www.kimballgroup.com/2008/09/22/slowly-changing-dimensions-part-2/
This complicates the ETL design (compared to a raw snapshot), but does require less data space. For example a Customer whos SCD2 attributes never change is represented by one row for all time.
Queries involving related Fact tables actually become simpler and very fast, as your ETL process assigns the relevant Customer row to each Fact via the Customer Surrogate Key, then you can just join using that.
Good luck!
Mike
Re: Potential 20+ Billion Row Fact Table
What do you mean with "a daily snapshot of all our customer accounts" ? Is this a snapshot including all account information?
If this is a snapshot for all account attributes, DB admins are right. You need to create an account dimension to store dimnensional attributes like account type, creation date, owner etc.
But you cannot hold everything with start/end dates. You also need a snapshot fact to hold facts (like EOD balance amount) because business needs trend, graph or average them over time. They can not easily trend them if you model EOD balance amount in a dimension. This is the first reason why you should have a fact table. On the other hand EOD balance is usually a rapidly changing attribute, so it has to be taken out of the SCD. This is the second reason.
I also agree with Ngalemmo. I don't think that your business really needs a 6 year old daily snapshot data.
If this is a snapshot for all account attributes, DB admins are right. You need to create an account dimension to store dimnensional attributes like account type, creation date, owner etc.
But you cannot hold everything with start/end dates. You also need a snapshot fact to hold facts (like EOD balance amount) because business needs trend, graph or average them over time. They can not easily trend them if you model EOD balance amount in a dimension. This is the first reason why you should have a fact table. On the other hand EOD balance is usually a rapidly changing attribute, so it has to be taken out of the SCD. This is the second reason.
I also agree with Ngalemmo. I don't think that your business really needs a 6 year old daily snapshot data.
umutiscan- Posts : 51
Join date : 2010-11-26
Age : 44
Location : Istanbul, Turkey
Re: Potential 20+ Billion Row Fact Table
Sorry yeah i probably didn't give enough details
We are already implementing SCD2 dimensions so the account dimension(and the others) will be slowly changing.
The daily shapshot FACT table would capture the relationship of the account to the other dimensions at the end of the day such as Status, Account Manager, Portfolio and obvioulsy metrics such as account balance etc are liable to change on a daily basis.
The business are interested in knowing and keeping this historical view of the accounts as the modelling teams use the data to model how accounts have performed/changed over time and use this information to help them model/predict potential new portfolios of accounts that they may purchase, so understanding the lifecycle of these accounts is important and rolling up the snapshot to a week or month would lose the all important granularity.
We're using MSSQL so partitioning is certainly an option
We are already implementing SCD2 dimensions so the account dimension(and the others) will be slowly changing.
The daily shapshot FACT table would capture the relationship of the account to the other dimensions at the end of the day such as Status, Account Manager, Portfolio and obvioulsy metrics such as account balance etc are liable to change on a daily basis.
The business are interested in knowing and keeping this historical view of the accounts as the modelling teams use the data to model how accounts have performed/changed over time and use this information to help them model/predict potential new portfolios of accounts that they may purchase, so understanding the lifecycle of these accounts is important and rolling up the snapshot to a week or month would lose the all important granularity.
We're using MSSQL so partitioning is certainly an option
meb97me- Posts : 34
Join date : 2010-07-28
Re: Potential 20+ Billion Row Fact Table
Hi,
when you do the historical reporting, are you going to chek/compare the data for the past 6 years for an account or you may check year/quaterly/monthly/daily ?
Could you please tell how will your report be for a account which changed 20 times in a month and for other which did not changed in a year.
thanks
Himanshu
when you do the historical reporting, are you going to chek/compare the data for the past 6 years for an account or you may check year/quaterly/monthly/daily ?
Could you please tell how will your report be for a account which changed 20 times in a month and for other which did not changed in a year.
thanks
Himanshu
hkandpal- Posts : 113
Join date : 2010-08-16
Re: Potential 20+ Billion Row Fact Table
meb97me, why do you track dimension relations in a daily snapshot? You can create factless facts to track the relations and you insert a new row when any relation changes or any new relation occurs. This will reduce the data size.
You have to continue to hold the metrics in a fact, but as I said before you need to limit the number of days to be held in your fact table.
You have to continue to hold the metrics in a fact, but as I said before you need to limit the number of days to be held in your fact table.
umutiscan- Posts : 51
Join date : 2010-11-26
Age : 44
Location : Istanbul, Turkey
Re: Potential 20+ Billion Row Fact Table
I suggest keeping daily snapshot for all accounts (even if most of them have no activity every day) will cause huge data explosion. Would recommend keep snapshot for accounts who have activity that day and try to handle it on reporting end.
Otherwise go ahead with daily snapshot for all account and partition it on monthly basis.
Otherwise go ahead with daily snapshot for all account and partition it on monthly basis.
Abhiraizada- Posts : 20
Join date : 2011-05-24
Re: Potential 20+ Billion Row Fact Table
meb97me wrote:As part of our current Data Warehouse project one of the FACT tables we're planning on creating is a daily snapshot of all our customer accounts (10 million+)
Now 10million records snapshotted daily = 3.6 billion records a year (and based on the width of the fact table that equates to about 130GB data a year excluding indexes) and we have 6 years worth of data we want to load so that equates to 21.9 billion records so therefore ~1TB!
The DBAs were a bit shocked by these figures for a single table and were asking why we couldn't use effective from/to dates for the fact table as obviously all the accounts are necessarily going to change each and every day , so there will be a lot or duplication.
I just wondered what others have done with data on this scale. I've got to admit this is bigger than previous implementations I've worked on.
cheers
Hi meb97me,
if the data analysts do require the daily snapshots of the data to generate predictive models then we cannot run away from it.
and I don't think Storage space is an issue these days;
yes but you need to definetly think about how data engineers are going to retrieve it. partitioning is a must as the size of data keep on growing in a consistent manner in your case.
As one of the folks suggested you can check if your analysts are okay with a monthly/weekly/bi-weekly snapshot of account facts instead of daily snapshots. This way you are trying to reduce the magnitude of data.
yes; you can very well maintain effective from and to dates i.e..
lets say you got an account and in a month it got only 2 transactions; you can pretty much manage having 2 records on the day when transaction happened instead of having 30 records; but its a work on ETL and is okay for the very first time you are setting up your DW.
Regards
Shiv
sgudavalli- Posts : 29
Join date : 2010-06-10
Age : 40
Location : Pune, India
Similar topics
» 'Routing' the grain of the fact table to multpile members of multiple dimensions causes the fact table to 'explode'
» Integrating new fact table which has one to many relationship with the main fact table in existing star schema
» How to create fact table with measures derived from comparing two fact table rows
» Healthcare Hospital Stay fact table - transaction fact table - a row for In and a row for Out?
» Design a snapshot Fact table in data mart from compliance-enable fact table in enterprise data warehouse
» Integrating new fact table which has one to many relationship with the main fact table in existing star schema
» How to create fact table with measures derived from comparing two fact table rows
» Healthcare Hospital Stay fact table - transaction fact table - a row for In and a row for Out?
» Design a snapshot Fact table in data mart from compliance-enable fact table in enterprise data warehouse
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum