Eliminate Date Dimension Surrogate Key
5 posters
Page 1 of 1
Eliminate Date Dimension Surrogate Key
We have always been using a sequential surrogate key for all our Dimension tables, including the Calendar Dimension.
There is also a different school of thought (only for Calendar Dimension) where instead of using sequential keys, replace it with a smartkey in the format of YYYYMMDD. I can appreciate this design since Calendar Days are always unique and sequential. Also note that this is the BASE CALENDAR, there are no Fiscal Attributes at all.
Recently we came across a recommendation of simply storing a DATE in the Fact tables, don't even mess with an Integer Key. The Date field will be the Key for the Date Dimension.
The advantages with this approach are as follows :
1. Ability to create table partition by date for Fact tables.
2. Eliminate the need for a Surrogate key lookup.
I can't seem to come up with any reasons not to do this. Am I missing something?
Any thoughts?
There is also a different school of thought (only for Calendar Dimension) where instead of using sequential keys, replace it with a smartkey in the format of YYYYMMDD. I can appreciate this design since Calendar Days are always unique and sequential. Also note that this is the BASE CALENDAR, there are no Fiscal Attributes at all.
Recently we came across a recommendation of simply storing a DATE in the Fact tables, don't even mess with an Integer Key. The Date field will be the Key for the Date Dimension.
The advantages with this approach are as follows :
1. Ability to create table partition by date for Fact tables.
2. Eliminate the need for a Surrogate key lookup.
I can't seem to come up with any reasons not to do this. Am I missing something?
Any thoughts?
juz_b- Posts : 17
Join date : 2009-02-07
Date
Hi,
Reading your post suggests that there are three options for storing a technical key in your fact table:
I've always used option number two : YYYYMMDD with the convert functions and this works quite well. The advantage of no lookup neeeded for this surogate key is the most important one. Putting an index on an integer key is the best option there is. Option number three doesn't seem to have any advantages to me..
Regards,
Hennie
Reading your post suggests that there are three options for storing a technical key in your fact table:
- Lookup via lookup tables
- Calculating YYYYMMDD (IN SQL SERVER CONVERT (Varchar(8), Getdate(), 112))
- Storing the date in the fact table like this 09/11/2009 ?
I've always used option number two : YYYYMMDD with the convert functions and this works quite well. The advantage of no lookup neeeded for this surogate key is the most important one. Putting an index on an integer key is the best option there is. Option number three doesn't seem to have any advantages to me..
Regards,
Hennie
hennie7863- Posts : 31
Join date : 2009-10-19
Re: Eliminate Date Dimension Surrogate Key
Using just date as a fact table dimension is not a good idea. For one thing, it gives you no recourse when you get a bad date. Second, you lose a lot of reporting flexibility that a well thought out date dimension can give you. Reporting last month, year-over-year and so forth are very simple with a proper dimension, whereas a date requires fairly complex and cumbersome SQL expressions that cannot make use of indexes.
Besides, the reason why YYYYMMDD is an exception to the surrogate key rule is for item #1 in your advantage list: to enable definition of partitions on date. But, it should be clear... it is NOT a smart key. It is purely for the use of a DBA to define partitions... it is still a surrogate key and should be treated as such.
Besides, the reason why YYYYMMDD is an exception to the surrogate key rule is for item #1 in your advantage list: to enable definition of partitions on date. But, it should be clear... it is NOT a smart key. It is purely for the use of a DBA to define partitions... it is still a surrogate key and should be treated as such.
Re: Eliminate Date Dimension Surrogate Key
Thanks for the inputs.
Actually we are still maintaining a Date Dimension, except we are using the DATE field as the Key. Therefore we will retain all the reporting flexibility through a Date dimension.
Seems like storing the YYYYMMDD format surrogate key is the way to go. If we get a Bad Date, we can just default it to a -1 surrogate key.
Thanks all!
Actually we are still maintaining a Date Dimension, except we are using the DATE field as the Key. Therefore we will retain all the reporting flexibility through a Date dimension.
Seems like storing the YYYYMMDD format surrogate key is the way to go. If we get a Bad Date, we can just default it to a -1 surrogate key.
Thanks all!
juz_b- Posts : 17
Join date : 2009-02-07
Re: Eliminate Date Dimension Surrogate Key
Actually recent Kimball school of thought (as of October of 2007) has changed; they're now recommending the use of the date itself as the surrogate key for your date dimension rather than a surrogate key. Since it's completely predictable it can easily be done without any risk to the design. It's also the only table you do this with.
There are several advantages to using a date as the date dimension surrogate key. For one, partitioning is very easy to do on date, particularly since it's so predictable. Another is just plain ease of use; it's much easier to write validation queries against a fact table that already has the date in it rather than having to look up that surrogate key each time. That's incredibly tedious if a data analyst is spending any time at all in that table.
I'd recommend using the actual date rather than a YYYYMMDD character string. That requires a date conversion that's not really necessary. By using the date itself you can easily use all the database's built-in date logic. Date fields index just fine so there's not reason not to. Just plug missing dates with something like 1/1/1900; that's becomes your old -1 integer surrogate key.
An approach I've used in the past is to store the date dimensions surrogate key along with the actual date in the fact table. The surrogate key is used by the reporting tool to provide full dimensional analysis on date but the actual date is there solely to make manual querying easier. It requires more space but I'm not too concerned about that. The trick is partitioning; if you partition on the date's surrogate key then you'll need another index on the date along with the surrogate key to ensure you'll only scan a partition and not the entire table. Otherwise including the date field isn't really that useful. If you don't use partitioning then it's a non-issue.
From a personal perspective I'm now on board with using the actual date as the surrogate key for the date dimension. I do it for all my models now.
There are several advantages to using a date as the date dimension surrogate key. For one, partitioning is very easy to do on date, particularly since it's so predictable. Another is just plain ease of use; it's much easier to write validation queries against a fact table that already has the date in it rather than having to look up that surrogate key each time. That's incredibly tedious if a data analyst is spending any time at all in that table.
I'd recommend using the actual date rather than a YYYYMMDD character string. That requires a date conversion that's not really necessary. By using the date itself you can easily use all the database's built-in date logic. Date fields index just fine so there's not reason not to. Just plug missing dates with something like 1/1/1900; that's becomes your old -1 integer surrogate key.
An approach I've used in the past is to store the date dimensions surrogate key along with the actual date in the fact table. The surrogate key is used by the reporting tool to provide full dimensional analysis on date but the actual date is there solely to make manual querying easier. It requires more space but I'm not too concerned about that. The trick is partitioning; if you partition on the date's surrogate key then you'll need another index on the date along with the surrogate key to ensure you'll only scan a partition and not the entire table. Otherwise including the date field isn't really that useful. If you don't use partitioning then it's a non-issue.
From a personal perspective I'm now on board with using the actual date as the surrogate key for the date dimension. I do it for all my models now.
BrianJarrett- Posts : 61
Join date : 2009-02-03
Age : 50
Location : St. Louis, MO
I agree with using a real DATE as the key of the DATE dimension
I have wrestled with this one for a while. I was always bothered by using "special" dates (e.g. 1/1/1900) to represent the "missing" and "N/A" kinds of entries, because there was always the risk that users would not be aware of these "special" dates and end up with some weird results, such as computing someone's work experience as 109 years, or something equally unlikely.
However, by using presentation layer views to screen the base table we can add a second line of defence. I now prefer the use of the true date as the primary key of the Date dimension, even with the "special" dates for the old -1, -2, ... entries.
Dates are EVERYWHERE in a data warehouse and it makes no sense to make it harder to use the built-in DBMS date functions. Also, if a user doesn't need the additional attributes of that particular date, then why bother to force a join to the Date dimension, just to get the real date?
Just faking the surrogate key as a YYYYMMDD integer version of the true date adds no value, in my not especially humble opinion. In Oracle I would use a TRUNC'd DATE column (midnight) as the primary key and make the table an index-organized, cached table. In Teradata I would use a DATE column (Teradata dates have no time component) with a unique primary index on the date column.
BTW, I prefer to name the Date dimension Calendar_Day. We have other financial and operational calendars with specific date attributes, and I would require them to use the same kinds of primary keys, so that joins are not too expensive (i.e. Calendar_Day and Financial_Day both use true dates as PK's to avoid datatype conversions).
However, by using presentation layer views to screen the base table we can add a second line of defence. I now prefer the use of the true date as the primary key of the Date dimension, even with the "special" dates for the old -1, -2, ... entries.
Dates are EVERYWHERE in a data warehouse and it makes no sense to make it harder to use the built-in DBMS date functions. Also, if a user doesn't need the additional attributes of that particular date, then why bother to force a join to the Date dimension, just to get the real date?
Just faking the surrogate key as a YYYYMMDD integer version of the true date adds no value, in my not especially humble opinion. In Oracle I would use a TRUNC'd DATE column (midnight) as the primary key and make the table an index-organized, cached table. In Teradata I would use a DATE column (Teradata dates have no time component) with a unique primary index on the date column.
BTW, I prefer to name the Date dimension Calendar_Day. We have other financial and operational calendars with specific date attributes, and I would require them to use the same kinds of primary keys, so that joins are not too expensive (i.e. Calendar_Day and Financial_Day both use true dates as PK's to avoid datatype conversions).
Colin Davies- Posts : 8
Join date : 2009-05-20
Similar topics
» Fact table with non-numeric measure
» Fact Indexing -SQL Server 2008
» Date Dimension: Representing partial dates/Imputing date values
» Always link date fields to Date Dimension?
» Surrogate keys and Dimension-to-Dimension links
» Fact Indexing -SQL Server 2008
» Date Dimension: Representing partial dates/Imputing date values
» Always link date fields to Date Dimension?
» Surrogate keys and Dimension-to-Dimension links
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum
|
|