Eliminate Date Dimension Surrogate Key

juz_b Fri Nov 06, 2009 4:57 pm

We have always been using a sequential surrogate key for all our Dimension tables, including the Calendar Dimension.

There is also a different school of thought (only for Calendar Dimension) where instead of using sequential keys, replace it with a smartkey in the format of YYYYMMDD. I can appreciate this design since Calendar Days are always unique and sequential. Also note that this is the BASE CALENDAR, there are no Fiscal Attributes at all.

Recently we came across a recommendation of simply storing a DATE in the Fact tables, don't even mess with an Integer Key. The Date field will be the Key for the Date Dimension.

The advantages with this approach are as follows :

1. Ability to create table partition by date for Fact tables.

2. Eliminate the need for a Surrogate key lookup.

I can't seem to come up with any reasons not to do this. Am I missing something?

Any thoughts?

hennie7863 Sun Nov 08, 2009 4:54 am

Hi,

Reading your post suggests that there are three options for storing a technical key in your fact table:

Lookup via lookup tables
Calculating YYYYMMDD (IN SQL SERVER CONVERT (Varchar(8), Getdate(), 112))
Storing the date in the fact table like this 09/11/2009 ?

I've always used option number two : YYYYMMDD with the convert functions and this works quite well. The advantage of no lookup neeeded for this surogate key is the most important one. Putting an index on an integer key is the best option there is. Option number three doesn't seem to have any advantages to me..

Regards,
Hennie

ngalemmo Mon Nov 09, 2009 12:24 pm

Using just date as a fact table dimension is not a good idea. For one thing, it gives you no recourse when you get a bad date. Second, you lose a lot of reporting flexibility that a well thought out date dimension can give you. Reporting last month, year-over-year and so forth are very simple with a proper dimension, whereas a date requires fairly complex and cumbersome SQL expressions that cannot make use of indexes.

Besides, the reason why YYYYMMDD is an exception to the surrogate key rule is for item #1 in your advantage list: to enable definition of partitions on date. But, it should be clear... it is NOT a smart key. It is purely for the use of a DBA to define partitions... it is still a surrogate key and should be treated as such.

juz_b Mon Nov 09, 2009 1:13 pm

Thanks for the inputs.

Actually we are still maintaining a Date Dimension, except we are using the DATE field as the Key. Therefore we will retain all the reporting flexibility through a Date dimension.

Seems like storing the YYYYMMDD format surrogate key is the way to go. If we get a Bad Date, we can just default it to a -1 surrogate key.

Thanks all!

BrianJarrett Sun Nov 15, 2009 10:41 am

Actually recent Kimball school of thought (as of October of 2007) has changed; they're now recommending the use of the date itself as the surrogate key for your date dimension rather than a surrogate key. Since it's completely predictable it can easily be done without any risk to the design. It's also the only table you do this with.

There are several advantages to using a date as the date dimension surrogate key. For one, partitioning is very easy to do on date, particularly since it's so predictable. Another is just plain ease of use; it's much easier to write validation queries against a fact table that already has the date in it rather than having to look up that surrogate key each time. That's incredibly tedious if a data analyst is spending any time at all in that table.

I'd recommend using the actual date rather than a YYYYMMDD character string. That requires a date conversion that's not really necessary. By using the date itself you can easily use all the database's built-in date logic. Date fields index just fine so there's not reason not to. Just plug missing dates with something like 1/1/1900; that's becomes your old -1 integer surrogate key.

An approach I've used in the past is to store the date dimensions surrogate key along with the actual date in the fact table. The surrogate key is used by the reporting tool to provide full dimensional analysis on date but the actual date is there solely to make manual querying easier. It requires more space but I'm not too concerned about that. The trick is partitioning; if you partition on the date's surrogate key then you'll need another index on the date along with the surrogate key to ensure you'll only scan a partition and not the entire table. Otherwise including the date field isn't really that useful. If you don't use partitioning then it's a non-issue.

From a personal perspective I'm now on board with using the actual date as the surrogate key for the date dimension. I do it for all my models now.

Colin Davies Fri Nov 20, 2009 1:20 am

I have wrestled with this one for a while. I was always bothered by using "special" dates (e.g. 1/1/1900) to represent the "missing" and "N/A" kinds of entries, because there was always the risk that users would not be aware of these "special" dates and end up with some weird results, such as computing someone's work experience as 109 years, or something equally unlikely.

However, by using presentation layer views to screen the base table we can add a second line of defence. I now prefer the use of the true date as the primary key of the Date dimension, even with the "special" dates for the old -1, -2, ... entries.

Dates are EVERYWHERE in a data warehouse and it makes no sense to make it harder to use the built-in DBMS date functions. Also, if a user doesn't need the additional attributes of that particular date, then why bother to force a join to the Date dimension, just to get the real date?

Just faking the surrogate key as a YYYYMMDD integer version of the true date adds no value, in my not especially humble opinion. In Oracle I would use a TRUNC'd DATE column (midnight) as the primary key and make the table an index-organized, cached table. In Teradata I would use a DATE column (Teradata dates have no time component) with a unique primary index on the date column.

BTW, I prefer to name the Date dimension Calendar_Day. We have other financial and operational calendars with specific date attributes, and I would require them to use the same kinds of primary keys, so that joins are not too expensive (i.e. Calendar_Day and Financial_Day both use true dates as PK's to avoid datatype conversions).

Sponsored content

Eliminate Date Dimension Surrogate Key

Eliminate Date Dimension Surrogate Key

Date

Re: Eliminate Date Dimension Surrogate Key

Re: Eliminate Date Dimension Surrogate Key

Re: Eliminate Date Dimension Surrogate Key

I agree with using a real DATE as the key of the DATE dimension

Re: Eliminate Date Dimension Surrogate Key