Primary Key of Dimension and Fact Tables

DavidStein Wed Aug 18, 2010 3:27 pm

I've seen several people give presentations which purport to represent the Kimball Method and I've seen this done two ways.

What is the current position on what should be the primary key of a dimension table?

I was under the impression that we should have surrogate meaningless integer keys (unique clustered indexes) for each dimension table, but should that value be the primary key as well?

Are Fact tables handled the same way or should the surrogate key be a unique clustered index and the business key(s) be the primary key of the fact table?

ngalemmo Wed Aug 18, 2010 4:14 pm

Dimensions always have a single unique surrogate key as its primary key. It is always a number.

Fact tables are a different matter, it depends on what you plan to do with them. At a logical level, it is common to identify a primary key, but I do not subscribe to the notion of actually enforcing it at the physical level (i.e. I do not implement primary key constraints on a fact table, or foreign key contraints for that matter).

However, it is sometimes handy to have a surrogate key on fact table rows to support bridges in certain circumstances and to support some update processes.

DavidStein Wed Aug 18, 2010 4:17 pm

ngalemmo wrote:Dimensions always have a single unique surrogate key as its primary key. It is always a number.

Fact tables are a different matter, it depends on what you plan to do with them. At a logical level, it is common to identify a primary key, but I do not subscribe to the notion of actually enforcing it at the physical level (i.e. I do not implement primary key constraints on a fact table, or foreign key contraints for that matter).

However, it is sometimes handy to have a surrogate key on fact table rows to support bridges in certain circumstances and to support some update processes.

Thank you ngalemmo. Don't fact tables need a unique clustered surrogate key to facilitate indexing of them?

John Simon Thu Aug 19, 2010 2:12 am

David,
If you're referring to SQL Server then the answer is no. Putting a clustered index on a fact table surrogate key would mean the clustered index would be wasted.
You should put a clustered index on the date key and maybe one or at most two other commonly used foreign keys. It does not need to be unique as SQL Server will internally add a 4 byte unique identifier to the clustered index to ensure uniqueness. The narrower your clustered index is, the better (generally), as any other nonclustered indexes will include the clustered index. So by having a wide clustered index you bloat your nonclustered index, getting less rows per page.

DavidStein Thu Aug 19, 2010 9:44 am

John Simon wrote:David,
If you're referring to SQL Server then the answer is no. Putting a clustered index on a fact table surrogate key would mean the clustered index would be wasted.
You should put a clustered index on the date key and maybe one or at most two other commonly used foreign keys. It does not need to be unique as SQL Server will internally add a 4 byte unique identifier to the clustered index to ensure uniqueness. The narrower your clustered index is, the better (generally), as any other nonclustered indexes will include the clustered index. So by having a wide clustered index you bloat your nonclustered index, getting less rows per page.

Yes, I am using SQL Server, but I respectfully disagree. A table cannot have more than one clustered index because that determines the actual order of records in a table.

Also, I thought non-clustered indexes were more efficient when based upon a integer type clustered index rather than a heap. No?

ngalemmo Thu Aug 19, 2010 12:54 pm

I am not a big fan of compound indexes of any form, custered or otherwise, on a fact table. The problem with compound indexes (those made up of multiple columns) is that they are only useful if the query includes the leading columns in the predicate. If, say, the query does not filter on the first column declared in the index, the index is useless.

So, their greatest utility is when the query path is known in advance, so you can build the index to support it. That is not the case in an ad-hoc environment.

That said, I do remember reading in SQL Server docs that they (Microsoft) recommend such an index. I don't recall why, and it has been a few years since I have looked at it. There is probably some internal thing they do that makes such an index useful.

John Simon Fri Aug 20, 2010 12:57 am

DavidStein wrote:
John Simon wrote:David,
If you're referring to SQL Server then the answer is no. Putting a clustered index on a fact table surrogate key would mean the clustered index would be wasted.
You should put a clustered index on the date key and maybe one or at most two other commonly used foreign keys. It does not need to be unique as SQL Server will internally add a 4 byte unique identifier to the clustered index to ensure uniqueness. The narrower your clustered index is, the better (generally), as any other nonclustered indexes will include the clustered index. So by having a wide clustered index you bloat your nonclustered index, getting less rows per page.

Yes, I am using SQL Server, but I respectfully disagree. A table cannot have more than one clustered index because that determines the actual order of records in a table.

Also, I thought non-clustered indexes were more efficient when based upon a integer type clustered index rather than a heap. No?

David,
You've misunderstood me.
I was suggesting your clustered index be based on your datekey or maybe a compound based on your datekey and another dimension e.g. CustomerKey.

Using a meaningless surrogate key as the basis of your clustered index will be a waste of your clustered index. If you had the clustered index on your date key (because most queries will include a date component), then you will get better performance.

ngalemmo Fri Aug 20, 2010 1:43 pm

DavidStein wrote:
Thank you ngalemmo. Don't fact tables need a unique clustered surrogate key to facilitate indexing of them?

I misread the question.

Absolutely... No. And, John, I disagree. I would never define a clustered index on a fact table.

Clustered indexes only have value when there is a clearly defined join path to the table. Dimensions are easy. If you want to cluster them, use the primary key. But facts? What would be the point? Facts are joined on their foreign keys and you don't know from one query to the next which keys are going to be used. Even if you use just date, how is the query going to be resolved? Would the query plan be biased towards using the cluster index even if there may be a higher cardinality (i.e. more selective) choice? If you are doing a report for a year, does it mean the database is going to do a scan of a years worth of data, even though you only want to look at one small category of customers?

hang Fri Aug 20, 2010 6:22 pm

There is a more convincing article about the topic written by another DW expert, Vincent Rainardi who is the author of "Building a Data Warehouse: With Examples in SQL Server". Please refer to the following site:

http://dwbi1.wordpress.com/2010/02/24/primary-key-and-clustered-index-on-the-fact-table/

John Simon Sat Aug 21, 2010 3:55 am

Neil,
SQL Server doesn't work that way. The query optimizer is smart enough to pick the right index, and will not automatically pick a clustered index over a non-clustered index. It will also use index intersections, so both nonclustered indexes and the clustered index in conjunction.

In addition to what Hang said, check out this at this article: http://sqlcat.com/top10lists/archive/2008/02/06/top-10-best-practices-for-building-a-large-scale-relational-data-warehouse.aspx

Specifically, section 2:

Build clustered index on the date key of the fact table

•This supports efficient queries to populate cubes or retrieve a historical data slice.
•If you load data in a batch window then use the options ALLOW_ROW_LOCKS = OFF and ALLOW_PAGE_LOCKS = OFF for the clustered index on the fact table. This helps speed up table scan operations during query time and helps avoid excessive locking activity during large updates.
•Build nonclustered indexes for each foreign key. This helps ‘pinpoint queries' to extract rows based on a selective dimension predicate.Use filegroups for administration requirements such as backup / restore, partial database availability, etc.

DavidStein Mon Aug 23, 2010 8:49 am

Thanks for the replies guys.
"Build clustered index on the date key of the fact table"

This begs the question, what about fact tables with more than one date? Choose one to be the primary or most important such as Order Date rather than Due Date and use that?

Sponsored content

Primary Key of Dimension and Fact Tables

Primary Key of Dimension and Fact Tables

Re: Primary Key of Dimension and Fact Tables

Re: Primary Key of Dimension and Fact Tables

Re: Primary Key of Dimension and Fact Tables

Re: Primary Key of Dimension and Fact Tables

Re: Primary Key of Dimension and Fact Tables

Re: Primary Key of Dimension and Fact Tables

Re: Primary Key of Dimension and Fact Tables

Re: Primary Key of Dimension and Fact Tables

Re: Primary Key of Dimension and Fact Tables

Re: Primary Key of Dimension and Fact Tables

Re: Primary Key of Dimension and Fact Tables