Primary Key on a fact table

ebk Tue Jul 03, 2012 2:51 pm

I've looked through this forum, and in the Design Tips about what to use as a PK on a fact table and there's something basic I'm not understanding. There seem to be several points of view:
- don't bother with a PK (because you're never trying to find a single fact row anyway)
- use a surrogate PK (handy in certain specific ETL situations, and in bridge situations)
- use composite PK made up of whatever subset of dimension FK's uniquely identify a row
- use surrogate PK, plus an alternate key made up of the subset of dimension FK's that uniquely identify a row

I don't understand why we use of the 'subset of dimension FK's', either as PK or an Alternate key. Is it that 'We just might want to retreive a single row -- but if we had a surrogate key, it's meaningless & we'd have to join to all the dimension tables in order to get the row?.
ie. Is it that the 'subset of dimension FK's' is our shortcut to the single row, on the outside chance we may need to retreive it? Is this overkill, or playing it safe?

ngalemmo Tue Jul 03, 2012 4:12 pm

By definition, a PK is one or more columns that uniquely identify a row. So, if you need a PK, then whichever set of columns that can be used to uniquely identify a row is it. You can use a surrogate or a compound key, whichever fits your use case.

You can declare a PK as much as you want. The question is, should you enforce a PK? By enforce, I mean to create database constraints that ensure unique primary keys. In data warehousing, the general consensus is you do not use constraints to enforce a PK on a fact table.

There are reasons for this. First, if you are actually updating fact rows, the process that attempts to do an update will ensure PK integrity. Any further enforcement at the database level is redundant. Second, database enforcement requires the database to create and maintain an index made up of the columns that comprise the PK. Such an index has little use for end-users querying the table, adding unnecessary overhead.

As far as surrogate PKs for fact tables go, I understand there are specific cases where such a key can be useful, however, it bypasses the basic form for integrating data across fact tables (aggregate then join across common attributes). I would not perform direct joins between fact tables unless you have tight control over the cardinality of the join itself. It is also raises questions as to why you are using a dimension model, or if the dimensional model was designed correctly. Using a surrogate PK to effect joins between facts means you are doing one-of joins, for specific entities, rather than wholesale aggregations commonly used in analysis. Such one-of queries are typically operational type actions that are better handled in a more normalized ODS environment. If the reason for the join is to obtain dimensions from higher level fact, then it is simply bad design. The more granular fact should house the same dimensions as the higher level fact and eliminate the fact-to-fact join altogether.

BoxesAndLines Thu Jul 05, 2012 2:16 pm

I declare and enforce primary keys. The ETL folks don't mind the extra check. It has saved their butts on several occasions. The reasoning is the same reasoning for OLTP. You need to be able to get to one row. When I put a surrogate key on a fact table I create an alternate key as well since the surrogate is, well, meaningless. I always recommend using surrogate keys on facts because I want a single column PK for all DW tables. I can leverage this many ways. The latest reason is for my data quality mart. I want to store the PK of the offending row in the DQ Fact table. A single column PK easily enables the BI tool to drill down to the exact row that has the DQ issue. If I had a multi-part variable PK, this becomes much more difficult and laborius.

Sponsored content

Primary Key on a fact table

Primary Key on a fact table

Re: Primary Key on a fact table

Re: Primary Key on a fact table

Re: Primary Key on a fact table