Fact Indexing -SQL Server 2008

raikarleena Tue Mar 31, 2009 10:49 am

I read the Tip below (by microsoft) for Fact table indexing.

"Create an integer date key in the format of YYYYMMDD as the surrogate key for both the date dimension and a foreign key in the fact table1. Create a clustered index on the date key column on the fact table. With the date key in this format, you can express date range filters conveniently and explicitly on the fact table"

Source: http://msdn.microsoft.com/en-us/library/cc719182.aspx

-My Fact would insert 20 million records per day. Datawarehouse would have only 6 month data. Which would mean 180 unique date keys in a table of 3600000000 records.
-All my queries would have a filter on datekey.

Any idea if it is it advisable for me to follow:
1. clustered index on datekey and non-clustered unique index on my fact surrogate key as per the guideline?
Or
2. Unique clustered index on my fact table surrogate key and non clustered index on datekey?

I tried running the plan with sample data (union of top 2000 records for each days data) . Plan is much better in Option 1 but query takes the same time to execute in both cases.

Thanks in advance for all the help!

BoxesAndLines Tue Mar 31, 2009 11:05 pm

What access path uses the Fact PK in the select clause? I would think having a simple unique index would suffice for the PK. I would also partition this table by day to reduce the row counts. I think M$'s tip is only that you can figure out the date without actually having to join to the date dimension. That works for date range queries, but it doesn't work if you want to leverage the hierarchy in the date dimension.

raikarleena Wed Apr 01, 2009 5:29 am

1. When i Have clustered index on my Fact SK and non-clustered index on datekey:
All the queries do an Index Scan on clustered Index. Query X,Cost=10

2. When I have clustered index on datekey and non clustered unique index on PK of fact
All the queries do a Index Seek on Clustered Index, Query X, Cost=0.13

Not sure if this would help with real time data. I am relatively new to SQL server. So though Option 2 Looks seems to increase my performance (cost reduced from 10 to 1.3 for almost all my queries), the queries seem to show same performance in terms of time.

Jeff Smith Wed Apr 01, 2009 3:59 pm

Having a Date Dimension Key that is ccyymmdd is useful in a couple of different ways. First, it enables code to calculate the number of days between 2 dates by subtracting one date key value from the other. But the real value is when you want to partition a table based on a date. Having the Date Key be CCYYMMDD facilitates the creation of the partitions and the populating of the partitioned table. If the Date Key was a straight surrogate key that contained no logic, the partitioning of the table would be a bit more difficult.

Zulfiqar Thu Apr 02, 2009 7:10 am

Utilising the date key as yyyymmdd will not allow you to do simple arithmetic calculations such as number of days in between etc. To do this you need to create another attribute in the calendar table that is sequential for each day. We have built a column called DaySequenceID. This makes it very powerful when we want to calculate sales for the last 7 days etc. We also have a MonthSequenceID that allows users to select multiple months crossing year boundaries in the user prompts.

John Simon Fri May 29, 2009 2:04 am

What is your most common query?

I would think it very unlikely that you would be doing a full table scan without reference to any dimensions. It is a very, very bad idea to put a clustered index on a fact primary key as it will be never used in your normal day-to-day queries.

Your clustered index should be based on the keys most commonly used in queries. Generally I use 2-4 dimension keys in my clustered index, and a combination of single non-clustered indexes on other keys and covering indexes for common queries requiring fast responses. Don't bother putting indexes on columns with a small number of distinct values.

With increased indexes you'll need to balance out the load/insert times compared to the query response times.

Zulfiqar Wed Jul 15, 2009 5:36 am

There is nearlyalways a compromise in building your fact tables in terms of the primary key from my experience. It would be great to be able to create a nice clean design with all relevant dimensions as the primary key. However most of the time line_id and line_item_id from the source system are required as part of the key to ensure they are unique and ensure you capture detailed level data that at some point will be important to business users. Hence it doesnt make sense to create a clustered index on the primary key as you would not anticipate end users querying on these non-meaningful columns except very rarely. in SQL Server you can only have one clustered index hence ensure it is used for the most queried dimensions. By the way SQL Server 2008 can help you identify most executed queries and would be useful in helping you to identify candidates columsn for indexes. I believe there are some system views that capture this metadata. However saying this i would definitely recommend to make the calendar key as the first column in your clustered indexes as you can almost guarantee that the date would be part of every query.

Sponsored content

Fact Indexing -SQL Server 2008

Fact Indexing -SQL Server 2008

Re: Fact Indexing -SQL Server 2008

Re: Fact Indexing -SQL Server 2008

Date Surrogate

Re: Fact Indexing -SQL Server 2008

Re: Fact Indexing -SQL Server 2008

Re: Fact Indexing -SQL Server 2008

Re: Fact Indexing -SQL Server 2008