Index Strategy on FACT Table with 300 Million records

dwman Mon Nov 08, 2010 5:08 am

I have a FACT table with 30 Foreign Keys and 4 Alternate Keys. This table will have 300 million records and is going to be partitioned based on Date key.

Would like to some advise on Index strategy. Thinking of,

1.Create single column all FK indexes

2.Not sure about the primary key ? - One options is create to create a composite clustered index on the Alternate keys or create a surrogate Key ?

DWMan

hang Mon Nov 08, 2010 8:38 pm

30 dimension FKs in a single fact should ring alarm bell as it sounds like a centipede fact. Kimball’s rule of thumb is that less than 15 dimension fact covers most cases, 25 dimensions would be excessive for a single fact table. For design and performance purpose, you may need to review the model and try to combine correlated dimensions or break down the fact into reasonable measure groups.

As far as performance goes, you don't need to create indexes for any FKs with low selectivity, as indexes on such fields will not improve performance but rather a waste of resources.

The combination of the dimension FKs, including date key, is the primary key of your fact. You may partition your fact based on the date key if it is part of primary key regardless of clustered or not.

ngalemmo Tue Nov 09, 2010 11:41 pm

What database system?

Why do you have alternate keys?

dwman Sun Nov 28, 2010 10:04 pm

Platform - SQL-Server 2008.

Sorry No alternate keys.

it was a 3 column composite Primary key.

John Simon Mon Nov 29, 2010 12:26 am

Forget a primary key - it doesn't add value, and takes up unnecessary column width.

First of all, as was suggested above, look at the number of foreign keys. You may want to combine some of those dimensions.

As far as indexing, create a clustered index on the most often user date key (you may only have one). Then put nonclustered indexes on the most often used foreign keys - e.g. product and customer. Your indexing strategy should be based on your query usage - so run a trace on the common queries used, and check any indexes you create against those queries to ensure you've created adequate and appropriate indexes.

Jeff Smith Mon Nov 29, 2010 10:56 am

Keep in mind that you can create nonclustered indexes on combinations of fields. If columns 1, 2, and 3 are frequently queried together, then create a nonclustered index for columns 1, 2, and 3. But remember, that if only columns 2 and 3 are in the query, the index won't be used.

The best way is to start minimally. Watch for performance issues and apply indexes targetted at the performance issues. You may find that you need a summary table to improve performance.

Be careful when testing the performance gains from indexing. Depending on the type of disk, you might be fooled into thinking that you're seeing a performance gain from the index when in fact the data used in the test is still in your disk's memory. We're using a EMC clariion SAN. Whenever it reads data from the disk, it holds it in it's own memory (seperate from the server's memory).

dwman Mon Nov 29, 2010 5:28 pm

Are you suggesting no Primary Key on the FACT table ?

I don't have any test data to run a Trace to find out the potential candidates for the Non-clustered indexes ? So probably I need to start with minimum Indexes

hang Mon Nov 29, 2010 7:16 pm

With the size of 300 million and growing, indexing alone is not enough to achieve reasonable performance. You need to seriously consider partitioning the table.

dwman Mon Nov 29, 2010 8:54 pm

yes, I have already designed for a monthly partition which will result in 5- 10 million records in a partition file group.

I am mainly concerned about the 30 Foreign keys making Centipede fact. Probably too late to change the FACT design to bring down the number of dimensions. Any other thoughts put a effective indexes ?

John Simon Mon Nov 29, 2010 9:09 pm

Partitions only help with loading. They don't seem to do a lot for query performance.

The question you need to ask is: How will the data be used? If you have several reports, then as I said, run some traces on the queries that will use the fact table to get an understanding of which columns to index.

You can probably make some assumptions for the reports based on the joins and where clauses. But you need to test.
As I said above, start off with a clustered index on the date, and non-clustered indexes on the most commonly used foreign keys. Keep it minimal and then test against your reports.

BoxesAndLines Mon Nov 29, 2010 11:35 pm

John Simon wrote:Partitions only help with loading. They don't seem to do a lot for query performance.

Sure they do. Partition pruning is highly beneficial on reads. This is especially true on snapshot fact tables.

John Simon Tue Nov 30, 2010 1:37 am

BoxesAndLines wrote:
John Simon wrote:Partitions only help with loading. They don't seem to do a lot for query performance.

Sure they do. Partition pruning is highly beneficial on reads. This is especially true on snapshot fact tables.

Sometimes they do, depending on the query.
Check this blog:
http://blogs.mssqltips.com/blogs/chadboyd/archive/2008/03/19/partitioning-data-for-query-performance-where-s-the-benefit.aspx

dwman Wed Dec 01, 2010 12:30 am

thanks folks for your advise & thoughts.

One more question, Have any of you have tried the following situation.

1. Partition key is not part of primary key.

2. But a clustered index will be created on the partition key

3. Primary key be implemented as non-clustered Index

Any pros, cons & issues would be great.

John Simon Wed Dec 01, 2010 12:55 am

Check out this post by the SQL Server Customer Advisory Team for best practices on Data Warehouses.

http://sqlcat.com/top10lists/archive/2008/02/06/top-10-best-practices-for-building-a-large-scale-relational-data-warehouse.aspx

ngalemmo Thu Dec 02, 2010 12:25 am

dwman wrote:One more question, Have any of you have tried the following situation.

1. Partition key is not part of primary key.

2. But a clustered index will be created on the partition key

3. Primary key be implemented as non-clustered Index

Any pros, cons & issues would be great.

Why do you have a primary key declared on the fact table?

Anyway, the partitioning rules are independent of the primary key. Use whatever makes sense for paritioning criteria. The tips John references are well worth reading.

dwman Thu Dec 02, 2010 12:47 am

This is not for a FACT table. I am considering Partition for a Large Log table. By taking Partition approach the old data can be deleted from the table without running the expensive delete command.
One of the Rule when you implement Partition is "Partition Key should be part of Primary Key or Clustered Index"
While I have done the partition using Primary Key approach, I haven' tried the second option.
So, looking for other people's views

Sponsored content

Index Strategy on FACT Table with 300 Million records