Fact Table Indexes

arfoot653@gmail.com Tue Mar 06, 2012 2:51 am

I have a fact table with more than 25 dimensions and will have around 30 million or so records. Whats the proper indexing strategy for the fact table. Fact table consists of a lot of date ids with role playing dims. Should there be a index on each id, or should there be a clustered index wherever possible? Also, with so many indexes how does the ETL load gets optimized. I read in DW ETL Toolkit that one of the options is to drop the indexes before ETL load and recreate indexes after. Am i understanding this right?

ngalemmo Tue Mar 06, 2012 7:04 am

In general there is an index on each FK column in the fact. By cluster do you mean a an index that organizes the table (the index and the row are together) or a index that contains multiple columns (compound index)? If you mean the former, it may or may not help performance for some queries and it would still be an index on a single column. If you mean the latter, it is not recommended.

arfoot653@gmail.com Wed Mar 07, 2012 2:01 am

I meant the compound index. Thanks for your reply. In the DW toolkit book, its recommended that you drop the indexes that are not required for load before the fact table load, add the rows and then add index and run runstats. Is that how its generally done to optimize load time performance?

VHF Wed Mar 07, 2012 1:33 pm

You need to determine which indexes will really help your queries. In many cases the SQL engine will end up doing a table-scan for a star-schema query, in which case the indexes on the fact table aren't being used at all. I would look at which dimensions will be used most frequently to constrain queries and maybe index those, but you need to make the indexes will actually be used. Execution plan is your friend!

I'm not sure a clustered index (physical row ordering in SQL Server) is a good idea for a fact table. A clustered index is great for performance when it it utilized most of the time, but you can actually take a bigger peformance hit when a different index is used which then needs to indirectly access a row in the clustered index. A clustered index build on a unique fact table identifier really wouldn't help DW queries (because nobody is going to query on a fact row id.) A clustered index with a compound keep consisting of foregin keys might help for certain queries, but then you pay a big price any time another index is used (because the clustered index key is so wide.)

I don't index my fact tables at all. I'm using SQL Server 2008. My largest fact table has about 10M rows. So far performance has been good. One of these days I might do some testing to see if I would benefit from any fact table indexing.

ngalemmo Wed Mar 07, 2012 3:16 pm

arfoot653@gmail.com wrote:I meant the compound index. Thanks for your reply. In the DW toolkit book, its recommended that you drop the indexes that are not required for load before the fact table load, add the rows and then add index and run runstats. Is that how its generally done to optimize load time performance?

If you are loading a lot of rows, disabling or dropping indexes helps. If you are using bitmap indexes, then you most definitely want to disable and rebuild indexes. But, it all depends and is worth testing both ways to see if it makes a big enough difference.

As far as stats goes, generating them after each load is overkill. Doing that on weekends is more than enough.

John Simon Wed Mar 07, 2012 11:48 pm

It sounds like you are using SQL Server. I'm actually preparing a presentation for a SQL Server User Group on dimensional modeling and one of the topics I'm presenting on is indexing. I've been looking at various indexing strategies on fact table for comparison purposes including: clustered indexes, heap tables with no indexes, heap tables with non clustered indexes, fact tables with a clustered surrogate primary key, and columnar indexes.

Using a clustered index on a date column (the most commonly used) will nearly double your query performance compared to using a heap table (no indexes or non clustered indexes only). I wouldn't put the clustered index across multiple columns though as this will impact all non-clustered indexes. I would also put a nonclustered index on each foreign key.

If you can move across to SQL Server 2012 (released on April 1st, 2012), you can use columnar indexing, and put the whole table in a column store index. Your performance will then be incredible - 2 seconds on a cold cached query compared to the 20 seconds for same query on a clustered index, and 40 seconds on a fact table without indexing. BTW, I ran these queries ona twelve million row fact table from the ContosoRetail DW database available for download at www.codeplex.com

VHF Thu Mar 08, 2012 10:13 am

John Simon wrote:Using a clustered index on a date column (the most commonly used) will nearly double your query performance compared to using a heap table (no indexes or non clustered indexes only). I wouldn't put the clustered index across multiple columns though as this will impact all non-clustered indexes. I would also put a nonclustered index on each foreign key.

Presumably this is changing the execution plan from a table scan to an index scan. Given that most queries do constrain on date, it seems like there is a signficiant potential for improvement, especially when querying on 1 or 2 years worth of data out of a fact table that contains several years.

John Simon wrote:I would also put a nonclustered index on each foreign key.

Even for the OP's case of having 25 FKs in the fact table? And in what circumstance are these indexes utilized?

For example, say I have a fact table representing product sales with FKs to Date, Product, and Store. If I query sales of home furnishings (Product attribute) for stores in the Southeast (Store attribute) for 2011-Q4, will any of the other fact table indexes be used? Won't it just do an index scan on the clustered index on date and join/merge Product and Store dimensions, discarding those records that don't meet the criteria?

Jeff Smith Thu Mar 08, 2012 1:06 pm

I have a SQL Server database. We have a clustered index on the fact tables based on the date. We're an insurance company and get claims after they've been paid. So our clustered index is on the Paid Date (if it was on the date of service, the entire table would get resorted and resorted and resorted during the load). The claim fact table is partitioned on the Paid Date (in years). The partitioning was done primarily to aid in retiring claim data or moving the old data to slower disk in the future.

If the table wasn't partitioned on the Paid Date, I would have created a complex clustered index on multiple columns starting with the Paid Date and including columns that are regularly included in queries.

I rarely create non-clustered indexes on single columns in the fact table. I don't think there is much benefit to it. I will create complex indexes that reflect how the data is queried. If I know that dimensions X, Y, and Z are usually queried together, then I'll create and index on those 3 fields.

If you are using a SAN, I think it's extremely difficult to measure the improvement from the indexing in the short term. The reason being that if you run a query on Fact Table A and include Dimensions X, Y, and Z, the SAN will keep the data in it's own memory. Say it takes 30 seconds for the query to run the first time, it could take 2 seconds the next time simple because the data is already in the SANS memory. This is different that the RAM of the database server, which can be cleared out.

The point being is that you have to almost go on gut when it comes to creating indexes.

John Simon Thu Mar 08, 2012 6:24 pm

Put non-clustered indexes on frequently used, high-cardinality foreign keys in your fact table for "pinpoint" queries. Good candidates would be something like customer and product. So if you have a query where you are looking for certain products for a customer, then having these columns use a non-clustered index will provide great performance improvements.

I don't have the reference on me, but I remember reading that narrow non-clustered indexes can work together such as the example I gave above. I thnk it was from Kimberly Tripp.

Here's a link to best practices from SQLCAT: http://sqlcat.com/sqlcat/b/top10lists/archive/2008/02/06/top-10-best-practices-for-building-a-large-scale-relational-data-warehouse.aspx

John Simon Thu Mar 08, 2012 6:53 pm

VHF wrote:
John Simon wrote:I would also put a nonclustered index on each foreign key.
Even for the OP's case of having 25 FKs in the fact table? And in what circumstance are these indexes utilized?

For example, say I have a fact table representing product sales with FKs to Date, Product, and Store. If I query sales of home furnishings (Product attribute) for stores in the Southeast (Store attribute) for 2011-Q4, will any of the other fact table indexes be used? Won't it just do an index scan on the clustered index on date and join/merge Product and Store dimensions, discarding those records that don't meet the criteria?

To simulate your query, I created an exact duplicate fact table, with the same indexing. However the newly created fact table had no non clustered index on Customer. The cost was 6% for the query with the index compared to 94% without.

Here's the query:

SELECT p.Manufacturer, SUM(SalesAmount) AS SaleAmount
FROM dbo.FactOnlineSales f (NOLOCK)
JOIN dbo.DimProduct p (NOLOCK)
ON f.ProductKey = p.ProductKey
JOIN dbo.DimDate d (NOLOCK)
ON f.DateKey = d.Datekey
JOIN dbo.DimCustomer c (NOLOCK)
ON f.CustomerKey = c.CustomerKey
WHERE d.CalendarYear = 2009
AND p.BrandName = 'Contoso'
AND c.CustomerLabel = '26733'
GROUP BY p.Manufacturer

The fact table with the non clustered index ran in 645 ms, the fact table without the non clustered index in 33,523 ms.

Looking at the stats:
First query: Table 'FactOnlineSales'. Scan count 365, logical reads 1955, physical reads 1, read-ahead reads 184
Second query: Table 'FactOnlineSales_Clustered'. Scan count 365, logical reads 101215, physical reads 2, read-ahead reads 98333

We can see there are far more reads required for the fact table without the customer index.

John Simon Fri Mar 09, 2012 6:10 am

Regarding compound indexes i.e. multi-column index vs indexes across multiple columns, using the ContosoRetailDW database again, I compared several queries using a single index on product and another index on customer, vs an index on both product and customer.

Running the following query:

SELECT c.[FirstName] + ' ' + [LastName], SUM(SalesAmount) AS SaleAmount
FROM dbo.FactOnlineSales f (NOLOCK)
JOIN dbo.DimCustomer c (NOLOCK)
ON f.CustomerKey = c.CustomerKey
JOIN dbo.DimDate d (NOLOCK)
ON f.DateKey = d.Datekey
JOIN dbo.DimProduct p (NOLOCK)
ON f.ProductKey = p.ProductKey
WHERE d.CalendarYear = 2008
AND c.CustomerLabel = '19739'
AND p.BrandName = 'Contoso'
GROUP BY c.[FirstName] + ' ' + [LastName]

against both indexing strategies, the single column indexes well outperformed the multi-column index - 1164 ms vs 34661 ms

Interestingly, the single column indexed table used a series of Nested Loops with key lookups against the indexes.
The multi-column indexed table used a star-query optimization that was far slower.

Sponsored content

Fact Table Indexes

Fact Table Indexes

Re: Fact Table Indexes

Re: Fact Table Indexes

Re: Fact Table Indexes

Re: Fact Table Indexes

Re: Fact Table Indexes

Re: Fact Table Indexes

Re: Fact Table Indexes

Re: Fact Table Indexes

Re: Fact Table Indexes

Re: Fact Table Indexes

Re: Fact Table Indexes