Fact Table Indexes
5 posters
Page 1 of 1
Fact Table Indexes
I have a fact table with more than 25 dimensions and will have around 30 million or so records. Whats the proper indexing strategy for the fact table. Fact table consists of a lot of date ids with role playing dims. Should there be a index on each id, or should there be a clustered index wherever possible? Also, with so many indexes how does the ETL load gets optimized. I read in DW ETL Toolkit that one of the options is to drop the indexes before ETL load and recreate indexes after. Am i understanding this right?
arfoot653@gmail.com- Posts : 9
Join date : 2012-02-01
Re: Fact Table Indexes
In general there is an index on each FK column in the fact. By cluster do you mean a an index that organizes the table (the index and the row are together) or a index that contains multiple columns (compound index)? If you mean the former, it may or may not help performance for some queries and it would still be an index on a single column. If you mean the latter, it is not recommended.
Re: Fact Table Indexes
I meant the compound index. Thanks for your reply. In the DW toolkit book, its recommended that you drop the indexes that are not required for load before the fact table load, add the rows and then add index and run runstats. Is that how its generally done to optimize load time performance?
arfoot653@gmail.com- Posts : 9
Join date : 2012-02-01
Re: Fact Table Indexes
You need to determine which indexes will really help your queries. In many cases the SQL engine will end up doing a table-scan for a star-schema query, in which case the indexes on the fact table aren't being used at all. I would look at which dimensions will be used most frequently to constrain queries and maybe index those, but you need to make the indexes will actually be used. Execution plan is your friend!
I'm not sure a clustered index (physical row ordering in SQL Server) is a good idea for a fact table. A clustered index is great for performance when it it utilized most of the time, but you can actually take a bigger peformance hit when a different index is used which then needs to indirectly access a row in the clustered index. A clustered index build on a unique fact table identifier really wouldn't help DW queries (because nobody is going to query on a fact row id.) A clustered index with a compound keep consisting of foregin keys might help for certain queries, but then you pay a big price any time another index is used (because the clustered index key is so wide.)
I don't index my fact tables at all. I'm using SQL Server 2008. My largest fact table has about 10M rows. So far performance has been good. One of these days I might do some testing to see if I would benefit from any fact table indexing.
I'm not sure a clustered index (physical row ordering in SQL Server) is a good idea for a fact table. A clustered index is great for performance when it it utilized most of the time, but you can actually take a bigger peformance hit when a different index is used which then needs to indirectly access a row in the clustered index. A clustered index build on a unique fact table identifier really wouldn't help DW queries (because nobody is going to query on a fact row id.) A clustered index with a compound keep consisting of foregin keys might help for certain queries, but then you pay a big price any time another index is used (because the clustered index key is so wide.)
I don't index my fact tables at all. I'm using SQL Server 2008. My largest fact table has about 10M rows. So far performance has been good. One of these days I might do some testing to see if I would benefit from any fact table indexing.
VHF- Posts : 236
Join date : 2009-04-28
Location : Wisconsin, US
Re: Fact Table Indexes
arfoot653@gmail.com wrote:I meant the compound index. Thanks for your reply. In the DW toolkit book, its recommended that you drop the indexes that are not required for load before the fact table load, add the rows and then add index and run runstats. Is that how its generally done to optimize load time performance?
If you are loading a lot of rows, disabling or dropping indexes helps. If you are using bitmap indexes, then you most definitely want to disable and rebuild indexes. But, it all depends and is worth testing both ways to see if it makes a big enough difference.
As far as stats goes, generating them after each load is overkill. Doing that on weekends is more than enough.
Re: Fact Table Indexes
It sounds like you are using SQL Server. I'm actually preparing a presentation for a SQL Server User Group on dimensional modeling and one of the topics I'm presenting on is indexing. I've been looking at various indexing strategies on fact table for comparison purposes including: clustered indexes, heap tables with no indexes, heap tables with non clustered indexes, fact tables with a clustered surrogate primary key, and columnar indexes.
Using a clustered index on a date column (the most commonly used) will nearly double your query performance compared to using a heap table (no indexes or non clustered indexes only). I wouldn't put the clustered index across multiple columns though as this will impact all non-clustered indexes. I would also put a nonclustered index on each foreign key.
If you can move across to SQL Server 2012 (released on April 1st, 2012), you can use columnar indexing, and put the whole table in a column store index. Your performance will then be incredible - 2 seconds on a cold cached query compared to the 20 seconds for same query on a clustered index, and 40 seconds on a fact table without indexing. BTW, I ran these queries ona twelve million row fact table from the ContosoRetail DW database available for download at www.codeplex.com
Using a clustered index on a date column (the most commonly used) will nearly double your query performance compared to using a heap table (no indexes or non clustered indexes only). I wouldn't put the clustered index across multiple columns though as this will impact all non-clustered indexes. I would also put a nonclustered index on each foreign key.
If you can move across to SQL Server 2012 (released on April 1st, 2012), you can use columnar indexing, and put the whole table in a column store index. Your performance will then be incredible - 2 seconds on a cold cached query compared to the 20 seconds for same query on a clustered index, and 40 seconds on a fact table without indexing. BTW, I ran these queries ona twelve million row fact table from the ContosoRetail DW database available for download at www.codeplex.com
Re: Fact Table Indexes
Presumably this is changing the execution plan from a table scan to an index scan. Given that most queries do constrain on date, it seems like there is a signficiant potential for improvement, especially when querying on 1 or 2 years worth of data out of a fact table that contains several years.John Simon wrote:Using a clustered index on a date column (the most commonly used) will nearly double your query performance compared to using a heap table (no indexes or non clustered indexes only). I wouldn't put the clustered index across multiple columns though as this will impact all non-clustered indexes. I would also put a nonclustered index on each foreign key.
Even for the OP's case of having 25 FKs in the fact table? And in what circumstance are these indexes utilized?John Simon wrote:I would also put a nonclustered index on each foreign key.
For example, say I have a fact table representing product sales with FKs to Date, Product, and Store. If I query sales of home furnishings (Product attribute) for stores in the Southeast (Store attribute) for 2011-Q4, will any of the other fact table indexes be used? Won't it just do an index scan on the clustered index on date and join/merge Product and Store dimensions, discarding those records that don't meet the criteria?
VHF- Posts : 236
Join date : 2009-04-28
Location : Wisconsin, US
Re: Fact Table Indexes
I have a SQL Server database. We have a clustered index on the fact tables based on the date. We're an insurance company and get claims after they've been paid. So our clustered index is on the Paid Date (if it was on the date of service, the entire table would get resorted and resorted and resorted during the load). The claim fact table is partitioned on the Paid Date (in years). The partitioning was done primarily to aid in retiring claim data or moving the old data to slower disk in the future.
If the table wasn't partitioned on the Paid Date, I would have created a complex clustered index on multiple columns starting with the Paid Date and including columns that are regularly included in queries.
I rarely create non-clustered indexes on single columns in the fact table. I don't think there is much benefit to it. I will create complex indexes that reflect how the data is queried. If I know that dimensions X, Y, and Z are usually queried together, then I'll create and index on those 3 fields.
If you are using a SAN, I think it's extremely difficult to measure the improvement from the indexing in the short term. The reason being that if you run a query on Fact Table A and include Dimensions X, Y, and Z, the SAN will keep the data in it's own memory. Say it takes 30 seconds for the query to run the first time, it could take 2 seconds the next time simple because the data is already in the SANS memory. This is different that the RAM of the database server, which can be cleared out.
The point being is that you have to almost go on gut when it comes to creating indexes.
If the table wasn't partitioned on the Paid Date, I would have created a complex clustered index on multiple columns starting with the Paid Date and including columns that are regularly included in queries.
I rarely create non-clustered indexes on single columns in the fact table. I don't think there is much benefit to it. I will create complex indexes that reflect how the data is queried. If I know that dimensions X, Y, and Z are usually queried together, then I'll create and index on those 3 fields.
If you are using a SAN, I think it's extremely difficult to measure the improvement from the indexing in the short term. The reason being that if you run a query on Fact Table A and include Dimensions X, Y, and Z, the SAN will keep the data in it's own memory. Say it takes 30 seconds for the query to run the first time, it could take 2 seconds the next time simple because the data is already in the SANS memory. This is different that the RAM of the database server, which can be cleared out.
The point being is that you have to almost go on gut when it comes to creating indexes.
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: Fact Table Indexes
Put non-clustered indexes on frequently used, high-cardinality foreign keys in your fact table for "pinpoint" queries. Good candidates would be something like customer and product. So if you have a query where you are looking for certain products for a customer, then having these columns use a non-clustered index will provide great performance improvements.
I don't have the reference on me, but I remember reading that narrow non-clustered indexes can work together such as the example I gave above. I thnk it was from Kimberly Tripp.
Here's a link to best practices from SQLCAT: http://sqlcat.com/sqlcat/b/top10lists/archive/2008/02/06/top-10-best-practices-for-building-a-large-scale-relational-data-warehouse.aspx
I don't have the reference on me, but I remember reading that narrow non-clustered indexes can work together such as the example I gave above. I thnk it was from Kimberly Tripp.
Here's a link to best practices from SQLCAT: http://sqlcat.com/sqlcat/b/top10lists/archive/2008/02/06/top-10-best-practices-for-building-a-large-scale-relational-data-warehouse.aspx
Re: Fact Table Indexes
VHF wrote:Even for the OP's case of having 25 FKs in the fact table? And in what circumstance are these indexes utilized?John Simon wrote:I would also put a nonclustered index on each foreign key.
For example, say I have a fact table representing product sales with FKs to Date, Product, and Store. If I query sales of home furnishings (Product attribute) for stores in the Southeast (Store attribute) for 2011-Q4, will any of the other fact table indexes be used? Won't it just do an index scan on the clustered index on date and join/merge Product and Store dimensions, discarding those records that don't meet the criteria?
To simulate your query, I created an exact duplicate fact table, with the same indexing. However the newly created fact table had no non clustered index on Customer. The cost was 6% for the query with the index compared to 94% without.
Here's the query:
SELECT p.Manufacturer, SUM(SalesAmount) AS SaleAmount
FROM dbo.FactOnlineSales f (NOLOCK)
JOIN dbo.DimProduct p (NOLOCK)
ON f.ProductKey = p.ProductKey
JOIN dbo.DimDate d (NOLOCK)
ON f.DateKey = d.Datekey
JOIN dbo.DimCustomer c (NOLOCK)
ON f.CustomerKey = c.CustomerKey
WHERE d.CalendarYear = 2009
AND p.BrandName = 'Contoso'
AND c.CustomerLabel = '26733'
GROUP BY p.Manufacturer
The fact table with the non clustered index ran in 645 ms, the fact table without the non clustered index in 33,523 ms.
Looking at the stats:
First query: Table 'FactOnlineSales'. Scan count 365, logical reads 1955, physical reads 1, read-ahead reads 184
Second query: Table 'FactOnlineSales_Clustered'. Scan count 365, logical reads 101215, physical reads 2, read-ahead reads 98333
We can see there are far more reads required for the fact table without the customer index.
Last edited by John Simon on Thu Mar 08, 2012 6:56 pm; edited 2 times in total (Reason for editing : Formatting)
Re: Fact Table Indexes
Regarding compound indexes i.e. multi-column index vs indexes across multiple columns, using the ContosoRetailDW database again, I compared several queries using a single index on product and another index on customer, vs an index on both product and customer.
Running the following query:
SELECT c.[FirstName] + ' ' + [LastName], SUM(SalesAmount) AS SaleAmount
FROM dbo.FactOnlineSales f (NOLOCK)
JOIN dbo.DimCustomer c (NOLOCK)
ON f.CustomerKey = c.CustomerKey
JOIN dbo.DimDate d (NOLOCK)
ON f.DateKey = d.Datekey
JOIN dbo.DimProduct p (NOLOCK)
ON f.ProductKey = p.ProductKey
WHERE d.CalendarYear = 2008
AND c.CustomerLabel = '19739'
AND p.BrandName = 'Contoso'
GROUP BY c.[FirstName] + ' ' + [LastName]
against both indexing strategies, the single column indexes well outperformed the multi-column index - 1164 ms vs 34661 ms
Interestingly, the single column indexed table used a series of Nested Loops with key lookups against the indexes.
The multi-column indexed table used a star-query optimization that was far slower.
Running the following query:
SELECT c.[FirstName] + ' ' + [LastName], SUM(SalesAmount) AS SaleAmount
FROM dbo.FactOnlineSales f (NOLOCK)
JOIN dbo.DimCustomer c (NOLOCK)
ON f.CustomerKey = c.CustomerKey
JOIN dbo.DimDate d (NOLOCK)
ON f.DateKey = d.Datekey
JOIN dbo.DimProduct p (NOLOCK)
ON f.ProductKey = p.ProductKey
WHERE d.CalendarYear = 2008
AND c.CustomerLabel = '19739'
AND p.BrandName = 'Contoso'
GROUP BY c.[FirstName] + ' ' + [LastName]
against both indexing strategies, the single column indexes well outperformed the multi-column index - 1164 ms vs 34661 ms
Interestingly, the single column indexed table used a series of Nested Loops with key lookups against the indexes.
The multi-column indexed table used a star-query optimization that was far slower.
Similar topics
» Fact Table Indexes
» 'Routing' the grain of the fact table to multpile members of multiple dimensions causes the fact table to 'explode'
» Integrating new fact table which has one to many relationship with the main fact table in existing star schema
» How to create fact table with measures derived from comparing two fact table rows
» Implementing secondary indexes on fact tables. Is it a good idea?
» 'Routing' the grain of the fact table to multpile members of multiple dimensions causes the fact table to 'explode'
» Integrating new fact table which has one to many relationship with the main fact table in existing star schema
» How to create fact table with measures derived from comparing two fact table rows
» Implementing secondary indexes on fact tables. Is it a good idea?
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum