Partitioning Discussion in SQL Server 2008 R2

FootyRef Mon Nov 05, 2012 2:08 pm

Not sure if this belongs here or in ETL and Data Quality, but I would like to discuss partitioning our very large fact table in SQL Server 2008 R2. We have a fact table that is 180 million rows with nearly 5.5 years worth of data. We only have a requirement to keep five years, so I am looking at getting rid of some of it and know that partitioning would help with that effort. Probably the biggest reason, though we want to start partitioning is the time it takes to go through the ETL on this table. We only have to do it once a month, but the whole ETL process is approaching twelve hours, with 9-10 of it just doing the inserting and updating our accumulating snapshot fact and then loading it from our staging database to our warehouse database and indexing it. One of the things I would like to be able to do is just stage the months we want to process, which is governed by the Begin Month is the source. I expect, with partitioning, that I could switch those partitions into a staging table and only process the source records there. Then, I could load just that table to the warehouse, apply the proper indexes on the surrogates, and switch those to the main table. We are looking for some scripts that might dynamically help us do that but I am finding bits and pieces and not enough for a complete solution and figured there were probably some folks very well versed at doing that in here. I need to be able to read the distinct Begin Month in my source data, switch the partitions for those months into a staging table, process the records, load the records to the warehouse into a staging table and switch out the old records and replace them with the new, as well as getting rid of the oldest partition in the staging fact and the warehouse fact.

Anyone willing to share ideas, thoughts, scripts, etc? I am the first person to try to do partitioning here and I probably know more than the DBAs right now because of the research I have done on the subject.

Jeff Smith Wed Nov 07, 2012 6:36 pm

I think I understand what you are saying.

You get late arriving facts. When you load them, the database is reordering the entire table. To avoid this, you want to use partitions. When late arriving facts come into the database, you only want to deal with the section of the final fact table that is being impacted - you want to limit the reordering to just the affected partitions.

I don't think you need to move the partitions to a staging table. Let's say your table is partitioned by Month (range of dates within the month) and that you have 24 months of data. The clustered index/key is the date used in the partition. If a new row comes in with a date from month 12, the insert will only resort the rows in the partition or Month 12. the database knows that it doesn't have to worry about rows outside of the partition for month 12.

But the process for what you want to do is to create Table_S (staging), move the partition from Table_P to Table_S, load it, and then move the partition back to Table_P.

Alternatively, you could create 2 tables: Current and history. The 2 tables are identical in design. The current table keeps the rows with dates that are influx. If you are always processing the last 3 months of data, then the current table keeps 3 months. As part of the load, it moves the oldest month of data from the current to History. You would create a union view across the 2 tables, but I'm not sure what that buys you.

As far as retiring old data, why? Just move the partition to slower cheaper disk.

FootyRef Thu Nov 08, 2012 12:57 pm

Thanks for replying, Jeff. I guess I didn't do a very good job of asking the question or I am not understanding your answer.

I want to start out by partitioning our 180 million row table in 60 partitions based on a month. We get new source data each month that might be updates to previous records or new records for one or more months in the future. I want to determine the distinct months in the source data and see if, first of all, those months already exist and, secondly, if they do not, then I want to dynmically create new partitions on those months. Then, I want to process the source data.

I have read in many places, including here, that is best to only work with the months of data that are needed rather than trying to inserts and updates against a table with all of the data. I alsio have read the best way to do that is have the records in the months in a separate staging table. Switching the partions from the large table would easily do that. Then I would process the new source data against the staging table and then switch the updated partitions back into the large table and get rid of the oldest partition since we only need to have 60 months of data.

The reason I posted is asking if this strategy is fine and if anyone knows of scripts that might automate the process of creating the new partitions based upon the new months in the source data, switching out the partitions from the large table to a staging table based upon the months in the source data and then switching them back into the large table based on the source months. Liek I was saying, I have found bits and pieces of code to do this, but not enough for a complete solution and I'm sure I amnot the only person who wants to do this.

I think I am going to post this in ETL and Data Quality, as well, because it is actually being done in the ETL process, but please feel free **ANYONE** to share your experiences with this and perhaps some scripts.

Jeff Smith Thu Nov 08, 2012 2:23 pm

What I'm saying is that you don't need to move the partitions from the fact to a staging table and back to the fact table. If you partition the table into 60 partitions, then each partition essentially acts like a seperate table. Anything you do to a partition will be limited to that partition.

Also, 60 partitions with 3 million rows each seems like overkill. Quarterly might be better.

FootyRef Thu Nov 08, 2012 2:35 pm

I would have thought the clustered index on the month would have taken care of that in the data we have now, but it takes so long to process these records. Even a Kimball ETL strategy talks about moving what you need to a staging table, processing it with the new data and then putting it back into the main table. I was just going to do that by switching partitions, which is a metadata function and doesn't involve data movement.

BoxesAndLines Thu Nov 08, 2012 2:36 pm

The OP wants to leverage partition swapping, which is a really good way to do it. The DBA at a previous engagement wrote the SP's to do this, one to add the new partition and one to drop the old partition. It would seem that someone on the internet would have written about this.

FootyRef Thu Nov 08, 2012 2:45 pm

Thanks for replying.

It's easy enough to add one partition and drop one, but I need to dynamically read the source data months, determine what ones I do not have defined as partitions and create them.

Jeff Smith Thu Nov 08, 2012 3:23 pm

When you partition the table, the partitioning occurs by the Clustered index. Once you partition the table, it will limit the reordering of the data to the affected Partition.

Try it after you've partitioned the table. If you don't see an improvement, then do the partition swapping.

As far as dropping data, your database isn't very big. With the partitioning, the old data will have no impact on queries. A query on the partitioned table will only use the partition specified in the query. If a query is only pulling data from 1 month, it will run the same whether you have 1 month of data i the table or 100.

I hate retiring data. if it's just a matter of disk space, you can buy cheaper slower disk and move the partitions with the older data to the slow disk.

I don't dynamically add partitions. My partitions are created based on the year and I have them defined for the next several years. I'll increase the size of the files housing the partitions as needed.

warrent Mon Nov 12, 2012 10:15 pm

I agree with Jeff's practical suggestions, but if you want to actively manage the partitioning code, there are a couple of examples and utilities on codeplex.com. Search for "Partition management" and you'll see several options. I'm familiar with the project titled SQL Server Partition Management. It has had over 12,000 downloads.
Good luck!

Sponsored content

Partitioning Discussion in SQL Server 2008 R2

Partitioning Discussion in SQL Server 2008 R2

Re: Partitioning Discussion in SQL Server 2008 R2

Re: Partitioning Discussion in SQL Server 2008 R2

Re: Partitioning Discussion in SQL Server 2008 R2

Re: Partitioning Discussion in SQL Server 2008 R2

Re: Partitioning Discussion in SQL Server 2008 R2

Re: Partitioning Discussion in SQL Server 2008 R2

Re: Partitioning Discussion in SQL Server 2008 R2

Re: Partitioning Discussion in SQL Server 2008 R2

Re: Partitioning Discussion in SQL Server 2008 R2