Database Partitioning and Dimensions without the distribution key

Guest Tue Aug 25, 2009 8:52 am

We are upgrading our data warehouse platform, and now have the option to partition the database across CPUs as well as partitioning tables themselves.

For database partitioning, there is a performance impact if the data from tables to be joined is not located in the same database partition, so the recommendation is to try and use the same distribution key (a selected column from the table). All of our largest tables contain the customer key, and most queries involve customer to some extent so it makes sense to partition the database using this key.

The problem arises with some of the larger dimension tables e.g. we have an order dimension that contains the various attributes of an order we want to report by. I'm thinking of 'breaking' the model and appending the customer key to some of these tables to ensure orders associated with that customer and the fact table are collocated in the same database partition.

I'd be interested to hear any comments on this approach or from anyone who has an alternative suggestion. We don't know what the performance impact of not having the partition key on the table will be, so we could always 'leave as is' and amend later if necessary.

Thanks

ngalemmo Tue Aug 25, 2009 11:29 am

Normally there is no direct correlation between facts and dimensions, which makes localization a challenge in MPP systems. But since you have a situation where direct correlation is possible, you might as well take advantage of it. But, depending on your system, you may run into hotspot issues where you have a significant imbalance in data distribution.

This is a significant issue with Netezza, for example, where each SPU has a somewhat limited capacity (around 100 GB). If you happen to be unlucky and wind up with a couple of your biggest customers in one SPU (like WalMart for the CPG industry), your overall system capacity will suffer. Systems, like Teradata work around the localization problem by allowing you to define join indexes which consume large amounts of space, they basically prejoin tables (sort of like a materialized view in Oracle) to aid performance. Another approach is taken by Aster Systems which allow you to replicate dimensions across all partitions, solving the issue once and for all. This latter approach works great for most dimensions, but probably not practical for inordinately large ones.

The other question worth bringing up is, do you really need an Order dimension? In designs I have done in the past I have been able to break Order attributes up into fact foreign keys (for the various dates) and a few junk dimensions for the rest of it.

Guest Tue Sep 01, 2009 5:55 am

The way DB2 works with the database partitioning and the fact that the customer key is a simple integer means we won't get any skew of data across a single database partition, so that's not an issue. I'm still not sure whether we are going to undertake the extra work to append the customer key onto those other dimensions, but at least no one has said don't do it!

With regard to the order dimension I agree it is a bit of an oddity and I don't think we should have it either. However, we do have order level measures that we cannot move to the item level (e.g. carriage charges), so we have both an order and order line fact table. We use business objects as our BI tool, and so need some way of joining a query involving order and order line together, so the order dimension remained as a conformant dimension. The alternative would have been to join the order and order line fact table directly via the order key, but that causes all sorts of problems with contexts as well as breaking the dimensional model.

I think this is one of the areas that provides a challenge when using a BI tool - it's an easy problem to get around if writing SQL, but shows the limitations of using a product that writes the SQL for you!

Any suggestions to get around this would be very welcome!
Thanks.

Ravi Venkatraman Thu Oct 01, 2009 9:48 pm

I would be careful and consider other options before appending the customer key into other dimensions.
Adding the Customer Key to the Other Dimension introduces dependency of Customer Dimension to be created before the other Dimensions could be Processed.

Since,you are using DB2 with Database Partition Facility. I would suggest replicating the Order Table in all partition.
If size is an issue then possibly you may want to look into Deep Compression option available in DB2 9.5 and 9.7 (additional cost). We have seen 60-70% compression rate on text fields.
Alternatively, you could replicate only the important columns from the Order table in all partition.

Sponsored content

Database Partitioning and Dimensions without the distribution key

Database Partitioning and Dimensions without the distribution key

Re: Database Partitioning and Dimensions without the distribution key

Re: Database Partitioning and Dimensions without the distribution key

Re: Database Partitioning and Dimensions without the distribution key

Re: Database Partitioning and Dimensions without the distribution key