dimension attribute denormalisation in fact table

eferragu Sat May 08, 2010 12:37 pm

Hello,

in our data warehouse, we have a fact table containing about 100 million rows per month joined to a customer dimension containing about 10 million customers.

The customer dimension contains 30 attributes. Two of these attributes are very frequently used in queries and the other ones are seldom used.

The queries using these 2 attributes are very long to execute because of the join cost which is very high.

In order to optimize the queries, we consider denormalizing the 2 attributes in the fact table while keeping them in the customer dimension.

What to you think about this solution ?

Comment ! in Ralph Kimball's books and articles, it is not recommanded to denormalize dimension attributes in the fact table, except for frequently changing type 2 attributes in monster dimensions. In my case, the customer dimension can be considered as a monster dimension but the 2 attributes in question never change. (Actually, the customer dimension is rather a "Credit Card" dimension containing information about the credit card - such as the card number, the bank code and the card type - that are not supposed to change. So all attributes are considered as type 1 attributes since all changes are in fact error corrections).

Thank you very much for your advice.

Regards,
Emmanuel FERRAGU

hang Sat May 08, 2010 7:13 pm

Don't denormalize the textual attribute into fact table. Instead, create a separate dimension table for those static or type 1 attributes, if type 2 is the main cause that makes the customer dimension so big, so you can have a much smaller credit card dimension linked to the fact table.

Make sure it is the join that causes the performance issue but not the nature that you are dealing with a massive fact table. Review or work out your partitioning and indexing strategy and see if it helps. If the distinct number of customers is far smaller, in order of magnitude, than 10 million, then consider creating a transaction dimension that keeps all those dynamic attributes with timestamp so you can have even smaller customer dimension linked to your fact table.

J.Moelands Sun May 09, 2010 9:24 am

Hi,

In addition to Hang is comments... (i guess some investigation on his points is always good)
Another option u might use is descriped in the new book of Ralph. He descripes three reasons when snowflaking your dimension can be a good (needed) option. (page 338 of his book).

Maybe you can split up this large customer dimension with a Snowflake Customer Subdimension. In this was you can save a lot of "row" space if a lot of attributes are not always filled (or filled with 'dummy values'). this will improve the query performance. Create a snowflake sub customer key (identity field in sql server) within your customer dimension and relate it to a customer sub dimension snowflake, within this customer sub dimension u can insert the other attributes based on this generated snowflake subcustomer key.

Within your OLAP / Reporting environment you can create a view that creates youre customer dimension with all the attributes.

Another option is 'compression and partitioning' within SQL 2008, but i don't know on which database type / version you are...

Hope this helpes you solving your problems.
Regards Joey Moelands

ngalemmo Mon May 10, 2010 3:42 pm

Depending on the nature of these attributes, there are different avenues you can consider.

If these attributes are identifiers (very high cardinality) you can consider doing a vertical partitioning of the dimension table. Basically break it up into two tables but keep the same primary key (ie. not a snowflake). Put the commonly used attributes in the thinner table. This should help.

If the attributes are characteristics (low to moderate cardinality) and some reasonable correlation between the values, consider putting the two in a junk dimension. There would be as many rows as there are unique combinations of existing values of the two attributes. If there is a good correlation between the two, there would not be very many unique existing combinations (or, a lot less than the number of customers). This would give you a much smaller dimension table to filter against. It would require you add a new foreign key to the fact. If there is no correlation, you could always consider two new dimensions.

eferragu Fri May 21, 2010 3:20 pm

Hello all.

Thanks very much for your answers.

Actually, the attributes in question are low cardinality attributes (bank code and credit card type) that are pretty much independant.
Besides, our model is a snowflake model as the bank and card type labels are not stored in the credit card dimension table but in a reference table.

So I will follow Nick's advice and store the two codes as two distinct attributes in the fact table.

Regards,
Emmanuel

ngalemmo Fri May 21, 2010 4:54 pm

So I will follow Nick's advice and store the two codes as two distinct attributes in the fact table.

I said that???

They should not be attibutes in the fact table (i.e. degenerate dimensions). Also, bank code and credit card type are not "pretty much independent" as most banks usually only offer one or two brands of credit cards, so there is moderate correlation between the two. Even of there was no correlation, the number of banks (specially now, with the consolidations and bankruptcies) is fairly small and I would imagine, the number of types can be counted on two hands, so the number of combinations would not be very large.

A junk dimension would be the way to go in this case.

hang Fri May 21, 2010 9:56 pm

A junk dimension would be the way to go in this case.

As the customer dimension is big, normalising it with a junk dimension does make since, especially when you may have other bank and card type related textual attributes to go with them, so that you can make customer dimension thinner.

Even of there was no correlation

As pointed out by Nick, you can combine any low cardinality attributes regardless of correlation, type1 or type2, and put them in a junk dimension. Don't forget you may also need a foreign key in the customer dimension to have a relationship between dimensions rather than through much bigger fact table. Having junk dimension key in both fact and dimension table may upset relational modeller, however this arrangement will allow us to build hierarchies between relatively smaller dimensions, as well as to aggregate or filter fact table directly by a tiny junk dimension instead of a monster customer dimension (a performance killer) if the aggregation or filtering is only about the bank and credit card type. ETL process should keep this two way relationships in synch.

There is another technique in Kimball's book, demoting frequently changed attributes into fact table, meaning you could reflect you attribute relationship in the fact table by having foreign keys from two separate dimensions that could have come from one dimension originally. Not sure if this approach is applicable in your case, to make you customer dimension significantly smaller.

eferragu Thu Aug 12, 2010 4:55 am

Thanks very much for your advice.

Just to let you know that we finally decided to create a junk dimension containing the bank code and the credit card type related to the fact table with a surrogate key. This junk dimension is populated via a "select distinct bank code, credit card type" query based on the customer dimension table. It works great !

Sponsored content

dimension attribute denormalisation in fact table

dimension attribute denormalisation in fact table

Re: dimension attribute denormalisation in fact table

Maybe u can use a snowflake dimension (as kimball descripes on page 338 of his new book the kimball group read)

Re: dimension attribute denormalisation in fact table

Re: dimension attribute denormalisation in fact table

Re: dimension attribute denormalisation in fact table

Re: dimension attribute denormalisation in fact table

Re: dimension attribute denormalisation in fact table

Re: dimension attribute denormalisation in fact table