Large Degenerate Dimension Values
+2
BoxesAndLines
Dave Jermy
6 posters
Page 1 of 1
Large Degenerate Dimension Values
I am designing a new dmensional model for an insurance company and users want to analyse the documents (stored in sharepoint) that have been attached to various entities (policies, claims etc.)
I have designed a Document dimension to hold some of the document metadata (document type, file type, document source). This is a relatively small dimension with just a couple of thousand rows. The fact table (grain: 1 row per change to a document's metadata) has FKs to the Date (for created date), Employee (for creator), Policy and Document dimensions and a small number of measures (count, filesize etc.)
This leaves me with a couple of attributes to fit into the model: Document Title and Document URL. The second is unique across all documents and the first is almost so, so my first thought was to put them onto the fact table as degenerate dimensions.
However, they are very long varchar columns (255 and 2000 characters respectively). Looking at the existing documents, the average combined length of the values in these columns is 157 characters (min 99, max 287). Given that the fact table is quite slender (<30 bytes), adding them seems wrong and may impact on query performance. But, I don't want to add them into the Document dimension, either as it would then be almost 1 to 1 with the fact table.
Any suggestions on the best way to model this?
I have designed a Document dimension to hold some of the document metadata (document type, file type, document source). This is a relatively small dimension with just a couple of thousand rows. The fact table (grain: 1 row per change to a document's metadata) has FKs to the Date (for created date), Employee (for creator), Policy and Document dimensions and a small number of measures (count, filesize etc.)
This leaves me with a couple of attributes to fit into the model: Document Title and Document URL. The second is unique across all documents and the first is almost so, so my first thought was to put them onto the fact table as degenerate dimensions.
However, they are very long varchar columns (255 and 2000 characters respectively). Looking at the existing documents, the average combined length of the values in these columns is 157 characters (min 99, max 287). Given that the fact table is quite slender (<30 bytes), adding them seems wrong and may impact on query performance. But, I don't want to add them into the Document dimension, either as it would then be almost 1 to 1 with the fact table.
Any suggestions on the best way to model this?
Dave Jermy- Posts : 33
Join date : 2011-03-24
Location : London, UK
Re: Large Degenerate Dimension Values
You've identfied the two methods. Do some volume testing to see which works best. If the performance is similar go with the dimension solution since it is cleaner.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: Large Degenerate Dimension Values
Because your fact table represents an instance of a document, it seems like that is where Document Title and Document URL should go as DD. I suspect in this case the majority of usage of this fact table will involve at least one of these fields, so there isn't a lot of performance savings to be gained from trying to move them elsewhere. If most of the useage of this fact table involved aggregating other measures, then it might be worthwhile trying to relocate the long DD columns to keep the fact table narrow.
VHF- Posts : 236
Join date : 2009-04-28
Location : Wisconsin, US
Re: Large Degenerate Dimension Values
As VHF hinted, it becomes a matter of how often these values are appear in queries. Adding them as DD's, even if it is a document fact table, will severly impact performance. Adding 150+ bytes to a < 30 byte row will have a dramatic hit to I/O. If 80% of the queries need the data, DD's may be the correct option, otherwise a separate dimension, even if the cardinality is high, would be the better choice.
Re: Large Degenerate Dimension Values
Thanks for the responses guys - I'll dig into the requirements a little deeper and do some performance testing.
Dave Jermy- Posts : 33
Join date : 2011-03-24
Location : London, UK
Re: Large Degenerate Dimension Values
Hi Dave,
Based on your description of the Fact table grain, I'd definitely go for storing them on the Dimension. I imagine there will be several Fact records for each Dimension record so you would be duplicating data if you put those attributes on the Fact. It would also be much easier to manage changes to those attributes if they are on the Dimension.
Good luck!
Mike
Based on your description of the Fact table grain, I'd definitely go for storing them on the Dimension. I imagine there will be several Fact records for each Dimension record so you would be duplicating data if you put those attributes on the Fact. It would also be much easier to manage changes to those attributes if they are on the Dimension.
Good luck!
Mike
Columnar-Oriented Database Eliminates Inefficiency of Columns Not Used by a Query
This issue demonstrates a benefit of a columnar-oriented database. Adding the columns to the fact table in a columnar-oriented database, such as Infobright, would impose no performance penalty on queries that do not use the columns.
In addition, Infobright would very heavily compress these columns.
In other words: "Adding 150+ bytes to a < 30 byte row will have" no impact on I/O in a columnar-oriented database for queries that do not use the added columns.
In addition, Infobright would very heavily compress these columns.
In other words: "Adding 150+ bytes to a < 30 byte row will have" no impact on I/O in a columnar-oriented database for queries that do not use the added columns.
Similar topics
» Large number of snowflake code tables per dimension
» Source Values and Conformed Values in the Dimension table
» Question on breaking out Degenerate Dimension to separate dimension
» Should I use a degenerate dimension or create a junk dimension?
» Should this be a degenerate dimension or a junk dimension?
» Source Values and Conformed Values in the Dimension table
» Question on breaking out Degenerate Dimension to separate dimension
» Should I use a degenerate dimension or create a junk dimension?
» Should this be a degenerate dimension or a junk dimension?
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum