SCD Type 2 on large and wide dimension tables

kiriti Fri Jun 10, 2011 11:38 am

Hi,
I am dealing with a situation where we have to implement slowly changing dimensionality (type 2 - version changes, effective dates) on two dimensions. For one of them, the "product" dimension, the key challenge is its width - the dimension has 352 columns! (I know it sounds crazy, but the business users have good reasons for every one of them). It also has over 6 million records. I have pared down the number of columns on which to version down to about 30.

The second one, "customer" has about 50 columns, but there are over 50 million customers. Type 2 logic needs to be implemented on about 10 columns.

We are currently on Oracle, moving to Netezza. We cannot do row-by-row processing to do this as it will probably not finish within a reasonable time, so I need a set-based approach.

Has any one encountered a similar situation and if so, how did you approach the problem?

Thanks!

hang Fri Jun 10, 2011 10:00 pm

Wow, truly monster dimensions. I have dealt with 20 million dimension tables and I thought that's staggering. Nevertheless they are in the same order of magnitude, and I guess what I did may apply to your scenario as well.

In my experience with monster dimensions, and most likely in yours, I found all the fact tables look far smaller than these dimensions and you may find yourself ending up with some shallow fact tables connected by deep (and wide) dimension tables, a bunch of out-balanced star schemas. You might question yourself, what I am dealing with? dimensions or facts. So in this case, what's the point to have normalised fact table while much bigger dimension tables are denormalised.

I guess whenever you are confronted with monster dimension tables, you need to transcend your dimensional thinking, treating them almost like fact tables. Briefly, normalise the monster dimension tables by breaking them into manageable pieces and leverage fact tables to reflect SCD 2 attributes so the dimensions are reasonably static and the fact tables keep growing. A special technique worth looking into is mini/junk dimension.

Another phenomenon I found with these monster dimensions is that very small subset of the dimensions is actually used by the facts. What about creating a number of fact driven sub-dimension tables so performance issue would go away and your star schemas becomes well balanced, deep fact connected by shallow dimension tables, without losing any information.

I would still keep the holistic monster dimension table that contains fuller picture of the dimension. But I would try to avoid connecting it to any fact table as I know the performance would be hopeless. I might use it as a centralised source for other sub-dimension tables. I could use super dimension only when I need to drill down into details, as I would apply some constraints to narrow down the query scope. I would not use it for slicing and dicing any facts as I've already got sub-dimension tables to serve the analysis requirements.

Bob Probst Mon Jun 13, 2011 9:12 am

I think you're fine to change over your Product Dim. I handle type 2 on tables about that size and wider with little trouble. When we need to add attributes, a rehash takes about 10 minutes tops.

I haven't worked with dimensions as long as 50 million. Knowing some specifics about your ETL process and how you intend to identify changes would be useful.

ebry74 Mon Jun 20, 2011 5:44 pm

Using our replicated tables as a source, we process the data while doing any lookups, derived columns, null replacements, and such into a staging table using the data flow task. Once the stage table is updated with the deltas, we populate the production table using a stored procedure. Here is some simple psuedocode which will hopefully get you started.

First, you will need to have a few setup columns in your stage table to make this work. We have added the following columns:

- IsNew
- IsType1
- IsType2
- Type1Hashbytes
- Type2Hashbytes

Step 1: Update your flags to identify what types of records are in your stage table. New, type 1 change, type 2 change.

Update Stage Set
Stage.IsNew = case when Prod.[businesskeycolumn] is null then 1 else 0
,Stage.IsType1 = case when Prod.[businesskeycolumn] is not null and Stage.Type1Hashbytes <> Prod.Type1Hashbytes then 1 else 0
,Stage.IsType2 = case when Prod.[businesskeycolumn] is not null and Stage.Type2Hashbytes <> Prod.Type2Hashbytes then 1 else 0
From Stage
Left Join Prod on Stage.[businesskeycolumn] = Prod.[businesskeycolumn] and Prod.RowIsCurrent = 'Y'

Step 2: Process Type 1 Updates overwriting existing column values. Join on business key to update all rows.

Update Prod Set
Prod.columns = Stage.Columns
From Stage
Inner Join Prod
on Stage.[businesskeycolumn] = Prod.[businesskeycolumn]
Where Stage.IsType1 = 1

Step 3: Process Type 2 Updates - Expire current row

Update Prod Set
Prod.RowIsCurrent = 'N'
,Prod.RowEndDate = DATEADD(DAY, -1, GetDate())
From Stage
Inner Join Prod
on Stage.[businesskeycolumn] = Prod.[businesskeycolumn]
Where Stage.IsType2 = 1

Step 4: Insert new records where Stage.IsType2 = 1 or Stage.IsNew = 1 (I won't write this out.)
Be sure to set up your RowStartDate at this point. If Stage.IsNew = 1 then [default value or null] else GetDate().

Hope this helps!

ngalemmo Tue Jun 21, 2011 10:02 am

With Netezza, row width can become an issue for queries from a performance point of view... however the number of rows are not particularly daunting.

I would suggest splitting each into two tables, one containing type 1 attributes and the other the type 2 attributes. It doubles the FKs on the fact tables, but that shouldn't be a big issue.

Converting to set operations shouldn't be too difficult. The basic process is the same, it's just that you are working with the entire table at once. Don't be afraid to use temp tables.

Sponsored content

SCD Type 2 on large and wide dimension tables

SCD Type 2 on large and wide dimension tables

Re: SCD Type 2 on large and wide dimension tables

Re: SCD Type 2 on large and wide dimension tables

We do our Type 2 SCD via stored procedures.

Re: SCD Type 2 on large and wide dimension tables

Re: SCD Type 2 on large and wide dimension tables