Modified SCD 2 implementation

Abhiraizada Fri Jul 05, 2013 3:22 am

We have currently implemented conventional SCD 2 dimensions in our model. Following represent the sample data resulting in our DWH enviroment -

Patient dimension having 2 versions for one patient -

CFDM Patient Dimension

PT_KEY NAME GENDER DOB PT_ID CURRENT FLAG
1 JOHN NULL 3 JULY'13 PT1 N
2 JOHN M 3 JULY'13 PT1 Y

Med. Administration records for patients -

CFDM Med Admin Fact

Med Admin Key PT_KEY LOCATION_KEY PROVIDER_KEY CONCEPT_KEY ATTR_1 ATTR_2 ATTR_N CURRENT FLAG
1 1 1 1 1 X1 Y
2 1 1 1 1 X2 Y
3 1 1 1 1 X1 X3 Y
4 2 1 1 1 X2 X4 Y

In the above example, if we write a query to retrieve number of drug administration records for all male patients will result in complex query where we will create a view on top of patient dimension which will then join to med admin fact. This approach is not scalable if we have huge volume of records and queries are being fired dynamically.

Alternative approach simplifies the data retrieval queries and also maintains history of changes to the record. The proposed approach will require composite primary key for dimension consist of – Surrogate key and Version Number.

CFDM Patient Dimension

PT_KEY VERSION NUM NAME GENDER DOB PT_ID CURRENT FLAG
1 1 JOHN NULL 3 JULY'13 PT1 N
1 2 JOHN M 3 JULY'13 PT1 Y

And relationship between dimension and fact will be created on the basis of this composite primary key –

CFDM Med Admin Fact

Med Admin Key PT_KEY PT_VERSION_NUM LOCATION_KEY PROVIDER_KEY CONCEPT_KEY ATTR_1 ATTR_2 ATTR_N CURRENT FLAG
1 1 1 1 1 1 X1 Y
2 1 1 1 1 1 X2 Y
3 1 1 1 1 1 X1 X3 Y
4 1 2 1 1 1 X2 X4 Y

Above design will help us create simplified queries avoiding complex recursive views. In this approach surrogate key for patient table does not change with every change in the record instead version number keeps getting incremented with each record change resulting single/unique surrogate key across multiple versions. This helps us in writing queries where all med admin records needs to be analyzed/viewed against latest instance of patient and also retains the capability to do conventional BI analytics of SCD type 2.

Similar approach is suggested in one of the design tips by Kimball - http://www.kimballgroup.com/2013/02/05/design-tip-152-slowly-changing-dimension-types-0-4-5-6-7/ ( refer last section describing Type 7).

In this approach fact table contains dual foreign keys for a given dimension: a surrogate key linked to the dimension table where type 2 attributes are tracked, plus the dimension’s durable/pseudo supernatural key linked to the current row in the type 2 dimension to present current attribute values. This results in following modification to Patient dimension and Med administration fact.

CFDM Patient Dimension

PT_KEY PSEUDO_KEY NAME GENDER DOB PT_ID CURRENT FLAG
1 1 JOHN NULL 3 JULY'13 PT1 N
2 1 JOHN M 3 JULY'13 PT1 Y

CFDM Med Admin Fact

Med Admin Key PT_KEY PSEUDO_PT_NUM LOCATION_KEY PROVIDER_KEY CONCEPT_KEY ATTR_1 ATTR_2 ATTR_N CURRENT FLAG
1 1 1 1 1 1 X1 Y
2 1 1 1 1 1 X2 Y
3 1 1 1 1 1 X1 X3 Y
4 2 1 1 1 1 X2 X4 Y

We are planning to move from conventional SCD 2 implementation approach to either of the above suggested approach. It will be helpful if i can get any expert advise before I go ahead and do any changes.

Thanks in Advance
Abhishek Raizada

BoxesAndLines Fri Jul 05, 2013 10:40 am

Yes I have done this in the past and it works well.

LAndrews Fri Jul 05, 2013 3:07 pm

I find the dual key approach increases the size of the fact table unneccessarily, and the joins become more complex as you need to include the current_flag='Y' criteria to get the correct results.

My preference is to add a new type-1 attribute to the dimension (current_pt_key). This allows for easy snowflaking when current attributes are required.

Abhiraizada Mon Jul 08, 2013 1:10 am

LAndrews wrote:
I find the dual key approach increases the size of the fact table unneccessarily, and the joins become more complex as you need to include the current_flag='Y' criteria to get the correct results.

My preference is to add a new type-1 attribute to the dimension (current_pt_key). This allows for easy snowflaking when current attributes are required.

I think you are suggesting adding CURRENT_DIMENSION_KEY attribute to the dimension table and use it when ever needed. I understand this will not impact the size of fact table but still will not reduce the query complexity, as every time I want to analyze fact data corresponding to latest instance of dimension I need to make recursive join ( the same is needed in conventional type 2 implementation) also it will considerably impact the ETL for dimension tables as every time new version of dimension record is created we need to update this CURRENT_KEY column to current surrogate value.

Abhiraizada Mon Jul 08, 2013 1:10 am

BoxesAndLines wrote:Yes I have done this in the past and it works well.

Thanks this give me some confidence !!!

ngalemmo Mon Jul 08, 2013 5:22 am

I've used the dual key approach in the past and it works fine. It also gives you the option of having both a type 1 and type 2 version of the dimension should you so desire. It can help performance if most queries only need current values.

The version numbers don't work as well. It is essentially the same thing except your dimension load processes are more complicated than they need to be.

Sponsored content

Modified SCD 2 implementation

Modified SCD 2 implementation

Re: Modified SCD 2 implementation

Re: Modified SCD 2 implementation

Re: Modified SCD 2 implementation

Re: Modified SCD 2 implementation

Re: Modified SCD 2 implementation

Re: Modified SCD 2 implementation