SCD 2 scenario
4 posters
Page 1 of 1
SCD 2 scenario
This could be a typical scenario in DWH, but I would like to know what is the best way to handle that.
I have a dimension with SCD Type 2 - Using Effective and Expiry Dates. This dimension key is referred in a fact table.
When there is any change in the SCD attributes of the dimension in source, a new record will be inserted in the dimension with a new dimension key generated. Now the data that comes henceforth in the fact will have the new dimension key as the reference.
When I try to join like
Select x,y,z from fact, dimension
where fact.dimension_key = dimension.dimension_key
and dimension.expiry_date is null
then I'll get only the data associated with the new dimension key. And if I do not use the condition "dimension.expiry_date is null" then I'll get duplicate data. What is the best approach to overcome this
Thanks in Advance..........
I have a dimension with SCD Type 2 - Using Effective and Expiry Dates. This dimension key is referred in a fact table.
When there is any change in the SCD attributes of the dimension in source, a new record will be inserted in the dimension with a new dimension key generated. Now the data that comes henceforth in the fact will have the new dimension key as the reference.
When I try to join like
Select x,y,z from fact, dimension
where fact.dimension_key = dimension.dimension_key
and dimension.expiry_date is null
then I'll get only the data associated with the new dimension key. And if I do not use the condition "dimension.expiry_date is null" then I'll get duplicate data. What is the best approach to overcome this
Thanks in Advance..........
neo.helios- Posts : 11
Join date : 2010-11-02
Re: SCD 2 scenario
How would you get duplicate data? Are you reinserting fact rows with the new dimension key?
Using NULL as an expiration date is a bad idea. It makes queries more complicated than they need to be. Common practice is to use a distant future date, such as 1/1/3000. This way you can use BETWEEN to locate a particular row based on date.
When you join a fact to a type 2 dimension, you join on the key only. If you need a particular version of the dimension row based on date, you perform a self join on the dimension using the natural key and a predicate on the date.
Joining from a fact to a dimension on the key should never give you duplicate data unless you are doing something wrong in the load.
Using NULL as an expiration date is a bad idea. It makes queries more complicated than they need to be. Common practice is to use a distant future date, such as 1/1/3000. This way you can use BETWEEN to locate a particular row based on date.
When you join a fact to a type 2 dimension, you join on the key only. If you need a particular version of the dimension row based on date, you perform a self join on the dimension using the natural key and a predicate on the date.
Joining from a fact to a dimension on the key should never give you duplicate data unless you are doing something wrong in the load.
Re: SCD 2 scenario
Thanks for the reply. I understand the way to join the fact and dimension to some extent. Would be great if you can detail it out.
I'll try to ask my question with a scenario :
I have a record in the customer dimension for a customer by name "John Werner". The structure of the dimension would be
The new dimension key 20 is inserted when the customer city changed
I have fact table with the structure
If I have to get the sum of amount by customer then If I write a query
Select dim.customer_name, fact.Product, sum(fact.amount)
from dimension dim, fact fact
where dim.customer_key = fact. customer_key
and dim.expiry_date is null
then for John Werner I'll get the data like
This is what I was referring to when I said duplicate data. I might be missing something here... Can you please explain
Regards
I'll try to ask my question with a scenario :
I have a record in the customer dimension for a customer by name "John Werner". The structure of the dimension would be
Customer_Key | Customer_Name | Customer_City | Effective_Date | Expiry_Date |
10 | John Werner | NJ | 01-Jan-2011 | 30-Jun-2011 |
20 | John Werner | CA | 30-Jun-2011 | Null |
I have fact table with the structure
Customer Key | Date | Product | Amount |
10 | 15-Jan-2011 | XYZ | 100 |
10 | 12-Mar-2011 | ABC | 500 |
20 | 11-Jul-2011 | XYZ | 300 |
20 | 01-Aug-2011 | ABC | 200 |
If I have to get the sum of amount by customer then If I write a query
Select dim.customer_name, fact.Product, sum(fact.amount)
from dimension dim, fact fact
where dim.customer_key = fact. customer_key
and dim.expiry_date is null
then for John Werner I'll get the data like
John Werner | XYZ | 100 |
John Werner | XYZ | 300 |
John Werner | ABC | 500 |
John Werner | ABC | 200 |
This is what I was referring to when I said duplicate data. I might be missing something here... Can you please explain
Regards
neo.helios- Posts : 11
Join date : 2010-11-02
Re: SCD 2 scenario
If your query is
Select dim.customer_name, fact.Product, sum(fact.amount)
from dimension dim, fact fact
where dim.customer_key = fact. customer_key
and dim.expiry_date is null
you will only get back
John Werner XYZ 300
John Werner ABC 200
If you query
Select dim.customer_name, fact.Product, sum(fact.amount)
from dimension dim, fact fact
where dim.customer_key = fact. customer_key
John Werner XYZ 100
John Werner XYZ 300
John Werner ABC 500
John Werner ABC 200
But what is duplicated??? Other than the fact that your query is missing a GROUP BY clause (I don't see how you could actually run the query as is, every DB I know would give you a syntax error), the measures are correct.
With a group by the result should be:
John Werner XYZ 400
John Werner ABC 700
Select dim.customer_name, fact.Product, sum(fact.amount)
from dimension dim, fact fact
where dim.customer_key = fact. customer_key
and dim.expiry_date is null
you will only get back
John Werner XYZ 300
John Werner ABC 200
If you query
Select dim.customer_name, fact.Product, sum(fact.amount)
from dimension dim, fact fact
where dim.customer_key = fact. customer_key
John Werner XYZ 100
John Werner XYZ 300
John Werner ABC 500
John Werner ABC 200
But what is duplicated??? Other than the fact that your query is missing a GROUP BY clause (I don't see how you could actually run the query as is, every DB I know would give you a syntax error), the measures are correct.
With a group by the result should be:
John Werner XYZ 400
John Werner ABC 700
Re: SCD 2 scenario
I guess I see where your problem is and you may want to see this:
John Werner 1100
The key is, you always should group by the natural key, Customer_ID if you have one, or whatever determines the unique customer in the source system, similar to this:
Select dim.customer_name, sum(fact.amount)
from dimension dim, fact fact
where dim.customer_key = fact. customer_key
Group by customer_name -- suppose Customer_name is the NK.
However, if you want to see sales by customer and product, you then need to put back product in the select and group by clauses, so you would end up with only two rows as ngalemmo suggested.
John Werner 1100
The key is, you always should group by the natural key, Customer_ID if you have one, or whatever determines the unique customer in the source system, similar to this:
Select dim.customer_name, sum(fact.amount)
from dimension dim, fact fact
where dim.customer_key = fact. customer_key
Group by customer_name -- suppose Customer_name is the NK.
However, if you want to see sales by customer and product, you then need to put back product in the select and group by clauses, so you would end up with only two rows as ngalemmo suggested.
Last edited by hang on Tue Aug 09, 2011 8:52 pm; edited 1 time in total
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
Re: SCD 2 scenario
I'm pretty sure the where clause "and dim.expiry_date is null" will result in
John Werner 500
As ngallemo mentioned, the effective dates on the dimension are primarily used for ETL purposes (choosing the correct SK), not when generating queries.
John Werner 500
As ngallemo mentioned, the effective dates on the dimension are primarily used for ETL purposes (choosing the correct SK), not when generating queries.
LAndrews- Posts : 132
Join date : 2010-05-13
Location : British Columbia, Canada

» Design scenario...
» How would you model this simple scenario ?
» Designing Sales Promotion for Packed Products
» Question - creating a dimensional model for facility management
» Degenerated dimension for my scenario ?
» How would you model this simple scenario ?
» Designing Sales Promotion for Packed Products
» Question - creating a dimensional model for facility management
» Degenerated dimension for my scenario ?
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum
|
|