SCD2 and partial information
3 posters
Page 1 of 1
SCD2 and partial information
Hi,
I’m trying to build a dimension from a database and a flat file.
Say I have the dimension consisting of network nodes.
My source data consist of:
- Node_id (business key)
- Node_label
- Node_type
There is only 2 attributes in this dimension and I want to enrich it by adding the following attributes provided in a flat file:
- DSLAM
- City
- And of course the Node_id (so I can join the data from the database with the flat file)
The resulting dimension would be:
- ID_DM (surrogate key)
- Node_Id (business key)
- Node_Label – SCD1
- Node_type – SCD1
- DSLAM – SCD2
- City – SCD2
So far so good.
The problem comes from the fact that there is 20 000 nodes and for each node the DSLAM and the City need to be entered manually in the flat file. So at the beginning the flat file will only contain partial information.
First flat file provided
Node_id | DSLAM | City
N1 |DSLAM1 | Chicago
N2 | DSLAM2 | Not Available
The concern that I have is with the 2nd line (notice the not available value in the city attribute)
After the first load of the dimension I will have something like this
Dm_NODE
ID_DM | Node_label | Node_Type | DSLAM | City effect_date | expiry_date
….
3 | N2 | Interface | DSLAM2 | Not available | 2014-01-01 | 9999-12-31
After the first load, the flat file is updated with the following information (the flat file always contain all the nodes, not just the delta).
New flat file:
Node_id | DSLAM | City
N1 | DSLAM1 | Chicago
N2 | DSLAM2 | Los Angeles
N3 | DSLAM3 | Houston
So basically in the second load and considering I’m using SCD2 for the City column I end up with :
DM_Node
ID_DM | Node_label | Node_Type | DSLAM | City effect_date | expiry_date
….
3 | N2 | Interface | DSLAM2 | Not available | 2014-01-01 | 2014-03-20
….
6 | N2 | Interface | DSLAM2 | Houston | 2014-03-20 | 9999-12-31
….
Which of course is incorrect because I wasn’t supposed to create a new line for the N2 node but only update the first row with the information made available. The problem is that the city attribute is an SCD2 attribute (because the city of a node can change say from Houston to Philadelphia) .
To solve this, I have added in the ETL process the following processing.
- During the scd check, if I have generated a new line for an already existing node, I check if in the old line the attribute city is set to no available, if it is the case, I discard the new line, and just update the old line (I change the scd2 with an scd1 behavior). If it is not the case I close the old line and insert the new line (scd2 behavior).
But In all honesty I don’t like this solution, because I find it to complicated (I always consider that if the solution is complicated for a simple problem, it is because I missed a more elegant solution).
I would be glad if someone more experienced could come with some insight.
Thank you for reading me.
I’m trying to build a dimension from a database and a flat file.
Say I have the dimension consisting of network nodes.
My source data consist of:
- Node_id (business key)
- Node_label
- Node_type
There is only 2 attributes in this dimension and I want to enrich it by adding the following attributes provided in a flat file:
- DSLAM
- City
- And of course the Node_id (so I can join the data from the database with the flat file)
The resulting dimension would be:
- ID_DM (surrogate key)
- Node_Id (business key)
- Node_Label – SCD1
- Node_type – SCD1
- DSLAM – SCD2
- City – SCD2
So far so good.
The problem comes from the fact that there is 20 000 nodes and for each node the DSLAM and the City need to be entered manually in the flat file. So at the beginning the flat file will only contain partial information.
First flat file provided
Node_id | DSLAM | City
N1 |DSLAM1 | Chicago
N2 | DSLAM2 | Not Available
The concern that I have is with the 2nd line (notice the not available value in the city attribute)
After the first load of the dimension I will have something like this
Dm_NODE
ID_DM | Node_label | Node_Type | DSLAM | City effect_date | expiry_date
….
3 | N2 | Interface | DSLAM2 | Not available | 2014-01-01 | 9999-12-31
After the first load, the flat file is updated with the following information (the flat file always contain all the nodes, not just the delta).
New flat file:
Node_id | DSLAM | City
N1 | DSLAM1 | Chicago
N2 | DSLAM2 | Los Angeles
N3 | DSLAM3 | Houston
So basically in the second load and considering I’m using SCD2 for the City column I end up with :
DM_Node
ID_DM | Node_label | Node_Type | DSLAM | City effect_date | expiry_date
….
3 | N2 | Interface | DSLAM2 | Not available | 2014-01-01 | 2014-03-20
….
6 | N2 | Interface | DSLAM2 | Houston | 2014-03-20 | 9999-12-31
….
Which of course is incorrect because I wasn’t supposed to create a new line for the N2 node but only update the first row with the information made available. The problem is that the city attribute is an SCD2 attribute (because the city of a node can change say from Houston to Philadelphia) .
To solve this, I have added in the ETL process the following processing.
- During the scd check, if I have generated a new line for an already existing node, I check if in the old line the attribute city is set to no available, if it is the case, I discard the new line, and just update the old line (I change the scd2 with an scd1 behavior). If it is not the case I close the old line and insert the new line (scd2 behavior).
But In all honesty I don’t like this solution, because I find it to complicated (I always consider that if the solution is complicated for a simple problem, it is because I missed a more elegant solution).
I would be glad if someone more experienced could come with some insight.
Thank you for reading me.
exhortae- Posts : 30
Join date : 2010-08-01
Re: SCD2 and partial information
The question is: what do you want to happen?
There is nothing wrong with special circumstances that require a deviation from the normal pattern, as long as you represent things the way you need to represent them.
You have two dependent attributes: DSLAM and City. You have been maintaining history for DSLAM. As there is a direct relationship between DSLAM and City, what's wrong with updating all rows in the dimension setting the City based on the DSLAM value?
There is nothing wrong with special circumstances that require a deviation from the normal pattern, as long as you represent things the way you need to represent them.
You have two dependent attributes: DSLAM and City. You have been maintaining history for DSLAM. As there is a direct relationship between DSLAM and City, what's wrong with updating all rows in the dimension setting the City based on the DSLAM value?
Re: SCD2 and partial information
ngalemmo wrote:The question is: what do you want to happen?
There is nothing wrong with special circumstances that require a deviation from the normal pattern, as long as you represent things the way you need to represent them.
You have two dependent attributes: DSLAM and City. You have been maintaining history for DSLAM. As there is a direct relationship between DSLAM and City, what's wrong with updating all rows in the dimension setting the City based on the DSLAM value?
Hi,
I'm sorry but I don't understand this part :
As there is a direct relationship between DSLAM and City, what's wrong with updating all rows in the dimension setting the City based on the DSLAM value?
could you please elaborate a little.
thank you
exhortae- Posts : 30
Join date : 2010-08-01
Re: SCD2 and partial information
If I understand your post, you want to treat city as a type 2 attribute except when its value is 'Not Available' in which case you want to treat it as a type 1. Works for me.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: SCD2 and partial information
BoxesAndLines wrote:If I understand your post, you want to treat city as a type 2 attribute except when its value is 'Not Available' in which case you want to treat it as a type 1. Works for me.
Yes that's what I want to do. Do you think the method I describe above to do it is OK ?
thanks
exhortae- Posts : 30
Join date : 2010-08-01
Similar topics
» Combination of SCD2, partial snowflake, Effective and end dates in fact table
» Partial vs Complete Dimensions
» Conformed Dimension Partial Set of Attributes
» Looking for suggestions on ETL tools for DW/BI Project
» information across different models
» Partial vs Complete Dimensions
» Conformed Dimension Partial Set of Attributes
» Looking for suggestions on ETL tools for DW/BI Project
» information across different models
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum
|
|