How to handle a change in the child dimension in snow-flake schema?

thomaszhwang Mon Aug 22, 2011 11:47 am

I have a Project dimension table and a Project Manager dimension table. Each Project has one Project Manager.

The Project Manager table has a Project_Manager_Key (the surrogate key), a Project_Manager_ID (the business key), a Project_Manager_Location (to indicate the location of this Project Manager) field, a Effective_Start field (to indicate the effective start date of this record) and a Effective_End field (to indicate the effective end date of this record; NULL if the record is still effective).

The Project_Manager_Location filed is using a Type 2 Slowly Changing Dimension, so whenever the location of a Project Manager is changed, a new record will be created in the Project Manager table with a new Project_Manager_Key, the same Project_Manager_ID, the new location, current date/time for the Effective_Start field and NULL for the Effective_End field. At the same time, the old Project Manager record will be marked ineffective by adding the current date/time for the Effective_End field.

Obviously, the Project_Manager_Key in the Project table will be outdated. In this case, do we need to update the Project table for any effective records in the Project table?

I suppose we should. If so, how should we actually do that? I meant, do we need to do an update on the Project table or create a new record with the new information for the same project in the Project table, supposed the Project Manager field in the Project employs a Type 2 Slowly Changing Dimension as well to track the replacement of Project Manager if it happens?

Thanks.

ngalemmo Mon Aug 22, 2011 12:18 pm

You are highlighting some of the other reasons why snowflaking is not a good idea.

If you now implement a type 2 on project, what does that do to the facts? Do you really want Project to be a type 2 or are you only considering it because of the snowflake?

How about getting rid of the snowflake?

thomaszhwang Mon Aug 22, 2011 12:25 pm

Now I'm a little bit confusing. Are you saying we shouldn't use SCD Type 2 on the PM_Key field in the Project table? I'm doing so is because I want to track the replacement of Project Manager to a Project if this happens. Do you think I should use a separate fact table to track this?

Thanks.

ngalemmo Mon Aug 22, 2011 12:42 pm

No, I am not saying that. But I think you can appreciate the complexities that snowflaked type 2 dimensions can introduce.

Its easy enough to get rid of the snowflake. Placing an FK to PM in fact tables is one way. Replicating PM information on the Project dim is another.

Having another fact table to track project history may or may not be needed. Depends on your requirements.

thomaszhwang Mon Aug 22, 2011 1:04 pm

Can I say SCD2 is only needed when I want to track changes of a specific field and the change of this field needs to be reflected in the aggregation of fact? If I want to track changes of a specific field, but I don't want the change to be reflected in the fact table, I should create a separate fact table to track the changes and should not employ SCD2 on that field?

ngalemmo Mon Aug 22, 2011 1:20 pm

thomaszhwang wrote:but I don't want the change to be reflected in the fact table

I don't understand what you mean by this. There are specific query patters that allow you to obtain current information from a type 2 SCD. So dimensional changes (i.e. new type 2 rows) are not an issue with fact tables. You can either obtain point in time or current dimensional information from a type 2 dimension.

Type 2 FKs do not become 'outdated', they represent what they are supposed to represent... the state of the dimension at the time of the event (fact). The problem with snowflaking is a dimension doesn't represent an event, so the 'point in time' nature of a type 2 dimension has little value in the context of a snowflake. So, in the case of a snowflake you start thinking about updating FK's referencing type 2 dimensions. You would never do this in a fact table.

thomaszhwang Mon Aug 22, 2011 2:43 pm

OK. I thought again and realized my previous statement is not right.

So in this case, how should I handle the type 2 SCD in snowflake?

Thanks.

ngalemmo Mon Aug 22, 2011 2:53 pm

I wouldn't try to update FKs. I would handle it in queries (self join on natural key for current row) just like a normal fact fk reference.

thomaszhwang Mon Aug 22, 2011 3:03 pm

OK I see. Let me try. Thanks.

thomaszhwang Mon Aug 22, 2011 3:15 pm

Are you saying we should use the business key instead of the surrogate key to connect two dimension tables?

ngalemmo Mon Aug 22, 2011 3:38 pm

No.

Assuming your type 2 SCD has a natural key column and a current flag column, the way to get the current row is to do a self join on the dimension, as in...

SELECT ...
FROM FACT F, DIM T2, DIM CUR
WHERE F.DIM_KEY = T2.DIM_KEY
AND T2.DIM_NK = CUR.DIM_NK
AND CUR.CURRENT_FLAG = 'Y'

Alias CUR would reference the current version row of the dimension while alias T2 references the point in time row.

thomaszhwang Mon Aug 22, 2011 3:46 pm

Now I'm totally confused. I'm sorry.

When I said "connect two dimension tables", I meant if I have a Project dimension table and a Client dimension table and there is a one-to-many relationship between the two tables (one Client has many Projects), in the Project dimension table, I should have the Client_ID field (the natural key) instead of a Client_Key field (the surrogate key). Is this correct?

Also I don't understand your SQL. I meant I understand it literally, but I don't understand why you use it this way?

Thanks for the patience.

ngalemmo Mon Aug 22, 2011 3:54 pm

First, you shouldn't be 'connecting two dimension tables'. Its snowflaking. Not a good thing to do.

Second, you should not update FKs all because a new version of a row is inserted into a type 2 dimension.

The SQL shows how to locate the current row in a type 2 dimension. Given a foreign key that could reference any version of a type 2 entry, you locate the current version of the entry by self-joining the dimension table to itself using the natural key.

This technique works for fact -> dimension lookups as well as dimension -> dimension lookups.

So, your original post wondering what to do if the PM's location changes, the answer is, do nothing. Let the query resolve it by locating the current version of the PM's row.

thomaszhwang Mon Aug 22, 2011 4:12 pm

The reason why I want to connect two dimension tables is because:

I have Fact_A connects to Dimension_a and Fact_B connects to Dimension_b. And there is a one-to-many relationship between Dimension_a and Dimension_b. If I denormalize Dimension_b into Dimension_a, Fact_B would have no dimension table to connect to. However If I don't denormalize Dimension_b, I have to connect Dimension_a and Dimension_b if I want to do drill-up to the Dimension_b level. Am I correct?

Thanks.

ngalemmo Mon Aug 22, 2011 5:10 pm

Not sure what dim_a and dim_b are referring to, but, if I had a fact table that had Project as one of the dimensions, I would also make sure Project Manager was a dimension as well or I would include PM data in the Project dimension. I would not snowflake PM off of Project.

thomaszhwang Mon Aug 22, 2011 5:13 pm

OK I see. Thanks.

hayrabedian Wed Nov 14, 2012 8:38 am

ngalemmo wrote:No, I am not saying that. But I think you can appreciate the complexities that snowflaked type 2 dimensions can introduce.

Its easy enough to get rid of the snowflake. Placing an FK to PM in fact tables is one way. Replicating PM information on the Project dim is another.

Having another fact table to track project history may or may not be needed. Depends on your requirements.

@ngalemmo, in my personal opionion, getting rid of the snowflake has its own issues:

1) Placing an FK to PM in fact tables
-- if we "resolve" the snowflake to a star schema, then the fact table could became artificially "wide", which will negatively inpact the query performance.
-- lets say, our business requirements demand that we have to store the exact timestamp of the project manager's change. What if we have a change in some of the PM's attributes, but there is no related fact row for this particular day? Shell we create additional table just to store the history? If so, we are going to introduce an additiional table, and an additional join in the queries.

2) Replicating PM information on the Project dim
-- what if we have several "parent" dimensions (like the project manager's one). All of their attributes shell be moved to the project dimension, which will make it "wide". For a monster dimensions it could be a performance worsening factor.
-- for me, it is easier to have a single project_manager dimension key to be tracked as SCD2, than all of the PM's attributes. Doesn't it?

Sponsored content

How to handle a change in the child dimension in snow-flake schema?

How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?

Re: How to handle a change in the child dimension in snow-flake schema?