How to Handle Updates to Dimension Attributes (SCD 1) that have a Cardinality > 1 (Context is Data Warehouse Star Schema)
2 posters
Page 1 of 1
How to Handle Updates to Dimension Attributes (SCD 1) that have a Cardinality > 1 (Context is Data Warehouse Star Schema)
Hi,
In our data warehouse, we use a star schema and we have a daily ETL process that loads dimension and fact data. When loading dimension data, if the key is new, it is inserted and if it is not new, if any of the attributes have changed, an update occurs. This works fine.
However, occasionally the attribute that has changed is an attribute that has a higher cardinality than 1 in the dimension table; that is to say, there are other keys in the table that share the same attribute (as identified by an attribute business key) that will also need updating. If we didn't see those same keys in the daily load, then the ETL process will not have updated them.
It's important that they are updated to maintain attribute hierarchy relationships within the dimension. We currently ensure this by running a stored procedure at the end of the ETL process that ranks the duplicate attribute values by attribute key and updates the older with the newer value.
However, this feels messy and can sometimes reduce transparency with regard to our auditing framework.
Any views or suggestions greatly appreciated.
Thanks in advance,
Adam
In our data warehouse, we use a star schema and we have a daily ETL process that loads dimension and fact data. When loading dimension data, if the key is new, it is inserted and if it is not new, if any of the attributes have changed, an update occurs. This works fine.
However, occasionally the attribute that has changed is an attribute that has a higher cardinality than 1 in the dimension table; that is to say, there are other keys in the table that share the same attribute (as identified by an attribute business key) that will also need updating. If we didn't see those same keys in the daily load, then the ETL process will not have updated them.
It's important that they are updated to maintain attribute hierarchy relationships within the dimension. We currently ensure this by running a stored procedure at the end of the ETL process that ranks the duplicate attribute values by attribute key and updates the older with the newer value.
However, this feels messy and can sometimes reduce transparency with regard to our auditing framework.
Any views or suggestions greatly appreciated.
Thanks in advance,
Adam
adamcunnington- Posts : 2
Join date : 2015-01-23
Re: How to Handle Updates to Dimension Attributes (SCD 1) that have a Cardinality > 1 (Context is Data Warehouse Star Schema)
Hi - I am not entirely sure what you mean by "attribute business key". Do you mean something like this scenario:
In your source system you might have a Customer table and a Customer Status column; that column contains a code and in a separate reference data table you have the code and description. In your (denormalised) Customer Dim you have the status code and status description but when the description changes in your source system reference table then these changes are not being reflected in the Customer Dim - because there is no change being made to the customer record in the source system.
If this is the situation you are talking about, then the solution is to enhance your Change Data Capture (CDC) and ETL processes to identify when these reference data records change, which records in other tables are impacted by these changes and ensure those records are extracted to your DW.
If I have misunderstood what you are talking about then please could you clarify? Thanks
In your source system you might have a Customer table and a Customer Status column; that column contains a code and in a separate reference data table you have the code and description. In your (denormalised) Customer Dim you have the status code and status description but when the description changes in your source system reference table then these changes are not being reflected in the Customer Dim - because there is no change being made to the customer record in the source system.
If this is the situation you are talking about, then the solution is to enhance your Change Data Capture (CDC) and ETL processes to identify when these reference data records change, which records in other tables are impacted by these changes and ensure those records are extracted to your DW.
If I have misunderstood what you are talking about then please could you clarify? Thanks
nick_white- Posts : 364
Join date : 2014-01-06
Location : London
Re: How to Handle Updates to Dimension Attributes (SCD 1) that have a Cardinality > 1 (Context is Data Warehouse Star Schema)
Hi Nick,
Thanks for answering so promptly. I think you may have misunderstood though.
So, I have an Ad table where each row is an ad. Each ad is part of a campaign. As i'm using a star schema, this is denormalised and represented in a single dimension table. I.e. the table has ad description, campaign ID (this is a business key, not a surrogate key), campaign, other data etc.
When my ETL process runs, it ingests an ad dimension file. This file contains dimension data for ads that ran yesterday only. New ads are inserted and existing ads are updated if one of their attributes have changed. In the case of the campaign name changing, the ads in the file will have their campaign attribute updated. However, other ads in the database table that didn't run yesterday but share the same campaign ID need to have their campaign attribute updated. The ETL process doesn't account for this directly because these ads were never ingested.
To reiterate my first post, I currently handle this with a stored procedure but it doesn't feel like best practice and this feels like a reasonably common issue?
I'm posting this in the architecture forum, not ETL as I feel my issue may be symptomatic of a need for slight design change. On the surface, this seems like the perfect use case for partial normalization through a snowflake schema. However, the cons of a snowflake architecture probably exceed the pros in my context, particularly the expensiveness of joins in a large database and slight obfuscation of attribute relationships to the user.
Thanks,
Adam
Thanks for answering so promptly. I think you may have misunderstood though.
So, I have an Ad table where each row is an ad. Each ad is part of a campaign. As i'm using a star schema, this is denormalised and represented in a single dimension table. I.e. the table has ad description, campaign ID (this is a business key, not a surrogate key), campaign, other data etc.
When my ETL process runs, it ingests an ad dimension file. This file contains dimension data for ads that ran yesterday only. New ads are inserted and existing ads are updated if one of their attributes have changed. In the case of the campaign name changing, the ads in the file will have their campaign attribute updated. However, other ads in the database table that didn't run yesterday but share the same campaign ID need to have their campaign attribute updated. The ETL process doesn't account for this directly because these ads were never ingested.
To reiterate my first post, I currently handle this with a stored procedure but it doesn't feel like best practice and this feels like a reasonably common issue?
I'm posting this in the architecture forum, not ETL as I feel my issue may be symptomatic of a need for slight design change. On the surface, this seems like the perfect use case for partial normalization through a snowflake schema. However, the cons of a snowflake architecture probably exceed the pros in my context, particularly the expensiveness of joins in a large database and slight obfuscation of attribute relationships to the user.
Thanks,
Adam
adamcunnington- Posts : 2
Join date : 2015-01-23
Re: How to Handle Updates to Dimension Attributes (SCD 1) that have a Cardinality > 1 (Context is Data Warehouse Star Schema)
Hi - OK, thanks, I understand now.
This is a common situation and is usually caused by people designing their ETL processes based on their sources and not their targets; it's much better to start with the target and work backwards to design your ETL - at least as far as getting the structure right, planning your CDC strategy, etc.
So for your example, your have a Dimension that contains Ad and Campaign attributes. If any of these attributes change in your source system then you need to update the rows in your Dimension - therefore your CDC triggers need to look at all sources of data for this Dimension (Ads and Campaigns) whereas it sounds like you are only looking at your Ads source.
Obviously if your Campaign changes then you probably have to update a lot of rows in your Dimension - but that is just a consequence of having modelled Ads and Campaigns as a single Dimension. As with all design decisions there are pros and cons: having a single Dim may make your reporting faster/easier but makes your ETL more complex; having separate Ad and Campaign Dims might make your ETL easier but makes your model more complex and your reporting slower. You just have to decide where the cost/benefit "sweet spot" lies in your particular circumstances.
My personal approach is to design the Dimensional model in the best way to support the report requirements and not worry about the complexity of the ETL, as I feel complexity is much better dealt with in the ETL rather than the model - but then I have a background in Informatica so like solving complex ETL challenges!
Hope this helps?
This is a common situation and is usually caused by people designing their ETL processes based on their sources and not their targets; it's much better to start with the target and work backwards to design your ETL - at least as far as getting the structure right, planning your CDC strategy, etc.
So for your example, your have a Dimension that contains Ad and Campaign attributes. If any of these attributes change in your source system then you need to update the rows in your Dimension - therefore your CDC triggers need to look at all sources of data for this Dimension (Ads and Campaigns) whereas it sounds like you are only looking at your Ads source.
Obviously if your Campaign changes then you probably have to update a lot of rows in your Dimension - but that is just a consequence of having modelled Ads and Campaigns as a single Dimension. As with all design decisions there are pros and cons: having a single Dim may make your reporting faster/easier but makes your ETL more complex; having separate Ad and Campaign Dims might make your ETL easier but makes your model more complex and your reporting slower. You just have to decide where the cost/benefit "sweet spot" lies in your particular circumstances.
My personal approach is to design the Dimensional model in the best way to support the report requirements and not worry about the complexity of the ETL, as I feel complexity is much better dealt with in the ETL rather than the model - but then I have a background in Informatica so like solving complex ETL challenges!
Hope this helps?
nick_white- Posts : 364
Join date : 2014-01-06
Location : London
Similar topics
» Star schema for a data warehouse
» Dimension Attributes and Fact attributes storing same data in multiple data marts??
» How to Handle Data that serves as both a dimension and attributes of another dimension
» Only way to pull data from star/snowflake schema is by using facts?
» Fact 1:N Dimension with Star Schema
» Dimension Attributes and Fact attributes storing same data in multiple data marts??
» How to Handle Data that serves as both a dimension and attributes of another dimension
» Only way to pull data from star/snowflake schema is by using facts?
» Fact 1:N Dimension with Star Schema
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum