Fact dates before begin date of Dimension
3 posters
Page 1 of 1
Fact dates before begin date of Dimension
Hi all,
For our Type 2 SCD, we have the begin_date and end_date named as "relevance_begin_date" and "relevance_end_date". (I fought a good fight to make these the names of those fields).
We are currently setting the original relevance_begin_date to a particular date in the source based on our knowledge of the source. For example, dim_employees is set to (original_hire_date - 30). The etl problem is that some of the facts are coming in with dates before the original relevance therefore not landing because the src.ALTERNATE_KEY is found but the fact.DATE BETWEEN RELEVANCE does not find a relevant record. The reporting problem is that this fact happened, it was accepted by a system. Do we hold off the fact and find the fix to the dimension, or do we do runtime surgery on the original dimensional record?
Here's some of my ideas on runtime surgery.
1) Amend the original dimensional record's relevance to be before the fact.
2) Allow the fact in even with this anamoly, but have a quality dimensional key call that anamoly out.
3) Put this fact table into an error table
4) Add another dimensional record that is before the original dimensional record to support this fact.
For our Type 2 SCD, we have the begin_date and end_date named as "relevance_begin_date" and "relevance_end_date". (I fought a good fight to make these the names of those fields).
We are currently setting the original relevance_begin_date to a particular date in the source based on our knowledge of the source. For example, dim_employees is set to (original_hire_date - 30). The etl problem is that some of the facts are coming in with dates before the original relevance therefore not landing because the src.ALTERNATE_KEY is found but the fact.DATE BETWEEN RELEVANCE does not find a relevant record. The reporting problem is that this fact happened, it was accepted by a system. Do we hold off the fact and find the fix to the dimension, or do we do runtime surgery on the original dimensional record?
Here's some of my ideas on runtime surgery.
1) Amend the original dimensional record's relevance to be before the fact.
2) Allow the fact in even with this anamoly, but have a quality dimensional key call that anamoly out.
3) Put this fact table into an error table
4) Add another dimensional record that is before the original dimensional record to support this fact.
Galen Boyer- Posts : 7
Join date : 2011-03-04
Re: Fact dates before begin date of Dimension
If you have bad dates in the system, its a problem that needs addressing, but it shouldn't mess up referencing dimensions. Sure you find the low dates, but what about dates that are wrong but fit into the effectivity period of a row?
When I implement effective dates on dimensions, the first row for a NK usually has an early date, such as 1/1/1900 with the realization that even if dates are bad, doing anything else won't solve the date problem.
When I implement effective dates on dimensions, the first row for a NK usually has an early date, such as 1/1/1900 with the realization that even if dates are bad, doing anything else won't solve the date problem.
Re: Fact dates before begin date of Dimension
Thanks!
I've recommended we do pretty much what you are saying, but I've recommended that we actually create a dimensional record that is for this case, per source key. Whenever the initial dimensional record is created, have a dimensional record created that spans the time period of 1900 to the start date of that record. The reason for this is because it would make an easy "after the fact" query for this problem. SELECT ALL FACTS WHERE THEIR KEYS ARE IN THE PRE-RELEVANCE-BUCKET.
The downside would be the dimension would be doubled in # of records initially. The upside is straightforwardness of this question across all facts. Also, the UI would show these, clearly error, attributes and therefore the EDW team wouldn't find it easy to leave them uncorrected. Those records would always demand attention when facts are tied to them.
The upside of your approach is 2-fold. 1) No doubling, 2) An already designed in way of identifying the "original" dimension. The downside is that this date problem we would need to constantly be vigilant about solving is now a different question per dimension.
I've recommended we do pretty much what you are saying, but I've recommended that we actually create a dimensional record that is for this case, per source key. Whenever the initial dimensional record is created, have a dimensional record created that spans the time period of 1900 to the start date of that record. The reason for this is because it would make an easy "after the fact" query for this problem. SELECT ALL FACTS WHERE THEIR KEYS ARE IN THE PRE-RELEVANCE-BUCKET.
The downside would be the dimension would be doubled in # of records initially. The upside is straightforwardness of this question across all facts. Also, the UI would show these, clearly error, attributes and therefore the EDW team wouldn't find it easy to leave them uncorrected. Those records would always demand attention when facts are tied to them.
The upside of your approach is 2-fold. 1) No doubling, 2) An already designed in way of identifying the "original" dimension. The downside is that this date problem we would need to constantly be vigilant about solving is now a different question per dimension.
Galen Boyer- Posts : 7
Join date : 2011-03-04
Re: Fact dates before begin date of Dimension
Don't mix up SCD date with business date attribute (original_hire_date). The SCD dates are updated by the system to keep track of SCD changes, while business attributes come from source and will be the source of the changes. For the fact SK lookup, use SCD dates, and as ngalemmo suggested, make the SCD start date early enough to cover all the fact references. For business constraints use dimension attributes.
In case of the late arriving dimension in the fact, add a new dimension record that has only SK and NK with other attributes set to null in your dimension, and let ETL to populate the null attributes in due course when the true dimension details arrive.
In case of the late arriving dimension in the fact, add a new dimension record that has only SK and NK with other attributes set to null in your dimension, and let ETL to populate the null attributes in due course when the true dimension details arrive.
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
Re: Fact dates before begin date of Dimension
Thanks. Our actual SCD dimensions are young enough and the number of them are small enough that we can change to that "philosophy".
What I don't like is you say "In case of the late arriving dimension in the fact, add a new dimension record that has only SK and NK with other attributes set to null in your dimension, and let ETL to populate the null attributes in due course when the true dimension details arrive." Shouldn't the dimensional attributes all be NON NULL? Therefore, all attributes of this bucket record would be something like "NOT HERE YET"?
What I don't like is you say "In case of the late arriving dimension in the fact, add a new dimension record that has only SK and NK with other attributes set to null in your dimension, and let ETL to populate the null attributes in due course when the true dimension details arrive." Shouldn't the dimensional attributes all be NON NULL? Therefore, all attributes of this bucket record would be something like "NOT HERE YET"?
Galen Boyer- Posts : 7
Join date : 2011-03-04
Re: Fact dates before begin date of Dimension
Why? What if you have a date or numeric attribute. Don't get confused between non-existence and late arriving dimension. The former is about having an unknown dimension record to bucket all NULL valued dimension references in the fact to ensure referental integrity, while the latter is about dimensions that appear in the fact, but not loaded into dimension table yet.Galen Boyer wrote:Shouldn't the dimensional attributes all be NON NULL?
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
Re: Fact dates before begin date of Dimension
Its stated that way on this website. http://www.ralphkimball.com/html/10dt/DT128SelectingDefaultValuesNulls.pdf.
SECTION -> Handling Null Attribute Values in Dimension Tables
Nulls should also be avoided when we can’t provide a value for a dimension attribute in a valid dimension row.
SECTION -> Handling Null Attribute Values in Dimension Tables
Nulls should also be avoided when we can’t provide a value for a dimension attribute in a valid dimension row.
Galen Boyer- Posts : 7
Join date : 2011-03-04
Re: Fact dates before begin date of Dimension
The file is not accessible and I am really interested to see the justification.
But anyway, what would you put into a date and numeric attribute that is originally null in the source. Without seeing a convincing argument, I still believe, in many cases, null value has its place in dimension attributes. Especially for monster dimensions, any sparsely populated textual attribute would incur tremendous unnecessary waste of storage, compromising all the savings on the fact tables.
But anyway, what would you put into a date and numeric attribute that is originally null in the source. Without seeing a convincing argument, I still believe, in many cases, null value has its place in dimension attributes. Especially for monster dimensions, any sparsely populated textual attribute would incur tremendous unnecessary waste of storage, compromising all the savings on the fact tables.
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
Re: Fact dates before begin date of Dimension
I put a period after the URL. Try the following.
http://www.ralphkimball.com/html/10dt/DT128SelectingDefaultValuesNulls.pdf
Its also stated in the Data Warehouse Design Solutions (foreward by Kimball) as the "Null avoidance principle", pg 73.
Here is what is said.
http://www.ralphkimball.com/html/10dt/DT128SelectingDefaultValuesNulls.pdf
Its also stated in the Data Warehouse Design Solutions (foreward by Kimball) as the "Null avoidance principle", pg 73.
Here is what is said.
Handling Null Attribute Values in Dimension Tables
Nulls should also be avoided when we can’t provide a value for a dimension attribute in a valid dimension row.
There are a several reasons why the value of a dimension attribute may not be available:
Missing Value – The attribute was missing from the source data.
Not Happened Yet – The attribute is not yet available due to source system timing issues.
Domain Violation – Either we have a data quality issue, or we don’t understand all the business
rules surrounding the attribute. The data provided by the source system is invalid for the column
type or outside the list of valid domain values.
Not Applicable – The attribute is not valid for the dimension row in question.
Text attributes in dimension tables usually can contain the actual values that describe the null
conditions. Try to keep in mind the effect on BI tools downstream that have to display your special null
value description in a fixed format report. Avoid tricks we’ve seen, such as populating the default
attributes with a space or meaningless string of symbols like @@@ as these only confuse the business
users. Consider the default values for each dimension attribute carefully and provide as much meaning
as possible to provide context to the business users.
Numeric attributes in dimension tables will need to have a set of special values. A value of zero often is
the best choice because it is usually obvious to the users that it is artificial. Some numeric attributes will
present you with a difficult choice if the business users combine these values in numeric computations.
Any actual numeric value used to stand in for null (say, zero) will participate in the computation but give
misleading results. An actual null value often will cause an error in a computation, which is annoying but
at least does not produce a falsely confident result. Perhaps you can program your BI tool to display null
numeric dimension attributes with “null” so that you can both report and compute on these attributes
without worrying about distorted data.
Finally, these default value choices should be re-used to describe common null conditions across
business processes and dimension tables in your dimensional data warehouse.
Galen Boyer- Posts : 7
Join date : 2011-03-04
Re: Fact dates before begin date of Dimension
Galen, thanks for the reference. Although I agree with most of the reasons why attributes should not be left as null, I still think it should be a guideline rather than a rule. There would be some exceptions, like the case of monster dimension in my previous post, or even late arriving dimensions where a dogmatic rule could cause a lot of extra unnecessary work by ignoring a simple effective solution.
Date attribute in dimension is another case that has not been directly addressed by the article. A simply null value in place of date attribute would suffice for most business requirements, instead of a confusing extreme early date or future date, or even snowflakeing your dimension by date dimension, which could be annoying when dealing with many dates.
Just for the sake of argument, sometimes a null value cannot always be replaced by a ‘0’ value for a numeric attribute. A typical SQL aggregate function AVG would most likely work properly with null values. Again some zeroed null values may work better for other aggregate functions.
All I am saying is, let's not use rules to dictate the design, even with dimensional modeling at its core fundamentals. It's a general guideline to denormalise dimensions instead of snowflaking them, but...
Here’s another relevant post with different point of view.
http://forum.kimballgroup.com/t707-null-values-best-practice-digest
Date attribute in dimension is another case that has not been directly addressed by the article. A simply null value in place of date attribute would suffice for most business requirements, instead of a confusing extreme early date or future date, or even snowflakeing your dimension by date dimension, which could be annoying when dealing with many dates.
Just for the sake of argument, sometimes a null value cannot always be replaced by a ‘0’ value for a numeric attribute. A typical SQL aggregate function AVG would most likely work properly with null values. Again some zeroed null values may work better for other aggregate functions.
All I am saying is, let's not use rules to dictate the design, even with dimensional modeling at its core fundamentals. It's a general guideline to denormalise dimensions instead of snowflaking them, but...
Here’s another relevant post with different point of view.
http://forum.kimballgroup.com/t707-null-values-best-practice-digest
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
Re: Fact dates before begin date of Dimension
I guess, what would be most important to me is something in the dimensional row beyond NULL that identified the record as having been created to support a fact vs a the row being created through the normal ETL processes. That's why the NOT NULL rule makes sense to me. Upon creating the row from the fact, set the descriptive attributes to something like, "NOT FROM SOURCE" or something along those lines. If its just NULL, then there is no information from it. Did the NULL come from the source or from the ETL code? On numbers, I can see how 0 or something like that would work, in that you aren't going to be storing facts on the dimensions, so you shouldn't need to do sql avg or median or such. Dates seems to be were the amount of work to make it happen might be more than what Kimball is trying to gain, especially if you follow his advice that the dates on the dimensions actually have FKs back to the date table. But, even there, all you need is a single date on the date dim that is outside the dates of the date_dim and stands for "NOT FROM SOURCE". Create that one date, and that's what all dates for all these records would be set to. So, I think following the advice is quite doable, and I'd rather follow the Kimball advice until I can't.
Galen Boyer- Posts : 7
Join date : 2011-03-04
Similar topics
» Too many dates on fact - Is there such a thing as Junk Date dimension
» Using the Date Dimension for Summary Fact Tables or new specialized Month Dimension?
» Date Dimension: Representing partial dates/Imputing date values
» Date dimension multiple dates
» Fact with eff / exp dates referencing dimension with eff / exp dates
» Using the Date Dimension for Summary Fact Tables or new specialized Month Dimension?
» Date Dimension: Representing partial dates/Imputing date values
» Date dimension multiple dates
» Fact with eff / exp dates referencing dimension with eff / exp dates
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum