Dimension Record Comes and Gos
2 posters
Page 1 of 1
Dimension Record Comes and Gos
I am an experienced DBA in the OLTP world. I need to create a data warehouse for our population statistics. I've done some reading on the Kimball fact/dimension method and it seems very strong and complete so I'm trying to follow it but my knowledge is very limited.
I have a fact table for births:
Birth year and month
Birth county
Child's gender
Child's race (race_id)
Mother's race (race_id)
Child Hispanic?
Mother Hispanic?
Age of mother
Number of births (for this year, month, county, race, etc.)
I have a dimension table for race:
Race_id
Race description
Start date
End date
How we are tracking race has changed over time:
1970-1979: White, Hispanic, Asian & Pacific Islander, Black, American Indian
1990-1999: Non-Hispanic White, Non-Hispanic Asian/Pacific Islander, Non-Hispanic Black, Non-Hispanic Indian, Hispanic White, Hispanic Asian/Pacific Islander, Hispanic Black, Hispanic Indian
2000 – present: White, Hispanic, Asian, Pacific Islander, Black, American Indian, MultiRace
I was planning on using the Type 2 Dimension to track the changes. The problem is that White, Hispanic, Black, and American Indian are good for two periods of time, 1970 through 1979, and again 2000 to present. It seems that there should be only one record per race. But with a single start date and end date, that is not possible. Should I add multiple start/end dates? Store multiple dimension records for each race? With multiple dimension records, the computer won't know that White in 1970 is the same as White in 2000.
I have a fact table for births:
Birth year and month
Birth county
Child's gender
Child's race (race_id)
Mother's race (race_id)
Child Hispanic?
Mother Hispanic?
Age of mother
Number of births (for this year, month, county, race, etc.)
I have a dimension table for race:
Race_id
Race description
Start date
End date
How we are tracking race has changed over time:
1970-1979: White, Hispanic, Asian & Pacific Islander, Black, American Indian
1990-1999: Non-Hispanic White, Non-Hispanic Asian/Pacific Islander, Non-Hispanic Black, Non-Hispanic Indian, Hispanic White, Hispanic Asian/Pacific Islander, Hispanic Black, Hispanic Indian
2000 – present: White, Hispanic, Asian, Pacific Islander, Black, American Indian, MultiRace
I was planning on using the Type 2 Dimension to track the changes. The problem is that White, Hispanic, Black, and American Indian are good for two periods of time, 1970 through 1979, and again 2000 to present. It seems that there should be only one record per race. But with a single start date and end date, that is not possible. Should I add multiple start/end dates? Store multiple dimension records for each race? With multiple dimension records, the computer won't know that White in 1970 is the same as White in 2000.
ngarris- Posts : 4
Join date : 2009-06-17
Re: Dimension Record Comes and Gos
There are a few things to consider.
As far as the race dimension goes, I would use a type 1 and have one row for each 'as declared' race. As an attribute of the dimension, there would be an 'as reported' race value. This latter attribute would be maintained in some manner (manually is probably fine) to reflect how a particular race should be reported. When the government decides to change their reporting requirements again, it would be a simple matter of updating this 'as reported' value in the dimension table. If you need to report race the old way, just use the 'as declared' value. You don't need a type 2.
I would also remove the hispanic flags from the fact table an replace it with a hispanic flag in the race dimension. Those races that are consider hispanic would have the flag set to true. As the defintion of hispanic changes, you need only change the dimension table... not the facts.
Another thing I would do is put county in its own dimension. You want to avoid storing text fields in a fact table. It only serves to make the table much larger than it needs to be which impacts query performance. Same goes for gender.
As far as the race dimension goes, I would use a type 1 and have one row for each 'as declared' race. As an attribute of the dimension, there would be an 'as reported' race value. This latter attribute would be maintained in some manner (manually is probably fine) to reflect how a particular race should be reported. When the government decides to change their reporting requirements again, it would be a simple matter of updating this 'as reported' value in the dimension table. If you need to report race the old way, just use the 'as declared' value. You don't need a type 2.
I would also remove the hispanic flags from the fact table an replace it with a hispanic flag in the race dimension. Those races that are consider hispanic would have the flag set to true. As the defintion of hispanic changes, you need only change the dimension table... not the facts.
Another thing I would do is put county in its own dimension. You want to avoid storing text fields in a fact table. It only serves to make the table much larger than it needs to be which impacts query performance. Same goes for gender.
Re: Dimension Record Comes and Gos
Yes, "birth county" is actually county_id, a foreign key to the County dimension table.
I see your point about storing the original race designations as stored on the birth certificate. Unfortunately we have many years of historical data. The Legislature specifies the race choices on the birth certificate and also specifies a (different) set of race designations for our population projections. Furthermore, they have changed these specifications over the years. We have many years of birth statistics, each year stored by the population projection race specification set by the Legislature for that year. The users would like to store the data according to these race specifications.
(For more detail see my previous note.) From 1970 to 1979, population projection race designations were the same as today, except that MultiRace was missing. However from 1990-1999, the Legislature set an entirely different set of race designations with each race divided into Hispanic and non-Hispanic. So races are valid from 1970 to 1979, invalid from 1990 to 1999, and then valid again in 2000.
I see your point about storing the original race designations as stored on the birth certificate. Unfortunately we have many years of historical data. The Legislature specifies the race choices on the birth certificate and also specifies a (different) set of race designations for our population projections. Furthermore, they have changed these specifications over the years. We have many years of birth statistics, each year stored by the population projection race specification set by the Legislature for that year. The users would like to store the data according to these race specifications.
(For more detail see my previous note.) From 1970 to 1979, population projection race designations were the same as today, except that MultiRace was missing. However from 1990-1999, the Legislature set an entirely different set of race designations with each race divided into Hispanic and non-Hispanic. So races are valid from 1970 to 1979, invalid from 1990 to 1999, and then valid again in 2000.
ngarris- Posts : 4
Join date : 2009-06-17
Similar topics
» Replicate Fact record because Dimension has changed
» Multiple Facts Mapped to Single Dimension Record
» When is a record "current"?
» Planning vs Operational System - Dimension Source
» Summary Record
» Multiple Facts Mapped to Single Dimension Record
» When is a record "current"?
» Planning vs Operational System - Dimension Source
» Summary Record
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum