Location and population dim/fact

shanemcmurray Mon Jul 18, 2011 12:52 pm

Hi, I'm new to dimensional modeling and I’m trying to wrap my head around all this.

I'm trying to create a model that allows me to represent locations to populations.

I think my dimension tables looks like this:

Table: DimLocation

LocationId
Country
State
County
City
ZipCode

Table: DimYear

YearId
Year

I’m guessing that I should store a dimension record at the lowest level, say zip code?

And Fact

Table: FactPopulation

PopulationId
LocationId
YearId
TotalPopulation
TotalHouseholds

I stuck with representing each location type above zip codes. For example, I can easily see how I can represent a zip code, pretty straight forward… but what about at the State level, or County Level, etc.?

Thanks in advance for any direction or help.

BoxesAndLines Mon Jul 18, 2011 3:34 pm

State and city are just aggregations of zips. You join on location_id where state = 'TX'. That will give you all households and total population for Texas.

shanemcmurray Mon Jul 18, 2011 3:56 pm

Thanks for the input. so, store it at the zip level in the fact table and sum it up to get the state level?

like:

select sum(TotalPopulation)
from factpopulation a join dimlocation b on a.location_id = b.location_id
where b.state = 'tx'

I was thinking the data could be stored in the fact table at any level, like state or county, without rolling it up.

would this be an incorrect approach?

Store it at the lowest grain, then aggregate it up? is that a better approach?

Thanks.

BoxesAndLines Mon Jul 18, 2011 4:04 pm

You should always store at the lowest level needed and aggregate up.

VHF Mon Jul 18, 2011 4:16 pm

Your SQL query is correct. That's the real power of Dimensional Modeling--the ability to "slice and dice" using any combinations of dimension attributes. As already mentioned, you should always store your fact data at the lowest level (grain) available and aggregate up as needed.

For a large DW, you might alsocreate aggreate fact tables to support common queries, for example a fact table the stores population by state. But don't even think about this for a fact tables with <10M rows. As long as queries are acceptably fast (and most modern database engines are pretty fast at handling star-schema queries) there is no need for the addition of an aggregate table.

shanemcmurray Mon Jul 18, 2011 4:22 pm

VHF, if I were to create an aggreate fact table for say, state population totals, do you use the same dimension table? or create a seperate dim table to represent the state level (grain)?

VHF Mon Jul 18, 2011 7:04 pm

Either approach is possible.

You can create a separate State dimension. As long as the State attribute in the State dimension contains the same values as the State attribute in your Location dimension these would be said to be conformed dimensions. (In fact, you could create your State dimension by doing a distinct query on your Location dimension.)

Or you can create special "state" records in your Location dimension. These rows would represent an entire state and have either NULL, empty, or a special value such as 'NA' in the City and Zip fields. It is often useful to have a flag that designates a state row.

Some folks here have a preference for one design pattern or the other, and can perhaps chime in with pros/cons. I have done both, and haven't settled on a single approach as best. I try to take into account how the design will affect what the user will see in the BI tools.

Some BI tools (such as Business Objects) have specific support for aggregates. Once configured, the tools will use either the low-level fact table or the aggregate fact table depending on the user’s query. Maintaining a single dimension table (with "special rows" for aggregate use) might facilitate setting up aggregate awareness with some tools.

P.S. I would recommend Mastering Data Warehouse Aggregates by Christopher Adamson with forward by Ralph Kimball.

shanemcmurray Mon Jul 18, 2011 7:48 pm

okay, got it. Thanks for all the help.

ngalemmo Mon Jul 18, 2011 9:26 pm

I’m guessing that I should store a dimension record at the lowest level, say zip code?

Actually it would be at whatever level you are getting the data. If this is Census data, you would probably be best storing it at the tract level.

Also, ZIP codes cross political boundaries. There are codes that cross states, however almost all of these are in sparsely populated areas. It is common for codes to cross municipal boundaries, so city aggregates, particularly in suburban areas, can be misleading. There are also codes for military (FPO) and large commercial buildings, which may have zero population, causing problems with per capita calculations.

shanemcmurray Tue Jul 19, 2011 11:23 am

ngalemmo, I see what you are saying. in this case would you recommend having fact tables for each geo type with matching dim tables to group/slice at that level? For example:

DimZipcode:

LocationId
Country
State
County
City
ZipCode

FactZipcode:

PopulationId
LocationId
TotalPopulation
TotalHouseholds

DimCity:

LocationId
Country
State
County
City

Factcity:

PopulationId
LocationId
TotalPopulation
TotalHouseholds

DimCounty:

LocationId
Country
State
County

FactCounty:

PopulationId
LocationId
TotalPopulation
TotalHouseholds

OR

Is there a better way to model this?

ngalemmo Tue Jul 19, 2011 12:23 pm

I wouldn't necessarily go that far. The aggregates are only useful if you can distinquish where zips cross bondaries. It all boils down to the level of detail of the mesures being received. The location dimension should reflect that level of detail.

If all you get is population by zip code, there isn't much else you can do about it and having the aggregtes don't change anything since you would not know the correct city or state anyway... just the one where the post office is located. However if this is census tract data, the location dimension would (and should) by at the tract level with city, state, zip and other attributes.

If the data is by city/state/zip (zip repeats with different values when crossing political boundaries), then that should be the NK of the location dimension. Again you would not need the additional aggregate fact tables unless performance is an issue.

Sponsored content

Location and population dim/fact

Location and population dim/fact

Re: Location and population dim/fact

Re: Location and population dim/fact

Re: Location and population dim/fact

Re: Location and population dim/fact

Re: Location and population dim/fact

Separate dimension for aggregate fact table

Re: Location and population dim/fact

Re: Location and population dim/fact

Re: Location and population dim/fact

Re: Location and population dim/fact

Re: Location and population dim/fact