Textual fact or Junk dimension?

grgs27 Fri Aug 27, 2010 6:07 am

Hello,

I have been following the forum for a while now and I must say that many doubts about dimensional modeling have gone thank to you all.
However I am facing now the modeling of a cube and I am somewhat stuck on what is the best implementation.

The cube is rather simple:

Dimension
- Commodity
- Country
- Time
...

Measure
- Production
- Export
- Import
- Re export
...

with each measure I have a non-numeric "attribute" which defines the quality of the measure (for each cell): estimated, official, unofficial, derived and so forth.

Based on your experience would it be better

- to store them as a non-numeric facts along with the corresponding measure.
or
- to store them as a junk dimension (as suggested order management in page 117)

considering also that it is not expressed as requirement (so far but most likely will be) the need to "aggregate" the indicator with a custom algorithm (example: if the summed production of some commodity in EMEA has some unofficial figure the indicator should show "unofficial" in the total).

Thanks in advance.

BoxesAndLines Fri Aug 27, 2010 9:41 am

It sounds like a regular dimension to me. What makes it a junk dimension?

grgs27 Fri Aug 27, 2010 10:52 am

BoxesAndLines wrote:It sounds like a regular dimension to me. What makes it a junk dimension?

Because it is related strictly to a fact...

The production (measure) of butter (commodity) in Germany (country) 2007 (time) was 10000 tons.
Now this is the fact and its attribute can be: this figure is derived, official, unofficial.
I don't have three possible figures for that but only one with a "quality indicator" of that fact.

In one fact-table I have up to 14 measure, with the dimension approach I would have 14 more dimension pointing at a dimension table of 5 rows.
It doesn't make sense to me...

LAndrews Fri Aug 27, 2010 12:18 pm

The grain of your fact should determine how to solve this.

For example, if the grain of the row states that all measures are of the same quality, then you have a single "quality" dimension with your 5 values in it.

Alternately, if the grain is independant of quality, then each measure needs to be denormalized by quality (e.g. Derived_Production, Official_Production etc).

This is similar to the budgeting models,where you can have multiple scenarios (Actual, Plan, Forecast etc).

Hope this helps.

VHF Fri Aug 27, 2010 7:41 pm

If I understand correctly, the OP wants to store a data quality metric for each of (up to) 14 measures in the fact table--essentialy a piece of metadata about each measure. So are these data quality metrics (a.) non-numeric measures or are they (b.) a dimension attribute?

I think a key question--touched on in the original post--is how these values will be handled when data is aggregated. If the need is to slice/dice/group data by the quality measurements, then they are (b.) a dimension attribute.

However, the original post mentioned a custom aggregation. That leads me to belive that these values are (a). non-numeric measures in themselves. To keep the fact table as narrow as possible AND to allow the aggregation function to work with numeric rather than string values, I would store a numeric value rather than a textual description. Something like 10=estimated, 20=derived, 30=uofficial, 40=official. (You could of course have a lookup table to tie the numeric code to a textual description for reporting purposes.)

Measure1
Measure1DataQuality
Measure2
Measure2DataQuality
...

The main trick is to make sure your data quality measure isn't aggrated inappropriatley--after all, a SUM of discreate data quality values doesn't tell you much. If your data quality was on a scale (such as 1 to 10) or a percentage than AVG might be useful. Assuming the numeric values are in ascending order of data quality, you should be able to use MIN to tell you the lowest quality piece of data included in your aggreated results. Otherwise, you may very well be looking at that custom aggregation.

Hope this helps with thinking through the process--and I hope your environment has good support for custom aggregations!

(I just tried to create a custom aggreation in SQL Server 2008, imagine my surprise that there was no "New" option for Aggregate Functions! I can create new Table-valued Functions, and I can create new Scalar-valued Functions, but not Aggregate functions. Is this an Enterprise Edition-only feature? No, it turns out Aggregate Functions can only be created using a CLR function written in C# or VB .NET!)

hang Fri Aug 27, 2010 11:13 pm

BoxesAndLines wrote:It sounds like a regular dimension to me. What makes it a junk dimension?

BoxesAndLines makes the point. The junk dimension is not the only way to reduce the number of dimensions, or measures for that matter. It is the proper dimensional modeling in general that makes the structure sound.

If you don't have too many measures, then LAndrews' alternative approach is the simplest and yet elegant way to model it, storing all the measures side by side.

If you are really concerned about the excessive number of measures, you make look a couple of pages further (p119) in Kimball's book where the Multiple Currencies scenario is discussed. You may not need a fact table like currency conversion fact, but you do need to store some deriving logic (say rate) in a dimension table. If the derived values are purely arbitrary, you might have to go back to LAddrews' approach. It is possible, but not recommended, to reduce the number of columns in fact table by partitioning your fact table vertically and join them by dimension conformance.

grgs27 Tue Aug 31, 2010 11:03 am

VHF wrote:

[...]

Hope this helps with thinking through the process--and I hope your environment has good support for custom aggregations!

(I just tried to create a custom aggreation in SQL Server 2008, imagine my surprise that there was no "New" option for Aggregate Functions! I can create new Table-valued Functions, and I can create new Scalar-valued Functions, but not Aggregate functions. Is this an Enterprise Edition-only feature? No, it turns out Aggregate Functions can only be created using a CLR function written in C# or VB .NET!)

Thank you VHF

we are going to "develop" our environment ourselves extending mondrian and few other things from the pentaho suite.
So for the support is not a big deal... and Postgres has aggregation support.

g

Jeff Smith Tue Aug 31, 2010 3:14 pm

I think the original post has the answer. The field in question is referred to as an "attribute". That makes it a dimension.

VHF Wed Sep 01, 2010 9:16 am

Ah, but is an "attribute" always an attribute?!? :-)

The answer to the question I posed--are these data quality metrics (a.) non-numeric measures or are they (b.) a dimension attribute?--might be (c.) both, in which case it would be appropriate to model them both ways, as a dimension attribute for filtering/grouping based on data quality AND as a measure in order to aggregate a measure of data quality.

Sponsored content

Textual fact or Junk dimension?

Textual fact or Junk dimension?

Re: Textual fact or Junk dimension?

Re: Textual fact or Junk dimension?

Re: Textual fact or Junk dimension?

Re: Textual fact or Junk dimension?

Re: Textual fact or Junk dimension?

Re: Textual fact or Junk dimension?

Re: Textual fact or Junk dimension?

Re: Textual fact or Junk dimension?

Re: Textual fact or Junk dimension?