Duplicating a field in more than one fact table.

ian.coetzer Mon May 31, 2010 7:57 am

Question about dimensional modeling.

Scenario:

I have a dimension called:
1. Dim_Application

And two fact tables:
1. Fact_ApplicationStatus
2. Fact_ApplicationConsolidation

Now the Dim_Application consists all the applications in our system.

Fact_ApplicationStatus contains all the statusses that each application found itself in at some point in time.

Fact_ApplicationStatus
ApplicationId StatusId StartDate EndDate Latest
1 1 2010/01/01 2010/01/15 0
1 2 2010/01/16 2010/02/05 0
1 3 2010/02/06 9999/12/31 1
2 1 2010/02/01 2010/02/18 0
2 2 2010/02/19 9999/12/31 1

Fact_ApplicationConsolidation contains the consolidation amounts for each application.

Fact_ApplicationConsolidation
ApplicationId StatusId Amount
1 3 5,000
2 2 8,000

The Question:
Is it incorrect to have StatusId duplicated in these tables?
In the first fact table all statusses are stored including the latest.

In the second table only the latest statusId is stored.

I have found it easier to do analysis in this way, since the business user does not have to also retrieve the latest status from the fact status table when analyzing consolidation information but can now group consolidation using the latest status.

any ideas? comments?

ian.coetzer Mon May 31, 2010 8:00 am

How / When does one make a call on whether to include a field in another fact table - when it is possible to derive it from another fact table - even if the user has to pull more dimensions and perform additional filtering?

ngalemmo Tue Jun 01, 2010 11:34 am

Is there a status dimension? Otherwise, what does status ID refer to?

There is nothing wrong with using a dimension more than once in different fact tables.

BrianJarrett Tue Jun 01, 2010 2:08 pm

First questions that come to mind are fact table grain and type. What is the grain of Fact_ApplicationStatus? One row per application, status, and date?

Also, in your first fact table, is "Latest" exactly the same as thing "StatusID", only just the latest status? If so, is it the latest status of all time? Of a particular date range?

The reason I ask is that you said you were storing the latest StatusID in the your second table (Fact_ApplicationConsolidation). If "Latest" is what I think it is, then you've got mixed grain in the Fact_ApplicationStatus table (which I think is the point of your question). Typically I'd store this as you have in the second fact table, with the latest status per application, and then I'd run a multipass SQL to get both counts. The first pass would get all statuses by application, the second pass would get the latest by application (which will be duplicated across any rows in the first query where you have more than one application and status).

So effectively then your first table is a periodic snapshot (with date) and your second table is an accumulating snapshot (without date). Mixing the grain is typically not advisable, but I've seen it done before, particularly for tables built specifically for canned reporting. It can, as you mentioned, make reporting easier. You just can't sum data of a higher grain (at least not without a known divisor) when it duplicates across data at a more detailed grain. Some reporting tools are smart enough to know this (Excel not being one of them).

We actually do this where I work now in a report-specific table. We store daily activity that happened that day alongside a count of all our active customers, up through and including that day. The activity is a daily snapshot, whereas the customers are an accumulating snapshot. It breaks some grain rules but the users know how the two work together. It also poses some problems with reporting tool semantic layers (like BO) that have to be addressed and overcome.

Speaking of periodic snapshot, it seems that your first fact table has variable date ranges. Typically with a periodic snapshot we'd see some sort of regularity, as in all the records belong to a particular "activity date" or something similar in nature. Maybe I don't understand your data well enough though. What does "StartDate" and "EndDate" represent in this table? I'm just curious how you go about updating existing fact records with the latest StatusID; I would assume that has to be done in a secondary process.

ian.coetzer Tue Jun 08, 2010 12:56 pm

Hi,

The Status Fact contains also the status duration of the previous status.

So it could look like this:

Fact_ApplicationStatus

[Id]	[ApplicationId]	[StatusId]	[Duration]	[Current]
1	111	1	5	0
2	111	2	4	0
3	111	3	5	1
4	222	1	2	1

Where the Duration indicates for how long the application has been in the status.
and current shows you which status is current.

Bye,
Ian

sgudavalli Thu Jun 10, 2010 10:55 am

@ian

i see lot of benefit if you can keep latest status in the dim acct table instead of fact App Consolidation table.
it will help you to perform analysis both on application status and also on App Consolidation facts.

Regards
Shiv

BrianJarrett Fri Jun 11, 2010 1:09 pm

I think I more clearly see your question. I probably complicated things a bit.

Your original question was if there was anything wrong with using the same field in two different fact tables (StatusID). As ngalemmo stated, there's nothing wrong with using a dimension in multiple fact tables. That's the whole point of the bus architecture and conformed dimensions.

StatusID appears to me to be a degenerate dimension (dimension with no attributes). There's no problem storing this degenerate dimension across multiple fact tables. Functionally it's no different than storing the surrogate for a dimension table across multiple fact tables; you just eliminate the join to a dimension table. You'll then change your SQL to pull the right StatusID field depending on which fact table you're querying (or you'll have the semantic layer of your reporting tool do it for you).

hang Fri Jun 11, 2010 11:51 pm

It looks to me your Fact_ApplicationStatus is an accumulating snapshot. Kimball has a relevant discussion in his book "The Complete Guide to Dimensional Modeling", Chapter 12, Education. The fact table may be called Application Pipeline Accumulating Fact with role playing date keys for all the statuses. The grain of the fact is one row per application. Therefore you would have ApplicationKey, Status1DateKey, Status2DateKey ... etc. in your fact table without introducing a dimension for status.

It's important to use surrogate date keys pointing to a role playing date dimension as you would have many undefined date keys when an application enters the pipeline. A date dimension would also give a rich set of calendar information for your analysis. You would periodically update the fact table as the application progresses fulfilling those undefined dates.The duration can be a derived measure and dollar amount should be just a normal measure in the same fact table. Hope this will help

Sponsored content

Duplicating a field in more than one fact table.

Duplicating a field in more than one fact table.

My Summarized Question on the original topic

Re: Duplicating a field in more than one fact table.

Re: Duplicating a field in more than one fact table.

Re: Duplicating a field in more than one fact table.

Re: Duplicating a field in more than one fact table.

Re: Duplicating a field in more than one fact table.

Re: Duplicating a field in more than one fact table.

Re: Duplicating a field in more than one fact table.