Building aggregate fact tables from staging

memphis Mon Nov 08, 2010 11:26 pm

Hi,

Is it acceptable to create an aggregate fact table from the staging tables after you have cleaned out the data in staging? I know usually you're supposed to create aggregate fact tables from the fact/dimension tables, however can it be done straight from the staging tables if your staging table already has data that can be easily aggregated from? The resulting aggregate fact table will still join to existing dimension tables that have already been established.

Thanks.

alex.caminals Tue Nov 09, 2010 6:49 am

Hi memphis,

There is no rule that prevents to aggregate the data directly from your staging tables. However, there are some details you should take into account before doing that.

- When building your detail fact and dimension tables, there may be some data transformations done that you will have to replicate during the aggregation.

- If the transformation logic changes over time in your detail fact and dimension tables, you will have to apply that change again to the aggregate table mappings.

- If the number of joins in the staging tables is high, building the aggregated tables can be a resource consuming process. If you have already gone through that when building the detail fact and dimension tables, leverage that work by building the aggregate fact tables from the detail ones.

There may be other reasons, I just went for those that came quickly to my mind. However, as you can see, there is none that forces you to build the aggregates from the detail tables. In the end, it's up to you.

Could I ask you the reasons you have to build directly from the staging tables?

Best regards,

memphis Tue Nov 09, 2010 8:33 am

Hi Alex,
Thanks very much for replying.

I thought about building directly from staging because I've got this 'sales like' table for my source data that has all the data that I need (no need to join to any other table). As someone else pointed out in my other post, if I were to create a fact/dimension table from this 'sales like' table it would essentially be a factless sales table which would kind of be pointless to do since it really isn't adding anymore value.
(Note I say 'sales like' table in quotes because it's not really a proper sales table, it is simply a table where each row is a sale of one insurance policy together with information on who sold it, when it was sold, branch it was sold at, type of product sold. It doesn't actually have any 'dollar amount' columns as such.) Furthermore, my reporting requirement is simply to know the total number of sales per month per product type.

So I figured, well why not just directly build an aggregate fact table directly from this staging table?

The only drawback I see from computing my aggregates directly from the staging table is what if later down the track, an additional sale was added to the source 'sales like' table. eg: In November 2010 it was discovered that a sale was missing for June 2010, therefore this sale would need to be added to the table. In this case, my aggregates would need to be recomputed. But if I recompute the aggregates, I lose track of the original aggregated values, in this case for June 2010. There is a requirement that the reports that spanned the months for June, July, August, September and October would need to show the original sale amount in June. Only from November onwards should it show the updated total.

So, bearing this in mind, does my situation lend itself well to computing my aggregate fact table directly from staging? (I hope I've explained my situation clearly. Happy to clarify if needed)

alex.caminals Wed Nov 10, 2010 8:14 pm

Hi memphis,

This is getting interesting...

First, your 'sales like' table is what is commonly denominated a denormalised table. That is not the best way of storing your data if you have a high number of rows as performance will be affected. In that case, the dimensional model suits better for huge volumes analysis. Anyway, let's assume the volumes are not so high and performance is good.

Second, if you have the risk of getting late arriving facts and want to keep the value as it was in the past, I strongly suggest you to use a snapshot fact table. If you also want to have the actual values, you can use the Sales fact table (in your case, your denormalised 'sales like' table) to get this value. So you will end up querying the 'sales like' table for actual data and the snapshot fact table for "as was" data.

I hope this clarifies your doubts. Otherwise, just post a reply to this post. I will be watching it.

Best regards,

memphis Wed Nov 10, 2010 10:44 pm

Thanks Alex,

My 'sales like' table is actually out of my control. It was given to me by the business to use. As far as I'm aware, another system is responsible for generating this 'sales like' table.

I'd like to do a bit more reading on snapshot fact tables just to refresh my memory, would this be the accumulating snapshot fact table?

Also where does this leave me with my aggregated fact table? My understanding after reading your suggestion is that I would have my 'sales like' denormalised table (which will have new sales appended to it every week) and another snapshot fact table for 'as-was' values. Would the aggregated fact table now be populated from the snapshot fact table? Or is the snapshot fact table itself an aggregated fact table?

Cheers.

Sponsored content

Building aggregate fact tables from staging