Issue with modelling a hierarchy so the OLAP cube build is efficent

Tim R Fri Mar 20, 2009 4:47 am

I am faced with a nasty which seems to have no neat solution within dimensional modelling.
(This could well be just me being brain dead and dumb).

Background:
We are a major commercial bakery company.
Having re-developed and released the first two datamarts, (order & deliveries), we have a number of mature dimensions.
No enterprise Bus Matrix exists so to some extent we are flying blind.
The focus has now turned to 'Losses during Manufacture', 'Quality Tracking', 'Daily Plant Efficiency' and 'True Cost of Manufacture' which are all operating in a similar area of the business.

The problem, ( this is just one clear example):
We already have a 'Location' dimension which describes all our own factories and distribution points
We already have the hierarchy from location > region > area captured as additional columns in the location dimension.

The new work introduces the following;
A Plant entity, ( unique within manufacturing location), that captures describes the machinery lines that are laid out in the location to produce our various products
A shift entity, ( unique within plant, NOT location), that captures the describes the shift details.

In my mind I see a simple star with dimensions dim_date, dim_location, dim_plant, dim_shift, etc...

However the reporting and analysis research has highlighted a need to have a hierarchy that goes
Area > Region > Location > Plant > Shift

From Location to Plant is 1 to many
From Plants to Shift is 1 to many

Our OLAP / Cube guru is not happy with a simple star schema, as outline above, for two reasons:-

1). To create the hierarchies from elements in different dimensions would require scanning the whole
of the fact table, (projected to be some 400 million rows).

2). Linking via the fact table limits the hierarchy to activated structures, unused ones would be missed.

I can see his point as this is just an example of something we have met before with customers, in addition I see this kind of thing being an issue in various of our future scenarios.
We will be running a few tests to get a firm feel as to the time it would take, but currently we are only up to 125 million rows for 18 months of data, (we are intending a 5 year data retention in the warehouse)

Previously, with CustomerSite, CustomerSubDivision, CustomerDivision and Customer I merged everything into a single dimension with the CustomerSite surrogate key being the link to the fact tables.
However to do something similar with Location, Plant & Shift would result in the link surrogate key being at the lower Shift level.
So what happens if one only needs to link at the location level, as in our existing datamarts?
To me it just doesn't make any sense to merge Location, Plant & Shift, particularly when in two datamarts location is already in use and being linked to via the location surrogate key.

Possible Solutions Explored:

Idea: Add legacy Name or Legacy ID columns to dim_shift to hold the
plant, location, Region, Area identification data
Issue: With SCD2's we get involved with linking by Legacy Item and between from & to dates

Idea: Add Foreign Key columns to dim_shift to hold the plant, location surrogate key data
Issue: This opens the door to cross dimension linking without going through the fact table, something
I'm not comfortable with.
With SCD2's we get involved multiple table maintenance and a change in another dimension forcing
a new record in dim_shift.

Idea: Create an outrigger to dim_shift to hold the Surrogate key values to describe the hierarchy
Issue: With SCD2's the need to rebuild after each dimension / fact ETL, again by scanning the fact table.

Idea: Create a generic Hierarchy table that holds the relevant surrogate keys in a generic structure,
(level1, level2 etc...)
Issue: With SCD2's the need to rebuild after each dimension / fact ETL, again by scanning the fact table.

To add spice to the mix, our parent company have engaged a consultancy house as the BI consultants of choice.
Having had dealings, I am acutely aware that they support at best a snowflake methodology and at worst a full blown CIF solution.

I would appreciate any guidance/thoughts/ideas/solutions that the forum might have.

TIA, Tim

ngalemmo Wed May 20, 2009 2:28 am

I don't get where your OLAP 'guru' is coming from.

First, one would assume you need to have measures when you build a cube, so having to read the fact table should not be considered a problem.
Second, if the pull is for a reasonable period of time, chances are pretty good every shift, plant, location and date would be represented, so inactive combinations would naturally appear as null in the cube. I mean, if you are doing analysis by plant and shift, for example, the cube would display all plants and all shifts with empty spots in the matrix where measures do not exist. The only time there may be a problem is if a plant is shut down for a long period of time.

Colin Davies Wed May 20, 2009 8:50 pm

You mention an outrigger on one of your dimensions, but I don't think that is what you are looking for. It sounds to me like what you need is a bridge table between your Location and Plant, and maybe another one between Plant and Shift. It's hard to say without seeing the actual schema. Ralph's "Data Warehouse Toolkit" has a good explanation of bridge tables in Chapter 6. There is also a good explanation in Chapter 5 of the "DW ETL Toolkit" book as well.

Sponsored content

Issue with modelling a hierarchy so the OLAP cube build is efficent

Issue with modelling a hierarchy so the OLAP cube build is efficent

Re: Issue with modelling a hierarchy so the OLAP cube build is efficent

Not an outrigger...

Re: Issue with modelling a hierarchy so the OLAP cube build is efficent