Problems with design to allow Rolling up of Hierarchical Data

miommi Wed Jul 02, 2014 2:15 am

I am trying to build a dimensional model to support a scenario where a customer can own 1 or more accounts and those account can belong to different categories. We need to be able to report on the number of customers that own accounts at the lowest, specific account type, level (eg Cheque Account 1 or Loan Account 2) as well as at the higher account category level (eg Cheque Account or Loan Account).

Example data for this scenario is:

Customer Cheque Accounts Loan Accounts
1 Cheque Account 1 Loan Account 1
2 Cheque Account 1 Loan Account 1
Loan Account 3
3 Cheque Account 2 Loan Account 1
Cheque Account 3 Loan Account 2
4 Cheque Account 1
Cheque Account 3
5 Loan Account 2
Loan Account 3

I have an account dimension table that includes the account description details and the catergorisation hierarchies. This looks a like this:
AccountKey AccountCatID AccountCatDesc AccountTypeID AccountTypeDesc
1 1 Cheque Account 100 Cheque Account 1
2 1 Cheque Account 101 Cheque Account 2
3 1 Cheque Account 102 Cheque Account 3
4 2 Loan Account 200 Loan Account 1
5 2 Loan Account 201 Loan Account 2
6 2 Loan Account 202 Loan Account 3

I have a Fact table that looks like this:
Key CustomerKey AccountGroupKey AccountCount
1 1 1 1
2 2 2 1
3 3 3 1
4 4 4 1
5 5 5 1

And, I have a bridge table to resolve the many-to –many relationships between the fact and Account dimension tables which looks like this:

AccountGroupKey AccountKey
1 1
1 4
2 1
2 4
2 6
3 2
3 3
3 4
3 5
4 1
4 3
5 5
5 6

If I run a query across these table at the lowest, Account Type, level all is good and I get the following, correct results:

AccountCatDesc AccountTypeDesc AccountCount
Cheque Account Cheque Account 1 3
Cheque Account Cheque Account 2 1
Cheque Account Cheque Account 3 2
Loan Account Load Account 1 3
Loan Account Loan Account 2 2
Loan Account Loan Account 3 2

But my problems start when I run a query to roll-up the values to the Account Category level. This query returns the following, incorrect results:

AccountCatDesc AccountCount
Cheque Account 6
Loan Account 7

Instead of the expected results of:

AccountCatDesc AccountCount
Cheque Account 4
Loan Account 4

This is because the query is double counting the multiple cheque accounts of customers 3 & 4 and the multiple loan accounts of customers 2, 3 & 5.
Customer Cheque Accounts Loan Accounts
1 1 1
2 1 2
3 2 2
4 2 0
5 0 2
Total 6 7

Does anyone have a simple solution to this problem please?

nick_white Wed Jul 02, 2014 2:49 am

I think your fundamental problem is that the grain of your fact table is incorrect for the type of query you are trying to run. The grain is basically at customer but you are trying to report on individual accounts - this is never going to work as the you have no measure that says how many accounts you have; AccountCount is not counting Accounts it is counting Customers (or, I guess, more accurately, Account Groups).
You need to create a new fact table with the correct grain in order to run the type of query you want. There may be a way to return the correct result with the model that you have (dividing AccountCount by the number of accounts related to it by the bridge table) but the SQL is going to be really nasty, it will probably perform very badly and you wouldn't really be using the dimensional model in the way it is intended.

Hope this helps?

miommi Thu Jul 03, 2014 2:43 am

Thanks Nick

I can resolve the issue quite simply with SQL sub-queries for the high levels but the problem is you have to run very different queries for different levels of the hierarchy. This is OK when using native SQL but there are problems getting reporting tools to automatically use different queries to aggregating data at different levels.

I have come across this issue several times when there is not a one to one match between parent and child values in a hierarchy. Here a couple of other examples I have encountered in the past from the recruitment reporting environment:

A candidates could choose several preferred work locations on there employment profile. The hierarchy within the location dimension was suburb, city, region, country. Candidates could choose 1 or more suburbs within 1 or more cities within 1 or more regions etc. The reports needed to be able to report the number of candidates per suburb then aggregate this up by city, region etc.

Another example, involves candidates choosing preferred industries in their profile. These industries were categorised into classes (eg Accounting or Banking & Finance) and sub-classes (eg Accounts Receivable or Credit Control). Once again, candidates could choose 1 or more sub-classes of industry within 1 or more categories. This too has problems when you try to aggregate the results up to the classification level because a standard group by counts all of the values sub-class entries.

The only way I have found to overcome this issue is to use different aggregate fact tables that use different SQL queries to roll up the data at each of the levels.

nick_white Thu Jul 03, 2014 4:50 pm

Hi - I'm not clear from your response whether you agree with me, disagree or don't think I have understood your issue correctly. You state "...there is not a one to one match between parent and child values in a hierarchy" but from the data you provide there is such a match and you have a standard hierarchy: each Account Type has one, and only one, parent Account Category.

If you change the grain of your fact table to include the Account rather than the Account Group it should all work without any issues.
I don't understand why you have introduced the concept of the Account Group - I guess I am missing something but I'm not sure what. Can you clarify please?
Regards

miommi Sun Jul 06, 2014 9:06 pm

Hi Nick

Sorry for the confusion, I don't think I am explaining my problem very well.

I do not think the problem is solely related to the grain of the fact table. The problem is my dimension is multivalued and therefore ragged in nature. The parent record in the Hierarchy can have many children and more than 1 of those children can be legitimate. I agree that in a normal dimension / fact relationship has a single value that is consistent with the grain but with these multivalued dimensions, several child dimension values can be attached to a single fact record. This is what I meant by "not a one to one match".

For instance in the work location example I mentioned in my previous post, a candidate can select several preferred locations. When they do that they are saying I would prefer to work in Leeds or York or Sheffield or Manchester etc. Some people may choose 5 preferred work locations, others may choose only 3, some may only choose 1 (this is what I mean by ragged). Irrespective of how many work locations they choose, each of the dimensional values needs to be attached to the single candidate record so that we can generate metrics for the number of candidates that have stated that they would prefer to work in Sheffield for example.

It is legitimate for the one candidate to appear in multiple metrics. For example, the above candidate would be counted in the metric for people who would prefer to work in Leeds and also in the metrics for candidates who would prefer to work in York, Sheffield and Manchester etc. We also need to be able to roll these values up to provide regional metrics eg Leeds, York and Sheffield would roll-up to the parent "Yorkshire" whereas the parent of Manchester is "Greater Manchester". When we roll these values up, we would only want the candidate to appear once for Yorkshire and once for Greater Manchester. If we were to roll these values up to the country level then we would only want the candidate to appear once in the metric for the UK.

The grain appears to change as we navigate up and down the hierarchy resulting in having to use a different mechanism to calculate the metric depending on the level of the hierarchy (eg at the bottom level we count all records by city, at the regional we count records by 'distinct' region and then at the top level it is by 'distinct' country. Now, we can handle that for users that are using SQL to access the data as we just need to educate them that they need to effectively use different queries for different levels. But, the problem are users that use tools to access the metrics and not SQL. For instance, if you pull the lowest level of data (Leeds, York, Sheffield etc) into a pivot table and then use the pivot table functionality to roll-up to the regional level. You get a value of 3 for Yorkshire because it is counting each of the values for Leeds, York and Sheffield.

I have tried changing the grain of the fact table by adding the preferred locations as keys (eg a column for preferred location 1, another column for preferred location 2 etc) but that means I have to have limit the number of choices and the queries are not nice as we have to use clauses such as where preferred location 1 = 'Sheffield' or preferred location 2 = 'Sheffield' or preferred location 3 = 'Sheffield' etc to determine the number of candidates that have stated that they would prefer to work in Sheffield. I have also tried this in the dimension with the same results. This is the reason I introduced the bridge table and the concept of the Group because without it I have a many-to-many relationship between the fact and dimension tables. I then have a view that combines the fact and the bridge table so that it effectively looks like the preferred locations are part of the fact table.

Hopefully this more clearly explains my problem.

Thanks

nick_white Mon Jul 07, 2014 3:14 am

Hi - that does make sense, but it is a completely different problem to the one that you originally raised about cheque/loan accounts which I think can be solved by changing the grain. If that problem is still an issue for you then it might be worth raising it as a separate post on the forum.

For this issue with cities/regions I would do the following:
1. Create bridge tables - as you state you can have any number of cities so any sort of design that limits you to City1, City2, ...Cityn is almost certainly going to cause you problems at some point

2. Create your own aggregate tables at region level that have the correct counts in them and get your users to use these tables. I don't think you could ever get these counts to roll-up correctly from City to Region in the way you want because, as you say, these are not really 'natural' hierarchies they are 'business defined' hierarchies.

The only way I can think of that might get this to work is if you had a separate record in your bridge table for each candidate's region and put a counter on the bridge table and rolled that up e.g. if a candidate expressed a preference for Leeds, York, Southampton and Winchester then the candidate would link to a Yorkshire bridge table record and a Hampshire bridge table record - each then linking to the relevant cities. While this may solve your region issue it may just make your city-level reporting more complicated and so doesn't really help overall; I guess all you can do is try it and see.

Sponsored content

Problems with design to allow Rolling up of Hierarchical Data

Problems with design to allow Rolling up of Hierarchical Data

Re: Problems with design to allow Rolling up of Hierarchical Data

Re: Problems with design to allow Rolling up of Hierarchical Data

Re: Problems with design to allow Rolling up of Hierarchical Data

Re: Problems with design to allow Rolling up of Hierarchical Data

Re: Problems with design to allow Rolling up of Hierarchical Data

Re: Problems with design to allow Rolling up of Hierarchical Data