Medical Data - Building a reporting Data Warehouse

KeithL Sat May 07, 2011 8:33 am

I am building a data warehouse from several different flat file sources. Here is my concern and I hope I get some pleasant methods of handling this:

After breaking apart the flat files in facts and dims, I find that I have an interesting situation. All my fact tables share the same dimension group on the rowid making my OLTP data warehouse storage nice and easy.

DimMaster
rowid
dim1
..
dim60

Fact1..Fact5
rowID
Measure1
..
Measure5

What I've been seeing in the forum is that I am kind of treating this like a giant junk dimension which would snowflake to the lookups of the dimension keys.

As you can see the dimmaster will grow rather quickly for each unique rowid in the fact table.

Currently, I have about 4million unique row ids with about 500k added per quarter and some of the fact tables have 50M records adding about 5M per Qtr.

This will grow. Not sure it will grow to failure, but maybe.

I think I have a few options:
1. Create a view adding all the dimensions to each rowid in each fact table (the simplest approach) and then have the traditional star schema.
2. Break out the low cardinal values in the DimMaster and create a junk key with those and then create Dims for the remaining Dimensions in the fact table. I'm kind of scared of how long this will take and not exactly sure of the benefit.

I'm leaning towards #1 because of a time crunch, but would love to hear your opinions.

hang Sat May 07, 2011 9:01 am

Option 2 would be a natural choice without second thought from dimensional modelling perspective. So why is it so time consuming? Just separate the attributes into a number of dimension buckets with reasonable size.

KeithL Sat May 07, 2011 9:27 am

I need to have the cube published by Monday and I'd like to try and avoid going back to the ETL again.

Let's say I was going to split up the DimMaster into smaller chunks. Is there a rational to what size chunks to do that in.
(is this a function of similar dimensions for user sake or maybe try not to exceed this many combinations).

Should I avoid combinations over a certain amount?

Let's say I made 5 junk dimensions, that would mean that users would have to hunt for the location of the dimension they want to slice by in each of the junk tables?

What exactly is low in low cardinal?

0 or 1?
10 possibilites?
20 possibilities?

When should they be there own dimension?

Sorry, maybe some of these are silly, but I have really only worked with the traditional star before.

hang Sat May 07, 2011 9:58 am

KeithL, got your message. It's quite a valid question. In my experience, when the cardinality exceeds 10 or there is any hierarchical relationship between attributes I tend to create a standalone dimensions for them.

However there is no absolute rule for it. If you find it hard to group them meaningfully, do a bit of homework using select distinct on each of the attributes and mark down the counts and then remove relatively higher cardinality one by one and see if the junk size becomes reasonable. You may find the size reduces dramatically by removing just few culprits. I guess a dimension with tens of thousands rows is quite normal. A dimension with hundreds of thousands would be close to be large. A dimension with millions of rows would be regarded as monster dimension and should be avoided if possible. You can do the initial data movement with your existing data and finetune your ETL later on when you have more time.

KeithL Sat May 07, 2011 11:07 am

I've got a group of 18 dimensions down to 49995 out of 4M records.

Just worried that after i let this go, it will creep without monitoring and I will be long gone. I'm on a consulting engagement.

I know for a fact that it will not increase if I use the star schema where my fact tables are in the dim1..dim60, measures.

Just trying to weigh my options.

hang Sat May 07, 2011 6:45 pm

Well you might have heard of Centipede fact, and you might get one, instead of a good star schema, if not careful. I think if you have data covering a significant period; the junk dimension would not grow alarmingly. For some junk dimensions with only flags, you can even pre-build them with Cartesian joins (and possibly unions). Ignoring unknowns, the rough estimate is 10 attributes would not exceed 1024 and 20 attributes would top up to 1 million. Here’s a relevant post: http://forum.kimballgroup.com/t575-a-fact-table-surrounded-by-a-lot-of-references-tables

Ramtin Sun May 08, 2011 3:03 am

I have been in similar situation only recently and took a step back and went back to the business process. I discovered that I can group my junk dimension in two ways.
1. certain flags and low cardinality attributes are in some way more related to each other than to others in such a way that they form meaning full groups that point to a particular aspect of business process. This allowed building Junk dimensions with names that the user found intuitive to navigate to the junk attribute.
2. The theoratical combination junk attributes of often very high and i found the remaining attributes had theoretical cardinality of 3M. However the practical cardinality of certain groups were less than 8K which compared to 11M fact looked very good. I also check this off with business SME and found that certain combinations were not allowed or could not happen eg; certain data type would always have flag A and B as Yes but Flag C as no. This meant that the likelyhood that the junk dimension may grow drastically and reach the theoratical cardinality was not possible. You may find that this also happens in your case.

However I also discoverd I also had number of Not very low cardinality(50+) atributes that I basically group to gether turned into an outrigger instead of junk. I am not sure if a subdimension would have been a better idea, however it seemed like a reasonable solution at the time.

I dont' tend to take hard line on how low the cardinality has to be for it to be good candidate for the junk dimension. I tend to look at the size of the junk dimension and if it is less than 1% of the size of the fact table then I tend to leave it alone not break it up any further.

KeithL Sun May 08, 2011 9:23 am

I see much better now that I am working with this.

Include the actual Description in the Junk Dimension. That way you don't need to bring in the dimensions anyway and connect to the JunkDim.

Right?

I guess I would assume this is only accurate if the Dim is made up of only DimID, Dim Code, Dim Descr.

KeithL Sun May 08, 2011 12:42 pm

I am having trouble with something that should be simple!

In my ETL (using SSIS), I am using a lookup to validate and assign the JunkID and on no match sending the row to be inserted. My problem occurs in the subsequent look up.

My error is [Lookup [5369]] Error: SSIS Error Code DTS_E_OLEDBERROR. An OLE DB error has occurred. Error code: 0x80040E21.
An OLE DB record is available. Source: "Microsoft SQL Server Native Client 10.0" Hresult: 0x80040E21
Description: "Invalid input parameter values. Check the status values for detail.".

I can't figure out why I am getting this error.

Please help. I should already be working on the cube and I am back in the ETL!

Before you ask I am using partial caching. (I already FIXED that).

Sponsored content

Medical Data - Building a reporting Data Warehouse

Medical Data - Building a reporting Data Warehouse

Re: Medical Data - Building a reporting Data Warehouse

Time

Re: Medical Data - Building a reporting Data Warehouse

Group of 18

Re: Medical Data - Building a reporting Data Warehouse

Re: Medical Data - Building a reporting Data Warehouse

Ha. I think I got it.

Uh Oh

Re: Medical Data - Building a reporting Data Warehouse