Model Data

AV Wed Jun 01, 2011 3:39 pm

We have huge flat files which we need to load to a dimensional model (model still in design phase).

The monthly file has about 50 million rows . The file contains patient level data, demographics, conditions patient has and some measures associated with patient .

One of the dimension(besides demogrphic etc ) we had in mind was Condition_DIM . As flat file has columns COND_1 , COND_2......COND_84. A patient can have 1 or many conditions. So in the flat file for a patient say P1 have following values
Cond_1 Cond_2 Cond_3 ... Cond_4 ..Cond_84 Score_1 Score_2
1 0 0 1 0 23.232 23.678

*1 denoting patient had the condition and 0 otherwise. Score_ is our measure.
We thought about putting scores in fact and then Conditions as dimension.

CONDITIONS_DIM (Dimension structure looks the following)
Cond_ID
Cond_1
Cond_2
Cond_3
Cond_4
Cond_5
.
.
Cond_84

Cond_id will go to fact.

Some of the queries that will be asked Give the count of partients that have Cond_1 and Cond_3 but not Cond_4. or give me the sum of score_1 for above combination.

As mentioned earlier, CONDITIONS_DIM has all the unique combinations of Cond_1 Cond_2 Cond_3 ... Cond_4 ..Cond_84 that exists in the data.When we did the analysis , for the first month - for a file with 50 million patient data had 10 million unique combinations of conditions.About 20% of fact. We are hoping following months the conditions_dim won't grow as much.

Qur concern is
1) Is the design correct with CONDITIONS_DIM having columns Cond_1 Cond_2 Cond_3 ... Cond_4 ..Cond_84 . Columns have just 1 or 0 . 1 denoting patient had the condition and 0 otherwise.
2) 10 million rows in dimension - is it normal. Can we design it better (performance of queries being concern) - we heard about the bridge tables, but worried it will not be scalable enough as we have to answer queries like "partients that have Cond_1 and Cond_3 BUT NOT Cond_4 ". Any pointers which can help to design it better will be hugely appreciated.

Thanks

Jeff Smith Wed Jun 01, 2011 4:15 pm

I think having a condition dimension that is essentially a junk dimension of the conditions on the source is a mistake. Let's say someone wanted to select all patients that had a certain condition. They'd have to search every column on the condition dimension. And what if some of the conditions columns were null. What an ugly query.

But if the data were in a more normalized form, where you had 1 condition column and a row for every patient/condition, then such a query would be easy. I think it's a patient/condition fact table that is at the condition level.

VHF Wed Jun 01, 2011 6:29 pm

AV wrote:Some of the queries that will be asked Give the count of partients that have Cond_1 and Cond_3 but not Cond_4. or give me the sum of score_1 for above combination.

I like to handle these types of queries by putting a delimited text field into the fact table containing identifiers for the conditions. So you might have a fact table like this (using a pipe delimiter in the Conditions field):

Patient	Conditions	Score_1	Score_2
P1	\|Cond_1\|Cond_2\|	23.232	23.678
P2	\|Cond_1\|Cond_2\|Cond_4\|	24.732	24.987
P3	\|Cond_1\|Cond_2\|Cond_3\|	23.101	23.204

You can then do a query like this:

SELECT * FROM FactTable
WHERE Conditions LIKE '%|Cond_1|%' AND Conditions LIKE '%|Cond_2|%' AND NOT Conditions LIKE '%|Cond_4|%'

(You could also use the interger surrogate keys from the Conditions table in the delimited string--either way just make sure to always use the leading and trailing delimiters on the string and use the delimiters in your search (so you don't find 34 in the middle of 134!))

This approach does require a table scan of the fact table and might present performance problems on a very large DW, but it provides a very straightforward way to handle the "has this and this but not that" scenario. With 50 million rows I think it would be OK.

You will likely also want a bridge table to implement a multivalued dimenson for Conditions--the delimited string approach is great for selecting qualifying patients, but you still need a way to join your fact table to the Conditions dimension in order to take advantage of additional attributes in the dimension.

(Just to clarify, with this approach the Conditions dimension would have only 84 records--one for each possible condition.)

Sponsored content

Model Data

Model Data

Re: Model Data

Re: Model Data

Re: Model Data