Model Data
3 posters
Page 1 of 1
Model Data
We have huge flat files which we need to load to a dimensional model (model still in design phase).
The monthly file has about 50 million rows . The file contains patient level data, demographics, conditions patient has and some measures associated with patient .
One of the dimension(besides demogrphic etc ) we had in mind was Condition_DIM . As flat file has columns COND_1 , COND_2......COND_84. A patient can have 1 or many conditions. So in the flat file for a patient say P1 have following values
Cond_1 Cond_2 Cond_3 ... Cond_4 ..Cond_84 Score_1 Score_2
1 0 0 1 0 23.232 23.678
*1 denoting patient had the condition and 0 otherwise. Score_ is our measure.
We thought about putting scores in fact and then Conditions as dimension.
CONDITIONS_DIM (Dimension structure looks the following)
Cond_ID
Cond_1
Cond_2
Cond_3
Cond_4
Cond_5
.
.
Cond_84
Cond_id will go to fact.
Some of the queries that will be asked Give the count of partients that have Cond_1 and Cond_3 but not Cond_4. or give me the sum of score_1 for above combination.
As mentioned earlier, CONDITIONS_DIM has all the unique combinations of Cond_1 Cond_2 Cond_3 ... Cond_4 ..Cond_84 that exists in the data.When we did the analysis , for the first month - for a file with 50 million patient data had 10 million unique combinations of conditions.About 20% of fact. We are hoping following months the conditions_dim won't grow as much.
Qur concern is
1) Is the design correct with CONDITIONS_DIM having columns Cond_1 Cond_2 Cond_3 ... Cond_4 ..Cond_84 . Columns have just 1 or 0 . 1 denoting patient had the condition and 0 otherwise.
2) 10 million rows in dimension - is it normal. Can we design it better (performance of queries being concern) - we heard about the bridge tables, but worried it will not be scalable enough as we have to answer queries like "partients that have Cond_1 and Cond_3 BUT NOT Cond_4 ". Any pointers which can help to design it better will be hugely appreciated.
Thanks
The monthly file has about 50 million rows . The file contains patient level data, demographics, conditions patient has and some measures associated with patient .
One of the dimension(besides demogrphic etc ) we had in mind was Condition_DIM . As flat file has columns COND_1 , COND_2......COND_84. A patient can have 1 or many conditions. So in the flat file for a patient say P1 have following values
Cond_1 Cond_2 Cond_3 ... Cond_4 ..Cond_84 Score_1 Score_2
1 0 0 1 0 23.232 23.678
*1 denoting patient had the condition and 0 otherwise. Score_ is our measure.
We thought about putting scores in fact and then Conditions as dimension.
CONDITIONS_DIM (Dimension structure looks the following)
Cond_ID
Cond_1
Cond_2
Cond_3
Cond_4
Cond_5
.
.
Cond_84
Cond_id will go to fact.
Some of the queries that will be asked Give the count of partients that have Cond_1 and Cond_3 but not Cond_4. or give me the sum of score_1 for above combination.
As mentioned earlier, CONDITIONS_DIM has all the unique combinations of Cond_1 Cond_2 Cond_3 ... Cond_4 ..Cond_84 that exists in the data.When we did the analysis , for the first month - for a file with 50 million patient data had 10 million unique combinations of conditions.About 20% of fact. We are hoping following months the conditions_dim won't grow as much.
Qur concern is
1) Is the design correct with CONDITIONS_DIM having columns Cond_1 Cond_2 Cond_3 ... Cond_4 ..Cond_84 . Columns have just 1 or 0 . 1 denoting patient had the condition and 0 otherwise.
2) 10 million rows in dimension - is it normal. Can we design it better (performance of queries being concern) - we heard about the bridge tables, but worried it will not be scalable enough as we have to answer queries like "partients that have Cond_1 and Cond_3 BUT NOT Cond_4 ". Any pointers which can help to design it better will be hugely appreciated.
Thanks
AV- Posts : 1
Join date : 2011-06-01
Re: Model Data
I think having a condition dimension that is essentially a junk dimension of the conditions on the source is a mistake. Let's say someone wanted to select all patients that had a certain condition. They'd have to search every column on the condition dimension. And what if some of the conditions columns were null. What an ugly query.
But if the data were in a more normalized form, where you had 1 condition column and a row for every patient/condition, then such a query would be easy. I think it's a patient/condition fact table that is at the condition level.
But if the data were in a more normalized form, where you had 1 condition column and a row for every patient/condition, then such a query would be easy. I think it's a patient/condition fact table that is at the condition level.
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: Model Data
AV wrote:Some of the queries that will be asked Give the count of partients that have Cond_1 and Cond_3 but not Cond_4. or give me the sum of score_1 for above combination.
I like to handle these types of queries by putting a delimited text field into the fact table containing identifiers for the conditions. So you might have a fact table like this (using a pipe delimiter in the Conditions field):
Patient | Conditions | Score_1 | Score_2 |
P1 | |Cond_1|Cond_2| | 23.232 | 23.678 |
P2 | |Cond_1|Cond_2|Cond_4| | 24.732 | 24.987 |
P3 | |Cond_1|Cond_2|Cond_3| | 23.101 | 23.204 |
You can then do a query like this:
SELECT * FROM FactTable
WHERE Conditions LIKE '%|Cond_1|%' AND Conditions LIKE '%|Cond_2|%' AND NOT Conditions LIKE '%|Cond_4|%'
(You could also use the interger surrogate keys from the Conditions table in the delimited string--either way just make sure to always use the leading and trailing delimiters on the string and use the delimiters in your search (so you don't find 34 in the middle of 134!))
This approach does require a table scan of the fact table and might present performance problems on a very large DW, but it provides a very straightforward way to handle the "has this and this but not that" scenario. With 50 million rows I think it would be OK.
You will likely also want a bridge table to implement a multivalued dimenson for Conditions--the delimited string approach is great for selecting qualifying patients, but you still need a way to join your fact table to the Conditions dimension in order to take advantage of additional attributes in the dimension.
(Just to clarify, with this approach the Conditions dimension would have only 84 records--one for each possible condition.)
Last edited by VHF on Wed Jun 01, 2011 6:48 pm; edited 3 times in total (Reason for editing : fix; clarify)
VHF- Posts : 236
Join date : 2009-04-28
Location : Wisconsin, US
Similar topics
» Tracking of historical data using SCD2 in a non-dimensional data model
» data model architecture for economic forecast data
» Canonical Data Model for Data warehouse
» Data Model
» Agile BI - Data Model
» data model architecture for economic forecast data
» Canonical Data Model for Data warehouse
» Data Model
» Agile BI - Data Model
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum