Swiss cheese and dimensional design
5 posters
Page 1 of 1
Swiss cheese and dimensional design
I have a potential patient dimension... Wait, this actually sounds like a gross overstatement... Let me explain...
Patient information in the source system resembles Swiss cheese, because any particular piece of the data is entirely voluntary.
So you have Bill without last name, Hernandez without first name and address, or Sparky, who happens to be a dog… All of this is mixed in with other, case related info in one table in the source system with CaseID as PK. It’s a particularly messy pile of s…tuff, that is inherently uncleanable.
Creating a dimension out of it does not make sense to me as that dimension will grow 1:1 with the fact (a natural key of the dimension seems to be the CaseID).
Leaving all the information in the fact table seems extremely ugly and inefficient. Is purging this info the only way out? What if there is a business case for running an occasional report on the quality of patient information gathering?
Patient information in the source system resembles Swiss cheese, because any particular piece of the data is entirely voluntary.
So you have Bill without last name, Hernandez without first name and address, or Sparky, who happens to be a dog… All of this is mixed in with other, case related info in one table in the source system with CaseID as PK. It’s a particularly messy pile of s…tuff, that is inherently uncleanable.
Creating a dimension out of it does not make sense to me as that dimension will grow 1:1 with the fact (a natural key of the dimension seems to be the CaseID).
Leaving all the information in the fact table seems extremely ugly and inefficient. Is purging this info the only way out? What if there is a business case for running an occasional report on the quality of patient information gathering?
cridal- Posts : 9
Join date : 2009-03-27
Re: Swiss cheese and dimensional design
You have to consider what the business case is to collect this data. If there is apparently no business case to collect indentifying information, what would be the business case to store it for analysis?
Agreed, swiss cheese --> can not analyze
If the data is swiss cheese as you say, how can you preface any business "question" using it? You could not even do a simple, Show me totals by patient last name. Sounds like you'll be doing analysis on dimensions other than patient.
lmorgan- Posts : 3
Join date : 2009-02-27
Re: Swiss cheese and dimensional design
cridal wrote:I have a potential patient dimension... Wait, this actually sounds like a gross overstatement... Let me explain...
Patient information in the source system resembles Swiss cheese, because any particular piece of the data is entirely voluntary.
So you have Bill without last name, Hernandez without first name and address, or Sparky, who happens to be a dog… All of this is mixed in with other, case related info in one table in the source system with CaseID as PK. It’s a particularly messy pile of s…tuff, that is inherently uncleanable.
Creating a dimension out of it does not make sense to me as that dimension will grow 1:1 with the fact (a natural key of the dimension seems to be the CaseID).
Leaving all the information in the fact table seems extremely ugly and inefficient. Is purging this info the only way out? What if there is a business case for running an occasional report on the quality of patient information gathering?
I personally don't like excluding data from the source if possible. After all, one man's garbage is another man's gold. If you want to include the crappy data, then 1 option is to put the crappy data into junk dimensions. If all of the unique combinations of the crappy data numbers the same of the rows in the fact table, then you could put the data into multiple junk dimensions, selecting different combinations of the crappy columns until you come up with junk dimensions that are manageable in size.
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: Swiss cheese and dimensional design
To Imorgan:
I guess the only queries I envision relate to quality of patients data capture. For example, what percentage of cases by operator is missing a patient’s zip code or city? The business reason for this might be to see which operator is lazy. If you can see a historical trend in percentages, then it might have to do with operator's performance on top of patient's unwillingness to share their info. Business case is clear -> fire that operator.
So hypothetically, a reason to come up with the design might be valid. The question is: if it were the case, how would you handle it in the design?
I guess the only queries I envision relate to quality of patients data capture. For example, what percentage of cases by operator is missing a patient’s zip code or city? The business reason for this might be to see which operator is lazy. If you can see a historical trend in percentages, then it might have to do with operator's performance on top of patient's unwillingness to share their info. Business case is clear -> fire that operator.
So hypothetically, a reason to come up with the design might be valid. The question is: if it were the case, how would you handle it in the design?
cridal- Posts : 9
Join date : 2009-03-27
Re: Swiss cheese and dimensional design
Are you a provider or an insurance company? If you are a provider, how do you bill the patient if you don't know who they are? If you are an insurance company, you should have some sort of subscriber identifier. Your problem is a common occurrence when data is repurposed. You want the data to provide insights to operator performance, but the data or the system wasn't designed for that problem. Most insurance companies can't even tell you who a particular claim is for. All they know is that a subscriber has a policy and the claim should be paid. Whether the claim is for the primary policy holder or their spouse or their children is not important to the claims processing application. It is very important to the managed care folks who are data mining the claims data to try to identify successful and unsuccessful trends in health care.
If you want to report on the data quality of the incoming data, Ralph has written many articles on the audit dimension to track the data quality of source and fact data. This would be a first step to quantifying the data quality.
If you want to report on the data quality of the incoming data, Ralph has written many articles on the audit dimension to track the data quality of source and fact data. This would be a first step to quantifying the data quality.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: Swiss cheese and dimensional design
It's actually none of those things. I've read Kimball. If you read my question carefully, I'm posing a question about a particular data modeling decision and not advice about data quality profiling. The source system is what it is.
cridal- Posts : 9
Join date : 2009-03-27
Re: Swiss cheese and dimensional design
Stick them on the fact table as nullable.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Event fact
Agree with Boxes on this. Sounds more like an event fact, with a bunch of nullable columns to allow missing data.
lmorgan- Posts : 3
Join date : 2009-02-27
Similar topics
» Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion
» Dimensional Modeling design
» Dimensional Modelling Design
» Dimensional design tests
» How to design Benchmark's (Target's) in dimensional model
» Dimensional Modeling design
» Dimensional Modelling Design
» Dimensional design tests
» How to design Benchmark's (Target's) in dimensional model
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum