How to load a dimension when the source data reuses IDs with different descriptions?
2 posters
Page 1 of 1
How to load a dimension when the source data reuses IDs with different descriptions?
I presently work with healthcare data, and one of our primary data sources reuses service provider IDs with completely different descriptions whenever the provider is out-of-network. (Providers that are "in-network" have unique Provider IDs, fortunately.)
Examples:
ProvID ProvIDDesc
ITS0001 ALBERT EINSTEIN MEDICAL CENTER
ITS0001 ALLEGHENY GENERAL HOSPITAL
ITS0001 ARIA HEALTH
ITS0001 ARROYO GRANDE COMMUNITY H
ITS0001 BAPTIST HOSPITAL WEST
ITS0001 BAPTIST MED CTR, BC
... etc
Since the providers listed above are always out-of-network, I populate Dim_Providers with a single row for ITS001 with a generic description of "ITS Out-Of-Network". There are other ITS-prefixed ProvIDs with the same problem; these are also collapsed to a single row in Dim_Providers with the same generic description: e.g., ITS002, ITS003, ITS004.... all get a description of "ITS Out-Of-Network". On the upside, all of these out-of-network providers are geographically way far away from our network, so we essentially ignore all of these types of medical claims.
While my approach eliminates the complexity of the situation, I'm nonetheless quite frustrated by the mere fact that a source ID can mean completely different providers. I feel my approach is in "violation" of Kimball dimensional modeling.
How would the rest of you handle this kind of data to create a Kimball compliant Dim_Providers dimension?
Examples:
ProvID ProvIDDesc
ITS0001 ALBERT EINSTEIN MEDICAL CENTER
ITS0001 ALLEGHENY GENERAL HOSPITAL
ITS0001 ARIA HEALTH
ITS0001 ARROYO GRANDE COMMUNITY H
ITS0001 BAPTIST HOSPITAL WEST
ITS0001 BAPTIST MED CTR, BC
... etc
Since the providers listed above are always out-of-network, I populate Dim_Providers with a single row for ITS001 with a generic description of "ITS Out-Of-Network". There are other ITS-prefixed ProvIDs with the same problem; these are also collapsed to a single row in Dim_Providers with the same generic description: e.g., ITS002, ITS003, ITS004.... all get a description of "ITS Out-Of-Network". On the upside, all of these out-of-network providers are geographically way far away from our network, so we essentially ignore all of these types of medical claims.
While my approach eliminates the complexity of the situation, I'm nonetheless quite frustrated by the mere fact that a source ID can mean completely different providers. I feel my approach is in "violation" of Kimball dimensional modeling.
How would the rest of you handle this kind of data to create a Kimball compliant Dim_Providers dimension?
pzajkowski- Posts : 31
Join date : 2009-08-10
Re: How to load a dimension when the source data reuses IDs with different descriptions?
If you want to retain the name, it basically boils down to modifing the process of assigning surrogate keys. The dimension table itself doesn't need to change.
For the population of 'ITS' id's, use a look-up process that includes name as part of the natural key.
For the population of 'ITS' id's, use a look-up process that includes name as part of the natural key.
Re: How to load a dimension when the source data reuses IDs with different descriptions?
ngalemmo wrote:
For the population of 'ITS' id's, use a look-up process that includes name as part of the natural key.
Now, why didn't I think of that? Simple and effective. (I think I get stuck in the mindset of merely using source IDs for the look-up process, instead of considering additional fields.)
I have a followup concern, however. Since our data sources provide a total mixed bag of spellings for a given service provider, I presently run a process equivalent to maintaining a master person index so that a consistent provider name can be stamped on all relevant rows, especially since not all of our sources supply a National Provider Index (NPI). For example, at least dozen different source IDs and spellings exist for "Rochester General Hospital": "RG hosp", "Roc Gen H", "Hosp, Rochester G".... etc. Without stamping a single matching provider description of "Rochester General Hospital" in Dim_Providers, I'd be writing a lot of messy sql code to capture the various versions of this single provider. It seems to me that this is a necessary evil, unless you can suggest otherwise?
pzajkowski- Posts : 31
Join date : 2009-08-10
Re: How to load a dimension when the source data reuses IDs with different descriptions?
That gets messy no matter what you do...
One solution would be to have two name fields on the dimension, a raw name and a clean name. For new rows populate both columns with the raw name (which serves as part of the natural key) and allow users to update the clean name value with corrected spellings. Always report using the clean name column. Names get fixed over time and it doesn't mess up reporting queries. By keeping the raw name, future references with the same spelling would be associated with the same dimension row and corrected (clean) spelling.
For normal providers you can leave the raw name blank or populate it, it doesn't matter.
One solution would be to have two name fields on the dimension, a raw name and a clean name. For new rows populate both columns with the raw name (which serves as part of the natural key) and allow users to update the clean name value with corrected spellings. Always report using the clean name column. Names get fixed over time and it doesn't mess up reporting queries. By keeping the raw name, future references with the same spelling would be associated with the same dimension row and corrected (clean) spelling.
For normal providers you can leave the raw name blank or populate it, it doesn't matter.
Re: How to load a dimension when the source data reuses IDs with different descriptions?
OK - Your suggestions and explanation is in-line with what I've been considering.
As always, thanks for the feedback and assistance.
-- Pete
As always, thanks for the feedback and assistance.
-- Pete
pzajkowski- Posts : 31
Join date : 2009-08-10
Similar topics
» Getting Dimension Source Data from Existing Views
» Source Fact data coming in at different levels of a conformed dimension
» Accumulating Fact Table Dates as Role Play Dimension and Descriptions
» multi ERP source system, single DW load strategy
» Using the Dimensional Data Warehouse as source data for the OLTP process
» Source Fact data coming in at different levels of a conformed dimension
» Accumulating Fact Table Dates as Role Play Dimension and Descriptions
» multi ERP source system, single DW load strategy
» Using the Dimensional Data Warehouse as source data for the OLTP process
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum
|
|