Modeling many heterogeneous observations in a single fact table

peter.gabriel Fri Apr 01, 2011 2:44 pm

I work with clinical health care data, which is very different in nature than the financial and transactional data common to the business world. Clinical data is extremely heterogeneous, with a high degree of dimensionality and less obvious “fact-ness” about it. I am struggling with the design of a clinical data repository and want to pose a general design question about whether it is appropriate to combine a lot of different types of “observations” into a single fact table.

There are several obvious conformed dimensions for clinical data, including patients, providers and dates. Now, consider several broad categories of information that could plausibly be modeled as facts:

Diagnostic test results – Here the “fact” could be a numerical value (with many different types of units possible), or could also be text (e.g. “Positive”). For some tests, the result is a lengthy text report, which would not be useful to store, but a “factless” fact noting that the test was performed would still be useful information. Dimensional data for these items would include (in addition to the conformed dimensions) information about the type of diagnostic test, various attributes that describe it, concept hierarchies to which it belongs, etc. Different types of tests (e.g. lab vs. radiology) would likely require different dimensions to capture different attributes.
Diagnoses – each time a patient is seen for a visit, diagnoses are rendered and documented according to a standard coding system used for billing. Each diagnosis could be a “factless” fact, with dimensions describing the diagnosis, the concept hierarchies to which it belongs, the coding system used, etc.
Clinical observations – this category includes myriad different assessments of symptoms, physical exam findings and information related to disease status and management. All of these observations can be coded as variables with permissible values ranging from numbers, to text, to Boolean. The conceptual range of variables and responses is extremely broad, and dimensions would be needed to describe the meaning of the variables and their coded responses.
Interventions -- this category would include treatments of different types, ranging from administration of medications, to surgical procedures, to radiation therapy. Different types of interventions would require different dimensions to describe them.

How can all of this information be modeled dimensionally? Or perhaps the question is, can it be modeled dimensionally?

To me, there seems to be little use in dividing it up into many different fact tables representing uniform concepts. This would likely require hundreds of fact tables and most would have relatively few records in them. And most analysis would need to be performed across numerous fact tables. This does not seem much different than using a normalized model.

When you think about it, there aren't a lot of “business processes” at work here, either. At a high level, there are basically two business processes going on – we periodically evaluate patients and make observations about them, and we give them treatments. But the heterogeneity present in both of these areas -- especially the observations -- makes combination into two or three fact tables problematic, as well. There would be hundreds of dimensions required, and most would only be relevant to a small fraction of the facts.

So my main question is, is that okay? Can you have a fact table that is filled with tons of conceptually different pieces of data (e.g. one row indicating that a serum hemoglobin value was 12.2 mg/dL and the next row indicating that a patient reported getting up to urinate an average of 3 times per night in the past week)? It could have hundreds of foreign keys, many of which aren't used or point to "not applicable" records in the dimension table, and lots of factless facts. But it seems like the only way to do it. Would it still provide any value in terms of analytical power and efficiency?

I'm really wondering if dimensional modeling perhaps is just not a useful tool for this type of data. I’m very interested in thoughts from the community, though! Are any others invovled in clinical data warehouses, and if so, how have you modeled your systems?

Thanks,
Pete

Jeff Smith Sat Apr 02, 2011 5:55 pm

It sounds like 4 different fact tables.

Health is generally considered the hardest field for DW, primarily because of the many to many to relationships between treatment, diagnosis, tests, etc. You are probably going to have to use a lot of bridge tables to contect everything.

Dimensional modeling should be able to handle it. But I would definitely look for some help from someone who has at least seen a dimensional DW for the health care industry.

peter.gabriel Wed Apr 06, 2011 11:42 am

Thanks - I appreciate your advice.

Pete

Sponsored content

Modeling many heterogeneous observations in a single fact table

Modeling many heterogeneous observations in a single fact table

Re: Modeling many heterogeneous observations in a single fact table

Re: Modeling many heterogeneous observations in a single fact table

Re: Modeling many heterogeneous observations in a single fact table