What to do when the weighting factor of a bridge table no longer seems relevant?

pzajkowski Tue Feb 16, 2010 4:34 pm

I have a fact table that has healthcare claims data. Since a given claim can have multiple diagnoses, I'll use a bridge table that links a bridgekey from the fact table to the bridge table. The bridge table will also have a diagnosiskey that ultimately links to DimDiagnosis. This scenario is often cited in Kimball's books, BI bloggers, etc...

In order to ensure a dollar amount associated with a given claim is aggregated correctly when a bridge table is in play, the bridge table has a weighting factor that is multiplied against the dollar amount to ensure the actual claim amount is correct.

All of this is just fine if all diagnoses of a given claim are being considered. But, what happens when only a couple of the diagnoses of a claim are being considered for a particular query?

For example, suppose a claim for $100 has four diagnoses (A,B,C,D) in the bridge table, and the bridge table has a weight of .25 for each diagnoses of the claim. Since the claim line will blowout to four rows when joined to the bridge table, multiplying .25 * $100 and then summing the result yields a correct amount of $100 for the claim.

But, what if my SQL query only cares about health claims where the diagnoses are A & C? What about A, C, & D? Or even just A? How do I correctly sum the claim amount when the weighting factor is no longer relevant?

I'd love to hear how others solve this problem!
--Pete

ngalemmo Wed Feb 17, 2010 1:39 pm

You need to discuss this with your business people. The primary question is, when doing cost analysis, do secondary diagnosis matter? Are you really distributing costs by diagnosis or are you reporting the cost of claims with certain combinations of diagnoses? If the latter is the case, you need to restructure diagnosis into groups and carry the group key in the claim. If you report claims beloging to certain groups you will not double count the claim no matter how many diagnoses are being selected.

pzajkowski Wed Feb 17, 2010 3:04 pm

Thank you for replying, ngalemmo.

Rather than focusing on the cost analysis of healthclaims, I'd like to take a step back...

In general, I don't understand how facts (i.e., counter fields and dollar fields) will sum correctly (via SQL queries) if only a portion of a multivalued dimension group is being referenced. I can certainly write sophisticated sql queries to ensure the results will end up being correct, but that approach seems to defeat the point of having a dimensional data warehouse.
(As a side note, my company isn't working with SSAS 2005, which I believe will consume bridge tables (M:M) quite nicely.)

ngalemmo Wed Feb 17, 2010 4:52 pm

It depends on the query. Lets say you have a query such as: "Give me the cost of all claims that have diagnosis codes X and Y". This is a basic multivalued dimension problem.

It can be solved with the right data structures. First, the claim has an FK to a diagnosis group table which contains every unique combination of diagnoses found on claims. If multiple claims have the same combination of diagnoses, they would all reference the same group. You then have a bridge table containing diagnosis group key and diagnosis key. The latter references a traditional diagnosis dimension table.

The query would be structured as follows:

SELECT ...stuff you want to see...
FROM claimFact c, diagBridge b1, diagBridge b2, diagDim d1, diagDim d2
WHERE c.diagBridgeKey = b1.diagBridgeKey
and b1.diagKey = d1.diagKey
and d1.diagCode = 'X'
and c.diagBridgeKey = b2.diagBridgeKey
and b2.diagKey = d2.diagKey
and d2.diagCode = 'Y'

You can also do all sorts of logic such as claims with diagnosis X or claims with diagnosis Y and Z...

SELECT ...stuff you want to see...
FROM claimFact c, diagBridge b1, diagBridge b2, diagBridge b3, diagDim d1, diagDim d2, diagDim d3
WHERE c.diagBridgeKey = b1.diagBridgeKey
and b1.diagKey = d1.diagKey
and d1.diagCode = 'X'
or (c.diagBridgeKey = b2.diagBridgeKey
and b2.diagKey = d2.diagKey
and d2.diagCode = 'Y'
and c.diagBridgeKey = b3.diagBridgeKey
and b3.diagKey = d3.diagKey
and d3.diagCode = 'Z')

In all cases, you only get one row per selected fact.

pzajkowski Wed Feb 17, 2010 5:10 pm

Thank you for your detailed response. Your description of how to structure the tables is inline with what I've been considering; so, I feel good that I've been understanding the message and making sense of Kimball's book, too. I guess I've been having a hard time believing it will all work out.

I will certainly work with your queries to gain a better understanding.

Thanks again

pzajkowski Thu Feb 18, 2010 12:28 pm

ngalemmo ,

At the end of the work day yesterday I briefly took a look at your last explanation and the sample sql. It suddenly dawned on me the various sub-conscious assumptions I've been making about the dimensional model, bridge tables, and how best to query the data. For example, it seemed to me (up until now) that Kimball's approach was suggesting that the dimensional model, itself, would be all that one would need: i.e., no need for complex queries or the use of excessive join statements or correlated queries, etc... At the same time, I found myself limiting the power of roles that dimensions can play, such as Dim_Date; it never occurred to me to consider using multiple occurrences of a bridge table as you did in your sample sql. All of this is actually embarrassing to admit as I take pride in my ability to write advanced sql (correlated queries, common table expressions, recursive joins, user-defined functions...) I can't remember the last time I found myself trapped by my own assumptions/interpretations.

So, for you, I imagine my questions about bridge tables and the use of weight factors to seem fairly confusing since, in reality, the answer is quite simple as you've kindly laid out for me. I might as well have been asking someone to explain why 2 + 2 = 4.

Many thanks for your assistance.
--pete

pzajkowski Tue Mar 16, 2010 2:37 pm

ngalemmo wrote:
...You can also do all sorts of logic such as claims with diagnosis X or claims with diagnosis Y and Z...

SELECT ...stuff you want to see...
FROM claimFact c, diagBridge b1, diagBridge b2, diagBridge b3, diagDim d1, diagDim d2, diagDim d3
WHERE c.diagBridgeKey = b1.diagBridgeKey
and b1.diagKey = d1.diagKey
and d1.diagCode = 'X'
or (c.diagBridgeKey = b2.diagBridgeKey
and b2.diagKey = d2.diagKey
and d2.diagCode = 'Y'
and c.diagBridgeKey = b3.diagBridgeKey
and b3.diagKey = d3.diagKey
and d3.diagCode = 'Z')

In all cases, you only get one row per selected fact.

ngalemmo, I am struggling to make the above example actually work without blowing out the number of rows: i.e., I'm ending up with duplicate rows.

Here is a sample data set:

CLAIM_ID	ICD9_DIAG_01	ICD9_DIAG_02	ICD9_DIAG_03	ICD9_DIAG_04	$PLAN_LIAB	DiagGroupKey
1	78900	53500	28521	41400	13.74	2305
2	25000	2724	4019	42789	4.16	1409
3	41400	41092	78609	7812	102.36	1504
4	586	28521	2722	41400	77.19	4291
5	586	28521	2722	41400	13.56	4291

Below is an example of the query in SQL Server. I did not include joins to a Dim_Diagnosis table. Instead, I simply hard coded the diag keys I'm looking for that exist in the diag bridge table:

Code:: SELECT A.* FROM #CLAIMS_DATA A JOIN #DIAG_BRIDGE B ON A.DIAGGROUPKEY = B.DIAGGROUPKEY JOIN #DIAG_BRIDGE C ON A.DIAGGROUPKEY = C.DIAGGROUPKEY JOIN #DIAG_BRIDGE D ON A.DIAGGROUPKEY = D.DIAGGROUPKEY WHERE B.DIAGKEY = 41400 OR ( C.DIAGKEY = 4019 AND D.DIAGKEY = 25000 )

Here's what's in the bridge table:

Code:: DiagGroupKey DiagKey ------------ ------------------------------------- 1409 25000 1409 2724 1409 4019 1409 42789 1504 41400 1504 41092 1504 78609 1504 7812 2305 78900 2305 53500 2305 56211 2305 41400 4291 586 4291 28521 4291 2722 4291 41400

And here are the results (blown out) rather than returning the original five rows from the data set:

Code:: CLAIM_ID ICD9_DIAG_01 ICD9_DIAG_02 ICD9_DIAG_03 ICD9_DIAG_04 PLAN_LIAB DiagGroupKey ----------- ------------ ------------ ------------ ------------ --------- ------------ 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 1 78900 53500 56211 41400 13.74 2305 2 25000 2724 4019 42789 4.16 1409 2 25000 2724 4019 42789 4.16 1409 2 25000 2724 4019 42789 4.16 1409 2 25000 2724 4019 42789 4.16 1409 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 3 41400 41092 78609 7812 102.36 1504 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 4 586 28521 2722 41400 77.19 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291 5 586 28521 2722 41400 13.56 4291

If I change the query to be SELECT DISTINCT, then I'll return the correct rows... but surely I shouldn't need to specify DISTINCT? I could also do a GROUP BY, by then Sum of Plan_Liab will be incorrect.

Perhaps I'm still missing something?

ngalemmo Tue Mar 16, 2010 3:44 pm

What is the '#' character in the SQL? I am not familiar with that syntax. The result set you are getting appears to be due to outer joins...

pzajkowski Tue Mar 16, 2010 7:14 pm

In SQL Server, temporary tables are noted with a # sign. A temporary table is actually a real table, but disappears once my connection to the server is closed. The #claims table is merely a table.

For the code I posted, pretend the # sign doesn't exist in the query. If I were querying against the actual Claims table, my code would still return duplicate records.

Also, the joins in the query are all inner joins.

--Peter

ngalemmo Tue Mar 16, 2010 10:44 pm

I played around with it in Access... It's been a while, but it appears the OR part is causing a cartesian product. Each of the bridge rows for claims 1, 3, 4, and 5 are being combined with the 4 bridge rows from claim 2 and claim 2 is being combined with each of the 41400 diagnosis bridge rows from the other 4 claims.

If you do each piece individually, it works ok:

SELECT Claim.CLAIM_ID, Claim.DiagGroupKey, bridge1.diagnosis
FROM Claim INNER JOIN bridge1 ON Claim.DiagGroupKey = bridge1.[group key]
WHERE (((bridge1.diagnosis)=41400));

CLAIM_ID DiagGroupKey diagnosis
3 1504 41400
1 2305 41400
4 4291 41400
5 4291 41400

SELECT Claim.CLAIM_ID, Claim.DiagGroupKey, bridge2.diagnosis, bridge3.diagnosis
FROM (Claim INNER JOIN bridge2 ON Claim.DiagGroupKey = bridge2.[group key]) INNER JOIN bridge3 ON Claim.DiagGroupKey = bridge3.[group key]
WHERE (((bridge2.diagnosis)=4019) AND ((bridge3.diagnosis)=25000));

CLAIM_ID DiagGroupKey bridge2.diagnosis bridge3.diagnosis
2 1409 4019 25000

So, to do ORs it appears you need to do Unions of each AND set. In retrospect it makes sense, since you want one result set OR another result set. DISTINCT can resolve the problem, but only if query returns just claims attributes. If you include any columns from diagnosis or the bridge the cartesian product would still return multiple rows. A union of two AND queries would be faster. My appologies... I implemented this about 5 years ago and forgot the details.

pzajkowski Wed Mar 17, 2010 7:43 am

Thanks for taking a follow-up look at this. Your assessment of OR creating the cartesian results is what I was also thinking, leading to the conclusion that a UNION is required. It's not an ideal solution, but certainly useable.

Apparently SSAS 2005 (SQL Server Analysis Services 2005) has the ability to leverage bridge tables without yielding a catersian resultset. A BI consultant's website (www.sqlbi.eu) demonstrates in SSAS 2005 that facts will aggregate correctly in situations where you might have many customers who have many accounts, and where an account can have many customers; a bridge table of account and customer keys connects these two dimensions. It's a thing of beauty to see how a single account of $100 with two customers will display each customer having access to $100, but the grand total for the account is still $100.

My company is not yet in a position to implement SSAS, however, so, writing SQL queries will still be the mechanism for pulling data to generate reports. The catch will be to convince my colleagues (SQL analysts) that writing UNION queries will be better than the old method of querying each diagnosis field separately.

By the way, you mention that you implemented your data warehouse 5 years ago. What platform/product do you use to query the data?

Again, thanks for your assistance.
--pete

ngalemmo Wed Mar 17, 2010 10:25 am

I figured it out... this query will work:

SELECT Claim.CLAIM_ID, Claim.PLAN_LIAB, Claim.DiagGroupKey, bridge1.[group key], bridge1.diagnosis, bridge2.[group key], bridge2.diagnosis
FROM ((Claim INNER JOIN bridge1 ON Claim.DiagGroupKey=bridge1.[group key]) INNER JOIN bridge2 ON Claim.DiagGroupKey=bridge2.[group key])
WHERE ((([bridge1.diagnosis]=41400 and [bridge2.diagnosis]=41400)Or ([bridge1.diagnosis]=4019 And [bridge2.diagnosis]=25000)));

All the bridge references need to be qualified. In the original query it was getting bridge 1 = 41400 and anything for bridges 2 & 3 and second half was getting specific values for bridge 2 & 3 and anything for bridge 1.

You only need as many bridges as there are ANDs for any one selection. If a group has less values, repeat one of the values (it doesn't matter which) for the remainer of the bridge aliases.

pzajkowski Wed Mar 17, 2010 11:17 am

Very interesting... looks promising...

Well, after some further testing, your rule for the number of bridges based on the number of "AND"s almost works.

Try adding one additional row in the claims table that contains all three diagnoses (25000, 4019,41400); your latest query will generate duplicate rows, unfortunately. As you mentioned previously, it appears that the OR creates a cartesian. Using a UNION may be the solution afterall, even though I sure do like your discovery of the number of bridges based on the number of ANDs...

ngalemmo Wed Mar 17, 2010 12:26 pm

pzajkowski wrote:By the way, you mention that you implemented your data warehouse 5 years ago. What platform/product do you use to query the data?
--pete

It was a clickstream warehouse for an automotive website. There were two multivalued dimensions relating to a session: the models someone viewed during a session and the keywords relating to the search phase someone used to get to the site. For example, they could count vistors who came in looking for 'Honda' and 'Civic' and who also looked at 'Toyota Prius' and 'Ford Focus'. Or, if someone looked at a '2010 Ford Mustang', what other models did they cross-shop, ranked by frequency.

It was on an Oracle platform. Performance was really good, even with a session fact table with over 500 million rows.

Try adding one additional row in the claims table that contains all three diagnoses (25000, 4019,41400); your latest query will generate duplicate rows, unfortunately.

You are right about that. Logically it makes sense since the instance satisfies both conditions. A risk of using OR's. In the case of the website analysis, it was not an issue... if they were doing an analysis of someone cross-shopping any 2010 Ford, they would summarize it by Ford model, so the model level numbers would be correct.

Vishy Tue Mar 06, 2012 6:28 am

SELECT Claim.CLAIM_ID, Claim.PLAN_LIAB, Claim.DiagGroupKey, bridge1.[group key], bridge1.diagnosis, bridge2.[group key], bridge2.diagnosis
FROM ((Claim INNER JOIN bridge1 ON Claim.DiagGroupKey=bridge1.[group key]) INNER JOIN bridge2 ON Claim.DiagGroupKey=bridge2.[group key])
WHERE ((([bridge1.diagnosis]=41400 and [bridge2.diagnosis]=41400)Or ([bridge1.diagnosis]=4019 And [bridge2.diagnosis]=25000)));

You only need as many bridges as there are ANDs for any one selection

Just wanted to understand why do we need that many bridges, do you mean that same daignosis may be there in multiple daiganosis groups and you want to filter only a particular diagnsis belonging to particular diagnosis group??

pzajkowski Tue Mar 06, 2012 11:56 am

Vishy wrote:
SELECT Claim.CLAIM_ID, Claim.PLAN_LIAB, Claim.DiagGroupKey, bridge1.[group key], bridge1.diagnosis, bridge2.[group key], bridge2.diagnosis
FROM ((Claim INNER JOIN bridge1 ON Claim.DiagGroupKey=bridge1.[group key]) INNER JOIN bridge2 ON Claim.DiagGroupKey=bridge2.[group key])
WHERE ((([bridge1.diagnosis]=41400 and [bridge2.diagnosis]=41400)Or ([bridge1.diagnosis]=4019 And [bridge2.diagnosis]=25000)));

You only need as many bridges as there are ANDs for any one selection

Just wanted to understand why do we need that many bridges, do you mean that same daignosis may be there in multiple daiganosis groups and you want to filter only a particular diagnsis belonging to particular diagnosis group??

(Wow -- been a while since I kicked off this thread...) Vishy, If I understand your question, then the answer is "yes" regarding a single diagnosis existing in multiple diagnosis groups. For example, in our health Claims datamart there are over 22,000 diagnosis groups in our DiagnosisBridge table that contain some form of ICD9 250* (Diabetes) -- i.e., ICD9 250* exists along side other diagnoses and in different placements (primary diag vs secondary vs ....). Overall, every distinct combination of diagnoses that exist in the Claims data are represented in the DiagnosisBridge table. So, when a given claim line (row) contains a combination of diagnoses that match a combination in the bridge table, the claim line is then loaded into the ClaimsFact table with the corresponding bridgekey that represents the combination of diags for that claim. (If a new combination of diagnoses appears that does not exist in the bridge table, then this combination is added to the bridge table with a bridgekey assigned.)

In the sample query below, I reference a SQL view that has already joined Dim_Diagnosis to BridgeDiagnosis by DiagnosisKey. In the case below, I'm searching for claims that contain two specific diagnoses:

Code:: /******************************************************** -- FIND CLAIMS WITH TWO DIAGS IN THE SAME CLAIM * -- AND THE TWO DIAGS ARE IN ANY ORDER * * * * SINCE THE CRITERIA USING AN 'AND', A SECOND * * JOIN TO THE BRIDGE TABLE IS REQUIRED. * * * * CRITERIA THAT USE 'AND' ALONG WITH A SPECIFIC DIAG * * WILL RETURN A DISTINCT LIST AUTOMATICALLY. * ********************************************************/ SELECT TOP 50 C.* FROM dbo.ClaimsFact C JOIN dbo.[v_BridgeDiagnosis_to_DimDiagnosis] V1 ON C.[DiagnosisBridgeKey] = V1.[DiagnosisBridgeKey] JOIN dbo.[v_BridgeDiagnosis_to_DimDiagnosis] V2 ON C.[DiagnosisBridgeKey] = V2.[DiagnosisBridgeKey] WHERE V1.[DiagnosisLookUpCode] = '4019' AND V2.[DiagnosisLookUpCode] = '25000'

In this next example, I'm looking for the same two diagnoses, but I specifically want the first diagnosis to be 25000. (Each diagnosis group in the bridge table records the order of the diagnoses as they existed on the original/source claims data):

Code:: -- FIND CLAIMS WITH TWO DIAGS IN THE SAME CLAIM -- AND SPECIFY IF ONE OF THE DIAGS IS THE PRIMARY DIAG SELECT TOP 50 C.* FROM dbo.ClaimsFact C JOIN dbo.[v_BridgeDiagnosis_to_DimDiagnosis] V1 ON C.[DiagnosisBridgeKey] = V1.[DiagnosisBridgeKey] JOIN dbo.[v_BridgeDiagnosis_to_DimDiagnosis] V2 ON C.[DiagnosisBridgeKey] = V2.[DiagnosisBridgeKey] WHERE V1.[DiagnosisLookUpCode] = '4019' AND V2.[DiagnosisLookUpCode] = '25000' AND V2.[DiagnosisOrder] = 1

Does that help?

Sponsored content

What to do when the weighting factor of a bridge table no longer seems relevant?

What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?

Re: What to do when the weighting factor of a bridge table no longer seems relevant?