Business Logic: DWH vs. Source system
5 posters
Page 1 of 1
Business Logic: DWH vs. Source system
Hello everyone,
We are currently implementing a Data Warehouse (consolidating data from several source systems) and we had an argument on where the business logic should reside.
Simplified example: The source contains the fields “amount”, a “discount” and a “total amount”. The “total amount” is supposed to be the “amount”*(1-“discount”), but for us, the DWH team, it is already available as a readily loadable fixed field.
So, should we (a) just take the “total amount” field or (b) take the raw fields “amount” and “discount” to calculate the “total amount” in the ETL?
In the first case, one would basically not bother about how the field is derived and leave its logic to the source system. However, I reckon there will be almost certainly records where the total amount != amount*(1-discount) leaving to inconsistencies if people start to compare the source system-delivered “total amount” with the “actual amounts” and “discounts”.
On the other hand, doing the calculation in the ETL would at least ensure consistency as it would capture the most detailed data. However, the DWH would be not just a dispositive system anymore. I would say that this is ok – in fact, that the DWH is actually supposed to be the central repository of business rules and that it is easier and more consistent than relying on a potentially unstable source system.
So, what’s the best way?
Btw- I know that there is always the possibility to load both and leave it up to the users which field to choose, but my question is more of a general nature on what the best practice would prescribe.
We are currently implementing a Data Warehouse (consolidating data from several source systems) and we had an argument on where the business logic should reside.
Simplified example: The source contains the fields “amount”, a “discount” and a “total amount”. The “total amount” is supposed to be the “amount”*(1-“discount”), but for us, the DWH team, it is already available as a readily loadable fixed field.
So, should we (a) just take the “total amount” field or (b) take the raw fields “amount” and “discount” to calculate the “total amount” in the ETL?
In the first case, one would basically not bother about how the field is derived and leave its logic to the source system. However, I reckon there will be almost certainly records where the total amount != amount*(1-discount) leaving to inconsistencies if people start to compare the source system-delivered “total amount” with the “actual amounts” and “discounts”.
On the other hand, doing the calculation in the ETL would at least ensure consistency as it would capture the most detailed data. However, the DWH would be not just a dispositive system anymore. I would say that this is ok – in fact, that the DWH is actually supposed to be the central repository of business rules and that it is easier and more consistent than relying on a potentially unstable source system.
So, what’s the best way?
Btw- I know that there is always the possibility to load both and leave it up to the users which field to choose, but my question is more of a general nature on what the best practice would prescribe.
inglev- Posts : 3
Join date : 2009-02-17
Re: Business Logic: DWH vs. Source system
I'm not sure that there's an official "best practice" for something like this (of course there might be and I could be unaware of it). But if it were up to me I think I would use the pre-calculated source system total field, provided I could ensure its accuracy. Using that field most accurately represents your system of record and eliminates the possibility of introducing math errors. It also saves you a step (albeit a simple one) by eliminating the need to calculate.
However, if the field is occasionally inaccurate then I'd first try to push that fix upstream to the source system people. If they won't fix it then you could either calculate the total amount yourself or you could create a variance field that would allows the users to see the inaccuracies while still allowing them to sum the fields together to get the correct total amount, despite the source system not providing it. Of course you could also just dump the source system pre-calculated total field altogether and do it yourself in the ETL, provided your users are okay with that.
True you could store both an ETL-calculated total field and a source system-calculated total field but I think that might be confusing to end users.
In the end I think it depends on the accuracy of your source data and your end user requirements.
Hope this helps.
However, if the field is occasionally inaccurate then I'd first try to push that fix upstream to the source system people. If they won't fix it then you could either calculate the total amount yourself or you could create a variance field that would allows the users to see the inaccuracies while still allowing them to sum the fields together to get the correct total amount, despite the source system not providing it. Of course you could also just dump the source system pre-calculated total field altogether and do it yourself in the ETL, provided your users are okay with that.
True you could store both an ETL-calculated total field and a source system-calculated total field but I think that might be confusing to end users.
In the end I think it depends on the accuracy of your source data and your end user requirements.
Hope this helps.
BrianJarrett- Posts : 61
Join date : 2009-02-03
Age : 50
Location : St. Louis, MO
Re: Business Logic: DWH vs. Source system
Take the raw columns and derived columns. At some point, you'll want both. You don't necessarily have to expose both to the reporting community. I would imagine that some people would be interested in raw columns while others are simply interested in the total. This is also an easy error to catch during the ETL process. Review Ralph's error check writings for a good process to capture and report these defects. You would be surprised how motivated source systems are at cleaning up their dirty laundry once it is published as daily metrics on a dashboard!
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: Business Logic: DWH vs. Source system
Many thanks guys for the anwers. I guess, the best way indeed might be the way to take them both and decide later on how to publish the figures.
inglev- Posts : 3
Join date : 2009-02-17
Re: Business Logic: DWH vs. Source system
One thing to consider it that "discount" is a non-additive field. Unless you are looking at detail rows, it has no value. When summarizing, your formula should be Summary Discount = 1 - (sum(total amount)/sum(amount)) to get a weighted average across all the records in the group.
Another thing, assuming we are talking about invoice data. Did the customer paid the Total Amount regardless of whether it is calculated correctly? If so, then you don't want to recalculate it in the DW. If you recalculate it, then you DW will loose credibility because it does not represent reality and reality is the name of the game. Also, recalculating it, even if it is correct in the source system, may pose a problem with rounding errors, if the rounding is done differently.
Personally, I'd put the Amount and Total Amount in the fact table and calculate the discount.
Another thing, assuming we are talking about invoice data. Did the customer paid the Total Amount regardless of whether it is calculated correctly? If so, then you don't want to recalculate it in the DW. If you recalculate it, then you DW will loose credibility because it does not represent reality and reality is the name of the game. Also, recalculating it, even if it is correct in the source system, may pose a problem with rounding errors, if the rounding is done differently.
Personally, I'd put the Amount and Total Amount in the fact table and calculate the discount.
higerdm- Posts : 1
Join date : 2009-02-03
Age : 64
Location : Wichita, KS
Similar topics
» Poorly sturctured data at source system
» Null and Blank Dates from Source System
» Recursive Relationships in Source System
» Deletes in the source system for Type-2...
» Extracting data when there is no timestamp in the source system
» Null and Blank Dates from Source System
» Recursive Relationships in Source System
» Deletes in the source system for Type-2...
» Extracting data when there is no timestamp in the source system
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum