When to implement cleaning?

EyeOpener Mon May 23, 2011 7:15 pm

I've been reading through Chapter 4 (Cleaning and Conforming) of The Data Warehouse ETL Toolkit. There's an interesting "Data Quality Process Flow" which discusses various types of data quality "screens" that might need to be performed, and a great schema model for auditing problematic records with a central error event fact table. However, I'm missing a key point -- I can see how all this could be used to detect bad data, but what are some recommended approaches for actually cleaning up that bad data? Or do the screens pull double-duty, and clean up dimension and fact data while they're creating the audit records? If not, should the subsequent ETL processes tie in to that audit data somehow (e.g. to determine which records to skip in the case of duplicates)?

ngalemmo Tue May 24, 2011 2:07 pm

My opinion is that data cleansing is a business process, not a technical one. It needs to occur at the source, not in the data warehouse.

The idea of the screens to detect bad data is to provide a feedback mechanism from the data warehouse to notify the business where issues exist. This provides actionable information that can be used to correct data problems.

One area of controversy is what to do with the erroneous values detected. There are those who say it should not be loaded into the DW. I disagree. It needs to be loaded so the DW accurately reflects what is in the operational systems. It also needs to be in there so that DW can be used to identify and assess the effect of the erroneous data.

BoxesAndLines Tue May 24, 2011 11:02 pm

Some things you clean, others you let be. Addresses for example. I cleanse addresses so I can leverage a conformed address dimension. Same for names. I need cleansed names to uniquely identify a customer from different sources. Other errors I let go. I capture these errors in my data quality fact table. I build a dashboard on top of this fact to raise awareness. Additionally, you can't improve what you don't measure. Running around the organization telling everyone data quality is bad doesn't impact as much as showing them the issues and associated cost.

EyeOpener Wed May 25, 2011 1:22 pm

Thanks for the replies. I've determined that we do need to implement some types of cleaning, such as removing duplicate dimension records. And I've partially answered my own question: the data-quality screens described in the ETL book are used for both reporting and filtering. A screen is "physically viewed by the ETL team as a status report on data quality, but it's also a kind of gate that doesn't let bad data through." I'm still looking for ideas on how best to actually implement these screens.

In our environment, we have an initial staging area that holds raw data extracted from the source systems in its original form, with metadata added. Then we have a secondary staging area where we build dimensions and facts, and finally a presentation area open for reporting. The secondary staging area is the natural (and perhaps only) place to implement this cleaning.

Currently we have constraints on the dimension tables to insure uniqueness. The problem with this approach is that the ETL errors out when the odd duplicate appears, and recovery is manual. I'm thinking that we could simply remove all the constraints, build dimensions as usual, then run our screens to remove duplicates (and perform whatever other cleaning tasks we want). If designed comprehensively, these screens would negate the need for constraints.

Is that a typical approach to cleaning (i.e. build it dirty, then clean it up)? Or are there other approaches I should consider?

Sponsored content

When to implement cleaning?

When to implement cleaning?

Re: When to implement cleaning?

Re: When to implement cleaning?

Re: When to implement cleaning?

Re: When to implement cleaning?