Data Cleaning - replacing nulls and addresses
4 posters
Page 1 of 1
Data Cleaning - replacing nulls and addresses
Hi,
I need to do some data cleaning as part of a data warehouse project that I'm about to start.
Some of the tasks will include some basic data cleaning (e.g. handling of nulls & replacing with a default value). Then there will be more complex data cleaning required for some address data.
I'll have a staging db which is doing an extract from the source SQL system, plus the main warehouse SQL database. I'll be using SSIS as the ETL tool. Could someone please tell me if the above data cleaning would be best handled - would it be in the staging database, or in the warehouse database?
I'm tempted to say it wont make too much difference, but I'm keen to follow best practice
Thanks!
I need to do some data cleaning as part of a data warehouse project that I'm about to start.
Some of the tasks will include some basic data cleaning (e.g. handling of nulls & replacing with a default value). Then there will be more complex data cleaning required for some address data.
I'll have a staging db which is doing an extract from the source SQL system, plus the main warehouse SQL database. I'll be using SSIS as the ETL tool. Could someone please tell me if the above data cleaning would be best handled - would it be in the staging database, or in the warehouse database?
I'm tempted to say it wont make too much difference, but I'm keen to follow best practice
Thanks!
jryan- Posts : 33
Join date : 2010-09-27
Re: Data Cleaning - replacing nulls and addresses
I am not a big fan of cleansing data going into a data warehouse. It's a big waste of time and effort because it does nothing to correct the root cause of the problem... bad data in the source.
A data warehouse is a great tool for moving an organization towards better management of the data because it exposes data issues that are otherwise unknown. Proper data cleansing should be the result of a feedback loop from the data warehouse back to the source system. Use the data warehouse to identify and expose data issues, work out a resolution, then apply that resolution to the source system. Once the source system is updated, the corrected data flows back into the data warehouse as part of the normal ETL process.
A data warehouse is a great tool for moving an organization towards better management of the data because it exposes data issues that are otherwise unknown. Proper data cleansing should be the result of a feedback loop from the data warehouse back to the source system. Use the data warehouse to identify and expose data issues, work out a resolution, then apply that resolution to the source system. Once the source system is updated, the corrected data flows back into the data warehouse as part of the normal ETL process.
Re: Data Cleaning - replacing nulls and addresses
Thanks for the quick reply,
I do agree - and I'm planning to implement Kimball style "screens" as per the ETL book, that will audit the data coming in for problems, then write to the error event fact, which can then be presented to the appropriate teams to correct in the source systems.
However, I have a host of DQ issues that I dont want to present to the user. E.g. a date (perhaps customer dob) that is out of range, or a textual attribute that is blank. For these items I want to replace them with default values. Do you think that activity woudl be better done in the Staging db or in the warehouse db?
I do agree - and I'm planning to implement Kimball style "screens" as per the ETL book, that will audit the data coming in for problems, then write to the error event fact, which can then be presented to the appropriate teams to correct in the source systems.
However, I have a host of DQ issues that I dont want to present to the user. E.g. a date (perhaps customer dob) that is out of range, or a textual attribute that is blank. For these items I want to replace them with default values. Do you think that activity woudl be better done in the Staging db or in the warehouse db?
jryan- Posts : 33
Join date : 2010-09-27
Re: Data Cleaning - replacing nulls and addresses
Staging database or ODS? I view a staging database as temporary work area for ETL processes. There is no persistent data stored here. The ODS should not be cleansed as it should reflect the source system data in order to support NRT application reporting.
In an ideal world ngalemmo is correct. Unfortunately, this is not an ideal world. Application teams routinely ignore data quality problems because it is not important to their domain. Show up with a list of invalid addresses and they'll stick the request at the bottom of the stack. I agree with your approach on tracking what was cleansed to provide feedback to the business on DQ problems.
If your invalid data is dimension data, you should associate the fact to a special dimension row to identify these errors. If the data is a null fact, be very cautious to replace with a zero. Zero is a valid value for many facts. Zero is also treated differently than null by database operations.
In an ideal world ngalemmo is correct. Unfortunately, this is not an ideal world. Application teams routinely ignore data quality problems because it is not important to their domain. Show up with a list of invalid addresses and they'll stick the request at the bottom of the stack. I agree with your approach on tracking what was cleansed to provide feedback to the business on DQ problems.
If your invalid data is dimension data, you should associate the fact to a special dimension row to identify these errors. If the data is a null fact, be very cautious to replace with a zero. Zero is a valid value for many facts. Zero is also treated differently than null by database operations.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: Data Cleaning - replacing nulls and addresses
As per my knowledge Staging area can be used for data cleaning process. Its better to go with staging db.
If you want to present cleaned data to custormer before loading into warehouse then can go for ODS.
Customer can validate ODS data and if updates required on data , have proper update and then load into warehouse.
If you want to present cleaned data to custormer before loading into warehouse then can go for ODS.
Customer can validate ODS data and if updates required on data , have proper update and then load into warehouse.
sac587118- Posts : 2
Join date : 2010-12-10
Similar topics
» Suggestions for cleaning data
» Multiple customer addresses along with specific account level addresses
» Loading and cleaning "randomly" updated, timestamped data from a 3rd party
» When to implement cleaning?
» Customer and Addresses
» Multiple customer addresses along with specific account level addresses
» Loading and cleaning "randomly" updated, timestamped data from a 3rd party
» When to implement cleaning?
» Customer and Addresses
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum