ETL to handle deleted reference data

robertfowler Thu Jul 15, 2010 2:40 am

Hello all,

I have a situation that I am unsure how to handle and would appreciate some input.

My operational system allows me to add, edit and remove customers. My source operational system also constrains me from adding a cusomter with the same name and the same short name - these need to be unique on the online and the source database has the constraints to enforce this. Also current system uses operational keys as primary for the dimensions.

The process that I am having a problem with is as follows:
1. User creates a customer: NAME = 'Sponge Bob', SHORT_NAME = 'Bob'
2. ETL creates a dimensional entry in the CUSTOMER dimension for the new customer
3. User deletes the customer
4. ETL runs but there is no action
5. User create a new customer with the identical credentials: NAME = 'Sponge Bob', SHORT_NAME = 'Bob'
6. ETL performs a lookup on the primary key (operational key) and no entry is found in the dimension so a new record is introduced. This is a problem because the integrity constraints applied to the source system on NAME and SHORT_NAME have also been introduced to the dimension tables. The insert fails because of this constraint.

To overcome this problem the legacy system has a process that identifies the fields which need to be unique NAME and SHORT_NAME and modifies the existing records to facilitate the introduction of the new record. E.g
Original records becomes: NAME = 'Sponge bob_1', SHORT_NAME = "Bob_1"
New redord entered NAME = 'Sponge Bob', SHORT_NAME = 'Bob'.

I am not sure how to handle this but what is in place doesn't feel right to me.

If the user has deleted a customer and is adding back the SAME customer then should I/must I use the original primary key, i.e. perform a lookup on NAME, and SHORT_NAME and use the Id that is returned?
What if the customer added back is NOT the same customer but just has the same NAME and SHORT_NAME? The online doesn't provide me with this information, so I can't tell which case I am dealing with.

Should I use a type 2 dimension to handle this and in every case simply end date the dimension entry when a customer is deleted, i.e. improve step 4 above to mark the customer as deleted in some way (set the end date) and simply add a new entry with a new start date and mark as the current row when a new customer is introduced?

Should a dimension table have unique constraints?

Thank you in advance

ngalemmo Thu Jul 15, 2010 12:47 pm

Why do you have "integrity" contraints on the dimensions table? What "integrity" are you trying to enforce?

The data warehouse is not the operational system. For whatever reason the application designers decided to handle certain attributes in the system is their business, not the business of the data warehouse.

The role of the data warehouse is to accurately reflect the information in its source systems. If the customer's short name is Bob, so be it. Store the value and get on with life.

If you do anything, it would be to add a delete flag (or date) attribute to the dimension so you can record when a customer was deleted from the source system. However, I would only do that if there was some relatively easy way to detect it... otherwise it really doesn't matter.

A data warehouse should not mimic an operational system. It is a different data store for a different purpose.

robertfowler Fri Jul 16, 2010 1:11 am

Thank you.

It felt all wrong to me when I first looked at it: having constraints on the dimension tables and changing the values of the attriburtes in order to be unique!

I will set about removing the unique constraints from the dimension tables and specifying the changes to the ETL as to not transform the values from the source system.

A question I do anticipate with this approach is if a user were to report sales by the customer dimension specifying the short name as 'Bob'. Let's assume that the customer had been removed and readded (for example purposes - lets say this had happened 5 time, twice it was really the same customer and 3 time another customer using exactly the same credentials) then would the sales not all be reflected against the name 'Bob'. Wouldn't that be incorrect?

I really appreciate the input
thank you

ngalemmo Fri Jul 16, 2010 11:54 am

Yes and no. I would consider it more as a bad query than a bad result. It is a matter of education... users need to understand the difference between a nickname and a true identifier of the customer. They should understand (and the do, really) that the customer ID is the identifier and if they want totals by customer they need to summarize on that ID. They could certainly include the short name on the report, but not as the sole descriminator for a customer.

Of course, it would also help if you had some way to flag old, deleted customers...

Sponsored content

ETL to handle deleted reference data

ETL to handle deleted reference data

Re: ETL to handle deleted reference data

Re: ETL to handle deleted reference data

Re: ETL to handle deleted reference data

Re: ETL to handle deleted reference data