Dealing with deduplication, merging records, etc in the warehouse layer

Juan Tue Dec 06, 2011 1:32 pm

Hi,

From my experience and Kimball ETL Architecture training I know that there are (not cast in concrete) 4 layers:
1) Staging 2) Transform 3) Conform 4) Warehouse (might be named differently by others)
The dedup/merging/cleaning, etc are all done in the first 8 subsystems ie. the first two layers.

What are the pros and cons of doing client merging, client deduplication, and client key management in the warehouse layer and not in the "kitchen" layer?

Your opinions will be appreciated

Thanks

ngalemmo Tue Dec 06, 2011 8:25 pm

I am not a proponent of merging, de-duping in any layer other than at the reporting level. Key management (if you mean the assignment of dimension keys) occurs in the transformation process prior to loading into the warehouse.

It all has to do with business keys. Clients in the source system are identified by a business key (whatever form that takes). The transactional source, which would contain the business key, needs to be related to the dimension using through the dimension's surrogate PK. It is important that the relationship between a fact and dimension is based on the natural key in the transaction rather than base it on some "merged" picture of the client. This correctly reflects the true nature of the relationship under which the transaction took place.

The "merged" client is simply a representation of a client. You handle this by including additional attributes to reflect merged client. From an end-user standpoint, you can present them with a merged view or an as-is view or both based on the attributes you decide to expose. Also, if the merge associations change (as the merge rules change) you can easily reflect the new merged view by simply updating the merged view attributes.

Dealing with deduplication, merging records, etc in the warehouse layer

Dealing with deduplication, merging records, etc in the warehouse layer

Re: Dealing with deduplication, merging records, etc in the warehouse layer