Dealing with deduplication, merging records, etc in the warehouse layer
2 posters
Page 1 of 1
Dealing with deduplication, merging records, etc in the warehouse layer
Hi,
From my experience and Kimball ETL Architecture training I know that there are (not cast in concrete) 4 layers:
1) Staging 2) Transform 3) Conform 4) Warehouse (might be named differently by others)
The dedup/merging/cleaning, etc are all done in the first 8 subsystems ie. the first two layers.
What are the pros and cons of doing client merging, client deduplication, and client key management in the warehouse layer and not in the "kitchen" layer?
Your opinions will be appreciated
Thanks
From my experience and Kimball ETL Architecture training I know that there are (not cast in concrete) 4 layers:
1) Staging 2) Transform 3) Conform 4) Warehouse (might be named differently by others)
The dedup/merging/cleaning, etc are all done in the first 8 subsystems ie. the first two layers.
What are the pros and cons of doing client merging, client deduplication, and client key management in the warehouse layer and not in the "kitchen" layer?
Your opinions will be appreciated
Thanks
Juan- Posts : 1
Join date : 2010-07-15
Location : Cape Town, ZA
Re: Dealing with deduplication, merging records, etc in the warehouse layer
I am not a proponent of merging, de-duping in any layer other than at the reporting level. Key management (if you mean the assignment of dimension keys) occurs in the transformation process prior to loading into the warehouse.
It all has to do with business keys. Clients in the source system are identified by a business key (whatever form that takes). The transactional source, which would contain the business key, needs to be related to the dimension using through the dimension's surrogate PK. It is important that the relationship between a fact and dimension is based on the natural key in the transaction rather than base it on some "merged" picture of the client. This correctly reflects the true nature of the relationship under which the transaction took place.
The "merged" client is simply a representation of a client. You handle this by including additional attributes to reflect merged client. From an end-user standpoint, you can present them with a merged view or an as-is view or both based on the attributes you decide to expose. Also, if the merge associations change (as the merge rules change) you can easily reflect the new merged view by simply updating the merged view attributes.
It all has to do with business keys. Clients in the source system are identified by a business key (whatever form that takes). The transactional source, which would contain the business key, needs to be related to the dimension using through the dimension's surrogate PK. It is important that the relationship between a fact and dimension is based on the natural key in the transaction rather than base it on some "merged" picture of the client. This correctly reflects the true nature of the relationship under which the transaction took place.
The "merged" client is simply a representation of a client. You handle this by including additional attributes to reflect merged client. From an end-user standpoint, you can present them with a merged view or an as-is view or both based on the attributes you decide to expose. Also, if the merge associations change (as the merge rules change) you can easily reflect the new merged view by simply updating the merged view attributes.
Similar topics
» Dealing with Duplicate Dimension Rows
» Data Deduplication in Dimension Design
» Merging of dimensions?
» Match merging
» Dealing with Mergers
» Data Deduplication in Dimension Design
» Merging of dimensions?
» Match merging
» Dealing with Mergers
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum