Linking or Merging Similar dimensions from Different Source Systems
+2
Jeff Smith
meb97me
6 posters
Page 1 of 1
Linking or Merging Similar dimensions from Different Source Systems
We've got a situation with the DW we're building where we have 2 seperate systems that we are wanting to bring the data together for cross analysis
1) The first system is the main operational system ie a contact management application that records various events between employees & customers
2) The second is a requisitioning/PO system
We have an employee dimension built from the first system however we are now facing the problem of whether to try and integrate the second systems Users in this employee dimension.
The problems we have identified are
1) Names etc are not consistant between the systems
2) completely different attributes are recorded in each system
2) There is no accurate method to match up users between the 2 systems, the first system logins use AD accounts but the second system uses manually generated logins which sometimes match up to their AD user names and sometimes don't
We're envisaging nightmares of each system trying to overwrite each others chnanges to the dimensional data because they aren't the same and then not alway being able to match the 2 up together
Would it be better to have a seperate User dimension for the second system and then have a bridging table to link (where possible) between the employees dimension from the first system and users dimension table of the second that way hopefully gleeing some useful information from the second system relating back to the first. Ie comparing susccesfull outcomes for each employee for the month against what they spent on clients that kind of thing.
hopefully that made some sort of sense
Any input very much appreciated
1) The first system is the main operational system ie a contact management application that records various events between employees & customers
2) The second is a requisitioning/PO system
We have an employee dimension built from the first system however we are now facing the problem of whether to try and integrate the second systems Users in this employee dimension.
The problems we have identified are
1) Names etc are not consistant between the systems
2) completely different attributes are recorded in each system
2) There is no accurate method to match up users between the 2 systems, the first system logins use AD accounts but the second system uses manually generated logins which sometimes match up to their AD user names and sometimes don't
We're envisaging nightmares of each system trying to overwrite each others chnanges to the dimensional data because they aren't the same and then not alway being able to match the 2 up together
Would it be better to have a seperate User dimension for the second system and then have a bridging table to link (where possible) between the employees dimension from the first system and users dimension table of the second that way hopefully gleeing some useful information from the second system relating back to the first. Ie comparing susccesfull outcomes for each employee for the month against what they spent on clients that kind of thing.
hopefully that made some sort of sense
Any input very much appreciated
Last edited by meb97me on Thu Feb 10, 2011 12:50 pm; edited 1 time in total
meb97me- Posts : 34
Join date : 2010-07-28
Re: Linking or Merging Similar dimensions from Different Source Systems
You can create an employee dimension and load it with data from both systems. The dimension can contain attributes from both systems or you can create an employee attribute dimension populated from the 2 systems. But what ever is done, you won't be able to link the employee from the 1 system to the other because no such cross walk exists. But, if you put the employees from the 2 systems into the same dimension and a crosswalk system is created, then it's just a matter of adding a fe more columns to the employee dimension.
The lowest level of the employee dimension becomes Source System/employee. Later on, if the cross walk gets created, you can have the Source System/Employee roll up to the Employee.
The lowest level of the employee dimension becomes Source System/employee. Later on, if the cross walk gets created, you can have the Source System/Employee roll up to the Employee.
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: Linking or Merging Similar dimensions from Different Source Systems
meb97me wrote:Would it be better to have a seperate User dimension for the second system and then have a bridging table to link (where possible) between the employees dimension from the first system and users dimension table of the second that way hopefully gleeing some useful information from the second system relating back to the first. Ie comparing susccesfull outcomes for each employee for the month against what they spent on clients that kind of thing.
It might be better in this case to have a separate User dimension for the second system, but a bridge table is not the right answer. (A bridge table is to connect a fact to a dimension, not to connect two dimensions.) What you need to do is conform the dimensions. Conformed dimensions are those that share at least one common attribute, which thus allows aggregated reporting across fact tables. To report across your fact tables at the employee level you need some sort of employee ID attribute that is common to the two dimensions.
Would it be possible to add a secondary identifier to one of the source systems that could reliably link employees across systems? Otherwise, you need to tackle it during ETL. Would matching on AD name/userID handle the bulk of the cases with the rest cleaned up manually?
Rather than a bridge table in DW, you might need a mapping table in the "kitchen" that would map users between the two systems. The mapping table would be used during loading of the dimensions to populate a common employee id attribute in each dimension.
Last edited by VHF on Thu Feb 03, 2011 4:39 pm; edited 2 times in total (Reason for editing : clarification; fix typo)
VHF- Posts : 236
Join date : 2009-04-28
Location : Wisconsin, US
Re: Linking or Merging Similar dimensions from Diffierent Source Systems
Your options vary depending on the tools at your disposal , the willingness of your user community to become involved in the management of the data and the scale of your dimensions.
Assuming manageable data sizes, I would definitely retain the data in a single conformed dimension table, from a maintainability perspective it keeps your environment tidier and consistent - i.e. employee is handled and sourced from a single location. As suggested above the true key to the data in this table becomes Source_System_Id/ Native_Employee_Id (obviously it is accompanied by a surrogate key).
In terms of merging the data, the mapping table is a good solution. I would imagine the challenge here is the ongoing maintenance of that table - in which case you will need to find an owner who will stay on top of the data (and then figure out some way of giving them controlled access). In addition, if you have the time you may be able to provide routines to assist in this maintenance activity (such as making newly arriving members obvious).
An alternative approach to merging the data could be introducing some form of cleansed parent level within the dimension table itself. For example:
SK || Native_EmployeeId || Employee_Name || Source_System_Id || Cleansed_Parent_id
1 || 101 || Bob Smith || 1 || 3 -- Id of DWH generated cleansed level
2 || J87 || SmitB || 2 || 3
3 || 3 || Bob Smith || 0 -- i.e. Data Warehouse generated || NULL
Again this has implciations for maintainability (i.e. someone needs to look after the data) and also determining how best to allow people to manage the data (i.e. providing a GUI).
Assuming manageable data sizes, I would definitely retain the data in a single conformed dimension table, from a maintainability perspective it keeps your environment tidier and consistent - i.e. employee is handled and sourced from a single location. As suggested above the true key to the data in this table becomes Source_System_Id/ Native_Employee_Id (obviously it is accompanied by a surrogate key).
In terms of merging the data, the mapping table is a good solution. I would imagine the challenge here is the ongoing maintenance of that table - in which case you will need to find an owner who will stay on top of the data (and then figure out some way of giving them controlled access). In addition, if you have the time you may be able to provide routines to assist in this maintenance activity (such as making newly arriving members obvious).
An alternative approach to merging the data could be introducing some form of cleansed parent level within the dimension table itself. For example:
SK || Native_EmployeeId || Employee_Name || Source_System_Id || Cleansed_Parent_id
1 || 101 || Bob Smith || 1 || 3 -- Id of DWH generated cleansed level
2 || J87 || SmitB || 2 || 3
3 || 3 || Bob Smith || 0 -- i.e. Data Warehouse generated || NULL
Again this has implciations for maintainability (i.e. someone needs to look after the data) and also determining how best to allow people to manage the data (i.e. providing a GUI).
MJGascoyne- Posts : 2
Join date : 2011-02-03
Re: Linking or Merging Similar dimensions from Different Source Systems
thanks for the replys gents
After some more thought, digesting the above proposed susggestions and further discussion we think the way ahead for us is to use the current Employee dimension we have and add some additional attributes required from the 2nd PO system.
We will then attempt to match up the 2 systems employee records using their logins. However for this to work we are going to ask the business to align the logins for the 2nd PO system to match that of the persons AD account.(this hopefully shouldn't be a problem, and from a technical point of view wont cause any problems with data mapped to that user as a seperateid userid is used in the tables to mark data assigned to a particular user)
That way if the persons already exisits as a user of the first Contact Management system we will be able to add the additional attribute field valus to their record from teh 2nd system.
A slight issue we will have is that we will have a lot of users who are set up in the 2nd system but not the first. If thats the case then we will create a new employee dimension record for them based using what data we have from the 2nd system. However if they subsequently get set up as a user of the Contact System we will use that as the primary source for maintaining any fields which are common between the 2 system ie FirstName, Last Name, Department etc
We're also going to need to store both natural keys in the record So we should have something like this
After some more thought, digesting the above proposed susggestions and further discussion we think the way ahead for us is to use the current Employee dimension we have and add some additional attributes required from the 2nd PO system.
We will then attempt to match up the 2 systems employee records using their logins. However for this to work we are going to ask the business to align the logins for the 2nd PO system to match that of the persons AD account.(this hopefully shouldn't be a problem, and from a technical point of view wont cause any problems with data mapped to that user as a seperateid userid is used in the tables to mark data assigned to a particular user)
That way if the persons already exisits as a user of the first Contact Management system we will be able to add the additional attribute field valus to their record from teh 2nd system.
A slight issue we will have is that we will have a lot of users who are set up in the 2nd system but not the first. If thats the case then we will create a new employee dimension record for them based using what data we have from the 2nd system. However if they subsequently get set up as a user of the Contact System we will use that as the primary source for maintaining any fields which are common between the 2 system ie FirstName, Last Name, Department etc
We're also going to need to store both natural keys in the record So we should have something like this
EmployeeSK | ContactSystemID | POSystemID | Username | |
1 | 12345 | mjackson | -- This person is only present in the Contact System | |
2 | ABCDE | mmouse | -- This person is only present in the PO System | |
3 | 67890 | FGHIJ | dduck | -- This person is present in both systems and identified by a common username |
meb97me- Posts : 34
Join date : 2010-07-28
Re: Linking or Merging Similar dimensions from Different Source Systems
I had a similar problem (in fact, multiple cases of the problem, where multiple systems needed have conformed data). As an example, we had 3 source systems, each of them had lists of Carriers (kind of like vendor lists), where wanted to see a single view of all business done with a Carrier.
In the end, I built a mapping application that allowed designated end users to "map" the Carriers together. It used two or three mapping tables to store the mapping information (2 if the mapping itself needed to be SCD type 1, 3 if the mapping needed to be type 2). To make a long story short (as I could get windy) - the ETL would detect if a mapping was already in place between the source system Carrier and the list of conformed Carriers. If it already existed, there was nothing to do. If it didn't exist, it would try some fuzzy logic to determine if the carrier already existed, and map the source system carrier to the conformed carrier. If it couldn't find a match, it would (a) create a new conformed carrier, (b) create the source system carrier and map it to the new conformed carrier, (c) send an email to an appropriate user group (I ran the data governance committee, so I was given all the power to dole out responsible for data integrity tasks) notifying them that a new carrier was create that couldn't be mapped properly, and they would need to use the mapping application.
The mapping application was used by end users to "fix" any of the mappings the ETL tool created. For instance, if two carriers were supposed to be conformed, the end user would take both source carriers and map them to the same conformed carrier. The next time the ETL ran, the data from both source systems for that carrier would appear to be from the same carrier in the data warehouse.
This was a simplified explanation. In fact, the type 2 SCM (slowly changing mapping) method I used was quite original, fast, and allowed lots of control about how entities mapped to each other, rolled up (grouped in parent child relationships), and changed over time. The mapping application itself took about 5 days (one person) to build quickly from scratch using VB, and 1 day (one person) every time there was a new dimension that needed to be added to the process. The ETL process around the conforming of the dimensions took about 2 extra days per dimension (one person) that needed to be conformed using this methodology. In the grand scheme of things, it was a huge win for minimal effort.
If you happen to be interested in more detail, feel free either post here or send me a private message, and I can send on additional information about the solution.
Good luck!
In the end, I built a mapping application that allowed designated end users to "map" the Carriers together. It used two or three mapping tables to store the mapping information (2 if the mapping itself needed to be SCD type 1, 3 if the mapping needed to be type 2). To make a long story short (as I could get windy) - the ETL would detect if a mapping was already in place between the source system Carrier and the list of conformed Carriers. If it already existed, there was nothing to do. If it didn't exist, it would try some fuzzy logic to determine if the carrier already existed, and map the source system carrier to the conformed carrier. If it couldn't find a match, it would (a) create a new conformed carrier, (b) create the source system carrier and map it to the new conformed carrier, (c) send an email to an appropriate user group (I ran the data governance committee, so I was given all the power to dole out responsible for data integrity tasks) notifying them that a new carrier was create that couldn't be mapped properly, and they would need to use the mapping application.
The mapping application was used by end users to "fix" any of the mappings the ETL tool created. For instance, if two carriers were supposed to be conformed, the end user would take both source carriers and map them to the same conformed carrier. The next time the ETL ran, the data from both source systems for that carrier would appear to be from the same carrier in the data warehouse.
This was a simplified explanation. In fact, the type 2 SCM (slowly changing mapping) method I used was quite original, fast, and allowed lots of control about how entities mapped to each other, rolled up (grouped in parent child relationships), and changed over time. The mapping application itself took about 5 days (one person) to build quickly from scratch using VB, and 1 day (one person) every time there was a new dimension that needed to be added to the process. The ETL process around the conforming of the dimensions took about 2 extra days per dimension (one person) that needed to be conformed using this methodology. In the grand scheme of things, it was a huge win for minimal effort.
If you happen to be interested in more detail, feel free either post here or send me a private message, and I can send on additional information about the solution.
Good luck!
Sideout72- Posts : 4
Join date : 2011-02-09
Re: Linking or Merging Similar dimensions from Different Source Systems
Sideout72 that's sounds like a great solution! we had considered a mapping application however we're having problems at present trying to get the business to take responsibility for data governance. No one seems willing to step up and take responsibility. (anyway thats a side issue!)
I'd certainly be interested in hearing more details as i'm sure will many other so feel free "to get windy" as it were
I'd certainly be interested in hearing more details as i'm sure will many other so feel free "to get windy" as it were
meb97me- Posts : 34
Join date : 2010-07-28
An even better way!
I think there is an even better way!
http://forum.kimballgroup.com/t908-a-dozen-systems-with-many-dimension-codes
This method allows facts to be loaded when there is no dimension row, but without the need to use the "UNKNOWN" dimension record.
Basically, new dimension entities, e.g. new staff are just loaded into the dimension with their natural keys, so you can report off the fact tables and see a name, even if they are not conformed exactly yet. Yet at the same time there is no risk that two different people will get reported as one if they accidentally have the same natural key in different systems.
http://forum.kimballgroup.com/t908-a-dozen-systems-with-many-dimension-codes
This method allows facts to be loaded when there is no dimension row, but without the need to use the "UNKNOWN" dimension record.
Basically, new dimension entities, e.g. new staff are just loaded into the dimension with their natural keys, so you can report off the fact tables and see a name, even if they are not conformed exactly yet. Yet at the same time there is no risk that two different people will get reported as one if they accidentally have the same natural key in different systems.
Al Wood- Posts : 46
Join date : 2010-12-08
Re: Linking or Merging Similar dimensions from Different Source Systems
i dont think that approach would work because as i understand it i would have a dimnsion table looking like this
So if i wanted to produce a staff list there would be duplicates for John Smith even if they were the same person
EmployeeSK | EmployeeKey | EmployeeName | Domain Account |
1 | SOURCEA:EMPA | John Smith | jsmith |
2 | SOURCEA:EMP99 | J Smith | jsmith |
So if i wanted to produce a staff list there would be duplicates for John Smith even if they were the same person
meb97me- Posts : 34
Join date : 2010-07-28
Re: Linking or Merging Similar dimensions from Different Source Systems
EmployeeSK | Source_System | SourceNaturalKey | Domain Account | Reporting Name (updated by Data Steward) |
1 | SOURCE_A | John Smith | jsmith | Mr John Smith |
2 | SOURCE_B | J Smith | jsmith | Mr John Smith |
Then some more people join but the SOURCE_A dimension update can't retreive a full name:
EmployeeSK | Source_System | SourceNaturalKey | Domain Account | Reporting Name (updated by Data Steward) |
1 | SOURCE_A | John Smith | jsmith | Mr John Smith |
2 | SOURCE_B | J Smith | jsmith | Mr John Smith |
3 | SOURCE_A | N Smith | nsmith | N Smith |
4 | SOURCE_A | J Smith | jmsmith | J Smith |
A few days later a data steward gets around to it:
EmployeeSK | Source_System | SourceNaturalKey | Domain Account | Reporting Name (updated by Data Steward) |
1 | SOURCE_A | John Smith | jsmith | Mr John Smith |
2 | SOURCE_B | J Smith | jsmith | Mr John Smith |
3 | SOURCE_A | N Smith | nsmith | Mr Nathan Smith |
4 | SOURCE_A | J Smith | jmsmith | Mrs Jane Mary Smith |
Just make sure all your reports use the [Reporting Name ... ] field. And a staff list uses DISTINCT [Reporting Name ... ]
Al Wood- Posts : 46
Join date : 2010-12-08
Re: Linking or Merging Similar dimensions from Different Source Systems
meb97me wrote:So if i wanted to produce a staff list there would be duplicates for John Smith even if they were the same person
Not if you aggregate on Domain Account (or Reporting Name or some other attribute that uniquely indentifies an employee accross the enterprise). You need to do a GROUP BY in the SQL to ensure no duplicates.
One advantage to having a separate dimension record for each source system is if there was a mistake it can be corrected later in the dimension without having to update any fact records. The previous post alludes to this with the example of data cleanup.
I think the multiple record approach is most advantages when dealing with very large dimensions and especially if MDM is involved. For example, in a system with millions of customer records an MDM system might figure out later on that what it "thought" were two people are really the same person. No need to update any fact records--just update the dimension records as needed.
When dealing with a more modest dimension I think both approaches are valid. I'm facing this issue with my (modest) customer dimension (a few thousand records) and right now I'm leaning towrds one record per customer with a natural key for each source system.
VHF- Posts : 236
Join date : 2009-04-28
Location : Wisconsin, US
Re: Linking or Merging Similar dimensions from Different Source Systems
Rather than reporting using SQL directly out of the data warehouse the duplicate person become apparent once you get to the cube (SSAS in my case) as the key of the dimension is based on the surrogate key so if you drag the person attribute into a pivot in excel or view the dimension you will see multiple versions of the same person.
Maybe i'm getting hung up on this point and its not really an issue but I can see that if we are going to be promoting self service for the BI project and users drag a list of employees out for their department say, they are going to complain as soon as they see multiple versions of the same person.
Maybe i'm getting hung up on this point and its not really an issue but I can see that if we are going to be promoting self service for the BI project and users drag a list of employees out for their department say, they are going to complain as soon as they see multiple versions of the same person.
meb97me- Posts : 34
Join date : 2010-07-28
Re: Linking or Merging Similar dimensions from Different Source Systems
You are taking the right approach in considering what users are going to see with self-service BI. When weighing two alternative designs, one of the most important points to consider is how each option will affect the final output. After all, the underlying purpose of DW/BI is to make information available to the users in a meaningful way!
VHF- Posts : 236
Join date : 2009-04-28
Location : Wisconsin, US
Re: Linking or Merging Similar dimensions from Different Source Systems
cheers VHF i'm glad someone can confirm that i'm actually thinking about things in the right way (well sort of :))
However I've just thought that really we're going to be in the same situation with any of the dimensions which have Type 2 change attributes and therefore multiple versions of themselves over time.
i guess this maybe more about the dimension design in the cube software by creating attributes & appropriate keys??
However I've just thought that really we're going to be in the same situation with any of the dimensions which have Type 2 change attributes and therefore multiple versions of themselves over time.
i guess this maybe more about the dimension design in the cube software by creating attributes & appropriate keys??
meb97me- Posts : 34
Join date : 2010-07-28
Re: Linking or Merging Similar dimensions from Different Source Systems
Hi,
I've only built a few cubes, but I feel sure there is a way to build a cube so a chosen attribute e.g. [Full unique name] is used as the key by the cube. I would say that the surrogate key values from the fact/dim structure are not useful information to display in the cube anyway.
Al Wood
I've only built a few cubes, but I feel sure there is a way to build a cube so a chosen attribute e.g. [Full unique name] is used as the key by the cube. I would say that the surrogate key values from the fact/dim structure are not useful information to display in the cube anyway.
Al Wood
Al Wood- Posts : 46
Join date : 2010-12-08
Similar topics
» Is it possible to design a DW in parallel with its source systems?
» Dear all. I can't figured out how how to linking in my structure the promotion/deal dimensions to the fact table avoiding dupplicates line
» Designing Single set of dimensions for disparate source data
» Merging of dimensions?
» Not quite late arriving dimensions but similar case....
» Dear all. I can't figured out how how to linking in my structure the promotion/deal dimensions to the fact table avoiding dupplicates line
» Designing Single set of dimensions for disparate source data
» Merging of dimensions?
» Not quite late arriving dimensions but similar case....
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum