Designing Single set of dimensions for disparate source data
+2
nick_white
SSK
6 posters
Page 1 of 1
Designing Single set of dimensions for disparate source data
I have a scenario where a single dimensional model needs to be designed that will hold data from several customers. The customers though have similar data domains, they do not have the same content. For e.g, considering 'Gender' as a dimension, the values from different customers could be say ( Customer 1) - 'Male, Female' (Customer 2)- 'M,F,T' (Customer 3) - 'Male, Female, Transgender' etc. In short, the dimensions and possibly attributes could be the same, but the attribute values would vary. The expectation is the same values needs to be stored and carried over to reporting layer as is for each customer with no data loss or conversion or standardization. E.g Customer 1 wants to see only 2 options for Gender - 'Male' and 'Female'
What are the things to be kept in mind for modeling this scenario?. Will adding a separate Customer/organization dimension (with all the customer attributes) and linking this Customer_ID as a foreign key in each dimension (to differentiate which data value in dimension is for which customer) help?
What are the things to be kept in mind for modeling this scenario?. Will adding a separate Customer/organization dimension (with all the customer attributes) and linking this Customer_ID as a foreign key in each dimension (to differentiate which data value in dimension is for which customer) help?
SSK- Posts : 2
Join date : 2015-11-16
Re: Designing Single set of dimensions for disparate source data
As long as you can filter facts by customer (using a dimension or degenerate dimension) then you shouldn't have to do anything else.
Dimensions should have descriptions in their fields and not codes (though you can have the codes as well if you need them, obviously).
A dimension record with Customer 1's field values would only ever be linked to a Customer 1 fact so you shouldn't have any issues. It doesn't matter if a field has values of Y, Yes, True, 1 etc for different customers as only the correct values will be available to that customer when linked to their facts
Dimensions should have descriptions in their fields and not codes (though you can have the codes as well if you need them, obviously).
A dimension record with Customer 1's field values would only ever be linked to a Customer 1 fact so you shouldn't have any issues. It doesn't matter if a field has values of Y, Yes, True, 1 etc for different customers as only the correct values will be available to that customer when linked to their facts
nick_white- Posts : 364
Join date : 2014-01-06
Location : London
Re: Designing Single set of dimensions for disparate source data
Thanks for your response and points, Nick
The facts are not separate for each customer. It would be same fact tables (with same measures) but probably partitioned (database partition at table level for each customer) and linked to specific dimension records like you mentioned.
The question is more on the higher level picture on how the dimensions could be designed (15+ in count) so that they could differentiate data for different customers in each of the dimensions.
The facts are not separate for each customer. It would be same fact tables (with same measures) but probably partitioned (database partition at table level for each customer) and linked to specific dimension records like you mentioned.
The question is more on the higher level picture on how the dimensions could be designed (15+ in count) so that they could differentiate data for different customers in each of the dimensions.
SSK- Posts : 2
Join date : 2015-11-16
Re: Designing Single set of dimensions for disparate source data
I probably didn't make myself very clear so to take a very simple example: say you had a "Boolean" dimension consisting of a SK and a text column. The text column could have values of True, False, Y, N, 1, 0 etc. When you create a fact record as long as you link it to the correct Dim record for that customer everything will work as you want. So if Customer A has a true value you might link it to the Dim record with the "True" value but you might link Customer B to the Dim record with a value of "Y"
Does that make it any clearer? Basically where you have different display values for the columns in a Dim record you create a different Dim record
Does that make it any clearer? Basically where you have different display values for the columns in a Dim record you create a different Dim record
Last edited by nick_white on Wed Nov 18, 2015 4:06 am; edited 1 time in total
nick_white- Posts : 364
Join date : 2014-01-06
Location : London
Re: Designing Single set of dimensions for disparate source data
Basically an identification the owner of the data (i.e. your customer) should appear in all tables. You could then use this to secure the data from other customers. As far as partitioning goes, I'm not sure if partitioning by customer will buy you much. Date is a more common partitioning method to improve performance of fact table queries.
Re: Designing Single set of dimensions for disparate source data
This is part of your source to target transformation logic. You need to rationalize the codes to a common code set. If source A has 1 to mean "Male" and source B has "M" to mean male, the ETL process will map to the appropriate dimension row the represents "Male". What you don't want to do is load multiple values for "Male".
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: Designing Single set of dimensions for disparate source data
On that partitioning question, if this was an MPP platform, and if Customer was the biggest dimension, and transaction volumes were fairly evenly distributed across the customer base, then using Customer as the distribution key might be a very good idea as you might save a lot of shuffling between nodes.
Re: Designing Single set of dimensions for disparate source data
You do not need separate customer dims. You need one customer dim and a column to store their source...i-e the company name they belong to. If you are standardizing your customer's data and afraid that a customer may want to see the original value, then create 2 columns for that value..... one column to hold standardized value and another column to hold original value from the source. For example if gender value 1 from source represents Male then in your customer dim you not only store "Male" as standardized value , but also 1 as original value. This approach provides the flexibility for the reporting so a customer can either see standardize value or a value from the source.
zoom- Posts : 97
Join date : 2010-08-23
Location : Chicago
Re: Designing Single set of dimensions for disparate source data
ron.dunn wrote:On that partitioning question, if this was an MPP platform, and if Customer was the biggest dimension, and transaction volumes were fairly evenly distributed across the customer base, then using Customer as the distribution key might be a very good idea as you might save a lot of shuffling between nodes.
He said 'partitioning' rather than 'distributing' so I am assuming an SMP platform. In an MPP environment, date would be a very bad distribution method. But, in an SMP environment, date usually is the most effective. In the case of an MPP you almost always want to co-locate your largest dimension with the facts.
Re: Designing Single set of dimensions for disparate source data
I think the discussion about partitioning is a side issue - the original question was about how to display different values to different customers for what is logically the same data.
In a normal DW I would agree that B&L's suggestion is the standard approach but the questioner specifically does not want to rationalise the data into a common set of values but wants to display customer-specific values
"What you don't want to do is load multiple values for "Male" ": this is, I beleive, exactly what the questionner wants to do, and for a valid reason.
My view is that you treat different values of the same logical data item as different - and therefore you end up with different records in your dimension tables e.g. in a logically boolean field hold values of "true", "yes", "1", etc (whatever the different customers expect to see). When you come to load your fact data, the values coming from Customer A will match the Dimension record that has customer A values and so the correct SK will be written to the fact table; data coming from Customer B that might be logically the same to Customer A's data will match the Dim record that has Customer B's values - and so on.
Thoughts?
In a normal DW I would agree that B&L's suggestion is the standard approach but the questioner specifically does not want to rationalise the data into a common set of values but wants to display customer-specific values
BoxesAndLines wrote:This is part of your source to target transformation logic. You need to rationalize the codes to a common code set. If source A has 1 to mean "Male" and source B has "M" to mean male, the ETL process will map to the appropriate dimension row the represents "Male". What you don't want to do is load multiple values for "Male".
"What you don't want to do is load multiple values for "Male" ": this is, I beleive, exactly what the questionner wants to do, and for a valid reason.
My view is that you treat different values of the same logical data item as different - and therefore you end up with different records in your dimension tables e.g. in a logically boolean field hold values of "true", "yes", "1", etc (whatever the different customers expect to see). When you come to load your fact data, the values coming from Customer A will match the Dimension record that has customer A values and so the correct SK will be written to the fact table; data coming from Customer B that might be logically the same to Customer A's data will match the Dim record that has Customer B's values - and so on.
Thoughts?
nick_white- Posts : 364
Join date : 2014-01-06
Location : London
Re: Designing Single set of dimensions for disparate source data
I'm not a big fan of doing it that way. One of the big values a data warehouse can provide is code rationalization, that is the ability to get the business on a common vocabulary. All these disparate codes, indicators, and flags in the disparate source application silos only serve to further obfuscate the underlying business processes. Do I really need N different values for male, or active, or open, or closed? This also introduces additional complexity on joins. I now need to query include N values to identify all male genders. If I add an additional column in my gender dimension to make identifying all male genders easier for reporting, I've already done all the work to cross reference the codes!
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: Designing Single set of dimensions for disparate source data
Except that he is talking about a multi-tennant situation, not the integration of data for a large, single, enterprise. Co-mingling of different tenant's data is a big no-no, so there is no reason to have consistent attribute values across tenant data.
In such cases is it absolutely critical the data remains in the realm of the specific tenant that owns it. To that end, all natural keys should include the tenant ID to ensure data is never co-mingled.
In such cases is it absolutely critical the data remains in the realm of the specific tenant that owns it. To that end, all natural keys should include the tenant ID to ensure data is never co-mingled.
Re: Designing Single set of dimensions for disparate source data
ngalemmo wrote:Except that he is talking about a multi-tennant situation, not the integration of data for a large, single, enterprise.
Hmm, somehow I missed that point, in memory of the great Emily Litella, "Never mind!"
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Similar topics
» Designing data marts from an EAV data source
» Is it a best practice that Data warehouse follows the source system data type?
» Customer Dimension built from two ERP tables
» Correlated - Separate Dimensions OR Single Dimensions ?
» multi ERP source system, single DW load strategy
» Is it a best practice that Data warehouse follows the source system data type?
» Customer Dimension built from two ERP tables
» Correlated - Separate Dimensions OR Single Dimensions ?
» multi ERP source system, single DW load strategy
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum