Designing Single set of dimensions for disparate source data

SSK Tue Nov 17, 2015 12:02 am

I have a scenario where a single dimensional model needs to be designed that will hold data from several customers. The customers though have similar data domains, they do not have the same content. For e.g, considering 'Gender' as a dimension, the values from different customers could be say ( Customer 1) - 'Male, Female' (Customer 2)- 'M,F,T' (Customer 3) - 'Male, Female, Transgender' etc. In short, the dimensions and possibly attributes could be the same, but the attribute values would vary. The expectation is the same values needs to be stored and carried over to reporting layer as is for each customer with no data loss or conversion or standardization. E.g Customer 1 wants to see only 2 options for Gender - 'Male' and 'Female'

What are the things to be kept in mind for modeling this scenario?. Will adding a separate Customer/organization dimension (with all the customer attributes) and linking this Customer_ID as a foreign key in each dimension (to differentiate which data value in dimension is for which customer) help?

nick_white Tue Nov 17, 2015 8:27 am

As long as you can filter facts by customer (using a dimension or degenerate dimension) then you shouldn't have to do anything else.
Dimensions should have descriptions in their fields and not codes (though you can have the codes as well if you need them, obviously).
A dimension record with Customer 1's field values would only ever be linked to a Customer 1 fact so you shouldn't have any issues. It doesn't matter if a field has values of Y, Yes, True, 1 etc for different customers as only the correct values will be available to that customer when linked to their facts

SSK Tue Nov 17, 2015 1:38 pm

Thanks for your response and points, Nick

The facts are not separate for each customer. It would be same fact tables (with same measures) but probably partitioned (database partition at table level for each customer) and linked to specific dimension records like you mentioned.

The question is more on the higher level picture on how the dimensions could be designed (15+ in count) so that they could differentiate data for different customers in each of the dimensions.

nick_white Tue Nov 17, 2015 5:17 pm

I probably didn't make myself very clear so to take a very simple example: say you had a "Boolean" dimension consisting of a SK and a text column. The text column could have values of True, False, Y, N, 1, 0 etc. When you create a fact record as long as you link it to the correct Dim record for that customer everything will work as you want. So if Customer A has a true value you might link it to the Dim record with the "True" value but you might link Customer B to the Dim record with a value of "Y"
Does that make it any clearer? Basically where you have different display values for the columns in a Dim record you create a different Dim record

ngalemmo Tue Nov 17, 2015 5:21 pm

Basically an identification the owner of the data (i.e. your customer) should appear in all tables. You could then use this to secure the data from other customers. As far as partitioning goes, I'm not sure if partitioning by customer will buy you much. Date is a more common partitioning method to improve performance of fact table queries.

BoxesAndLines Wed Nov 18, 2015 4:45 pm

This is part of your source to target transformation logic. You need to rationalize the codes to a common code set. If source A has 1 to mean "Male" and source B has "M" to mean male, the ETL process will map to the appropriate dimension row the represents "Male". What you don't want to do is load multiple values for "Male".

ron.dunn Thu Nov 19, 2015 3:40 am

On that partitioning question, if this was an MPP platform, and if Customer was the biggest dimension, and transaction volumes were fairly evenly distributed across the customer base, then using Customer as the distribution key might be a very good idea as you might save a lot of shuffling between nodes.

zoom Thu Nov 19, 2015 8:51 am

You do not need separate customer dims. You need one customer dim and a column to store their source...i-e the company name they belong to. If you are standardizing your customer's data and afraid that a customer may want to see the original value, then create 2 columns for that value..... one column to hold standardized value and another column to hold original value from the source. For example if gender value 1 from source represents Male then in your customer dim you not only store "Male" as standardized value , but also 1 as original value. This approach provides the flexibility for the reporting so a customer can either see standardize value or a value from the source.

ngalemmo Thu Nov 19, 2015 11:49 am

ron.dunn wrote:On that partitioning question, if this was an MPP platform, and if Customer was the biggest dimension, and transaction volumes were fairly evenly distributed across the customer base, then using Customer as the distribution key might be a very good idea as you might save a lot of shuffling between nodes.

He said 'partitioning' rather than 'distributing' so I am assuming an SMP platform. In an MPP environment, date would be a very bad distribution method. But, in an SMP environment, date usually is the most effective. In the case of an MPP you almost always want to co-locate your largest dimension with the facts.

nick_white Fri Nov 20, 2015 3:31 am

I think the discussion about partitioning is a side issue - the original question was about how to display different values to different customers for what is logically the same data.
In a normal DW I would agree that B&L's suggestion is the standard approach but the questioner specifically does not want to rationalise the data into a common set of values but wants to display customer-specific values

BoxesAndLines wrote:This is part of your source to target transformation logic. You need to rationalize the codes to a common code set. If source A has 1 to mean "Male" and source B has "M" to mean male, the ETL process will map to the appropriate dimension row the represents "Male". What you don't want to do is load multiple values for "Male".

"What you don't want to do is load multiple values for "Male" ": this is, I beleive, exactly what the questionner wants to do, and for a valid reason.

My view is that you treat different values of the same logical data item as different - and therefore you end up with different records in your dimension tables e.g. in a logically boolean field hold values of "true", "yes", "1", etc (whatever the different customers expect to see). When you come to load your fact data, the values coming from Customer A will match the Dimension record that has customer A values and so the correct SK will be written to the fact table; data coming from Customer B that might be logically the same to Customer A's data will match the Dim record that has Customer B's values - and so on.

Thoughts?

BoxesAndLines Fri Nov 20, 2015 9:54 am

I'm not a big fan of doing it that way. One of the big values a data warehouse can provide is code rationalization, that is the ability to get the business on a common vocabulary. All these disparate codes, indicators, and flags in the disparate source application silos only serve to further obfuscate the underlying business processes. Do I really need N different values for male, or active, or open, or closed? This also introduces additional complexity on joins. I now need to query include N values to identify all male genders. If I add an additional column in my gender dimension to make identifying all male genders easier for reporting, I've already done all the work to cross reference the codes!

ngalemmo Fri Nov 20, 2015 3:00 pm

Except that he is talking about a multi-tennant situation, not the integration of data for a large, single, enterprise. Co-mingling of different tenant's data is a big no-no, so there is no reason to have consistent attribute values across tenant data.

In such cases is it absolutely critical the data remains in the realm of the specific tenant that owns it. To that end, all natural keys should include the tenant ID to ensure data is never co-mingled.

BoxesAndLines Fri Nov 20, 2015 3:08 pm

ngalemmo wrote:Except that he is talking about a multi-tennant situation, not the integration of data for a large, single, enterprise.

Hmm, somehow I missed that point, in memory of the great Emily Litella, "Never mind!"

Sponsored content

Designing Single set of dimensions for disparate source data

Designing Single set of dimensions for disparate source data

Re: Designing Single set of dimensions for disparate source data

Re: Designing Single set of dimensions for disparate source data

Re: Designing Single set of dimensions for disparate source data

Re: Designing Single set of dimensions for disparate source data

Re: Designing Single set of dimensions for disparate source data

Re: Designing Single set of dimensions for disparate source data

Re: Designing Single set of dimensions for disparate source data

Re: Designing Single set of dimensions for disparate source data

Re: Designing Single set of dimensions for disparate source data

Re: Designing Single set of dimensions for disparate source data

Re: Designing Single set of dimensions for disparate source data

Re: Designing Single set of dimensions for disparate source data

Re: Designing Single set of dimensions for disparate source data