Do I need a second fact table?
3 posters
Page 1 of 1
Do I need a second fact table?
I have a "factless" fact table that contains daily snapshots.
The fact table is constructed:
PriKey,
Report_date,
SDCompKey (this connects to a dim table for Sales DistrictComposition - SalesDistrict,State,County - SCD)
CustomerType1 Surrogate Key (this connects to a dim table for CustomerType1 - name, address, etc - SCD)
CustomerType2 Surrogate Key (this connects to a dim table for CustomerType2 - name, address, etc - SCD)
ObjectCount (factored Counter specifically associated to CustomerType1 to sum for a daily particular activity)
I am stuck using a cube to serve the results as the UI, but I need to get daily counts for CustomerType2 by SalesDistrict grouping on the sales district name and customertype 2 name.
I currently do this using a query in my staging database area:
select report_date, salesdistrictname, count(*) as CT2Count
from (select distinct report_date, salesdistrictname, ct2Name
from facttable f
join salesdistrictcompdim s on f.sdcompkey=s.sdcompkey
join ctype2dim c on f.ctype2key=c.ctype2key ) a
group by report_date, salesdistrictname
Should I build a second fact table and cube measure to get the results I need? I can't figure out how to query the current cube to give me the correct total. Record count won't do it and distinct counts using just the keys on the fact table won't do it either. I really don't want to build another ObjectCount field for CustomerType2 if I can avoid it. I am using MS Analysis Server with the fact table and dim tables in a cube.
Thanks
The fact table is constructed:
PriKey,
Report_date,
SDCompKey (this connects to a dim table for Sales DistrictComposition - SalesDistrict,State,County - SCD)
CustomerType1 Surrogate Key (this connects to a dim table for CustomerType1 - name, address, etc - SCD)
CustomerType2 Surrogate Key (this connects to a dim table for CustomerType2 - name, address, etc - SCD)
ObjectCount (factored Counter specifically associated to CustomerType1 to sum for a daily particular activity)
I am stuck using a cube to serve the results as the UI, but I need to get daily counts for CustomerType2 by SalesDistrict grouping on the sales district name and customertype 2 name.
I currently do this using a query in my staging database area:
select report_date, salesdistrictname, count(*) as CT2Count
from (select distinct report_date, salesdistrictname, ct2Name
from facttable f
join salesdistrictcompdim s on f.sdcompkey=s.sdcompkey
join ctype2dim c on f.ctype2key=c.ctype2key ) a
group by report_date, salesdistrictname
Should I build a second fact table and cube measure to get the results I need? I can't figure out how to query the current cube to give me the correct total. Record count won't do it and distinct counts using just the keys on the fact table won't do it either. I really don't want to build another ObjectCount field for CustomerType2 if I can avoid it. I am using MS Analysis Server with the fact table and dim tables in a cube.
Thanks
lprince- Posts : 9
Join date : 2015-11-11
Re: Do I need a second fact table?
Why do you need the name to do a count? Are you not simply counting customers?
Would not
select report_date, salesdistrictname, count(distinct ctype1key) as CT2Count
from facttable f
join salesdistrictcompdim s on f.sdcompkey=s.sdcompkey
group by report_date, salesdistrictname
do the same thing?
Would not
select report_date, salesdistrictname, count(distinct ctype1key) as CT2Count
from facttable f
join salesdistrictcompdim s on f.sdcompkey=s.sdcompkey
group by report_date, salesdistrictname
do the same thing?
Re: Do I need a second fact table?
Oh I wish it did.... But no, I'm dealing with historic trashy data (not my source, so I've been struggling with how far to clean it.)
I'm using name because all historical reporting had been done grouping on salesdistrict and CustomerType2name so I'm trying to verify my counts against history to make sure I have accurate numbers. That said, It's a twofold problem, because the customerType2 is a company, they may have field offices in multiple locations, which is in the raw data. Additionally, because different people are doing data entry, there are typos (Inc. vs ,Inc, etc) in the contact info across counties/states for same company on the same day. The best I could come up with was to dump all the unique values into the Dim table and then create a durablekey for ones that I could confirm are the same based on reasonable variation in name, same address, same phone #, etc. So as a result, the Dim table has a surrogate key: CType2_key and Cust_DurableKey, customer_name, customer_rollup_name (could be a parent company or corrected version of the name), address, phone, city, state, and zip.
So, unfortunately, while CType2_SurrogateKey is unique in the fact table, the grouping returns too many records because what I technically need is a rollup on the Cust_DurableKey and not the Customername (in the final version). I'm new to learning DWH, but I didn't think I should put the DurableKey into the fact table. If that's what I should do, that might fix my problem.
I'm using name because all historical reporting had been done grouping on salesdistrict and CustomerType2name so I'm trying to verify my counts against history to make sure I have accurate numbers. That said, It's a twofold problem, because the customerType2 is a company, they may have field offices in multiple locations, which is in the raw data. Additionally, because different people are doing data entry, there are typos (Inc. vs ,Inc, etc) in the contact info across counties/states for same company on the same day. The best I could come up with was to dump all the unique values into the Dim table and then create a durablekey for ones that I could confirm are the same based on reasonable variation in name, same address, same phone #, etc. So as a result, the Dim table has a surrogate key: CType2_key and Cust_DurableKey, customer_name, customer_rollup_name (could be a parent company or corrected version of the name), address, phone, city, state, and zip.
So, unfortunately, while CType2_SurrogateKey is unique in the fact table, the grouping returns too many records because what I technically need is a rollup on the Cust_DurableKey and not the Customername (in the final version). I'm new to learning DWH, but I didn't think I should put the DurableKey into the fact table. If that's what I should do, that might fix my problem.
lprince- Posts : 9
Join date : 2015-11-11
Re: Do I need a second fact table?
Isn't ctype1key your durable key?
If this key represents a field office, but you are trying to count head offices/companies, you need either a hierarchy or attributes for the head office in your dimension. You could then count head offices based on distinct values of the head office id.
If this key represents a field office, but you are trying to count head offices/companies, you need either a hierarchy or attributes for the head office in your dimension. You could then count head offices based on distinct values of the head office id.
Re: Do I need a second fact table?
Nope, sorry to cause confusion, but Ctype1Key belongs to a totally different set of businesses. There are roughly 200 type1s and 3000 type2s with random subsets that come and go on a daily basis. The two groups work in tandem at the same location on the same item, and both of them can be customers of our business so we think of them separately and as a result, I have separate dim tables for each of them. I was able to solve the problem for the CustomerType1 counts by introducing a factored object count (that totals to 1 for a given day/CustomerType1) because it was easy to do and the worst case was a division by 4. In the case of CustomerType2, it could be a division by 2 all the way up to 40 something to come up with a factored count so I was looking for a more elegant solution. I'm thinking a calculated measure might do the trick, but I'm only halfway through the MDX book that I'm reading and I don't know if that would be better or if I really should make a second cube just for CustomerType2 that's aggregated the way I need it. My problem with that though is that they won't be able to drill down through the data once I do it that way.
lprince- Posts : 9
Join date : 2015-11-11
Re: Do I need a second fact table?
I solved it using the following mdx for a single day:
with member measures.uniqCt2
as count(nonempty(([Cust_Type2Dim].[CT2Name].[Ct2Name]*[Measures].[RowCount])))
select measures.uniqCt2 on 0,
nonempty([salesdistrictdim].[salesdistrictnumber].[salesdistrictnumber]) on 1
from CubeName
where ([YMD_date].[DateValue].&[2015-01-05])
I'll still have to fuss with it a bit more to get the date into rows, but it does a distinct count of CustType2 for each sales district.
It took me a couple of days of reading/googling to get the MDX syntax right, so I thought i'd post back and maybe save someone else time.
with member measures.uniqCt2
as count(nonempty(([Cust_Type2Dim].[CT2Name].[Ct2Name]*[Measures].[RowCount])))
select measures.uniqCt2 on 0,
nonempty([salesdistrictdim].[salesdistrictnumber].[salesdistrictnumber]) on 1
from CubeName
where ([YMD_date].[DateValue].&[2015-01-05])
I'll still have to fuss with it a bit more to get the date into rows, but it does a distinct count of CustType2 for each sales district.
It took me a couple of days of reading/googling to get the MDX syntax right, so I thought i'd post back and maybe save someone else time.
Last edited by lprince on Thu Dec 17, 2015 11:38 am; edited 1 time in total (Reason for editing : typo in syntax)
lprince- Posts : 9
Join date : 2015-11-11
Re: Do I need a second fact table?
So, you are trying to allocate measures to different grains?
Usually you handle this with a bridge table. The bridge table joins with a fact dimension fk and the bridge references whichever dimension it needs to. The bridge may contain an allocation factor if you need it.
My question would be, do the facts always contain references to a ctype1 and ctype2 dimension row, or can it sometimes be null (i.e. any)?
The nice thing about a bridge is you build it to support a specific allocation need. Everything needed to support queries is pre-calculated and stored on the bridge. And each specific bridge can be combined with another bridge. Allocation of those facts would be derived by multiplying the allocation factors in the bridges being used.
Usually you handle this with a bridge table. The bridge table joins with a fact dimension fk and the bridge references whichever dimension it needs to. The bridge may contain an allocation factor if you need it.
My question would be, do the facts always contain references to a ctype1 and ctype2 dimension row, or can it sometimes be null (i.e. any)?
The nice thing about a bridge is you build it to support a specific allocation need. Everything needed to support queries is pre-calculated and stored on the bridge. And each specific bridge can be combined with another bridge. Allocation of those facts would be derived by multiplying the allocation factors in the bridges being used.
Re: Do I need a second fact table?
maybe, but not intentionally. The grain of the fact table is "all activities reported on a given day for any given location by these two groups of businesses/customers that use any given piece of equipment." It's not like anything I can find in a book or neat and tidy lesson and it's "factless" which makes looking for examples of the right way to structure my tables/dims that much harder. The businesses are always in tandem, so there isn't ever a record that has one without the other. CustomerType1Dim is the "owner" of the equipment and CustomerType2Dim is the operator of the equipment, but they can both be our customers or be a potential customer, that's a different key I have in each dim table. I tried to reduce the complexity in my example to avoid confusing the issue I was trying to solve. Sales districts are geographic based, but if Customer A is active in District 1 and District 2, they want it counted as 2, not 1 because each sales district is its own universe and ultimately we are using this as a measure of 'potential customers' that exist at a point in time in each district.
Would it be better structure to build an aggregated fact table that contains a single record for each Day, SalesDistrict, Cust1, Cust2 combination and lose the details of what they are working on? The total dataset is less than a million records. I wanted to make it so that the people could look at the high level totals, but ultimately, if they were curious enough, be able to drill down through the data to see what activity and location information related to where the businesses are working.
I've read about bridge tables a little bit on line, but honestly am only halfway through the Data Warehouse Toolkit book, so I'm not sure I understand how to apply them in this situation.
Would it be better structure to build an aggregated fact table that contains a single record for each Day, SalesDistrict, Cust1, Cust2 combination and lose the details of what they are working on? The total dataset is less than a million records. I wanted to make it so that the people could look at the high level totals, but ultimately, if they were curious enough, be able to drill down through the data to see what activity and location information related to where the businesses are working.
I've read about bridge tables a little bit on line, but honestly am only halfway through the Data Warehouse Toolkit book, so I'm not sure I understand how to apply them in this situation.
lprince- Posts : 9
Join date : 2015-11-11
Re: Do I need a second fact table?
I have to admit to getting a bit confused by all the previous posts but reading your last post am I correct in thinking that the grain/uniqueness of your fact table is defined by these keys:
Day, SalesDistrict, Cust1, Cust2 and EquipmentID (or whatever you've called it)?
Also, what you are trying to achieve is a count of unique Cust2 by Day and SalesDistrict?
An aggregate is not going to help, IMO, as aggregates only provide performance improvements they don't give you any more information.
Why won't "SELECT Day, SalesDistrict, count (distinct Cust2) from FactTable GROUP BY Day, SalesDistrict" work?
Day, SalesDistrict, Cust1, Cust2 and EquipmentID (or whatever you've called it)?
Also, what you are trying to achieve is a count of unique Cust2 by Day and SalesDistrict?
An aggregate is not going to help, IMO, as aggregates only provide performance improvements they don't give you any more information.
Why won't "SELECT Day, SalesDistrict, count (distinct Cust2) from FactTable GROUP BY Day, SalesDistrict" work?
nick_white- Posts : 364
Join date : 2014-01-06
Location : London
Re: Do I need a second fact table?
Thank you for that confirmation on aggregation.
The query I originally posted in my first post does the totals correctly for me, but I can't query a Cube that way, the interface that we're exposing to the users is measures/dims from a Cube using SSAS and a vendor based solution I think is equivalent to PowerPivot. The uniqueness of the table is based on fields: report_date, geo_locationkey. Everything else isn't unique unless I include report_date and geo_locationkey. There can be multiples of owner/operator (CustType1/CustType2, respectively) doing multiple activities using multiple pieces of equipment in multiple sales districts. The CustType2ID represents unique entries in the Dim Table (containing CustType2ID, Cust_DurableKey, CustName, Cust_Rollup_Name, Phone, Address, etc) for the SAME Customer (Due to Typos in data, field office locations and other variations in contact info (multiple phone #s) - think SCD not only in time, but in many cases no single record per customer that is truly the "right" one, just unique variations that have to be grouped together by the DurableKey). It's the nature of the source data which is outside our control. I pondered building an outrigger dim table that had only the DurableKey and rollup_name in it (so I would have a table with single entry per Customer), but wasn't sure if there was an advantage.
The business explanation: report is used to count "potential customers available on any given day", which is later compared to a count of "customers sold to on any given day" to determine sales effectiveness. The reason I need a count of CustType2 (grouped on customer name) by salesdistrict is to match an existing report that is currently generated manually on a weekly basis to one that will come from the cube "on-demand" and can be sliced/diced in ways other than just the daily totals. The new report will group on SalesDistrict, Cust_DurableKey rather than name, so it will correct some other errors I've found in the report (name identical, but they are not the same company).
The MDX query I posted gives me the solution I needed so I can live with the structures I've got while I do some more research on bridge tables.
The query I originally posted in my first post does the totals correctly for me, but I can't query a Cube that way, the interface that we're exposing to the users is measures/dims from a Cube using SSAS and a vendor based solution I think is equivalent to PowerPivot. The uniqueness of the table is based on fields: report_date, geo_locationkey. Everything else isn't unique unless I include report_date and geo_locationkey. There can be multiples of owner/operator (CustType1/CustType2, respectively) doing multiple activities using multiple pieces of equipment in multiple sales districts. The CustType2ID represents unique entries in the Dim Table (containing CustType2ID, Cust_DurableKey, CustName, Cust_Rollup_Name, Phone, Address, etc) for the SAME Customer (Due to Typos in data, field office locations and other variations in contact info (multiple phone #s) - think SCD not only in time, but in many cases no single record per customer that is truly the "right" one, just unique variations that have to be grouped together by the DurableKey). It's the nature of the source data which is outside our control. I pondered building an outrigger dim table that had only the DurableKey and rollup_name in it (so I would have a table with single entry per Customer), but wasn't sure if there was an advantage.
The business explanation: report is used to count "potential customers available on any given day", which is later compared to a count of "customers sold to on any given day" to determine sales effectiveness. The reason I need a count of CustType2 (grouped on customer name) by salesdistrict is to match an existing report that is currently generated manually on a weekly basis to one that will come from the cube "on-demand" and can be sliced/diced in ways other than just the daily totals. The new report will group on SalesDistrict, Cust_DurableKey rather than name, so it will correct some other errors I've found in the report (name identical, but they are not the same company).
The MDX query I posted gives me the solution I needed so I can live with the structures I've got while I do some more research on bridge tables.
lprince- Posts : 9
Join date : 2015-11-11
Similar topics
» 'Routing' the grain of the fact table to multpile members of multiple dimensions causes the fact table to 'explode'
» Integrating new fact table which has one to many relationship with the main fact table in existing star schema
» How to create fact table with measures derived from comparing two fact table rows
» Healthcare Hospital Stay fact table - transaction fact table - a row for In and a row for Out?
» Design a snapshot Fact table in data mart from compliance-enable fact table in enterprise data warehouse
» Integrating new fact table which has one to many relationship with the main fact table in existing star schema
» How to create fact table with measures derived from comparing two fact table rows
» Healthcare Hospital Stay fact table - transaction fact table - a row for In and a row for Out?
» Design a snapshot Fact table in data mart from compliance-enable fact table in enterprise data warehouse
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum