First DW and hopelessly out of my depth
5 posters
Page 1 of 1
First DW and hopelessly out of my depth
Hi. A brief bit of history - my background is in SQL Server database development including SSRS and SSIS and I have done reporting from a data warehouse, so when I was asked in the interview for my current job whether I would be able to redesign their existing data warehouse, I said yes but everyone was aware that I'd not built one before. I read some of the Kimball book and most of what I read seemed so blindingly obvious, that I decided in my wisdom to scrap the book and just get on with building the warehouse the way I thought was best, basing most of my architechture on the existing DW and my own knowledge of building databases and applications, plus common sense.
The whole point of building a new warehouse when they already had one, was because it was designed by someone who had no experience of DW design and therefore didn't really know what he was doing and as a result its hard to manage and full of holes, not to mention hundreds of unknown products and customers (but more on that later). The fact that they've employed me with no DW experience to build them a better one is an irony utterly lost on the management here.
Anyway it turns out that a data warehouse is a whole different beast as I'm sure you're all aware and having managed 90% of the journey from blank piece of paper to working DW, I'm struggling with some of the finer points.
The basic architechture of my warehouse is as follows. We sell things, so I have a central SalesFact table. We sell products to customers so I have a ProductDimension and a CustomerDimension. We operate in global divisions so I have a RegionDimension which contains the DivisionName, Currency, ExchangeRate etc. I also have a DateDimension table. I would imagine that most of this is pretty standard stuff.
I've even built the SSIS packages to bring all this together. It works and all my components are the same sizes and lined up neatly so it looks lovely, especially when its all green.
My problem is with products, specifically handling unknowns.
We have literally tens of thousands of products because we make these big complex machines and they have loads of piddling little parts so its quite common to have new parts come through the process and often they aren't listed in my source tables for one reason or another so they get assigned as unknown products. I've added a -1 "Unknown Product" to my ProductDimension table and these rows in the SalesFact table pick these up but I don't know what is supposed to happen after that. Kimball's book says to assign these -1s but are these products supposed to remain unknown forever? How do I go about adding a new Product manually to the ProductDimension? You see my ProductDimension table is truncated daily and re-populated from the source tables so if I add a Product in, it just gets wiped every morning.
I'm thinking that I could build a table for manually added products and use a UNION ALL to add these to the ProductDimension as a part of the import process but that seems like using a sticking plaster, when a more robust solution would be better. Another option would be to use the MERGE function instead of a TRUNCATE and total repopulation of the table but again that's messy because I'd have to put the data in a staging table first.
Does anyone have any advice you can give me to better handle these unknowns?
Many thanks.
Nick
The whole point of building a new warehouse when they already had one, was because it was designed by someone who had no experience of DW design and therefore didn't really know what he was doing and as a result its hard to manage and full of holes, not to mention hundreds of unknown products and customers (but more on that later). The fact that they've employed me with no DW experience to build them a better one is an irony utterly lost on the management here.
Anyway it turns out that a data warehouse is a whole different beast as I'm sure you're all aware and having managed 90% of the journey from blank piece of paper to working DW, I'm struggling with some of the finer points.
The basic architechture of my warehouse is as follows. We sell things, so I have a central SalesFact table. We sell products to customers so I have a ProductDimension and a CustomerDimension. We operate in global divisions so I have a RegionDimension which contains the DivisionName, Currency, ExchangeRate etc. I also have a DateDimension table. I would imagine that most of this is pretty standard stuff.
I've even built the SSIS packages to bring all this together. It works and all my components are the same sizes and lined up neatly so it looks lovely, especially when its all green.
My problem is with products, specifically handling unknowns.
We have literally tens of thousands of products because we make these big complex machines and they have loads of piddling little parts so its quite common to have new parts come through the process and often they aren't listed in my source tables for one reason or another so they get assigned as unknown products. I've added a -1 "Unknown Product" to my ProductDimension table and these rows in the SalesFact table pick these up but I don't know what is supposed to happen after that. Kimball's book says to assign these -1s but are these products supposed to remain unknown forever? How do I go about adding a new Product manually to the ProductDimension? You see my ProductDimension table is truncated daily and re-populated from the source tables so if I add a Product in, it just gets wiped every morning.
I'm thinking that I could build a table for manually added products and use a UNION ALL to add these to the ProductDimension as a part of the import process but that seems like using a sticking plaster, when a more robust solution would be better. Another option would be to use the MERGE function instead of a TRUNCATE and total repopulation of the table but again that's messy because I'd have to put the data in a staging table first.
Does anyone have any advice you can give me to better handle these unknowns?
Many thanks.
Nick
Re: First DW and hopelessly out of my depth
Why is your Product dimension getting truncated and reloaded each day?
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: First DW and hopelessly out of my depth
Truncating and reloading a dimension is really bad form. You are supposed to assign surrogate keys to dimension rows to ensure stability with the facts. This requires a process that inserts new dimension rows and updates to changed rows. Facts reference rows using the surrogate keys, that should not change.
As far as new items coming in from facts, you should infer new dimension rows. In addition to the surrogate key, there should be a natural, or business, key that identifies the product. Assuming the fact information provides the business key, you create a new row with the business key and a new surrogate key (not -1) and have the fact refer to that row using the new surrogate key. When the source system is eventually updated with that new product, you update the dimensions row, based on the business key, so that the attributes for that product are populated and become available to the fact.
The '-1' row is only for a situation where you have absolutely nothing (no business key). Such an assignment cannot be changed, unless you get an update to the fact itself.
As far as new items coming in from facts, you should infer new dimension rows. In addition to the surrogate key, there should be a natural, or business, key that identifies the product. Assuming the fact information provides the business key, you create a new row with the business key and a new surrogate key (not -1) and have the fact refer to that row using the new surrogate key. When the source system is eventually updated with that new product, you update the dimensions row, based on the business key, so that the attributes for that product are populated and become available to the fact.
The '-1' row is only for a situation where you have absolutely nothing (no business key). Such an assignment cannot be changed, unless you get an update to the fact itself.
Re: First DW and hopelessly out of my depth
Jeff Smith wrote:Why is your Product dimension getting truncated and reloaded each day?
Good question. Truth is its because this is what the previous DW did and as I know my predecessor was a Kimball disciple I assumed it was common practice. We live and learn, so I'm now in the process of fixing this. Thanks for your input.
Re: First DW and hopelessly out of my depth
ngalemmo wrote:Truncating and reloading a dimension is really bad form. You are supposed to assign surrogate keys to dimension rows to ensure stability with the facts. This requires a process that inserts new dimension rows and updates to changed rows. Facts reference rows using the surrogate keys, that should not change.
As far as new items coming in from facts, you should infer new dimension rows. In addition to the surrogate key, there should be a natural, or business, key that identifies the product. Assuming the fact information provides the business key, you create a new row with the business key and a new surrogate key (not -1) and have the fact refer to that row using the new surrogate key. When the source system is eventually updated with that new product, you update the dimensions row, based on the business key, so that the attributes for that product are populated and become available to the fact.
The '-1' row is only for a situation where you have absolutely nothing (no business key). Such an assignment cannot be changed, unless you get an update to the fact itself.
First of all thanks for your feedback and help, I can't explain how valuable this is.
I get the surrogate keys thing and I already have those implemented (I just didn't know that was what they were called). I'm working on scrapping the truncate and reload feature and I can now see what I should be doing with the ProductDimension.
I was planning to use a stored proc and the SQL MERGE function (whch is one of my favourite new SQL features) to handle inserting new dimension rows and updates to changed rows. However, I'm using SSIS for moving all my data around, doing lookups etc, so is there a specific SSIS component that can do this?
Re: First DW and hopelessly out of my depth
I've not used SSIS. My knowledge of it is fairly high-level. I would suspect there is a component that would do inference, as it is a common technique.
Re: First DW and hopelessly out of my depth
ngalemmo wrote:I've not used SSIS. My knowledge of it is fairly high-level. I would suspect there is a component that would do inference, as it is a common technique.
Well just as a matter of interest then, if you don't mind me asking, what do you use if not SSIS (which I, in my limited experience, had assumed was the industry standard for getting data into a warehouse and processing it)?
Re: First DW and hopelessly out of my depth
SSIS is a Microsoft product that comes with SQL Server. There are a host of other ETL products (Informatica, Data Stage, Ab-Initio, etc…) that work with any database system. They vary in how things are done, but overall they are all decent products.
Re: First DW and hopelessly out of my depth
If you have SQL Server, then the vast majority use SSIS. I've only seen one that does not. If you have a different database(s) then here you go, .
As you can see, SSIS is a niche player since it is only used on SQL Server.
As you can see, SSIS is a niche player since it is only used on SQL Server.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: First DW and hopelessly out of my depth
Ah ok. I'd naively assumed we were all Microsoft SQL Server users as it the only DB system of note that I've ever used & probably ever will.
Thanks all.
Thanks all.
Re: First DW and hopelessly out of my depth
Oracle was the most used DBMS for data warehousing for a long time, but I would guess SQL Server has probably surpassed them in number of implementations. SQL Server implementations tend to be smaller than Oracle implementations. Combined they are probably 60-70% of the market by # of implementations. Other players include (in no particular order) IBM DB2, Teradata, Netezza (aka IBM PureData), Vertica, Greenplum, ParAccel, and many others.
Re: First DW and hopelessly out of my depth
We're on SQL Server but use Cognos Data Manager as the ETL tool. We are planning to switch to Infomatica in 2014/15.
Jeff Smith- Posts : 471
Join date : 2009-02-03
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: First DW and hopelessly out of my depth
Hi Nick,
I use SSIS, and it does have a "Slowly Changing Dimension" wizard which builds components to deliver dimension data. However it has a quite a few flaws and quirks, and architecturally it processes row-by-row, so I wouldnt recommend it.
I normally build an SSIS Data Flow task to populate a Staging table, then follow that with an Execute SQL task to run a SQL MERGE statement.
Good luck!
Mike
I use SSIS, and it does have a "Slowly Changing Dimension" wizard which builds components to deliver dimension data. However it has a quite a few flaws and quirks, and architecturally it processes row-by-row, so I wouldnt recommend it.
I normally build an SSIS Data Flow task to populate a Staging table, then follow that with an Execute SQL task to run a SQL MERGE statement.
Good luck!
Mike
Re: First DW and hopelessly out of my depth
Mike Honey wrote:Hi Nick,
I use SSIS, and it does have a "Slowly Changing Dimension" wizard which builds components to deliver dimension data. However it has a quite a few flaws and quirks, and architecturally it processes row-by-row, so I wouldnt recommend it.
I normally build an SSIS Data Flow task to populate a Staging table, then follow that with an Execute SQL task to run a SQL MERGE statement.
Good luck!
Mike
Cheers Mike, that was initially my thinking only I don't like to use staging tables too much as they end up cluttering the database. I'll probably have a play with the Slowly Changing Dimension wizard but I expect I'll end up using MERGE, as I'm sure you'll agree its one of the best features to be added to SQL Server in years.
Re: First DW and hopelessly out of my depth
Hi Nick,
Yes SQL MERGE is effective, but I wish they had implemented it as an SSIS component - that might've avoid needing staging tables and the pages of custom code required to drive a SQL MERGE.
If you are trying the SCD Wizard, here is some info on its limitations:
http://forum.kimballgroup.com/t1772-microsoft-ssis-bids-scd-type-2-development
http://connect.microsoft.com/SQLServer/feedback/details/574910/ssis-slowly-changing-dimension-scd-transformation-with-fixed-attributes-never-detects-changing-attributes
http://connect.microsoft.com/SQLServer/feedback/details/586000/ssis-slowly-changing-dimension-component-spurious-error-inferred-indicator-column-must-be-of-type-dt-bool
http://connect.microsoft.com/SQLServer/feedback/details/658605/ssis-slowly-changing-dimension-component-does-not-detect-historical-changes-in-dt-dbtimestamp2-datetime2-columns
Good luck!
Mike
Yes SQL MERGE is effective, but I wish they had implemented it as an SSIS component - that might've avoid needing staging tables and the pages of custom code required to drive a SQL MERGE.
If you are trying the SCD Wizard, here is some info on its limitations:
http://forum.kimballgroup.com/t1772-microsoft-ssis-bids-scd-type-2-development
http://connect.microsoft.com/SQLServer/feedback/details/574910/ssis-slowly-changing-dimension-scd-transformation-with-fixed-attributes-never-detects-changing-attributes
http://connect.microsoft.com/SQLServer/feedback/details/586000/ssis-slowly-changing-dimension-component-spurious-error-inferred-indicator-column-must-be-of-type-dt-bool
http://connect.microsoft.com/SQLServer/feedback/details/658605/ssis-slowly-changing-dimension-component-does-not-detect-historical-changes-in-dt-dbtimestamp2-datetime2-columns
Good luck!
Mike
Similar topics
» New Approaches to Variable Depth Hierarchies?
» Dimensional Modeling In depth training
» Pictures during the course DW lifecycle in depth, White Plains, NY
» Alternative 4: Forced Fixed Depth Hierarchy Technique
» Dimensional Modeling In depth training
» Pictures during the course DW lifecycle in depth, White Plains, NY
» Alternative 4: Forced Fixed Depth Hierarchy Technique
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum