Data Vault v's Dimensional Model
+6
hang
dlinstedt
hennie7863
ngalemmo
BoxesAndLines
AndyPainter
10 posters
Page 1 of 2
Page 1 of 2 • 1, 2
Data Vault v's Dimensional Model
Seems to be a lot of interest over at LinkedIn around the DataVault model. I wondered what the thoughts were about it. When would you use it v's a Dimensional Model or even a 3NF for a data warehouse.
more info here: http://www.danlinstedt.com/AboutDV.php#Data_Vault_Through_Pictures
Looks like a hybrid, maybe the best of both or the worst of both?
Would welcome comments and thoughts to help guide me.
more info here: http://www.danlinstedt.com/AboutDV.php#Data_Vault_Through_Pictures
Looks like a hybrid, maybe the best of both or the worst of both?
Would welcome comments and thoughts to help guide me.
Re: Data Vault v's Dimensional Model
I would say there are a few people proselytizing the merits of of data vault modeling. Hard information is more difficult to come by. For example, if I have a question on dimensional modeling, there are umpteen different books, thousands of web sites, and countless companies that have implemented dimensional data warehouses. Even the posters proselytizing are wary of expressing too much detail in their posts! Until the data vault folks openly share their "secret sauce", I'm sticking with tried and true successful data warehouse architectures.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: Data Vault v's Dimensional Model
The data vault approach has been around for a long time and has not really gained much traction in the field. It's interesting, but as you can gather from the link you provided, you wind up with a far more complex schema than you would with a dimensional approach. (And even more complex than the 3NF example from which the vault model was derived).
Also, there are flaws in the model he presents. If you look at the Complete Data Vault model, there is no link between customers, the sales force and orders UNLESS a shipment link exists. I don't know where he is going with this, but in my experience, orders occur, and are reported on well before any shipment takes place. And, sometimes orders are never shipped.
Also, there are flaws in the model he presents. If you look at the Complete Data Vault model, there is no link between customers, the sales force and orders UNLESS a shipment link exists. I don't know where he is going with this, but in my experience, orders occur, and are reported on well before any shipment takes place. And, sometimes orders are never shipped.
Re: Data Vault v's Dimensional Model
Interesting topic. In last years i'm implementing Kimball solutions quite a lot. Currently i'm reading quite a lot about the different methodologies and one reason to according Lindstedt for choosing is the 'system of record'. Because of the cleansing, enriching, etc the lineage of data is very hard in the 'traditional' datawarehousing systems like Kimball and Inmon. Auditing is very difficult in systems like Kimball and Inmon. Lindsteds says that you should import the data in the datawarehouse as it is as in the source system. Business rules should be applied close to the Business use in reports. I know too little (yet) about datavault for a complete answer.
Last edited by hennie7863 on Wed Dec 09, 2009 5:18 am; edited 1 time in total
hennie7863- Posts : 31
Join date : 2009-10-19
Re: Data Vault v's Dimensional Model
I agree with Lindstedt that the data warehouse should reflect the data in the source system. The primary goal is to be accurate, not 'correct'. Cleansing should be a function of the operational systems, with the DW serving as an aid to accomplish it. Cleaning the data warehouse should a side effect of correcting the source system data.
But I don't see where his approach has any advantage or disadvantage over the other methodologies as far as audit or lineage is concerned. Also, no data warehouse, not matter what methodology you use, is the 'system of record'.
But I don't see where his approach has any advantage or disadvantage over the other methodologies as far as audit or lineage is concerned. Also, no data warehouse, not matter what methodology you use, is the 'system of record'.
Re: Data Vault v's Dimensional Model
Can't the warehouse be the system of record for derived attributes such as Total Customer Value? How about other attributes calculated through predictive models?
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: Data Vault v's Dimensional Model
BoxesAndLines wrote:Can't the warehouse be the system of record for derived attributes such as Total Customer Value? How about other attributes calculated through predictive models?
Not really. "System of record" from a legal/audit/SOX et al point of view is system by which the actual transaction took place and recorded. These are the operational systems. You may use the data warehouse to derive various metrics, but ultimately, the data used to derive those metrics (if they are being used for public disclosure) must ultimately tie back to the individual transactions in the system of record from which they were derived.
Technically, this rigor does not apply to data being used to make business decisions, because such decisions also are based on a significant amount of subjective thought as well. However, a properly implemented BI solution should have controls in place to prevent fraud so that metrics from the system accurately reflect what is truely occuring in the business.
I appreciate the thoughts...
Hi Folks,
I appreciate the thoughts - and yes, you are right to mention that there is not enough published on the subject. Please remember that Dr. Kimball has published more information than I have to date, and he also has had many more years of marketing, speaking, and sharing than I have. I published the first article/paper on Data Vault in 2001 on www.TDAN.com - it's a technique that's only been known to the EDW / BI world for 9/10 years now. However, it is a strange phenomenon that people are not discussing the Data Vault Model as freely as the Kimball Star Schema.
Anyhow, In case you were wondering: I agree with Nick G in some areas - any good enterprise data warehouse should be accountable and auditable, able to pass audits - regardless of the data modeling technique used. Yes, the Data Vault Modeling techniques are a hub and spoke design - a mix of normalized modeling components with type 2 dimensional properties in the Satellites. It is in fact a hybrid data model - spanning some of the flexibility of 3NF and combining it with some of the techniques of dimensional modelling. But that's where the similarities stop. Too often people confuse the data model implementation aspects with the "approach" of building a Data Warehouse or the Framework so it were. The framework that the Data Vault Modeling components rely on include a staging area, an enterprise data warehouse (physical not logical), followed by Star Schemas, cubes, and other data release mechanisms to properly push the data to the end-users or business. This framework is different than that of a "logical EDW" made up of loosely affiliated Star Schemas combined with a staging area.
Another difference is where the business rules occur. Nick G hit it on the head, again - the Data Warehouse should be accountable and auditable - which by default means pushing business rules of changing/altering data downstream to "between the data warehouse and the data marts". This is definately a principle I agree with.
I can't speak for why people are not writing about it, but let me weigh in for a minute: DOD, US Navy, US Army, Edmonton Police, SNS Bank, World Bank, ABN-AMRO, UBS, Central Bank of Indonesia, Monash University, Central Bureau of Statistics (Netherlands), Denver University, City & County of Denver, USDA, Diamler Motors, Air France, Nutreco, Microsoft, JP Morgan Chase, Lockheed Martin, and the list goes on - these companies are using/building or have used the Data Vault successfully over the past several years. There must be something new / different that the data model and standards bring to the table, or it wouldn't be accepted / used by so many companies.
Ok that said: let's talk about the definition of system of record: Here is where I disagree with Nick G. I would argue that the definition of what a data warehouse IS, and it's purpose has changed over the years. In particular, when it comes to being the ONLY place that real-time information is stored historically, or integrated with the other data. In this case, it becomes the ONLY place where an auditor can actually perform an audit of how and what the data was that was collected along the way, hence it becomes a system of record.
It's changing again - with a concept called "Operational Data Warehousing" where the principles of a data warehouse are being pushed and re-defined by the business. The business actually builds an operational application right on top of the Data Warehouse, and edits the data set directly, creates new records directly - writing the history back to the warehouse in real time. In this case, also - it becomes a system of record. We have three different companies that have built systems like this and are functioning today. One is ING Real Estate HQ, One is a congressional effort for medical records management of the armed forces, one is Cendant Timeshare Resource Group. Like it or not, the introduction of "real-time data" flowing in to and being captured by the Data Warehouse makes it (by default) a system of record. More often than not the source data (after delivery to the EDW) is destroyed on the machine that performed capture... especially if it's a capture machine outside the EDW.
Anyhow, the point being that data warehouses also frequently house data from "old/retired" often dismantled legacy systems. Even if the auditor wanted to, they couldn't audit the source system because it's been scrapped or junk-heaped somewhere. So in this case, the auditor treats the EDW as a system of record, like it or not. I've actually been in this situation several times, and passed audits because of the manner in which I treat the data warehouse.
Now, all of that is separate from the data modeling constructs used to build your EDW. I've simply chosen to use the Data Vault for flexibility and scalability of the systems design. Let's talk about that for a minute...
Would you agree that the most scalable architecture (between NUMA, MPP, SMP, etc...) is MPP? Massively Parallel Processing... Or would you tend to say that some other architecture is more suitable to economies of scale? It seems to me that MPP is the clear winner over all these others. Actually a Hybrid is used these days to support cloud based systems. Massive SMP (symmetrical multi-processing) (ie: clustered machines) combined with MPP shared nothing architecture, to create a scale-out scale free cloud. Anyhow, if the mathematics behind MPP have proven to be true (for example divide and conquer) then why not utilize this type of architecture within database engines? Why not take MPP mathematics and apply them to the basics of data modeling by dividing and conquering or parallelizing the database operations?
A fist shot at this is called partitioning of the tables, but it has to go deeper - to where the joins live. There are age-old arguments about more joins/less joins and which is better, but at the end of the day it has to do with a balanced approach - how many joins can your hardware /RDBMS software execute in parallel effectively? How much data can your machine access in parallel and independent processes? Can you effectively make use of both types of partitioning (vertical and horizontal)? Can you apply compression across the board and enhance performance? Vertical partitioning is what the column based appliances / databases have been doing for massive performance increases in the query side.
The Data Vault Model makes use of some of the basic ideas buried within the MPP realm - divide and conquer, apply compression to columns where redundancy lives, and apply unique high speed 100% coverage to indexes where possible. Hence the hub & spoke design. The larger the system (or the larger the number of tables), the more I can split the tables/system up across multiple independent machines. It's a pure fit for virtualized computing/cloud computing resources. It's also a scale free architecture, because I don't need to store ALL my data in a single geographical place to get at it quickly. Any how, the hub and link structures enable the model to join in 100% parallel across MPP hardware / RDBMS that is configured properly, and at economies of scale - the MPP query will always win over a serialized join (once the machine and database are tuned appropriately).
Anyhow, there's a lot more to this than meets the eye, it may look like a simple data model - but it's the reasons why we do what we do that make it powerful. It's the purpose behind the architecture that is currently driving companies like the Netherlands Tax Authority and the DOD to select the Data Vault for use in some of it's largest (data set wise) projects.
Now folks might ask: well, isn't it overkill for small datawarehouses? Maybe, maybe not - depends on the COST of modifying a "stage/star schema" architecture down the road, vs weighing the cost of modifications to a Data Vault. Hence a company like Phillips Electronics also undertaking a Data Vault effort to consolidate or centralize raw data feeds from all across their divisions.
What I would say is this: If you're happy building Star Schemas as your data warehouse, great - if it's working for you, wonderful - don't change. But if you run into situations that require heavy re-engineering, cause your project to "stop & restart" or be completely rebuilt every 2 to 3 years, then you might want to look at the Data Vault as an alternative. Again, if you have little to no pain in the current architecture you are using, then don't change - don't fix what's not broken....
I hope this helps, as always - take what you like, use the good things that are there. Take the time to investigate new ideas, throw away the rest. After all, this is just more information. And remember: the world NEEDS Star Schemas, Cubes, and alternative release mechanisms - we couldn't feed the business needs without them! I'm merely proposing the Data Vault Model as an Enterprise Data Warehouse layer, NOT as a data delivery layer... Leave that to the expert: Dr. Ralph Kimball.
Please feel free to contact me directly if you wish.
Thank-you kindly,
Dan Linstedt
DanL@DanLinstedt.com
I appreciate the thoughts - and yes, you are right to mention that there is not enough published on the subject. Please remember that Dr. Kimball has published more information than I have to date, and he also has had many more years of marketing, speaking, and sharing than I have. I published the first article/paper on Data Vault in 2001 on www.TDAN.com - it's a technique that's only been known to the EDW / BI world for 9/10 years now. However, it is a strange phenomenon that people are not discussing the Data Vault Model as freely as the Kimball Star Schema.
Anyhow, In case you were wondering: I agree with Nick G in some areas - any good enterprise data warehouse should be accountable and auditable, able to pass audits - regardless of the data modeling technique used. Yes, the Data Vault Modeling techniques are a hub and spoke design - a mix of normalized modeling components with type 2 dimensional properties in the Satellites. It is in fact a hybrid data model - spanning some of the flexibility of 3NF and combining it with some of the techniques of dimensional modelling. But that's where the similarities stop. Too often people confuse the data model implementation aspects with the "approach" of building a Data Warehouse or the Framework so it were. The framework that the Data Vault Modeling components rely on include a staging area, an enterprise data warehouse (physical not logical), followed by Star Schemas, cubes, and other data release mechanisms to properly push the data to the end-users or business. This framework is different than that of a "logical EDW" made up of loosely affiliated Star Schemas combined with a staging area.
Another difference is where the business rules occur. Nick G hit it on the head, again - the Data Warehouse should be accountable and auditable - which by default means pushing business rules of changing/altering data downstream to "between the data warehouse and the data marts". This is definately a principle I agree with.
I can't speak for why people are not writing about it, but let me weigh in for a minute: DOD, US Navy, US Army, Edmonton Police, SNS Bank, World Bank, ABN-AMRO, UBS, Central Bank of Indonesia, Monash University, Central Bureau of Statistics (Netherlands), Denver University, City & County of Denver, USDA, Diamler Motors, Air France, Nutreco, Microsoft, JP Morgan Chase, Lockheed Martin, and the list goes on - these companies are using/building or have used the Data Vault successfully over the past several years. There must be something new / different that the data model and standards bring to the table, or it wouldn't be accepted / used by so many companies.
Ok that said: let's talk about the definition of system of record: Here is where I disagree with Nick G. I would argue that the definition of what a data warehouse IS, and it's purpose has changed over the years. In particular, when it comes to being the ONLY place that real-time information is stored historically, or integrated with the other data. In this case, it becomes the ONLY place where an auditor can actually perform an audit of how and what the data was that was collected along the way, hence it becomes a system of record.
It's changing again - with a concept called "Operational Data Warehousing" where the principles of a data warehouse are being pushed and re-defined by the business. The business actually builds an operational application right on top of the Data Warehouse, and edits the data set directly, creates new records directly - writing the history back to the warehouse in real time. In this case, also - it becomes a system of record. We have three different companies that have built systems like this and are functioning today. One is ING Real Estate HQ, One is a congressional effort for medical records management of the armed forces, one is Cendant Timeshare Resource Group. Like it or not, the introduction of "real-time data" flowing in to and being captured by the Data Warehouse makes it (by default) a system of record. More often than not the source data (after delivery to the EDW) is destroyed on the machine that performed capture... especially if it's a capture machine outside the EDW.
Anyhow, the point being that data warehouses also frequently house data from "old/retired" often dismantled legacy systems. Even if the auditor wanted to, they couldn't audit the source system because it's been scrapped or junk-heaped somewhere. So in this case, the auditor treats the EDW as a system of record, like it or not. I've actually been in this situation several times, and passed audits because of the manner in which I treat the data warehouse.
Now, all of that is separate from the data modeling constructs used to build your EDW. I've simply chosen to use the Data Vault for flexibility and scalability of the systems design. Let's talk about that for a minute...
Would you agree that the most scalable architecture (between NUMA, MPP, SMP, etc...) is MPP? Massively Parallel Processing... Or would you tend to say that some other architecture is more suitable to economies of scale? It seems to me that MPP is the clear winner over all these others. Actually a Hybrid is used these days to support cloud based systems. Massive SMP (symmetrical multi-processing) (ie: clustered machines) combined with MPP shared nothing architecture, to create a scale-out scale free cloud. Anyhow, if the mathematics behind MPP have proven to be true (for example divide and conquer) then why not utilize this type of architecture within database engines? Why not take MPP mathematics and apply them to the basics of data modeling by dividing and conquering or parallelizing the database operations?
A fist shot at this is called partitioning of the tables, but it has to go deeper - to where the joins live. There are age-old arguments about more joins/less joins and which is better, but at the end of the day it has to do with a balanced approach - how many joins can your hardware /RDBMS software execute in parallel effectively? How much data can your machine access in parallel and independent processes? Can you effectively make use of both types of partitioning (vertical and horizontal)? Can you apply compression across the board and enhance performance? Vertical partitioning is what the column based appliances / databases have been doing for massive performance increases in the query side.
The Data Vault Model makes use of some of the basic ideas buried within the MPP realm - divide and conquer, apply compression to columns where redundancy lives, and apply unique high speed 100% coverage to indexes where possible. Hence the hub & spoke design. The larger the system (or the larger the number of tables), the more I can split the tables/system up across multiple independent machines. It's a pure fit for virtualized computing/cloud computing resources. It's also a scale free architecture, because I don't need to store ALL my data in a single geographical place to get at it quickly. Any how, the hub and link structures enable the model to join in 100% parallel across MPP hardware / RDBMS that is configured properly, and at economies of scale - the MPP query will always win over a serialized join (once the machine and database are tuned appropriately).
Anyhow, there's a lot more to this than meets the eye, it may look like a simple data model - but it's the reasons why we do what we do that make it powerful. It's the purpose behind the architecture that is currently driving companies like the Netherlands Tax Authority and the DOD to select the Data Vault for use in some of it's largest (data set wise) projects.
Now folks might ask: well, isn't it overkill for small datawarehouses? Maybe, maybe not - depends on the COST of modifying a "stage/star schema" architecture down the road, vs weighing the cost of modifications to a Data Vault. Hence a company like Phillips Electronics also undertaking a Data Vault effort to consolidate or centralize raw data feeds from all across their divisions.
What I would say is this: If you're happy building Star Schemas as your data warehouse, great - if it's working for you, wonderful - don't change. But if you run into situations that require heavy re-engineering, cause your project to "stop & restart" or be completely rebuilt every 2 to 3 years, then you might want to look at the Data Vault as an alternative. Again, if you have little to no pain in the current architecture you are using, then don't change - don't fix what's not broken....
I hope this helps, as always - take what you like, use the good things that are there. Take the time to investigate new ideas, throw away the rest. After all, this is just more information. And remember: the world NEEDS Star Schemas, Cubes, and alternative release mechanisms - we couldn't feed the business needs without them! I'm merely proposing the Data Vault Model as an Enterprise Data Warehouse layer, NOT as a data delivery layer... Leave that to the expert: Dr. Ralph Kimball.
Please feel free to contact me directly if you wish.
Thank-you kindly,
Dan Linstedt
DanL@DanLinstedt.com
dlinstedt- Posts : 7
Join date : 2010-03-15
Re: Data Vault v's Dimensional Model
I know this is an old post, but I just noticed something relevant might have happened recently in SQL Server world. Has anyone heard about Master Data Services (MDS) in Denali version of SQL Server? It sounds to me there is a close connection between Data Vault and MDS. I believe in Microsoft's vision, Data Vault and Dimensional model could coexist in a DW/BI system.
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
Re: Data Vault v's Dimensional Model
Microsoft MDS is simply Master Data Management i.e. managing your product or customer reference data.
It allows versioning, hierarchy management etc. It is already in 2008R2.
Denali provides an easy to user Excel add-in and fixes some performance issues.
What MDS does is create versioned views that other systems or the DW can use.
It allows versioning, hierarchy management etc. It is already in 2008R2.
Denali provides an easy to user Excel add-in and fixes some performance issues.
What MDS does is create versioned views that other systems or the DW can use.
Re: Data Vault v's Dimensional Model
However is the underlying data model of most Master Data Management system based on Data Vault, as I heard the terms like Hub and Spoke in MDS which are pretty much the same concept of Hub and Satellite in Data Vault.
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
Proper assumptions should drive the model technique
The Vault vs. Dimensional vs. Inmon is a subject that has been debated a lot. Unfortunately, I have not found much concrete information about it that take the discussion down the level of actually solving the business problems in the enterprise:
1) Do not lose data (auditing/compliance)
2) Model it, so you can understand it
3) Make it physical, so you can query it
From my experience with building warehouse (which is significant, I have build large telco CDR, investment banking systems and retailers, to name a few) Dimensional modeling is THE best delivery mechanism for data - and this is becoming increasingly evident with the mainstream acceptance of column store technologies. But Dimensional models DO suffer from throwing away data (1 above) - especially in type2 dimension design. There is an option for extending the dimensional model to be better fit for purpose here.
I still see some light controversy about 3NF vs. Dimensional model as a the end-user accesible format - but it boils down the simple fact: Joins are expensive. And this brings me to Dan's point on join/sec, because that argument simply does not hold up in highly scaled (100s of TB) system that need low latency access (minutes or seconds). Having many tables does not mean you work well on MPP systems - on the contrary, you saturate the interconnect links between the nodes. The join/sec price is in other words only linear if the data model supports it. And this leads to the crux of the argument, in order to get from 1 to 3 above - you must have a model that allows a fast way to get from "never loose data" storage to a dimensional model. Seen in that perspective, Data Vault is an immensely ineffecient way to store data in step 1, since it involves expensive joins that do not lend themselves well to optimization, even less so than an Inmon style 3NF model. Being generous, there are some ideas in Data Vault that can be hybridized into the Kimball model to gain meet all three benefits above elegantly - but that really boils down to an age old trick from the OLTP world (storing history in a separate table from the main entity)
There are also some assumption in Data Vault that simply are not true: most important that the loading of a star schema is expensive or somehow harder to do in parallel than in a data vault. I have personally tuned the the first "ETL World Record" and can today load over 1TB in 10 minutes into a star schema - before I even start thinking about scale-out/MPP systems. I have built systems that slam millions of rows every second into a real time query data structures and can update billions of rows in minutes. All of them relied on dimensional model alone and have no issues with concurrency or high speed loading in that model.
In Data Vault Series 1 - Data Vault Overview we also find the interesting quote: "The current data models based on 3NF or star schema are proving difficult to modify, maintain and query, let alone backup and restore.". First of all, this is not really the case: the disciplines required to build, maintain and grow warehouses are well understood and implemented (with success) in a large number of organizations and it would be a stretch to call them "difficult". Secondly, while it is true that large data sizes present some unique challenges - Data Vault does not provide any novel solutions to them. Yes, it is easy to build a highly normalized model that only has indiscriminate inserts - but if that model is hard to bring into a dimensional schema (or need materialized tables to query properly) one has achieved nothing and might as well not have bothered with modeling the data in the first place.
Of course, it requires a rather long argument to fully refute Data Vault as a modeling technique, while at the same time extending the Kimball model to deal more efficiently with point 1 above. This is beyond a forum post. I am currently moving this argument forward in my blog. The outline of the argument can be found by Googling "kejser blog intermezzo" (sorry, can't post links since new in Forum) - but as said, it requires a long series of blogs to flesh out, looking forward to comments.
1) Do not lose data (auditing/compliance)
2) Model it, so you can understand it
3) Make it physical, so you can query it
From my experience with building warehouse (which is significant, I have build large telco CDR, investment banking systems and retailers, to name a few) Dimensional modeling is THE best delivery mechanism for data - and this is becoming increasingly evident with the mainstream acceptance of column store technologies. But Dimensional models DO suffer from throwing away data (1 above) - especially in type2 dimension design. There is an option for extending the dimensional model to be better fit for purpose here.
I still see some light controversy about 3NF vs. Dimensional model as a the end-user accesible format - but it boils down the simple fact: Joins are expensive. And this brings me to Dan's point on join/sec, because that argument simply does not hold up in highly scaled (100s of TB) system that need low latency access (minutes or seconds). Having many tables does not mean you work well on MPP systems - on the contrary, you saturate the interconnect links between the nodes. The join/sec price is in other words only linear if the data model supports it. And this leads to the crux of the argument, in order to get from 1 to 3 above - you must have a model that allows a fast way to get from "never loose data" storage to a dimensional model. Seen in that perspective, Data Vault is an immensely ineffecient way to store data in step 1, since it involves expensive joins that do not lend themselves well to optimization, even less so than an Inmon style 3NF model. Being generous, there are some ideas in Data Vault that can be hybridized into the Kimball model to gain meet all three benefits above elegantly - but that really boils down to an age old trick from the OLTP world (storing history in a separate table from the main entity)
There are also some assumption in Data Vault that simply are not true: most important that the loading of a star schema is expensive or somehow harder to do in parallel than in a data vault. I have personally tuned the the first "ETL World Record" and can today load over 1TB in 10 minutes into a star schema - before I even start thinking about scale-out/MPP systems. I have built systems that slam millions of rows every second into a real time query data structures and can update billions of rows in minutes. All of them relied on dimensional model alone and have no issues with concurrency or high speed loading in that model.
In Data Vault Series 1 - Data Vault Overview we also find the interesting quote: "The current data models based on 3NF or star schema are proving difficult to modify, maintain and query, let alone backup and restore.". First of all, this is not really the case: the disciplines required to build, maintain and grow warehouses are well understood and implemented (with success) in a large number of organizations and it would be a stretch to call them "difficult". Secondly, while it is true that large data sizes present some unique challenges - Data Vault does not provide any novel solutions to them. Yes, it is easy to build a highly normalized model that only has indiscriminate inserts - but if that model is hard to bring into a dimensional schema (or need materialized tables to query properly) one has achieved nothing and might as well not have bothered with modeling the data in the first place.
Of course, it requires a rather long argument to fully refute Data Vault as a modeling technique, while at the same time extending the Kimball model to deal more efficiently with point 1 above. This is beyond a forum post. I am currently moving this argument forward in my blog. The outline of the argument can be found by Googling "kejser blog intermezzo" (sorry, can't post links since new in Forum) - but as said, it requires a long series of blogs to flesh out, looking forward to comments.
Last edited by thomaskejser on Sun Aug 14, 2011 6:45 pm; edited 1 time in total (Reason for editing : spelling errors)
thomaskejser- Posts : 12
Join date : 2011-08-14
Proper Assumptions should include scalability, flexibility, and maintenance
Thomas,
I appreciate your response - however, I have worked with 100's of TB's in a Data Vault model on Teradata no-less. I've also worked with very large scale systems in to the 3 PB range using Data Vault techniques.
There are several points I want to make with your statements:
Your Statement:
"And this brings me to Dan's point on join/sec, because that argument simply does not hold up in highly scaled (100s of TB) system that need low latency access (minutes or seconds)."
I disagree as this has not been my experience on-site with customers. When was the last time you needed to join "all" the tables in a data warehouse (for any data modeling technique)? If this is the case, then sure, the Data Vault has more joins, and therefore if you query the entire model, yes, it will perform worse than other models. However, due to the normalization - AND the query optimization techniques such as table elimination, AND the co-location use of elements such as join indexes, the join performance is in fact FASTER than other modeling techniques. Why? Because the row sets are normalized, because the tables can be "queried in parallel due to the normalization". Co-location is KEY as is data spread or distribution in an MPP environment.
The Data Vault model offers you the ability to choose the right distribution over the right nodes. The links act as join-indexes, and can be co-located with the largest of the tables depending on what you choose as the Primary Index (or hash distribution column set) for the nodes in the MPP environment. The queries have less I/O because the data is normalized, the Data Vault can fit more rows per block on disk, and furthermore the engine runs in parallel - so we see a performance multiplier in a truly independent MPP system.
IF the MPP design for the RDBMS is done properly, AND the links (think join-indexes) are co-located properly, then you do NOT saturate the interconnects, in fact, just the opposite. The data coming from each independent node is removed from the join if it's not used, and it never leaves the node to be checked against the other nodes at all. Only the pre-filtered results are moved, and at that - because the Link tables are narrow (optimized join indexes), only the KEYS to the non-co-located tables are moved to other nodes. THEN each node again, in parallel, fetches the matching data set off disk, to return a much smaller result set.
Your statement:
"Seen in that perspective, Data Vault is an immensely ineffecient way to store data in step 1, since it involves expensive joins that do not lend themselves well to optimization, even less so than an Inmon style 3NF model."
Again I have to disagree, see my explanation above. In fact, with the given parallelism at the query level, and at the hardware level fetching data, and eliminating tables that don't need to be part of the query, the execution of the queries against the structure is extremely efficient, and fast.
Let's talk for a minute about the notion of denormalization or removing joins... There is a benefit to denormalization, but there is also an upper limit to the WIDTH of the row that can be set. The real question is: how many rows can I fit in a disk block before I push I/O higher on an exponential scale? The block-size of different database systems vary. In Teradata it used to be 64k, I think now they've boosted that to 256kb. Oracle can go to 64kb only on certain platforms, and so on.... Any RDBMS system on Microsoft is locked in to 8k physical, unless the physical disk is reformatted to be 64k, but even then, the Microsoft OS must work with 8k block sizes.
What happens with denormalization? (ie: removal of joins)... you reach a sweet spot, where performance is optimal - great, that means the I/O's required to get the rows out are the least amount of I/O's needed to access the data, and are balanced with the query. HOWEVER, you are still left with a SEQUENTIAL SCAN of a single table - and now, you are relying on the "smartness" of the disk and the RDBMS to know that you want a table scan, and to read-ahead the blocks in to RAM to gather the information.
But this has limitations.... It cannot scale beyond this point (try it)... Continue to add width to the row (doesn't necessarily mean adding new columns). You will find that once you pass this sweet spot in width (denormalize by removing another join, and combining another tables' columns) that the performance will begin to degrade on an exponential scale. Why is that? Because the I/O is STILL stuck on sequential access, and it is still forced to TABLE SCAN, now, there are less rows per block - and something else starts to happen... Rows start to "chain over/across" blocks, which doubles, then triples, then quadruples the I/O - and for gosh sakes, if you hit fragmentation - watch out....
This is where Normalization (many rows per block) combined with proper parallel query, and the right co-location can far surpass any denormalized model. Because you can achieve a BALANCE with the right layout and distribution of the data. It also means that the scalability is limitless (in keeping with the hardware changes). These are the laws of MPP, divide and conquer.
It's an easy thing to try, run performance tests of different row sizes: 1k, 2k, 4k, 8k, 16k (per row). THEN run performance tests on a system like Teradata where there is equivalent amounts of data, but the rows are split and joined together. Which runs faster at 100MB, 500GB, 1TB, 250TB, 500TB, 1PB? You can try this with "1 denormalized table" vs "2, 3, 4, 5 normalized tables and joins".... I have.
Your Statement:
"the disciplines required to build, maintain and grow warehouses are well understood and implemented (with success) in a large number of organizations and it would be a stretch to call them "difficult".
If this is the case, why are so many organizations struggling with "spread-marts"? Why are so many organizations complaining about the lack of IT agility every time they want a change? Why can't data warehouses be built and developed the same way twice, three times, etc? If this is the case, why hasn't the industry matured and optimized the building of data warehouse systems by automating 80% or more of the mundane tasks? If this is the case, why are businesses still challenging the value and cost of implementing BI solutions? If this is the case, why does it take IT months or even years to "cross-integrate" new sources, new data sets, and provide new answers?
Let me back up a minute... There are many architects out there who are _very_ good at systems architecture, and data architecture, and process architecture. You may be one of these, and as I don't know your background, I can't comment on your level of knowledge. What I will say, is these folks I meet are few and far between - however, the systems they build (regardless of the data model applied) are solid, are well documented, are understood, and yes - are considered successes by their business customer. Ok, kudos to them. But when I begin asking questions (as I have above), they begin to leave the room.... The answers simply don't exist.
Something we learn in SEI/CMMI training is: repeatability, measurement, cost & risk, optimization, documentation, problem finding, consistency, standardization - and then of course: finally (after nearly all quantities are known) we can begin to "automate" using tool sets to construct systems, and do the work of 3 to 6 people at once. But I ramble....
Your Statement:
"Secondly, while it is true that large data sizes present some unique challenges - Data Vault does not provide any novel solutions to them. Yes, it is easy to build a highly normalized model that only has indiscriminate inserts - but if that model is hard to bring into a dimensional schema (or need materialized tables to query properly) one has achieved nothing and might as well not have bothered with modeling the data in the first place. "
Hmmm... First, the articles you read were written in 2001 and 2002. I have far more advanced topics, explanations, and proven successes in the market today that I share during the Data Vault Modeling Certification Class. And if there were "this many holes" in the Data Model, then customers like some of the worlds leading financial institutions would NOT be using these techniques. Customers like the DoD, NSA, FAA, FDA, USDA, and more would not be implementing Data Vault models. As for indiscriminate inserts... that couldn't be further from the truth. The Data Vault Model is built around 100% inserts 98% of the time, and the only reason for that last 2% is because the RDBMS engines today don't fully handle complete temporality of data sets. We (just like in star schema type 2 dimensions) must issue an update to the END-DATE columns of Satellite descriptor rows.
As for your claim "hard to bring in to a dimensional model" That simply is flat out false and un-true. I can (and have) built and loaded Star Schemas from a Data Vault within 45 minutes of receiving a 2 page requirements document filled out by business users / customers. So I'm not sure where you're getting your information about it being difficult to load or build star schemas.
As far as your claim "needing to build materialzed views" to bring the data in is also false. This is not necessary for all cases. In fact, in Teradata I can build VIRTUAL star schemas (views that look like dimensions and facts), and for these cases, I never have to land the data physically anywhere after the Data Vault. In SQLServer, I can push the data from the Data Vault direct to an OLAP cube without resorting to materialized views. There are hundreds of successful implementations, and many customers on Oracle, Teradata, SQLServer, DB2 EEE, MySQL, and even Netezza.
Your Statement:
I have personally tuned the the first "ETL World Record" and can today load over 1TB in 10 minutes into a star schema - before I even start thinking about scale-out/MPP systems. I have built systems that slam millions of rows every second into a real time query data structures and can update billions of rows in minutes
Ok - please justify this. Was this on Teradata? Oracle? DB2 UDB? Netezza? DatAllegro? SQLServer? Paraccel? or something else? What was the underlying hardware? How much money did the customer spend on making the infrastructure viable? And more to the point, did you then build a comparable Data Vault model and test the exact same data set against it?
These numbers have no bearing on this conversation without equal comparison in the same environment using both data modeling approaches at the same time. I can & have done the same thing as you have done, only using Data Vaults, so I'm of the opinion this is a moot point.
Given the right hardware and infrastructure (once you shut-off foreign keys, defaults, and constraints) - any reasonably well modeled "data model" is capable of this behaviour, but there is simply not enough information in this statement you make to compare the Star Schema data model against the Data Vault model in terms of performance.
I too have a blog: danlinstedt (dot) com, and have posted many many topics about the Data Vault and its success. I also have a new book on the subject: learndatavault (dot) com, feel free to take a look. You can also find me (if you would like to contact me) on linkedin (dot) com (slash) dlinstedt
I mean no disrespect, I just have concerns about your statements.
Thank-you kindly,
Dan Linstedt
I appreciate your response - however, I have worked with 100's of TB's in a Data Vault model on Teradata no-less. I've also worked with very large scale systems in to the 3 PB range using Data Vault techniques.
There are several points I want to make with your statements:
Your Statement:
"And this brings me to Dan's point on join/sec, because that argument simply does not hold up in highly scaled (100s of TB) system that need low latency access (minutes or seconds)."
I disagree as this has not been my experience on-site with customers. When was the last time you needed to join "all" the tables in a data warehouse (for any data modeling technique)? If this is the case, then sure, the Data Vault has more joins, and therefore if you query the entire model, yes, it will perform worse than other models. However, due to the normalization - AND the query optimization techniques such as table elimination, AND the co-location use of elements such as join indexes, the join performance is in fact FASTER than other modeling techniques. Why? Because the row sets are normalized, because the tables can be "queried in parallel due to the normalization". Co-location is KEY as is data spread or distribution in an MPP environment.
The Data Vault model offers you the ability to choose the right distribution over the right nodes. The links act as join-indexes, and can be co-located with the largest of the tables depending on what you choose as the Primary Index (or hash distribution column set) for the nodes in the MPP environment. The queries have less I/O because the data is normalized, the Data Vault can fit more rows per block on disk, and furthermore the engine runs in parallel - so we see a performance multiplier in a truly independent MPP system.
IF the MPP design for the RDBMS is done properly, AND the links (think join-indexes) are co-located properly, then you do NOT saturate the interconnects, in fact, just the opposite. The data coming from each independent node is removed from the join if it's not used, and it never leaves the node to be checked against the other nodes at all. Only the pre-filtered results are moved, and at that - because the Link tables are narrow (optimized join indexes), only the KEYS to the non-co-located tables are moved to other nodes. THEN each node again, in parallel, fetches the matching data set off disk, to return a much smaller result set.
Your statement:
"Seen in that perspective, Data Vault is an immensely ineffecient way to store data in step 1, since it involves expensive joins that do not lend themselves well to optimization, even less so than an Inmon style 3NF model."
Again I have to disagree, see my explanation above. In fact, with the given parallelism at the query level, and at the hardware level fetching data, and eliminating tables that don't need to be part of the query, the execution of the queries against the structure is extremely efficient, and fast.
Let's talk for a minute about the notion of denormalization or removing joins... There is a benefit to denormalization, but there is also an upper limit to the WIDTH of the row that can be set. The real question is: how many rows can I fit in a disk block before I push I/O higher on an exponential scale? The block-size of different database systems vary. In Teradata it used to be 64k, I think now they've boosted that to 256kb. Oracle can go to 64kb only on certain platforms, and so on.... Any RDBMS system on Microsoft is locked in to 8k physical, unless the physical disk is reformatted to be 64k, but even then, the Microsoft OS must work with 8k block sizes.
What happens with denormalization? (ie: removal of joins)... you reach a sweet spot, where performance is optimal - great, that means the I/O's required to get the rows out are the least amount of I/O's needed to access the data, and are balanced with the query. HOWEVER, you are still left with a SEQUENTIAL SCAN of a single table - and now, you are relying on the "smartness" of the disk and the RDBMS to know that you want a table scan, and to read-ahead the blocks in to RAM to gather the information.
But this has limitations.... It cannot scale beyond this point (try it)... Continue to add width to the row (doesn't necessarily mean adding new columns). You will find that once you pass this sweet spot in width (denormalize by removing another join, and combining another tables' columns) that the performance will begin to degrade on an exponential scale. Why is that? Because the I/O is STILL stuck on sequential access, and it is still forced to TABLE SCAN, now, there are less rows per block - and something else starts to happen... Rows start to "chain over/across" blocks, which doubles, then triples, then quadruples the I/O - and for gosh sakes, if you hit fragmentation - watch out....
This is where Normalization (many rows per block) combined with proper parallel query, and the right co-location can far surpass any denormalized model. Because you can achieve a BALANCE with the right layout and distribution of the data. It also means that the scalability is limitless (in keeping with the hardware changes). These are the laws of MPP, divide and conquer.
It's an easy thing to try, run performance tests of different row sizes: 1k, 2k, 4k, 8k, 16k (per row). THEN run performance tests on a system like Teradata where there is equivalent amounts of data, but the rows are split and joined together. Which runs faster at 100MB, 500GB, 1TB, 250TB, 500TB, 1PB? You can try this with "1 denormalized table" vs "2, 3, 4, 5 normalized tables and joins".... I have.
Your Statement:
"the disciplines required to build, maintain and grow warehouses are well understood and implemented (with success) in a large number of organizations and it would be a stretch to call them "difficult".
If this is the case, why are so many organizations struggling with "spread-marts"? Why are so many organizations complaining about the lack of IT agility every time they want a change? Why can't data warehouses be built and developed the same way twice, three times, etc? If this is the case, why hasn't the industry matured and optimized the building of data warehouse systems by automating 80% or more of the mundane tasks? If this is the case, why are businesses still challenging the value and cost of implementing BI solutions? If this is the case, why does it take IT months or even years to "cross-integrate" new sources, new data sets, and provide new answers?
Let me back up a minute... There are many architects out there who are _very_ good at systems architecture, and data architecture, and process architecture. You may be one of these, and as I don't know your background, I can't comment on your level of knowledge. What I will say, is these folks I meet are few and far between - however, the systems they build (regardless of the data model applied) are solid, are well documented, are understood, and yes - are considered successes by their business customer. Ok, kudos to them. But when I begin asking questions (as I have above), they begin to leave the room.... The answers simply don't exist.
Something we learn in SEI/CMMI training is: repeatability, measurement, cost & risk, optimization, documentation, problem finding, consistency, standardization - and then of course: finally (after nearly all quantities are known) we can begin to "automate" using tool sets to construct systems, and do the work of 3 to 6 people at once. But I ramble....
Your Statement:
"Secondly, while it is true that large data sizes present some unique challenges - Data Vault does not provide any novel solutions to them. Yes, it is easy to build a highly normalized model that only has indiscriminate inserts - but if that model is hard to bring into a dimensional schema (or need materialized tables to query properly) one has achieved nothing and might as well not have bothered with modeling the data in the first place. "
Hmmm... First, the articles you read were written in 2001 and 2002. I have far more advanced topics, explanations, and proven successes in the market today that I share during the Data Vault Modeling Certification Class. And if there were "this many holes" in the Data Model, then customers like some of the worlds leading financial institutions would NOT be using these techniques. Customers like the DoD, NSA, FAA, FDA, USDA, and more would not be implementing Data Vault models. As for indiscriminate inserts... that couldn't be further from the truth. The Data Vault Model is built around 100% inserts 98% of the time, and the only reason for that last 2% is because the RDBMS engines today don't fully handle complete temporality of data sets. We (just like in star schema type 2 dimensions) must issue an update to the END-DATE columns of Satellite descriptor rows.
As for your claim "hard to bring in to a dimensional model" That simply is flat out false and un-true. I can (and have) built and loaded Star Schemas from a Data Vault within 45 minutes of receiving a 2 page requirements document filled out by business users / customers. So I'm not sure where you're getting your information about it being difficult to load or build star schemas.
As far as your claim "needing to build materialzed views" to bring the data in is also false. This is not necessary for all cases. In fact, in Teradata I can build VIRTUAL star schemas (views that look like dimensions and facts), and for these cases, I never have to land the data physically anywhere after the Data Vault. In SQLServer, I can push the data from the Data Vault direct to an OLAP cube without resorting to materialized views. There are hundreds of successful implementations, and many customers on Oracle, Teradata, SQLServer, DB2 EEE, MySQL, and even Netezza.
Your Statement:
I have personally tuned the the first "ETL World Record" and can today load over 1TB in 10 minutes into a star schema - before I even start thinking about scale-out/MPP systems. I have built systems that slam millions of rows every second into a real time query data structures and can update billions of rows in minutes
Ok - please justify this. Was this on Teradata? Oracle? DB2 UDB? Netezza? DatAllegro? SQLServer? Paraccel? or something else? What was the underlying hardware? How much money did the customer spend on making the infrastructure viable? And more to the point, did you then build a comparable Data Vault model and test the exact same data set against it?
These numbers have no bearing on this conversation without equal comparison in the same environment using both data modeling approaches at the same time. I can & have done the same thing as you have done, only using Data Vaults, so I'm of the opinion this is a moot point.
Given the right hardware and infrastructure (once you shut-off foreign keys, defaults, and constraints) - any reasonably well modeled "data model" is capable of this behaviour, but there is simply not enough information in this statement you make to compare the Star Schema data model against the Data Vault model in terms of performance.
I too have a blog: danlinstedt (dot) com, and have posted many many topics about the Data Vault and its success. I also have a new book on the subject: learndatavault (dot) com, feel free to take a look. You can also find me (if you would like to contact me) on linkedin (dot) com (slash) dlinstedt
I mean no disrespect, I just have concerns about your statements.
Thank-you kindly,
Dan Linstedt
dlinstedt- Posts : 7
Join date : 2010-03-15
Re: Data Vault v's Dimensional Model
In Kimball's methodology, there is an important fact type out of only 3 types, periodic snapshot fact. I know a lot of people feel very uncomfortable repeating mostly the same data periodically as the table can get very big over some time, especially for daily snapshot.dlinstedt wrote:In fact, in Teradata I can build VIRTUAL star schemas (views that look like dimensions and facts), and for these cases, I never have to land the data physically anywhere after the Data Vault. In SQLServer, I can push the data from the Data Vault direct to an OLAP cube without resorting to materialized views.
However there is another efficient way to store this type of fact using effective date pair, treating fact table like a SCD dimension. Is this something similar to the Link concept in Data Vault? I can see a couple of points of doing periodic snapshots, simpler load process and better performance for trend analysis in which you don't need to cross join to date dimension to produce the records on the fly. I think it really comes down to the performance at the cost of much deeper fact table which may become less significant with highly normalised fact table and horizontal partitioning techniques.
Another point about performance is the star schema optimisation, a feature that has been implemented in many major RDBMS venders. Can data vault's virtual star schema leverage the feature effectively? One convincing point on performance in dimensional modeling is that the join is based on surrogate keys without date constraints, whereas the business key based joins plus date ranged constraints would compromise the performance significantly.
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
Specific answers to your questions
Hi Hang,
The virtual (sql views) can provide type 1 and type 2 dimension views without storing copies of the data. Type 3 dimensions can be done, but require much more complex SQL because we are pivoting data over time in to hard-coded columns. I would not recommend producing type 3 dimensions virtually from the Data Vault model.
All of the "data over time" is stored in the Satellites. The Links are parent tables (much like factless facts) along with a child table (Satellite) hanging off the links. Except in the notion of a transactional Link (which looks exactly like a fact table. However, their are specific rules to when and how to use a transactional link, otherwise flexibility and data integrity can be lost. My book: at learnDataVault (dot) com (slash) purchase-book discusses the architecture and the rule sets in detail. Because the Data Vault model is a hybrid approach (cross between 3NF and Star Schema) in inherits best of breed of both models, along with some (but not all) of the structural components.
Without a table example about your comments about a "deeper fact table" I am having trouble seeing your points about performance.
Now, regarding surrogate keys: the Data Vault model uses surrogate keys for all tables, and again, because it's a hybrid of 3nf and Star Schema, it can make use of not only the "star-join" optimizations, but also the query elimination techniques AND the parallel query engines. Effectively the Satellite data is vertical partitioning, and can be further partitioned horizontally - for added gains. The Satellite structure is the Surrogate plus the date constraint, but again, because of table elimination, and partitioning mechanisms, only the tables that are needed are brought in to the query.
Hope this helps,
Dan Linstedt
http (colon) learnDataVault (dot) com
The virtual (sql views) can provide type 1 and type 2 dimension views without storing copies of the data. Type 3 dimensions can be done, but require much more complex SQL because we are pivoting data over time in to hard-coded columns. I would not recommend producing type 3 dimensions virtually from the Data Vault model.
All of the "data over time" is stored in the Satellites. The Links are parent tables (much like factless facts) along with a child table (Satellite) hanging off the links. Except in the notion of a transactional Link (which looks exactly like a fact table. However, their are specific rules to when and how to use a transactional link, otherwise flexibility and data integrity can be lost. My book: at learnDataVault (dot) com (slash) purchase-book discusses the architecture and the rule sets in detail. Because the Data Vault model is a hybrid approach (cross between 3NF and Star Schema) in inherits best of breed of both models, along with some (but not all) of the structural components.
Without a table example about your comments about a "deeper fact table" I am having trouble seeing your points about performance.
Now, regarding surrogate keys: the Data Vault model uses surrogate keys for all tables, and again, because it's a hybrid of 3nf and Star Schema, it can make use of not only the "star-join" optimizations, but also the query elimination techniques AND the parallel query engines. Effectively the Satellite data is vertical partitioning, and can be further partitioned horizontally - for added gains. The Satellite structure is the Surrogate plus the date constraint, but again, because of table elimination, and partitioning mechanisms, only the tables that are needed are brought in to the query.
Hope this helps,
Dan Linstedt
http (colon) learnDataVault (dot) com
dlinstedt- Posts : 7
Join date : 2010-03-15
Re: Data Vault v's Dimensional Model
Hi Dan,dlinstedt wrote:Without a table example about your comments about a "deeper fact table" I am having trouble seeing your points about performance.
Sorry for the confusion. I really meant that Kimball's periodic snapshot fact table produces better performance at cost of of redundant data materialisation on the disk, resulting in much deeper fact table than effective dated fact or perhaps DV link tables. However without materialising the snapshot data in a physical table, you may need to cross join the fact with date dimension to have a virtual snapshot in memory to enable trend analysis, for instance by cube. I am concerned about the performance cost caused by cross join.
I can see the point of vertical partitioning by Satellites as the changing attributes in dimensions are stored in respective tables according to their changing patterns instead of wide denormalised dimension tables. There is no doubt the performance gain would be significant by normalising big dimensions. In dimensional modeling, Kimball also suggested normalising monster dimensions by breaking it down into set of tables based on attributes' changing patterns and cardinalities.
Correct me if I am wrong. I guess the critical difference between dimensional modeling and data vault is the treatment on dimensions. The former is to denormalise in general but normalise in some special cases, whereas in data vault, dimensions must be physically on 3NF while the denormalised dimension format may only be achieved by virtual layer for user consumption. Kimball suggests having minimal dimension entries in the fact table by reasonable dimension denormalisation to avoid centipede fact table. Data vault keeps dimension tables normalised in link tables to minimise the data redundancies in the dimension tables, and control dimension growth at most granular level.
The common ground for both methodologies would be minimal snowflaking, shallow joins between dimensions and facts. I am not sure how DV handles multivalued attributes. I guess it would use Links and store attributes in their respective satellites at level of their own grains. Does DV accept Junk dimension or mini dimension? or it is too denormalised and the fast changing low cardinality attributes should be in individual unworthy satellites. I have seen designs having 20 dimension tables with only two records in the notion of normlisation. Even in relational thinking, so many piece meal tables would make you feel uncomfortable. Believe me, few joins making tons of low cardinality attributes available on one platform is awfully convenient and comforting, although understanding the concept and implementing it is mind bending.
I don't think performance would be major concern in terms of dimension joins for both models. It really comes down to the ease of use at physical level. But ease of use in denormalisation is achieved at cost of data quality which can only be effectively addressed by normalisation at some stage of data warehousing, given the fact that most source systems don't provide high quality data.
In designing a BI system, you always end up with some sort of ODS no matter how undesirable it is normally regarded in pure dimensional thinking. In reality, most legacy OLTP systems are not good normalised system. I think that's why Master Data Management (MDM) has become increasingly popular as a part of BI strategy. The question is what model are these MDM based on? Can we build our own MDM system if the model is standardised like dimensional modeling or data vault?
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
Hi Hang
Hi Hang
FYI i'm a d dutch dv expert and prez of the dutch dv usergroup
To get on a take MDM. Most MDM systems lack good memory of the past, which means that DV can work for MDM systems as well. I know that some have made MDM systems with methods that are related to Data Vault. Esp when you start doing bitemporal data DV will be a lot easier than 3NF or Dimensional modeling. My take is that Data Vault is a good foundation for MDM, and can also be used to create fully historic MDM solutions. or keep the history for MDM solutions that do not keep any history for themselves.
To Thomas Kejser,
Extreem scalability has a lot of factors, and one of them is knowing the underlying system and HW. I'quite sure that big SQL Server DV implementations have their own challenges, and I know on the extreem scale SQL server is NOT a JOIN champion in any edition (PDW,FastTrack...). There are however DBMSses that do allow for superior JOIN performance like monetdb/Vectorwise.
I also want to caution to extend Dimensional modelling for full temporalisation. Esp bi-temporalisation and multi temporalisation do not work well with dimensional modelling. For this aspect Data Vault does far better. When time becomes a complication, model normalisation and generic temporalisation (and hence join performance) become a must, regardless of the kind of 'modelling techniques' (actually model 2 model transformations) you'd like to perform.
FYI i'm a d dutch dv expert and prez of the dutch dv usergroup
To get on a take MDM. Most MDM systems lack good memory of the past, which means that DV can work for MDM systems as well. I know that some have made MDM systems with methods that are related to Data Vault. Esp when you start doing bitemporal data DV will be a lot easier than 3NF or Dimensional modeling. My take is that Data Vault is a good foundation for MDM, and can also be used to create fully historic MDM solutions. or keep the history for MDM solutions that do not keep any history for themselves.
To Thomas Kejser,
Extreem scalability has a lot of factors, and one of them is knowing the underlying system and HW. I'quite sure that big SQL Server DV implementations have their own challenges, and I know on the extreem scale SQL server is NOT a JOIN champion in any edition (PDW,FastTrack...). There are however DBMSses that do allow for superior JOIN performance like monetdb/Vectorwise.
I also want to caution to extend Dimensional modelling for full temporalisation. Esp bi-temporalisation and multi temporalisation do not work well with dimensional modelling. For this aspect Data Vault does far better. When time becomes a complication, model normalisation and generic temporalisation (and hence join performance) become a must, regardless of the kind of 'modelling techniques' (actually model 2 model transformations) you'd like to perform.
dm_unseen- Posts : 4
Join date : 2009-02-03
Location : Netherlands
Assumptions and data structures
Hi Dan
Good to see you are in still alive in this forum. I look forward to hearing your views and discussing with you.
First of all, let me start by establishing my credentials. I design data warehouse for everything from retailers, manufacturers, telcos and now investment banks. I have over 15 years of experience in the database industry and am currently leading the design of the largest SQL Server installations in the world. One of my specialities is movement of data between sources, "EDW" and marts - and this typically means that my system will be moving several TB of data every day. I too work with major financial institutions (which I, for obvious confidentiality reasons, cannot name). I think I have even run across a few of your designs out there in the field. I hope we can also agree from our field experience that just because customers are doing something, doesn't mean it is a good idea. Hence, the fact that someone build something using your techniques does not mean it is the best way to do things.
You made a lot of points, so I will go over them one at a time. Let me just reiterate that data models should, in my opinion, be built using proper assumptions about database engines and their capabilities. Let me also state where I agree with you: normalization in the way you use it for satellites in your model techniques is actually a good idea in many cases. This is something I am going to elaborate in my blog as I advance the argument there. I will be tracking your blog, so it will be interesting to ping back and forth.
To your points:
Dan: "When was the last time you needed to join "all" the tables in a data warehouse (for any data modeling technique)? If this is the case, then sure, the Data Vault has more joins, and therefore if you query the entire model, yes, it will perform worse than other models."
Joining "All" tables is really not needed to get the benefit - and I hope I did not come across as putting up such a far fetched straw man. The problem can be boiled down to joining TWO tables. Take TPCH as an example. In that schema, they split LINEITEM and ORDERS - this means that queries grouping on "dimensions" from both will incur an expensive join. A more efficient design would be to store the two tables pre-joined. Of course, you could co-locate the two tables, but this just brings me to:
Dan: "However, due to the normalization - AND the query optimization techniques such as table elimination, AND the co-location use of elements such as join indexes, the join performance is in fact FASTER than other modeling techniques."
Co-Location will allow you to join two large tables together at a reasonable speed (using a merge strategy). However, it only works if you are not seeking into those tables (only if you scan or bitmap scan). If you take a data vault model with two large hubs, linked by a large link, the co-location is often not feasible, because you can only store data sorted in one way (and bitmaps wont help you either). In other words: You can only co-locate if there is a common column to co-locate ON in the model - interestingly the data vault super normalization guarantees that to NOT be the case.
JOIN indexes is "cheating" you are giving away large storage to get more performance. Granted: Trading storage for performance is normally a good idea- for small tables or aggregates. But if you have large table structures, and the only reason you implement join indexes is to make them reasonably fast to query, then I ask you: why did you bother modeling them like that in the first place?
Dan: "Why? Because the row sets are normalized, because the tables can be "queried in parallel due to the normalization". Co-location is KEY as is data spread or distribution in an MPP environment."
I have no clue which database engine you refer to that create more parallelism by having tables "queried in parallel due to normalization". As you may recall, database implement read locks that allow multiple threads to touch the same table at the same time - this includes threads that are part of the same parallel query. To give you an example, SQL Server will happily scan a single table with 36GB/sec (and after that, you typically run out of IOPS on an SMP system).
Dan: "how many rows can I fit in a disk block before I push I/O higher on an exponential scale? The block-size of different database systems vary. In Teradata it used to be 64k, I think now they've boosted that to 256kb. Oracle can go to 64kb only on certain platforms, and so on.... Any RDBMS system on Microsoft is locked in to 8k physical, unless the physical disk is reformatted to be 64k, but even then, the Microsoft OS must work with 8k block sizes."
A few factual corrections: Microsoft Windows allows 64K blocks, but that is just the allocation unit. You can request any IOPS size during scans (for example, SQL Server will request 512K blocks). Oracle I believe can go all the way to 1MB page sizes on certain platforms. Not that it matters, because there is no point where I/O goes exponential in a star schema. You scan a table: the IOPS are linear with the table size. Of course, if you insist on normalizing, then all bets are off (since joins, especially the loop joins that the from/to in the vault model use) will be log n * m complexity and if you lack the right indexes - you get your exponential time (which, interestingly, only happen on normalized models).
Dan: "you reach a sweet spot, where performance is optimal - great, that means the I/O's required to get the rows out are the least amount of I/O's needed to access the data, and are balanced with the query. HOWEVER, you are still left with a SEQUENTIAL SCAN of a single table - and now, you are relying on the "smartness" of the disk and the RDBMS to know that you want a table scan, and to read-ahead the blocks in to RAM to gather the information. "
As opposed to what? Trying to cram the join into memory or relying on random disk access? Let me just reiterate: Having an indexes only access strategy on a large table is incredibly expensive, both in disk space and performance (since you will be trashing the RAM as you access each index). Also, please note that fast disk scans are exactly what the specialized DW database engines like Vertica, Greenplum, PDW and Neteeza are made to excel at.
Dan: "Continue to add width to the row (doesn't necessarily mean adding new columns). You will find that once you pass this sweet spot in width (denormalize by removing another join, and combining another tables' columns) that the performance will begin to degrade on an exponential scale. Why is that? Because the I/O is STILL stuck on sequential access, and it is still forced to TABLE SCAN, now, there are less rows per block - and something else starts to happen... Rows start to "chain over/across" blocks, which doubles, then triples, then quadruples the I/O - and for gosh sakes, if you hit fragmentation - watch out...."
I am curious which database vendor you have observed this behavior for? I am deeply familiar with the code base of SQL Server - and there is no such code or behavior in there. On the note of having wide rows: It is correct that this is an issue that a star schema model must struggle with, since surrogate keys are de-normalized into the star schema. But let us just do a quick back on the envelope calculation: A dimension key is typically 4-8 bytes, so is a measure (using some decimal, float or in). Let us just be worst case and say 8 bytes per column. In our smallest case, SQL Server, a page is 8K - so this leaves you with 1000 columns in the fact table. So can we agree this is not such a big deal?
On the other note about row width: I think most DW designers acknowledge that there is an overhead of "paying" for the scan of the columns you are not using. However, this is what column stores are all about, they allow you to touch only the column you need to touch and only incur the IOPS for that.
Again, I also have to say: As opposed to what? Indexes access to huge tables?
Dan: "This is where Normalization (many rows per block) combined with proper parallel query, and the right co-location can far surpass any denormalized model. Because you can achieve a BALANCE with the right layout and distribution of the data. It also means that the scalability is limitless (in keeping with the hardware changes). These are the laws of MPP, divide and conquer. "
I agree that MPP is divide and conquer. But I think we disagree on how to divide. "Divide" in MPP is typically done by hash distributing the large tables across the nodes and co-locating the small tables: by replicating or applying the same hash, if the tables share columns (see above note on co-location). So, you are not dividing the tables vertically, you are doing it horizontally (distributing rows, not columns). This is what leads to balance - and it just so happens that star schemas, with their small dimension tables and large facts, lend themselves very well to co-locating the dimension (by replicating them) and evenly distributing the facts.
I believe I have already answered the many rows/block point with the column store (and if you should find yourself in a DB engine that does not have those, feel free to vertically partition the fact table in the star schema or add a narrow, scan friendly index).
Dan: "Why can't data warehouses be built and developed the same way twice, three times, etc? If this is the case, why hasn't the industry matured and optimized the building of data warehouse systems by automating 80% or more of the mundane tasks? If this is the case, why are businesses still challenging the value and cost of implementing BI solutions? If this is the case, why does it take IT months or even years to "cross-integrate" new sources, new data sets, and provide new answers?"
I certainly hope you appreciate the complexity of even getting ACCESS to sources (and having someone explain WHAT the source is), not to mention the large business questions about how you want to QUERY the data. I think there is also wide agreement that this effort, not the modeling, represents the 80% of the work needed to build a warehouse. I think everyone on the IT-industry has beating their fist on the table saying: Why cant building IT systems be like building bridges. Books are written on the subject of why this is the case - I dont think the data warehouse is a special case here.
When it comes to the mundane tasks of modeling, this can be automated with meta-frameworks (interestingly, independent of which model you use) - something that several companies now specialize in (I personally built one that automates the creation of star schemas). Of course, if you are to use a meta-modeling tool, you must find yourself a business the appreciates that a well written meta model tool, operated by a few experts, beat the other strategy which is: outsourcing to an army of "ETL programmers" and "BI developers".
Dan: "because the RDBMS engines today don't fully handle complete temporality of data sets."
The temporal database is one of those dreams we have had in IT for many years. There are good reasons it has not been done in a performanant way (there are inherent complexities in this problem that does not lend themselves well to good solutions). I happen to agree that the satelite construct ("flaking off the history") is the best bet so far.
Dan: As for your claim "hard to bring in to a dimensional model" That simply is flat out false and un-true. I can (and have) built and loaded Star Schemas from a Data Vault within 45 minutes of receiving a 2 page requirements document filled out by business users / customers. So I'm not sure where you're getting your information about it being difficult to load or build star schemas.
This comes back to the point about co-location. Joining large links and hubs together is very expensive - and so is creating de-normalized type2 dimensions on top of satellites, joined over hubs, joined over links, joined over more hubs/satellites/links. Keeping track of the aggregated history between them is a proper nightmare in a big "parking lot sized" data vault model.
Dan: "In fact, in Teradata I can build VIRTUAL star schemas (views that look like dimensions and facts), and for these cases, I never have to land the data physically anywhere after the Data Vault. In SQLServer, I can push the data from the Data Vault direct to an OLAP cube without resorting to materialized views. "
Building such views only defers the cost of the join to the cube or the query time. Moving the problem around like that does not make it go away - but you are now paying the join cost several times (every query) instead of only once (during the ETL). This is one of the points I will make in my blog in more detail, suffice to say that it is a big waste of hardware resources to implement non-materialized data structures that are joined at query time.
Dan: Ok - please justify this. Was this on Teradata? Oracle? DB2 UDB? Netezza? DatAllegro? SQLServer? Paraccel? or something else? What was the underlying hardware? How much money did the customer spend on making the infrastructure viable?
My first run was on a Unisys R7600 using SQL Server 2008. I have since done this on a HP DL980, using SQL Server 2008R2. The cost of such a system is in the 200.000 USD range. MPP is not really required to get this sort of speed.
Good to see you are in still alive in this forum. I look forward to hearing your views and discussing with you.
First of all, let me start by establishing my credentials. I design data warehouse for everything from retailers, manufacturers, telcos and now investment banks. I have over 15 years of experience in the database industry and am currently leading the design of the largest SQL Server installations in the world. One of my specialities is movement of data between sources, "EDW" and marts - and this typically means that my system will be moving several TB of data every day. I too work with major financial institutions (which I, for obvious confidentiality reasons, cannot name). I think I have even run across a few of your designs out there in the field. I hope we can also agree from our field experience that just because customers are doing something, doesn't mean it is a good idea. Hence, the fact that someone build something using your techniques does not mean it is the best way to do things.
You made a lot of points, so I will go over them one at a time. Let me just reiterate that data models should, in my opinion, be built using proper assumptions about database engines and their capabilities. Let me also state where I agree with you: normalization in the way you use it for satellites in your model techniques is actually a good idea in many cases. This is something I am going to elaborate in my blog as I advance the argument there. I will be tracking your blog, so it will be interesting to ping back and forth.
To your points:
Dan: "When was the last time you needed to join "all" the tables in a data warehouse (for any data modeling technique)? If this is the case, then sure, the Data Vault has more joins, and therefore if you query the entire model, yes, it will perform worse than other models."
Joining "All" tables is really not needed to get the benefit - and I hope I did not come across as putting up such a far fetched straw man. The problem can be boiled down to joining TWO tables. Take TPCH as an example. In that schema, they split LINEITEM and ORDERS - this means that queries grouping on "dimensions" from both will incur an expensive join. A more efficient design would be to store the two tables pre-joined. Of course, you could co-locate the two tables, but this just brings me to:
Dan: "However, due to the normalization - AND the query optimization techniques such as table elimination, AND the co-location use of elements such as join indexes, the join performance is in fact FASTER than other modeling techniques."
Co-Location will allow you to join two large tables together at a reasonable speed (using a merge strategy). However, it only works if you are not seeking into those tables (only if you scan or bitmap scan). If you take a data vault model with two large hubs, linked by a large link, the co-location is often not feasible, because you can only store data sorted in one way (and bitmaps wont help you either). In other words: You can only co-locate if there is a common column to co-locate ON in the model - interestingly the data vault super normalization guarantees that to NOT be the case.
JOIN indexes is "cheating" you are giving away large storage to get more performance. Granted: Trading storage for performance is normally a good idea- for small tables or aggregates. But if you have large table structures, and the only reason you implement join indexes is to make them reasonably fast to query, then I ask you: why did you bother modeling them like that in the first place?
Dan: "Why? Because the row sets are normalized, because the tables can be "queried in parallel due to the normalization". Co-location is KEY as is data spread or distribution in an MPP environment."
I have no clue which database engine you refer to that create more parallelism by having tables "queried in parallel due to normalization". As you may recall, database implement read locks that allow multiple threads to touch the same table at the same time - this includes threads that are part of the same parallel query. To give you an example, SQL Server will happily scan a single table with 36GB/sec (and after that, you typically run out of IOPS on an SMP system).
Dan: "how many rows can I fit in a disk block before I push I/O higher on an exponential scale? The block-size of different database systems vary. In Teradata it used to be 64k, I think now they've boosted that to 256kb. Oracle can go to 64kb only on certain platforms, and so on.... Any RDBMS system on Microsoft is locked in to 8k physical, unless the physical disk is reformatted to be 64k, but even then, the Microsoft OS must work with 8k block sizes."
A few factual corrections: Microsoft Windows allows 64K blocks, but that is just the allocation unit. You can request any IOPS size during scans (for example, SQL Server will request 512K blocks). Oracle I believe can go all the way to 1MB page sizes on certain platforms. Not that it matters, because there is no point where I/O goes exponential in a star schema. You scan a table: the IOPS are linear with the table size. Of course, if you insist on normalizing, then all bets are off (since joins, especially the loop joins that the from/to in the vault model use) will be log n * m complexity and if you lack the right indexes - you get your exponential time (which, interestingly, only happen on normalized models).
Dan: "you reach a sweet spot, where performance is optimal - great, that means the I/O's required to get the rows out are the least amount of I/O's needed to access the data, and are balanced with the query. HOWEVER, you are still left with a SEQUENTIAL SCAN of a single table - and now, you are relying on the "smartness" of the disk and the RDBMS to know that you want a table scan, and to read-ahead the blocks in to RAM to gather the information. "
As opposed to what? Trying to cram the join into memory or relying on random disk access? Let me just reiterate: Having an indexes only access strategy on a large table is incredibly expensive, both in disk space and performance (since you will be trashing the RAM as you access each index). Also, please note that fast disk scans are exactly what the specialized DW database engines like Vertica, Greenplum, PDW and Neteeza are made to excel at.
Dan: "Continue to add width to the row (doesn't necessarily mean adding new columns). You will find that once you pass this sweet spot in width (denormalize by removing another join, and combining another tables' columns) that the performance will begin to degrade on an exponential scale. Why is that? Because the I/O is STILL stuck on sequential access, and it is still forced to TABLE SCAN, now, there are less rows per block - and something else starts to happen... Rows start to "chain over/across" blocks, which doubles, then triples, then quadruples the I/O - and for gosh sakes, if you hit fragmentation - watch out...."
I am curious which database vendor you have observed this behavior for? I am deeply familiar with the code base of SQL Server - and there is no such code or behavior in there. On the note of having wide rows: It is correct that this is an issue that a star schema model must struggle with, since surrogate keys are de-normalized into the star schema. But let us just do a quick back on the envelope calculation: A dimension key is typically 4-8 bytes, so is a measure (using some decimal, float or in). Let us just be worst case and say 8 bytes per column. In our smallest case, SQL Server, a page is 8K - so this leaves you with 1000 columns in the fact table. So can we agree this is not such a big deal?
On the other note about row width: I think most DW designers acknowledge that there is an overhead of "paying" for the scan of the columns you are not using. However, this is what column stores are all about, they allow you to touch only the column you need to touch and only incur the IOPS for that.
Again, I also have to say: As opposed to what? Indexes access to huge tables?
Dan: "This is where Normalization (many rows per block) combined with proper parallel query, and the right co-location can far surpass any denormalized model. Because you can achieve a BALANCE with the right layout and distribution of the data. It also means that the scalability is limitless (in keeping with the hardware changes). These are the laws of MPP, divide and conquer. "
I agree that MPP is divide and conquer. But I think we disagree on how to divide. "Divide" in MPP is typically done by hash distributing the large tables across the nodes and co-locating the small tables: by replicating or applying the same hash, if the tables share columns (see above note on co-location). So, you are not dividing the tables vertically, you are doing it horizontally (distributing rows, not columns). This is what leads to balance - and it just so happens that star schemas, with their small dimension tables and large facts, lend themselves very well to co-locating the dimension (by replicating them) and evenly distributing the facts.
I believe I have already answered the many rows/block point with the column store (and if you should find yourself in a DB engine that does not have those, feel free to vertically partition the fact table in the star schema or add a narrow, scan friendly index).
Dan: "Why can't data warehouses be built and developed the same way twice, three times, etc? If this is the case, why hasn't the industry matured and optimized the building of data warehouse systems by automating 80% or more of the mundane tasks? If this is the case, why are businesses still challenging the value and cost of implementing BI solutions? If this is the case, why does it take IT months or even years to "cross-integrate" new sources, new data sets, and provide new answers?"
I certainly hope you appreciate the complexity of even getting ACCESS to sources (and having someone explain WHAT the source is), not to mention the large business questions about how you want to QUERY the data. I think there is also wide agreement that this effort, not the modeling, represents the 80% of the work needed to build a warehouse. I think everyone on the IT-industry has beating their fist on the table saying: Why cant building IT systems be like building bridges. Books are written on the subject of why this is the case - I dont think the data warehouse is a special case here.
When it comes to the mundane tasks of modeling, this can be automated with meta-frameworks (interestingly, independent of which model you use) - something that several companies now specialize in (I personally built one that automates the creation of star schemas). Of course, if you are to use a meta-modeling tool, you must find yourself a business the appreciates that a well written meta model tool, operated by a few experts, beat the other strategy which is: outsourcing to an army of "ETL programmers" and "BI developers".
Dan: "because the RDBMS engines today don't fully handle complete temporality of data sets."
The temporal database is one of those dreams we have had in IT for many years. There are good reasons it has not been done in a performanant way (there are inherent complexities in this problem that does not lend themselves well to good solutions). I happen to agree that the satelite construct ("flaking off the history") is the best bet so far.
Dan: As for your claim "hard to bring in to a dimensional model" That simply is flat out false and un-true. I can (and have) built and loaded Star Schemas from a Data Vault within 45 minutes of receiving a 2 page requirements document filled out by business users / customers. So I'm not sure where you're getting your information about it being difficult to load or build star schemas.
This comes back to the point about co-location. Joining large links and hubs together is very expensive - and so is creating de-normalized type2 dimensions on top of satellites, joined over hubs, joined over links, joined over more hubs/satellites/links. Keeping track of the aggregated history between them is a proper nightmare in a big "parking lot sized" data vault model.
Dan: "In fact, in Teradata I can build VIRTUAL star schemas (views that look like dimensions and facts), and for these cases, I never have to land the data physically anywhere after the Data Vault. In SQLServer, I can push the data from the Data Vault direct to an OLAP cube without resorting to materialized views. "
Building such views only defers the cost of the join to the cube or the query time. Moving the problem around like that does not make it go away - but you are now paying the join cost several times (every query) instead of only once (during the ETL). This is one of the points I will make in my blog in more detail, suffice to say that it is a big waste of hardware resources to implement non-materialized data structures that are joined at query time.
Dan: Ok - please justify this. Was this on Teradata? Oracle? DB2 UDB? Netezza? DatAllegro? SQLServer? Paraccel? or something else? What was the underlying hardware? How much money did the customer spend on making the infrastructure viable?
My first run was on a Unisys R7600 using SQL Server 2008. I have since done this on a HP DL980, using SQL Server 2008R2. The cost of such a system is in the 200.000 USD range. MPP is not really required to get this sort of speed.
thomaskejser- Posts : 12
Join date : 2011-08-14
Not here to debate performance.
Hi Thomas,
Before I get going... the first thing I'd like to say is: I'm not here to debate performance. The problem we get in to when debating performance is it is all just here-say unless we put the two models to the test, on the same infrastructure with the same "data sources" loaded in complete, and tuned properly. Even then, it's like comparing apples and oranges. So, the best we can hope to get out of a "head-to-head" performance test will be approximations. Furthermore, we would need sufficient data to overrun database in-memory caches, and in-memory disk caches (I/O caches and read-aheads) in order to make the test valid. Debating performance with you without hard-fast head to head numbers on the same infrastructure on the same machine would be like comparing steel to sand, two completely different things - and of course - in the end it's all just hot air without justification.
I do realize that we are both speaking from experience, great, wonderful... but just because we have experience doesn't mean you and I have seen the same systems, or even designed the same systems. Furthermore, if you've not been trained as a certified Data Vault modeler, then you have not had the full experience - and are missing some of the nuances or finer points of the data modeling techniques and standards that are available.
What I will say is this: Yes, performance is important, yes - the size of the "live or used" data set matters, yes, parallelism, partitioning, and the ability to co-locate data sets matter. Design matters.
One point that is very difficult to argue over is end-user access. Why? Because the Data Vault is built to be a back-end data warehouse, that feeds star schemas, cubes, and other data modeled data marts on the front end. With this in mind, the Data Vault is tuned for high speed loading, AND high speed querying (but, of course - it is NOT tuned for ad-hoc querying) - it is tuned for specific processes (consistent, repeatable processes that use the same indexes and the same data sets over and over again) to produce the front-ends.
In tuning: you can tune for ad-hoc access, or you can tune for processing speed - it is really difficult to tune the architecture (physical design) for both, unless you have in-memory constructs. This is why we see the rise of "in-memory" processing from traditional RDBMS vendors, and the rise of in-memory processing from appliance vendors - as well as query vendors.
Now, regarding "large Hubs and Large Links"... that depends on what you mean by "large?" and how you vertically partition and horizontally partition can make all the difference in the world, especially when the implementation is built on a system that is truly MPP enabled. Back to the point: the average "row width" of a Hub is approximately 28 to 100 bytes (depends on if you use unicode, and if you have a VARCHAR as a business key). With that, assuming a 64 kb block size (you said could be attained, ok... can you point me to the documentation that provides this, I'd love to look it up?).... means approximately 640 rows per block for the Hub. Especially if you set the fill-factor to 100% (% free to zero). It's 100% inserts which lead to zero fragmentation. The Links, well, the links average size (assuming 5 foreign keys @ 8 bytes each - which most links have 2 or 3) + load date (8 bytes) + sequence (8 bytes) + record source (varchar(12)), you are looking at a row width of approximately 70 bytes. Which again, for 64kb block size means approximately: 914 rows per block. Also fill factor of 100% (% free set to zero), and it too is 100% inserts, which lead to zero fragmentation.
One more note about SQLServer 2008 R2, is more about the Windows OS underneath... SQLServer still uses PageFile.sys - (because it uses the Windows OS)... Windows OS (not even 64 bit OS) is currently fully-parallel access to PageFile.sys, at the core level - Windows is incapable of "opening multiple threads at exactly the same time" to perform I/O against the PageFile. which means the RAM swapping that Windows OS does is still capable of becoming blocked I/O on a single process waiting for PageFile. This is NOT true of Linux, Unix, and Mainframe systems. This one small detail can make a huge difference in large systems that desire parallelism. In other words, Windows and SQLServer are NOT truly pre-emptive, even in their RAM swap algorithms (which means there can be tremendous bottlenecks for parallel threaded code, AS WELL as data sets that are being run in parallel processes).
Which as we both know, fragmentation is the enemy of performance on any database. Devices like Netezza add on top of this built in compression (BUT because of the way Hubs and Links are built), column compression (value compression) can only occur for the load dates and record sources. But, even then, that's approximately a 50% compression rate, increasing the number of rows per block (doubling it).
Anyhow, once you run these numbers with co-location, AND parallel query (true parallelism - pre-emptive tasks) which SQLServer (because of the OS) can only execute in non-pre-emptive mode, meaning some tasks can BLOCK the I/O for longer than they should... Anyhow in truly pre-emptive parallelism, the ideas of running multiple parallel tasks (parallel query threads) against different data sets (or computational nodes in an MPP world) would provide you with near-linear performance gains, compared to singular tasks (or table scans) of denormalized data sets. This has been proven, over and over and over again as the only way to scale beyond a certain point. If anyone want's to know more, there are hundreds of books, articles and proofs available. You can start with HPC (high performance computing) https://computing.llnl.gov/tutorials/parallel_comp/ as an example of a high level discussion.
Now, a join index in the Data Vault is not cheating... In fact, it IS the Link table, the Link table (many to many) relationship is the only way to join two different structures together. The Link table is "stored in it's own right" as a part of the design. And no, you can't co-locate both Hubs (that's part of the design, is to achieve balanced data across the MPP nodes), but you can co-locate one Hub (the largest Hub) with its' corresponding index (both of which are very very narrow, and can fit many rows per block).
Anyhow, I teach all of this implementation, parallelism, and query & more, in my performance and tuning classes - as well as my implementation classes. Those of you interested in the "science" behind the scalability, and the implementation of the best practices can learn more at: learnDataVault (dot) com - I'll be releasing on-line classes between now and the end of the year. Drop me an email if you are interested.
With normalization the point is this: truly pre-emptive parallel threads executing against large numbers of rows, with a SMALL I/O count (because they can retrieve large numbers of rows because of the rows per block), can execute faster than a single table scan against a wide table, especially when the width of that table exceeds the optimal rows-per-block, and begins to double, triple, and quadruple the I/O counts. This type of model (table scans) are not sustainable in large data set environments - this is also why each Netezza box has an upper terabyte limit, and why you have to string multiple Netezza boxes together to achieve and manage growth.
Anyhow, the proof is out there regarding parallelism, partitioning, and row-size...
Regarding width of rows, it's mathematical in nature - it has nothing to do with the code base. Again, try this experiment on multiple database engines: run a single table scan against 1k, 2k, 4k, 8k row sizes (consecutively). Built many many rows (with compression off, no parallel query, and no partitioning). What you find is this: for 64kb block sizes, you can put 64 rows, 32 rows, 16 rows, and 8 rows per block respectively. Now load the tables with 5 million, 50 million, 200 million, and 500 million rows. Test the performance against each by querying ALL columns, and ALL rows (it is impossible to tune with just a simple select count(*)) Anyhow, I go through this test case and discuss the nature of testing in one of my performance and tuning on-line classes. I outline the entire test, the testing procedures, and discuss the results. There is too much to this test case to explain it all in this one test. But, the end result is this: the wider the row sets, the slower the scan will become - it will begin to push the I/O count up (double, then triple, then quadruple it) - there is then a sweet spot to denormalization that the data modeler should not exceed. The sweet spot will change depending on the block size, and the hardware and database platform you are on. You can find out more by contacting me: learnDataVault (dot) com (slash) contact-us
Oh yes, your claim: "trying to jam the joins in to an in-memory access".... I never made such statements, either the purpose of a Link table is not understood, or the "lack of indexing" that is required in the Data Vault (other than the primary and foreign keys) is not understood. One more point... The Data Vault is tuned for processing speed (loading and mechanical querying - meaning NO AD HOC)... Again, we do not need to "jam" the joins to an all-in-memory process, quite the contrary, a true MPP system (like Teradata) will appropriate the joins to the right execution node. DB2 EEE has it's own method, Netezza has joins at the firmware level and uses flat-wide tables underneath (building on the COBOL notions of data access, but adding compression and data access algorithms (hash based) to be able to compute where data is living on disk... Other databases do things differently, even SQLServer clustered machines and it's new Data Warehouse edition with partitioning does things differently (especially after it's purchase of DatAllegro which was PostGreSQL based).
One last point I wish to make is this: Just because you've seen Data Vault implementations out there, doesn't mean I've been involved, doesn't mean it's been created correctly - especially if the model has not been audited or verified by me, doesn't mean it can scale (depends on the hardware and database system, as well). Just like any data model - people can & have "gotten it wrong." Especially because they've built it without proper training, or proper review. I've seen the same things with 3NF, and Star Schema, and just about every other data modeling technique out there - so it's not fair for you to make comparisons to "models you've seen" without some basis of justification, letting me know who, when, where it was built - and checking to see if it was built by Certified Data Vault Consultants - or if it was reviewed by me.
Before I get going... the first thing I'd like to say is: I'm not here to debate performance. The problem we get in to when debating performance is it is all just here-say unless we put the two models to the test, on the same infrastructure with the same "data sources" loaded in complete, and tuned properly. Even then, it's like comparing apples and oranges. So, the best we can hope to get out of a "head-to-head" performance test will be approximations. Furthermore, we would need sufficient data to overrun database in-memory caches, and in-memory disk caches (I/O caches and read-aheads) in order to make the test valid. Debating performance with you without hard-fast head to head numbers on the same infrastructure on the same machine would be like comparing steel to sand, two completely different things - and of course - in the end it's all just hot air without justification.
I do realize that we are both speaking from experience, great, wonderful... but just because we have experience doesn't mean you and I have seen the same systems, or even designed the same systems. Furthermore, if you've not been trained as a certified Data Vault modeler, then you have not had the full experience - and are missing some of the nuances or finer points of the data modeling techniques and standards that are available.
What I will say is this: Yes, performance is important, yes - the size of the "live or used" data set matters, yes, parallelism, partitioning, and the ability to co-locate data sets matter. Design matters.
One point that is very difficult to argue over is end-user access. Why? Because the Data Vault is built to be a back-end data warehouse, that feeds star schemas, cubes, and other data modeled data marts on the front end. With this in mind, the Data Vault is tuned for high speed loading, AND high speed querying (but, of course - it is NOT tuned for ad-hoc querying) - it is tuned for specific processes (consistent, repeatable processes that use the same indexes and the same data sets over and over again) to produce the front-ends.
In tuning: you can tune for ad-hoc access, or you can tune for processing speed - it is really difficult to tune the architecture (physical design) for both, unless you have in-memory constructs. This is why we see the rise of "in-memory" processing from traditional RDBMS vendors, and the rise of in-memory processing from appliance vendors - as well as query vendors.
Now, regarding "large Hubs and Large Links"... that depends on what you mean by "large?" and how you vertically partition and horizontally partition can make all the difference in the world, especially when the implementation is built on a system that is truly MPP enabled. Back to the point: the average "row width" of a Hub is approximately 28 to 100 bytes (depends on if you use unicode, and if you have a VARCHAR as a business key). With that, assuming a 64 kb block size (you said could be attained, ok... can you point me to the documentation that provides this, I'd love to look it up?).... means approximately 640 rows per block for the Hub. Especially if you set the fill-factor to 100% (% free to zero). It's 100% inserts which lead to zero fragmentation. The Links, well, the links average size (assuming 5 foreign keys @ 8 bytes each - which most links have 2 or 3) + load date (8 bytes) + sequence (8 bytes) + record source (varchar(12)), you are looking at a row width of approximately 70 bytes. Which again, for 64kb block size means approximately: 914 rows per block. Also fill factor of 100% (% free set to zero), and it too is 100% inserts, which lead to zero fragmentation.
One more note about SQLServer 2008 R2, is more about the Windows OS underneath... SQLServer still uses PageFile.sys - (because it uses the Windows OS)... Windows OS (not even 64 bit OS) is currently fully-parallel access to PageFile.sys, at the core level - Windows is incapable of "opening multiple threads at exactly the same time" to perform I/O against the PageFile. which means the RAM swapping that Windows OS does is still capable of becoming blocked I/O on a single process waiting for PageFile. This is NOT true of Linux, Unix, and Mainframe systems. This one small detail can make a huge difference in large systems that desire parallelism. In other words, Windows and SQLServer are NOT truly pre-emptive, even in their RAM swap algorithms (which means there can be tremendous bottlenecks for parallel threaded code, AS WELL as data sets that are being run in parallel processes).
Which as we both know, fragmentation is the enemy of performance on any database. Devices like Netezza add on top of this built in compression (BUT because of the way Hubs and Links are built), column compression (value compression) can only occur for the load dates and record sources. But, even then, that's approximately a 50% compression rate, increasing the number of rows per block (doubling it).
Anyhow, once you run these numbers with co-location, AND parallel query (true parallelism - pre-emptive tasks) which SQLServer (because of the OS) can only execute in non-pre-emptive mode, meaning some tasks can BLOCK the I/O for longer than they should... Anyhow in truly pre-emptive parallelism, the ideas of running multiple parallel tasks (parallel query threads) against different data sets (or computational nodes in an MPP world) would provide you with near-linear performance gains, compared to singular tasks (or table scans) of denormalized data sets. This has been proven, over and over and over again as the only way to scale beyond a certain point. If anyone want's to know more, there are hundreds of books, articles and proofs available. You can start with HPC (high performance computing) https://computing.llnl.gov/tutorials/parallel_comp/ as an example of a high level discussion.
Now, a join index in the Data Vault is not cheating... In fact, it IS the Link table, the Link table (many to many) relationship is the only way to join two different structures together. The Link table is "stored in it's own right" as a part of the design. And no, you can't co-locate both Hubs (that's part of the design, is to achieve balanced data across the MPP nodes), but you can co-locate one Hub (the largest Hub) with its' corresponding index (both of which are very very narrow, and can fit many rows per block).
Anyhow, I teach all of this implementation, parallelism, and query & more, in my performance and tuning classes - as well as my implementation classes. Those of you interested in the "science" behind the scalability, and the implementation of the best practices can learn more at: learnDataVault (dot) com - I'll be releasing on-line classes between now and the end of the year. Drop me an email if you are interested.
With normalization the point is this: truly pre-emptive parallel threads executing against large numbers of rows, with a SMALL I/O count (because they can retrieve large numbers of rows because of the rows per block), can execute faster than a single table scan against a wide table, especially when the width of that table exceeds the optimal rows-per-block, and begins to double, triple, and quadruple the I/O counts. This type of model (table scans) are not sustainable in large data set environments - this is also why each Netezza box has an upper terabyte limit, and why you have to string multiple Netezza boxes together to achieve and manage growth.
Anyhow, the proof is out there regarding parallelism, partitioning, and row-size...
Regarding width of rows, it's mathematical in nature - it has nothing to do with the code base. Again, try this experiment on multiple database engines: run a single table scan against 1k, 2k, 4k, 8k row sizes (consecutively). Built many many rows (with compression off, no parallel query, and no partitioning). What you find is this: for 64kb block sizes, you can put 64 rows, 32 rows, 16 rows, and 8 rows per block respectively. Now load the tables with 5 million, 50 million, 200 million, and 500 million rows. Test the performance against each by querying ALL columns, and ALL rows (it is impossible to tune with just a simple select count(*)) Anyhow, I go through this test case and discuss the nature of testing in one of my performance and tuning on-line classes. I outline the entire test, the testing procedures, and discuss the results. There is too much to this test case to explain it all in this one test. But, the end result is this: the wider the row sets, the slower the scan will become - it will begin to push the I/O count up (double, then triple, then quadruple it) - there is then a sweet spot to denormalization that the data modeler should not exceed. The sweet spot will change depending on the block size, and the hardware and database platform you are on. You can find out more by contacting me: learnDataVault (dot) com (slash) contact-us
Oh yes, your claim: "trying to jam the joins in to an in-memory access".... I never made such statements, either the purpose of a Link table is not understood, or the "lack of indexing" that is required in the Data Vault (other than the primary and foreign keys) is not understood. One more point... The Data Vault is tuned for processing speed (loading and mechanical querying - meaning NO AD HOC)... Again, we do not need to "jam" the joins to an all-in-memory process, quite the contrary, a true MPP system (like Teradata) will appropriate the joins to the right execution node. DB2 EEE has it's own method, Netezza has joins at the firmware level and uses flat-wide tables underneath (building on the COBOL notions of data access, but adding compression and data access algorithms (hash based) to be able to compute where data is living on disk... Other databases do things differently, even SQLServer clustered machines and it's new Data Warehouse edition with partitioning does things differently (especially after it's purchase of DatAllegro which was PostGreSQL based).
One last point I wish to make is this: Just because you've seen Data Vault implementations out there, doesn't mean I've been involved, doesn't mean it's been created correctly - especially if the model has not been audited or verified by me, doesn't mean it can scale (depends on the hardware and database system, as well). Just like any data model - people can & have "gotten it wrong." Especially because they've built it without proper training, or proper review. I've seen the same things with 3NF, and Star Schema, and just about every other data modeling technique out there - so it's not fair for you to make comparisons to "models you've seen" without some basis of justification, letting me know who, when, where it was built - and checking to see if it was built by Certified Data Vault Consultants - or if it was reviewed by me.
dlinstedt- Posts : 7
Join date : 2010-03-15
Assumptions - still getting it wrong
Hi Dan
First of all, let me just direct you to the Windows Internals book by Mark Russinovich. Here, you will read about the multi threaded nature of the I/O system in windows and how this has nothing to do with pagefile.sys. You really shouldn't engage in advise about Windows and how database run on top of this platform without having the necessary background, which I am sorry to say, you display a lack of. The section you might want to read up on in I/O completion ports and how they work. Incidentally, you will also find information about how block access works in Windows (to get large block scan for example)
Secondly, while I agree that performance is only one aspect of a data model - it does matter since it determines the eventual success of the installation (no performance = too long batch window and poor user experience). And it seems you are interested in making this point too, since you make statements about the science of tuning and mathematics of database.
Third: Let us define "large" tables as tables that don't fit in memory. As you rightly point out - pure in-memory engines do not form a basis for good compared of any model (since they don't incur any of the unique challenges that big data sizes create). So, going from the point on three large tables join (let us just say two hubs and a link). In order to perform this join on a large dataset (say: if you load a big fact table or response to a query at low granularity), you have several options. Let us just have a look at some common ones
1) Scan one table, loop into the next two
2) Scan one, hash the other two
3) scan one, co-located merge into the other (Assumes same ordering on phyiscal structure)
4) Denomalize the co-location column and merge
5) Use bitmapped or hash based disk access
Now, all options the do a hashes in memory are very expensive - because it is expensive to "spill" a hash. If the hashes are large and DO fit in memory, you get trashing. So, hashing in out on the three big table join.
Going after loop and merge strategies. Loops are expensive, since they are random, small block IOPS (expensive, slow, pointer walking, lg n * m). Bitmaps and hashes speed this up, but by a fraction of the speed achievable with a column store - and you still have to pay the compare cost on the CPU cores and in memory (which you would not have to do if you were de-normalized)
Merges require that all tables are sorted the same way. This either mean large memory usage (incurring the trashing effect at concurrency) or that you denormalize the merge column into all the joined tables to get proper co-location. Unfortunately, this only works for the simple cases.
Fifth: If you are extrapolating from a "wider rows incur more IOPS" into guidance for a data model, I am sorry to say: but you are seriously over stretching your assumptions. Column stores are there to address exactly that problem (this incidentally, has been known since the 70ies). And even if you don't have column stores in whatever database engine you target, you can use aggregate views (materialized) to achieve the same effect of bringing down row width for queries that need the fast scans (since it seems we have now established that spending even more disk in an already expensive system is a not cheating). Again, data Vault does not address this issue, you simply fast disk access for more, slower joins. It does not follow from "wide rows are slower" that de-normalizing is a good idea.
Ohh.. and DataAllegro was not spawned from PostGreSQL, it came out of Ingress. Please get your facts right before you go out there preaching your data model.
First of all, let me just direct you to the Windows Internals book by Mark Russinovich. Here, you will read about the multi threaded nature of the I/O system in windows and how this has nothing to do with pagefile.sys. You really shouldn't engage in advise about Windows and how database run on top of this platform without having the necessary background, which I am sorry to say, you display a lack of. The section you might want to read up on in I/O completion ports and how they work. Incidentally, you will also find information about how block access works in Windows (to get large block scan for example)
Secondly, while I agree that performance is only one aspect of a data model - it does matter since it determines the eventual success of the installation (no performance = too long batch window and poor user experience). And it seems you are interested in making this point too, since you make statements about the science of tuning and mathematics of database.
Third: Let us define "large" tables as tables that don't fit in memory. As you rightly point out - pure in-memory engines do not form a basis for good compared of any model (since they don't incur any of the unique challenges that big data sizes create). So, going from the point on three large tables join (let us just say two hubs and a link). In order to perform this join on a large dataset (say: if you load a big fact table or response to a query at low granularity), you have several options. Let us just have a look at some common ones
1) Scan one table, loop into the next two
2) Scan one, hash the other two
3) scan one, co-located merge into the other (Assumes same ordering on phyiscal structure)
4) Denomalize the co-location column and merge
5) Use bitmapped or hash based disk access
Now, all options the do a hashes in memory are very expensive - because it is expensive to "spill" a hash. If the hashes are large and DO fit in memory, you get trashing. So, hashing in out on the three big table join.
Going after loop and merge strategies. Loops are expensive, since they are random, small block IOPS (expensive, slow, pointer walking, lg n * m). Bitmaps and hashes speed this up, but by a fraction of the speed achievable with a column store - and you still have to pay the compare cost on the CPU cores and in memory (which you would not have to do if you were de-normalized)
Merges require that all tables are sorted the same way. This either mean large memory usage (incurring the trashing effect at concurrency) or that you denormalize the merge column into all the joined tables to get proper co-location. Unfortunately, this only works for the simple cases.
Fifth: If you are extrapolating from a "wider rows incur more IOPS" into guidance for a data model, I am sorry to say: but you are seriously over stretching your assumptions. Column stores are there to address exactly that problem (this incidentally, has been known since the 70ies). And even if you don't have column stores in whatever database engine you target, you can use aggregate views (materialized) to achieve the same effect of bringing down row width for queries that need the fast scans (since it seems we have now established that spending even more disk in an already expensive system is a not cheating). Again, data Vault does not address this issue, you simply fast disk access for more, slower joins. It does not follow from "wide rows are slower" that de-normalizing is a good idea.
Ohh.. and DataAllegro was not spawned from PostGreSQL, it came out of Ingress. Please get your facts right before you go out there preaching your data model.
thomaskejser- Posts : 12
Join date : 2011-08-14
Let's not get personal here.
First, I don't approve of the fact that you think the forum is a place for public attacks. If you really want to help me and the rest of the readers understand your points, then you will kindly provide us with on-line documentation (as I requested) that shows us where I have got it wrong. Like all humans, I make mistakes, and for that I am sorry. However, please don't cross-the-line to hit below the belt, and furthermore, I am attempting to have this discussion with you, even though you have not bothered to take the time to become a certified Data Vault Modeler... just keep that in mind please.
Performance matters, yes it always matters - but again, until you have side-by-side facts (which I keep asking you to produce), you cannot make the claims you are making either...
For the data modeling aspect it's quite simple:
Take a Dimension, strip it down to a type 1, remove all other elements except the Business key and the surrogate sequence. Do this for one more dimension. Then, build a factless fact table with these two surrogate dimension keys. What do you have? Two Hubs and a Link.
Now if I am to understand you correctly, your claims (against the Data Vault) are that a) it doesn't perform because of the joins, b) that it can't scale, c) that it is NOT better than one "big denormalized table"...
So to summarize (and put this quite succinctly): If what is under the covers (with the Data Vault Model) is in fact a hybrid approach that uses PARTS of star-schema design, then to knock the Data Vault model is by default to knock the Star Schema design... Is this correct? So, if it doesn't perform because of the joins, then that is a problem that will carry through to ALL star schemas (given the design parameters set above). Generalizations get us ALL in to trouble, and I am sorry if I generalized earlier... again, I defer the performance discussion until we have head-to-head numbers.
First of all, let me just direct you to the Windows Internals book by Mark Russinovich. Here, you will read about the multi threaded nature of the I/O system in windows and how this has nothing to do with pagefile.sys. You really shouldn't engage in advise about Windows and how database run on top of this platform without having the necessary background, which I am sorry to say, you display a lack of. The section you might want to read up on in I/O completion ports and how they work. Incidentally, you will also find information about how block access works in Windows (to get large block scan for example)
I am not referring to "block access". I am referring to the physical I/O disk fetch that happens at the firmware and CPU levels of the machine. At this level I/O BLOCKING occurs in sequential fashion, causing other "threads" to wait on the CPU until the disk has returned from it's non-pre-emptive interrupt. It has nothing to do with the upper level operating system, and everything to do with the I/O Subsystem, and the firm-ware at these levels. This is also known as the BIOS layers. Can you tell me definitively - is Windows OS pre-emptive or non-pre-emptive multi-tasking?
Secondly, while I agree that performance is only one aspect of a data model - it does matter since it determines the eventual success of the installation (no performance = too long batch window and poor user experience). And it seems you are interested in making this point too, since you make statements about the science of tuning and mathematics of database.
Of course performance matters, it's one of the major reasons for having this discussion - and you mis-interpreted me. What I am saying (again): is that it's not worth debating here until and unless hard numbers in a head-to-head approved case study can be produced and published. Otherwise, it's just hot-air between us.
Third: Let us define "large" tables as tables that don't fit in memory. As you rightly point out - pure in-memory engines do not form a basis for good compared of any model (since they don't incur any of the unique challenges that big data sizes create). So, going from the point on three large tables join (let us just say two hubs and a link). In order to perform this join on a large dataset (say: if you load a big fact table or response to a query at low granularity), you have several options. Let us just have a look at some common ones
Ok - again, how does this "case" differ from the star-schema design I suggested above? One way: the case you suggest is "querying a single table for a table scan". What happens when you join a type 2 dimension to a fact table?? From a key perspective the query looks exactly the same as it does against the Hubs and Link structures I suggested above. It just so happens that Hubs and Links are capable of storing more rows per block.
Going after loop and merge strategies. Loops are expensive, since they are random, small block IOPS (expensive, slow, pointer walking, lg n * m). Bitmaps and hashes speed this up, but by a fraction of the speed achievable with a column store - and you still have to pay the compare cost on the CPU cores and in memory (which you would not have to do if you were de-normalized)
Pointer walking? Small block IOPS? I did not state this was the case... in fact I said quite the opposite (again I am mis-interpreted). I stated at the beginning that the block sizes were to be 64kb... regardless of data model, this means that they are fairly large block IOPS - so I am missing the argument/point you are trying to make here. Now, in regard to a column store... I had no idea that we were even discussing a column store - I thought this was a discussion about Data Vault on SQLServer 2008 R2 (which to my knowledge is not a column store).
Merges require that all tables are sorted the same way. This either mean large memory usage (incurring the trashing effect at concurrency) or that you denormalize the merge column into all the joined tables to get proper co-location. Unfortunately, this only works for the simple cases.
If this only works in simple cases, then why is Teradata enjoying huge success in the market place with very real, very large, and very complex cases? It IS a set / a tuple of join columns (sequences) that work to be co-located on specific nodes, to allow the separated nodes to do the work before pulling the data set off on to the interconnect bus for physical join needs. It does not flood RAM. if this only worked for simple cases, why are database vendors playing "catch-up" to NoSQL, Hadoop, and other vendors (including column based stores) where performance of extreme normalization has taken over?
Fifth: If you are extrapolating from a "wider rows incur more IOPS" into guidance for a data model, I am sorry to say: but you are seriously over stretching your assumptions. Column stores are there to address exactly that problem (this incidentally, has been known since the 70ies). And even if you don't have column stores in whatever database engine you target, you can use aggregate views (materialized) to achieve the same effect of bringing down row width for queries that need the fast scans (since it seems we have now established that spending even more disk in an already expensive system is a not cheating). Again, data Vault does not address this issue, you simply fast disk access for more, slower joins. It does not follow from "wide rows are slower" that de-normalizing is a good idea.
Wider rows that overflow the optimal row-to-block ratio WILL cause more IOPS, they also caused CHAINED ROWS to occur. When a row becomes too wide to fit within a block, it "straddles" in to the next block, causing 2 I/Os instead of one to read the same amount of information. Once the data has become this wide (or keeps getting wider), the I/O's do in fact double, then triple (if the data is widened again), and so on. It also leads to additional fragmentation and in the case of high volumes, massive performance issues.
Regarding aggregate or materialized views - they (at least in Oracle) form physical wide tables. In SQLServer, I am not exactly sure what they build under the covers, as I admit, I haven't studied it. However, a materialized view in Oracle IS a single wide table that represents all the columns in the view (Oracle folks, feel free to correct me if I'm wrong). Bringing down the row "width" only happens when you put less columns in the Materialized view than are in the table. I never said that "spending more disk in an already expensive system is not cheating". If I haven't said it yet, let me say it now: Because of the normalization, I do not store copies of the data sets across multiple tables, thus reducing (overall) the total amount of data actually needing to be stored. With the reduction in data size, is less disk required for storing the SAME data set as would be stored in a type-2 dimension.
Column stores are there to address exactly that problem (this incidentally, has been known since the 70ies).
I did not think that our discussion included column stores. Column stores have their own scalability problems, however that said: Column stores don't give a hoot about which physical data model is thrown at it. They "vertically partition" every single column, there is no concept of "table" in a column store, except in the logical sense for organization and query methods. Last I checked, SQLServer 2008 R2 was not a column store - and I thought we were discussing the merits of Data Vault on a SQLServer and Windows environment?
Of course Column stores change the game, just like Netezza with it's flat-wide denormalized table structures. The reason flat-wide works on Netezza is (again), because of the firmware changes along with the column based compression that is automatically built in. But Netezza is NOT a column store.
One more point: if the discussion needs to be about column stores and performance... then why is it so bad to "normalize" data sets and join data together? Column stores do it - they take physical data models literally to 6th normal form.... The Data Vault takes it one step further than dimensional modeling, but not as far as a column store...
So, if table scans were the answer to massive performance and massive scalability, I ask again: Why haven't ALL the vendors jumped on the Netezza bus? If Table scans are so great and powerful, how come Teradata enjoys so much success in MPP and normalized modeling?
Best wishes,
Dan Linstedt
Performance matters, yes it always matters - but again, until you have side-by-side facts (which I keep asking you to produce), you cannot make the claims you are making either...
For the data modeling aspect it's quite simple:
Take a Dimension, strip it down to a type 1, remove all other elements except the Business key and the surrogate sequence. Do this for one more dimension. Then, build a factless fact table with these two surrogate dimension keys. What do you have? Two Hubs and a Link.
Now if I am to understand you correctly, your claims (against the Data Vault) are that a) it doesn't perform because of the joins, b) that it can't scale, c) that it is NOT better than one "big denormalized table"...
So to summarize (and put this quite succinctly): If what is under the covers (with the Data Vault Model) is in fact a hybrid approach that uses PARTS of star-schema design, then to knock the Data Vault model is by default to knock the Star Schema design... Is this correct? So, if it doesn't perform because of the joins, then that is a problem that will carry through to ALL star schemas (given the design parameters set above). Generalizations get us ALL in to trouble, and I am sorry if I generalized earlier... again, I defer the performance discussion until we have head-to-head numbers.
First of all, let me just direct you to the Windows Internals book by Mark Russinovich. Here, you will read about the multi threaded nature of the I/O system in windows and how this has nothing to do with pagefile.sys. You really shouldn't engage in advise about Windows and how database run on top of this platform without having the necessary background, which I am sorry to say, you display a lack of. The section you might want to read up on in I/O completion ports and how they work. Incidentally, you will also find information about how block access works in Windows (to get large block scan for example)
I am not referring to "block access". I am referring to the physical I/O disk fetch that happens at the firmware and CPU levels of the machine. At this level I/O BLOCKING occurs in sequential fashion, causing other "threads" to wait on the CPU until the disk has returned from it's non-pre-emptive interrupt. It has nothing to do with the upper level operating system, and everything to do with the I/O Subsystem, and the firm-ware at these levels. This is also known as the BIOS layers. Can you tell me definitively - is Windows OS pre-emptive or non-pre-emptive multi-tasking?
Secondly, while I agree that performance is only one aspect of a data model - it does matter since it determines the eventual success of the installation (no performance = too long batch window and poor user experience). And it seems you are interested in making this point too, since you make statements about the science of tuning and mathematics of database.
Of course performance matters, it's one of the major reasons for having this discussion - and you mis-interpreted me. What I am saying (again): is that it's not worth debating here until and unless hard numbers in a head-to-head approved case study can be produced and published. Otherwise, it's just hot-air between us.
Third: Let us define "large" tables as tables that don't fit in memory. As you rightly point out - pure in-memory engines do not form a basis for good compared of any model (since they don't incur any of the unique challenges that big data sizes create). So, going from the point on three large tables join (let us just say two hubs and a link). In order to perform this join on a large dataset (say: if you load a big fact table or response to a query at low granularity), you have several options. Let us just have a look at some common ones
Ok - again, how does this "case" differ from the star-schema design I suggested above? One way: the case you suggest is "querying a single table for a table scan". What happens when you join a type 2 dimension to a fact table?? From a key perspective the query looks exactly the same as it does against the Hubs and Link structures I suggested above. It just so happens that Hubs and Links are capable of storing more rows per block.
Going after loop and merge strategies. Loops are expensive, since they are random, small block IOPS (expensive, slow, pointer walking, lg n * m). Bitmaps and hashes speed this up, but by a fraction of the speed achievable with a column store - and you still have to pay the compare cost on the CPU cores and in memory (which you would not have to do if you were de-normalized)
Pointer walking? Small block IOPS? I did not state this was the case... in fact I said quite the opposite (again I am mis-interpreted). I stated at the beginning that the block sizes were to be 64kb... regardless of data model, this means that they are fairly large block IOPS - so I am missing the argument/point you are trying to make here. Now, in regard to a column store... I had no idea that we were even discussing a column store - I thought this was a discussion about Data Vault on SQLServer 2008 R2 (which to my knowledge is not a column store).
Merges require that all tables are sorted the same way. This either mean large memory usage (incurring the trashing effect at concurrency) or that you denormalize the merge column into all the joined tables to get proper co-location. Unfortunately, this only works for the simple cases.
If this only works in simple cases, then why is Teradata enjoying huge success in the market place with very real, very large, and very complex cases? It IS a set / a tuple of join columns (sequences) that work to be co-located on specific nodes, to allow the separated nodes to do the work before pulling the data set off on to the interconnect bus for physical join needs. It does not flood RAM. if this only worked for simple cases, why are database vendors playing "catch-up" to NoSQL, Hadoop, and other vendors (including column based stores) where performance of extreme normalization has taken over?
Fifth: If you are extrapolating from a "wider rows incur more IOPS" into guidance for a data model, I am sorry to say: but you are seriously over stretching your assumptions. Column stores are there to address exactly that problem (this incidentally, has been known since the 70ies). And even if you don't have column stores in whatever database engine you target, you can use aggregate views (materialized) to achieve the same effect of bringing down row width for queries that need the fast scans (since it seems we have now established that spending even more disk in an already expensive system is a not cheating). Again, data Vault does not address this issue, you simply fast disk access for more, slower joins. It does not follow from "wide rows are slower" that de-normalizing is a good idea.
Wider rows that overflow the optimal row-to-block ratio WILL cause more IOPS, they also caused CHAINED ROWS to occur. When a row becomes too wide to fit within a block, it "straddles" in to the next block, causing 2 I/Os instead of one to read the same amount of information. Once the data has become this wide (or keeps getting wider), the I/O's do in fact double, then triple (if the data is widened again), and so on. It also leads to additional fragmentation and in the case of high volumes, massive performance issues.
Regarding aggregate or materialized views - they (at least in Oracle) form physical wide tables. In SQLServer, I am not exactly sure what they build under the covers, as I admit, I haven't studied it. However, a materialized view in Oracle IS a single wide table that represents all the columns in the view (Oracle folks, feel free to correct me if I'm wrong). Bringing down the row "width" only happens when you put less columns in the Materialized view than are in the table. I never said that "spending more disk in an already expensive system is not cheating". If I haven't said it yet, let me say it now: Because of the normalization, I do not store copies of the data sets across multiple tables, thus reducing (overall) the total amount of data actually needing to be stored. With the reduction in data size, is less disk required for storing the SAME data set as would be stored in a type-2 dimension.
Column stores are there to address exactly that problem (this incidentally, has been known since the 70ies).
I did not think that our discussion included column stores. Column stores have their own scalability problems, however that said: Column stores don't give a hoot about which physical data model is thrown at it. They "vertically partition" every single column, there is no concept of "table" in a column store, except in the logical sense for organization and query methods. Last I checked, SQLServer 2008 R2 was not a column store - and I thought we were discussing the merits of Data Vault on a SQLServer and Windows environment?
Of course Column stores change the game, just like Netezza with it's flat-wide denormalized table structures. The reason flat-wide works on Netezza is (again), because of the firmware changes along with the column based compression that is automatically built in. But Netezza is NOT a column store.
One more point: if the discussion needs to be about column stores and performance... then why is it so bad to "normalize" data sets and join data together? Column stores do it - they take physical data models literally to 6th normal form.... The Data Vault takes it one step further than dimensional modeling, but not as far as a column store...
So, if table scans were the answer to massive performance and massive scalability, I ask again: Why haven't ALL the vendors jumped on the Netezza bus? If Table scans are so great and powerful, how come Teradata enjoys so much success in MPP and normalized modeling?
Best wishes,
Dan Linstedt
dlinstedt- Posts : 7
Join date : 2010-03-15
Re: Data Vault v's Dimensional Model
Why haven't ALL the vendors jumped on the Netezza bus?
Netezza's architecture is patented. Their combination of an FPGA and disk is unique and integral to their performance advantage for certain operations (i.e. massive queries) over traditional MPP architectures.
If Table scans are so great and powerful, how come Teradata enjoys so much success in MPP and normalized modeling?
Netezza has been very successful over Teradata in heads-up comparisons.
Start from an example
Hi Dan
I don't mean to attack you as a person. But please understand that when you make claim that is simply wrong (as with the Windows OS) - and present yourself as an authority in it - then it it not a blow below the belt to point out that you are not doing yourself a favour.
To answer your question about Windows: it can be BOTH a pre-emptive and voluntary yield system. The kernel itself is pre-emptive by default - but it is possible to build other scheduling mechanisms on top of it. This happens to be the case for SQL Server, which has a user mode, voluntary yield, scheduler (called SQLOS). However, note that this really does not matter, as I/O operations are handled by threading system dedicated to it. Because it takes micro or milliseconds to issue IOPS (as compared to nanoseconds for cpu cycles) it is therefore a good idea to either task switch or pipeline them for best system throughput. This also means that you will want the most out of every I/O operation - which I think we agree on?
I am not sure I follow your point on disk access fully. Are you claiming that sequential scans are unachievable or somehow not good? Please elaborate here. To give you some more information about disk subsystems (which I work with a lot): There is nothing inherently blocking at the hardware level that make sequential a problem, so we can rule that out as an argument. Also, note that you don't need to have sequential to optimize disk performance, you just need to have large block sizes. To give you some hard numbers on that:a 15K spindle will typically deliver within 80-90% of the theoretical max sequential (which is roughly 125MB/sec) if the block size is above 512K. Both Oracle, neteeza and SQL server are perfectly capable of generating this pattern, especially in a star schema. Compare this with a 64K block (for example) fetch for loop strategies: even if you squeeze in a lot of rows in a block you are still getting random, low block size, IOPS. Typically, a 15K full stroked spindle deliver 220 random IOPS, so those 64K blocks only land you in the 14MB/sec. You have to squeeze in a LOT of rows to get even close to the sequential scan.
With regards to the model I am arguing for: I am NOT arguing for a single, super wide, table. I am taking about a star schema. I hope we agree that column stores do NOT require all the data to be pre-joined, they can be applied to the fact table alone, while maintaining the benefits of row bases storage (fast singletons) at the dimension. Can we agree that in such a model, having fact tables with many hundred columns is not really an issue we need to concern ourselves with? If you disagree, the we must assume the Vault would suffer the same problem in the link tables - yes?
With regards to your class and the online evidence: I think it would be fair to say that the burden of evidence rests on your shoulders, in this forum, for showing why Data Vault as an EDW model is superior to dimensional models. This is after all the question this thread concerns itself with. Since we are both in this forum, I hope we can agree about the immense contribution that Kimball has made to warehouses and that neither of our experiences match those of Kimball - we are both the upstarts here and until we can show otherwise, the Star Schema is still one of the most proven DW models put there, even at very high scale (I have worked on a 1PB star schema). Please also note that I am not fully agreeing with star schema as the full solution, especially when it comes to history tracking (where we to a large extend agree).
The point I am trying to make here is that your assumptions about what is "wrong" with a star schema are not issues with the model technique, but with the implementations people do. I am also trying to point out that normalization, as described in your data vault, is not the answer to those problems.
With regards to Teradata - they have a wonderful database engine, but it too suffers under the same laws of computing that all the other engines do. The fact that they are successful (as are others) i don't think can be used to advance any argument either for nor against the Data Vault technique. Take SAN as a parallel example: A successful storage solution that many companies swear by - but that does not make it optimal or even very good for a lot of scenarios. In fact, SAN implementations, and their insistence of "magic solutions", have seriously damaged many database projects by over promising and under delivering on storage solutions (at a very high price point).
I think it would be useful to advance this argument by using a concrete modeling example and then going after those hard numbers you ask for. This I think would remove confusion about which specific technologies we are talking about and how to compare models. I would very much welcome such an argument.
I don't mean to attack you as a person. But please understand that when you make claim that is simply wrong (as with the Windows OS) - and present yourself as an authority in it - then it it not a blow below the belt to point out that you are not doing yourself a favour.
To answer your question about Windows: it can be BOTH a pre-emptive and voluntary yield system. The kernel itself is pre-emptive by default - but it is possible to build other scheduling mechanisms on top of it. This happens to be the case for SQL Server, which has a user mode, voluntary yield, scheduler (called SQLOS). However, note that this really does not matter, as I/O operations are handled by threading system dedicated to it. Because it takes micro or milliseconds to issue IOPS (as compared to nanoseconds for cpu cycles) it is therefore a good idea to either task switch or pipeline them for best system throughput. This also means that you will want the most out of every I/O operation - which I think we agree on?
I am not sure I follow your point on disk access fully. Are you claiming that sequential scans are unachievable or somehow not good? Please elaborate here. To give you some more information about disk subsystems (which I work with a lot): There is nothing inherently blocking at the hardware level that make sequential a problem, so we can rule that out as an argument. Also, note that you don't need to have sequential to optimize disk performance, you just need to have large block sizes. To give you some hard numbers on that:a 15K spindle will typically deliver within 80-90% of the theoretical max sequential (which is roughly 125MB/sec) if the block size is above 512K. Both Oracle, neteeza and SQL server are perfectly capable of generating this pattern, especially in a star schema. Compare this with a 64K block (for example) fetch for loop strategies: even if you squeeze in a lot of rows in a block you are still getting random, low block size, IOPS. Typically, a 15K full stroked spindle deliver 220 random IOPS, so those 64K blocks only land you in the 14MB/sec. You have to squeeze in a LOT of rows to get even close to the sequential scan.
With regards to the model I am arguing for: I am NOT arguing for a single, super wide, table. I am taking about a star schema. I hope we agree that column stores do NOT require all the data to be pre-joined, they can be applied to the fact table alone, while maintaining the benefits of row bases storage (fast singletons) at the dimension. Can we agree that in such a model, having fact tables with many hundred columns is not really an issue we need to concern ourselves with? If you disagree, the we must assume the Vault would suffer the same problem in the link tables - yes?
With regards to your class and the online evidence: I think it would be fair to say that the burden of evidence rests on your shoulders, in this forum, for showing why Data Vault as an EDW model is superior to dimensional models. This is after all the question this thread concerns itself with. Since we are both in this forum, I hope we can agree about the immense contribution that Kimball has made to warehouses and that neither of our experiences match those of Kimball - we are both the upstarts here and until we can show otherwise, the Star Schema is still one of the most proven DW models put there, even at very high scale (I have worked on a 1PB star schema). Please also note that I am not fully agreeing with star schema as the full solution, especially when it comes to history tracking (where we to a large extend agree).
The point I am trying to make here is that your assumptions about what is "wrong" with a star schema are not issues with the model technique, but with the implementations people do. I am also trying to point out that normalization, as described in your data vault, is not the answer to those problems.
With regards to Teradata - they have a wonderful database engine, but it too suffers under the same laws of computing that all the other engines do. The fact that they are successful (as are others) i don't think can be used to advance any argument either for nor against the Data Vault technique. Take SAN as a parallel example: A successful storage solution that many companies swear by - but that does not make it optimal or even very good for a lot of scenarios. In fact, SAN implementations, and their insistence of "magic solutions", have seriously damaged many database projects by over promising and under delivering on storage solutions (at a very high price point).
I think it would be useful to advance this argument by using a concrete modeling example and then going after those hard numbers you ask for. This I think would remove confusion about which specific technologies we are talking about and how to compare models. I would very much welcome such an argument.
thomaskejser- Posts : 12
Join date : 2011-08-14
Re: Data Vault v's Dimensional Model
I am currently leading a team on a large Teradata implementation for a Telco.
As far as performance goes, normalised structures are no where near as performant as a dimensional model. Inmon recognises this which is why he recommends using Star Schema data marts for the access layer, and Teradata recognise this, which is why I am building summarised fact tables and conformed dimensions in a Teradata database, as directed by Teradata consultants.
Nothing beats dimensional modeling for performance of queries.
As far as performance goes, normalised structures are no where near as performant as a dimensional model. Inmon recognises this which is why he recommends using Star Schema data marts for the access layer, and Teradata recognise this, which is why I am building summarised fact tables and conformed dimensions in a Teradata database, as directed by Teradata consultants.
Nothing beats dimensional modeling for performance of queries.
Page 1 of 2 • 1, 2
Similar topics
» Tracking of historical data using SCD2 in a non-dimensional data model
» creating dimensional model of log data
» Dimensional Model from a Hierarchical Data Source
» Data Vault vs Kimball
» Rule based algorithm to convert an ER model to a dimensional model
» creating dimensional model of log data
» Dimensional Model from a Hierarchical Data Source
» Data Vault vs Kimball
» Rule based algorithm to convert an ER model to a dimensional model
Page 1 of 2
Permissions in this forum:
You cannot reply to topics in this forum
|
|