Data Vault v's Dimensional Model
+6
hang
dlinstedt
hennie7863
ngalemmo
BoxesAndLines
AndyPainter
10 posters
Page 2 of 2
Page 2 of 2 • 1, 2
Again, no direct links - no answers to my questions.
1) A head-to-head test for performance is necessary, but are you going to give your time to do it? I too would welcome these numbers.
2) the Data Vault model is built based on semi-normalization. As indicated, it is JUST like a star schema dimension type 1, stripped down to the business key, and a factless fact. The star-join optimizer can & does work to the advantage of the Data Vault much the same way it works with star schemas
3) you have not shown definitively that more rows per block = slower performance. You stated that it's higher random I/O, but the only way it could be higher random I/O is if there is "more fragmentation" in the Data Vault, please explain how this is, and provide a case for why this happens, this has NOT been my experience
4) You insist on continuing to compare a front-end star schema delivery system with a back-end data vault EDW storage system. Why? I've said before: the Data Vault is for back-end data warehousing, NOT front end delivery. I have always said: put the star schemas on the front-end for delivery, be it virtual or not.
5) The Data Vault model solves other problems beyond performance, that the star schema has suffered from (when the star schema is used as a data model for enterprise data warehousing).
6) there is always a trade-off between flexibility and performance (normalization/denormalization). The hardware vendors are always pushing the boundaries for performance and capabilities. Before they began working with internal algorithms, firmware, and hardware layouts (like Netezza) - star schema denormalization suffered greatly in performance, I've seen it and lived through it first-hand.
7) you argue the finer points of column based databases which are highly normalized (vertical partitioned down to the individual column level for the physical implementation), yet you also argue that denormalization is the way forward? How are both possible at the same time? I would suggest that if Column based databases are to make it, then they need to find a way to break the scalability barriers they are facing today. Sybase IQ has an upper limit (which a large credit card company had to dump them because the engineering team at Sybase couldn't make the product scale beyond this specific point). The same goes for ParAccel, and Vertica and other column based data stores. My point is this: either Normalization is the "best way forward" or it's "not"... you can't have both and scale in to extremely large petabyte sized environments. If this were true, Teradata and MPP would not be succeeding at such large volumes where many other vendors can't even compete today.
I am familiar with Netezza, I know they have patents, but here's the point: when there's money to be made in a specific niche, companies always find a way to compete - they often file competing patents. I ask again, if denormalization is the way forward for all vendors, why haven't the rest of the vendors of the world jumped on the denormalization & extreme scalability & performance bus? Why was Netezza the only one? And yes, I know all about the FPGA. I toured the Netezza engineering in Massachusetts many years ago with Foster Hinshaw, co-founder of the company.
9) Netezza has beaten Teradata in heads-up comparisons. True. But ONLY up to a certain SIZE. There is an upper limit to how much "denormalization" can manage and handle before it can't scale. If you know of a comparison between Netezza and Teradata that is at or above 500 TB in size, and where Netezza Beats Teradata in this category, then great - I'd love to read about it. I've not seen one yet. If denormalization is the way forward, why aren't we all building star schemas on Mainframes?
10) you still do not answer my questions by providing the references I seek, re-read the posts for the questions that I ask over and over again.
How on earth did we get to the point where we are debating "vendor implementations" anyhow? This is not the point of this thread. We are and were supposed to be discussing the benefits and drawbacks of modeling techniques.
Please point out my assumptions about star-schema that you say are "wrong." I'd like to get back to the original claims/issues/thoughts which have everything to do with modeling, and nothing to do with vendors or hardware.
Please also note that I am not fully agreeing with star schema as the full solution, especially when it comes to history tracking (where we to a large extend agree)
?? Then why are we having this discussion in the first place, as this is my original point for using Data Vault ??
in other words: use the Data Vault for back-end historical data tracking, use star schemas, flat-wide, cubes, and whatever is good to deliver the data to the business users... This has always been my original claim.
Cheers,
Dan L
2) the Data Vault model is built based on semi-normalization. As indicated, it is JUST like a star schema dimension type 1, stripped down to the business key, and a factless fact. The star-join optimizer can & does work to the advantage of the Data Vault much the same way it works with star schemas
3) you have not shown definitively that more rows per block = slower performance. You stated that it's higher random I/O, but the only way it could be higher random I/O is if there is "more fragmentation" in the Data Vault, please explain how this is, and provide a case for why this happens, this has NOT been my experience
4) You insist on continuing to compare a front-end star schema delivery system with a back-end data vault EDW storage system. Why? I've said before: the Data Vault is for back-end data warehousing, NOT front end delivery. I have always said: put the star schemas on the front-end for delivery, be it virtual or not.
5) The Data Vault model solves other problems beyond performance, that the star schema has suffered from (when the star schema is used as a data model for enterprise data warehousing).
6) there is always a trade-off between flexibility and performance (normalization/denormalization). The hardware vendors are always pushing the boundaries for performance and capabilities. Before they began working with internal algorithms, firmware, and hardware layouts (like Netezza) - star schema denormalization suffered greatly in performance, I've seen it and lived through it first-hand.
7) you argue the finer points of column based databases which are highly normalized (vertical partitioned down to the individual column level for the physical implementation), yet you also argue that denormalization is the way forward? How are both possible at the same time? I would suggest that if Column based databases are to make it, then they need to find a way to break the scalability barriers they are facing today. Sybase IQ has an upper limit (which a large credit card company had to dump them because the engineering team at Sybase couldn't make the product scale beyond this specific point). The same goes for ParAccel, and Vertica and other column based data stores. My point is this: either Normalization is the "best way forward" or it's "not"... you can't have both and scale in to extremely large petabyte sized environments. If this were true, Teradata and MPP would not be succeeding at such large volumes where many other vendors can't even compete today.
I am familiar with Netezza, I know they have patents, but here's the point: when there's money to be made in a specific niche, companies always find a way to compete - they often file competing patents. I ask again, if denormalization is the way forward for all vendors, why haven't the rest of the vendors of the world jumped on the denormalization & extreme scalability & performance bus? Why was Netezza the only one? And yes, I know all about the FPGA. I toured the Netezza engineering in Massachusetts many years ago with Foster Hinshaw, co-founder of the company.
9) Netezza has beaten Teradata in heads-up comparisons. True. But ONLY up to a certain SIZE. There is an upper limit to how much "denormalization" can manage and handle before it can't scale. If you know of a comparison between Netezza and Teradata that is at or above 500 TB in size, and where Netezza Beats Teradata in this category, then great - I'd love to read about it. I've not seen one yet. If denormalization is the way forward, why aren't we all building star schemas on Mainframes?
10) you still do not answer my questions by providing the references I seek, re-read the posts for the questions that I ask over and over again.
How on earth did we get to the point where we are debating "vendor implementations" anyhow? This is not the point of this thread. We are and were supposed to be discussing the benefits and drawbacks of modeling techniques.
Please point out my assumptions about star-schema that you say are "wrong." I'd like to get back to the original claims/issues/thoughts which have everything to do with modeling, and nothing to do with vendors or hardware.
Please also note that I am not fully agreeing with star schema as the full solution, especially when it comes to history tracking (where we to a large extend agree)
?? Then why are we having this discussion in the first place, as this is my original point for using Data Vault ??
in other words: use the Data Vault for back-end historical data tracking, use star schemas, flat-wide, cubes, and whatever is good to deliver the data to the business users... This has always been my original claim.
Cheers,
Dan L
dlinstedt- Posts : 7
Join date : 2010-03-15
More on block sizes:
With regards to block sizing: I am referring specifically to database block sizes (as this is the consistent measure by which most RDBMS machines can be altered). Note: I am referring to Oracle, Teradata, DB2 UDB (EEE), SQLServer, Sybase, and MySQL. I am NOT referring to File Allocation Size (which is a different measure, and is not available in all databases). NOTE: I have tried to provide the links, but I am not sure this forum will allow it. Because of this, I have provided the titles and authors of the articles so they can still be referenced.
You and I have had a discussion about block sizes, pagefile.sys, and pre-emptive multi-tasking. I have provided references to each of these points below.
Regarding SQLServer block sizes:
SQLServer 2008 R2:
"SQL Server uses a fixed block size as a page of 8 KB. While it reduces a DBA’s ability to fine-tune I/O operations, it does provide for greater transportability and automation. With a fixed block size, SQL Server can automatically optimize I/O operations and memory utilization since it also knows exactly how much space is utilized with each I/O operation."
Microsoft: Practical SQL Server 2008 for Oracle Professionals, Written by Microsoft engineering, October 2008, found under the header: Multiple Block Sizes on Page 20 of the document (I would provide the link, but we are not allowed to post links here) Microsoft Engineering Technical Document
This is different than the File Allocation Size: which CAN be 64kb:
File Allocation Size
An appropriate value for most installations should be 65,536 bytes (that is, 64 KB) for partitions on which SQL Server data or log files reside. In many cases, this is the same size for Analysis Services data or log files, but there are times where 32 KB provides better performance. To determine the right size, you will need to do performance testing with your workload comparing the two different block sizes.
Found in: Disk Partition Alignment Best Practices for SQL Server, written for 2008, found on-line in technical articles, written by: Jimmy May, Denny Lee, in 2009
I found a reference for you to prove that PageFile.sys can and DOES have an impact on the performance of SQLServer 2008 (You told me that PageFile is not used by SQLServer, and that I should get my facts straight...)
Pagefile Configuration for SQLServer 2008 (Can be done, and IS used by the OS)
Document: Troubleshooting Performance Problems in SQL Server 2008, MSDN Technical Article
Written by: Writers: Sunil Agarwal, Boris Baryshnikov, Keith Elmore, Juergen Thomas, Kun Cheng, Burzin Patel
Indicates PageFile Settings are utilized to control the performance of "swapping" and memory blocks in Windows OS.
In fact: SQLServer Central posted this article:
Pre Check List for 64 BIT SQL Server 2008 Cluster Installation, Author: By Rudy Panigas, 2009/12/31
in which: the check list has the following item:
Create a Local Disk for Page File
Create a dedicated disk on the locally attached storage unit that will contain the operating systems paging files. See System Performance Settings for more information on how to assign the paging disk to this dedicated disk.
Clearly, SQLServer2008 R2 DOES use PageFile.sys, and PageFile.SYS configuration & usage DOES have an impact on performance of both normalized and denormalized data models, therefore any arguments about pagefile.sys and it's utilization are a moot point, UNLESS it is a barrier to parallel operations (which I claim table scans have an upper limit, when that limit is breached (data width), parallelism cannot be achieved to break down the query and go after the data set underneath), thus causing potential locking at the PageFile.sys levels.
More on the PageFile.sys components:
Paging
This is an issue that has cropped up with the advent of 64bit OS’s. Windows Server 2003 changed it’s priority for non OS processes. This severely impacted SQL Server and created a heck of a time for SQL Server support. What happens, is when the OS feels it is experiencing external memory pressure, it begins choosing victims to steal their memory. In which cases non system level process will find their working set swapped to the paging file. The most obvious indication of this is the amount of page file usage. It should preferably be under 8%. I have seen some SQL Servers run ok on 10% however this is pushing it.
Written in: Troubleshooting SQL Server Slowness, Author: Derek Dieter, June 2009.
Link: SQLServer Trouble Shooting
Regarding pre-emptive & non-pre-emptive...
Now, SQL Server is in Non-Preemptive mode by default and it works fine. When CLR, extended Stored Procedures and other external components run, they run in Preemptive mode, leading to the creation of these wait types.
Article: SQL SERVER – PREEMPTIVE and Non-PREEMPTIVE – Wait Type – Day 19 of 28, written by: pinaldave
Joe Stack, from Microsoft: PERFORMANCE TUNING WITH WAIT STATISTICS
Wrote the following: on slide: 38
Preemption is act of an OS temporarily interrupting an executing task, involuntarily
•Higher priority tasks can preempt lower priority tasks
•When this happens -can be expensive
•Preemptive mode used in SQL Server for external code calls, CLR with an UNSAFE assemblies, APIs that could block for extended periods (example –extended stored procedures)
What I was referring to was NOT preemptive/non-preemptive behavior of SQLServer, but the behavior of the Windows Operating System. I hereby concede that SQLServer 2008 (in the form of external code calls) can execute preemptively. However, this says nothing about SQLServers' behavior in executing SELECT queries, and if the SQL task is preemptive or not. I am still looking for references to this point, because it makes a difference in parallel query execution abilities and performance.
Why? Because clearly the SQLServer 2008 engine replaces SOME functionality of the OS levels, however, it still relies on Windows OS calls and PageFile.sys to manage swapping, and thus - in a non-preemptive situation, can have a process become "blocked" (waiting on disk I/O) and locked in the CPU for execution cycles until the disk returns with answers. Even though this operation is milliseconds or microseconds, it can & does have an impact on parallelism, performance, and scalability.
As pointed out in yet another Microsoft Technet article:
Cooperative scheduling does not solve all scheduling problems. For example, tasks that do not fairly yield to other tasks can consume all available computing resources and prevent other tasks from making progress. The Concurrency Runtime uses the efficiency benefits of cooperative scheduling to complement the fairness guarantees of preemptive scheduling. By default, the Concurrency Runtime provides a cooperative scheduler that uses a work-stealing algorithm to efficiently distribute work among computing resources. However, the Concurrency Runtime scheduler also relies on the preemptive scheduler of the operating system to fairly distribute resources among applications. You can also create custom schedulers and scheduler policies in your applications to produce fine-grained control over thread execution.
Document: Comparing the Concurrency Runtime to Other Concurrency Models, July 2010
A little further down in the article it discusses the scheduling and threaded models of SQLServer 2008. The question is: does a table scan read-ahead end-up blocking I/O in order to read what is consiered a wide-row table (4k or 8k row size) when you want all of the data from a SQL Select? OR will parallel query with sufficient engines, and more rows per block end up with more I/O's in parallel (taking less overall time), less I/O's in parallel (also taking less overall time), or the same amount of I/O's in parallel (should also take less time).
Why less time? Because parallel tasks going after the complete data set, split the amount of work in to parallel execution units. Do they force the disk to work harder? maybe, do the take less time? usually - this has been the proven case in parallel computing since the understanding and implementation of parallelism in computing systems. If this weren't the case, we wouldn't have multi-threaded CPU's today, nor would we have ever implemented parallel query engines nor threads at the database levels.
Regarding FPGA and Netezza:
Article: FPGA: What’s in it for a Database?, written by: Rene Mueller, Jens Teubner
Parallelism and processing power now shifted away from CPU, closer to data / disk.
FPGA has similar dimensions as a CPU, consumes 5 times less power and clock speed is about 5 times less.
Filtering out unnecessary data.
My point:
FPGA moves the parallelism out of the software, on to the hardware/firmware layers. Parallelism is still parallelism no matter which way you slice it. The argument put forward by Thomas that I have a hard time swallowing is: "parallelism is bad (read: normalization is a form of parallelism), table-scans and sequential access is better/faster" I cannot believe this to be the case. Especially when hardware vendors like Netezza PUT parallelism at the hardware level to handle the "joins" or "join" problems against the data sets. Netezza does one better by offering column compression (forced), thus reducing I/O and storage needs, and increasing performance. Netezza has yet another trick up it's sleeve - it uses HASH based functions to avoid indexing, but this IS an MPP data distribution (normalization) technique, based on key-sets, the data is layed out / distributed to specific disk blocks. It just so happens that what executes next makes all the difference: FILTERING at the hardware level. I am suggesting (with normalization and the way the Data Vault is laid out) that anyone can take advantage of this. With parallel query optimizers against the Data Vault, both table and column elimination (filtering) can take place - unfortunately it's just not at the hardware/firmware level, which leads us to believe that "joins are bad or slower performing than normalized tables".
Ahh but there are limitations:
Limits to this flexibility are set by the availability of each type of resources. The most apparent limitation is the number of lookup tables of a given FPGA chip, which can limit the complexity of the logic circuit that can be programmed to it. Other potential limits include memory requirements, the availability of additional in-silicon functionality, or the bandwidth of the interconnect fabric.
If normalization was so bad, then it would make sense to use mainframe computing power (with vast ability to handle super wide file stores) to build and house our data warehousing systems.
I want to make one more point: I agree with everyone, that OVER-NORMALIZATION can lead to a down-swing in perfromance that has to be compensated by hardware layers (hence the COLUMN based appliances, they do EXACTLY that - they over-normalize, but make up for it with column compression, and hardware performance). I still maintain that over-denormalization (rows that are too wide) will lead to a down-swing in performance as well, again, must be compensated by hardware layers (hence Mainframes and Netezza).
The secrets to any and all performance is striking a balance between architecture, flexibility, scalability and purpose.
A balance between parallelism, normalization, hardware capabilities and designs must be applied for common sense and best possible outcomes.
The purpose of the Data Vault model is to be a back-end historical data store, NOT a front-end ad-hoc query machine.
The purpose of the Star Schema model (my opinion only) is to be a front-end pivot, semi-historical data store, not a back-end fully accountable historical machine.
One more note: different "data model designs" will perform differently on different hardware, and different infrastructure.... hence the separation of Logical and physical data models over the years.
My questions to the readers:
So: if they split the work up (divide the processing), and then they suggest that horizontal (range/hash) partitioning of data is another scalability and performance measure, why can't vertical (column based) partitioning further the performance?
What the Data Vault proposes is exactly that: horizontal, and vertical partitioning, working together with parallel query, and parallel processing engines to reach maximum even distribution of work across execution units. Does it mean more joins? Yes. Is it slower than a single table scan? The answer to this depends on: block size, pagefile (or operating system that doesn't use pagefile), disk speed, tuning switches at the database level, amount of data (overall in the table), row-width of each row, number of parallel processes (that can be executed), read-ahead buffers (for table scan purposes).
The only way to truly test the "performance" is to setup a specific and agreed upon test case where the data models are made public and approved by both parties, the hardware, software, and database versions are well documented and understood. Then, to try it with different data row widths, as well as different numbers of rows (overall data size). Finally, to run the same tests on different database engines. Realizing of course that the numbers from different engines will provide apples and oranges, (no more than a rough approximation, because there are too many variables to count), but it will still give a good approximation for results across the multiple platforms.
One final note: I was unable to locate ANY statements that described the physical and exacting operation of PageFile.sys - something that said: "A single write to the pagefile or a single read from the pagefile is [or is not] a blocking operation." Because of this lack of evidence, there is no way to conclusively determine if in fact PageFile.sys is truly a "preemptive" operation. I do realize that the OS has preemptive scheduling layers, and that's great - but when it comes down to brass tacks, the answer to this question needs to be found and documented.
You and I have had a discussion about block sizes, pagefile.sys, and pre-emptive multi-tasking. I have provided references to each of these points below.
Regarding SQLServer block sizes:
SQLServer 2008 R2:
"SQL Server uses a fixed block size as a page of 8 KB. While it reduces a DBA’s ability to fine-tune I/O operations, it does provide for greater transportability and automation. With a fixed block size, SQL Server can automatically optimize I/O operations and memory utilization since it also knows exactly how much space is utilized with each I/O operation."
Microsoft: Practical SQL Server 2008 for Oracle Professionals, Written by Microsoft engineering, October 2008, found under the header: Multiple Block Sizes on Page 20 of the document (I would provide the link, but we are not allowed to post links here) Microsoft Engineering Technical Document
This is different than the File Allocation Size: which CAN be 64kb:
File Allocation Size
An appropriate value for most installations should be 65,536 bytes (that is, 64 KB) for partitions on which SQL Server data or log files reside. In many cases, this is the same size for Analysis Services data or log files, but there are times where 32 KB provides better performance. To determine the right size, you will need to do performance testing with your workload comparing the two different block sizes.
Found in: Disk Partition Alignment Best Practices for SQL Server, written for 2008, found on-line in technical articles, written by: Jimmy May, Denny Lee, in 2009
I found a reference for you to prove that PageFile.sys can and DOES have an impact on the performance of SQLServer 2008 (You told me that PageFile is not used by SQLServer, and that I should get my facts straight...)
Pagefile Configuration for SQLServer 2008 (Can be done, and IS used by the OS)
Document: Troubleshooting Performance Problems in SQL Server 2008, MSDN Technical Article
Written by: Writers: Sunil Agarwal, Boris Baryshnikov, Keith Elmore, Juergen Thomas, Kun Cheng, Burzin Patel
Indicates PageFile Settings are utilized to control the performance of "swapping" and memory blocks in Windows OS.
In fact: SQLServer Central posted this article:
Pre Check List for 64 BIT SQL Server 2008 Cluster Installation, Author: By Rudy Panigas, 2009/12/31
in which: the check list has the following item:
Create a Local Disk for Page File
Create a dedicated disk on the locally attached storage unit that will contain the operating systems paging files. See System Performance Settings for more information on how to assign the paging disk to this dedicated disk.
Clearly, SQLServer2008 R2 DOES use PageFile.sys, and PageFile.SYS configuration & usage DOES have an impact on performance of both normalized and denormalized data models, therefore any arguments about pagefile.sys and it's utilization are a moot point, UNLESS it is a barrier to parallel operations (which I claim table scans have an upper limit, when that limit is breached (data width), parallelism cannot be achieved to break down the query and go after the data set underneath), thus causing potential locking at the PageFile.sys levels.
More on the PageFile.sys components:
Paging
This is an issue that has cropped up with the advent of 64bit OS’s. Windows Server 2003 changed it’s priority for non OS processes. This severely impacted SQL Server and created a heck of a time for SQL Server support. What happens, is when the OS feels it is experiencing external memory pressure, it begins choosing victims to steal their memory. In which cases non system level process will find their working set swapped to the paging file. The most obvious indication of this is the amount of page file usage. It should preferably be under 8%. I have seen some SQL Servers run ok on 10% however this is pushing it.
Written in: Troubleshooting SQL Server Slowness, Author: Derek Dieter, June 2009.
Link: SQLServer Trouble Shooting
Regarding pre-emptive & non-pre-emptive...
Now, SQL Server is in Non-Preemptive mode by default and it works fine. When CLR, extended Stored Procedures and other external components run, they run in Preemptive mode, leading to the creation of these wait types.
Article: SQL SERVER – PREEMPTIVE and Non-PREEMPTIVE – Wait Type – Day 19 of 28, written by: pinaldave
Joe Stack, from Microsoft: PERFORMANCE TUNING WITH WAIT STATISTICS
Wrote the following: on slide: 38
Preemption is act of an OS temporarily interrupting an executing task, involuntarily
•Higher priority tasks can preempt lower priority tasks
•When this happens -can be expensive
•Preemptive mode used in SQL Server for external code calls, CLR with an UNSAFE assemblies, APIs that could block for extended periods (example –extended stored procedures)
What I was referring to was NOT preemptive/non-preemptive behavior of SQLServer, but the behavior of the Windows Operating System. I hereby concede that SQLServer 2008 (in the form of external code calls) can execute preemptively. However, this says nothing about SQLServers' behavior in executing SELECT queries, and if the SQL task is preemptive or not. I am still looking for references to this point, because it makes a difference in parallel query execution abilities and performance.
Why? Because clearly the SQLServer 2008 engine replaces SOME functionality of the OS levels, however, it still relies on Windows OS calls and PageFile.sys to manage swapping, and thus - in a non-preemptive situation, can have a process become "blocked" (waiting on disk I/O) and locked in the CPU for execution cycles until the disk returns with answers. Even though this operation is milliseconds or microseconds, it can & does have an impact on parallelism, performance, and scalability.
As pointed out in yet another Microsoft Technet article:
Cooperative scheduling does not solve all scheduling problems. For example, tasks that do not fairly yield to other tasks can consume all available computing resources and prevent other tasks from making progress. The Concurrency Runtime uses the efficiency benefits of cooperative scheduling to complement the fairness guarantees of preemptive scheduling. By default, the Concurrency Runtime provides a cooperative scheduler that uses a work-stealing algorithm to efficiently distribute work among computing resources. However, the Concurrency Runtime scheduler also relies on the preemptive scheduler of the operating system to fairly distribute resources among applications. You can also create custom schedulers and scheduler policies in your applications to produce fine-grained control over thread execution.
Document: Comparing the Concurrency Runtime to Other Concurrency Models, July 2010
A little further down in the article it discusses the scheduling and threaded models of SQLServer 2008. The question is: does a table scan read-ahead end-up blocking I/O in order to read what is consiered a wide-row table (4k or 8k row size) when you want all of the data from a SQL Select? OR will parallel query with sufficient engines, and more rows per block end up with more I/O's in parallel (taking less overall time), less I/O's in parallel (also taking less overall time), or the same amount of I/O's in parallel (should also take less time).
Why less time? Because parallel tasks going after the complete data set, split the amount of work in to parallel execution units. Do they force the disk to work harder? maybe, do the take less time? usually - this has been the proven case in parallel computing since the understanding and implementation of parallelism in computing systems. If this weren't the case, we wouldn't have multi-threaded CPU's today, nor would we have ever implemented parallel query engines nor threads at the database levels.
Regarding FPGA and Netezza:
Article: FPGA: What’s in it for a Database?, written by: Rene Mueller, Jens Teubner
Parallelism and processing power now shifted away from CPU, closer to data / disk.
FPGA has similar dimensions as a CPU, consumes 5 times less power and clock speed is about 5 times less.
Filtering out unnecessary data.
My point:
FPGA moves the parallelism out of the software, on to the hardware/firmware layers. Parallelism is still parallelism no matter which way you slice it. The argument put forward by Thomas that I have a hard time swallowing is: "parallelism is bad (read: normalization is a form of parallelism), table-scans and sequential access is better/faster" I cannot believe this to be the case. Especially when hardware vendors like Netezza PUT parallelism at the hardware level to handle the "joins" or "join" problems against the data sets. Netezza does one better by offering column compression (forced), thus reducing I/O and storage needs, and increasing performance. Netezza has yet another trick up it's sleeve - it uses HASH based functions to avoid indexing, but this IS an MPP data distribution (normalization) technique, based on key-sets, the data is layed out / distributed to specific disk blocks. It just so happens that what executes next makes all the difference: FILTERING at the hardware level. I am suggesting (with normalization and the way the Data Vault is laid out) that anyone can take advantage of this. With parallel query optimizers against the Data Vault, both table and column elimination (filtering) can take place - unfortunately it's just not at the hardware/firmware level, which leads us to believe that "joins are bad or slower performing than normalized tables".
Ahh but there are limitations:
Limits to this flexibility are set by the availability of each type of resources. The most apparent limitation is the number of lookup tables of a given FPGA chip, which can limit the complexity of the logic circuit that can be programmed to it. Other potential limits include memory requirements, the availability of additional in-silicon functionality, or the bandwidth of the interconnect fabric.
If normalization was so bad, then it would make sense to use mainframe computing power (with vast ability to handle super wide file stores) to build and house our data warehousing systems.
I want to make one more point: I agree with everyone, that OVER-NORMALIZATION can lead to a down-swing in perfromance that has to be compensated by hardware layers (hence the COLUMN based appliances, they do EXACTLY that - they over-normalize, but make up for it with column compression, and hardware performance). I still maintain that over-denormalization (rows that are too wide) will lead to a down-swing in performance as well, again, must be compensated by hardware layers (hence Mainframes and Netezza).
The secrets to any and all performance is striking a balance between architecture, flexibility, scalability and purpose.
A balance between parallelism, normalization, hardware capabilities and designs must be applied for common sense and best possible outcomes.
The purpose of the Data Vault model is to be a back-end historical data store, NOT a front-end ad-hoc query machine.
The purpose of the Star Schema model (my opinion only) is to be a front-end pivot, semi-historical data store, not a back-end fully accountable historical machine.
One more note: different "data model designs" will perform differently on different hardware, and different infrastructure.... hence the separation of Logical and physical data models over the years.
My questions to the readers:
So: if they split the work up (divide the processing), and then they suggest that horizontal (range/hash) partitioning of data is another scalability and performance measure, why can't vertical (column based) partitioning further the performance?
What the Data Vault proposes is exactly that: horizontal, and vertical partitioning, working together with parallel query, and parallel processing engines to reach maximum even distribution of work across execution units. Does it mean more joins? Yes. Is it slower than a single table scan? The answer to this depends on: block size, pagefile (or operating system that doesn't use pagefile), disk speed, tuning switches at the database level, amount of data (overall in the table), row-width of each row, number of parallel processes (that can be executed), read-ahead buffers (for table scan purposes).
The only way to truly test the "performance" is to setup a specific and agreed upon test case where the data models are made public and approved by both parties, the hardware, software, and database versions are well documented and understood. Then, to try it with different data row widths, as well as different numbers of rows (overall data size). Finally, to run the same tests on different database engines. Realizing of course that the numbers from different engines will provide apples and oranges, (no more than a rough approximation, because there are too many variables to count), but it will still give a good approximation for results across the multiple platforms.
One final note: I was unable to locate ANY statements that described the physical and exacting operation of PageFile.sys - something that said: "A single write to the pagefile or a single read from the pagefile is [or is not] a blocking operation." Because of this lack of evidence, there is no way to conclusively determine if in fact PageFile.sys is truly a "preemptive" operation. I do realize that the OS has preemptive scheduling layers, and that's great - but when it comes down to brass tacks, the answer to this question needs to be found and documented.
dlinstedt- Posts : 7
Join date : 2010-03-15
Re: Data Vault v's Dimensional Model
Virtual or physical star schema? There is a fundamental difference between them which basically draws the line between Kimball and Inmon.dlinstedt wrote:use the Data Vault for back-end historical data tracking, use star schemas, flat-wide, cubes, and whatever is good to deliver the data to the business users... This has always been my original claim.
I would be more than happy to see some standard and disciplined modeling technique like data vault that underpins the MDM and tracks all the change history as another upstream data layer to feed the dimensional data store (DDS), just like DDS feeding the cube. In this model, it would be a lot easier to balance ETL process with two distinctly different focuses. One is dealing with data integration, data entry through application, cleansing and data quality etc.. The other part of ETL is mainly focused on SCD, surrogate key substitution etc..
I think we all agree we definitely need star schema for cubing and trend analysis. However some of the key elements like fact grains and dimensionality need to be clear at physical level. When we are debating about the performance, let's not forget that ease of use is another purpose of dimensional modeling, or data warehousing in general.
Under this structure, MDM layer could also provide near real time reporting if needed. It also helps to rebuild DDS and cube without any hiccup. I would not be surprised to see many BI vendors will come up with similar product line.
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
Re: Data Vault v's Dimensional Model
When we start talking about hardware architectures and block sizes and stuff, its a good indication that the discussion has gone off the rails. Frankly, none of that stuff matters, and it changes every 18 months or so.
When discussing Dimensional Modelling versus Data Vault (or Inmon's CDF) the question is: Can a dimensional EDW stand alone as the primary data store and historical record or is it necessary to create a repository from which data marts are published. Frankly, I believe a dimensional EDW can stand on its own.
So, when comparing a dimensional EDW with Data Vault, the issue is not about joins, complexities, or performance. When you commit to a store and publish architecture, access to the repository (Data Vault or 3NF DB) is a background process that publishes into cubes, star schema, flat files or standard reports. No (or very little) ad-hoc stuff at all... as that is all done in the published marts.
The issue is, do you need to do it at all? Is there some data structure or relationship or historical record that cannot be represented in a dimensional model? If not, why would one want to build an entire infrastructure to simply store data, and then build another infrastructure so people can use the data?
When discussing Dimensional Modelling versus Data Vault (or Inmon's CDF) the question is: Can a dimensional EDW stand alone as the primary data store and historical record or is it necessary to create a repository from which data marts are published. Frankly, I believe a dimensional EDW can stand on its own.
So, when comparing a dimensional EDW with Data Vault, the issue is not about joins, complexities, or performance. When you commit to a store and publish architecture, access to the repository (Data Vault or 3NF DB) is a background process that publishes into cubes, star schema, flat files or standard reports. No (or very little) ad-hoc stuff at all... as that is all done in the published marts.
The issue is, do you need to do it at all? Is there some data structure or relationship or historical record that cannot be represented in a dimensional model? If not, why would one want to build an entire infrastructure to simply store data, and then build another infrastructure so people can use the data?
Some answers...
Hi Dan
I think we are moving toward common ground now. Let me restate a few things I think we agree on (please chime in)
- There are two different problems that must be solved: ad-hoc Reporting and audit/archive
- We agree that for reporting purposes, a star schema is the right model
- We agree that hard numbers are difficult to come by and that vendor to vendor compares are hard
- I hope we also agree that if we position two models against each other, we can calculate backwards to the IOPS, Memory and Network required to perform operations (and get in the right O-notation ballpark on CPU)
- We agree that parallism is good and that it is a desired characteristic of a database
- We agree that stripping down a star schema to type1 is an interesting approach (and one that I am also recommending). But I disagree with your idea of making every relationship into a link. Also, if "strip down a star schema to type1) is what data vault is, then call it that. I dont understand why you want to invent all those new terms and make those grand claims that it helps solve scale or archival in some magic way that star schemas do not.
I am NOT claiming that parallelism is bad - but I am telling you that normalization does not give you more of it, on the contratry, loop join strategies (which are the ones that scale in well in normalized models) are extremely expensive in CPU cycles as compared to hash strategies.
With regards to pre-emptive vs. cooperative multi tasking - I think that is going off on a tangent. I assume we agree that every modern operating system scales well even on large scale nodes?
Can we also agree that super wide tables, with a LOT of columns (thousands) are a non-issue for data models (you can always model "downwards" instead, by unpivoting data).
If I understand you correctly, you claim:
- That column stores and star schemas do not scale to the PB size for queries (we should agree WHICH queries you mean, reporting?)
- That a normalized model a la Data Vault DOES scale to the PB size, even when you need to join two or more huge tables that are not co-located?
On this note: I simply dont follow how you get from "you need more IOPS to scan a wide table if you only need a few columns" to "you should normalize by using Data Vault Guidance". If you said that there are cases where vertical partitioning of fact tables matter (i.e. a 1-1 "normalization") then I think we could be more on common ground.
Could you also elaborate what you mean by "blocking I/O". Do you mean that the task blocks while it waits for I/O or that the I/O system itself gets blocked? To use SQL Server as an example, it will use I/O completion ports to issue the I/O request. They are described here:
http://msdn.microsoft.com/en-us/library/aa365198(v=vs.85).aspx
Highlight this note: "I/O completion ports provide an efficient threading model for processing multiple asynchronous I/O requests on a multiprocessor system" - the details are described in the paper. You will see that neither pagefile.sys, nor blocking of the entire I/O system is involved. Note that EACH port has its own concurrency value, you can open as many as you like (until you run out of memory or threads :))
With regards to the 8K, 64K and 512K questions: The Fast Track guidanance gives you some good basic information on this: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-warehousing/fast-track.aspx. Apart from that, the best source really is Russinovich: http://technet.microsoft.com/en-us/sysinternals/bb963901. If you want to convince yourself that larger than 64K I/O requests are possible, it is easy to use a tool like SQLIO or IOMeter to set up a test and simply measure the request sizes using Perfmon (the counter you are looking for is avg disk bytes/read. You can also use this reference to read about how SQL Server I/O scaling and balancing http://msdn.microsoft.com/en-us/library/ee410782.aspx (incidentally, partially written by yours truly). You might also want to look at the read-ahead described here: http://msdn.microsoft.com/en-us/library/ms191475.aspx - a similar construct exists in all major databases.
With regards to the pagefile.sys: I dont understand why you are so focused on what that file? It is just the virtual memory backing the RAM. Database engines on the Windows platform do not use this file unless an exceptional condition occurs (they typically pin memory in non-swappable large pages instead) and I/O operations just don't touch it. Perhaps the exceptional conditions you found is why you are wondering about it? It is true that we will sometimes dedicate a disk to pagefile.sys, but this is done to respond to the rare cases where the SQL Server process fights with another process for memory, for example in the case of Analysis Services running at the same time (the large pages does not protect the exe itself from being swapped out - so you want the exe back in main memory asap). This is NOT what is done on a large scale SQL System. On a poorly configured cluster, this can cause the heartbeat to fail, so this is why a dedicated drive is recommended "just in case". So please read this in the right context.
With regards to fragmented reads and random/sequential: Let us just assume that all pages are full. This is achievable with proper loading techniques (described here: http://msdn.microsoft.com/en-us/library/dd425070(v=sql.100).aspx). In that case, a high readahead, 512K block still beats random seeks. Also note that SQL Server has an optimization call "unordered scan" where the order of index pages do not matter with regards to sequential IOPS (you can read about unordered here: http://www.sqlmag.com/article/sql-server/quaere-verum-clustered-index-scans-part-i)
I will post more, but trying to catch up on all our open questions.
I think we are moving toward common ground now. Let me restate a few things I think we agree on (please chime in)
- There are two different problems that must be solved: ad-hoc Reporting and audit/archive
- We agree that for reporting purposes, a star schema is the right model
- We agree that hard numbers are difficult to come by and that vendor to vendor compares are hard
- I hope we also agree that if we position two models against each other, we can calculate backwards to the IOPS, Memory and Network required to perform operations (and get in the right O-notation ballpark on CPU)
- We agree that parallism is good and that it is a desired characteristic of a database
- We agree that stripping down a star schema to type1 is an interesting approach (and one that I am also recommending). But I disagree with your idea of making every relationship into a link. Also, if "strip down a star schema to type1) is what data vault is, then call it that. I dont understand why you want to invent all those new terms and make those grand claims that it helps solve scale or archival in some magic way that star schemas do not.
I am NOT claiming that parallelism is bad - but I am telling you that normalization does not give you more of it, on the contratry, loop join strategies (which are the ones that scale in well in normalized models) are extremely expensive in CPU cycles as compared to hash strategies.
With regards to pre-emptive vs. cooperative multi tasking - I think that is going off on a tangent. I assume we agree that every modern operating system scales well even on large scale nodes?
Can we also agree that super wide tables, with a LOT of columns (thousands) are a non-issue for data models (you can always model "downwards" instead, by unpivoting data).
If I understand you correctly, you claim:
- That column stores and star schemas do not scale to the PB size for queries (we should agree WHICH queries you mean, reporting?)
- That a normalized model a la Data Vault DOES scale to the PB size, even when you need to join two or more huge tables that are not co-located?
On this note: I simply dont follow how you get from "you need more IOPS to scan a wide table if you only need a few columns" to "you should normalize by using Data Vault Guidance". If you said that there are cases where vertical partitioning of fact tables matter (i.e. a 1-1 "normalization") then I think we could be more on common ground.
Could you also elaborate what you mean by "blocking I/O". Do you mean that the task blocks while it waits for I/O or that the I/O system itself gets blocked? To use SQL Server as an example, it will use I/O completion ports to issue the I/O request. They are described here:
http://msdn.microsoft.com/en-us/library/aa365198(v=vs.85).aspx
Highlight this note: "I/O completion ports provide an efficient threading model for processing multiple asynchronous I/O requests on a multiprocessor system" - the details are described in the paper. You will see that neither pagefile.sys, nor blocking of the entire I/O system is involved. Note that EACH port has its own concurrency value, you can open as many as you like (until you run out of memory or threads :))
With regards to the 8K, 64K and 512K questions: The Fast Track guidanance gives you some good basic information on this: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-warehousing/fast-track.aspx. Apart from that, the best source really is Russinovich: http://technet.microsoft.com/en-us/sysinternals/bb963901. If you want to convince yourself that larger than 64K I/O requests are possible, it is easy to use a tool like SQLIO or IOMeter to set up a test and simply measure the request sizes using Perfmon (the counter you are looking for is avg disk bytes/read. You can also use this reference to read about how SQL Server I/O scaling and balancing http://msdn.microsoft.com/en-us/library/ee410782.aspx (incidentally, partially written by yours truly). You might also want to look at the read-ahead described here: http://msdn.microsoft.com/en-us/library/ms191475.aspx - a similar construct exists in all major databases.
With regards to the pagefile.sys: I dont understand why you are so focused on what that file? It is just the virtual memory backing the RAM. Database engines on the Windows platform do not use this file unless an exceptional condition occurs (they typically pin memory in non-swappable large pages instead) and I/O operations just don't touch it. Perhaps the exceptional conditions you found is why you are wondering about it? It is true that we will sometimes dedicate a disk to pagefile.sys, but this is done to respond to the rare cases where the SQL Server process fights with another process for memory, for example in the case of Analysis Services running at the same time (the large pages does not protect the exe itself from being swapped out - so you want the exe back in main memory asap). This is NOT what is done on a large scale SQL System. On a poorly configured cluster, this can cause the heartbeat to fail, so this is why a dedicated drive is recommended "just in case". So please read this in the right context.
With regards to fragmented reads and random/sequential: Let us just assume that all pages are full. This is achievable with proper loading techniques (described here: http://msdn.microsoft.com/en-us/library/dd425070(v=sql.100).aspx). In that case, a high readahead, 512K block still beats random seeks. Also note that SQL Server has an optimization call "unordered scan" where the order of index pages do not matter with regards to sequential IOPS (you can read about unordered here: http://www.sqlmag.com/article/sql-server/quaere-verum-clustered-index-scans-part-i)
I will post more, but trying to catch up on all our open questions.
thomaskejser- Posts : 12
Join date : 2011-08-14
Dimensional repository standing on it's own
Hi ngalemmo
Thanks for chiming in. I agree that there is no need for any other repository than the dimensional model. However, I think Dan makes a very strong point in: http://www.tdan.com/view-articles/5054/
"The problem is scalability and flexibility. If an additional parent table is added, the change is forced to cascade down through all subordinate table structures. Also, when a new row is inserted with an existing parent key (the only field to change is the date-time stamp) all child rows must be reassigned to the new parent key."
This is really just a fancy way to say that if you change your mind about type2 history tracking, you are faced with a problem on your fact table keys. Because you now have to update all keys in all affected facts. I think this is the big one that has hurt some dimensional modelers when they try to scale their data model. Especially if you have trouble writing a fast update (and there are clearly engines that are better at this than others) or believe that there is some law that makes updates hard for databases to do at scale.
I think there is some point in "carving off" the type2 attributes into a separate table because of this issue (This is similar to Dan's satellite). The issue with doing this is that you now have to do a nasty BETWEEN join (on from/to dates) in order to access fact data. Such joins are hard for query optimizers to optimize - because it is difficult to get a good cardinality estimate. OF course, the increased price of the BETWEEN also bubbles up to either the user or the data mart loader - something that has to be accounted for.
Thanks for chiming in. I agree that there is no need for any other repository than the dimensional model. However, I think Dan makes a very strong point in: http://www.tdan.com/view-articles/5054/
"The problem is scalability and flexibility. If an additional parent table is added, the change is forced to cascade down through all subordinate table structures. Also, when a new row is inserted with an existing parent key (the only field to change is the date-time stamp) all child rows must be reassigned to the new parent key."
This is really just a fancy way to say that if you change your mind about type2 history tracking, you are faced with a problem on your fact table keys. Because you now have to update all keys in all affected facts. I think this is the big one that has hurt some dimensional modelers when they try to scale their data model. Especially if you have trouble writing a fast update (and there are clearly engines that are better at this than others) or believe that there is some law that makes updates hard for databases to do at scale.
I think there is some point in "carving off" the type2 attributes into a separate table because of this issue (This is similar to Dan's satellite). The issue with doing this is that you now have to do a nasty BETWEEN join (on from/to dates) in order to access fact data. Such joins are hard for query optimizers to optimize - because it is difficult to get a good cardinality estimate. OF course, the increased price of the BETWEEN also bubbles up to either the user or the data mart loader - something that has to be accounted for.
Last edited by thomaskejser on Tue Aug 30, 2011 12:39 pm; edited 1 time in total (Reason for editing : spelling errors)
thomaskejser- Posts : 12
Join date : 2011-08-14
Re: Data Vault v's Dimensional Model
If an additional parent table is added, the change is forced to cascade down through all subordinate table structures.
I guess it refers to adding a new dimension to an existing fact table. It can happen, but if it was designed right to begin with (enterprise view rather than building to specific reports) it would be rare. And if you are to apply the change retroactively, it is a little bit of work (maybe 10-20hrs).
This I have no clue what it is referring to. If it is a type 2, you never go back and change fact keys when new rows are added. Even if you change the dimension from a type 2 to a type 1, there is never any reason to rekey the facts... you simply adjust the dimension.Also, when a new row is inserted with an existing parent key (the only field to change is the date-time stamp) all child rows must be reassigned to the new parent key.
The issue with doing this is that you now have to do a nasty BETWEEN join (on from/to dates) in order to access fact data.
Why would you need to do this? Its a simple join between a fact and a type 2. The only time you would filter a join against a type 2 is if you want to adjust the dimension to a point in time other than the point in time of the fact. In other words, getting the current dimension values. Simply implementing both a type 1 and type 2 version of the dimension clears that up.
The thing is, none of these are issues that cannot be resolved or handled in a dimensional model. There may be some simplicities that the data vault provides, but at what cost? It involves creating a separate and distinct environment purely to store data, with no guarantee that you may need to do some rework. Its a lot of effort to maybe save a few man weeks over a period of years because of unknown or unanticipated model changes.
There are a few other choice tidbits in the article:
One of the most difficult issues of a conformed data mart (or conformed fact tables) is getting the grain right. That means understanding the data as it is aggregated for each fact table and assuring that the aggregation will stay consistent for all time (during the life of the relationship) and the structure of each fact table will not change (i.e., no new dimensions will be added to either fact table). This limits design, scalability and flexibility of the data model.
This is kind of aluding to my earlier comment of building it right. If for a moment you think that data is aggregated for a fact table, you don't understand dimensional data warehousing. Fact tables should be built at the lowest level of detail available from the source data. Period. Aggregation is purely optional, stored in separate fact tables, and usually done for performance reasons.
Another issue is the “helper table.” This table is defined to be a dimension-to-dimension relationship Link.
A helper or bridge table sits between a fact and a dimension. It is not a dimension-to-dimension link. From a relational point of view, it is an associative entity used to resolve many-to-many relationships. Since a many-to-many relationship is the function of the relationships in the data, and not the modeling methodology, I fail to see how this is an issue with dimensional design.
Here's the thing. The data vault introduces a level of abstraction over what we would normally view as a data structure (a row and its columns). Fine, nothing wrong with that. And, if I was to implement a store and publish architecture for a data warehouse, it is a methodology well worth considering. But are data structures used in business really that dynamic that relatively minor maintainance issues matter? I haven't see that. What I have usually seen is large organizations spending years and countless millions of dollars updating their ERP. With timeframes like that, I wouldn't sweat the occasional schema change.
The Type2 driven update statement
The thing is, none of these are issues that cannot be resolved or handled in a dimensional model. There may be some simplicities that the data vault provides, but at what cost? It involves creating a separate and distinct environment purely to store data, with no guarantee that you may need to do some rework. Its a lot of effort to maybe save a few man weeks over a period of years because of unknown or unanticipated model changes.
...We are in agreement, ask Dan :-)
With regards to the Type2 changes, here is the scenario:
Let us say you have a some fact tables keyed to the customer dimension. You decide to do type2 change tracking on the customer city and type1 on the name.
At T(0) your world looks like this:
SK,CustomerID,Name,City,From, To
1, A, Thomas,London,2000, 2001
2, A, Thomas,Berlin,2002, 9999
Now, at T(1), you decide that you also want to do type2 tracking on customer education. This changes faster than City, so you now get a larger customer dimension with more rows, and more surrogate keys. For example this (using floats for SK, just for readability):
SK,Customer,Name,City,Education,From, To
1,A,Thomas,London,None,2000, 2001
1.1,A,Thomas,London,Master,2001, 2002
2,A,Thomas,Berlin,Master,2002, 9999
Here is the issue: the large fact tables that are keyed to The London (SK=1) now needs to have their SK updated to 1.1 if the transaction data is in 2001 and the education has changed to "master". This could potentially be a rather large update statement on several fact tables (travel the dependency tree here). If you are a PB sized installation and not careful about writing that update correctly, you could be looking at days, or even weeks until the change has been affected (DBAs get this type of stuff wrong all the time)
The alternative is to "fork off" the history. So you get this two-table "dimensional" structure
StableSK, Customer, Name
1,A, Thomas
HistorySK, StableSK, City, Education, From, To
42, 1, London, None, 2000, 2001
43, 1, London, Master, 2001, 2002
44, 1, Berlin, Master, 2002, 9999
If you key the fact on StableSK, you can change your mind as much as you like about the type2 changes, without incurring the big update. Of course, this means that the join between the history table and the fact table is much more expensive (since it is now a non-equi join, the BETWEEN I was referring to). I think that tradeoff is generally worth it, but the obvious issues about optimizer trouble of course rear their ugly heads here.
Last edited by thomaskejser on Tue Aug 30, 2011 3:02 pm; edited 2 times in total (Reason for editing : Minor spelling errors)
thomaskejser- Posts : 12
Join date : 2011-08-14
Re: Data Vault v's Dimensional Model
The issue is retroactively adjusting history, not changing the conditions of a type 2. You can always change a type 2 and let things fall into place moving forward without a lot of effort.
And, yeah, if you had to retroactively change things, it would be a real pain in the neck. But, as yet, I haven't had to. And the idea of a stable surrogate key has been discussed before, but as an alternate key to a type 2, not as the PK (in conjunction with date) in one table rather than two.
I don't like the forking idea (no pun intended) of the static key as it impacts query performance for all time over a rare occurance of a one-time rekeying of the fact table to do retroactive history adjustments, assuming old history is even available from the source system.
And, yeah, if you had to retroactively change things, it would be a real pain in the neck. But, as yet, I haven't had to. And the idea of a stable surrogate key has been discussed before, but as an alternate key to a type 2, not as the PK (in conjunction with date) in one table rather than two.
I don't like the forking idea (no pun intended) of the static key as it impacts query performance for all time over a rare occurance of a one-time rekeying of the fact table to do retroactive history adjustments, assuming old history is even available from the source system.
To fork or not to fork
ngalemmo wrote:
I don't like the forking idea (no pun intended) of the static key as it impacts query performance for all time over a rare occurance of a one-time rekeying of the fact table to do retroactive history adjustments, assuming old history is even available from the source system.
Still, a good pun :-)
I think it is a valid question to ask: "How large is that performance impact?". The re-keying is a real problem that I have experienced at scale. With badly written update statements, we are literally talking weeks of running, on even a "largish" 100TB system. You are amortizing a potentially VERY high pain over a lot of smaller user queries - and this may be a good idea, depending on the impact. But a well written UPDATE statement on an MPP should get you in the tens of GB/sec to the disks (especially if you are laid out well for sequential :-)
So the question is: is it worth giving up hash joins and trading them for loops (painful) and also: most column stores also struggle with this particular query pattern.
Note that by doing the forking trick, you only affect queries that need the type2 history, type1 queries can still go to the "StableSK" table.
I am also curious to know if any database vendors have a nifty way to perform this BETWEEN From/To join efficiently (there are clearly ways to get some of the hash performance here). Hope a DBA will chime in here.
thomaskejser- Posts : 12
Join date : 2011-08-14
Re: Data Vault v's Dimensional Model
Hi Thomas, I suspect it is the intention of Master Data Services (MDS) in Denali (SQL Server). I have not got my hands on the new feature, but have seen powerful demo in PASS that you may incrementally add (publish) the new attributes to any dimension without developer doing any DDL or update on the table.ngalemmo wrote:Can a dimensional EDW stand alone as the primary data store and historical record or is it necessary to create a repository from which data marts are published.
I have also seen the session by Joy Mundy explaining the role MDS can play to ensure data quality, proper metadata management etc. to streamline ETL process.
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
MDS and metadata
Hang
I think that Metadata and MDS are orthogonal to the "store and publish or just use star schema" architectural question. No matter which architecture you build, you will need both metadata and MDS to make it manageable at scale. Also, both Dan's Vault model and the star schema can be the recipients or semi-automated ETL flows auto generated from metadata. I have build meta frameworks that load both of them - and they are equally simply to load. Though the star schema lends itself better to parallelism because it has less key constraints to worry about. Of course, you can turn off key validation in the data vault, but that kind if defeats the purpose (and also leads optimizers astray).
All that being said, the notion that users can build their own hierarchies and add their own attributes to dimensions is a rather tempting idea and one you can see in Denali. There are cases where it makes sense for a user to have his own copy of a dimension table and then populate the attributes as they see fit (but maintaining the surrogate key). This of course creates some interesting indexing problems - but I think they are surmountable, even at scale.
I think that Metadata and MDS are orthogonal to the "store and publish or just use star schema" architectural question. No matter which architecture you build, you will need both metadata and MDS to make it manageable at scale. Also, both Dan's Vault model and the star schema can be the recipients or semi-automated ETL flows auto generated from metadata. I have build meta frameworks that load both of them - and they are equally simply to load. Though the star schema lends itself better to parallelism because it has less key constraints to worry about. Of course, you can turn off key validation in the data vault, but that kind if defeats the purpose (and also leads optimizers astray).
All that being said, the notion that users can build their own hierarchies and add their own attributes to dimensions is a rather tempting idea and one you can see in Denali. There are cases where it makes sense for a user to have his own copy of a dimension table and then populate the attributes as they see fit (but maintaining the surrogate key). This of course creates some interesting indexing problems - but I think they are surmountable, even at scale.
thomaskejser- Posts : 12
Join date : 2011-08-14
Re: Data Vault v's Dimensional Model
Thomas
Thanks for the comments. However MDS does have big implication on ETL process that has played a big part and determines the success of the dimensional modelling architecture. I guess with MDS in place, ETL process would become more manageable, or somehow automated as you put.
Could you please also clarify that the underlying data structure for MDS is more in line with DAN’s data vault model and quite different from dimensional model or any other traditional 3NF model per se.
Thanks for the comments. However MDS does have big implication on ETL process that has played a big part and determines the success of the dimensional modelling architecture. I guess with MDS in place, ETL process would become more manageable, or somehow automated as you put.
Could you please also clarify that the underlying data structure for MDS is more in line with DAN’s data vault model and quite different from dimensional model or any other traditional 3NF model per se.
hang- Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia
MDM looks more like Vault - or does it?
Hi Hang
I agree that a Master data repository looks a lot more like vault at first glance. However, just because a schema has tables with start- and end dates does not make it a data vault - it just makes it a system that capture changes to tables (OLTP designers may have designed such "log tables" using triggers before).
If you are designing the schema for a master data management system - you generally want that schema to be extensible from the client to allow business users to model the world as they think it looks. This means that you are often building something that is somehow a form of name/value pair database - but which also should contain a materialized version of the same data into something that looks a lot like type2 dimensions so they are easy to query. Is that a Data Vault or just sound 3NF design?...
When I think about MDM, I think of it as one (maybe the only, if done right) source of dimension data - not facts. Because of that, it is a small database system that can take some liberties about the schema it uses (performance is less critical, though still important). But even small databases like that still have a lot of value - especially if the MDM knows how to handle keys properly.
Be aware that the MDM can actually make your life harder as an EDW designer: cleaning of data can take a long time if it is not done with good tuning. This means that the MDM can be "lagging behind" the sources from the perspective of the EDW (which may be connected directly to message busses and source extracts). This can lead to all sorts of interesting late/early arriving key problems. It is also not unusual to see the EDW solve data quality problems (because they need that solution NOW in their reports) before the rest of the organization is done doing vendor selection on a MDM system. You may have experienced the interesting political situations and discussions about who does the best job at handling keys and data cleaning - the EDW or the MDM?
With regards to key management: In my blog at http://blog.kejser.org I have described structure that I call "map tables" that are an important part of the ETL flow that I argue for. They look somewhat like Dan's satellites, but as you can see in some of the entires, they serve a differnet purpose and some interesting characteristics that help you manage key errors in the sources. Ideally, a MDM repository would supply tables that have that structure to the EDW - along with the cleaned source data of dimensions.
I agree that a Master data repository looks a lot more like vault at first glance. However, just because a schema has tables with start- and end dates does not make it a data vault - it just makes it a system that capture changes to tables (OLTP designers may have designed such "log tables" using triggers before).
If you are designing the schema for a master data management system - you generally want that schema to be extensible from the client to allow business users to model the world as they think it looks. This means that you are often building something that is somehow a form of name/value pair database - but which also should contain a materialized version of the same data into something that looks a lot like type2 dimensions so they are easy to query. Is that a Data Vault or just sound 3NF design?...
When I think about MDM, I think of it as one (maybe the only, if done right) source of dimension data - not facts. Because of that, it is a small database system that can take some liberties about the schema it uses (performance is less critical, though still important). But even small databases like that still have a lot of value - especially if the MDM knows how to handle keys properly.
Be aware that the MDM can actually make your life harder as an EDW designer: cleaning of data can take a long time if it is not done with good tuning. This means that the MDM can be "lagging behind" the sources from the perspective of the EDW (which may be connected directly to message busses and source extracts). This can lead to all sorts of interesting late/early arriving key problems. It is also not unusual to see the EDW solve data quality problems (because they need that solution NOW in their reports) before the rest of the organization is done doing vendor selection on a MDM system. You may have experienced the interesting political situations and discussions about who does the best job at handling keys and data cleaning - the EDW or the MDM?
With regards to key management: In my blog at http://blog.kejser.org I have described structure that I call "map tables" that are an important part of the ETL flow that I argue for. They look somewhat like Dan's satellites, but as you can see in some of the entires, they serve a differnet purpose and some interesting characteristics that help you manage key errors in the sources. Ideally, a MDM repository would supply tables that have that structure to the EDW - along with the cleaned source data of dimensions.
Last edited by thomaskejser on Tue Aug 30, 2011 6:52 pm; edited 1 time in total (Reason for editing : Clarifications)
thomaskejser- Posts : 12
Join date : 2011-08-14
Re: Data Vault v's Dimensional Model
Thomas,
Your "map tables" appear (in slightly different form) in a Data Vault in the Business (Rule) Data Vault section. Your alternate keys can be satistfied with keysats. Basically the same look and feel but slightly more flexible and because of auditability also usable for data quality feedback loops to OLTP source systems as well.
Another problem I have with all of your stuff is that most of it should handle bitemporality as well, else you will never be able to fix business rules without destroying your auditability. This goes for most of this disussion BTW.
Also when only the present is important in a DV you can always partition off the past, logically the past and present of the same info still form 1 entity (except for model changes). Physically splitting it and saying it is different is just doing physical optimizations on the database schema level.
However I do seriously doubt that your architecture will allow for an (small) incremental and flexible DWH implementation without serious rework. IMO it looks either still too much "big bang" which i think is one of the biggest issues of Kimball and Inmon "old style".
One of the interesting philisophical differences between Data Vault and kimball is that Data Vault tries to reconcilliate and align source and the DWH as much as possible whille Kimball tries to have "improved" or even "the best" data quite distinctly from the sources. Or to put it another way, Kimball polarizes your data between sources and DWH while DV tries to unify them. From a more holistic Data/information Architecture standpoint (not just DWH) I think unification is the better approach.
Your "map tables" appear (in slightly different form) in a Data Vault in the Business (Rule) Data Vault section. Your alternate keys can be satistfied with keysats. Basically the same look and feel but slightly more flexible and because of auditability also usable for data quality feedback loops to OLTP source systems as well.
Another problem I have with all of your stuff is that most of it should handle bitemporality as well, else you will never be able to fix business rules without destroying your auditability. This goes for most of this disussion BTW.
Also when only the present is important in a DV you can always partition off the past, logically the past and present of the same info still form 1 entity (except for model changes). Physically splitting it and saying it is different is just doing physical optimizations on the database schema level.
However I do seriously doubt that your architecture will allow for an (small) incremental and flexible DWH implementation without serious rework. IMO it looks either still too much "big bang" which i think is one of the biggest issues of Kimball and Inmon "old style".
One of the interesting philisophical differences between Data Vault and kimball is that Data Vault tries to reconcilliate and align source and the DWH as much as possible whille Kimball tries to have "improved" or even "the best" data quite distinctly from the sources. Or to put it another way, Kimball polarizes your data between sources and DWH while DV tries to unify them. From a more holistic Data/information Architecture standpoint (not just DWH) I think unification is the better approach.
dm_unseen- Posts : 4
Join date : 2009-02-03
Location : Netherlands
Re: Data Vault v's Dimensional Model
Kimball polarizes your data between sources and DWH while DV tries to unify them.
Could you elaborate?
Re: Data Vault v's Dimensional Model
With Data Vault we start off with assuming the source data is correct and complete. If not we initiate feedback loops to improve DQ. We also don't delete source data because we want to measure DQ improvement over time. We also try to solely rely on source key integration. In a Data Vault we don't want to create new "truth" by processing data, we want to report on the facts given by the source. When this is not possible we will reluctantly add additional layers to cleanse/conform the data, but by doing that we create another truth/view on the data different from the source, hampering auditability and obscuring the relation between operational processes and the DWH data.
Kimball just states to ignore the source systems asap when data is made available through the EDW, so source system data actually becomes 2nd best data compared to the cleansed & conformed data in the EDW. For Kimball bad DQ is an argument for a DWH to take over instead of trying to improve the DQ of the sources.
IMO kimball tries to sit on top (of te sources) while Data vault tries to align with them.
Kimball just states to ignore the source systems asap when data is made available through the EDW, so source system data actually becomes 2nd best data compared to the cleansed & conformed data in the EDW. For Kimball bad DQ is an argument for a DWH to take over instead of trying to improve the DQ of the sources.
IMO kimball tries to sit on top (of te sources) while Data vault tries to align with them.
dm_unseen- Posts : 4
Join date : 2009-02-03
Location : Netherlands
Unifying the sources
Hi Dan
Agree that the maps are very similar to satellites - though note that they are specifically handled per source, and that I see them as an add-on to the dimensional model, a part of the ETL flow, not as some new and revolutionary modeling technique.
I don't follow why data vault (or an Inmon 3NF) somehow unifies the source - putting every source into the same database (even if you model it using some fixed ruleset) is just another way of saying "stovepipe", it does not embrace the fact that sources ARE different and need cleaning and semantic realignment to be turned into usable data, which is where the big work happens and those 80% of the time is spend. My point is exactly that you should NOT model the source, because it is a waste of time. Instead, if you must, throw it into some cheap big piles (HADOOP is good, if not, use some cheap database engine). Just keep the source schema as it is, perhaps adding a little log table to each table to track changes (if the source system doesn't do that already). That is as close to agile as it comes (but no silver bullet - which I think we agree does not really exist). If the source schema changes, create a new table or modify the existing one - no big deal, no special model technique.
Temporal relations must be handled in any data load scenario, it depends on what you consider the "valid date" for a fact table and which dates you use as the basis of valid from/to in the type2/history tables. But this does not change the fact that you still need some agreed on, common, definition of validity that users can report on. It may not be the "truth", but is it the agreed upon version of it. Again, it is the ETL flow that create the reporting schema which must eventually handle this..
Paraphrasing a point made earlier in this forum: Just because you shove a lot of data into a common model does not mean you are dodging a "big bang" implementation or being any more structured about it. The Kimball style EDW (as opposed to using Kimball just for marts) is exactly characterized by the fact that you can gradually build it up, one fact and dimension table at a time. This means you can get off the ground quick and deliver business value, without engaging in an academic modeling exercise. From the "One version of the truth" requirement, it does not follow that "everything must be in one, unified and pre-defined data model"
Agree that the maps are very similar to satellites - though note that they are specifically handled per source, and that I see them as an add-on to the dimensional model, a part of the ETL flow, not as some new and revolutionary modeling technique.
I don't follow why data vault (or an Inmon 3NF) somehow unifies the source - putting every source into the same database (even if you model it using some fixed ruleset) is just another way of saying "stovepipe", it does not embrace the fact that sources ARE different and need cleaning and semantic realignment to be turned into usable data, which is where the big work happens and those 80% of the time is spend. My point is exactly that you should NOT model the source, because it is a waste of time. Instead, if you must, throw it into some cheap big piles (HADOOP is good, if not, use some cheap database engine). Just keep the source schema as it is, perhaps adding a little log table to each table to track changes (if the source system doesn't do that already). That is as close to agile as it comes (but no silver bullet - which I think we agree does not really exist). If the source schema changes, create a new table or modify the existing one - no big deal, no special model technique.
Temporal relations must be handled in any data load scenario, it depends on what you consider the "valid date" for a fact table and which dates you use as the basis of valid from/to in the type2/history tables. But this does not change the fact that you still need some agreed on, common, definition of validity that users can report on. It may not be the "truth", but is it the agreed upon version of it. Again, it is the ETL flow that create the reporting schema which must eventually handle this..
Paraphrasing a point made earlier in this forum: Just because you shove a lot of data into a common model does not mean you are dodging a "big bang" implementation or being any more structured about it. The Kimball style EDW (as opposed to using Kimball just for marts) is exactly characterized by the fact that you can gradually build it up, one fact and dimension table at a time. This means you can get off the ground quick and deliver business value, without engaging in an academic modeling exercise. From the "One version of the truth" requirement, it does not follow that "everything must be in one, unified and pre-defined data model"
thomaskejser- Posts : 12
Join date : 2011-08-14
Re: Data Vault v's Dimensional Model
Thomas,
Some small but interesting remarks (rest comes later):
PS I'm not sure if you are confusing me (dm_unseen AKA Martijn Evers) with Dan Linstedt. Although we both know eachother and Data Vault very well we are still different persons :). It appears I'm currently standing in for him (fine by me) which means you get 2 "DV heads" for the thread of one.
PPS I'll see you at the upcoming SQLBits? I'll be doing "Data Vault in a Day" precon there
Some small but interesting remarks (rest comes later):
- With Data Vault we like to use the phrase "One version of the Facts". The reason is that truth is subjective and will change over the course of the life of a DWH. It means that all interpretation (BR,cleansing conforming and calculating) is to be done after the Central DWH and not before. This is different from classical inmon style and Kimball but in line with the latest inmon architecture.
- Data Vault does try to conform, but just not upto the point of doing actual compex ETL stuff. That is reserved for downstream data marts or business data vault. If this leaves you with a pile of datajunk instead of a nice DWH, well.... whose data problem is that?
- The argument between using a Data Vault or an Historical Staging Area IMO depends also on DWH usage. We try to server ODS style reports (real time as well) from our central DWH so Data Vault works quite well there. IMO an HSA is the 2nd best option after a Data Vault (better than nothing;)
- Kimball works per fact and dimension table but Data Vault works incrementally per *source* table (actually a set of source tables). In a Data Vault Dimension design & conformation is optional (as is the whole Kimball Architecture).
PS I'm not sure if you are confusing me (dm_unseen AKA Martijn Evers) with Dan Linstedt. Although we both know eachother and Data Vault very well we are still different persons :). It appears I'm currently standing in for him (fine by me) which means you get 2 "DV heads" for the thread of one.
PPS I'll see you at the upcoming SQLBits? I'll be doing "Data Vault in a Day" precon there
dm_unseen- Posts : 4
Join date : 2009-02-03
Location : Netherlands
Re: Data Vault v's Dimensional Model
dm_unseen wrote:With Data Vault we start off with assuming the source data is correct and complete.
I don't assume anything. Well, actually not true... I assume its all bad and don't worry about it. I am concerned about accuracy, not correctness.
We also don't delete source data because we want to measure DQ improvement over time.
What does this have to do with the model? Sounds more like a data retention policy. There is nothing in a dimensional model that precludes retention of old data. Its a design choice, not a mandate.
We also try to solely rely on source key integration. In a Data Vault we don't want to create new "truth" by processing data, we want to report on the facts given by the source.
Again, not a modeling difference. There are load techniques that allow for soft integration of disparate source in a dimensional model. All source data is retained, facts are tied to the dimension rows from the same source and provisions are made to accomodate a combined view across systems. Also, the primary purpose of a data warehouse is to be accurate. Correctness comes later.
When this is not possible we will reluctantly add additional layers to cleanse/conform the data, but by doing that we create another truth/view on the data different from the source, hampering auditability and obscuring the relation between operational processes and the DWH data.
Kimball just states to ignore the source systems asap when data is made available through the EDW, so source system data actually becomes 2nd best data compared to the cleansed & conformed data in the EDW. For Kimball bad DQ is an argument for a DWH to take over instead of trying to improve the DQ of the sources.
I don't quite follow. If you have placed the data into an accurate data warehouse, why would you want to have BI users go back to the source? What is the point of building a data warehouse if not to serve as the primary source for tactical and strategic data analysis? It is not a quality issue.
You are implying that dimensional modeling somehow dictates where and when data cleansing should take place. While Ralph may or may not have expressed opinions on this matter (I don't know), they have nothing to do with the modeling technique. It is a process issue. You can implement DQ feedback loops, in-line cleansing, or whatever technique you feel is appropriate to the case at hand.
IMO kimball tries to sit on top (of te sources) while Data vault tries to align with them.
I don't understand what you mean by this.
Oops :-)
Hi Dm_unseen
Indeed, I did confuse you with Dan, sorry. Yes, I will be at SQLBits, but unfortunately I will not have time for the pre-con there.
I will be doing a presentation about scalability though, hope to see you there.
Indeed, I did confuse you with Dan, sorry. Yes, I will be at SQLBits, but unfortunately I will not have time for the pre-con there.
I will be doing a presentation about scalability though, hope to see you there.
thomaskejser- Posts : 12
Join date : 2011-08-14
Dimensional Modeling and Data Vault is a project scope vs a program scope issue !
Hi,
From previous posts I noticed that performance may be an issue and if it's the case than for me Dimensional Modeling is the best way to answer business questions efficiently, regardless the speed comparison related to physical infrastructures.
The bigger issue is auditing and you can see the Kimball Tip26 about adding an audit table in the EDW:
http://www.kimballgroup.com/html/designtipsPDF/DesignTips2001/KimballDT26AddingAnAudit.pdf
From the above Dimensional Modeling is a solution for best efficiency answering enterprise wide needs, whereas Data Vault would be the best answer for business willing to have a fast-track solution (COULD BE COST EFFECTIVE) for a project scope rather than a program scope.
Sofiane
From previous posts I noticed that performance may be an issue and if it's the case than for me Dimensional Modeling is the best way to answer business questions efficiently, regardless the speed comparison related to physical infrastructures.
The bigger issue is auditing and you can see the Kimball Tip26 about adding an audit table in the EDW:
http://www.kimballgroup.com/html/designtipsPDF/DesignTips2001/KimballDT26AddingAnAudit.pdf
From the above Dimensional Modeling is a solution for best efficiency answering enterprise wide needs, whereas Data Vault would be the best answer for business willing to have a fast-track solution (COULD BE COST EFFECTIVE) for a project scope rather than a program scope.
Sofiane
ssofiane- Posts : 1
Join date : 2012-01-10
Page 2 of 2 • 1, 2
Similar topics
» Tracking of historical data using SCD2 in a non-dimensional data model
» creating dimensional model of log data
» Dimensional Model from a Hierarchical Data Source
» Data Vault vs Kimball
» Rule based algorithm to convert an ER model to a dimensional model
» creating dimensional model of log data
» Dimensional Model from a Hierarchical Data Source
» Data Vault vs Kimball
» Rule based algorithm to convert an ER model to a dimensional model
Page 2 of 2
Permissions in this forum:
You cannot reply to topics in this forum
|
|