Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion
+14
Mohsin
rjp73
bowie
ngalemmo
steve_waters
mark.tan
PHough
ArjanF
BoxesAndLines
Jeff Smith
tomstagg
dharidas
Chris Cammers
Andy Painter
18 posters
Page 1 of 2
Page 1 of 2 • 1, 2
Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion
I've been dimensional modelling for well over 10 years, and strongly held the opinon that only a dimensional model would work for an EDW, however i'm now begining to think that there are certain requirements that lead to a 3NF design for an 'Enterprise Data Warehouse'. I'd like to share these and understand what other requirements lead to a 3NF design. I'm not look for a relgious debate over one or the other just the facts of the matter. These are the requirements that i have seen that i think need a 3NF design:
1. Use of Teradata for the core data warehouse. Why? - It's just designed better to work with that type of model.
2. Real-time data warehousing. Why? - A 3NF data model should allow a quicker update of data.
I would also add that even with a 3NF design, i'd alwasy look to create dimensional marts for specific applications.
1. Use of Teradata for the core data warehouse. Why? - It's just designed better to work with that type of model.
2. Real-time data warehousing. Why? - A 3NF data model should allow a quicker update of data.
I would also add that even with a 3NF design, i'd alwasy look to create dimensional marts for specific applications.
Andy Painter- Posts : 1
Join date : 2009-02-03
Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion
I will be the first to admit 3NF sure is easier to implement if you are looking for real-time data but the problem lies in the use of the resulting EDW. If the whole intention of the data warehouse is to supply data to analytical applications then you lose the two major advantages of the Dimensional Model
The major problem with dimensional models and real-time data is the transformation. I worked on a project where we attempted to populate a dimensional model in real-time but we fell flat because we could not find a mechanism to transform the data in real time.
We were using a tool that provides a stream for each table's change log but we could not synchronize the transactions to ensure we built a valid fact row. I think if we had been able to change the application schema we might have been able to overcome this obstacle so we'll get it right on the next project.
- The first is ease of use, the dimensional model will be much easier for your users to understand and easier for your Query and Analysis tool to consume.
- Query Performance, 3NF databases suffer from performance problems when aggregating or retrieving many dimensional values that analysis may require. If you are only going to do operational reports then you may be able to get by with 3NF because your operational user will be looking for very fine grain data.
The major problem with dimensional models and real-time data is the transformation. I worked on a project where we attempted to populate a dimensional model in real-time but we fell flat because we could not find a mechanism to transform the data in real time.
We were using a tool that provides a stream for each table's change log but we could not synchronize the transactions to ensure we built a valid fact row. I think if we had been able to change the application schema we might have been able to overcome this obstacle so we'll get it right on the next project.
Chris Cammers- Posts : 1
Join date : 2009-02-03
Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion
I agree with Chris on this. While there are general guidelines and best practices around building data warehouses, the architect will have to take into account the complexity, volume and organization of data in their respective enterprise while designing data warehouses.
In my view, 3NF is a set a rules that focus on maintaining data integrity when you have a lot of transactional updates to your database. As such it is tuned for easy retrieval and updates for detail data.Performance of aggregate queries will most often suffer, especially when you have complex reports spanning multiple attributes and fact tables.
Dimensional modeling is a more of a practice, a disciplined approach, if you will, of creating an infrastructure for most efficient aggregate query performance with drill down/roll up capabilities. The data model should support easy navigability (business users seldom want to navigate complex join paths involving n levels of normalized tables) and demand query performance that is more or less agnostic of data volume.
While the influx of powerful hardware has vastly improved aggregate performance of 3NF data models, it generally lacks the discipline, ad-hoc query performance and ease of use associated dimensional models.
Again, EDW is a significant investment and hence I have always tried to highlight, with varying levels of success, about the scalability and benefits of dimensional model. And Yes!, I have been beaten and bruised, often by very senior data architects, who wants to reuse existing 3NF transaction processing databases for analytics. While I have proudly maintained, scaled and enhanced my 10 year old dimensional data model with ease, I have also quietly mourned at the demise of some of the departmental 3NF versions that required multiple design revisions to accommodate requests for additional reports.
In my view, 3NF is a set a rules that focus on maintaining data integrity when you have a lot of transactional updates to your database. As such it is tuned for easy retrieval and updates for detail data.Performance of aggregate queries will most often suffer, especially when you have complex reports spanning multiple attributes and fact tables.
Dimensional modeling is a more of a practice, a disciplined approach, if you will, of creating an infrastructure for most efficient aggregate query performance with drill down/roll up capabilities. The data model should support easy navigability (business users seldom want to navigate complex join paths involving n levels of normalized tables) and demand query performance that is more or less agnostic of data volume.
While the influx of powerful hardware has vastly improved aggregate performance of 3NF data models, it generally lacks the discipline, ad-hoc query performance and ease of use associated dimensional models.
Again, EDW is a significant investment and hence I have always tried to highlight, with varying levels of success, about the scalability and benefits of dimensional model. And Yes!, I have been beaten and bruised, often by very senior data architects, who wants to reuse existing 3NF transaction processing databases for analytics. While I have proudly maintained, scaled and enhanced my 10 year old dimensional data model with ease, I have also quietly mourned at the demise of some of the departmental 3NF versions that required multiple design revisions to accommodate requests for additional reports.
dharidas- Posts : 5
Join date : 2009-02-03
Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion
Advantages of 3NF over Dimensional model I see are:
- reduced ETL effort: makes the data warehouse more agile and meets changing business needs much quicker - I regularly see that the business changes there mind on what they want to capture and sometimes it is impossible to keep up - The warehouse can become semi-redundant
- Changes to source data systems can quickly be fed through to the EDW and on to users
- Real time for Dimension model is complex (e.g. inferred dimensions,etc) means EDW is even slower.
Disadvantages of 3NF over Dimensional model I see are:
- Complex joins - How can you allow the business to perform ad-hoc queries? I can't see how you can populate an OLAP cube from a 3NF data model.
- Slow performance - multi-joins means query time can be long - no users want to wait for data to load. Although we be mitigated with powerful hardware
- Where does the calcuation logic go?! ETL performs complex calculations to provide insight into business.
- High over-head to create reports for users - If a user wants a specific report there is a significant effort to understand and create the report from the 3NF model.
Sounds like you're talking about a form of operational Data Store (ODS) Andy.
- reduced ETL effort: makes the data warehouse more agile and meets changing business needs much quicker - I regularly see that the business changes there mind on what they want to capture and sometimes it is impossible to keep up - The warehouse can become semi-redundant
- Changes to source data systems can quickly be fed through to the EDW and on to users
- Real time for Dimension model is complex (e.g. inferred dimensions,etc) means EDW is even slower.
Disadvantages of 3NF over Dimensional model I see are:
- Complex joins - How can you allow the business to perform ad-hoc queries? I can't see how you can populate an OLAP cube from a 3NF data model.
- Slow performance - multi-joins means query time can be long - no users want to wait for data to load. Although we be mitigated with powerful hardware
- Where does the calcuation logic go?! ETL performs complex calculations to provide insight into business.
- High over-head to create reports for users - If a user wants a specific report there is a significant effort to understand and create the report from the 3NF model.
Sounds like you're talking about a form of operational Data Store (ODS) Andy.
tomstagg- Posts : 1
Join date : 2009-02-03
3NF cannot handle certain SCD
The problem with 3NF is that it cannot handle type 2 or type 3 slowing changing dimensions. For Type 2, you have to reload all of the dimensional data during every load with a date stamp. And if rollups change or relationships change, you're hosed. It's harder to build/design the dimensional database, but once it's built, it's a thing of beauty
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion
Jeff Smith wrote: "The problem with 3NF is that it cannot handle type 2 or type 3 slowing changing dimensions"
Eh? Are you saying 3NF models can't capture history?
Eh? Are you saying 3NF models can't capture history?
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Argument "Real Time"
Hello,
One of the arguments for using a 3NF DWH model is that the Dimensional model is to slow for "Real Time". Most of the common Database Systems now support partitioning of tables, even OLAP database models support partitioning. Could this not be a solution for making a Dimensional DWH model easier to use in real time? Isn't real time about the latest state of information, so information not older then 1 day, closed dimensions, older changed dimensions are not really needed to be visible on real time?
So why not partition your data based on dates and build a 2 based ETL process for your dimensional model? First the regular process for processing all partitions also the older dimension records. Second build a process to update only current partitions holding the latest information of the day.
I have to say, I've never done it but I just want to dump an idea and listen to other opinion. So what do you think of the idea and what are the arguments not to do this?
Don't come with the money argument, because when they need "real time" and they want to be able to report in a understandable way, dimensional modeling is still the best and word the money! I know that it should be very complex but not impossible or is it?
Best Regards,
Arjan Fraaij
One of the arguments for using a 3NF DWH model is that the Dimensional model is to slow for "Real Time". Most of the common Database Systems now support partitioning of tables, even OLAP database models support partitioning. Could this not be a solution for making a Dimensional DWH model easier to use in real time? Isn't real time about the latest state of information, so information not older then 1 day, closed dimensions, older changed dimensions are not really needed to be visible on real time?
So why not partition your data based on dates and build a 2 based ETL process for your dimensional model? First the regular process for processing all partitions also the older dimension records. Second build a process to update only current partitions holding the latest information of the day.
I have to say, I've never done it but I just want to dump an idea and listen to other opinion. So what do you think of the idea and what are the arguments not to do this?
Don't come with the money argument, because when they need "real time" and they want to be able to report in a understandable way, dimensional modeling is still the best and word the money! I know that it should be very complex but not impossible or is it?
Best Regards,
Arjan Fraaij
What exactly is 3NF versus Star?
Hi all
I am a bit puzzled by this post beacuse my understanding is that in a Star schema the fact table is "always" fully normalised and optimised. The source for the fact table is an OLTP transaction table and in my experience follws its structure closely, and this source 3NF table is often less efficient that the target 3NF fact table because of OLTP operational attributes it has to carry for transaction processing. So this Start schema fact table being 3NF should be extremely efficient to load in real-time. So I really do not follow most of the issues around loading fact tables in real-time. This is the high-volume, performance attacking point.
My basic understanding is that the denormalisation happens in the dimension tables in order to: simplify the analytical context structures, further optimise the 3NF fact table (by removing redundant foreign keys), and optimise queries by reducing the number of joins at query time. The basic Inserting and Updating of dimensional structures should be a simple and efficient matter of triggers since the volumes will be minute compared to the fact table. BUT the 'real-time' maintenance of these could still be a challenge, to my mind only in two potentially problematic areas: 1) surrogate key maintenance and availability for the 3NF fact table to look up, and 2) real-time maintenance of type 2 slowly changing dimensions in complex dimensions.
In this case for real-time reporting I tend to just use the system keys instead of surrogate keys since the reasons for the existence of surrogate keys MAY be reduced for real-time reporting. (And I am sure that there are clever people that will be able to work out an efficient SK generation mechanism if, e.g., there are two source OLTP systems with clashing keys that both need to be loaded in real-time into a single reporting DW/ODS). For the fact table key lookup there may have to be some sort of periodic 'housekeeping baby-batch' run thus: for all transaction records for which the SK does not yet exist put into a 'to-be-processed-later' cache table and every few minutes retry loading these until the dimension table triggers have picked up the new dimension members - or something smarter.
As for SCD type 2, I do not really use these much personally since they cause all sorts of other complications for large dimensions' sizes, complicated reconciliation and recovery, and they complicate end user queries somewhat. So I try to convince the users that all changes 'inherit the sins of their forefathers' hence become type 1 (or 3) - and when I really do have to track changes - which of course does happen, but not without a fight - I use SCD Type 4 (my name) which is a change tracking factless fact table removed from the dimension table it is tracking. For real-time loading I think it would be possible to convince the users that in real-time, tracking changes does not make much economic sense or may not be economically feasible in real-time.
So in summary, where it matters for real-time (in the fact tables) the Star schema IS already in 3NF. And for dimensions I suggest keeping the source system OLTP primary keys instead of surrogate keys, use triggers to update the denormalised dimesions, keep these dimensions as 'lean and mean', i.e. as simple, as possible, and do not allow SCD type 2. For the EDW there can be another version of these dimension tables with the full, rich complement of attributes and SCD x's.
Regards, Paul Hough
I am a bit puzzled by this post beacuse my understanding is that in a Star schema the fact table is "always" fully normalised and optimised. The source for the fact table is an OLTP transaction table and in my experience follws its structure closely, and this source 3NF table is often less efficient that the target 3NF fact table because of OLTP operational attributes it has to carry for transaction processing. So this Start schema fact table being 3NF should be extremely efficient to load in real-time. So I really do not follow most of the issues around loading fact tables in real-time. This is the high-volume, performance attacking point.
My basic understanding is that the denormalisation happens in the dimension tables in order to: simplify the analytical context structures, further optimise the 3NF fact table (by removing redundant foreign keys), and optimise queries by reducing the number of joins at query time. The basic Inserting and Updating of dimensional structures should be a simple and efficient matter of triggers since the volumes will be minute compared to the fact table. BUT the 'real-time' maintenance of these could still be a challenge, to my mind only in two potentially problematic areas: 1) surrogate key maintenance and availability for the 3NF fact table to look up, and 2) real-time maintenance of type 2 slowly changing dimensions in complex dimensions.
In this case for real-time reporting I tend to just use the system keys instead of surrogate keys since the reasons for the existence of surrogate keys MAY be reduced for real-time reporting. (And I am sure that there are clever people that will be able to work out an efficient SK generation mechanism if, e.g., there are two source OLTP systems with clashing keys that both need to be loaded in real-time into a single reporting DW/ODS). For the fact table key lookup there may have to be some sort of periodic 'housekeeping baby-batch' run thus: for all transaction records for which the SK does not yet exist put into a 'to-be-processed-later' cache table and every few minutes retry loading these until the dimension table triggers have picked up the new dimension members - or something smarter.
As for SCD type 2, I do not really use these much personally since they cause all sorts of other complications for large dimensions' sizes, complicated reconciliation and recovery, and they complicate end user queries somewhat. So I try to convince the users that all changes 'inherit the sins of their forefathers' hence become type 1 (or 3) - and when I really do have to track changes - which of course does happen, but not without a fight - I use SCD Type 4 (my name) which is a change tracking factless fact table removed from the dimension table it is tracking. For real-time loading I think it would be possible to convince the users that in real-time, tracking changes does not make much economic sense or may not be economically feasible in real-time.
So in summary, where it matters for real-time (in the fact tables) the Star schema IS already in 3NF. And for dimensions I suggest keeping the source system OLTP primary keys instead of surrogate keys, use triggers to update the denormalised dimesions, keep these dimensions as 'lean and mean', i.e. as simple, as possible, and do not allow SCD type 2. For the EDW there can be another version of these dimension tables with the full, rich complement of attributes and SCD x's.
Regards, Paul Hough
PHough- Posts : 3
Join date : 2009-02-04
Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion
BoxesAndLines wrote:Jeff Smith wrote: "The problem with 3NF is that it cannot handle type 2 or type 3 slowing changing dimensions"
Eh? Are you saying 3NF models can't capture history?
I think it is not that 3NF cannot capture historical data. They can, but the nature of the design (normalization of dimension/master tables) result in a lot of complexity when making updates for SCD type 2 or 3. However, in the case of Star Schema, it is truely a elegant design to handle SCD 2 and 3 type of data.
mark.tan- Posts : 14
Join date : 2009-02-04
Nice post, Andy....
Andy Painter wrote:I've been dimensional modelling for well over 10 years, and strongly held the opinon that only a dimensional model would work for an EDW, however i'm now begining to think that there are certain requirements that lead to a 3NF design for an 'Enterprise Data Warehouse'. I'd like to share these and understand what other requirements lead to a 3NF design. I'm not look for a relgious debate over one or the other just the facts of the matter. These are the requirements that i have seen that i think need a 3NF design:
1. Use of Teradata for the core data warehouse. Why? - It's just designed better to work with that type of model.
2. Real-time data warehousing. Why? - A 3NF data model should allow a quicker update of data.
I would also add that even with a 3NF design, i'd alwasy look to create dimensional marts for specific applications.
Sounds like the classic "hub and spoke" model that Inmon advocates:
1. A 3NF DW to "integrate" data from source systems
2. Data Marts downstream from the 3NF DW
Best case scenario is that the dimensions are conformed and the subject areas are discrete and cleanly delineated.
Personally, I think this approach is a nightmare.
The 3NF DW becomes a single point of failure that impacts all the downstream data marts.
The 3NF DW is very hard (ie, time consuming) to design. Then it takes a long time to dev, test and get signed off. Finally, it tends towards a "big bang" approach. It is hard to "phase" 3NF work, IMO. Subject areas added at different times produce asymetry in the data....
steve_waters- Posts : 4
Join date : 2009-04-23
Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion
I agree on the single point of failure problem. On the other hand, a lot of value is derived by cleansing and integrating the data once before extracting to data marts. I disagree on the big bang only concept though. Every Teradata shop in town is built on a normalized data structure. You can build these data structures incrementally just as easy as you can build or enhance a star. The normalized data structure is also more resilient to change since data (should) only appear in one table.
I'm not proposing one methodology over the other. But I see this more as a Chevy truck vs. Ford truck argument than one approach is hands down better than the other.
I apologize to all the folks not from Texas who don't sit around the camp fire arguing over the merits of Ford and Chevy trucks.
I'm not proposing one methodology over the other. But I see this more as a Chevy truck vs. Ford truck argument than one approach is hands down better than the other.
I apologize to all the folks not from Texas who don't sit around the camp fire arguing over the merits of Ford and Chevy trucks.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
The Case against teradata
I think the argument that is being made is for not using teradata as opposed to implementing a 3NF DW. While Hardware/softare should be considered when designing the DW, IMHO the hardware/software shouldn't dictate the design.
The only argument that I know of for using Teradata is based on the size of the DW. But a large DW needs to be designed to be simple to use. I admit I'm biased - I like the star schema design. It's easy to use. It's easy to set up a reporting font end. Cubes are easy to build. Sorry if it makes it more difficult on the modelers and hardware guys - but the intent of the DW design is to make it easier on the users.
The only argument that I know of for using Teradata is based on the size of the DW. But a large DW needs to be designed to be simple to use. I admit I'm biased - I like the star schema design. It's easy to use. It's easy to set up a reporting font end. Cubes are easy to build. Sorry if it makes it more difficult on the modelers and hardware guys - but the intent of the DW design is to make it easier on the users.
Jeff Smith- Posts : 471
Join date : 2009-02-03
I challenge this notion....
BoxesAndLines wrote:I agree on the single point of failure problem. On the other hand, a lot of value is derived by cleansing and integrating the data once before extracting to data marts. I disagree on the big bang only concept though. Every Teradata shop in town is built on a normalized data structure. You can build these data structures incrementally just as easy as you can build or enhance a star. The normalized data structure is also more resilient to change since data (should) only appear in one table.
I'm not proposing one methodology over the other. But I see this more as a Chevy truck vs. Ford truck argument than one approach is hands down better than the other.
I apologize to all the folks not from Texas who don't sit around the camp fire arguing over the merits of Ford and Chevy trucks.
When a conformed dimension is built, the data gets cleansed and integrated. Then other fact tables in various data marts can connect to it.
Simple staging tables that hold extracts from the source systems are all that is needed. The process to insert/update rows into the conformed dimension can do the cleansing/integrating. This approach is just as efficient and more effective. (More effective because the Dim will have a surrogate key for history tracking, and there will be a well-defined method for "connecting" to the Dimension.)
steve_waters- Posts : 4
Join date : 2009-04-23
It's not the difference between a Chevy and ford
BoxesAndLines wrote:I agree on the single point of failure problem. On the other hand, a lot of value is derived by cleansing and integrating the data once before extracting to data marts. I disagree on the big bang only concept though. Every Teradata shop in town is built on a normalized data structure. You can build these data structures incrementally just as easy as you can build or enhance a star. The normalized data structure is also more resilient to change since data (should) only appear in one table.
I'm not proposing one methodology over the other. But I see this more as a Chevy truck vs. Ford truck argument than one approach is hands down better than the other.
I apologize to all the folks not from Texas who don't sit around the camp fire arguing over the merits of Ford and Chevy trucks.
Sorry, but I think the argument is not the same as a Chevy and a Ford. That analogy is appropriate for discussing DB2 vs Oracle, but 3NF vs Star Schema is much, much more at the core. A more appropriate analogy might be taking the highway vs. city roads. The traveling speed on the best designed highway is always going to be faster than the best designed city street.
Building a 3NF DB in the core with star schemas in marts is a waste of space. Sure, the source data can be normalized in the staging area, but there is no need to keep history in 3NF. Process it through the staging area, load it into the fact tables, and then delete it from the staging area. By keeping all of the history in a 3NF database, your just increasing the amount of space required to store your data. You still have to deal with issues around the reuse of codes by the source system, and now you run the risk of having data in the 3NF that does not match the data in your marts - which, in my opinion, is the worst possible scenario. A primary goal of a DW should be to make it the single source of truth. Mixing a 3NF with marts will at some point violate this goal.
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion
So all the Teradata datawarehouses in the world are a waste of space since they are built in a normalized fashion? I've seen successful implementation of both warehouse models at different clients. Since both models can be successful, I avoid statements that say one way is vastly superior to another. Each approach has their pluses and minuses.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion
Jeff Smith wrote:BoxesAndLines wrote:I agree on the single point of failure problem. On the other hand, a lot of value is derived by cleansing and integrating the data once before extracting to data marts. I disagree on the big bang only concept though. Every Teradata shop in town is built on a normalized data structure. You can build these data structures incrementally just as easy as you can build or enhance a star. The normalized data structure is also more resilient to change since data (should) only appear in one table.
I'm not proposing one methodology over the other. But I see this more as a Chevy truck vs. Ford truck argument than one approach is hands down better than the other.
I apologize to all the folks not from Texas who don't sit around the camp fire arguing over the merits of Ford and Chevy trucks.
Sorry, but I think the argument is not the same as a Chevy and a Ford. That analogy is appropriate for discussing DB2 vs Oracle, but 3NF vs Star Schema is much, much more at the core. A more appropriate analogy might be taking the highway vs. city roads. The traveling speed on the best designed highway is always going to be faster than the best designed city street.
Building a 3NF DB in the core with star schemas in marts is a waste of space. Sure, the source data can be normalized in the staging area, but there is no need to keep history in 3NF. Process it through the staging area, load it into the fact tables, and then delete it from the staging area. By keeping all of the history in a 3NF database, your just increasing the amount of space required to store your data. You still have to deal with issues around the reuse of codes by the source system, and now you run the risk of having data in the 3NF that does not match the data in your marts - which, in my opinion, is the worst possible scenario. A primary goal of a DW should be to make it the single source of truth. Mixing a 3NF with marts will at some point violate this goal.
boom.
Great points, Jeff...I agree!
steve_waters- Posts : 4
Join date : 2009-04-23
Duplication of data is a waste of space
BoxesAndLines wrote:So all the Teradata datawarehouses in the world are a waste of space since they are built in a normalized fashion? I've seen successful implementation of both warehouse models at different clients. Since both models can be successful, I avoid statements that say one way is vastly superior to another. Each approach has their pluses and minuses.
Just because a model can be successful doesn't make it the preferred approach. Are there badly designed dimensional models? Of course. Are there well designed 3NF? Of Course. But a well designed dimensional model blows away an equally well designed 3NF. I have never encountered a situation in the DW that cannot be handled with a dimensional model. In fact, I think there are many situations that are handled in the dimensional model much easier than in a 3NF.
There are some downsides to a dimensional model - they increase the difficulty level for developers. They can be hard to conceptualize for people that don't have a lot of experience with dimensional models.
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion
Jeff...
"A more appropriate analogy might be taking the highway vs. city roads. The traveling speed on the best designed highway is always going to be faster than the best designed city street."
I take it you have never been in Los Angeles.
Back to topic... every time this quesion comes up there are always those that say 'you can't do __(fill in your favorite)___ in a dimensional/3NF (pick one) data warehouse'. Fact of the matter is, none of it is true... except for one thing: END USERS DO NOT QUERY 3NF DATAWAREHOUSES. Well, actually they can, but it is not considered best practice by proponents of the Inmon architecture.
So, the question becomes... do you want to build a data warehouse AND THEN build a publication layer to produce dimensional datamarts so people can actually use the thing? Or do you want to cut out the middleman and just go with a dimensional model to begin with?
You can be equally successful with either architecture eventually... it all becomes a matter of time and money. If you got a whole lot of both and really don't care if or when anything will be delivered... go the 3NF route. Better yet, if 'black hole' solutions are preferred, implement SAP/BW.
"A more appropriate analogy might be taking the highway vs. city roads. The traveling speed on the best designed highway is always going to be faster than the best designed city street."
I take it you have never been in Los Angeles.
Back to topic... every time this quesion comes up there are always those that say 'you can't do __(fill in your favorite)___ in a dimensional/3NF (pick one) data warehouse'. Fact of the matter is, none of it is true... except for one thing: END USERS DO NOT QUERY 3NF DATAWAREHOUSES. Well, actually they can, but it is not considered best practice by proponents of the Inmon architecture.
So, the question becomes... do you want to build a data warehouse AND THEN build a publication layer to produce dimensional datamarts so people can actually use the thing? Or do you want to cut out the middleman and just go with a dimensional model to begin with?
You can be equally successful with either architecture eventually... it all becomes a matter of time and money. If you got a whole lot of both and really don't care if or when anything will be delivered... go the 3NF route. Better yet, if 'black hole' solutions are preferred, implement SAP/BW.
Nicely Summed Up By ngalemmo
I am relatively new to DW and was recently tasked to "clean-up" the DW. My background is mostly with OLTP. We have a (permanent) staging area which is made up mostly of ODS snapshots and flat-file tables. The ODS snapshots are invariably in 3NF (though the relationships are not enforced in the warehouse) and the flat files tend to be denormalised. Most of the ETL process involves aggregating the data into "dimensions" and "fact" tables for reporting. However, these are not well construed especially with regards to SCD and it is my job to untangle the mess. The star schema (especially the snow flake) is to me a normalised form shorn of association tables making it "close to 3NF" but not quite 2NF either. So that is the route I gave taken. The one thing we also have to decide is to empty the staging tables once we have aggreggated the data. That will free up a lot of space.
All in all naglemmo summed the discussion up nicely. Thanks.
All in all naglemmo summed the discussion up nicely. Thanks.
bowie- Posts : 1
Join date : 2009-06-08
you need both
while i advocate starting (and staying) with data marts whenever possible, in the real world you almost always have to structure data relationally AND dimensionally for large scale EDWs. i'll call the relational part staging or an ODS depending on its purpose. remember you can do this 2NF for both relational and dimensional structures.
also agree that the inmon-esque CIF is a relic. why the heck would you want 3NF for a DW? why not just make a copy of the source and get on with it
i usually reserve 3NF for staging areas.
also agree that the inmon-esque CIF is a relic. why the heck would you want 3NF for a DW? why not just make a copy of the source and get on with it
i usually reserve 3NF for staging areas.
rjp73- Posts : 4
Join date : 2009-07-24
Consider Sparsity and Density of Dimensions
I think one more thing should be highlighted, Star Schema although effective in retrival but most of the dimensional columns are null and hence can create storage issues, further when u perform any type of SCD its a pretty expensive scenario, so I thnk its better to snowflake a star schema to some extend.
Mohsin- Posts : 4
Join date : 2009-03-03
Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion
Mohsin wrote:I think one more thing should be highlighted, Star Schema although effective in retrival but most of the dimensional columns are null and hence can create storage issues, further when u perform any type of SCD its a pretty expensive scenario, so I thnk its better to snowflake a star schema to some extend.
I've never seen a situation where "most of the dimensional columns are null". Could you elaborate?
3NF or Dimensional? Factors to evaluate
Evaluate the scope
First of all, you need to evaluate whether you are building a data warehouse or just a data mart.
Next, evaluate, are you building an "enterprise" data warehouse or is your corporate too big to have ONE data warehouse?
Solution Architecture - Factors to consider for DW architecture:
1) How many sources you will be integrating the data from? Are there any unstructured data sources?
2) Data Consumption patterns? Just BI users or are there power users? Extracts? Downstream applications? Also, Grain of the data at the consumer layer, are most of them summarized/aggregated views?
3) Volumetric Analysis, how much data you will be loading every day?
4) Incremental/CDC strategy
5) Logical/Physical Data Integration Architecture including extracting, transformation, loading mechanism
6) Subject Areas based on your company's business, what are you modeling for? Business entities, business processes
7) SLAs: Are users looking for T+1, T+2 or near-real time for the data in their reports/BI Tool?
8.) Solution will include a "Golden copy" model (or data warehouse) where every thing gets loaded and is not really requirements specific
9) Requirement specific "marts" which will be used for your reporting purposes
No argument on #9, that gotta be dimensional models with facts, dims as they are easy, better served for that purpose
For#8, you can either go with dimensional again or a 3NF core model. One thing to note there is that it is not duplicating the data, you "have to" have a golden copy which has lowest grain data and has no business facing requirements, if you disagree here, go back to my first point - are you just building 1 data mart? All fitering, summarization, aggregation occurs after the golden-copy has been loaded. You have to decide if you go with a dimensional model there, how fast you can update the daily data? Would it meet your SLAs? Would it survive with years and years of history? Would it scale if you need grow in the size for your facts and dimensions and add new requirements?
Call it an ODS (which I personally hate because ODS serves a specific purpose and is volatile in its basic form but people percieve that differently) or a data warehouse but your solution need to include those fundamental components.
People who argue that why should i duplicate data, its not about duplication but about fundamental differences between your data warehouse and a data-mart, thats why my opening statement was evaluate whether you are just building a mart or a warehouse. Your data mart may just need aggregated/filtered (most current) data or may not need all the dimensions you have in the golden copy. So again, come up with the fundamental design in the right way and with the right information, you will be able to make the right decision.
Hope this helps,
sanjay.
First of all, you need to evaluate whether you are building a data warehouse or just a data mart.
Next, evaluate, are you building an "enterprise" data warehouse or is your corporate too big to have ONE data warehouse?
Solution Architecture - Factors to consider for DW architecture:
1) How many sources you will be integrating the data from? Are there any unstructured data sources?
2) Data Consumption patterns? Just BI users or are there power users? Extracts? Downstream applications? Also, Grain of the data at the consumer layer, are most of them summarized/aggregated views?
3) Volumetric Analysis, how much data you will be loading every day?
4) Incremental/CDC strategy
5) Logical/Physical Data Integration Architecture including extracting, transformation, loading mechanism
6) Subject Areas based on your company's business, what are you modeling for? Business entities, business processes
7) SLAs: Are users looking for T+1, T+2 or near-real time for the data in their reports/BI Tool?
8.) Solution will include a "Golden copy" model (or data warehouse) where every thing gets loaded and is not really requirements specific
9) Requirement specific "marts" which will be used for your reporting purposes
No argument on #9, that gotta be dimensional models with facts, dims as they are easy, better served for that purpose
For#8, you can either go with dimensional again or a 3NF core model. One thing to note there is that it is not duplicating the data, you "have to" have a golden copy which has lowest grain data and has no business facing requirements, if you disagree here, go back to my first point - are you just building 1 data mart? All fitering, summarization, aggregation occurs after the golden-copy has been loaded. You have to decide if you go with a dimensional model there, how fast you can update the daily data? Would it meet your SLAs? Would it survive with years and years of history? Would it scale if you need grow in the size for your facts and dimensions and add new requirements?
Call it an ODS (which I personally hate because ODS serves a specific purpose and is volatile in its basic form but people percieve that differently) or a data warehouse but your solution need to include those fundamental components.
People who argue that why should i duplicate data, its not about duplication but about fundamental differences between your data warehouse and a data-mart, thats why my opening statement was evaluate whether you are just building a mart or a warehouse. Your data mart may just need aggregated/filtered (most current) data or may not need all the dimensions you have in the golden copy. So again, come up with the fundamental design in the right way and with the right information, you will be able to make the right decision.
Hope this helps,
sanjay.
sanjayvyas- Posts : 3
Join date : 2010-04-13
Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion
Hi all;
This string might be a few months old but it rings true for me right now. I'm struggling to implement several dimensional models on a Teradata platform, running up against ETL folks and dba's who tell me that Teradata just doesn't handle dimensional models. Dimensional modeling is relatively new to this shop, and quite frankly I'm tired of hearing how relational tables or even flat tables will perform much better than star schemas.
I've implemented plenty of dimensional models on DB2, Oracle, but Teradata is a different animal altogether - PI's vs PK's, etc. Should the db really drive the design?
In one Teradata shop, their version of dimensional modeling was to implement huge multi-join views (i.e, 20+ tables) which overlapped with other views, and just as I left that assignment, they ran their first test and it failed miserably - because, you guessed it, too many joins.
The current project has reporsts/queries calling for a dump of a whole subject area (multiple marts) instead of staying within a specific business process. I've resorted to creating a dumbed-down set of flattened tables so real-time reports can access rather than drilling across 4, 5, 6 or more fact tables. The requirements these are based on are vague, and that's another issue altogether, but still, shouldn't the star schema model hold up for Teradata as well as other db's? Am I mad?
This string might be a few months old but it rings true for me right now. I'm struggling to implement several dimensional models on a Teradata platform, running up against ETL folks and dba's who tell me that Teradata just doesn't handle dimensional models. Dimensional modeling is relatively new to this shop, and quite frankly I'm tired of hearing how relational tables or even flat tables will perform much better than star schemas.
I've implemented plenty of dimensional models on DB2, Oracle, but Teradata is a different animal altogether - PI's vs PK's, etc. Should the db really drive the design?
In one Teradata shop, their version of dimensional modeling was to implement huge multi-join views (i.e, 20+ tables) which overlapped with other views, and just as I left that assignment, they ran their first test and it failed miserably - because, you guessed it, too many joins.
The current project has reporsts/queries calling for a dump of a whole subject area (multiple marts) instead of staying within a specific business process. I've resorted to creating a dumbed-down set of flattened tables so real-time reports can access rather than drilling across 4, 5, 6 or more fact tables. The requirements these are based on are vague, and that's another issue altogether, but still, shouldn't the star schema model hold up for Teradata as well as other db's? Am I mad?
seadog2010- Posts : 23
Join date : 2010-03-04
Implementing dimensional datamodel in Teradata
Whoever says you cannot implement dimensional data model on Teradata are wrong because "they" have never implemented a dimensional data model on Teradata...first off, you are absolutely right, a db should not "drive" your design because that makes you db-dependent..yes ofcourse, you have to optimize your model according to the target database but that's about it.
The reason why it is discouraged to have a dimensional data model on Teradata is due to 2 main reasons: Distribution of data and colocatoin of your fact/dimension data.
Distribution of your data: As you distribute your data on several units within the system in accordance with their Massively Parallel Platform technology (there are other database which use similar architecture..for e.g. Netezza) and if you do not distribute your data on a "good" distribution key, then you will get lot of skew for which there is a high possibility in a dimensional datamodel because data is duplicated because of its' denormalized nature.
Colocation: If you find a unique distribution key for a fact table, it will always re-distribute the data for every query as there is no colocation.
Few questions for you to answer:
1) Are you creating just a data mart or a data warehouse?
2) Who would consume the data? BI tool and/or power users?
3) If BI tool, what BI tool? (most BI tool are dimensional model friendly, so eventually, you will have to create a dimensional model anyway)
Also Check my previous response (one before you) and see if any of that makes sense.
The key to your implementation would be to choose good distribution candidates for your fact tables..mostly, it will be the largest dimension attached to the fact which also gives you a better performance.
hope this helps,
sanjay.
The reason why it is discouraged to have a dimensional data model on Teradata is due to 2 main reasons: Distribution of data and colocatoin of your fact/dimension data.
Distribution of your data: As you distribute your data on several units within the system in accordance with their Massively Parallel Platform technology (there are other database which use similar architecture..for e.g. Netezza) and if you do not distribute your data on a "good" distribution key, then you will get lot of skew for which there is a high possibility in a dimensional datamodel because data is duplicated because of its' denormalized nature.
Colocation: If you find a unique distribution key for a fact table, it will always re-distribute the data for every query as there is no colocation.
Few questions for you to answer:
1) Are you creating just a data mart or a data warehouse?
2) Who would consume the data? BI tool and/or power users?
3) If BI tool, what BI tool? (most BI tool are dimensional model friendly, so eventually, you will have to create a dimensional model anyway)
Also Check my previous response (one before you) and see if any of that makes sense.
The key to your implementation would be to choose good distribution candidates for your fact tables..mostly, it will be the largest dimension attached to the fact which also gives you a better performance.
hope this helps,
sanjay.
sanjayvyas- Posts : 3
Join date : 2010-04-13
Page 1 of 2 • 1, 2
Similar topics
» How to design Benchmark's (Target's) in dimensional model
» Rule based algorithm to convert an ER model to a dimensional model
» Complexities of Relational Model and Simplicities of Dimensional Model
» Meta-model of Kimball dimensional model
» Why we use Dimensional Model over De-normalized relational Model ?
» Rule based algorithm to convert an ER model to a dimensional model
» Complexities of Relational Model and Simplicities of Dimensional Model
» Meta-model of Kimball dimensional model
» Why we use Dimensional Model over De-normalized relational Model ?
Page 1 of 2
Permissions in this forum:
You cannot reply to topics in this forum
|
|