Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion

Page 1 of 3 1, 2, 3  Next

View previous topic View next topic Go down

Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion

Post  Andy Painter on Tue Feb 03, 2009 6:40 am

I've been dimensional modelling for well over 10 years, and strongly held the opinon that only a dimensional model would work for an EDW, however i'm now begining to think that there are certain requirements that lead to a 3NF design for an 'Enterprise Data Warehouse'. I'd like to share these and understand what other requirements lead to a 3NF design. I'm not look for a relgious debate over one or the other just the facts of the matter. These are the requirements that i have seen that i think need a 3NF design:

1. Use of Teradata for the core data warehouse. Why? - It's just designed better to work with that type of model.
2. Real-time data warehousing. Why? - A 3NF data model should allow a quicker update of data.

I would also add that even with a 3NF design, i'd alwasy look to create dimensional marts for specific applications.

Andy Painter

Posts: 1
Join date: 2009-02-03

View user profile

Back to top Go down

Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion

Post  Chris Cammers on Tue Feb 03, 2009 7:26 am

I will be the first to admit 3NF sure is easier to implement if you are looking for real-time data but the problem lies in the use of the resulting EDW. If the whole intention of the data warehouse is to supply data to analytical applications then you lose the two major advantages of the Dimensional Model
  1. The first is ease of use, the dimensional model will be much easier for your users to understand and easier for your Query and Analysis tool to consume.
  2. Query Performance, 3NF databases suffer from performance problems when aggregating or retrieving many dimensional values that analysis may require. If you are only going to do operational reports then you may be able to get by with 3NF because your operational user will be looking for very fine grain data.


The major problem with dimensional models and real-time data is the transformation. I worked on a project where we attempted to populate a dimensional model in real-time but we fell flat because we could not find a mechanism to transform the data in real time.

We were using a tool that provides a stream for each table's change log but we could not synchronize the transactions to ensure we built a valid fact row. I think if we had been able to change the application schema we might have been able to overcome this obstacle so we'll get it right on the next project.

Chris Cammers

Posts: 1
Join date: 2009-02-03

View user profile

Back to top Go down

Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion

Post  dharidas on Tue Feb 03, 2009 9:15 am

I agree with Chris on this. While there are general guidelines and best practices around building data warehouses, the architect will have to take into account the complexity, volume and organization of data in their respective enterprise while designing data warehouses.

In my view, 3NF is a set a rules that focus on maintaining data integrity when you have a lot of transactional updates to your database. As such it is tuned for easy retrieval and updates for detail data.Performance of aggregate queries will most often suffer, especially when you have complex reports spanning multiple attributes and fact tables.

Dimensional modeling is a more of a practice, a disciplined approach, if you will, of creating an infrastructure for most efficient aggregate query performance with drill down/roll up capabilities. The data model should support easy navigability (business users seldom want to navigate complex join paths involving n levels of normalized tables) and demand query performance that is more or less agnostic of data volume.

While the influx of powerful hardware has vastly improved aggregate performance of 3NF data models, it generally lacks the discipline, ad-hoc query performance and ease of use associated dimensional models.

Again, EDW is a significant investment and hence I have always tried to highlight, with varying levels of success, about the scalability and benefits of dimensional model. And Yes!, I have been beaten and bruised, often by very senior data architects, who wants to reuse existing 3NF transaction processing databases for analytics. While I have proudly maintained, scaled and enhanced my 10 year old dimensional data model with ease, I have also quietly mourned at the demise of some of the departmental 3NF versions that required multiple design revisions to accommodate requests for additional reports.

dharidas

Posts: 5
Join date: 2009-02-03

View user profile

Back to top Go down

Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion

Post  tomstagg on Tue Feb 03, 2009 1:17 pm

Advantages of 3NF over Dimensional model I see are:
- reduced ETL effort: makes the data warehouse more agile and meets changing business needs much quicker - I regularly see that the business changes there mind on what they want to capture and sometimes it is impossible to keep up - The warehouse can become semi-redundant
- Changes to source data systems can quickly be fed through to the EDW and on to users
- Real time for Dimension model is complex (e.g. inferred dimensions,etc) means EDW is even slower.

Disadvantages of 3NF over Dimensional model I see are:
- Complex joins - How can you allow the business to perform ad-hoc queries? I can't see how you can populate an OLAP cube from a 3NF data model.
- Slow performance - multi-joins means query time can be long - no users want to wait for data to load. Although we be mitigated with powerful hardware
- Where does the calcuation logic go?! ETL performs complex calculations to provide insight into business.
- High over-head to create reports for users - If a user wants a specific report there is a significant effort to understand and create the report from the 3NF model.

Sounds like you're talking about a form of operational Data Store (ODS) Andy.

tomstagg

Posts: 1
Join date: 2009-02-03

View user profile

Back to top Go down

3NF cannot handle certain SCD

Post  Jeff Smith on Tue Feb 03, 2009 2:51 pm

The problem with 3NF is that it cannot handle type 2 or type 3 slowing changing dimensions. For Type 2, you have to reload all of the dimensional data during every load with a date stamp. And if rollups change or relationships change, you're hosed. It's harder to build/design the dimensional database, but once it's built, it's a thing of beauty

Jeff Smith

Posts: 468
Join date: 2009-02-03

View user profile

Back to top Go down

Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion

Post  BoxesAndLines on Tue Feb 03, 2009 3:17 pm

Jeff Smith wrote: "The problem with 3NF is that it cannot handle type 2 or type 3 slowing changing dimensions"

Eh? Are you saying 3NF models can't capture history?

BoxesAndLines

Posts: 1104
Join date: 2009-02-03
Location: USA

View user profile

Back to top Go down

Argument "Real Time"

Post  ArjanF on Tue Feb 03, 2009 11:10 pm

Hello,

One of the arguments for using a 3NF DWH model is that the Dimensional model is to slow for "Real Time". Most of the common Database Systems now support partitioning of tables, even OLAP database models support partitioning. Could this not be a solution for making a Dimensional DWH model easier to use in real time? Isn't real time about the latest state of information, so information not older then 1 day, closed dimensions, older changed dimensions are not really needed to be visible on real time?

So why not partition your data based on dates and build a 2 based ETL process for your dimensional model? First the regular process for processing all partitions also the older dimension records. Second build a process to update only current partitions holding the latest information of the day.

I have to say, I've never done it but I just want to dump an idea and listen to other opinion. So what do you think of the idea and what are the arguments not to do this?

Don't come with the money argument, because when they need "real time" and they want to be able to report in a understandable way, dimensional modeling is still the best and word the money! I know that it should be very complex but not impossible or is it?

Best Regards,
Arjan Fraaij

ArjanF

Posts: 7
Join date: 2009-02-03
Age: 40
Location: Netherlands

View user profile http://www.arjanfraaij.com

Back to top Go down

What exactly is 3NF versus Star?

Post  PHough on Wed Feb 04, 2009 12:06 am

Hi all

I am a bit puzzled by this post beacuse my understanding is that in a Star schema the fact table is "always" fully normalised and optimised. The source for the fact table is an OLTP transaction table and in my experience follws its structure closely, and this source 3NF table is often less efficient that the target 3NF fact table because of OLTP operational attributes it has to carry for transaction processing. So this Start schema fact table being 3NF should be extremely efficient to load in real-time. So I really do not follow most of the issues around loading fact tables in real-time. This is the high-volume, performance attacking point.

My basic understanding is that the denormalisation happens in the dimension tables in order to: simplify the analytical context structures, further optimise the 3NF fact table (by removing redundant foreign keys), and optimise queries by reducing the number of joins at query time. The basic Inserting and Updating of dimensional structures should be a simple and efficient matter of triggers since the volumes will be minute compared to the fact table. BUT the 'real-time' maintenance of these could still be a challenge, to my mind only in two potentially problematic areas: 1) surrogate key maintenance and availability for the 3NF fact table to look up, and 2) real-time maintenance of type 2 slowly changing dimensions in complex dimensions.

In this case for real-time reporting I tend to just use the system keys instead of surrogate keys since the reasons for the existence of surrogate keys MAY be reduced for real-time reporting. (And I am sure that there are clever people that will be able to work out an efficient SK generation mechanism if, e.g., there are two source OLTP systems with clashing keys that both need to be loaded in real-time into a single reporting DW/ODS). For the fact table key lookup there may have to be some sort of periodic 'housekeeping baby-batch' run thus: for all transaction records for which the SK does not yet exist put into a 'to-be-processed-later' cache table and every few minutes retry loading these until the dimension table triggers have picked up the new dimension members - or something smarter.

As for SCD type 2, I do not really use these much personally since they cause all sorts of other complications for large dimensions' sizes, complicated reconciliation and recovery, and they complicate end user queries somewhat. So I try to convince the users that all changes 'inherit the sins of their forefathers' hence become type 1 (or 3) - and when I really do have to track changes - which of course does happen, but not without a fight - I use SCD Type 4 (my name) which is a change tracking factless fact table removed from the dimension table it is tracking. For real-time loading I think it would be possible to convince the users that in real-time, tracking changes does not make much economic sense or may not be economically feasible in real-time.

So in summary, where it matters for real-time (in the fact tables) the Star schema IS already in 3NF. And for dimensions I suggest keeping the source system OLTP primary keys instead of surrogate keys, use triggers to update the denormalised dimesions, keep these dimensions as 'lean and mean', i.e. as simple, as possible, and do not allow SCD type 2. For the EDW there can be another version of these dimension tables with the full, rich complement of attributes and SCD x's.

Regards, Paul Hough

PHough

Posts: 3
Join date: 2009-02-03

View user profile

Back to top Go down

Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion

Post  mark.tan on Thu Apr 23, 2009 2:14 am

BoxesAndLines wrote:Jeff Smith wrote: "The problem with 3NF is that it cannot handle type 2 or type 3 slowing changing dimensions"

Eh? Are you saying 3NF models can't capture history?

I think it is not that 3NF cannot capture historical data. They can, but the nature of the design (normalization of dimension/master tables) result in a lot of complexity when making updates for SCD type 2 or 3. However, in the case of Star Schema, it is truely a elegant design to handle SCD 2 and 3 type of data.

mark.tan

Posts: 14
Join date: 2009-02-03

View user profile

Back to top Go down

Nice post, Andy....

Post  steve_waters on Thu Apr 23, 2009 11:36 am

Andy Painter wrote:I've been dimensional modelling for well over 10 years, and strongly held the opinon that only a dimensional model would work for an EDW, however i'm now begining to think that there are certain requirements that lead to a 3NF design for an 'Enterprise Data Warehouse'. I'd like to share these and understand what other requirements lead to a 3NF design. I'm not look for a relgious debate over one or the other just the facts of the matter. These are the requirements that i have seen that i think need a 3NF design:

1. Use of Teradata for the core data warehouse. Why? - It's just designed better to work with that type of model.
2. Real-time data warehousing. Why? - A 3NF data model should allow a quicker update of data.

I would also add that even with a 3NF design, i'd alwasy look to create dimensional marts for specific applications.

Sounds like the classic "hub and spoke" model that Inmon advocates:

1. A 3NF DW to "integrate" data from source systems
2. Data Marts downstream from the 3NF DW

Best case scenario is that the dimensions are conformed and the subject areas are discrete and cleanly delineated.

Personally, I think this approach is a nightmare.

The 3NF DW becomes a single point of failure that impacts all the downstream data marts.

The 3NF DW is very hard (ie, time consuming) to design. Then it takes a long time to dev, test and get signed off. Finally, it tends towards a "big bang" approach. It is hard to "phase" 3NF work, IMO. Subject areas added at different times produce asymetry in the data....

steve_waters

Posts: 4
Join date: 2009-04-23

View user profile

Back to top Go down

Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion

Post  BoxesAndLines on Thu Apr 23, 2009 1:36 pm

I agree on the single point of failure problem. On the other hand, a lot of value is derived by cleansing and integrating the data once before extracting to data marts. I disagree on the big bang only concept though. Every Teradata shop in town is built on a normalized data structure. You can build these data structures incrementally just as easy as you can build or enhance a star. The normalized data structure is also more resilient to change since data (should) only appear in one table.

I'm not proposing one methodology over the other. But I see this more as a Chevy truck vs. Ford truck argument than one approach is hands down better than the other.

I apologize to all the folks not from Texas who don't sit around the camp fire arguing over the merits of Ford and Chevy trucks.

BoxesAndLines

Posts: 1104
Join date: 2009-02-03
Location: USA

View user profile

Back to top Go down

The Case against teradata

Post  Jeff Smith on Mon Apr 27, 2009 11:03 am

I think the argument that is being made is for not using teradata as opposed to implementing a 3NF DW. While Hardware/softare should be considered when designing the DW, IMHO the hardware/software shouldn't dictate the design.

The only argument that I know of for using Teradata is based on the size of the DW. But a large DW needs to be designed to be simple to use. I admit I'm biased - I like the star schema design. It's easy to use. It's easy to set up a reporting font end. Cubes are easy to build. Sorry if it makes it more difficult on the modelers and hardware guys - but the intent of the DW design is to make it easier on the users.

Jeff Smith

Posts: 468
Join date: 2009-02-03

View user profile

Back to top Go down

I challenge this notion....

Post  steve_waters on Mon Apr 27, 2009 11:44 am

BoxesAndLines wrote:I agree on the single point of failure problem. On the other hand, a lot of value is derived by cleansing and integrating the data once before extracting to data marts. I disagree on the big bang only concept though. Every Teradata shop in town is built on a normalized data structure. You can build these data structures incrementally just as easy as you can build or enhance a star. The normalized data structure is also more resilient to change since data (should) only appear in one table.

I'm not proposing one methodology over the other. But I see this more as a Chevy truck vs. Ford truck argument than one approach is hands down better than the other.

I apologize to all the folks not from Texas who don't sit around the camp fire arguing over the merits of Ford and Chevy trucks.

When a conformed dimension is built, the data gets cleansed and integrated. Then other fact tables in various data marts can connect to it.

Simple staging tables that hold extracts from the source systems are all that is needed. The process to insert/update rows into the conformed dimension can do the cleansing/integrating. This approach is just as efficient and more effective. (More effective because the Dim will have a surrogate key for history tracking, and there will be a well-defined method for "connecting" to the Dimension.)

steve_waters

Posts: 4
Join date: 2009-04-23

View user profile

Back to top Go down

It's not the difference between a Chevy and ford

Post  Jeff Smith on Tue Apr 28, 2009 7:51 am

BoxesAndLines wrote:I agree on the single point of failure problem. On the other hand, a lot of value is derived by cleansing and integrating the data once before extracting to data marts. I disagree on the big bang only concept though. Every Teradata shop in town is built on a normalized data structure. You can build these data structures incrementally just as easy as you can build or enhance a star. The normalized data structure is also more resilient to change since data (should) only appear in one table.

I'm not proposing one methodology over the other. But I see this more as a Chevy truck vs. Ford truck argument than one approach is hands down better than the other.

I apologize to all the folks not from Texas who don't sit around the camp fire arguing over the merits of Ford and Chevy trucks.

Sorry, but I think the argument is not the same as a Chevy and a Ford. That analogy is appropriate for discussing DB2 vs Oracle, but 3NF vs Star Schema is much, much more at the core. A more appropriate analogy might be taking the highway vs. city roads. The traveling speed on the best designed highway is always going to be faster than the best designed city street.

Building a 3NF DB in the core with star schemas in marts is a waste of space. Sure, the source data can be normalized in the staging area, but there is no need to keep history in 3NF. Process it through the staging area, load it into the fact tables, and then delete it from the staging area. By keeping all of the history in a 3NF database, your just increasing the amount of space required to store your data. You still have to deal with issues around the reuse of codes by the source system, and now you run the risk of having data in the 3NF that does not match the data in your marts - which, in my opinion, is the worst possible scenario. A primary goal of a DW should be to make it the single source of truth. Mixing a 3NF with marts will at some point violate this goal.

Jeff Smith

Posts: 468
Join date: 2009-02-03

View user profile

Back to top Go down

Re: Reasons to use a 3NF design over a Dimensional Model design for an EDW - Discussion

Post  BoxesAndLines on Tue Apr 28, 2009 9:32 am

So all the Teradata datawarehouses in the world are a waste of space since they are built in a normalized fashion? I've seen successful implementation of both warehouse models at different clients. Since both models can be successful, I avoid statements that say one way is vastly superior to another. Each approach has their pluses and minuses.

BoxesAndLines

Posts: 1104
Join date: 2009-02-03
Location: USA

View user profile

Back to top Go down

Page 1 of 3 1, 2, 3  Next

View previous topic View next topic Back to top


Permissions in this forum:
You cannot reply to topics in this forum