Advantage of this approach??

Vishy Wed Feb 29, 2012 4:53 am

Hi,

I have seen dimensions where StartDate and EndDate are used to maintain the history and surrogates are not used but natural keys are used. Just wanted to understand the advantage of this approach other then that ETL processes will be pretty fast compare to where we maintain history using surrogate keys. Other then this I don't find any advantage of it.

Reporting out of this design is very difficult as performance is too slow. I have myself exeprienced this in my previous project. I had 8 dimensions there and if a report is using all the dimensions then I needed to put date filters for all these 8 dimensions ?

gvarga Wed Feb 29, 2012 6:06 am

Hi,

If you are using the natural keys as FKs in the fact table, of course you will need to find the proper dimension attributes during reporting.

But if you use surrogate dimension keys as FKs, then fact data will be linked always to the proper dimension row version and in this case there is no need to find the proper dimension row while querying. It is enough to constrain the query from the Calandar ( Date) dimension.

Vishy Wed Feb 29, 2012 7:33 am

Yes, I need to have something in dimension table to identify which row is current.
What I would like to know what is the advantage of this approach ? I have seen it in few projects and when I asked being a report developer that this design is not good , neither I got any proper answer nor this designed was changed by dimensional modelling architect.

I don't say that this designed is good or bad, what I feel there must be some advantage of this otherwise people won't design this way.

If we use surrogate keys we can not join to fact directly, but if we use natural keys we will be able to join 2 facts, is this the advantage or again a dis-advantage

hkandpal Wed Feb 29, 2012 10:24 am

Hi ,

if you use surrogate key then you can mantain SCD also , if you just want to have SCD 0 or 1 the n you can use natural keys.
When you join fact tables , if the granularity is at the same level and the dimension are common they you can join them.

thanks

ngalemmo Wed Feb 29, 2012 10:47 am

I'm not aware of any advantage to using natural keys other than making the ETL slightly quicker and the coding slightly easier.

Surrogate keys are simple, small and stable, natural keys are not. Surrogate keys enable a very specific, repeatable query pattern. The nature of joins using natural keys are context sensitive requiring specific knowledge to the relationship between two tables (i.e. what columns make up the key). A natural key based DW cannot easily accommodate new sources that use a different key structure. NK based designs perform poorly and take up more space than SK based designs.

In other words, a natural key based data warehouse loads slightly quicker, but is more difficult to use, takes more space, and queries run slower. What is more important? Shaving a bit of time off the load or a better end-user experience?

Vishy Wed Feb 29, 2012 12:48 pm

thanks for insight,

I don't know but last night I was doing some dimensional modelling in my dream
May be because since last few days I am really thinking about providing DWH reports almost in a real time manner.
I am ready to do few compromises but providing reports next day is very old fashioned and DWH detail reports might become old fashioned if
softwares which inserts data in OLTP comes with few more ready made report objects. DWH still holds good for summary analysis but for detail reports waiting for almost one day sometime frustrates me.

I was thinking what can I do so that DWH reports are available on the same day, I dont want people to use ODS as it can not bring cubes into the picture.

For doing that I require following things

1) ETL processes should be fast.
2) Data should be small for day time run, it may have multiple runs.
3) one part of business area should be covered or divide business area into some small small parts which can be used independently
4) No history whatsoever, all type 1 dimension, just overwrite. History will be maintained by DWH which will get data overnight (next day conventional way).
------------------------------------------------------------------

I am still searching for ideas but my dream is that an executive who finishes his job at 3pm gets the BI details from cube/reports/adhoc at 5pm.

LAndrews Wed Feb 29, 2012 1:44 pm

I'd suggest you might want to take a step back and clarify the challenge you are trying to solve.

The challenge with delivering "Real Time" (or close to real-time) solutions is not so much the data model, but more the ETL processes. As ngalemmo suggested, there are numerous decisions to be made, and a number of different approaches to consider (e.g. ODS vs RealTime Partition etc).

Assignment and use of a surrogate is a minor part of the process - the time consuming activities are usually related to the cleansing and conforming of the information, peforming data quality checks etc. A common compromise is having an intra-day ETL process that differs from the nightly process.

Chapter 11 of the Data Warehouse Toolkit is dedicated to real-time ETL issues.

Hope this helps.

ngalemmo Wed Feb 29, 2012 3:32 pm

The notion of a nightly update is not 'old fashioned' nor is real-time updating necessarily better.

Real time updating introduces a lot of issues to the usage of the DW. First, and foremost, do the consumers of this information have the ability to act on it in a timely manner? Do you really want anyone to act on such information? For example, if you are a retailer sitting in HQ, what use is there to know sales for the last hour? If it is a matter of seeing that product gets shipped to the store tonight, other, simpler, automated systems are better suited for handling such a requirement... systems that have little or nothing to do with data warehouses.

Second, real time updating makes it very difficult to do analysis because the data is not stable. Someone does some analysis and would like a co-worker to confirm it, how can that be done if the data is constantly changing? If the data changes once a night, it is easy to isolate the data so that additional analysis can be performed within the same context. If data is constantly changing, and the users don't know when such changes occur, it is much more difficult to isolate the data as of a point in time. This is because it is not obvious what that point in time should be, and even if you could identify the precise point in time, it is not clear if it would be the same if concurrent updates were taking place (a later query would see updates performed by transactions running at the time of the original analysis).

Reducing latency (approaching real time) is not a difficult problem for data warehouse loading. There are well defined techniques for doing it and performance/latency is a cost issue (how much horsepower to you need). 'Why do you need it?' is the question that can be difficult to answer.

hang Fri Mar 02, 2012 7:51 pm

Vishy wrote:I am still searching for ideas but my dream is that an executive who finishes his job at 3pm gets the BI details from cube/reports/adhoc at 5pm.

Near realtime data warehousing is still possible under dimensional modeling (surrogate keyed). The best latency I could get is 20 minutes with ETL load time of 5 minutes plus a few seconds of cube reprocessing. There are a few techniques that I have employed to minimise the load time and ensure the reporting results are consistent with minimal disruption due to the frequent touch on the data.

Firstly, partition the fact table into current day and history partitions, either physically or by table partitioning feature available in DBMS, so that you can quickly truncate and reload the data for just one day. I think the technique has been discussed by Kimball's dimensional modeling book.

Secondly, apply some data switching mechanism to avoid the confusion of inconsistent data and slow response during the hourly load when the underlying tables are being updated. You may need two copies of daily data for viewing and ETL load. You then switch data online by backup and restore, data replication, or changing views or synonyms on the fly which is the most sophisticated but fast approach. The idea is switching the underlying data does not involve any transformation and only takes seconds or less, instead of minutes in a typical hourly ETL load.

Data switching may be less significant for nightly load, but it can be very effective in even daily load data warehousing, to provide a stable report viewing environment in case of failure in ETL load.

John Simon Mon Mar 05, 2012 4:22 am

I work in a Teradata environment where we use date ranges. The advantage is that all history is captured, whereas with SCD Type 2, you will only have history for the columns you determined were relevant at some time in the past.
I don't agree with it as a concept though, because range lookups are always slow.

Vishy Mon Mar 05, 2012 7:46 am

Thanks

Then ideal solution would be Surrgoate key,start date,end date and current flag.. having all in dimension ( if diemension is not too big which can force us to through hot attribute requring SCD2 treatment in to a separate dimension).

Surrogate Keys -- will give all the advantage it gives i.e natural key limitations and fast processing as SK is integer.

Start Date and End date - can help in bringing all the history from diemnsion itself without connecting to the fact.

Current Flag -- it will help in again fast processing as you would put this filter in the query so only filtered rows of dimension will be available for join with the fact table.

Sponsored content

Advantage of this approach??

Advantage of this approach??

Re: Advantage of this approach??

Re: Advantage of this approach??

Re: Advantage of this approach??

Re: Advantage of this approach??

Re: Advantage of this approach??

Re: Advantage of this approach??

Re: Advantage of this approach??

Re: Advantage of this approach??

Re: Advantage of this approach??

Re: Advantage of this approach??

Re: Advantage of this approach??