Buildiong Star Schema in real time ?
5 posters
Page 1 of 1
Buildiong Star Schema in real time ?
I am trying to understand the details of how to build STAR Schemas in real time or conveting Traditional star into real time star.
In Traditional DW(Star Schema) kimball approach, on nightly basis we read data from different disperate source system and then populate the dimension tables and fact table.Bottom line is that we wait for the processing done in all the OLTP systems and other sources and then process so we can connect the fact record to most of the dimension tables.
In real time we have to have a Messaging system(say MQ) to send us the messages from OLTP through out the day and then process them to fact table and possibly to dimension tables. Lets say, if we do that we may not have most of the dimensions ready when we load the fact table as some of the old source systems may not even have the messaging capacity and also that event which triggered the fact row might have not occured yet in the other systems needed for star. So, how do we build the real time data marts ? Do we need to have all sources converted to messaging if we have to do real time ? Do we need to have seperate structures if we have to do real time ? may be not star schema but just a flat table ? Also, in any ETL tool, if the job is running the second initiation to trigger the job will fail and also the files created between the jobs can be overridden by next incoming row.
Anyone with the experience of converting a Traditional DW to a real time DW ?
In Traditional DW(Star Schema) kimball approach, on nightly basis we read data from different disperate source system and then populate the dimension tables and fact table.Bottom line is that we wait for the processing done in all the OLTP systems and other sources and then process so we can connect the fact record to most of the dimension tables.
In real time we have to have a Messaging system(say MQ) to send us the messages from OLTP through out the day and then process them to fact table and possibly to dimension tables. Lets say, if we do that we may not have most of the dimensions ready when we load the fact table as some of the old source systems may not even have the messaging capacity and also that event which triggered the fact row might have not occured yet in the other systems needed for star. So, how do we build the real time data marts ? Do we need to have all sources converted to messaging if we have to do real time ? Do we need to have seperate structures if we have to do real time ? may be not star schema but just a flat table ? Also, in any ETL tool, if the job is running the second initiation to trigger the job will fail and also the files created between the jobs can be overridden by next incoming row.
Anyone with the experience of converting a Traditional DW to a real time DW ?
VTK- Posts : 50
Join date : 2011-07-15
Re: Buildiong Star Schema in real time ?
Fundamentally the processes are the same, just that they occur more often. Missing dimensions for facts is something that can happen with batch systems as well.
The basic technique is to infer dimension rows when needed by a fact load. If a dimension row is missing, you create the row using the natural key received from the fact and flag it as inferred. When it eventually comes in from the dimension feed, the attributes will be loaded and the infer flag reset. The flag has two purposes: it serves as a means to alert users that such a condition exists, and it controls updates in type 2 dimensions. In a type 2, if there is an inferred row, that row is updated in place rather than creating a new row, this way facts loaded against the inferred row will reference attributes from the first update. Subsequent updates are handled normally.
From an operational standpoint you need to review the ETL processes to adapt it to smaller load sizes. Most of the time 'real time' is actually implemented as small batches run frequently (say, 15 minute intervals). Most ETL tools handle surrogate key assignment by caching natural key/surrogate key pairs in memory rather than querying the source tables. There is initial overhead to build these caches at the beginning of the job. This is very efficient for processing large batches, but may be unnecessary overhead (particularly with very large dimensions) for small batches. Various ETL tools give you different options for dealing with this, such as maintaing a permanent cache or performing the lookup directly against the dimension table. You need to experiment to see what works best for you.
The other consideration is the load on the system while users are querying the database. It may be necessary to beef up hardware and reorganize physical table structures to handle the load.
The basic technique is to infer dimension rows when needed by a fact load. If a dimension row is missing, you create the row using the natural key received from the fact and flag it as inferred. When it eventually comes in from the dimension feed, the attributes will be loaded and the infer flag reset. The flag has two purposes: it serves as a means to alert users that such a condition exists, and it controls updates in type 2 dimensions. In a type 2, if there is an inferred row, that row is updated in place rather than creating a new row, this way facts loaded against the inferred row will reference attributes from the first update. Subsequent updates are handled normally.
From an operational standpoint you need to review the ETL processes to adapt it to smaller load sizes. Most of the time 'real time' is actually implemented as small batches run frequently (say, 15 minute intervals). Most ETL tools handle surrogate key assignment by caching natural key/surrogate key pairs in memory rather than querying the source tables. There is initial overhead to build these caches at the beginning of the job. This is very efficient for processing large batches, but may be unnecessary overhead (particularly with very large dimensions) for small batches. Various ETL tools give you different options for dealing with this, such as maintaing a permanent cache or performing the lookup directly against the dimension table. You need to experiment to see what works best for you.
The other consideration is the load on the system while users are querying the database. It may be necessary to beef up hardware and reorganize physical table structures to handle the load.
Re: Buildiong Star Schema in real time ?
I've always wondered about the need for a "real time" Data Warehouse.
The DW is all about full table scans and looking at trends. There really isn't anything that is going to happen in the next hour that is going to change the trend.
I would think real time data is all about reporting on a very limited time frame - the past couple of hours or how the world looks right now. This involves a very limited set of data. And because the data is limited, it doesn't really need a data structure such as a star. That doesn't mean the real time can't be joined to a dimension table on something other than the primary key (maybe the business key).
The DW is all about full table scans and looking at trends. There really isn't anything that is going to happen in the next hour that is going to change the trend.
I would think real time data is all about reporting on a very limited time frame - the past couple of hours or how the world looks right now. This involves a very limited set of data. And because the data is limited, it doesn't really need a data structure such as a star. That doesn't mean the real time can't be joined to a dimension table on something other than the primary key (maybe the business key).
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: Buildiong Star Schema in real time ?
Jeff Smith wrote:I've always wondered about the need for a "real time" Data Warehouse.
The DW is all about full table scans and looking at trends. There really isn't anything that is going to happen in the next hour that is going to change the trend.
I would think real time data is all about reporting on a very limited time frame - the past couple of hours or how the world looks right now. This involves a very limited set of data. And because the data is limited, it doesn't really need a data structure such as a star. That doesn't mean the real time can't be joined to a dimension table on something other than the primary key (maybe the business key).
Where's the Like button? I'm with you. You want current? Look in the OLTP.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Re: Buildiong Star Schema in real time ?
I agree with y'all 99% of the time… there are only a handful of situations where real time analytic processing is necessary, but even then, it is not so much real time data warehousing (loading) as it is real time data warehouse usage (query). Things like fraud detection, next best offer (retail) are analytic applications that make use of historical data to make near real-time decisions. But neither of them really need an up-to-the-minute data warehouse.
It's a matter of tiers in the data. Recent activity should be held and maintained by the application, not the data warehouse, as it is usually such activity that is most critical for the application but little use for tactical and strategic analytics.
It's a matter of tiers in the data. Recent activity should be held and maintained by the application, not the data warehouse, as it is usually such activity that is most critical for the application but little use for tactical and strategic analytics.
Re: Buildiong Star Schema in real time ?
And don't forget that having real time data in one place and the DW in another place doesn't preclude combining the data in a report. With Cognos, for example, you can write query on the DW and another query from the Real Time datasource and then do a union between the 2 queries. Not something that should be done when the queries are bringing back a ton of data, but if someone wanted an upto the second count of claims processed for the month that's one row from the DW, one row from the Real Time, union and resum for one row.
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: Buildiong Star Schema in real time ?
Enterprise Architecture group in our company is talking about delivering system and business events through canonical model on enterprise service bus and drive entire BI on top of it. Hence at some point of time, once all front office applications talks with Enterprise Service BUS, BI / EDW would loose access on underlying OLTP databases and everything will be based of real time messages. I do see this happening across industries no sooner than later, which brings requirement of designing the staging layer more write friendly than read for star schema.
I agree with you all about dimensional model does not needs to populated real time however I do see need of writing messages from queue into staging at real time.
I would suggest you to design staging schema more write friendly. This could be smaller vertically growing satellite tables per message schema / business event, it will allow you to make sure reads and writes between ESB and staging are complimenting each other. Also fragmentation level of these column store tables will be much lesser than flat history/staging if you try to transform real time messages into same grain of Fact tables within process.
Moving from column store staging to start schema during nightly jobs could be bit of pain however if you have right hardware (these joins will be high CPU prone) and software (SQL 2012 / Oracle 10g) it seems feasible.
I agree with you all about dimensional model does not needs to populated real time however I do see need of writing messages from queue into staging at real time.
I would suggest you to design staging schema more write friendly. This could be smaller vertically growing satellite tables per message schema / business event, it will allow you to make sure reads and writes between ESB and staging are complimenting each other. Also fragmentation level of these column store tables will be much lesser than flat history/staging if you try to transform real time messages into same grain of Fact tables within process.
Moving from column store staging to start schema during nightly jobs could be bit of pain however if you have right hardware (these joins will be high CPU prone) and software (SQL 2012 / Oracle 10g) it seems feasible.
shrikant.kulkarni- Posts : 2
Join date : 2014-02-22
Similar topics
» Real time datawarehousing
» Star Schema for MPP databases
» star schema designing
» Help designing star schema
» Snowflake or Star Schema?
» Star Schema for MPP databases
» star schema designing
» Help designing star schema
» Snowflake or Star Schema?
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum