Dimensional Model from a Hierarchical Data Source

rademola Thu Oct 27, 2011 7:57 am

Hi,

I am facing a rather challenging situation which I don't believe is peculiar because someone/people may have resolved it before. Here we go...

My data source is hierarchical from which I have been tasked with creating a dimensional model. While in the past I have worked with transactional models that naturally lend themselves to a dimensional model translation i.e. many to many joins becoming facts, reference types becoming dimensions, cascading relationships becoming hierarchies, etc., my current data source behaves slightly differently.

We migrated the source data to a "relational - hierarchical model" where the keys of the parent entities are migrated to the children. In most cases, I find measure-type attributes existing across the ancestor - descendant leves. I'm also able to identify dimensional entities based on context and usage from the transactional data analyses but even these entities have the parent "fact" natural key as well as measure type attributes. The risk is that I might end up with a dimensional model that has more fact tables that dimensions. I have thought about creating entities that play a dual role - both fact and dimensions based on usage context in reporting but I know that goes against everything I know about dimensional modeling.

I don't have the luxury of creating a completely source-decoupled dimensional model (a true logical model) which might help address the issues I am having IMHO.

Has anybody got any practical ideas or come across similar in their experience?

Thanks,
Rotimi

ngalemmo Thu Oct 27, 2011 10:26 am

I don't have the luxury of creating a completely source-decoupled dimensional model (a true logical model) which might help address the issues I am having IMHO.

Really? What have you done in the past? The data structures in the source system have little to do which what the dimensional model looks like. If the structure of the source affects the structure or your model, you are not creating a dimensional model, but rather a pseudo dimensional rehash of the original structure.

Forget about how data is stored in the source. Concern youself with the business, identifying facts and dimensions based on how the business operates. How you get the data from the source is an ETL problem, not a modeling problem.

rademola Thu Oct 27, 2011 11:22 am

Thanks for your reply.

I agree with the points you raised. It will be good to sit down with the business, understand the workflows, how performance or operations metrics will be reported or monitored, etc and develop an OLTP source - agnostic logical dimensional model. In my relatively short experience I haven't built dimensional models that are completely decoupled from the transaction systems. And I would be interested in knowing those that have done so...or maybe it's just me...

"

Forget about how data is stored in the source

."??? Yes I don't need to profile the data, understand the relationships or the transactional database model, etc.

I have got constraints, which I can't ignore or divulge unfortunately. I need to build a dimensional (or quasi-dimensional :)) data model for reporting out of a hierarchical database. If anyone has done this before I will be happy to hear how they went about it.

Regards,
Rotimi

ngalemmo Thu Oct 27, 2011 2:12 pm

You need to understand the content, its just that the structures used to store it should not be taken into consideration when you design the model... that is what transformation is all about.

Hierarchical structures are not a big deal. The modern equivalent is an XML file or an object model. In fact, the earliest databases where hierarchical because it made sense from a business perspective (and was performant with the hardware of the day). The idea that you have an order header and attached to it are order lines is hierarchical. In a hierarchical database the only way to get to the lines is through the header. The drawback is if you want to see orders that have product X, you first need to get an order, then look at each line on the order to see if that line has the product. So, to do queries like that you need to know the structure and traverse it.

In a world where hierarchical databases ( and later CODASYL network databases) ruled, the concept of a relational database was foreign and controversial. Even after Codd published his proof, it still took a while for some to accept the concept. It then took even longer for hardware to catch up. In a day when 128K memory and 200MB of disk was large mainframe territory, relational databases didn't perform very well. Software engineering and improved hardware capacities finally let relational databases make inroads in the early 80's.

Bottom line, the source is what it is. Figure it out. Don't compromise a design all because of source data structures. (Unless you need to deal with MUMPS... which is a real ugly and bizarre database system. Call in someone who knows it to do extracts or explain the model. It will save a lot of time.)

rademola Fri Oct 28, 2011 6:22 am

Thanks once again for your very enlightening response.

Like I alluded to before, there are existing constraints to how much (or how long) I can do. Yes, XML and many of the existing mission critical applications (even in the US!) are still based on hierarchical/network DBs. And not may organizations will commit resources to rewrite or migrate to Oracle, SQLServer, Sybase, DB2, etc. Vendors of DB adaptors/bridges, legacy DB drivers, etc will always exist. SOA and Messaging Systems vendors have been developing with this fact in mind...

My role extends beyond providing a data model which is why I am concerned with how data is provisioned into my dimensional models. In the case of delivering a dimensional data warehouse from this hierarchical database source, I decided to adopt the following:
- in my data acquisition phase, I first translate the hierarchical data model to "relational" by migrating the parent key to the child entities. The resulting database is not completely normalized but I don't care because it is transient.
- my data warehouse model is dimensional: fact and dimensions and outriggers, snowflakes, etc. I concede it may not be perfect but nothing in life is!
- In the ETL pipeline, when generating the dimension table surrogate keys only child key columns are used to derive dimension surrogate keys. The composite key (parent key + child key) on the child entity is used as a means of assigning the surrogate key to the fact tables i.e. it is a bridging dataset within the ETL workflow...

Will appreciate comments, suggestions...

Thanks.

ngalemmo Fri Oct 28, 2011 3:09 pm

The process looks fine, however when staging the extract data prior to loading, I wouldn't spend much time modeling it, I would just flatten things out so its easy to deal with. For example, in an Order/Order Line source, I would just stage a wide row with header data repeated for each line.

Also you should find that not all tables have child tables under them, so they are basically the same as a relational table.

Also, depending on how old this system is, the ability to index it may be limited, so performance may be dependent on how you access it.

Sponsored content

Dimensional Model from a Hierarchical Data Source

Dimensional Model from a Hierarchical Data Source

Re: Dimensional Model from a Hierarchical Data Source

Re: Dimensional Model from a Hierarchical Data Source

Re: Dimensional Model from a Hierarchical Data Source

Re: Dimensional Model from a Hierarchical Data Source

Re: Dimensional Model from a Hierarchical Data Source

Re: Dimensional Model from a Hierarchical Data Source