Modeling In-Memory

real_mhardy Fri Jun 13, 2014 6:40 pm

Hello,

I am an avid supporter of dimensional modeling, however there we are running into design concerns when applying principles/practice to a completely in-memory solution. Our initial design is such that source system tables are replicated into an appliance and the modeling is entirely virtual (versus ETL in a typical solution). This means basic principles, such as surrogate keys, are no longer practical.

Has anyone had success extending dimensional methodologies in an entirely virtual layer? The simplest approach is to duplicate the data into dims/facts, however this is not cost effective (memory storage is very expensive), and the base tables are required to be stored for extraneous reasons. Any ideas on how to approach this, or is dimensional modeling the wrong approach in this scenario?

Thanks!

ngalemmo Fri Jun 13, 2014 11:04 pm

What are you asking? What do you mean by 'virtual'? Surely the data exists in some structure in the memory based repository. Assuming the data is normalized (to reduce storage requirements), do you mean create views to give the appearance of a dimensional model? Sure, why not?

I mean, a materialized dimensional model on traditional storage devices provides some performance advantages for large analytic queries. But, if the data resides in a high speed memory device, the performance difference (if any) becomes negligible.

real_mhardy Tue Jun 17, 2014 11:47 am

The data would physically exist in the memory based repository replicated 1-for-1 as it exists in the source system (with the exception of additional metadata columns). The dimensionaliztion and conformation would happen as a set of views built on top of the source system data model. However, using this approach a I believe a number of best practices no longer make sense. For example, surrogate keys no longer are possible and junk dimensions/attributes should go on the fact table directly. I don't even know how to even approach type 2 history. Any suggestions on how to proceed?

Thanks!

ngalemmo Tue Jun 17, 2014 12:15 pm

Don't confuse hardware with data modeling. The reasons why models are the way they are have nothing to do with the hardware.

The ideas behind dimensional models and surrogate keys have to do with the long term stability of the relationships among data in the model. A copy of the operational tables does nothing to ensure this stability, or history for that matter.

What you are doing is replicating a bunch of tables and throwing views over them. Which, for what it is, is not a bad thing, but it is not a data warehouse (or at least not a proper one). I doesn't matter what media you are using to store the data.

If you are concerned about type 2 history, you need to build a proper model. What you intend to do won't cut it.

BoxesAndLines Tue Jun 17, 2014 3:45 pm

I'm with ngalemmo. You still have to load the tables to memory. Creating one super wide table with all columns has the same drawbacks whether it resides in-memory or on disk. Your in memory dimensional model should run exceptionally fast as disk access is still the slowest part of a given query.

real_mhardy Tue Jun 17, 2014 7:19 pm

Thanks ngalemmo and BoxesAndLines for your replies.

One thing to note is that we have a requirement to load the source system tables as-is into the memory appliance. We are trying to avoid instantiating the data physically a second time, hence why we are considering a virtual data warehouse through views build on replicated source system tables. Sorry to sound naive, but why can't the model data not exist physically, but virtually through views? Wouldn't the same principles apply?

ngalemmo: I don't think I'm confusing hardware with data modeling. Hardware allows for different approaches and/or removes previous limitations. The model is still the model and should look the same regardless of technology.

BoxesAndLines: I wasn't thinking of creating super wide views as it limits the ability to reuse and conform.

Thanks!

BoxesAndLines Tue Jun 17, 2014 9:16 pm

real_mhardy wrote:Thanks ngalemmo and BoxesAndLines for your replies.

One thing to note is that we have a requirement to load the source system tables as-is into the memory appliance...

I'm beginning to see the problem. As a data modeling professional, nobody in the business is qualified to make this requirement. That said, if you can virtualize a dimensional model off of randomly normalized source tables, that's great! I know Teradata's been claiming to be able to do that for years but I've yet to see it done successfully (i.e. good performance).

ngalemmo Wed Jun 18, 2014 10:44 am

If the data model is properly normalized you should have no problem creating a dimensional view of the same.

You cannot support change history unless you make changes to the data model.

Why would building a proper model incur another instance of the data? You have an operational instance and an analytic instance. Why would using a different model for the analytics instance result in a third copy?

real_mhardy Wed Jun 18, 2014 2:46 pm

ngalemmo: We are hoping to only have one physical instance of the data, the operational instance (source system model). The analytical instance would not exist physically, but as a collection of views sitting ontop denormalizing, formatting, and adding business rules.

Is this a viable approach? I think we recognise that for certain requirements, such as type 2 history, storing the data a second time may be necessary. Do you know if there is any literature on the subject or is this a "bad idea"?

Thanks!

nathanjones77 Thu Jun 19, 2014 5:15 am

real_mhardy wrote:
Is this a viable approach? I think we recognise that for certain requirements, such as type 2 history, storing the data a second time may be necessary. Do you know if there is any literature on the subject or is this a "bad idea"?

From talking to my SAP HANA colleagues, this isn't necessarily seen as an issue yet so there isn't literature on it. Thinking deeper about it after these conversations there are two cases. (I recognise you might not be talking HANA, but it still should be relevant).

1) Live analytics within the application. Based on live data, no need for history. One of the biggest sales pitches for HANA is removing the layers and latency within this model to enable operational reports (and remove the need for a separate operational data store)
2) Historic analytics - needs 'as-was' data for analyitc and reporting purposes.

Both can run in memory on a database that is supporting ACID (again, e.g. HANA)

Case 1) is the usual sales pitch for HANA - running analytics based on live data, which doesn't have the need for history and as so, can be run on top of the views on the ERP.

2) is the hard case that isn't often part of the sales pitch and involves reengineering the data warehouse/BW to strip out the view and summary layers that make up the current architecture. This is where history tracking and dimension conformance comes in. History tracking as done in either the CIF approach, data vault or SCDs involves persistent inserts and updates to some table that is different to the live production tables (whether CRM, ERP or core platform). This clearly has to be a different functional requirement and piece of functionality. Analytic results in case 2) may be fed back into case 1) for live analytics. If we are in case 2), then, say the SCD customer is maintained separately from the source customer and similarly the fact tables are maintained separately as otherwise, key management is not implemented properly and you don't have history.

From a dimensional modelling perspective, you can certainly model the views in case 1 dimensionally but don't expect history tracking (or conformance if you're not in a full SAP world (I'm simplifying dramatically here) . Case 2, is simply a data warehouse - dimensional or CIF as you wish to model (and obviously here isn't the place to argue for the second of those ) and has to manage all the things a data warehouse manages. Given the cost of RAM, I imagine that there also needs to be some dynamic caching physically in order to host some (older, less used) data on slower disk and SSD rather than RAM.

Sponsored content

Modeling In-Memory

Modeling In-Memory

Re: Modeling In-Memory

Re: Modeling In-Memory

Re: Modeling In-Memory

Re: Modeling In-Memory

Re: Modeling In-Memory

Re: Modeling In-Memory

Re: Modeling In-Memory

Re: Modeling In-Memory

Re: Modeling In-Memory

Re: Modeling In-Memory