Enormous data size

jaiveeru Thu May 14, 2009 11:04 am

This all started from my query 2 weeks back on solving data modeling problem in recursive hierarchical data table.
I posted a query and got replies, almost instantly. All replies pointing to one solution.

I did exactly as suggested i.e. I resolved the tree structure into a flat table so that there is no many to many items left. This was done in addition to resolving all one to many data other tables as well. I happily added all redundant fields into table and my fact table size now is 200 GB and with indexes it can easily cross 500 GB.

Now guys problem is in my whole big company I am the alone who is working on datawarehouse. I have no previous track record to see or verify if what I did was right or wrong. I myself do not have experience of working in datawarehouse.

On quering from dimension tables I do get the right rows, but can you imagine how much time would it take if I query billions of rows to select few thousands from them.
Each index cost me 50-100 GB of disk space, sometimes I wonder where I am heading to?
I am already slammed with a letter from my head to justify the disk space requirements. Logically I see I am doing the right but is this kind of data size expected?
Transactional data table had around 20GB for the same data.
But upto what size a warehouse can grow and still perform well? Is the size that I am dealing with is acceptable?
Please guys say something !!! It's time to back the concept of datawarehouse (in my company)

Jeff Smith Thu May 14, 2009 11:57 am

How do you go from 20 GB in the transactional system to 200 GB in the DW if it's the same data? And you mentioned "billions of rows". Even if it's 1 billion, 20 bytes per row in the transaction system getting blown up to 200 bytes in the DW seems odd.

jaiveeru Thu May 14, 2009 6:42 pm

Jeff thanks for the response.

Let me show you the picture (tip of the iceberg)

Data table in transaction db is my meter data table, which records every hour's meters data.

All these meters actually spread like nodes in the tree, so tree has meters.

Meters are grouped and the group forms tree.

SO

Just taking 1 meter to understand it better, let's assume A is parent of B which is parent of C which in turn has got 3 meters m1, m2, m3

in transaction db, groups are stored separate and readings are separate. For simplicity I have not included time but to understand spread i did include 1 extra reading for a day.

Group Table
child ------------parent------------ meter
A ------------ Null------------ null
B ------------ A ------------ null
C ------------ B ------------ m1
C ------------ B ------------ m2
C ------------ B ------------ m3

Meter Table
Meter ------------ Reading ------------ date
m1 ------------ 100 ------------ d1
m2 ------------ 24 ------------ d1
m3 ------------ 50 ------------ d1
m3 ------------ 56 ------------ d2

work like charm,

now we plan to make a datawarehouse:
we can't have same structure in dimensional model.

on separating we get a

bridge table

parent -----subsidiary child --------meter -------bridgeID
A ------------ A ------------ null ------------ 1
A ------------B ------------null ------------2
A ------------C ------------ m1------------ 3
A ------------C ------------m2 ------------4
A ------------C ------------m3 ------------5
B ------------B ------------null ------------6
B ------------C ------------m1 ------------7
B ------------C ------------m2 ------------8
B ------------C ------------m3 ------------9
C ------------C ------------m1 ------------10
C ------------C ------------m2 ------------11
C ------------C ------------m3 ------------12

999 is represent fake meterid

Fact Table

FKbridgeID -------Meter ------ reading --------- date
1 ------------999 ------------0 ------------d1
2 ------------999 ------------0 ------------d1
3 ------------m1 ------------100 ------------d1
4 ------------m2 ------------24 ------------d1
5 ------------m3 ------------50 ------------d1
6 ------------999 ------------0 ------------d1
7 ------------m1 ------------100 ------------d1
8 ------------m2 ------------24 ------------d1
9 ------------m3 ------------50 ------------d1
10 ------------m1 ------------100 ------------d1
11 ------------m2 ------------24 ------------d1
12 ------------m3 ------------50 ------------d1
1 ------------999 ------------0 ------------d2
2 ------------999 ------------0 ------------d2
3 ------------m1 ------------0 ------------d2
4 ------------m2 ------------0 ------------d2
5 ------------m3 ------------56 ------------d2
6 ------------999 ------------0 ------------d2
7 ------------m1 ------------0 ------------d2
8 ------------m2 ------------0 ------------d2
9 ------------m3 ------------56 ------------d2
10 ------------m1 ------------0 ------------d2
11 ------------m2 ------------0 ------------d2
12 ------------m3 ------------56 ------------d2

So we clearly see that rows in fact are 6 times the rows in meter table. And I needed to break the recursive group table to resolve many to many relation so I made the bridge. Then I gave bridge a surrogated key to use in fact table and this helped me in getting rid of a many to many between fact and bridge.

Do I have a better way of doing this that also regard dimension modeling principles?

tropically Mon May 18, 2009 11:53 am

What is the grain for your fact?

ngalemmo Mon May 18, 2009 8:25 pm

Me thinks you got the model wrong...

There is nothing that should change the number of rows in your fact table. The facts are meter reads... there should be one row per read, no more, no less. One of the foreign keys should be for the meter being read.

As for hierarchies, I do not understand what the 'group' table is supposed to represent. You use groups to cluster combinations of multivalued dimensions, not to implement hierarchies.

The hierarchy bridge table is an explosion of the hierarchy expressed as:
parent key
child key
distance from top for parent
distance from parent
... other attributes, such as a leaf (bottom) flag.

So, if there is a hierarchy such as A is parent of B is parent of C the exploded bridge table would look like:

A, A, 0, 0
A, B, 0, 1
A, C, 0, 2
B, B, 1, 0
B, C, 1, 1
C, C, 2, 0

You join the foreign key in the fact table (the meter key, as the child) through the bridge to aggregate measures to any parent. There is no change to the fact table, or the dimension table for that matter.

Jeff Smith Tue May 19, 2009 1:44 pm

I'm not sure I fully understand the situation, but I think this is a ragged heirarchy. Some ETL software handle this better than others.

A Ragged Hierarchy is a situation that I encountered with Cost Centers. It is typical for cost centers to roll up to other cost centers that have their own costs. For example, My manager has 3 direct reports. Each Direct has their own cost center (1, 2, 3). These 3 cost centers roll up to my manager's cost center (40). But his cost center also contains his costs. He has a manager with cost center 100, who has 3 directs (including my manager, 40, 50, 60). Only my manager has direct reports. The hierarchy is as follows

Base-L1-L2-L3
1-1-40-100
2-1-40-100
3-1-40-100
40- -40-100
50- -50-100
60- -60-100
100- - -100

Direct costs, such as salaries, can come in for every cost center therefore, every cost center has to be at the lowest level of the hierarchy. Certain levels are left blank for certain values because no data should be reported at that level. For example, if 40 were at level_1, it would only contain the direct costs on my manager (his salary and the salary of his assistant), but not the costs for centers 1, 2, and 3. If you reported costs for 40 at level 1, it would not accurately state the full costs of cost center 40. The correct cost for 40 can only be reported at level 2, therefore, the rollup for 40 at the base level is null at level 1.

jaiveeru Tue May 19, 2009 5:52 pm

tropically wrote:What is the grain for your fact?

My grain size is hourly reading.

So its entered as 1 reading per hour per day per meter.

jaiveeru Tue May 19, 2009 6:16 pm

ngalemmo wrote:Me thinks you got the model wrong...

There is nothing that should change the number of rows in your fact table. The facts are meter reads... there should be one row per read, no more, no less. One of the foreign keys should be for the meter being read.

As for hierarchies, I do not understand what the 'group' table is supposed to represent. You use groups to cluster combinations of multivalued dimensions, not to implement hierarchies.

The hierarchy bridge table is an explosion of the hierarchy expressed as:
parent key
child key
distance from top for parent
distance from parent
... other attributes, such as a leaf (bottom) flag.

So, if there is a hierarchy such as A is parent of B is parent of C the exploded bridge table would look like:

A, A, 0, 0
A, B, 0, 1
A, C, 0, 2
B, B, 1, 0
B, C, 1, 1
C, C, 2, 0

You join the foreign key in the fact table (the meter key, as the child) through the bridge to aggregate measures to any parent. There is no change to the fact table, or the dimension table for that matter.

Thanks for response.

Group is not group as in meaning. Lets say we have meters installed in multiple companies and we have software that tracks consumption in various sites.
Now these sites in themselve are quite big and have multiple floors, meters are installed in many across each of these sites. Even among floors there can be wings and sections and so on and we can have meters anywhere.
So when we calculate the consumption for a site we need to add meters under the site from all the floors, wings, sections etc. We may also need to compare consumption between two section say HR and finance sections and in that case meter summed up will only be respective meters.
for simplicity to understand I referred to these companies, buildings, sites, wings, sections etc a group.
So a group can have many groups and in turn they can have many. At any level it's possible to have meters, BUT one meter will be immediate child to one and only one group. One Group on the other hand can have as many meters as it likes.

I am not sure if I understood your solution completely. Can I treat a meterid from the fact table as a child or parent? Because meterid is neither unique in bridge nor unique in fact table. So it has a many to many relation with bridge. That's the reason I gave a surrogated primary key to the bridge called PKBridge and used it in fact as FKBridge so that I can establish a one to many relation between two tables.

ngalemmo Tue May 19, 2009 7:10 pm

A meter key should be a unique surrogate... do whatever you need to do to make the business key unique, but I would assume every physical meter would have a unique identifier, correct?

Anyway, meters would be children and groups would be parents.

So, you have a fact table with, minimally:

meter key (fk to meter dimension)
reading (measure)
date/time/etc...

Both meters and groups should be defined in a meter dimension, or you can define groups in a separate dimension... (but the former is more flexible)

So, in the previous example A & B would be groups and C a meter. Lets also add D, another meter under group B, the bridge table would contain:

Parent, Child, parent level, distance from parent
A, A, 0, 0
A, B, 0, 1
A, C, 0, 2
A, D, 0, 2
B, B, 1, 0
B, C, 1, 1
B, D, 1, 1
C, C, 2, 0
D, D, 2, 0

You fact table contains:
Meter, reading
C, 5
C, 10
D, 8
D, 3

If you want the sum of readings for group B, you query the bridge for parent = B and join the child in the bridge to the meter key in the fact table. The query would get rows for meters C and D because both are listed as children of B. If you did the same query for parent = A it would look for children A, B, C & D. Because A and B are not meters and have no readings of their own, it would not find any child rows for A or B in the fact table, so it would return the sum of readings for meters C & D.

The inclusion of identity rows, such as 'C, C, 2, 0' allow you to always use the bridge regardless of what you are looking for... allowing you to use the same query to sum a specific meter or a group of meters.

jaiveeru Wed May 20, 2009 5:40 am

ngalemmo wrote:A meter key should be a unique surrogate... do whatever you need to do to make the business key unique, but I would assume every physical meter would have a unique identifier, correct?

Anyway, meters would be children and groups would be parents.

So, you have a fact table with, minimally:

meter key (fk to meter dimension)
reading (measure)
date/time/etc...

Both meters and groups should be defined in a meter dimension, or you can define groups in a separate dimension... (but the former is more flexible)

So, in the previous example A & B would be groups and C a meter. Lets also add D, another meter under group B, the bridge table would contain:

Parent, Child, parent level, distance from parent
A, A, 0, 0
A, B, 0, 1
A, C, 0, 2
A, D, 0, 2
B, B, 1, 0
B, C, 1, 1
B, D, 1, 1
C, C, 2, 0
D, D, 2, 0

You fact table contains:
Meter, reading
C, 5
C, 10
D, 8
D, 3

If you want the sum of readings for group B, you query the bridge for parent = B and join the child in the bridge to the meter key in the fact table. The query would get rows for meters C and D because both are listed as children of B. If you did the same query for parent = A it would look for children A, B, C & D. Because A and B are not meters and have no readings of their own, it would not find any child rows for A or B in the fact table, so it would return the sum of readings for meters C & D.

The inclusion of identity rows, such as 'C, C, 2, 0' allow you to always use the bridge regardless of what you are looking for... allowing you to use the same query to sum a specific meter or a group of meters.

This morning I was trying hard to work out on the solution you gave. I still don't quite understand I suppose.

I guess it's easy to remain on the same page and discuss the same example. There were two possibilities between fact and bridge. One is bridge has surrogate unique key and fact has foreign key (This solution had resulted in huge fact size and hence I came with my issue in this forum). Second as you suggest fact has unique key and bridge has foreign key? Is that right?
I give IDs to fact table to make it clearer.

FACTForeign, Parent, Child, parent level, distance from parent
null, A, A, 0, 0
null, A, B, 0, 1
null, A, C, 0, 2
null, A, D, 0, 2
null, B, B, 1, 0
1 or 2, B, C, 1, 1
3 or 4, B, D, 1, 1
1 or 2, C, C, 2, 0
3 or 4, D, D, 2, 0

You fact table contains:
FACTID, Meter, reading
1, C, 5
2, C, 10
3, D, 8
4, D, 3

I need to resolve the key value in bridge table (mentioned in bold).

jaiveeru Wed May 20, 2009 6:00 am

To answer other items from your last post:

every meter has a unique identifier, it's called meterID.

There is no Meter Dimension becasue it's part of Site Dimension which has group and meters both information.
It's the dimension that we have on the other side of the bridge table. So SiteDimension (having PKSite, Group, ParentGroup, MeterID, and other columns) then Bridge dimension (PKBridge, FKSite, Group, SubsidiaryGroup,MeterID) then Fact(FKBridge,MeterID,Date,Time,Reading, other aggregatable columns).
Site Dimension contains recursive hierarchy
Bridge contains exploded hierarchy
and the fact got atomic explosion

ngalemmo Wed May 20, 2009 9:36 pm

I'm confused. What is the purpose of the bridge dimension and the bridge FK in the fact table?

If the bridge is just a flattened hierarchy for a meter, then the PK should be the meter key, right? Because you are going to have one row per meter. There is no need for another fk (bridge fk) in the fact table because the meter is the smallest unit you are dealing with. Also, if you have such a flattened hierarchy, there is no need for an exploded hierarchy as well. An exploded hierarchy is used to represent recursive hierarchies.

jaiveeru Thu May 21, 2009 4:15 am

ngalemmo wrote:I'm confused. What is the purpose of the bridge dimension and the bridge FK in the fact table?

If the bridge is just a flattened hierarchy for a meter, then the PK should be the meter key, right? Because you are going to have one row per meter. There is no need for another fk (bridge fk) in the fact table because the meter is the smallest unit you are dealing with. Also, if you have such a flattened hierarchy, there is no need for an exploded hierarchy as well. An exploded hierarchy is used to represent recursive hierarchies.

Please correct me if I am wrong, I suppose what you are trying to say is I must establish a relation from bridge to fact table on meterid because meterid uniquely represent one meter. So there is no bridge key required between the fact and bridge.
bridge table would look like (Group, Subsidiary Group, MeterID ...) and fact look like (MeterID, FKdate, FKTime, ConsumptionReading...) in fact this fact table is nothing but old MeterData table from transaction db with additional column that represent FKDate, and FKTime from two other dimensions.
Am I right?

jaiveeru Thu May 21, 2009 6:17 am

ngalemmo wrote:I'm confused. What is the purpose of the bridge dimension and the bridge FK in the fact table?

If the bridge is just a flattened hierarchy for a meter, then the PK should be the meter key, right? Because you are going to have one row per meter. There is no need for another fk (bridge fk) in the fact table because the meter is the smallest unit you are dealing with. Also, if you have such a flattened hierarchy, there is no need for an exploded hierarchy as well. An exploded hierarchy is used to represent recursive hierarchies.

Just wanted to answer your questions hidden in the last post. Find them in red-
1. I am going to have 1 row per meter per hour per day in the fact (So there will be multiple rows for 1 meter). Meters give there reading just for that very hour, and once the reading is with us they re calculate starting from 0. So Meter only gives an hourly reading and we need to aggregate them.

2. I used exploded hierarchy for groups, becasue consumption can be compared between any groups so I list all the subsidiary groups and meters if there are any.

e.g.

Group A
	Group B	M1
		Group C	M2
			M3
	Group X	M4
		M5

This makes a recursive hierarchy between groups hence I exploded the group as follows

My bridge is:

Group A --- Group B --- M1
Group A --- Group C --- M2
Group A --- Group C --- M3
Group A --- Group X --- M4
Group A --- Group X --- M5
Group B --- Group B --- M1
Group B --- Group C --- M2
Group B --- Group C --- M3
Group C --- Group C --- M2
Group C --- Group C --- M3
Group X --- Group X --- M4
Group X --- Group X --- M5

My fact in this case might look like

M1 --- 01/01/2004 --- 00:00:00 --- 10
M1 --- 01/01/2004 --- 00:00:01 --- 12
M1 --- 01/01/2004 --- 00:00:02 --- 19
M1 --- 01/01/2004 --- 00:00:03 --- 87

and so on...

Now we can very well connect meter to meter from bridge to fact and this relationship will be called a many to many relationship in dimensional model, which is not acceptable

ngalemmo Thu May 21, 2009 8:55 pm

Who says its not acceptable? Exploded hierarchy bridge tables are a very acceptable way to support recursive, ragged hierarchies.

From a model point of view, it is a many to many relationship because it is a redundant representation of the recursive heirarchy. (the child is associated with all its parents in the hierarchy tree). The redundancy is necessary to remove the recursive aspect, allowing you to use the hierarchy using simple SQL queries. In any given query you are restricting the join to a given parent (be it a meter or a location), so in practice, it is a one-to-many relationship.

Jeff Smith Tue Jun 02, 2009 9:20 am

ngalemmo wrote:Who says its not acceptable? Exploded hierarchy bridge tables are a very acceptable way to support recursive, ragged hierarchies.

From a model point of view, it is a many to many relationship because it is a redundant representation of the recursive heirarchy. (the child is associated with all its parents in the hierarchy tree). The redundancy is necessary to remove the recursive aspect, allowing you to use the hierarchy using simple SQL queries. In any given query you are restricting the join to a given parent (be it a meter or a location), so in practice, it is a one-to-many relationship.

Why do you need a bridge table to support a ragged hierarchy? Why not create a dimension table with a ragged hierarchy?

ngalemmo Tue Jun 02, 2009 1:59 pm

There are basically two ways to represent hierarchies in a DW. A flat structure (i.e. dimension) or a bridge. The former is used for fixed hierarchies with known depth and defined levels (such as company, division, department)... in other words you have columns in the dimension that mean something to the business. The other case is a hierarchy that has unknown depth and levels that don't mean anything in particular (such as a bill of materials, employee-manager relationships, etc), and are usually represented in the operational system as parent-child relationships (i.e. a recursive hierarchy).

In general, recursive structures are difficult to work with in SQL (some databases have extensions, but even so, implementing them with BI tools is troublesome). The purpose of the bridge is to remove the recursion by exploding the structure so that a query can access the hierarchy at any level and report on children at any level using simple joins in one-pass queries.

Jeff Smith Wed Jun 03, 2009 9:13 am

ngalemmo wrote:There are basically two ways to represent hierarchies in a DW. A flat structure (i.e. dimension) or a bridge. The former is used for fixed hierarchies with known depth and defined levels (such as company, division, department)... in other words you have columns in the dimension that mean something to the business. The other case is a hierarchy that has unknown depth and levels that don't mean anything in particular (such as a bill of materials, employee-manager relationships, etc), and are usually represented in the operational system as parent-child relationships (i.e. a recursive hierarchy).

In general, recursive structures are difficult to work with in SQL (some databases have extensions, but even so, implementing them with BI tools is troublesome). The purpose of the bridge is to remove the recursion by exploding the structure so that a query can access the hierarchy at any level and report on children at any level using simple joins in one-pass queries.

My experience has been that many BI tools handle ragged hierarchies very well. Many of the Cube software support such structures. The training material for Cognos's Data Manager ETL tool has a section describing how to build dimensions with ragged hierarchies that is applicable to other ETL tools. I've seen comments in other forums about how Star Schemas have a hard time with ragged hierarchies but that hasn't been my experience. Maybe I'm missing something.

Financial Cost Centers is a typical ragged hierarchy where a specific cost center can be a roll up point as well as the lowest level of the hierarchy. The danger with this type of data is allowing information to be reported at the wrong level. A dimension with a ragged hierachy can easily prevent a Cost Center from appearing at the incorrect level.

I'm just trying to understand where the difficulty is with using the ragged hierarchy.

ngalemmo Wed Jun 03, 2009 12:02 pm

I agree that many tools today support use of recursive structures with features various vendors have embedded in their products, but not all. Also, not all BI involves the use of 'cubes'... often it involves reports and interfaces based on relational sources (i.e the data warehouse). If you happen to have toolset that does not support recursive structures, you are in a world of hurt unless you implement a bridge. Bridges are not difficult to build and do not consume much space. It is easy enough to include them in the initial implementation so you have a solution that works regardless of how the business wishes to get to the data.

Jeff Smith Wed Jun 03, 2009 4:17 pm

ngalemmo wrote:I agree that many tools today support use of recursive structures with features various vendors have embedded in their products, but not all. Also, not all BI involves the use of 'cubes'... often it involves reports and interfaces based on relational sources (i.e the data warehouse). If you happen to have toolset that does not support recursive structures, you are in a world of hurt unless you implement a bridge. Bridges are not difficult to build and do not consume much space. It is easy enough to include them in the initial implementation so you have a solution that works regardless of how the business wishes to get to the data.

Maybe I'm think but I still don't see the need for bridge table with a ragged hierarchy. I see it referenced in the literature but I haven't seen the need with the design that I have used.

This is how I've implemented ragged heirarchies in a dimension table:

Dimension Key Base Level Level 1 Level 2 Level 3 Top Level
1 AAA ABC BBD BBE BBZ
2 ABC ABC BBD BBE BBZ
3 BBB BBC BBD BBE BBZ
4 BBC BBC BBD BBE BBZ
5 BBD BBD BBE BBZ
6 BBE BBE BBZ
7 BBZ BBZ

This is typical of cost centers, where a regional cost center includes the costs of each district plus the costs of the regional office. All cost centers come in at the base level but only appear in a Hierarchy Level once. Cost Center BBZ, for example, would include the costs of the CEO office at the base level but would not appear again until the very top level when it would include the costs of the entire company. Such a design would prevent cost center BBZ for appearing with just the costs of the CEO's Office, which prevents the database from displaying correct information.

The ease or difficulty of coding such a heirarchy would depend on the ETL software.

ngalemmo Thu Jun 04, 2009 12:07 pm

Look at it this way...

In SQL, if I want to sum activity for 'BBD' in your example, how would I construct the query against a detailed fact table that only contains facts for leaf level children of 'BBD'? If the hierarchy is truly ragged, how do I know what level(s) 'BBD' will appear?

Sponsored content

Enormous data size

Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

I think this is a Ragged hierarchy

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size

Re: Enormous data size