The grain level

yariv Thu Aug 12, 2010 5:52 pm

I have a question about the grain of a fact table. Currently our code coverage data warehouse tracks code coverage per function and the measure is number of blocks covered. There are some scenarios that require we provide the actual lines covered so that users could drill all the way down to the covered code. The source database stores the coverage data as a compressed bit vector (type = image). I know it wouldn't be wise to store this column in a fact table in the DW. My question is how can this be done? Or, should it be placed somewhere outside the DW – like in a NDS or ODS?

Thanks.

ngalemmo Thu Aug 12, 2010 6:14 pm

The issue isn't where it should go... it's a dimension... but what to do with it once it is in there.

You can either leave it as it is (and create the necessary functions to interpret it) or you can transform it in some manner (bridge table) to make it more SQL friendly.

If you leave it as a bit vector, should it be stored as a degenerate dimension? Depends. If it will only appear in one fact table... forever... and your database can store it in a separate data structure, then go ahead. By the latter I mean that in some databases, Oracle for example, you have the option of storing a BLOB in the row or have the DB store a reference to the BLOB and place the BLOB in a separate data area. The latter is perferrable as queries that do not use the BLOB would not be burdened with the overhead (i.e. very big row) of dealing with it.

However, 99% of the time the best decision would be to put it in its own dimension table. For a natural key, rather than use the BLOB itself (and, frankly, I don't know if DBs allow you to define an index on a BLOB), I would calcaulate a hash value (such as a 32 bit CRC) and use the hash as a non-unique index into the dimension table. This would speed lookups when assigning surrogate keys. The lookup would compare the hash as well as BLOB value to get the correct surrogate key value.

With that said, what the vector represents is a grouping of a multivalued dimension. I would consider transforming the vector into a bridge table.

yariv Fri Aug 13, 2010 4:18 pm

Thank you for your response.
Not sure I understand why you consider this bit vector a (degenerate) dimension. In my opinion it is the measure, not an additive one but nevertheless a way to represent the code coverage in a function. Putting it in a dimension implies indirectly that there could be other functions that may share the same bit vector which is not true. The bit vector is not an attribute describing a function but a result of a test and therefore it should be in the fact table, if stored in the DW.

ngalemmo Fri Aug 13, 2010 4:32 pm

My reading is, as you said, the measure is the number of blocks covered. I assume the bit vector represents which blocks. In a dimensional model, dimensions represent context of the measures. WHAT blocks, (along with Who, Where, When and Why) are context.

If I misread, what does the bit vector represent and what do you plan to do with it as a measure?

yariv Fri Aug 13, 2010 5:00 pm

You read correctly. The current measure is expressed as number of blocks covered and the bit vector shows which blocks were covered. This idea to include more detailed information on the blocks that were covered was brought up by a program manager that described a user scenario in which the user is able to not only see what percentage of the code was covered but also drill down and see the exact blocks in a function that were hit. I don't know yet how to enable this drill down even if I have the bit vector available, but what I am trying to figure out is where to put it once I extract it from the source DB. Putting the bit vector in a dimension will result in a very large dimension. For example, I have 5 million functions in the code base, with an average of 11 blocks in each function (maximum = 13,000 blocks). Do you suggest to consider it a degenerate dimension which means it will reside in the fact table and specify the storage not in the row?

What about using NDS (normalized data store)?

ngalemmo Fri Aug 13, 2010 6:31 pm

Is their more information about the block other than its identity? In other words, would there be a block dimension?

Frankly I would dump the vector idea and treat the blocks as a multivalued dimension. Have a simple bridge table with function key and block key. Such a structure can be handled by SQL. If different functions can reference the same set of blocks, then you may want to consider creating block groups to reduce the size of the bridge. You would identify a unique group and store the block group key in the fact table and build a bridge with block group key/block key pairs.

If you use surrogate keys (always use surrogate keys) this bridge will not be very big. Worse case (without grouping blocks) you would have 55M rows (5M functions averaging 11 blocks)... if you average 12 bytes a row, the entire table is only around 650MB. No big deal.

And, unless you are compressing the vectors, the storage requirements for the vector approach is far worse. If you allow for a 13000 bit vector, uncompressed is about 1500 bytes, with 5M rows, you use 7.5GB to store the same information AND need to write some sort of function to deal with it in the database, which means performance is going to stink.

yariv Fri Aug 13, 2010 7:56 pm

The current dimensions do not go beyond the function level. So there is no Block dimension. Some of the dimensions we have are: Build, Class, Feature, Team, Person, File, Binary, etc.
The bit vector is compressed in the current implementation and its average size is 3 Kb (max is 160 Kb).
Thanks again for your suggestions; even if I decide not follow a particular suggestion you certainly provided me with more options than I had before.

Sponsored content

The grain level

The grain level

Re: The grain level

The grain level

Re: The grain level

The grain level

Re: The grain level

The grain level

Re: The grain level