Monster dimension, joining fact tables

dwfavor Tue Feb 01, 2011 12:02 pm

I am newbie to the DW. We have document processing OLTP system with the following tables

Document table (partitioned table, approx. 1 billion rows)
DocId -PK
Title
DocDate
… 10 other columns

Users table
UserId -PK
UserName
.. 3 other columns

DocumentUsers (joining Document and Users table) (partitioned table, about 10x Document rows = approx. 10 billion rows)
DocId – PK, FK
UserId – PK, FK
Type – PK (user role in the document, e.g. in an email document it would store recipient type of To, From, CC etc.)

DocumentProcessInfo table (Stores document processing event, 5 billion rows)
DocId – PK, FK
EventTypeId – PK, FK
ProcessDate – PK, FK
DocumentSourceId – PK, FK
ProcessJobId – PK, FK
.. 5 other attributes

The DW query requirements are, number/size of documents per user, average processing time, how many times documents got processed etc.
Grain needed is document level

For the dimensional model for this, the options are
Option 1.
DimDocument with snowflaking of DimDocumentUser and DimUser

DimDocument table
DocKey
DocAltKey
… 10 other columns

DimDocumentUser table
DocUserKey
DocKey – FK
UserKey – FK
Type

FactDocumentEvent table
DocKey
EventKey (table omitted for brevity)
ProcessDateKey
DocumentSourceKey (table omitted for brevity)
JobKey

The issues with this approach are DimDocument and DimDocumentUser can grow to 1 and 10 billion rows respectively, super monster dimensions, which is not recommended. Snowflaking complexity. Also, it is recommended not to partition the dimension table per MSSQL Server DW best practices.

Option 2.
DimDocument table ( can grow to 1 billion)
DimUser table (not too large, few millions)
FactDocumentUser table (can grow to 10 billion, partition candidate)
FactDocumentEvent table (can grow to 5-10 billion, partition candidate)

The issues with this approach are
a.Monstrous DimDocument dimension is still present in this option.
b.How to get document user details along with document event data since FactDocumentUser and FactDocumentEvent tables cannot be joined e.g. show all the details of the top 100 documents that took longest to process. If UserKey is included in FactDocumentEvent, the table can grow tremendously since each document can have avg of 10 users and each document can have avg of 5-10 events. So with potential of 1 billion documents x 10 users x 10 events = 100 billion.

Your guidance on this design would be appreciated

ngalemmo Tue Feb 01, 2011 3:04 pm

What is it about the document dimension that user's typically select on? Have you considered breaking down the dimension?

dwfavor Tue Feb 01, 2011 3:42 pm

Thanks for your reply. Typically, Document dimension will get used two ways.
1. Show details of all the documents participating in an aggregate. For example, show all documents details (with associated users) which took more than 100ms to process. Another example, show all documents within certain size that were processed yesterday.
2. Looking for specific document(s) by DocumentAltKey, DocumentDate. For example, show the processing details of specific document id or all the documents within specific document date.

ngalemmo Tue Feb 01, 2011 5:04 pm

But what are those document details? By breaking these details out into new dimensions and storing FKs to those details in the fact, you can significantly reduce the size of the dimension tables used in queries and significantly improve performance when selecting documents based on those attributes.

For example, document size could appear as a degenerate dimension in the fact table.

If I was to have a document dimension, about the only thing I would have in there would be the name of the document and it's URL or file name. I would also break the document name down into keywords and build a bridge structure so that facts could be searched by keyword rather than scanning the entire document dimension.

dwfavor Tue Feb 01, 2011 5:16 pm

Document details
DocumentId = Content hash based id, unique, search-able
NativeId = Id at the source, e.g. exchange id of email message
Title = e.g. email subject
DocumentDate
DocumentType (integer type)
PlatformType (integer type)
Size
EmbeddedDocumentCount
SourceLocation = e.g. source directory path

So even if Document dimension is broken out into common attributes, number of the unique documents in the normalized dimension could still be in billions. That would be an issue.

ngalemmo Tue Feb 01, 2011 5:46 pm

You are not going to avoid a big table with document info such as file name, id, title and so on, but, you can create much smaller dimensions to cover what people would normally search on.

So, if you were to take things such as document date and make it a FK to the date dimension in the fact, and make a junk (aka mini) dimension out of document and platform type, and put size and # embedded documents as degenerate dimensions, you can significantly improve query times because fact selection occurs against much smaller dimension tables. Once facts are selected, it would then join to the huge dimension to get information about documents of interest (if needed for the query). Most databases will handle the query this way if you do not have any predicates on attributes in the huge dimension.

dwfavor Tue Feb 01, 2011 6:02 pm

These are very interesting ideas. Let me rework the design and come back with more questions, if any.

Thanks for your prompt and thorough replies.

Sponsored content

Monster dimension, joining fact tables

Monster dimension, joining fact tables

Re: Monster dimension, joining fact tables

Re: Monster dimension, joining fact tables

Re: Monster dimension, joining fact tables

Re: Monster dimension, joining fact tables

Re: Monster dimension, joining fact tables

Re: Monster dimension, joining fact tables

Re: Monster dimension, joining fact tables