Monster dimension, joining fact tables
2 posters
Page 1 of 1
Monster dimension, joining fact tables
I am newbie to the DW. We have document processing OLTP system with the following tables
Document table (partitioned table, approx. 1 billion rows)
DocId -PK
Title
DocDate
… 10 other columns
Users table
UserId -PK
UserName
.. 3 other columns
DocumentUsers (joining Document and Users table) (partitioned table, about 10x Document rows = approx. 10 billion rows)
DocId – PK, FK
UserId – PK, FK
Type – PK (user role in the document, e.g. in an email document it would store recipient type of To, From, CC etc.)
DocumentProcessInfo table (Stores document processing event, 5 billion rows)
DocId – PK, FK
EventTypeId – PK, FK
ProcessDate – PK, FK
DocumentSourceId – PK, FK
ProcessJobId – PK, FK
.. 5 other attributes
The DW query requirements are, number/size of documents per user, average processing time, how many times documents got processed etc.
Grain needed is document level
For the dimensional model for this, the options are
Option 1.
DimDocument with snowflaking of DimDocumentUser and DimUser
DimDocument table
DocKey
DocAltKey
… 10 other columns
DimDocumentUser table
DocUserKey
DocKey – FK
UserKey – FK
Type
FactDocumentEvent table
DocKey
EventKey (table omitted for brevity)
ProcessDateKey
DocumentSourceKey (table omitted for brevity)
JobKey
The issues with this approach are DimDocument and DimDocumentUser can grow to 1 and 10 billion rows respectively, super monster dimensions, which is not recommended. Snowflaking complexity. Also, it is recommended not to partition the dimension table per MSSQL Server DW best practices.
Option 2.
DimDocument table ( can grow to 1 billion)
DimUser table (not too large, few millions)
FactDocumentUser table (can grow to 10 billion, partition candidate)
FactDocumentEvent table (can grow to 5-10 billion, partition candidate)
The issues with this approach are
a.Monstrous DimDocument dimension is still present in this option.
b.How to get document user details along with document event data since FactDocumentUser and FactDocumentEvent tables cannot be joined e.g. show all the details of the top 100 documents that took longest to process. If UserKey is included in FactDocumentEvent, the table can grow tremendously since each document can have avg of 10 users and each document can have avg of 5-10 events. So with potential of 1 billion documents x 10 users x 10 events = 100 billion.
Your guidance on this design would be appreciated
Document table (partitioned table, approx. 1 billion rows)
DocId -PK
Title
DocDate
… 10 other columns
Users table
UserId -PK
UserName
.. 3 other columns
DocumentUsers (joining Document and Users table) (partitioned table, about 10x Document rows = approx. 10 billion rows)
DocId – PK, FK
UserId – PK, FK
Type – PK (user role in the document, e.g. in an email document it would store recipient type of To, From, CC etc.)
DocumentProcessInfo table (Stores document processing event, 5 billion rows)
DocId – PK, FK
EventTypeId – PK, FK
ProcessDate – PK, FK
DocumentSourceId – PK, FK
ProcessJobId – PK, FK
.. 5 other attributes
The DW query requirements are, number/size of documents per user, average processing time, how many times documents got processed etc.
Grain needed is document level
For the dimensional model for this, the options are
Option 1.
DimDocument with snowflaking of DimDocumentUser and DimUser
DimDocument table
DocKey
DocAltKey
… 10 other columns
DimDocumentUser table
DocUserKey
DocKey – FK
UserKey – FK
Type
FactDocumentEvent table
DocKey
EventKey (table omitted for brevity)
ProcessDateKey
DocumentSourceKey (table omitted for brevity)
JobKey
The issues with this approach are DimDocument and DimDocumentUser can grow to 1 and 10 billion rows respectively, super monster dimensions, which is not recommended. Snowflaking complexity. Also, it is recommended not to partition the dimension table per MSSQL Server DW best practices.
Option 2.
DimDocument table ( can grow to 1 billion)
DimUser table (not too large, few millions)
FactDocumentUser table (can grow to 10 billion, partition candidate)
FactDocumentEvent table (can grow to 5-10 billion, partition candidate)
The issues with this approach are
a.Monstrous DimDocument dimension is still present in this option.
b.How to get document user details along with document event data since FactDocumentUser and FactDocumentEvent tables cannot be joined e.g. show all the details of the top 100 documents that took longest to process. If UserKey is included in FactDocumentEvent, the table can grow tremendously since each document can have avg of 10 users and each document can have avg of 5-10 events. So with potential of 1 billion documents x 10 users x 10 events = 100 billion.
Your guidance on this design would be appreciated
Last edited by dwfavor on Tue Feb 01, 2011 12:09 pm; edited 1 time in total (Reason for editing : Additional design point)
dwfavor- Posts : 4
Join date : 2011-02-01
Re: Monster dimension, joining fact tables
What is it about the document dimension that user's typically select on? Have you considered breaking down the dimension?
Re: Monster dimension, joining fact tables
Thanks for your reply. Typically, Document dimension will get used two ways.
1. Show details of all the documents participating in an aggregate. For example, show all documents details (with associated users) which took more than 100ms to process. Another example, show all documents within certain size that were processed yesterday.
2. Looking for specific document(s) by DocumentAltKey, DocumentDate. For example, show the processing details of specific document id or all the documents within specific document date.
1. Show details of all the documents participating in an aggregate. For example, show all documents details (with associated users) which took more than 100ms to process. Another example, show all documents within certain size that were processed yesterday.
2. Looking for specific document(s) by DocumentAltKey, DocumentDate. For example, show the processing details of specific document id or all the documents within specific document date.
dwfavor- Posts : 4
Join date : 2011-02-01
Re: Monster dimension, joining fact tables
But what are those document details? By breaking these details out into new dimensions and storing FKs to those details in the fact, you can significantly reduce the size of the dimension tables used in queries and significantly improve performance when selecting documents based on those attributes.
For example, document size could appear as a degenerate dimension in the fact table.
If I was to have a document dimension, about the only thing I would have in there would be the name of the document and it's URL or file name. I would also break the document name down into keywords and build a bridge structure so that facts could be searched by keyword rather than scanning the entire document dimension.
For example, document size could appear as a degenerate dimension in the fact table.
If I was to have a document dimension, about the only thing I would have in there would be the name of the document and it's URL or file name. I would also break the document name down into keywords and build a bridge structure so that facts could be searched by keyword rather than scanning the entire document dimension.
Re: Monster dimension, joining fact tables
Document details
DocumentId = Content hash based id, unique, search-able
NativeId = Id at the source, e.g. exchange id of email message
Title = e.g. email subject
DocumentDate
DocumentType (integer type)
PlatformType (integer type)
Size
EmbeddedDocumentCount
SourceLocation = e.g. source directory path
So even if Document dimension is broken out into common attributes, number of the unique documents in the normalized dimension could still be in billions. That would be an issue.
DocumentId = Content hash based id, unique, search-able
NativeId = Id at the source, e.g. exchange id of email message
Title = e.g. email subject
DocumentDate
DocumentType (integer type)
PlatformType (integer type)
Size
EmbeddedDocumentCount
SourceLocation = e.g. source directory path
So even if Document dimension is broken out into common attributes, number of the unique documents in the normalized dimension could still be in billions. That would be an issue.
dwfavor- Posts : 4
Join date : 2011-02-01
Re: Monster dimension, joining fact tables
You are not going to avoid a big table with document info such as file name, id, title and so on, but, you can create much smaller dimensions to cover what people would normally search on.
So, if you were to take things such as document date and make it a FK to the date dimension in the fact, and make a junk (aka mini) dimension out of document and platform type, and put size and # embedded documents as degenerate dimensions, you can significantly improve query times because fact selection occurs against much smaller dimension tables. Once facts are selected, it would then join to the huge dimension to get information about documents of interest (if needed for the query). Most databases will handle the query this way if you do not have any predicates on attributes in the huge dimension.
So, if you were to take things such as document date and make it a FK to the date dimension in the fact, and make a junk (aka mini) dimension out of document and platform type, and put size and # embedded documents as degenerate dimensions, you can significantly improve query times because fact selection occurs against much smaller dimension tables. Once facts are selected, it would then join to the huge dimension to get information about documents of interest (if needed for the query). Most databases will handle the query this way if you do not have any predicates on attributes in the huge dimension.
Re: Monster dimension, joining fact tables
These are very interesting ideas. Let me rework the design and come back with more questions, if any.
Thanks for your prompt and thorough replies.
Thanks for your prompt and thorough replies.
dwfavor- Posts : 4
Join date : 2011-02-01
Similar topics
» DegenerateDimension joining Fact tables
» Joining fact tables
» Joining Fact tables
» Joining fact tables
» Joining two Fact tables???
» Joining fact tables
» Joining Fact tables
» Joining fact tables
» Joining two Fact tables???
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum