Beginning Data Modeling for DW
4 posters
Page 1 of 1
Beginning Data Modeling for DW
Hello,
Please bear with me as these are a beginner's struggle to implement the DW the best way possible. I recently started studying about Kimball modeling.
I would like to ask a few questions on common practices on modeling the DW. I need to store historical data all the time; and I'm thinking I should use SCD2.
The following are a few tables. I just need a little help grouping them into a fact/dim category and if there are any intermediate tables that I need to group them appropriately.
Policy (Dim/Fact)
PolicyHistory (Dim/Fact)
======================================================
Id Policy# Term Eff.Date Exp. Date TermPremium Billed Written Status CommissionCode
---------------------------------------------------------------------------------------------------------------------------
1 AB0001 1 1/1/2000 5/31/2000 $1000 900 900 ACTIVE CO-130
2 AB0001 2 6/1/2000 12/31/2000 $1900 94.8 1100 CANCELLED TR-1A0
3 AB0001 3 1/1/2001 5/31/2001 $1200 302.44 1000 ACTIVE BB-1C0
AutoCoverages (AutoCoverageHistory) (Dim/Fact)?
================================================
Id PolicyId Type Premium Billed Written Fee1 Fee2
------------------------------------------------------------------------------
1 1 BI $10 $20 $10 $1 $5
2 1 COLL $20 $30 $20 $1 $5
3 1 PD $30 $40 $20 $1 $5
HomeCoverages
=============
Id PolicyId Type Premium Billed Written Fee1 Fee2
------------------------------------------------------------------------------
1 1 COVA $10 $20 $10 $1 $5
2 1 COVB $20 $30 $20 $1 $5
3 1 COVC $30 $40 $20 $1 $5
The dimensions are
AutoDriver (auto policy)
AutoVehicle (auto policy)
HomeLocation(home policy)(city, state, zip)
Questions:
==========
1. What criteria should I use in the above tables to group them into Fact/Dim? All the tables can be categorized into Fact/Dim depending on how you look at it. This is where I'm stumped.
2. Surrogate Keys.
In the above example, I have a policy with 3 rows, each with different term.
a. Should I use the combination of Policy# and Term as the natural key?
b. Or should I just have the same PolId with (IDENTITY) in addition to Surrogate Key?
3. Since AutoCoverages and HomeCoverages are subsets of Policy, how should I handle this design?
4. Is the history maintained in OLTP or OLAP?
I. Run ETL every night to capture changes and bring them over to DW. Which of the following methodologies is used?
1) Use checksum to determine if a row has changed and then process the rows accordingly.
2) Create triggers on source tables (I, U, D) and process the queue frequently (or at the end of the day).
Any help is greatly appreciated.
Thanks,
RK.
Please bear with me as these are a beginner's struggle to implement the DW the best way possible. I recently started studying about Kimball modeling.
I would like to ask a few questions on common practices on modeling the DW. I need to store historical data all the time; and I'm thinking I should use SCD2.
The following are a few tables. I just need a little help grouping them into a fact/dim category and if there are any intermediate tables that I need to group them appropriately.
Policy (Dim/Fact)
PolicyHistory (Dim/Fact)
======================================================
Id Policy# Term Eff.Date Exp. Date TermPremium Billed Written Status CommissionCode
---------------------------------------------------------------------------------------------------------------------------
1 AB0001 1 1/1/2000 5/31/2000 $1000 900 900 ACTIVE CO-130
2 AB0001 2 6/1/2000 12/31/2000 $1900 94.8 1100 CANCELLED TR-1A0
3 AB0001 3 1/1/2001 5/31/2001 $1200 302.44 1000 ACTIVE BB-1C0
AutoCoverages (AutoCoverageHistory) (Dim/Fact)?
================================================
Id PolicyId Type Premium Billed Written Fee1 Fee2
------------------------------------------------------------------------------
1 1 BI $10 $20 $10 $1 $5
2 1 COLL $20 $30 $20 $1 $5
3 1 PD $30 $40 $20 $1 $5
HomeCoverages
=============
Id PolicyId Type Premium Billed Written Fee1 Fee2
------------------------------------------------------------------------------
1 1 COVA $10 $20 $10 $1 $5
2 1 COVB $20 $30 $20 $1 $5
3 1 COVC $30 $40 $20 $1 $5
The dimensions are
AutoDriver (auto policy)
AutoVehicle (auto policy)
HomeLocation(home policy)(city, state, zip)
Questions:
==========
1. What criteria should I use in the above tables to group them into Fact/Dim? All the tables can be categorized into Fact/Dim depending on how you look at it. This is where I'm stumped.
2. Surrogate Keys.
In the above example, I have a policy with 3 rows, each with different term.
a. Should I use the combination of Policy# and Term as the natural key?
b. Or should I just have the same PolId with (IDENTITY) in addition to Surrogate Key?
3. Since AutoCoverages and HomeCoverages are subsets of Policy, how should I handle this design?
4. Is the history maintained in OLTP or OLAP?
I. Run ETL every night to capture changes and bring them over to DW. Which of the following methodologies is used?
1) Use checksum to determine if a row has changed and then process the rows accordingly.
2) Create triggers on source tables (I, U, D) and process the queue frequently (or at the end of the day).
Any help is greatly appreciated.
Thanks,
RK.
raghuk- Posts : 8
Join date : 2009-06-16
Re: Beginning Data Modeling for DW
I highly recommend that you read the Insurance chapter of The Data Warehouse Toolkit, 2nd Edition (starting on page 305). It addresses a lot of your questions.
Re: Beginning Data Modeling for DW
Joy wrote:I highly recommend that you read the Insurance chapter of The Data Warehouse Toolkit, 2nd Edition (starting on page 305). It addresses a lot of your questions.
Joy,
Thanks for a quick reply. I just placed an order for this book from Amazon after reading your message.
Can you please let me know your thoughts about other general questions?
I have other situations where I cannot determine if a table should be deemed a fact or dim. Am I wrong to think that all the data in the warehouse is either a fact or a dim? or are there situations where a table is both used as Fact and Dim? What is your recommendation in the afore-mentioned situation? Probably I'll have to study more on this in another book that I placed the order for 'The Complete Guide To Dimensional Modeling' by RKimball and MRoss.
It would be good to hear your thoughts before I get my hands on those books.
Questions:
==========
1. What criteria should I use in the above tables to group them into Fact/Dim? All the tables can be categorized into Fact and Dim both depending on how you look at it. This is where I'm stumped.
2. Surrogate Keys.
In the above example, I have a policy with 3 rows, each with different term.
a. Should I use the combination of Policy# and Term as the natural key?
b. Or should I just have the same PolId with (IDENTITY) in addition to Surrogate Key?
4. Is the history maintained in OLTP or OLAP?
I. Run ETL every night to capture changes and bring them over to DW. Which of the following methodologies is used?
1) Use checksum to determine if a row has changed and then process the rows accordingly.
2) Create triggers on source tables (I, U, D) and process the queue frequently (or at the end of the day).
Thanks,
raghuk- Posts : 8
Join date : 2009-06-16
Fact or Dim Determination
A common myth is that a given source table will only ever be use as a fact or dimension table in a dimensional model.
One given business process a business data entity like employee may be a dimension while for another it may be a fact (e.g. HR metrics).
The key is that you follow a process of design analysis that really does not look at specific data tables and columns but focuses on what you are trying to create with the dimensional model. You seem very data centric, as am I; as such it is really comfortable to just look at data and try to classify it as fact or dimension.
The process really invovles getting to a whiteboard and answering the following questions (from DW Toolkit Ch2):
1. What business process are you modeling and what are the interesting questions that come out of those processes?
2. What is the grain for the business process - What will a row in the fact table represent?
3. For fact rows, how do you describe/classify/group those measurements? [Dimensions]
4. What are the numeric (usually additive) measurements of the business process? [Facts]
After answering those questions you begin to revist data to determine where that information resides and how it would map to the design generated out of yours answers.
One given business process a business data entity like employee may be a dimension while for another it may be a fact (e.g. HR metrics).
The key is that you follow a process of design analysis that really does not look at specific data tables and columns but focuses on what you are trying to create with the dimensional model. You seem very data centric, as am I; as such it is really comfortable to just look at data and try to classify it as fact or dimension.
The process really invovles getting to a whiteboard and answering the following questions (from DW Toolkit Ch2):
1. What business process are you modeling and what are the interesting questions that come out of those processes?
2. What is the grain for the business process - What will a row in the fact table represent?
3. For fact rows, how do you describe/classify/group those measurements? [Dimensions]
4. What are the numeric (usually additive) measurements of the business process? [Facts]
After answering those questions you begin to revist data to determine where that information resides and how it would map to the design generated out of yours answers.
JoeSalvatore- Posts : 4
Join date : 2009-06-19
Re: Beginning Data Modeling for DW
Think transaction or event, these types of entities are prime candidates for fact tables. For dimensions, think context. Entities that provide context for a transaction or event are good candidates for dimensions.raghuk wrote:Hello,
--snip--
==========
1. What criteria should I use in the above tables to group them into Fact/Dim? All the tables can be categorized into Fact/Dim depending on how you look at it. This is where I'm stumped.
Ralph has written extensively on using surrogate keys. Use them here.
2. Surrogate Keys.
In the above example, I have a policy with 3 rows, each with different term.
a. Should I use the combination of Policy# and Term as the natural key?
b. Or should I just have the same PolId with (IDENTITY) in addition to Surrogate Key?
Several options available for you here, a hierarchy, different dimensions, even fact tables.3. Since AutoCoverages and HomeCoverages are subsets of Policy, how should I handle this design?
I use the easiest method supported by the application. If they support sending me deltas, I use deltas. These are all just different tools in the toolbox. My goal is to process the least amount of data to support the business requirements. History is supported in the warehouse. It may also be supported in the OLTP application.4. Is the history maintained in OLTP or OLAP?
I. Run ETL every night to capture changes and bring them over to DW. Which of the following methodologies is used?
1) Use checksum to determine if a row has changed and then process the rows accordingly.
2) Create triggers on source tables (I, U, D) and process the queue frequently (or at the end of the day).
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Similar topics
» Data Modeling for UserUtilization
» Dimension Modeling for Big Data
» Data Warehouse Modeling
» Uses of a data modeling tool
» Physician Network Data Modeling
» Dimension Modeling for Big Data
» Data Warehouse Modeling
» Uses of a data modeling tool
» Physician Network Data Modeling
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum