Incremental Extraction Methods From Highly Normalized Source

miteshc Wed Jul 09, 2014 12:10 am

Hi,

I need help in implementing below scenario. Source system is highly normalized say Table A related to Table B on key1, Table B is related to Table C on key2, Table C to Table D on Key3 and Table D to Table E on Key 4. Requirement is to load daily changed data in DWH. Typically on a given day, for data in A, there need not be any data in child tables (i.e. Table B, Table C etc). So it might happen that delta data on a given day for Table E might be related to data which came 3-4 days back in Table A. So in order to get complete dataset from all 5 tables, I will have to join all the tables and use OR condition on date modified for each of the tables. Which means, I am actually joining complete dataset on all the tables and then trying to get delta dataset depending on modified date of each table. This approach would put more load on source as well ETL processing and hence might not serve daily loads well.

Just to be more clear on example from Insurance, underwriting case is created in Table A. Table B might hold data related to lives assured against that case, Table C might hold data related to personal details information on these lives ... and eventually Table E might have information on Decision taken on underwriting case created in A. Typically entire lifecycle will not happen in one day so it rules out join on delta extraction from each table. Business also confirmed there is no max timeframe in which A to E lifecycle can complete so it rules out option of considering last few months of data from each table.

Is there a better solution to handle this? When data grows, we really dont want daily loads running for longer duration.

Thanks.

PS1 - Key1, Key2, Key3 etc are source specific keys which does not have any significance to Datawarehouse. There is no other way to join A to C without going through B or A to D without going through B and C.
PS2 - Business want to report data even when entire lifecycle in not complete. e.g. They want to see all cases created or they want to see number of lives covered in those cases. So our DWH table need to hold this information with corresponding decision not populated for some of the cases. Those will be populated later when decision is available.

miteshc Wed Jul 09, 2014 12:11 am

One more thing is here I have given example of only 5 tables. In reality we have to get data from almost 10+ tables always to complete the dataset.

nick_white Wed Jul 09, 2014 2:52 am

Hi - don't these tables all load into separate dimensions/facts in your target? A = Case Dim/Fact, B (possibly joined to C) = People Dim, ..., E = Decision Dim/Fact, etc.
I'm not clear why you need to join all this data into a single dataset in order to load it into your target e.g. if the life/personal information data in tables B/C change then you can load it into the relevant Dim(s) - you don't need to go anywhere near the source Case table A, do you?
If table A is the source for a Case Fact then you would load the related Dims first from the relevant tables and then load the fact from table A - which presumably already contains the FKs necessary to lookup the Dim SKs you'll add to the the Fact table?

If for some reason you do need a complete dataset to load into your target then why not hold this as denormalised record(s) in your staging area, take extracts from your individual tables, use these to update your staging area record(s) and then load your target from your staging area

miteshc Wed Jul 09, 2014 6:08 am

Thank you Nick for your response.

Actually there is no measure information attached. The reporting requirement is basically on counts of cases per benefit, gender, age etc. The attributes from these different tables actually form a dataset which is information about the UW case. Table A works as master here. So if it has 100 records on first day, Table B might have only 80 of those on first day, Table C might have only 60 of these 80 from B and so on. Next day A might get 10 more records, whereas B might get 20 of previous day's plus may be 5 of today's data etc.

This is why I want them as complete dataset. Individual tables can not be treated as Dims. As you suggested ODS or key store table is what we were considering as well. And when entire lifecycle is done, we might delete record from those key store table to keep only required cases which are not complete.

nick_white Wed Jul 09, 2014 7:12 am

The source for any Data warehouse usually includes a 3NF database so what you are doing doesn't seem to be out of the ordinary. Unless every Dimension/Fact table in your target takes data from every source table I would not expect you to need to join everything in your source in order to load data into your target - you might have to join some of your tables to populate each Dim but not all of them e.g. Source tables 1, 2 & 3 populate Dim X, Source tables 2 & 3 populate Dim Y, Source table 4 populates Dim Z, etc.
However, you obviously know your data structures better than me so if you really do need to do this then the ODS/Key Store/Staging Area is the obvious place to do it.
Regards,

ngalemmo Wed Jul 09, 2014 8:37 am

miteshc wrote:Thank you Nick for your response.

Actually there is no measure information attached. The reporting requirement is basically on counts of cases per benefit, gender, age etc. The attributes from these different tables actually form a dataset which is information about the UW case. Table A works as master here. So if it has 100 records on first day, Table B might have only 80 of those on first day, Table C might have only 60 of these 80 from B and so on. Next day A might get 10 more records, whereas B might get 20 of previous day's plus may be 5 of today's data etc.

This is why I want them as complete dataset. Individual tables can not be treated as Dims. As you suggested ODS or key store table is what we were considering as well. And when entire lifecycle is done, we might delete record from those key store table to keep only required cases which are not complete.

Are you building a dimensional model or are you just making life difficult for yourself? If you are not building a dimensional model, just pull the data from individual tables as-is and don't worry about it. Your extracted tables will change as the source tables change and things will be fine.

If you are building a dimensional model, then do so. You have a fact table that connects the various dimensions together. A fact table does not need to have measures… it would be reflecting the state of a case. There is no need to coordinate changes across tables. Only in unusual cases where two tables exist with a 1:1 relationship would you even consider extracting together.

Sponsored content

Incremental Extraction Methods From Highly Normalized Source

Incremental Extraction Methods From Highly Normalized Source

Incremental Extraction Methods From Highly Normalized Source

Re: Incremental Extraction Methods From Highly Normalized Source

Re: Incremental Extraction Methods From Highly Normalized Source

Re: Incremental Extraction Methods From Highly Normalized Source

Re: Incremental Extraction Methods From Highly Normalized Source

Re: Incremental Extraction Methods From Highly Normalized Source