need solution for IIel load late arriving dimension
3 posters
Page 1 of 1
need solution for IIel load late arriving dimension
Hello,
Q. We have implemented a data lake (PDM based on IBM Banking DW) on a Netezza appliance and we use DataStage as the ETL tool.
I need to design multiple ETL flows for data (Events: opening of an account and transactions: pay at the grocery store, recieve ur salary etc…) coming from the same source system which would run every 20 mins and load several different tables and a common table (columns: surrogate key, source system id, unique id in source system =account number).
The trouble lies in
- The ETL flow must be able to run in parallel
- When they run in parallel, they could load the same reference row in the common table.As the insert of a value by one load won’t be seen by another session until that batch insert completes and implicitly commits. So if 2 loads load the same account not existing in the target at the same time, we will have dups.
We have "one commit principle" at our site due to which we can’t load the account_target during the processing of the ETL flow. It can be loaded only as the last step in the ETL flow. Thus, the issue….
I suggested using an intermediate table which we commit/write to during the execution of the job by using the netezza connector’s brute force solution https://www-01.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.conn.netezza.use.doc/topics/netezza_config_prinmary_key_validation.html
and then using this intermediate table to load the facts/relationship tables.
So, with the brute force method, I ensure that only one of the conflicting/potential duplicate finds it’s way to the DB and the rest of the jobs which build the relationship tables/sub-types utilize the surrogate key which made it to the intermediate table.
However, my solution was turned down by the architect and i am now looking for alternates…..
by experience do you know one apart from
- Having a separate flow for the common table/lookup
- serializing the ETL flows….
Thanks
Q. We have implemented a data lake (PDM based on IBM Banking DW) on a Netezza appliance and we use DataStage as the ETL tool.
I need to design multiple ETL flows for data (Events: opening of an account and transactions: pay at the grocery store, recieve ur salary etc…) coming from the same source system which would run every 20 mins and load several different tables and a common table (columns: surrogate key, source system id, unique id in source system =account number).
The trouble lies in
- The ETL flow must be able to run in parallel
- When they run in parallel, they could load the same reference row in the common table.As the insert of a value by one load won’t be seen by another session until that batch insert completes and implicitly commits. So if 2 loads load the same account not existing in the target at the same time, we will have dups.
We have "one commit principle" at our site due to which we can’t load the account_target during the processing of the ETL flow. It can be loaded only as the last step in the ETL flow. Thus, the issue….
I suggested using an intermediate table which we commit/write to during the execution of the job by using the netezza connector’s brute force solution https://www-01.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.conn.netezza.use.doc/topics/netezza_config_prinmary_key_validation.html
and then using this intermediate table to load the facts/relationship tables.
So, with the brute force method, I ensure that only one of the conflicting/potential duplicate finds it’s way to the DB and the rest of the jobs which build the relationship tables/sub-types utilize the surrogate key which made it to the intermediate table.
However, my solution was turned down by the architect and i am now looking for alternates…..
by experience do you know one apart from
- Having a separate flow for the common table/lookup
- serializing the ETL flows….
Thanks
Iamundercover- Posts : 5
Join date : 2015-10-22
Re: need solution for IIel load late arriving dimension
Well, nothing like making up 'standards' that only serve to cause problems.
Why does the etl flow need to run in parallel? On Netezza it doesn't buy you anything. Throughput would be pretty much the same if the jobs are run serially or in parallel.
What is the common table and how do you expect multiple processes to update it without getting a serialization error?
Why does the etl flow need to run in parallel? On Netezza it doesn't buy you anything. Throughput would be pretty much the same if the jobs are run serially or in parallel.
What is the common table and how do you expect multiple processes to update it without getting a serialization error?
Re: need solution for IIel load late arriving dimension
I agree
We have data coming from the same source system which loads
Every 20 mins
FACTs
ETL Flow I: Events - opening of an account etc...
ETL Flow II: Transactions - payment made at the grocery store, debit incoming salary etc…)
Once a day
DIM
ETL Flow III: Daily inventory of all the accounts which feeds the DIM (Subtype)
AND
The design pattern, we have chosen for late arriving dimensions is to load the lookup/reference/super type table (columns: surrogate key, source system id, unique id in source system =account number).
Thus, to avoid
- Introducing dependencies among ETL flow I, II and III
- Loading dups in the supertype/reference/lookup table due to technical reason: When the job of ETL flow I, II or III which loads the super type table are executing in parallel might load dups as when they perform their check to find if the surrogate key is already generated for the natural key they might not find one and try and create it.
We have data coming from the same source system which loads
Every 20 mins
FACTs
ETL Flow I: Events - opening of an account etc...
ETL Flow II: Transactions - payment made at the grocery store, debit incoming salary etc…)
Once a day
DIM
ETL Flow III: Daily inventory of all the accounts which feeds the DIM (Subtype)
AND
The design pattern, we have chosen for late arriving dimensions is to load the lookup/reference/super type table (columns: surrogate key, source system id, unique id in source system =account number).
Thus, to avoid
- Introducing dependencies among ETL flow I, II and III
- Loading dups in the supertype/reference/lookup table due to technical reason: When the job of ETL flow I, II or III which loads the super type table are executing in parallel might load dups as when they perform their check to find if the surrogate key is already generated for the natural key they might not find one and try and create it.
Iamundercover- Posts : 5
Join date : 2015-10-22
Re: need solution for IIel load late arriving dimension
Why not have a durable alternate key in the dimension (type 1 key) and store that in the fact. You would then use the business timestamp of the fact against the business timestamp of the dimension to locate the correct row? This does slightly alter the queries, but on a Netezza box it is probably not very costly, and you can put the logic in a view, making it transparent to users. Distribution and organization of the tables can significantly help performance, if this is a particularly large dimension.
Re: need solution for IIel load late arriving dimension
When you load facts from ETL Flows I and II presumably you are populating them with the Account dummy SK value which is also going into your lookup table?
When you load your Account Dim in ETL Flow III all you need to do is translate your dummy SKs into the real SKs and update your fact tables - standard Late Arriving Dimension processing as described by Kimball.
It makes no difference if you have "duplicates" in your lookup table as you are going to translate them to the same real SK e.g.
Lookup Table has:
Source System = A; Source Natural Key = ABC; Dummy SK = 123
Source System = A; Source Natural Key = ABC; Dummy SK = 456
Dimension Load has:
Source System = A; Source Natural Key = ABC; Real SK = 987
Using Dimension load data to find records in your lookup table you identify that any fact table that has values of 123 or 456 in their Account SK column need to have this updated to 987
Regards,
When you load your Account Dim in ETL Flow III all you need to do is translate your dummy SKs into the real SKs and update your fact tables - standard Late Arriving Dimension processing as described by Kimball.
It makes no difference if you have "duplicates" in your lookup table as you are going to translate them to the same real SK e.g.
Lookup Table has:
Source System = A; Source Natural Key = ABC; Dummy SK = 123
Source System = A; Source Natural Key = ABC; Dummy SK = 456
Dimension Load has:
Source System = A; Source Natural Key = ABC; Real SK = 987
Using Dimension load data to find records in your lookup table you identify that any fact table that has values of 123 or 456 in their Account SK column need to have this updated to 987
Regards,
nick_white- Posts : 364
Join date : 2014-01-06
Location : London
Re: need solution for IIel load late arriving dimension
@ngalemmo
The FACTs get loaded throughout the day (every 20 mins) and the exploitation as well. Thus, we would like to have the FACTs in order/with the correct SK for the NK of the dimension in our lookup table so that we don't have to play catch-up all day long.
The FACTs get loaded throughout the day (every 20 mins) and the exploitation as well. Thus, we would like to have the FACTs in order/with the correct SK for the NK of the dimension in our lookup table so that we don't have to play catch-up all day long.
Iamundercover- Posts : 5
Join date : 2015-10-22
Re: need solution for IIel load late arriving dimension
It is not clear if you are working with a type 1 or a type 2 dimension. In the case of type 2, using a durable key and a timestamp to locate the correct late arriving dimension is a common approach.
If you are dealing with a type 1, things are much simpler. All you need to do is infer a dimension row if you encounter a new natural key when loading facts. You create a new dimension row using the natural key. Assign the surrogate key to the fact. Dimension data will be populated later when the dimension feed comes in. The fact will reference the populated row.
If you are dealing with a type 1, things are much simpler. All you need to do is infer a dimension row if you encounter a new natural key when loading facts. You create a new dimension row using the natural key. Assign the surrogate key to the fact. Dimension data will be populated later when the dimension feed comes in. The fact will reference the populated row.
Re: need solution for IIel load late arriving dimension
@nick_white
No, its not a dummy SK value which we will load, we recieve the NK in the FACT records which will be used to create the SK in the supertype table (SCD Type 0). All it has is the SK, NK and model related/audit columns.
Lookup Table has:
Source System = A; Source Natural Key = ABC; Dummy SK = 123
The below record should not be created
Source System = A; Source Natural Key = ABC; Dummy SK = 456
Dimension Load will utilize the SK which is already created in the lookup table and load the subtype table with attributes
No, its not a dummy SK value which we will load, we recieve the NK in the FACT records which will be used to create the SK in the supertype table (SCD Type 0). All it has is the SK, NK and model related/audit columns.
Lookup Table has:
Source System = A; Source Natural Key = ABC; Dummy SK = 123
The below record should not be created
Source System = A; Source Natural Key = ABC; Dummy SK = 456
Dimension Load will utilize the SK which is already created in the lookup table and load the subtype table with attributes
Iamundercover- Posts : 5
Join date : 2015-10-22
Re: need solution for IIel load late arriving dimension
@ngalemmo
It's SCD type in the supertype/lookup/reference table. Thus, we just want the first insert to go through and the rest to utilize it. However, as the lookup table can loaded with either of the ETL flows (I, II and III), I am looking for a technical solution to avoid duplicates.
It's SCD type in the supertype/lookup/reference table. Thus, we just want the first insert to go through and the rest to utilize it. However, as the lookup table can loaded with either of the ETL flows (I, II and III), I am looking for a technical solution to avoid duplicates.
Iamundercover- Posts : 5
Join date : 2015-10-22
Re: need solution for IIel load late arriving dimension
Why is there a single lookup/reference table?
Dimensions should be in their own tables. The dimension should contain the NK and SK. When you infer dimension rows you do so in the dimension itself.
The simplest way to avoid two jobs inserting duplicate new rows into the same table is to not have two jobs inserting into the same table at the same time. Get rid of the common global key table and fix your job scheduling.
Dimensions should be in their own tables. The dimension should contain the NK and SK. When you infer dimension rows you do so in the dimension itself.
The simplest way to avoid two jobs inserting duplicate new rows into the same table is to not have two jobs inserting into the same table at the same time. Get rid of the common global key table and fix your job scheduling.
Re: need solution for IIel load late arriving dimension
You said you are putting the NKs in the Fact table - but that breaks the basics of (standard) Dimensional Modelling - there are a few reasons to put NKs in Fact tables but that is normally to solve complex SCD/temporal reporting issues and the NKs complement, rather than replace, the SKs.
You seem to have a pretty straightforward late arriving dimension issue but have tried to solve it with a super/subtype solution that, IMO, is never going to work and is unnecessarily complicated. Just read Kimball's solution to Late Arriving Dimensions and implement that.
You seem to have a pretty straightforward late arriving dimension issue but have tried to solve it with a super/subtype solution that, IMO, is never going to work and is unnecessarily complicated. Just read Kimball's solution to Late Arriving Dimensions and implement that.
nick_white- Posts : 364
Join date : 2014-01-06
Location : London
Similar topics
» Late Arriving Dimension Data
» alternate approaches for late arriving dimension attributes
» Design Alternatives for Late Arriving/Updated Type 2 Dimension Data
» Late Arriving Facts
» Part of fact information arives later
» alternate approaches for late arriving dimension attributes
» Design Alternatives for Late Arriving/Updated Type 2 Dimension Data
» Late Arriving Facts
» Part of fact information arives later
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum