need solution for IIel load late arriving dimension

Iamundercover Thu Oct 22, 2015 7:39 am

Hello,

Q. We have implemented a data lake (PDM based on IBM Banking DW) on a Netezza appliance and we use DataStage as the ETL tool.

I need to design multiple ETL flows for data (Events: opening of an account and transactions: pay at the grocery store, recieve ur salary etc…) coming from the same source system which would run every 20 mins and load several different tables and a common table (columns: surrogate key, source system id, unique id in source system =account number).

The trouble lies in

- The ETL flow must be able to run in parallel
- When they run in parallel, they could load the same reference row in the common table.As the insert of a value by one load won’t be seen by another session until that batch insert completes and implicitly commits. So if 2 loads load the same account not existing in the target at the same time, we will have dups.

We have "one commit principle" at our site due to which we can’t load the account_target during the processing of the ETL flow. It can be loaded only as the last step in the ETL flow. Thus, the issue….

I suggested using an intermediate table which we commit/write to during the execution of the job by using the netezza connector’s brute force solution https://www-01.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.conn.netezza.use.doc/topics/netezza_config_prinmary_key_validation.html

and then using this intermediate table to load the facts/relationship tables.

So, with the brute force method, I ensure that only one of the conflicting/potential duplicate finds it’s way to the DB and the rest of the jobs which build the relationship tables/sub-types utilize the surrogate key which made it to the intermediate table.

However, my solution was turned down by the architect and i am now looking for alternates…..

by experience do you know one apart from

- Having a separate flow for the common table/lookup
- serializing the ETL flows….

Thanks

ngalemmo Thu Oct 22, 2015 2:07 pm

Well, nothing like making up 'standards' that only serve to cause problems.

Why does the etl flow need to run in parallel? On Netezza it doesn't buy you anything. Throughput would be pretty much the same if the jobs are run serially or in parallel.
What is the common table and how do you expect multiple processes to update it without getting a serialization error?

Iamundercover Thu Oct 22, 2015 3:43 pm

I agree

We have data coming from the same source system which loads

Every 20 mins
FACTs
ETL Flow I: Events - opening of an account etc...
ETL Flow II: Transactions - payment made at the grocery store, debit incoming salary etc…)

Once a day
DIM
ETL Flow III: Daily inventory of all the accounts which feeds the DIM (Subtype)

AND

The design pattern, we have chosen for late arriving dimensions is to load the lookup/reference/super type table (columns: surrogate key, source system id, unique id in source system =account number).

Thus, to avoid

- Introducing dependencies among ETL flow I, II and III
- Loading dups in the supertype/reference/lookup table due to technical reason: When the job of ETL flow I, II or III which loads the super type table are executing in parallel might load dups as when they perform their check to find if the surrogate key is already generated for the natural key they might not find one and try and create it.

ngalemmo Thu Oct 22, 2015 5:28 pm

Why not have a durable alternate key in the dimension (type 1 key) and store that in the fact. You would then use the business timestamp of the fact against the business timestamp of the dimension to locate the correct row? This does slightly alter the queries, but on a Netezza box it is probably not very costly, and you can put the logic in a view, making it transparent to users. Distribution and organization of the tables can significantly help performance, if this is a particularly large dimension.

nick_white Fri Oct 23, 2015 7:57 am

When you load facts from ETL Flows I and II presumably you are populating them with the Account dummy SK value which is also going into your lookup table?
When you load your Account Dim in ETL Flow III all you need to do is translate your dummy SKs into the real SKs and update your fact tables - standard Late Arriving Dimension processing as described by Kimball.
It makes no difference if you have "duplicates" in your lookup table as you are going to translate them to the same real SK e.g.

Lookup Table has:
Source System = A; Source Natural Key = ABC; Dummy SK = 123
Source System = A; Source Natural Key = ABC; Dummy SK = 456

Dimension Load has:
Source System = A; Source Natural Key = ABC; Real SK = 987

Using Dimension load data to find records in your lookup table you identify that any fact table that has values of 123 or 456 in their Account SK column need to have this updated to 987

Regards,

Iamundercover Fri Oct 23, 2015 8:11 am

@ngalemmo

The FACTs get loaded throughout the day (every 20 mins) and the exploitation as well. Thus, we would like to have the FACTs in order/with the correct SK for the NK of the dimension in our lookup table so that we don't have to play catch-up all day long.

ngalemmo Fri Oct 23, 2015 3:40 pm

It is not clear if you are working with a type 1 or a type 2 dimension. In the case of type 2, using a durable key and a timestamp to locate the correct late arriving dimension is a common approach.

If you are dealing with a type 1, things are much simpler. All you need to do is infer a dimension row if you encounter a new natural key when loading facts. You create a new dimension row using the natural key. Assign the surrogate key to the fact. Dimension data will be populated later when the dimension feed comes in. The fact will reference the populated row.

Iamundercover Mon Oct 26, 2015 3:05 am

@nick_white

No, its not a dummy SK value which we will load, we recieve the NK in the FACT records which will be used to create the SK in the supertype table (SCD Type 0). All it has is the SK, NK and model related/audit columns.

Lookup Table has:
Source System = A; Source Natural Key = ABC; Dummy SK = 123
The below record should not be created
Source System = A; Source Natural Key = ABC; Dummy SK = 456

Dimension Load will utilize the SK which is already created in the lookup table and load the subtype table with attributes

Iamundercover Mon Oct 26, 2015 3:07 am

@ngalemmo

It's SCD type in the supertype/lookup/reference table. Thus, we just want the first insert to go through and the rest to utilize it. However, as the lookup table can loaded with either of the ETL flows (I, II and III), I am looking for a technical solution to avoid duplicates.

ngalemmo Mon Oct 26, 2015 9:19 am

Why is there a single lookup/reference table?

Dimensions should be in their own tables. The dimension should contain the NK and SK. When you infer dimension rows you do so in the dimension itself.

The simplest way to avoid two jobs inserting duplicate new rows into the same table is to not have two jobs inserting into the same table at the same time. Get rid of the common global key table and fix your job scheduling.

nick_white Mon Oct 26, 2015 3:23 pm

You said you are putting the NKs in the Fact table - but that breaks the basics of (standard) Dimensional Modelling - there are a few reasons to put NKs in Fact tables but that is normally to solve complex SCD/temporal reporting issues and the NKs complement, rather than replace, the SKs.
You seem to have a pretty straightforward late arriving dimension issue but have tried to solve it with a super/subtype solution that, IMO, is never going to work and is unnecessarily complicated. Just read Kimball's solution to Late Arriving Dimensions and implement that.

Sponsored content

need solution for IIel load late arriving dimension

need solution for IIel load late arriving dimension

Re: need solution for IIel load late arriving dimension

Re: need solution for IIel load late arriving dimension

Re: need solution for IIel load late arriving dimension

Re: need solution for IIel load late arriving dimension

Re: need solution for IIel load late arriving dimension

Re: need solution for IIel load late arriving dimension

Re: need solution for IIel load late arriving dimension

Re: need solution for IIel load late arriving dimension

Re: need solution for IIel load late arriving dimension

Re: need solution for IIel load late arriving dimension

Re: need solution for IIel load late arriving dimension