Overall design and data model and Informatica CDC

kjfischer Fri May 06, 2011 12:12 pm

We are building data warehouse environment from scratch. I am looking for validation of our architecture. There are multiple sources from Rdb, SQL, and Oracle.

At a high level we will have the source, a landing area, a staging area, a atomic unified warehouse, and a mart information delivery area.

The main source is Rdb which is highly normalized. We are replicating all 300 tables into a landing schema in an Oracle database using JCC Logminer. In essense, the landing area is a complete copy and in sync with the source. It is there solely for the purpose that our ETL tool will have a single source environment.

We have a business requirement to track history and for the warehouse to be near real time. We have purchased Informatica CDC for the ETL tool, but I have no experience with it. The idea is that Informatica will read the changes in the landing area from the Oracle logs and insert changed rows into a staging schema. The staging tables are identical in structure to the source tables but with a cdc_sequence_id, cdc_timestamp, cdc_transaction_type (I, U, D), cdc_status_indicator.

Informatica will regularly batch up changed rows in staging that haven't been processed and insert them into a unified warehouse schema that for the most part is structured the same as the source, but because we need to track history, we have a new surrogate primary key and timestamp for each table. So an insert followed by an update in the source would result in 2 inserts into our warehouse schema with 2 unique surrogate keys (and retaining the cdc tracking fields as well so we know the transaction_type).

We will then be modelling out our mart area which would be dimensional in nature consisting of facts and dimension tables.

When moving data from source (our landing essentially) to stage to atomic warehouse, what exactly is important to physically model? At what point are surrogate keys necessary to introduce? What are key tracking fields moving through the areas?

BoxesAndLines Fri May 06, 2011 2:05 pm

You should model all tables in the lifecycle. Each schema (staging, EDW) should undergo the same rigor. The staging will be much simpler since it is essentially a direct copy of the source with tracking timestamps. The EDW also needs modeling. Kimball's methodology is documented in his book, "The complete guide to dimensional modeling". It's a great resource for dimensional modeling solutions. If this is your first pass at dimensional modeling, then the methodology may leave you wanting (my only complaint when I was starting). Ralph overcomes this complaint by showing example after example of how-to's. The two most important folks in a succesful ETL project are the ETL and Data architects. If they are not in-sync then the risk for your project goes way up.

Informatica also has an extensive partner network to assist users who have used up the professional services freebies from the software purchase. I also would recommend leveraging that network.

Surrogate keys are normally introduced at the EDW model. Do a search on audit columns to identify threads that address ETL tracking columns in the EDW. Considerations include insert date, update date, process identifier (set of tables to track counts related to what was inserted, deleted, updated).

Good luck and keep us posted.

Overall design and data model and Informatica CDC

Overall design and data model and Informatica CDC

Re: Overall design and data model and Informatica CDC