Full history staging tables

ryno1234 Tue Aug 18, 2015 8:23 am

In my ETL process, I typically move data from my source system(s), to a staging table of some sort and then from the staging tables into my dimensional model. My routines run nightly and because of that, my highest level of resolution on data is 1 day. I detect daily changes only.

The source systems tend to not have very good history tracking / date tracking of modifications, so I've relied on simply bringing source-system data in and comparing it to my staged data to see if there are any updates I need to bring into the staging DB. FYI - I'm referring to data such as "Vehicle", "Employee", etc. Things that would typically end up being modeled as an actual dimension.

In the past I've been all over the board with what would constitute an update (or even *how* to persist the updated data, i.e. overwrite, new record, etc) because it's been case by case depending upon the table in question, the fields in question, etc.

I recently had a thought: Why don't I just bring in every field for a given table, and if there is a change to any of them, stamp a new row in my staging DB with row_effective_date and row_expiration_date fields. This will allow me to have a full history of every record which creates limitless possibilities with my data warehouse and modeling new data from the staging tables.

This feels like it's using a sledgehammer to solve a problem, but really, it's the most straight forward to implement, I can treat all tables the same vs. having vastly different rules per each staging table and I can have an enormous swath of options when it comes to actually creating a dimensional model from the staged data.

Thoughts? Has anyone done this before?

seemiyah Tue Aug 18, 2015 11:55 pm

I think maintaining a full history in your staging tables is a sound approach.

I am currently involved in a DW build where the following is used:
- Tables with a (reliable) Create/UpdateTime field in the source system have a new record written to the staging table when the UpdateTime is above the high-water mark. No checks are made to see if there are any changes to the underlying data.
- Other tables are extracted in full each day and compared to the staging table using a MD5 hash. If there are any changes, a new record is written.

ngalemmo Wed Aug 19, 2015 12:24 am

I guess I don't understand how the 'staging area' would be any different than the data warehouse itself. After all, isn't the goal to retain all data and a historical record? If one has the capacity and the business need, why not just make all the dimensions Type 2?

The reality is a business simply does not need a full historical record of all data. A lot of data has no historical business value. That's why we make decisions in the design.

But, if you go this route, the work to maintain a Type 2 dimension is pretty much the same as maintaing the staging area. So why create another repository?

Sponsored content

Full history staging tables

Full history staging tables

Re: Full history staging tables

Re: Full history staging tables

Re: Full history staging tables