Extract Archive Process

jryan Thu Oct 04, 2012 7:48 am

Hi,

I have a requirement to archive my staging extracts on a daily basis. This is much like the approach mentioned in the kimball books where he says "The untransformed extracts should be archived for at least a week. Many organizations archive the extracts for a month or even forever."

For some of my data, I will be getting only changed rows from the source. This is fine, they can go straight into some sort of archive table no poblem.

But for the other tables, I can't get changed rows. For those tables I'm going to have to take in all rows, then use something like SQL MINUS or EXCEPT in order to figure out whats changed since the previous day. These tables are not massive, but one is say 100,000 rows. That means I'll be putting 100,000 rows every day into the archive table, meaning it will grow to 36 million in just one year, which seems a bit excessive. There are probably bigger tables also.

I'm toying whether to archive the exact extract or whether to archive the results of the SQL EXCEPT query. In theory a SQL EXCEPT query should always give you the changed rows, but I suppose its not your true extract. I'm swaying towards archiving the extract in full every day (as is) but only keeping it for 3 months.

What do you think?

BoxesAndLines Thu Oct 04, 2012 10:38 am

My concern would be what if you need to recreate the CDC result? You only have the result as you have already minimally processed the input data. As a result, you can't fix an error in the CDC process.

Extract Archive Process

Extract Archive Process

Re: Extract Archive Process