Extract Archive Process
2 posters
Page 1 of 1
Extract Archive Process
Hi,
I have a requirement to archive my staging extracts on a daily basis. This is much like the approach mentioned in the kimball books where he says "The untransformed extracts should be archived for at least a week. Many organizations archive the extracts for a month or even forever."
For some of my data, I will be getting only changed rows from the source. This is fine, they can go straight into some sort of archive table no poblem.
But for the other tables, I can't get changed rows. For those tables I'm going to have to take in all rows, then use something like SQL MINUS or EXCEPT in order to figure out whats changed since the previous day. These tables are not massive, but one is say 100,000 rows. That means I'll be putting 100,000 rows every day into the archive table, meaning it will grow to 36 million in just one year, which seems a bit excessive. There are probably bigger tables also.
I'm toying whether to archive the exact extract or whether to archive the results of the SQL EXCEPT query. In theory a SQL EXCEPT query should always give you the changed rows, but I suppose its not your true extract. I'm swaying towards archiving the extract in full every day (as is) but only keeping it for 3 months.
What do you think?
I have a requirement to archive my staging extracts on a daily basis. This is much like the approach mentioned in the kimball books where he says "The untransformed extracts should be archived for at least a week. Many organizations archive the extracts for a month or even forever."
For some of my data, I will be getting only changed rows from the source. This is fine, they can go straight into some sort of archive table no poblem.
But for the other tables, I can't get changed rows. For those tables I'm going to have to take in all rows, then use something like SQL MINUS or EXCEPT in order to figure out whats changed since the previous day. These tables are not massive, but one is say 100,000 rows. That means I'll be putting 100,000 rows every day into the archive table, meaning it will grow to 36 million in just one year, which seems a bit excessive. There are probably bigger tables also.
I'm toying whether to archive the exact extract or whether to archive the results of the SQL EXCEPT query. In theory a SQL EXCEPT query should always give you the changed rows, but I suppose its not your true extract. I'm swaying towards archiving the extract in full every day (as is) but only keeping it for 3 months.
What do you think?
jryan- Posts : 33
Join date : 2010-09-27
Re: Extract Archive Process
My concern would be what if you need to recreate the CDC result? You only have the result as you have already minimally processed the input data. As a result, you can't fix an error in the CDC process.
BoxesAndLines- Posts : 1212
Join date : 2009-02-03
Location : USA
Similar topics
» ETL: process row by row with pipeline between steps or process the whole datasets in every step?
» Why extract?
» Question asked about ETL (little explanation needed)
» Extract to staging: use constraints or not?
» Multiple sources for the same data - which one to extract from?
» Why extract?
» Question asked about ETL (little explanation needed)
» Extract to staging: use constraints or not?
» Multiple sources for the same data - which one to extract from?
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum