ETL Architecture and Control Flow
2 posters
Page 1 of 1
ETL Architecture and Control Flow
Hi Folks,
I am new to the forums, they look like an interesting place to discuss BI issues and troubles.
So, I am not new to BI, but we have done a lot of re-architecting of the etl systems with servers being moved, wan considerations etc.
Currently, our source system extracts, compresses and ftps data to our etl area. File names are representative of the 'day' the data refers to. This allows data o be queued in case of ftp failure etc.
My etl system is entirely file based. I use Powershell to pull together the earliest available batch of files. Check to see if the batch is complete, then unpackage the batch ready for loading. I use token files to lock / unlock the staging area and to prevent things getting overwritten in case of error.
I then push my source files through a number of transforms which roughly summise into preparation (junking rows, initial column cleaning), dimension prep, dimension delivery, fact prep then fact delivery. I do not have a surrogate key generator and two critical dimensions are SCD. For this reason I need to deliver dimensions before conforming fact data.
Each step writes to temporary files. If the step completes succesfully, they are moved to the input of the next step, and the source files for the previous step are deleted. This method gives a 'pipeline' of transformations, each of which are restartable at any point.
This all sounds quite satisfactory, and works well at the moment, but complexity is going increase. We have two other data sources to bring online, so I need to syncronise the loading of dimensions. As a note, all packages are audited, but I have no concept of a 'batch', with a particular batch / runid from start to finish.
We are running ms sql 64bit standard edition, so we don't have the ability to partition tables or cubes. Also, to make things harder, our transactional system updates transactions. This makes implementing incremental processing rather difficult.
So, I am interested in knowing how people architect / manage...
. syncronising multiple source systems. What happens if one system is unable to supply data in a timely fashion - do you code overrides or hold out for the data ?
. state management. Do people use file pipeline methods like me, or do you maintain a state in the db, and rely on that to control the flow of data through the etl pipeline.
. restartability. My transforms take some time, so I am keen to allow jobs to be restarted at an time without danger of breaking the etl. db based state management would allow me to do that I think.
. db or file based staging. I am a big fan of file based work, so the db overhead seems unneccessary
Thoughts, comments?
All welcome.
Paul
I am new to the forums, they look like an interesting place to discuss BI issues and troubles.
So, I am not new to BI, but we have done a lot of re-architecting of the etl systems with servers being moved, wan considerations etc.
Currently, our source system extracts, compresses and ftps data to our etl area. File names are representative of the 'day' the data refers to. This allows data o be queued in case of ftp failure etc.
My etl system is entirely file based. I use Powershell to pull together the earliest available batch of files. Check to see if the batch is complete, then unpackage the batch ready for loading. I use token files to lock / unlock the staging area and to prevent things getting overwritten in case of error.
I then push my source files through a number of transforms which roughly summise into preparation (junking rows, initial column cleaning), dimension prep, dimension delivery, fact prep then fact delivery. I do not have a surrogate key generator and two critical dimensions are SCD. For this reason I need to deliver dimensions before conforming fact data.
Each step writes to temporary files. If the step completes succesfully, they are moved to the input of the next step, and the source files for the previous step are deleted. This method gives a 'pipeline' of transformations, each of which are restartable at any point.
This all sounds quite satisfactory, and works well at the moment, but complexity is going increase. We have two other data sources to bring online, so I need to syncronise the loading of dimensions. As a note, all packages are audited, but I have no concept of a 'batch', with a particular batch / runid from start to finish.
We are running ms sql 64bit standard edition, so we don't have the ability to partition tables or cubes. Also, to make things harder, our transactional system updates transactions. This makes implementing incremental processing rather difficult.
So, I am interested in knowing how people architect / manage...
. syncronising multiple source systems. What happens if one system is unable to supply data in a timely fashion - do you code overrides or hold out for the data ?
. state management. Do people use file pipeline methods like me, or do you maintain a state in the db, and rely on that to control the flow of data through the etl pipeline.
. restartability. My transforms take some time, so I am keen to allow jobs to be restarted at an time without danger of breaking the etl. db based state management would allow me to do that I think.
. db or file based staging. I am a big fan of file based work, so the db overhead seems unneccessary
Thoughts, comments?
All welcome.
Paul
monsieur_arrie- Posts : 2
Join date : 2009-04-21
File based ETL
All three of the problems posted have their answer in the same philosophy. The power of a file based ETL is two fold, 1 has been highlighted by the original post that it doesn't have the overhead of a RDBMS. The second power is that essentially each file can contain 1 record and you can have a million files same format, same processing, run the code again and again.
SO basically, if certain sources are unable to provide data on time, you should make a copy of the code for those files and run them separately through the same file based processign supply chain. The aggregated facts will have a problem and they will have to be redone, everything else including the SCD should be fine.
The argument is that if 1 complete file at the start of the process has 100,000 records, what if it had 80,000 and then 20,000, would it make any difference? If not then you are fine.
Please note that file based ETL requires rigorous data integrity controls and it is not recommended for the faint of heart who do not have 10 years in data architecture. File based processing mistake of 2 records in 10,000,000 will take you upto a month to figure out which 2 need to be fixed (speaking from personal experience)
SO basically, if certain sources are unable to provide data on time, you should make a copy of the code for those files and run them separately through the same file based processign supply chain. The aggregated facts will have a problem and they will have to be redone, everything else including the SCD should be fine.
The argument is that if 1 complete file at the start of the process has 100,000 records, what if it had 80,000 and then 20,000, would it make any difference? If not then you are fine.
Please note that file based ETL requires rigorous data integrity controls and it is not recommended for the faint of heart who do not have 10 years in data architecture. File based processing mistake of 2 records in 10,000,000 will take you upto a month to figure out which 2 need to be fixed (speaking from personal experience)
Eddy22Ho- Posts : 1
Join date : 2009-02-06
Similar topics
» Data flow to EDW
» Modelling an Event Flow
» Control Checks between Sql Tables (Facts and Dims) and SSAS Cubes?
» New to SSIS & Version Control
» Track and control of dimension and fact tables
» Modelling an Event Flow
» Control Checks between Sql Tables (Facts and Dims) and SSAS Cubes?
» New to SSIS & Version Control
» Track and control of dimension and fact tables
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum