ETL: process row by row with pipeline between steps or process the whole datasets in every step?
3 posters
Page 1 of 1
ETL: process row by row with pipeline between steps or process the whole datasets in every step?
I have a doubt about ETL process performance.
I have always believed that the better way to process a dataset in a ETL process in term of performance is to process the whole dataset in every step and pass the completed and transformed dataset to the next step of the flow.
I have worked like that in SAS (typical sas data steps), but working in Pentaho DI, I can see that the rows flow to the next setps without being completed the whole dataset, is that good for performance?, in that case the datasets is being processed row by row and the processes rows sent to the next step.
I know that there is a limit in the row by row processing because some steps need the whole dataset to be processed.
I want to know what is the best way to process.
Thanks in advance
I have always believed that the better way to process a dataset in a ETL process in term of performance is to process the whole dataset in every step and pass the completed and transformed dataset to the next step of the flow.
I have worked like that in SAS (typical sas data steps), but working in Pentaho DI, I can see that the rows flow to the next setps without being completed the whole dataset, is that good for performance?, in that case the datasets is being processed row by row and the processes rows sent to the next step.
I know that there is a limit in the row by row processing because some steps need the whole dataset to be processed.
I want to know what is the best way to process.
Thanks in advance
juanvg1972- Posts : 25
Join date : 2015-05-05
Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?
SAS is not an ETL tool. It can be used for ETL, but so can a lot of things.
The most expensive operation any process can perform is I/O to disk, writing being worse than reading. So, purpose built ETL tools are optimized to do ETL. SAS isn't.
The most expensive operation any process can perform is I/O to disk, writing being worse than reading. So, purpose built ETL tools are optimized to do ETL. SAS isn't.
row by row of whole dataset in every step
Thanks ngalemmo,
I know SAS is not an ETL tool, although SAS/Data Integration Studio could be consider like that.
Anyway, my question in not SAS, my question is:
Which is more effcicient in an ETL to process row by row with pipeline between steps or process the whole datasets in every step and wait to be completed to send to the next?
Thanks,
Juan
I know SAS is not an ETL tool, although SAS/Data Integration Studio could be consider like that.
Anyway, my question in not SAS, my question is:
Which is more effcicient in an ETL to process row by row with pipeline between steps or process the whole datasets in every step and wait to be completed to send to the next?
Thanks,
Juan
juanvg1972- Posts : 25
Join date : 2015-05-05
Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?
Pipeline. It reduces I/O. The fewer times you need to touch disk, the faster things will run. Purpose build ETL tools do a lot of in memory caching for lookups and other purposes.
But there are exceptions. For example, doing transformations in database on a MPP platform such as Netezza (aka IBM Pure Data) is extremely fast. It is a matter of understanding the architecture of the system you are working with and leveraging its strengths. A well designed process can transform hundreds of millions (if not billions) of rows in a matter of minutes.
But there are exceptions. For example, doing transformations in database on a MPP platform such as Netezza (aka IBM Pure Data) is extremely fast. It is a matter of understanding the architecture of the system you are working with and leveraging its strengths. A well designed process can transform hundreds of millions (if not billions) of rows in a matter of minutes.
Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?
Thanks ngalemmo,
another related question.
It's always better to avoid I/O, but when you process a datasets in memory, is better do the step row by row or process the whole dataset?.
I think is better to process the whole dataset, but if yoy process row by row yoy can pipe processed rows to the next step?
In terms if best practices, what is better?
Thank for your help
another related question.
It's always better to avoid I/O, but when you process a datasets in memory, is better do the step row by row or process the whole dataset?.
I think is better to process the whole dataset, but if yoy process row by row yoy can pipe processed rows to the next step?
In terms if best practices, what is better?
Thank for your help
juanvg1972- Posts : 25
Join date : 2015-05-05
Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?
It really depends on the tool you are using. There is no 'best' choice, only a best choice for the particular architecture.
Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?
I differ slightly, in that I believe in-database transformations are generally faster than in-tool for production-scale workloads. I've seen complex dimension processing through SQL-based ELT beat both SSIS and Kettle (for example) on the same hardware, DBMS and data.
(I spend a lot of my time performance testing ETL/ELT tools)
(I spend a lot of my time performance testing ETL/ELT tools)
Similar topics
» Model not-fixed-step for process in a wide datawarehouse. Extend fact-table ?
» Testing ETL pipeline(s)
» Katie and Emil "10 Steps" guide to building a data warehouse
» ETL Process
» Application pipeline - one or more fact tables?
» Testing ETL pipeline(s)
» Katie and Emil "10 Steps" guide to building a data warehouse
» ETL Process
» Application pipeline - one or more fact tables?
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum