ETL: process row by row with pipeline between steps or process the whole datasets in every step?

juanvg1972 Wed Jun 17, 2015 4:06 pm

I have a doubt about ETL process performance.

I have always believed that the better way to process a dataset in a ETL process in term of performance is to process the whole dataset in every step and pass the completed and transformed dataset to the next step of the flow.

I have worked like that in SAS (typical sas data steps), but working in Pentaho DI, I can see that the rows flow to the next setps without being completed the whole dataset, is that good for performance?, in that case the datasets is being processed row by row and the processes rows sent to the next step.

I know that there is a limit in the row by row processing because some steps need the whole dataset to be processed.

I want to know what is the best way to process.

Thanks in advance

ngalemmo Wed Jun 17, 2015 4:38 pm

SAS is not an ETL tool. It can be used for ETL, but so can a lot of things.

The most expensive operation any process can perform is I/O to disk, writing being worse than reading. So, purpose built ETL tools are optimized to do ETL. SAS isn't.

juanvg1972 Wed Jun 17, 2015 4:45 pm

Thanks ngalemmo,

I know SAS is not an ETL tool, although SAS/Data Integration Studio could be consider like that.

Anyway, my question in not SAS, my question is:

Which is more effcicient in an ETL to process row by row with pipeline between steps or process the whole datasets in every step and wait to be completed to send to the next?

Thanks,

Juan

ngalemmo Wed Jun 17, 2015 5:15 pm

Pipeline. It reduces I/O. The fewer times you need to touch disk, the faster things will run. Purpose build ETL tools do a lot of in memory caching for lookups and other purposes.

But there are exceptions. For example, doing transformations in database on a MPP platform such as Netezza (aka IBM Pure Data) is extremely fast. It is a matter of understanding the architecture of the system you are working with and leveraging its strengths. A well designed process can transform hundreds of millions (if not billions) of rows in a matter of minutes.

juanvg1972 Thu Jun 18, 2015 11:21 am

Thanks ngalemmo,

another related question.

It's always better to avoid I/O, but when you process a datasets in memory, is better do the step row by row or process the whole dataset?.
I think is better to process the whole dataset, but if yoy process row by row yoy can pipe processed rows to the next step?

In terms if best practices, what is better?

Thank for your help

ngalemmo Thu Jun 18, 2015 11:46 am

It really depends on the tool you are using. There is no 'best' choice, only a best choice for the particular architecture.

ron.dunn Sat Jun 20, 2015 7:13 pm

I differ slightly, in that I believe in-database transformations are generally faster than in-tool for production-scale workloads. I've seen complex dimension processing through SQL-based ELT beat both SSIS and Kettle (for example) on the same hardware, DBMS and data.

(I spend a lot of my time performance testing ETL/ELT tools)

Sponsored content

ETL: process row by row with pipeline between steps or process the whole datasets in every step?

ETL: process row by row with pipeline between steps or process the whole datasets in every step?

Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?

row by row of whole dataset in every step

Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?

Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?

Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?

Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?

Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?