Parallel Data Processing Comparison – CloverETL vs. Talend vs. Pentaho

On Oct. 21 OpenSys released a new version of its ETL tool, CloverETL Designer version 2.8.1. It’s mainly bugfix version but also brings a new component, ParallelReader, that makes delimited data file (CSV) processing faster than ever before.

I decided to make a test and compare ParallelReader’s performance with CloverETL’s UniversalDataReader and also with two ETL competitors Talend Open Studio (3.1.3) and Pentaho Data Integration (3.2.0).

As a testing task I chose simple SQL query and I tried to rewrite it to ETL transformation.

select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from lineitem
where l_shipdate <= date ‘1998-09-03’
group by l_returnflag, l_linestatus
order byl_returnflag, l_linestatus;

This query is one of standard queries used for performance testing of database engines. More info on http://www.tpc.org/tpch/. The dataset for testing was generated by dbgen utility available on tpc.org too. The size of dataset is 725MB.

All transformation was run on my laptop: Intel Core 2 Duo @ 1666 Mhz, 2048 MB RAM, 200GB SATA 5400 RPM, Windows Vista Home Premium 32bit.

Java parameters was set up to -server -Xmx256m -Xmx1536m.

The results aren’t surprising :-) :

  1. CloverETL – ParallelReader
  2. CloverETL – UniversalDataReader
  3. Talend
  4. Pentaho

Results

If you don’t trust me you can verify results on your own computer. All transformation graphs and testing dataset are available on rapidshare.com or filefactory.com (200 MB). CloverETL Designer can be downloaded on www.cloveretl.com.

Deeper and more extensive comparison will be published soon. Watch www.cloveretl.com, watch this blog. The latest news about CloverETL are also available on CloverETL linkedin group and CloverETL facebook group. Don’t hesitate and join.

Transformation graphs

CloverETL ParallelReader

CloverETL ParallelReader

CloverETL UniversalDataReader

CloverETL UniversalDataReader

Talend

Talend

Pentaho

Pentaho

12 thoughts on “Parallel Data Processing Comparison – CloverETL vs. Talend vs. Pentaho

  1. If you are reading the data in parallel, why are you not doing the same in Pentaho Data Integration?
    I’ll have a look later, but previous experience with the Clover “Benchmarks” tells me something fishy is going on.

    • Hi Matt! I tried it but I was unsuccessful. Instead of “Text file input” step I used “CSV file input” with “Running in parallel?” switched on and 2 running copies but the result data was wrong. It looked like only a half of input records was read.
      If you want to show me that Pentaho can read faster, please provide me modified transformation graph and I will run it on my laptop. Download link for input data and my transformations are available in my blog post.
      Important condition: you can read only one instance of input file! No additional copies.

  2. Hi Peter,

    You were right obviously, we don’t have an in memory aggregation step in Pentaho Data Integration.
    The reason for that is quite simple if you think about it actually: we do our reporting and analyses in other parts of the Pentaho software stack.
    For performance driven (immediate) response times I would recommend Pentaho Analyses (Mondrian) in that regard.
    It’s also possible to report directly on the data from Pentaho Data Integration too: http://michaeltarallo.blogspot.com/2009/07/pentaho-goes-to-movies-data-integration.html

    However, as with all benchmarks, there is this hidden implication that since one performance metric is bad, the whole tool must be bad. :-)

    So suppose we *would* have an in memory group by, what would the performance *then* be?
    Well, I created a new JIRA feature request in your honor this morning: http://jira.pentaho.com/browse/PDI-2804
    And, since the code is pretty trivial compared to the functionality of the streaming version, I implemented it in 4.0 as well.
    It’s unfortunately not possible to add the step to the stable 3.2 branches and I lack the time (and user requests) to create it as a plugin.

    As such, if you or a reader wants to play with it, get a recent (preview) build over here: http://ci.pentaho.com/job/Kettle/

    The result is more in-line with yours for a CPU-bound process. Pentaho Data Integration did it in 45 seconds, but your system is a lot slower (1.66Ghz vs 2.33Ghz) http://imagebin.ca/img/oo-MhJ.jpg
    Get the transformation here : http://www.kettle.be/dloads/Pentaho2.ktr

    The transformation displayed is interesting in the sense that you can use data partitioning to parallelize the aggregation step as well. That gives you maybe another 5-10% gain. Maybe that’s something you should try too. The run-time for the transformation is really a bit too short to draw any major conclusions. Maybe we should try with larger data sets to flush out other bottlenecks. :-)

    On a side note: it takes Kettle 100 seconds to bulk load the data into MySQL.
    Executing the query takes MySQL 43 seconds, even when you put an index on the group columns, so I don’t think either tool is doing that bad of a job.

    Again, thanks for the puzzle/challenge. I certainly had my fun with it :-)
    Best of luck with CloverETL, it looks like you’re doing a great job.

    Kind regards,
    Matt

  3. Pingback: ParallelReader Versus Competitors Part 2 « CloverETL's Blog

  4. Pingback: ParallelReader Versus Competitors Finish « CloverETL's Blog

Leave a Reply

Your email address will not be published. Required fields are marked *

*


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>