CloverETL's Blog

October 26, 2009

ParallelReader versus competitors

Filed under: Using CloverETL — Tags: , , , , , — Petr Uher @ 3:37 am

On Oct. 21 OpenSys released a new version of its ETL tool, CloverETL Designer version 2.8.1. It’s mainly bugfix version but also brings a new component, ParallelReader, that makes delimited data file (CSV) processing faster than ever before.

I decided to make a test and compare ParallelReader’s performance with CloverETL’s UniversalDataReader and also with two ETL competitors Talend Open Studio (3.1.3) and Pentaho Data Integration (3.2.0).

As a testing task I chose simple SQL query and I tried to rewrite it to ETL transformation.

select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from lineitem
where l_shipdate <= date ‘1998-09-03’
group by l_returnflag, l_linestatus
order byl_returnflag, l_linestatus;

This query is one of standard queries used for performance testing of database engines. More info on http://www.tpc.org/tpch/. The dataset for testing was generated by dbgen utility available on tpc.org too. The size of dataset is 725MB.

All transformation was run on my laptop: Intel Core 2 Duo @ 1666 Mhz, 2048 MB RAM, 200GB SATA 5400 RPM, Windows Vista Home Premium 32bit.

Java parameters was set up to -server -Xmx256m -Xmx1536m.

The results aren’t surprising :-) :

  1. CloverETL – ParallelReader
  2. CloverETL – UniversalDataReader
  3. Talend
  4. Pentaho

Results

If you don’t trust me you can verify results on your own computer. All transformation graphs and testing dataset are available on rapidshare.com or filefactory.com (200 MB). CloverETL Designer can be downloaded on www.cloveretl.com.

Deeper and more extensive comparison will be published soon. Watch www.cloveretl.com, watch this blog. The latest news about CloverETL are also available on CloverETL linkedin group and CloverETL facebook group. Don’t hesitate and join.

Transformation graphs

CloverETL ParallelReader

CloverETL ParallelReader


CloverETL UniversalDataReader

CloverETL UniversalDataReader


Talend

Talend


Pentaho

Pentaho

12 Comments »

  1. If you are reading the data in parallel, why are you not doing the same in Pentaho Data Integration?
    I’ll have a look later, but previous experience with the Clover “Benchmarks” tells me something fishy is going on.

    Comment by Matt Casters — October 29, 2009 @ 5:50 pm

    • Hi Matt! I tried it but I was unsuccessful. Instead of “Text file input” step I used “CSV file input” with “Running in parallel?” switched on and 2 running copies but the result data was wrong. It looked like only a half of input records was read.
      If you want to show me that Pentaho can read faster, please provide me modified transformation graph and I will run it on my laptop. Download link for input data and my transformations are available in my blog post.
      Important condition: you can read only one instance of input file! No additional copies.

      Comment by Petr Uher — October 29, 2009 @ 7:54 pm

      • The file lineitems.tbl has an extra trailing column. (according to the CSV standard) Define a dummy value at the end and you’re fine.

        Comment by Matt Casters — October 29, 2009 @ 8:05 pm

  2. By the way, I’m still trying to download the file without too much luck. I’ll give it another try later.

    Comment by Matt Casters — October 29, 2009 @ 8:06 pm

    • It looks like rapidshare is too busy. I’ll reupload it to another server and add download link to the post.

      Comment by Petr Uher — October 29, 2009 @ 8:12 pm

      • OK, since I know that the reading is fast enough, I can only guess by looking at the image that Kettle is being hurt by the sort BEFORE the aggregate.

        Comment by Matt Casters — October 29, 2009 @ 8:21 pm

      • Yes, it is but it isn’t my fault. It’s Pentaho’s “feature”. Pentaho “Group by” step requires sorted input, CloverETL’s and Talend’s aggregates don’t.

        Comment by Petr Uher — October 29, 2009 @ 8:29 pm

    • Alternative link for download added.
      http://www.filefactory.com/file/a0732fg/n/ParallelReaderComparison_zip

      Comment by Petr Uher — October 29, 2009 @ 8:39 pm

      • Thanks for the puzzle Peter, I’ll have a look tomorrow.

        Comment by Matt Casters — October 29, 2009 @ 8:41 pm

  3. Hi Peter,

    You were right obviously, we don’t have an in memory aggregation step in Pentaho Data Integration.
    The reason for that is quite simple if you think about it actually: we do our reporting and analyses in other parts of the Pentaho software stack.
    For performance driven (immediate) response times I would recommend Pentaho Analyses (Mondrian) in that regard.
    It’s also possible to report directly on the data from Pentaho Data Integration too: http://michaeltarallo.blogspot.com/2009/07/pentaho-goes-to-movies-data-integration.html

    However, as with all benchmarks, there is this hidden implication that since one performance metric is bad, the whole tool must be bad. :-)

    So suppose we *would* have an in memory group by, what would the performance *then* be?
    Well, I created a new JIRA feature request in your honor this morning: http://jira.pentaho.com/browse/PDI-2804
    And, since the code is pretty trivial compared to the functionality of the streaming version, I implemented it in 4.0 as well.
    It’s unfortunately not possible to add the step to the stable 3.2 branches and I lack the time (and user requests) to create it as a plugin.

    As such, if you or a reader wants to play with it, get a recent (preview) build over here: http://ci.pentaho.com/job/Kettle/

    The result is more in-line with yours for a CPU-bound process. Pentaho Data Integration did it in 45 seconds, but your system is a lot slower (1.66Ghz vs 2.33Ghz) http://imagebin.ca/img/oo-MhJ.jpg
    Get the transformation here : http://www.kettle.be/dloads/Pentaho2.ktr

    The transformation displayed is interesting in the sense that you can use data partitioning to parallelize the aggregation step as well. That gives you maybe another 5-10% gain. Maybe that’s something you should try too. The run-time for the transformation is really a bit too short to draw any major conclusions. Maybe we should try with larger data sets to flush out other bottlenecks. :-)

    On a side note: it takes Kettle 100 seconds to bulk load the data into MySQL.
    Executing the query takes MySQL 43 seconds, even when you put an index on the group columns, so I don’t think either tool is doing that bad of a job.

    Again, thanks for the puzzle/challenge. I certainly had my fun with it :-)
    Best of luck with CloverETL, it looks like you’re doing a great job.

    Kind regards,
    Matt

    Comment by Matt Casters — October 30, 2009 @ 10:56 am

  4. [...] pm Before we will release a complete comparison of open source ETL tools and after a success of my previous blog post I decided to publish the second transformation that we used in the [...]

    Pingback by ParallelReader Versus Competitors Part 2 « CloverETL's Blog — November 11, 2009 @ 4:08 pm

  5. [...] transformations were run on my laptop with Windows Vista Home Premium. For detail information see part 1 and part [...]

    Pingback by ParallelReader Versus Competitors Finish « CloverETL's Blog — December 9, 2009 @ 3:09 pm


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.