CloverETL's Blog

November 11, 2009

ParallelReader Versus Competitors Part 2

Filed under: Using CloverETL — Tags: , , , , , — Petr Uher @ 4:08 pm

Before we will release a complete comparison of open source ETL tools and after a success of my previous blog post I decided to publish the second transformation that we used in the comparison.

The second transformation is also based on SQL query that I rewrote to ETL transformation. I chose Query 3 from http://www.tpc.org/tpch.

select
l_orderkey,
sum(l_extendedprice*(1-l_discount)) as revenue,
o_orderdate,
o_shippriority
from customer, orders, lineitem
where c_mktsegment = ‘BUILDING’
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date ‘1995-03-15’
and l_shipdate > date ‘1995-03-15’
group by l_orderkey, o_orderdate, o_shippriority
order by revenue desc, o_orderdate

Input data are generated by dbgen utility and stored in CSV files.

  • lineitem.tbl – 6,001,215 records, 724 MB
  • customers.tbl – 15,000 records, 23.2 MB
  • orders.tbl – 1,500,000 records, 163 MB

Expected output should contain 11,620 records.

There is a new item in the results. After a discussion in my previous post I added „Pentaho parallel“, Pentaho transformation that reads data in parallel mode. Thanks Matt for your transformation :-) without it I wasn’t able to set it up.

Matt Caster also presented an opinion that Pentaho is discriminated because the transformation sorts the data before aggregation in Pentaho transformation. Yes, I agree that sorting of 6,000,000 records takes a significant amount of execution time of the transformation. But I have no choice, Pentaho aggregate component requires sorted input. Today’s transformation is more fair in this aspect. The number of records flowing to aggregate component is smaller (30,519 records) so they can be easily sorted in memory and the sorting doesn’t influence the total execution time in such volume.

The versions of used ETL tools stay the same ones: CloverETL Designer 2.8.1, Talend Open Studio 3.1.3 and Pentaho Data Integration 3.2.0.

Also the hardware configuration and Java runtime parameters are the same:

  • Intel Core 2 Duo @ 1666 Mhz, 2048 MB RAM, 200GB SATA 5400 RPM, Windows Vista Home Premium 32bit.
  • -server -Xmx256m -Xmx1536m

Results:

  1. CloverETL ParallelReader
  2. Talend
  3. Pentaho parallel
  4. CloverETL UniversalDataReader
  5. Pentaho

Results

Transformations and the input data are available on filefactory.com. Today’s transformation are named TPCH2. The transformation from my previous post are named TPCH1.

Please give me a feedback, especially on Talend transformation if it’s correct.

Transformation graphs

CloverETL ParallelReader & UniversalDataReader

CloverETL ParallelReader & UniversalDataReader

Talend

Talend

Pentaho

Pentaho

Pentaho parallel

Pentaho parallel

2 Comments »

  1. [...] ParallelReader Versus Competitors Part 2 « CloverETL's Blog [...]

    Pingback by Yves de Montcheuil auf YouTube - Dijit — November 11, 2009 @ 10:15 pm

  2. [...] Short summary of my previous posts: For testing I used two transformations based on TPCH test and the input data generated by dbgen utility. The transformations were run on my laptop with Windows Vista Home Premium. For detail information see part 1 and part 2. [...]

    Pingback by ParallelReader Versus Competitors Finish « CloverETL's Blog — December 9, 2009 @ 3:09 pm


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.