Before we will release a complete comparison of open source ETL tools and after a success of my previous blog post I decided to publish the second transformation that we used in the comparison.
The second transformation is also based on SQL query that I rewrote to ETL transformation. I chose Query 3 from http://www.tpc.org/tpch.
sum(l_extendedprice*(1-l_discount)) as revenue,
from customer, orders, lineitem
where c_mktsegment = ‘BUILDING’
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date ‘1995-03-15’
and l_shipdate > date ‘1995-03-15’
group by l_orderkey, o_orderdate, o_shippriority
order by revenue desc, o_orderdate
Input data are generated by
dbgen utility and stored in CSV files.
- lineitem.tbl – 6,001,215 records, 724 MB
- customers.tbl – 15,000 records, 23.2 MB
- orders.tbl – 1,500,000 records, 163 MB
Expected output should contain 11,620 records.
There is a new item in the results. After a discussion in my previous post I added „Pentaho parallel“, Pentaho transformation that reads data in parallel mode. Thanks Matt for your transformation without it I wasn’t able to set it up.
Matt Caster also presented an opinion that Pentaho is discriminated because the transformation sorts the data before aggregation in Pentaho transformation. Yes, I agree that sorting of 6,000,000 records takes a significant amount of execution time of the transformation. But I have no choice, Pentaho aggregate component requires sorted input. Today’s transformation is more fair in this aspect. The number of records flowing to aggregate component is smaller (30,519 records) so they can be easily sorted in memory and the sorting doesn’t influence the total execution time in such volume.
The versions of used ETL tools stay the same ones: CloverETL Designer 2.8.1, Talend Open Studio 3.1.3 and Pentaho Data Integration 3.2.0.
Also the hardware configuration and Java runtime parameters are the same:
- Intel Core 2 Duo @ 1666 Mhz, 2048 MB RAM, 200GB SATA 5400 RPM, Windows Vista Home Premium 32bit.
-server -Xmx256m -Xmx1536m
- CloverETL ParallelReader
- Pentaho parallel
- CloverETL UniversalDataReader
Transformations and the input data are available on filefactory.com. Today’s transformation are named TPCH2. The transformation from my previous post are named TPCH1.
Please give me a feedback, especially on Talend transformation if it’s correct.
CloverETL ParallelReader & UniversalDataReader