Tag Archives: delimited file

Parallel Data Processing Comparison – CloverETL vs. Talend vs. Pentaho

On Oct. 21 OpenSys released a new version of its ETL tool, CloverETL Designer version 2.8.1. It’s mainly bugfix version but also brings a new component, ParallelReader, that makes delimited data file (CSV) processing faster than ever before.

I decided to make a test and compare ParallelReader’s performance with CloverETL’s UniversalDataReader and also with two ETL competitors Talend Open Studio (3.1.3) and Pentaho Data Integration (3.2.0).

As a testing task I chose simple SQL query and I tried to rewrite it to ETL transformation.

select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from lineitem
where l_shipdate <= date ‘1998-09-03’
group by l_returnflag, l_linestatus
order byl_returnflag, l_linestatus;

This query is one of standard queries used for performance testing of database engines. More info on http://www.tpc.org/tpch/. The dataset for testing was generated by dbgen utility available on tpc.org too. The size of dataset is 725MB.

All transformation was run on my laptop: Intel Core 2 Duo @ 1666 Mhz, 2048 MB RAM, 200GB SATA 5400 RPM, Windows Vista Home Premium 32bit.

Java parameters was set up to -server -Xmx256m -Xmx1536m.

The results aren’t surprising :-) :

  1. CloverETL – ParallelReader
  2. CloverETL – UniversalDataReader
  3. Talend
  4. Pentaho

Results

If you don’t trust me you can verify results on your own computer. All transformation graphs and testing dataset are available on rapidshare.com or filefactory.com (200 MB). CloverETL Designer can be downloaded on www.cloveretl.com.

Deeper and more extensive comparison will be published soon. Watch www.cloveretl.com, watch this blog. The latest news about CloverETL are also available on CloverETL linkedin group and CloverETL facebook group. Don’t hesitate and join.

Transformation graphs

CloverETL ParallelReader

CloverETL ParallelReader

CloverETL UniversalDataReader

CloverETL UniversalDataReader

Talend

Talend

Pentaho

Pentaho

ParallelReader Component: Performance Boost in Data Processing

In October release 2.8.1 of Clover we introduced a new component which definitely should attract your attention – the Parallel Reader. The name itself already suggests the goal of the component – improve reading speed by going parallel. The component is very similar to Universal Data Reader in function – it reads delimited flat files like CSV, tab delimited, etc. – much hasn’t changed here. But the real difference comes from under the hood.

There are two major optimizations which allow Parallel Reader to exhibit excellent performance results, especially on server-class machines with fast modern disks or better yet, disk arrays. The first optimization we have done is – of course – reading the file in parallel. The input file is divided into a set of virtual data chunks which are fed into reading threads. These work all together at the same time – each one parsing data records just from its own file part. The number of threads can be specified by component parameter “Level Of Parallelism” and should reflect the hardware setup – e.g. number of disks in a stripped RAID – to harness the maximum power of Parallel Reader. Another great performance gain we achieved is merely by just simplifying the data parser inside. This parser is as simple as possible – although with limited validation, error handling, and some functionality – but really, really fast.

Although the new reader has a few limitations coming from its nature, extreme speed in common use cases compensates all these drawbacks. If you are processing big amounts of data (hundreds of megabytes and more) and your transformation does not depend on data records being read in original order, Parallel Reader is here and it might just be the right choice for you – why not give it a try?

Hidden Features: Using Mutable Delimiter for Data Parsing

CloverETL provides a very useful feature: mutable delimiter. When you parse a delimited file (eg. CSV) you can specify different delimiter for each field. This isn’t surprising for daily CloverETL users however for users of other ETL tools it can be. It might not be very well known that in CloverETL you can even define more delimiters for one field (so called “mutable delimiter”) and CloverETL chooses the right one. It reveals new ways of file processing with irregular structure in CloverETL. I believe this functionality isn’t provided by any other ETL tool on the market. If I am wrong you can leave me a message in comments. I’m always happy to find “hidden features” of other ETL tools.

Syntax of a mutable delimiter: delimiters have to be separated by ‘\|‘. For example if you want to define that field delimiter can be ‘;‘ or ‘,‘ or ‘#‘ you have to write ‘;\|,\|#‘.
The simple example of using a mutable delimiter you can download here as a zipped CloverETL project. The import of existing CloverETL project to your CloverETL Designer is described in CloverETL documentation.