Tag Archives: performance

Big Time with Big Records

Those of you who have ever tried to process big records with CloverETL already learned that it required some tweaking and special care to make it run smoothly and efficiently. In some cases, CloverETL could get too greedy with memory requirements for a graph run, making it quite cumbersome to set up. With CloverETL 3.2 we have introduced improved memory management in the runtime layer that has optimized memory usage when running graphs with big records.

Let’s take a look inside to see what this is all about…

Pipeline Approach

Clover’s approach to record processing is based on a pipeline – a chain of processing components connected by edges. The edges are the key point in inter-component communication. They have to ensure a fast transfer of records from one component to another. Our approach for edge data transfer has always been based on records serialization into byte stream on the starting end of an edge and deserialization back to record form on the other end. This ensures a basic invariant for all of our components - no record instance sharing. Each component has its own instance of a record populated from data on the input edge. It is then processed by the component and serialized into an output edge. This simple idea delivers excellent performance gains. (We have tried many times to find an even better approach, but have always returned to this one. Believe me – we have tried hard and many, many times.)

This imposes a painful decision to make on the edge itself – the capacity of the buffer that stores the bytes as they are passed from one end to the other. Obviously, that buffer must have enough room to hold the biggest record which passes through it. Those familiar with the CloverETL engine already know where I am going – the Records.MAX_RECORD_SIZE parameter.

In versions prior to 3.2, we used standard java.nio.ByteBuffer allocated to various multiples of MAX_RECORD_SIZE. That meant that all edges, component buffers, and just about anything with records passing through it were set to accommodate at least MAX_RECORD_SIZE bytes worth of the “guessed” biggest record possible. Over time, we gradually raised the default from 8 KB up to 64 KB (which, in the world of XMLs, unstructured data and other modern marvels is still far from being enough.) Yet, increasing MAX_RECORD_SIZE had quite a few negative effects on memory consumption, as any small increase was immediately multiplied by number of components and edges in a graph that shared this static buffer size. It was also shared among all graphs and sandboxes in the Server where the default was applied, regardless of whether or not the graph processed big records.

Introducing CloverBuffer

Now we are proud to say that with release 3.2, we have brought a significant improvement to this area. No more MAX_RECORD_SIZE trade-off decisions are necessary. Memory allocation for an edge and component buffer is now smart: it grows with higher demands and stays low for low demands. We have stepped up from the plain ByteBuffer to our own new container for serialized byte form of records – a CloverBuffer. It acts as a full replacement of a ByteBuffer, but what sets it apart is the ability to grow. CloverBuffer starts small but can transparently grow up to a predefined maximum limit (newly introduced RECORD_INITIAL_SIZE and RECORD_LIMIT_SIZE) without needing any programmer intervention.

So although there still is one global setting for all, it just sets boundaries that cannot be crossed. But anything in between those limits is allocated automatically to ensure the smallest memory footprint of each transformation run based on its needs in real time, not estimated ones. Graphs combining the processing of big records and small records, e.g. main stream of data combined with some logging branch, utilize only as much memory per edge/component as the size of the data passing through them.

Programmers Only

All CloverETL code base has been refactored to use the new CloverBuffer. We recommend that everyone adopt it too, so that your transformations can run seamlessly. In any case, you don’t need to worry – we keep our code backward compatible so even without changing your code, it still complies with the new release.

For completeness, here is an example of old record container allocation:
ByteBuffer recordBuffer = ByteBuffer.allocateDirect(Defaults.Record.MAX_RECORD_SIZE);

This now should be substituted with following code:
CloverBuffer recordBuffer = CloverBuffer.allocateDirect(Defaults.Record.RECORD_INITIAL_SIZE, Defaults.Record.RECORD_LIMIT_SIZE);

The constant Record.MAX_RECORD_SIZE is now deprecated and a pair of new constants was introduced:
Record.RECORD_INITIAL_SIZE – initial buffer size, for now 64KB and will be probably decreased in upcoming release (http://bug.javlin.eu/browse/CL-2070) to minimalize initial memory allocation for regular graphs.

The second constant ‘Record.RECORD_LIMIT_SIZE’ is actually one-to-one replacement for MAX_RECORD_SIZE (keeping MAX_RECORD_SIZE backward compatible for the sake of unmodified components), which sets the maximum upper bound per one CloverBuffer instance. This can be virtually anything – for convenient early detection of real buffer overruns, it is set to 32 MB by default. Lowering or increasing this upper bound affects memory consumption only in the cases where there is a real need for such big buffer – otherwise the buffers are kept at RECORD_INITIAL_SIZE and are grown gradually towards the upper limit.

As you can see, the CloverBuffer now makes it possible to process bigger records with less memory footprint, since only buffers for edges or components that actually manipulate big records grow, while the others still remain small.

Performance Optimization of Metrics in CloverETL Data Profiler

The first beta version CloverETL Data Profiler was released in October, and since then we have been working on improvements for the second beta version, which was released at the end of last year. Besides bug fixing and adding a few new features, we also worked on performance optimization of profiling metrics. This article will describe this improvement and how profiling is interconnected with CloverETL Engine.

The CloverETL Data Profiler processes input data as a stream. All metrics read input values as they are obtained from the source (CSV file, Excel sheet, or database table) and, at the end of a reading, metrics return their results and these results are then stored into the results database.

For most of the metrics (minimum value, maximum value) this approach works just fine. However, certain metrics cannot work like this – not only do they have higher computation-time requirements, but they also require all the input data to be kept in memory. For large data sets this makes using the operating system memory inappropriate. Therefore, external memory needs to be used.

In the first beta version we used Profiler’s internal SQL database to store all the values for these memory-consuming metrics. The data were first inserted into the SQL database and then a database query was used to calculate the result of the metric. This allowed for profiling large data sets– larger than the amount of available memory.

However, there was a large overhead caused by inserting the data into the database; also, the final result query computation consumed lots of system resources. We were not happy with the performance and architecture of this solution, so we decided to redesign it and use the powerful CloverETL Engine to get the job done.

We exploit the fact that the memory-consuming metrics can still be computed on a stream of data, if the incoming data are sorted. In the improved version we use the CloverETL Engine to first sort the values using the ExtSort components, and then we analyze the sorted data as a stream. In this approach, no other external facilities (such as SQL database) are used during profiling.

The overall performance of CloverETL Data Profiler has improved, especially for large data sets. Even with full set of metrics enabled, we are now able to analyze 4 GB of data with 30 fields in 30 minutes. Also, memory consumption has improved significantly.

Finally, in the Profiler GUI, we have marked the metrics that require sorting, and therefore have longer computation time, with a small clock icon. These metrics are not enabled by default.

The following picture shows in detail the different phases of metrics calculation. First, we calculate the metrics that can work on unsorted streams of data. In the following phases, for each field in its own separate phase, we run an ExtSort component and connect it to a component that calculates the metrics that expect sorted data. We use Rollup with custom transform Java code to calculate the metrics. Rollup allows for producing variable amount of output records for any amount of incoming records.

Another performance improvement in second beta version of CloverETL Data Profiler also affects the metrics that do not require the input data to be sorted. The profiler will now make better use of available CPUs and there will be less CPU time consumed on context-switching. This results in a boost up to 15% for data with a high amount of fields. Also, simpler structure of the internal CloverETL graph results in a significantly lower memory footprint.

In summary, in the second beta version of CloverETL Data Profiler, we have improved both performance and memory consumption by fully exploiting the capabilities of CloverETL Engine.

Meet Joda and get 30% more power!

Joda… don’t get too excited – Clover has (most probably) not contracted a Jedi master to squeeze in a portion of extra power to the product. In the real world, Joda is a quite useful third-party library for handling date and time operations. It has been in CloverETL for some time as an alternative option to the standard Java date implementation. Although not having superpowers of the aforementioned sci-fi character it is well worth being friends with and using it wisely might give your data transformations a noticeable punch in terms of performance.

The goal of Joda creators is simple – to create a perfect implementation of date and time handling functions in Java which would be superior to the standard Java built-in ones. Joda is well-designed, extensible and easy to use. But for us data maniacs, there’s one single sweet thing about it – it’s damn fast.

Using Joda proves especially useful for flat files and parsing dates from strings and putting them back to formatted strings on the output end of a transformation. Joda can yield around 30 % speed increase compared to the standard Java! So if you’re dealing with data files with lots of date fields in various formats and need to read them, perform some date operations and then output the results to a formatted string representation, you’re definitely not going to regret switching to Joda.

Although such huge performance gains usually come at a price, with Joda there’s actually very little to be concerned about. There are basically two things you could think of: The first one being that Joda is really strict on the format of the data – far more than Java. In Java you don’t need to care too much when your data is of incorrect case, contains extra white spaces, etc. However, Joda fails on data that do not exactly conform to the specified format. So if “1-JAn-2010” (notice the “A”) is fine for Java, Joda ends with an error on it. So bear this in mind.

Joda uses the same formatting symbols as you get in Java. But here comes another small drawback of the current Joda version: you cannot parse a time zone name – like “Pacific Standard Time” or “PST”. For those of you who are familiar with formatting strings, it’s the “z” symbol. The “Z” option works quite well and formatting dates using both “z” and “Z” (output) works just fine too.

Using Joda in CloverETL is fairly easy. You don’t need to install or link anything – it’s right there already. As you know each “date” field in CloverETL metadata can have a specific format specified. By specifying a prefix in the format string you can control whether you want to use Java or Joda engine. Let’s see an example:

Use Java engine:

DD/MM/YYYY (default)

java:DD/MM/YYYY

Use Joda engine:

joda:DD/MM/YYYY

So by prefixing the date format with either java: or joda: you can explicitly say which date engine that field should use. As you can see you can even control which engine to use on a per-field basis. That’s it, nothing else. So try it for yourself and see the difference!

Interesting resources:

Joda project page:

http://joda-time.sourceforge.net/

Formatting string symbols:

http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html

Sorting Data: ExtSort vs. FastSort – which one is better for me? (Part 2)

In my  previous post I have focused on tips for tweaking the FastSort component – performance sort component available in all commercial CloverETL editions. Today, I would like to touch the original ExtSort component which has been in CloverETL for a while and is available in both commercial and free (Community, opensource engine) editions.

While FastSort is optimized to deliver stunning performance in resource-rich environment, ExtSort is a modest hard worker used to working with limited resources. This proves to be useful especially in tranformations with lots of parallel branches with sorting.

Again I won’t discuss the case where there are only a few records and all sorting can be done in memory. For external sorting, ExtSort organizes its temp data in sorted chunks and places them onto a fixed number of tapes – each tape being a single temporary file. Since ExtSort is single-threaded, multiple ExtSort-s running at once do not consume that much resources as FastSort does.

With ExtSort, there isn’t that much magic as with tweaking FastSort. First what you can do is specify a number of temporary directories where tapes will be stored. This is useful when you’re able to harness multiple physical drives under multiple mount points, otherwise just make sure you have enough free space on the drive.

The only two attributes you can fiddle with is Buffer capacity and Number of tapes. There’s also the Sorter initial capacity but this will be deprecated soon and nowadays acts exactly the same as Buffer capacity. Buffer capacity determines the size of each chunk and thus needs to fit into memory. If you’re familiar with FastSort, Run size is the equivalent parameter there. Increasing Buffer capacity is generally a good idea for boosting ExtSort’s performance, but expect increased memory requirements, although still far from those of FastSort.

Number of tapes defaults to 6 and generally will yield best results, however again, increasing the number slightly does not take too much resources and can help with performance.

Compared to FastSort, ExtSort is pretty easy to configure and will work almost under any conditions. However, there’s always a catch – with ExtSort it is the rather inferior performance compared to its stronger brother.

Sorting Data – ExtSort vs. FastSort – which one is better for me?

I often get asked why CloverETL offers two sort components instead of just one and what’s the right key for determining which one is better for a particular purpose.

The reason for having two sort components in CloverETL is simply to keep things as easy as possible. Since the inner natures of ExtSort and FastSort are quite different it would be really difficult to implement a nice and clean universal one.

Luckily, the decision is simple and straightforward. In case you can dedicate enough system resources (CPU cores and/or memory) for the graph doing the sorting, FastSort is the clear option. On the other hand, if you’re short on resources and want a more conservative behavior, pick ExtSort which will give you steady performance at minimum system requirements.

FastSort is a very powerful tool, but to truly witness its power, users must set it up correctly to use their hardware’s maximum potential. We will now dive into the settings behind this impressive component and learn how to max out it’s ability while being careful to avoid crashes.

Tweaking FastSort

FastSort is greedy for both memory and CPU cores and in case the system does not have enough of either, FastSort can quite easily crash with out-of-memory, especially if the records you’re going to sort are big (long string fields, tens or hundreds of fields, etc.).

Parallelism

Unlike ExtSort, FastSort can utilize potentially unlimited number of CPU cores to do its job. You can control how many worker threads are used by overriding default value for “Concurrency (threads)”. My experience shows however, that unless you’re able to use really fast disk drives, going for more than 2 threads does not necessarily help and can even slow the process back down a bit. So basically you don’t need to worry about parallelism at all unless you have the hardware to take advantage of it. Remember, that parallelism adds extra memory load for each additional thread!

Memory

FastSort can be a bit tricky with memory, since there are multiple settings which affect it. The most important is the “Run size (records)” which denotes the size of the data chunk being sorted at a time. Note, that actual record size and level of parallelism increase the overall memory consumption – so be careful with this setting. The default is 20k records, if you set the “Estimated record count” – which is your rough guess on total count of records to be sorted, the Run size is computed for you based on a experimentally derived formula. This formula tries to get the right “Run size” based on number of records and amount of available memory (which you can limit with “Maximum memory” – defaults to unlimited). This “computed guess” works in most cases, but can fail under certain conditions. You need to test and tweak on your data a bit to get the best result. Run size is definitely a parameter worth playing with!

Be sure to have enough memory dedicated to your JVM – with large, numerous records. You want to give FastSort plenty of free memory – going for 512 MB up to 2 gigs is worth it! (e.g. –Xmx1536m) With a lot of memory, FastSort will do an amazing job. However with default 64 MB heap space setting, FastSort can crash.

‘In memory only sorting’ is an option you can use in case you’re sure that all data will fit into your memory – you can either force it (and then possibly crash due to out-of-memory) or leave it to default auto. Auto means that at first, FastSort tries to sort the data in memory and if that fails, on disk sorting is used instead.

Other limits and valuable parameters

Apart from memory settings, you can impose more limits on FastSort to reflect your needs. For example, if your system works with disk quotas which limit the number of open files, you can cap temp files of FastSort with “Max open files”. Note that FastSort uses LOTS of files – hundreds, thousands. If you cap it too much (500 or less) FastSort will continue to work, but  its performance will decrease significantly. So should you need to limit the number of open files, consider switching to ExtSort.

Settings you can forget

There are other advanced options for FastSort, but you can leave them to default values unless you are really trying to optimize your sort. Number of read buffers defines how many chunks of data will be held in memory at a time – which must be at least the number of Concurrency – otherwise some of the workers wouldn’t have data to work on. Using too large a number, you’ll end up with out-of-memory – the default is based on current concurrency setting and is just fine.

Average record size is nothing else than a helper guess on average byte size of records in the data – if not set, FastSort computes this automatically from the real data so it’s usually more precise than setting an explicit value.

Tape buffer is a buffer used for each worker for filling the output and slightly affects performance, but the default is fine in almost all cases.

The last two options control how temp files are created, they can be either compressed (defaults to false) and you can even control the charset of string fields (default UTF16). Both are there for space saving purposes (space occupied by temp files during graph execution) and decrease performance.

The Decision

FatSort is very powerful sorting component and can significantly speed up your transformation process. But it has to be set up correctly. So, if you are not sure and you want the always safe and simple sort, go with ExtSort. On the other hand, if you know your hardware and want to utilize it to optimize your sort for speed, dive into FastSort and explore it a bit. The results can be extraordinary.

To be continued … (Part 2 will discuss ExtSort component)

DataDirect’s OracleDB JDBC Driver Speed Test

Purpose

Compare the speed of data loading into Oracle database (Oracle Database 11g Release 11.1.0.6.0 – Enterprise Edition) with Oracle corp. JDBC driver, DataDirect JDBC Oracle driver and direct data loading (OracleDataWriter component – sqlldr utility) in CloverETL.

 

Test description

 

Graph used for testing:

DDdbLoad.grf

The above graph loads data into database table that contains 3 number columns and 127 varchar columns.

Database table for storing data is truncated before each data loading – DBExecute components, each with query: TRUNCATE TABLE dd_test1 REUSE STORAGE

Phase 1: loading data with DDBulkLoad (DataDirect) object from csv file (loader.load(file))

Phase 3: loading data with DDBulkLoad (DataDirect) object from ResultSet (loader.load(resultSet)) – created ResultSet implementation, that reads data from DataRecord (read from the edge).

Phase 4: loading data with DBOutputTable with Oracle corp’s  JDBC driver:

Manifest-Version: 1.0
Specification-Title:    Oracle JDBC driver classes for use with JDK14
Sealed: true
Created-By: 1.4.2_08 (Sun Microsystems Inc.)
Implementation-Title:   ojdbc14.jar
Specification-Vendor:   Oracle Corporation
Specification-Version:  Oracle JDBC Driver version - "10.2.0.1.0XE"
Implementation-Version: Oracle JDBC Driver version - "10.2.0.1.0XE"
Implementation-Vendor:  Oracle Corporation
Implementation-Time:    Wed Jan 25 01:28:31 2006

Phase 7: loading data with DBOutputTable with DataDirect Oracle JDBC driver – enabled bulk load feature

Phase 9: loading data with OracleDataWriter component from csv file (sqlldr utility)

Phase 11: loading data with OracleDataWriter component from edge (sqlldr utility)

Test processing

Graph run 3 times for 10,000,000 records with default DataDirect settings.

Graph run 3 times for 1,000,000 records with default DataDirect settings.

Graph run 3 times for 1,000,000 records with following settings:

 

Test results

Results in seconds.

1,000,000 records with default DataDirect settings:

Phase 1: 178, 167, 132 – min: 132, max: 178, average: 159

Phase 3: 128, 166, 152 – min: 128, max: 166, average: 149

Phase 5: 228, 246, 290 – min: 228, max: 290, average: 255

Phase 7: 176, 170, 239 – min: 170, max: 239, average: 195

Phase 9: 44, 45, 56 – min: 44, max: 56, average: 48

Phase 11: 104, 95, 106 – min: 95, max: 104, average: 102

1,000,000 records with custom settings:

Phase 1: 163, 152, 142 – min: 142, max: 163, average: 152

Phase 3: 166, 133, 134 – min: 133, max: 166, average: 144

Phase 5: 278, 263, 260 – min: 260, max: 278, average: 267

Phase 7: 239, 172, 209 – min: 172, max: 239, average: 207

10,000,000 records with default DataDirect settings:

Phase 1: 1553, 1818, 1352 – min: 1352, max: 1818, average: 1574

Phase 3: 1475, 1299, 1298 – min: 1298, max: 1475, average: 1357

Phase 5: 3041, 2592, 2550 – min: 2550, max: 3041, average: 2728

Phase 7: 1824, 1623, 1722 – min: 1722, max: 1824, average: 1723

Phase 9: 404, 432, 472 – min: 404, max: 472, average: 436

Phase 11: 1096, 975, 1012 – min: 975, max: 1096, average: 1028

Summary

Loading data was slowest when DBOutputTable with original Oracle corp. driver was used. All loadings with DataDirect driver were faster than with Oracle corp. driver but the usage of DDBulkLoad object (DataDirect) plainly increases the speed of loading data in comparison with setting EnableBulkLoad=true and using DBOutputTable. The results for loading data from csv file and edge (result set) are very similar with slight predomination of ResultSet method. All three methods with DataDirect driver usage, get to more steady execution times with number of records to load.

The fastest way of loading data is unquestionably direct data loading with sqlldr utility. Even when inter-storing data in pipe, the sqlldr utility is about 50% faster than any other method, but is less convenient.
1,000,000 records
10,000,000 records

Parallel Data Processing Comparison – CloverETL vs. Talend vs. Pentaho (Part 3)

As I have promised I bring you a complex comparison of ETL tools: CloverETL, Talend and Pentaho.

Short summary of my previous posts: For testing I used two transformations based on TPCH test and the input data generated by dbgen utility. The transformations were run on my laptop with Windows Vista Home Premium. For detail information see part 1 and part 2.

New testing:
To ensure my comparison a full complexity, all tools were tested as “desktop” and “enterprise” ETL tools. The “desktop” tools were running on laptop computer with a small amount of data. The “enterprise” ETL tools were running on server class machine with a large amount of data stored both in flat files and in a database. The transformation executed on server class machine was the same as the one I executed on desktop, only the size of input data was changed:

  • lineitem.tbl – 59,986,052 records, 7.24 GB
  • customers.tbl – 1,500,000 records, 233 MB
  • orders.tbl – 15,000,000 records, 1.62 GB

The results of flat file reading:

TPCH-Q1

TPCH-Q1

TPCH-Q3

TPCH-Q3

The new results of database reading, all previously published results, detailed information about used hardware and a summary are available in this final document.

I also described main features of all tools and my experiences to work with them. This part of the document expresses my opinions so it could be biased since I work mostly with CloverETL. If you don’t agree with anything, please express your opinion in comments. I will be pleased to discuss them with you.

Parallel Data Processing Comparison – CloverETL vs. Talend vs. Pentaho (Part 2)

Before we will release a complete comparison of open source ETL tools and after a success of my previous blog post I decided to publish the second transformation that we used in the comparison.

The second transformation is also based on SQL query that I rewrote to ETL transformation. I chose Query 3 from http://www.tpc.org/tpch.

select
l_orderkey,
sum(l_extendedprice*(1-l_discount)) as revenue,
o_orderdate,
o_shippriority
from customer, orders, lineitem
where c_mktsegment = ‘BUILDING’
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date ‘1995-03-15’
and l_shipdate > date ‘1995-03-15’
group by l_orderkey, o_orderdate, o_shippriority
order by revenue desc, o_orderdate

Input data are generated by dbgen utility and stored in CSV files.

  • lineitem.tbl – 6,001,215 records, 724 MB
  • customers.tbl – 15,000 records, 23.2 MB
  • orders.tbl – 1,500,000 records, 163 MB

Expected output should contain 11,620 records.

There is a new item in the results. After a discussion in my previous post I added „Pentaho parallel“, Pentaho transformation that reads data in parallel mode. Thanks Matt for your transformation :-) without it I wasn’t able to set it up.

Matt Caster also presented an opinion that Pentaho is discriminated because the transformation sorts the data before aggregation in Pentaho transformation. Yes, I agree that sorting of 6,000,000 records takes a significant amount of execution time of the transformation. But I have no choice, Pentaho aggregate component requires sorted input. Today’s transformation is more fair in this aspect. The number of records flowing to aggregate component is smaller (30,519 records) so they can be easily sorted in memory and the sorting doesn’t influence the total execution time in such volume.

The versions of used ETL tools stay the same ones: CloverETL Designer 2.8.1, Talend Open Studio 3.1.3 and Pentaho Data Integration 3.2.0.

Also the hardware configuration and Java runtime parameters are the same:

  • Intel Core 2 Duo @ 1666 Mhz, 2048 MB RAM, 200GB SATA 5400 RPM, Windows Vista Home Premium 32bit.
  • -server -Xmx256m -Xmx1536m

Results:

  1. CloverETL ParallelReader
  2. Talend
  3. Pentaho parallel
  4. CloverETL UniversalDataReader
  5. Pentaho

Results

Transformations and the input data are available on filefactory.com. Today’s transformation are named TPCH2. The transformation from my previous post are named TPCH1.

Please give me a feedback, especially on Talend transformation if it’s correct.

Transformation graphs

CloverETL ParallelReader & UniversalDataReader

CloverETL ParallelReader & UniversalDataReader

Talend

Talend

Pentaho

Pentaho

Pentaho parallel

Pentaho parallel

Parallel Data Processing Comparison – CloverETL vs. Talend vs. Pentaho

On Oct. 21 OpenSys released a new version of its ETL tool, CloverETL Designer version 2.8.1. It’s mainly bugfix version but also brings a new component, ParallelReader, that makes delimited data file (CSV) processing faster than ever before.

I decided to make a test and compare ParallelReader’s performance with CloverETL’s UniversalDataReader and also with two ETL competitors Talend Open Studio (3.1.3) and Pentaho Data Integration (3.2.0).

As a testing task I chose simple SQL query and I tried to rewrite it to ETL transformation.

select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from lineitem
where l_shipdate <= date ‘1998-09-03’
group by l_returnflag, l_linestatus
order byl_returnflag, l_linestatus;

This query is one of standard queries used for performance testing of database engines. More info on http://www.tpc.org/tpch/. The dataset for testing was generated by dbgen utility available on tpc.org too. The size of dataset is 725MB.

All transformation was run on my laptop: Intel Core 2 Duo @ 1666 Mhz, 2048 MB RAM, 200GB SATA 5400 RPM, Windows Vista Home Premium 32bit.

Java parameters was set up to -server -Xmx256m -Xmx1536m.

The results aren’t surprising :-) :

  1. CloverETL – ParallelReader
  2. CloverETL – UniversalDataReader
  3. Talend
  4. Pentaho

Results

If you don’t trust me you can verify results on your own computer. All transformation graphs and testing dataset are available on rapidshare.com or filefactory.com (200 MB). CloverETL Designer can be downloaded on www.cloveretl.com.

Deeper and more extensive comparison will be published soon. Watch www.cloveretl.com, watch this blog. The latest news about CloverETL are also available on CloverETL linkedin group and CloverETL facebook group. Don’t hesitate and join.

Transformation graphs

CloverETL ParallelReader

CloverETL ParallelReader

CloverETL UniversalDataReader

CloverETL UniversalDataReader

Talend

Talend

Pentaho

Pentaho

ParallelReader Component: Performance Boost in Data Processing

In October release 2.8.1 of Clover we introduced a new component which definitely should attract your attention – the Parallel Reader. The name itself already suggests the goal of the component – improve reading speed by going parallel. The component is very similar to Universal Data Reader in function – it reads delimited flat files like CSV, tab delimited, etc. – much hasn’t changed here. But the real difference comes from under the hood.

There are two major optimizations which allow Parallel Reader to exhibit excellent performance results, especially on server-class machines with fast modern disks or better yet, disk arrays. The first optimization we have done is – of course – reading the file in parallel. The input file is divided into a set of virtual data chunks which are fed into reading threads. These work all together at the same time – each one parsing data records just from its own file part. The number of threads can be specified by component parameter “Level Of Parallelism” and should reflect the hardware setup – e.g. number of disks in a stripped RAID – to harness the maximum power of Parallel Reader. Another great performance gain we achieved is merely by just simplifying the data parser inside. This parser is as simple as possible – although with limited validation, error handling, and some functionality – but really, really fast.

Although the new reader has a few limitations coming from its nature, extreme speed in common use cases compensates all these drawbacks. If you are processing big amounts of data (hundreds of megabytes and more) and your transformation does not depend on data records being read in original order, Parallel Reader is here and it might just be the right choice for you – why not give it a try?