CloverETL's Blog

December 18, 2009

CloverETL extends ETL Price/Performance Leadership with Launch of CloverETL Cluster

Filed under: Using CloverETL — Tags: , , , , — Lucie Felixova @ 3:54 pm

On December 9, 2009 CloverETL Cluster Edition was launched at PriceWaterhouseCoopers premises. CloverETL Cluster intelligently partitions data and distributes them evenly across multiple nodes in a cluster for execution in parallel. CloverETL Cluster’s ability to load balance large data transformations increases throughput, fault tolerance and flexibility.

CloverETL Cluster can be deployed on premise in a customer’s own data center or in a variety of cloud configurations, such as Amazon EC2, which can drive costs even lower. During the launch in Prague, CloverETL Cluster was demonstrated running on four Amazon EC2 servers.

The following table shows the time CloverETL requires to execute a moderately complex transformation of six million records (725 MB) in a variety of local and cloud configurations:

  • CloverETL Desktop Designer on a MacBook: 150 seconds
  • CloverETL Cluster Load Balanced Across Two EC2 servers: 60 seconds
  • CloverETL Cluster Load Balanced Across Three EC2 servers: 43 seconds
  • CloverETL Cluster Load Balanced Across Four EC2 servers: 31 seconds

October 23, 2009

Parallel reader

Filed under: Using CloverETL — Tags: , , , , — Martin Zatopek @ 11:36 am

In October release 2.8.1 of Clover we introduced a new component which definitely should attract your attention – the Parallel Reader. The name itself already suggests the goal of the component – improve reading speed by going parallel. The component is very similar to Universal Data Reader in function – it reads delimited flat files like CSV, tab
delimited, etc. – much hasn’t changed here. But the real difference comes from under the hood.

There are two major optimizations which allow Parallel Reader to exhibit excellent performance results, especially on server-class machines with fast modern disks or better yet, disk arrays. The first optimization we have done is – of course – reading the file in parallel. The input file is divided into a set of virtual data chunks which are fed into reading threads. These work all together at the same time – each one parsing data records just from its own file part. The number of threads can be specified by component parameter “Level Of Parallelism” and should reflect the hardware setup – e.g. number of disks in a stripped RAID – to harness the maximum power of Parallel Reader. Another great performance gain we achieved is merely by just simplifying the data parser inside. This parser is as simple as possible – although with limited validation, error handling, and some functionality – but really, really fast.

Although the new reader has a few limitations coming from its nature, extreme speed in common use cases compensates all these drawbacks. If you are processing big amounts of data (hundreds of megabytes and more) and your transformation does not depend on data records being read in original order, Parallel Reader is here and it might just be the right choice for you – why not give it a try?

Blog at WordPress.com.