CloverETL's Blog

October 23, 2009

Parallel reader

Filed under: Using CloverETL — Tags: , , , , — Martin Zatopek @ 11:36 am

In October release 2.8.1 of Clover we introduced a new component which definitely should attract your attention – the Parallel Reader. The name itself already suggests the goal of the component – improve reading speed by going parallel. The component is very similar to Universal Data Reader in function – it reads delimited flat files like CSV, tab
delimited, etc. – much hasn’t changed here. But the real difference comes from under the hood.

There are two major optimizations which allow Parallel Reader to exhibit excellent performance results, especially on server-class machines with fast modern disks or better yet, disk arrays. The first optimization we have done is – of course – reading the file in parallel. The input file is divided into a set of virtual data chunks which are fed into reading threads. These work all together at the same time – each one parsing data records just from its own file part. The number of threads can be specified by component parameter “Level Of Parallelism” and should reflect the hardware setup – e.g. number of disks in a stripped RAID – to harness the maximum power of Parallel Reader. Another great performance gain we achieved is merely by just simplifying the data parser inside. This parser is as simple as possible – although with limited validation, error handling, and some functionality – but really, really fast.

Although the new reader has a few limitations coming from its nature, extreme speed in common use cases compensates all these drawbacks. If you are processing big amounts of data (hundreds of megabytes and more) and your transformation does not depend on data records being read in original order, Parallel Reader is here and it might just be the right choice for you – why not give it a try?

August 18, 2009

Hidden features: Mutable delimiter

Filed under: Using CloverETL — Tags: , , , — Petr Uher @ 9:38 am

CloverETL provides a very useful feature: mutable delimiter. When you parse a delimited file (eg. CSV) you can specify different delimiter for each field. This isn’t surprising for daily CloverETL users however for users of other ETL tools it can be. It might not be very well known that in CloverETL you can even define more delimiters for one field (so called “mutable delimiter”) and CloverETL chooses the right one. It reveals new ways of file processing with irregular structure in CloverETL. I believe this functionality isn’t provided by any other ETL tool on the market. If I am wrong you can leave me a message in comments. I’m always happy to find “hidden features” of other ETL tools.

Syntax of a mutable delimiter: delimiters have to be separated by ‘\\|‘. For example if you want to define that field delimiter can be ‘;‘ or ‘,‘ or ‘#‘ you have to write ‘;\\|,\\|#‘.
The simple example of using a mutable delimiter you can download here as a zipped CloverETL project. The import of existing CloverETL project to your CloverETL Designer is described in CloverETL documentation.

May 7, 2009

Parsing of an Apache access log

Filed under: Using CloverETL — Tags: , , — Vaclav Matous @ 7:21 am

The UniversalDataReader is designed for reading files in various formats. We use this component for many purposes. One of them is parsing of an Apache access log. The file normally includes records in a commonly used combined log format, e.g.:

127.0.0.1 – frank [10/Oct/2000:13:55:36 -0700] “GET /apache_pb.gif HTTP/1.0″ 200 2326 “http://www.example.com/start.html” “Mozilla/4.08 [en] (Win98; I ;Nav)”

Fields in the record are delimited by a space mark. But a space can be included in some quoted fields, such as “GET /apache_pb.gif HTTP/1.0″, so a single space is an improper delimiter. Fortunately, CloverETL allows you to define variable delimiters in metadata. So parsing of the log depends only on a proper setting of metadata on an output edge from the reader. In our case we defined following delimiters: space, space, space+left square bracket, right square bracket+space+quotation mark, quotation mark+space etc.

The complete example with an additional computing of the most visited pages and the most visiting IP addresses can be found in Advanced Examples from release 2-7-0.

Blog at WordPress.com.