Tag Archives: data parsing

CloverETL as a High-throughput XML Processor

XML is a markup language that has been around for some years now. Originally, it comes from the world of documents – used in web hypertext, word processors and other representations. Today, it is very popular in many areas, including the world of data exchange. The reasons are simple – the format is straightforward, well defined, and easily transferable accross platforms. XML can be easily read and modified by users in contrast to proprietary and binary formats. It also represents structured hierarchical data, which can be very difficult to express in plain CSV format. XML is self-descriptive, which heavily increases the user’s ability to understand data and eliminates the need of data format description and parsing instructions.

XML is often used to transport data between potentionally incompatible systems, resulting in a task to parse and store data of this format and eventually to process this data. CloverETL provides powerful tools to accomplish this task.

One of the components that provides XML parsing is XMLXPathReader. The user simply defines the mapping of each data element or attribute to a given CloverETL field. In the background of the component there is a DOM parser which allows the user to include general XPath expressions in the mapping definition.

In practice, users will often encounter vast XML files, which typically follow a standard structure. This structure contains records which represent a given entity (company, person, etc.) that can be repeated many times in a large XML data source. It is quite common that these sources of data come in sizes of 10s or even 100s of gigabytes. When this happens, DOM parsing is greatly inappropriate as all this data cannot be contained in memory. For this reason, another CloverETL XML parsing component becomes handy – XMLExtract. This handles records individually which are usually quite small, at least small enough to be processed in memory.

In XMLExtract, the user is able to define how each element can be mapped to a CloverETL record at every level of the XML structure . XMLExtract also provides the possibility of including a parent key at each structure level, thus allowing later complete reconstructions of the entire data structure. If the XML does not contain the unique key itself, it can also be easily generated using a CloverETL sequence object.

XML data and their basic integrity rules can be very well specified using XML Schema which today is a standard part of well defined data exchange. If you use XML Schema, CloverETL provides a very convenient visual drag&drop editor which helps the user build an XML mapping:

This screenshot represents an XML mapping which defines how XML and Clover fields are mapped. This mapping can also be displayed as text:

To provide an example where these methods were essential, CloverETL successfully completed a master data consolidation and matching project for an international insurance company. The XML Schemas were very complex, containing hundreds of different XML element types in its structure. The volume of data was over a hundred GBs describing tens of millions of customers as organizations and 4-5 million customers as persons. One of the many tasks assigned to CloverETL was to read and store the vast amount of data in XML in which it performed substantially greater due to XML’s fast sequential processing.

ParallelReader Component: Performance Boost in Data Processing

In October release 2.8.1 of Clover we introduced a new component which definitely should attract your attention – the Parallel Reader. The name itself already suggests the goal of the component – improve reading speed by going parallel. The component is very similar to Universal Data Reader in function – it reads delimited flat files like CSV, tab delimited, etc. – much hasn’t changed here. But the real difference comes from under the hood.

There are two major optimizations which allow Parallel Reader to exhibit excellent performance results, especially on server-class machines with fast modern disks or better yet, disk arrays. The first optimization we have done is – of course – reading the file in parallel. The input file is divided into a set of virtual data chunks which are fed into reading threads. These work all together at the same time – each one parsing data records just from its own file part. The number of threads can be specified by component parameter “Level Of Parallelism” and should reflect the hardware setup – e.g. number of disks in a stripped RAID – to harness the maximum power of Parallel Reader. Another great performance gain we achieved is merely by just simplifying the data parser inside. This parser is as simple as possible – although with limited validation, error handling, and some functionality – but really, really fast.

Although the new reader has a few limitations coming from its nature, extreme speed in common use cases compensates all these drawbacks. If you are processing big amounts of data (hundreds of megabytes and more) and your transformation does not depend on data records being read in original order, Parallel Reader is here and it might just be the right choice for you – why not give it a try?

Hidden Features: Using Mutable Delimiter for Data Parsing

CloverETL provides a very useful feature: mutable delimiter. When you parse a delimited file (eg. CSV) you can specify different delimiter for each field. This isn’t surprising for daily CloverETL users however for users of other ETL tools it can be. It might not be very well known that in CloverETL you can even define more delimiters for one field (so called “mutable delimiter”) and CloverETL chooses the right one. It reveals new ways of file processing with irregular structure in CloverETL. I believe this functionality isn’t provided by any other ETL tool on the market. If I am wrong you can leave me a message in comments. I’m always happy to find “hidden features” of other ETL tools.

Syntax of a mutable delimiter: delimiters have to be separated by ‘\|‘. For example if you want to define that field delimiter can be ‘;‘ or ‘,‘ or ‘#‘ you have to write ‘;\|,\|#‘.
The simple example of using a mutable delimiter you can download here as a zipped CloverETL project. The import of existing CloverETL project to your CloverETL Designer is described in CloverETL documentation.

Parsing of an Apache access log

 

The UniversalDataReader is designed for reading files in various formats. We use this component for many purposes. One of them is parsing of an Apache access log. The file normally includes records in a commonly used combined log format, e.g.:

127.0.0.1 – frank [10/Oct/2000:13:55:36 -0700] “GET /apache_pb.gif HTTP/1.0″ 200 2326 “http://www.example.com/start.html” “Mozilla/4.08 [en] (Win98; I ;Nav)”

 

Fields in the record are delimited by a space mark. But a space can be included in some quoted fields, such as “GET /apache_pb.gif HTTP/1.0″, so a single space is an improper delimiter. Fortunately, CloverETL allows you to define variable delimiters in metadata. So parsing of the log depends only on a proper setting of metadata on an output edge from the reader. In our case we defined following delimiters: space, space, space+left square bracket, right square bracket+space+quotation mark, quotation mark+space etc.

 

The complete example with an additional computing of the most visited pages and the most visiting IP addresses can be found in Advanced Examples (AccessLogParsing.grf) included in CloverETL Designer or you can download all examples from SourceForge.