Tag Archives: data processing

Usage of Internal and External Graph Elements

GRAPH ELEMENTS

In addition to components, edges, notes, etc., any CloverETL graphs may contain the following graph elements:

  • Metadata
  • Connections
  • Lookup tables
  • Sequences
  • Parameters

Each of these graph elements may be either created and written in the graph XML source file, or specified in an external file and linked to the graph. Such elements are called internal, or external, respectively. In case of external elements, graph XML file only contains a link to such file with graph element definition.

Internal Graph Elements

Graph XML file can contain internal graph elements that may look like those in the following examples:

 <Metadata id="Metadata1"> <Record fieldDelimiter="|" name="LUT" recordDelimiter="rn" type="delimited"> <Field name="field1" type="integer"/> <Field name="field2" type="string"/> </Record> </Metadata> <Connection database="POSTGRE" dbURL="jdbc:postgresql://hostname/database" id="JDBC0" jdbcSpecific="POSTGRE" name="NewConnection" password="mypassword" type="JDBC" user="username"/> <LookupTable charset="ISO-8859-1" id="LookupTable0" initialSize="512" key="field1" keyDuplicates="true" metadata="Metadata1" name="simpleLookup0" type="simpleLookup"/> <Sequence cached="1" fileURL="${SEQ_DIR}/seq_withdata.txt" id="Sequence0" name="seq" start="1" step="1" type="SIMPLE_SEQUENCE"/> <Property id="GraphParameter1" name="NUMBER" value="6"/>

(In addition to such internal graph elements, graph contains a link to the workspace.prm file defining project parameters – “CONN_DIR”, “META_DIR”, etc.)

<Property fileURL="workspace.prm" id="GraphParameter0"/>

The resulting Outline pane looks like this:

Internal Graph Elements

External Graph Elements

When you want to link an external graph element, you need to right-click any of the five categories (Metadata, Connections, Parameters, Sequences, or Lookups) in the Outline pane and select in the context menu:

  • New metadata → Link shared definition
  • Connections → Link DB connection, Link JMS connection, or Link QuickBase connection
  • Parameters → Link parameter file
  • Sequences → Link shared sequence
  • Lookup tables → Link shared lookup table

After that, the URL Dialog opens.

URL Dialog

The dialog consists of three tabs: Workspace view, Local files, and Remote files tabs. They serve for browsing local files within your workspace, outside your workspace, or files located on a remote computer, respectively.

Files with graph elements are located in the same project as the graph itself.

To browse your project, you need to use the Workspace view tab of the URL Dialog.

The workspace.prm file in the project itself contains the definition of parameters for the conn, meta, lookup, seq directories (among others). They are designed as “CONN_DIR”, “META_DIR”, “SEQ_DIR”, and “LOOKUP_DIR”, respectively. The value of each of these parameters is resolved to its value using a dollar and curly brackets. For example, “${META_DIR}”.

Once you link a file containing graph element definition located in the project, the “_DIR” termination of these names assures that the path to such files located in the conn, meta, lookup, or seq directories is automatically replaced with one of these parameters.

The graph XML file contains the following links to metadata, db connection, simple sequence, simple lookup table, and parameter files:

<Metadata fileURL="${META_DIR}/LUT.fmt" id="Metadata1"/> <Connection dbConfig="${CONN_DIR}/NewConnection.cfg" id="JDBC0" type="JDBC"/> <LookupTable id="LookupTable0" lookupConfig="${LOOKUP_DIR}/simpleLookup0.cfg"/> <Sequence id="Sequence0" seqConfig="${SEQ_DIR}/seq.cfg"> <attr name="type"><![CDATA[SIMPLE_SEQUENCE]]></attr> </Sequence> <Property fileURL="parameters.prm" id="GraphParameter1"/>

(In addition to the mentioned graph elements, graph contains a link to the workspace.prm file defining project parameters – “CONN_DIR”, “META_DIR”, etc.)

<Property fileURL="workspace.prm" id="GraphParameter0"/>

The resulting Outline pane looks like this:

External Graph Elements

Files with graph elements are located outside the project containing the graph.

To locate external graph elements OUTSIDE the project, you need to use the Local files tab of the URL Dialog. It allows to browse the file system of your local computer.

You can also define a parameter for the whole path with the mentioned “_DIR” termination. If you have defined such parameter (e.g, “PATH_TO_GRAPH_ELEMENTS_DIR”), the whole path will automatically be replaced with this parameter (as ${PATH_TO_GRAPH_ELEMENTS_DIR}).

Remember that you need to define such parameter BEFORE you link a graph element in the specified location, otherwise, the whole path will not be replaced with the parameter name!

The whole path to the graph elements is specified in the graph XML file. The graph XML file contains the following links to metadata, db connection, simple sequence, simple lookup table, and parameter files:

<Metadata fileURL="D:/ExternalGraphElements/meta/HashJoinInput.fmt" id="Metadata0"/> <Metadata fileURL="${PATH_TO_GRAPH_ELEMENTS_DIR}/meta/LUT.fmt" id="Metadata1"/> <Connection dbConfig="${PATH_TO_GRAPH_ELEMENTS_DIR}/NewConnection.cfg" id="JDBC0" type="JDBC"/> <LookupTable id="LookupTable0" lookupConfig="${PATH_TO_GRAPH_ELEMENTS_DIR}/simpleLookup0.cfg"/> <Sequence id="Sequence0" seqConfig="${_TO_GRAPH_ELEMENTS_DIR}/seq.cfg"> <attr name="type"><![CDATA[SIMPLE_SEQUENCE]]></attr> </Sequence> <Property fileURL="parameters.prm" id="GraphParameter1"/>

The last mentioned parameters.prm file defines two parameters:

NUMBER, whose value is 6

and

PATH_TO_GRAPH_ELEMENTS_DIR, whose value is D:/ExternalGraphElements.

(In addition to the mentioned graph elements, graph contains a link to the workspace.prm file defining project parameters – “CONN_DIR”, “META_DIR”, etc.)

<Property fileURL="workspace.prm" id="GraphParameter0"/>

Note:

Note that the value of the PATH_TO_GRAPH_ELEMENTS_DIR parameter is D:/ExternalGraphElements.

  • The first external metadata element (with ”Metadata0” id) was linked BEFORE the PATH_TO_GRAPH_ELEMENTS_DIR parameter was defined. The whole path was NOT converted into ${PATH_TO_GRAPH_ELEMENTS}/…
  • But the next elements (with “Metadata1” id, and all of the other elements) were automatically converted to the ${PATH_TO_GRAPH_ELEMENTS_DIR}/… AS SOON AS they were linked to the graph.

The resulting Outline pane looks like this:

External Outside Project

Files with graph elements are located on a remote computer.

To locate the files on remote computer, you need to use the Remote files tab of the URL Dialog. It allows to specify details of the remote computer file system.

To connect a remote computer, click the Create/Edit URL button at the right side from the Server URL combobox. In the Edit URL Dialog that opens, you need to specify all the authentication details:

Edit URL Dialog

The general structure of a remote path is:

<protocol>://<username>:<password>@<hostname|IP>:<portnumber>/<pathtoexternalelements>

Supported protocols are: http, https, ftp, ftps, sftp. The first two do not allow browsing the remote file system, whereas the other four allow it.

The whole path to the graph elements is specified in the graph XML file. The graph XML file contains the following links to metadata, db connection, simple sequence, simple lookup table, and parameter files:

<Metadata fileURL="sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/meta/LUT.fmt" id="Metadata1"/>

<Connection dbConfig="sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/conn/NewConnection.cfg" id="JDBC0" type="JDBC"/>

<LookupTable id="LookupTable0"
lookupConfig="sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/lookup/simpleLookup0.cfg"/>

<Sequence id="Sequence0" seqConfig="://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/seq/seq.cfg">
<attr name="type"><![CDATA[SIMPLE_SEQUENCE]]></attr>
</Sequence>

<Property fileURL="parameters.prm" id="GraphParameter1"/>

The last mentioned parameters.prm file defines two parameters:

NUMBER, whose value is 6

and

PATH_TO_REMOTE_GRAPH_ELEMENTS_DIR, whose value is sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/ExternalGraphElements.

(In addition to the mentioned graph elements, graph contains a link to the workspace.prm file defining project parameters – “CONN_DIR”, “META_DIR”, etc.)

<Property fileURL="workspace.prm" id="GraphParameter0"/> 

The resulting Outline pane without parameter usage looks like this:

Remote Graph Elements

Note:

Remember that the paths to linked external elements located on a remote computer do NOT use automatically the ${PATH_TO_REMOTE_GRAPH_ELEMENTS_DIR} value instead of sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/ExternalGraphElements.

You may switch to the Source tab of your graph and replace the sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/ExternalGraphElements with ${PATH_TO_REMOTE_GRAPH_ELEMENTS_DIR} by hand.

Thus, after replacing the paths in the Source tab of the Graph editor with the value of graph parameter, the Outline pane looks like this:

Remote Graph Elements With Parameter

COMPARISON OF INTERNAL AND EXTERNAL GRAPH ELEMENTS

All the graph elements, both the internal and the external, may be converted into the other form. Any internal element may become an external one, and vice versa.

For more details consult our documentation.

VARIOUS FORMATS OF GRAPH ELEMENTS MAY BE USED AT THE SAME TIME

Remember that you can use all forms of graph elements in a single graph: internal, external, located on local computer, and external elements, located remotely, accessible via various protocols.

How to Easily Enrich Data Using CloverETL’s Auto-filling feature

Users often need to retrieve data from a data source which does not contain this related data but is easily defined. Thus, it is important to be able to add further information to your source that is not already present in the file (e.g. time stamp, name of excel sheet). Such additional information can simplify further data processing.

For example:

  • Each file has a time of creation, its size, name and the path where it is located, and (in case of an XLS file) the name of the sheet that is read.
  • When a data source is being read, the reader starts to work at an explicit time; each record is also read at an explicit time.
  • Records can be numbered in the order in which they are read.
  • Information about errors may be available (in DBExecute and DBOutputTable).

All this information can easily enrich the read data in CloverETL by using the auto-filling functionality.

Auto-filling is a feature that is available on the metadata definition level. When you open the Metadata Editor and select any field of the metadata there is an Autofilling property (under the Advance tab). You can select one of the following values:

Name Date type Description
default_value any type When the null value is assign to the field and the field is marked as non-nullable, the null is replaced by the value of Default property of the field
global_row_count any numeric Sequence number of the record in a data source starting from 0. The number isn’t reset for each input file while the wildcards are used in File URL (${DATAIN_DIR}/input*.txt)
source_row_count any numeric Sequence number of the record in data source, starting from 0. The number is reset to 0 for each input file while the wildcards are used in File URL (${DATAIN_DIR}/input*.txt)
metadata_row_count any numeric Similar to global_row_count. But when the reader component supports more than one type of output metadata (XMLExtract, XMLXPathReader) each metadata has a separate counter.
metadata_source_row_count any numeric Similar to source_row_count. But when the reader component supports more than one type of output metadata (XMLExtract, XMLXPathReader) each metadata has a separate counter.
source_name string Name of data source. For file readers it’s fully qualified path (ex. /home/user/input.csv), for DataGenerator it’s ID of graph component, SQL query for DBInputTable, fully qualified class name for JMSReader
source_size any numeric Size of the file in bytes. 0 for non-file readers.
source_timestamp date Date and time of the creation of the file. Empty (null) in all non-file readers.
reader_timestamp date Date and time when reader starts read data
row_timestamp date Date and time when the reader starts read the record
sheet_name string Name of the sheet, only in XLSDataReader
ErrCode any numeric Error code returned by database engine, only in DBExecute and DBOutputTable
ErrText string Error message returned by database engine, only in DBExecute and DBOutputTable

Remember that any of these functions can be applied to a field not contained in a file, database table, generated data, or JMS message.

Use case

Imagine that you have an export of customers from a database in an Excel file. The Excel file is organized into many sheets; each sheet is named by a state abbreviation and contains only customers from one state. See figure bellow.

Now you want to merge all customers to one CSV file but for each customer you want to also store the state in separate column. It looks very easy :-) . For CloverETL it is, not necessarily so for other ETL tools.

The final graph is very simple, there are only two components.

The most important part of this graph is hidden in the definition of the metadata on the edge. We have to enrich the metadata, which we created from the Excel file by using the proper wizard in CloverETL Designer, with a new field “state” by setting the Autofilling property to sheet_name as you can see below. And that’s all.

The Autofilling field can be placed on any position in a metadata definition. It’s conveniant that we place the autofilling fields at the end of the metadata definition, after the field with the record delimiter. But this conveniance does not apply when the same metadata are also used for writing to a flat file (in our case). Thus we moved the autofilling fields to the position before the field with a record delimiter.

When you run the graph, you will get the following results:

You can use multiple autofilling fields with different functions at the same time to get more useful information from the file. For example, you can get the file name and the sheet names at the same time. Or you can get the number the records, etc. For example, when the following auto-filling functions are used (sheet_name, source_size, row_timestamp, global_row_count), the result will look like this:

Remember that only the edges connected to the output port(s) of the following reader components can use the auto-filling functionality:

  • UniversalDataReader
  • CloverDataReader
  • XLSDataReader
  • DBFDataReader
  • MultiLevelReader
  • XMLExtract
  • XMLXPathReader
  • DBInputTable
  • DataGenerator
  • JMSReader

DBOutputTable and DBExecute can use only two error auto-filling functions – ErrCode and ErrText.

CloverETL as a High-throughput XML Processor

XML is a markup language that has been around for some years now. Originally, it comes from the world of documents – used in web hypertext, word processors and other representations. Today, it is very popular in many areas, including the world of data exchange. The reasons are simple – the format is straightforward, well defined, and easily transferable accross platforms. XML can be easily read and modified by users in contrast to proprietary and binary formats. It also represents structured hierarchical data, which can be very difficult to express in plain CSV format. XML is self-descriptive, which heavily increases the user’s ability to understand data and eliminates the need of data format description and parsing instructions.

XML is often used to transport data between potentionally incompatible systems, resulting in a task to parse and store data of this format and eventually to process this data. CloverETL provides powerful tools to accomplish this task.

One of the components that provides XML parsing is XMLXPathReader. The user simply defines the mapping of each data element or attribute to a given CloverETL field. In the background of the component there is a DOM parser which allows the user to include general XPath expressions in the mapping definition.

In practice, users will often encounter vast XML files, which typically follow a standard structure. This structure contains records which represent a given entity (company, person, etc.) that can be repeated many times in a large XML data source. It is quite common that these sources of data come in sizes of 10s or even 100s of gigabytes. When this happens, DOM parsing is greatly inappropriate as all this data cannot be contained in memory. For this reason, another CloverETL XML parsing component becomes handy – XMLExtract. This handles records individually which are usually quite small, at least small enough to be processed in memory.

In XMLExtract, the user is able to define how each element can be mapped to a CloverETL record at every level of the XML structure . XMLExtract also provides the possibility of including a parent key at each structure level, thus allowing later complete reconstructions of the entire data structure. If the XML does not contain the unique key itself, it can also be easily generated using a CloverETL sequence object.

XML data and their basic integrity rules can be very well specified using XML Schema which today is a standard part of well defined data exchange. If you use XML Schema, CloverETL provides a very convenient visual drag&drop editor which helps the user build an XML mapping:

This screenshot represents an XML mapping which defines how XML and Clover fields are mapped. This mapping can also be displayed as text:

To provide an example where these methods were essential, CloverETL successfully completed a master data consolidation and matching project for an international insurance company. The XML Schemas were very complex, containing hundreds of different XML element types in its structure. The volume of data was over a hundred GBs describing tens of millions of customers as organizations and 4-5 million customers as persons. One of the many tasks assigned to CloverETL was to read and store the vast amount of data in XML in which it performed substantially greater due to XML’s fast sequential processing.

ParallelReader Component: Performance Boost in Data Processing

In October release 2.8.1 of Clover we introduced a new component which definitely should attract your attention – the Parallel Reader. The name itself already suggests the goal of the component – improve reading speed by going parallel. The component is very similar to Universal Data Reader in function – it reads delimited flat files like CSV, tab delimited, etc. – much hasn’t changed here. But the real difference comes from under the hood.

There are two major optimizations which allow Parallel Reader to exhibit excellent performance results, especially on server-class machines with fast modern disks or better yet, disk arrays. The first optimization we have done is – of course – reading the file in parallel. The input file is divided into a set of virtual data chunks which are fed into reading threads. These work all together at the same time – each one parsing data records just from its own file part. The number of threads can be specified by component parameter “Level Of Parallelism” and should reflect the hardware setup – e.g. number of disks in a stripped RAID – to harness the maximum power of Parallel Reader. Another great performance gain we achieved is merely by just simplifying the data parser inside. This parser is as simple as possible – although with limited validation, error handling, and some functionality – but really, really fast.

Although the new reader has a few limitations coming from its nature, extreme speed in common use cases compensates all these drawbacks. If you are processing big amounts of data (hundreds of megabytes and more) and your transformation does not depend on data records being read in original order, Parallel Reader is here and it might just be the right choice for you – why not give it a try?

Processing Data from QuickBase

We have great news for users of on-line database QuickBase from Intuit . CloverETL became a next tool which can be used to manipulate the data in this database. Now you can work with data in QuickBase without restraints and with full power of CloverETL. We introduced several specialized components dedicated for this purpose.

QuickBaseQueryReader serves as a reader of records queried from a table. You only need to specify a table name which you want to read from and a query in QuickBase query language. It is that simple. The second reader, QuickBaseRecordReader, is even simpler. Again, you specify a table name and instead of the little bit confusing query, you just specify the particular record identifiers you require.

Data writing is provided also by a set of two components – QuickBaseRecordWriter and QuickBaseImportCSV. First one is a complex writer of incoming data records to a given table and second one is a bit simpler, however much faster bulk data loader to QuickBase.

So we have prepared the whole family of components fully supporting the data manipulation in QuickBase – do not hesitate and try it.

CloverETL for Data Processing of Sport Results

logo_maxibasketballCloverETL can be used not only in enterprise environment, but also in sport and entertainment industry. Prague hosted 10th FIMBA World Maxibasketball Championship in the first week of july. More than 160 teams from 31 countries took part in this popular event.

Match data (results and statistics) was transmitted in xls format, the most popular format for these purposes I guess. Although the excel files are user-friendly, for automatical processing are slightly inconvenient. CloverETL transformed xls data and stored them into a database to be used in more comfortable way from data engineer’s point of view.

CloverETL – more than just extract, transform & load

Can ETL be used for something useful – aside extract, transform & load ?

I played with an idea that common applications, which include data entry & data processing, can be split into simple, web based data entry application (optionally also data presentation) and business logic implemented as an ETL process. With CloverETL, the transformation piece can be also called as a WEB application – thank to Clover’s “launch services” which seamlessly convert any transformation into WEB service where data are passed-in and sent-out through HTTP protocol (POST request).

The whole idea is based on fact, that data processing is easily (visually) designed & debugged using ETL tool. Unfortunately, the data entry is usually out of scope for most ETL tools (are there any which support it ?)  therefore WEB based “glue” with data entry forms is the easiest way.

Other option is to use the modern curse of IT – Excel. Excel is popular data entry & data presentation tool – why not to exploit it ?

I wholeheartedly believe that using this approach, development time can be cut to half as opposed to standard way of coding it in PHP or as a J2EE application. Also further modifications can be done more quickly. Visual programming is here, finally !

PS: I am going to bid for a project using this approach. So if I convince the customer, in two months I should know what this idea stands for….