Tag Archives: data transformation

Launch Services – ETL Transformation as a WebService

Transformations on CloverETL Server can be run by users in a very simple way by using just a web browser and a correct link. When a user click on the link, a transformation is triggered on the Server and requested data are generated. These data can be in several different formats as Excel spreadsheet, CSV file, XML, or HTML. One can even place a form on the web page that serves as a source for parameters for the transformation.

In this blog post, I will demonstrate how Launch Services can be used from user point-of-view. The next part will show how Launch Services can be configured.

Each installation of CloverETL Server contains a sandbox named default that has many examples of Launch Services in. So you can immediately experiment with all the examples described below. If you do not have a licensed version of CloverETL Server, you can use an online server demo running at http://www.cloveretl.com/examples/server-demo

Example 1 – Glossary

There is a database with economic terms which is updated everyday on our online demo server. You can put a web form on a web page where user enters an unknown term and press the execute button. A request is sent to the server where CloverETL transformation is run, then matching and similar terms are returned and displayed as an HTML page.

Note: This example is using HTTP POST request. However, Launch Services can handle also GET requests. So you can place anywhere a normal URL that would link to an explanation of a specific term. Format of such link is: <server-url>/launch/<launch-service-name>?<paramenter-name>=<parameter-value>, e.g.: http://server-demo-ec2.cloveretl.com/clover/launch/glossary?term=trend

Example 2 – Mountains

There is a database with the highest mountains on Earth with their heights. User enters an elevation above sea-level and hits an enter key. Excel sheet is then displayed listing all mountains with the given minimal elevation.

Note: It is necessary to provide credentials to trigger Launch Services (using HTTP basic authentification). A transformation may depend on a logged user or a group, the user belongs to. It is thus possible to configure the service in such a way that it will run under one URL but returned data format will depend on a specific user (e.g. users will get results in an Excel sheet, but one specific user can receive results as CSV file).

Example 3 – Mountains Upload

Launch Service can accept also a file containing data as a parameter. These data can be inserted to the server database or can be used for a transformation run. As a variation to the previous example, it is possible to submit a file with a list of highest mountains that is used for a selection.

Example 4 – Everest Ascents

Though input fields are of text type, Launch Services can parse them, according to how they are configured, into other data types as integer, real, date, boolean etc. They can have specific formats and locale settings, even different for each user. If a user enters data in a wrong format, he is alerted and CloverETL transformation can then receive only correct, parsed data.

In our example, we have a list of mountain climbers that reached Mount Everest without using an oxygen mask. Launch Service enables you to select a climber by date of the first ascent, gender, number of successful ascents, and nationality.

Example 5 – Scripting

Up to now, we have been accessing Launch Services from a web browser. However, they can be easily used also from other systems. It is possible to access a given URL through any application supporting HTTP protocol as wget on Unix. You could write a script downloading all women that climbed Mount Everest and sending them to a given email address:

wget –user clover –password clover -q -O – http://localhost:8080/clover/launch/ascents-everest?sex=F | tail -n +2 | cut -f 1 | mail i@somwhe.re -s “Ladies on Everest”

Example 6 – Client Library

The last possibility how to call Launch Services is via a Java library cloverETL-server-client.zip. This package will come handy if you need to integrate Launch Service with your Java application or if you need to trigger Launch Service from a command line with more parameters than wget allows for.

If you need to run Launch Service called acents-everest running on a server http://localhost:8080/clover as a user clover and all results save to a folder acents-everest-result, you can use this command:
java -jar cloverETL-server-client.jar acents-everest -u http://localhost:8080/clover -l clover -p clover -o acents-everest-result

Read the part two: Launch Services – Configuration

Introducing Rollup

Rollup is a general transformation component introduced in CloverETL version 2.8. It serves as an executor of rollup transformations written in Java or CTL. Basically, rollup transformations are used to process groups of data records. Each group consisting of M data records may be used to output N different data records. The number of output data records is determined at runtime within the rollup transformation. If no group key is defined, all input data records belong to a single group.

Each group of data records may share a data record referred to as a group accumulator. It may be used to store intermediate results on group basis. A group accumulator is created and initialized when the first data record of a group is encountered and may be updated for each data record in this group (including the first and the last data record). When the last data record of a group is encountered, processing of this group is finished and the group accumulator is disposed. If the input data records are not sorted, each group is finished as soon as all input data records are read and processed.

Because Rollup is a general transformation component, it can be used instead of Aggregate, Dedup, Normalizer and Denormalizer components. Rollup does not require sorted input data records and therefore can save overhead required to sort them. So if you have millions of unsorted data records belonging to a few groups, Rollup is definitely the way to go. It is also possible to use Rollup to output intermediate results related to each input data record within a group.

Lifecycle of a Rollup Transformation

The lifecycle of a rollup transform is described in the org.jetel.component.rollup.RecordRollup Java interface. Here goes its nicer version:

  • The init(Properties, DataRecordMetadata, DataRecordMetadata, DataRecordMetadata[]) method is called to initialize the transform.
  • For each input data record as a current data record:
    • If the current data recordbelongs to a new group:
      • If requested, a group accumulator is created.
      • The initGroup(DataRecord, DataRecord) method is called to initialize processing of the group and to initialize the group accumulator (if it exists).
    • The updateGroup(DataRecord, DataRecord) method is called for the current data record and the corresponding group accumulator (if it was requested).
    • If the method returned true, the updateTransform(int, DataRecord, DataRecord, DataRecord[]) method is called repeatedly to generate output data record(s) until it returns RecordRollup.SKIP. If it returns value < RecordRollup.SKIP, the getMessage() method is called.
    • If the current data recordis the last one in its group:
      • The finishGroup(DataRecord, DataRecord) method is called to finish the group processing.
      • If the method returned true, the transform(int, DataRecord, DataRecord, DataRecord[]) method is called repeatedly to generate output data record(s) until it returns RecordRollup.SKIP. If it returns value < RecordRollup.SKIP, the getMessage() method is called.
      • If the group accumulator was requested, its contents is disposed.
  • The finished() method is called to notify that the rollup transform finished.
  • The reset() method may optionally be called to reset the transformation and so that the previous two steps may be executed again.

The lifecycle seems complicated but the main idea is quite simple – process each group and record within that group and notify the user about each relevant event. This way it is possible to implement virtually any transformation with just a single component.

Example of Rollup Transformation

Let’s have a very simple rollup transformation graph:

The DataGenerator component is used as a source of sample data records with an integer key and a decimal value. The Rollup component is used to count the number of occurrences within each group, the sum of all decimal values within a group, and the average value within a group. The Trash component serves as /dev/null.

The rollup transform itself is not complicated either:

function void initGroup(output groupAccumulator) {
// default values are provided here
groupAccumulator.key = $0.key;
groupAccumulator.occurrences = 0;
groupAccumulator.sum = 0;
}

function boolean updateGroup(output groupAccumulator) {
groupAccumulator.occurrences++;
groupAccumulator.sum = groupAccumulator.sum + $0.value;
return false; // no intermediate output records will be generated
}

function boolean finishGroup(output groupAccumulator) {
groupAccumulator.average = groupAccumulator.sum / groupAccumulator.occurrences;
return true; // group output records will be generated
}

function integer updateTransform(integer counter, output groupAccumulator) {
raiseError("This function will not be called at all!");
}

function integer transform(integer counter, output groupAccumulator) {
// only a single output data record will be generated
if (counter > 0) {
return SKIP;
}
$0.* = groupAccumulator.*;
return ALL;
}

The above five functions are required functions that must be implemented for the rollup transform to work correctly. All the other (commented out) functions presented in the CTL template are optional. As you might see, the possibilities are endless.

Do you use the Rollup component in your project? Please let us know what do you use it for or what attributes/functionality you miss.

Sorting Data: ExtSort vs. FastSort – which one is better for me? (Part 2)

In my  previous post I have focused on tips for tweaking the FastSort component – performance sort component available in all commercial CloverETL editions. Today, I would like to touch the original ExtSort component which has been in CloverETL for a while and is available in both commercial and free (Community, opensource engine) editions.

While FastSort is optimized to deliver stunning performance in resource-rich environment, ExtSort is a modest hard worker used to working with limited resources. This proves to be useful especially in tranformations with lots of parallel branches with sorting.

Again I won’t discuss the case where there are only a few records and all sorting can be done in memory. For external sorting, ExtSort organizes its temp data in sorted chunks and places them onto a fixed number of tapes – each tape being a single temporary file. Since ExtSort is single-threaded, multiple ExtSort-s running at once do not consume that much resources as FastSort does.

With ExtSort, there isn’t that much magic as with tweaking FastSort. First what you can do is specify a number of temporary directories where tapes will be stored. This is useful when you’re able to harness multiple physical drives under multiple mount points, otherwise just make sure you have enough free space on the drive.

The only two attributes you can fiddle with is Buffer capacity and Number of tapes. There’s also the Sorter initial capacity but this will be deprecated soon and nowadays acts exactly the same as Buffer capacity. Buffer capacity determines the size of each chunk and thus needs to fit into memory. If you’re familiar with FastSort, Run size is the equivalent parameter there. Increasing Buffer capacity is generally a good idea for boosting ExtSort’s performance, but expect increased memory requirements, although still far from those of FastSort.

Number of tapes defaults to 6 and generally will yield best results, however again, increasing the number slightly does not take too much resources and can help with performance.

Compared to FastSort, ExtSort is pretty easy to configure and will work almost under any conditions. However, there’s always a catch – with ExtSort it is the rather inferior performance compared to its stronger brother.

Sorting Data – ExtSort vs. FastSort – which one is better for me?

I often get asked why CloverETL offers two sort components instead of just one and what’s the right key for determining which one is better for a particular purpose.

The reason for having two sort components in CloverETL is simply to keep things as easy as possible. Since the inner natures of ExtSort and FastSort are quite different it would be really difficult to implement a nice and clean universal one.

Luckily, the decision is simple and straightforward. In case you can dedicate enough system resources (CPU cores and/or memory) for the graph doing the sorting, FastSort is the clear option. On the other hand, if you’re short on resources and want a more conservative behavior, pick ExtSort which will give you steady performance at minimum system requirements.

FastSort is a very powerful tool, but to truly witness its power, users must set it up correctly to use their hardware’s maximum potential. We will now dive into the settings behind this impressive component and learn how to max out it’s ability while being careful to avoid crashes.

Tweaking FastSort

FastSort is greedy for both memory and CPU cores and in case the system does not have enough of either, FastSort can quite easily crash with out-of-memory, especially if the records you’re going to sort are big (long string fields, tens or hundreds of fields, etc.).

Parallelism

Unlike ExtSort, FastSort can utilize potentially unlimited number of CPU cores to do its job. You can control how many worker threads are used by overriding default value for “Concurrency (threads)”. My experience shows however, that unless you’re able to use really fast disk drives, going for more than 2 threads does not necessarily help and can even slow the process back down a bit. So basically you don’t need to worry about parallelism at all unless you have the hardware to take advantage of it. Remember, that parallelism adds extra memory load for each additional thread!

Memory

FastSort can be a bit tricky with memory, since there are multiple settings which affect it. The most important is the “Run size (records)” which denotes the size of the data chunk being sorted at a time. Note, that actual record size and level of parallelism increase the overall memory consumption – so be careful with this setting. The default is 20k records, if you set the “Estimated record count” – which is your rough guess on total count of records to be sorted, the Run size is computed for you based on a experimentally derived formula. This formula tries to get the right “Run size” based on number of records and amount of available memory (which you can limit with “Maximum memory” – defaults to unlimited). This “computed guess” works in most cases, but can fail under certain conditions. You need to test and tweak on your data a bit to get the best result. Run size is definitely a parameter worth playing with!

Be sure to have enough memory dedicated to your JVM – with large, numerous records. You want to give FastSort plenty of free memory – going for 512 MB up to 2 gigs is worth it! (e.g. –Xmx1536m) With a lot of memory, FastSort will do an amazing job. However with default 64 MB heap space setting, FastSort can crash.

‘In memory only sorting’ is an option you can use in case you’re sure that all data will fit into your memory – you can either force it (and then possibly crash due to out-of-memory) or leave it to default auto. Auto means that at first, FastSort tries to sort the data in memory and if that fails, on disk sorting is used instead.

Other limits and valuable parameters

Apart from memory settings, you can impose more limits on FastSort to reflect your needs. For example, if your system works with disk quotas which limit the number of open files, you can cap temp files of FastSort with “Max open files”. Note that FastSort uses LOTS of files – hundreds, thousands. If you cap it too much (500 or less) FastSort will continue to work, but  its performance will decrease significantly. So should you need to limit the number of open files, consider switching to ExtSort.

Settings you can forget

There are other advanced options for FastSort, but you can leave them to default values unless you are really trying to optimize your sort. Number of read buffers defines how many chunks of data will be held in memory at a time – which must be at least the number of Concurrency – otherwise some of the workers wouldn’t have data to work on. Using too large a number, you’ll end up with out-of-memory – the default is based on current concurrency setting and is just fine.

Average record size is nothing else than a helper guess on average byte size of records in the data – if not set, FastSort computes this automatically from the real data so it’s usually more precise than setting an explicit value.

Tape buffer is a buffer used for each worker for filling the output and slightly affects performance, but the default is fine in almost all cases.

The last two options control how temp files are created, they can be either compressed (defaults to false) and you can even control the charset of string fields (default UTF16). Both are there for space saving purposes (space occupied by temp files during graph execution) and decrease performance.

The Decision

FatSort is very powerful sorting component and can significantly speed up your transformation process. But it has to be set up correctly. So, if you are not sure and you want the always safe and simple sort, go with ExtSort. On the other hand, if you know your hardware and want to utilize it to optimize your sort for speed, dive into FastSort and explore it a bit. The results can be extraordinary.

To be continued … (Part 2 will discuss ExtSort component)

Sending E-mails from Data Transformation (Part 2 – Attachements)

In my previous post I talked about using the EmailSender component featured in CloverETL 2.8 and later to send messages from inside a running transformation graph. EmailSender is used in cases when you need to compose a message from the data that you process in your graph. For example, bulk mailing computed reports, reporting faulty data to administrators, etc. Read my previous post on EmailSender to learn the basics.

In this post, I will focus just on composing messages with attachments using CloverETL’s EmailSender component.

You can add as many attachments to your message as you want. Any attachment can be, as everything in EmailSender, either taken from your input data record or passed as a parameter or static text.

EmailSender distinguishes between two kinds of attachments – files (on the computer where the transformation runs) and data attachments.

File attachments work as one would expect – the file is read from your local computer and attached to the message. In CloverETL you can choose how you specify the file name – either pick a static path or you pass it in the input data record.

Data attachments are composed directly from data coming to the input of EmailSender. This way you can assemble for example XML file attachments, texts and even binary attachments. Obviously, in such case you need to specify additional information so that EmailSender knows how to send your data – specifically its mime type and the attachment name.

To start with attachments, open the EmailSender edit dialog and go to the “Attachments” attribute editor.

The easiest way to add an attachment is to use the “->” button or drag&drop from Fields to Attachments.

This way you create a file attachment whose path is passed from an input data record (in field “attchFile” in the example above). Please notice the description column which tells you how the attachment will be handled.

You can also add a new attachment using the “+” button and then editing the newly created item with the “…” editor button (you may find this a bit confusing at first – just hit the the plus “+” button and a new line in the Attachments table appears. Then click in the first column of the new row and click the three-dot edit button “…” – and bingo, you’re there!).

In the dialog, you can either browse for a local file or pick a field to use as file name. Notice the screenshot – I can even use the field as just a part of some predefined path. This allows me to compose the attachment path from static path, field value or even parameters (the full example might look like “${PROJECT}attachments$attchFile”).

Data attachments can be defined with the last option – Attachment data from record. Simply pick the field where your attachment data will be, then specify the attachment name and mime type – both can again be either taken from a field or static. See the following screenshot to get the hang of it.

That’s it! You can now send attachments with CloverETL’s EmailSender. Just always remember, that you can set any value to either constant, parameter or field value using the $field notation. With this in mind,  you can set up pretty much everything you’ll ever need.

Loop Execution of Data Transformation

Case study description

Czech Insolvency Registry (http://isir.justice.cz) basically contains data about economic subjects that entered insolvency and have financial difficulties with paying off their debts. The registry allows everybody to download data using public SOAP Web Service. It can be done manually or automatically with the right software.
https://isir.justice.cz:8443/isir_ws/services/IsirPub001?wsdl

CloverETL can easily help with the automatically download that would save time and technical difficulties. CloverETL graph can get required data by calling the web service, processes data and store it in required format. Unfortunately the Registry’s web service is very poorly designed. The service doesn’t give you current status of each of the economic subjects, but provides the whole history of the required company. Therefore we have to download not only the current information we need but the whole information since the year 2008 (the registry foundation). That is a lot of data to process – actually thousands of log records for each company! Moreover the Registry’s Web service „GetIsirPub0012“ provides only maximum of 1000 records per one call. If one company has few thousands of records you have to undertake more calls.  So we have to download data in thousand-records bunches, but we don’t know in advance exactly how many of these bunches (records) there are for each company. That makes the whole process quiet difficult.

But solution with CloverETL is simple. CloverETL Server provides features “graph event listener” and “groovy task” that help us with all the above described challenges. Firstly, we will of course design a CloverETL graph that processes for the beginning just one thousand –record bunch of data (see picture bellow).

Transformation graph

This graph has a parameter „startID“ which has value “0” by default. If we want to process 1000 records starting let’s say from no. 2541, then  start ID will be startID=2541, and the first downloaded record will be identified by. If we run graph without parameters, it’ll download and process first thousand of records (no. 0 – 999).

Graph also contains couple of components to store ID of last downloaded record so that  the next bunch to download may use the last ID  as the startID. It will be automatically stored to graph parameters as lasted parameter. It can be done by in-line Java code in Reformat component:
String id = GetVal.getString(source[0],"id");
getGraph().getGraphProperties().setProperty("lastID", id );

The loop

The graph we designed must be executed n-times to download and process all records. At first we don’t know how many times, but we know, that we can stop the downloading process as soon as there are no more records to read. It means we can stop the process as soon as  “started” and “lasted” are equal.

How to achieve such loop?

Graph event listener

To achieve the automatic loop, for the graph that we designed and described previously, we’ll define graph event listener for “FINISHED_OK” graph event on CloverETL Server. So every time transformation finishes without error („FINISHED_OK“), listener will trigger task that we selected. We need to specify this tasks now. Since we want to  execute the same graph repeatedly, we have to specify “execute graph” task. This task will repeat executing the graph indefinitely. However we need to stop this loop at some point. We need to “break the loop” when the startID and the lastID parameters are equal. Therefore it is actually better to create “groovy task” instead of „execute graph“.

Groovy task

Groovy is scripting language with Java syntax. In addition, groovy scripts may access java objects and use java libraries. See Groovy project site http://groovy.codehaus.org/ for more details.

We’ll create a simple groovy script which decides whether execute the graph again or not. To decide it, we’ll need to get graph properties from the finished graph. These properties are accessible by calling method event.getProperties().

Then, we’ll need to execute graph using CloverETL Server Java API. It’s done by calling method serverFacade.executeGraph().

Script may return String value which is stored in „Task history log“.

// these variables are predefined:
// sessionToken
// event
// serverFacade

import com.cloveretl.server.persistent.RunRecord;
import org.apache.log4j.Logger;
import com.cloveretl.server.api.*;
import org.springframework.web.context.WebApplicationContext;
import org.springframework.web.context.support.WebApplicationContextUtils;

Logger log = Logger.getLogger("groovy-ISIR-graphEventListener");

Properties eventProps = event.getProperties();
log.info("event properties: " + eventProps);

// get lastID and startID from previous graph execution
String lastIDString = eventProps.getProperty("lastID");
String startIDString = eventProps.getProperty("startID");
long lastID = Long.valueOf(lastIDString);
long startID = Long.valueOf(startIDString);

// lastID and startID from last graph execution are equal – break the loop
if (lastID == startID)
return "no more records to download";

// prepare startID which will be passed to next graph execution
Properties properties = new Properties();
properties.setProperty("startID", lastIDString);

String SANDBOX = eventProps.getProperty("SANDBOX_CODE");
String GRAPH = eventProps.getProperty("GRAPH_FILE");
GraphExecutionCommand graphExecutionCommand = new GraphExecutionCommand(
null, SANDBOX, GRAPH, null, null, null, true, properties, null, null);
Response respExec = serverFacade.executeGraph(sessionToken, graphExecutionCommand);
String result = "graph "+SANDBOX+"/"+GRAPH+" executed: "+respExec.getBean();
log.info(result);
return result;

Graph results

All graph results for each bunch of data are stored to only one CSV file. They are always added, so don’t worry there is no danger that some of them will be overwritten :-) . So when the whole batch of transformations is finished, we have only one CSV file with all processed records. Or if somebody wishes we can consolidate records and store them directly into database where the data can be stored in more friendly and usable format.

Sending E-mails from Data Transformation (Part 1)

If you ever have a situation where you need to send an email from your data transformation, CloverETL gives you several options to do it. If you need to monitor your transformation’s health and status, you would be better off with CloverETL Server, which offers plenty of monitoring and reporting features. These can be hooked to an “e-mail” action which sends you e-mail alerts based on predefined rules. This is very useful feature in an enterprise environment. We will hopefully cover these options in some later posts.

Today, I would like to focus on simpler, yet powerful way of sending e-mail messages from CloverETL – sending them directly from the transformation graph. You use a dedicated component – the EmailSender – to send e-mail messages from inside a graph.

The component falls into writers category which suggests that it can take data on the input and output it somewhere. In this case, it puts it into an e-mail message. You can map your input onto any aspect of the message – from “To”, “Cc”, “Subject”, etc. to the message body or as an attachment(s). EmailSender reads input records and for each one it composes and sends out a single e-mail message.

A small, but quite nice feature is that you don’t need to map all message properties to your input. You need to prepare only such parts that vary with each record and leave the rest to defaults or set up parameters for it. For example, if you want to compose your message body from data input records but want the message recipient to be fixed, you can use a parameter for “To” and keep it separate from data records.

Basic setup of EmailSender requires setting basic parameters for SMTP – host, and username and password if required. You can also setup SSL and TLS if the server supports it.

Please note, that it is a common CloverETL approach to use parameters to pass sensitive data – password in this case (see screenshot).

Let’s now take a peek at setting the actual message composition and the mapping onto input – the Message attribute. It opens an editor which displays message headers and your input.

The mapping is actually extremely simple – you use fields from your input as variables to fill in the values. Each field is used as $field (see use of $text in MessageBody). Also note the Alternative column – for records where the “text” field in the example has an empty value, this alternative value will be used (can be constant or defined in a parameter).

You can even combine several fields into one to form (any) text – not only in message body, but also “Subject”, “To”, etc.

If you need a more advanced approach, you can use the “+” button to add additional keys and values. They are passed as-is to your e-mail message as extra headers (e.g. for “X-Mailer: CloverETL” you set key Name to “X-Mailer” and set its Value to “CloverETL”).

That’s all for now, hope you will find many cases where you find this component to be a useful tool.

Also, wait for my next post where I explain how you can use EmailSender to send messages with attachments!

 

Data Profiling with CloverETL

Before you start to develop any data transformation you should explore your data (make data profiling). There are a lot of tools on the market that can help you. But why to install and learn another software when you can use the tool you are familiar with? CloverETL is mainly data transformation tool but it can be easily used for data profiling as well (as I will show you in this blog post).

It is very easy to do data statistic with latest version of CloverETL. You can find DataProfiling project in CloverETL Examples Projects. The project consists of two graphs: BasicStatistic and AdvancedStatistic.

The first one finds basic statistic for input data file:

  • minimum value for numeric fields or minimum length of data for string and byte fields
  • maximum value for numeric fields or maximum length of data for string and byte fields
  • average value for numeric fields or average length of data for string and byte fields
  • number of records in data file
  • number of not null values for each data field
  • number of null values for each data field

Additionally, for string data fields, it finds:

  • first not null value
  • if all values are ASCII

The second one calculates for each data field:

  • number of records in data file
  • number of not null values
  • number of unique values
  • minimum value
  • maximum value
  • average value for numeric fields
  • median value
  • modus value

It also finds frequency counts for fields with not many (the threshold is defined by a parameter HISTOGRAM_THRESHOLD) unique values.

BasicStatistic graph

The graphs in the project are prepared to analyze data from the excel file ORDERS.xls placed in data-in directory. But for purpose of this post we will analyze data stored in a flat file employees.list.dat (also placed in data-in directory).

To do that we need to set following parameters:

input_file=${DATAIN_DIR}/employees.list.dat

metadata=${META_DIR}/employees.fmt

READER_TYPE=DATA_READER

Metadata file (employees.fmt) has to contain metadata for employees.list.dat:

<?xml version="1.0" encoding="UTF-8"?>
<Record name="EMPLOYEE" recordDelimiter="n" recordSize="-1" type="delimited">
<Field delimiter="," format="#" name="EMP_NO" nullable="true" shift="0" type="integer"/>
<Field delimiter="," name="FIRST_NAME" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="LAST_NAME" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="PHONE_EXT" nullable="true" shift="0" type="string"/>
<Field delimiter="," format="dd/MM/yyyy" name="HIRE_DATE" nullable="true" shift="0" type="date"/>
<Field delimiter="," name="DEPT_NO" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="JOB_CODE" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="JOB_GRADE" nullable="true" shift="0" type="numeric"/>
<Field delimiter="," name="JOB_COUNTRY" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="SALARY" nullable="true" shift="0" type="numeric"/>
<Field name="FULL_NAME" nullable="true" shift="0" type="string"/>
</Record>

Lets see the graph with mid-results and output:

DataReader parses data from the input file. Normalizer creates record that consists of three basic fields: original field name, field type and “normalized” value, which is original value for numeric fields,  time in milliseconds for date fields and data length for string or byte fields. Moreover this component collects basic information about string data e.g.: finds first not null value, checks if string contains only ASCII characters and if it can be converted to number. The next component (Rollup – Statistic) calculates minimum, maximum and average value for each group of records (with the same field name). It also propagates first not null value, checks if all isAscii and isNumber fields are not false and sets result value for the whole group. Fourth and fifth component has only “cosmetic” aim – they convert times in milliseconds back to user friendly form and sort output records. The writer converts results to report.

By default the final report is a plain html file:

Data statistic for data-in/employees.list.dat

Field name Field type min max average count count not null count null first not null is Ascii is number
DEPT_NO string 3.0 3.0 3.0 51 43 8 600 true true
EMP_NO integer 0.0 145.0 58.392156862745104 51 51 0 true true
FIRST_NAME string 3.0 11.0 5.549019607843137 51 51 0 Robert true false
FULL_NAME string 10.0 22.0 14.549019607843137 51 51 0 Nelson, Robert true false
HIRE_DATE date 28/12/1988 00:00:00 15/11/1994 00:00:00 51 51 0
JOB_CODE string 2.0 24.0 6.1568627450980395 51 51 0 VP true false
JOB_COUNTRY string 2.0 11.0 3.6470588235294117 51 51 0 USA true false
JOB_GRADE number 0.0 5.0 3.4117647058823524 51 51 0 true true
LAST_NAME string 3.0 12.0 6.8431372549019605 51 51 0 Nelson true false
PHONE_EXT string 1.0 5.0 2.8627450980392157 51 51 0 250 true false
SALARY number 0.0 9.9E7 2267125.137254902 51 51 0 true true

but it can be easily changed to excel file (just adjust graph parameter WRITER_TYPE=XLS_WRITER).

AdvancedStatistic graph

The phase 0 of AdvancedStatistic graph works similarly as graph BasicStatistic, but it uses Aggregators for statistic calculations instead of Rollup. Be particular about component called Simplification in phase 0 of the graph: it stores number of records in file and names of fields with number of unique values under threshold in dictionary (marked by red ellipses on the picture above). Then Histogram filter component can read this field’s names and skip the records that aren’t between fields for frequency calculations (green eclipses on the picture). Phase 1 Aggregators count frequencies for fields that were filtered out by previous component.

Resulting file looks as follows:

Advanced data statistic and histograms for ./data-in/employees.list.dat

Statistics

Field name Field type min max average number count count not null count unique median modus
DEPT_NO string 000 900 51 43 20 600 623
EMP_NO integer 0.0 145.0 58.3921568627451 51 51 47 45.0 2.0
FIRST_NAME string Andrew Yuki 51 51 44 Mark Robert
FULL_NAME string Baldwin, Janet Young, Katherine 51 51 50 Lee, Terri Sutherland, Claudia
HIRE_DATE date 28/12/1988 00:00:00 15/11/1994 00:00:00 51 51 44 20/04/1992 00:00:00 02/01/1994 00:00:00
JOB_CODE string Admin Vice President 51 51 16 Mktg Eng
JOB_COUNTRY string Canada USA 51 51 9 USA USA
JOB_GRADE number 0.0 5.0 3.411764705882353 51 51 6 4.0 4.0
LAST_NAME string Baldwin Young 51 51 48 Lambert Johnson
PHONE_EXT string 1 null 51 51 46 3355 null
SALARY number 0.0 9.9E7 2267125.137254902 51 51 40 61637.8125 0.0

Histograms

Field name Field type value count count %
DEPT_NO string 8 ###############·····················································································
DEPT_NO string 000 2 ###·································································································
DEPT_NO string 100 2 ###·································································································
DEPT_NO string 110 2 ###·································································································
DEPT_NO string 115 2 ###·································································································
DEPT_NO string 120 3 #####·······························································································
DEPT_NO string 121 1 #···································································································
DEPT_NO string 123 1 #···································································································
DEPT_NO string 125 1 #···································································································
DEPT_NO string 130 2 ###·································································································
DEPT_NO string 140 2 ###·································································································
DEPT_NO string 180 2 ###·································································································
DEPT_NO string 600 2 ###·································································································
DEPT_NO string 621 4 #######·····························································································
DEPT_NO string 622 3 #####·······························································································
DEPT_NO string 623 5 #########···························································································
DEPT_NO string 670 2 ###·································································································
DEPT_NO string 671 3 #####·······························································································
DEPT_NO string 672 2 ###·································································································
DEPT_NO string 900 2 ###·································································································
JOB_CODE string Admin 4 #######·····························································································
JOB_CODE string CEO 1 #···································································································
JOB_CODE string CFO 1 #···································································································
JOB_CODE string Dir 1 #···································································································
JOB_CODE string Doc 1 #···································································································
JOB_CODE string Eng 15 #############################·······································································
JOB_CODE string Finan 1 #···································································································
JOB_CODE string Inside Sales Coordinator 1 #···································································································
JOB_CODE string Mktg 1 #···································································································
JOB_CODE string Mngr 4 #######·····························································································
JOB_CODE string PRel 1 #···································································································
JOB_CODE string SRep 9 #################···················································································
JOB_CODE string Sales 2 ###·································································································
JOB_CODE string Sales Representative 6 ###########·························································································
JOB_CODE string VP 2 ###·································································································
JOB_CODE string Vice President 1 #···································································································
JOB_COUNTRY string Canada 2 ###·································································································
JOB_COUNTRY string England 3 #####·······························································································
JOB_COUNTRY string France 1 #···································································································
JOB_COUNTRY string Italy 1 #···································································································
JOB_COUNTRY string Japan 2 ###·································································································
JOB_COUNTRY string Sales 1 #···································································································
JOB_COUNTRY string Switzerland 1 #···································································································
JOB_COUNTRY string UK 4 #######·····························································································
JOB_COUNTRY string USA 36 ######################################################################······························
JOB_GRADE number 0.0 1 #···································································································
JOB_GRADE number 1.0 2 ###·································································································
JOB_GRADE number 2.0 8 ###############·····················································································
JOB_GRADE number 3.0 14 ###########################·········································································
JOB_GRADE number 4.0 16 ###############################·····································································
JOB_GRADE number 5.0 10 ###################·················································································

Joining Data with RelationalJoin

CloverETL version 2.8 offered a brand new commercial  component called RelationalJoin. It extends the CloverETL pallete of joiner components with new functionality – joining of records with relational operators different from equal (=). This component has two input ports for master and slave data records, and a single output port for joined data records. Master and slave data records are joined and sent to the output port if they are in a specific relation to each other. In other words, it’s just another joiner that uses a relational operator to drive the joining process.

Relations Between Data Records

The relation between two data records is specified by any of the following relational operators: !=, <, <=, >, >=. For example the < operator means that each master data record is less than all the slave data records that it was joined with. If you look at it from the slave’s perspective, all slave data records joined with a certain master data records are greater in this case. It’s valid in both ways. ;-)

In order to make the joining process as effective as possible, input data records coming through both input ports have to be sorted appropriately. Except for the != operator that doesn’t require any sorting at all. However, if you choose the < or <= operator, you need to sort both streams of input data records in the descending order. In case of the > and >= operators, you need to make sure the data records are sorted in the ascending order. If the sort order is invalid, execution of the component fails.

When processing large data sets, be aware that the master data records are processed one by one while the slave data records need to be buffered. In the worst case, all the slave data records need to be buffered and thus their number should be as small as possible.

Practical Example

They say that an example is worth a thousand words so let’s create a simple one. Imagine that you’ve got a set of several distinct numbers. If for some reason you need to pair each of them with all numbers from the same set that are greater, it is a pretty simple task for RelationalJoin! :-) See the example graph below.

Example graph with RelationJoin.

We read the numbers from a flat file, sort them appropriately, send them as two independent data streams to RelationalJoin which produces the desired output, and finally write them to another flat file. Pretty simple, don’t you think? ;-)

In order to configure RelationalJoin, we need to specify a transformation, join key and join relation. Setting the first two is simple, you have done that a hundred times. In case of join relation, it is simple as well, just select “master(D) < slave(D)” from the combo box. The letter D in the round brackets denotes that we need to sort both streams of data records in the descending order. Thus we need to configure ExtSort in this way.

That’s all we had to do, just run the graph and see the desired results!

Iteration through the Record Fields in CTL

Note: this post relates to CloverETL 2.x version with CTL version 1. Starting with CloverETL 3.x , CTL version 2 was introduced which has a bit different syntax. However you can still use CTL 1 in transformations – see CTL1 vs CTL2 comparison.

Recently, I have been facing a very common problem. Imagine this scenario: I have two files – the first one with origin records and the second one with slightly modified new records. Each record had a unique key and aproximately 50 fields. My task was to compare these two files and find out how many fields in every pair differ from each other in the corresponding records.

The simplified graph can be seen in following picture:

A comparison of two records can be processed by the CloverETL DataIntersection component which joins the records with the same keys. In the joined records, you can compare fields that are not part of the key. But remember, you have to write the comparison in the CTL transformation ;-) . Of course you could write the following block of code for each pair of compared fields:

int count = 0;
if(nvl($0.field_N, '') != nvl($1.field_N, '')) {
count++;
//but typically more actions :-(
}
//and imagine this block 50times :-(
//final mapping
$0.key := $0.key;
$0.count := count;

But this solution takes too much time when you have to repeat it for many fields (approx. 50 in my case). It is also very slow, uncomfortable and increases the probability of making a mistake in your code (e.g. omitting some fields). Fortunately, CloverETL allows you to iterate through the fields of processed records! :-) The code is then more briefer and more generic:

//declaration of variables for copies of input records
record(Metadata1) myrec1;
record(Metadata1) myrec2;
function transform() {
int i = 0;
int count = 0;
//asign value of input records to local variables
myrec1 = @0; //myrec1 is a copy of a current record on input port 0
myrec2 = @1; //myrec2 is a copy of a current record on input port 1
//iterate through fields, suppose that field with index 0 is the key
for(i = 1; i < length(myrec1); i++) {
if(nvl(myrec1[i], '') != nvl(myrec2[i], '')) {
count++;
}
}
//final mapping
$0.key := $0.key;
$0.count := count;
}

Someone could object to the necessity of making copies of records. In this case I have good news. CTL in CloverETL version 2.9 introduces the possibility of iterating directly through the fields of input records.

@0[i];//i-th field of the input record on port 0

Moreover new functions for getting field names and data types are introduced in version 2.9. Personally, I am looking forward to such features that will make CTL code simplier and clearer.