Author Archives: Miroslav Stys

CloverETL Jobflows – Build, Monitor, and Manage Complex Workflows

Lately, I have been confronted with a data integration task of extracting customer data from an Oracle database to an XML file, uploading it to FTP, and finally encrypting that file. If you use a single CloverETL graph to do this, the overall logic might get lost in complexity of the graph. You will need many components, execute them in different phases and, most importantly, your solution will be vulnerable to all kinds of errors – causing the graph to fail for not-so-obvious reasons.

Now what would a mighty tool for such complex tasks be? The answer is the new jobflow component in CloverETL 3.3.

Jobflows is a high-level approach to managing ETL processes. It allows you to combine ETL with other activities to form complex business processes – providing orchestration, conditional job execution, and error handling. Actions participating in jobflow include scripts, executables, web services (REST/SOAP) and common file operations (both local and remote).

Jobflow example

First of all, let’s take a look at the picture below. It shows the principal logic of the example I will be discussing in this article.

(yellow blocks – jobflow *.jbf, green blocks – ETL graphs *.grf)

Step One – Extract from DB

Let us say that the first step in our complex set of tasks will be responsible for extracting rows from the Oracle DB, converting the data to XML and uploading it to an FTP. It will do nothing more and nothing less, and after doing so, it will report its results to the next step in the jobflow pipeline (Did the job fail? How long did it run? When was it started? What was the error message?). This step will be performed by an underlying ETL graph, whose execution is managed by ExecuteGraph component.

If you are scratching your head to figure out how jobs communicate with each other, keep in mind that ExecuteGraph is a component like any other, i.e. it is connected to other components by edges. Jobflow components can use pre-defined (template) metadata containing fields that allow you to send job results through the upstream chain (try right-click on an edge → New metadata from template).

Step Two – Send XML to FTP

In this example, the DB extraction step is connected to a File Operation component called CopyFiles that uploads the result to an FTP server:

In fact, the metadata describes a token – a piece of information that flows on jobflow edges – similarly to a record in an ETL graph.
Tokens are responsible for triggering events in a jobflow; you can make components react to incoming tokens in various ways (read more).

Input mapping

Back to the example: How would I let the CopyFiles component know the name and path of the XML file to be uploaded to FTP? I added an extra metadata field to the output edge of the ExecuteGraph component. As soon as the ExecuteGraph component finishes, it populates the field with a file name the executed graph has created (see Dictionary and Output Mapping section). I used the Output mapping of the ExecuteGraph to map the file name to the output edge. Respectively, Input mapping on CopyFiles component lets you read source/target file paths from input edge. The affected attribute in this case is sourceURL, so you would write the Input mapping like this:

$out.0.sourceURL = "${DATAOUT_DIR}" + "/" + $in.0.xmlName;

Notice the $out.0 – in Input Mapping it represents the internal structure of the component attributes – it acts as a record so that you can use it in CTL2. Its metadata is fixed by the component definition – check the transformation dialog to see all the attributes you can map.

Dictionary and Output Mapping

I have not yet mentioned how the DB extraction job communicates with its depended jobs, particularly, where and how it gets the XML file name. Obviously, it needs to ‘send out’ its internal parameter – the XML name. This is carried out via one of CloverETL’s earlier features – dictionary.

Dictionary is an interface between the graph and its surroundings, e.g. other jobflow components, externally called procedures, etc. Using a dictionary to pass parameters between two jobs works like this:

  • First, you create a dictionary entry in the ETL graph (via Outline → Dictionary) and configure its data type. This determines what kind of data will be exchanged – string in this case.
  • Second, you populate the dictionary entry with a value within your ETL graph – either using the SetJobParameters component or in CTL (e.g. Reformat).
  • Now, in the parent jobflow, you can map the dictionary entry content in ExecuteGraph’s Output mapping. The dictionary entry will appear in the left panel under Dictionary folder:

Note – Dictionary can also be manipulated with the GetJobInput and SetJobOutput components. These allow you to €˜read and write€™ (respectively) from/to a dictionary inside an ETL graph. GetJobInput is usually a starting component that reads the dictionary value and maps it to its output edge, passing it to the rest of the graph. SetJobOutput, on the other hand, will typically be placed at the end of your graph where you want to make the results accessible to the parent.

Error Handling

So far, I have not taken into account any errors either on the graph or on the jobflow level. Let’s look at the middle step in my jobflow hierarchy – what happens if, for example, the file server goes down during the FTP transfer?

  • Obviously, the XML file will not be uploaded completely and the ETL graph will fail.
  • Next, the error will be signalled to the ExecuteGraph component which called the graph.
  • The error info is sent to the second output port of the component where you can react to it – stop the jobflow by placing the Fail component there, send email to the support team (EmailSender), or handle the error by another graph or nested jobflow.

In the picture below, I react to an FTP transfer error by writing an error log (notice this is handled by a separate graph) and then stopping the jobflow using Fail component:

There are many more possible reactions to failures. I could have configured the executing component not to Stop processing on fail – if an error occurred in the executed graph, the jobflow would not break and incoming tokens would sill trigger new graph executions (i.e. ignoring errors).

Step Three – Analyse Jobflow Stats

The penultimate step in my jobflow analyses previous job runs from a statistical point of view. First, it combines (Combine) tokens from preceding jobs and then computes average duration of each one. If that goes okay, the jobflow ends with Success.

Step Four – Encrypt Output

This overall result is then signalled to a logic one level higher (the top jobflow in the main diagram), which executed all the underlying tasks by a single ExecuteJobflow component. Using ExecuteScript, it calls a binary utility to produce a hash of the XML created earlier. The ListFiles component in front of it helps assembling the complete File URL.

When Two Become One: CloverETL OEM Embedded White Labeling

In our previous post, we reviewed the three approaches to CloverETL OEM—Here, I’ll discuss the more technical aspects of CloverETL OEM Embedded White Labeling.

OEM White Labeling – What it Takes

When partners needs white labeling, they are mostly considering the seamless integration of the ETL piece with their application. They have to manage things like version control of their application (making sure it works with Clover), the look and feel of the GUI, and a good sense of automation, for example. The Clover team works with the partner to architect this and can do so quickly.

Partners needing this option receive CloverETL Integration documentation to guide them through the process of achieving stable, effective white labeling. Both the CloverETL Designer and CloverETL Server can also be white labeled as part of a company strategy.

Which steps does white labeling consist of in Designer and Server?

CloverETL Designer

CloverETL Designer branding is based on showing the product name, splash screen, welcome screen, etcetera inside the Eclipse environment. Applications embedding or extending CloverETL Designer can use the same branding elements to visually customize the product. Features available via branding:

  • Product naming and description: These are visible in multiple places inside Eclipse (however, not all occurrences can be changed.)
  • Default configuration: Eclipse configuration options can be changed according to users’ needs, e.g. whether a splash screen should be shown at start-up.
  • Splash screen: The initial image shown at start-up and the progress bar can be customized.
  • Welcome screen: An introduction screen shown to users on the application’s first run; it can contain useful links to documentation, examples etc.

CloverETL Server

CloverETL Server is mentioned in multiple messages, web GUI images, directory names, etc. Following steps described in the integration documentation (available to OEM partners), you can get rid of all CloverETL occurrences. White labeling Server thus involves:

  • Replacing images, logos plus related work on graphics
  • Editing a couple of properties stored in server configuration files
  • Tuning the resulting application

Handling Errors in Heterogeneous Input Data

ComplexDataReader is a powerful new component in CloverETL meant for reading elaborate heterogeneous data. However, all data cannot be read easily even if you spend a lot of time configuring the component. Sometimes you need to think in advance: What if you come across unknown metadata you have not handled? Normally, the graph crashes.

This post will examine a way of preventing that or, more specifically, how to handle errors in input data.

Example Input Data

Input Data

What We Will Do

We can instantly distinguish three kinds of metadata on the input: product, product_range and service. ComplexDataReader is the best component to parse these using three states of a state machine. As you can see, there is one line that does not fit into the data. The magic trick of this example lies in preparing one extra state – the error state. The state will be responsible for “catching” all incorrect data which would cause the component to fail. In order to be able to decide which data are “bad,” or, more precisely, when to switch to the error state, you have to write a custom Selector class in Java. The idea behind the code is very simple and will be explained below:

“Prep Work”

First, we need to prepare metadata for all three states of the state machine plus one extra. The extra metadata will represent error lines on the input we need to “throw away.”

Second, do not forget to connect the component to its succeeding components and assign metadata to output edges.

Third, set the “File URL” property to point the component to the input file.

Here are the three aforementioned metadata:

Metadata: Product

Metadata: Service

Metadata: Product Range

And one extra metadata for error lines:

Metadata for Error Lines

Designing  State Machine

We are going to create four states:

Note: There are no transition edges to be seen in the graph. It is because the Selector itself will decide when to change between states.

Start configuring the component via the “Transform” property. Create four states corresponding to the metadata and set “Initial state” to “Let selector decide”:

Switch to state “$0 product” and define its output mapping. In this state, we will send all fields to the output. Thus, drag state $0 to the “Value” column in the right-hand pane. You will produce the “$0.*” directive. In the “Transition table”, switch “Target state” to “Let selector decide”:

Repeat the same procedure for all remaining states (including the error state). Always send everything to the output port and “Let selector decide” about the target state:

Writing Custom Selector

We are now going to prepare a Java class that will do the magic of this example – switch between states “$0 product”, “$1 service”, “$2 product_range” and the “$3 error” state in case there are errors on reading. This particular prefix Selector will assume there is another record on the following line(s) and will try to read it. If there really is a new record, we can recover from the error line and carry on reading.

You can prepare the Java class in any editor of your choice. After writing it, just remember to place it into the “trans” folder of your project. On that condition, CloverETL will automatically compile the class for you.

The Selector class will look like this:

public class CustomPrefixInputMetadataSelector1 extends com.opensys.cloveretl.component.complexdatareader.PrefixInputMetadataSelector {

	private static final int DEFAULT = 3;

	@Override
	public int select(int prevState) {
		int result = super.select(prevState);
		if(result == org.jetel.component.RecordTransform.ALL) {
			return DEFAULT;
		}
		return result;
	}
}

A few comments concerning the code:

  • int result = super.select(prevState);
    First, we try to call the default selector and store the number of the next state into result.
  • if(result == org.jetel.component.RecordTransform.ALL)
    And if the default selector cannot decide…
  • return DEFAULT;
    We return the default state number – number 3. This is the error state.

Now that you are done with the code, switch to the “Selector” tab in “State transitions”. In “Selector URL”, browse for your custom Selector. Notice that after you specify its location, the “Selector properties” area changes:

Conclusions & Pitfalls

In this article, we have presented a way of handling flaws in the input data. We have been capable of addressing a situation when the selector looks on the following metadata and cannot decide which state goes next.

However, there are numerous cases when you just cannot prevent reading errors from occurring. For instance, even if the selector recognizes the following metadata but then fails on parsing them, we cannot react and the graph fails. You can imagine that as a file whose field types suddenly change, (e.g. from integer to date – the selector starts parsing an integer and crashes). Another known case we cannot handle is changeable number of fields in one record. If new fields occur or their number decreases, the graph execution fails. The only exception to this are fields added at the end of a record. These can be handled with the help of lenient data policy.

Download a complete CloverETL project – error handling in ComplexDataReader