Tag Archives: heterogeneous data

Handling Errors in Heterogeneous Input Data

ComplexDataReader is a powerful new component in CloverETL meant for reading elaborate heterogeneous data. However, all data cannot be read easily even if you spend a lot of time configuring the component. Sometimes you need to think in advance: What if you come across unknown metadata you have not handled? Normally, the graph crashes.

This post will examine a way of preventing that or, more specifically, how to handle errors in input data.

Example Input Data

Input Data

What We Will Do

We can instantly distinguish three kinds of metadata on the input: product, product_range and service. ComplexDataReader is the best component to parse these using three states of a state machine. As you can see, there is one line that does not fit into the data. The magic trick of this example lies in preparing one extra state – the error state. The state will be responsible for “catching” all incorrect data which would cause the component to fail. In order to be able to decide which data are “bad,” or, more precisely, when to switch to the error state, you have to write a custom Selector class in Java. The idea behind the code is very simple and will be explained below:

“Prep Work”

First, we need to prepare metadata for all three states of the state machine plus one extra. The extra metadata will represent error lines on the input we need to “throw away.”

Second, do not forget to connect the component to its succeeding components and assign metadata to output edges.

Third, set the “File URL” property to point the component to the input file.

Here are the three aforementioned metadata:

Metadata: Product

Metadata: Service

Metadata: Product Range

And one extra metadata for error lines:

Metadata for Error Lines

Designing  State Machine

We are going to create four states:

Note: There are no transition edges to be seen in the graph. It is because the Selector itself will decide when to change between states.

Start configuring the component via the “Transform” property. Create four states corresponding to the metadata and set “Initial state” to “Let selector decide”:

Switch to state “$0 product” and define its output mapping. In this state, we will send all fields to the output. Thus, drag state $0 to the “Value” column in the right-hand pane. You will produce the “$0.*” directive. In the “Transition table”, switch “Target state” to “Let selector decide”:

Repeat the same procedure for all remaining states (including the error state). Always send everything to the output port and “Let selector decide” about the target state:

Writing Custom Selector

We are now going to prepare a Java class that will do the magic of this example – switch between states “$0 product”, “$1 service”, “$2 product_range” and the “$3 error” state in case there are errors on reading. This particular prefix Selector will assume there is another record on the following line(s) and will try to read it. If there really is a new record, we can recover from the error line and carry on reading.

You can prepare the Java class in any editor of your choice. After writing it, just remember to place it into the “trans” folder of your project. On that condition, CloverETL will automatically compile the class for you.

The Selector class will look like this:

public class CustomPrefixInputMetadataSelector1 extends com.opensys.cloveretl.component.complexdatareader.PrefixInputMetadataSelector {

	private static final int DEFAULT = 3;

	@Override
	public int select(int prevState) {
		int result = super.select(prevState);
		if(result == org.jetel.component.RecordTransform.ALL) {
			return DEFAULT;
		}
		return result;
	}
}

A few comments concerning the code:

  • int result = super.select(prevState);
    First, we try to call the default selector and store the number of the next state into result.
  • if(result == org.jetel.component.RecordTransform.ALL)
    And if the default selector cannot decide…
  • return DEFAULT;
    We return the default state number – number 3. This is the error state.

Now that you are done with the code, switch to the “Selector” tab in “State transitions”. In “Selector URL”, browse for your custom Selector. Notice that after you specify its location, the “Selector properties” area changes:

Conclusions & Pitfalls

In this article, we have presented a way of handling flaws in the input data. We have been capable of addressing a situation when the selector looks on the following metadata and cannot decide which state goes next.

However, there are numerous cases when you just cannot prevent reading errors from occurring. For instance, even if the selector recognizes the following metadata but then fails on parsing them, we cannot react and the graph fails. You can imagine that as a file whose field types suddenly change, (e.g. from integer to date – the selector starts parsing an integer and crashes). Another known case we cannot handle is changeable number of fields in one record. If new fields occur or their number decreases, the graph execution fails. The only exception to this are fields added at the end of a record. These can be handled with the help of lenient data policy.

Download a complete CloverETL project – error handling in ComplexDataReader

Processing Heterogeneous Data with ComplexDataReader

ComplexDataReader – Example How-to

ComplexDataReader is a new component for reading heterogeneous data (data which contains multiple types of records that can also depend on each other) without the need of hard coding. Instead, the component is driven by a state machine which can be set-up using the GUI.

The following example will present some of the capabilities of ComplexDataReader, as well as guide you through the design of a simple automaton, which is used for processing a text file containing two types of shipments grouped into batches. Each batch starts with a batch header; the number of items in a batch is variable and it is part of the header.

Input Data

Input Data for ComplexDataReader

What We Want to Achieve

For every parcel and every letter, send to the output the address and the charge to the output, also add the batch ID, customer ID, and the date from the respective batch header.

The first element of a batch header determines the type of its elements, and the third element contains the number of items in the batch.

Preparation

Before starting the configuration of the component, all the required metadata should be defined. Also, the component should be connected to the succeeding component(s) and the output edge(s) should have metadata assigned.

You may also set the “File URL” property of the component to point to the input file.

Internal metadata (used for parsing the input):

Batch Metadata

Parcel Metadata

Letter Metadata

Output metadata (used for output mapping):

Shipment Metadata

ComplexDataReader Configuration

First, we have to design an automaton, which will guide the component through parsing the input. The automaton may look like this:

ComplexDataReader Automaton

The idea behind it is that we start by reading a batch header, therefore the initial state is set to “$0 – Batch”. Then we can decide, depending on the value of the “type” field, whether to proceed to “$1 – Letter” or “$2 – Parcel”. In either of these states, we read as many records as specified in the “count” field of the previous batch header, then return to “$0 – Batch” and expect a new batch header.

To start building the automaton, open the configuration dialog by double clicking the component and then its “Transform” property.

Create three states by dragging the “batch”, “letter” and “parcel” metadata, respectively, from the list of Available Metadata on the left to the list of States on the right. You can also edit the labels of the states. Set the Initial state to “State $0″ by selecting it from the drop-down list.

Optionally, you may switch to the Overview tab and press the Undock button to get an interactive overview of the automaton being built.

Switch to the State $0 tab. This state represents a new batch. Set the automaton to reset the counters for state $1 and $2 by pressing the Actions button and ticking Reset counter for “State $1″ and “State $2″. Add two rows to the Transition table. Set the condition of the first row to $batch.type == "LETTERS" and the condition of the second row to $batch.type == "PARCELS". Set their target states to “State $1″ and “State $2″, respectively. You may also set the target of the default transition to Fail to detect unexpected batch types.

Note that in state $0, no output mapping is defined; hence no data will be sent to the output.

The configuration of state $1 and $2 will be very similar. In these states we want to produce output, therefore we have to define output mapping. For example, in state $1 we need to send to the output “address” and “charge” fields from internal record $1 (last letter record) and “batchID”, “customerID” and “date” from internal record $0 (last batch header record).

For state $1, define Output mapping by dragging row “$1″ from the left table onto “Port 0″ in the right table. Then expand row $0 on the left and Port 0 on the right and drag “batchID”, “customerID” and “date” from the left onto “$0.batchID”, “$0.customerID” and “$0.date” on the right, respectively.

Add one row to the Transition table and set its condition to counter1 < $batch.count and its target to “State $1″. Also set the target of the default transition to “State $0″.

Letter State

Similarly, for state $2, drag row “$2″ onto “Port 0″ and “batchID”, “customerID” and “date” from row $0 onto “$0.batchID”, “$0.customerID” and “$0.date”. Add one row to the Transition table and set its condition to counter2 < $batch.count and its target to “State $2″. Again, set the target of the default transition to “State $0″.

Parcel State

Download the transformation graph with data