Tag Archives: CloverETL Server

CloverETL Visions for 2012: Evolution and Revolution in Data Integration

Part one – celebrating 10 years

In 2012, CloverETL will celebrate its 10th anniversary as an open source project. It all started back in 2002. On October 3rd, 2002, version 0.1 was first announced on the Freshmeat (now Freecode) portal. That day, CloverETL’s official life began.

I don’t want to look into Clover’s history too much, though. I do, however, want to take this time to make a few comments about the principles on which CloverETL was established and how these principles continue to determine its future.

Principle number 1: Elegant and robust architecture guarantees a stable foundation

CloverETL started more as a framework on which other projects could be based, rather than as an end-user product with a “sexy” GUI. As a matter of fact, the real GUI was built in 2005, almost three years after the release of first CloverETL engine, which is now present in every tool of the CloverETL family – the Designer, the Server and also CloverETL Profiler.

Even though we are now on version 3.2, there has, so far, only been one change which significantly broke backward compatibility: when we switched from Java 1.4 to Java 1.5 and changed some key interface definitions.

This particular principle is what gives a certain peace of mind to the projects and software products embedding or otherwise deploying Clover, as they know there won’t be any sudden surprises with future versions. It also proves that the original architecture was robust and flexible enough at the outset to support all the later additions and improvements.

Principle number 2: Less is better

CloverETL is based on idea of cooperating components, each specialized with one certain functionality only. However each component is flexible enough to support various “outer” conditions in which the component works.

For example, our UniversalDataReader is meant for parsing text data. The data can come in variations like fixed-length, delimited, or combined; can be read locally or from remote locations; and can be available in plain form or compressed. All these variations are supported, which means that subtle changes, like data becoming available through a different protocol or perhaps being suddenly compressed, require only slight reconfiguration of our DataReader. Contrast this with other players, whose hundreds of different components require architecture changes in transformation (replacement of one component with other) when small shift in input data happens (e.g. due to moving from DEV to PROD environment) and you’ll notice the difference.

It also means that a programmer or analyst designing data transformations in Clover does not need to carry a dictionary of components; a short list covers all possible scenarios.

Principle number 3: Agility is sexy, but long term planning is wise

CloverETL is used in many applications by many customers. Some of them are large, global corporations that embed Clover in their products. Through our OEM program, we work with many customers with a very agile approach to the development of their applications. Some of them have release cycle as short as two weeks where they must  not only develop & debug, but also release new features. Clover’s development team tries to keep up with this sprint, but we still take our time to plan, architect, and develop new, fundamental features to extend CloverETL’s capabilities and help our customers do their jobs faster and simpler.

The reason we insist on thinking through every new feature request, beyond simple tweaks, is that sometimes relatively small and quick change may break compatibility somewhere or prevent future extensions. Whenever our development team touches the core (engine) we make sure the change is properly evaluated from several points of view, including:

  • Backward compatibility – at least at transformation graph level.
  • Performance – Slowdown of just a few percent on big data can mean extra kW of energy consumed by data crunching servers.
  • Future extensibility – We hate deprecating APIs or components just because we might not be able to continue enhancing and improving them.

This principle is further supported by the fact that CloverETL continues to be developed by the same, stable development team year in and out. Many team members have been around since 2005, when the commercial life of Clover began.

Part two – What will appear on the menu in 2012

In short, there will be evolution and, in certain areas, some revolution. We are always sorting out the dilemma of whether to break from the “past” and come up with something completely new and revolutionary – at least in our minds – or continue to improve the old-faithful engine architecture laid out years ago.

As we weren’t able to choose one or the other, we decided to continue improving what works well (and should continue to, even in future) and overhaul some things that have had occasional hiccups with modern data structures and formats brought to us by the CLOUD.

Evolution

Expanding CloverETL OEM program

As CloverETL attracts new OEM customers, we continue improving our OEM program by making it simpler to embed, modify, white-label, or otherwise enhance our technology stack. This includes better documentation, example projects, and extended training.
We are also investing in our support team, which has always strived to provide timely and accurate answers to all support requests submitted through various channels, from e-mail to the technology forum and hotline.

Our support staff is comprised of experienced consultants and programmers who have real-life experience with our technology—they aren’t just people a few manual pages ahead of a user seeking an answer.

GUI – continuous improvement of the user experience

We will continue our effort to make the Designer more and more user-friendly. Our motto is: CloverETL is built by professionals for professionals and, truly, professional DI experts or Java programmers usually give us high marks. Nonetheless, we want to make our technology accessible to the broadest possible audience seeking solutions to certain data needs.

Enhancing CloverETL Cluster – our BigData recipe

These days, BigData is usually mentioned together with Hadoop as the solution. As much as we like Hadoop for various reasons, we have our own recipe for processing BigData, and we think it’s better suited for classical data integration/ETL tasks. It is based on a split/transform/merge idea, where big input data are partitioned and then processed in parallel on multiple nodes of the CloverETL Cluster. The advantage of this, as opposed to Hadoop, is that the transformation may be developed & debugged locally, then easily deployed onto CloverETL Cluster for fast execution. Even if executed in a cluster environment, all the debugging and monitoring options of our Designer are available. It is also worth mentioning that deploying CloverETL Cluster is much easier than setting up the Hadoop cluster.

Our big enhancement of CloverETL Cluster in 2012 will be the merging of our technology with Hadoop – more precisely HDFS filesystem – which should combine the best from both worlds. HDFS provides some cool features, namely robustness and high performance, and we want to utilize its automated data partitioning to make it easier to grow (or shrink) the storage of data depending on actual needs.

Revolution

Rich data structures – trees, unstructured data, etc.

It has to come with age, but I can’t resist and must admire those who devised Cobol and CopyBook. In those times, every byte of storage counted and CPUs were slow, yet programmers were still able to process rich data structures. Then relational databases came and brought the idea of tables and normal forms. Well, today, we are back to rich structures, but this time, we’ve stopped counting bytes or CPU cycles (which has a huge impact on power consumption of servers, but that’s a different story.) That is why XML, JSON, or other rich structures are becoming the norm today.

In order to support these structures and formats as first class passengers, we decided to overhaul our metadata and record storage model and allow direct support of tree structures, multi-values of fields, and even loosely typed data organized in maps/properties collections.

This independently constitutes as a big adventure, as every single piece of our technology platform will be affected, and thus will have to be adapted. The effort will be huge, and necessary regression testing of the whole platform will be endless. Despite this, the prize is enticing: almost any type of data (and the cloud will be bonanza for this) will be 1:1 representable by Clover. That will include XML, JSON, POJO, and complex properties – and, in the future, who knows what else!

—–

We have always claimed that CloverETL is future-proof. Therefore, in 2012, we will be improving our foundations so they withstand the next 10 years.

If what I’ve talked about above is of interest to you, then please stay tuned. We will be publishing more details on our new functionality as we implement it.

For now, I wish everyone a very successful 2012!

Exporting Data Transformation Projects to CloverETL Server

CloverETL Designer in its full or trial version provides integration with CloverETL Server. The CloverETL Server serves as an ETL runtime environment and brings such enterprise features as automation, workflows, monitoring, user management and many others. The integration allows users to design and maintain data integration Server projects locally with their Designers. However, sometimes you may find yourself in a situation when you need to export and deploy a project originally developed locally on your computer to the Server. A quick how-to is described below.

1. Select File > Export

2. Select Export to CloverETL Server sandbox

3. Select desired projects
4. In case you want to export a project to our demo server you can select it from a combo box, or type in URL of your CloverETL Server. Enter a username and password (clover/clover for the demo server).
5. Click the Reload button to load available sandboxes and select a desired one (playground1 or playground2 for the demo server. Other demo sandboxes are readonly).
6. Click Finish

7. Check the exported project in the CloverETL Server under Sandboxes.

Warning: Graphs, including their parameters are copied to the Server (i.e. file paths.) These parameters needs to be adjusted.

Launch Services – Part 2 – Configuration

In the last blog post, you learned what the Launch Services are. In this post you will see how to configure them.

Let us study an example scenario to become acquainted with configuration. We have a database containing the highest mountains on Earth along with their heights. The user enters an elevation above sea-level and hits the enter key. The Excel sheet is then displayed listing all mountains with the given minimal elevation.

Mountains example of Launch Services

How to Configure It?

First, we must create a transformation graph that uses a dictionary to receive parameters and to store results. Create a new graph in CloverETL Designer. In the outline pane, right click on Dictionary and choose Edit. Add a new entry named heightMin: with the “As Input” field set to true, and “type” set to Integer. Then add another entry named mountains.xls of type writable.channel, content type text/csv, and “As Output” set to true.

Dictionary

Now we may build a transformation graph. Components can use a dictionary in three different ways:

  1. Via file URL: data readers and writers may specify a File URL in the format dictionary:field-name. In our example, we set a data writer File URL to dictionary:mountains.xls.URL Dialog
  2. In CTL: anywhere in CTL code, we can use an expression of type dictionary.field-name to read or write the dictionary. In our example we use Filter expression $0.heightM >= dictionary.heightMin
  3. In Java code: using methods transformationGraph.getDictionary().getValue(String fieldName) and transformationGraph.getDictionary().setValue(String fieldName, Object value)

When a transformation graph is designed and ready, we must publish it as a Launch Service. In CloverETL Server administration, go to section Launch Services and click New launch configuration. Now enter a name, a sandbox and a graph name. Then open the Detail page for the new service, and click on Edit Parameters tab. Create a new parameter with heightMin name.

CloverETL Server Interface

Now we may test it. When we click a test link, the server generates a simple form which executes a launch service. We can copy, customize and use this form in a web site.

Test Page

Loop Execution of Data Transformation

Case study description

Czech Insolvency Registry (http://isir.justice.cz) basically contains data about economic subjects that entered insolvency and have financial difficulties with paying off their debts. The registry allows everybody to download data using public SOAP Web Service. It can be done manually or automatically with the right software.
https://isir.justice.cz:8443/isir_ws/services/IsirPub001?wsdl

CloverETL can easily help with the automatically download that would save time and technical difficulties. CloverETL graph can get required data by calling the web service, processes data and store it in required format. Unfortunately the Registry’s web service is very poorly designed. The service doesn’t give you current status of each of the economic subjects, but provides the whole history of the required company. Therefore we have to download not only the current information we need but the whole information since the year 2008 (the registry foundation). That is a lot of data to process – actually thousands of log records for each company! Moreover the Registry’s Web service „GetIsirPub0012“ provides only maximum of 1000 records per one call. If one company has few thousands of records you have to undertake more calls.  So we have to download data in thousand-records bunches, but we don’t know in advance exactly how many of these bunches (records) there are for each company. That makes the whole process quiet difficult.

But solution with CloverETL is simple. CloverETL Server provides features “graph event listener” and “groovy task” that help us with all the above described challenges. Firstly, we will of course design a CloverETL graph that processes for the beginning just one thousand –record bunch of data (see picture bellow).

Transformation graph

This graph has a parameter „startID“ which has value “0” by default. If we want to process 1000 records starting let’s say from no. 2541, then  start ID will be startID=2541, and the first downloaded record will be identified by. If we run graph without parameters, it’ll download and process first thousand of records (no. 0 – 999).

Graph also contains couple of components to store ID of last downloaded record so that  the next bunch to download may use the last ID  as the startID. It will be automatically stored to graph parameters as lasted parameter. It can be done by in-line Java code in Reformat component:
String id = GetVal.getString(source[0],"id");
getGraph().getGraphProperties().setProperty("lastID", id );

The loop

The graph we designed must be executed n-times to download and process all records. At first we don’t know how many times, but we know, that we can stop the downloading process as soon as there are no more records to read. It means we can stop the process as soon as  “started” and “lasted” are equal.

How to achieve such loop?

Graph event listener

To achieve the automatic loop, for the graph that we designed and described previously, we’ll define graph event listener for “FINISHED_OK” graph event on CloverETL Server. So every time transformation finishes without error („FINISHED_OK“), listener will trigger task that we selected. We need to specify this tasks now. Since we want to  execute the same graph repeatedly, we have to specify “execute graph” task. This task will repeat executing the graph indefinitely. However we need to stop this loop at some point. We need to “break the loop” when the startID and the lastID parameters are equal. Therefore it is actually better to create “groovy task” instead of „execute graph“.

Groovy task

Groovy is scripting language with Java syntax. In addition, groovy scripts may access java objects and use java libraries. See Groovy project site http://groovy.codehaus.org/ for more details.

We’ll create a simple groovy script which decides whether execute the graph again or not. To decide it, we’ll need to get graph properties from the finished graph. These properties are accessible by calling method event.getProperties().

Then, we’ll need to execute graph using CloverETL Server Java API. It’s done by calling method serverFacade.executeGraph().

Script may return String value which is stored in „Task history log“.

// these variables are predefined:
// sessionToken
// event
// serverFacade

import com.cloveretl.server.persistent.RunRecord;
import org.apache.log4j.Logger;
import com.cloveretl.server.api.*;
import org.springframework.web.context.WebApplicationContext;
import org.springframework.web.context.support.WebApplicationContextUtils;

Logger log = Logger.getLogger("groovy-ISIR-graphEventListener");

Properties eventProps = event.getProperties();
log.info("event properties: " + eventProps);

// get lastID and startID from previous graph execution
String lastIDString = eventProps.getProperty("lastID");
String startIDString = eventProps.getProperty("startID");
long lastID = Long.valueOf(lastIDString);
long startID = Long.valueOf(startIDString);

// lastID and startID from last graph execution are equal – break the loop
if (lastID == startID)
return "no more records to download";

// prepare startID which will be passed to next graph execution
Properties properties = new Properties();
properties.setProperty("startID", lastIDString);

String SANDBOX = eventProps.getProperty("SANDBOX_CODE");
String GRAPH = eventProps.getProperty("GRAPH_FILE");
GraphExecutionCommand graphExecutionCommand = new GraphExecutionCommand(
null, SANDBOX, GRAPH, null, null, null, true, properties, null, null);
Response respExec = serverFacade.executeGraph(sessionToken, graphExecutionCommand);
String result = "graph "+SANDBOX+"/"+GRAPH+" executed: "+respExec.getBean();
log.info(result);
return result;

Graph results

All graph results for each bunch of data are stored to only one CSV file. They are always added, so don’t worry there is no danger that some of them will be overwritten :-) . So when the whole batch of transformations is finished, we have only one CSV file with all processed records. Or if somebody wishes we can consolidate records and store them directly into database where the data can be stored in more friendly and usable format.

Parallel Data Processing with CloverETL Cluster

For the upcoming release of CloverETL 2.9, we are working on improvements in CloverETL Server which will allow run transformations in parallel on multiple cluster nodes.

CloverETL Server already supports clustering, so more instances may cooperate to each other. Current stable version already implements common cluster features: fail-over/high-availability and scalability of lots of requests which are load-balanced on available cluster nodes. These features are actually implemented since version 1.3.

The basic concept of new parallelism
Transformation may be automatically executed in parallel on more cluster nodes according to configuration and each of these “worker” transformations processes just its part of data. Because there is one “master” transformation, which manages the other transformations and which gathers tracking data from “worker” transformations, the parallelism is transparent for CloverETL Server client. Client by default “sees” just one (master) execution and aggregated tracking data. However there are still logs and tracking data for each of “worker” transformations, so it’s still possible to inspect details of this parallel execution. “Worker” transformations outputs are gathered to the “master”, thus client has one single transformation output which may be processed further.

So how to get parts of input data?
Basically, transformation can process data which is already partitioned, which is the best case and there is no overhead with partitioning of data, or CloverETL Server itself can partition input data from one single source and distribute data on the fly (during the transformation) to several cluster nodes using the network connection. Overhead of this operation depends on the speed of network communication and other conditions.

Design changes in the graph
We aim to keep the transformation graph almost the same as it would be for “standalone” execution. Thus there will be just a couple of extra components in the graph which is intended to run in parallel. These components will handle partitioning/departitioning of data in case it’s not already partitioned.

Scalability
The new parallelism in CloverETL Server is a giant leap for scalability of the transformations. Ever since the graph is designed for paraller run, the number of computers which run this transformation depends just on cluster configuration. Graph itself is still the same. Configuration of the parallelism includes:

  • working CloverETL Server cluster, thus standalone server instances won’t be able to handle such execution
  • “partitioned” sandbox(see below) with list of locations

New sandbox types
On server side, graphs and related files are organized in so-called sandboxes. Until version 2.8, there was just one type: “shared” sandbox. It means that it contains the same files and directory structure on all cluster nodes. Since version 2.9 there will be two more types:

  • “local” sandbox – is (locally) accessible on just one cluster node. It’s intended for huge input/output data which is not intended to be shared/replicated among multiple cluster nodes.
  • “partitioned” sandbox – each of its physical location contains just part of data. It’s intended as a storage for partitioned input/output data of transformations which are supposed to run in parallel. List of physical locations actually specifies nodes which will run “worker” transformations.

Master – worker responsibilities
Master observes all related workers and when some transformation phase is finished on all workers, it’s master’s responsibility to allow the workers to process next phase. When any of the workers fails from any reason, it’s master’s responsibility to abort all the other workers and select whole execution as failed. Master/worker – These terms have meaning only in the scope of one transformation. Since 2.9 there is no privileged node configured as “master” in the cluster, but it doesn’t mean that all the nodes are equal. There may be differences between nodes in accessibility to physical sources. Configuration of sandboxes should reflect it.

Designer-Server Integration: HTTPS made easy

In CloverETL Designer 2.8.0, connecting to CloverETL Server over HTTPS protocol is supported. However, the client requires some configuration including import of client’s certificate to the server. Starting with CloverETL Designer 2.8.1, the situation is much simplified. The HTTPS can be used without any additional client configuration.

The usage scenario is similar to using a web browser – if the Designer detects an unknown server certificate, it asks the user if the certificate should be accepted & imported. A server certificate can be imported either permanently or temporarily for one Designer session.

Connecting to CloverETL Server over HTTPS

Connecting to CloverETL Server over HTTPS

In the above screenshot you can see an example of connecting to the CloverETL Server over HTTPS. The Designer detected an unknown certificate and asks the user whether the certificate should be accepted. The user can of course examine certificate’s content prior to accepting or refusing.

This simple HTTP connection work in case that the application server running CloverETL Server does not require a certificate from its clients. When it requires client certificates, then the Designer must be configured as previously.

Designer-Server Integration Testing

CloverETL’s development team is preparing a new amazing feature, integration of CloverETL Designer with CloverETL Server. This feature shifts work with Clover to a much more comfortable level.

I was asked to participate on testing of it. And I decided to share my impressions.

The main feature of integration allows you to work with CloverETL’s graph located on CloverETL Server in  the same way as if it would be located on your desktop machine. So no more copying of files from desktop to server, no more out-of-date files, all items are located only on server and accesible and editable in the Eclipse with CloverETL on your desktop, transformation graphs are editable in graphic format.

All graphs are run on server machine but you don’t lose any of advantages useful for developing and debugging, you can view debug data on edge, view data on reader without running of the graph, see tracking information in tracking view etc. In addition all runs of graphs are tracked on server so you can see all execution logs in the Executions History tab of server administration interface.

After initial doubts I have realized that it works and now I’m fascinated with it :-) . You can expect it with many other improvement in version 2.8 of CloverETL Designer. So forget Informatica, forget DataStage, use CloverETL :-) .