CloverETL's Blog

November 4, 2009

New level of parallelism in CloverETL

Filed under: Developing Clover — Tags: , , , , , — mvarecha @ 12:28 pm

For the upcoming release of CloverETL 2.9, we are working on improvements in CloverETL Server which will allow run transformations in parallel on multiple cluster nodes.

CloverETL Server already supports clustering, so more instances may cooperate to each other. Current stable version already implements common cluster features: fail-over/high-availability and scalability of lots of requests which are load-balanced on available cluster nodes. These features are actually implemented since version 1.3.

The basic concept of new parallelism
Transformation may be automatically executed in parallel on more cluster nodes according to configuration and each of these “worker” transformations processes just its part of data. Because there is one “master” transformation, which manages the other transformations and which gathers tracking data from “worker” transformations, the parallelism is transparent for CloverETL Server client. Client by default “sees” just one (master) execution and aggregated tracking data. However there are still logs and tracking data for each of “worker” transformations, so it’s still possible to inspect details of this parallel execution. “Worker” transformations outputs are gathered to the “master”, thus client has one single transformation output which may be processed further.

So how to get parts of input data?
Basically, transformation can process data which is already partitioned, which is the best case and there is no overhead with partitioning of data, or CloverETL Server itself can partition input data from one single source and distribute data on the fly (during the transformation) to several cluster nodes using the network connection. Overhead of this operation depends on the speed of network communication and other conditions.

Design changes in the graph
We aim to keep the transformation graph almost the same as it would be for “standalone” execution. Thus there will be just a couple of extra components in the graph which is intended to run in parallel. These components will handle partitioning/departitioning of data in case it’s not already partitioned.

Scalability
The new parallelism in CloverETL Server is a giant leap for scalability of the transformations. Ever since the graph is designed for paraller run, the number of computers which run this transformation depends just on cluster configuration. Graph itself is still the same. Configuration of the parallelism includes:

  • working CloverETL Server cluster, thus standalone server instances won’t be able to handle such execution
  • “partitioned” sandbox(see below) with list of locations

New sandbox types
On server side, graphs and related files are organized in so-called sandboxes. Until version 2.8, there was just one type: “shared” sandbox. It means that it contains the same files and directory structure on all cluster nodes. Since version 2.9 there will be two more types:

  • “local” sandbox – is (locally) accessible on just one cluster node. It’s intended for huge input/output data which is not intended to be shared/replicated among multiple cluster nodes.
  • “partitioned” sandbox – each of its physical location contains just part of data. It’s intended as a storage for partitioned input/output data of transformations which are supposed to run in parallel. List of physical locations actually specifies nodes which will run “worker” transformations.

Master – worker responsibilities
Master observes all related workers and when some transformation phase is finished on all workers, it’s master’s responsibility to allow the workers to process next phase. When any of the workers fails from any reason, it’s master’s responsibility to abort all the other workers and select whole execution as failed. Master/worker – These terms have meaning only in the scope of one transformation. Since 2.9 there is no privileged node configured as “master” in the cluster, but it doesn’t mean that all the nodes are equal. There may be differences between nodes in accessibility to physical sources. Configuration of sandboxes should reflect it.

September 7, 2009

Designer-Server integration: HTTPS made easy

Filed under: Using CloverETL — Tags: , , , , — Jaroslav Urban @ 11:32 am

In CloverETL Designer 2.8.0, connecting to CloverETL Server over HTTPS protocol is supported. However, the client requires some configuration including import of client’s certificate to the server. Starting with CloverETL Designer 2.8.1, the situation is much simplified. The HTTPS can be used without any additional client configuration.

The usage scenario is similar to using a web browser – if the Designer detects an unknown server certificate, it asks the user if the certificate should be accepted & imported. A server certificate can be imported either permanently or temporarily for one Designer session.

Connecting to CloverETL Server over HTTPS

Connecting to CloverETL Server over HTTPS

In the above screenshot you can see an example of connecting to the CloverETL Server over HTTPS. The Designer detected an unknown certificate and asks the user whether the certificate should be accepted. The user can of course examine certificate’s content prior to accepting or refusing.

This simple HTTP connection work in case that the application server running CloverETL Server does not require a certificate from its clients. When it requires client certificates, then the Designer must be configured as previously.

July 28, 2009

Designer-Server integration testing

Filed under: Using CloverETL — Tags: , , , — Petr Uher @ 4:21 am

CloverETL’s development team is preparing a new amazing feature, integration of CloverETL Designer with CloverETL Server. This feature shifts work with Clover to a much more comfortable level.

I was asked to participate on testing of it. And I decided to share my impressions.

The main feature of integration allows you to work with CloverETL’s graph located on CloverETL Server inĀ  the same way as if it would be located on your desktop machine. So no more copying of files from desktop to server, no more out-of-date files, all items are located only on server and accesible and editable in the Eclipse with CloverETL on your desktop, transformation graphs are editable in graphic format.

All graphs are run on server machine but you don’t lose any of advantages useful for developing and debugging, you can view debug data on edge, view data on reader without running of the graph, see tracking information in tracking view etc. In addition all runs of graphs are tracked on server so you can see all execution logs in the Executions History tab of server administration interface.

After initial doubts I have realized that it works and now I’m fascinated with it :-) . You can expect it with many other improvement in version 2.8 of CloverETL Designer. So forget Informatica, forget DataStage, use CloverETL :-) .

Blog at WordPress.com.