Tips for Integrating CloverETL to Third Party Applications

I am sure you are often finding yourself in your work in a situation when you have  data so-called „in your hands“ and you would like to transform them to a proper form, clean up them or load them to a data destination (e.g. a database). But you do not want to reinvent the wheel again and again. It is the time when you start searching a suitable technology which can help you to avoid developing what has been already developed, tuned and is  stable. Most of the ETL tools often satisfy all these requirements and CloverETL is not an exception. In addition, CloverETL Engine provides a few tricky approaches how to process your data a little bit unconventional.

  1. Transformation graph does not have to be statically defined in a standard XML file with an .grf extension – if you need to generate a graph based on dynamic user settings, it is nothing easier than to create your transformation dynamically (details are described at http://wiki.cloveretl.com/doku.php?id=embedding_clover) in your Java application. If you have your graph already created (you have its instance), no matter if it was created as a result of an .grf file or you created it manually, then you can easily reuse it – run the graph repeatedly without a need to rebuild it again and again. This ability is very useful for transformations that run really short and need to be triggered a couple of times per second (for instance as a reaction on a user ‘click’ in your massive multi-user application).
  2. Next interesting extension of your development tools can be creating your own shell stream filter – a simple application which transforms standard input (stdin) to standard output (stdout) (for example sort, cat) – everything based on CloverETL. You can prepare a whole set of specialized command line applications which can be exploited in your complex shell scripts. It will be enough if your graph contains just one UniversalDataReader and just one UniversalDataWriter with a fileURL parameter set to the dash character ‘-’ which stands for standard input (stdin) or standard output (stdout) stream. I have prepared a simple example that filters out undesirable lines from standard input. A pretty straightforward usage:cat employees.txt | clover -D:metadata=“employees.fmt“ -D:filter=“$salary>10000“ filter.grf > beFired.txt
  3. In the first paragraph, I tackled a possibility to build transformation graphs directly from your Java code. In this approach, you will encounter nature limitation of most of ETL tools very soon – narrow set of data types processed by public set of components. Typically you have java.io.InputStream with the data ready for further processing. If we decide to use classical Clover instruments, we would have to store the data into a temporary file first and prepare a graph which reads the temporary file. We would have a similar issue with handling output data, a temporary file would have to be used again. The issue is obvious – how to avoid needless temporary files? CloverETL engine has a prepared solution for this purpose called Dictionary. It is a sort of shared memory which can be concurrently accessed by all components all the time a graph is running. The dictionary is typically used for passing a data object as an input into the graph and retrieving another data object as an output from the graph. We can look at Dictionary as a classical Map<key, value>. Key is a string identifier of a value which can take a whole set of different dictionary types – integer, boolean, decimal, … and last but not least ReadableChannel and WritableChannel. The two last mentioned dictionary entry types can be exploited for passing and retrieving data to and from a graph. See the following piece of code how dictionary can be populated://prepare channel with the data for ETL processing
    InputStream inputData = getInputDataStream();

    //prepare channel where the resuled data will be formatted
    OutputStream outputData = prepareOutputDataStream();

    //create graph instance based on grf file and initialize it
    TransformationGraph graph = TransformationGraphXMLReaderWriter.read(File);
    EngineInitializer.initGraph(graph);

    //initialize graph dictionary
    //our input channel will be registered under „inputStream“ key
    graph.getDictionary().setValue(„inputStream“, „ReadableChannel“, inputData);

    //our output channel will be registered under „outputStream“ key
    graph.getDictionary().setValue(„outputStream“, „WritableChannel“, outputData);

    //execute graph – output data will be pushed to output stream during graph run
    runGraph.executeGraph(graph);

    Now you probably ask yourself how the graph knows that input data are ready in dictionary under „inputStream“ key and on the other hand how it knows where to write the result output data. The answer is simple – fileURL attribute of UniversalDataReader/Writer has a specialized syntax for dictionary entries. Reader can have fileURL set to “dict:inputStream”. In case of Writer we need to setup fileURL attribute to “dict:outputStream”. That is all – the CloverETL engine takes care of data transmission between your data streams and the CloverETL graph automatically. Data prepared in input stream will be parsed by a dedicated data reader and will be passed as Clover data records for further processing to next components down the graph. And incoming data to a data writer will be formatted into your channel prepared in dictionary under „outputStream“ key.

    As it was already mentioned,  Dictionary can handle various data types. Beside already described data streams, it is possible to store all basic Clover data types – string, integer, long, number (Java equivalent double), decimal (Java equivalent BigDecimal), byte, and boolean. So Dictionary can be used for passing input values to a graph or also for inter-component communication – the first component writes some semi-result into Dictionary and the second component can pick up this value for further processing.

    Probably the most advanced way how to exploit Dictionary is possibility to define your own proprietary dictionary data types. Similarly to components, connections, CTL functions and so on, the dictionary entry types are also fully pluginable. So you can easily introduce your own type that corresponds to your needs. For example, you can extend CTL by your own function that allows you to access this data value from Clover and converts it to a CloverETL data record – the basic data element processed by CloverETL engine. It is certainly possible to create a new set of components that understand your specific data format. In the scope of  component run your custom data format can be retrieved from Dictionary, transformed into a standard CloverETL data record and passed to an output port for following processing. We have been using this approach successfully in several projects where the data format was totally incompatible with CloverETL records.

I hope this little bit technical insight into CloverETL engine inspires you for its usage in situations that seemed inappropriate till now.

Integration of CloverETL with PHP

To witness the power of Clover, consider the following scenario. Our customer required an application whose main purpose was generating reports in xls format. What would seem like a simple task was presented with difficult challenges:

  • Reports had  many different formats specified by many parameters.
  • Reports were generated from heterogeneous sources – xls files, database tables and IBM iSeries files (former AS/400).
  • Application users modified data (mainly added records) in a report before its final confirmation.
  • The confirmation of a report invoked updates in several database tables and generated log records.
  • Short response times were needed for the most of operations.

We knew right away we would approach this situation with a web application written in PHP because web application architecture brought us plenty of needed advantages. But how can we quickly and reliably integrate heterogeneous sources? This is exactly the type of task that ETL tools were developed for. That’s why we built the whole solution using CloverETL Server with the Launch Services (equivalent of web services) feature. The following schema should illustrate the architecture of our solution.

Users of the application communicate only with the web application layer (1) that uses its own database for storing temporary data, setup information, and a history of reported records. This part also manages access permissions based on user roles.

When a user wants to view some report, he or she sets the proper parameters and other restrictions (e.g. type of records to be reported or dates scope) in the web application using his/her preferred web browser (IE, Firefox …). Confirming the filled parameters invokes a call to a Clover Launch Service (2) and a transformation connected with this service. The transformation integrates data from heterogeneous sources and stores it into a database dedicated to the web application. All stored records are identified by runID (transformation execution’s unique identifier within Clover Server) and the service returns the same runID as a response from its call.

The web application allows users to modify only certain data (according to partial runID). Users can insert, change or delete records (3). These changes have an effect only within the database of the web application and play only a temporary role until the user confirms the report. The confirmation again invokes  a call to the Clover Launch Service that then generates the needed reports, logs records about processed changes, and propagates changes into a history of reported records (4).

To summarize this solution, we moved the entire application logic into a CloverETL Server which allows us to solve such problems with ease. Furthermore we took advantage of Clover Launch Services – a combination of ETL integration power and online processing.

In my opinion, there are many similar problems where the concept described above cuts the time and effort to reach a solution.

Working with CloverETL as a New User

I would like to share my experience with CloverETL as an external person. I study at Georgia Institute of Technology in Atlanta, GA with a major in Computer Science. On my search for interesting internships in Europe I found about Javlin a.s., a company based in Prague. As I wanted to get on hand experience with programming and software development I was immediately interested in this company. It sounded like an appealing opportunity where I could gain a lot of helpful knowledge and work experience.

In January I arrived in Prague and started working with CloverETL immediately when I began my internship with Javlin. When I arrived here, I had no clue exactly what CloverETL ‘did’, much less what date warehousing and extract-transform-load tools were. After the first week or so, I really figured out what ETL tools are, and especially, what CloverETL is. I played around with CloverETL, and was intrigued by what it can do. I really liked what it could – I don’t have any need for it now as a student, but I can see how incredibly powerful it can be for certain people or companies. I really like how simple it is to just connect components with a GUI, you can read a client XLS sheet, compare it with your Outlook address book (exported as a CSV file), removing duplicate records, collecting all data for each client into one record, and then making a new XLS sheet or CSV file for Outlook. Also, I like how the engine itself is open-source, enabling anyone to download and use it – but in my opinion, the Designer is much quicker and easier to use.

Oracle Uses CloverETL for Data Integration in its OIA Server

We are very happy to have learned that Oracle uses CloverETL for data manipulation in its OIA server. Oracle Identity Analytics (OIA) is a solution that provides enterprise role lifecycle management and identity compliance functionality. See Oracle whitepaper http://www.oracle.com/technology/products/id_mgmt/oia/pdf/oia_wp.pdf

Originally Sun Microsystems used CloverETL in its Sun Role Manager. It is a great pleasure to see that Oracle decided to continue using CloverETL for its software after aquiring Sun Microsystems.

Data Integration with Web Services – How to Set Up a WebServiceClient

WebServiceClient is a new CloverETL component that was released in 2.9 version. It provides an easy intuitive way to interact with web-service technology that has gained a huge popularity in the last few years. The following simple example shows the new component´s functionality. The WS component is user friendly and very easy to use.

Example for WS component usage:

Our goal is to get the actual weather information for specific ZIP codes from the web service that is running on http://ws.cdyne.com/WeatherWS/Weather.asmx?op=GetCityWeatherByZIP. The zip codes are stored in the flat file and results will be saved to Excel file.

I suppose that you are familiar with CloverETL so I will not describe the input and output setting up but will focus directly on WebService Client component.

Step 1 – WebService URL and operation

Firstly we have to specify the URL of web service WSDL file and select the operation you want to use from the list of all operation WS provides. WSDL URL is the basic attribute of the component and have to be set to http://ws.cdyne.com/WeatherWS/Weather.asmx?WSDL. The dialog for Operation name attribute allows you to select the operation you want to use. In our case it’s the operation GetCityWeatherByZIP from the group WeatherSoap (Port:WeatherSoap) as you can see on the figure bellow.

Step 2 – Generate request

The request is a way how to pass the arguments to webservice. It’s the XML file with pre-defined structure which depends on the used webservice and writting it by hand is very annoying. CloverETL provides a functionality that generates the required structure of request in one click. Simply open the dialog for Request structure attribute and click on „Generate request“ button. The skeleton of the request is genereted and we have to map the input port fields of the component to the request placeholders. In our case the only placeholder is string. By drag-and-drop we replace string by the field name from the right panel of the dialog.

Step 3 – Process response

Finally we define how to process the response from webservice. The response is XML file so we will create the mapping among elements and attributes of XML files and the output fields of the component. CloverETL makes it easy for us, it provides drag-and-drop dialog where we can select the item from „XML Fields“ part of dialog and drop them to „Clover Fields“.

And that’s all :-) . Communication with webservice is set-up. Now we can collect actual weather conditions from Internet.

Parallel Data Processing with CloverETL Cluster

For the upcoming release of CloverETL 2.9, we are working on improvements in CloverETL Server which will allow run transformations in parallel on multiple cluster nodes.

CloverETL Server already supports clustering, so more instances may cooperate to each other. Current stable version already implements common cluster features: fail-over/high-availability and scalability of lots of requests which are load-balanced on available cluster nodes. These features are actually implemented since version 1.3.

The basic concept of new parallelism
Transformation may be automatically executed in parallel on more cluster nodes according to configuration and each of these “worker” transformations processes just its part of data. Because there is one “master” transformation, which manages the other transformations and which gathers tracking data from “worker” transformations, the parallelism is transparent for CloverETL Server client. Client by default “sees” just one (master) execution and aggregated tracking data. However there are still logs and tracking data for each of “worker” transformations, so it’s still possible to inspect details of this parallel execution. “Worker” transformations outputs are gathered to the “master”, thus client has one single transformation output which may be processed further.

So how to get parts of input data?
Basically, transformation can process data which is already partitioned, which is the best case and there is no overhead with partitioning of data, or CloverETL Server itself can partition input data from one single source and distribute data on the fly (during the transformation) to several cluster nodes using the network connection. Overhead of this operation depends on the speed of network communication and other conditions.

Design changes in the graph
We aim to keep the transformation graph almost the same as it would be for “standalone” execution. Thus there will be just a couple of extra components in the graph which is intended to run in parallel. These components will handle partitioning/departitioning of data in case it’s not already partitioned.

Scalability
The new parallelism in CloverETL Server is a giant leap for scalability of the transformations. Ever since the graph is designed for paraller run, the number of computers which run this transformation depends just on cluster configuration. Graph itself is still the same. Configuration of the parallelism includes:

  • working CloverETL Server cluster, thus standalone server instances won’t be able to handle such execution
  • “partitioned” sandbox(see below) with list of locations

New sandbox types
On server side, graphs and related files are organized in so-called sandboxes. Until version 2.8, there was just one type: “shared” sandbox. It means that it contains the same files and directory structure on all cluster nodes. Since version 2.9 there will be two more types:

  • “local” sandbox – is (locally) accessible on just one cluster node. It’s intended for huge input/output data which is not intended to be shared/replicated among multiple cluster nodes.
  • “partitioned” sandbox – each of its physical location contains just part of data. It’s intended as a storage for partitioned input/output data of transformations which are supposed to run in parallel. List of physical locations actually specifies nodes which will run “worker” transformations.

Master – worker responsibilities
Master observes all related workers and when some transformation phase is finished on all workers, it’s master’s responsibility to allow the workers to process next phase. When any of the workers fails from any reason, it’s master’s responsibility to abort all the other workers and select whole execution as failed. Master/worker – These terms have meaning only in the scope of one transformation. Since 2.9 there is no privileged node configured as “master” in the cluster, but it doesn’t mean that all the nodes are equal. There may be differences between nodes in accessibility to physical sources. Configuration of sandboxes should reflect it.

Designer-Server Integration: HTTPS made easy

In CloverETL Designer 2.8.0, connecting to CloverETL Server over HTTPS protocol is supported. However, the client requires some configuration including import of client’s certificate to the server. Starting with CloverETL Designer 2.8.1, the situation is much simplified. The HTTPS can be used without any additional client configuration.

The usage scenario is similar to using a web browser – if the Designer detects an unknown server certificate, it asks the user if the certificate should be accepted & imported. A server certificate can be imported either permanently or temporarily for one Designer session.

Connecting to CloverETL Server over HTTPS

Connecting to CloverETL Server over HTTPS

In the above screenshot you can see an example of connecting to the CloverETL Server over HTTPS. The Designer detected an unknown certificate and asks the user whether the certificate should be accepted. The user can of course examine certificate’s content prior to accepting or refusing.

This simple HTTP connection work in case that the application server running CloverETL Server does not require a certificate from its clients. When it requires client certificates, then the Designer must be configured as previously.

Designer-Server Integration Testing

CloverETL’s development team is preparing a new amazing feature, integration of CloverETL Designer with CloverETL Server. This feature shifts work with Clover to a much more comfortable level.

I was asked to participate on testing of it. And I decided to share my impressions.

The main feature of integration allows you to work with CloverETL’s graph located on CloverETL Server in  the same way as if it would be located on your desktop machine. So no more copying of files from desktop to server, no more out-of-date files, all items are located only on server and accesible and editable in the Eclipse with CloverETL on your desktop, transformation graphs are editable in graphic format.

All graphs are run on server machine but you don’t lose any of advantages useful for developing and debugging, you can view debug data on edge, view data on reader without running of the graph, see tracking information in tracking view etc. In addition all runs of graphs are tracked on server so you can see all execution logs in the Executions History tab of server administration interface.

After initial doubts I have realized that it works and now I’m fascinated with it :-) . You can expect it with many other improvement in version 2.8 of CloverETL Designer. So forget Informatica, forget DataStage, use CloverETL :-) .

CloverETL version 2.7 released

OpenSys released today new version of CloverETL Engine – 2.7.

According to ChangeLog, there have been more then 300 changes – some of them new functionality, some fixes of reported problems.
Together with the new engine, CloverETL Designer (previously CloverETL GUI) version 2.2 is also available.

Details about changes of both CloverETL Engine and Designer can be found on CloverETL’s main site.