Tips for Integrating CloverETL to Third Party Applications

I am sure you are often finding yourself in your work in a situation when you have  data so-called „in your hands“ and you would like to transform them to a proper form, clean up them or load them to a data destination (e.g. a database). But you do not want to reinvent the wheel again and again. It is the time when you start searching a suitable technology which can help you to avoid developing what has been already developed, tuned and is  stable. Most of the ETL tools often satisfy all these requirements and CloverETL is not an exception. In addition, CloverETL Engine provides a few tricky approaches how to process your data a little bit unconventional.

  1. Transformation graph does not have to be statically defined in a standard XML file with an .grf extension – if you need to generate a graph based on dynamic user settings, it is nothing easier than to create your transformation dynamically (details are described at http://wiki.cloveretl.com/doku.php?id=embedding_clover) in your Java application. If you have your graph already created (you have its instance), no matter if it was created as a result of an .grf file or you created it manually, then you can easily reuse it – run the graph repeatedly without a need to rebuild it again and again. This ability is very useful for transformations that run really short and need to be triggered a couple of times per second (for instance as a reaction on a user ‘click’ in your massive multi-user application).
  2. Next interesting extension of your development tools can be creating your own shell stream filter – a simple application which transforms standard input (stdin) to standard output (stdout) (for example sort, cat) – everything based on CloverETL. You can prepare a whole set of specialized command line applications which can be exploited in your complex shell scripts. It will be enough if your graph contains just one UniversalDataReader and just one UniversalDataWriter with a fileURL parameter set to the dash character ‘-’ which stands for standard input (stdin) or standard output (stdout) stream. I have prepared a simple example that filters out undesirable lines from standard input. A pretty straightforward usage:cat employees.txt | clover -D:metadata=“employees.fmt“ -D:filter=“$salary>10000“ filter.grf > beFired.txt
  3. In the first paragraph, I tackled a possibility to build transformation graphs directly from your Java code. In this approach, you will encounter nature limitation of most of ETL tools very soon – narrow set of data types processed by public set of components. Typically you have java.io.InputStream with the data ready for further processing. If we decide to use classical Clover instruments, we would have to store the data into a temporary file first and prepare a graph which reads the temporary file. We would have a similar issue with handling output data, a temporary file would have to be used again. The issue is obvious – how to avoid needless temporary files? CloverETL engine has a prepared solution for this purpose called Dictionary. It is a sort of shared memory which can be concurrently accessed by all components all the time a graph is running. The dictionary is typically used for passing a data object as an input into the graph and retrieving another data object as an output from the graph. We can look at Dictionary as a classical Map<key, value>. Key is a string identifier of a value which can take a whole set of different dictionary types – integer, boolean, decimal, … and last but not least ReadableChannel and WritableChannel. The two last mentioned dictionary entry types can be exploited for passing and retrieving data to and from a graph. See the following piece of code how dictionary can be populated://prepare channel with the data for ETL processing
    InputStream inputData = getInputDataStream();

    //prepare channel where the resuled data will be formatted
    OutputStream outputData = prepareOutputDataStream();

    //create graph instance based on grf file and initialize it
    TransformationGraph graph = TransformationGraphXMLReaderWriter.read(File);
    EngineInitializer.initGraph(graph);

    //initialize graph dictionary
    //our input channel will be registered under „inputStream“ key
    graph.getDictionary().setValue(„inputStream“, „ReadableChannel“, inputData);

    //our output channel will be registered under „outputStream“ key
    graph.getDictionary().setValue(„outputStream“, „WritableChannel“, outputData);

    //execute graph – output data will be pushed to output stream during graph run
    runGraph.executeGraph(graph);

    Now you probably ask yourself how the graph knows that input data are ready in dictionary under „inputStream“ key and on the other hand how it knows where to write the result output data. The answer is simple – fileURL attribute of UniversalDataReader/Writer has a specialized syntax for dictionary entries. Reader can have fileURL set to “dict:inputStream”. In case of Writer we need to setup fileURL attribute to “dict:outputStream”. That is all – the CloverETL engine takes care of data transmission between your data streams and the CloverETL graph automatically. Data prepared in input stream will be parsed by a dedicated data reader and will be passed as Clover data records for further processing to next components down the graph. And incoming data to a data writer will be formatted into your channel prepared in dictionary under „outputStream“ key.

    As it was already mentioned,  Dictionary can handle various data types. Beside already described data streams, it is possible to store all basic Clover data types – string, integer, long, number (Java equivalent double), decimal (Java equivalent BigDecimal), byte, and boolean. So Dictionary can be used for passing input values to a graph or also for inter-component communication – the first component writes some semi-result into Dictionary and the second component can pick up this value for further processing.

    Probably the most advanced way how to exploit Dictionary is possibility to define your own proprietary dictionary data types. Similarly to components, connections, CTL functions and so on, the dictionary entry types are also fully pluginable. So you can easily introduce your own type that corresponds to your needs. For example, you can extend CTL by your own function that allows you to access this data value from Clover and converts it to a CloverETL data record – the basic data element processed by CloverETL engine. It is certainly possible to create a new set of components that understand your specific data format. In the scope of  component run your custom data format can be retrieved from Dictionary, transformed into a standard CloverETL data record and passed to an output port for following processing. We have been using this approach successfully in several projects where the data format was totally incompatible with CloverETL records.

I hope this little bit technical insight into CloverETL engine inspires you for its usage in situations that seemed inappropriate till now.

10 thoughts on “Tips for Integrating CloverETL to Third Party Applications

  1. Is there a way to “save” you transformationGraph as XML again?

    TransformationGraphXMLReaderWriter.read(File) reads it in,

    but

    TransformationGraphXMLReaderWriter.write?

    Thanks
    Des

  2. Unfortunately it is not possible now. Most of the components are able to serialize back to xml form (see toXML() methods), however this functionality on transformation graph is not maintained and few core graph elements does not support this feature at all. We are hard working on Clover 4 (new generation) where should be new graph model layer available and this functionality should be integral part of new engine. For now you have to work/manipulate with xml form of graph directly.

    Martin Zatopek

    • Dictionary is the most suitable way how to pass whatsoever to a graph. However, you need a component or a way how to process this input. For example if you pass a HashTable to a graph, you need a custom component or custom java transformation for Reformat component, which picks up the hash table and does something with it. What exactly do you want to do?

        • Hi Bijesh,

          I see two options how to achieve what you request.

          1) First you can create you own custom component, which has only one attribute ‘dictionaryEntryName’, where user can specify name of dictionary entry with your prepared data bean. This component would be really primitive and good template for that would be DataGenerator component. Component wouldn’t have any input ports, just one output port, where the bean data will be sent. This component is then very easy to use whenever in your graphs. However, maybe you don’t want to create your own component. In that case try the second option…

          2) Create simple graph {DataGenerator}->{Normalizer}->{XLSWriter}
          DataGenerator component sends just one dummy record. This record triggers your Normalizer java transformation. Generally, normalizer component can generate many records based on one input records (one to many conversion). The trigger record just starts your code used in normalizer which could look like the following code snippet:

          public class BeanNormalizer extends DataRecordNormalize {

          private MyBean myBean;

          @Override
          public int count(DataRecord source) throws TransformException {
          //find your bean in dictionary
          myBean = (MyBean) getGraph().getDictionary().getEntry("entryName").getValue();

          return getNumberOfRecords();
          }

          @Override
          public int transform(DataRecord source, DataRecord target, int idx) throws TransformException {
          //variable 'source' is the dummy records which is ignored now
          populateRecord(target, idx);
          return 0;
          }

          private int getNumberOfRecords() {
          //returns number of data records stored in the given bean
          }

          private void populateRecord(DataRecord record, int idx) {
          //populate given records with data stored in the given bean
          }

          }

          This is probably more straightforward way how reach your goal.

          Martin

    • ‘runGraph’ is name of class, where we have our main method of clover engine. Its name is little bit confusing due small first letter, but it is only class in whole project which is named like this :)

      So following line is invocation of a static method on class ‘runGraph’:

      runGraph.executeGraph(graph);

      Martin

Leave a Reply

Your email address will not be published. Required fields are marked *

*


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>