Tag Archives: CloverETL

Java Vulnerabilities: No Impact on CloverETL Products

The recent discoveries of Java’s vulnerabilities have caused concerns for many organizations and individuals alike. With recent questions from multiple customers, we’d like to reassure you that our products are not impacted by these vulnerabilities.

The security holes are related to a Java plug-in on the browser where it can be used by hackers to overtake or silently install malware on the visitor’s computer (more details can be found in this article: http://krebsonsecurity.com/2013/01/what-you-need-to-know-about-the-java-exploit/).

We do not use Java in the frontend (browser plug-in) in any of our products; even for our Server application, the Server Console UI is built on top of Rich Faces – which only uses HTML and basic Javascript in the browser.

However, we recommend that you update your Java package for your own protection in other applications. Please also consider upgrading to our latest stable build: CloverETL version 3.3.1 – http://www.cloveretl.com/resources/releases/3-3-1; the Designer installer package for Windows comes with an updated Java bundle that contains the latest patches.

New Release Model Offers Clover Users The Choice

As you may have noticed, the latest CloverETL release is labeled as a milestone release. What does this mean exactly? Well, this release marks the beginning of our new release model that will fulfill two separate, but important requirements that our customers have been asking for: the need for sustained stability in operations as well as the availability of new features as they come out.

Milestone Releases

As a major provider of OEM data integration for many BI or MDM solution suppliers, milestone releases addresses the desire for more frequent release cycles– to not only match the suppliers’ product development, but also to provide new features on an ongoing basis. Milestone releases are meant for these kinds of users and customers. Each milestone release, occurring every two months or so, will focus on introducing major new features and improvements, along with bug fixes. Milestone releases are carefully QA’d and ready for implementation in our user community.

Production Releases

The production release should be selected by those who seek high stability. Production releases happen usually twice a year and contain all features and fixes previously released in milestone releases;  these releases endure a rigorous QA process and testing by the community.

With this new release model, CloverETL users and customers have the freedom to choose whether to maintain operations and stronger stability on the production track or to implement the newest features into their flow with the milestone release.

Big Time with Big Records

Those of you who have ever tried to process big records with CloverETL already learned that it required some tweaking and special care to make it run smoothly and efficiently. In some cases, CloverETL could get too greedy with memory requirements for a graph run, making it quite cumbersome to set up. With CloverETL 3.2 we have introduced improved memory management in the runtime layer that has optimized memory usage when running graphs with big records.

Let’s take a look inside to see what this is all about…

Pipeline Approach

Clover’s approach to record processing is based on a pipeline – a chain of processing components connected by edges. The edges are the key point in inter-component communication. They have to ensure a fast transfer of records from one component to another. Our approach for edge data transfer has always been based on records serialization into byte stream on the starting end of an edge and deserialization back to record form on the other end. This ensures a basic invariant for all of our components - no record instance sharing. Each component has its own instance of a record populated from data on the input edge. It is then processed by the component and serialized into an output edge. This simple idea delivers excellent performance gains. (We have tried many times to find an even better approach, but have always returned to this one. Believe me – we have tried hard and many, many times.)

This imposes a painful decision to make on the edge itself – the capacity of the buffer that stores the bytes as they are passed from one end to the other. Obviously, that buffer must have enough room to hold the biggest record which passes through it. Those familiar with the CloverETL engine already know where I am going – the Records.MAX_RECORD_SIZE parameter.

In versions prior to 3.2, we used standard java.nio.ByteBuffer allocated to various multiples of MAX_RECORD_SIZE. That meant that all edges, component buffers, and just about anything with records passing through it were set to accommodate at least MAX_RECORD_SIZE bytes worth of the “guessed” biggest record possible. Over time, we gradually raised the default from 8 KB up to 64 KB (which, in the world of XMLs, unstructured data and other modern marvels is still far from being enough.) Yet, increasing MAX_RECORD_SIZE had quite a few negative effects on memory consumption, as any small increase was immediately multiplied by number of components and edges in a graph that shared this static buffer size. It was also shared among all graphs and sandboxes in the Server where the default was applied, regardless of whether or not the graph processed big records.

Introducing CloverBuffer

Now we are proud to say that with release 3.2, we have brought a significant improvement to this area. No more MAX_RECORD_SIZE trade-off decisions are necessary. Memory allocation for an edge and component buffer is now smart: it grows with higher demands and stays low for low demands. We have stepped up from the plain ByteBuffer to our own new container for serialized byte form of records – a CloverBuffer. It acts as a full replacement of a ByteBuffer, but what sets it apart is the ability to grow. CloverBuffer starts small but can transparently grow up to a predefined maximum limit (newly introduced RECORD_INITIAL_SIZE and RECORD_LIMIT_SIZE) without needing any programmer intervention.

So although there still is one global setting for all, it just sets boundaries that cannot be crossed. But anything in between those limits is allocated automatically to ensure the smallest memory footprint of each transformation run based on its needs in real time, not estimated ones. Graphs combining the processing of big records and small records, e.g. main stream of data combined with some logging branch, utilize only as much memory per edge/component as the size of the data passing through them.

Programmers Only

All CloverETL code base has been refactored to use the new CloverBuffer. We recommend that everyone adopt it too, so that your transformations can run seamlessly. In any case, you don’t need to worry – we keep our code backward compatible so even without changing your code, it still complies with the new release.

For completeness, here is an example of old record container allocation:
ByteBuffer recordBuffer = ByteBuffer.allocateDirect(Defaults.Record.MAX_RECORD_SIZE);

This now should be substituted with following code:
CloverBuffer recordBuffer = CloverBuffer.allocateDirect(Defaults.Record.RECORD_INITIAL_SIZE, Defaults.Record.RECORD_LIMIT_SIZE);

The constant Record.MAX_RECORD_SIZE is now deprecated and a pair of new constants was introduced:
Record.RECORD_INITIAL_SIZE – initial buffer size, for now 64KB and will be probably decreased in upcoming release (http://bug.javlin.eu/browse/CL-2070) to minimalize initial memory allocation for regular graphs.

The second constant ‘Record.RECORD_LIMIT_SIZE’ is actually one-to-one replacement for MAX_RECORD_SIZE (keeping MAX_RECORD_SIZE backward compatible for the sake of unmodified components), which sets the maximum upper bound per one CloverBuffer instance. This can be virtually anything – for convenient early detection of real buffer overruns, it is set to 32 MB by default. Lowering or increasing this upper bound affects memory consumption only in the cases where there is a real need for such big buffer – otherwise the buffers are kept at RECORD_INITIAL_SIZE and are grown gradually towards the upper limit.

As you can see, the CloverBuffer now makes it possible to process bigger records with less memory footprint, since only buffers for edges or components that actually manipulate big records grow, while the others still remain small.

CloverETL Visions for 2012: Evolution and Revolution in Data Integration

Part one – celebrating 10 years

In 2012, CloverETL will celebrate its 10th anniversary as an open source project. It all started back in 2002. On October 3rd, 2002, version 0.1 was first announced on the Freshmeat (now Freecode) portal. That day, CloverETL’s official life began.

I don’t want to look into Clover’s history too much, though. I do, however, want to take this time to make a few comments about the principles on which CloverETL was established and how these principles continue to determine its future.

Principle number 1: Elegant and robust architecture guarantees a stable foundation

CloverETL started more as a framework on which other projects could be based, rather than as an end-user product with a “sexy” GUI. As a matter of fact, the real GUI was built in 2005, almost three years after the release of first CloverETL engine, which is now present in every tool of the CloverETL family – the Designer, the Server and also CloverETL Profiler.

Even though we are now on version 3.2, there has, so far, only been one change which significantly broke backward compatibility: when we switched from Java 1.4 to Java 1.5 and changed some key interface definitions.

This particular principle is what gives a certain peace of mind to the projects and software products embedding or otherwise deploying Clover, as they know there won’t be any sudden surprises with future versions. It also proves that the original architecture was robust and flexible enough at the outset to support all the later additions and improvements.

Principle number 2: Less is better

CloverETL is based on idea of cooperating components, each specialized with one certain functionality only. However each component is flexible enough to support various “outer” conditions in which the component works.

For example, our UniversalDataReader is meant for parsing text data. The data can come in variations like fixed-length, delimited, or combined; can be read locally or from remote locations; and can be available in plain form or compressed. All these variations are supported, which means that subtle changes, like data becoming available through a different protocol or perhaps being suddenly compressed, require only slight reconfiguration of our DataReader. Contrast this with other players, whose hundreds of different components require architecture changes in transformation (replacement of one component with other) when small shift in input data happens (e.g. due to moving from DEV to PROD environment) and you’ll notice the difference.

It also means that a programmer or analyst designing data transformations in Clover does not need to carry a dictionary of components; a short list covers all possible scenarios.

Principle number 3: Agility is sexy, but long term planning is wise

CloverETL is used in many applications by many customers. Some of them are large, global corporations that embed Clover in their products. Through our OEM program, we work with many customers with a very agile approach to the development of their applications. Some of them have release cycle as short as two weeks where they must  not only develop & debug, but also release new features. Clover’s development team tries to keep up with this sprint, but we still take our time to plan, architect, and develop new, fundamental features to extend CloverETL’s capabilities and help our customers do their jobs faster and simpler.

The reason we insist on thinking through every new feature request, beyond simple tweaks, is that sometimes relatively small and quick change may break compatibility somewhere or prevent future extensions. Whenever our development team touches the core (engine) we make sure the change is properly evaluated from several points of view, including:

  • Backward compatibility – at least at transformation graph level.
  • Performance – Slowdown of just a few percent on big data can mean extra kW of energy consumed by data crunching servers.
  • Future extensibility – We hate deprecating APIs or components just because we might not be able to continue enhancing and improving them.

This principle is further supported by the fact that CloverETL continues to be developed by the same, stable development team year in and out. Many team members have been around since 2005, when the commercial life of Clover began.

Part two – What will appear on the menu in 2012

In short, there will be evolution and, in certain areas, some revolution. We are always sorting out the dilemma of whether to break from the “past” and come up with something completely new and revolutionary – at least in our minds – or continue to improve the old-faithful engine architecture laid out years ago.

As we weren’t able to choose one or the other, we decided to continue improving what works well (and should continue to, even in future) and overhaul some things that have had occasional hiccups with modern data structures and formats brought to us by the CLOUD.

Evolution

Expanding CloverETL OEM program

As CloverETL attracts new OEM customers, we continue improving our OEM program by making it simpler to embed, modify, white-label, or otherwise enhance our technology stack. This includes better documentation, example projects, and extended training.
We are also investing in our support team, which has always strived to provide timely and accurate answers to all support requests submitted through various channels, from e-mail to the technology forum and hotline.

Our support staff is comprised of experienced consultants and programmers who have real-life experience with our technology—they aren’t just people a few manual pages ahead of a user seeking an answer.

GUI – continuous improvement of the user experience

We will continue our effort to make the Designer more and more user-friendly. Our motto is: CloverETL is built by professionals for professionals and, truly, professional DI experts or Java programmers usually give us high marks. Nonetheless, we want to make our technology accessible to the broadest possible audience seeking solutions to certain data needs.

Enhancing CloverETL Cluster – our BigData recipe

These days, BigData is usually mentioned together with Hadoop as the solution. As much as we like Hadoop for various reasons, we have our own recipe for processing BigData, and we think it’s better suited for classical data integration/ETL tasks. It is based on a split/transform/merge idea, where big input data are partitioned and then processed in parallel on multiple nodes of the CloverETL Cluster. The advantage of this, as opposed to Hadoop, is that the transformation may be developed & debugged locally, then easily deployed onto CloverETL Cluster for fast execution. Even if executed in a cluster environment, all the debugging and monitoring options of our Designer are available. It is also worth mentioning that deploying CloverETL Cluster is much easier than setting up the Hadoop cluster.

Our big enhancement of CloverETL Cluster in 2012 will be the merging of our technology with Hadoop – more precisely HDFS filesystem – which should combine the best from both worlds. HDFS provides some cool features, namely robustness and high performance, and we want to utilize its automated data partitioning to make it easier to grow (or shrink) the storage of data depending on actual needs.

Revolution

Rich data structures – trees, unstructured data, etc.

It has to come with age, but I can’t resist and must admire those who devised Cobol and CopyBook. In those times, every byte of storage counted and CPUs were slow, yet programmers were still able to process rich data structures. Then relational databases came and brought the idea of tables and normal forms. Well, today, we are back to rich structures, but this time, we’ve stopped counting bytes or CPU cycles (which has a huge impact on power consumption of servers, but that’s a different story.) That is why XML, JSON, or other rich structures are becoming the norm today.

In order to support these structures and formats as first class passengers, we decided to overhaul our metadata and record storage model and allow direct support of tree structures, multi-values of fields, and even loosely typed data organized in maps/properties collections.

This independently constitutes as a big adventure, as every single piece of our technology platform will be affected, and thus will have to be adapted. The effort will be huge, and necessary regression testing of the whole platform will be endless. Despite this, the prize is enticing: almost any type of data (and the cloud will be bonanza for this) will be 1:1 representable by Clover. That will include XML, JSON, POJO, and complex properties – and, in the future, who knows what else!

—–

We have always claimed that CloverETL is future-proof. Therefore, in 2012, we will be improving our foundations so they withstand the next 10 years.

If what I’ve talked about above is of interest to you, then please stay tuned. We will be publishing more details on our new functionality as we implement it.

For now, I wish everyone a very successful 2012!

A Look Back: CloverETL and Data Integration in 2011

As 2011 comes to a close, we’d like to take the time to reflect on what this year has brought CloverETL, its users, and our customers.

Since CloverETL is, after all, a data integration platform, the world of integration is at our core. We’re constantly striving to challenge ourselves in new ways and improve how we approach data integration. This year was no different.

Enhancing Our Core – Two Upgrades of CloverETL

In the past six months, we released two upgraded versions of CloverETL. CloverETL 3.1, published in June, brought significant changes to the platform in several areas. With a deeper focus on connectivity and enhanced support of various data formats, CloverETL 3.1 helped users better process data with complex structure, emails, and Lotus documents, to name a few. The latest version of CloverETL, version 3.2, offered further enhancements to the user experience, as well as improved the processing of large data records.

Data Integration Meets Data Quality – CloverETL Profiler

This year was also a year for new products. With Clover, we’ve moved forward with an evolved sense of the data world. Because data integration, data quality, and other data disciplines are becoming more and more intertwined, we developed the CloverETL Profiler, data profiling application. Released in beta back in October, the profiler helps users make informed decisions on how to improve the quality of transformed data, which is particularly useful as precursor to a greater data integration projects. CloverETL also integrates more easily with the AddressDoctor solution to improve the quality of geographical information.

Strengthening CloverETL Presence in the US Market

In June, Javlin, the developer of CloverETL, opened up its new office in the Washington D.C. area, which became the headquarters of Javlin Inc., our US presence. Javlin Inc., with both a dedicated sales and customer service force, brings Clover to a whole new market of possibilities.

Last but not least, we are pleased to see that our OEM data integration offer will have a number of important implementations in the upcoming year. (But more on that later. Stay tuned.)

As we leave 2011, we can say that this past year was a whirlwind of hard work, exciting releases, and interesting customers and stories. We’re looking forward to another great year with CloverETL. Cheers to the New Year.

Building Data Warehouse with CloverETL: Slowly Changing Dimension Type 2

In the last part of our data warehouse (DWH) tutorial, I showed you how to load a dimension table that stores historical data according to the Slowly Changing Dimension Type 1 (SCD1). In today’s post, I will focus on a Slowly Changing Dimension Type 2 (SCD2) dimension table. I think that SCD2 is the most challenging sub-task of ETL part of DWH design and each ETL architect should be able to deal with it.

In contrast to SCD1, SCD2 table stores preserves history of attributes. So once the value of attribute is changed in external system  (OLTP) we have to create a new record in SCD2 dimension table with the actual value but we also have to mark the old record in SCD2 table as obsolete. The most common way to obsolete the record is to maintain two additional attributes: valid_from and valid_to. Then the record is considered valid at particular date D when valid_from < D ≤ valid_to. You can find a detailed explanation of SCD2 principles in Kimball’s DWH bible or on wikipedia.org.

Let us show how SCD2 works in real in a small example. We will use DWH schema introduced in SCD1’s post.

It consists of four dimensions (Customer, Product, Store and Date), one degenerate dimension (invoice number) and one fact table (Sales). Fact table stores two additive facts: units and total price.

Store table is populated as SCD1 and we will load Customer table that was marked as SCD2 dimension table. Let’s imagine that Customer changed his email. What will happen in OLTP and DWH Customer table named D_CUSTOMER?

OLTP:

C0001;John;Newman;john.newman@hotmail.com => C0001;John;Newman;newman.john@gmail.com

DWH:

0001;C0001;John;Newman;john.newman@hotmail.com;2009-10-10;null =>
0001;C0001;John;Newman;john.newman@hotmail.com;2009-10-10;2010-05-20
0002;C0001;John;Newman;newman.john@gmail.com;2010-05-21;null

Notice especially the first two attributes (columns) and the last two attributes of DWH table. The first attribute is a surrogate key, it is a unique identifier of the record in D_CUSTOMER table. It is generated by ETL process.  The second one (C0001) is a natural key, a unique identifier of customer in OLTP. When you list all records of the same natural key in D_CUSTOMER you will get a complete history of one customer.

The principle how SCD2 works is explained now I will describe an implementation of SCD2 in CloverETL. See the CloverETL’s graph bellow.

The basic data-flow of the graph is very simple: in the lower branch we read the data from OLTP (for us as in previous DWH post it’s a CSV file), in the upper branch we read data that is already stored in our SCD2 table. We have to use the Dedup component, as we want only one actual record for each customer. The two branches intersect in DataIntersection component that processes the records according to the natural key. The component has three output ports, as there are three possible outcomes:

  1. The record exists only in DWH. This should not happen, it means that the record was deleted in OLTP. The “normal” OLTPs do not allow delete of records. That kind of records end in Trash component.
  2. In DWH table there exists at least one record with the same natural key as the record coming from OLTP. That record goes through the second output port to the component that identifies whether the record was changed (ExtFilter component). And then the record is copied to two records: the first one that obsoletes the current record in D_CUSTOMER (identified by surrogate key) and  the second one that is inserted to D_CUSTOMER and stores the new values read from OLTP. The first one set column valid_to = today()-1 and the second record is inserted with valid_from = today() and valid_to = null.
  3. The record coming from OLTP is a new one, there is no record with the same natural key in DWH. In that case the record is sent to the third output port and in following components is inserted to D_CUSTOMER table with valid_from = today() and valid_to = null.

If you want to verify that your CloverETL SCD2 graph works correctly or if you are looking for sample data, you can simply import example project to your Clover installation. It is embedded to your CloverETL Designer as a DWHExample project. For more information how to import example project see online documentation.

Loop Execution of Data Transformation

Case study description

Czech Insolvency Registry (http://isir.justice.cz) basically contains data about economic subjects that entered insolvency and have financial difficulties with paying off their debts. The registry allows everybody to download data using public SOAP Web Service. It can be done manually or automatically with the right software.
https://isir.justice.cz:8443/isir_ws/services/IsirPub001?wsdl

CloverETL can easily help with the automatically download that would save time and technical difficulties. CloverETL graph can get required data by calling the web service, processes data and store it in required format. Unfortunately the Registry’s web service is very poorly designed. The service doesn’t give you current status of each of the economic subjects, but provides the whole history of the required company. Therefore we have to download not only the current information we need but the whole information since the year 2008 (the registry foundation). That is a lot of data to process – actually thousands of log records for each company! Moreover the Registry’s Web service „GetIsirPub0012“ provides only maximum of 1000 records per one call. If one company has few thousands of records you have to undertake more calls.  So we have to download data in thousand-records bunches, but we don’t know in advance exactly how many of these bunches (records) there are for each company. That makes the whole process quiet difficult.

But solution with CloverETL is simple. CloverETL Server provides features “graph event listener” and “groovy task” that help us with all the above described challenges. Firstly, we will of course design a CloverETL graph that processes for the beginning just one thousand –record bunch of data (see picture bellow).

Transformation graph

This graph has a parameter „startID“ which has value “0” by default. If we want to process 1000 records starting let’s say from no. 2541, then  start ID will be startID=2541, and the first downloaded record will be identified by. If we run graph without parameters, it’ll download and process first thousand of records (no. 0 – 999).

Graph also contains couple of components to store ID of last downloaded record so that  the next bunch to download may use the last ID  as the startID. It will be automatically stored to graph parameters as lasted parameter. It can be done by in-line Java code in Reformat component:
String id = GetVal.getString(source[0],"id");
getGraph().getGraphProperties().setProperty("lastID", id );

The loop

The graph we designed must be executed n-times to download and process all records. At first we don’t know how many times, but we know, that we can stop the downloading process as soon as there are no more records to read. It means we can stop the process as soon as  “started” and “lasted” are equal.

How to achieve such loop?

Graph event listener

To achieve the automatic loop, for the graph that we designed and described previously, we’ll define graph event listener for “FINISHED_OK” graph event on CloverETL Server. So every time transformation finishes without error („FINISHED_OK“), listener will trigger task that we selected. We need to specify this tasks now. Since we want to  execute the same graph repeatedly, we have to specify “execute graph” task. This task will repeat executing the graph indefinitely. However we need to stop this loop at some point. We need to “break the loop” when the startID and the lastID parameters are equal. Therefore it is actually better to create “groovy task” instead of „execute graph“.

Groovy task

Groovy is scripting language with Java syntax. In addition, groovy scripts may access java objects and use java libraries. See Groovy project site http://groovy.codehaus.org/ for more details.

We’ll create a simple groovy script which decides whether execute the graph again or not. To decide it, we’ll need to get graph properties from the finished graph. These properties are accessible by calling method event.getProperties().

Then, we’ll need to execute graph using CloverETL Server Java API. It’s done by calling method serverFacade.executeGraph().

Script may return String value which is stored in „Task history log“.

// these variables are predefined:
// sessionToken
// event
// serverFacade

import com.cloveretl.server.persistent.RunRecord;
import org.apache.log4j.Logger;
import com.cloveretl.server.api.*;
import org.springframework.web.context.WebApplicationContext;
import org.springframework.web.context.support.WebApplicationContextUtils;

Logger log = Logger.getLogger("groovy-ISIR-graphEventListener");

Properties eventProps = event.getProperties();
log.info("event properties: " + eventProps);

// get lastID and startID from previous graph execution
String lastIDString = eventProps.getProperty("lastID");
String startIDString = eventProps.getProperty("startID");
long lastID = Long.valueOf(lastIDString);
long startID = Long.valueOf(startIDString);

// lastID and startID from last graph execution are equal – break the loop
if (lastID == startID)
return "no more records to download";

// prepare startID which will be passed to next graph execution
Properties properties = new Properties();
properties.setProperty("startID", lastIDString);

String SANDBOX = eventProps.getProperty("SANDBOX_CODE");
String GRAPH = eventProps.getProperty("GRAPH_FILE");
GraphExecutionCommand graphExecutionCommand = new GraphExecutionCommand(
null, SANDBOX, GRAPH, null, null, null, true, properties, null, null);
Response respExec = serverFacade.executeGraph(sessionToken, graphExecutionCommand);
String result = "graph "+SANDBOX+"/"+GRAPH+" executed: "+respExec.getBean();
log.info(result);
return result;

Graph results

All graph results for each bunch of data are stored to only one CSV file. They are always added, so don’t worry there is no danger that some of them will be overwritten :-) . So when the whole batch of transformations is finished, we have only one CSV file with all processed records. Or if somebody wishes we can consolidate records and store them directly into database where the data can be stored in more friendly and usable format.

Accessing Files with New File URL Dialog

The CloverETL Designer has a brand new File URL Dialog, which was introduced in the version 2.9. The newly designed file dialog is very friendly and intuitive to navigate. There are a lot of new features and improvements. The dialog is separated into several tabs to simplify navigation. They enable users to easily specify resources such as local files, remote files or shared memory (dictionary). The new dialog is more comfortable to use and has simplified clear design as you can see in the picture bellow. The dialog window adjusts itself according to the context.

Clover Server

In the new File dialog you can also find a new CloverETL Server tab specially designed to work with files located on CloverETL Server. It is only visible if you have opened the dialog from existing CloverETL Server project. It looks very similar to the tab you work with on your local computer but you can browse remote CloverETL sandboxes. All names of sandboxes for which you have permissions are in the bookmarks. So you can easily access them.

File URLs

This tab handles all types of URLs but it’s mainly designed to browse remote file system via http/https/ftp/ftps/sftp protocols. It also brings special dialog where you can specify advanced parameters of connection like proxy server, HTTP properties.

Port / Dictionary

The port and dictionary tabs are specific to CloverETL. The Port tab is visible only if the component or graph element allows reading/writing data from/to the port. Dictionary is a shared memory between parts of the graph. It is identified by name and processing type parameter. Both tabs help you to specify the URLs in a visual way so you don’t have to know the exact syntax of CloverETL’s URLs and your work will be easier and more productive.

Extensibility

Due to new modular dialog architecture, the dialog itself can be extended for specific tabs if needed.

Writing to Infobright Database

In response to customer requests we created a new component that writes data into Infobright software, a very popular column-oriented relational database. The CloverETL’s just released version 2.9 offers this new InfobrightDataWriter component.

Infobright is a highly compressed column-oriented database, based on MySql engine. In this database data are stored column-by-column instead of more typically row-by-row. There are many advantages of column-orientation, including the ability to do more efficient data compression and allowing compression to be optimized for each particular data type. The higher efficiency can be achieved because each column stores a single data type as opposed to rows that typically contain several data type. However the main purpose of Infobright software is to deliver a scalable data warehouse database optimized for analytic queries.

Our new InfobrightDataWriter component allows to easily load data into Infobright database directly from CloverETL transformations. It can be used both with Infobright Community Edition (ICE) as well as with Infobright Enterprise Edition (IEE). With IEE it is possible to use fast data loading in binary format. In binary format particular rows aren’t separated by any special character and there are no values’ delimiters or qualifiers either.

The connector uses named pipe for data transfer. In Linux systems it is system pipe, on Windows you need native library placed on the Java library path. The library is part of  infobright-core-vX_X package that can be downloaded from Infobright site (Download Contributed Software).

For remote load it is necessary to start Infobright remote load agent on the server where Infobright is running (part of infobright-core-v3).

Example graph:

 

InfobrightDataWriter configuration:

The above graph loads data from selected input fields (“Clover fields” attribute) to database table test (“Database table” attribute) on remote server. It uses fast binary method of data loading (“Data format” attribute). Data loaded into database can be sent to output port. Input metadata has free format but the output one has to meet certain conditions:

  • comma as a field delimiter
  • system new line ('n' for Linux, 'rn' for Windows) as a record delimiter
  • date/time formatting: yyyy-MM-dd HH:mm:ss (can be date only, time only or both)

As we load data to the remote server, the Infobright agent must run on the server. To start the agent we need to call:

java -jar infobright-core-3.0-remote.jar -p 6666 -l all

The port the agent is listening to must be the same as the agent port set on the component (“Remote agent port” attribute).

Full description of the component can be found on CloverETL’s wiki pages (InfobrightDataWriter component).

Iteration through the Record Fields in CTL

Note: this post relates to CloverETL 2.x version with CTL version 1. Starting with CloverETL 3.x , CTL version 2 was introduced which has a bit different syntax. However you can still use CTL 1 in transformations – see CTL1 vs CTL2 comparison.

Recently, I have been facing a very common problem. Imagine this scenario: I have two files – the first one with origin records and the second one with slightly modified new records. Each record had a unique key and aproximately 50 fields. My task was to compare these two files and find out how many fields in every pair differ from each other in the corresponding records.

The simplified graph can be seen in following picture:

A comparison of two records can be processed by the CloverETL DataIntersection component which joins the records with the same keys. In the joined records, you can compare fields that are not part of the key. But remember, you have to write the comparison in the CTL transformation ;-) . Of course you could write the following block of code for each pair of compared fields:

int count = 0;
if(nvl($0.field_N, '') != nvl($1.field_N, '')) {
count++;
//but typically more actions :-(
}
//and imagine this block 50times :-(
//final mapping
$0.key := $0.key;
$0.count := count;

But this solution takes too much time when you have to repeat it for many fields (approx. 50 in my case). It is also very slow, uncomfortable and increases the probability of making a mistake in your code (e.g. omitting some fields). Fortunately, CloverETL allows you to iterate through the fields of processed records! :-) The code is then more briefer and more generic:

//declaration of variables for copies of input records
record(Metadata1) myrec1;
record(Metadata1) myrec2;
function transform() {
int i = 0;
int count = 0;
//asign value of input records to local variables
myrec1 = @0; //myrec1 is a copy of a current record on input port 0
myrec2 = @1; //myrec2 is a copy of a current record on input port 1
//iterate through fields, suppose that field with index 0 is the key
for(i = 1; i < length(myrec1); i++) {
if(nvl(myrec1[i], '') != nvl(myrec2[i], '')) {
count++;
}
}
//final mapping
$0.key := $0.key;
$0.count := count;
}

Someone could object to the necessity of making copies of records. In this case I have good news. CTL in CloverETL version 2.9 introduces the possibility of iterating directly through the fields of input records.

@0[i];//i-th field of the input record on port 0

Moreover new functions for getting field names and data types are introduced in version 2.9. Personally, I am looking forward to such features that will make CTL code simplier and clearer.