When Two Become One: CloverETL OEM Embedded White Labeling

In our previous post, we reviewed the three approaches to CloverETL OEM—Here, I’ll discuss the more technical aspects of CloverETL OEM Embedded White Labeling.

OEM White Labeling – What it Takes

When partners needs white labeling, they are mostly considering the seamless integration of the ETL piece with their application. They have to manage things like version control of their application (making sure it works with Clover), the look and feel of the GUI, and a good sense of automation, for example. The Clover team works with the partner to architect this and can do so quickly.

Partners needing this option receive CloverETL Integration documentation to guide them through the process of achieving stable, effective white labeling. Both the CloverETL Designer and CloverETL Server can also be white labeled as part of a company strategy.

Which steps does white labeling consist of in Designer and Server?

CloverETL Designer

CloverETL Designer branding is based on showing the product name, splash screen, welcome screen, etcetera inside the Eclipse environment. Applications embedding or extending CloverETL Designer can use the same branding elements to visually customize the product. Features available via branding:

  • Product naming and description: These are visible in multiple places inside Eclipse (however, not all occurrences can be changed.)
  • Default configuration: Eclipse configuration options can be changed according to users’ needs, e.g. whether a splash screen should be shown at start-up.
  • Splash screen: The initial image shown at start-up and the progress bar can be customized.
  • Welcome screen: An introduction screen shown to users on the application’s first run; it can contain useful links to documentation, examples etc.

CloverETL Server

CloverETL Server is mentioned in multiple messages, web GUI images, directory names, etc. Following steps described in the integration documentation (available to OEM partners), you can get rid of all CloverETL occurrences. White labeling Server thus involves:

  • Replacing images, logos plus related work on graphics
  • Editing a couple of properties stored in server configuration files
  • Tuning the resulting application

Big Time with Big Records

Those of you who have ever tried to process big records with CloverETL already learned that it required some tweaking and special care to make it run smoothly and efficiently. In some cases, CloverETL could get too greedy with memory requirements for a graph run, making it quite cumbersome to set up. With CloverETL 3.2 we have introduced improved memory management in the runtime layer that has optimized memory usage when running graphs with big records.

Let’s take a look inside to see what this is all about…

Pipeline Approach

Clover’s approach to record processing is based on a pipeline – a chain of processing components connected by edges. The edges are the key point in inter-component communication. They have to ensure a fast transfer of records from one component to another. Our approach for edge data transfer has always been based on records serialization into byte stream on the starting end of an edge and deserialization back to record form on the other end. This ensures a basic invariant for all of our components - no record instance sharing. Each component has its own instance of a record populated from data on the input edge. It is then processed by the component and serialized into an output edge. This simple idea delivers excellent performance gains. (We have tried many times to find an even better approach, but have always returned to this one. Believe me – we have tried hard and many, many times.)

This imposes a painful decision to make on the edge itself – the capacity of the buffer that stores the bytes as they are passed from one end to the other. Obviously, that buffer must have enough room to hold the biggest record which passes through it. Those familiar with the CloverETL engine already know where I am going – the Records.MAX_RECORD_SIZE parameter.

In versions prior to 3.2, we used standard java.nio.ByteBuffer allocated to various multiples of MAX_RECORD_SIZE. That meant that all edges, component buffers, and just about anything with records passing through it were set to accommodate at least MAX_RECORD_SIZE bytes worth of the “guessed” biggest record possible. Over time, we gradually raised the default from 8 KB up to 64 KB (which, in the world of XMLs, unstructured data and other modern marvels is still far from being enough.) Yet, increasing MAX_RECORD_SIZE had quite a few negative effects on memory consumption, as any small increase was immediately multiplied by number of components and edges in a graph that shared this static buffer size. It was also shared among all graphs and sandboxes in the Server where the default was applied, regardless of whether or not the graph processed big records.

Introducing CloverBuffer

Now we are proud to say that with release 3.2, we have brought a significant improvement to this area. No more MAX_RECORD_SIZE trade-off decisions are necessary. Memory allocation for an edge and component buffer is now smart: it grows with higher demands and stays low for low demands. We have stepped up from the plain ByteBuffer to our own new container for serialized byte form of records – a CloverBuffer. It acts as a full replacement of a ByteBuffer, but what sets it apart is the ability to grow. CloverBuffer starts small but can transparently grow up to a predefined maximum limit (newly introduced RECORD_INITIAL_SIZE and RECORD_LIMIT_SIZE) without needing any programmer intervention.

So although there still is one global setting for all, it just sets boundaries that cannot be crossed. But anything in between those limits is allocated automatically to ensure the smallest memory footprint of each transformation run based on its needs in real time, not estimated ones. Graphs combining the processing of big records and small records, e.g. main stream of data combined with some logging branch, utilize only as much memory per edge/component as the size of the data passing through them.

Programmers Only

All CloverETL code base has been refactored to use the new CloverBuffer. We recommend that everyone adopt it too, so that your transformations can run seamlessly. In any case, you don’t need to worry – we keep our code backward compatible so even without changing your code, it still complies with the new release.

For completeness, here is an example of old record container allocation:
ByteBuffer recordBuffer = ByteBuffer.allocateDirect(Defaults.Record.MAX_RECORD_SIZE);

This now should be substituted with following code:
CloverBuffer recordBuffer = CloverBuffer.allocateDirect(Defaults.Record.RECORD_INITIAL_SIZE, Defaults.Record.RECORD_LIMIT_SIZE);

The constant Record.MAX_RECORD_SIZE is now deprecated and a pair of new constants was introduced:
Record.RECORD_INITIAL_SIZE – initial buffer size, for now 64KB and will be probably decreased in upcoming release (http://bug.javlin.eu/browse/CL-2070) to minimalize initial memory allocation for regular graphs.

The second constant ‘Record.RECORD_LIMIT_SIZE’ is actually one-to-one replacement for MAX_RECORD_SIZE (keeping MAX_RECORD_SIZE backward compatible for the sake of unmodified components), which sets the maximum upper bound per one CloverBuffer instance. This can be virtually anything – for convenient early detection of real buffer overruns, it is set to 32 MB by default. Lowering or increasing this upper bound affects memory consumption only in the cases where there is a real need for such big buffer – otherwise the buffers are kept at RECORD_INITIAL_SIZE and are grown gradually towards the upper limit.

As you can see, the CloverBuffer now makes it possible to process bigger records with less memory footprint, since only buffers for edges or components that actually manipulate big records grow, while the others still remain small.

CloverETL OEM Advantage: ETL As You Like It

Shakespeare didn’t know the first thing about OEM, or Original Equipment Manufacturers, but he did have quite a grasp on relationships. That being said, we’re here to put what Will preached back into the data world, channeling his penchant for the complexity of true relationships, with our OEM Program. Introducing, “ETL As You Like It: A True OEM Partnership.”

What is a CloverETL OEM Partnership?

Well, it really is a relationship, isn’t it? When CloverETL aligns with a company, we bring together not just our unique products, but also our tested dev teams, solutions, and ideas. It’s actually like a marriage—with joys and delights, compromise and adjustment. We’ve been active in OEM Partnerships for a while now and think it’s a great way for Clover to flourish in all sorts of challenges.

Today, CloverETL is at the core of many data service platforms as a vital data integration piece. As an OEM to many customers’ larger offerings – be it with IBM’s MDM offer, Good Data’s BI platform, or Mulesoft’s ESB service – Clover, with its flexibility, lends itself to match any OEM business strategy. It’s just a matter of deciding which OEM approach works best. CloverETL can work on a “partner basis” (side by side as a toolset), be embedded as a data integration piece, or even be embedded and white labeled.

Let’s take a look at what OEM programs are available for Clover today.

Partner Approach

Some companies use CloverETL alongside their offer to provide an integrated commercial package to their end-user clients. These clients can then have access to the Clover community through our forum to gain knowledge about the best use practice of CloverETL. Often, companies are replacing just the ETL bit because it is either hand-coded or just too cumbersome.

The data services companies that choose this option tend to need a lower volume of licenses. Another reason might also be to simply add a stronger ETL product to an application or service without the need to fully embed Clover. In a sense, the partner approach works like a “power couple” to run compelling applications that provide value to data clients.

Embedded OEM

Many businesses prefer the embedded OEM approach for CloverETL because they only have to manage one offering for their development teams. To embed means that as they make changes and improvements to their application, Clover and its team can adapt with a company’s growth in an evolving market. We offer this option in higher volume scenarios on a per unit basis, or even on an enterprise (one time fee) basis as the case allows.

For example, there are literally hundreds of Clover users for IBM’s Initiate MDM offer; the company, a partner since 2007, even produced a CloverETL user guide to help their clients learn and use Clover more effectively.

You can check out this IBM WorkBench document at: http://publib.boulder.ibm.com/infocenter/initiate/v9r5/topic/com.ibm.initiatepdfs.doc/topics/i46wecug.pdf

Embedded OEM – White Labeling

Lately, we are seeing a trend from customers asking for a white labeled ETL/data integration piece for their service offer. This is part of an interesting strategy where service providers simply want to present their offer with a focus on the application or the power of the company brand. IT and data professionals know that the ETL is there, cranking through and organizing data, but a white label strategy puts the ETL behind the scenes for a complete, branded look: Voila! The service just works, it seems. White labeling has the same licensing strategy as embedded-only OEM approach.

In all, David Pavlis, Javlin’s President put it best: “What’s great about Clover is that we want to work with you, and you can rely on us. We can adapt alongside your offering or become part of it, fully integrated. What we do know is that, in every case, we can support a lifetime partnership–adapting, improving, and aligning to where our customers and their clients are going.”

And there you have it, as you like us.

Interested in the technical nuances of OEM? Our OEM blog series will continue, detailing each approach reviewed today, so check back soon.

Performance Optimization of Metrics in CloverETL Profiler

The first beta version CloverETL Profiler was released in October, and since then we have been working on improvements for the second beta version, which was released at the end of last year. Besides bug fixing and adding a few new features, we also worked on performance optimization of profiling metrics. This article will describe this improvement and how profiling is interconnected with CloverETL Engine.

CloverETL Profiler processes input data as a stream. All metrics read input values as they are obtained from the source (CSV file, Excel sheet, or database table) and, at the end of a reading, metrics return their results and these results are then stored into the results database.

For most of the metrics (minimum value, maximum value) this approach works just fine. However, certain metrics cannot work like this – not only do they have higher computation-time requirements, but they also require all the input data to be kept in memory. For large data sets this makes using the operating system memory inappropriate. Therefore, external memory needs to be used.

In the first beta version we used Profiler’s internal SQL database to store all the values for these memory-consuming metrics. The data were first inserted into the SQL database and then a database query was used to calculate the result of the metric. This allowed for profiling large data sets– larger than the amount of available memory.

However, there was a large overhead caused by inserting the data into the database; also, the final result query computation consumed lots of system resources. We were not happy with the performance and architecture of this solution, so we decided to redesign it and use the powerful CloverETL Engine to get the job done.

We exploit the fact that the memory-consuming metrics can still be computed on a stream of data, if the incoming data are sorted. In the improved version we use the CloverETL Engine to first sort the values using the ExtSort components, and then we analyze the sorted data as a stream. In this approach, no other external facilities (such as SQL database) are used during profiling.

The overall performance of CloverETL Profiler has improved, especially for large data sets. Even with full set of metrics enabled, we are now able to analyze 4 GB of data with 30 fields in 30 minutes. Also, memory consumption has improved significantly.

Finally, in the Profiler GUI, we have marked the metrics that require sorting, and therefore have longer computation time, with a small clock icon. These metrics are not enabled by default.

The following picture shows in detail the different phases of metrics calculation. First, we calculate the metrics that can work on unsorted streams of data. In the following phases, for each field in its own separate phase, we run an ExtSort component and connect it to a component that calculates the metrics that expect sorted data. We use Rollup with custom transform Java code to calculate the metrics. Rollup allows for producing variable amount of output records for any amount of incoming records.

Another performance improvement in second beta version of CloverETL Profiler also affects the metrics that do not require the input data to be sorted. The profiler will now make better use of available CPUs and there will be less CPU time consumed on context-switching. This results in a boost up to 15% for data with a high amount of fields. Also, simpler structure of the internal CloverETL graph results in a significantly lower memory footprint.

In summary, in the second beta version of CloverETL Profiler, we have improved both performance and memory consumption by fully exploiting the capabilities of CloverETL Engine.

CloverETL 2.9 Released: Infobright Data Writer, Web Services Component and Other New Features.

New CloverETL version 2.9. was just released. This version brings a new Infobright Data Writer component, enhances the connectivity by adding Web Services component and adds features that simplify common data transformation tasks.

New Features and Components:
Infobright Data Writer
In response to customer requests, this component writes data into Infobright software, a column-oriented relational database. Infobright is a provider of solutions designed to deliver a scalable data warehouse optimized for analytic queries.

Web Services component
The new component makes communication with Web Services easier than ever. It provides user friendly graphical interface for mapping your data into Web Service fields, automatically generates requests and process responses. It offers faster, easier and more comfortable way to interact with remote Web Services.

Reading formatted values from XLS
Additionally to reading plain data from MicrosoftTM ExcelTM sheets, the Excel component is now also capable of reading user-formatted values such as currencies, dates or numbers.

New tracking option
Customers can now see all absolute speed rates for finished data transformations, facilitating comparative analysis in pursuit of process improvements.

New Aspell Lookup table
Brand new implementation of this component brings better performance, improved configuration and better customization.

Improved treatment of empty (NULL) values
Developers can now specify special strings that should be treated as empty (NULL) when data is being parsed. This feature simplifies processing of typical application export files which often contain values insignificant for ETL processing. Additionally it may lead to improved processing throughput and lower memory consumption of data transformation.

More user friendly File URL dialog and improved LDAP functionality.

Customers can evaluate these new features along with CloverETL’s other leading capabilities with a free 30-day trial of the CloverETL Designer Pro evaluation, which is available at www.cloveretl.com Information management professionals can also evaluate the enterprise integration features of CloverETL Server via an online demo, which is also available at www.cloveretl.com.

Parallel Data Processing with CloverETL Cluster

For the upcoming release of CloverETL 2.9, we are working on improvements in CloverETL Server which will allow run transformations in parallel on multiple cluster nodes.

CloverETL Server already supports clustering, so more instances may cooperate to each other. Current stable version already implements common cluster features: fail-over/high-availability and scalability of lots of requests which are load-balanced on available cluster nodes. These features are actually implemented since version 1.3.

The basic concept of new parallelism
Transformation may be automatically executed in parallel on more cluster nodes according to configuration and each of these “worker” transformations processes just its part of data. Because there is one “master” transformation, which manages the other transformations and which gathers tracking data from “worker” transformations, the parallelism is transparent for CloverETL Server client. Client by default “sees” just one (master) execution and aggregated tracking data. However there are still logs and tracking data for each of “worker” transformations, so it’s still possible to inspect details of this parallel execution. “Worker” transformations outputs are gathered to the “master”, thus client has one single transformation output which may be processed further.

So how to get parts of input data?
Basically, transformation can process data which is already partitioned, which is the best case and there is no overhead with partitioning of data, or CloverETL Server itself can partition input data from one single source and distribute data on the fly (during the transformation) to several cluster nodes using the network connection. Overhead of this operation depends on the speed of network communication and other conditions.

Design changes in the graph
We aim to keep the transformation graph almost the same as it would be for “standalone” execution. Thus there will be just a couple of extra components in the graph which is intended to run in parallel. These components will handle partitioning/departitioning of data in case it’s not already partitioned.

Scalability
The new parallelism in CloverETL Server is a giant leap for scalability of the transformations. Ever since the graph is designed for paraller run, the number of computers which run this transformation depends just on cluster configuration. Graph itself is still the same. Configuration of the parallelism includes:

  • working CloverETL Server cluster, thus standalone server instances won’t be able to handle such execution
  • “partitioned” sandbox(see below) with list of locations

New sandbox types
On server side, graphs and related files are organized in so-called sandboxes. Until version 2.8, there was just one type: “shared” sandbox. It means that it contains the same files and directory structure on all cluster nodes. Since version 2.9 there will be two more types:

  • “local” sandbox – is (locally) accessible on just one cluster node. It’s intended for huge input/output data which is not intended to be shared/replicated among multiple cluster nodes.
  • “partitioned” sandbox – each of its physical location contains just part of data. It’s intended as a storage for partitioned input/output data of transformations which are supposed to run in parallel. List of physical locations actually specifies nodes which will run “worker” transformations.

Master – worker responsibilities
Master observes all related workers and when some transformation phase is finished on all workers, it’s master’s responsibility to allow the workers to process next phase. When any of the workers fails from any reason, it’s master’s responsibility to abort all the other workers and select whole execution as failed. Master/worker – These terms have meaning only in the scope of one transformation. Since 2.9 there is no privileged node configured as “master” in the cluster, but it doesn’t mean that all the nodes are equal. There may be differences between nodes in accessibility to physical sources. Configuration of sandboxes should reflect it.

Designer-Server Integration Testing

CloverETL’s development team is preparing a new amazing feature, integration of CloverETL Designer with CloverETL Server. This feature shifts work with Clover to a much more comfortable level.

I was asked to participate on testing of it. And I decided to share my impressions.

The main feature of integration allows you to work with CloverETL’s graph located on CloverETL Server in  the same way as if it would be located on your desktop machine. So no more copying of files from desktop to server, no more out-of-date files, all items are located only on server and accesible and editable in the Eclipse with CloverETL on your desktop, transformation graphs are editable in graphic format.

All graphs are run on server machine but you don’t lose any of advantages useful for developing and debugging, you can view debug data on edge, view data on reader without running of the graph, see tracking information in tracking view etc. In addition all runs of graphs are tracked on server so you can see all execution logs in the Executions History tab of server administration interface.

After initial doubts I have realized that it works and now I’m fascinated with it :-) . You can expect it with many other improvement in version 2.8 of CloverETL Designer. So forget Informatica, forget DataStage, use CloverETL :-) .

CloverETL version 2.7 released

OpenSys released today new version of CloverETL Engine – 2.7.

According to ChangeLog, there have been more then 300 changes – some of them new functionality, some fixes of reported problems.
Together with the new engine, CloverETL Designer (previously CloverETL GUI) version 2.2 is also available.

Details about changes of both CloverETL Engine and Designer can be found on CloverETL’s main site.

Upcoming 2.7 Release of CloverETL – Faster Sorting of Data and Improved Reading Data

As of today (Mar 31st), Clover Engine 2.7 branch has been created and the testing/QA process has started. Within approx 2 weeks, brand new version of CloverETL will be ready. It brings many small new features and bug fixes, but also several significant improvements – mostly in speed.

The aging ExtSort component is being replaced by new FastSort, which can bring up to 2.5 times the performance of old ExtSort. I am sure, there will be special post on this blog by FastSort’s developer Pavel Najvar, who will explan in detail where he found those hidden 250% of speed.

There are also speed improvements in our Universal Data Reader (reader of text data, delimited or fixed). We thoroughly profiled its code and were able to find 20-25% of additional speed. This puts us even farher in front of competition !