DataMotion: CloverETL’s Newest OEM Partner

CloverETL offers a choice. This has been part of our philosophy since day one. Flexibility. Scalability. Robustness. These principles are fundamental to us—and also what set us apart in the market. With these strengths at the forefront of our offering, our mission is simple: to craft a tailored approach for customers looking for that “just right” ETL/Data Integration relationship. This just makes sense to us. So with this firm belief, we continue look for new ways to make great things happen for our customers.

Our OEM strategy is no different. As described in an earlier blog, CloverETL works as OEM in three ways: OEM Partner, OEM Embedded, and OEM Embedded White Label.

For us, the ability for customers to shape and style a unique OEM relationship with us really transforms the OEM conversation. When a partnership is involved, it’s a serious marriage—one in which we work closely together, count on each other, and move forward towards success. A strong OEM relationship comes not only when two product offerings sync well, but when the companies’ goals do too. This is what we strive for because after all, an OEM partner’s success is, in a sense, ours too.

The OEM Partner

In certain cases, a company may just need a little something extra. This is a great way we two companies can shine individually—together. Naturally, this is what we call the partner approach.

OEM Partner companies use CloverETL alongside their offer to provide an integrated package to their end-user clients. And this is great not only for our customers, but also for their clients, who gain access to the Clover community through both our resources and licensing of CloverETL. We support these indirect customers—users who come to CloverETL through our OEM. We welcome them; they are the in-laws, so to speak. Helping our partners means making sure that their clients get exactly what they need to move forward with projects too. Here is where our flexible architecture is a plus: our dev team can work on our ETL offering, while their team can offer the right support to their users who are implementing it. This partnership means we work together to bring the right solutions to the right customers.

CloverETL and DataMotion

Let’s take a look at a real life partner of ours—a really exciting and interesting case. DataMotion, a company that offers software, consulting, development and implementation services of CRM, Direct Marketing and Data Quality solutions in Brazil, fits under this partner umbrella well.

Before they found Javlin, “DataMotion was a frustated user of SQLServer SSIS. So, we were searching for a new, powerful and accessible ETL solution,” said Ricardo Rego, Managing Partner of DataMotion.

He continues, “In the middle of last year, we started to look at the market for a new data integration flagship technology. We were performing due diligence in readiness for start-up. Part of this planning phase included selecting a software vendor that could support the delivery of professional service engagements and help the business grow using a common integration platform. Our decision went to Javlin and CloverETL.”

With them, CloverETL works as a critical software component together with their growing suite of data solutions. As business in Brazil propels forward, especially in the data-heavy verticals of healthcare, retail, and finance, companies are looking to harness the business value from their data assets. For this, they turn to DataMotion with CloverETL for support. “With CloverETL, DataMotion will be able to provide a solid rock data management and integration services to all SME in Brazil and Latin America. This is a special momentum where the Brazilian economy is growing very fast and the enterprise commitment combined with the quality of the information is critical and mandatory to the success of businesses,” said Rego.

As we have seen working with a diverse mix of partner relationships, this approach is definitely something special: it works like a “power couple” to provide real value to data clients.

New Release Model Offers Clover Users The Choice

As you may have noticed, the latest CloverETL release is labeled as a milestone release. What does this mean exactly? Well, this release marks the beginning of our new release model that will fulfill two separate, but important requirements that our customers have been asking for: the need for sustained stability in operations as well as the availability of new features as they come out.

Milestone Releases

As a major provider of OEM data integration for many BI or MDM solution suppliers, milestone releases addresses the desire for more frequent release cycles– to not only match the suppliers’ product development, but also to provide new features on an ongoing basis. Milestone releases are meant for these kinds of users and customers. Each milestone release, occurring every two months or so, will focus on introducing major new features and improvements, along with bug fixes. Milestone releases are carefully QA’d and ready for implementation in our user community.

Production Releases

The production release should be selected by those who seek high stability. Production releases happen usually twice a year and contain all features and fixes previously released in milestone releases;  these releases endure a rigorous QA process and testing by the community.

With this new release model, CloverETL users and customers have the freedom to choose whether to maintain operations and stronger stability on the production track or to implement the newest features into their flow with the milestone release.

When Two Become One: CloverETL OEM Embedded White Labeling

In our previous post, we reviewed the three approaches to CloverETL OEM—Here, I’ll discuss the more technical aspects of CloverETL OEM Embedded White Labeling.

OEM White Labeling – What it Takes

When partners needs white labeling, they are mostly considering the seamless integration of the ETL piece with their application. They have to manage things like version control of their application (making sure it works with Clover), the look and feel of the GUI, and a good sense of automation, for example. The Clover team works with the partner to architect this and can do so quickly.

Partners needing this option receive CloverETL Integration documentation to guide them through the process of achieving stable, effective white labeling. Both the CloverETL Designer and CloverETL Server can also be white labeled as part of a company strategy.

Which steps does white labeling consist of in Designer and Server?

CloverETL Designer

CloverETL Designer branding is based on showing the product name, splash screen, welcome screen, etcetera inside the Eclipse environment. Applications embedding or extending CloverETL Designer can use the same branding elements to visually customize the product. Features available via branding:

  • Product naming and description: These are visible in multiple places inside Eclipse (however, not all occurrences can be changed.)
  • Default configuration: Eclipse configuration options can be changed according to users’ needs, e.g. whether a splash screen should be shown at start-up.
  • Splash screen: The initial image shown at start-up and the progress bar can be customized.
  • Welcome screen: An introduction screen shown to users on the application’s first run; it can contain useful links to documentation, examples etc.

CloverETL Server

CloverETL Server is mentioned in multiple messages, web GUI images, directory names, etc. Following steps described in the integration documentation (available to OEM partners), you can get rid of all CloverETL occurrences. White labeling Server thus involves:

  • Replacing images, logos plus related work on graphics
  • Editing a couple of properties stored in server configuration files
  • Tuning the resulting application

Big Time with Big Records

Those of you who have ever tried to process big records with CloverETL already learned that it required some tweaking and special care to make it run smoothly and efficiently. In some cases, CloverETL could get too greedy with memory requirements for a graph run, making it quite cumbersome to set up. With CloverETL 3.2 we have introduced improved memory management in the runtime layer that has optimized memory usage when running graphs with big records.

Let’s take a look inside to see what this is all about…

Pipeline Approach

Clover’s approach to record processing is based on a pipeline – a chain of processing components connected by edges. The edges are the key point in inter-component communication. They have to ensure a fast transfer of records from one component to another. Our approach for edge data transfer has always been based on records serialization into byte stream on the starting end of an edge and deserialization back to record form on the other end. This ensures a basic invariant for all of our components - no record instance sharing. Each component has its own instance of a record populated from data on the input edge. It is then processed by the component and serialized into an output edge. This simple idea delivers excellent performance gains. (We have tried many times to find an even better approach, but have always returned to this one. Believe me – we have tried hard and many, many times.)

This imposes a painful decision to make on the edge itself – the capacity of the buffer that stores the bytes as they are passed from one end to the other. Obviously, that buffer must have enough room to hold the biggest record which passes through it. Those familiar with the CloverETL engine already know where I am going – the Records.MAX_RECORD_SIZE parameter.

In versions prior to 3.2, we used standard java.nio.ByteBuffer allocated to various multiples of MAX_RECORD_SIZE. That meant that all edges, component buffers, and just about anything with records passing through it were set to accommodate at least MAX_RECORD_SIZE bytes worth of the “guessed” biggest record possible. Over time, we gradually raised the default from 8 KB up to 64 KB (which, in the world of XMLs, unstructured data and other modern marvels is still far from being enough.) Yet, increasing MAX_RECORD_SIZE had quite a few negative effects on memory consumption, as any small increase was immediately multiplied by number of components and edges in a graph that shared this static buffer size. It was also shared among all graphs and sandboxes in the Server where the default was applied, regardless of whether or not the graph processed big records.

Introducing CloverBuffer

Now we are proud to say that with release 3.2, we have brought a significant improvement to this area. No more MAX_RECORD_SIZE trade-off decisions are necessary. Memory allocation for an edge and component buffer is now smart: it grows with higher demands and stays low for low demands. We have stepped up from the plain ByteBuffer to our own new container for serialized byte form of records – a CloverBuffer. It acts as a full replacement of a ByteBuffer, but what sets it apart is the ability to grow. CloverBuffer starts small but can transparently grow up to a predefined maximum limit (newly introduced RECORD_INITIAL_SIZE and RECORD_LIMIT_SIZE) without needing any programmer intervention.

So although there still is one global setting for all, it just sets boundaries that cannot be crossed. But anything in between those limits is allocated automatically to ensure the smallest memory footprint of each transformation run based on its needs in real time, not estimated ones. Graphs combining the processing of big records and small records, e.g. main stream of data combined with some logging branch, utilize only as much memory per edge/component as the size of the data passing through them.

Programmers Only

All CloverETL code base has been refactored to use the new CloverBuffer. We recommend that everyone adopt it too, so that your transformations can run seamlessly. In any case, you don’t need to worry – we keep our code backward compatible so even without changing your code, it still complies with the new release.

For completeness, here is an example of old record container allocation:
ByteBuffer recordBuffer = ByteBuffer.allocateDirect(Defaults.Record.MAX_RECORD_SIZE);

This now should be substituted with following code:
CloverBuffer recordBuffer = CloverBuffer.allocateDirect(Defaults.Record.RECORD_INITIAL_SIZE, Defaults.Record.RECORD_LIMIT_SIZE);

The constant Record.MAX_RECORD_SIZE is now deprecated and a pair of new constants was introduced:
Record.RECORD_INITIAL_SIZE – initial buffer size, for now 64KB and will be probably decreased in upcoming release (http://bug.javlin.eu/browse/CL-2070) to minimalize initial memory allocation for regular graphs.

The second constant ‘Record.RECORD_LIMIT_SIZE’ is actually one-to-one replacement for MAX_RECORD_SIZE (keeping MAX_RECORD_SIZE backward compatible for the sake of unmodified components), which sets the maximum upper bound per one CloverBuffer instance. This can be virtually anything – for convenient early detection of real buffer overruns, it is set to 32 MB by default. Lowering or increasing this upper bound affects memory consumption only in the cases where there is a real need for such big buffer – otherwise the buffers are kept at RECORD_INITIAL_SIZE and are grown gradually towards the upper limit.

As you can see, the CloverBuffer now makes it possible to process bigger records with less memory footprint, since only buffers for edges or components that actually manipulate big records grow, while the others still remain small.

CloverETL OEM Advantage: ETL As You Like It

Shakespeare didn’t know the first thing about OEM, or Original Equipment Manufacturers, but he did have quite a grasp on relationships. That being said, we’re here to put what Will preached back into the data world, channeling his penchant for the complexity of true relationships, with our OEM Program. Introducing, “ETL As You Like It: A True OEM Partnership.”

What is a CloverETL OEM Partnership?

Well, it really is a relationship, isn’t it? When CloverETL aligns with a company, we bring together not just our unique products, but also our tested dev teams, solutions, and ideas. It’s actually like a marriage—with joys and delights, compromise and adjustment. We’ve been active in OEM Partnerships for a while now and think it’s a great way for Clover to flourish in all sorts of challenges.

Today, CloverETL is at the core of many data service platforms as a vital data integration piece. As an OEM to many customers’ larger offerings – be it with IBM’s MDM offer, Good Data’s BI platform, or Mulesoft’s ESB service – Clover, with its flexibility, lends itself to match any OEM business strategy. It’s just a matter of deciding which OEM approach works best. CloverETL can work on a “partner basis” (side by side as a toolset), be embedded as a data integration piece, or even be embedded and white labeled.

Let’s take a look at what OEM programs are available for Clover today.

Partner Approach

Some companies use CloverETL alongside their offer to provide an integrated commercial package to their end-user clients. These clients can then have access to the Clover community through our forum to gain knowledge about the best use practice of CloverETL. Often, companies are replacing just the ETL bit because it is either hand-coded or just too cumbersome.

The data services companies that choose this option tend to need a lower volume of licenses. Another reason might also be to simply add a stronger ETL product to an application or service without the need to fully embed Clover. In a sense, the partner approach works like a “power couple” to run compelling applications that provide value to data clients.

Embedded OEM

Many businesses prefer the embedded OEM approach for CloverETL because they only have to manage one offering for their development teams. To embed means that as they make changes and improvements to their application, Clover and its team can adapt with a company’s growth in an evolving market. We offer this option in higher volume scenarios on a per unit basis, or even on an enterprise (one time fee) basis as the case allows.

For example, there are literally hundreds of Clover users for IBM’s Initiate MDM offer; the company, a partner since 2007, even produced a CloverETL user guide to help their clients learn and use Clover more effectively.

You can check out this IBM WorkBench document at: http://publib.boulder.ibm.com/infocenter/initiate/v9r5/topic/com.ibm.initiatepdfs.doc/topics/i46wecug.pdf

Embedded OEM – White Labeling

Lately, we are seeing a trend from customers asking for a white labeled ETL/data integration piece for their service offer. This is part of an interesting strategy where service providers simply want to present their offer with a focus on the application or the power of the company brand. IT and data professionals know that the ETL is there, cranking through and organizing data, but a white label strategy puts the ETL behind the scenes for a complete, branded look: Voila! The service just works, it seems. White labeling has the same licensing strategy as embedded-only OEM approach.

In all, David Pavlis, Javlin’s President put it best: “What’s great about Clover is that we want to work with you, and you can rely on us. We can adapt alongside your offering or become part of it, fully integrated. What we do know is that, in every case, we can support a lifetime partnership–adapting, improving, and aligning to where our customers and their clients are going.”

And there you have it, as you like us.

Interested in the technical nuances of OEM? Our OEM blog series will continue, detailing each approach reviewed today, so check back soon.

Performance Optimization of Metrics in CloverETL Data Profiler

The first beta version CloverETL Data Profiler was released in October, and since then we have been working on improvements for the second beta version, which was released at the end of last year. Besides bug fixing and adding a few new features, we also worked on performance optimization of profiling metrics. This article will describe this improvement and how profiling is interconnected with CloverETL Engine.

The CloverETL Data Profiler processes input data as a stream. All metrics read input values as they are obtained from the source (CSV file, Excel sheet, or database table) and, at the end of a reading, metrics return their results and these results are then stored into the results database.

For most of the metrics (minimum value, maximum value) this approach works just fine. However, certain metrics cannot work like this – not only do they have higher computation-time requirements, but they also require all the input data to be kept in memory. For large data sets this makes using the operating system memory inappropriate. Therefore, external memory needs to be used.

In the first beta version we used Profiler’s internal SQL database to store all the values for these memory-consuming metrics. The data were first inserted into the SQL database and then a database query was used to calculate the result of the metric. This allowed for profiling large data sets– larger than the amount of available memory.

However, there was a large overhead caused by inserting the data into the database; also, the final result query computation consumed lots of system resources. We were not happy with the performance and architecture of this solution, so we decided to redesign it and use the powerful CloverETL Engine to get the job done.

We exploit the fact that the memory-consuming metrics can still be computed on a stream of data, if the incoming data are sorted. In the improved version we use the CloverETL Engine to first sort the values using the ExtSort components, and then we analyze the sorted data as a stream. In this approach, no other external facilities (such as SQL database) are used during profiling.

The overall performance of CloverETL Data Profiler has improved, especially for large data sets. Even with full set of metrics enabled, we are now able to analyze 4 GB of data with 30 fields in 30 minutes. Also, memory consumption has improved significantly.

Finally, in the Profiler GUI, we have marked the metrics that require sorting, and therefore have longer computation time, with a small clock icon. These metrics are not enabled by default.

The following picture shows in detail the different phases of metrics calculation. First, we calculate the metrics that can work on unsorted streams of data. In the following phases, for each field in its own separate phase, we run an ExtSort component and connect it to a component that calculates the metrics that expect sorted data. We use Rollup with custom transform Java code to calculate the metrics. Rollup allows for producing variable amount of output records for any amount of incoming records.

Another performance improvement in second beta version of CloverETL Data Profiler also affects the metrics that do not require the input data to be sorted. The profiler will now make better use of available CPUs and there will be less CPU time consumed on context-switching. This results in a boost up to 15% for data with a high amount of fields. Also, simpler structure of the internal CloverETL graph results in a significantly lower memory footprint.

In summary, in the second beta version of CloverETL Data Profiler, we have improved both performance and memory consumption by fully exploiting the capabilities of CloverETL Engine.

CloverETL Visions for 2012: Evolution and Revolution in Data Integration

Part one – celebrating 10 years

In 2012, CloverETL will celebrate its 10th anniversary as an open source project. It all started back in 2002. On October 3rd, 2002, version 0.1 was first announced on the Freshmeat (now Freecode) portal. That day, CloverETL’s official life began.

I don’t want to look into Clover’s history too much, though. I do, however, want to take this time to make a few comments about the principles on which CloverETL was established and how these principles continue to determine its future.

Principle number 1: Elegant and robust architecture guarantees a stable foundation

CloverETL started more as a framework on which other projects could be based, rather than as an end-user product with a “sexy” GUI. As a matter of fact, the real GUI was built in 2005, almost three years after the release of first CloverETL engine, which is now present in every tool of the CloverETL family – the Designer, the Server and also CloverETL Profiler.

Even though we are now on version 3.2, there has, so far, only been one change which significantly broke backward compatibility: when we switched from Java 1.4 to Java 1.5 and changed some key interface definitions.

This particular principle is what gives a certain peace of mind to the projects and software products embedding or otherwise deploying Clover, as they know there won’t be any sudden surprises with future versions. It also proves that the original architecture was robust and flexible enough at the outset to support all the later additions and improvements.

Principle number 2: Less is better

CloverETL is based on idea of cooperating components, each specialized with one certain functionality only. However each component is flexible enough to support various “outer” conditions in which the component works.

For example, our UniversalDataReader is meant for parsing text data. The data can come in variations like fixed-length, delimited, or combined; can be read locally or from remote locations; and can be available in plain form or compressed. All these variations are supported, which means that subtle changes, like data becoming available through a different protocol or perhaps being suddenly compressed, require only slight reconfiguration of our DataReader. Contrast this with other players, whose hundreds of different components require architecture changes in transformation (replacement of one component with other) when small shift in input data happens (e.g. due to moving from DEV to PROD environment) and you’ll notice the difference.

It also means that a programmer or analyst designing data transformations in Clover does not need to carry a dictionary of components; a short list covers all possible scenarios.

Principle number 3: Agility is sexy, but long term planning is wise

CloverETL is used in many applications by many customers. Some of them are large, global corporations that embed Clover in their products. Through our OEM program, we work with many customers with a very agile approach to the development of their applications. Some of them have release cycle as short as two weeks where they must  not only develop & debug, but also release new features. Clover’s development team tries to keep up with this sprint, but we still take our time to plan, architect, and develop new, fundamental features to extend CloverETL’s capabilities and help our customers do their jobs faster and simpler.

The reason we insist on thinking through every new feature request, beyond simple tweaks, is that sometimes relatively small and quick change may break compatibility somewhere or prevent future extensions. Whenever our development team touches the core (engine) we make sure the change is properly evaluated from several points of view, including:

  • Backward compatibility – at least at transformation graph level.
  • Performance – Slowdown of just a few percent on big data can mean extra kW of energy consumed by data crunching servers.
  • Future extensibility – We hate deprecating APIs or components just because we might not be able to continue enhancing and improving them.

This principle is further supported by the fact that CloverETL continues to be developed by the same, stable development team year in and out. Many team members have been around since 2005, when the commercial life of Clover began.

Part two – What will appear on the menu in 2012

In short, there will be evolution and, in certain areas, some revolution. We are always sorting out the dilemma of whether to break from the “past” and come up with something completely new and revolutionary – at least in our minds – or continue to improve the old-faithful engine architecture laid out years ago.

As we weren’t able to choose one or the other, we decided to continue improving what works well (and should continue to, even in future) and overhaul some things that have had occasional hiccups with modern data structures and formats brought to us by the CLOUD.

Evolution

Expanding CloverETL OEM program

As CloverETL attracts new OEM customers, we continue improving our OEM program by making it simpler to embed, modify, white-label, or otherwise enhance our technology stack. This includes better documentation, example projects, and extended training.
We are also investing in our support team, which has always strived to provide timely and accurate answers to all support requests submitted through various channels, from e-mail to the technology forum and hotline.

Our support staff is comprised of experienced consultants and programmers who have real-life experience with our technology—they aren’t just people a few manual pages ahead of a user seeking an answer.

GUI – continuous improvement of the user experience

We will continue our effort to make the Designer more and more user-friendly. Our motto is: CloverETL is built by professionals for professionals and, truly, professional DI experts or Java programmers usually give us high marks. Nonetheless, we want to make our technology accessible to the broadest possible audience seeking solutions to certain data needs.

Enhancing CloverETL Cluster – our BigData recipe

These days, BigData is usually mentioned together with Hadoop as the solution. As much as we like Hadoop for various reasons, we have our own recipe for processing BigData, and we think it’s better suited for classical data integration/ETL tasks. It is based on a split/transform/merge idea, where big input data are partitioned and then processed in parallel on multiple nodes of the CloverETL Cluster. The advantage of this, as opposed to Hadoop, is that the transformation may be developed & debugged locally, then easily deployed onto CloverETL Cluster for fast execution. Even if executed in a cluster environment, all the debugging and monitoring options of our Designer are available. It is also worth mentioning that deploying CloverETL Cluster is much easier than setting up the Hadoop cluster.

Our big enhancement of CloverETL Cluster in 2012 will be the merging of our technology with Hadoop – more precisely HDFS filesystem – which should combine the best from both worlds. HDFS provides some cool features, namely robustness and high performance, and we want to utilize its automated data partitioning to make it easier to grow (or shrink) the storage of data depending on actual needs.

Revolution

Rich data structures – trees, unstructured data, etc.

It has to come with age, but I can’t resist and must admire those who devised Cobol and CopyBook. In those times, every byte of storage counted and CPUs were slow, yet programmers were still able to process rich data structures. Then relational databases came and brought the idea of tables and normal forms. Well, today, we are back to rich structures, but this time, we’ve stopped counting bytes or CPU cycles (which has a huge impact on power consumption of servers, but that’s a different story.) That is why XML, JSON, or other rich structures are becoming the norm today.

In order to support these structures and formats as first class passengers, we decided to overhaul our metadata and record storage model and allow direct support of tree structures, multi-values of fields, and even loosely typed data organized in maps/properties collections.

This independently constitutes as a big adventure, as every single piece of our technology platform will be affected, and thus will have to be adapted. The effort will be huge, and necessary regression testing of the whole platform will be endless. Despite this, the prize is enticing: almost any type of data (and the cloud will be bonanza for this) will be 1:1 representable by Clover. That will include XML, JSON, POJO, and complex properties – and, in the future, who knows what else!

—–

We have always claimed that CloverETL is future-proof. Therefore, in 2012, we will be improving our foundations so they withstand the next 10 years.

If what I’ve talked about above is of interest to you, then please stay tuned. We will be publishing more details on our new functionality as we implement it.

For now, I wish everyone a very successful 2012!