New Release Model Offers Clover Users The Choice

As you may have noticed, the latest CloverETL release is labeled as a milestone release. What does this mean exactly? Well, this release marks the beginning of our new release model that will fulfill two separate, but important requirements that our customers have been asking for: the need for sustained stability in operations as well as the availability of new features as they come out.

Milestone Releases

As a major provider of OEM data integration for many BI or MDM solution suppliers, milestone releases addresses the desire for more frequent release cycles– to not only match the suppliers’ product development, but also to provide new features on an ongoing basis. Milestone releases are meant for these kinds of users and customers. Each milestone release, occurring every two months or so, will focus on introducing major new features and improvements, along with bug fixes. Milestone releases are carefully QA’d and ready for implementation in our user community.

Production Releases

The production release should be selected by those who seek high stability. Production releases happen usually twice a year and contain all features and fixes previously released in milestone releases;  these releases endure a rigorous QA process and testing by the community.

With this new release model, CloverETL users and customers have the freedom to choose whether to maintain operations and stronger stability on the production track or to implement the newest features into their flow with the milestone release.

When Two Become One: CloverETL OEM Embedded White Labeling

In our previous post, we reviewed the three approaches to CloverETL OEM—Here, I’ll discuss the more technical aspects of CloverETL OEM Embedded White Labeling.

OEM White Labeling – What it Takes

When partners needs white labeling, they are mostly considering the seamless integration of the ETL piece with their application. They have to manage things like version control of their application (making sure it works with Clover), the look and feel of the GUI, and a good sense of automation, for example. The Clover team works with the partner to architect this and can do so quickly.

Partners needing this option receive CloverETL Integration documentation to guide them through the process of achieving stable, effective white labeling. Both the CloverETL Designer and CloverETL Server can also be white labeled as part of a company strategy.

Which steps does white labeling consist of in Designer and Server?

CloverETL Designer

CloverETL Designer branding is based on showing the product name, splash screen, welcome screen, etcetera inside the Eclipse environment. Applications embedding or extending CloverETL Designer can use the same branding elements to visually customize the product. Features available via branding:

  • Product naming and description: These are visible in multiple places inside Eclipse (however, not all occurrences can be changed.)
  • Default configuration: Eclipse configuration options can be changed according to users’ needs, e.g. whether a splash screen should be shown at start-up.
  • Splash screen: The initial image shown at start-up and the progress bar can be customized.
  • Welcome screen: An introduction screen shown to users on the application’s first run; it can contain useful links to documentation, examples etc.

CloverETL Server

CloverETL Server is mentioned in multiple messages, web GUI images, directory names, etc. Following steps described in the integration documentation (available to OEM partners), you can get rid of all CloverETL occurrences. White labeling Server thus involves:

  • Replacing images, logos plus related work on graphics
  • Editing a couple of properties stored in server configuration files
  • Tuning the resulting application

Big Time with Big Records

Those of you who have ever tried to process big records with CloverETL already learned that it required some tweaking and special care to make it run smoothly and efficiently. In some cases, CloverETL could get too greedy with memory requirements for a graph run, making it quite cumbersome to set up. With CloverETL 3.2 we have introduced improved memory management in the runtime layer that has optimized memory usage when running graphs with big records.

Let’s take a look inside to see what this is all about…

Pipeline Approach

Clover’s approach to record processing is based on a pipeline – a chain of processing components connected by edges. The edges are the key point in inter-component communication. They have to ensure a fast transfer of records from one component to another. Our approach for edge data transfer has always been based on records serialization into byte stream on the starting end of an edge and deserialization back to record form on the other end. This ensures a basic invariant for all of our components - no record instance sharing. Each component has its own instance of a record populated from data on the input edge. It is then processed by the component and serialized into an output edge. This simple idea delivers excellent performance gains. (We have tried many times to find an even better approach, but have always returned to this one. Believe me – we have tried hard and many, many times.)

This imposes a painful decision to make on the edge itself – the capacity of the buffer that stores the bytes as they are passed from one end to the other. Obviously, that buffer must have enough room to hold the biggest record which passes through it. Those familiar with the CloverETL engine already know where I am going – the Records.MAX_RECORD_SIZE parameter.

In versions prior to 3.2, we used standard java.nio.ByteBuffer allocated to various multiples of MAX_RECORD_SIZE. That meant that all edges, component buffers, and just about anything with records passing through it were set to accommodate at least MAX_RECORD_SIZE bytes worth of the “guessed” biggest record possible. Over time, we gradually raised the default from 8 KB up to 64 KB (which, in the world of XMLs, unstructured data and other modern marvels is still far from being enough.) Yet, increasing MAX_RECORD_SIZE had quite a few negative effects on memory consumption, as any small increase was immediately multiplied by number of components and edges in a graph that shared this static buffer size. It was also shared among all graphs and sandboxes in the Server where the default was applied, regardless of whether or not the graph processed big records.

Introducing CloverBuffer

Now we are proud to say that with release 3.2, we have brought a significant improvement to this area. No more MAX_RECORD_SIZE trade-off decisions are necessary. Memory allocation for an edge and component buffer is now smart: it grows with higher demands and stays low for low demands. We have stepped up from the plain ByteBuffer to our own new container for serialized byte form of records – a CloverBuffer. It acts as a full replacement of a ByteBuffer, but what sets it apart is the ability to grow. CloverBuffer starts small but can transparently grow up to a predefined maximum limit (newly introduced RECORD_INITIAL_SIZE and RECORD_LIMIT_SIZE) without needing any programmer intervention.

So although there still is one global setting for all, it just sets boundaries that cannot be crossed. But anything in between those limits is allocated automatically to ensure the smallest memory footprint of each transformation run based on its needs in real time, not estimated ones. Graphs combining the processing of big records and small records, e.g. main stream of data combined with some logging branch, utilize only as much memory per edge/component as the size of the data passing through them.

Programmers Only

All CloverETL code base has been refactored to use the new CloverBuffer. We recommend that everyone adopt it too, so that your transformations can run seamlessly. In any case, you don’t need to worry – we keep our code backward compatible so even without changing your code, it still complies with the new release.

For completeness, here is an example of old record container allocation:
ByteBuffer recordBuffer = ByteBuffer.allocateDirect(Defaults.Record.MAX_RECORD_SIZE);

This now should be substituted with following code:
CloverBuffer recordBuffer = CloverBuffer.allocateDirect(Defaults.Record.RECORD_INITIAL_SIZE, Defaults.Record.RECORD_LIMIT_SIZE);

The constant Record.MAX_RECORD_SIZE is now deprecated and a pair of new constants was introduced:
Record.RECORD_INITIAL_SIZE – initial buffer size, for now 64KB and will be probably decreased in upcoming release (http://bug.javlin.eu/browse/CL-2070) to minimalize initial memory allocation for regular graphs.

The second constant ‘Record.RECORD_LIMIT_SIZE’ is actually one-to-one replacement for MAX_RECORD_SIZE (keeping MAX_RECORD_SIZE backward compatible for the sake of unmodified components), which sets the maximum upper bound per one CloverBuffer instance. This can be virtually anything – for convenient early detection of real buffer overruns, it is set to 32 MB by default. Lowering or increasing this upper bound affects memory consumption only in the cases where there is a real need for such big buffer – otherwise the buffers are kept at RECORD_INITIAL_SIZE and are grown gradually towards the upper limit.

As you can see, the CloverBuffer now makes it possible to process bigger records with less memory footprint, since only buffers for edges or components that actually manipulate big records grow, while the others still remain small.

CloverETL OEM Advantage: ETL As You Like It

Shakespeare didn’t know the first thing about OEM, or Original Equipment Manufacturers, but he did have quite a grasp on relationships. That being said, we’re here to put what Will preached back into the data world, channeling his penchant for the complexity of true relationships, with our OEM Program. Introducing, “ETL As You Like It: A True OEM Partnership.”

What is a CloverETL OEM Partnership?

Well, it really is a relationship, isn’t it? When CloverETL aligns with a company, we bring together not just our unique products, but also our tested dev teams, solutions, and ideas. It’s actually like a marriage—with joys and delights, compromise and adjustment. We’ve been active in OEM Partnerships for a while now and think it’s a great way for Clover to flourish in all sorts of challenges.

Today, CloverETL is at the core of many data service platforms as a vital data integration piece. As an OEM to many customers’ larger offerings – be it with IBM’s MDM offer, Good Data’s BI platform, or Mulesoft’s ESB service – Clover, with its flexibility, lends itself to match any OEM business strategy. It’s just a matter of deciding which OEM approach works best. CloverETL can work on a “partner basis” (side by side as a toolset), be embedded as a data integration piece, or even be embedded and white labeled.

Let’s take a look at what OEM programs are available for Clover today.

Partner Approach

Some companies use CloverETL alongside their offer to provide an integrated commercial package to their end-user clients. These clients can then have access to the Clover community through our forum to gain knowledge about the best use practice of CloverETL. Often, companies are replacing just the ETL bit because it is either hand-coded or just too cumbersome.

The data services companies that choose this option tend to need a lower volume of licenses. Another reason might also be to simply add a stronger ETL product to an application or service without the need to fully embed Clover. In a sense, the partner approach works like a “power couple” to run compelling applications that provide value to data clients.

Embedded OEM

Many businesses prefer the embedded OEM approach for CloverETL because they only have to manage one offering for their development teams. To embed means that as they make changes and improvements to their application, Clover and its team can adapt with a company’s growth in an evolving market. We offer this option in higher volume scenarios on a per unit basis, or even on an enterprise (one time fee) basis as the case allows.

For example, there are literally hundreds of Clover users for IBM’s Initiate MDM offer; the company, a partner since 2007, even produced a CloverETL user guide to help their clients learn and use Clover more effectively.

You can check out this IBM WorkBench document at: http://publib.boulder.ibm.com/infocenter/initiate/v9r5/topic/com.ibm.initiatepdfs.doc/topics/i46wecug.pdf

Embedded OEM – White Labeling

Lately, we are seeing a trend from customers asking for a white labeled ETL/data integration piece for their service offer. This is part of an interesting strategy where service providers simply want to present their offer with a focus on the application or the power of the company brand. IT and data professionals know that the ETL is there, cranking through and organizing data, but a white label strategy puts the ETL behind the scenes for a complete, branded look: Voila! The service just works, it seems. White labeling has the same licensing strategy as embedded-only OEM approach.

In all, David Pavlis, Javlin’s President put it best: “What’s great about Clover is that we want to work with you, and you can rely on us. We can adapt alongside your offering or become part of it, fully integrated. What we do know is that, in every case, we can support a lifetime partnership–adapting, improving, and aligning to where our customers and their clients are going.”

And there you have it, as you like us.

Interested in the technical nuances of OEM? Our OEM blog series will continue, detailing each approach reviewed today, so check back soon.

Performance Optimization of Metrics in CloverETL Data Profiler

The first beta version CloverETL Data Profiler was released in October, and since then we have been working on improvements for the second beta version, which was released at the end of last year. Besides bug fixing and adding a few new features, we also worked on performance optimization of profiling metrics. This article will describe this improvement and how profiling is interconnected with CloverETL Engine.

The CloverETL Data Profiler processes input data as a stream. All metrics read input values as they are obtained from the source (CSV file, Excel sheet, or database table) and, at the end of a reading, metrics return their results and these results are then stored into the results database.

For most of the metrics (minimum value, maximum value) this approach works just fine. However, certain metrics cannot work like this – not only do they have higher computation-time requirements, but they also require all the input data to be kept in memory. For large data sets this makes using the operating system memory inappropriate. Therefore, external memory needs to be used.

In the first beta version we used Profiler’s internal SQL database to store all the values for these memory-consuming metrics. The data were first inserted into the SQL database and then a database query was used to calculate the result of the metric. This allowed for profiling large data sets– larger than the amount of available memory.

However, there was a large overhead caused by inserting the data into the database; also, the final result query computation consumed lots of system resources. We were not happy with the performance and architecture of this solution, so we decided to redesign it and use the powerful CloverETL Engine to get the job done.

We exploit the fact that the memory-consuming metrics can still be computed on a stream of data, if the incoming data are sorted. In the improved version we use the CloverETL Engine to first sort the values using the ExtSort components, and then we analyze the sorted data as a stream. In this approach, no other external facilities (such as SQL database) are used during profiling.

The overall performance of CloverETL Data Profiler has improved, especially for large data sets. Even with full set of metrics enabled, we are now able to analyze 4 GB of data with 30 fields in 30 minutes. Also, memory consumption has improved significantly.

Finally, in the Profiler GUI, we have marked the metrics that require sorting, and therefore have longer computation time, with a small clock icon. These metrics are not enabled by default.

The following picture shows in detail the different phases of metrics calculation. First, we calculate the metrics that can work on unsorted streams of data. In the following phases, for each field in its own separate phase, we run an ExtSort component and connect it to a component that calculates the metrics that expect sorted data. We use Rollup with custom transform Java code to calculate the metrics. Rollup allows for producing variable amount of output records for any amount of incoming records.

Another performance improvement in second beta version of CloverETL Data Profiler also affects the metrics that do not require the input data to be sorted. The profiler will now make better use of available CPUs and there will be less CPU time consumed on context-switching. This results in a boost up to 15% for data with a high amount of fields. Also, simpler structure of the internal CloverETL graph results in a significantly lower memory footprint.

In summary, in the second beta version of CloverETL Data Profiler, we have improved both performance and memory consumption by fully exploiting the capabilities of CloverETL Engine.

CloverETL Visions for 2012: Evolution and Revolution in Data Integration

Part one – celebrating 10 years

In 2012, CloverETL will celebrate its 10th anniversary as an open source project. It all started back in 2002. On October 3rd, 2002, version 0.1 was first announced on the Freshmeat (now Freecode) portal. That day, CloverETL’s official life began.

I don’t want to look into Clover’s history too much, though. I do, however, want to take this time to make a few comments about the principles on which CloverETL was established and how these principles continue to determine its future.

Principle number 1: Elegant and robust architecture guarantees a stable foundation

CloverETL started more as a framework on which other projects could be based, rather than as an end-user product with a “sexy” GUI. As a matter of fact, the real GUI was built in 2005, almost three years after the release of first CloverETL engine, which is now present in every tool of the CloverETL family – the Designer, the Server and also CloverETL Profiler.

Even though we are now on version 3.2, there has, so far, only been one change which significantly broke backward compatibility: when we switched from Java 1.4 to Java 1.5 and changed some key interface definitions.

This particular principle is what gives a certain peace of mind to the projects and software products embedding or otherwise deploying Clover, as they know there won’t be any sudden surprises with future versions. It also proves that the original architecture was robust and flexible enough at the outset to support all the later additions and improvements.

Principle number 2: Less is better

CloverETL is based on idea of cooperating components, each specialized with one certain functionality only. However each component is flexible enough to support various “outer” conditions in which the component works.

For example, our UniversalDataReader is meant for parsing text data. The data can come in variations like fixed-length, delimited, or combined; can be read locally or from remote locations; and can be available in plain form or compressed. All these variations are supported, which means that subtle changes, like data becoming available through a different protocol or perhaps being suddenly compressed, require only slight reconfiguration of our DataReader. Contrast this with other players, whose hundreds of different components require architecture changes in transformation (replacement of one component with other) when small shift in input data happens (e.g. due to moving from DEV to PROD environment) and you’ll notice the difference.

It also means that a programmer or analyst designing data transformations in Clover does not need to carry a dictionary of components; a short list covers all possible scenarios.

Principle number 3: Agility is sexy, but long term planning is wise

CloverETL is used in many applications by many customers. Some of them are large, global corporations that embed Clover in their products. Through our OEM program, we work with many customers with a very agile approach to the development of their applications. Some of them have release cycle as short as two weeks where they must  not only develop & debug, but also release new features. Clover’s development team tries to keep up with this sprint, but we still take our time to plan, architect, and develop new, fundamental features to extend CloverETL’s capabilities and help our customers do their jobs faster and simpler.

The reason we insist on thinking through every new feature request, beyond simple tweaks, is that sometimes relatively small and quick change may break compatibility somewhere or prevent future extensions. Whenever our development team touches the core (engine) we make sure the change is properly evaluated from several points of view, including:

  • Backward compatibility – at least at transformation graph level.
  • Performance – Slowdown of just a few percent on big data can mean extra kW of energy consumed by data crunching servers.
  • Future extensibility – We hate deprecating APIs or components just because we might not be able to continue enhancing and improving them.

This principle is further supported by the fact that CloverETL continues to be developed by the same, stable development team year in and out. Many team members have been around since 2005, when the commercial life of Clover began.

Part two – What will appear on the menu in 2012

In short, there will be evolution and, in certain areas, some revolution. We are always sorting out the dilemma of whether to break from the “past” and come up with something completely new and revolutionary – at least in our minds – or continue to improve the old-faithful engine architecture laid out years ago.

As we weren’t able to choose one or the other, we decided to continue improving what works well (and should continue to, even in future) and overhaul some things that have had occasional hiccups with modern data structures and formats brought to us by the CLOUD.

Evolution

Expanding CloverETL OEM program

As CloverETL attracts new OEM customers, we continue improving our OEM program by making it simpler to embed, modify, white-label, or otherwise enhance our technology stack. This includes better documentation, example projects, and extended training.
We are also investing in our support team, which has always strived to provide timely and accurate answers to all support requests submitted through various channels, from e-mail to the technology forum and hotline.

Our support staff is comprised of experienced consultants and programmers who have real-life experience with our technology—they aren’t just people a few manual pages ahead of a user seeking an answer.

GUI – continuous improvement of the user experience

We will continue our effort to make the Designer more and more user-friendly. Our motto is: CloverETL is built by professionals for professionals and, truly, professional DI experts or Java programmers usually give us high marks. Nonetheless, we want to make our technology accessible to the broadest possible audience seeking solutions to certain data needs.

Enhancing CloverETL Cluster – our BigData recipe

These days, BigData is usually mentioned together with Hadoop as the solution. As much as we like Hadoop for various reasons, we have our own recipe for processing BigData, and we think it’s better suited for classical data integration/ETL tasks. It is based on a split/transform/merge idea, where big input data are partitioned and then processed in parallel on multiple nodes of the CloverETL Cluster. The advantage of this, as opposed to Hadoop, is that the transformation may be developed & debugged locally, then easily deployed onto CloverETL Cluster for fast execution. Even if executed in a cluster environment, all the debugging and monitoring options of our Designer are available. It is also worth mentioning that deploying CloverETL Cluster is much easier than setting up the Hadoop cluster.

Our big enhancement of CloverETL Cluster in 2012 will be the merging of our technology with Hadoop – more precisely HDFS filesystem – which should combine the best from both worlds. HDFS provides some cool features, namely robustness and high performance, and we want to utilize its automated data partitioning to make it easier to grow (or shrink) the storage of data depending on actual needs.

Revolution

Rich data structures – trees, unstructured data, etc.

It has to come with age, but I can’t resist and must admire those who devised Cobol and CopyBook. In those times, every byte of storage counted and CPUs were slow, yet programmers were still able to process rich data structures. Then relational databases came and brought the idea of tables and normal forms. Well, today, we are back to rich structures, but this time, we’ve stopped counting bytes or CPU cycles (which has a huge impact on power consumption of servers, but that’s a different story.) That is why XML, JSON, or other rich structures are becoming the norm today.

In order to support these structures and formats as first class passengers, we decided to overhaul our metadata and record storage model and allow direct support of tree structures, multi-values of fields, and even loosely typed data organized in maps/properties collections.

This independently constitutes as a big adventure, as every single piece of our technology platform will be affected, and thus will have to be adapted. The effort will be huge, and necessary regression testing of the whole platform will be endless. Despite this, the prize is enticing: almost any type of data (and the cloud will be bonanza for this) will be 1:1 representable by Clover. That will include XML, JSON, POJO, and complex properties – and, in the future, who knows what else!

—–

We have always claimed that CloverETL is future-proof. Therefore, in 2012, we will be improving our foundations so they withstand the next 10 years.

If what I’ve talked about above is of interest to you, then please stay tuned. We will be publishing more details on our new functionality as we implement it.

For now, I wish everyone a very successful 2012!

A Look Back: CloverETL and Data Integration in 2011

As 2011 comes to a close, we’d like to take the time to reflect on what this year has brought CloverETL, its users, and our customers.

Since CloverETL is, after all, a data integration platform, the world of integration is at our core. We’re constantly striving to challenge ourselves in new ways and improve how we approach data integration. This year was no different.

Enhancing Our Core – Two Upgrades of CloverETL

In the past six months, we released two upgraded versions of CloverETL. CloverETL 3.1, published in June, brought significant changes to the platform in several areas. With a deeper focus on connectivity and enhanced support of various data formats, CloverETL 3.1 helped users better process data with complex structure, emails, and Lotus documents, to name a few. The latest version of CloverETL, version 3.2, offered further enhancements to the user experience, as well as improved the processing of large data records.

Data Integration Meets Data Quality – CloverETL Profiler

This year was also a year for new products. With Clover, we’ve moved forward with an evolved sense of the data world. Because data integration, data quality, and other data disciplines are becoming more and more intertwined, we developed the CloverETL Profiler, data profiling application. Released in beta back in October, the profiler helps users make informed decisions on how to improve the quality of transformed data, which is particularly useful as precursor to a greater data integration projects. CloverETL also integrates more easily with the AddressDoctor solution to improve the quality of geographical information.

Strengthening CloverETL Presence in the US Market

In June, Javlin, the developer of CloverETL, opened up its new office in the Washington D.C. area, which became the headquarters of Javlin Inc., our US presence. Javlin Inc., with both a dedicated sales and customer service force, brings Clover to a whole new market of possibilities.

Last but not least, we are pleased to see that our OEM data integration offer will have a number of important implementations in the upcoming year. (But more on that later. Stay tuned.)

As we leave 2011, we can say that this past year was a whirlwind of hard work, exciting releases, and interesting customers and stories. We’re looking forward to another great year with CloverETL. Cheers to the New Year.

CloverETL Data Profiler: Under the Hood of Data Profiling Application

Recently we’ve released the second public beta release of a new member of our CloverETL family – CloverETL Data Profiler. With this article we’d like to share an overview of the technical architecture of CloverETL Profiler – the why and how of the design.

The CloverETL Data Profiler is a data profiling application, i.e. it provides the users with various information about their data, such as average, number of empty values , histogram-like charts etc. When designing the application we had several goals in mind:

  • Performance – Data of interest can be huge. The profiling of them needs to be reasonably quick so that the user does not need to wait for results for too long – we want profiling to be an interactive process, not a batch job.
  • CloverETL Integration – Data profiling is often just one part of a more complex workflow – for example, it can be the starting point of a data integration project.  We want to make the transition from data profiling to the following steps in data integration with CloverETL as smooth as possible.
  • Web presentation – An important group of users of data profiling tools are people interested in just viewing the profiling results; they don’t need to design or execute data profiling jobs. Presenting the profiling results on web gives such users a very simple way to access the results.

CloverETL Profiler Architecture

Let’s have a look at the main building blocks of CloverETL Data Profiler:

  • CloverETL Data Profiler GUI – This is the graphical tool where users create and execute profiling jobs. We’ll talk about the jobs later in more detail, but basically they describe a data source to profile (e.g. file, database etc.) and how to profile it using metrics (e.g. calculate maximum value on a database column). The GUI is based on the Eclipse RCP platform and reuses components of CloverETL Designer – this is the obvious way for the reuse of GUIs for concepts common with CloverETL (e.g. metadata, database connections, data analytics etc.) and it paves way for future deep integration.
  • CloverETL Data Profiler Engine – Profiling jobs created and edited in the CloverETL Profiler GUI are stored as ordinary XML files, similarly to CloverETL graphs. The profiler engine takes these jobs, analyzes them, and automatically generates and runs a CloverETL graph. The CloverETL graph performs the actual profiling – so the CloverETL Data Profiler Engine is a relatively thin wrapper on top of CloverETL Engine.
  •  CloverETL Engine – As described above, the real work of data profiling is performed by CloverETL Engine that runs a specially designed transformation graph. Because of this, the profiler gains access to the wide feature set and great performance provided by the CloverETL Engine.
  • Result Storage – CloverETL graphs performing the profiling store the results in a Result Storage database. The database can be an embedded Derby database for single user scenarios or a shared external database for multiple users. The shared external database can be used in a scenario where multiple users create and execute profiling jobs and at the same time, need to share their results. We support a range of databases for the storage of results – Oracle, MySQL, PostgreSQL and MSSQL.
  • Reporting Server – As observing profiling results is done via a web interface, we have a Reporting Server that both serves the web content and provides it with the profiling results stored in ResultsStorage. The Reporting Server isolates the web interface from details of the Result Storage by providing the profiling results as JSON via a REST interface. We kept the Reporting Server intentionally simple – more complex server-side functionality such as scheduling will be reused by integration with CloverETL Server.
  • Reporting Console – This is the tool for observing the profiling results – values of metrics, charts of the data, troubleshooting details about rejected records etc. The Reporting Console is an AJAX web application based on HTML, JavaScript and related standard technologies. Because it’s a web application, it’s very easy for anyone to view the results – there’s no need to install a special application; you only need a reasonably modern web browser.

All parts of CloverETL  DataProfiler are currently bundled in one simple-to-use package as a standalone application – just start the CloverETL Data Profiler and it automatically launches an embedded Result Storage, Reporting Server etc. However we’ve laid the foundations for separating the building blocks for bigger deployment scenarios and better integration with the rest of CloverETL family.

Data Profiling Process

Now let’s have a look on the nitty-gritty details of actually profiling data. The basic premise is that we already have all the tools needed for profiling in CloverETL – a transformation graph has all the expressive power needed for profiling of data:

  • Reading of a data source – CloverETL has a wide range of reader components and connectors to 3rd party systems.
  • Calculating metrics – Transformations in CloverETL are extensible by Java code or by scripts in our proprietary scripting language CTL.
  • Storing results – This is just writing to a database which is one of the basic use cases of CloverETL.

So anyone could manually create a CloverETL graph that profiles data. But it’s quite a complex task which would take the user’s focus away from the core: what data source does he want to profile and which metrics does he want to use. In the CloverETL Data Profiler, the user describes this core information in a profiling job, and then the CloverETL Profiler Engine transforms it into a CloverETL transformation graph. The profiling job is defined in a relatively simple XML file that can be edited in a graphical editor. The job primarily contains the following information:

  • Data source – Description of data source to profile – path to a file, definition of a database connection etc.
  • Metadata – Description of the data source structure by CloverETL metadata
  • Metrics – Multiple metrics can be enabled on each field of the metadata, e.g. minimum value, longest string, median etc.

The above picture demonstrates the process of running a profiling job:

  • The profiling job is stored in an XML file that contains all required information. The job is created and edited in the CloverETL Profiler GUI.
  • The job XML file is taken by the CloverETL Profiler Engine which creates a CloverETL transformation graph and runs it.
  • The picture of the transformation graph contains an actual profiling graph which reads the data source, stores the results and, most importantly, calculates metrics in parallel branches. Details of the branching depend on the kind of metrics used: some can be computed on-the-fly on the original data, while some require all data to be sorted (e.g. median). The number of branches depends on the selection of metrics and the number of available CPU cores – we optimize the graph for high performance.

This article is a brief introduction into the architecture of CloverETL Data Profiler. As you can see, we’ve saved a lot of effort by using the power of CloverETL at the core. The CloverETL Data Profiler can be also seen as a successful example of embedding CloverETL.

A Second Wind: New Features Strengthen CloverETL Data Profiler

In Data Profiling with CloverETL Data Profiler beta, we discussed the value of the new product that will soon enrich our product portfolio – CloverETL Data Profiler. The beta testing has been up and running, and so far we have received a lot of valuable feedback from the first beta build that allows us to further enhance the features and usability of CloverETL Data Profiler. Let’s see what the new build offers.

Integration with CloverETL Designer

The new build of the CloverETL Data Profiler introduces a key feature – the first step toward integration with the CloverETL suite. If you are interested in taking part in the development and feedback process of CloverETL products or the beta testing, you can register and get the new build.

The CloverETL Data Profiler has several important use cases, one being its integral role in analysis in the early stages of your projects. The insight into data gained from CloverETL Profiler is valuable for project planning, illustrating the importance of clean data, while offering some direction in the design of your ETL process.

Aside from gathering information during the definition of the profiling job, with this build’s new feature, you are also preparing your resources to use later in the CloverETL Designer. This work includes: defining the database connection and metadata that describe the schemas of your data source. The profiler may assist you in this process. It is especially useful for data sources that don’t include schema as their integral part, such as CSV files.

Exporting Metadata and Transformation

The second beta build of CloverETL Data Profiler includes the option to export both database connection metadata and a sample CloverETL transformation graph containing the reader that accessed the data source you previously defined. These are ready to be used in CloverETL Designer. You can see official CloverETL Data Profiler documentation to learn more about this feature.

Other Improvements in the New Build

Here is a short summary of the most notable improvements in the second build:

  • export to CloverETL Designer graph
  • visual enhancement of CloverETL Data Profiler job
  • removed dependency on internal database for metrics calculation
    • this results in performance boost, 4 GB of data with 30 fields and all metrics enabled takes 30 minutes
  • metrics that might be time consuming are visually distinguished in profiler
  • other tweaks and changes that enhance the overall performance and stability

The Next Steps

This is our first step in providing integration of CloverETL Data Profiler with the rest of the CloverETL suite. The next stages will be:

  • full integration with CloverETL Designer that will allow you the capability to profile any data flow inside your CloverETL graph, and
  • integration with CloverETL Server, where you will be able to set up events based on conditions detected in these data flows inside CloverETL graphs running on the server.