Performance Optimization of Metrics in CloverETL Profiler

The first beta version CloverETL Profiler was released in October, and since then we have been working on improvements for the second beta version, which was released at the end of last year. Besides bug fixing and adding a few new features, we also worked on performance optimization of profiling metrics. This article will describe this improvement and how profiling is interconnected with CloverETL Engine.

CloverETL Profiler processes input data as a stream. All metrics read input values as they are obtained from the source (CSV file, Excel sheet, or database table) and, at the end of a reading, metrics return their results and these results are then stored into the results database.

For most of the metrics (minimum value, maximum value) this approach works just fine. However, certain metrics cannot work like this – not only do they have higher computation-time requirements, but they also require all the input data to be kept in memory. For large data sets this makes using the operating system memory inappropriate. Therefore, external memory needs to be used.

In the first beta version we used Profiler’s internal SQL database to store all the values for these memory-consuming metrics. The data were first inserted into the SQL database and then a database query was used to calculate the result of the metric. This allowed for profiling large data sets– larger than the amount of available memory.

However, there was a large overhead caused by inserting the data into the database; also, the final result query computation consumed lots of system resources. We were not happy with the performance and architecture of this solution, so we decided to redesign it and use the powerful CloverETL Engine to get the job done.

We exploit the fact that the memory-consuming metrics can still be computed on a stream of data, if the incoming data are sorted. In the improved version we use the CloverETL Engine to first sort the values using the ExtSort components, and then we analyze the sorted data as a stream. In this approach, no other external facilities (such as SQL database) are used during profiling.

The overall performance of CloverETL Profiler has improved, especially for large data sets. Even with full set of metrics enabled, we are now able to analyze 4 GB of data with 30 fields in 30 minutes. Also, memory consumption has improved significantly.

Finally, in the Profiler GUI, we have marked the metrics that require sorting, and therefore have longer computation time, with a small clock icon. These metrics are not enabled by default.

The following picture shows in detail the different phases of metrics calculation. First, we calculate the metrics that can work on unsorted streams of data. In the following phases, for each field in its own separate phase, we run an ExtSort component and connect it to a component that calculates the metrics that expect sorted data. We use Rollup with custom transform Java code to calculate the metrics. Rollup allows for producing variable amount of output records for any amount of incoming records.

Another performance improvement in second beta version of CloverETL Profiler also affects the metrics that do not require the input data to be sorted. The profiler will now make better use of available CPUs and there will be less CPU time consumed on context-switching. This results in a boost up to 15% for data with a high amount of fields. Also, simpler structure of the internal CloverETL graph results in a significantly lower memory footprint.

In summary, in the second beta version of CloverETL Profiler, we have improved both performance and memory consumption by fully exploiting the capabilities of CloverETL Engine.

CloverETL Visions for 2012: Evolution and Revolution in Data Integration

Part one – celebrating 10 years

In 2012, CloverETL will celebrate its 10th anniversary as an open source project. It all started back in 2002. On October 3rd, 2002, version 0.1 was first announced on the Freshmeat (now Freecode) portal. That day, CloverETL’s official life began.

I don’t want to look into Clover’s history too much, though. I do, however, want to take this time to make a few comments about the principles on which CloverETL was established and how these principles continue to determine its future.

Principle number 1: Elegant and robust architecture guarantees a stable foundation

CloverETL started more as a framework on which other projects could be based, rather than as an end-user product with a “sexy” GUI. As a matter of fact, the real GUI was built in 2005, almost three years after the release of first CloverETL engine, which is now present in every tool of the CloverETL family – the Designer, the Server and also CloverETL Profiler.

Even though we are now on version 3.2, there has, so far, only been one change which significantly broke backward compatibility: when we switched from Java 1.4 to Java 1.5 and changed some key interface definitions.

This particular principle is what gives a certain peace of mind to the projects and software products embedding or otherwise deploying Clover, as they know there won’t be any sudden surprises with future versions. It also proves that the original architecture was robust and flexible enough at the outset to support all the later additions and improvements.

Principle number 2: Less is better

CloverETL is based on idea of cooperating components, each specialized with one certain functionality only. However each component is flexible enough to support various “outer” conditions in which the component works.

For example, our UniversalDataReader is meant for parsing text data. The data can come in variations like fixed-length, delimited, or combined; can be read locally or from remote locations; and can be available in plain form or compressed. All these variations are supported, which means that subtle changes, like data becoming available through a different protocol or perhaps being suddenly compressed, require only slight reconfiguration of our DataReader. Contrast this with other players, whose hundreds of different components require architecture changes in transformation (replacement of one component with other) when small shift in input data happens (e.g. due to moving from DEV to PROD environment) and you’ll notice the difference.

It also means that a programmer or analyst designing data transformations in Clover does not need to carry a dictionary of components; a short list covers all possible scenarios.

Principle number 3: Agility is sexy, but long term planning is wise

CloverETL is used in many applications by many customers. Some of them are large, global corporations that embed Clover in their products. Through our OEM program, we work with many customers with a very agile approach to the development of their applications. Some of them have release cycle as short as two weeks where they must  not only develop & debug, but also release new features. Clover’s development team tries to keep up with this sprint, but we still take our time to plan, architect, and develop new, fundamental features to extend CloverETL’s capabilities and help our customers do their jobs faster and simpler.

The reason we insist on thinking through every new feature request, beyond simple tweaks, is that sometimes relatively small and quick change may break compatibility somewhere or prevent future extensions. Whenever our development team touches the core (engine) we make sure the change is properly evaluated from several points of view, including:

  • Backward compatibility – at least at transformation graph level.
  • Performance – Slowdown of just a few percent on big data can mean extra kW of energy consumed by data crunching servers.
  • Future extensibility – We hate deprecating APIs or components just because we might not be able to continue enhancing and improving them.

This principle is further supported by the fact that CloverETL continues to be developed by the same, stable development team year in and out. Many team members have been around since 2005, when the commercial life of Clover began.

Part two – What will appear on the menu in 2012

In short, there will be evolution and, in certain areas, some revolution. We are always sorting out the dilemma of whether to break from the “past” and come up with something completely new and revolutionary – at least in our minds – or continue to improve the old-faithful engine architecture laid out years ago.

As we weren’t able to choose one or the other, we decided to continue improving what works well (and should continue to, even in future) and overhaul some things that have had occasional hiccups with modern data structures and formats brought to us by the CLOUD.

Evolution

Expanding CloverETL OEM program

As CloverETL attracts new OEM customers, we continue improving our OEM program by making it simpler to embed, modify, white-label, or otherwise enhance our technology stack. This includes better documentation, example projects, and extended training.
We are also investing in our support team, which has always strived to provide timely and accurate answers to all support requests submitted through various channels, from e-mail to the technology forum and hotline.

Our support staff is comprised of experienced consultants and programmers who have real-life experience with our technology—they aren’t just people a few manual pages ahead of a user seeking an answer.

GUI – continuous improvement of the user experience

We will continue our effort to make the Designer more and more user-friendly. Our motto is: CloverETL is built by professionals for professionals and, truly, professional DI experts or Java programmers usually give us high marks. Nonetheless, we want to make our technology accessible to the broadest possible audience seeking solutions to certain data needs.

Enhancing CloverETL Cluster – our BigData recipe

These days, BigData is usually mentioned together with Hadoop as the solution. As much as we like Hadoop for various reasons, we have our own recipe for processing BigData, and we think it’s better suited for classical data integration/ETL tasks. It is based on a split/transform/merge idea, where big input data are partitioned and then processed in parallel on multiple nodes of the CloverETL Cluster. The advantage of this, as opposed to Hadoop, is that the transformation may be developed & debugged locally, then easily deployed onto CloverETL Cluster for fast execution. Even if executed in a cluster environment, all the debugging and monitoring options of our Designer are available. It is also worth mentioning that deploying CloverETL Cluster is much easier than setting up the Hadoop cluster.

Our big enhancement of CloverETL Cluster in 2012 will be the merging of our technology with Hadoop – more precisely HDFS filesystem – which should combine the best from both worlds. HDFS provides some cool features, namely robustness and high performance, and we want to utilize its automated data partitioning to make it easier to grow (or shrink) the storage of data depending on actual needs.

Revolution

Rich data structures – trees, unstructured data, etc.

It has to come with age, but I can’t resist and must admire those who devised Cobol and CopyBook. In those times, every byte of storage counted and CPUs were slow, yet programmers were still able to process rich data structures. Then relational databases came and brought the idea of tables and normal forms. Well, today, we are back to rich structures, but this time, we’ve stopped counting bytes or CPU cycles (which has a huge impact on power consumption of servers, but that’s a different story.) That is why XML, JSON, or other rich structures are becoming the norm today.

In order to support these structures and formats as first class passengers, we decided to overhaul our metadata and record storage model and allow direct support of tree structures, multi-values of fields, and even loosely typed data organized in maps/properties collections.

This independently constitutes as a big adventure, as every single piece of our technology platform will be affected, and thus will have to be adapted. The effort will be huge, and necessary regression testing of the whole platform will be endless. Despite this, the prize is enticing: almost any type of data (and the cloud will be bonanza for this) will be 1:1 representable by Clover. That will include XML, JSON, POJO, and complex properties – and, in the future, who knows what else!

—–

We have always claimed that CloverETL is future-proof. Therefore, in 2012, we will be improving our foundations so they withstand the next 10 years.

If what I’ve talked about above is of interest to you, then please stay tuned. We will be publishing more details on our new functionality as we implement it.

For now, I wish everyone a very successful 2012!

A Look Back: CloverETL and Data Integration in 2011

As 2011 comes to a close, we’d like to take the time to reflect on what this year has brought CloverETL, its users, and our customers.

Since CloverETL is, after all, a data integration platform, the world of integration is at our core. We’re constantly striving to challenge ourselves in new ways and improve how we approach data integration. This year was no different.

Enhancing Our Core – Two Upgrades of CloverETL

In the past six months, we released two upgraded versions of CloverETL. CloverETL 3.1, published in June, brought significant changes to the platform in several areas. With a deeper focus on connectivity and enhanced support of various data formats, CloverETL 3.1 helped users better process data with complex structure, emails, and Lotus documents, to name a few. The latest version of CloverETL, version 3.2, offered further enhancements to the user experience, as well as improved the processing of large data records.

Data Integration Meets Data Quality – CloverETL Profiler

This year was also a year for new products. With Clover, we’ve moved forward with an evolved sense of the data world. Because data integration, data quality, and other data disciplines are becoming more and more intertwined, we developed the CloverETL Profiler, data profiling application. Released in beta back in October, the profiler helps users make informed decisions on how to improve the quality of transformed data, which is particularly useful as precursor to a greater data integration projects. CloverETL also integrates more easily with the AddressDoctor solution to improve the quality of geographical information.

Strengthening CloverETL Presence in the US Market

In June, Javlin, the developer of CloverETL, opened up its new office in the Washington D.C. area, which became the headquarters of Javlin Inc., our US presence. Javlin Inc., with both a dedicated sales and customer service force, brings Clover to a whole new market of possibilities.

Last but not least, we are pleased to see that our OEM data integration offer will have a number of important implementations in the upcoming year. (But more on that later. Stay tuned.)

As we leave 2011, we can say that this past year was a whirlwind of hard work, exciting releases, and interesting customers and stories. We’re looking forward to another great year with CloverETL. Cheers to the New Year.

CloverETL Profiler: Under the Hood of Data Profiling Application

Recently we’ve released the second public beta release of a new member of our CloverETL family – CloverETL Profiler. With this article we’d like to share an overview of the technical architecture of CloverETL Profiler – the why and how of the design.

CloverETL Profiler is a data profiling application, i.e. it provides the users with various information about their data, such as average, number of empty values , histogram-like charts etc. When designing the application we had several goals in mind:

  • Performance – Data of interest can be huge. The profiling of them needs to be reasonably quick so that the user does not need to wait for results for too long – we want profiling to be an interactive process, not a batch job.
  • CloverETL Integration – Data profiling is often just one part of a more complex workflow – for example, it can be the starting point of a data integration project.  We want to make the transition from data profiling to the following steps in data integration with CloverETL as smooth as possible.
  • Web presentation – An important group of users of data profiling tools are people interested in just viewing the profiling results; they don’t need to design or execute data profiling jobs. Presenting the profiling results on web gives such users a very simple way to access the results.

CloverETL Profiler Architecture

Let’s have a look at the main building blocks of CloverETL Profiler:

  • CloverETL Profiler GUI – This is the graphical tool where users create and execute profiling jobs. We’ll talk about the jobs later in more detail, but basically they describe a data source to profile (e.g. file, database etc.) and how to profile it using metrics (e.g. calculate maximum value on a database column). The GUI is based on the Eclipse RCP platform and reuses components of CloverETL Designer – this is the obvious way for the reuse of GUIs for concepts common with CloverETL (e.g. metadata, database connections, data analytics etc.) and it paves way for future deep integration.
  • CloverETL Profiler Engine – Profiling jobs created and edited in the CloverETL Profiler GUI are stored as ordinary XML files, similarly to CloverETL graphs. The profiler engine takes these jobs, analyzes them, and automatically generates and runs a CloverETL graph. The CloverETL graph performs the actual profiling – so the CloverETL Profiler Engine is a relatively thin wrapper on top of CloverETL Engine.
  •  CloverETL Engine – As described above, the real work of data profiling is performed by CloverETL Engine that runs a specially designed transformation graph. Because of this, the profiler gains access to the wide feature set and great performance provided by the CloverETL Engine.
  • Result Storage – CloverETL graphs performing the profiling store the results in a Result Storage database. The database can be an embedded Derby database for single user scenarios or a shared external database for multiple users. The shared external database can be used in a scenario where multiple users create and execute profiling jobs and at the same time, need to share their results. We support a range of databases for the storage of results – Oracle, MySQL, PostgreSQL and MSSQL.
  • Reporting Server – As observing profiling results is done via a web interface, we have a Reporting Server that both serves the web content and provides it with the profiling results stored in ResultsStorage. The Reporting Server isolates the web interface from details of the Result Storage by providing the profiling results as JSON via a REST interface. We kept the Reporting Server intentionally simple – more complex server-side functionality such as scheduling will be reused by integration with CloverETL Server.
  • Reporting Console – This is the tool for observing the profiling results – values of metrics, charts of the data, troubleshooting details about rejected records etc. The Reporting Console is an AJAX web application based on HTML, JavaScript and related standard technologies. Because it’s a web application, it’s very easy for anyone to view the results – there’s no need to install a special application; you only need a reasonably modern web browser.

All parts of CloverETL Profiler are currently bundled in one simple-to-use package as a standalone application – just start the CloverETL Profiler and it automatically launches an embedded Result Storage, Reporting Server etc. However we’ve laid the foundations for separating the building blocks for bigger deployment scenarios and better integration with the rest of CloverETL family.

Data Profiling Process

Now let’s have a look on the nitty-gritty details of actually profiling data. The basic premise is that we already have all the tools needed for profiling in CloverETL – a transformation graph has all the expressive power needed for profiling of data:

  • Reading of a data source – CloverETL has a wide range of reader components and connectors to 3rd party systems.
  • Calculating metrics – Transformations in CloverETL are extensible by Java code or by scripts in our proprietary scripting language CTL.
  • Storing results – This is just writing to a database which is one of the basic use cases of CloverETL.

So anyone could manually create a CloverETL graph that profiles data. But it’s quite a complex task which would take the user’s focus away from the core: what data source does he want to profile and which metrics does he want to use. In CloverETL Profiler the user describes this core information in a profiling job, and then the CloverETL Profiler Engine transforms it into a CloverETL transformation graph. The profiling job is defined in a relatively simple XML file that can be edited in a graphical editor. The job primarily contains the following information:

  • Data source – Description of data source to profile – path to a file, definition of a database connection etc.
  • Metadata – Description of the data source structure by CloverETL metadata
  • Metrics – Multiple metrics can be enabled on each field of the metadata, e.g. minimum value, longest string, median etc.

The above picture demonstrates the process of running a profiling job:

  • The profiling job is stored in an XML file that contains all required information. The job is created and edited in the CloverETL Profiler GUI.
  • The job XML file is taken by the CloverETL Profiler Engine which creates a CloverETL transformation graph and runs it.
  • The picture of the transformation graph contains an actual profiling graph which reads the data source, stores the results and, most importantly, calculates metrics in parallel branches. Details of the branching depend on the kind of metrics used: some can be computed on-the-fly on the original data, while some require all data to be sorted (e.g. median). The number of branches depends on the selection of metrics and the number of available CPU cores – we optimize the graph for high performance.

This article is a brief introduction into the architecture of CloverETL Profiler. As you can see, we’ve saved a lot of effort by using the power of CloverETL at the core. CloverETL Profiler can be also seen as a successful example of embedding CloverETL.

A Second Wind: New Features Strengthen CloverETL Profiler

In Data Profiling with CloverETL Profiler beta, we discussed the value of the new product that will soon enrich our product portfolio – CloverETL Profiler. The beta testing has been up and running, and so far we have received a lot of valuable feedback from the first beta build that allows us to further enhance the features and usability of CloverETL Profiler. Let’s see what the new build offers.

Integration with CloverETL Designer

The new build of the CloverETL Profiler introduces a key feature – the first step toward integration with the CloverETL suite. If you are interested in taking part in the development and feedback process of CloverETL products or the beta testing, you can register and get the new build.

CloverETL Profiler has several important use cases, one being its integral role in analysis in the early stages of your projects. The insight into data gained from CloverETL Profiler is valuable for project planning, illustrating the importance of clean data, while offering some direction in the design of your ETL process.

Aside from gathering information during the definition of the profiling job, with this build’s new feature, you are also preparing your resources to use later in the CloverETL Designer. This work includes: defining the database connection and metadata that describe the schemas of your data source. The profiler may assist you in this process. It is especially useful for data sources that don’t include schema as their integral part, such as CSV files.

Exporting Metadata and Transformation

The second beta build of CloverETL Profiler includes the option to export both database connection metadata and a sample CloverETL transformation graph containing the reader that accessed the data source you previously defined. These are ready to be used in CloverETL Designer. You can see official CloverETL Profiler documentation to learn more about this feature.

Other Improvements in the New Build

Here is a short summary of the most notable improvements in the second build:

  • export to CloverETL Designer graph
  • visual enhancement of CloverETL Profiler job
  • removed dependency on internal database for metrics calculation
    • this results in performance boost, 4 GB of data with 30 fields and all metrics enabled takes 30 minutes
  • metrics that might be time consuming are visually distinguished in profiler
  • other tweaks and changes that enhance the overall performance and stability

The Next Steps

This is our first step in providing integration of CloverETL Profiler with the rest of the CloverETL suite. The next stages will be:

  • full integration with CloverETL Designer that will allow you the capability to profile any data flow inside your CloverETL graph, and
  • integration with CloverETL Server, where you will be able to set up events based on conditions detected in these data flows inside CloverETL graphs running on the server.

Lotus Notes Integration – export or exchange data easily

Lotus Domino (formerly Lotus Notes) is an IBM server product that provides enterprise-grade e-mail, collaboration capabilities, and a custom application platform. The platform is used by many large corporations, and most of them also employ other enterprise systems that need to exchange data with the Lotus platform. This is where CloverETL can now be helpful.

Since version 3.1 CloverETL is able to connect to Lotus Domino servers and read and write its data, CloverETL’s Lotus Notes components help you connect the data-record CloverETL world with key-value concepts of Lotus Notes databases. This connectivity allows a lot easier exchange of Lotus data with Excel, relational databases or even with various WebServices. CloverETL Designer enables even less technical persons to setup connectivity to Lotus Domino and create simple data extraction or data ingest apps.

Lotus Notes (Domino) database stores data in documents. A document contains key-value pairs where the key is always a string and the value can be of various formats and types (number, string, image, OLE object, RTF, …). The data in Lotus Notes are therefore structured differently than in classical data-record world of CloverETL. Typically there are no guarantees of what values with what keys will be present in a document.

To overcome this gap, the concept of Lotus Notes Views is used to our advantage. A view in Lotus Notes provides tabular structure for the data in Lotus Notes database. Therefore, to read the data it is necessary to first define a view through which the data will be exported. This task needs to be done by the manager of the Lotus Notes database.

Finally, in the Lotus Domino server, remote access needs to be enabled. This is typically done by issuing the following commands in the server console:

load http load diiop

Connecting to Lotus Notes from CloverETL

The first task on the side of CloverETL is to define a Lotus Notes connection. The connection specifies the address of the Lotus Domino server and user credentials that will be used to log in. The user’s name should be in the format that can be found in the Person document in the Domino Directory. The database can be either specified by its file name (typically a *.nsf file) or by the Replica ID value.

Finally, a path to the Lotus Notes connection drivers (NCSO.jar) needs to be specified. These drivers are not part of CloverETL and can usually be found in your Lotus Domino server installation, e.g. C:\Program Files\Lotus\Domino\data\domino\java\NSCO.jar

Creating metadata

After the connection is prepared, the next step is to create metadata that will be used for reading data from Lotus Notes.

First, choose the connection that you’ve created in the previous steps and also, at the bottom of the dialog, specify the view which will provide the input data. On the following page of the wizard you will see new metadata extracted from the view. Based on the preview of the data you can then manually fine-tune the types of the fields, as initially all field types will be set to string.

Reading from Lotus Notes

LotusReader is the name of the CloverETL component which reads data from Lotus Notes views. Thanks to the preparation steps above, configuring this component is fairly simple. Place this component into your graph, open its configuration and set the Domino connection created in previous steps and select the view from which the data will be extracted. Finally, connect this reader to your graph with an edge and assign metadata we extracted in the steps above.

Writing to Lotus Notes

Complementary to the reader component, there is also a component to import data into Lotus Notes. Basic configuration of the LotusWriter component is quite simple too. In the following paragraphs, we will pay better attention to the different ways the data can be written to Lotus Notes.

The default mode to write the data is insert. In this mode, LotusWriter will create new document in Lotus Notes database for every incoming record. After inserting the document, it will also launch form calculations. This is a procedure defined in the Lotus Notes database by the author of the database that can for example fill-in missing fields, fix formatting of specific string fields, check various constraints, and so on.

The computation can mark a record as invalid in the case of errors during the computation. You can choose to further process these records (e.g. dump them to a log file) by connecting any writer component to the only output port of LotusWriter. You also have the option not to save the invalid records, by enabling the Skip invalid documents option. By default, even documents marked as invalid will be stored into the Lotus Notes database.

The other write mode is update. In this mode, the LotusWriter tries to update existing documents in the database, instead of inserting new ones. The component will look up the documents by searching the view specified in the component configuration (required parameter for update mode). Each view marks certain columns as sorted. These columns will be used too look up the documents to update.

There are two more advanced and in certain occasions quite important concepts in writing to Lotus Notes database with CloverETL. The first one is lazy update replication, the other one is custom field mapping.

For the update mode, after the document fields are updated, save operation will be commenced. This operation may also launch the optional replication of the document. However, in case there were no actual changes in the contents of the document, the replication would be unnecessary. Therefore, the writer component uses lazy update mechanism – it will only launch the save operation on the document, if some of the fields’ values actually changed. This can save significant amount of time for large data manipulations.

Finally, the last feature that will be described here is custom field mapping. By default, the LotusWriter component will create documents with values based on the names of the input metadata fields. However, it might be preferable to create documents with the help of a Form. Forms often specify fields documents should contain and are a common way how to create new documents in Lotus Notes.

The mapping dialog helps you to map fields of input metadata to the fields of selected Form. This allows writing single field multiple times, skip fields, but most importantly choose proper names for the target document fields.

Data Profiling with CloverETL Profiler beta

The process of data integration, data migration, consolidation and other data manipulation projects consists of a variety of steps and tasks. Javlin supports many critical tasks within these projects with a versatile ETL tool that provides technical solutions to transform data and connect different systems and data sources with various data formats.

However, there are additional aspects in order to achieve success with data processing projects. One important project phase to ensure future success is the analysis phase. At Javlin, we are now on the verge of introducing new tool that will boost productivity in the early stages of a project. This product is our new data profiling tool – CloverETL Profiler.

CloverETL Profiler helps give you insight into your data. By creating a profile of your data using various metrics, you can get the big picture of the current condition of your data. This information has a large impact on the next stages of your project.

Real World Use – Data Warehousing Project

Imagine you are project manager with a complex task at hand – building a data warehouse. You have couple of data sources – database tables, CSV files, log files you want to analyze and probably some others. After you define the basic requirements for your data warehouse and its structure, you need to check the condition of your source data.

As you use the CloverETL Profiler, you will gain:

  • insight into the quality of your data indicating whether you need the perform a data cleansing phase in your project:
    • detect what data are missing (high percentage of missing or invalid values) looking for a different source
    • you may include data cleansing tools in the project budget and reserve tasks in your project plan
  • information about the structure of data in all your sources:
    • detect the need for ETL tools on specific data sources to transform your input data into a common format

Instead of writing your own tools and testing the data in complex way, you are just a few clicks away from creating detailed profile of your data.

Product Vision

The usage of the CloverETL Profiler tool for the project analysis phase is just the beginning. As part of our vision for the future, we are on the path to create a fully integrated product in CloverETL Designer and CloverETL Server family.

Its main purposes will be to aid:

  • CloverETL Designer: to assist the development of ETL transformation by allowing the creation of profiles of data in any stage of the transformation and checking the results during the debugging process
  • CloverETL Server: to constantly monitor the data condition on the production server  and invoke actions based on the specific condition detected in the profile of your data

Beta Program

Right now, we are excited to announce the beginning of our beta program for CloverETL Profiler. We invite you, people from the data integration community, to join the testing phase; it’s a great opportunity to not only have a look behind the scenes, but to also be a part of the development process by contributing your insights. For more information, please visit www.cloveretl.com/products/profiler.

Exporting Data Transformation Projects to CloverETL Server

CloverETL Designer in its full or trial version provides integration with CloverETL Server. The CloverETL Server serves as an ETL runtime environment and brings such enterprise features as automation, workflows, monitoring, user management and many others. The integration allows users to design and maintain data integration Server projects locally with their Designers. However, sometimes you may find yourself in a situation when you need to export and deploy a project originally developed locally on your computer to the Server. A quick how-to is described below.

1. Select File > Export

2. Select Export to CloverETL Server sandbox

3. Select desired projects
4. In case you want to export a project to our demo server you can select it from a combo box, or type in URL of your CloverETL Server. Enter a username and password (clover/clover for the demo server).
5. Click the Reload button to load available sandboxes and select a desired one (playground1 or playground2 for the demo server. Other demo sandboxes are readonly).
6. Click Finish

7. Check the exported project in the CloverETL Server under Sandboxes.

Warning: Graphs, including their parameters are copied to the Server (i.e. file paths.) These parameters needs to be adjusted.

Handling Errors in Heterogeneous Input Data

ComplexDataReader is a powerful new component in CloverETL meant for reading elaborate heterogeneous data. However, all data cannot be read easily even if you spend a lot of time configuring the component. Sometimes you need to think in advance: What if you come across unknown metadata you have not handled? Normally, the graph crashes.

This post will examine a way of preventing that or, more specifically, how to handle errors in input data.

Example Input Data

Input Data

What We Will Do

We can instantly distinguish three kinds of metadata on the input: product, product_range and service. ComplexDataReader is the best component to parse these using three states of a state machine. As you can see, there is one line that does not fit into the data. The magic trick of this example lies in preparing one extra state – the error state. The state will be responsible for “catching” all incorrect data which would cause the component to fail. In order to be able to decide which data are “bad,” or, more precisely, when to switch to the error state, you have to write a custom Selector class in Java. The idea behind the code is very simple and will be explained below:

“Prep Work”

First, we need to prepare metadata for all three states of the state machine plus one extra. The extra metadata will represent error lines on the input we need to “throw away.”

Second, do not forget to connect the component to its succeeding components and assign metadata to output edges.

Third, set the “File URL” property to point the component to the input file.

Here are the three aforementioned metadata:

Metadata: Product

Metadata: Service

Metadata: Product Range

And one extra metadata for error lines:

Metadata for Error Lines

Designing  State Machine

We are going to create four states:

Note: There are no transition edges to be seen in the graph. It is because the Selector itself will decide when to change between states.

Start configuring the component via the “Transform” property. Create four states corresponding to the metadata and set “Initial state” to “Let selector decide”:

Switch to state “$0 product” and define its output mapping. In this state, we will send all fields to the output. Thus, drag state $0 to the “Value” column in the right-hand pane. You will produce the “$0.*” directive. In the “Transition table”, switch “Target state” to “Let selector decide”:

Repeat the same procedure for all remaining states (including the error state). Always send everything to the output port and “Let selector decide” about the target state:

Writing Custom Selector

We are now going to prepare a Java class that will do the magic of this example – switch between states “$0 product”, “$1 service”, “$2 product_range” and the “$3 error” state in case there are errors on reading. This particular prefix Selector will assume there is another record on the following line(s) and will try to read it. If there really is a new record, we can recover from the error line and carry on reading.

You can prepare the Java class in any editor of your choice. After writing it, just remember to place it into the “trans” folder of your project. On that condition, CloverETL will automatically compile the class for you.

The Selector class will look like this:

public class CustomPrefixInputMetadataSelector1 extends com.opensys.cloveretl.component.complexdatareader.PrefixInputMetadataSelector {

	private static final int DEFAULT = 3;

	@Override
	public int select(int prevState) {
		int result = super.select(prevState);
		if(result == org.jetel.component.RecordTransform.ALL) {
			return DEFAULT;
		}
		return result;
	}
}

A few comments concerning the code:

  • int result = super.select(prevState);
    First, we try to call the default selector and store the number of the next state into result.
  • if(result == org.jetel.component.RecordTransform.ALL)
    And if the default selector cannot decide…
  • return DEFAULT;
    We return the default state number – number 3. This is the error state.

Now that you are done with the code, switch to the “Selector” tab in “State transitions”. In “Selector URL”, browse for your custom Selector. Notice that after you specify its location, the “Selector properties” area changes:

Conclusions & Pitfalls

In this article, we have presented a way of handling flaws in the input data. We have been capable of addressing a situation when the selector looks on the following metadata and cannot decide which state goes next.

However, there are numerous cases when you just cannot prevent reading errors from occurring. For instance, even if the selector recognizes the following metadata but then fails on parsing them, we cannot react and the graph fails. You can imagine that as a file whose field types suddenly change, (e.g. from integer to date – the selector starts parsing an integer and crashes). Another known case we cannot handle is changeable number of fields in one record. If new fields occur or their number decreases, the graph execution fails. The only exception to this are fields added at the end of a record. These can be handled with the help of lenient data policy.

Download a complete CloverETL project – error handling in ComplexDataReader

Usability Improvements in CloverETL 3.1

One of the most noticeable set of changes in CloverETL version 3.1 is the interface improvements, substantially improving Clover’s usability and understandability. These improvements save both new and old users valuable time when creating or manipulating their data transformation graphs and further cement CloverETL’s place as one of the most easy to use ETL tools on the market.

The biggest improvement was the addition of drag-and-drop functionality to a number of different aspects of Clover. You can drag files to the graph, files to components, files to metadata, and metadata to edges, saving innumerable clicks through menus.

We have also made it easier to link your metadata and edges while creating the edges. If you right-click on the Edge tool in your palette, it will give you a list of every metadata you have created on the current graph. If you select one of the metadata, whenever you create an edge with the edge tool, it will automatically assign that metadata to the edge.

Not only is it easier to link metadata and edges, we’ve also made it easier to create and manipulate the edges themselves. Edges can now be created simply by dragging from one component’s out port to another’s in port. If you find you want to change where the edge is connected, that too is now one-click. Simply click and drag an edge’s endpoints to any other port.

The last shortcut that version 3.1 added to CloverETL is an easier way to set the description on a component. Before, the description field was buried in the component’s properties, but now it has been moved to the header of the properties window. This improvement makes it substantially easier to clarify the purpose of your components, making your graph easier to read overall.