Tag Archives: CloverETL Profiler

Performance Optimization of Metrics in CloverETL Data Profiler

The first beta version CloverETL Data Profiler was released in October, and since then we have been working on improvements for the second beta version, which was released at the end of last year. Besides bug fixing and adding a few new features, we also worked on performance optimization of profiling metrics. This article will describe this improvement and how profiling is interconnected with CloverETL Engine.

The CloverETL Data Profiler processes input data as a stream. All metrics read input values as they are obtained from the source (CSV file, Excel sheet, or database table) and, at the end of a reading, metrics return their results and these results are then stored into the results database.

For most of the metrics (minimum value, maximum value) this approach works just fine. However, certain metrics cannot work like this – not only do they have higher computation-time requirements, but they also require all the input data to be kept in memory. For large data sets this makes using the operating system memory inappropriate. Therefore, external memory needs to be used.

In the first beta version we used Profiler’s internal SQL database to store all the values for these memory-consuming metrics. The data were first inserted into the SQL database and then a database query was used to calculate the result of the metric. This allowed for profiling large data sets– larger than the amount of available memory.

However, there was a large overhead caused by inserting the data into the database; also, the final result query computation consumed lots of system resources. We were not happy with the performance and architecture of this solution, so we decided to redesign it and use the powerful CloverETL Engine to get the job done.

We exploit the fact that the memory-consuming metrics can still be computed on a stream of data, if the incoming data are sorted. In the improved version we use the CloverETL Engine to first sort the values using the ExtSort components, and then we analyze the sorted data as a stream. In this approach, no other external facilities (such as SQL database) are used during profiling.

The overall performance of CloverETL Data Profiler has improved, especially for large data sets. Even with full set of metrics enabled, we are now able to analyze 4 GB of data with 30 fields in 30 minutes. Also, memory consumption has improved significantly.

Finally, in the Profiler GUI, we have marked the metrics that require sorting, and therefore have longer computation time, with a small clock icon. These metrics are not enabled by default.

The following picture shows in detail the different phases of metrics calculation. First, we calculate the metrics that can work on unsorted streams of data. In the following phases, for each field in its own separate phase, we run an ExtSort component and connect it to a component that calculates the metrics that expect sorted data. We use Rollup with custom transform Java code to calculate the metrics. Rollup allows for producing variable amount of output records for any amount of incoming records.

Another performance improvement in second beta version of CloverETL Data Profiler also affects the metrics that do not require the input data to be sorted. The profiler will now make better use of available CPUs and there will be less CPU time consumed on context-switching. This results in a boost up to 15% for data with a high amount of fields. Also, simpler structure of the internal CloverETL graph results in a significantly lower memory footprint.

In summary, in the second beta version of CloverETL Data Profiler, we have improved both performance and memory consumption by fully exploiting the capabilities of CloverETL Engine.

A Look Back: CloverETL and Data Integration in 2011

As 2011 comes to a close, we’d like to take the time to reflect on what this year has brought CloverETL, its users, and our customers.

Since CloverETL is, after all, a data integration platform, the world of integration is at our core. We’re constantly striving to challenge ourselves in new ways and improve how we approach data integration. This year was no different.

Enhancing Our Core – Two Upgrades of CloverETL

In the past six months, we released two upgraded versions of CloverETL. CloverETL 3.1, published in June, brought significant changes to the platform in several areas. With a deeper focus on connectivity and enhanced support of various data formats, CloverETL 3.1 helped users better process data with complex structure, emails, and Lotus documents, to name a few. The latest version of CloverETL, version 3.2, offered further enhancements to the user experience, as well as improved the processing of large data records.

Data Integration Meets Data Quality – CloverETL Profiler

This year was also a year for new products. With Clover, we’ve moved forward with an evolved sense of the data world. Because data integration, data quality, and other data disciplines are becoming more and more intertwined, we developed the CloverETL Profiler, data profiling application. Released in beta back in October, the profiler helps users make informed decisions on how to improve the quality of transformed data, which is particularly useful as precursor to a greater data integration projects. CloverETL also integrates more easily with the AddressDoctor solution to improve the quality of geographical information.

Strengthening CloverETL Presence in the US Market

In June, Javlin, the developer of CloverETL, opened up its new office in the Washington D.C. area, which became the headquarters of Javlin Inc., our US presence. Javlin Inc., with both a dedicated sales and customer service force, brings Clover to a whole new market of possibilities.

Last but not least, we are pleased to see that our OEM data integration offer will have a number of important implementations in the upcoming year. (But more on that later. Stay tuned.)

As we leave 2011, we can say that this past year was a whirlwind of hard work, exciting releases, and interesting customers and stories. We’re looking forward to another great year with CloverETL. Cheers to the New Year.

CloverETL Data Profiler: Under the Hood of Data Profiling Application

Recently we’ve released the second public beta release of a new member of our CloverETL family – CloverETL Data Profiler. With this article we’d like to share an overview of the technical architecture of CloverETL Profiler – the why and how of the design.

The CloverETL Data Profiler is a data profiling application, i.e. it provides the users with various information about their data, such as average, number of empty values , histogram-like charts etc. When designing the application we had several goals in mind:

  • Performance – Data of interest can be huge. The profiling of them needs to be reasonably quick so that the user does not need to wait for results for too long – we want profiling to be an interactive process, not a batch job.
  • CloverETL Integration – Data profiling is often just one part of a more complex workflow – for example, it can be the starting point of a data integration project.  We want to make the transition from data profiling to the following steps in data integration with CloverETL as smooth as possible.
  • Web presentation – An important group of users of data profiling tools are people interested in just viewing the profiling results; they don’t need to design or execute data profiling jobs. Presenting the profiling results on web gives such users a very simple way to access the results.

CloverETL Profiler Architecture

Let’s have a look at the main building blocks of CloverETL Data Profiler:

  • CloverETL Data Profiler GUI – This is the graphical tool where users create and execute profiling jobs. We’ll talk about the jobs later in more detail, but basically they describe a data source to profile (e.g. file, database etc.) and how to profile it using metrics (e.g. calculate maximum value on a database column). The GUI is based on the Eclipse RCP platform and reuses components of CloverETL Designer – this is the obvious way for the reuse of GUIs for concepts common with CloverETL (e.g. metadata, database connections, data analytics etc.) and it paves way for future deep integration.
  • CloverETL Data Profiler Engine – Profiling jobs created and edited in the CloverETL Profiler GUI are stored as ordinary XML files, similarly to CloverETL graphs. The profiler engine takes these jobs, analyzes them, and automatically generates and runs a CloverETL graph. The CloverETL graph performs the actual profiling – so the CloverETL Data Profiler Engine is a relatively thin wrapper on top of CloverETL Engine.
  •  CloverETL Engine – As described above, the real work of data profiling is performed by CloverETL Engine that runs a specially designed transformation graph. Because of this, the profiler gains access to the wide feature set and great performance provided by the CloverETL Engine.
  • Result Storage – CloverETL graphs performing the profiling store the results in a Result Storage database. The database can be an embedded Derby database for single user scenarios or a shared external database for multiple users. The shared external database can be used in a scenario where multiple users create and execute profiling jobs and at the same time, need to share their results. We support a range of databases for the storage of results – Oracle, MySQL, PostgreSQL and MSSQL.
  • Reporting Server – As observing profiling results is done via a web interface, we have a Reporting Server that both serves the web content and provides it with the profiling results stored in ResultsStorage. The Reporting Server isolates the web interface from details of the Result Storage by providing the profiling results as JSON via a REST interface. We kept the Reporting Server intentionally simple – more complex server-side functionality such as scheduling will be reused by integration with CloverETL Server.
  • Reporting Console – This is the tool for observing the profiling results – values of metrics, charts of the data, troubleshooting details about rejected records etc. The Reporting Console is an AJAX web application based on HTML, JavaScript and related standard technologies. Because it’s a web application, it’s very easy for anyone to view the results – there’s no need to install a special application; you only need a reasonably modern web browser.

All parts of CloverETL  DataProfiler are currently bundled in one simple-to-use package as a standalone application – just start the CloverETL Data Profiler and it automatically launches an embedded Result Storage, Reporting Server etc. However we’ve laid the foundations for separating the building blocks for bigger deployment scenarios and better integration with the rest of CloverETL family.

Data Profiling Process

Now let’s have a look on the nitty-gritty details of actually profiling data. The basic premise is that we already have all the tools needed for profiling in CloverETL – a transformation graph has all the expressive power needed for profiling of data:

  • Reading of a data source – CloverETL has a wide range of reader components and connectors to 3rd party systems.
  • Calculating metrics – Transformations in CloverETL are extensible by Java code or by scripts in our proprietary scripting language CTL.
  • Storing results – This is just writing to a database which is one of the basic use cases of CloverETL.

So anyone could manually create a CloverETL graph that profiles data. But it’s quite a complex task which would take the user’s focus away from the core: what data source does he want to profile and which metrics does he want to use. In the CloverETL Data Profiler, the user describes this core information in a profiling job, and then the CloverETL Profiler Engine transforms it into a CloverETL transformation graph. The profiling job is defined in a relatively simple XML file that can be edited in a graphical editor. The job primarily contains the following information:

  • Data source – Description of data source to profile – path to a file, definition of a database connection etc.
  • Metadata – Description of the data source structure by CloverETL metadata
  • Metrics – Multiple metrics can be enabled on each field of the metadata, e.g. minimum value, longest string, median etc.

The above picture demonstrates the process of running a profiling job:

  • The profiling job is stored in an XML file that contains all required information. The job is created and edited in the CloverETL Profiler GUI.
  • The job XML file is taken by the CloverETL Profiler Engine which creates a CloverETL transformation graph and runs it.
  • The picture of the transformation graph contains an actual profiling graph which reads the data source, stores the results and, most importantly, calculates metrics in parallel branches. Details of the branching depend on the kind of metrics used: some can be computed on-the-fly on the original data, while some require all data to be sorted (e.g. median). The number of branches depends on the selection of metrics and the number of available CPU cores – we optimize the graph for high performance.

This article is a brief introduction into the architecture of CloverETL Data Profiler. As you can see, we’ve saved a lot of effort by using the power of CloverETL at the core. The CloverETL Data Profiler can be also seen as a successful example of embedding CloverETL.

A Second Wind: New Features Strengthen CloverETL Data Profiler

In Data Profiling with CloverETL Data Profiler beta, we discussed the value of the new product that will soon enrich our product portfolio – CloverETL Data Profiler. The beta testing has been up and running, and so far we have received a lot of valuable feedback from the first beta build that allows us to further enhance the features and usability of CloverETL Data Profiler. Let’s see what the new build offers.

Integration with CloverETL Designer

The new build of the CloverETL Data Profiler introduces a key feature – the first step toward integration with the CloverETL suite. If you are interested in taking part in the development and feedback process of CloverETL products or the beta testing, you can register and get the new build.

The CloverETL Data Profiler has several important use cases, one being its integral role in analysis in the early stages of your projects. The insight into data gained from CloverETL Profiler is valuable for project planning, illustrating the importance of clean data, while offering some direction in the design of your ETL process.

Aside from gathering information during the definition of the profiling job, with this build’s new feature, you are also preparing your resources to use later in the CloverETL Designer. This work includes: defining the database connection and metadata that describe the schemas of your data source. The profiler may assist you in this process. It is especially useful for data sources that don’t include schema as their integral part, such as CSV files.

Exporting Metadata and Transformation

The second beta build of CloverETL Data Profiler includes the option to export both database connection metadata and a sample CloverETL transformation graph containing the reader that accessed the data source you previously defined. These are ready to be used in CloverETL Designer. You can see official CloverETL Data Profiler documentation to learn more about this feature.

Other Improvements in the New Build

Here is a short summary of the most notable improvements in the second build:

  • export to CloverETL Designer graph
  • visual enhancement of CloverETL Data Profiler job
  • removed dependency on internal database for metrics calculation
    • this results in performance boost, 4 GB of data with 30 fields and all metrics enabled takes 30 minutes
  • metrics that might be time consuming are visually distinguished in profiler
  • other tweaks and changes that enhance the overall performance and stability

The Next Steps

This is our first step in providing integration of CloverETL Data Profiler with the rest of the CloverETL suite. The next stages will be:

  • full integration with CloverETL Designer that will allow you the capability to profile any data flow inside your CloverETL graph, and
  • integration with CloverETL Server, where you will be able to set up events based on conditions detected in these data flows inside CloverETL graphs running on the server.

Data Profiling with CloverETL Profiler beta

The process of data integration, data migration, consolidation and other data manipulation projects consists of a variety of steps and tasks. Javlin supports many critical tasks within these projects with a versatile ETL tool that provides technical solutions to transform data and connect different systems and data sources with various data formats.

However, there are additional aspects in order to achieve success with data processing projects. One important project phase to ensure future success is the analysis phase. At Javlin, we are now on the verge of introducing new tool that will boost productivity in the early stages of a project. This product is our new data profiling tool – CloverETL Profiler.

CloverETL Profiler helps give you insight into your data. By creating a profile of your data using various metrics, you can get the big picture of the current condition of your data. This information has a large impact on the next stages of your project.

Real World Use – Data Warehousing Project

Imagine you are project manager with a complex task at hand – building a data warehouse. You have couple of data sources – database tables, CSV files, log files you want to analyze and probably some others. After you define the basic requirements for your data warehouse and its structure, you need to check the condition of your source data.

As you use the CloverETL Profiler, you will gain:

  • insight into the quality of your data indicating whether you need the perform a data cleansing phase in your project:
    • detect what data are missing (high percentage of missing or invalid values) looking for a different source
    • you may include data cleansing tools in the project budget and reserve tasks in your project plan
  • information about the structure of data in all your sources:
    • detect the need for ETL tools on specific data sources to transform your input data into a common format

Instead of writing your own tools and testing the data in complex way, you are just a few clicks away from creating detailed profile of your data.

Product Vision

The usage of the CloverETL Profiler tool for the project analysis phase is just the beginning. As part of our vision for the future, we are on the path to create a fully integrated product in CloverETL Designer and CloverETL Server family.

Its main purposes will be to aid:

  • CloverETL Designer: to assist the development of ETL transformation by allowing the creation of profiles of data in any stage of the transformation and checking the results during the debugging process
  • CloverETL Server: to constantly monitor the data condition on the production server  and invoke actions based on the specific condition detected in the profile of your data

Beta Program

Right now, we are excited to announce the beginning of our beta program for CloverETL Profiler. We invite you, people from the data integration community, to join the testing phase; it’s a great opportunity to not only have a look behind the scenes, but to also be a part of the development process by contributing your insights. For more information, please visit www.cloveretl.com/products/profiler.