Performance Optimization of Metrics in CloverETL Profiler

The first beta version CloverETL Profiler was released in October, and since then we have been working on improvements for the second beta version, which was released at the end of last year. Besides bug fixing and adding a few new features, we also worked on performance optimization of profiling metrics. This article will describe this improvement and how profiling is interconnected with CloverETL Engine.

CloverETL Profiler processes input data as a stream. All metrics read input values as they are obtained from the source (CSV file, Excel sheet, or database table) and, at the end of a reading, metrics return their results and these results are then stored into the results database.

For most of the metrics (minimum value, maximum value) this approach works just fine. However, certain metrics cannot work like this – not only do they have higher computation-time requirements, but they also require all the input data to be kept in memory. For large data sets this makes using the operating system memory inappropriate. Therefore, external memory needs to be used.

In the first beta version we used Profiler’s internal SQL database to store all the values for these memory-consuming metrics. The data were first inserted into the SQL database and then a database query was used to calculate the result of the metric. This allowed for profiling large data sets– larger than the amount of available memory.

However, there was a large overhead caused by inserting the data into the database; also, the final result query computation consumed lots of system resources. We were not happy with the performance and architecture of this solution, so we decided to redesign it and use the powerful CloverETL Engine to get the job done.

We exploit the fact that the memory-consuming metrics can still be computed on a stream of data, if the incoming data are sorted. In the improved version we use the CloverETL Engine to first sort the values using the ExtSort components, and then we analyze the sorted data as a stream. In this approach, no other external facilities (such as SQL database) are used during profiling.

The overall performance of CloverETL Profiler has improved, especially for large data sets. Even with full set of metrics enabled, we are now able to analyze 4 GB of data with 30 fields in 30 minutes. Also, memory consumption has improved significantly.

Finally, in the Profiler GUI, we have marked the metrics that require sorting, and therefore have longer computation time, with a small clock icon. These metrics are not enabled by default.

The following picture shows in detail the different phases of metrics calculation. First, we calculate the metrics that can work on unsorted streams of data. In the following phases, for each field in its own separate phase, we run an ExtSort component and connect it to a component that calculates the metrics that expect sorted data. We use Rollup with custom transform Java code to calculate the metrics. Rollup allows for producing variable amount of output records for any amount of incoming records.

Another performance improvement in second beta version of CloverETL Profiler also affects the metrics that do not require the input data to be sorted. The profiler will now make better use of available CPUs and there will be less CPU time consumed on context-switching. This results in a boost up to 15% for data with a high amount of fields. Also, simpler structure of the internal CloverETL graph results in a significantly lower memory footprint.

In summary, in the second beta version of CloverETL Profiler, we have improved both performance and memory consumption by fully exploiting the capabilities of CloverETL Engine.

Lotus Notes Integration – export or exchange data easily

Lotus Domino (formerly Lotus Notes) is an IBM server product that provides enterprise-grade e-mail, collaboration capabilities, and a custom application platform. The platform is used by many large corporations, and most of them also employ other enterprise systems that need to exchange data with the Lotus platform. This is where CloverETL can now be helpful.

Since version 3.1 CloverETL is able to connect to Lotus Domino servers and read and write its data, CloverETL’s Lotus Notes components help you connect the data-record CloverETL world with key-value concepts of Lotus Notes databases. This connectivity allows a lot easier exchange of Lotus data with Excel, relational databases or even with various WebServices. CloverETL Designer enables even less technical persons to setup connectivity to Lotus Domino and create simple data extraction or data ingest apps.

Lotus Notes (Domino) database stores data in documents. A document contains key-value pairs where the key is always a string and the value can be of various formats and types (number, string, image, OLE object, RTF, …). The data in Lotus Notes are therefore structured differently than in classical data-record world of CloverETL. Typically there are no guarantees of what values with what keys will be present in a document.

To overcome this gap, the concept of Lotus Notes Views is used to our advantage. A view in Lotus Notes provides tabular structure for the data in Lotus Notes database. Therefore, to read the data it is necessary to first define a view through which the data will be exported. This task needs to be done by the manager of the Lotus Notes database.

Finally, in the Lotus Domino server, remote access needs to be enabled. This is typically done by issuing the following commands in the server console:

load http load diiop

Connecting to Lotus Notes from CloverETL

The first task on the side of CloverETL is to define a Lotus Notes connection. The connection specifies the address of the Lotus Domino server and user credentials that will be used to log in. The user’s name should be in the format that can be found in the Person document in the Domino Directory. The database can be either specified by its file name (typically a *.nsf file) or by the Replica ID value.

Finally, a path to the Lotus Notes connection drivers (NCSO.jar) needs to be specified. These drivers are not part of CloverETL and can usually be found in your Lotus Domino server installation, e.g. C:\Program Files\Lotus\Domino\data\domino\java\NSCO.jar

Creating metadata

After the connection is prepared, the next step is to create metadata that will be used for reading data from Lotus Notes.

First, choose the connection that you’ve created in the previous steps and also, at the bottom of the dialog, specify the view which will provide the input data. On the following page of the wizard you will see new metadata extracted from the view. Based on the preview of the data you can then manually fine-tune the types of the fields, as initially all field types will be set to string.

Reading from Lotus Notes

LotusReader is the name of the CloverETL component which reads data from Lotus Notes views. Thanks to the preparation steps above, configuring this component is fairly simple. Place this component into your graph, open its configuration and set the Domino connection created in previous steps and select the view from which the data will be extracted. Finally, connect this reader to your graph with an edge and assign metadata we extracted in the steps above.

Writing to Lotus Notes

Complementary to the reader component, there is also a component to import data into Lotus Notes. Basic configuration of the LotusWriter component is quite simple too. In the following paragraphs, we will pay better attention to the different ways the data can be written to Lotus Notes.

The default mode to write the data is insert. In this mode, LotusWriter will create new document in Lotus Notes database for every incoming record. After inserting the document, it will also launch form calculations. This is a procedure defined in the Lotus Notes database by the author of the database that can for example fill-in missing fields, fix formatting of specific string fields, check various constraints, and so on.

The computation can mark a record as invalid in the case of errors during the computation. You can choose to further process these records (e.g. dump them to a log file) by connecting any writer component to the only output port of LotusWriter. You also have the option not to save the invalid records, by enabling the Skip invalid documents option. By default, even documents marked as invalid will be stored into the Lotus Notes database.

The other write mode is update. In this mode, the LotusWriter tries to update existing documents in the database, instead of inserting new ones. The component will look up the documents by searching the view specified in the component configuration (required parameter for update mode). Each view marks certain columns as sorted. These columns will be used too look up the documents to update.

There are two more advanced and in certain occasions quite important concepts in writing to Lotus Notes database with CloverETL. The first one is lazy update replication, the other one is custom field mapping.

For the update mode, after the document fields are updated, save operation will be commenced. This operation may also launch the optional replication of the document. However, in case there were no actual changes in the contents of the document, the replication would be unnecessary. Therefore, the writer component uses lazy update mechanism – it will only launch the save operation on the document, if some of the fields’ values actually changed. This can save significant amount of time for large data manipulations.

Finally, the last feature that will be described here is custom field mapping. By default, the LotusWriter component will create documents with values based on the names of the input metadata fields. However, it might be preferable to create documents with the help of a Form. Forms often specify fields documents should contain and are a common way how to create new documents in Lotus Notes.

The mapping dialog helps you to map fields of input metadata to the fields of selected Form. This allows writing single field multiple times, skip fields, but most importantly choose proper names for the target document fields.