CloverETL's Blog

March 9, 2010

Oracle uses CloverETL for data manipulation in its OIA server

Filed under: Using CloverETL — Tags: , , , , , — Lucie Felixova @ 12:00 am

We are very happy to have learned that Oracle uses CloverETL for data manipulation in its OIA server. Oracle Identity Analytics (OIA) is a solution that provides enterprise role lifecycle management and identity compliance functionality. See Oracle whitepaper http://www.oracle.com/technology/products/id_mgmt/oia/pdf/oia_wp.pdf

Originally Sun Microsystems used CloverETL in its Sun Role Manager. It is a great pleasure to see that Oracle decided to continue using CloverETL for its software after aquiring Sun Microsystems.

March 8, 2010

How to set up a WebServiceClient Component in 3 steps

Filed under: Using CloverETL — Tags: , , — Martin Zatopek @ 12:41 pm

WebServiceClient is a new CloverETL component that was released in 2.9 version. It provides an easy intuitive way to interact with web-service technology that has gained a huge popularity in the last few years. The following simple example shows the new component´s functionality. The WS component is user friendly and very easy to use.

Example for WS component usage:

Our goal is to get the actual weather information for specific ZIP codes from the web service that is running on http://ws.cdyne.com/WeatherWS/Weather.asmx?op=GetCityWeatherByZIP. The zip codes are stored in the flat file and results will be saved to Excel file.

I suppose that you are familiar with CloverETL so I will not describe the input and output setting up but will focus directly on WebService Client component.

Step 1 – WebService URL and operation

Firstly we have to specify the URL of web service WSDL file and select the operation you want to use from the list of all operation WS provides. WSDL URL is the basic attribute of the component and have to be set to http://ws.cdyne.com/WeatherWS/Weather.asmx?WSDL. The dialog for Operation name attribute allows you to select the operation you want to use. In our case it’s the operation GetCityWeatherByZIP from the group WeatherSoap (Port:WeatherSoap) as you can see on the figure bellow.

Step 2 – Generate request

The request is a way how to pass the arguments to webservice. It’s the XML file with pre-defined structure which depends on the used webservice and writting it by hand is very annoying. CloverETL provides a functionality that generates the required structure of request in one click. Simply open the dialog for Request structure attribute and click on „Generate request“ button. The skeleton of the request is genereted and we have to map the input port fields of the component to the request placeholders. In our case the only placeholder is string. By drag-and-drop we replace string by the field name from the right panel of the dialog.

Step 3 – Process response

Finally we define how to process the response from webservice. The response is XML file so we will create the mapping among elements and attributes of XML files and the output fields of the component. CloverETL makes it easy for us, it provides drag-and-drop dialog where we can select the item from „XML Fields“ part of dialog and drop them to „Clover Fields“.

And that’s all :-) . Communication with webservice is set-up. Now we can collect actual weather conditions from Internet.

February 24, 2010

CloverETL File-URL Dialog

Filed under: Developing Clover — Tags: , , , , , , — jausperger @ 3:36 pm

The CloverETL Designer has a brand new File URL Dialog, which was introduced in the version 2.9. The newly designed file dialog is very friendly and intuitive to navigate. There are a lot of new features and improvements. The dialog is separated into several tabs to simplify navigation. They enable users to easily specify resources such as local files, remote files or shared memory (dictionary). The new dialog is more comfortable to use and has simplified clear design as you can see in the picture bellow. The dialog window adjusts itself according to the context.

Clover Server

In the new File dialog you can also find a new CloverETL Server tab specially designed to work with files located on CloverETL Server. It is only visible if you have opened the dialog from existing CloverETL Server project. It looks very similar to the tab you work with on your local computer but you can browse remote CloverETL sandboxes. All names of sandboxes for which you have permissions are in the bookmarks. So you can easily access them.

File URLs

This tab handles all types of URLs but it’s mainly designed to browse remote file system via http/https/ftp/ftps/sftp protocols. It also brings special dialog where you can specify advanced parameters of connection like proxy server, HTTP properties.

Port / Dictionary

The port and dictionary tabs are specific to CloverETL. The Port tab is visible only if the component or graph element allows reading/writing data from/to the port. Dictionary is a shared memory between parts of the graph. It is identified by name and processing type parameter. Both tabs help you to specify the URLs in a visual way so you don’t have to know the exact syntax of CloverETL’s URLs and your work will be easier and more productive.

Extensibility

Due to new modular dialog architecture, the dialog itself can be extended for specific tabs if needed.

February 11, 2010

RelationalJoin Commercial Component

Filed under: Using CloverETL — Tags: , , — Martin Janík @ 5:33 pm

CloverETL version 2.8 offered a brand new commercial  component called RelationalJoin. It extends the CloverETL pallete of joiner components with new functionality – joining of records with relational operators different from equal (=). This component has two input ports for master and slave data records, and a single output port for joined data records. Master and slave data records are joined and sent to the output port if they are in a specific relation to each other. In other words, it’s just another joiner that uses a relational operator to drive the joining process.

Relations Between Data Records

The relation between two data records is specified by any of the following relational operators: !=, <, <=, >, >=. For example the < operator means that each master data record is less than all the slave data records that it was joined with. If you look at it from the slave’s perspective, all slave data records joined with a certain master data records are greater in this case. It’s valid in both ways. ;-)

In order to make the joining process as effective as possible, input data records coming through both input ports have to be sorted appropriately. Except for the != operator that doesn’t require any sorting at all. However, if you choose the < or <= operator, you need to sort both streams of input data records in the descending order. In case of the > and >= operators, you need to make sure the data records are sorted in the ascending order. If the sort order is invalid, execution of the component fails.

When processing large data sets, be aware that the master data records are processed one by one while the slave data records need to be buffered. In the worst case, all the slave data records need to be buffered and thus their number should be as small as possible.

Practical Example

They say that an example is worth a thousand words so let’s create a simple one. Imagine that you’ve got a set of several distinct numbers. If for some reason you need to pair each of them with all numbers from the same set that are greater, it is a pretty simple task for RelationalJoin! :-) See the example graph below.

Example graph with RelationJoin.

We read the numbers from a flat file, sort them appropriately, send them as two independent data streams to RelationalJoin which produces the desired output, and finally write them to another flat file. Pretty simple, don’t you think? ;-)

In order to configure RelationalJoin, we need to specify a transformation, join key and join relation. Setting the first two is simple, you have done that a hundred times. In case of join relation, it is simple as well, just select “master(D) < slave(D)” from the combo box. The letter D in the round brackets denotes that we need to sort both streams of data records in the descending order. Thus we need to configure ExtSort in this way.

That’s all we had to do, just run the graph and see the desired results!

February 3, 2010

InfobrightDataWriter component

Filed under: Using CloverETL — Tags: , , , , — Agata Vackova @ 8:40 am

In response to customer requests we created a new component that writes data into Infobright software, a very popular column-oriented relational database. The CloverETL’s just released version 2.9 offers this new InfobrightDataWriter component.

Infobright is a highly compressed column-oriented database, based on MySql engine. In this database data are stored column-by-column instead of more typically row-by-row. There are many advantages of column-orientation, including the ability to do more efficient data compression and allowing compression to be optimized for each particular data type. The higher efficiency can be achieved because each column stores a single data type as opposed to rows that typically contain several data type. However the main purpose of Infobright software is to deliver a scalable data warehouse database optimized for analytic queries.

Our new InfobrightDataWriter component allows to easily load data into Infobright database directly from CloverETL transformations. It can be used both with Infobright Community Edition (ICE) as well as with Infobright Enterprise Edition (IEE). With IEE it is possible to use fast data loading in binary format. In binary format particular rows aren’t separated by any special character and there are no values’ delimiters or qualifiers either.

The connector uses named pipe for data transfer. In Linux systems it is system pipe, on Windows you need native library placed on the Java library path. The library is part of  infobright-core-vX_X package that can be downloaded from Infobright site (Download Contributed Software).

For remote load it is necessary to start Infobright remote load agent on the server where Infobright is running (part of infobright-core-v3).

Example graph:

InfobrightDataWriter configuration:

The above graph loads data from selected input fields (“Clover fields” attribute) to database table test (“Database table” attribute) on remote server. It uses fast binary method of data loading (“Data format” attribute). Data loaded into database can be sent to output port. Input metadata has free format but the output one has to meet certain conditions:

  • comma as a field delimiter
  • system new line ('\n' for Linux, '\r\n' for Windows) as a record delimiter
  • date/time formatting: yyyy-MM-dd HH:mm:ss (can be date only, time only or both)

As we load data to the remote server, the Infobright agent must run on the server. To start the agent we need to call:

java -jar infobright-core-3.0-remote.jar -p 6666 -l all

The port the agent is listening to must be the same as the agent port set on the component (“Remote agent port” attribute).

Full description of the component can be found on CloverETL’s wiki pages (InfobrightDataWriter component).

February 1, 2010

CloverETL version 2.9 was released. It adds Infobright Data Writer, Web Services component and other new features.

Filed under: Developing Clover — Tags: , , , , , — Lucie Felixova @ 10:18 am

New CloverETL version 2.9. was just released. This version brings a new Infobright Data Writer component, enhances the connectivity by adding Web Services component and adds features that simplify common data transformation tasks.

New Features and Components:
Infobright Data Writer
In response to customer requests, this component writes data into Infobright software, a column-oriented relational database. Infobright is a provider of solutions designed to deliver a scalable data warehouse optimized for analytic queries.

Web Services component
The new component makes communication with Web Services easier than ever. It provides user friendly graphical interface for mapping your data into Web Service fields, automatically generates requests and process responses. It offers faster, easier and more comfortable way to interact with remote Web Services.

Reading formatted values from XLS
Additionally to reading plain data from MicrosoftTM ExcelTM sheets, the Excel component is now also capable of reading user-formatted values such as currencies, dates or numbers.

New tracking option
Customers can now see all absolute speed rates for finished data transformations, facilitating comparative analysis in pursuit of process improvements.

New Aspell Lookup table
Brand new implementation of this component brings better performance, improved configuration and better customization.

Improved treatment of empty (NULL) values
Developers can now specify special strings that should be treated as empty (NULL) when data is being parsed. This feature simplifies processing of typical application export files which often contain values insignificant for ETL processing. Additionally it may lead to improved processing throughput and lower memory consumption of data transformation.

More user friendly File URL dialog and improved LDAP functionality.

Customers can evaluate these new features along with CloverETL’s other leading capabilities with a free 30-day trial of the CloverETL Designer Pro evaluation, which is available at www.cloveretl.com Information management professionals can also evaluate the enterprise integration features of CloverETL Server via an online demo, which is also available at www.cloveretl.com.

January 25, 2010

Iteration through the record fields in CTL

Filed under: Using CloverETL — Tags: , , — Vaclav Matous @ 9:52 am

Recently, I have been facing a very common problem. Imagine this scenario: I have two files – the first one with origin records and the second one with slightly modified new records. Each record had a unique key and aproximately 50 fields. My task was to compare these two files and find out how many fields in every pair differ from each other in the corresponding records.

The simplified graph can be seen in following picture:

A comparison of two records can be processed by the CloverETL DataIntersection component which joins the records with the same keys. In the joined records, you can compare fields that are not part of the key. But remember, you have to write the comparison in the CTL transformation ;-) . Of course you could write the following block of code for each pair of compared fields:

int count = 0;
if(nvl($0.field_N, '') != nvl($1.field_N, '')) {
count++;
//but typically more actions :-(
}
//and imagine this block 50times :-(
//final mapping
$0.key := $0.key;
$0.count := count;

But this solution takes too much time when you have to repeat it for many fields (approx. 50 in my case). It is also very slow, uncomfortable and increases the probability of making a mistake in your code (e.g. omitting some fields). Fortunately, CloverETL allows you to iterate through the fields of processed records! :-) The code is then more briefer and more generic:

//declaration of variables for copies of input records
record(Metadata1) myrec1;
record(Metadata1) myrec2;
function transform() {
int i = 0;
int count = 0;
//asign value of input records to local variables
myrec1 = @0; //myrec1 is a copy of a current record on input port 0
myrec2 = @1; //myrec2 is a copy of a current record on input port 1
//iterate through fields, suppose that field with index 0 is the key
for(i = 1; i < length(myrec1); i++) {
if(nvl(myrec1[i], '') != nvl(myrec2[i], '')) {
count++;
}
}
//final mapping
$0.key := $0.key;
$0.count := count;
}

Someone could object to the necessity of making copies of records. In this case I have good news. CTL in CloverETL version 2.9 introduces the possibility of iterating directly through the fields of input records.

@0[i];//i-th field of the input record on port 0

Moreover new functions for getting field names and data types are introduced in version 2.9. Personally, I am looking forward to such features that will make CTL code simplier and clearer.

January 12, 2010

DataDirect’s OracleDB JDBC driver speed test

Filed under: Using CloverETL — Tags: , , , , , , — Agata Vackova @ 12:39 pm

Purpose

Compare the speed of data loading into Oracle database (Oracle Database 11g Release 11.1.0.6.0 – Enterprise Edition) with Oracle corp. JDBC driver, DataDirect JDBC Oracle driver and direct data loading (OracleDataWriter component – sqlldr utility) in CloverETL.

Test description

Graph used for testing:

DDdbLoad.grf

The above graph loads data into database table that contains 3 number columns and 127 varchar columns.

Database table for storing data is truncated before each data loading – DBExecute components, each with query: TRUNCATE TABLE dd_test1 REUSE STORAGE

Phase 1: loading data with DDBulkLoad (DataDirect) object from csv file (loader.load(file))

Phase 3: loading data with DDBulkLoad (DataDirect) object from ResultSet (loader.load(resultSet)) – created ResultSet implementation, that reads data from DataRecord (read from the edge).

Phase 4: loading data with DBOutputTable with Oracle corp’s  JDBC driver:

Manifest-Version: 1.0
Specification-Title:    Oracle JDBC driver classes for use with JDK14
Sealed: true
Created-By: 1.4.2_08 (Sun Microsystems Inc.)
Implementation-Title:   ojdbc14.jar
Specification-Vendor:   Oracle Corporation
Specification-Version:  Oracle JDBC Driver version - "10.2.0.1.0XE"
Implementation-Version: Oracle JDBC Driver version - "10.2.0.1.0XE"
Implementation-Vendor:  Oracle Corporation
Implementation-Time:    Wed Jan 25 01:28:31 2006

Phase 7: loading data with DBOutputTable with DataDirect Oracle JDBC driver – enabled bulk load feature

Phase 9: loading data with OracleDataWriter component from csv file (sqlldr utility)

Phase 11: loading data with OracleDataWriter component from edge (sqlldr utility)

Test processing

Graph run 3 times for 10,000,000 records with default DataDirect settings.

Graph run 3 times for 1,000,000 records with default DataDirect settings.

Graph run 3 times for 1,000,000 records with following settings:

Test results

Results in seconds.

1,000,000 records with default DataDirect settings:

Phase 1: 178, 167, 132 – min: 132, max: 178, average: 159

Phase 3: 128, 166, 152 – min: 128, max: 166, average: 149

Phase 5: 228, 246, 290 – min: 228, max: 290, average: 255

Phase 7: 176, 170, 239 – min: 170, max: 239, average: 195

Phase 9: 44, 45, 56 – min: 44, max: 56, average: 48

Phase 11: 104, 95, 106 – min: 95, max: 104, average: 102

1,000,000 records with custom settings:

Phase 1: 163, 152, 142 – min: 142, max: 163, average: 152

Phase 3: 166, 133, 134 – min: 133, max: 166, average: 144

Phase 5: 278, 263, 260 – min: 260, max: 278, average: 267

Phase 7: 239, 172, 209 – min: 172, max: 239, average: 207

10,000,000 records with default DataDirect settings:

Phase 1: 1553, 1818, 1352 – min: 1352, max: 1818, average: 1574

Phase 3: 1475, 1299, 1298 – min: 1298, max: 1475, average: 1357

Phase 5: 3041, 2592, 2550 – min: 2550, max: 3041, average: 2728

Phase 7: 1824, 1623, 1722 – min: 1722, max: 1824, average: 1723

Phase 9: 404, 432, 472 – min: 404, max: 472, average: 436

Phase 11: 1096, 975, 1012 – min: 975, max: 1096, average: 1028

Summary

Loading data was slowest when DBOutputTable with original Oracle corp. driver was used. All loadings with DataDirect driver were faster than with Oracle corp. driver but the usage of DDBulkLoad object (DataDirect) plainly increases the speed of loading data in comparison with setting EnableBulkLoad=true and using DBOutputTable. The results for loading data from csv file and edge (result set) are very similar with slight predomination of ResultSet method. All three methods with DataDirect driver usage, get to more steady execution times with number of records to load.

The fastest way of loading data is unquestionably direct data loading with sqlldr utility. Even when inter-storing data in pipe, the sqlldr utility is about 50% faster than any other method, but is less convenient.
1,000,000 records
10,000,000 records

December 18, 2009

CloverETL extends ETL Price/Performance Leadership with Launch of CloverETL Cluster

Filed under: Using CloverETL — Tags: , , , , — Lucie Felixova @ 3:54 pm

On December 9, 2009 CloverETL Cluster Edition was launched at PriceWaterhouseCoopers premises. CloverETL Cluster intelligently partitions data and distributes them evenly across multiple nodes in a cluster for execution in parallel. CloverETL Cluster’s ability to load balance large data transformations increases throughput, fault tolerance and flexibility.

CloverETL Cluster can be deployed on premise in a customer’s own data center or in a variety of cloud configurations, such as Amazon EC2, which can drive costs even lower. During the launch in Prague, CloverETL Cluster was demonstrated running on four Amazon EC2 servers.

The following table shows the time CloverETL requires to execute a moderately complex transformation of six million records (725 MB) in a variety of local and cloud configurations:

  • CloverETL Desktop Designer on a MacBook: 150 seconds
  • CloverETL Cluster Load Balanced Across Two EC2 servers: 60 seconds
  • CloverETL Cluster Load Balanced Across Three EC2 servers: 43 seconds
  • CloverETL Cluster Load Balanced Across Four EC2 servers: 31 seconds

December 9, 2009

ParallelReader Versus Competitors Finish

Filed under: Using CloverETL — Tags: , , , , , — Petr Uher @ 3:09 pm

As I have promised I bring you a complex comparison of ETL tools: CloverETL, Talend and Pentaho.

Short summary of my previous posts: For testing I used two transformations based on TPCH test and the input data generated by dbgen utility. The transformations were run on my laptop with Windows Vista Home Premium. For detail information see part 1 and part 2.

New testing:
To ensure my comparison a full complexity, all tools were tested as “desktop” and “enterprise” ETL tools. The “desktop” tools were running on laptop computer with a small amount of data. The “enterprise” ETL tools were running on server class machine with a large amount of data stored both in flat files and in a database. The transformation executed on server class machine was the same as the one I executed on desktop, only the size of input data was changed:

  • lineitem.tbl – 59,986,052 records, 7.24 GB
  • customers.tbl – 1,500,000 records, 233 MB
  • orders.tbl – 15,000,000 records, 1.62 GB

The results of flat file reading:

TPCH-Q1

TPCH-Q1

TPCH-Q3

TPCH-Q3

The new results of database reading, all previously published results, detailed information about used hardware and a summary are available in this final document.

I also described main features of all tools and my experiences to work with them. This part of the document expresses my opinions so it could be biased since I work mostly with CloverETL. If you don’t agree with anything, please express your opinion in comments. I will be pleased to discuss them with you.

Older Posts »

Blog at WordPress.com.