CloverETL's Blog

February 3, 2010

InfobrightDataWriter component

Filed under: Using CloverETL — Tags: , , , , — Agata Vackova @ 8:40 am

In response to customer requests we created a new component that writes data into Infobright software, a very popular column-oriented relational database. The CloverETL’s just released version 2.9 offers this new InfobrightDataWriter component.

Infobright is a highly compressed column-oriented database, based on MySql engine. In this database data are stored column-by-column instead of more typically row-by-row. There are many advantages of column-orientation, including the ability to do more efficient data compression and allowing compression to be optimized for each particular data type. The higher efficiency can be achieved because each column stores a single data type as opposed to rows that typically contain several data type. However the main purpose of Infobright software is to deliver a scalable data warehouse database optimized for analytic queries.

Our new InfobrightDataWriter component allows to easily load data into Infobright database directly from CloverETL transformations. It can be used both with Infobright Community Edition (ICE) as well as with Infobright Enterprise Edition (IEE). With IEE it is possible to use fast data loading in binary format. In binary format particular rows aren’t separated by any special character and there are no values’ delimiters or qualifiers either.

The connector uses named pipe for data transfer. In Linux systems it is system pipe, on Windows you need native library placed on the Java library path. The library is part of  infobright-core-vX_X package that can be downloaded from Infobright site (Download Contributed Software).

For remote load it is necessary to start Infobright remote load agent on the server where Infobright is running (part of infobright-core-v3).

Example graph:

InfobrightDataWriter configuration:

The above graph loads data from selected input fields (“Clover fields” attribute) to database table test (“Database table” attribute) on remote server. It uses fast binary method of data loading (“Data format” attribute). Data loaded into database can be sent to output port. Input metadata has free format but the output one has to meet certain conditions:

  • comma as a field delimiter
  • system new line ('\n' for Linux, '\r\n' for Windows) as a record delimiter
  • date/time formatting: yyyy-MM-dd HH:mm:ss (can be date only, time only or both)

As we load data to the remote server, the Infobright agent must run on the server. To start the agent we need to call:

java -jar infobright-core-3.0-remote.jar -p 6666 -l all

The port the agent is listening to must be the same as the agent port set on the component (“Remote agent port” attribute).

Full description of the component can be found on CloverETL’s wiki pages (InfobrightDataWriter component).

February 1, 2010

CloverETL version 2.9 was released. It adds Infobright Data Writer, Web Services component and other new features.

Filed under: Developing Clover — Tags: , , , , , — lucieopensys @ 10:18 am

New CloverETL version 2.9. was just released. This version brings a new Infobright Data Writer component, enhances the connectivity by adding Web Services component and adds features that simplify common data transformation tasks.

New Features and Components:
Infobright Data Writer
In response to customer requests, this component writes data into Infobright software, a column-oriented relational database. Infobright is a provider of solutions designed to deliver a scalable data warehouse optimized for analytic queries.

Web Services component
The new component makes communication with Web Services easier than ever. It provides user friendly graphical interface for mapping your data into Web Service fields, automatically generates requests and process responses. It offers faster, easier and more comfortable way to interact with remote Web Services.

Reading formatted values from XLS
Additionally to reading plain data from MicrosoftTM ExcelTM sheets, the Excel component is now also capable of reading user-formatted values such as currencies, dates or numbers.

New tracking option
Customers can now see all absolute speed rates for finished data transformations, facilitating comparative analysis in pursuit of process improvements.

New Aspell Lookup table
Brand new implementation of this component brings better performance, improved configuration and better customization.

Improved treatment of empty (NULL) values
Developers can now specify special strings that should be treated as empty (NULL) when data is being parsed. This feature simplifies processing of typical application export files which often contain values insignificant for ETL processing. Additionally it may lead to improved processing throughput and lower memory consumption of data transformation.

More user friendly File URL dialog and improved LDAP functionality.

Customers can evaluate these new features along with CloverETL’s other leading capabilities with a free 30-day trial of the CloverETL Designer Pro evaluation, which is available at www.cloveretl.com Information management professionals can also evaluate the enterprise integration features of CloverETL Server via an online demo, which is also available at www.cloveretl.com.

January 25, 2010

Iteration through the record fields in CTL

Filed under: Using CloverETL — Tags: , , — Vaclav Matous @ 9:52 am

Recently, I have been facing a very common problem. Imagine this scenario: I have two files – the first one with origin records and the second one with slightly modified new records. Each record had a unique key and aproximately 50 fields. My task was to compare these two files and find out how many fields in every pair differ from each other in the corresponding records.

The simplified graph can be seen in following picture:

A comparison of two records can be processed by the CloverETL DataIntersection component which joins the records with the same keys. In the joined records, you can compare fields that are not part of the key. But remember, you have to write the comparison in the CTL transformation ;-) . Of course you could write the following block of code for each pair of compared fields:

int count = 0;
if(nvl($0.field_N, '') != nvl($1.field_N, '')) {
count++;
//but typically more actions :-(
}
//and imagine this block 50times :-(
//final mapping
$0.key := $0.key;
$0.count := count;

But this solution takes too much time when you have to repeat it for many fields (approx. 50 in my case). It is also very slow, uncomfortable and increases the probability of making a mistake in your code (e.g. omitting some fields). Fortunately, CloverETL allows you to iterate through the fields of processed records! :-) The code is then more briefer and more generic:

//declaration of variables for copies of input records
record(Metadata1) myrec1;
record(Metadata1) myrec2;
function transform() {
int i = 0;
int count = 0;
//asign value of input records to local variables
myrec1 = @0; //myrec1 is a copy of a current record on input port 0
myrec2 = @1; //myrec2 is a copy of a current record on input port 1
//iterate through fields, suppose that field with index 0 is the key
for(i = 1; i < length(myrec1); i++) {
if(nvl(myrec1[i], '') != nvl(myrec2[i], '')) {
count++;
}
}
//final mapping
$0.key := $0.key;
$0.count := count;
}

Someone could object to the necessity of making copies of records. In this case I have good news. CTL in CloverETL version 2.9 introduces the possibility of iterating directly through the fields of input records.

@0[i];//i-th field of the input record on port 0

Moreover new functions for getting field names and data types are introduced in version 2.9. Personally, I am looking forward to such features that will make CTL code simplier and clearer.

January 12, 2010

DataDirect’s OracleDB JDBC driver speed test

Filed under: Using CloverETL — Tags: , , , , , , — Agata Vackova @ 12:39 pm

Purpose

Compare the speed of data loading into Oracle database (Oracle Database 11g Release 11.1.0.6.0 – Enterprise Edition) with Oracle corp. JDBC driver, DataDirect JDBC Oracle driver and direct data loading (OracleDataWriter component – sqlldr utility) in CloverETL.

Test description

Graph used for testing:

DDdbLoad.grf

The above graph loads data into database table that contains 3 number columns and 127 varchar columns.

Database table for storing data is truncated before each data loading – DBExecute components, each with query: TRUNCATE TABLE dd_test1 REUSE STORAGE

Phase 1: loading data with DDBulkLoad (DataDirect) object from csv file (loader.load(file))

Phase 3: loading data with DDBulkLoad (DataDirect) object from ResultSet (loader.load(resultSet)) – created ResultSet implementation, that reads data from DataRecord (read from the edge).

Phase 4: loading data with DBOutputTable with Oracle corp’s  JDBC driver:

Manifest-Version: 1.0
Specification-Title:    Oracle JDBC driver classes for use with JDK14
Sealed: true
Created-By: 1.4.2_08 (Sun Microsystems Inc.)
Implementation-Title:   ojdbc14.jar
Specification-Vendor:   Oracle Corporation
Specification-Version:  Oracle JDBC Driver version - "10.2.0.1.0XE"
Implementation-Version: Oracle JDBC Driver version - "10.2.0.1.0XE"
Implementation-Vendor:  Oracle Corporation
Implementation-Time:    Wed Jan 25 01:28:31 2006

Phase 7: loading data with DBOutputTable with DataDirect Oracle JDBC driver – enabled bulk load feature

Phase 9: loading data with OracleDataWriter component from csv file (sqlldr utility)

Phase 11: loading data with OracleDataWriter component from edge (sqlldr utility)

Test processing

Graph run 3 times for 10,000,000 records with default DataDirect settings.

Graph run 3 times for 1,000,000 records with default DataDirect settings.

Graph run 3 times for 1,000,000 records with following settings:

Test results

Results in seconds.

1,000,000 records with default DataDirect settings:

Phase 1: 178, 167, 132 – min: 132, max: 178, average: 159

Phase 3: 128, 166, 152 – min: 128, max: 166, average: 149

Phase 5: 228, 246, 290 – min: 228, max: 290, average: 255

Phase 7: 176, 170, 239 – min: 170, max: 239, average: 195

Phase 9: 44, 45, 56 – min: 44, max: 56, average: 48

Phase 11: 104, 95, 106 – min: 95, max: 104, average: 102

1,000,000 records with custom settings:

Phase 1: 163, 152, 142 – min: 142, max: 163, average: 152

Phase 3: 166, 133, 134 – min: 133, max: 166, average: 144

Phase 5: 278, 263, 260 – min: 260, max: 278, average: 267

Phase 7: 239, 172, 209 – min: 172, max: 239, average: 207

10,000,000 records with default DataDirect settings:

Phase 1: 1553, 1818, 1352 – min: 1352, max: 1818, average: 1574

Phase 3: 1475, 1299, 1298 – min: 1298, max: 1475, average: 1357

Phase 5: 3041, 2592, 2550 – min: 2550, max: 3041, average: 2728

Phase 7: 1824, 1623, 1722 – min: 1722, max: 1824, average: 1723

Phase 9: 404, 432, 472 – min: 404, max: 472, average: 436

Phase 11: 1096, 975, 1012 – min: 975, max: 1096, average: 1028

Summary

Loading data was slowest when DBOutputTable with original Oracle corp. driver was used. All loadings with DataDirect driver were faster than with Oracle corp. driver but the usage of DDBulkLoad object (DataDirect) plainly increases the speed of loading data in comparison with setting EnableBulkLoad=true and using DBOutputTable. The results for loading data from csv file and edge (result set) are very similar with slight predomination of ResultSet method. All three methods with DataDirect driver usage, get to more steady execution times with number of records to load.

The fastest way of loading data is unquestionably direct data loading with sqlldr utility. Even when inter-storing data in pipe, the sqlldr utility is about 50% faster than any other method, but is less convenient.
1,000,000 records
10,000,000 records

December 18, 2009

CloverETL extends ETL Price/Performance Leadership with Launch of CloverETL Cluster

Filed under: Using CloverETL — Tags: , , , , — lucieopensys @ 3:54 pm

On December 9, 2009 CloverETL Cluster Edition was launched at PriceWaterhouseCoopers premises. CloverETL Cluster intelligently partitions data and distributes them evenly across multiple nodes in a cluster for execution in parallel. CloverETL Cluster’s ability to load balance large data transformations increases throughput, fault tolerance and flexibility.

CloverETL Cluster can be deployed on premise in a customer’s own data center or in a variety of cloud configurations, such as Amazon EC2, which can drive costs even lower. During the launch in Prague, CloverETL Cluster was demonstrated running on four Amazon EC2 servers.

The following table shows the time CloverETL requires to execute a moderately complex transformation of six million records (725 MB) in a variety of local and cloud configurations:

  • CloverETL Desktop Designer on a MacBook: 150 seconds
  • CloverETL Cluster Load Balanced Across Two EC2 servers: 60 seconds
  • CloverETL Cluster Load Balanced Across Three EC2 servers: 43 seconds
  • CloverETL Cluster Load Balanced Across Four EC2 servers: 31 seconds

December 9, 2009

ParallelReader Versus Competitors Finish

Filed under: Using CloverETL — Tags: , , , , , — Petr Uher @ 3:09 pm

As I have promised I bring you a complex comparison of ETL tools: CloverETL, Talend and Pentaho.

Short summary of my previous posts: For testing I used two transformations based on TPCH test and the input data generated by dbgen utility. The transformations were run on my laptop with Windows Vista Home Premium. For detail information see part 1 and part 2.

New testing:
To ensure my comparison a full complexity, all tools were tested as “desktop” and “enterprise” ETL tools. The “desktop” tools were running on laptop computer with a small amount of data. The “enterprise” ETL tools were running on server class machine with a large amount of data stored both in flat files and in a database. The transformation executed on server class machine was the same as the one I executed on desktop, only the size of input data was changed:

  • lineitem.tbl – 59,986,052 records, 7.24 GB
  • customers.tbl – 1,500,000 records, 233 MB
  • orders.tbl – 15,000,000 records, 1.62 GB

The results of flat file reading:

TPCH-Q1

TPCH-Q1

TPCH-Q3

TPCH-Q3

The new results of database reading, all previously published results, detailed information about used hardware and a summary are available in this final document.

I also described main features of all tools and my experiences to work with them. This part of the document expresses my opinions so it could be biased since I work mostly with CloverETL. If you don’t agree with anything, please express your opinion in comments. I will be pleased to discuss them with you.

November 24, 2009

Clover and Web 2.0 Marketing

Filed under: Uncategorized — gtewallace @ 9:00 pm

A couple weeks ago I had the pleasure of talking about my favorite topic – B2B marketing – to a great group of business people at the Pricewaterhousecoopers office in Prague.  The event was organized by the folks at Clover, and it was a real treat for me to be able to share some of the things I’ve learned as a B2B tech marketer with the group.

To be honest, I entered into the presentation with some trepidation.  Social media and online communities have become such popular topics these days, that I wondered if I had anything new or useful to add to the conversation.  Ultimately, I think (hope) what sets my stuff apart is that it’s all based on my experience as a marketer, so I hope it is practical.  There are a lot of academicians out there who are probably much better informed than I on all the latest trends and techniques – but what I can say is that I’ve put the stuff in this presentation to work in practice to help drive marketing results.

As I worked on the presentation with the team here at Clover, a very cool thing happened.  Not only did the team get to see my vision for some of new ways to tell the CloverETL story, but I also learned how Clover’s data management technology can be used to help businesses extract value from their online communities – you’ll see this on one of the last slides.

Below is the presentation, edited slightly so the slides can stand on their own.  I hope you find them interesting and useful. (some of the text is hard to read so you’ll probably want to go through it in full screen mode).

November 13, 2009

Data integration survey

Filed under: Uncategorized — Tags: , — Petr Uher @ 5:22 pm

The folks at Telesperience have a really useful survey on data integration going. It’s just 10 questions and participants can request results.

Here’s the link to the survey: http://bit.ly/2rtFga

November 11, 2009

ParallelReader Versus Competitors Part 2

Filed under: Using CloverETL — Tags: , , , , , — Petr Uher @ 4:08 pm

Before we will release a complete comparison of open source ETL tools and after a success of my previous blog post I decided to publish the second transformation that we used in the comparison.

The second transformation is also based on SQL query that I rewrote to ETL transformation. I chose Query 3 from http://www.tpc.org/tpch.

select
l_orderkey,
sum(l_extendedprice*(1-l_discount)) as revenue,
o_orderdate,
o_shippriority
from customer, orders, lineitem
where c_mktsegment = ‘BUILDING’
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date ‘1995-03-15’
and l_shipdate > date ‘1995-03-15’
group by l_orderkey, o_orderdate, o_shippriority
order by revenue desc, o_orderdate

Input data are generated by dbgen utility and stored in CSV files.

  • lineitem.tbl – 6,001,215 records, 724 MB
  • customers.tbl – 15,000 records, 23.2 MB
  • orders.tbl – 1,500,000 records, 163 MB

Expected output should contain 11,620 records.

There is a new item in the results. After a discussion in my previous post I added „Pentaho parallel“, Pentaho transformation that reads data in parallel mode. Thanks Matt for your transformation :-) without it I wasn’t able to set it up.

Matt Caster also presented an opinion that Pentaho is discriminated because the transformation sorts the data before aggregation in Pentaho transformation. Yes, I agree that sorting of 6,000,000 records takes a significant amount of execution time of the transformation. But I have no choice, Pentaho aggregate component requires sorted input. Today’s transformation is more fair in this aspect. The number of records flowing to aggregate component is smaller (30,519 records) so they can be easily sorted in memory and the sorting doesn’t influence the total execution time in such volume.

The versions of used ETL tools stay the same ones: CloverETL Designer 2.8.1, Talend Open Studio 3.1.3 and Pentaho Data Integration 3.2.0.

Also the hardware configuration and Java runtime parameters are the same:

  • Intel Core 2 Duo @ 1666 Mhz, 2048 MB RAM, 200GB SATA 5400 RPM, Windows Vista Home Premium 32bit.
  • -server -Xmx256m -Xmx1536m

Results:

  1. CloverETL ParallelReader
  2. Talend
  3. Pentaho parallel
  4. CloverETL UniversalDataReader
  5. Pentaho

Results

Transformations and the input data are available on filefactory.com. Today’s transformation are named TPCH2. The transformation from my previous post are named TPCH1.

Please give me a feedback, especially on Talend transformation if it’s correct.

Transformation graphs

CloverETL ParallelReader & UniversalDataReader

CloverETL ParallelReader & UniversalDataReader

Talend

Talend

Pentaho

Pentaho

Pentaho parallel

Pentaho parallel

November 4, 2009

New level of parallelism in CloverETL

Filed under: Developing Clover — Tags: , , , , , — mvarecha @ 12:28 pm

For the upcoming release of CloverETL 2.9, we are working on improvements in CloverETL Server which will allow run transformations in parallel on multiple cluster nodes.

CloverETL Server already supports clustering, so more instances may cooperate to each other. Current stable version already implements common cluster features: fail-over/high-availability and scalability of lots of requests which are load-balanced on available cluster nodes. These features are actually implemented since version 1.3.

The basic concept of new parallelism
Transformation may be automatically executed in parallel on more cluster nodes according to configuration and each of these “worker” transformations processes just its part of data. Because there is one “master” transformation, which manages the other transformations and which gathers tracking data from “worker” transformations, the parallelism is transparent for CloverETL Server client. Client by default “sees” just one (master) execution and aggregated tracking data. However there are still logs and tracking data for each of “worker” transformations, so it’s still possible to inspect details of this parallel execution. “Worker” transformations outputs are gathered to the “master”, thus client has one single transformation output which may be processed further.

So how to get parts of input data?
Basically, transformation can process data which is already partitioned, which is the best case and there is no overhead with partitioning of data, or CloverETL Server itself can partition input data from one single source and distribute data on the fly (during the transformation) to several cluster nodes using the network connection. Overhead of this operation depends on the speed of network communication and other conditions.

Design changes in the graph
We aim to keep the transformation graph almost the same as it would be for “standalone” execution. Thus there will be just a couple of extra components in the graph which is intended to run in parallel. These components will handle partitioning/departitioning of data in case it’s not already partitioned.

Scalability
The new parallelism in CloverETL Server is a giant leap for scalability of the transformations. Ever since the graph is designed for paraller run, the number of computers which run this transformation depends just on cluster configuration. Graph itself is still the same. Configuration of the parallelism includes:

  • working CloverETL Server cluster, thus standalone server instances won’t be able to handle such execution
  • “partitioned” sandbox(see below) with list of locations

New sandbox types
On server side, graphs and related files are organized in so-called sandboxes. Until version 2.8, there was just one type: “shared” sandbox. It means that it contains the same files and directory structure on all cluster nodes. Since version 2.9 there will be two more types:

  • “local” sandbox – is (locally) accessible on just one cluster node. It’s intended for huge input/output data which is not intended to be shared/replicated among multiple cluster nodes.
  • “partitioned” sandbox – each of its physical location contains just part of data. It’s intended as a storage for partitioned input/output data of transformations which are supposed to run in parallel. List of physical locations actually specifies nodes which will run “worker” transformations.

Master – worker responsibilities
Master observes all related workers and when some transformation phase is finished on all workers, it’s master’s responsibility to allow the workers to process next phase. When any of the workers fails from any reason, it’s master’s responsibility to abort all the other workers and select whole execution as failed. Master/worker – These terms have meaning only in the scope of one transformation. Since 2.9 there is no privileged node configured as “master” in the cluster, but it doesn’t mean that all the nodes are equal. There may be differences between nodes in accessibility to physical sources. Configuration of sandboxes should reflect it.

Older Posts »

Blog at WordPress.com.