CloverETL's Blog

November 4, 2009

New level of parallelism in CloverETL

Filed under: Developing Clover — Tags: , , , , , — mvarecha @ 12:28 pm

For the upcoming release of CloverETL 2.9, we are working on improvements in CloverETL Server which will allow run transformations in parallel on multiple cluster nodes.

CloverETL Server already supports clustering, so more instances may cooperate to each other. Current stable version already implements common cluster features: fail-over/high-availability and scalability of lots of requests which are load-balanced on available cluster nodes. These features are actually implemented since version 1.3.

The basic concept of new parallelism
Transformation may be automatically executed in parallel on more cluster nodes according to configuration and each of these “worker” transformations processes just its part of data. Because there is one “master” transformation, which manages the other transformations and which gathers tracking data from “worker” transformations, the parallelism is transparent for CloverETL Server client. Client by default “sees” just one (master) execution and aggregated tracking data. However there are still logs and tracking data for each of “worker” transformations, so it’s still possible to inspect details of this parallel execution. “Worker” transformations outputs are gathered to the “master”, thus client has one single transformation output which may be processed further.

So how to get parts of input data?
Basically, transformation can process data which is already partitioned, which is the best case and there is no overhead with partitioning of data, or CloverETL Server itself can partition input data from one single source and distribute data on the fly (during the transformation) to several cluster nodes using the network connection. Overhead of this operation depends on the speed of network communication and other conditions.

Design changes in the graph
We aim to keep the transformation graph almost the same as it would be for “standalone” execution. Thus there will be just a couple of extra components in the graph which is intended to run in parallel. These components will handle partitioning/departitioning of data in case it’s not already partitioned.

Scalability
The new parallelism in CloverETL Server is a giant leap for scalability of the transformations. Ever since the graph is designed for paraller run, the number of computers which run this transformation depends just on cluster configuration. Graph itself is still the same. Configuration of the parallelism includes:

  • working CloverETL Server cluster, thus standalone server instances won’t be able to handle such execution
  • “partitioned” sandbox(see below) with list of locations

New sandbox types
On server side, graphs and related files are organized in so-called sandboxes. Until version 2.8, there was just one type: “shared” sandbox. It means that it contains the same files and directory structure on all cluster nodes. Since version 2.9 there will be two more types:

  • “local” sandbox – is (locally) accessible on just one cluster node. It’s intended for huge input/output data which is not intended to be shared/replicated among multiple cluster nodes.
  • “partitioned” sandbox – each of its physical location contains just part of data. It’s intended as a storage for partitioned input/output data of transformations which are supposed to run in parallel. List of physical locations actually specifies nodes which will run “worker” transformations.

Master – worker responsibilities
Master observes all related workers and when some transformation phase is finished on all workers, it’s master’s responsibility to allow the workers to process next phase. When any of the workers fails from any reason, it’s master’s responsibility to abort all the other workers and select whole execution as failed. Master/worker – These terms have meaning only in the scope of one transformation. Since 2.9 there is no privileged node configured as “master” in the cluster, but it doesn’t mean that all the nodes are equal. There may be differences between nodes in accessibility to physical sources. Configuration of sandboxes should reflect it.

October 26, 2009

ParallelReader versus competitors

Filed under: Using CloverETL — Tags: , , , , , — Petr Uher @ 3:37 am

On Oct. 21 OpenSys released a new version of its ETL tool, CloverETL Designer version 2.8.1. It’s mainly bugfix version but also brings a new component, ParallelReader, that makes delimited data file (CSV) processing faster than ever before.

I decided to make a test and compare ParallelReader’s performance with CloverETL’s UniversalDataReader and also with two ETL competitors Talend Open Studio (3.1.3) and Pentaho Data Integration (3.2.0).

As a testing task I chose simple SQL query and I tried to rewrite it to ETL transformation.

select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from lineitem
where l_shipdate <= date ‘1998-09-03’
group by l_returnflag, l_linestatus
order byl_returnflag, l_linestatus;

This query is one of standard queries used for performance testing of database engines. More info on http://www.tpc.org/tpch/. The dataset for testing was generated by dbgen utility available on tpc.org too. The size of dataset is 725MB.

All transformation was run on my laptop: Intel Core 2 Duo @ 1666 Mhz, 2048 MB RAM, 200GB SATA 5400 RPM, Windows Vista Home Premium 32bit.

Java parameters was set up to -server -Xmx256m -Xmx1536m.

The results aren’t surprising :-) :

  1. CloverETL – ParallelReader
  2. CloverETL – UniversalDataReader
  3. Talend
  4. Pentaho

Results

If you don’t trust me you can verify results on your own computer. All transformation graphs and testing dataset are available on rapidshare.com or filefactory.com (200 MB). CloverETL Designer can be downloaded on www.cloveretl.com.

Deeper and more extensive comparison will be published soon. Watch www.cloveretl.com, watch this blog. The latest news about CloverETL are also available on CloverETL linkedin group and CloverETL facebook group. Don’t hesitate and join.

Transformation graphs

CloverETL ParallelReader

CloverETL ParallelReader


CloverETL UniversalDataReader

CloverETL UniversalDataReader


Talend

Talend


Pentaho

Pentaho

October 23, 2009

Parallel reader

Filed under: Using CloverETL — Tags: , , , , — Martin Zatopek @ 11:36 am

In October release 2.8.1 of Clover we introduced a new component which definitely should attract your attention – the Parallel Reader. The name itself already suggests the goal of the component – improve reading speed by going parallel. The component is very similar to Universal Data Reader in function – it reads delimited flat files like CSV, tab
delimited, etc. – much hasn’t changed here. But the real difference comes from under the hood.

There are two major optimizations which allow Parallel Reader to exhibit excellent performance results, especially on server-class machines with fast modern disks or better yet, disk arrays. The first optimization we have done is – of course – reading the file in parallel. The input file is divided into a set of virtual data chunks which are fed into reading threads. These work all together at the same time – each one parsing data records just from its own file part. The number of threads can be specified by component parameter “Level Of Parallelism” and should reflect the hardware setup – e.g. number of disks in a stripped RAID – to harness the maximum power of Parallel Reader. Another great performance gain we achieved is merely by just simplifying the data parser inside. This parser is as simple as possible – although with limited validation, error handling, and some functionality – but really, really fast.

Although the new reader has a few limitations coming from its nature, extreme speed in common use cases compensates all these drawbacks. If you are processing big amounts of data (hundreds of megabytes and more) and your transformation does not depend on data records being read in original order, Parallel Reader is here and it might just be the right choice for you – why not give it a try?

October 8, 2009

Building DWH with CloverETL: Slowly Changing Dimension Type 1

Filed under: Using CloverETL — Tags: , , , , — Petr Uher @ 10:21 am

The very typical usage of ETL tools is loading the data warehouse (DWH). So I decided to write a tutorial that will describe typical data warehouse tasks (slowly changing dimensions, date dimension, filling fact tables) and propose solutions with using of CloverETL.

If you are a newbie in data warehousing I recommend you reading some of the books by Ralph Kimball or H. I. Imon.

Sample data warehouse collects the information about sales for a small store chain that offers electronics like iPod, MP3, laptops etc.

The DB schema of my data warehouse is very simple. It consists of four dimensions (Customer, Product, Store and Date), one degenerate dimension (invoice number) and one fact table (Sales). Fact table stores two additive facts: units and total price. you can see complete DB scheme on figure below.

DB schema of sample DWH

DB schema of sample DWH

Store dimension

One thing you will surely face when you build data warehouse is working with several types of slowly changing dimension (SCD). In this part of the tutorial I used the simplest SCD type.

The simplest and surely the most popular SCD type among ETL developers is slowly changing dimension type 1. It doesn’t store any history, so once the value in online transaction processing system (OLTP) has been changed, the value in DWH is immediately overwritten as well.

I decided to use SCD1 for store dimension which collects basic information about stores: address, store identifier, store manager etc. Each store is identified in OLTP by unique store number (natural key). But for DWH I have to generate an own surrogate identifier ID_D_STORE.

The basic idea of processing SCD1 is very simple: compare records in DWH and OLTP, insert missing records into DWH, update the DWH records according to OLTP. For all these tasks the attribute that helps us to find corresponding records is the natural key – STORE_NUMBER.

So let’s go to develop CloverETL graph. For better portability all inputs and output data are stored in csv files, thus you don’t have to configure any database. The store dimension of DWH is stored in D_STORE.tbl file, the actual data from OLTP are stored in Store_25092009.csv. In both of these files we have to read, sort on natural key STORE_NUMBER and find the records that aren’t in D_STORE (third output of DataIntersection). In this last step we will use DataIntersection component. Simultaneously (by second output of DataIntersection) we get the potential records that can be different in OLTP and DWH. These records are then filtered and only the records having any different value of any attribute are processed and new values are stored to D_STORE_update.tbl file. New records are written to D_STORE_insert.tbl file once ID_D_STORE attribute is added. ID_D_STORE attribute gains its value from sequence that we have already defined in CloverETL in advance. And that’s all. You can see the resulting graph below.

CloverETL graph D_STORE_SCD1

CloverETL graph D_STORE_SCD1

If you want to read/write data from/to database easily replace UniversalDataReaders with DBInputTables and UniversalDataWriters with DBOutputTable components.

You can download complete CloverETL project here.

To be continued. In the next part we will deal with slowly changing dimension type 2.

September 16, 2009

The Beauty of Inline CTL Expressions

Filed under: Using CloverETL — Tags: , , , — Martin Janík @ 2:19 pm

Have you ever wondered how to write data records to a file with current date in its name? Then I’ve got a brand new solution for you! Since version 2.8, CloverETL supports inline CTL expressions within graph attributes and parameters. What does that mean? It means that you can use attributes and parameters containing CTL expressions which are evaluated dynamically at run-time!

Do you recall referencing graph parameters using the ${PARAM} syntax? Using inline CTL expressions is even simpler! Let’s assume you’d like to store some data records to a unique file every day, e.g. orders_2009-09-16.dat. Changing the file name every day might be pretty annoying. Well, it’s much more pleasant to use an inline CTL expression. Take UniversalDataWriter for example and simply set its fileURL attribute to something like this:

${DATAOUT_DIR}/orders_`date2str(today(), "yyyy-MM-dd")`.dat

Notice the `date2str(today(), "yyyy-MM-dd")` part — that’s an inline CTL expression. Yes, that’s all you need to do! Just enclose your CTL expression within back quotes and place it anywhere you like. Except CTL transformations of course, that would be meaningless. ;-)

The good news is that you can reference graph parameters from inline CTL expressions. Such graph parameters might again contain CTL expressions. And so forth… Assuming you defined parameters FIRST_NAME and LAST_NAME, defining another parameter, let’s say FULL_NAME, in the following way is perfectly valid:

`substring('${FIRST_NAME}', 0, 1)`. ${LAST_NAME}

Beware, two adjacent back quotes are treated as an empty CTL expression and always evaluated to an empty string. You might also ask how to use back quotes within inline CTL expressions. Well, it’s pretty straightforward, just escape them using a back slash, i.e. \`. (Both these features work since version 2.8.1.)

By default, evaluation of inline CTL expressions is turned on. If you want to turn this feature off for any reason, you can simply do so by setting the GraphProperties.EXPRESSION_EVALUATION_ENABLED configuration property to false.

We hope you’ll find this brand new feature useful! You can let us know where and why you use it by leaving a comment to this post.

September 7, 2009

Designer-Server integration: HTTPS made easy

Filed under: Using CloverETL — Tags: , , , , — Jaroslav Urban @ 11:32 am

In CloverETL Designer 2.8.0, connecting to CloverETL Server over HTTPS protocol is supported. However, the client requires some configuration including import of client’s certificate to the server. Starting with CloverETL Designer 2.8.1, the situation is much simplified. The HTTPS can be used without any additional client configuration.

The usage scenario is similar to using a web browser – if the Designer detects an unknown server certificate, it asks the user if the certificate should be accepted & imported. A server certificate can be imported either permanently or temporarily for one Designer session.

Connecting to CloverETL Server over HTTPS

Connecting to CloverETL Server over HTTPS

In the above screenshot you can see an example of connecting to the CloverETL Server over HTTPS. The Designer detected an unknown certificate and asks the user whether the certificate should be accepted. The user can of course examine certificate’s content prior to accepting or refusing.

This simple HTTP connection work in case that the application server running CloverETL Server does not require a certificate from its clients. When it requires client certificates, then the Designer must be configured as previously.

September 2, 2009

New QuickBase components

Filed under: Using CloverETL — Tags: , , , — Martin Zatopek @ 8:34 am

We have great news for users of on-line database QuickBase from Intuit . CloverETL became a next tool which can be used to manipulate the data in this database. Now you can work with data in QuickBase without restraints and with full power of CloverETL. We introduced several specialized components dedicated for this purpose.

QuickBaseQueryReader serves as a reader of records queried from a table. You only need to specify a table name which you want to read from and a query in QuickBase query language. It is that simple. The second reader, QuickBaseRecordReader, is even simpler. Again, you specify a table name and instead of the little bit confusing query, you just specify the particular record identifiers you require.

Data writing is provided also by a set of two components – QuickBaseRecordWriter and QuickBaseImportCSV. First one is a complex writer of incoming data records to a given table and second one is a bit simpler, however much faster bulk data loader to QuickBase.

So we have prepared the whole family of components fully supporting the data manipulation in QuickBase – do not hesitate and try it.

August 18, 2009

Hidden features: Mutable delimiter

Filed under: Using CloverETL — Tags: , , , — Petr Uher @ 9:38 am

CloverETL provides a very useful feature: mutable delimiter. When you parse a delimited file (eg. CSV) you can specify different delimiter for each field. This isn’t surprising for daily CloverETL users however for users of other ETL tools it can be. It might not be very well known that in CloverETL you can even define more delimiters for one field (so called “mutable delimiter”) and CloverETL chooses the right one. It reveals new ways of file processing with irregular structure in CloverETL. I believe this functionality isn’t provided by any other ETL tool on the market. If I am wrong you can leave me a message in comments. I’m always happy to find “hidden features” of other ETL tools.

Syntax of a mutable delimiter: delimiters have to be separated by ‘\\|‘. For example if you want to define that field delimiter can be ‘;‘ or ‘,‘ or ‘#‘ you have to write ‘;\\|,\\|#‘.
The simple example of using a mutable delimiter you can download here as a zipped CloverETL project. The import of existing CloverETL project to your CloverETL Designer is described in CloverETL documentation.

August 11, 2009

Hidden features: Environment variables in CloverETL transformation

Filed under: Using CloverETL — Tags: , , — Petr Uher @ 12:43 pm

Using environment variables

“Environment variable is named value that can affect the way running process will behave on a computer.”

In daily praxis we usually use environment variables with different syntax depending on operation system. On UNIX-like systems we use them with the syntax: $variable_name, on DOS and Windows systems the syntax is: %variable_name%. To list the variables on UNIX-like system we can use env shell command, on DOS and Windows systems set cmd command. You can find more general information on environment variables at en.wikipedie.org.
But enough of general information. Now how we can use environment variables in CloverETL transformation? It’s very simple, you can use it in the same way as you routinely use graph parameters. So if you want to add username of the user under whom the transformation is running to your processed data, it’s nothing more than adding a new field to metadata and write following in CTL (Clover Transformation Language):

function transform(){
...
$0.username := '${USER}'; //UNIX-like systems
OR
$0.username := '${USERNAME}'; //DOS and Windows systems
...
}

But be careful, value of environment variables can contain “bad characters” (\,") that have to be escaped by ‘\‘ in CloverETL. The safest way to use env variables in CloverETL is to enclose them in quotation marks  ‘'‘.

Overwriting environment variables & priority of parameter definitions

Often it’s very helpful to use environment variables inside CloverETL transformation. But sometimes you want to define your own graph parameter with the same name as the existing environment variable has. And you may ask the question: “Is it possible?”. I answer: “Yes, it is :-) .” Because there is a hierarchy of graph parameter definitions:

  1. Parameter from external parameter file specified at the start of graph execution by -cfg option
  2. Parameter defined at the start of graph execution by -P option
  3. Parameter from external parameter file that is linked to the graph during the graph development
  4. Internal graph parameter
  5. Environment variable

Parameter definitions from the list are sorted by priority (highest to lowest). So if you have internal graph parameter with the same name as the environment variable, the value from internal parameter is always used in CloverETL.

July 28, 2009

Designer-Server integration testing

Filed under: Using CloverETL — Tags: , , , — Petr Uher @ 4:21 am

CloverETL’s development team is preparing a new amazing feature, integration of CloverETL Designer with CloverETL Server. This feature shifts work with Clover to a much more comfortable level.

I was asked to participate on testing of it. And I decided to share my impressions.

The main feature of integration allows you to work with CloverETL’s graph located on CloverETL Server in  the same way as if it would be located on your desktop machine. So no more copying of files from desktop to server, no more out-of-date files, all items are located only on server and accesible and editable in the Eclipse with CloverETL on your desktop, transformation graphs are editable in graphic format.

All graphs are run on server machine but you don’t lose any of advantages useful for developing and debugging, you can view debug data on edge, view data on reader without running of the graph, see tracking information in tracking view etc. In addition all runs of graphs are tracked on server so you can see all execution logs in the Executions History tab of server administration interface.

After initial doubts I have realized that it works and now I’m fascinated with it :-) . You can expect it with many other improvement in version 2.8 of CloverETL Designer. So forget Informatica, forget DataStage, use CloverETL :-) .

Older Posts »

Blog at WordPress.com.