CloverETL's Blog

October 23, 2009

Parallel reader

Filed under: Using CloverETL — Tags: , , , , — Martin Zatopek @ 11:36 am

In October release 2.8.1 of Clover we introduced a new component which definitely should attract your attention – the Parallel Reader. The name itself already suggests the goal of the component – improve reading speed by going parallel. The component is very similar to Universal Data Reader in function – it reads delimited flat files like CSV, tab
delimited, etc. – much hasn’t changed here. But the real difference comes from under the hood.

There are two major optimizations which allow Parallel Reader to exhibit excellent performance results, especially on server-class machines with fast modern disks or better yet, disk arrays. The first optimization we have done is – of course – reading the file in parallel. The input file is divided into a set of virtual data chunks which are fed into reading threads. These work all together at the same time – each one parsing data records just from its own file part. The number of threads can be specified by component parameter “Level Of Parallelism” and should reflect the hardware setup – e.g. number of disks in a stripped RAID – to harness the maximum power of Parallel Reader. Another great performance gain we achieved is merely by just simplifying the data parser inside. This parser is as simple as possible – although with limited validation, error handling, and some functionality – but really, really fast.

Although the new reader has a few limitations coming from its nature, extreme speed in common use cases compensates all these drawbacks. If you are processing big amounts of data (hundreds of megabytes and more) and your transformation does not depend on data records being read in original order, Parallel Reader is here and it might just be the right choice for you – why not give it a try?

October 8, 2009

Building DWH with CloverETL: Slowly Changing Dimension Type 1

Filed under: Using CloverETL — Tags: , , , , — Petr Uher @ 10:21 am

The very typical usage of ETL tools is loading the data warehouse (DWH). So I decided to write a tutorial that will describe typical data warehouse tasks (slowly changing dimensions, date dimension, filling fact tables) and propose solutions with using of CloverETL.

If you are a newbie in data warehousing I recommend you reading some of the books by Ralph Kimball or H. I. Imon.

Sample data warehouse collects the information about sales for a small store chain that offers electronics like iPod, MP3, laptops etc.

The DB schema of my data warehouse is very simple. It consists of four dimensions (Customer, Product, Store and Date), one degenerate dimension (invoice number) and one fact table (Sales). Fact table stores two additive facts: units and total price. you can see complete DB scheme on figure below.

DB schema of sample DWH

DB schema of sample DWH

Store dimension

One thing you will surely face when you build data warehouse is working with several types of slowly changing dimension (SCD). In this part of the tutorial I used the simplest SCD type.

The simplest and surely the most popular SCD type among ETL developers is slowly changing dimension type 1. It doesn’t store any history, so once the value in online transaction processing system (OLTP) has been changed, the value in DWH is immediately overwritten as well.

I decided to use SCD1 for store dimension which collects basic information about stores: address, store identifier, store manager etc. Each store is identified in OLTP by unique store number (natural key). But for DWH I have to generate an own surrogate identifier ID_D_STORE.

The basic idea of processing SCD1 is very simple: compare records in DWH and OLTP, insert missing records into DWH, update the DWH records according to OLTP. For all these tasks the attribute that helps us to find corresponding records is the natural key – STORE_NUMBER.

So let’s go to develop CloverETL graph. For better portability all inputs and output data are stored in csv files, thus you don’t have to configure any database. The store dimension of DWH is stored in D_STORE.tbl file, the actual data from OLTP are stored in Store_25092009.csv. In both of these files we have to read, sort on natural key STORE_NUMBER and find the records that aren’t in D_STORE (third output of DataIntersection). In this last step we will use DataIntersection component. Simultaneously (by second output of DataIntersection) we get the potential records that can be different in OLTP and DWH. These records are then filtered and only the records having any different value of any attribute are processed and new values are stored to D_STORE_update.tbl file. New records are written to D_STORE_insert.tbl file once ID_D_STORE attribute is added. ID_D_STORE attribute gains its value from sequence that we have already defined in CloverETL in advance. And that’s all. You can see the resulting graph below.

CloverETL graph D_STORE_SCD1

CloverETL graph D_STORE_SCD1

If you want to read/write data from/to database easily replace UniversalDataReaders with DBInputTables and UniversalDataWriters with DBOutputTable components.

You can download complete CloverETL project here.

To be continued. In the next part we will deal with slowly changing dimension type 2.

September 7, 2009

Designer-Server integration: HTTPS made easy

Filed under: Using CloverETL — Tags: , , , , — Jaroslav Urban @ 11:32 am

In CloverETL Designer 2.8.0, connecting to CloverETL Server over HTTPS protocol is supported. However, the client requires some configuration including import of client’s certificate to the server. Starting with CloverETL Designer 2.8.1, the situation is much simplified. The HTTPS can be used without any additional client configuration.

The usage scenario is similar to using a web browser – if the Designer detects an unknown server certificate, it asks the user if the certificate should be accepted & imported. A server certificate can be imported either permanently or temporarily for one Designer session.

Connecting to CloverETL Server over HTTPS

Connecting to CloverETL Server over HTTPS

In the above screenshot you can see an example of connecting to the CloverETL Server over HTTPS. The Designer detected an unknown certificate and asks the user whether the certificate should be accepted. The user can of course examine certificate’s content prior to accepting or refusing.

This simple HTTP connection work in case that the application server running CloverETL Server does not require a certificate from its clients. When it requires client certificates, then the Designer must be configured as previously.

September 2, 2009

New QuickBase components

Filed under: Using CloverETL — Tags: , , , — Martin Zatopek @ 8:34 am

We have great news for users of on-line database QuickBase from Intuit . CloverETL became a next tool which can be used to manipulate the data in this database. Now you can work with data in QuickBase without restraints and with full power of CloverETL. We introduced several specialized components dedicated for this purpose.

QuickBaseQueryReader serves as a reader of records queried from a table. You only need to specify a table name which you want to read from and a query in QuickBase query language. It is that simple. The second reader, QuickBaseRecordReader, is even simpler. Again, you specify a table name and instead of the little bit confusing query, you just specify the particular record identifiers you require.

Data writing is provided also by a set of two components – QuickBaseRecordWriter and QuickBaseImportCSV. First one is a complex writer of incoming data records to a given table and second one is a bit simpler, however much faster bulk data loader to QuickBase.

So we have prepared the whole family of components fully supporting the data manipulation in QuickBase – do not hesitate and try it.

Blog at WordPress.com.