Tag Archives: ETL

Loading Data Warehouse – Fact Table Load Wizard

Users who are using CloverETL for data-warehousing often need to create a data transformation which populates DWH fact table with transaction data from source system. Each such transformation contains many components that join keys from dimensional tables to processing records.

There is no problem, in case a user uses only few dimensional tables in process of loading fact table. But if there are much more tables, it is quite uncomfortable to build a whole transformation graph and set all components, create needed metadata and establish a database connection after that. For this we created in CloverETL 3.0 a brand new Fact Table Load Wizard that dramatically simplifies and speeds up development process of a transformation. This wizard can be found in the menu of CloverETL wizards.

What does a user needs as an input into the wizard is just a data file with source data which should be loaded into the fact table and a metadata describing structure of this data file. As the input exists, a user can use the wizard and enter requested information. For more information how to use the wizard please see related documentation on the official CloverETL web site.

The wizard generates ready-to-run graphs, but a user is welcome to change whatever is needed.

Building Data Warehouse with CloverETL: Slowly Changing Dimension Type 2

In the last part of our data warehouse (DWH) tutorial, I showed you how to load a dimension table that stores historical data according to the Slowly Changing Dimension Type 1 (SCD1). In today’s post, I will focus on a Slowly Changing Dimension Type 2 (SCD2) dimension table. I think that SCD2 is the most challenging sub-task of ETL part of DWH design and each ETL architect should be able to deal with it.

In contrast to SCD1, SCD2 table stores preserves history of attributes. So once the value of attribute is changed in external system  (OLTP) we have to create a new record in SCD2 dimension table with the actual value but we also have to mark the old record in SCD2 table as obsolete. The most common way to obsolete the record is to maintain two additional attributes: valid_from and valid_to. Then the record is considered valid at particular date D when valid_from < D ≤ valid_to. You can find a detailed explanation of SCD2 principles in Kimball’s DWH bible or on wikipedia.org.

Let us show how SCD2 works in real in a small example. We will use DWH schema introduced in SCD1’s post.

It consists of four dimensions (Customer, Product, Store and Date), one degenerate dimension (invoice number) and one fact table (Sales). Fact table stores two additive facts: units and total price.

Store table is populated as SCD1 and we will load Customer table that was marked as SCD2 dimension table. Let’s imagine that Customer changed his email. What will happen in OLTP and DWH Customer table named D_CUSTOMER?

OLTP:

C0001;John;Newman;john.newman@hotmail.com => C0001;John;Newman;newman.john@gmail.com

DWH:

0001;C0001;John;Newman;john.newman@hotmail.com;2009-10-10;null =>
0001;C0001;John;Newman;john.newman@hotmail.com;2009-10-10;2010-05-20
0002;C0001;John;Newman;newman.john@gmail.com;2010-05-21;null

Notice especially the first two attributes (columns) and the last two attributes of DWH table. The first attribute is a surrogate key, it is a unique identifier of the record in D_CUSTOMER table. It is generated by ETL process.  The second one (C0001) is a natural key, a unique identifier of customer in OLTP. When you list all records of the same natural key in D_CUSTOMER you will get a complete history of one customer.

The principle how SCD2 works is explained now I will describe an implementation of SCD2 in CloverETL. See the CloverETL’s graph bellow.

The basic data-flow of the graph is very simple: in the lower branch we read the data from OLTP (for us as in previous DWH post it’s a CSV file), in the upper branch we read data that is already stored in our SCD2 table. We have to use the Dedup component, as we want only one actual record for each customer. The two branches intersect in DataIntersection component that processes the records according to the natural key. The component has three output ports, as there are three possible outcomes:

  1. The record exists only in DWH. This should not happen, it means that the record was deleted in OLTP. The “normal” OLTPs do not allow delete of records. That kind of records end in Trash component.
  2. In DWH table there exists at least one record with the same natural key as the record coming from OLTP. That record goes through the second output port to the component that identifies whether the record was changed (ExtFilter component). And then the record is copied to two records: the first one that obsoletes the current record in D_CUSTOMER (identified by surrogate key) and  the second one that is inserted to D_CUSTOMER and stores the new values read from OLTP. The first one set column valid_to = today()-1 and the second record is inserted with valid_from = today() and valid_to = null.
  3. The record coming from OLTP is a new one, there is no record with the same natural key in DWH. In that case the record is sent to the third output port and in following components is inserted to D_CUSTOMER table with valid_from = today() and valid_to = null.

If you want to verify that your CloverETL SCD2 graph works correctly or if you are looking for sample data, you can simply import example project to your Clover installation. It is embedded to your CloverETL Designer as a DWHExample project. For more information how to import example project see online documentation.

CloverETL as a High-throughput XML Processor

XML is a markup language that has been around for some years now. Originally, it comes from the world of documents – used in web hypertext, word processors and other representations. Today, it is very popular in many areas, including the world of data exchange. The reasons are simple – the format is straightforward, well defined, and easily transferable accross platforms. XML can be easily read and modified by users in contrast to proprietary and binary formats. It also represents structured hierarchical data, which can be very difficult to express in plain CSV format. XML is self-descriptive, which heavily increases the user’s ability to understand data and eliminates the need of data format description and parsing instructions.

XML is often used to transport data between potentionally incompatible systems, resulting in a task to parse and store data of this format and eventually to process this data. CloverETL provides powerful tools to accomplish this task.

One of the components that provides XML parsing is XMLXPathReader. The user simply defines the mapping of each data element or attribute to a given CloverETL field. In the background of the component there is a DOM parser which allows the user to include general XPath expressions in the mapping definition.

In practice, users will often encounter vast XML files, which typically follow a standard structure. This structure contains records which represent a given entity (company, person, etc.) that can be repeated many times in a large XML data source. It is quite common that these sources of data come in sizes of 10s or even 100s of gigabytes. When this happens, DOM parsing is greatly inappropriate as all this data cannot be contained in memory. For this reason, another CloverETL XML parsing component becomes handy – XMLExtract. This handles records individually which are usually quite small, at least small enough to be processed in memory.

In XMLExtract, the user is able to define how each element can be mapped to a CloverETL record at every level of the XML structure . XMLExtract also provides the possibility of including a parent key at each structure level, thus allowing later complete reconstructions of the entire data structure. If the XML does not contain the unique key itself, it can also be easily generated using a CloverETL sequence object.

XML data and their basic integrity rules can be very well specified using XML Schema which today is a standard part of well defined data exchange. If you use XML Schema, CloverETL provides a very convenient visual drag&drop editor which helps the user build an XML mapping:

This screenshot represents an XML mapping which defines how XML and Clover fields are mapped. This mapping can also be displayed as text:

To provide an example where these methods were essential, CloverETL successfully completed a master data consolidation and matching project for an international insurance company. The XML Schemas were very complex, containing hundreds of different XML element types in its structure. The volume of data was over a hundred GBs describing tens of millions of customers as organizations and 4-5 million customers as persons. One of the many tasks assigned to CloverETL was to read and store the vast amount of data in XML in which it performed substantially greater due to XML’s fast sequential processing.

Data Profiling with CloverETL

Before you start to develop any data transformation you should explore your data (make data profiling). There are a lot of tools on the market that can help you. But why to install and learn another software when you can use the tool you are familiar with? CloverETL is mainly data transformation tool but it can be easily used for data profiling as well (as I will show you in this blog post).

It is very easy to do data statistic with latest version of CloverETL. You can find DataProfiling project in CloverETL Examples Projects. The project consists of two graphs: BasicStatistic and AdvancedStatistic.

The first one finds basic statistic for input data file:

  • minimum value for numeric fields or minimum length of data for string and byte fields
  • maximum value for numeric fields or maximum length of data for string and byte fields
  • average value for numeric fields or average length of data for string and byte fields
  • number of records in data file
  • number of not null values for each data field
  • number of null values for each data field

Additionally, for string data fields, it finds:

  • first not null value
  • if all values are ASCII

The second one calculates for each data field:

  • number of records in data file
  • number of not null values
  • number of unique values
  • minimum value
  • maximum value
  • average value for numeric fields
  • median value
  • modus value

It also finds frequency counts for fields with not many (the threshold is defined by a parameter HISTOGRAM_THRESHOLD) unique values.

BasicStatistic graph

The graphs in the project are prepared to analyze data from the excel file ORDERS.xls placed in data-in directory. But for purpose of this post we will analyze data stored in a flat file employees.list.dat (also placed in data-in directory).

To do that we need to set following parameters:

input_file=${DATAIN_DIR}/employees.list.dat

metadata=${META_DIR}/employees.fmt

READER_TYPE=DATA_READER

Metadata file (employees.fmt) has to contain metadata for employees.list.dat:

<?xml version="1.0" encoding="UTF-8"?>
<Record name="EMPLOYEE" recordDelimiter="n" recordSize="-1" type="delimited">
<Field delimiter="," format="#" name="EMP_NO" nullable="true" shift="0" type="integer"/>
<Field delimiter="," name="FIRST_NAME" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="LAST_NAME" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="PHONE_EXT" nullable="true" shift="0" type="string"/>
<Field delimiter="," format="dd/MM/yyyy" name="HIRE_DATE" nullable="true" shift="0" type="date"/>
<Field delimiter="," name="DEPT_NO" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="JOB_CODE" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="JOB_GRADE" nullable="true" shift="0" type="numeric"/>
<Field delimiter="," name="JOB_COUNTRY" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="SALARY" nullable="true" shift="0" type="numeric"/>
<Field name="FULL_NAME" nullable="true" shift="0" type="string"/>
</Record>

Lets see the graph with mid-results and output:

DataReader parses data from the input file. Normalizer creates record that consists of three basic fields: original field name, field type and “normalized” value, which is original value for numeric fields,  time in milliseconds for date fields and data length for string or byte fields. Moreover this component collects basic information about string data e.g.: finds first not null value, checks if string contains only ASCII characters and if it can be converted to number. The next component (Rollup – Statistic) calculates minimum, maximum and average value for each group of records (with the same field name). It also propagates first not null value, checks if all isAscii and isNumber fields are not false and sets result value for the whole group. Fourth and fifth component has only “cosmetic” aim – they convert times in milliseconds back to user friendly form and sort output records. The writer converts results to report.

By default the final report is a plain html file:

Data statistic for data-in/employees.list.dat

Field name Field type min max average count count not null count null first not null is Ascii is number
DEPT_NO string 3.0 3.0 3.0 51 43 8 600 true true
EMP_NO integer 0.0 145.0 58.392156862745104 51 51 0 true true
FIRST_NAME string 3.0 11.0 5.549019607843137 51 51 0 Robert true false
FULL_NAME string 10.0 22.0 14.549019607843137 51 51 0 Nelson, Robert true false
HIRE_DATE date 28/12/1988 00:00:00 15/11/1994 00:00:00 51 51 0
JOB_CODE string 2.0 24.0 6.1568627450980395 51 51 0 VP true false
JOB_COUNTRY string 2.0 11.0 3.6470588235294117 51 51 0 USA true false
JOB_GRADE number 0.0 5.0 3.4117647058823524 51 51 0 true true
LAST_NAME string 3.0 12.0 6.8431372549019605 51 51 0 Nelson true false
PHONE_EXT string 1.0 5.0 2.8627450980392157 51 51 0 250 true false
SALARY number 0.0 9.9E7 2267125.137254902 51 51 0 true true

but it can be easily changed to excel file (just adjust graph parameter WRITER_TYPE=XLS_WRITER).

AdvancedStatistic graph

The phase 0 of AdvancedStatistic graph works similarly as graph BasicStatistic, but it uses Aggregators for statistic calculations instead of Rollup. Be particular about component called Simplification in phase 0 of the graph: it stores number of records in file and names of fields with number of unique values under threshold in dictionary (marked by red ellipses on the picture above). Then Histogram filter component can read this field’s names and skip the records that aren’t between fields for frequency calculations (green eclipses on the picture). Phase 1 Aggregators count frequencies for fields that were filtered out by previous component.

Resulting file looks as follows:

Advanced data statistic and histograms for ./data-in/employees.list.dat

Statistics

Field name Field type min max average number count count not null count unique median modus
DEPT_NO string 000 900 51 43 20 600 623
EMP_NO integer 0.0 145.0 58.3921568627451 51 51 47 45.0 2.0
FIRST_NAME string Andrew Yuki 51 51 44 Mark Robert
FULL_NAME string Baldwin, Janet Young, Katherine 51 51 50 Lee, Terri Sutherland, Claudia
HIRE_DATE date 28/12/1988 00:00:00 15/11/1994 00:00:00 51 51 44 20/04/1992 00:00:00 02/01/1994 00:00:00
JOB_CODE string Admin Vice President 51 51 16 Mktg Eng
JOB_COUNTRY string Canada USA 51 51 9 USA USA
JOB_GRADE number 0.0 5.0 3.411764705882353 51 51 6 4.0 4.0
LAST_NAME string Baldwin Young 51 51 48 Lambert Johnson
PHONE_EXT string 1 null 51 51 46 3355 null
SALARY number 0.0 9.9E7 2267125.137254902 51 51 40 61637.8125 0.0

Histograms

Field name Field type value count count %
DEPT_NO string 8 ###############·····················································································
DEPT_NO string 000 2 ###·································································································
DEPT_NO string 100 2 ###·································································································
DEPT_NO string 110 2 ###·································································································
DEPT_NO string 115 2 ###·································································································
DEPT_NO string 120 3 #####·······························································································
DEPT_NO string 121 1 #···································································································
DEPT_NO string 123 1 #···································································································
DEPT_NO string 125 1 #···································································································
DEPT_NO string 130 2 ###·································································································
DEPT_NO string 140 2 ###·································································································
DEPT_NO string 180 2 ###·································································································
DEPT_NO string 600 2 ###·································································································
DEPT_NO string 621 4 #######·····························································································
DEPT_NO string 622 3 #####·······························································································
DEPT_NO string 623 5 #########···························································································
DEPT_NO string 670 2 ###·································································································
DEPT_NO string 671 3 #####·······························································································
DEPT_NO string 672 2 ###·································································································
DEPT_NO string 900 2 ###·································································································
JOB_CODE string Admin 4 #######·····························································································
JOB_CODE string CEO 1 #···································································································
JOB_CODE string CFO 1 #···································································································
JOB_CODE string Dir 1 #···································································································
JOB_CODE string Doc 1 #···································································································
JOB_CODE string Eng 15 #############################·······································································
JOB_CODE string Finan 1 #···································································································
JOB_CODE string Inside Sales Coordinator 1 #···································································································
JOB_CODE string Mktg 1 #···································································································
JOB_CODE string Mngr 4 #######·····························································································
JOB_CODE string PRel 1 #···································································································
JOB_CODE string SRep 9 #################···················································································
JOB_CODE string Sales 2 ###·································································································
JOB_CODE string Sales Representative 6 ###########·························································································
JOB_CODE string VP 2 ###·································································································
JOB_CODE string Vice President 1 #···································································································
JOB_COUNTRY string Canada 2 ###·································································································
JOB_COUNTRY string England 3 #####·······························································································
JOB_COUNTRY string France 1 #···································································································
JOB_COUNTRY string Italy 1 #···································································································
JOB_COUNTRY string Japan 2 ###·································································································
JOB_COUNTRY string Sales 1 #···································································································
JOB_COUNTRY string Switzerland 1 #···································································································
JOB_COUNTRY string UK 4 #######·····························································································
JOB_COUNTRY string USA 36 ######################################################################······························
JOB_GRADE number 0.0 1 #···································································································
JOB_GRADE number 1.0 2 ###·································································································
JOB_GRADE number 2.0 8 ###############·····················································································
JOB_GRADE number 3.0 14 ###########################·········································································
JOB_GRADE number 4.0 16 ###############################·····································································
JOB_GRADE number 5.0 10 ###################·················································································

CloverETL 2.9 Released: Infobright Data Writer, Web Services Component and Other New Features.

New CloverETL version 2.9. was just released. This version brings a new Infobright Data Writer component, enhances the connectivity by adding Web Services component and adds features that simplify common data transformation tasks.

New Features and Components:
Infobright Data Writer
In response to customer requests, this component writes data into Infobright software, a column-oriented relational database. Infobright is a provider of solutions designed to deliver a scalable data warehouse optimized for analytic queries.

Web Services component
The new component makes communication with Web Services easier than ever. It provides user friendly graphical interface for mapping your data into Web Service fields, automatically generates requests and process responses. It offers faster, easier and more comfortable way to interact with remote Web Services.

Reading formatted values from XLS
Additionally to reading plain data from MicrosoftTM ExcelTM sheets, the Excel component is now also capable of reading user-formatted values such as currencies, dates or numbers.

New tracking option
Customers can now see all absolute speed rates for finished data transformations, facilitating comparative analysis in pursuit of process improvements.

New Aspell Lookup table
Brand new implementation of this component brings better performance, improved configuration and better customization.

Improved treatment of empty (NULL) values
Developers can now specify special strings that should be treated as empty (NULL) when data is being parsed. This feature simplifies processing of typical application export files which often contain values insignificant for ETL processing. Additionally it may lead to improved processing throughput and lower memory consumption of data transformation.

More user friendly File URL dialog and improved LDAP functionality.

Customers can evaluate these new features along with CloverETL’s other leading capabilities with a free 30-day trial of the CloverETL Designer Pro evaluation, which is available at www.cloveretl.com Information management professionals can also evaluate the enterprise integration features of CloverETL Server via an online demo, which is also available at www.cloveretl.com.

Parallel Data Processing Comparison – CloverETL vs. Talend vs. Pentaho (Part 3)

As I have promised I bring you a complex comparison of ETL tools: CloverETL, Talend and Pentaho.

Short summary of my previous posts: For testing I used two transformations based on TPCH test and the input data generated by dbgen utility. The transformations were run on my laptop with Windows Vista Home Premium. For detail information see part 1 and part 2.

New testing:
To ensure my comparison a full complexity, all tools were tested as “desktop” and “enterprise” ETL tools. The “desktop” tools were running on laptop computer with a small amount of data. The “enterprise” ETL tools were running on server class machine with a large amount of data stored both in flat files and in a database. The transformation executed on server class machine was the same as the one I executed on desktop, only the size of input data was changed:

  • lineitem.tbl – 59,986,052 records, 7.24 GB
  • customers.tbl – 1,500,000 records, 233 MB
  • orders.tbl – 15,000,000 records, 1.62 GB

The results of flat file reading:

TPCH-Q1

TPCH-Q1

TPCH-Q3

TPCH-Q3

The new results of database reading, all previously published results, detailed information about used hardware and a summary are available in this final document.

I also described main features of all tools and my experiences to work with them. This part of the document expresses my opinions so it could be biased since I work mostly with CloverETL. If you don’t agree with anything, please express your opinion in comments. I will be pleased to discuss them with you.

Parallel Data Processing Comparison – CloverETL vs. Talend vs. Pentaho (Part 2)

Before we will release a complete comparison of open source ETL tools and after a success of my previous blog post I decided to publish the second transformation that we used in the comparison.

The second transformation is also based on SQL query that I rewrote to ETL transformation. I chose Query 3 from http://www.tpc.org/tpch.

select
l_orderkey,
sum(l_extendedprice*(1-l_discount)) as revenue,
o_orderdate,
o_shippriority
from customer, orders, lineitem
where c_mktsegment = ‘BUILDING’
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date ‘1995-03-15’
and l_shipdate > date ‘1995-03-15’
group by l_orderkey, o_orderdate, o_shippriority
order by revenue desc, o_orderdate

Input data are generated by dbgen utility and stored in CSV files.

  • lineitem.tbl – 6,001,215 records, 724 MB
  • customers.tbl – 15,000 records, 23.2 MB
  • orders.tbl – 1,500,000 records, 163 MB

Expected output should contain 11,620 records.

There is a new item in the results. After a discussion in my previous post I added „Pentaho parallel“, Pentaho transformation that reads data in parallel mode. Thanks Matt for your transformation :-) without it I wasn’t able to set it up.

Matt Caster also presented an opinion that Pentaho is discriminated because the transformation sorts the data before aggregation in Pentaho transformation. Yes, I agree that sorting of 6,000,000 records takes a significant amount of execution time of the transformation. But I have no choice, Pentaho aggregate component requires sorted input. Today’s transformation is more fair in this aspect. The number of records flowing to aggregate component is smaller (30,519 records) so they can be easily sorted in memory and the sorting doesn’t influence the total execution time in such volume.

The versions of used ETL tools stay the same ones: CloverETL Designer 2.8.1, Talend Open Studio 3.1.3 and Pentaho Data Integration 3.2.0.

Also the hardware configuration and Java runtime parameters are the same:

  • Intel Core 2 Duo @ 1666 Mhz, 2048 MB RAM, 200GB SATA 5400 RPM, Windows Vista Home Premium 32bit.
  • -server -Xmx256m -Xmx1536m

Results:

  1. CloverETL ParallelReader
  2. Talend
  3. Pentaho parallel
  4. CloverETL UniversalDataReader
  5. Pentaho

Results

Transformations and the input data are available on filefactory.com. Today’s transformation are named TPCH2. The transformation from my previous post are named TPCH1.

Please give me a feedback, especially on Talend transformation if it’s correct.

Transformation graphs

CloverETL ParallelReader & UniversalDataReader

CloverETL ParallelReader & UniversalDataReader

Talend

Talend

Pentaho

Pentaho

Pentaho parallel

Pentaho parallel

ParallelReader Component: Performance Boost in Data Processing

In October release 2.8.1 of Clover we introduced a new component which definitely should attract your attention – the Parallel Reader. The name itself already suggests the goal of the component – improve reading speed by going parallel. The component is very similar to Universal Data Reader in function – it reads delimited flat files like CSV, tab delimited, etc. – much hasn’t changed here. But the real difference comes from under the hood.

There are two major optimizations which allow Parallel Reader to exhibit excellent performance results, especially on server-class machines with fast modern disks or better yet, disk arrays. The first optimization we have done is – of course – reading the file in parallel. The input file is divided into a set of virtual data chunks which are fed into reading threads. These work all together at the same time – each one parsing data records just from its own file part. The number of threads can be specified by component parameter “Level Of Parallelism” and should reflect the hardware setup – e.g. number of disks in a stripped RAID – to harness the maximum power of Parallel Reader. Another great performance gain we achieved is merely by just simplifying the data parser inside. This parser is as simple as possible – although with limited validation, error handling, and some functionality – but really, really fast.

Although the new reader has a few limitations coming from its nature, extreme speed in common use cases compensates all these drawbacks. If you are processing big amounts of data (hundreds of megabytes and more) and your transformation does not depend on data records being read in original order, Parallel Reader is here and it might just be the right choice for you – why not give it a try?

Building Data Warehouse with CloverETL: Slowly Changing Dimension Type 1

The very typical usage of ETL tools is loading the data warehouse (DWH). So I decided to write a tutorial that will describe typical data warehouse tasks (slowly changing dimensions, date dimension, filling fact tables) and propose solutions with using of CloverETL.

If you are a newbie in data warehousing I recommend you reading some of the books by Ralph Kimball or H. I. Imon.

Sample data warehouse collects the information about sales for a small store chain that offers electronics like iPod, MP3, laptops etc.

The DB schema of my data warehouse is very simple. It consists of four dimensions (Customer, Product, Store and Date), one degenerate dimension (invoice number) and one fact table (Sales). Fact table stores two additive facts: units and total price. you can see complete DB scheme on figure below.

DB schema of sample DWH

DB schema of sample DWH

Store dimension

One thing you will surely face when you build data warehouse is working with several types of slowly changing dimension (SCD). In this part of the tutorial I used the simplest SCD type.

The simplest and surely the most popular SCD type among ETL developers is slowly changing dimension type 1. It doesn’t store any history, so once the value in online transaction processing system (OLTP) has been changed, the value in DWH is immediately overwritten as well.

I decided to use SCD1 for store dimension which collects basic information about stores: address, store identifier, store manager etc. Each store is identified in OLTP by unique store number (natural key). But for DWH I have to generate an own surrogate identifier ID_D_STORE.

The basic idea of processing SCD1 is very simple: compare records in DWH and OLTP, insert missing records into DWH, update the DWH records according to OLTP. For all these tasks the attribute that helps us to find corresponding records is the natural key – STORE_NUMBER.

So let’s go to develop CloverETL graph. For better portability all inputs and output data are stored in csv files, thus you don’t have to configure any database. The store dimension of DWH is stored in D_STORE.tbl file, the actual data from OLTP are stored in Store_25092009.csv. In both of these files we have to read, sort on natural key STORE_NUMBER and find the records that aren’t in D_STORE (third output of DataIntersection). In this last step we will use DataIntersection component. Simultaneously (by second output of DataIntersection) we get the potential records that can be different in OLTP and DWH. These records are then filtered and only the records having any different value of any attribute are processed and new values are stored to D_STORE_update.tbl file. New records are written to D_STORE_insert.tbl file once ID_D_STORE attribute is added. ID_D_STORE attribute gains its value from sequence that we have already defined in CloverETL in advance. And that’s all. You can see the resulting graph below.

CloverETL graph D_STORE_SCD1

CloverETL graph D_STORE_SCD1

If you want to read/write data from/to database easily replace UniversalDataReaders with DBInputTables and UniversalDataWriters with DBOutputTable components.

You can download complete CloverETL project here.

To be continued. In the next part we will deal with slowly changing dimension type 2.

Processing Data from QuickBase

We have great news for users of on-line database QuickBase from Intuit . CloverETL became a next tool which can be used to manipulate the data in this database. Now you can work with data in QuickBase without restraints and with full power of CloverETL. We introduced several specialized components dedicated for this purpose.

QuickBaseQueryReader serves as a reader of records queried from a table. You only need to specify a table name which you want to read from and a query in QuickBase query language. It is that simple. The second reader, QuickBaseRecordReader, is even simpler. Again, you specify a table name and instead of the little bit confusing query, you just specify the particular record identifiers you require.

Data writing is provided also by a set of two components – QuickBaseRecordWriter and QuickBaseImportCSV. First one is a complex writer of incoming data records to a given table and second one is a bit simpler, however much faster bulk data loader to QuickBase.

So we have prepared the whole family of components fully supporting the data manipulation in QuickBase – do not hesitate and try it.