Tag Archives: data quality

A Look Back: CloverETL and Data Integration in 2011

As 2011 comes to a close, we’d like to take the time to reflect on what this year has brought CloverETL, its users, and our customers.

Since CloverETL is, after all, a data integration platform, the world of integration is at our core. We’re constantly striving to challenge ourselves in new ways and improve how we approach data integration. This year was no different.

Enhancing Our Core – Two Upgrades of CloverETL

In the past six months, we released two upgraded versions of CloverETL. CloverETL 3.1, published in June, brought significant changes to the platform in several areas. With a deeper focus on connectivity and enhanced support of various data formats, CloverETL 3.1 helped users better process data with complex structure, emails, and Lotus documents, to name a few. The latest version of CloverETL, version 3.2, offered further enhancements to the user experience, as well as improved the processing of large data records.

Data Integration Meets Data Quality – CloverETL Profiler

This year was also a year for new products. With Clover, we’ve moved forward with an evolved sense of the data world. Because data integration, data quality, and other data disciplines are becoming more and more intertwined, we developed the CloverETL Profiler, data profiling application. Released in beta back in October, the profiler helps users make informed decisions on how to improve the quality of transformed data, which is particularly useful as precursor to a greater data integration projects. CloverETL also integrates more easily with the AddressDoctor solution to improve the quality of geographical information.

Strengthening CloverETL Presence in the US Market

In June, Javlin, the developer of CloverETL, opened up its new office in the Washington D.C. area, which became the headquarters of Javlin Inc., our US presence. Javlin Inc., with both a dedicated sales and customer service force, brings Clover to a whole new market of possibilities.

Last but not least, we are pleased to see that our OEM data integration offer will have a number of important implementations in the upcoming year. (But more on that later. Stay tuned.)

As we leave 2011, we can say that this past year was a whirlwind of hard work, exciting releases, and interesting customers and stories. We’re looking forward to another great year with CloverETL. Cheers to the New Year.

Data Profiling with CloverETL Profiler beta

The process of data integration, data migration, consolidation and other data manipulation projects consists of a variety of steps and tasks. Javlin supports many critical tasks within these projects with a versatile ETL tool that provides technical solutions to transform data and connect different systems and data sources with various data formats.

However, there are additional aspects in order to achieve success with data processing projects. One important project phase to ensure future success is the analysis phase. At Javlin, we are now on the verge of introducing new tool that will boost productivity in the early stages of a project. This product is our new data profiling tool – CloverETL Profiler.

CloverETL Profiler helps give you insight into your data. By creating a profile of your data using various metrics, you can get the big picture of the current condition of your data. This information has a large impact on the next stages of your project.

Real World Use – Data Warehousing Project

Imagine you are project manager with a complex task at hand – building a data warehouse. You have couple of data sources – database tables, CSV files, log files you want to analyze and probably some others. After you define the basic requirements for your data warehouse and its structure, you need to check the condition of your source data.

As you use the CloverETL Profiler, you will gain:

  • insight into the quality of your data indicating whether you need the perform a data cleansing phase in your project:
    • detect what data are missing (high percentage of missing or invalid values) looking for a different source
    • you may include data cleansing tools in the project budget and reserve tasks in your project plan
  • information about the structure of data in all your sources:
    • detect the need for ETL tools on specific data sources to transform your input data into a common format

Instead of writing your own tools and testing the data in complex way, you are just a few clicks away from creating detailed profile of your data.

Product Vision

The usage of the CloverETL Profiler tool for the project analysis phase is just the beginning. As part of our vision for the future, we are on the path to create a fully integrated product in CloverETL Designer and CloverETL Server family.

Its main purposes will be to aid:

  • CloverETL Designer: to assist the development of ETL transformation by allowing the creation of profiles of data in any stage of the transformation and checking the results during the debugging process
  • CloverETL Server: to constantly monitor the data condition on the production server  and invoke actions based on the specific condition detected in the profile of your data

Beta Program

Right now, we are excited to announce the beginning of our beta program for CloverETL Profiler. We invite you, people from the data integration community, to join the testing phase; it’s a great opportunity to not only have a look behind the scenes, but to also be a part of the development process by contributing your insights. For more information, please visit www.cloveretl.com/products/profiler.

Address Cleansing and Transliteration with CloverETL and AddressDoctor

Data quality usually goes hand in hand with data integration. The new version CloverETL 3.1 has enriched its data cleansing capabilities through integration with AddressDoctor. AddressDoctor contains address and geo data for more than 240 countries all over the globe. Along with correcting and fixing mail addresses, AddressDoctor can also be used for transliteration of non-Latin writing systems into Latin characters or enriching addresses with latitude and longitude information.

CloverETL integrates AddressDoctor software through a dedicated AddressDoctor component. In all cases you need to have java library AddressDoctor5.jar, native libraries (they all need to be on the class-path when running a graph) and country databases. You also need unlock codes for your databases.

AddressDoctor component has 4 required parameters:

  • Configuration is driven by configXml or configFileparameter.  The simplest configuration file can look as follows:
           <?xml version="1.0" encoding="UTF-8"?>
           <SetConfig>
             <General MaxAddressObjectCount="10"/>
             <UnlockCode><Your unlock code here></UnlockCode>
             <DataBase CountryISO3="ALL" Path="/home/user/AD/db"/>
           </SetConfig>
  • Parameters can be set by parameterXml or parameterFileattribute. The simplest parameter file can look as follows:
           <?xml version="1.0" encoding="UTF-8"?>
           <Parameters>
             <Process Mode="PARSE"/>
             <Input/>
             <Result/>
           </Parameters>
  • Input mapping defines mapping between Clover input fields and AddressDoctor address properties.
  • Output mapping defines mapping between AddressDoctor output address properties and your output.

Transliteration example

Imagine that you have addresses all over the world saved in their original languages. For some languages, you would not even recognize them– to say nothing of their correct reading. Let’s try with following addresses:

Tomáš Novák;Vohradského 5;Česká Lípa;Czech republic
Hans-Peter Feiertag;Metro-Straße 1;Düsseldorf 4;Deutschland
Michał Dąbrowski;Marszałkowska 142;00-132 Warszawa;Polska
Борис Николаевич Ельцин; Казанская пл., 23;Санкт-Петербург;RUSSIAN FEDERATION
John Smith;100 Main Street;New York NY 10023;USB

To transliterate above addresses to Ascii, we need to set following parameters:

  • Mode to PARSE to transliterate without processing the address, Optimization level to STANDARD as it produces the best results. Both in Process tab of Parameters attribute:
  • Encoding to UTF-16 in Input tab of Parametersattribute:
  • Preferred language to ENGLISH and Preferred script to ASCII_SIMPLIFIED in Result tab of Parameters attribute:

Now we need to feed the Addressdoctor with input mapping. Input metadata:

Corresponds with Contact name, Street complete, Locality complete and Country name address properties:

–>

Similar mapping needs to be provided for output properties:

–>

We can also connect the Error port for invalid (unrecognised) addresses:

Now we can run our graph getting following results:

Tomas Novak;Vohradskeho;Ceska Lipa;CZECH REPUBLIC
Hans-Peter Feiertag;Metro-Strase;Dusseldorf 4;GERMANY
Michal Dabrowski;Marszalkowska;Warsaw;POLAND
Boris Nikolaevic Elcin;Kazanskaa Pl.;Sankt-Peterburg;RUSSIAN FEDERATION

But one address is missing. Look at the error port:

N1 means Validation Error: No validation performed because country was not recognized

The output can be even improved by setting ASCII_EXTENDED as Preferred script in Result tab of Parameters attribute:

Tomash Novak;Vohradskeho;Cheska Lipa;CZECH REPUBLIC
Hans-Peter Feiertag;Metro-Strasse;Duesseldorf 4;GERMANY
Michal Dabrowski;Marszalkowska;Warsaw;POLAND
Boris Nikolaevich Elcin;Kazanskaya Pl.;Sankt-Peterburg;RUSSIAN FEDERATION

Now the transcription corresponds with pronunciation.

Enrichment example

If you have proper database, you can enrich addresses with data you don’t know, e.g. ZIP code, geocoding, certification status.

Let’s consider following addresses:

To validate an address according to the Canada Post SERP rules we would need certified database for Canada. We need to set in Configuration as well as in Parameters:

and

After AddressDoctor transformation we get:

Count ResultNumber ElementItem Country Province Locality PostalCode Street Number SERPStatus SERPCategory ResultProcessStatus ResultModeUsed ResultPreferredScript ResultPreferredLanguage ResultDataMailabilityScore ResultDataElementResultStatus ResultDataElementInputStatus ResultDataElementRelevance
1 1 1 CANADA ON TORONTO M4P 3J5 YONGE STREET 2384 ESE1 C C3 CERTIFIED ASCII_SIMPLIFIED DATABASE 3 80F0F0F0F000004000E0 00606060600000200060 10101010100000100010
1 1 1 CANADA SK LA RONGE FINLAYSON ST 108 ESE1 N I1 CERTIFIED ASCII_SIMPLIFIED DATABASE 0 000000000000000000E0 00601010100000000060 00000000000000000000
1 1 1 CANADA SK PRINCE ALBERT S6V 0C7 15TH ST E 801 ESE1 C C4 CERTIFIED ASCII_SIMPLIFIED DATABASE 4 80F0F0F0F00000F000E0 00606060600000600060 10101010100000100010

Note the fields:

  • PostalCode – for recognized addresses we get valid postal code
  • SERPCategory :
    • C: Corrected
    • N: Incorrect
  • ResultProcessStatus and ResultDataMailabilityScore:
    • C3 (3) : Corrected – but some elements could not be checked (should be fine)
    • I1 (0) : Data could not be corrected and is pretty unlikely to be delivered (futile)
    • C4 (4) : Corrected – all (postally relevant) elements have been checked (almost certain)

Download the transformation graph with data

Data sampling with CloverETL

Testing data transformations is generally not an easy task. When creating and testing a transformation you might want to get a data sample to check if your transformation works properly. In this point a question arises: How to create a representative data probe on the full data set? Obviously, the easiest way is to read just part of data from the beginning. But such data sample can be very unreliable.  I’ve prepared a few simple graphs that create a data probe which can be regarded as representative for the full data set.

All graphs were created based on the sampling methods described in the article Sampling (statistics).

Simple random sampling

In this method each record has the same probability of selection. Filtering is based on double value chosen (approximately) uniformly from the range 0.0d (inclusive) to 1.0d (exclusive): record is selected if the drawn number is lower than required sample set size:

Systematic sampling

Systematic sampling relies on arranging the data set according to some ordering scheme and then selecting elements in regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every k-th element from then onwards:

Sorting can be disabled in this graph. Then it is selected just every k-th element from the full data set, starting from a randomly selected record from the interval [1, k].

Stratified sampling

If the data set embraces a number of distinct categories, the frame can be organized by these categories into separate strata. Each stratum is then sampled as an independent sub-population out of which individual elements can be randomly selected. At least one record from each stratum must be selected:

Probability proportional to size sampling

Probability for each record is set to be proportional to its stratum size, up to a maximum of 1. Strata are defined by the value of the selected field. For each group of records it it is used systematic sampling method:

Methods comparison

Simple random sampling method is the simplest and fastest. It is sufficient in most cases. Systematic sampling with disabled sorting is as fast as simple random sampling and produces also strongly representative data probe. The stratified sampling method is the trickiest one. It is useful only if the data set can be split into the separated groups that have reasonable sizes. In other cases the data probe is a lot of bigger than requested.

Please see the attached CloverETL project with the above graphs. It also contains the graph for comparison of samples created with different sampling methods. I’ve done some tests for the file containing 5,000,000 rows with information about financial transactions. Each row contains unique transaction id, id of a customer, transaction amount and currency info. Total number of customers is 50,001; number of possible currencies is 35. I performed two sets of tests: one for the group defined by customer id and one defined by currency id.

Results for the sampling_field = CustomerId

Stratum is defined by id of customer. All data can be split to 50,001 groups with sizes from 61 to 143 transactions.

Following table shows testing results for some groups. Sorting was enabled for systematic sampling method.

defined sample size ratio: 0.01

sampling field (CustomerId) value simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 22 s 772 ms 0 h 1 m 34 s 965 ms 0 h 1 m 33 s 831 ms 0 h 1 m 30 s 973 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 4 71 0.0563 1 71 0.0140 1 71 0.0140 1 71 0.0140
1 0 94 0.0000 1 94 0.0106 1 94 0.0106 1 94 0.0106
10 2 110 0.0181 1 110 0.0090 2 110 0.0181 1 110 0.0090
100 1 93 0.0107 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 0 83 0.0000 1 83 0.0120 1 83 0.0120 1 83 0.0120
10000 2 101 0.0198 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 0 99 0.0000 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 0 109 0.0000 1 109 0.0091 3 109 0.0275 1 109 0.0091
10003 1 86 0.0116 1 86 0.0116 2 86 0.0232 1 86 0.0116
10004 1 86 0.0116 1 86 0.0116 1 86 0.0116 1 86 0.0116
total 49937 5000000 0.0099 50000 5000000 0.0100 68172 5000000 0.0136 50011 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 28 s 741 ms 0 h 1 m 34 s 474 ms 0 h 1 m 32 s 628 ms 0 h 1 m 33 s 949 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 0 71 0.0000 0 71 0.0000 1 71 0.0140 0 71 0.0000
1 0 94 0.0000 1 94 0.0106 1 94 0.0106 1 94 0.0106
10 1 110 0.0090 1 110 0.0090 3 110 0.0272 1 110 0.0090
100 1 93 0.0107 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 0 83 0.0000 1 83 0.0120 2 83 0.0240 0 83 0.0000
10000 1 101 0.0099 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 2 99 0.0202 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 1 109 0.0091 1 109 0.0091 1 109 0.0091 1 109 0.0091
10003 0 86 0.0000 1 86 0.0116 1 86 0.0116 1 86 0.0116
10004 1 86 0.0116 1 86 0.0116 1 86 0.0116 0 86 0.0000
total 49931 5000000 0.0099 50000 5000000 0.0100 68369 5000000 0.0136 50010 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 24 s 975 ms 0 h 1 m 37 s 446 ms 0 h 1 m 29 s 98 ms 0 h 1 m 32 s 857 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 0 71 0.0000 1 71 0.0140 1 71 0.0140 1 71 0.0140
1 1 94 0.0106 1 94 0.0106 1 94 0.0106 1 94 0.0106
10 0 110 0.0000 1 110 0.0090 2 110 0.0181 2 110 0.0181
100 0 93 0.0000 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 1 83 0.0120 1 83 0.0120 1 83 0.0120 1 83 0.0120
10000 2 101 0.0198 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 1 99 0.0101 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 0 109 0.0000 1 109 0.0091 3 109 0.0275 1 109 0.0091
10003 1 86 0.0116 1 86 0.0116 1 86 0.0116 1 86 0.0116
10004 0 86 0.0000 1 86 0.0116 1 86 0.0116 1 86 0.0116
total 49983 5000000 0.0099 50000 5000000 0.0100 68258 5000000 0.0136 49900 5000000 0.0099
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling

Results for the same test but with data sorting disabled in systematic sampling method:

defined sample size ratio: 0.01

sampling field (CustomerId) value simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 28 s 168 ms 0 h 0 m 23 s 117 ms 0 h 1 m 35 s 414 ms 0 h 1 m 30 s 985 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1 71 0.0140 0 71 0.0000 1 71 0.0140 1 71 0.0140
1 1 94 0.0106 0 94 0.0000 2 94 0.0212 1 94 0.0106
10 1 110 0.0090 2 110 0.0181 1 110 0.0090 1 110 0.0090
100 0 93 0.0000 0 93 0.0000 1 93 0.0107 1 93 0.0107
1000 0 83 0.0000 0 83 0.0000 1 83 0.0120 1 83 0.0120
10000 2 101 0.0198 0 101 0.0000 1 101 0.0099 1 101 0.0099
10001 0 99 0.0000 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 3 109 0.0275 1 109 0.0091 3 109 0.0275 1 109 0.0091
10003 1 86 0.0116 2 86 0.0232 1 86 0.0116 1 86 0.0116
10004 1 86 0.0116 0 86 0.0000 2 86 0.0232 0 86 0.0000
total 50081 5000000 0.0100 50000 5000000 0.0100 68227 5000000 0.0136 49966 5000000 0.0099
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 23 s 78 ms 0 h 0 m 19 s 178 ms 0 h 1 m 33 s 148 ms 0 h 1 m 29 s 261 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 0 71 0.0000 0 71 0.0000 1 71 0.0140 0 71 0.0000
1 0 94 0.0000 0 94 0.0000 1 94 0.0106 1 94 0.0106
10 0 110 0.0000 3 110 0.0272 1 110 0.0090 1 110 0.0090
100 3 93 0.0322 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 1 83 0.0120 0 83 0.0000 2 83 0.0240 1 83 0.0120
10000 0 101 0.0000 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 1 99 0.0101 1 99 0.0101 3 99 0.0303 1 99 0.0101
10002 1 109 0.0091 0 109 0.0000 1 109 0.0091 1 109 0.0091
10003 1 86 0.0116 0 86 0.0000 1 86 0.0116 1 86 0.0116
10004 3 86 0.0348 0 86 0.0000 1 86 0.0116 1 86 0.0116
total 50056 5000000 0.0100 50000 5000000 0.0100 68528 5000000 0.0137 50033 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 28 s 244 ms 0 h 0 m 27 s 52 ms 0 h 1 m 35 s 49 ms 0 h 1 m 27 s 725 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1 71 0.0140 0 71 0.0000 2 71 0.0281 0 71 0.0000
1 1 94 0.0106 0 94 0.0000 2 94 0.0212 1 94 0.0106
10 0 110 0.0000 2 110 0.0181 4 110 0.0363 1 110 0.0090
100 2 93 0.0215 2 93 0.0215 1 93 0.0107 1 93 0.0107
1000 2 83 0.0240 0 83 0.0000 1 83 0.0120 1 83 0.0120
10000 0 101 0.0000 0 101 0.0000 1 101 0.0099 1 101 0.0099
10001 0 99 0.0000 4 99 0.0404 1 99 0.0101 1 99 0.0101
10002 1 109 0.0091 2 109 0.0183 1 109 0.0091 1 109 0.0091
10003 1 86 0.0116 1 86 0.0116 2 86 0.0232 0 86 0.0000
10004 0 86 0.0000 0 86 0.0000 1 86 0.0116 1 86 0.0116
total 50116 5000000 0.0100 50000 5000000 0.0100 68470 5000000 0.0136 50010 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling

Since the groups are really small, there should be selected none or one record from each group and for the smaller groups we should have more often zero selected records. In relation to this criteria the PPS sampling method and systematic sampling method with sorting data enabled give the best results. Data sample created with stratified method is always oversized.

Results for the sampling_field = CurrencyId

Stratum is defined by id of currency. All data can be split to 35 groups with very similar sizes from 142,042 to 143,572 transactions.

The following table shows testing results for some groups. Sorting was enabled for systematic sampling method.

defined sample size ratio: 0.01

sampling field (CurrencyId) value simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 21 s 681 ms 0 h 1 m 26 s 859 ms 0 h 1 m 25 s 970 ms 0 h 1 m 27 s 85 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1450 142623 0.0101 1427 142623 0.0100 1447 142623 0.0101 1426 142623 0.0099
1 1371 142925 0.0095 1429 142925 0.0099 1430 142925 0.0100 1429 142925 0.0099
10 1420 142897 0.0099 1429 142897 0.0100 1432 142897 0.0100 1429 142897 0.0100
11 1448 142896 0.0101 1429 142896 0.0100 1443 142896 0.0100 1429 142896 0.0100
12 1383 142522 0.0097 1425 142522 0.0099 1488 142522 0.0104 1425 142522 0.0099
13 1468 142461 0.0103 1425 142461 0.0100 1395 142461 0.0097 1424 142461 0.0099
14 1449 142997 0.0101 1430 142997 0.0100 1479 142997 0.0103 1430 142997 0.0100
15 1401 142697 0.0098 1426 142697 0.0099 1438 142697 0.0100 1427 142697 0.0100
16 1396 143137 0.0097 1432 143137 0.0100 1387 143137 0.0096 1431 143137 0.0099
17 1464 142517 0.0102 1425 142517 0.0099 1413 142517 0.0099 1425 142517 0.0099
total 49959 5000000 0.0099 50000 5000000 0.0100 50075 5000000 0.0100 49997 5000000 0.0099
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CurrencyId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 22 s 949 ms 0 h 1 m 25 s 726 ms 0 h 1 m 27 s 629 ms 0 h 1 m 24 s 537 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1449 142623 0.0101 1427 142623 0.0100 1496 142623 0.0104 1426 142623 0.0099
1 1468 142925 0.0102 1429 142925 0.0099 1442 142925 0.0100 1429 142925 0.0099
10 1436 142897 0.0100 1429 142897 0.0100 1406 142897 0.0098 1429 142897 0.0100
11 1436 142896 0.0100 1429 142896 0.0100 1402 142896 0.0098 1429 142896 0.0100
12 1410 142522 0.0098 1425 142522 0.0099 1454 142522 0.0102 1425 142522 0.0099
13 1438 142461 0.0100 1425 142461 0.0100 1414 142461 0.0099 1425 142461 0.0100
14 1420 142997 0.0099 1430 142997 0.0100 1450 142997 0.0101 1430 142997 0.0100
15 1412 142697 0.0098 1427 142697 0.0100 1400 142697 0.0098 1427 142697 0.0100
16 1453 143137 0.0101 1431 143137 0.0099 1442 143137 0.0100 1431 143137 0.0099
17 1431 142517 0.0100 1425 142517 0.0099 1372 142517 0.0096 1425 142517 0.0099
total 50163 5000000 0.0100 50000 5000000 0.0100 49709 5000000 0.0099 50000 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CurrencyId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 27 s 716 ms 0 h 1 m 26 s 865 ms 0 h 1 m 26 s 657 ms 0 h 1 m 26 s 254 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1488 142623 0.0104 1426 142623 0.0099 1416 142623 0.0099 1426 142623 0.0099
1 1353 142925 0.0094 1429 142925 0.0099 1434 142925 0.0100 1429 142925 0.0099
10 1417 142897 0.0099 1429 142897 0.0100 1390 142897 0.0097 1429 142897 0.0100
11 1448 142896 0.0101 1429 142896 0.0100 1438 142896 0.0100 1429 142896 0.0100
12 1448 142522 0.0101 1425 142522 0.0099 1408 142522 0.0098 1425 142522 0.0099
13 1412 142461 0.0099 1425 142461 0.0100 1432 142461 0.0100 1424 142461 0.0099
14 1440 142997 0.0100 1430 142997 0.0100 1471 142997 0.0102 1430 142997 0.0100
15 1445 142697 0.0101 1427 142697 0.0100 1530 142697 0.0107 1427 142697 0.0100
16 1436 143137 0.0100 1431 143137 0.0099 1456 143137 0.0101 1432 143137 0.0100
17 1381 142517 0.0096 1425 142517 0.0099 1365 142517 0.0095 1426 142517 0.0100
total 50089 5000000 0.0100 50000 5000000 0.0100 49707 5000000 0.0099 49999 5000000 0.0099
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CurrencyId simple sampling systematic sampling stratified sampling pps sampling

With such large groups all the methods give very good results. Although no doubt we get the best results using the systematic sampling or PPS sampling methods where the sample size is always within the limits 0.0099 to 0.0100.

Download the transformation graph with data

Spell Checking for Better Data Quality – AspellLookup Table

AspellLookupTable is a commercial lookup table which has been around since CloverETL 2.6. Because Aspell is a free software spell checker, you might  be wondering what it is used for in CloverETL. In fact, AspellLookupTable does not perform any spell checking at all, it “just” allows you to lookup data records with keys similar to the one you provide. This may be useful e.g. when looking for a street whose name is misspelled to a certain extent.

Similarity of String Keys

The similarity of two string keys is measured in terms of edit operations used to “transform” a misspelled key to the correct one. At the moment, CloverETL supports five edit operations: change of character case, character transposition, deletion of a character, insertion of a character, and finally character replacement. Each edit operation is assigned an edit cost which reflects its “seriousness”. By default, edit costs are set to 10, 90, 95, 95, and 100, respectively. As you can see, the difference in character case is not considered as serious as the character transposition. However, it is possible to tweak the edit costs as you wish.

If  you add all edit costs of the minimum number of edit operations required to “transform” the misspelled key to the correct one, you get an edit distance of the two keys. In other words, you get the similarity measure of the keys. However, the edit distance itself is just a number which will vary for different keys. Therefore you need to somehow determine which data records should be included in the result set and which should be thrown away. And that’s why you also need to specify a spelling threshold. This threshold is 230 by default but may be changed to best suit your needs.

An example is worth a thousand words, so let’s assume that one of the keys is "Hello". If you use "helo" for lookup, the edit distance will be 105 because you had to change the case of 'h' (cost = 10) and to insert another 'l' (cost = 95). If you used "Halllo" instead, the edit distance would be 195 ('a' was replaced by 'e' and one 'l' had to be removed). Both these queries would return a non-empty result set as their edit distance is below the spelling threshold. On the other hand, "Bye" as a lookup key would result in an empty result set as its edit distance is way above the threshold.

AspellLookupTable in Action

AspellLookupTable is a great candidate for address cleansing. Imagine you have a huge list of incorrectly spelled street names and you need to determine whether these streets actually exist in a certain town or city. You also have a directory of correctly spelled street names with several additional information, e.g. the town name, its region, etc. In the end, you want a list of all the streets, and corresponding towns, that do exist. Does it seem difficult? No, it’s a piece of cake with CloverETL:

First you need to set up your AspellLookupTable instance to load the street directory. Then you just load the invalid addresses and “pair” them with existing streets using the lookup table. When you have all the streets joined with whatever data you need, you may perform some further processing and then store the result in a flat file. Pretty simple, huh? :-)

Download the transformation graph with data

Data Quality at a Glance Conference

Javlin a.s., producer of CloverETL, took part in a Data Quality at a Glance Conference held on April 20th, at PriceWaterhouseCoopers’s premises in Prague. This conference was organized by IDG and Javlin served there as the professional supervisor partner.

Javlin together with other conference partners PriceWaterhouseCoopers, SAS and Ataccama each held presentations on Data Quality topics.

The first presentation was held by Mr. Snytr. It discussed Personal data and its quality from the view of the Office for the Protection of Personal Data.

Mr. Maly, senior manager at PriceWaterhouseCoopers talked in his presentation about Optimizing business processes. According to him, data quality is closely related to quality processes. Poor data has an impact on strategic decision-making and can cause a loss of business opportunity and/or profit. Therefore, it is important to find the source of data errors and set the right data processes. Mr. Maly emphasized that it is essential to cleanse data continuously.

Mr. Kyjonka from SAS held a presentation on One version of truth for all or MDM in business life. He highlighted importance of Data integration, Data Warehousing and MDM for getting the right data on time. He showed 3 important parts of MDM – System of Record, MDM Hub and Integration infrastructure. Further it was shown that 4 different types of MDM solutions can be used for various purposes – Registry style, Transaction style, Hybrid style, and Consolidation style (ETL). Choosing the right style of MDM solution depends on budget and how much time the company has. Quality cleansed data, technical infrastructure, Data Governance program, willingness to share data, etc are some of the important factors for MDM.

Mr. Matous, Javlin’s consultant had a presentation on Data Cleansing. According to him data quality is the process of detection, reporting and correction of the invalid or missing values in data. Mr. Matous made it clear how important it is to do data audits and why to use a data quality scorecard. A data quality scorecard tracks the financial impact of poor data and estimates the return on investment into Data Quality activities. It helps managers determine whether or not to invest into data cleansing tools. Mr. Matous talked about several Data Quality benefits in business. Some of them include increased efficiency of marketing campaigns, early warning system, increased credibility and reputation among customers. It was also shown how data cleansing could be done using CloverETL.

Mr. Vojtek, VP of Engineering at Javlin, discussed a specific case study where Javlin had undertaken data quality improvements. He emphasized how data quality is essential for quality business results. It was shown what kinds of pitfalls could be experienced in a data quality implementation in multinational companies.

The last presentation of the conference was hold by Mr. Kyjonka from SAS. He discussed a case study named The clever way to cleanse data. He argues that when companies think they have only 10% of poor or bad data, they usually have about 40% bad data. In his particular case 90% of the data had to be corrected during the data cleansing process.

Live from the Conference – @CloverETL on Twitter. Follow the tag #DQPWC.

Data Profiling with CloverETL

Before you start to develop any data transformation you should explore your data (make data profiling). There are a lot of tools on the market that can help you. But why to install and learn another software when you can use the tool you are familiar with? CloverETL is mainly data transformation tool but it can be easily used for data profiling as well (as I will show you in this blog post).

It is very easy to do data statistic with latest version of CloverETL. You can find DataProfiling project in CloverETL Examples Projects. The project consists of two graphs: BasicStatistic and AdvancedStatistic.

The first one finds basic statistic for input data file:

  • minimum value for numeric fields or minimum length of data for string and byte fields
  • maximum value for numeric fields or maximum length of data for string and byte fields
  • average value for numeric fields or average length of data for string and byte fields
  • number of records in data file
  • number of not null values for each data field
  • number of null values for each data field

Additionally, for string data fields, it finds:

  • first not null value
  • if all values are ASCII

The second one calculates for each data field:

  • number of records in data file
  • number of not null values
  • number of unique values
  • minimum value
  • maximum value
  • average value for numeric fields
  • median value
  • modus value

It also finds frequency counts for fields with not many (the threshold is defined by a parameter HISTOGRAM_THRESHOLD) unique values.

BasicStatistic graph

The graphs in the project are prepared to analyze data from the excel file ORDERS.xls placed in data-in directory. But for purpose of this post we will analyze data stored in a flat file employees.list.dat (also placed in data-in directory).

To do that we need to set following parameters:

input_file=${DATAIN_DIR}/employees.list.dat

metadata=${META_DIR}/employees.fmt

READER_TYPE=DATA_READER

Metadata file (employees.fmt) has to contain metadata for employees.list.dat:

<?xml version="1.0" encoding="UTF-8"?>
<Record name="EMPLOYEE" recordDelimiter="n" recordSize="-1" type="delimited">
<Field delimiter="," format="#" name="EMP_NO" nullable="true" shift="0" type="integer"/>
<Field delimiter="," name="FIRST_NAME" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="LAST_NAME" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="PHONE_EXT" nullable="true" shift="0" type="string"/>
<Field delimiter="," format="dd/MM/yyyy" name="HIRE_DATE" nullable="true" shift="0" type="date"/>
<Field delimiter="," name="DEPT_NO" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="JOB_CODE" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="JOB_GRADE" nullable="true" shift="0" type="numeric"/>
<Field delimiter="," name="JOB_COUNTRY" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="SALARY" nullable="true" shift="0" type="numeric"/>
<Field name="FULL_NAME" nullable="true" shift="0" type="string"/>
</Record>

Lets see the graph with mid-results and output:

DataReader parses data from the input file. Normalizer creates record that consists of three basic fields: original field name, field type and “normalized” value, which is original value for numeric fields,  time in milliseconds for date fields and data length for string or byte fields. Moreover this component collects basic information about string data e.g.: finds first not null value, checks if string contains only ASCII characters and if it can be converted to number. The next component (Rollup – Statistic) calculates minimum, maximum and average value for each group of records (with the same field name). It also propagates first not null value, checks if all isAscii and isNumber fields are not false and sets result value for the whole group. Fourth and fifth component has only “cosmetic” aim – they convert times in milliseconds back to user friendly form and sort output records. The writer converts results to report.

By default the final report is a plain html file:

Data statistic for data-in/employees.list.dat

Field name Field type min max average count count not null count null first not null is Ascii is number
DEPT_NO string 3.0 3.0 3.0 51 43 8 600 true true
EMP_NO integer 0.0 145.0 58.392156862745104 51 51 0 true true
FIRST_NAME string 3.0 11.0 5.549019607843137 51 51 0 Robert true false
FULL_NAME string 10.0 22.0 14.549019607843137 51 51 0 Nelson, Robert true false
HIRE_DATE date 28/12/1988 00:00:00 15/11/1994 00:00:00 51 51 0
JOB_CODE string 2.0 24.0 6.1568627450980395 51 51 0 VP true false
JOB_COUNTRY string 2.0 11.0 3.6470588235294117 51 51 0 USA true false
JOB_GRADE number 0.0 5.0 3.4117647058823524 51 51 0 true true
LAST_NAME string 3.0 12.0 6.8431372549019605 51 51 0 Nelson true false
PHONE_EXT string 1.0 5.0 2.8627450980392157 51 51 0 250 true false
SALARY number 0.0 9.9E7 2267125.137254902 51 51 0 true true

but it can be easily changed to excel file (just adjust graph parameter WRITER_TYPE=XLS_WRITER).

AdvancedStatistic graph

The phase 0 of AdvancedStatistic graph works similarly as graph BasicStatistic, but it uses Aggregators for statistic calculations instead of Rollup. Be particular about component called Simplification in phase 0 of the graph: it stores number of records in file and names of fields with number of unique values under threshold in dictionary (marked by red ellipses on the picture above). Then Histogram filter component can read this field’s names and skip the records that aren’t between fields for frequency calculations (green eclipses on the picture). Phase 1 Aggregators count frequencies for fields that were filtered out by previous component.

Resulting file looks as follows:

Advanced data statistic and histograms for ./data-in/employees.list.dat

Statistics

Field name Field type min max average number count count not null count unique median modus
DEPT_NO string 000 900 51 43 20 600 623
EMP_NO integer 0.0 145.0 58.3921568627451 51 51 47 45.0 2.0
FIRST_NAME string Andrew Yuki 51 51 44 Mark Robert
FULL_NAME string Baldwin, Janet Young, Katherine 51 51 50 Lee, Terri Sutherland, Claudia
HIRE_DATE date 28/12/1988 00:00:00 15/11/1994 00:00:00 51 51 44 20/04/1992 00:00:00 02/01/1994 00:00:00
JOB_CODE string Admin Vice President 51 51 16 Mktg Eng
JOB_COUNTRY string Canada USA 51 51 9 USA USA
JOB_GRADE number 0.0 5.0 3.411764705882353 51 51 6 4.0 4.0
LAST_NAME string Baldwin Young 51 51 48 Lambert Johnson
PHONE_EXT string 1 null 51 51 46 3355 null
SALARY number 0.0 9.9E7 2267125.137254902 51 51 40 61637.8125 0.0

Histograms

Field name Field type value count count %
DEPT_NO string 8 ###############·····················································································
DEPT_NO string 000 2 ###·································································································
DEPT_NO string 100 2 ###·································································································
DEPT_NO string 110 2 ###·································································································
DEPT_NO string 115 2 ###·································································································
DEPT_NO string 120 3 #####·······························································································
DEPT_NO string 121 1 #···································································································
DEPT_NO string 123 1 #···································································································
DEPT_NO string 125 1 #···································································································
DEPT_NO string 130 2 ###·································································································
DEPT_NO string 140 2 ###·································································································
DEPT_NO string 180 2 ###·································································································
DEPT_NO string 600 2 ###·································································································
DEPT_NO string 621 4 #######·····························································································
DEPT_NO string 622 3 #####·······························································································
DEPT_NO string 623 5 #########···························································································
DEPT_NO string 670 2 ###·································································································
DEPT_NO string 671 3 #####·······························································································
DEPT_NO string 672 2 ###·································································································
DEPT_NO string 900 2 ###·································································································
JOB_CODE string Admin 4 #######·····························································································
JOB_CODE string CEO 1 #···································································································
JOB_CODE string CFO 1 #···································································································
JOB_CODE string Dir 1 #···································································································
JOB_CODE string Doc 1 #···································································································
JOB_CODE string Eng 15 #############################·······································································
JOB_CODE string Finan 1 #···································································································
JOB_CODE string Inside Sales Coordinator 1 #···································································································
JOB_CODE string Mktg 1 #···································································································
JOB_CODE string Mngr 4 #######·····························································································
JOB_CODE string PRel 1 #···································································································
JOB_CODE string SRep 9 #################···················································································
JOB_CODE string Sales 2 ###·································································································
JOB_CODE string Sales Representative 6 ###########·························································································
JOB_CODE string VP 2 ###·································································································
JOB_CODE string Vice President 1 #···································································································
JOB_COUNTRY string Canada 2 ###·································································································
JOB_COUNTRY string England 3 #####·······························································································
JOB_COUNTRY string France 1 #···································································································
JOB_COUNTRY string Italy 1 #···································································································
JOB_COUNTRY string Japan 2 ###·································································································
JOB_COUNTRY string Sales 1 #···································································································
JOB_COUNTRY string Switzerland 1 #···································································································
JOB_COUNTRY string UK 4 #######·····························································································
JOB_COUNTRY string USA 36 ######################################################################······························
JOB_GRADE number 0.0 1 #···································································································
JOB_GRADE number 1.0 2 ###·································································································
JOB_GRADE number 2.0 8 ###############·····················································································
JOB_GRADE number 3.0 14 ###########################·········································································
JOB_GRADE number 4.0 16 ###############################·····································································
JOB_GRADE number 5.0 10 ###################·················································································