Tag Archives: data cleansing

Address Cleansing and Transliteration with CloverETL and AddressDoctor

Data quality usually goes hand in hand with data integration. The new version CloverETL 3.1 has enriched its data cleansing capabilities through integration with AddressDoctor. AddressDoctor contains address and geo data for more than 240 countries all over the globe. Along with correcting and fixing mail addresses, AddressDoctor can also be used for transliteration of non-Latin writing systems into Latin characters or enriching addresses with latitude and longitude information.

CloverETL integrates AddressDoctor software through a dedicated AddressDoctor component. In all cases you need to have java library AddressDoctor5.jar, native libraries (they all need to be on the class-path when running a graph) and country databases. You also need unlock codes for your databases.

AddressDoctor component has 4 required parameters:

  • Configuration is driven by configXml or configFileparameter.  The simplest configuration file can look as follows:
           <?xml version="1.0" encoding="UTF-8"?>
           <SetConfig>
             <General MaxAddressObjectCount="10"/>
             <UnlockCode><Your unlock code here></UnlockCode>
             <DataBase CountryISO3="ALL" Path="/home/user/AD/db"/>
           </SetConfig>
  • Parameters can be set by parameterXml or parameterFileattribute. The simplest parameter file can look as follows:
           <?xml version="1.0" encoding="UTF-8"?>
           <Parameters>
             <Process Mode="PARSE"/>
             <Input/>
             <Result/>
           </Parameters>
  • Input mapping defines mapping between Clover input fields and AddressDoctor address properties.
  • Output mapping defines mapping between AddressDoctor output address properties and your output.

Transliteration example

Imagine that you have addresses all over the world saved in their original languages. For some languages, you would not even recognize them– to say nothing of their correct reading. Let’s try with following addresses:

Tomáš Novák;Vohradského 5;Česká Lípa;Czech republic
Hans-Peter Feiertag;Metro-Straße 1;Düsseldorf 4;Deutschland
Michał Dąbrowski;Marszałkowska 142;00-132 Warszawa;Polska
Борис Николаевич Ельцин; Казанская пл., 23;Санкт-Петербург;RUSSIAN FEDERATION
John Smith;100 Main Street;New York NY 10023;USB

To transliterate above addresses to Ascii, we need to set following parameters:

  • Mode to PARSE to transliterate without processing the address, Optimization level to STANDARD as it produces the best results. Both in Process tab of Parameters attribute:
  • Encoding to UTF-16 in Input tab of Parametersattribute:
  • Preferred language to ENGLISH and Preferred script to ASCII_SIMPLIFIED in Result tab of Parameters attribute:

Now we need to feed the Addressdoctor with input mapping. Input metadata:

Corresponds with Contact name, Street complete, Locality complete and Country name address properties:

–>

Similar mapping needs to be provided for output properties:

–>

We can also connect the Error port for invalid (unrecognised) addresses:

Now we can run our graph getting following results:

Tomas Novak;Vohradskeho;Ceska Lipa;CZECH REPUBLIC
Hans-Peter Feiertag;Metro-Strase;Dusseldorf 4;GERMANY
Michal Dabrowski;Marszalkowska;Warsaw;POLAND
Boris Nikolaevic Elcin;Kazanskaa Pl.;Sankt-Peterburg;RUSSIAN FEDERATION

But one address is missing. Look at the error port:

N1 means Validation Error: No validation performed because country was not recognized

The output can be even improved by setting ASCII_EXTENDED as Preferred script in Result tab of Parameters attribute:

Tomash Novak;Vohradskeho;Cheska Lipa;CZECH REPUBLIC
Hans-Peter Feiertag;Metro-Strasse;Duesseldorf 4;GERMANY
Michal Dabrowski;Marszalkowska;Warsaw;POLAND
Boris Nikolaevich Elcin;Kazanskaya Pl.;Sankt-Peterburg;RUSSIAN FEDERATION

Now the transcription corresponds with pronunciation.

Enrichment example

If you have proper database, you can enrich addresses with data you don’t know, e.g. ZIP code, geocoding, certification status.

Let’s consider following addresses:

To validate an address according to the Canada Post SERP rules we would need certified database for Canada. We need to set in Configuration as well as in Parameters:

and

After AddressDoctor transformation we get:

Count ResultNumber ElementItem Country Province Locality PostalCode Street Number SERPStatus SERPCategory ResultProcessStatus ResultModeUsed ResultPreferredScript ResultPreferredLanguage ResultDataMailabilityScore ResultDataElementResultStatus ResultDataElementInputStatus ResultDataElementRelevance
1 1 1 CANADA ON TORONTO M4P 3J5 YONGE STREET 2384 ESE1 C C3 CERTIFIED ASCII_SIMPLIFIED DATABASE 3 80F0F0F0F000004000E0 00606060600000200060 10101010100000100010
1 1 1 CANADA SK LA RONGE FINLAYSON ST 108 ESE1 N I1 CERTIFIED ASCII_SIMPLIFIED DATABASE 0 000000000000000000E0 00601010100000000060 00000000000000000000
1 1 1 CANADA SK PRINCE ALBERT S6V 0C7 15TH ST E 801 ESE1 C C4 CERTIFIED ASCII_SIMPLIFIED DATABASE 4 80F0F0F0F00000F000E0 00606060600000600060 10101010100000100010

Note the fields:

  • PostalCode – for recognized addresses we get valid postal code
  • SERPCategory :
    • C: Corrected
    • N: Incorrect
  • ResultProcessStatus and ResultDataMailabilityScore:
    • C3 (3) : Corrected – but some elements could not be checked (should be fine)
    • I1 (0) : Data could not be corrected and is pretty unlikely to be delivered (futile)
    • C4 (4) : Corrected – all (postally relevant) elements have been checked (almost certain)

Download the transformation graph with data

Spell Checking for Better Data Quality – AspellLookup Table

AspellLookupTable is a commercial lookup table which has been around since CloverETL 2.6. Because Aspell is a free software spell checker, you might  be wondering what it is used for in CloverETL. In fact, AspellLookupTable does not perform any spell checking at all, it “just” allows you to lookup data records with keys similar to the one you provide. This may be useful e.g. when looking for a street whose name is misspelled to a certain extent.

Similarity of String Keys

The similarity of two string keys is measured in terms of edit operations used to “transform” a misspelled key to the correct one. At the moment, CloverETL supports five edit operations: change of character case, character transposition, deletion of a character, insertion of a character, and finally character replacement. Each edit operation is assigned an edit cost which reflects its “seriousness”. By default, edit costs are set to 10, 90, 95, 95, and 100, respectively. As you can see, the difference in character case is not considered as serious as the character transposition. However, it is possible to tweak the edit costs as you wish.

If  you add all edit costs of the minimum number of edit operations required to “transform” the misspelled key to the correct one, you get an edit distance of the two keys. In other words, you get the similarity measure of the keys. However, the edit distance itself is just a number which will vary for different keys. Therefore you need to somehow determine which data records should be included in the result set and which should be thrown away. And that’s why you also need to specify a spelling threshold. This threshold is 230 by default but may be changed to best suit your needs.

An example is worth a thousand words, so let’s assume that one of the keys is "Hello". If you use "helo" for lookup, the edit distance will be 105 because you had to change the case of 'h' (cost = 10) and to insert another 'l' (cost = 95). If you used "Halllo" instead, the edit distance would be 195 ('a' was replaced by 'e' and one 'l' had to be removed). Both these queries would return a non-empty result set as their edit distance is below the spelling threshold. On the other hand, "Bye" as a lookup key would result in an empty result set as its edit distance is way above the threshold.

AspellLookupTable in Action

AspellLookupTable is a great candidate for address cleansing. Imagine you have a huge list of incorrectly spelled street names and you need to determine whether these streets actually exist in a certain town or city. You also have a directory of correctly spelled street names with several additional information, e.g. the town name, its region, etc. In the end, you want a list of all the streets, and corresponding towns, that do exist. Does it seem difficult? No, it’s a piece of cake with CloverETL:

First you need to set up your AspellLookupTable instance to load the street directory. Then you just load the invalid addresses and “pair” them with existing streets using the lookup table. When you have all the streets joined with whatever data you need, you may perform some further processing and then store the result in a flat file. Pretty simple, huh? :-)

Download the transformation graph with data

Data Quality at a Glance Conference

Javlin a.s., producer of CloverETL, took part in a Data Quality at a Glance Conference held on April 20th, at PriceWaterhouseCoopers’s premises in Prague. This conference was organized by IDG and Javlin served there as the professional supervisor partner.

Javlin together with other conference partners PriceWaterhouseCoopers, SAS and Ataccama each held presentations on Data Quality topics.

The first presentation was held by Mr. Snytr. It discussed Personal data and its quality from the view of the Office for the Protection of Personal Data.

Mr. Maly, senior manager at PriceWaterhouseCoopers talked in his presentation about Optimizing business processes. According to him, data quality is closely related to quality processes. Poor data has an impact on strategic decision-making and can cause a loss of business opportunity and/or profit. Therefore, it is important to find the source of data errors and set the right data processes. Mr. Maly emphasized that it is essential to cleanse data continuously.

Mr. Kyjonka from SAS held a presentation on One version of truth for all or MDM in business life. He highlighted importance of Data integration, Data Warehousing and MDM for getting the right data on time. He showed 3 important parts of MDM – System of Record, MDM Hub and Integration infrastructure. Further it was shown that 4 different types of MDM solutions can be used for various purposes – Registry style, Transaction style, Hybrid style, and Consolidation style (ETL). Choosing the right style of MDM solution depends on budget and how much time the company has. Quality cleansed data, technical infrastructure, Data Governance program, willingness to share data, etc are some of the important factors for MDM.

Mr. Matous, Javlin’s consultant had a presentation on Data Cleansing. According to him data quality is the process of detection, reporting and correction of the invalid or missing values in data. Mr. Matous made it clear how important it is to do data audits and why to use a data quality scorecard. A data quality scorecard tracks the financial impact of poor data and estimates the return on investment into Data Quality activities. It helps managers determine whether or not to invest into data cleansing tools. Mr. Matous talked about several Data Quality benefits in business. Some of them include increased efficiency of marketing campaigns, early warning system, increased credibility and reputation among customers. It was also shown how data cleansing could be done using CloverETL.

Mr. Vojtek, VP of Engineering at Javlin, discussed a specific case study where Javlin had undertaken data quality improvements. He emphasized how data quality is essential for quality business results. It was shown what kinds of pitfalls could be experienced in a data quality implementation in multinational companies.

The last presentation of the conference was hold by Mr. Kyjonka from SAS. He discussed a case study named The clever way to cleanse data. He argues that when companies think they have only 10% of poor or bad data, they usually have about 40% bad data. In his particular case 90% of the data had to be corrected during the data cleansing process.

Live from the Conference – @CloverETL on Twitter. Follow the tag #DQPWC.