Data quality usually goes hand in hand with data integration. The new version CloverETL 3.1 has enriched its data cleansing capabilities through integration with AddressDoctor. AddressDoctor contains address and geo data for more than 240 countries all over the globe. Along with correcting and fixing mail addresses, AddressDoctor can also be used for transliteration of non-Latin writing systems into Latin characters or enriching addresses with latitude and longitude information.
CloverETL integrates AddressDoctor software through a dedicated AddressDoctor component. In all cases you need to have java library AddressDoctor5.jar, native libraries (they all need to be on the class-path when running a graph) and country databases. You also need unlock codes for your databases.
AddressDoctor component has 4 required parameters:
- Configuration is driven by configXml or configFileparameter. The simplest configuration file can look as follows:
<?xml version="1.0" encoding="UTF-8"?> <SetConfig> <General MaxAddressObjectCount="10"/> <UnlockCode><Your unlock code here></UnlockCode> <DataBase CountryISO3="ALL" Path="/home/user/AD/db"/> </SetConfig> - Parameters can be set by parameterXml or parameterFileattribute. The simplest parameter file can look as follows:
<?xml version="1.0" encoding="UTF-8"?> <Parameters> <Process Mode="PARSE"/> <Input/> <Result/> </Parameters> - Input mapping defines mapping between Clover input fields and AddressDoctor address properties.
- Output mapping defines mapping between AddressDoctor output address properties and your output.
Transliteration example
Imagine that you have addresses all over the world saved in their original languages. For some languages, you would not even recognize them– to say nothing of their correct reading. Let’s try with following addresses:
Tomáš Novák;Vohradského 5;Česká Lípa;Czech republic Hans-Peter Feiertag;Metro-Straße 1;Düsseldorf 4;Deutschland Michał Dąbrowski;Marszałkowska 142;00-132 Warszawa;Polska Борис Николаевич Ельцин; Казанская пл., 23;Санкт-Петербург;RUSSIAN FEDERATION John Smith;100 Main Street;New York NY 10023;USB
To transliterate above addresses to Ascii, we need to set following parameters:
- Mode to PARSE to transliterate without processing the address, Optimization level to STANDARD as it produces the best results. Both in Process tab of Parameters attribute:

- Encoding to UTF-16 in Input tab of Parametersattribute:

- Preferred language to ENGLISH and Preferred script to ASCII_SIMPLIFIED in Result tab of Parameters attribute:

Now we need to feed the Addressdoctor with input mapping. Input metadata:
Corresponds with Contact name, Street complete, Locality complete and Country name address properties:
Similar mapping needs to be provided for output properties:
We can also connect the Error port for invalid (unrecognised) addresses:
Now we can run our graph getting following results:
Tomas Novak;Vohradskeho;Ceska Lipa;CZECH REPUBLIC Hans-Peter Feiertag;Metro-Strase;Dusseldorf 4;GERMANY Michal Dabrowski;Marszalkowska;Warsaw;POLAND Boris Nikolaevic Elcin;Kazanskaa Pl.;Sankt-Peterburg;RUSSIAN FEDERATION
But one address is missing. Look at the error port:
N1 means Validation Error: No validation performed because country was not recognized
The output can be even improved by setting ASCII_EXTENDED as Preferred script in Result tab of Parameters attribute:
Tomash Novak;Vohradskeho;Cheska Lipa;CZECH REPUBLIC Hans-Peter Feiertag;Metro-Strasse;Duesseldorf 4;GERMANY Michal Dabrowski;Marszalkowska;Warsaw;POLAND Boris Nikolaevich Elcin;Kazanskaya Pl.;Sankt-Peterburg;RUSSIAN FEDERATION
Now the transcription corresponds with pronunciation.
Enrichment example
If you have proper database, you can enrich addresses with data you don’t know, e.g. ZIP code, geocoding, certification status.
Let’s consider following addresses:
To validate an address according to the Canada Post SERP rules we would need certified database for Canada. We need to set in Configuration as well as in Parameters:
After AddressDoctor transformation we get:
| Count | ResultNumber | ElementItem | Country | Province | Locality | PostalCode | Street | Number | SERPStatus | SERPCategory | ResultProcessStatus | ResultModeUsed | ResultPreferredScript | ResultPreferredLanguage | ResultDataMailabilityScore | ResultDataElementResultStatus | ResultDataElementInputStatus | ResultDataElementRelevance |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | CANADA | ON | TORONTO | M4P 3J5 | YONGE STREET | 2384 | ESE1 | C | C3 | CERTIFIED | ASCII_SIMPLIFIED | DATABASE | 3 | 80F0F0F0F000004000E0 | 00606060600000200060 | 10101010100000100010 |
| 1 | 1 | 1 | CANADA | SK | LA RONGE | FINLAYSON ST | 108 | ESE1 | N | I1 | CERTIFIED | ASCII_SIMPLIFIED | DATABASE | 0 | 000000000000000000E0 | 00601010100000000060 | 00000000000000000000 | |
| 1 | 1 | 1 | CANADA | SK | PRINCE ALBERT | S6V 0C7 | 15TH ST E | 801 | ESE1 | C | C4 | CERTIFIED | ASCII_SIMPLIFIED | DATABASE | 4 | 80F0F0F0F00000F000E0 | 00606060600000600060 | 10101010100000100010 |
Note the fields:
- PostalCode – for recognized addresses we get valid postal code
- SERPCategory :
- C: Corrected
- N: Incorrect
- ResultProcessStatus and ResultDataMailabilityScore:
- C3 (3) : Corrected – but some elements could not be checked (should be fine)
- I1 (0) : Data could not be corrected and is pretty unlikely to be delivered (futile)
- C4 (4) : Corrected – all (postally relevant) elements have been checked (almost certain)












