Before you start to develop any data transformation you should explore your data (make data profiling). There are a lot of tools on the market that can help you. But why to install and learn another software when you can use the tool you are familiar with? CloverETL is mainly data transformation tool but it can be easily used for data profiling as well (as I will show you in this blog post).
It is very easy to do data statistic with latest version of CloverETL. You can find DataProfiling project in CloverETL Examples Projects. The project consists of two graphs: BasicStatistic and AdvancedStatistic.
The first one finds basic statistic for input data file:
- minimum value for numeric fields or minimum length of data for string and byte fields
- maximum value for numeric fields or maximum length of data for string and byte fields
- average value for numeric fields or average length of data for string and byte fields
- number of records in data file
- number of not null values for each data field
- number of null values for each data field
Additionally, for string data fields, it finds:
- first not null value
- if all values are ASCII
The second one calculates for each data field:
- number of records in data file
- number of not null values
- number of unique values
- minimum value
- maximum value
- average value for numeric fields
- median value
- modus value
It also finds frequency counts for fields with not many (the threshold is defined by a parameter HISTOGRAM_THRESHOLD) unique values.
BasicStatistic graph
The graphs in the project are prepared to analyze data from the excel file ORDERS.xls placed in data-in directory. But for purpose of this post we will analyze data stored in a flat file employees.list.dat (also placed in data-in directory).
To do that we need to set following parameters:
input_file=${DATAIN_DIR}/employees.list.dat
metadata=${META_DIR}/employees.fmt
READER_TYPE=DATA_READER
Metadata file (employees.fmt) has to contain metadata for employees.list.dat:
<?xml version="1.0" encoding="UTF-8"?>
<Record name="EMPLOYEE" recordDelimiter="n" recordSize="-1" type="delimited">
<Field delimiter="," format="#" name="EMP_NO" nullable="true" shift="0" type="integer"/>
<Field delimiter="," name="FIRST_NAME" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="LAST_NAME" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="PHONE_EXT" nullable="true" shift="0" type="string"/>
<Field delimiter="," format="dd/MM/yyyy" name="HIRE_DATE" nullable="true" shift="0" type="date"/>
<Field delimiter="," name="DEPT_NO" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="JOB_CODE" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="JOB_GRADE" nullable="true" shift="0" type="numeric"/>
<Field delimiter="," name="JOB_COUNTRY" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="SALARY" nullable="true" shift="0" type="numeric"/>
<Field name="FULL_NAME" nullable="true" shift="0" type="string"/>
</Record>
Lets see the graph with mid-results and output:

DataReader parses data from the input file. Normalizer creates record that consists of three basic fields: original field name, field type and “normalized” value, which is original value for numeric fields, time in milliseconds for date fields and data length for string or byte fields. Moreover this component collects basic information about string data e.g.: finds first not null value, checks if string contains only ASCII characters and if it can be converted to number. The next component (Rollup – Statistic) calculates minimum, maximum and average value for each group of records (with the same field name). It also propagates first not null value, checks if all isAscii and isNumber fields are not false and sets result value for the whole group. Fourth and fifth component has only “cosmetic” aim – they convert times in milliseconds back to user friendly form and sort output records. The writer converts results to report.
By default the final report is a plain html file:
Data statistic for data-in/employees.list.dat
| Field name |
Field type |
min |
max |
average |
count |
count not null |
count null |
first not null |
is Ascii |
is number |
| DEPT_NO |
string |
3.0 |
3.0 |
3.0 |
51 |
43 |
8 |
600 |
true |
true |
| EMP_NO |
integer |
0.0 |
145.0 |
58.392156862745104 |
51 |
51 |
0 |
|
true |
true |
| FIRST_NAME |
string |
3.0 |
11.0 |
5.549019607843137 |
51 |
51 |
0 |
Robert |
true |
false |
| FULL_NAME |
string |
10.0 |
22.0 |
14.549019607843137 |
51 |
51 |
0 |
Nelson, Robert |
true |
false |
| HIRE_DATE |
date |
28/12/1988 00:00:00 |
15/11/1994 00:00:00 |
|
51 |
51 |
0 |
|
|
|
| JOB_CODE |
string |
2.0 |
24.0 |
6.1568627450980395 |
51 |
51 |
0 |
VP |
true |
false |
| JOB_COUNTRY |
string |
2.0 |
11.0 |
3.6470588235294117 |
51 |
51 |
0 |
USA |
true |
false |
| JOB_GRADE |
number |
0.0 |
5.0 |
3.4117647058823524 |
51 |
51 |
0 |
|
true |
true |
| LAST_NAME |
string |
3.0 |
12.0 |
6.8431372549019605 |
51 |
51 |
0 |
Nelson |
true |
false |
| PHONE_EXT |
string |
1.0 |
5.0 |
2.8627450980392157 |
51 |
51 |
0 |
250 |
true |
false |
| SALARY |
number |
0.0 |
9.9E7 |
2267125.137254902 |
51 |
51 |
0 |
|
true |
true |
but it can be easily changed to excel file (just adjust graph parameter WRITER_TYPE=XLS_WRITER).
AdvancedStatistic graph

The phase 0 of AdvancedStatistic graph works similarly as graph BasicStatistic, but it uses Aggregators for statistic calculations instead of Rollup. Be particular about component called Simplification in phase 0 of the graph: it stores number of records in file and names of fields with number of unique values under threshold in dictionary (marked by red ellipses on the picture above). Then Histogram filter component can read this field’s names and skip the records that aren’t between fields for frequency calculations (green eclipses on the picture). Phase 1 Aggregators count frequencies for fields that were filtered out by previous component.
Resulting file looks as follows:
Advanced data statistic and histograms for ./data-in/employees.list.dat
Statistics
| Field name |
Field type |
min |
max |
average number |
count |
count not null |
count unique |
median |
modus |
| DEPT_NO |
string |
000 |
900 |
|
51 |
43 |
20 |
600 |
623 |
| EMP_NO |
integer |
0.0 |
145.0 |
58.3921568627451 |
51 |
51 |
47 |
45.0 |
2.0 |
| FIRST_NAME |
string |
Andrew |
Yuki |
|
51 |
51 |
44 |
Mark |
Robert |
| FULL_NAME |
string |
Baldwin, Janet |
Young, Katherine |
|
51 |
51 |
50 |
Lee, Terri |
Sutherland, Claudia |
| HIRE_DATE |
date |
28/12/1988 00:00:00 |
15/11/1994 00:00:00 |
|
51 |
51 |
44 |
20/04/1992 00:00:00 |
02/01/1994 00:00:00 |
| JOB_CODE |
string |
Admin |
Vice President |
|
51 |
51 |
16 |
Mktg |
Eng |
| JOB_COUNTRY |
string |
Canada |
USA |
|
51 |
51 |
9 |
USA |
USA |
| JOB_GRADE |
number |
0.0 |
5.0 |
3.411764705882353 |
51 |
51 |
6 |
4.0 |
4.0 |
| LAST_NAME |
string |
Baldwin |
Young |
|
51 |
51 |
48 |
Lambert |
Johnson |
| PHONE_EXT |
string |
1 |
null |
|
51 |
51 |
46 |
3355 |
null |
| SALARY |
number |
0.0 |
9.9E7 |
2267125.137254902 |
51 |
51 |
40 |
61637.8125 |
0.0 |
Histograms
| Field name |
Field type |
value |
count |
count % |
| DEPT_NO |
string |
|
8 |
###############····················································································· |
| DEPT_NO |
string |
000 |
2 |
###································································································· |
| DEPT_NO |
string |
100 |
2 |
###································································································· |
| DEPT_NO |
string |
110 |
2 |
###································································································· |
| DEPT_NO |
string |
115 |
2 |
###································································································· |
| DEPT_NO |
string |
120 |
3 |
#####······························································································· |
| DEPT_NO |
string |
121 |
1 |
#··································································································· |
| DEPT_NO |
string |
123 |
1 |
#··································································································· |
| DEPT_NO |
string |
125 |
1 |
#··································································································· |
| DEPT_NO |
string |
130 |
2 |
###································································································· |
| DEPT_NO |
string |
140 |
2 |
###································································································· |
| DEPT_NO |
string |
180 |
2 |
###································································································· |
| DEPT_NO |
string |
600 |
2 |
###································································································· |
| DEPT_NO |
string |
621 |
4 |
#######····························································································· |
| DEPT_NO |
string |
622 |
3 |
#####······························································································· |
| DEPT_NO |
string |
623 |
5 |
#########··························································································· |
| DEPT_NO |
string |
670 |
2 |
###································································································· |
| DEPT_NO |
string |
671 |
3 |
#####······························································································· |
| DEPT_NO |
string |
672 |
2 |
###································································································· |
| DEPT_NO |
string |
900 |
2 |
###································································································· |
| JOB_CODE |
string |
Admin |
4 |
#######····························································································· |
| JOB_CODE |
string |
CEO |
1 |
#··································································································· |
| JOB_CODE |
string |
CFO |
1 |
#··································································································· |
| JOB_CODE |
string |
Dir |
1 |
#··································································································· |
| JOB_CODE |
string |
Doc |
1 |
#··································································································· |
| JOB_CODE |
string |
Eng |
15 |
#############################······································································· |
| JOB_CODE |
string |
Finan |
1 |
#··································································································· |
| JOB_CODE |
string |
Inside Sales Coordinator |
1 |
#··································································································· |
| JOB_CODE |
string |
Mktg |
1 |
#··································································································· |
| JOB_CODE |
string |
Mngr |
4 |
#######····························································································· |
| JOB_CODE |
string |
PRel |
1 |
#··································································································· |
| JOB_CODE |
string |
SRep |
9 |
#################··················································································· |
| JOB_CODE |
string |
Sales |
2 |
###································································································· |
| JOB_CODE |
string |
Sales Representative |
6 |
###########························································································· |
| JOB_CODE |
string |
VP |
2 |
###································································································· |
| JOB_CODE |
string |
Vice President |
1 |
#··································································································· |
| JOB_COUNTRY |
string |
Canada |
2 |
###································································································· |
| JOB_COUNTRY |
string |
England |
3 |
#####······························································································· |
| JOB_COUNTRY |
string |
France |
1 |
#··································································································· |
| JOB_COUNTRY |
string |
Italy |
1 |
#··································································································· |
| JOB_COUNTRY |
string |
Japan |
2 |
###································································································· |
| JOB_COUNTRY |
string |
Sales |
1 |
#··································································································· |
| JOB_COUNTRY |
string |
Switzerland |
1 |
#··································································································· |
| JOB_COUNTRY |
string |
UK |
4 |
#######····························································································· |
| JOB_COUNTRY |
string |
USA |
36 |
######################################################################······························ |
| JOB_GRADE |
number |
0.0 |
1 |
#··································································································· |
| JOB_GRADE |
number |
1.0 |
2 |
###································································································· |
| JOB_GRADE |
number |
2.0 |
8 |
###############····················································································· |
| JOB_GRADE |
number |
3.0 |
14 |
###########################········································································· |
| JOB_GRADE |
number |
4.0 |
16 |
###############################····································································· |
| JOB_GRADE |
number |
5.0 |
10 |
###################················································································· |