Tag Archives: data profiling

Performance Optimization of Metrics in CloverETL Data Profiler

The first beta version CloverETL Data Profiler was released in October, and since then we have been working on improvements for the second beta version, which was released at the end of last year. Besides bug fixing and adding a few new features, we also worked on performance optimization of profiling metrics. This article will describe this improvement and how profiling is interconnected with CloverETL Engine.

The CloverETL Data Profiler processes input data as a stream. All metrics read input values as they are obtained from the source (CSV file, Excel sheet, or database table) and, at the end of a reading, metrics return their results and these results are then stored into the results database.

For most of the metrics (minimum value, maximum value) this approach works just fine. However, certain metrics cannot work like this – not only do they have higher computation-time requirements, but they also require all the input data to be kept in memory. For large data sets this makes using the operating system memory inappropriate. Therefore, external memory needs to be used.

In the first beta version we used Profiler’s internal SQL database to store all the values for these memory-consuming metrics. The data were first inserted into the SQL database and then a database query was used to calculate the result of the metric. This allowed for profiling large data sets– larger than the amount of available memory.

However, there was a large overhead caused by inserting the data into the database; also, the final result query computation consumed lots of system resources. We were not happy with the performance and architecture of this solution, so we decided to redesign it and use the powerful CloverETL Engine to get the job done.

We exploit the fact that the memory-consuming metrics can still be computed on a stream of data, if the incoming data are sorted. In the improved version we use the CloverETL Engine to first sort the values using the ExtSort components, and then we analyze the sorted data as a stream. In this approach, no other external facilities (such as SQL database) are used during profiling.

The overall performance of CloverETL Data Profiler has improved, especially for large data sets. Even with full set of metrics enabled, we are now able to analyze 4 GB of data with 30 fields in 30 minutes. Also, memory consumption has improved significantly.

Finally, in the Profiler GUI, we have marked the metrics that require sorting, and therefore have longer computation time, with a small clock icon. These metrics are not enabled by default.

The following picture shows in detail the different phases of metrics calculation. First, we calculate the metrics that can work on unsorted streams of data. In the following phases, for each field in its own separate phase, we run an ExtSort component and connect it to a component that calculates the metrics that expect sorted data. We use Rollup with custom transform Java code to calculate the metrics. Rollup allows for producing variable amount of output records for any amount of incoming records.

Another performance improvement in second beta version of CloverETL Data Profiler also affects the metrics that do not require the input data to be sorted. The profiler will now make better use of available CPUs and there will be less CPU time consumed on context-switching. This results in a boost up to 15% for data with a high amount of fields. Also, simpler structure of the internal CloverETL graph results in a significantly lower memory footprint.

In summary, in the second beta version of CloverETL Data Profiler, we have improved both performance and memory consumption by fully exploiting the capabilities of CloverETL Engine.

CloverETL Data Profiler: Under the Hood of Data Profiling Application

Recently we’ve released the second public beta release of a new member of our CloverETL family – CloverETL Data Profiler. With this article we’d like to share an overview of the technical architecture of CloverETL Profiler – the why and how of the design.

The CloverETL Data Profiler is a data profiling application, i.e. it provides the users with various information about their data, such as average, number of empty values , histogram-like charts etc. When designing the application we had several goals in mind:

  • Performance – Data of interest can be huge. The profiling of them needs to be reasonably quick so that the user does not need to wait for results for too long – we want profiling to be an interactive process, not a batch job.
  • CloverETL Integration – Data profiling is often just one part of a more complex workflow – for example, it can be the starting point of a data integration project.  We want to make the transition from data profiling to the following steps in data integration with CloverETL as smooth as possible.
  • Web presentation – An important group of users of data profiling tools are people interested in just viewing the profiling results; they don’t need to design or execute data profiling jobs. Presenting the profiling results on web gives such users a very simple way to access the results.

CloverETL Profiler Architecture

Let’s have a look at the main building blocks of CloverETL Data Profiler:

  • CloverETL Data Profiler GUI – This is the graphical tool where users create and execute profiling jobs. We’ll talk about the jobs later in more detail, but basically they describe a data source to profile (e.g. file, database etc.) and how to profile it using metrics (e.g. calculate maximum value on a database column). The GUI is based on the Eclipse RCP platform and reuses components of CloverETL Designer – this is the obvious way for the reuse of GUIs for concepts common with CloverETL (e.g. metadata, database connections, data analytics etc.) and it paves way for future deep integration.
  • CloverETL Data Profiler Engine – Profiling jobs created and edited in the CloverETL Profiler GUI are stored as ordinary XML files, similarly to CloverETL graphs. The profiler engine takes these jobs, analyzes them, and automatically generates and runs a CloverETL graph. The CloverETL graph performs the actual profiling – so the CloverETL Data Profiler Engine is a relatively thin wrapper on top of CloverETL Engine.
  •  CloverETL Engine – As described above, the real work of data profiling is performed by CloverETL Engine that runs a specially designed transformation graph. Because of this, the profiler gains access to the wide feature set and great performance provided by the CloverETL Engine.
  • Result Storage – CloverETL graphs performing the profiling store the results in a Result Storage database. The database can be an embedded Derby database for single user scenarios or a shared external database for multiple users. The shared external database can be used in a scenario where multiple users create and execute profiling jobs and at the same time, need to share their results. We support a range of databases for the storage of results – Oracle, MySQL, PostgreSQL and MSSQL.
  • Reporting Server – As observing profiling results is done via a web interface, we have a Reporting Server that both serves the web content and provides it with the profiling results stored in ResultsStorage. The Reporting Server isolates the web interface from details of the Result Storage by providing the profiling results as JSON via a REST interface. We kept the Reporting Server intentionally simple – more complex server-side functionality such as scheduling will be reused by integration with CloverETL Server.
  • Reporting Console – This is the tool for observing the profiling results – values of metrics, charts of the data, troubleshooting details about rejected records etc. The Reporting Console is an AJAX web application based on HTML, JavaScript and related standard technologies. Because it’s a web application, it’s very easy for anyone to view the results – there’s no need to install a special application; you only need a reasonably modern web browser.

All parts of CloverETL  DataProfiler are currently bundled in one simple-to-use package as a standalone application – just start the CloverETL Data Profiler and it automatically launches an embedded Result Storage, Reporting Server etc. However we’ve laid the foundations for separating the building blocks for bigger deployment scenarios and better integration with the rest of CloverETL family.

Data Profiling Process

Now let’s have a look on the nitty-gritty details of actually profiling data. The basic premise is that we already have all the tools needed for profiling in CloverETL – a transformation graph has all the expressive power needed for profiling of data:

  • Reading of a data source – CloverETL has a wide range of reader components and connectors to 3rd party systems.
  • Calculating metrics – Transformations in CloverETL are extensible by Java code or by scripts in our proprietary scripting language CTL.
  • Storing results – This is just writing to a database which is one of the basic use cases of CloverETL.

So anyone could manually create a CloverETL graph that profiles data. But it’s quite a complex task which would take the user’s focus away from the core: what data source does he want to profile and which metrics does he want to use. In the CloverETL Data Profiler, the user describes this core information in a profiling job, and then the CloverETL Profiler Engine transforms it into a CloverETL transformation graph. The profiling job is defined in a relatively simple XML file that can be edited in a graphical editor. The job primarily contains the following information:

  • Data source – Description of data source to profile – path to a file, definition of a database connection etc.
  • Metadata – Description of the data source structure by CloverETL metadata
  • Metrics – Multiple metrics can be enabled on each field of the metadata, e.g. minimum value, longest string, median etc.

The above picture demonstrates the process of running a profiling job:

  • The profiling job is stored in an XML file that contains all required information. The job is created and edited in the CloverETL Profiler GUI.
  • The job XML file is taken by the CloverETL Profiler Engine which creates a CloverETL transformation graph and runs it.
  • The picture of the transformation graph contains an actual profiling graph which reads the data source, stores the results and, most importantly, calculates metrics in parallel branches. Details of the branching depend on the kind of metrics used: some can be computed on-the-fly on the original data, while some require all data to be sorted (e.g. median). The number of branches depends on the selection of metrics and the number of available CPU cores – we optimize the graph for high performance.

This article is a brief introduction into the architecture of CloverETL Data Profiler. As you can see, we’ve saved a lot of effort by using the power of CloverETL at the core. The CloverETL Data Profiler can be also seen as a successful example of embedding CloverETL.

A Second Wind: New Features Strengthen CloverETL Data Profiler

In Data Profiling with CloverETL Data Profiler beta, we discussed the value of the new product that will soon enrich our product portfolio – CloverETL Data Profiler. The beta testing has been up and running, and so far we have received a lot of valuable feedback from the first beta build that allows us to further enhance the features and usability of CloverETL Data Profiler. Let’s see what the new build offers.

Integration with CloverETL Designer

The new build of the CloverETL Data Profiler introduces a key feature – the first step toward integration with the CloverETL suite. If you are interested in taking part in the development and feedback process of CloverETL products or the beta testing, you can register and get the new build.

The CloverETL Data Profiler has several important use cases, one being its integral role in analysis in the early stages of your projects. The insight into data gained from CloverETL Profiler is valuable for project planning, illustrating the importance of clean data, while offering some direction in the design of your ETL process.

Aside from gathering information during the definition of the profiling job, with this build’s new feature, you are also preparing your resources to use later in the CloverETL Designer. This work includes: defining the database connection and metadata that describe the schemas of your data source. The profiler may assist you in this process. It is especially useful for data sources that don’t include schema as their integral part, such as CSV files.

Exporting Metadata and Transformation

The second beta build of CloverETL Data Profiler includes the option to export both database connection metadata and a sample CloverETL transformation graph containing the reader that accessed the data source you previously defined. These are ready to be used in CloverETL Designer. You can see official CloverETL Data Profiler documentation to learn more about this feature.

Other Improvements in the New Build

Here is a short summary of the most notable improvements in the second build:

  • export to CloverETL Designer graph
  • visual enhancement of CloverETL Data Profiler job
  • removed dependency on internal database for metrics calculation
    • this results in performance boost, 4 GB of data with 30 fields and all metrics enabled takes 30 minutes
  • metrics that might be time consuming are visually distinguished in profiler
  • other tweaks and changes that enhance the overall performance and stability

The Next Steps

This is our first step in providing integration of CloverETL Data Profiler with the rest of the CloverETL suite. The next stages will be:

  • full integration with CloverETL Designer that will allow you the capability to profile any data flow inside your CloverETL graph, and
  • integration with CloverETL Server, where you will be able to set up events based on conditions detected in these data flows inside CloverETL graphs running on the server.

Data Profiling with CloverETL Profiler beta

The process of data integration, data migration, consolidation and other data manipulation projects consists of a variety of steps and tasks. Javlin supports many critical tasks within these projects with a versatile ETL tool that provides technical solutions to transform data and connect different systems and data sources with various data formats.

However, there are additional aspects in order to achieve success with data processing projects. One important project phase to ensure future success is the analysis phase. At Javlin, we are now on the verge of introducing new tool that will boost productivity in the early stages of a project. This product is our new data profiling tool – CloverETL Profiler.

CloverETL Profiler helps give you insight into your data. By creating a profile of your data using various metrics, you can get the big picture of the current condition of your data. This information has a large impact on the next stages of your project.

Real World Use – Data Warehousing Project

Imagine you are project manager with a complex task at hand – building a data warehouse. You have couple of data sources – database tables, CSV files, log files you want to analyze and probably some others. After you define the basic requirements for your data warehouse and its structure, you need to check the condition of your source data.

As you use the CloverETL Profiler, you will gain:

  • insight into the quality of your data indicating whether you need the perform a data cleansing phase in your project:
    • detect what data are missing (high percentage of missing or invalid values) looking for a different source
    • you may include data cleansing tools in the project budget and reserve tasks in your project plan
  • information about the structure of data in all your sources:
    • detect the need for ETL tools on specific data sources to transform your input data into a common format

Instead of writing your own tools and testing the data in complex way, you are just a few clicks away from creating detailed profile of your data.

Product Vision

The usage of the CloverETL Profiler tool for the project analysis phase is just the beginning. As part of our vision for the future, we are on the path to create a fully integrated product in CloverETL Designer and CloverETL Server family.

Its main purposes will be to aid:

  • CloverETL Designer: to assist the development of ETL transformation by allowing the creation of profiles of data in any stage of the transformation and checking the results during the debugging process
  • CloverETL Server: to constantly monitor the data condition on the production server  and invoke actions based on the specific condition detected in the profile of your data

Beta Program

Right now, we are excited to announce the beginning of our beta program for CloverETL Profiler. We invite you, people from the data integration community, to join the testing phase; it’s a great opportunity to not only have a look behind the scenes, but to also be a part of the development process by contributing your insights. For more information, please visit www.cloveretl.com/products/profiler.

Data sampling with CloverETL

Testing data transformations is generally not an easy task. When creating and testing a transformation you might want to get a data sample to check if your transformation works properly. In this point a question arises: How to create a representative data probe on the full data set? Obviously, the easiest way is to read just part of data from the beginning. But such data sample can be very unreliable.  I’ve prepared a few simple graphs that create a data probe which can be regarded as representative for the full data set.

All graphs were created based on the sampling methods described in the article Sampling (statistics).

Simple random sampling

In this method each record has the same probability of selection. Filtering is based on double value chosen (approximately) uniformly from the range 0.0d (inclusive) to 1.0d (exclusive): record is selected if the drawn number is lower than required sample set size:

Systematic sampling

Systematic sampling relies on arranging the data set according to some ordering scheme and then selecting elements in regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every k-th element from then onwards:

Sorting can be disabled in this graph. Then it is selected just every k-th element from the full data set, starting from a randomly selected record from the interval [1, k].

Stratified sampling

If the data set embraces a number of distinct categories, the frame can be organized by these categories into separate strata. Each stratum is then sampled as an independent sub-population out of which individual elements can be randomly selected. At least one record from each stratum must be selected:

Probability proportional to size sampling

Probability for each record is set to be proportional to its stratum size, up to a maximum of 1. Strata are defined by the value of the selected field. For each group of records it it is used systematic sampling method:

Methods comparison

Simple random sampling method is the simplest and fastest. It is sufficient in most cases. Systematic sampling with disabled sorting is as fast as simple random sampling and produces also strongly representative data probe. The stratified sampling method is the trickiest one. It is useful only if the data set can be split into the separated groups that have reasonable sizes. In other cases the data probe is a lot of bigger than requested.

Please see the attached CloverETL project with the above graphs. It also contains the graph for comparison of samples created with different sampling methods. I’ve done some tests for the file containing 5,000,000 rows with information about financial transactions. Each row contains unique transaction id, id of a customer, transaction amount and currency info. Total number of customers is 50,001; number of possible currencies is 35. I performed two sets of tests: one for the group defined by customer id and one defined by currency id.

Results for the sampling_field = CustomerId

Stratum is defined by id of customer. All data can be split to 50,001 groups with sizes from 61 to 143 transactions.

Following table shows testing results for some groups. Sorting was enabled for systematic sampling method.

defined sample size ratio: 0.01

sampling field (CustomerId) value simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 22 s 772 ms 0 h 1 m 34 s 965 ms 0 h 1 m 33 s 831 ms 0 h 1 m 30 s 973 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 4 71 0.0563 1 71 0.0140 1 71 0.0140 1 71 0.0140
1 0 94 0.0000 1 94 0.0106 1 94 0.0106 1 94 0.0106
10 2 110 0.0181 1 110 0.0090 2 110 0.0181 1 110 0.0090
100 1 93 0.0107 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 0 83 0.0000 1 83 0.0120 1 83 0.0120 1 83 0.0120
10000 2 101 0.0198 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 0 99 0.0000 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 0 109 0.0000 1 109 0.0091 3 109 0.0275 1 109 0.0091
10003 1 86 0.0116 1 86 0.0116 2 86 0.0232 1 86 0.0116
10004 1 86 0.0116 1 86 0.0116 1 86 0.0116 1 86 0.0116
total 49937 5000000 0.0099 50000 5000000 0.0100 68172 5000000 0.0136 50011 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 28 s 741 ms 0 h 1 m 34 s 474 ms 0 h 1 m 32 s 628 ms 0 h 1 m 33 s 949 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 0 71 0.0000 0 71 0.0000 1 71 0.0140 0 71 0.0000
1 0 94 0.0000 1 94 0.0106 1 94 0.0106 1 94 0.0106
10 1 110 0.0090 1 110 0.0090 3 110 0.0272 1 110 0.0090
100 1 93 0.0107 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 0 83 0.0000 1 83 0.0120 2 83 0.0240 0 83 0.0000
10000 1 101 0.0099 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 2 99 0.0202 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 1 109 0.0091 1 109 0.0091 1 109 0.0091 1 109 0.0091
10003 0 86 0.0000 1 86 0.0116 1 86 0.0116 1 86 0.0116
10004 1 86 0.0116 1 86 0.0116 1 86 0.0116 0 86 0.0000
total 49931 5000000 0.0099 50000 5000000 0.0100 68369 5000000 0.0136 50010 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 24 s 975 ms 0 h 1 m 37 s 446 ms 0 h 1 m 29 s 98 ms 0 h 1 m 32 s 857 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 0 71 0.0000 1 71 0.0140 1 71 0.0140 1 71 0.0140
1 1 94 0.0106 1 94 0.0106 1 94 0.0106 1 94 0.0106
10 0 110 0.0000 1 110 0.0090 2 110 0.0181 2 110 0.0181
100 0 93 0.0000 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 1 83 0.0120 1 83 0.0120 1 83 0.0120 1 83 0.0120
10000 2 101 0.0198 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 1 99 0.0101 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 0 109 0.0000 1 109 0.0091 3 109 0.0275 1 109 0.0091
10003 1 86 0.0116 1 86 0.0116 1 86 0.0116 1 86 0.0116
10004 0 86 0.0000 1 86 0.0116 1 86 0.0116 1 86 0.0116
total 49983 5000000 0.0099 50000 5000000 0.0100 68258 5000000 0.0136 49900 5000000 0.0099
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling

Results for the same test but with data sorting disabled in systematic sampling method:

defined sample size ratio: 0.01

sampling field (CustomerId) value simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 28 s 168 ms 0 h 0 m 23 s 117 ms 0 h 1 m 35 s 414 ms 0 h 1 m 30 s 985 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1 71 0.0140 0 71 0.0000 1 71 0.0140 1 71 0.0140
1 1 94 0.0106 0 94 0.0000 2 94 0.0212 1 94 0.0106
10 1 110 0.0090 2 110 0.0181 1 110 0.0090 1 110 0.0090
100 0 93 0.0000 0 93 0.0000 1 93 0.0107 1 93 0.0107
1000 0 83 0.0000 0 83 0.0000 1 83 0.0120 1 83 0.0120
10000 2 101 0.0198 0 101 0.0000 1 101 0.0099 1 101 0.0099
10001 0 99 0.0000 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 3 109 0.0275 1 109 0.0091 3 109 0.0275 1 109 0.0091
10003 1 86 0.0116 2 86 0.0232 1 86 0.0116 1 86 0.0116
10004 1 86 0.0116 0 86 0.0000 2 86 0.0232 0 86 0.0000
total 50081 5000000 0.0100 50000 5000000 0.0100 68227 5000000 0.0136 49966 5000000 0.0099
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 23 s 78 ms 0 h 0 m 19 s 178 ms 0 h 1 m 33 s 148 ms 0 h 1 m 29 s 261 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 0 71 0.0000 0 71 0.0000 1 71 0.0140 0 71 0.0000
1 0 94 0.0000 0 94 0.0000 1 94 0.0106 1 94 0.0106
10 0 110 0.0000 3 110 0.0272 1 110 0.0090 1 110 0.0090
100 3 93 0.0322 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 1 83 0.0120 0 83 0.0000 2 83 0.0240 1 83 0.0120
10000 0 101 0.0000 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 1 99 0.0101 1 99 0.0101 3 99 0.0303 1 99 0.0101
10002 1 109 0.0091 0 109 0.0000 1 109 0.0091 1 109 0.0091
10003 1 86 0.0116 0 86 0.0000 1 86 0.0116 1 86 0.0116
10004 3 86 0.0348 0 86 0.0000 1 86 0.0116 1 86 0.0116
total 50056 5000000 0.0100 50000 5000000 0.0100 68528 5000000 0.0137 50033 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 28 s 244 ms 0 h 0 m 27 s 52 ms 0 h 1 m 35 s 49 ms 0 h 1 m 27 s 725 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1 71 0.0140 0 71 0.0000 2 71 0.0281 0 71 0.0000
1 1 94 0.0106 0 94 0.0000 2 94 0.0212 1 94 0.0106
10 0 110 0.0000 2 110 0.0181 4 110 0.0363 1 110 0.0090
100 2 93 0.0215 2 93 0.0215 1 93 0.0107 1 93 0.0107
1000 2 83 0.0240 0 83 0.0000 1 83 0.0120 1 83 0.0120
10000 0 101 0.0000 0 101 0.0000 1 101 0.0099 1 101 0.0099
10001 0 99 0.0000 4 99 0.0404 1 99 0.0101 1 99 0.0101
10002 1 109 0.0091 2 109 0.0183 1 109 0.0091 1 109 0.0091
10003 1 86 0.0116 1 86 0.0116 2 86 0.0232 0 86 0.0000
10004 0 86 0.0000 0 86 0.0000 1 86 0.0116 1 86 0.0116
total 50116 5000000 0.0100 50000 5000000 0.0100 68470 5000000 0.0136 50010 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling

Since the groups are really small, there should be selected none or one record from each group and for the smaller groups we should have more often zero selected records. In relation to this criteria the PPS sampling method and systematic sampling method with sorting data enabled give the best results. Data sample created with stratified method is always oversized.

Results for the sampling_field = CurrencyId

Stratum is defined by id of currency. All data can be split to 35 groups with very similar sizes from 142,042 to 143,572 transactions.

The following table shows testing results for some groups. Sorting was enabled for systematic sampling method.

defined sample size ratio: 0.01

sampling field (CurrencyId) value simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 21 s 681 ms 0 h 1 m 26 s 859 ms 0 h 1 m 25 s 970 ms 0 h 1 m 27 s 85 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1450 142623 0.0101 1427 142623 0.0100 1447 142623 0.0101 1426 142623 0.0099
1 1371 142925 0.0095 1429 142925 0.0099 1430 142925 0.0100 1429 142925 0.0099
10 1420 142897 0.0099 1429 142897 0.0100 1432 142897 0.0100 1429 142897 0.0100
11 1448 142896 0.0101 1429 142896 0.0100 1443 142896 0.0100 1429 142896 0.0100
12 1383 142522 0.0097 1425 142522 0.0099 1488 142522 0.0104 1425 142522 0.0099
13 1468 142461 0.0103 1425 142461 0.0100 1395 142461 0.0097 1424 142461 0.0099
14 1449 142997 0.0101 1430 142997 0.0100 1479 142997 0.0103 1430 142997 0.0100
15 1401 142697 0.0098 1426 142697 0.0099 1438 142697 0.0100 1427 142697 0.0100
16 1396 143137 0.0097 1432 143137 0.0100 1387 143137 0.0096 1431 143137 0.0099
17 1464 142517 0.0102 1425 142517 0.0099 1413 142517 0.0099 1425 142517 0.0099
total 49959 5000000 0.0099 50000 5000000 0.0100 50075 5000000 0.0100 49997 5000000 0.0099
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CurrencyId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 22 s 949 ms 0 h 1 m 25 s 726 ms 0 h 1 m 27 s 629 ms 0 h 1 m 24 s 537 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1449 142623 0.0101 1427 142623 0.0100 1496 142623 0.0104 1426 142623 0.0099
1 1468 142925 0.0102 1429 142925 0.0099 1442 142925 0.0100 1429 142925 0.0099
10 1436 142897 0.0100 1429 142897 0.0100 1406 142897 0.0098 1429 142897 0.0100
11 1436 142896 0.0100 1429 142896 0.0100 1402 142896 0.0098 1429 142896 0.0100
12 1410 142522 0.0098 1425 142522 0.0099 1454 142522 0.0102 1425 142522 0.0099
13 1438 142461 0.0100 1425 142461 0.0100 1414 142461 0.0099 1425 142461 0.0100
14 1420 142997 0.0099 1430 142997 0.0100 1450 142997 0.0101 1430 142997 0.0100
15 1412 142697 0.0098 1427 142697 0.0100 1400 142697 0.0098 1427 142697 0.0100
16 1453 143137 0.0101 1431 143137 0.0099 1442 143137 0.0100 1431 143137 0.0099
17 1431 142517 0.0100 1425 142517 0.0099 1372 142517 0.0096 1425 142517 0.0099
total 50163 5000000 0.0100 50000 5000000 0.0100 49709 5000000 0.0099 50000 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CurrencyId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 27 s 716 ms 0 h 1 m 26 s 865 ms 0 h 1 m 26 s 657 ms 0 h 1 m 26 s 254 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1488 142623 0.0104 1426 142623 0.0099 1416 142623 0.0099 1426 142623 0.0099
1 1353 142925 0.0094 1429 142925 0.0099 1434 142925 0.0100 1429 142925 0.0099
10 1417 142897 0.0099 1429 142897 0.0100 1390 142897 0.0097 1429 142897 0.0100
11 1448 142896 0.0101 1429 142896 0.0100 1438 142896 0.0100 1429 142896 0.0100
12 1448 142522 0.0101 1425 142522 0.0099 1408 142522 0.0098 1425 142522 0.0099
13 1412 142461 0.0099 1425 142461 0.0100 1432 142461 0.0100 1424 142461 0.0099
14 1440 142997 0.0100 1430 142997 0.0100 1471 142997 0.0102 1430 142997 0.0100
15 1445 142697 0.0101 1427 142697 0.0100 1530 142697 0.0107 1427 142697 0.0100
16 1436 143137 0.0100 1431 143137 0.0099 1456 143137 0.0101 1432 143137 0.0100
17 1381 142517 0.0096 1425 142517 0.0099 1365 142517 0.0095 1426 142517 0.0100
total 50089 5000000 0.0100 50000 5000000 0.0100 49707 5000000 0.0099 49999 5000000 0.0099
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CurrencyId simple sampling systematic sampling stratified sampling pps sampling

With such large groups all the methods give very good results. Although no doubt we get the best results using the systematic sampling or PPS sampling methods where the sample size is always within the limits 0.0099 to 0.0100.

Download the transformation graph with data

Data Profiling with CloverETL

Before you start to develop any data transformation you should explore your data (make data profiling). There are a lot of tools on the market that can help you. But why to install and learn another software when you can use the tool you are familiar with? CloverETL is mainly data transformation tool but it can be easily used for data profiling as well (as I will show you in this blog post).

It is very easy to do data statistic with latest version of CloverETL. You can find DataProfiling project in CloverETL Examples Projects. The project consists of two graphs: BasicStatistic and AdvancedStatistic.

The first one finds basic statistic for input data file:

  • minimum value for numeric fields or minimum length of data for string and byte fields
  • maximum value for numeric fields or maximum length of data for string and byte fields
  • average value for numeric fields or average length of data for string and byte fields
  • number of records in data file
  • number of not null values for each data field
  • number of null values for each data field

Additionally, for string data fields, it finds:

  • first not null value
  • if all values are ASCII

The second one calculates for each data field:

  • number of records in data file
  • number of not null values
  • number of unique values
  • minimum value
  • maximum value
  • average value for numeric fields
  • median value
  • modus value

It also finds frequency counts for fields with not many (the threshold is defined by a parameter HISTOGRAM_THRESHOLD) unique values.

BasicStatistic graph

The graphs in the project are prepared to analyze data from the excel file ORDERS.xls placed in data-in directory. But for purpose of this post we will analyze data stored in a flat file employees.list.dat (also placed in data-in directory).

To do that we need to set following parameters:

input_file=${DATAIN_DIR}/employees.list.dat

metadata=${META_DIR}/employees.fmt

READER_TYPE=DATA_READER

Metadata file (employees.fmt) has to contain metadata for employees.list.dat:

<?xml version="1.0" encoding="UTF-8"?>
<Record name="EMPLOYEE" recordDelimiter="n" recordSize="-1" type="delimited">
<Field delimiter="," format="#" name="EMP_NO" nullable="true" shift="0" type="integer"/>
<Field delimiter="," name="FIRST_NAME" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="LAST_NAME" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="PHONE_EXT" nullable="true" shift="0" type="string"/>
<Field delimiter="," format="dd/MM/yyyy" name="HIRE_DATE" nullable="true" shift="0" type="date"/>
<Field delimiter="," name="DEPT_NO" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="JOB_CODE" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="JOB_GRADE" nullable="true" shift="0" type="numeric"/>
<Field delimiter="," name="JOB_COUNTRY" nullable="true" shift="0" type="string"/>
<Field delimiter="," name="SALARY" nullable="true" shift="0" type="numeric"/>
<Field name="FULL_NAME" nullable="true" shift="0" type="string"/>
</Record>

Lets see the graph with mid-results and output:

DataReader parses data from the input file. Normalizer creates record that consists of three basic fields: original field name, field type and “normalized” value, which is original value for numeric fields,  time in milliseconds for date fields and data length for string or byte fields. Moreover this component collects basic information about string data e.g.: finds first not null value, checks if string contains only ASCII characters and if it can be converted to number. The next component (Rollup – Statistic) calculates minimum, maximum and average value for each group of records (with the same field name). It also propagates first not null value, checks if all isAscii and isNumber fields are not false and sets result value for the whole group. Fourth and fifth component has only “cosmetic” aim – they convert times in milliseconds back to user friendly form and sort output records. The writer converts results to report.

By default the final report is a plain html file:

Data statistic for data-in/employees.list.dat

Field name Field type min max average count count not null count null first not null is Ascii is number
DEPT_NO string 3.0 3.0 3.0 51 43 8 600 true true
EMP_NO integer 0.0 145.0 58.392156862745104 51 51 0 true true
FIRST_NAME string 3.0 11.0 5.549019607843137 51 51 0 Robert true false
FULL_NAME string 10.0 22.0 14.549019607843137 51 51 0 Nelson, Robert true false
HIRE_DATE date 28/12/1988 00:00:00 15/11/1994 00:00:00 51 51 0
JOB_CODE string 2.0 24.0 6.1568627450980395 51 51 0 VP true false
JOB_COUNTRY string 2.0 11.0 3.6470588235294117 51 51 0 USA true false
JOB_GRADE number 0.0 5.0 3.4117647058823524 51 51 0 true true
LAST_NAME string 3.0 12.0 6.8431372549019605 51 51 0 Nelson true false
PHONE_EXT string 1.0 5.0 2.8627450980392157 51 51 0 250 true false
SALARY number 0.0 9.9E7 2267125.137254902 51 51 0 true true

but it can be easily changed to excel file (just adjust graph parameter WRITER_TYPE=XLS_WRITER).

AdvancedStatistic graph

The phase 0 of AdvancedStatistic graph works similarly as graph BasicStatistic, but it uses Aggregators for statistic calculations instead of Rollup. Be particular about component called Simplification in phase 0 of the graph: it stores number of records in file and names of fields with number of unique values under threshold in dictionary (marked by red ellipses on the picture above). Then Histogram filter component can read this field’s names and skip the records that aren’t between fields for frequency calculations (green eclipses on the picture). Phase 1 Aggregators count frequencies for fields that were filtered out by previous component.

Resulting file looks as follows:

Advanced data statistic and histograms for ./data-in/employees.list.dat

Statistics

Field name Field type min max average number count count not null count unique median modus
DEPT_NO string 000 900 51 43 20 600 623
EMP_NO integer 0.0 145.0 58.3921568627451 51 51 47 45.0 2.0
FIRST_NAME string Andrew Yuki 51 51 44 Mark Robert
FULL_NAME string Baldwin, Janet Young, Katherine 51 51 50 Lee, Terri Sutherland, Claudia
HIRE_DATE date 28/12/1988 00:00:00 15/11/1994 00:00:00 51 51 44 20/04/1992 00:00:00 02/01/1994 00:00:00
JOB_CODE string Admin Vice President 51 51 16 Mktg Eng
JOB_COUNTRY string Canada USA 51 51 9 USA USA
JOB_GRADE number 0.0 5.0 3.411764705882353 51 51 6 4.0 4.0
LAST_NAME string Baldwin Young 51 51 48 Lambert Johnson
PHONE_EXT string 1 null 51 51 46 3355 null
SALARY number 0.0 9.9E7 2267125.137254902 51 51 40 61637.8125 0.0

Histograms

Field name Field type value count count %
DEPT_NO string 8 ###############·····················································································
DEPT_NO string 000 2 ###·································································································
DEPT_NO string 100 2 ###·································································································
DEPT_NO string 110 2 ###·································································································
DEPT_NO string 115 2 ###·································································································
DEPT_NO string 120 3 #####·······························································································
DEPT_NO string 121 1 #···································································································
DEPT_NO string 123 1 #···································································································
DEPT_NO string 125 1 #···································································································
DEPT_NO string 130 2 ###·································································································
DEPT_NO string 140 2 ###·································································································
DEPT_NO string 180 2 ###·································································································
DEPT_NO string 600 2 ###·································································································
DEPT_NO string 621 4 #######·····························································································
DEPT_NO string 622 3 #####·······························································································
DEPT_NO string 623 5 #########···························································································
DEPT_NO string 670 2 ###·································································································
DEPT_NO string 671 3 #####·······························································································
DEPT_NO string 672 2 ###·································································································
DEPT_NO string 900 2 ###·································································································
JOB_CODE string Admin 4 #######·····························································································
JOB_CODE string CEO 1 #···································································································
JOB_CODE string CFO 1 #···································································································
JOB_CODE string Dir 1 #···································································································
JOB_CODE string Doc 1 #···································································································
JOB_CODE string Eng 15 #############################·······································································
JOB_CODE string Finan 1 #···································································································
JOB_CODE string Inside Sales Coordinator 1 #···································································································
JOB_CODE string Mktg 1 #···································································································
JOB_CODE string Mngr 4 #######·····························································································
JOB_CODE string PRel 1 #···································································································
JOB_CODE string SRep 9 #################···················································································
JOB_CODE string Sales 2 ###·································································································
JOB_CODE string Sales Representative 6 ###########·························································································
JOB_CODE string VP 2 ###·································································································
JOB_CODE string Vice President 1 #···································································································
JOB_COUNTRY string Canada 2 ###·································································································
JOB_COUNTRY string England 3 #####·······························································································
JOB_COUNTRY string France 1 #···································································································
JOB_COUNTRY string Italy 1 #···································································································
JOB_COUNTRY string Japan 2 ###·································································································
JOB_COUNTRY string Sales 1 #···································································································
JOB_COUNTRY string Switzerland 1 #···································································································
JOB_COUNTRY string UK 4 #######·····························································································
JOB_COUNTRY string USA 36 ######################################################################······························
JOB_GRADE number 0.0 1 #···································································································
JOB_GRADE number 1.0 2 ###·································································································
JOB_GRADE number 2.0 8 ###############·····················································································
JOB_GRADE number 3.0 14 ###########################·········································································
JOB_GRADE number 4.0 16 ###############################·····································································
JOB_GRADE number 5.0 10 ###################·················································································