Tag Archives: sorting data

Sorting Data: ExtSort vs. FastSort – which one is better for me? (Part 2)

In my  previous post I have focused on tips for tweaking the FastSort component – performance sort component available in all commercial CloverETL editions. Today, I would like to touch the original ExtSort component which has been in CloverETL for a while and is available in both commercial and free (Community, opensource engine) editions.

While FastSort is optimized to deliver stunning performance in resource-rich environment, ExtSort is a modest hard worker used to working with limited resources. This proves to be useful especially in tranformations with lots of parallel branches with sorting.

Again I won’t discuss the case where there are only a few records and all sorting can be done in memory. For external sorting, ExtSort organizes its temp data in sorted chunks and places them onto a fixed number of tapes – each tape being a single temporary file. Since ExtSort is single-threaded, multiple ExtSort-s running at once do not consume that much resources as FastSort does.

With ExtSort, there isn’t that much magic as with tweaking FastSort. First what you can do is specify a number of temporary directories where tapes will be stored. This is useful when you’re able to harness multiple physical drives under multiple mount points, otherwise just make sure you have enough free space on the drive.

The only two attributes you can fiddle with is Buffer capacity and Number of tapes. There’s also the Sorter initial capacity but this will be deprecated soon and nowadays acts exactly the same as Buffer capacity. Buffer capacity determines the size of each chunk and thus needs to fit into memory. If you’re familiar with FastSort, Run size is the equivalent parameter there. Increasing Buffer capacity is generally a good idea for boosting ExtSort’s performance, but expect increased memory requirements, although still far from those of FastSort.

Number of tapes defaults to 6 and generally will yield best results, however again, increasing the number slightly does not take too much resources and can help with performance.

Compared to FastSort, ExtSort is pretty easy to configure and will work almost under any conditions. However, there’s always a catch – with ExtSort it is the rather inferior performance compared to its stronger brother.

Sorting Data – ExtSort vs. FastSort – which one is better for me?

I often get asked why CloverETL offers two sort components instead of just one and what’s the right key for determining which one is better for a particular purpose.

The reason for having two sort components in CloverETL is simply to keep things as easy as possible. Since the inner natures of ExtSort and FastSort are quite different it would be really difficult to implement a nice and clean universal one.

Luckily, the decision is simple and straightforward. In case you can dedicate enough system resources (CPU cores and/or memory) for the graph doing the sorting, FastSort is the clear option. On the other hand, if you’re short on resources and want a more conservative behavior, pick ExtSort which will give you steady performance at minimum system requirements.

FastSort is a very powerful tool, but to truly witness its power, users must set it up correctly to use their hardware’s maximum potential. We will now dive into the settings behind this impressive component and learn how to max out it’s ability while being careful to avoid crashes.

Tweaking FastSort

FastSort is greedy for both memory and CPU cores and in case the system does not have enough of either, FastSort can quite easily crash with out-of-memory, especially if the records you’re going to sort are big (long string fields, tens or hundreds of fields, etc.).

Parallelism

Unlike ExtSort, FastSort can utilize potentially unlimited number of CPU cores to do its job. You can control how many worker threads are used by overriding default value for “Concurrency (threads)”. My experience shows however, that unless you’re able to use really fast disk drives, going for more than 2 threads does not necessarily help and can even slow the process back down a bit. So basically you don’t need to worry about parallelism at all unless you have the hardware to take advantage of it. Remember, that parallelism adds extra memory load for each additional thread!

Memory

FastSort can be a bit tricky with memory, since there are multiple settings which affect it. The most important is the “Run size (records)” which denotes the size of the data chunk being sorted at a time. Note, that actual record size and level of parallelism increase the overall memory consumption – so be careful with this setting. The default is 20k records, if you set the “Estimated record count” – which is your rough guess on total count of records to be sorted, the Run size is computed for you based on a experimentally derived formula. This formula tries to get the right “Run size” based on number of records and amount of available memory (which you can limit with “Maximum memory” – defaults to unlimited). This “computed guess” works in most cases, but can fail under certain conditions. You need to test and tweak on your data a bit to get the best result. Run size is definitely a parameter worth playing with!

Be sure to have enough memory dedicated to your JVM – with large, numerous records. You want to give FastSort plenty of free memory – going for 512 MB up to 2 gigs is worth it! (e.g. –Xmx1536m) With a lot of memory, FastSort will do an amazing job. However with default 64 MB heap space setting, FastSort can crash.

‘In memory only sorting’ is an option you can use in case you’re sure that all data will fit into your memory – you can either force it (and then possibly crash due to out-of-memory) or leave it to default auto. Auto means that at first, FastSort tries to sort the data in memory and if that fails, on disk sorting is used instead.

Other limits and valuable parameters

Apart from memory settings, you can impose more limits on FastSort to reflect your needs. For example, if your system works with disk quotas which limit the number of open files, you can cap temp files of FastSort with “Max open files”. Note that FastSort uses LOTS of files – hundreds, thousands. If you cap it too much (500 or less) FastSort will continue to work, but  its performance will decrease significantly. So should you need to limit the number of open files, consider switching to ExtSort.

Settings you can forget

There are other advanced options for FastSort, but you can leave them to default values unless you are really trying to optimize your sort. Number of read buffers defines how many chunks of data will be held in memory at a time – which must be at least the number of Concurrency – otherwise some of the workers wouldn’t have data to work on. Using too large a number, you’ll end up with out-of-memory – the default is based on current concurrency setting and is just fine.

Average record size is nothing else than a helper guess on average byte size of records in the data – if not set, FastSort computes this automatically from the real data so it’s usually more precise than setting an explicit value.

Tape buffer is a buffer used for each worker for filling the output and slightly affects performance, but the default is fine in almost all cases.

The last two options control how temp files are created, they can be either compressed (defaults to false) and you can even control the charset of string fields (default UTF16). Both are there for space saving purposes (space occupied by temp files during graph execution) and decrease performance.

The Decision

FatSort is very powerful sorting component and can significantly speed up your transformation process. But it has to be set up correctly. So, if you are not sure and you want the always safe and simple sort, go with ExtSort. On the other hand, if you know your hardware and want to utilize it to optimize your sort for speed, dive into FastSort and explore it a bit. The results can be extraordinary.

To be continued … (Part 2 will discuss ExtSort component)

Joining Data with RelationalJoin

CloverETL version 2.8 offered a brand new commercial  component called RelationalJoin. It extends the CloverETL pallete of joiner components with new functionality – joining of records with relational operators different from equal (=). This component has two input ports for master and slave data records, and a single output port for joined data records. Master and slave data records are joined and sent to the output port if they are in a specific relation to each other. In other words, it’s just another joiner that uses a relational operator to drive the joining process.

Relations Between Data Records

The relation between two data records is specified by any of the following relational operators: !=, <, <=, >, >=. For example the < operator means that each master data record is less than all the slave data records that it was joined with. If you look at it from the slave’s perspective, all slave data records joined with a certain master data records are greater in this case. It’s valid in both ways. ;-)

In order to make the joining process as effective as possible, input data records coming through both input ports have to be sorted appropriately. Except for the != operator that doesn’t require any sorting at all. However, if you choose the < or <= operator, you need to sort both streams of input data records in the descending order. In case of the > and >= operators, you need to make sure the data records are sorted in the ascending order. If the sort order is invalid, execution of the component fails.

When processing large data sets, be aware that the master data records are processed one by one while the slave data records need to be buffered. In the worst case, all the slave data records need to be buffered and thus their number should be as small as possible.

Practical Example

They say that an example is worth a thousand words so let’s create a simple one. Imagine that you’ve got a set of several distinct numbers. If for some reason you need to pair each of them with all numbers from the same set that are greater, it is a pretty simple task for RelationalJoin! :-) See the example graph below.

Example graph with RelationJoin.

We read the numbers from a flat file, sort them appropriately, send them as two independent data streams to RelationalJoin which produces the desired output, and finally write them to another flat file. Pretty simple, don’t you think? ;-)

In order to configure RelationalJoin, we need to specify a transformation, join key and join relation. Setting the first two is simple, you have done that a hundred times. In case of join relation, it is simple as well, just select “master(D) < slave(D)” from the combo box. The letter D in the round brackets denotes that we need to sort both streams of data records in the descending order. Thus we need to configure ExtSort in this way.

That’s all we had to do, just run the graph and see the desired results!

Light Speed Sorting with FastSort

 

Recently I’ve been struggling to squeeze a little speed increase out of current CloverETL’s sorting component – the ExtSort. Benchmarks show that the performance of ExtSort is very good, yet we again wanted to push things a few steps forward and make them even better. Finally, after a little research and tweaking we came up with a compromise. The development split into two paths – the original ExtSort remained and a new component – FastSort – was introduced in CloverETL 2.7. Let’s have a small peek at what it is about!

FastSort is based on modified merging algorithm and can usually produce double or sometimes even 2.5 times better performance results than ExtSort. No matter how good this sounds it surprisingly doesn’t make ExtSort obsolete or generally inferior sort component! Let’s see why it is and how you will benefit from learning to use the right sort component for the right purpose.

Let’s go back to the beginning. ExtSort is based on merge sort algorithm using fixed number of tapes (temporary disk files). Records in ExtSort are read from input, sorted into groups – chunks – and added onto the end one of the tapes balancing their lengths. The chunks are then merged together and sent out. The number of tapes and chunk size play important role in performance tweaking. ExtSort can work on any size of input data (provided that there is enough disk space for temporary files) and has reasonably low demands on system memory (almost constant relatively to input data set size).

On the other hand, FastSort does not put chunks onto tapes – it creates a new file (i.e. tape) for each sorted chunk instead – we call them “sorted runs” here. Along with parallel processing of multiple chunks at once, larger memory utilization, etc. great speed achievements are possible. But there is a cost for everything and this fast sort approach is no exception. Since FastSort has to keep many open files it slowly increases its resource demands proportionally to the size of the data set. That means there is a theoretical cap which is a trade-off between run size (which needs to fit into memory for sorting) and overhead with keeping open runs (around 10KB each). However, on most production systems, hitting this cap is far beyond practical use. Let’s see a small example:

Let’s have a billion (10^9) data records of average size of 200 bytes, i.e. around 200 GB of data – quite a large set. Ideal run size, which is computed automatically, is around 500 000 records = 100 megs for a single run.. Under ideal conditions there are 3 sort buffers which makes 300 megs of memory for sorting. This setup produces roughly 2000 temporary files, i.e. another 20 megs for keeping track of them. That adds to a total of 320 MB of memory plus some system overhead for sorting a 200 GB file – this surely is acceptable, especially when we take FastSort’s tweaking possibilities into account. There are always ways of sacrificing a little performance to decrease resource requirements – e.g. shrinking run size, limiting buffers and open files, etc. – there’s a lot of parameters to fiddle with if you wish to.

As you can see, FastSort is a bit more greedy that ExtSort but given the resources, it gets the job done significantly faster. In cases where speed is crucial and enough system resources can be dedicated for sorting, FastSort is a great choice. For moderate and resource critical applications ExtSort is less demanding and provides steady performance at very little cost.

For further information please referr to CloverETL’s wiki page http://wiki.cloveretl.org/doku.php?id=components:transformers#fastsort or Documentation page.

Upcoming 2.7 Release of CloverETL – Faster Sorting of Data and Improved Reading Data

As of today (Mar 31st), Clover Engine 2.7 branch has been created and the testing/QA process has started. Within approx 2 weeks, brand new version of CloverETL will be ready. It brings many small new features and bug fixes, but also several significant improvements – mostly in speed.

The aging ExtSort component is being replaced by new FastSort, which can bring up to 2.5 times the performance of old ExtSort. I am sure, there will be special post on this blog by FastSort’s developer Pavel Najvar, who will explan in detail where he found those hidden 250% of speed.

There are also speed improvements in our Universal Data Reader (reader of text data, delimited or fixed). We thoroughly profiled its code and were able to find 20-25% of additional speed. This puts us even farher in front of competition !