<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>CloverETL&#039;s Blog &#187; sort</title>
	<atom:link href="http://blog.cloveretl.com/tag/sort/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.cloveretl.com</link>
	<description>Life, the Universe, CloverETL and everything ...</description>
	<lastBuildDate>Thu, 15 Jul 2010 14:12:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='blog.cloveretl.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://www.gravatar.com/blavatar/dd4c2411bcdf90b36e88bda58e3fce7c?s=96&#038;d=http://s2.wp.com/i/buttonw-com.png</url>
		<title>CloverETL&#039;s Blog &#187; sort</title>
		<link>http://blog.cloveretl.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blog.cloveretl.com/osd.xml" title="CloverETL&#039;s Blog" />
	<atom:link rel='hub' href='http://blog.cloveretl.com/?pushpress=hub'/>
		<item>
		<title>ExtSort vs. FastSort – which one is better for me?</title>
		<link>http://blog.cloveretl.com/2010/06/15/extsort-vs-fastsort-%e2%80%93-which-one-is-better-for-me/</link>
		<comments>http://blog.cloveretl.com/2010/06/15/extsort-vs-fastsort-%e2%80%93-which-one-is-better-for-me/#comments</comments>
		<pubDate>Tue, 15 Jun 2010 12:08:59 +0000</pubDate>
		<dc:creator>bigpavel</dc:creator>
				<category><![CDATA[Using CloverETL]]></category>
		<category><![CDATA[sort]]></category>
		<category><![CDATA[sort components]]></category>
		<category><![CDATA[sorting]]></category>

		<guid isPermaLink="false">http://blog.cloveretl.com/?p=704</guid>
		<description><![CDATA[I often get asked why CloverETL offers two sort components instead of just one and what’s the right key for determining which one is better for a particular purpose.

The reason for having two sort components in CloverETL is simply to keep things as easy as possible. Since the inner natures of ExtSort and FastSort are quite different it would be really difficult to implement a nice and clean universal one.

Luckily, the decision is simple and straightforward. In case you can dedicate enough system resources (CPU cores and/or memory) for the graph doing the sorting, FastSort is the clear option. On the other hand, if you’re short on resources and want a more conservative behavior, pick ExtSort which will give you steady performance at minimum system requirements.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.cloveretl.com&blog=7070972&post=704&subd=cloveretl&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p>I often get asked why CloverETL offers two sort components instead of just one and what’s the right key for determining which one is better for a particular purpose.</p>
<p>The reason for having two sort components in CloverETL is simply to keep things as easy as possible. Since the inner natures of ExtSort and FastSort are quite different it would be really difficult to implement a nice and clean universal one.</p>
<p>Luckily, the decision is simple and straightforward. In case you can dedicate enough system resources (CPU cores and/or memory) for the graph doing the sorting, FastSort is the clear option. On the other hand, if you’re short on resources and want a more conservative behavior, pick ExtSort which will give you steady performance at minimum system requirements.</p>
<p>FastSort is a very powerful tool, but to truly witness its power, users must set it up correctly to use their hardware&#8217;s maximum potential. We will now dive into the settings behind this impressive component and learn how to max out it&#8217;s ability while being careful to avoid crashes.</p>
<h4>Tweaking FastSort</h4>
<p>FastSort is greedy for both memory and CPU cores and in case the system does not have enough of either, FastSort can quite easily crash with out-of-memory, especially if the records you’re going to sort are big (long string fields, tens or hundreds of fields, etc.).</p>
<h5>Parallelism</h5>
<p>Unlike ExtSort, FastSort can utilize potentially unlimited number of CPU cores to do its job. You can control how many worker threads are used by overriding default value for “Concurrency (threads)”. My experience shows however, that unless you’re able to use really fast disk drives, going for more than 2 threads does not necessarily help and can even slow the process back down a bit. So basically you don’t need to worry about parallelism at all unless you have the hardware to take advantage of it. Remember, that parallelism adds extra memory load for each additional thread!</p>
<h5>Memory</h5>
<p>FastSort can be a bit tricky with memory, since there are multiple settings which affect it. The most important is the “Run size (records)” which denotes the size of the data chunk being sorted at a time. Note, that actual record size and level of parallelism increase the overall memory consumption – so be careful with this setting. The default is 20k records, if you set the “Estimated record count” – which is your rough guess on total count of records to be sorted, the Run size is computed for you based on a experimentally derived formula. This formula tries to get the right “Run size” based on number of records and amount of available memory (which you can limit with “Maximum memory” – defaults to unlimited). This “computed guess” works in most cases, but can fail under certain conditions. You need to test and tweak on your data a bit to get the best result. Run size is definitely a parameter worth playing with!</p>
<p>Be sure to have enough memory dedicated to your JVM – with large, numerous records. You want to give FastSort plenty of free memory – going for 512 MB up to 2 gigs is worth it! (e.g. –Xmx1536m) With a lot of memory, FastSort will do an amazing job. However with default 64 MB heap space setting, FastSort can crash.</p>
<p>&#8216;In memory only sorting&#8217; is an option you can use in case you’re sure that all data will fit into your memory – you can either force it (and then possibly crash due to out-of-memory) or leave it to default auto. Auto means that at first, FastSort tries to sort the data in memory and if that fails, on disk sorting is used instead.</p>
<h5>Other limits and valuable parameters</h5>
<p>Apart from memory settings, you can impose more limits on FastSort to reflect your needs. For example, if your system works with disk quotas which limit the number of open files, you can cap temp files of FastSort with “Max open files”. Note that FastSort uses LOTS of files – hundreds, thousands. If you cap it too much (500 or less) FastSort will continue to work, but  its performance will decrease significantly. So should you need to limit the number of open files, consider switching to ExtSort.</p>
<h5>Settings you can forget</h5>
<p>There are other advanced options for FastSort, but you can leave them to default values unless you are really trying to optimize your sort. Number of read buffers defines how many chunks of data will be held in memory at a time – which must be at least the number of Concurrency – otherwise some of the workers wouldn’t have data to work on. Using too large a number, you’ll end up with out-of-memory – the default is based on current concurrency setting and is just fine.</p>
<p>Average record size is nothing else than a helper guess on average byte size of records in the data – if not set, FastSort computes this automatically from the real data so it’s usually more precise than setting an explicit value.</p>
<p>Tape buffer is a buffer used for each worker for filling the output and slightly affects performance, but the default is fine in almost all cases.</p>
<p>The last two options control how temp files are created, they can be either compressed (defaults to false) and you can even control the charset of string fields (default UTF16). Both are there for space saving purposes (space occupied by temp files during graph execution) and decrease performance.</p>
<h4>The Decision</h4>
<p>FatSort is very powerful sorting component and can significantly speed up your transformation process. But it has to be set up correctly. So, if you are not sure and you want the always safe and simple sort, go with ExtSort. On the other hand, if you know your hardware and want to utilize it to optimize your sort for speed, dive into FastSort and explore it a bit. The results can be extraordinary.</p>
<p><em>To be continued</em> … (Part 2 will discuss ExtSort component)</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cloveretl.wordpress.com/704/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cloveretl.wordpress.com/704/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cloveretl.wordpress.com/704/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cloveretl.wordpress.com/704/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cloveretl.wordpress.com/704/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cloveretl.wordpress.com/704/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cloveretl.wordpress.com/704/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cloveretl.wordpress.com/704/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cloveretl.wordpress.com/704/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cloveretl.wordpress.com/704/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.cloveretl.com&blog=7070972&post=704&subd=cloveretl&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://blog.cloveretl.com/2010/06/15/extsort-vs-fastsort-%e2%80%93-which-one-is-better-for-me/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0fb57473985720d4d29eac8a52337a73?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">bigpavel</media:title>
		</media:content>
	</item>
		<item>
		<title>Light speed sorting with FastSort</title>
		<link>http://blog.cloveretl.com/2009/05/15/light-speed-sorting-with-fastsort/</link>
		<comments>http://blog.cloveretl.com/2009/05/15/light-speed-sorting-with-fastsort/#comments</comments>
		<pubDate>Fri, 15 May 2009 14:40:12 +0000</pubDate>
		<dc:creator>bigpavel</dc:creator>
				<category><![CDATA[Using CloverETL]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[sort]]></category>

		<guid isPermaLink="false">http://blog.cloveretl.com/?p=40</guid>
		<description><![CDATA[A short post introducing FastSort component in CloverETL 2.7<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.cloveretl.com&blog=7070972&post=40&subd=cloveretl&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<p><!-- 		@page { margin: 0.79in } 		P { margin-bottom: 0.08in } 		A:link { so-language: zxx } --></p>
<p style="margin-bottom:0;" lang="en-US">Recently I’ve been struggling to squeeze a little speed increase out of current CloverETL’s sorting component – the ExtSort. Benchmarks show that the performance of ExtSort is very good, yet we again wanted to push things a few steps forward and make them even better. Finally, after a little research and tweaking we came up with a compromise. The development split into two paths – the original ExtSort remained and a new component – FastSort – was introduced in CloverETL 2.7. Let&#8217;s have a small peek at what it is about!</p>
<p style="margin-bottom:0;" lang="en-US">FastSort is based on modified merging algorithm and can usually produce double or sometimes even 2.5 times better performance results than ExtSort. No matter how good this sounds it surprisingly doesn’t make ExtSort obsolete or generally inferior sort component! Let&#8217;s see why it is and how you will benefit from learning to use the right sort component for the right purpose.</p>
<p style="margin-bottom:0;" lang="en-US">Let&#8217;s go back to the beginning. ExtSort is based on merge sort algorithm using fixed number of tapes (temporary disk files). Records in ExtSort are read from input, sorted into groups – chunks – and added onto the end one of the tapes balancing their lengths. The chunks are then merged together and sent out. The number of tapes and chunk size play important role in performance tweaking. ExtSort can work on any size of input data (provided that there is enough disk space for temporary files) and has reasonably low demands on system memory (almost constant relatively to input data set size).</p>
<p style="margin-bottom:.14in;" lang="en-US">On the other hand, FastSort does not put chunks onto tapes – it creates a new file (i.e. tape) for each sorted chunk instead – we call them “sorted runs” here. Along with parallel processing of multiple chunks at once, larger memory utilization, etc. great speed achievements are possible. But there is a cost for everything and this fast sort approach is no exception. Since FastSort has to keep many open files it slowly increases its resource demands proportionally to the size of the data set. That means there is a theoretical cap which is a trade-off between run size (which needs to fit into memory for sorting) and overhead with keeping open runs (around 10KB each).  However, on most production systems, hitting this cap is far beyond practical use. Let&#8217;s see a small example:</p>
<p style="margin-bottom:.14in;" lang="en-US"><em>Let&#8217;s have a billion (10^9) data records of average size of 200 bytes, i.e. around 200 GB of data – quite a large set. Ideal run size, which is computed automatically, is around 500 000 records = 100 megs for a single run.. Under ideal conditions there are 3 sort buffers which makes 300 megs of memory for sorting. This setup produces roughly 2000 temporary files, i.e. another 20 megs for keeping track of them. That adds to a total of 320 MB of memory plus some system overhead for sorting a 200 GB file – this surely is acceptable, especially when we take FastSort&#8217;s tweaking possibilities into account. There are always ways of sacrificing a little performance to decrease resource requirements – e.g. shrinking run size, limiting buffers and open files, etc. &#8211; there&#8217;s a lot of parameters to fiddle with if you wish to.</em></p>
<p style="margin-bottom:.14in;" lang="en-US">As you can see, FastSort is a bit more greedy that ExtSort but given the resources, it gets the job done significantly faster. In cases where speed is crucial and enough system resources can be dedicated for sorting, FastSort is a great choice. For moderate and resource critical applications ExtSort is less demanding and provides steady performance at very little cost.</p>
<p style="margin-bottom:.14in;" lang="en-US">For further information please referr to CloverETL&#8217;s wiki page <a href="http://wiki.cloveretl.org/doku.php?id=components:transformers#fastsort">http://wiki.cloveretl.org/doku.php?id=components:transformers#fastsort</a> or  Documentation page.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cloveretl.wordpress.com/40/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cloveretl.wordpress.com/40/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cloveretl.wordpress.com/40/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cloveretl.wordpress.com/40/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cloveretl.wordpress.com/40/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cloveretl.wordpress.com/40/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cloveretl.wordpress.com/40/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cloveretl.wordpress.com/40/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cloveretl.wordpress.com/40/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cloveretl.wordpress.com/40/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.cloveretl.com&blog=7070972&post=40&subd=cloveretl&ref=&feed=1" />]]></content:encoded>
			<wfw:commentRss>http://blog.cloveretl.com/2009/05/15/light-speed-sorting-with-fastsort/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0fb57473985720d4d29eac8a52337a73?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">bigpavel</media:title>
		</media:content>
	</item>
	</channel>
</rss>