<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: ParallelReader versus competitors</title>
	<atom:link href="http://blog.cloveretl.com/2009/10/26/parallelreader-versus-competitors/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.cloveretl.com/2009/10/26/parallelreader-versus-competitors/</link>
	<description>Life, the Universe, CloverETL and everything ...</description>
	<lastBuildDate>Tue, 19 Jan 2010 18:17:45 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: ParallelReader Versus Competitors Finish &#171; CloverETL&#39;s Blog</title>
		<link>http://blog.cloveretl.com/2009/10/26/parallelreader-versus-competitors/#comment-54</link>
		<dc:creator>ParallelReader Versus Competitors Finish &#171; CloverETL&#39;s Blog</dc:creator>
		<pubDate>Wed, 09 Dec 2009 15:09:21 +0000</pubDate>
		<guid isPermaLink="false">http://blog.cloveretl.com/?p=223#comment-54</guid>
		<description>[...] transformations were run on my laptop with Windows Vista Home Premium. For detail information see part 1 and part [...]</description>
		<content:encoded><![CDATA[<p>[...] transformations were run on my laptop with Windows Vista Home Premium. For detail information see part 1 and part [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ParallelReader Versus Competitors Part 2 &#171; CloverETL&#39;s Blog</title>
		<link>http://blog.cloveretl.com/2009/10/26/parallelreader-versus-competitors/#comment-47</link>
		<dc:creator>ParallelReader Versus Competitors Part 2 &#171; CloverETL&#39;s Blog</dc:creator>
		<pubDate>Wed, 11 Nov 2009 16:08:42 +0000</pubDate>
		<guid isPermaLink="false">http://blog.cloveretl.com/?p=223#comment-47</guid>
		<description>[...] pm   Before we will release a complete comparison of open source ETL tools and after a success of my previous blog post I decided to publish the second transformation that we used in the [...]</description>
		<content:encoded><![CDATA[<p>[...] pm   Before we will release a complete comparison of open source ETL tools and after a success of my previous blog post I decided to publish the second transformation that we used in the [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matt Casters</title>
		<link>http://blog.cloveretl.com/2009/10/26/parallelreader-versus-competitors/#comment-40</link>
		<dc:creator>Matt Casters</dc:creator>
		<pubDate>Fri, 30 Oct 2009 10:56:09 +0000</pubDate>
		<guid isPermaLink="false">http://blog.cloveretl.com/?p=223#comment-40</guid>
		<description>Hi Peter,

You were right obviously, we don&#039;t have an in memory aggregation step in Pentaho Data Integration.
The reason for that is quite simple if you think about it actually: we do our reporting and analyses in other parts of the Pentaho software stack.
For performance driven (immediate) response times I would recommend Pentaho Analyses (Mondrian) in that regard.
It&#039;s also possible to report directly on the data from Pentaho Data Integration too: http://michaeltarallo.blogspot.com/2009/07/pentaho-goes-to-movies-data-integration.html

However, as with all benchmarks, there is this hidden implication that since one performance metric is bad, the whole tool must be bad. :-)

So suppose we *would* have an in memory group by, what would the performance *then* be?
Well, I created a new JIRA feature request in your honor this morning: http://jira.pentaho.com/browse/PDI-2804
And, since the code is pretty trivial compared to the functionality of the streaming version, I implemented it in 4.0 as well.
It&#039;s unfortunately not possible to add the step to the stable 3.2 branches and I lack the time (and user requests) to create it as a plugin.

As such, if you or a reader wants to play with it, get a recent (preview) build over here: http://ci.pentaho.com/job/Kettle/ 

The result is more in-line with yours for a CPU-bound process.  Pentaho Data Integration did it in 45 seconds, but your system is a lot slower (1.66Ghz vs 2.33Ghz)  http://imagebin.ca/img/oo-MhJ.jpg
Get the transformation here : http://www.kettle.be/dloads/Pentaho2.ktr

The transformation displayed is interesting in the sense that you can use data partitioning to parallelize the aggregation step as well.  That gives you maybe another 5-10% gain. Maybe that&#039;s something you should try too. The run-time for the transformation is really a bit too short to draw any major conclusions.  Maybe we should try with larger data sets to flush out other bottlenecks. :-)

On a side note: it takes Kettle 100 seconds to bulk load the data into MySQL.
Executing the query takes MySQL 43 seconds, even when you put an index on the group columns, so I don&#039;t think either tool is doing that bad of a job.

Again, thanks for the puzzle/challenge.  I certainly had my fun with it :-)
Best of luck with CloverETL, it looks like you&#039;re doing a great job.

Kind regards,
Matt</description>
		<content:encoded><![CDATA[<p>Hi Peter,</p>
<p>You were right obviously, we don&#8217;t have an in memory aggregation step in Pentaho Data Integration.<br />
The reason for that is quite simple if you think about it actually: we do our reporting and analyses in other parts of the Pentaho software stack.<br />
For performance driven (immediate) response times I would recommend Pentaho Analyses (Mondrian) in that regard.<br />
It&#8217;s also possible to report directly on the data from Pentaho Data Integration too: <a href="http://michaeltarallo.blogspot.com/2009/07/pentaho-goes-to-movies-data-integration.html" rel="nofollow">http://michaeltarallo.blogspot.com/2009/07/pentaho-goes-to-movies-data-integration.html</a></p>
<p>However, as with all benchmarks, there is this hidden implication that since one performance metric is bad, the whole tool must be bad. <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>So suppose we *would* have an in memory group by, what would the performance *then* be?<br />
Well, I created a new JIRA feature request in your honor this morning: <a href="http://jira.pentaho.com/browse/PDI-2804" rel="nofollow">http://jira.pentaho.com/browse/PDI-2804</a><br />
And, since the code is pretty trivial compared to the functionality of the streaming version, I implemented it in 4.0 as well.<br />
It&#8217;s unfortunately not possible to add the step to the stable 3.2 branches and I lack the time (and user requests) to create it as a plugin.</p>
<p>As such, if you or a reader wants to play with it, get a recent (preview) build over here: <a href="http://ci.pentaho.com/job/Kettle/" rel="nofollow">http://ci.pentaho.com/job/Kettle/</a> </p>
<p>The result is more in-line with yours for a CPU-bound process.  Pentaho Data Integration did it in 45 seconds, but your system is a lot slower (1.66Ghz vs 2.33Ghz)  <a href="http://imagebin.ca/img/oo-MhJ.jpg" rel="nofollow">http://imagebin.ca/img/oo-MhJ.jpg</a><br />
Get the transformation here : <a href="http://www.kettle.be/dloads/Pentaho2.ktr" rel="nofollow">http://www.kettle.be/dloads/Pentaho2.ktr</a></p>
<p>The transformation displayed is interesting in the sense that you can use data partitioning to parallelize the aggregation step as well.  That gives you maybe another 5-10% gain. Maybe that&#8217;s something you should try too. The run-time for the transformation is really a bit too short to draw any major conclusions.  Maybe we should try with larger data sets to flush out other bottlenecks. <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>On a side note: it takes Kettle 100 seconds to bulk load the data into MySQL.<br />
Executing the query takes MySQL 43 seconds, even when you put an index on the group columns, so I don&#8217;t think either tool is doing that bad of a job.</p>
<p>Again, thanks for the puzzle/challenge.  I certainly had my fun with it <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /><br />
Best of luck with CloverETL, it looks like you&#8217;re doing a great job.</p>
<p>Kind regards,<br />
Matt</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matt Casters</title>
		<link>http://blog.cloveretl.com/2009/10/26/parallelreader-versus-competitors/#comment-39</link>
		<dc:creator>Matt Casters</dc:creator>
		<pubDate>Thu, 29 Oct 2009 20:41:52 +0000</pubDate>
		<guid isPermaLink="false">http://blog.cloveretl.com/?p=223#comment-39</guid>
		<description>Thanks for the puzzle Peter, I&#039;ll have a look tomorrow.</description>
		<content:encoded><![CDATA[<p>Thanks for the puzzle Peter, I&#8217;ll have a look tomorrow.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Petr Uher</title>
		<link>http://blog.cloveretl.com/2009/10/26/parallelreader-versus-competitors/#comment-38</link>
		<dc:creator>Petr Uher</dc:creator>
		<pubDate>Thu, 29 Oct 2009 20:39:07 +0000</pubDate>
		<guid isPermaLink="false">http://blog.cloveretl.com/?p=223#comment-38</guid>
		<description>Alternative link for download added.
http://www.filefactory.com/file/a0732fg/n/ParallelReaderComparison_zip</description>
		<content:encoded><![CDATA[<p>Alternative link for download added.<br />
<a href="http://www.filefactory.com/file/a0732fg/n/ParallelReaderComparison_zip" rel="nofollow">http://www.filefactory.com/file/a0732fg/n/ParallelReaderComparison_zip</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Petr Uher</title>
		<link>http://blog.cloveretl.com/2009/10/26/parallelreader-versus-competitors/#comment-37</link>
		<dc:creator>Petr Uher</dc:creator>
		<pubDate>Thu, 29 Oct 2009 20:29:25 +0000</pubDate>
		<guid isPermaLink="false">http://blog.cloveretl.com/?p=223#comment-37</guid>
		<description>Yes, it is but it isn&#039;t my fault. It&#039;s Pentaho&#039;s &quot;feature&quot;. Pentaho &quot;Group by&quot; step requires sorted input, CloverETL&#039;s and Talend&#039;s aggregates don&#039;t.</description>
		<content:encoded><![CDATA[<p>Yes, it is but it isn&#8217;t my fault. It&#8217;s Pentaho&#8217;s &#8220;feature&#8221;. Pentaho &#8220;Group by&#8221; step requires sorted input, CloverETL&#8217;s and Talend&#8217;s aggregates don&#8217;t.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matt Casters</title>
		<link>http://blog.cloveretl.com/2009/10/26/parallelreader-versus-competitors/#comment-36</link>
		<dc:creator>Matt Casters</dc:creator>
		<pubDate>Thu, 29 Oct 2009 20:21:10 +0000</pubDate>
		<guid isPermaLink="false">http://blog.cloveretl.com/?p=223#comment-36</guid>
		<description>OK, since I know that the reading is fast enough, I can only guess by looking at the image that Kettle is being hurt by the sort BEFORE the aggregate.</description>
		<content:encoded><![CDATA[<p>OK, since I know that the reading is fast enough, I can only guess by looking at the image that Kettle is being hurt by the sort BEFORE the aggregate.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Petr Uher</title>
		<link>http://blog.cloveretl.com/2009/10/26/parallelreader-versus-competitors/#comment-35</link>
		<dc:creator>Petr Uher</dc:creator>
		<pubDate>Thu, 29 Oct 2009 20:12:08 +0000</pubDate>
		<guid isPermaLink="false">http://blog.cloveretl.com/?p=223#comment-35</guid>
		<description>It looks like rapidshare is too busy. I&#039;ll reupload it to another server and add download link to the post.</description>
		<content:encoded><![CDATA[<p>It looks like rapidshare is too busy. I&#8217;ll reupload it to another server and add download link to the post.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matt Casters</title>
		<link>http://blog.cloveretl.com/2009/10/26/parallelreader-versus-competitors/#comment-34</link>
		<dc:creator>Matt Casters</dc:creator>
		<pubDate>Thu, 29 Oct 2009 20:06:38 +0000</pubDate>
		<guid isPermaLink="false">http://blog.cloveretl.com/?p=223#comment-34</guid>
		<description>By the way, I&#039;m still trying to download the file without too much luck.  I&#039;ll give it another try later.</description>
		<content:encoded><![CDATA[<p>By the way, I&#8217;m still trying to download the file without too much luck.  I&#8217;ll give it another try later.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matt Casters</title>
		<link>http://blog.cloveretl.com/2009/10/26/parallelreader-versus-competitors/#comment-33</link>
		<dc:creator>Matt Casters</dc:creator>
		<pubDate>Thu, 29 Oct 2009 20:05:23 +0000</pubDate>
		<guid isPermaLink="false">http://blog.cloveretl.com/?p=223#comment-33</guid>
		<description>The file lineitems.tbl has an extra trailing column. (according to the CSV standard)  Define a dummy value at the end and you&#039;re fine.</description>
		<content:encoded><![CDATA[<p>The file lineitems.tbl has an extra trailing column. (according to the CSV standard)  Define a dummy value at the end and you&#8217;re fine.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
