<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>perro's bite</title>
	<atom:link href="http://perro.si/feed" rel="self" type="application/rss+xml" />
	<link>http://perro.si</link>
	<description>about personal projects, mostly data analysis, programming, etc.</description>
	<lastBuildDate>Mon, 22 Feb 2010 16:47:55 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Django profiling again</title>
		<link>http://perro.si/django-profiling-again</link>
		<comments>http://perro.si/django-profiling-again#comments</comments>
		<pubDate>Sun, 22 Nov 2009 10:15:46 +0000</pubDate>
		<dc:creator>Peter Ljubič</dc:creator>
				<category><![CDATA[Projects]]></category>
		<category><![CDATA[bottleneck]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[profiling]]></category>

		<guid isPermaLink="false">http://perro.si/?p=274</guid>
		<description><![CDATA[The process of profiling takes the identification of bottlenecks (parts of the code that run slowly) followed by their more or less successful removal. As described in the previous post about profiling you can get helped by using a profile middleware. The solution is all fine, but the problem remains how to identify a bottleneck. [...]]]></description>
			<content:encoded><![CDATA[<p>The process of profiling takes the identification of bottlenecks (parts of the code that run slowly) followed by their more or less successful removal. As described in <a href="http://perro.si/profiling-django-applications">the previous post</a> about profiling you can get helped by using a profile middleware. The solution is all fine, but the problem remains how to identify a bottleneck.</p>
<p>You don&#8217;t want to test all of your views in a huge application by appending ?profile in the urls. At least I don&#8217;t. And even if you do, you might detect a view that takes few seconds to response, but it is called only twice a year. While another view taking a millisecond is called million times a day. Therefore the following decorator and a shell script can help you identifying such views.</p>
<p>The decorator stores function&#8217;s name, time it spent and its timestamp in a temporary file. The shell script then parses that file and displays sum of time used in each function. All you have to do now is add @execution_time in front of every function you&#8217;d like to measure.</p>
<pre class="sh_python">
class execution_time(object):
    def __init__(self, func):
        self.func = func

    def __call__(self, *args, **kwargs):
        from datetime import datetime
        n1 = datetime.now()
        result = self.func(*args, **kwargs)
        n2 = datetime.now()
        delta = n2 - n1
        file = open('/tmp/exectime', 'a')
        file.write('%d %s %d %d %d %d %d %d %dn' % (
            delta.microseconds, self.func.__name__,
            n1.year, n1.month, n1.day, n1.hour, n1.minute,
            n1.second, n1.microsecond)
        )
        file.close()
        return result
    def __repr__(self):
        """Return the function's docstring."""
        return self.func.__doc__</pre>
<p>The shell script. Cat, (g)awk, and sort. Make a call with the file produced by the decorator as an input parameter.</p>
<pre class="sh_sh">#!/bin/bash
echo '  absolute    relative     #calls   time/call function'

cat $1 |
gawk '{ all += $1; a[$2] += $1; freq[$2] += 1; }
END { for(x in a) printf "%10d %11g %10d %11g %sn",
    a[x], a[x]/all, freq[x], a[x]/freq[x], x; }' |
sort -n</pre>
<p>Finally, the output. Ordered by absolute time spent in function. The second column shows the percentage of time spent in a function, e.g., the model_preview function taking 10,61% of all measured time. Note that this is relative to the sum of time that was spent only by <em>measured</em> views. One of the drawbacks of this decorator is that it is not taking into account the fact that one function might call the other. So if function A calls B, and you measure both, you&#8217;ll end with time for A that is actually time for A and B together. But that&#8217;s when the profiling using middleware comes handy.</p>
<pre>  absolute    relative     #calls   time/call function
    211772 0.000133614         21     10084.4 load
    368835  0.00023271         72     5122.71 sort_boxes
    384175 0.000242389         33     11641.7 remove_from_cart

...

 139603047   0.0880802       4585     30447.8 design
 160016266     0.10096       7074     22620.3 render_image
 168310848    0.106193       4172       40343 model_preview</pre>
<p><script type="text/javascript"><!--
try { sh_highlightDocument() } catch(e){};
--></script></p>
]]></content:encoded>
			<wfw:commentRss>http://perro.si/django-profiling-again/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Profiling Django Applications</title>
		<link>http://perro.si/profiling-django-applications</link>
		<comments>http://perro.si/profiling-django-applications#comments</comments>
		<pubDate>Fri, 28 Nov 2008 16:35:35 +0000</pubDate>
		<dc:creator>Peter Ljubič</dc:creator>
				<category><![CDATA[Projects]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[hotshot]]></category>
		<category><![CDATA[profiler]]></category>
		<category><![CDATA[profiling]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://perro.si/?p=153</guid>
		<description><![CDATA[I don&#8217;t really like to put print statements in my code in order to do debugging. Actually the only thing I hate more is putting get-time-like functions before and after certain part of code. And then subtracting after with before and output the result. Wouldn&#8217;t it be nice to have a tool like gprof for [...]]]></description>
			<content:encoded><![CDATA[<p>I don&#8217;t really like to put print statements in my code in order to do debugging. Actually the only thing I hate more is putting get-time-like functions before and after certain part of code. And then subtracting after with before and output the result. Wouldn&#8217;t it be nice to have a tool like <a href="http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.html" target="_blank">gprof</a> for debugging <a title="Django (web framework)" rel="homepage" href="http://www.djangoproject.com">Django</a> applications?</p>
<p>You can use one of the python&#8217;s module named <a href="http://docs.python.org/lib/module-hotshot.html">hotshot</a>, high performance logging profiler. How to use it you can read on its presentation site. Here I will only mention that its statistics function (hotshot.stats) outputs results to the standard output. </p>
<p>Let&#8217;s decide we will trigger the profiler on any url and output its results in the browser by appending <em>profile</em> attribute in the url query string. In order to intercept that attribute we create a so-called view <a href="http://djangobook.com/en/1.0/chapter15/">middleware</a>. In this middleware we inspect the url that was used to call a certain view.</p>
<p>Now, in middleware code we just have to check whether there is a GET attribute named <em>profile</em> in the query string (e.g. http://example.com/some/path/?profile). If there is one we create the Profile object, use it to make a call to the actual view, and then output statistics via returning HttpResponse. Since statistics are output to stdout we redirect it to StringIO to be able to display it in the response.</p>
<p>If no such attribute exists None is returned, and normal procedure of calling the default view takes place. When you put it all together the middleware code looks like the following:</p>
<p><code><br />
<span style="color: magenta;">from</span> django.http <span style="color: magenta;">import</span> HttpResponse<br />
<span style="color: magenta;">import</span> hotshot, hotshot.stats<br />
<span style="color: magenta;">import</span> sys, StringIO, os</code></p>
<p><code><span style="color: #cc0;">class</span> <span style="color: cyan;">ProfileMiddleware</span>():<br />
  <span style="color: #cc0;">def</span> <span style="color: cyan;">__init__</span>(self):<br />
    <span style="color: #cc0;">pass</span></code></p>
<p><code> </code></p>
<p><code>  <span style="color: #cc0;">def</span> <span style="color: cyan;">process_view</span>(self, request, view, *args, **kwargs):<br />
    <span style="color: #cc0;">for</span> item <span style="color: #cc0;">in</span> request.META['<span style="color: red;">QUERY_STRING</span>'].split('<span style="color: red;">&amp;</span>'):<br />
      <span style="color: #cc0;">if</span> item.split('<span style="color: red;">=</span>')[0] == '<span style="color: red;">profile</span>':<span style="color: blue;"> # profile in query string</span></code></p>
<p><code> </code></p>
<p><code><span style="color: blue;">        # catch the output, must happen before stats object is created<br />
        # see https://bugs.launchpad.net/webpy/+bug/133080 for the details</span><br />
        std_old, std_new = sys.stdout, StringIO.StringIO()<br />
        sys.stdout = std_new</code></p>
<p><code> </code></p>
<p><code><span style="color: blue;">        # now let's do some profiling</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tmpfile = '<span style="color: red;">/tmp/%s</span>' % request.COOKIES['<span style="color: red;">sessionid</span>']<br />
        prof = hotshot.Profile(tmpfile)</code></p>
<p><code> </code></p>
<p><code><span style="color: blue;">        # make a call to the actual view function with the given arguments</span><br />
        response = prof.runcall(view, request, *args[0], *args[1])<br />
        prof.close()</code></p>
<p><code><span style="color: blue;">        # and then statistical reporting</span><br />
        stats = hotshot.stats.load(tmpfile)<br />
        stats.strip_dirs()<br />
        stats.sort_stats('<span style="color: red;">time</span>')</code></p>
<p><code><span style="color: blue;">        # do the output</span><br />
        stats.print_stats(1.0)</code></p>
<p><code><span style="color: blue;">        # restore default output</span><br />
        sys.stdout = std_old</code></p>
<p><code> </code></p>
<p><code><span style="color: blue;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# delete file</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;os.remove(tmpfile)</code></p>
<p><code> </code></p>
<p><code>        <span style="color: #cc0;">return</span> HttpResponse('<span style="color: red;">&lt;pre\&gt;%s&lt;/pre&gt;</span>' % std_new.getvalue())</code></p>
<p><code>    <span style="color: #cc0;">return</span> None</code></p>
<p>Next, you save this code in the file named <code>middleware/profile.py</code>. In order for middleware to work one must enable it in the <code>settings.py</code> file:</p>
<p><code><br />
MIDDLEWARE_CLASSES = (<br />
&nbsp;&nbsp;...<br />
&nbsp;&nbsp;'<span style="color: red;">djangocode.middleware.profile.ProfileMiddleware</span>',<br />
)<br />
</code></p>
<p>The sample output might look like this:</p>
<pre>
19743&nbsp;function&nbsp;calls&nbsp;(19238&nbsp;primitive&nbsp;calls)&nbsp;in&nbsp;0.064&nbsp;CPU&nbsp;seconds

&nbsp;&nbsp;&nbsp;Ordered&nbsp;by:&nbsp;internal&nbsp;time

&nbsp;&nbsp;&nbsp;ncalls&nbsp;&nbsp;tottime&nbsp;&nbsp;percall&nbsp;&nbsp;cumtime&nbsp;&nbsp;percall&nbsp;filename:lineno(function)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;4465&nbsp;&nbsp;&nbsp;&nbsp;0.007&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.007&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;encoding.py:37(force_unicode)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;995&nbsp;&nbsp;&nbsp;&nbsp;0.006&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.010&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;html.py:30(escape)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;&nbsp;0.005&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.041&nbsp;&nbsp;&nbsp;&nbsp;0.005&nbsp;defaulttags.py:108(render)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;995&nbsp;&nbsp;&nbsp;&nbsp;0.004&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.030&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;debug.py:85(render)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1006&nbsp;&nbsp;&nbsp;&nbsp;0.004&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.014&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;functional.py:246(wrapper)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;32/1&nbsp;&nbsp;&nbsp;&nbsp;0.004&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.049&nbsp;&nbsp;&nbsp;&nbsp;0.049&nbsp;__init__.py:764(render)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1027&nbsp;&nbsp;&nbsp;&nbsp;0.004&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.004&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;safestring.py:89(mark_safe)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1010&nbsp;&nbsp;&nbsp;&nbsp;0.003&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.005&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;__init__.py:690(_resolve_lookup)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1010&nbsp;&nbsp;&nbsp;&nbsp;0.002&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.002&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;context.py:38(__getitem__)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1009&nbsp;&nbsp;&nbsp;&nbsp;0.002&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.009&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;__init__.py:533(resolve)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1020&nbsp;&nbsp;&nbsp;&nbsp;0.002&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.007&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;__init__.py:672(resolve)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;42/3&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.005&nbsp;&nbsp;&nbsp;&nbsp;0.002&nbsp;__init__.py:254(parse)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.064&nbsp;&nbsp;&nbsp;&nbsp;0.064&nbsp;pages.py:76(account_statistics)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;324&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;util.py:39(__getattr__)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1405&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;__init__.py:790(render)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.004&nbsp;&nbsp;&nbsp;&nbsp;0.004&nbsp;pages.py:53(_sql)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;317&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;cursors.py:320(fetchone)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;191&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.002&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;debug.py:25(create_token)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;37&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;__init__.py:487(__init__)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.002&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;debug.py:10(tokenize)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;191&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;__init__.py:229(create_token)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;10&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;cursors.py:273(_do_query)
&nbsp;&nbsp;&nbsp;293/76&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;__init__.py:750(get_nodes_by_type)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;290&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;&nbsp;&nbsp;&nbsp;0.001&nbsp;&nbsp;&nbsp;&nbsp;0.000&nbsp;connections.py:189(string_decoder)
</pre>
<p>Note you should only install the profiling middleware in the development version since you don&#8217;t want just anybody to see the structure of your code from live version. And the next time you would like to check why a certain url is running slow just append it ?profile.</p>
]]></content:encoded>
			<wfw:commentRss>http://perro.si/profiling-django-applications/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Rich Can Play</title>
		<link>http://perro.si/the-rich-can-play</link>
		<comments>http://perro.si/the-rich-can-play#comments</comments>
		<pubDate>Mon, 25 Aug 2008 12:24:58 +0000</pubDate>
		<dc:creator>Peter Ljubič</dc:creator>
				<category><![CDATA[Visual]]></category>
		<category><![CDATA[beijing]]></category>
		<category><![CDATA[gdp]]></category>
		<category><![CDATA[medals]]></category>
		<category><![CDATA[olympic games]]></category>
		<category><![CDATA[population]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://perro.si/?p=9</guid>
		<description><![CDATA[Glancing through the table of the Beijing 2008 olympic medals count made me curious about what is more important to be near or right at the top. Actually what triggered curiosity was the fact that India had only one medal at the time. And India's population is enormous. Just like China's. Just that China is leading. Another trigger is the fact that I come from Slovenia, a country with a population just above 2M. And we have 5 medals.]]></description>
			<content:encoded><![CDATA[<p>Glancing through the table of the Beijing 2008 olympic medals count made me curious about what is more important to be near or right at the top. Actually what triggered curiosity was the fact that India had only one medal at the time. And India&#8217;s population is enormous. Just like China&#8217;s. Just that China is leading. Another trigger is the fact that I come from Slovenia, a country with a population just above 2M. And we have 5 medals.</p>
<p>To get an answer I downloaded some GDP and population data from the world bank website and <a title="Overall medal standings, Beijing 2008" href="http://results.beijing2008.cn/WRM/ENG/INF/GL/95A/GL0000000.shtml" target="_blank">overall medal standings</a> from the Beijing 2008 official website. After some data alignment a plot was created using <a title="The R Project for Statistical Computing" href="http://www.r-project.org/" target="_blank">R</a>, the environment to statistically explore data sets. Each point of the graph represents one country. Blue circles represent countries whose athletes won at least one medal in Beijing. Their size correspond to the number of medals won. Red crosses represent countries not winning any medals. The position of the points (red and blue) on X axis represents log of its population number, while Y axis plots log of GDP per capita.</p>
<p><a href="http://perro.si/wp-content/uploads/2008/08/beijing1.png"></a><span style="text-decoration: underline; color: #551a8b;"><a href="http://perro.si/wp-content/uploads/2008/08/beijing.gif"><img class="alignleft size-medium wp-image-65" title="Beijing 2008 Medals Count With Extra Countries Info" src="http://perro.si/wp-content/uploads/2008/08/beijing.gif" alt="" width="300" height="231" /></a></span>The first pattern that appears on the figure is the upper right triangle containing big blue points &#8211; representing countries winning the most of the medals. So GDP per capita matters. So does population size. Also left side and bottom of figure representing countries too small or too poor to win any of the medals.</p>
<p>Interesting are countries from former communist block with not so much wealth, but with great tradition in sports and population big enough to also win slightly bigger chunk of medals &#8211; such countries are Ukraine, Belarus, Romania, Kazakhstan, Poland, etc.</p>
<p>Countries from Africa mostly populate area in the lower part of the figure, where three examples stand out: Kenya, Zimbabwe, and Ethiopia. I believe mostly they won athletic competitions where they&#8217;re known for mastering long distance running.</p>
<p>What follows is the R code used to create this graph, where <code>country</code>, <code>pop</code>, <code>gdppc</code>, <code>all</code>, and <code>color</code> represent countries&#8217; name, its population, its gdp per capita, sum of medals (gold + silver + bronze), and color representing whether the country won any medals at all (blue) or none (red):<br />
<code><br />
postscript(file='beijing.ps')<br />
plot(log(pop), log(gdppc), cex=all, col=color, pch=16)<br />
text(log(pop), log(gdppc), labels=country, pos=4)<br />
dev.off()<br />
</code></p>
<p>Conclusions? The rich can definitely play, and size matters as well.</p>
]]></content:encoded>
			<wfw:commentRss>http://perro.si/the-rich-can-play/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Netflix Prize with Nearest Neighbours</title>
		<link>http://perro.si/netflix-prize-with-nearest-neighbours</link>
		<comments>http://perro.si/netflix-prize-with-nearest-neighbours#comments</comments>
		<pubDate>Mon, 25 Aug 2008 12:24:14 +0000</pubDate>
		<dc:creator>Peter Ljubič</dc:creator>
				<category><![CDATA[Projects]]></category>
		<category><![CDATA[data analysis]]></category>
		<category><![CDATA[nearest neighbours]]></category>
		<category><![CDATA[netflix]]></category>

		<guid isPermaLink="false">http://perro.si/?p=7</guid>
		<description><![CDATA[Those not familiar with the contest can find the details here. After three weeks of work i managed to score 0.9422 rmse, which brings positions just above the 1000th place. how? Not too hard, but not too easy either. I&#8217;ve put the k-nearest neighbours algorithm on work. The algorithm finds the nearest k samples to [...]]]></description>
			<content:encoded><![CDATA[<p>Those not familiar with the contest can find the details <a href="http://netflixprize.com/">here</a>. After three weeks of work i managed to score 0.9422 rmse, which brings positions just above the 1000th place. how? Not too hard, but not too easy either.</p>
<p>I&#8217;ve put the <a href="http://en.wikipedia.org/wiki/Nearest_neighbor_%28pattern_recognition%29">k-nearest neighbours</a> algorithm on work. The algorithm finds the nearest k samples to the given one (the one we&#8217;d like to predict), and then tries to assume its class/value somehow. So there are three tasks to make it work:</p>
<ul>
<li>define <strong>distance</strong> that will tell the algorithm which are the closest samples</li>
<li>set <strong>k</strong> &#8211; neighbourhood size</li>
<li>choose <strong>combination</strong> of the k values from the closest samples to obtain the final results</li>
</ul>
<p>For a (user, movie) tuple I found the k closest users that watched the movie, and combined their rates to get a prediction.</p>
<h3>Distance</h3>
<p>There are many different distance measures to consider. One can try pearson&#8217;s correlation coefficient. Vectors&#8217; sparsity brought cosine similarity to my mind. It works fine for clustering documents represented by sparse vectors. But none of those produced satisfactory results, so I tried another method that tries to capture some common sense.</p>
<p>So imagine a person walking into the room full of unknown people. Initially everybody has the same (some average) distance to you. Then you talk to them and the first one likes the same music band as you. So distance to this person diminished a little bit. Then another person says the same band is crap, so his/her distance increases. But there are more bands, movies, actors, moral issues, work related issues etc. you talk about. The more &#8216;slots&#8217; you have in common and if you agree on each slot above some average agreement, the closer you are to the person.</p>
<p>So that&#8217;s exactly what I used to calculate the distance between users. First, i calculated the average distance between all the rates. Actually i took around one million randomly picked rates, and got an average rate-to-rate distance to be 1.53465. When comparing two users their distance was sum of the r1 &#8211; r2 &#8211; average_distance for the movies in common. And when the movie was not seen by one or both of them, average distance was assumed. That&#8217;s how the problem of non-overlapping movies was solved, when users had only a few movies in common.</p>
<h3>K</h3>
<p>With the distance set, initially i took 100 neighbours, and was a bit lucky, because later i tried also with 25, 50, 75, 125, 150, and 200 but none improved the score. Note also that for other distance the number is not necessarily the same.</p>
<h3>Combination</h3>
<p>I tried only two of them. one is averaging the 100 rates obtained from the closest samples, and the other is weighted, where weights were dropping linearly from 100 for the closest sample toward 1 for the farthest one. The later gave better results.</p>
<h3>Implementation</h3>
<p>Built two c++ classes, namely <code>user</code> and <code>movie</code> with all the necessary in their interfaces, such as <code>user::get_closest_users()</code>. Initially everything was written in c, but it was a messy solution. Actually one month later I couldn&#8217;t read the code anymore, I got lost in the arrays and pointers&#8230; <img src='http://perro.si/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<div id="attachment_144" class="wp-caption alignright" style="width: 160px"><a href="http://perro.si/wp-content/uploads/2008/10/netflix.png"><img class="size-thumbnail wp-image-144" title="48 cores attacking netflix prize" src="http://perro.si/wp-content/uploads/2008/10/netflix.png" alt="48 cores attacking netflix prize" width="150" height="119" /></a><p class="wp-caption-text">48 cores attacking netflix prize</p></div>
<p>Running the qualifying dataset took me about five days. Some friends did some caching of the neighbourhood, but didn&#8217;t work well. Then one day i got lucky. At work they asked me if i have some software to do some heavy weight testing of the 7 brand new hp blades (each armed with eight cores) to be tested. That&#8217;s 56 cores, heaven. Using python script I split the dataset in small pieces and distributed them around the blades and later picked up the results. Using ssh i managed to run it remotely. It took 4 to 6 hours to finish.</p>
<h3>Wishes</h3>
<p>I had more time to play with it.</p>
]]></content:encoded>
			<wfw:commentRss>http://perro.si/netflix-prize-with-nearest-neighbours/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Spaghetti Code</title>
		<link>http://perro.si/spaghetti-code</link>
		<comments>http://perro.si/spaghetti-code#comments</comments>
		<pubDate>Mon, 25 Aug 2008 12:23:38 +0000</pubDate>
		<dc:creator>Peter Ljubič</dc:creator>
				<category><![CDATA[Visual]]></category>
		<category><![CDATA[complexity]]></category>
		<category><![CDATA[function call]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[php]]></category>

		<guid isPermaLink="false">http://perro.si/?p=4</guid>
		<description><![CDATA[Having trouble explaining your non-programmer boss that the whole story of creating software is just going too fast? Or tried to convince someone you need more time to plan software, not just to write software? Show these people some graphs. They&#8217;ll get it. Or at least you&#8217;ll get it they&#8217;ll never get it. The title [...]]]></description>
			<content:encoded><![CDATA[<p><span style="color: #551a8b; text-decoration: underline;"><a href="http://perro.si/wp-content/uploads/2008/08/bb.gif"><img class="alignleft size-medium wp-image-69" title="Shitty Code Graph" src="http://perro.si/wp-content/uploads/2008/08/bb.gif" alt="" width="294" height="300" /></a></span>Having trouble explaining your non-programmer boss that the whole story of creating software is just going too fast? Or tried to convince someone you need more time to <em>plan </em>software, not just to <em>write</em> software? Show these people some graphs. They&#8217;ll get it. Or at least you&#8217;ll get it they&#8217;ll never get it.</p>
<p>The title should actually be <em>tight coupling visualized</em>. Graphs come from one of the companies where I used to work. Each node in the graph represents one function from the code (not necessarily in the same file). An edge running from the node <strong>a</strong> to the node <strong>b</strong> represents a call of function <strong>b</strong> from function <strong>a</strong>.</p>
<p>How to obtain such a graph from your code? Ok, first of all, it was PHP code. Function calls were obtained and saved in a file using the tool named <a title="PHPCallGraph" href="http://phpcallgraph.sourceforge.net/" target="_blank">PHPCallGraph</a>. In this case I used it from command line on Ubuntu. After that it is suggested to use GraphViz, but it was too slow and graphics not nice. So i converted the output to a .gml format, and visualized it with another must-have companion &#8211; <a title="yEd" href="http://www.google.com/url?sa=t&amp;ct=res&amp;cd=1&amp;url=http%3A%2F%2Fwww.yworks.com%2Fproducts%2Fyed%2F&amp;ei=t_WESJ-NNYvY7AX7jbGfBw&amp;usg=AFQjCNFCCf2EqvE7rlEMm4ImPfHDNmF9IQ&amp;sig2=67ar-tk-olYfSLILfgPHrA" target="_blank">yEd</a>. It&#8217;s java-based, it&#8217;s free, it can handle large graphs, and it has powerful arsenal of algorithms to get the desired layout. The graphs were obtained using the organic layout (check out the tool&#8217;s screenshot gallery on its website).</p>
<p><a href="http://blog.perro.si/wp-content/uploads/2008/08/registration2.png"></a><a href="http://perro.si/wp-content/uploads/2008/08/registration.gif"><img class="alignright size-medium wp-image-68" title="Registration Code with OpenID" src="http://perro.si/wp-content/uploads/2008/08/registration.gif" alt="" width="300" height="273" /></a>Now again about code and the business. One comes from a local project and the other from registration and billing system written using Jan Rain&#8217;s OpenID Library. Guess which is which.</p>
]]></content:encoded>
			<wfw:commentRss>http://perro.si/spaghetti-code/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

