<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Information Retrieval on the Live Web &#187; evaluation</title>
	<atom:link href="http://livewebir.com/blog/tag/evaluation/feed/" rel="self" type="application/rss+xml" />
	<link>http://livewebir.com/blog</link>
	<description>by Paul Ogilvie</description>
	<lastBuildDate>Thu, 26 May 2011 18:47:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Useful evaluations</title>
		<link>http://livewebir.com/blog/2008/06/useful-evaluations/</link>
		<comments>http://livewebir.com/blog/2008/06/useful-evaluations/#comments</comments>
		<pubDate>Mon, 23 Jun 2008 12:51:02 +0000</pubDate>
		<dc:creator>pogil</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[evaluation]]></category>

		<guid isPermaLink="false">http://pogil.wordpress.com/?p=7</guid>
		<description><![CDATA[I have thought many times over the last few years about evaluation of information retrieval systems.  Karen Spärck Jones stated my thoughts quite eloquently.  Unfortunately, I wasn&#8217;t able to dig up the exact quote, but it was something along the lines of &#8220;statistical significance is not enough; you must also have practical significance.&#8221; To measure [...]]]></description>
			<content:encoded><![CDATA[<p>I have thought many times over the last few years about evaluation of information retrieval systems.  Karen Spärck Jones stated my thoughts quite eloquently.  Unfortunately, I wasn&#8217;t able to dig up the exact quote, but it was something along the lines of &#8220;statistical significance is not enough; you must also have practical significance.&#8221;</p>
<p>To measure a practical difference between systems, we must:</p>
<ol>
<li>have an evaluation measure that correlates with user satisfaction,</li>
<li>understand the difference needed under the evaluation measure for a user to notice that one system is preferable to another, and</li>
<li>have confidence (statistically) that the difference between the two systems is larger than that minimum noticeable difference.</li>
</ol>
<p>The third challenge is the most straightforward as we can rely on statistics.  The first two, however, are far more challenging.  It is often the case that traditional system evaluation measures such as mean-average precision do not correlate well with human satisfaction.  Other measures focusing on early precision may correlate better with human experience, but there is much higher variability when focusing only on the top ranked documents.  This means we need a much larger number of queries to measure statistical significance.</p>
<p>There has been a lot of recent research in the Information Retrieval community on evaluation.  Much of this work has focused on statistical techniques and efficiently building test collections.  I think these works are great to have, but I wish there were more working to answer the first two questions.</p>
]]></content:encoded>
			<wfw:commentRss>http://livewebir.com/blog/2008/06/useful-evaluations/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

