<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Information Retrieval on the Live Web &#187; Statistics</title>
	<atom:link href="http://livewebir.com/blog/category/statistics/feed/" rel="self" type="application/rss+xml" />
	<link>http://livewebir.com/blog</link>
	<description>by Paul Ogilvie</description>
	<lastBuildDate>Mon, 08 Mar 2010 17:20:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Why we model comment counts</title>
		<link>http://livewebir.com/blog/2008/07/why-we-model-comment-counts/</link>
		<comments>http://livewebir.com/blog/2008/07/why-we-model-comment-counts/#comments</comments>
		<pubDate>Tue, 08 Jul 2008 19:34:01 +0000</pubDate>
		<dc:creator>pogil</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[comment counts]]></category>

		<guid isPermaLink="false">http://pogil.wordpress.com/?p=15</guid>
		<description><![CDATA[Last week, I wrote a very technical post on how I model the distribution of comment counts for an RSS feed in FeedHub.  I originally drafted the post to document its derivation.  It was non-trivial enough that I feel that there is some merit in sharing this information with others, but it started with the [...]]]></description>
			<content:encoded><![CDATA[<p>Last week, I wrote a <a title="Modeling Blog Post Comment Counts" href="http://livewebir.com/blog/2008/07/modeling-blog-post-comment-counts/">very technical post</a> on how I model the distribution of comment counts for an RSS feed in <a title="FeedHub" href="http://www.feedhub.com">FeedHub</a>.  I originally drafted the post to document its derivation.  It was non-trivial enough that I feel that there is some merit in sharing this information with others, but it started with the assumption that my readers either are interested in the derivation for its own sake or already understand the importance of modeling these distributions per input feed.  I&#8217;d like to correct that with a little discussion about how this information can be used.</p>
<p>FeedHub&#8217;s goal is to provide our users with a personalized feed that delivers only the most relevant posts.  This feed is unique to each of our users, and I use &#8220;relevant&#8221; here to refer to any post the user would like to see in their personalized feed.  By observing a user&#8217;s interactions with their feed items (such as clicking on the thumbs up or thumbs down icons), FeedHub tries to learn a model of which feed item attributes are predictive of the user&#8217;s interests.  These attributes may be inferred from the content of the feed item, such as the Wikipedia category based labels we assign to items.  Alternatively, they could be based on other attributes of the feed item, such as the number of comments people have made to that post.</p>
<p>The base assumption is that this comment count is predictive of whether or not the item is interesting to the user.  The more discussion around the item, the more likely it is to be interesting to a user.</p>
<p>However, it is important that we normalize the input to the model used in FeedHub such that the comment counts are useful predictors across all input feeds.  Here are some factors that may be important:</p>
<ol>
<li>Some feed sources have a lot of discussion around every item (e.g. <a title="Slashdot" href="http://slashdot.org">Slashdot</a>).  Other feeds may have many fewer comments per post (e.g. John Langford&#8217;s <a title="Machine Learning (Theory)" href="http://hunch.net">Machine Learning (Theory)</a> blog).  If you&#8217;ve subscribed to both of these feeds in FeedHub, that implies you are probably interested in receiving content from both of these feeds.  Naively using the comment count directly as a feature in a learning algorithm is unlikely to work well; it&#8217;s hard to believe that the post with the most comments from John Langford&#8217;s blog is worse than the post with the fewest comments on Slashdot.  Some per-feed normalization is likely to be important.</li>
<li>For a feed which typically receives a small number of comments per item, the difference between 1 comment and 10 comments is likely to reflect a larger difference in &#8220;interest&#8221; than the difference between 51 and 60 comments.  Careful normalization within a feed may be necessary.</li>
</ol>
<p>These two concerns led me to hypothesize that a good way to normalize the number of comments an item has received would be to model the probability distribution of comments per item for each feed.  Given an observed number of comments <img src='http://s.wordpress.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> for an item, we can estimate <img src='http://s.wordpress.com/latex.php?latex=P%28X%3Cx%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P(X&lt;x)' title='P(X&lt;x)' class='latex' />, the probability that another randomly selected item from that same feed will have fewer comments.  I&#8217;m not an expert in the combination of evidence, but this approach does mitigate the above concerns.  Using this probability estimate also has a nice intuitive interpretation: a normalized value of 0.9 means that we expect an item to have more comments than 90% of the items from that feed.</p>
<p>The desire for the Bayesian estimation for the distribution described in my previous post arises from concerns that we may not be able to estimate the distribution well from a small number of items in an input feed where posts are relatively infrequent.  Using a Bayesian estimate of <img src='http://s.wordpress.com/latex.php?latex=P%28X%3Cx%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P(X&lt;x)' title='P(X&lt;x)' class='latex' /> allows us to reduce the bumpiness compared to maximum likelihood estimates.</p>
<p>For example, consider coin flips.  If we observe two heads in two flips, the maximum likelihood estimator would suggest that the coin is unfair.  In practice, most people have a prior belief that the coin is unlikely to be unfair.  It would take many more coin flips all coming up heads for us to overcome that prior belief.  A Bayesian estimate formally models a similar estimation process.  When we have few observations of the data, our estimates will be close to those specified by our prior beliefs.  As we get more data, the estimates from the observations become more reliable and we place less emphasis on our prior beliefs.</p>
]]></content:encoded>
			<wfw:commentRss>http://livewebir.com/blog/2008/07/why-we-model-comment-counts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Modeling blog post comment counts</title>
		<link>http://livewebir.com/blog/2008/07/modeling-blog-post-comment-counts/</link>
		<comments>http://livewebir.com/blog/2008/07/modeling-blog-post-comment-counts/#comments</comments>
		<pubDate>Tue, 01 Jul 2008 20:51:46 +0000</pubDate>
		<dc:creator>pogil</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[comment counts]]></category>
		<category><![CDATA[expectation maximization]]></category>
		<category><![CDATA[maximum a posterori]]></category>
		<category><![CDATA[negative binomial distribution]]></category>

		<guid isPermaLink="false">http://pogil.wordpress.com/?p=8</guid>
		<description><![CDATA[As part of my work for FeedHub, I found the need to model the distribution of comment counts for blog posts in RSS feeds.  In particular, I want to normalize the number of comments an item receives to a score ranging from 0 to 1.  It turns out that the  negative binomial distribution is a [...]]]></description>
			<content:encoded><![CDATA[<p>As part of my work for <a title="FeedHub" href="http://www.feedhub.com">FeedHub</a>, I found the need to model the distribution of comment counts for blog posts in RSS feeds.  In particular, I want to normalize the number of comments an item receives to a score ranging from 0 to 1.  It turns out that the  <a title="negative binomial distribution" href="http://en.wikipedia.org/wiki/Negative_binomial_distribution">negative binomial distribution</a> is a good fit for comment counts. The negative binomial distribution is a discrete distribution with a probability density function of</p>
<p style="text-align:center;"><img src='http://s.wordpress.com/latex.php?latex=f%28x%3Bp%2Cr%29%20%3D%20%5Cfrac%7B%5CGamma%28x%20%2B%20r%29%7D%7B%5CGamma%28r%29%20x%21%7D%20p%5Er%20%281-p%29%5Ex&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(x;p,r) = \frac{\Gamma(x + r)}{\Gamma(r) x!} p^r (1-p)^x' title='f(x;p,r) = \frac{\Gamma(x + r)}{\Gamma(r) x!} p^r (1-p)^x' class='latex' /></p>
<p style="text-align:left;">where <img src='http://s.wordpress.com/latex.php?latex=r%20%3E%200&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r &gt; 0' title='r &gt; 0' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=0%20%5Cleq%20p%20%5Cleq%201.&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0 \leq p \leq 1.' title='0 \leq p \leq 1.' class='latex' /></p>
<p style="text-align:left;">However, in many cases we have small sample sizes and I felt the natural urge to specify prior distributions over the negative binomial&#8217;s parameters. The <a title="beta distribution" href="http://en.wikipedia.org/wiki/Beta_distribution">beta distribution</a> is conjugate for <img src='http://s.wordpress.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> when <img src='http://s.wordpress.com/latex.php?latex=r&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r' title='r' class='latex' /> is known, but for our data <img src='http://s.wordpress.com/latex.php?latex=r&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r' title='r' class='latex' /> is not fixed.</p>
<p style="text-align:left;">This left me a little stuck.  What I really want is a closed form conjugate prior when both parameters of the negative binomial distribution that is efficient to compute.  What I found was <a title="Conjugate Bayesian Analysis of the Negative Binomial Distribution" href="http://www.soa.org/library/research/actuarial-research-clearing-house/1990-99/1993/arch-1/arch93v112.pdf">Morgan and Hickman</a>, which requires a large sample to be accurate (kind of defeats the point).  I also found <a title="Bayesian Inference for the Negative Binomial Distribution via Polynomial Expansions" href="http://www.ingentaconnect.com/content/asa/jcgs/2002/00000011/00000001/art00009">Bradlow, Hardie, and Fader</a>, which has a &#8220;closed-form&#8221; solution which requires a 300-term expansion to accurately estimate the posterior.   While noticeably faster than <a title="Markov chain Monte Carlo" href="http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">Markov chain Monte Carlo</a> methods, it is still more heavyweight than I want.</p>
<p style="text-align:left;">I felt defeated until I realized that if I accept inference using <a title="maximum a posteriori" href="http://en.wikipedia.org/wiki/Maximum_a_posteriori">maximum a posteriori</a> estimates of <img src='http://s.wordpress.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=r&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r' title='r' class='latex' /> as a good enough alternative to full-blown Bayesian inference, things become much simpler.  I need only periodically recompute <img src='http://s.wordpress.com/latex.php?latex=%5Chat%7Bp%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat{p}' title='\hat{p}' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=%5Chat%7Br%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat{r}' title='\hat{r}' class='latex' /> and perform inference directly on the negative binomial distribution using those estimates.  I also get an additional benefit from being willing to use maximum a posteriori estimates; I can use <a title="expectation-maximization" href="http://en.wikipedia.org/wiki/Expectation-maximization_algorithm">expectation-maximization</a> to estimate <img src='http://s.wordpress.com/latex.php?latex=%5Chat%7Bp%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat{p}' title='\hat{p}' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=%5Chat%7Br%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat{r}' title='\hat{r}' class='latex' />.</p>
<p style="text-align:left;">This task is quite easy when estimating the parameters of a negative binomial.  Given observations <img src='http://s.wordpress.com/latex.php?latex=x%5En%2C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x^n,' title='x^n,' class='latex' /> estimates <img src='http://s.wordpress.com/latex.php?latex=p%5E%7B%5Bt%5D%7D%2C%20r%5E%7B%5Bt%5D%7D%2C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p^{[t]}, r^{[t]},' title='p^{[t]}, r^{[t]},' class='latex' /> and priors over <img src='http://s.wordpress.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=r%2C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r,' title='r,' class='latex' /> we estimate new maximum a posteriori estimates <img src='http://s.wordpress.com/latex.php?latex=p%5E%7B%5Bt%2B1%5D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p^{[t+1]}' title='p^{[t+1]}' class='latex' />, <img src='http://s.wordpress.com/latex.php?latex=r%5E%7B%5Bt%2B1%5D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r^{[t+1]}' title='r^{[t+1]}' class='latex' />  Wash, rinse, repeat until convergence.  This iterative procedure means that when estimating <img src='http://s.wordpress.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' />, we can assume a constant <img src='http://s.wordpress.com/latex.php?latex=r&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r' title='r' class='latex' /> (and vice versa).</p>
<p style="text-align:left;">To estimate <img src='http://s.wordpress.com/latex.php?latex=p%7Cx%5En&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p|x^n' title='p|x^n' class='latex' />, write</p>
<p style="text-align:center;"><img src='http://s.wordpress.com/latex.php?latex=f%28p%7Cx%5En%29%20%5Cpropto%20f%28p%29L_n%28r%2Cp%29%5C%2C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(p|x^n) \propto f(p)L_n(r,p)\,' title='f(p|x^n) \propto f(p)L_n(r,p)\,' class='latex' /></p>
<p style="text-align:left;">where</p>
<p style="text-align:center;"><img src='http://s.wordpress.com/latex.php?latex=%5Cbegin%7Barray%7D%7Brl%7D%20L_n%28r%2Cp%29%20%26%20%3D%20%5Cprod_%7Bi%3D1%7D%5En%20f%28x_i%3Bp%2Cn%29%20%5C%5C%20%5C%5C%20%26%20%3D%20p%5E%7Brn%7D%281-p%29%5E%7B%5Csum_%7Bi%3D1%7D%5En%20x_i%7D%20%5Cprod_%7Bi%3D1%7D%5En%20%5Cfrac%7B%5CGamma%28x_i%20%2B%20r%29%7D%7B%5CGamma%28r%29x_i%21%7D%20.%5Cend%7Barray%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\begin{array}{rl} L_n(r,p) &amp; = \prod_{i=1}^n f(x_i;p,n) \\ \\ &amp; = p^{rn}(1-p)^{\sum_{i=1}^n x_i} \prod_{i=1}^n \frac{\Gamma(x_i + r)}{\Gamma(r)x_i!} .\end{array}' title='\begin{array}{rl} L_n(r,p) &amp; = \prod_{i=1}^n f(x_i;p,n) \\ \\ &amp; = p^{rn}(1-p)^{\sum_{i=1}^n x_i} \prod_{i=1}^n \frac{\Gamma(x_i + r)}{\Gamma(r)x_i!} .\end{array}' class='latex' /></p>
<p style="text-align:left;">
<p style="text-align:left;">
<p style="text-align:left;">If <img src='http://s.wordpress.com/latex.php?latex=p%20%5Csim%20Beta%28%5Calpha%2C%20%5Cbeta%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p \sim Beta(\alpha, \beta)' title='p \sim Beta(\alpha, \beta)' class='latex' /> then</p>
<p style="text-align:center;"><img src='http://s.wordpress.com/latex.php?latex=%5Cbegin%7Barray%7D%7Brl%7D%20f%28p%7Cx%5En%29%20%26%20%5Cpropto%20p%5E%7B%5Calpha-1%7D%20%281-p%29%5E%7B%5Cbeta-1%7D%20L_n%28r%2Cp%29%20%5C%5C%20%5C%5C%20%26%20%5Cpropto%20p%5E%7B%5Calpha%2Brn%20-%201%7D%281-p%29%5E%7B%5Cbeta%20%2B%20%5Csum_%7Bi%3D1%7D%5En%20x_i%20-%201%7D%2C%5Cend%7Barray%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\begin{array}{rl} f(p|x^n) &amp; \propto p^{\alpha-1} (1-p)^{\beta-1} L_n(r,p) \\ \\ &amp; \propto p^{\alpha+rn - 1}(1-p)^{\beta + \sum_{i=1}^n x_i - 1},\end{array}' title='\begin{array}{rl} f(p|x^n) &amp; \propto p^{\alpha-1} (1-p)^{\beta-1} L_n(r,p) \\ \\ &amp; \propto p^{\alpha+rn - 1}(1-p)^{\beta + \sum_{i=1}^n x_i - 1},\end{array}' class='latex' /></p>
<p style="text-align:left;">which indicates <img src='http://s.wordpress.com/latex.php?latex=p%7Cx%5En%20%5Csim%20Beta%28%5Calpha%20%2B%20rn%2C%20%5Cbeta%20%2B%20%5Csum_%7Bi%3D1%7D%5En%20x_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p|x^n \sim Beta(\alpha + rn, \beta + \sum_{i=1}^n x_i)' title='p|x^n \sim Beta(\alpha + rn, \beta + \sum_{i=1}^n x_i)' class='latex' />.  To estimate <img src='http://s.wordpress.com/latex.php?latex=p%5E%7B%5Bt%2B1%5D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p^{[t+1]}' title='p^{[t+1]}' class='latex' />, we use the posterior mode of <img src='http://s.wordpress.com/latex.php?latex=p%7Cx%5En&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p|x^n' title='p|x^n' class='latex' />:</p>
<p style="text-align:center;"><img src='http://s.wordpress.com/latex.php?latex=p%5E%7B%5Bt%2B1%5D%7D%20%3D%20%5Cfrac%7B%5Calpha%20%2B%20r%5E%7B%5Bt%5D%7Dn%20-%201%7D%7B%5Calpha%20%2B%20%5Cbeta%20%2B%20r%5E%7B%5Bt%5D%7Dn%20%2B%20%5Csum_%7Bi%3D1%7D%5En%20x_i%20-%202%7D.&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p^{[t+1]} = \frac{\alpha + r^{[t]}n - 1}{\alpha + \beta + r^{[t]}n + \sum_{i=1}^n x_i - 2}.' title='p^{[t+1]} = \frac{\alpha + r^{[t]}n - 1}{\alpha + \beta + r^{[t]}n + \sum_{i=1}^n x_i - 2}.' class='latex' /></p>
<p style="text-align:left;">That was the easy part.  Now for the harder part.</p>
<p style="text-align:left;">For my purposes, the beta prime distribution is a good fit for <img src='http://s.wordpress.com/latex.php?latex=r&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r' title='r' class='latex' />.  The <a title="beta prime distribution" href="http://en.wikipedia.org/wiki/Beta_prime_distribution">beta prime distribution</a> has a similar shape to the more familiar gamma distribution.  If <img src='http://s.wordpress.com/latex.php?latex=X%20%5Csim%20Beta%28a%2C%20b%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X \sim Beta(a, b)' title='X \sim Beta(a, b)' class='latex' /> then <img src='http://s.wordpress.com/latex.php?latex=Y%20%3D%20%5Cfrac%7BX%7D%7B1%20-%20X%7D%20%5Csim%20Beta%5E%5Cprime%28a%2C%20b%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y = \frac{X}{1 - X} \sim Beta^\prime(a, b)' title='Y = \frac{X}{1 - X} \sim Beta^\prime(a, b)' class='latex' />.  For the beta prime distribution,</p>
<p style="text-align:center;"><img src='http://s.wordpress.com/latex.php?latex=f%28r%3Ba%2Cb%29%20%3D%20%5Cfrac%7B%5CGamma%28a%20%2B%20b%29%7D%7B%5CGamma%28a%29%5CGamma%28b%29%7D%20r%5Ea%20%281%2Br%29%5E%7B-a-b%7D%20.&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(r;a,b) = \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} r^a (1+r)^{-a-b} .' title='f(r;a,b) = \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} r^a (1+r)^{-a-b} .' class='latex' /></p>
<p style="text-align:left;">
<p style="text-align:left;">Our posterior <img src='http://s.wordpress.com/latex.php?latex=r%20%7Cx%5En&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r |x^n' title='r |x^n' class='latex' /> is distributed</p>
<p style="text-align:center;"><img src='http://s.wordpress.com/latex.php?latex=f%28r%7Cx%5En%29%3Df%28r%3Ba%2Cb%29L_n%28r%2Cp%29%5Cleft%2F%5Cint%20f%28r%3Ba%2Cb%29L_n%28r%2Cp%29dr%20%5Cright.%20%2C%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(r|x^n)=f(r;a,b)L_n(r,p)\left/\int f(r;a,b)L_n(r,p)dr \right. , ' title='f(r|x^n)=f(r;a,b)L_n(r,p)\left/\int f(r;a,b)L_n(r,p)dr \right. , ' class='latex' /></p>
<p style="text-align:left;">where</p>
<p style="text-align:center;"><img src='http://s.wordpress.com/latex.php?latex=f%28r%3Ba%2Cb%29L_n%28r%2Cp%29%5Cpropto%20r%5Ea%281%2Br%29%5E%7B-a-b%7Dp%5E%7Brn%7D%5Cprod_%7Bi%3D1%7D%5En%5CGamma%28r%2Bx_i%29%2F%5CGamma%28r%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(r;a,b)L_n(r,p)\propto r^a(1+r)^{-a-b}p^{rn}\prod_{i=1}^n\Gamma(r+x_i)/\Gamma(r)' title='f(r;a,b)L_n(r,p)\propto r^a(1+r)^{-a-b}p^{rn}\prod_{i=1}^n\Gamma(r+x_i)/\Gamma(r)' class='latex' /></p>
<p style="text-align:left;">
<p style="text-align:left;">Following Bradlow, Hardie, and Fader, we note that the ratio of the two Gamma functions can computed exactly:</p>
<p style="text-align:center;"><img src='http://s.wordpress.com/latex.php?latex=%5Cprod_%7Bi%3D1%7D%5En%20%5CGamma%28r%2Bx_i%29%20%5Cleft%2F%5CGamma%28r%29%5Cright.%3D%5Cprod_%7Bi%3D1%7D%5E%7Bx%5E%2A%7D%28r%2Bi-1%29%5E%7Bs_i%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\prod_{i=1}^n \Gamma(r+x_i) \left/\Gamma(r)\right.=\prod_{i=1}^{x^*}(r+i-1)^{s_i}' title='\prod_{i=1}^n \Gamma(r+x_i) \left/\Gamma(r)\right.=\prod_{i=1}^{x^*}(r+i-1)^{s_i}' class='latex' /></p>
<p style="text-align:left;">where <img src='http://s.wordpress.com/latex.php?latex=x%5E%2A%20%3D%20%5Cmax%28x%5En%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x^* = \max(x^n)' title='x^* = \max(x^n)' class='latex' />, <img src='http://s.wordpress.com/latex.php?latex=s_i%3D%5Csum_%7Bj%3Di%7D%5E%7Bx%5E%2A%7Dn_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='s_i=\sum_{j=i}^{x^*}n_j' title='s_i=\sum_{j=i}^{x^*}n_j' class='latex' />, and <img src='http://s.wordpress.com/latex.php?latex=n_j%3D%7C%5C%7Bx_i%20%5Cin%20x%5En%3Ax_i%3Dj%5C%7D%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n_j=|\{x_i \in x^n:x_i=j\}|' title='n_j=|\{x_i \in x^n:x_i=j\}|' class='latex' /> is the number of observations equal to <img src='http://s.wordpress.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' />.</p>
<p style="text-align:left;">While <img src='http://s.wordpress.com/latex.php?latex=f%28r%3Ba%2Cb%29L_n%28r%2Cp%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(r;a,b)L_n(r,p)' title='f(r;a,b)L_n(r,p)' class='latex' /> can be computed exactly, it is not easy to integrate analytically.  Although I am unwilling to perform computationally expensive operations regularly, I am very comfortable with using numeric techniques to update the MAP estimates.  So I can rely on black-box estimation functions in <a title="R" href="http://www.r-project.org">R</a> when testing or the <a title="Apache Commons Math" href="http://commons.apache.org/math/">Apache Commons Math</a> library for use in our system.  The resulting recipe to estimate for <img src='http://s.wordpress.com/latex.php?latex=r%5E%7B%5Bt%2B1%5D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r^{[t+1]}' title='r^{[t+1]}' class='latex' />:</p>
<p style="text-align:center;"><img src='http://s.wordpress.com/latex.php?latex=%5Cbegin%7Barray%7D%7Brl%7D%20r%5E%7B%5Bt%2B1%5D%7D%20%26%20%3D%20%5Carg%5Cmax_r%20f%28r%3Ba%2Cb%29L_n%28r%2Cp%5E%7B%5Bt%5D%7D%29%20%5C%5C%20%5C%5C%20%26%20%3D%20%5Carg%5Cmax_r%20%20r%5Ea%281%2Br%29%5E%7B-a-b%7Dp%5E%7Brn%7D%20%5Cprod_%7Bi%3D1%7D%5E%7Bx%5E%2A%7D%28r%2Bi-1%29%5E%7Bs_i%7D%20.%20%5Cend%7Barray%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\begin{array}{rl} r^{[t+1]} &amp; = \arg\max_r f(r;a,b)L_n(r,p^{[t]}) \\ \\ &amp; = \arg\max_r  r^a(1+r)^{-a-b}p^{rn} \prod_{i=1}^{x^*}(r+i-1)^{s_i} . \end{array}' title='\begin{array}{rl} r^{[t+1]} &amp; = \arg\max_r f(r;a,b)L_n(r,p^{[t]}) \\ \\ &amp; = \arg\max_r  r^a(1+r)^{-a-b}p^{rn} \prod_{i=1}^{x^*}(r+i-1)^{s_i} . \end{array}' class='latex' /></p>
<p>In practice, what I actually maximize is the log of the above quantity, which reduces the chance of overflow or underflow during computation.  While <img src='http://s.wordpress.com/latex.php?latex=f%28r%3Ba%2Cb%29L_n%28r%2Cp%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(r;a,b)L_n(r,p)' title='f(r;a,b)L_n(r,p)' class='latex' /> is not easily integrable, it is differentiable (as is its derivative).  One could define first and second derivatives and find the maximum value using Newton-Raphson, but the derivatives require recurrence relations to state succinctly (and I couldn&#8217;t be bothered).</p>
<p>All that remains is the initial choice of <img src='http://s.wordpress.com/latex.php?latex=p%5E%7B%5B0%5D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p^{[0]}' title='p^{[0]}' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=r%5E%7B%5B0%5D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r^{[0]}' title='r^{[0]}' class='latex' />.  I chose as starting points the method of moments estimator for these parameters (<a title="Efficient Estimation of Parmaeters in the Negative Binomial Distribution" href="http://dx.doi.org/10.1080/03610920500501346">Savani and Zhigljavsky</a>):</p>
<p style="text-align:center;"><img src='http://s.wordpress.com/latex.php?latex=%5Cbegin%7Barray%7D%7Brl%7D%20r%5E%7B%5B0%5D%7D%20%26%20%3D%20%5Cbar%7Bx%7D%5E2%2F%28v%20-%20%5Cbar%7Bx%7D%29%20%5C%5C%20%5C%5C%20p%5E%7B%5B0%5D%7D%20%26%20%3D%20r%5E%7B%5B0%5D%7D%20%2F%20%28r%5E%7B%5B0%5D%7D%20%2B%20%5Cbar%7Bx%7D%29%20%20%5Cend%7Barray%7D%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\begin{array}{rl} r^{[0]} &amp; = \bar{x}^2/(v - \bar{x}) \\ \\ p^{[0]} &amp; = r^{[0]} / (r^{[0]} + \bar{x})  \end{array} ' title='\begin{array}{rl} r^{[0]} &amp; = \bar{x}^2/(v - \bar{x}) \\ \\ p^{[0]} &amp; = r^{[0]} / (r^{[0]} + \bar{x})  \end{array} ' class='latex' /></p>
<p>where <img src='http://s.wordpress.com/latex.php?latex=%5Cbar%7Bx%7D%20%3D%20%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi%3D1%7D%5En%20x_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i' title='\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=v%20%3D%20%5Cfrac%7B1%7D%7Bn%7D%5Cleft%28%5Csum_%7Bi%3D1%7D%20x%5E2%5Cright%29%20-%20%5Cbar%7Bx%7D%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='v = \frac{1}{n}\left(\sum_{i=1} x^2\right) - \bar{x}^2' title='v = \frac{1}{n}\left(\sum_{i=1} x^2\right) - \bar{x}^2' class='latex' />.  When the method of moments estimates are not well-defined for the sample data, I use the means of the prior distributions as starting points.</p>
<p>Here&#8217;s a snapshot of a few feeds.  The red line indicates the method of moments estimator and the blue line shows the maximum a posteriori estimates. Yes, I know that the negative binomial is a discrete distribution and plotting it as a line is misleading, but I wanted to look at a large number of feeds at a time and using lines to plot the density is easier to see.</p>
<p style="text-align:center;"><a href="http://pogil.files.wordpress.com/2008/07/comment-counts.png"><img class="size-full wp-image-12 aligncenter" src="http://pogil.files.wordpress.com/2008/07/comment-counts.png" alt="Distribution of comment counts" width="460" height="447" /></a></p>
<h3 style="text-align: left;">Update:</h3>
<p style="text-align: left;"><a href="http://livewebir.com/blog/2008/07/modeling-blog-post-comment-counts/#comment-6">Michelle asked a couple of questions</a> about the use of the beta prime distribution as a prior for the <img src='http://s.wordpress.com/latex.php?latex=r&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r' title='r' class='latex' /> parameter of the negative binomial, so I figured I&#8217;d update this post with a little more detail about this choice.</p>
<p style="text-align: left;">When choosing a distribution to model the prior for <img src='http://s.wordpress.com/latex.php?latex=r&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r' title='r' class='latex' />, I started by looking at the histogram of <img src='http://s.wordpress.com/latex.php?latex=r&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r' title='r' class='latex' /> values estimated using the method of moments estimator described above.  I first considered using the gamma distribution, but it didn&#8217;t turn out to be a great fit.  The red line on the histogram below shows the method of moments estimator for the gamma function proposed by <a title="A CLASS OF METHOD OF MOMENTS ESTIMATORS FOR THE TWO-PARAMETER GAMMA FAMILY" href="http://www.stat.ualberta.ca/~wiens/pubs/gamma.pdf">Wiens et al (Pak. J. Statist. 2003 Vol.19(1) pp129-141)</a> with <img src='http://s.wordpress.com/latex.php?latex=k%3D0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k=0' title='k=0' class='latex' />.  The blue line shows the much better fit given by the beta prime distribution.</p>
<p style="text-align: center;"><a href="http://livewebir.com/blog/wp-content/uploads/2008/11/negbinom_r_parameter.png"><img class="size-full wp-image-39 aligncenter" title="Distribution of r parameter for negative binomial " src="http://livewebir.com/blog/wp-content/uploads/2008/11/negbinom_r_parameter.png" alt="Distribution of r parameter for negative binomial " width="375" height="375" /></a></p>
<p>To fit the beta prime parameters, I fit a beta distribution to <img src='http://s.wordpress.com/latex.php?latex=r%20%2F%20%28r%20%2B%201%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r / (r + 1)' title='r / (r + 1)' class='latex' />.  Here&#8217;s a plot showing the histogram for <img src='http://s.wordpress.com/latex.php?latex=r%20%2F%20%28r%20%2B%201%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r / (r + 1)' title='r / (r + 1)' class='latex' /> and the method of moments estimator for the beta distribution:</p>
<p style="text-align: center;"><a href="http://livewebir.com/blog/wp-content/uploads/2008/11/beta_transformation.png"><img class="size-full wp-image-40 aligncenter" title="beta_transformation" src="http://livewebir.com/blog/wp-content/uploads/2008/11/beta_transformation.png" alt="The beta distribution fit for r / (1 + r)" width="375" height="375" /></a></p>
<p style="text-align:left;">
<p style="text-align:left;">I hope this makes my choice of the beta prime distribution and its estimation a little more clear.</p>
]]></content:encoded>
			<wfw:commentRss>http://livewebir.com/blog/2008/07/modeling-blog-post-comment-counts/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Useful evaluations</title>
		<link>http://livewebir.com/blog/2008/06/useful-evaluations/</link>
		<comments>http://livewebir.com/blog/2008/06/useful-evaluations/#comments</comments>
		<pubDate>Mon, 23 Jun 2008 12:51:02 +0000</pubDate>
		<dc:creator>pogil</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[evaluation]]></category>

		<guid isPermaLink="false">http://pogil.wordpress.com/?p=7</guid>
		<description><![CDATA[I have thought many times over the last few years about evaluation of information retrieval systems.  Karen Spärck Jones stated my thoughts quite eloquently.  Unfortunately, I wasn&#8217;t able to dig up the exact quote, but it was something along the lines of &#8220;statistical significance is not enough; you must also have practical significance.&#8221; To measure [...]]]></description>
			<content:encoded><![CDATA[<p>I have thought many times over the last few years about evaluation of information retrieval systems.  Karen Spärck Jones stated my thoughts quite eloquently.  Unfortunately, I wasn&#8217;t able to dig up the exact quote, but it was something along the lines of &#8220;statistical significance is not enough; you must also have practical significance.&#8221;</p>
<p>To measure a practical difference between systems, we must:</p>
<ol>
<li>have an evaluation measure that correlates with user satisfaction,</li>
<li>understand the difference needed under the evaluation measure for a user to notice that one system is preferable to another, and</li>
<li>have confidence (statistically) that the difference between the two systems is larger than that minimum noticeable difference.</li>
</ol>
<p>The third challenge is the most straightforward as we can rely on statistics.  The first two, however, are far more challenging.  It is often the case that traditional system evaluation measures such as mean-average precision do not correlate well with human satisfaction.  Other measures focusing on early precision may correlate better with human experience, but there is much higher variability when focusing only on the top ranked documents.  This means we need a much larger number of queries to measure statistical significance.</p>
<p>There has been a lot of recent research in the Information Retrieval community on evaluation.  Much of this work has focused on statistical techniques and efficiently building test collections.  I think these works are great to have, but I wish there were more working to answer the first two questions.</p>
]]></content:encoded>
			<wfw:commentRss>http://livewebir.com/blog/2008/06/useful-evaluations/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Multiple significance testing</title>
		<link>http://livewebir.com/blog/2008/06/multiple-significance-testing/</link>
		<comments>http://livewebir.com/blog/2008/06/multiple-significance-testing/#comments</comments>
		<pubDate>Mon, 16 Jun 2008 18:47:16 +0000</pubDate>
		<dc:creator>pogil</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[false discovery rate]]></category>
		<category><![CDATA[multiple testing]]></category>

		<guid isPermaLink="false">http://pogil.wordpress.com/?p=6</guid>
		<description><![CDATA[Panos Ipeirotis recently posted some interesting thoughts about statistical significance tests for comparing systems when incrementally developing techniques to solve a problem.  While I don&#8217;t have the answer to his question, I did notice that he made note of the Bonferroni method for correcting for mutiple testing.  The need for multiple test correction arises when [...]]]></description>
			<content:encoded><![CDATA[<p>Panos Ipeirotis recently posted some interesting thoughts about <a title="Statistical Significance of Sequntial Comparisons" href="http://behind-the-enemy-lines.blogspot.com/2008/06/statistical-significance-of-sequential.html">statistical significance tests</a> for comparing systems when incrementally developing techniques to solve a problem.  While I don&#8217;t have the answer to his question, I did notice that he made note of the Bonferroni method for correcting for mutiple testing.  The need for multiple test correction arises when you need to make more than one comparison; the more comparisons you make, the more likely an uncorrected test will be rejected by chance.</p>
<p>One limitation of Bonferroni method is its conservativeness.  If you would normally reject the null hypothesis when <img src='http://s.wordpress.com/latex.php?latex=p%20%3C%20%5Calpha&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p &lt; \alpha' title='p &lt; \alpha' class='latex' />, when using Bonferroni correction you would only reject when <img src='http://s.wordpress.com/latex.php?latex=p%20%3C%20%5Calpha%2Fm&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p &lt; \alpha/m' title='p &lt; \alpha/m' class='latex' />, where <img src='http://s.wordpress.com/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' /> is the numer of tests.  This can make it very difficult to reject any of the tests when <img src='http://s.wordpress.com/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' /> is large.  This arises because the Bonferroni method controls the probability of falsely rejecting <em>any</em> null hypothesis; it controls the family-wise error rate.</p>
<p>There&#8217;s a newer technique that addresses the conservative nature of the Bonferroni method.  Instead of controlling the probability of falsely rejecting any null hypothesis, the idea is to control the <a title="false discovery rate" href="http://en.wikipedia.org/wiki/False_discovery_rate">false discovery rate</a>.  The false discovery rate is the expected proportion of false rejections of the null hypothesis.  By controlling the false discovery rate, we acknowledge that we are willing to accept that for each rejection of the null hypothesis, we expected that the probability it was rejected in error is <img src='http://s.wordpress.com/latex.php?latex=%5Calpha&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\alpha' title='\alpha' class='latex' /> or less.</p>
<p>The <a title="Controlling the False Discovery Rate" href="http://www.math.tau.ac.il/~ybenja/MyPapers/benjamini_hochberg1995.pdf">Benjamini-Hochberg</a> method is a simple approach for controlling the false discovery rate.  Given <img src='http://s.wordpress.com/latex.php?latex=P_1%2C%20P_2%2C%20%5Cdots%20P_m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P_1, P_2, \dots P_m' title='P_1, P_2, \dots P_m' class='latex' /> p-values resulting from a statistical significance test:</p>
<ol>
<li>Let <img src='http://s.wordpress.com/latex.php?latex=P_%7B%281%29%7D%20%5Cleq%20P_%7B%282%29%7D%20%5Cleq%20%5Cdots%20%5Cleq%20P_%7B%28m%29%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P_{(1)} \leq P_{(2)} \leq \dots \leq P_{(m)}' title='P_{(1)} \leq P_{(2)} \leq \dots \leq P_{(m)}' class='latex' /> be the p-values sorted in increasing order.</li>
<li>Define <img src='http://s.wordpress.com/latex.php?latex=l_i%20%3D%20%5Cfrac%7Bi%20%5Calpha%7D%7BC_m%20m%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l_i = \frac{i \alpha}{C_m m}' title='l_i = \frac{i \alpha}{C_m m}' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=k%20%3D%20%5Cmax%5C%7Bi%20%3A%20P_%7B%28i%29%7D%20%3C%20l_i%5C%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k = \max\{i : P_{(i)} &lt; l_i\}' title='k = \max\{i : P_{(i)} &lt; l_i\}' class='latex' /> where <img src='http://s.wordpress.com/latex.php?latex=C_m%20%3D%201&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='C_m = 1' title='C_m = 1' class='latex' /> if the p-values are independent and <img src='http://s.wordpress.com/latex.php?latex=C_m%20%3D%20%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%201%2Fi&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='C_m = \sum_{i=1}^{m} 1/i' title='C_m = \sum_{i=1}^{m} 1/i' class='latex' /> otherwise.</li>
<li>Reject all null hypotheses where <img src='http://s.wordpress.com/latex.php?latex=P_i%20%5Cleq%20P_%7B%28k%29%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P_i \leq P_{(k)}' title='P_i \leq P_{(k)}' class='latex' />.</li>
</ol>
<p>In my <a title="Investigating the Exhaustivity Dimension in Content-Oriented XML Element Retrieval Evaluation" href="http://www.cs.cmu.edu/~pto/papers/CIKM_2006_INEX_EXH.pdf">CIKM paper</a> from 2006 with <a title="Home Page of Mounia Lalmas" href="http://www.dcs.qmul.ac.uk/~mounia/">Mounia Lalmas</a>, we performed extensive system pairwise comparisons to understand some aspects of evaluation measures in XML element retrieval.  We did look into controlling family-wise error rate through the Bonferroni method, but because we were doing pair-wise tests on roughly 40 different result lists per task, none of the system differences were identified as statistically significant.   Controlling the false discovery rate allowed us to identify differences, despite the large number of comparisons and relatively small sample sizes.</p>
]]></content:encoded>
			<wfw:commentRss>http://livewebir.com/blog/2008/06/multiple-significance-testing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
