Modeling blog post comment counts

As part of my work for FeedHub, I found the need to model the distribution of comment counts for blog posts in RSS feeds.  In particular, I want to normalize the number of comments an item receives to a score ranging from 0 to 1.  It turns out that the  negative binomial distribution is a good fit for comment counts. The negative binomial distribution is a discrete distribution with a probability density function of

f(x;p,r) = \frac{\Gamma(x + r)}{\Gamma(r) x!} p^r (1-p)^x

where r > 0 and 0 \leq p \leq 1.

However, in many cases we have small sample sizes and I felt the natural urge to specify prior distributions over the negative binomial’s parameters. The beta distribution is conjugate for p when r is known, but for our data r is not fixed.

This left me a little stuck.  What I really want is a closed form conjugate prior when both parameters of the negative binomial distribution that is efficient to compute.  What I found was Morgan and Hickman, which requires a large sample to be accurate (kind of defeats the point).  I also found Bradlow, Hardie, and Fader, which has a “closed-form” solution which requires a 300-term expansion to accurately estimate the posterior.   While noticeably faster than Markov chain Monte Carlo methods, it is still more heavyweight than I want.

I felt defeated until I realized that if I accept inference using maximum a posteriori estimates of p and r as a good enough alternative to full-blown Bayesian inference, things become much simpler.  I need only periodically recompute \hat{p} and \hat{r} and perform inference directly on the negative binomial distribution using those estimates.  I also get an additional benefit from being willing to use maximum a posteriori estimates; I can use expectation-maximization to estimate \hat{p} and \hat{r}.

This task is quite easy when estimating the parameters of a negative binomial.  Given observations x^n, estimates p^{[t]}, r^{[t]}, and priors over p and r, we estimate new maximum a posteriori estimates p^{[t+1]}, r^{[t+1]}  Wash, rinse, repeat until convergence.  This iterative procedure means that when estimating p, we can assume a constant r (and vice versa).

To estimate p|x^n, write

f(p|x^n) \propto f(p)L_n(r,p)\,

where

\begin{array}{rl} L_n(r,p) & = \prod_{i=1}^n f(x_i;p,n) \\ \\ & = p^{rn}(1-p)^{\sum_{i=1}^n x_i} \prod_{i=1}^n \frac{\Gamma(x_i + r)}{\Gamma(r)x_i!} .\end{array}

If p \sim Beta(\alpha, \beta) then

\begin{array}{rl} f(p|x^n) & \propto p^{\alpha-1} (1-p)^{\beta-1} L_n(r,p) \\ \\ & \propto p^{\alpha+rn - 1}(1-p)^{\beta + \sum_{i=1}^n x_i - 1},\end{array}

which indicates p|x^n \sim Beta(\alpha + rn, \beta + \sum_{i=1}^n x_i).  To estimate p^{[t+1]}, we use the posterior mode of p|x^n:

p^{[t+1]} = \frac{\alpha + r^{[t]}n - 1}{\alpha + \beta + r^{[t]}n + \sum_{i=1}^n x_i - 2}.

That was the easy part.  Now for the harder part.

For my purposes, the beta prime distribution is a good fit for r.  The beta prime distribution has a similar shape to the more familiar gamma distribution.  If X \sim Beta(a, b) then Y = \frac{X}{1 - X} \sim Beta^\prime(a, b).  For the beta prime distribution,

f(r;a,b) = \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} r^a (1+r)^{-a-b} .

Our posterior r |x^n is distributed

f(r|x^n)=f(r;a,b)L_n(r,p)\left/\int f(r;a,b)L_n(r,p)dr \right. ,

where

f(r;a,b)L_n(r,p)\propto r^a(1+r)^{-a-b}p^{rn}\prod_{i=1}^n\Gamma(r+x_i)/\Gamma(r)

Following Bradlow, Hardie, and Fader, we note that the ratio of the two Gamma functions can computed exactly:

\prod_{i=1}^n \Gamma(r+x_i) \left/\Gamma(r)\right.=\prod_{i=1}^{x^*}(r+i-1)^{s_i}

where x^* = \max(x^n), s_i=\sum_{j=i}^{x^*}n_j, and n_j=|\{x_i \in x^n:x_i=j\}| is the number of observations equal to j.

While f(r;a,b)L_n(r,p) can be computed exactly, it is not easy to integrate analytically.  Although I am unwilling to perform computationally expensive operations regularly, I am very comfortable with using numeric techniques to update the MAP estimates.  So I can rely on black-box estimation functions in R when testing or the Apache Commons Math library for use in our system.  The resulting recipe to estimate for r^{[t+1]}:

\begin{array}{rl} r^{[t+1]} & = \arg\max_r f(r;a,b)L_n(r,p^{[t]}) \\ \\ & = \arg\max_r  r^a(1+r)^{-a-b}p^{rn} \prod_{i=1}^{x^*}(r+i-1)^{s_i} . \end{array}

In practice, what I actually maximize is the log of the above quantity, which reduces the chance of overflow or underflow during computation.  While f(r;a,b)L_n(r,p) is not easily integrable, it is differentiable (as is its derivative).  One could define first and second derivatives and find the maximum value using Newton-Raphson, but the derivatives require recurrence relations to state succinctly (and I couldn’t be bothered).

All that remains is the initial choice of p^{[0]} and r^{[0]}.  I chose as starting points the method of moments estimator for these parameters (Savani and Zhigljavsky):

\begin{array}{rl} r^{[0]} & = \bar{x}^2/(v - \bar{x}) \\ \\ p^{[0]} & = r^{[0]} / (r^{[0]} + \bar{x})  \end{array}

where \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i and v = \frac{1}{n}\left(\sum_{i=1} x^2\right) - \bar{x}^2.  When the method of moments estimates are not well-defined for the sample data, I use the means of the prior distributions as starting points.

Here’s a snapshot of a few feeds.  The red line indicates the method of moments estimator and the blue line shows the maximum a posteriori estimates. Yes, I know that the negative binomial is a discrete distribution and plotting it as a line is misleading, but I wanted to look at a large number of feeds at a time and using lines to plot the density is easier to see.

Distribution of comment counts

Update:

Michelle asked a couple of questions about the use of the beta prime distribution as a prior for the r parameter of the negative binomial, so I figured I’d update this post with a little more detail about this choice.

When choosing a distribution to model the prior for r, I started by looking at the histogram of r values estimated using the method of moments estimator described above.  I first considered using the gamma distribution, but it didn’t turn out to be a great fit.  The red line on the histogram below shows the method of moments estimator for the gamma function proposed by Wiens et al (Pak. J. Statist. 2003 Vol.19(1) pp129-141) with k=0.  The blue line shows the much better fit given by the beta prime distribution.

Distribution of r parameter for negative binomial

To fit the beta prime parameters, I fit a beta distribution to r / (r + 1).  Here’s a plot showing the histogram for r / (r + 1) and the method of moments estimator for the beta distribution:

The beta distribution fit for r / (1 + r)

I hope this makes my choice of the beta prime distribution and its estimation a little more clear.

Useful evaluations

I have thought many times over the last few years about evaluation of information retrieval systems.  Karen Spärck Jones stated my thoughts quite eloquently.  Unfortunately, I wasn’t able to dig up the exact quote, but it was something along the lines of “statistical significance is not enough; you must also have practical significance.”

To measure a practical difference between systems, we must:

  1. have an evaluation measure that correlates with user satisfaction,
  2. understand the difference needed under the evaluation measure for a user to notice that one system is preferable to another, and
  3. have confidence (statistically) that the difference between the two systems is larger than that minimum noticeable difference.

The third challenge is the most straightforward as we can rely on statistics.  The first two, however, are far more challenging.  It is often the case that traditional system evaluation measures such as mean-average precision do not correlate well with human satisfaction.  Other measures focusing on early precision may correlate better with human experience, but there is much higher variability when focusing only on the top ranked documents.  This means we need a much larger number of queries to measure statistical significance.

There has been a lot of recent research in the Information Retrieval community on evaluation.  Much of this work has focused on statistical techniques and efficiently building test collections.  I think these works are great to have, but I wish there were more working to answer the first two questions.

Multiple significance testing

Panos Ipeirotis recently posted some interesting thoughts about statistical significance tests for comparing systems when incrementally developing techniques to solve a problem.  While I don’t have the answer to his question, I did notice that he made note of the Bonferroni method for correcting for mutiple testing.  The need for multiple test correction arises when you need to make more than one comparison; the more comparisons you make, the more likely an uncorrected test will be rejected by chance.

One limitation of Bonferroni method is its conservativeness.  If you would normally reject the null hypothesis when p < \alpha, when using Bonferroni correction you would only reject when p < \alpha/m, where m is the numer of tests.  This can make it very difficult to reject any of the tests when m is large.  This arises because the Bonferroni method controls the probability of falsely rejecting any null hypothesis; it controls the family-wise error rate.

There’s a newer technique that addresses the conservative nature of the Bonferroni method.  Instead of controlling the probability of falsely rejecting any null hypothesis, the idea is to control the false discovery rate.  The false discovery rate is the expected proportion of false rejections of the null hypothesis.  By controlling the false discovery rate, we acknowledge that we are willing to accept that for each rejection of the null hypothesis, we expected that the probability it was rejected in error is \alpha or less.

The Benjamini-Hochberg method is a simple approach for controlling the false discovery rate.  Given P_1, P_2, \dots P_m p-values resulting from a statistical significance test:

  1. Let P_{(1)} \leq P_{(2)} \leq \dots \leq P_{(m)} be the p-values sorted in increasing order.
  2. Define l_i = \frac{i \alpha}{C_m m} and k = \max\{i : P_{(i)} < l_i\} where C_m = 1 if the p-values are independent and C_m = \sum_{i=1}^{m} 1/i otherwise.
  3. Reject all null hypotheses where P_i \leq P_{(k)}.

In my CIKM paper from 2006 with Mounia Lalmas, we performed extensive system pairwise comparisons to understand some aspects of evaluation measures in XML element retrieval.  We did look into controlling family-wise error rate through the Bonferroni method, but because we were doing pair-wise tests on roughly 40 different result lists per task, none of the system differences were identified as statistically significant.   Controlling the false discovery rate allowed us to identify differences, despite the large number of comparisons and relatively small sample sizes.

Summize + Twitter for WWDC

My friends at Summize have been doing some really interesting things.  Their first demo, now offline, was a new twist on product search.  They spidered for reviews from various sources and performed sentiment analysis to unify ratings and reviews.  I was particularly fond of their use of stacked bar histograms to efficiently summarize people’s sentiments.  Display such as these convey much more distributional information than simply reporting averages.

Currently, Summize’s showcase product is a Twitter search application.  Their search application is nice because you can get real-time updates of matches as the tweets are happening.  Twitter’s recognized the value of Summize’s conversational search, and linked to Summize for live coverage of WWDC.  I think it’s great when small companies collaborate in these ways.  It’s also a smart move for Twitter today, given the troubles they’ve been having recently.  Diverting some of their traffic to Summize might help them handle today’s load.

Looking at the results of Twitter’s suggested search, I feel a little overwhelmed.  Just a few seconds after clicking on the link, Summize’s search results informed me that there have been an additional 26 posts matching the search since the page was loaded.  Letting the page go for a minute without refreshing showed hundreds of new tweets.  That’s not terribly surprising, given the excitement around WWDC.  But here’s the rub.  Most of these tweets are useless to me or contain redundant information.  I want only the unique information, not the chatter.  For high volume topics of discussion, I really need the conversational search to filter or summarize the results for me.  Abdur and Eric, please bring some of the great summarization and organizational aspects of your product search to your conversational search.