Modeling blog post comment counts

As part of my work for FeedHub, I found the need to model the distribution of comment counts for blog posts in RSS feeds.  In particular, I want to normalize the number of comments an item receives to a score ranging from 0 to 1.  It turns out that the  negative binomial distribution is a good fit for comment counts. The negative binomial distribution is a discrete distribution with a probability density function of

f(x;p,r) = \frac{\Gamma(x + r)}{\Gamma(r) x!} p^r (1-p)^x

where r > 0 and 0 \leq p \leq 1.

However, in many cases we have small sample sizes and I felt the natural urge to specify prior distributions over the negative binomial’s parameters. The beta distribution is conjugate for p when r is known, but for our data r is not fixed.

This left me a little stuck.  What I really want is a closed form conjugate prior when both parameters of the negative binomial distribution that is efficient to compute.  What I found was Morgan and Hickman, which requires a large sample to be accurate (kind of defeats the point).  I also found Bradlow, Hardie, and Fader, which has a “closed-form” solution which requires a 300-term expansion to accurately estimate the posterior.   While noticeably faster than Markov chain Monte Carlo methods, it is still more heavyweight than I want.

I felt defeated until I realized that if I accept inference using maximum a posteriori estimates of p and r as a good enough alternative to full-blown Bayesian inference, things become much simpler.  I need only periodically recompute \hat{p} and \hat{r} and perform inference directly on the negative binomial distribution using those estimates.  I also get an additional benefit from being willing to use maximum a posteriori estimates; I can use expectation-maximization to estimate \hat{p} and \hat{r}.

This task is quite easy when estimating the parameters of a negative binomial.  Given observations x^n, estimates p^{[t]}, r^{[t]}, and priors over p and r, we estimate new maximum a posteriori estimates p^{[t+1]}, r^{[t+1]}  Wash, rinse, repeat until convergence.  This iterative procedure means that when estimating p, we can assume a constant r (and vice versa).

To estimate p|x^n, write

f(p|x^n) \propto f(p)L_n(r,p)\,

where

\begin{array}{rl} L_n(r,p) & = \prod_{i=1}^n f(x_i;p,n) \\ \\ & = p^{rn}(1-p)^{\sum_{i=1}^n x_i} \prod_{i=1}^n \frac{\Gamma(x_i + r)}{\Gamma(r)x_i!} .\end{array}

If p \sim Beta(\alpha, \beta) then

\begin{array}{rl} f(p|x^n) & \propto p^{\alpha-1} (1-p)^{\beta-1} L_n(r,p) \\ \\ & \propto p^{\alpha+rn - 1}(1-p)^{\beta + \sum_{i=1}^n x_i - 1},\end{array}

which indicates p|x^n \sim Beta(\alpha + rn, \beta + \sum_{i=1}^n x_i).  To estimate p^{[t+1]}, we use the posterior mode of p|x^n:

p^{[t+1]} = \frac{\alpha + r^{[t]}n - 1}{\alpha + \beta + r^{[t]}n + \sum_{i=1}^n x_i - 2}.

That was the easy part.  Now for the harder part.

For my purposes, the beta prime distribution is a good fit for r.  The beta prime distribution has a similar shape to the more familiar gamma distribution.  If X \sim Beta(a, b) then Y = \frac{X}{1 - X} \sim Beta^\prime(a, b).  For the beta prime distribution,

f(r;a,b) = \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} r^a (1+r)^{-a-b} .

Our posterior r |x^n is distributed

f(r|x^n)=f(r;a,b)L_n(r,p)\left/\int f(r;a,b)L_n(r,p)dr \right. ,

where

f(r;a,b)L_n(r,p)\propto r^a(1+r)^{-a-b}p^{rn}\prod_{i=1}^n\Gamma(r+x_i)/\Gamma(r)

Following Bradlow, Hardie, and Fader, we note that the ratio of the two Gamma functions can computed exactly:

\prod_{i=1}^n \Gamma(r+x_i) \left/\Gamma(r)\right.=\prod_{i=1}^{x^*}(r+i-1)^{s_i}

where x^* = \max(x^n), s_i=\sum_{j=i}^{x^*}n_j, and n_j=|\{x_i \in x^n:x_i=j\}| is the number of observations equal to j.

While f(r;a,b)L_n(r,p) can be computed exactly, it is not easy to integrate analytically.  Although I am unwilling to perform computationally expensive operations regularly, I am very comfortable with using numeric techniques to update the MAP estimates.  So I can rely on black-box estimation functions in R when testing or the Apache Commons Math library for use in our system.  The resulting recipe to estimate for r^{[t+1]}:

\begin{array}{rl} r^{[t+1]} & = \arg\max_r f(r;a,b)L_n(r,p^{[t]}) \\ \\ & = \arg\max_r  r^a(1+r)^{-a-b}p^{rn} \prod_{i=1}^{x^*}(r+i-1)^{s_i} . \end{array}

In practice, what I actually maximize is the log of the above quantity, which reduces the chance of overflow or underflow during computation.  While f(r;a,b)L_n(r,p) is not easily integrable, it is differentiable (as is its derivative).  One could define first and second derivatives and find the maximum value using Newton-Raphson, but the derivatives require recurrence relations to state succinctly (and I couldn’t be bothered).

All that remains is the initial choice of p^{[0]} and r^{[0]}.  I chose as starting points the method of moments estimator for these parameters (Savani and Zhigljavsky):

\begin{array}{rl} r^{[0]} & = \bar{x}^2/(v - \bar{x}) \\ \\ p^{[0]} & = r^{[0]} / (r^{[0]} + \bar{x})  \end{array}

where \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i and v = \frac{1}{n}\left(\sum_{i=1} x^2\right) - \bar{x}^2.  When the method of moments estimates are not well-defined for the sample data, I use the means of the prior distributions as starting points.

Here’s a snapshot of a few feeds.  The red line indicates the method of moments estimator and the blue line shows the maximum a posteriori estimates. Yes, I know that the negative binomial is a discrete distribution and plotting it as a line is misleading, but I wanted to look at a large number of feeds at a time and using lines to plot the density is easier to see.

Distribution of comment counts

Update:

Michelle asked a couple of questions about the use of the beta prime distribution as a prior for the r parameter of the negative binomial, so I figured I’d update this post with a little more detail about this choice.

When choosing a distribution to model the prior for r, I started by looking at the histogram of r values estimated using the method of moments estimator described above.  I first considered using the gamma distribution, but it didn’t turn out to be a great fit.  The red line on the histogram below shows the method of moments estimator for the gamma function proposed by Wiens et al (Pak. J. Statist. 2003 Vol.19(1) pp129-141) with k=0.  The blue line shows the much better fit given by the beta prime distribution.

Distribution of r parameter for negative binomial

To fit the beta prime parameters, I fit a beta distribution to r / (r + 1).  Here’s a plot showing the histogram for r / (r + 1) and the method of moments estimator for the beta distribution:

The beta distribution fit for r / (1 + r)

I hope this makes my choice of the beta prime distribution and its estimation a little more clear.

Comments (5)

  1. Michelle wrote::

    Hi, this is a great post. I understand the part the posterior p|{x1,…,xn} was induced. However, what is not equivalently intuitive to me is the part about the posterior r|{x1,…,xn}. Could you talk a bit more about why beta prime distribution was chosen for r?

    As mentioned, “if X ~ Beta(a, b), then Y = X/(1+X) ~ Beta’(a, b)”. In another word, if Y ~ Beta’(a, b), then X = Y/(1-Y) ~ Beta(a, b). If `Y’ is substituted by r, which variable will be the substitution of `X’?

    Sunday, November 23, 2008 at 6:21 am #
  2. Michelle wrote::

    Hi, Paul, thanks for your reply and the updated details. Now I understand it better.

    I apologize for two typos, i.e., Y=X/(1+X) (should be Y=X/(1-X)) and X=Y/(1-Y) (should be X=Y/(1+Y)) in my earlier comment.

    Tuesday, November 25, 2008 at 3:28 am #
  3. Mark Carman wrote::

    Fantastic post Paul. Thanks for sharing your technique!

    Sunday, January 9, 2011 at 4:52 pm #
  4. kyle ward wrote::

    uhhhmm are there any random variate or variable involved here?

    Sunday, April 3, 2011 at 11:52 pm #
  5. pogil wrote::

    Kyle, I’m not exactly sure what you’re asking, so let me take a stab at briefly summarizing the model and its random variables.

    In this task X ~ NB(r, p), where X is the random variable (the observed number of comments on a blog post from a given blog) and NB refers to the negative binomial distribution. Additionally, I’ve taken a Bayesian view of r and p for the parameters of the negative binomial, choosing to view them as random variables as well. I’ve modeled r using a beta prime and p as a beta. Computing the exact Bayesian distribution for X was overkill for my use case and not terribly computationally efficient, so I chose to use maximum-a-posteriori estimates for r and p when computing the distribution of X for a blog.

    I hope this helps clarify how I’ve modeled blog post comment counts.

    Monday, April 4, 2011 at 8:36 am #

Trackbacks/Pingbacks (2)

  1. Why We Model Comment Counts « Paul Ogilvie on Tuesday, July 8, 2008 at 7:34 pm

    [...] at 7:34 pm · Filed under Blogs, Statistics ·Tagged comment counts Last week, I wrote a very technical post on how I model the distribution of comment counts for an RSS feed in FeedHub.  I originally [...]

  2. [...] Retrieval on the Live Web by Paul Ogilvie < Modeling blog post comment counts What makes a blog post popular? series at mSpoke blog [...]