As part of my work for FeedHub, I found the need to model the distribution of comment counts for blog posts in RSS feeds. In particular, I want to normalize the number of comments an item receives to a score ranging from 0 to 1. It turns out that the negative binomial distribution is a good fit for comment counts. The negative binomial distribution is a discrete distribution with a probability density function of
where and
However, in many cases we have small sample sizes and I felt the natural urge to specify prior distributions over the negative binomial’s parameters. The beta distribution is conjugate for when
is known, but for our data
is not fixed.
This left me a little stuck. What I really want is a closed form conjugate prior when both parameters of the negative binomial distribution that is efficient to compute. What I found was Morgan and Hickman, which requires a large sample to be accurate (kind of defeats the point). I also found Bradlow, Hardie, and Fader, which has a “closed-form” solution which requires a 300-term expansion to accurately estimate the posterior. While noticeably faster than Markov chain Monte Carlo methods, it is still more heavyweight than I want.
I felt defeated until I realized that if I accept inference using maximum a posteriori estimates of and
as a good enough alternative to full-blown Bayesian inference, things become much simpler. I need only periodically recompute
and
and perform inference directly on the negative binomial distribution using those estimates. I also get an additional benefit from being willing to use maximum a posteriori estimates; I can use expectation-maximization to estimate
and
.
This task is quite easy when estimating the parameters of a negative binomial. Given observations estimates
and priors over
and
we estimate new maximum a posteriori estimates
,
Wash, rinse, repeat until convergence. This iterative procedure means that when estimating
, we can assume a constant
(and vice versa).
To estimate , write
where
If then
which indicates . To estimate
, we use the posterior mode of
:
That was the easy part. Now for the harder part.
For my purposes, the beta prime distribution is a good fit for . The beta prime distribution has a similar shape to the more familiar gamma distribution. If
then
. For the beta prime distribution,
Our posterior is distributed
where
Following Bradlow, Hardie, and Fader, we note that the ratio of the two Gamma functions can computed exactly:
where ,
, and
is the number of observations equal to
.
While can be computed exactly, it is not easy to integrate analytically. Although I am unwilling to perform computationally expensive operations regularly, I am very comfortable with using numeric techniques to update the MAP estimates. So I can rely on black-box estimation functions in R when testing or the Apache Commons Math library for use in our system. The resulting recipe to estimate for
:
In practice, what I actually maximize is the log of the above quantity, which reduces the chance of overflow or underflow during computation. While is not easily integrable, it is differentiable (as is its derivative). One could define first and second derivatives and find the maximum value using Newton-Raphson, but the derivatives require recurrence relations to state succinctly (and I couldn’t be bothered).
All that remains is the initial choice of and
. I chose as starting points the method of moments estimator for these parameters (Savani and Zhigljavsky):
where and
. When the method of moments estimates are not well-defined for the sample data, I use the means of the prior distributions as starting points.
Here’s a snapshot of a few feeds. The red line indicates the method of moments estimator and the blue line shows the maximum a posteriori estimates. Yes, I know that the negative binomial is a discrete distribution and plotting it as a line is misleading, but I wanted to look at a large number of feeds at a time and using lines to plot the density is easier to see.
Update:
Michelle asked a couple of questions about the use of the beta prime distribution as a prior for the parameter of the negative binomial, so I figured I’d update this post with a little more detail about this choice.
When choosing a distribution to model the prior for , I started by looking at the histogram of
values estimated using the method of moments estimator described above. I first considered using the gamma distribution, but it didn’t turn out to be a great fit. The red line on the histogram below shows the method of moments estimator for the gamma function proposed by Wiens et al (Pak. J. Statist. 2003 Vol.19(1) pp129-141) with
. The blue line shows the much better fit given by the beta prime distribution.
To fit the beta prime parameters, I fit a beta distribution to . Here’s a plot showing the histogram for
and the method of moments estimator for the beta distribution:
I hope this makes my choice of the beta prime distribution and its estimation a little more clear.


