Last week, I wrote a very technical post on how I model the distribution of comment counts for an RSS feed in FeedHub. I originally drafted the post to document its derivation. It was non-trivial enough that I feel that there is some merit in sharing this information with others, but it started with the assumption that my readers either are interested in the derivation for its own sake or already understand the importance of modeling these distributions per input feed. I’d like to correct that with a little discussion about how this information can be used.
FeedHub’s goal is to provide our users with a personalized feed that delivers only the most relevant posts. This feed is unique to each of our users, and I use “relevant” here to refer to any post the user would like to see in their personalized feed. By observing a user’s interactions with their feed items (such as clicking on the thumbs up or thumbs down icons), FeedHub tries to learn a model of which feed item attributes are predictive of the user’s interests. These attributes may be inferred from the content of the feed item, such as the Wikipedia category based labels we assign to items. Alternatively, they could be based on other attributes of the feed item, such as the number of comments people have made to that post.
The base assumption is that this comment count is predictive of whether or not the item is interesting to the user. The more discussion around the item, the more likely it is to be interesting to a user.
However, it is important that we normalize the input to the model used in FeedHub such that the comment counts are useful predictors across all input feeds. Here are some factors that may be important:
- Some feed sources have a lot of discussion around every item (e.g. Slashdot). Other feeds may have many fewer comments per post (e.g. John Langford’s Machine Learning (Theory) blog). If you’ve subscribed to both of these feeds in FeedHub, that implies you are probably interested in receiving content from both of these feeds. Naively using the comment count directly as a feature in a learning algorithm is unlikely to work well; it’s hard to believe that the post with the most comments from John Langford’s blog is worse than the post with the fewest comments on Slashdot. Some per-feed normalization is likely to be important.
- For a feed which typically receives a small number of comments per item, the difference between 1 comment and 10 comments is likely to reflect a larger difference in “interest” than the difference between 51 and 60 comments. Careful normalization within a feed may be necessary.
These two concerns led me to hypothesize that a good way to normalize the number of comments an item has received would be to model the probability distribution of comments per item for each feed. Given an observed number of comments for an item, we can estimate
, the probability that another randomly selected item from that same feed will have fewer comments. I’m not an expert in the combination of evidence, but this approach does mitigate the above concerns. Using this probability estimate also has a nice intuitive interpretation: a normalized value of 0.9 means that we expect an item to have more comments than 90% of the items from that feed.
The desire for the Bayesian estimation for the distribution described in my previous post arises from concerns that we may not be able to estimate the distribution well from a small number of items in an input feed where posts are relatively infrequent. Using a Bayesian estimate of allows us to reduce the bumpiness compared to maximum likelihood estimates.
For example, consider coin flips. If we observe two heads in two flips, the maximum likelihood estimator would suggest that the coin is unfair. In practice, most people have a prior belief that the coin is unlikely to be unfair. It would take many more coin flips all coming up heads for us to overcome that prior belief. A Bayesian estimate formally models a similar estimation process. When we have few observations of the data, our estimates will be close to those specified by our prior beliefs. As we get more data, the estimates from the observations become more reliable and we place less emphasis on our prior beliefs.