Freebase topic descriptions in Mechanical Turk

As a component of mSpoke’s named entity detection algorithm, we disambiguate known named-entities to Freebase topics.  To gather better evaluation and tuning data, I recently spent some time improving our Mechanical Turk template for assessing named entity assignment and disambiguation.  I didn’t want to upload a Freebase topic description with each HIT, so I instead spent a little time figuring out how to load the topic description dynamically.  It wasn’t terribly difficult, but because it took me a little time to figure out how everything fits together, I figure it is probably worth sharing.

The basic approach is to include the Freebase id and label in the data we upload to Mechanical Turk, then use AJAX to request the description from Freebase and fill it into the HIT when the template is rendered.

The first thing we need is some HTML for the Freebase attribution and places to fill in the Wikipedia text attribution (all of our disambiguated named entities have descriptions originating in Wikipedia) and topic description.

<div id="description"></div>
<div style="font-size:x-small">
  <img src="http://www.freebase.com/api/trans/raw/freebase/attribution"
    style="float:left; margin-right: 5px" />
  <div style="margin-left:30px"> Source:
    <a href="http://www.freebase.com" title="Freebase &ndash; The World's Database">Freebase</a>
    &ndash; The World&apos;s Database<br />
    &quot;<a href="http://www.freebase.com/view/${freebase_id}"title="${label}:
    Freebase &ndash; The World's Database">${label} </a>&quot; Freely licensed under
    <a href="http://www.freebase.com/view/common/license/cc_attribution_25">CC-BY</a>.
  </div>
</div>
<div id="attribution"></div>

The Freebase attribution in the middle is mostly boilerplate and we can fill in the Freebase id and topic label from our HIT data using ${freebase_id} and ${label}. The description and attribution divs will be filled in using AJAX:

<script src="http://code.jquery.com/jquery-1.3.min.js"></script>
<script src="http://jquery-json.googlecode.com/files/jquery.json-1.3.min.js"></script>
<script>
  var envelope = { query :
  {
    id :   "${freebase_id}",
    type : "/common/topic",
    article : [{
      id : null,
      "/common/document/source_uri" : null
    }]
  }};

  jQuery.getJSON(
    "http://api.freebase.com/api/service/mqlread?callback=?",
    { query : jQuery.toJSON( envelope ) },
    processIdRequest
  );

  function processIdRequest( response ) {
    if ( response.code == "/api/status/ok"
         && response.result
         && response.result.article ) {
      jQuery.each(response.result.article, function() {
        requestDescription(this.id);
        addAttribution(this["/common/document/source_uri"]);
      });
    }
  }

  function requestDescription( id ) {
    jQuery.getJSON(
      "http://api.freebase.com/api/trans/raw/" + id + "?callback=?",
      processDescriptionRequest
    );
  }

  function processDescriptionRequest( response ) {
    if ( response.code == "/api/status/ok"
         && response.result
         && response.result.body ) {
      jQuery("div#description").html(response.result.body);
    }
  }

  function addAttribution( uri ) {
    var id = uri.substr(uri.lastIndexOf('/') + 1);
    jQuery("div#attribution").html(
      "The original description for this topic was automatically generated from the " +
      "<a href=\"http://en.wikipedia.org/w/index.php?curid=" + id + "\">Wikipedia article \"" +
      "${label}\"</a> licensed under the <a href=\"http://www.gnu.org/copyleft/fdl.html\">" +
      "GNU Free Documentation License.</a>"
    );
  }
</script>

Since Mechanical Turk fills in the HIT template variables prior to rendering the web page, we can fill in the Freebase page id and topic label where needed, such as in the query to Freebase’s API and the Wikipedia attribution text.  The query to Freebase, represented by envelope, requests both the topic descriptions id and the Wikipedia source URI.  The script uses jQuery to request the data, and processIdRequest passes on the article id to requestDescription and the Wikipedia source uri to addAttribution. processIdRequest then uses Freebase to look up the description of the topic given its id.  Finally, since the Wikipedia source URI isn’t an actual link and looks like http://wp/en/1194195, addAttribution parses out the article id and generates a link to the actual Wikipedia page in the attribution.

Twitter search ain’t that bad

My friend Daniel Tunkelang recently made the arguments that Twitter isn’t a search engine, nor is their search engine hard to build.  While I certainly agree that Twitter is not a search engine, I disagree with several of his comments with regards to their search engine. As an advocate of HCIR, I’m surprised Daniel didn’t take the time to think about what makes searching tweets different.

  1. Daniel argues that in order for there to be “search,” there must be an information need.  I use Twitter frequently to gauge sentiment and quantity of discussions around products.  Twitter search can greatly aid my decision making process of whether or not to use a product (such as an open source tool).
  2. It is arguable that the recency of a tweet is a very large component of relevance.  Twitter is a way of discussing and interacting with others about what is happening right now.  A reverse chronological ordering of tweets makes sense.
  3. At 140 characters, traditional text ranking algorithms might not perform well.  There’s very little context within a tweet to determine how well such a short tweet matches a short query.  Ordering by well tuned, state-of-the-art text retrieval approaches such as Okapi-BM25 may give bad orderings on tweets.
  4. At 140 characters, the entire tweet can be comfortably displayed in the result list.  There is no need for snippet generation, and query term highlighting facilitates quick scanning of the results.  Other aspects of relevance other than timeliness may be easier to gauge than in other search tasks.
  5. Given the emphasis on timeliness, the search engine must be indexing tweets in real time and making new tweets available very quickly.  While this is no doubt possible with in memory inverted indexes and many people have built this functionality, textbooks rarely address issues of searching and indexing documents in real time.  The real-time demands on Twitter search are greater than other web search engines.

Now that I’m done defending Twitter search, I agree there are things they could do better.  Daniel alludes in comments that one dimension of relevance that is currently not reflected in search results is the influence of the Twitter user on relevance.  There are search tasks where who is participating in the discussion is just as important as when the statement was made.

Also, I mentioned that I use Twitter to measure activity and sentiment.  Twitter search does little to summarize or aggregate results.  External tools such as twist.flaptor.com have done a better job of using tweets to measure and show volume of discussion.  Measuring sentiment from 140 characters is difficult, but it may be possible to measure general reaction as an aggregate.

Close Reading

My girlfriend Christine Mahady (Chris) and I recently met with my friend Alexander Flurie to discuss a potential project.  The project revolves around making a tool to assist in performing close readings in an educational environment.

Close reading is a technique used in literary criticism which focuses on the careful analysis of the language of a text.  These analyses are typically done on passages or short texts and critique low level features of the text, such as word choice, syntax, and the structure and flow of the ideas expressed in the text.  Performing a close reading often begins with a careful annotation of the text, making notes on observations of structure, tone, rhetorical devices, or anything about the text the reader notices.  During this process, the reader will also look up any unfamiliar words, make note of word choice, such as the use of slang, and so on.

Interestingly, we aren’t the only people to consider doing close reading online.  The Golden Notebook Project launched about a month ago.  Seven authors and critics are collaborating to do a close reading of Doris Lessing’s The Golden Notebook.  Their goals our different than ours; this is an interesting case study on one novel.

Our goals are more focused on making it easy for instructors to create close readings in educational environments.  In order to ease adoption, we’d like to seed the tool with publicly available texts, such as those found in Project Gutenberg.  We also want to allow instructors to easily create discussion groups for students, where all students’ annotations could be shared for collaborative reading or private for individual assignments.  Finally, we’d like this tool to allow the instructor to import annotations they’ve previously created for the text into a group, so that instructors can easily reuse their own annotations from semester to semester.  I’m personally very excited about this side project, because I think it is a great example of how online tools can be easier to use than traditional tools.

Popularity, subjectivity and polarity in blogs

My next post in the What makes a blog post popluar? series at the mSpoke blog is now up.  This post looks at subjectivity and polarity in blog posts.  I think it makes a small but interesting complement to some of the other recent discussions about subjectivity, including Lexalytic’s analysis of the reaction on Twitter to the Motrin ad and Matthew Hurst’s ongoing series of posts on the subject of sentiment mining.

I didn’t report the equivalent per-feed analysis I had performed in the previous post; there was no correlation after that transformation.  I wanted to focus on highlighting the very slight positive correlation without bogging the post down with a lot of additional analysis.

Dojo charts that sing: box and whisker plots

A big part of my job at mSpoke is to understand our users, system performance, and our ability to deliver relevant content.  In order to facilitate this, we’ve been building a dashboard.  The dashboard is a group of web pages that enables us to easily monitor different aspects of our systems’ status.  I really can’t understate the importance of having the right tools available in order to do data analysis.  Our dashboard is an interactive, frequently updated suite of web pages and graphs that allow us to easily diagnose issues in our system, observe topical trends in live web content, and better understand which content receives our users’ attention.

We use a lot of Dojo at mSpoke, so it was an easy choice for us to use Dojo Charting to create the plots in our dashboard.  However, it doesn’t yet include some chart types we’d like.  So I built them.  This post covers our custom box and whisker plot.

A box and whisker plot elegantly displays distributional statistics about a set of data.  The box covers the middle 50% of the data, the line in the box shows the median, the whiskers cover the range of points that are not outliers, and the outliers are indicated using markers.

Dojo box and whisker chart

Dojo box and whisker chart

I’ve hosted an interactive version of the chart and the source code.  mSpoke is happy to share this code, and I would be happy to work with Dojo to include it in later releases.

Using the box and whisker plot

Creating a box and whisker plot is very similar to the creation of other Dojo charts.  In addition to including dojo.js, you’ll need to include BoxAndWhisker.js:

<script type="text/javascript" 
        src="mspoke/charting/plot2d/BoxAndWhisker.js"></script>

The javascript to create the plot is simple:

    var chart = new dojox.charting.Chart2D("chart");
    chart.addPlot("default", {type: "BoxAndWhisker", markers: true});

I’ve included two different ways to pass in the data for a series.  The first is to simply list all of the data points; BoxAndWhisker.js will compute the median, middle 50% of points, whiskers and outliers.

    var fill = new dojo.Color([0x00, 0x4f, 0x80, 0.7]);
    var stroke = {color:fill};

    // List all points approach
    chart.addSeries( "Mon",
      [4.4, 6.7, 5.7, 4.6, 4.1, 4.4, 4.3, 4, 3.5, 6.2, 6.5, 0.5],
      {stroke:stroke, fill:fill, center:1, width:0.2});

You must specify the center and width for the box to display properly.  The center indicates where the x-coordinate the box and whisker for the series will be placed and the width indicates the width along the x-axis of the box.  You can specify the range and labels of the axes using chart.addAxis.  View the source of the interactive version for an example.

The other approach is better suited for handling large data sets, when passing in all of the data would make the script run slowly.  In this case, you need to figure out the box and whisker coordinates yourself.

    // Pass in box and whisker coordinates and outliers
    chart.addSeries( "Tue",
      {lwhisker: 1.4, lbox: 2.9, median: 3.35, ubox: 5.1, uwhisker: 7.5,
       outliers: [0.2, 8.9]},
      {stroke:stroke, fill:fill , center:2, width:0.2});

lwhisker and uwhisker are the lower and upper points for the whiskers, lbox and ubox specify the range of the box, and median is the coordinate for the median line in the box.  You can specify the list of markers with the outliers parameter.  Complete source for constructing a box and whisker plot is available in the source of the interactive version.

Creating custom plot types

Building a custom Dojo plot isn’t extremely hard, but it takes some patience to get things just right.  The biggest quirk was that I had indicate in code that the BoxAndWhisker plot type is a part of dojox.charting.  I suspect I may be misunderstanding how to use the Dojo Toolkit correctly.  In order to make the chart visible to Dojo, BoxAndWhisker.js must use dojo.provide at the top of the file:

dojo.provide("dojox.charting.plot2d.BoxAndWhisker");

Following the example in other Dojo plots, I used dojo.declare to declare the plot constructur. As with the call to dojo.provide, I had to declare the plot to be a part of dojox.charting:

	dojo.declare("dojox.charting.plot2d.BoxAndWhisker",
                          dojox.charting.plot2d.Base, {

Otherwise, I was able to develop the chart by following the examples already provided in dojox.charting.plot2d.

What makes a blog post popular? series at mSpoke blog

I will be writing a series of blog posts for the mSpoke blog investigating correlations between the popularity of feed items and various properties of the items.  The first post looks for a relationship between popularity and reading difficulty.  Later posts will consider other properties of feed items, such as the use of opinionated language, topicality, and novelty.

Why we model comment counts

Last week, I wrote a very technical post on how I model the distribution of comment counts for an RSS feed in FeedHub.  I originally drafted the post to document its derivation.  It was non-trivial enough that I feel that there is some merit in sharing this information with others, but it started with the assumption that my readers either are interested in the derivation for its own sake or already understand the importance of modeling these distributions per input feed.  I’d like to correct that with a little discussion about how this information can be used.

FeedHub’s goal is to provide our users with a personalized feed that delivers only the most relevant posts.  This feed is unique to each of our users, and I use “relevant” here to refer to any post the user would like to see in their personalized feed.  By observing a user’s interactions with their feed items (such as clicking on the thumbs up or thumbs down icons), FeedHub tries to learn a model of which feed item attributes are predictive of the user’s interests.  These attributes may be inferred from the content of the feed item, such as the Wikipedia category based labels we assign to items.  Alternatively, they could be based on other attributes of the feed item, such as the number of comments people have made to that post.

The base assumption is that this comment count is predictive of whether or not the item is interesting to the user.  The more discussion around the item, the more likely it is to be interesting to a user.

However, it is important that we normalize the input to the model used in FeedHub such that the comment counts are useful predictors across all input feeds.  Here are some factors that may be important:

  1. Some feed sources have a lot of discussion around every item (e.g. Slashdot).  Other feeds may have many fewer comments per post (e.g. John Langford’s Machine Learning (Theory) blog).  If you’ve subscribed to both of these feeds in FeedHub, that implies you are probably interested in receiving content from both of these feeds.  Naively using the comment count directly as a feature in a learning algorithm is unlikely to work well; it’s hard to believe that the post with the most comments from John Langford’s blog is worse than the post with the fewest comments on Slashdot.  Some per-feed normalization is likely to be important.
  2. For a feed which typically receives a small number of comments per item, the difference between 1 comment and 10 comments is likely to reflect a larger difference in “interest” than the difference between 51 and 60 comments.  Careful normalization within a feed may be necessary.

These two concerns led me to hypothesize that a good way to normalize the number of comments an item has received would be to model the probability distribution of comments per item for each feed.  Given an observed number of comments x for an item, we can estimate P(X<x), the probability that another randomly selected item from that same feed will have fewer comments.  I’m not an expert in the combination of evidence, but this approach does mitigate the above concerns.  Using this probability estimate also has a nice intuitive interpretation: a normalized value of 0.9 means that we expect an item to have more comments than 90% of the items from that feed.

The desire for the Bayesian estimation for the distribution described in my previous post arises from concerns that we may not be able to estimate the distribution well from a small number of items in an input feed where posts are relatively infrequent.  Using a Bayesian estimate of P(X<x) allows us to reduce the bumpiness compared to maximum likelihood estimates.

For example, consider coin flips.  If we observe two heads in two flips, the maximum likelihood estimator would suggest that the coin is unfair.  In practice, most people have a prior belief that the coin is unlikely to be unfair.  It would take many more coin flips all coming up heads for us to overcome that prior belief.  A Bayesian estimate formally models a similar estimation process.  When we have few observations of the data, our estimates will be close to those specified by our prior beliefs.  As we get more data, the estimates from the observations become more reliable and we place less emphasis on our prior beliefs.

Modeling blog post comment counts

As part of my work for FeedHub, I found the need to model the distribution of comment counts for blog posts in RSS feeds.  In particular, I want to normalize the number of comments an item receives to a score ranging from 0 to 1.  It turns out that the  negative binomial distribution is a good fit for comment counts. The negative binomial distribution is a discrete distribution with a probability density function of

f(x;p,r) = \frac{\Gamma(x + r)}{\Gamma(r) x!} p^r (1-p)^x

where r > 0 and 0 \leq p \leq 1.

However, in many cases we have small sample sizes and I felt the natural urge to specify prior distributions over the negative binomial’s parameters. The beta distribution is conjugate for p when r is known, but for our data r is not fixed.

This left me a little stuck.  What I really want is a closed form conjugate prior when both parameters of the negative binomial distribution that is efficient to compute.  What I found was Morgan and Hickman, which requires a large sample to be accurate (kind of defeats the point).  I also found Bradlow, Hardie, and Fader, which has a “closed-form” solution which requires a 300-term expansion to accurately estimate the posterior.   While noticeably faster than Markov chain Monte Carlo methods, it is still more heavyweight than I want.

I felt defeated until I realized that if I accept inference using maximum a posteriori estimates of p and r as a good enough alternative to full-blown Bayesian inference, things become much simpler.  I need only periodically recompute \hat{p} and \hat{r} and perform inference directly on the negative binomial distribution using those estimates.  I also get an additional benefit from being willing to use maximum a posteriori estimates; I can use expectation-maximization to estimate \hat{p} and \hat{r}.

This task is quite easy when estimating the parameters of a negative binomial.  Given observations x^n, estimates p^{[t]}, r^{[t]}, and priors over p and r, we estimate new maximum a posteriori estimates p^{[t+1]}, r^{[t+1]}  Wash, rinse, repeat until convergence.  This iterative procedure means that when estimating p, we can assume a constant r (and vice versa).

To estimate p|x^n, write

f(p|x^n) \propto f(p)L_n(r,p)\,

where

\begin{array}{rl} L_n(r,p) & = \prod_{i=1}^n f(x_i;p,n) \\ \\ & = p^{rn}(1-p)^{\sum_{i=1}^n x_i} \prod_{i=1}^n \frac{\Gamma(x_i + r)}{\Gamma(r)x_i!} .\end{array}

If p \sim Beta(\alpha, \beta) then

\begin{array}{rl} f(p|x^n) & \propto p^{\alpha-1} (1-p)^{\beta-1} L_n(r,p) \\ \\ & \propto p^{\alpha+rn - 1}(1-p)^{\beta + \sum_{i=1}^n x_i - 1},\end{array}

which indicates p|x^n \sim Beta(\alpha + rn, \beta + \sum_{i=1}^n x_i).  To estimate p^{[t+1]}, we use the posterior mode of p|x^n:

p^{[t+1]} = \frac{\alpha + r^{[t]}n - 1}{\alpha + \beta + r^{[t]}n + \sum_{i=1}^n x_i - 2}.

That was the easy part.  Now for the harder part.

For my purposes, the beta prime distribution is a good fit for r.  The beta prime distribution has a similar shape to the more familiar gamma distribution.  If X \sim Beta(a, b) then Y = \frac{X}{1 - X} \sim Beta^\prime(a, b).  For the beta prime distribution,

f(r;a,b) = \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} r^a (1+r)^{-a-b} .

Our posterior r |x^n is distributed

f(r|x^n)=f(r;a,b)L_n(r,p)\left/\int f(r;a,b)L_n(r,p)dr \right. ,

where

f(r;a,b)L_n(r,p)\propto r^a(1+r)^{-a-b}p^{rn}\prod_{i=1}^n\Gamma(r+x_i)/\Gamma(r)

Following Bradlow, Hardie, and Fader, we note that the ratio of the two Gamma functions can computed exactly:

\prod_{i=1}^n \Gamma(r+x_i) \left/\Gamma(r)\right.=\prod_{i=1}^{x^*}(r+i-1)^{s_i}

where x^* = \max(x^n), s_i=\sum_{j=i}^{x^*}n_j, and n_j=|\{x_i \in x^n:x_i=j\}| is the number of observations equal to j.

While f(r;a,b)L_n(r,p) can be computed exactly, it is not easy to integrate analytically.  Although I am unwilling to perform computationally expensive operations regularly, I am very comfortable with using numeric techniques to update the MAP estimates.  So I can rely on black-box estimation functions in R when testing or the Apache Commons Math library for use in our system.  The resulting recipe to estimate for r^{[t+1]}:

\begin{array}{rl} r^{[t+1]} & = \arg\max_r f(r;a,b)L_n(r,p^{[t]}) \\ \\ & = \arg\max_r  r^a(1+r)^{-a-b}p^{rn} \prod_{i=1}^{x^*}(r+i-1)^{s_i} . \end{array}

In practice, what I actually maximize is the log of the above quantity, which reduces the chance of overflow or underflow during computation.  While f(r;a,b)L_n(r,p) is not easily integrable, it is differentiable (as is its derivative).  One could define first and second derivatives and find the maximum value using Newton-Raphson, but the derivatives require recurrence relations to state succinctly (and I couldn’t be bothered).

All that remains is the initial choice of p^{[0]} and r^{[0]}.  I chose as starting points the method of moments estimator for these parameters (Savani and Zhigljavsky):

\begin{array}{rl} r^{[0]} & = \bar{x}^2/(v - \bar{x}) \\ \\ p^{[0]} & = r^{[0]} / (r^{[0]} + \bar{x})  \end{array}

where \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i and v = \frac{1}{n}\left(\sum_{i=1} x^2\right) - \bar{x}^2.  When the method of moments estimates are not well-defined for the sample data, I use the means of the prior distributions as starting points.

Here’s a snapshot of a few feeds.  The red line indicates the method of moments estimator and the blue line shows the maximum a posteriori estimates. Yes, I know that the negative binomial is a discrete distribution and plotting it as a line is misleading, but I wanted to look at a large number of feeds at a time and using lines to plot the density is easier to see.

Distribution of comment counts

Update:

Michelle asked a couple of questions about the use of the beta prime distribution as a prior for the r parameter of the negative binomial, so I figured I’d update this post with a little more detail about this choice.

When choosing a distribution to model the prior for r, I started by looking at the histogram of r values estimated using the method of moments estimator described above.  I first considered using the gamma distribution, but it didn’t turn out to be a great fit.  The red line on the histogram below shows the method of moments estimator for the gamma function proposed by Wiens et al (Pak. J. Statist. 2003 Vol.19(1) pp129-141) with k=0.  The blue line shows the much better fit given by the beta prime distribution.

Distribution of r parameter for negative binomial

To fit the beta prime parameters, I fit a beta distribution to r / (r + 1).  Here’s a plot showing the histogram for r / (r + 1) and the method of moments estimator for the beta distribution:

The beta distribution fit for r / (1 + r)

I hope this makes my choice of the beta prime distribution and its estimation a little more clear.

Useful evaluations

I have thought many times over the last few years about evaluation of information retrieval systems.  Karen Spärck Jones stated my thoughts quite eloquently.  Unfortunately, I wasn’t able to dig up the exact quote, but it was something along the lines of “statistical significance is not enough; you must also have practical significance.”

To measure a practical difference between systems, we must:

  1. have an evaluation measure that correlates with user satisfaction,
  2. understand the difference needed under the evaluation measure for a user to notice that one system is preferable to another, and
  3. have confidence (statistically) that the difference between the two systems is larger than that minimum noticeable difference.

The third challenge is the most straightforward as we can rely on statistics.  The first two, however, are far more challenging.  It is often the case that traditional system evaluation measures such as mean-average precision do not correlate well with human satisfaction.  Other measures focusing on early precision may correlate better with human experience, but there is much higher variability when focusing only on the top ranked documents.  This means we need a much larger number of queries to measure statistical significance.

There has been a lot of recent research in the Information Retrieval community on evaluation.  Much of this work has focused on statistical techniques and efficiently building test collections.  I think these works are great to have, but I wish there were more working to answer the first two questions.

Multiple significance testing

Panos Ipeirotis recently posted some interesting thoughts about statistical significance tests for comparing systems when incrementally developing techniques to solve a problem.  While I don’t have the answer to his question, I did notice that he made note of the Bonferroni method for correcting for mutiple testing.  The need for multiple test correction arises when you need to make more than one comparison; the more comparisons you make, the more likely an uncorrected test will be rejected by chance.

One limitation of Bonferroni method is its conservativeness.  If you would normally reject the null hypothesis when p < \alpha, when using Bonferroni correction you would only reject when p < \alpha/m, where m is the numer of tests.  This can make it very difficult to reject any of the tests when m is large.  This arises because the Bonferroni method controls the probability of falsely rejecting any null hypothesis; it controls the family-wise error rate.

There’s a newer technique that addresses the conservative nature of the Bonferroni method.  Instead of controlling the probability of falsely rejecting any null hypothesis, the idea is to control the false discovery rate.  The false discovery rate is the expected proportion of false rejections of the null hypothesis.  By controlling the false discovery rate, we acknowledge that we are willing to accept that for each rejection of the null hypothesis, we expected that the probability it was rejected in error is \alpha or less.

The Benjamini-Hochberg method is a simple approach for controlling the false discovery rate.  Given P_1, P_2, \dots P_m p-values resulting from a statistical significance test:

  1. Let P_{(1)} \leq P_{(2)} \leq \dots \leq P_{(m)} be the p-values sorted in increasing order.
  2. Define l_i = \frac{i \alpha}{C_m m} and k = \max\{i : P_{(i)} < l_i\} where C_m = 1 if the p-values are independent and C_m = \sum_{i=1}^{m} 1/i otherwise.
  3. Reject all null hypotheses where P_i \leq P_{(k)}.

In my CIKM paper from 2006 with Mounia Lalmas, we performed extensive system pairwise comparisons to understand some aspects of evaluation measures in XML element retrieval.  We did look into controlling family-wise error rate through the Bonferroni method, but because we were doing pair-wise tests on roughly 40 different result lists per task, none of the system differences were identified as statistically significant.   Controlling the false discovery rate allowed us to identify differences, despite the large number of comparisons and relatively small sample sizes.