Panos Ipeirotis recently posted some interesting thoughts about statistical significance tests for comparing systems when incrementally developing techniques to solve a problem. While I don’t have the answer to his question, I did notice that he made note of the Bonferroni method for correcting for mutiple testing. The need for multiple test correction arises when you need to make more than one comparison; the more comparisons you make, the more likely an uncorrected test will be rejected by chance.
One limitation of Bonferroni method is its conservativeness. If you would normally reject the null hypothesis when , when using Bonferroni correction you would only reject when
, where
is the numer of tests. This can make it very difficult to reject any of the tests when
is large. This arises because the Bonferroni method controls the probability of falsely rejecting any null hypothesis; it controls the family-wise error rate.
There’s a newer technique that addresses the conservative nature of the Bonferroni method. Instead of controlling the probability of falsely rejecting any null hypothesis, the idea is to control the false discovery rate. The false discovery rate is the expected proportion of false rejections of the null hypothesis. By controlling the false discovery rate, we acknowledge that we are willing to accept that for each rejection of the null hypothesis, we expected that the probability it was rejected in error is or less.
The Benjamini-Hochberg method is a simple approach for controlling the false discovery rate. Given p-values resulting from a statistical significance test:
- Let
be the p-values sorted in increasing order.
- Define
and
where
if the p-values are independent and
otherwise.
- Reject all null hypotheses where
.
In my CIKM paper from 2006 with Mounia Lalmas, we performed extensive system pairwise comparisons to understand some aspects of evaluation measures in XML element retrieval. We did look into controlling family-wise error rate through the Bonferroni method, but because we were doing pair-wise tests on roughly 40 different result lists per task, none of the system differences were identified as statistically significant. Controlling the false discovery rate allowed us to identify differences, despite the large number of comparisons and relatively small sample sizes.