As a component of mSpoke’s named entity detection algorithm, we disambiguate known named-entities to Freebase topics. To gather better evaluation and tuning data, I recently spent some time improving our Mechanical Turk template for assessing named entity assignment and disambiguation. I didn’t want to upload a Freebase topic description with each HIT, so I instead spent a little time figuring out how to load the topic description dynamically. It wasn’t terribly difficult, but because it took me a little time to figure out how everything fits together, I figure it is probably worth sharing.
The basic approach is to include the Freebase id and label in the data we upload to Mechanical Turk, then use AJAX to request the description from Freebase and fill it into the HIT when the template is rendered.
The first thing we need is some HTML for the Freebase attribution and places to fill in the Wikipedia text attribution (all of our disambiguated named entities have descriptions originating in Wikipedia) and topic description.
<div id="description"></div>
<div style="font-size:x-small">
<img src="http://www.freebase.com/api/trans/raw/freebase/attribution"
style="float:left; margin-right: 5px" />
<div style="margin-left:30px"> Source:
<a href="http://www.freebase.com" title="Freebase – The World's Database">Freebase</a>
– The World's Database<br />
"<a href="http://www.freebase.com/view/${freebase_id}"title="${label}:
Freebase – The World's Database">${label} </a>" Freely licensed under
<a href="http://www.freebase.com/view/common/license/cc_attribution_25">CC-BY</a>.
</div>
</div>
<div id="attribution"></div>
The Freebase attribution in the middle is mostly boilerplate and we can fill in the Freebase id and topic label from our HIT data using ${freebase_id} and ${label}. The description and attribution divs will be filled in using AJAX:
<script src="http://code.jquery.com/jquery-1.3.min.js"></script>
<script src="http://jquery-json.googlecode.com/files/jquery.json-1.3.min.js"></script>
<script>
var envelope = { query :
{
id : "${freebase_id}",
type : "/common/topic",
article : [{
id : null,
"/common/document/source_uri" : null
}]
}};
jQuery.getJSON(
"http://api.freebase.com/api/service/mqlread?callback=?",
{ query : jQuery.toJSON( envelope ) },
processIdRequest
);
function processIdRequest( response ) {
if ( response.code == "/api/status/ok"
&& response.result
&& response.result.article ) {
jQuery.each(response.result.article, function() {
requestDescription(this.id);
addAttribution(this["/common/document/source_uri"]);
});
}
}
function requestDescription( id ) {
jQuery.getJSON(
"http://api.freebase.com/api/trans/raw/" + id + "?callback=?",
processDescriptionRequest
);
}
function processDescriptionRequest( response ) {
if ( response.code == "/api/status/ok"
&& response.result
&& response.result.body ) {
jQuery("div#description").html(response.result.body);
}
}
function addAttribution( uri ) {
var id = uri.substr(uri.lastIndexOf('/') + 1);
jQuery("div#attribution").html(
"The original description for this topic was automatically generated from the " +
"<a href=\"http://en.wikipedia.org/w/index.php?curid=" + id + "\">Wikipedia article \"" +
"${label}\"</a> licensed under the <a href=\"http://www.gnu.org/copyleft/fdl.html\">" +
"GNU Free Documentation License.</a>"
);
}
</script>
Since Mechanical Turk fills in the HIT template variables prior to rendering the web page, we can fill in the Freebase page id and topic label where needed, such as in the query to Freebase’s API and the Wikipedia attribution text. The query to Freebase, represented by envelope, requests both the topic descriptions id and the Wikipedia source URI. The script uses jQuery to request the data, and processIdRequest passes on the article id to requestDescription and the Wikipedia source uri to addAttribution. processIdRequest then uses Freebase to look up the description of the topic given its id. Finally, since the Wikipedia source URI isn’t an actual link and looks like http://wp/en/1194195, addAttribution parses out the article id and generates a link to the actual Wikipedia page in the attribution.

for an item, we can estimate
, the probability that another randomly selected item from that same feed will have fewer comments. I’m not an expert in the combination of evidence, but this approach does mitigate the above concerns. Using this probability estimate also has a nice intuitive interpretation: a normalized value of 0.9 means that we expect an item to have more comments than 90% of the items from that feed.
and 
when
is known, but for our data
and
and perform inference directly on the negative binomial distribution using those estimates. I also get an additional benefit from being willing to use maximum a posteriori estimates; I can use
estimates
and priors over
we estimate new maximum a posteriori estimates
,
Wash, rinse, repeat until convergence. This iterative procedure means that when estimating
, write

then
. To estimate ![p^{[t+1]} = \frac{\alpha + r^{[t]}n - 1}{\alpha + \beta + r^{[t]}n + \sum_{i=1}^n x_i - 2}. p^{[t+1]} = \frac{\alpha + r^{[t]}n - 1}{\alpha + \beta + r^{[t]}n + \sum_{i=1}^n x_i - 2}.](http://livewebir.com/blog/wp-content/latex/9e8/9e8d723937a027eb5668f18184ed8fcc-FFFFFF000000.png)
then
. For the beta prime distribution,
is distributed


,
, and
is the number of observations equal to
.
can be computed exactly, it is not easy to integrate analytically. Although I am unwilling to perform computationally expensive operations regularly, I am very comfortable with using numeric techniques to update the MAP estimates. So I can rely on black-box estimation functions in ![\begin{array}{rl} r^{[t+1]} & = \arg\max_r f(r;a,b)L_n(r,p^{[t]}) \\ \\ & = \arg\max_r r^a(1+r)^{-a-b}p^{rn} \prod_{i=1}^{x^*}(r+i-1)^{s_i} . \end{array} \begin{array}{rl} r^{[t+1]} & = \arg\max_r f(r;a,b)L_n(r,p^{[t]}) \\ \\ & = \arg\max_r r^a(1+r)^{-a-b}p^{rn} \prod_{i=1}^{x^*}(r+i-1)^{s_i} . \end{array}](http://livewebir.com/blog/wp-content/latex/a32/a324b0d7ba2a1a98d9b9583bcad94016-FFFFFF000000.png)
and
. I chose as starting points the method of moments estimator for these parameters (![\begin{array}{rl} r^{[0]} & = \bar{x}^2/(v - \bar{x}) \\ \\ p^{[0]} & = r^{[0]} / (r^{[0]} + \bar{x}) \end{array} \begin{array}{rl} r^{[0]} & = \bar{x}^2/(v - \bar{x}) \\ \\ p^{[0]} & = r^{[0]} / (r^{[0]} + \bar{x}) \end{array}](http://livewebir.com/blog/wp-content/latex/d04/d0487d7025e63420194cf9cd84e72722-FFFFFF000000.png)
and
. When the method of moments estimates are not well-defined for the sample data, I use the means of the prior distributions as starting points.
. The blue line shows the much better fit given by the beta prime distribution.
. Here’s a plot showing the histogram for 
, when using Bonferroni correction you would only reject when
, where
is the numer of tests. This can make it very difficult to reject any of the tests when
or less.
p-values resulting from a statistical significance test:
be the p-values sorted in increasing order.
and
where
if the p-values are independent and
otherwise.
.