Engineering for experiments: measuring performance

The main goal of split testing is to compare the performance of multiple components or parameters.  Naturally, we cannot do so without measuring this performance.  Our performance metric may be a click through rate, such as in our related content widget example, or other conversion metrics, such visits to an retail website resulting in a purchase.  In addition, in some cases it can also be important to measure the response time of the components being tested.  This, coupled with other system monitoring, can help you estimate the costs of running the new component at scale.  It may not be worth deploying a new algorithm if the improvements to the performance metric are marginal but the increased costs of running the algorithm are large.

You will probably want to measure and report global measures for each testing group  in real time.  You may also have some pre-defined demographic groups of interest you may wish to track.  An obvious one for our related context example is to report logged in users separately from the users not logged in.

However, it is very important that you be able to revisit this data later and explore it in new ways.  For users who are not logged in, it is important that you store the session information and the user’s actions.   For users who are logged in, you will want this information in addition to your system’s internal user identifier.  With this information, you can later do analysis on your logs to separate user groups by more subtle demographics for logged in users, such as occupation, interests, and historical levels of activity.

The value of this post hoc analysis cannot be understated.  It is not enough to simply choose the better of two method; an understanding of when one method outperforms another can help motivate the next method to test.  A thorough understanding of the strengths and weaknesses of an approach is essential for the motivation of the next round of experiments.  Without it, you rely solely on your own intuition and ability to guess new methods.  I don’t know about you, but I’d place more bets on data than luck.

When examining your data, it can be helpful to look at the same numbers in different ways.  For example, consider hourly traffic to Wikipedia for the week of March 01 to March 07 2010.  I downloaded the data for these plots from  The following plots show traffic to the English and German language Wikipedia sites, normalized by dividing by the sum of all page views to the language’s site.  These plots will give us a sense of when users visit the two different sites, but will not be useful for comparing relative traffic from one language to another.  While these data are not the results of a split test, I think they are helpful for illustrating the point that different representations of data can lead to different insights.

This first plot shows a simple line plot with time on the horizontal axis and the percent of weekly traffic on the vertical axis.  This is a very traditional representation for this type of data.  Some things immediately jump out.  The usage of the German language Wikipedia varies much more widely with time than the English Wikipedia, although we do see some time-of-day effects on English too.  We also see lower traffic levels on both Wikipedia sites on the weekend (March 06 and March 07).  In this plot it is quite easy to read the values of peaks and valleys, but it is harder to isolate trends across days.

Another way to present this data is to plot the data with day on the horizontal axis and hour of day on the vertical axis.  The width between two lines for a day varies proportionally to the percent of traffic observed in that day’s hour.  This presentation of the data allows us to see effects that weren’t obvious in the other plot.  For example, we can see that usage on the German language website increased between 7 and 10 am on Saturday and Sunday, while on a typical weekday, this behavior was observed early in the day, between 6 and 9.  Traffic to the German website peaked later in the day on the weekends than weekdays.

To conclude, measuring performance is perhaps the most important aspect of split testing.  In addition to measuring key performance measures such as click-through rates and conversion rates, it is wise to track the performance of new components to help in the estimation of cost.   It is important to store enough data to later explore the results of a test in new ways, digging deeper to gain insights, such as which demographics were well served by the test and for which users the tested approach did poorly.  During this investigation, it can be helpful to look at your data in multiple ways, because different  presentations of the data my facilitate different insights.  These insights can be used to help you form a better mental model of your users and techniques and motivate the next set of hypotheses you wish to explore.

Engineering for experiments: the role of modularity

My last post outlined some initial thoughts about requirements for easy split testing of web sites.  In this post I assert that modular design is a critical component of any web system you wish to improve through testing.

Software developers are well educated about the value of modular design for reuse and testing of correctness.  This can also extend to testing value of components.  By defining a clean interface for a component, it becomes possible to easily substitute other implementations or algorithms.

When we consider our related content widget example from the last post, we may have a new snippet generation algorithm we wish to compare to our existing algorithm.  By deploying a service implementing a standard snippet generation interface, a test framework could direct a portion of all requests to the new algorithm, recording this new component’s impact on click-through-rate.

Components of a related content selection widget: article ranker and snippet generatorComplementary to modularity is the need to handle data flow between components.  Without the ability to handle data flow between components in a generic web architecture, to developer is left reimplementing these interactions routinely.  With data flow management, the web architecture can intercept the output of components, transform results, monitor component throughput, and manage split tests.

In some ways the importance of creating modular components and the need to handle data flow between these components a generic web architecture for testing is similar to an enterprise service bus.  An enterprise service bus is middleware which handles messaging between components of complex information architectures.  The bus handles data flow between components, frequently legacy or written in a variety of programming languages.  The routing, message handling, monitoring, and management support provided by many enterprise service buses may be attractive for the design and development a web architecture for testing.  On the other hand, the multi-system support and extra bells and whistles which may not be necessary for your web architecture could create extra overhead and computing resources not desirable in your context.  Most enterprise service buses use XML as the communication language, which as a hefty representation of data, the processing and transformation messages could greatly add to the latency of your website.  I personally have not evaluated any enterprise serial bus for use with web systems, so it is possible that my fears are misplaced.

To conclude, modularity is an important component of any software system, and its use in a web architecture can facilitate web testing.  Coupled with modularity is the need for the test-oriented web architecture to handle data flow in a lightweight way.  Enterprise serial buses provide some of this functionality, but may have more overhead than is desirable for use in a low-latency web system, where every millisecond counts.  Any existing middleware solution for handling data flow would need to be existed to provide additional support for testing.

Engineering for experiments

It’s not hard to see the importance of performing tests on web systems.  We regularly hear how companies such as Amazon, Google, or Facebook employ A/B testing, also known as split testing and multivariate testing, to quickly test variations of their websites.  Deploying a test to a percentage of a site’s requests can quickly measure the impact of a site change or a variation of an algorithm.  The impact of testing in a company can be greater than an isolated experiment on measuring which color maximizes the click-through rate.  An infrastructure for and culture of testing can impact positively the entire organization.  An organization that doesn’t perform frequent tests is one doomed to adapt slowly or without information.

In order for a culture of experimentation to exist, it must be easy to perform these experiments and measure the results.  There are some great frameworks for offline experimentation.  MapReduce solutions are great for experimentation and examination of large data sets.  I’ve seen the importance of a good software architecture for experimentation in my own experience with the Lemur Toolkit.  However, I haven’t seen much discussion of frameworks for performing online experiments in complex web systems.

This post considers as an example split testing for a related content widget.  Such a related content widget could be a deployed on a news publisher’s article pages.  The widget would show related articles in a sidebar or below the content of the article.  We may wish to test variations of the related content widget to see if they increase some measure of success, such as the click-through rate.

Conceptually, a related content widget is quite simple.  There are two major components, one that returns a list of related items and one that renders the items on the web page.  While this system is quite simple conceptually, a typical implementation of such a system may be much more complex.  The renderer in the client may make use of a mix of XML, CSS, and HTML.  Parts of the rendering algorithm may also live server-side.  For example, if the user visited the web page from a search engine result page, the keywords could be used to create context sensitive snippets for the related content items.  A snippet generation algorithm would more likely be run on the server than on the client.

The experiments we may wish to perform on the related-content could depend on any of the components used in the creation of the widget.  For example, here is just a small sample of things we may wish to test for the impact on the click-through rate:

  • the color of the recommended item titles (CSS changes),
  • the number of items presented (parameters passed to the related content service), and
  • keyword highlighting, use of keywords in ranking content selection, and custom snippet generation when the user arrives from search engine landing page (the ranking algorithm used by the related content service, the snippet generation algorithm, and possibly CSS changes).

To me, this means that a test framework will interface with many of a web system’s components.  Here are some additional thoughts about the attributes I’d like a test management framework have.  It should

  1. handle data flow between components,
  2. track metrics such as click-through-rates and response time of components,
  3. handle multiple parallel tests along with component dependencies,
  4. be able to handle various user types (such as split by session, long-term cookie, or logged-in user),
  5. be easy to register new tests,
  6. handle deployment of code and files to servers,
  7. verify to some degree the integrity of new code before returning results of the code to user requests,
  8. be cloud-aware, and
  9. be able to do all of this without site downtime.

I should acknowledge that there are some tools out there for managing split tests. However, I believe (possibly in error) that they only scratch at the surface of the requirements I’ve listed above.  Also, it may be unfair for me to talk of this as test framework, because it’s really more of a web framework which has adequate support for testing.

I know I’m asking for quite a lot, but I believe these attributes are important for the creation of a culture of experimentation.  Over the next few weeks I hope to go into more detail about why these attributes are important and share some initial thoughts on how a framework may support these goals.  Since much of these ideas are still formative, I’d love to hear your own thoughts as well.

Freebase topic descriptions in Mechanical Turk

As a component of mSpoke‘s named entity detection algorithm, we disambiguate known named-entities to Freebase topics.  To gather better evaluation and tuning data, I recently spent some time improving our Mechanical Turk template for assessing named entity assignment and disambiguation.  I didn’t want to upload a Freebase topic description with each HIT, so I instead spent a little time figuring out how to load the topic description dynamically.  It wasn’t terribly difficult, but because it took me a little time to figure out how everything fits together, I figure it is probably worth sharing.

The basic approach is to include the Freebase id and label in the data we upload to Mechanical Turk, then use AJAX to request the description from Freebase and fill it into the HIT when the template is rendered.

The first thing we need is some HTML for the Freebase attribution and places to fill in the Wikipedia text attribution (all of our disambiguated named entities have descriptions originating in Wikipedia) and topic description.

<div id="description"></div>
<div style="font-size:x-small">
  <img src=""
    style="float:left; margin-right: 5px" />
  <div style="margin-left:30px"> Source:
    <a href="" title="Freebase &ndash; The World's Database">Freebase</a>
    &ndash; The World&apos;s Database<br />
    &quot;<a href="${freebase_id}"title="${label}:
    Freebase &ndash; The World's Database">${label} </a>&quot; Freely licensed under
    <a href="">CC-BY</a>.
<div id="attribution"></div>

The Freebase attribution in the middle is mostly boilerplate and we can fill in the Freebase id and topic label from our HIT data using ${freebase_id} and ${label}. The description and attribution divs will be filled in using AJAX:

<script src=""></script>
<script src=""></script>
  var envelope = { query :
    id :   "${freebase_id}",
    type : "/common/topic",
    article : [{
      id : null,
      "/common/document/source_uri" : null

    { query : jQuery.toJSON( envelope ) },

  function processIdRequest( response ) {
    if ( response.code == "/api/status/ok"
         && response.result
         && response.result.article ) {
      jQuery.each(response.result.article, function() {

  function requestDescription( id ) {
      "" + id + "?callback=?",

  function processDescriptionRequest( response ) {
    if ( response.code == "/api/status/ok"
         && response.result
         && response.result.body ) {

  function addAttribution( uri ) {
    var id = uri.substr(uri.lastIndexOf('/') + 1);
      "The original description for this topic was automatically generated from the " +
      "<a href=\"" + id + "\">Wikipedia article \"" +
      "${label}\"</a> licensed under the <a href=\"\">" +
      "GNU Free Documentation License.</a>"

Since Mechanical Turk fills in the HIT template variables prior to rendering the web page, we can fill in the Freebase page id and topic label where needed, such as in the query to Freebase’s API and the Wikipedia attribution text.  The query to Freebase, represented by envelope, requests both the topic descriptions id and the Wikipedia source URI.  The script uses jQuery to request the data, and processIdRequest passes on the article id to requestDescription and the Wikipedia source uri to addAttribution. processIdRequest then uses Freebase to look up the description of the topic given its id.  Finally, since the Wikipedia source URI isn’t an actual link and looks like http://wp/en/1194195, addAttribution parses out the article id and generates a link to the actual Wikipedia page in the attribution.

Twitter search ain’t that bad

My friend Daniel Tunkelang recently made the arguments that Twitter isn’t a search engine, nor is their search engine hard to build.  While I certainly agree that Twitter is not a search engine, I disagree with several of his comments with regards to their search engine. As an advocate of HCIR, I’m surprised Daniel didn’t take the time to think about what makes searching tweets different.

  1. Daniel argues that in order for there to be “search,” there must be an information need.  I use Twitter frequently to gauge sentiment and quantity of discussions around products.  Twitter search can greatly aid my decision making process of whether or not to use a product (such as an open source tool).
  2. It is arguable that the recency of a tweet is a very large component of relevance.  Twitter is a way of discussing and interacting with others about what is happening right now.  A reverse chronological ordering of tweets makes sense.
  3. At 140 characters, traditional text ranking algorithms might not perform well.  There’s very little context within a tweet to determine how well such a short tweet matches a short query.  Ordering by well tuned, state-of-the-art text retrieval approaches such as Okapi-BM25 may give bad orderings on tweets.
  4. At 140 characters, the entire tweet can be comfortably displayed in the result list.  There is no need for snippet generation, and query term highlighting facilitates quick scanning of the results.  Other aspects of relevance other than timeliness may be easier to gauge than in other search tasks.
  5. Given the emphasis on timeliness, the search engine must be indexing tweets in real time and making new tweets available very quickly.  While this is no doubt possible with in memory inverted indexes and many people have built this functionality, textbooks rarely address issues of searching and indexing documents in real time.  The real-time demands on Twitter search are greater than other web search engines.

Now that I’m done defending Twitter search, I agree there are things they could do better.  Daniel alludes in comments that one dimension of relevance that is currently not reflected in search results is the influence of the Twitter user on relevance.  There are search tasks where who is participating in the discussion is just as important as when the statement was made.

Also, I mentioned that I use Twitter to measure activity and sentiment.  Twitter search does little to summarize or aggregate results.  External tools such as have done a better job of using tweets to measure and show volume of discussion.  Measuring sentiment from 140 characters is difficult, but it may be possible to measure general reaction as an aggregate.

Close Reading

My girlfriend Christine Mahady (Chris) and I recently met with my friend Alexander Flurie to discuss a potential project.  The project revolves around making a tool to assist in performing close readings in an educational environment.

Close reading is a technique used in literary criticism which focuses on the careful analysis of the language of a text.  These analyses are typically done on passages or short texts and critique low level features of the text, such as word choice, syntax, and the structure and flow of the ideas expressed in the text.  Performing a close reading often begins with a careful annotation of the text, making notes on observations of structure, tone, rhetorical devices, or anything about the text the reader notices.  During this process, the reader will also look up any unfamiliar words, make note of word choice, such as the use of slang, and so on.

Interestingly, we aren’t the only people to consider doing close reading online.  The Golden Notebook Project launched about a month ago.  Seven authors and critics are collaborating to do a close reading of Doris Lessing’s The Golden Notebook.  Their goals our different than ours; this is an interesting case study on one novel.

Our goals are more focused on making it easy for instructors to create close readings in educational environments.  In order to ease adoption, we’d like to seed the tool with publicly available texts, such as those found in Project Gutenberg.  We also want to allow instructors to easily create discussion groups for students, where all students’ annotations could be shared for collaborative reading or private for individual assignments.  Finally, we’d like this tool to allow the instructor to import annotations they’ve previously created for the text into a group, so that instructors can easily reuse their own annotations from semester to semester.  I’m personally very excited about this side project, because I think it is a great example of how online tools can be easier to use than traditional tools.

Popularity, subjectivity and polarity in blogs

My next post in the What makes a blog post popluar? series at the mSpoke blog is now up.  This post looks at subjectivity and polarity in blog posts.  I think it makes a small but interesting complement to some of the other recent discussions about subjectivity, including Lexalytic’s analysis of the reaction on Twitter to the Motrin ad and Matthew Hurst’s ongoing series of posts on the subject of sentiment mining.

I didn’t report the equivalent per-feed analysis I had performed in the previous post; there was no correlation after that transformation.  I wanted to focus on highlighting the very slight positive correlation without bogging the post down with a lot of additional analysis.

Dojo charts that sing: box and whisker plots

A big part of my job at mSpoke is to understand our users, system performance, and our ability to deliver relevant content.  In order to facilitate this, we’ve been building a dashboard.  The dashboard is a group of web pages that enables us to easily monitor different aspects of our systems’ status.  I really can’t understate the importance of having the right tools available in order to do data analysis.  Our dashboard is an interactive, frequently updated suite of web pages and graphs that allow us to easily diagnose issues in our system, observe topical trends in live web content, and better understand which content receives our users’ attention.

We use a lot of Dojo at mSpoke, so it was an easy choice for us to use Dojo Charting to create the plots in our dashboard.  However, it doesn’t yet include some chart types we’d like.  So I built them.  This post covers our custom box and whisker plot.

A box and whisker plot elegantly displays distributional statistics about a set of data.  The box covers the middle 50% of the data, the line in the box shows the median, the whiskers cover the range of points that are not outliers, and the outliers are indicated using markers.

Dojo box and whisker chart

Dojo box and whisker chart

I’ve hosted an interactive version of the chart and the source code.  mSpoke is happy to share this code, and I would be happy to work with Dojo to include it in later releases. [Update: in 2008 mSpoke placed this code into public domain, with no warranty of any kind. I’ve updated the source code to reflect this status.]

Using the box and whisker plot

Creating a box and whisker plot is very similar to the creation of other Dojo charts.  In addition to including dojo.js, you’ll need to include BoxAndWhisker.js:

<script type="text/javascript" 

The javascript to create the plot is simple:

    var chart = new dojox.charting.Chart2D("chart");
    chart.addPlot("default", {type: "BoxAndWhisker", markers: true});

I’ve included two different ways to pass in the data for a series.  The first is to simply list all of the data points; BoxAndWhisker.js will compute the median, middle 50% of points, whiskers and outliers.

    var fill = new dojo.Color([0x00, 0x4f, 0x80, 0.7]);
    var stroke = {color:fill};

    // List all points approach
    chart.addSeries( "Mon",
      [4.4, 6.7, 5.7, 4.6, 4.1, 4.4, 4.3, 4, 3.5, 6.2, 6.5, 0.5],
      {stroke:stroke, fill:fill, center:1, width:0.2});

You must specify the center and width for the box to display properly.  The center indicates where the x-coordinate the box and whisker for the series will be placed and the width indicates the width along the x-axis of the box.  You can specify the range and labels of the axes using chart.addAxis.  View the source of the interactive version for an example.

The other approach is better suited for handling large data sets, when passing in all of the data would make the script run slowly.  In this case, you need to figure out the box and whisker coordinates yourself.

    // Pass in box and whisker coordinates and outliers
    chart.addSeries( "Tue",
      {lwhisker: 1.4, lbox: 2.9, median: 3.35, ubox: 5.1, uwhisker: 7.5,
       outliers: [0.2, 8.9]},
      {stroke:stroke, fill:fill , center:2, width:0.2});

lwhisker and uwhisker are the lower and upper points for the whiskers, lbox and ubox specify the range of the box, and median is the coordinate for the median line in the box.  You can specify the list of markers with the outliers parameter.  Complete source for constructing a box and whisker plot is available in the source of the interactive version.

Creating custom plot types

Building a custom Dojo plot isn’t extremely hard, but it takes some patience to get things just right.  The biggest quirk was that I had indicate in code that the BoxAndWhisker plot type is a part of dojox.charting.  I suspect I may be misunderstanding how to use the Dojo Toolkit correctly.  In order to make the chart visible to Dojo, BoxAndWhisker.js must use dojo.provide at the top of the file:


Following the example in other Dojo plots, I used dojo.declare to declare the plot constructur. As with the call to dojo.provide, I had to declare the plot to be a part of dojox.charting:

                          dojox.charting.plot2d.Base, {

Otherwise, I was able to develop the chart by following the examples already provided in dojox.charting.plot2d.

What makes a blog post popular? series at mSpoke blog

I will be writing a series of blog posts for the mSpoke blog investigating correlations between the popularity of feed items and various properties of the items.  The first post looks for a relationship between popularity and reading difficulty.  Later posts will consider other properties of feed items, such as the use of opinionated language, topicality, and novelty.

Why we model comment counts

Last week, I wrote a very technical post on how I model the distribution of comment counts for an RSS feed in FeedHub.  I originally drafted the post to document its derivation.  It was non-trivial enough that I feel that there is some merit in sharing this information with others, but it started with the assumption that my readers either are interested in the derivation for its own sake or already understand the importance of modeling these distributions per input feed.  I’d like to correct that with a little discussion about how this information can be used.

FeedHub’s goal is to provide our users with a personalized feed that delivers only the most relevant posts.  This feed is unique to each of our users, and I use “relevant” here to refer to any post the user would like to see in their personalized feed.  By observing a user’s interactions with their feed items (such as clicking on the thumbs up or thumbs down icons), FeedHub tries to learn a model of which feed item attributes are predictive of the user’s interests.  These attributes may be inferred from the content of the feed item, such as the Wikipedia category based labels we assign to items.  Alternatively, they could be based on other attributes of the feed item, such as the number of comments people have made to that post.

The base assumption is that this comment count is predictive of whether or not the item is interesting to the user.  The more discussion around the item, the more likely it is to be interesting to a user.

However, it is important that we normalize the input to the model used in FeedHub such that the comment counts are useful predictors across all input feeds.  Here are some factors that may be important:

  1. Some feed sources have a lot of discussion around every item (e.g. Slashdot).  Other feeds may have many fewer comments per post (e.g. John Langford’s Machine Learning (Theory) blog).  If you’ve subscribed to both of these feeds in FeedHub, that implies you are probably interested in receiving content from both of these feeds.  Naively using the comment count directly as a feature in a learning algorithm is unlikely to work well; it’s hard to believe that the post with the most comments from John Langford’s blog is worse than the post with the fewest comments on Slashdot.  Some per-feed normalization is likely to be important.
  2. For a feed which typically receives a small number of comments per item, the difference between 1 comment and 10 comments is likely to reflect a larger difference in “interest” than the difference between 51 and 60 comments.  Careful normalization within a feed may be necessary.

These two concerns led me to hypothesize that a good way to normalize the number of comments an item has received would be to model the probability distribution of comments per item for each feed.  Given an observed number of comments x for an item, we can estimate P(X<x), the probability that another randomly selected item from that same feed will have fewer comments.  I’m not an expert in the combination of evidence, but this approach does mitigate the above concerns.  Using this probability estimate also has a nice intuitive interpretation: a normalized value of 0.9 means that we expect an item to have more comments than 90% of the items from that feed.

The desire for the Bayesian estimation for the distribution described in my previous post arises from concerns that we may not be able to estimate the distribution well from a small number of items in an input feed where posts are relatively infrequent.  Using a Bayesian estimate of P(X<x) allows us to reduce the bumpiness compared to maximum likelihood estimates.

For example, consider coin flips.  If we observe two heads in two flips, the maximum likelihood estimator would suggest that the coin is unfair.  In practice, most people have a prior belief that the coin is unlikely to be unfair.  It would take many more coin flips all coming up heads for us to overcome that prior belief.  A Bayesian estimate formally models a similar estimation process.  When we have few observations of the data, our estimates will be close to those specified by our prior beliefs.  As we get more data, the estimates from the observations become more reliable and we place less emphasis on our prior beliefs.