The main goal of split testing is to compare the performance of multiple components or parameters. Naturally, we cannot do so without measuring this performance. Our performance metric may be a click through rate, such as in our related content widget example, or other conversion metrics, such visits to an retail website resulting in a purchase. In addition, in some cases it can also be important to measure the response time of the components being tested. This, coupled with other system monitoring, can help you estimate the costs of running the new component at scale. It may not be worth deploying a new algorithm if the improvements to the performance metric are marginal but the increased costs of running the algorithm are large.
You will probably want to measure and report global measures for each testing group in real time. You may also have some pre-defined demographic groups of interest you may wish to track. An obvious one for our related context example is to report logged in users separately from the users not logged in.
However, it is very important that you be able to revisit this data later and explore it in new ways. For users who are not logged in, it is important that you store the session information and the user’s actions. For users who are logged in, you will want this information in addition to your system’s internal user identifier. With this information, you can later do analysis on your logs to separate user groups by more subtle demographics for logged in users, such as occupation, interests, and historical levels of activity.
The value of this post hoc analysis cannot be understated. It is not enough to simply choose the better of two method; an understanding of when one method outperforms another can help motivate the next method to test. A thorough understanding of the strengths and weaknesses of an approach is essential for the motivation of the next round of experiments. Without it, you rely solely on your own intuition and ability to guess new methods. I don’t know about you, but I’d place more bets on data than luck.
When examining your data, it can be helpful to look at the same numbers in different ways. For example, consider hourly traffic to Wikipedia for the week of March 01 to March 07 2010. I downloaded the data for these plots from http://dammit.lt/wikistats/. The following plots show traffic to the English and German language Wikipedia sites, normalized by dividing by the sum of all page views to the language’s site. These plots will give us a sense of when users visit the two different sites, but will not be useful for comparing relative traffic from one language to another. While these data are not the results of a split test, I think they are helpful for illustrating the point that different representations of data can lead to different insights.
This first plot shows a simple line plot with time on the horizontal axis and the percent of weekly traffic on the vertical axis. This is a very traditional representation for this type of data. Some things immediately jump out. The usage of the German language Wikipedia varies much more widely with time than the English Wikipedia, although we do see some time-of-day effects on English too. We also see lower traffic levels on both Wikipedia sites on the weekend (March 06 and March 07). In this plot it is quite easy to read the values of peaks and valleys, but it is harder to isolate trends across days.
Another way to present this data is to plot the data with day on the horizontal axis and hour of day on the vertical axis. The width between two lines for a day varies proportionally to the percent of traffic observed in that day’s hour. This presentation of the data allows us to see effects that weren’t obvious in the other plot. For example, we can see that usage on the German language website increased between 7 and 10 am on Saturday and Sunday, while on a typical weekday, this behavior was observed early in the day, between 6 and 9. Traffic to the German website peaked later in the day on the weekends than weekdays.
To conclude, measuring performance is perhaps the most important aspect of split testing. In addition to measuring key performance measures such as click-through rates and conversion rates, it is wise to track the performance of new components to help in the estimation of cost. It is important to store enough data to later explore the results of a test in new ways, digging deeper to gain insights, such as which demographics were well served by the test and for which users the tested approach did poorly. During this investigation, it can be helpful to look at your data in multiple ways, because different presentations of the data my facilitate different insights. These insights can be used to help you form a better mental model of your users and techniques and motivate the next set of hypotheses you wish to explore.

