<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Information Retrieval on the Live Web &#187; Web ?.0</title>
	<atom:link href="http://livewebir.com/blog/category/web-0/feed/" rel="self" type="application/rss+xml" />
	<link>http://livewebir.com/blog</link>
	<description>by Paul Ogilvie</description>
	<lastBuildDate>Thu, 26 May 2011 18:47:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Engineering for experiments: measuring performance</title>
		<link>http://livewebir.com/blog/2010/03/engineering-for-experiments-measuring-performance/</link>
		<comments>http://livewebir.com/blog/2010/03/engineering-for-experiments-measuring-performance/#comments</comments>
		<pubDate>Mon, 08 Mar 2010 17:20:24 +0000</pubDate>
		<dc:creator>pogil</dc:creator>
				<category><![CDATA[Web ?.0]]></category>

		<guid isPermaLink="false">http://livewebir.com/blog/?p=120</guid>
		<description><![CDATA[The main goal of split testing is to compare the performance of multiple components or parameters.  Naturally, we cannot do so without measuring this performance.  Our performance metric may be a click through rate, such as in our related content widget example, or other conversion metrics, such visits to an retail website resulting in a purchase. [...]]]></description>
			<content:encoded><![CDATA[<p>The main goal of split testing is to compare the performance of multiple components or parameters.  Naturally, we cannot do so without measuring this performance.  Our performance metric may be a click through rate, such as in our related content widget example, or other conversion metrics, such visits to an retail website resulting in a purchase.  In addition, in some cases it can also be important to measure the response time of the components being tested.  This, coupled with other system monitoring, can help you estimate the costs of running the new component at scale.  It may not be worth deploying a new algorithm if the improvements to the performance metric are marginal but the increased costs of running the algorithm are large.</p>
<p>You will probably want to measure and report global measures for each testing group  in real time.  You may also have some pre-defined demographic groups of interest you may wish to track.  An obvious one for our related context example is to report logged in users separately from the users not logged in.</p>
<p>However, it is very important that you be able to revisit this data later and explore it in new ways.  For users who are not logged in, it is important that you store the session information and the user&#8217;s actions.   For users who are logged in, you will want this information in addition to your system&#8217;s internal user identifier.  With this information, you can later do analysis on your logs to separate user groups by more subtle demographics for logged in users, such as occupation, interests, and historical levels of activity.</p>
<p>The value of this post hoc analysis cannot be understated.  It is not enough to simply choose the better of two method; an understanding of when one method outperforms another can help motivate the next method to test.  A thorough understanding of the strengths and weaknesses of an approach is essential for the motivation of the next round of experiments.  Without it, you rely solely on your own intuition and ability to guess new methods.  I don&#8217;t know about you, but I&#8217;d place more bets on data than luck.</p>
<p>When examining your data, it can be helpful to look at the same numbers in different ways.  For example, consider hourly traffic to Wikipedia for the week of March 01 to March 07 2010.  I downloaded the data for these plots from <a title="Wikipedia Traffic Stats" href="http://dammit.lt/wikistats/">http://dammit.lt/wikistats/</a>.  The following plots show traffic to the English and German language Wikipedia sites, normalized by dividing by the sum of all page views to the language&#8217;s site.  These plots will give us a sense of when users visit the two different sites, but will not be useful for comparing relative traffic from one language to another.  While these data are not the results of a split test, I think they are helpful for illustrating the point that different representations of data can lead to different insights.</p>
<p>This first plot shows a simple line plot with time on the horizontal axis and the percent of weekly traffic on the vertical axis.  This is a very traditional representation for this type of data.  Some things immediately jump out.  The usage of the German language Wikipedia varies much more widely with time than the English Wikipedia, although we do see some time-of-day effects on English too.  We also see lower traffic levels on both Wikipedia sites on the weekend (March 06 and March 07).  In this plot it is quite easy to read the values of peaks and valleys, but it is harder to isolate trends across days.</p>
<p><a href="http://livewebir.com/blog/wp-content/uploads/2010/03/wikiLine.png"><img class="aligncenter size-full wp-image-123" title="Wikipedia Traffic, Line View" src="http://livewebir.com/blog/wp-content/uploads/2010/03/wikiLine.png" alt="" width="600" height="400" /></a>Another way to present this data is to plot the data with day on the horizontal axis and hour of day on the vertical axis.  The width between two lines for a day varies proportionally to the percent of traffic observed in that day&#8217;s hour.  This presentation of the data allows us to see effects that weren&#8217;t obvious in the other plot.  For example, we can see that usage on the German language website increased between 7 and 10 am on Saturday and Sunday, while on a typical weekday, this behavior was observed early in the day, between 6 and 9.  Traffic to the German website peaked later in the day on the weekends than weekdays.</p>
<p><a href="http://livewebir.com/blog/wp-content/uploads/2010/03/wikiViolin.png"><img class="aligncenter size-full wp-image-124" title="Wikipedia Traffic, Violin View" src="http://livewebir.com/blog/wp-content/uploads/2010/03/wikiViolin.png" alt="" width="600" height="600" /></a>To conclude, measuring performance is perhaps the most important aspect of split testing.  In addition to measuring key performance measures such as click-through rates and conversion rates, it is wise to track the performance of new components to help in the estimation of cost.   It is important to store enough data to later explore the results of a test in new ways, digging deeper to gain insights, such as which demographics were well served by the test and for which users the tested approach did poorly.  During this investigation, it can be helpful to look at your data in multiple ways, because different  presentations of the data my facilitate different insights.  These insights can be used to help you form a better mental model of your users and techniques and motivate the next set of hypotheses you wish to explore.</p>
]]></content:encoded>
			<wfw:commentRss>http://livewebir.com/blog/2010/03/engineering-for-experiments-measuring-performance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Engineering for experiments: the role of modularity</title>
		<link>http://livewebir.com/blog/2010/02/engineering-for-experiments-the-role-of-modularity/</link>
		<comments>http://livewebir.com/blog/2010/02/engineering-for-experiments-the-role-of-modularity/#comments</comments>
		<pubDate>Mon, 15 Feb 2010 03:14:45 +0000</pubDate>
		<dc:creator>pogil</dc:creator>
				<category><![CDATA[Web ?.0]]></category>

		<guid isPermaLink="false">http://livewebir.com/blog/?p=108</guid>
		<description><![CDATA[My last post outlined some initial thoughts about requirements for easy split testing of web sites.  In this post I assert that modular design is a critical component of any web system you wish to improve through testing. Software developers are well educated about the value of modular design for reuse and testing of correctness.  [...]]]></description>
			<content:encoded><![CDATA[<p>My <a title="Engineering for experiments" href="http://livewebir.com/blog/2010/01/engineering-for-experiments/">last post</a> outlined some initial thoughts about requirements for easy split testing of web sites.  In this post I assert that modular design is a critical component of any web system you wish to improve through testing.</p>
<p>Software developers are well educated about the value of modular design for reuse and testing of correctness.  This can also extend to testing value of components.  By defining a clean interface for a component, it becomes possible to easily substitute other implementations or algorithms.</p>
<p>When we consider our related content widget example from the last post, we may have a new snippet generation algorithm we wish to compare to our existing algorithm.  By deploying a service implementing a standard snippet generation interface, a test framework could direct a portion of all requests to the new algorithm, recording this new component&#8217;s impact on click-through-rate.</p>
<p><a href="http://livewebir.com/blog/wp-content/uploads/2010/02/modularity.png"><img class="aligncenter size-full wp-image-114" title="Components of a related content selection widget" src="http://livewebir.com/blog/wp-content/uploads/2010/02/modularity.png" alt="Components of a related content selection widget: article ranker and snippet generator" width="402" height="141" /></a>Complementary to modularity is the need to handle data flow between components.  Without the ability to handle data flow between components in a generic web architecture, to developer is left reimplementing these interactions routinely.  With data flow management, the web architecture can intercept the output of components, transform results, monitor component throughput, and manage split tests.</p>
<p>In some ways the importance of creating modular components and the need to handle data flow between these components a generic web architecture for testing is similar to an <a href="http://en.wikipedia.org/wiki/Enterprise_service_bus">enterprise service bus</a>.  An enterprise service bus is middleware which handles messaging between components of complex information architectures.  The bus handles data flow between components, frequently legacy or written in a variety of programming languages.  The routing, message handling, monitoring, and management support provided by many enterprise service buses may be attractive for the design and development a web architecture for testing.  On the other hand, the multi-system support and extra bells and whistles which may not be necessary for your web architecture could create extra overhead and computing resources not desirable in your context.  Most enterprise service buses use XML as the communication language, which as a hefty representation of data, the processing and transformation messages could greatly add to the latency of your website.  I personally have not evaluated any enterprise serial bus for use with web systems, so it is possible that my fears are misplaced.</p>
<p>To conclude, modularity is an important component of any software system, and its use in a web architecture can facilitate web testing.  Coupled with modularity is the need for the test-oriented web architecture to handle data flow in a lightweight way.  Enterprise serial buses provide some of this functionality, but may have more overhead than is desirable for use in a low-latency web system, where every millisecond counts.  Any existing middleware solution for handling data flow would need to be existed to provide additional support for testing.</p>
]]></content:encoded>
			<wfw:commentRss>http://livewebir.com/blog/2010/02/engineering-for-experiments-the-role-of-modularity/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Engineering for experiments</title>
		<link>http://livewebir.com/blog/2010/01/engineering-for-experiments/</link>
		<comments>http://livewebir.com/blog/2010/01/engineering-for-experiments/#comments</comments>
		<pubDate>Tue, 05 Jan 2010 20:20:57 +0000</pubDate>
		<dc:creator>pogil</dc:creator>
				<category><![CDATA[Web ?.0]]></category>

		<guid isPermaLink="false">http://livewebir.com/blog/?p=90</guid>
		<description><![CDATA[It&#8217;s not hard to see the importance of performing tests on web systems.  We regularly hear how companies such as Amazon, Google, or Facebook employ A/B testing, also known as split testing and multivariate testing, to quickly test variations of their websites.  Deploying a test to a percentage of a site&#8217;s requests can quickly measure [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s not hard to see the importance of performing tests on web systems.  We regularly hear how companies such as Amazon, Google, or Facebook employ A/B testing, also known as split testing and multivariate testing, to quickly test variations of their websites.  Deploying a test to a percentage of a site&#8217;s requests can quickly measure the impact of a site change or a variation of an algorithm.  The impact of testing in a company can be greater than an isolated experiment on measuring which color maximizes the click-through rate.  An infrastructure for and culture of testing can impact positively the entire organization.  An organization that doesn&#8217;t perform frequent tests is one doomed to adapt slowly or without information.</p>
<p>In order for a culture of experimentation to exist, it must be easy to perform these experiments and measure the results.  There are some great frameworks for offline experimentation.  MapReduce solutions are great for experimentation and examination of large data sets.  I&#8217;ve seen the importance of a good software architecture for experimentation in my own experience with the <a title="The Lemur Toolkit for Language Modeling and Information Retrieval" href="http://lemurproject.org">Lemur Toolkit</a>.  However, I haven&#8217;t seen much discussion of frameworks for performing online experiments in complex web systems.</p>
<p>This post considers as an example split testing for a related content widget.  Such a related content widget could be a deployed on a news publisher&#8217;s article pages.  The widget would show related articles in a sidebar or below the content of the article.  We may wish to test variations of the related content widget to see if they increase some measure of success, such as the click-through rate.</p>
<p>Conceptually, a related content widget is quite simple.  There are two major components, one that returns a list of related items and one that renders the items on the web page.  While this system is quite simple conceptually, a typical implementation of such a system may be much more complex.  The renderer in the client may make use of a mix of XML, CSS, and HTML.  Parts of the rendering algorithm may also live server-side.  For example, if the user visited the web page from a search engine result page, the keywords could be used to create context sensitive snippets for the related content items.  A snippet generation algorithm would more likely be run on the server than on the client.</p>
<p>The experiments we may wish to perform on the related-content could depend on any of the components used in the creation of the widget.  For example, here is just a small sample of things we may wish to test for the impact on the click-through rate:</p>
<ul>
<li>the color of the recommended item titles (CSS changes),</li>
<li>the number of items presented (parameters passed to the related content service), and</li>
<li>keyword highlighting, use of keywords in ranking content selection, and custom snippet generation when the user arrives from search engine landing page (the ranking algorithm used by the related content service, the snippet generation algorithm, and possibly CSS changes).</li>
</ul>
<p>To me, this means that a test framework will interface with many of a web system&#8217;s components.  Here are some additional thoughts about the attributes I&#8217;d like a test management framework have.  It should</p>
<ol>
<li>handle data flow between components,</li>
<li>track metrics such as click-through-rates and response time of components,</li>
<li>handle multiple parallel tests along with component dependencies,</li>
<li>be able to handle various user types (such as split by session, long-term cookie, or logged-in user),</li>
<li>be easy to register new tests,</li>
<li>handle deployment of code and files to servers,</li>
<li>verify to some degree the integrity of new code before returning results of the code to user requests,</li>
<li>be cloud-aware, and</li>
<li>be able to do all of this without site downtime.</li>
</ol>
<p>I should acknowledge that there are some tools out there for managing split tests.  However, I believe (possibly in error) that they only scratch at the surface of the requirements I&#8217;ve listed above.  Also, it may be unfair for me to talk of this as test framework, because it&#8217;s really more of a web framework which has adequate support for testing.</p>
<p>I know I&#8217;m asking for quite a lot, but I believe these attributes are important for the creation of a culture of experimentation.  Over the next few weeks I hope to go into more detail about why these attributes are important and share some initial thoughts on how a framework may support these goals.  Since much of these ideas are still formative, I&#8217;d love to hear your own thoughts as well.</p>
]]></content:encoded>
			<wfw:commentRss>http://livewebir.com/blog/2010/01/engineering-for-experiments/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Twitter search ain&#8217;t that bad</title>
		<link>http://livewebir.com/blog/2009/03/twitter-search-aint-that-bad/</link>
		<comments>http://livewebir.com/blog/2009/03/twitter-search-aint-that-bad/#comments</comments>
		<pubDate>Sun, 08 Mar 2009 20:41:31 +0000</pubDate>
		<dc:creator>pogil</dc:creator>
				<category><![CDATA[Web ?.0]]></category>

		<guid isPermaLink="false">http://livewebir.com/blog/?p=78</guid>
		<description><![CDATA[My friend Daniel Tunkelang recently made the arguments that Twitter isn&#8217;t a search engine, nor is their search engine hard to build.  While I certainly agree that Twitter is not a search engine, I disagree with several of his comments with regards to their search engine. As an advocate of HCIR, I&#8217;m surprised Daniel didn&#8217;t [...]]]></description>
			<content:encoded><![CDATA[<p>My friend Daniel Tunkelang recently made the arguments that <a href="http://thenoisychannel.com/2009/03/05/twitter-is-not-a-search-engine/">Twitter isn&#8217;t a search engine</a>, <a href="http://thenoisychannel.com/2009/03/04/twitters-real-time-search-aint-that-hard/">nor is their search engine hard to build</a>.  While I certainly agree that Twitter is not a search engine, I disagree with several of his comments with regards to their search engine.  As an advocate of HCIR, I&#8217;m surprised Daniel didn&#8217;t take the time to think about what makes searching tweets different.</p>
<ol>
<li>Daniel argues that in order for there to be &#8220;search,&#8221; there must be an information need.  I use Twitter frequently to gauge sentiment and quantity of discussions around products.  Twitter search can greatly aid my decision making process of whether or not to use a product (such as an open source tool).</li>
<li>It is arguable that the recency of a tweet is a very large component of relevance.  Twitter is a way of discussing and interacting with others about what is happening right now.  A reverse chronological ordering of tweets makes sense.</li>
<li>At 140 characters, traditional text ranking algorithms might not perform well.  There&#8217;s very little context within a tweet to determine how well such a short tweet matches a short query.  Ordering by well tuned, state-of-the-art text retrieval approaches such as Okapi-BM25 may give bad orderings on tweets.</li>
<li>At 140 characters, the entire tweet can be comfortably displayed in the result list.  There is no need for snippet generation, and query term highlighting facilitates quick scanning of the results.  Other aspects of relevance other than timeliness may be easier to gauge than in other search tasks.</li>
<li>Given the emphasis on timeliness, the search engine must be indexing tweets in real time and making new tweets available very quickly.  While this is no doubt possible with in memory inverted indexes and many people have built this functionality, textbooks rarely address issues of searching and indexing documents in real time.  The real-time demands on Twitter search are greater than other web search engines.</li>
</ol>
<p>Now that I&#8217;m done defending <a href="http://search.twitter.com">Twitter search</a>, I agree there are things they could do better.  Daniel alludes in comments that one dimension of relevance that is currently not reflected in search results is the influence of the Twitter user on relevance.  There are search tasks where who is participating in the discussion is just as important as when the statement was made.</p>
<p>Also, I mentioned that I use Twitter to measure activity and sentiment.  Twitter search does little to summarize or aggregate results.  External tools such as <a href="http://twist.flaptor.com">twist.flaptor.com</a> have done a better job of using tweets to measure and show volume of discussion.  Measuring sentiment from 140 characters is difficult, but it may be possible to measure general reaction as an aggregate.</p>
]]></content:encoded>
			<wfw:commentRss>http://livewebir.com/blog/2009/03/twitter-search-aint-that-bad/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Close Reading</title>
		<link>http://livewebir.com/blog/2008/12/close-reading/</link>
		<comments>http://livewebir.com/blog/2008/12/close-reading/#comments</comments>
		<pubDate>Fri, 12 Dec 2008 21:51:51 +0000</pubDate>
		<dc:creator>pogil</dc:creator>
				<category><![CDATA[Web ?.0]]></category>

		<guid isPermaLink="false">http://livewebir.com/blog/?p=47</guid>
		<description><![CDATA[My girlfriend Christine Mahady (Chris) and I recently met with my friend Alexander Flurie to discuss a potential project.  The project revolves around making a tool to assist in performing close readings in an educational environment. Close reading is a technique used in literary criticism which focuses on the careful analysis of the language of [...]]]></description>
			<content:encoded><![CDATA[<p>My girlfriend Christine Mahady (Chris) and I recently met with my friend Alexander Flurie to discuss a potential project.  The project revolves around making a tool to assist in performing close readings in an educational environment.</p>
<p>Close reading is a technique used in literary criticism which focuses on the careful analysis of the language of a text.  These analyses are typically done on passages or short texts and critique low level features of the text, such as word choice, syntax, and the structure and flow of the ideas expressed in the text.  Performing a close reading often begins with a careful annotation of the text, making notes on observations of structure, tone, rhetorical devices, or anything about the text the reader notices.  During this process, the reader will also look up any unfamiliar words, make note of word choice, such as the use of slang, and so on.</p>
<p>Interestingly, we aren&#8217;t the only people to consider doing close reading online.  <a title="The Golden Notebook Project" href="http://thegoldennotebook.org/">The Golden Notebook Project</a> launched about a month ago.  Seven authors and critics are collaborating to do a close reading of Doris Lessing&#8217;s The Golden Notebook.  Their goals our different than ours; this is an interesting case study on one novel.</p>
<p>Our goals are more focused on making it easy for instructors to create close readings in educational environments.  In order to ease adoption, we&#8217;d like to seed the tool with publicly available texts, such as those found in <a title="Project Gutenberg" href="http://www.gutenberg.org/">Project Gutenberg</a>.  We also want to allow instructors to easily create discussion groups for students, where all students&#8217; annotations could be shared for collaborative reading or private for individual assignments.  Finally, we&#8217;d like this tool to allow the instructor to import annotations they&#8217;ve previously created for the text into a group, so that instructors can easily reuse their own annotations from semester to semester.  I&#8217;m personally very excited about this side project, because I think it is a great example of how online tools can be easier to use than traditional tools.</p>
]]></content:encoded>
			<wfw:commentRss>http://livewebir.com/blog/2008/12/close-reading/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Summize + Twitter for WWDC</title>
		<link>http://livewebir.com/blog/2008/06/summize-twitter-for-wwdc/</link>
		<comments>http://livewebir.com/blog/2008/06/summize-twitter-for-wwdc/#comments</comments>
		<pubDate>Mon, 09 Jun 2008 16:06:55 +0000</pubDate>
		<dc:creator>pogil</dc:creator>
				<category><![CDATA[Web ?.0]]></category>
		<category><![CDATA[conversational search]]></category>
		<category><![CDATA[summize]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[wwdc]]></category>

		<guid isPermaLink="false">http://pogil.wordpress.com/?p=3</guid>
		<description><![CDATA[Twitter has linked to Summize's conversation search to show real-time coverage of tweets for WWDC.  It's great to see Twitter promoting Summize, but I felt overwhelmed by the volume of the search results.  I'd really like to see Summize provide some summarization, filtering, or categorization to their results to help me deal with the volume of tweets on popular subjects.]]></description>
			<content:encoded><![CDATA[<p>My friends at <a title="Summize" href="http://summize.com">Summize </a>have been doing some really interesting things.  Their first demo, now offline, was a new twist on product search.  They spidered for reviews from various sources and performed sentiment analysis to unify ratings and reviews.  I was particularly fond of their use of <a title="Stars and Bars" href="http://blog.summize.com/2008/01/stars-and-bars.html">stacked bar histograms</a> to efficiently summarize people&#8217;s sentiments.  Display such as these convey much more distributional information than simply reporting averages.</p>
<p>Currently, Summize&#8217;s showcase product is a <a title="Twitter" href="http://www.twitter.com">Twitter</a> search application.  Their search application is nice because you can get real-time updates of matches as the tweets are happening.  Twitter&#8217;s recognized the value of Summize&#8217;s conversational search, and <a title="WWDC - Live Coverage" href="http://blog.summize.com/2008/06/wwdc---live-cov.html">linked</a> to Summize for live coverage of <a title="WWDC" href="http://developer.apple.com/wwdc/">WWDC</a>.  I think it&#8217;s great when small companies collaborate in these ways.  It&#8217;s also a smart move for Twitter today, given the troubles they&#8217;ve been having recently.  Diverting some of their traffic to Summize might help them handle today&#8217;s load.</p>
<p>Looking at the results of <a title="Twitter coverage of WWDC" href="http://summize.com/search?q=wwdc+OR+apple+OR+iphone+OR+%22steve+jobs%22">Twitter&#8217;s suggested search</a>, I feel a little overwhelmed.  Just a few seconds after clicking on the link, Summize&#8217;s search results informed me that there have been an additional 26 posts matching the search since the page was loaded.  Letting the page go for a minute without refreshing showed hundreds of new tweets.  That&#8217;s not terribly surprising, given the excitement around WWDC.  But here&#8217;s the rub.  Most of these tweets are useless to me or contain redundant information.  I want only the unique information, not the chatter.  For high volume topics of discussion, I really need the conversational search to filter or summarize the results for me.  Abdur and Eric, please bring some of the great summarization and organizational aspects of your product search to your conversational search.</p>
]]></content:encoded>
			<wfw:commentRss>http://livewebir.com/blog/2008/06/summize-twitter-for-wwdc/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

