By Peter Bailis - October 3, 2018
During his seminal Graduate Computer Systems course at UC Berkeley, Eric Brewer offered the following anecdote:
Eric’s company Inktomi provided web search for major sites such as Yahoo! and ultimately peaked at a public valuation of $25 billion. Their massive success led to an unexpected problem: Inktomi had run out of capacity in their only datacenter. As a result, their growth was capped.
This was the mid-1990s, so Inktomi couldn’t just spin up servers like today on EC2. Instead, they had to lease a second datacenter, over 50 miles away.
Migrating their servers between these datacenters posed a serious challenge. Inktomi couldn’t simply turn their servers off and drive them to the new datacenter because they were serving live traffic. On the other hand, replicating their data would be complex, and full replication would mean doubling their server count.
Nevertheless, Brewer and colleagues devised a clever plan to guarantee 100% uptime without buying a single additional server.
This is seemingly impossible: in a regular database, if the data on any server (i.e., shard) is inaccessible, the engine can’t guarantee a correct result. In fact, some databases simply won’t execute queries if a single server is unavailable.
Inktomi’s clever insight was to rethink the definition of “correct.” In web search, some results are more relevant than others. Web search is subjective: when I search for “Kafka,” I may want to read the Wikipedia page for Franz Kafka the author, while you may be looking for docs for the Kafka message queue. While results vary in quality, even modern search engines like Google don’t make claims of optimality. Instead, we use search engines because they’re generally helpful and relevant.
Over the course of a weekend, Inktomi turned off half the servers and drove them to the new datacenter, serving queries from the original half. Then, they redirected traffic to the newly installed servers. This meant that during the migration, searches only reached one half of Inktomi’s servers.
While half of the Web was effectively excluded from searches during the migration, Inktomi provided 100% uptime. These search results might not have been as useful as the day before, but they were “good enough” for a weekend’s worth of queries. And today, these techniques are still widely used in distributed computing.
As machine learning and “Software 2.0” become dominant in software development, Inktomi’s lessons are increasingly relevant. Our modern compute stack – from compilers to runtimes to hardware – is built around a deterministic, precise model of computation. But statistical forms of correctness are becoming the new normal. Researchers have begun to explore the impact of this relaxed statistical correctness in domains including language design, concurrent programming, microarchitecture, sensor processing, and query optimization. But in practice, everything’s still up for grabs, and the mainstream software stack stands to be reinvented.
Google’s decades-long dominance in web search highlights the importance of scale and quality in ML-powered products: the highest quality results win. As more of the world’s workloads shift to ML training and inference, which are similar to web search in their statistical robustness, software that can efficiently exploit this robustness is becoming key to success.
If you’re interested in helping build the future of data analytics, join us at Sisu.
Illustration by Michie Cao.