By Peter Bailis - December 9, 2020
In the last five years, a fundamental inequality in analytics has reversed. Conventionally, data and compute were critical bottlenecks. A typical enterprise data architecture consisted of many expensive data silos spread across on-premises servers; these servers were in turn carefully tended by small armies of database administrators and only accessed by a priesthood of analysts. In these settings, databases were expensive, and people were relatively cheap.
Today, we see the opposite. Cloud storage and cloud data warehouses like Snowflake and Amazon Redshift harness cloud economies of scale to enable fast and inexpensive query execution across elastic compute resources. These systems are unprecedented in both affordability and scalability.
To put this scale in context: in 1996, storing the entire 147.8GB internet crawl at Google required a dedicated large-scale storage array; today, this same storage costs $3.40 per month on Amazon’s S3 service. As per their recent S-1 filing, Snowflake processes an average of 507 million queries per day — over 162,000 per customer, every day. As a result, speed of compute is no longer a constraint in analysis – spend enough money with a cloud vendor, and today’s data infrastructure can return queries at arbitrary speed.
As a result, the bottleneck in analytics has shifted from the expensive, and therefore limited, data management infrastructure to the people that author the queries and determine which questions are best to ask. But with more data available than ever before, the rate at which we can point and click through today’s analytics tools is a snail’s pace compared to cloud processing speeds. It can take days – and in some cases weeks – for even the most sophisticated analysts to understand what data is most relevant to a given business question, generate the hundreds of queries required to sufficiently answer the question, and to present the answer in an interpretable, actionable format to the business. Today, relative to our data architectures, it’s the people who are overwhelmingly expensive, and slow.
Coping with this growing inequality between people and machines requires a new approach to analytics. In my work at Stanford with advanced tech companies like Google and Facebook, it was clear that it’s no longer feasible to scale people in an effort to keep up with data. Rather, automation is required to reduce the human bottleneck of putting data to work and to unlock the potential of data at scale.
While total automation of the analytics process and removal of the human bottleneck entirely is tantalizing, this goal remains out of reach for today’s AI/ML capabilities. Today’s state-of-the-art ML models excel at repetitive, rote tasks in controlled environments. Playing Atari? Easy. Playing Starcraft? Okay. Highway driving on a clear day? Fine. Navigating a congested intersection in an urban environment with jaywalkers? Not so much. Making a single decision about marketing or product strategy? Not a chance.
The uncertain business environment analysts operate in requires them to navigate overlapping competitive, political, and strategic considerations. The world is messy, and enterprise data – although impressive in scope – is incomplete. (I have yet to meet an enterprise data team that explicitly modeled the possibility of a global pandemic in their forecasting methodologies.)
The resulting need is to augment the analyst, by automating the rote, routine, and boring parts of analysis. Making a strategic recommendation? Let the human decide. Slicing and dicing the sales pipeline for the upcoming monthly business review? Automate it. Picking a methodology for measuring customer lifetime value (CLV)? Human. Looking for unexpected changes from forecast? Automate.
This cooperative approach optimizes the scarcest resource in analytics: human ingenuity and judgment. Humans are brilliant at formulating metrics and creatively determining appropriate responses. By building these collaborative loops, analysts can craft better strategy and more tactful responses to changes, with less drudgery and more efficiency. Moreover, given the inherent bias in many ML models, human-in-the-loop analysts retain their agency and the ability to employ good judgment and make the final say on tough moral and ethical cases.
While this augmented, collaborative approach may sound far-fetched, it is much closer than you might realize. When’s the last time you wondered if a Google search result was correct? We’ve learned to take automated, high quality web search and information retrieval – which only 20 years ago were manually maintained by large teams of people – for granted. Consumer applications like Facebook, Twitter, Netflix, and TikTok similarly rely on sophisticated models to highlight content and compete for user attention. These applications fall short of “full service automation” – Netflix doesn’t choose a movie for me, and Google still doesn’t book my flights and hotels automatically – but they dramatically simplify the process of information discovery and decision-making.
However, employing these kinds of recommendation and filtering models in an enterprise context is a far greater technical challenge than in consumer settings. Internet services like Google and Facebook each leverage billions of user interactions as supervision, or statistical signal, for their ranking and relevance models. In contrast, even the largest enterprise analytics deployments only number in the low hundreds of thousands of users per day. Moreover, Google and Netflix provide recommendations based on a mostly homogeneous corpus – the public internet, and the Netflix catalog, respectively. In contrast, most enterprise data is private and will remain private – forever. This means relevance models are not easily transferable.
In a nutshell, the success of recommendation models in consumer internet scenarios shows that augmentation of common retrieval and ranking tasks is possible given a large corpus of user interaction data. While Augmented Analytics tools have access to less user supervision, the economic impact of even marginally better automation in asking, answering, and interpreting analyses is immense. In a world where every business decision-maker has access to more data than when Google launched, there are decisions totaling trillions of dollars of impact every quarter that stand to be better informed by data.
By building analytics tools that cooperate with human judgment, it’s possible to start with merely “good” results, then improve models. The return on investment of even one well-timed analytics insight is easily quantifiable – for example, improving sales conversion rates in a mature enterprise software business by 0.1% can pay for an entire analyst team. Over time, a system can learn and increase that 0.1% to 0.2%, and beyond. Moreover, the data required to support that insight is, in many cases, already accessible to the business decision-maker who will leverage it. What’s missing is the tools to remove the bottleneck that stands between data at scale and the decision-makers who stand to benefit from it. Augmented Analytics platforms are poised to fill this gap, and to re-balance the human bottleneck in the modern analytics stack