What to do with Big Data?
Making ML useful is a platform problem

By Peter Bailis - December 4, 2018

Big Data: Where’s the value?

With today’s excitement about AI and machine learning, it’s easy to forget that we were only recently enamored with Big Data and its promise of extracting value from high volume, high dimensional data. In just over a decade, Big Data changed our perception of data at scale: far cheaper data storage and processing led to a widespread shift, from treating data as an cost center requiring expensive data warehouses to treating data as an asset with huge value, waiting to be unlocked.

So, what happened to the promise of Big Data? Despite the widespread collection of data at scale, there’s little evidence that most enterprises are successful in efficiently realizing this value. Data science remains one of the most in-demand talents, and several name-brand efforts based on bespoke and ad-hoc analytics have publicly struggled to scale and deliver on promises. In many warehouses, only a small fraction of data is ever utilized. Reflecting on its promise, it’s surprising that Big Data isn’t considered more of a failure.

Deep learning to the rescue…

Around the time we might have expected a “trough of disillusionment” for Big Data, we saw massive advances in machine learning capabilities via the rise of modern deep learning. Thanks to increased amounts of annotated data and cheap compute, deep networks roared onto the scene in 2012. New deep network architectures like AlexNet obliterated past approaches to machine learning tasks that relied on hand-tuned, manually engineered features. Given continued advances in hard tasks like object detection and question answering, it seems that finally extracting the value from Big Data is in sight.

While modern deep networks are a major advance for machine learning, they excel at processing data that’s different than much of the data stored in today’s Big Data lakes. Historically, deep networks have performed especially well on data that is unstructured, like visual and textual data. However, to make predictions using structured data (i.e., with a common, repetitive, and well-defined schema) like transaction or customer records in a data warehouse, deep networks aren’t a panacea: in fact, on this structured data, much simpler models often perform nearly as well. Instead, the bottleneck is in simply putting the data to use.

One of my favorite examples of this phenomenon comes from Google’s recent paper on “Scalable and accurate deep learning with electronic health records.” Buried on page 12 of the Supplemental Materials, we see that logistic regression (appearing in lecture 3 of our intro ML class at Stanford) “essentially performs just as well as Deep Nets” for these predictive tasks, coming within 2-3% accuracy without any manual feature engineering:


A platform for putting data to work

For many use cases, putting data to work doesn’t require a new deep network, or more efficient neural architecture search. Instead, it requires new software tools. What does such a toolkit for using structured data at scale look like? At Sisu, we believe it will:

  1. Help navigate organizations’ existing data at scale. Modern organizations are sitting on massive amounts of data in warehouses like Redshift, BigQuery, and Snowflake. Displaying raw data in a table or set of dashboards is insufficient and impractical—the volume of this data is just too great. A usable ML platform will need to help users proactively identify where to look and how to respond for a given predictive task, in real time.
  2. Provide results users can trust. Deep networks are notoriously hard to understand and famously difficult to interpret—why should we trust their output? Usable ML platforms must explain their rationale for making a given prediction or recommendation so a user can understand and verify their output. As a result, we believe black-box AutoML-oriented solutions that fail to earn user trust will only see near-term uptake for the lowest-value tasks.
  3. Work alongside users. Except for the most mechanical and precisely specified tasks like datacenter scheduling, we’re years away from complete automation of even routine business workflows. As a result, usable ML platforms will work alongside users, augmenting their intuition and their existing workflows. Users are smart, and ML platforms can make them smarter.

A usable analytics platform with these capabilities would enable fundamentally new platform architectures. In contrast with spreadsheet software or modern business intelligence, which are focused on a manual, user-driven interaction model, the vast amount of data available in a modern lake allows us to obtain high-quality results using weaker specifications from users; instead of requiring users to completely specify their queries of interest, we can infer user intent. Moreover, we can utilize historical interactions to perform personalized ranking and relevance, and predict future intents using variants of reinforcement learning.

Today, these technologies are common in consumer internet applications (e.g., Google’s keyword search, Facebook’s news feed) but are completely foreign to enterprise analytics settings. Given the volume of data available in data lakes, we can finally afford to apply these techniques to private, first-party data as found in modern organizations.

The fundamental challenge lies in leveraging this structured data effectively without requiring expert intervention. At Sisu, we have a strong hypothesis about how to do so. To learn more, sign up for access, or come work with us.

Illustration by Michie Cao.


Read more

Why Everyone Needs a Dedicated Analyst Team

With the investment we’re making in collecting structured data, everyone could benefit from a dedicated analyst team. But almost nobody does. Sisu can help

Read more

Three Takeaways from SysML 2019:
More Data, Better Tools, Accessible Models

Reflecting on the research and discussions at SysML 2019, program committee member Peter Bailis shares his observations on three emerging trends.

Read more