By Peter Bailis - August 11, 2021
Running a cloud data analytics platform is hard, and requires a lot of planning and execution around data governance, compute, and security. While there’s no silver bullet that can solve these problems, there are a few things that cloud warehouses could do to make it far easier to deploy modern analytics platforms at scale.
In the past three years at Sisu, I’ve been regularly surprised by how challenging it is to actually deliver cloud-based analytics to users. Unlike products like Airtable or Notion, where users input data that is created and lives in the application itself, most cloud analytics platforms like Sisu work on data that is collected and stored in external data warehouses. That means analytics platforms need to connect to and process data they don’t own themselves, and this data is often highly valuable and sensitive.
Conceptually, what cloud analytics platforms do is simple. Platforms like Databricks, Looker, and Sisu apply some kind of transformations and compute to users’ data, which is usually stored in another place in the cloud. In theory, this “just” requires spinning up some compute resources that are co-located with users’ data, and returning a result.
While this sounds simple, designing a compute mechanism that is flexible (i.e., allows a range of workflows), is scalable, and is secure is very hard in practice.
As a secure and flexible mechanism for co-locating compute and data, databases have provided support for User-Defined Functions (UDF) for decades, and people have even built entire analytics systems based on the UDF abstraction and in-database compute. Products like Google’s AutoML Tables are a modern take on this idea.
Implementing cloud-based analytics platforms as UDFs inside of databases like Snowflake and BigQuery would be amazing. This would enable:
However, there are a few practical considerations standing in the way:
As a result of this challenge, almost every cloud analytics provider – from ELT (Fivetran, dbt Labs, Census) to ML and visualization (Databricks, Sisu, Looker) – has to offer a separate compute environment with separate metering, billing, security, and compute. Some cloud providers like Snowflake have offered interfaces like external functions that essentially ship data to a third-party API. But this mostly just simplifies the process of invoking functions, not running them.
Fortunately, consumers have gotten a lot more comfortable about third-party analytics services, and a new breed of SaaS security products like Vanta make it easier than ever to attest to the security of a given analytics platform. Moreover, data sharing is getting a lot easier too, with Delta Sharing and Snowflake Secure Data Sharing. However, it can still take months to complete a true enterprise-grade security assessment, and the compute resources still need to live somewhere.
There’s a potential alternative that looks a lot more like a UDF, in the form of offering a form of “secure enclave” for third-party computation to run over hosted data. Specifically, a cloud warehouse vendor could offer the ability to co-locate compute natively within a trusted private environment, and then share data with that environment for processing.
This proposal is effectively what many analytics providers offer customers in the form of a “hybrid VPC” deployment – where analytics run inside of a customer’s cloud environment. That said, today’s hybrid VPC deployments aren’t turn-key, and require substantial work for the customer in terms of software deployment and management, networking, and monitoring. In contrast, a standardized enclave model could simplify much of this by making the interfaces between data and compute and deployment process templated and standardized. The economics of this kind of enclave are easy: a warehouse provider could charge for this compute, much like cloud providers themselves charge for compute as well.
There are a ton of hurdles required to realize this kind of co-located enclave model, but such an approach would help close the gap between theory and practice by making it as easy to consume cloud analytics as it is to consume services offered natively by major cloud providers. I wouldn’t be surprised to see cloud providers start to offer this kind of service in the next few years. It’d be a huge benefit to users, and to the pace of analytics in the cloud.