Why aren't cloud analytics platforms just UDFs?

By Peter Bailis - August 11, 2021

Running a cloud data analytics platform is hard, and requires a lot of planning and execution around data governance, compute, and security. While there’s no silver bullet that can solve these problems, there are a few things that cloud warehouses could do to make it far easier to deploy modern analytics platforms at scale.

In the past three years at Sisu, I’ve been regularly surprised by how challenging it is to actually deliver cloud-based analytics to users. Unlike products like Airtable or Notion, where users input data that is created and lives in the application itself, most cloud analytics platforms like Sisu work on data that is collected and stored in external data warehouses. That means analytics platforms need to connect to and process data they don’t own themselves, and this data is often highly valuable and sensitive.

Conceptually, what cloud analytics platforms do is simple. Platforms like Databricks, Looker, and Sisu apply some kind of transformations and compute to users’ data, which is usually stored in another place in the cloud. In theory, this “just” requires spinning up some compute resources that are co-located with users’ data, and returning a result.

While this sounds simple, designing a compute mechanism that is flexible (i.e., allows a range of workflows), is scalable, and is secure is very hard in practice.

As a secure and flexible mechanism for co-locating compute and data, databases have provided support for User-Defined Functions (UDF) for decades, and people have even built entire analytics systems based on the UDF abstraction and in-database compute. Products like Google’s AutoML Tables are a modern take on this idea.

Implementing cloud-based analytics platforms as UDFs inside of databases like Snowflake and BigQuery would be amazing. This would enable:

  • Easier data sharing and data privacy controls
  • Easier metering and billing
  • Reuse of familiar user interfaces like SQL and database consoles

However, there are a few practical considerations standing in the way:

  • There’s no open standard for UDF definitions – UDF runtime support and languages vary quite a bit between databases in general, and cloud databases in particular. As a result, you’d have to reimplement functionality over and over to run on different databases.
  • There’s typically not a lot of compute available to UDFs. ML algorithms running in platforms like Databricks and Sisu can take a ton of cycles and memory, if only for short bursts. That means analytics will be slow, or may not run at all if they’re run inside of a database environment.
  • Supporting external services and dependencies is tough; for example, supporting analytics that makes external network calls poses a number of security risks that are hard to mitigate.

As a result of this challenge, almost every cloud analytics provider – from ELT (Fivetran, dbt Labs, Census) to ML and visualization (Databricks, Sisu, Looker) – has to offer a separate compute environment with separate metering, billing, security, and compute. Some cloud providers like Snowflake have offered interfaces like external functions that essentially ship data to a third-party API. But this mostly just simplifies the process of invoking functions, not running them.

Fortunately, consumers have gotten a lot more comfortable about third-party analytics services, and a new breed of SaaS security products like Vanta make it easier than ever to attest to the security of a given analytics platform. Moreover, data sharing is getting a lot easier too, with Delta Sharing and Snowflake Secure Data Sharing. However, it can still take months to complete a true enterprise-grade security assessment, and the compute resources still need to live somewhere.

There’s a potential alternative that looks a lot more like a UDF, in the form of offering a form of “secure enclave” for third-party computation to run over hosted data. Specifically, a cloud warehouse vendor could offer the ability to co-locate compute natively within a trusted private environment, and then share data with that environment for processing.

This proposal is effectively what many analytics providers offer customers in the form of a “hybrid VPC” deployment – where analytics run inside of a customer’s cloud environment. That said, today’s hybrid VPC deployments aren’t turn-key, and require substantial work for the customer in terms of software deployment and management, networking, and monitoring. In contrast, a standardized enclave model could simplify much of this by making the interfaces between data and compute and deployment process templated and standardized. The economics of this kind of enclave are easy: a warehouse provider could charge for this compute, much like cloud providers themselves charge for compute as well.

There are a ton of hurdles required to realize this kind of co-located enclave model, but such an approach would help close the gap between theory and practice by making it as easy to consume cloud analytics as it is to consume services offered natively by major cloud providers. I wouldn’t be surprised to see cloud providers start to offer this kind of service in the next few years. It’d be a huge benefit to users, and to the pace of analytics in the cloud.

Read more

Data mining vs. machine learning

Data mining and machine learning are computer science methods for finding insights about data patterns. Take a look at how they can help you make informed decisions in this post.

Read more

Sisu Smart Waterfall Charts: Telling clear stories in complex data

Introducing Smart Waterfall Charts, an industry first, one-of-a-kind combination of data exploration and storytelling.

Read more