By Peter Bailis - August 26, 2021
Despite a recent proliferation of tools in the modern data stack, it’s unclear whether we’re seeing an unbundling of data tooling into many separate layers, or the first steps towards consolidation of data tools. The answer has a huge impact on end users of modern analytics.
Architecturally, modern data stacks look much different than they did even five years ago. We’ve seen an explosion of new types of tools for managing data, including dedicated systems for data pipelines, transformations, catalogs and governance, metrics definitions, data observability, and reverse ETL. While many of these ideas are not strictly new, their latest, cloud-native incarnations are enjoying unprecedented degrees of popularity and attention.
One popular interpretation of this explosion of data tools is that we are witnessing the “unbundling” of the data stack. Under this interpretation, classically monolithic data tools like data warehouses are being dismantled into constituent parts. These parts are in turn more modular and higher quality than their predecessors. For example, instead of bundling data movement and data transformation within a single ETL tool, we can utilize tools that are specialized for each (e.g., Fivetran and dbt).
This “unbundling” thesis is attractive in that it accurately captures the sheer amount of interest and engagement with the modern data stack – and points towards a more “open” future. This is arguably in users’ best interests, as it allows more flexible choice and less lock-in than quasi-open, vendor-controlled standards like LookML. Moreover, it points towards a future of open innovation, experimentation, and excitement.
However, it’s also possible that this “unbundling” represents a temporary state of affairs. Specifically, under this alternative thesis – which we’ll call “consolidation” – the proliferation of data tools today reflects what will ultimately become a standard set of features within just a few discrete, consolidated layers of the data stack.
Consider the following analytics workflow:
Entire product categories today are devoted to solving each of these steps, and many steps require multiple platforms to solve. This has a huge impact on consumers, who not only need to purchase, provision, monitor, and manage 10+ different tools to complete this workflow, but they must also move data from SaaS provider to provider as it is processed, transformed, and computed.
Contrast this state of affairs with an idealized world in which the above processes are consolidated into a small set of large services:
Such a two-tier consolidated architecture is in theory far simpler and easier for users. Data-driven work is performed in a limited number of platforms. This enables greater impact of data within an organization, less data movement, and greater economies of scale.
If this looks familiar, it should be: historically, this is roughly how data stacks were architected – with a monolithic data warehousing layer to capture and organize this data, and a business intelligence layer to access and analyze it (and probably a separate set of tools for data scientists to write models in programmatic languages).
So, if consolidation is so beneficial to users, why are we seeing “unbundling” now? My thesis is that this unbundling is a response to the rapidly-evolving demands on and capabilities of cloud data. Cloud data isn’t like on-premises data. On-premises, data is often siloed, doesn’t change much, and is a scarce commodity. In the cloud, data changes fast, comes from many (often standardized) sources, and – due to compute elasticity – can be scaled to serve thousands of people inside a modern organization. These cloud-specific changes require new capabilities – like data observability – and approaches – like automation of common, repetitive analysis tasks. In a nutshell, the data ecosystem is slowly rebuilding the warehouse and analysis layers to adapt to the new reality of cloud data.
The key challenge in completing this rebuilding and consolidation is that these consolidated layers represent a monumental undertaking. Each of the handful of consolidated layers that will emerge requires substantial resources and excellent execution. It’s tempting to think that the major cloud providers will build each layer in totality, but based on the breakaway successes of products like Snowflake and Databricks, I believe they are up for grabs. And given there is likely a $100B+ company behind each, many will try.
In the next two years, I expect we’ll see more attempts to consolidate the modern data stack, albeit in intermediate stages – for example, the consolidation of data pipelines and transformation, data catalogs with metrics layers, and dashboards with diagnostics. My hope is that, if this consolidation occurs and users benefit, we can still keep the “good parts” of unbundling – not just more useful, modern features, but also open, interoperable APIs and standards for common tasks such as metrics definitions along the way.
While it’s too early to tell who might “win” each consolidated layer, it’s clear that, in aggregate, the consolidated warehouse and analysis layers will look far different than their predecessors. Fundamentally, cloud-native architectures that can take advantage of cloud scalability and compute will be at a natural advantage. Moreover, much of the work – especially in the analysis layer – is spread across an absurd number of tools today – not just business intelligence, but also spreadsheets, docs, and slides. Consolidating this work has the potential to transform the future of work for every modern organization, and to redefine the future of data.