By Peter Bailis - September 2, 2021
The rise of analytics engineering as a discipline is one of the buzziest trends in data. If you aren’t familiar, analytics engineering involves “provid[ing] clean data sets to end users, modeling data in a way that empowers end users to answer their own questions… [and] transforming, testing, deploying, and documenting data.” This role encompasses many fields, and is closer to software engineering than conventional data analytics since it involves writing programmatic scripts and generally maintaining data as a living set of software-produced artifacts.
While there’s some debate about how analytics engineering and data analytics roles will evolve, I’m a firm believer that analytics engineering is a critical step forward for analytics: empowering humans to solve one of the most intractable Computer Science problems in analytics.
Despite heroic efforts to automate the processes of data modeling and Extract-Load-Transform/Extract-Transform-Load (ELT/ETL), these tasks are still too hard for computers to perform. In fact, they may never be automatable (or are very likely AI-complete).
Why? Every business and organization is different — with different concepts and entities to model — and the ways in which data is represented in these organizations is bound only by the limits of human creativity. There are common patterns in data modeling (from guidance on normalization, like star schemas, to canonical representations of events, like server logs) but, in the limit, there is no one true way to model every business. As a result, even as an academic, I viewed the field of algorithmic attempts at data cleaning and normalization as a tar pit, where “everything is possible, but nothing of interest is easy.”
The rise of the analytics engineer pays homage to the complexity of data cleaning and semantic modeling. It’s not enough to simply ship bits to a common warehouse (i.e., data engineering) — these bits need to be organized and mapped into concepts that the rest of the organization can use and rely on. Business users don’t care if tying product engagement metrics to retention rates requires joining two tables or seven; they just want the metrics and attributes they care about laid out and ready for analysis.
From this lens, the analytics engineering role is the missing glue that connects business context with an organization’s data — because of the inherent complexity in that task, connecting business context to data requires people and some amount of code to customize for each organization. And, the role of an analytics engineer is getting better defined and more efficient, thanks to tools that are designed to work with humans, at cloud scale, including new platforms for defining custom transforms (e.g., dbt), exporting semantic concepts to the business (e.g., metrics layers), and monitoring the results (e.g., data observability systems).
Analytics engineering is a good thing for data analysis inside a modern organization.
It’s become a trope that we have a shortage of analysts in the labor market, but it bears repeating: not only are there not enough analysts, but it’s actually not cost effective to hire enough analysts for questions in a modern organization. Working with some of the most advanced teams in FAANG, like Google Adwords, at Stanford, I was struck by how few people actually had dedicated analysts — most people simply had access to data, but didn’t have access to a human to interpret that data for them. Today, as the CEO of a data analytics startup, I struggle to decide how much to invest in analytics: if analysts were free, every department would have one, maybe two analysts. But analytics isn’t free — we have the data for it, but not the resources.
By standardizing data sets and data models within a given organization, analytics engineers reduce friction in downstream analyses. Rather than spending hours or days struggling to define the right metric or pull the right columns and double-check their meanings, analysts can jump straight to the questions they want to ask. And, for routine and repetitive analyses, analysts can put the analysis on autopilot.
This latter point is especially interesting: most business users care about a handful of key metrics. So, if an analyst can identify the canonical data sources, metrics, and tables that represent the key metrics within a business unit, then, in theory, the questions that arise day-in, day-out are easier to automatically answer than ever. “Why is engagement down?” Check columns seven through ten, and seventy through ninety of the big flat table the analytics engineering team set up. “What happened in APAC last week?” Check columns four, seventeen, and twenty-two of the other big flat table.
In this “autopilot” scenario, the analyst still plays a critical role: the translator between the business and the data. There is so much context required to truly understand what is relevant to a given business operator that isn’t usually captured in data or metadata, like the department’s OKRs and key activities in a given week or quarter. As a result, it’s often the analyst’s job to tie the business context to the data that’s available, and to obtain, curate, and join in new data for analysis as new questions arise. By empowering data engineers to curate and tend to the underlying data, that frees up the analyst to ultimately serve more of the business.
The upshot of analytics engineering is that analytics engineering as a discipline is really a broader recognition of the multi-faceted role that analysts today already perform, and is one strata of the analytics workflow that finally has some decent, human-centric tools to specialize and become more efficient. Over time, this stratification will likely continue: some analysts will be analytics engineers, and others will be metrics engineers, root cause analysts, scenario planners, and/or presentation-builders. And, as the last mile of analytics tools get better, we’ll see more of a world in which everyone’s an analyst. Analytics engineers just happen to have one of the most rapidly-advancing and hottest toolkits in the modern data stack — the rest of the stack just has to catch up!