By Sid Sharma - July 16, 2020
Like any good story arc, we’ve come a long way since the origins of data analytics. The first phase of BI started with rigid, IT-owned systems. The second phase followed with a wave of more flexible, business-oriented tools that enable a more business-facing Data Analyst mindset — and a tsunami of pretty, easy to filter, but often static dashboards.
Today — with the rise of cloud-native data warehouses and advancements in scalable inference methods — we’re at the cusp of a third phase that not only affords better, faster processing of data, but also lets operational data analysts impact business decisions like never before. I call this phase Data Analyst 3.0.
Before we look at the factors bringing in Data Analyst 3.0, let’s take a look at how far we’ve come. It used to be that a single person within the IT team could gain all the relevant domain and technology skills necessary to become a “data expert.” Data wasn’t big or wide, which meant that people could obtain new data skills (Excel, lightweight SQL, SAS, etc.) as problems arose, and the process of sending over a CSV to answer questions worked just fine.
But, from the organization’s perspective, most data requests failed in the handoff between IT and the business because technologists didn’t know how to make their data infrastructure consumable to an everyday Excel user. The queries that IT teams could deliver only answered a single question about a specific KPI. This had two major issues:
Fortunately, this system has largely disappeared over the last ten years alongside the rise of more business-centric data modeling, BI, and visualization tools. These modern tools define the second wave of BI and help make Data Analyst 2.0 a more agile member of the team.
Beneath these end-user tools, this second wave is supported by several platforms that make it easier to derive value from the vast amounts of data we’re storing. Collectively, these tools make up a modern analytics stack.
The exact evolution of this analytics stack is a fascinating topic, but I’ll save it for another post.
To navigate and maintain this stack efficiently, businesses needed more than just the IT team, so a few common roles emerged:
One way that you can think about the distinction in these roles is whether they act before or after the data is collected. Data Engineers are responsible for operations before the data is collected (and transformed), while Analysts and Data Scientists are responsible for operations after the data is collected.
Like Google’s Cassie Korzykov mentions in one of her insightful posts, if your primary skill falls closest to that of a Data Analyst, chances are you feel left behind in your “technical” expertise by your Data Science counterparts. Even the job market views the data scientist role as a level up from you. Only a few people realize that these two roles are entirely different from one another.
Data Scientists provide high-effort solutions to specific problems. If the issues they tackle aren’t worth solving, businesses end up wasting their time. They are narrow-and-deep workers, so it’s imperative to point them at problems that deserve the effort. To ensure you make good use of their time, you need to be sure you already have the right problem or need a wide-and-shallow approach to finding one
This is where a Data Analyst can help the business. A Data Analyst’s primary goal is to surf vast datasets quickly, liaise with the business stakeholders, and surface potential insights. Speed is their highest virtue. The result: the company gets a finger on its pulse and eyes on previously-unknown unknowns. This generates the inspiration for decision-makers to select the most valuable quests for Data Scientists.
Unfortunately, many Data Analysts today are stuck in a quandary. They’re sitting on a treasure trove of rich, wide data, but they’re often torn between the dual roles of summarizing data to report key metrics versus the deep, comprehensive tasks of metric diagnosis.
These second-wave BI tools are well-equipped to create rich, rolled-up dashboards to answer ‘what has happened.’ However, in our experience, these dashboards often fall short in the precise moments when businesses need Analysts to add the most value – when something goes wrong (like those Monday morning meetings when the VP of Sales asks, “Why did sales drop 50% in EMEA last month?”).
Frequently, this is because these views are built on simplified, aggregated views of data that put handcuffs on data exploration and diagnosis. If you only operate on aggregates, you can’t explore in detail. Part of creating aggregates is presupposing what questions people are going to ask—as if they are cast in stone. So, a Data Analyst commits up front what they want to show in a finite real estate of a dashboard.
Sure, Analysts can add ‘filters’ or enable ‘drill-downs’ on dashboards, but as the columns and unique values within each column grow, the number of ways in which they could slice the data explodes (50 stores * 100 SKUs * 5 coupon codes * 20 cities … you get the point). This is why the dream of self-serve analytics morphed into death by a thousand filters.
When an ad-hoc question comes in, a Data Analyst often starts the diagnosis from scratch. It’s a manual process involving SQL to fetch the granular data, adding relevant dimensions, and finally using Python/R to dig up the insight. The process is reactive, needs to start from scratch every time, and hampers decision-making velocity. Result: Businesses end up in a similar situation as the first BI wave – business stakeholders queuing in their tickets, this time to get an answer to their “why” questions.
What Data Analysts need are faster, easier, and more comprehensive ways to build, monitor, and diagnose granular, high-dimensional datasets — a new paradigm that can quickly answer, “Why does the data look the way it does?” and “What changed from last week?” This new paradigm could piggyback on two recent technological advancements:
1. Cloud-native storage and compute: Today, not only is it cheaper than ever to store data in cloud warehouses, but it is also 25%-50% faster* to query data from big flat denormalized tables than star schemas – thanks to advancements in massively parallel processing. Based on this new reality, Analysts need to rethink how