Proactive Analytics: Root-Cause Analysis vs. Outlier Detection

By Peter Bailis - June 16, 2020

Businesses have used dashboards and reports to understand what’s happening with their key metrics for decades. The problem is, these dashboards are typically static – analysts have to know what to ask and where to look in their data for answers. And with data changing faster than analysts can keep up, data teams are caught in a vicious, reactive pattern of noticing changes, digging in to diagnose, and repeat – all while trying to stay on top of their normal work.

In response, a new wave of proactive analytics platforms are emerging to break this reactive pattern using root-cause analysis and anomaly detection. These proactive platforms help analysts prioritize their limited attention on the most useful and impactful answers their data can provide at each point of time. Just like we get suggestions for timely, relevant content on platforms like Netflix and Facebook, proactive data analytics can suggest key findings to review, as well as new questions to ask of data.

Two types of proactive analytics:
Automated anomaly detection and automated root-cause analysis

Currently, there are two major feature sets in proactive analytics platforms: anomaly (outlier) detection and root-cause analysis.

To understand the difference, let’s imagine we want to optimize user engagement for a recently released mobile game. We have a dataset that looks at engagement across 23 countries, 14 demographics, and 24,000 Android device types.

Automated anomaly detection identifies unusual or deviant behavior in a monitored metric, typically over time. With the data in our example, we could use anomaly detection to look at unexpected behavior across all 7,728,000 dimensions of the data. We might find that Polish Galaxy S7 owners on the latest Android operating system were 30% more engaged this week than last week.

This type of discovery is finding the “needle in the haystack” – this cohort is statistically unusual, but it’s not clear whether this 30% increase has a meaningful impact, or lift, on our overall engagement. Moreover, since we’re treating each slice of data as a different metric, there are likely to be many such unusual cohorts – it’s hard to tell where to spend our time. We’ll need to investigate each cohort before we have actionable results. 

An outlier is like finding a needle in haystack

Automated root-cause analysis, on the other hand, identifies the key factors that contribute to a metric (i.e., deliver positive or negative lift), particularly as it changes over time. With our example above, we could use metric diagnosis to look for countries, releases, and device types (and combinations of those variables) that disproportionately contribute to overall engagement.

We might find that overall engagement was 2% higher this week compared to last week because of users in Spain on phones released in 2020. In contrast to automated anomaly detection, in this case, the actual engagement numbers for this cohort are not what’s important; rather, we’re interested in this cohort because it’s disproportionately contributing to our overall metric. 

Root-cause analysis identifies the key factors that contribute to a metric - not just one outlier

In another period, we may observe that even if the overall engagement metric is unchanged (e.g. flat week over week), the drivers behind the metric may have changed in significant ways (e.g., US engagement is down, European engagement is up, and overall flat).

This kind of “key driver analysis” / “root-cause analysis” typically looks at one metric at a time, and each slice of data represents a potential contributing factor. By identifying the slices with the largest impact on the overall metric, it’s easy to see where the biggest opportunities for improvement and the largest sources of lift come from.

Anomaly outlier detection and metric diagnosis are complementary, and it’s common to see anomaly detection and diagnosis used together. Together, these techniques allow you to see what’s unusual and understand why it’s happening.

The many challenges of outlier detection

For most businesses, however, anomaly detection is less useful and less interpretable because getting anomaly detection right is hard. What’s an “outlier” or an “anomaly” is completely subjective — an outlier to one business may be normal for another. The terms “outlier detection” and “anomaly detection” are no more informative than the phrase “machine learning classification.” In addition, if – like the examples above and many platforms in the market – we treat each slice of the data as a separate metric, this leads to more metrics to track. At scale, this means anomaly detection can be more sensitive to noise and can lead to a higher degree of false positives, and “alert fatigue.”

Given enough human supervision and tuning, it is possible to do anomaly detection well in specific domains, but it’s extremely hard (i.e., equivalent to ‘solving all of AutoML / automated data science’) to actually get anomaly detection right in a general-purpose analytics platform.

The most successful applications of general-purpose outlier detection platforms tend to revolve around IT and fraud monitoring, where there’s a clear misconfiguration or bug that left unattended leads to extremely bad behavior for the business. The resulting fixes that IT and engineering teams ship are valuable. From a statistical perspective, finding a metric that suddenly drops to zero or spikes very high due to a misconfiguration or bug is relatively easy.

In contrast, many “front office” business analytics tasks must capture complex consumer and operational behavior with many inputs, from customer behaviors to marketing channels and pricing packages. Finding the most actionable, impactful results is challenging for these applications, and for metrics like consumer spending, activation and churn, and gross margin. While every marketer would love to find “one weird factor” that explains 100% of conversions, such a magical factor rarely exists.

In fact, in our research at Stanford DAWN that led to Sisu, we initially started looking at anomaly detection and root-cause analysis together. In addition to the above issues we encountered, we also found that in unsupervised and multivariate settings, the results of anomaly detection were hard to explain to users. For example, in a building occupancy setting, we probably care about high occupancy levels, but do we care about simultaneously high occupancy, low temperature, and low noise? And how do we explain it to users?

In contrast, we found that businesses often know what a good “threshold” for their metrics look like. Teams already spot many of the important anomalies in their metrics manually, and instead spend more time on “why is a metric changing?” rather than “is anything happening?”


The power of general-purpose root-cause analysis

We realized that root-cause analysis was far more powerful for helping businesses understand why something was changing and what they could take action on.

With root-cause analysis, it is possible to identify the factors and populations that make a difference in customer metrics with higher precision (i.e., lower false positives). That is because these high impact, actionable populations are more resistant to noise; in production, we found that many high impact populations are also large, with thousands or more data points. The probability that these thousands of data points arise from random chance is small – and we can quantify this probability using a range of methods, too! Moreover, instead of requiring a user to look through possibly thousands of statistically significant outliers, we could provide a “top five” list of factors that made a difference.

Diagnosis isn’t just a statistics problem – it’s also a workflow problem. The business metrics that stand to benefit most from diagnosis are already defined by business owners, in the form of OKRs and team KPIs. Moreover, analytics teams already diagnose changes in these metrics on a regular basis – for example, in weekly business reviews. But this diagnosis is manual, repetitive, and time-consuming, and has to be repeated for every business request. The idea of replacing this back-breaking manual diagnosis with a faster and more comprehensive automated process was a no-brainer for many analyst teams.

Analytics teams find it easy to fit diagnoses into their existing day-to-day work. And by keeping the analyst in the loop, they can provide feedback to iterate and even improve results over time. IT and data engineering teams like this too – we can put their hard work curating and enriching their data lakes to work on a daily basis!


Proactive analytics with Sisu

At Sisu, we see the benefits of diagnosis in a range of industries first-hand. Samsung uses metric diagnosis to improve conversions on new marketing campaigns and product launches. Mixt, a fast-casual restaurant, relies on metric diagnosis to fine-tune store operations and improve customer loyalty programs. Upwork uses metric diagnosis to understand match rates in its two-sided marketplace. We also work with advertising campaign managers to understand what’s driving their spend, and online retailers to analyze how Instagram ads drive up-sell and cross-sell opportunities across their global customer base. 

Instead of finding needles in the haystack, Sisu helps identify meaningful opportunities based on factors already present in these organizations’ existing data. These opportunities are not only actionable by the business, but they can drive significant lift. By focusing on high-volume, high-dimensional data, we’ve shown that proactive analytics can scale across industries.

Sure, an analyst could have found these on their own – if they had enough time. Instead, Sisu reduces analyst time to diagnose by up to 98%, so analysts can spend more time making informed recommendations and less time doing work that can be automated. And since metrics are defined once a quarter – if not once a year – once Sisu is set up, it automatically notifies analysts when the factors behind their key metrics change.

This type of proactive data analytics isn’t easy, of course. There’s a high computational cost to enabling this diagnosis over the high-dimensional, high-volume data in most organizations — we routinely see data with hundreds of millions of records and billions of possible factors. You can’t run this on a laptop. To run this kind of fast, comprehensive diagnosis, you need a completely new distributed architecture for processing. That’s where Sisu comes in.

For us, proactive analytics is all about keeping up with the changes unique to your business. While there are plenty of anomaly detection tools, these are small but important components of a truly proactive data analytics platform. Our customers already know their businesses are changing. Sisu tells them why they’re changing, and helps analysts figure out the best response in the window of opportunity to act. If you’re ready to get proactive with your data, let us know!

Read more

Designing Datasets: Four Principles to Advance Data Diagnosis

With more transactional data in cloud-native warehouses than ever before, analysts should stop aggregating their data for business intelligence tools. To help, here are four principles on designing datasets for cloud-native diagnosis.

Read more

Why Everyone Needs a Dedicated Analyst Team

With the investment we’re making in collecting structured data, everyone could benefit from a dedicated analyst team. But almost nobody does. Sisu can help

Read more