Solving the Data Analyst’s Dilemma:
Fast answers to more questions in a sea of data

By Callum McCann - January 26, 2021

If you’ve worked with data at any point in the past ten years, you’re probably familiar with two key parts of the analyst life.

1. The Buzzword of ‘Big Data.’ Everyone has their own definition of what it is, but we all know what it means — every company has more data than ever before. So much so that this industrious author has started a movement to kill that term. If everyone has Big Data, then no one has Big Data, and we can all go back to just calling it data.

2. The Endless List of Questions
. Now that all data is Big Data, we’ve stumbled onto a problem that didn’t exist before. In the past, we were constrained by the number of questions that we could ask. Nowadays, we’re constrained by the number of questions that we can answer. And every answer spawns more questions in return. Business users, executives, customers — we’ve become a data-driven society, and people want answers now. What is causing my sales to go down week over week? What is driving my conversion up? Why are new users spending less time in our app? There aren’t enough analysts in the world to answer all the questions people are asking.

But knowing the why is a tricky question. Let’s say you’re an analyst at an e-commerce company, and your company rolls out a new “Best Sellers” section of the homepage. Traffic to the featured products seems to suggest it’s a huge hit — two million events a day. Way to go, team!

But that’s not the end of the story. Now your boss is messaging you over Slack (or Teams but probably not Hangouts) and wants to know if the Best Sellers are increasing conversions, which you both define as a successful sale. And to throw another wrench into that mix, he wants to know if the new section is increasing conversions across ALL products, not just those listed as Best Sellers. Welp – goodbye weekend, hello endless SQL queries.

And even after you’ve run hundreds of grouping queries and tracked down what you think the answer is, can you really be sure that you checked every possibility? Human beings are notoriously affected by our own biases, even subtle ones that we might not consider when looking at data, and they may lead us to incorrect conclusions.

Worry not, dear sleep-deprived analyst; there are solutions to this problem that don’t require burning the midnight oil. In fact, this is one of the reasons that we built Sisu. We let the machines do what they’re great at (computation, aggregation, statistics) and allow humans to focus on doing what we do best — interpreting the results, adding the business context, and building the story to communicate.

Now you can set Sisu up, make yourself a beautiful cup of coffee (or tea but hey, years of being an analyst probably built up that caffeine addiction), and come back to have all the results analyzed, prioritized, and ready for you to figure out what’s actually driving change.

To show you what I mean, let’s continue with the previous example. Imagine you’re looking at the following dataset (simplified to three rows). Following your boss’s request, you’re most likely going to spend hours going through the dataset and trying to determine what factors are driving conversions. Is it all mobile customers whose final event was Dino_Jump? Users from google who looked at Slip-N-Slides? And what if these two combinations have overlap? Which one of them is actually driving change?

If we make some assumptions about this example data, we can determine how many combinations there are. For the purposes of this calculation, we’ll assume:

  • Source has three values – google, direct, ad, and unknown.
  • Is_Best_Seller is a true/false boolean value.
  • Geo_State has 56 values (50 states, 1 DC, 5 territories).
  • Device_Type has three values – mobile, web, and unknown.
  • Final_Event can contain any of the 50 event types.
  • Final_Product_Page can contain any of the 1800 pages on your website.
  • Cust_Age_Ranges has 10 different options

Let’s limit ourselves to 3rd-order* combinations. First order combinations would be [3+2+56….+1800+10], which equals 1925 possible single factors to explain the change. Not too bad! But what about second order? [3*2+3*56+….1800*10], which equals 229,930. That’s a lot more to parse through and could take a hefty chunk of time!

Finally, what about 3rd order? [3*2*56 … 50*1800*10] which equals 8,939,780. That’s way too many combinations to try and parse through!

So in total, even with this small dataset, there exist 9.2 million potential combinations of ways to analyze conversions. Determining which of these combinations is statistically significant is a task that computers were built to perform.

We’re building Sisu with this problem in mind. It’s 2021, and we think it’s about time analysts stop needing to trawl through their data for answers.

Using a combination of Machine Learning, techniques from causal inference, and statistical significance testing, Sisu goes through each combination to determine which specific subpopulations are driving the most significant impacts on your metric column (ex: Conversion). And it does this in seconds, for datasets much larger than my improbably small example above.

This approach means analysts can now spend their time contributing to long-term planning around unstructured decision making. You can have confidence that the answers you’re providing the business, the story that you’re crafting to answer a business question, are statistically significant, accurate, and surfaced by world-class algorithms written by people who understand your pain.

And did I mention that it’s fast? One of our users, Housecall Pro’s VP of Analytics, Vanessa Cirannek described the whole analytics process, from dataset creation to analysis to deliverable, “Before Sisu, if I asked one of my analysts a question, it would take them at least three days to come back to me with an answer. Now, I have an answer within the hour.”

Now, Sisu isn’t here to replace analysts — they’re our north star. Our goal is to augment analysts so that they can answer all the questions they’re given and free up their time so the midnight hunt for answers becomes a thing of the past.

Don’t believe me? Reach out to our team for a POC — in a few days time we can be running on top of your data warehouse and ending any late night fact-finding missions.

* 1st, 2nd, and 3rd order facts refer to the factors that make up the combination. 1st order would be a single factors (geo_state = CA), 2nd order would be two factors (geo_state = CA AND cust_age_range = 18-24), and 3rd order would be 3 factors (geo_state = CA AND cust_age_range = 18-24 AND final_event = dino_jump).

Read more

Designing Datasets: Four Principles to Advance Data Diagnosis

With more transactional data in cloud-native warehouses than ever before, analysts should stop aggregating their data for business intelligence tools. To help, here are four principles on designing datasets for cloud-native diagnosis.

Read more

Rethinking Data Literacy in 2021

The traditional approach to data literacy isn’t working. In 2021, it’s time to build a shared language between data experts and business users with a metrics-first approach to analytics.

Read more