Designing Datasets: Four Principles to Advance Data Diagnosis

By Brynne Henn - June 11, 2020

For years, business intelligence (BI) tools and legacy analytics warehouses trained data teams to aggregate, simplify, and streamline complex datasets into narrow, normalized schemas to work within the tools’ limited capabilities.

But now, a new class of cloud-native warehouses — like Redshift, Snowflake, BigQuery, and Azure Synapse — are overcoming many of the scale and speed challenges faced by their predecessors, making it easier for companies to store all the rich, wide data they can capture. These new warehouses have eliminated the need to transform and simplify data before load.

That means it’s time to break some old BI habits. While desktop-BI tools still assume some of the limitations of older warehouse architectures, cloud-native diagnostic tools like Sisu are meeting the opportunity offered by these rich, flat tables. Aggregating and simplifying this data is like downsampling an audio file – you lose too much of the richness in the process.

To help data teams take a few steps backward to a richer set of features, we’ve put together these four principles to advance data diagnosis from our guide on Designing Better Datasets for Diagnostic Analytics.

1. Get granular: Tie each record to a unit of value

Whether you’re looking to diagnose changes in revenue, shifts in content consumption, or the weekly fluctuations in new player downloads for a mobile game, it’s critical to construct datasets where every row in the table ties directly to a unit of value.

For revenue and sales metrics, this means building tables where every row is an individual transaction. For trial conversion and customer retention use cases, account- or customer-level rows are most useful. For content engagement and game mechanics analyses, session-level data is usually optimal. In each case, there’s a direct tie to the KPI unit and the measure: Revenue per transaction, conversion and retention rates, or ARPU.

2. Flat is beautiful: Disaggregate and leverage the power of cloud-native platforms

The more features you can examine in a diagnosis, the more comprehensive and accurate your explanation can be.

By flattening out data, rather than aggregating it, you can look at individual records for counts, durations, interactions, and even SKUs and find those interesting interactions at a more actionable level.

You can also eliminate most of the functional dependencies often observed in datasets by disaggregating the data and shrugging off the restrictions of dashboard-based BI. When you’re not forced to pre-aggregate calculations like one-month, three-month, and twelve-month revenues, you can often get richer information from fewer features.