Designing Datasets: Four Principles to Advance Data Diagnosis

By Brynne Henn - June 11, 2020

For years, business intelligence (BI) tools and legacy analytics warehouses trained data teams to aggregate, simplify, and streamline complex datasets into narrow, normalized schemas to work within the tools’ limited capabilities.

But now, a new class of cloud-native warehouses — like Redshift, Snowflake, BigQuery, and Azure Synapse — are overcoming many of the scale and speed challenges faced by their predecessors, making it easier for companies to store all the rich, wide data they can capture. These new warehouses have eliminated the need to transform and simplify data before load.

That means it’s time to break some old BI habits. While desktop-BI tools still assume some of the limitations of older warehouse architectures, cloud-native diagnostic tools like Sisu are meeting the opportunity offered by these rich, flat tables. Aggregating and simplifying this data is like downsampling an audio file – you lose too much of the richness in the process.

To help data teams take a few steps backward to a richer set of features, we’ve put together these four principles to advance data diagnosis from our guide on Designing Better Datasets for Diagnostic Analytics.

1. Get granular: Tie each record to a unit of value

Whether you’re looking to diagnose changes in revenue, shifts in content consumption, or the weekly fluctuations in new player downloads for a mobile game, it’s critical to construct datasets where every row in the table ties directly to a unit of value.

For revenue and sales metrics, this means building tables where every row is an individual transaction. For trial conversion and customer retention use cases, account- or customer-level rows are most useful. For content engagement and game mechanics analyses, session-level data is usually optimal. In each case, there’s a direct tie to the KPI unit and the measure: Revenue per transaction, conversion and retention rates, or ARPU.

2. Flat is beautiful: Disaggregate and leverage the power of cloud-native platforms

The more features you can examine in a diagnosis, the more comprehensive and accurate your explanation can be.

By flattening out data, rather than aggregating it, you can look at individual records for counts, durations, interactions, and even SKUs and find those interesting interactions at a more actionable level.

You can also eliminate most of the functional dependencies often observed in datasets by disaggregating the data and shrugging off the restrictions of dashboard-based BI. When you’re not forced to pre-aggregate calculations like one-month, three-month, and twelve-month revenues, you can often get richer information from fewer features.

3. Tell me more: Build the widest, most diverse set of features possible.

Diversity in data, long the bane of BI tools and overtaxed analysts, is now the most valuable feature of any given dataset. It’s possible to explore millions of records in a few seconds and test hypotheses based on hundreds of unique factors.

In the past, the computational requirements of checking tens of billions of possible combinations in a dataset was prohibitive, but now cloud-native analytics platforms like Sisu can comprehensively test these spaces orders of magnitude faster than their predecessors.

Instead of culling secondary features like demographics, marketing acquisition, content details, loyalty data, ecosystem flags, and tenure, it’s far more beneficial to include as many factors as possible in an analysis.

4. Know the metrics that matter: Build to the KPIs you care about

The way a business defines its KPIs will determine the appropriate level of granularity for these datasets. The general rule of thumb is to maintain datasets that support the definitions of each KPI you use to manage the business.

The reason for this? Over time, the key indicators an organization chooses to measure its performance will not change that rapidly. Both leading and trailing metrics in every category will remain consistent, which allows for more predictability in what data to collect, and at what level. This also enables extreme flexibility and growth in the features collected to described each record in the data.

When (dis)aggregated to the right level of granularity, you can minimize the amount of ongoing work required to maintain a dataset. Features can be added or dropped with flexibility without disrupting the ongoing analysis of these critical operational metrics.

These four principles — Get granular, Flat is beautiful, Tell me more, and Know the metrics that matter — should guide you to build datasets that provide the most useful actionable facts.

To read more tips on designing datasets, download Designing Better Datasets for Diagnostic Analytics: A Sisu How-to Guide.

Read more

Three Ways to Increase Analyst Efficiency and Decrease Data Prep

Is your analytics team spending more than 80% of their time prepping data, and only 20% doing the actual analysis? You’re not alone, but there are three ways you can improve your team’s efficiency and allow them to do more than just prepping data.

Read more

The Future of Analytics: Faster Data Demands Faster Analytics

Join Ventana Research and Sisu on June 17th to learn about the changing data patterns impacting data teams and how proactive data analytics tools can accelerate the pace of decision-making.

Read more