With more data available than ever before, you should have the answers to all your questions… somewhere.
No matter where you sit in the organization, you’re well aware of your data’s potential value, if only you had the tools, infrastructure, and time to analyze it quickly and effectively. Unfortunately, there is so much data available that it’s a struggle to figure out what to look for, when, and if the answers you find are useful. Typically by the time you find an answer, new data has changed the equation, and the business has missed the window of opportunity to act. Sound familiar?
When we have more data, we have less of something else: time and attention to tend to it. As Herbert Simon wrote, “[Information] consumes the attention of its recipients. Hence, a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it.”* In other words, working with big data today feels like trying to move a mountain with a shovel. The bigger the mountain, the more shovels you need – and the less effective they are. But what if you could throw down those shovels and get behind the wheel of heavy machinery that really is capable of moving mountains?
“A wealth of information creates a poverty of attention and a need to allocate that attention efficiently.”—Herb Simon
That machinery is the crucial element at the center of this guide: a blueprint for a better, faster, cloud-native data architecture. The future of analytics in all its fast, proactive, comprehensive glory is in the cloud. But, to successfully unlock the speed and agility of your team, there are key data analysis platforms, data structures, and processes you’ll need to invest in first to get truly proactive in your use of data.
Breaking down data silos
Data silos occur organically and accidentally. You’ll need clear intent and strong diplomacy to break them apart.
When businesses grow, they add new applications and software, create new departments, and cultivate multiple subcultures, each of which tends to breed data silos. Perhaps one department competes with another for funding or IT resources. Or the company acquires another business, which brings legacy tools and processes along for the ride. However it happens, your analytics team inherits these disparate, disconnected data silos that you need to integrate to fully unlock the potential of the information they capture.
For tips on working with your organization to overcome issues with data silos, view Part Two.
Making data-driven decisions isn’t a new approach (for more on that, read analyst Sid Sharma’s perspective on the history of data-driven decision-making). What’s changed is the structure of that data and the frequency with which it’s collected. While on-premises platforms have advanced, they can struggle to handle the volume, diversity, and speed of today’s data. To get ahead, you need a modern analytics stack.
New cloud-native data architectures move both the management and security of the data warehouse to the cloud, simultaneously creating access to data that is nearly infinite, low-cost, and scalable. Other cloud technologies downstream let businesses capitalize on the cloud-native data architectures by making it possible to answer questions business teams have never been able to ask until now.
Whether your business is large or small, you’ll have access to the data you need to inform key decisions using an architecture that grows with your business.
Whether you’re building a new data architecture from scratch or planning a strategy to migrate your hybrid infrastructure, let’s take a look at the benefits of the cloud.
Here’s the legacy data architecture scenario that many analyst teams are familiar with: large, diverse datasets that tax the data architecture and slow down the analysis process. In many organizations, it may take as long as 24 hours between the time data enters the warehouse and the time analysts can start working with it. When you need daily reports, but it takes days to create each one, you’re creating a situation where no real-time decisions are possible, no follow-up questions are asked, and no patterns are found before it’s too late.
Then, once data is in the warehouse, the team may spend hours or even days running complex analyses against their hypotheses. There are workarounds, like queueing or data lakes. Still, these imply other limitations and can result in the same outcome: a delay between creating your data and finding actionable facts within that data.
But with a cloud-native data architecture, data can be streamed and accessible immediately in the warehouse, and with proactive analytics tools, data can be created and understood in time for business to pivot quickly on data-driven decisions.
2. Freedom from computational requirements
“The fast parts learn, the slow parts remember.” So said Stewart Brand in reference to the layers of change within a system. Unfortunately, business intelligence tools and legacy analytics warehouses have, in many cases, trained data teams to aggregate, simplify, and streamline complex datasets into narrow, normalized schema that could fit within their limited processing capabilities.
So, while we have the technological capabilities for faster intake of data into the cloud, the old habits business intelligence tools created have trained us to stay down in the weeds, where it’s safer and easier to process our data. In many companies, data analysts have been asked to oversimplify their data models to meet the requirements of their business intelligence tools and legacy analytics warehouses, while businesses are hobbled and opportunities lost due to lack of insights.
For a time, this aggregation succeeded in making it easier for data visualization and business intelligence tools to query these datasets and build accessible dashboards for broad consumption. Unfortunately, these over-simplified datasets dramatically reduced the utility of the data for more critical diagnosis and root cause analysis. And if we continue using data under these constraints, we’ll forever be prevented from asking the questions we’d love to ask if we freed our minds.
When it comes to effective diagnosis and root cause analysis, the more features, the better. In the past, the computational requirements of checking tens of billions of possible combinations in a dataset were prohibitive, but cloud-native analytics platforms can comprehensively test these spaces at rates orders of magnitude faster than their predecessors. This means you get more information and facts from your data without increasing the amount of work necessary to get it.
3. Improved focus
It all comes down to finding actionable data in real time, and that’s what cloud-native architectures facilitate. Cloud-native analytics platforms allow your team to use a declarative approach: your analyses are based on metrics and KPIs that affect your company’s bottom line, and never have to stray from that strategic focus.
Particularly when analyst resources are at a premium, you need tools that can proactively seek out and recommend the data and the answers that your business teams really need.
With the rich breadth of data available and the added pressure of fast-moving markets, the current model of manually digging through data, testing individual hypotheses, and generating static reports is labor-intensive, slow, and fails to scale to the hundreds of columns and millions of rows found in most datasets.
When using cloud-native analytics platforms, you can stack-rank the data results. Interesting populations rise to the surface on their own, allowing you to test competing hypotheses in parallel. This simplifies an analyst’s tasks, allowing them to get answers faster, maintain a strategic focus, and go further into the data.
In order to build or migrate to a complete, cloud-native analytics platform, you need to establish a vision first built on a unified, single source of truth (we’ll get to the analytics engine in the next section).
We’ll help you understand where your data is coming from today and where you can capture it in the future. This knowledge will influence the tools, pipelines, and warehouses you need to process that data.
1. Business applications
These specialized data sources help run business processes and capture information about critical activities and objects. Usually, a company will buy subscriptions to multiple, distinct business applications for various, disparate processes.
For example, imagine a fictitious company called DairyTrail, which provides services for the dairy supply chain, including production, transport, processing, packaging, and storage of dairy products for a large and growing network of independent dairies. Descartes provides the bulk of their data collection. In addition, the DairyTrail sales team uses Salesforce, and data also comes into DairyTrail via their clients’ subscriptions to Smart Milk, which tracks weight, fat percentage, and solids nonfat (SNF) percentage of raw product. They’re also collecting weather and feed data.
There’s no other company in the world structured exactly like DairyTrail, so it stands to reason that there is no single, bespoke solution that perfectly accommodates all their possible software needs. This problem is how disparate data sources are born. In the past, diverse data sources meant days’ worth of number-crunching by the analytics team with the hopes that patterns across various data sources would become visible. Today, however, a well-constructed cloud-native data architecture integrates disparate data sources and capitalizes on them.
2. Data providers
Data providers are third parties who offer additional, often industry-specific data to augment a company’s proprietary data. Data-as-a-Service (DaaS) companies like Nielsen and Comscore, Crux, and SafeGraph are all classic examples of third-party data providers.
In our fictitious company, DairyTrail, the executive team also wants to know everything they can about their target market, dairy farmers, and the public’s dairy product-buying and dairy product-consumption habits. So they also contract with Nielsen and SafeGraph to obtain intel on trends among their markets, resulting in additional data sources.
Fortunately, with a cloud-native data architecture and other ingredients that make it run smoothly, this powerful intel gets used to its fullest potential, allowing data analysts to do the work they know is possible.
Data pipelines serve to move data from its source (like a CRM application) to a destination (data warehouse), possibly transforming the data into a more usable structure in the process. Historically, data pipelines performed an extract, transform, load (ETL) process to help better manage data by normalizing the data and allowing it to be integrated with multiple data sources. However, switching the order of this process would keep data as clean as possible before it is analyzed.
That brings us to another benefit of a cloud-native data architecture: its ability to transform raw data after it’s in the cloud: you can shift from an ETL process to an ELT (extract, load, transform) data process. Why does it matter where in the process you transform data? The simple answer is that loading your data into a connector means that data transformation enjoys the same speed, freedom, and increased capability that all other processes and activities in your cloud-native data architecture can access (see Part One of this guide). You’ll want to employ a cloud-first data connector that performs the minimum transformations required for your needs.
A business with multiple departments is like a full orchestra with numerous sections: each plays a crucial part, but they all need to play the same song. When executive teams aren’t conditioned to think in silos, they will ask questions about the business as a whole: how does a factor in one department affect a factor in another department? Today, with a cloud-native data architecture, your analytics team can answer questions that implicate multiple departments and business processes; in fact, they can even find those patterns before the executive team asks. Insights become apparent when all data is in one location.
Enter the cloud data warehouse. While on-premises warehouses have stored data on trends, patterns, and correlations for years, a new breed of platforms have emerged, custom-built to tackle the massive quantities of historical data and to power fast, complex queries. Today, not only is it cheaper to store data, it is also 25% – 50% faster to query data from big, flat, denormalized tables than star schemas, thanks to advancements in massively parallel processing.
Unlike the rolled-up data often presented by business intelligence tools, data warehouses store data as one giant, rich, granular table, so that analysts can navigate freely, without traditional constraints. A rich, wide dataset is more useful for diagnosis.
In our fictitious dairy supply chain business, the executive team might ask, “How do packaging and shipping affect milk quality?” A six-year-old can tell you what happens when milk goes bad. It stands to reason that DairyTrail’s executive team will want to know why, when, and how it goes bad in their supply chain. But why stop there? The analytics team can spot these factors before the executive team asks and provide actionable insights in real time, becoming a proactive entity within the business, rather than a reactive one. Milk, like any product on the market, has a limited shelf life. Insights aren’t actionable forever.
Even tightly regulated industries like finance, government, and healthcare can now take advantage of cloud-based platforms. These warehouses have made significant advances in security, availability, and compliance, providing the same benefits in a cloud-native analytics stack that you could expect from a fully on-premises deployment.
If you have to adhere to on-premise storage regulations, we have seen that it is in most companies’ best interests to take charge of their own data storage, rather than purchase add-on storage capabilities from another application, like a business intelligence tool with built-in data connectors. Having control of how and where your data is stored means that you can move and control it whenever and however you want, without added costs or risks of data loss.
On-Premise/Hybrid Hosting Challenges
Traditional on-premise data warehouses are just like the name states—servers located onsite at your organization. These warehouses require a more significant investment upfront, as you’ll need to buy the hardware and personnel to manage the data you expect to collect and may have higher total operational costs (TOCs). Hybrid options let you store data onsite and sync to the cloud.
Total control also means total responsibility. In the on-premise/hybrid scenario, there is more pressure on your IT teams to handle all maintenance of the hardware, software, and integrations. Most importantly, it’s difficult for on-premise solutions to scale with speed and volume of data most companies are collecting today. Scaling up to meet these changing needs will require replacing new systems and additional costs.
Cloud Hosting Relief
As part of your cloud-native data architecture, store data in the cloud, if possible. Cloud-based data warehouses are gaining more attention and market share because of their flexibility and cost-effectiveness. According to Gartner, 75% of all databases will be deployed or migrated to the cloud by 2022.** Like on-premises hosting, cloud data warehouses let you collect, store, query, and analyze data, but without the need for up-front investments in hardware or personnel.
Cloud hosting is typically cheaper, faster, more scalable and elastic, and enjoys a lower risk of downtime, all of which are interrelated.
Cost is the primary reason businesses switch to cloud hosting, followed quickly by speed. Rather than forecast needs, purchase expensive hardware, and train up an IT team to administer them, a business can be up and running for a fraction of the price in a matter of minutes or hours.
As for scalability and elasticity, if the business is successful, it will grow. You’ll undoubtedly add new sources of data as time goes on, and your executive team will want more significant insights and increased capabilities. Cloud hosting scales at the click of a button, with pricing structures that allow you to turn hosting and computing on and off based on activity. For example, the fictitious company DairyTrail could be heavily impacted by animal diseases and consumer diet trends, which could introduce major, unpredictable volatility into the business and affect its computing needs. Not a problem: the analytics team could adjust its hosting accordingly.
Finally, distributing your data storage across a cloud inherently reduces the risk of downtime and its implications, should it occur. Storage providers like Snowflake use continuous data protection features that replace traditional back-up procedures, so you can restore data to specified points in time.
Cloud hosting’s greatest challenge to companies that are able to use it—in other words, those companies without regulatory or compliance issues that require or favor on-premise hosting of data—is selling the transition internally. As always, the fast parts learn, the slow parts remember; in the case of data warehouse hosting, stakeholders in your organization may be holding onto outdated rationale for on-premise storage. See “Address Data Silos” at the end of this section for guidance on selling the transition internally.
Our advice? Listen to the concerns of stakeholders before insisting on moving to the cloud. For most businesses, a modern cloud-native data architecture can address concerns and protect against liabilities and worst-case scenarios. Take the time to ensure your migration plan addresses those concerns explicitly when laying it out. Getting buy-in and building relationships internally may help with bring-up time in the long run.
It’s not enough to collect a mountain of data. To get the ROI you expect, you have to transform that static, dusty archive into a living source of value for the organization. The tools to accomplish that are your data analytics platforms. These tools enable users to analyze and pull insights from their data via queries, charts, and collaborative tools. When selecting business analytics platforms—and yes, you may need tools from several categories—you must consider the following factors:
With cloud-based business intelligence platforms, you can monitor business processes with pre-built reports and visualizations. An effective BI platform illustrates the “what” and the “where,” as in “What is changing in my business?” and “Where is that change impacting our different business units?” These views deliver data on predefined key performance indicators (KPIs) via a dashboard, and many allow users to write ad-hoc queries via an expert-user interface.
But as AI and ML begin to transform BI, the utility of statically-defined dashboards shifts, no matter how frequently they are updated. These dashboards and visualizations are designed to improve access to data, showing whether metrics and KPIs have changed. But, they’re not designed to automatically explore the subtle changes beneath the surface that explain why the change occurred. What happens when the factors impacting a KPI change faster than a static dashboard can visualize?
Back in the old days of building models, outliers were often intentionally removed because this statistical noise was frequently a distraction from the primary trends in the data. But outliers are beacons of business threats like fraud or intrusion. They provide first-blush insights into emerging trends, but once identified, outliers will require further investigation to understand “what else” is happening in our business.
Because of this, outlier and anomaly detection platforms are best suited for operations metrics over business metrics. Using mathematical models and machine learning, anomaly detection platforms can help you identify outliers in your data to inform quick business responses to urgent threats. Anytime there is an outlier that has real meaning for the business, acting swiftly can free managers up from the slow, laborious process of righting a ship and instead let them spend that time making proactive decisions.
If you’re like most companies, your analysts have been attempting to answer the “why” question manually. Either by using Excel pivot tables, or spending hours slicing and dicing their dashboards by many different dimensions, or performing one-off hypothesis testing based on their gut feel or that of various business leaders.
Augmented analytics platforms like Sisu use automated root-cause analysis to identify the key factors and populations in your data that contribute to your core metrics — more comprehensively and with higher precision (i.e., lower false positives). In contrast to anomaly detection platforms, this approach not only surfaces outliers but also considers how a cohort contributes to the overall metric. By identifying the factors with the largest impact on the overall metric, it’s easy to see where the actionable opportunities for improvement are or where the largest sources of lift come from.
With augmented analytics tools, you can continuously monitor and proactively surface the subpopulations in your data that are driving change—no need to slice and dice to find what’s interesting or what changed, and no need to spend days or even weeks number-crunching different hypotheses. In fact, Sisu integrates into existing business intelligence tools and workflows. As an example, you can hyperlink Sisu objectives from an existing dashboard or potentially embed an “analyze in Sisu” option from your dashboard.
Using Sisu, you can tell your business team why metric performance changes over time or differs between groups. For example, in the DairyTrail scenario, you’ll not only notice, but know why fat content is higher in dairy products this year compared to last year.
This approach accelerates the process of answering any kind of “why” question, including:
This data-heavy output provides your analytics team with answers to the “why.” And that’s what business executives have been asking all along…That, in fact, is the reason they have been conditioned to ask “what” in the first place: so they could find out “why” and fix the “why.” Moreover, it is also the reason your analytics team exists. So it stands to reason that it justifies a cloud-native data architecture and all of the tools it comprises.
By using augmented analytics, you’re eliminating the archaic “what” step that only existed as a best effort to begin with.
When you have mountains of data to dig through, a cloud-native analytics architecture augments your data teams ability to power through every question your business has in mind, and ultimately build a more proactive workflow. By choosing the right analytics tools and transitioning to a cloud-based infrastructure, you’ll start answering questions before the business can think to ask them. You’ll free up resources to get ahead of changing markets, and free everyone—from analysts to the CEO—from the constraints of traditional BI tools. There is a plethora of “what” in your hands. Now use it to find out “why.”
To make the case internally, here’s a quick game plan for moving to a more effective cloud-native data architecture:
1. Align data sources and key questions (KPIs)
Go back to the source and catalog the data you’re collecting. The goal is to strip away the rolled-up reporting views built for on-premises constraints and rediscover the rich, transactional, and granular records generated by your business application and third-party data sources. This process will identify where you have the best alignment between data and KPIs and the factors that may define them.
2. Establish your Source of Truth for analytics
Next, consolidate this rich data into a flexible, centralized store. Having a single source of truth, enables your team to quickly and flexibly create the wide, flat denormalized tables that help diagnose the ever-changing questions about performance and process. The key ingredients here are a modern data pipeline and a cloud-native warehouse.
For your data pipeline, consider a cloud-first ELT (extract, load, transform) platform over a more traditional ETL tool. Transforming data after it enters the data warehouse lets you take full advantage of the speed and increased capability of the cloud (and reduce some work on your part, too). Similarly, when selecting your warehouse, choose cloud-first, to take advantage of the flexibility, speed, and cost benefits.
3. All about the engine: Choose analytics platforms that augment your team
Now, it’s time to start asking questions of your data. BI and visualization tools are necessary to improve access to information. But to truly get ROI from the data in your warehouse, pair your BI tools with augmented analytics engines advanced enough to accelerate testing, diagnosis, and root cause analysis across billions of possible combinations. By pairing these descriptive and diagnostic tools together, you’ll supercharge your ability to answer the two critical questions of “what’s changed?” and “why?” about your key metrics.
4. Get proactive with the why
With proactive analytics tools, you can finally answer “why” faster. Using all of your data will enable you to automatically and comprehensively surface the factors and subpopulations impacting your metrics. When you get proactive, you’ll be able to diagnose data and uncover answers before business decision-makers even ask.