Analyzing Racial and Geographic Disparities in COVID-19 Cases with Snowflake Data Shares and Sisu

By Charles Zhu - August 4, 2020

As the COVID-19 pandemic redefines day-to-day life for many, it’s also casting a spotlight on the inequities and flaws in U.S. systems — especially along racial lines. In fact, a recent New York Times report analyzed CDC data and found that Black and Latino sub-populations are disproportionately affected by the virus.

With COVID-19 cases rising across the U.S. as states reopen, this report had our team wondering what other trends we could observe between the first spike in cases in March and the current spike through July.

Using readily accessible data up to July 15th (courtesy of the Snowflake share of Demyst and Starschema data), we found critical differences in just a few clicks:

  • A 6.5x surge in Latino counties. Overall cases surged 2x in June/July vs. the first March/April surge. However, counties with more Latinos than average surged 6.5x. And contrary to the first surge, more Black and other majority-minority counties are seeing much slower increases than the rate of increase elsewhere in the country.
  • A 4.3-4.7x increase in cases in Urban vs. Rural counties. While March’s cases grew the fastest in cities, we observed a 4.3x – 4.7x increase in June-July cases in rural and sparsely populated counties over March.
  • Hotspots moving from the Northeast to the South. The location of the majority of cases has also completely shifted — from the North East and coastal states in March, to Southern states that largely missed the first surge of cases.
  • California double-take. One of the glaring states with a “double surge” is California. But following the Urban to Rural patterns observed nationwide, California case growth has shifted away from LA and San Francisco metro areas and towards the Inland Empire and Central Valley cities.

Important note: While experts in public health have reviewed these findings, they do not constitute policy recommendations in any way.

Getting the full picture

Before we could dive into the data to see trends, we had to make sure we had a full view of the factors contributing to this surge.

Using Snowflake’s ability to do cross-database joins and a simple SQL query, we quickly joined three datasets. The first dataset from Demyst is a current list of nationwide COVID-19 cases from the New York Times, refreshed every 24 hours. We joined this with Starschema’s dataset, which tracks U.S. policy actions by state, giving us context on what measure and restrictions were lifted. Finally, we joined all of this with a census dataset from American Community Survey to understand each county’s demographic breakdown. A screenshot of these DBs in action is below.

Snowflake schema set up to join datasets

Setting a baseline: Understanding the nationwide surge

With these datasets joined in our Snowflake database, we pointed Sisu at the table and set up a time-based objective comparing New_Cases in the four week period beginning March 17th and ending April 13th, to New_Cases in a four week period starting June 16th and ending July 13th.

We see that compared to the first four-week surge in March, when much of the country began adopting lock-down and social distancing measures, the number of new cases doubled (2x) from 593,000 to 1.2 million.

But this aggregate 2x change obscures the myriad of subpopulations that reported far fewer new cases or far more.

Rapid growth in COVID-19 cases in predominantly Latino communities; slower growth in cases in Black and other majority-minority counties compared to the overall country

When we look at the subpopulations in Sisu, we immediately see that certain populations bore the brunt of this second surge in cases.

For example, counties with a higher Black population saw a slower growth in cases when compared to the national average, while predominantly Latino counties saw a greater rise in new cases when compared to the national average.

Counties with a greater proportion of Latinos than the average county saw a 6.5x increase in total new cases in June, compared to the first surge in March. In other words, while the Latino subpopulation makes up roughly 27% of all counties in the United States, predominately-Latino communities accounted for 57% of the 2x increase in coronavirus cases between 3/17-4/13 and 6/16-7/13.

From 3/17-4/13, the total number of new cases in more Latino counties was 65K. From 6/16-7/13, the total number of new cases in more Latino counties was 421K, a 6.5x increase.

But excluding the predominately-Latino counties already discussed, the data shows that counties with 1) more than 50% minority population and 2) more below the poverty line than average actually saw a smaller increase in new cases when comparing 6/16-7/13 relative to 3/17-4/13.

Specifically, while nationally there was a 100% (2x) increase in new cases in the June/July surge over the March/April surge:

  • Counties with a greater population below the poverty line than the average saw cases increase by 91.9%
  • Counties with a greater Black population than average saw cases increase by 68.1%
  • Cases in majority-minority counties saw cases increase by 19.2%
  • Counties with a higher proportion of Asians saw cases increase by 49%

While we acknowledge that testing proportions may differ among different counties, it appears that in this second surge of cases is afflicting more Latino and White populations than in the first surge of cases in March.

Surge in rural COVID-19 cases and sparsely populated states

While the first wave of cases in March was primarily in coastal states and metropolitan areas, in the most recent spike rural areas and sparsely populated states are becoming a hotspot. Cases in these areas increased a shocking 4.3x – 4.7x — more than doubling the country’s increase.

New cases in rural counties increased 4.3x in the period 6/16-7/13 (129,400) over the period 3/17-4/13 (30,100). This is an absolute increase of 99,300.

Rural and sparsely populated areas account for just 14% of the U.S. population, and yet in June, rural counties accounted for 16.1% of the increase in COVID-19 cases. In the first wave, rural counties were relatively unaffected by the virus, but in this new wave these counties have now caught up with other counties. This is especially concerning, as most rural health systems are ill-equipped to handle the influx of critical patients. In fact, the Pew Center reported last year that 128 rural hospitals have closed since 2010, including a record 18 hospitals last year, and many existing hospitals are underfunded and at risk of closure.

Hotspots state-by-state and city-by-city show a drastic shift from the Northeast to the Sun Belt

Since each state and city has had a different response to maintaining the coronavirus, we wanted to look at where the biggest hotspots have been and how they’ve shifted between the March surge and the current rise in cases. Using the filtering capabilities in Sisu, we quickly filter the dataset to look at states and understand the change.

Change in sum reflects the relative increase in cases for a specific subpopulation in new cases from March/April, to June/July. Impact reflects the absolute change in new cases between the two periods.

As we can see, in this most recent surge, Florida and Texas have had the greatest number of new cases between the last peak and the current one. They are also two states who had more lax shelter in place restrictions, and who lifted restrictions the earliest. On the other hand, Arizona had the sharpest increase of new cases, with a 19.6x increase from the last surge in March and the current one.

Change in sum reflects the relative increase in cases for a specific subpopulation in new cases from March/April, to June/July. Impact reflects the absolute change in new cases between the two periods.

In contrast, North Eastern and New England States have maintained a low caseload between the first surge of cases in March and this new surge of cases in June. These states were hit hard in March, so their mixture of stern shelter-in-place policies and a cautious approach to reopening could be curbing the spread.

Hotspot twice around: California COVID-19 cases continue to grow

The only state where the idea of “one spike is enough” is not holding true is California. Despite being a hotspot in March, cases in the state increased 6.6x between March and July. We drilled down into California’s data to see why.

While LA county has contributed the most absolute number of new cases between the last surge in March/April and the current one in June/July, we’ve seen significant spikes in the more rural and smaller-population counties like those in the Central Valley and Inland Empire. And like we’ve seen elsewhere in this June surge of cases, counties with a more Latino population than average have been hit particularly hard, with a 7.1x increase in cases.

Change in sum reflects the relative increase in cases for a specific subpopulation in new cases from March/April, to June/July. Impact reflects the absolute change in new cases between the two periods.

Final Thoughts

We know that this data is only part of the picture. The U.S. is still struggling to make testing widely available, and in June the CDC estimated that the true tally of COVID-19 cases is likely 10 times the number reported. While the CDC is not able to determine if the unreported cases have similar racial and ethnic inequities, they say it is clear that there have been significant disparities in the number of both deaths and cases.

But with the data we do have, it paints a clear picture that the U.S. is nowhere close to putting this pandemic behind us. As every public health expert predicted, we can see a clear correlation between those counties that lifted stay at home orders and mandatory quarantines — or in some cases those who never had mandatory quarantines to begin with — and the counties who saw an exponential increase in new cases in the June surge. In most cases, even as states reopen and relax stay at home orders, there are more cases in every county and city than when states first issued stay-at-home orders in March.

While so much is unprecedented, what’s encouraging to us is the possibility for states and businesses to make informed decisions by marrying these datasets with their own data. And with more data becoming available, it’s more important than ever for journalists and activists to be able to quickly dive into the data to bring light to these inequities and hold decision-makers to account.

Read more

Announcing Future Data: A New and Independent Data Conference

Introducing Future Data, an open, independent event connecting leading voices in data with a community of technical experts to talk about what's next.

Read more

451 Research:
Informed Decision Making with Proactive Intelligence

In this guest post, 451 Research’s Matt Aslett shares his research on the importance of proactive intelligence and how it enables agile and informed decision making.

Read more