Analyzing Racial and Geographic Disparities in COVID-19 Cases with Snowflake Data Shares and Sisu

By Charles Zhu - August 4, 2020

As the COVID-19 pandemic redefines day-to-day life for many, it’s also casting a spotlight on the inequities and flaws in U.S. systems — especially along racial lines. In fact, a recent New York Times report analyzed CDC data and found that Black and Latino sub-populations are disproportionately affected by the virus.

With COVID-19 cases rising across the U.S. as states reopen, this report had our team wondering what other trends we could observe between the first spike in cases in March and the current spike through July.

Using readily accessible data up to July 15th (courtesy of the Snowflake share of Demyst and Starschema data), we found critical differences in just a few clicks:

  • A 6.5x surge in Latino counties. Overall cases surged 2x in June/July vs. the first March/April surge. However, counties with more Latinos than average surged 6.5x. And contrary to the first surge, more Black and other majority-minority counties are seeing much slower increases than the rate of increase elsewhere in the country.
  • A 4.3-4.7x increase in cases in Urban vs. Rural counties. While March’s cases grew the fastest in cities, we observed a 4.3x – 4.7x increase in June-July cases in rural and sparsely populated counties over March.
  • Hotspots moving from the Northeast to the South. The location of the majority of cases has also completely shifted — from the North East and coastal states in March, to Southern states that largely missed the first surge of cases.
  • California double-take. One of the glaring states with a “double surge” is California. But following the Urban to Rural patterns observed nationwide, California case growth has shifted away from LA and San Francisco metro areas and towards the Inland Empire and Central Valley cities.

Important note: While experts in public health have reviewed these findings, they do not constitute policy recommendations in any way.

Getting the full picture

Before we could dive into the data to see trends, we had to make sure we had a full view of the factors contributing to this surge.

Using Snowflake’s ability to do cross-database joins and a simple SQL query, we quickly joined three datasets. The first dataset from Demyst is a current list of nationwide COVID-19 cases from the New York Times, refreshed every 24 hours. We joined this with Starschema’s dataset, which tracks U.S. policy actions by state, giving us context on what measure and restrictions were lifted. Finally, we joined all of this with a census dataset from American Community Survey to understand each county’s demographic breakdown. A screenshot of these DBs in action is below.

Snowflake schema set up to join datasets

Setting a baseline: Understanding the nationwide surge

With these datasets joined in our Snowflake database, we pointed Sisu at the table and set up a time-based objective comparing New_Cases in the four week period beginning March 17th and ending April 13th, to New_Cases in a four week period starting June 16th and ending July 13th.

We see that compared to the first four-week surge in March, when much of the country began adopting lock-down and social distancing measures, the number of new cases doubled (2x) from 593,000 to 1.2 million.