Lockdown Hack interlude: Correlating mobility stats & pandemic growth

I crunched some data related to the COVID-19 pandemic, following an impulse after I came across two separate news items:

The paper uses the Citymapper Mobility Index which is being published for 41 major cities across the world. The Descartes Labs data contains a very similar mobility index but it’s limited to the US and a lot denser, covering individual US counties. (Both datasets likely have problematic biases since they come from specific populations of mobile phone users but that’s a discussion for another day.)

I’m no epidemiologist so I can’t judge the merit of the paper but it did give me a glimpse into how an epidemiologist might think about putting these kinds of datasets together. So I set out to recreate this figure from the paper using the Descartes Labs data:

Screenshot 2020-04-11 at 16.40.13

I used Python with pandas, and the New York Times as a source for COVID-19 data at the county level. The full code is available in this Jupyter notebook. In the resulting graph the dots now represent counties of New York state with at least 20 cumulative confirmed cases of COVID-19 by March 29 (excluding New York City counties), while the axes correspond directly to the original figure:

us-counties

At least the trend looks right.

In the notebook I included a lot of intermediate results to make it easy to follow along but stripped down the code is quite compact, an awesome demonstration of the power of pandas and the SciPy ecoystem in general:

# This code released under an MIT license. In notebook form with more explanation:
# https://github.com/nikhaldi/covid-notebooks/blob/master/Mobility%20vs%20COVID-19%20growth%20plot.ipynb
import numpy as np
import pandas as pd
us_counties = pd.read_csv(
"https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv",
dtype={"fips": str},
parse_dates=["date"]
)
mobility = pd.read_csv(
"https://raw.githubusercontent.com/descarteslabs/DL-COVID-19/master/DL-us-mobility-daterow.csv",
dtype={'fips': str},
parse_dates=['date']
)
us_counties_march_week4 = us_counties.set_index('date').loc["2020-03-23":"2020-03-29"]
mobility_march_week2 = mobility.set_index("date").loc["2020-03-09":"2020-03-15"]
us_counties_march_week4_ny = us_counties_march_week4[(us_counties_march_week4["state"] == "New York")]
mobility_march_week2_ny = mobility_march_week2[
(mobility_march_week2["admin_level"] == 2) & (mobility_march_week2["admin1"] == "New York")
]
def mean_daily_growth(series):
return np.mean(series / series.shift(1) 1)
us_counties_march_week4_ny_mean_growth = us_counties_march_week4_ny[["fips", "cases"]].groupby("fips").aggregate(
max_cases=pd.NamedAgg(column="cases", aggfunc=np.max),
mean_daily_growth=pd.NamedAgg(column="cases", aggfunc=mean_daily_growth)
)
mobility_march_week2_ny_mean = mobility_march_week2_ny.groupby("fips").mean()
min_cases_threshold = 20
merged = pd.merge(
us_counties_march_week4_ny_mean_growth[
us_counties_march_week4_ny_mean_growth["max_cases"] >= min_cases_threshold
],
mobility_march_week2_ny_mean["m50_index"],
on="fips"
)
slope, intercept = np.polyfit(merged.m50_index, np.log(merged.mean_daily_growth), deg=1)
y_log_estimated = slope * merged.m50_index + intercept
ax = merged.plot.scatter(x="m50_index", y="mean_daily_growth", figsize=(10,10))
ax.plot(merged.m50_index, np.exp(y_log_estimated))

Even though I have experience with numpy and other parts of the SciPy ecosystem, I hadn’t used pandas before. Another motivation for this mini-project was to apply pandas to something practical. The bulk of this work was done within 3 hours.

This isn’t serious science of course – I’m not a scientist (but some of my best friends are!). It’s a proof of concept, proving to myself that I understand these datasets and that I roughly understand what was happening in that paper. I’m now thinking about how to build more serious tools around this idea of correlating physical distancing measures and COVID-19 outcomes.