I crunched some data related to the COVID-19 pandemic, following an impulse after I came across two separate news items:
- Descartes Labs released a new set of mobility stats, based on mobile phone data aggregated at the US county level (full disclosure: I used to work for Descartes Labs).
- On Twitter I saw a pre-print paper by a group of epidemiologists: Estimating the effect of physical distancing on the COVID-19 pandemic using an urban mobility index ( , , , , , )
The paper uses the Citymapper Mobility Index which is being published for 41 major cities across the world. The Descartes Labs data contains a very similar mobility index but it’s limited to the US and a lot denser, covering individual US counties. (Both datasets likely have problematic biases since they come from specific populations of mobile phone users but that’s a discussion for another day.)
I’m no epidemiologist so I can’t judge the merit of the paper but it did give me a glimpse into how an epidemiologist might think about putting these kinds of datasets together. So I set out to recreate this figure from the paper using the Descartes Labs data:
I used Python with pandas, and the New York Times as a source for COVID-19 data at the county level. The full code is available in this Jupyter notebook. In the resulting graph the dots now represent counties of New York state with at least 20 cumulative confirmed cases of COVID-19 by March 29 (excluding New York City counties), while the axes correspond directly to the original figure:
At least the trend looks right.
In the notebook I included a lot of intermediate results to make it easy to follow along but stripped down the code is quite compact, an awesome demonstration of the power of pandas and the SciPy ecoystem in general:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# This code released under an MIT license. In notebook form with more explanation: | |
# https://github.com/nikhaldi/covid-notebooks/blob/master/Mobility%20vs%20COVID-19%20growth%20plot.ipynb | |
import numpy as np | |
import pandas as pd | |
us_counties = pd.read_csv( | |
"https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv", | |
dtype={"fips": str}, | |
parse_dates=["date"] | |
) | |
mobility = pd.read_csv( | |
"https://raw.githubusercontent.com/descarteslabs/DL-COVID-19/master/DL-us-mobility-daterow.csv", | |
dtype={'fips': str}, | |
parse_dates=['date'] | |
) | |
us_counties_march_week4 = us_counties.set_index('date').loc["2020-03-23":"2020-03-29"] | |
mobility_march_week2 = mobility.set_index("date").loc["2020-03-09":"2020-03-15"] | |
us_counties_march_week4_ny = us_counties_march_week4[(us_counties_march_week4["state"] == "New York")] | |
mobility_march_week2_ny = mobility_march_week2[ | |
(mobility_march_week2["admin_level"] == 2) & (mobility_march_week2["admin1"] == "New York") | |
] | |
def mean_daily_growth(series): | |
return np.mean(series / series.shift(1) – 1) | |
us_counties_march_week4_ny_mean_growth = us_counties_march_week4_ny[["fips", "cases"]].groupby("fips").aggregate( | |
max_cases=pd.NamedAgg(column="cases", aggfunc=np.max), | |
mean_daily_growth=pd.NamedAgg(column="cases", aggfunc=mean_daily_growth) | |
) | |
mobility_march_week2_ny_mean = mobility_march_week2_ny.groupby("fips").mean() | |
min_cases_threshold = 20 | |
merged = pd.merge( | |
us_counties_march_week4_ny_mean_growth[ | |
us_counties_march_week4_ny_mean_growth["max_cases"] >= min_cases_threshold | |
], | |
mobility_march_week2_ny_mean["m50_index"], | |
on="fips" | |
) | |
slope, intercept = np.polyfit(merged.m50_index, np.log(merged.mean_daily_growth), deg=1) | |
y_log_estimated = slope * merged.m50_index + intercept | |
ax = merged.plot.scatter(x="m50_index", y="mean_daily_growth", figsize=(10,10)) | |
ax.plot(merged.m50_index, np.exp(y_log_estimated)) |
Even though I have experience with numpy and other parts of the SciPy ecosystem, I hadn’t used pandas before. Another motivation for this mini-project was to apply pandas to something practical. The bulk of this work was done within 3 hours.
This isn’t serious science of course – I’m not a scientist (but some of my best friends are!). It’s a proof of concept, proving to myself that I understand these datasets and that I roughly understand what was happening in that paper. I’m now thinking about how to build more serious tools around this idea of correlating physical distancing measures and COVID-19 outcomes.
Lockdown Hack #2: Towards proving that social distancing works – Salt Mines