1 Introduction

The COVID-19 pandemic took the world by surprise in late 2019 and sprang into the United States in early 2020. With it came an unprecedented time in history, as well as uncertainty for the future of safe behavior and routines in the “new normal”. Furthering the difficulty of this time was the United States government’s inability to pronounce collective action, and instead, each state was left to determine limitations on behavior and routine, including movement restrictions. Some states enacted legislation that prevented previously common activities, such as going to restaurants, to cease, further inhibiting movement. All of this was done for safety concerns with respect to COVID-19, but came at the expense of people’s freedom. With this in mind, we ask the question: What is the relationship between the mobility of individuals and the COVID-19 cases per capita of their state?

We explore this question through the use of a multitude of variables related to mobility and COVID-19 by creating three linear specifications that range in their quantity and type of inputs. These input variables are based on three different types of data, 1) Google’s mobility dataset, 2) state legislation and policies relevant to mobility, and 3) relevant state characteristics to mobility during the pandemic. Data from Google’s mobility dataset consists of comparing different types of mobility within each state each day to its pre-pandemic mobility, such as mobility to retail and recreation locations, grocery stores and pharmacies, and workplaces. State legislation variables are based on the start and end dates of different types of state legislation that limit / inhibit mobility, such as stay-at-home orders and restaurant closures. Last, relevant state characteristics include variables such as population density and political association, both of which are linked to the environment and provide context to the mobility that occured. The output variable, COVID-19 cases per capita (specifically, for every 100,000 people) by state, is used to measure the impact of mobility with respect to COVID-19 transmission. While many different types of COVID-19 measurements could be used, we believe studying case information is best for a non-time series based study, as we perform here. The reason for this is that death, arguably the best alternative statistical measure of COVID-19 transmission, is a lagging measure, meaning that multiple timeframes would have to be used. While it is possible to use multiple timeframes in this study to describe the relationship between the left and right hand side of a model, it is challenging in this case to be certain of the correct time frame to use in the lagging measure. With this in mind, we have opted for a measure that is better suited for a single time frame; COVID-19 cases per capita.

In a time when the federal government has been unable to pronounce decisive action with respect to movement and bipartisan state government views have left the nation concerned, confused, and even at times immobile, it is important to identify the relationship between specific types of movement and COVID-19 cases per capita.

1.1 Data Sources

We conducted this study using data from March 15 to May 15 of 2020 from a multitude of data sources. For COVID-19 case information, we relied upon data from Johns Hopkins, a leader on COVID-19 data especially at the beginning of the pandemic in the United States, available on data.world. As stated in the introduction, this is a per capita metric, with cases in this study meaning the total number of cases per 100,000 people from March 15 to May 15, 2020. This was calculated by subtracting the total number of cases per 100,000 people up until March 15 from the total number of cases up until May 15. In general, this is a very clean dataset with very little missing data.

Google’s mobility dataset provides state mobility data to specific types of locations in reference to a baseline that was taken between January 3 and February 6, 2020 for each state. While this is in general a very clean dataset, there are some minor gaps in time in some states that minorly impact the credibility of this dataset in what we consider to be a negligible measure. We take the average of each state’s mobility during the March 15 - May 15 time frame to determine the mobility to each type of location. It is important to note that mobility to certain locations, such as parks, may vary significantly in some states between baseline timeframe and the study’s timeframe.

The state policy database provided us with information on the start and end of each state’s policies with respect to COVID-19. Our calculations were based on the total number of days during our time frame that a policy was in place, such as the number of days a shelter in place policy was in action. It is worth noting that the data is generally clean, and while we believe the source to be reliable, did not confirm the information provided for any policies.

Additional contextual information in the study is also used, such as population density and the political party majority of voters voted for in the 2020 election for each state. This is used to provide an understanding of the environment in which mobility was occurring, which we felt may be helpful in better understanding its relationship with the number of COVID-19 cases per capita.

Here is a summary table for the variables in our dataset:

Data Variables Treatment
COVID-19 Cases cumulative_cases_per_100_000 N/A
Google Mobility avg_grocery_and_pharmacy_percent_change_from_baseline, avg_parks_percent_change_from_baseline, avg_transit_stations_percent_change_from_baseline, avg_workplaces_percent_change_from_baseline, avg_residential_percent_change_from_baseline Average of each day for each state
Policy Data days_shelter_in_place_policy, days_interstate_travel_quarantine_policy, days_face_mask_policy, days_restaurants_closure_policy, days_gyms_closure_policy, days_movie_theaters_closure_policy, days_bars_closure_policy, days_casinos_closure_policy Sum of days in place for each state
2020 Presidential Election Party party_democrat 1 if state voted democrat in 2020 presidential election, 0 otherwise
State Population Density density_sq_mile N/A

2 Model Building Process

In this section we go through our process of regression model building. We start with data cleansing and wrangling to prepare our data for further analysis. Then we explain the exploratory data analysis (EDA) we did to evaluate distributions of variables, outliers, and correlations between variables in our dataset. Guided by the knowledge we gain from the EDA, we built the linear regression models described below with the explanatory variables we were interested in. With each model, we examined the statistical significance and practical significance of each model.

2.1 Data Cleansing and Data Wrangling

Our goal in this research is to build descriptive models to identify the correlation between mobility and the number of new COVID-19 cases. In this research, we investigated if mobility, specifically mobility to certain types of locations, had a linear relationship with the number of COVID-19 cases during what we considered to be the beginning of the pandemic (March 15, 2020 to May 15, 2020). Furthermore, we determined what government policies and restrictions have statistical significance on the number of new COVID cases per capita, as well as what contextual factors related to mobility are important in the spread of the disease. First, we examine histograms of variables to explain what transformations are needed.

COVID-19 cases per 100,000 is a right-skewed distribution, as seen in figure 2.1. This shows that while most states had only a few hundred cases per 100,000 individuals during the beginning of the pandemic, this was not the case for all states, as some topped 1500 cases per 100,000 residents. When treated with a log function, the distribution becomes much more normal. With this in mind, we utilized the log transform of the variable for the rest of the study.

Histrogram of COVID-19 Cases Per Capita

Figure 2.1: Histrogram of COVID-19 Cases Per Capita

State population density is a right-skewed distribution, as seen in figure 2.2. While most states have a population density of less than 105 people per square mile, some states are outliers in this regard, such as New Jersey with approximately 1,210 people per square mile. A log transformation helped treat this data to become more normally distributed, and it is used in this way throughout the rest of the study.

Histrogram of Population Density

Figure 2.2: Histrogram of Population Density

We also explored transformation in the state policy variables, but ultimately decided those variables are bimodule in nature, most likely due to the fact that state policies are highly dependent on the political associations and ideologies of the state governor. Many states did not implement specific state policies in response to COVID-19, especially in the early phase of the pandemic. In the later sections, we explore what roles these state policy variables have in our regression models.

2.2 Exploratory Data Analysis

We focused on three aspects in our exploratory data analysis: normality checks, outliers detection and handling, and variable correlation assessment. First, we wanted to check if the variables in our data are normally distributed. While linear regression does not assume normality for predictor variables and an outcome variable, by checking the normality of these variables, it gave us insight on the distribution of the data, such as whether there is significant skewness that can be handled through data transformation. We also did this to help identify outliers, which brings us to the next section: outlier detections and handling. In the outlier detections and handling analysis, we aimed to discuss various outliers found in our dataset and explain how we come up with our decisions in handling these outliers. Last but not least, we also examine correlations between variables in our dataset to provide us intuition on the relationships between various variables, and give us a preview on potential collinearity issues in our regression model building process.

2.2.1 Normality check

We generate a qqplot for each variable to check the normality of each one. From the qq-plot results for each variable in figure below, we can tell cases per capita, retail and recreation mobility, grocery and pharmacy mobility, parks mobility, density per square mile, transit mobility, workplace mobility are generally normally distributed with light tails. Among these variables, shelter in place policy, interstate quarantine policy, face mask policy are not normally distributed, showing bimodal results. We believe this may have been due to some states being slow to react to COVID, while others reacted faster and created restriction policies. Future studies could focus on whether political party preference of the state is a contributing factor to the COVID policies and hence impact the growth rate of COVID cases.

Q-Q Plot of Variables

Figure 2.3: Q-Q Plot of Variables

Q-Q Plot of Variables

Figure 2.4: Q-Q Plot of Variables

2.2.2 Outliers detections and handling

Some outliers are detected in cases per capita, retail and recreation mobility, and population density per square mile as shown in figure below. Alaska, Hawaii, and Montana are outliers with respect to COVID-19 cases per capita, which had a lot less cases than the rest of the states. This could be due to either the isolation of the states, or low population density. To assist in dealing with these outliers we considered removing them or manipulating the case value to the lower quantile. However, because we only had a small dataset with 50 state level records to work with, we decided to keep all of the data as-is to avoid data loss.

New York and New Jersey also have outliers with respect to retail and recreation mobility data. This could be because recreation and retail mobility in New York and New Jersey contain densely populated states where retail and recreation make up much of the space versus less densely populated states. Then after shutting down recreation and retail business due to the pandemic, the mobility change in retail and recreation in these states dropped drastically, ending up as outliers. Again, we believe this reflects the true mobility changes and should not be ignored or removed from our study, and as such, we kept retail and recreation percent change from baseline in the state of New York and New Jersey in the dataset.

The population density per square mile in the state of Alaska is also an outlier. We considered adjusting this data point to the 25% quantile of the overall data, but again, since we only had 50 data points, we ultimately decided to keep the dataset as it is.

Outliers Detection with Box and Whisker Plots

Figure 2.5: Outliers Detection with Box and Whisker Plots

Outliers Detection with Box and Whisker Plots

Figure 2.6: Outliers Detection with Box and Whisker Plots

Some outliers are detected in cases per capita, avg_retail_and_recreation_percent_change_from_baseline, density_sq_mile_log.
Boxplot of Log of COVID-19 Cases

Figure 2.7: Boxplot of Log of COVID-19 Cases

##      State cumulative_cases_per_100_000_log
## 2   Alaska                         3.985831
## 11  Hawaii                         3.795264
## 26 Montana                         3.768384

COVID 19 cases per capita in Alaska, Hawaii, and Montana are outliers, which had a lot less cases than the rest of the states, could be due to either the isolation of the states, or low density. To assist in dealing with these outliers we considered removing them or manipulating the case value to the lower quantile. However, because we only have a small dataset with 50 state level records to work with, we decided to keep all of the data as-is to avoid data loss.

Boxplot of avg_retail_and_recreation_percent_change_from_baseline

Figure 2.8: Boxplot of avg_retail_and_recreation_percent_change_from_baseline

##         State avg_retail_and_recreation_percent_change_from_baseline
## 30 New Jersey                                              -52.04839
## 32   New York                                              -54.32258

Retail and recreation mobility data in New York and New Jersey are outliers. This could be because recreation and retail mobility in New York and New Jersey contain densely populated states where retail and recreation make up much of the space versus less densely populated states. Then after shutting down recreation and retail business due to the pandemic, the mobility change in retail and recreation in these states dropped drastically, and thus ending up as outliers. Again, we believe this is reflecting the true mobility changes and should not be ignored or removed from our study, and as such, we will keep retail and recreation percent change from baseline in the state of New York and New Jersey in the dataset.

Boxplot of density_sq_mile_log

Figure 2.9: Boxplot of density_sq_mile_log

##    State density_sq_mile_log
## 2 Alaska           0.2623643

Density per square mile in the state of Alaska is an outlier. This is explainable since much fewer people live in the state of Alaska. We weighed the options between keeping it as is, removing alaska, as well as manipulating the density_sq_mile_log to 25% quantile data, and decided to keep it as is.

Again, we believe this is reflecting the reality. Also we are avoiding data loss since we are dealing with a small sample set. We will keep values for the state of Alaska.

2.2.3 Variables Correlations Assessment

We generated a correlation matrix between the COVID-19 cases and the mobility data to explore how correlated each mobility data variable is with one another. From this, we determined that residential mobility and workplace mobility are highly correlated to the number of cases per capita. We were first surprised by this high positive correlation between residential mobility data and COVID-19 cases, because the data suggests that the greater the residential mobility (time at home), the more cases of COVID-19 in the state. While not immediately intuitive, we believe this makes sense due to the fact that once someone has COVID-19, they are very likely to be at home for the foreseeable future. This period would be at least two weeks in order to not spread the disease, and even more if the person has symptoms. Furthermore, this could also be interpreted as while the number of COVID-19 cases increase, more and more people stay at home in an effort to not catch the virus, so the residential mobility is higher.

Work mobility data and COVID-19 cases have a high negative correlation. This can be explained as while COVID-19 cases increase, less and less people physically go to work, and therefore work mobility tends to decrease. It is also worth noting that work and residential mobility are highly negatively correlated. This is likely because as more people start to work from home, residential mobility goes up and work mobility goes down.