1 Introduction

The COVID-19 pandemic took the world by surprise in late 2019 and sprang into the United States in early 2020. With it came an unprecedented time in history, as well as uncertainty for the future of safe behavior and routines in the “new normal”. Furthering the difficulty of this time was the United States government’s inability to pronounce collective action, and instead, each state was left to determine limitations on behavior and routine, including movement restrictions. Some states enacted legislation that prevented previously common activities, such as going to restaurants, to cease, further inhibiting movement. All of this was done for safety concerns with respect to COVID-19, but came at the expense of people’s freedom. With this in mind, we ask the question: What is the relationship between the mobility of individuals and the COVID-19 cases per capita of their state?

We explore this question through the use of a multitude of variables related to mobility and COVID-19 by creating three linear specifications that range in their quantity and type of inputs. These input variables are based on three different types of data, 1) Google’s mobility dataset, 2) state legislation and policies relevant to mobility, and 3) relevant state characteristics to mobility during the pandemic. Data from Google’s mobility dataset consists of comparing different types of mobility within each state each day to its pre-pandemic mobility, such as mobility to retail and recreation locations, grocery stores and pharmacies, and workplaces. State legislation variables are based on the start and end dates of different types of state legislation that limit / inhibit mobility, such as stay-at-home orders and restaurant closures. Last, relevant state characteristics include variables such as population density and political association, both of which are linked to the environment and provide context to the mobility that occured. The output variable, COVID-19 cases per capita (specifically, for every 100,000 people) by state, is used to measure the impact of mobility with respect to COVID-19 transmission. While many different types of COVID-19 measurements could be used, we believe studying case information is best for a non-time series based study, as we perform here. The reason for this is that death, arguably the best alternative statistical measure of COVID-19 transmission, is a lagging measure, meaning that multiple timeframes would have to be used. While it is possible to use multiple timeframes in this study to describe the relationship between the left and right hand side of a model, it is challenging in this case to be certain of the correct time frame to use in the lagging measure. With this in mind, we have opted for a measure that is better suited for a single time frame; COVID-19 cases per capita.

In a time when the federal government has been unable to pronounce decisive action with respect to movement and bipartisan state government views have left the nation concerned, confused, and even at times immobile, it is important to identify the relationship between specific types of movement and COVID-19 cases per capita.

1.1 Data Sources

We conducted this study using data from March 15 to May 15 of 2020 from a multitude of data sources. For COVID-19 case information, we relied upon data from Johns Hopkins, a leader on COVID-19 data especially at the beginning of the pandemic in the United States, available on data.world. As stated in the introduction, this is a per capita metric, with cases in this study meaning the total number of cases per 100,000 people from March 15 to May 15, 2020. This was calculated by subtracting the total number of cases per 100,000 people up until March 15 from the total number of cases up until May 15. In general, this is a very clean dataset with very little missing data.

Google’s mobility dataset provides state mobility data to specific types of locations in reference to a baseline that was taken between January 3 and February 6, 2020 for each state. While this is in general a very clean dataset, there are some minor gaps in time in some states that minorly impact the credibility of this dataset in what we consider to be a negligible measure. We take the average of each state’s mobility during the March 15 - May 15 time frame to determine the mobility to each type of location. It is important to note that mobility to certain locations, such as parks, may vary significantly in some states between baseline timeframe and the study’s timeframe.

The state policy database provided us with information on the start and end of each state’s policies with respect to COVID-19. Our calculations were based on the total number of days during our time frame that a policy was in place, such as the number of days a shelter in place policy was in action. It is worth noting that the data is generally clean, and while we believe the source to be reliable, did not confirm the information provided for any policies.

Additional contextual information in the study is also used, such as population density and the political party majority of voters voted for in the 2020 election for each state. This is used to provide an understanding of the environment in which mobility was occurring, which we felt may be helpful in better understanding its relationship with the number of COVID-19 cases per capita.

Here is a summary table for the variables in our dataset:

Data	Variables	Treatment
COVID-19 Cases	cumulative_cases_per_100_000	N/A
Google Mobility	avg_grocery_and_pharmacy_percent_change_from_baseline, avg_parks_percent_change_from_baseline, avg_transit_stations_percent_change_from_baseline, avg_workplaces_percent_change_from_baseline, avg_residential_percent_change_from_baseline	Average of each day for each state
Policy Data	days_shelter_in_place_policy, days_interstate_travel_quarantine_policy, days_face_mask_policy, days_restaurants_closure_policy, days_gyms_closure_policy, days_movie_theaters_closure_policy, days_bars_closure_policy, days_casinos_closure_policy	Sum of days in place for each state
2020 Presidential Election Party	party_democrat	1 if state voted democrat in 2020 presidential election, 0 otherwise
State Population Density	density_sq_mile	N/A

2 Model Building Process

In this section we go through our process of regression model building. We start with data cleansing and wrangling to prepare our data for further analysis. Then we explain the exploratory data analysis (EDA) we did to evaluate distributions of variables, outliers, and correlations between variables in our dataset. Guided by the knowledge we gain from the EDA, we built the linear regression models described below with the explanatory variables we were interested in. With each model, we examined the statistical significance and practical significance of each model.

2.1 Data Cleansing and Data Wrangling

Our goal in this research is to build descriptive models to identify the correlation between mobility and the number of new COVID-19 cases. In this research, we investigated if mobility, specifically mobility to certain types of locations, had a linear relationship with the number of COVID-19 cases during what we considered to be the beginning of the pandemic (March 15, 2020 to May 15, 2020). Furthermore, we determined what government policies and restrictions have statistical significance on the number of new COVID cases per capita, as well as what contextual factors related to mobility are important in the spread of the disease. First, we examine histograms of variables to explain what transformations are needed.

COVID-19 cases per 100,000 is a right-skewed distribution, as seen in figure 2.1. This shows that while most states had only a few hundred cases per 100,000 individuals during the beginning of the pandemic, this was not the case for all states, as some topped 1500 cases per 100,000 residents. When treated with a log function, the distribution becomes much more normal. With this in mind, we utilized the log transform of the variable for the rest of the study.

Figure 2.1: Histrogram of COVID-19 Cases Per Capita

State population density is a right-skewed distribution, as seen in figure 2.2. While most states have a population density of less than 105 people per square mile, some states are outliers in this regard, such as New Jersey with approximately 1,210 people per square mile. A log transformation helped treat this data to become more normally distributed, and it is used in this way throughout the rest of the study.

Figure 2.2: Histrogram of Population Density

We also explored transformation in the state policy variables, but ultimately decided those variables are bimodule in nature, most likely due to the fact that state policies are highly dependent on the political associations and ideologies of the state governor. Many states did not implement specific state policies in response to COVID-19, especially in the early phase of the pandemic. In the later sections, we explore what roles these state policy variables have in our regression models.

2.2 Exploratory Data Analysis

We focused on three aspects in our exploratory data analysis: normality checks, outliers detection and handling, and variable correlation assessment. First, we wanted to check if the variables in our data are normally distributed. While linear regression does not assume normality for predictor variables and an outcome variable, by checking the normality of these variables, it gave us insight on the distribution of the data, such as whether there is significant skewness that can be handled through data transformation. We also did this to help identify outliers, which brings us to the next section: outlier detections and handling. In the outlier detections and handling analysis, we aimed to discuss various outliers found in our dataset and explain how we come up with our decisions in handling these outliers. Last but not least, we also examine correlations between variables in our dataset to provide us intuition on the relationships between various variables, and give us a preview on potential collinearity issues in our regression model building process.

2.2.1 Normality check

We generate a qqplot for each variable to check the normality of each one. From the qq-plot results for each variable in figure below, we can tell cases per capita, retail and recreation mobility, grocery and pharmacy mobility, parks mobility, density per square mile, transit mobility, workplace mobility are generally normally distributed with light tails. Among these variables, shelter in place policy, interstate quarantine policy, face mask policy are not normally distributed, showing bimodal results. We believe this may have been due to some states being slow to react to COVID, while others reacted faster and created restriction policies. Future studies could focus on whether political party preference of the state is a contributing factor to the COVID policies and hence impact the growth rate of COVID cases.

Figure 2.3: Q-Q Plot of Variables

Figure 2.4: Q-Q Plot of Variables

2.2.2 Outliers detections and handling

Some outliers are detected in cases per capita, retail and recreation mobility, and population density per square mile as shown in figure below. Alaska, Hawaii, and Montana are outliers with respect to COVID-19 cases per capita, which had a lot less cases than the rest of the states. This could be due to either the isolation of the states, or low population density. To assist in dealing with these outliers we considered removing them or manipulating the case value to the lower quantile. However, because we only had a small dataset with 50 state level records to work with, we decided to keep all of the data as-is to avoid data loss.

New York and New Jersey also have outliers with respect to retail and recreation mobility data. This could be because recreation and retail mobility in New York and New Jersey contain densely populated states where retail and recreation make up much of the space versus less densely populated states. Then after shutting down recreation and retail business due to the pandemic, the mobility change in retail and recreation in these states dropped drastically, ending up as outliers. Again, we believe this reflects the true mobility changes and should not be ignored or removed from our study, and as such, we kept retail and recreation percent change from baseline in the state of New York and New Jersey in the dataset.

The population density per square mile in the state of Alaska is also an outlier. We considered adjusting this data point to the 25% quantile of the overall data, but again, since we only had 50 data points, we ultimately decided to keep the dataset as it is.

Figure 2.5: Outliers Detection with Box and Whisker Plots

Figure 2.6: Outliers Detection with Box and Whisker Plots

Some outliers are detected in cases per capita, avg_retail_and_recreation_percent_change_from_baseline, density_sq_mile_log.

Figure 2.7: Boxplot of Log of COVID-19 Cases

##      State cumulative_cases_per_100_000_log
## 2   Alaska                         3.985831
## 11  Hawaii                         3.795264
## 26 Montana                         3.768384

COVID 19 cases per capita in Alaska, Hawaii, and Montana are outliers, which had a lot less cases than the rest of the states, could be due to either the isolation of the states, or low density. To assist in dealing with these outliers we considered removing them or manipulating the case value to the lower quantile. However, because we only have a small dataset with 50 state level records to work with, we decided to keep all of the data as-is to avoid data loss.

Figure 2.8: Boxplot of avg_retail_and_recreation_percent_change_from_baseline

##         State avg_retail_and_recreation_percent_change_from_baseline
## 30 New Jersey                                              -52.04839
## 32   New York                                              -54.32258

Retail and recreation mobility data in New York and New Jersey are outliers. This could be because recreation and retail mobility in New York and New Jersey contain densely populated states where retail and recreation make up much of the space versus less densely populated states. Then after shutting down recreation and retail business due to the pandemic, the mobility change in retail and recreation in these states dropped drastically, and thus ending up as outliers. Again, we believe this is reflecting the true mobility changes and should not be ignored or removed from our study, and as such, we will keep retail and recreation percent change from baseline in the state of New York and New Jersey in the dataset.

Figure 2.9: Boxplot of density_sq_mile_log

##    State density_sq_mile_log
## 2 Alaska           0.2623643

Density per square mile in the state of Alaska is an outlier. This is explainable since much fewer people live in the state of Alaska. We weighed the options between keeping it as is, removing alaska, as well as manipulating the density_sq_mile_log to 25% quantile data, and decided to keep it as is.

Again, we believe this is reflecting the reality. Also we are avoiding data loss since we are dealing with a small sample set. We will keep values for the state of Alaska.

2.2.3 Variables Correlations Assessment

We generated a correlation matrix between the COVID-19 cases and the mobility data to explore how correlated each mobility data variable is with one another. From this, we determined that residential mobility and workplace mobility are highly correlated to the number of cases per capita. We were first surprised by this high positive correlation between residential mobility data and COVID-19 cases, because the data suggests that the greater the residential mobility (time at home), the more cases of COVID-19 in the state. While not immediately intuitive, we believe this makes sense due to the fact that once someone has COVID-19, they are very likely to be at home for the foreseeable future. This period would be at least two weeks in order to not spread the disease, and even more if the person has symptoms. Furthermore, this could also be interpreted as while the number of COVID-19 cases increase, more and more people stay at home in an effort to not catch the virus, so the residential mobility is higher.

Work mobility data and COVID-19 cases have a high negative correlation. This can be explained as while COVID-19 cases increase, less and less people physically go to work, and therefore work mobility tends to decrease. It is also worth noting that work and residential mobility are highly negatively correlated. This is likely because as more people start to work from home, residential mobility goes up and work mobility goes down.

Figure 2.10: G-G Pairs of Mobility Variables

We also generate a correlation matrix to analyze the correlation between policy variables and the number of covid cases. With this, we see that face mask policy and COVID-19 cases have a strong positive correlation. This indicates that as COVID-19 cases grow, face mask policy is carried out in more states.

We also can tell population density and cases have a strong positive correlation. This aligns with intuition that more densely populated areas had higher COVID-19 cases at the beginning of the pandemic.

Figure 2.11: G-G Pairs of State Policies and Context Variables

2.3 Modeling

Taking some of the findings from the exploratory data analysis, we build a few descriptive models to help us understand the linear relationship between mobility data and COVID19 cases. We will demonstrate our linear modeling and model selection exercises below.

2.3.1 Model 1:

In the first attempt, we put together a descriptive model to get a general idea about the relationship between COVID-19 cases and average mobility in each state. To do this, we decided to average all mobility data from Google’s mobility dataset together to obtain one metric on mobility. This results in a very low R-squared value, only 0.0549. This result tells us that we will need to build a more complex model to examine the relationship between different types of mobility as well as different policies to better understand the relationship between mobility and the number of COVID-19 cases per capita.

# Model 1
model1 <- lm(cumulative_cases_per_100_000_log ~ mobility_mean, data=mobility_df)
summary(model1)

## 
## Call:
## lm(formula = cumulative_cases_per_100_000_log ~ mobility_mean, 
##     data = mobility_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.30379 -0.51172 -0.02607  0.64149  1.52484 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.25721    0.23036   22.82   <2e-16 ***
## mobility_mean -0.02366    0.01417   -1.67    0.101    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8369 on 48 degrees of freedom
## Multiple R-squared:  0.0549, Adjusted R-squared:  0.03521 
## F-statistic: 2.788 on 1 and 48 DF,  p-value: 0.1015

While other univariable models would have met statistically significant thresholds, we felt that it was important to note what the most simplified version of capturing mobility was with the data we had. While many variables explain the context in which mobility is occurring and may even impact one’s ability to be mobile, only the mobility data from Google contains actual mobility data. While certain mobility variables are more statistically significant than the average of all of them combined, we aimed for this model to be a best representation of mobility of all types in just one variable, and therefore felt an average was appropriate to try and target this metric.

2.3.2 Model 2_v1:

The second of the three models we developed for this study focused on utilizing a handful of variables that capture mobility and mobility’s context. Specifically with reference to context, a multitude of variables related to state policies and population density were examined. There are many possible combinations of this data to configure the model. With this in mind, we analyzed a few different possibilities below with respect to their statistical results and practical significance.

# Model 2
model2_v1 <- lm(cumulative_cases_per_100_000_log ~ 
                  avg_residential_percent_change_from_baseline + 
                  avg_workplaces_percent_change_from_baseline + 
                  avg_grocery_and_pharmacy_percent_change_from_baseline + 
                  days_interstate_travel_quarantine_policy + 
                  days_face_mask_policy +
                  density_sq_mile_log, data=mobility_df)

summary(model2_v1)

## 
## Call:
## lm(formula = cumulative_cases_per_100_000_log ~ avg_residential_percent_change_from_baseline + 
##     avg_workplaces_percent_change_from_baseline + avg_grocery_and_pharmacy_percent_change_from_baseline + 
##     days_interstate_travel_quarantine_policy + days_face_mask_policy + 
##     density_sq_mile_log, data = mobility_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93045 -0.32761 -0.00514  0.30022  1.02360 
## 
## Coefficients:
##                                                        Estimate Std. Error
## (Intercept)                                           1.5129530  0.8502028
## avg_residential_percent_change_from_baseline          0.3154007  0.0698232
## avg_workplaces_percent_change_from_baseline           0.0346150  0.0286931
## avg_grocery_and_pharmacy_percent_change_from_baseline 0.1148316  0.0243530
## days_interstate_travel_quarantine_policy              0.0001327  0.0037932
## days_face_mask_policy                                 0.0276177  0.0073840
## density_sq_mile_log                                   0.2639510  0.0649685
##                                                       t value Pr(>|t|)    
## (Intercept)                                             1.780 0.082223 .  
## avg_residential_percent_change_from_baseline            4.517 4.84e-05 ***
## avg_workplaces_percent_change_from_baseline             1.206 0.234263    
## avg_grocery_and_pharmacy_percent_change_from_baseline   4.715 2.56e-05 ***
## days_interstate_travel_quarantine_policy                0.035 0.972250    
## days_face_mask_policy                                   3.740 0.000539 ***
## density_sq_mile_log                                     4.063 0.000202 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4653 on 43 degrees of freedom
## Multiple R-squared:  0.7383, Adjusted R-squared:  0.7018 
## F-statistic: 20.22 on 6 and 43 DF,  p-value: 4.507e-11

vif(model2_v1)

##          avg_residential_percent_change_from_baseline 
##                                              9.510787 
##           avg_workplaces_percent_change_from_baseline 
##                                              6.448486 
## avg_grocery_and_pharmacy_percent_change_from_baseline 
##                                              5.863944 
##              days_interstate_travel_quarantine_policy 
##                                              1.630390 
##                                 days_face_mask_policy 
##                                              1.473139 
##                                   density_sq_mile_log 
##                                              1.864627

anova(model1, model2_v1, test='F')

## Analysis of Variance Table
## 
## Model 1: cumulative_cases_per_100_000_log ~ mobility_mean
## Model 2: cumulative_cases_per_100_000_log ~ avg_residential_percent_change_from_baseline + 
##     avg_workplaces_percent_change_from_baseline + avg_grocery_and_pharmacy_percent_change_from_baseline + 
##     days_interstate_travel_quarantine_policy + days_face_mask_policy + 
##     density_sq_mile_log
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     48 33.621                                  
## 2     43  9.308  5    24.313 22.462 5.252e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ggpairs plots, we determined that residential_percent_change_from_baseline, workplaces_percent_change_from_baseline, grocery_and_pharmacy_percent_change_from_baseline, interstate_travel_quarantine_policy, face_mask_policy, density_sq_mile_log are strongly correlated with COVID-19 cases per capita. Therefore, for the first version of model 2, we added all of the above variables in the linear model. We got a 0.738 R Squared and 0.702 adjusted R squared, indicating that the selected dependent variables explain the COVID-19 cases per capita reasonably well.

However, in conducting a VIF test, we found the VIF scores for residential_percent_change_from_baseline, workplaces_percent_change_from_baseline and grocery_and_pharmacy_percent_change_from_baseline were all above 4. This indicated to us that we have included variables that have strong multicollinearity. Furthermore, using the R function summary() on the model, we can tell interstate_travel_quarantine_policy and workplaces_percent_change_from_baseline have a p-value greater than 0.05, which is not statistically significant.

With this in mind, we decided to remove some dependent variables to resolve the multicollinearity issue and only include variables that are statistically significant.

2.3.3 Model 2_v2:

model2_v2<-lm(log(cumulative_cases_per_100_000_log) ~ 
                avg_residential_percent_change_from_baseline + 
                avg_grocery_and_pharmacy_percent_change_from_baseline + 
                days_face_mask_policy + density_sq_mile_log, data=mobility_df)

summary(model2_v2)

## 
## Call:
## lm(formula = log(cumulative_cases_per_100_000_log) ~ avg_residential_percent_change_from_baseline + 
##     avg_grocery_and_pharmacy_percent_change_from_baseline + days_face_mask_policy + 
##     density_sq_mile_log, data = mobility_df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.166977 -0.052998  0.004576  0.042675  0.189801 
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                           0.892221   0.104536
## avg_residential_percent_change_from_baseline          0.046502   0.009372
## avg_grocery_and_pharmacy_percent_change_from_baseline 0.022206   0.003649
## days_face_mask_policy                                 0.004788   0.001392
## density_sq_mile_log                                   0.049528   0.012163
##                                                       t value Pr(>|t|)    
## (Intercept)                                             8.535 5.81e-11 ***
## avg_residential_percent_change_from_baseline            4.962 1.04e-05 ***
## avg_grocery_and_pharmacy_percent_change_from_baseline   6.085 2.33e-07 ***
## days_face_mask_policy                                   3.440 0.001267 ** 
## density_sq_mile_log                                     4.072 0.000186 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08783 on 45 degrees of freedom
## Multiple R-squared:  0.7056, Adjusted R-squared:  0.6795 
## F-statistic: 26.97 on 4 and 45 DF,  p-value: 1.896e-11

vif(model2_v2)

##          avg_residential_percent_change_from_baseline 
##                                              4.808311 
## avg_grocery_and_pharmacy_percent_change_from_baseline 
##                                              3.695200 
##                                 days_face_mask_policy 
##                                              1.469115 
##                                   density_sq_mile_log 
##                                              1.833938

After removing variables, we arrived at model2_v2. This resulted in an adjusted R-squared of 0.679, which is a high number and ideal for a model. However, after conducting the VIF test, residential_percent_change_from_baseline VIF score is 4.81, which is above 4, suggesting mulit-collinearity. With this in mind, we decided to remove residential_percent_change_from_baseline and build another linear model.

2.3.4 Model 2_v3:

model2_v3 <- lm(cumulative_cases_per_100_000_log ~ 
                  avg_grocery_and_pharmacy_percent_change_from_baseline 
                + days_face_mask_policy + density_sq_mile_log, data=mobility_df)

summary(model2_v3)

## 
## Call:
## lm(formula = cumulative_cases_per_100_000_log ~ avg_grocery_and_pharmacy_percent_change_from_baseline + 
##     days_face_mask_policy + density_sq_mile_log, data = mobility_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21346 -0.42770 -0.09103  0.44592  1.31867 
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                           3.674962   0.291460
## avg_grocery_and_pharmacy_percent_change_from_baseline 0.046011   0.016940
## days_face_mask_policy                                 0.038034   0.008868
## density_sq_mile_log                                   0.402062   0.072939
##                                                       t value Pr(>|t|)    
## (Intercept)                                            12.609  < 2e-16 ***
## avg_grocery_and_pharmacy_percent_change_from_baseline   2.716  0.00928 ** 
## days_face_mask_policy                                   4.289 9.11e-05 ***
## density_sq_mile_log                                     5.512 1.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5813 on 46 degrees of freedom
## Multiple R-squared:  0.5631, Adjusted R-squared:  0.5346 
## F-statistic: 19.76 on 3 and 46 DF,  p-value: 2.247e-08

vif(model2_v3)

## avg_grocery_and_pharmacy_percent_change_from_baseline 
##                                              1.817792 
##                                 days_face_mask_policy 
##                                              1.361210 
##                                   density_sq_mile_log 
##                                              1.505677

This resulted in a model with an adjusted R-Squared for Model3_v3 of 0.5631, which is again what we would like to see in a model. Furthermore, the VIF for all variables is less than 4. In an attempt to add more mobility information to the model, as this is a study of mobility, we decided to try and add in workplaces_percent_change_from_baseline to see if it improved the R-squared while keeping the multicollinearity at a reasonable level.

2.3.5 Model 2_v4:

## 
## Call:
## lm(formula = cumulative_cases_per_100_000_log ~ avg_workplaces_percent_change_from_baseline + 
##     avg_grocery_and_pharmacy_percent_change_from_baseline + days_face_mask_policy + 
##     density_sq_mile_log, data = mobility_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02418 -0.40039 -0.06982  0.32390  1.36539 
## 
## Coefficients:
##                                                        Estimate Std. Error
## (Intercept)                                            2.062630   0.801607
## avg_workplaces_percent_change_from_baseline           -0.053977   0.025137
## avg_grocery_and_pharmacy_percent_change_from_baseline  0.075527   0.021331
## days_face_mask_policy                                  0.034563   0.008691
## density_sq_mile_log                                    0.348698   0.074502
##                                                       t value Pr(>|t|)    
## (Intercept)                                             2.573 0.013446 *  
## avg_workplaces_percent_change_from_baseline            -2.147 0.037195 *  
## avg_grocery_and_pharmacy_percent_change_from_baseline   3.541 0.000942 ***
## days_face_mask_policy                                   3.977 0.000251 ***
## density_sq_mile_log                                     4.680 2.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5597 on 45 degrees of freedom
## Multiple R-squared:  0.6037, Adjusted R-squared:  0.5685 
## F-statistic: 17.14 on 4 and 45 DF,  p-value: 1.317e-08

## 
## Call:
## lm(formula = cumulative_cases_per_100_000_log ~ avg_workplaces_percent_change_from_baseline + 
##     avg_grocery_and_pharmacy_percent_change_from_baseline + days_face_mask_policy + 
##     density_sq_mile_log, data = mobility_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02418 -0.40039 -0.06982  0.32390  1.36539 
## 
## Coefficients:
##                                                        Estimate Std. Error
## (Intercept)                                            2.062630   0.801607
## avg_workplaces_percent_change_from_baseline           -0.053977   0.025137
## avg_grocery_and_pharmacy_percent_change_from_baseline  0.075527   0.021331
## days_face_mask_policy                                  0.034563   0.008691
## density_sq_mile_log                                    0.348698   0.074502
##                                                       t value Pr(>|t|)    
## (Intercept)                                             2.573 0.013446 *  
## avg_workplaces_percent_change_from_baseline            -2.147 0.037195 *  
## avg_grocery_and_pharmacy_percent_change_from_baseline   3.541 0.000942 ***
## days_face_mask_policy                                   3.977 0.000251 ***
## density_sq_mile_log                                     4.680 2.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5597 on 45 degrees of freedom
## Multiple R-squared:  0.6037, Adjusted R-squared:  0.5685 
## F-statistic: 17.14 on 4 and 45 DF,  p-value: 1.317e-08

##           avg_workplaces_percent_change_from_baseline 
##                                              3.419717 
## avg_grocery_and_pharmacy_percent_change_from_baseline 
##                                              3.108606 
##                                 days_face_mask_policy 
##                                              1.409981 
##                                   density_sq_mile_log 
##                                              1.694200

## Analysis of Variance Table
## 
## Model 1: cumulative_cases_per_100_000_log ~ avg_grocery_and_pharmacy_percent_change_from_baseline + 
##     days_face_mask_policy + density_sq_mile_log
## Model 2: cumulative_cases_per_100_000_log ~ avg_workplaces_percent_change_from_baseline + 
##     avg_grocery_and_pharmacy_percent_change_from_baseline + days_face_mask_policy + 
##     density_sq_mile_log
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     46 15.543                              
## 2     45 14.098  1    1.4446 4.6108 0.03719 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After adding back workplaces_percent_change_from_baseline to the linear model, this resulted in a higher adjusted R-Squared (0.568). Also, the VIF for all input variables is less than 4, and all variables are statistically significant at the 95% confidence interval with a p-value of less than 0.05.

To further examine the models, we applied the F test and compared model2_v3 and model2_v4. Model2_v4 had a smaller Residual Sum of Squares (RSS = 14.098) and the F test p value was less than 0.05. This tells us that including workplaces_percent_change_from_baseline leads to an improvement in model performance, and therefore we decided to keep model2_v4 as our final model for model 2. The variables also contain a balance between contextual / control variables and mobility data variables, which we believe is a good balance as model 1 indicated to us that not all mobilities can be treated the same and that context of the mobility is important to include. Please note that when “model two” is mentioned throughout the remainder of this paper, it is referring to model2_v4.

2.3.6 Model 3:

## 
## Call:
## lm(formula = cumulative_cases_per_100_000_log ~ avg_grocery_and_pharmacy_percent_change_from_baseline + 
##     avg_parks_percent_change_from_baseline + avg_transit_stations_percent_change_from_baseline + 
##     avg_workplaces_percent_change_from_baseline + avg_residential_percent_change_from_baseline + 
##     days_shelter_in_place_policy + days_interstate_travel_quarantine_policy + 
##     days_face_mask_policy + days_restaurants_closure_policy + 
##     days_gyms_closure_policy + days_movie_theaters_closure_policy + 
##     days_bars_closure_policy + days_casinos_closure_policy + 
##     party_DEMOCRAT + density_sq_mile_log, data = mobility_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79434 -0.28457  0.02547  0.22551  0.72148 
## 
## Coefficients:
##                                                         Estimate Std. Error
## (Intercept)                                            2.1024961  1.0582163
## avg_grocery_and_pharmacy_percent_change_from_baseline  0.0799028  0.0328089
## avg_parks_percent_change_from_baseline                 0.0034518  0.0031061
## avg_transit_stations_percent_change_from_baseline      0.0048593  0.0113864
## avg_workplaces_percent_change_from_baseline            0.0288006  0.0351393
## avg_residential_percent_change_from_baseline           0.2892807  0.0835355
## days_shelter_in_place_policy                          -0.0029161  0.0061320
## days_interstate_travel_quarantine_policy               0.0001748  0.0043066
## days_face_mask_policy                                  0.0237862  0.0080977
## days_restaurants_closure_policy                       -0.0152909  0.0143164
## days_gyms_closure_policy                              -0.0001379  0.0158362
## days_movie_theaters_closure_policy                     0.0073297  0.0152668
## days_bars_closure_policy                              -0.0057315  0.0168013
## days_casinos_closure_policy                            0.0063933  0.0028149
## party_DEMOCRAT                                         0.1301219  0.1989809
## density_sq_mile_log                                    0.2845353  0.0800491
##                                                       t value Pr(>|t|)   
## (Intercept)                                             1.987  0.05505 . 
## avg_grocery_and_pharmacy_percent_change_from_baseline   2.435  0.02027 * 
## avg_parks_percent_change_from_baseline                  1.111  0.27424   
## avg_transit_stations_percent_change_from_baseline       0.427  0.67224   
## avg_workplaces_percent_change_from_baseline             0.820  0.41815   
## avg_residential_percent_change_from_baseline            3.463  0.00146 **
## days_shelter_in_place_policy                           -0.476  0.63743   
## days_interstate_travel_quarantine_policy                0.041  0.96786   
## days_face_mask_policy                                   2.937  0.00590 **
## days_restaurants_closure_policy                        -1.068  0.29301   
## days_gyms_closure_policy                               -0.009  0.99310   
## days_movie_theaters_closure_policy                      0.480  0.63423   
## days_bars_closure_policy                               -0.341  0.73510   
## days_casinos_closure_policy                             2.271  0.02958 * 
## party_DEMOCRAT                                          0.654  0.51755   
## density_sq_mile_log                                     3.555  0.00114 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4533 on 34 degrees of freedom
## Multiple R-squared:  0.8036, Adjusted R-squared:  0.7169 
## F-statistic: 9.274 on 15 and 34 DF,  p-value: 4.821e-08

## avg_grocery_and_pharmacy_percent_change_from_baseline 
##                                             11.211195 
##                avg_parks_percent_change_from_baseline 
##                                              1.857410 
##     avg_transit_stations_percent_change_from_baseline 
##                                              7.013354 
##           avg_workplaces_percent_change_from_baseline 
##                                             10.187629 
##          avg_residential_percent_change_from_baseline 
##                                             14.339718 
##                          days_shelter_in_place_policy 
##                                              2.962568 
##              days_interstate_travel_quarantine_policy 
##                                              2.213718 
##                                 days_face_mask_policy 
##                                              1.866217 
##                       days_restaurants_closure_policy 
##                                              7.260957 
##                              days_gyms_closure_policy 
##                                              7.910815 
##                    days_movie_theaters_closure_policy 
##                                              7.269959 
##                              days_bars_closure_policy 
##                                              6.697563 
##                           days_casinos_closure_policy 
##                                              1.625212 
##                                        party_DEMOCRAT 
##                                              2.311956 
##                                   density_sq_mile_log 
##                                              2.981821

## Analysis of Variance Table
## 
## Model 1: cumulative_cases_per_100_000_log ~ avg_workplaces_percent_change_from_baseline + 
##     avg_grocery_and_pharmacy_percent_change_from_baseline + days_face_mask_policy + 
##     density_sq_mile_log
## Model 2: cumulative_cases_per_100_000_log ~ avg_grocery_and_pharmacy_percent_change_from_baseline + 
##     avg_parks_percent_change_from_baseline + avg_transit_stations_percent_change_from_baseline + 
##     avg_workplaces_percent_change_from_baseline + avg_residential_percent_change_from_baseline + 
##     days_shelter_in_place_policy + days_interstate_travel_quarantine_policy + 
##     days_face_mask_policy + days_restaurants_closure_policy + 
##     days_gyms_closure_policy + days_movie_theaters_closure_policy + 
##     days_bars_closure_policy + days_casinos_closure_policy + 
##     party_DEMOCRAT + density_sq_mile_log
##   Res.Df     RSS Df Sum of Sq      F   Pr(>F)   
## 1     45 14.0984                                
## 2     34  6.9872 11    7.1112 3.1458 0.005006 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In model 3, we included all available mobility and state policy related variables in our dataset. We also included political party preference via which party the state went to in the 2020 election, and the population density for each state as part of the explanatory variables. Despite achieving a higher adjusted R-squared, we believe it now overfits our relatively small data set as we describe further in the section below.

3 Regression Table

# stargazer(model1, model2, model3, title="Regression Results")
stargazer(model1, model2_v4, model3, title="Regression Results",
dep.var.labels=c("COVID-19 Cases Per 100,000 People"),
covariate.labels=c("Mobility Average", "Retail and Rec. Mobility","Grocery and Pharm. Mobility",
"Parks Mobility","Transit Stations Mobility", "Workplaces Mobility", "Residential Mobility",
"Shelter in Place Plcy.", "Interstate Trav. Quarant. Plcy.", "Face Mask Plcy.", 
"Restaurant Closure Plcy.", "Gym Closure Plcy.", "Movie Theater Closure Plcy.", 
"Bars Closure Plcy.", "Casino Closure Plcy.", "Democratic Party", 
"State Population Density"), ci=TRUE, ci.level=0.90, single.row=TRUE, header=FALSE)

3.1 Statistical Significance

The statistical significance of each model varies, with increasing fit and number of features with each increasing model number as is illustrated in Table 1. The mobility average, a variable based on averaging all types of mobility together from Google’s mobility dataset, comes up just short of meeting the 90% confidence interval to show statistical significance with a p-value of 0.101. This model does a poor job of fitting the data, with an adjusted R-squared value indicating that it explains just 3.5% of the variance in the left hand side of the model. The F statistic is also not statistically significant with 48 degrees of freedom at 2.79, showing that the unrestricted model does not significantly deviate from the restricted constant model in the calculation, indicating a poor fit to the data.

The second model contains four variables that provide a balanced approach between using Google’s mobility data and data that provides context in which mobility is occurring. The two variables used from Google’s mobility data are 1) grocery and pharmacy mobility, and 2) workplaces mobility. Most notable about these mobilities is that the grocery and pharmacy mobility has a positive coefficient in the model and a slightly negative coefficient in the figure 2.10 in showing its direct relationship with the COVID-19 cases, indicating that the other variables alter its relationship with the left hand side. Second, they are both statistically significant, with workplaces mobility reaching the 95% confidence interval and grocery and pharmacy mobility meeting the 99.9% confidence interval. Both of the contextual variables (number of days of the face mask policy and state population density) meet the 99.9% confidence interval and have positive coefficients, with a notably high coefficient for state population density. The adjusted R-squared value indicates that 56.8% of the variance in COVID-19 cases is explained by this model, and is backed statistically with a high F statistic that meets the 99.9% confidence interval.

The level of fit continues to strengthen in model three, with more variables being added to the model and the F statistic reaching the 99.9% confidence interval again. It is notable that face mask policy, state population density, and grocery and pharmacy mobility again reached a high level of statistical significance, indicating their persistence and importance in explaining cases of COVID-19. While the R-squared value has increased, indicating a better fit and that the model explains 70.8% of the variance, this should be taken with caution as the data is almost surely being overfit due to the violation of the 1:10 variable to data point ratio being violated in the creation of this model. It is also notable that the residual standard error does not greatly improve between model two and three. This, along with the addition of many statistically insignificant variables, indicate that these variables are not adding much benefit to the model. We also discovered that through the use of the VIF test, there is a significant issue with the multicollinearity of the variables in model 3. Furthermore, while testing various combinations of variables in model 2, we found high multicollinearity whenever a mobility variable was combined with residential mobility. Because of this, and because residential mobility does not accurately represent the type of mobility we are interested in, we decided to not attempt to include it in our model 2 even though it clearly has a strong correlation with COVID-19 cases per capita.

3.2 Practical Significance

From a practical standpoint, model one does a poor job of explaining cases as a function of mobility. While we believe there is a relationship between mobility and COVID-19 cases, this is not captured in this model due to the conflicting nature of the mobilities measured in the dataset. As is seen in figure 2.10, some types of mobility show a positive correlation with the number of COVID-19 cases, whereas others have a negative correlation. Averaging these together creates an incoherent variable that is not well correlated with COVID-19 cases. While this means that the model is not equipped to describe the relationship between COVID-19 and mobility, it is an important finding that not all types of mobility should be treated as the same with respect to their impact on the number of COVID-19 cases.

The second model’s balance in context and specific types of relevant mobility, as well as its strong significance without overfitting, led us to believe that this is our best model. Due to the limited number of data points to work with, we were required to only use a few different variables within the model in an effort to not overfit the data, and also felt it important to include the context in which mobility occurred, which led us to include state density and face mask policy. These turned out to be two strong indicators related to COVID-19 cases, as both are statistically significant. However, the positive coefficient tied to face masks is interesting, as there has been plenty of information that details the importance of face masks in stopping the transmission of COVID-19. Furthermore, figure 2.11 showing the direct relationship of face masks and COVID-19 transmission also finds this relationship to be positive. One possible explanation for this is that conservative states, who more often have opposed the use of face masks during the pandemic, also have less populated states. As shown in model two, there is a sizable coefficient with state population density and also a strong correlation, as is shown in figure 2.11 of COVID-19 cases and state population density. This does not suggest that face masks do not help stop the spread of COVID-19, and instead explains that there is a complex picture at play with all of the factors that explain the number of COVID-19 cases per capita.

The mobility data in this model is notable for multiple reasons. As explained through the process of exploratory data analysis above, statistically, these were the most logical types of mobility to include with respect to multicollinearity and statistical significance. We believe this choice is further backed by the state of the world from March 15 to May 15 2020. At that time, going to the grocery store was an event with respect to COVID-19. Governments urged citizens to not go often in fear of transmitting the disease, and it was also at times mass pandemonium when doing so, as many necessities such as toilet paper were sold out. While the data indicates a negative relationship between COVID-19 cases and workplaces mobility, this should be taken with a grain of salt. First, Google’s workplaces mobility focuses on places like offices, not places like a grocery store, where many people still work. Second, as much less people began to go to work as is described in the data, it may have in fact been fairly safe for those who still did go in, as they were able to socially distance themselves from others. The bottom line is that this metric does not do a good job at covering essential workers, as people in these roles such as bus drivers, healthcare workers, and grocery store cashiers were in much different environments than those captured in the workplaces mobility metric. While there is a slight negative correlation in figure 2.10 between grocery and pharmacy mobility and COVID-19 cases, it is not surprising that grocery and pharmacy mobility has a positive coefficient in our model where contextual factors like face mask and population density are accounted for.

Model three is the most complex model we utilized in this study. However, when trying to describe relationships, this often caused great difficulty. Due to the number of variables used to try to describe each variable’s contribution to the COVID-19 per capita case count in an overfit relationship, it is very challenging to find meaning in each variable’s coefficients as it is not accurate to depict each one without respect to the many other variables in the model. With this in mind, this model bears little practical significance, though the similar coefficients and significance found in the state population density and face mask policy in both models two and three is notable in backing model two’s findings.

With this in mind, no model is perfect, and the models presented in this study are no exception. However, we believe that our best model to describe the relationship between COVID-19 cases and mobility is model two. We are confident in this based on exploratory data analysis, type of variables included, quantity of variables included, and other statistical comparison methods mentioned above. Below we will discuss the limitations of this model with respect to the six classic linear modeling assumptions.

4 Limitations of Model

In the section below we examine the limitations of our linear regression models. To do this, we compared the classical linear model (CLM) assumptions against our second model (model2_v4). The CLM assumptions are:

Linear Conditional Expectation
i.i.d.
No Perfect Collinearity
Zero Conditional Mean Error
Homoskedasticity
Normality of Error term

These checks are used to establish the credibility and validity of our model. After assessing test results of these assumptions, we are able make strong arguments on the insight we gained from the explanatory variables in these models.

Before using plots and other statistical tests to compare against the CLM assumptions, we use the ‘gvlma’ package in R to gain an understanding of the limitations of our model with respect to the metrics listed below. Some of these relate directly to the CLM assumptions, whereas others provide further information about this model.

gvlma(model2_v4)

## 
## Call:
## lm(formula = cumulative_cases_per_100_000_log ~ avg_workplaces_percent_change_from_baseline + 
##     avg_grocery_and_pharmacy_percent_change_from_baseline + days_face_mask_policy + 
##     density_sq_mile_log, data = mobility_df)
## 
## Coefficients:
##                                           (Intercept)  
##                                               2.06263  
##           avg_workplaces_percent_change_from_baseline  
##                                              -0.05398  
## avg_grocery_and_pharmacy_percent_change_from_baseline  
##                                               0.07553  
##                                 days_face_mask_policy  
##                                               0.03456  
##                                   density_sq_mile_log  
##                                               0.34870  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = model2_v4) 
## 
##                      Value p-value                Decision
## Global Stat        3.54812  0.4706 Assumptions acceptable.
## Skewness           1.71336  0.1906 Assumptions acceptable.
## Kurtosis           0.23807  0.6256 Assumptions acceptable.
## Link Function      0.08619  0.7691 Assumptions acceptable.
## Heteroscedasticity 1.51051  0.2191 Assumptions acceptable.

Global Stat is used to check if the relationships between input variables (Xs) and predictor variables (COVID-19 cases) are roughly linear. Rejection of the null (p < .05) indicates a non-linear relationship between one or more of the X’s and COVID-19 cases. In our test result, this assumption for our model is accepted, which suggests our model variables have a fairly linear relationship. This is the overall metric; it states whether the model, as a whole, passes or fails. In our gvlma result, the global stat indicates that the linearity assumption is met.

Skewness - This tests the skewness of the data’s distribution, checking if additional transformations are needed to meet the assumption of normality. Rejection of the null (p < .05) indicates that the data is significantly skewed. In the gvlma result, we fail to reject the null hypothesis, meaning that the test indicates this metric is satisfactory.
Kurtosis - This tests if our distribution has kurtosis (highly peaked or very shallowly peaked), necessitating a transformation to meet the assumption of normality. Rejection of the null (p < .05) indicates that the distribution has kurtosis. In our gvlma result, this result is acceptable. This means no additional data transformation is needed.
Link Function - This tests if the dependent variable is continuous. Rejection of the null (p < .05) indicates that an alternative form of the generalized linear model (e.g. logistic or binomial regression) should be used. We fail to reject the null hypothesis, leading us to believe our approach is satisfactory and that a linear model is acceptable to use in this case.
Heteroscedasticity - This test is used to determine if the variance of residuals is constant across the range of X, indicating homoscedasticity. Rejection of the null (p < .05) indicates that the residuals are heteroscedastic, and thus non-constant across the range of X. We fail to reject the null hypothesis, indicating that this assumption is met.

This provided us a good start for testing our model. Since we only have 50 records to work with, a very small dataset, it is necessary to break down our model with respect to the six CLM assumptions as listed at the beginning of this section. Below, we examine these for our model 2.

par(mfrow=c(2,2))
plot(model2_v4)

Figure 4.1: Model2_v4 Plots

4.1 Linearity

In the residuals vs fitted plot in figure 4.1, we see that the red line is nearly straight across, indicating that our data is linear. With this in mind, the linearity assumption is met.

4.2 iid random sample

There are many issues with independent and identically distributed data in this study. With respect to the left hand side of the models, each state’s cases per capita are not independent of one another. Many states did not require interstate quarantine policies throughout this period of the pandemic, and even if this was in place, it was not necessarily strictly enforced. This means that people were almost surely transmitting the virus between people who lived in separate states, resulting in them not being independent of one another. Furthermore, our ability to capture the number of cases in each state is dependent on state testing, and states varied significantly in their ability and desire to pursue this, resulting in a lack of identically distributed data on cases of COVID-19.

When looking at the right hand side of the models, mobility is no different in this regard. People were free to travel to other states throughout the beginning of the pandemic, and even if they entered a state with an interstate travel quarantine in place, after 14 days they could move freely in that area. With this in mind, it is clear that people could have logged mobility data in Google’s mobility dataset within multiple states, meaning that this data is not independent. Furthermore, as states began to roll out policies with COVID-19 restrictions, the choice to do so became largely politically motivated. With this in mind, it quickly became clear that state governments listened to one another for guidance on how to proceed in outputting said policies. This also means that each state was not independent of one another in this manner either.

4.3 No Perfect Multi-Collinearity

We conduct this test to make sure that no explanatory variable has a perfect linear relationship with any other explanatory variables. We assess this assumption by conducting the VIF test seen below.

vif(model2_v4)

##           avg_workplaces_percent_change_from_baseline 
##                                              3.419717 
## avg_grocery_and_pharmacy_percent_change_from_baseline 
##                                              3.108606 
##                                 days_face_mask_policy 
##                                              1.409981 
##                                   density_sq_mile_log 
##                                              1.694200

All variables in model2_v4 have a VIF score of less than 4. This indicates that our model does not have a perfect multicollinearity issue and hence meets this assumption.

4.4 Conditional mean error is zero

To test this assumption, we need to check whether the residuals of our model have a constant mean in the plot of residuals against the predictor. This can be tested by viewing the residuals vs fitted plot in figure 4.1. In the plot of residual vs fitted, we see that the residuals are pretty evenly spread above and below zero and that the red line is fairly straight across the zero line. This indicates that the residuals and the fitted values are uncorrelated, suggesting this is a linear model with normally distributed errors and this assumption is met.

4.5 The error term has a constant variance (no heteroskedasticity).

We check the assumption of constant variance of the residuals (homoscedasticity) for our model by assessing the Scale-Location (or Spread-Location) plot in figure 4.1. It is used to check the homogeneity of variance of the residuals (homoscedasticity). The scale-location is not a straight line with equally spread points and has a slope with both small and large fitted values. So, our model may have a heteroscedasticity problem. We believe this may be caused by outliers in the data. Specifically, as our findings note at the beginning of the study, mobility data, population density and COVID cases contain very high values in certain states (New York and New Jersey), whereas they contain very low values in the state of Alaska. Similarly, due to political preferences as well as COVID-19 awareness of states, the COVID related policies could also range from being extremely liberal to conservertive.

This is a limitation of our model. To resolve the issues with heteroscedasticity in a future study, we could attempt to apply transformations to the data or sample the data in a different manner, as not meeting this assumption may be in part because of the data not being i.i.d. as discussed above. Another technique to help resolve this could be to use a non-linear model.

4.6 Errors are normally distributed

We use the normal Q-Q plot in figure 4.1, the QQ plot of the residuals, to assess if our model meets this assumption. This plot illustrates a fairly straight line with a light tail, indicating that the residuals are close to normally distributed, meaning our model meets this assumption. However, the residuals vs leverage plot in figure 4.1 highlights the top 3 most extreme points with standardized residuals higher 3. In future studies, we would look to a different way to transform the data or fit a non-linear model to address these data points.

5 Discussion of Omitted Variables

Among our regression models to explain the number of cases per capita (cumulative_cases_per_100_000_log as the model outcome), we conclude that model2_v4 is able to provide the best statistical significance and practical significance to allow us to associate mobility-related factors to COVID-19 cases for each state. Since explanatory variable density_sq_mile_log has the most prominent coefficient in our model2_v4, we decided to focus on this variable as our subject to understand how omitted variable bias impacts our regression model.

Here is the summary table of our 5 most important omitted variables:

Omitted Variable	Correlation to COVID-19 Cases	Correlation to Population Density	Direction of Bias
long_term_health_problem	Positive	Negative	Towards Zero
age_65_or_above	Positive	Negative	Towards Zero
social_distancing	Negative	Negative	Away from Zero
poverty	Positive	Negative	Towards Zero
awareness_responsibility	Negative	Positive	Towards Zero

Our first omitted variable would be the number of people that have a long term health problem per capita for each state. We believe people with health issues would be more prone to develop serious illness from COVID-19 infection, which would lead to a higher chance of being tested and confirmed with a case of the virus. By that reasoning, the higher the number of people that have serious long term health problems per capita, the more cases such a state would have. Furthermore, rural areas tend to have more portions of their population with long term health problems. Hence the number of people that have serious long term health problems per capita would be negatively correlated to density_sq_mile_log. Given this omitted variable correlates to the number of COVID-19 cases positively and the population density (which has a positive coefficient in our regression model) for each state negatively, the bias from this omitted variable would be towards zero.
The number of residents age 65 or above per capita for each state would also be a significant omitted variable since it may have a strong correlation with COVID-19 cases per capita as those with weaker immune systems are more likely to develop symptoms of the disease. Similar to people with long term health problems, they are more prone to develop a serious illness from COVID-19 infection, and are more likely to require medical attention (thus get tested and confirmed with a positive case). By that logic, the number of residents age 65 or above per capita should have a positive correlation with the number of COVID-19 cases. In the US, rural regions tend to have an older population. Based on that information, we believe the number of residents age 65 or above per capita has a negative correlation with the population density for each state. Given the coefficient of our explanatory variable density_sq_mile_log in our regression model is positive, the direction of this OVB would be towards zero.
While we do not have data for contact tracing to understand how people practice social distancing rules, this omitted variable is believed to have a significant impact on the number of COVID-19 cases. This omitted variable, measuring the percentage of the population practicing social distancing (i.e. >90% of the time they are keeping at least 6 ft away from another person when outside of their house), would have a negative correlation with the number of COVID-19 cases. We also believe that it is easier for residents to practice social distance if their communities are less crowded, meaning there is a negative correlation between the level of social distancing the residents practiced, and the population density for each state. The direction of this OVB would be away from zero.
We believe the population’s annual income below the poverty line per capita for each state would have some effect on the number of COVID-19 cases for that state. People who are in poverty often have jobs that require physical presence to the workplace, and they are less likely to have responsible employers to provide appropriate personal protective equipment (PPE). They may even rely on social program support (i.e. shelter), making them more vulnerable to COVID-19. So we believe there is a positive relation between the percentage of the population in poverty and the number of COVID-19 cases for each state. In the meantime, rural regions tend to have a higher percentage of their population in poverty due to less job opportunities and government programs to help fight poverty. Hence we believe there is a negative correlation between high poverty rate and population density. And that would bring the OVB from this omitted variable towards zero.
Last but not least, we believe the level of individual awareness/responsibility towards COVID-19 precautions and guidelines would make an impact on the COVID-19 cases for every state. This data would be challenging to collect (it would likely be in the form of non-parametric data via conducting a survey of asking randomly selected individuals on how they view and take precaution against COVID-19). However, we believe that it would have a negative correlation with the COVID-19 cases. We also believe that the average level of individual awareness/responsibility towards COVID-19 would be higher in urban regions since they are exposed to more media about the COVID-19 reports, especially during the early phase of the 2020 pandemic. Therefore, there is a positive relationship between this omitted variable and population density for each state. The OVB from this omitted variable would be towards zero, which reduces the statistical significance of the coefficient of the population density variable density_sq_mile_log.

6 Conclusion

In this study, we aimed to describe the relationship between COVID-19 cases per capita and mobility. Our findings suggest that the type of correlation between the two variables is largely dependent on contextual variables and the type of mobility that is occurring. The strongest variable at play in our models was state population density. This makes sense in many ways, as transmission of COVID-19 is known to occur most often when people are bunched together in tight spaces, which would be more likely in a highly populated area. With respect to the mobility variables from Google, each added an understanding to the complex relationship between mobility and COVID-19 cases per capita. When combined into a model, it often faced issues with high multicollinearity or too much noise for a linear model to explain, resulting in the clear need for contextual variables to be applied. This points to the context that the mobility occurs in, and not the mobility itself, being the most important type of variable in explaining COVID-19 cases per capita.

With the first specification lacking statistical significance and the third containing too many variables for the dataset, the second model discussed in this study is clearly the best from a statistical and practical standpoint. With just four variables, it is able to explain 56.8% of the variance in COVID-19 cases per capita for each state. This is also an acceptable amount of variables to apply to this small dataset of 50 data points, and without more data points, it was challenging to split the data into training, validation, and test sets to address potential overfitting. We also aimed to keep transformations and alterations of outliers to a minimum for ease of interpretability. Some of the assumptions of the classic linear model are violated, specifically IID and heteroscedasticity. As the data is fundamentally not set up to answer this question in a clean manner, all results from this study should be taken with caution. Furthermore, as explained in the omitted variables section and by the clear complexity of the study’s topic, there are factors at play that are not accounted for in our models.

The findings of this study provide insight into specific types of mobility that are noteworthy to consider during a pandemic, as well as key contextual factors at play such as state population density, face mask policies, and more. Findings suggest that mobility should be of greatest concern in states with a high population density. This is especially noteworthy for state governments in the event of a future pandemic, as certain state regulations may make more sense depending on contextual factors such as state population density. Future work with a dataset that has mobility information not just to specific locations, but total mobility for individuals within each state would help to describe this relationship further. Additionally, doing so as a time-series study throughout the entire pandemic may provide more detailed results that would be helpful for policymakers and everyday citizens to make informed decisions.

7 Reference

COVID-19 Cases (https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker)
Google Mobility Dataset (https://www.google.com/covid19/mobility/)
State Policy Dataset (https://docs.google.com/spreadsheets/d/1zu9qEWI8PsOI_i8nI_S29HDGHlIp2lfVMsGxpQ5tvAQ/edit#gid=973655443)
Political Dataset (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/42MVDX)

Lab 2: Regression to Study the Spread of Covid-19

Patrick Old, Dicky Woo, Rui Li