Project Setup

Start by loading the COVID Tracking Project (CTP) dataset (states daily) from the API. Documentation can be found on the website. https://covidtracking.com/data/api

I’ll also load the libraries we’ll need for the project.

# load libraries
library(tidyverse)
library(jsonlite)
library(lubridate)
library(pracma)
library(tidymodels)
library(naniar)
# load and view
ctp <- fromJSON("https://api.covidtracking.com/v1/states/daily.json")
names(ctp)
##  [1] "date"                        "state"                      
##  [3] "positive"                    "probableCases"              
##  [5] "negative"                    "pending"                    
##  [7] "totalTestResultsSource"      "totalTestResults"           
##  [9] "hospitalizedCurrently"       "hospitalizedCumulative"     
## [11] "inIcuCurrently"              "inIcuCumulative"            
## [13] "onVentilatorCurrently"       "onVentilatorCumulative"     
## [15] "recovered"                   "dataQualityGrade"           
## [17] "lastUpdateEt"                "dateModified"               
## [19] "checkTimeEt"                 "death"                      
## [21] "hospitalized"                "dateChecked"                
## [23] "totalTestsViral"             "positiveTestsViral"         
## [25] "negativeTestsViral"          "positiveCasesViral"         
## [27] "deathConfirmed"              "deathProbable"              
## [29] "totalTestEncountersViral"    "totalTestsPeopleViral"      
## [31] "totalTestsAntibody"          "positiveTestsAntibody"      
## [33] "negativeTestsAntibody"       "totalTestsPeopleAntibody"   
## [35] "positiveTestsPeopleAntibody" "negativeTestsPeopleAntibody"
## [37] "totalTestsPeopleAntigen"     "positiveTestsPeopleAntigen" 
## [39] "totalTestsAntigen"           "positiveTestsAntigen"       
## [41] "fips"                        "positiveIncrease"           
## [43] "negativeIncrease"            "total"                      
## [45] "totalTestResultsIncrease"    "posNeg"                     
## [47] "deathIncrease"               "hospitalizedIncrease"       
## [49] "hash"                        "commercialScore"            
## [51] "negativeRegularScore"        "negativeScore"              
## [53] "positiveScore"               "score"                      
## [55] "grade"

Data Preparation

There’s a lot of columns here, but we don’t need all of them. I’m going to focus on new positive cases, new tests, new deaths, and number of people currently hospitalized – along with the state and date variables.

Before selecting these, I’m just going to confirm that state and date uniquely identify the observations in the data.

# check uniqueness of state and date
if (nrow(ctp %>% distinct(state, date)) / nrow(ctp) != 1) {
  print("Go find your duplicates")
} else{
    print("There's one observation for each state and date.")
}
## [1] "There's one observation for each state and date."

Now just pull in the six variables we need. I’m also going to do some simple renaming and date formatting to prepare for analysis. I’ll also check whether there are negative values. If there are negative values, I’ll remove them.

# get only the variables that we need
ctp <- ctp %>%
  select(state, date, totalTestResultsIncrease, positiveIncrease, deathIncrease, hospitalizedCurrently) %>%
  
  # rename variables to make them a bit shorter and easier
  rename(cases_new = positiveIncrease,
         tests_new = totalTestResultsIncrease,
         deaths_new = deathIncrease,
         hosp_current = hospitalizedCurrently) %>%
  
  # make sure date is in date format
  mutate(date = ymd(date))

knitr::kable(head(ctp))
state date tests_new cases_new deaths_new hosp_current
AK 2020-12-06 10545 757 0 164
AL 2020-12-06 7880 2288 12 1927
AR 2020-12-06 14704 1542 40 1076
AS 2020-12-06 0 0 0 NA
AZ 2020-12-06 20586 5376 25 2977
CA 2020-12-06 293071 30075 85 10624
# check for negative values
nrow(ctp %>%
  filter(cases_new < 0 | deaths_new < 0 | tests_new < 0))
## [1] 119
# remove negative values
ctp <- ctp %>%
  replace_with_na_if(is.numeric, ~.x < 0)

# check again for negative values
nrow(ctp %>%
  filter(cases_new < 0 | deaths_new < 0 | tests_new < 0))
## [1] 0

We’re also going to load data on states’ populations, which is useful for normalizing results at the state level. In general, we’re going to focus on cases, deaths, etc. per unit of population

# get data from CTP GitHub page
state_pop <- read_csv("https://raw.githubusercontent.com/COVID19Tracking/associated-data/master/us_census_data/us_census_2018_population_estimates_states.csv") %>%
  select(state, population)

# join to our main dataset
ctp <- ctp %>%
  left_join(state_pop, by = "state")

knitr::kable(head(ctp))
state date tests_new cases_new deaths_new hosp_current population
AK 2020-12-06 10545 757 0 164 737438
AL 2020-12-06 7880 2288 12 1927 4887871
AR 2020-12-06 14704 1542 40 1076 3013825
AS 2020-12-06 0 0 0 NA NA
AZ 2020-12-06 20586 5376 25 2977 7171646
CA 2020-12-06 293071 30075 85 10624 39557045

Exploratory Analysis

Before diving in to modeling, we need to get a sense of what the data looks like. We’ll use the ggplot2 package for data visualization to explore how trends in new cases, deaths, tests, and hospitalizations have varied over time. First, we’ll look at a national level, using group_by and summarize functions to get a single observation for the US on each date that is the sum of values in individual states and territories.

# first, need to get data to a national level
national_ctp <- ctp %>%
  
  # group by date since we want one row for each date
  group_by(date) %>%
  
  # sum our main columns to get country-level variables
  summarize_if(is.numeric, sum, na.rm = TRUE) %>%
  mutate(entity = "United States")

# I'll be using `tail` from now on to look at the data, because the data is 
# better populated more recently than it was in March when CTP began
knitr::kable(tail(national_ctp))
date tests_new cases_new deaths_new hosp_current population entity
2020-12-01 2340996 176753 2473 98777 330362587 United States
2020-12-02 1470464 195796 2733 100322 330362587 United States
2020-12-03 1828230 210204 2706 100755 330362587 United States
2020-12-04 1854869 224831 2563 101276 330362587 United States
2020-12-05 2169756 211073 2445 101190 330362587 United States
2020-12-06 1634532 176771 1138 101487 330362587 United States

Now that we have a national level dataset, let’s begin plotting. Start with a simple bar plot of cases and then deaths and then all metrics together

# cases plot
cases.plot <- national_ctp %>%
  ggplot(aes(x = date, y = cases_new)) +
  geom_col(color = "orange") +
  labs(title = "New COVID-19 Cases Reported by Date in the United States")
cases.plot

# deaths plot
deaths.plot <- national_ctp %>%
  ggplot(aes(x = date, y = deaths_new)) +
  geom_col(color = "gray") +
  labs(title = "New COVID-19 Deaths Reported by Date in the United States")
deaths.plot

# plot all metrics together by reshaping the data to long format
all_metrics.plot <- national_ctp %>%
  pivot_longer(cols = -c(date, entity, population),
               values_to = "value",
               names_to = "metric") %>%
  ggplot(aes(x = date, y = value, color = metric)) +
  geom_col() +
  facet_wrap(~metric, scales = "free") +
  theme(legend.position = "none") +
  labs(title = "COVID-19 Metrics by Date in the United States")
all_metrics.plot