Start by loading the COVID Tracking Project (CTP) dataset (states daily) from the API. Documentation can be found on the website. https://covidtracking.com/data/api
I’ll also load the libraries we’ll need for the project.
# load libraries
library(tidyverse)
library(jsonlite)
library(lubridate)
library(pracma)
library(tidymodels)
library(naniar)
# load and view
ctp <- fromJSON("https://api.covidtracking.com/v1/states/daily.json")
names(ctp)
## [1] "date" "state"
## [3] "positive" "probableCases"
## [5] "negative" "pending"
## [7] "totalTestResultsSource" "totalTestResults"
## [9] "hospitalizedCurrently" "hospitalizedCumulative"
## [11] "inIcuCurrently" "inIcuCumulative"
## [13] "onVentilatorCurrently" "onVentilatorCumulative"
## [15] "recovered" "dataQualityGrade"
## [17] "lastUpdateEt" "dateModified"
## [19] "checkTimeEt" "death"
## [21] "hospitalized" "dateChecked"
## [23] "totalTestsViral" "positiveTestsViral"
## [25] "negativeTestsViral" "positiveCasesViral"
## [27] "deathConfirmed" "deathProbable"
## [29] "totalTestEncountersViral" "totalTestsPeopleViral"
## [31] "totalTestsAntibody" "positiveTestsAntibody"
## [33] "negativeTestsAntibody" "totalTestsPeopleAntibody"
## [35] "positiveTestsPeopleAntibody" "negativeTestsPeopleAntibody"
## [37] "totalTestsPeopleAntigen" "positiveTestsPeopleAntigen"
## [39] "totalTestsAntigen" "positiveTestsAntigen"
## [41] "fips" "positiveIncrease"
## [43] "negativeIncrease" "total"
## [45] "totalTestResultsIncrease" "posNeg"
## [47] "deathIncrease" "hospitalizedIncrease"
## [49] "hash" "commercialScore"
## [51] "negativeRegularScore" "negativeScore"
## [53] "positiveScore" "score"
## [55] "grade"
There’s a lot of columns here, but we don’t need all of them. I’m going to focus on new positive cases, new tests, new deaths, and number of people currently hospitalized – along with the state and date variables.
Before selecting these, I’m just going to confirm that state and date uniquely identify the observations in the data.
# check uniqueness of state and date
if (nrow(ctp %>% distinct(state, date)) / nrow(ctp) != 1) {
print("Go find your duplicates")
} else{
print("There's one observation for each state and date.")
}
## [1] "There's one observation for each state and date."
Now just pull in the six variables we need. I’m also going to do some simple renaming and date formatting to prepare for analysis. I’ll also check whether there are negative values. If there are negative values, I’ll remove them.
# get only the variables that we need
ctp <- ctp %>%
select(state, date, totalTestResultsIncrease, positiveIncrease, deathIncrease, hospitalizedCurrently) %>%
# rename variables to make them a bit shorter and easier
rename(cases_new = positiveIncrease,
tests_new = totalTestResultsIncrease,
deaths_new = deathIncrease,
hosp_current = hospitalizedCurrently) %>%
# make sure date is in date format
mutate(date = ymd(date))
knitr::kable(head(ctp))
state | date | tests_new | cases_new | deaths_new | hosp_current |
---|---|---|---|---|---|
AK | 2020-12-06 | 10545 | 757 | 0 | 164 |
AL | 2020-12-06 | 7880 | 2288 | 12 | 1927 |
AR | 2020-12-06 | 14704 | 1542 | 40 | 1076 |
AS | 2020-12-06 | 0 | 0 | 0 | NA |
AZ | 2020-12-06 | 20586 | 5376 | 25 | 2977 |
CA | 2020-12-06 | 293071 | 30075 | 85 | 10624 |
# check for negative values
nrow(ctp %>%
filter(cases_new < 0 | deaths_new < 0 | tests_new < 0))
## [1] 119
# remove negative values
ctp <- ctp %>%
replace_with_na_if(is.numeric, ~.x < 0)
# check again for negative values
nrow(ctp %>%
filter(cases_new < 0 | deaths_new < 0 | tests_new < 0))
## [1] 0
We’re also going to load data on states’ populations, which is useful for normalizing results at the state level. In general, we’re going to focus on cases, deaths, etc. per unit of population
# get data from CTP GitHub page
state_pop <- read_csv("https://raw.githubusercontent.com/COVID19Tracking/associated-data/master/us_census_data/us_census_2018_population_estimates_states.csv") %>%
select(state, population)
# join to our main dataset
ctp <- ctp %>%
left_join(state_pop, by = "state")
knitr::kable(head(ctp))
state | date | tests_new | cases_new | deaths_new | hosp_current | population |
---|---|---|---|---|---|---|
AK | 2020-12-06 | 10545 | 757 | 0 | 164 | 737438 |
AL | 2020-12-06 | 7880 | 2288 | 12 | 1927 | 4887871 |
AR | 2020-12-06 | 14704 | 1542 | 40 | 1076 | 3013825 |
AS | 2020-12-06 | 0 | 0 | 0 | NA | NA |
AZ | 2020-12-06 | 20586 | 5376 | 25 | 2977 | 7171646 |
CA | 2020-12-06 | 293071 | 30075 | 85 | 10624 | 39557045 |
Before diving in to modeling, we need to get a sense of what the data looks like. We’ll use the ggplot2
package for data visualization to explore how trends in new cases, deaths, tests, and hospitalizations have varied over time. First, we’ll look at a national level, using group_by
and summarize
functions to get a single observation for the US on each date that is the sum of values in individual states and territories.
# first, need to get data to a national level
national_ctp <- ctp %>%
# group by date since we want one row for each date
group_by(date) %>%
# sum our main columns to get country-level variables
summarize_if(is.numeric, sum, na.rm = TRUE) %>%
mutate(entity = "United States")
# I'll be using `tail` from now on to look at the data, because the data is
# better populated more recently than it was in March when CTP began
knitr::kable(tail(national_ctp))
date | tests_new | cases_new | deaths_new | hosp_current | population | entity |
---|---|---|---|---|---|---|
2020-12-01 | 2340996 | 176753 | 2473 | 98777 | 330362587 | United States |
2020-12-02 | 1470464 | 195796 | 2733 | 100322 | 330362587 | United States |
2020-12-03 | 1828230 | 210204 | 2706 | 100755 | 330362587 | United States |
2020-12-04 | 1854869 | 224831 | 2563 | 101276 | 330362587 | United States |
2020-12-05 | 2169756 | 211073 | 2445 | 101190 | 330362587 | United States |
2020-12-06 | 1634532 | 176771 | 1138 | 101487 | 330362587 | United States |
Now that we have a national level dataset, let’s begin plotting. Start with a simple bar plot of cases and then deaths and then all metrics together
# cases plot
cases.plot <- national_ctp %>%
ggplot(aes(x = date, y = cases_new)) +
geom_col(color = "orange") +
labs(title = "New COVID-19 Cases Reported by Date in the United States")
cases.plot
# deaths plot
deaths.plot <- national_ctp %>%
ggplot(aes(x = date, y = deaths_new)) +
geom_col(color = "gray") +
labs(title = "New COVID-19 Deaths Reported by Date in the United States")
deaths.plot
# plot all metrics together by reshaping the data to long format
all_metrics.plot <- national_ctp %>%
pivot_longer(cols = -c(date, entity, population),
values_to = "value",
names_to = "metric") %>%
ggplot(aes(x = date, y = value, color = metric)) +
geom_col() +
facet_wrap(~metric, scales = "free") +
theme(legend.position = "none") +
labs(title = "COVID-19 Metrics by Date in the United States")
all_metrics.plot