TOC
In a previous article, I mentioned the need to find mentors to enhance your ability to create clear, concise, and meaningful data visualizations. The New York Times is one example I highly recommend you follow. They do an excellent job of creating information rich charts that are understandable to a wide variety of people. One example is their COVID 19 tacker.
You may be thinking, sure, that chart looks great, but I have no idea how to create something like this. Well, you are in luck. I am going recreate their COVID tracker using R
and ggplot2
. Since I am using ggplot2
, it will not be interactive like their chart, but, the interaction feature on this particular chart does not add much value and depending on the resources available to you, it is often easier to distribute static charts to your consumers, especially if your work is required to go into a PowerPoint.
We will start by loading the libraries needed for this project.
library(dplyr)
library(readr)
library(ggplot2)
Import Data
Next, we need to grab the data. Luckily, the NYT posts their COVID data on github.
url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv"
df_states <- readr::read_csv(url)
glimpse(df_states)
## Observations: 38,702
## Variables: 5
## $ date <date> 2020-01-21, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-24, 2…
## $ state <chr> "Washington", "Washington", "Washington", "Illinois", "Washing…
## $ fips <chr> "53", "53", "53", "17", "53", "06", "17", "53", "04", "06", "1…
## $ cases <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1,…
## $ deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Transform the Data
This data set has the cumulative cases and deaths on a given day for each state. To simplify the process, we will start with one state: New York.
df_ny_raw <- df_states %>%
filter(state == "New York")
glimpse(df_ny_raw)
## Observations: 703
## Variables: 5
## $ date <date> 2020-03-01, 2020-03-02, 2020-03-03, 2020-03-04, 2020-03-05, 2…
## $ state <chr> "New York", "New York", "New York", "New York", "New York", "N…
## $ fips <chr> "36", "36", "36", "36", "36", "36", "36", "36", "36", "36", "3…
## $ cases <dbl> 1, 1, 2, 11, 22, 44, 89, 106, 142, 173, 217, 326, 421, 610, 73…
## $ deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 6, 10, 18, 32, 39, 6…
As you can see below, the cases
column has the cumulative cases up to that date.
df_ny_raw %>%
ggplot(aes(x = date, y = cases)) +
geom_line()
In order to get the case count for each day, we need to take the difference between the current date and the previous date. We can do this using the lag()
function from dplyr
.
df_ny = df_ny_raw %>%
mutate(cases = cases - lag(cases,
default = 0))
df_ny %>%
ggplot(aes(x = date, y = cases)) +
geom_line()
We now have cases-per-day, but our line is not as smooth as the NYT chart above. The NYT uses a smoothed 7-day average in order to minimize the day-to-day variability and highlight the longer-term trends.
To create the 7-day average, we will sum the case counts from the current day and the previous 6 days and divide by 7.
df_ny_smooth = df_ny %>%
mutate(smooth = sum(cases,
lag(cases, 1), lag(cases, 2),
lag(cases, 3), lag(cases, 4),
lag(cases, 5), lag(cases, 6))
/7)
If we want to save some typing and make it easy to repeat this process, we could write a custom function to complete the same goal.
rolling_average <- function(x, period = 7){
total = x
for(i in 1:period-1){
total = total + lag(x,i)
}
return(total/period)
}
df_ny_smooth = df_ny %>%
mutate(smooth = rolling_average(cases))
df_ny_smooth
## # A tibble: 703 x 6
## date state fips cases deaths smooth
## <date> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2020-03-01 New York 36 1 0 NA
## 2 2020-03-02 New York 36 0 0 NA
## 3 2020-03-03 New York 36 1 0 NA
## 4 2020-03-04 New York 36 9 0 NA
## 5 2020-03-05 New York 36 11 0 NA
## 6 2020-03-06 New York 36 22 0 NA
## 7 2020-03-07 New York 36 45 0 19.1
## 8 2020-03-08 New York 36 17 0 17.4
## 9 2020-03-09 New York 36 36 0 25.3
## 10 2020-03-10 New York 36 31 0 28.9
## # … with 693 more rows
Now that the data is ready, we will build the chart in layers.
Basic Chart
plot_nyt <- df_ny_smooth %>%
filter(!is.na(smooth)) %>%
ggplot(aes(x = date, y = smooth)) +
geom_line(color = "red") +
geom_area(fill = "red", alpha = .25)
plot_nyt
Update the Theme
plot_nyt <- plot_nyt +
theme(
panel.background = element_blank(),
axis.text.y = element_text(angle = 0, vjust = -.5,
margin = margin(r = -30)),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(colour = 'light grey',
linetype = 'dashed',size = .35),
axis.ticks.x = element_line(color = "light grey"),
axis.ticks.y = element_blank(),
axis.line.x = element_line(colour = "light grey", linetype = "solid",
size = .5))
plot_nyt
Adjust the Scales
breaks <- scales::extended_breaks()(df_ny_smooth$smooth)
breaks <- breaks[2:length(breaks)]
plot_nyt <- plot_nyt +
scale_x_date(expand = c(0,0),
date_labels = "%b-%Y") +
scale_y_continuous(expand = c(0,0),
breaks = breaks,
limits = c(0,max(df_ny_smooth$smooth))) +
labs(y = NULL,
x = NULL,
title = "New Reported Cases: New York")
plot_nyt
Add Annotations
Annotations really take your chart to the next level. They are a great way to add context to your chart and tell a story. For this example, we will add text and a line segment to describe what data represents. Annotating the smoothed line directly, as opposed to adding a legend, is a great way to save space on your chart. Also, the user does not have to hunt for the information because it is dead center in the chart!
If you look closely, you will notice the NYT uses logic to put their annotation toward the middle of the chart but always on the apex of curve (look at different examples of the states to see this in action). This is a nice touch that really gives their charts that extra wow factor.
Below is a way to recreate their process. We simply take a slice from the center of the chart (120 day period) and find the maximum point on the smooth line. We will then add our annotations off this reference point.
max_date = df_ny$date %>% median() + 60
min_date = df_ny$date %>% median() - 60
label_vals = df_ny_smooth %>%
filter(date > min_date, date < max_date) %>%
arrange(desc(smooth)) %>%
top_n(1)
label_vals
## # A tibble: 1 x 6
## date state fips cases deaths smooth
## <date> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2021-01-12 New York 36 14791 39588 18725.
Now that we have our reference point, let’s add the annotations.
plot_nyt +
annotate("text", x = label_vals$date,
y = label_vals$smooth * 1.4,
label = "7-day\naverage",
size = 3.5,
fontface = "plain") +
annotate("segment", x = label_vals$date, xend = label_vals$date,
y = label_vals$smooth,
yend = label_vals$smooth * 1.2,
size = .25)
There you have it, a quality chart worthy of publication in the New York Times. This example should be easy to adapt to any similar time-series data you might have. Good luck!