1 Introduction
Incomplete reporting of epidemiological data at recent times can result in case count data that is right-truncated. Right-truncated case counts can be misleading to interpret at face-value, as they will typically show a decline in the number of reported observations in the most recent time points. These are the time points where the highest proportion of the data has yet to be observed in the dataset.
The imputation of the cases that will eventually be observed up until the current time is referred to as a nowcast.
A number of methods have been developed to nowcast epidemiological case count data.
The purpose of baselinenowcast
is to provide a nowcast computed directly from
the most recent observations to estimate a delay distribution empirically, and
apply that to the partially observed data to generate a nowcast.
In the below section, we will describe an example of a nowcasting problem, and
demonstrate how to use baselinenowcast
to estimate a delay distribution from
the data and apply that estimate to generate a probabilistic nowcast. This
example will walk through the low-level functionality of the “default” model
permutation. In future vignettes, we will demonstrate examples of how to
create more complex model permutations.
2 Packages
As well as the baselinenowcast
package this vignette also uses epinowcast
,
ggplot2
, tidyr
, and dplyr
.
The installation of epinowcast
is not required for using the package,
however, its pre and post-processing functions provide a lot of the data
wrangling needed to set up the nowcasting problem.
We note that no components of the vignette require installing CmdStan
,
which is a downstream dependency of epinowcast
.
We will just be using the R
components of epinowcast
, which can be
installed using the example lines of code below, so there is no need to
additionally install CmdStan
.
3 Data
Nowcasting of right-truncated case counts involves the estimation of reporting delays for recently reported data. For this, we need case counts indexed both by when they were diagnosed (often called the “reference date”) and by when they were reported (i.e. when administratively recorded via public health surveillance; often called “report date”). The difference between the reference date and the report date is the reporting delay. For this quick start, we use daily level data from the Robert Koch Institute via the Germany Nowcasting hub. These data represent hospital admission counts by date of positive test and date of test report in Germany up to October 1, 2021.
4 Filtering and plotting the data
We will filter the data to just look at the national-level data, for all age
groups.
We will pretend that we are making a nowcast as of August 1, 2021, therefore
we will exclude all reference dates and report dates before that date.
germany_covid19_hosp
is provided as package data from epinowcast
Let’s start by plotting the sum of the reports at each reference date,
and then compare that to what we will eventually observe as of the latest date
in the complete dataset (data available through October 1, 2021)
The red line shows the cumulative number of confirmed admissions on each report
date, across all delays, using the data available as of August 1, 2021.
It demonstrates the characteristic behaviour of right-truncation.
This is because we have not yet observed the data that will become available
for the longer delays at recent time points.
Our task will be to estimate what the “final” cumulative number of cases at each reference date, observed as of the “fully observed” data on October 2021.
nowcast_date <- "2021-08-01"
eval_date <- "2021-10-01"
target_data <- germany_covid19_hosp[location == "DE"][age_group == "00+"] |>
enw_filter_report_dates(latest_date = eval_date) |>
enw_filter_reference_dates(
latest_date = nowcast_date
)
latest_data <- enw_latest_data(target_data)
observed_data <- enw_filter_report_dates(
target_data,
latest_date = nowcast_date
)
head(observed_data)
## reference_date location age_group confirm report_date
## <IDat> <fctr> <fctr> <int> <IDat>
## 1: 2021-04-06 DE 00+ 149 2021-04-06
## 2: 2021-04-07 DE 00+ 312 2021-04-07
## 3: 2021-04-08 DE 00+ 424 2021-04-08
## 4: 2021-04-09 DE 00+ 288 2021-04-09
## 5: 2021-04-10 DE 00+ 273 2021-04-10
## 6: 2021-04-11 DE 00+ 107 2021-04-11
obs_data_by_reference_date <- enw_latest_data(observed_data)
ggplot() +
geom_line(
data = obs_data_by_reference_date,
aes(x = reference_date, y = confirm), color = "darkred"
) +
geom_line(
data = latest_data,
aes(x = reference_date, y = confirm), color = "black"
) +
theme_bw() +
xlab("Reference date") +
ylab("Confirmed admissions") +
scale_y_continuous(trans = "log10") +
ggtitle("Comparing real-time and later observed cases")
Here the black line represents the quantity we will evaluate our nowcast against, the final observed cases, and the red line represents the observed cases we have observed up until August 1st, 2021.
5 Pre-processing
In order to compute a nowcast for this data, we will need to start by creating what we call a reporting triangle. This is a matrix where each row represents one of the time points being referenced and each column represents the delay, starting from 0 up until the maximum delay. The entries represent the number of new cases assigned to that reference time point with a particular delay, with entries in the bottom right triangle missing as the data reported with longer delays has yet to be observed for recent reference times. The reporting triangle will be used to estimate the delay distribution, or the proportion of the final number of cases reported on a particular delay. Since this data is both reported and referenced daily, we will use the time scale of days to create the reporting triangle, but the delay and the reference date can have any temporal granularity.
In this example, we will both fit our delay distribution, and apply it to generate a nowcast matrix using the same data, the national level data from Germany for all age groups.
In the below sections, we will specify the maximum delay, the number of reference times to use for delay estimation, and the number of reference times to use for uncertainty quantification. We will use 3 times the maximum delay for the total training volume, with the latest 50% of the reference times used for delay estimation. For uncertainty estimation, we will generate retrospective nowcast datasets with the latest 50% of the reference times. This means that in order to estimate the delay using the same amount of data, the oldest retrospective nowcast dataset will use the first 50% of the reference times.
We recommend choosing the maximum delay and number of historical observations based on an exploratory data analysis, as these specifications will change significantly depending on the dataset.
# Specify the maximum delay, which will determine the length of your delay
# distribution. Empirical data outside this delay window will not be used for
# training.
max_delay <- 30
n_training_volume <- 3 * max_delay
# Specify the number of reference times to use to estimate the delay
# distribution. Note this assumes you want the most recent observations.
n_history_delay <- 0.5 * n_training_volume
# Specify the number of retrospective nowcast datasets
# to use for uncertainty estimation.
n_retrospective_nowcasts <- 0.5 * n_training_volume
Next we will use the epinowcast
function, enw_preprocess_data()
and the
data in the form of a long tidy dataframe indexed by reference date and
report date and filtered to the strata we are interested in, to generate a
reporting triangle.
training_data <- enw_filter_reference_dates(
observed_data,
include_days = n_training_volume - 1
)
latest_training_data <- enw_latest_data(training_data)
target_data <- enw_filter_reference_dates(
latest_data,
include_days = n_training_volume - 1
)
# Get the reporting triangle, adding an additional day because epinowcast
# we want the max_delay + 1 entries since 0 is a valid delay.
# This also validates that the data is in the correct format and
# runs preprocessing see ?enw_preprocess_data for more details
pobs <- enw_preprocess_data(
obs = training_data,
max_delay = max_delay + 1
)
# as we only have one group here we only need reference_date, delay,
# and new_confirm
reporting_triangle_df <- select(
pobs$new_confirm[[1]],
reference_date,
delay,
new_confirm
)
# we now pivot to wide format, dropping the reference_date column, and
# convert to a matrix
# this is the format that baselinenowcast expects
reporting_triangle <- reporting_triangle_df |>
pivot_wider(names_from = delay, values_from = new_confirm) |>
select(-reference_date) |>
as.matrix()
To see what this looks like, we can plot it using ggplot()
.
In this figure, the grey indicates matrix elements that are NA
, which we
would expect to be the case in the bottom right portion of the reporting
triangle where the counts have yet to be observed.
triangle_df <- as.data.frame(reporting_triangle) |>
mutate(time = row_number()) |>
pivot_longer(!time,
values_to = "count",
names_prefix = "V",
names_to = "delay"
) |>
mutate(delay = as.numeric(delay))
ggplot(
triangle_df,
aes(x = delay, y = time, fill = count)
) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Reporting triangle", x = "Delay", y = "Time") +
theme_bw() +
scale_y_reverse()
6 Estimate delay
Now that we have the reporting triangle, we are now ready to pass it in to the
baselinenowcast
package to estimate the delay distribution.
We will tell the function the maximum delay and the number of observations we
want to use, though the default will be to use the whole reporting triangle.
If the reporting triangle is too small for the user-specified delays and number
of training observations, the function will throw an error.
We only want to pass in the reporting triangle (for a single group!) to this
function.
If reference date are repeated because the reporting triangle contains multiple
strata, the function will throw an error.
The get_delay_estimate()
function expects the following inputs:
-
reporting_triangle
: a matrix with the reporting triangle for a single strata. Here the rows represent the time points and the columns represent the observed delays, starting at 0. This can also be a reporting matrix or an incomplete reporting matrix (so not all elements of the bottom right triangle need to be missing). -
max_delay
: an integer indicating the maximum delay to estimate. This must be less than or equal to the number of rows intriangle
minus 1, since we assumetriangle
is indexed at 0. -
n
: an integer indicating the number of observations by reference date to use to fit the delay distribution. This must be less than or equal to the number of rows intriangle
. Here we will usen_history_delay
.
delay_pmf <- get_delay_estimate(
reporting_triangle = reporting_triangle,
max_delay = max_delay,
n = n_history_delay
)
delay_df <- data.frame(
delay = 0:(length(delay_pmf) - 1),
pmf = delay_pmf
)
ggplot(delay_df) +
geom_line(aes(x = delay, y = cumsum(pmf))) +
xlab("Delay") +
ylab("Cumulative proportion reported") +
ggtitle("Empirical point estimate of cumulative proportion reported by delay") + # nolint
theme_bw()
ggplot(delay_df) +
geom_line(aes(x = delay, y = pmf)) +
xlab("Delay") +
ylab("Proportion reported") +
ggtitle("Empirical point estimate of proportion reported by delay") +
theme_bw()
7 Apply the delay to generate a point nowcast
The next step in our workflow is to take the estimated delay distribution and
apply it to the partially observed reporting triangle, generating an estimate
of the number of new cases confirmed at each reference date and delay.
This will generate a point estimate of what we can call the reporting square,
which is the complete set of reference dates and delays.
In this case, we will be applying the delay to the same reporting triangle we
used to generate the estimate, but this doesn’t always have to be the case.
The reporting triangle we are applying it to must have the same max_delay
as the delay estimate.
It is worth noting that we could also have estimated the delay and applied it
in one single step by calling generate_pt_nowcast_mat()
. In subsequent steps
to estimate the uncertainty, both delay estimation and generating a point
nowcast matrix happen in a single step.
point_nowcast_matrix <- apply_delay(
rep_tri_to_nowcast = reporting_triangle,
delay_pmf = delay_pmf
)
We’ll make a quick plot to compare the nowcasted confirmed cases through August 1, 2021, from the observations up until October 1, 2021. We’ll compare this to the right-truncated data available up until August 1, 2021.
point_nowcast_df <- target_data |>
mutate(nowcast = rowSums(point_nowcast_matrix))
prep_latest_data <- latest_training_data |>
mutate(type = "Real-time data") |>
select(type, reference_date, count = confirm)
# Combine data into a single dataframe for plotting
plot_data <- point_nowcast_df |>
pivot_longer(
cols = c(confirm, nowcast),
names_to = "type",
values_to = "count"
) |>
mutate(type = case_when(
type == "confirm" ~ "Final observed data",
type == "nowcast" ~ "Point nowcast",
TRUE ~ type
)) |>
bind_rows(prep_latest_data)
# Create plot with data type as a variable
ggplot(plot_data, aes(x = reference_date, y = count, color = type)) +
geom_line() +
scale_color_manual(values = c(
"Real-time data" = "darkred",
"Final observed data" = "black",
"Point nowcast" = "darkblue"
)) +
theme_bw() +
xlab("Reference date") +
ylab("Confirmed admissions") +
scale_y_continuous(trans = "log10") +
ggtitle("Comparing real-time, nowcasted, and later observed cases") +
theme(legend.position = "bottom") +
labs(color = "Type")
Here we can see that our point nowcast slightly underestimates what was eventually reported (black line), but does a decent overall job of correcting for the right-truncation observed in the red line (the data before the nowcast).
8 Estimate uncertainty
So far, we’ve demonstrated how to generate a point estimate of a nowcast. We would like to generate probabilistic nowcasts. To do so, we will use the error between the predicted components of retrospectively generated nowcasts (aggregated across the delays) and the corresponding observations that we have already observed. We will assume that the observations follow a negative binomial observation model, and independently estimate the dispersion in the negative binomial at each forecast horizon.
The method used to estimate the uncertainty works by generating retrospective reporting triangles using what would have been available as of each retrospective nowcast time to estimate a delay distribution and generate a point nowcast matrix.
We repeat this process for n_retrospective_nowcasts
reference times in the
current reporting triangle, starting from the latest reference
time and working backwards, ultimately using all n_retrospective_nowcasts
and n_history_delay
reference times.
trunc_rep_tri_list <- truncate_triangles(reporting_triangle,
n = n_retrospective_nowcasts
)
retro_rep_tri_list <- generate_triangles(trunc_rep_tri_list)
The generate_triangles()
function returns a list of
reporting triangles, in order from the most to least
recent, starting from the most recent. It is not filtered to exclude any
rows that may not ultimately be used to generate the nowcast, which means
that each triangle will have a different number of rows.
Next we will pass this list of reporting triangles to the
generate_pt_nowcast_mat_list()
. In this function, we can also specify
n
, the number of reference times to be used to estimate the delay for each
nowcast, which we will set as the n_history_delay
previous specified.
retro_pt_nowcast_mat_list <- generate_pt_nowcast_mat_list(
reporting_triangle_list = retro_rep_tri_list,
n = n_history_delay
)
Next, we will use the retrospective reporting triangles, the point nowcasted reporting triangles, and the truncated reporting triangles to estimate the uncertainty at each horizon, starting at horizon 0.
disp_params <- estimate_dispersion(
pt_nowcast_mat_list = retro_pt_nowcast_mat_list,
trunc_rep_tri_list = trunc_rep_tri_list,
reporting_triangle_list = retro_rep_tri_list,
n = n_retrospective_nowcasts
)
9 Generate probabilistic nowcast
Now that we have estimated the dispersion, we can generate a probabilistic
nowcast. We can use the get_nowcast_draws()
which:
- generates draws from the nowcast distribution
- combines the draws with the observed data to form a single draw of the nowcast
- repeats this process for
draws
draws
nowcast_draws_df <- get_nowcast_draws(
point_nowcast_matrix, reporting_triangle,
dispersion = disp_params,
draws = 100
)
head(nowcast_draws_df)
## pred_count time draw
## 1 736 1 1
## 2 897 2 1
## 3 893 3 1
## 4 804 4 1
## 5 722 5 1
## 6 492 6 1
As we can see, the output dataframe from get_nowcast_draws()
contains:
-
time
: The reference time index -
draw
: The draw number -
pred_count
: The predicted final count for this reference time and draw (combining observations and predictions when present)
10 Visualizing the nowcast
Let’s visualize the nowcast compared to the final observed data. We first need to link our nowcast with the original data so we can see our nowcast by reference date.
# Prepare the datasets for joining
latest_data_prepped <- latest_training_data |>
mutate(time = row_number()) |>
rename(obs_confirm = confirm) |>
mutate(reference_date = as.Date(reference_date))
final_data_prepped <- target_data |>
select(reference_date, final_confirm = confirm) |>
mutate(reference_date = as.Date(reference_date))
# Join the datasets
obs_with_nowcast_draws_df <- nowcast_draws_df |>
left_join(latest_data_prepped, by = "time") |>
left_join(final_data_prepped, by = "reference_date")
head(obs_with_nowcast_draws_df)
## pred_count time draw reference_date location age_group obs_confirm
## 1 736 1 1 2021-05-04 DE 00+ 823
## 2 897 2 1 2021-05-05 DE 00+ 1028
## 3 893 3 1 2021-05-06 DE 00+ 1016
## 4 804 4 1 2021-05-07 DE 00+ 892
## 5 722 5 1 2021-05-08 DE 00+ 822
## 6 492 6 1 2021-05-09 DE 00+ 561
## report_date final_confirm
## 1 2021-07-24 823
## 2 2021-07-25 1028
## 3 2021-07-26 1016
## 4 2021-07-27 892
## 5 2021-07-28 822
## 6 2021-07-29 561
# Create a separate dataframe for observed and final data
combined_data <- obs_with_nowcast_draws_df |>
select(reference_date, obs_confirm, final_confirm) |>
distinct() |>
pivot_longer(
cols = c(obs_confirm, final_confirm),
names_to = "type",
values_to = "count"
) |>
mutate(type = case_when(
type == "obs_confirm" ~ "Observed data",
type == "final_confirm" ~ "Final observed data"
))
# Plot with draws for nowcast only
ggplot() +
# Add nowcast draws as thin gray lines
geom_line(
data = obs_with_nowcast_draws_df,
aes(
x = reference_date, y = pred_count, group = draw,
color = "Nowcast draw", linewidth = "Nowcast draw"
)
) +
# Add observed data and final data once
geom_line(
data = combined_data,
aes(
x = reference_date,
y = count,
color = type,
linewidth = type
)
) +
theme_bw() +
scale_color_manual(
values = c(
"Nowcast draw" = "gray",
"Observed data" = "darkred",
"Final observed data" = "black"
),
name = ""
) +
scale_linewidth_manual(
values = c(
"Nowcast draw" = 0.2,
"Observed data" = 1,
"Final observed data" = 1
),
name = ""
) +
scale_y_continuous(trans = "log10") +
xlab("Reference date") +
ylab("Hospital admissions") +
theme(legend.position = "bottom") +
ggtitle("Comparison of admissions as of the nowcast date, later observed counts, \n and probabilistic nowcasted counts") # nolint