1 Introduction
Incomplete reporting of epidemiological data at recent times can result in case count data that is right-truncated. Right-truncated case counts can be misleading to interpret at face-value, as they will typically show a decline in the number of reported observations in the most recent time points. These are the time points where the highest proportion of the data has yet to be observed in the dataset.
The imputation of the cases that will eventually be observed up until the current time is referred to as a nowcast.
A number of methods have been developed to nowcast epidemiological case count data.
The purpose of baselinenowcast
is to provide a nowcast computed directly from
the most recent observations to estimate a delay distribution empirically, and
apply that to the partially observed data to generate a nowcast.
In the below section, we will describe an example of a nowcasting problem, and
demonstrate how to use baselinenowcast
to estimate a delay distribution from
the data and apply that estimate to generate a probabilistic nowcast.
2 Packages
As well as the baselinenowcast
package this vignette also uses epinowcast
,
ggplot2
, and dplyr
.
The installation of epinowcast
is not required for using the package,
however, its pre and post-processing functions provide a lot of the data
wrangling needed to set up the nowcasting problem.
We note that no components of the vignette require installing CmdStan
,
which is a downstream dependency of epinowcast
.
We will just be using the R
components of epinowcast
, which can be
installed using the example lines of code below, so there is no need to
additionally install CmdStan
.
# Installing epinowcast
# install.packages( #nolint
# "epinowcast", repos = "https://epinowcast.r-universe.dev" #nolint
# ) #nolint
# Load packages
library(baselinenowcast)
library(epinowcast)
library(ggplot2)
library(dplyr)
library(lubridate)
library(tidyr)
# Set seed for reproducibility
set.seed(123)
3 Data
Nowcasting of right-truncated case counts involves the estimation of reporting delays for recently reported data. For this, we need case counts indexed both by when they were diagnosed (often called “reference date”) and by when they were reported (i.e. when administratively recorded via public health surveillance; often called “report date”). The difference between the reference date and the report date is the reporting delay. For this quick start, we use daily level data from the Robert Koch Institute via the Germany Nowcasting hub. These data represent hospital admission counts by date of positive test and date of test report in Germany up to October 1, 2021.
4 Filtering and plotting the data
We will filter the data to just look at the national-level data, for all age
groups.
We will pretend that we are making a nowcast as of July 1, 2021, therefore we
will exclude all reference dates and report dates after that date.
germany_covid19_hosp
is provided as package data from epinowcast
Let’s start by plotting the sum of the reports at each reference date,
and then compare that to what we will eventually observe as of the final date
in the complete dataset.
The red line shows the cumulative number of confirmed admissions on each report
date, across all delays, using the data available as of July 1, 2021.
It demonstrates the characteristic behaviour of right-truncation.
This is because we have not yet observed the data that will become available
for the longer delays at recent time points.
Our task will be to estimate what the “final” cumulative number of cases at each reference date, observed as of the “fully observed” data on October 2021.
data_long <- germany_covid19_hosp # import data from epinowcast
data_filtered <- data_long |>
enw_filter_report_dates(latest_date = "2021-07-01") |>
filter(
location == "DE", age_group == "00+",
report_date == "2021-07-01"
)
data_filtered_max <- data_long |>
filter(
location == "DE", age_group == "00+",
reference_date <= "2021-07-01"
) |>
group_by(reference_date) |>
summarise(confirm = max(confirm))
ggplot() +
geom_line(
data = data_filtered,
aes(x = reference_date, y = confirm), color = "darkred"
) +
geom_line(
data = data_filtered_max,
aes(x = reference_date, y = confirm), color = "black"
) +
theme_bw() +
xlab("Reference date") +
ylab("Confirmed admissions") +
scale_y_continuous(trans = "log10") +
ggtitle("Comparing real-time and later observed cases")
Here the black line represents the quantity we will evaluate our nowcast against, the final observed cases, and the red line represents the observed cases we have observed up until July 1st, 2021.
5 Pre-processing
In order to compute a nowcast for this data, we will need to start by creating what we call a reporting triangle. This is a matrix where each row represents one of the time points being referenced and each column represents the delay, starting from 0 up until the maximum delay. The entries represent the number of new cases assigned to that reference time point with a particular delay, with entries in the bottom right triangle missing as the data reported with longer delays has yet to be observed for recent reference times. The reporting triangle will be used to estimate the delay distribution, or the proportion of the final number of cases reported on a particular delay. Since this data is both reported and referenced daily, we will use the time scale of days to create the reporting triangle, but the delay and the reference date can have any temporal granularity.
In this example, we will both fit our delay distribution, and apply it to generate a nowcast matrix using the same data, the national level data from Germany for all age groups. However, these components can be separated, so for example, we could use the national level data for all age groups to estimate a delay distribution, and then we could apply that elsewhere, for example to the data stratified by age group and location. This type of “borrowing” from another training dataset can be really useful when you have low counts or relatively sparse data, which is likely to be the case for smaller populations.
In the below sections, we will specify our nowcast date, the maximum delay, and the number of reference time observations that we want to use to estimate the delay distribution. We recommend choosing the maximum delay and number of historical observations based on an exploratory data analysis, as these specifications will change significantly depending on the dataset.
nowcast_date <- "2021-07-01"
# Specify the maximum delay, which will determine the length of your delay
# distribution. Empirical data outside this delay window will not be used for
# training.
max_delay <- 40
# Specify the number of reference times to use to estimate the delay
# distribution. Note this assumes you want the most recent observations.
n_history_delay <- 60
Next we will use the epinowcast
function, enw_preprocess_data()
and the
data in the form of a long tidy dataframe indexed by reference date and
report date and filtered to the strata we are interested in, to generate a
reporting triangle.
# Noting that this is the only way epinowcast preprocessing would work --
# return to this later. IDate was throwing errors if we used the dplyr processed
# observed long above.
observed_long <- data_long[location == "DE"][age_group == "00+"] |> # nolint
enw_filter_report_dates(latest_date = nowcast_date) |>
enw_filter_reference_dates(include_days = n_history_delay - 1)
head(observed_long)
## reference_date location age_group confirm report_date
## <IDat> <fctr> <fctr> <int> <IDat>
## 1: 2021-05-03 DE 00+ 107 2021-05-03
## 2: 2021-05-04 DE 00+ 240 2021-05-04
## 3: 2021-05-05 DE 00+ 245 2021-05-05
## 4: 2021-05-06 DE 00+ 259 2021-05-06
## 5: 2021-05-07 DE 00+ 263 2021-05-07
## 6: 2021-05-08 DE 00+ 189 2021-05-08
# Get the reporting triangle, adding an additional day because epinowcast
# we want the max_delay + 1 entries since 0 is a valid delay.
pobs <- enw_preprocess_data(
obs = observed_long,
max_delay = max_delay + 1
)
triangle_full <- pobs$reporting_triangle[[1]]
head(triangle_full)
## Key: <.group, reference_date>
## .group reference_date 0 1 2 3 4 5 6 7 8
## <num> <IDat> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1: 1 2021-05-03 107 76 45 25 23 16 5 4 6
## 2: 1 2021-05-04 240 178 60 36 33 11 5 21 19
## 3: 1 2021-05-05 245 158 70 42 22 17 36 43 34
## 4: 1 2021-05-06 259 163 69 22 16 42 57 36 10
## 5: 1 2021-05-07 263 169 46 15 55 32 27 15 25
## 6: 1 2021-05-08 189 97 34 78 42 30 23 27 9
## 9 10 11 12 13 14 15 16 17 18 19 20
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1: 25 1 3 7 3 5 0 6 10 5 3 1
## 2: 15 4 7 8 5 17 11 7 10 5 5 6
## 3: 17 21 10 4 24 23 35 27 12 2 1 2
## 4: 21 11 4 24 30 26 22 16 7 3 2 9
## 5: 8 8 23 26 27 11 9 0 0 6 4 0
## 6: 7 18 31 18 28 13 5 1 8 13 12 11
## 21 22 23 24 25 26 27 28 29 30 31 32
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1: 3 1 2 1 0 5 0 1 1 1 0 0
## 2: 2 1 5 5 5 2 1 4 7 1 1 0
## 3: 12 10 13 4 1 0 0 7 2 3 5 1
## 4: 8 8 8 5 2 1 5 1 2 4 1 1
## 5: 7 6 6 0 2 7 4 2 0 1 0 2
## 6: 6 2 0 4 5 5 3 2 0 1 1 1
## 33 34 35 36 37 38 39 40
## <int> <int> <int> <int> <int> <int> <int> <int>
## 1: 2 0 0 4 1 0 3 0
## 2: 1 0 1 2 1 2 3 0
## 3: 3 2 5 2 6 2 0 0
## 4: 1 3 4 4 3 3 1 10
## 5: 3 4 2 0 2 0 4 4
## 6: 2 1 4 0 0 0 3 7
reporting_triangle <- triangle_full |>
select(-`.group`, -reference_date) |>
as.matrix() |>
unname()
To see what this looks like, we can plot it using ggplot()
.
triangle_df <- as.data.frame(reporting_triangle) |>
mutate(time = row_number()) |>
pivot_longer(!time,
values_to = "count",
names_prefix = "V",
names_to = "delay"
) |>
mutate(delay = as.numeric(delay))
ggplot(
triangle_df,
aes(x = delay, y = time, fill = count)
) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Reporting triangle", x = "Delay", y = "Time") +
theme_bw() +
scale_y_reverse()
It’s clear from this figure that the bottom left of the reporting triangle
contains 0s as a result of the pre-processing we have done.
In this case, we want to assume that we are making the nowcast as of the last
reference time, which means the bottom right portion of the reporting triangle
should contains NAs for the yet-to-be observed values.
To make this change, we can use the package function
replace_lower_right_with_NA()
.
cleaned_reporting_triangle <- replace_lower_right_with_NA(reporting_triangle)
triangle_df <- as.data.frame(cleaned_reporting_triangle) |>
mutate(time = row_number()) |>
pivot_longer(!time,
values_to = "count",
names_prefix = "V",
names_to = "delay"
) |>
mutate(delay = as.numeric(delay))
ggplot(
triangle_df,
aes(x = delay, y = time, fill = count)
) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(
title = "Reporting triangle with NAs for missing values",
x = "Delay",
y = "Time"
) +
theme_bw() +
scale_y_reverse()
Now we see that the bottom right of the triangle contains NAs where the observations are missing.
6 Estimate delay
Now that we have the reporting triangle, we are now ready to pass it in to the
baselinenowcast
package to estimate the delay distribution.
We will tell the function the maximum delay and the number of observations we
want to use, though the default will be to use the whole reporting triangle.
If the reporting triangle is too small for the user-specified delays and number
of training observations, the function will throw an error.
We only want to pass in the reporting triangle (for a single group!) to this
function.
If reference date are repeated because the reporting triangle contains multiple
strata, the function will throw an error.
The get_delay_estimate()
function expects the following inputs:
- reporting_triangle
: a matrix with the reporting triangle for a single
strata. Here the rows represent the time points and the columns represent the
observed delays, starting at 0. This can also be a reporting matrix or an
incomplete reporting matrix (so not all elements of the bottom right triangle
need to be missing).
- max_delay
: an integer indicating the maximum delay to estimate. This must
be less than or equal to the number of rows in triangle
minus 1, since we
assume triangle
is indexed at 0.
- n_history_delay
: an integer indicating the number of observations by reference
date to use to fit the delay distribution. This must be less than or equal to
the number of rows in triangle
.
delay_pmf <- get_delay_estimate(
reporting_triangle = cleaned_reporting_triangle,
max_delay = max_delay,
n = n_history_delay
)
delay_df <- data.frame(
delay = 0:(length(delay_pmf) - 1),
pmf = delay_pmf
)
ggplot(delay_df) +
geom_line(aes(x = delay, y = cumsum(pmf))) +
xlab("Delay") +
ylab("Cumulative proportion reported") +
ggtitle("Empirical point estimate of cumulative proportion reported by delay") + # nolint
theme_bw()
ggplot(delay_df) +
geom_line(aes(x = delay, y = pmf)) +
xlab("Delay") +
ylab("Proportion reported") +
ggtitle("Empirical point estimate of proportion reported by delay") +
theme_bw()
7 Apply delay to generate point nowcast
The next step in our workflow is to take the estimated delay distribution and
apply it to the partially observed reporting triangle, generating an estimate
of the number of new cases confirmed at each reference date and delay.
This will generate a point estimate of what we can call the reporting square,
which is the complete set of reference dates and delays.
In this case, we will be applying the delay to the same reporting triangle we
used to generate the estimate, but this doesn’t always have to be the case.
The reporting triangle we are applying it to must have the same max_delay
as the delay estimate.
point_nowcast_matrix <- apply_delay(
rep_tri_to_nowcast = cleaned_reporting_triangle,
delay_pmf = delay_pmf
)
We’ll make a quick plot to compare the nowcasted confirmed cases through July 1, 2021, from the observations up until October 1, 2021. We’ll compare this to the right-truncated data available up until July 1, 2021.
final_data <- data_long[location == "DE"][age_group == "00+"] |> # nolint
enw_filter_report_dates(latest_date = "2021-10-01") |>
enw_filter_reference_dates(
latest_date = "2021-07-01",
include_days = n_history_delay - 1
) |>
group_by(reference_date) |>
summarise(
total_confirmed = max(confirm)
) |>
mutate(nowcast = rowSums(point_nowcast_matrix))
summary_data <- observed_long |>
group_by(reference_date) |>
summarise(total_confirmed = max(confirm))
ggplot() +
geom_line(
# Plot the data summed across reporting days as of July 1,2021
data = summary_data,
aes(x = reference_date, y = total_confirmed), color = "darkred"
) +
geom_line(
data = final_data,
aes(x = reference_date, y = total_confirmed), color = "black"
) +
geom_line(
data = final_data,
aes(x = reference_date, y = nowcast), color = "darkblue"
) +
theme_bw() +
xlab("Reference date") +
ylab("Confirmed admissions") +
scale_y_continuous(trans = "log10") +
ggtitle("Comparing real-time, nowcasted, and later observed cases")
Here we can see that our point nowcast slightly overestimates what was eventually reported (black line), but does a decent overall job of correcting for the right-truncation observed in the red line (the data prior to the nowcast).
8 Estimate uncertainty
So far, we’ve demonstrated how to generate a point estimate of a nowcast. We would like to generate probabilistic nowcasts. To do so, we will use the error between the observations that we have and what we would have nowcasted retrospectively. We will assume that the observations follow a negative binomial observation model, and independently estimate the dispersion in the negative binomial at each delay \(d\).
The method used to estimate the uncertainty in the nowcast simulates exactly the iterative process, by generating retrospective reporting triangles using what would have been available as of the time, and then for each reporting triangle, estimating a delay distribution and generating a nowcast.
To do this, we will first generate truncated reporting matrices from the latest reporting triangle, generate retrospective reporting triangles by removing observations that wouldn’t have been observed as of the latest time point, and lastly generate nowcasts from that list of retrospective reporting triangles. The function will default to assuming you want to generate as many retrospective reporting triangles that you will be able to generate nowcasts from. In this case, the function will return 18 retrospective reporting triangles, with the smallest one containing 42 rows, all of which are needed to generate a nowcast of reporting triangle with 41 delay columns.
trunc_rep_mat_list <- truncate_triangles(
reporting_triangle = cleaned_reporting_triangle
)
retro_rep_tri_list <- generate_triangles(
trunc_rep_mat_list = trunc_rep_mat_list
)
The generate_triangles()
function returns a list of
reporting triangles, in order from the most to least
recent, starting from the most recent. It is not filtered to exclude any
rows that may not ultimately be used to generate the nowcast, which means
that each triangle will have a different number of rows.
Next we will pass this list of reporting triangles to the
generate_nowcasts()
alongside the specification of how many observations
to use to estimate the delay
retro_pt_nowcast_mat_list <- generate_pt_nowcast_mat_list(
reporting_triangle_list = retro_rep_tri_list
)
Next, we will use the list of retrospective reporting triangles to estimate the uncertainty at each delay \(d\)
disp_params <- estimate_dispersion(
pt_nowcast_mat_list = retro_pt_nowcast_mat_list,
trunc_rep_mat_list = trunc_rep_mat_list
)
9 Generate probabilistic nowcast
Now that we have estimated the dispersion, we take the point nowcast matrix we
previously generated, point_nowcast_matrix
, and generate many draws of the
expected observed values to get a probabilistic nowcast.
nowcast_matrix_list <- add_uncertainty(
point_nowcast_matrix = point_nowcast_matrix,
disp = disp_params,
n_draws = 1000
)
Next, we will take the list of nowcast matrices and
convert this into a long tidy dataframe, indexed by reference time t
,
delay d
, and draws.
nowcast_draws <- nowcast_matrix_list_to_df(
nowcast_matrix_list = nowcast_matrix_list
)
head(nowcast_draws)
## time delay count draw
## 1 1 1 107 1
## 2 1 2 76 1
## 3 1 3 45 1
## 4 1 4 25 1
## 5 1 5 23 1
## 6 1 6 16 1
The package also provides a helper function, aggregate_df_by_ref_time()
which converts the draws indexed by t
and d
to one which contains the sum
over all delays. We can then use this to plot the nowcasts compared to the
data available as of the nowcast date, and the later observed data.
summary_nowcast <- aggregate_df_by_ref_time(nowcast_draws)
# Join with the original data
final_data <- data_long[location == "DE"][age_group == "00+"] |> # nolint
enw_filter_report_dates(latest_date = "2021-10-01") |>
enw_filter_reference_dates(
latest_date = "2021-07-01",
include_days = n_history_delay - 1
) |>
group_by(reference_date) |>
summarise(
final_confirmed = max(confirm)
)
original_data <- observed_long |>
group_by(reference_date) |>
summarise(original_confirmed = max(confirm))
time_df <- data.frame(
reference_date =
seq(
from = min(original_data$reference_date),
to = max(original_data$reference_date),
by = "day"
)
) |> mutate(time = row_number())
sampled_draws <- sample.int(max(summary_nowcast$draw), 100)
summary_nowcast_w_data <- summary_nowcast |>
left_join(time_df,
by = "time"
) |>
left_join(original_data, by = "reference_date") |>
left_join(final_data, by = "reference_date") |>
# Filter to a set of draws for plotting + zoom in on nowcast period
filter(
draw %in% sampled_draws,
reference_date >= max(ymd(reference_date)) - days(max_delay)
)
ggplot(summary_nowcast_w_data) +
geom_line(aes(x = reference_date, y = total_count, group = draw),
alpha = 0.2, color = "gray", linewidth = 0.2
) +
geom_line(aes(x = reference_date, y = original_confirmed),
color = "darkred"
) +
geom_line(aes(x = reference_date, y = final_confirmed),
color = "black"
) +
theme_bw() +
xlab("Reference date") +
ylab("Hospital admissions") +
ggtitle("Comparison of admissions as of the nowcast date, later observed counts, \n and probabilistic nowcasted counts") # nolint