Skip to contents

1 Introduction

Incomplete reporting of epidemiological data at recent times can result in case count data that is right-truncated. Right-truncated case counts can be misleading to interpret at face-value, as they will typically show a decline in the number of reported observations in the most recent time points. These are the time points where the highest proportion of the data has yet to be observed in the dataset.

The imputation of the cases that will eventually be observed up until the current time is referred to as a nowcast.

A number of methods have been developed to nowcast epidemiological case count data.

The purpose of baselinenowcast is to provide a nowcast computed directly from the most recent observations to estimate a delay distribution empirically, and apply that to the partially observed data to generate a nowcast.

In the below section, we will describe an example of a nowcasting problem, and demonstrate how to use baselinenowcast to estimate a delay distribution from the data and apply that estimate to generate a probabilistic nowcast.

2 Packages

As well as the baselinenowcast package this vignette also uses epinowcast, ggplot2, and dplyr. The installation of epinowcast is not required for using the package, however, its pre and post-processing functions provide a lot of the data wrangling needed to set up the nowcasting problem. We note that no components of the vignette require installing CmdStan, which is a downstream dependency of epinowcast. We will just be using the R components of epinowcast, which can be installed using the example lines of code below, so there is no need to additionally install CmdStan.

# Installing epinowcast
# install.packages( #nolint
#  "epinowcast", repos = "https://epinowcast.r-universe.dev" #nolint
# ) #nolint
# Load packages
library(baselinenowcast)
library(epinowcast)
library(ggplot2)
library(dplyr)
# Set seed for reproducibility
set.seed(123)

3 Data

Nowcasting of right-truncated case counts involves the estimation of reporting delays for recently reported data. For this, we need case counts both by when they were diagnosed (often called “reference date”) and by when they were reported (i.e. when administratively recorded via public health surveillance; often called “report date”). The difference between the reference date and the report date is the reporting delay. For this quick start, we use daily level data from the Robert Koch Institute via the Germany Nowcasting hub. These data represent hospitalisation counts by date of positive test and date of test report in Germany up to October 1, 2021.

4 Filtering and plotting the data

We will filter the data to just look at the national-level data, for all age groups. We will pretend that we are making a nowcast as of July 1, 2021, therefore we will exclude all reference dates and report dates after that date. germany_covid19_hosp is provided as package data from epinowcast Let’s start by plotting the sum of the reports at each reference date, and then compare that to what we will eventually observe as of the final date in the complete dataset. The red line shows the cumulative number of confirmed admissions on each report date, across all delays, using the data available as of July 1, 2021. It demonstrates the characteristic behavior of right-truncation. This is because we have not yet observed the data that will become available for the longer delays at recent time points.

Our task will be to estimate what the “final” cumulative number of cases will at each reference date, observed as of the “fully observed” data on October 2021.

data_long <- germany_covid19_hosp # import data from epinowcast

ggplot() +
  geom_line(
    # Plot the data summed across reporting days
    data = data_long |>
      enw_filter_report_dates(latest_date = "2021-07-01") |>
      filter(
        location == "DE", age_group == "00+",
        report_date == "2021-07-01"
      ),
    aes(x = reference_date, y = confirm), color = "darkred"
  ) +
  geom_line(
    data = data_long |>
      filter(
        location == "DE", age_group == "00+",
        reference_date <= "2021-07-01"
      ) |>
      group_by(reference_date) |>
      summarise(confirm = max(confirm)),
    aes(x = reference_date, y = confirm), color = "black"
  ) +
  theme_bw() +
  xlab("Reference date") +
  ylab("Confirmed admissions") +
  scale_y_continuous(trans = "log10") +
  ggtitle("Comparing real-time and later observed cases")

Here the black line represents the quantity we will evaluate our nowcast against, and the red line represents the data we have available to us up until July 1st, 2021.

5 Pre-processing

In order to compute a nowcast for this data, we will need to start by creating what we call a reporting triangle. This is a wide formatted dataframe where each row represents one of the time points being reference and each column represents the delay, starting from 0 up until the maximum delay. The entries represent the number of new cases assigned to that reference time point with a particular delay. The reporting triangle will be used to estimate the delay distribution, or the proportion of the final number of cases reported on a particular delay. Since this data is both reported and referenced daily, we will use the time scale of days to create the reporting triangle, but the delay and the reference date can have any temporal granularity.

In this example, we will both fit our delay distribution, and apply it to generate a nowcast using the same data, the national level data from Germany for all age groups. However, these components can be separated, so for example, we could use the national level data for all age groups to estimate a delay distribution, and then we could apply that elsewhere, for example to the data stratified by age group and location. This type of “borrowing” from another training dataset can be really useful when you have low counts or relatively sparse data, which is likely to be the case for smaller populations.

In the below sections, we will specify our nowcast date, the maximum delay, and the number of observations by reference date that we want to use to estimate the delay distribution. We recommend choosing the maximum delay and number of historical observations based on an exploratory data analysis, as these specifications will change significantly depending on the dataset.

nowcast_date <- "2021-07-01"
# Specify the maximum delay, which will determine the length of your delay
# distribution. Empirical data outside this delay window will not be used for
# training.
max_delay <- 40
# Specify the number of reference dates to use to estimate the delay
# distribution.Note this assumes you want the most recent observations
# (though we can consider changing this)
n_history <- 60

Next we will use the epinowcast function, enw_preprocess_data() and the data in the form a of a long tidy dataframe indexed by reference date and report date and filtered to the strata we are interested in, to generate a reporting triangle.

# Noting that this is the only way epinowcast preprocessing would work --
# return to this later. IDate was throwing errors if we used the dplyr processed
# observed long above.
observed_long <- data_long[location == "DE"][age_group == "00+"] |> # nolint
  enw_filter_report_dates(latest_date = nowcast_date) |>
  enw_filter_reference_dates(include_days = n_history - 1)
head(observed_long)
##    reference_date location age_group confirm report_date
##            <IDat>   <fctr>    <fctr>   <int>      <IDat>
## 1:     2021-05-03       DE       00+     107  2021-05-03
## 2:     2021-05-04       DE       00+     240  2021-05-04
## 3:     2021-05-05       DE       00+     245  2021-05-05
## 4:     2021-05-06       DE       00+     259  2021-05-06
## 5:     2021-05-07       DE       00+     263  2021-05-07
## 6:     2021-05-08       DE       00+     189  2021-05-08
# Get the reporting triangle, adding an additional day because epinowcast
# we want the max_delay + 1 entries since 0 is a valid delay.
pobs <- enw_preprocess_data(
  obs = observed_long,
  max_delay = max_delay + 1
)

triangle_full <- pobs$reporting_triangle[[1]]
head(triangle_full)
## Key: <.group, reference_date>
##    .group reference_date     0     1     2     3     4     5     6     7     8
##     <num>         <IDat> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1:      1     2021-05-03   107    76    45    25    23    16     5     4     6
## 2:      1     2021-05-04   240   178    60    36    33    11     5    21    19
## 3:      1     2021-05-05   245   158    70    42    22    17    36    43    34
## 4:      1     2021-05-06   259   163    69    22    16    42    57    36    10
## 5:      1     2021-05-07   263   169    46    15    55    32    27    15    25
## 6:      1     2021-05-08   189    97    34    78    42    30    23    27     9
##        9    10    11    12    13    14    15    16    17    18    19    20
##    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1:    25     1     3     7     3     5     0     6    10     5     3     1
## 2:    15     4     7     8     5    17    11     7    10     5     5     6
## 3:    17    21    10     4    24    23    35    27    12     2     1     2
## 4:    21    11     4    24    30    26    22    16     7     3     2     9
## 5:     8     8    23    26    27    11     9     0     0     6     4     0
## 6:     7    18    31    18    28    13     5     1     8    13    12    11
##       21    22    23    24    25    26    27    28    29    30    31    32
##    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1:     3     1     2     1     0     5     0     1     1     1     0     0
## 2:     2     1     5     5     5     2     1     4     7     1     1     0
## 3:    12    10    13     4     1     0     0     7     2     3     5     1
## 4:     8     8     8     5     2     1     5     1     2     4     1     1
## 5:     7     6     6     0     2     7     4     2     0     1     0     2
## 6:     6     2     0     4     5     5     3     2     0     1     1     1
##       33    34    35    36    37    38    39    40
##    <int> <int> <int> <int> <int> <int> <int> <int>
## 1:     2     0     0     4     1     0     3     0
## 2:     1     0     1     2     1     2     3     0
## 3:     3     2     5     2     6     2     0     0
## 4:     1     3     4     4     3     3     1    10
## 5:     3     4     2     0     2     0     4     4
## 6:     2     1     4     0     0     0     3     7
# Check that the sum of the rows in the reporting triangle is the
# same as the sums across report dates by reference date
observed_sums <- observed_long |>
  group_by(reference_date) |>
  summarise(total_confirmed = max(confirm)) |>
  mutate(triangle_sums = rowSums(triangle_full |>
    select(-`.group`, -reference_date) |>
    as.matrix() |>
    unname()))

6 Estimate delay

Now that we have the reporting triangle, we are now ready to pass it in to the baselinenowcast package to estimate the delay distribution. We will tell the function the maximum delay and the number of observations we want to use, though the default will be to use the whole reporting triangle. If the reporting triangle is too small for the user-specified delays and number of training observations, the function will throw an error. We only want to pass in the reporting triangle (for a single group!) to this function. If reference date are repeated because the reporting triangle contains multiple strata, the function will throw an error.

The get_delay_estimate() function expects the following inputs: - triangle: a matrix with the reporting triangle for a single strata. Here the rows represent the time points and the columns represent the observed delays, starting at 0. - max_delay: an integer indicating the maximum delay to estimate. This must be less than or equal to the number of rows in triangle minus 1, since we assume triangle is indexed at 0. - n_history: an integer indicating the number of observations by reference date to use to fit the delay distribution. This must be less than or equal to the number of rows in triangle. enw_preprocess() returns a triangle with the columns .group and .reference_date and delays indicated by the column names. While we will eventually write methods that will map from other input formats such as this, we will start by demonstrating the functionality on only the matrix of the reporting triangle.

triangle <- triangle_full |>
  select(-`.group`, -reference_date) |> # remove unnecessary columns
  as.matrix() |>
  unname()
head(triangle)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]  107   76   45   25   23   16    5    4    6    25     1     3     7     3
## [2,]  240  178   60   36   33   11    5   21   19    15     4     7     8     5
## [3,]  245  158   70   42   22   17   36   43   34    17    21    10     4    24
## [4,]  259  163   69   22   16   42   57   36   10    21    11     4    24    30
## [5,]  263  169   46   15   55   32   27   15   25     8     8    23    26    27
## [6,]  189   97   34   78   42   30   23   27    9     7    18    31    18    28
##      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## [1,]     5     0     6    10     5     3     1     3     1     2     1     0
## [2,]    17    11     7    10     5     5     6     2     1     5     5     5
## [3,]    23    35    27    12     2     1     2    12    10    13     4     1
## [4,]    26    22    16     7     3     2     9     8     8     8     5     2
## [5,]    11     9     0     0     6     4     0     7     6     6     0     2
## [6,]    13     5     1     8    13    12    11     6     2     0     4     5
##      [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
## [1,]     5     0     1     1     1     0     0     2     0     0     4     1
## [2,]     2     1     4     7     1     1     0     1     0     1     2     1
## [3,]     0     0     7     2     3     5     1     3     2     5     2     6
## [4,]     1     5     1     2     4     1     1     1     3     4     4     3
## [5,]     7     4     2     0     1     0     2     3     4     2     0     2
## [6,]     5     3     2     0     1     1     1     2     1     4     0     0
##      [,39] [,40] [,41]
## [1,]     0     3     0
## [2,]     2     3     0
## [3,]     2     0     0
## [4,]     3     1    10
## [5,]     0     4     4
## [6,]     0     3     7
delay_pmf <- get_delay_estimate(
  triangle = triangle,
  max_delay = max_delay,
  n_history = n_history
)

delay_df <- data.frame(
  delay = 0:(length(delay_pmf) - 1),
  pmf = delay_pmf
)

ggplot(delay_df) +
  geom_line(aes(x = delay, y = cumsum(pmf))) +
  xlab("Delay") +
  ylab("Cumulative proportion reported") +
  ggtitle("Empirical point estimate of cumulative proportion reported by delay") + # nolint
  theme_bw()

ggplot(delay_df) +
  geom_line(aes(x = delay, y = pmf)) +
  xlab("Delay") +
  ylab("Proportion reported") +
  ggtitle("Empirical point estimate of proportion reported by delay") +
  theme_bw()

7 Apply delay to generate point nowcast

The next step in our workflow is to take the estimated delay distribution and apply it to the partially observed reporting triangle, generating an estimate of the number of new cases confirmed at each reference date and delay. This will generate a point estimate of what we can call the reporting square, which is the complete set of reference dates and delays. In this case, we will be applying the delay to the same reporting triangle we used to generate the estimate, but this doesn’t always have to be the case. The reporting triangle we are applying it to must have the same max_delay as the delay estimate.

point_reporting_square <- apply_delay(
  triangle_to_nowcast = triangle,
  delay_pmf = delay_pmf
)

We’ll make a quick plot to compare the nowcasted confirmed cases through July 1, 2021, from the observations up until October 1, 2021. We’ll compare this to the right-truncated data available up until July 1, 2021.

final_data <- data_long[location == "DE"][age_group == "00+"] |> # nolint
  enw_filter_report_dates(latest_date = "2021-10-01") |>
  enw_filter_reference_dates(
    latest_date = "2021-07-01",
    include_days = n_history - 1
  ) |>
  group_by(reference_date) |>
  summarise(
    total_confirmed = max(confirm)
  ) |>
  mutate(nowcast = rowSums(point_reporting_square))

ggplot() +
  geom_line(
    # Plot the data summed across reporting days as of July 1,2021
    data = observed_long |>
      group_by(reference_date) |>
      summarise(total_confirmed = max(confirm)),
    aes(x = reference_date, y = total_confirmed), color = "darkred"
  ) +
  geom_line(
    data = final_data,
    aes(x = reference_date, y = total_confirmed), color = "black"
  ) +
  geom_line(
    data = final_data,
    aes(x = reference_date, y = nowcast), color = "darkblue"
  ) +
  theme_bw() +
  xlab("Reference date") +
  ylab("Confirmed admissions") +
  scale_y_continuous(trans = "log10") +
  ggtitle("Comparing real-time, nowcasted, and later observed cases")

Here we can see that our point nowcast slightly overestimates what was eventually reported (black line), but does a decent overall job of correcting for the right-truncation observed in the red line (the data prior to the nowcast).

What we really want is a probabilistic estimate of the nowcasted case counts.

8 Estimate uncertainty

So far, we’ve demonstrated how to generate a point estimate of a nowcast. We would like to generate probabilistic nowcasts, which will require estimating the empirical error between the expected observations and the actual observations when applying the delay distribution. This assumes a negative binomial observation model.

We will start by estimating the dispersion parameter of the negative binomial as a function of the delay d

disp_params <- estimate_uncertainty(
  triangle_to_nowcast = triangle,
  delay_pmf = delay_pmf
)

9 Generate probabilistic nowcast

Now that we have estimated the dispersion, we can use the delay distribution, the reporting triangle that we want to nowcast, and the dispersion parameters to generate a probabilistic nowcast.

# nowcast <- generate_nowcast( #nolint
# triangle_to_nowcast = triangle, #nolint
# delay_df = delay_df, #nolint
# disp_params = disp_params #nolint
# ) #nolint

In the above, we might want to make sure we include the necessary metadata if the user passes it in, so we could write methods that dispatch on objects of a particular type so that they spit back out things with dates if those were inputted.