For a sound use of health care data in epidemiology: evaluation of a calibration model for count data with application to prediction of cancer incidence in areas without cancer registry

For a sound use of health care data in epidemiology: evaluation of a calibration model for count... Summary There is a growing interest in using health care (HC) data to produce epidemiological surveillance indicators such as incidence. Typically, in the field of cancer, incidence is provided by local cancer registries which, in many countries, do not cover the whole territory; using proxy measures from available nationwide HC databases would appear to be a suitable approach to fill this gap. However, in most cases, direct counts from these databases do not provide reliable measures of incidence. To obtain accurate incidence estimations and prediction intervals, these databases need to be calibrated using a registry-based gold standard measure of incidence. This article presents a calibration model for count data developed to predict cancer incidence from HC data in geographical areas without cancer registries. First, the ratio between the proxy measure and incidence is modeled in areas with registries using a Poisson mixed model that allows for heterogeneity between areas (calibration stage). This ratio is then inverted to predict incidence from the proxy measure in areas without registries. Prediction error admits closed-form expression which accounts for heterogeneity in the ratio between areas. A simulation study shows the accuracy of our method in terms of prediction and coverage probability. The method is further applied to predict the incidence of two cancers in France using hospital data as the proxy measure. We hope this approach will encourage sound use of the usually imperfect information extracted from HC data. 1. Introduction Measuring disease incidence at a local level is an important issue in public health. Incidence figures are needed to spot disparities in health status, establish public health goals, and compare the performance of health systems between different geographical areas. In the field of cancer, incidence figures are needed to plan prevention programs, establish diagnosis, and treatment strategies, as well as to assess the benefit of screening programs. Cancer incidence is generally monitored by local population-based cancer registries, which ensure exhaustive recording of new cancer cases occurring in the population living in a given area. However, in many countries, cancer registries were established on local initiatives and do not cover the whole territory. Thus, cancer incidence data are not available at all local levels and should thus rely on predictions based on external data (i.e. proxy) and statistical modeling. In France, for example, only 16 “Départements” (local level administrative districts) out of 96 are covered by a cancer registry, which corresponds to 20% of the French population. This gap has led to interest in using information from existing health care (HC) databases available for the whole territory as a proxy (e.g. hospitalization or HC reimbursement databases) (Mitton and others, 2011). However, the use of this data as a proxy for epidemiological purposes calls for caution because these HC databases were established for economic purposes. Accordingly, a direct count of cases from these databases is not a reliable measure of incidence (Carré and others, 2006; Couris and others, 2009). Indeed, depending on the cancer being studied, several factors, such as HC pathways and coding habits, may affect the ability of HC data to accurately reflect incidence values. For instance, the accidental discovery of thyroid cancer after surgery for a benign condition may not be mentioned in a patient’s medical record. Similarly, skin melanomas removed by private dermatologists will not appear in hospital data, nor cases of prostate cancer treated by radiotherapy alone in a private radiologist unit outside of clinics and hospitals. Moreover, HC data also contain “false positive” incident cases, mainly prevalent cases, and even coding errors (e.g. in situ cancer or topography misspecification) (Grosclaude and others, 2012). Consequently, the sensitivity and positive predicted values of HC databases in identifying incident cancer cases may vary between geographical areas and between cancer sites (Carré and others, 2006; Grosclaude and others, 2012). To be able to predict cancer incidence in geographical areas without a cancer registry, HC data must be calibrated on gold standard (i.e. coming from registries). As these predictions are necessarily subject to errors, it is crucial to provide correct prediction intervals. This is a critical issue when using HC data in epidemiology: the question of “veracity” (Ehrenstein and others, 2017). Indeed, irrespective of the advantages of HC data sources to build epidemiological indicators, the question remains about how valid the indicators are and whether we are able to provide confidence intervals (CIs) which correctly reflect their accuracy. This article proposes a method to address this issue in the case where such a gold standard is available only in certain geographical areas. For this purpose, we propose using a calibration approach. The objectives of this methodological article are to: (i) present a calibration approach developed specifically to predict cancer incidence at the district level; (ii) formally derive the statistical properties of the predicted incidence estimators and build corresponding prediction intervals; and (iii) evaluate the performance of the method in terms of bias, standard errors estimation and coverage probability (CP), through realistic simulations. The article is organized as follows: Section 2 presents the calibration approach, the incidence estimators considered and the derivation of their variance. Section 3 presents a simulation study, which evaluates the methodology’s application in various scenarios. Section 4 illustrates the use of the method with real cancer data from local level French cancer registries and hospital discharge data. The final section includes a general discussion. 2. Calibration model, prediction, and variance derivation The approach proposes developing a predictor of local cancer incidence by using a proxy measure. As the latter stems from incidence (e.g. hospitalization is a consequence of having cancer), we use a calibration approach (Osborne, 1991; Carroll and others, 2006) in which the proxy measure is modeled as a function of incidence, with data from districts where both measures are available. This model is then inversed to predict the local incidence from the proxy measure. To facilitate understanding of the approach, we will describe the calibration approach in the specific case where the gold standard incidence value comes from a cancer registry and the proxy measure is the number of newly hospitalized patients. However, the method is suitable for any context where a gold standard is available for only a part of the territory and where the proxy measure is available for the whole territory. 2.1. Notations In a given district $$j (\,j=1\ldots J)$$, $${\bf{C_j}}=(C_{1j}\ldots C_{Ij})^\top$$ denotes the random vector of the number of incident cancer cases by age group $$i (i=1\ldots I)$$. $${\bf{C_j}}$$ is assumed to follow a Poisson distribution of expectation $${\bf{\bf{\gamma}_j}}=(\gamma_{1j}\ldots \gamma_{Ij})^\top$$. $${\bf{H_j}}=(H_{1j}\ldots H_{Ij})^\top$$ denotes the random vector of newly hospitalized cancer patients. $${\bf{H_j}}$$ is related to $${\bf{C_j}}$$ by the calibration model (2.1) defined in Section 2.2 of this text. $${\tau_j}={\mathbf{w_j}}^\top{\bf{\bf{\gamma}_j}}= \sum_i w_{ij}\gamma_{ij}$$ denotes a weighted sum of incident cases expectations. According to the weights used, $${\tau_j}$$ corresponds either to the total number of cases expectation ($$w_j={\mathbf{1}}$$) or to the age-standardized incidence rate (ASR) expectation ($$w_{ij}=w^r_i/PY_{ij}$$), where $$w^r_i$$ is the weight of age class $$i$$ in the reference population and $${\rm PY}_{ij}$$ the vector of the number of person-years (PY) in districts $$j$$ for age class $$i$$. Finally, for a matrix $${\mathbf{X}}=(x_{ij})$$, $$\exp({\mathbf{X}})$$ refers here to the matrix of exponentiated values $$[\exp({\mathbf{X}})]_{ij}=\exp(x_{ij})$$ and for a vector $${\mathbf{y}}$$, $${}^\mathrm{D}{\mathbf{y}}$$ refers to the square diagonal matrix with the elements of $${\mathbf{y}}$$ on the main diagonal. Symbol $${\mathbf{A}}\circ{\mathbf{B}}$$ denotes the element-wise product of matrices $${\mathbf{A}}$$ and $${\mathbf{B}}$$ (or Hadamard product—see Section 1.2 of the supplementary material available at Biostatistics online for calculation properties used in the article); $$E_Y$$ and $$V_Y$$ denote the expectation and the variance–covariance matrix over the probability distribution of the random variable $$Y$$. 2.2. Calibration model We assume that the number of hospitalized patients is related to the gold standard incidence using a Poisson log-normal mixed model that accounts for the hierarchical structure of the data (McCulloch and others, 2008). More specifically, the calibration model assumes that, given the number of incident cases and a district random effect $$b_j$$, the number of newly hospitalized patients $$H_{ij}$$ of age class $$i$$ (central age $$a_i$$) in a district $$j$$ follows a Poisson distribution with mean $$\eta_{ij}$$, i.e. ($$H_{ij}| b_j,C_{ij}=c_{ij}\sim\mathcal{P}(\eta_{ij})$$) where:   \begin{equation} \left\{ \begin{array}[H]{l} \eta_{ij}=E(H_{ij} |b_j,C_{ij} =c_{ij} )=c_{ij}\exp(f(a_i)+b_j)\\[3pt] b_j\sim\mathcal{N}(0,\sigma_d^2) \end{array} \right. \end{equation} (2.1)$$f$$ is modeled with a restricted cubic spline with one knot at the median age (median over all cases) and boundary knots placed at 5 years above and below the minimum and maximum age respectively (over all cases). The marginal expectation of (2.1) over $$b_j$$ is (Skrondal and Rabe-Hesketh, 2009):   \[ E(H_{ij} |C_{ij}=c_{ij})=E_{b_j}(\eta_{ij} )=c_{ij}\exp(f(a_i)+ \sigma_d^2/2). \] Thus, $$h(a)=\exp(\,f(a)+ \sigma_d^2/2)$$ gives the mean $$H/C$$ ratio of age $$a$$ over all districts. In matrix notation, calibration model (2.1) may be written:   \[ \left\{ \begin{array}[H]{l} ({\bf{H_j}} |b_j,{\bf{C_j}}={\mathbf{c_j}})\sim\mathcal{P}({\bf{\boldsymbol{\eta}_j}})\\[3pt] \log({\bf{\boldsymbol{\eta}_j}})=\log({\mathbf{c_j}})+{\mathbf{{X\boldsymbol{\beta}}}}+b_j\\[3pt] b_j\sim\mathcal{N}(0,\sigma_d^2), \end{array} \right. \] where $${\mathbf{X}}$$ is the matrix of the restricted cubic spline basis for $$f(a)$$ and $$\boldsymbol{\beta}$$ the corresponding vector of coefficients. Using data from registry areas, where both $${\bf{H_j}}$$ and $${\bf{C_j}}$$ are known, this calibration model provides an estimate $$\hat{\boldsymbol{\beta}}$$ of $$\boldsymbol{\beta}$$, an estimate $$\bf{\hat{\Sigma}}_{\hat{\boldsymbol{\beta}}}$$ of the variance–covariance matrix $$\bf{\Sigma}_{\hat{\boldsymbol{\beta}}}$$ of $$\hat{\boldsymbol{\beta}}$$, with $$\hat{\boldsymbol{\beta}}\sim\mathcal{N}(\boldsymbol{\beta},\bf{\Sigma}_{\hat{\boldsymbol{\beta}}})$$, and an estimate $$\hat{\sigma}_d$$ of $$\sigma_d$$. 2.3. Prediction of incidence Now let us turn to a district where the incidence is unknown. Several incidence predictions may be considered. We are primarily interested in the expectation $${\bf{\bf{\gamma}_j}}$$ of $${\bf{C_j}}$$, in particular because predictions are used to describe incidence variations across districts. An estimator $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$ for $${\bf{\bf{\gamma}_j}}$$ is thus required. Synthetic measures of incidence over all ages, such as the overall number of cases or the ASR ($$\hat{\tau}_j$$) derived from $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$, are also useful and are usually the main incidence indicators of interest. 2.3.1. Prediction of the expectation of the number of incident cases by age: vector $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$ If the theoretical parameters $$\boldsymbol{\beta}$$ and $$\sigma_d$$ were known, then the average $$H/C$$ ratio (noted $$h(a)$$) would be known and the random vector $$\{H_{aj}/h(a)=H_{aj}\exp(-f(a)-\sigma_d^2/2)\}$$ (or $${\bf{H_j}}\circ\exp(-{\mathbf{{X\boldsymbol{\beta}}}}-\sigma_d^2/2)$$ in matrix notation) would be an unbiased estimator of $${\bf{\gamma}_j} (E[{\bf{H_j}}\, \circ\, \exp(-{\mathbf{{X\boldsymbol{\beta}}}}-\sigma_d^2/2)] = E_{{\bf{C_j}}}[E( {\bf{H_j}}\circ\exp(-{\mathbf{{X\boldsymbol{\beta}}}}-\sigma_d^2/2) | {\bf{C_j}})]=E_{{\bf{C_j}}}[{\bf{C_j}}]={\bf{\gamma}_j}$$). Thus, a natural estimator for $${\bf{\bf{\gamma}_j}}$$ is the corresponding plug-in estimator using $$\hat{\boldsymbol{\beta}}$$ and $$\hat{\sigma}_d^2$$ as estimated in the calibration stage:   \[ {\mathbf{{\bf{\boldsymbol{\hat{\gamma}_j}}}}}={\bf{H_j}}\circ\exp(-{\mathbf{{X\boldsymbol{\hat{\beta}}}}}-\hat{\sigma}_d^2/2)={\bf{H_j}}\circ\exp(-{\bf{Z}\boldsymbol{\hat{\alpha}}}), \] with $${\mathbf{Z}}=({\mathbf{{X,\frac{1}{2}}}})$$ and $${\mathbf{{\boldsymbol{\hat{\alpha}}}}}=({\mathbf{{\boldsymbol{\hat{\beta}}^\top}}},\hat{\sigma}_d^2)^\top$$, with expectation $$\bf{\alpha}=E({\mathbf{{\boldsymbol{\hat{\alpha}}})}}=({\boldsymbol{\beta^\top}},\sigma_d^2)^\top$$. Neglecting the uncertainty in the estimation of $$\sigma_d$$, $${\bf{Z}\boldsymbol{\hat{\alpha}}}$$ may be assumed to be normally distributed with mean $${\bf{Z\boldsymbol{\alpha}}}$$ and variance–covariance matrix . Then $$\exp(-{\bf{Z}\boldsymbol{\hat{\alpha}}})$$ follows a multivariate log-normal distribution with mean $$E(\exp(-{\bf{Z}\boldsymbol{\hat{\alpha}}}))=\exp(-{\bf{Z\boldsymbol{\alpha}}}+\frac{1}{2}{\bf{D_{\boldsymbol{\Sigma}}}})$$, where $${\bf{D_{\boldsymbol{\Sigma}}}}$$ is the vector of the diagonal elements of $$\bf{\Sigma}$$, and admits a closed form for the variance–covariance matrix (see Section 3 of the supplementary material available at Biostatistics online). 2.3.2. Prediction of the expectation of standardized incidence rates and overall number of cases: $$\hat{\tau}_j$$ Predicting incidence over all ages is then straightforward. The expectations of the overall number of cases and standardized incidence rates are both estimated as a weighted sum of age-specific predictions $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$: $$\hat{\tau}_j={\mathbf{w_j}}{\bf{\boldsymbol{\hat{\gamma}_j}}}$$, with $${\mathbf{w_j}}$$ as defined in the notation paragraph. 2.4. Variance–covariance matrix of the predictions 2.4.1. Variance–covariance matrix of $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$ To calculate the variance of $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$, we first note that:   \[ V({\bf{\boldsymbol{\hat{\gamma}_j}}})= E_{{\bf{C_j}}}[V({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\bf{C_j}})]+V_{{\bf{C_j}}}[E({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\bf{C_j}})]. \] The first term in the above equation can be derived using the following conditional variance decomposition:   \[ V({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\bf{C_j}})=V_{{\boldsymbol{\hat{\alpha}}}}[E({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}})]+E_{{\boldsymbol{\hat{\alpha}}}}[V({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}})], \] The second term of the latter can be further decomposed by noting that:   \[ V({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}})=V_{b_j}[E({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}},b_j)]+E_{b_j}[V({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}},b_j)] \] Regrouping the preceding equations gives the final expression for $$V({\bf{\boldsymbol{\hat{\gamma}_j}}})$$:   \[ \begin{split} V({\bf{\boldsymbol{\hat{\gamma}_j}}})=&E_{\bf{C_j}}[V_{{\boldsymbol{\hat{\alpha}}}}[E({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}})]]+E_{\bf{C_j}}[E_{{\boldsymbol{\hat{\alpha}}}}[V_{b_j}[E({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}},b_j)]]]\\ &+E_{\bf{C_j}}[E_{{\boldsymbol{\hat{\alpha}}}}[E_{b_j}[V({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}},b_j)]]]+V_{\bf{C_j}}[E({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\bf{C_j}})] \end{split} \] The first term comes from the variability of the estimation of $${\boldsymbol{\hat{\alpha}}}$$ in the calibration model (2.1). The second term comes from the between-district variability of the $$H/C$$ ratio $$\sigma_d^2$$ and indicates that the wider this variability, the larger is the prediction error. Finally, the third and fourth terms are due to the Poisson variability of $${\bf{H_j}}$$ and $${\bf{C_j}}$$ around their mean and are referred to hereafter as the random Poisson part of the overall variance. Using the proprieties of the multivariate log-normal distribution, we see that $$V({\bf{\boldsymbol{\hat{\gamma}_j}}})$$ is finally expressed in a closed form (see Section 1.3 of the supplementary material available at Biostatistics online):   \begin{equation} V({\bf{\boldsymbol{\hat{\gamma}_j}}}) = (e^{\boldsymbol{\Sigma}+\sigma_d^2}-1)\circ\big[{}^\mathrm{D}{\bf{\bf{\gamma}_j}}\circ e^{{\bf{D_{\boldsymbol{\Sigma}}}}}+({\bf{\bf{\gamma}_j}}\circ e^{\frac{1}{2}{\bf{D_{\boldsymbol{\Sigma}}}}})({\bf{\bf{\gamma}_j}}\circ e^{\frac{1}{2}{\bf{D_{\boldsymbol{\Sigma}}}}})^\top\big]+{}^\mathrm{D}{\bf{\bf{\gamma}_j}}\circ(e^{-{\bf{Z\boldsymbol{\alpha}}}+2{\bf{D_{\boldsymbol{\Sigma}}}}}+1) \end{equation} (2.2)$$V({\bf{\boldsymbol{\hat{\gamma}_j}}})$$ is estimated with the plug-in estimator. 2.4.2. Variance of $$\hat{\tau}_j={\mathbf{w_j}}^\top{\bf{\boldsymbol{\hat{\gamma}_j}}}$$ The variance of any weighted sum of $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$, $$\hat{\tau}_j={\mathbf{w_j}}^\top{\bf{\boldsymbol{\hat{\gamma}_j}}}$$ is straightforward: $$ V(\hat{\tau}_j)=V({\mathbf{w_j}}^\top{\bf{\boldsymbol{\hat{\gamma}_j}}})={\mathbf{w_j}}^\top V({\bf{\boldsymbol{\hat{\gamma}_j}}}){\mathbf{w_j}}$$. 2.5. Distribution and predictions intervals for $$\hat{\tau}_j={\mathbf{w_j}}^T{\bf{\boldsymbol{\hat{\gamma}_j}}}$$ With respect to the distribution of $$\hat{\tau}_j$$, let us first remark that $$\log({\bf{\boldsymbol{\hat{\gamma}_j}}})$$ is distributed as the sum of a log-Poisson and of a normal variable ($$\log({\bf{\boldsymbol{\hat{\gamma}_j}}})=\log({\bf{h_j}})-{\bf{Z}\boldsymbol{\hat{\alpha}}}$$). As the approximation of a log-Poisson by a normal distribution is generally accurate, we assume that $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$ follows a log-normal distribution. Thus $$\hat{\tau}_j$$, as a weighted sum of log-normal distributions, may also be approximated by a log-normal distribution (Fenton, 1960). The prediction interval of $$\hat{\tau}_j$$ at the $$\alpha$$ level is then easily written, using the coefficient of variation (CV) of $$\hat{\tau}_j$$, $$\hat{cv}_j=\frac{\sqrt{\hat{V}(\hat{\tau}_j)}}{\hat{\tau}_j}$$ (see Section 2 of the supplementary material available at Biostatistics online):   \[ PI(\hat{\tau}_j)= \left[\hat{\tau}_j\sqrt{\hat{cv}_j^2+1} e^{-z_{\frac{\alpha}{2}}\sqrt{\log(\hat{cv}_j^2+1)}}; \hat{\tau}_j\sqrt{\hat{cv}_j^2+1} e^{z_{\frac{\alpha}{2}}\sqrt{\log(\hat{cv}_j^2+1)}}\right] \] with $$z_{\frac{\alpha}{2}}$$ the quantile at the $$\frac{\alpha}{2}$$ level of a standard normal distribution. 2.6. Implementation In the simulation study and real data analyzes below, the calibration model (2.1) was fitted using the glmer function from lme4 package in R (Bates and others, 2015). An adaptive Gauss–Hermite quadrature with 20 points was used to obtain the marginal likelihood. The variance and prediction intervals were also calculated using R software. 3. Simulation study In this section, we consider realistic simulation scenarios to evaluate the performance of the proposed approach. More specifically, we check whether our approach allows: (i) an unbiased prediction of incidence; (ii) a correct CP of the prediction intervals; and (iii) a correct estimation of the key parameter $$\sigma_d^2$$ [between-district variance of the $$H/C$$ ratio in the calibration model, (2.1)]. In addition, we assess the robustness of the approach in case of misspecification in the calibration model (2.1) of the distribution of the district random effects. Here, we will focus on the prediction of the overall number of cases expectation (i.e. $${\tau_j}={\mathbf{w_j}}^\top{\bf{\bf{\gamma}_j}}$$ with $${\mathbf{w_j}}={\mathbf{1}}$$). We expect that predicting standardized incidence rates will give similar results in terms of relative bias and coverage properties, because both indicators are basically weighted sums of predicted numbers of cases by age. Several factors may affect the performance of our approach. Accordingly, different scenarios were considered, especially regarding $$\sigma_d$$ whose values directly affect the prediction intervals. The scenarios were designed to mimic various situations observed in practice, and we relied on real data and results applying the model to 24 cancer sites for the 2007–2009 period (data not shown) to choose or estimate the parameters used to generate the data as detailed hereafter. 3.1. Data generation and simulation design With respect to the design, the study period is 2007–2009, and we assume that the area with registries includes 14 districts (which reflects reality in France for this period). Table S1 of the supplementary material available at Biostatistics online details the input parameters and the data sources used to estimate them. Nine scenarios were considered combining three levels of incidence and three levels for $$\sigma_d$$. The incidence levels considered were 3.5, 14, and 70 cases per 100 000 PY; that is, approximately 500, 2000, and 10 000 cases in the registry area (i.e. the total area covered by different district-level registries) available for the calibration model (2.1), which corresponds to the 25th, 50th, and 75th percentile of the distribution of the number of cases per cancer site in the French registry area. The values considered for $$\sigma_d$$ were 0.05, 0.1, and 0.3, which correspond to the range observed in our application to 24 cancer sites. The numbers of PY were directly taken from the observed male PY in the 14 registries (range 180 000–600 000). The H/C ratio, i.e. $$f(a)$$, comes from calibration model (2.1) applied to real data on colorectal cancer in males. With respect to the prediction stage, incidence is predicted in three prediction districts, corresponding to three populations sizes (5th, 50th, and 95th percentiles of the distribution of the number of PY for men over all French districts during the study period—see Table S1 of the supplementary material available at Biostatistics online). This led to 27 possibilities when combined with the 9 scenarios described above. For a given scenario, each simulation which we ran consisted in: Data generation in the districts of the registry area and in the three prediction districts. For each district $$j$$, the number of cases $${\bf{c_j}}$$ was drawn from a Poisson distribution with incidence rate $$g(a)\phi$$ (see Table S1 of the supplementary material available at Biostatistics online), the district random effect $$b_j$$ drawn from a normal distribution $$\mathcal{N}(0,\sigma_d^2)$$, and the number of hospitalized patients $${\bf{h_j}}$$ drawn from a Poisson distribution conditionally on $${\bf{C_j}}$$ and $$b_j$$, using $$f(a)$$; Analysis of the $$H/C$$ ratio in the registry area with the calibration model (2.1) in order to obtain the calibration parameters estimates; Incidence prediction and calculation of the prediction intervals in the three prediction districts, using number of hospitalized cases generated in Step 1 and calibration parameters’ estimated in Step 2. In each scenario, we simulated 5000 data sets. This number allowed us to estimate the parameters of interest (i.e. the number of predicted cases and their variances, and the variance of the random effect $$\sigma_d^2$$) with a precision of at least 5%. In addition, we performed a sensitivity analysis on the impact of the misspecification of distribution of the random effects $$b_j$$, which is assumed Gaussian in the calibration model (2.1). To this end, two additional scenarios were considered, in which $$b_j$$ were drawn from a non-Gaussian distribution. In the first scenario, $$b_j$$ was drawn from a centered uniform distribution with a standard deviation (SD) of 0.1. In the second scenario, $$b_j$$ was drawn from a mixture of two Gaussian distributions with density $$\pi\mathcal{N}(\mu,\sigma^2/100)+(1-\pi)\mathcal{N}(\mu_2,\sigma_2^2)$$, to mimic a case where a portion $$\pi$$ of the districts are clustered away from the mean. $$\mu$$ was set to 0.1, $$\sigma$$ to 0.1, $$\pi$$ to 0.3, and $$\mu_2$$ ($$=-0.04$$) and $$\sigma_2 (=0.09)$$ were determined to ensure a zero mean and a 0.1 SD of the distribution. Finally, additional simulations were run in order to evaluate the influence of the size (i.e. number of districts) of the calibration set on the performance of the method. We used the same simulation design and scenarios described above, but allowed the number of districts in the calibration set to vary between 5 and 100. Results from these additional simulations are presented in Section 6 of the supplementary material available at Biostatistics online. 3.2. Performance indicators The prediction of the overall number of incident cases expectation $$\hat{\tau}_j$$ was assessed in terms of: (i) relative bias (i.e. difference between the empirical mean of the 5000 $$\hat{\tau}_j$$ estimates and the true value $${\tau_j}$$ divided by $${\tau_j}$$); (ii) empirical CP (i.e. the proportion of samples in which the 95% prediction interval contains $${\tau_j}$$); and (iii) relative bias in the estimation of the standard error (i.e. the relative difference between the mean of the 5000 estimated standard errors of $$\hat{\tau}_j$$ and the empirical SD of the 5000 $$\hat{\tau}_j$$ estimates). Finally, the estimation of the between-district variance in the $$H/C$$ ratio ($$\hat{\sigma}_d^2$$) was assessed in terms of mean, median, and relative bias. 3.3. Results of the simulation study Table 1 shows all performance indicators concerning $${\tau_j}$$. Overall, they were rather good. Relative biases were very small ($$<$$1.5%) and the empirical CP ranged between 91% and 96%. The smaller the number of incident cases and the smaller the between-district variability of the $$H/C$$ ratio ($$\sigma_d$$), the closer was the CP to their nominal level. For a given overall number of cases $${\tau_j}$$ (i.e. at fixed incidence level and district size), the CV logically increased with $$\sigma_d$$. This inflation of the CV with $$\sigma_d$$ increased with the number of cases. For example, at the smallest $${\tau_j}$$ value ($${\tau_j}=10$$), the CV increased only slightly from 0.47 to 0.56 whereas, at its highest value ($${\tau_j}={1233}$$), a greater than 4-fold increase in the CV was observed, from 0.07 to 0.32. This shows that when the number of cases to predict $${\tau_j}$$ is very small, the random Poisson part is large and dominates the overall variance. In this case, the overall variance is not very sensitive to $$\sigma_d$$. On the contrary, when the number of cases to predict $${\tau_j}$$ is large, the random Poisson part of the overall variance shrinks and $$\sigma_d$$ contributes mainly to the overall variance as $$\sigma_d$$ increases. With respect to the estimation of the standard error, in almost all scenarios, a negative bias (up to $$-$$6%) was observed, which means that the SD is on average slightly underestimated. Table 1. Expected number of cases ($${\tau_j}$$), relative bias ($$\% \text{bias}$$), empirical SD, $${\rm CV}={\rm SD}/{\tau_j}$$, and empirical CP of predicted expected number of cases $$\hat{\tau}_j$$, and relative bias of standard error estimates ($$\% \text{bias } \hat{\text{SE}}$$), for expected number of cases predicted by district size and incidence levels for different variability $$\sigma_d$$ of the $$H/C$$ ratio $$\sigma_d$$  District size  Incidence level  $$\tau_j$$  Estimator $$\hat{\tau}_j$$  % bias  SD  CV  CP (%)  % bias $$\hat{\text{SE}}$$  0.05  Small  Low  10  $$-$$0.1  4.6  0.47  94.9  $$-$$3.6        Medium  40  $$-$$0.2  9.2  0.23  95.3  -0.2        High  197  $$-$$0.1  22.3  0.11  94.7  $$-$$0.2     Medium  Low  26  0.9  7.6  0.29  95.2  $$-$$0.2        Medium  106  0.3  15.9  0.15  94.8  $$-$$1.2        High  525  0.3  42.8  0.08  93.8  $$-$$2.8     Large  Low  62  0.4  12.1  0.20  95.6  -1.6        Medium  248  0.0  26.3  0.11  94.1  $$-$$1.0        High  1233  $$-$$0.1  81.4  0.07  92.2  $$-$$4.2  0.1  Small  Low  10  1.2  4.7  0.48  94.3  $$-$$4.0        Medium  40  1.2  10.1  0.26  94.3  $$-$$2.9        High  197  0.6  28.3  0.14  94.4  $$-$$1.2     Medium  Low  26  0.0  7.9  0.30  95.1  $$-$$0.6        Medium  106  0.1  18.2  0.17  94.3  $$-$$1.5        High  525  0.0  64.5  0.12  92.5  $$-$$5.6     Large  Low  62  1.1  13.4  0.22  94.4  $$-$$1.5        Medium  248  0.3  35.0  0.14  92.4  $$-$$5.2        High  1233  0.2  137.3  0.11  91.4  $$-$$4.7  0.3  Small  Low  10  0.4  5.5  0.56  95.4  $$-$$2.2        Medium  40  1.0  15.7  0.40  94.0  $$-$$3.4        High  197  1.0  67.2  0.34  92.4  $$-$$5.5     Medium  Low  26  0.9  11.2  0.43  94.5  $$-$$1.8        Medium  106  0.3  36.6  0.35  93.2  $$-$$3.6        High  525  0.2  168.7  0.32  92.5  $$-$$3.6     Large  Low  62  1.5  22.9  0.37  92.8  $$-$$2.8        Medium  248  0.7  81.6  0.33  92.4  $$-$$3.1        High  1233  0.8  392.0  0.32  92.4  $$-$$3.0  $$\sigma_d$$  District size  Incidence level  $$\tau_j$$  Estimator $$\hat{\tau}_j$$  % bias  SD  CV  CP (%)  % bias $$\hat{\text{SE}}$$  0.05  Small  Low  10  $$-$$0.1  4.6  0.47  94.9  $$-$$3.6        Medium  40  $$-$$0.2  9.2  0.23  95.3  -0.2        High  197  $$-$$0.1  22.3  0.11  94.7  $$-$$0.2     Medium  Low  26  0.9  7.6  0.29  95.2  $$-$$0.2        Medium  106  0.3  15.9  0.15  94.8  $$-$$1.2        High  525  0.3  42.8  0.08  93.8  $$-$$2.8     Large  Low  62  0.4  12.1  0.20  95.6  -1.6        Medium  248  0.0  26.3  0.11  94.1  $$-$$1.0        High  1233  $$-$$0.1  81.4  0.07  92.2  $$-$$4.2  0.1  Small  Low  10  1.2  4.7  0.48  94.3  $$-$$4.0        Medium  40  1.2  10.1  0.26  94.3  $$-$$2.9        High  197  0.6  28.3  0.14  94.4  $$-$$1.2     Medium  Low  26  0.0  7.9  0.30  95.1  $$-$$0.6        Medium  106  0.1  18.2  0.17  94.3  $$-$$1.5        High  525  0.0  64.5  0.12  92.5  $$-$$5.6     Large  Low  62  1.1  13.4  0.22  94.4  $$-$$1.5        Medium  248  0.3  35.0  0.14  92.4  $$-$$5.2        High  1233  0.2  137.3  0.11  91.4  $$-$$4.7  0.3  Small  Low  10  0.4  5.5  0.56  95.4  $$-$$2.2        Medium  40  1.0  15.7  0.40  94.0  $$-$$3.4        High  197  1.0  67.2  0.34  92.4  $$-$$5.5     Medium  Low  26  0.9  11.2  0.43  94.5  $$-$$1.8        Medium  106  0.3  36.6  0.35  93.2  $$-$$3.6        High  525  0.2  168.7  0.32  92.5  $$-$$3.6     Large  Low  62  1.5  22.9  0.37  92.8  $$-$$2.8        Medium  248  0.7  81.6  0.33  92.4  $$-$$3.1        High  1233  0.8  392.0  0.32  92.4  $$-$$3.0  Table 1. Expected number of cases ($${\tau_j}$$), relative bias ($$\% \text{bias}$$), empirical SD, $${\rm CV}={\rm SD}/{\tau_j}$$, and empirical CP of predicted expected number of cases $$\hat{\tau}_j$$, and relative bias of standard error estimates ($$\% \text{bias } \hat{\text{SE}}$$), for expected number of cases predicted by district size and incidence levels for different variability $$\sigma_d$$ of the $$H/C$$ ratio $$\sigma_d$$  District size  Incidence level  $$\tau_j$$  Estimator $$\hat{\tau}_j$$  % bias  SD  CV  CP (%)  % bias $$\hat{\text{SE}}$$  0.05  Small  Low  10  $$-$$0.1  4.6  0.47  94.9  $$-$$3.6        Medium  40  $$-$$0.2  9.2  0.23  95.3  -0.2        High  197  $$-$$0.1  22.3  0.11  94.7  $$-$$0.2     Medium  Low  26  0.9  7.6  0.29  95.2  $$-$$0.2        Medium  106  0.3  15.9  0.15  94.8  $$-$$1.2        High  525  0.3  42.8  0.08  93.8  $$-$$2.8     Large  Low  62  0.4  12.1  0.20  95.6  -1.6        Medium  248  0.0  26.3  0.11  94.1  $$-$$1.0        High  1233  $$-$$0.1  81.4  0.07  92.2  $$-$$4.2  0.1  Small  Low  10  1.2  4.7  0.48  94.3  $$-$$4.0        Medium  40  1.2  10.1  0.26  94.3  $$-$$2.9        High  197  0.6  28.3  0.14  94.4  $$-$$1.2     Medium  Low  26  0.0  7.9  0.30  95.1  $$-$$0.6        Medium  106  0.1  18.2  0.17  94.3  $$-$$1.5        High  525  0.0  64.5  0.12  92.5  $$-$$5.6     Large  Low  62  1.1  13.4  0.22  94.4  $$-$$1.5        Medium  248  0.3  35.0  0.14  92.4  $$-$$5.2        High  1233  0.2  137.3  0.11  91.4  $$-$$4.7  0.3  Small  Low  10  0.4  5.5  0.56  95.4  $$-$$2.2        Medium  40  1.0  15.7  0.40  94.0  $$-$$3.4        High  197  1.0  67.2  0.34  92.4  $$-$$5.5     Medium  Low  26  0.9  11.2  0.43  94.5  $$-$$1.8        Medium  106  0.3  36.6  0.35  93.2  $$-$$3.6        High  525  0.2  168.7  0.32  92.5  $$-$$3.6     Large  Low  62  1.5  22.9  0.37  92.8  $$-$$2.8        Medium  248  0.7  81.6  0.33  92.4  $$-$$3.1        High  1233  0.8  392.0  0.32  92.4  $$-$$3.0  $$\sigma_d$$  District size  Incidence level  $$\tau_j$$  Estimator $$\hat{\tau}_j$$  % bias  SD  CV  CP (%)  % bias $$\hat{\text{SE}}$$  0.05  Small  Low  10  $$-$$0.1  4.6  0.47  94.9  $$-$$3.6        Medium  40  $$-$$0.2  9.2  0.23  95.3  -0.2        High  197  $$-$$0.1  22.3  0.11  94.7  $$-$$0.2     Medium  Low  26  0.9  7.6  0.29  95.2  $$-$$0.2        Medium  106  0.3  15.9  0.15  94.8  $$-$$1.2        High  525  0.3  42.8  0.08  93.8  $$-$$2.8     Large  Low  62  0.4  12.1  0.20  95.6  -1.6        Medium  248  0.0  26.3  0.11  94.1  $$-$$1.0        High  1233  $$-$$0.1  81.4  0.07  92.2  $$-$$4.2  0.1  Small  Low  10  1.2  4.7  0.48  94.3  $$-$$4.0        Medium  40  1.2  10.1  0.26  94.3  $$-$$2.9        High  197  0.6  28.3  0.14  94.4  $$-$$1.2     Medium  Low  26  0.0  7.9  0.30  95.1  $$-$$0.6        Medium  106  0.1  18.2  0.17  94.3  $$-$$1.5        High  525  0.0  64.5  0.12  92.5  $$-$$5.6     Large  Low  62  1.1  13.4  0.22  94.4  $$-$$1.5        Medium  248  0.3  35.0  0.14  92.4  $$-$$5.2        High  1233  0.2  137.3  0.11  91.4  $$-$$4.7  0.3  Small  Low  10  0.4  5.5  0.56  95.4  $$-$$2.2        Medium  40  1.0  15.7  0.40  94.0  $$-$$3.4        High  197  1.0  67.2  0.34  92.4  $$-$$5.5     Medium  Low  26  0.9  11.2  0.43  94.5  $$-$$1.8        Medium  106  0.3  36.6  0.35  93.2  $$-$$3.6        High  525  0.2  168.7  0.32  92.5  $$-$$3.6     Large  Low  62  1.5  22.9  0.37  92.8  $$-$$2.8        Medium  248  0.7  81.6  0.33  92.4  $$-$$3.1        High  1233  0.8  392.0  0.32  92.4  $$-$$3.0  3.3.1. Estimation of district variability of the $$H/C$$ ratio ($$\hat{\sigma}_d$$) Table 2 shows the estimations of the SD of the district random effect in the calibration model ($$\hat{\sigma}_d$$). In all scenarios, $$\sigma_d$$ was underestimated (up to 27%), the bias decreasing with increasing $$\sigma_d$$ values and incidence levels. Table 2. Mean, median, and relative bias of the estimator of $$\sigma_d$$ (the standard error of the $$H/C$$ ratio), by incidence level for different true value of $$\sigma_d$$    $$\sigma_d$$ = 0.05  $$\sigma_d$$ = 0.1  $$\sigma_d$$ = 0.3  Incidence level  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Low  3.9  0.0  $$-$$21.8  7.2  7.4  $$-$$27.8  27.6  27.6  $$-$$7.9  Medium  3.6  3.7  $$-$$27.0  8.7  9.0  $$-$$12.9  28.3  28.1  $$-$$5.5  High  4.4  4.5  $$-$$11.3  9.3  9.3  $$-$$6.8  28.5  28.3  $$-$$5.0     $$\sigma_d$$ = 0.05  $$\sigma_d$$ = 0.1  $$\sigma_d$$ = 0.3  Incidence level  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Low  3.9  0.0  $$-$$21.8  7.2  7.4  $$-$$27.8  27.6  27.6  $$-$$7.9  Medium  3.6  3.7  $$-$$27.0  8.7  9.0  $$-$$12.9  28.3  28.1  $$-$$5.5  High  4.4  4.5  $$-$$11.3  9.3  9.3  $$-$$6.8  28.5  28.3  $$-$$5.0  Table 2. Mean, median, and relative bias of the estimator of $$\sigma_d$$ (the standard error of the $$H/C$$ ratio), by incidence level for different true value of $$\sigma_d$$    $$\sigma_d$$ = 0.05  $$\sigma_d$$ = 0.1  $$\sigma_d$$ = 0.3  Incidence level  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Low  3.9  0.0  $$-$$21.8  7.2  7.4  $$-$$27.8  27.6  27.6  $$-$$7.9  Medium  3.6  3.7  $$-$$27.0  8.7  9.0  $$-$$12.9  28.3  28.1  $$-$$5.5  High  4.4  4.5  $$-$$11.3  9.3  9.3  $$-$$6.8  28.5  28.3  $$-$$5.0     $$\sigma_d$$ = 0.05  $$\sigma_d$$ = 0.1  $$\sigma_d$$ = 0.3  Incidence level  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Low  3.9  0.0  $$-$$21.8  7.2  7.4  $$-$$27.8  27.6  27.6  $$-$$7.9  Medium  3.6  3.7  $$-$$27.0  8.7  9.0  $$-$$12.9  28.3  28.1  $$-$$5.5  High  4.4  4.5  $$-$$11.3  9.3  9.3  $$-$$6.8  28.5  28.3  $$-$$5.0  3.3.2. Sensitivity analysis As shown in Table 3, the misspecification of the distribution of the random effect in the calibration model did not alter the good performance of the approach. Whether this distribution was correctly specified or not had only a minor impact on the performance indicators presented (relative bias, mean estimated standard errors, or CPs of the prediction intervals). Table 3. Relative bias ($$\% \text{bias}$$), SD, and empirical CP of predicted expected number of cases $$\hat{\tau}_j$$, and relative bias of standard error estimates ($$\%\text{ bias } \hat{\text{SE}}$$), for expected number of cases predicted by district size and incidence levels for different generating distributions of district random effect~: uniform ($$\mathcal{U}$$), bimodal ($$\mathcal{B}$$), and Gaussian ($$\mathcal{G}$$)       % bias  SD  CP (%)  % bias $$\hat{\text{SE}}$$  District size  Incidence level  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  Small  Low  1.2  0.8  1.2  4.7  4.7  4.7  94.6  94.5  94.3  $$-$$2.5  $$-$$3.3  $$-$$4.0     Medium  0.4  $$-$$0.9  1.2  9.8  9.8  10.1  94.8  95.1  94.3  0.0  $$-$$1.4  $$-$$2.9     High  0.0  $$-$$0.1  0.6  28.9  28.0  28.3  94.2  95.2  94.4  $$-$$3.4  $$-$$0.1  $$-$$1.2  Medium  Low  1.3  0.9  0.0  8.0  8.0  7.9  94.5  94.6  95.1  $$-$$1.2  $$-$$1.0  $$-$$0.6     Medium  0.5  0.0  0.1  18.4  18.6  18.2  93.9  93.8  94.3  $$-$$2.2  $$-$$4.2  $$-$$1.5     High  0.3  0.1  0.0  64.9  62.7  64.5  93.5  94.1  92.5  $$-$$5.6  $$-$$2.1  $$-$$5.6  Large  Low  0.5  0.7  1.1  13.3  13.3  13.4  94.0  94.5  94.4  $$-$$1.8  $$-$$1.6  $$-$$1.5     Medium  0.1  0.3  0.3  34.7  34.7  35.0  92.9  92.4  92.4  $$-$$4.4  $$-$$5.3  $$-$$5.2     High  0.1  $$-$$0.1  0.2  138.4  134.8  137.3  94.0  93.8  91.4  $$-$$5.1  $$-$$2.5  $$-$$4.7        % bias  SD  CP (%)  % bias $$\hat{\text{SE}}$$  District size  Incidence level  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  Small  Low  1.2  0.8  1.2  4.7  4.7  4.7  94.6  94.5  94.3  $$-$$2.5  $$-$$3.3  $$-$$4.0     Medium  0.4  $$-$$0.9  1.2  9.8  9.8  10.1  94.8  95.1  94.3  0.0  $$-$$1.4  $$-$$2.9     High  0.0  $$-$$0.1  0.6  28.9  28.0  28.3  94.2  95.2  94.4  $$-$$3.4  $$-$$0.1  $$-$$1.2  Medium  Low  1.3  0.9  0.0  8.0  8.0  7.9  94.5  94.6  95.1  $$-$$1.2  $$-$$1.0  $$-$$0.6     Medium  0.5  0.0  0.1  18.4  18.6  18.2  93.9  93.8  94.3  $$-$$2.2  $$-$$4.2  $$-$$1.5     High  0.3  0.1  0.0  64.9  62.7  64.5  93.5  94.1  92.5  $$-$$5.6  $$-$$2.1  $$-$$5.6  Large  Low  0.5  0.7  1.1  13.3  13.3  13.4  94.0  94.5  94.4  $$-$$1.8  $$-$$1.6  $$-$$1.5     Medium  0.1  0.3  0.3  34.7  34.7  35.0  92.9  92.4  92.4  $$-$$4.4  $$-$$5.3  $$-$$5.2     High  0.1  $$-$$0.1  0.2  138.4  134.8  137.3  94.0  93.8  91.4  $$-$$5.1  $$-$$2.5  $$-$$4.7  Table 3. Relative bias ($$\% \text{bias}$$), SD, and empirical CP of predicted expected number of cases $$\hat{\tau}_j$$, and relative bias of standard error estimates ($$\%\text{ bias } \hat{\text{SE}}$$), for expected number of cases predicted by district size and incidence levels for different generating distributions of district random effect~: uniform ($$\mathcal{U}$$), bimodal ($$\mathcal{B}$$), and Gaussian ($$\mathcal{G}$$)       % bias  SD  CP (%)  % bias $$\hat{\text{SE}}$$  District size  Incidence level  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  Small  Low  1.2  0.8  1.2  4.7  4.7  4.7  94.6  94.5  94.3  $$-$$2.5  $$-$$3.3  $$-$$4.0     Medium  0.4  $$-$$0.9  1.2  9.8  9.8  10.1  94.8  95.1  94.3  0.0  $$-$$1.4  $$-$$2.9     High  0.0  $$-$$0.1  0.6  28.9  28.0  28.3  94.2  95.2  94.4  $$-$$3.4  $$-$$0.1  $$-$$1.2  Medium  Low  1.3  0.9  0.0  8.0  8.0  7.9  94.5  94.6  95.1  $$-$$1.2  $$-$$1.0  $$-$$0.6     Medium  0.5  0.0  0.1  18.4  18.6  18.2  93.9  93.8  94.3  $$-$$2.2  $$-$$4.2  $$-$$1.5     High  0.3  0.1  0.0  64.9  62.7  64.5  93.5  94.1  92.5  $$-$$5.6  $$-$$2.1  $$-$$5.6  Large  Low  0.5  0.7  1.1  13.3  13.3  13.4  94.0  94.5  94.4  $$-$$1.8  $$-$$1.6  $$-$$1.5     Medium  0.1  0.3  0.3  34.7  34.7  35.0  92.9  92.4  92.4  $$-$$4.4  $$-$$5.3  $$-$$5.2     High  0.1  $$-$$0.1  0.2  138.4  134.8  137.3  94.0  93.8  91.4  $$-$$5.1  $$-$$2.5  $$-$$4.7        % bias  SD  CP (%)  % bias $$\hat{\text{SE}}$$  District size  Incidence level  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  Small  Low  1.2  0.8  1.2  4.7  4.7  4.7  94.6  94.5  94.3  $$-$$2.5  $$-$$3.3  $$-$$4.0     Medium  0.4  $$-$$0.9  1.2  9.8  9.8  10.1  94.8  95.1  94.3  0.0  $$-$$1.4  $$-$$2.9     High  0.0  $$-$$0.1  0.6  28.9  28.0  28.3  94.2  95.2  94.4  $$-$$3.4  $$-$$0.1  $$-$$1.2  Medium  Low  1.3  0.9  0.0  8.0  8.0  7.9  94.5  94.6  95.1  $$-$$1.2  $$-$$1.0  $$-$$0.6     Medium  0.5  0.0  0.1  18.4  18.6  18.2  93.9  93.8  94.3  $$-$$2.2  $$-$$4.2  $$-$$1.5     High  0.3  0.1  0.0  64.9  62.7  64.5  93.5  94.1  92.5  $$-$$5.6  $$-$$2.1  $$-$$5.6  Large  Low  0.5  0.7  1.1  13.3  13.3  13.4  94.0  94.5  94.4  $$-$$1.8  $$-$$1.6  $$-$$1.5     Medium  0.1  0.3  0.3  34.7  34.7  35.0  92.9  92.4  92.4  $$-$$4.4  $$-$$5.3  $$-$$5.2     High  0.1  $$-$$0.1  0.2  138.4  134.8  137.3  94.0  93.8  91.4  $$-$$5.1  $$-$$2.5  $$-$$4.7  4. Application to real data To illustrate our approach, we estimated the incidence of two cancers [lip-oral cavity-pharynx cancer in men (LOPm) and non-Hodgkin lymphoma in women (NHLw)] in all French districts over the 2007–2011 period. Those sites were chosen to illustrate situations with, respectively, a low and medium district-variability of the $$H/I$$ ratio. 4.1. Data sources and methods Patients hospitalized for LOPm or NHLw during the study period were identified using the French national hospital discharge database [Programme de Médicalisation des Systémes d’Information (PMSI)], which collects data on patient stays in all public and private HC institutions. This database includes information on the principal diagnosis (according to the International Classification of Diseases ICD-10) and certain personal characteristics (sex, age, code of the residence area, and an anonymous patient identifier). The number of newly hospitalized patients, i.e. not hospitalized for the same cancer during the two preceding years, was counted per district and per 5-year age group over the whole period. Gold standard incidence data were provided by the French network of cancer registries (FRANCIM). This network ensures an exhaustive collection of incident cancer cases in several French districts. Data over the study period were available from 13 registries for LOPm and 15 registries for NHLw, which corresponds to 17% and 18% coverage of the French population, respectively. Registry data are coded according to the third version of the International Classification of Diseases for Oncology (ICD-O-3). Incident cases of cancer were aggregated per district and for the same age groups as above. We used the calibration approach detailed above to predict the total number of incident cases over all 96 districts over the 2007–2011 period. The results are presented as ASR using the age structure of the world population as a reference and standardized incidence ratios (SIR) using an internal standardization. The SIR prediction intervals were directly derived from those of the total number of cases. The SIR were charted on a map where districts with non-significant SIR (i.e. with prediction intervals containing 1) were hatched. The SIR were also plotted against their precision (inverse of its CV) in a funnel plot. Funnel plots include 95% confidence bands (respectively 99.8%), which corresponds to the expected 95% CI of a SIR given the precision (respectively 99.8% CI), under the null hypothesis $${\rm SIR}=1$$. These funnel plots allow a quick examination whether the observed SIR are outside the expected range under the null hypothesis, given their precision. A graph plotting the precision of the SIR against the square root of the number of predicted cases is also provided. 4.2. Results During the 2007–2011 period, 8086 incident LOPm and 5111 incident NHLw were recorded by cancer registries, and 8918 and 4362, respectively, new hospitalizations were identified in the PMSI. The between-district SD $$\sigma_d$$ of the $$H/C$$ ratio as estimated from the registry area was 0 for LOPm and 0.09 for NHLw. The predicted national ASR were 20.5 and 8.4 per 100 000 $$PY$$ for LOPm and NHLw, respectively (95% PIs: [20.0; 21.1] and [7.8; 9.0]) with wide between-district variability (see Table S2 of the supplementary material available at Biostatistics online). The value of $$\sigma_d$$ had a great impact on the precision of the estimated SIR. In NHLw, the variance of the SIR (see 2.2) was dominated by the $$\sigma_d$$ component and the SIR precisions were clustered close to 1/$$\sigma_d$$ (Figure 1A), the closest corresponding to the districts with the highest number of predicted cases (Figure 1B). On the contrary, in LOPm, the variance of the number of predicted cases was dominated by its random Poisson part. The SIR precisions then behaved like $$\sqrt{\hat{\tau}_j}$$ and was more scattered (Figure 1A and B). Fig. 1 View largeDownload slide Precision of the SIR of cancer incidence predictions. Panel A: funnel plot of precision vs. SIR with the 95 and 99.8% confidence limits. Panel B: precision of the SIR vs. the square root of the number of predicted cases. CV: coefficient of variation; LOPm: lip, oral cavity, pharynx cancer in men; NHLw: non-Hodgkin lymphoma in women. Fig. 1 View largeDownload slide Precision of the SIR of cancer incidence predictions. Panel A: funnel plot of precision vs. SIR with the 95 and 99.8% confidence limits. Panel B: precision of the SIR vs. the square root of the number of predicted cases. CV: coefficient of variation; LOPm: lip, oral cavity, pharynx cancer in men; NHLw: non-Hodgkin lymphoma in women. The predictions of the incidence of LOPm at the district level were of much more practical use than those of NHLw (Figure 2). In the first case, the geographic variation of incidence outweighed the imprecision of the estimates: districts with above-average incidence were mainly located in the north and north-west of France while those with lower incidence were located in the south-west. This geographical contrast in incidence is probably linked to large and well-documented geographical difference in tobacco and alcohol consumption in France (Ligier and others, 2011). In the second case (NHLw), the extent of the geographical variation in SIR was the same as the prediction intervals and the predictions were not sufficiently precise to provide a valuable description of incidence disparities. Fig. 2 View largeDownload slide France maps of cancer incidence predictions SIR for the 2007–2011 period. The hatched districts are those with a 95% prediction interval that contains 1. Fig. 2 View largeDownload slide France maps of cancer incidence predictions SIR for the 2007–2011 period. The hatched districts are those with a 95% prediction interval that contains 1. Accordingly, incidence predictions with poor precision may be of little practical interest. In particular, when the $$H/C$$ ratio varies widely between districts (high $$\sigma_d$$), $$H$$ is a “poor” proxy measure for $$C$$ and, logically, the predictions obtained using $$H$$ and the average $$H/C$$ ratio will also be poor, as observed for NHLw. This variability in the $$H/C$$ ratio probably reflects differences in coding practices between hospitals and/or geographical differences in the medical care pathway of patients (Grosclaude and others, 2012). 5. Discussion 5.1. Calibration model The present article proposes a solution to a univariate calibration problem for count data. It enables local incidence to be predicted using a proxy measure whenever a gold standard exists. The approach falls within the context of the classic univariate calibration (Osborne, 1991): in districts where both measures are available (calibration experiment), the mean ratio between the proxy and the gold standard measures (calibration curve) is estimated by age with a Poisson regression model, district-specific deviations from the mean being modeled with random effects. This mean ratio is then inverted to predict incidence from the proxy measure (prediction stage) and the variance of the random effects used to derive appropriate prediction intervals. The present approach has specific characteristics different from the traditional univariate statistical calibration. First, as the proxy measures are count data, the calibration model was estimated with a generalized linear mixed model (GLM(M)), despite the fact that the vast majority of literature on statistical calibration concentrates on the normal linear or non-linear models. The advantage of a GLM(M) Poisson model for calibration is that the classic estimator obtained by inverting the calibration curve has finite moments and that its distribution can be approximated by a log-normal distribution. We found very few publications dealing with calibration involving count data. For example, to calculate the CIs in an inverse prediction problem where the proxy measure follows a Poisson distribution, Vidoni (2003) proposed inverting the prediction intervals from the calibration model. Another example was found in the field of biological dosimetry: the dose of irradiation received by a patient was predicted from the number of blood lymphocytes using an inverse regression method where the number of lymphocytes was supposed to follow a Poisson distribution (Higueras and others, 2015). The above-mentioned methods were not directly applicable to our context. Indeed, in addition to the traditional “experimental” error (e.g. variation of the proxy measure around an invariant gold standard), our calibration problem has to deal with two additional sources of variability: the variability of the calibration curve between districts and the Poisson variability of the gold standard around its mean. With respect to the “experimental” error, we assumed that the proxy measure followed a Poisson distribution conditionally on the gold standard (H$$|$$C). We initially opted for a quasi-Poisson approach (Breslow and Clayton, 1993), which allowed for under-/over-dispersion, since the variance of $$H|C$$ would obviously vary with its expectation, and because we expected some under-dispersion (H being a proxy of C). This quasi-likelihood approach may be implemented using the function glmmPQL of the R library MASS (Venables and Ripley, 2002). The dispersion coefficient obtained may be directly incorporated in the overall variance (2.2) and only impacts the component related to the variance of H. In the systematic explorations we conducted (on 20 cancer sites for both sexes, results not shown), this model error was supported by examining residuals, which confirmed that the (quasi) Poisson framework was appropriate. However, we observed only slight under- or over-dispersion (the second being less frequent) with a minor impact in both cases on the overall variance (since the latter was dominated by the other components). We therefore finally opted for a Poisson model, using glmer which ensured exact Poisson likelihood estimation. In practice, checking for dispersion in the residuals is recommended, and if substantial dispersion is observed, using the quasi-Poisson approach is advised. The R programs for both versions (Poisson and quasi-Poisson) are available on the GitHub repository. The between-district variability of $$H/C$$ ratio (i.e. the calibration curve) was taken into account by introducing a district random intercept into the calibration model and assuming it followed a normal distribution. With real data, this distribution might not be normal. However, our simulations confirmed the robustness of the approach in case of misspecification of this distribution (McCulloch and Neuhaus, 2011). We assumed a simple random-effect specification, but we could have considered complex models instead, allowing a district variability in the age effect. However, our purpose was to retrieve a simple measure of the main district differences between the calibration curves and the parsimony of the model is crucial to enhance predictive performance. Furthermore, we are modeling the ratio between a proxy and incidence, which are thought to be “close” indicators; so that in our context, age–district interactions represent tapering effects which are beyond our scope and fall outside the focus of sparsity we strive for this model. It is worth noting however that any factor explaining the $$H/C$$ ratio (at individual or district scale), if available, should be included in the model, or at least tested. It may help to decrease the district variability of the ratio. Finally, our approach, to be accurate, requires that the set of districts available for the calibration phase (i.e. covered by a registry) is “representative,” in some sense, of all districts. More precisely, it assumes that the mean and the between-district variability of the ratio ($$\sigma_d$$) are correctly estimated with the calibration set of districts. Whether these assumptions reasonably hold depends heavily on the calibration-set available and also on the proxy used. In order to support discussion on this issue, it is recommended to describe how the calibration-set is distributed among all districts regarding factors that are potentially related to the proxy-to-incidence ratio. In practice, if there is little knowledge on these factors, one may describe basic population indicators to support the hypothesis that the calibration-set is representative of the overall population (e.g. in terms of deprivation, rural population, medical density, etc.—see Section 5 of the supplementary material available at Biostatistics online). 5.2. Bias and coverage probabilities Overall, the results of the simulations proved the accuracy of the present approach: the predicted incidence values were unbiased and the CPs of the prediction intervals were close to their nominal values. However, both the accuracy of the estimates of the standard errors and the CPs of the prediction intervals tended to decrease slightly with the increase in the number of incident cases or in the variability of the $$H/C$$ ratio. As shown by the simulations in different settings, the under-coverage of the prediction intervals is mainly due to the underestimation of $$\sigma_d$$, the between-district variability of the $$H/C$$ ratio. However, the impact is smaller for smaller $$\sigma_d$$ and, for a given $$\sigma_d$$, for smaller incidence and smaller district size (i.e. for smaller number of cases $${\tau_j}$$). The underlying mechanisms are explained in Section 3.3: the underestimation of $$\sigma_d$$ only impacts the first term in the overall variance of the predicted incidence (see 2.2), and thus its effect on the overall variance depends on the relative contribution of this term. Therefore, the underestimation of $$\sigma_d$$ has a minor impact on the overall variance when $$\sigma_d$$ is small, whereas its impact increases for larger $$\sigma_d$$, and as a consequence the CP slightly decreases. However, for a given $$\sigma_d$$, the impact is reduced for a smaller number of cases, because the random Poisson part contribution to the overall variance then increases and consequently the relative importance of the other terms reduces. The downward bias in the estimates of the variance component in GLMM is well known and diminishes with increasing observation per cluster and with a larger number of cluster (Austin, 2010). The influence of the number, K, of districts in the calibration set was confirmed by additional simulation results given in the supplementary material available at Biostatistics online (see Section 6, Figure S2 of the supplementary material available at Biostatistics online): the estimations of $$\sigma_d$$ are biased downward, to a larger extent for small $$\sigma_d$$, and this bias decreases with increasing K. The impact of this bias on the CP of the prediction interval is detailed in the supplementary material available at Biostatistics online for all scenarios (see Section 6, Figures S3 and S4 of the supplementary material available at Biostatistics online). Overall, the CPs of the prediction intervals were satisfactory (above 90%) when $$K\ge$$10 and in our epidemiological context, such prediction intervals are valuable for practical use. As a guide, the reader may refer to Figure S3 of the supplementary material available at Biostatistics online for a detailed description of the performance of the method, which depends on the number of calibration districts, between-district variability of the H/C ratio $$\sigma_d$$, incidence level and size of the prediction district. 5.3. Value of the calibration approach and related epidemiological considerations Although predicting incidence at a local level is an important issue in countries without a national registry, there are as yet few studies on the topic, and we did not find any outside of the cancer field. In Italy, regional incidence estimates were back-calculated for major cancer sites, based on regional mortality projections and regional survival estimates derived from registries, using the relation between incidence, survival, and mortality (Verdecchia and others, 2007). This multi-step methodology requires numerous assumptions however and brings uncertainty by using unnecessary projections. In France, the incidence/mortality ratio was previously used at regional levels with some limitations (Colonna and others, 2008), in particular since the ratio was assumed to be identical across regions. However, the variability of I/M ratio is too high to be used at a district level for almost all cancer sites. Extensions of the incidence/mortality ratio approach have been developed to predict county-level incidence in the USA for the three most frequent cancer sites, adding ecological covariates to improve prediction, such as socio-demographic characteristics and lifestyle habits (Pickle and others, 2003). Another natural direction, rather than attempting to enrich the modeling of incidence/mortality ratio, is to try to find a better incidence proxy. Routinely available HC databases provide appealing and promising incidence proxies. Some authors proposed to correct HC data based on sensitivity and specificity of the HC data to identify incident cases (e.g. Couris and others, 2009). One major limitation of this approach is that it requires individual linkage between registries and the HC data to estimate the required sensitivity and specificity. A few studies were conducted in France to predict cancer incidence at a district level from HC data using the HC/incidence ratio (Remontet and others, 2008; Mitton and others, 2011; Uhry and others, 2013). However, these were essentially papers oriented towards real-world applications without formal statistical derivation and evaluation. To our knowledge, the present article is the first methodological work to address the issue of predicting local incidence from a proxy based on calibration model, with an appropriate formal mathematical derivation together with a simulation-based evaluation. Given the multiplication of data sources, proxy measures will multiply too and be easy to obtain. This fact highlights the need for validated and practical calibration methods before data use for incidence prediction. From a practical point of view, although the proposed method provides unbiased incidence values and prediction intervals with good CPs, some predictions may be insufficiently precise to present any value in public health decision making processes. In the application section of this article, we saw for example that hospitalization data were poor proxies in NHLw, which resulted in a prediction interval of the same order of magnitude as the extent of geographical variation of the incidence. Thus, in practice, if the aim of the analysis is to describe geographical disparities in incidence, one may decide to discard predictions of cancer incidence where prediction errors are large relative to the magnitude of incidence disparities across districts and focus instead on cancer sites for which predictions are considered valuable. 5.4. Perspectives The incidence predictions obtained with our calibration approach may be used for posterior analyzes, since their distribution is known. This is especially true for geographical models including a spatial component which may be used to smooth and stabilize the SIR and reduce the variance. Moreover, future work should focus on predicting incidence through joint modeling of the expectations of the incidence and the proxy measures. Such a framework may enable, in addition, the simultaneous use of several proxies of incidence. 6. Conclusion Predicting disease incidence at sub-national levels is a difficult task, and to date has not been explored in depth. For countries without national registries, we present a convenient method to predict incidence from health insurance data, hospital discharge data, or other source of proxy of incidence data. We hope our approach will encourage the sound use of the usually imperfect information present in these data. Supplementary material supplementary material is available at http://biostatistics.oxfordjournals.org. Acknowledgments The authors thank the FRANCIM network for providing cancer incidence data and C. Bonaldi for his helpful suggestions. Conflict of Interest: None declared. References Austin P. C. ( 2010). Estimating multilevel logistic regression models when the number of clusters is low: a comparison of different statistical software procedures. The International Journal of Biostatistics  6, Article 16. Bates D., Mächler M., Bolker B. and Walker S. ( 2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software  67, 1– 48. Google Scholar CrossRef Search ADS   Breslow N. E. and Clayton D. G. ( 1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association  88, 9. Carré N., Uhry Z., Velten M., Trétarre B., Schvartz C., Molinié F., Maarouf N., Langlois C., Kadi-Hanifi A. M., Grosclaude P. and Colonna M. ( 2006). Predictive value and sensibility of hospital discharge system (PMSI) compared to cancer registries for thyroïd cancer (1999–2000). Revue d’epidemiologie et de sante publique  54, 367– 376. Google Scholar CrossRef Search ADS PubMed  Carroll R. J., Ruppert D., Stefanski L. A. and Crainiceanu C. M. ( 2006). Measurement Error in Nonlinear Models: A Modern Perspective , 2nd edition. Florida, USA: Chapman and Hall/CRC. Google Scholar CrossRef Search ADS   Colonna M., Bossard N., Mitton N., Remontet L., Belot A., Delafosse P. and Grosclaude P. ( 2008). Some interpretation of regional estimates of the incidence of cancer in France over the period 1980–2005. Revue d’épidémiologie et de santé publique  56, 434– 440. Google Scholar CrossRef Search ADS PubMed  Couris C. M., Polazzi S., Olive F., Remontet L., Bossard N., Gomez F., Schott A. M., Mitton N., Colonna M. and Trombert B. ( 2009). Breast cancer incidence using administrative data: correction with sensitivity and specificity. Journal of Clinical Epidemiology  62, 660– 666. Google Scholar CrossRef Search ADS PubMed  Ehrenstein V., Nielsen H., Pedersen A. B., Johnsen S. P. and Pedersen L. ( 2017). Clinical epidemiology in the era of big data: new opportunities, familiar challenges. Clinical Epidemiology  9, 245– 250. Google Scholar CrossRef Search ADS PubMed  Fenton L. ( 1960). The sum of log-normal probability distributions in scatter transmission systems. IEEE Transactions on Communications  8, 57– 67. Google Scholar CrossRef Search ADS   Grosclaude P., Dentan C., Trétarre B., Velten M., Fournier E. and Molinié F. ( 2012). Utilité des bases de données médico-administratives pour le suivi épidémiologique des cancers. Comparaison avec les données des registres au niveau individuel. Bulletin Epidémiologique Hebdomadaire , 5-6, 63– 67. Higueras M., Puig P., Ainsbury E. A. and Rothkamm K. ( 2015). A new inverse regression model applied to radiation biodosimetry. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences  471, 20140588. Google Scholar CrossRef Search ADS   Ligier K., Belot A., Launoy G., Velten M., Bossard N., Iwaz J., Righini C. A., Delafosse P. and Guizard A. V. ( 2011). Descriptive epidemiology of upper aerodigestive tract cancers in France: incidence over 1980–2005 and projection to 2010. Oral Oncology  47, 302– 307. Google Scholar CrossRef Search ADS PubMed  McCulloch C. E. and Neuhaus J. M. ( 2011). Misspecifying the shape of a random effects distribution: why getting it wrong may not matter. Statistical Science  26, 388– 402. Google Scholar CrossRef Search ADS   McCulloch C. E., Searle S. R. and Neuhaus J. M. ( 2008). Generalized, Linear, and Mixed Models, 2nd Edition . Hoboken, New Jersey, USA: John Wiley & Sons, Inc. Mitton N., Colonna M., Trombert B., Olive F., Gomez F., Iwaz J., Polazzi S., Schott-Petelaz A. M., Uhry Z., Bossard N. and Remontet L. ( 2011). A suitable approach to estimate cancer incidence in area without cancer registry. Journal of Cancer Epidemiology  2011, 418968. Google Scholar CrossRef Search ADS PubMed  Osborne C. ( 1991). Statistical calibration: a review. International Statistical Review/Revue Internationale de Statistique  59, 309. Pickle L. W., Feuer E. J., Edwards B. K. ( 2003). U.S. Predicted Cancer Incidence, 1999: Complete Maps by County and State From Spatial Projection Models . NCI Cancer Surveillance Monograph Series, number 5. Bethesda, MD: National Cancer Institute. Remontet L., Mitton N., Couris C.M., Iwaz J., Gomez F., Olive F., Polazzi S., Schott A. M., Trombert B., Bossard N. and Colonna M. ( 2008). Is it possible to estimate the incidence of breast cancer from medico-administrative databases? European Journal of Epidemiology  23, 681– 688. Google Scholar CrossRef Search ADS PubMed  Skrondal A. and Rabe-Hesketh S. ( 2009). Prediction in multilevel generalized linear models. Journal of the Royal Statistical Society: Series A (Statistics in Society)  172, 659– 687. Google Scholar CrossRef Search ADS   Uhry Z., Remontet L., Colonna M., Belot A., Grosclaude P., Mitton N., Delacour-Billon S., Gentil J., Boussac-Zarebska M., Bossard N. and others ( 2013). Cancer incidence estimation at a district level with out a national registry: a validation study for 24 cancer sites using French health insurance and registry data. Cancer Epidemiology  37, 99– 114. Google Scholar CrossRef Search ADS PubMed  Venables W. N. and Ripley B. D. ( 2002). Modern Applied Statistics with S , 4th edition. New York: Springer. Google Scholar CrossRef Search ADS   Verdecchia A., De Angelis R., Francisci S. and Grande E. ( 2007). Methodology for estimation of cancer incidence, survival and prevalence in Italian regions. Tumori  93, 337– 44. Google Scholar CrossRef Search ADS PubMed  Vidoni P. ( 2003). Prediction and calibration in generalized linear models. Annals of the Institute of Statistical Mathematics  55, 169– 185. © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biostatistics Oxford University Press

For a sound use of health care data in epidemiology: evaluation of a calibration model for count data with application to prediction of cancer incidence in areas without cancer registry

Loading next page...
 
/lp/ou_press/for-a-sound-use-of-health-care-data-in-epidemiology-evaluation-of-a-sLFKnrvF0k
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
ISSN
1465-4644
eISSN
1468-4357
D.O.I.
10.1093/biostatistics/kxy012
Publisher site
See Article on Publisher Site

Abstract

Summary There is a growing interest in using health care (HC) data to produce epidemiological surveillance indicators such as incidence. Typically, in the field of cancer, incidence is provided by local cancer registries which, in many countries, do not cover the whole territory; using proxy measures from available nationwide HC databases would appear to be a suitable approach to fill this gap. However, in most cases, direct counts from these databases do not provide reliable measures of incidence. To obtain accurate incidence estimations and prediction intervals, these databases need to be calibrated using a registry-based gold standard measure of incidence. This article presents a calibration model for count data developed to predict cancer incidence from HC data in geographical areas without cancer registries. First, the ratio between the proxy measure and incidence is modeled in areas with registries using a Poisson mixed model that allows for heterogeneity between areas (calibration stage). This ratio is then inverted to predict incidence from the proxy measure in areas without registries. Prediction error admits closed-form expression which accounts for heterogeneity in the ratio between areas. A simulation study shows the accuracy of our method in terms of prediction and coverage probability. The method is further applied to predict the incidence of two cancers in France using hospital data as the proxy measure. We hope this approach will encourage sound use of the usually imperfect information extracted from HC data. 1. Introduction Measuring disease incidence at a local level is an important issue in public health. Incidence figures are needed to spot disparities in health status, establish public health goals, and compare the performance of health systems between different geographical areas. In the field of cancer, incidence figures are needed to plan prevention programs, establish diagnosis, and treatment strategies, as well as to assess the benefit of screening programs. Cancer incidence is generally monitored by local population-based cancer registries, which ensure exhaustive recording of new cancer cases occurring in the population living in a given area. However, in many countries, cancer registries were established on local initiatives and do not cover the whole territory. Thus, cancer incidence data are not available at all local levels and should thus rely on predictions based on external data (i.e. proxy) and statistical modeling. In France, for example, only 16 “Départements” (local level administrative districts) out of 96 are covered by a cancer registry, which corresponds to 20% of the French population. This gap has led to interest in using information from existing health care (HC) databases available for the whole territory as a proxy (e.g. hospitalization or HC reimbursement databases) (Mitton and others, 2011). However, the use of this data as a proxy for epidemiological purposes calls for caution because these HC databases were established for economic purposes. Accordingly, a direct count of cases from these databases is not a reliable measure of incidence (Carré and others, 2006; Couris and others, 2009). Indeed, depending on the cancer being studied, several factors, such as HC pathways and coding habits, may affect the ability of HC data to accurately reflect incidence values. For instance, the accidental discovery of thyroid cancer after surgery for a benign condition may not be mentioned in a patient’s medical record. Similarly, skin melanomas removed by private dermatologists will not appear in hospital data, nor cases of prostate cancer treated by radiotherapy alone in a private radiologist unit outside of clinics and hospitals. Moreover, HC data also contain “false positive” incident cases, mainly prevalent cases, and even coding errors (e.g. in situ cancer or topography misspecification) (Grosclaude and others, 2012). Consequently, the sensitivity and positive predicted values of HC databases in identifying incident cancer cases may vary between geographical areas and between cancer sites (Carré and others, 2006; Grosclaude and others, 2012). To be able to predict cancer incidence in geographical areas without a cancer registry, HC data must be calibrated on gold standard (i.e. coming from registries). As these predictions are necessarily subject to errors, it is crucial to provide correct prediction intervals. This is a critical issue when using HC data in epidemiology: the question of “veracity” (Ehrenstein and others, 2017). Indeed, irrespective of the advantages of HC data sources to build epidemiological indicators, the question remains about how valid the indicators are and whether we are able to provide confidence intervals (CIs) which correctly reflect their accuracy. This article proposes a method to address this issue in the case where such a gold standard is available only in certain geographical areas. For this purpose, we propose using a calibration approach. The objectives of this methodological article are to: (i) present a calibration approach developed specifically to predict cancer incidence at the district level; (ii) formally derive the statistical properties of the predicted incidence estimators and build corresponding prediction intervals; and (iii) evaluate the performance of the method in terms of bias, standard errors estimation and coverage probability (CP), through realistic simulations. The article is organized as follows: Section 2 presents the calibration approach, the incidence estimators considered and the derivation of their variance. Section 3 presents a simulation study, which evaluates the methodology’s application in various scenarios. Section 4 illustrates the use of the method with real cancer data from local level French cancer registries and hospital discharge data. The final section includes a general discussion. 2. Calibration model, prediction, and variance derivation The approach proposes developing a predictor of local cancer incidence by using a proxy measure. As the latter stems from incidence (e.g. hospitalization is a consequence of having cancer), we use a calibration approach (Osborne, 1991; Carroll and others, 2006) in which the proxy measure is modeled as a function of incidence, with data from districts where both measures are available. This model is then inversed to predict the local incidence from the proxy measure. To facilitate understanding of the approach, we will describe the calibration approach in the specific case where the gold standard incidence value comes from a cancer registry and the proxy measure is the number of newly hospitalized patients. However, the method is suitable for any context where a gold standard is available for only a part of the territory and where the proxy measure is available for the whole territory. 2.1. Notations In a given district $$j (\,j=1\ldots J)$$, $${\bf{C_j}}=(C_{1j}\ldots C_{Ij})^\top$$ denotes the random vector of the number of incident cancer cases by age group $$i (i=1\ldots I)$$. $${\bf{C_j}}$$ is assumed to follow a Poisson distribution of expectation $${\bf{\bf{\gamma}_j}}=(\gamma_{1j}\ldots \gamma_{Ij})^\top$$. $${\bf{H_j}}=(H_{1j}\ldots H_{Ij})^\top$$ denotes the random vector of newly hospitalized cancer patients. $${\bf{H_j}}$$ is related to $${\bf{C_j}}$$ by the calibration model (2.1) defined in Section 2.2 of this text. $${\tau_j}={\mathbf{w_j}}^\top{\bf{\bf{\gamma}_j}}= \sum_i w_{ij}\gamma_{ij}$$ denotes a weighted sum of incident cases expectations. According to the weights used, $${\tau_j}$$ corresponds either to the total number of cases expectation ($$w_j={\mathbf{1}}$$) or to the age-standardized incidence rate (ASR) expectation ($$w_{ij}=w^r_i/PY_{ij}$$), where $$w^r_i$$ is the weight of age class $$i$$ in the reference population and $${\rm PY}_{ij}$$ the vector of the number of person-years (PY) in districts $$j$$ for age class $$i$$. Finally, for a matrix $${\mathbf{X}}=(x_{ij})$$, $$\exp({\mathbf{X}})$$ refers here to the matrix of exponentiated values $$[\exp({\mathbf{X}})]_{ij}=\exp(x_{ij})$$ and for a vector $${\mathbf{y}}$$, $${}^\mathrm{D}{\mathbf{y}}$$ refers to the square diagonal matrix with the elements of $${\mathbf{y}}$$ on the main diagonal. Symbol $${\mathbf{A}}\circ{\mathbf{B}}$$ denotes the element-wise product of matrices $${\mathbf{A}}$$ and $${\mathbf{B}}$$ (or Hadamard product—see Section 1.2 of the supplementary material available at Biostatistics online for calculation properties used in the article); $$E_Y$$ and $$V_Y$$ denote the expectation and the variance–covariance matrix over the probability distribution of the random variable $$Y$$. 2.2. Calibration model We assume that the number of hospitalized patients is related to the gold standard incidence using a Poisson log-normal mixed model that accounts for the hierarchical structure of the data (McCulloch and others, 2008). More specifically, the calibration model assumes that, given the number of incident cases and a district random effect $$b_j$$, the number of newly hospitalized patients $$H_{ij}$$ of age class $$i$$ (central age $$a_i$$) in a district $$j$$ follows a Poisson distribution with mean $$\eta_{ij}$$, i.e. ($$H_{ij}| b_j,C_{ij}=c_{ij}\sim\mathcal{P}(\eta_{ij})$$) where:   \begin{equation} \left\{ \begin{array}[H]{l} \eta_{ij}=E(H_{ij} |b_j,C_{ij} =c_{ij} )=c_{ij}\exp(f(a_i)+b_j)\\[3pt] b_j\sim\mathcal{N}(0,\sigma_d^2) \end{array} \right. \end{equation} (2.1)$$f$$ is modeled with a restricted cubic spline with one knot at the median age (median over all cases) and boundary knots placed at 5 years above and below the minimum and maximum age respectively (over all cases). The marginal expectation of (2.1) over $$b_j$$ is (Skrondal and Rabe-Hesketh, 2009):   \[ E(H_{ij} |C_{ij}=c_{ij})=E_{b_j}(\eta_{ij} )=c_{ij}\exp(f(a_i)+ \sigma_d^2/2). \] Thus, $$h(a)=\exp(\,f(a)+ \sigma_d^2/2)$$ gives the mean $$H/C$$ ratio of age $$a$$ over all districts. In matrix notation, calibration model (2.1) may be written:   \[ \left\{ \begin{array}[H]{l} ({\bf{H_j}} |b_j,{\bf{C_j}}={\mathbf{c_j}})\sim\mathcal{P}({\bf{\boldsymbol{\eta}_j}})\\[3pt] \log({\bf{\boldsymbol{\eta}_j}})=\log({\mathbf{c_j}})+{\mathbf{{X\boldsymbol{\beta}}}}+b_j\\[3pt] b_j\sim\mathcal{N}(0,\sigma_d^2), \end{array} \right. \] where $${\mathbf{X}}$$ is the matrix of the restricted cubic spline basis for $$f(a)$$ and $$\boldsymbol{\beta}$$ the corresponding vector of coefficients. Using data from registry areas, where both $${\bf{H_j}}$$ and $${\bf{C_j}}$$ are known, this calibration model provides an estimate $$\hat{\boldsymbol{\beta}}$$ of $$\boldsymbol{\beta}$$, an estimate $$\bf{\hat{\Sigma}}_{\hat{\boldsymbol{\beta}}}$$ of the variance–covariance matrix $$\bf{\Sigma}_{\hat{\boldsymbol{\beta}}}$$ of $$\hat{\boldsymbol{\beta}}$$, with $$\hat{\boldsymbol{\beta}}\sim\mathcal{N}(\boldsymbol{\beta},\bf{\Sigma}_{\hat{\boldsymbol{\beta}}})$$, and an estimate $$\hat{\sigma}_d$$ of $$\sigma_d$$. 2.3. Prediction of incidence Now let us turn to a district where the incidence is unknown. Several incidence predictions may be considered. We are primarily interested in the expectation $${\bf{\bf{\gamma}_j}}$$ of $${\bf{C_j}}$$, in particular because predictions are used to describe incidence variations across districts. An estimator $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$ for $${\bf{\bf{\gamma}_j}}$$ is thus required. Synthetic measures of incidence over all ages, such as the overall number of cases or the ASR ($$\hat{\tau}_j$$) derived from $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$, are also useful and are usually the main incidence indicators of interest. 2.3.1. Prediction of the expectation of the number of incident cases by age: vector $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$ If the theoretical parameters $$\boldsymbol{\beta}$$ and $$\sigma_d$$ were known, then the average $$H/C$$ ratio (noted $$h(a)$$) would be known and the random vector $$\{H_{aj}/h(a)=H_{aj}\exp(-f(a)-\sigma_d^2/2)\}$$ (or $${\bf{H_j}}\circ\exp(-{\mathbf{{X\boldsymbol{\beta}}}}-\sigma_d^2/2)$$ in matrix notation) would be an unbiased estimator of $${\bf{\gamma}_j} (E[{\bf{H_j}}\, \circ\, \exp(-{\mathbf{{X\boldsymbol{\beta}}}}-\sigma_d^2/2)] = E_{{\bf{C_j}}}[E( {\bf{H_j}}\circ\exp(-{\mathbf{{X\boldsymbol{\beta}}}}-\sigma_d^2/2) | {\bf{C_j}})]=E_{{\bf{C_j}}}[{\bf{C_j}}]={\bf{\gamma}_j}$$). Thus, a natural estimator for $${\bf{\bf{\gamma}_j}}$$ is the corresponding plug-in estimator using $$\hat{\boldsymbol{\beta}}$$ and $$\hat{\sigma}_d^2$$ as estimated in the calibration stage:   \[ {\mathbf{{\bf{\boldsymbol{\hat{\gamma}_j}}}}}={\bf{H_j}}\circ\exp(-{\mathbf{{X\boldsymbol{\hat{\beta}}}}}-\hat{\sigma}_d^2/2)={\bf{H_j}}\circ\exp(-{\bf{Z}\boldsymbol{\hat{\alpha}}}), \] with $${\mathbf{Z}}=({\mathbf{{X,\frac{1}{2}}}})$$ and $${\mathbf{{\boldsymbol{\hat{\alpha}}}}}=({\mathbf{{\boldsymbol{\hat{\beta}}^\top}}},\hat{\sigma}_d^2)^\top$$, with expectation $$\bf{\alpha}=E({\mathbf{{\boldsymbol{\hat{\alpha}}})}}=({\boldsymbol{\beta^\top}},\sigma_d^2)^\top$$. Neglecting the uncertainty in the estimation of $$\sigma_d$$, $${\bf{Z}\boldsymbol{\hat{\alpha}}}$$ may be assumed to be normally distributed with mean $${\bf{Z\boldsymbol{\alpha}}}$$ and variance–covariance matrix . Then $$\exp(-{\bf{Z}\boldsymbol{\hat{\alpha}}})$$ follows a multivariate log-normal distribution with mean $$E(\exp(-{\bf{Z}\boldsymbol{\hat{\alpha}}}))=\exp(-{\bf{Z\boldsymbol{\alpha}}}+\frac{1}{2}{\bf{D_{\boldsymbol{\Sigma}}}})$$, where $${\bf{D_{\boldsymbol{\Sigma}}}}$$ is the vector of the diagonal elements of $$\bf{\Sigma}$$, and admits a closed form for the variance–covariance matrix (see Section 3 of the supplementary material available at Biostatistics online). 2.3.2. Prediction of the expectation of standardized incidence rates and overall number of cases: $$\hat{\tau}_j$$ Predicting incidence over all ages is then straightforward. The expectations of the overall number of cases and standardized incidence rates are both estimated as a weighted sum of age-specific predictions $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$: $$\hat{\tau}_j={\mathbf{w_j}}{\bf{\boldsymbol{\hat{\gamma}_j}}}$$, with $${\mathbf{w_j}}$$ as defined in the notation paragraph. 2.4. Variance–covariance matrix of the predictions 2.4.1. Variance–covariance matrix of $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$ To calculate the variance of $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$, we first note that:   \[ V({\bf{\boldsymbol{\hat{\gamma}_j}}})= E_{{\bf{C_j}}}[V({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\bf{C_j}})]+V_{{\bf{C_j}}}[E({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\bf{C_j}})]. \] The first term in the above equation can be derived using the following conditional variance decomposition:   \[ V({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\bf{C_j}})=V_{{\boldsymbol{\hat{\alpha}}}}[E({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}})]+E_{{\boldsymbol{\hat{\alpha}}}}[V({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}})], \] The second term of the latter can be further decomposed by noting that:   \[ V({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}})=V_{b_j}[E({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}},b_j)]+E_{b_j}[V({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}},b_j)] \] Regrouping the preceding equations gives the final expression for $$V({\bf{\boldsymbol{\hat{\gamma}_j}}})$$:   \[ \begin{split} V({\bf{\boldsymbol{\hat{\gamma}_j}}})=&E_{\bf{C_j}}[V_{{\boldsymbol{\hat{\alpha}}}}[E({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}})]]+E_{\bf{C_j}}[E_{{\boldsymbol{\hat{\alpha}}}}[V_{b_j}[E({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}},b_j)]]]\\ &+E_{\bf{C_j}}[E_{{\boldsymbol{\hat{\alpha}}}}[E_{b_j}[V({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\boldsymbol{\hat{\alpha}}},{\bf{C_j}},b_j)]]]+V_{\bf{C_j}}[E({\bf{\boldsymbol{\hat{\gamma}_j}}}|{\bf{C_j}})] \end{split} \] The first term comes from the variability of the estimation of $${\boldsymbol{\hat{\alpha}}}$$ in the calibration model (2.1). The second term comes from the between-district variability of the $$H/C$$ ratio $$\sigma_d^2$$ and indicates that the wider this variability, the larger is the prediction error. Finally, the third and fourth terms are due to the Poisson variability of $${\bf{H_j}}$$ and $${\bf{C_j}}$$ around their mean and are referred to hereafter as the random Poisson part of the overall variance. Using the proprieties of the multivariate log-normal distribution, we see that $$V({\bf{\boldsymbol{\hat{\gamma}_j}}})$$ is finally expressed in a closed form (see Section 1.3 of the supplementary material available at Biostatistics online):   \begin{equation} V({\bf{\boldsymbol{\hat{\gamma}_j}}}) = (e^{\boldsymbol{\Sigma}+\sigma_d^2}-1)\circ\big[{}^\mathrm{D}{\bf{\bf{\gamma}_j}}\circ e^{{\bf{D_{\boldsymbol{\Sigma}}}}}+({\bf{\bf{\gamma}_j}}\circ e^{\frac{1}{2}{\bf{D_{\boldsymbol{\Sigma}}}}})({\bf{\bf{\gamma}_j}}\circ e^{\frac{1}{2}{\bf{D_{\boldsymbol{\Sigma}}}}})^\top\big]+{}^\mathrm{D}{\bf{\bf{\gamma}_j}}\circ(e^{-{\bf{Z\boldsymbol{\alpha}}}+2{\bf{D_{\boldsymbol{\Sigma}}}}}+1) \end{equation} (2.2)$$V({\bf{\boldsymbol{\hat{\gamma}_j}}})$$ is estimated with the plug-in estimator. 2.4.2. Variance of $$\hat{\tau}_j={\mathbf{w_j}}^\top{\bf{\boldsymbol{\hat{\gamma}_j}}}$$ The variance of any weighted sum of $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$, $$\hat{\tau}_j={\mathbf{w_j}}^\top{\bf{\boldsymbol{\hat{\gamma}_j}}}$$ is straightforward: $$ V(\hat{\tau}_j)=V({\mathbf{w_j}}^\top{\bf{\boldsymbol{\hat{\gamma}_j}}})={\mathbf{w_j}}^\top V({\bf{\boldsymbol{\hat{\gamma}_j}}}){\mathbf{w_j}}$$. 2.5. Distribution and predictions intervals for $$\hat{\tau}_j={\mathbf{w_j}}^T{\bf{\boldsymbol{\hat{\gamma}_j}}}$$ With respect to the distribution of $$\hat{\tau}_j$$, let us first remark that $$\log({\bf{\boldsymbol{\hat{\gamma}_j}}})$$ is distributed as the sum of a log-Poisson and of a normal variable ($$\log({\bf{\boldsymbol{\hat{\gamma}_j}}})=\log({\bf{h_j}})-{\bf{Z}\boldsymbol{\hat{\alpha}}}$$). As the approximation of a log-Poisson by a normal distribution is generally accurate, we assume that $${\bf{\boldsymbol{\hat{\gamma}_j}}}$$ follows a log-normal distribution. Thus $$\hat{\tau}_j$$, as a weighted sum of log-normal distributions, may also be approximated by a log-normal distribution (Fenton, 1960). The prediction interval of $$\hat{\tau}_j$$ at the $$\alpha$$ level is then easily written, using the coefficient of variation (CV) of $$\hat{\tau}_j$$, $$\hat{cv}_j=\frac{\sqrt{\hat{V}(\hat{\tau}_j)}}{\hat{\tau}_j}$$ (see Section 2 of the supplementary material available at Biostatistics online):   \[ PI(\hat{\tau}_j)= \left[\hat{\tau}_j\sqrt{\hat{cv}_j^2+1} e^{-z_{\frac{\alpha}{2}}\sqrt{\log(\hat{cv}_j^2+1)}}; \hat{\tau}_j\sqrt{\hat{cv}_j^2+1} e^{z_{\frac{\alpha}{2}}\sqrt{\log(\hat{cv}_j^2+1)}}\right] \] with $$z_{\frac{\alpha}{2}}$$ the quantile at the $$\frac{\alpha}{2}$$ level of a standard normal distribution. 2.6. Implementation In the simulation study and real data analyzes below, the calibration model (2.1) was fitted using the glmer function from lme4 package in R (Bates and others, 2015). An adaptive Gauss–Hermite quadrature with 20 points was used to obtain the marginal likelihood. The variance and prediction intervals were also calculated using R software. 3. Simulation study In this section, we consider realistic simulation scenarios to evaluate the performance of the proposed approach. More specifically, we check whether our approach allows: (i) an unbiased prediction of incidence; (ii) a correct CP of the prediction intervals; and (iii) a correct estimation of the key parameter $$\sigma_d^2$$ [between-district variance of the $$H/C$$ ratio in the calibration model, (2.1)]. In addition, we assess the robustness of the approach in case of misspecification in the calibration model (2.1) of the distribution of the district random effects. Here, we will focus on the prediction of the overall number of cases expectation (i.e. $${\tau_j}={\mathbf{w_j}}^\top{\bf{\bf{\gamma}_j}}$$ with $${\mathbf{w_j}}={\mathbf{1}}$$). We expect that predicting standardized incidence rates will give similar results in terms of relative bias and coverage properties, because both indicators are basically weighted sums of predicted numbers of cases by age. Several factors may affect the performance of our approach. Accordingly, different scenarios were considered, especially regarding $$\sigma_d$$ whose values directly affect the prediction intervals. The scenarios were designed to mimic various situations observed in practice, and we relied on real data and results applying the model to 24 cancer sites for the 2007–2009 period (data not shown) to choose or estimate the parameters used to generate the data as detailed hereafter. 3.1. Data generation and simulation design With respect to the design, the study period is 2007–2009, and we assume that the area with registries includes 14 districts (which reflects reality in France for this period). Table S1 of the supplementary material available at Biostatistics online details the input parameters and the data sources used to estimate them. Nine scenarios were considered combining three levels of incidence and three levels for $$\sigma_d$$. The incidence levels considered were 3.5, 14, and 70 cases per 100 000 PY; that is, approximately 500, 2000, and 10 000 cases in the registry area (i.e. the total area covered by different district-level registries) available for the calibration model (2.1), which corresponds to the 25th, 50th, and 75th percentile of the distribution of the number of cases per cancer site in the French registry area. The values considered for $$\sigma_d$$ were 0.05, 0.1, and 0.3, which correspond to the range observed in our application to 24 cancer sites. The numbers of PY were directly taken from the observed male PY in the 14 registries (range 180 000–600 000). The H/C ratio, i.e. $$f(a)$$, comes from calibration model (2.1) applied to real data on colorectal cancer in males. With respect to the prediction stage, incidence is predicted in three prediction districts, corresponding to three populations sizes (5th, 50th, and 95th percentiles of the distribution of the number of PY for men over all French districts during the study period—see Table S1 of the supplementary material available at Biostatistics online). This led to 27 possibilities when combined with the 9 scenarios described above. For a given scenario, each simulation which we ran consisted in: Data generation in the districts of the registry area and in the three prediction districts. For each district $$j$$, the number of cases $${\bf{c_j}}$$ was drawn from a Poisson distribution with incidence rate $$g(a)\phi$$ (see Table S1 of the supplementary material available at Biostatistics online), the district random effect $$b_j$$ drawn from a normal distribution $$\mathcal{N}(0,\sigma_d^2)$$, and the number of hospitalized patients $${\bf{h_j}}$$ drawn from a Poisson distribution conditionally on $${\bf{C_j}}$$ and $$b_j$$, using $$f(a)$$; Analysis of the $$H/C$$ ratio in the registry area with the calibration model (2.1) in order to obtain the calibration parameters estimates; Incidence prediction and calculation of the prediction intervals in the three prediction districts, using number of hospitalized cases generated in Step 1 and calibration parameters’ estimated in Step 2. In each scenario, we simulated 5000 data sets. This number allowed us to estimate the parameters of interest (i.e. the number of predicted cases and their variances, and the variance of the random effect $$\sigma_d^2$$) with a precision of at least 5%. In addition, we performed a sensitivity analysis on the impact of the misspecification of distribution of the random effects $$b_j$$, which is assumed Gaussian in the calibration model (2.1). To this end, two additional scenarios were considered, in which $$b_j$$ were drawn from a non-Gaussian distribution. In the first scenario, $$b_j$$ was drawn from a centered uniform distribution with a standard deviation (SD) of 0.1. In the second scenario, $$b_j$$ was drawn from a mixture of two Gaussian distributions with density $$\pi\mathcal{N}(\mu,\sigma^2/100)+(1-\pi)\mathcal{N}(\mu_2,\sigma_2^2)$$, to mimic a case where a portion $$\pi$$ of the districts are clustered away from the mean. $$\mu$$ was set to 0.1, $$\sigma$$ to 0.1, $$\pi$$ to 0.3, and $$\mu_2$$ ($$=-0.04$$) and $$\sigma_2 (=0.09)$$ were determined to ensure a zero mean and a 0.1 SD of the distribution. Finally, additional simulations were run in order to evaluate the influence of the size (i.e. number of districts) of the calibration set on the performance of the method. We used the same simulation design and scenarios described above, but allowed the number of districts in the calibration set to vary between 5 and 100. Results from these additional simulations are presented in Section 6 of the supplementary material available at Biostatistics online. 3.2. Performance indicators The prediction of the overall number of incident cases expectation $$\hat{\tau}_j$$ was assessed in terms of: (i) relative bias (i.e. difference between the empirical mean of the 5000 $$\hat{\tau}_j$$ estimates and the true value $${\tau_j}$$ divided by $${\tau_j}$$); (ii) empirical CP (i.e. the proportion of samples in which the 95% prediction interval contains $${\tau_j}$$); and (iii) relative bias in the estimation of the standard error (i.e. the relative difference between the mean of the 5000 estimated standard errors of $$\hat{\tau}_j$$ and the empirical SD of the 5000 $$\hat{\tau}_j$$ estimates). Finally, the estimation of the between-district variance in the $$H/C$$ ratio ($$\hat{\sigma}_d^2$$) was assessed in terms of mean, median, and relative bias. 3.3. Results of the simulation study Table 1 shows all performance indicators concerning $${\tau_j}$$. Overall, they were rather good. Relative biases were very small ($$<$$1.5%) and the empirical CP ranged between 91% and 96%. The smaller the number of incident cases and the smaller the between-district variability of the $$H/C$$ ratio ($$\sigma_d$$), the closer was the CP to their nominal level. For a given overall number of cases $${\tau_j}$$ (i.e. at fixed incidence level and district size), the CV logically increased with $$\sigma_d$$. This inflation of the CV with $$\sigma_d$$ increased with the number of cases. For example, at the smallest $${\tau_j}$$ value ($${\tau_j}=10$$), the CV increased only slightly from 0.47 to 0.56 whereas, at its highest value ($${\tau_j}={1233}$$), a greater than 4-fold increase in the CV was observed, from 0.07 to 0.32. This shows that when the number of cases to predict $${\tau_j}$$ is very small, the random Poisson part is large and dominates the overall variance. In this case, the overall variance is not very sensitive to $$\sigma_d$$. On the contrary, when the number of cases to predict $${\tau_j}$$ is large, the random Poisson part of the overall variance shrinks and $$\sigma_d$$ contributes mainly to the overall variance as $$\sigma_d$$ increases. With respect to the estimation of the standard error, in almost all scenarios, a negative bias (up to $$-$$6%) was observed, which means that the SD is on average slightly underestimated. Table 1. Expected number of cases ($${\tau_j}$$), relative bias ($$\% \text{bias}$$), empirical SD, $${\rm CV}={\rm SD}/{\tau_j}$$, and empirical CP of predicted expected number of cases $$\hat{\tau}_j$$, and relative bias of standard error estimates ($$\% \text{bias } \hat{\text{SE}}$$), for expected number of cases predicted by district size and incidence levels for different variability $$\sigma_d$$ of the $$H/C$$ ratio $$\sigma_d$$  District size  Incidence level  $$\tau_j$$  Estimator $$\hat{\tau}_j$$  % bias  SD  CV  CP (%)  % bias $$\hat{\text{SE}}$$  0.05  Small  Low  10  $$-$$0.1  4.6  0.47  94.9  $$-$$3.6        Medium  40  $$-$$0.2  9.2  0.23  95.3  -0.2        High  197  $$-$$0.1  22.3  0.11  94.7  $$-$$0.2     Medium  Low  26  0.9  7.6  0.29  95.2  $$-$$0.2        Medium  106  0.3  15.9  0.15  94.8  $$-$$1.2        High  525  0.3  42.8  0.08  93.8  $$-$$2.8     Large  Low  62  0.4  12.1  0.20  95.6  -1.6        Medium  248  0.0  26.3  0.11  94.1  $$-$$1.0        High  1233  $$-$$0.1  81.4  0.07  92.2  $$-$$4.2  0.1  Small  Low  10  1.2  4.7  0.48  94.3  $$-$$4.0        Medium  40  1.2  10.1  0.26  94.3  $$-$$2.9        High  197  0.6  28.3  0.14  94.4  $$-$$1.2     Medium  Low  26  0.0  7.9  0.30  95.1  $$-$$0.6        Medium  106  0.1  18.2  0.17  94.3  $$-$$1.5        High  525  0.0  64.5  0.12  92.5  $$-$$5.6     Large  Low  62  1.1  13.4  0.22  94.4  $$-$$1.5        Medium  248  0.3  35.0  0.14  92.4  $$-$$5.2        High  1233  0.2  137.3  0.11  91.4  $$-$$4.7  0.3  Small  Low  10  0.4  5.5  0.56  95.4  $$-$$2.2        Medium  40  1.0  15.7  0.40  94.0  $$-$$3.4        High  197  1.0  67.2  0.34  92.4  $$-$$5.5     Medium  Low  26  0.9  11.2  0.43  94.5  $$-$$1.8        Medium  106  0.3  36.6  0.35  93.2  $$-$$3.6        High  525  0.2  168.7  0.32  92.5  $$-$$3.6     Large  Low  62  1.5  22.9  0.37  92.8  $$-$$2.8        Medium  248  0.7  81.6  0.33  92.4  $$-$$3.1        High  1233  0.8  392.0  0.32  92.4  $$-$$3.0  $$\sigma_d$$  District size  Incidence level  $$\tau_j$$  Estimator $$\hat{\tau}_j$$  % bias  SD  CV  CP (%)  % bias $$\hat{\text{SE}}$$  0.05  Small  Low  10  $$-$$0.1  4.6  0.47  94.9  $$-$$3.6        Medium  40  $$-$$0.2  9.2  0.23  95.3  -0.2        High  197  $$-$$0.1  22.3  0.11  94.7  $$-$$0.2     Medium  Low  26  0.9  7.6  0.29  95.2  $$-$$0.2        Medium  106  0.3  15.9  0.15  94.8  $$-$$1.2        High  525  0.3  42.8  0.08  93.8  $$-$$2.8     Large  Low  62  0.4  12.1  0.20  95.6  -1.6        Medium  248  0.0  26.3  0.11  94.1  $$-$$1.0        High  1233  $$-$$0.1  81.4  0.07  92.2  $$-$$4.2  0.1  Small  Low  10  1.2  4.7  0.48  94.3  $$-$$4.0        Medium  40  1.2  10.1  0.26  94.3  $$-$$2.9        High  197  0.6  28.3  0.14  94.4  $$-$$1.2     Medium  Low  26  0.0  7.9  0.30  95.1  $$-$$0.6        Medium  106  0.1  18.2  0.17  94.3  $$-$$1.5        High  525  0.0  64.5  0.12  92.5  $$-$$5.6     Large  Low  62  1.1  13.4  0.22  94.4  $$-$$1.5        Medium  248  0.3  35.0  0.14  92.4  $$-$$5.2        High  1233  0.2  137.3  0.11  91.4  $$-$$4.7  0.3  Small  Low  10  0.4  5.5  0.56  95.4  $$-$$2.2        Medium  40  1.0  15.7  0.40  94.0  $$-$$3.4        High  197  1.0  67.2  0.34  92.4  $$-$$5.5     Medium  Low  26  0.9  11.2  0.43  94.5  $$-$$1.8        Medium  106  0.3  36.6  0.35  93.2  $$-$$3.6        High  525  0.2  168.7  0.32  92.5  $$-$$3.6     Large  Low  62  1.5  22.9  0.37  92.8  $$-$$2.8        Medium  248  0.7  81.6  0.33  92.4  $$-$$3.1        High  1233  0.8  392.0  0.32  92.4  $$-$$3.0  Table 1. Expected number of cases ($${\tau_j}$$), relative bias ($$\% \text{bias}$$), empirical SD, $${\rm CV}={\rm SD}/{\tau_j}$$, and empirical CP of predicted expected number of cases $$\hat{\tau}_j$$, and relative bias of standard error estimates ($$\% \text{bias } \hat{\text{SE}}$$), for expected number of cases predicted by district size and incidence levels for different variability $$\sigma_d$$ of the $$H/C$$ ratio $$\sigma_d$$  District size  Incidence level  $$\tau_j$$  Estimator $$\hat{\tau}_j$$  % bias  SD  CV  CP (%)  % bias $$\hat{\text{SE}}$$  0.05  Small  Low  10  $$-$$0.1  4.6  0.47  94.9  $$-$$3.6        Medium  40  $$-$$0.2  9.2  0.23  95.3  -0.2        High  197  $$-$$0.1  22.3  0.11  94.7  $$-$$0.2     Medium  Low  26  0.9  7.6  0.29  95.2  $$-$$0.2        Medium  106  0.3  15.9  0.15  94.8  $$-$$1.2        High  525  0.3  42.8  0.08  93.8  $$-$$2.8     Large  Low  62  0.4  12.1  0.20  95.6  -1.6        Medium  248  0.0  26.3  0.11  94.1  $$-$$1.0        High  1233  $$-$$0.1  81.4  0.07  92.2  $$-$$4.2  0.1  Small  Low  10  1.2  4.7  0.48  94.3  $$-$$4.0        Medium  40  1.2  10.1  0.26  94.3  $$-$$2.9        High  197  0.6  28.3  0.14  94.4  $$-$$1.2     Medium  Low  26  0.0  7.9  0.30  95.1  $$-$$0.6        Medium  106  0.1  18.2  0.17  94.3  $$-$$1.5        High  525  0.0  64.5  0.12  92.5  $$-$$5.6     Large  Low  62  1.1  13.4  0.22  94.4  $$-$$1.5        Medium  248  0.3  35.0  0.14  92.4  $$-$$5.2        High  1233  0.2  137.3  0.11  91.4  $$-$$4.7  0.3  Small  Low  10  0.4  5.5  0.56  95.4  $$-$$2.2        Medium  40  1.0  15.7  0.40  94.0  $$-$$3.4        High  197  1.0  67.2  0.34  92.4  $$-$$5.5     Medium  Low  26  0.9  11.2  0.43  94.5  $$-$$1.8        Medium  106  0.3  36.6  0.35  93.2  $$-$$3.6        High  525  0.2  168.7  0.32  92.5  $$-$$3.6     Large  Low  62  1.5  22.9  0.37  92.8  $$-$$2.8        Medium  248  0.7  81.6  0.33  92.4  $$-$$3.1        High  1233  0.8  392.0  0.32  92.4  $$-$$3.0  $$\sigma_d$$  District size  Incidence level  $$\tau_j$$  Estimator $$\hat{\tau}_j$$  % bias  SD  CV  CP (%)  % bias $$\hat{\text{SE}}$$  0.05  Small  Low  10  $$-$$0.1  4.6  0.47  94.9  $$-$$3.6        Medium  40  $$-$$0.2  9.2  0.23  95.3  -0.2        High  197  $$-$$0.1  22.3  0.11  94.7  $$-$$0.2     Medium  Low  26  0.9  7.6  0.29  95.2  $$-$$0.2        Medium  106  0.3  15.9  0.15  94.8  $$-$$1.2        High  525  0.3  42.8  0.08  93.8  $$-$$2.8     Large  Low  62  0.4  12.1  0.20  95.6  -1.6        Medium  248  0.0  26.3  0.11  94.1  $$-$$1.0        High  1233  $$-$$0.1  81.4  0.07  92.2  $$-$$4.2  0.1  Small  Low  10  1.2  4.7  0.48  94.3  $$-$$4.0        Medium  40  1.2  10.1  0.26  94.3  $$-$$2.9        High  197  0.6  28.3  0.14  94.4  $$-$$1.2     Medium  Low  26  0.0  7.9  0.30  95.1  $$-$$0.6        Medium  106  0.1  18.2  0.17  94.3  $$-$$1.5        High  525  0.0  64.5  0.12  92.5  $$-$$5.6     Large  Low  62  1.1  13.4  0.22  94.4  $$-$$1.5        Medium  248  0.3  35.0  0.14  92.4  $$-$$5.2        High  1233  0.2  137.3  0.11  91.4  $$-$$4.7  0.3  Small  Low  10  0.4  5.5  0.56  95.4  $$-$$2.2        Medium  40  1.0  15.7  0.40  94.0  $$-$$3.4        High  197  1.0  67.2  0.34  92.4  $$-$$5.5     Medium  Low  26  0.9  11.2  0.43  94.5  $$-$$1.8        Medium  106  0.3  36.6  0.35  93.2  $$-$$3.6        High  525  0.2  168.7  0.32  92.5  $$-$$3.6     Large  Low  62  1.5  22.9  0.37  92.8  $$-$$2.8        Medium  248  0.7  81.6  0.33  92.4  $$-$$3.1        High  1233  0.8  392.0  0.32  92.4  $$-$$3.0  3.3.1. Estimation of district variability of the $$H/C$$ ratio ($$\hat{\sigma}_d$$) Table 2 shows the estimations of the SD of the district random effect in the calibration model ($$\hat{\sigma}_d$$). In all scenarios, $$\sigma_d$$ was underestimated (up to 27%), the bias decreasing with increasing $$\sigma_d$$ values and incidence levels. Table 2. Mean, median, and relative bias of the estimator of $$\sigma_d$$ (the standard error of the $$H/C$$ ratio), by incidence level for different true value of $$\sigma_d$$    $$\sigma_d$$ = 0.05  $$\sigma_d$$ = 0.1  $$\sigma_d$$ = 0.3  Incidence level  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Low  3.9  0.0  $$-$$21.8  7.2  7.4  $$-$$27.8  27.6  27.6  $$-$$7.9  Medium  3.6  3.7  $$-$$27.0  8.7  9.0  $$-$$12.9  28.3  28.1  $$-$$5.5  High  4.4  4.5  $$-$$11.3  9.3  9.3  $$-$$6.8  28.5  28.3  $$-$$5.0     $$\sigma_d$$ = 0.05  $$\sigma_d$$ = 0.1  $$\sigma_d$$ = 0.3  Incidence level  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Low  3.9  0.0  $$-$$21.8  7.2  7.4  $$-$$27.8  27.6  27.6  $$-$$7.9  Medium  3.6  3.7  $$-$$27.0  8.7  9.0  $$-$$12.9  28.3  28.1  $$-$$5.5  High  4.4  4.5  $$-$$11.3  9.3  9.3  $$-$$6.8  28.5  28.3  $$-$$5.0  Table 2. Mean, median, and relative bias of the estimator of $$\sigma_d$$ (the standard error of the $$H/C$$ ratio), by incidence level for different true value of $$\sigma_d$$    $$\sigma_d$$ = 0.05  $$\sigma_d$$ = 0.1  $$\sigma_d$$ = 0.3  Incidence level  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Low  3.9  0.0  $$-$$21.8  7.2  7.4  $$-$$27.8  27.6  27.6  $$-$$7.9  Medium  3.6  3.7  $$-$$27.0  8.7  9.0  $$-$$12.9  28.3  28.1  $$-$$5.5  High  4.4  4.5  $$-$$11.3  9.3  9.3  $$-$$6.8  28.5  28.3  $$-$$5.0     $$\sigma_d$$ = 0.05  $$\sigma_d$$ = 0.1  $$\sigma_d$$ = 0.3  Incidence level  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Mean ($$\times$$100)  Med. ($$\times$$100)  Bias (%)  Low  3.9  0.0  $$-$$21.8  7.2  7.4  $$-$$27.8  27.6  27.6  $$-$$7.9  Medium  3.6  3.7  $$-$$27.0  8.7  9.0  $$-$$12.9  28.3  28.1  $$-$$5.5  High  4.4  4.5  $$-$$11.3  9.3  9.3  $$-$$6.8  28.5  28.3  $$-$$5.0  3.3.2. Sensitivity analysis As shown in Table 3, the misspecification of the distribution of the random effect in the calibration model did not alter the good performance of the approach. Whether this distribution was correctly specified or not had only a minor impact on the performance indicators presented (relative bias, mean estimated standard errors, or CPs of the prediction intervals). Table 3. Relative bias ($$\% \text{bias}$$), SD, and empirical CP of predicted expected number of cases $$\hat{\tau}_j$$, and relative bias of standard error estimates ($$\%\text{ bias } \hat{\text{SE}}$$), for expected number of cases predicted by district size and incidence levels for different generating distributions of district random effect~: uniform ($$\mathcal{U}$$), bimodal ($$\mathcal{B}$$), and Gaussian ($$\mathcal{G}$$)       % bias  SD  CP (%)  % bias $$\hat{\text{SE}}$$  District size  Incidence level  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  Small  Low  1.2  0.8  1.2  4.7  4.7  4.7  94.6  94.5  94.3  $$-$$2.5  $$-$$3.3  $$-$$4.0     Medium  0.4  $$-$$0.9  1.2  9.8  9.8  10.1  94.8  95.1  94.3  0.0  $$-$$1.4  $$-$$2.9     High  0.0  $$-$$0.1  0.6  28.9  28.0  28.3  94.2  95.2  94.4  $$-$$3.4  $$-$$0.1  $$-$$1.2  Medium  Low  1.3  0.9  0.0  8.0  8.0  7.9  94.5  94.6  95.1  $$-$$1.2  $$-$$1.0  $$-$$0.6     Medium  0.5  0.0  0.1  18.4  18.6  18.2  93.9  93.8  94.3  $$-$$2.2  $$-$$4.2  $$-$$1.5     High  0.3  0.1  0.0  64.9  62.7  64.5  93.5  94.1  92.5  $$-$$5.6  $$-$$2.1  $$-$$5.6  Large  Low  0.5  0.7  1.1  13.3  13.3  13.4  94.0  94.5  94.4  $$-$$1.8  $$-$$1.6  $$-$$1.5     Medium  0.1  0.3  0.3  34.7  34.7  35.0  92.9  92.4  92.4  $$-$$4.4  $$-$$5.3  $$-$$5.2     High  0.1  $$-$$0.1  0.2  138.4  134.8  137.3  94.0  93.8  91.4  $$-$$5.1  $$-$$2.5  $$-$$4.7        % bias  SD  CP (%)  % bias $$\hat{\text{SE}}$$  District size  Incidence level  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  Small  Low  1.2  0.8  1.2  4.7  4.7  4.7  94.6  94.5  94.3  $$-$$2.5  $$-$$3.3  $$-$$4.0     Medium  0.4  $$-$$0.9  1.2  9.8  9.8  10.1  94.8  95.1  94.3  0.0  $$-$$1.4  $$-$$2.9     High  0.0  $$-$$0.1  0.6  28.9  28.0  28.3  94.2  95.2  94.4  $$-$$3.4  $$-$$0.1  $$-$$1.2  Medium  Low  1.3  0.9  0.0  8.0  8.0  7.9  94.5  94.6  95.1  $$-$$1.2  $$-$$1.0  $$-$$0.6     Medium  0.5  0.0  0.1  18.4  18.6  18.2  93.9  93.8  94.3  $$-$$2.2  $$-$$4.2  $$-$$1.5     High  0.3  0.1  0.0  64.9  62.7  64.5  93.5  94.1  92.5  $$-$$5.6  $$-$$2.1  $$-$$5.6  Large  Low  0.5  0.7  1.1  13.3  13.3  13.4  94.0  94.5  94.4  $$-$$1.8  $$-$$1.6  $$-$$1.5     Medium  0.1  0.3  0.3  34.7  34.7  35.0  92.9  92.4  92.4  $$-$$4.4  $$-$$5.3  $$-$$5.2     High  0.1  $$-$$0.1  0.2  138.4  134.8  137.3  94.0  93.8  91.4  $$-$$5.1  $$-$$2.5  $$-$$4.7  Table 3. Relative bias ($$\% \text{bias}$$), SD, and empirical CP of predicted expected number of cases $$\hat{\tau}_j$$, and relative bias of standard error estimates ($$\%\text{ bias } \hat{\text{SE}}$$), for expected number of cases predicted by district size and incidence levels for different generating distributions of district random effect~: uniform ($$\mathcal{U}$$), bimodal ($$\mathcal{B}$$), and Gaussian ($$\mathcal{G}$$)       % bias  SD  CP (%)  % bias $$\hat{\text{SE}}$$  District size  Incidence level  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  Small  Low  1.2  0.8  1.2  4.7  4.7  4.7  94.6  94.5  94.3  $$-$$2.5  $$-$$3.3  $$-$$4.0     Medium  0.4  $$-$$0.9  1.2  9.8  9.8  10.1  94.8  95.1  94.3  0.0  $$-$$1.4  $$-$$2.9     High  0.0  $$-$$0.1  0.6  28.9  28.0  28.3  94.2  95.2  94.4  $$-$$3.4  $$-$$0.1  $$-$$1.2  Medium  Low  1.3  0.9  0.0  8.0  8.0  7.9  94.5  94.6  95.1  $$-$$1.2  $$-$$1.0  $$-$$0.6     Medium  0.5  0.0  0.1  18.4  18.6  18.2  93.9  93.8  94.3  $$-$$2.2  $$-$$4.2  $$-$$1.5     High  0.3  0.1  0.0  64.9  62.7  64.5  93.5  94.1  92.5  $$-$$5.6  $$-$$2.1  $$-$$5.6  Large  Low  0.5  0.7  1.1  13.3  13.3  13.4  94.0  94.5  94.4  $$-$$1.8  $$-$$1.6  $$-$$1.5     Medium  0.1  0.3  0.3  34.7  34.7  35.0  92.9  92.4  92.4  $$-$$4.4  $$-$$5.3  $$-$$5.2     High  0.1  $$-$$0.1  0.2  138.4  134.8  137.3  94.0  93.8  91.4  $$-$$5.1  $$-$$2.5  $$-$$4.7        % bias  SD  CP (%)  % bias $$\hat{\text{SE}}$$  District size  Incidence level  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  $$\mathcal{U}$$  $$\mathcal{B}$$  $$\mathcal{G}$$  Small  Low  1.2  0.8  1.2  4.7  4.7  4.7  94.6  94.5  94.3  $$-$$2.5  $$-$$3.3  $$-$$4.0     Medium  0.4  $$-$$0.9  1.2  9.8  9.8  10.1  94.8  95.1  94.3  0.0  $$-$$1.4  $$-$$2.9     High  0.0  $$-$$0.1  0.6  28.9  28.0  28.3  94.2  95.2  94.4  $$-$$3.4  $$-$$0.1  $$-$$1.2  Medium  Low  1.3  0.9  0.0  8.0  8.0  7.9  94.5  94.6  95.1  $$-$$1.2  $$-$$1.0  $$-$$0.6     Medium  0.5  0.0  0.1  18.4  18.6  18.2  93.9  93.8  94.3  $$-$$2.2  $$-$$4.2  $$-$$1.5     High  0.3  0.1  0.0  64.9  62.7  64.5  93.5  94.1  92.5  $$-$$5.6  $$-$$2.1  $$-$$5.6  Large  Low  0.5  0.7  1.1  13.3  13.3  13.4  94.0  94.5  94.4  $$-$$1.8  $$-$$1.6  $$-$$1.5     Medium  0.1  0.3  0.3  34.7  34.7  35.0  92.9  92.4  92.4  $$-$$4.4  $$-$$5.3  $$-$$5.2     High  0.1  $$-$$0.1  0.2  138.4  134.8  137.3  94.0  93.8  91.4  $$-$$5.1  $$-$$2.5  $$-$$4.7  4. Application to real data To illustrate our approach, we estimated the incidence of two cancers [lip-oral cavity-pharynx cancer in men (LOPm) and non-Hodgkin lymphoma in women (NHLw)] in all French districts over the 2007–2011 period. Those sites were chosen to illustrate situations with, respectively, a low and medium district-variability of the $$H/I$$ ratio. 4.1. Data sources and methods Patients hospitalized for LOPm or NHLw during the study period were identified using the French national hospital discharge database [Programme de Médicalisation des Systémes d’Information (PMSI)], which collects data on patient stays in all public and private HC institutions. This database includes information on the principal diagnosis (according to the International Classification of Diseases ICD-10) and certain personal characteristics (sex, age, code of the residence area, and an anonymous patient identifier). The number of newly hospitalized patients, i.e. not hospitalized for the same cancer during the two preceding years, was counted per district and per 5-year age group over the whole period. Gold standard incidence data were provided by the French network of cancer registries (FRANCIM). This network ensures an exhaustive collection of incident cancer cases in several French districts. Data over the study period were available from 13 registries for LOPm and 15 registries for NHLw, which corresponds to 17% and 18% coverage of the French population, respectively. Registry data are coded according to the third version of the International Classification of Diseases for Oncology (ICD-O-3). Incident cases of cancer were aggregated per district and for the same age groups as above. We used the calibration approach detailed above to predict the total number of incident cases over all 96 districts over the 2007–2011 period. The results are presented as ASR using the age structure of the world population as a reference and standardized incidence ratios (SIR) using an internal standardization. The SIR prediction intervals were directly derived from those of the total number of cases. The SIR were charted on a map where districts with non-significant SIR (i.e. with prediction intervals containing 1) were hatched. The SIR were also plotted against their precision (inverse of its CV) in a funnel plot. Funnel plots include 95% confidence bands (respectively 99.8%), which corresponds to the expected 95% CI of a SIR given the precision (respectively 99.8% CI), under the null hypothesis $${\rm SIR}=1$$. These funnel plots allow a quick examination whether the observed SIR are outside the expected range under the null hypothesis, given their precision. A graph plotting the precision of the SIR against the square root of the number of predicted cases is also provided. 4.2. Results During the 2007–2011 period, 8086 incident LOPm and 5111 incident NHLw were recorded by cancer registries, and 8918 and 4362, respectively, new hospitalizations were identified in the PMSI. The between-district SD $$\sigma_d$$ of the $$H/C$$ ratio as estimated from the registry area was 0 for LOPm and 0.09 for NHLw. The predicted national ASR were 20.5 and 8.4 per 100 000 $$PY$$ for LOPm and NHLw, respectively (95% PIs: [20.0; 21.1] and [7.8; 9.0]) with wide between-district variability (see Table S2 of the supplementary material available at Biostatistics online). The value of $$\sigma_d$$ had a great impact on the precision of the estimated SIR. In NHLw, the variance of the SIR (see 2.2) was dominated by the $$\sigma_d$$ component and the SIR precisions were clustered close to 1/$$\sigma_d$$ (Figure 1A), the closest corresponding to the districts with the highest number of predicted cases (Figure 1B). On the contrary, in LOPm, the variance of the number of predicted cases was dominated by its random Poisson part. The SIR precisions then behaved like $$\sqrt{\hat{\tau}_j}$$ and was more scattered (Figure 1A and B). Fig. 1 View largeDownload slide Precision of the SIR of cancer incidence predictions. Panel A: funnel plot of precision vs. SIR with the 95 and 99.8% confidence limits. Panel B: precision of the SIR vs. the square root of the number of predicted cases. CV: coefficient of variation; LOPm: lip, oral cavity, pharynx cancer in men; NHLw: non-Hodgkin lymphoma in women. Fig. 1 View largeDownload slide Precision of the SIR of cancer incidence predictions. Panel A: funnel plot of precision vs. SIR with the 95 and 99.8% confidence limits. Panel B: precision of the SIR vs. the square root of the number of predicted cases. CV: coefficient of variation; LOPm: lip, oral cavity, pharynx cancer in men; NHLw: non-Hodgkin lymphoma in women. The predictions of the incidence of LOPm at the district level were of much more practical use than those of NHLw (Figure 2). In the first case, the geographic variation of incidence outweighed the imprecision of the estimates: districts with above-average incidence were mainly located in the north and north-west of France while those with lower incidence were located in the south-west. This geographical contrast in incidence is probably linked to large and well-documented geographical difference in tobacco and alcohol consumption in France (Ligier and others, 2011). In the second case (NHLw), the extent of the geographical variation in SIR was the same as the prediction intervals and the predictions were not sufficiently precise to provide a valuable description of incidence disparities. Fig. 2 View largeDownload slide France maps of cancer incidence predictions SIR for the 2007–2011 period. The hatched districts are those with a 95% prediction interval that contains 1. Fig. 2 View largeDownload slide France maps of cancer incidence predictions SIR for the 2007–2011 period. The hatched districts are those with a 95% prediction interval that contains 1. Accordingly, incidence predictions with poor precision may be of little practical interest. In particular, when the $$H/C$$ ratio varies widely between districts (high $$\sigma_d$$), $$H$$ is a “poor” proxy measure for $$C$$ and, logically, the predictions obtained using $$H$$ and the average $$H/C$$ ratio will also be poor, as observed for NHLw. This variability in the $$H/C$$ ratio probably reflects differences in coding practices between hospitals and/or geographical differences in the medical care pathway of patients (Grosclaude and others, 2012). 5. Discussion 5.1. Calibration model The present article proposes a solution to a univariate calibration problem for count data. It enables local incidence to be predicted using a proxy measure whenever a gold standard exists. The approach falls within the context of the classic univariate calibration (Osborne, 1991): in districts where both measures are available (calibration experiment), the mean ratio between the proxy and the gold standard measures (calibration curve) is estimated by age with a Poisson regression model, district-specific deviations from the mean being modeled with random effects. This mean ratio is then inverted to predict incidence from the proxy measure (prediction stage) and the variance of the random effects used to derive appropriate prediction intervals. The present approach has specific characteristics different from the traditional univariate statistical calibration. First, as the proxy measures are count data, the calibration model was estimated with a generalized linear mixed model (GLM(M)), despite the fact that the vast majority of literature on statistical calibration concentrates on the normal linear or non-linear models. The advantage of a GLM(M) Poisson model for calibration is that the classic estimator obtained by inverting the calibration curve has finite moments and that its distribution can be approximated by a log-normal distribution. We found very few publications dealing with calibration involving count data. For example, to calculate the CIs in an inverse prediction problem where the proxy measure follows a Poisson distribution, Vidoni (2003) proposed inverting the prediction intervals from the calibration model. Another example was found in the field of biological dosimetry: the dose of irradiation received by a patient was predicted from the number of blood lymphocytes using an inverse regression method where the number of lymphocytes was supposed to follow a Poisson distribution (Higueras and others, 2015). The above-mentioned methods were not directly applicable to our context. Indeed, in addition to the traditional “experimental” error (e.g. variation of the proxy measure around an invariant gold standard), our calibration problem has to deal with two additional sources of variability: the variability of the calibration curve between districts and the Poisson variability of the gold standard around its mean. With respect to the “experimental” error, we assumed that the proxy measure followed a Poisson distribution conditionally on the gold standard (H$$|$$C). We initially opted for a quasi-Poisson approach (Breslow and Clayton, 1993), which allowed for under-/over-dispersion, since the variance of $$H|C$$ would obviously vary with its expectation, and because we expected some under-dispersion (H being a proxy of C). This quasi-likelihood approach may be implemented using the function glmmPQL of the R library MASS (Venables and Ripley, 2002). The dispersion coefficient obtained may be directly incorporated in the overall variance (2.2) and only impacts the component related to the variance of H. In the systematic explorations we conducted (on 20 cancer sites for both sexes, results not shown), this model error was supported by examining residuals, which confirmed that the (quasi) Poisson framework was appropriate. However, we observed only slight under- or over-dispersion (the second being less frequent) with a minor impact in both cases on the overall variance (since the latter was dominated by the other components). We therefore finally opted for a Poisson model, using glmer which ensured exact Poisson likelihood estimation. In practice, checking for dispersion in the residuals is recommended, and if substantial dispersion is observed, using the quasi-Poisson approach is advised. The R programs for both versions (Poisson and quasi-Poisson) are available on the GitHub repository. The between-district variability of $$H/C$$ ratio (i.e. the calibration curve) was taken into account by introducing a district random intercept into the calibration model and assuming it followed a normal distribution. With real data, this distribution might not be normal. However, our simulations confirmed the robustness of the approach in case of misspecification of this distribution (McCulloch and Neuhaus, 2011). We assumed a simple random-effect specification, but we could have considered complex models instead, allowing a district variability in the age effect. However, our purpose was to retrieve a simple measure of the main district differences between the calibration curves and the parsimony of the model is crucial to enhance predictive performance. Furthermore, we are modeling the ratio between a proxy and incidence, which are thought to be “close” indicators; so that in our context, age–district interactions represent tapering effects which are beyond our scope and fall outside the focus of sparsity we strive for this model. It is worth noting however that any factor explaining the $$H/C$$ ratio (at individual or district scale), if available, should be included in the model, or at least tested. It may help to decrease the district variability of the ratio. Finally, our approach, to be accurate, requires that the set of districts available for the calibration phase (i.e. covered by a registry) is “representative,” in some sense, of all districts. More precisely, it assumes that the mean and the between-district variability of the ratio ($$\sigma_d$$) are correctly estimated with the calibration set of districts. Whether these assumptions reasonably hold depends heavily on the calibration-set available and also on the proxy used. In order to support discussion on this issue, it is recommended to describe how the calibration-set is distributed among all districts regarding factors that are potentially related to the proxy-to-incidence ratio. In practice, if there is little knowledge on these factors, one may describe basic population indicators to support the hypothesis that the calibration-set is representative of the overall population (e.g. in terms of deprivation, rural population, medical density, etc.—see Section 5 of the supplementary material available at Biostatistics online). 5.2. Bias and coverage probabilities Overall, the results of the simulations proved the accuracy of the present approach: the predicted incidence values were unbiased and the CPs of the prediction intervals were close to their nominal values. However, both the accuracy of the estimates of the standard errors and the CPs of the prediction intervals tended to decrease slightly with the increase in the number of incident cases or in the variability of the $$H/C$$ ratio. As shown by the simulations in different settings, the under-coverage of the prediction intervals is mainly due to the underestimation of $$\sigma_d$$, the between-district variability of the $$H/C$$ ratio. However, the impact is smaller for smaller $$\sigma_d$$ and, for a given $$\sigma_d$$, for smaller incidence and smaller district size (i.e. for smaller number of cases $${\tau_j}$$). The underlying mechanisms are explained in Section 3.3: the underestimation of $$\sigma_d$$ only impacts the first term in the overall variance of the predicted incidence (see 2.2), and thus its effect on the overall variance depends on the relative contribution of this term. Therefore, the underestimation of $$\sigma_d$$ has a minor impact on the overall variance when $$\sigma_d$$ is small, whereas its impact increases for larger $$\sigma_d$$, and as a consequence the CP slightly decreases. However, for a given $$\sigma_d$$, the impact is reduced for a smaller number of cases, because the random Poisson part contribution to the overall variance then increases and consequently the relative importance of the other terms reduces. The downward bias in the estimates of the variance component in GLMM is well known and diminishes with increasing observation per cluster and with a larger number of cluster (Austin, 2010). The influence of the number, K, of districts in the calibration set was confirmed by additional simulation results given in the supplementary material available at Biostatistics online (see Section 6, Figure S2 of the supplementary material available at Biostatistics online): the estimations of $$\sigma_d$$ are biased downward, to a larger extent for small $$\sigma_d$$, and this bias decreases with increasing K. The impact of this bias on the CP of the prediction interval is detailed in the supplementary material available at Biostatistics online for all scenarios (see Section 6, Figures S3 and S4 of the supplementary material available at Biostatistics online). Overall, the CPs of the prediction intervals were satisfactory (above 90%) when $$K\ge$$10 and in our epidemiological context, such prediction intervals are valuable for practical use. As a guide, the reader may refer to Figure S3 of the supplementary material available at Biostatistics online for a detailed description of the performance of the method, which depends on the number of calibration districts, between-district variability of the H/C ratio $$\sigma_d$$, incidence level and size of the prediction district. 5.3. Value of the calibration approach and related epidemiological considerations Although predicting incidence at a local level is an important issue in countries without a national registry, there are as yet few studies on the topic, and we did not find any outside of the cancer field. In Italy, regional incidence estimates were back-calculated for major cancer sites, based on regional mortality projections and regional survival estimates derived from registries, using the relation between incidence, survival, and mortality (Verdecchia and others, 2007). This multi-step methodology requires numerous assumptions however and brings uncertainty by using unnecessary projections. In France, the incidence/mortality ratio was previously used at regional levels with some limitations (Colonna and others, 2008), in particular since the ratio was assumed to be identical across regions. However, the variability of I/M ratio is too high to be used at a district level for almost all cancer sites. Extensions of the incidence/mortality ratio approach have been developed to predict county-level incidence in the USA for the three most frequent cancer sites, adding ecological covariates to improve prediction, such as socio-demographic characteristics and lifestyle habits (Pickle and others, 2003). Another natural direction, rather than attempting to enrich the modeling of incidence/mortality ratio, is to try to find a better incidence proxy. Routinely available HC databases provide appealing and promising incidence proxies. Some authors proposed to correct HC data based on sensitivity and specificity of the HC data to identify incident cases (e.g. Couris and others, 2009). One major limitation of this approach is that it requires individual linkage between registries and the HC data to estimate the required sensitivity and specificity. A few studies were conducted in France to predict cancer incidence at a district level from HC data using the HC/incidence ratio (Remontet and others, 2008; Mitton and others, 2011; Uhry and others, 2013). However, these were essentially papers oriented towards real-world applications without formal statistical derivation and evaluation. To our knowledge, the present article is the first methodological work to address the issue of predicting local incidence from a proxy based on calibration model, with an appropriate formal mathematical derivation together with a simulation-based evaluation. Given the multiplication of data sources, proxy measures will multiply too and be easy to obtain. This fact highlights the need for validated and practical calibration methods before data use for incidence prediction. From a practical point of view, although the proposed method provides unbiased incidence values and prediction intervals with good CPs, some predictions may be insufficiently precise to present any value in public health decision making processes. In the application section of this article, we saw for example that hospitalization data were poor proxies in NHLw, which resulted in a prediction interval of the same order of magnitude as the extent of geographical variation of the incidence. Thus, in practice, if the aim of the analysis is to describe geographical disparities in incidence, one may decide to discard predictions of cancer incidence where prediction errors are large relative to the magnitude of incidence disparities across districts and focus instead on cancer sites for which predictions are considered valuable. 5.4. Perspectives The incidence predictions obtained with our calibration approach may be used for posterior analyzes, since their distribution is known. This is especially true for geographical models including a spatial component which may be used to smooth and stabilize the SIR and reduce the variance. Moreover, future work should focus on predicting incidence through joint modeling of the expectations of the incidence and the proxy measures. Such a framework may enable, in addition, the simultaneous use of several proxies of incidence. 6. Conclusion Predicting disease incidence at sub-national levels is a difficult task, and to date has not been explored in depth. For countries without national registries, we present a convenient method to predict incidence from health insurance data, hospital discharge data, or other source of proxy of incidence data. We hope our approach will encourage the sound use of the usually imperfect information present in these data. Supplementary material supplementary material is available at http://biostatistics.oxfordjournals.org. Acknowledgments The authors thank the FRANCIM network for providing cancer incidence data and C. Bonaldi for his helpful suggestions. Conflict of Interest: None declared. References Austin P. C. ( 2010). Estimating multilevel logistic regression models when the number of clusters is low: a comparison of different statistical software procedures. The International Journal of Biostatistics  6, Article 16. Bates D., Mächler M., Bolker B. and Walker S. ( 2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software  67, 1– 48. Google Scholar CrossRef Search ADS   Breslow N. E. and Clayton D. G. ( 1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association  88, 9. Carré N., Uhry Z., Velten M., Trétarre B., Schvartz C., Molinié F., Maarouf N., Langlois C., Kadi-Hanifi A. M., Grosclaude P. and Colonna M. ( 2006). Predictive value and sensibility of hospital discharge system (PMSI) compared to cancer registries for thyroïd cancer (1999–2000). Revue d’epidemiologie et de sante publique  54, 367– 376. Google Scholar CrossRef Search ADS PubMed  Carroll R. J., Ruppert D., Stefanski L. A. and Crainiceanu C. M. ( 2006). Measurement Error in Nonlinear Models: A Modern Perspective , 2nd edition. Florida, USA: Chapman and Hall/CRC. Google Scholar CrossRef Search ADS   Colonna M., Bossard N., Mitton N., Remontet L., Belot A., Delafosse P. and Grosclaude P. ( 2008). Some interpretation of regional estimates of the incidence of cancer in France over the period 1980–2005. Revue d’épidémiologie et de santé publique  56, 434– 440. Google Scholar CrossRef Search ADS PubMed  Couris C. M., Polazzi S., Olive F., Remontet L., Bossard N., Gomez F., Schott A. M., Mitton N., Colonna M. and Trombert B. ( 2009). Breast cancer incidence using administrative data: correction with sensitivity and specificity. Journal of Clinical Epidemiology  62, 660– 666. Google Scholar CrossRef Search ADS PubMed  Ehrenstein V., Nielsen H., Pedersen A. B., Johnsen S. P. and Pedersen L. ( 2017). Clinical epidemiology in the era of big data: new opportunities, familiar challenges. Clinical Epidemiology  9, 245– 250. Google Scholar CrossRef Search ADS PubMed  Fenton L. ( 1960). The sum of log-normal probability distributions in scatter transmission systems. IEEE Transactions on Communications  8, 57– 67. Google Scholar CrossRef Search ADS   Grosclaude P., Dentan C., Trétarre B., Velten M., Fournier E. and Molinié F. ( 2012). Utilité des bases de données médico-administratives pour le suivi épidémiologique des cancers. Comparaison avec les données des registres au niveau individuel. Bulletin Epidémiologique Hebdomadaire , 5-6, 63– 67. Higueras M., Puig P., Ainsbury E. A. and Rothkamm K. ( 2015). A new inverse regression model applied to radiation biodosimetry. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences  471, 20140588. Google Scholar CrossRef Search ADS   Ligier K., Belot A., Launoy G., Velten M., Bossard N., Iwaz J., Righini C. A., Delafosse P. and Guizard A. V. ( 2011). Descriptive epidemiology of upper aerodigestive tract cancers in France: incidence over 1980–2005 and projection to 2010. Oral Oncology  47, 302– 307. Google Scholar CrossRef Search ADS PubMed  McCulloch C. E. and Neuhaus J. M. ( 2011). Misspecifying the shape of a random effects distribution: why getting it wrong may not matter. Statistical Science  26, 388– 402. Google Scholar CrossRef Search ADS   McCulloch C. E., Searle S. R. and Neuhaus J. M. ( 2008). Generalized, Linear, and Mixed Models, 2nd Edition . Hoboken, New Jersey, USA: John Wiley & Sons, Inc. Mitton N., Colonna M., Trombert B., Olive F., Gomez F., Iwaz J., Polazzi S., Schott-Petelaz A. M., Uhry Z., Bossard N. and Remontet L. ( 2011). A suitable approach to estimate cancer incidence in area without cancer registry. Journal of Cancer Epidemiology  2011, 418968. Google Scholar CrossRef Search ADS PubMed  Osborne C. ( 1991). Statistical calibration: a review. International Statistical Review/Revue Internationale de Statistique  59, 309. Pickle L. W., Feuer E. J., Edwards B. K. ( 2003). U.S. Predicted Cancer Incidence, 1999: Complete Maps by County and State From Spatial Projection Models . NCI Cancer Surveillance Monograph Series, number 5. Bethesda, MD: National Cancer Institute. Remontet L., Mitton N., Couris C.M., Iwaz J., Gomez F., Olive F., Polazzi S., Schott A. M., Trombert B., Bossard N. and Colonna M. ( 2008). Is it possible to estimate the incidence of breast cancer from medico-administrative databases? European Journal of Epidemiology  23, 681– 688. Google Scholar CrossRef Search ADS PubMed  Skrondal A. and Rabe-Hesketh S. ( 2009). Prediction in multilevel generalized linear models. Journal of the Royal Statistical Society: Series A (Statistics in Society)  172, 659– 687. Google Scholar CrossRef Search ADS   Uhry Z., Remontet L., Colonna M., Belot A., Grosclaude P., Mitton N., Delacour-Billon S., Gentil J., Boussac-Zarebska M., Bossard N. and others ( 2013). Cancer incidence estimation at a district level with out a national registry: a validation study for 24 cancer sites using French health insurance and registry data. Cancer Epidemiology  37, 99– 114. Google Scholar CrossRef Search ADS PubMed  Venables W. N. and Ripley B. D. ( 2002). Modern Applied Statistics with S , 4th edition. New York: Springer. Google Scholar CrossRef Search ADS   Verdecchia A., De Angelis R., Francisci S. and Grande E. ( 2007). Methodology for estimation of cancer incidence, survival and prevalence in Italian regions. Tumori  93, 337– 44. Google Scholar CrossRef Search ADS PubMed  Vidoni P. ( 2003). Prediction and calibration in generalized linear models. Annals of the Institute of Statistical Mathematics  55, 169– 185. © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

BiostatisticsOxford University Press

Published: Mar 29, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off