Valid and efficient subgroup analyses using nested case-control data

Valid and efficient subgroup analyses using nested case-control data Abstract Background It is not uncommon for investigators to conduct further analyses of subgroups, using data collected in a nested case-control design. Since the sampling of the participants is related to the outcome of interest, the data at hand are not a representative sample of the population, and subgroup analyses need to be carefully considered for their validity and interpretation. Methods We performed simulation studies, generating cohorts within the proportional hazards model framework and with covariate coefficients chosen to mimic realistic data and more extreme situations. From the cohorts we sampled nested case-control data and analysed the effect of a binary exposure on a time-to-event outcome in subgroups defined by a covariate (an independent risk factor, a confounder or an effect modifier) and compared the estimates with the corresponding subcohort estimates. Cohort analyses were performed with Cox regression, and nested case-control samples or restricted subsamples were analysed with both conditional logistic regression and weighted Cox regression. Results For all studied scenarios, the subgroup analyses provided unbiased estimates of the exposure coefficients, with conditional logistic regression being less efficient than the weighted Cox regression. Conclusions For the study of a subpopulation, analysis of the corresponding subgroup of individuals sampled in a nested case-control design provides an unbiased estimate of the effect of exposure, regardless of whether the variable used to define the subgroup is a confounder, effect modifier or independent risk factor. Weighted Cox regression provides more efficient estimates than conditional logistic regression. Conditional logistic regression, risk set sampling, weighted likelihood, weighted Cox regression Key Messages Subgroup analyses of nested case-control data provide valid estimates of exposure effects. The subgroups can be defined by any covariate measured at baseline, whether an independent risk factor, a confounder or an effect modifier. Weighted Cox regression analysis of the subgroups provides more precise estimates. This provides reassurance for investigators who conduct such analyses, but with the usual caveat concerning overuse and over-interpretation of subgroup analyses. Background The time-matched nested case-control (NCC) design uses incidence density sampling for selecting controls within the cohort: for each case, controls are randomly selected from the individuals still at risk at the event time of the case (i.e. the risk set).1 This design combines the benefits of the time aspect of the cohort design and the significant savings in cost and time by sampling only a few non-cases at each event time.2–4 Cost savings can be substantial when, in addition to the usual costs of enrolling study participants, the covariates of interest include expensive laboratory assays such as whole genome sequencing.5 For modern biobank-based epidemiological research, time-matching can also be an important advantage when biomarkers are affected by storage time or batch effects.4,6 Whereas the NCC design will usually have been chosen to answer a specific research question, there are numerous examples in the literature of investigators conducting subsequent analyses of subgroups of interest.7–13 Our interest in investigating the validity of such subgroup analysis was motivated by our own NCC study of the association between radiation therapy for breast cancer and subsequent risk of lung cancer, an association that was shown to be modified by smoking status.14 This prompted us to investigate the subgroup of smokers in more detail, but as smoking was not a matching factor, it was unclear whether restricting the analysis to this subgroup of the NCC sample could raise some statistical issues. In contrast to the cohort design which readily enables such analyses, subgroup analyses with the NCC design need to be carefully considered for their validity and interpretation. Since in a case-control study the sampling of the participants is related to the outcome of interest, the data at hand are not a representative sample of the source population: the cases are over-represented in the sample, and will have a different distribution for some covariates (e.g. risk factors). Furthermore, matching cases to controls can induce additional differences in the covariate distributions.15 Although conditional logistic regression analysis of NCC data provides valid estimates of the effect of exposure adjusted for ‘imbalance’ in the covariates, it is not clear whether a subgroup selected from the NCC sample will provide valid estimates (see Figure 1, panels A and B). Figure 1 View largeDownload slide Illustration of non-matched and matched nested case-control studies and potential concerns with subgroup analyses. As an illustration, we consider a study where the potential matching variable is “smoking habits” (illustrated with the pipes) and assume that the case in each pair is on the left handside. In panel A, there is no additional matching and the distribution of smoking is not the same among cases and controls. Restricting the analysis (of some exposure of interest) to the subgroup of smokers raises concerns illustrated in panel B: the conditional logistic regression analysis will include only the case-control pairs with two smokers, which, in addition to a loss of power, raises the problem of a non-representative sub-sample and the consequent potential for bias. In addition, each matched set that is retained needs to be discordant for the exposure variable in order to enter the likelihood expression in the analysis, further reducing the power. In panel C, cases were matched to controls on “smoking habits”, so that restricting the analysis to the subgroup of smokers (or non-smokers) does not raise any statistical concern other than the loss of power. Figure 1 View largeDownload slide Illustration of non-matched and matched nested case-control studies and potential concerns with subgroup analyses. As an illustration, we consider a study where the potential matching variable is “smoking habits” (illustrated with the pipes) and assume that the case in each pair is on the left handside. In panel A, there is no additional matching and the distribution of smoking is not the same among cases and controls. Restricting the analysis (of some exposure of interest) to the subgroup of smokers raises concerns illustrated in panel B: the conditional logistic regression analysis will include only the case-control pairs with two smokers, which, in addition to a loss of power, raises the problem of a non-representative sub-sample and the consequent potential for bias. In addition, each matched set that is retained needs to be discordant for the exposure variable in order to enter the likelihood expression in the analysis, further reducing the power. In panel C, cases were matched to controls on “smoking habits”, so that restricting the analysis to the subgroup of smokers (or non-smokers) does not raise any statistical concern other than the loss of power. Subgroups can be defined by: (i) the outcome (special subgroups of cases); (ii) a factor that was a matching variable; or (iii) a covariate not used for the sampling. For the first situation, analysis is restricted to the case-control sets for which the case experienced the ‘restricted’ outcome (for example, a specific histological subtype of a cancer). Regarding the second situation, analysis of the exposure-outcome association within a subgroup defined by a matching variable is equivalent to a redefinition of the inclusion criteria and hence the target population (see Figure 1, panel C). Whereas the matching factor is usually a potential confounder, it may transpire to be an effect modifier. In contrast to these first two situations, which will only suffer from a loss of statistical power, for the third setting (subgroups defined by a covariate not used for the sampling), it is unclear to what extent such subgroup analyses are valid and whether we should prefer one statistical approach to another in a given situation. Data from NCC studies are usually analysed with conditional logistic regression. An alternative approach uses weighted likelihood methods where the individuals in the dataset are weighted by the inverse of their probability of being sampled into the study,16 and the data are analysed by running a weighted Cox regression analysis. In this paper, we investigate the validity and efficiency of subgroup analyses of NCC studies. In a series of simulation studies, we compare the weighted likelihood method with conditional logistic regression, to identify which situation could be problematic and what statistical methods should be preferred. Methods In the survival analysis framework, the hazard function for an individual i to experience the outcome of interest at time t is usually modelled using the Cox proportional hazards model: hi(t|Xi,β)=h0(t)  exp (β'Xi) where h0(t) is the baseline hazard function, Xi is the vector of covariates for individual i and β is the vector of corresponding coefficients. The classical approach for estimating β with NCC data is to maximize the partial likelihood:1, L(β) = ∏tiexp⁡[β'Xi]∑kϵR’iexp⁡[β'Xk] where R’i is the sampled risk set for case i. In practice, this is solved by using a conditional logistic regression analysis. An alternative method pools all selected unique individuals in the analysis, each weighted by the inverse of their probability of being sampled,2,17 and β is estimated by maximizing a weighted partial likelihood: L(β) = ∏tiexp⁡[β'Xi]∑kϵR*iexp⁡[β'Xk] . wk where R*i is the collection of all cases and sampled controls at risk at time ti and wk is the weight for individual k.16 As the weight is the inverse of the probability that the individual is sampled for the study, cases are usually given a weight of one (NCC studies commonly include all cases) and the weight for a non-case has to be calculated. Samuelsen16 suggested calculating the probability for individual k to be sampled (pk) with an expression that mimics the sampling procedure, i.e. pk=1-∏i, Sk≤ti≤Tk1-miRi-1 1-Yk(ti) where Sk and Tk are the start and end of follow-up for individual k, mi is the number of controls sampled at event time ti and Yk(ti) is an indicator of the case-control status of individual k at time ti. When the sampling involves additional matching on a confounder Xc, the probabilities are calculated within the strata defined by Xc.18 The weights reconstruct the number of individuals at risk over time in the cohort. In each of the strata defined by any covariate, the number of individuals at risk is also expected to be reconstructed appropriately. As a result, the individuals enter any subgroup analysis with the weight calculated for the full NCC data. Simulation settings We simulated cohorts of individuals characterized by four covariates: a binary exposure, a continuous independent risk factor, a binary confounder and a binary effect modifier. The exposure in our motivating study was radiation therapy, the effect modifier was the smoking status at baseline (which was both a strong risk factor and a strong effect modifier), the independent risk factor was age at baseline and the confounder was adjuvant treatment. The values for the coefficients (β) and the baseline hazard, h0(t), were chosen to roughly mimic the values from our own study of lung cancer among breast cancer patients14 or from other studies of the same subject.19 For each setting, we simulated 500 realizations of a cohort with N = 100 000 individuals, generating data from a proportional hazards model with constant baseline hazard and exponential censoring. In each of these cohorts, we sampled a time-matched NCC study with two controls per case. Sampling was conducted without matching and also with matching on the confounder or on the effect modifier. Main simulation setting We generated the outcome for our main setting using the values presented in Table 1, with a baseline hazard of 0.0005. Table 1 Values of all parameters in the main simulation setting Variable Distribution Coefficient Hazard ratio Exposure Xe Binomial (N, pc) βe = 0.405 1.5 Independent risk factor Xirf Normal (μ = 0, σ = 10) βirf = 0.020 1.02 Confounder Xc Binomial (N, 0.5) βc = 0.0953 1.1 Effect modifier Xem Binomial (N, 0.3) βem = 1.386 4 Interaction coefficient between Xe and Xem: βinteract βe = 0.69 2 Variable Distribution Coefficient Hazard ratio Exposure Xe Binomial (N, pc) βe = 0.405 1.5 Independent risk factor Xirf Normal (μ = 0, σ = 10) βirf = 0.020 1.02 Confounder Xc Binomial (N, 0.5) βc = 0.0953 1.1 Effect modifier Xem Binomial (N, 0.3) βem = 1.386 4 Interaction coefficient between Xe and Xem: βinteract βe = 0.69 2 N, total number of individuals simulated in the cohort. pc depends on the value of Xc: pc(Xc = 0) = 0.8 and pc(Xc = 1) = 0.5. Table 1 Values of all parameters in the main simulation setting Variable Distribution Coefficient Hazard ratio Exposure Xe Binomial (N, pc) βe = 0.405 1.5 Independent risk factor Xirf Normal (μ = 0, σ = 10) βirf = 0.020 1.02 Confounder Xc Binomial (N, 0.5) βc = 0.0953 1.1 Effect modifier Xem Binomial (N, 0.3) βem = 1.386 4 Interaction coefficient between Xe and Xem: βinteract βe = 0.69 2 Variable Distribution Coefficient Hazard ratio Exposure Xe Binomial (N, pc) βe = 0.405 1.5 Independent risk factor Xirf Normal (μ = 0, σ = 10) βirf = 0.020 1.02 Confounder Xc Binomial (N, 0.5) βc = 0.0953 1.1 Effect modifier Xem Binomial (N, 0.3) βem = 1.386 4 Interaction coefficient between Xe and Xem: βinteract βe = 0.69 2 N, total number of individuals simulated in the cohort. pc depends on the value of Xc: pc(Xc = 0) = 0.8 and pc(Xc = 1) = 0.5. The association between the exposure and confounder was generated through the following function: P(Xe| Xc)= exp(log(4)+log(0.25)*Xc) 1+exp(log(4)+log(0.25)*Xc) This association corresponds to a probability of 0.8 of being exposed if the binary confounder has value zero and a probability 0.5 for a value of one. Denoting the coefficients by a vector β and the covariates of individual i as the ith row Xi of the design matrix X, the model is thus written as: hi(t |X,β) = h0 exp(β′Xi) where Xi may also include a multiplicative interaction between the exposure and effect modifier. All individuals entered the cohort at time t = 0 and were followed until the event of interest or a censoring time (generated from an exponential distribution with a rate of 0.05) with a maximum follow-up time of t = 25. This setting resulted in approximately 2.6% of cases over the entire follow-up. Other simulation settings To simulate more challenging settings, we modified the simulation described above by: (i) using more extreme values for the covariates coefficients; (ii) changing the strength and direction of the confounder-exposure association; (iii) adding a correlation between the independent risk factor and the confounder; (iv) having censoring depend on the effect modifier; (v) generating outcomes with lower prevalence; (vi) generating smaller cohorts; and (vii) combinations of these modifications. Subgroups: definitions and statistical analyses From the three covariates used in our simulation, we defined eight subgroups (two for the confounder, two for the effect modifier and four for the continuous risk factor categorized in quartiles). To analyse a subgroup (for example, the subgroup of ‘smokers’), we selected all individuals from the cohort or from the NCC sample who belonged to this subgroup. For each of the subgroups, the exposure coefficient βe was estimated from the restricted cohort using Cox regression, and from the restricted NCC data using both conditional logistic regression and weighted Cox regression. Results were compared with the estimates obtained from the 500 ‘full’ cohorts analysed with Cox regression and the 500 ‘full’ NCC samples analysed with both conditional logistic regression and weighted Cox regression. In all analyses, we used the fully specified model that generated the data. The efficiency of a subgroup analysis was estimated as the ratio of the variances reported by the subgroup regression analysis and the corresponding cohort analysis. Supplementary analyses For weighted Cox regression of matched NCC studies, it was previously noted that using unstratified or stratified weights did not influence the estimated exposure coefficient.20 We investigated this assertion in the matched settings by using both types of weights. We assessed how well the unstratified weights performed in estimating the coefficient of interest βe. We also calculated the ratio over time (from t = 0 to t = 25 in steps of 0.5) of the number of individuals recovered from the NCC study with the weighting system (stratified and unstratified), to the number of individuals in the cohort, and we report the mean and the standard deviation of this ratio over 500 realizations. Application to real data To investigate whether the gender of a non-Hodgkin’s lymphoma patient (NHL) is a risk factor for his/her siblings to develop NHL, we used a cohort of 15 727 siblings of all NHL patients, among whom there were 87 cases of NHL: 37 in brothers and 50 in sisters of patients. We analysed the full cohort and the two subgroups of brothers and sisters of the patients using Cox regression with age as the time scale. We sampled 100 unmatched NCC studies with two controls per case from the cohort, and analysed the full NCC data and the two subgroups with both conditional logistic regression and weighted Cox regression. Our models included the main exposure (gender of the patient), an independent risk factor (year of birth of the sibling) and an effect modifier (gender of the sibling).21 We also sampled 100 NCC studies matched on the sibling’s gender. Software All data management and analyses were performed with the R statistical package (version 3.2.2). To estimate the weights, we used the wpl function (multipleNCC package). We used the coxph function (survival package) to run the Cox and the weighted Cox regressions (with weights retrieved from the wpl-object). The conditional logistic regressions were run with the clogit function. Results Unmatched simulation settings The adjusted estimates of βe in our main (unmatched) simulation setting are presented in Table 2. All analyses provided unbiased estimates for βe, and the analysis of the subgroup defined by the effect modifier provided an unbiased estimate of βe + βinteract. In all analyses, with full data or subgroup data, the weighted Cox regression was more efficient than the conditional logistic regression. This latter analysis experienced a dramatic loss of efficiency when restricting the analysis to subgroups (Table 2). All other unmatched settings provided similar results (data not shown). The smaller cohorts that we analysed included 18 000 individuals. The conditional logistic regression was prone to non-convergence in approximately 10% of the analyses of the smaller subgroups. Disregarding these realizations, the average conditional logistic regression estimates were close to the average cohort estimates and close to the simulated value of βe (Supplementary Table, available as Supplementary data at IJE online). Table 2 Simulation results for cohorts including 100 000 individuals βe Model based s.e. Emp. s.e. Efficiencya n casesb n controlsc,d True value 0.405 Full data  Cohort Cox regression 0.408 0.087 0.089 1 2656 97344  NCC Cond.log.reg 0.408 0.099 0.099 0.772 2656 5312 Weighted Cox 0.407 0.094 0.095 0.845 2656 5213 Subgroups defined by the binary confounder: Baseline level  Cohort subgroup Cox regression 0.401 0.153 0.157 1 1412 48586  NCC subgroup Cond.log.reg 0.398 0.212 0.218 0.524 1058 1410 Weighted Cox 0.398 0.164 0.166 0.874 1412 2597 Level 1  Cohort subgroup Cox regression 0.415 0.111 0.115 1 1244 48758  NCC subgroup Cond.log.reg 0.412 0.159 0.166 0.484 935 1248 Weighted Cox 0.413 0.120 0.119 0.851 1244 2615 Subgroups defined by the independent risk factor with four categories: Level 1  Cohort subgroup Cox regression 0.407 0.189 0.194 1 570 26861  NCC subgroup Cond.log.reg 0.429 0.351 0.366 0.295 271 314 Weighted Cox 0.406 0.201 0.206 0.879 570 1440 Level 2  Cohort subgroup Cox regression 0.427 0.184 0.177 1 613 24010  NCC subgroup Cond.log.reg 0.437 0.362 0.364 0.265 266 304 Weighted Cox 0.426 0.198 0.194 0.860 613 1291 Level 3  Cohort subgroup Cox regression 0.415 0.174 0.177 1 663 23085  NCC subgroup Cond.log.reg 0.444 0.350 0.359 0.254 278 316 Weighted Cox 0.417 0.189 0.188 0.843 663 1234 Level 4  Cohort subgroup Cox regression 0.406 0.157 0.156 1 809 23388  NCC subgroup Cond.log.reg 0.416 0.308 0.326 0.263 343 391 Weighted Cox 0.404 0.174 0.171 0.817 809 1247 Subgroups defined by the binary effect modifier: Baseline  Cohort subgroup Cox regression 0.408 0.090 0.091 1 696 69310  NCC subgroup Cond.log.reg 0.406 0.121 0.117 0.550 637 987 Weighted Cox 0.407 0.097 0.095 0.857 696 3737 True value 0.405 + 0.690 Level 1  Cohort subgroup Cox regression 1.101 0.063 0.063 1 1959 28034  NCC subgroup Cond.log.reg 1.106 0.118 0.117 0.287 977 1143 Weighted Cox 1.105 0.084 0.083 0.563 1959 1476 βe Model based s.e. Emp. s.e. Efficiencya n casesb n controlsc,d True value 0.405 Full data  Cohort Cox regression 0.408 0.087 0.089 1 2656 97344  NCC Cond.log.reg 0.408 0.099 0.099 0.772 2656 5312 Weighted Cox 0.407 0.094 0.095 0.845 2656 5213 Subgroups defined by the binary confounder: Baseline level  Cohort subgroup Cox regression 0.401 0.153 0.157 1 1412 48586  NCC subgroup Cond.log.reg 0.398 0.212 0.218 0.524 1058 1410 Weighted Cox 0.398 0.164 0.166 0.874 1412 2597 Level 1  Cohort subgroup Cox regression 0.415 0.111 0.115 1 1244 48758  NCC subgroup Cond.log.reg 0.412 0.159 0.166 0.484 935 1248 Weighted Cox 0.413 0.120 0.119 0.851 1244 2615 Subgroups defined by the independent risk factor with four categories: Level 1  Cohort subgroup Cox regression 0.407 0.189 0.194 1 570 26861  NCC subgroup Cond.log.reg 0.429 0.351 0.366 0.295 271 314 Weighted Cox 0.406 0.201 0.206 0.879 570 1440 Level 2  Cohort subgroup Cox regression 0.427 0.184 0.177 1 613 24010  NCC subgroup Cond.log.reg 0.437 0.362 0.364 0.265 266 304 Weighted Cox 0.426 0.198 0.194 0.860 613 1291 Level 3  Cohort subgroup Cox regression 0.415 0.174 0.177 1 663 23085  NCC subgroup Cond.log.reg 0.444 0.350 0.359 0.254 278 316 Weighted Cox 0.417 0.189 0.188 0.843 663 1234 Level 4  Cohort subgroup Cox regression 0.406 0.157 0.156 1 809 23388  NCC subgroup Cond.log.reg 0.416 0.308 0.326 0.263 343 391 Weighted Cox 0.404 0.174 0.171 0.817 809 1247 Subgroups defined by the binary effect modifier: Baseline  Cohort subgroup Cox regression 0.408 0.090 0.091 1 696 69310  NCC subgroup Cond.log.reg 0.406 0.121 0.117 0.550 637 987 Weighted Cox 0.407 0.097 0.095 0.857 696 3737 True value 0.405 + 0.690 Level 1  Cohort subgroup Cox regression 1.101 0.063 0.063 1 1959 28034  NCC subgroup Cond.log.reg 1.106 0.118 0.117 0.287 977 1143 Weighted Cox 1.105 0.084 0.083 0.563 1959 1476 Adjusted exposure coefficients (βe), model-based standard errors (Model-based s.e.), empirical standard errors (Emp. s.e.), efficiency relative to the corresponding cohort analysis and mean number (n) of cases and controls included in the different analyses, for the main simulation setting described in the text. All numbers are averages over 500 runs. NCC, nested case-control; Cond.log.reg, conditional logistic regression. a The efficiency is calculated as the squared ratio of the model-based standard error (s.e.) provided by the NCC data analysis (conditional logistic regression or weighted Cox regression) and the s.e. provided by the corresponding cohort analysis. b,c The number of casesb and controlsc that are reported for the conditional logistic regression for the subgroup analyses are those from sets with at least one control and one case. In addition, all numbers are averages over 500 runs and have been rounded so that the cases across different strata, or cases and controls in the subgroups of the cohort, sum up to the cohort totals ±1. d The number of controls that are reported for the weighted Cox regression are the numbers of non-cases in the cohort at the end of the study. Table 2 Simulation results for cohorts including 100 000 individuals βe Model based s.e. Emp. s.e. Efficiencya n casesb n controlsc,d True value 0.405 Full data  Cohort Cox regression 0.408 0.087 0.089 1 2656 97344  NCC Cond.log.reg 0.408 0.099 0.099 0.772 2656 5312 Weighted Cox 0.407 0.094 0.095 0.845 2656 5213 Subgroups defined by the binary confounder: Baseline level  Cohort subgroup Cox regression 0.401 0.153 0.157 1 1412 48586  NCC subgroup Cond.log.reg 0.398 0.212 0.218 0.524 1058 1410 Weighted Cox 0.398 0.164 0.166 0.874 1412 2597 Level 1  Cohort subgroup Cox regression 0.415 0.111 0.115 1 1244 48758  NCC subgroup Cond.log.reg 0.412 0.159 0.166 0.484 935 1248 Weighted Cox 0.413 0.120 0.119 0.851 1244 2615 Subgroups defined by the independent risk factor with four categories: Level 1  Cohort subgroup Cox regression 0.407 0.189 0.194 1 570 26861  NCC subgroup Cond.log.reg 0.429 0.351 0.366 0.295 271 314 Weighted Cox 0.406 0.201 0.206 0.879 570 1440 Level 2  Cohort subgroup Cox regression 0.427 0.184 0.177 1 613 24010  NCC subgroup Cond.log.reg 0.437 0.362 0.364 0.265 266 304 Weighted Cox 0.426 0.198 0.194 0.860 613 1291 Level 3  Cohort subgroup Cox regression 0.415 0.174 0.177 1 663 23085  NCC subgroup Cond.log.reg 0.444 0.350 0.359 0.254 278 316 Weighted Cox 0.417 0.189 0.188 0.843 663 1234 Level 4  Cohort subgroup Cox regression 0.406 0.157 0.156 1 809 23388  NCC subgroup Cond.log.reg 0.416 0.308 0.326 0.263 343 391 Weighted Cox 0.404 0.174 0.171 0.817 809 1247 Subgroups defined by the binary effect modifier: Baseline  Cohort subgroup Cox regression 0.408 0.090 0.091 1 696 69310  NCC subgroup Cond.log.reg 0.406 0.121 0.117 0.550 637 987 Weighted Cox 0.407 0.097 0.095 0.857 696 3737 True value 0.405 + 0.690 Level 1  Cohort subgroup Cox regression 1.101 0.063 0.063 1 1959 28034  NCC subgroup Cond.log.reg 1.106 0.118 0.117 0.287 977 1143 Weighted Cox 1.105 0.084 0.083 0.563 1959 1476 βe Model based s.e. Emp. s.e. Efficiencya n casesb n controlsc,d True value 0.405 Full data  Cohort Cox regression 0.408 0.087 0.089 1 2656 97344  NCC Cond.log.reg 0.408 0.099 0.099 0.772 2656 5312 Weighted Cox 0.407 0.094 0.095 0.845 2656 5213 Subgroups defined by the binary confounder: Baseline level  Cohort subgroup Cox regression 0.401 0.153 0.157 1 1412 48586  NCC subgroup Cond.log.reg 0.398 0.212 0.218 0.524 1058 1410 Weighted Cox 0.398 0.164 0.166 0.874 1412 2597 Level 1  Cohort subgroup Cox regression 0.415 0.111 0.115 1 1244 48758  NCC subgroup Cond.log.reg 0.412 0.159 0.166 0.484 935 1248 Weighted Cox 0.413 0.120 0.119 0.851 1244 2615 Subgroups defined by the independent risk factor with four categories: Level 1  Cohort subgroup Cox regression 0.407 0.189 0.194 1 570 26861  NCC subgroup Cond.log.reg 0.429 0.351 0.366 0.295 271 314 Weighted Cox 0.406 0.201 0.206 0.879 570 1440 Level 2  Cohort subgroup Cox regression 0.427 0.184 0.177 1 613 24010  NCC subgroup Cond.log.reg 0.437 0.362 0.364 0.265 266 304 Weighted Cox 0.426 0.198 0.194 0.860 613 1291 Level 3  Cohort subgroup Cox regression 0.415 0.174 0.177 1 663 23085  NCC subgroup Cond.log.reg 0.444 0.350 0.359 0.254 278 316 Weighted Cox 0.417 0.189 0.188 0.843 663 1234 Level 4  Cohort subgroup Cox regression 0.406 0.157 0.156 1 809 23388  NCC subgroup Cond.log.reg 0.416 0.308 0.326 0.263 343 391 Weighted Cox 0.404 0.174 0.171 0.817 809 1247 Subgroups defined by the binary effect modifier: Baseline  Cohort subgroup Cox regression 0.408 0.090 0.091 1 696 69310  NCC subgroup Cond.log.reg 0.406 0.121 0.117 0.550 637 987 Weighted Cox 0.407 0.097 0.095 0.857 696 3737 True value 0.405 + 0.690 Level 1  Cohort subgroup Cox regression 1.101 0.063 0.063 1 1959 28034  NCC subgroup Cond.log.reg 1.106 0.118 0.117 0.287 977 1143 Weighted Cox 1.105 0.084 0.083 0.563 1959 1476 Adjusted exposure coefficients (βe), model-based standard errors (Model-based s.e.), empirical standard errors (Emp. s.e.), efficiency relative to the corresponding cohort analysis and mean number (n) of cases and controls included in the different analyses, for the main simulation setting described in the text. All numbers are averages over 500 runs. NCC, nested case-control; Cond.log.reg, conditional logistic regression. a The efficiency is calculated as the squared ratio of the model-based standard error (s.e.) provided by the NCC data analysis (conditional logistic regression or weighted Cox regression) and the s.e. provided by the corresponding cohort analysis. b,c The number of casesb and controlsc that are reported for the conditional logistic regression for the subgroup analyses are those from sets with at least one control and one case. In addition, all numbers are averages over 500 runs and have been rounded so that the cases across different strata, or cases and controls in the subgroups of the cohort, sum up to the cohort totals ±1. d The number of controls that are reported for the weighted Cox regression are the numbers of non-cases in the cohort at the end of the study. We investigated the ability of the weighting method to recover the correct number of individuals at risk over time. On average, the weights reproduced the pattern of the number of individuals at risk in the full cohort and in each subgroup. The estimated numbers of individuals at risk were in excellent agreement with the corresponding actual numbers at risk in the cohort, with ratios always close to 1 (Figure 2). All unmatched settings provided similar patterns. Figure 2 View largeDownload slide Performance of the weighting system in recovering the number of individuals in the cohort (unmatched NCC study). Ratio (with two standard deviations) of the numbers of individuals at risk over time recovered from the unmatched NCC data with the weighting system, to the numbers of individuals at risk in the cohort, presented for the full data and for the two subgroups defined by the binary confounder. Figure 2 View largeDownload slide Performance of the weighting system in recovering the number of individuals in the cohort (unmatched NCC study). Ratio (with two standard deviations) of the numbers of individuals at risk over time recovered from the unmatched NCC data with the weighting system, to the numbers of individuals at risk in the cohort, presented for the full data and for the two subgroups defined by the binary confounder. Matched simulation settings Similar to the unmatched situation, all analyses provided unbiased estimates of βe or of βe + βinteract. In contrast to the unmatched situation, for the two subgroups defined by the confounder (i.e. the matching factor), the efficiency of the conditional logistic regression was similar to that of the weighted Cox regression. The results were similar when matching on the effect modifier. On average, the stratified weights reproduced the pattern of the number of individuals at risk in the cohort with similar performance to the unmatched situation, whereas the unstratified weights performed poorly in the subgroups defined by the matching factor (Figure 3). Despite this, using the unstratified weights provided valid estimates of βein all subgroups, whether or not defined by the matching variable (data not shown). Figure 3 View largeDownload slide Performance of the weighting system in recovering the number of individuals in the cohort (matched NCC study). Ratio (with two standard deviations) of the numbers of individuals at risk over time recovered with unstratified weights (upper panel) and stratified weights (lower panel) from NCC data that was matched on the confounder, to the actual numbers of individuals at risk in the cohort. Figure 3 View largeDownload slide Performance of the weighting system in recovering the number of individuals in the cohort (matched NCC study). Ratio (with two standard deviations) of the numbers of individuals at risk over time recovered with unstratified weights (upper panel) and stratified weights (lower panel) from NCC data that was matched on the confounder, to the actual numbers of individuals at risk in the cohort. Even when we challenged the weighted method using more extreme situations, the estimates from the weighted Cox analysis of the NCC data, overall and in subgroups, were unbiased when using unstratified weights (Figure 4). Figure 4 View largeDownload slide Coefficient estimates with 95% confidence interval. Coefficient estimates with 95% confidence interval provided by the Cox regression analysis of the full cohort and subgroups (solid line), and weighted Cox regression analysis of the corresponding NCC data with unstratified weights despite stratified sampling (dashed line) in an extreme setting (βirf = 0.95 with censoring associated with the effect modifier). Figure 4 View largeDownload slide Coefficient estimates with 95% confidence interval. Coefficient estimates with 95% confidence interval provided by the Cox regression analysis of the full cohort and subgroups (solid line), and weighted Cox regression analysis of the corresponding NCC data with unstratified weights despite stratified sampling (dashed line) in an extreme setting (βirf = 0.95 with censoring associated with the effect modifier). Real data application The results from the 100 unmatched NCC studies, sampled within a cohort of siblings of NHL patients, are presented in Figure 5, where we coded the exposure of interest using an indicator variable for female patients. Analysis of the subgroups defined by the sex of the sibling provided unbiased estimates for the exposure effects: βe for brothers of the patient and βe + βinteract for sisters, and the weighted Cox regression was more efficient than the conditional logistic regression. A sibling of a female NHL patient had a slightly increased risk of developing NHL compared with a sibling of a male patient, but this risk was modified by the gender of the sibling so that the risk for sisters is more than double the risk for brothers (2.66 = e0.979 for sisters versus 1.16 = e0.146 for brothers). Similarly, the subgroup analyses provided unbiased estimates of exposure coefficients when matching on the effect modifier (data not shown). Figure 5 View largeDownload slide Coefficient estimates with 95% confidence interval in the study of NHL. Estimates of the effect of sex of patient on the risk of NHL in siblings, with 95% confidence intervals. Estimates from Cox regression analyses of the full cohort and the two subgroups defined by the sex of sibling (solid lines), and average estimates obtained from 100 NCC samples, analysed by conditional logistic regression (dashed lines), and weighted Cox regression (dotted lines). Figure 5 View largeDownload slide Coefficient estimates with 95% confidence interval in the study of NHL. Estimates of the effect of sex of patient on the risk of NHL in siblings, with 95% confidence intervals. Estimates from Cox regression analyses of the full cohort and the two subgroups defined by the sex of sibling (solid lines), and average estimates obtained from 100 NCC samples, analysed by conditional logistic regression (dashed lines), and weighted Cox regression (dotted lines). Discussion In a series of simulation studies and in a real data application, we have shown that analyses of subgroups of NCC data provide unbiased estimates of the exposure-outcome association, regardless of whether the subgroup is defined by an independent risk factor, a matched or unmatched confounder or an effect modifier. We also showed that using a conditional logistic regression to analyse the subgroup is less efficient than using a weighted Cox regression. Exploring the weighting system in our simulated data, we found that on average the weights succeeded in recovering the pattern of risk-set sizes over time, not only in the cohort but in all subgroups, for both unmatched and matched designs, provided stratified weights were used for the latter. More importantly, using unstratified weights for a matched design, weighted Cox regression analyses still provide unbiased estimates of the exposure coefficient βe, confirming what was already reported by Støer et al.20 Whereas the topic of subgroup analyses is abundantly documented for randomized control trials,22 we did not find any study addressing the question of the validity of such analyses for NCC studies. One might argue that such subgroup analyses are not of interest, as a subgroup of interest can be accommodated by including an interaction term in the analysis of the full data.22 However, there may be good reasons for investigating a restricted population in more detail. For example, the absence of an overall treatment effect does not necessarily imply that an intervention does not have any effect in any of the subgroups:23 the safety and effectiveness of a drug may depend on patients’ characteristics and/or co-medication,24 as in the case of concomitant exposure to a proton pump and clopidogrel after acute myocardial infarction.24,25 Our goal is thus to reassure researchers who do such analysis with NCC data, but not to promote overuse of subgroup analyses. The conditional logistic regression was much less efficient than the weighted Cox regression in analysing subgroups. Since conditional logistic regression is performed on matched sets, the number of sets including at least one case and one control can decrease dramatically if the subgroup is not defined by a matched covariate. This decrease depends on the association between the outcome and the covariate and the prevalence of the various levels of the covariate, as well as the prevalence of exposure. In our main simulation setting, the number of cases retained in the conditional logistic regression analysis could be two to three times fewer than the number in the weighted Cox regression analysis (Table 2). These numbers will be further reduced in the analysis, as sets that include members with the same exposure value do not contribute to the conditional likelihood. As a consequence, when subgroups are small, the conditional logistic regression is more likely to face convergence problems (Supplementary Table). A limitation of our study is that our simulated data were complete (no incomplete data in the ‘collected’ information on covariates), a situation seldom met in reality. In addition, as all these analyses were using the correctly and fully specified model (i.e. all variables used to generate the data were used in the analyses) we did not encounter the bias that can arise from model misspecification.26,27 Similarly, when the censoring scheme depended on the effect modifier, bias in the exposure estimates was avoided by including this variable in our model.28,29 In conclusion, our recommendation is that when there is good reason to analyse a specific subgroup from NCC data, it is valid to do so but we recommend performing the analysis with a weighted Cox regression to optimize efficiency. Supplementary Data Supplementary data are available at IJE online. Funding This research was supported by the Swedish Cancer Society (Cancerfonden) through the grant CAN 2015/493. Conflict of interest: None declared. References 1 Thomas DC. Addendum in: Liddell JR, McDonald JC, Thomas DC. Methods of cohort analysis: Appraisal by application to asbestos mining . J R Stat SocA 1977 ; 140 : 469 – 91 . 2 Borgan O , Samuelsen SO. A review of cohort sampling designs for Cox’s regression model: potentials in epidemiology . Nor Epidemiol 2003 ; 3: 239 – 48 . 3 Bertke S , Hein M , Schubauer-Berigan M , Deddens J. A simulation study of relative efficiency and bias in the nested case-control study design . Epidemiol Methods 2013; 2: 85 – 93 . Google Scholar CrossRef Search ADS PubMed 4 Ohneberg K , Wolkewitz M , Beyersmann J et al. Analysis of clinical cohort data using nested case-control and case-cohort sampling designs: a powerful and economical tool . Methods Inf Med 2015 ; 54: 505 – 14 . Google Scholar CrossRef Search ADS PubMed 5 Joubert BR , Lange EM , Franceschini N , Mwapasa V , North KE , Meshnick SR ; NIAID Center for HIV/AIDS Vaccine Immunology . A whole genome association study of mother-to-child transmission of HIV in Malawi . Genome Med 2010 ; 2: 17 . Google Scholar CrossRef Search ADS PubMed 6 Läärä E. Study designs for biobank-based epidemiologic research on chronic diseases. In: Dillner J (ed). Methods in Biobanking . New York, NY : Springer , 2011 . 7 Li H , Beeghly-Fadiel A , Wen W et al. Gene-environment interactions for breast cancer risk among Chinese women: a report from the Shanghai Breast Cancer Genetics Study . Am J Epidemiol 2013 ; 177: 161 – 70 . Google Scholar CrossRef Search ADS PubMed 8 Song H , Michel A , Nyren O , Ekstrom AM , Pawlita M , Ye W. A CagA-independent cluster of antigens related to the risk of noncardia gastric cancer: Associations between Helicobacter pylori antibodies and gastric adenocarcinoma explored by multiplex serology . Int J Cancer 2014 ; 134: 2942 – 50 . Google Scholar CrossRef Search ADS PubMed 9 Schoemaker MJ , Folkerd EJ , Jones ME et al. Combined effects of endogenous sex hormone levels and mammographic density on postmenopausal breast cancer risk: results from the Breakthrough Generations Study . Br J Cancer 2014 ; 110: 1898 – 907 . Google Scholar CrossRef Search ADS PubMed 10 Boursi B , Mamtani R , Haynes K , Yang YX. Pernicious anemia and colorectal cancer risk - A nested case-control study . Dig Liver Dis 2016 ; 48: 1386 – 90 . Google Scholar CrossRef Search ADS PubMed 11 Liu HC , Yang SY , Liao YT , Chen CC , Kuo CJ. Antipsychotic medications and risk of acute coronary syndrome in schizophrenia: a nested case-control study . PLoS One 2016 ; 11 : e0163533 . Google Scholar CrossRef Search ADS PubMed 12 Kim G , Jang SY , Han E et al. Effect of statin on hepatocellular carcinoma in patients with type 2 diabetes: A nationwide nested case-control study . Int J Cancer 2017 ; 140: 798 – 806 . Google Scholar CrossRef Search ADS PubMed 13 Devore EE , Warner ET , Eliassen H et al. Urinary melatonin in relation to postmenopausal breast cancer risk according to melatonin 1 receptor status . Cancer Epidemiol Biomarkers Prev 2017 ; 26: 413 – 19 . Google Scholar CrossRef Search ADS PubMed 14 Delcoigne B , Colzani E , Prochazka M et al. Breaking the matching in nested case-control data offered several advantages for risk estimation . J Clin Epidemiol 2017 ; 82: 79 – 86 . Google Scholar CrossRef Search ADS PubMed 15 Pearce N. Analysis of matched case-control studies . BMJ 2016 ; 352 : i969 . Google Scholar CrossRef Search ADS PubMed 16 Samuelsen S. A pseudolikelihood approach to analysis of nested case-control studies . Biometrika 1997 ; 84: 379 – 94 . Google Scholar CrossRef Search ADS 17 Borgan O, , Samuelsen SO. Nested case-control and case-cohort studies. In: Klein JP , van Houwelingen HC , Ibrahim JG et al. (eds). Handbook of Survival Analysis . Boca Raton, FL : Chapman & Hall/CRC , 2013 . 18 Delcoigne B, , Hagenbuch N , Schelin ME et al. Feasibility of reusing time-matched controls in an overlapping cohort . Stat Methods Med Res 2016 , Sep 21. pii:0962280216669744. 19 Grantzau T , Thomsen MS , Væth M , Overgaard J. Risk of second primary lung cancer in women after radiotherapy for breast cancer . Radiother Oncol 2014 ; 111 : 366e73 . Google Scholar CrossRef Search ADS 20 Støer N , Samuelsen S. Inverse probability weighting in nested case-control studies with additional matching - a simulation study . Stat Med 2013 ; 32 : 5328 – 39 . Google Scholar CrossRef Search ADS PubMed 21 Lee M , Rebora P , Valsecchi MG , Czene K , Reilly M. A unified model for estimating and testing familial aggregation . Stat Med 2013 ; 32: 5353 – 65 . Google Scholar CrossRef Search ADS PubMed 22 Pocock SJ , Assmann SE , Enos LE , Kasten LE. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems . Stat Med 2002 ; 21: 2917 – 30 . Google Scholar CrossRef Search ADS PubMed 23 Abrahamowicz M , Beauchamp ME , Fournier P , Dumont A. Evidence of subgroup-specific treatment effect in the absence of an overall effect: is there really a contradiction? Pharmacoepidemiol Drug Saf 2013 ; 22 : 1178 – 88 . Google Scholar CrossRef Search ADS PubMed 24 Greenfield S , Kravitz R , Duan N et al. Heterogeneity of treatment effects: Implications for guidelines, payment, and quality assessment . Am J Med 2007 ; 120: 3 – 9 . Google Scholar CrossRef Search ADS 25 Juurlink DN , Gomes T , Ko DT et al. A population-based study of the drug interaction between proton pump inhibitors and clopidogrel . Can Med Assoc J 2009 ; 180: 713 – 18 . Google Scholar CrossRef Search ADS 26 Ford I , Norrie J , Ahmadi S. Model inconsistency, illustrated by the Cox proportional hazard model . Stat Med 1995 ; 14 : 735 – 46 . Google Scholar CrossRef Search ADS PubMed 27 Lin DY , Psaty BM , Kronmal RA. Assessing the sensitivity of regression results to unmeasured confounders in observational studies . Biometrics 1998 ; 54: 948 – 63 . Google Scholar CrossRef Search ADS PubMed 28 Jackson D , White IR , Seaman S , Evans H , Baisley K , Carpenter J. Relaxing the independent censoring assumption in the Cox proportional hazards model using multiple imputation . Stat Med 2014 ; 33 : 4681 – 94 . Google Scholar CrossRef Search ADS PubMed 29 O’Quigley J , Xu R. Robustness of proportional hazards regression. In: Klein JP , van Houwelingen HC , Ibrahim JG et al. (eds). Handbook of Survival Analysis . Boca Raton, FL : Chapman & Hall/CRC , 2013 . © The Author(s) 2018; all rights reserved. Published by Oxford University Press on behalf of the International Epidemiological Association This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Epidemiology Oxford University Press

Valid and efficient subgroup analyses using nested case-control data

Loading next page...
 
/lp/ou_press/valid-and-efficient-subgroup-analyses-using-nested-case-control-data-pPbfBhd6HP
Publisher
Oxford University Press
Copyright
© The Author(s) 2018; all rights reserved. Published by Oxford University Press on behalf of the International Epidemiological Association
ISSN
0300-5771
eISSN
1464-3685
D.O.I.
10.1093/ije/dyx282
Publisher site
See Article on Publisher Site

Abstract

Abstract Background It is not uncommon for investigators to conduct further analyses of subgroups, using data collected in a nested case-control design. Since the sampling of the participants is related to the outcome of interest, the data at hand are not a representative sample of the population, and subgroup analyses need to be carefully considered for their validity and interpretation. Methods We performed simulation studies, generating cohorts within the proportional hazards model framework and with covariate coefficients chosen to mimic realistic data and more extreme situations. From the cohorts we sampled nested case-control data and analysed the effect of a binary exposure on a time-to-event outcome in subgroups defined by a covariate (an independent risk factor, a confounder or an effect modifier) and compared the estimates with the corresponding subcohort estimates. Cohort analyses were performed with Cox regression, and nested case-control samples or restricted subsamples were analysed with both conditional logistic regression and weighted Cox regression. Results For all studied scenarios, the subgroup analyses provided unbiased estimates of the exposure coefficients, with conditional logistic regression being less efficient than the weighted Cox regression. Conclusions For the study of a subpopulation, analysis of the corresponding subgroup of individuals sampled in a nested case-control design provides an unbiased estimate of the effect of exposure, regardless of whether the variable used to define the subgroup is a confounder, effect modifier or independent risk factor. Weighted Cox regression provides more efficient estimates than conditional logistic regression. Conditional logistic regression, risk set sampling, weighted likelihood, weighted Cox regression Key Messages Subgroup analyses of nested case-control data provide valid estimates of exposure effects. The subgroups can be defined by any covariate measured at baseline, whether an independent risk factor, a confounder or an effect modifier. Weighted Cox regression analysis of the subgroups provides more precise estimates. This provides reassurance for investigators who conduct such analyses, but with the usual caveat concerning overuse and over-interpretation of subgroup analyses. Background The time-matched nested case-control (NCC) design uses incidence density sampling for selecting controls within the cohort: for each case, controls are randomly selected from the individuals still at risk at the event time of the case (i.e. the risk set).1 This design combines the benefits of the time aspect of the cohort design and the significant savings in cost and time by sampling only a few non-cases at each event time.2–4 Cost savings can be substantial when, in addition to the usual costs of enrolling study participants, the covariates of interest include expensive laboratory assays such as whole genome sequencing.5 For modern biobank-based epidemiological research, time-matching can also be an important advantage when biomarkers are affected by storage time or batch effects.4,6 Whereas the NCC design will usually have been chosen to answer a specific research question, there are numerous examples in the literature of investigators conducting subsequent analyses of subgroups of interest.7–13 Our interest in investigating the validity of such subgroup analysis was motivated by our own NCC study of the association between radiation therapy for breast cancer and subsequent risk of lung cancer, an association that was shown to be modified by smoking status.14 This prompted us to investigate the subgroup of smokers in more detail, but as smoking was not a matching factor, it was unclear whether restricting the analysis to this subgroup of the NCC sample could raise some statistical issues. In contrast to the cohort design which readily enables such analyses, subgroup analyses with the NCC design need to be carefully considered for their validity and interpretation. Since in a case-control study the sampling of the participants is related to the outcome of interest, the data at hand are not a representative sample of the source population: the cases are over-represented in the sample, and will have a different distribution for some covariates (e.g. risk factors). Furthermore, matching cases to controls can induce additional differences in the covariate distributions.15 Although conditional logistic regression analysis of NCC data provides valid estimates of the effect of exposure adjusted for ‘imbalance’ in the covariates, it is not clear whether a subgroup selected from the NCC sample will provide valid estimates (see Figure 1, panels A and B). Figure 1 View largeDownload slide Illustration of non-matched and matched nested case-control studies and potential concerns with subgroup analyses. As an illustration, we consider a study where the potential matching variable is “smoking habits” (illustrated with the pipes) and assume that the case in each pair is on the left handside. In panel A, there is no additional matching and the distribution of smoking is not the same among cases and controls. Restricting the analysis (of some exposure of interest) to the subgroup of smokers raises concerns illustrated in panel B: the conditional logistic regression analysis will include only the case-control pairs with two smokers, which, in addition to a loss of power, raises the problem of a non-representative sub-sample and the consequent potential for bias. In addition, each matched set that is retained needs to be discordant for the exposure variable in order to enter the likelihood expression in the analysis, further reducing the power. In panel C, cases were matched to controls on “smoking habits”, so that restricting the analysis to the subgroup of smokers (or non-smokers) does not raise any statistical concern other than the loss of power. Figure 1 View largeDownload slide Illustration of non-matched and matched nested case-control studies and potential concerns with subgroup analyses. As an illustration, we consider a study where the potential matching variable is “smoking habits” (illustrated with the pipes) and assume that the case in each pair is on the left handside. In panel A, there is no additional matching and the distribution of smoking is not the same among cases and controls. Restricting the analysis (of some exposure of interest) to the subgroup of smokers raises concerns illustrated in panel B: the conditional logistic regression analysis will include only the case-control pairs with two smokers, which, in addition to a loss of power, raises the problem of a non-representative sub-sample and the consequent potential for bias. In addition, each matched set that is retained needs to be discordant for the exposure variable in order to enter the likelihood expression in the analysis, further reducing the power. In panel C, cases were matched to controls on “smoking habits”, so that restricting the analysis to the subgroup of smokers (or non-smokers) does not raise any statistical concern other than the loss of power. Subgroups can be defined by: (i) the outcome (special subgroups of cases); (ii) a factor that was a matching variable; or (iii) a covariate not used for the sampling. For the first situation, analysis is restricted to the case-control sets for which the case experienced the ‘restricted’ outcome (for example, a specific histological subtype of a cancer). Regarding the second situation, analysis of the exposure-outcome association within a subgroup defined by a matching variable is equivalent to a redefinition of the inclusion criteria and hence the target population (see Figure 1, panel C). Whereas the matching factor is usually a potential confounder, it may transpire to be an effect modifier. In contrast to these first two situations, which will only suffer from a loss of statistical power, for the third setting (subgroups defined by a covariate not used for the sampling), it is unclear to what extent such subgroup analyses are valid and whether we should prefer one statistical approach to another in a given situation. Data from NCC studies are usually analysed with conditional logistic regression. An alternative approach uses weighted likelihood methods where the individuals in the dataset are weighted by the inverse of their probability of being sampled into the study,16 and the data are analysed by running a weighted Cox regression analysis. In this paper, we investigate the validity and efficiency of subgroup analyses of NCC studies. In a series of simulation studies, we compare the weighted likelihood method with conditional logistic regression, to identify which situation could be problematic and what statistical methods should be preferred. Methods In the survival analysis framework, the hazard function for an individual i to experience the outcome of interest at time t is usually modelled using the Cox proportional hazards model: hi(t|Xi,β)=h0(t)  exp (β'Xi) where h0(t) is the baseline hazard function, Xi is the vector of covariates for individual i and β is the vector of corresponding coefficients. The classical approach for estimating β with NCC data is to maximize the partial likelihood:1, L(β) = ∏tiexp⁡[β'Xi]∑kϵR’iexp⁡[β'Xk] where R’i is the sampled risk set for case i. In practice, this is solved by using a conditional logistic regression analysis. An alternative method pools all selected unique individuals in the analysis, each weighted by the inverse of their probability of being sampled,2,17 and β is estimated by maximizing a weighted partial likelihood: L(β) = ∏tiexp⁡[β'Xi]∑kϵR*iexp⁡[β'Xk] . wk where R*i is the collection of all cases and sampled controls at risk at time ti and wk is the weight for individual k.16 As the weight is the inverse of the probability that the individual is sampled for the study, cases are usually given a weight of one (NCC studies commonly include all cases) and the weight for a non-case has to be calculated. Samuelsen16 suggested calculating the probability for individual k to be sampled (pk) with an expression that mimics the sampling procedure, i.e. pk=1-∏i, Sk≤ti≤Tk1-miRi-1 1-Yk(ti) where Sk and Tk are the start and end of follow-up for individual k, mi is the number of controls sampled at event time ti and Yk(ti) is an indicator of the case-control status of individual k at time ti. When the sampling involves additional matching on a confounder Xc, the probabilities are calculated within the strata defined by Xc.18 The weights reconstruct the number of individuals at risk over time in the cohort. In each of the strata defined by any covariate, the number of individuals at risk is also expected to be reconstructed appropriately. As a result, the individuals enter any subgroup analysis with the weight calculated for the full NCC data. Simulation settings We simulated cohorts of individuals characterized by four covariates: a binary exposure, a continuous independent risk factor, a binary confounder and a binary effect modifier. The exposure in our motivating study was radiation therapy, the effect modifier was the smoking status at baseline (which was both a strong risk factor and a strong effect modifier), the independent risk factor was age at baseline and the confounder was adjuvant treatment. The values for the coefficients (β) and the baseline hazard, h0(t), were chosen to roughly mimic the values from our own study of lung cancer among breast cancer patients14 or from other studies of the same subject.19 For each setting, we simulated 500 realizations of a cohort with N = 100 000 individuals, generating data from a proportional hazards model with constant baseline hazard and exponential censoring. In each of these cohorts, we sampled a time-matched NCC study with two controls per case. Sampling was conducted without matching and also with matching on the confounder or on the effect modifier. Main simulation setting We generated the outcome for our main setting using the values presented in Table 1, with a baseline hazard of 0.0005. Table 1 Values of all parameters in the main simulation setting Variable Distribution Coefficient Hazard ratio Exposure Xe Binomial (N, pc) βe = 0.405 1.5 Independent risk factor Xirf Normal (μ = 0, σ = 10) βirf = 0.020 1.02 Confounder Xc Binomial (N, 0.5) βc = 0.0953 1.1 Effect modifier Xem Binomial (N, 0.3) βem = 1.386 4 Interaction coefficient between Xe and Xem: βinteract βe = 0.69 2 Variable Distribution Coefficient Hazard ratio Exposure Xe Binomial (N, pc) βe = 0.405 1.5 Independent risk factor Xirf Normal (μ = 0, σ = 10) βirf = 0.020 1.02 Confounder Xc Binomial (N, 0.5) βc = 0.0953 1.1 Effect modifier Xem Binomial (N, 0.3) βem = 1.386 4 Interaction coefficient between Xe and Xem: βinteract βe = 0.69 2 N, total number of individuals simulated in the cohort. pc depends on the value of Xc: pc(Xc = 0) = 0.8 and pc(Xc = 1) = 0.5. Table 1 Values of all parameters in the main simulation setting Variable Distribution Coefficient Hazard ratio Exposure Xe Binomial (N, pc) βe = 0.405 1.5 Independent risk factor Xirf Normal (μ = 0, σ = 10) βirf = 0.020 1.02 Confounder Xc Binomial (N, 0.5) βc = 0.0953 1.1 Effect modifier Xem Binomial (N, 0.3) βem = 1.386 4 Interaction coefficient between Xe and Xem: βinteract βe = 0.69 2 Variable Distribution Coefficient Hazard ratio Exposure Xe Binomial (N, pc) βe = 0.405 1.5 Independent risk factor Xirf Normal (μ = 0, σ = 10) βirf = 0.020 1.02 Confounder Xc Binomial (N, 0.5) βc = 0.0953 1.1 Effect modifier Xem Binomial (N, 0.3) βem = 1.386 4 Interaction coefficient between Xe and Xem: βinteract βe = 0.69 2 N, total number of individuals simulated in the cohort. pc depends on the value of Xc: pc(Xc = 0) = 0.8 and pc(Xc = 1) = 0.5. The association between the exposure and confounder was generated through the following function: P(Xe| Xc)= exp(log(4)+log(0.25)*Xc) 1+exp(log(4)+log(0.25)*Xc) This association corresponds to a probability of 0.8 of being exposed if the binary confounder has value zero and a probability 0.5 for a value of one. Denoting the coefficients by a vector β and the covariates of individual i as the ith row Xi of the design matrix X, the model is thus written as: hi(t |X,β) = h0 exp(β′Xi) where Xi may also include a multiplicative interaction between the exposure and effect modifier. All individuals entered the cohort at time t = 0 and were followed until the event of interest or a censoring time (generated from an exponential distribution with a rate of 0.05) with a maximum follow-up time of t = 25. This setting resulted in approximately 2.6% of cases over the entire follow-up. Other simulation settings To simulate more challenging settings, we modified the simulation described above by: (i) using more extreme values for the covariates coefficients; (ii) changing the strength and direction of the confounder-exposure association; (iii) adding a correlation between the independent risk factor and the confounder; (iv) having censoring depend on the effect modifier; (v) generating outcomes with lower prevalence; (vi) generating smaller cohorts; and (vii) combinations of these modifications. Subgroups: definitions and statistical analyses From the three covariates used in our simulation, we defined eight subgroups (two for the confounder, two for the effect modifier and four for the continuous risk factor categorized in quartiles). To analyse a subgroup (for example, the subgroup of ‘smokers’), we selected all individuals from the cohort or from the NCC sample who belonged to this subgroup. For each of the subgroups, the exposure coefficient βe was estimated from the restricted cohort using Cox regression, and from the restricted NCC data using both conditional logistic regression and weighted Cox regression. Results were compared with the estimates obtained from the 500 ‘full’ cohorts analysed with Cox regression and the 500 ‘full’ NCC samples analysed with both conditional logistic regression and weighted Cox regression. In all analyses, we used the fully specified model that generated the data. The efficiency of a subgroup analysis was estimated as the ratio of the variances reported by the subgroup regression analysis and the corresponding cohort analysis. Supplementary analyses For weighted Cox regression of matched NCC studies, it was previously noted that using unstratified or stratified weights did not influence the estimated exposure coefficient.20 We investigated this assertion in the matched settings by using both types of weights. We assessed how well the unstratified weights performed in estimating the coefficient of interest βe. We also calculated the ratio over time (from t = 0 to t = 25 in steps of 0.5) of the number of individuals recovered from the NCC study with the weighting system (stratified and unstratified), to the number of individuals in the cohort, and we report the mean and the standard deviation of this ratio over 500 realizations. Application to real data To investigate whether the gender of a non-Hodgkin’s lymphoma patient (NHL) is a risk factor for his/her siblings to develop NHL, we used a cohort of 15 727 siblings of all NHL patients, among whom there were 87 cases of NHL: 37 in brothers and 50 in sisters of patients. We analysed the full cohort and the two subgroups of brothers and sisters of the patients using Cox regression with age as the time scale. We sampled 100 unmatched NCC studies with two controls per case from the cohort, and analysed the full NCC data and the two subgroups with both conditional logistic regression and weighted Cox regression. Our models included the main exposure (gender of the patient), an independent risk factor (year of birth of the sibling) and an effect modifier (gender of the sibling).21 We also sampled 100 NCC studies matched on the sibling’s gender. Software All data management and analyses were performed with the R statistical package (version 3.2.2). To estimate the weights, we used the wpl function (multipleNCC package). We used the coxph function (survival package) to run the Cox and the weighted Cox regressions (with weights retrieved from the wpl-object). The conditional logistic regressions were run with the clogit function. Results Unmatched simulation settings The adjusted estimates of βe in our main (unmatched) simulation setting are presented in Table 2. All analyses provided unbiased estimates for βe, and the analysis of the subgroup defined by the effect modifier provided an unbiased estimate of βe + βinteract. In all analyses, with full data or subgroup data, the weighted Cox regression was more efficient than the conditional logistic regression. This latter analysis experienced a dramatic loss of efficiency when restricting the analysis to subgroups (Table 2). All other unmatched settings provided similar results (data not shown). The smaller cohorts that we analysed included 18 000 individuals. The conditional logistic regression was prone to non-convergence in approximately 10% of the analyses of the smaller subgroups. Disregarding these realizations, the average conditional logistic regression estimates were close to the average cohort estimates and close to the simulated value of βe (Supplementary Table, available as Supplementary data at IJE online). Table 2 Simulation results for cohorts including 100 000 individuals βe Model based s.e. Emp. s.e. Efficiencya n casesb n controlsc,d True value 0.405 Full data  Cohort Cox regression 0.408 0.087 0.089 1 2656 97344  NCC Cond.log.reg 0.408 0.099 0.099 0.772 2656 5312 Weighted Cox 0.407 0.094 0.095 0.845 2656 5213 Subgroups defined by the binary confounder: Baseline level  Cohort subgroup Cox regression 0.401 0.153 0.157 1 1412 48586  NCC subgroup Cond.log.reg 0.398 0.212 0.218 0.524 1058 1410 Weighted Cox 0.398 0.164 0.166 0.874 1412 2597 Level 1  Cohort subgroup Cox regression 0.415 0.111 0.115 1 1244 48758  NCC subgroup Cond.log.reg 0.412 0.159 0.166 0.484 935 1248 Weighted Cox 0.413 0.120 0.119 0.851 1244 2615 Subgroups defined by the independent risk factor with four categories: Level 1  Cohort subgroup Cox regression 0.407 0.189 0.194 1 570 26861  NCC subgroup Cond.log.reg 0.429 0.351 0.366 0.295 271 314 Weighted Cox 0.406 0.201 0.206 0.879 570 1440 Level 2  Cohort subgroup Cox regression 0.427 0.184 0.177 1 613 24010  NCC subgroup Cond.log.reg 0.437 0.362 0.364 0.265 266 304 Weighted Cox 0.426 0.198 0.194 0.860 613 1291 Level 3  Cohort subgroup Cox regression 0.415 0.174 0.177 1 663 23085  NCC subgroup Cond.log.reg 0.444 0.350 0.359 0.254 278 316 Weighted Cox 0.417 0.189 0.188 0.843 663 1234 Level 4  Cohort subgroup Cox regression 0.406 0.157 0.156 1 809 23388  NCC subgroup Cond.log.reg 0.416 0.308 0.326 0.263 343 391 Weighted Cox 0.404 0.174 0.171 0.817 809 1247 Subgroups defined by the binary effect modifier: Baseline  Cohort subgroup Cox regression 0.408 0.090 0.091 1 696 69310  NCC subgroup Cond.log.reg 0.406 0.121 0.117 0.550 637 987 Weighted Cox 0.407 0.097 0.095 0.857 696 3737 True value 0.405 + 0.690 Level 1  Cohort subgroup Cox regression 1.101 0.063 0.063 1 1959 28034  NCC subgroup Cond.log.reg 1.106 0.118 0.117 0.287 977 1143 Weighted Cox 1.105 0.084 0.083 0.563 1959 1476 βe Model based s.e. Emp. s.e. Efficiencya n casesb n controlsc,d True value 0.405 Full data  Cohort Cox regression 0.408 0.087 0.089 1 2656 97344  NCC Cond.log.reg 0.408 0.099 0.099 0.772 2656 5312 Weighted Cox 0.407 0.094 0.095 0.845 2656 5213 Subgroups defined by the binary confounder: Baseline level  Cohort subgroup Cox regression 0.401 0.153 0.157 1 1412 48586  NCC subgroup Cond.log.reg 0.398 0.212 0.218 0.524 1058 1410 Weighted Cox 0.398 0.164 0.166 0.874 1412 2597 Level 1  Cohort subgroup Cox regression 0.415 0.111 0.115 1 1244 48758  NCC subgroup Cond.log.reg 0.412 0.159 0.166 0.484 935 1248 Weighted Cox 0.413 0.120 0.119 0.851 1244 2615 Subgroups defined by the independent risk factor with four categories: Level 1  Cohort subgroup Cox regression 0.407 0.189 0.194 1 570 26861  NCC subgroup Cond.log.reg 0.429 0.351 0.366 0.295 271 314 Weighted Cox 0.406 0.201 0.206 0.879 570 1440 Level 2  Cohort subgroup Cox regression 0.427 0.184 0.177 1 613 24010  NCC subgroup Cond.log.reg 0.437 0.362 0.364 0.265 266 304 Weighted Cox 0.426 0.198 0.194 0.860 613 1291 Level 3  Cohort subgroup Cox regression 0.415 0.174 0.177 1 663 23085  NCC subgroup Cond.log.reg 0.444 0.350 0.359 0.254 278 316 Weighted Cox 0.417 0.189 0.188 0.843 663 1234 Level 4  Cohort subgroup Cox regression 0.406 0.157 0.156 1 809 23388  NCC subgroup Cond.log.reg 0.416 0.308 0.326 0.263 343 391 Weighted Cox 0.404 0.174 0.171 0.817 809 1247 Subgroups defined by the binary effect modifier: Baseline  Cohort subgroup Cox regression 0.408 0.090 0.091 1 696 69310  NCC subgroup Cond.log.reg 0.406 0.121 0.117 0.550 637 987 Weighted Cox 0.407 0.097 0.095 0.857 696 3737 True value 0.405 + 0.690 Level 1  Cohort subgroup Cox regression 1.101 0.063 0.063 1 1959 28034  NCC subgroup Cond.log.reg 1.106 0.118 0.117 0.287 977 1143 Weighted Cox 1.105 0.084 0.083 0.563 1959 1476 Adjusted exposure coefficients (βe), model-based standard errors (Model-based s.e.), empirical standard errors (Emp. s.e.), efficiency relative to the corresponding cohort analysis and mean number (n) of cases and controls included in the different analyses, for the main simulation setting described in the text. All numbers are averages over 500 runs. NCC, nested case-control; Cond.log.reg, conditional logistic regression. a The efficiency is calculated as the squared ratio of the model-based standard error (s.e.) provided by the NCC data analysis (conditional logistic regression or weighted Cox regression) and the s.e. provided by the corresponding cohort analysis. b,c The number of casesb and controlsc that are reported for the conditional logistic regression for the subgroup analyses are those from sets with at least one control and one case. In addition, all numbers are averages over 500 runs and have been rounded so that the cases across different strata, or cases and controls in the subgroups of the cohort, sum up to the cohort totals ±1. d The number of controls that are reported for the weighted Cox regression are the numbers of non-cases in the cohort at the end of the study. Table 2 Simulation results for cohorts including 100 000 individuals βe Model based s.e. Emp. s.e. Efficiencya n casesb n controlsc,d True value 0.405 Full data  Cohort Cox regression 0.408 0.087 0.089 1 2656 97344  NCC Cond.log.reg 0.408 0.099 0.099 0.772 2656 5312 Weighted Cox 0.407 0.094 0.095 0.845 2656 5213 Subgroups defined by the binary confounder: Baseline level  Cohort subgroup Cox regression 0.401 0.153 0.157 1 1412 48586  NCC subgroup Cond.log.reg 0.398 0.212 0.218 0.524 1058 1410 Weighted Cox 0.398 0.164 0.166 0.874 1412 2597 Level 1  Cohort subgroup Cox regression 0.415 0.111 0.115 1 1244 48758  NCC subgroup Cond.log.reg 0.412 0.159 0.166 0.484 935 1248 Weighted Cox 0.413 0.120 0.119 0.851 1244 2615 Subgroups defined by the independent risk factor with four categories: Level 1  Cohort subgroup Cox regression 0.407 0.189 0.194 1 570 26861  NCC subgroup Cond.log.reg 0.429 0.351 0.366 0.295 271 314 Weighted Cox 0.406 0.201 0.206 0.879 570 1440 Level 2  Cohort subgroup Cox regression 0.427 0.184 0.177 1 613 24010  NCC subgroup Cond.log.reg 0.437 0.362 0.364 0.265 266 304 Weighted Cox 0.426 0.198 0.194 0.860 613 1291 Level 3  Cohort subgroup Cox regression 0.415 0.174 0.177 1 663 23085  NCC subgroup Cond.log.reg 0.444 0.350 0.359 0.254 278 316 Weighted Cox 0.417 0.189 0.188 0.843 663 1234 Level 4  Cohort subgroup Cox regression 0.406 0.157 0.156 1 809 23388  NCC subgroup Cond.log.reg 0.416 0.308 0.326 0.263 343 391 Weighted Cox 0.404 0.174 0.171 0.817 809 1247 Subgroups defined by the binary effect modifier: Baseline  Cohort subgroup Cox regression 0.408 0.090 0.091 1 696 69310  NCC subgroup Cond.log.reg 0.406 0.121 0.117 0.550 637 987 Weighted Cox 0.407 0.097 0.095 0.857 696 3737 True value 0.405 + 0.690 Level 1  Cohort subgroup Cox regression 1.101 0.063 0.063 1 1959 28034  NCC subgroup Cond.log.reg 1.106 0.118 0.117 0.287 977 1143 Weighted Cox 1.105 0.084 0.083 0.563 1959 1476 βe Model based s.e. Emp. s.e. Efficiencya n casesb n controlsc,d True value 0.405 Full data  Cohort Cox regression 0.408 0.087 0.089 1 2656 97344  NCC Cond.log.reg 0.408 0.099 0.099 0.772 2656 5312 Weighted Cox 0.407 0.094 0.095 0.845 2656 5213 Subgroups defined by the binary confounder: Baseline level  Cohort subgroup Cox regression 0.401 0.153 0.157 1 1412 48586  NCC subgroup Cond.log.reg 0.398 0.212 0.218 0.524 1058 1410 Weighted Cox 0.398 0.164 0.166 0.874 1412 2597 Level 1  Cohort subgroup Cox regression 0.415 0.111 0.115 1 1244 48758  NCC subgroup Cond.log.reg 0.412 0.159 0.166 0.484 935 1248 Weighted Cox 0.413 0.120 0.119 0.851 1244 2615 Subgroups defined by the independent risk factor with four categories: Level 1  Cohort subgroup Cox regression 0.407 0.189 0.194 1 570 26861  NCC subgroup Cond.log.reg 0.429 0.351 0.366 0.295 271 314 Weighted Cox 0.406 0.201 0.206 0.879 570 1440 Level 2  Cohort subgroup Cox regression 0.427 0.184 0.177 1 613 24010  NCC subgroup Cond.log.reg 0.437 0.362 0.364 0.265 266 304 Weighted Cox 0.426 0.198 0.194 0.860 613 1291 Level 3  Cohort subgroup Cox regression 0.415 0.174 0.177 1 663 23085  NCC subgroup Cond.log.reg 0.444 0.350 0.359 0.254 278 316 Weighted Cox 0.417 0.189 0.188 0.843 663 1234 Level 4  Cohort subgroup Cox regression 0.406 0.157 0.156 1 809 23388  NCC subgroup Cond.log.reg 0.416 0.308 0.326 0.263 343 391 Weighted Cox 0.404 0.174 0.171 0.817 809 1247 Subgroups defined by the binary effect modifier: Baseline  Cohort subgroup Cox regression 0.408 0.090 0.091 1 696 69310  NCC subgroup Cond.log.reg 0.406 0.121 0.117 0.550 637 987 Weighted Cox 0.407 0.097 0.095 0.857 696 3737 True value 0.405 + 0.690 Level 1  Cohort subgroup Cox regression 1.101 0.063 0.063 1 1959 28034  NCC subgroup Cond.log.reg 1.106 0.118 0.117 0.287 977 1143 Weighted Cox 1.105 0.084 0.083 0.563 1959 1476 Adjusted exposure coefficients (βe), model-based standard errors (Model-based s.e.), empirical standard errors (Emp. s.e.), efficiency relative to the corresponding cohort analysis and mean number (n) of cases and controls included in the different analyses, for the main simulation setting described in the text. All numbers are averages over 500 runs. NCC, nested case-control; Cond.log.reg, conditional logistic regression. a The efficiency is calculated as the squared ratio of the model-based standard error (s.e.) provided by the NCC data analysis (conditional logistic regression or weighted Cox regression) and the s.e. provided by the corresponding cohort analysis. b,c The number of casesb and controlsc that are reported for the conditional logistic regression for the subgroup analyses are those from sets with at least one control and one case. In addition, all numbers are averages over 500 runs and have been rounded so that the cases across different strata, or cases and controls in the subgroups of the cohort, sum up to the cohort totals ±1. d The number of controls that are reported for the weighted Cox regression are the numbers of non-cases in the cohort at the end of the study. We investigated the ability of the weighting method to recover the correct number of individuals at risk over time. On average, the weights reproduced the pattern of the number of individuals at risk in the full cohort and in each subgroup. The estimated numbers of individuals at risk were in excellent agreement with the corresponding actual numbers at risk in the cohort, with ratios always close to 1 (Figure 2). All unmatched settings provided similar patterns. Figure 2 View largeDownload slide Performance of the weighting system in recovering the number of individuals in the cohort (unmatched NCC study). Ratio (with two standard deviations) of the numbers of individuals at risk over time recovered from the unmatched NCC data with the weighting system, to the numbers of individuals at risk in the cohort, presented for the full data and for the two subgroups defined by the binary confounder. Figure 2 View largeDownload slide Performance of the weighting system in recovering the number of individuals in the cohort (unmatched NCC study). Ratio (with two standard deviations) of the numbers of individuals at risk over time recovered from the unmatched NCC data with the weighting system, to the numbers of individuals at risk in the cohort, presented for the full data and for the two subgroups defined by the binary confounder. Matched simulation settings Similar to the unmatched situation, all analyses provided unbiased estimates of βe or of βe + βinteract. In contrast to the unmatched situation, for the two subgroups defined by the confounder (i.e. the matching factor), the efficiency of the conditional logistic regression was similar to that of the weighted Cox regression. The results were similar when matching on the effect modifier. On average, the stratified weights reproduced the pattern of the number of individuals at risk in the cohort with similar performance to the unmatched situation, whereas the unstratified weights performed poorly in the subgroups defined by the matching factor (Figure 3). Despite this, using the unstratified weights provided valid estimates of βein all subgroups, whether or not defined by the matching variable (data not shown). Figure 3 View largeDownload slide Performance of the weighting system in recovering the number of individuals in the cohort (matched NCC study). Ratio (with two standard deviations) of the numbers of individuals at risk over time recovered with unstratified weights (upper panel) and stratified weights (lower panel) from NCC data that was matched on the confounder, to the actual numbers of individuals at risk in the cohort. Figure 3 View largeDownload slide Performance of the weighting system in recovering the number of individuals in the cohort (matched NCC study). Ratio (with two standard deviations) of the numbers of individuals at risk over time recovered with unstratified weights (upper panel) and stratified weights (lower panel) from NCC data that was matched on the confounder, to the actual numbers of individuals at risk in the cohort. Even when we challenged the weighted method using more extreme situations, the estimates from the weighted Cox analysis of the NCC data, overall and in subgroups, were unbiased when using unstratified weights (Figure 4). Figure 4 View largeDownload slide Coefficient estimates with 95% confidence interval. Coefficient estimates with 95% confidence interval provided by the Cox regression analysis of the full cohort and subgroups (solid line), and weighted Cox regression analysis of the corresponding NCC data with unstratified weights despite stratified sampling (dashed line) in an extreme setting (βirf = 0.95 with censoring associated with the effect modifier). Figure 4 View largeDownload slide Coefficient estimates with 95% confidence interval. Coefficient estimates with 95% confidence interval provided by the Cox regression analysis of the full cohort and subgroups (solid line), and weighted Cox regression analysis of the corresponding NCC data with unstratified weights despite stratified sampling (dashed line) in an extreme setting (βirf = 0.95 with censoring associated with the effect modifier). Real data application The results from the 100 unmatched NCC studies, sampled within a cohort of siblings of NHL patients, are presented in Figure 5, where we coded the exposure of interest using an indicator variable for female patients. Analysis of the subgroups defined by the sex of the sibling provided unbiased estimates for the exposure effects: βe for brothers of the patient and βe + βinteract for sisters, and the weighted Cox regression was more efficient than the conditional logistic regression. A sibling of a female NHL patient had a slightly increased risk of developing NHL compared with a sibling of a male patient, but this risk was modified by the gender of the sibling so that the risk for sisters is more than double the risk for brothers (2.66 = e0.979 for sisters versus 1.16 = e0.146 for brothers). Similarly, the subgroup analyses provided unbiased estimates of exposure coefficients when matching on the effect modifier (data not shown). Figure 5 View largeDownload slide Coefficient estimates with 95% confidence interval in the study of NHL. Estimates of the effect of sex of patient on the risk of NHL in siblings, with 95% confidence intervals. Estimates from Cox regression analyses of the full cohort and the two subgroups defined by the sex of sibling (solid lines), and average estimates obtained from 100 NCC samples, analysed by conditional logistic regression (dashed lines), and weighted Cox regression (dotted lines). Figure 5 View largeDownload slide Coefficient estimates with 95% confidence interval in the study of NHL. Estimates of the effect of sex of patient on the risk of NHL in siblings, with 95% confidence intervals. Estimates from Cox regression analyses of the full cohort and the two subgroups defined by the sex of sibling (solid lines), and average estimates obtained from 100 NCC samples, analysed by conditional logistic regression (dashed lines), and weighted Cox regression (dotted lines). Discussion In a series of simulation studies and in a real data application, we have shown that analyses of subgroups of NCC data provide unbiased estimates of the exposure-outcome association, regardless of whether the subgroup is defined by an independent risk factor, a matched or unmatched confounder or an effect modifier. We also showed that using a conditional logistic regression to analyse the subgroup is less efficient than using a weighted Cox regression. Exploring the weighting system in our simulated data, we found that on average the weights succeeded in recovering the pattern of risk-set sizes over time, not only in the cohort but in all subgroups, for both unmatched and matched designs, provided stratified weights were used for the latter. More importantly, using unstratified weights for a matched design, weighted Cox regression analyses still provide unbiased estimates of the exposure coefficient βe, confirming what was already reported by Støer et al.20 Whereas the topic of subgroup analyses is abundantly documented for randomized control trials,22 we did not find any study addressing the question of the validity of such analyses for NCC studies. One might argue that such subgroup analyses are not of interest, as a subgroup of interest can be accommodated by including an interaction term in the analysis of the full data.22 However, there may be good reasons for investigating a restricted population in more detail. For example, the absence of an overall treatment effect does not necessarily imply that an intervention does not have any effect in any of the subgroups:23 the safety and effectiveness of a drug may depend on patients’ characteristics and/or co-medication,24 as in the case of concomitant exposure to a proton pump and clopidogrel after acute myocardial infarction.24,25 Our goal is thus to reassure researchers who do such analysis with NCC data, but not to promote overuse of subgroup analyses. The conditional logistic regression was much less efficient than the weighted Cox regression in analysing subgroups. Since conditional logistic regression is performed on matched sets, the number of sets including at least one case and one control can decrease dramatically if the subgroup is not defined by a matched covariate. This decrease depends on the association between the outcome and the covariate and the prevalence of the various levels of the covariate, as well as the prevalence of exposure. In our main simulation setting, the number of cases retained in the conditional logistic regression analysis could be two to three times fewer than the number in the weighted Cox regression analysis (Table 2). These numbers will be further reduced in the analysis, as sets that include members with the same exposure value do not contribute to the conditional likelihood. As a consequence, when subgroups are small, the conditional logistic regression is more likely to face convergence problems (Supplementary Table). A limitation of our study is that our simulated data were complete (no incomplete data in the ‘collected’ information on covariates), a situation seldom met in reality. In addition, as all these analyses were using the correctly and fully specified model (i.e. all variables used to generate the data were used in the analyses) we did not encounter the bias that can arise from model misspecification.26,27 Similarly, when the censoring scheme depended on the effect modifier, bias in the exposure estimates was avoided by including this variable in our model.28,29 In conclusion, our recommendation is that when there is good reason to analyse a specific subgroup from NCC data, it is valid to do so but we recommend performing the analysis with a weighted Cox regression to optimize efficiency. Supplementary Data Supplementary data are available at IJE online. Funding This research was supported by the Swedish Cancer Society (Cancerfonden) through the grant CAN 2015/493. Conflict of interest: None declared. References 1 Thomas DC. Addendum in: Liddell JR, McDonald JC, Thomas DC. Methods of cohort analysis: Appraisal by application to asbestos mining . J R Stat SocA 1977 ; 140 : 469 – 91 . 2 Borgan O , Samuelsen SO. A review of cohort sampling designs for Cox’s regression model: potentials in epidemiology . Nor Epidemiol 2003 ; 3: 239 – 48 . 3 Bertke S , Hein M , Schubauer-Berigan M , Deddens J. A simulation study of relative efficiency and bias in the nested case-control study design . Epidemiol Methods 2013; 2: 85 – 93 . Google Scholar CrossRef Search ADS PubMed 4 Ohneberg K , Wolkewitz M , Beyersmann J et al. Analysis of clinical cohort data using nested case-control and case-cohort sampling designs: a powerful and economical tool . Methods Inf Med 2015 ; 54: 505 – 14 . Google Scholar CrossRef Search ADS PubMed 5 Joubert BR , Lange EM , Franceschini N , Mwapasa V , North KE , Meshnick SR ; NIAID Center for HIV/AIDS Vaccine Immunology . A whole genome association study of mother-to-child transmission of HIV in Malawi . Genome Med 2010 ; 2: 17 . Google Scholar CrossRef Search ADS PubMed 6 Läärä E. Study designs for biobank-based epidemiologic research on chronic diseases. In: Dillner J (ed). Methods in Biobanking . New York, NY : Springer , 2011 . 7 Li H , Beeghly-Fadiel A , Wen W et al. Gene-environment interactions for breast cancer risk among Chinese women: a report from the Shanghai Breast Cancer Genetics Study . Am J Epidemiol 2013 ; 177: 161 – 70 . Google Scholar CrossRef Search ADS PubMed 8 Song H , Michel A , Nyren O , Ekstrom AM , Pawlita M , Ye W. A CagA-independent cluster of antigens related to the risk of noncardia gastric cancer: Associations between Helicobacter pylori antibodies and gastric adenocarcinoma explored by multiplex serology . Int J Cancer 2014 ; 134: 2942 – 50 . Google Scholar CrossRef Search ADS PubMed 9 Schoemaker MJ , Folkerd EJ , Jones ME et al. Combined effects of endogenous sex hormone levels and mammographic density on postmenopausal breast cancer risk: results from the Breakthrough Generations Study . Br J Cancer 2014 ; 110: 1898 – 907 . Google Scholar CrossRef Search ADS PubMed 10 Boursi B , Mamtani R , Haynes K , Yang YX. Pernicious anemia and colorectal cancer risk - A nested case-control study . Dig Liver Dis 2016 ; 48: 1386 – 90 . Google Scholar CrossRef Search ADS PubMed 11 Liu HC , Yang SY , Liao YT , Chen CC , Kuo CJ. Antipsychotic medications and risk of acute coronary syndrome in schizophrenia: a nested case-control study . PLoS One 2016 ; 11 : e0163533 . Google Scholar CrossRef Search ADS PubMed 12 Kim G , Jang SY , Han E et al. Effect of statin on hepatocellular carcinoma in patients with type 2 diabetes: A nationwide nested case-control study . Int J Cancer 2017 ; 140: 798 – 806 . Google Scholar CrossRef Search ADS PubMed 13 Devore EE , Warner ET , Eliassen H et al. Urinary melatonin in relation to postmenopausal breast cancer risk according to melatonin 1 receptor status . Cancer Epidemiol Biomarkers Prev 2017 ; 26: 413 – 19 . Google Scholar CrossRef Search ADS PubMed 14 Delcoigne B , Colzani E , Prochazka M et al. Breaking the matching in nested case-control data offered several advantages for risk estimation . J Clin Epidemiol 2017 ; 82: 79 – 86 . Google Scholar CrossRef Search ADS PubMed 15 Pearce N. Analysis of matched case-control studies . BMJ 2016 ; 352 : i969 . Google Scholar CrossRef Search ADS PubMed 16 Samuelsen S. A pseudolikelihood approach to analysis of nested case-control studies . Biometrika 1997 ; 84: 379 – 94 . Google Scholar CrossRef Search ADS 17 Borgan O, , Samuelsen SO. Nested case-control and case-cohort studies. In: Klein JP , van Houwelingen HC , Ibrahim JG et al. (eds). Handbook of Survival Analysis . Boca Raton, FL : Chapman & Hall/CRC , 2013 . 18 Delcoigne B, , Hagenbuch N , Schelin ME et al. Feasibility of reusing time-matched controls in an overlapping cohort . Stat Methods Med Res 2016 , Sep 21. pii:0962280216669744. 19 Grantzau T , Thomsen MS , Væth M , Overgaard J. Risk of second primary lung cancer in women after radiotherapy for breast cancer . Radiother Oncol 2014 ; 111 : 366e73 . Google Scholar CrossRef Search ADS 20 Støer N , Samuelsen S. Inverse probability weighting in nested case-control studies with additional matching - a simulation study . Stat Med 2013 ; 32 : 5328 – 39 . Google Scholar CrossRef Search ADS PubMed 21 Lee M , Rebora P , Valsecchi MG , Czene K , Reilly M. A unified model for estimating and testing familial aggregation . Stat Med 2013 ; 32: 5353 – 65 . Google Scholar CrossRef Search ADS PubMed 22 Pocock SJ , Assmann SE , Enos LE , Kasten LE. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems . Stat Med 2002 ; 21: 2917 – 30 . Google Scholar CrossRef Search ADS PubMed 23 Abrahamowicz M , Beauchamp ME , Fournier P , Dumont A. Evidence of subgroup-specific treatment effect in the absence of an overall effect: is there really a contradiction? Pharmacoepidemiol Drug Saf 2013 ; 22 : 1178 – 88 . Google Scholar CrossRef Search ADS PubMed 24 Greenfield S , Kravitz R , Duan N et al. Heterogeneity of treatment effects: Implications for guidelines, payment, and quality assessment . Am J Med 2007 ; 120: 3 – 9 . Google Scholar CrossRef Search ADS 25 Juurlink DN , Gomes T , Ko DT et al. A population-based study of the drug interaction between proton pump inhibitors and clopidogrel . Can Med Assoc J 2009 ; 180: 713 – 18 . Google Scholar CrossRef Search ADS 26 Ford I , Norrie J , Ahmadi S. Model inconsistency, illustrated by the Cox proportional hazard model . Stat Med 1995 ; 14 : 735 – 46 . Google Scholar CrossRef Search ADS PubMed 27 Lin DY , Psaty BM , Kronmal RA. Assessing the sensitivity of regression results to unmeasured confounders in observational studies . Biometrics 1998 ; 54: 948 – 63 . Google Scholar CrossRef Search ADS PubMed 28 Jackson D , White IR , Seaman S , Evans H , Baisley K , Carpenter J. Relaxing the independent censoring assumption in the Cox proportional hazards model using multiple imputation . Stat Med 2014 ; 33 : 4681 – 94 . Google Scholar CrossRef Search ADS PubMed 29 O’Quigley J , Xu R. Robustness of proportional hazards regression. In: Klein JP , van Houwelingen HC , Ibrahim JG et al. (eds). Handbook of Survival Analysis . Boca Raton, FL : Chapman & Hall/CRC , 2013 . © The Author(s) 2018; all rights reserved. Published by Oxford University Press on behalf of the International Epidemiological Association This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

International Journal of EpidemiologyOxford University Press

Published: Jan 29, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off