Add Journal to My Library
American Journal of Epidemiology
, Volume 187 (3) – Mar 1, 2018

10 pages

/lp/ou_press/using-sensitivity-analyses-for-unobserved-confounding-to-address-tnZTGbg6Uu

- Publisher
- Oxford University Press
- Copyright
- © The Author(s) 2018. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
- ISSN
- 0002-9262
- eISSN
- 1476-6256
- D.O.I.
- 10.1093/aje/kwx248
- Publisher site
- See Article on Publisher Site

Abstract Propensity score methods are a popular tool with which to control for confounding in observational data, but their bias-reduction properties—as well as internal validity, generally—are threatened by covariate measurement error. There are few easy-to-implement methods of correcting for such bias. In this paper, we describe and demonstrate how existing sensitivity analyses for unobserved confounding—propensity score calibration, VanderWeele and Arah’s bias formulas, and Rosenbaum’s sensitivity analysis—can be adapted to address this problem. In a simulation study, we examine the extent to which these sensitivity analyses can correct for several measurement error structures: classical, systematic differential, and heteroscedastic covariate measurement error. We then apply these approaches to address covariate measurement error in estimating the association between depression and weight gain in a cohort of adults in Baltimore, Maryland. We recommend the use of VanderWeele and Arah’s bias formulas and propensity score calibration (assuming it is adapted appropriately for the measurement error structure), as both approaches perform well for a variety of propensity score estimators and measurement error structures. confounding factors (epidemiology), measurement error, propensity score, unobserved confounding Propensity score methods are a popular tool in the analysis of observational data (1). However, just as with any parametric analytical approach, the theory justifying their use assumes that the covariates included are measured without error and in the same way across treatment groups. In reality, covariates are often measured with error and may be measured differently (e.g., by different instruments) or with differential measurement error across treatment groups. For example, disability measures from instruments such as the Instrumental Activities of Daily Living scale (2) and the Physical Component Summary of the Medical Outcomes Study 36-Item Short Form Health Survey (SF-36-Phys) (3) are mismeasured versions of the true, latent construct of disability. Such measures may not only be “noisier” versions of the truth but may differ between treatment groups (e.g., Instrumental Activities of Daily Living in the intervention group and SF-36-Phys in the nonintervention group). Ignoring these measurement issues may lead to incorrect effect estimates. In fact, Steiner et al. (4) showed that using covariates measured with error (in which the mismeasured covariates were noisier versions of the true unobserved covariates) attenuated the bias-reducing properties of propensity score methods (including subclassification, weighting, and regression with the propensity score as a covariate). Although covariate measurement error is likely in propensity score models—particularly in public health and the social sciences—there has been little research into methods that can correct for such error when using propensity score approaches (5), with a few exceptions (6–9). The number of strategies available could be increased by recognizing the link between covariate measurement error and unobserved confounding and adapting methods that assess sensitivity to an unobserved confounder to address covariate measurement error when using a propensity score approach. In this paper, we aim to: 1) explicate the link between covariate measurement error and unobserved confounding; 2) adapt existing, easy-to-implement sensitivity analyses for unobserved confounding to address covariate measurement error in propensity score methods; 3) describe scenarios under which each approach may be appropriate; 4) evaluate their performance in a limited simulation study; and 5) apply these approaches to address covariate measurement error in estimating the association between depression and weight gain in a cohort of adults in Baltimore, Maryland. Most discussions of measurement error in the literature have focused on classical measurement error, which is nondifferential and homoscedastic. However, measurement error that is differential by treatment status may be especially pertinent in a propensity score context—for example, when propensity scores are used to match untreated subjects from one study or population to treated subjects in an intervention study (10). Consequently, we will specify how the approaches considered herein can be applied to classical measurement error as well as to measurement error that is differential by treatment status in terms of both the location and scale parameters. NOTATION, ESTIMANDS, AND ASSUMPTIONS Let observed data O=(Z,A,W,Y) and complete data C=(Z,X,A,W,Y), where Z is an observed continuous covariate measured without error; X is an unobserved continuous covariate measured without error; A is an observed binary (0/1) variable indicating treatment, measured without error; W is an observed mismeasured version of X; and Y is an observed continuous outcome measured without error. Measurement error scenarios We consider additive, non-Berkson measurement error of the form W=X+U, where U∼N(f(O),σ2f2(O)). We consider 3 measurement error scenarios. First, in classical measurement error, W is an unbiased, noisier version of X: U∼N(0,σ2). An example of classical measurement error is blood pressure, where readings from an automatic blood pressure cuff are noisier versions of the true value (11). Second, we consider measurement error that is differential by treatment status in the location parameter: U∼N(f(A,X),σ2), subsequently referred to as systematic differential measurement error. An example of this could be blood pressure measured by means of 2 different automatic cuffs, one for each of 2 treatment groups, that are not calibrated to each other. Third, we consider measurement error that is differential by treatment status in the scale parameter: U∼N(0,σ2f2(A,X)), subsequently referred to as heteroscedastic measurement error. An example of this could be blood pressure measured using a manual sphygmomanometer in the intervention group but using an automatic blood pressure cuff in the nonintervention group. In such a scenario, we might expect more variability in the nonintervention group than in the intervention group. Figure 1 depicts each of the 3 measurement error scenarios, and Figure 2 compares the naive and true propensity scores under each scenario. Figure 1. View largeDownload slide Scatterplots of the covariate measured without error, X, and the covariate measured with error, W, by treatment status (depressed (1) vs. not depressed (0)) for each of the following measurement error scenarios: A) classical measurement error, B) systematic differential measurement error, and C) heteroscedastic measurement error. Figure 1. View largeDownload slide Scatterplots of the covariate measured without error, X, and the covariate measured with error, W, by treatment status (depressed (1) vs. not depressed (0)) for each of the following measurement error scenarios: A) classical measurement error, B) systematic differential measurement error, and C) heteroscedastic measurement error. Figure 2. View largeDownload slide Scatterplots of the true propensity score (when the unobserved covariate measured without error, X, is used) and the naive propensity score (when the observed covariate measured with error, W, is used) by treatment status (depressed (1) vs. not depressed (0)) for each of the following measurement error scenarios: A) classical measurement error, B) systematic differential measurement error, and C) heteroscedastic measurement error. Figure 2. View largeDownload slide Scatterplots of the true propensity score (when the unobserved covariate measured without error, X, is used) and the naive propensity score (when the observed covariate measured with error, W, is used) by treatment status (depressed (1) vs. not depressed (0)) for each of the following measurement error scenarios: A) classical measurement error, B) systematic differential measurement error, and C) heteroscedastic measurement error. Estimands and assumptions We consider 2 estimands of interest: the average treatment effect (ATE), E(Y(1) − Y(0)), and the conditional ATE, E(Y(1) − Y(0)|Z,X), where Y(a) is the counterfactual outcome setting A = a and the expectations are taken across all i individuals. If X is observed, identification of both estimands relies on the following assumptions. First, we assume strongly ignorable treatment assignment: For each a∈{0,1}, we have Y(a)╨A|X,Z under positivity: 0 < P(A = 1) < 1. We also assume consistency: For each a∈{0,1}, we have Y(a) = Y on the event A = a. Finally, we make the stable unit treatment value assumption: There is one version of each treatment condition, and the treatment assignment of individual i does not influence the potential outcome in another individual. However, if we only observe a mismeasured version of X, designated W, then the estimands are not identifiable because the strongly ignorable treatment assignment assumption is not met. Below we describe in more detail how using W instead of X in an estimator fails to completely control for confounding. COVARIATE MEASUREMENT ERROR AND UNOBSERVED CONFOUNDING In curricula and the literature, measurement error and unobserved confounding (also called omitted variable bias) (12) are usually discussed as threats to valid causal inference, but typically as separate topics without consideration of their intersection in the case of covariate measurement error. There are some notable exceptions (4, 13–16). The equivalency between covariate measurement error and unobserved confounding can be seen through their impact on the assumption of ignorable treatment assignment. Let {X, Z} be the set of confounding variables, consisting of the subset of observed confounding variables, {W, Z}, and unobserved confounding variables, ∆. When there is unobserved confounding (17), This can be recast in measurement error terms, using the notation defined previously: The equivalency can also be seen through directed acyclic graphs, as described in Hernán and Robins (13). We can rewrite Hernán and Robins’ Figure 9.8 as in Figure 3. This measurement error directed acyclic graph is easily recast as an unobserved confounding directed acyclic graph, because we see that X is a confounder and it is unobserved. Figure 3. View largeDownload slide Directed acyclic graph representing measurement error and unobserved confounding. A, treatment; X, unobserved (in the main data) covariate measured without error; W, observed, mismeasured version of X; Z, observed covariate measured without error; Y, outcome. Figure 3. View largeDownload slide Directed acyclic graph representing measurement error and unobserved confounding. A, treatment; X, unobserved (in the main data) covariate measured without error; W, observed, mismeasured version of X; Z, observed covariate measured without error; Y, outcome. SENSITIVITY ANALYSES We now describe several easy-to-implement sensitivity analyses for unobserved confounding and describe how they can be adapted for measurement error. Sample software code is presented in Web Appendices 1 and 2 (available at https://academic.oup.com/aje). Propensity score calibration Propensity score calibration (PSC) has been described previously as a method for reducing bias due to unobserved confounding (18). It uses a validation data set that contains the treatment variable, A, and all covariates present in the propensity score model, including covariates measured without error, {X,Z}, and the subset of those variables that are also measured with error, W (see Table 1). The validation data set does not need to include Y, so it can be cross-sectional. PSC is similar to regression calibration (19) except that instead of modeling the mismeasured and correctly measured covariates, the naive (using mismeasured covariates) and true (using correctly measured covariates) propensity scores are modeled in the validation subset and then extrapolated to calibrate the naive propensity score in the main study data set. The authors of the method, Stürmer et al. (20), state that the calibrated propensity score can be used in any propensity score approach, including matching, subclassification, and controlling for the propensity score in an outcome regression model. We found empirical support for this statement in our simulations. However, we found that using this method in an inverse-probability-of-treatment-weighted estimator increased rather than decreased bias due to measurement error (results available upon request). Standard errors should be estimated by bootstrapping to propagate uncertainty from the calibration procedure. Table 1. Variables Present in Each of the Validation and Main Data Sets Data Set Variablea A X W Z Y Validation Yes Yes Yes Yes Main Yes Yes Yes Yes Data Set Variablea A X W Z Y Validation Yes Yes Yes Yes Main Yes Yes Yes Yes aA, treatment; X, unobserved (in the main data) covariate measured without error; W, observed, mismeasured version of X; Z, observed covariate measured without error; Y, outcome. PSC makes the following assumptions: 1) that a validation data set exists that contains A and all covariates, including versions measured with and without error, {X,W,Z}; 2) that the true propensity score is a linear function of the mismeasured propensity score, treatment, and any other covariates in the validation data set; 3) that the PSC model in the validation data set generalizes to the main data set; and 4) that the naive propensity score, enaive, is a surrogate for the true propensity score, etrue. The surrogacy assumption is violated if the naive propensity score contains additional information about the outcome that is not contained in the true propensity score. Perhaps contrary to initial intuition, this is a restrictive assumption, as “surrogacy may not be as natural or credible for the propensity score as it is for measurement error” (21, p. 1295). Moreover, there is no formal test for this assumption in the PSC setting (20). Assumption 2 may also be problematic. PSC fits the model E(etrue|A,enaive) using linear regression (18), an assumption of which is constant variance of the residuals. However, in the differential measurement error scenarios we consider, the residuals’ variance varies by A. Consequently, the approach is theoretically only appropriate for classical measurement error (18) (see Table 2). However, in simulations (detailed in Web Appendix 3 and Web Table 1; results shown in Table 3), we find that it has similar performance in the case of heteroscedastic measurement error. To relax the assumption of constant variance of the residuals in the case of systematic differential measurement error, we modify the algorithm to use weighted least squares (WLS), allowing the variance to differ by strata of A (sample code is provided in Web Appendices 1 and 2). Table 2. Overview of Sensitivity Analyses, Including the Applicable Estimands and Measurement Error Structures for Each Sensitivity Analysis Estimand Measurement Error Structure Classical Differential Systematic Heteroscedastic PSC (18), modified approach Marginal ATE, conditional ATE Yes (Yes) (Yes)a Rosenbaum’s sensitivity analysis (23) Test statistic, P value Yes (Yes) VanderWeele and Arah’s bias formula (22) Marginal ATE, conditional ATE Yes Yes (Yes) Sensitivity Analysis Estimand Measurement Error Structure Classical Differential Systematic Heteroscedastic PSC (18), modified approach Marginal ATE, conditional ATE Yes (Yes) (Yes)a Rosenbaum’s sensitivity analysis (23) Test statistic, P value Yes (Yes) VanderWeele and Arah’s bias formula (22) Marginal ATE, conditional ATE Yes Yes (Yes) Abbreviations: ATE, average treatment effect; PSC, propensity score calibration. a In the case of systematic differential error, “(Yes)” means that the sensitivity analysis can be applied if the weighted least squares adaptation discussed in the text is used. In the case of heteroscedastic measurement error, “(Yes)” means that the sensitivity analysis can be applied with little practical consequences, even though it is not theoretically appropriate. Table 3. Simulation Results According to Sensitivity Analysis and Measurement Error Structurea Sensitivity Analysis Naive ATE Estimate True Treatment Effect Estimate Corrected ATE Estimate % Bias Variance 95% CI Coverage MSE % Bias Variance 95% CI Coverage MSE % Bias Variance 95% CI Coverage MSE Classical measurement error Rosenbaum (23) −99.7 0.001 0.290 0.0 0.090 0.090 −54.5 0.060 0.146 VanderWeele and Arah (22) −43.5 0.004 0.0 1.710 0.0 0.004 92.8 0.004 −0.3 0.001 92.8 0.002 PSC (18), least squares −43.2 0.003 0.0 1.684 0.0 0.001 94.5 0.001 −9.6 0.053 74.3 0.139 Systematic differential measurement error Rosenbaum −100.0 0.000 0.292 0.0 0.085 0.085 95.8 0.004 0.272 VanderWeele and Arah −84.7 0.002 0.0 6.454 0.0 0.004 92.8 0.004 −1.1 0.001 83.3 0.002 PSC, least squares −83.6 0.001 0.0 6.291 0.0 0.001 94.5 0.001 −33.9 0.077 14.2 1.154 PSC, WLS −4.7 0.716 96.3 0.446 Heteroscedastic measurement error Rosenbaum −99.8 0.001 0.290 0.0 0.085 0.085 −64.3 0.046 0.169 VanderWeele and Arah −36.2 0.004 0.0 1.181 0.0 0.004 92.8 0.004 −0.2 0.002 93.2 0.002 PSC, least squares −35.8 0.003 0.0 1.159 0.0 0.001 94.5 0.001 −7.6 0.040 77.2 0.094 PSC, WLS −10.8 0.126 80.9 0.255 Sensitivity Analysis Naive ATE Estimate True Treatment Effect Estimate Corrected ATE Estimate % Bias Variance 95% CI Coverage MSE % Bias Variance 95% CI Coverage MSE % Bias Variance 95% CI Coverage MSE Classical measurement error Rosenbaum (23) −99.7 0.001 0.290 0.0 0.090 0.090 −54.5 0.060 0.146 VanderWeele and Arah (22) −43.5 0.004 0.0 1.710 0.0 0.004 92.8 0.004 −0.3 0.001 92.8 0.002 PSC (18), least squares −43.2 0.003 0.0 1.684 0.0 0.001 94.5 0.001 −9.6 0.053 74.3 0.139 Systematic differential measurement error Rosenbaum −100.0 0.000 0.292 0.0 0.085 0.085 95.8 0.004 0.272 VanderWeele and Arah −84.7 0.002 0.0 6.454 0.0 0.004 92.8 0.004 −1.1 0.001 83.3 0.002 PSC, least squares −83.6 0.001 0.0 6.291 0.0 0.001 94.5 0.001 −33.9 0.077 14.2 1.154 PSC, WLS −4.7 0.716 96.3 0.446 Heteroscedastic measurement error Rosenbaum −99.8 0.001 0.290 0.0 0.085 0.085 −64.3 0.046 0.169 VanderWeele and Arah −36.2 0.004 0.0 1.181 0.0 0.004 92.8 0.004 −0.2 0.002 93.2 0.002 PSC, least squares −35.8 0.003 0.0 1.159 0.0 0.001 94.5 0.001 −7.6 0.040 77.2 0.094 PSC, WLS −10.8 0.126 80.9 0.255 Abbreviations: ATE, average treatement effect; CI, confidence interval; MSE, mean squared error; PSC, propensity score calibration; WLS, weighted least squares. a Performance measures included percent bias, variance, 95% confidence interval coverage, and mean squared error. For each measurement error structure, the naive analysis approach was compared with the truth and with the sensitivity analysis approach that attempted to correct the covariate measurement error. VanderWeele and Arah’s sensitivity analysis The VanderWeele and Arah sensitivity analysis for unobserved confounding has been described previously (22). Briefly, VanderWeele and Arah (22) provide formulas with which to calculate the bias caused by unobserved confounding in estimating conditional or marginal effects. These bias formulas involve setting values for various sensitivity parameters, the number of which depends on the simplifying assumptions made. Without internal validation data containing complete data, C, we may have little information with which to guess reasonable values for each parameter. However, using available internal and external sources as a guide, one could explore a matrix of reasonable combinations, identifying those which result in a change in inference. The bias formulas are estimand-specific but can be used for any method of estimation. For example, the same bias formula for the ATE can be used regardless of whether propensity score matching or inverse-probability-of-treatment weighting is used. The standard error of the bias-corrected ATE is the same as the standard error of the biased ATE if the parameters in the bias formula do not vary by strata of covariates. Otherwise, standard errors can be estimated by bootstrapping. This approach can be used to correct for classical measurement error and systematic differential measurement error (see Table 2). Although it is not clear how to adapt the bias formulas for heteroscedastic measurement error, we find in simulations that bias formulas that ignore the differential measurement error in the scale parameter perform well (see Table 3). For classical measurement error, VanderWeele and Arah’s simplified bias formula can be used, setting sensitivity parameters that describe the association between treatment, A, and the portion of X not captured by W, designated U, and between Y and U (22, p. 44). For systematic differential measurement error, the bias equation from VanderWeele and Arah’s theorem 1 can be used, setting the following sensitivity parameters: 1) the association between Y and U conditional on W and Z when A = 1 and 2) when A = 0; 3) the difference in the mean value of U conditional on A = 1 and W and Z and the mean value of U conditional on W and Z only; and 4) the difference in mean values detailed in point 3 when A = 0. See Web Appendices 1 and 2 for sample software code. Rosenbaum’s sensitivity analysis Rosenbaum’s approach has been described previously (23–25). It assumes that 1) the data are in propensity-score-matched pairs and 2) the intervention and nonintervention groups in the matched data set are balanced on observed confounding variables. There are several versions of this sensitivity analysis: the original version, which assumes—adapted for measurement error—that the portion of the true, unobserved covariate, X, that is not captured in the observed, mismeasured covariate, W—designated U—is a near-perfect predictor of the outcome, Y; the dual version, which assumes that U is a near-perfect predictor of treatment, A; and the simultaneous version, which sets sensitivity parameters for the association between U and A and for the association between U and Y, similar to VanderWeele and Arah’s bias formulas. Because of the restrictive assumptions of the original and dual versions, we consider the simultaneous version here. The 2 sensitivity parameters in the simultaneous sensitivity analysis are Γ and Δ. For simplicity, we consider a binary Y. The 2 sensitivity parameters are given by the following equations: logπA1−πA=β0+βWW+log(Γ)U+βZZ. logπY1−πY=α0+αAA+αWW+log(Δ)U+αZZ. Γ is the multiplier by which the portion of X that is not captured by W, U, increases or decreases the odds of treatment assignment. If measurement error does not affect the odds of treatment assignment, then Γ=1. Similarly, Δ is the multiplier by which U increases or decreases the odds of the outcome, Y. We perform the sensitivity analysis by varying Γ and Δ simultaneously. If Y is a binary outcome variable, the 2 sensitivity parameters are used to set the upper or lower bound of McNemar’s test statistic (24). The resulting P value from this test is the P value corrected for measurement error. If Y is a continuous outcome, then the 2 sensitivity parameters are used to set the upper or lower bound of the normalized Wilcoxon signed-rank test statistic. When Y is continuous, Δ is interpreted as the conditional odds that the subject with greater U also has greater Y for the pair with median rank. This clunky interpretation makes it difficult to choose reasonable values for this sensitivity analysis parameter. Rosenbaum’s approach can be used to correct for classical measurement error and systematic differential measurement error (see Table 2). As with the other two approaches we have considered, although it is not clear how to adapt the approach to address heteroscedastic measurement error, we find in simulations that the method performs similarly for the heteroscedastic measurement error scenario as it does for classical measurement error (see Table 3). SIMULATION RESULTS Table 3 presents the simulation results. Details of the simulation setup are provided in Web Appendix 3. For each method and for each measurement error scenario, we present the performance of the naive approach that uses W instead of X, the no-measurement-error approach that uses X, and the approach that corrects for measurement error using one of the 3 sensitivity analyses. We estimate the parameters for the VanderWeele and Arah approach and the Rosenbaum approach using complete data. We fit the PSC model using an internal validation data set. In most data analyses, many of these values would be unknown. As with parametric model misspecification, there is no theoretical basis for how the different methods would perform when using incorrect parameter values. For this reason, we focus on comparing performance using estimates from complete data, representing an optimal bound. For all 3 measurement error scenarios, the VanderWeele and Arah bias formula approach has the greatest potential for reducing bias due to measurement error, since it can reduce bias by 100% if the correct sensitivity parameters are used. In addition, the VanderWeele and Arah bias-corrected estimates have variances and mean squared errors that are similar to those of the true estimates. Ninety-five percent confidence interval coverage is also high. We see that the PSC approach reduces but does not eliminate bias due to measurement error under all 3 scenarios. Performance of the unmodified, least-squares PSC is better for the classical and heteroscedastic measurement error scenarios than for the systematic differential measurement error scenario. For both the classical and heteroscedastic measurement error scenarios, using PSC to obtain a corrected estimate reduces bias by nearly 80% as compared with the naive approach and results in 95% confidence interval coverage of more than 70%. The assumption of constant variance of residuals in the PSC model is violated under both differential measurement error scenarios. However, performance is not affected in the heteroscedastic case, which corroborates previous results that demonstrated little practical impact of heteroscedastic error (26, p. 81). As Table 3 shows, using the WLS modification improved the performance of PSC in the systematic differential measurement error scenario. Using the standard PSC algorithm, the corrected estimates remained 34% biased, on average, while the WLS implementation resulted in 5% bias—similar to the PSC results in the other measurement error scenarios. In addition, the WLS modification increased 95% confidence interval coverage from 14.2% to 96%. However, these gains were made at the expense of increased variance. The Rosenbaum sensitivity analysis performed least well in terms of its ability to correct for covariate measurement error. This was true regardless of whether or not we performed matching with replacement. In addition, the corrected P values were highly variable and spanned the range from 0 to 1 over the 1,000 simulation iterations. APPLICATION Overview and setup We now apply the above approaches to correct for covariate measurement error in estimating the association between baseline depression and subsequent change in body mass index (weight (kg)/height (m)2) among middle-aged-to-older women enrolled in the Baltimore Memory Study. The Baltimore Memory Study has been described previously (27). Data included in this analysis covered the period May 2001–April 2005. Participants gave informed consent, and the Johns Hopkins University Institutional Review Board approved the study protocol. Disability status may be an important confounding variable, as it has been shown to be associated with depression (28) and may be also associated with change in body mass index. It is plausible that disability is measured with error, which may differ by depression status—for example, persons with depression could have disability scores that are measured too low and with more noise than those without depression. Persons with depression have more variable SF-36-Phys scores (variance of 0.625 vs. 0.434) and scores that are lower, on average (mean of −1.1 vs. −0.6). We have no gold standard measurement of disability. For the purposes of illustration, we use the SF-36-Phys score as the “alloyed gold standard” (29) and simulate additional measurement error: W=X+N(0,1)+I(A=1)×N(−0.5, 2), where A is the depression indicator, X is the SF-36-Phys score, and W is the mismeasured version of the SF-36-Phys score. We would expect similar performance regardless of whether X is a perfect gold standard or an “alloyed gold standard” (29). W exhibits both systematic differential and heteroscedastic measurement error by exposure status, (shown in Figure 4), in that persons who are depressed score slightly lower and have more variability. The reliability of the mismeasured version ( σX2/σW2) among persons who are depressed is 0.12; the reliability among those who are not depressed is 0.31. (The measurement details of other variables and further analytical details are included in Web Appendix 4.) Figure 4. View largeDownload slide Measurement of disability using the Physical Component Summary of the Medical Outcomes Study 36-Item Short Form Health Survey (SF-36-Phys), with and without added measurement error. Figure 4. View largeDownload slide Measurement of disability using the Physical Component Summary of the Medical Outcomes Study 36-Item Short Form Health Survey (SF-36-Phys), with and without added measurement error. For this simple example, we include the 597 women with complete data.We take a random one-third sample to create a validation subset (n = 193). We then use this validation data set to create the PSC model and to inform the sensitivity analysis parameters for the VanderWeele and Arah approach (which is possible, since the validation data set includes information on Y). We control for confounding by conditioning on the linear propensity score for PSC and by inverse-probability-of-treatment weighting for the VanderWeele and Arah approach. Results We expect the WLS PSC approach and VanderWeele and Arah’s bias formula to correct for our simulated systematic differential and heteroscedastic measurement error, as seen in Table 2. We include results using the original least-squares PSC approach for comparison. Figure 5 shows the estimated effects comparing 1) the “naive” estimate, using the version of the SF-36-Phys with added measurement error, 2) the “true” estimate, using the SF-36-Phys without added error, and 3) the “corrected” estimates for each of the sensitivity analysis approaches. Figure 5. View largeDownload slide Estimates of the association of depression with subsequent change in body mass index (BMI; weight (kg)/height (m)2), conditional on covariates, using the VanderWeele and Arah (22) bias formula (A) and propensity score calibration (B) to correct for covariate measurement error. Bars, 95% confidence intervals. LS, least squares; WLS, weighted least squares. Figure 5. View largeDownload slide Estimates of the association of depression with subsequent change in body mass index (BMI; weight (kg)/height (m)2), conditional on covariates, using the VanderWeele and Arah (22) bias formula (A) and propensity score calibration (B) to correct for covariate measurement error. Bars, 95% confidence intervals. LS, least squares; WLS, weighted least squares. We find that using the VanderWeele and Arah bias formula reduces bias by 94%, WLS PSC reduces bias by 59%, and the original least-squares PSC slightly increases the bias. Although the bias is reduced using each of the 2 appropriate sensitivity analysis approaches, the confidence intervals widen. These results should be interpreted with caution, as we have made multiple simplifications. A more comprehensive analysis would account for missing data and informative dropout, time-varying confounding, and potential mediation. In addition, we simulated additional measurement error and used an internal validation data set with complete data to estimate parameters for the VanderWeele and Arah approach and fit the PSC model. In a real-world setting, it is unlikely that we would have such data. DISCUSSION Covariate measurement error and unobserved confounding can be thought of as equivalent forms of bias. Consequently, similar methods can be used to address their threats to validity. Few researchers undertake sensitivity analyses to estimate the potential impact of unobserved confounding, and even fewer do so for measurement error. Moreover, when measurement error is considered, it is typically limited to classical measurement error, even though more complex error structures may be present. In this paper, we have described and demonstrated how several easy-to-implement sensitivity analyses for unobserved confounding can be adapted to address classical, systematic differential, and heteroscedastic covariate measurement error in propensity score methods. In a limited simulation study, we have provided optimal performance bounds for the extent to which these sensitivity analyses can correct for measurement error. We have also applied these approaches to a data example estimating the association between depression and subsequent change in body mass index among women, addressing measurement error in the confounding variable of disability. To further lower barriers to implementation, we provide annotated R (R Foundation for Statistical Computing, Vienna, Austria) and SAS (SAS Institute, Inc., Cary, North Carolina) code in Web Appendices 1 and 2, respectively, that serves as a tutorial. We describe the strengths and limitations of each sensitivity analysis below. Advantages of PSC (18) include the fact that it can address multiple covariates measured with error simultaneously and does not depend on having information about the outcome in the validation data. It can be used with propensity score matching, subclassification, and regression adjustment of the propensity score, but not with weighting. However, the surrogacy assumption may be restrictive. An adaptation relaxes this assumption by using a parameter representing the association between the unobserved portion of the covariate and the outcome to correct for the remaining bias (21). This parameter could be estimated from an internal validation study with outcome information or could be treated as a sensitivity analysis. Our adaptation using WLS allows PSC to be used for systematic differential measurement error. This adaptation has the benefit of reducing bias and improving confidence interval coverage, but at the expense of greater variance. Other limitations of PSC include the fact that it tends to overadjust and break down when measurement error is large and/or the association between the naive and true propensity scores is weak (18). Most importantly, it reduces but may not eliminate bias due to measurement error. For example, in simulation studies in which all assumptions were met, bias reductions varied between 32% and 106% (20). We found similar results (see Table 3) with bias reductions closest to 100% in scenarios where ATE = 0 (results available upon request). A significant advantage of the VanderWeele and Arah bias formulas (22) is that if the correct sensitivity parameters are used and the assumptions are met, then the bias estimate is itself unbiased in expectation. Thus, this approach can fully correct for bias due to all 3 types of covariate measurement error considered. In addition, the approach can be used for any estimation method. However, without an internal validation data set containing complete data, it is unlikely that the parameters will be correctly specified, and the degree of bias correction is unclear when using incorrect parameter values. Rosenbaum’s sensitivity analysis (23) is perhaps the most familiar of all sensitivity analyses for unobserved confounding. However, it has several disadvantages in our context. The first is that it can only be used with propensity-score-matched data. The second is that it can be used to obtain a corrected test statistic or P value, but it is not clear how it would provide an adjusted estimate for the ATE. Another disadvantage is that the interpretation of Δ is not straightforward when Y is continuous, which makes it difficult to posit sensitivity analysis values. Finally, we find that the method may reduce bias but that this reduction is far from complete (see Table 3). This paper was limited in scope. Our goals were 1) to show the connection between bias caused by unobserved confounding and that caused by covariate measurement error and 2) to demonstrate how several simple approaches for addressing unobserved confounding could also be used to address covariate measurement error. There exist numerous other approaches for addressing covariate measurement error that may perform better and rely on fewer assumptions (6–8, 14–16, 30–37). However, with few exceptions (14, 15), many of these approaches are not as easy to implement for nonstatisticians. Comparing performance among these other approaches and lowering barriers to implementation are areas for future work. In conclusion, we recommend the use of VanderWeele and Arah’s bias formulas and PSC (assuming it is adapted appropriately for the measurement error structure) to assess sensitivity of results to covariate measurement error. Both approaches are appropriate for a variety of propensity score estimators and measurement error structures. Real-world data are messy. Concerns about bias due to unobserved confounding and/or measurement error should be addressed rather than ignored. We hope that methods such as the ones examined in this paper will be more widely utilized in addressing such concerns. ACKNOWLEDGMENTS Department of Epidemiology, School of Public Health, University of California, Berkeley, Berkeley, California (Kara E. Rudolph); and Departments of Mental Health, Biostatistics, and Health Policy and Management, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland (Elizabeth A. Stuart). K.E.R.’s time was supported by the Drug Dependence Epidemiology Training Program of the National Institute on Drug Abuse (grant T32DA007292-21; Principal Investigator: Dr. Deborah Furr-Holden) and the Robert Wood Johnson Foundation Health & Society Scholars program. E.A.S.’s time was supported by the National Institute of Mental Health (grant R01MH099010; Principal Investigator: Dr. Elizabeth A. Stuart). We thank Drs. Brian Schwartz and Thomas Glass for support in providing the Baltimore Memory Study data. We thank Ian Schmid for providing the SAS code. Conflict of interest: none declared. Abbreviations ATE average treatment effect PSC propensity score calibration SF-36-Phys Physical Component Summary of the Medical Outcomes Study 36-Item Short Form Health Survey WLS weighted least squares REFERENCES 1 Stuart EA. Matching methods for causal inference: a review and a look forward. Stat Sci . 2010; 25( 1): 1– 21. Google Scholar CrossRef Search ADS PubMed 2 Katz S, Ford AB, Moskowitz RW, et al. Studies of illness and the aged. The index of ADL, a standardized measure of biological and psychological function. JAMA . 1963; 185( 12): 914– 919. Google Scholar CrossRef Search ADS PubMed 3 Ware JE Jr, Sherbourne CD. The MOS 36-item Short-Form Health Survey (SF-36): I. Conceptual framework and item selection. Med Care . 1992; 30( 6): 473– 483. Google Scholar CrossRef Search ADS PubMed 4 Steiner PM, Cook TD, Shadish WR. On the importance of reliable covariate measurement in selection bias adjustments using propensity scores. J Educ Behav Stat . 2011; 36( 2): 213– 236. Google Scholar CrossRef Search ADS 5 Millimet DL. The elephant in the corner: a cautionary tale about measurement error in treatment effects models. In: Drukker DM, ed. Missing Data Methods: Cross-Sectional Methods and Applications . (Advances in Econometrics, vol. 27, part 1). Bingley, United Kingdom: Emerald Group Publishing Ltd.; 2011: 1– 39. Google Scholar CrossRef Search ADS 6 Hong H, Rudolph KE, Stuart EA. Bayesian approach for addressing differential covariate measurement error in propensity score methods [published online ahead of print October 13, 2016]. Psychometrika . 2016. (doi: 10.1007/s11336-016-9533-x). 7 McCaffrey DF, Lockwood J, Setodji CM. Inverse probability weighting with error-prone covariates. Biometrika . 2013; 100( 3): 671– 680. Google Scholar CrossRef Search ADS PubMed 8 Lockwood J, McCaffrey DF. Matching and weighting with functions of error-prone covariates for causal inference. J Am Stat Assoc . 2016; 111( 516): 1831– 1839. Google Scholar CrossRef Search ADS 9 Webb-Vargas Y, Rudolph KE, Lenis D, et al. . An imputation-based solution to using mismeasured covariates in propensity score analysis. Stat Methods Med Res . 2017; 26( 4): 1824– 1837. Google Scholar CrossRef Search ADS PubMed 10 Dehejia RH, Wahba S. Causal effects in nonexperimental studies: reevaluating the evaluation of training programs. J Am Stat Assoc . 1999; 94( 448): 1053– 1062. Google Scholar CrossRef Search ADS 11 Higgins JR, de Swiet M. Blood-pressure measurement and classification in pregnancy. Lancet . 2001; 357( 9250): 131– 135. Google Scholar CrossRef Search ADS PubMed 12 Heckman JJ, Singer B. Longitudinal Analysis of Labor Market Data . Cambridge, United Kingdom: Cambridge University Press; 2008. 13 Hernán M, Robins J. Causal Inference . Boca Raton, FL: Chapman & Hall/CRC Press. In press. 14 Blackwell M, Honaker J, King G. A unified approach to measurement error and missing data: overview and applications. Sociol Methods Res . 2017; 46( 3): 303– 341. Google Scholar CrossRef Search ADS 15 Blackwell M, Honaker J, King G. A unified approach to measurement error and missing data: details and extensions. Sociol Methods Res . 2017; 46( 3): 342– 369. Google Scholar CrossRef Search ADS 16 Cole SR, Chu H, Greenland S. Multiple imputation for measurement-error correction. Int J Epidemiol . 2006; 35( 4): 1074– 1081. Google Scholar CrossRef Search ADS PubMed 17 Pearl J. Causality . Cambridge, United Kingdom: Cambridge University Press; 2009. Google Scholar CrossRef Search ADS 18 Stürmer T, Schneeweiss S, Avorn J, et al. . Adjusting effect estimates for unmeasured confounding with validation data using propensity score calibration. Am J Epidemiol . 2005; 162( 3): 279– 289. Google Scholar CrossRef Search ADS PubMed 19 Spiegelman D, McDermott A, Rosner B. Regression calibration method for correcting measurement-error bias in nutritional epidemiology. Am J Clin Nutr . 1997; 65( 4 suppl): 1179S– 1186S. Google Scholar CrossRef Search ADS PubMed 20 Stürmer T, Schneeweiss S, Rothman KJ, et al. . Performance of propensity score calibration—a simulation study. Am J Epidemiol . 2007; 165( 10): 1110– 1118. Google Scholar CrossRef Search ADS PubMed 21 Lunt M, Glynn RJ, Rothman KJ, et al. . Propensity score calibration in the absence of surrogacy. Am J Epidemiol . 2012; 175( 12): 1294– 1302. Google Scholar CrossRef Search ADS PubMed 22 VanderWeele TJ, Arah OA. Bias formulas for sensitivity analysis of unmeasured confounding for general outcomes, treatments, and confounders. Epidemiology . 2011; 22( 1): 42– 52. Google Scholar CrossRef Search ADS PubMed 23 Rosenbaum PR. Design of Observational Studies . (Springer Series in Statistics). New York, NY: Springer Publishing Company; 2010. Google Scholar CrossRef Search ADS 24 Keele LJ. rbounds: Perform Rosenbaum Bounds Sensitivity Tests for Matched and Unmatched Data. Version 2.1. Vienna, Austria: R Foundation for Statistical Computing; 2014. https://cran.r-project.org/web/packages/rbounds/index.html. Accessed August 18, 2016. 25 Gastwirth JL, Krieger AM, Rosenbaum PR. Dual and simultaneous sensitivity analysis for matched pairs. Biometrika . 1998; 85( 4): 907– 920. Google Scholar CrossRef Search ADS 26 Carroll RJ, Ruppert D, Stefanski LA, et al. . Measurement Error in Nonlinear Models: A Modern Perspective . Boca Raton, FL: Chapman & Hall/CRC Press; 2012. 27 Schwartz BS, Glass TA, Bolla KI, et al. . Disparities in cognitive functioning by race/ethnicity in the Baltimore Memory Study. Environ Health Perspect . 2004; 112( 3): 314– 320. Google Scholar CrossRef Search ADS PubMed 28 Turner RJ, Noh S. Physical disability and depression: a longitudinal analysis. J Health Soc Behav . 1988; 29( 1): 23– 37. Google Scholar CrossRef Search ADS PubMed 29 Spiegelman D, Schneeweiss S, McDermott A. Measurement error correction for logistic regression models with an “alloyed gold standard.” Am J Epidemiol . 1997; 145( 2): 184– 196. Google Scholar CrossRef Search ADS PubMed 30 McCandless LC, Gustafson P, Levy A. Bayesian sensitivity analysis for unmeasured confounding in observational studies. Stat Med . 2007; 26( 11): 2331– 2347. Google Scholar CrossRef Search ADS PubMed 31 Gustafson P, McCandless LC, Levy AR, et al. . Simplified Bayesian sensitivity analysis for mismeasured and unobserved confounders. Biometrics . 2010; 66( 4): 1129– 1137. Google Scholar CrossRef Search ADS PubMed 32 Lin HW, Chen YH. Adjustment for missing confounders in studies based on observational databases: 2-stage calibration combining propensity scores from primary and validation data. Am J Epidemiol . 2014; 180( 3): 308– 317. Google Scholar CrossRef Search ADS PubMed 33 Carroll R, Gail M, Lubin J. Case-control studies with errors in covariates. J Am Stat Assoc . 1993; 88( 421): 185– 199. 34 Ghosh-Dastidar B, Schafer JL. Multiple edit/multiple imputation for multivariate continuous data. J Am Stat Assoc . 2003; 98( 464): 807– 817. Google Scholar CrossRef Search ADS 35 Imai K, Yamamoto T. Causal inference with differential measurement error: nonparametric identification and sensitivity analysis. Am J Pol Sci 2010; 54( 2): 543– 560. Google Scholar CrossRef Search ADS 36 Hossain S, Gustafson P. Bayesian adjustment for covariate measurement errors: a flexible parametric approach. Stat Med . 2009; 28( 11): 1580– 1600. Google Scholar CrossRef Search ADS PubMed 37 Gustafson P. Measurement error modelling with an approximate instrumental variable. J R Stat Soc Series B Stat Methodol . 2007; 69( 5): 797– 815. Google Scholar CrossRef Search ADS © The Author(s) 2018. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

American Journal of Epidemiology – Oxford University Press

**Published: ** Mar 1, 2018

Loading...

personal research library

It’s your single place to instantly

**discover** and **read** the research

that matters to you.

Enjoy **affordable access** to

over 12 million articles from more than

**10,000 peer-reviewed journals**.

All for just $49/month

Read as many articles as you need. **Full articles** with original layout, charts and figures. Read **online**, from anywhere.

Keep up with your field with **Personalized Recommendations** and **Follow Journals** to get automatic updates.

It’s easy to organize your research with our built-in **tools**.

Read from thousands of the leading scholarly journals from *SpringerNature*, *Elsevier*, *Wiley-Blackwell*, *Oxford University Press* and more.

All the latest content is available, no embargo periods.

## “Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”

Daniel C.

## “Whoa! It’s like Spotify but for academic articles.”

@Phil_Robichaud

## “I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”

@deepthiw

## “My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”

@JoseServera

## DeepDyve Freelancer | ## DeepDyve Pro | |

Price | FREE | $49/month $360/year |

Save searches from Google Scholar, PubMed | ||

Create lists to organize your research | ||

Export lists, citations | ||

Read DeepDyve articles | Abstract access only | Unlimited access to over 18 million full-text articles |

Print | 20 pages/month | |

PDF Discount | 20% off | |

Read and print from thousands of top scholarly journals.

System error. Please try again!

or

By signing up, you agree to DeepDyve’s Terms of Service and Privacy Policy.

Already have an account? Log in

Bookmark this article. You can see your Bookmarks on your DeepDyve Library.

To save an article, **log in** first, or **sign up** for a DeepDyve account if you don’t already have one.