Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Landmark Models for Optimizing the Use of Repeated Measurements of Risk Factors in Electronic Health Records to Predict Future Disease Risk

Landmark Models for Optimizing the Use of Repeated Measurements of Risk Factors in Electronic... Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 American Journal of Epidemiology Vol. 187, No. 7 © The Author(s) 2018. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. DOI: 10.1093/aje/kwy018 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons. Advance Access publication: org/licenses/by/4.0), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. March 23, 2018 Practice of Epidemiology Landmark Models for Optimizing the Use of Repeated Measurements of Risk Factors in Electronic Health Records to Predict Future Disease Risk Ellie Paige, Jessica Barrett, David Stevens, Ruth H. Keogh, Michael J. Sweeting, Irwin Nazareth, Irene Petersen, and Angela M. Wood* * Correspondence to Dr. Angela Wood, Department of Public Health and Primary Care, University of Cambridge, Strangeways Research Laboratory, Cambridge CB1 8RN, United Kingdom (e-mail: [email protected]). Initially submitted July 27, 2017; accepted for publication January 25, 2018. The benefits of using electronic health records (EHRs) for disease risk screening and personalized health-care decisions are being increasingly recognized. Here we present a computationally feasible statistical approach with which to address the methodological challenges involved in utilizing historical repeat measures of multiple risk fac- tors recorded in EHRs to systematically identify patients at high risk of future disease. The approach is principally based on a 2-stage dynamic landmark model. The first stage estimates current risk factor values from all available historical repeat risk factor measurements via landmark-age–specific multivariate linear mixed-effects models with correlated random intercepts, which account for sporadically recorded repeat measures, unobserved data, and measurement errors. The second stage predicts future disease risk from a sex-stratified Cox proportional hazards model, with estimated current risk factor values from the first stage. We exemplify these methods by developing and validating a dynamic 10-year cardiovascular disease risk prediction model using primary-care EHRs for age, diabetes status, hyper- tension treatment, smoking status, systolic blood pressure, total cholesterol, and high-density lipoprotein cholesterol in 41,373 persons from 10 primary-care practices in England and Wales contributing to The Health Improvement Network (1997–2016). Using cross-validation, the model was well-calibrated (Brier score = 0.041, 95% confidence interval: 0.039, 0.042) and had good discrimination (C-index = 0.768, 95% confidence interval: 0.759, 0.777). cardiovascular disease; dynamic risk prediction; electronic health records; landmarking; mixed-effects models; primary care records Abbreviations: CI, confidence interval; CVD, cardiovascular disease; EHRs, electronic health records; HDL-C, high-density lipoprotein cholesterol; SBP, systolic blood pressure. Using electronic health records (EHRs) to systematically iden- measured sporadically during general practice visits, and follow- tify persons at high risk of developing future disease outcomes up continues until the person transfers out or dies. Defining has the potential to increase the cost-effectiveness of health care arbitrary time origins for model development without allowing (1); however, existing risk prediction models do not fully opti- for the in- and outflow of study participants over time can intro- mize available historical data. The development of computation- duce bias (2). Second, risk prediction models typically use single ally feasible statistical methods for predicting future disease risk measures of error-prone risk factors (e.g., blood pressure and from existing EHRs presents specific methodological challenges cholesterol), but EHRs often contain data on risk factors measured and opportunities. repeatedly over time which could be utilized both for model First, risk prediction models are typically developed using tra- development and for predicting future disease risk. In partic- ditional prospective study designs, which defineabaselineorigin ular, repeated measurements can be used to predict error-free at whichriskfactors were observedand from whichtopredict “estimated current values” of risk factors, which may increase their future disease risk. However, EHRs are dynamic in nature—for predictive ability (3). Third, most risk prediction models require example, in primary-care records, an individual’s follow-up complete risk factor data in order to predict future risk. An begins at registration with a general practice, risk factors are exception in cardiovascular disease (CVD) risk prediction is the 1530 Am J Epidemiol. 2018;187(7):1530–1538 Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 Repeat Risk Factors in Electronic Health Records 1531 METHODS QRISK2 model (4), which has a built-in tool for substituting missing data on risk factors using age- and sex-specific popula- Data source tion average values. Notably, this substitution approach is not compatible with the multiple-imputation approach used for We used patient data from 10 randomly selected general prac- model development of QRISK2 and has not yet been formally tices that contributed data to The Health Improvement Network validated (5). Since EHR systems are primarily designed for (18), a United Kingdom general practice database that derives patient management and administrative purposes, there can be data from routine administrative and clinical practice. During con- large amounts of unobserved information on risk factors that sultations with patients, family physicians enter data on medical needs to be handled appropriately and compatibly both in model symptoms and diagnoses using Read codes (19) (a hierarchical development and for predicting future disease risk. classification system), while information on drug prescriptions is While multiple methods exist for developing risk prediction entered automatically into the EHRs. The Health Improvement models using EHRs, a previous systematic review found that only Network captures information on patient demographic character- 8% of studies modeled repeated longitudinal measures, only 54% istics, practice-level data, diagnoses and symptoms, specialist re- accounted for missing data, only 16% appropriately accounted for ferrals, laboratory testing, disease monitoring, prescribing, and censoring and loss to follow-up, and none assessed informative death. For this study, we created code lists for the risk factors and observations (where the clinic visit itself provides meaningful outcomes using previously described methods (20). Code lists information) (6). Our aim was to establish a computationally fea- were reviewed by a clinician (I.N.) and have been published sible generic statistical framework that accounts for these potential on ClinicalCodes.org. advantages and biases of EHRs in the development of dynamic The main outcome was newly recorded diagnoses of nonfatal risk prediction models that leverage repeated measurements and or fatal CVD, where CVD was defined, as with previous primary handle unobserved data on routinely recorded risk factors. Our care risk scores (4), as angina, myocardial infarction, stroke, tran- approach combines 2 existing methods, landmark-age models sient ischemic attack, or major coronary surgery and revasculari- and multivariate linear mixed-effects models (2, 7). A landmark zation. Cause of death was ascertained using Read codes. age is a reference point (e.g., 40, 45, 50,…, 85 years) at which Risk factors were selected on the basis of those in the validated we want to make risk predictions using risk factor information American College of Cardiology/American Heart Association collected up to that age. A series of prediction models, which we Pooled Cohort Risk Assessment Equations (21, 22) and included call landmark-age models, are constructed with time origin at age, sex, diabetes status (binary, ascertained using Read codes the landmark age and past risk factor information from eligible (23)), smoking status (binary), systolic blood pressure (SBP) individuals (e.g., in our setting these are persons who are cur- (adjusted for hypertension treatment), total cholesterol level, and rently registered with a general practice and at future risk of high-density lipoprotein cholesterol (HDL-C) level. Once an disease at the landmark age). As such, individuals may con- individual had a diabetes diagnosis or a prescription for a blood- tribute to one or more prediction models, depending on their pressure–lowering medication, he or she was considered to have eligibility at the landmark age reference points. this condition/treatment throughout follow-up. Values for SBP, Typically, landmark-age models are constructed using Cox total cholesterol, and HDL-C were standardized by centering on proportional hazards models with the last observed risk factor val- sex-specific means and dividing by the standard deviation. ues. We propose an extension to this, whereby we replace the last observed values with error-free risk factor values estimated from a multivariate linear mixed-effects model using all available Study population repeated measures of past risk factor values for each landmark age (8). Multivariate mixed-effects models intrinsically handle Data were available from January 1, 1997, to January 18, 2016. unobserved data and sporadically recorded repeat measures (9) Individuals entered the study from the latest of the following and their measurement errors (10). The approach also provides dates: 1) the date of registration at a general practice plus 6 flexibility to account for the number (or rate) of clinic visits as a months; 2) the date for acceptable computer usage (quality mea- proxy for illness severity or health anxiety. There is a strong body surement defined as the year in which a general practice continu- of statistical evidence showing the benefits and potential applica- ously used their computer system for recording of medical events tions of modeling longitudinal data using mixed-effects linear and prescribing) (24); 3) the date for acceptable mortality report- regression models (3, 11–14), but this method is not often em- ing (the date on which mortality recording reflected that of the ployed in the development of risk prediction models using EHRs United Kingdom general population) (25); 4) thedateonwhich (6). Moreover, using landmarking to model data in EHRs has the individual turned 30 years of age; or 5) January 1, 1997. In- been previously proposed (15) and has been combined with uni- dividuals exited the study at the earliest of the following dates: variate mixed-effects modeling (16, 17) but not in the context of 1) their first (i.e., “incident”) newly recorded CVD event; 2) trans- dynamic risk prediction models. fer out of the general practice; 3) their date of death; or 4) January In the current study, we explore how landmarking can be com- 18, 2016. The target population for which we wanted to esti- bined with multivariate mixed-effects linear regression models to mate CVD risk included persons with general practice records leverage the advantages of each method in order to generate and without a history of CVD or statin prescriptions (see Web dynamic risk prediction models suitable for use in EHRs. Figure 1, available at https://academic.oup.com/aje). We excluded We illustrate our approach through the estimation of 10-year participants with statin prescriptions, as these individuals are CVD risk using EHRs from 10 general practices in England already being treated for being at risk of developing CVD and Wales. and as such would not need to be identified by a screening Am J Epidemiol. 2018;187(7):1530–1538 Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 1532 Paige et al. algorithm. In addition, the study sample excluded persons with landmark age as a time origin, past risk factor information was unknown sex, persons with a study entry date after age 85 years, extracted from age 30 years onwards and participants were fol- and persons with no measurements of smoking status, SBP, lowed up for 10 years until their first CVD event or the study total cholesterol, or HDL-C between study entry and study exit exit date (Figure 1). Crude incidence rates by age at study entry, (Web Figure 1). sex, and calendar year of statin prescription were calculated. The following measurements were considered biologically Estimation of error-free current risk factor values. For each implausible and were changed to “missing” for the analysis: landmark age and separately for males and females, we fitted mul- SBP <60 mm Hg or >250 mm Hg (26); total cholesterol level tivariate mixed-effects linear regression models (9) on past repeat <1.75 mmol/L or >20 mmol/L (27); and HDL-C level <0.3 measurements for smoking status, SBP, total cholesterol, and mmol/L or >3.1 mmol/L (26) (out of a total 1,675,241 mea- HDL-C. Each model included fixed intercepts and slopes for surements, 12,352 measurements were changed to missing). each risk factor, a time-dependent covariate for initiation of The scheme under which The Health Improvement Network blood-pressure–lowering medications for SBP, and correlated was to obtain and provide anonymized patient data was approved individual-specific random intercepts for all 4 risk factors. These by the National Health Service South-East Multicenter Research models were estimable for persons with at least 1 measurement of Ethics Committee in 2002, and scientific approval to undertake at least 1 risk factor. From each model, we estimated the error-free this study was obtained from the IQVIA World Publications current risk factor values (i.e., the predicted values at the landmark Scientific Review Committee (IQVIA, Durham, North Carolina). age) using the best linear unbiased predictors from the empirical E.P., J.B., D.S., I.P., and A.M.W. had full access to the data used Bayes posterior distribution of the random intercepts, conditional to create the study population. This article follows RECORD on the past observed risk factor measurements. reporting guidelines (Web Table 1) (28). Estimating 10-year CVD risk. Ten-year CVD risk was esti- mated from a landmark age Cox proportional hazards model, stratified by sex and with time since landmark age as the underly- Statistical analysis ing time variable. The model adjusted for landmark age and land- Two-stage dynamic risk prediction model. We used a 2-stage mark age squared and included the following risk factors: last approach to construct a dynamic risk prediction model, first observed diabetes status; last observed treatment for hypertension; modeling historical repeated risk factor measurements using mul- and estimated current risk factor values for smoking status, SBP, tivariate mixed-effects linear models and then estimating 10-year total cholesterol, and HDL-C. Participants were followed up for a CVD risk using Cox proportional hazards models (Figure 1). We maximum of 10 years. Therefore, proportional hazards are briefly present the methods here and provide more detail in the assumed only across a 10-year period. A “super-landmark model” Web Appendix. In both stages, models were developed at land- approach (7) was used with robust standard errors. A super- mark ages (40, 45,…, 85 years) for eligible participants, defined landmark model is a version of landmarking in which the data as those 1) registered with a general practice at the landmark age, sets contributing to the landmark models across all landmark 2) with no CVD diagnoses prior to the landmark age, and 3) with ages are stacked and a single time-to-event model is fitted to no statin prescription prior to the landmark age. Treating each the stacked data set (Web Appendix). 30 40 45 50 55 60 65 70 75 80 85 90 95 Age, years Figure 1. Schematic showing the landmark age approach. The dashed lines indicate historical repeat measures of smoking status, systolic blood pressure, total cholesterol, and high-density lipoprotein cholesterol, modeled by means of landmark-age–specific multivariate linear mixed-effects models. The diamonds show the landmark age (time of risk prediction). The arrows indicate the 10-year follow-up to the point of a cardiovascular disease event or censoring, modeled via a landmark Cox model. Am J Epidemiol. 2018;187(7):1530–1538 Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 Repeat Risk Factors in Electronic Health Records 1533 Assessment of predictive ability. The performance of the results of the multivariate mixed-effects models for the annual 10-year CVD risk predictions was assessed with measures of rate of repeated measurements in the 5 years before each land- calibration (i.e., calibration plots by decile of predicted risk), pre- mark age (as a proxy to account for bias due to sicker or more dictive accuracy (i.e., Brier scores; an average of the squared dif- health-conscious individuals’ having more repeats (32)). Third, ference between the observed outcome and predicted risk, where instead of estimating current risk factor values from only past lower scores indicate better predictive accuracy and zero means information, we estimated the future 10-year average risk factor perfect calibration), and discrimination (i.e., C-index; a measure levels from a multivariate mixed-effects model derived from of how well the model discriminates between persons with and both past and future risk factor information within the 10-year without CVD (29, 30)). We estimated the C-index over all partici- future horizon (Web Figure 2). Importantly, only past observed pants (calculated over pairs of different individuals) and also risk factors were subsequently used in the prediction of the future separately at each landmark age. The latter is estimated on subsets 10-year average risk factor levels for the Cox model. Fourth, of persons of the same age; thus, we call this an age-adjusted since it might be useful to identify patients who are still at C-index, which naturally will have lower values to reflect poorer high absolute risk even after treatment with statins, we discrimination (31). We used 10-fold cross-validation, splitting reran the main analyses including statin users in the models. the data by general practice, to account for overoptimism. The mixed-effects model including a time-dependent co- The above 10-year CVD risk predictions were compared variate for statin therapy initiation for total cholesterol and against predictions from 1) a “basic” landmark-age model, which statin therapy at the landmark age was included as a risk included sex, age, last observed diabetes status, and last observed factor in the Cox model. treatment for hypertension; 2) a dynamic landmark-age model All analyses were performed using Stata 14.2 (StataCorp with landmark age interactions with each covariate; 3) a dynamic LLC, College Station, Texas), and 95% confidence intervals landmark-age model with last observed measurements of all risk were generated for all measures of association. factors instead of estimated current risk factor values; and 4) a dynamic landmark-age model using cumulative mean values of RESULTS all historical measurements recorded before each landmark age, of smoking status, SBP, total cholesterol, and HDL-C. Predictions Study sample from models 3 and 4 were only estimable for persons with 1 or more measurements of all risk factors, which we call the restricted The target population included 41,373 persons with general sample. practice records and without a history of CVD or statin use at Sensitivity analyses. We conducted 4 sensitivity analyses. study entry. Of these individuals, 32,328 persons (78%) had at First, instead of using all available historical repeat measure- least 1 measurement of smoking status, SBP, total cholesterol, or ments of risk factors, we restricted the data to be within 10 HDL-C recorded before the first CVD event or statin prescription years before each landmark age. Second, we adjusted the (Web Figure 1). Mean age at study entry was 47.9 (standard Table 1. Characteristics of Participants in the Study Sample, The Health Improvement Network, United Kingdom, 1997–2016 Mean (SD) No. of Sample and Baseline Characteristic Measurements per Year Characteristic Study Sample (n = 32,328) Restricted Sample (n = 12,292) Study Restricted No. of No. of Sample Sample % Mean (SD) % Mean (SD) Persons Persons Age at study entry, years 47.9 (13.6) 47.5 (12.3) Male sex 17,592 54 6,819 55 History of diabetes 3,743 12 2,175 18 9,935 31 4,685 38 Prescription for blood-pressure– lowering medication Prescription for statins 5,617 17 2,003 16 Current smoker 9,453 29 3,358 27 0.6 (0.4) 0.6 (0.4) Systolic blood pressure, mm Hg 134.8 (21.0) 135.3 (21.1) 1.4 (1.4) 1.6 (1.4) Total cholesterol level, mmol/L 5.5 (1.1) 5.4 (1.0) 0.4 (0.4) 0.5 (0.4) HDL-C level, mmol/L 1.4 (0.4) 1.4 (0.4) 0.3 (0.3) 0.4 (0.3) Abbreviations: HDL-C, high-density lipoprotein cholesterol; SD, standard deviation. The restricted sample contained only patients with at least 1 measurement for each variable (smoking status, systolic blood pressure, total cho- lesterol, and HDL-C). Number and percentage were calculated across the follow-up period (e.g., a diagnosis of diabetes at any point during follow-up was counted as a history of diabetes for that individual). Based on the first measurement taken after study entry. Am J Epidemiol. 2018;187(7):1530–1538 Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 1534 Paige et al. Table 2. Crude Cardiovascular Disease Incidence Rate per 1,000 deviation, 13.6) years; 17,592 participants (54%) were male, Person-Years According to Age at Study Entry, Sex, and Calendar and 5,617 (17%) were prescribed statins after study entry (Table 1). Year of Statin Prescription, The Health Improvement Network, United Participants generally had more repeat measures of SBP than of Kingdom. 1997–2016 smoking status, total cholesterol, and HDL-C (Table 1). On aver- age, there were 1.1 years between repeated measurements of No. of Incident Total No. Crude IR per Factor CVD Cases of PY 1,000 PY smoking status, 0.5 years between repeated measurements of SBP, 1.1 years between repeated measurements of total choles- Age at study entry, years terol, and 1.2 years between repeated measurements of HDL-C. Overall, 2,861 participants (7%) had a newly recorded CVD 40–44 167 57,754 2.9 event over the course of a mean 10.4 (standard deviation, 5.6) 45–49 239 53,056 4.5 years of follow-up. Crude CVD incidence rates per 1,000 person- 50–54 307 49,903 6.2 years increased from 2.9 for persons aged 40–44 years to 35.2 for 55–59 356 37,132 9.6 persons aged 80–84 years; rates were higher in men than in 60–64 382 29,552 12.9 women, and they decreased among statin users by increasing calendar year (Table 2). Participants in the study sample and 65–69 396 22,417 17.7 the restricted sample (n = 12,292 (30% of the target popula- 70–74 386 15,626 24.7 tion); Web Figure 1) were similar in terms of age at study 75–79 299 10,575 28.3 entry, sex, SBP, and total and HDL-C levels, but those in the 80–84 187 5,317 35.2 restricted sample were more likely to have diabetes (Table 1). Sex The study sample had more males than the target population Male 1,520 198,797 7.6 but was otherwise similar (Web Table 2). Female 1,341 232,166 5.8 Estimates from the landmark models Calendar year of statin initiation Regression coefficients from the age- and sex-specificmulti- 1997–2001 225 4,828 46.6 variate linear mixed-effects models and hazard ratios for the Cox 2002–2006 968 38,857 24.9 models, without 10-fold cross-validation, are provided in Web 2007–2011 687 46,662 14.7 Tables 3–6. Overall, the values of the fixed intercepts from the 2012–2016 365 27,543 13.3 multivariate mixed-effects linear models show that SBP and total cholesterol level increased over the landmark ages, whereas HDL-C Abbreviations: CVD, cardiovascular disease; IR, incidence rate; PY, and smoking status decreased (Web Table 3). In addition, hazard person-years. ratios were generally stronger for the model using estimated cur- Calendar year of the prescribing date of the index statin prescription. rent risk factor values than for the model using the last observed values or cumulative mean values (Web Table 6). Assessment of 10-year CVD risk Risk discrimination was better at younger ages than at older In the landmark model with estimated current risk factor val- ages across all models (Web Figure 8). ues, 28% of individuals had an estimated 10-year CVD risk of ≥10%, and 10% had an estimated risk of ≥20%. The model ap- Sensitivity analyses peared well-calibrated (Web Figure 3A), had a Brier score of There was no difference in risk discrimination when the model 0.041 (95% confidence interval (CI): 0.039, 0.042) (Figure 2A), was restricted to using historical repeated-measures data collected and had an overall C-index of 0.768 (95% CI: 0.759, 0.777) up to 10 years before the landmark age (C-index = 0.768, 95% (Figure 2B). The C-index was improved by 0.016 (95% CI: CI: 0.758, 0.777) or when the estimated current risk factor values 0.013, 0.020) in comparison with the basic model (Figure 2C). were adjusted for the rate of clinic visits (C-index = 0.766, 95% Discrimination was better at younger ages (Figure 3). Additional CI: 0.756, 0.775). However, we observed an increase in risk age interactions did not further improve calibration or risk discrim- discrimination using estimated future 10-year average risk factor ination (Web Figure 3B and Figure 2B). The basic model (includ- levels (C-index = 0.774, 95% CI: 0.765, 0.783) instead of esti- ing only age, diabetes status, and treatment for hypertension) also mated current risk factor values. C-indices were lower when statin appeared well calibrated (Web Figure 3C), had a Brier score of users were included in the analysis, but the patterns of risk dis- 0.041 (95% CI: 0.040, 0.043) (Figure 2A), and had a lower overall crimination and calibration remained the same as in the main anal- C-index of 0.752 (95% CI: 0.742, 0.761) (Figure 2B). Similar to ysis (Web Tables 7 and 8). the main model, the basic model also discriminated risk better at younger ages than at older ages (Web Figure 4). Estimated 10-year CVD risk appeared slightly higher in mod- DISCUSSION els using last observed and cumulative mean risk factor values as compared with estimated current values (Web Figure 5). In this paper, we have presented a computationally feasible sta- Calibration, Brier scores, and C-indices were similar across tistical framework for developing dynamic risk prediction models the landmark models with last observed, cumulative mean, for use on EHRs with historical repeated measures of risk factors. or estimated current risk factor values (Web Figures 6 and 7). The 2-stage landmark approach combines Cox proportional Am J Epidemiol. 2018;187(7):1530–1538 Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 Repeat Risk Factors in Electronic Health Records 1535 A) Model Brier Score (95% CI) Basic model 0.041 (0.040, 0.043) Model with estimated current 0.041 (0.039, 0.042) values of risk factors Model with age interactions 0.040 (0.039, 0.042) 0.039 0.040 0.041 0.042 0.043 Brier Score B) Model C-Index (95% CI) 0.752 (0.742, 0.761) Basic model Model with estimated current 0.768 (0.759, 0.777) values of risk factors 0.769 (0.760, 0.778) Model with age interactions 0.74 0.75 0.76 0.77 0.78 C-Index C) Change in C-Index (95% CI) P Value Model Basic model 0.000 (Referent) Model with estimated current 0.016 (0.013, 0.020) <0.01 values of risk factors <0.01 Model with age interactions 0.017 (0.013, 0.022) −0.015 0 0.015 0.025 Change in C-Index Figure 2. Calibration and risk discrimination statistics for 3 models of cardiovascular disease risk prediction (n = 32,328), The Health Improve- ment Network, United Kingdom, 1997–2016. A) Calibration statistics for each risk prediction model. The graph shows the Brier score (▪) and 95% confidence interval (CI; bars) for each model. A lower Brier score is interpreted as better calibration. B) Risk discrimination statistics for each risk prediction model. The graph shows the C-index (▪) and 95% CI (bars) for each model. A higher C-index value is interpreted as better discrimination. C) Change in risk discrimination for each risk prediction model. The graph shows the change in C-index (▪) and its 95% CI (bars) for each risk pre- diction model in relation to the basic model (referent). The basic model included age and sex plus the last observed measures for diabetes status and hypertension treatment. The model with estimated current values of the risk factors included all factors in the basic model plus predicted current values for smoking status, systolic blood pressure, total cholesterol, and high-density lipoprotein cholesterol. The model with age interactions included all factors in the basic model plus predicted current values for smoking status, systolic blood pressure, total cholesterol, and high-density lipoprotein cholesterol, plus interactions of age with all risk factors. hazards regression and age-specific multivariate linear mixed- diseases and conditions and for use on other electronic patient effects models, which account for sporadically recorded repeat records in which repeated measurements are recorded, such as measures, unobserved data, and measurement errors. We those collected in secondary-care settings. illustrated the framework for the derivation and validation of Our motivation was based on optimizing electronic primary- a primary-care dynamic risk prediction model for 10-year care data for automatically identifying high-risk individuals for CVD risk, but it has potential for wider application to other full formal disease risk assessment, rather like a prescreening tool Am J Epidemiol. 2018;187(7):1530–1538 Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 1536 Paige et al. 0.8 complexity and messiness of the EHRs that would be used to estimate future disease risk for individuals, unlike risk predic- tion models developed using purpose-designed cohort studies. 0.7 Importantly, the assumptions made about the dynamic nature of the historical repeat-measures data, unobserved risk factors, and measurement errors in the model development are compatible 0.6 with the assumptions required for making a risk prediction for a new individual using data from EHRs. 0.5 In our sensitivity analysis, we investigated the use of predicted Age-Adjusted C-Index future 10-year average risk factor levels instead of estimated current Overall C-Index values and observed a modest improvement in risk discrimination. 0.4 This suggests that future risk factor values for smoking, SBP, 40 45 50 55 60 65 70 75 80 85 total cholesterol, and HDL-C are more predictive of future 10-year Landmark Age, years CVDriskthancurrentvalues.Aconsiderable limitation in this anal- ysis is that it ignores informative censoring of individuals due to Figure 3. Overall and age-adjusted values for C-index, The Health death or CVD events in the multivariate mixed-effects model, Improvement Network, United Kingdom, 1997–2016. Dashed lines, although evidence from empirical and simulation studies (11, 14) 95% confidence intervals. suggests that there is often little to be gained from more complex modeling (e.g., joint models (37)). Other methods with which to develop risk prediction models for use on EHRs exist, including machine learning approaches with the potential to increase the cost-effectiveness of health care. such as neural networks (14, 38, 39) and statistical approaches For example, several international guidelines for CVD risk assess- such as joint models (14). Prediction models developed using ment and management (21, 33–35) recommend using a system- landmark and joint models for single risk factors have been atic strategy for prioritizing people for full formal risk assessment previously compared (40) but not in a setting using multivar- on the basis of an estimate of their CVD risk using risk factors iate risk factors. Joint models are more computationally burden- already recorded in EHRs. CVD risk assessment tools, such as the some than landmark models, and further development is required Framingham risk model (36) and QRISK2 (4), are now integrated before they are computationally feasible for application to into electronic primary-care record systems, but they are not pur- large EHR data sets. However, landmark models can be posefully designed for prescreening use. The QRISK2 model esti- developed using any standard statistical software with multi- mates CVD risk using the last observed values for the numerous variate mixed-effects models and Cox regression. Analyses risk factors, and when data are missing, imputes them using age- employing the landmark-age- and sex-specific multivariate and sex-specific population averages for continuous risk factors or mixed-effects models can be run in parallel, since the most assumes no adverse clinical indicators. Our proposed framework computationally burdensome part is extracting the out-of- optimizes all available historical risk factor values, handling sample individual-specific random intercepts for estima- potential bias from spurious one-off measurements, and when tion of the current risk factor values. data are missing, intrinsically imputes them using all other risk Certain limitations of our proposed method remain. First, our factor information. Future work should formally compare such approach assumes a multivariate normal distribution for esti- models for prescreening use and assess their cost-effectiveness. mated current values of continuous and binary risk factors. Such For illustration, we compared a basic CVD risk model using an assumption is not uncommon in statistical methodology for sex, age, diabetes status, and treatment for hypertension against epidemiology (e.g., in regression calibration (10) and multiple extended risk models with additional risk factors incorporated as imputation (41)); however, it would be possible to replace it cumulative means, last observed values, or estimated current risk with a mixture of regression models with correlated latent vari- factor values for smoking status, SBP, total cholesterol, and HDL- ables (42). Second, the added distributional assumptions on the C. Our findings showed a modest improvement in risk discrimina- risk factors may limit transferability of the model to other popu- tion when including estimated current values of additional lations and implicate recalibration methods for use of the model risk factors but no difference in risk discrimination in the restricted in other populations, especially in comparison with conventional data set when comparing additional risk factors incorporated as CVD prediction models. Investigating the impact of model mis- specification is on our future research agenda. Third, uncertain- last observed, cumulative means, or estimated current risk factor ties in the estimated current risk factor values are not accounted values. Cumulative mean risk factor values handle sporadically for in the Cox model. However, our previous work suggested recorded repeat measurements and account for measurement er- that such uncertainties are often negligible relative to the esti- rors, but they are only estimable for persons with at least 1 his- mated standard errors of the β coefficients in the Cox model torical measurement on all risk factors and thus are not suitable for population-wide screening. A major strength of the land- (10). Fourth, persons with more frequent EHRs are more likely mark model with estimated current values of risk factors is that it to have health conditions or health anxiety. We attempted to can be applied to persons with at least 1 measure on any of the risk account for this by adjusting the estimated current risk factor val- factors included in the multivariate mixed model (in our illustra- ues by the annual rate of repeated measurements, although it tion, this was approximately 80% of individuals). may be plausible to additionally include this as a risk factor Another strength of our landmark framework is that it was in the Cox model. Fifth, for our illustration, we assumed developed and internally validated using data that reflected the alack ofspecific Read or drug codes to indicate no diagnosis Am J Epidemiol. 2018;187(7):1530–1538 C-Index Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 Repeat Risk Factors in Electronic Health Records 1537 3. Paige E, Barrett J, Pennells L, et al. Use of repeated blood or medication use, and information on cause of death was only pressure and cholesterol measurements to improve available for 13% of participants who died, meaning CVD cardiovascular disease risk prediction: an individual- incidence was underestimated in this study. Sixth, we used the participant-data meta-analysis. Am J Epidemiol. 2017;186(8): same definition of CVD events as used in CVD risk prediction 899–907. models employed in practice, such as QRISK2, which includes 4. Hippisley-Cox J, Coupland C, Vinogradova Y, et al. Predicting “soft” outcomes such as angina. However, while angina can be cardiovascular risk in England and Wales: prospective a symptom of coronary heart disease, it is not a disease itself, derivation and validation of QRISK2. BMJ. 2008;336(7659): and the appropriateness of including it in the outcome defini- 1475–1482. tion of CVD risk prediction models will depend on the clinical 5. Collins GS, Altman DG. An independent and external context. Finally, despite the use of contemporary data, CVD validation of QRISK2 cardiovascular disease risk score: a prospective open cohort study. BMJ. 2010;340:c2442. screening and treatment practices have changed over time and 6. Goldstein BA, Navar AM, Pencina MJ, et al. Opportunities and are not accounted for in the models. These limitations are unlikely challenges in developing risk prediction models with electronic to have affected our between-model comparisons. health records data: a systematic review. J Am Med Inform The benefits of optimizing EHRs for disease risk screening and Assoc. 2017;24(1):198–208. personalized health-care decisions are increasingly being 7. van Houwelingen HC, Putter H. Dynamic Prediction in recognized. There is a growing need for suitable statistical Clinical Survival Analysis. Boca Raton, FL: CRC Press; 2012. methods, data analytics, and machine learning approaches 8. Xanthakis V, Sullivan LM, Vasan RS. Multilevel modeling with which to address the computational and methodological versus cross-sectional analysis for assessing the longitudinal challenges involved in the analysis of such “big data.” The tracking of cardiovascular risk factors over time. Stat Med. framework presented in this paper provides a practical, trans- 2013;32(28):5028–5038. 9. Laird NM, Ware JH. Random-effects models for longitudinal parent, and flexible solution for the development of dynamic data. Biometrics. 1982;38(4):963–974. risk prediction models for use on EHRs. 10. Fibrinogen Studies Collaboration. Correcting for multivariate measurement error by regression calibration in meta-analyses of epidemiological studies. Stat Med. 2009; 28(7):1067–1092. ACKNOWLEDGMENTS 11. Sweeting MJ, Barrett JK, Thompson SG, et al. The use of repeated blood pressure measures for cardiovascular risk Author affiliations: Department of Public Health and prediction: a comparison of statistical models in the ARIC Primary Care, School of Clinical Medicine, University of Study. Stat Med. 2017;36(28):4514–4528. Cambridge, Cambridge, United Kingdom (Ellie Paige, 12. Singh A, Nadkarni G, Gottesman O, et al. Incorporating temporal EHR data in predictive models for risk stratification Jessica Barrett, David Stevens, Michael J. Sweeting, Angela of renal function deterioration. J Biomed Inform. 2015;53: M. Wood); National Centre for Epidemiology and 220–228. Population Health, Research School of Population, The 13. Akbarov A, Williams R, Brown B, et al. A two-stage dynamic Australian National University, Canberra, Australia (Ellie model to enable updating of clinical risk prediction from Paige); MRC Biostatistics Unit, University of Cambridge, longitudinal health record data: illustrated with kidney Cambridge, United Kingdom (Jessica Barrett); Department function. Stud Health Technol Inform. 2015;216:696–700. of Medical Statistics, London School of Hygiene and Tropical 14. Goldstein BA, Pomann GM, Winkelmayer WC, et al. A Medicine, London, United Kingdom (Ruth H. Keogh); and comparison of risk prediction methods using repeated Institute of Epidemiology and Health, Research Department of observations: an application to electronic health records for Primary Care and Population Health, Institute of Epidemiology hemodialysis. Stat Med. 2017;36(17):2750–2763. 15. Wells BJ, Chagin KM, Li L, et al. Using the landmark method and Health Care, University College London, London, United for creating prediction models in large datasets derived from Kingdom (Irwin Nazareth, Irene Petersen). electronic health records. Health Care Manag Sci. 2015;18(1): This work was funded by the Medical Research Council 86–92. (MRC) (grant MR/K014811/1). J.B. was supported by an 16. Damman K, Jaarsma T, Voors AA, et al. Both in- and out- MRC fellowship (grant G0902100) and the MRC Unit hospital worsening of renal function predict outcome in Program (grant MC_UU_00002/5). R.H.K. was supported by patients with heart failure: results from the Coordinating an MRC Methodology Fellowship (grant MR/M014827/1). Study Evaluating Outcome of Advising and Counseling in The study funders played no role in the design, analysis, Heart Failure (COACH). Eur J Heart Fail. 2009;11(9): or interpretation of the study. 847–854. Conflict of interest: none declared. 17. Maziarz M, Heagerty P, Cai T, et al. On longitudinal prediction with time-to-event outcome: comparison of modeling options. Biometrics. 2017;73(1):83–93. 18. In Practice Systems Ltd. The Health Improvement Network (THIN). 2016. http://www.inps.co.uk/vision/health- REFERENCES improvement-network-thin. Accessed July 5, 2016. 1. Bates DW, Saria S, Ohno-Machado L, et al. Big data in health 19. Chisholm J. The Read clinical classification. BMJ. 1990; care: using analytics to identify and manage high-risk and high- 300(6732):1092. cost patients. Health Aff (Millwood). 2014;33(7):1123–1131. 20. Davé S, Petersen I. Creating medical and drug code lists to 2. Dafni U. Landmark analysis at the 25-year landmark point. identify cases in primary care databases. Pharmacoepidemiol Circ Cardiovasc Qual Outcomes. 2011;4(3):363–371. Drug Saf. 2009;18(8):704–707. Am J Epidemiol. 2018;187(7):1530–1538 Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 1538 Paige et al. 21. Goff DC Jr, Lloyd-Jones DM, Bennett G, et al. 2013 ACC/ 32. Goldstein BA, Bhavsar NA, Phelan M, et al. Controlling for AHA guideline on the assessment of cardiovascular risk: a informed presence bias due to the number of health encounters in an report of the American College of Cardiology/American Heart electronic health record. Am J Epidemiol. 2016;184(11):847–855. Association Task Force on Practice Guidelines. J Am College 33. National Institute for Health and Care Excellence. Cardiol. 2014;63(25):2935–2959. Cardiovascular Disease: Risk Assessment and Reduction, 22. Muntner P, Colantonio LD, Cushman M, et al. Validation of Including Lipid Modification. (Clinical guideline CG181). the atherosclerotic cardiovascular disease Pooled Cohort risk London, United Kingdom: National Institute for Health and equations. JAMA. 2014;311(14):1406–1415. Care Excellence; 2014. 23. Sharma M, Petersen I, Nazareth I, et al. An algorithm for 34. New Zealand Ministry of Health. Cardiovascular Disease Risk identification and classification of individuals with type 1 and Assessment: Updated 2013. New Zealand Primary Care Handbook type 2 diabetes mellitus in a large primary care database. Clin 2012. Wellington, New Zealand: Ministry of Health; 2013. Epidemiol. 2016;8:373–380. 35. Perk J, De Backer G, Gohlke H, et al. European Guidelines on 24. Horsfall L, Walters K, Petersen I. Identifying periods of Cardiovascular Disease Prevention in Clinical Practice acceptable computer usage in primary care research databases. (version 2012). The Fifth Joint Task Force of the European Pharmacoepidemiol Drug Saf. 2013;22(1):64–69. Society of Cardiology and Other Societies on Cardiovascular 25. Maguire A, Blak BT, Thompson M. The importance of Disease Prevention in Clinical Practice (constituted by defining periods of complete mortality reporting for research representatives of nine societies and by invited experts). Eur using automated data from primary care. Pharmacoepidemiol Heart J. 2012;33(13):1635–1701. Drug Saf. 2009;18(1):76–83. 36. D’Agostino RB Sr, Vasan RS, Pencina MJ, et al. General 26. Littman AJ, Boyko EJ, McDonell MB, et al. Evaluation of a cardiovascular risk profile for use in primary care: the weight management program for veterans. Prev Chronic Dis. Framingham Heart Study. Circulation. 2008;117(6):743–753. 2012;9:E99. 37. Rizopoulos D. Dynamic predictions and prospective accuracy 27. Hajifathalian K, Ueda P, Lu Y, et al. A novel risk score to in joint models for longitudinal and time-to-event data. predict cardiovascular disease risk in national populations Biometrics. 2011;67(3):819–829. (Globorisk): a pooled analysis of prospective cohorts and 38. Miotto R, Li L, Kidd BA, et al. Deep patient: an unsupervised health examination surveys. Lancet Diabetes Endocrinol. representation to predict the future of patients from the 2015;3(5):339–355. electronic health records. Sci Rep. 2016;6:26094. 28. Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of 39. Shameer K, Johnson KW, Yahi A, et al. Predictive modeling of studies Conducted using Observational Routinely collected hospital readmission rates using electronic medical record- health Data (RECORD) statement. PLoS Med. 2015;12(10): wide machine learning: a case-study using Mount Sinai heart e1001885. failure cohort. Pac Symp Biocomput. 2016;22:276–287. 29. Lloyd-Jones DM. Cardiovascular risk prediction: basic 40. Suresh K, Taylor JMG, Spratt DE, et al. Comparison of joint concepts, current status, and future directions. Circulation. modeling and landmarking for dynamic prediction under an 2010;121(15):1768–1777. illness-death model. Biom J. 2017;59(6):1277–1300. 30. Harrell FE Jr, Califf RM, Pryor DB, et al. Evaluating the yield 41. Schafer JL. Analysis of Incomplete Multivariate Data. Boca of medical tests. JAMA. 1982;247(18):2543–2546. Raton, FL: CRC Press; 1997. 31. White IR, Rapsomaniki E; Emerging Risk Factors 42. Fitzmaurice GM, Laird NM. Regression models for mixed Collaboration. Covariate-adjusted measures of discrimination discrete and continuous responses with potentially missing for survival data. Biom J. 2015;57(4):592–613. values. Biometrics. 1997;53(1):110–122. Am J Epidemiol. 2018;187(7):1530–1538 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png American Journal of Epidemiology Oxford University Press

Landmark Models for Optimizing the Use of Repeated Measurements of Risk Factors in Electronic Health Records to Predict Future Disease Risk

Loading next page...
 
/lp/ou_press/landmark-models-for-optimizing-the-use-of-repeated-measurements-of-3iQDUI480Q

References (51)

Publisher
Oxford University Press
Copyright
Copyright © 2022 Johns Hopkins Bloomberg School of Public Health
ISSN
0002-9262
eISSN
1476-6256
DOI
10.1093/aje/kwy018
Publisher site
See Article on Publisher Site

Abstract

Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 American Journal of Epidemiology Vol. 187, No. 7 © The Author(s) 2018. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. DOI: 10.1093/aje/kwy018 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons. Advance Access publication: org/licenses/by/4.0), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. March 23, 2018 Practice of Epidemiology Landmark Models for Optimizing the Use of Repeated Measurements of Risk Factors in Electronic Health Records to Predict Future Disease Risk Ellie Paige, Jessica Barrett, David Stevens, Ruth H. Keogh, Michael J. Sweeting, Irwin Nazareth, Irene Petersen, and Angela M. Wood* * Correspondence to Dr. Angela Wood, Department of Public Health and Primary Care, University of Cambridge, Strangeways Research Laboratory, Cambridge CB1 8RN, United Kingdom (e-mail: [email protected]). Initially submitted July 27, 2017; accepted for publication January 25, 2018. The benefits of using electronic health records (EHRs) for disease risk screening and personalized health-care decisions are being increasingly recognized. Here we present a computationally feasible statistical approach with which to address the methodological challenges involved in utilizing historical repeat measures of multiple risk fac- tors recorded in EHRs to systematically identify patients at high risk of future disease. The approach is principally based on a 2-stage dynamic landmark model. The first stage estimates current risk factor values from all available historical repeat risk factor measurements via landmark-age–specific multivariate linear mixed-effects models with correlated random intercepts, which account for sporadically recorded repeat measures, unobserved data, and measurement errors. The second stage predicts future disease risk from a sex-stratified Cox proportional hazards model, with estimated current risk factor values from the first stage. We exemplify these methods by developing and validating a dynamic 10-year cardiovascular disease risk prediction model using primary-care EHRs for age, diabetes status, hyper- tension treatment, smoking status, systolic blood pressure, total cholesterol, and high-density lipoprotein cholesterol in 41,373 persons from 10 primary-care practices in England and Wales contributing to The Health Improvement Network (1997–2016). Using cross-validation, the model was well-calibrated (Brier score = 0.041, 95% confidence interval: 0.039, 0.042) and had good discrimination (C-index = 0.768, 95% confidence interval: 0.759, 0.777). cardiovascular disease; dynamic risk prediction; electronic health records; landmarking; mixed-effects models; primary care records Abbreviations: CI, confidence interval; CVD, cardiovascular disease; EHRs, electronic health records; HDL-C, high-density lipoprotein cholesterol; SBP, systolic blood pressure. Using electronic health records (EHRs) to systematically iden- measured sporadically during general practice visits, and follow- tify persons at high risk of developing future disease outcomes up continues until the person transfers out or dies. Defining has the potential to increase the cost-effectiveness of health care arbitrary time origins for model development without allowing (1); however, existing risk prediction models do not fully opti- for the in- and outflow of study participants over time can intro- mize available historical data. The development of computation- duce bias (2). Second, risk prediction models typically use single ally feasible statistical methods for predicting future disease risk measures of error-prone risk factors (e.g., blood pressure and from existing EHRs presents specific methodological challenges cholesterol), but EHRs often contain data on risk factors measured and opportunities. repeatedly over time which could be utilized both for model First, risk prediction models are typically developed using tra- development and for predicting future disease risk. In partic- ditional prospective study designs, which defineabaselineorigin ular, repeated measurements can be used to predict error-free at whichriskfactors were observedand from whichtopredict “estimated current values” of risk factors, which may increase their future disease risk. However, EHRs are dynamic in nature—for predictive ability (3). Third, most risk prediction models require example, in primary-care records, an individual’s follow-up complete risk factor data in order to predict future risk. An begins at registration with a general practice, risk factors are exception in cardiovascular disease (CVD) risk prediction is the 1530 Am J Epidemiol. 2018;187(7):1530–1538 Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 Repeat Risk Factors in Electronic Health Records 1531 METHODS QRISK2 model (4), which has a built-in tool for substituting missing data on risk factors using age- and sex-specific popula- Data source tion average values. Notably, this substitution approach is not compatible with the multiple-imputation approach used for We used patient data from 10 randomly selected general prac- model development of QRISK2 and has not yet been formally tices that contributed data to The Health Improvement Network validated (5). Since EHR systems are primarily designed for (18), a United Kingdom general practice database that derives patient management and administrative purposes, there can be data from routine administrative and clinical practice. During con- large amounts of unobserved information on risk factors that sultations with patients, family physicians enter data on medical needs to be handled appropriately and compatibly both in model symptoms and diagnoses using Read codes (19) (a hierarchical development and for predicting future disease risk. classification system), while information on drug prescriptions is While multiple methods exist for developing risk prediction entered automatically into the EHRs. The Health Improvement models using EHRs, a previous systematic review found that only Network captures information on patient demographic character- 8% of studies modeled repeated longitudinal measures, only 54% istics, practice-level data, diagnoses and symptoms, specialist re- accounted for missing data, only 16% appropriately accounted for ferrals, laboratory testing, disease monitoring, prescribing, and censoring and loss to follow-up, and none assessed informative death. For this study, we created code lists for the risk factors and observations (where the clinic visit itself provides meaningful outcomes using previously described methods (20). Code lists information) (6). Our aim was to establish a computationally fea- were reviewed by a clinician (I.N.) and have been published sible generic statistical framework that accounts for these potential on ClinicalCodes.org. advantages and biases of EHRs in the development of dynamic The main outcome was newly recorded diagnoses of nonfatal risk prediction models that leverage repeated measurements and or fatal CVD, where CVD was defined, as with previous primary handle unobserved data on routinely recorded risk factors. Our care risk scores (4), as angina, myocardial infarction, stroke, tran- approach combines 2 existing methods, landmark-age models sient ischemic attack, or major coronary surgery and revasculari- and multivariate linear mixed-effects models (2, 7). A landmark zation. Cause of death was ascertained using Read codes. age is a reference point (e.g., 40, 45, 50,…, 85 years) at which Risk factors were selected on the basis of those in the validated we want to make risk predictions using risk factor information American College of Cardiology/American Heart Association collected up to that age. A series of prediction models, which we Pooled Cohort Risk Assessment Equations (21, 22) and included call landmark-age models, are constructed with time origin at age, sex, diabetes status (binary, ascertained using Read codes the landmark age and past risk factor information from eligible (23)), smoking status (binary), systolic blood pressure (SBP) individuals (e.g., in our setting these are persons who are cur- (adjusted for hypertension treatment), total cholesterol level, and rently registered with a general practice and at future risk of high-density lipoprotein cholesterol (HDL-C) level. Once an disease at the landmark age). As such, individuals may con- individual had a diabetes diagnosis or a prescription for a blood- tribute to one or more prediction models, depending on their pressure–lowering medication, he or she was considered to have eligibility at the landmark age reference points. this condition/treatment throughout follow-up. Values for SBP, Typically, landmark-age models are constructed using Cox total cholesterol, and HDL-C were standardized by centering on proportional hazards models with the last observed risk factor val- sex-specific means and dividing by the standard deviation. ues. We propose an extension to this, whereby we replace the last observed values with error-free risk factor values estimated from a multivariate linear mixed-effects model using all available Study population repeated measures of past risk factor values for each landmark age (8). Multivariate mixed-effects models intrinsically handle Data were available from January 1, 1997, to January 18, 2016. unobserved data and sporadically recorded repeat measures (9) Individuals entered the study from the latest of the following and their measurement errors (10). The approach also provides dates: 1) the date of registration at a general practice plus 6 flexibility to account for the number (or rate) of clinic visits as a months; 2) the date for acceptable computer usage (quality mea- proxy for illness severity or health anxiety. There is a strong body surement defined as the year in which a general practice continu- of statistical evidence showing the benefits and potential applica- ously used their computer system for recording of medical events tions of modeling longitudinal data using mixed-effects linear and prescribing) (24); 3) the date for acceptable mortality report- regression models (3, 11–14), but this method is not often em- ing (the date on which mortality recording reflected that of the ployed in the development of risk prediction models using EHRs United Kingdom general population) (25); 4) thedateonwhich (6). Moreover, using landmarking to model data in EHRs has the individual turned 30 years of age; or 5) January 1, 1997. In- been previously proposed (15) and has been combined with uni- dividuals exited the study at the earliest of the following dates: variate mixed-effects modeling (16, 17) but not in the context of 1) their first (i.e., “incident”) newly recorded CVD event; 2) trans- dynamic risk prediction models. fer out of the general practice; 3) their date of death; or 4) January In the current study, we explore how landmarking can be com- 18, 2016. The target population for which we wanted to esti- bined with multivariate mixed-effects linear regression models to mate CVD risk included persons with general practice records leverage the advantages of each method in order to generate and without a history of CVD or statin prescriptions (see Web dynamic risk prediction models suitable for use in EHRs. Figure 1, available at https://academic.oup.com/aje). We excluded We illustrate our approach through the estimation of 10-year participants with statin prescriptions, as these individuals are CVD risk using EHRs from 10 general practices in England already being treated for being at risk of developing CVD and Wales. and as such would not need to be identified by a screening Am J Epidemiol. 2018;187(7):1530–1538 Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 1532 Paige et al. algorithm. In addition, the study sample excluded persons with landmark age as a time origin, past risk factor information was unknown sex, persons with a study entry date after age 85 years, extracted from age 30 years onwards and participants were fol- and persons with no measurements of smoking status, SBP, lowed up for 10 years until their first CVD event or the study total cholesterol, or HDL-C between study entry and study exit exit date (Figure 1). Crude incidence rates by age at study entry, (Web Figure 1). sex, and calendar year of statin prescription were calculated. The following measurements were considered biologically Estimation of error-free current risk factor values. For each implausible and were changed to “missing” for the analysis: landmark age and separately for males and females, we fitted mul- SBP <60 mm Hg or >250 mm Hg (26); total cholesterol level tivariate mixed-effects linear regression models (9) on past repeat <1.75 mmol/L or >20 mmol/L (27); and HDL-C level <0.3 measurements for smoking status, SBP, total cholesterol, and mmol/L or >3.1 mmol/L (26) (out of a total 1,675,241 mea- HDL-C. Each model included fixed intercepts and slopes for surements, 12,352 measurements were changed to missing). each risk factor, a time-dependent covariate for initiation of The scheme under which The Health Improvement Network blood-pressure–lowering medications for SBP, and correlated was to obtain and provide anonymized patient data was approved individual-specific random intercepts for all 4 risk factors. These by the National Health Service South-East Multicenter Research models were estimable for persons with at least 1 measurement of Ethics Committee in 2002, and scientific approval to undertake at least 1 risk factor. From each model, we estimated the error-free this study was obtained from the IQVIA World Publications current risk factor values (i.e., the predicted values at the landmark Scientific Review Committee (IQVIA, Durham, North Carolina). age) using the best linear unbiased predictors from the empirical E.P., J.B., D.S., I.P., and A.M.W. had full access to the data used Bayes posterior distribution of the random intercepts, conditional to create the study population. This article follows RECORD on the past observed risk factor measurements. reporting guidelines (Web Table 1) (28). Estimating 10-year CVD risk. Ten-year CVD risk was esti- mated from a landmark age Cox proportional hazards model, stratified by sex and with time since landmark age as the underly- Statistical analysis ing time variable. The model adjusted for landmark age and land- Two-stage dynamic risk prediction model. We used a 2-stage mark age squared and included the following risk factors: last approach to construct a dynamic risk prediction model, first observed diabetes status; last observed treatment for hypertension; modeling historical repeated risk factor measurements using mul- and estimated current risk factor values for smoking status, SBP, tivariate mixed-effects linear models and then estimating 10-year total cholesterol, and HDL-C. Participants were followed up for a CVD risk using Cox proportional hazards models (Figure 1). We maximum of 10 years. Therefore, proportional hazards are briefly present the methods here and provide more detail in the assumed only across a 10-year period. A “super-landmark model” Web Appendix. In both stages, models were developed at land- approach (7) was used with robust standard errors. A super- mark ages (40, 45,…, 85 years) for eligible participants, defined landmark model is a version of landmarking in which the data as those 1) registered with a general practice at the landmark age, sets contributing to the landmark models across all landmark 2) with no CVD diagnoses prior to the landmark age, and 3) with ages are stacked and a single time-to-event model is fitted to no statin prescription prior to the landmark age. Treating each the stacked data set (Web Appendix). 30 40 45 50 55 60 65 70 75 80 85 90 95 Age, years Figure 1. Schematic showing the landmark age approach. The dashed lines indicate historical repeat measures of smoking status, systolic blood pressure, total cholesterol, and high-density lipoprotein cholesterol, modeled by means of landmark-age–specific multivariate linear mixed-effects models. The diamonds show the landmark age (time of risk prediction). The arrows indicate the 10-year follow-up to the point of a cardiovascular disease event or censoring, modeled via a landmark Cox model. Am J Epidemiol. 2018;187(7):1530–1538 Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 Repeat Risk Factors in Electronic Health Records 1533 Assessment of predictive ability. The performance of the results of the multivariate mixed-effects models for the annual 10-year CVD risk predictions was assessed with measures of rate of repeated measurements in the 5 years before each land- calibration (i.e., calibration plots by decile of predicted risk), pre- mark age (as a proxy to account for bias due to sicker or more dictive accuracy (i.e., Brier scores; an average of the squared dif- health-conscious individuals’ having more repeats (32)). Third, ference between the observed outcome and predicted risk, where instead of estimating current risk factor values from only past lower scores indicate better predictive accuracy and zero means information, we estimated the future 10-year average risk factor perfect calibration), and discrimination (i.e., C-index; a measure levels from a multivariate mixed-effects model derived from of how well the model discriminates between persons with and both past and future risk factor information within the 10-year without CVD (29, 30)). We estimated the C-index over all partici- future horizon (Web Figure 2). Importantly, only past observed pants (calculated over pairs of different individuals) and also risk factors were subsequently used in the prediction of the future separately at each landmark age. The latter is estimated on subsets 10-year average risk factor levels for the Cox model. Fourth, of persons of the same age; thus, we call this an age-adjusted since it might be useful to identify patients who are still at C-index, which naturally will have lower values to reflect poorer high absolute risk even after treatment with statins, we discrimination (31). We used 10-fold cross-validation, splitting reran the main analyses including statin users in the models. the data by general practice, to account for overoptimism. The mixed-effects model including a time-dependent co- The above 10-year CVD risk predictions were compared variate for statin therapy initiation for total cholesterol and against predictions from 1) a “basic” landmark-age model, which statin therapy at the landmark age was included as a risk included sex, age, last observed diabetes status, and last observed factor in the Cox model. treatment for hypertension; 2) a dynamic landmark-age model All analyses were performed using Stata 14.2 (StataCorp with landmark age interactions with each covariate; 3) a dynamic LLC, College Station, Texas), and 95% confidence intervals landmark-age model with last observed measurements of all risk were generated for all measures of association. factors instead of estimated current risk factor values; and 4) a dynamic landmark-age model using cumulative mean values of RESULTS all historical measurements recorded before each landmark age, of smoking status, SBP, total cholesterol, and HDL-C. Predictions Study sample from models 3 and 4 were only estimable for persons with 1 or more measurements of all risk factors, which we call the restricted The target population included 41,373 persons with general sample. practice records and without a history of CVD or statin use at Sensitivity analyses. We conducted 4 sensitivity analyses. study entry. Of these individuals, 32,328 persons (78%) had at First, instead of using all available historical repeat measure- least 1 measurement of smoking status, SBP, total cholesterol, or ments of risk factors, we restricted the data to be within 10 HDL-C recorded before the first CVD event or statin prescription years before each landmark age. Second, we adjusted the (Web Figure 1). Mean age at study entry was 47.9 (standard Table 1. Characteristics of Participants in the Study Sample, The Health Improvement Network, United Kingdom, 1997–2016 Mean (SD) No. of Sample and Baseline Characteristic Measurements per Year Characteristic Study Sample (n = 32,328) Restricted Sample (n = 12,292) Study Restricted No. of No. of Sample Sample % Mean (SD) % Mean (SD) Persons Persons Age at study entry, years 47.9 (13.6) 47.5 (12.3) Male sex 17,592 54 6,819 55 History of diabetes 3,743 12 2,175 18 9,935 31 4,685 38 Prescription for blood-pressure– lowering medication Prescription for statins 5,617 17 2,003 16 Current smoker 9,453 29 3,358 27 0.6 (0.4) 0.6 (0.4) Systolic blood pressure, mm Hg 134.8 (21.0) 135.3 (21.1) 1.4 (1.4) 1.6 (1.4) Total cholesterol level, mmol/L 5.5 (1.1) 5.4 (1.0) 0.4 (0.4) 0.5 (0.4) HDL-C level, mmol/L 1.4 (0.4) 1.4 (0.4) 0.3 (0.3) 0.4 (0.3) Abbreviations: HDL-C, high-density lipoprotein cholesterol; SD, standard deviation. The restricted sample contained only patients with at least 1 measurement for each variable (smoking status, systolic blood pressure, total cho- lesterol, and HDL-C). Number and percentage were calculated across the follow-up period (e.g., a diagnosis of diabetes at any point during follow-up was counted as a history of diabetes for that individual). Based on the first measurement taken after study entry. Am J Epidemiol. 2018;187(7):1530–1538 Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 1534 Paige et al. Table 2. Crude Cardiovascular Disease Incidence Rate per 1,000 deviation, 13.6) years; 17,592 participants (54%) were male, Person-Years According to Age at Study Entry, Sex, and Calendar and 5,617 (17%) were prescribed statins after study entry (Table 1). Year of Statin Prescription, The Health Improvement Network, United Participants generally had more repeat measures of SBP than of Kingdom. 1997–2016 smoking status, total cholesterol, and HDL-C (Table 1). On aver- age, there were 1.1 years between repeated measurements of No. of Incident Total No. Crude IR per Factor CVD Cases of PY 1,000 PY smoking status, 0.5 years between repeated measurements of SBP, 1.1 years between repeated measurements of total choles- Age at study entry, years terol, and 1.2 years between repeated measurements of HDL-C. Overall, 2,861 participants (7%) had a newly recorded CVD 40–44 167 57,754 2.9 event over the course of a mean 10.4 (standard deviation, 5.6) 45–49 239 53,056 4.5 years of follow-up. Crude CVD incidence rates per 1,000 person- 50–54 307 49,903 6.2 years increased from 2.9 for persons aged 40–44 years to 35.2 for 55–59 356 37,132 9.6 persons aged 80–84 years; rates were higher in men than in 60–64 382 29,552 12.9 women, and they decreased among statin users by increasing calendar year (Table 2). Participants in the study sample and 65–69 396 22,417 17.7 the restricted sample (n = 12,292 (30% of the target popula- 70–74 386 15,626 24.7 tion); Web Figure 1) were similar in terms of age at study 75–79 299 10,575 28.3 entry, sex, SBP, and total and HDL-C levels, but those in the 80–84 187 5,317 35.2 restricted sample were more likely to have diabetes (Table 1). Sex The study sample had more males than the target population Male 1,520 198,797 7.6 but was otherwise similar (Web Table 2). Female 1,341 232,166 5.8 Estimates from the landmark models Calendar year of statin initiation Regression coefficients from the age- and sex-specificmulti- 1997–2001 225 4,828 46.6 variate linear mixed-effects models and hazard ratios for the Cox 2002–2006 968 38,857 24.9 models, without 10-fold cross-validation, are provided in Web 2007–2011 687 46,662 14.7 Tables 3–6. Overall, the values of the fixed intercepts from the 2012–2016 365 27,543 13.3 multivariate mixed-effects linear models show that SBP and total cholesterol level increased over the landmark ages, whereas HDL-C Abbreviations: CVD, cardiovascular disease; IR, incidence rate; PY, and smoking status decreased (Web Table 3). In addition, hazard person-years. ratios were generally stronger for the model using estimated cur- Calendar year of the prescribing date of the index statin prescription. rent risk factor values than for the model using the last observed values or cumulative mean values (Web Table 6). Assessment of 10-year CVD risk Risk discrimination was better at younger ages than at older In the landmark model with estimated current risk factor val- ages across all models (Web Figure 8). ues, 28% of individuals had an estimated 10-year CVD risk of ≥10%, and 10% had an estimated risk of ≥20%. The model ap- Sensitivity analyses peared well-calibrated (Web Figure 3A), had a Brier score of There was no difference in risk discrimination when the model 0.041 (95% confidence interval (CI): 0.039, 0.042) (Figure 2A), was restricted to using historical repeated-measures data collected and had an overall C-index of 0.768 (95% CI: 0.759, 0.777) up to 10 years before the landmark age (C-index = 0.768, 95% (Figure 2B). The C-index was improved by 0.016 (95% CI: CI: 0.758, 0.777) or when the estimated current risk factor values 0.013, 0.020) in comparison with the basic model (Figure 2C). were adjusted for the rate of clinic visits (C-index = 0.766, 95% Discrimination was better at younger ages (Figure 3). Additional CI: 0.756, 0.775). However, we observed an increase in risk age interactions did not further improve calibration or risk discrim- discrimination using estimated future 10-year average risk factor ination (Web Figure 3B and Figure 2B). The basic model (includ- levels (C-index = 0.774, 95% CI: 0.765, 0.783) instead of esti- ing only age, diabetes status, and treatment for hypertension) also mated current risk factor values. C-indices were lower when statin appeared well calibrated (Web Figure 3C), had a Brier score of users were included in the analysis, but the patterns of risk dis- 0.041 (95% CI: 0.040, 0.043) (Figure 2A), and had a lower overall crimination and calibration remained the same as in the main anal- C-index of 0.752 (95% CI: 0.742, 0.761) (Figure 2B). Similar to ysis (Web Tables 7 and 8). the main model, the basic model also discriminated risk better at younger ages than at older ages (Web Figure 4). Estimated 10-year CVD risk appeared slightly higher in mod- DISCUSSION els using last observed and cumulative mean risk factor values as compared with estimated current values (Web Figure 5). In this paper, we have presented a computationally feasible sta- Calibration, Brier scores, and C-indices were similar across tistical framework for developing dynamic risk prediction models the landmark models with last observed, cumulative mean, for use on EHRs with historical repeated measures of risk factors. or estimated current risk factor values (Web Figures 6 and 7). The 2-stage landmark approach combines Cox proportional Am J Epidemiol. 2018;187(7):1530–1538 Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 Repeat Risk Factors in Electronic Health Records 1535 A) Model Brier Score (95% CI) Basic model 0.041 (0.040, 0.043) Model with estimated current 0.041 (0.039, 0.042) values of risk factors Model with age interactions 0.040 (0.039, 0.042) 0.039 0.040 0.041 0.042 0.043 Brier Score B) Model C-Index (95% CI) 0.752 (0.742, 0.761) Basic model Model with estimated current 0.768 (0.759, 0.777) values of risk factors 0.769 (0.760, 0.778) Model with age interactions 0.74 0.75 0.76 0.77 0.78 C-Index C) Change in C-Index (95% CI) P Value Model Basic model 0.000 (Referent) Model with estimated current 0.016 (0.013, 0.020) <0.01 values of risk factors <0.01 Model with age interactions 0.017 (0.013, 0.022) −0.015 0 0.015 0.025 Change in C-Index Figure 2. Calibration and risk discrimination statistics for 3 models of cardiovascular disease risk prediction (n = 32,328), The Health Improve- ment Network, United Kingdom, 1997–2016. A) Calibration statistics for each risk prediction model. The graph shows the Brier score (▪) and 95% confidence interval (CI; bars) for each model. A lower Brier score is interpreted as better calibration. B) Risk discrimination statistics for each risk prediction model. The graph shows the C-index (▪) and 95% CI (bars) for each model. A higher C-index value is interpreted as better discrimination. C) Change in risk discrimination for each risk prediction model. The graph shows the change in C-index (▪) and its 95% CI (bars) for each risk pre- diction model in relation to the basic model (referent). The basic model included age and sex plus the last observed measures for diabetes status and hypertension treatment. The model with estimated current values of the risk factors included all factors in the basic model plus predicted current values for smoking status, systolic blood pressure, total cholesterol, and high-density lipoprotein cholesterol. The model with age interactions included all factors in the basic model plus predicted current values for smoking status, systolic blood pressure, total cholesterol, and high-density lipoprotein cholesterol, plus interactions of age with all risk factors. hazards regression and age-specific multivariate linear mixed- diseases and conditions and for use on other electronic patient effects models, which account for sporadically recorded repeat records in which repeated measurements are recorded, such as measures, unobserved data, and measurement errors. We those collected in secondary-care settings. illustrated the framework for the derivation and validation of Our motivation was based on optimizing electronic primary- a primary-care dynamic risk prediction model for 10-year care data for automatically identifying high-risk individuals for CVD risk, but it has potential for wider application to other full formal disease risk assessment, rather like a prescreening tool Am J Epidemiol. 2018;187(7):1530–1538 Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 1536 Paige et al. 0.8 complexity and messiness of the EHRs that would be used to estimate future disease risk for individuals, unlike risk predic- tion models developed using purpose-designed cohort studies. 0.7 Importantly, the assumptions made about the dynamic nature of the historical repeat-measures data, unobserved risk factors, and measurement errors in the model development are compatible 0.6 with the assumptions required for making a risk prediction for a new individual using data from EHRs. 0.5 In our sensitivity analysis, we investigated the use of predicted Age-Adjusted C-Index future 10-year average risk factor levels instead of estimated current Overall C-Index values and observed a modest improvement in risk discrimination. 0.4 This suggests that future risk factor values for smoking, SBP, 40 45 50 55 60 65 70 75 80 85 total cholesterol, and HDL-C are more predictive of future 10-year Landmark Age, years CVDriskthancurrentvalues.Aconsiderable limitation in this anal- ysis is that it ignores informative censoring of individuals due to Figure 3. Overall and age-adjusted values for C-index, The Health death or CVD events in the multivariate mixed-effects model, Improvement Network, United Kingdom, 1997–2016. Dashed lines, although evidence from empirical and simulation studies (11, 14) 95% confidence intervals. suggests that there is often little to be gained from more complex modeling (e.g., joint models (37)). Other methods with which to develop risk prediction models for use on EHRs exist, including machine learning approaches with the potential to increase the cost-effectiveness of health care. such as neural networks (14, 38, 39) and statistical approaches For example, several international guidelines for CVD risk assess- such as joint models (14). Prediction models developed using ment and management (21, 33–35) recommend using a system- landmark and joint models for single risk factors have been atic strategy for prioritizing people for full formal risk assessment previously compared (40) but not in a setting using multivar- on the basis of an estimate of their CVD risk using risk factors iate risk factors. Joint models are more computationally burden- already recorded in EHRs. CVD risk assessment tools, such as the some than landmark models, and further development is required Framingham risk model (36) and QRISK2 (4), are now integrated before they are computationally feasible for application to into electronic primary-care record systems, but they are not pur- large EHR data sets. However, landmark models can be posefully designed for prescreening use. The QRISK2 model esti- developed using any standard statistical software with multi- mates CVD risk using the last observed values for the numerous variate mixed-effects models and Cox regression. Analyses risk factors, and when data are missing, imputes them using age- employing the landmark-age- and sex-specific multivariate and sex-specific population averages for continuous risk factors or mixed-effects models can be run in parallel, since the most assumes no adverse clinical indicators. Our proposed framework computationally burdensome part is extracting the out-of- optimizes all available historical risk factor values, handling sample individual-specific random intercepts for estima- potential bias from spurious one-off measurements, and when tion of the current risk factor values. data are missing, intrinsically imputes them using all other risk Certain limitations of our proposed method remain. First, our factor information. Future work should formally compare such approach assumes a multivariate normal distribution for esti- models for prescreening use and assess their cost-effectiveness. mated current values of continuous and binary risk factors. Such For illustration, we compared a basic CVD risk model using an assumption is not uncommon in statistical methodology for sex, age, diabetes status, and treatment for hypertension against epidemiology (e.g., in regression calibration (10) and multiple extended risk models with additional risk factors incorporated as imputation (41)); however, it would be possible to replace it cumulative means, last observed values, or estimated current risk with a mixture of regression models with correlated latent vari- factor values for smoking status, SBP, total cholesterol, and HDL- ables (42). Second, the added distributional assumptions on the C. Our findings showed a modest improvement in risk discrimina- risk factors may limit transferability of the model to other popu- tion when including estimated current values of additional lations and implicate recalibration methods for use of the model risk factors but no difference in risk discrimination in the restricted in other populations, especially in comparison with conventional data set when comparing additional risk factors incorporated as CVD prediction models. Investigating the impact of model mis- specification is on our future research agenda. Third, uncertain- last observed, cumulative means, or estimated current risk factor ties in the estimated current risk factor values are not accounted values. Cumulative mean risk factor values handle sporadically for in the Cox model. However, our previous work suggested recorded repeat measurements and account for measurement er- that such uncertainties are often negligible relative to the esti- rors, but they are only estimable for persons with at least 1 his- mated standard errors of the β coefficients in the Cox model torical measurement on all risk factors and thus are not suitable for population-wide screening. A major strength of the land- (10). Fourth, persons with more frequent EHRs are more likely mark model with estimated current values of risk factors is that it to have health conditions or health anxiety. We attempted to can be applied to persons with at least 1 measure on any of the risk account for this by adjusting the estimated current risk factor val- factors included in the multivariate mixed model (in our illustra- ues by the annual rate of repeated measurements, although it tion, this was approximately 80% of individuals). may be plausible to additionally include this as a risk factor Another strength of our landmark framework is that it was in the Cox model. Fifth, for our illustration, we assumed developed and internally validated using data that reflected the alack ofspecific Read or drug codes to indicate no diagnosis Am J Epidemiol. 2018;187(7):1530–1538 C-Index Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 Repeat Risk Factors in Electronic Health Records 1537 3. Paige E, Barrett J, Pennells L, et al. Use of repeated blood or medication use, and information on cause of death was only pressure and cholesterol measurements to improve available for 13% of participants who died, meaning CVD cardiovascular disease risk prediction: an individual- incidence was underestimated in this study. Sixth, we used the participant-data meta-analysis. Am J Epidemiol. 2017;186(8): same definition of CVD events as used in CVD risk prediction 899–907. models employed in practice, such as QRISK2, which includes 4. Hippisley-Cox J, Coupland C, Vinogradova Y, et al. Predicting “soft” outcomes such as angina. However, while angina can be cardiovascular risk in England and Wales: prospective a symptom of coronary heart disease, it is not a disease itself, derivation and validation of QRISK2. BMJ. 2008;336(7659): and the appropriateness of including it in the outcome defini- 1475–1482. tion of CVD risk prediction models will depend on the clinical 5. Collins GS, Altman DG. An independent and external context. Finally, despite the use of contemporary data, CVD validation of QRISK2 cardiovascular disease risk score: a prospective open cohort study. BMJ. 2010;340:c2442. screening and treatment practices have changed over time and 6. Goldstein BA, Navar AM, Pencina MJ, et al. Opportunities and are not accounted for in the models. These limitations are unlikely challenges in developing risk prediction models with electronic to have affected our between-model comparisons. health records data: a systematic review. J Am Med Inform The benefits of optimizing EHRs for disease risk screening and Assoc. 2017;24(1):198–208. personalized health-care decisions are increasingly being 7. van Houwelingen HC, Putter H. Dynamic Prediction in recognized. There is a growing need for suitable statistical Clinical Survival Analysis. Boca Raton, FL: CRC Press; 2012. methods, data analytics, and machine learning approaches 8. Xanthakis V, Sullivan LM, Vasan RS. Multilevel modeling with which to address the computational and methodological versus cross-sectional analysis for assessing the longitudinal challenges involved in the analysis of such “big data.” The tracking of cardiovascular risk factors over time. Stat Med. framework presented in this paper provides a practical, trans- 2013;32(28):5028–5038. 9. Laird NM, Ware JH. Random-effects models for longitudinal parent, and flexible solution for the development of dynamic data. Biometrics. 1982;38(4):963–974. risk prediction models for use on EHRs. 10. Fibrinogen Studies Collaboration. Correcting for multivariate measurement error by regression calibration in meta-analyses of epidemiological studies. Stat Med. 2009; 28(7):1067–1092. ACKNOWLEDGMENTS 11. Sweeting MJ, Barrett JK, Thompson SG, et al. The use of repeated blood pressure measures for cardiovascular risk Author affiliations: Department of Public Health and prediction: a comparison of statistical models in the ARIC Primary Care, School of Clinical Medicine, University of Study. Stat Med. 2017;36(28):4514–4528. Cambridge, Cambridge, United Kingdom (Ellie Paige, 12. Singh A, Nadkarni G, Gottesman O, et al. Incorporating temporal EHR data in predictive models for risk stratification Jessica Barrett, David Stevens, Michael J. Sweeting, Angela of renal function deterioration. J Biomed Inform. 2015;53: M. Wood); National Centre for Epidemiology and 220–228. Population Health, Research School of Population, The 13. Akbarov A, Williams R, Brown B, et al. A two-stage dynamic Australian National University, Canberra, Australia (Ellie model to enable updating of clinical risk prediction from Paige); MRC Biostatistics Unit, University of Cambridge, longitudinal health record data: illustrated with kidney Cambridge, United Kingdom (Jessica Barrett); Department function. Stud Health Technol Inform. 2015;216:696–700. of Medical Statistics, London School of Hygiene and Tropical 14. Goldstein BA, Pomann GM, Winkelmayer WC, et al. A Medicine, London, United Kingdom (Ruth H. Keogh); and comparison of risk prediction methods using repeated Institute of Epidemiology and Health, Research Department of observations: an application to electronic health records for Primary Care and Population Health, Institute of Epidemiology hemodialysis. Stat Med. 2017;36(17):2750–2763. 15. Wells BJ, Chagin KM, Li L, et al. Using the landmark method and Health Care, University College London, London, United for creating prediction models in large datasets derived from Kingdom (Irwin Nazareth, Irene Petersen). electronic health records. Health Care Manag Sci. 2015;18(1): This work was funded by the Medical Research Council 86–92. (MRC) (grant MR/K014811/1). J.B. was supported by an 16. Damman K, Jaarsma T, Voors AA, et al. Both in- and out- MRC fellowship (grant G0902100) and the MRC Unit hospital worsening of renal function predict outcome in Program (grant MC_UU_00002/5). R.H.K. was supported by patients with heart failure: results from the Coordinating an MRC Methodology Fellowship (grant MR/M014827/1). Study Evaluating Outcome of Advising and Counseling in The study funders played no role in the design, analysis, Heart Failure (COACH). Eur J Heart Fail. 2009;11(9): or interpretation of the study. 847–854. Conflict of interest: none declared. 17. Maziarz M, Heagerty P, Cai T, et al. On longitudinal prediction with time-to-event outcome: comparison of modeling options. Biometrics. 2017;73(1):83–93. 18. In Practice Systems Ltd. The Health Improvement Network (THIN). 2016. http://www.inps.co.uk/vision/health- REFERENCES improvement-network-thin. Accessed July 5, 2016. 1. Bates DW, Saria S, Ohno-Machado L, et al. Big data in health 19. Chisholm J. The Read clinical classification. BMJ. 1990; care: using analytics to identify and manage high-risk and high- 300(6732):1092. cost patients. Health Aff (Millwood). 2014;33(7):1123–1131. 20. Davé S, Petersen I. Creating medical and drug code lists to 2. Dafni U. Landmark analysis at the 25-year landmark point. identify cases in primary care databases. Pharmacoepidemiol Circ Cardiovasc Qual Outcomes. 2011;4(3):363–371. Drug Saf. 2009;18(8):704–707. Am J Epidemiol. 2018;187(7):1530–1538 Downloaded from https://academic.oup.com/aje/article/187/7/1530/4952104 by DeepDyve user on 20 July 2022 1538 Paige et al. 21. Goff DC Jr, Lloyd-Jones DM, Bennett G, et al. 2013 ACC/ 32. Goldstein BA, Bhavsar NA, Phelan M, et al. Controlling for AHA guideline on the assessment of cardiovascular risk: a informed presence bias due to the number of health encounters in an report of the American College of Cardiology/American Heart electronic health record. Am J Epidemiol. 2016;184(11):847–855. Association Task Force on Practice Guidelines. J Am College 33. National Institute for Health and Care Excellence. Cardiol. 2014;63(25):2935–2959. Cardiovascular Disease: Risk Assessment and Reduction, 22. Muntner P, Colantonio LD, Cushman M, et al. Validation of Including Lipid Modification. (Clinical guideline CG181). the atherosclerotic cardiovascular disease Pooled Cohort risk London, United Kingdom: National Institute for Health and equations. JAMA. 2014;311(14):1406–1415. Care Excellence; 2014. 23. Sharma M, Petersen I, Nazareth I, et al. An algorithm for 34. New Zealand Ministry of Health. Cardiovascular Disease Risk identification and classification of individuals with type 1 and Assessment: Updated 2013. New Zealand Primary Care Handbook type 2 diabetes mellitus in a large primary care database. Clin 2012. Wellington, New Zealand: Ministry of Health; 2013. Epidemiol. 2016;8:373–380. 35. Perk J, De Backer G, Gohlke H, et al. European Guidelines on 24. Horsfall L, Walters K, Petersen I. Identifying periods of Cardiovascular Disease Prevention in Clinical Practice acceptable computer usage in primary care research databases. (version 2012). The Fifth Joint Task Force of the European Pharmacoepidemiol Drug Saf. 2013;22(1):64–69. Society of Cardiology and Other Societies on Cardiovascular 25. Maguire A, Blak BT, Thompson M. The importance of Disease Prevention in Clinical Practice (constituted by defining periods of complete mortality reporting for research representatives of nine societies and by invited experts). Eur using automated data from primary care. Pharmacoepidemiol Heart J. 2012;33(13):1635–1701. Drug Saf. 2009;18(1):76–83. 36. D’Agostino RB Sr, Vasan RS, Pencina MJ, et al. General 26. Littman AJ, Boyko EJ, McDonell MB, et al. Evaluation of a cardiovascular risk profile for use in primary care: the weight management program for veterans. Prev Chronic Dis. Framingham Heart Study. Circulation. 2008;117(6):743–753. 2012;9:E99. 37. Rizopoulos D. Dynamic predictions and prospective accuracy 27. Hajifathalian K, Ueda P, Lu Y, et al. A novel risk score to in joint models for longitudinal and time-to-event data. predict cardiovascular disease risk in national populations Biometrics. 2011;67(3):819–829. (Globorisk): a pooled analysis of prospective cohorts and 38. Miotto R, Li L, Kidd BA, et al. Deep patient: an unsupervised health examination surveys. Lancet Diabetes Endocrinol. representation to predict the future of patients from the 2015;3(5):339–355. electronic health records. Sci Rep. 2016;6:26094. 28. Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of 39. Shameer K, Johnson KW, Yahi A, et al. Predictive modeling of studies Conducted using Observational Routinely collected hospital readmission rates using electronic medical record- health Data (RECORD) statement. PLoS Med. 2015;12(10): wide machine learning: a case-study using Mount Sinai heart e1001885. failure cohort. Pac Symp Biocomput. 2016;22:276–287. 29. Lloyd-Jones DM. Cardiovascular risk prediction: basic 40. Suresh K, Taylor JMG, Spratt DE, et al. Comparison of joint concepts, current status, and future directions. Circulation. modeling and landmarking for dynamic prediction under an 2010;121(15):1768–1777. illness-death model. Biom J. 2017;59(6):1277–1300. 30. Harrell FE Jr, Califf RM, Pryor DB, et al. Evaluating the yield 41. Schafer JL. Analysis of Incomplete Multivariate Data. Boca of medical tests. JAMA. 1982;247(18):2543–2546. Raton, FL: CRC Press; 1997. 31. White IR, Rapsomaniki E; Emerging Risk Factors 42. Fitzmaurice GM, Laird NM. Regression models for mixed Collaboration. Covariate-adjusted measures of discrimination discrete and continuous responses with potentially missing for survival data. Biom J. 2015;57(4):592–613. values. Biometrics. 1997;53(1):110–122. Am J Epidemiol. 2018;187(7):1530–1538

Journal

American Journal of EpidemiologyOxford University Press

Published: Jul 1, 2018

Keywords: primary health care; high density lipoprotein cholesterol; total cholesterol; smoking; hypertension

There are no references for this article.