Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/87/5032901 by Ed 'DeepDyve' Gillespie user on 07 November 2018 JAMIA Open, 1(1), 2018, 87–98 doi: 10.1093/jamiaopen/ooy011 Advance Access Publication Date: 4 June 2018 Research and Applications Research and Applications Predictive modeling in urgent care: a comparative study of machine learning approaches 1 2 3 1 Fengyi Tang, Cao Xiao, Fei Wang, and Jiayu Zhou Department of Computer Science and Engineering, Michigan State University College of Engineering, East Lansing, Michigan, 2 3 USA, AI for Healthcare, IBM Research, Cambridge, Massachusetts, USA and Department of Healthcare Policy and Research, Weill Cornell Medical School Cornell University, New York, New York, USA Corresponding Author: Jiayu Zhou, 428 S Shaw Ln, East Lansing, Michigan 48824, USA. (email@example.com) Received 14 December 2017; Revised 30 March 2018; Accepted 2 April 2018 ABSTRACT Objective: The growing availability of rich clinical data such as patients’ electronic health records provide great opportunities to address a broad range of real-world questions in medicine. At the same time, artiﬁcial intelli- gence and machine learning (ML)-based approaches have shown great premise on extracting insights from those data and helping with various clinical problems. The goal of this study is to conduct a systematic compar- ative study of different ML algorithms for several predictive modeling problems in urgent care. Design: We assess the performance of 4 benchmark prediction tasks (eg mortality and prediction, differential diagnostics, and disease marker discovery) using medical histories, physiological time-series, and demo- graphics data from the Medical Information Mart for Intensive Care (MIMIC-III) database. Measurements: For each given task, performance was estimated using standard measures including the area under the receiver operating characteristic (AUC) curve, F-1 score, sensitivity, and speciﬁcity. Microaveraged AUC was used for multiclass classiﬁcation models. Results and Discussion: Our results suggest that recurrent neural networks show the most promise in mortality prediction where temporal patterns in physiologic features alone can capture in-hospital mortality risk (AUC> 0.90). Temporal models did not provide additional beneﬁt compared to deep models in differential diag- nostics. When comparing the training–testing behaviors of readmission and mortality models, we illustrate that readmission risk may be independent of patient stability at discharge. We also introduce a multiclass prediction scheme for length of stay which preserves sensitivity and AUC with outliers of increasing duration despite de- crease in sample size. Key words: predictive modeling, machine learning, urgent care Because of the popularity of artificial intelligence (AI) in recent INTRODUCTION AND BACKGROUND years, ML, as a way of realizing AI, has been developing rapidly. Tons of ML approaches have been proposed. However, from an The increasing adoption of electronic health records (EHR) sys- application perspective, the users would have difficult times on tems has brought in unprecedented opportunities for the field of choosing the right ML algorithm for the right problem. This is the medical informatics. There are lots of research works on utiliza- reason why we usually see different papers adopted different tion of such data on different tasks such as predictive modeling, approaches but without explicit explanations on the motivation and 2 3 disease subtyping, and comparative effectiveness research. Ma- rationale. chine learning (ML) approaches are common tools for implement- In this article, we aim to fill in such gap by conducting a system- ing these tasks. atic comparative study on the applications of different ML V The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact firstname.lastname@example.org 87 Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/87/5032901 by Ed 'DeepDyve' Gillespie user on 07 November 2018 88 JAMIA Open, 2018, Vol. 1, No. 1 approaches in predictive modeling in health care. The scenario we In-hospital mortality care about specifically is in Emergency Room/Urgent Care, where In-hospital mortality task was modeled as a binary classification fast pace decisions need to be made to determine acuity of each visit problem. In total, there were 4155 adult patients (13%) who experi- and allocate appropriate amount of resources. A growing commu- enced in-hospital mortality, of which 3138 (75.5%) were in the ICU nity in medical informatics focusing on quality improvement has setting. Traditionally, SAPS and SOFA scores are used to evaluate elucidated the relevance of these factors to medical errors and over- mortality risk. Depending on the disease, SAPS-II predicts within a all quality of care. Accurate predictive modeling can help recognize wide range (0.65–0.89) of area under receiver operating characteristic the status of the patients and environment in time and allow the de- curve (AUC) scores, depending on the critical conditions being stud- cision makers to work out better plans. Many research on predictive ied. Our study evaluates performance of predictive models using modeling in emergency room has been conducted in recent years, AUC. Sensitivity, specificity, and f1-scores were included to aid the in- such as identification of high-risk patients for in-hospital mortality, terpretation of AUC scores due to the presence of class-imbalance. length of intensive care unit (ICU) stay outliers, 30-day all-cause readmissions, and predicting differential diagnoses for admis- Length of stays 7,8 sions, which have been proven to be useful in different aspects in- Prediction of length of ICU stays remain an important task for iden- cluding decreasing unnecessary lab tests and increasing the tifying high-cost hospital admissions in terms of staffing cost and re- 10,11 accuracy of inpatient triage for admission decisions. 6 source management. Accurate predictions of outliers in ICU stays In terms of ML algorithms, many of them have been applied in (eg 1–2 weeks) may greatly improve inpatient clinical decisions. We 12–15 those tasks. In particular, since 2012, deep learning models formulated LOS as a multiclass classification problem using bins of have achieved great success in many applications involving lengths (1, 2), (3, 5), (5, 8), (8, 14), (14, 21), (21, 30), (30þ, ) to re- 16,17 18 19 images, speech, and natural language processing. Research- flect the range of possible LOS values in terms of days. As shown in ers in medical informatics have also been exploring the potentials of Figure 1, this binning scheme smoothly captured the exponential de- those powerful models. Lipton et al showed that recurrent neural cay of LOS with increasing number of days. networks (RNN) using only physiologic markers from EHR can To evaluate the performance on LOS task, AUC, f1-score, sensitiv- achieve expert-level differential diagnoses over a wide range of dis- ity, and specificity were calculated for each bin, and a microaveraged eases. Choi et al showed that by using word embedding techniques AUC and f1 scores were calculated for the overall performance of the for contextual embedding of medical data, diagnostic, and proce- model across all bins. AUC and f1-scores were chosen to facilitate the dural codes alone can predict future diagnoses with sensitivity as interpretation of LOS performance in comparison with other tasks. high as 0.79. More recently, benchmark performances for decom- position and length of stay (LOS) predictions have also been investi- Differential diagnoses gated. The key technical differences in these studies come from 2 We examined the top 25 most commonly appearing conditions major components: (1) patient representation which represents each (ICD-9 codes) in MIMC-III using a multilabel classification frame- patient into a structured data point for modeling, and (2) learning al- work (see Supplementary Material Section S8.3). Supplementary gorithm which infers patterns from the patient representations and Table S2 shows these diagnoses with their associated absolute and delivers a predictive model. In this article, we will compare several relative prevalence (%) among the MIMIC-III population. To evalu- state-of-art patient representation and ML algorithms across 4 bench- ate the performance of predictions, AUC, f1-score, sensitivity, and mark tasks and discuss clinical insights derived from the results. specificity scores were calculated for each disease label, and a micro- averaged AUROC and f1-score were calculated for each admission. METHODS Readmission prediction We investigate 2 types of readmissions: all-cause 30-day readmis- Data set description sion, where number of positive cases amount to 1884 (5.1%) of to- The Medical Information Mart for Intensive Care (MIMIC-III) data- tal admissions; and variable length readmissions. For the latter, we base obtained from Physionet.org was used in our study. This data use bins to generate 6 classes (bins) associated with each admission set was made available by Beth Israel Hospital and includes deiden- that correspond to observed time-to-readmission: (1, 30), (30, 90), tified inpatient data from their critical care unit from 2005 to 2011. (90, 180), (180, 360), (360, 720), (720þ, ), measured in days, and MIMIC-III captures hospital admission data, laboratory measure- the prediction problem is formulated as a multiclass prediction ments, procedure event recordings, pharmacologic prescriptions, problem. Both approaches are evaluated with AUC, F1, sensitivity, transfer and discharge data, diagnostic data, and microbiological and specificity scores. data from 46 520 unique patients. In total, there were 58 976 unique admissions and 14 567 unique International Statistical Patient features Classification of Diseases and Related Health Problems (ICD)-9 di- Diagnosis codes agnostic codes. When considering only nonpediatric patients (age There are 14 567 unique ICD-9 diagnostic codes in MIMIC-III data, 18) and discounting admissions without multiple transfers or length which would lead to high-dimensional very sparse representations of ICU stay <24 h, there were a total of 30 414 unique patients and for patients if we treat each distinct code as 1D. Therefore, we use 37 787 unique admissions. A summary of demographic distribution the ICD-9-CM instead. The ICD-9-CM codes are designed to of patients can be found in Supplementary Table S1. capture the group-level disease specificity by only using the first 3 letters of their full length codes. In this way, we reduce the feature Predictive tasks in assessment dimension to 942 ICD-9 group codes. Supplementary Figure S1 Four learning tasks are adopted in our study as the benchmarks of shows distribution of diagnostic codes and diagnostic categories in those ML algorithms. MIMIC-III. Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/87/5032901 by Ed 'DeepDyve' Gillespie user on 07 November 2018 JAMIA Open, 2018, Vol. 1, No. 1 89 Figure 1. Distribution of length of stays (LOS) and readmission in MIMIC-III. A, The Distribution of patient volume for each ICU length of stay range. This binning scheme allowed for patient volumes to follow smooth exponential decay with increasing LOS time. Bins 5–8 and 8–14 are of particular interest, as these are fre- quently used as lower thresholds for deﬁning “LOS outliers” for identifying high-cost admissions. B, The distribution of patient volume for each time-to-readmis- sion range, measured in days. Due to the fact that few patients in MIMIC-III had multiple admissions, the amount of patients that fall under the 720þ days category greatly outnumbers the rest. MIMIC-III, Medical Information Mart for Intensive Care. For sequential models, we simply use the standardized hourly aver- Temporal variables age data per admission to establish this baseline, denoted as X19. To capture the temporal patterns of complex diseases, we also con- sider temporal variables of the 6 most frequently sampled vital signs and top 13 most frequently sampled laboratory values for downstream History-level representation (diagnostic history only) prediction tasks. Since sampling frequency differs greatly per inpatient In more recent papers, it has been proposed that sequential data admission, we took hourly averages of time-series variables up to the may be more effectively represented in embedded representations, first 48 hours of each admission across all prediction tasks. This ap- where each event is mapped onto a vector space of related 21,23 22,28 proach resembles hourly sampling methods from previous studies. events. Embedding techniques such as word2vec allow for Each temporal variable was standardized by taking the difference with sparse representations of medical history to be transformed into its mean and dividing by its standard deviation. Figure 2 further sum- dense word vectors whose mappings also capture contextual infor- marizes the distributions of these variables. Missing data were imputed mation based on co-occurrence. with the carry-forward strategy at each time-stamp. As shown in Figure 3, each admission was treated like a sen- tence, with medical events occurring as neighboring words. In a slid- Demographics ing window fashion, word2vec takes the middle word of each In addition, we also consider patients’ demographic variables such sliding window and learns the most likely neighboring words. This as age, marital status, ethnicity and insurance information for each representation strategy was denoted as w2v. As an additional base- patient. Age was quantized into 5 bins (18, 25), (25, 45), (45, 65), line, sum of one-hot vectors was also used to represent diagnostic (65, 89), (89þ,). history for collapse models, denoted as onehot. Feature representations Combined representation Based on the aforementioned types of features, performances were Mixed time-series and static representations were used for both se- compared across 4 types of feature representation strategies: (i) Physi- quential and collapse models. For collapse models, Word2Vec ologic features only which is denoted x19 (19 physiologic time-series embeddings of diagnostic history was concatenated with summary variables) for sequential and x48 (48 h average) for classic models. features from time-series data as features for prediction. This was (ii) Diagnostic histories only, denoted as w2v for word2vec embed- denoted as W48 (w2vþ x48). For sequential models, we utilized 2 28 29 dings and onehot for one-hot vector representations. (iii) Com- separate layers of input: the x19 input was fed into recurrent layer, bined visit-level and demographic information-level representation as and its output was merged with the w2v input layer. The hierarchi- denoted by w48 for classic models and x19_h2v or x19_demo for se- cal sequential models were labeled as x19_h2v when both diagnostic quential models. (iv) Embedded sentence-level representation, and demographic histories were used for the w2v input, and denoted as sentences for all kinds of models. Specifics of these repre- x19_demo when only demographic word2vec inputs were used. The sentations can be found in the Supplementary Material of this article. latter case applied only to the prediction of differential diagnoses, where diagnostic history of admissions were used as labels rather than as features. Visit-level representation (physiologic features only) For collapse models [Support Vector Machines (SVM), Logistic Re- gression (LR), Ensemble Classifiers, and Feed-Forward MLPs), raw Embedded representation hourly averages for each time-series variable was converted into 5 In this representation scheme, both diagnostic history and time- summary features per variable: minimum value, maximum value, series variables were treated as word vectors for representation. For mean, standard deviation, and number of observations for the each admission, time-series data (l_) and diagnostic history (d_)in duration of the admission. We denote this representation as X48. the sequence they were encountered during the admission. Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/87/5032901 by Ed 'DeepDyve' Gillespie user on 07 November 2018 90 JAMIA Open, 2018, Vol. 1, No. 1 Figure 2. Distribution of physiologic time-series variables in MIMC-III. A, The kernel density distribution of lab values used in the comparative study. Each variable follows a Gaussian distribution with magnesium and PH having the lowest variance. B, The histogram view of laboratory variable distributions. BUN, creatinine, platelets, and serum lactate measurements demonstrate right-skew behavior while PH is left-skewed. To differentiate the type of event, each feature is labeled with prefix (eg l_51265, sodium) was 2 standard deviations above its normal “l_” for labs or vital signs and “d_” for diagnosis. Time-series varia- value, the sentence vector for the admission would include the bles were discretized and included in the feature vector depending ITEMID of the lab (eg [ .., l_52165, d_341, .. . ]). In this setting, we on whether or not the observed event was within 1 standard were able to map abnormal time-series values with frequently co- deviation of its mean value. For example, if an observed lab value occurring diagnostic codes in the same word-vector space. Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/87/5032901 by Ed 'DeepDyve' Gillespie user on 07 November 2018 JAMIA Open, 2018, Vol. 1, No. 1 91 Figure 2. Continued Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/87/5032901 by Ed 'DeepDyve' Gillespie user on 07 November 2018 92 JAMIA Open, 2018, Vol. 1, No. 1 Figure 3. word2vec embedding of medical events. A, The general architecture of skip-gram embedding used to map sparse one-hot representation of medical codes into dense word vector embeddings. Given a series of discrete medical events, center, and neighboring events are generated in a sliding window fashion, where the neural network learns the relationships nearby words for contextual representation. The weights which map input events onto the hidden layers are used as a ﬁltering layer for future inputs for prediction tasks. B, An overview of the word2vec pipeline for transforming input features from the EHR into word vec- tor representations. Sentence-level representation is being shown here, but word2vec can be used exclusively for diagnostic codes in visit-level representations as well. Types of predictive models 0.76 range across most models, but it did not capture the same level Collapse models of sensitivity and specificity as did exclusively time-series and mixed Collapse models are standard ML models which do not consider feature representations. temporal information. In this study, we examined SVM, Random When comparing mortality prediction performance between var- Forest (RF), Gradient Boost Classifier (GBC), LR, and Feed-forward ious embedding techniques, the most notable performance boost oc- Multi-Layer Perceptron (MLP). curred when RNN models achieved significantly greater AUC (0.907 for LSTM and 0.933 for CNN) and f1-scores (0.526 for LSTM and 0.587 for CNN) while using visit-level features when Sequential models compared to the next best model (feed-forward MLP architecture w/ Two RNN models were examined in this study: the bidirectional 0.816 AUC, 0.519 f1-score). Similarly, when using mixed visit- and 30,31 Long Short-term Memory (LSTM) model and the Convolu- history-level features, LSTM and CNN preserved around 10% AUC tional Neural Network w/ LSTM model (CNN-LSTM). Regulari- increase and 15% f1-score increase in comparison to MLP and en- zation was implemented with Dropout and L2 regularization at semble models. The key advantage of sequential models over MLP is each LSTM layer. For binary and multilabel classification tasks, sig- that they capture temporal relationships between time-steps with se- moid activation function was used at the fully connected output quentially presented data. While previous studies have cited ability layer, and binary cross-entropy was used as loss function. For the of inflammatory markers and vital signs for in-hospital mortality multiclass case (eg LOS and readmission bins), softmax activation 13,33 prediction, notable performance difference between our col- was used at the output layer with categorical cross-entropy as loss lapse and sequential models suggests that 48 h temporal trends may function. Adam optimizer with initial learning rate of 0.005 was greatly augment the predictive ability of physiologic markers. used in both cases. Refer to Supplementary Material Section S8.2 for details about the mechanics of these models as well as the hyperparameter tuning LOS prediction procedures. Our code is available at https://github.com/illidanlab/ur- Table 2 summarizes performance for various models across 8 LOS gent-care-comparative for the features and models presented in this ranges. In admissions resulting in 1–5 ICU days, MLP w/ x48 article. Figure 4 provides an overview of the workflow of our experi- achieved the highest AUC and f1-scores. LR w/ w48 achieved the ment from preprocessing to prediction. highest AUC and f1-scores for durations greater than 5 days. In fact, the highest performance achieved by LR w/48 was in predicting out- lier cases >30 days with AUC of 0.934 and f1-score of 0.173. In pre- dicting LOS outliers between 8 and 14 days, LR w/48 achieved AUC RESULTS of 0.840 and f1-score of 0.372. Performance patterns were similar In-hospital mortality prediction between sequential and LR, where the lowest performance occurred Table 1 summarizes the top performances of models on the mortal- for predictions between 2 and 5 days (AUC ranging from 0.62 to ity prediction task. Sequential models significantly out-performed 0.74) and highest performance occurred for predictions between the collapse models in AUC (P-value <.05 for all sequential vs col- 8 and 30þ outlier days (AUC ranging from 0.83 to 0.89). lapse comparisons, see Supplementary Table S6.) and achieved the One notable trend was that while the AUC scores consistently in- highest AUC score of 0.949 (0.003 std). In general, diagnostic codes creased as the outlier days increased, the f1-scores decreased, as did alone yielded the poor performance for both classic and sequential the sample size of the bins. For example, LR with mixed physiologic models. Time-series data alone achieved the closest performance to and diagnostic features produced average AUC scores of 0.748, combined visit- and history-level representations for both sequential 0.579, 0.705, 0.84, 0.887, and 0.917 for LOS ranges (1, 2), (2, 3), and classical models. In fact, the highest sensitivity score (0.911) (3, 5), (5, 8), (8, 14), (14, 21), and (21, 30). The progression of f1- was achieved by vanilla LR with only physiologic data (x48). scores were: 0.704, 0.372, 0.34, 0.298, 0.372, 0.264, and 0.173. In- Sentence-level representation yielded consistent scores in the 0.70– terestingly, the sensitivity values also progressively increased for in- Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/87/5032901 by Ed 'DeepDyve' Gillespie user on 07 November 2018 JAMIA Open, 2018, Vol. 1, No. 1 93 Figure 4. Overview of workﬂow. A, The overview of our experimental pipeline from preprocessing to prediction. Raw EHR data is ﬁrst processed into Uniform Feature Matrix (UFM), where key features such as hourly averaged vital signs, ICD-9 group codes and lab values are extracted per patient and aligned. Labels for each task is then extracted for each relevant patient. Additional preprocessing is performed for different features (eg embedding, described below). Once features are normalized and aligned, prediction is performed for each task. B Uniform Feature Matrix (UFM) used for prediction. The “prediction window” refers to the elapsed time between data used for feature construction and the event of prediction (eg 30 days postdischarge in readmission). Table 1. Summary of top performing mortality models w/ representation schemes Rank Model AUC F1 Sn Sp P-value Classic models 1 MLP w/ W48 0.855 (0.0058) 0.546 (0.011) 0.877 (0.0071) 0.834 (0.007) .0019 2 RF w/ W48 0.843 (0.0073) 0.523 (0.005) 0.864 (0.019) 0.821 (0.0052) .0018 3 GBC w/ W48 0.773 (0.0098) 0.437 (0.013) 0.759 (0.024) 0.786 (0.017) .014 Sequential models 1 LSTM w/ x19 1 h2v 0.949 (0.003) 0.623 (0.012) 0.883 (0.016) 0.887 (0.0073) .0001 2 CNN-LSTM w/ x19þh2v 0.940 (0.0071) 0.633 (0.031) 0.852 (0.04) 0.895 (0.023) .0022 3 CNN-LSTM w/ x19 0.933 (0.006) 0.587 (0.025) 0.854 (0.016) 0.868 (0.018) .0025 Note: Each performance metric is evaluated across 5 stratiﬁed shufﬂe splits. The mean performance is reported with the standard deviation in parenthesis. The P-value is calculated by comparing the AUC of a given model with the baseline performance with LR and physiologic markers. More extensive pairwise statistical t-tests are shown in Supplementary Table S6. Abbreviations: AUC: area under receiver operating characteristic curve; F1: f1-score; Sn: sensitivity; Sp: speciﬁcity; MLP: Multi-Layer Perceptron; RF: Random Forest; LSTM: Long Short-term Memory; CNN: Convolutional Neural Network; GBC: Gradient Boost Classiﬁer. Bold values indicate best performance. creasing LOS bins: 0.804, 0.695, 0.659, 0.748, 0.878, 0.916, 0.953, Differential diagnoses prediction and 0.955. Such pattern suggests that the trade-off occurred for pos- Table 3 summarizes the performances of models across various key itive predictive values (PPV), which dramatically decreased for lon- differential diagnoses in MIMIC-III. Overall, sequential models did ger LOS days. This can be attributed to the fact that the absolute not significantly improve performance when compared to MLP (see number of outlier patients decreased dramatically with increasing Supplementary Table S7). CNN-LSTM using hierarchal inputs from LOS days. Since PPV is sensitive to the proportion of positive sam- visit- and history-level information performed best among sequential ples while sensitivity is not, the change in f1-score can be explained models, but differences were not significant (P-value >.05). by the distribution of labels rather than a decrease in true-positive Our models were able to predict renal diseases with the highest prediction by the models. In fact, the AUC, sensitivity, and specific- performance (0.887–0.895 AUC between MLP and CNN-LSTM ity increased with LOS bins for most models, suggesting that our models) presumably due to the inclusion of blood urea nitrogen lev- binning technique was especially helpful in discriminating LOS out- els (BUN) and creatinine as features. BUN-to-creatinine ratio is liers with increasing duration of stay. commonly used as a clinical metric for evaluating glomerular Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/87/5032901 by Ed 'DeepDyve' Gillespie user on 07 November 2018 94 JAMIA Open, 2018, Vol. 1, No. 1 Table 2. Summary of top performing LOS predictors w/ representation schemes Bins Model AUC F1 P-value Classic models 1–3 d MLP w/ x48 0.791 (0.0043) 0.746 (0.0072) .0034 3–5 d MLP w/ w48 0.653 (0.018) 0.444 (0.029) .081 5–8 d LR w/ w48 0.705 (0.006) 0.298 (0.007) .121 8–14 d LR w/ w48 0.840 (0.0079) 0.372 (0.014) .029 14–21 d LR w/ x48 0.887 (0.019) 0.264 (0.015) .033 21–30 d LR w/ x48 0.917 (0.011) 0.182 (0.01) .0016 30þ LR w/ w48 0.934 (0.011) 0.173 (0.0041) .0028 Micro LR w/ w48 0.747 (0.0025) 0.419 (0.0018) .051 Sequential models 1–3 d CNN-LSTM w/ x19 0.758 (0.0055) 0.615 (0.015) .013 3–5 d CNN-LSTM w/ x19 0.645 (0.0047) 0.139 (0.031) .092 5–8 d CNN-LSTM w/ x19 0.736 (0.0029) 0.103 (0.012) .088 8–14 d CNN-LSTM w/ x19 0.838 (0.0055) 0.181 (0.037) .055 14–21 d CNN-LSTM w/ x19 0.877 (0.009) 0.112 (0.025) .0046 21–30 d LSTM w/ x19þh2v 0.879 (0.025) 0.135 (0.032) .011 30þ LSTM w/ x19þh2v 0.889 (0.027) 0.165 (0.07) .005 Micro CNN-LSTM w/ x19 0.846 (0.001) 0.368 (0.010) .00014 Note: Each performance metric is evaluated across 5 stratiﬁed shufﬂe splits. The mean performance is reported with the standard deviation in parenthesis. The P-value is calculated by comparing the AUC of a given model with the baseline performance with random forest classiﬁer and diagnostic histories. More extensive pairwise statistical t-tests are shown in Supplementary Table S8. Abbreviations: LOS: length of stay; AUC: area under receiver operating characteristic curve; F1: f1-score; CNN: Convolutional Neural Network; MLP: Multi- Layer Perceptron; LR: Logistic Regression; LSTM: Long Short-term Memory. Bold values indicate best performance. Table 3. Summary of top performing DDX predictors w/ representation schemes DDX Model AUC F1 P-value Classic models CHF MLP w/ x48 0.784 (0.00238) 0.488 (0.00689) .000273 CAD MLP w/ x48 0.798 (0.00612) 0.52 (0.011) .000498 Aﬁb MLP w/ x48 0.745 (0.00218) 0.401 (0.0121) .00260 Sepsis MLP w/ x48 0.883 (0.00422) 0.312 (0.0101) 9.99E5 AKF MLP w/ x48 0.886 (0.00387) 0.505 (0.0106) 3.82E5 CKD MLP w/ x48 0.870 (0.00612) 0.276 (0.0173) .000121 T2DM MLP w/ x48 0.742 (0.00584) 0.199 (0.0175) .00435 Hyperlipidemia MLP w/ sentences 0.751 (0.00519) 0.17 (0.00178) .00269 Pneumonia MLP w/ x48 0.723 (0.00492) 0.001 (0.00112) .00658 Micro MLP w/ x48 0.806 (0.0021) 0.328 (0.003) .000123 Sequential models CHF LSTM w/ x19 þ demo 0.785 (0.00346) 0.455 (0.0211) .000469 CAD CNN w/ x19 þ demo 0.793 (0.00486) 0.480 (0.0382) .000629 Aﬁb LSTM w/ x19 þ demo 0.768 (0.00534) 0.341 (0.0494) .00161 Sepsis LSTM w/ x19 0.862 (0.00892) 0.254 (0.0268) .000332 AKF CNN w/ x19 0.863 (0.00729) 0.488 (0.0285) .000208 CKD CNN w/ x19 þ demo 0.872 (0.00611) 0.172 (0.0154) .000115 T2DM LSTM w/ x19 þ demo 0.746 (0.00881) 0.144 (0.0213) .00736 Hyperlipidemia CNN w/ x19 þ demo 0.749 (0.0122) 0.175 (0.048) .0115 Pneumonia CNN w/ x19 þ demo 0.723 (0.0115) 0.006 (0.00106) .0216 Micro CNN w/ 19 þ demo 0.803 (0.00308) 0.306 (0.0105) .000224 Note: Each performance metric is evaluated across 5 stratiﬁed shufﬂe splits. The mean performance is reported with the standard deviation in parenthesis. The P-value is calculated by comparing the AUC of a given model with the baseline performance using LR using physiologic markers. More extensive pairwise statisti- cal t-tests are shown in Supplementary Table S7. Abbreviations: DDX: differential diagnoses; AUC: area under receiver operating characteristic curve; F1: f1-score; Sn: sensitivity; Sp: speciﬁcity; CHF: conges- tive heart failure; CAD: coronary arteriolar disease; Aﬁb: atrial ﬁbrillation; AKF: acute kidney failure; CKD: chronic kidney disease; T2DM: type II diabetes mellitus; MLP: Multi-Layer Perceptron; CNN: Convolutional Neural Network; LSTM: Long Short-term Memory. Bold values indicate best performance. performance and intactness of renal nephrons. Similarly, essential data across time. However, interesting patterns emerge when we hypertension yielded high AUC scores across most models due to were able to identify disease phenotypes without using the gold stan- our inclusion of systolic bood pressure and diastolic blood pressure dard clinical markers typically associated with those conditions. Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/87/5032901 by Ed 'DeepDyve' Gillespie user on 07 November 2018 JAMIA Open, 2018, Vol. 1, No. 1 95 Table 4. Summary of top performing readmission models w/ representation schemes Rank Model AUC F1 Sn Sp P-value Classic models 1 RF w/ w48 0.582 (0.0067) 0.122 (0.0025) 0.601 (0.02) 0.563 (0.0086) .0387 2 LR w/ w2v 0.577 (0.0067) 0.123 (0.0023) 0.574 (0.031) 0.592 (0.0211) .0469 3 RF w/ 48h 0.577 (0.009) 0.121 (0.003) 0.571 (0.021) 0.583 (0.004) .0657 Sequential models 1 LSTM w/ x19 þ h2v 0.580 (0.00914) 0.112 (0.0043) 0.548 (0.0192) 0.565 (0.0206) .0606 2 LSTM w/ x19 0.554 (0.00648) 0.108 (0.0028) 0.538 (0.0168) 0.554 (0.0214) .107 3 LSTM w/ w2v 0.552 (0.0154) 0.107 (0.0038) 0.567 (0.0404) 0.524 (0.0272) .199 Note: Each performance metric is evaluated across 5 stratiﬁed shufﬂe splits. The mean performance is reported with the standard deviation in parenthesis. The P-value is calculated by comparing the AUC of a given model with the random classiﬁer with AUC of 0.50 and variance of 0.0015. Abbreviations: AUC: area under receiver operating characteristic curve; F1: f1-score; Sn: sensitivity; Sp: speciﬁcity; RF: Random Forest; LR: Logistic Regres- sion; LSTM: Long Short-term Memory. For example, cardiovascular conditions such as atrial fibrillation cross-validation. The best performing collapsed model was RF clas- (Afib) and congestive heart failure (CHF) are often confirmed by sifier using mixed physiologic and history features (RF w/ w48), ECG (usually via 24 h Holtz monitor) and echocardiography (stress- which achieved an AUC of 0.582 (0.007 std) and f1-score of 0.122 induced or otherwise), respectively. Our study shows that RNNs, (0.003 std). using only vital signs, demographic information, and a subset of metabolic panel lab values, were able to capture their prevalence DISCUSSION with as high as 0.785 AUC and 0.395 sensitivity scores for CHF and 0.768 AUC and 0.328 sensitivity scores for Afib. In comparison, the Key features and models for each task gold standard measurement with 24 h Holtz monitor detects Afib Our results show that sequential models are most suitable for in- with sensitivity of 0.319 at annual screenings and tops out at 0.71 if hospital mortality prediction, where temporal patterns of simple done monthly. Because Afibs occur spontaneously in many cases, physiologic features are adequate in capturing mortality risk. Deep they can be easily missed during physical exams unless Holtz moni- models in general significantly out-perform nondeep models for the tors or expensive implantable devices are used for longitudinal mon- differential diagnostic task (Supplementary Table S5), but temporal itoring. The predictive ability of physiologic-markers alone for CHF information from sequential models did not provide additional ben- and arrhythmic events suggest the possibility that arrhythmic car- efit when compared to MLP. For LOS prediction, collapse models diac pathologies yield temporal changes in physiologic regulation and deep models provided similar performance across various time- that is screenable in the acute setting. ranges. More important difference was in feature selection, where There were several diseases for which sensitivity and f1-scores physiologic markers significantly out-performed diagnostic histories were very low across all model predictions. For example, all classic in predicting LOS range for both deep and nondeep models (Supple- models with the exception of MLP failed to predict (AUC of 0.50) mentary Table S8). Our results for all-cause readmission suggests depressive disorder (psychiatric), esophageal reflux (GI), hypothy- the need for additional features for this particular task. Physiologic roidism (endocrine), tobacco use disorder (behavioral), pneumonia and diagnostic histories alone do not capture the defining elements and food/vomit pneumonitis (infectious), chronic air obstruction (re- of this particular clinical problem. A summary table is provided in spiratory, may be seasonal or trigger-dependent), and nonspecific Supplementary Table S9 which briefly summarizes the best model anemia (hematologic). The most surprising condition of the above- and features for each task. mentioned cases was hypothyroidism, which is known to cause long-term physiologic changes in metabolism and vital signs. While it is possible that the physiologic markers did not capture the pro- Readmission as a separate problem from patient gression of these diseases, the cause of underperformance was likely stability due to the duration of our observation window (48 h), which may Figure 5A and B shows the differences in generalizability of RNN have failed to capture the longitudinal or trigger-based temporal models for the mortality and readmission tasks. In both cases, bidi- patterns of more chronic diseases. rectional LSTMs were trained with 5-fold cross-validation to illus- trate learning behavior and test-set generalization for readmission and mortality tasks. For both cases, it was clear that the training Prediction for all-cause readmission within a 30-day AUC was increasing with each training iteration (epoch), while the window loss function was decreasing consistently. However, only in the mor- Table 4A summarizes the top performing models for binary and tality case did we observe an increase in testing AUC, which should multiclass classification of readmission events. Ensemble classifiers ideally follow the training AUC behavior. In the readmission case, (RF and GBC) produced comparable performances to RNN models the training AUC approached 0.90þ over 30 epochs, but the testing in both tasks. In both the multiclass and the binary classification AUC increased from 0.50 toward 0.57–0.61 range and fluctuated case, the best performing sequential model was hierarchal LSTM us- for the following epochs (>5). Such behavior exemplified most, if ing mixed visit- and history-level features. However, this architec- not all, of our model training behaviors for this task. This discrepancy ture was only able to achieve a mean AUC score of 0.580 (0.009 points to the idea that perhaps our feature representation was inade- std) and f1-score of 0.112 (0.004 std) on test sets across 5-fold quate in capturing risk factors for readmission. In particular, examin- Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/87/5032901 by Ed 'DeepDyve' Gillespie user on 07 November 2018 96 JAMIA Open, 2018, Vol. 1, No. 1 Figure 5. Comparison of training performance for readmission and mortality tasks. A comparison between the training data of readmission and mortality tasks. A, 5-fold validation training data of vanilla bidirectional LSTM trained on physiologic time-series data only. Training AUC is demarcated tr_auc while testing AUC is demarcated te_auc. B, 5-fold validation training data of the same model architecture and feature selection on the readmission task. In both cases, the training AUC scores increased with decreasing loss the training set, but only in the mortality task are the train-test results generalized. This suggests a wide disparity be- tween in the readmission task samples which the models could not capture. C, A model training data captured in multitask learning of readmission, in-hospital mortality, 30-day, and 1-year mortality. All AUC scores shown in C are testing data only. With increasing epochs, only mortality models improved. More impor- tantly the training patterns show that knowledge transfer from mortality tasks did not improve readmission predictions. ing patterns in diagnostic history, health care coverage (as represented readmission. The LSTM model, using only temporal physiologic by insurance policy, marital status, and ethnicity in our case), data, was able to capture generalizable performance across all mor- and physiological markers may be insufficient in capturing the key tality tasks but not the readmission task. contributing factors of hospital readmission. We further examined the dependence of readmission on the “stability” of patients. The all-cause 30-day readmission has classi- CONCLUSION cally been formulated as a problem of accurately depicting patient stability upon discharge from inpatient facilities. If this were the In this study, we leveraged performance differences between patient case, then there should exist parallel patterns in postdischarge mor- feature representations and predictive model architectures to capture tality and readmission. Figure 5C demonstrates a supplementary ex- insight from clinical tasks. One notable limitation of this study is the periment done with multitask learning of in-hospital mortality, 30- exclusion of procedural and medication data from our analysis of clin- day readmission, 30-day and 1-year mortality. Here, vanilla bidirec- ical outcomes. The fact that inclusion of demographic features such as tional LSTM was used for training across 100 epochs over 5-fold insurance policy, marital status, gender and race of the patients did validation, with the average values across different k-folds visualized not benefit our readmission prediction models points to the possibil- in the summary plot. We see that while there was knowledge trans- ity that accurate risk models for more complex tasks such as read- fer across in-hospital mortality, 30-day mortality and 1-year mortal- mission may require feature selection to include environmental ity, the 30-day readmission task did not stand to gain any additional factors such as medication progression, procedural follow-up and performance boost from the added knowledge captured by the mor- access to transportation. For example, previous studies have cited tality prediction tasks. In fact, testing AUC patterns of 30-day mor- that system-level factors such as medicine reconciliation, access to tality differed greatly from that of testing AUC for 30-day transportation and coordination with primary providers may play Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/87/5032901 by Ed 'DeepDyve' Gillespie user on 07 November 2018 JAMIA Open, 2018, Vol. 1, No. 1 97 7. Kansagara D, Englander H, Salanitro A, et al. Risk pre- diction models for pivotal roles in unplanned readmission and postdischarge mortal- 35–39 hospital readmission: a systematic review. JAMA 2011; 306 (15): ity. Future studies may include medication administration, 1688–98. drug history and adverse effect data to build a more comprehen- 8. Ben-Assuli O, Shabtai I, Leshno M. The impact of EHR and HIE on reduc- sive picture of postdischarge risk factors. Lastly, we note that the ing avoidable admissions: controlling main differential diagnoses. BMC scope of this study includes identifying the optimal model and fea- Med Inform Decis Mak 2013; 13 (1): 49. ture representation techniques for various clinical tasks; future 9. Cismondi F, Celi LA, Fialho AS, et al. Reducing unnecessary lab testing in investigations may address the interpretability of deep models and the ICU with artiﬁcial intelligence. Int J Med Inform 2013; 82 (5): differences in feature importance for the various tasks. 345–58. 10. Zhang X, Kim J, Patzer RE, et al. Prediction of emergency department hospital admission based on natural language processing and neural net- works. Methods Inf Med 2017; 56 (5): 377–89. FUNDING 11. Politano AD, Riccio LM, Lake DE, et al. Predicting the need for urgent in- tubation in a surgical/trauma intensive care unit. Surgery 2013; 154 (5): This research project was supported in part by National Science Foundation 1110–6. under grants IIS-1615597 (JZ), IIS-1565596 (JZ), IIS-1749940 (JZ), IIS- 12. Warner JL, Zhang P, Liu J, Alterovitz G. Classiﬁcation of hospital ac- 1650723 (FW), IIS-1716432 (FW), IIS-1750326 (FW) and Office of Naval quired complications using temporal clinical information from a large Research under grants N00014-17-1-2265 (JZ). electronic health record. J Biomed Inform 2016; 59: 209–17. Conflict of interest statement. None declared. 13. Calvert JS, Price DA, Barton CW, Chettipally UK, Das R. Discharge rec- ommendation based on a novel technique of homeostatic analysis. JAm Med Inform Assoc 2016; 24 (1): 24–9. 14. Forkan ARM, Khalil I. A probabilistic model for early prediction of ab- CONTRIBUTORS normal clinical events using vital sign correlations in home-based monitor- ing. In: 2016 IEEE International Conference on Pervasive Computing and All authors provided significant contributions to: Communications (PerCom). IEEE; 2016: 1–9. • 15. Farhan W, Wang Z, Huang Y, Wang S, Wang F, Jiang X. A predictive the conception or design of the work; or the acquisition, analysis, model for medical events based on contextual embedding of temporal or interpretation of data for the work. sequences. JMIR Med Inform 2016; 4 (4) drafting the work or revising it critically for important intellec- 16. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015; 521 (7553): tual content. ﬁnal approval of the version to be published. 17. Wan J, Wang D, Hoi SCH, et al. Deep learning for content-based image agreement to be accountable for all aspects of the work in ensur- retrieval: a comprehensive study. In: Proceedings of the 22nd ACM inter- ing that questions related to the accuracy or integrity of any part national conference on Multimedia. ACM; 2014: 157–166. of the work are appropriately investigated and resolved. 18. Deng L, Hinton G, Kingsbury B. New types of deep neural network learn- ing for speech recognition and related applications: an overview. In: 2013 FT was responsible for majority of data acquisition, implementa- IEEE International Conference on Acoustics, Speech and Signal Processing tion of experiments, and result interpretation. CX and FW provided (ICASSP). IEEE; 2013: 8599–8603. major editing of writing and guided the design of experiments. JZ 19. Collobert R, Weston J. A uniﬁed architecture for natural language provided original conception of the project and guided the majority processing: deep neural networks with multitask learning. In: Proceedings of experimental formulations. of the 25th International Conference on Machine Learning. ACM; 2008: 160–167. 20. Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for health- care: review, opportunities and challenges. Brief Bioinform 2017: SUPPLEMENTARY MATERIAL bbx044. 21. Lipton ZC, Kale DC, Elkan C, Wetzel RC. Learning to diagnose Supplementary material is available at Journal of the American with LSTM recurrent neural networks. arXiv Preprint arXiv: 151103677 Medical Informatics Association online. 22. Choi E, Bahadori MT, Schuetz A, Stewart WF, Sun J. Doctor AI: predict- ing clinical events via recurrent neural networks. In: Machine Learning for REFERENCES Healthcare Conference; 2016: 301–318. 1. Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current 23. Harutyunyan H, Khachatrian H, Kale DC, Galstyan A. Multitask Learn- issues and guidelines. Int J Med Inform 2008; 77 (2): 81–97. ing and Bench- marking with Clinical Time Series Data. arXiv Preprint 2. Saria S, Goldenberg A. Subtyping: what it is and its role in precision medi- arXiv: 170307771 2017. cine. IEEE Intell Syst 2015; 30 (4): 70–5. 24. Johnson AE, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible criti- 3. Hersh WR, Weiner MG, Embi PJ, et al. Caveats for the use of operational cal care database. Sci Data 2016; 3: 160035. electronic health record data in comparative effectiveness research. Med 25. Bisbal M, Jouve E, Papazian L, et al. Effectiveness of SAPS III to predict Care 2013; 51 (8 0 3): S30. hospital mortality for post-cardiac arrest patients. Resuscitation 2014; 85 4. Kluge EHW. Resource allocation in healthcare: implications of models of (7): 939–44. medicine as a profession. Medscape Gen Med 2007; 9 (1): 57. 26. Pantet O, Faouzi M, Brusselaers N, Vernay A, Berger M. Comparison of 5. Zimmerman JE, Kramer AA, McNair DS, Malila FM. Acute Physiology mortality prediction models and validation of SAPS II in critically ill burns and Chronic Health Evaluation (APACHE) IV: hospital mortality assess- patients. Ann Burns Fire Disasters 2016; 29 (2): 123. ment for today’s critically ill patients. Crit Care Med 2006; 34 (5): 27. Zhou J, Wang F, Hu J, Ye J. From micro to macro: data driven phe- 1297–310. notyping by densiﬁcation of longitudinal electronic medical 6. Dahl D, Wojtal GG, Breslow MJ, Huguez D, Stone D, Korpi G. The high records. In: Proceedings of the 20th ACM SIGKDD International cost of low-acuity ICU outliers. J Healthcare Manag 2012; 57 (6): Conference on Knowledge Discovery and Data Mining. ACM; 421–33. 2014: 135–144. Downloaded from https://academic.oup.com/jamiaopen/article-abstract/1/1/87/5032901 by Ed 'DeepDyve' Gillespie user on 07 November 2018 98 JAMIA Open, 2018, Vol. 1, No. 1 28. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed repre- 37. Medlock MM, Cantilena LR, Riel MA. Adverse events following dis- sentations of words and phrases and their compositionality. In: Advances charge from the hospital. Ann Internal Med 2004; 140 (3): 231–2. in neural information processing systems; 2013: 3111–9. 38. Budnitz DS, Shehab N, Kegler SR, Richards CL. Medication use leading 29. Uriarte-Arcia AV, Lopez -Y anez ~ I, Y anez-M ~ arquez C. One-Hot Vector to emergency department visits for adverse drug events in older adults. Hybrid Associative Classiﬁer for Medical Data Classiﬁcation. PLoS One Ann Internal Med 2007; 147 (11): 755–65. 2014; 9 (4): e95715. 39. Oddone EZ, Weinberger M, Horner M, et al. Classifying general medicine 30. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput readmissions. J Gen Internal Med 1996; 11 (10): 597–607. 1997; 9 (8): 1735–80. 40. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learn- 31. Gers FA, Schmidhuber J, Cummins F. Learning to forget: continual pre- ing in Python. J Mach Learn Res 2011; 12 (Oct): 2825–30. diction with LSTM. 1999; 41. Chollet F. Keras, GitHub; 2015. https://github.com/fchollet/keras. 32. Dumoulin V, Visin F. A guide to convolution arithmetic for deep learning. 42. Tu JV, Guerriere MR. Use of a neural network as a predictive instrument arXiv Preprint arXiv: 160307285 2016. for length of stay in the intensive care unit following cardiac surgery. 33. Ljunggren M, Castr en M, Nordberg M, Kurland L. The association be- Comput Biomed Res 1993; 26 (3): 220–9. tween vital signs and mortality in a retrospective cohort study of an unse- 43. Hachesu PR, Ahmadi M, Alizadeh S, Sadoughi F. Use of data mining tech- lected emergency department population. Scand J Trauma Resusc Emerg niques to determine and predict length of stay of cardiac patients. Health- Med 2016; 24 (1): 21. care Inform Res 2013; 19 (2): 121–9. 34. Andrade JG, Field T, Khairy P. Detection of occult atrial ﬁbrillation inpa- 44. Christopher MB. Pattern Recognition and Machine Learning. New York: tients with embolic stroke of uncertain source: a work in progress. Front Springer; 2016. Physiol 2015; 6: 100. 45. Elliott M. Readmission to intensive care: a review of the literature. Aust 35. Futoma J, Morris J, Lucas J. A comparison of models for predicting early Crit Care 2006; 19 (3): 96–104. hospital readmissions. J Biomed Inform 2015; 56: 229–38. 46. Japkowicz N, Stephen S. The class imbalance problem: a systematic study. 36. Mueller SK, Sponsler KC, Kripalani S, Schnipper JL. Hospital-based medi- Intell Data Anal 2002; 6 (5): 429–49. cation reconciliation practices: a systematic review. Arch Internal Med 47. King G, Zeng L. Logistic regression in rare events data. Polit Anal 2001; 9 2012; 172 (14): 1057–69. (2): 137–63.
JAMIA Open – Oxford University Press
Published: Jul 1, 2018
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”Daniel C.
“Whoa! It’s like Spotify but for academic articles.”@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”@JoseServera