A rank weighted classification for plasma proteomic profiles based on case-based reasoning

A rank weighted classification for plasma proteomic profiles based on case-based reasoning Background: It is a challenge to precisely classify plasma proteomic profiles into their clinical status based solely on their patterns even though distinct patterns of plasma proteomic profiles are regarded as potential to be a biomarker because the profiles have large within-subject variances. Methods: The present study proposes a rank-based weighted CBR classifier (RWCBR). We hypothesized that a CBR classifier is advantageous when individual patterns are specific and do not follow the general patterns like proteomic profiles, and robust feature weights can enhance the performance of the CBR classifier. To validate RWCBR, we conducted numerical experiments, which predict the clinical status of the 70 subjects using plasma proteomic profiles by comparing the performances to previous approaches. Results: According to the numerical experiment, SVM maintained the highest minimum values of Precision and Recall, but RWCBR showed highest average value in all information indices, and it maintained the smallest standard deviation in F-1 score and G-measure. Conclusions: RWCBR approach showed potential as a robust classifier in predicting the clinical status of the subjects for plasma proteomic profiles. Keywords: Case-based reasoning, Plasma proteomic profiles, Classification, Rank Background CBR has less risk of overfitting in prediction, and medical Case-based reasoning (CBR) is an artificial intelligent cases can’t be often explained by general patterns of the approach based on an inference technique that is said to case-base. It is important to classify the plasma proteomic be the most effective method to construct an expert profiles solely depending on their shapes because their system [1]. When a target case occurs, CBR is mainly distinct patterns of profiles are regarded as a potential bio- performed according to the following four procedures: marker according to clinical status [4]. However, plasma retrieving, reusing, revising and retaining [2, 3]. It solves a proteomic profiles may be a typical example not following target problem by revising the solution with the previous the general patterns which lead to poor accuracies in cases in similar situations retrieved from the case-base, prediction by classification methods based on overall and the target case is retained in the case-base for the next means of similarity due to large within-subject variance, problem once the problem is solved. Thus, up-to-date and there is no gold standard to analyze the plasma case-base is always maintained in CBR system. The CBR proteomic profiles yet. The present study conducts a CBR system has been applied in many learning or problem- based classification method with the plasma proteomic solving techniques of real-world applications. In particular, profiles which does not make decision for classification the prediction techniques based on CBR can be more depending on the overall mean. However, CBR often also appropriate in bio-medical field than other fields because shows lower prediction performance compared to other learning techniques. Previous studies proposed some methods to improve the performance of CBR. Those stud- Correspondence: amykwon@korea.ac.kr ies were primarily focused on either weight optimization Big Data Science, Division of Economics & Statistics, College of Public Policy, Korea University, Sejong, Korea © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 2 of 9 methods [5–9] or feature (or subset) selection methods the proposed method using Rank, and section 4 describes [10, 11], and one study proposed a hybrid generic our data schemes and empirical results. At last, the con- approach to optimize the both with the number of neigh- clusion and further research are discussed in section 5. bor cases to compute in the case retrieval procedure of CBR [12]. The meaningful set of features is often predeter- Methods mined by experts in bio-medical fields, and the most simi- The CBR classifier with plasma proteomic profiles lar case may result in the best accuracy in prediction The general problem-solving process by the CBR classifier when output values of each feature are wide-spread like is described in Fig. 1. The CBR classifier describes a target plasma proteomic profiles. If that is the case, a proper problem using old experiences, and finds a solution of the weight optimization may only enhance the prediction problem by retrieving similar cases stored in the case-base performance of CBR. The weights are optimized either to the target problem where the case-base is the specific subjectively or objectively. Subject weights are typically knowledge base of past experiences. The case is typically allocated according to the preference scores or informa- retrieved by learning techniques for the CBR classifier, tion of experts such as Delphi method [5]. Objective and the most common technique is k-nearest neighbor weights can be allocated by entropy method [7], statistical (NN). The original CBR classifier uses 1-NN which method [8] or they can be optimized while proceeding retrieves the most similar case from the case-base to the algorithms such as generic algorithm (GA) [12]orneural- target problem. The problem is adapted from the retrieved net (NN) [6]. Among these approaches, NN needs a large cases, and is revised. Once the problem is solved, the cases number of inter-connected neurons to allocate weights, so are retained. The CBR classifier with plasma proteomic small or moderate-size samples may not attain a standard profiles maintained the same scheme. The problem is to structure of NN [9]. GA is also criticized due to premature identify the class of a target case by comparing the pattern convergence or low reliability [9]. A weight by a statistical of the target case with those in the case-base where the method was allocated by the proportion of Wald’s statis- case-base consists of trained samples with their class- tics [8] which is obtained by assuming asymptotically nor- labels. The case is retrieved from the case-base by k-near- mal distributions of parameters. The present study est neighbor (NN) to solve the problem, and the target proposes a non-parametric weight allocation method case as well as the retrieved case are stored in the case- without using normality assumption. We investigate the base once the class of the target case is determined. accuracy of a CBR based classification with plasma prote- omic profiles to diagnose cervical cancer, and observe the Prior studies for weight optimization enhancement of the prediction performance of the CBR The original classifier assesses the similarity of a target classifier by allocating feature weights. To validate our ap- case with cases in the case-base under the assumption proach, we also conduct previous weight allocation that all features are equally likely important. However, it methods for the CBR classifier together with the plasma may be practical to think of the relative importance proteomic profiles. The paper is organized as follows. among the features, so some researches differently After introduction, section 2 briefly describes the CBR allocated the weights on features considering the relative system and reviews previous studies. Section 3 presents importance. Since different weights for the attributes Fig. 1 Problem solving process of CBR Classifier. The CBR classifier finds the solution of the problem by retrieving similar cases stored in the case-base to the target problem as described Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 3 of 9 can vary the distribution of the overall similarities Table 1 Prior studies about the weight optimization methods among the cases, the retrieved cases by the CBR Authors Year Methods Weights classifier can be different depending on the altered Cardie & Howe [14] (1997) Information gain G(f) distribution of the similarities. Regarding that matter, Ahn & Kim [12] (2009) Relative importance [0-7] the weight allocation or optimization is closely related with the performance of the CBR classifier. f ¼1 In particular, the weight allocation or optimization Gu et al. [13] (2010) Delphi method – techniques have gained attention as a way to Chang et al. [5] (2011) Delphi method – enhance the performance of the CBR classifier in entropy Zhao et al. [7] (2011) Entropy method previous studies. entropy DELPHI method is one of the most common f ¼1 approaches to allocate feature weights to the CBR Wald Liang et al. [8] (2012) Logistic regression classifier. DELPHI method directly reflects experts’ opin- Wald ions about the features as the corresponding weights like f ¼1 Gu et al. [13] or Chang et al. [5], so the weights can be indicates information gain of the f-th feature, and entropy is defined as − p ∙ log p changed by the point of view of the subjects. Alterna- i 2 i tively, weights have been objectively allocated using information gain or entropy. Cardie and Howe [14] first in all methods in their studies [18, 19]. Most classifiers selected a set of relevant features using a decision tree, try to use a distance metric that keeps data points close and assigned the weights with information gain to the if the class labels are the same while keeps distance from feature which was chosen by the tree. Ahn and Kim [12] the data points if the class labels are different. The goal encoded feature weights with numbers from 0 to 7 of the CBR classifier is to predict a class label of a target which represented the relative importance of the case of x by retrieving the most similar case from the features. These numbers were processed as 3-bit binary case-base using the proper distance metric. Let χ ¼f x numbers and transformed into floating decimal numbers ; ⋯; x g be a collection of n data points in the case- (x ) for weights. Zhao et al. [7] used information entropy base with the known class labels of C ={c , ⋯, c } where 1 n for feature weights to select suppliers. They computed ! m x ∈R and c ∈ {1, ⋯, K}. The CBR classifier typically the average regression coefficients to seek the integrated adapts the k-NN approach to retrieve the similar cases average index of each supplier, and calculated both the to the target case with a given k.The k-NN approach as- information gain in ID3 of the decision tree and the sumes that the class conditional probability in the near- entropy. These values were later standardized as the ! ! est neighbors to x ; Nð x Þ , is constant, and tries to 0 0 numbers in the range of [0,1] for weights. Besides, Liang maintain consistency in predicting class labels for x by et al. [8] optimized feature weights by a statistical obtaining its neighborhood as follows where I(·) is an in- approach. They fitted features with binary logistic dicator function. regression, and computed the Wald statistics of param- eter estimates for the features. Then, the statistics are P ! ! Ix ∈Nx IcðÞ ¼ j i 0 i ! i¼1 standardized by dividing them by the sum of all the pj x ¼ P ð1Þ ! ! Ix ∈Nx statistics before they are allocated to the features as the i 0 i¼1 weights. Suitable weights may vary depending upon problems we encountered. Prior studies about the The global distance between the target case and any weight optimization or allocation methods are summa- case in the case-base is computed by summing up the rized in Table 1. local distances to determine the nearest neighbors for the target case on Eq. (1). The local distance is Rank-based weight optimization computed for each feature between the target case and Distance functions and problem setting any case in the case-base by the pre-defined local A typical similarity or dissimilarity measure is a distance distance metric, and the types of local distance metric metric, and it is crucial to learn a good distance metric do not have to be the same among the features. Euclid- to represent the similarity or dissimilarity in feature ean distance metric is typically used to compute the space although there are considerable researches on physical distance between the two data points, but it distance metrics [15–17]. Some researches have been suffers in the case that vectors of data points aren’t focused on the comparison of their impacts on the linearly distributed like default measurements of performance in classification with known public proteomic profiles. On the contrary, Fre’chet distance database [18, 19]. However, no single similarity or metric is known to be useful to measure the distance dissimilarity showed dominantly superior to the others between the data points when the vectors of those data Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 4 of 9 points lie on the non-linear curve [20]. According to the similarity is regarded as a function of ranks in the characteristics of the feature types, the distance metric present study because the similarity is computed consists of either Euclidean distance metric or Fre’chet according to the corresponding rank-order informa- distance metric for the present study. Euclidean distance tion of the distances for features between the target metric and Fre’chet distance metric are defined for the case and any case in the case-base. Thus, the weights ð f Þ ð f Þ ! ! can be naturally allocated to the features in the feature, f, as follows where x and x are sub- i 0 similarity measure maintaining the same property vectors of any data point in the case-base and the target from the objective function based on Wilcoxon’srank case consisting of the feature, f, respectively. sum statistics. The objective function for the present ðÞ f ðÞ f ðÞ f ðÞ f ðÞ f ðÞ f study can be summarized as follows where n is the ! ! ! ! ! ! d x ; x ¼ x − x x − x ð2Þ i 0 i 0 i 0 number of cases having the class label of 1 when the class labels are denoted as either 0 or 1 and the ðÞ f ðÞ f ðÞ f ðÞ f ! ! ! 0 ! 0 d x ; x ¼ inf max dx ðÞ αðÞ t ; x ðÞ βðÞ t ð3Þ i 0 i 0 α;β t ∈½ 0;1 number of classes, J, is set to 2. arg max ω  r ω : f ¼1;⋯;m f f f ¼1 Conversion to rank-order information X 1 ð f Þ ð0Þ ! ! where r ¼ rank d x ; x f f i 0 i¼1 Plasma proteomic profiles have the large within-subject variance. Although the class labels are the same, the n ðn þ 1Þ 1 1 profiles can be distributed over a considerable extent as well as they are not following the general pattern. We 0≤ω ≤1 constraint to determined the proximity of the cases using the global m ω ¼ 1 f ¼1 similarity based on rank-order information of the distances [21] instead of using the distance itself to ð5Þ enhance robustness in predicting the class label of the On Eq. (5), as the probability increases that the two target case in the present study. The similarity is com- groups of the cases are truly drawn from the population- puted as follows where N′ is the number of cases having cases having the different class labels, the corresponding a unique ranking-order in the case-base and ω is an feature weight of ω becomes large because the resulted unknown weight for a feature of f. statistics, W, is large. The significance of the test statis- 2 3 f f 0 ! ! tics is directly represented by the magnitude of the cor- N − rank d x ; x X f i 0 ! ! 4 5 Sx ; x ¼ ωf  ð4Þ responding p-value, so the feature weights can be i 0 N −1 f ¼1 computed using the magnitudes of p-values from the test statistics as follows. According to Eq. (4), the higher the rank, the greater the similarity between the i-th case and the target case. 1−pðjW j≥r Þ ω ¼ ð6Þ f m 1−pðjW j≥r Þ f ¼1 Weight optimization Every feature is equally likely important to the where W denotes the test statistics of the Wilcoxon’s original CBR classifier. Since the original CBR rank sum test. The feature weights from Eq. (6) are used classifier often showed lower predictability, there have to compute the similarity of Eq. (4). been some researches to improve the predictability by assigning different weights to the features according Application and experiments to their relative importance. In the same line of thoughts, we adopted the different weights to the Data description The proteomic profiles were obtained feature in calculating the similarity, and optimized the from the blood plasma samples which were collected weights according to the objective function from from recruited subjects at the University of Louisville, Wilcoxon’s rank sum test statistics. The ability of the KY, USA. Total 70 female subjects were recruited for objective function is mainly influenced by the feature this study, and 50% of those subjects were diagnosed weights, and the weights are determined to maximize with cervical carcinoma while the others are healthy the ability of the objective function to differentiate controls, without any known diseases. The study proto- the cases having different class labels. Wilcoxon’srank col was approved by the institutional review board of the sum test is a non-parametric test to assess the University of Louisville, and informed consent forms difference of the mean ranks for two samples, and it were voluntarily signed by the participants. The origin of is known to be useful when outliers exist in the the data can be referred to [22], and the secondary data observations compared to the parametric tests. The was used for the study. The default output measurement Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 5 of 9 of the proteomic profiles was the excess heat capacity The statistical model was introduced to show the dif- (ΔC ), which were recorded at the different tempera- ference of the plasma proteomic profiles between two tures from 45 to 90 °C by incrementally adding 1 °C to groups having different clinical status using a composite the previous measuring temperature. The proteomic coefficient which was a weighted product of an average profiles were preprocessed prior to the analysis. The probability being in the same group of the reference excess heat capacity (ΔC ) as the default measurement is sample and Pearson’s correlation coefficient. In the a vector of real numbers of length 451 and it typically present study, this model was conducted with the default shows one or two peaks on the range of temperatures setting of the composite coefficient as in the literature during the experiment. We newly extracted 5 features [23]. Namely, the reference set was composed with the from the pre-processed data besides the excess heat cases in the ‘control’ class for this model, and the weight capacity. The feature information is summarized in factor of the composite coefficient was set to 1 as Table 2. The class information for each proteomic described in the literature. This classification model is profile was labeled as either ‘control’ or ‘cancer’ ac- abbreviated as (SCUCC) indicating statistically classified cording to the clinical status of the corresponding sub- using the composite coefficient for the experiments as a ject. On Table 2, PEAK1 and PEAK2 indicate those reference method. Among the CBR approaches, the first peaks, and T1 and T2 are temperatures that those model is a classical CBR approach (CLCBR), which gives peaks occur at where {PEAK1, PEAK2, T1, T2} were attributes equal-weights and uses 1-NN for the case re- estimated by Gaussian kernel regression from the ex- trieval. This model would be the base model to examine cess heat capacity patterns. IND indicates a set of 451 the effect of the CBR classifiers having different weights individual measurement of the excess heat capacity, on features. ETCBR and LWCBR are the weighted CBR and IR is a binary value indicating the initial direc- approaches. The feature weights were allocated with tional tendency of the excess heat capacity as the standardized entropy value in ETCBR [7]. In the present temperature increases. IR is 1 if the directional ten- study, IR is Berno’ulli and the other features are dency is positive, 0 otherwise. assumed as normal distribution. The computed entropy was standardized by dividing each entropy by the sum of Numerical experiments The purpose of the numerical all entropy values prior to allocation, and 1-NN was experiments is to study the performance of the CBR used for the case retrieval. LWCBR indicates a weighted classifier in prediction with the plasma proteomic pro- CBR approach from logistic regression model. This files by comparing with the previous approaches. In par- model adopted standardized Wald statistics of the re- ticular, we observed whether the rank-based feature gression coefficients for feature weights by fitting the weights enhance the performance of the CBR classifier observation with binary logistic regression. Namely, with proteomic profiles or not. As reference methods, the Wald statistics were divided by the sum of the all two common machine learning methods, k-NN (k -NN) statistics before they were allocated to the features and support vector machine (SVM), a statistical ap- [8], and also used 1-NN for the case retrieval. Logis- proach using the composite coefficient [23], and three tic regression is a typical parametric approach and CBR approaches weighted by different allocation Wald statistics are derived from the regression coeffi- methods in previous studies [7, 8] were conducted to cients under the asymptotic normal assumption, so validate the performance of the proposed CBR approach this model can be a good reference to observe the for the present study. The number of neighbors for k performance of the proposed feature weights. The -NN was 5 which was determined by cross-validation proposed CBR approach is abbreviated as RWCBR (CV) with training samples, and SVM was conducted indicating a rank-weighted CBR approach. As based on the radial basis kernel. described in the above sections, the feature weights Table 2 Description of the features Features (Abbreviation) Type Contents Initial Response (IR) Binary number 0: decreasing 1: increasing Temperature 1 (T1) Real number Range [45 - 55] Temperature 2 (T2) Real number Range [56 - 90] Maximum Peak at T1 (PEAK1) Real number Range [0 - ∞] Maximum Peak at T2 (PEAK2) Real number Range [0 - ∞] A set of individual ΔC (IND) A vector of real numbers Range [0 - ∞] p Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 6 of 9 were computed from Wilcoxon’s rank sum test and true positive Precesion ¼ P the most similar case was retrieved as the other CBR ðÞ true positive þ false positive approaches. true positive Recall ¼ P The data set of proteomic profiles consists of the pre- ðÞ true positive þ false negative defined features on Table 2 with 70 subjects, and the precision  recall F‐1 score ¼ 2 class labels are fully given with the number of classes as ðÞ precision þ recall pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi two. The data set was randomly partitioned into five G‐measure ¼ precision  recall equal-sized subsets for 5-fold CV. At each fold, a subset was selected as a test set, and the other four subsets be- came a training set where the proportion of the cases With the retrieved case for each target case, the class were equally distributed from the two classes during the label of the target case was predicted according to Eq. (1), experiments. The feature weights for the ETCBR, and the prediction results were used to compute the LWCBR and RWCBR were estimated with the training information indices at each fold according to the above set, and the optimized weights were allocated to the fea- definitions. tures in the test set. The resulting information indices at each fold are summarized on Table 4, and the comprehensive statistics using the minimum (MIN), average, (AVG), standard Results deviation (STD) and the maximum (MAX) for each Each fold has the same size of cases in the test set index are summarized on Table 5. Among the models, and the training set as 14 and 56, respectively during RWCBR and SVM consistently showed good perfor- the experiments. The estimated feature weights by mances over different sample sets in predicting the class ETCBR, LWCBR and RWCBR at each fold are sum- labels with plasma proteomic profiles in comparison marized in Table 3. The performances of the seven with others, but the performance of RWCBR was slightly different models were evaluated at each fold in terms better. In case of Precision and Recall indices, LWCBR of Precision, Recall, F-1 score, [24, 25]and G- had the biggest range from the minimum of 33% to the measure [25]. The estimated weights in Table 3 were maximum of 80% and from the minimum of 29% to the allocated to the features when the CBR classifiers re- maximum of 100%, respectively while SVM had the trieved the most similar case from the case-base by smallest range from the minimum of 85.7% to the max- ETCBR, LWCBR and RWCBR. The information indi- imum of 100% in both indices. However, RWCBR ces of Precision, Recall, F-1 score and G-measure are showed the highest average value of 91% among the all defined as follows. models in both Precision and Recall, and the perform- ance was maintained at least 71%. ETCBR and RWCBR showed better performance than CLCBR in both Preci- sion and Recall, but LWCBR worked poor in compari- son with CLCBR by showing lower mean values in all Table 3 Estimated feature weights indices although we generally expected a weighted CBR Fold Model IR PEAK1 PEAK2 T1 T2 IND approach to perform better than CLCBR. Comparing I ETCBR 0.0067 0.0145 0.0347 0.0595 0.0220 0.8625 CLCBR to SCUCC, a statistical approach, average preci- LWCBR 0.0158 0.1145 0.3845 0.0960 0.0179 0.3713 sion index of CLCBC was lower, but average recall index RWCBR 0.1704 0.1890 0.1488 0.1886 0.1596 0.1436 was higher. F-1 score and G-measure were similar II ETCBR 0.0089 0.0311 0.0477 0.0766 0.0314 0.8042 between the two, so it appears that CBR approaches do LWCBR 0.0877 0.0752 0.2550 0.0016 0.3244 0.2561 not always work well with plasma proteomic profiles. Regarding F1-score and G-measure, SVM also main- RWCBR 0.2190 0.2222 0.0954 0.2203 0.1325 0.1106 tained the shortest ranges, but RWCBR showed the best III ETCBR 0.0068 0.0265 0.0391 0.0654 0.0243 0.8377 performance at most aspects of summary statistics LWCBR 0.0175 0.1780 0.2481 0.1188 0.1584 0.2793 among the seven models. In particular, it maintained the RWCBR 0.1274 0.2603 0.1144 0.2837 0.1092 0.1051 smallest standard deviations in comparison with the IV ETCBR 0.0076 0.0259 0.0383 0.0643 0.0247 0.8392 other models. LWCBR 0.0001 0.3524 0.0682 0.4252 0.1484 0.0056 The retrieved cases for each target case from the test set by RWCBR model at the first fold are displayed on RWCBR 0.1627 0.2266 0.1279 0.2206 0.1233 0.1388 Fig. 2. The black solid lines represent the 14 target cases V ETCBR 0.0065 0.0223 0.0345 0.0581 0.0213 0.8573 from the test set, and the red solid lines are the most LWCBR 0.0250 0.0036 0.1628 0.2751 0.3219 0.2116 similar cases retrieved from the case-bases according to RWCBR 0.2205 0.2186 0.1024 0.2334 0.1076 0.1175 the similarity measure of Eq. (4). Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 7 of 9 Table 4 Information indices by 5-Fold CV Fold Measures K-NN SVM SCUCC CLCBR ETCBR LWCBR RWCBR I Precision 0.5000 0.8571 1.0000 0.6667 1.0000 0.8000 1.0000 Recall 0.7143 0.8571 0.2857 0.5714 0.7143 0.5714 0.7143 F1-score 0.5882 0.8571 0.4444 0.6154 0.8333 0.6667 0.8333 G-measure 0.5976 0.8571 0.5345 0.6171 0.8452 0.6761 0.8452 II Precision 0.8571 0.8571 1.0000 0.8000 0.7500 0.3333 1.0000 Recall 0.8571 0.8571 0.8571 0.5714 0.8571 0.2857 0.8571 F1-score 0.8571 0.8571 0.9231 0.6667 0.8000 0.3077 0.9231 G-measure 0.8571 0.8571 0.9258 0.6761 0.8018 0.3086 0.9258 III Precision 0.6000 0.8571 0.8333 0.7500 0.8333 0.7778 0.7778 Recall 0.8571 0.8571 0.7143 0.8571 0.7143 1.0000 1.0000 F1-score 0.7059 0.8571 0.7692 0.8000 0.7692 0.8750 0.8750 G-measure 0.7171 0.8571 0.7715 0.8018 0.7715 0.8819 0.8819 IV Precision 0.7500 1.0000 1.0000 1.0000 0.8750 0.7778 1.0000 Recall 0.8571 1.0000 0.5714 1.0000 1.0000 1.0000 1.0000 F1-score 0.8000 1.0000 0.7273 1.0000 0.9333 0.8750 1.0000 G-measure 0.8018 1.0000 0.7559 1.0000 0.9354 0.8819 1.0000 V Precision 0.5455 0.8571 0.7000 0.6000 1.0000 0.6000 0.7778 Recall 0.8571 0.8571 1.0000 0.8571 1.0000 0.8571 1.0000 F1-score 0.6667 0.8571 0.8235 0.7059 1.0000 0.7059 0.8750 G-measure 0.6838 0.8571 0.8366 0.7171 1.0000 0.7171 0.8819 Table 5 Comprehensive statistics for information indices Measures Statistics K-NN SVM SCUCC CLCBR ETCBR LWCBR RWCBR Precision MIN 0.5000 0.8571 0.7000 0.6000 0.7500 0.3333 0.7778 AVG 0.6506 0.8857 0.9067 0.7633 0.8917 0.6578 0.9111 STD 0.1490 0.0639 0.1362 0.1529 0.1087 0.1985 0.1217 MAX 0.8571 1.0000 1.0000 1.0000 1.0000 0.8000 1.0000 Recall MIN 0.7143 0.8571 0.2857 0.5714 0.7143 0.2857 0.7143 AVG 0.8285 0.8857 0.6857 0.7714 0.8571 0.7428 0.9143 STD 0.0639 0.0639 0.2748 0.1917 0.1429 0.3097 0.1278 MAX 0.8571 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 F1-score MIN 0.5882 0.8571 0.4444 0.6154 0.7692 0.3077 0.8333 AVG 0.7236 0.8857 0.7375 0.7576 0.8672 0.6861 0.9013 STD 0.1067 0.0639 0.1795 0.1514 0.0965 0.2320 0.0637 MAX 0.8571 1.0000 0.9231 1.0000 1.0000 0.8750 1.0000 G-measure MIN 0.5976 0.8571 0.5345 0.6171 0.7715 0.3086 0.8452 AVG 0.7221 0.8857 0.7649 0.7624 0.8708 0.6931 0.9070 STD 0.1088 0.0639 0.1451 0.1488 0.0951 0.2345 0.0593 MAX 0.8571 1.0000 0.9258 1.0000 1.0000 0.8819 1.0000 Note: MIN the minimum, AVG average, STD standard deviation, MAX the maximum Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 8 of 9 Fig. 2 The retrieved cases by RWCBR at Fold 1. The figure illustrates 14 cases of proteomic profiles at fold 1. Black-solid lines represent target cases, and red-solid lines represent the solutions optimized by RWCBR Conclusion and Discussion outperform in average value in all information indices, Plasma proteomic profiles have been regarded as a and it maintained the lowest standard deviation in F-1 potential biomarker to diagnose certain diseases score and G-measure. Also, LWCBR showed lower per- according to their specific patterns. It is challenging to formance than CLCBR in most information indices. A precisely predict the clinical status based solely on the weighted CBR approaches do not always perform well, so patterns of profiles because some profiles do not the weight allocation or optimization methods should take frequently follow the general patterns, which leads to into accounts the characteristics of the data set to enhance large within-subject variance. The prediction based on the performance of CBR classifier. CBR based approaches may be effective in that case. The The sample size of the plasma proteomic profiles was CBR classifier predicts the clinical status of a target case small in the present study. However, RWCBR approach by retrieving the most similar case from the case-base, showed potential to predict the clinical status based so it would be advantageous in prediction because it can solely on plasma proteomic profiles as a robust classifier avoid the risk to make decision according to deviated over different sample sets in the present study. overall means due to the outlying pattern. However, CBR classifier often shows low predictability, and some Abbreviations AVG: Average; CBR: Case-based reasoning; CLCBR: Classical CBR; CV: Cross- studies made efforts to enhance the predictability using Validation; ETCBR: Entropy-based CBR; k-NN: k-Nearest neighbor; weight optimization for features. There is still no golden LWCBR: Logistic regression-based Weighted CBR; MAX: The maximum; standard to optimize or allocate the feature weights, MIN: The minimum; RWCBR: Rank-based Weighted CBR; SCUCC: Statistically classified using composite coefficient; STD: Standard deviation; SVM: Support which can be dependent upon the characteristics of the vector machine data we encounter. The present study suggests a rank-based weighted Acknowledgements CBR classifier (RWCBR) to predict the clinical status of We thank Assistant Professor NC Garbett at James Graham Brown Cancer plasma proteomic profiles. The rank-based weighted Center in the University of Louisville for providing us with the data set and valuable comments. Also, we give thanks to Assistant Professor M. Ouyang at CBR classifier uses a weighted similarity based on rank- U Mass Boston for thorough scientific reviews and valuable comments. order information of distance metrics to retrieve the most similar case from the case-base where the feature Funding weights are optimized from Wilcoxon’s rank sum This work was supported by the Korea University, Sejong, Korea [K1720701, statistics. We conducted numerical experiments to 2017], and also supported by College of Public Policy, Korea University, Sejong, Korea [K1729001, 2018]. validate the performance of RWCBR. As reference methods, two machine learning techniques, k-NN and Availability of data and materials SVM, a statistical method, SCUCC, a classical CBR The dataset for the current study are not publicly available due to the (CLCBR) and two differently weighted CBR, ETCBR and repository policy from the institute at this moment. LWCBR methods were compared in terms of Precision, Recall, F-1 score and G-measure. According to the results, Author’s contributions SVM showed the lowest standard deviation and the high- AMK do the model development, the numerical experiment and manuscript est minimum value for Precision, Recall, but RWCBR writing. The author read and approved the final manuscript. Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 9 of 9 Ethics approval and consent to participate 21. Li H, Sun J. Ranking-order case-based reasoning for financial distress The study protocol was approved by the institutional review board of the prediction. Knowl-Based Syst. 2008;21:868–78. University of Louisville in regard to the collection of the original data, and 22. Garbett NC, Merchant ML, Helm CW, Jenson AB, Klein JB, Chaires JB. informed consent forms were voluntarily signed by the all participants [IRB\# Detection of cervical Cancer biomarker patterns in blood plasma and urine 08.0108, 608.03]. All information about the subjects had been de-identified by differential scanning calorimetry and mass spectrometry. PLoS One. from the stage of data collection, and the de-identified data set was directly 2014;9:e84710. provided by the principal investigator only for the purpose of methodological 23. Fish DJ, Brewood GP, Kim JS, et al. Statistical analysis of plasma development research, and the administrative permission will be required thermograms measured by differential scanning calorimetry. Biophys Chem. to access the raw data from the principal investigator, Dr. Garbett NC as 2005;152:184–90. well as the administrative policy corresponding with James Brown Graham Cancer 24. Powers DMW. Evaluation: from precision, recall and F-measure to ROC, center. The data was the secondary data for this study, and particular procedure Informedness, Markedness & Correlation. J Mach Learn Technol. 2011;2:37–63. of ethics approval for data analysis was not required. 25. Nicolas PR. Scala for machine learning. Birmingham: PACKT Publishing Ltd; 2015. p. 37–63. Competing interests The author declares that she has no competing interests. Publisher’sNote Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Received: 26 June 2017 Accepted: 3 May 2018 References 1. Burke EK, MacCarthy B, Petrovic S, MacCarthy B, Petrovic S, Qu R. Structured cases in case-based reasoning-re-using and adapting cases for time-tabling problems. Knowledge Based Syst. 2000;13:159–65. 2. Althoff KD, Auriol E, Barletta R, Manago M. A review of industrial case-based reasoning tools, an AI perspective report. AI intelligence; 1995. p. 3–4. 3. Aamodt A, Plaza E. Case-based reasoning: foundational issues, methodological variations and system approach. AI Commun. 1994;7:39–59. 4. Garbett NC, Miller JJ, Jenson AB, Chaires JB. Calorimetry outside the box a new window into the plasma proteome. Biophys J. 2008;94:1377–83. 5. Chang WL. A CBR-based Delphi model for quality group decisions. Cybern Syst. 2011;42:402–14. 6. Huang YS, Chiang CC, Shieh JW, Grimson E. Prototype optimization for nearest-neighbor classification. Pattern Recogn. 2002;35:1237–45. 7. Zhao K, Yu X. A case-based reasoning approach on supplier selection in petroleum enterprises. Expert Syst Appl. 2011;38:6839–47. 8. Liang C, Gu D, Bichindaritz I, Li X, Zuo C, Cheng W. Integrating gray system theory and logistic regression into case-base reasoning for safety assessment f thermal power plants. Expert Syst Appl. 2012;39:5154–67. 9. Yan A, Shao H, Guo Z. Weight optimization for case-based reasoning using membrane computing. Inf Sci. 2014;287:109–20. 10. Domingos P. Context-sensitive feature selection for lazy learners. Artif Intell Rev. 1997;11:1–5. 11. Skalac DB. Prototype and feature selection by sampling and random mutation hill climbing algorithms. Proceedings of the 11th international conference on machine learning, vol. 2; 1994. p. 293–301. 12. Ahn H, Kim K. Global optimization of case-based reasoning for breast cytology diagnosis. Expert Syst Appl. 2009;36:724–34. 13. Gu DX, Liang CY, Li XG, Yang SL, Zhang P. Intelligent technique for knowledge reuse of dental medical records based on Case-Base reasoning. J Med Syst. 2010;34:213–22. 14. Cardie C, Howe N. Improving minority class prediction using case-specific feature weights. Proceedings of the 14th international conference on machine learning; 1997. p. 57–65. 15. Cha SH. Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Model Methods Appl Sci. 2007;1:300–7. 16. Chen Y, Garcia EK, Gupta MR, Rahimi A, Cazzanti L. Similarity-based classification: concepts and algorithms. J Mach Learn Res. 2009;20:747–76. 17. Jain AK, Murty MN, Flynn PJ. Data clustering: a review ACM computing survey. ACM Comput Surv. 1999;31:264–323. 18. Shirkhorshidi AS, Aghabozorgi S, Wah TY. Comparision study on similarity and dissimilarity measures in clustering continuous data. PLoS One. 2015;10: e0144059. 19. Khalifa A. Al., Haranczyk, M., Holliday, J. Comparison of nonbinary similarity coefficients for similarity searching, clustering and compound selection. J Chem Inf Model. 2009;49:1193–201. 20. Alt H, Godau M. Computing the Fre’chet distance between two polygonal curves. Int J Comput Geom Appl. 1995;5:75–91. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Medical Informatics and Decision Making Springer Journals

A rank weighted classification for plasma proteomic profiles based on case-based reasoning

Free
9 pages

Loading next page...
 
/lp/springer_journal/a-rank-weighted-classification-for-plasma-proteomic-profiles-based-on-01kDVoXQn0
Publisher
BioMed Central
Copyright
Copyright © 2018 by The Author(s).
Subject
Medicine & Public Health; Health Informatics; Information Systems and Communication Service; Management of Computing and Information Systems
eISSN
1472-6947
D.O.I.
10.1186/s12911-018-0610-1
Publisher site
See Article on Publisher Site

Abstract

Background: It is a challenge to precisely classify plasma proteomic profiles into their clinical status based solely on their patterns even though distinct patterns of plasma proteomic profiles are regarded as potential to be a biomarker because the profiles have large within-subject variances. Methods: The present study proposes a rank-based weighted CBR classifier (RWCBR). We hypothesized that a CBR classifier is advantageous when individual patterns are specific and do not follow the general patterns like proteomic profiles, and robust feature weights can enhance the performance of the CBR classifier. To validate RWCBR, we conducted numerical experiments, which predict the clinical status of the 70 subjects using plasma proteomic profiles by comparing the performances to previous approaches. Results: According to the numerical experiment, SVM maintained the highest minimum values of Precision and Recall, but RWCBR showed highest average value in all information indices, and it maintained the smallest standard deviation in F-1 score and G-measure. Conclusions: RWCBR approach showed potential as a robust classifier in predicting the clinical status of the subjects for plasma proteomic profiles. Keywords: Case-based reasoning, Plasma proteomic profiles, Classification, Rank Background CBR has less risk of overfitting in prediction, and medical Case-based reasoning (CBR) is an artificial intelligent cases can’t be often explained by general patterns of the approach based on an inference technique that is said to case-base. It is important to classify the plasma proteomic be the most effective method to construct an expert profiles solely depending on their shapes because their system [1]. When a target case occurs, CBR is mainly distinct patterns of profiles are regarded as a potential bio- performed according to the following four procedures: marker according to clinical status [4]. However, plasma retrieving, reusing, revising and retaining [2, 3]. It solves a proteomic profiles may be a typical example not following target problem by revising the solution with the previous the general patterns which lead to poor accuracies in cases in similar situations retrieved from the case-base, prediction by classification methods based on overall and the target case is retained in the case-base for the next means of similarity due to large within-subject variance, problem once the problem is solved. Thus, up-to-date and there is no gold standard to analyze the plasma case-base is always maintained in CBR system. The CBR proteomic profiles yet. The present study conducts a CBR system has been applied in many learning or problem- based classification method with the plasma proteomic solving techniques of real-world applications. In particular, profiles which does not make decision for classification the prediction techniques based on CBR can be more depending on the overall mean. However, CBR often also appropriate in bio-medical field than other fields because shows lower prediction performance compared to other learning techniques. Previous studies proposed some methods to improve the performance of CBR. Those stud- Correspondence: amykwon@korea.ac.kr ies were primarily focused on either weight optimization Big Data Science, Division of Economics & Statistics, College of Public Policy, Korea University, Sejong, Korea © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 2 of 9 methods [5–9] or feature (or subset) selection methods the proposed method using Rank, and section 4 describes [10, 11], and one study proposed a hybrid generic our data schemes and empirical results. At last, the con- approach to optimize the both with the number of neigh- clusion and further research are discussed in section 5. bor cases to compute in the case retrieval procedure of CBR [12]. The meaningful set of features is often predeter- Methods mined by experts in bio-medical fields, and the most simi- The CBR classifier with plasma proteomic profiles lar case may result in the best accuracy in prediction The general problem-solving process by the CBR classifier when output values of each feature are wide-spread like is described in Fig. 1. The CBR classifier describes a target plasma proteomic profiles. If that is the case, a proper problem using old experiences, and finds a solution of the weight optimization may only enhance the prediction problem by retrieving similar cases stored in the case-base performance of CBR. The weights are optimized either to the target problem where the case-base is the specific subjectively or objectively. Subject weights are typically knowledge base of past experiences. The case is typically allocated according to the preference scores or informa- retrieved by learning techniques for the CBR classifier, tion of experts such as Delphi method [5]. Objective and the most common technique is k-nearest neighbor weights can be allocated by entropy method [7], statistical (NN). The original CBR classifier uses 1-NN which method [8] or they can be optimized while proceeding retrieves the most similar case from the case-base to the algorithms such as generic algorithm (GA) [12]orneural- target problem. The problem is adapted from the retrieved net (NN) [6]. Among these approaches, NN needs a large cases, and is revised. Once the problem is solved, the cases number of inter-connected neurons to allocate weights, so are retained. The CBR classifier with plasma proteomic small or moderate-size samples may not attain a standard profiles maintained the same scheme. The problem is to structure of NN [9]. GA is also criticized due to premature identify the class of a target case by comparing the pattern convergence or low reliability [9]. A weight by a statistical of the target case with those in the case-base where the method was allocated by the proportion of Wald’s statis- case-base consists of trained samples with their class- tics [8] which is obtained by assuming asymptotically nor- labels. The case is retrieved from the case-base by k-near- mal distributions of parameters. The present study est neighbor (NN) to solve the problem, and the target proposes a non-parametric weight allocation method case as well as the retrieved case are stored in the case- without using normality assumption. We investigate the base once the class of the target case is determined. accuracy of a CBR based classification with plasma prote- omic profiles to diagnose cervical cancer, and observe the Prior studies for weight optimization enhancement of the prediction performance of the CBR The original classifier assesses the similarity of a target classifier by allocating feature weights. To validate our ap- case with cases in the case-base under the assumption proach, we also conduct previous weight allocation that all features are equally likely important. However, it methods for the CBR classifier together with the plasma may be practical to think of the relative importance proteomic profiles. The paper is organized as follows. among the features, so some researches differently After introduction, section 2 briefly describes the CBR allocated the weights on features considering the relative system and reviews previous studies. Section 3 presents importance. Since different weights for the attributes Fig. 1 Problem solving process of CBR Classifier. The CBR classifier finds the solution of the problem by retrieving similar cases stored in the case-base to the target problem as described Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 3 of 9 can vary the distribution of the overall similarities Table 1 Prior studies about the weight optimization methods among the cases, the retrieved cases by the CBR Authors Year Methods Weights classifier can be different depending on the altered Cardie & Howe [14] (1997) Information gain G(f) distribution of the similarities. Regarding that matter, Ahn & Kim [12] (2009) Relative importance [0-7] the weight allocation or optimization is closely related with the performance of the CBR classifier. f ¼1 In particular, the weight allocation or optimization Gu et al. [13] (2010) Delphi method – techniques have gained attention as a way to Chang et al. [5] (2011) Delphi method – enhance the performance of the CBR classifier in entropy Zhao et al. [7] (2011) Entropy method previous studies. entropy DELPHI method is one of the most common f ¼1 approaches to allocate feature weights to the CBR Wald Liang et al. [8] (2012) Logistic regression classifier. DELPHI method directly reflects experts’ opin- Wald ions about the features as the corresponding weights like f ¼1 Gu et al. [13] or Chang et al. [5], so the weights can be indicates information gain of the f-th feature, and entropy is defined as − p ∙ log p changed by the point of view of the subjects. Alterna- i 2 i tively, weights have been objectively allocated using information gain or entropy. Cardie and Howe [14] first in all methods in their studies [18, 19]. Most classifiers selected a set of relevant features using a decision tree, try to use a distance metric that keeps data points close and assigned the weights with information gain to the if the class labels are the same while keeps distance from feature which was chosen by the tree. Ahn and Kim [12] the data points if the class labels are different. The goal encoded feature weights with numbers from 0 to 7 of the CBR classifier is to predict a class label of a target which represented the relative importance of the case of x by retrieving the most similar case from the features. These numbers were processed as 3-bit binary case-base using the proper distance metric. Let χ ¼f x numbers and transformed into floating decimal numbers ; ⋯; x g be a collection of n data points in the case- (x ) for weights. Zhao et al. [7] used information entropy base with the known class labels of C ={c , ⋯, c } where 1 n for feature weights to select suppliers. They computed ! m x ∈R and c ∈ {1, ⋯, K}. The CBR classifier typically the average regression coefficients to seek the integrated adapts the k-NN approach to retrieve the similar cases average index of each supplier, and calculated both the to the target case with a given k.The k-NN approach as- information gain in ID3 of the decision tree and the sumes that the class conditional probability in the near- entropy. These values were later standardized as the ! ! est neighbors to x ; Nð x Þ , is constant, and tries to 0 0 numbers in the range of [0,1] for weights. Besides, Liang maintain consistency in predicting class labels for x by et al. [8] optimized feature weights by a statistical obtaining its neighborhood as follows where I(·) is an in- approach. They fitted features with binary logistic dicator function. regression, and computed the Wald statistics of param- eter estimates for the features. Then, the statistics are P ! ! Ix ∈Nx IcðÞ ¼ j i 0 i ! i¼1 standardized by dividing them by the sum of all the pj x ¼ P ð1Þ ! ! Ix ∈Nx statistics before they are allocated to the features as the i 0 i¼1 weights. Suitable weights may vary depending upon problems we encountered. Prior studies about the The global distance between the target case and any weight optimization or allocation methods are summa- case in the case-base is computed by summing up the rized in Table 1. local distances to determine the nearest neighbors for the target case on Eq. (1). The local distance is Rank-based weight optimization computed for each feature between the target case and Distance functions and problem setting any case in the case-base by the pre-defined local A typical similarity or dissimilarity measure is a distance distance metric, and the types of local distance metric metric, and it is crucial to learn a good distance metric do not have to be the same among the features. Euclid- to represent the similarity or dissimilarity in feature ean distance metric is typically used to compute the space although there are considerable researches on physical distance between the two data points, but it distance metrics [15–17]. Some researches have been suffers in the case that vectors of data points aren’t focused on the comparison of their impacts on the linearly distributed like default measurements of performance in classification with known public proteomic profiles. On the contrary, Fre’chet distance database [18, 19]. However, no single similarity or metric is known to be useful to measure the distance dissimilarity showed dominantly superior to the others between the data points when the vectors of those data Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 4 of 9 points lie on the non-linear curve [20]. According to the similarity is regarded as a function of ranks in the characteristics of the feature types, the distance metric present study because the similarity is computed consists of either Euclidean distance metric or Fre’chet according to the corresponding rank-order informa- distance metric for the present study. Euclidean distance tion of the distances for features between the target metric and Fre’chet distance metric are defined for the case and any case in the case-base. Thus, the weights ð f Þ ð f Þ ! ! can be naturally allocated to the features in the feature, f, as follows where x and x are sub- i 0 similarity measure maintaining the same property vectors of any data point in the case-base and the target from the objective function based on Wilcoxon’srank case consisting of the feature, f, respectively. sum statistics. The objective function for the present ðÞ f ðÞ f ðÞ f ðÞ f ðÞ f ðÞ f study can be summarized as follows where n is the ! ! ! ! ! ! d x ; x ¼ x − x x − x ð2Þ i 0 i 0 i 0 number of cases having the class label of 1 when the class labels are denoted as either 0 or 1 and the ðÞ f ðÞ f ðÞ f ðÞ f ! ! ! 0 ! 0 d x ; x ¼ inf max dx ðÞ αðÞ t ; x ðÞ βðÞ t ð3Þ i 0 i 0 α;β t ∈½ 0;1 number of classes, J, is set to 2. arg max ω  r ω : f ¼1;⋯;m f f f ¼1 Conversion to rank-order information X 1 ð f Þ ð0Þ ! ! where r ¼ rank d x ; x f f i 0 i¼1 Plasma proteomic profiles have the large within-subject variance. Although the class labels are the same, the n ðn þ 1Þ 1 1 profiles can be distributed over a considerable extent as well as they are not following the general pattern. We 0≤ω ≤1 constraint to determined the proximity of the cases using the global m ω ¼ 1 f ¼1 similarity based on rank-order information of the distances [21] instead of using the distance itself to ð5Þ enhance robustness in predicting the class label of the On Eq. (5), as the probability increases that the two target case in the present study. The similarity is com- groups of the cases are truly drawn from the population- puted as follows where N′ is the number of cases having cases having the different class labels, the corresponding a unique ranking-order in the case-base and ω is an feature weight of ω becomes large because the resulted unknown weight for a feature of f. statistics, W, is large. The significance of the test statis- 2 3 f f 0 ! ! tics is directly represented by the magnitude of the cor- N − rank d x ; x X f i 0 ! ! 4 5 Sx ; x ¼ ωf  ð4Þ responding p-value, so the feature weights can be i 0 N −1 f ¼1 computed using the magnitudes of p-values from the test statistics as follows. According to Eq. (4), the higher the rank, the greater the similarity between the i-th case and the target case. 1−pðjW j≥r Þ ω ¼ ð6Þ f m 1−pðjW j≥r Þ f ¼1 Weight optimization Every feature is equally likely important to the where W denotes the test statistics of the Wilcoxon’s original CBR classifier. Since the original CBR rank sum test. The feature weights from Eq. (6) are used classifier often showed lower predictability, there have to compute the similarity of Eq. (4). been some researches to improve the predictability by assigning different weights to the features according Application and experiments to their relative importance. In the same line of thoughts, we adopted the different weights to the Data description The proteomic profiles were obtained feature in calculating the similarity, and optimized the from the blood plasma samples which were collected weights according to the objective function from from recruited subjects at the University of Louisville, Wilcoxon’s rank sum test statistics. The ability of the KY, USA. Total 70 female subjects were recruited for objective function is mainly influenced by the feature this study, and 50% of those subjects were diagnosed weights, and the weights are determined to maximize with cervical carcinoma while the others are healthy the ability of the objective function to differentiate controls, without any known diseases. The study proto- the cases having different class labels. Wilcoxon’srank col was approved by the institutional review board of the sum test is a non-parametric test to assess the University of Louisville, and informed consent forms difference of the mean ranks for two samples, and it were voluntarily signed by the participants. The origin of is known to be useful when outliers exist in the the data can be referred to [22], and the secondary data observations compared to the parametric tests. The was used for the study. The default output measurement Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 5 of 9 of the proteomic profiles was the excess heat capacity The statistical model was introduced to show the dif- (ΔC ), which were recorded at the different tempera- ference of the plasma proteomic profiles between two tures from 45 to 90 °C by incrementally adding 1 °C to groups having different clinical status using a composite the previous measuring temperature. The proteomic coefficient which was a weighted product of an average profiles were preprocessed prior to the analysis. The probability being in the same group of the reference excess heat capacity (ΔC ) as the default measurement is sample and Pearson’s correlation coefficient. In the a vector of real numbers of length 451 and it typically present study, this model was conducted with the default shows one or two peaks on the range of temperatures setting of the composite coefficient as in the literature during the experiment. We newly extracted 5 features [23]. Namely, the reference set was composed with the from the pre-processed data besides the excess heat cases in the ‘control’ class for this model, and the weight capacity. The feature information is summarized in factor of the composite coefficient was set to 1 as Table 2. The class information for each proteomic described in the literature. This classification model is profile was labeled as either ‘control’ or ‘cancer’ ac- abbreviated as (SCUCC) indicating statistically classified cording to the clinical status of the corresponding sub- using the composite coefficient for the experiments as a ject. On Table 2, PEAK1 and PEAK2 indicate those reference method. Among the CBR approaches, the first peaks, and T1 and T2 are temperatures that those model is a classical CBR approach (CLCBR), which gives peaks occur at where {PEAK1, PEAK2, T1, T2} were attributes equal-weights and uses 1-NN for the case re- estimated by Gaussian kernel regression from the ex- trieval. This model would be the base model to examine cess heat capacity patterns. IND indicates a set of 451 the effect of the CBR classifiers having different weights individual measurement of the excess heat capacity, on features. ETCBR and LWCBR are the weighted CBR and IR is a binary value indicating the initial direc- approaches. The feature weights were allocated with tional tendency of the excess heat capacity as the standardized entropy value in ETCBR [7]. In the present temperature increases. IR is 1 if the directional ten- study, IR is Berno’ulli and the other features are dency is positive, 0 otherwise. assumed as normal distribution. The computed entropy was standardized by dividing each entropy by the sum of Numerical experiments The purpose of the numerical all entropy values prior to allocation, and 1-NN was experiments is to study the performance of the CBR used for the case retrieval. LWCBR indicates a weighted classifier in prediction with the plasma proteomic pro- CBR approach from logistic regression model. This files by comparing with the previous approaches. In par- model adopted standardized Wald statistics of the re- ticular, we observed whether the rank-based feature gression coefficients for feature weights by fitting the weights enhance the performance of the CBR classifier observation with binary logistic regression. Namely, with proteomic profiles or not. As reference methods, the Wald statistics were divided by the sum of the all two common machine learning methods, k-NN (k -NN) statistics before they were allocated to the features and support vector machine (SVM), a statistical ap- [8], and also used 1-NN for the case retrieval. Logis- proach using the composite coefficient [23], and three tic regression is a typical parametric approach and CBR approaches weighted by different allocation Wald statistics are derived from the regression coeffi- methods in previous studies [7, 8] were conducted to cients under the asymptotic normal assumption, so validate the performance of the proposed CBR approach this model can be a good reference to observe the for the present study. The number of neighbors for k performance of the proposed feature weights. The -NN was 5 which was determined by cross-validation proposed CBR approach is abbreviated as RWCBR (CV) with training samples, and SVM was conducted indicating a rank-weighted CBR approach. As based on the radial basis kernel. described in the above sections, the feature weights Table 2 Description of the features Features (Abbreviation) Type Contents Initial Response (IR) Binary number 0: decreasing 1: increasing Temperature 1 (T1) Real number Range [45 - 55] Temperature 2 (T2) Real number Range [56 - 90] Maximum Peak at T1 (PEAK1) Real number Range [0 - ∞] Maximum Peak at T2 (PEAK2) Real number Range [0 - ∞] A set of individual ΔC (IND) A vector of real numbers Range [0 - ∞] p Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 6 of 9 were computed from Wilcoxon’s rank sum test and true positive Precesion ¼ P the most similar case was retrieved as the other CBR ðÞ true positive þ false positive approaches. true positive Recall ¼ P The data set of proteomic profiles consists of the pre- ðÞ true positive þ false negative defined features on Table 2 with 70 subjects, and the precision  recall F‐1 score ¼ 2 class labels are fully given with the number of classes as ðÞ precision þ recall pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi two. The data set was randomly partitioned into five G‐measure ¼ precision  recall equal-sized subsets for 5-fold CV. At each fold, a subset was selected as a test set, and the other four subsets be- came a training set where the proportion of the cases With the retrieved case for each target case, the class were equally distributed from the two classes during the label of the target case was predicted according to Eq. (1), experiments. The feature weights for the ETCBR, and the prediction results were used to compute the LWCBR and RWCBR were estimated with the training information indices at each fold according to the above set, and the optimized weights were allocated to the fea- definitions. tures in the test set. The resulting information indices at each fold are summarized on Table 4, and the comprehensive statistics using the minimum (MIN), average, (AVG), standard Results deviation (STD) and the maximum (MAX) for each Each fold has the same size of cases in the test set index are summarized on Table 5. Among the models, and the training set as 14 and 56, respectively during RWCBR and SVM consistently showed good perfor- the experiments. The estimated feature weights by mances over different sample sets in predicting the class ETCBR, LWCBR and RWCBR at each fold are sum- labels with plasma proteomic profiles in comparison marized in Table 3. The performances of the seven with others, but the performance of RWCBR was slightly different models were evaluated at each fold in terms better. In case of Precision and Recall indices, LWCBR of Precision, Recall, F-1 score, [24, 25]and G- had the biggest range from the minimum of 33% to the measure [25]. The estimated weights in Table 3 were maximum of 80% and from the minimum of 29% to the allocated to the features when the CBR classifiers re- maximum of 100%, respectively while SVM had the trieved the most similar case from the case-base by smallest range from the minimum of 85.7% to the max- ETCBR, LWCBR and RWCBR. The information indi- imum of 100% in both indices. However, RWCBR ces of Precision, Recall, F-1 score and G-measure are showed the highest average value of 91% among the all defined as follows. models in both Precision and Recall, and the perform- ance was maintained at least 71%. ETCBR and RWCBR showed better performance than CLCBR in both Preci- sion and Recall, but LWCBR worked poor in compari- son with CLCBR by showing lower mean values in all Table 3 Estimated feature weights indices although we generally expected a weighted CBR Fold Model IR PEAK1 PEAK2 T1 T2 IND approach to perform better than CLCBR. Comparing I ETCBR 0.0067 0.0145 0.0347 0.0595 0.0220 0.8625 CLCBR to SCUCC, a statistical approach, average preci- LWCBR 0.0158 0.1145 0.3845 0.0960 0.0179 0.3713 sion index of CLCBC was lower, but average recall index RWCBR 0.1704 0.1890 0.1488 0.1886 0.1596 0.1436 was higher. F-1 score and G-measure were similar II ETCBR 0.0089 0.0311 0.0477 0.0766 0.0314 0.8042 between the two, so it appears that CBR approaches do LWCBR 0.0877 0.0752 0.2550 0.0016 0.3244 0.2561 not always work well with plasma proteomic profiles. Regarding F1-score and G-measure, SVM also main- RWCBR 0.2190 0.2222 0.0954 0.2203 0.1325 0.1106 tained the shortest ranges, but RWCBR showed the best III ETCBR 0.0068 0.0265 0.0391 0.0654 0.0243 0.8377 performance at most aspects of summary statistics LWCBR 0.0175 0.1780 0.2481 0.1188 0.1584 0.2793 among the seven models. In particular, it maintained the RWCBR 0.1274 0.2603 0.1144 0.2837 0.1092 0.1051 smallest standard deviations in comparison with the IV ETCBR 0.0076 0.0259 0.0383 0.0643 0.0247 0.8392 other models. LWCBR 0.0001 0.3524 0.0682 0.4252 0.1484 0.0056 The retrieved cases for each target case from the test set by RWCBR model at the first fold are displayed on RWCBR 0.1627 0.2266 0.1279 0.2206 0.1233 0.1388 Fig. 2. The black solid lines represent the 14 target cases V ETCBR 0.0065 0.0223 0.0345 0.0581 0.0213 0.8573 from the test set, and the red solid lines are the most LWCBR 0.0250 0.0036 0.1628 0.2751 0.3219 0.2116 similar cases retrieved from the case-bases according to RWCBR 0.2205 0.2186 0.1024 0.2334 0.1076 0.1175 the similarity measure of Eq. (4). Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 7 of 9 Table 4 Information indices by 5-Fold CV Fold Measures K-NN SVM SCUCC CLCBR ETCBR LWCBR RWCBR I Precision 0.5000 0.8571 1.0000 0.6667 1.0000 0.8000 1.0000 Recall 0.7143 0.8571 0.2857 0.5714 0.7143 0.5714 0.7143 F1-score 0.5882 0.8571 0.4444 0.6154 0.8333 0.6667 0.8333 G-measure 0.5976 0.8571 0.5345 0.6171 0.8452 0.6761 0.8452 II Precision 0.8571 0.8571 1.0000 0.8000 0.7500 0.3333 1.0000 Recall 0.8571 0.8571 0.8571 0.5714 0.8571 0.2857 0.8571 F1-score 0.8571 0.8571 0.9231 0.6667 0.8000 0.3077 0.9231 G-measure 0.8571 0.8571 0.9258 0.6761 0.8018 0.3086 0.9258 III Precision 0.6000 0.8571 0.8333 0.7500 0.8333 0.7778 0.7778 Recall 0.8571 0.8571 0.7143 0.8571 0.7143 1.0000 1.0000 F1-score 0.7059 0.8571 0.7692 0.8000 0.7692 0.8750 0.8750 G-measure 0.7171 0.8571 0.7715 0.8018 0.7715 0.8819 0.8819 IV Precision 0.7500 1.0000 1.0000 1.0000 0.8750 0.7778 1.0000 Recall 0.8571 1.0000 0.5714 1.0000 1.0000 1.0000 1.0000 F1-score 0.8000 1.0000 0.7273 1.0000 0.9333 0.8750 1.0000 G-measure 0.8018 1.0000 0.7559 1.0000 0.9354 0.8819 1.0000 V Precision 0.5455 0.8571 0.7000 0.6000 1.0000 0.6000 0.7778 Recall 0.8571 0.8571 1.0000 0.8571 1.0000 0.8571 1.0000 F1-score 0.6667 0.8571 0.8235 0.7059 1.0000 0.7059 0.8750 G-measure 0.6838 0.8571 0.8366 0.7171 1.0000 0.7171 0.8819 Table 5 Comprehensive statistics for information indices Measures Statistics K-NN SVM SCUCC CLCBR ETCBR LWCBR RWCBR Precision MIN 0.5000 0.8571 0.7000 0.6000 0.7500 0.3333 0.7778 AVG 0.6506 0.8857 0.9067 0.7633 0.8917 0.6578 0.9111 STD 0.1490 0.0639 0.1362 0.1529 0.1087 0.1985 0.1217 MAX 0.8571 1.0000 1.0000 1.0000 1.0000 0.8000 1.0000 Recall MIN 0.7143 0.8571 0.2857 0.5714 0.7143 0.2857 0.7143 AVG 0.8285 0.8857 0.6857 0.7714 0.8571 0.7428 0.9143 STD 0.0639 0.0639 0.2748 0.1917 0.1429 0.3097 0.1278 MAX 0.8571 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 F1-score MIN 0.5882 0.8571 0.4444 0.6154 0.7692 0.3077 0.8333 AVG 0.7236 0.8857 0.7375 0.7576 0.8672 0.6861 0.9013 STD 0.1067 0.0639 0.1795 0.1514 0.0965 0.2320 0.0637 MAX 0.8571 1.0000 0.9231 1.0000 1.0000 0.8750 1.0000 G-measure MIN 0.5976 0.8571 0.5345 0.6171 0.7715 0.3086 0.8452 AVG 0.7221 0.8857 0.7649 0.7624 0.8708 0.6931 0.9070 STD 0.1088 0.0639 0.1451 0.1488 0.0951 0.2345 0.0593 MAX 0.8571 1.0000 0.9258 1.0000 1.0000 0.8819 1.0000 Note: MIN the minimum, AVG average, STD standard deviation, MAX the maximum Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 8 of 9 Fig. 2 The retrieved cases by RWCBR at Fold 1. The figure illustrates 14 cases of proteomic profiles at fold 1. Black-solid lines represent target cases, and red-solid lines represent the solutions optimized by RWCBR Conclusion and Discussion outperform in average value in all information indices, Plasma proteomic profiles have been regarded as a and it maintained the lowest standard deviation in F-1 potential biomarker to diagnose certain diseases score and G-measure. Also, LWCBR showed lower per- according to their specific patterns. It is challenging to formance than CLCBR in most information indices. A precisely predict the clinical status based solely on the weighted CBR approaches do not always perform well, so patterns of profiles because some profiles do not the weight allocation or optimization methods should take frequently follow the general patterns, which leads to into accounts the characteristics of the data set to enhance large within-subject variance. The prediction based on the performance of CBR classifier. CBR based approaches may be effective in that case. The The sample size of the plasma proteomic profiles was CBR classifier predicts the clinical status of a target case small in the present study. However, RWCBR approach by retrieving the most similar case from the case-base, showed potential to predict the clinical status based so it would be advantageous in prediction because it can solely on plasma proteomic profiles as a robust classifier avoid the risk to make decision according to deviated over different sample sets in the present study. overall means due to the outlying pattern. However, CBR classifier often shows low predictability, and some Abbreviations AVG: Average; CBR: Case-based reasoning; CLCBR: Classical CBR; CV: Cross- studies made efforts to enhance the predictability using Validation; ETCBR: Entropy-based CBR; k-NN: k-Nearest neighbor; weight optimization for features. There is still no golden LWCBR: Logistic regression-based Weighted CBR; MAX: The maximum; standard to optimize or allocate the feature weights, MIN: The minimum; RWCBR: Rank-based Weighted CBR; SCUCC: Statistically classified using composite coefficient; STD: Standard deviation; SVM: Support which can be dependent upon the characteristics of the vector machine data we encounter. The present study suggests a rank-based weighted Acknowledgements CBR classifier (RWCBR) to predict the clinical status of We thank Assistant Professor NC Garbett at James Graham Brown Cancer plasma proteomic profiles. The rank-based weighted Center in the University of Louisville for providing us with the data set and valuable comments. Also, we give thanks to Assistant Professor M. Ouyang at CBR classifier uses a weighted similarity based on rank- U Mass Boston for thorough scientific reviews and valuable comments. order information of distance metrics to retrieve the most similar case from the case-base where the feature Funding weights are optimized from Wilcoxon’s rank sum This work was supported by the Korea University, Sejong, Korea [K1720701, statistics. We conducted numerical experiments to 2017], and also supported by College of Public Policy, Korea University, Sejong, Korea [K1729001, 2018]. validate the performance of RWCBR. As reference methods, two machine learning techniques, k-NN and Availability of data and materials SVM, a statistical method, SCUCC, a classical CBR The dataset for the current study are not publicly available due to the (CLCBR) and two differently weighted CBR, ETCBR and repository policy from the institute at this moment. LWCBR methods were compared in terms of Precision, Recall, F-1 score and G-measure. According to the results, Author’s contributions SVM showed the lowest standard deviation and the high- AMK do the model development, the numerical experiment and manuscript est minimum value for Precision, Recall, but RWCBR writing. The author read and approved the final manuscript. Kwon BMC Medical Informatics and Decision Making (2018) 18:34 Page 9 of 9 Ethics approval and consent to participate 21. Li H, Sun J. Ranking-order case-based reasoning for financial distress The study protocol was approved by the institutional review board of the prediction. Knowl-Based Syst. 2008;21:868–78. University of Louisville in regard to the collection of the original data, and 22. Garbett NC, Merchant ML, Helm CW, Jenson AB, Klein JB, Chaires JB. informed consent forms were voluntarily signed by the all participants [IRB\# Detection of cervical Cancer biomarker patterns in blood plasma and urine 08.0108, 608.03]. All information about the subjects had been de-identified by differential scanning calorimetry and mass spectrometry. PLoS One. from the stage of data collection, and the de-identified data set was directly 2014;9:e84710. provided by the principal investigator only for the purpose of methodological 23. Fish DJ, Brewood GP, Kim JS, et al. Statistical analysis of plasma development research, and the administrative permission will be required thermograms measured by differential scanning calorimetry. Biophys Chem. to access the raw data from the principal investigator, Dr. Garbett NC as 2005;152:184–90. well as the administrative policy corresponding with James Brown Graham Cancer 24. Powers DMW. Evaluation: from precision, recall and F-measure to ROC, center. The data was the secondary data for this study, and particular procedure Informedness, Markedness & Correlation. J Mach Learn Technol. 2011;2:37–63. of ethics approval for data analysis was not required. 25. Nicolas PR. Scala for machine learning. Birmingham: PACKT Publishing Ltd; 2015. p. 37–63. Competing interests The author declares that she has no competing interests. Publisher’sNote Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Received: 26 June 2017 Accepted: 3 May 2018 References 1. Burke EK, MacCarthy B, Petrovic S, MacCarthy B, Petrovic S, Qu R. Structured cases in case-based reasoning-re-using and adapting cases for time-tabling problems. Knowledge Based Syst. 2000;13:159–65. 2. Althoff KD, Auriol E, Barletta R, Manago M. A review of industrial case-based reasoning tools, an AI perspective report. AI intelligence; 1995. p. 3–4. 3. Aamodt A, Plaza E. Case-based reasoning: foundational issues, methodological variations and system approach. AI Commun. 1994;7:39–59. 4. Garbett NC, Miller JJ, Jenson AB, Chaires JB. Calorimetry outside the box a new window into the plasma proteome. Biophys J. 2008;94:1377–83. 5. Chang WL. A CBR-based Delphi model for quality group decisions. Cybern Syst. 2011;42:402–14. 6. Huang YS, Chiang CC, Shieh JW, Grimson E. Prototype optimization for nearest-neighbor classification. Pattern Recogn. 2002;35:1237–45. 7. Zhao K, Yu X. A case-based reasoning approach on supplier selection in petroleum enterprises. Expert Syst Appl. 2011;38:6839–47. 8. Liang C, Gu D, Bichindaritz I, Li X, Zuo C, Cheng W. Integrating gray system theory and logistic regression into case-base reasoning for safety assessment f thermal power plants. Expert Syst Appl. 2012;39:5154–67. 9. Yan A, Shao H, Guo Z. Weight optimization for case-based reasoning using membrane computing. Inf Sci. 2014;287:109–20. 10. Domingos P. Context-sensitive feature selection for lazy learners. Artif Intell Rev. 1997;11:1–5. 11. Skalac DB. Prototype and feature selection by sampling and random mutation hill climbing algorithms. Proceedings of the 11th international conference on machine learning, vol. 2; 1994. p. 293–301. 12. Ahn H, Kim K. Global optimization of case-based reasoning for breast cytology diagnosis. Expert Syst Appl. 2009;36:724–34. 13. Gu DX, Liang CY, Li XG, Yang SL, Zhang P. Intelligent technique for knowledge reuse of dental medical records based on Case-Base reasoning. J Med Syst. 2010;34:213–22. 14. Cardie C, Howe N. Improving minority class prediction using case-specific feature weights. Proceedings of the 14th international conference on machine learning; 1997. p. 57–65. 15. Cha SH. Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Model Methods Appl Sci. 2007;1:300–7. 16. Chen Y, Garcia EK, Gupta MR, Rahimi A, Cazzanti L. Similarity-based classification: concepts and algorithms. J Mach Learn Res. 2009;20:747–76. 17. Jain AK, Murty MN, Flynn PJ. Data clustering: a review ACM computing survey. ACM Comput Surv. 1999;31:264–323. 18. Shirkhorshidi AS, Aghabozorgi S, Wah TY. Comparision study on similarity and dissimilarity measures in clustering continuous data. PLoS One. 2015;10: e0144059. 19. Khalifa A. Al., Haranczyk, M., Holliday, J. Comparison of nonbinary similarity coefficients for similarity searching, clustering and compound selection. J Chem Inf Model. 2009;49:1193–201. 20. Alt H, Godau M. Computing the Fre’chet distance between two polygonal curves. Int J Comput Geom Appl. 1995;5:75–91.

Journal

BMC Medical Informatics and Decision MakingSpringer Journals

Published: May 31, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off