TY - JOUR AU - Pravin, A AB - Abstract Medical data clustering is an important part of medical decision systems as it refines highly sensitive information from the huge medical datasets. Medical data clustering includes processes, like determine random clusters, set data into specified clusters and handle data clusters dynamically. Hence, handling of medical data streams and clustering remains a challenging issue. This paper proposes a technique, namely Rider-based sunflower optimization (RSFO) for medical data clustering. Initially, the significant features are selected using the Tversky index with holoentropy that is established from the input data. The holo-entropy is utilized to analyze the relationship between the attributes and features. Here, the clustering is done by a Black Hole Entropic Fuzzy Clustering (BHEFC) algorithm, where the optimal cluster centroids are selected by the proposed RSFO algorithm. The proposed RSFO is designed by incorporating the Rider optimization algorithm (ROA) and sunflower optimization (SFO). The effectiveness of the proposed BHEFC+RSFO algorithm is analyzed by the Dermatology Data Set, and the proposed method has the maximal accuracy of 94.480%, Jaccard coefficient of 94.224% and Rand coefficient of 91.307%, respectively. 1 INTRODUCTION Nowadays, numerous industrial data mining clustering methodologies have been designed and their practice is rising enormously for attaining the required goal. The researchers attempt to provide the best endeavour to design a fast and effective technique for abstracting the spatial medical data sets. The huge volume of hidden data in the massive databases has produced a terrific interest in data mining. Huge data are accumulated from the medical data, remote sensing and geographic information system. So, in day-to-day lives, a vast amount of data is left that needs proper analysis [1]. Medical data consist of high dimensional data and is usually termed as high dimensional. In medical data processing, the selection of optimal features is an important task for enhancing the value of the model accumulated from the selected data [2]. The major constituent of predictive accuracy is the quality and quantity of the data. However, the data accumulated in the medicine is accumulated as a result of patient-care activity to benefit the patient and research is a secondary contemplation. Accordingly, the medical database consists of numerous features, which creates issues in data mining methods and tools. The medical database [39,40] contains a huge amount of heterogeneous data with heterogeneous data fields. However, the heterogeneity of the data [3] obscures the data mining tools. Moreover, massive medical databases comprise missing values, which should be addressed prior to the use of data mining tools. The medical database consists of data that are imprecise, redundant, inconsistent or incomplete, which can influence the results of data mining tools [4]. Data clustering is a significant method in the data-mining domain wherein the data are located in homogeneous groups based on similarity. In contrast to clustering, the medical data stream clustering is a complex research area whose goal is to extract the hidden information from the incessant flow of medical data. For realistic applications, the medical data stream clustering can be an extremely challenging chore due to unlimited incoming data, huge size and high frequencies. There are many data clustering methodologies devised in the literature [5]. Thus, to manage the inconsistent biomedical data, intelligent techniques are devised for machine learning [6] and data mining, which are needed for taking logical reasoning using the saved raw data. The novel medical equipment is adapted in the diagnosis and to generate composite and large data. Thus, to manage the ill-structured biomedical data [7,8], many intelligent techniques for data mining and machine learning are needed for taking decisions from the raw data, which is termed as medical data mining [9]. The aim of clustering is to group data when a huge amount of data resides in the database. The huge data are summarized with small groups or classes to ease the analysis. Also, the majority of the data is collected based on natural groupings. Nonetheless, determining the groups to categorize the data is a complex task for humans except the case, when the data are low dimensional. Thus, the data clustering methods in soft computing are devised in the literature for clustering the data with ease [10]. Clustering techniques can be categorized into different groups like hierarchical algorithms, grid-based clustering, partitioning-based clustering and density-based clustering [1]. Furthermore, the clustering methods can be devised in two forms, namely hard and soft. Fuzzy clustering methods contain certain methods like Fuzzy C-Shell clustering algorithms [11], Mountain method (MM) [12], Gustafson Kessel (GK) and Fuzzy-C-means (FCM) [13]. Numerous hybrid optimization methods are devised to attain global optimality. In [14], particle swarm optimization (PSO)-based clustering algorithms are devised in which the PSO is utilized for determining the centroids of the clusters and provides the count of cluster centers as an input to k-means for improved clustering. The hybrid methods provide fast convergence as compared to the k-means algorithm [15,16]. The hybrid artificial bee colony (HABC) algorithm for data clustering was devised in [17] to enhance the information exchange between the bees using the crossover operator of genetic algorithm (GA) [18] along with artificial bee colony (ABC). The hybrid method based on the k-means algorithm and Heuristic Kalman Algorithm (HKA) is devised in [19], which integrates the benefits of global HKA exploration and the fast convergence of the k-means algorithm. The Blackhole phenomenon is devised in [20] for data clustering. The plan of the method is based on the gravitational pull of nearby objects using the Blackhole. Meanwhile, the nature-inspired optimization techniques and deep learning classifiers are devised for data clustering which includes neural networks, support vector machines. In [21], a fuzzy least square support vector machine (FLS-SVM) technique is devised for clustering the data whereas in [22], a Back Propagation Neural Network (BPNN) is employed for clustering the data. The methods employed for clustering big data is a difficult task due to several reasons. In real-time applications, the major issue due to large volume, infinite sequence of incoming data, and high-frequency issues [5]. Also, there is a possibility of information loss, as valuable data are removed [23]. The major issue faced by the beneficiary is a real-world dataset, which contains the complex data form, time-variant datasets, inconsistent features and missing values, which occur due to data transformation, data selection and data collection [24]. Moreover, high time complexity issues are unable to cluster huge amounts of heterogeneous data [25]. To overcome these challenges in medical data clustering, this work proposes a novel clustering approach for clustering the medical data. In this work, the clustering is done by the BHEFC, in which the optimal cluster centroids are selected by the proposed RFSO algorithm. The ROA [26] performs fault diagnosis with improved classification accuracy. Also, this method mitigates the non-trivial issues while producing a new set of individuals. The SFO [27] is capable to determine the good locations, which improved the performance. Thus, the integration of ROA and SFO addresses the computational complexity issues and obtained a better global optimal solution. Also, in the proposed method, the complexity of analyzing data is minimized, because the data are represented as a reduced set of features. Therefore, it required less memory and time. Moreover, the Markov-Chain Monte Carlo method is utilized to cluster large volume data. Hence, the proposed medical data clustering overcomes the challenges in the existing methods and renders effective performance results. The research aims to devise a novel data clustering method using the BHEFC [28] algorithm. The proposed model undergoes three steps for medical data clustering, namely pre-processing, feature selection and data clustering. The pre-processing step is carried out for removing the artifacts and redundant data from the documents. Then, the next step is the feature selection using the Tversky index [29] with holoentropy [30] and then, clustering is done using the proposed BHEFC-based RSFO algorithm, which is the integration of ROA [26] and SFO [27]. The clustering task is made effective using the proposed RSFO algorithm, which determines the optimal centroid. Hence, the proposed BHEFC-based RSFO renders effective prediction accuracy through facilitating the data clustering using input data. 1.1 The major contribution of the research paper 1.1.1 Proposed BHEFC-based RSFO algorithm for medical data clustering To find the cluster centroids are time-consuming and still do not guarantee to yield the global optimum. In the proposed method, the medical data clustering is done by using the BHEFC algorithm, which uses the proposed RSFO for selecting the optimal cluster centers. The proposed RSFO is designed by incorporating ROA and SFO. By using the proposed RSFO, the weighted coefficients are modified, wherein the number of clusters is user-defined. Thus, the proposed clustering technique can perform data clustering to find the optimal centroids, such that the convergence speeds are enhanced, which enhances the performance of the proposed method. The paper is structured in the following manner: section 1 provides the introductory part of the medical data clustering and section 2 discusses the existing methods of medical data clustering with challenges of the methods that remain as the motivation for the research. The proposed method of medical data clustering is demonstrated in section 3 and section 4 demonstrates the results of the methods. Finally, section 5 concludes the research work. 2 LITERATURE REVIEW The eight existing methods devised in the literature are elaborated in this section along with the drawbacks. Yelipe et al. [23] developed an imputation approach named Imputation based on class-based clustering (IM-CBC) using the class-based-clustering classifier (CBCC) for evaluating the similarity between the two medical records. This method employed fuzzy-similarity functions and Euclidean distance for finding the similarity between the clusters. Then, the classification was performed using the classifiers, such as support vector machine (SVM), k-nearest neighbours (k-NN) or C4.5. The result proved that the performance of the classifier was enhanced depicting high accuracy, but the method failed to consider fuzzy measures for classifying and predicting medical data values. Al-Shammari et al. [5] developed a dynamic maintenance framework based on density-based clustering for classifying patients with similar properties. In this method, medical data clusters were generated based on piece-wise aggregate approximation and the density-based spatial clustering (PAA + DBSCAN) algorithm. The produced medical clusters are generated when a new set of medical data streams arrives. The incremental cluster maintenance is a lengthy process and thus an advanced cluster maintenance (ACM) approach was utilized for enhancing the performance of the dynamic cluster maintenance. However, this method failed to consider high-frequency data streams for updating the data clusters. Das et al. [31] developed a modified Bee colony optimization (MBCO) approach for clustering the data. Here, a probability-based selection (Pbselection) approach was devised for allocating the unassigned data points in each location. This method gives faster convergence as compared to other methods. The MBCO incorporates the k-means algorithm for enhancing the performance of MBCO to obtain a global optimal solution. The chaotic theory was employed to enhance the rate of convergence and classification accuracy. However, the method failed to adapt multi-objective optimization functions for initiating clustering and was unable to process using high-frequency data streams. Chauhan et al. [24] designed a two-step clustering technology for analyzing the disorders of the patient using various data variables to determine the optimal clusters of different shapes and sizes. The goal was to determine the hidden knowledge and significant factors that could help the medical practitioners for the diagnosis of liver disease in its earlier stage. The aim is to find efficient patterns with a predictive data analytics method for determining the hidden patterns from a huge database. The two-step clustering method produced three clusters wherein each cluster represented correlated factors, which help to diagnose the disease in the future. Chen et al. [32] designed a manifold learning-based method for enhancing the model training and data division. The method was analyzed with the atlas registration framework [32] along with the deep learning framework. The segmented results with and without data were used for the comparison. The final segmentation was enhanced by adapting manifold learning into the framework. However, the method did not apply to other machine learning methods. Khanmohammadi et al. [33] developed a hybridized method, which integrated k-harmonic means and overlapping k-means algorithms (KHM-OKM) to overwhelm the drawback. The goal KHM-OKM method was to utilize the output of the KHM method for initializing the cluster centers of the OKM method. The method was effective in clustering medical datasets. However, the method failed to integrate the metaheuristic optimization algorithms to solve the issues of local minima. Zhang et al. [25] developed a high-order Pulse-code modulation algorithm (HOPCM) for clustering huge data by optimizing the fitness in the tensor space. Furthermore, the distributed HOPCM method was devised based on the MapReduce for dealing with huge heterogeneous data. Moreover, a privacy-preserving HOPCM algorithm was utilized for preserving the private data in the cloud using the BGV encryption scheme. The method was effective in clustering huge data without revealing confidential data. However, the method failed to consider other datasets for medical data clustering. Verma et al. [34] designed a hybrid method for coronary artery disease (CAD) diagnosis using the identification of risk factors along with the correlation-based feature subset (CFS) selection, K-means clustering algorithms and PSO. Also, the supervised learning algorithms like multinomial logistic regression (MLR), C4.5, multi-layer perceptron (MLP), and fuzzy unordered rule induction algorithm (FURIA) were used to model CAD cases. The method showed improved accuracy as compared to other conventional methods. Baliarsingh et al. [35] developed a medical data classification based on memetic algorithm-based SVM (M-SVM). This was a fusion of social engineering optimizer (SEO) and Emperor penguin optimization (EPO), which accurately classify the medical data. This method did not apply to large-scale datasets. Zemmal et al. [36] developed a method for medical data classification, which combines active learning (AL) and PSO. It reduced the requirement of experts for medical data annotation and produced accurate results. Anyhow, labelling was difficult in this method. 2.1 Challenges The challenges faced by conventional methods are enlisted in this section. To find the cluster centroids are time-consuming and still do not guarantee to yield the global optimum. In the proposed method, the RSFO algorithm is proposed to find the optimal cluster centroids. The medical data undergoes a mining process for extracting the knowledge, which is performed by handling the missing values or by adapting imputation in such a way that the results of imputed value acquired better classification rates. However, there is a possibility of information loss, as valuable data is removed [23]. To overcome this problem, during the pre-processing the inconsistent words from the data are eliminated in the proposed method. In clustering, the medical data stream clustering is complex research, wherein the goal is to extract the hidden information using the incessant flow of medical data. However, in real-time applications, the medical data stream clustering is a major obstacle due to massive volume, an infinite sequence of incoming data, and high-frequency issues [5]. To overcome this problem, in the proposed method, the Markov-Chain Monte Carlo method is utilized to cluster the data. The major issue faced by the beneficiary is real-world datasets, which contain complex data forms, time-variant datasets, inconsistent features and missing values, which occur due to data transformation, data selection and data collection. Thus, the discovery of disease using inconsistent data records may lead to incoherent diagnosis [24]. The methods employed for clustering big data is a difficult task due to several reasons. Initially, the concatenation of features is performed using different modalities and heterogeneous data sets so they are unable to generate the required results. Secondly, these methods pose high time complexity issues that are unable to cluster huge amounts of heterogeneous data [25]. In the proposed method, the highly relevant features are used for clustering. Hence, the complexity of analyzing data is minimized, since the data are represented as a reduced set of features. The conventional clustering methods do not yield satisfactory results due to different reasons, like huge data samples, sparsity and high dimensionality. Thus, the clustering methods are needed for reducing the difficulty based on memory and time. In the proposed method, the complexity of analyzing data is minimized, because the data is represented as a reduced set of features. Therefore, it required less memory and time. 3 PROPOSED RSFO FOR MEDICAL DATA CLUSTERING This section presents the proposed technique, namely RSFO, which is utilized for medical data clustering based on the BHEFC algorithm. At first, the medical data are subjected to the pre-processing phase for extracting the noise and artifacts contained in the input data. Once the pre-processed data are obtained then, the features are selected from the pre-processed data to find similar data. Here, the Tversky index is employed for finding the features from the data. After obtaining the significant features, the features are fed to the data clustering phase, which is carried out using the BHEFC algorithm. In the data clustering phase, the medical data are clustered using BHEFC [28] algorithm, which finds the cluster centers. The weighted coefficients that correspond to the cluster centers are optimally found using RSFO. The proposed RSFO is designed by incorporating ROA [26] and SFO [27]. The clustering task is made effective using the proposed RSFO algorithm, which determines the optimal centroid. With the proposed RSFO, the weighted coefficients are modified, wherein the number of clusters is user-defined. Thus, the proposed clustering technique can perform data clustering to find the optimal centroids such that the convergence speed is enhanced. Figure 1 illustrates the schematic view of the proposed BHEFC-based RSFO clustering scheme for medical data clustering. FIGURE 1. Open in new tabDownload slide Schematic view of proposed RSFO-based medical data clustering. Consider a database |$D$| containing |$g$| number of documents represented as, $$\begin{equation} D=\left\{{d}_1,{d}_2,\dots{d}_h,\dots{d}_g\right\} \end{equation}$$(1) where |${d}_h$| indicates |${h}^{th}$| document such as |$1\le h\le g$|⁠, and |$g$| represents the total number of the document contained in the database. 3.1 Pre-processing In this section, the pre-processing is considered as a significant step for smoothly organizing thousands of data to provide effective results. The pre-processing helps to describe the processing of data to obtain better representations. The database consists of unnecessary words or phrases, which impacts the clustering process. Thus, pre-processing becomes an important process for eliminating inconsistent words from the data. 3.2 Feature selection This section deliberates significant features selected from the input data and the significance of feature extraction is to generate highly relevant features that enable better clustering of the available data. On the other hand, the complexity of analyzing data is minimized as data is represented as a reduced set of features. The feature selection is utilized after pre-processing using the Tversky index with holoentropy. Moreover, the accuracy associated with the clustering is assured through the effective feature selection using the input data. Moreover, feature selection is initiated by parsing the input data using the Tversky index with holoentropy. Some of the challenges faced by conventional methods are listed below. The first issue considered is to mine the topics from the database |$D$| considering the goal that each topic consists set of documents under a topic. The second issue is to mine the most suitable feature words using the database |$D$|⁠. 3.2.1 Tversky Index using holoentropy The Tversky Index refers to an asymmetric similarity measure that compares two sets of data for checking the similarities. Due to the huge number of similar words, the accuracy of the proposed system is affected. These issues are addressed using the Tversky Index with holoentropy. Here, the holoentropy-enabled feature evaluation function is used to extract the most suitable words from the database map to reduce the computational overhead as well as improve the accuracy [30]. The Tversky Index is considered as a generalization of Dice’s coefficient and the Tanimoto coefficient. For two sets |${d}_i$| and |${d}_m$|⁠, the Tversky Index [18] gives a number between 0 and 1, which is represented as, $$\begin{equation} T\left({d}_i,{d}_m\right)=\frac{\mid{d}_i\cap{d}_m\mid }{\mid{d}_i\cap{d}_m\mid +\alpha \mid{d}_i-{d}_m\mid +\beta \mid{d}_m-{d}_i\mid } \end{equation}$$(2) where |${d}_i$| and |${d}_m$| indicate the data from the database such that |$1\le i\le m\le g$| |$\alpha$|and |$\beta$|represent the Tversky parameters where |$\alpha, \beta \ge 0$|⁠, |${d}_i-{d}_m$| represents the relative component of |${d}_m$| in |${d}_i$|⁠. Here, the setting |$\alpha =\beta =1$| represents the Tanimoto coefficient and |$\alpha =\beta =0.5$| indicates the dice coefficient. The holoentropy [30] is used in Tversky parameters |$\alpha$| and |$\beta$| to analyze the relationship between attributes and features. The product of weight function and entropy is represented as holoentropy, which is utilized for selecting the best feature from the complete set of data. The holoentropy-enabled Tversky parameters are given as, $$\begin{equation} {H}_e\left(\gamma \right)=\omega \times E\left(\gamma \right) \end{equation}$$(3) where |$\omega$| indicates weight function and |$E(\gamma )$| represents entropy measure. Here, |$E(\gamma )$| is defined as the sum of entropies of individual attribute values. $$\begin{equation} E\left(\gamma \right)=-\sum \limits_{i=1}^{u\left(\gamma \right)}{P}_i\log{P}_i \end{equation}$$(4) where |$u(\gamma )$| represents the number of unique values in data. Here, |$\gamma$| is applied on both the Tversky parameters |$\alpha$| and |$\beta$|⁠. 3.2.2 Proposed BHEFC-based RSFO for clustering medical data The BHEFC [28] tries to comprehend its effectual clustering by determining the maximum-a-posteriori (MAP) values of the attributes in the Bayesian fuzzy clustering (BFC) model for the obtained data. This clustering technique utilizes the Markov-Chain Monte Carlo method [28] to cluster data. The method utilizes the Metropolis within Gibbs sampler [28] to produce the samples from the BFC using posteriori distribution. As the samples are produced, they can be computed in the posterior to compute the best sample retains for facilitating the usage of the BFC algorithm. Consider the database |$D$| with |$C$| number of clusters. According to the Lagrangian optimization method, $$\begin{eqnarray} &&M=\sum \limits_{p=1}^C\sum \limits_{h=1}^g\left({u}_{hp}{\left\Vert{d}_h-{t}_p\right\Vert}^2+\rho\;\ln\;{u}_{hp}+\rho\;\ln{\left\Vert{d}_h-{t}_p\right\Vert}^2\right)\nonumber\\&&\quad+\sum \limits_{h=1}^g{\eta}_h\left(-1\sum \limits_{p=1}^C{u}_{hp}\right) \end{eqnarray}$$(5) where |${d}_h$| represents data object, |$\rho$| indicates user-defined constant, |${t}_p$| indicates cluster center, |${u}_{hp}$| denotes fuzzy membership function that |${h}^{th}$| data object belongs to |${p}^{th}$| cluster center, |$C$| indicates total cluster centers, |$g$| total data. Consider the derivative of the above equation expressed as, $$\begin{equation} \left(\frac{\partial M}{\partial{u}_{hp}}\right)={\left\Vert{d}_h-{t}_p\right\Vert}^2+\rho /{u}_{hp}-{\eta}_h=0, \end{equation}$$(6) $$\begin{equation} {u}_{hp}=\rho /\left({\eta}_h-{\left\Vert{d}_h-{t}_p\right\Vert}^2\right) \end{equation}$$(7) Since|$\sum \limits_{p=1}^C{u}_{hp}=1$|⁠, thus the equation is given as, $$\begin{equation} \sum \limits_{p=1}^C\left[1/{\eta}_h-{\left\Vert{d}_h-{t}_p\right\Vert}^2\right]=\frac{1}{\rho } \end{equation}$$(8) For making |${u}_{hp}$| significant, a numerical solution of |${\eta}_h$| should ensure|${u}_{hp}\ge 0$|⁠. Referring to the optimization method, an iterative method for obtaining a no-closed form solution of |${\eta}_h$| with a guarantee of |${u}_{hp}\ge 0.$| Here, |${\eta}_h$| and |${t}_p$| is computed using the proposed RSFO algorithm, which is designed by integrating ROA and SFO algorithm. The solution encoding, fitness function, and proposed RSFO algorithm are illustrated in the subsection. (i) Solution encoding Figure 2 represents the solution encoding of the proposed RSFO algorithm, which is the simplest view for finding the clusters. The solution is the cluster centroid is initialized at random, depending on intermediate data. Here, |$\omega$| are the weights obtained by the proposed RSFO algorithm by tuning the cluster centroids, where |$g$| indicate the number of data objects and |$C$| represent the number of clusters. Thus, the solution is a vector, whose size is equivalent to the number of clusters and the data. Based on the fitness evaluated, the weighted coefficients equivalent to the cluster centroids are determined optimally using the proposed RSFO algorithm and is represented below. (ii) Fitness function FIGURE 2. Open in new tabDownload slide Solution encoding of proposed BHEFC-based RSFO. The fitness function decides the quality of the solution and is represented using Equations (5) and (6). (iii) Proposed RSFO algorithm The data clustering is performed by updating the weighted coefficients that correspond to the cluster centroids using the proposed RSFO algorithm, which is designed by combining SFO and ROA. The proposed RSFO algorithm inherits the advantages of both ROA and SFO and provides the best performance of data clustering. ROA [26] is inspired by the behaviour of rider groups, who travel to reach a common target location to become a winner. Here, the riders are selected from the total riders of each group, such as bypass rider, attacker, follower and overtaker, respectively. Every group undergoes many strategies for reaching the target. Thus, it is concluded that this technique performs fault diagnosis with improved classification accuracy. Also, the ROA is highly efficient and follows the steps of fictional computing for solving the optimization issues, but possesses lower convergence and it is highly sensitive to the hyperparameters. The SFO [27] algorithm is a nature-inspired optimization technique devised based on the motion of sunflower. The method is capable to determine the good locations, which improved the performance. The method mitigates the non-trivial issues while producing a new set of individuals. However, this method suffers from huge computational complexities. Thus, the ROA method is integrated with the SFO algorithm for improving the computational complexity issues and obtained a global optimal solution. The algorithmic steps are illustrated in the following section: Step 1: Initialization The first step is the initialization of solutions and other algorithmic parameters, which is given as, $$\begin{equation} A=\left\{{A}_1,{A}_2,\dots, {A}_b,\dots, {A}_c\right\} \end{equation}$$(9) where |${A}_b$|,indicates |${b}^{th}$|,solution and |$c$|,represents total solutions. Step 2: Evaluate fitness function The success rate or fitness of the solution is computed based on the Lagrangian optimization principle, which is elaborated in section 3.2.2. Thus, the fitness of the solution is depicted in Equations (5) and (6). Step 3: Position update of the rider groups Here, the updated position of the bypass rider is used in the update process for maximizing the success rate by determining the position of the bypass rider. The bypass riders follow a common path without tracking the leading rider. The equation of the bypass rider given by riders is expressed as, $$\begin{equation} {A}_{k+1}\left(r,s\right)=\vartheta \left[{A}_k\left(\kappa, s\right)\ast \ell (s)+{A}_k\left(\mu, s\right)\ast \left[1-\ell (s)\right]\right] \end{equation}$$(10) where |$\vartheta$|,denotes random number ranging from 0 and 1, |$\kappa$|,is a random number, |$\ell$|,is a random number ranging from 0 and 1, and |$\mu$|,is a random number, |$k$|,indicates the iterations. Assume |$\mu =r$|⁠, the equation is rewritten as, $$\begin{equation} {A}_{k+1}\left(r,s\right)=\vartheta \left[{A}_k\left(\kappa, s\right)\ast \ell (s)+{A}_k\left(r,s\right)\ast \left[1-\ell (s)\right]\right] \end{equation}$$(11) The attacker tends to seize the position of leaders by following the update process of leader, but the attackers update the values in the coordinates rather than the selected values and thus, the update process of the attacker is given by, $$\begin{equation} {A}_{k+1}\left(r,s\right)={A}^X\left(X,s\right)+\left[\mathrm{Cos}\left({J}_{r,s}^v\right)\ast{A}^X\left(X,s\right)+{f}_r^k\right] \end{equation}$$(12) where |${A}^X(X,s)$|,represents the position of the leading rider, |$\mathrm{Cos}({J}_{r,s}^v)$|,denotes the steering angle of the |${r}^{th}$| rider in |${v}^{th}$|,coordinate and |${f}_r^k$|,is the distance travelled by |${r}^{th}$|,rider. The follower tends to update the position using the position of the leading rider to reach the target more quickly and the equation of the follower is given by, $$\begin{equation} {A}_{k+1}\left(r,q\right)={A}^X\left(X,q\right)+\left[\mathrm{Cos}\left({J}_{r,q}^v\right)\ast{A}^X\left(X,q\right)\ast{f}_r^k\right] \end{equation}$$(13) where |$q$|,is the coordinate selector, |${A}^X$|,represents the position of the leading rider, |${J}_{r,q}^v$|,denotes the steering angle of the |${r}^{th}$| rider in |${q}^{th}$|,coordinate and |${f}_r^k$|,is the distance travelled by |${r}^{th}$|,rider. The expression according to the update process of the overtaker using the ROA is given by, $$\begin{equation} {A}_{k+1}\left(r,q\right)={A}_k\left(r,q\right)+\left[{F}_k(r)\ast{A}^X\left(X,q\right)\right] \end{equation}$$(14) where |${A}_k(r,q)$|,represents the position of |${r}^{th}$| rider in |${q}^{th}$|⁠, coordinate and |${F}_k(r)$|,represents the direction indicator of |${r}^{th}$|, rider at the time |$k$|⁠. The SFO algorithm updates the solution space based on the sunflower’s motion. The alterations in the motion make the sunflower move in one direction. Thus, the update solution of SFO is given by, $$\begin{equation} {A}_{k+1}\left(r,s\right)={A}_k\left(r,s\right)+{x}_r\times{t}_r \end{equation}$$(15) where |${A}_k(r,s)$| represents the current solution, |${x}_r$|⁠,represents the step of sunflower and |${t}_r$|⁠,indicates the direction of the sunflower. $$\begin{equation} {A}_k\left(r,s\right)={A}_{k+1}\left(r,s\right)-{x}_r\times{t}_r \end{equation}$$(16) After substituting Equation (16) in bypass rider given in Equation (11), the update equation derived is, $$\begin{eqnarray} {A}_{k+1}\left(r,s\right)&=&\vartheta \left[{A}_k\left(\kappa, s\right)\ast \ell (s)+\left({A}_{k+1}\left(r,s\right)-{x}_r\times{t}_r\right)\right.\nonumber\\&&\quad\ast \left[1-\ell (s)\right]\big] \end{eqnarray}$$(17) $$\begin{eqnarray} {A}_{k+1}\left(r,s\right)&=&\vartheta \left[{A}_k\left(\kappa, s\right)\ast \ell (s)+{A}_{k+1}\left(r,s\right)\left[1-\ell (s)\right]\right.\nonumber\\&&\quad-{x}_r\times{t}_r\ast \left[1-\ell (s)\right]\big] \end{eqnarray}$$(18) After rearranging the above equation, $$\begin{eqnarray} {A}_{k+1}\left(r,s\right)&=&\vartheta \left[{A}_k\left(\kappa, s\right)\ast \ell (s)+{A}_{k+1}\left(r,s\right)\right.\nonumber\\&&\ -{A}_{k+1}\left(r,s\right)\ell (s)-{x}_r{t}_r+{x}_r{t}_r\ell (s)\big] \end{eqnarray}$$(19) $$\begin{eqnarray} &&{A}_{k+1}\left(r,s\right)-{A}_{k+1}\left(r,s\right)\vartheta +\vartheta{A}_{k+1}\left(r,s\right)\ell (s)=\nonumber\\&&\quad\vartheta \left[{A}_k\left(\kappa, s\right)\ast \ell (s)-{x}_r{t}_r+{x}_r{t}_r\ell (s)\right] \end{eqnarray}$$(20) $$\begin{eqnarray}&& {A}_{k+1}\left(r,s\right)\left[1-\vartheta +\vartheta \ell (s)\right]=\nonumber\\&&\quad\vartheta \left[{A}_k\left(\kappa, s\right)\ast \ell (s)-{x}_r{t}_r+{x}_r{t}_r\ell (s)\right] \end{eqnarray}$$(21) The final equation is given by, $$\begin{eqnarray} &&{A}_{k+1}\left(r,s\right)=\frac{1}{\left[1-\vartheta \left[1-\ell (s)\right]\right]}\nonumber\\&&\quad\left[\vartheta \left[{A}_k\left(\kappa, s\right)\ast \ell (s)-{x}_r{t}_r\left[1-\ell (s)\right]\right]\right] \end{eqnarray}$$(22) Step 4: Determining the best solution The solution is best if it acquired the maximal fitness value. Also, the update of the rider parameters is essential to determine the best solution. Step 5: Termination The steps are repeated until the iteration reaches the maximum count. Thus, the optimization renders the optimal values for the centroid update that is eminent to note that the proposed approach is effective in dealing with the medical data. The proposed RSFO algorithm is represented in Algorithm 1. Algorithm 1. Pseudocode of proposed RSFO. RSFO algorithm Input: random position of the rider |$A$| Output: leading rider |${A}^X(X,s)$| Parameters: |${J}_{r,q}^v$|➔ Steering angle of the |${r}^{th}$| rider in |${q}^{th}$|coordinate, |$M$|➔Success rate, |${A}_k(r,q)$|➔position of |${r}^{th}$| rider in |${q}^{th}$|coordinate, |${A}_k(r,s)$|➔current solution, |${f}_r^k$|➔       distance travelled by |${r}^{th}$|rider, |${F}_k(r)$|➔direction indicator of |${r}^{th}$|rider at time       |$k$|, |${A}^X$|➔ position of leading rider. Begin    Initialize the position of riders    Initialize the rider parameter: steering angle |${J}_{r,q}^v$|    Calculate the success rate |$M$|    While |$n<{K}_{OFF}$|      For |$k=1$|to |$S$|         Update the position of the bypass rider using        |${A}_{k+1}(r,s)=\vartheta [{A}_k(\kappa, s)\ast \ell (s)+{A}_k(r,s)\ast [1-\ell (s)]]$|        Update the position of the attacker using        |${A}_{k+1}(r,s)={A}^X(X,s)+[\mathrm{Cos}({J}_{r,s}^v)\ast{A}^X(X,s)+{f}_r^k]$|        Update the position of the follower using        |${A}_{k+1}(r,q)={A}^X(X,q)+[\mathrm{Cos}({J}_{r,q}^v)\ast{A}^X(X,q)\ast{f}_r^k]$|        Update the position of the overtaker using        |${A}_{k+1}(r,q)={A}_k(r,q)+[{F}_k(r)\ast{A}^X(X,q)]$|        Arrange the riders based on the success rate        Select the riders having the maximum |$M$| as the leading rider        Update the rider parameter |${J}_{r,q}^v$|        Return |$M$|        |$n=n+1$|      End for     End while  Terminate RSFO algorithm Input: random position of the rider |$A$| Output: leading rider |${A}^X(X,s)$| Parameters: |${J}_{r,q}^v$|➔ Steering angle of the |${r}^{th}$| rider in |${q}^{th}$|coordinate, |$M$|➔Success rate, |${A}_k(r,q)$|➔position of |${r}^{th}$| rider in |${q}^{th}$|coordinate, |${A}_k(r,s)$|➔current solution, |${f}_r^k$|➔       distance travelled by |${r}^{th}$|rider, |${F}_k(r)$|➔direction indicator of |${r}^{th}$|rider at time       |$k$|, |${A}^X$|➔ position of leading rider. Begin    Initialize the position of riders    Initialize the rider parameter: steering angle |${J}_{r,q}^v$|    Calculate the success rate |$M$|    While |$n<{K}_{OFF}$|      For |$k=1$|to |$S$|         Update the position of the bypass rider using        |${A}_{k+1}(r,s)=\vartheta [{A}_k(\kappa, s)\ast \ell (s)+{A}_k(r,s)\ast [1-\ell (s)]]$|        Update the position of the attacker using        |${A}_{k+1}(r,s)={A}^X(X,s)+[\mathrm{Cos}({J}_{r,s}^v)\ast{A}^X(X,s)+{f}_r^k]$|        Update the position of the follower using        |${A}_{k+1}(r,q)={A}^X(X,q)+[\mathrm{Cos}({J}_{r,q}^v)\ast{A}^X(X,q)\ast{f}_r^k]$|        Update the position of the overtaker using        |${A}_{k+1}(r,q)={A}_k(r,q)+[{F}_k(r)\ast{A}^X(X,q)]$|        Arrange the riders based on the success rate        Select the riders having the maximum |$M$| as the leading rider        Update the rider parameter |${J}_{r,q}^v$|        Return |$M$|        |$n=n+1$|      End for     End while  Terminate Open in new tab Algorithm 1. Pseudocode of proposed RSFO. RSFO algorithm Input: random position of the rider |$A$| Output: leading rider |${A}^X(X,s)$| Parameters: |${J}_{r,q}^v$|➔ Steering angle of the |${r}^{th}$| rider in |${q}^{th}$|coordinate, |$M$|➔Success rate, |${A}_k(r,q)$|➔position of |${r}^{th}$| rider in |${q}^{th}$|coordinate, |${A}_k(r,s)$|➔current solution, |${f}_r^k$|➔       distance travelled by |${r}^{th}$|rider, |${F}_k(r)$|➔direction indicator of |${r}^{th}$|rider at time       |$k$|, |${A}^X$|➔ position of leading rider. Begin    Initialize the position of riders    Initialize the rider parameter: steering angle |${J}_{r,q}^v$|    Calculate the success rate |$M$|    While |$n<{K}_{OFF}$|      For |$k=1$|to |$S$|         Update the position of the bypass rider using        |${A}_{k+1}(r,s)=\vartheta [{A}_k(\kappa, s)\ast \ell (s)+{A}_k(r,s)\ast [1-\ell (s)]]$|        Update the position of the attacker using        |${A}_{k+1}(r,s)={A}^X(X,s)+[\mathrm{Cos}({J}_{r,s}^v)\ast{A}^X(X,s)+{f}_r^k]$|        Update the position of the follower using        |${A}_{k+1}(r,q)={A}^X(X,q)+[\mathrm{Cos}({J}_{r,q}^v)\ast{A}^X(X,q)\ast{f}_r^k]$|        Update the position of the overtaker using        |${A}_{k+1}(r,q)={A}_k(r,q)+[{F}_k(r)\ast{A}^X(X,q)]$|        Arrange the riders based on the success rate        Select the riders having the maximum |$M$| as the leading rider        Update the rider parameter |${J}_{r,q}^v$|        Return |$M$|        |$n=n+1$|      End for     End while  Terminate RSFO algorithm Input: random position of the rider |$A$| Output: leading rider |${A}^X(X,s)$| Parameters: |${J}_{r,q}^v$|➔ Steering angle of the |${r}^{th}$| rider in |${q}^{th}$|coordinate, |$M$|➔Success rate, |${A}_k(r,q)$|➔position of |${r}^{th}$| rider in |${q}^{th}$|coordinate, |${A}_k(r,s)$|➔current solution, |${f}_r^k$|➔       distance travelled by |${r}^{th}$|rider, |${F}_k(r)$|➔direction indicator of |${r}^{th}$|rider at time       |$k$|, |${A}^X$|➔ position of leading rider. Begin    Initialize the position of riders    Initialize the rider parameter: steering angle |${J}_{r,q}^v$|    Calculate the success rate |$M$|    While |$n<{K}_{OFF}$|      For |$k=1$|to |$S$|         Update the position of the bypass rider using        |${A}_{k+1}(r,s)=\vartheta [{A}_k(\kappa, s)\ast \ell (s)+{A}_k(r,s)\ast [1-\ell (s)]]$|        Update the position of the attacker using        |${A}_{k+1}(r,s)={A}^X(X,s)+[\mathrm{Cos}({J}_{r,s}^v)\ast{A}^X(X,s)+{f}_r^k]$|        Update the position of the follower using        |${A}_{k+1}(r,q)={A}^X(X,q)+[\mathrm{Cos}({J}_{r,q}^v)\ast{A}^X(X,q)\ast{f}_r^k]$|        Update the position of the overtaker using        |${A}_{k+1}(r,q)={A}_k(r,q)+[{F}_k(r)\ast{A}^X(X,q)]$|        Arrange the riders based on the success rate        Select the riders having the maximum |$M$| as the leading rider        Update the rider parameter |${J}_{r,q}^v$|        Return |$M$|        |$n=n+1$|      End for     End while  Terminate Open in new tab 4 RESULTS AND DISCUSSION This section illustrates the analysis of methods for data clustering by evaluating the performance analysis by comparing the methods with conventional methods. 4.1 Experimental setup The execution of the proposed technique is performed in the PC with Windows 10 Operating System, 4 GB RAM, Intel i-3 core processor. The proposed method is implemented using MATLAB. 4.2 Database description The experimentation is conducted on the medical data that are taken from the Dermatology Data Set, which is available online at ‘https://archive.ics.uci.edu/ml/datasets/dermatology’ [37] for performing the medical data clustering. Nilsel Ilter and H. Altay Guvenir are the owners of this dataset, which is used to determine the type of eryhemato-squamous Disease. It is a multivariate dataset, which contains categorical and integer attribute characteristics. This database contains 34 attributes, 33 of which are linear valued and one of them is nominal. In dermatology, the differential diagnosis of erythemato-squamous diseases is a major problem. The diseases in this group are psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis and pityriasis rubra pilaris. This dataset contains 366 number of instances and 232 965 number of web hits. FIGURE 3. Open in new tabDownload slide Experimentation results of the proposed BHEFC-based RSFO: (a) sample data, (b) cluster size-2 and (c) cluster size-4. FIGURE 4. Open in new tabDownload slide Analysis of proposed BHEFC-based RSFO with features size using (a) accuracy, (b) Jaccard coefficient and (c) Rand coefficient. FIGURE 5. Open in new tabDownload slide Analysis of proposed BHEFC-based RSFO with chunk size using (a) accuracy, (b) Jaccard coefficient and (c) Rand coefficient. FIGURE 6. Open in new tabDownload slide Analysis of methods with features size using (a) accuracy, (b) Jaccard coefficient and (c) Rand coefficient. 4.3 Performance metric The metrics employed for the analysis are Jaccard coefficient, accuracy and rand coefficient. 4.3.1 Accuracy The accuracy indicates the accurate detection process that is calculated as, $$\begin{equation} Accuracy=\frac{T^p+{T}^n}{T^p+{T}^n+{F}^p+{F}^n} \end{equation}$$(23) where |${T}^p$| specifies the number of true positives and |${F}^p$| indicates the total number of false positive, |${T}^n$| is the total number of true negatives and |${F}^n$| represent the total number of a false negative. 4.3.2 Jaccard coefficient The Jaccard coefficient is employed to compare the data of two clusters for computing the similarities between the two datasets. $$\begin{equation} J\left(A,B\right)=\frac{\mid A\cap B\mid }{\mid A\cup B\mid } \end{equation}$$(24) where |$A$| and |$B$| represent two datasets. 4.3.3 Rand coefficient The rand coefficient is a measure wherein the percentage of right decisions is made by the algorithm and is given by, $$\begin{equation} {R}_c=\frac{T^p+{T}^n}{T^p+{F}^p+{F}^n+{T}^n} \end{equation}$$(25) 4.4 Experimentation results Figure 3 shows the experimental results of the proposed BHEFC-based RSFO based on the different cluster sizes. Figure 3a is the sample data visualized in the 2D graph. Here, each cluster is indicated in different colours. Figure 3b shows the cluster size-2, the blue colour indicates the presence of disease, and the pink colour indicates the absence of disease. Figure 3c shows the cluster size-4; in this figure, the pink colour indicates the absence of disease, red colour indicates the lower possibility of disease, the blue colour indicates the intermediate possibility of disease and the green colour indicates the large possibility of diseases. 4.5 Performance analysis The performance analysis of the proposed BHEFC-based RSFO using clustering accuracy, Jaccard coefficient and rand coefficient are illustrated in this section. The analysis is performed by varying the chunk size and the feature size. 4.5.1 Analysis based on feature size Figure 4 illustrates the analysis of the proposed BHEFC-based RSFO using accuracy, Jaccard coefficient and Rand coefficient parameters. The size of the selected features is varied from 5 to 30 to evaluate the performance. The analysis of the proposed BHEFC-based RSFO with the accuracy parameter is portrayed in Figure 4a. When the number of selected features is 5, the corresponding accuracy values computed by BHEFC-based RSFO with iteration 100, iteration 200, iteration 300, iteration 400, iteration 500 and iteration 600 are 87.108, 87.287, 87.466, 87.645, 87.825 and 88.004%, respectively. When the number of selected features is 30, the corresponding accuracy values computed by BHEFC-based RSFO with iteration 100, iteration 200, iteration 300, iteration 400, iteration 500 and iteration 600 are 92.34, 92.53, 92.72, 92.91, 93.1, and 93.29%, respectively. The analysis of the proposed BHEFC-based RSFO in terms of the Jaccard coefficient parameter is portrayed in Fig. 4b. When the number of selected features is 5, the corresponding Jaccard coefficient values computed by BHEFC-based RSFO with iteration 100, iteration 200, iteration 300, iteration 400, iteration 500 and iteration 600 are 81.112%, 81.279, 81.446, 81.613, 81.78 and 81.946%, respectively. When the number of selected features is 30, the corresponding Jaccard coefficient values computed by BHEFC-based RSFO with iteration 100, iteration 200, iteration 300, iteration 400, iteration 500, and iteration 600 are 92.34, 92.53, 92.72, 92.91, 93.1 and 93.29%, respectively. The analysis of the proposed BHEFC-based RSFO in terms of the Rand coefficient parameter is portrayed in Fig. 4c. When the number of selected features is 5, the corresponding Rand coefficient values computed by BHEFC-based RSFO with iteration 100, iteration 200, iteration 300, iteration 400, iteration 500 and iteration 600 are 84.074, 84.247, 84.420, 84.593, 84.766 and 84.939%, respectively. When the number of selected features is 30, the corresponding Rand coefficient values computed by BHEFC-based RSFO with iteration 100, iteration 200, iteration 300, iteration 400, iteration 500 and iteration 600 are 92.34, 92.53, 92.72, 92.91, 93.1 and 93.29%, respectively. TABLE 1. Comparative analysis. Variations . Metrics . KHM-OKM . CBCC . FCM . BFCM . PPHOPCM . Proposed BHEFC-based RSFO . Selected features Accuracy (%) 80.983 82.021 91.771 91.885 92.663 94.480 Jaccard coefficient (%) 75.612 75.612 90.176 90.346 91.633 94.224 Rand coefficient (%) 69.088 70.532 86.248 86.455 87.882 91.307 Chunk size Accuracy (%) 64.760 80.730 91.442 92.339 94.117 94.117 Jaccard coefficient (%) 49.766 77.571 89.960 91.190 93.699 93.699 Rand coefficient (%) 52.635 69.252 86.287 87.929 91.280 91.280 Variations . Metrics . KHM-OKM . CBCC . FCM . BFCM . PPHOPCM . Proposed BHEFC-based RSFO . Selected features Accuracy (%) 80.983 82.021 91.771 91.885 92.663 94.480 Jaccard coefficient (%) 75.612 75.612 90.176 90.346 91.633 94.224 Rand coefficient (%) 69.088 70.532 86.248 86.455 87.882 91.307 Chunk size Accuracy (%) 64.760 80.730 91.442 92.339 94.117 94.117 Jaccard coefficient (%) 49.766 77.571 89.960 91.190 93.699 93.699 Rand coefficient (%) 52.635 69.252 86.287 87.929 91.280 91.280 Open in new tab TABLE 1. Comparative analysis. Variations . Metrics . KHM-OKM . CBCC . FCM . BFCM . PPHOPCM . Proposed BHEFC-based RSFO . Selected features Accuracy (%) 80.983 82.021 91.771 91.885 92.663 94.480 Jaccard coefficient (%) 75.612 75.612 90.176 90.346 91.633 94.224 Rand coefficient (%) 69.088 70.532 86.248 86.455 87.882 91.307 Chunk size Accuracy (%) 64.760 80.730 91.442 92.339 94.117 94.117 Jaccard coefficient (%) 49.766 77.571 89.960 91.190 93.699 93.699 Rand coefficient (%) 52.635 69.252 86.287 87.929 91.280 91.280 Variations . Metrics . KHM-OKM . CBCC . FCM . BFCM . PPHOPCM . Proposed BHEFC-based RSFO . Selected features Accuracy (%) 80.983 82.021 91.771 91.885 92.663 94.480 Jaccard coefficient (%) 75.612 75.612 90.176 90.346 91.633 94.224 Rand coefficient (%) 69.088 70.532 86.248 86.455 87.882 91.307 Chunk size Accuracy (%) 64.760 80.730 91.442 92.339 94.117 94.117 Jaccard coefficient (%) 49.766 77.571 89.960 91.190 93.699 93.699 Rand coefficient (%) 52.635 69.252 86.287 87.929 91.280 91.280 Open in new tab FIGURE 7. Open in new tabDownload slide Analysis of methods with chunk size using (a) accuracy, (b) Jaccard coefficient and (c) Rand coefficient. 4.5.2 Analysis based on chunk size Figure 5 illustrates the analysis of the proposed BHEFC-based RSFO using accuracy, Jaccard coefficient and Rand coefficient parameters. The size of the chunks is varied from 2 to 10 to evaluate the performance. The analysis of the proposed BHEFC-based RSFO with the accuracy parameter is portrayed in Fig. 5a. When the chunk size is 2, the corresponding accuracy values computed by BHEFC-based RSFO with iteration 100, iteration 200, iteration 300, iteration 400, iteration 500 and iteration 600 are 85.734, 85.911, 86.087, 86.263, 86.440 and 86.616%, respectively. When the chunk size is 10, the corresponding accuracy values computed by BHEFC-based RSFO with iteration 100, iteration 200, iteration 300, iteration 400, iteration 500 and iteration 600 are 92.34, 92.53, 92.72, 92.91, 93.1 and 93.29%, respectively. The analysis of the proposed BHEFC-based RSFO in terms of the Jaccard coefficient parameter is portrayed in Fig. 5b. When the chunk size is 2, the corresponding Jaccard coefficient values computed by BHEFC-based RSFO with iteration 100, iteration 200, iteration 300, iteration 400, iteration 500 and iteration 600 are 64.116, 64.248, 64.380, 64.512, 64.643 and 64.775%, respectively. When the chunk size is 10, the corresponding Jaccard coefficient values computed by BHEFC-based RSFO with iteration 100, iteration 200, iteration 300, iteration 400, iteration 500 and iteration 600 are 92.34, 92.53, 92.72, 92.91, 93.1 and 93.29%, respectively. The analysis of the proposed BHEFC-based RSFO in terms of the Jaccard coefficient parameter is portrayed in Fig. 5c. When the chunk size is 2, the corresponding Rand coefficient values computed by BHEFC-based RSFO with iteration 100, iteration 200, iteration 300, iteration 400, iteration 500 and iteration 600 are 71.608, 71.756, 71.903, 72.050, 72.198 and 72.345%, respectively. When the chunk size is 10, the corresponding Rand coefficient values computed by BHEFC-based RSFO with iteration 100, iteration 200, iteration 300, iteration 400, iteration 500 and iteration 600 are 92.34, 92.53, 92.72, 92.91, 93.1 and 93.29%, respectively. 4.6 Comparative analysis The comparative analysis of proposed BHEFC-based RSFO compared with the existing methods, such as KHM-OKM [33], CBCC [23], FCM [13], Fuzzy C-means Clustering with Bilateral Filtering (BFCM) [38], and Privacy-preserving high-order possibilistic C-means algorithm (PPHOPCM) [25] using clustering accuracy, Jaccard coefficient and Rand coefficient is illustrated in this section. The analysis is performed by varying the chunk size and the feature size. 4.6.1 Analysis based on feature size Figure 6 illustrates the comparative analysis of methods in terms of accuracy, Jaccard coefficient and Rand coefficient. The analysis is done by varying the selected features from 5 to 30. The analysis of methods based on the accuracy parameter is illustrated in Fig. 6a. When selected features is 5, the corresponding accuracy values computed by existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM and proposed BHEFC-based RSFO are 50.095, 57.394, 61.516, 61.516, 76.384 and 79.218%. Likewise, when the selected features is 10, the corresponding accuracy values computed by existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM and proposed BHEFC-based RSFO are 80.983, 82.021, 91.771, 91.885, 92.663 and 94.480%, respectively. The analysis of methods based on the Jaccard coefficient is illustrated in Fig. 6b. When selected features is 5, the corresponding Jaccard coefficient values computed by existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM and proposed BHEFC-based RSFO are 27.918, 38.822, 44.979, 44.979, 67.190 and 71.424%. Likewise, when the selected features are 10, the corresponding Jaccard coefficient values computed by existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM and proposed BHEFC-based RSFO are 75.612, 75.612, 90.176, 90.346, 91.633 and 94.224%, respectively. The analysis of methods based on the Rand coefficient is illustrated in Fig. 6c. When selected features is 5, the corresponding Rand coefficient values computed by existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM and proposed BHEFC-based RSFO are 46.288, 48.153, 50.169, 50.169, 63.219 and 66.733%. Likewise, when the selected features is 10, the corresponding Rand coefficient values computed by existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM and proposed BHEFC-based RSFO are 75.612, 69.088, 70.532, 86.248, 86.455, 87.882 and 91.307%, respectively. 4.6.2 Analysis based on chunk size Figure 7 illustrates the comparative analysis of methods in terms of accuracy, Jaccard coefficient and Rand coefficient. The analysis is done by varying the chunk size from 2 to 10. The analysis of methods based on the accuracy parameter is illustrated in Fig. 7a. When selected features is 5, the corresponding accuracy values computed by existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM and proposed BHEFC-based RSFO are 38.629, 53.890, 60.301, 63.777, 92.114 and 93.660%. Likewise, when the selected features is 10, the corresponding accuracy values computed by existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM and proposed BHEFC-based RSFO are 64.760, 80.730, 91.442, 92.339, 94.117 and 94.117%, respectively. The analysis of methods based on the Jaccard coefficient is illustrated in Fig. 7b. When selected features is 5, the corresponding Jaccard coefficient values computed by existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM and proposed BHEFC-based RSFO are 10.750, 33.577, 43.156, 48.321, 90.874 and 93.039%. Likewise, when the selected features are 10, the corresponding Jaccard coefficient values computed by existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM and proposed BHEFC-based RSFO are 49.766, 77.571, 89.960, 91.190, 93.699 and 93.699%, respectively. The analysis of methods based on the Rand coefficient is illustrated in Fig. 7c. When selected features is 5, the corresponding Rand coefficient values computed by existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM and proposed BHEFC-based RSFO are 46.795, 47.335, 49.870, 51.955, 87.515 and 90.406%. Likewise, when the selected features is 10, the corresponding Rand coefficient values computed by existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM and proposed BHEFC-based RSFO are 75.612, 52.635, 69.252, 86.287, 87.929, 91.280 and 91.280%, respectively. 4.7 Comparative discussion Table 1 illustrates the comparative analysis of methods using the Jaccard coefficient, accuracy, and rand coefficient parameter. Here, the feature size and the chunk size are varied for the performance evaluation. Considering feature size, the maximal accuracy is determined by proposed BHEFC-based RSFO with an accuracy of 94.480%, whereas the accuracy of existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM are 80.983, 82.021, 91.771, 91.885, and 92.663, respectively. Likewise, the maximal Jaccard coefficient is computed by proposed BHEFC-based RSFO with 94.224%, whereas the existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM are 75.612, 75.612, 90.176, 90.346 and 91.633%, respectively. The maximal Rand coefficient is obtained by proposed BHEFC-based RSFO with 91.307% whereas the Rand coefficient values are computed by existing KHM-OKM, CBCC, FCM, BFCM, PPHOPCM are 69.088, 70.532, 86.248, 86.455 and 87.882%. Considering Chunk size, the proposed BHEFC-RSFO computed the maximal accuracy of 94.117%, Jaccard coefficient of 93.699%, and Rand coefficient of 91.280%, respectively. By using the comparative analysis, the proposed method offered best performance than the existing methods. To find the cluster centroids are time-consuming and still do not guarantee to yield the global optimum. The performance of the proposed method is enhanced using various methods, such as the Tversky index with holoentropy, Markov-Chain Monte Carlo method, and the integration of ROA and SFO algorithm. The holoentropy-enabled feature evaluation function is used to extract the most suitable feature words from the database, which reduce the computational overhead as well as to improve the accuracy of the proposed system. Also, the Markov-Chain Monte Carlo method effectively clusters the data in the database. The RSFO algorithm reduces the computational complexity issues and obtained a global optimal solution. Hence, the performance of the proposed system is superior to the existing methods. 5 CONCLUSION This paper presents an effective data clustering method using the hybrid optimization-based BHEFC algorithm. The proposed model undergoes three steps for data clustering, namely pre-processing, feature selection, and data clustering. The pre-processing step is a significant step for the effective management of the data categorization, which enable the research to concentrate on the data clustering of the dynamic data. The most significant feature is selected using the Tversky index with holoentropy. Finally, the data clustering is performed based on the BHEFC that employs the RSFO for selecting the optimal centroids. The proposed RSFO is designed by integrating the ROA with the SFO algorithm. In the clustering process, the set of medical data clusters is generated using the proposed BHEFC-based RSFO algorithm. The clustering task is made effective using the proposed RSFO algorithm, which determines the optimal centroid. The effectiveness of the proposed BHEFC+RSFO algorithm is revealed with maximal accuracy of 94.480%, Jaccard coefficient of 94.224%, and rand coefficient of 91.307%, respectively. This work can be extended by considering multi-objective optimization functions for effective clustering. The proposed system is used for the diagnosis of various diseases, which improve the patient care and treatment. Also, it identifies the similar patients based on their attributes to explore costs, treatments, or outcomes. Also, it is used to detect unknown diseases, causes of diseases, and identification of medical treatment methods. The proposed method is considered only an effective medical data clustering. In the future, the proposed method will be further improved by clustering the medical images. DATA AVAILABILITY STATEMENT The data underlying this article are available in Dermatology Data Set, at https://archive.ics.uci.edu/ml/datasets/dermatology. References [1] Chauhan , R. , Kaur , H. and Alam , M.A. ( 2010 ) Data clustering method for discovering clusters in spatial cancer databases . International Journal of Computer Applications , 10 , 9 – 14 . Google Scholar Crossref Search ADS WorldCat [2] Ghazavi , S.N. and Liao , T.W. ( 2008 ) Medical data mining by fuzzy modeling with selected features . Artif. Intell. Med. , 43 , 195 – 206 . Google Scholar Crossref Search ADS PubMed WorldCat [3] Conti , L. , Goli , G., Monti , M., Pellegrini , P., Rossi , G. and Barbari , M. ( 2017 ) Simplified method for the characterization of rectangular straw bales (RSB) thermal conductivity . In the proceeding of IOP Conference Series: Materials Science and Engineering . 245 . [4] Delen , D. , Walker , G. and Kadam , A. ( 2005 ) Predicting breast cancer survivability: A comparison of three data mining methods . Artif. Intell. Med. , 34 , 113 – 127 . Google Scholar Crossref Search ADS PubMed WorldCat [5] Al-Shammari , A. , Zhou , R., Naseriparsaa , M. and Liu , C. ( 2019 ) An effective density-based clustering and dynamic maintenance framework for evolving medical data streams . Int. J. Med. Inform. , 126 , 176 – 186 . Google Scholar Crossref Search ADS PubMed WorldCat [6] Chithra , R.S. and Jagatheeswari , P. ( 2019 ) Enhanced WOA and modular neural network for severity analysis of tuberculosis . Multimedia Research , 2 , pp. 43 – 55 . Google Scholar OpenURL Placeholder Text WorldCat [7] Salvatore Leonardi , Giuseppe Fabio Parisi, Antonino Capizzi, Sara Manti, Caterina Cuppari, Maria Grazia Scuderi, Novella Rotolo, Angela Lanzafame, Maria Musumeci, and Carmelo Salpietro, "YKL-40 as marker of severe lung disease in cystic fibrosis patients", J. Cyst. Fibros. , vol. 15 , pp. 583 – 586 , 2016 . Crossref Search ADS PubMed [8] Parisi , G.F. , Papale , M., Rotolo , N., Aloisio , D., Tardino , L., Scuderi , M.G., Di Benedetto , V., Nenna , R., Midulla , F. and Leonardi , S. ( 2017 ) Severe disease in cystic fibrosis and fecal calprotectin levels . Immunobiology , 222 , 582 – 586 . Google Scholar Crossref Search ADS PubMed WorldCat [9] Likitha , V. , Naik , S. and Manjunath , R. ( 2019 ) Development of predictive model to improve accuracy of medical data processing using machine learning techniques . International Journal of Scientific Research and Review , 7 , May , pp. 233 – 240 . Google Scholar OpenURL Placeholder Text WorldCat [10] Hammouda , K. and Karray , F., " A comparative study of data clustering techniques ," University of Waterloo , Ontario, Canada , pp. 1 , 2000 , http://www.pami.uwaterloo.ca/pub/hammouda/sde625-paper.pdf. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [11] Abdullah , M. , Al-Anzi , F.S. and Al-Sharhan , S., "Efficient Fuzzy Techniques for Medical Data Clustering," In Proceeding of 9th IEEE-GCC Conference and Exhibition , IEEE , Manama, Bahrain , pp. 1 – 9 , 2017 . [12] Yager , R.R. and Filev , D.P. ( 1994 ) Approximate clustering via the mountain method . IEEE Trans. Syst. Man Cybern. , 24 , 1279 – 1284 . Google Scholar Crossref Search ADS WorldCat [13] Bezdek , J.C. ( 2013 ) Pattern recognition with fuzzy objective function algorithms . Springer Science & Business Media , New York . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [14] Rana , S. , Jasola , S. and Kumar , R. ( 2011 ) A review on particle swarm optimization algorithms and their applications to data clustering . Artificial Intelligence Review , 35 . Google Scholar OpenURL Placeholder Text WorldCat [15] Jadhav , A.N. and Gomathi , N. ( 2019 ) DIGWO: Hybridization of dragonfly algorithm with improved Grey wolf optimization algorithm for data clustering . Multimedia Research , 2 , 1 – 11 . Google Scholar OpenURL Placeholder Text WorldCat [16] Brajula , W. and Praveena , S. ( 2018 ) Energy efficient genetic algorithm based clustering technique for prolonging the life time of wireless sensor network . Journal of Networking and Communication Systems , 1 , 1 – 9 . Google Scholar OpenURL Placeholder Text WorldCat [17] Yan , X. , Zhu , Y., Zou , W. and Wang , L. ( 2012 ) A new approach for data clustering using hybrid artificial bee colony algorithm . Neurocomputing , 97 , 241 – 250 . Google Scholar Crossref Search ADS WorldCat [18] S. M. Swamy , B. R. Rajakumar, and I. R. Valarmathi, “Design of Hybrid Wind and Photovoltaic Power System using Opposition-based Genetic Algorithm with Cauchy Mutation”, IET Chennai Fourth International Conference on Sustainable Energy and Intelligent Systems (SEISCON 2013) , Chennai , India , 2013 . [19] Pakrashi , A. and Chaudhuri , B.B. ( 2016 ) A kalman filtering induced heuristic optimization based partitional data clustering . Inform. Sci. , 369 , 704 – 717 . Google Scholar Crossref Search ADS WorldCat [20] Hatamlou , A. ( 2013 ) Black hole a new heuristic optimization approach for data clustering . Inform. Sci. , 222 , 175 – 184 . Google Scholar Crossref Search ADS WorldCat [21] Tsujinishi , D. and Abe , S. ( 2003 ) Fuzzy least squares support vector machines for multiclass problems . Neural Netw. , 16 , 785 – 792 . Google Scholar Crossref Search ADS PubMed WorldCat [22] N. M. Nawi , A. Khan, M. Z. Rehman, " A new back-propagation neural network optimized with cuckoo search algorithm ," in Proceedings of International Conference on Computational Science and Its Applications , Springer : Computational Science and Its Applications — ICCSA 2013. pp. 413 – 426 . [23] Yelipe , U. , Porika , S. and Golla , M. ( 2018 ) An efficient approach for imputation and classification of medical data values using class-based clustering of medical records . Computers & Electrical Engineering , 66 , 487 – 504 . Google Scholar Crossref Search ADS WorldCat [24] Chauhan , R. , Kumar , N. and Rekapally , R. ( 2019 ) Predictive Data Analytics Technique for Optimization of Medical Databases. In Soft Computing: Theories and Applications , pp. 433 – 441 . Springer , New York . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC [25] Zhang , Q. , Yang , L.T., Chen , Z. and Li , P. ( 2017 ) PPHOPCM: Privacy-preserving high-order possibilistic c-means algorithm for big data clustering with cloud computing . IEEE Transactions on Big Data , pp. 1 – 11 . Google Scholar OpenURL Placeholder Text WorldCat [26] Binu , D. and Kariyappa , B.S. ( 2018 ) RideNN: A new rider optimization algorithm-based neural network for fault diagnosis in analog circuits . IEEE Transactions on Instrumentation and Measurement , 68 , 2 – 26 . Google Scholar Crossref Search ADS WorldCat [27] Gomes , G.F. , da Cunha , S.S. and Ancelotti , A.C. ( 2019 ) A sunflower optimization (SFO) algorithm applied to damage identification on laminated composite plates . Engineering with Computers , 35 , 619 – 626 . Google Scholar Crossref Search ADS WorldCat [28] Liu , J. , Chung , F.L. and Wang , S. ( 2017 ) Black hole entropic fuzzy clustering . IEEE Transactions On Systems, Man, And Cybernetics: Systems , 48 , 1622 – 1636 . Google Scholar Crossref Search ADS WorldCat [29] Tversky , A. ( 1977 ) Features of similarity . Phychological Review , 84 , 327 – 352 . Google Scholar Crossref Search ADS WorldCat [30] Bhutada , D. , Balaram , V.V.S.S.S. and Bulusu , V.V., "Holoentropy Based Dynamic Semantic Latent Dirichilet Allocation For Topic Extraction," International Journal of Applied Engineering Research , vol. 11 , pp. 1304 – 1313 , 2016 . [31] Das , P. , Das , D.K. and Dey , S. ( 2018 ) A modified bee Colony optimization (MBCO) and its hybridization with k-means for an application to data clustering . Appl. Soft Comput. , 70 , 590 – 603 . Google Scholar Crossref Search ADS WorldCat [32] Chen , S. , Dorn , S., Lell , M., Kachelrieß , M. and Maier , A., "Manifold learning-based data sampling for model training," In Bildverarbeitung für die Medizin , Informatik aktuell Springer , New York, pp. 269 – 274 , 2018 . [33] Khanmohammadi , S. , Adibeig , N. and Shanehbandy , S. ( 2017 ) An improved overlapping k-means clustering method for medical applications . Expert Systems with Applications , 67 , 12 – 18 . Google Scholar Crossref Search ADS WorldCat [34] Verma , L. , Srivastava , S. and Negi , P.C. ( 2016 ) A hybrid data mining model to predict coronary artery disease cases using non-invasive clinical data . J. Med. Syst. , 40 , 178 . Google Scholar Crossref Search ADS PubMed WorldCat [35] Baliarsingh , S.K. , Ding , W., Vipsita , S. and Bakshi , S. ( 2019 ) A memetic algorithm using emperor penguin and social engineering optimization for medical data classification . Applied Soft Computing , 85 . Google Scholar OpenURL Placeholder Text WorldCat [36] Zemmal , N. , Azizi , N., Sellami , M., Cheriguene , S., Ziani , A., AlDwairi , M. and Dendani , N. ( 2020 ) Particle swarm optimization based swarm intelligence for active learning improvement: Application on medical data classification . Cognitive Computation , 12 , 991 – 1010 . Google Scholar Crossref Search ADS WorldCat [37] Dermatology Data Set , “https://archive.ics.uci.edu/ml/datasets/dermatology”, Accessed on August 2019 . [38] Yuchen Liu , Kai Xiao, Alei Liang, and Haibing Guan, "Fuzzy C-means Clustering with Bilateral Filtering for Medical Image Segmentation', In Proceeding of International Conference on Hybrid Artificial Intelligence Systems , Lecture Notes in Computer Science, vol. 7208 , Springer Nature , Switzerland pp. 221 – 230 , 2012 . [39] Bhambere , S. “The long wait for health in India”-a study of waiting time for patients in a tertiary care hospital in western India . International Journal of Basic and Applied Research , 7 , 108 – 111 . OpenURL Placeholder Text WorldCat [40] Bhambere , S.D. ( 2017 ) Oral health status, knowledge and caries occurrence in visually impaired students . International Journal of Health Sciences and Research , 7 , 118 – 121 . Google Scholar OpenURL Placeholder Text WorldCat © The Author(s) 2021. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) © The Author(s) 2021. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com TI - Optimization Enabled Black Hole Entropic Fuzzy Clustering Approach for Medical Data JF - The Computer Journal DO - 10.1093/comjnl/bxab021 DA - 2022-07-15 UR - https://www.deepdyve.com/lp/oxford-university-press/optimization-enabled-black-hole-entropic-fuzzy-clustering-approach-for-1CUc0ShcgV SP - 1795 EP - 1811 VL - 65 IS - 7 DP - DeepDyve ER -