TY - JOUR AU - Zeb, Asim AB - Motivation Many real applications such as businesses and health generate large categorical datasets with uncertainty. A fundamental task is to efficiently discover hidden and non-trivial patterns from such large uncertain categorical datasets. Since the exact value of an attribute is often unknown in uncertain categorical datasets, conventional clustering analysis algorithms do not provide a suitable means for dealing with categorical data, uncertainty, and stability. Problem statement The ability of decision making in the presence of vagueness and uncertainty in data can be handled using Rough Set Theory. Though, recent categorical clustering techniques based on Rough Set Theory help but they suffer from low accuracy, high computational complexity, and generalizability especially on data sets where they sometimes fail or hardly select their best clustering attribute. Objectives The main objective of this research is to propose a new information theoretic based Rough Purity Approach (RPA). Another objective of this work is to handle the problems of traditional Rough Set Theory based categorical clustering techniques. Hence, the ultimate goal is to cluster uncertain categorical datasets efficiently in terms of the performance, generalizability and computational complexity. Methods The RPA takes into consideration information-theoretic attribute purity of the categorical-valued information systems. Several extensive experiments are conducted to evaluate the efficiency of RPA using a real Supplier Base Management (SBM) and six benchmark UCI datasets. The proposed RPA is also compared with several recent categorical data clustering techniques. Results The experimental results show that RPA outperforms the baseline algorithms. The significant percentage improvement with respect to time (66.70%), iterations (83.13%), purity (10.53%), entropy (14%), and accuracy (12.15%) as well as Rough Accuracy of clusters show that RPA is suitable for practical usage. Conclusion We conclude that as compared to other techniques, the attribute purity of categorical-valued information systems can better cluster the data. Hence, RPA technique can be recommended for large scale clustering in multiple domains and its performance can be enhanced for further research. 1 Introduction Advances in computational, faster and cheaper storage and communication technologies have led to the generation and storage of very large and complex data by businesses, governmental agencies and other organizations. The collected data can be used for important business decisions such as better understanding market dynamics, customers spending trends, operations and internal business processes. However, size and complexity of the data render it beyond the ability of a human analyst to process it for the purpose of decision making process. Similarly, in these processes, the issue of uncertain attribute value appears as a result of instrument fault, approximations in measurement or even subjective by assessments expert etc [1]. Moreover, as much of the data is uncertain and categorical in nature, it poses defiance to the conventional data analytic approaches. As a result, there is a surge of interest in methods for mining uncertain categorical data recently [2–5]. Discovering useful knowledge these data sets efficiently is a serious requirement and a huge economic need. Clustering a set of objects into homogeneous groups is a fundamental operation in data mining. Clustering methods are often used to support data-driven decision making in numerous domains such as Businesses (e.g., market dynamic analysis) [6], Healthcare (e.g., protein sequence analysis) [7–9], Science (e.g., environmental data analysis) [10], Information Security [11], Computer Networks [12], Image Segmentation [13] and Software Maintenance [14, 15]. In data analytics, clustering method lies at the core of successful data analysis tasks such as data summation, classification as well as data reduction, filtering, exploratory data analysis and many more [14, 16–19]. A variety of cluster analysis methods for numerical data analysis are commonly deployed by organizations. These cluster analysis methods are not appropriate for categorical dataset processing. The increasing proliferation of large uncertain categorical data sets poses significant challenges to the contemporary clustering techniques. Recently, attention has been put on data with non-numerical attributes or categorical attributes. There have been progresses in categorical data clustering [20–24]. Although these clustering methods show advancement in categorical data clustering and analysis, they are not suitable for uncertain categorical datasets and suffer from stability issues [25]. Recently, approaches that are based on fuzzy sets [20, 26–28] and Rough Set Theory (RST) [25, 29–33] for clustering categorical data have appeared in the literature. However, fuzzy sets based methods require heavy computational complexity as they require several runs each time with new initial value to assess the clustering outcome stability. Moreover, a parameter that controls the membership fuzziness need to be adjusted to achieve better clustering results. In the process of dealing with categorical data and handling uncertainty, the Rough Set Theory has become well-established mechanism in a wide variety of applications including databases. Two types of uncertainty can be modeled by Rough Set Theory inherently [34–36]. The indiscernibility relation gives rise to the first type of uncertainty. The indiscernibility relation partitions all values into a finite set of equivalence classes and is imposed on the universe. The second type of uncertainty is modeled through the approximation regions in Rough Sets. Here, the elements of upper approximation region have uncertain participation, whereas the lower approximation region have total participation. Rough Set Theory (RST) is a mathematical concept to imperfect analysis. It was discussed in greater detail in [12, 30]. The RST is a viable system to deal with uncertainty in clustering process of categorical data. RST was originally a symbolic data analysis tool now being developed for cluster analysis. RST clusters the universe, and describe its subsets as classes of equivalence. It also helps in decision making on uncertain data [31]. For example, symptoms form information about patients of a certain disease. In view of their available symptoms, the similar or indiscernible patients are characterized by the same symptoms. This way of generating the indiscernibility relation is the mathematical basis of Rough Set Theory. Maximum Dependence Attribute (MDA), Maximum Significance of Attribute (MSA), Information Theoretic Dependency Roughness (ITDR) and other recent rough set based techniques [31–33] outperformed their predecessors [25, 37] for clustering categorical data. However, these recent techniques suffer from low accuracy, high computational complexity and generalizability issues especially on data sets where they sometimes fail or hardly select their best clustering attribute. Some of their limitations are outlined: MDA technique cannot perform well on data sets with attributes having zero or equal dependency value. MSA technique also fails to select clustering attribute on data sets having attributes with zero or equal significance value. ITDR techniques face issues like random attribute selection and integrity of classes due to presence of entropy measure. Hence, an efficient technique is needed to cluster uncertain categorical datasets in terms of the accuracy, generalizability and computational complexity. In this paper, we propose a new information theoretic Rough Purity Approach (RPA) for categorical data clustering that addresses the problems inherent in the existing RST based clustering techniques. RPA utilizes the Rough Attribute Dependencies based on purity measure [38–41] in categorical valued information systems. The representation of uncertain information by purity has been applied to all areas of databases, including data mining [39], knowledge extraction [40], cluster validation [42] and information retrieval [41]. Hence, this paper relates the concept of information theoretic purity to Rough Sets to establish a new Rough Set metric of uncertainty which is Rough Purity. A Supplier Base Management real data set and several UCI benchmark data sets are used to validate the effectiveness of the proposed approach [43]. The Accuracy, Entropy, Purity, Rough Accuracy, Iterations and Time are some measures to test the quality of the obtained clusters. Moreover, validating the clustering results is a non-trivial task. The ratio of correctly clustered and total objects gives Accuracy [44]. The degree to which each cluster consists of objects from a single class is called entropy and better clustering performance has smaller entropy [39, 45]. The extent to which a cluster contains objects of a single class is known as Purity measure [39]. A better clustering result must have high overall purity and a value of 1 shows perfect clustering. The mean roughness of selected clustering attribute will give the Rough Accuracy. Higher mean Roughness implies better accuracy [31]. The computational complexity of clustering task can be determined by number of iterations required for finding the indiscernibility relations. It also includes finding the maximum or minimum values of dependence, significance, Rough Entropy, Rough Purity etc. The computational complexity of any technique can also be illustrated in terms of respond time. Here, the response time of CPU in milliseconds is counted to examine the performance of clustering task. A better technique in terms of response time will always consume less time. The rest of this paper is organized as follows. Section 2 describes the overview of the related work in the field of Cluster Analysis, Rough set theory, and categorical data clustering. To explore the limitations of Rough categorical clustering techniques, the analysis of existing techniques on an illustrative example is presented in Section 3. Section 4 introduces the concept of a new and proposed information theoretic Rough Purity measure. An illustrative example and proposition illustrating the methodology and significance of proposed approach is also highlighted. The experimental setup and data sets is described in Section 5. The experiments and the discussion on results are presented in Section 6. The summary of results and threats to validay are discussed in Section 7 and Section 8 respectively. Section 9 concludes the article at the end. 2 Related work 2.1 Cluster analysis Clustering is a summary and generative or concise model of the data without explicit labels. The basic issue of clustering is splitting the data objects into potential similar sets. There are significant variations in this issue depending on clustering model and data type. The clustering methods are utilized to support data-driven decision making in many domains such as software maintenance, information security, science, businesses and health care [29]. The application areas in which the clustering is required are social network analysis, biological data analysis, multimedia data analysis, dynamic trend detection, data summarization, customer segmentation and collaborative filtering [46]. Moreover, it is also utilized as intermediate step for other fundamental data mining problems. A wide variety of cluster analysis techniques is employed to address the clustering problems [42, 47]. The commonly used clustering techniques include Feature Selection Methods, Probabilistic and Generative Models, Distance-Based Algorithms, Density and Grid-Based Methods, Leveraging Dimensionality Reduction Methods, Model-based Methods, Matrix Factorization and Co-Clustering, Spectral Methods [17]. The existing work on cluster analysis techniques is summarized in Table 1. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Summary of related work on cluster analysis. https://doi.org/10.1371/journal.pone.0265190.t001 2.2 Rough Set Theory The uncertain categorical data is used in several areas nowadays and the classical clustering methods are unable to handle such data. Accordingly, several uncertain categorical clustering methods got attention. Pawlak in 1982 introduces Rough Set Theory (RTS) which is an approach to deal with uncertainty and vagueness. The RST has appeared as an essential concept for dealing with different tasks like identifying and evaluating data dependency, reasoning of uncertain data and reduct of information. Moreover, it is useful for representing and analyzing the uncertain, vague and imprecise knowledge, data patterns and accessibility of consistent information [30]. In RST, the viewpoint is that every object of the universe has associated some information (knowledge, data) and the objects are similar or indiscernible characterized by the identical information. Accordingly, an indiscernibility relation is generated in this way which is the fundamental mathematical concept of RST. This relation somehow resembles with Leibniz’s Law of Indiscernibility. The rough indiscernibility relations are developed in context of an arbitrary set of attributes. Other data analysis tool need additional information like basic probability assignments in Dempster–Shafer theory, probability distributions in statistics and grade of membership of fuzzy set theory whereas the RST does not have any such requirement about data hence it is better. The precise concepts in contrast to vague concepts can be characterized in terms of information about the objects. Accordingly, as pair of precise concepts the RST replaces any vague concept by an upper and lower and approximation. All possibly belonged objects for each concept are included in upper approximation whereas all surely belonged objects are in lower approximation. A boundary region of any concept is the difference of upper and lower and approximation. Hence, despite of membership of a set a boundary region is employed in RST to express the vagueness [12]. The boundary region of a set is non-empty when the knowledge about set is not enough to describe the set precisely. Therefore, a set having empty boundary region is crisp otherwise it is rough. This idea of vagueness resembles exactly that is proposed by Frege [61] whereas the lower and upper and approximations of a set coincides with the interior and closure operations of topology [62]. Different effective RST based techniques were developed for exploring hidden patterns and determining optimal sets in data. Moreover, it assists in evaluating the data significance and developing the decision rules from data [31]. The RST utilized in numerous applications by researchers which is summarized in Table 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Summary of related work on rough set theory. https://doi.org/10.1371/journal.pone.0265190.t002 2.3 Categorical data clustering The classical techniques for clustering are limited for numeric data however, the categorical data is multi-valued and similarity may be termed as identical objects, values or both. In categorical type of data, the tables with fields are not naturally illustrated by a metric for example certain symptoms of a patient, names of automobiles producers and manufacturer products. Therefore, the clustering of categorical data is more challenging as there is no inherent distance measure. Though, several valuable categorical clustering algorithms are introduced but they are not designed to deal with uncertainty [31]. Accordingly, the clustering of categorical data where no sharp boundary is present between clusters rises as an important problem of the real world applications. This uncertainty in categorical data clustering is handled using fuzzy sets where the clusters of categorical data is represented with fuzzy centroids [26]. The fuzzy set based algorithm and conventional algorithms are tested and compared on some categorical clustering data sets. Though, better performance is obtained by the fuzzy set based algorithm but to get a satisfactory value for even one parameter it requires multiple runs. Similarly, to achieve stability the fuzzy membership need to be controlled. Some substantial contributions are offered by rough set based techniques which handles uncertainty and cluster categorical data. The rough set based Total roughness (TR) and Bi-clustering (BC) techniques select best clustering attribute and handle the uncertainty issue [37]. The BC technique is limited to bi-valued attributes whereas the TR works on multiple-valued attributes. Moreover, the limited data, arbitrarily selection and imbalance clustering are key limitations of both techniques. Min–Min-Roughness (MMR) is another rough set based clustering technique for categorical data having the significant ability to handle uncertainty by user itself [25]. The MMR technique outperforms against K-modes, fuzzy K-modes and fuzzy centroids on Zoo and Soybean data. The proposed technique is also tested against ROCK, Squeezer, hierarchical and other algorithms on comparatively larger date of Mushroom data. The stable results of MMR technique are subject to number of clusters as input. The MMR clustering technique is modified as MMeR for dealing with uncertainty, numerical and categorical features at the same time [79]. The MMeR has ability to deal with heterogeneous data by generalizing the hamming distance. A new modified hamming distance is accordingly developed for any two data objects. The experimental results show better performance of MMeR as compare to some existing algorithms on several data sets. Certain limitations related to computational complexity and accuracy of previous techniques were resolved by suggesting an improved rough set based categorical clustering technique named Maximum Dependency Attributes (MDA) [31]. The clustering attribute in information systems with maximum attribute dependency is chosen by the MDA technique. The MDA technique outperforms its predecessor approaches but itself lacks the generalizability and efficiency. A Variable Precision Rough Set (VPRS) approach utilizes the mean accuracy of approximation to cluster categorical data [5]. The VPRS consider a noisy data and without a predefined clustering attribute it successfully clusters some UCI data sets. Furthermore, the final clusters using divide and conquer method were found comparatively better and are also visualized. The performance of MMeR in terms of data heterogeneity and uncertainty algorithm was further enhanced by suggesting the Standard Deviation Roughness (SDR) clustering algorithm [80]. The experimental results on certain data sets in terms of cluster purity shows the worth of SDR as compare to other techniques. Later on, a Standard deviation of Standard Deviation Roughness (SSDR) was introduced in this sequence [81]. The SSDR has the capability to cluster uncertain numerical and categorical data at the same time and hence is proven better than its predecessors like SDR, MMeR and MMR. Maximum Significance of Attributes (MSA) also computes an appropriate clustering attribute based on the significance of attributes RST concept [32]. The MSA handles the uncertainty and stability for categorical clustering process. The accuracy and purity was also improved up to some extent as compare to MDA, MMR, TR and BC techniques. A clustering technique known as Information-Theoretic Dependency Roughness (ITDR) for categorical data is developed that utilizes the information-theoretic dependencies [33]. A new measure of uncertainty in categorical data was introduced named as information-theoretic entropy. The complexity and purity for the appropriate clustering attribute selection by ITDR was better against SSDR, SDR, MMeR and MMR. The likelihood function and indiscernibility relation of multivariate multinomial distributions was utilized to develop a novel modified Fuzzy k-Partition method [82]. The idea was effective as it performs extensive theoretical analysis and still achieve lower computational complexity as compare to Fuzzy k-Partition and Fuzzy Centroid approaches. The clustering accuracy and response time were also improved on some real and UCI data. The rough intuitionistic fuzzy K-Mode algorithm was an extension of rough fuzzy k-mode for clustering the categorical data. The parameter of intuitionistic degree in a given cluster was added which calculate the element membership value. The efficiency of suggested scheme as tested on some categorical data of UCI repository which highlights the better results against rough fuzzy k-mode algorithm. An algorithm called Min-Mean-Mean-Roughness (MMeMeR) was introduced based on enhancements in MMeR and MMR algorithms [83]. A coherent and logical effect of considering the minimum or mean on better accuracy is also analyzed using standard UCI data. They found the objects at edge of a heterogeneous data can be clustered with certainty and are more captivating. Hence, MMeMeR technique was termed effective over existing SDR, MMeR and MMR techniques. Recently, Maximum Value Attribute (MVA) technique is suggested that efficiently cluster the uncertain categorical data [84]. A supplier’s data and several UCI data sets are considered to validate the performance of MVA technique with existing approaches. Despite of better performance, it sometimes produce singleton clusters and subject to only domain knowledge. The existing work on rough categorical data clustering is summarized in Table 3. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Summary of existing work on rough categorical data clustering. https://doi.org/10.1371/journal.pone.0265190.t003 2.1 Cluster analysis Clustering is a summary and generative or concise model of the data without explicit labels. The basic issue of clustering is splitting the data objects into potential similar sets. There are significant variations in this issue depending on clustering model and data type. The clustering methods are utilized to support data-driven decision making in many domains such as software maintenance, information security, science, businesses and health care [29]. The application areas in which the clustering is required are social network analysis, biological data analysis, multimedia data analysis, dynamic trend detection, data summarization, customer segmentation and collaborative filtering [46]. Moreover, it is also utilized as intermediate step for other fundamental data mining problems. A wide variety of cluster analysis techniques is employed to address the clustering problems [42, 47]. The commonly used clustering techniques include Feature Selection Methods, Probabilistic and Generative Models, Distance-Based Algorithms, Density and Grid-Based Methods, Leveraging Dimensionality Reduction Methods, Model-based Methods, Matrix Factorization and Co-Clustering, Spectral Methods [17]. The existing work on cluster analysis techniques is summarized in Table 1. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Summary of related work on cluster analysis. https://doi.org/10.1371/journal.pone.0265190.t001 2.2 Rough Set Theory The uncertain categorical data is used in several areas nowadays and the classical clustering methods are unable to handle such data. Accordingly, several uncertain categorical clustering methods got attention. Pawlak in 1982 introduces Rough Set Theory (RTS) which is an approach to deal with uncertainty and vagueness. The RST has appeared as an essential concept for dealing with different tasks like identifying and evaluating data dependency, reasoning of uncertain data and reduct of information. Moreover, it is useful for representing and analyzing the uncertain, vague and imprecise knowledge, data patterns and accessibility of consistent information [30]. In RST, the viewpoint is that every object of the universe has associated some information (knowledge, data) and the objects are similar or indiscernible characterized by the identical information. Accordingly, an indiscernibility relation is generated in this way which is the fundamental mathematical concept of RST. This relation somehow resembles with Leibniz’s Law of Indiscernibility. The rough indiscernibility relations are developed in context of an arbitrary set of attributes. Other data analysis tool need additional information like basic probability assignments in Dempster–Shafer theory, probability distributions in statistics and grade of membership of fuzzy set theory whereas the RST does not have any such requirement about data hence it is better. The precise concepts in contrast to vague concepts can be characterized in terms of information about the objects. Accordingly, as pair of precise concepts the RST replaces any vague concept by an upper and lower and approximation. All possibly belonged objects for each concept are included in upper approximation whereas all surely belonged objects are in lower approximation. A boundary region of any concept is the difference of upper and lower and approximation. Hence, despite of membership of a set a boundary region is employed in RST to express the vagueness [12]. The boundary region of a set is non-empty when the knowledge about set is not enough to describe the set precisely. Therefore, a set having empty boundary region is crisp otherwise it is rough. This idea of vagueness resembles exactly that is proposed by Frege [61] whereas the lower and upper and approximations of a set coincides with the interior and closure operations of topology [62]. Different effective RST based techniques were developed for exploring hidden patterns and determining optimal sets in data. Moreover, it assists in evaluating the data significance and developing the decision rules from data [31]. The RST utilized in numerous applications by researchers which is summarized in Table 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Summary of related work on rough set theory. https://doi.org/10.1371/journal.pone.0265190.t002 2.3 Categorical data clustering The classical techniques for clustering are limited for numeric data however, the categorical data is multi-valued and similarity may be termed as identical objects, values or both. In categorical type of data, the tables with fields are not naturally illustrated by a metric for example certain symptoms of a patient, names of automobiles producers and manufacturer products. Therefore, the clustering of categorical data is more challenging as there is no inherent distance measure. Though, several valuable categorical clustering algorithms are introduced but they are not designed to deal with uncertainty [31]. Accordingly, the clustering of categorical data where no sharp boundary is present between clusters rises as an important problem of the real world applications. This uncertainty in categorical data clustering is handled using fuzzy sets where the clusters of categorical data is represented with fuzzy centroids [26]. The fuzzy set based algorithm and conventional algorithms are tested and compared on some categorical clustering data sets. Though, better performance is obtained by the fuzzy set based algorithm but to get a satisfactory value for even one parameter it requires multiple runs. Similarly, to achieve stability the fuzzy membership need to be controlled. Some substantial contributions are offered by rough set based techniques which handles uncertainty and cluster categorical data. The rough set based Total roughness (TR) and Bi-clustering (BC) techniques select best clustering attribute and handle the uncertainty issue [37]. The BC technique is limited to bi-valued attributes whereas the TR works on multiple-valued attributes. Moreover, the limited data, arbitrarily selection and imbalance clustering are key limitations of both techniques. Min–Min-Roughness (MMR) is another rough set based clustering technique for categorical data having the significant ability to handle uncertainty by user itself [25]. The MMR technique outperforms against K-modes, fuzzy K-modes and fuzzy centroids on Zoo and Soybean data. The proposed technique is also tested against ROCK, Squeezer, hierarchical and other algorithms on comparatively larger date of Mushroom data. The stable results of MMR technique are subject to number of clusters as input. The MMR clustering technique is modified as MMeR for dealing with uncertainty, numerical and categorical features at the same time [79]. The MMeR has ability to deal with heterogeneous data by generalizing the hamming distance. A new modified hamming distance is accordingly developed for any two data objects. The experimental results show better performance of MMeR as compare to some existing algorithms on several data sets. Certain limitations related to computational complexity and accuracy of previous techniques were resolved by suggesting an improved rough set based categorical clustering technique named Maximum Dependency Attributes (MDA) [31]. The clustering attribute in information systems with maximum attribute dependency is chosen by the MDA technique. The MDA technique outperforms its predecessor approaches but itself lacks the generalizability and efficiency. A Variable Precision Rough Set (VPRS) approach utilizes the mean accuracy of approximation to cluster categorical data [5]. The VPRS consider a noisy data and without a predefined clustering attribute it successfully clusters some UCI data sets. Furthermore, the final clusters using divide and conquer method were found comparatively better and are also visualized. The performance of MMeR in terms of data heterogeneity and uncertainty algorithm was further enhanced by suggesting the Standard Deviation Roughness (SDR) clustering algorithm [80]. The experimental results on certain data sets in terms of cluster purity shows the worth of SDR as compare to other techniques. Later on, a Standard deviation of Standard Deviation Roughness (SSDR) was introduced in this sequence [81]. The SSDR has the capability to cluster uncertain numerical and categorical data at the same time and hence is proven better than its predecessors like SDR, MMeR and MMR. Maximum Significance of Attributes (MSA) also computes an appropriate clustering attribute based on the significance of attributes RST concept [32]. The MSA handles the uncertainty and stability for categorical clustering process. The accuracy and purity was also improved up to some extent as compare to MDA, MMR, TR and BC techniques. A clustering technique known as Information-Theoretic Dependency Roughness (ITDR) for categorical data is developed that utilizes the information-theoretic dependencies [33]. A new measure of uncertainty in categorical data was introduced named as information-theoretic entropy. The complexity and purity for the appropriate clustering attribute selection by ITDR was better against SSDR, SDR, MMeR and MMR. The likelihood function and indiscernibility relation of multivariate multinomial distributions was utilized to develop a novel modified Fuzzy k-Partition method [82]. The idea was effective as it performs extensive theoretical analysis and still achieve lower computational complexity as compare to Fuzzy k-Partition and Fuzzy Centroid approaches. The clustering accuracy and response time were also improved on some real and UCI data. The rough intuitionistic fuzzy K-Mode algorithm was an extension of rough fuzzy k-mode for clustering the categorical data. The parameter of intuitionistic degree in a given cluster was added which calculate the element membership value. The efficiency of suggested scheme as tested on some categorical data of UCI repository which highlights the better results against rough fuzzy k-mode algorithm. An algorithm called Min-Mean-Mean-Roughness (MMeMeR) was introduced based on enhancements in MMeR and MMR algorithms [83]. A coherent and logical effect of considering the minimum or mean on better accuracy is also analyzed using standard UCI data. They found the objects at edge of a heterogeneous data can be clustered with certainty and are more captivating. Hence, MMeMeR technique was termed effective over existing SDR, MMeR and MMR techniques. Recently, Maximum Value Attribute (MVA) technique is suggested that efficiently cluster the uncertain categorical data [84]. A supplier’s data and several UCI data sets are considered to validate the performance of MVA technique with existing approaches. Despite of better performance, it sometimes produce singleton clusters and subject to only domain knowledge. The existing work on rough categorical data clustering is summarized in Table 3. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Summary of existing work on rough categorical data clustering. https://doi.org/10.1371/journal.pone.0265190.t003 3 An epirical analysis of existing categorical clustering techniques based on Rough Set Theory Some existing Rough Set based techniques for selecting a clustering attribute in categorical data are analyzed. A well-known technique, Maximum Dependency Attribute (MDA) [31] takes into account the Rough dependency of attributes. The MDA technique chooses best clustering attribute in information system based on maximum dependency degree [87]. Best clustering attribute is selected by MDA technique on the basis of higher dependency degree. Hassanein and Elmelegy [32] proposes an alternative Rough Clustering Technique known as Maximum Significance Attribute (MSA). In an information system, MSA technique utilizes the significance of attributes. Higher degree of significance in MSA technique determines the best clustering attribute. Though, MDA and MSA techniques perform well in clustering categorical data as compared to their predecessor, they have hardly or sometimes not been able to work on following cases in a categorical data set, Independent attributes Non-significant attributes Equally dependent attributes Equally significant attributes To illustrate these issues, we consider the following example. Example 1 Table 4 is modified data set showing patients with possible viral symptoms [62]. There are three conditional attributes: Headache (H), Vomiting (V) and Temperature (T) of six patients. Viral Illness is the decision attribute in Table 4. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. A Viral Illness information system. https://doi.org/10.1371/journal.pone.0265190.t004 The indiscernibility relation of each attribute induces equivalence classes and considering MDA technique, we calculate the dependency degree of attributes. The dependency degrees of viral data set are given in Table 5. Here, selecting best clustering attribute is not possible as dependency degrees are all equal and 0. Accordingly, the MDA technique fails and hence creates a problem. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 5. Dependency degree of attributes from Table 4. https://doi.org/10.1371/journal.pone.0265190.t005 In case of MSA technique, we compute the significance of subsets of U. The significance degree of all attributes are presented in Table 6. In such situation, the best clustering attribute selection by MSA technique is not possible as all significance values are equal and 0. Therefore, MSA technique also fails and creates a problem. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 6. Significance degree of attributes from Table 4. https://doi.org/10.1371/journal.pone.0265190.t006 The above example illustrates the lack of ability of existing techniques to deal with zero or equal dependent and significant attributes. Another recent categorical clustering technique ITDR works on the entropy roughness to find clustering attribute [5, 33]. However, entropy is one of the type of purity measure [42] which considers the entire distribution and not just the largest class as it is done by the purity measure [88] in a particular cluster. In other words, the homogeneity or heterogeneity of the cluster does not affect the entropy results [89]. The strength and limitations of existing Rough Set based categorical clustering techniques are highlighted in Table 7. The summary of literature review leading to the proposed research framework is presented as in Fig 1. This figure shows how various researchers contributed towards the main issue of clustering categorical data. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 7. Strengths and limitations of existing Rough categorical clustering techniques. https://doi.org/10.1371/journal.pone.0265190.t007 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Scenario leading to the proposed framework. https://doi.org/10.1371/journal.pone.0265190.g001 The analysis of existing techniques presented in Table 7 and Fig 1 motivates towards the development of a more comprehensive measure of uncertainty. Accordingly, a measure based on classical information theoretic purity is derived. 4 Information-theoretic purity measure with Rough Set Theory The first and most commonly used purity measures is information gain which is based on Shannon’s entropy from information theory [40, 90]. Several variations in classical purity are introduced depending on type of application and a particular uncertainty measurement [39, 41, 42, 89, 91, 92]. In this work, the purity is defined that it can be applied to Rough databases. Hence, the purity of a Rough Set X is illustrated as below. Definition 1 In an approximation space S = (U, Y, V, ξ), let L, M ⊆ Y and L, M ≠ ϕ. Rough Purity (RP) of attribute M on attributes L, written as L⇒P M can be define using by the following equation, (1) Where P(Mi|Lj) is a fuction from Y. Definition 2 Suppose yi ∈ Y, V(yi), has k-different values say βk, k = 1, 2, …, n. Consider a subset of the attributes M(yi = βk) having k-different values of attribute yi. Max-roughness of the set M(yi = βk) with respect to yj where i ≠ j denoted by MRP (Mi[γ]|Lj) as, (2) Definition 3 MMRP(Mi|Lj) denotes the Max-mean-roughness of yi ∈ Y w.r.t yj ∈ Y and is calculated as, (3) V(yi) is the set of values of attribute yi ∈ Y and i ≠ j. Definition 4 Consider number of attributes a, max-mean-max-roughness of yi ∈ M with respect to yj ∈ L, where i ≠ j, refers to the maximum of MMRP(yi|yj), denoted MMMRP(Mi|Lj) is obtained by the following formula: (4) The Rough Purity Approach (RPA) takes into account the mean degree of Rough Purity to find partitioning attribute. The justification is that the high Rough Purity value implies the more accurate partition attribute is selected. The maximum total roughness of each attribute decides the best crispness [37]. Normally, high purity shows better clustering combination and the clusters are pure subsets of input classes if purity value is high [93]. Definition 5 To illustrate the computational complexity for RPA technique, let there are n objects, m attributes and l values of each attribute in an information system. The RPA needs nm computation for finding elementary set of all attributes. The Rough Purity of all subsets of U having different values and maximum Rough Purity of all attributes with respect to each other consumes n2 l computation steps. Accordingly, the steps for finding all mean max-rough purity values are n times. Therefore, the polynomial O(n2 l + mn + n) comes the computational complexity of RPA. The steps involved in RPA technique are presented in Fig 2. Next, we present an illustrative example of the RPA technique. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. The RPA algorithm. https://doi.org/10.1371/journal.pone.0265190.g002 Example 2 A student’s enrollment qualification information system is presented in Table 8. Degree (D), English (E), Statistics (S), Programming (P) and Mathematics (M) are five categorical attributes of eight students. The best clustering attribute needs to be selected provided no pre-defined decision attribute. For calculating the Rough Purity values, firstly the indiscernibility relations of each attribute must be obtained that induces equivalence classes. Table 8 gives following partitions of object, X(D=B.Sc.)={1, 2}, X(D=M.Sc.)={3, 4, 8}, X(D=Ph.D.)={5, 6, 7}, U/D={{1, 2}, {3, 4, 8}, {5, 6, 7}} X(E=low)={1, 5}, X(E=intermediate)={2, 4}, X(E=advanced)={3, 6, 7, 8}, U/E={{1, 5}, {2, 4}, {3, 6, 7, 8}} X(S=no)={1, 3, 4, 6}, X(S=yes) ={2, 5, 7, 8}, U/S={{1, 3, 4, 6}, {2, 5, 7, 8}} X(P=fluent)={1, 4, 7, 8}, X(P=poor)={2, 3, 5, 6}, U/P={{1, 4, 7, 8}, {2, 3, 5, 6}} X(M=poor)={1, 3, 4, 7}, X(M=fluent)={2, 5, 6, 8}, U/M={{1, 3, 4, 7}, {2, 5, 6, 8}} Download: PPT PowerPoint slide PNG larger image TIFF original image Table 8. Student’s enrollment qualification information system. https://doi.org/10.1371/journal.pone.0265190.t008 Definition 4 is used to find the Rough Purity of Degree w.r.t Statistics, P(S=yes∣B.Sc.)=({2, 5, 7, 8}, {1, 2})=1/2=0.5 P(S=yes∣ M.Sc.)=({2, 5, 7, 8}, {3, 4, 8})=1/3=0.33 P(S=yes ∣ Ph.D)=({2, 5, 7, 8}, {5, 6, 7})=2/3=0.67 P(S=no ∣ B.Sc)=({1, 3, 4, 6}, {1, 2})=1/2=0.5 P(S=no ∣ M.Sc.)=({1, 3, 4, 6}, {3, 4, 8})=2/3=0.67 P(S=no ∣ Ph.D.)=({1, 3, 4, 6}, {5, 6, 7})=1/3=0.33 The maximum roughness degree of Statistics (S) w.r.t Degree (D) can be calculated as: MP(Syes)=max(0.5,0.33,0.67)=0.67, MP(Sno)=max(0.5,0.67,0.33)=0.67. The mean Rough Purity of attribute Statistics (S) with respect to Degree (D) are MMP(S)=(D∣(L∣S = no)+ D∣(L∣S = yes)) /|V(D)| =(0.67+0.67)/2 = 0.67 Proceeding similarly, each attribute mean Rough Purity is computed. Table 9 summarizes the calculations with RPA, which shows that the high mean purity value is of Mathematics attribute. Considering the heuristic that the high purity shows better clustering combinations, therefore, best clustering attribute is selected as Mathematics. Hence, the clusters obtained are (1,3,4,7), (2,5,6,8). Download: PPT PowerPoint slide PNG larger image TIFF original image Table 9. MMP roughnes of Table 8. https://doi.org/10.1371/journal.pone.0265190.t009 The comparison of Rough Purity and other measures of uncertainty are illustrated in Propsotion 1. Proposition 1 Rough Purity is more comprehensive measure of uncertainty as compared to Rough Dependency and significance of attributes. Proof: If the attributes are not dependent on each other, then dependency degree [31] results zero. Similarly, it can be proved that independent attributes are also non-significant. Hence significance of attribute [32] also gives zero. Irrespective of above cases that attributes are not dependent or they are not significant for each other, the Rough Purity measure will always give a non zero value. In other words, the Eq 2 always gives, (5) Hence, it is proved that Rough Purity is more comprehensive measure of uncertainty than Rough Dependency and significance of attributes. 5 Experimental setup and data sets description RPA technique is validated using C#. The results are presented in form of tables. The domain of Supplier Base Management (SBM) is used to validate the proposed RPA technique [43]. SBM data set comprises ten attributes (shown in Table 10) showing performance information and supplier capability of 23 Suppliers (S). The attribute included are Quality Management Practices and systems (Qm), Documentation and Self-audit (Ds), Process/manufacturing Capability (Pc), Management of Firm (Mf), Design and Development Capabilities (Dc), Cost (C), Quality (Q), Price (P), Delivery (D), Cost Reduction Performance (Cp) and Others (O). The efficiency of each supplier is determined by applying the Data Envelopment Analysis [43]. The last column of Table 10 shows their conclusion on each supplier. The domain of all attributes contain continuous value because the categorical data is already normalized. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 10. Discretized supply base management data set. https://doi.org/10.1371/journal.pone.0265190.t010 RPA technique is also validated using six data sets taken from UCIML repository. They includes: Balloons (16 instances, 4 attributes), Car Evaluation (1728 instances, 6 attributes), Zoo (101 instances, 17 attributes) and Chess (3196 instances, 37 attributes), Balance scale (625 instances, 5 attributes), Monk’s problems (432 instances, 8 attributes). RPA is tested with all these data sets and compared with recent Rough Categorical techniques MDA, MSA and ITDR on basis of various evaluation measures like Time, Iterations, Purity, Entropy, Accuracy and Rough Accuracy. 6 Results and discussion Table 11 illustrates the time complexity of MDA, MSA, ITDR and RPA techniques to complete the clustering task. For Balloons data set, the number of instances are less therefore the response time is same for all techniques. Moreover, RPA takes lesser time as compared to all techniques for Car, Zoo and Chess data sets. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 11. Time complexity of all techniques. https://doi.org/10.1371/journal.pone.0265190.t011 The iterative complexity depends on number of attributes and attribute values of a data set. It also includes the steps like finding dependency degree of all attributes for MDA, maximum significance of all possible combinations of attributes for MSA, minimum Rough Entropy for ITDR and maximum Rough Purity for RPA. It can also be seen from Table 12 that the RPA consumes minimum iterations all data set than the MDA and MSA techniques. According to Table 12, despite the fact that the RPA and ITDR techniques undergo almost similar iterative complexity to get their best clustering attribute but the RPA has still the better time taken. The reason is Rough Purity formula is computationally simpler than Rough Entropy therefore the effect can be seen on response time. The relevant induced indiscernibility relation will show the clusters obtained by selected best attribute. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 12. Iterative complexity of all techniques. https://doi.org/10.1371/journal.pone.0265190.t012 Table 13 shows the performance of RPA, MDA, MSA and ITDR techniques in terms of Purity, Entropy, Accurracy and Rough Accuracy. The achieved accuracy on all data sets as presented in Table 13 shows that the proposed RPA technique outperformed other techniques except Balance Scale and Monk’s Problem where the accuracy is the same. Similarly, Table 13 also illustrates the entropy of obtained clusters by each technique. Less entropy shows better clustering technique [45] and it can be seen from this table that the proposed technique shows lesser entropy for all data sets except Balance Scale and Monk’s Problem where entropy is the same. Hence, RPA performance is better for entropy measure too. Moreover, the purity of obtained clusters by each technique as presented in Table 13 shows that the RPA technique has better purity for all data sets except Car evaluation, Balance Scale and Monk’s Problem data set where all techniques produce equal purity of their best clustering attribute. Finally, Table 13 presents the Rough Accuracy of the techniques. The reason of less or zero Rough Accuracy value is that this measure is not a comprehensive measure of uncertainty [34]. The overall performance of RPA technique in terms of Rough accuracy is still better as compared to other techniques. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 13. Comparative performance of techniques for all data sets. https://doi.org/10.1371/journal.pone.0265190.t013 If two or more techniques select similar clustering attributes then the evaluation measures produced by those techniques are also similar. For example in case of Monk’s problems data set, the MDA and MSA techniques select similar clustering attribute hence their Accuracy, Purity and Entropy values are similar. Similarly, for the same data set the ITDR and RPA choose the same attribute as best. Despite the fact that these techniques can choose same best clustering attribute, but the number of iterations, time taken and hence, the complexity is still promising for RPA technique as the data sets size increases. 7 Summary of results This section summarizes the average percentage improvement and overall percentage improvement by RPA technique for clustering categorical data as compared to MDA, MSA, and ITDR. This summary of results shows that the RPA technique significantly improves the time, iterations, purity, entropy, and accuracy. Table 14 shows a slight response time improvement by RPA against ITDR but as compared to MSA and MDA techniques, the percentage improvement is large. It is also observed in Table 15, that the RPA techniques require almost half iterations as compared to ITDR and 100% fewer iterations against MDA and MSA techniques to choose the best clustering attribute. Similarly, the Tables 16–18 clearly show the significant improvement in terms of several clustering evaluation measures like purity, entropy and accuracy by RPA technique against MDA and MSA technique. Though the ITDR technique outperforms MDA and MSA for these measures but still the performance of RPA is reasonably improved as compared to ITDR technique for clustering the categorical data. Finally, Table 19 highlights the comparative overall improvement by RPA in terms of Time, Iterations, Purity, Entropy, and Accuracy. It can be clearly seen that RPA technique not only proved to be less complex but also more efficient in selecting the best clustering attribute and clustering categorical data. Hence, it can be summarized from the whole experimental results that the proposed RPA technique is not only simple, more generalized, and quick but also more perfect clusters were obtained having less entropy and high purity and accuracy. 8 Threats to validity The primary threat to validity for this study is that the tools of existing approaches like MDA, MSA and ITDR are not available, they are re-implemented via a prototype implementation system. This system is developed using C# for experimental purpose. However, our code of previous approaches is strictly based on the descriptions and pseudo codes available in their respective research articles. To reduce the influence of this biasness and as remedy, similar data sets and same evaluation measures were considered as used by other existing techniques. As a result, it is ensured that all evaluation measures of existing techniques give the same results as computed in their original work. Another threat to validity for this study is related to the number of instances and attributes of dataset. In this study, a real SBM and six bench mark data sets were chosen for experiments. Moreover, to generalize our results, it was necessary to perform experiments with data sets of various number of instances and attributes. Accordingly, the data sets considered for experimentation were chosen from different application domains. However, this study only focused on small and medium size data sets. Experiments on large data sets may be performed to further validate the proposed technique. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 14. Average percentage improvement of time by RPA technique. https://doi.org/10.1371/journal.pone.0265190.t014 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 15. Average percentage improvement of iterations by RPA technique. https://doi.org/10.1371/journal.pone.0265190.t015 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 16. Average percentage improvement of Purity by RPA technique. https://doi.org/10.1371/journal.pone.0265190.t016 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 17. Average percentage improvement of entropy by RPA technique. https://doi.org/10.1371/journal.pone.0265190.t017 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 18. Average percentage improvement of accuracy by RPA technique. https://doi.org/10.1371/journal.pone.0265190.t018 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 19. Overall percentage improvement by RPA technique. https://doi.org/10.1371/journal.pone.0265190.t019 9 Conclusion The traditional clustering techniques are not able to deal with uncertainty in the data set as they are not designed to do so. Several categorical data clustering techniques have emerged as a new trend in techniques of handling uncertainty in the clustering process. The motivation of a better Rough Clustering technique is developed after exposing some potential issues of recently developed Rough Clustering techniques like MDA, MSA and ITDR. These issues include data with attributes having zero or equal dependency, attributes with zero or equal significance value and random attribute selection. The key contribution of this paper is that these limitations of existing rough set based clustering techniques for categorical data are handled successfully and effectively. A Rough set based information theoretic approach for clustering categorical data with uncertainty named Rough Purity (RPA) approach is hence presented. The extensive experimental analysis of the proposed RPA and existing approaches using a supplier base management real data set and UCI benchmark data sets are discussed. The significant improvement can be seen in experimental outcomes in terms of relevant parameters like time (66.70%), iterations (83.13%), purity (10.53%), entropy (14%), accuracy (12.15%) and rough accuracy of clusters. This significant improvement by the proposed technique shows that RPA can be extended for further research in the field of Data Mining, Artificial Intelligence, Rough Set Theory and soft computing etc. One of the limitation of this research is that the analyses of only relevant rough set based categorical techniques like MDA, MSA and ITDR is presented. Though, this comparison provides strong evidence about the efficiency of the proposed approach in terms of several evaluation parameters, but other approaches like fuzzy bipolar soft set and Pythagorean fuzzy bipolar soft set etc. need to be compared to further analyze the RPA technique. TI - Rough set based information theoretic approach for clustering uncertain categorical data JF - PLoS ONE DO - 10.1371/journal.pone.0265190 DA - 2022-05-13 UR - https://www.deepdyve.com/lp/public-library-of-science-plos-journal/rough-set-based-information-theoretic-approach-for-clustering-sIGmREWrZx SP - e0265190 VL - 17 IS - 5 DP - DeepDyve ER -