TY - JOUR AU - Qiu, Zhiyi AB - Introduction Credit scoring refers to the process using statistics to classify applicants for credit into different risk categories [1], in order to “determine the likelihood that a prospective borrower will default on a loan” [2]. The history of credit scoring is relatively short as about sixty years [3], despite the long history of credit which could be traced back to 2000 BC [4]. In practice, the credit scoring transforms “relevant data into numerical measures that guide credit decisions” [5]. Therefore, a variety of statistical models are applied in the process. The simple parametric statistical model, linear discriminate analysis (LDA) is one of the first models for credit scoring, although it is questioned because of the presumed normal distribution of data [6]. This deficiency of LDA is largely overcome by some sophisticated models like logistic regression, k nearest neighbor [7, 8], decision trees [8, 9], and neural networks [8, 10–13]. Notably, large financial intermediaries like American Express and Security Pacific Bank (SPB) build their credit scoring system on the basis of neural network, for this model outperforms others by 10% more accuracy [14]. A body of prior literature focusing on the techniques of credit scoring [15–20] are mostly based on classical statistic theories, which are less adaptive in the context of large sample. Fintech as the fusion of finance and technology [21] is applied to credit scoring recently. However, the performance of these approaches relies on the parameters and the application is limited because of the difficult in determining parameter with lack of prior knowledge. Zhao, Xu [22] tests that fintech approaches like the neural network perform well as long as the parameters are properly set. In other word, parameter setting determines the performance of these approaches. Notably, some remarkable progress takes place in swarm intelligence algorithm [23, 24]. Based on the logic of natural selection, the approaches of swarm intelligence algorithm mimics individual and in-group behaviors of species to seek the optimal solution. As Hurley and Adebayo [25] suggests, “all data is credit data”. This paper attempts at a novel framework combined with conventional credit information in credit-scoring industry with the emergence of big data technology. Notably, some state-of-the-art techniques are proposed to determine neural network architecture in recent decades [26–28]. However, these techniques, consuming several GPU-days, are more applicable with the scenario of image/audio recognition where high-dimension large-size datasets prevail. To the contrary, the credit scoring is a completely different scenario. First, the datasets of credit is smaller sized with less dimensions and consequently the above-mentioned techniques are prone to overfit. Second, high-performance computing (HPC) is inaccessible to most banking practitioners, especially small-/medium-size depository institutions. Thus, the purpose of this research is to construct a realistic framework tailored for credit scoring to optimize the hyper-parameters of neural network with swarm intelligence algorithm. This paper further benchmarks the performance of the novel framework against classical as well as hybrid or ensemble models proposed in recent literature [29–34]. This paper is to answer the following questions. First, does the neural network with hyper-parameters determined by swarm intelligence algorithm outperform the classical credit-scoring models (i.e. logistic regression, naive Bayesian, discriminant analysis, k nearest neighbor, decision tree, support-vector machine, K-means, and random forest) and state-of-the-art models proposed in recent literature [29–34]? Second, are the fitting and generalization ability of a neural network steady after its parameters determined by swarm intelligence algorithms? Third, does our framework perform robustly with increasing hidden layers of neural network? Fourth, what is the comparative advantage of our framework against the state-of-the-art techniques for optimizing neural networks [26–28]? This paper sheds a new light on the application of swarm intelligence algorithm to the credit scoring area. To address this purpose, this paper proposes a novel credit-scoring framework to determine the optimal SI algorithm for hyper-parameter optimization of neural network and carries out an experiment to test the generalization and robustness of neural network trained by swarm intelligence algorithm. Specifically, eight other prevalent credit-scoring models as well as some hybrid or ensemble models constructed in recent literature [29–34] make up the control group and seven swarm intelligence algorithms are extracted from prior literature. The neural network with parameters trained by the seven swarm intelligence algorithms are included in the treated group. This paper compares the performance of models in the treated and control groups to classify the appropriate model for credit scoring. The findings shows that models constructed within this framework outperforms models in the control group. The application of fintech in this paper implies that, despite the challenges brought by fintech [21], tech-driven services are complements rather than replacement of the traditional banking system [35, 36] and commercial banks would embrace with fintech to gain a new growth [35, 37]. The rest of this paper is organized as follows. Section 2 constructs a theoretical framework of the prevalent classical credit scoring models and the typical swarm intelligence algorithm approaches. In Section 3, a novel framework is proposed for optimizing hyper-parameter of neural networks with swarm intelligence algorithms. Section 4 describes the data used in the empirical research and findings are reported and analyzed in Section 5. In the last section, this paper draws conclusion and provides suggestions on model selection of credit scoring accordingly. Theories of algorithm Prevalent classical models for credit scoring Credit scoring models support lenders during the decision-making process of loans. A body of credit scoring models developed into maturity in the recent decades, including statistics-based models like logistic regression, naive Bayes, determinant analysis [38, 39] and machine learning based models like K nearest neighbor, decision tree, support vector machine, artificial neural network [40–42]. As is mentioned in the section of introduction, artificial neural network is widely accepted for its outstanding accuracy and is selected as the underlying model of this paper. These models are applies to different scenarios because of distinct assumptions and instance characteristics. Artificial neural network (ANN). The artificial neural network is an important quantitative technique in credit scoring [43], which is widely used in the context of microfinance [44], imbalanced data [45], real-time assessment [39], etc. The technique of neural network model has evolved into different forms to deal with the credit scoring problems; e.g. partial logistic artificial neural network [46], artificial metaplasticity neural network [47], and hybrid neural networks [48]. Prior experiments show that the neural networks outperforms a bunch of conventional techniques (e.g. discriminant analysis, probit analysis, logistic regression, etc.) in credit scoring [49–51]. Furthermore, the neural networks trained by more sophisticated algorithms outperform those trained by ordinary gradient descent [22, 52]. Besides, the hybrid of neural network and genetic algorithm proves as excellent classifier in credit scoring [53]. However, some issues remain in the application of neural network to credit scoring. For example, the determination of training-to-validation sample ratio remains controversial in the prior literature [54, 55]. Notably, similar to the most of machine learning techniques, the neural network is prone to overfitting and consequently poor generalization [56]. Thus, the aim of this paper is to improve the generalization of neural network with swarm intelligence algorithms. The approach of artificial neural network (ANN) stimulates the neural network of human brain [57]. With artificial neuron as the unit of information operation, the weight value of connection between artificial neurons indicates the intensity of connection. The connection and the structure reflect how the information is represented, transmitted, and operated in the network. Back propagation (BP) is the most prevalent neural network in the context of credit scoring. We apply recurrent back-propagation in this paper, which is fed forward until a fixed value is achieved while the error is computed and propagated backward. A typical neural network consists of input layer, hidden layer, and output layer. We introduce the training procedures as follows. First, the network is fed forward: hidden layer accepts data from input layer and modifies them with non-linear transformation before output. Once the output value is generated, we measure the difference between actual and desired output value and obtain the error value. In this stage, the error value is transferred backward from output layer to hidden layer and then to input layer. At the same time, the error value is shared across layers and the weight value of every unit is adjusted accordingly. Intended for the gradient decrease of the error value, we adjust the connection between layers (i.e. input-hidden connection and hidden-output connection) and the threshold. This training process goes on until we classify the network parameters (i.e. weight and threshold values) applicable to the minimized error value. After the above-mentioned training process, when fed with an input value, the neural network automatically outputs values with minimal error after non-linear transformation. Logistic regression. Logistic regression proposed by Berkson [58] is most widely used in both industry and academy of banking thanks to its simple architecture and time complexity [59–61]. Conditional probability for logistic regression is given by (1) (2) where β0, β1, ⋯, βm are estimated with maximum likelihood estimation (MLE). To be specific, as the independent variable yi takes value either zero or one, then (3) Since the instances are independent from each other, the likelihood function is given by (4) and the logarithmic function is (5) The value of β0, β1, ⋯, βm is estimated when the partial derivatives with respect to the four variables equal zero. However, instead of a closed-form solution, we estimate the non-linear likelihood function with iteration. As is suggested in prior literature, the logistic regression performs weakly when solving non-linear problems [62]. Naive Bayesian (NB) approach. The approach of Naive Bayes (NB) is born from the classical Bayesian approach of statistics and provides theoretical justification for classifiers that even do not use Bayesian theorem explicitly [62]. Based on the solid theoretical framework of statistics, the NB model remains robust in the case of missing value. However, the underlying assumption that all the indicators are independent from each other is too strong for the real world. The Bayesian conditional probability of event y and x is given by (6) i.e. diving the probability that both event y and x take place by the probability that event x takes place measures the probability of event x with the condition that event y takes place. In the context of credit scoring, event x refers to the case that a certain character (measured by an indicator) of the instance takes a particular value. Specifically, instance of credit history is divided into several categories (e.g. two categories: yi = 1 for default; yi = 0 for non-default) and p(y) in Eq (6) refers to the frequency of each category. Then p(x | y) in Eq (6) equals the frequency of indicators in the subsample of each category and p(x) equals the percentage of indications in the full sample. Thereby, p(y | x) is measured according to Eq (6). If there are more than one indicator for the independent variable, then (7) Discriminant analysis (DA). Discriminant analysis (DA) proposed by [63] is often cited to compare with other techniques in credit scoring [64, 65]. This approach forms classification criteria based on the instance with categories known and predict the unknown categories according to such criteria. The DA is either parametric or non-parametric. The parametric DA constructs the model with a certain assumption of instance distribution (e.g. normal distribution). However, the model is constructed biased because of the unobservable distribution so that parametric DA is not widely used and DA performs weakly when dealing with non-linear problems [62]. For instead, the non-parametric DA prevails in the context of credit scoring. This approach investigate instance distribution with non-parametric method and construct classification criteria accordingly. Thus, results of non-parametric DA are more robust. K nearest neighbor (KNN). As one of the most classical method of data mining, k nearest neighbor (KNN) is carried out as follows. Suppose xi remains to categorize. In the instance set whose category is known, find k instances most similar to (nearest to) xi, known as k nearest neighbors of xi. According to the rule of majority voting, xi is classified into the neighbor’s category with the largest number of instances. If k = 1, then xi is classified into the same category as its nearest neighbor. Since the KNN relies on the comparison with a set containing known values rather than estimation, this approach is efficient in terms of modelling. However, the predictive accuracy of KNN is determined by the measure of distance and the cardinality k of the neighborhood [62]. Decision tree (DT). Decision tree (DT), a basic technique of ensemble learning [66], is another prevalent machine-learning-based approach to credit scoring [67]. With some new techniques introduced [68, 69], the DT approach is efficient in categorization and shows results in an explicit manner for interpretation. However, this approach leads to biased results when the instance is time series with complex categories. In the framework of greedy algorithm, the DT construct a tree-shape structure. First, classify the optimal value of a certain indicator and classify the instance accordingly. Then, divide each subsample with optimal value until the predefined stopping criterion reached. To be specific, Step 1: take the full sample set as the root node of the tree. Step 2: test every possible classification of every indicator until the optimal value identified in the recursion. Step 3: set decision nodes using the optimal value in Step 2 and classify the root node into leaf nodes. Step 4: repeat Step 2 and 3 until every leaf node is pure enough. As core of the DT approach, purity measures the ratio of homogenous instances in a leaf node over the full sample; i.e. one leaf node is “purer” as such ratio is higher. The optimal classification refers to the one that improve the purity most in the recursion. The concept of entropy shown in Eq (8) measures the uncertainty of categorization in each subsample. (8) where c denotes the number of categories in the instance set D (e.g. in real-world credit scoring, c equals two as the instance set is categorized into “default” and “non-default”). pi denotes the ratio of subsample size in category i over the full sample set D; i.e. . On general, a large value of entropy indicates that an increasing body of information is required for categorization. Support vector machine (SVM). The support vector machine (SVM) is employed as a technique of credit scoring in the past decade [19, 70–72]. Prior literature using real-world credit scoring data from the US and Taiwan (China) indicates that support vector machines achieves accuracy comparable of that of neural networks [73]. The SVM is a non-probabilistic binary linear classifier, classifying the optimal hyperplane to split instance in the space to the maximum [74]. The term “to the maximum” indicates that the distance between subsample and the hyperplane is maximized and the error in categorization is minimized thereby. Applying kernel function, the SVM simplifies classification applicable to various scenarios. However, its performance relies on the selection of kernel function and it consumes a large storage for computation. Three types of kernel function are used to classify the optimal hyperplane: linear, polynomial, and radial basis function. The linear kernel function divides instances by a plane and attempts to classify the hyperplane in the original feature space. The polynomial kernel function transforms the original instances into high-dimensional instances with polynomial characteristics and then divides these transformed instances with a curve. The radial basis function (RBF) classifies the hyperplane after mapping instances into higher dimensional feature space via the RBF. In most cases, the RBF outperforms the other two kernel functions. In this paper, all these three kernel functions and their parameter set are employed to obtain the best performance via grid search method. K-means. The K-means clustering method is an unsupervised learning algorithm aimed to solve clustering problem with iterative calculation. This algorithm starts with a group of centroids, which are K instances randomly selected from the original dataset as the beginning points of every cluster. For every instance, we calculate its distance from every centroid and assign it to the nearest cluster (i.e. its distance from the centroid of this cluster is shorter than that from others). Once an instance is assigned to a certain cluster, we re-select the centroid from all the instanced in this cluster. This iteration goes on until either no (or minimal) instance remains to be assigned or no (or minimal) centroid moves. After the training of iteration, once fed with an instance, the algorithm assign it to the nearest cluster [75]. Despite the simplicity and speed, the K-means technique is limited in terms of robustness, since the clusters are determined by the initial random assignments [76]. Random forest (RF). The random forest (RF) method proposed by Breiman [77] is a supervised learning algorithm based on decision trees, which is used in credit scoring, inter alia, imbalanced dataset [45]. The term “forest” refers to the model built on multiple decision trees. The term “random” indicates that the training set and the test features are selected randomly. Thus, this algorithm will not overfit [78], provided enough trees in the forest. We introduce the procedures as follows. First, we randomly bootstrap m samples with replacement and acquire n training sets after n times of such bootstrapping (bagging). Second, we train the decision tree with each training set. Third, we split every decision tree using information gain or Gini importance to calculate the root node. Fourth, the forest chooses the classification with the most votes (each tree votes for a certain class) and the mean value of trees’ prediction is the prediction of the forest (each tree predicts class probabilities). Combining multiple independent models, the random forest resists the problem of overfitting and noise. Furthermore, this approach incorporating a multitude of features is excellent when faced with high-dimensional data and is ready to detect underlying non-linear characteristics. Besides, the random forest is renowned for its speed of implementation. Swarm intelligence algorithm As for the problem with no solution in traditional optimization based on individual agent and criterion, the swarm intelligence algorithm handle them by mimicking the natural biological evolution and/or the social behaviour of species. The systematic and organizational principles underlying individual and/or in-group behaviour of species are the core mechanism of these approaches. For example, herds and flocks cooperate in the search of food or mate. Every individual in the herd or flock learns from experience of other members as well as itself and adjusts its strategy of search accordingly. Bat algorithm (BA). Bat algorithm [79], aimed at global optimization, mimics the echolocation of bats. Assume that All bats sense distance with echolocation and distinguish objectives from obstacles. The bat flies randomly from point xi at a speed of vi. Meanwhile, it makes a sound of fixed frequency fmin, variant wavelength λ and volume. According to the distance from objective, the bat adjust the wavelength and transmission frequency γ ∈ [0, 1]. The volume A0 changes from the maximum to the fixed minimum. The optimization in BA approach mimic the motion and food-seeking process of bats. Thus, The BA approach maps individual bats as feasible solutions in the space of high-dimension problems. The location of bat is assessed with the fitness function of objective and the solution is identified with recursions. Cuckoo search optimization (CSO). The cuckoo search optimization (CSO) algorithm [80] mimic the brood parasitism of cuckoo. Moreover, instead of random walk, the search process of CS algorithm mimics the discrete exploration of Lévy fly, composed by a series of straight motions and abrupt turning of 90 degree. Assume that Every cuckoo lays one egg and randomly incubate it in a host nest. The best nest with high-quality egg is passed onto the next generation. The number of available host nest is fixed and the host detects cuckoo’s egg with a probability of p ∈ (0, 1). Once detecting the invading egg, the host either destroys it or discards the nest. The CS approach is widely used in social science thanks to the efficient optimization with few parameters. Firefly algorithm (FA). The firefly algorithm [81] mimics the behavior of fireflies who flash for food and mate. With an explicit logic, the FA converge quickly to the global optimality. The algorithm take the brightness of firefly as the objective value and assumes that The attractiveness of a firefly is positively related with its brightness; i.e. the less bright one moves towards the brighter one. The brightness is negatively related with the distance between two fireflies. A firefly moves randomly if it detects no one brighter. Gravitational search (GS) algorithm. The gravitational search algorithm [82] assumes search agents as masses obeying the Newtonian laws of gravitation and motion as follows. Law of gravitation: agents attract each other, while the gravity between two agents is directly proportional to the product of their masses and inversely proportional to the distance between them. Law of motion: the velocity of each agent keeps constant unless an external force acts upon it and the change of velocity equals to the force divided by the mass of the agent. Thus, agents moves in accordance with the two laws above until the optimal position is reached. Gray wolf optimization (GWO) algorithm. The gray wolf optimization algorithm mimics the hunting mechanism of gray wolves in nature [83]. Wolves keep strict social hierarchy represented by the division of a pack into four levels, each with different authority and responsibility. Gray wolves hunt in three steps: Tracking, chasing, and approaching the prey; Pursuing, encircling, and harassing the prey until it stops moving; Attack towards the prey. Particle swarm optimization (PSO) algorithm. Particle swarm optimization (PSO) algorithm [84] is inspired by the predation of birds. Using massless particles moving in solution space, this algorithm mimics birds seeking food. Every particle finds the optimal solution on its own, known as local extremum, and shares the solution with all other particles. The global optimality is the best local extremum. Comparing local extremum and the global optimality, every particle adjust its motion in terms of direction and velocity. As a comparatively simple algorithm, the PSO approach is efficient in seeking optimality and is applicable to problems with real values. Social spider algorithm (SSA). Social spider algorithm [85] mimics the social spiders colony behavior. The SS algorithm assumes that The search space is communal web of spiders where individuals could interact with each other. Each solution within the search space represents a spider position in the communal web and each social spider is weighted according to the fitness value of the solution. The spider colony is highly female-biased population with predefined proportion. Every social spider is assigned a set of cooperative behaviors according to its gender, known as evolutionary operators. Whale swarm algorithm (WSA). The whale swarm algorithm [86] mimics predation of humpback whales who seek food in a cooperative manner. This algorithm is recursive as follows. All the whales communicate with each other by ultrasound in the search space. Each whale has a certain degree of computing ability to calculate the distance to others. The quality and quantity of food found are associated to the fitness of whale’s objective value. A whale follows the nearest one with a better fitness. Prevalent classical models for credit scoring Credit scoring models support lenders during the decision-making process of loans. A body of credit scoring models developed into maturity in the recent decades, including statistics-based models like logistic regression, naive Bayes, determinant analysis [38, 39] and machine learning based models like K nearest neighbor, decision tree, support vector machine, artificial neural network [40–42]. As is mentioned in the section of introduction, artificial neural network is widely accepted for its outstanding accuracy and is selected as the underlying model of this paper. These models are applies to different scenarios because of distinct assumptions and instance characteristics. Artificial neural network (ANN). The artificial neural network is an important quantitative technique in credit scoring [43], which is widely used in the context of microfinance [44], imbalanced data [45], real-time assessment [39], etc. The technique of neural network model has evolved into different forms to deal with the credit scoring problems; e.g. partial logistic artificial neural network [46], artificial metaplasticity neural network [47], and hybrid neural networks [48]. Prior experiments show that the neural networks outperforms a bunch of conventional techniques (e.g. discriminant analysis, probit analysis, logistic regression, etc.) in credit scoring [49–51]. Furthermore, the neural networks trained by more sophisticated algorithms outperform those trained by ordinary gradient descent [22, 52]. Besides, the hybrid of neural network and genetic algorithm proves as excellent classifier in credit scoring [53]. However, some issues remain in the application of neural network to credit scoring. For example, the determination of training-to-validation sample ratio remains controversial in the prior literature [54, 55]. Notably, similar to the most of machine learning techniques, the neural network is prone to overfitting and consequently poor generalization [56]. Thus, the aim of this paper is to improve the generalization of neural network with swarm intelligence algorithms. The approach of artificial neural network (ANN) stimulates the neural network of human brain [57]. With artificial neuron as the unit of information operation, the weight value of connection between artificial neurons indicates the intensity of connection. The connection and the structure reflect how the information is represented, transmitted, and operated in the network. Back propagation (BP) is the most prevalent neural network in the context of credit scoring. We apply recurrent back-propagation in this paper, which is fed forward until a fixed value is achieved while the error is computed and propagated backward. A typical neural network consists of input layer, hidden layer, and output layer. We introduce the training procedures as follows. First, the network is fed forward: hidden layer accepts data from input layer and modifies them with non-linear transformation before output. Once the output value is generated, we measure the difference between actual and desired output value and obtain the error value. In this stage, the error value is transferred backward from output layer to hidden layer and then to input layer. At the same time, the error value is shared across layers and the weight value of every unit is adjusted accordingly. Intended for the gradient decrease of the error value, we adjust the connection between layers (i.e. input-hidden connection and hidden-output connection) and the threshold. This training process goes on until we classify the network parameters (i.e. weight and threshold values) applicable to the minimized error value. After the above-mentioned training process, when fed with an input value, the neural network automatically outputs values with minimal error after non-linear transformation. Logistic regression. Logistic regression proposed by Berkson [58] is most widely used in both industry and academy of banking thanks to its simple architecture and time complexity [59–61]. Conditional probability for logistic regression is given by (1) (2) where β0, β1, ⋯, βm are estimated with maximum likelihood estimation (MLE). To be specific, as the independent variable yi takes value either zero or one, then (3) Since the instances are independent from each other, the likelihood function is given by (4) and the logarithmic function is (5) The value of β0, β1, ⋯, βm is estimated when the partial derivatives with respect to the four variables equal zero. However, instead of a closed-form solution, we estimate the non-linear likelihood function with iteration. As is suggested in prior literature, the logistic regression performs weakly when solving non-linear problems [62]. Naive Bayesian (NB) approach. The approach of Naive Bayes (NB) is born from the classical Bayesian approach of statistics and provides theoretical justification for classifiers that even do not use Bayesian theorem explicitly [62]. Based on the solid theoretical framework of statistics, the NB model remains robust in the case of missing value. However, the underlying assumption that all the indicators are independent from each other is too strong for the real world. The Bayesian conditional probability of event y and x is given by (6) i.e. diving the probability that both event y and x take place by the probability that event x takes place measures the probability of event x with the condition that event y takes place. In the context of credit scoring, event x refers to the case that a certain character (measured by an indicator) of the instance takes a particular value. Specifically, instance of credit history is divided into several categories (e.g. two categories: yi = 1 for default; yi = 0 for non-default) and p(y) in Eq (6) refers to the frequency of each category. Then p(x | y) in Eq (6) equals the frequency of indicators in the subsample of each category and p(x) equals the percentage of indications in the full sample. Thereby, p(y | x) is measured according to Eq (6). If there are more than one indicator for the independent variable, then (7) Discriminant analysis (DA). Discriminant analysis (DA) proposed by [63] is often cited to compare with other techniques in credit scoring [64, 65]. This approach forms classification criteria based on the instance with categories known and predict the unknown categories according to such criteria. The DA is either parametric or non-parametric. The parametric DA constructs the model with a certain assumption of instance distribution (e.g. normal distribution). However, the model is constructed biased because of the unobservable distribution so that parametric DA is not widely used and DA performs weakly when dealing with non-linear problems [62]. For instead, the non-parametric DA prevails in the context of credit scoring. This approach investigate instance distribution with non-parametric method and construct classification criteria accordingly. Thus, results of non-parametric DA are more robust. K nearest neighbor (KNN). As one of the most classical method of data mining, k nearest neighbor (KNN) is carried out as follows. Suppose xi remains to categorize. In the instance set whose category is known, find k instances most similar to (nearest to) xi, known as k nearest neighbors of xi. According to the rule of majority voting, xi is classified into the neighbor’s category with the largest number of instances. If k = 1, then xi is classified into the same category as its nearest neighbor. Since the KNN relies on the comparison with a set containing known values rather than estimation, this approach is efficient in terms of modelling. However, the predictive accuracy of KNN is determined by the measure of distance and the cardinality k of the neighborhood [62]. Decision tree (DT). Decision tree (DT), a basic technique of ensemble learning [66], is another prevalent machine-learning-based approach to credit scoring [67]. With some new techniques introduced [68, 69], the DT approach is efficient in categorization and shows results in an explicit manner for interpretation. However, this approach leads to biased results when the instance is time series with complex categories. In the framework of greedy algorithm, the DT construct a tree-shape structure. First, classify the optimal value of a certain indicator and classify the instance accordingly. Then, divide each subsample with optimal value until the predefined stopping criterion reached. To be specific, Step 1: take the full sample set as the root node of the tree. Step 2: test every possible classification of every indicator until the optimal value identified in the recursion. Step 3: set decision nodes using the optimal value in Step 2 and classify the root node into leaf nodes. Step 4: repeat Step 2 and 3 until every leaf node is pure enough. As core of the DT approach, purity measures the ratio of homogenous instances in a leaf node over the full sample; i.e. one leaf node is “purer” as such ratio is higher. The optimal classification refers to the one that improve the purity most in the recursion. The concept of entropy shown in Eq (8) measures the uncertainty of categorization in each subsample. (8) where c denotes the number of categories in the instance set D (e.g. in real-world credit scoring, c equals two as the instance set is categorized into “default” and “non-default”). pi denotes the ratio of subsample size in category i over the full sample set D; i.e. . On general, a large value of entropy indicates that an increasing body of information is required for categorization. Support vector machine (SVM). The support vector machine (SVM) is employed as a technique of credit scoring in the past decade [19, 70–72]. Prior literature using real-world credit scoring data from the US and Taiwan (China) indicates that support vector machines achieves accuracy comparable of that of neural networks [73]. The SVM is a non-probabilistic binary linear classifier, classifying the optimal hyperplane to split instance in the space to the maximum [74]. The term “to the maximum” indicates that the distance between subsample and the hyperplane is maximized and the error in categorization is minimized thereby. Applying kernel function, the SVM simplifies classification applicable to various scenarios. However, its performance relies on the selection of kernel function and it consumes a large storage for computation. Three types of kernel function are used to classify the optimal hyperplane: linear, polynomial, and radial basis function. The linear kernel function divides instances by a plane and attempts to classify the hyperplane in the original feature space. The polynomial kernel function transforms the original instances into high-dimensional instances with polynomial characteristics and then divides these transformed instances with a curve. The radial basis function (RBF) classifies the hyperplane after mapping instances into higher dimensional feature space via the RBF. In most cases, the RBF outperforms the other two kernel functions. In this paper, all these three kernel functions and their parameter set are employed to obtain the best performance via grid search method. K-means. The K-means clustering method is an unsupervised learning algorithm aimed to solve clustering problem with iterative calculation. This algorithm starts with a group of centroids, which are K instances randomly selected from the original dataset as the beginning points of every cluster. For every instance, we calculate its distance from every centroid and assign it to the nearest cluster (i.e. its distance from the centroid of this cluster is shorter than that from others). Once an instance is assigned to a certain cluster, we re-select the centroid from all the instanced in this cluster. This iteration goes on until either no (or minimal) instance remains to be assigned or no (or minimal) centroid moves. After the training of iteration, once fed with an instance, the algorithm assign it to the nearest cluster [75]. Despite the simplicity and speed, the K-means technique is limited in terms of robustness, since the clusters are determined by the initial random assignments [76]. Random forest (RF). The random forest (RF) method proposed by Breiman [77] is a supervised learning algorithm based on decision trees, which is used in credit scoring, inter alia, imbalanced dataset [45]. The term “forest” refers to the model built on multiple decision trees. The term “random” indicates that the training set and the test features are selected randomly. Thus, this algorithm will not overfit [78], provided enough trees in the forest. We introduce the procedures as follows. First, we randomly bootstrap m samples with replacement and acquire n training sets after n times of such bootstrapping (bagging). Second, we train the decision tree with each training set. Third, we split every decision tree using information gain or Gini importance to calculate the root node. Fourth, the forest chooses the classification with the most votes (each tree votes for a certain class) and the mean value of trees’ prediction is the prediction of the forest (each tree predicts class probabilities). Combining multiple independent models, the random forest resists the problem of overfitting and noise. Furthermore, this approach incorporating a multitude of features is excellent when faced with high-dimensional data and is ready to detect underlying non-linear characteristics. Besides, the random forest is renowned for its speed of implementation. Artificial neural network (ANN). The artificial neural network is an important quantitative technique in credit scoring [43], which is widely used in the context of microfinance [44], imbalanced data [45], real-time assessment [39], etc. The technique of neural network model has evolved into different forms to deal with the credit scoring problems; e.g. partial logistic artificial neural network [46], artificial metaplasticity neural network [47], and hybrid neural networks [48]. Prior experiments show that the neural networks outperforms a bunch of conventional techniques (e.g. discriminant analysis, probit analysis, logistic regression, etc.) in credit scoring [49–51]. Furthermore, the neural networks trained by more sophisticated algorithms outperform those trained by ordinary gradient descent [22, 52]. Besides, the hybrid of neural network and genetic algorithm proves as excellent classifier in credit scoring [53]. However, some issues remain in the application of neural network to credit scoring. For example, the determination of training-to-validation sample ratio remains controversial in the prior literature [54, 55]. Notably, similar to the most of machine learning techniques, the neural network is prone to overfitting and consequently poor generalization [56]. Thus, the aim of this paper is to improve the generalization of neural network with swarm intelligence algorithms. The approach of artificial neural network (ANN) stimulates the neural network of human brain [57]. With artificial neuron as the unit of information operation, the weight value of connection between artificial neurons indicates the intensity of connection. The connection and the structure reflect how the information is represented, transmitted, and operated in the network. Back propagation (BP) is the most prevalent neural network in the context of credit scoring. We apply recurrent back-propagation in this paper, which is fed forward until a fixed value is achieved while the error is computed and propagated backward. A typical neural network consists of input layer, hidden layer, and output layer. We introduce the training procedures as follows. First, the network is fed forward: hidden layer accepts data from input layer and modifies them with non-linear transformation before output. Once the output value is generated, we measure the difference between actual and desired output value and obtain the error value. In this stage, the error value is transferred backward from output layer to hidden layer and then to input layer. At the same time, the error value is shared across layers and the weight value of every unit is adjusted accordingly. Intended for the gradient decrease of the error value, we adjust the connection between layers (i.e. input-hidden connection and hidden-output connection) and the threshold. This training process goes on until we classify the network parameters (i.e. weight and threshold values) applicable to the minimized error value. After the above-mentioned training process, when fed with an input value, the neural network automatically outputs values with minimal error after non-linear transformation. Logistic regression. Logistic regression proposed by Berkson [58] is most widely used in both industry and academy of banking thanks to its simple architecture and time complexity [59–61]. Conditional probability for logistic regression is given by (1) (2) where β0, β1, ⋯, βm are estimated with maximum likelihood estimation (MLE). To be specific, as the independent variable yi takes value either zero or one, then (3) Since the instances are independent from each other, the likelihood function is given by (4) and the logarithmic function is (5) The value of β0, β1, ⋯, βm is estimated when the partial derivatives with respect to the four variables equal zero. However, instead of a closed-form solution, we estimate the non-linear likelihood function with iteration. As is suggested in prior literature, the logistic regression performs weakly when solving non-linear problems [62]. Naive Bayesian (NB) approach. The approach of Naive Bayes (NB) is born from the classical Bayesian approach of statistics and provides theoretical justification for classifiers that even do not use Bayesian theorem explicitly [62]. Based on the solid theoretical framework of statistics, the NB model remains robust in the case of missing value. However, the underlying assumption that all the indicators are independent from each other is too strong for the real world. The Bayesian conditional probability of event y and x is given by (6) i.e. diving the probability that both event y and x take place by the probability that event x takes place measures the probability of event x with the condition that event y takes place. In the context of credit scoring, event x refers to the case that a certain character (measured by an indicator) of the instance takes a particular value. Specifically, instance of credit history is divided into several categories (e.g. two categories: yi = 1 for default; yi = 0 for non-default) and p(y) in Eq (6) refers to the frequency of each category. Then p(x | y) in Eq (6) equals the frequency of indicators in the subsample of each category and p(x) equals the percentage of indications in the full sample. Thereby, p(y | x) is measured according to Eq (6). If there are more than one indicator for the independent variable, then (7) Discriminant analysis (DA). Discriminant analysis (DA) proposed by [63] is often cited to compare with other techniques in credit scoring [64, 65]. This approach forms classification criteria based on the instance with categories known and predict the unknown categories according to such criteria. The DA is either parametric or non-parametric. The parametric DA constructs the model with a certain assumption of instance distribution (e.g. normal distribution). However, the model is constructed biased because of the unobservable distribution so that parametric DA is not widely used and DA performs weakly when dealing with non-linear problems [62]. For instead, the non-parametric DA prevails in the context of credit scoring. This approach investigate instance distribution with non-parametric method and construct classification criteria accordingly. Thus, results of non-parametric DA are more robust. K nearest neighbor (KNN). As one of the most classical method of data mining, k nearest neighbor (KNN) is carried out as follows. Suppose xi remains to categorize. In the instance set whose category is known, find k instances most similar to (nearest to) xi, known as k nearest neighbors of xi. According to the rule of majority voting, xi is classified into the neighbor’s category with the largest number of instances. If k = 1, then xi is classified into the same category as its nearest neighbor. Since the KNN relies on the comparison with a set containing known values rather than estimation, this approach is efficient in terms of modelling. However, the predictive accuracy of KNN is determined by the measure of distance and the cardinality k of the neighborhood [62]. Decision tree (DT). Decision tree (DT), a basic technique of ensemble learning [66], is another prevalent machine-learning-based approach to credit scoring [67]. With some new techniques introduced [68, 69], the DT approach is efficient in categorization and shows results in an explicit manner for interpretation. However, this approach leads to biased results when the instance is time series with complex categories. In the framework of greedy algorithm, the DT construct a tree-shape structure. First, classify the optimal value of a certain indicator and classify the instance accordingly. Then, divide each subsample with optimal value until the predefined stopping criterion reached. To be specific, Step 1: take the full sample set as the root node of the tree. Step 2: test every possible classification of every indicator until the optimal value identified in the recursion. Step 3: set decision nodes using the optimal value in Step 2 and classify the root node into leaf nodes. Step 4: repeat Step 2 and 3 until every leaf node is pure enough. As core of the DT approach, purity measures the ratio of homogenous instances in a leaf node over the full sample; i.e. one leaf node is “purer” as such ratio is higher. The optimal classification refers to the one that improve the purity most in the recursion. The concept of entropy shown in Eq (8) measures the uncertainty of categorization in each subsample. (8) where c denotes the number of categories in the instance set D (e.g. in real-world credit scoring, c equals two as the instance set is categorized into “default” and “non-default”). pi denotes the ratio of subsample size in category i over the full sample set D; i.e. . On general, a large value of entropy indicates that an increasing body of information is required for categorization. Support vector machine (SVM). The support vector machine (SVM) is employed as a technique of credit scoring in the past decade [19, 70–72]. Prior literature using real-world credit scoring data from the US and Taiwan (China) indicates that support vector machines achieves accuracy comparable of that of neural networks [73]. The SVM is a non-probabilistic binary linear classifier, classifying the optimal hyperplane to split instance in the space to the maximum [74]. The term “to the maximum” indicates that the distance between subsample and the hyperplane is maximized and the error in categorization is minimized thereby. Applying kernel function, the SVM simplifies classification applicable to various scenarios. However, its performance relies on the selection of kernel function and it consumes a large storage for computation. Three types of kernel function are used to classify the optimal hyperplane: linear, polynomial, and radial basis function. The linear kernel function divides instances by a plane and attempts to classify the hyperplane in the original feature space. The polynomial kernel function transforms the original instances into high-dimensional instances with polynomial characteristics and then divides these transformed instances with a curve. The radial basis function (RBF) classifies the hyperplane after mapping instances into higher dimensional feature space via the RBF. In most cases, the RBF outperforms the other two kernel functions. In this paper, all these three kernel functions and their parameter set are employed to obtain the best performance via grid search method. K-means. The K-means clustering method is an unsupervised learning algorithm aimed to solve clustering problem with iterative calculation. This algorithm starts with a group of centroids, which are K instances randomly selected from the original dataset as the beginning points of every cluster. For every instance, we calculate its distance from every centroid and assign it to the nearest cluster (i.e. its distance from the centroid of this cluster is shorter than that from others). Once an instance is assigned to a certain cluster, we re-select the centroid from all the instanced in this cluster. This iteration goes on until either no (or minimal) instance remains to be assigned or no (or minimal) centroid moves. After the training of iteration, once fed with an instance, the algorithm assign it to the nearest cluster [75]. Despite the simplicity and speed, the K-means technique is limited in terms of robustness, since the clusters are determined by the initial random assignments [76]. Random forest (RF). The random forest (RF) method proposed by Breiman [77] is a supervised learning algorithm based on decision trees, which is used in credit scoring, inter alia, imbalanced dataset [45]. The term “forest” refers to the model built on multiple decision trees. The term “random” indicates that the training set and the test features are selected randomly. Thus, this algorithm will not overfit [78], provided enough trees in the forest. We introduce the procedures as follows. First, we randomly bootstrap m samples with replacement and acquire n training sets after n times of such bootstrapping (bagging). Second, we train the decision tree with each training set. Third, we split every decision tree using information gain or Gini importance to calculate the root node. Fourth, the forest chooses the classification with the most votes (each tree votes for a certain class) and the mean value of trees’ prediction is the prediction of the forest (each tree predicts class probabilities). Combining multiple independent models, the random forest resists the problem of overfitting and noise. Furthermore, this approach incorporating a multitude of features is excellent when faced with high-dimensional data and is ready to detect underlying non-linear characteristics. Besides, the random forest is renowned for its speed of implementation. Swarm intelligence algorithm As for the problem with no solution in traditional optimization based on individual agent and criterion, the swarm intelligence algorithm handle them by mimicking the natural biological evolution and/or the social behaviour of species. The systematic and organizational principles underlying individual and/or in-group behaviour of species are the core mechanism of these approaches. For example, herds and flocks cooperate in the search of food or mate. Every individual in the herd or flock learns from experience of other members as well as itself and adjusts its strategy of search accordingly. Bat algorithm (BA). Bat algorithm [79], aimed at global optimization, mimics the echolocation of bats. Assume that All bats sense distance with echolocation and distinguish objectives from obstacles. The bat flies randomly from point xi at a speed of vi. Meanwhile, it makes a sound of fixed frequency fmin, variant wavelength λ and volume. According to the distance from objective, the bat adjust the wavelength and transmission frequency γ ∈ [0, 1]. The volume A0 changes from the maximum to the fixed minimum. The optimization in BA approach mimic the motion and food-seeking process of bats. Thus, The BA approach maps individual bats as feasible solutions in the space of high-dimension problems. The location of bat is assessed with the fitness function of objective and the solution is identified with recursions. Cuckoo search optimization (CSO). The cuckoo search optimization (CSO) algorithm [80] mimic the brood parasitism of cuckoo. Moreover, instead of random walk, the search process of CS algorithm mimics the discrete exploration of Lévy fly, composed by a series of straight motions and abrupt turning of 90 degree. Assume that Every cuckoo lays one egg and randomly incubate it in a host nest. The best nest with high-quality egg is passed onto the next generation. The number of available host nest is fixed and the host detects cuckoo’s egg with a probability of p ∈ (0, 1). Once detecting the invading egg, the host either destroys it or discards the nest. The CS approach is widely used in social science thanks to the efficient optimization with few parameters. Firefly algorithm (FA). The firefly algorithm [81] mimics the behavior of fireflies who flash for food and mate. With an explicit logic, the FA converge quickly to the global optimality. The algorithm take the brightness of firefly as the objective value and assumes that The attractiveness of a firefly is positively related with its brightness; i.e. the less bright one moves towards the brighter one. The brightness is negatively related with the distance between two fireflies. A firefly moves randomly if it detects no one brighter. Gravitational search (GS) algorithm. The gravitational search algorithm [82] assumes search agents as masses obeying the Newtonian laws of gravitation and motion as follows. Law of gravitation: agents attract each other, while the gravity between two agents is directly proportional to the product of their masses and inversely proportional to the distance between them. Law of motion: the velocity of each agent keeps constant unless an external force acts upon it and the change of velocity equals to the force divided by the mass of the agent. Thus, agents moves in accordance with the two laws above until the optimal position is reached. Gray wolf optimization (GWO) algorithm. The gray wolf optimization algorithm mimics the hunting mechanism of gray wolves in nature [83]. Wolves keep strict social hierarchy represented by the division of a pack into four levels, each with different authority and responsibility. Gray wolves hunt in three steps: Tracking, chasing, and approaching the prey; Pursuing, encircling, and harassing the prey until it stops moving; Attack towards the prey. Particle swarm optimization (PSO) algorithm. Particle swarm optimization (PSO) algorithm [84] is inspired by the predation of birds. Using massless particles moving in solution space, this algorithm mimics birds seeking food. Every particle finds the optimal solution on its own, known as local extremum, and shares the solution with all other particles. The global optimality is the best local extremum. Comparing local extremum and the global optimality, every particle adjust its motion in terms of direction and velocity. As a comparatively simple algorithm, the PSO approach is efficient in seeking optimality and is applicable to problems with real values. Social spider algorithm (SSA). Social spider algorithm [85] mimics the social spiders colony behavior. The SS algorithm assumes that The search space is communal web of spiders where individuals could interact with each other. Each solution within the search space represents a spider position in the communal web and each social spider is weighted according to the fitness value of the solution. The spider colony is highly female-biased population with predefined proportion. Every social spider is assigned a set of cooperative behaviors according to its gender, known as evolutionary operators. Whale swarm algorithm (WSA). The whale swarm algorithm [86] mimics predation of humpback whales who seek food in a cooperative manner. This algorithm is recursive as follows. All the whales communicate with each other by ultrasound in the search space. Each whale has a certain degree of computing ability to calculate the distance to others. The quality and quantity of food found are associated to the fitness of whale’s objective value. A whale follows the nearest one with a better fitness. Bat algorithm (BA). Bat algorithm [79], aimed at global optimization, mimics the echolocation of bats. Assume that All bats sense distance with echolocation and distinguish objectives from obstacles. The bat flies randomly from point xi at a speed of vi. Meanwhile, it makes a sound of fixed frequency fmin, variant wavelength λ and volume. According to the distance from objective, the bat adjust the wavelength and transmission frequency γ ∈ [0, 1]. The volume A0 changes from the maximum to the fixed minimum. The optimization in BA approach mimic the motion and food-seeking process of bats. Thus, The BA approach maps individual bats as feasible solutions in the space of high-dimension problems. The location of bat is assessed with the fitness function of objective and the solution is identified with recursions. Cuckoo search optimization (CSO). The cuckoo search optimization (CSO) algorithm [80] mimic the brood parasitism of cuckoo. Moreover, instead of random walk, the search process of CS algorithm mimics the discrete exploration of Lévy fly, composed by a series of straight motions and abrupt turning of 90 degree. Assume that Every cuckoo lays one egg and randomly incubate it in a host nest. The best nest with high-quality egg is passed onto the next generation. The number of available host nest is fixed and the host detects cuckoo’s egg with a probability of p ∈ (0, 1). Once detecting the invading egg, the host either destroys it or discards the nest. The CS approach is widely used in social science thanks to the efficient optimization with few parameters. Firefly algorithm (FA). The firefly algorithm [81] mimics the behavior of fireflies who flash for food and mate. With an explicit logic, the FA converge quickly to the global optimality. The algorithm take the brightness of firefly as the objective value and assumes that The attractiveness of a firefly is positively related with its brightness; i.e. the less bright one moves towards the brighter one. The brightness is negatively related with the distance between two fireflies. A firefly moves randomly if it detects no one brighter. Gravitational search (GS) algorithm. The gravitational search algorithm [82] assumes search agents as masses obeying the Newtonian laws of gravitation and motion as follows. Law of gravitation: agents attract each other, while the gravity between two agents is directly proportional to the product of their masses and inversely proportional to the distance between them. Law of motion: the velocity of each agent keeps constant unless an external force acts upon it and the change of velocity equals to the force divided by the mass of the agent. Thus, agents moves in accordance with the two laws above until the optimal position is reached. Gray wolf optimization (GWO) algorithm. The gray wolf optimization algorithm mimics the hunting mechanism of gray wolves in nature [83]. Wolves keep strict social hierarchy represented by the division of a pack into four levels, each with different authority and responsibility. Gray wolves hunt in three steps: Tracking, chasing, and approaching the prey; Pursuing, encircling, and harassing the prey until it stops moving; Attack towards the prey. Particle swarm optimization (PSO) algorithm. Particle swarm optimization (PSO) algorithm [84] is inspired by the predation of birds. Using massless particles moving in solution space, this algorithm mimics birds seeking food. Every particle finds the optimal solution on its own, known as local extremum, and shares the solution with all other particles. The global optimality is the best local extremum. Comparing local extremum and the global optimality, every particle adjust its motion in terms of direction and velocity. As a comparatively simple algorithm, the PSO approach is efficient in seeking optimality and is applicable to problems with real values. Social spider algorithm (SSA). Social spider algorithm [85] mimics the social spiders colony behavior. The SS algorithm assumes that The search space is communal web of spiders where individuals could interact with each other. Each solution within the search space represents a spider position in the communal web and each social spider is weighted according to the fitness value of the solution. The spider colony is highly female-biased population with predefined proportion. Every social spider is assigned a set of cooperative behaviors according to its gender, known as evolutionary operators. Whale swarm algorithm (WSA). The whale swarm algorithm [86] mimics predation of humpback whales who seek food in a cooperative manner. This algorithm is recursive as follows. All the whales communicate with each other by ultrasound in the search space. Each whale has a certain degree of computing ability to calculate the distance to others. The quality and quantity of food found are associated to the fitness of whale’s objective value. A whale follows the nearest one with a better fitness. A novel framework: SI algorithm based BP-ANN credit-scoring model In this section, we propose our novel framework based on swarm intelligence algorithm and BP-ANN model. The flowchart in Fig 1 plots the three procedures of constructing such framework: pre-processing, training, and test. In the first step, imputation, normalization and re-ordering are conducted to make the datasets suitable for further modelling. In the second step, we optimize the hyper-parameters of BP-ANN with swarm intelligence algorithm to find out the optimal model and the corresponding SI algorithm that suits specific scenarios and datasets. In the third step, we apply the BP-ANN model with the optimal parameters to evaluate the credit of new samples. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Flowchart of the framework. Plots the framework of this paper in three procedures: Pre-processing, optimization, and training. https://doi.org/10.1371/journal.pone.0234254.g001 Step 1: Pre-processing Imputation. Missing values are prevalent in real-world credit-scoring datasets, partly owing to the insufficient information collection in business process or the careless of business staff and data manager. First, we identify the label (good or bad, default or not default) of each sample including missing values (samples with missing labels are dropped). Second, we select all the other samples with the same label whose values for the same attributes are not missing and assign them to the control group. Third, the missing values are replaced by the corresponding attribute of a sample randomly selected from the control group. The above-mentioned imputation is comparatively efficient, for the time it consumes are proportional to the sample size of the dataset. Besides, for any specific attribute, the values with higher frequency (indicating strong connection with samples of the same label) are more likely to be selected in imputation, which helps to predict the label in the modelling stage. Normalization. Although some models remain robust despite data scaling, distance-based models (e.g. KNN) are heavily dependent on attitude standardization. Considering the errors of each model are the same after normalization, the scaled dataset makes the comparison between different models possible. We employ the most commonly used scaling method shown in Eq (9). (9) where X represents the vector of a specific attitude; x is the value of the attribute of one sample; and x′ is the scaled value of the same attribute in the sample. Re-ordering. The order of samples matters. In some cases, the order of samples with different labels could affect the performance of sequential learning models. Furthermore, the imbalance sample problem might occur during the k-fold modelling process: if the good (or bad) samples make up a larger proportion in one dataset, the performance is biased across different datasets. It is easier for models to identify the pattern of samples labelled with dominant value, which causes unstable performance of models. Before modelling, we re-order samples in accordance with the following procedures. First, samples are divided into two cohorts based on the binary label, denoted as “majority” and “minority” (dependent on the sample proportion). Second, we count the number of samples in each group with different labels and calculate the ratio of majority to minority. For example, the ratio for a dataset with 100 bad samples (minority) and 300 good samples (majority) is three. The ratios are round to integer. Third, we conduct sampling without replacement from two groups according to the proportion of each group in the population. Consider the context with three as the ratio. Three good samples are selected first, with one bad sample selected following; then another three good samples followed by one bad sample; etc. Finally, the samples left in the two groups after the above-mentioned sampling are assigned to the end sequence. Hence, all the samples are re-assigned to a predetermined sequence and the samples from two groups are distributed with more balance. Step 2: Training We need to determine the optimal SI algorithm before using it to find out the optimal parameter set for BP-ANN model. However, the “grail algorithm” does not exist and we search for the optimal SI algorithm for heterogeneous dataset. First, we construct an alternative algorithm pool with several typical and widely used SI algorithms. For each SI algorithm in the pool, we construct a comparability scenario and set the same key hyper-parameters, including the number of individuals and time of iterations. Next, we set the feasible parameter space of BP-ANN model according to prior literature, including the size of hidden layers, the learning rate, the max iteration limit, and the tolerance of errors. Specifically, we set only the upper and lower boundary limit of each parameter and have SI algorithms to search the optimal parameters in the space. Finally, we apply SI algorithms in the pool one by one to optimize the parameter set of BP-ANN model (whose performance is sensitive to parameters) and find out the optimal SI algorithm with the highest value of area under curve (AUC) indicator (see the next section for further details). The BP-ANN model that optimized by the optimal SI algorithm is the core of our model. Step 3: Test In this step, we apply the BP-ANN model whose parameters optimized by the optimal SI algorithm to another real-world dataset. We employ the data pre-processed in Step 1 and test the BP-ANN model with hyper-parameters determined in Step 2. Step 1: Pre-processing Imputation. Missing values are prevalent in real-world credit-scoring datasets, partly owing to the insufficient information collection in business process or the careless of business staff and data manager. First, we identify the label (good or bad, default or not default) of each sample including missing values (samples with missing labels are dropped). Second, we select all the other samples with the same label whose values for the same attributes are not missing and assign them to the control group. Third, the missing values are replaced by the corresponding attribute of a sample randomly selected from the control group. The above-mentioned imputation is comparatively efficient, for the time it consumes are proportional to the sample size of the dataset. Besides, for any specific attribute, the values with higher frequency (indicating strong connection with samples of the same label) are more likely to be selected in imputation, which helps to predict the label in the modelling stage. Normalization. Although some models remain robust despite data scaling, distance-based models (e.g. KNN) are heavily dependent on attitude standardization. Considering the errors of each model are the same after normalization, the scaled dataset makes the comparison between different models possible. We employ the most commonly used scaling method shown in Eq (9). (9) where X represents the vector of a specific attitude; x is the value of the attribute of one sample; and x′ is the scaled value of the same attribute in the sample. Re-ordering. The order of samples matters. In some cases, the order of samples with different labels could affect the performance of sequential learning models. Furthermore, the imbalance sample problem might occur during the k-fold modelling process: if the good (or bad) samples make up a larger proportion in one dataset, the performance is biased across different datasets. It is easier for models to identify the pattern of samples labelled with dominant value, which causes unstable performance of models. Before modelling, we re-order samples in accordance with the following procedures. First, samples are divided into two cohorts based on the binary label, denoted as “majority” and “minority” (dependent on the sample proportion). Second, we count the number of samples in each group with different labels and calculate the ratio of majority to minority. For example, the ratio for a dataset with 100 bad samples (minority) and 300 good samples (majority) is three. The ratios are round to integer. Third, we conduct sampling without replacement from two groups according to the proportion of each group in the population. Consider the context with three as the ratio. Three good samples are selected first, with one bad sample selected following; then another three good samples followed by one bad sample; etc. Finally, the samples left in the two groups after the above-mentioned sampling are assigned to the end sequence. Hence, all the samples are re-assigned to a predetermined sequence and the samples from two groups are distributed with more balance. Imputation. Missing values are prevalent in real-world credit-scoring datasets, partly owing to the insufficient information collection in business process or the careless of business staff and data manager. First, we identify the label (good or bad, default or not default) of each sample including missing values (samples with missing labels are dropped). Second, we select all the other samples with the same label whose values for the same attributes are not missing and assign them to the control group. Third, the missing values are replaced by the corresponding attribute of a sample randomly selected from the control group. The above-mentioned imputation is comparatively efficient, for the time it consumes are proportional to the sample size of the dataset. Besides, for any specific attribute, the values with higher frequency (indicating strong connection with samples of the same label) are more likely to be selected in imputation, which helps to predict the label in the modelling stage. Normalization. Although some models remain robust despite data scaling, distance-based models (e.g. KNN) are heavily dependent on attitude standardization. Considering the errors of each model are the same after normalization, the scaled dataset makes the comparison between different models possible. We employ the most commonly used scaling method shown in Eq (9). (9) where X represents the vector of a specific attitude; x is the value of the attribute of one sample; and x′ is the scaled value of the same attribute in the sample. Re-ordering. The order of samples matters. In some cases, the order of samples with different labels could affect the performance of sequential learning models. Furthermore, the imbalance sample problem might occur during the k-fold modelling process: if the good (or bad) samples make up a larger proportion in one dataset, the performance is biased across different datasets. It is easier for models to identify the pattern of samples labelled with dominant value, which causes unstable performance of models. Before modelling, we re-order samples in accordance with the following procedures. First, samples are divided into two cohorts based on the binary label, denoted as “majority” and “minority” (dependent on the sample proportion). Second, we count the number of samples in each group with different labels and calculate the ratio of majority to minority. For example, the ratio for a dataset with 100 bad samples (minority) and 300 good samples (majority) is three. The ratios are round to integer. Third, we conduct sampling without replacement from two groups according to the proportion of each group in the population. Consider the context with three as the ratio. Three good samples are selected first, with one bad sample selected following; then another three good samples followed by one bad sample; etc. Finally, the samples left in the two groups after the above-mentioned sampling are assigned to the end sequence. Hence, all the samples are re-assigned to a predetermined sequence and the samples from two groups are distributed with more balance. Step 2: Training We need to determine the optimal SI algorithm before using it to find out the optimal parameter set for BP-ANN model. However, the “grail algorithm” does not exist and we search for the optimal SI algorithm for heterogeneous dataset. First, we construct an alternative algorithm pool with several typical and widely used SI algorithms. For each SI algorithm in the pool, we construct a comparability scenario and set the same key hyper-parameters, including the number of individuals and time of iterations. Next, we set the feasible parameter space of BP-ANN model according to prior literature, including the size of hidden layers, the learning rate, the max iteration limit, and the tolerance of errors. Specifically, we set only the upper and lower boundary limit of each parameter and have SI algorithms to search the optimal parameters in the space. Finally, we apply SI algorithms in the pool one by one to optimize the parameter set of BP-ANN model (whose performance is sensitive to parameters) and find out the optimal SI algorithm with the highest value of area under curve (AUC) indicator (see the next section for further details). The BP-ANN model that optimized by the optimal SI algorithm is the core of our model. Step 3: Test In this step, we apply the BP-ANN model whose parameters optimized by the optimal SI algorithm to another real-world dataset. We employ the data pre-processed in Step 1 and test the BP-ANN model with hyper-parameters determined in Step 2. Methodology This paper carries out an experiment to test whether the BP-ANN model trained by swarm intelligence algorithm (treated group) outperforms prevalent classical models (control group) and several typical hybrid or ensemble models constructed in recent literature [29–34] within the context of credit scoring. First, the experiment investigates the performance of seven swarm intelligence algorithms (see Section “Swarm intelligence algorithm”) as an optimizer of BP-ANN model in the context of different datasets. Second, we compare the performance of trained BP-ANN models and those in control group, in order to classify the best in terms of generalization and robustness. Third, we analyze how the number of hidden layers affects the performance of trained BP-ANN models. Last, the time complexity of trained BP-ANN models is analyzed. Data The instances of this paper are extracted from four public datasets of UCI (University of California, Irvine) and one dataset HELOC about credit scoring: first, the German Credit Dataset (the German dataset, https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29); second, the Australian Credit Approval Dataset (the Australian dataset, https://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29); third, the Japanese Credit Dataset (the Japanese dataset, https://archive.ics.uci.edu/ml/datasets/Credit+Approval); forth, the Default of Credit Card Clients Dataset from Taiwan (the Taiwan dataset, https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients); and fifth, the Home Equity Line of Credit Dataset from the U.S. (the HELOC dataset, https://community.fico.com/s/explainable-machine-learning-challenge). The reasons for these datasets selection are as follows. First, because of the unavailable dataset of commercial banks [87], public dataset is widely used in prior literature and results from the same dataset are comparable. Second, the German Credit Dataset (over 500 thousand page views), the Australian Credit Approval Dataset (over 155 thousand page views), the Japanese Dataset (over 393 thousand page views) and the Taiwan Dataset (over 350 thousand page views) are the most widely used datasets of UCI public credit scoring, with which a large body of prior literature study the performance of various models [22, 88]. Third, we focus on the diversity of datasets. On one hand, there are significant differences in the number of samples and attribute dimensions. On the other hand, these five datasets are extracted from five different financial markets (Germany, Australia, Japan, Taiwan and the US). Thus, we test not only the performance of each algorithm with different sample sizes and dimensions but also the feasibility of each algorithm in real world with different financial risks. The summary of datasets is presented in Table 1. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Summary of datasets. https://doi.org/10.1371/journal.pone.0234254.t001 Model evaluation In line with the evaluation techniques proposed by recent literature [79], We select eight indicators (i.e. AUC, accuracy, precision (pos), precision (neg), sensitivity, specificity, Brier score, and G-mean) and the confusion matrix is presented in Table 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Confusion matrix. https://doi.org/10.1371/journal.pone.0234254.t002 Based on the confusion matrix in Table 2, we construct five evaluation indicators as shown in Eqs (10) to (14). (10)(11)(12)(13)(14) The indicator accuracy in Eq (10) measures the ratio of correctly identified sample over the entire sample. The indicator precision (pos) in Eq (11) measures the ratio of correctly identified positive sample over the entire positive sample, which evaluates how the model performs when classifying positive samples. Similarly, the indicator precision (neg) in Eq (12) measures the ratio of correctly identified negative sample over the entire negative sample, which evaluates how the model performs when classifying negative samples. The indicator sensitivity in Eq (13) measure the ratio of positive samples that are correctly predicted over the entire sample predicted as positive, which evaluates how precisely the model predicts in terms of positive samples. Similarly, the indicator specificity in Eq (14) measures the ratio of negative samples that are correctly predicted over the entire sample predicted as negative, which evaluates how precisely the model predicts in terms of negative samples. To evaluate the comprehensive performance of each model, we employ the method of AUC (see details below). Using only one indicator, this method measures the classification ability of entire sample and the balance of classified samples simultaneously. The AUC builds on the knowledge of confusion matrix (see Table 2). We introduce the procedures of the AUC as follows. All predictive positive (default) probabilities of the model make the sequence P. The actual positive (default) probability works as threshold of classifier; i.e. the instance is classified as default if the predictive positive probability is larger than the threshold. Then we have the false positive rate and true positive rate as shown in Eqs (15) and (16). (15)(16) Thus, we have two sequences: sequence of false positive rates and sequence of true positive rates. After that, with false positive rate sequence as the x-axis while true positive rate sequence as the y-axis, we draw the curve of receiver operating characteristic (ROC) (see Fig 2) and the area under the ROC curve (area under curve, AUC) is positively related with the performance of classification model. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Area under the ROC curve (AUC). Plots the receiver operating characteristic (ROC) curve (see the blue curve), where the horizontal axis denotes the sequence of false positive rate and the vertical axis denote the sequence of true positive rate The area under the ROC curve (AUC) indicates the performance of model as a classifier (see the green part). https://doi.org/10.1371/journal.pone.0234254.g002 In real word, the datasets are usually imbalanced (e.g., good samples make up a greater proportion than bad samples) and three principled evaluation metrics (i.e. the Brier score, G-mean, and H-measure) are introduced thereby. The H-measure requires predetermined distribution of misclassification cost [89] and is less prevalent in recent evaluation. Thus, we use the other two metrics (i.e. the Brier score, and G-mean) simultaneously to evaluate the performance of each model with the imbalanced datasets. As shown in Eq (17), the Brier score measures the mean square error of predicted and true value. (17) where pi and ti are the predicted value and true value, respectively; and N is the sample size. With the result of Eqs (13) and (14), the G-mean is measured as show in Eq (18). (18) Settings First, this paper trains parameters of the BP-ANN model with the seven approaches of swarm algorithm described in Section “Swarm intelligence algorithm”. Considering the time complexity and accuracy, each swarm algorithm contains 10 individuals and iterates 20 times. The neural network contains one input layer, one hidden layer, and one output layer from the beginning. In the later stage, we increase the number of hidden layers in BP-ANN. Parameters of the neural network are trained in the parameter space as presented in Table 3. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Parameters of neural network. https://doi.org/10.1371/journal.pone.0234254.t003 The procedures that we optimize the BP-ANN model with swarm intelligence algorithm are introduced as follows. Step 1: 10 four-dimension feasible solutions are randomly selected from the solution space (The four dimensions are the number of neurons in the hidden layer, the learning rate, the maximum number of iterations in the network, and the maximum fault tolerance. The value range of each dimension is shown in Table 5). Training the BP-ANN model with the parameter set of each feasible solution, we have 10 models with different parameters. After evaluation, we take the best performance as the “present global optimal value” of the certain swarm intelligence algorithm and the certain feasible solution as the “present global optimal solution”. Step 2: based on the optimization mechanism and principles of the certain swarm intelligence algorithm, we set out from the present feasible solution and start an exploration of “novel feasible solution”. If the “novel feasible solution” is better than the present feasible solution, we replace the present feasible solution with the “novel feasible solution” and compare the novel one with the “present global optimal solution”. If the novel one is better, we replace the “present global optimal solution” with the “novel feasible solution” and take the performance of the novel one as the “present global optimal value”. Step 3: if the stopping condition of the certain swarm intelligence algorithm is not achieved, then we repeat Step 2. Otherwise, we stop the optimization and take the “present global optimal solution” as the final solution. Then we set parameters of BP-ANN accordingly. In line with the prior literature, this paper applies 5-fold cross validation. To be specific, all the instances from datasets are divided into five pairs of training-test sets. For each pair, training set predicts parameters and constructs model accordingly. Then we examine the generalization of model using the test set in order to decide whether it fits new instances that isolated from the train set. The process runs five times to ensure the model is robust. All models introduced in Section “Prevalent classical models for credit scoring” (i.e. logistic regression, NB approach, DA, KNN, DT, SVM, K means, and RF) are enrolled in control group. Within the context of the same public datasets, we enroll several typical hybrid or ensemble models proposed in recent literature [29–34] in the control group. When evaluating the performance of difference models, we report the value of eight indicators in Section “Model evaluation” while focusing on the value of AUC (see Section “Model evaluation” for detailed calculation). As is mentioned above, the AUC not only reflects the entire precision of the model but also indicates how the model performs when classifying a certain category of instances. Models in Section “Prevalent classical models for credit scoring” are constructed with the build-in package of Matlab 2017a to build up while the “Optimize Hyperparameters” is set as “all”. As the command indicates, a body of parameter sets including the kernel functions of SVM model are applied to each baseline model. Notably, we carry out all the experiments in this study using a PC of 3.4 GHz, Intel CORE i5-7500 and 8GB RAM with the operating system of Microsoft Windows 10. Data The instances of this paper are extracted from four public datasets of UCI (University of California, Irvine) and one dataset HELOC about credit scoring: first, the German Credit Dataset (the German dataset, https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29); second, the Australian Credit Approval Dataset (the Australian dataset, https://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29); third, the Japanese Credit Dataset (the Japanese dataset, https://archive.ics.uci.edu/ml/datasets/Credit+Approval); forth, the Default of Credit Card Clients Dataset from Taiwan (the Taiwan dataset, https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients); and fifth, the Home Equity Line of Credit Dataset from the U.S. (the HELOC dataset, https://community.fico.com/s/explainable-machine-learning-challenge). The reasons for these datasets selection are as follows. First, because of the unavailable dataset of commercial banks [87], public dataset is widely used in prior literature and results from the same dataset are comparable. Second, the German Credit Dataset (over 500 thousand page views), the Australian Credit Approval Dataset (over 155 thousand page views), the Japanese Dataset (over 393 thousand page views) and the Taiwan Dataset (over 350 thousand page views) are the most widely used datasets of UCI public credit scoring, with which a large body of prior literature study the performance of various models [22, 88]. Third, we focus on the diversity of datasets. On one hand, there are significant differences in the number of samples and attribute dimensions. On the other hand, these five datasets are extracted from five different financial markets (Germany, Australia, Japan, Taiwan and the US). Thus, we test not only the performance of each algorithm with different sample sizes and dimensions but also the feasibility of each algorithm in real world with different financial risks. The summary of datasets is presented in Table 1. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Summary of datasets. https://doi.org/10.1371/journal.pone.0234254.t001 Model evaluation In line with the evaluation techniques proposed by recent literature [79], We select eight indicators (i.e. AUC, accuracy, precision (pos), precision (neg), sensitivity, specificity, Brier score, and G-mean) and the confusion matrix is presented in Table 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Confusion matrix. https://doi.org/10.1371/journal.pone.0234254.t002 Based on the confusion matrix in Table 2, we construct five evaluation indicators as shown in Eqs (10) to (14). (10)(11)(12)(13)(14) The indicator accuracy in Eq (10) measures the ratio of correctly identified sample over the entire sample. The indicator precision (pos) in Eq (11) measures the ratio of correctly identified positive sample over the entire positive sample, which evaluates how the model performs when classifying positive samples. Similarly, the indicator precision (neg) in Eq (12) measures the ratio of correctly identified negative sample over the entire negative sample, which evaluates how the model performs when classifying negative samples. The indicator sensitivity in Eq (13) measure the ratio of positive samples that are correctly predicted over the entire sample predicted as positive, which evaluates how precisely the model predicts in terms of positive samples. Similarly, the indicator specificity in Eq (14) measures the ratio of negative samples that are correctly predicted over the entire sample predicted as negative, which evaluates how precisely the model predicts in terms of negative samples. To evaluate the comprehensive performance of each model, we employ the method of AUC (see details below). Using only one indicator, this method measures the classification ability of entire sample and the balance of classified samples simultaneously. The AUC builds on the knowledge of confusion matrix (see Table 2). We introduce the procedures of the AUC as follows. All predictive positive (default) probabilities of the model make the sequence P. The actual positive (default) probability works as threshold of classifier; i.e. the instance is classified as default if the predictive positive probability is larger than the threshold. Then we have the false positive rate and true positive rate as shown in Eqs (15) and (16). (15)(16) Thus, we have two sequences: sequence of false positive rates and sequence of true positive rates. After that, with false positive rate sequence as the x-axis while true positive rate sequence as the y-axis, we draw the curve of receiver operating characteristic (ROC) (see Fig 2) and the area under the ROC curve (area under curve, AUC) is positively related with the performance of classification model. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Area under the ROC curve (AUC). Plots the receiver operating characteristic (ROC) curve (see the blue curve), where the horizontal axis denotes the sequence of false positive rate and the vertical axis denote the sequence of true positive rate The area under the ROC curve (AUC) indicates the performance of model as a classifier (see the green part). https://doi.org/10.1371/journal.pone.0234254.g002 In real word, the datasets are usually imbalanced (e.g., good samples make up a greater proportion than bad samples) and three principled evaluation metrics (i.e. the Brier score, G-mean, and H-measure) are introduced thereby. The H-measure requires predetermined distribution of misclassification cost [89] and is less prevalent in recent evaluation. Thus, we use the other two metrics (i.e. the Brier score, and G-mean) simultaneously to evaluate the performance of each model with the imbalanced datasets. As shown in Eq (17), the Brier score measures the mean square error of predicted and true value. (17) where pi and ti are the predicted value and true value, respectively; and N is the sample size. With the result of Eqs (13) and (14), the G-mean is measured as show in Eq (18). (18) Settings First, this paper trains parameters of the BP-ANN model with the seven approaches of swarm algorithm described in Section “Swarm intelligence algorithm”. Considering the time complexity and accuracy, each swarm algorithm contains 10 individuals and iterates 20 times. The neural network contains one input layer, one hidden layer, and one output layer from the beginning. In the later stage, we increase the number of hidden layers in BP-ANN. Parameters of the neural network are trained in the parameter space as presented in Table 3. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Parameters of neural network. https://doi.org/10.1371/journal.pone.0234254.t003 The procedures that we optimize the BP-ANN model with swarm intelligence algorithm are introduced as follows. Step 1: 10 four-dimension feasible solutions are randomly selected from the solution space (The four dimensions are the number of neurons in the hidden layer, the learning rate, the maximum number of iterations in the network, and the maximum fault tolerance. The value range of each dimension is shown in Table 5). Training the BP-ANN model with the parameter set of each feasible solution, we have 10 models with different parameters. After evaluation, we take the best performance as the “present global optimal value” of the certain swarm intelligence algorithm and the certain feasible solution as the “present global optimal solution”. Step 2: based on the optimization mechanism and principles of the certain swarm intelligence algorithm, we set out from the present feasible solution and start an exploration of “novel feasible solution”. If the “novel feasible solution” is better than the present feasible solution, we replace the present feasible solution with the “novel feasible solution” and compare the novel one with the “present global optimal solution”. If the novel one is better, we replace the “present global optimal solution” with the “novel feasible solution” and take the performance of the novel one as the “present global optimal value”. Step 3: if the stopping condition of the certain swarm intelligence algorithm is not achieved, then we repeat Step 2. Otherwise, we stop the optimization and take the “present global optimal solution” as the final solution. Then we set parameters of BP-ANN accordingly. In line with the prior literature, this paper applies 5-fold cross validation. To be specific, all the instances from datasets are divided into five pairs of training-test sets. For each pair, training set predicts parameters and constructs model accordingly. Then we examine the generalization of model using the test set in order to decide whether it fits new instances that isolated from the train set. The process runs five times to ensure the model is robust. All models introduced in Section “Prevalent classical models for credit scoring” (i.e. logistic regression, NB approach, DA, KNN, DT, SVM, K means, and RF) are enrolled in control group. Within the context of the same public datasets, we enroll several typical hybrid or ensemble models proposed in recent literature [29–34] in the control group. When evaluating the performance of difference models, we report the value of eight indicators in Section “Model evaluation” while focusing on the value of AUC (see Section “Model evaluation” for detailed calculation). As is mentioned above, the AUC not only reflects the entire precision of the model but also indicates how the model performs when classifying a certain category of instances. Models in Section “Prevalent classical models for credit scoring” are constructed with the build-in package of Matlab 2017a to build up while the “Optimize Hyperparameters” is set as “all”. As the command indicates, a body of parameter sets including the kernel functions of SVM model are applied to each baseline model. Notably, we carry out all the experiments in this study using a PC of 3.4 GHz, Intel CORE i5-7500 and 8GB RAM with the operating system of Microsoft Windows 10. Findings First, we employ seven swarm intelligence algorithms to train BP-ANN and report the performance of trained models in the first subsection “Optimization”. Further, the performance of control group is reported and compared in the second subsection “Control group”. We also present the performance of our model while hidden layers of the BP-ANN increasing, followed by analysis of computational complexity (i.e. runtime). Last, by comparison with control group, we show how our framework balances between accuracy and efficiency. Optimization Table 4 reports the performance of BP-ANN models with the five datasets described in Section “Data”, where parameters are trained by different swarm algorithms. The value of eight indicators in Section “Model evaluation” are reported while the optimal performance are presented in bold. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. Performance of BP-ANN trained by different swarm intelligence algorithms. https://doi.org/10.1371/journal.pone.0234254.t004 First, we compare the model performance with the German Credit Dataset, as shown in the first 12 rows of Table 6. In the training phase, the fitness of CHSO-BP-ANN model scores highest in terms of all six indicators, which implies that the fitting ability of this model is best and robust. The fitness of PSO-BP-ANN, WSA-BP-ANN, and BA-BP-ANN performs less competent but still acceptable, with the AUC value over 0.9. Focused on the classification of positive samples, we notice that all the models except CHSO-BP-ANN and PSO-BP-ANN score no more than 0.8 in terms of precision (pos). However, all the models performs better when classifying negative, with the value of precision (neg) over 0.9. The classification is biased towards the negative samples owing to the fact that, in the training set, the size of negative sample set is far larger than that of positive sample set. In the phase of test, the model CSO-BP-ANN performs best in terms of AUC, accuracy and precision (neg). The model CHSO-BP-ANN, performing well in the training phase though, scores highest in terms of precision (pos) and specificity in the testing phase, which indicates excellent classification of positive sample and precise prediction. Besides, the model SSA-BP-ANN and FA-BP-ANN classify comparatively precisely. As shown in Table 4, we first compare the model performance with the German dataset. In the training phase, the CSO-BP-ANN model in training phase scores the highest in terms of all indicators except the precision (pos). It indicates that the CSO-BP-ANN performs best when distinguishing samples with opposite attributes from each other. In addition, the WSA-BP-ANN scores the highest in terms of the precision (pos) (0.5674), indicating that this model performs best when identifying the positive samples. The performance across models differs within a comparatively limited range in terms of the AUC (approx. 0.08) and accuracy (approx. 0.06). Besides, the overall performance of PSO-BP-ANN and GS-BP-ANN is strong in terms of all indicators while the performance of BA-BP-ANN and GS-BP-ANN is comparatively weak. In the testing phase, the PSO-BP-ANN performs best generalization in terms of overall classification and prediction with imbalanced datasets. To be specific, the PSO-BP-ANN scores 0.8004 for AUC, 0.7660 for the accuracy, and 0.2340 for the Brier score. Furthermore, the WSA-BP-ANN scores the highest in terms of precision (pos), specificity, and G-mean, while BA-BP-ANN who performs moderately during training phase scores the highest in terms of precision (neg) and sensitivity. Within the context of the Australian dataset, in the training phase, the CSO-BP-ANN scores the highest in terms of all indicators except precision (pos) and specificity. Similarly to the context of German dataset, the WSA-BP-ANN scores the highest in terms of precision (pos) and specificity. The performance of other models varies within a limited range in terms of the AUC (approx.0.02) and accuracy (approx. 0.02), indicating that BP-ANN trained by these SI algorithms presents a strong performance when identifying the classifiable attributes in the training set. In the testing phase, the GWO-BP-ANN and the BS-BP-ANN perform best in terms of the AUC (0.9373) and accuracy (0.8638), respectively. Meanwhile, the GS-BP-ANN performs best in terms of the Brier score and G-mean, indicating strongest balance of generalization among the models. Besides, the BA-BP-ANN and the CSO-BP-ANN present strong performance in terms of identifying the positive and negative samples. Within the context of the Japanese dataset, in the training phase, the CSO-BP-ANN who performs best in the previous context scores the highest in terms of no indicator. The PSO-BP-ANN scores the highest in terms of the AUC (0.9455) while the BA-BP-ANN performs best in terms of accuracy (0.8736) as well as specificity, Brier score and G-mean. In addition, the GS-BP-ANN and the GWO-BP-ANN present strong performance when identifying the positive and negative samples. In the testing phase, the GWO-BP-ANN presents strong performance of overall and balance of prediction by scoring the best in terms of the AUC, accuracy, Brier score, and G-mean. Besides, the GWO-BP-ANN, despite the moderate performance in the training phase, scores the best in terms of accuracy, precision (neg), sensitivity, specificity, and Brier score. Within the context of the Taiwan dataset, in the training phase, the BA-BP-ANN performs best in terms of all indicators except precision (pos), indicating excellent fitness of the training set. In terms of overall identification (i.e. the AUC and accuracy), the difference of performance between the BA-BP-ANN and the other models remains around 0.04 (for the AUC) and 0.05 (for accuracy). In terms of balance of identification (i.e. Brier score and G-mean), the difference remains around 0.05. In other word, the performance remains comparatively stable across different models. In the testing phase, the BA-BP-ANN performs still best in terms of all indicators except precision (pos) and specificity. The difference of performance between the BA-BP-ANN and the other models remains around 0.04 (for the AUC) and 0.05 (for accuracy) in terms of overall identification, and 0.05 (for Brier score) and 0.04 (for G-mean) in terms of balanced identification. Thus, all the models perform robustly through the two phases. Finally, we compare the performance within the context of the HELOC dataset. In the training phase, the PSO-BP-ANN performs best in terms of overall and balanced fitting, with the best score in the AUC, accuracy, Brier score, and G-mean. Meanwhile, the BA-BP-ANN and the WSA-BP-ANN perform better when identifying samples with certain attributes. The performance of models varies within an average range of 0.0274. In the testing phase, the PSO-BP-ANN, the VA-BP-ANN, and the WSA-BP-ANN remain their highest scores in the training phase, which indicates robust performance of these models. The difference of performance across models remains in an average range of 0.0282, moderately greater than the range during training phase but still limited. A further comparison of performance within the context of five datasets is conducted as follows. First, volatility across models. With the five datasets, the average range of scores measures 0.1008 (German), 0.0203 (Australian), 0.0091 (Japanese), 0.0687 (Taiwan), and 0.0274 (FICO) during the training phase while 0.0608 (German), 0.0195 (Australian), 0.0145 (Japanese), 0.0656 (Taiwan), and 0.0282 (FICO) during the testing phase. Generally, the range of scores is limited, which indicates robustness across BP-ANN models trained by different SI algorithms. Second, stability within model. The performance of models during the testing phase is slightly weaker than that during the training phase. To be specific, comparing the scores during the two stages, the difference measures no more than 0.01 for most of the indicators. In other word, the BP-ANN models trained by SI algorithms are robust across the training and testing phases and thereby are useful in the real world where practitioners select model based on the performance of training. Third, the optimal SI algorithm. Unfortunately, we see no evidence that a model performs best with all the five datasets. For instead, the characteristics of each dataset affect how the SI algorithm optimizes the model. The BP-ANN model trained by the CSO, GS, GWO, BA, PSO, and WSA presents best performance in terms of different indicators within different context. That is why we propose the selection of SI algorithm and the framework of modelling in the section of “Methodology”. In the real world, due to the lack of knowledge when facing a new context, we have to search for the optimal SI algorithm rather than determine with prior knowledge. Furthermore, the search is feasible because of the stability within model. Control group In this section, we compare the performance of the BP-ANN trained by SI algorithms with models in the control group. The control group includes classical credit scoring models mentioned in section “Prevalent classical models for credit scoring” (i.e. logistic regression, NB approach, DA, KNN, DT, linear and polynomial SVM, SVM-RBF, K means, and RF) and several hybrid or ensemble models constructed in recent literature [29–34]. Table 5 reports how the classical models (i.e. logistic regression, NB approach, DA, KNN, DT, linear and polynomial SVM, SVM-RBF, K means, and RF) perform within the context of five datasets (see the section “Data” for details). The value of evaluation metrics (see the section “Model evaluation” for details) are reported with the optimal performance (i.e. the lowest value for Brier score and the highest value for other indicators) presented in bold. Notably, as a lazy learning model, the KNN excludes the training phase. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 5. Performance of classical models in control group. https://doi.org/10.1371/journal.pone.0234254.t005 As is shown in Table 5, we first focus on the performance of classical models within the context of the German dataset. During the training phase, the RF model outperforms all the other competing models in terms of all indicators. As for most indicators, the evaluation is “nearly perfect” (i.e. with a value very close to 1 or to 0), suggesting a possibility of overfitting in the training phase. Besides, some competing models (e.g. the logistic regression, NB approach, DA, and SVM) also performs well in terms of the AUC, accuracy, Brier score, and G-mean. Due to the greater proportion of majority samples, the value of precision (neg) is always greater than that of precision (pos) and the specificity greater than the sensitivity. On the other hand, the k-means model performs the worst in terms of all indicators, indicating that a lazy learning model might not suit for the context of credit scoring. In the testing phase, the DA model performs best in terms of the AUC, precision (neg), sensitivity and Brier score. The logistic regression model also performs well and scores the highest value in terms of accuracy and G-mean, while the KNN model performs the best in terms of precision (pos) and specificity. Apart from the K-means model, the RF model performs the worst during the testing phase, which indicates that the “nearly perfect” performance of RF model during the training phase is a sign of overfitting. For the Australian dataset, in the training phase, RF model still performs the best among all the competing models in terms of all the indicators except precision (pos) and specificity. Other models also score high in terms of the AUC and accuracy. Furthermore, the logistic regression, DT, and DA present balanced ability of distinguishing the majority and minority classes. Specifically, their scores for sensitivity and specificity are quite close and the gap between precision (pos) and precision (neg) remains less than 0.07. As to the other models, the identification is imbalanced; i.e. they are better to identify a certain group of samples. In the testing phase, the DA model performs the best in terms of accuracy, precision (neg), sensitivity and G-mean, while the logistic model scores the highest in the AUC and lowest in Brier score. At the same time, the KNN model scores the highest precision (pos) (0.8373) and the NB model scores the highest specificity (0.901). Besides, the SVM and RF model also perform well with accuracy greater than 0.85. On the other hand, k-means model performs the worst with most indicators are less than 0.7. For the Japanese dataset, in the training phase, the RF model and k-means model performs best in terms of all indicators and RF model still achieves the best performance in terms of accuracy (0.8594) and precision (pos) (0.8287). For the other models, the logistic, DA, and SVM scores a high value of AUC (greater than 0.92) and accuracy (greater than 0.86) in the training phase and score the optimal value in terms of the AUC, precision (pos), specificity, Brier score and G-mean respectively in the testing phase. Still, K-means model performs the worst with AUC of 0.6409. For the Taiwan dataset, the RF model performs best in terms of all indicators during the training phase but only in terms of accuracy during the testing phase. For the other competing models, the value of AUC remains less than 0.74 in the training phase, declining from the value with small-size datasets. It implies that the identification of the sample structural characteristics becomes more difficult while the sample size grows. As for overall prediction during testing phase, the KNN model scores the highest AUC (0.7213) while the RF model scores the highest accuracy (0.6783). For the other competing models, the value of AUC and accuracy fluctuates around 0.7 and 0.67 respectively with the k-means model scoring the lowest. As for the identification of minority class, the SVM scores the highest accuracy for prediction, with the highest precision (pos) (0.7553) followed by the DT model with precision (pos) of 0.7415. Meanwhile, the NB model performs the strongest when identifying the bad samples, with the highest sensitivity (0.8453). Besides, the NB and SVM model performs best in predicting and identifying the majority class, respectively. For the HELOC dataset, the RF model performs best in all aspects during the training phase but in no aspects during the testing phase. Specifically, in the training phase, the AUC and accuracy remain above 0.7 for all models except the k-means. As for the balance between two groups of samples, the gap between precision (pos) and precision (neg) remains less than 0.01 for the DT, DA and logistic model. For these three models, the value of Brier score remains less than 0.2 and G-mean greater than 0.71, indicating a better balance ability. In the testing phase, the SVM model performs best in terms of the AUC, accuracy, precision (neg), sensitivity and G-mean, suggesting excellent generalization. Besides, the NB model performs best in terms of precision (pos) and specificity while the logistic regression outperforms others in terms of Brier score. In addition, across most of these models, the value of indicators varies within a relatively limited range, which suggests a strong performance of these models with a balanced dataset. To sum up, the comparison across classical models with different datasets suggests propositions as follows. First, in the aspect of comprehensive performance, the logistic regression and the DA model achieve a better performance with small-size datasets (e.g. the German dataset, the Australian dataset and the Japanese dataset) during the testing phase while the SVM and NB perform better with large-size datasets (e.g. the Taiwan dataset and the HELOC dataset). Second, in the aspect of minority class identification (testing phase), the KNN and SVM model achieve (twice, respectively) the best performance in terms of precision (pos), while the NB model performs best with only the HELOC dataset. Besides, the DA model and NB model achieve (twice, respectively) the best performance in terms of sensitivity, while the SVM model performs best with only the HELOC dataset. In other word, the SVM and NB perform better when predicting minority class. Third, in the aspect of balance ability (testing phase), the logistic regression achieves the best performance for three times in terms of Brier score, while the DA and DT model achieve once for each. Besides, the DA and SVM model achieve (twice, respectively) the best performance in terms of G-mean, while the logistic regression model achieves once. By comparison, the DA and the logistic model present better balance in identifying the two groups of samples. Fourth, in the aspect of robustness (i.e. the performance difference between training and testing phase), the performance during training phase is generally weaker than that during the testing phase. Models on the ground of classical statistics (e.g. the DT and the logistic regression) are more robust than models based on the novel machine learning theories (e.g. the SVM and RF). Besides, after hyper-parameter optimization, the performance difference between the two phases is comparatively small for most baseline models. Table 6 reports the performance of several state-of-the-art hybrid or ensemble credit-scoring models. These models are constructed in recent literature [29–34] and applied in the context of three prevalent datasets (i.e. the German dataset, the Australian dataset, and the Japanese dataset). We list the value of evaluation metrics as reported in the sourcing literature and present the optimal performance in bold. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 6. Performance of hybrid or ensemble models in control group. https://doi.org/10.1371/journal.pone.0234254.t006 As is shown in Table 6, within the context of the three datasets, the performance varies across different models. Within the German dataset, the AGHE presents best overall identification with the optimal values for accuracy, AUC, and AUC-H. Meanwhile, the SGBoost-TPE and the NS+LWV score best in terms of Brier score and G-mean, respectively. Within the Australian dataset, the multi-sage hybrid model scores best in terms of AUC and AUC-H while the DGCEC scores the highest value for AUC. Still, the SGBoost-TPE and the NS+LWV score best in terms of Brier score and G-mean, respectively, as they do within the German dataset. Three models are included in the comparison within the Japanese dataset. The AGHE and the ConsA score the best values for AUC and brier score while the multi-stage hybrid model and the AGHE score the best in terms of AUC-H and accuracy, respectively. To sum up, across the three datasets (i.e. the German, Australian, and Japanese datasets), the AGHE scores best in terms of the six metrics for overall performance (i.e. the AUC, accuracy, precision (pos), precision (neg), sensitivity, and specificity) and presents strong ability of overall prediction. On the other hand, the XGBoost-TPE and the NRS+LWV presents good balance between the identification of two groups of samples. Compared with the classical models in Table 5, hybrid models in Table 6 performs better in terms of overall prediction and the balanced identification. BP-ANN model with increasing hidden layers The number of hidden layers partly determines the performance of BP-ANN models. However, if we set the number of hidden layers as one of the parameters to optimize, the swarm intelligence algorithms will not work due to the ambiguous dimension of solution space. In this subsection, for instead, we compare the performance of BP-ANN models with different numbers of hidden layers and analyze the robustness of model performance thereby. We rerun the experiment described in the section of methodology with two and three hidden layers (the number of neurons in the added hidden layers in line with the setting in Table 3) and present the performance in Table 7. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 7. Performance of BP-ANN models with increasing hidden layers. https://doi.org/10.1371/journal.pone.0234254.t007 Comparing the results in Table 7 with those in Table 6, we notice that, despite the fluctuating scores in terms of the 8 indicators, the fitness of training as well as the generalization of test are improved with the increased number of hidden layers. To be specific, while the number of hidden layers goes from one to two and to three, for the German dataset the optimal AUC moves from 0.8859 to 0.8989 and to 0.8927 in the training phase and from 0.8004 to 0.8001 and to 0.8047 in the testing phase; for the Australian dataset, it moves from 0.9514 to 0.9526 and to 0.9524 in the training phase and from 0.9373 to 0.9382 and to 0.9380 in the testing phase; for the Japanese dataset, it moves from 0.9455 to 0.9439 and to 0.9432 in the training phase and from 0.9354 to 0.9340 and to 0.9333 in the testing phase; for the Taiwan dataset, it moves from 0.7596 to 0.767 and to 0.7695 in the training phase and from 0.7403 to 0.7969 and to 0.7985; for the HELOC dataset, it moves from 0.7936 to 0.767 and to 0.7695 in the training phase and from 0.7898 to 0.7909 and to 0.7909 in the testing phase. The trend of AUC indicates that BP-ANN with more hidden layers outperform those with less hidden layers and such outperformance is evident when comparing the scores of other indicators. We conduct further comparison with models in the control group (see Tables 5 and 6) and find that BP-ANN performs better with increased number of hidden layers. In the context of the German dataset, the model of Zhang, He [32] and that of Xu, Zhang [33] in the control group outperform our BP-ANN model with two hidden layers. However, in the context of other datasets, our optimized BP-ANN model with increased hidden layers outperforms any model in the control group. Therefore, we propose that the fitness and generalization of BP-ANN models improve with the number of hidden layers increasing. Notably, the model performs comparatively stably (with fluctuation within an acceptable extent) while the number of hidden layers increasing, which indicates greater robustness of BP-ANN trained by SI algorithms. In addition, increasing hidden layers would not lead to overfitting. Thus, we recommend that users train BP-ANN models with different number of hidden layer in order to find out the optimal setting for the certain context. Time complexity When selecting the applicable model to score credit, we take the complexity as well as accuracy into consideration. Table 8 reports the time complexity of BP-ANN models trained by different swarm intelligence algorithms. Each algorithm contains 10 individuals and iterates 20 times on the same computer. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 8. Time complexity of BP-ANN models trained by swarm intelligence algorithms. https://doi.org/10.1371/journal.pone.0234254.t008 As shown in Table 8, the time complexity varies across different datasets with uniform parameter set. Within the context of small-size dataset (e.g. the German, Australian and Japanese dataset), the PSO and the WSA optimize the most efficiently (i.e. runs within 50 sec) while the CSO and the FA optimize with least efficiency (more than 1 min). For the other SI algorithms, the optimization consumes around one minute. When the dataset is small sized, the time complexity varies within one minute. However, within the large-size datasets, the time complexity increases while the size of dataset grow. Within the Taiwan dataset, the WSA consumes least time (133.581 sec) followed by the GS (2 min) and the GWO (3 min) while the CSO consumes the most (nearly 7 min). Within the HELOC dataset, despite longer running time because of the increased sample size, the WSA consumes still the least time (approx. 3.5 min) followed by the GS (267.296 sec) and the GWO (340.226 sec) while the FA and the PSO consumes the most (approx. 11 min). In other word, the WSA, GS, and GWO consume less time and perform more robustly with large-size datasets. Analytic comparison This section is intended to answer whether our BP-ANN model trained by SI algorithms outperforms the classical and state-of-the-art models for credit scoring. First, the overall prediction. Within the German dataset, the PSO-BP-ANN (with the AUC of 0.8004) outperforms most models in the control group, albeit slightly weaker than that of [30, 32, 33]. Within the other four datasets, our model outperforms all the models in the control group. Specifically, the optimal AUC measures 0.9370 for the control group but 0.9373 for our model within the Australian dataset; 0.9330 for the control group but 0.9354 for our model within the Japanese dataset; 0.7213 for the control group but 0.7403 for our model within the Taiwan dataset; and 0.7851 for the control group but 0.7898 for our model within the HELOC dataset. In addition, our model presents best performance in terms of the mean value. Second, balanced prediction during testing phase. Unlike the BP-ANN whose output is the predicted probability of credit default, models with output in the form of labels (e.g. the DT, RF, and SVM) present outstanding performance in binary classification. With a uniform threshold of classification (0.5), these models outperform ours in terms of balance metrics (i.e. accuracy, Brier score, and G-mean). Nevertheless, our model performs better than the numerical regression models in the control group. Third, robustness of prediction. We focus on the value range of evaluation metrics for each model. Some state-of-the-art models are excluded from this comparison, for the evaluation metrics are missing during training phase. Within the German dataset, the average range for control group is 0.4313 (training) and 0.185 (test), while the average range for our models is 0.1008 (training) and 0.0608 (test). Within the Australian dataset, the average range for control group is 0.4089 (training) and 0.2552 (test), while the average range for our models is 0.0203 (training) and 0.0195 (test). Within the Japanese dataset, the average range for control group is 0.4242 (training) and 0.5148 (test), while the average range for our models is 0.0091 (training) and 0.0145 (test). Within the Taiwan dataset, the average range for control group is 0.5148 (training) and 0.2323 (test), while the average range for our models is 0.0687 (training) and 0.0656 (test). Within the HELOC dataset, the average range for control group is 0.3675 (training) and 0.1400 (test), while the average range for our models is 0.0274 (training) and 0.0282 (test). Thus, with less variant prediction, our model performs with increasing robustness across datasets. Fourth, time complexity. Baseline models in the control group conduct grid search to determine the hyper-parameter set. Consequently, their time complexity is several times of ours. As is proposed in prior literature [26–28], some state-of-the-art techniques to determine neural network architecture requires several GPU-days. However, with small-size datasets, the hyper-parameters of BP-ANN are determined by SI algorithms within one minute (i.e. 59.703 sec, 46.878 sec, and 45.025 sec for the German, Australian, and Japanese dataset, respectively). With large-size dataset, the process completes within 12 min (i.e. 265.836 sec and 432.497 sec for the Taiwan and HELOC dataset, respectively). Instead of grid search for the optimal parameter set, our models conducts guided search based on available information. Therefore, our models consume acceptable runtime to determine parameter set for the BP-ANN with comparatively good performance, which is more practical in real world. To sum up, the prominent advantage of our framework lies in that it searches the hyper-parameter space of BP-ANN within acceptable runtime and determines a preferable hyper-parameter set efficiently. The fitness and generalization of BP-ANN are improved thereby. In addition, our models predict more precisely when the new samples are “bad”. Furthermore, our models enjoy greater robustness while the performance varies limitedly between training and testing phase. Optimization Table 4 reports the performance of BP-ANN models with the five datasets described in Section “Data”, where parameters are trained by different swarm algorithms. The value of eight indicators in Section “Model evaluation” are reported while the optimal performance are presented in bold. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. Performance of BP-ANN trained by different swarm intelligence algorithms. https://doi.org/10.1371/journal.pone.0234254.t004 First, we compare the model performance with the German Credit Dataset, as shown in the first 12 rows of Table 6. In the training phase, the fitness of CHSO-BP-ANN model scores highest in terms of all six indicators, which implies that the fitting ability of this model is best and robust. The fitness of PSO-BP-ANN, WSA-BP-ANN, and BA-BP-ANN performs less competent but still acceptable, with the AUC value over 0.9. Focused on the classification of positive samples, we notice that all the models except CHSO-BP-ANN and PSO-BP-ANN score no more than 0.8 in terms of precision (pos). However, all the models performs better when classifying negative, with the value of precision (neg) over 0.9. The classification is biased towards the negative samples owing to the fact that, in the training set, the size of negative sample set is far larger than that of positive sample set. In the phase of test, the model CSO-BP-ANN performs best in terms of AUC, accuracy and precision (neg). The model CHSO-BP-ANN, performing well in the training phase though, scores highest in terms of precision (pos) and specificity in the testing phase, which indicates excellent classification of positive sample and precise prediction. Besides, the model SSA-BP-ANN and FA-BP-ANN classify comparatively precisely. As shown in Table 4, we first compare the model performance with the German dataset. In the training phase, the CSO-BP-ANN model in training phase scores the highest in terms of all indicators except the precision (pos). It indicates that the CSO-BP-ANN performs best when distinguishing samples with opposite attributes from each other. In addition, the WSA-BP-ANN scores the highest in terms of the precision (pos) (0.5674), indicating that this model performs best when identifying the positive samples. The performance across models differs within a comparatively limited range in terms of the AUC (approx. 0.08) and accuracy (approx. 0.06). Besides, the overall performance of PSO-BP-ANN and GS-BP-ANN is strong in terms of all indicators while the performance of BA-BP-ANN and GS-BP-ANN is comparatively weak. In the testing phase, the PSO-BP-ANN performs best generalization in terms of overall classification and prediction with imbalanced datasets. To be specific, the PSO-BP-ANN scores 0.8004 for AUC, 0.7660 for the accuracy, and 0.2340 for the Brier score. Furthermore, the WSA-BP-ANN scores the highest in terms of precision (pos), specificity, and G-mean, while BA-BP-ANN who performs moderately during training phase scores the highest in terms of precision (neg) and sensitivity. Within the context of the Australian dataset, in the training phase, the CSO-BP-ANN scores the highest in terms of all indicators except precision (pos) and specificity. Similarly to the context of German dataset, the WSA-BP-ANN scores the highest in terms of precision (pos) and specificity. The performance of other models varies within a limited range in terms of the AUC (approx.0.02) and accuracy (approx. 0.02), indicating that BP-ANN trained by these SI algorithms presents a strong performance when identifying the classifiable attributes in the training set. In the testing phase, the GWO-BP-ANN and the BS-BP-ANN perform best in terms of the AUC (0.9373) and accuracy (0.8638), respectively. Meanwhile, the GS-BP-ANN performs best in terms of the Brier score and G-mean, indicating strongest balance of generalization among the models. Besides, the BA-BP-ANN and the CSO-BP-ANN present strong performance in terms of identifying the positive and negative samples. Within the context of the Japanese dataset, in the training phase, the CSO-BP-ANN who performs best in the previous context scores the highest in terms of no indicator. The PSO-BP-ANN scores the highest in terms of the AUC (0.9455) while the BA-BP-ANN performs best in terms of accuracy (0.8736) as well as specificity, Brier score and G-mean. In addition, the GS-BP-ANN and the GWO-BP-ANN present strong performance when identifying the positive and negative samples. In the testing phase, the GWO-BP-ANN presents strong performance of overall and balance of prediction by scoring the best in terms of the AUC, accuracy, Brier score, and G-mean. Besides, the GWO-BP-ANN, despite the moderate performance in the training phase, scores the best in terms of accuracy, precision (neg), sensitivity, specificity, and Brier score. Within the context of the Taiwan dataset, in the training phase, the BA-BP-ANN performs best in terms of all indicators except precision (pos), indicating excellent fitness of the training set. In terms of overall identification (i.e. the AUC and accuracy), the difference of performance between the BA-BP-ANN and the other models remains around 0.04 (for the AUC) and 0.05 (for accuracy). In terms of balance of identification (i.e. Brier score and G-mean), the difference remains around 0.05. In other word, the performance remains comparatively stable across different models. In the testing phase, the BA-BP-ANN performs still best in terms of all indicators except precision (pos) and specificity. The difference of performance between the BA-BP-ANN and the other models remains around 0.04 (for the AUC) and 0.05 (for accuracy) in terms of overall identification, and 0.05 (for Brier score) and 0.04 (for G-mean) in terms of balanced identification. Thus, all the models perform robustly through the two phases. Finally, we compare the performance within the context of the HELOC dataset. In the training phase, the PSO-BP-ANN performs best in terms of overall and balanced fitting, with the best score in the AUC, accuracy, Brier score, and G-mean. Meanwhile, the BA-BP-ANN and the WSA-BP-ANN perform better when identifying samples with certain attributes. The performance of models varies within an average range of 0.0274. In the testing phase, the PSO-BP-ANN, the VA-BP-ANN, and the WSA-BP-ANN remain their highest scores in the training phase, which indicates robust performance of these models. The difference of performance across models remains in an average range of 0.0282, moderately greater than the range during training phase but still limited. A further comparison of performance within the context of five datasets is conducted as follows. First, volatility across models. With the five datasets, the average range of scores measures 0.1008 (German), 0.0203 (Australian), 0.0091 (Japanese), 0.0687 (Taiwan), and 0.0274 (FICO) during the training phase while 0.0608 (German), 0.0195 (Australian), 0.0145 (Japanese), 0.0656 (Taiwan), and 0.0282 (FICO) during the testing phase. Generally, the range of scores is limited, which indicates robustness across BP-ANN models trained by different SI algorithms. Second, stability within model. The performance of models during the testing phase is slightly weaker than that during the training phase. To be specific, comparing the scores during the two stages, the difference measures no more than 0.01 for most of the indicators. In other word, the BP-ANN models trained by SI algorithms are robust across the training and testing phases and thereby are useful in the real world where practitioners select model based on the performance of training. Third, the optimal SI algorithm. Unfortunately, we see no evidence that a model performs best with all the five datasets. For instead, the characteristics of each dataset affect how the SI algorithm optimizes the model. The BP-ANN model trained by the CSO, GS, GWO, BA, PSO, and WSA presents best performance in terms of different indicators within different context. That is why we propose the selection of SI algorithm and the framework of modelling in the section of “Methodology”. In the real world, due to the lack of knowledge when facing a new context, we have to search for the optimal SI algorithm rather than determine with prior knowledge. Furthermore, the search is feasible because of the stability within model. First, volatility across models. With the five datasets, the average range of scores measures 0.1008 (German), 0.0203 (Australian), 0.0091 (Japanese), 0.0687 (Taiwan), and 0.0274 (FICO) during the training phase while 0.0608 (German), 0.0195 (Australian), 0.0145 (Japanese), 0.0656 (Taiwan), and 0.0282 (FICO) during the testing phase. Generally, the range of scores is limited, which indicates robustness across BP-ANN models trained by different SI algorithms. Second, stability within model. The performance of models during the testing phase is slightly weaker than that during the training phase. To be specific, comparing the scores during the two stages, the difference measures no more than 0.01 for most of the indicators. In other word, the BP-ANN models trained by SI algorithms are robust across the training and testing phases and thereby are useful in the real world where practitioners select model based on the performance of training. Third, the optimal SI algorithm. Unfortunately, we see no evidence that a model performs best with all the five datasets. For instead, the characteristics of each dataset affect how the SI algorithm optimizes the model. The BP-ANN model trained by the CSO, GS, GWO, BA, PSO, and WSA presents best performance in terms of different indicators within different context. That is why we propose the selection of SI algorithm and the framework of modelling in the section of “Methodology”. In the real world, due to the lack of knowledge when facing a new context, we have to search for the optimal SI algorithm rather than determine with prior knowledge. Furthermore, the search is feasible because of the stability within model. Control group In this section, we compare the performance of the BP-ANN trained by SI algorithms with models in the control group. The control group includes classical credit scoring models mentioned in section “Prevalent classical models for credit scoring” (i.e. logistic regression, NB approach, DA, KNN, DT, linear and polynomial SVM, SVM-RBF, K means, and RF) and several hybrid or ensemble models constructed in recent literature [29–34]. Table 5 reports how the classical models (i.e. logistic regression, NB approach, DA, KNN, DT, linear and polynomial SVM, SVM-RBF, K means, and RF) perform within the context of five datasets (see the section “Data” for details). The value of evaluation metrics (see the section “Model evaluation” for details) are reported with the optimal performance (i.e. the lowest value for Brier score and the highest value for other indicators) presented in bold. Notably, as a lazy learning model, the KNN excludes the training phase. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 5. Performance of classical models in control group. https://doi.org/10.1371/journal.pone.0234254.t005 As is shown in Table 5, we first focus on the performance of classical models within the context of the German dataset. During the training phase, the RF model outperforms all the other competing models in terms of all indicators. As for most indicators, the evaluation is “nearly perfect” (i.e. with a value very close to 1 or to 0), suggesting a possibility of overfitting in the training phase. Besides, some competing models (e.g. the logistic regression, NB approach, DA, and SVM) also performs well in terms of the AUC, accuracy, Brier score, and G-mean. Due to the greater proportion of majority samples, the value of precision (neg) is always greater than that of precision (pos) and the specificity greater than the sensitivity. On the other hand, the k-means model performs the worst in terms of all indicators, indicating that a lazy learning model might not suit for the context of credit scoring. In the testing phase, the DA model performs best in terms of the AUC, precision (neg), sensitivity and Brier score. The logistic regression model also performs well and scores the highest value in terms of accuracy and G-mean, while the KNN model performs the best in terms of precision (pos) and specificity. Apart from the K-means model, the RF model performs the worst during the testing phase, which indicates that the “nearly perfect” performance of RF model during the training phase is a sign of overfitting. For the Australian dataset, in the training phase, RF model still performs the best among all the competing models in terms of all the indicators except precision (pos) and specificity. Other models also score high in terms of the AUC and accuracy. Furthermore, the logistic regression, DT, and DA present balanced ability of distinguishing the majority and minority classes. Specifically, their scores for sensitivity and specificity are quite close and the gap between precision (pos) and precision (neg) remains less than 0.07. As to the other models, the identification is imbalanced; i.e. they are better to identify a certain group of samples. In the testing phase, the DA model performs the best in terms of accuracy, precision (neg), sensitivity and G-mean, while the logistic model scores the highest in the AUC and lowest in Brier score. At the same time, the KNN model scores the highest precision (pos) (0.8373) and the NB model scores the highest specificity (0.901). Besides, the SVM and RF model also perform well with accuracy greater than 0.85. On the other hand, k-means model performs the worst with most indicators are less than 0.7. For the Japanese dataset, in the training phase, the RF model and k-means model performs best in terms of all indicators and RF model still achieves the best performance in terms of accuracy (0.8594) and precision (pos) (0.8287). For the other models, the logistic, DA, and SVM scores a high value of AUC (greater than 0.92) and accuracy (greater than 0.86) in the training phase and score the optimal value in terms of the AUC, precision (pos), specificity, Brier score and G-mean respectively in the testing phase. Still, K-means model performs the worst with AUC of 0.6409. For the Taiwan dataset, the RF model performs best in terms of all indicators during the training phase but only in terms of accuracy during the testing phase. For the other competing models, the value of AUC remains less than 0.74 in the training phase, declining from the value with small-size datasets. It implies that the identification of the sample structural characteristics becomes more difficult while the sample size grows. As for overall prediction during testing phase, the KNN model scores the highest AUC (0.7213) while the RF model scores the highest accuracy (0.6783). For the other competing models, the value of AUC and accuracy fluctuates around 0.7 and 0.67 respectively with the k-means model scoring the lowest. As for the identification of minority class, the SVM scores the highest accuracy for prediction, with the highest precision (pos) (0.7553) followed by the DT model with precision (pos) of 0.7415. Meanwhile, the NB model performs the strongest when identifying the bad samples, with the highest sensitivity (0.8453). Besides, the NB and SVM model performs best in predicting and identifying the majority class, respectively. For the HELOC dataset, the RF model performs best in all aspects during the training phase but in no aspects during the testing phase. Specifically, in the training phase, the AUC and accuracy remain above 0.7 for all models except the k-means. As for the balance between two groups of samples, the gap between precision (pos) and precision (neg) remains less than 0.01 for the DT, DA and logistic model. For these three models, the value of Brier score remains less than 0.2 and G-mean greater than 0.71, indicating a better balance ability. In the testing phase, the SVM model performs best in terms of the AUC, accuracy, precision (neg), sensitivity and G-mean, suggesting excellent generalization. Besides, the NB model performs best in terms of precision (pos) and specificity while the logistic regression outperforms others in terms of Brier score. In addition, across most of these models, the value of indicators varies within a relatively limited range, which suggests a strong performance of these models with a balanced dataset. To sum up, the comparison across classical models with different datasets suggests propositions as follows. First, in the aspect of comprehensive performance, the logistic regression and the DA model achieve a better performance with small-size datasets (e.g. the German dataset, the Australian dataset and the Japanese dataset) during the testing phase while the SVM and NB perform better with large-size datasets (e.g. the Taiwan dataset and the HELOC dataset). Second, in the aspect of minority class identification (testing phase), the KNN and SVM model achieve (twice, respectively) the best performance in terms of precision (pos), while the NB model performs best with only the HELOC dataset. Besides, the DA model and NB model achieve (twice, respectively) the best performance in terms of sensitivity, while the SVM model performs best with only the HELOC dataset. In other word, the SVM and NB perform better when predicting minority class. Third, in the aspect of balance ability (testing phase), the logistic regression achieves the best performance for three times in terms of Brier score, while the DA and DT model achieve once for each. Besides, the DA and SVM model achieve (twice, respectively) the best performance in terms of G-mean, while the logistic regression model achieves once. By comparison, the DA and the logistic model present better balance in identifying the two groups of samples. Fourth, in the aspect of robustness (i.e. the performance difference between training and testing phase), the performance during training phase is generally weaker than that during the testing phase. Models on the ground of classical statistics (e.g. the DT and the logistic regression) are more robust than models based on the novel machine learning theories (e.g. the SVM and RF). Besides, after hyper-parameter optimization, the performance difference between the two phases is comparatively small for most baseline models. Table 6 reports the performance of several state-of-the-art hybrid or ensemble credit-scoring models. These models are constructed in recent literature [29–34] and applied in the context of three prevalent datasets (i.e. the German dataset, the Australian dataset, and the Japanese dataset). We list the value of evaluation metrics as reported in the sourcing literature and present the optimal performance in bold. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 6. Performance of hybrid or ensemble models in control group. https://doi.org/10.1371/journal.pone.0234254.t006 As is shown in Table 6, within the context of the three datasets, the performance varies across different models. Within the German dataset, the AGHE presents best overall identification with the optimal values for accuracy, AUC, and AUC-H. Meanwhile, the SGBoost-TPE and the NS+LWV score best in terms of Brier score and G-mean, respectively. Within the Australian dataset, the multi-sage hybrid model scores best in terms of AUC and AUC-H while the DGCEC scores the highest value for AUC. Still, the SGBoost-TPE and the NS+LWV score best in terms of Brier score and G-mean, respectively, as they do within the German dataset. Three models are included in the comparison within the Japanese dataset. The AGHE and the ConsA score the best values for AUC and brier score while the multi-stage hybrid model and the AGHE score the best in terms of AUC-H and accuracy, respectively. To sum up, across the three datasets (i.e. the German, Australian, and Japanese datasets), the AGHE scores best in terms of the six metrics for overall performance (i.e. the AUC, accuracy, precision (pos), precision (neg), sensitivity, and specificity) and presents strong ability of overall prediction. On the other hand, the XGBoost-TPE and the NRS+LWV presents good balance between the identification of two groups of samples. Compared with the classical models in Table 5, hybrid models in Table 6 performs better in terms of overall prediction and the balanced identification. BP-ANN model with increasing hidden layers The number of hidden layers partly determines the performance of BP-ANN models. However, if we set the number of hidden layers as one of the parameters to optimize, the swarm intelligence algorithms will not work due to the ambiguous dimension of solution space. In this subsection, for instead, we compare the performance of BP-ANN models with different numbers of hidden layers and analyze the robustness of model performance thereby. We rerun the experiment described in the section of methodology with two and three hidden layers (the number of neurons in the added hidden layers in line with the setting in Table 3) and present the performance in Table 7. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 7. Performance of BP-ANN models with increasing hidden layers. https://doi.org/10.1371/journal.pone.0234254.t007 Comparing the results in Table 7 with those in Table 6, we notice that, despite the fluctuating scores in terms of the 8 indicators, the fitness of training as well as the generalization of test are improved with the increased number of hidden layers. To be specific, while the number of hidden layers goes from one to two and to three, for the German dataset the optimal AUC moves from 0.8859 to 0.8989 and to 0.8927 in the training phase and from 0.8004 to 0.8001 and to 0.8047 in the testing phase; for the Australian dataset, it moves from 0.9514 to 0.9526 and to 0.9524 in the training phase and from 0.9373 to 0.9382 and to 0.9380 in the testing phase; for the Japanese dataset, it moves from 0.9455 to 0.9439 and to 0.9432 in the training phase and from 0.9354 to 0.9340 and to 0.9333 in the testing phase; for the Taiwan dataset, it moves from 0.7596 to 0.767 and to 0.7695 in the training phase and from 0.7403 to 0.7969 and to 0.7985; for the HELOC dataset, it moves from 0.7936 to 0.767 and to 0.7695 in the training phase and from 0.7898 to 0.7909 and to 0.7909 in the testing phase. The trend of AUC indicates that BP-ANN with more hidden layers outperform those with less hidden layers and such outperformance is evident when comparing the scores of other indicators. We conduct further comparison with models in the control group (see Tables 5 and 6) and find that BP-ANN performs better with increased number of hidden layers. In the context of the German dataset, the model of Zhang, He [32] and that of Xu, Zhang [33] in the control group outperform our BP-ANN model with two hidden layers. However, in the context of other datasets, our optimized BP-ANN model with increased hidden layers outperforms any model in the control group. Therefore, we propose that the fitness and generalization of BP-ANN models improve with the number of hidden layers increasing. Notably, the model performs comparatively stably (with fluctuation within an acceptable extent) while the number of hidden layers increasing, which indicates greater robustness of BP-ANN trained by SI algorithms. In addition, increasing hidden layers would not lead to overfitting. Thus, we recommend that users train BP-ANN models with different number of hidden layer in order to find out the optimal setting for the certain context. Time complexity When selecting the applicable model to score credit, we take the complexity as well as accuracy into consideration. Table 8 reports the time complexity of BP-ANN models trained by different swarm intelligence algorithms. Each algorithm contains 10 individuals and iterates 20 times on the same computer. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 8. Time complexity of BP-ANN models trained by swarm intelligence algorithms. https://doi.org/10.1371/journal.pone.0234254.t008 As shown in Table 8, the time complexity varies across different datasets with uniform parameter set. Within the context of small-size dataset (e.g. the German, Australian and Japanese dataset), the PSO and the WSA optimize the most efficiently (i.e. runs within 50 sec) while the CSO and the FA optimize with least efficiency (more than 1 min). For the other SI algorithms, the optimization consumes around one minute. When the dataset is small sized, the time complexity varies within one minute. However, within the large-size datasets, the time complexity increases while the size of dataset grow. Within the Taiwan dataset, the WSA consumes least time (133.581 sec) followed by the GS (2 min) and the GWO (3 min) while the CSO consumes the most (nearly 7 min). Within the HELOC dataset, despite longer running time because of the increased sample size, the WSA consumes still the least time (approx. 3.5 min) followed by the GS (267.296 sec) and the GWO (340.226 sec) while the FA and the PSO consumes the most (approx. 11 min). In other word, the WSA, GS, and GWO consume less time and perform more robustly with large-size datasets. Analytic comparison This section is intended to answer whether our BP-ANN model trained by SI algorithms outperforms the classical and state-of-the-art models for credit scoring. First, the overall prediction. Within the German dataset, the PSO-BP-ANN (with the AUC of 0.8004) outperforms most models in the control group, albeit slightly weaker than that of [30, 32, 33]. Within the other four datasets, our model outperforms all the models in the control group. Specifically, the optimal AUC measures 0.9370 for the control group but 0.9373 for our model within the Australian dataset; 0.9330 for the control group but 0.9354 for our model within the Japanese dataset; 0.7213 for the control group but 0.7403 for our model within the Taiwan dataset; and 0.7851 for the control group but 0.7898 for our model within the HELOC dataset. In addition, our model presents best performance in terms of the mean value. Second, balanced prediction during testing phase. Unlike the BP-ANN whose output is the predicted probability of credit default, models with output in the form of labels (e.g. the DT, RF, and SVM) present outstanding performance in binary classification. With a uniform threshold of classification (0.5), these models outperform ours in terms of balance metrics (i.e. accuracy, Brier score, and G-mean). Nevertheless, our model performs better than the numerical regression models in the control group. Third, robustness of prediction. We focus on the value range of evaluation metrics for each model. Some state-of-the-art models are excluded from this comparison, for the evaluation metrics are missing during training phase. Within the German dataset, the average range for control group is 0.4313 (training) and 0.185 (test), while the average range for our models is 0.1008 (training) and 0.0608 (test). Within the Australian dataset, the average range for control group is 0.4089 (training) and 0.2552 (test), while the average range for our models is 0.0203 (training) and 0.0195 (test). Within the Japanese dataset, the average range for control group is 0.4242 (training) and 0.5148 (test), while the average range for our models is 0.0091 (training) and 0.0145 (test). Within the Taiwan dataset, the average range for control group is 0.5148 (training) and 0.2323 (test), while the average range for our models is 0.0687 (training) and 0.0656 (test). Within the HELOC dataset, the average range for control group is 0.3675 (training) and 0.1400 (test), while the average range for our models is 0.0274 (training) and 0.0282 (test). Thus, with less variant prediction, our model performs with increasing robustness across datasets. Fourth, time complexity. Baseline models in the control group conduct grid search to determine the hyper-parameter set. Consequently, their time complexity is several times of ours. As is proposed in prior literature [26–28], some state-of-the-art techniques to determine neural network architecture requires several GPU-days. However, with small-size datasets, the hyper-parameters of BP-ANN are determined by SI algorithms within one minute (i.e. 59.703 sec, 46.878 sec, and 45.025 sec for the German, Australian, and Japanese dataset, respectively). With large-size dataset, the process completes within 12 min (i.e. 265.836 sec and 432.497 sec for the Taiwan and HELOC dataset, respectively). Instead of grid search for the optimal parameter set, our models conducts guided search based on available information. Therefore, our models consume acceptable runtime to determine parameter set for the BP-ANN with comparatively good performance, which is more practical in real world. To sum up, the prominent advantage of our framework lies in that it searches the hyper-parameter space of BP-ANN within acceptable runtime and determines a preferable hyper-parameter set efficiently. The fitness and generalization of BP-ANN are improved thereby. In addition, our models predict more precisely when the new samples are “bad”. Furthermore, our models enjoy greater robustness while the performance varies limitedly between training and testing phase. First, the overall prediction. Within the German dataset, the PSO-BP-ANN (with the AUC of 0.8004) outperforms most models in the control group, albeit slightly weaker than that of [30, 32, 33]. Within the other four datasets, our model outperforms all the models in the control group. Specifically, the optimal AUC measures 0.9370 for the control group but 0.9373 for our model within the Australian dataset; 0.9330 for the control group but 0.9354 for our model within the Japanese dataset; 0.7213 for the control group but 0.7403 for our model within the Taiwan dataset; and 0.7851 for the control group but 0.7898 for our model within the HELOC dataset. In addition, our model presents best performance in terms of the mean value. Second, balanced prediction during testing phase. Unlike the BP-ANN whose output is the predicted probability of credit default, models with output in the form of labels (e.g. the DT, RF, and SVM) present outstanding performance in binary classification. With a uniform threshold of classification (0.5), these models outperform ours in terms of balance metrics (i.e. accuracy, Brier score, and G-mean). Nevertheless, our model performs better than the numerical regression models in the control group. Third, robustness of prediction. We focus on the value range of evaluation metrics for each model. Some state-of-the-art models are excluded from this comparison, for the evaluation metrics are missing during training phase. Within the German dataset, the average range for control group is 0.4313 (training) and 0.185 (test), while the average range for our models is 0.1008 (training) and 0.0608 (test). Within the Australian dataset, the average range for control group is 0.4089 (training) and 0.2552 (test), while the average range for our models is 0.0203 (training) and 0.0195 (test). Within the Japanese dataset, the average range for control group is 0.4242 (training) and 0.5148 (test), while the average range for our models is 0.0091 (training) and 0.0145 (test). Within the Taiwan dataset, the average range for control group is 0.5148 (training) and 0.2323 (test), while the average range for our models is 0.0687 (training) and 0.0656 (test). Within the HELOC dataset, the average range for control group is 0.3675 (training) and 0.1400 (test), while the average range for our models is 0.0274 (training) and 0.0282 (test). Thus, with less variant prediction, our model performs with increasing robustness across datasets. Fourth, time complexity. Baseline models in the control group conduct grid search to determine the hyper-parameter set. Consequently, their time complexity is several times of ours. As is proposed in prior literature [26–28], some state-of-the-art techniques to determine neural network architecture requires several GPU-days. However, with small-size datasets, the hyper-parameters of BP-ANN are determined by SI algorithms within one minute (i.e. 59.703 sec, 46.878 sec, and 45.025 sec for the German, Australian, and Japanese dataset, respectively). With large-size dataset, the process completes within 12 min (i.e. 265.836 sec and 432.497 sec for the Taiwan and HELOC dataset, respectively). Instead of grid search for the optimal parameter set, our models conducts guided search based on available information. Therefore, our models consume acceptable runtime to determine parameter set for the BP-ANN with comparatively good performance, which is more practical in real world. To sum up, the prominent advantage of our framework lies in that it searches the hyper-parameter space of BP-ANN within acceptable runtime and determines a preferable hyper-parameter set efficiently. The fitness and generalization of BP-ANN are improved thereby. In addition, our models predict more precisely when the new samples are “bad”. Furthermore, our models enjoy greater robustness while the performance varies limitedly between training and testing phase. Conclusions This paper proposes a novel framework for credit scoring which is conducted in three steps. First, pre-processing of data, including imputation to make up the missing values, normalization to eliminate the effect of measurement, and re-ordering to balance the occurrence of sample with binary labels. Second, training the model. We employ several SI algorithms to optimize the hyper-parameters of BP-ANN and determine the optimal algorithm based on the value of AUC. The search space of hyper-parameters is set in line with prior literature. Third, applying the model. We apply the optimal model determined in the second step to predict new samples pre-processed in the first step. Our framework determines a preferable hyper-parameter set for the BP-ANN with acceptable runtime and thereby improves the fitness and generalization of neural networks. By comparison with classical and hybrid or ensemble models in the control group, our framework performs more robustly across training and testing phases. Additionally, models proposed in this paper predict with greater precision when applied to credit default samples. An interesting follow-up idea is to develop an ensemble or hybrid version of SI-training BP-ANN. In order to improve the identification of minority class, we recommend that the penalty factor for identification error of minority class be included as hyper-parameter to train the neural network. Furthermore, a body of evaluation metrics (e.g. precision (pos), Brier score, G-mean, etc.) are employed and thereby hyper-parameters of BP-ANN are optimized with multiple objectives. Acknowledgments The authors are greatly grateful to the two anonymous reviewers for their insightful comments on this manuscript. TI - Optimizing hyper-parameters of neural networks with swarm intelligence: A novel framework for credit scoring JO - PLoS ONE DO - 10.1371/journal.pone.0234254 DA - 2020-06-05 UR - https://www.deepdyve.com/lp/public-library-of-science-plos-journal/optimizing-hyper-parameters-of-neural-networks-with-swarm-intelligence-RBkGSLOGbv SP - e0234254 VL - 15 IS - 6 DP - DeepDyve ER -