# An identification method for important nodes based on k-shell and structural hole

An identification method for important nodes based on k-shell and structural hole Abstract Identifying important nodes that have great influence on propagation in social networks has an important significance on understanding and controlling spread of information in the network. On the basis of the idea that location of nodes can affect their ability to communicate, it is pointed out that the important nodes in networks consisted by two kinds of nodes, namely opinion leaders and structural holes. But most of the existing methods for identifying important nodes does not take the structural holes into account. This article proposes a hybrid important nodes identification algorithm, which combines k-shell and network effective size, aiming at finding a group of nodes that are the most central and bridging. Doing experiments on five actual data sets based on susceptible–infected–removed propagation model, the experimental results show that, comparing with algorithms on the basis of degree, closeness, k-shell, between-ness and network effective size, the proposed algorithm has the highest identification accuracy. 1. Introduction Current social networks not only provide new platform for the dissemination and sharing of information but also become important way for users to show their own values, express interest demands and maintain interpersonal relationships. It is of great practical significance to quickly and accurately identify the important nodes that have great influence on news dissemination in social networks and also has important practical significance for monitoring and management of social networks. Evaluating node’s importance according to its centricity has always been mainstream method of identifying important nodes in social networks. At present, evaluating indicators mainly include degree, tightness and k-shell, and so on. [1–3]. This kind of method is based on graph theory to analyse the centralization degree of nodes in the network. However, the importance of nodes in networks is not only related to its centralization degree but also related to its functional characteristics. For example, structural hole nodes is in the intermediary position between communities. Although they are not in the centre of the network, they are in the position between communities. So they affect and even control social relations and spread of information [4]. Most of the existing important nodes identification methods did not consider this kind of nodes. Therefore, it is necessary to study important nodes identification method combined with centralization information and structural hole information. So the key to discover important nodes is to find nodes that can produce, influence and play the role of information bridge. To solve the problem that the importance of nodes are affected by many factors, evaluating indicator D-importance is put forward at first, which reflects centralization degree and bridging ability, and then an algorithm for identifying importance nodes which combines k-shell with structure hole is proposed. At last, the validity of the proposed algorithm is evaluated through susceptible–infected–removed (SIR) model. 2. Preliminaries 2.1 The definition of important nodes An important node is a special node that can affect network structure and function to a greater extent than other nodes. From the perspective of complex networks, nodes with strong influence on information dissemination and network structure usually include two types, namely aggregation nodes and bridge nodes [5]. Aggregation nodes are central nodes in the network and have a great influence on internal information dissemination. Bridge nodes connect different communities and play very important roles in information circulation among whole network. In the field of social sciences, aggregation nodes and bridge nodes correspond to opinion leaders and structural holes separately in role analysis. The concept of opinion leaders was first proposed by Lazarsfeld, which refers to people or organizations that have high influence and strong ability to provide objective facts and subjective judgement in the process of information transmission and interpersonal interaction. It has been pointed out that opinion leaders played significant role in information dissemination and communication, and ordinary users could promote information dissemination by establishing relationships with opinion leaders. On the other hand, the structure holes are trying to explain the key position in a group [4]. It is considered that non-redundant positions between heterogeneous nodes are structural holes, which play important role as information bridge. According to Burt [4], it often takes nodes that occupy structure hole location as structural holes. Because both these types of nodes have important influence on information dissemination in a network, they are all important nodes. But the identification methods are not the same. The existing methods of identifying important nodes are mostly directed to opinion leaders. The main measures are degree, closeness, k-shell, between-ness, centrality and so on. The corresponding node identification algorithms are referred to as DC, CI, KS and BC, respectively. 2.2 Structural hole 2.2.1 Structural hole theory Burt thought that structural holes were one or more individuals directly related to some individuals in the social network, but they are not directly related to other individuals, or no relations to other individuals, that is to say, there were relationship discontinuities. From the view of whole network, there are some ‘caves’ in the network structure [7]. In other words, if one actor connects to other two actors directly, but the two actors do not connect each other directly, then the network location occupied by the actor is a structural hole. As shown in Fig. 1, there are no direct links between nodes B, C and D, and node A connects directly with the three nodes. Compared with the other three nodes, node A clearly has competitive advantage. The other three nodes have to pass through it to contact with each other, so node A is occupied by the location of the structure hole; the second picture is a closed network, so there is no structural hole. Fig. 1. View largeDownload slide Examples of structural holes. Fig. 1. View largeDownload slide Examples of structural holes. Theory of structural hole combines social network exchange theory, relational intensity theory and so on. It explains the key position of the individual in the group and thinks that if the individual in the position of the structural hole, it can get more competitive advantage and innovation ability through information filtering and information dissemination. 2.2.2 Structure hole identification algorithm At present, there are two types of classical structural hole identification algorithms. One uses between-ness centrality to calculate structural holes [8], namely BC algorithm. It is considered to be a better algorithm for identifying structural holes. It characterizes the degree of a single node controlling network resources. Its basic idea is that if a node is on the shortest path of many other nodes, the node has a high between-ness centrality and is more likely to occupy the structural hole position. The formula of between-ness centrality of node q is as follows: \begin{align} \begin{split} BC(q) = \sum\limits_i\sum\limits_jB_{ij}(q)=\sum\limits_i\sum\limits_j\frac{g_{ij}(q)}{g_{ij}}, \end{split} \end{align} (2.1) where g$$_{ij}$$ represents the shortest path from node i to node j, and g$$_{ij}$$(q) represents the shortest path of node i to node j via node q; B$$_{ij}$$(q) indicates the ability of node q controlling nodes i and j. The algorithm has two assumptions: the weights of each path are equal and the information from one actor to another actor always takes the shortest path. The disadvantage of BC is its shortest path assumption, and its high computational complexity, so it is not suitable for large networks. Another classical structural hole identification algorithm is proposed by Burt. He proposed a series of structural hole calculation indicators, such as effective size of the network, constraint coefficient, grade of degree and so on [4]. Network effective size is the most useful indicator; it expresses the redundancy among nodes through personal network size minus number of redundant links. The algorithm calculating the structure holes by the effective size is called ES algorithm. The formula for calculating effective scale of node i is: \begin{align} \begin{split} ES(i) = \sum\limits_j\left[1-\sum\limits_qp_{iq}m_{jq}\right]\!, \end{split} \end{align} (2.2) where j represents the node connected to i, q represents the other nodes except i and j and p$$_{iq}$$m$$_{jq}$$ represents the number of redundant connections between i and j. 2.3 Centricity indicators 2.3.1 Degree Degree [9] is the number of node’s neighbour, reflecting the direct impact of the node. The higher degree represents the more central location the node is in the network. The formula for calculating degree of i is: \begin{align} \begin{split} DC(i) = \sum\limits_ja_{ij}, \end{split} \end{align} (2.3) If node i is connected to node j, then $$a_{ij} = 1$$; otherwise, $$a_{ij} = 0$$. 2.3.2 Closeness Closeness [1, 10] reflects the difficulty level of a node to reach other nodes through the network and measures the ability of a node influencing other nodes over the network. The larger closeness of a node represents the node is closer to other nodes and is easier to spread the information. The formula for calculating closeness of i is: \begin{align} \begin{split} CI(i) = \frac{n-1}{\sum\limits_jd_{ij}}, \end{split} \end{align} (2.4) where d$$_{ij}$$ represents distance of the shortest path from node i to node j and $$n-1$$ represents the greatest possible maximum number of neighbouring nodes. 2.3.3 K-shell The k-shell [11] is a coarse-grained network node importance sorting method, revealing the hierarchical level of the network topology. A node’s k-shell represents the depth of the node in the network. The greater k-shell represents the more central the node in the networks and the more important the node is. The k-shell of a node is obtained by the following decomposition progress: in a given undirected network, all nodes with degree of 1 in the network and the edges that they connect to these nodes are removed. Some new nodes with degree of 1 may appear in the network. Then, remove these nodes and their edges. Repeating this operation until there are no nodes which degree is 1. So k-shell of all these removed nodes is 1. The rest can be done in the same manner, and we can get all nodes k-shell. 3. An important node identification algorithm based on k-shell and structure hole 3.1 The basic idea of the algorithm Kitsak pointed out that the importance of a node depending on its position in the network. The existing important nodes identification methods usually only consider the opinion leaders. On one hand, it is because the network centre nodes are easily getting attention; on the other hand, there are no important nodes evaluation methods that can take both the structural holes and the opinion leaders into account. Rogers, in his theory of innovation and diffusion, mentioned that the spread of new things in society generally require to go through different stages. In early days, innovators would first adopt such new things and tend to act as opinion leaders in later stages [12]. In social networks that have strong community structure, the characteristics of important nodes should be with high local centricity and high global diversity at the same time. In other words, they also have higher bridging ability. Previous studies have shown that k-shell can be a good measure for core level of nodes in networks [3], and nodes with large shell number are in the centre of networks. Combining the k-shell decomposition algorithm with the structural hole identification algorithm should be able to identify important nodes that take both central degree and ability of inter-community bridging into account and can also overcome problem that the k-shell algorithm cannot distinguish importance of nodes with same shell value. 3.2 Definition and objectives of the algorithm Based on the aforementioned ideas, we propose a goal to identify important nodes in social networks: let $$G = (V, E)$$ represent an undirected social network, $$V = \{v_1, v_2,\,{\ldots}\,, v_n\}$$ represent a set of network nodes and $$|V|= N$$ represent total number of nodes; the importance of node v$$_i$$$$(i = 1, 2, \,{\ldots}\,, N)$$ is calculated, and the first k-nodes with the greatest importance are selected as the recognition result. To calculate the importance of nodes, the concept of nodes importance degree is put forward, considering both the centrality degree and the structural hole degree of nodes. The following gives relevant concepts and definitions, taking node i as an example: Definition 1 (k-shell number of a node) The shell number of node i is obtained by k-shell decomposition, marked as KS$$_i$$. Definition 2 (structural hole degree of a node) Structural hole degree of node i is obtained by network effective size that was proposed by Burt (formula 2.2), marked as ES$$_i$$. Definition 3 (importance degree of a node) D-importance, denoted as DI(i), is a linear combination of k-shell number and structural hole degree. The values of parameters $$\alpha$$ and $$\beta$$ are set according to structure and function of actual networks: \begin{align} \begin{split} DI(i)=\alpha * KS_i + \beta * ES_i,\alpha+\beta=1. \end{split} \end{align} (3.1) From the above definition, finding the important nodes in the network is to find the first k-nodes with the largest DI. 3.3 Algorithm description To achieve the above-mentioned objectives, the following algorithm is proposed: Input: G (V, E), k Output: the first k-nodes with descending order of D-importance Method: Step l: Calculate k-shell numbers of all nodes by KS algorithm; Step 2: Arrange nodes in descending order according to their k-shell numbers; Step 3: Calculate the structural hole degree of the top 10*k nodes by ES algorithm; Step 4: Calculate D-importance of nodes in Step 3 according to formula 3.1; Step 5: Arrange nodes in descending order according to their D-importance and output the first k-nodes. The core of the above algorithm is to calculate D-importance of nodes, so the algorithm is marked as DI algorithm. 4. Experimental analyses Based on the SIR model [13], we conducted experiments comparing DI algorithm with DC, CI, KS, BC and ES algorithm on real data sets. Kendall correlation coefficient [14] was used for accuracy measure. 4.1 Experimental data sets Because different types of social networks have different topology characteristics, five typical real social network data sets are selected: (1) Karate, karate club membership network; (2) Lesmis, relationship network of The Miserable Ones; (3) Polbooks, political book network of USA; (4) Email, mail communication relationship network for Rovira i Virgili University and (5) Netscience, co-author network for scientists. The basic information of the five data sets is shown in Table 1, where N represents total number of nodes, M represents total number of edges, C represents network aggregation coefficient, D$$_m$$ represents the maximum degree of nodes and d represents average path length of the network. DI algorithm can also be applied to other networks. Table 1 Basic property of the networks Data sets N M C $$D_m$$ d Karate$$^1$$ 34 78 0.57 17 2.40 Lesmis$$^2$$ 77 508 0.57 36 2.64 Polbooks$$^3$$ 105 441 0.48 25 3.07 Netscience$$^3$$ 1589 5484 0.63 34 6.04 Email$$^4$$ 1133 5451 0.22 71 3.60 Data sets N M C $$D_m$$ d Karate$$^1$$ 34 78 0.57 17 2.40 Lesmis$$^2$$ 77 508 0.57 36 2.64 Polbooks$$^3$$ 105 441 0.48 25 3.07 Netscience$$^3$$ 1589 5484 0.63 34 6.04 Email$$^4$$ 1133 5451 0.22 71 3.60 Table 1 Basic property of the networks Data sets N M C $$D_m$$ d Karate$$^1$$ 34 78 0.57 17 2.40 Lesmis$$^2$$ 77 508 0.57 36 2.64 Polbooks$$^3$$ 105 441 0.48 25 3.07 Netscience$$^3$$ 1589 5484 0.63 34 6.04 Email$$^4$$ 1133 5451 0.22 71 3.60 Data sets N M C $$D_m$$ d Karate$$^1$$ 34 78 0.57 17 2.40 Lesmis$$^2$$ 77 508 0.57 36 2.64 Polbooks$$^3$$ 105 441 0.48 25 3.07 Netscience$$^3$$ 1589 5484 0.63 34 6.04 Email$$^4$$ 1133 5451 0.22 71 3.60 Remarks: http://vlado.fmf.uni-lj.si/pub/networks/data/ucinet/ucidata.htm#zachary. Last accessed date is 9 August 2017. http://moreno.ss.uci.edu/data.html#lesmis. Last accessed date is 9 August 2017. http://www-personal.umich.edu/~mejn/netdata. Last accessed date is 9 August 2017. http://deim.urv.cat/~alexandre.arenas/data/welcome.htm. Last accessed date is 9 August 2017. 4.2 Experimental results and analysis 4.2.1 The values of $$\alpha$$ and $$\beta$$ DI algorithm needs to get combination of $$\alpha$$ and $$\beta$$ according to actual data sets. Table 2 shows 11 group combinations of $$\alpha$$ and $$\beta$$ in Email data set. The third column represents correlation coefficient between actual communication ability ranking result of nodes and the ranking result of DI algorithm. The larger the $$\tau$$ value is, the closer the actual communication ability is to ranking of DI algorithm. That is to say, DI has high accuracy when $$\tau$$ is large. Table 2 $$\tau$$(DI,S(t)) in different combination of $$\alpha$$ and $$\beta$$ $$\alpha$$ $$\beta$$ $$\tau(DI,S(t))$$ 1.0 0.0 0.6301 0.9 0.1 0.6879 0.8 0.2 0.7401 0.7 0.3 0.7758 0.6 0.4 0.7989 0.5 0.5 0.8157 0.4 0.6 0.8244 0.3 0.7 0.8306 0.2 0.8 0.8340 0.1 0.9 0.8370 0.0 1.0 0.8345 $$\alpha$$ $$\beta$$ $$\tau(DI,S(t))$$ 1.0 0.0 0.6301 0.9 0.1 0.6879 0.8 0.2 0.7401 0.7 0.3 0.7758 0.6 0.4 0.7989 0.5 0.5 0.8157 0.4 0.6 0.8244 0.3 0.7 0.8306 0.2 0.8 0.8340 0.1 0.9 0.8370 0.0 1.0 0.8345 Table 2 $$\tau$$(DI,S(t)) in different combination of $$\alpha$$ and $$\beta$$ $$\alpha$$ $$\beta$$ $$\tau(DI,S(t))$$ 1.0 0.0 0.6301 0.9 0.1 0.6879 0.8 0.2 0.7401 0.7 0.3 0.7758 0.6 0.4 0.7989 0.5 0.5 0.8157 0.4 0.6 0.8244 0.3 0.7 0.8306 0.2 0.8 0.8340 0.1 0.9 0.8370 0.0 1.0 0.8345 $$\alpha$$ $$\beta$$ $$\tau(DI,S(t))$$ 1.0 0.0 0.6301 0.9 0.1 0.6879 0.8 0.2 0.7401 0.7 0.3 0.7758 0.6 0.4 0.7989 0.5 0.5 0.8157 0.4 0.6 0.8244 0.3 0.7 0.8306 0.2 0.8 0.8340 0.1 0.9 0.8370 0.0 1.0 0.8345 According to Table 2, accuracy of DI algorithm increases with decrease of $$\alpha$$ and increase of $$\beta$$. But when $$\alpha = 0.0$$, $$\beta = 1.0$$, accuracy of DI algorithm declined. Thus, $$\alpha = 0.1$$, $$\beta = 0.9$$ is the optimal combination. Meanwhile, the same experiment with other data sets also supports this conclusion. Therefore, DI algorithm in the experiments uses the combination of $$\alpha = 0.1$$ and $$\beta= 0.9$$. In actual application, the appropriate combination should be chosen for different networks. 4.2.2 Correlation analysis between the algorithms and the actual influence For nodes with strong propagation ability, if transmission rate in SIR model is set too large, the nodes will soon infect whole network. This will lead to the problem that the importance of single node is difficult to be distinguished. So a smaller transmission rate is used in experiments to show better infection within limited time. During experiments, only one node in the network was selected as source node and propagation time was set to 10 s. Let the total number of infected and rehabilitation nodes as final value of the influence S(t), and each operation was repeated 100 times for each node. Because of the limitation of length, only correlation graphs between six kinds of node importance evaluation algorithms and the actual influence S(t) in two larger networks are given. Figure 2 shows correlation graphs in Netscience, and Fig. 3 shows correlation graphs in Email. Fig. 2. View largeDownload slide Correlation analyses between the algorithms and the actual influence on Netscience. Fig. 2. View largeDownload slide Correlation analyses between the algorithms and the actual influence on Netscience. Fig. 3. View largeDownload slide Correlation analyses between the algorithms and the actual influence on Email. Fig. 3. View largeDownload slide Correlation analyses between the algorithms and the actual influence on Email. By observation of Figs 2 and 3, it is found that relationships between CI, KS and BC algorithms and influence S(t) were not very clear both in Netscience and in Email. Taking KS as an example, there were only several grades of k-shell, the importance of nodes with same k-shell number could not be distinguished; in BC algorithm, between-ness of most nodes is small, difference of S(t) for each node is great, but S(t) of the nodes with large between-ness is not necessarily large. The result of DC, ES and DI algorithms and S(t) have a strong correlation, so it can be used in accurate assessment for propagation influence of nodes. The results on Karate, Lesmis and Polbooks data sets are similar. According to Figs 2 and 3, it is not possible to visually determine pros and cons of DC, ES and DI algorithm, so Kendall $$\tau$$ coefficient was further calculated, as shown in Table 3. The second column of Table 3 represents transmission rate of SIR model, the third column represents immune probability, and the fourth to ninth columns give the correlation coefficients between the actual communication ability of the nodes and the results of the different algorithms. a represents $$\tau(\overline{S(t)},DC)$$, b represents $$\tau(\overline{S(t)},BC)$$, c represents $$\tau(\overline{S(t)},KS)$$, d represents $$\tau(\overline{S(t)},CI)$$, e represents $$\tau(\overline{S(t)},ES)$$ and f represents $$\tau(\overline{S(t)},DI)$$. According to Table 3, except Email data set, we can see that results of DI algorithm are fully optimal. In Email, accuracy of DI algorithm is just below accuracy of DC algorithm. Overall, DI algorithm can be considered having the highest accuracy. Table 3 Correlation coefficients $$\tau$$ between the algorithms and the actual influence Networks $$\beta$$ $$\alpha$$ a b c d e f Karate 0.33 0.90 0.7618 0.6762 0.5839 0.7944 0.7455 0.7961 Lesmis 0.13 0.01 0.7954 0.5479 0.7220 0.6311 0.7923 0.7977 Polbooks 0.07 0.01 0.7428 0.4677 0.4107 0.5408 0.7343 0.7468 Email 0.01 0.01 0.8509 0.6763 0.6347 0.7724 0.7891 0.8430 Netscience 0.02 0.01 0.8645 0.3597 0.7644 0.6602 0.8639 0.8665 Networks $$\beta$$ $$\alpha$$ a b c d e f Karate 0.33 0.90 0.7618 0.6762 0.5839 0.7944 0.7455 0.7961 Lesmis 0.13 0.01 0.7954 0.5479 0.7220 0.6311 0.7923 0.7977 Polbooks 0.07 0.01 0.7428 0.4677 0.4107 0.5408 0.7343 0.7468 Email 0.01 0.01 0.8509 0.6763 0.6347 0.7724 0.7891 0.8430 Netscience 0.02 0.01 0.8645 0.3597 0.7644 0.6602 0.8639 0.8665 Table 3 Correlation coefficients $$\tau$$ between the algorithms and the actual influence Networks $$\beta$$ $$\alpha$$ a b c d e f Karate 0.33 0.90 0.7618 0.6762 0.5839 0.7944 0.7455 0.7961 Lesmis 0.13 0.01 0.7954 0.5479 0.7220 0.6311 0.7923 0.7977 Polbooks 0.07 0.01 0.7428 0.4677 0.4107 0.5408 0.7343 0.7468 Email 0.01 0.01 0.8509 0.6763 0.6347 0.7724 0.7891 0.8430 Netscience 0.02 0.01 0.8645 0.3597 0.7644 0.6602 0.8639 0.8665 Networks $$\beta$$ $$\alpha$$ a b c d e f Karate 0.33 0.90 0.7618 0.6762 0.5839 0.7944 0.7455 0.7961 Lesmis 0.13 0.01 0.7954 0.5479 0.7220 0.6311 0.7923 0.7977 Polbooks 0.07 0.01 0.7428 0.4677 0.4107 0.5408 0.7343 0.7468 Email 0.01 0.01 0.8509 0.6763 0.6347 0.7724 0.7891 0.8430 Netscience 0.02 0.01 0.8645 0.3597 0.7644 0.6602 0.8639 0.8665 4.2.3 Efficiency analysis of DI algorithm DI algorithm was designed to find nodes in the network that have both centrality and bridging ability in communities. Algorithms that reflect centrality are DC, CI and KS. Algorithms that reflect inter-community bridging are BC and ES. Analysis from the previous section reveals that the result of DC algorithm is the best among DC, CI and KS algorithms and the result of ES algorithm is better between ES and BC algorithms. Therefore, through correlation analysis among DI, DC and ES, we can see that DI algorithm have a good reflection of centrality and bridging ability. And the key of the DI algorithm is formula 3.1, while the experiments used $$\alpha = 0.1$$, $$\beta = 0.9$$, we found that DI altorithm and ES algorithm must be highly correlated. So we only do extra correlation analysis of DI algorithm and DC algorithm. Table 4 shows correlation coefficient of DI algorithm and DC algorithm in each actual network. Figure 4 shows the relationship between DI and DC in each experimental data set, in which horizontal axis represents D-importance of each node, vertical axis represents degree of each node and the colorful dots represent S(t). Fig. 4. View largeDownload slide Correlation analyses between DI and DC in different networks. Fig. 4. View largeDownload slide Correlation analyses between DI and DC in different networks. Table 4 Correlation coefficient $$\tau$$ between DI and DC in different networks Networks $$\tau(DI,DC)$$ Karate 0.9289 Lesmis 0.9786 Polbooks 0.9531 Email 0.9808 Netscience 0.9576 Networks $$\tau(DI,DC)$$ Karate 0.9289 Lesmis 0.9786 Polbooks 0.9531 Email 0.9808 Netscience 0.9576 Table 4 Correlation coefficient $$\tau$$ between DI and DC in different networks Networks $$\tau(DI,DC)$$ Karate 0.9289 Lesmis 0.9786 Polbooks 0.9531 Email 0.9808 Netscience 0.9576 Networks $$\tau(DI,DC)$$ Karate 0.9289 Lesmis 0.9786 Polbooks 0.9531 Email 0.9808 Netscience 0.9576 According to correlation coefficient calculated in Table 4 and Fig. 4, it can be shown that DI algorithm and DC algorithm have a strong correlation. It shows that the important nodes got by DI algorithm can reflect high level of centrality. Through the above experiments, it can be seen that DI algorithm can accurately identify important nodes that have both centrality and bridging ability in network communities. 5. Conclusions This article studied the problem of important nodes identification in social networks. Inspired by the notion that location of nodes in a network can influence the nodes transmission ability, important nodes in a network are divided into two kinds, opinion leaders and structural holes. So an idea that important nodes should have high centrality and bridging ability at the same time is proposed. Then, DI algorithm is proposed to identify important nodes by combining ES algorithm with KS algorithm. Five actual networks are used to do experiments with SIR model, and the correlation coefficient is used to compare DI algorithm with DC, CI, BC, KS and ES algorithms. It can be seen that: (1) Compared to other algorithms, DI can accurately identify nodes with high propagation ability; (2) DI and DC, DI and ES have good correlation, indicating that identified nodes have high centrality and bridging ability. The computational complexity of DI algorithm is still high; it will be optimized in the next step. Funding Shannxi Provincial Education Commission (Program No. 15JK1468) and Shaanxi Provincial Natural Science Foundation Project (Program No. 2017JQ6053). References 1. Kai Y. , Ning Z. H. & Shuqing S. ( 2015 ) Node centrality on individual microblog user network. J. Univ. Shanghai Sci. Tech. , 37 , 43 – 48 . 2. Yingxia S. H. , Bin C. , Lin M. & Hongzhi Y. ( 2016 ) A fast sketch-based approach of Top-k closeness centrality search on large networks. Chinese J. Comput. , 39 , 1965 – 1978 . 3. Zhuoming R. , Jianguo L. , Feng S. H. , Zhaolong H. & Qiang G. ( 2013 ) Analysis of the spreading influence of the nodes with minimum K-shell value in complex networks. Acta Phys. Sin. , 62 , 466 – 471 . 4. Burt R. S. ( 1993 ) The social structure of competition. Explor. Econ. Sociol. , 65 , 103 . 5. Xing L. , Zhinong Z. H. & Yang L. ( 2013 ) Fast algorithm for random walk centerlity. Appl. Res. Comput. , 30 , 2337 – 2340 . 6. Xingdong W. , Yi L. & Lei L. ( 2014 ) Influence analysis of online social networks. Chinese J. Comput. , 37 , 735 – 752 . 7. Jian F. & Yuanyuan D. ( 2016 ) A structural hole identification algorithm in social networks based on overlapping communities and structural hole degree. Comput. Eng. Sci. , 38 , 897 – 904 . 8. Freeman L. C. ( 1977 ) A set of measures of centrality based on betweenness. Sociometry , 40 , 35 – 41 . Google Scholar CrossRef Search ADS 9. Xiaofan W. , Xiang L. & Guanrong C. H. ( 2012 ) Network Science: An Introduction . Beijing : Higher Education Press . 10. Zhongming H. , Yan C. H. , Wen L. , Bihong Y. , Mengqi L. & Dagao D. ( 2017 ) Research on node influence analysis in social networks. J. Software , 28 , 84 – 104 . 11. Jingqiao L. , Xiufen F. & Zaiqiao M. ( 2015 ) Identification of influential spreading nodes in microblog network. Appl. Res. Comput. , 32 , 2305 – 2308 . 12. Cha M. , Haddadi H. , Benevenuto F. & Gummadi P. K. ( 2010 ) Measuring user influence in twitter: The million follower fallacy. Int. Conf. Web Soc. Media , 10 , 30 . 13. Xiaolong R. & Linyuan L. V. ( 2014 ) Review of ranking nodes in complex networks. Chinese Sci. Bull. , 59 , 1175 – 1197 . Google Scholar CrossRef Search ADS 14. Zeqin D. , Fengzhen H. , Jiafei D. , Xinfeng L. , Jin L. & Jun W. ( 2014 ) An improved synchronous algorithm based Kendall for analyzing epileptic brain network. Acta Phys. Sin. , 63 , 208701 – 208705 . © The authors 2017. Published by Oxford University Press. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Complex Networks Oxford University Press

# An identification method for important nodes based on k-shell and structural hole

, Volume Advance Article (3) – Sep 14, 2017
11 pages

/lp/ou_press/an-identification-method-for-important-nodes-based-on-k-shell-and-BpuN2g0JvK
Publisher
Oxford University Press
ISSN
2051-1310
eISSN
2051-1329
D.O.I.
10.1093/comnet/cnx035
Publisher site
See Article on Publisher Site

### Abstract

Abstract Identifying important nodes that have great influence on propagation in social networks has an important significance on understanding and controlling spread of information in the network. On the basis of the idea that location of nodes can affect their ability to communicate, it is pointed out that the important nodes in networks consisted by two kinds of nodes, namely opinion leaders and structural holes. But most of the existing methods for identifying important nodes does not take the structural holes into account. This article proposes a hybrid important nodes identification algorithm, which combines k-shell and network effective size, aiming at finding a group of nodes that are the most central and bridging. Doing experiments on five actual data sets based on susceptible–infected–removed propagation model, the experimental results show that, comparing with algorithms on the basis of degree, closeness, k-shell, between-ness and network effective size, the proposed algorithm has the highest identification accuracy. 1. Introduction Current social networks not only provide new platform for the dissemination and sharing of information but also become important way for users to show their own values, express interest demands and maintain interpersonal relationships. It is of great practical significance to quickly and accurately identify the important nodes that have great influence on news dissemination in social networks and also has important practical significance for monitoring and management of social networks. Evaluating node’s importance according to its centricity has always been mainstream method of identifying important nodes in social networks. At present, evaluating indicators mainly include degree, tightness and k-shell, and so on. [1–3]. This kind of method is based on graph theory to analyse the centralization degree of nodes in the network. However, the importance of nodes in networks is not only related to its centralization degree but also related to its functional characteristics. For example, structural hole nodes is in the intermediary position between communities. Although they are not in the centre of the network, they are in the position between communities. So they affect and even control social relations and spread of information [4]. Most of the existing important nodes identification methods did not consider this kind of nodes. Therefore, it is necessary to study important nodes identification method combined with centralization information and structural hole information. So the key to discover important nodes is to find nodes that can produce, influence and play the role of information bridge. To solve the problem that the importance of nodes are affected by many factors, evaluating indicator D-importance is put forward at first, which reflects centralization degree and bridging ability, and then an algorithm for identifying importance nodes which combines k-shell with structure hole is proposed. At last, the validity of the proposed algorithm is evaluated through susceptible–infected–removed (SIR) model. 2. Preliminaries 2.1 The definition of important nodes An important node is a special node that can affect network structure and function to a greater extent than other nodes. From the perspective of complex networks, nodes with strong influence on information dissemination and network structure usually include two types, namely aggregation nodes and bridge nodes [5]. Aggregation nodes are central nodes in the network and have a great influence on internal information dissemination. Bridge nodes connect different communities and play very important roles in information circulation among whole network. In the field of social sciences, aggregation nodes and bridge nodes correspond to opinion leaders and structural holes separately in role analysis. The concept of opinion leaders was first proposed by Lazarsfeld, which refers to people or organizations that have high influence and strong ability to provide objective facts and subjective judgement in the process of information transmission and interpersonal interaction. It has been pointed out that opinion leaders played significant role in information dissemination and communication, and ordinary users could promote information dissemination by establishing relationships with opinion leaders. On the other hand, the structure holes are trying to explain the key position in a group [4]. It is considered that non-redundant positions between heterogeneous nodes are structural holes, which play important role as information bridge. According to Burt [4], it often takes nodes that occupy structure hole location as structural holes. Because both these types of nodes have important influence on information dissemination in a network, they are all important nodes. But the identification methods are not the same. The existing methods of identifying important nodes are mostly directed to opinion leaders. The main measures are degree, closeness, k-shell, between-ness, centrality and so on. The corresponding node identification algorithms are referred to as DC, CI, KS and BC, respectively. 2.2 Structural hole 2.2.1 Structural hole theory Burt thought that structural holes were one or more individuals directly related to some individuals in the social network, but they are not directly related to other individuals, or no relations to other individuals, that is to say, there were relationship discontinuities. From the view of whole network, there are some ‘caves’ in the network structure [7]. In other words, if one actor connects to other two actors directly, but the two actors do not connect each other directly, then the network location occupied by the actor is a structural hole. As shown in Fig. 1, there are no direct links between nodes B, C and D, and node A connects directly with the three nodes. Compared with the other three nodes, node A clearly has competitive advantage. The other three nodes have to pass through it to contact with each other, so node A is occupied by the location of the structure hole; the second picture is a closed network, so there is no structural hole. Fig. 1. View largeDownload slide Examples of structural holes. Fig. 1. View largeDownload slide Examples of structural holes. Theory of structural hole combines social network exchange theory, relational intensity theory and so on. It explains the key position of the individual in the group and thinks that if the individual in the position of the structural hole, it can get more competitive advantage and innovation ability through information filtering and information dissemination. 2.2.2 Structure hole identification algorithm At present, there are two types of classical structural hole identification algorithms. One uses between-ness centrality to calculate structural holes [8], namely BC algorithm. It is considered to be a better algorithm for identifying structural holes. It characterizes the degree of a single node controlling network resources. Its basic idea is that if a node is on the shortest path of many other nodes, the node has a high between-ness centrality and is more likely to occupy the structural hole position. The formula of between-ness centrality of node q is as follows: \begin{align} \begin{split} BC(q) = \sum\limits_i\sum\limits_jB_{ij}(q)=\sum\limits_i\sum\limits_j\frac{g_{ij}(q)}{g_{ij}}, \end{split} \end{align} (2.1) where g$$_{ij}$$ represents the shortest path from node i to node j, and g$$_{ij}$$(q) represents the shortest path of node i to node j via node q; B$$_{ij}$$(q) indicates the ability of node q controlling nodes i and j. The algorithm has two assumptions: the weights of each path are equal and the information from one actor to another actor always takes the shortest path. The disadvantage of BC is its shortest path assumption, and its high computational complexity, so it is not suitable for large networks. Another classical structural hole identification algorithm is proposed by Burt. He proposed a series of structural hole calculation indicators, such as effective size of the network, constraint coefficient, grade of degree and so on [4]. Network effective size is the most useful indicator; it expresses the redundancy among nodes through personal network size minus number of redundant links. The algorithm calculating the structure holes by the effective size is called ES algorithm. The formula for calculating effective scale of node i is: \begin{align} \begin{split} ES(i) = \sum\limits_j\left[1-\sum\limits_qp_{iq}m_{jq}\right]\!, \end{split} \end{align} (2.2) where j represents the node connected to i, q represents the other nodes except i and j and p$$_{iq}$$m$$_{jq}$$ represents the number of redundant connections between i and j. 2.3 Centricity indicators 2.3.1 Degree Degree [9] is the number of node’s neighbour, reflecting the direct impact of the node. The higher degree represents the more central location the node is in the network. The formula for calculating degree of i is: \begin{align} \begin{split} DC(i) = \sum\limits_ja_{ij}, \end{split} \end{align} (2.3) If node i is connected to node j, then $$a_{ij} = 1$$; otherwise, $$a_{ij} = 0$$. 2.3.2 Closeness Closeness [1, 10] reflects the difficulty level of a node to reach other nodes through the network and measures the ability of a node influencing other nodes over the network. The larger closeness of a node represents the node is closer to other nodes and is easier to spread the information. The formula for calculating closeness of i is: \begin{align} \begin{split} CI(i) = \frac{n-1}{\sum\limits_jd_{ij}}, \end{split} \end{align} (2.4) where d$$_{ij}$$ represents distance of the shortest path from node i to node j and $$n-1$$ represents the greatest possible maximum number of neighbouring nodes. 2.3.3 K-shell The k-shell [11] is a coarse-grained network node importance sorting method, revealing the hierarchical level of the network topology. A node’s k-shell represents the depth of the node in the network. The greater k-shell represents the more central the node in the networks and the more important the node is. The k-shell of a node is obtained by the following decomposition progress: in a given undirected network, all nodes with degree of 1 in the network and the edges that they connect to these nodes are removed. Some new nodes with degree of 1 may appear in the network. Then, remove these nodes and their edges. Repeating this operation until there are no nodes which degree is 1. So k-shell of all these removed nodes is 1. The rest can be done in the same manner, and we can get all nodes k-shell. 3. An important node identification algorithm based on k-shell and structure hole 3.1 The basic idea of the algorithm Kitsak pointed out that the importance of a node depending on its position in the network. The existing important nodes identification methods usually only consider the opinion leaders. On one hand, it is because the network centre nodes are easily getting attention; on the other hand, there are no important nodes evaluation methods that can take both the structural holes and the opinion leaders into account. Rogers, in his theory of innovation and diffusion, mentioned that the spread of new things in society generally require to go through different stages. In early days, innovators would first adopt such new things and tend to act as opinion leaders in later stages [12]. In social networks that have strong community structure, the characteristics of important nodes should be with high local centricity and high global diversity at the same time. In other words, they also have higher bridging ability. Previous studies have shown that k-shell can be a good measure for core level of nodes in networks [3], and nodes with large shell number are in the centre of networks. Combining the k-shell decomposition algorithm with the structural hole identification algorithm should be able to identify important nodes that take both central degree and ability of inter-community bridging into account and can also overcome problem that the k-shell algorithm cannot distinguish importance of nodes with same shell value. 3.2 Definition and objectives of the algorithm Based on the aforementioned ideas, we propose a goal to identify important nodes in social networks: let $$G = (V, E)$$ represent an undirected social network, $$V = \{v_1, v_2,\,{\ldots}\,, v_n\}$$ represent a set of network nodes and $$|V|= N$$ represent total number of nodes; the importance of node v$$_i$$$$(i = 1, 2, \,{\ldots}\,, N)$$ is calculated, and the first k-nodes with the greatest importance are selected as the recognition result. To calculate the importance of nodes, the concept of nodes importance degree is put forward, considering both the centrality degree and the structural hole degree of nodes. The following gives relevant concepts and definitions, taking node i as an example: Definition 1 (k-shell number of a node) The shell number of node i is obtained by k-shell decomposition, marked as KS$$_i$$. Definition 2 (structural hole degree of a node) Structural hole degree of node i is obtained by network effective size that was proposed by Burt (formula 2.2), marked as ES$$_i$$. Definition 3 (importance degree of a node) D-importance, denoted as DI(i), is a linear combination of k-shell number and structural hole degree. The values of parameters $$\alpha$$ and $$\beta$$ are set according to structure and function of actual networks: \begin{align} \begin{split} DI(i)=\alpha * KS_i + \beta * ES_i,\alpha+\beta=1. \end{split} \end{align} (3.1) From the above definition, finding the important nodes in the network is to find the first k-nodes with the largest DI. 3.3 Algorithm description To achieve the above-mentioned objectives, the following algorithm is proposed: Input: G (V, E), k Output: the first k-nodes with descending order of D-importance Method: Step l: Calculate k-shell numbers of all nodes by KS algorithm; Step 2: Arrange nodes in descending order according to their k-shell numbers; Step 3: Calculate the structural hole degree of the top 10*k nodes by ES algorithm; Step 4: Calculate D-importance of nodes in Step 3 according to formula 3.1; Step 5: Arrange nodes in descending order according to their D-importance and output the first k-nodes. The core of the above algorithm is to calculate D-importance of nodes, so the algorithm is marked as DI algorithm. 4. Experimental analyses Based on the SIR model [13], we conducted experiments comparing DI algorithm with DC, CI, KS, BC and ES algorithm on real data sets. Kendall correlation coefficient [14] was used for accuracy measure. 4.1 Experimental data sets Because different types of social networks have different topology characteristics, five typical real social network data sets are selected: (1) Karate, karate club membership network; (2) Lesmis, relationship network of The Miserable Ones; (3) Polbooks, political book network of USA; (4) Email, mail communication relationship network for Rovira i Virgili University and (5) Netscience, co-author network for scientists. The basic information of the five data sets is shown in Table 1, where N represents total number of nodes, M represents total number of edges, C represents network aggregation coefficient, D$$_m$$ represents the maximum degree of nodes and d represents average path length of the network. DI algorithm can also be applied to other networks. Table 1 Basic property of the networks Data sets N M C $$D_m$$ d Karate$$^1$$ 34 78 0.57 17 2.40 Lesmis$$^2$$ 77 508 0.57 36 2.64 Polbooks$$^3$$ 105 441 0.48 25 3.07 Netscience$$^3$$ 1589 5484 0.63 34 6.04 Email$$^4$$ 1133 5451 0.22 71 3.60 Data sets N M C $$D_m$$ d Karate$$^1$$ 34 78 0.57 17 2.40 Lesmis$$^2$$ 77 508 0.57 36 2.64 Polbooks$$^3$$ 105 441 0.48 25 3.07 Netscience$$^3$$ 1589 5484 0.63 34 6.04 Email$$^4$$ 1133 5451 0.22 71 3.60 Table 1 Basic property of the networks Data sets N M C $$D_m$$ d Karate$$^1$$ 34 78 0.57 17 2.40 Lesmis$$^2$$ 77 508 0.57 36 2.64 Polbooks$$^3$$ 105 441 0.48 25 3.07 Netscience$$^3$$ 1589 5484 0.63 34 6.04 Email$$^4$$ 1133 5451 0.22 71 3.60 Data sets N M C $$D_m$$ d Karate$$^1$$ 34 78 0.57 17 2.40 Lesmis$$^2$$ 77 508 0.57 36 2.64 Polbooks$$^3$$ 105 441 0.48 25 3.07 Netscience$$^3$$ 1589 5484 0.63 34 6.04 Email$$^4$$ 1133 5451 0.22 71 3.60 Remarks: http://vlado.fmf.uni-lj.si/pub/networks/data/ucinet/ucidata.htm#zachary. Last accessed date is 9 August 2017. http://moreno.ss.uci.edu/data.html#lesmis. Last accessed date is 9 August 2017. http://www-personal.umich.edu/~mejn/netdata. Last accessed date is 9 August 2017. http://deim.urv.cat/~alexandre.arenas/data/welcome.htm. Last accessed date is 9 August 2017. 4.2 Experimental results and analysis 4.2.1 The values of $$\alpha$$ and $$\beta$$ DI algorithm needs to get combination of $$\alpha$$ and $$\beta$$ according to actual data sets. Table 2 shows 11 group combinations of $$\alpha$$ and $$\beta$$ in Email data set. The third column represents correlation coefficient between actual communication ability ranking result of nodes and the ranking result of DI algorithm. The larger the $$\tau$$ value is, the closer the actual communication ability is to ranking of DI algorithm. That is to say, DI has high accuracy when $$\tau$$ is large. Table 2 $$\tau$$(DI,S(t)) in different combination of $$\alpha$$ and $$\beta$$ $$\alpha$$ $$\beta$$ $$\tau(DI,S(t))$$ 1.0 0.0 0.6301 0.9 0.1 0.6879 0.8 0.2 0.7401 0.7 0.3 0.7758 0.6 0.4 0.7989 0.5 0.5 0.8157 0.4 0.6 0.8244 0.3 0.7 0.8306 0.2 0.8 0.8340 0.1 0.9 0.8370 0.0 1.0 0.8345 $$\alpha$$ $$\beta$$ $$\tau(DI,S(t))$$ 1.0 0.0 0.6301 0.9 0.1 0.6879 0.8 0.2 0.7401 0.7 0.3 0.7758 0.6 0.4 0.7989 0.5 0.5 0.8157 0.4 0.6 0.8244 0.3 0.7 0.8306 0.2 0.8 0.8340 0.1 0.9 0.8370 0.0 1.0 0.8345 Table 2 $$\tau$$(DI,S(t)) in different combination of $$\alpha$$ and $$\beta$$ $$\alpha$$ $$\beta$$ $$\tau(DI,S(t))$$ 1.0 0.0 0.6301 0.9 0.1 0.6879 0.8 0.2 0.7401 0.7 0.3 0.7758 0.6 0.4 0.7989 0.5 0.5 0.8157 0.4 0.6 0.8244 0.3 0.7 0.8306 0.2 0.8 0.8340 0.1 0.9 0.8370 0.0 1.0 0.8345 $$\alpha$$ $$\beta$$ $$\tau(DI,S(t))$$ 1.0 0.0 0.6301 0.9 0.1 0.6879 0.8 0.2 0.7401 0.7 0.3 0.7758 0.6 0.4 0.7989 0.5 0.5 0.8157 0.4 0.6 0.8244 0.3 0.7 0.8306 0.2 0.8 0.8340 0.1 0.9 0.8370 0.0 1.0 0.8345 According to Table 2, accuracy of DI algorithm increases with decrease of $$\alpha$$ and increase of $$\beta$$. But when $$\alpha = 0.0$$, $$\beta = 1.0$$, accuracy of DI algorithm declined. Thus, $$\alpha = 0.1$$, $$\beta = 0.9$$ is the optimal combination. Meanwhile, the same experiment with other data sets also supports this conclusion. Therefore, DI algorithm in the experiments uses the combination of $$\alpha = 0.1$$ and $$\beta= 0.9$$. In actual application, the appropriate combination should be chosen for different networks. 4.2.2 Correlation analysis between the algorithms and the actual influence For nodes with strong propagation ability, if transmission rate in SIR model is set too large, the nodes will soon infect whole network. This will lead to the problem that the importance of single node is difficult to be distinguished. So a smaller transmission rate is used in experiments to show better infection within limited time. During experiments, only one node in the network was selected as source node and propagation time was set to 10 s. Let the total number of infected and rehabilitation nodes as final value of the influence S(t), and each operation was repeated 100 times for each node. Because of the limitation of length, only correlation graphs between six kinds of node importance evaluation algorithms and the actual influence S(t) in two larger networks are given. Figure 2 shows correlation graphs in Netscience, and Fig. 3 shows correlation graphs in Email. Fig. 2. View largeDownload slide Correlation analyses between the algorithms and the actual influence on Netscience. Fig. 2. View largeDownload slide Correlation analyses between the algorithms and the actual influence on Netscience. Fig. 3. View largeDownload slide Correlation analyses between the algorithms and the actual influence on Email. Fig. 3. View largeDownload slide Correlation analyses between the algorithms and the actual influence on Email. By observation of Figs 2 and 3, it is found that relationships between CI, KS and BC algorithms and influence S(t) were not very clear both in Netscience and in Email. Taking KS as an example, there were only several grades of k-shell, the importance of nodes with same k-shell number could not be distinguished; in BC algorithm, between-ness of most nodes is small, difference of S(t) for each node is great, but S(t) of the nodes with large between-ness is not necessarily large. The result of DC, ES and DI algorithms and S(t) have a strong correlation, so it can be used in accurate assessment for propagation influence of nodes. The results on Karate, Lesmis and Polbooks data sets are similar. According to Figs 2 and 3, it is not possible to visually determine pros and cons of DC, ES and DI algorithm, so Kendall $$\tau$$ coefficient was further calculated, as shown in Table 3. The second column of Table 3 represents transmission rate of SIR model, the third column represents immune probability, and the fourth to ninth columns give the correlation coefficients between the actual communication ability of the nodes and the results of the different algorithms. a represents $$\tau(\overline{S(t)},DC)$$, b represents $$\tau(\overline{S(t)},BC)$$, c represents $$\tau(\overline{S(t)},KS)$$, d represents $$\tau(\overline{S(t)},CI)$$, e represents $$\tau(\overline{S(t)},ES)$$ and f represents $$\tau(\overline{S(t)},DI)$$. According to Table 3, except Email data set, we can see that results of DI algorithm are fully optimal. In Email, accuracy of DI algorithm is just below accuracy of DC algorithm. Overall, DI algorithm can be considered having the highest accuracy. Table 3 Correlation coefficients $$\tau$$ between the algorithms and the actual influence Networks $$\beta$$ $$\alpha$$ a b c d e f Karate 0.33 0.90 0.7618 0.6762 0.5839 0.7944 0.7455 0.7961 Lesmis 0.13 0.01 0.7954 0.5479 0.7220 0.6311 0.7923 0.7977 Polbooks 0.07 0.01 0.7428 0.4677 0.4107 0.5408 0.7343 0.7468 Email 0.01 0.01 0.8509 0.6763 0.6347 0.7724 0.7891 0.8430 Netscience 0.02 0.01 0.8645 0.3597 0.7644 0.6602 0.8639 0.8665 Networks $$\beta$$ $$\alpha$$ a b c d e f Karate 0.33 0.90 0.7618 0.6762 0.5839 0.7944 0.7455 0.7961 Lesmis 0.13 0.01 0.7954 0.5479 0.7220 0.6311 0.7923 0.7977 Polbooks 0.07 0.01 0.7428 0.4677 0.4107 0.5408 0.7343 0.7468 Email 0.01 0.01 0.8509 0.6763 0.6347 0.7724 0.7891 0.8430 Netscience 0.02 0.01 0.8645 0.3597 0.7644 0.6602 0.8639 0.8665 Table 3 Correlation coefficients $$\tau$$ between the algorithms and the actual influence Networks $$\beta$$ $$\alpha$$ a b c d e f Karate 0.33 0.90 0.7618 0.6762 0.5839 0.7944 0.7455 0.7961 Lesmis 0.13 0.01 0.7954 0.5479 0.7220 0.6311 0.7923 0.7977 Polbooks 0.07 0.01 0.7428 0.4677 0.4107 0.5408 0.7343 0.7468 Email 0.01 0.01 0.8509 0.6763 0.6347 0.7724 0.7891 0.8430 Netscience 0.02 0.01 0.8645 0.3597 0.7644 0.6602 0.8639 0.8665 Networks $$\beta$$ $$\alpha$$ a b c d e f Karate 0.33 0.90 0.7618 0.6762 0.5839 0.7944 0.7455 0.7961 Lesmis 0.13 0.01 0.7954 0.5479 0.7220 0.6311 0.7923 0.7977 Polbooks 0.07 0.01 0.7428 0.4677 0.4107 0.5408 0.7343 0.7468 Email 0.01 0.01 0.8509 0.6763 0.6347 0.7724 0.7891 0.8430 Netscience 0.02 0.01 0.8645 0.3597 0.7644 0.6602 0.8639 0.8665 4.2.3 Efficiency analysis of DI algorithm DI algorithm was designed to find nodes in the network that have both centrality and bridging ability in communities. Algorithms that reflect centrality are DC, CI and KS. Algorithms that reflect inter-community bridging are BC and ES. Analysis from the previous section reveals that the result of DC algorithm is the best among DC, CI and KS algorithms and the result of ES algorithm is better between ES and BC algorithms. Therefore, through correlation analysis among DI, DC and ES, we can see that DI algorithm have a good reflection of centrality and bridging ability. And the key of the DI algorithm is formula 3.1, while the experiments used $$\alpha = 0.1$$, $$\beta = 0.9$$, we found that DI altorithm and ES algorithm must be highly correlated. So we only do extra correlation analysis of DI algorithm and DC algorithm. Table 4 shows correlation coefficient of DI algorithm and DC algorithm in each actual network. Figure 4 shows the relationship between DI and DC in each experimental data set, in which horizontal axis represents D-importance of each node, vertical axis represents degree of each node and the colorful dots represent S(t). Fig. 4. View largeDownload slide Correlation analyses between DI and DC in different networks. Fig. 4. View largeDownload slide Correlation analyses between DI and DC in different networks. Table 4 Correlation coefficient $$\tau$$ between DI and DC in different networks Networks $$\tau(DI,DC)$$ Karate 0.9289 Lesmis 0.9786 Polbooks 0.9531 Email 0.9808 Netscience 0.9576 Networks $$\tau(DI,DC)$$ Karate 0.9289 Lesmis 0.9786 Polbooks 0.9531 Email 0.9808 Netscience 0.9576 Table 4 Correlation coefficient $$\tau$$ between DI and DC in different networks Networks $$\tau(DI,DC)$$ Karate 0.9289 Lesmis 0.9786 Polbooks 0.9531 Email 0.9808 Netscience 0.9576 Networks $$\tau(DI,DC)$$ Karate 0.9289 Lesmis 0.9786 Polbooks 0.9531 Email 0.9808 Netscience 0.9576 According to correlation coefficient calculated in Table 4 and Fig. 4, it can be shown that DI algorithm and DC algorithm have a strong correlation. It shows that the important nodes got by DI algorithm can reflect high level of centrality. Through the above experiments, it can be seen that DI algorithm can accurately identify important nodes that have both centrality and bridging ability in network communities. 5. Conclusions This article studied the problem of important nodes identification in social networks. Inspired by the notion that location of nodes in a network can influence the nodes transmission ability, important nodes in a network are divided into two kinds, opinion leaders and structural holes. So an idea that important nodes should have high centrality and bridging ability at the same time is proposed. Then, DI algorithm is proposed to identify important nodes by combining ES algorithm with KS algorithm. Five actual networks are used to do experiments with SIR model, and the correlation coefficient is used to compare DI algorithm with DC, CI, BC, KS and ES algorithms. It can be seen that: (1) Compared to other algorithms, DI can accurately identify nodes with high propagation ability; (2) DI and DC, DI and ES have good correlation, indicating that identified nodes have high centrality and bridging ability. The computational complexity of DI algorithm is still high; it will be optimized in the next step. Funding Shannxi Provincial Education Commission (Program No. 15JK1468) and Shaanxi Provincial Natural Science Foundation Project (Program No. 2017JQ6053). References 1. Kai Y. , Ning Z. H. & Shuqing S. ( 2015 ) Node centrality on individual microblog user network. J. Univ. Shanghai Sci. Tech. , 37 , 43 – 48 . 2. Yingxia S. H. , Bin C. , Lin M. & Hongzhi Y. ( 2016 ) A fast sketch-based approach of Top-k closeness centrality search on large networks. Chinese J. Comput. , 39 , 1965 – 1978 . 3. Zhuoming R. , Jianguo L. , Feng S. H. , Zhaolong H. & Qiang G. ( 2013 ) Analysis of the spreading influence of the nodes with minimum K-shell value in complex networks. Acta Phys. Sin. , 62 , 466 – 471 . 4. Burt R. S. ( 1993 ) The social structure of competition. Explor. Econ. Sociol. , 65 , 103 . 5. Xing L. , Zhinong Z. H. & Yang L. ( 2013 ) Fast algorithm for random walk centerlity. Appl. Res. Comput. , 30 , 2337 – 2340 . 6. Xingdong W. , Yi L. & Lei L. ( 2014 ) Influence analysis of online social networks. Chinese J. Comput. , 37 , 735 – 752 . 7. Jian F. & Yuanyuan D. ( 2016 ) A structural hole identification algorithm in social networks based on overlapping communities and structural hole degree. Comput. Eng. Sci. , 38 , 897 – 904 . 8. Freeman L. C. ( 1977 ) A set of measures of centrality based on betweenness. Sociometry , 40 , 35 – 41 . Google Scholar CrossRef Search ADS 9. Xiaofan W. , Xiang L. & Guanrong C. H. ( 2012 ) Network Science: An Introduction . Beijing : Higher Education Press . 10. Zhongming H. , Yan C. H. , Wen L. , Bihong Y. , Mengqi L. & Dagao D. ( 2017 ) Research on node influence analysis in social networks. J. Software , 28 , 84 – 104 . 11. Jingqiao L. , Xiufen F. & Zaiqiao M. ( 2015 ) Identification of influential spreading nodes in microblog network. Appl. Res. Comput. , 32 , 2305 – 2308 . 12. Cha M. , Haddadi H. , Benevenuto F. & Gummadi P. K. ( 2010 ) Measuring user influence in twitter: The million follower fallacy. Int. Conf. Web Soc. Media , 10 , 30 . 13. Xiaolong R. & Linyuan L. V. ( 2014 ) Review of ranking nodes in complex networks. Chinese Sci. Bull. , 59 , 1175 – 1197 . Google Scholar CrossRef Search ADS 14. Zeqin D. , Fengzhen H. , Jiafei D. , Xinfeng L. , Jin L. & Jun W. ( 2014 ) An improved synchronous algorithm based Kendall for analyzing epileptic brain network. Acta Phys. Sin. , 63 , 208701 – 208705 . © The authors 2017. Published by Oxford University Press. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

### Journal

Journal of Complex NetworksOxford University Press

Published: Sep 14, 2017

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations