Representation of texts as complex networks: a mesoscopic approach

Representation of texts as complex networks: a mesoscopic approach Abstract Statistical techniques that analyse texts, referred to as text analytics, have departed from the use of simple word count statistics towards a new paradigm. Text mining now hinges on a more sophisticated set of methods, including the representations in terms of complex networks. While well-established word-adjacency (co-occurrence) methods successfully grasp syntactical features of written texts, they are unable to represent important aspects of textual data, such as its topical structure, that is the sequence of subjects developing at a mesoscopic level along the text. Such aspects are often overlooked by current methodologies. In order to grasp the mesoscopic characteristics of semantical content in written texts, we devised a network model which is able to analyse documents in a multi-scale fashion. In the proposed model, a limited amount of adjacent paragraphs are represented as nodes, which are connected whenever they share a minimum semantical content. To illustrate the capabilities of our model, we present, as a case example, a qualitative analysis of ‘Alice’s Adventures in Wonderland’. We show that the mesoscopic structure of a document, modelled as a network, reveals many semantic traits of texts. Such an approach paves the way to a myriad of semantic-based applications. In addition, our approach is illustrated in a machine learning context, in which texts are classified among real texts and randomized instances. 1. Introduction The availability of an ever growing amount of data brought up by the age of information has strongly impacted science, giving rise to a novel perspective on data analysis. The use and development of systematic approaches to analyse data has already become mandatory in a wide range of knowledge areas, such as physics [1], biology [2, 3], medicine [4] and even humanities [5, 6]. This also includes techniques devoted to the systematic analysis of texts, known as text mining [7]. Traditionally, approaches involving text analytics were solely based on simple statistics considering mostly the frequency of words [8, 9], which are, in general, suitable for the task of text classification [10]. However, more sophisticated methods have been devised for complex tasks, such as to quantify the words relevance [11, 12] in a document. These techniques can be employed to detect, for instance, important topics in a given text [13, 14]. Even more challenging are the methods used to study the relationships among words or topics in a document or a set of documents. This kind of analysis can be undertaken by considering semantic similarities [15] or linguistic characteristics [16]. By using these new techniques, many other applications could be achieved, for example, automatic summarization [17], event summary from many documents [18], sentiment analysis [19] or authorship detection [20]. Applications that illustrate the temporal dynamics [21–23] are also important. In these works, texts or movies are analysed according to the way entities (mainly characters) interact through time. Recently, [24] investigated how the emotional content evolves in a story. Moreover, text datasets can also be analysed in terms of the relationships among their elements, such as words and paragraphs. So, texts can be regarded as a complex structure and, therefore, be suitably represented in terms of complex networks. A well-known approach to construct complex networks from texts is the word-adjacency (or co-occurrence) technique [25, 26], which is based on connecting pairs of words that are immediately adjacent. The strategy of mapping texts according to co-occurrence relationships is a simplification of networks formed by syntactical links [27]. Despite this seeming limitation, word adjacency networks have been employed successfully to address a great variety of natural language processing problems. This includes sentiment analysis [28], authorship detection [29–31], stylometry [32], text classification [33], word sense disambiguation [34–36], text summarization [37, 38], machine translation [37, 39] and others. Perhaps some critical disadvantages associated with the word adjacency approach are its inability to properly characterize the semantic similarity in written texts [40] and portray the topical structure presented in many texts. The topical structure of a text is expected to naturally emerge from its network representation through a pronounced heterogeneous macro-structure. However, this hardly happens on typical co-occurrence networks, which present no community structure [41]. This suggests that the co-occurrence representation does not completely capture the information at the mesoscopic structure of the text, such as topics and subtopics. In addition, the information regarding the temporal evolution along a text is also overlooked in co-occurrence networks. In order to address the above limitations, we propose a mesoscopic representation of texts, where a node represents a large context, for example, a set of adjacent sentences or paragraphs. More specifically, in our approach each node corresponds to $$\Delta$$ subsequent paragraphs. The relationship between these nodes is then established by a similarity criteria. As such, edges are created whenever a large number of words is shared between two nodes. Note that, by doing so, the network structure becomes more dependent on how the author approaches the topics along the text. As we shall show, the main goal of the proposed representation is to reflect the semantic complexity of texts, a feature that cannot be straightforwardly obtained in traditional word adjacency networks. This manuscript is organized as follows: Section 2 describes our approach to create the mesoscopic network from a given document. Section 3 describes a case study of our approach. Section 4 illustrates the mesoscopic approach in a machine learning context. Finally, Section 5 concludes our article and suggests perspectives for further studies. 2. Methods This section describes the procedure to obtain mesoscopic complex networks from texts, which include books and other documents with paragraph structure. Here, we also briefly present the technique employed to visualize these networks. 2.1 From texts to networks In recent years, a new set of techniques has been introduced to create networks from documents, which takes into account their mesoscopic structure [41]. In that work, the networks are generated by connecting words existing in the same context, which is defined in terms of a fixed window length. This approach was able to produce modular networks, with each community related to contextual topics or subtopics of the text [41]. Even though the semantical organization of texts is captured by this representation, it is not straightforward to obtain the temporal evolution of the story being told. Here, we extend the concepts introduced by [41] to derive a new technique to construct networks from texts. Our methodology addresses two important aspects typically overlooked by more traditional approaches: (a) the mesoscopic structure of a text and (b) its unfolding along time. To consider (a), instead of linking adjacent words, we use larger pieces of text as the basic representational unit. These pieces are connected according to the similarity among themselves. The temporal evolution of ideas and concepts is incorporated into our model because, by construction, successive nodes always result connected as a consequence of their shared content. Henceforth, we consider an organized text as a sequence of words delimitated by paragraphs. In our analysis, the paragraphs can be retained from the text, or can be inferred from the text own structure, for instance, by considering sequences with a fixed number of words. Our approach starts with a pre-processing step typically employed for semantical-based text analysis. First, punctuation marks and numbers are removed. We also discard words conveying little contextual meaning, that is, the stopwords. Examples of stopwords are articles and prepositions. If a lemmatization technique [7] is available for the language being considered, it is used to normalize concepts. In this step, words are reduced to their canonical forms, so that inflections in verbal tense, number, case or gender are disregarded. For example, the sentence “‘Oh, I’ve had such a curious dream!” said Alice’ becomes ‘curious dream say alice’, after being pre-processed. Next, we employ the tf-idf (term frequency-inverse document frequency) technique [7], which defines a map $$\text{tf-idf}(w,d,D)$$ quantifying the importance of each word $$w$$ in a given document $$d$$ from a set of documents $$D$$. The $$\text{tf-idf}(w,d,D)$$ map is computed as   $$\text{tf-idf}(w,d,D) = \text{tf}(w,d) \times \text{idf}(w,D),$$ (2.1) where $$\text{tf}(w,d)$$, the term-frequency component, accounts for the relevance of $$w \in d$$ and $$\text{idf}(w,D)$$, the inverse document frequency, quantifies the frequency of $$w$$ in all $$d \in D$$. Many variations of both tf and idf terms have been proposed [7]. In this article, we consider $$\text{tf}(w,d)$$ as the raw frequency of a given word $$w$$ in a document $$d$$ divided by the total number of terms in the document, which is a normalization for each document. Consequently, documents with different number of words can be compared. The $$\text{idf}(w,D)$$ is calculated as   $$\textrm{idf}(w,D) = \log\Bigg{(}\frac{|D|}{f_w}\Bigg{)},$$ (2.2) where $$|D|$$ is the total number of documents in $$D$$ and $$f_w$$ is the number of documents in which $$w$$ occurs at least once. Such a term is employed in order to increase the weight of the key-words. The mesoscopic network is generated from the preprocessed text, hereafter referred to as organized text $$O$$. The organized text $$O$$ consists of a sequence of paragraphs $$O = (p_0,p_1,p_2\dots)$$ with each paragraph $$p_i$$ comprising a sequence of words $$p_i = (w_{i0},w_{i1},w_{i2}\dots)$$. Differently from the co-occurrence model where nodes represent words, here, we map entire paragraphs or sequences of consecutive paragraphs as nodes. In particular, for a choice of window size $$\Delta$$, each possible subsequence comprising $$\Delta$$ paragraphs in $$O$$, $$P_{k}^{\Delta}=(p_{k},p_{k+1},\dots p_{k+\Delta-1})$$, is represented by a node in the devised mesoscopic network. \. 1(a) illustrates the process of obtaining the nodes of the mesoscopic network. The edges of the mesoscopic network are identified by calculating a contextual similarity measurement considering all pairs of sequences of paragraphs $$P_{k}^\Delta$$ in the investigated document. Here, we employed the traditional bag of words combined with the cosine similarity measurement [7]. Bearing in mind that the number of words in each paragraph can vary significantly, the cosine similarity was used because it does not depend on the length of the text chunks being compared [42]. First, for each considered sequence of paragraphs $$P$$, a vector $$W_P$$, spanning the same number of words present in $$O$$, is obtained from the $$\text{tf-idf}(w, P, O)$$ map applied to each word $$w$$ in $$O$$. Note that, when a certain word $$w$$ is not present in $$P$$, $$\text{tf-idf}(w, P, O)=0$$. The content similarity measurement $$S(P_A,P_B)$$ between two paragraph windows $$P_A$$ and $$P_B$$ is obtained using   $$S(P_A,P_B) = \frac{\sum\limits_{w\,\in\,O} {\text{tf-idf}(w, P_A, O) \times \text{tf-idf}(w, P_B, O)}}{\sqrt{\sum\limits_{w\,\in\,O} {\text{tf-idf}(w, P_A, O)^2}} \sqrt{\sum\limits_{w\,\in\,O} {\text{tf-idf}(w, P_B, O)^2}}}.$$ (2.3) As a result, a fully connected network is created (see Fig. 1(b)), in which the edge weights correspond to the similarity $$S(P_A,P_B)$$ among each pair of nodes. The final mesoscopic network is obtained by pruning the weakest connections, that is, the links whose weight takes a value below a given threshold $$T$$. After this procedure, edge weights are ignored, resulting in an unweighted network (see Fig. 1(c)). Fig. 1. View largeDownload slide Illustration of the presented methodology. Initially, the text is organized in sets of subsequent and overlapping windows $$P_{k}^{3}$$, each containing three structural paragraphs, as shown in (a). Next, the cosine similarity is calculated among all pairs of text windows (illustrated by the width of the lines in b). The mesoscopic network is obtained by maintaining only connections among pairs with similarity higher than a threshold value $$T$$. This is illustrated by the network visualization in (c). Color is available online. Fig. 1. View largeDownload slide Illustration of the presented methodology. Initially, the text is organized in sets of subsequent and overlapping windows $$P_{k}^{3}$$, each containing three structural paragraphs, as shown in (a). Next, the cosine similarity is calculated among all pairs of text windows (illustrated by the width of the lines in b). The mesoscopic network is obtained by maintaining only connections among pairs with similarity higher than a threshold value $$T$$. This is illustrated by the network visualization in (c). Color is available online. To better understand the overall structure of mesoscopic networks, we visualized the network structure using a technique based on force-directed nodes placement. In particular, we used a technique inspired on the Fruchterman–Reingold (FR) [43] algorithm, in which the network is regarded as a system of nodes behaving like particles that interact by the action of two types of forces: attractive forces, existing only between connected nodes, and repulsive forces, that exist between all pairs of nodes. By minimizing the energy of that system, the network organizes itself in a graphically appealing layout. This visualization technique naturally highlights many aspects of the topological structure of networks [43]. 2.2 Results evaluation In order to show the potential of our networks to reflect the document story, we compared networks created from Real Texts (RT) with networks created from Shuffled Texts (ST), where clearly no story exists. The ST were created in a two-fold manner: obtained by shuffling words (SW) or paragraphs (SP) from RT. To generate the SW version, all words from a given text were shuffled and the paragraphs were created with the same number of words as those in the original document. It is important to highlight that the number of paragraphs, their respective order and the number of words in each paragraph were preserved. In the second version of ST, SP, we shuffled all paragraphs from a given RT. Thus, the structure of each single paragraph is kept, but the new sequence of paragraphs may not generate a consistent, coherent story. For each document, a single weighted mesoscopic network was created for each class (RT, SW and SP). Consequently, the classes have the same number of networks. Considering the classes of text (RT, SW and SP), for each weighted network, we generated unweighted networks from a set of thresholds. Because the similarity measurement depends exclusively on the content of each text, the obtained edge weights are not comparable across texts. Moreover, a fixed similarity threshold $$T$$ is impractical because it could lead to the removal of all edges in a network, if it is too high; or no removals, if it is small. Because of that, a set of thresholds for each network was defined as the values that would keep a given percentage of edges in a network, so that strongest connections were maintained. The choice of an optimal percentage is not trivial, and it might change for different datasets. Therefore, the set of thresholds of a given text is defined as $$\{T_{5\%}, T_{10\%}, T_{15\%},..,T_{95\%}\}$$ where $$T_{5\%}$$ is the threshold $$T$$ that keeps only 5% of the edges with the highest weights, $$T_{10\%}$$ is the threshold $$T$$ for 10%, and so on. Each text was then characterized by the network measurements extracted from all networks created by applying the different thresholds. In particular, other approaches could be employed in order to remove some edges. One option would be to modify all networks in order to emerge a specific property, such as the same average degree. We used two measurements to compare the mesoscopic networks: Clustering coefficient: this measurement is well known in complex networks analysis [44] and it was used in many text classification applications [45–47]. The clustering coefficient quantifies the fraction of loops of order three (i.e. triangles), for each network node and it is computed as   $$C_i = \frac{N_\Delta(i)}{N_3(i)},$$ (2.4) where $$N_\Delta(i)$$ is the number of connected triangles in which node $$i$$ takes part and $$N_3(i)$$ is the number of connected triples, where $$i$$ is the central node; Matching index: for each edge, this measure computes the similarity between the two nodes connected to the edge according to the number of common neighbours [48–50]. In other words, this measurement quantifies the similarity between two network regions connected by an edge. This measurement is computed as   $$\mu_{i,j} = \frac{\sum_{k \neq i,j} a_{ik} a_{jk}}{\sum_{k \neq j} a_{ik} + \sum_{k \neq i} a_{jk}},$$ (2.5) where $$a_{ij}$$ is an element of the adjacency matrix, and $$a_{ij} = 1$$ if nodes $$i$$ and $$j$$ are connected. This measurement is used to identify long-range links since, by construction, nodes representing overlapping sets of paragraphs share several neighbours. Conversely, nodes representing distant regions in the network usually do not share many neighbours. The books were considered in their entirety. As a consequence, the number of network nodes varies, which can influence many complex network measurements. As a solution for this problem, we analysed the network in terms of local measurements of clustering and matching index. In order to provide additional information about the text, the two measurements were calculated for all nodes/edges and sorted according to the text sequence, giving rise to a time series. For the matching index, we created the time series by establishing the following order of edges:   $$\nonumber \{\mu_{0,0},\mu_{0,1},\ldots,\mu_{0,n-1},\mu_{1,0},\mu_{1,1}\ldots\mu_{1,n-1},\ldots,\mu_{n-1,n-1}\}.$$ If there is no edge linking two nodes, the corresponding value in the time series is not taken into account. 3. Case study: mesoscopic analysis of ‘Alice’s adventures in wonderland’ In order to illustrate the potential of modelling RT as mesoscopic networks, we applied our methodology to the well-known book ‘Alice’s Adventures in Wonderland’. This story revolves around the adventures of a little girl, called Alice, after she falls in a hole and arrives in an unknown fantasy world. The book was written in 1865 by Charles Lutwidge Dodgson under the pseudonym Lewis Carroll. It is divided into the following 12 chapters: Down the Rabbit-Hole The Pool of Tears A Caucus-Race and a Long Tale The Rabbit Sends in a Little Bill Advice from a Caterpillar Pig and Pepper A Mad Tea-Party The Queen’s Croquet-Ground The Mock Turtle’s Story The Lobster Quadrille Who Stole the Tarts? Alice’s Evidence After the pre-processing steps had been undertaken, we chose a fixed window size $$\Delta = 20$$ paragraphs. Two mesoscopic networks, $$\mathcal{G}_1$$ and $$\mathcal{G}_2$$, were constructed from the book with distinct thresholds to prune connections, $$T_1=0.31$$ and $$T_2=0.18$$, respectively. With these thresholds, only 5 and 10% of the edges, respectively, remained in the network. It is important to highlight that the parameter $$\Delta$$ was set in order to maintain almost all network nodes in a single connected component, for all considered $$T$$ values. Additionally, a higher $$\Delta$$ value implies in a higher number of paragraphs representing each node, which can hide information from the smaller ones. We start the analysis of the mesoscopic structures by investigating the properties of the $$\mathcal{G}_1$$ network, which is simpler than $$\mathcal{G}_2$$. For this analysis, we consider a 2D visualization of $$\mathcal{G}_1$$, which is shown in Fig. 2. This visualization was obtained by employing the FR algorithm mentioned in Section 2. Because nodes sharing the same paragraphs become strongly connected among themselves, a pronounced chain-like structure naturally emerges on the mesoscopic network. In addition, this structure is related to the order of the nodes along the book. This property is better observed in Fig. 2(a), where the colour of each node indicates its position along the text. In mesoscopic networks, connections among distant nodes indicate regions of high contextual similarity that are not a result of overlapping sequences of paragraphs. In these networks, the structure connects contextually similar regions of nodes which, by its turn, brings them closer along the chain-like structure of the network. Fig. 2. View largeDownload slide Visualization of the network $$\mathcal{G}_1$$, representing the book Alice’s Adventures in Wonderland with a threshold $$T_1=0.31$$. Each node indicates a sequence of paragraphs. The order of the nodes according to the story is shown in (a). The first nodes of the story appear in blue, while the last nodes are represented in an orange colour. In (b), the chapters of the nodes are represented with distinct colours. Color is available online. Fig. 2. View largeDownload slide Visualization of the network $$\mathcal{G}_1$$, representing the book Alice’s Adventures in Wonderland with a threshold $$T_1=0.31$$. Each node indicates a sequence of paragraphs. The order of the nodes according to the story is shown in (a). The first nodes of the story appear in blue, while the last nodes are represented in an orange colour. In (b), the chapters of the nodes are represented with distinct colours. Color is available online. In order to better understand the relationship between the mesoscopic structure and the contextual information of the book, we segmented the obtained network according to the chapter organization of the book. This is visualized in Fig. 2(b), in which the chapter of each node is indicated by a colour, according to the legend. Considering the connectivity among the chapters of the book, we derived the following observations: In Chapter 1, we note that there is no strong connection among its paragraphs and those from other chapters, except for chapters 2 and 3, which is explained by the aforementioned overlap between subsequent paragraphs. The lack of long-range connections among some nodes of the first chapter may happen because the beginning of the book is substantially different from almost every other part. In this chapter, the story starts in a more realistic scenario and it presents fewer descriptions of the fantasy locations and creatures than in the rest of the book; Chapters 2, 3 and the beginning of Chapter 4 are connected among themselves. The vocabulary used in Chapter 2 includes many negative words, such as poor, hopeless, cry, tears, tired, in order to express the situation and how Alice was feeling. Chapters 3 and 4 are not so emotional, but their texts still have a few negative words and some recalls about past events. In particular, all these chapters describe the period of the story when Alice was very frightened of the world she had just jumped in, and she constantly remembered of her cat, Dinah, and how good her pet was at catching other animals. In addition, all these chapters mention when she cried and it formed a pool of tears; In Chapter 5, there are strong connections between regions from the same chapter. This probably happens because the long conversation between Alice and the Caterpillar revolved around a few topics. They mainly discussed the many sizes she had during that day and how confused she was about herself; Even though Chapters 7 and 11 describe different events—the first is about the tea party and the other has a trial—there are some edges connecting nodes from these chapters. One possible explanation is that both chapters include the characters Alice, March Hare, Dormouse and the Hatter, which is having tea in both situations. Furthermore, these chapters present specific kinds of food and beverage related to the tea party, for example, tea, bread and butter; There is a group of highly connected nodes in the end of Chapter 9 and in the beginning of Chapter 10. This probably happens because Alice is introduced to The Mock Turtle in the last paragraphs of Chapter 9 and their conversation ended only in Chapter 10. Moreover, one topic shared between these chapters was related to teaching. In the end of Chapter 9, the The Mock Turtle explains to Alice his days at school and the lessons he used to have. In the next Chapter, some animals asked Alice to repeat some lessons, and she felt she was at school. Figure 3 displays a visualization of the $$\mathcal{G}_2$$ network, which was constructed using a lower threshold value, $$T_2=0.18$$. By using two threshold choices, it has been possible to illustrate the potential of our method in describing the characteristics of the network in a multi-scale fashion. From Fig. 3(a), we can observe the network still has a chain-like structure similar to that found in $$\mathcal{G}_1$$. However, this network presents more connections among nodes from different parts of the book. This is because the $$\mathcal{G}_2$$ network captures more fine-grained information about the relationships among the paragraphs. Comparing $$\mathcal{G}_1$$ and $$\mathcal{G}_2$$, we note that while Chapter 1 is connected only with Chapters 2 and 3 in $$\mathcal{G}_1$$, in $$\mathcal{G}_2$$ it also connects with other parts of the book, in particular, with Chapters 4 and 7. However, the analysis of fine-grained networks may present some disadvantages because these networks tend to incorporate more local characteristics. Moreover, they may include noise and relationships not driven by a strong contextual content. Fig. 3. View largeDownload slide Visualization of the network $$\mathcal{G}_2$$, representing the book Alice’s Adventures in Wonderland with a threshold $$T_1=0.31$$. The nodes indicate sets of 20 adjacent paragraphs. Item (a) shows the order of the nodes according to the story, where the first nodes of the story appear in blue and the last nodes in an orange. Item (b) represents the chapters, in which the nodes are represented with distinct colours. Color is available online. Fig. 3. View largeDownload slide Visualization of the network $$\mathcal{G}_2$$, representing the book Alice’s Adventures in Wonderland with a threshold $$T_1=0.31$$. The nodes indicate sets of 20 adjacent paragraphs. Item (a) shows the order of the nodes according to the story, where the first nodes of the story appear in blue and the last nodes in an orange. Item (b) represents the chapters, in which the nodes are represented with distinct colours. Color is available online. 4. Discriminating real from ST To illustrate the ability of the proposed representation to grasp semantical information of texts by considering topological features, we evaluated the efficiency of the method in discriminating RT from texts conveying no meaning, which are here represented by ST. This is an important potential application of the proposed approach as a subsidy to fraud identification, such as inferring if texts in unknown languages are meaningful or not. In Fig. 4, we show the two networks obtained from the book ‘Alice’s Adventures in Wonderland’ and the respective values of clustering coefficient (for two thresholds) along the document. Note that an interesting pattern emerges in both cases. Regions encompassing many long-range connections are characterized by low values of clustering. In addition, there is a complex pattern of intermittent appearances of low values of clustering coefficient in regions devoid of long-range connections. A similar behaviour occurred with the matching index (result not shown), which also captures the presence of long-range links. As explained in Section 2.2, the matching index quantifies the number of shared neighbours between two interconnected nodes. This measurement is used to identify long-range links since, by construction, nodes representing overlapping sets of paragraphs share several neighbours. Conversely, nodes representing distant regions in the network usually do not share many neighbours. Fig. 4. View largeDownload slide Visualization of the networks representing ‘Alice’s Adventures in Wonderland’. Item (a) represents the network $$\mathcal{G}_1$$ with a threshold $$T_1=0.31$$ and item (b) represents the network $$\mathcal{G}_2$$ with a threshold $$T_2=0.18$$. The node colours indicate the value of the clustering coefficient, in which nodes with the highest values are represented in orange. Note that there is a non-trivial pattern of clustering coefficient along the network nodes. Color is available online. Fig. 4. View largeDownload slide Visualization of the networks representing ‘Alice’s Adventures in Wonderland’. Item (a) represents the network $$\mathcal{G}_1$$ with a threshold $$T_1=0.31$$ and item (b) represents the network $$\mathcal{G}_2$$ with a threshold $$T_2=0.18$$. The node colours indicate the value of the clustering coefficient, in which nodes with the highest values are represented in orange. Note that there is a non-trivial pattern of clustering coefficient along the network nodes. Color is available online. In Fig. 5, we show the behaviour of the clustering coefficient along time for the real book and its two respective meaningless versions formed by shuffled paragraphs and words. It is clear from the figure that, in average, the clustering coefficient of all three versions fluctuates around $$C\simeq0.78$$. However, the patterns of fluctuations are markedly dissimilar. The largest variations arise for the real book, while both shuffled versions seem to display larger regions of weak fluctuations (see e.g. nodes from 180 to 280 in Fig. 5(b)). A similar pattern was obtained for the matching index measurement. Owing to the clear patterns in the fluctuations of local density discriminating real and meaningless texts, we applied measurements to quantify the mentioned fluctuations in order to check how much the proposed model depends on the text unfolding. Fig. 5. View largeDownload slide Clustering coefficient for all network nodes of real and shuffled versions (RT, SW and SP) created from the book ‘Alice’s Adventures in Wonderland’. The threshold $$T_1=0.31$$ was chosen to select the strongest semantical links. Fig. 5. View largeDownload slide Clustering coefficient for all network nodes of real and shuffled versions (RT, SW and SP) created from the book ‘Alice’s Adventures in Wonderland’. The threshold $$T_1=0.31$$ was chosen to select the strongest semantical links. The fluctuations observed in Fig. 5 were characterized with the coefficient of variation in a set of observations $$X$$, where $$X$$ here represents the ordered set of values of $$C$$ or $$\mu$$. The coefficient of variation ($$c_v(X)$$) is defined [51] as:   $$c_v(X) = {\sigma(X)}/{\langle X \rangle},$$ (4.1) where $$\sigma(X)$$ and $$\langle X \rangle$$ are the standard deviation and the average of $$X$$, respectively. For a choice of a window size, $$\delta$$, and for each possible subsequence of $$X$$, $$\mathcal{X}^\delta_k = \{x_k, x_{k+1},\dots,x_{k+\delta-1}\}$$, the coefficient of variation, $$c_v(\mathcal{X}^\delta_k)$$, is calculated. In order to consider the coefficient of variation in a mesoscale, we analyse the series in a multi-scale fashion, by considering $$\delta = \{3,5,7,10,15,20,25,30,35,40,50\}$$. For each value of window size $$\delta$$, we summarize the values of fluctuations by averaging over all $$c_v(\mathcal{X}^\delta_k$$), that is:   $$\mathcal{C}_v^\delta(X) = \frac{1}{N} \sum_{k=1}^{n-\delta+1} c_v(\mathcal{X}^\delta_k).$$ (4.2) Finally, each network was characterized by the set of features $$\mathcal{F} = \{\mathcal{C}_v^{\delta=3},\mathcal{C}_v^{\delta=5},\mathcal{C}_v^{\delta=7}\ldots\}$$, with $$X$$ being the values of clustering coefficient and matching index. To validate the potential of our mesoscopic model to extract the information from the document story, we considered the problem of discriminating real from meaningless (shuffled) texts using a dataset comprising several books (see details in Appendix A). We first visualized all three classes of texts in a bidimensional principal component analysis projection [52] (PCA). The results are shown in Fig. 6(a), in which the two first components account for approximately 76% of the projection. Remarkably, the networks are usually placed close to others from the same class, while being well-separated from other classes. This latter effect is confirmed in terms of the average distance between classes shown in Table 1. Our results are compared with those obtained with the traditional approach based on co-occurrence networks (see details in Appendix B). The PCA projection of these networks is shown in Fig. 6(b). Although the sum of the two main PCA components accounts for 70% of the projection, the group of networks from RT and SP are not distinguishable. This behaviour was expected because co-occurrence networks were first devised to grasp linguistic/syntactical features. When language structure is kept and only the mesoscopic structure is changed (in SP texts), the co-occurrence approach is unable to discriminate real from meaningless texts. The poor discriminability observed is confirmed by the distances shown in Table 2. Table 1 Mesoscopic network: average distance among networks from the same class. Note that, when using mesoscopic networks, it is possible to discriminate real texts from those generated by both shuffled words and paragraph    RT  SW  SP  RT  0.00  13.67  10.98  SW  13.67  0.00  13.21  SP  10.98  13.21  0.00     RT  SW  SP  RT  0.00  13.67  10.98  SW  13.67  0.00  13.21  SP  10.98  13.21  0.00  Table 2 Co-occurrence network: average distance among networks from the same class. If co-occurrence networks are used, real texts and texts formed by shuffled paragraphs cannot be discriminated    RT  SW  SP  RT  0.00  7.18  0.12  SW  7.18  0.00  7.08  SP  0.12  7.08  0.00     RT  SW  SP  RT  0.00  7.18  0.12  SW  7.18  0.00  7.08  SP  0.12  7.08  0.00  Fig. 6. View largeDownload slide PCA projections of the networks generated from RT, SP texts and SW texts. The projections (a) and (b) represent the mesoscopic and the co-occurrence networks, respectively. (a) Mesoscopic networks and (b) co-occurrence networks. Color is available online. Fig. 6. View largeDownload slide PCA projections of the networks generated from RT, SP texts and SW texts. The projections (a) and (b) represent the mesoscopic and the co-occurrence networks, respectively. (a) Mesoscopic networks and (b) co-occurrence networks. Color is available online. The discriminability between RT and the two classes of ST was also evaluated using an unsupervised approach based on the K-means algorithm [53]. Here, we used the six principal components as features as such choice yielded optimized results. Considering all documents of the datasets, only 8.9% of instances were incorrectly clustered with the mesoscopic approach. Interestingly, the clustering generated by the algorithm yielded only 0.02% of false negatives for the SP class. A feature relevance analysis revealed that the clustering coefficient outperforms the matching index for the clustering task, when the algorithm is applied using the measurements separately. When only the clustering and matching index are used, the percentage of incorrectly assigned instances are 11.7 and 16.7%, respectively. The unsupervised approach was also used to compare the proposed methodology and traditional co-occurrence networks. In this analysis, we used 10 principal components, as this amount of features yielded optimized results. The quality of clusters was estimated in terms of the accuracy the adjusted rand index (ARI) [54]. The cluster quality indexes obtained in both types of networks are shown in Table 3. Co-occurrence networks could not properly distinguish RT from SP classes, as expected from the analysis of Fig. 6(b). In this scenario, 72.5% of SP texts were incorrectly classified as RT. This inability is also reflected in the ARI, which is much lower in co-occurrence networks. Such a result confirms that the co-occurrence networks are unable to distinguish between RT and ST. This is because in both cases, RT and SP, the syntax of the original texts is maintained and co-occurrence networks represent the syntactical relationship between words. Whereas, the mesoscopic network model, which takes into consideration the organization of the paragraphs along the text, was able to distinguish the texts among the tree classes. Table 3 Comparison of the K-means clustering performance among different network approaches. Two different measurements were applied: Adjusted Rand Index (ARI) and Accuracy. In both measurements, 1 indicates that all instances are correctly classified and 0 indicates the opposite    ARI  Accuracy  Mesoscopic (Clustering)  0.679  0.883  Mesoscopic (Matching Index)  0.576  0.833  Mesoscopic (all features)  0.749  0.911  Co-occurrence  0.268  0.575     ARI  Accuracy  Mesoscopic (Clustering)  0.679  0.883  Mesoscopic (Matching Index)  0.576  0.833  Mesoscopic (all features)  0.749  0.911  Co-occurrence  0.268  0.575  A particular feature of the mesoscopic model is the existence of long-range connections. More specifically, a long-range connection is a link that connects two nodes that are far apart in the document. This type of link usually appears when a subject/context previously mentioned in the book is revisited in the story. It has been conjectured that such links, a consequence of the long-range correlation effect [55], are essential for mapping a multidimensional conceptual space into a smaller dimensional space [56]. Such a characteristic can be related to the writing style or the nature of the book (e.g. tales, novels or scientific writing). To quantify the presence of long-range links, we show (Fig. 7) the scatterplots of all edges weights versus the time difference between linked nodes, where time corresponds to the natural reading order. The wide distribution of weights obtained for the smallest time differences is imposed by the construction rules of mesoscopic networks, in which many edges are established between successive nodes. Long-range connections were also observed in the three classes of texts (i.e. RT, SW and SP). However, most of such connections are very weak. As depicted in the inset of Fig. 7, RT tend to present stronger long-range connections than ST, especially in the time frame of 300 to 400 paragraphs. The presence of strong long range connections only in the RT case can be interpreted as a consequence of contextual relationships along the story. These characteristics were observed in Section 3, where we described the network connections according to the unfolding of the book. Fig. 7. View largeDownload slide Time difference between linked nodes vs. their respective edges weights. Note that it is hard to distinguish among the points in the scatterplots because they are very close. This measurement was computed for all network edges and the inset represents a region of long-range links, that is, links with time difference larger than 100. Comparing the different classes of texts, it is evident that strong long-range connections are more likely to appear in real networks. (a) RT, (b) SP and (c) SW. Fig. 7. View largeDownload slide Time difference between linked nodes vs. their respective edges weights. Note that it is hard to distinguish among the points in the scatterplots because they are very close. This measurement was computed for all network edges and the inset represents a region of long-range links, that is, links with time difference larger than 100. Comparing the different classes of texts, it is evident that strong long-range connections are more likely to appear in real networks. (a) RT, (b) SP and (c) SW. 5. Conclusion In order to grasp semantical, mesoscopic properties of texts modelled as networks, we proposed an approach that considers the semantical similarity between textual segments. Differently from previous representations, we modelled sequences of adjacent paragraphs as nodes, whose links are established by content similarity. By doing so, we could capture two important features present in written texts: long-range correlations and the temporal unfolding of documents. In addition, the proposed approach for text representation also allowed multi-scale representation of documents. Specifically, two parameters control the scale: (i) $$\Delta$$: the number of consecutive paragraphs in each window, and (ii) $$T$$: the threshold used to prune connections among nodes with low contextual similarity. As a case study, we tested our approach in ‘Alice’s Adventures in Wonderland’, by employing network visualization techniques on the generated mesoscopic network. Many insights could be drawn from the visualization by tracing a parallel between its underlying structure and the story. In particular, we investigated the correspondence between the content of each chapter and the underlying network structure arising from the proposed model. Our model uncovered many relationships among different contexts sharing the same topics, such as similar characters or places throughout the story. For example, the high contextual similarity found between chapters 7 and 11 can be explained by the fact that both chapters share a recurrent subject revolving around the character The Hatter and the tea party thematic. Note that similar textual inferences could not be drawn from models solely based on local features, as it is the case of traditional word-adjacency or syntactical networks, as they emphasize mostly stylistic textual subtleties. The effectiveness of our model was also evaluated with respect to the task of discriminating real from ST. The shuffled versions, particularly, were created by mixing either words or paragraphs of RT. We have found that, if we consider only two simple local density measurements, it is possible to separate all three classes of texts with high accuracy. The traditional co-occurrence turned out to grasp only local subtleties, as the model was not able to discriminate RT from those generated by shuffling paragraphs. This happens because, when paragraphs are shuffled, only a few edges—those at the paragraph boundaries—are modified. These results confirm the suitability of the proposed model in capturing larger contexts in a mesoscopic fashion. A further analysis of the model also revealed that RT are characterized by stronger long-range links, a feature that could be explored in tests of informativeness of written documents [40]. An important step in our methodology is the selection of parameters, in particular, the choice of the window size $$\Delta$$ and the threshold $$T$$, which defines the characteristic scale of analysis for the resulting network. Here, we provided means to configure the method so that the network forms a single connected component. However, more sophisticated approaches can be used to improve such a selection, for instance by using complexity measurements, such as the Estrada’s heterogeneity index [57], the Laplacian energy [58] and the von Neumann entropy [59]. The proposed methodology still presents some limitations regarding the selection of $$\Delta$$. For instance, for low values of $$\Delta$$, the resulting network would account for a smaller scale of analysis, thus allowing the investigation of the text at levels lower than chapters, down to paragraphs ($$\Delta = 1$$). However, this may impact in the quality of the statistics extracted from the text due to the reduction of vocabulary, which is used to obtain the similarity among nodes. Alternatively, other methods could be employed to account for the textual similarity. A well-known vectorial representation, that is, word embeddings [60, 61] could be used to incorporate not only first-order statistics but also the semantical relationships among words. The proposed network representation paves the way for developing new techniques that could be applied to automatically analyse the mesoscopic structure of documents. These techniques could improve traditional approaches used to tackle typical text mining problems under a new perspective. This capability should be further explored in future works, for instance, by measuring the efficiency of our model in text classification, summarization and similar applications in which an accurate semantic analysis plays a prominent role in the characterization of written texts. Funding The authors acknowledge financial support from Capes-Brazil, Sao Paulo Research Foundation (FAPESP) (grant no. 2016/19069-9, 2015/08003-4, 2015/05676-8, 2014/20830-0 and 2011/50761-2), CNPq-Brazil (grant no. 307333/2013-2) and NAP-PRP-USP. Appendix A. Dataset All the texts used in our dataset were extracted from the open access Project Gutemberg dataset.1 We divided the dataset into two major groups, according to the original language: (i) English and (ii) Other languages. The books, sorted by language and author, are listed below: English: Arthur Conan Doyle: The Adventures of Sherlock Holmes; The Tragedy of the Korosko; The Valley of Fear; Through the Magic Door and Uncle Bernac - A Memory of the Empire; Bram Stoker: Dracula’s Guest; The Lair of the White Worm; The Jewel Of Seven Stars; The Man and The Mystery of the sea; Charles Dickens: A Tale of Two Cities; American Notes; Barnaby Rudge: A Tale of the Riots of Eighty; Great Expectations and Hard Times; Edgar Allan Poe: The Works of Edgar Allan Poe (Volume 1 - 5); Hector H. Munro (Saki): Beasts and Super-Beasts; The Chronicles of Clovis; The Toys of Peace; When William Came and The Unbearable Bassington; P. G. Wodehouse: The Girl on the Boat; My Man Jeeves; Something New; The Adventures of Sally and The Clicking of Cuthbert Thomas Hardy: A Pair of Blue Eyes; Far from the Madding Crowd; Jude the Obscure; The Mayor of Casterbridge and The Hand of Ethelberta; William M. Thackeray: Barry Lyndon; The Book of Snobs; The History of Pendennis; The Virginians and Vanity Fair Other languages: French: Gustave Aimard: Le fils du Soleil; Jules Verne: Face au Drapeau; Louis Amédée Achard: Pierre de Villerglé; Louis Reybaud: Les Idoles d’argile; Victor Hugo: Han d’Islande. German: Goethe: Die Wahlverwandtschaften; Jakob Wassermann: Der Moloch; Robert Walser: Geschwister Tanner; Thomas Mann: Königliche Hoheit; Wilhelm Hauff: Lichtenstein. Italian Alberto Boccardi: Il Peccato di Loreta; Anton Giulio Barrili: La Montanara; Enrico Castelnuovo: Alla Finestra; Guido da Verona: Sciogli la treccia, Maria Maddalena; Virginia Mulazzi: La Pergamena Distrutta. Portuguese: Camilo Castelo Branco: Amor de Perdição; Eça de Queirós: A cidade e as Serras; Faustino da Fonseca: Os Bravos do Mindello; Jaime de Magalhães Lima: Transviado; Júlio Dinis: Uma Família Inglesa. Appendix B. Characterization of co-occurrence networks Typically, co-occurrence (or word adjacency) networks are formed by mapping each concept into a distinct node of the network. The edges are established by adjacency relationships, that is, if two words are adjacent in the text, they are connected in the network. Such networks have been extensively explored in the context of text analysis and pattern recognition [46]. In the present work, we compare the properties of the mesoscopic and co-occurrence models. We compare the mesoscopic results with a set of centrality measurements of co-occurrence networks used in the Ref. [33], which are: accessibility [62], betweenness centrality [63], closeness centrality, clustering coefficient, degree, eccentricity [64], eigenvector centrality [65], generalized accessibility [66], modularity [67] (computed from fast greedy algorithm [68]), neighbourhood connectivity, number of nodes, PageRank [69] and, symmetry [33]. Apart from modularity, we compute the following quantities for each measurement: maximum value ($$\max(X)$$), median ($$\tilde X$$), minimum value ($$\min(X)$$) and standard deviation $$\sigma(X)$$. To create the co-occurrence networks, we trimmed the texts to the same number of words because many of the above complex network measurements are influenced by the number of nodes. Because the number of network nodes varies in mesoscopic networks, we did not use the same set of measurements as for the co-occurrence networks. Furthermore, in the co-occurrence network analysis, we only used texts written in English because this kind of representation catches information regarding the syntax, which is different for each language. References 1. Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., & Hwang, D.-U. ( 2006) Complex networks: structure and dynamics. Phys. Rep. , 424, 175– 308. Google Scholar CrossRef Search ADS   2. Barabasi, A.-L. & Oltvai, Z. N. ( 2004) Network biology: understanding the cell’s functional organization. Nat. Rev. Genet. , 5, 101– 113. Google Scholar CrossRef Search ADS PubMed  3. de Arruda, H. F., Comin, C. H., Miazaki, M., Viana, M. P. & da Fontoura Costa, L. ( 2015) A framework for analyzing the relationship between gene expression and morphological, topological, and dynamical patterns in neuronal networks. J. Neurosci. Methods , 245, 1– 14. Google Scholar CrossRef Search ADS PubMed  4. Barabási, A.-L., Gulbahce, N. & Loscalzo, J. ( 2011) Network medicine: a network-based approach to human disease. Nat. Rev. Genet. , 12, 56– 68. Google Scholar CrossRef Search ADS PubMed  5. Kalimeri, M., Constantoudis, V., Papadimitriou, C., Karamanos, K., Diakonos, F. K. & Papageorgiou, H. ( 2015) Word-length entropies and correlations of natural language written texts. J. Quant. Linguist. , 22, 101– 118. Google Scholar CrossRef Search ADS   6. Moreno, Y., Nekovee, M. & Pacheco, A. F. ( 2004) Dynamics of rumor spreading in complex networks. Phys. Rev. E , 69, 066130. Google Scholar CrossRef Search ADS   7. Manning, C. D. & Schütze, H. ( 1999) Foundations of Statistical Natural Language Processing . Cambridge, MA: MIT Press. 8. Altmann, E. G., Pierrehumbert, J. B. & Motter, A. E. ( 2009) Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words. PLoS One , 4, e7678. Google Scholar CrossRef Search ADS PubMed  9. Nahm, U. Y. & Mooney, R. J. ( 2002) Text mining with information extraction. AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases , Vol. 1. 10. Joachims, T. ( 2001) A statistical learning learning model of text classification for support vector machines. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval . New Orleans, Louisiana, USA: ACM, pp. 128– 136. 11. Hotho, A., Nürnberger, A. & Paaß, G. ( 2005) A brief survey of text mining. Ldv Forum , Vol. 20. pp. 19– 62. 12. Ramos, J. ( 2003) Using tf-idf to determine word relevance in document queries. Proceedings of the First Instructional Conference on Machine Learning . 13. AlSumait, L., Barbará, D. & Domeniconi, C. ( 2008) On-line lda: adaptive topic models for mining text streams with applications to topic detection and tracking. ICDM’08 Eighth IEEE International Conference on Data Mining, 2008 . Pisa, Italy: IEEE, pp. 3– 12. 14. Blei, D. M., Ng, A. Y. & Jordan, M. I. ( 2003) Latent dirichlet allocation. J. Mach. Learn. Res. , 3, 993– 1022. 15. Landauer, T. K., Foltz, P. W. & Laham, D. ( 1998) An introduction to latent semantic analysis. Discourse Process. , 25, 259– 284. Google Scholar CrossRef Search ADS   16. Hatzivassiloglou, V., Gravano, L. & Maganti, A. ( 2000) An investigation of linguistic features and clustering algorithms for topical document clustering. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval . Athens, Greece: ACM, pp. 224– 231. 17. Chang, Y.-L. & Chien, J.-T. ( 2009) Latent dirichlet learning for document summarization. IEEE International Conference on Acoustics, Speech and Signal Processing, 2009. ICASSP 2009.  New York, NY, USA: IEEE, pp. 1689– 1692. 18. Wei, Y., Singh, L., Gallagher, B. & Buttler, D. ( 2016) Overlapping target event and story line detection of online newspaper articles. IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016 . Montreal, QC, Canada: IEEE, pp. 222– 232. 19. Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y. & Potts, C. ( 2011) Learning word vectors for sentiment analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 . Portland, Oregon, USA: Association for Computational Linguistics, pp. 142– 150. 20. Chen, X., Hao, P., Chandramouli, R. & Subbalakshmi, K. ( 2011) Authorship similarity detection from email messages. Machine Learning and Data Mining in Pattern Recognition . Berlin, Heidelberg: Springer, pp. 375– 386. Google Scholar CrossRef Search ADS   21. Liu, S., Wu, Y., Wei, E., Liu, M. & Liu, Y. ( 2013) Storyflow: tracking the evolution of stories. IEEE Trans. Vis. Comput. Graph. , 19, 2436– 2445. Google Scholar CrossRef Search ADS PubMed  22. Prado, S. D., Dahmen, S. R., Bazzan, A. L., Carron, P. M. & Kenna, R. ( 2016) Temporal network analysis of literary texts. Adv. Complex Syst. , 19( 3), 1650005 ( 19 pages). Google Scholar CrossRef Search ADS   23. Tanahashi, Y. & Ma, K.-L. ( 2012) Design considerations for optimizing storyline visualizations. IEEE Trans. Vis. Comput. Graph. , 18, 2679– 2688. Google Scholar CrossRef Search ADS PubMed  24. Reagan, A. J., Mitchell, L., Kiley, D., Danforth, C. M. & Dodds, P. S. ( 2016) The emotional arcs of stories are dominated by six basic shapes. EPJ Data Science , 5, 31. Google Scholar CrossRef Search ADS   25. Amancio, D. R., Oliveira Jr, O. N. & Costa, L. F. ( 2012) Structure–semantics interplay in complex networks and its effects on the predictability of similarity in texts. Phys. A , 391, 4406– 4419. Google Scholar CrossRef Search ADS   26. Kulig, A., Drożdż, S., Kwapień, J. & Oświecimka, P. ( 2015) Modeling the average shortest-path length in growth of word-adjacency networks. Phys. Rev. E , 91, 032810. Google Scholar CrossRef Search ADS   27. Ferrer i Cancho, R., Solé, R. V. & Köhler, R. ( 2004) Patterns in syntactic dependency networks. Phys. Rev. E , 69, 051915. Google Scholar CrossRef Search ADS   28. Feldman, R. ( 2013) Techniques and applications for sentiment analysis. Commun . ACM, 56, 82– 89. Google Scholar CrossRef Search ADS   29. Amancio, D. R. ( 2015) Authorship recognition via fluctuation analysis of network topology and word intermittency. J. Stat. Mech. Theory Exp. , 2015, P03005. Google Scholar CrossRef Search ADS   30. Mehri, A., Darooneh, A. H. & Shariati, A. ( 2012) The complex networks approach for authorship attribution of books. Phys. A , 391, 2429– 2437. Google Scholar CrossRef Search ADS   31. Segarra, S., Eisen, M. & Ribeiro, A. ( 2015) Authorship attribution through function word adjacency networks. IEEE Trans. Signal Process. , 63, 5464– 5478. Google Scholar CrossRef Search ADS   32. Amancio, D. R. ( 2015) A complex network approach to stylometry. PLoS One , 10, e0136076. Google Scholar CrossRef Search ADS PubMed  33. de Arruda, H. F., Costa, L. d. F. & Amancio, D. R. ( 2016) Using complex networks for text classification: discriminating informative and imaginative documents. Europhys. Lett. , 113, 28007. Google Scholar CrossRef Search ADS   34. Amancio, D. R., Oliveira Jr, O. N. & da F. Costa, L. ( 2012) Unveiling the relationship between complex networks metrics and word senses. Europhys. Lett. , 98, 18002. Google Scholar CrossRef Search ADS   35. Mihalcea, R., Tarau, P. & Figa, E. ( 2004) Pagerank on semantic networks, with application to word sense disambiguation. Proceedings of the 20th International Conference on Computational Linguistics , COLING ’04. Stroudsburg, PA: Association for Computational Linguistics, 1126 pages. 36. Silva, T. C. & Amancio, D. R. ( 2012) Word sense disambiguation via high order of learning in complex networks. Europhys. Lett. , 98, 58001. Google Scholar CrossRef Search ADS   37. Amancio, D. R., Nunes Jr, M. G., O. N. O. & da F. Costa, L. ( 2012) Extractive summarization using complex networks and syntactic dependency. Phys. A , 391, 1855– 1864. Google Scholar CrossRef Search ADS   38. Antiqueira, L., Oliveira Jr, O. N., Costa, L. F. & Nunes, M. G. V. ( 2009) A complex network approach to text summarization. Inf. Sci. , 179, 584– 599. Google Scholar CrossRef Search ADS   39. Xuan, Q. & Wu, T.-J. ( 2009) Node matching between complex networks. Phys. Rev. E , 80, 026103. Google Scholar CrossRef Search ADS   40. Amancio, D. R., Altmann, E. G., Rybski, D., Oliveira Jr, O. N. & Costa, L. F. ( 2013) Probing the statistical properties of unknown texts: application to the Voynich manuscript. PLOS One , 8, 1– 10. 41. de Arruda, H. F., Costa, L. d. F. & Amancio, D. R. ( 2016) Topic segmentation via community detection in complex networks. Chaos , 26( 6), 063120. Google Scholar CrossRef Search ADS   42. Han, J. ( 2005) Data Mining: Concepts and Techniques . San Francisco, CA: Morgan Kaufmann Publishers Inc. 43. Fruchterman, T. & Reingold, E. ( 1991) Graph drawing by force-directed placement. Software: Practice and experience , 21, 1129– 1164. Google Scholar CrossRef Search ADS   44. Watts, D. J. & Strogatz, S. H. ( 1998) Collective dynamics of ‘small-world’ networks. Nature , 393, 440– 442. Google Scholar CrossRef Search ADS PubMed  45. Amancio, D. R., Altmann, E. G., Oliveira Jr, O. N. & Costa, L. d. F. ( 2011) Comparing intermittency and network measurements of words and their dependence on authorship. New J. Phys. , 13, 123024. Google Scholar CrossRef Search ADS   46. Masucci, A. & Rodgers, G. ( 2006) Network properties of written human language. Phys. Rev. E , 74, 026102. Google Scholar CrossRef Search ADS   47. Sheng, L. & Li, C. ( 2009) English and chinese languages as weighted complex networks. Phys. A , 388, 2561– 2570. Google Scholar CrossRef Search ADS   48. Kaiser, M. & Hilgetag, C. C. ( 2004) Edge vulnerability in neural and metabolic networks. Biol. Cybernet. , 90, 311– 317. Google Scholar CrossRef Search ADS   49. Newman, M. ( 2010) Networks: An Introduction . New York, NY: Oxford University Press, Inc. Google Scholar CrossRef Search ADS   50. Sporns, O. ( 2003) Graph theory methods for the analysis of neural connectivity patterns. Neuroscience Databases  ( Kötter R. ed.). Boston, MA: Springer, pp. 171– 185. Google Scholar CrossRef Search ADS   51. Das, N. ( 2008) Statistical Methods-Combined Edition, Vols. i and ii . West Patel Nagar, New Delhi, India: Tata MCGraw Hill Education Private Limited, PAGES-4, 5: 290. 52. Jolliffe, I. ( 2002) Principal Component Analysis . New York: Springer Verlag. 53. Frank, E., Hall, M., Holmes, G., Kirkby, R., Pfahringer, B., Witten, I. H. & Trigg, L. ( 2009) Weka-a machine learning workbench for data mining. Data Mining and Knowledge Discovery Handbook  ( Maimon O. & Rokach L. eds). Boston, MA: Springer, pp. 1269– 1277. Google Scholar CrossRef Search ADS   54. Hubert, L. & Arabie, P. ( 1985) Comparing partitions. J. Classif. , 2, 193– 218. Google Scholar CrossRef Search ADS   55. Ebeling, W. & Neiman, A. ( 1995) Long-range correlations between letters and sentences in texts. Phys. A , 215, 233– 241. Google Scholar CrossRef Search ADS   56. Alvarez-Lacalle, E., Dorow, B., Eckmann, J.-P. & Moses, E. ( 2006) Hierarchical structures induce long-range dynamical correlations in written texts. Proc. Natl. Acad. Sci. , 103, 7956– 7961. Google Scholar CrossRef Search ADS   57. Estrada, E. ( 2010) Quantifying network heterogeneity. Phys. Rev. E , 82, 066102. Google Scholar CrossRef Search ADS   58. Gutman, I. & Zhou, B. ( 2006) Laplacian energy of a graph. Linear Algebra Appl.,  29– 37. 59. Braunstein, S. L., Ghosh, S., Severini, S. ( 2006) The Laplacian of a graph as a density matrix: a basic combinatorial approach to separability of mixed states. Ann. Comb.,  10, 291– 317. Google Scholar CrossRef Search ADS   60. Mikolov, T., Chen, K., Corrado, G. & Dean, J. ( 2013) Efficient estimation of word representations in vector space. ArXiv preprint arXiv:1301.3781. 61. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. ( 2013) Distributed Representations of Words and Phrases and Their Compositionality  ( Burges, C. J. C. Bottou, L. Welling, M. Ghahramani Z. & Weinberger K. Q. eds). Curran Associates, Inc.: Advances in Neural Information Processing Systems. 3111– 3119. 62. Travençolo, B. A. N. & Costa, L. d. F. ( 2008) Accessibility in complex networks. Phys. Lett. A , 373, 89– 95. Google Scholar CrossRef Search ADS   63. Freeman, L. ( 1977) A set of measures of centrality based on betweenness. Sociometry , 40, 35– 41. Google Scholar CrossRef Search ADS   64. Estrada, E. ( 2012) The Structure of Complex Networks: Theory and Applications . New York, NY, USA: Oxford University Press. 65. Bonacich, P. ( 1987) Power and centrality: a family of measures. Amer. J. Sociol. , 92, 1170– 1182. Google Scholar CrossRef Search ADS   66. de Arruda, G. F., Barbieri, A. L., Rodríguez, P. M., Rodrigues, F. A., Moreno, Y. & Costa, L. d. F. ( 2014) Role of centrality for the identification of influential spreaders in complex networks. Phys. Rev. E , 90, 032812. Google Scholar CrossRef Search ADS   67. Newman, M. E. & Girvan, M. ( 2004) Finding and evaluating community structure in networks. Phys. Rev. E , 69, 026113. Google Scholar CrossRef Search ADS   68. Clauset, A., Newman, M. E. & Moore, C. ( 2004) Finding community structure in very large networks. Phys. Rev. E , 70, 066111. Google Scholar CrossRef Search ADS   69. Langville, A. N. & Meyer, C. D. ( 2011) Google’s PageRank and Beyond: The Science of Search Engine Rankings . Princeton, NJ, USA: Princeton University Press. Footnotes 1 Project Gutemberg - https://www.gutenberg.org/ © The authors 2017. Published by Oxford University Press. All rights reserved. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Complex Networks Oxford University Press

Representation of texts as complex networks: a mesoscopic approach

, Volume 6 (1) – Feb 1, 2018
20 pages

/lp/ou_press/representation-of-texts-as-complex-networks-a-mesoscopic-approach-WON1qCx0Re
Publisher
Oxford University Press
ISSN
2051-1310
eISSN
2051-1329
D.O.I.
10.1093/comnet/cnx023
Publisher site
See Article on Publisher Site

Abstract

Abstract Statistical techniques that analyse texts, referred to as text analytics, have departed from the use of simple word count statistics towards a new paradigm. Text mining now hinges on a more sophisticated set of methods, including the representations in terms of complex networks. While well-established word-adjacency (co-occurrence) methods successfully grasp syntactical features of written texts, they are unable to represent important aspects of textual data, such as its topical structure, that is the sequence of subjects developing at a mesoscopic level along the text. Such aspects are often overlooked by current methodologies. In order to grasp the mesoscopic characteristics of semantical content in written texts, we devised a network model which is able to analyse documents in a multi-scale fashion. In the proposed model, a limited amount of adjacent paragraphs are represented as nodes, which are connected whenever they share a minimum semantical content. To illustrate the capabilities of our model, we present, as a case example, a qualitative analysis of ‘Alice’s Adventures in Wonderland’. We show that the mesoscopic structure of a document, modelled as a network, reveals many semantic traits of texts. Such an approach paves the way to a myriad of semantic-based applications. In addition, our approach is illustrated in a machine learning context, in which texts are classified among real texts and randomized instances. 1. Introduction The availability of an ever growing amount of data brought up by the age of information has strongly impacted science, giving rise to a novel perspective on data analysis. The use and development of systematic approaches to analyse data has already become mandatory in a wide range of knowledge areas, such as physics [1], biology [2, 3], medicine [4] and even humanities [5, 6]. This also includes techniques devoted to the systematic analysis of texts, known as text mining [7]. Traditionally, approaches involving text analytics were solely based on simple statistics considering mostly the frequency of words [8, 9], which are, in general, suitable for the task of text classification [10]. However, more sophisticated methods have been devised for complex tasks, such as to quantify the words relevance [11, 12] in a document. These techniques can be employed to detect, for instance, important topics in a given text [13, 14]. Even more challenging are the methods used to study the relationships among words or topics in a document or a set of documents. This kind of analysis can be undertaken by considering semantic similarities [15] or linguistic characteristics [16]. By using these new techniques, many other applications could be achieved, for example, automatic summarization [17], event summary from many documents [18], sentiment analysis [19] or authorship detection [20]. Applications that illustrate the temporal dynamics [21–23] are also important. In these works, texts or movies are analysed according to the way entities (mainly characters) interact through time. Recently, [24] investigated how the emotional content evolves in a story. Moreover, text datasets can also be analysed in terms of the relationships among their elements, such as words and paragraphs. So, texts can be regarded as a complex structure and, therefore, be suitably represented in terms of complex networks. A well-known approach to construct complex networks from texts is the word-adjacency (or co-occurrence) technique [25, 26], which is based on connecting pairs of words that are immediately adjacent. The strategy of mapping texts according to co-occurrence relationships is a simplification of networks formed by syntactical links [27]. Despite this seeming limitation, word adjacency networks have been employed successfully to address a great variety of natural language processing problems. This includes sentiment analysis [28], authorship detection [29–31], stylometry [32], text classification [33], word sense disambiguation [34–36], text summarization [37, 38], machine translation [37, 39] and others. Perhaps some critical disadvantages associated with the word adjacency approach are its inability to properly characterize the semantic similarity in written texts [40] and portray the topical structure presented in many texts. The topical structure of a text is expected to naturally emerge from its network representation through a pronounced heterogeneous macro-structure. However, this hardly happens on typical co-occurrence networks, which present no community structure [41]. This suggests that the co-occurrence representation does not completely capture the information at the mesoscopic structure of the text, such as topics and subtopics. In addition, the information regarding the temporal evolution along a text is also overlooked in co-occurrence networks. In order to address the above limitations, we propose a mesoscopic representation of texts, where a node represents a large context, for example, a set of adjacent sentences or paragraphs. More specifically, in our approach each node corresponds to $$\Delta$$ subsequent paragraphs. The relationship between these nodes is then established by a similarity criteria. As such, edges are created whenever a large number of words is shared between two nodes. Note that, by doing so, the network structure becomes more dependent on how the author approaches the topics along the text. As we shall show, the main goal of the proposed representation is to reflect the semantic complexity of texts, a feature that cannot be straightforwardly obtained in traditional word adjacency networks. This manuscript is organized as follows: Section 2 describes our approach to create the mesoscopic network from a given document. Section 3 describes a case study of our approach. Section 4 illustrates the mesoscopic approach in a machine learning context. Finally, Section 5 concludes our article and suggests perspectives for further studies. 2. Methods This section describes the procedure to obtain mesoscopic complex networks from texts, which include books and other documents with paragraph structure. Here, we also briefly present the technique employed to visualize these networks. 2.1 From texts to networks In recent years, a new set of techniques has been introduced to create networks from documents, which takes into account their mesoscopic structure [41]. In that work, the networks are generated by connecting words existing in the same context, which is defined in terms of a fixed window length. This approach was able to produce modular networks, with each community related to contextual topics or subtopics of the text [41]. Even though the semantical organization of texts is captured by this representation, it is not straightforward to obtain the temporal evolution of the story being told. Here, we extend the concepts introduced by [41] to derive a new technique to construct networks from texts. Our methodology addresses two important aspects typically overlooked by more traditional approaches: (a) the mesoscopic structure of a text and (b) its unfolding along time. To consider (a), instead of linking adjacent words, we use larger pieces of text as the basic representational unit. These pieces are connected according to the similarity among themselves. The temporal evolution of ideas and concepts is incorporated into our model because, by construction, successive nodes always result connected as a consequence of their shared content. Henceforth, we consider an organized text as a sequence of words delimitated by paragraphs. In our analysis, the paragraphs can be retained from the text, or can be inferred from the text own structure, for instance, by considering sequences with a fixed number of words. Our approach starts with a pre-processing step typically employed for semantical-based text analysis. First, punctuation marks and numbers are removed. We also discard words conveying little contextual meaning, that is, the stopwords. Examples of stopwords are articles and prepositions. If a lemmatization technique [7] is available for the language being considered, it is used to normalize concepts. In this step, words are reduced to their canonical forms, so that inflections in verbal tense, number, case or gender are disregarded. For example, the sentence “‘Oh, I’ve had such a curious dream!” said Alice’ becomes ‘curious dream say alice’, after being pre-processed. Next, we employ the tf-idf (term frequency-inverse document frequency) technique [7], which defines a map $$\text{tf-idf}(w,d,D)$$ quantifying the importance of each word $$w$$ in a given document $$d$$ from a set of documents $$D$$. The $$\text{tf-idf}(w,d,D)$$ map is computed as   $$\text{tf-idf}(w,d,D) = \text{tf}(w,d) \times \text{idf}(w,D),$$ (2.1) where $$\text{tf}(w,d)$$, the term-frequency component, accounts for the relevance of $$w \in d$$ and $$\text{idf}(w,D)$$, the inverse document frequency, quantifies the frequency of $$w$$ in all $$d \in D$$. Many variations of both tf and idf terms have been proposed [7]. In this article, we consider $$\text{tf}(w,d)$$ as the raw frequency of a given word $$w$$ in a document $$d$$ divided by the total number of terms in the document, which is a normalization for each document. Consequently, documents with different number of words can be compared. The $$\text{idf}(w,D)$$ is calculated as   $$\textrm{idf}(w,D) = \log\Bigg{(}\frac{|D|}{f_w}\Bigg{)},$$ (2.2) where $$|D|$$ is the total number of documents in $$D$$ and $$f_w$$ is the number of documents in which $$w$$ occurs at least once. Such a term is employed in order to increase the weight of the key-words. The mesoscopic network is generated from the preprocessed text, hereafter referred to as organized text $$O$$. The organized text $$O$$ consists of a sequence of paragraphs $$O = (p_0,p_1,p_2\dots)$$ with each paragraph $$p_i$$ comprising a sequence of words $$p_i = (w_{i0},w_{i1},w_{i2}\dots)$$. Differently from the co-occurrence model where nodes represent words, here, we map entire paragraphs or sequences of consecutive paragraphs as nodes. In particular, for a choice of window size $$\Delta$$, each possible subsequence comprising $$\Delta$$ paragraphs in $$O$$, $$P_{k}^{\Delta}=(p_{k},p_{k+1},\dots p_{k+\Delta-1})$$, is represented by a node in the devised mesoscopic network. \. 1(a) illustrates the process of obtaining the nodes of the mesoscopic network. The edges of the mesoscopic network are identified by calculating a contextual similarity measurement considering all pairs of sequences of paragraphs $$P_{k}^\Delta$$ in the investigated document. Here, we employed the traditional bag of words combined with the cosine similarity measurement [7]. Bearing in mind that the number of words in each paragraph can vary significantly, the cosine similarity was used because it does not depend on the length of the text chunks being compared [42]. First, for each considered sequence of paragraphs $$P$$, a vector $$W_P$$, spanning the same number of words present in $$O$$, is obtained from the $$\text{tf-idf}(w, P, O)$$ map applied to each word $$w$$ in $$O$$. Note that, when a certain word $$w$$ is not present in $$P$$, $$\text{tf-idf}(w, P, O)=0$$. The content similarity measurement $$S(P_A,P_B)$$ between two paragraph windows $$P_A$$ and $$P_B$$ is obtained using   $$S(P_A,P_B) = \frac{\sum\limits_{w\,\in\,O} {\text{tf-idf}(w, P_A, O) \times \text{tf-idf}(w, P_B, O)}}{\sqrt{\sum\limits_{w\,\in\,O} {\text{tf-idf}(w, P_A, O)^2}} \sqrt{\sum\limits_{w\,\in\,O} {\text{tf-idf}(w, P_B, O)^2}}}.$$ (2.3) As a result, a fully connected network is created (see Fig. 1(b)), in which the edge weights correspond to the similarity $$S(P_A,P_B)$$ among each pair of nodes. The final mesoscopic network is obtained by pruning the weakest connections, that is, the links whose weight takes a value below a given threshold $$T$$. After this procedure, edge weights are ignored, resulting in an unweighted network (see Fig. 1(c)). Fig. 1. View largeDownload slide Illustration of the presented methodology. Initially, the text is organized in sets of subsequent and overlapping windows $$P_{k}^{3}$$, each containing three structural paragraphs, as shown in (a). Next, the cosine similarity is calculated among all pairs of text windows (illustrated by the width of the lines in b). The mesoscopic network is obtained by maintaining only connections among pairs with similarity higher than a threshold value $$T$$. This is illustrated by the network visualization in (c). Color is available online. Fig. 1. View largeDownload slide Illustration of the presented methodology. Initially, the text is organized in sets of subsequent and overlapping windows $$P_{k}^{3}$$, each containing three structural paragraphs, as shown in (a). Next, the cosine similarity is calculated among all pairs of text windows (illustrated by the width of the lines in b). The mesoscopic network is obtained by maintaining only connections among pairs with similarity higher than a threshold value $$T$$. This is illustrated by the network visualization in (c). Color is available online. To better understand the overall structure of mesoscopic networks, we visualized the network structure using a technique based on force-directed nodes placement. In particular, we used a technique inspired on the Fruchterman–Reingold (FR) [43] algorithm, in which the network is regarded as a system of nodes behaving like particles that interact by the action of two types of forces: attractive forces, existing only between connected nodes, and repulsive forces, that exist between all pairs of nodes. By minimizing the energy of that system, the network organizes itself in a graphically appealing layout. This visualization technique naturally highlights many aspects of the topological structure of networks [43]. 2.2 Results evaluation In order to show the potential of our networks to reflect the document story, we compared networks created from Real Texts (RT) with networks created from Shuffled Texts (ST), where clearly no story exists. The ST were created in a two-fold manner: obtained by shuffling words (SW) or paragraphs (SP) from RT. To generate the SW version, all words from a given text were shuffled and the paragraphs were created with the same number of words as those in the original document. It is important to highlight that the number of paragraphs, their respective order and the number of words in each paragraph were preserved. In the second version of ST, SP, we shuffled all paragraphs from a given RT. Thus, the structure of each single paragraph is kept, but the new sequence of paragraphs may not generate a consistent, coherent story. For each document, a single weighted mesoscopic network was created for each class (RT, SW and SP). Consequently, the classes have the same number of networks. Considering the classes of text (RT, SW and SP), for each weighted network, we generated unweighted networks from a set of thresholds. Because the similarity measurement depends exclusively on the content of each text, the obtained edge weights are not comparable across texts. Moreover, a fixed similarity threshold $$T$$ is impractical because it could lead to the removal of all edges in a network, if it is too high; or no removals, if it is small. Because of that, a set of thresholds for each network was defined as the values that would keep a given percentage of edges in a network, so that strongest connections were maintained. The choice of an optimal percentage is not trivial, and it might change for different datasets. Therefore, the set of thresholds of a given text is defined as $$\{T_{5\%}, T_{10\%}, T_{15\%},..,T_{95\%}\}$$ where $$T_{5\%}$$ is the threshold $$T$$ that keeps only 5% of the edges with the highest weights, $$T_{10\%}$$ is the threshold $$T$$ for 10%, and so on. Each text was then characterized by the network measurements extracted from all networks created by applying the different thresholds. In particular, other approaches could be employed in order to remove some edges. One option would be to modify all networks in order to emerge a specific property, such as the same average degree. We used two measurements to compare the mesoscopic networks: Clustering coefficient: this measurement is well known in complex networks analysis [44] and it was used in many text classification applications [45–47]. The clustering coefficient quantifies the fraction of loops of order three (i.e. triangles), for each network node and it is computed as   $$C_i = \frac{N_\Delta(i)}{N_3(i)},$$ (2.4) where $$N_\Delta(i)$$ is the number of connected triangles in which node $$i$$ takes part and $$N_3(i)$$ is the number of connected triples, where $$i$$ is the central node; Matching index: for each edge, this measure computes the similarity between the two nodes connected to the edge according to the number of common neighbours [48–50]. In other words, this measurement quantifies the similarity between two network regions connected by an edge. This measurement is computed as   $$\mu_{i,j} = \frac{\sum_{k \neq i,j} a_{ik} a_{jk}}{\sum_{k \neq j} a_{ik} + \sum_{k \neq i} a_{jk}},$$ (2.5) where $$a_{ij}$$ is an element of the adjacency matrix, and $$a_{ij} = 1$$ if nodes $$i$$ and $$j$$ are connected. This measurement is used to identify long-range links since, by construction, nodes representing overlapping sets of paragraphs share several neighbours. Conversely, nodes representing distant regions in the network usually do not share many neighbours. The books were considered in their entirety. As a consequence, the number of network nodes varies, which can influence many complex network measurements. As a solution for this problem, we analysed the network in terms of local measurements of clustering and matching index. In order to provide additional information about the text, the two measurements were calculated for all nodes/edges and sorted according to the text sequence, giving rise to a time series. For the matching index, we created the time series by establishing the following order of edges:   $$\nonumber \{\mu_{0,0},\mu_{0,1},\ldots,\mu_{0,n-1},\mu_{1,0},\mu_{1,1}\ldots\mu_{1,n-1},\ldots,\mu_{n-1,n-1}\}.$$ If there is no edge linking two nodes, the corresponding value in the time series is not taken into account. 3. Case study: mesoscopic analysis of ‘Alice’s adventures in wonderland’ In order to illustrate the potential of modelling RT as mesoscopic networks, we applied our methodology to the well-known book ‘Alice’s Adventures in Wonderland’. This story revolves around the adventures of a little girl, called Alice, after she falls in a hole and arrives in an unknown fantasy world. The book was written in 1865 by Charles Lutwidge Dodgson under the pseudonym Lewis Carroll. It is divided into the following 12 chapters: Down the Rabbit-Hole The Pool of Tears A Caucus-Race and a Long Tale The Rabbit Sends in a Little Bill Advice from a Caterpillar Pig and Pepper A Mad Tea-Party The Queen’s Croquet-Ground The Mock Turtle’s Story The Lobster Quadrille Who Stole the Tarts? Alice’s Evidence After the pre-processing steps had been undertaken, we chose a fixed window size $$\Delta = 20$$ paragraphs. Two mesoscopic networks, $$\mathcal{G}_1$$ and $$\mathcal{G}_2$$, were constructed from the book with distinct thresholds to prune connections, $$T_1=0.31$$ and $$T_2=0.18$$, respectively. With these thresholds, only 5 and 10% of the edges, respectively, remained in the network. It is important to highlight that the parameter $$\Delta$$ was set in order to maintain almost all network nodes in a single connected component, for all considered $$T$$ values. Additionally, a higher $$\Delta$$ value implies in a higher number of paragraphs representing each node, which can hide information from the smaller ones. We start the analysis of the mesoscopic structures by investigating the properties of the $$\mathcal{G}_1$$ network, which is simpler than $$\mathcal{G}_2$$. For this analysis, we consider a 2D visualization of $$\mathcal{G}_1$$, which is shown in Fig. 2. This visualization was obtained by employing the FR algorithm mentioned in Section 2. Because nodes sharing the same paragraphs become strongly connected among themselves, a pronounced chain-like structure naturally emerges on the mesoscopic network. In addition, this structure is related to the order of the nodes along the book. This property is better observed in Fig. 2(a), where the colour of each node indicates its position along the text. In mesoscopic networks, connections among distant nodes indicate regions of high contextual similarity that are not a result of overlapping sequences of paragraphs. In these networks, the structure connects contextually similar regions of nodes which, by its turn, brings them closer along the chain-like structure of the network. Fig. 2. View largeDownload slide Visualization of the network $$\mathcal{G}_1$$, representing the book Alice’s Adventures in Wonderland with a threshold $$T_1=0.31$$. Each node indicates a sequence of paragraphs. The order of the nodes according to the story is shown in (a). The first nodes of the story appear in blue, while the last nodes are represented in an orange colour. In (b), the chapters of the nodes are represented with distinct colours. Color is available online. Fig. 2. View largeDownload slide Visualization of the network $$\mathcal{G}_1$$, representing the book Alice’s Adventures in Wonderland with a threshold $$T_1=0.31$$. Each node indicates a sequence of paragraphs. The order of the nodes according to the story is shown in (a). The first nodes of the story appear in blue, while the last nodes are represented in an orange colour. In (b), the chapters of the nodes are represented with distinct colours. Color is available online. In order to better understand the relationship between the mesoscopic structure and the contextual information of the book, we segmented the obtained network according to the chapter organization of the book. This is visualized in Fig. 2(b), in which the chapter of each node is indicated by a colour, according to the legend. Considering the connectivity among the chapters of the book, we derived the following observations: In Chapter 1, we note that there is no strong connection among its paragraphs and those from other chapters, except for chapters 2 and 3, which is explained by the aforementioned overlap between subsequent paragraphs. The lack of long-range connections among some nodes of the first chapter may happen because the beginning of the book is substantially different from almost every other part. In this chapter, the story starts in a more realistic scenario and it presents fewer descriptions of the fantasy locations and creatures than in the rest of the book; Chapters 2, 3 and the beginning of Chapter 4 are connected among themselves. The vocabulary used in Chapter 2 includes many negative words, such as poor, hopeless, cry, tears, tired, in order to express the situation and how Alice was feeling. Chapters 3 and 4 are not so emotional, but their texts still have a few negative words and some recalls about past events. In particular, all these chapters describe the period of the story when Alice was very frightened of the world she had just jumped in, and she constantly remembered of her cat, Dinah, and how good her pet was at catching other animals. In addition, all these chapters mention when she cried and it formed a pool of tears; In Chapter 5, there are strong connections between regions from the same chapter. This probably happens because the long conversation between Alice and the Caterpillar revolved around a few topics. They mainly discussed the many sizes she had during that day and how confused she was about herself; Even though Chapters 7 and 11 describe different events—the first is about the tea party and the other has a trial—there are some edges connecting nodes from these chapters. One possible explanation is that both chapters include the characters Alice, March Hare, Dormouse and the Hatter, which is having tea in both situations. Furthermore, these chapters present specific kinds of food and beverage related to the tea party, for example, tea, bread and butter; There is a group of highly connected nodes in the end of Chapter 9 and in the beginning of Chapter 10. This probably happens because Alice is introduced to The Mock Turtle in the last paragraphs of Chapter 9 and their conversation ended only in Chapter 10. Moreover, one topic shared between these chapters was related to teaching. In the end of Chapter 9, the The Mock Turtle explains to Alice his days at school and the lessons he used to have. In the next Chapter, some animals asked Alice to repeat some lessons, and she felt she was at school. Figure 3 displays a visualization of the $$\mathcal{G}_2$$ network, which was constructed using a lower threshold value, $$T_2=0.18$$. By using two threshold choices, it has been possible to illustrate the potential of our method in describing the characteristics of the network in a multi-scale fashion. From Fig. 3(a), we can observe the network still has a chain-like structure similar to that found in $$\mathcal{G}_1$$. However, this network presents more connections among nodes from different parts of the book. This is because the $$\mathcal{G}_2$$ network captures more fine-grained information about the relationships among the paragraphs. Comparing $$\mathcal{G}_1$$ and $$\mathcal{G}_2$$, we note that while Chapter 1 is connected only with Chapters 2 and 3 in $$\mathcal{G}_1$$, in $$\mathcal{G}_2$$ it also connects with other parts of the book, in particular, with Chapters 4 and 7. However, the analysis of fine-grained networks may present some disadvantages because these networks tend to incorporate more local characteristics. Moreover, they may include noise and relationships not driven by a strong contextual content. Fig. 3. View largeDownload slide Visualization of the network $$\mathcal{G}_2$$, representing the book Alice’s Adventures in Wonderland with a threshold $$T_1=0.31$$. The nodes indicate sets of 20 adjacent paragraphs. Item (a) shows the order of the nodes according to the story, where the first nodes of the story appear in blue and the last nodes in an orange. Item (b) represents the chapters, in which the nodes are represented with distinct colours. Color is available online. Fig. 3. View largeDownload slide Visualization of the network $$\mathcal{G}_2$$, representing the book Alice’s Adventures in Wonderland with a threshold $$T_1=0.31$$. The nodes indicate sets of 20 adjacent paragraphs. Item (a) shows the order of the nodes according to the story, where the first nodes of the story appear in blue and the last nodes in an orange. Item (b) represents the chapters, in which the nodes are represented with distinct colours. Color is available online. 4. Discriminating real from ST To illustrate the ability of the proposed representation to grasp semantical information of texts by considering topological features, we evaluated the efficiency of the method in discriminating RT from texts conveying no meaning, which are here represented by ST. This is an important potential application of the proposed approach as a subsidy to fraud identification, such as inferring if texts in unknown languages are meaningful or not. In Fig. 4, we show the two networks obtained from the book ‘Alice’s Adventures in Wonderland’ and the respective values of clustering coefficient (for two thresholds) along the document. Note that an interesting pattern emerges in both cases. Regions encompassing many long-range connections are characterized by low values of clustering. In addition, there is a complex pattern of intermittent appearances of low values of clustering coefficient in regions devoid of long-range connections. A similar behaviour occurred with the matching index (result not shown), which also captures the presence of long-range links. As explained in Section 2.2, the matching index quantifies the number of shared neighbours between two interconnected nodes. This measurement is used to identify long-range links since, by construction, nodes representing overlapping sets of paragraphs share several neighbours. Conversely, nodes representing distant regions in the network usually do not share many neighbours. Fig. 4. View largeDownload slide Visualization of the networks representing ‘Alice’s Adventures in Wonderland’. Item (a) represents the network $$\mathcal{G}_1$$ with a threshold $$T_1=0.31$$ and item (b) represents the network $$\mathcal{G}_2$$ with a threshold $$T_2=0.18$$. The node colours indicate the value of the clustering coefficient, in which nodes with the highest values are represented in orange. Note that there is a non-trivial pattern of clustering coefficient along the network nodes. Color is available online. Fig. 4. View largeDownload slide Visualization of the networks representing ‘Alice’s Adventures in Wonderland’. Item (a) represents the network $$\mathcal{G}_1$$ with a threshold $$T_1=0.31$$ and item (b) represents the network $$\mathcal{G}_2$$ with a threshold $$T_2=0.18$$. The node colours indicate the value of the clustering coefficient, in which nodes with the highest values are represented in orange. Note that there is a non-trivial pattern of clustering coefficient along the network nodes. Color is available online. In Fig. 5, we show the behaviour of the clustering coefficient along time for the real book and its two respective meaningless versions formed by shuffled paragraphs and words. It is clear from the figure that, in average, the clustering coefficient of all three versions fluctuates around $$C\simeq0.78$$. However, the patterns of fluctuations are markedly dissimilar. The largest variations arise for the real book, while both shuffled versions seem to display larger regions of weak fluctuations (see e.g. nodes from 180 to 280 in Fig. 5(b)). A similar pattern was obtained for the matching index measurement. Owing to the clear patterns in the fluctuations of local density discriminating real and meaningless texts, we applied measurements to quantify the mentioned fluctuations in order to check how much the proposed model depends on the text unfolding. Fig. 5. View largeDownload slide Clustering coefficient for all network nodes of real and shuffled versions (RT, SW and SP) created from the book ‘Alice’s Adventures in Wonderland’. The threshold $$T_1=0.31$$ was chosen to select the strongest semantical links. Fig. 5. View largeDownload slide Clustering coefficient for all network nodes of real and shuffled versions (RT, SW and SP) created from the book ‘Alice’s Adventures in Wonderland’. The threshold $$T_1=0.31$$ was chosen to select the strongest semantical links. The fluctuations observed in Fig. 5 were characterized with the coefficient of variation in a set of observations $$X$$, where $$X$$ here represents the ordered set of values of $$C$$ or $$\mu$$. The coefficient of variation ($$c_v(X)$$) is defined [51] as:   $$c_v(X) = {\sigma(X)}/{\langle X \rangle},$$ (4.1) where $$\sigma(X)$$ and $$\langle X \rangle$$ are the standard deviation and the average of $$X$$, respectively. For a choice of a window size, $$\delta$$, and for each possible subsequence of $$X$$, $$\mathcal{X}^\delta_k = \{x_k, x_{k+1},\dots,x_{k+\delta-1}\}$$, the coefficient of variation, $$c_v(\mathcal{X}^\delta_k)$$, is calculated. In order to consider the coefficient of variation in a mesoscale, we analyse the series in a multi-scale fashion, by considering $$\delta = \{3,5,7,10,15,20,25,30,35,40,50\}$$. For each value of window size $$\delta$$, we summarize the values of fluctuations by averaging over all $$c_v(\mathcal{X}^\delta_k$$), that is:   $$\mathcal{C}_v^\delta(X) = \frac{1}{N} \sum_{k=1}^{n-\delta+1} c_v(\mathcal{X}^\delta_k).$$ (4.2) Finally, each network was characterized by the set of features $$\mathcal{F} = \{\mathcal{C}_v^{\delta=3},\mathcal{C}_v^{\delta=5},\mathcal{C}_v^{\delta=7}\ldots\}$$, with $$X$$ being the values of clustering coefficient and matching index. To validate the potential of our mesoscopic model to extract the information from the document story, we considered the problem of discriminating real from meaningless (shuffled) texts using a dataset comprising several books (see details in Appendix A). We first visualized all three classes of texts in a bidimensional principal component analysis projection [52] (PCA). The results are shown in Fig. 6(a), in which the two first components account for approximately 76% of the projection. Remarkably, the networks are usually placed close to others from the same class, while being well-separated from other classes. This latter effect is confirmed in terms of the average distance between classes shown in Table 1. Our results are compared with those obtained with the traditional approach based on co-occurrence networks (see details in Appendix B). The PCA projection of these networks is shown in Fig. 6(b). Although the sum of the two main PCA components accounts for 70% of the projection, the group of networks from RT and SP are not distinguishable. This behaviour was expected because co-occurrence networks were first devised to grasp linguistic/syntactical features. When language structure is kept and only the mesoscopic structure is changed (in SP texts), the co-occurrence approach is unable to discriminate real from meaningless texts. The poor discriminability observed is confirmed by the distances shown in Table 2. Table 1 Mesoscopic network: average distance among networks from the same class. Note that, when using mesoscopic networks, it is possible to discriminate real texts from those generated by both shuffled words and paragraph    RT  SW  SP  RT  0.00  13.67  10.98  SW  13.67  0.00  13.21  SP  10.98  13.21  0.00     RT  SW  SP  RT  0.00  13.67  10.98  SW  13.67  0.00  13.21  SP  10.98  13.21  0.00  Table 2 Co-occurrence network: average distance among networks from the same class. If co-occurrence networks are used, real texts and texts formed by shuffled paragraphs cannot be discriminated    RT  SW  SP  RT  0.00  7.18  0.12  SW  7.18  0.00  7.08  SP  0.12  7.08  0.00     RT  SW  SP  RT  0.00  7.18  0.12  SW  7.18  0.00  7.08  SP  0.12  7.08  0.00  Fig. 6. View largeDownload slide PCA projections of the networks generated from RT, SP texts and SW texts. The projections (a) and (b) represent the mesoscopic and the co-occurrence networks, respectively. (a) Mesoscopic networks and (b) co-occurrence networks. Color is available online. Fig. 6. View largeDownload slide PCA projections of the networks generated from RT, SP texts and SW texts. The projections (a) and (b) represent the mesoscopic and the co-occurrence networks, respectively. (a) Mesoscopic networks and (b) co-occurrence networks. Color is available online. The discriminability between RT and the two classes of ST was also evaluated using an unsupervised approach based on the K-means algorithm [53]. Here, we used the six principal components as features as such choice yielded optimized results. Considering all documents of the datasets, only 8.9% of instances were incorrectly clustered with the mesoscopic approach. Interestingly, the clustering generated by the algorithm yielded only 0.02% of false negatives for the SP class. A feature relevance analysis revealed that the clustering coefficient outperforms the matching index for the clustering task, when the algorithm is applied using the measurements separately. When only the clustering and matching index are used, the percentage of incorrectly assigned instances are 11.7 and 16.7%, respectively. The unsupervised approach was also used to compare the proposed methodology and traditional co-occurrence networks. In this analysis, we used 10 principal components, as this amount of features yielded optimized results. The quality of clusters was estimated in terms of the accuracy the adjusted rand index (ARI) [54]. The cluster quality indexes obtained in both types of networks are shown in Table 3. Co-occurrence networks could not properly distinguish RT from SP classes, as expected from the analysis of Fig. 6(b). In this scenario, 72.5% of SP texts were incorrectly classified as RT. This inability is also reflected in the ARI, which is much lower in co-occurrence networks. Such a result confirms that the co-occurrence networks are unable to distinguish between RT and ST. This is because in both cases, RT and SP, the syntax of the original texts is maintained and co-occurrence networks represent the syntactical relationship between words. Whereas, the mesoscopic network model, which takes into consideration the organization of the paragraphs along the text, was able to distinguish the texts among the tree classes. Table 3 Comparison of the K-means clustering performance among different network approaches. Two different measurements were applied: Adjusted Rand Index (ARI) and Accuracy. In both measurements, 1 indicates that all instances are correctly classified and 0 indicates the opposite    ARI  Accuracy  Mesoscopic (Clustering)  0.679  0.883  Mesoscopic (Matching Index)  0.576  0.833  Mesoscopic (all features)  0.749  0.911  Co-occurrence  0.268  0.575     ARI  Accuracy  Mesoscopic (Clustering)  0.679  0.883  Mesoscopic (Matching Index)  0.576  0.833  Mesoscopic (all features)  0.749  0.911  Co-occurrence  0.268  0.575  A particular feature of the mesoscopic model is the existence of long-range connections. More specifically, a long-range connection is a link that connects two nodes that are far apart in the document. This type of link usually appears when a subject/context previously mentioned in the book is revisited in the story. It has been conjectured that such links, a consequence of the long-range correlation effect [55], are essential for mapping a multidimensional conceptual space into a smaller dimensional space [56]. Such a characteristic can be related to the writing style or the nature of the book (e.g. tales, novels or scientific writing). To quantify the presence of long-range links, we show (Fig. 7) the scatterplots of all edges weights versus the time difference between linked nodes, where time corresponds to the natural reading order. The wide distribution of weights obtained for the smallest time differences is imposed by the construction rules of mesoscopic networks, in which many edges are established between successive nodes. Long-range connections were also observed in the three classes of texts (i.e. RT, SW and SP). However, most of such connections are very weak. As depicted in the inset of Fig. 7, RT tend to present stronger long-range connections than ST, especially in the time frame of 300 to 400 paragraphs. The presence of strong long range connections only in the RT case can be interpreted as a consequence of contextual relationships along the story. These characteristics were observed in Section 3, where we described the network connections according to the unfolding of the book. Fig. 7. View largeDownload slide Time difference between linked nodes vs. their respective edges weights. Note that it is hard to distinguish among the points in the scatterplots because they are very close. This measurement was computed for all network edges and the inset represents a region of long-range links, that is, links with time difference larger than 100. Comparing the different classes of texts, it is evident that strong long-range connections are more likely to appear in real networks. (a) RT, (b) SP and (c) SW. Fig. 7. View largeDownload slide Time difference between linked nodes vs. their respective edges weights. Note that it is hard to distinguish among the points in the scatterplots because they are very close. This measurement was computed for all network edges and the inset represents a region of long-range links, that is, links with time difference larger than 100. Comparing the different classes of texts, it is evident that strong long-range connections are more likely to appear in real networks. (a) RT, (b) SP and (c) SW. 5. Conclusion In order to grasp semantical, mesoscopic properties of texts modelled as networks, we proposed an approach that considers the semantical similarity between textual segments. Differently from previous representations, we modelled sequences of adjacent paragraphs as nodes, whose links are established by content similarity. By doing so, we could capture two important features present in written texts: long-range correlations and the temporal unfolding of documents. In addition, the proposed approach for text representation also allowed multi-scale representation of documents. Specifically, two parameters control the scale: (i) $$\Delta$$: the number of consecutive paragraphs in each window, and (ii) $$T$$: the threshold used to prune connections among nodes with low contextual similarity. As a case study, we tested our approach in ‘Alice’s Adventures in Wonderland’, by employing network visualization techniques on the generated mesoscopic network. Many insights could be drawn from the visualization by tracing a parallel between its underlying structure and the story. In particular, we investigated the correspondence between the content of each chapter and the underlying network structure arising from the proposed model. Our model uncovered many relationships among different contexts sharing the same topics, such as similar characters or places throughout the story. For example, the high contextual similarity found between chapters 7 and 11 can be explained by the fact that both chapters share a recurrent subject revolving around the character The Hatter and the tea party thematic. Note that similar textual inferences could not be drawn from models solely based on local features, as it is the case of traditional word-adjacency or syntactical networks, as they emphasize mostly stylistic textual subtleties. The effectiveness of our model was also evaluated with respect to the task of discriminating real from ST. The shuffled versions, particularly, were created by mixing either words or paragraphs of RT. We have found that, if we consider only two simple local density measurements, it is possible to separate all three classes of texts with high accuracy. The traditional co-occurrence turned out to grasp only local subtleties, as the model was not able to discriminate RT from those generated by shuffling paragraphs. This happens because, when paragraphs are shuffled, only a few edges—those at the paragraph boundaries—are modified. These results confirm the suitability of the proposed model in capturing larger contexts in a mesoscopic fashion. A further analysis of the model also revealed that RT are characterized by stronger long-range links, a feature that could be explored in tests of informativeness of written documents [40]. An important step in our methodology is the selection of parameters, in particular, the choice of the window size $$\Delta$$ and the threshold $$T$$, which defines the characteristic scale of analysis for the resulting network. Here, we provided means to configure the method so that the network forms a single connected component. However, more sophisticated approaches can be used to improve such a selection, for instance by using complexity measurements, such as the Estrada’s heterogeneity index [57], the Laplacian energy [58] and the von Neumann entropy [59]. The proposed methodology still presents some limitations regarding the selection of $$\Delta$$. For instance, for low values of $$\Delta$$, the resulting network would account for a smaller scale of analysis, thus allowing the investigation of the text at levels lower than chapters, down to paragraphs ($$\Delta = 1$$). However, this may impact in the quality of the statistics extracted from the text due to the reduction of vocabulary, which is used to obtain the similarity among nodes. Alternatively, other methods could be employed to account for the textual similarity. A well-known vectorial representation, that is, word embeddings [60, 61] could be used to incorporate not only first-order statistics but also the semantical relationships among words. The proposed network representation paves the way for developing new techniques that could be applied to automatically analyse the mesoscopic structure of documents. These techniques could improve traditional approaches used to tackle typical text mining problems under a new perspective. This capability should be further explored in future works, for instance, by measuring the efficiency of our model in text classification, summarization and similar applications in which an accurate semantic analysis plays a prominent role in the characterization of written texts. Funding The authors acknowledge financial support from Capes-Brazil, Sao Paulo Research Foundation (FAPESP) (grant no. 2016/19069-9, 2015/08003-4, 2015/05676-8, 2014/20830-0 and 2011/50761-2), CNPq-Brazil (grant no. 307333/2013-2) and NAP-PRP-USP. Appendix A. Dataset All the texts used in our dataset were extracted from the open access Project Gutemberg dataset.1 We divided the dataset into two major groups, according to the original language: (i) English and (ii) Other languages. The books, sorted by language and author, are listed below: English: Arthur Conan Doyle: The Adventures of Sherlock Holmes; The Tragedy of the Korosko; The Valley of Fear; Through the Magic Door and Uncle Bernac - A Memory of the Empire; Bram Stoker: Dracula’s Guest; The Lair of the White Worm; The Jewel Of Seven Stars; The Man and The Mystery of the sea; Charles Dickens: A Tale of Two Cities; American Notes; Barnaby Rudge: A Tale of the Riots of Eighty; Great Expectations and Hard Times; Edgar Allan Poe: The Works of Edgar Allan Poe (Volume 1 - 5); Hector H. Munro (Saki): Beasts and Super-Beasts; The Chronicles of Clovis; The Toys of Peace; When William Came and The Unbearable Bassington; P. G. Wodehouse: The Girl on the Boat; My Man Jeeves; Something New; The Adventures of Sally and The Clicking of Cuthbert Thomas Hardy: A Pair of Blue Eyes; Far from the Madding Crowd; Jude the Obscure; The Mayor of Casterbridge and The Hand of Ethelberta; William M. Thackeray: Barry Lyndon; The Book of Snobs; The History of Pendennis; The Virginians and Vanity Fair Other languages: French: Gustave Aimard: Le fils du Soleil; Jules Verne: Face au Drapeau; Louis Amédée Achard: Pierre de Villerglé; Louis Reybaud: Les Idoles d’argile; Victor Hugo: Han d’Islande. German: Goethe: Die Wahlverwandtschaften; Jakob Wassermann: Der Moloch; Robert Walser: Geschwister Tanner; Thomas Mann: Königliche Hoheit; Wilhelm Hauff: Lichtenstein. Italian Alberto Boccardi: Il Peccato di Loreta; Anton Giulio Barrili: La Montanara; Enrico Castelnuovo: Alla Finestra; Guido da Verona: Sciogli la treccia, Maria Maddalena; Virginia Mulazzi: La Pergamena Distrutta. Portuguese: Camilo Castelo Branco: Amor de Perdição; Eça de Queirós: A cidade e as Serras; Faustino da Fonseca: Os Bravos do Mindello; Jaime de Magalhães Lima: Transviado; Júlio Dinis: Uma Família Inglesa. Appendix B. Characterization of co-occurrence networks Typically, co-occurrence (or word adjacency) networks are formed by mapping each concept into a distinct node of the network. The edges are established by adjacency relationships, that is, if two words are adjacent in the text, they are connected in the network. Such networks have been extensively explored in the context of text analysis and pattern recognition [46]. In the present work, we compare the properties of the mesoscopic and co-occurrence models. We compare the mesoscopic results with a set of centrality measurements of co-occurrence networks used in the Ref. [33], which are: accessibility [62], betweenness centrality [63], closeness centrality, clustering coefficient, degree, eccentricity [64], eigenvector centrality [65], generalized accessibility [66], modularity [67] (computed from fast greedy algorithm [68]), neighbourhood connectivity, number of nodes, PageRank [69] and, symmetry [33]. Apart from modularity, we compute the following quantities for each measurement: maximum value ($$\max(X)$$), median ($$\tilde X$$), minimum value ($$\min(X)$$) and standard deviation $$\sigma(X)$$. To create the co-occurrence networks, we trimmed the texts to the same number of words because many of the above complex network measurements are influenced by the number of nodes. Because the number of network nodes varies in mesoscopic networks, we did not use the same set of measurements as for the co-occurrence networks. Furthermore, in the co-occurrence network analysis, we only used texts written in English because this kind of representation catches information regarding the syntax, which is different for each language. References 1. Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., & Hwang, D.-U. ( 2006) Complex networks: structure and dynamics. Phys. Rep. , 424, 175– 308. Google Scholar CrossRef Search ADS   2. Barabasi, A.-L. & Oltvai, Z. N. ( 2004) Network biology: understanding the cell’s functional organization. Nat. Rev. Genet. , 5, 101– 113. Google Scholar CrossRef Search ADS PubMed  3. de Arruda, H. F., Comin, C. H., Miazaki, M., Viana, M. P. & da Fontoura Costa, L. ( 2015) A framework for analyzing the relationship between gene expression and morphological, topological, and dynamical patterns in neuronal networks. J. Neurosci. Methods , 245, 1– 14. Google Scholar CrossRef Search ADS PubMed  4. Barabási, A.-L., Gulbahce, N. & Loscalzo, J. ( 2011) Network medicine: a network-based approach to human disease. Nat. Rev. Genet. , 12, 56– 68. Google Scholar CrossRef Search ADS PubMed  5. Kalimeri, M., Constantoudis, V., Papadimitriou, C., Karamanos, K., Diakonos, F. K. & Papageorgiou, H. ( 2015) Word-length entropies and correlations of natural language written texts. J. Quant. Linguist. , 22, 101– 118. Google Scholar CrossRef Search ADS   6. Moreno, Y., Nekovee, M. & Pacheco, A. F. ( 2004) Dynamics of rumor spreading in complex networks. Phys. Rev. E , 69, 066130. Google Scholar CrossRef Search ADS   7. Manning, C. D. & Schütze, H. ( 1999) Foundations of Statistical Natural Language Processing . Cambridge, MA: MIT Press. 8. Altmann, E. G., Pierrehumbert, J. B. & Motter, A. E. ( 2009) Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words. PLoS One , 4, e7678. Google Scholar CrossRef Search ADS PubMed  9. Nahm, U. Y. & Mooney, R. J. ( 2002) Text mining with information extraction. AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases , Vol. 1. 10. Joachims, T. ( 2001) A statistical learning learning model of text classification for support vector machines. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval . New Orleans, Louisiana, USA: ACM, pp. 128– 136. 11. Hotho, A., Nürnberger, A. & Paaß, G. ( 2005) A brief survey of text mining. Ldv Forum , Vol. 20. pp. 19– 62. 12. Ramos, J. ( 2003) Using tf-idf to determine word relevance in document queries. Proceedings of the First Instructional Conference on Machine Learning . 13. AlSumait, L., Barbará, D. & Domeniconi, C. ( 2008) On-line lda: adaptive topic models for mining text streams with applications to topic detection and tracking. ICDM’08 Eighth IEEE International Conference on Data Mining, 2008 . Pisa, Italy: IEEE, pp. 3– 12. 14. Blei, D. M., Ng, A. Y. & Jordan, M. I. ( 2003) Latent dirichlet allocation. J. Mach. Learn. Res. , 3, 993– 1022. 15. Landauer, T. K., Foltz, P. W. & Laham, D. ( 1998) An introduction to latent semantic analysis. Discourse Process. , 25, 259– 284. Google Scholar CrossRef Search ADS   16. Hatzivassiloglou, V., Gravano, L. & Maganti, A. ( 2000) An investigation of linguistic features and clustering algorithms for topical document clustering. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval . Athens, Greece: ACM, pp. 224– 231. 17. Chang, Y.-L. & Chien, J.-T. ( 2009) Latent dirichlet learning for document summarization. IEEE International Conference on Acoustics, Speech and Signal Processing, 2009. ICASSP 2009.  New York, NY, USA: IEEE, pp. 1689– 1692. 18. Wei, Y., Singh, L., Gallagher, B. & Buttler, D. ( 2016) Overlapping target event and story line detection of online newspaper articles. IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016 . Montreal, QC, Canada: IEEE, pp. 222– 232. 19. Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y. & Potts, C. ( 2011) Learning word vectors for sentiment analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 . Portland, Oregon, USA: Association for Computational Linguistics, pp. 142– 150. 20. Chen, X., Hao, P., Chandramouli, R. & Subbalakshmi, K. ( 2011) Authorship similarity detection from email messages. Machine Learning and Data Mining in Pattern Recognition . Berlin, Heidelberg: Springer, pp. 375– 386. Google Scholar CrossRef Search ADS   21. Liu, S., Wu, Y., Wei, E., Liu, M. & Liu, Y. ( 2013) Storyflow: tracking the evolution of stories. IEEE Trans. Vis. Comput. Graph. , 19, 2436– 2445. Google Scholar CrossRef Search ADS PubMed  22. Prado, S. D., Dahmen, S. R., Bazzan, A. L., Carron, P. M. & Kenna, R. ( 2016) Temporal network analysis of literary texts. Adv. Complex Syst. , 19( 3), 1650005 ( 19 pages). Google Scholar CrossRef Search ADS   23. Tanahashi, Y. & Ma, K.-L. ( 2012) Design considerations for optimizing storyline visualizations. IEEE Trans. Vis. Comput. Graph. , 18, 2679– 2688. Google Scholar CrossRef Search ADS PubMed  24. Reagan, A. J., Mitchell, L., Kiley, D., Danforth, C. M. & Dodds, P. S. ( 2016) The emotional arcs of stories are dominated by six basic shapes. EPJ Data Science , 5, 31. Google Scholar CrossRef Search ADS   25. Amancio, D. R., Oliveira Jr, O. N. & Costa, L. F. ( 2012) Structure–semantics interplay in complex networks and its effects on the predictability of similarity in texts. Phys. A , 391, 4406– 4419. Google Scholar CrossRef Search ADS   26. Kulig, A., Drożdż, S., Kwapień, J. & Oświecimka, P. ( 2015) Modeling the average shortest-path length in growth of word-adjacency networks. Phys. Rev. E , 91, 032810. Google Scholar CrossRef Search ADS   27. Ferrer i Cancho, R., Solé, R. V. & Köhler, R. ( 2004) Patterns in syntactic dependency networks. Phys. Rev. E , 69, 051915. Google Scholar CrossRef Search ADS   28. Feldman, R. ( 2013) Techniques and applications for sentiment analysis. Commun . ACM, 56, 82– 89. Google Scholar CrossRef Search ADS   29. Amancio, D. R. ( 2015) Authorship recognition via fluctuation analysis of network topology and word intermittency. J. Stat. Mech. Theory Exp. , 2015, P03005. Google Scholar CrossRef Search ADS   30. Mehri, A., Darooneh, A. H. & Shariati, A. ( 2012) The complex networks approach for authorship attribution of books. Phys. A , 391, 2429– 2437. Google Scholar CrossRef Search ADS   31. Segarra, S., Eisen, M. & Ribeiro, A. ( 2015) Authorship attribution through function word adjacency networks. IEEE Trans. Signal Process. , 63, 5464– 5478. Google Scholar CrossRef Search ADS   32. Amancio, D. R. ( 2015) A complex network approach to stylometry. PLoS One , 10, e0136076. Google Scholar CrossRef Search ADS PubMed  33. de Arruda, H. F., Costa, L. d. F. & Amancio, D. R. ( 2016) Using complex networks for text classification: discriminating informative and imaginative documents. Europhys. Lett. , 113, 28007. Google Scholar CrossRef Search ADS   34. Amancio, D. R., Oliveira Jr, O. N. & da F. Costa, L. ( 2012) Unveiling the relationship between complex networks metrics and word senses. Europhys. Lett. , 98, 18002. Google Scholar CrossRef Search ADS   35. Mihalcea, R., Tarau, P. & Figa, E. ( 2004) Pagerank on semantic networks, with application to word sense disambiguation. Proceedings of the 20th International Conference on Computational Linguistics , COLING ’04. Stroudsburg, PA: Association for Computational Linguistics, 1126 pages. 36. Silva, T. C. & Amancio, D. R. ( 2012) Word sense disambiguation via high order of learning in complex networks. Europhys. Lett. , 98, 58001. Google Scholar CrossRef Search ADS   37. Amancio, D. R., Nunes Jr, M. G., O. N. O. & da F. Costa, L. ( 2012) Extractive summarization using complex networks and syntactic dependency. Phys. A , 391, 1855– 1864. Google Scholar CrossRef Search ADS   38. Antiqueira, L., Oliveira Jr, O. N., Costa, L. F. & Nunes, M. G. V. ( 2009) A complex network approach to text summarization. Inf. Sci. , 179, 584– 599. Google Scholar CrossRef Search ADS   39. Xuan, Q. & Wu, T.-J. ( 2009) Node matching between complex networks. Phys. Rev. E , 80, 026103. Google Scholar CrossRef Search ADS   40. Amancio, D. R., Altmann, E. G., Rybski, D., Oliveira Jr, O. N. & Costa, L. F. ( 2013) Probing the statistical properties of unknown texts: application to the Voynich manuscript. PLOS One , 8, 1– 10. 41. de Arruda, H. F., Costa, L. d. F. & Amancio, D. R. ( 2016) Topic segmentation via community detection in complex networks. Chaos , 26( 6), 063120. Google Scholar CrossRef Search ADS   42. Han, J. ( 2005) Data Mining: Concepts and Techniques . San Francisco, CA: Morgan Kaufmann Publishers Inc. 43. Fruchterman, T. & Reingold, E. ( 1991) Graph drawing by force-directed placement. Software: Practice and experience , 21, 1129– 1164. Google Scholar CrossRef Search ADS   44. Watts, D. J. & Strogatz, S. H. ( 1998) Collective dynamics of ‘small-world’ networks. Nature , 393, 440– 442. Google Scholar CrossRef Search ADS PubMed  45. Amancio, D. R., Altmann, E. G., Oliveira Jr, O. N. & Costa, L. d. F. ( 2011) Comparing intermittency and network measurements of words and their dependence on authorship. New J. Phys. , 13, 123024. Google Scholar CrossRef Search ADS   46. Masucci, A. & Rodgers, G. ( 2006) Network properties of written human language. Phys. Rev. E , 74, 026102. Google Scholar CrossRef Search ADS   47. Sheng, L. & Li, C. ( 2009) English and chinese languages as weighted complex networks. Phys. A , 388, 2561– 2570. Google Scholar CrossRef Search ADS   48. Kaiser, M. & Hilgetag, C. C. ( 2004) Edge vulnerability in neural and metabolic networks. Biol. Cybernet. , 90, 311– 317. Google Scholar CrossRef Search ADS   49. Newman, M. ( 2010) Networks: An Introduction . New York, NY: Oxford University Press, Inc. Google Scholar CrossRef Search ADS   50. Sporns, O. ( 2003) Graph theory methods for the analysis of neural connectivity patterns. Neuroscience Databases  ( Kötter R. ed.). Boston, MA: Springer, pp. 171– 185. Google Scholar CrossRef Search ADS   51. Das, N. ( 2008) Statistical Methods-Combined Edition, Vols. i and ii . West Patel Nagar, New Delhi, India: Tata MCGraw Hill Education Private Limited, PAGES-4, 5: 290. 52. Jolliffe, I. ( 2002) Principal Component Analysis . New York: Springer Verlag. 53. Frank, E., Hall, M., Holmes, G., Kirkby, R., Pfahringer, B., Witten, I. H. & Trigg, L. ( 2009) Weka-a machine learning workbench for data mining. Data Mining and Knowledge Discovery Handbook  ( Maimon O. & Rokach L. eds). Boston, MA: Springer, pp. 1269– 1277. Google Scholar CrossRef Search ADS   54. Hubert, L. & Arabie, P. ( 1985) Comparing partitions. J. Classif. , 2, 193– 218. Google Scholar CrossRef Search ADS   55. Ebeling, W. & Neiman, A. ( 1995) Long-range correlations between letters and sentences in texts. Phys. A , 215, 233– 241. Google Scholar CrossRef Search ADS   56. Alvarez-Lacalle, E., Dorow, B., Eckmann, J.-P. & Moses, E. ( 2006) Hierarchical structures induce long-range dynamical correlations in written texts. Proc. Natl. Acad. Sci. , 103, 7956– 7961. Google Scholar CrossRef Search ADS   57. Estrada, E. ( 2010) Quantifying network heterogeneity. Phys. Rev. E , 82, 066102. Google Scholar CrossRef Search ADS   58. Gutman, I. & Zhou, B. ( 2006) Laplacian energy of a graph. Linear Algebra Appl.,  29– 37. 59. Braunstein, S. L., Ghosh, S., Severini, S. ( 2006) The Laplacian of a graph as a density matrix: a basic combinatorial approach to separability of mixed states. Ann. Comb.,  10, 291– 317. Google Scholar CrossRef Search ADS   60. Mikolov, T., Chen, K., Corrado, G. & Dean, J. ( 2013) Efficient estimation of word representations in vector space. ArXiv preprint arXiv:1301.3781. 61. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. ( 2013) Distributed Representations of Words and Phrases and Their Compositionality  ( Burges, C. J. C. Bottou, L. Welling, M. Ghahramani Z. & Weinberger K. Q. eds). Curran Associates, Inc.: Advances in Neural Information Processing Systems. 3111– 3119. 62. Travençolo, B. A. N. & Costa, L. d. F. ( 2008) Accessibility in complex networks. Phys. Lett. A , 373, 89– 95. Google Scholar CrossRef Search ADS   63. Freeman, L. ( 1977) A set of measures of centrality based on betweenness. Sociometry , 40, 35– 41. Google Scholar CrossRef Search ADS   64. Estrada, E. ( 2012) The Structure of Complex Networks: Theory and Applications . New York, NY, USA: Oxford University Press. 65. Bonacich, P. ( 1987) Power and centrality: a family of measures. Amer. J. Sociol. , 92, 1170– 1182. Google Scholar CrossRef Search ADS   66. de Arruda, G. F., Barbieri, A. L., Rodríguez, P. M., Rodrigues, F. A., Moreno, Y. & Costa, L. d. F. ( 2014) Role of centrality for the identification of influential spreaders in complex networks. Phys. Rev. E , 90, 032812. Google Scholar CrossRef Search ADS   67. Newman, M. E. & Girvan, M. ( 2004) Finding and evaluating community structure in networks. Phys. Rev. E , 69, 026113. Google Scholar CrossRef Search ADS   68. Clauset, A., Newman, M. E. & Moore, C. ( 2004) Finding community structure in very large networks. Phys. Rev. E , 70, 066111. Google Scholar CrossRef Search ADS   69. Langville, A. N. & Meyer, C. D. ( 2011) Google’s PageRank and Beyond: The Science of Search Engine Rankings . Princeton, NJ, USA: Princeton University Press. Footnotes 1 Project Gutemberg - https://www.gutenberg.org/ © The authors 2017. Published by Oxford University Press. All rights reserved.

Journal

Journal of Complex NetworksOxford University Press

Published: Feb 1, 2018

DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 12 million articles from more than
10,000 peer-reviewed journals.

All for just $49/month Explore the DeepDyve Library Unlimited reading Read as many articles as you need. Full articles with original layout, charts and figures. Read online, from anywhere. Stay up to date Keep up with your field with Personalized Recommendations and Follow Journals to get automatic updates. Organize your research It’s easy to organize your research with our built-in tools. Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. Monthly Plan • Read unlimited articles • Personalized recommendations • No expiration • Print 20 pages per month • 20% off on PDF purchases • Organize your research • Get updates on your journals and topic searches$49/month

14-day Free Trial

Best Deal — 39% off

Annual Plan

• All the features of the Professional Plan, but for 39% off!
• Billed annually
• No expiration
• For the normal price of 10 articles elsewhere, you get one full year of unlimited access to articles.

$588$360/year

billed annually

14-day Free Trial