# Identifying networks with common organizational principles

Identifying networks with common organizational principles Abstract Many complex systems can be represented as networks, and the problem of network comparison is becoming increasingly relevant. There are many techniques for network comparison, from simply comparing network summary statistics to sophisticated but computationally costly alignment-based approaches. Yet it remains challenging to accurately cluster networks that are of a different size and density, but hypothesized to be structurally similar. In this article, we address this problem by introducing a new network comparison methodology that is aimed at identifying common organizational principles in networks. The methodology is simple, intuitive and applicable in a wide variety of settings ranging from the functional classification of proteins to tracking the evolution of a world trade network. 1. Introduction Many complex systems can be represented as networks, including friendships, the World Wide Web, global trade flows and protein-protein interactions [1]. The study of networks has been a very active area of research in recent years, and in particular, network comparison has become increasingly relevant for example [2–5]. Network comparison itself has many wide-ranging applications, for example, comparing protein–protein interaction networks could lead to increased understanding of underlying biological processes [4, 6]. Network comparison can also be used to study the evolution of networks over time and for identifying sudden changes and shocks. Methods for comparing networks range from comparison of summary statistics to sophisticated alignment-based approaches [3, 7, 8]. Alignment based approaches aim to identify structurally similar regions of networks by finding a mapping between network nodes that maximizes the overlap between the networks. This mapping information is often the principal aim of such methods which for instance in the context of protein–protein interactions, could be used to infer the biological function and structure of proteins in different organisms [3, 6–8]. Although network alignment is NP-hard in general several efficient heuristics have emerged over recent years [8, 9]. On the other hand, alignment-free methods in the form of network similarity measures and network distances aim to quantify the overall similarity of networks on the basis of network features. Alignment free methods typically are computationally less expensive than alignment based methods and can effectively compare large sets of networks. Network comparison measures have many applications such as goodness of fit tests of random graph models of real world networks [10–12] and the tracking the evolution of network time series [13]. Network comparison measures have also attracted increasing attention in the field of machine learning, where they are mostly referred to as graph kernels, with applications for example in personalized medicine for example [14], computer vision and drug discovery for example [15]. Real-world networks can be very large and are often inhomogeneous, which makes the problem of network comparison challenging, especially when networks differ significantly in terms of size and density. In this article, we address this problem by introducing a new network comparison methodology that is aimed at comparing networks according to their common organizational principles. The observation that the degree distribution of many real world networks is highly right skewed and in many cases approximately follows a power law has been very influential in the development of network science [16]. Consequently, it has become widely accepted that the shape of the degree distribution (e.g. binomial vs power law) is indicative of the generating mechanism underlying the network. In this article, we formalize this idea by introducing a measure that captures the shape of distributions. The measure emerges from the requirement that a metric between forms of distributions should be invariant under rescalings and translations of the observables. Based on this measure, we then introduce a new network comparison methodology, which we call NetEmd. Although our methodology is applicable to almost any type of feature that can be associated to nodes or edges of a graph, we focus mainly on distributions of small connected subgraphs, also known as graphlets. Graphlets form the basis of many of the state of the art network comparison methods [4, 5, 11] and hence using graphlet based features allows for a comparative assessment of the presented methodology. Moreover, certain subgraph patterns, called network motifs [17, 18], occur much more frequently in many real world networks than is expected on the basis of pure chance. Network motifs are considered to be basic building blocks of networks that contribute to the function of the network by performing modular tasks and have therefore been conjectured to be favoured by natural selection. This is supported by the observation that network motifs are largely conserved within classes of networks [10, 19]. Our methodology provides an effective tool for comparing networks even when networks differ significantly in size and density, which is the case in most applications. The methodology performs well on a wide variety of networks ranging from chemical compounds having as few as 10 nodes to internet network with tens of thousands of nodes. The method achieves state of the art performance even when based on a rather restricted set of features, specifically distributions of graphlets up to size 3. Graphlets of size 3 can be enumerated very efficiently and hence when based on these features the method scales favourably to networks with millions and even billions of nodes. The method also behaves well under the network sub-sampling [5, 20–22]. The methodology further meets the needs of researchers from a variety of fields, from the social sciences to the biological and life sciences, by being computationally efficient and simple to implement. We test the presented methodology in a large number of settings, starting with clustering synthetic and real world networks, where we find that the presented methodology outperforms state of the art graphlet-based network comparison methods in clustering networks of different sizes and densities. We then test the more fine grained properties of NetEmd using data sets that represent evolving networks at different points in time. Finally, we test whether NetEmd can predict functional categories of networks by exploring machine learning applications and find that classifiers based on NetEmd outperform state-of-the art graph classifiers on several benchmark data sets. 2. A measure for comparing shapes of distributions Here we build on the idea that the information encapsulated in the shape of the degree distribution and other network properties reflects the topological organization of the network. From an abstract point of view we think of the shape of a distribution as a property that is invariant under linear deformations i.e. translations and re-scalings of the axis. For example, a Gaussian distribution always has its characteristic bell curve shape regardless of its mean and standard deviation. Consequently, we postulate that any metric that aims to capture the similarity of shapes should be invariant under such linear transformations of its inputs. Based on these ideas we define the following measure between distributions $$p$$ and $$q$$ that are supported on $$\mathbb{R}$$ and have non-zero, finite variances:   $$\label{emdmet} {\rm EMD}^*(p,q)=\mathrm{inf}_{c\in\mathbb{R}}\left( {\rm EMD}\big(\tilde{p}(\cdot+c),\tilde{q}(\cdot)\big)\right)\!,$$ (2.1) where $${\rm EMD}$$ is the earth mover’s distance and $$\tilde{p}$$ and $$\tilde{q}$$ are the distributions obtained by rescaling $$p$$ and $$q$$ to have variance 1. More precisely, $$\tilde{p}$$ is the distribution obtained from $$p$$ by the transformation $$x\rightarrow \frac{x}{\sigma(p)}$$, where $$\sigma(p)$$ is the standard deviation of $$p$$. Intuitively, $${\rm EMD}$$ (also known as the first Wasserstein metric [23]) can be thought of as the minimal work, that is mass times distance, needed to ‘transport’ the mass of one distribution onto the other. For probability distributions $$p$$ and $$q$$ with support in $$\mathbb{R}$$ and bounded absolute first moment, the $${\rm EMD}$$ between $$p$$ and $$q$$ is given by $${\rm EMD}(p,q)=\int_{-\infty}^\infty|F(x)-G(x)|\,\mathrm{d}x$$, where $$F$$ and $$G$$ are the cumulative distribution functions of $$p$$ and $$q$$ respectively. In principle, $${\rm EMD}$$ in equation (2.1) can be replaced by almost any other probability metric $$d$$ to obtain a corresponding metric $$d^*$$. Here we choose $${\rm EMD}$$ because it is well suited to comparing shapes, as shown by its many applications in the area of pattern recognition and image retrieval [23]. Moreover, we found that $${\rm EMD}$$ produces superior results to classical $$L^1$$ and Kolmogorov distances, especially for highly irregular distributions that one frequently encounters in real world networks. For two networks $$G$$ and $$G'$$ and given network feature $$t$$, we define the corresponding $${\rm NetEmd}_t$$ measure by:   $${\rm NetEmd}_t (G,G')={\rm EMD}^*(p_t(G),p_t(G')),$$ (2.2) where $$p_t(G)$$ and $$p_t(G')$$ are the distributions of $$t$$ on $$G$$ and $$G'$$ respectively. $${\rm NetEmd}_t$$ can be shown to be a pseudometric between graphs for any feature $$t$$ (see Appendix Section B), that is it is non-negative, symmetric and satisfies the triangle inequality. Figure 1 gives examples where $$t$$ is taken to be the degree distribution, and $$p_t(G)$$ is the degree distribution of $$G$$. Fig. 1. View largeDownload slide Plots of rescaled and translated degree distributions for Barabasi–Albert (BA) and Erdős–Rényi (ER) models with $$N$$ nodes and average degree $$k$$: (a) BA $$N=5000$$, $$k=100$$ vs BA $$N=50,000$$, $$k=10$$. (b) ER $$N=5000$$, $$k=100$$ vs ER $$N=50000$$, $$k=10$$. (c) BA $$N=5000$$, $$k=100$$ vs ER $$N=5000$$, $$k=100$$. The $${\rm EMD}^*$$ distances between the degree distribution of two BA or ER models with quite different values of $$N$$ and $$k$$ are smaller than the $${\rm EMD}^*$$ distance between the degree distribution of a BA and ER model when the number of nodes and average degree are equal. Fig. 1. View largeDownload slide Plots of rescaled and translated degree distributions for Barabasi–Albert (BA) and Erdős–Rényi (ER) models with $$N$$ nodes and average degree $$k$$: (a) BA $$N=5000$$, $$k=100$$ vs BA $$N=50,000$$, $$k=10$$. (b) ER $$N=5000$$, $$k=100$$ vs ER $$N=50000$$, $$k=10$$. (c) BA $$N=5000$$, $$k=100$$ vs ER $$N=5000$$, $$k=100$$. The $${\rm EMD}^*$$ distances between the degree distribution of two BA or ER models with quite different values of $$N$$ and $$k$$ are smaller than the $${\rm EMD}^*$$ distance between the degree distribution of a BA and ER model when the number of nodes and average degree are equal. Metrics based on a single feature might not be effective in capturing the structural differences between networks as two networks that agree in terms of a single feature such as the degree distribution might still differ significantly in terms of other structures. For instance, there are many generation mechanisms that produce networks with power-law type degree distributions but differ significantly in other aspects. Consequently, measures that are based on the comparison of multiple features can be expected to be more effective at identifying structural differences between networks than measures that are based on a single feature $$t$$, because for two networks to be considered similar they must show similarity across multiple features. Hence, for a given set $$T=\{t_1,t_2,...,t_m\}$$ of network features, we define the $${\rm NetEmd}$$ measure corresponding to $$T$$ simply as:   $$\label{eq:def_netemd} {\rm NetEmd}_T(G,G')=\frac{1}{m}\sum_{j=1}^{m} {\rm NetEmd}_{t_j} (G,G').$$ (2.3) Although $${\rm NetEmd}$$ can in principle be based on any set $$T$$ of network features to which one can associate distributions, we initially consider only features that are based on distributions of small connected subgraphs, also known as graphlets. Graphlets form the basis of many state of the art network comparison methods and hence allow for a comparative assessment of the proposed methodology that is independent of the choice of features. First, we consider graphlet degree distributions (GDDs) [11] as our set of features. For a given graphlet $$m$$, the graphlet degree of a node is the number of graphlet-$$m$$ induced subgraphs that are attached to the node. One can distinguish between the different positions the node can have in $$m$$, which correspond to the automorphism orbits of $$m$$, see Fig. 2. We initially take the set of 73 GDDs corresponding to graphlets up to size 5 to be the default set of inputs, for which we denote the metric as $${\rm NetEmd}_{G5}$$. Fig. 2. View largeDownload slide Graphlets on two to four nodes. The different shades in each graphlet represent different automorphism orbits, numbered from 0 to 14. Fig. 2. View largeDownload slide Graphlets on two to four nodes. The different shades in each graphlet represent different automorphism orbits, numbered from 0 to 14. Later, we also explore alternative definitions of subgraph distributions based on ego networks, as well as the effect of varying the size of subgraphs considered in the input. Finally, we consider the eigenvalue spectra of the graph Laplacian and the normalized graph Laplacian as inputs. 3. Results In order to give a comparative assessment of $${\rm NetEmd}$$, we consider other graphlet based network comparison methods, namely GDDA [11], GCD [5] and Netdis [4]. These represent the most effective alignment-free network comparison methodologies in the existing literature. While GDDA directly compares distributions of graphlets up to size 5 in a pairwise fashion, GCD is based on comparing rank correlations between graphlet degrees. Here we consider both default settings of GCD [5], namely GCD11, which is based on a non-redundant subset of 11 graphlets up to size 4, and GCD73 which uses all graphlets up to size 5. Netdis differs from GDDA and GCD in that it is based on subgraph counts in ego-networks of nodes. Another important distinction is that Netdis first centers these raw counts by comparing them to the counts that could be expected under a particular null model before computing the final statistics. In our analysis, we consider two null models: an Erdös–Rényi random graph and a duplication divergence graph [24] which has a scale-free degree distribution as well as a high clustering coefficient. We denote these two variants as $${\rm Netdis}_{\rm ER}$$ and $${\rm Netdis}_{\rm SF}$$, respectively. We initially report results of a single variant $${\rm NetEmd}_{G5}$$ that is based on graphlets up to size 5 in order to provide a comparison between the different methods that is independent of the choice of features. Results obtained using different feature sets are discussed in Section 3.3. 3.1 Clustering synthetic and real world networks We start with the classical setting of network comparison where the task is to identify groups of structurally similar networks. The main challenge in this setting is to identify structurally similar networks even though they might differ substantially in terms of size and density. Given a set $$S=\{G_1,G_2,...,G_n\}$$ of networks consisting of disjoint classes $$C=\{c_1,c_2,...,c_m\}$$ one would like a network comparison measure $$d$$ to position networks from the same class closer to each other when compared to networks from other classes. Given a network $$G$$, this can be measured in terms of the empirical probability $$P(G)$$ that $$d(G,G_1)<d(G,G_2)$$ where $$G_1$$ is a randomly selected network from the same class as $$G$$ (excluding itself) and $$G_2$$ is a randomly selected network from outside the class of $$G$$ and $$d$$ is the network comparison statistic. Consequently, the performance over the whole data set is measured in terms of the quantity $$\overline{P}=\frac{1}{|S|}\sum_{G\in S}P(G)$$. It can be shown that $$P(G)$$ is equivalent to the area under the receiver operator characteristic curve of a binary classifier that for a given network $$G$$ classifies the $$k$$ nearest neighbours of $$G$$ with respect to $$d$$ as being in the same class as $$G$$ (See Appendix H). Hence, a measure that positions networks randomly has an expected $$\overline{P}$$ of 0.5 whereas $$\overline{P}=1$$ corresponds to perfect separation between classes. Other measures are discussed in the Appendix. Conclusions reached in this article hold regardless of which of these performance measures one uses. We first test $${\rm NetEmd}$$ on synthetic networks corresponding to realizations of eight random graph models, namely the Erdős–Rényi random graphs [25], the Barabasi Albert preferential attachment model [16], two duplication divergence models [24, 26], the geometric gene duplication model [27], 3D geometric random graphs [28], the configuration model [29], and Watts–Strogatz small world networks [30] (see Section G.1 in the Appendix for details). For synthetic networks, we consider three experimental settings of increasing difficulty, starting with the task of clustering networks that have same size $$N$$ and average degree $$k$$ according to generating mechanism—a task that is relevant in a model selection setting. For this we generate 16 data sets, which collectively we call $$RG_1$$, corresponding to combinations of $$N\in\{1250,2500,5000,10000\}$$ and $$k\in\{10,20,40,80\}$$, each containing 10 realizations per model, that is 80 networks. This is an easier problem than clustering networks of different sizes and densities, and in this setting we find that the $$\overline{P}$$ scores (see Fig. 3c) of top performing measures tend to be within one standard deviation of each other. We find that $${\rm NetEmd}_{G5}$$ and GCD73 achieve the highest scores, followed by GCD11 and $${\rm Netdis}_{\rm SF}$$. Fig. 3. View largeDownload slide (a) and (b) show the heatmaps of pairwise distances on $$RG_2$$ ($$N\in\{2000,3000,4000\}$$ and $$k\in\{20,24,28,32,36,40\}$$) according to $${\rm NetEmd}_{G5}$$ and GCD73, respectively. In the heat map, networks are ordered from top to bottom in the following order: model, average degree and node count. The heatmap of $${\rm NetEmd}$$ shows eight clearly identifiable blocks on the diagonal corresponding to different generative models while the heatmap of GCD73 shows signs of off-diagonal mixing. (c) $$\overline{P}$$ values for various comparison measures for data sets of synthetic and real world networks. For $$RG_1$$ we calculated the value of $$\overline{P}$$ for each of the 16 sub-data sets. The table shows the average and standard deviation of the $$\overline{P}$$ values obtained over these 16 sub-data sets. Fig. 3. View largeDownload slide (a) and (b) show the heatmaps of pairwise distances on $$RG_2$$ ($$N\in\{2000,3000,4000\}$$ and $$k\in\{20,24,28,32,36,40\}$$) according to $${\rm NetEmd}_{G5}$$ and GCD73, respectively. In the heat map, networks are ordered from top to bottom in the following order: model, average degree and node count. The heatmap of $${\rm NetEmd}$$ shows eight clearly identifiable blocks on the diagonal corresponding to different generative models while the heatmap of GCD73 shows signs of off-diagonal mixing. (c) $$\overline{P}$$ values for various comparison measures for data sets of synthetic and real world networks. For $$RG_1$$ we calculated the value of $$\overline{P}$$ for each of the 16 sub-data sets. The table shows the average and standard deviation of the $$\overline{P}$$ values obtained over these 16 sub-data sets. Having established that $${\rm NetEmd}$$ is able to differentiate networks according to generating mechanism, we move on to the task of clustering networks of different sizes and densities. For this we generate two data sets: $$RG_2$$ in which the size $$N$$ and average degree $$k$$ are increased independently in linear steps to twice their initial value ($$N\in\{2000,3000,4000\}$$ and $$k\in\{20,24,28,32,36,40\}$$) and $$RG_3$$ in which the size and average degree are increased independently in multiples of 2 to 8 times their initial value ($$N\in\{1250,2500,5000,10000\}$$ and $$k\in\{10,20,40,80\}$$). In $$RG_3$$, the number of nodes and average degrees of the networks both vary by one order of magnitude, and therefore clustering according to model type is challenging. Both $$RG_2$$ and $$RG_3$$ contain 10 realizations per model parameter so that these contain $$3\times6\times8\times10=1440$$ and $$4\times4\times8\times10=1280$$ networks, respectively. Finally, we consider a data set consisting of networks from 10 different classes of real world networks (RWN) as well as a data set from [4] that consists of real world and synthetic networks from the larger collection compiled by Onnela $$et$$$$al.$$ [31]. Since the generation mechanisms of real world networks are generally unknown the ‘ground truth’ of real world data sets is based on the assumption that networks from a certain domain are structurally more similar to each other than to networks from different domains. We find that $${\rm NetEmd}_{G5}$$ outperforms all of the other three methods at clustering networks of different sizes and densities on all data sets. The difference can also be seen in the heatmaps of $${\rm NetEmd}_{G5}$$ and GCD73, the second best performing method for $$RG_2$$, given in Fig. (3a,b). While the heatmap of $${\rm NetEmd}_{G5}$$ shows eight clearly identifiable blocks on the diagonal corresponding to different generative models, the heatmap of GCD73 shows signs of off-diagonal mixing. The difference in performance becomes even more pronounced on more challenging data sets, that is on $$RG_3$$ (see Fig. D2 in the Appendix) and the Onnela et al. data set. 3.2 Time ordered networks A network comparison measure should ideally not only be able to identify groups of similar networks but should also be able to capture structural similarity at a finer local scale. To study the behaviour of $${\rm NetEmd}$$ at a more local level, we consider data sets that represent a system measured at different points in time. Since it is reasonable to assume that such networks evolve gradually over time towards their final state they offer a sensible experimental setting for testing the local properties of network comparison methodologies. We consider two data sets named AS-caida and AS-733 [32] that represent the topology of the Internet at the level of autonomous systems and a third data set that consists of bilateral trade flows between countries for the years 1962–2014 [33, 34]. Both edges and nodes are added and deleted over time in all three data sets. As was noted in [32] the time ranking in evolving networks is reflected to a certain degree in simple summary statistics. Hence, recovering the time ranking of evolving networks should be regarded as a test of consistency rather than an evaluation of performance. In order to minimize the dependence of our results on the algorithm that is used to rank networks, we consider four different ways of ranking networks based on their pairwise distances as follows. We assume that either the first or last network in the time series is given. Rankings are then constructed in a step-wise fashion. At each step one either adds the network that is closest to the last added network (Algorithm 1), or adds the network that has smallest average distance to all the networks in the ranking constructed so far (Algorithm 2). The performance of a measure in ranking networks is then measured in terms of Kendall’s rank correlation coefficient $$\tau$$ between the true time ranking and the best ranking obtained by any of the four methods. We find that $${\rm NetEmd}_{G5}$$ successfully recovers the time ordering for all three data sets, as can be seen in the time ordered heatmaps given in Fig. 4 which all show clear groupings along the diagonal. The heat maps for the two internet data sets show small groups of outliers which can also be identified as sudden jumps in summary statistics for example the number of nodes. The two large clusters in the heatmap of world trade networks (Fig. 4(c)) coincide with a change in the data gathering methodology in 1984 [33]. Although $${\rm NetEmd}_{G5}$$ comes second to $${\rm Netdis}_{\rm SF}$$ on AS-733 and to GCD11 on AS-caida, $${\rm NetEmd}_{G5}$$ has the highest overall score and is the only measure that achieves consistently high scores on all three data sets. Fig. 4. View largeDownload slide (a), (b) and (c) Heatmaps of $${\rm NetEmd}_{G5}$$ for networks representing the internet at the level of autonomous systems networks and world trade networks. The date of measurement increases from left to right/ top to bottom. $${\rm NetEmd}_{G5}$$ accurately captures the evolution over time in all three data sets by positioning networks that are close in time closer to each other resulting in a clear signal along the diagonal. (d) Kendall’s rank correlation coefficient between the true time ranking and rankings inferred from different network comparison measures. Fig. 4. View largeDownload slide (a), (b) and (c) Heatmaps of $${\rm NetEmd}_{G5}$$ for networks representing the internet at the level of autonomous systems networks and world trade networks. The date of measurement increases from left to right/ top to bottom. $${\rm NetEmd}_{G5}$$ accurately captures the evolution over time in all three data sets by positioning networks that are close in time closer to each other resulting in a clear signal along the diagonal. (d) Kendall’s rank correlation coefficient between the true time ranking and rankings inferred from different network comparison measures. 3.3 NetEmd based on different sets of inputs We examine the effect of reducing the size of graphlets considered in the input of $${\rm NetEmd}$$, which is also relevant from a computational point of view, since enumerating graphlets up to size 5 can be challenging for very large networks. We consider variants based on the graphlet degree distributions of graphlets up to sizes 3 and 4, which we denote as $${\rm NetEmd}_{G3}$$ and $${\rm NetEmd}_{G4}$$. We also consider $${\rm NetEmd}_{\rm DD}$$ which is based only on the degree distribution as a baseline. Results are given in Table 1. We find that reducing the size of graphlets from 5 to 4 does not significantly decrease the performance of $${\rm NetEmd}$$ and actually produces better results on three data sets ($$RG_3$$, Real world and Onnela et $$al.$$). Even when based on only graphlets up to size 3, that is just edges, 2-paths and triangles, $${\rm NetEmd}$$ outperforms all other non-$${\rm NetEmd}$$ methods that we tested on at least 6 out of 8 data sets. Table 1 Results for different variants of $${\rm NetEmd}$$ based on distributions of graphlets up to size 3 and 4 ($${\rm NetEmd}_{G3}$$ and $${\rm NetEmd}_{G4}$$), counts of graphlets up to size 4 in 1-step ego networks of nodes ($${\rm NetEmd}_{E4}$$), eigenvalue spectra of Laplacian operators ($${\rm NetEmd}_{s}$$) and the degree distribution ($${\rm NetEmd}_{DD}$$). Values in bold indicate that a measure achieves the highest score among all measures considered in the manuscript. For $$RG_1$$ we calculate the value of $$\overline{P}$$ for each of the 16 sub-data sets. The table shows the average and standard deviation of the $$\overline{P}$$ values obtained over these 16 sub-data sets Data set  $${\rm NetEmd}_{G3}$$  $${\rm NetEmd}_{G4}$$  $${\rm NetEmd}_{G5}$$  $${\rm NetEmd}_{E4}$$  $${\rm NetEmd}_{S}$$  $${\rm NetEmd}_{\rm DD}$$  Best other  $$RG_1$$  0.989  0.995  0.997  0.993  0.992  0.957  0.996     $$\pm$$ 0.008  $$\pm$$ 0.005  $$\pm$$ 0.003  $$\pm$$ 0.004  $$\pm$$ 0.007  $$\pm$$ 0.024  $$\pm$$ 0.005 (GCD73)  $$RG_2$$  0.982  0.987  0.988  0.983  0.992  0.944  0.976(GCD73)  $$RG_3$$  0.940  0.941  0.925  0.947  0.972  0.902  0.872(GCD11)  RWN  0.952  0.950  0.942  0.933  0.933  0.907  0.906(GCD73)  Onnela et al.  0.892  0.898  0.890  0.892  0.858  0.867  0.832($${\rm Netdis}_{\rm ER}$$)  AS-733  0.808  0.874  0.874  0.922  0.855  0.928  0.933($${\rm Netdis}_{\rm SF}$$)  AS-caida  0.898  0.892  0.890  0.820  0.780  0.821  0.897(GCD11)  World Trade  0.697  0.785  0.821  0.665  0.430  0.358  0.666($${\rm Netdis}_{\rm ER}$$)  Data set  $${\rm NetEmd}_{G3}$$  $${\rm NetEmd}_{G4}$$  $${\rm NetEmd}_{G5}$$  $${\rm NetEmd}_{E4}$$  $${\rm NetEmd}_{S}$$  $${\rm NetEmd}_{\rm DD}$$  Best other  $$RG_1$$  0.989  0.995  0.997  0.993  0.992  0.957  0.996     $$\pm$$ 0.008  $$\pm$$ 0.005  $$\pm$$ 0.003  $$\pm$$ 0.004  $$\pm$$ 0.007  $$\pm$$ 0.024  $$\pm$$ 0.005 (GCD73)  $$RG_2$$  0.982  0.987  0.988  0.983  0.992  0.944  0.976(GCD73)  $$RG_3$$  0.940  0.941  0.925  0.947  0.972  0.902  0.872(GCD11)  RWN  0.952  0.950  0.942  0.933  0.933  0.907  0.906(GCD73)  Onnela et al.  0.892  0.898  0.890  0.892  0.858  0.867  0.832($${\rm Netdis}_{\rm ER}$$)  AS-733  0.808  0.874  0.874  0.922  0.855  0.928  0.933($${\rm Netdis}_{\rm SF}$$)  AS-caida  0.898  0.892  0.890  0.820  0.780  0.821  0.897(GCD11)  World Trade  0.697  0.785  0.821  0.665  0.430  0.358  0.666($${\rm Netdis}_{\rm ER}$$)  Table 1 Results for different variants of $${\rm NetEmd}$$ based on distributions of graphlets up to size 3 and 4 ($${\rm NetEmd}_{G3}$$ and $${\rm NetEmd}_{G4}$$), counts of graphlets up to size 4 in 1-step ego networks of nodes ($${\rm NetEmd}_{E4}$$), eigenvalue spectra of Laplacian operators ($${\rm NetEmd}_{s}$$) and the degree distribution ($${\rm NetEmd}_{DD}$$). Values in bold indicate that a measure achieves the highest score among all measures considered in the manuscript. For $$RG_1$$ we calculate the value of $$\overline{P}$$ for each of the 16 sub-data sets. The table shows the average and standard deviation of the $$\overline{P}$$ values obtained over these 16 sub-data sets Data set  $${\rm NetEmd}_{G3}$$  $${\rm NetEmd}_{G4}$$  $${\rm NetEmd}_{G5}$$  $${\rm NetEmd}_{E4}$$  $${\rm NetEmd}_{S}$$  $${\rm NetEmd}_{\rm DD}$$  Best other  $$RG_1$$  0.989  0.995  0.997  0.993  0.992  0.957  0.996     $$\pm$$ 0.008  $$\pm$$ 0.005  $$\pm$$ 0.003  $$\pm$$ 0.004  $$\pm$$ 0.007  $$\pm$$ 0.024  $$\pm$$ 0.005 (GCD73)  $$RG_2$$  0.982  0.987  0.988  0.983  0.992  0.944  0.976(GCD73)  $$RG_3$$  0.940  0.941  0.925  0.947  0.972  0.902  0.872(GCD11)  RWN  0.952  0.950  0.942  0.933  0.933  0.907  0.906(GCD73)  Onnela et al.  0.892  0.898  0.890  0.892  0.858  0.867  0.832($${\rm Netdis}_{\rm ER}$$)  AS-733  0.808  0.874  0.874  0.922  0.855  0.928  0.933($${\rm Netdis}_{\rm SF}$$)  AS-caida  0.898  0.892  0.890  0.820  0.780  0.821  0.897(GCD11)  World Trade  0.697  0.785  0.821  0.665  0.430  0.358  0.666($${\rm Netdis}_{\rm ER}$$)  Data set  $${\rm NetEmd}_{G3}$$  $${\rm NetEmd}_{G4}$$  $${\rm NetEmd}_{G5}$$  $${\rm NetEmd}_{E4}$$  $${\rm NetEmd}_{S}$$  $${\rm NetEmd}_{\rm DD}$$  Best other  $$RG_1$$  0.989  0.995  0.997  0.993  0.992  0.957  0.996     $$\pm$$ 0.008  $$\pm$$ 0.005  $$\pm$$ 0.003  $$\pm$$ 0.004  $$\pm$$ 0.007  $$\pm$$ 0.024  $$\pm$$ 0.005 (GCD73)  $$RG_2$$  0.982  0.987  0.988  0.983  0.992  0.944  0.976(GCD73)  $$RG_3$$  0.940  0.941  0.925  0.947  0.972  0.902  0.872(GCD11)  RWN  0.952  0.950  0.942  0.933  0.933  0.907  0.906(GCD73)  Onnela et al.  0.892  0.898  0.890  0.892  0.858  0.867  0.832($${\rm Netdis}_{\rm ER}$$)  AS-733  0.808  0.874  0.874  0.922  0.855  0.928  0.933($${\rm Netdis}_{\rm SF}$$)  AS-caida  0.898  0.892  0.890  0.820  0.780  0.821  0.897(GCD11)  World Trade  0.697  0.785  0.821  0.665  0.430  0.358  0.666($${\rm Netdis}_{\rm ER}$$)  Given that the complexity of enumerating graphlets up to size $$s$$ in a network on $$N$$ nodes having maximum degree $$k_{\rm max}$$ is $$O(Nk_{\max}^{s-1})$$, $${\rm NetEmd}_{G4}$$ offers an optimal combination of performance and computational efficiency in most cases. The even less computationally costly $${\rm NetEmd}_{G3}$$ scales favourably even to networks of billions of edges for which enumerating graphlets of size 4 can be computationally prohibitive. This opens the door for comparing very large networks which are outside the reach of current methods while still retaining state of the art performance. Furthermore, $${\rm NetEmd}$$ behaves well under the bootstrap which samples nodes and uses the only the graphlet degrees of these nodes as inputs; see for example [5, 20–22]. Sub-sampling can be leveraged to further improve computational efficiency of $${\rm NetEmd}$$ (see Appendix D). We find that in some cases restricting the set of inputs actually leads to an increase in the performance of $${\rm NetEmd}$$. This indicates that not all graphlet distributions are equally informative in all settings [35]. Consequently, identifying (learning) which graphlet distributions contain the most pertinent information for a given task might lead to significant improvements in performance. Such generalizations can be incorporated into $${\rm NetEmd}$$ in a straightforward manner, for instance by modifying the sum in Equation (2.3) to incorporate weights. $${\rm NetEmd}$$ is ideally suited for such metric learning [36] type generalizations since it constructs an individual distance for each graphlet distribution. Moreover, such single feature $${\rm NetEmd}$$ measures are in many cases highly informative even on their own. For instance $${\rm NetEmd}_{\rm DD}$$, which only uses the degree distribution, outperforms the non-$${\rm NetEmd}$$ measures we tested individually on more than half the data sets we considered. We also considered counts of graphlets up to size 4 in 1-step ego networks of nodes ($${\rm NetEmd}_{E4}$$) [4] as an alternative way of capturing subgraph distributions, for which we denote the measure as $${\rm NetEmd}_{E4}$$. Although we find that $${\rm NetEmd}_{E4}$$ achieves consistently high scores variants based on graphlet degree distributions tend to perform better on most data sets. Finally, we consider spectral distributions of graphs as a possible alternative to graphlet based features. The spectra of various graph operators are closely related to topological properties of graphs [37–39] and have been widely used to characterize and compare graphs [2, 40]. We used the spectra of the graph Laplacian and normalized graph Laplacian as inputs for $${\rm NetEmd}$$ for which we denote the measure as $${\rm NetEmd}_S$$. For a given graph the Laplacian is defined as $$L=D-A$$ where $$A$$ is the adjacency matrix of the graph and $$D$$ is the diagonal matrix whose diagonal entries are the node degrees. The normalized Laplacian $$\hat{L}$$ is defined as $$D^{-\frac{1}{2}}LD^{-\frac{1}{2}}$$. Given the eigenvalue distributions $$S(L)$$ and $$S(\hat{L})$$ of $$L$$ and $$\hat{L}$$ we define $${\rm NetEmd}_S$$ to be $$\frac{1}{2}({\rm NetEmd}_{S(L)}+{\rm NetEmd}_{S(\hat{L})})$$. We find that in general $${\rm NetEmd}_S$$ performs better in clustering random graphs of different sizes and densities when compared to graphlet based network comparison measures. However, on the RWN and Onnela et al. data sets graphlet based $${\rm NetEmd}$$ measures tend to perform better than the spectral variant which can be attributed to the prevalence of network motifs in real world networks, giving graphlet based measures an advantage. The spectral variant is also outperformed on the time ordering of data sets which in turn might be a result of the sensitivity of graph spectra to small changes in the underlying graph [2]. 3.4 Functional classification of networks One of the primary motivations in studying the structure of networks is to identify topological features that can be related to the function of a network. In the context of network comparison this translates into the problem of finding metrics that can identify functionally similar networks based on their topological structure. In order to test whether $${\rm NetEmd}$$ can be used to identify functionally similar networks, we use several benchmarks from the machine learning literature where graph similarity measures, called graph kernels, have been intensively studied over the past decade. In the context of machine learning the goal is to construct classifiers that can accurately predict the class membership of unknown graphs. We test $${\rm NetEmd}$$ on benchmark data sets representing social networks [41] consisting of Reddit posts, scientific collaborations and ego networks in the Internet Movie Database (IMDB). The Reddit data sets Reddit-Binary, Reddit-Multi-5k and Reddit-Multi-12k consist of networks representing Reddit treads where nodes correspond to users and two users are connected whenever one responded to the other’s comments. While for the Reddit-Binary data sets the task is to classify networks into discussion based and question/answer based communities, in the data sets Reddit-Multi-5k and Reddit-Multi-12k the task is to classify networks according to their subreddit categories. COLLAB is a data set consisting of ego-networks of scientists from the fields High Energy Physics, Condensed Matter Physics and Astro Physics and the task is to determine which of these fields a given researcher belongs to. Similarly, the data sets IMDB-Binary and IMDB-Multi represent collaborations between film actors derived from the IMDB and the task is to classify ego-networks into different genres, that is action and romance in the case of IMDB-Binary and comedy, action and Sci-Fi genres in the case of IMDB-Multi. We use C-support vector machine (C-SVM) [42] classifiers with a Gaussian kernel $$K(G,G')=\exp(-\frac{{\rm NetEmd}(G,G')^2}{2\alpha^2})$$, where $$\alpha$$ is a free parameter to be learned during training. Performance evaluation is carried out by 10-fold cross validation, where at each step of the validation 9 folds are used for training and 1 fold for evaluation. Free parameters of classifiers are learned via 10-fold cross validation on the training data only. Finally, every experiment is repeated 10 fold, and average prediction accuracy and standard deviation are reported. Table 2 gives classification accuracies obtained using $${\rm NetEmd}$$ measures based on graphlets up to size five ($${\rm NetEmd}_{G5}$$) and spectra of Laplacian operators ($${\rm NetEmd}_{S}$$) on the data sets representing social networks. We compare $${\rm NetEmd}$$ based kernels to graphlet kernels [43] and deep graphlet kernels [41] as well as two non-SVM classifiers namely the random forest (RF) classifier introduced in [44] and the convolutional neural network based classifier introduced in [45]. Short descriptions of the classifiers is given in the Appendix (Section F.2). Table 2 10-fold cross validation accuracies of Gaussian kernels based on $${\rm NetEmd}$$ measures using the distributions of graphlets up to size 5 ($${\rm NetEmd}_{G5}$$) and Laplacian spectra ($${\rm NetEmd}_S$$) and other graph kernels, namely the deep graphlet kernels (DGK)[41] and the graphlet kernel (GK) [43]. We also consider alternatives to support vector machines classifiers, namely the random forest classifiers (RF) introduced in [44] and convolutional neural networks (PCSN) [45]. Values in bold correspond to significantly higher scores, which are scores with t-test p-values less than 0.05 when compared to the highest score Kernel  Reddit-Binary  Reddit-Multi-5k  Reddit-Multi-12k  COLLAB  IMDB-Binary  IMDB-Multi  $${\rm NetEmd}_{G5}$$  92.67 $$\pm$$ 0.30  54.61 $$\pm$$ 0.18  48.09 $$\pm$$ 0.21  79.32 $$\pm$$ 0.27  66.99 $$\pm$$ 1.19  41.45 $$\pm$$ 0.70  $${\rm NetEmd}_S$$  88.59 $$\pm$$ 0.35  53.05 $$\pm$$ 0.34  44.45 $$\pm$$ 0.18  79.05 $$\pm$$ 0.20  71.68 $$\pm$$ 0.88  46.06 $$\pm$$ 0.50  DGK  78.04 $$\pm$$ 0.39  41.27 $$\pm$$ 0.18  32.22 $$\pm$$ 0.10  73.09 $$\pm$$ 0.25  66.96 $$\pm$$ 0.56  44.55 $$\pm$$ 0.52  GK  77.34 $$\pm$$ 0.18  41.01 $$\pm$$ 0.17  31.82 $$\pm$$ 0.08  72.84 $$\pm$$ 0.28  65.87 $$\pm$$ 0.98  43.89 $$\pm$$ 0.38  RF  88.7 $$\pm$$ 1.99  50.9 $$\pm$$ 2.07  42.7 $$\pm$$ 1.28  76.5 $$\pm$$ 1.68  72.4 $$\pm$$ 4.68  47.8 $$\pm$$ 3.55  PCSN  86.30 $$\pm$$ 1.58  49.10 $$\pm$$ 0.70  41.32 $$\pm$$ 0.42  72.60 $$\pm$$ 2.15  71.00 $$\pm$$ 2.29  45.23 $$\pm$$ 2.84  Kernel  Reddit-Binary  Reddit-Multi-5k  Reddit-Multi-12k  COLLAB  IMDB-Binary  IMDB-Multi  $${\rm NetEmd}_{G5}$$  92.67 $$\pm$$ 0.30  54.61 $$\pm$$ 0.18  48.09 $$\pm$$ 0.21  79.32 $$\pm$$ 0.27  66.99 $$\pm$$ 1.19  41.45 $$\pm$$ 0.70  $${\rm NetEmd}_S$$  88.59 $$\pm$$ 0.35  53.05 $$\pm$$ 0.34  44.45 $$\pm$$ 0.18  79.05 $$\pm$$ 0.20  71.68 $$\pm$$ 0.88  46.06 $$\pm$$ 0.50  DGK  78.04 $$\pm$$ 0.39  41.27 $$\pm$$ 0.18  32.22 $$\pm$$ 0.10  73.09 $$\pm$$ 0.25  66.96 $$\pm$$ 0.56  44.55 $$\pm$$ 0.52  GK  77.34 $$\pm$$ 0.18  41.01 $$\pm$$ 0.17  31.82 $$\pm$$ 0.08  72.84 $$\pm$$ 0.28  65.87 $$\pm$$ 0.98  43.89 $$\pm$$ 0.38  RF  88.7 $$\pm$$ 1.99  50.9 $$\pm$$ 2.07  42.7 $$\pm$$ 1.28  76.5 $$\pm$$ 1.68  72.4 $$\pm$$ 4.68  47.8 $$\pm$$ 3.55  PCSN  86.30 $$\pm$$ 1.58  49.10 $$\pm$$ 0.70  41.32 $$\pm$$ 0.42  72.60 $$\pm$$ 2.15  71.00 $$\pm$$ 2.29  45.23 $$\pm$$ 2.84  Table 2 10-fold cross validation accuracies of Gaussian kernels based on $${\rm NetEmd}$$ measures using the distributions of graphlets up to size 5 ($${\rm NetEmd}_{G5}$$) and Laplacian spectra ($${\rm NetEmd}_S$$) and other graph kernels, namely the deep graphlet kernels (DGK)[41] and the graphlet kernel (GK) [43]. We also consider alternatives to support vector machines classifiers, namely the random forest classifiers (RF) introduced in [44] and convolutional neural networks (PCSN) [45]. Values in bold correspond to significantly higher scores, which are scores with t-test p-values less than 0.05 when compared to the highest score Kernel  Reddit-Binary  Reddit-Multi-5k  Reddit-Multi-12k  COLLAB  IMDB-Binary  IMDB-Multi  $${\rm NetEmd}_{G5}$$  92.67 $$\pm$$ 0.30  54.61 $$\pm$$ 0.18  48.09 $$\pm$$ 0.21  79.32 $$\pm$$ 0.27  66.99 $$\pm$$ 1.19  41.45 $$\pm$$ 0.70  $${\rm NetEmd}_S$$  88.59 $$\pm$$ 0.35  53.05 $$\pm$$ 0.34  44.45 $$\pm$$ 0.18  79.05 $$\pm$$ 0.20  71.68 $$\pm$$ 0.88  46.06 $$\pm$$ 0.50  DGK  78.04 $$\pm$$ 0.39  41.27 $$\pm$$ 0.18  32.22 $$\pm$$ 0.10  73.09 $$\pm$$ 0.25  66.96 $$\pm$$ 0.56  44.55 $$\pm$$ 0.52  GK  77.34 $$\pm$$ 0.18  41.01 $$\pm$$ 0.17  31.82 $$\pm$$ 0.08  72.84 $$\pm$$ 0.28  65.87 $$\pm$$ 0.98  43.89 $$\pm$$ 0.38  RF  88.7 $$\pm$$ 1.99  50.9 $$\pm$$ 2.07  42.7 $$\pm$$ 1.28  76.5 $$\pm$$ 1.68  72.4 $$\pm$$ 4.68  47.8 $$\pm$$ 3.55  PCSN  86.30 $$\pm$$ 1.58  49.10 $$\pm$$ 0.70  41.32 $$\pm$$ 0.42  72.60 $$\pm$$ 2.15  71.00 $$\pm$$ 2.29  45.23 $$\pm$$ 2.84  Kernel  Reddit-Binary  Reddit-Multi-5k  Reddit-Multi-12k  COLLAB  IMDB-Binary  IMDB-Multi  $${\rm NetEmd}_{G5}$$  92.67 $$\pm$$ 0.30  54.61 $$\pm$$ 0.18  48.09 $$\pm$$ 0.21  79.32 $$\pm$$ 0.27  66.99 $$\pm$$ 1.19  41.45 $$\pm$$ 0.70  $${\rm NetEmd}_S$$  88.59 $$\pm$$ 0.35  53.05 $$\pm$$ 0.34  44.45 $$\pm$$ 0.18  79.05 $$\pm$$ 0.20  71.68 $$\pm$$ 0.88  46.06 $$\pm$$ 0.50  DGK  78.04 $$\pm$$ 0.39  41.27 $$\pm$$ 0.18  32.22 $$\pm$$ 0.10  73.09 $$\pm$$ 0.25  66.96 $$\pm$$ 0.56  44.55 $$\pm$$ 0.52  GK  77.34 $$\pm$$ 0.18  41.01 $$\pm$$ 0.17  31.82 $$\pm$$ 0.08  72.84 $$\pm$$ 0.28  65.87 $$\pm$$ 0.98  43.89 $$\pm$$ 0.38  RF  88.7 $$\pm$$ 1.99  50.9 $$\pm$$ 2.07  42.7 $$\pm$$ 1.28  76.5 $$\pm$$ 1.68  72.4 $$\pm$$ 4.68  47.8 $$\pm$$ 3.55  PCSN  86.30 $$\pm$$ 1.58  49.10 $$\pm$$ 0.70  41.32 $$\pm$$ 0.42  72.60 $$\pm$$ 2.15  71.00 $$\pm$$ 2.29  45.23 $$\pm$$ 2.84  On the Reddit data sets and the COLLAB data set, $${\rm NetEmd}_{G5}$$ significantly outperforms other state-of-the-art graph classifiers. On the other hand, we find that $${\rm NetEmd}_{G5}$$ performs poorly on the IMDB data sets. This can be traced back to the large number of complete graphs present in the IMDB data sets: 139 out of the 1000 graphs in IMDB-Binary and 789 out of 1500 graphs in IMDB-Multi are complete graphs which mainly correspond to ego-networks of actors having acted only in a single film. By definition, $${\rm NetEmd}_{G5}$$ cannot distinguish between complete graphs of different sizes since all graphlet degree distributions are concentrated on a single value in complete graphs. The spectral variant $${\rm NetEmd}_S$$ is not affected by this and we find that $${\rm NetEmd}_S$$ is either on par with or outperforms the other non-$${\rm NetEmd}$$ graph classifiers on all six data sets. We also tested $${\rm NetEmd}$$ on benchmark data sets representing chemical compounds and protein structures. Unlike the social network data sets, in these data sets nodes and edges are labeled to reflect domain specific knowledge such as atomic number, amino acid type and bond type. Although $${\rm NetEmd}$$, in contrast to the other graph kernels, does not rely on domain specific knowledge in the form of node or edge labels, we found that $${\rm NetEmd}$$ outperforms many of the considered graph kernels coming only second to the Weisfeiler–Lehman [46] type kernels in terms of overall performance (see Appendix E). 4. Discussion Starting from basic principles, we have introduced a general network comparison methodology, $${\rm NetEmd}$$, that is aimed at capturing common generating processes in networks. We tested $${\rm NetEmd}$$ in a large variety of experimental settings and found that $${\rm NetEmd}$$ successfully identifies similar networks at multiple scales even when networks differ significantly in terms of size and density, generally outperforming other graphlet based network comparison measures. Even when based only on graphlets up to size 3 (i.e. edges, 2-paths and triangles), $${\rm NetEmd}$$ has performance comparable to the state of the art, making $${\rm NetEmd}$$ feasible even for networks containing billions of edges and nodes. By exploring machine learning applications we showed that $${\rm NetEmd}$$ captures topological similarity in a way that relates to the function of networks and outperforms state-of-the art graph classifiers on several graph classification benchmarks. Although we only considered variants of $${\rm NetEmd}$$ that are based on distributions of graphlets and spectra of Laplacian operators in this article, $${\rm NetEmd}$$ can also be applied to other graph features in a straightforward manner. For instance, distributions of paths and centrality measures might capture larger scale properties of networks and their inclusion into $${\rm NetEmd}$$ might lead to a more refined measure. Data availability The source code for $${\rm NetEmd}$$ is freely available at: http://opig.stats.ox.ac.uk/resources Funding EPSRC (grant EP/K032402/1 to A.W, G.R, C.D and R.G); and EPSRC (grants EP/G037280/1 and EP/L016044/1 to C.D). Colciencias through (grant 568 to L.O.); COST Action (CA15109 to R.G.) and Dame Kathleen Ollerenshaw Research Fellowship; Alan Turing Institute (grant EP/NS10129/1 to C.D. and G.R.). Acknowledgements We thank Xiaochuan Xu and Martin O’Reilly for useful discussions. L.O. acknowledges the support of Colciencias. R.G. acknowledges support from the COST Action. C.D. and G.R. acknowledge the support of the Alan Turing Institute. Appendix A. Implementation A.1 Graphlet distributions. In the main article, both the graphlet degree distribution and graphlet counts in 1-step ego networks were used as inputs for $${\rm NetEmd}$$. Graphlet degree distributions. The graphlet degree [11] of a node specifies the number of graphlets (small induced subgraphs) of a certain type the node appears in, while distinguishing between different positions the node can have in a graphlet. Different positions within a graphlet correspond to the orbits of the automorphism group of the graphlet. Among graphs on two to four nodes, there are 9 possible graphs and 15 possible orbits. Among graphs on two to five nodes there are 30 possible graphs and 73 possible orbits. Graphlet distributions based on ego-networks. Another way of obtaining graphlet distributions is to consider graphlet counts in ego-networks [4]. The $$k$$-step ego-network of a node $$i$$ is defined as the subgraph induced on all the nodes that can be reached from $$i$$ (including $$i$$) in less than $$k$$ steps. For a given $$k$$, the distribution of a graphlet $$m$$ in a network $$G$$ is then simply obtained by counting the occurrence of $$m$$ as an induced subgraph in the $$k$$-step ego-networks of each individual node. A.2 Step-wise implementation In this article, for integer valued network features such as graphlet based distributions, we base our implementation on the probability distribution that corresponds to the histogram of feature $$t$$ with bin width 1 as $$p_t(G)$$. $${\rm NetEmd}$$ can also be defined on the basis of discrete empirical distributions, that is distributions consisting of point masses (see Section C). Here we summarise the calculation of the $${\rm NetEmd}_T(G,G')$$ distance between networks $$G$$ and $$G'$$ (with $$N$$ and $$N'$$ nodes respectively), based on the comparison of the set of local network features $$T=\{t_1,\ldots,t_m\}$$ of graphlet degrees corresponding to graphlets up to size $$k$$. (1) First one computes the graphlet degree sequences corresponding to graphlets up to size $$k$$ for networks $$G$$ and $$G'$$. This can be done efficiently using the algorithm ORCA [47]. For the graphlet degree $$t_1$$ compute a histogram across all $$N$$ nodes of $$G$$ having bins of width 1 of which the centers are at their respective values. This histogram is then normalized to have total mass 1. We then interpret the histogram as the (piecewise continuous) probability density function of a random variable. This probability density function is denoted by $$p_{t_1}(G)$$. The standard deviation of $$p_{t_1}(G)$$ is then computed, and is used to rescale the distribution so that it has variance 1. This distribution is denoted by $$\widetilde{p_{t_1}(G)}$$. (2) Repeat the above step for network $$G'$$, and denote the resulting distribution by $$\widetilde{p_{t_1}(G')}$$. Now compute   \begin{equation*} {\rm NetEmd}_{t_1}^*(G,G')=\mathrm{inf}_{c\in\mathbb{R}}\big({\rm EMD}\big(\widetilde{p_{t_1}(G)}(\cdot+c),\widetilde{p_{t_1}(G')}(\cdot)\big)\big). \end{equation*} In practice, this minimisation over $$c$$ is computed using a suitable optimization algorithm. In our implementation we use the Brent–Dekker algorithm [48] with an error tolerance of 0.00001 and with the number of iterations upper bounded by 150. (3) Repeat the above two steps for the network features $$t_2,\ldots,t_m$$ and compute   \begin{equation*} {\rm NetEmd}_T(G,G')=\frac{1}{m}\sum_{j=1}^m {\rm NetEmd}_{t_j}^*(G,G'). \end{equation*} A.3 Example: $${\rm EMD}^*$$ for Gaussian distributions Suppose that $$p$$ and $$q$$ are $$N(\mu_1,\sigma_1^2)$$ and $$N(\mu_2,\sigma_2^2)$$ distributions, respectively. Then   \begin{align*} {\rm EMD}^*(p,q) &=\inf_{c\in\mathbb{R}}\Big({\rm EMD}\big(\tilde{p}(\cdot+c),\tilde{q}(\cdot)\big)\Big)\\ &={\rm EMD} \left(\tilde{p}\left(\cdot-\frac{\mu_1}{\sigma_1}+\frac{\mu_2}{\sigma_2}\right),\tilde{q}(\cdot)\right)\\ &={\rm EMD}\big(\tilde{q}(\cdot),\tilde{q}(\cdot)\big)=0. \end{align*} Here we used that if $$X\sim N(\mu_1,\sigma_1^2)$$ and $$Y\sim N(\mu_2,\sigma_2^2)$$, then $$\frac{X}{\sigma_1}+c\sim N(\frac{\mu_1}{\sigma_1}-c,1)$$ and $$\frac{Y}{\sigma_2}\sim N(\frac{\mu_2}{\sigma_2},1)$$, and these two distributions are equal if $$c=\frac{\mu_1}{\sigma_1}-\frac{\mu_2}{\sigma_2}$$. A.4 Spectral NetEmd When using spectra of graph operators, which take real values instead of the integer values one has in the case of graphlet distributions, we use the empirical distribution consisting of point masses for computing $${\rm NetEmd}$$. For more details see Section C of this appendix. A.5 Computational complexity The computational complexity of graphlet based comparison methods is dominated by the complexity of enumerating graphlets. For a network of size $$N$$ and maximum degree $$d$$, enumerating all connected graphlets up to size $$m$$ has complexity $$O(Nd^{m-1})$$, while counting all graphlets up to size $$m$$ in all $$k$$-step ego-networks has complexity $$O(Nd^{k+m-1})$$. Because most real world networks are sparse, graphlet enumeration algorithms tends to scale more favourably in practice than the worst case upper bounds given above. In the case of spectral measures, the most commonly used algorithms for computing the eigenvalue spectrum have complexity $$O(N^3)$$. Recent results show that the spectra of graph operators can be approximated efficiently in $$O(N^2)$$ time [49]. Given the distribution of a feature $$t$$, computing $${\rm EMD}_t^*(G,G')$$ has complexity $$O(k(s+s')\mathrm{log}(s+s'))$$, where $$s$$ and $$s'$$ are the number of different values $$t$$ takes in $$G$$ and $$G'$$ respectively and $$k$$ is the maximum number function calls of the optimization algorithm used to align the distributions. For node based features such as graphlet distributions, the worst case complexity is $$O(k(N(G)+N(G'))\mathrm{log}(N(G)+N(G')))$$, where $$N(G)$$ is the number of nodes of $$G$$, since the number of different values $$t$$ can take is bounded by the number of nodes. We use the combinatorial graphlet enumeration algorithm ORCA [47] for enumerating graphlets. Typical runtimes for enumerating all graphlets up to size 5 are can be found in Table A1 along with runtimes for $${\rm NetEmd}_{G5}$$ given the corresponding graphlet degree distributions. Enumerating graphlets of size 5 for networks of size around 1000 nodes tends to take several minutes allowing data sets consisting of several hundreds of such networks to be analysed on a single CPU in a matter of hours. Larger data sets containing networks of the order of 10000 nodes or larger are computationally challenging since enumerating graphlets of size 5 for such networks can take several hours [47]. Consequently, computing $${\rm NetEmd}_{G5}$$ on data sets such as $$RG_3$$ and $$Reddit$$-$$Multi$$-12$$k$$ are computationally challenging tasks that can take up to 3 days on a 24 CPU cluster. Table A1 Runtimes for enumerating graphlets up to size 5 using ORCA for Erdős–Rényi and a Barabási Albert random graphs and computation times for evaluating $${\rm NetEmd}_{G5}$$ given the graphlet degree distributions. For $${\rm NetEmd}_{G5}$$ the range $$t_{\min}$$–$$t_{\max}$$ obtained over all pairwise comparisons of all networks of a given size is shown. Experiments were performed on a single core of an Intel i7-6700HQ processor. ORCA is implemented in C++ and $${\rm NetEmd}_{G5}$$ is implemented in Python and Cython Number of nodes average degrees  $$N=250$$ (10, 20, 50, 100)  $$N=1000$$ (10, 20, 50, 100)  $$N=5000$$ (10, 20, 50, 100)  $$N=10000$$ (10, 20, 50, 100)  Erdős–Rényi model  ORCA  (0.02 s, 0.17 s, 3.10 s, 16.5 s)  (0.09 s, 0.65 s, 9.68 s, 96.6 s)  (0.40 s, 2.38 s, 40.7 s, 490 s)  (1.36 s, 7.21 s, 82 s, 858 s)  Barabási Albert  ORCA  (0.06 s, 0.37 s, 3.29 s, 34.7 s)  (0.38 s, 2.29 s, 25.9 s, 184 s)  (3.53 s, 15.4 s, 272 s, 2366 s)  (6.66 s, 57.5 s, 786 s, 5870 s)  $${\rm NetEmd}_{G5}$$  $$t_{\min}$$–$$t_{\max}$$  0.27 s–0.72 s  0.66 s–2.53 s  1.63 s–10.5 s  2.58 s–18.0 s  Number of nodes average degrees  $$N=250$$ (10, 20, 50, 100)  $$N=1000$$ (10, 20, 50, 100)  $$N=5000$$ (10, 20, 50, 100)  $$N=10000$$ (10, 20, 50, 100)  Erdős–Rényi model  ORCA  (0.02 s, 0.17 s, 3.10 s, 16.5 s)  (0.09 s, 0.65 s, 9.68 s, 96.6 s)  (0.40 s, 2.38 s, 40.7 s, 490 s)  (1.36 s, 7.21 s, 82 s, 858 s)  Barabási Albert  ORCA  (0.06 s, 0.37 s, 3.29 s, 34.7 s)  (0.38 s, 2.29 s, 25.9 s, 184 s)  (3.53 s, 15.4 s, 272 s, 2366 s)  (6.66 s, 57.5 s, 786 s, 5870 s)  $${\rm NetEmd}_{G5}$$  $$t_{\min}$$–$$t_{\max}$$  0.27 s–0.72 s  0.66 s–2.53 s  1.63 s–10.5 s  2.58 s–18.0 s  Table A1 Runtimes for enumerating graphlets up to size 5 using ORCA for Erdős–Rényi and a Barabási Albert random graphs and computation times for evaluating $${\rm NetEmd}_{G5}$$ given the graphlet degree distributions. For $${\rm NetEmd}_{G5}$$ the range $$t_{\min}$$–$$t_{\max}$$ obtained over all pairwise comparisons of all networks of a given size is shown. Experiments were performed on a single core of an Intel i7-6700HQ processor. ORCA is implemented in C++ and $${\rm NetEmd}_{G5}$$ is implemented in Python and Cython Number of nodes average degrees  $$N=250$$ (10, 20, 50, 100)  $$N=1000$$ (10, 20, 50, 100)  $$N=5000$$ (10, 20, 50, 100)  $$N=10000$$ (10, 20, 50, 100)  Erdős–Rényi model  ORCA  (0.02 s, 0.17 s, 3.10 s, 16.5 s)  (0.09 s, 0.65 s, 9.68 s, 96.6 s)  (0.40 s, 2.38 s, 40.7 s, 490 s)  (1.36 s, 7.21 s, 82 s, 858 s)  Barabási Albert  ORCA  (0.06 s, 0.37 s, 3.29 s, 34.7 s)  (0.38 s, 2.29 s, 25.9 s, 184 s)  (3.53 s, 15.4 s, 272 s, 2366 s)  (6.66 s, 57.5 s, 786 s, 5870 s)  $${\rm NetEmd}_{G5}$$  $$t_{\min}$$–$$t_{\max}$$  0.27 s–0.72 s  0.66 s–2.53 s  1.63 s–10.5 s  2.58 s–18.0 s  Number of nodes average degrees  $$N=250$$ (10, 20, 50, 100)  $$N=1000$$ (10, 20, 50, 100)  $$N=5000$$ (10, 20, 50, 100)  $$N=10000$$ (10, 20, 50, 100)  Erdős–Rényi model  ORCA  (0.02 s, 0.17 s, 3.10 s, 16.5 s)  (0.09 s, 0.65 s, 9.68 s, 96.6 s)  (0.40 s, 2.38 s, 40.7 s, 490 s)  (1.36 s, 7.21 s, 82 s, 858 s)  Barabási Albert  ORCA  (0.06 s, 0.37 s, 3.29 s, 34.7 s)  (0.38 s, 2.29 s, 25.9 s, 184 s)  (3.53 s, 15.4 s, 272 s, 2366 s)  (6.66 s, 57.5 s, 786 s, 5870 s)  $${\rm NetEmd}_{G5}$$  $$t_{\min}$$–$$t_{\max}$$  0.27 s–0.72 s  0.66 s–2.53 s  1.63 s–10.5 s  2.58 s–18.0 s  Appendix B. Proof that NetEmd is a distance measure We begin by stating a definition. A pseudometric on a set $$X$$ is a non-negative real-valued function $$d:X\times X\rightarrow [0,\infty)$$ such that, for all $$x,y,z\in X$$, (1) $$d(x,x)= 0$$; (2) $$d(x,y)=d(y,x)$$ (symmetry); (3) $$d(x,z)\leq d(x,y)+d(y,z)$$ (triangle inequality). If Condition 1 is replaced by the condition that $$d(x,y)=0\iff x=y$$ then $$d$$ defines a metric. Note that this requirement can only be satisfied by a network comparison measure that is based on a complete set of graph invariants and hence network comparison measures in general will not satisfy this requirement. Proposition Let $$M$$ denote the space of all real-valued probability measures supported on $$\mathbb{R}$$ with finite, non-zero variance. Then the $${\rm EMD}^*$$ distance between probability measures, $$\mu_X$$ and $$\mu_Y$$ in $$M$$ defined by   \begin{equation*}{\rm EMD}^*(\mu_X,\mu_Y)=\inf_{c\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_X(\cdot),\tilde{\mu}_Y(\cdot+c)), \end{equation*} defines a pseudometric on the space of probability measures $$M$$. Proof. We first note that if $$\mu_X\in M$$ then $$\tilde{\mu}_X(\cdot+c)\in M$$ for any $$c\in\mathbb{R}$$. Let us now verify that $${\rm EMD}^*$$ satisfies all properties of a pseudometric. Clearly, for any $$\mu_X\in M$$, we have $$0\leq {\rm EMD}^*(\mu_X,\mu_X)\leq {\rm EMD}(\tilde{\mu}_X(\cdot),\tilde{\mu}_X(\cdot))=0$$, and so $${\rm EMD}^*(\mu_X,\mu_X)=0$$. Symmetry holds, since for, any $$\mu_X$$ and $$\mu_Y$$ in $$M$$,   \begin{align*}{\rm EMD}^*(\mu_X,\mu_Y)&=\inf_{c\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_X(\cdot),\tilde{\mu}_Y(\cdot+c))\\&=\inf_{c\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_Y(\cdot+c),\tilde{\mu}_X(\cdot))\\ &=\inf_{c\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_Y(\cdot),\tilde{\mu}_X(\cdot+c))\\&={\rm EMD}^*(\mu_Y,\mu_X). \end{align*} Finally, we verify that $${\rm EMD}^*$$ satisfies the triangle inequality. Suppose $$\mu_X$$, $$\mu_Y$$ and $$\mu_Z$$ are probability measures from the space $$M$$, then so are $$\tilde{\mu}_X(\cdot+a)$$, $$\tilde{\mu}_Y(\cdot+b)$$ for any $$a,b\in\mathbb{R}$$. Since EMD satisfies the triangle inequality, we have, for any $$a,b\in\mathbb{R}$$,   \begin{equation*}\label{emdeqn1}{\rm EMD}(\tilde{\mu}_X(\cdot+a),\tilde{\mu}_Y(\cdot+b))\leq {\rm EMD}(\tilde{\mu}_X(\cdot+a),\tilde{\mu}_Z(\cdot))+{\rm EMD}(\tilde{\mu}_Y(\cdot+b),\tilde{\mu}_Z(\cdot)). \end{equation*} Since the above inequality holds for all $$a,b\in\mathbb{R}$$, we have that   \begin{align*} {\rm EMD}^*(\mu_X,\mu_Y)&=\inf_{c\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_X(\cdot+c),\tilde{\mu}_Y(\cdot))\\ &=\inf_{a,b\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_X(\cdot+a),\tilde{\mu}_Y(\cdot+b)) \\ &\leq \inf_{a,b\in\mathbb{R}}\big[{\rm EMD}(\tilde{\mu}_X(\cdot+a),\tilde{\mu}_Z(\cdot))+{\rm EMD}(\tilde{\mu}_Y(\cdot+b),\mu_Z(\cdot))\big] \\ &=\inf_{a\in\mathbb{R}}\big[{\rm EMD}(\tilde{\mu}_X(\cdot+a),\tilde{\mu}_Z(\cdot))+\inf_{b\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_Y(\cdot+b),\tilde{\mu}_Z(\cdot))\big] \\ &=\inf_{a\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_X(\cdot+a),\tilde{\mu}_Z(\cdot))+\inf_{b\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_Y(\cdot+b),\tilde{\mu}_Z(\cdot))\\ &={\rm EMD}^*(\mu_X,\mu_Z)+{\rm EMD}^*(\mu_Y,\mu_Z), \end{align*} as required. We have thus verified that $${\rm EMD}^*$$ satisfies all properties of a pseudometric. □ Appendix C. Generalization of $${\rm EMD}^*$$ to point masses Although in the case of graphlet based features we based our implementation of $${\rm NetEmd}$$ on probability distribution functions that correspond to normalized histograms having bin width 1 $${\rm NetEmd}$$ can also be based on empirical distributions consisting of collections of point masses located at the observed values. The definition of $${\rm EMD}^*$$ can be generalized to include distributions of zero variance, that is unit point masses. Mathematically, the distribution of a point mass at $$x_0$$ is given by the Dirac measure $$\delta_x(x_0)$$. Such distributions are frequently encountered in practice since some graphlets do not occur in certain networks. First, we note that unit point masses are always mapped onto unit point masses under rescaling operations. Moreover, for a unit point mass $$\delta_x(x_0)$$ we have that $$\mathrm{inf}_{c\in\mathbb{R}}({\rm EMD}(\tilde{p}(\cdot+c),\delta_x(x_0)))$$$$=\mathrm{inf}_{c\in\mathbb{R}}\left({\rm EMD}(\tilde{p}(\cdot+c),\delta_x(kx_0))\right)$$ for all $$p\in M$$ and $$k>0$$. Consequently, $${\rm EMD}^*$$ can be generalized to include unit point masses in a consistent fashion by always rescaling them by 1:   \begin{equation*}\label{emdmet2} {\rm EMD}^*(p,q)=\mathrm{inf}_{c\in\mathbb{R}}\big({\rm EMD}(\hat{p}(\cdot+c),\hat{q})\big), \end{equation*} where $$\hat{p}=\tilde{p}$$ (as in Eq. 2.1) if $$p$$ has a non-zero variance, and $$\hat{p}=p$$ if $$p$$ has variance zero. Appendix D. Sub-sampling $${\rm NetEmd}$$ is well suited for network sub-sampling [5, 21, 22]. In the case of graphlets the sub-sampling procedure consist of sampling nodes and using the graphlet degrees corresponding to these nodes only as inputs for $${\rm NetEmd}$$. Figure D1 shows the $$\overline{P}$$ scores for variants of $${\rm NetEmd}$$ on a set of synthetic networks and the Onnela et al. data set. We find that the performance of $${\rm NetEmd}$$ is stable under sub-sampling and that in general using a sample of only $$10\%$$ of the nodes produces results comparable to the case where all nodes are used. Fig. D1. View largeDownload slide (a) and (b) show the heatmaps of pairwise distances on $$RG_3$$ ($$N\in\{1250,2500,5000,10000\}$$ and $$k\in\{10,20,40,80\}$$) according to $${\rm NetEmd}_{G5}$$ and next best performing measure GCD11, respectively. In the heat map, networks are ordered from top to bottom in the following order: model, average degree, and node count. Although we observe some degree of off diagonal mixing, the heatmap of $${\rm NetEmd}$$ still shows 8 diagonal blocks corresponding to different generative models in contrast to the heat map of GCD11. Fig. D1. View largeDownload slide (a) and (b) show the heatmaps of pairwise distances on $$RG_3$$ ($$N\in\{1250,2500,5000,10000\}$$ and $$k\in\{10,20,40,80\}$$) according to $${\rm NetEmd}_{G5}$$ and next best performing measure GCD11, respectively. In the heat map, networks are ordered from top to bottom in the following order: model, average degree, and node count. Although we observe some degree of off diagonal mixing, the heatmap of $${\rm NetEmd}$$ still shows 8 diagonal blocks corresponding to different generative models in contrast to the heat map of GCD11. Fig. D2. View largeDownload slide The $$\overline{P}$$ values for different variants of $${\rm NetEmd}$$ under sub-sampling for (a) a set of 80 synthetic networks coming from eight different random graph models with 2500 nodes and average degree 20, (b) for the Onnela et al. data set showing the average and standard deviation over 50 experiments for each sampled proportion. Note that the performance of $${\rm NetEmd}$$ under sub-sampling is remarkably stable and is close to optimal even when only $$10\%$$ of nodes are sampled. For synthetic networks, we find that the stability of $${\rm NetEmd}$$ increases as the size of the graphlets used in the input is increased. Fig. D2. View largeDownload slide The $$\overline{P}$$ values for different variants of $${\rm NetEmd}$$ under sub-sampling for (a) a set of 80 synthetic networks coming from eight different random graph models with 2500 nodes and average degree 20, (b) for the Onnela et al. data set showing the average and standard deviation over 50 experiments for each sampled proportion. Note that the performance of $${\rm NetEmd}$$ under sub-sampling is remarkably stable and is close to optimal even when only $$10\%$$ of nodes are sampled. For synthetic networks, we find that the stability of $${\rm NetEmd}$$ increases as the size of the graphlets used in the input is increased. Appendix E. Results for data sets of chemical compounds and proteins We also tested $${\rm NetEmd}$$ on benchmark data sets representing chemical compounds (MUTAG, NCI1 and NCI109) and protein structures (ENZYMES and D&D). MUTAG [50] is a data set of 188 chemical compounds that are labelled according to their mutagenic effect on Salmonella typhimurium. NCI1 and NCI109 represent sets of chemical compounds which are labelled for their activity against non-small cell lung cancer and ovarian cancer cell lines, respectively [15]. Nodes and edges in MUTAG, NCI1 and NCI109 are labelled by atomic number and bound type, respectively. ENZYMES and D&D [51] consist of networks representing protein structures at the level of tertiary structure and amino acids, respectively. While networks in ENZYMES are classified into six different enzyme classes, networks in D&D are classified according to whether or not they correspond to an enzyme. Nodes in ENZYMES are labelled according to structural element type and according to amino acid types in D&D. Classification accuracies obtained using $${\rm NetEmd}$$ on the data sets of chemical compounds and protein structures are given in Table E1, along with results for other graph kernels reported in [46]. For a detailed description of these kernels we refer to [46] and the references therein. Note that, in contrast to all other kernels in Table E1, $${\rm NetEmd}$$ does not use any domain specific knowledge in the form of node or edge labels. Node and edge labels are highly informative for all five classification tasks—as shown in [52]. Table E1 10-fold cross validation accuracies of Gaussian kernels based on $${\rm NetEmd}_{G5}$$ and $${\rm NetEmd}_S$$ and other kernels reported in [46] Kernel  MUTAG  NCI1  NCI109  ENZYMES  D & D  $${\rm NetEmd}_{G5}$$  83.71 $$\pm$$ 1.16  78.59 $$\pm$$ 0.28  76.71 $$\pm$$ 0.34  46.55 $$\pm$$ 1.25  78.01 $$\pm$$ 0.38  $${\rm NetEmd}_{S}$$  83.30 $$\pm$$ 1.20  77.36 $$\pm$$ 0.38  76.14 $$\pm$$ 0.27  42.75 $$\pm$$ 0.78  76.74 $$\pm$$ 0.43  WL subtree  82.05 $$\pm$$ 0.36  82.19 $$\pm$$ 0.18  82.46 $$\pm$$ 0.24  52.22 $$\pm$$ 1.26  79.78 $$\pm$$ 0.36  WL edge  81.06 $$\pm$$ 1.95  84.37 $$\pm$$ 0.30  84.49 $$\pm$$ 0.20  53.17 $$\pm$$ 2.04  77.95 $$\pm$$ 0.70  WL shortest path  83.78 $$\pm$$ 1.46  84.55 $$\pm$$ 0.36  83.53 $$\pm$$ 0.30  59.05 $$\pm$$ 1.05  79.43 $$\pm$$ 0.55  Ramon & Gärtner  85.72 $$\pm$$ 0.49  61.86 $$\pm$$ 0.27  61.67 $$\pm$$ 0.21  13.35 $$\pm$$ 0.87  57.27 $$\pm$$ 0.07  p-random walk  79.19 $$\pm$$ 1.09  58.66 $$\pm$$ 0.28  58.36 $$\pm$$ 0.94  27.67 $$\pm$$ 0.95  66.64 $$\pm$$ 0.83  Random walk  80.72 $$\pm$$ 0.38  64.34 $$\pm$$ 0.27  63.51 $$\pm$$ 0.18  21.68 $$\pm$$ 0.94  71.70 $$\pm$$ 0.47  Graphlet count  75.61 $$\pm$$ 0.49  66.00 $$\pm$$ 0.07  66.59 $$\pm$$ 0.08  32.70 $$\pm$$ 1.20  78.59 $$\pm$$ 0.12  Shortest path  87.28 $$\pm$$ 0.55  73.47 $$\pm$$ 0.11  73.07 $$\pm$$ 0.11  41.68 $$\pm$$ 1.79  78.45 $$\pm$$ 0.26  Kernel  MUTAG  NCI1  NCI109  ENZYMES  D & D  $${\rm NetEmd}_{G5}$$  83.71 $$\pm$$ 1.16  78.59 $$\pm$$ 0.28  76.71 $$\pm$$ 0.34  46.55 $$\pm$$ 1.25  78.01 $$\pm$$ 0.38  $${\rm NetEmd}_{S}$$  83.30 $$\pm$$ 1.20  77.36 $$\pm$$ 0.38  76.14 $$\pm$$ 0.27  42.75 $$\pm$$ 0.78  76.74 $$\pm$$ 0.43  WL subtree  82.05 $$\pm$$ 0.36  82.19 $$\pm$$ 0.18  82.46 $$\pm$$ 0.24  52.22 $$\pm$$ 1.26  79.78 $$\pm$$ 0.36  WL edge  81.06 $$\pm$$ 1.95  84.37 $$\pm$$ 0.30  84.49 $$\pm$$ 0.20  53.17 $$\pm$$ 2.04  77.95 $$\pm$$ 0.70  WL shortest path  83.78 $$\pm$$ 1.46  84.55 $$\pm$$ 0.36  83.53 $$\pm$$ 0.30  59.05 $$\pm$$ 1.05  79.43 $$\pm$$ 0.55  Ramon & Gärtner  85.72 $$\pm$$ 0.49  61.86 $$\pm$$ 0.27  61.67 $$\pm$$ 0.21  13.35 $$\pm$$ 0.87  57.27 $$\pm$$ 0.07  p-random walk  79.19 $$\pm$$ 1.09  58.66 $$\pm$$ 0.28  58.36 $$\pm$$ 0.94  27.67 $$\pm$$ 0.95  66.64 $$\pm$$ 0.83  Random walk  80.72 $$\pm$$ 0.38  64.34 $$\pm$$ 0.27  63.51 $$\pm$$ 0.18  21.68 $$\pm$$ 0.94  71.70 $$\pm$$ 0.47  Graphlet count  75.61 $$\pm$$ 0.49  66.00 $$\pm$$ 0.07  66.59 $$\pm$$ 0.08  32.70 $$\pm$$ 1.20  78.59 $$\pm$$ 0.12  Shortest path  87.28 $$\pm$$ 0.55  73.47 $$\pm$$ 0.11  73.07 $$\pm$$ 0.11  41.68 $$\pm$$ 1.79  78.45 $$\pm$$ 0.26  Table E1 10-fold cross validation accuracies of Gaussian kernels based on $${\rm NetEmd}_{G5}$$ and $${\rm NetEmd}_S$$ and other kernels reported in [46] Kernel  MUTAG  NCI1  NCI109  ENZYMES  D & D  $${\rm NetEmd}_{G5}$$  83.71 $$\pm$$ 1.16  78.59 $$\pm$$ 0.28  76.71 $$\pm$$ 0.34  46.55 $$\pm$$ 1.25  78.01 $$\pm$$ 0.38  $${\rm NetEmd}_{S}$$  83.30 $$\pm$$ 1.20  77.36 $$\pm$$ 0.38  76.14 $$\pm$$ 0.27  42.75 $$\pm$$ 0.78  76.74 $$\pm$$ 0.43  WL subtree  82.05 $$\pm$$ 0.36  82.19 $$\pm$$ 0.18  82.46 $$\pm$$ 0.24  52.22 $$\pm$$ 1.26  79.78 $$\pm$$ 0.36  WL edge  81.06 $$\pm$$ 1.95  84.37 $$\pm$$ 0.30  84.49 $$\pm$$ 0.20  53.17 $$\pm$$ 2.04  77.95 $$\pm$$ 0.70  WL shortest path  83.78 $$\pm$$ 1.46  84.55 $$\pm$$ 0.36  83.53 $$\pm$$ 0.30  59.05 $$\pm$$ 1.05  79.43 $$\pm$$ 0.55  Ramon & Gärtner  85.72 $$\pm$$ 0.49  61.86 $$\pm$$ 0.27  61.67 $$\pm$$ 0.21  13.35 $$\pm$$ 0.87  57.27 $$\pm$$ 0.07  p-random walk  79.19 $$\pm$$ 1.09  58.66 $$\pm$$ 0.28  58.36 $$\pm$$ 0.94  27.67 $$\pm$$ 0.95  66.64 $$\pm$$ 0.83  Random walk  80.72 $$\pm$$ 0.38  64.34 $$\pm$$ 0.27  63.51 $$\pm$$ 0.18  21.68 $$\pm$$ 0.94  71.70 $$\pm$$ 0.47  Graphlet count  75.61 $$\pm$$ 0.49  66.00 $$\pm$$ 0.07  66.59 $$\pm$$ 0.08  32.70 $$\pm$$ 1.20  78.59 $$\pm$$ 0.12  Shortest path  87.28 $$\pm$$ 0.55  73.47 $$\pm$$ 0.11  73.07 $$\pm$$ 0.11  41.68 $$\pm$$ 1.79  78.45 $$\pm$$ 0.26  Kernel  MUTAG  NCI1  NCI109  ENZYMES  D & D  $${\rm NetEmd}_{G5}$$  83.71 $$\pm$$ 1.16  78.59 $$\pm$$ 0.28  76.71 $$\pm$$ 0.34  46.55 $$\pm$$ 1.25  78.01 $$\pm$$ 0.38  $${\rm NetEmd}_{S}$$  83.30 $$\pm$$ 1.20  77.36 $$\pm$$ 0.38  76.14 $$\pm$$ 0.27  42.75 $$\pm$$ 0.78  76.74 $$\pm$$ 0.43  WL subtree  82.05 $$\pm$$ 0.36  82.19 $$\pm$$ 0.18  82.46 $$\pm$$ 0.24  52.22 $$\pm$$ 1.26  79.78 $$\pm$$ 0.36  WL edge  81.06 $$\pm$$ 1.95  84.37 $$\pm$$ 0.30  84.49 $$\pm$$ 0.20  53.17 $$\pm$$ 2.04  77.95 $$\pm$$ 0.70  WL shortest path  83.78 $$\pm$$ 1.46  84.55 $$\pm$$ 0.36  83.53 $$\pm$$ 0.30  59.05 $$\pm$$ 1.05  79.43 $$\pm$$ 0.55  Ramon & Gärtner  85.72 $$\pm$$ 0.49  61.86 $$\pm$$ 0.27  61.67 $$\pm$$ 0.21  13.35 $$\pm$$ 0.87  57.27 $$\pm$$ 0.07  p-random walk  79.19 $$\pm$$ 1.09  58.66 $$\pm$$ 0.28  58.36 $$\pm$$ 0.94  27.67 $$\pm$$ 0.95  66.64 $$\pm$$ 0.83  Random walk  80.72 $$\pm$$ 0.38  64.34 $$\pm$$ 0.27  63.51 $$\pm$$ 0.18  21.68 $$\pm$$ 0.94  71.70 $$\pm$$ 0.47  Graphlet count  75.61 $$\pm$$ 0.49  66.00 $$\pm$$ 0.07  66.59 $$\pm$$ 0.08  32.70 $$\pm$$ 1.20  78.59 $$\pm$$ 0.12  Shortest path  87.28 $$\pm$$ 0.55  73.47 $$\pm$$ 0.11  73.07 $$\pm$$ 0.11  41.68 $$\pm$$ 1.79  78.45 $$\pm$$ 0.26  On MUTAG, $${\rm NetEmd}$$ achieves an accuracy that is comparable to the Weisfeiler–Lehman (WL) shortest path kernel, but is outperformed by the shortest path kernel and the kernel by Ramon & Gärtner. While on NCI1, NCI109 and ENZYMES, $${\rm NetEmd}$$ is outperformed only by WL kernels, on D&D $${\rm NetEmd}$$ achieves a classification accuracy that is comparable to the best performing kernels. Notably, on D&D $${\rm NetEmd}$$ also outperforms the vector model by Dobson and Doig [53] (classification accuracy: 76.86 $$\pm$$ 1.23) which is based on 52 physical and chemical features without using domain specific knowledge, that is solely based on graph topology. Appendix F. Graph classifiers F.1 Implementation of C-SVMs Following the procedure in [46] we use 10-fold cross validation with a C-SVM [42] to test classification performance. We use the python package scikit-learn [54] which is based on libsvm [55]. The $$C-value$$ of the C-SVM and the $$\alpha$$ for the Gaussian kernel is tuned independently for each fold using training data from that fold only. We consider values $$\{2^{-7},2^{-6},...,2^7\}$$ for $$C$$ and the $$\{2^{-7},2^{-6},...,2^7\}$$ multiples of the median of the distance matrix for $$\alpha$$. Each experiment is repeated 10 times, and average prediction accuracies and their standard deviations are reported. We also note that the Gaussian NetEmd kernel is not necessarily positive semidefinite for all values of $$\alpha$$ [56]. The implication is that the C-SVM might converge to a stationary point that is not always guaranteed to be a global optimum. Although there exist alternative algorithms [57] for training C-SVMs with indefinite kernels which might result in better classification accuracy, here we chose to use the standard libsvm-algorithm in order to ensure a fair comparison between kernels. For a discussion of support vector machines with indefinite kernels see [58]. F.2 Other kernels and classifiers F.2.1 Graph kernels Graph kernels are in general based on counting various types substructures in graphs such as walks (p-random walk, random walk), paths (Shortest path) and sub-trees (Ramon & Gärtner). Given the count vectors of the substructures of interest the corresponding kernel is defined as their inner product. F.2.2 The graphlet kernel and the deep graphlet kernel The graphlet kernel and the deep graphlet kernel use the normalized counts of graphlets on $$k$$ nodes, including disconnected graphlets, as features. The deep graphlet kernel involves the additional step of learning similarities between graphlets based on the edit distance. In the case of the social network data sets $$k=7$$. F.2.3 Weisfeiler Lehman kernels The Weisfeiler Lehman (WL) kernels are based on the Weisfeiler Lehman label propagation procedure in which a node is iteratively labeled by the labels of its neighbours. Given a base kernel the corresponding WL kernel is obtained by combining the base kernel over several iterations of the WL procedure. The WL-subtree, WL-edge and WL-shortest path kernels have base kernels that use the following counts as features: node labels, labelled edges, and shortest paths with labelled end points, respectively. F.2.4 The random forest classifier The random forest classifier in [44] is based on the following features: number of nodes, number of edges, average degree, degree assortativity, number of triangles and the global clustering coefficient. F.2.5 PCSN The convolutional neural network classifier of Niepert et al. is based on learning feature representations of graphs that correspond to locally connected regions of networks. The approach first identifies a sequence of nodes for which neighbourhood graphs are created and then maps these onto a vector space representation. Appendix G. Detailed description of data sets and models G.1 Synthetic networks and random graph models $$RG_1$$ consists of 16 sub data sets corresponding to combinations of $$N\in\{1250,2500,5000,10000\}$$ and $$k\in\{10,20,40,80\}$$ containing 10 realizations for each model so that each of the 16 sub data set contains 80 networks. In $$RG_2$$ the size $$N$$ and average degree $$k$$ are increased independently in linear steps to twice their initial value ($$N\in\{2000,3000,4000\}$$ and $$k\in\{20,24,28,32,36,40\}$$) and contains 10 realizations per model parameter combination, resulting in a data set of $$3\times6\times8\times10=1440$$ networks. In $$RG_3$$ the size $$N$$ and average degree $$k$$ are increased independently in multiples of 2 to 8 times their initial value ($$N\in\{1250,2500,5000,10000\}$$ and $$k\in\{10,20,40,80\}$$) and again contains 10 realizations per model parameter combination, resulting in a data set of $$4\times4\times8\times10=1280$$ networks. The models are as follows. G.1.1 The Erdős–Rényi model We consider the Erdős–Rényi (ER) model [25] $$G(N,m)$$ where $$N$$ is the number of nodes and $$m$$ is the number of edges. The edges are chosen uniformly at random without replacement from the $$\binom{N}{2}$$ possible edges. G.1.2 The configuration model Given a graphical degree sequence, the configuration model creates a random graph that is drawn uniformly at random from the space of all graphs with the given degree sequence. The degree sequence of the configuration models used in the article is taken to be degree sequence of a duplication divergence model that has the desired average degree. G.1.3 The Barabási Albert preferential attachment model In the Barabási–Albert model [16], a network is generated starting from a small initial network to which nodes of degree $$m$$ are added iteratively and the probability of connecting the new node to an existing node is proportional to the degree of the existing node. G.1.4 Geometric random graphs Geometric random graphs [59] are constructed under the assumption that the nodes in the network are embedded into a $$D$$ dimensional space, and the presence of an edge depends only on the distance between the nodes and a given threshold $$r$$. The model is constructed by placing $$N$$ nodes uniformly at random in an $$D$$-dimensional square $$[0,1]^D$$. Then edges are placed between any pair of nodes for which the distance between them is less or equal to the threshold $$r$$. We use $$D=3$$ and set $$r$$ to be the threshold that results in a network with the desired average degree, while the distance is the Euclidean distance. G.1.5 The geometric gene duplication model The geometric gene duplication model is a geometric model [27] in which the nodes are distributed in three-dimensional Euclidean space $$\mathbb{R}^3$$ according to the following rule. Starting from an small initial set of nodes in three dimensions, at each step a randomly chosen node is selected and a new node is placed at random within a Euclidean distance $$d$$ of this node. The process is repeated until the desired number of nodes is reached. Nodes within a certain distance $$r$$ are then connected. We fix $$r$$ to obtain the desired average degree. G.1.6 The duplication divergence model of Vázquez et al. The duplication divergence model of Vázquez et al. [24] is defined by the following growing rules: (1) Duplication: A node $$v_i$$ is randomly selected and duplicated ($$v_i'$$) along with all of its interactions. An edge between $$v_i$$ and $$v_i'$$ is placed with probability $$p$$. (2) Divergence: For each pair of duplicated edges $$\{(v_i,v_k);(v_i',v_k)\}$$; one of the duplicated edges is selected uniformly at random and then deleted with probability $$q$$. This process is followed until the desired number of nodes is reached. In our case we fix $$p$$ to be 0.05 and adjust $$q$$ through a grid search to obtain a network that on average has the desired average degree. G.1.7 The duplication divergence of Ispolatov et al The duplication divergence model of Ispolatov et al. [26] starts with an initial network consisting of a single edge and then at each step a random node is chosen for duplication and the duplicate is connected to each of the neighbours of its parent with probability $$p$$. We adjust $$p$$ to obtain networks that have on average the desired average degree. G.1.8 The Watts–Strogatz model The Watts–Strogatz model, [30] creates graphs that interpolate between regular graphs and ER graphs. The model starts with a ring of $$n$$ nodes in which each node is connected to its $$k$$-nearest neighbours in both directions of the ring. Each edges is rewired with probability $$p$$ to a node which is selected uniformly at random. While $$k$$ is adjusted to obtain networks having the desired average degree we take $$p$$ to be 0.05. G.2 Real world data sets Summary statistics of the data sets are given in Table G1. Table G1 Summary statistics of data sets $$N$$, $$E$$ and $$d$$ stand for the number of nodes, number of edges and edge density, respectively. Data set  #Networks  $$N_{\min}$$  Median ($$N$$)  $$N_{\max}$$  $$E_{\min}$$  Median ($$E$$)  $$E_{\max}$$  $$d_{\min}$$  Median ($$d$$)  $$d_{\max}$$  RWN  167  24  351  62586  76  2595  824617  7.55$$e-$$05  0.0163  0.625  Onnela et al.  151  30  918  11586  62  2436  232794  4.26$$e-$$5  0.0147  0.499  AS-caida  122  8020  22883  26475  18203  46290  53601  1.48$$e-$$4  1.78$$e-$$4  5.66$$e-$$4  AS-733  732  493  4180.5  6474  1234  8380.5  13895  6.63$$e-$$4  9.71$$e-$$4  1.01$$e-$$2  World Trade  53  156  195  242  5132  7675  18083  0.333  0.515  0.625  Networks  Reddit-Binary  2000  6  304.5  3782  4  379  4071  5.69$$e-$$4  8.25$$e-$$3  0.286  Reddit-Multi-5k  4999  22  374  3648  21  422  4783  6.55$$e-$$4  6.03$$e-$$3  0.091  Reddit-Multi-12k  11929  2  280  3782  1  323  5171  5.69$$e-$$4  8.27$$e-$$3  1.0  COLLAB  5000  32  52  492  60  654.5  40120  0.029  0.424  1.0  IMDB-Binary  1000  12  17  136  26  65  1249  0.095  0.462  1.0  IMDB-Multi  1500  7  10  89  12  36  1467  0.127  1.0  1.0  MUTAG  188  10  17.5  28  10  19  33  0.082  0.132  0.222  NCI1  4110  3  27  111  2  29  119  0.0192  0.0855  0.667  NCI109  4127  4  26  111  3  29  119  0.0192  0.0862  0.5  ENZYMES  600  2  32  125  1  60  149  0.0182  0.130  1.0  D&D  1178  30  241  5748  63  610.5  14267  8.64$$e-$$4  0.0207  0.2  Data set  #Networks  $$N_{\min}$$  Median ($$N$$)  $$N_{\max}$$  $$E_{\min}$$  Median ($$E$$)  $$E_{\max}$$  $$d_{\min}$$  Median ($$d$$)  $$d_{\max}$$  RWN  167  24  351  62586  76  2595  824617  7.55$$e-$$05  0.0163  0.625  Onnela et al.  151  30  918  11586  62  2436  232794  4.26$$e-$$5  0.0147  0.499  AS-caida  122  8020  22883  26475  18203  46290  53601  1.48$$e-$$4  1.78$$e-$$4  5.66$$e-$$4  AS-733  732  493  4180.5  6474  1234  8380.5  13895  6.63$$e-$$4  9.71$$e-$$4  1.01$$e-$$2  World Trade  53  156  195  242  5132  7675  18083  0.333  0.515  0.625  Networks  Reddit-Binary  2000  6  304.5  3782  4  379  4071  5.69$$e-$$4  8.25$$e-$$3  0.286  Reddit-Multi-5k  4999  22  374  3648  21  422  4783  6.55$$e-$$4  6.03$$e-$$3  0.091  Reddit-Multi-12k  11929  2  280  3782  1  323  5171  5.69$$e-$$4  8.27$$e-$$3  1.0  COLLAB  5000  32  52  492  60  654.5  40120  0.029  0.424  1.0  IMDB-Binary  1000  12  17  136  26  65  1249  0.095  0.462  1.0  IMDB-Multi  1500  7  10  89  12  36  1467  0.127  1.0  1.0  MUTAG  188  10  17.5  28  10  19  33  0.082  0.132  0.222  NCI1  4110  3  27  111  2  29  119  0.0192  0.0855  0.667  NCI109  4127  4  26  111  3  29  119  0.0192  0.0862  0.5  ENZYMES  600  2  32  125  1  60  149  0.0182  0.130  1.0  D&D  1178  30  241  5748  63  610.5  14267  8.64$$e-$$4  0.0207  0.2  Table G1 Summary statistics of data sets $$N$$, $$E$$ and $$d$$ stand for the number of nodes, number of edges and edge density, respectively. Data set  #Networks  $$N_{\min}$$  Median ($$N$$)  $$N_{\max}$$  $$E_{\min}$$  Median ($$E$$)  $$E_{\max}$$  $$d_{\min}$$  Median ($$d$$)  $$d_{\max}$$  RWN  167  24  351  62586  76  2595  824617  7.55$$e-$$05  0.0163  0.625  Onnela et al.  151  30  918  11586  62  2436  232794  4.26$$e-$$5  0.0147  0.499  AS-caida  122  8020  22883  26475  18203  46290  53601  1.48$$e-$$4  1.78$$e-$$4  5.66$$e-$$4  AS-733  732  493  4180.5  6474  1234  8380.5  13895  6.63$$e-$$4  9.71$$e-$$4  1.01$$e-$$2  World Trade  53  156  195  242  5132  7675  18083  0.333  0.515  0.625  Networks  Reddit-Binary  2000  6  304.5  3782  4  379  4071  5.69$$e-$$4  8.25$$e-$$3  0.286  Reddit-Multi-5k  4999  22  374  3648  21  422  4783  6.55$$e-$$4  6.03$$e-$$3  0.091  Reddit-Multi-12k  11929  2  280  3782  1  323  5171  5.69$$e-$$4  8.27$$e-$$3  1.0  COLLAB  5000  32  52  492  60  654.5  40120  0.029  0.424  1.0  IMDB-Binary  1000  12  17  136  26  65  1249  0.095  0.462  1.0  IMDB-Multi  1500  7  10  89  12  36  1467  0.127  1.0  1.0  MUTAG  188  10  17.5  28  10  19  33  0.082  0.132  0.222  NCI1  4110  3  27  111  2  29  119  0.0192  0.0855  0.667  NCI109  4127  4  26  111  3  29  119  0.0192  0.0862  0.5  ENZYMES  600  2  32  125  1  60  149  0.0182  0.130  1.0  D&D  1178  30  241  5748  63  610.5  14267  8.64$$e-$$4  0.0207  0.2  Data set  #Networks  $$N_{\min}$$  Median ($$N$$)  $$N_{\max}$$  $$E_{\min}$$  Median ($$E$$)  $$E_{\max}$$  $$d_{\min}$$  Median ($$d$$)  $$d_{\max}$$  RWN  167  24  351  62586  76  2595  824617  7.55$$e-$$05  0.0163  0.625  Onnela et al.  151  30  918  11586  62  2436  232794  4.26$$e-$$5  0.0147  0.499  AS-caida  122  8020  22883  26475  18203  46290  53601  1.48$$e-$$4  1.78$$e-$$4  5.66$$e-$$4  AS-733  732  493  4180.5  6474  1234  8380.5  13895  6.63$$e-$$4  9.71$$e-$$4  1.01$$e-$$2  World Trade  53  156  195  242  5132  7675  18083  0.333  0.515  0.625  Networks  Reddit-Binary  2000  6  304.5  3782  4  379  4071  5.69$$e-$$4  8.25$$e-$$3  0.286  Reddit-Multi-5k  4999  22  374  3648  21  422  4783  6.55$$e-$$4  6.03$$e-$$3  0.091  Reddit-Multi-12k  11929  2  280  3782  1  323  5171  5.69$$e-$$4  8.27$$e-$$3  1.0  COLLAB  5000  32  52  492  60  654.5  40120  0.029  0.424  1.0  IMDB-Binary  1000  12  17  136  26  65  1249  0.095  0.462  1.0  IMDB-Multi  1500  7  10  89  12  36  1467  0.127  1.0  1.0  MUTAG  188  10  17.5  28  10  19  33  0.082  0.132  0.222  NCI1  4110  3  27  111  2  29  119  0.0192  0.0855  0.667  NCI109  4127  4  26  111  3  29  119  0.0192  0.0862  0.5  ENZYMES  600  2  32  125  1  60  149  0.0182  0.130  1.0  D&D  1178  30  241  5748  63  610.5  14267  8.64$$e-$$4  0.0207  0.2  G.2.1 Real world networks from different classes (RWN) We compiled a data set consisting of 10 different classes of real world networks: social networks, metabolic networks, protein interaction networks, protein structure networks, food webs, autonomous systems networks of the internet, world trade networks, airline networks, peer to peer file sharing networks and scientific collaboration networks. Although in some instances larger versions of these data sets are available, we restrict the maximum number of networks in a certain class to 20 by taking random samples of larger data sets in order to avoid scores being dominated by larger network classes. The class of social networks consists of 10 social networks from the Pajek data set which can be found at http://vlado.fmf.uni-lj.si/pub/networks/data/default.htm (22 January 2018) (Networks: ’bkfrat’, ’bkham’, ’dolphins’, ’kaptailS1’, ’kaptailS2’, ’kaptailT1’, ’kaptailT2’, ’karate’, ’lesmis’, ’prison’) and a sample of 10 Facebook networks from [60] (Networks: ’Bucknell39’, ’Duke14’, ’Harvard1’, ’MU78’, ’Maine59’, ’Rice31’, ’UC61’, ’UCLA26’, ’UVA16’,’Yale4’). The class of metabolic networks consists of 20 networks taken from [61] (Networks: ’AB’, ’AG’, ’AP’, ’AT’, ’BS’, ’CE’, ’CT’, ’EF’, ’HI’, ’MG’, ’MJ’, ’ML’, ’NG’, ’OS’, ’PA’, ’PN’, ’RP’, ’TH’, ’TM’, ’YP’). The class of protein interaction networks consists of six networks from BIOGRID [62] (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Mus musculus and Saccharomyces cerevisiae downloaded: October 2015) and 5 networks from HINT [63] (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens and Mus musculus (Version: June 1 2014)) and the protein interaction network of Escherichia coli by Rajagopala et al. [64]. The class of protein structure networks consists of a sample of 20 networks from the data set D&D (Networks: 20, 119, 231, 279, 335, 354, 355, 369, 386, 462, 523, 529, 597, 748, 833, 866, 990, 1043, 1113, 1157). The class of food webs consists of 20 food webs from the Pajek data set: http://vlado.fmf.uni-lj.si/pub/networks/data/default.htm (22 January 2018) (Networks: ’ChesLower’, ’ChesMiddle’, ’ChesUpper’, ’Chesapeake’, ’CrystalC’, ’CrystalD’, ’Everglades’, ’Florida’, ’Michigan’, ’Mondego’, ’Narragan’, ’StMarks’, ’baydry’, ’baywet’, ’cypdry’, ’cypwet’, ’gramdry’, ’gramwet’, ’mangdry’, ’mangwet’). The class of internet networks consists of 10 randomly chosen networks from AS-733 [32] (Networks:’1997/11/12’, ’1997/12/28’, ’1998/01/01’, ’1998/06/06’, ’1998/08/13’, ’1998/12/04’, ’1999/03/30’, ’1999/04/17’, ’1999 /06/18’, ’1999/08/30’) and 10 randomly chosen networks from AS-caida [32] (Networks: ’2004/10/04’, ’2006/01/23’, ’2006/03/27’, ’2006/07/10’, ’2006/09/25’, ’2006/11/27’, ’2007/01/15’, ’2007/04/30’, ’2007/05/28’, ’2007/09/24’). Both data sets are from SNAP [65](June 1 2016). The class of world trade networks is a sample of 20 networks of the larger data set considered in [33, 34] (Networks: 1968, 1971, 1974, 1975, 1976, 1978, 1980, 1984, 1989, 1992, 1993, 1996, 1998, 2001, 2003, 2005, 2007, 2010, 2011, 2012). The airline networks were derived from the data available at: http://openflights.org/ (22 January 2018). For this we considered the 50 largest airlines from the database in terms of the number of destinations that the airline serves. For each airline a network is obtained by the considering all airports that are serviced by the airlines which are connected whenever there is direct flight between a pair of nodes. We then took a sample of 20 networks from this larger data set (Airline codes of the networks: ’AD’, ’AF’, ’AM’, ’BA’, ’DY’, ’FL’, ’FR’, ’JJ’, ’JL’, ’MH’, ’MU’, ’NH’, ’QF’, ’SU’, ’SV’, ’U2’, ’UA’, ’US’, ’VY’, ’ZH’). The class of peer to peer networks consist of nine networks of the Gnutella file sharing platform measured at different dates which are available at [65]. The scientific collaboration networks consists of five networks representing different scientific disciplines which were obtained from [65] (1 June 2015). G.2.2 Onnela et al. data set The Onnela et al. data set consists of all undirected and unweighted networks from the larger collection analysed in [31]. A complete list of networks and class membership can be found in the supplementary information of [4]. G.2.3 Time ordered data sets The data sets AS-caida and AS-733 each represent the internet measured at the level of autonomous systems at various points in time. Both data sets were downloaded from [65] (1 June 2015). The World Trade Networks data set is based on the data set [33] for the years 1962–2000 and on UN COMTRADE [34] for the years 2001-2015. Two countries are connected in the network whenever they import or export a commodity from a each other within the given calendar year. The complete data set was downloaded from : http://atlas.media.mit.edu/en/resources/data/ on 22 January 2018. G.2.4 Machine learning benchmarks A short description of the social networks datasets was given in the main text. A more detailed description can be found in [41]. The social network data sets were downloaded from https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets on 22 January 2018. A short short description of the chemical compound and protein structure data sets was given in Section E. A more detailed description of the data set can be found in [46]. These data sets were downloaded from: https://www.bsse.ethz.ch/mlcb/research/machine-learning/graph-kernels.html on 22 January 2018. Appendix H. $$\overline{P}$$ and area under ROC We assume that our data set $$S=\{G_1,G_2,...,G_n\}$$ of networks consists of disjoint classes $$C=\{c_1,c_2,...,c_m\}$$. For a given distance measure $$d$$ and given a network $$G\in S$$ we want to calculate the empirical probability $$P(G)$$ that $$d(G,G_1)<d(G,G_2)$$ where $$G_1$$ is a randomly selected network from the same class as $$G$$ (excluding itself) and $$G_2$$ is a randomly selected network from outside the class of $$G$$. Given measure $$d$$, $$P(G)$$ is simply given by:   $$P(G)=\frac{1}{(|S|-|C(G)|)(|C(G)|-1)}\sum_{G_1\in C_G\setminus \{G\}} \sum_{G_2\in S\setminus C_G} I(d(G,G_1)<d(G,G_2)),\label{pg}$$ (H.1) where $$C_G$$ denotes the set of networks in the same class of $$G$$ and $$I$$ is the indicator function. Now we compare this to the area under ROC of the classifier $$kNN(G)$$ that classifies the k-nearest neighbours of $$G$$ to be in the same class a $$G$$. Let the $$k^{th}$$ nearest neighbour of $$G$$ according to $$d$$ be $$G_k$$. For a given $$k$$ the true positive (TP) and false positive (FP) rates can be written as:   $$\label{tp} {\rm TP}(k)=\frac{1}{(|C(G)|-1)}\sum_{G_1\in C_G\setminus \{G\}} I(d(G,G_1)\leq d(G,G_k)).$$ (H.2)  \begin{eqnarray} {\rm FP}(k)=\frac{1}{(|S|-|C(G)|)}\sum_{G_2\in S\setminus C_G} I(d(G,G_2)\leq d(G,G_k)). \end{eqnarray} (H.3) The ROC curve (i.e. $${\rm FP}(k)$$ vs $${\rm TP}(k)$$) is a step function as $${\rm FP}(k)$$ increases by $$\frac{1}{(|S|-|C(G)|)}$$ if $$G_k$$ is a false positive and $${\rm TP}(k)$$ increases its value by $$\frac{1}{(|C(G)|-1)}$$ if $$G_k$$ is a true positive. Consequently the area under the ROC curve is given by $$\sum_k \frac{1}{(|S|-|C(G)|)} {\rm TP}(k) I(G_k\in S\setminus C_G)$$. Substituting Eq. (H.2) into this equation yields Eq. (H.1) showing the equivalence of $$P(G)$$ and the area under the ROC curve of $$kNN(G)$$. H.1 Area under precision recall curve Given the relation of $$\overline{P}$$ to the area under ROC one can also define a performance measure analogous to $$\overline{P}$$ based on the area under precision-recall curve of $$kNN(G)$$. More specifically, we define $$\overline{\rm PR}=\frac{1}{|S|}\sum_{G}{\rm PR}(kNN(G))$$, where $${\rm PR}(kNN(G))$$ is the area under the precision recall curve of $$kNN(G)$$. Although $$\overline{{\rm PR}}$$ produces results that are consistent with $$\overline{P}$$ the area under the precision recall curve lacks the clear statistical interpretation of $$\overline{P}$$, especially in the multi-class setting [66], hence we choose $$\overline{P}$$ as our default measure of performance. $$\overline{{\rm PR}}$$ scores are given Table H1. Note that $${\rm NetEmd}$$ measures achieve the highest $$\overline{{\rm PR}}$$ score on all data sets. Table H1 $$\overline{{\rm PR}}$$ scores for measures and data sets considered in the main text. $${\rm NetEmd}$$ measures have the highest $$\overline{{\rm PR}}$$ score (given in bold) on all data sets. For $$RG_1$$ we calculated the value of the $$\overline{{\rm PR}}$$ score for each of the 16 sub-data sets. The table shows the average and standard deviation of the $$\overline{{\rm PR}}$$ values obtained over these 16 sub-data sets    $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.962 $$\pm$$ 0.024  0.926  0.792  0.741  0.687  $${\rm NetEmd}_{G4}$$  0.984 $$\pm$$ 0.015  0.951  0.810  0.752  0.728  $${\rm NetEmd}_{G5}$$  0.988 $$\pm$$ 0.011  0.957  0.784  0.744  0.714  $${\rm NetEmd}_{S}$$  0.980 $$\pm$$ 0.017  0.980  0.908  0.689  0.646  $${\rm NetEmd}_{E4}$$  0.973 $$\pm$$ 0.018  0.945  0.814  0.699  0.668  $${\rm NetEmd}_{DD}$$  0.830 $$\pm$$ 0.066  0.748  0.606  0.598  0.578  $${\rm Netdis}_{\rm ER}$$  0.946 $$\pm$$ 0.032  0.709  0.492  0.610  0.546  $${\rm Netdis}_{\rm SF}$$  0.965 $$\pm$$ 0.021  0.792  0.541  0.571  0.507  GCD11  0.976 $$\pm$$ 0.022  0.923  0.692  0.686  0.570  GCD73  0.986 $$\pm$$ 0.016  0.935  0.673  0.711  0.618  GGDA  0.908 $$\pm$$ 0.092  0.877  0.618  0.511  0.566     $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.962 $$\pm$$ 0.024  0.926  0.792  0.741  0.687  $${\rm NetEmd}_{G4}$$  0.984 $$\pm$$ 0.015  0.951  0.810  0.752  0.728  $${\rm NetEmd}_{G5}$$  0.988 $$\pm$$ 0.011  0.957  0.784  0.744  0.714  $${\rm NetEmd}_{S}$$  0.980 $$\pm$$ 0.017  0.980  0.908  0.689  0.646  $${\rm NetEmd}_{E4}$$  0.973 $$\pm$$ 0.018  0.945  0.814  0.699  0.668  $${\rm NetEmd}_{DD}$$  0.830 $$\pm$$ 0.066  0.748  0.606  0.598  0.578  $${\rm Netdis}_{\rm ER}$$  0.946 $$\pm$$ 0.032  0.709  0.492  0.610  0.546  $${\rm Netdis}_{\rm SF}$$  0.965 $$\pm$$ 0.021  0.792  0.541  0.571  0.507  GCD11  0.976 $$\pm$$ 0.022  0.923  0.692  0.686  0.570  GCD73  0.986 $$\pm$$ 0.016  0.935  0.673  0.711  0.618  GGDA  0.908 $$\pm$$ 0.092  0.877  0.618  0.511  0.566  Table H1 $$\overline{{\rm PR}}$$ scores for measures and data sets considered in the main text. $${\rm NetEmd}$$ measures have the highest $$\overline{{\rm PR}}$$ score (given in bold) on all data sets. For $$RG_1$$ we calculated the value of the $$\overline{{\rm PR}}$$ score for each of the 16 sub-data sets. The table shows the average and standard deviation of the $$\overline{{\rm PR}}$$ values obtained over these 16 sub-data sets    $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.962 $$\pm$$ 0.024  0.926  0.792  0.741  0.687  $${\rm NetEmd}_{G4}$$  0.984 $$\pm$$ 0.015  0.951  0.810  0.752  0.728  $${\rm NetEmd}_{G5}$$  0.988 $$\pm$$ 0.011  0.957  0.784  0.744  0.714  $${\rm NetEmd}_{S}$$  0.980 $$\pm$$ 0.017  0.980  0.908  0.689  0.646  $${\rm NetEmd}_{E4}$$  0.973 $$\pm$$ 0.018  0.945  0.814  0.699  0.668  $${\rm NetEmd}_{DD}$$  0.830 $$\pm$$ 0.066  0.748  0.606  0.598  0.578  $${\rm Netdis}_{\rm ER}$$  0.946 $$\pm$$ 0.032  0.709  0.492  0.610  0.546  $${\rm Netdis}_{\rm SF}$$  0.965 $$\pm$$ 0.021  0.792  0.541  0.571  0.507  GCD11  0.976 $$\pm$$ 0.022  0.923  0.692  0.686  0.570  GCD73  0.986 $$\pm$$ 0.016  0.935  0.673  0.711  0.618  GGDA  0.908 $$\pm$$ 0.092  0.877  0.618  0.511  0.566     $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.962 $$\pm$$ 0.024  0.926  0.792  0.741  0.687  $${\rm NetEmd}_{G4}$$  0.984 $$\pm$$ 0.015  0.951  0.810  0.752  0.728  $${\rm NetEmd}_{G5}$$  0.988 $$\pm$$ 0.011  0.957  0.784  0.744  0.714  $${\rm NetEmd}_{S}$$  0.980 $$\pm$$ 0.017  0.980  0.908  0.689  0.646  $${\rm NetEmd}_{E4}$$  0.973 $$\pm$$ 0.018  0.945  0.814  0.699  0.668  $${\rm NetEmd}_{DD}$$  0.830 $$\pm$$ 0.066  0.748  0.606  0.598  0.578  $${\rm Netdis}_{\rm ER}$$  0.946 $$\pm$$ 0.032  0.709  0.492  0.610  0.546  $${\rm Netdis}_{\rm SF}$$  0.965 $$\pm$$ 0.021  0.792  0.541  0.571  0.507  GCD11  0.976 $$\pm$$ 0.022  0.923  0.692  0.686  0.570  GCD73  0.986 $$\pm$$ 0.016  0.935  0.673  0.711  0.618  GGDA  0.908 $$\pm$$ 0.092  0.877  0.618  0.511  0.566  Appendix I. Performance evaluation via area under precision recall curve by Yaveroglu et al.? The area under precision recall curve (AUPRC) was used as a performance metric for network comparison measures by Yaveroglu et al. [5]. The AUPRC is based on a classifier that for a given distance threshold $$\epsilon$$ classifies pairs of networks to be similar whenever $$d(G,G')<\epsilon$$. A pair satisfying $$d(G,G')<\epsilon$$ is taken to be a true positive whenever $$G$$ and $$G'$$ are from the same class. The AUPRC is then defined to be the area under the precision recall curve obtained by varying $$\epsilon$$ in small increments. However, AUPRC is problematic, especially in settings where one has more than two classes and when classes are separated at different scales. Figure I1 gives three examples of metrics for a problem that has three classes: (a) shows a metric $$d_1$$ (AUPRC = 0.847) that clearly separates the 3-classes which, however, has a lower AUPRC than the metrics given in (b) (AUPRC = 0.902) which confuses half of Class-1 with Class-2 and (c) ( AUPRC = 0.896) which shows 2 rather than 3 classes. The colour scale in the figure represents the magnitude of a comparison between a pair of individuals according to the corresponding metric. Fig. I1. View largeDownload slide Heat maps of three measures for in an example of 3 equally sized classes. (a) Metric $$d_1$$ shows clear separation between the 3 classes. (b) $$d_2$$ shows 3 classes with half of Class-1 positioned closer to Class-2. (c) $$d_3$$ identifies 2 rather than 3 classes. Note that $$d_1$$ has lower AUPRC than $$d_2$$ and $$d_3$$ despite being best at identifying the 3 classes whereas $$\overline{P}$$ values for the metrics are $$\overline{P}(d_1)$$=1.0, $$\overline{P}(d_2)$$=0.887 and $$\overline{P}(d_3)$$=0.869. Similarly, $$\overline{{\rm PR}}$$ values for the metrics are $$\overline{{\rm PR}}(d_1)$$=1.0, $$\overline{{\rm PR}}(d_2)$$=0.893 and $$\overline{{\rm PR}}(d_3)$$=0.882. Fig. I1. View largeDownload slide Heat maps of three measures for in an example of 3 equally sized classes. (a) Metric $$d_1$$ shows clear separation between the 3 classes. (b) $$d_2$$ shows 3 classes with half of Class-1 positioned closer to Class-2. (c) $$d_3$$ identifies 2 rather than 3 classes. Note that $$d_1$$ has lower AUPRC than $$d_2$$ and $$d_3$$ despite being best at identifying the 3 classes whereas $$\overline{P}$$ values for the metrics are $$\overline{P}(d_1)$$=1.0, $$\overline{P}(d_2)$$=0.887 and $$\overline{P}(d_3)$$=0.869. Similarly, $$\overline{{\rm PR}}$$ values for the metrics are $$\overline{{\rm PR}}(d_1)$$=1.0, $$\overline{{\rm PR}}(d_2)$$=0.893 and $$\overline{{\rm PR}}(d_3)$$=0.882. Some of the problems of AUPRC are the following. First, AUPRC is based on a classifier that identifies pairs of similar networks and hence is only indirectly related to the problem of separating classes. Moreover, the classifier uses a single global threshold $$\epsilon$$ for all networks and classes, and hence implicitly assumes that all classes are separated on the same scale. The AUPRC further lacks a clear statistical interpretation, which complicates its use especially when one has multiple classes and when precision recall curves of different measures intersect. Despite its problems, we give AUPRC values for all measures we considered in the main text in Table I1 for the sake of completeness. Note that $${\rm NetEmd}$$ measures achieve the highest AUPRC on all data sets. Table I1 AUPRC scores for measures and data sets considered in the main text. $${\rm NetEmd}$$ measures have the highest AUPRC score (given in bold) on all data sets. For $$RG_1$$, we calculated the value of the AUPRC score for each of the 16 sub-data sets. The table shows the average and standard deviation of the AUPRC values obtained over these 16 sub-data sets.    $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.917 $$\pm$$ 0.039  0.869  0.702  0.800  0.756  $${\rm NetEmd}_{G4}$$  0.959 $$\pm$$ 0.030  0.930  0.759  0.774  0.786  $${\rm NetEmd}_{G5}$$  0.981 $$\pm$$ 0.018  0.957  0.766  0.722  0.757  $${\rm NetEmd}_{S}$$  0.967 $$\pm$$ 0.015  0.958  0.833  0.702  0.672  $${\rm NetEmd}_{E4}$$  0.966 $$\pm$$ 0.030  0.945  0.801  0.777  0.739  $${\rm NetEmd}_{\rm DD}$$  0.756 $$\pm$$ 0.044  0.708  0.516  0.655  0.612  $${\rm Netdis}_{\rm ER}$$  0.867 $$\pm$$ 0.044  0.579  0.396  0.607  0.621  $${\rm Netdis}_{\rm SF}$$  0.852 $$\pm$$ 0.028  0.657  0.437  0.522  0.592  GCD11  0.888 $$\pm$$ 0.084  0.709  0.478  0.713  0.693  GCD73  0.966 $$\pm$$ 0.052  0.858  0.571  0.736  0.743  GGDA  0.815 $$\pm$$ 0.176  0.740  0.481  0.500  0.625     $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.917 $$\pm$$ 0.039  0.869  0.702  0.800  0.756  $${\rm NetEmd}_{G4}$$  0.959 $$\pm$$ 0.030  0.930  0.759  0.774  0.786  $${\rm NetEmd}_{G5}$$  0.981 $$\pm$$ 0.018  0.957  0.766  0.722  0.757  $${\rm NetEmd}_{S}$$  0.967 $$\pm$$ 0.015  0.958  0.833  0.702  0.672  $${\rm NetEmd}_{E4}$$  0.966 $$\pm$$ 0.030  0.945  0.801  0.777  0.739  $${\rm NetEmd}_{\rm DD}$$  0.756 $$\pm$$ 0.044  0.708  0.516  0.655  0.612  $${\rm Netdis}_{\rm ER}$$  0.867 $$\pm$$ 0.044  0.579  0.396  0.607  0.621  $${\rm Netdis}_{\rm SF}$$  0.852 $$\pm$$ 0.028  0.657  0.437  0.522  0.592  GCD11  0.888 $$\pm$$ 0.084  0.709  0.478  0.713  0.693  GCD73  0.966 $$\pm$$ 0.052  0.858  0.571  0.736  0.743  GGDA  0.815 $$\pm$$ 0.176  0.740  0.481  0.500  0.625  Table I1 AUPRC scores for measures and data sets considered in the main text. $${\rm NetEmd}$$ measures have the highest AUPRC score (given in bold) on all data sets. For $$RG_1$$, we calculated the value of the AUPRC score for each of the 16 sub-data sets. The table shows the average and standard deviation of the AUPRC values obtained over these 16 sub-data sets.    $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.917 $$\pm$$ 0.039  0.869  0.702  0.800  0.756  $${\rm NetEmd}_{G4}$$  0.959 $$\pm$$ 0.030  0.930  0.759  0.774  0.786  $${\rm NetEmd}_{G5}$$  0.981 $$\pm$$ 0.018  0.957  0.766  0.722  0.757  $${\rm NetEmd}_{S}$$  0.967 $$\pm$$ 0.015  0.958  0.833  0.702  0.672  $${\rm NetEmd}_{E4}$$  0.966 $$\pm$$ 0.030  0.945  0.801  0.777  0.739  $${\rm NetEmd}_{\rm DD}$$  0.756 $$\pm$$ 0.044  0.708  0.516  0.655  0.612  $${\rm Netdis}_{\rm ER}$$  0.867 $$\pm$$ 0.044  0.579  0.396  0.607  0.621  $${\rm Netdis}_{\rm SF}$$  0.852 $$\pm$$ 0.028  0.657  0.437  0.522  0.592  GCD11  0.888 $$\pm$$ 0.084  0.709  0.478  0.713  0.693  GCD73  0.966 $$\pm$$ 0.052  0.858  0.571  0.736  0.743  GGDA  0.815 $$\pm$$ 0.176  0.740  0.481  0.500  0.625     $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.917 $$\pm$$ 0.039  0.869  0.702  0.800  0.756  $${\rm NetEmd}_{G4}$$  0.959 $$\pm$$ 0.030  0.930  0.759  0.774  0.786  $${\rm NetEmd}_{G5}$$  0.981 $$\pm$$ 0.018  0.957  0.766  0.722  0.757  $${\rm NetEmd}_{S}$$  0.967 $$\pm$$ 0.015  0.958  0.833  0.702  0.672  $${\rm NetEmd}_{E4}$$  0.966 $$\pm$$ 0.030  0.945  0.801  0.777  0.739  $${\rm NetEmd}_{\rm DD}$$  0.756 $$\pm$$ 0.044  0.708  0.516  0.655  0.612  $${\rm Netdis}_{\rm ER}$$  0.867 $$\pm$$ 0.044  0.579  0.396  0.607  0.621  $${\rm Netdis}_{\rm SF}$$  0.852 $$\pm$$ 0.028  0.657  0.437  0.522  0.592  GCD11  0.888 $$\pm$$ 0.084  0.709  0.478  0.713  0.693  GCD73  0.966 $$\pm$$ 0.052  0.858  0.571  0.736  0.743  GGDA  0.815 $$\pm$$ 0.176  0.740  0.481  0.500  0.625  References 1. Newman M. E. J. Networks: An Introduction . Oxford, UK: Oxford University Press, 2010. Google Scholar CrossRef Search ADS   2. Wilson R. C. & Zhu P. ( 2008) A study of graph spectra for comparing graphs and trees. Pattern Recogn. , 41, 2833– 2841. Google Scholar CrossRef Search ADS   3. Neyshabur B., Khadem A., Hashemifar S. & Arab S. S. ( 2013) Netal: a new graph- based method for global alignment of protein–protein interaction networks. Bioinformatics , 27, 1654– 1662. Google Scholar CrossRef Search ADS   4. Ali W., Rito T., Reinert G., Sun F. & Deane C. M. ( 2014) Alignment-free protein interaction network comparison. Bioinformatics , 30, i430– i437. Google Scholar CrossRef Search ADS PubMed  5. Yaveroglu Ö. N., Malod-Dognin N., Davis D., Levnajic Z., Janjic V., Karapandza R., Stojmirovic A. & Przulj N. ( 2014) Revealing the hidden language of complex networks. Sci. Rep. , 4. 6. Singh R., Xu J. & Berger B. ( 2008) Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc. Natl. Acad. Sci. USA , 105, 12763– 12768. Google Scholar CrossRef Search ADS   7. Kuchaiev O. & Pržulj N. ( 2011) Integrative network alignment reveals large regions of global network similarity in yeast and human. Bioinformatics , 27, 1390– 1396. Google Scholar CrossRef Search ADS PubMed  8. Mamano N. & Hayes W. B. ( 2017) Sana: simulated annealing far outperforms many other search algorithms for biological network alignment. Bioinformatics , 33, 2156– 2164. Google Scholar CrossRef Search ADS PubMed  9. Hashemifar S. & Xu J. ( 2014) HubAlign: an accurate and efficient method for global alignment of protein–protein interaction networks. Bioinformatics , 30, i438– i444. Google Scholar CrossRef Search ADS PubMed  10. Milo R., Itzkovitz S., Kashtan N., Levitt R., Shen-Orr S., Ayzenshtat I. Sheffer M. & Alon U. ( 2004) Superfamilies of evolved and designed networks. Science , 303, 1538– 1542. Google Scholar CrossRef Search ADS PubMed  11. Pržulj N. ( 2007) Biological network comparison using graphlet degree distribution. Bioinformatics , 23, e177– e183. Google Scholar CrossRef Search ADS PubMed  12. Rito T., Wang Z., Deane C. M. & Reinert G. ( 2010) How threshold behaviour affects the use of subgraphs for network comparison. Bioinformatics , 26, i611– i617. Google Scholar CrossRef Search ADS PubMed  13. Kossinets G. & Watts D. J. ( 2006) Empirical analysis of an evolving social network. Science , 311, 88– 90. Google Scholar CrossRef Search ADS PubMed  14. Borgwardt K. M., Kriegel H.-P., Vishwanathan S. V. N. & Schraudolph N. N. ( 2007) Graph kernels for disease outcome prediction from protein-protein interaction networks. Pacific Symposium on Biocomputing , vol. 12. Singapore: World Scientific Publishing, pp. 4– 15. 15. Wale N., Watson I. A. & Karypis G. ( 2008) Comparison of descriptor spaces for chemical compound retrieval and classification. Knowl. Inf. Syst. , 14, 347– 375. Google Scholar CrossRef Search ADS   16. Barabási A.-L. & Albert R. ( 1999) Emergence of scaling in random networks. Science , 286, 509– 512. Google Scholar CrossRef Search ADS PubMed  17. Milo R., Shen-Orr S., Itzkovitz S., Kashtan N., Chklovskii D. & Alon U. ( 2002) Network motifs: Simple building blocks of complex networks. Science , 298, 824– 827. Google Scholar CrossRef Search ADS PubMed  18. Masoudi-Nejad A., Schreiber F., Kashani Z. & Razaghi M. ( 2012) Building blocks of biological networks: a review on major network motif discovery algorithms. IET Syst. Biol. , 6, 164– 174. Google Scholar CrossRef Search ADS PubMed  19. Wegner A. E. ( 2014) Subgraph covers: an information-theoretic approach to motif analysis in networks. Phys. Rev. X , 4, 041026. 20. Holmes S. & Reinert G. ( 2004) Stein’s method for the bootstrap. Stein’s Method . Beachwood, OH, USA: Institute of Mathematical Statistics, pp. 93– 132. Google Scholar CrossRef Search ADS   21. Bhattacharya S. & Bickel P. J. ( 2015) Subsampling bootstrap of count features of networks. Ann. Statist. , 43, 2384– 2411. Google Scholar CrossRef Search ADS   22. Ali W., Wegner A. E., Gaunt R. E., Deane C. M. & Reinert G. ( 2016) Comparison of large networks with sub-sampling strategies. Sci. Rep. , 6, 28955. Google Scholar CrossRef Search ADS PubMed  23. Runber Y., Tomasi C. & Guibas L. J. ( 1998) A metric for distributions with applications to image databases. IEEE International Conference on Computer Vision . Piscataway, NJ, USA: IEEE, pp. 59– 66. 24. Vázquez A., Flammini A., Maritan A. & Vespignani A. ( 2003) Modeling of protein interaction networks. Complexus , 1, 38– 44. Google Scholar CrossRef Search ADS   25. Erdős P. & Rényi A. ( 1960) On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci. , 5, 1960. 26. Ispolatov I., Krapivsky P. L. & Yuryev A. ( 2005) Duplication-divergence model of protein interaction network. Phys. Rev. E , 71, 061911. Google Scholar CrossRef Search ADS   27. Higham D. J., Rašajski M. & Pržulj N. ( 2008) Fitting a geometric graph to a protein–protein interaction network. Bioinformatics , 24, 1093– 1099. Google Scholar CrossRef Search ADS PubMed  28. Penrose M. ( 2003) Random Geometric Graphs . Oxford, UK: Oxford University Press. Google Scholar CrossRef Search ADS   29. Molloy M. & Reed B. ( 1995) A critical point for random graphs with a given degree sequence. Random Structures Algorithms , 6, 161– 180. Google Scholar CrossRef Search ADS   30. Watts D. J. & Strogatz S. H. ( 1998) Collective dynamics of ‘small-world’ networks. Nature , 393, 440– 442. Google Scholar CrossRef Search ADS PubMed  31. Onnela J.-P., Fenn D. J., Reid S., Porter M. A., Mucha P. J., Fricker M. D. & Jones N. S. ( 2012) Taxonomies of networks from community structure. Phys. Rev. E , 86, 036104, 2012. Google Scholar CrossRef Search ADS   32. Leskovec J., Kleinberg J. & Faloutsos C. ( 2005) Graphs over time: densification laws, shrinking diameters and possible explanations. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining . New York, NY, USA: ACM, pp. 177– 187. 33. Feenstra R. C., Lipsey R. E., Deng H., Ma A. C. & Mo H. ( 2005) World trade flows: 1962–2000. Technical Report . National Bureau of Economic Research. 34. United-Nations-Statistics-Division ( 2015) United nations commodity trade statistics database (un comtrade). http://comtrade.un.org/ ( last date accessed 22 January 2018). 35. Maugis P. G., Priebe C. E., Olhede S. C. & Wolfe P. J. ( 2017) Statistical inference for network samples using subgraph counts. ArXiv preprint: arXiv:1701.00505 . 36. Xing E. P., Ng A. Y., Jordan M. I. & Russell S. ( 2003) Distance metric learning with application to clustering with side-information. Adv. Neur. Inf. Process. Syst. , 15, 505– 512. 37. Mohar B., Alavi Y., Chartrand G. & Oellermann O. R. ( 1991) The Laplacian spectrum of graphs. Graph Theory, Combinatorics, and Applications  ( Mohar B. ed.), 2, Wiley NY, USA: Wiley Intersci. Publ., pp. 871– 898. 38. Chung F. R. K. ( 1997) Spectral Graph Theory , vol. 92. Providence, RI, USA: American Mathematical Society. 39. Banerjee A. & Jost J. ( 2008) On the spectrum of the normalized graph laplacian. Linear Algebra Appl. , 428, 3015– 3022. Google Scholar CrossRef Search ADS   40. Gu J., Jost J., Liu S. & Stadler P. F. ( 2016) Spectral classes of regular, random, and empirical graphs. Linear Algebra Appl. , 489, 30– 49. Google Scholar CrossRef Search ADS   41. Yanardag P. & Vishwanathan S. V. N. ( 2015) Deep graph kernels. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . New York, NY, USA: ACM, pp. 1365– 1374. 42. Cortes C. & Vapnik V. ( 1995) Support-vector networks. Mach. Learn. , 20, 273– 297. 43. Shervashidze N., Vishwanathan S. V. N., Petri T., Mehlhorn K. & Borgwardt K. M. ( 2009) Efficient graphlet kernels for large graph comparison. AISTATS  ( van Dyk D. & Welling M. eds), 5, Florida, USA: PMLR, pp. 488– 495. 44. Barnett I., Malik N., Kuijjer M. L., Mucha P. J. & Onnela J.-P. ( 2016) Feature-based classification of networks. CoRR , arXiv preprint: arXiv:1610.05868. 45. Niepert M., Ahmed M. & Kutzkov K. ( 2016) Learning convolutional neural networks for graphs. International Conference on Machine Learning . New York, NY, USA: PMLR, pp. 2014– 2023. 46. Shervashidze N., Schweitzer P., van Leeuwen E. J., Mehlhorn K. & Borgwardt K. M. ( 2011) Weisfeiler-Lehman graph kernels. J. Mach. Learn. Res. , 12, 2539– 2561. 47. Hočevar T. & Demšar J. ( 2014) A combinatorial approach to graphlet counting. Bioinformatics , 30, 559– 565. Google Scholar CrossRef Search ADS PubMed  48. Brent R. P. ( 1971) An algorithm with guaranteed convergence for finding a zero of a function. Comput. J. , 14, 422– 425. Google Scholar CrossRef Search ADS   49. Thüne M. ( 2013) Eigenvalues of matrices and graphs. PhD Thesis , University of Leipzig. 50. Debnath A. K., Lopez de Compadre R. L., Debnath G., Shusterman A. J. & Hansch C. ( 1991) Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. J. Med. Chem. , 34, 786– 797. Google Scholar CrossRef Search ADS PubMed  51. Borgwardt K. M., Ong C. S., Schonauer S., Vishwanathan SVN., Smola A. J. & Kriegel H.-P. ( 2005) Protein function prediction via graph kernels. Bioinformatics , 21, i47– i56. Google Scholar CrossRef Search ADS PubMed  52. Sugiyama M. & Borgwardt K. ( 2015) Halting in random walk kernels. Advances in Neural Information Processing Systems . ( Cortes C. Lee D. D. Sugiyama M. & Garnett R. eds), Cambridge, MA, USA: MIT Press, pp. 1639– 1647. 53. Dobson P. D. & Doig A. J. ( 2003) Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol. , 330, 771– 783. Google Scholar CrossRef Search ADS PubMed  54. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O. Blondel M., Prettenhofer P., Weiss R., Dubourg V. et al.   ( 2011) Scikit-learn: machine learning in python. J. Mach. Learn. Res. , 12, 2825– 2830. 55. Chang C.-C. & Lin C.-J. ( 2011) Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) , 2, 27. 56. Jayasumana S., Hartley R., Salzmann M., Li H. & Harandi M. ( 2015) Kernel methods on Riemannian manifolds with Gaussian RBF kernels. IEEE Trans. Pattern Anal. Mach. Intell. , 37, 2464– 2477. Google Scholar CrossRef Search ADS PubMed  57. Luss R. & d’Aspremont A. ( 2008) Support vector machine classification with indefinite kernels. Advances in Neural Information Processing Systems . ( Platt J. C. Koller D. Singer Y. & Roweis S. T. eds), USA: Curran Associates Inc., pp. 953– 960. Google Scholar CrossRef Search ADS   58. Haasdonk B. ( 2005) Feature space interpretation of SVMS with indefinite kernels. IEEE Trans. Pattern Anal. Mach. Intell. , 27, 482– 492. Google Scholar CrossRef Search ADS PubMed  59. Gilbert E. N. ( 1961) Random plane networks. J. Soc. Ind. Appl. Math. , 9, 533– 543. Google Scholar CrossRef Search ADS   60. Traud A. L., Mucha P. J. & Porter M. A. ( 2012) Social structure of Facebook networks. Phys. A , 391, 4165– 4180. Google Scholar CrossRef Search ADS   61. Jeong H., Tombor B., Albert R., Oltvai Z. N. & Barabási A.-L. ( 2000) The large-scale organization of metabolic networks. Nature , 407, 651– 654. Google Scholar CrossRef Search ADS PubMed  62. Stark C., Breitkreutz B.-J., Reguly T., Boucher L., Breitkreutz A. & Tyers M. ( 2006) Biogrid: a general repository for interaction datasets. Nucleic Acids Res. , 34, D535– D539. Google Scholar CrossRef Search ADS PubMed  63. Das J. & Yu H. ( 2012) Hint: High-quality protein interactomes and their applications in understanding human disease. BMC Syst Biol. , 6, 92. Google Scholar CrossRef Search ADS PubMed  64. Rajagopala S. V., Sikorski P., Kumar A., Mosca R., Vlasblom J., Arnold R., Franca-Koh J., Pakala S. B., Phanse S., Ceol A., et al.   ( 2014) The binary protein–protein interaction landscape of escherichia coli. Nat. Biotechnol. , 32, 285– 290. Google Scholar CrossRef Search ADS PubMed  65. Jure L. & Krevl A. ( 2014) SNAP datasets: Stanford large network dataset collection. https://snap.stanford.edu/data/ ( last date accessed 22 January 2018). 66. Hand D. J. & Till R. J. ( 2001) A simple generalisation of the area under the roc curve for multiple class classification problems. Mach. Learn. , 45, 171– 186. Google Scholar CrossRef Search ADS   © The authors 2018. Published by Oxford University Press. All rights reserved. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Complex Networks Oxford University Press

# Identifying networks with common organizational principles

, Volume Advance Article – Feb 3, 2018
27 pages

/lp/ou_press/identifying-networks-with-common-organizational-principles-Sfqr2qfEzB
Publisher
Oxford University Press
ISSN
2051-1310
eISSN
2051-1329
D.O.I.
10.1093/comnet/cny003
Publisher site
See Article on Publisher Site

### Abstract

Abstract Many complex systems can be represented as networks, and the problem of network comparison is becoming increasingly relevant. There are many techniques for network comparison, from simply comparing network summary statistics to sophisticated but computationally costly alignment-based approaches. Yet it remains challenging to accurately cluster networks that are of a different size and density, but hypothesized to be structurally similar. In this article, we address this problem by introducing a new network comparison methodology that is aimed at identifying common organizational principles in networks. The methodology is simple, intuitive and applicable in a wide variety of settings ranging from the functional classification of proteins to tracking the evolution of a world trade network. 1. Introduction Many complex systems can be represented as networks, including friendships, the World Wide Web, global trade flows and protein-protein interactions [1]. The study of networks has been a very active area of research in recent years, and in particular, network comparison has become increasingly relevant for example [2–5]. Network comparison itself has many wide-ranging applications, for example, comparing protein–protein interaction networks could lead to increased understanding of underlying biological processes [4, 6]. Network comparison can also be used to study the evolution of networks over time and for identifying sudden changes and shocks. Methods for comparing networks range from comparison of summary statistics to sophisticated alignment-based approaches [3, 7, 8]. Alignment based approaches aim to identify structurally similar regions of networks by finding a mapping between network nodes that maximizes the overlap between the networks. This mapping information is often the principal aim of such methods which for instance in the context of protein–protein interactions, could be used to infer the biological function and structure of proteins in different organisms [3, 6–8]. Although network alignment is NP-hard in general several efficient heuristics have emerged over recent years [8, 9]. On the other hand, alignment-free methods in the form of network similarity measures and network distances aim to quantify the overall similarity of networks on the basis of network features. Alignment free methods typically are computationally less expensive than alignment based methods and can effectively compare large sets of networks. Network comparison measures have many applications such as goodness of fit tests of random graph models of real world networks [10–12] and the tracking the evolution of network time series [13]. Network comparison measures have also attracted increasing attention in the field of machine learning, where they are mostly referred to as graph kernels, with applications for example in personalized medicine for example [14], computer vision and drug discovery for example [15]. Real-world networks can be very large and are often inhomogeneous, which makes the problem of network comparison challenging, especially when networks differ significantly in terms of size and density. In this article, we address this problem by introducing a new network comparison methodology that is aimed at comparing networks according to their common organizational principles. The observation that the degree distribution of many real world networks is highly right skewed and in many cases approximately follows a power law has been very influential in the development of network science [16]. Consequently, it has become widely accepted that the shape of the degree distribution (e.g. binomial vs power law) is indicative of the generating mechanism underlying the network. In this article, we formalize this idea by introducing a measure that captures the shape of distributions. The measure emerges from the requirement that a metric between forms of distributions should be invariant under rescalings and translations of the observables. Based on this measure, we then introduce a new network comparison methodology, which we call NetEmd. Although our methodology is applicable to almost any type of feature that can be associated to nodes or edges of a graph, we focus mainly on distributions of small connected subgraphs, also known as graphlets. Graphlets form the basis of many of the state of the art network comparison methods [4, 5, 11] and hence using graphlet based features allows for a comparative assessment of the presented methodology. Moreover, certain subgraph patterns, called network motifs [17, 18], occur much more frequently in many real world networks than is expected on the basis of pure chance. Network motifs are considered to be basic building blocks of networks that contribute to the function of the network by performing modular tasks and have therefore been conjectured to be favoured by natural selection. This is supported by the observation that network motifs are largely conserved within classes of networks [10, 19]. Our methodology provides an effective tool for comparing networks even when networks differ significantly in size and density, which is the case in most applications. The methodology performs well on a wide variety of networks ranging from chemical compounds having as few as 10 nodes to internet network with tens of thousands of nodes. The method achieves state of the art performance even when based on a rather restricted set of features, specifically distributions of graphlets up to size 3. Graphlets of size 3 can be enumerated very efficiently and hence when based on these features the method scales favourably to networks with millions and even billions of nodes. The method also behaves well under the network sub-sampling [5, 20–22]. The methodology further meets the needs of researchers from a variety of fields, from the social sciences to the biological and life sciences, by being computationally efficient and simple to implement. We test the presented methodology in a large number of settings, starting with clustering synthetic and real world networks, where we find that the presented methodology outperforms state of the art graphlet-based network comparison methods in clustering networks of different sizes and densities. We then test the more fine grained properties of NetEmd using data sets that represent evolving networks at different points in time. Finally, we test whether NetEmd can predict functional categories of networks by exploring machine learning applications and find that classifiers based on NetEmd outperform state-of-the art graph classifiers on several benchmark data sets. 2. A measure for comparing shapes of distributions Here we build on the idea that the information encapsulated in the shape of the degree distribution and other network properties reflects the topological organization of the network. From an abstract point of view we think of the shape of a distribution as a property that is invariant under linear deformations i.e. translations and re-scalings of the axis. For example, a Gaussian distribution always has its characteristic bell curve shape regardless of its mean and standard deviation. Consequently, we postulate that any metric that aims to capture the similarity of shapes should be invariant under such linear transformations of its inputs. Based on these ideas we define the following measure between distributions $$p$$ and $$q$$ that are supported on $$\mathbb{R}$$ and have non-zero, finite variances:   $$\label{emdmet} {\rm EMD}^*(p,q)=\mathrm{inf}_{c\in\mathbb{R}}\left( {\rm EMD}\big(\tilde{p}(\cdot+c),\tilde{q}(\cdot)\big)\right)\!,$$ (2.1) where $${\rm EMD}$$ is the earth mover’s distance and $$\tilde{p}$$ and $$\tilde{q}$$ are the distributions obtained by rescaling $$p$$ and $$q$$ to have variance 1. More precisely, $$\tilde{p}$$ is the distribution obtained from $$p$$ by the transformation $$x\rightarrow \frac{x}{\sigma(p)}$$, where $$\sigma(p)$$ is the standard deviation of $$p$$. Intuitively, $${\rm EMD}$$ (also known as the first Wasserstein metric [23]) can be thought of as the minimal work, that is mass times distance, needed to ‘transport’ the mass of one distribution onto the other. For probability distributions $$p$$ and $$q$$ with support in $$\mathbb{R}$$ and bounded absolute first moment, the $${\rm EMD}$$ between $$p$$ and $$q$$ is given by $${\rm EMD}(p,q)=\int_{-\infty}^\infty|F(x)-G(x)|\,\mathrm{d}x$$, where $$F$$ and $$G$$ are the cumulative distribution functions of $$p$$ and $$q$$ respectively. In principle, $${\rm EMD}$$ in equation (2.1) can be replaced by almost any other probability metric $$d$$ to obtain a corresponding metric $$d^*$$. Here we choose $${\rm EMD}$$ because it is well suited to comparing shapes, as shown by its many applications in the area of pattern recognition and image retrieval [23]. Moreover, we found that $${\rm EMD}$$ produces superior results to classical $$L^1$$ and Kolmogorov distances, especially for highly irregular distributions that one frequently encounters in real world networks. For two networks $$G$$ and $$G'$$ and given network feature $$t$$, we define the corresponding $${\rm NetEmd}_t$$ measure by:   $${\rm NetEmd}_t (G,G')={\rm EMD}^*(p_t(G),p_t(G')),$$ (2.2) where $$p_t(G)$$ and $$p_t(G')$$ are the distributions of $$t$$ on $$G$$ and $$G'$$ respectively. $${\rm NetEmd}_t$$ can be shown to be a pseudometric between graphs for any feature $$t$$ (see Appendix Section B), that is it is non-negative, symmetric and satisfies the triangle inequality. Figure 1 gives examples where $$t$$ is taken to be the degree distribution, and $$p_t(G)$$ is the degree distribution of $$G$$. Fig. 1. View largeDownload slide Plots of rescaled and translated degree distributions for Barabasi–Albert (BA) and Erdős–Rényi (ER) models with $$N$$ nodes and average degree $$k$$: (a) BA $$N=5000$$, $$k=100$$ vs BA $$N=50,000$$, $$k=10$$. (b) ER $$N=5000$$, $$k=100$$ vs ER $$N=50000$$, $$k=10$$. (c) BA $$N=5000$$, $$k=100$$ vs ER $$N=5000$$, $$k=100$$. The $${\rm EMD}^*$$ distances between the degree distribution of two BA or ER models with quite different values of $$N$$ and $$k$$ are smaller than the $${\rm EMD}^*$$ distance between the degree distribution of a BA and ER model when the number of nodes and average degree are equal. Fig. 1. View largeDownload slide Plots of rescaled and translated degree distributions for Barabasi–Albert (BA) and Erdős–Rényi (ER) models with $$N$$ nodes and average degree $$k$$: (a) BA $$N=5000$$, $$k=100$$ vs BA $$N=50,000$$, $$k=10$$. (b) ER $$N=5000$$, $$k=100$$ vs ER $$N=50000$$, $$k=10$$. (c) BA $$N=5000$$, $$k=100$$ vs ER $$N=5000$$, $$k=100$$. The $${\rm EMD}^*$$ distances between the degree distribution of two BA or ER models with quite different values of $$N$$ and $$k$$ are smaller than the $${\rm EMD}^*$$ distance between the degree distribution of a BA and ER model when the number of nodes and average degree are equal. Metrics based on a single feature might not be effective in capturing the structural differences between networks as two networks that agree in terms of a single feature such as the degree distribution might still differ significantly in terms of other structures. For instance, there are many generation mechanisms that produce networks with power-law type degree distributions but differ significantly in other aspects. Consequently, measures that are based on the comparison of multiple features can be expected to be more effective at identifying structural differences between networks than measures that are based on a single feature $$t$$, because for two networks to be considered similar they must show similarity across multiple features. Hence, for a given set $$T=\{t_1,t_2,...,t_m\}$$ of network features, we define the $${\rm NetEmd}$$ measure corresponding to $$T$$ simply as:   $$\label{eq:def_netemd} {\rm NetEmd}_T(G,G')=\frac{1}{m}\sum_{j=1}^{m} {\rm NetEmd}_{t_j} (G,G').$$ (2.3) Although $${\rm NetEmd}$$ can in principle be based on any set $$T$$ of network features to which one can associate distributions, we initially consider only features that are based on distributions of small connected subgraphs, also known as graphlets. Graphlets form the basis of many state of the art network comparison methods and hence allow for a comparative assessment of the proposed methodology that is independent of the choice of features. First, we consider graphlet degree distributions (GDDs) [11] as our set of features. For a given graphlet $$m$$, the graphlet degree of a node is the number of graphlet-$$m$$ induced subgraphs that are attached to the node. One can distinguish between the different positions the node can have in $$m$$, which correspond to the automorphism orbits of $$m$$, see Fig. 2. We initially take the set of 73 GDDs corresponding to graphlets up to size 5 to be the default set of inputs, for which we denote the metric as $${\rm NetEmd}_{G5}$$. Fig. 2. View largeDownload slide Graphlets on two to four nodes. The different shades in each graphlet represent different automorphism orbits, numbered from 0 to 14. Fig. 2. View largeDownload slide Graphlets on two to four nodes. The different shades in each graphlet represent different automorphism orbits, numbered from 0 to 14. Later, we also explore alternative definitions of subgraph distributions based on ego networks, as well as the effect of varying the size of subgraphs considered in the input. Finally, we consider the eigenvalue spectra of the graph Laplacian and the normalized graph Laplacian as inputs. 3. Results In order to give a comparative assessment of $${\rm NetEmd}$$, we consider other graphlet based network comparison methods, namely GDDA [11], GCD [5] and Netdis [4]. These represent the most effective alignment-free network comparison methodologies in the existing literature. While GDDA directly compares distributions of graphlets up to size 5 in a pairwise fashion, GCD is based on comparing rank correlations between graphlet degrees. Here we consider both default settings of GCD [5], namely GCD11, which is based on a non-redundant subset of 11 graphlets up to size 4, and GCD73 which uses all graphlets up to size 5. Netdis differs from GDDA and GCD in that it is based on subgraph counts in ego-networks of nodes. Another important distinction is that Netdis first centers these raw counts by comparing them to the counts that could be expected under a particular null model before computing the final statistics. In our analysis, we consider two null models: an Erdös–Rényi random graph and a duplication divergence graph [24] which has a scale-free degree distribution as well as a high clustering coefficient. We denote these two variants as $${\rm Netdis}_{\rm ER}$$ and $${\rm Netdis}_{\rm SF}$$, respectively. We initially report results of a single variant $${\rm NetEmd}_{G5}$$ that is based on graphlets up to size 5 in order to provide a comparison between the different methods that is independent of the choice of features. Results obtained using different feature sets are discussed in Section 3.3. 3.1 Clustering synthetic and real world networks We start with the classical setting of network comparison where the task is to identify groups of structurally similar networks. The main challenge in this setting is to identify structurally similar networks even though they might differ substantially in terms of size and density. Given a set $$S=\{G_1,G_2,...,G_n\}$$ of networks consisting of disjoint classes $$C=\{c_1,c_2,...,c_m\}$$ one would like a network comparison measure $$d$$ to position networks from the same class closer to each other when compared to networks from other classes. Given a network $$G$$, this can be measured in terms of the empirical probability $$P(G)$$ that $$d(G,G_1)<d(G,G_2)$$ where $$G_1$$ is a randomly selected network from the same class as $$G$$ (excluding itself) and $$G_2$$ is a randomly selected network from outside the class of $$G$$ and $$d$$ is the network comparison statistic. Consequently, the performance over the whole data set is measured in terms of the quantity $$\overline{P}=\frac{1}{|S|}\sum_{G\in S}P(G)$$. It can be shown that $$P(G)$$ is equivalent to the area under the receiver operator characteristic curve of a binary classifier that for a given network $$G$$ classifies the $$k$$ nearest neighbours of $$G$$ with respect to $$d$$ as being in the same class as $$G$$ (See Appendix H). Hence, a measure that positions networks randomly has an expected $$\overline{P}$$ of 0.5 whereas $$\overline{P}=1$$ corresponds to perfect separation between classes. Other measures are discussed in the Appendix. Conclusions reached in this article hold regardless of which of these performance measures one uses. We first test $${\rm NetEmd}$$ on synthetic networks corresponding to realizations of eight random graph models, namely the Erdős–Rényi random graphs [25], the Barabasi Albert preferential attachment model [16], two duplication divergence models [24, 26], the geometric gene duplication model [27], 3D geometric random graphs [28], the configuration model [29], and Watts–Strogatz small world networks [30] (see Section G.1 in the Appendix for details). For synthetic networks, we consider three experimental settings of increasing difficulty, starting with the task of clustering networks that have same size $$N$$ and average degree $$k$$ according to generating mechanism—a task that is relevant in a model selection setting. For this we generate 16 data sets, which collectively we call $$RG_1$$, corresponding to combinations of $$N\in\{1250,2500,5000,10000\}$$ and $$k\in\{10,20,40,80\}$$, each containing 10 realizations per model, that is 80 networks. This is an easier problem than clustering networks of different sizes and densities, and in this setting we find that the $$\overline{P}$$ scores (see Fig. 3c) of top performing measures tend to be within one standard deviation of each other. We find that $${\rm NetEmd}_{G5}$$ and GCD73 achieve the highest scores, followed by GCD11 and $${\rm Netdis}_{\rm SF}$$. Fig. 3. View largeDownload slide (a) and (b) show the heatmaps of pairwise distances on $$RG_2$$ ($$N\in\{2000,3000,4000\}$$ and $$k\in\{20,24,28,32,36,40\}$$) according to $${\rm NetEmd}_{G5}$$ and GCD73, respectively. In the heat map, networks are ordered from top to bottom in the following order: model, average degree and node count. The heatmap of $${\rm NetEmd}$$ shows eight clearly identifiable blocks on the diagonal corresponding to different generative models while the heatmap of GCD73 shows signs of off-diagonal mixing. (c) $$\overline{P}$$ values for various comparison measures for data sets of synthetic and real world networks. For $$RG_1$$ we calculated the value of $$\overline{P}$$ for each of the 16 sub-data sets. The table shows the average and standard deviation of the $$\overline{P}$$ values obtained over these 16 sub-data sets. Fig. 3. View largeDownload slide (a) and (b) show the heatmaps of pairwise distances on $$RG_2$$ ($$N\in\{2000,3000,4000\}$$ and $$k\in\{20,24,28,32,36,40\}$$) according to $${\rm NetEmd}_{G5}$$ and GCD73, respectively. In the heat map, networks are ordered from top to bottom in the following order: model, average degree and node count. The heatmap of $${\rm NetEmd}$$ shows eight clearly identifiable blocks on the diagonal corresponding to different generative models while the heatmap of GCD73 shows signs of off-diagonal mixing. (c) $$\overline{P}$$ values for various comparison measures for data sets of synthetic and real world networks. For $$RG_1$$ we calculated the value of $$\overline{P}$$ for each of the 16 sub-data sets. The table shows the average and standard deviation of the $$\overline{P}$$ values obtained over these 16 sub-data sets. Having established that $${\rm NetEmd}$$ is able to differentiate networks according to generating mechanism, we move on to the task of clustering networks of different sizes and densities. For this we generate two data sets: $$RG_2$$ in which the size $$N$$ and average degree $$k$$ are increased independently in linear steps to twice their initial value ($$N\in\{2000,3000,4000\}$$ and $$k\in\{20,24,28,32,36,40\}$$) and $$RG_3$$ in which the size and average degree are increased independently in multiples of 2 to 8 times their initial value ($$N\in\{1250,2500,5000,10000\}$$ and $$k\in\{10,20,40,80\}$$). In $$RG_3$$, the number of nodes and average degrees of the networks both vary by one order of magnitude, and therefore clustering according to model type is challenging. Both $$RG_2$$ and $$RG_3$$ contain 10 realizations per model parameter so that these contain $$3\times6\times8\times10=1440$$ and $$4\times4\times8\times10=1280$$ networks, respectively. Finally, we consider a data set consisting of networks from 10 different classes of real world networks (RWN) as well as a data set from [4] that consists of real world and synthetic networks from the larger collection compiled by Onnela $$et$$$$al.$$ [31]. Since the generation mechanisms of real world networks are generally unknown the ‘ground truth’ of real world data sets is based on the assumption that networks from a certain domain are structurally more similar to each other than to networks from different domains. We find that $${\rm NetEmd}_{G5}$$ outperforms all of the other three methods at clustering networks of different sizes and densities on all data sets. The difference can also be seen in the heatmaps of $${\rm NetEmd}_{G5}$$ and GCD73, the second best performing method for $$RG_2$$, given in Fig. (3a,b). While the heatmap of $${\rm NetEmd}_{G5}$$ shows eight clearly identifiable blocks on the diagonal corresponding to different generative models, the heatmap of GCD73 shows signs of off-diagonal mixing. The difference in performance becomes even more pronounced on more challenging data sets, that is on $$RG_3$$ (see Fig. D2 in the Appendix) and the Onnela et al. data set. 3.2 Time ordered networks A network comparison measure should ideally not only be able to identify groups of similar networks but should also be able to capture structural similarity at a finer local scale. To study the behaviour of $${\rm NetEmd}$$ at a more local level, we consider data sets that represent a system measured at different points in time. Since it is reasonable to assume that such networks evolve gradually over time towards their final state they offer a sensible experimental setting for testing the local properties of network comparison methodologies. We consider two data sets named AS-caida and AS-733 [32] that represent the topology of the Internet at the level of autonomous systems and a third data set that consists of bilateral trade flows between countries for the years 1962–2014 [33, 34]. Both edges and nodes are added and deleted over time in all three data sets. As was noted in [32] the time ranking in evolving networks is reflected to a certain degree in simple summary statistics. Hence, recovering the time ranking of evolving networks should be regarded as a test of consistency rather than an evaluation of performance. In order to minimize the dependence of our results on the algorithm that is used to rank networks, we consider four different ways of ranking networks based on their pairwise distances as follows. We assume that either the first or last network in the time series is given. Rankings are then constructed in a step-wise fashion. At each step one either adds the network that is closest to the last added network (Algorithm 1), or adds the network that has smallest average distance to all the networks in the ranking constructed so far (Algorithm 2). The performance of a measure in ranking networks is then measured in terms of Kendall’s rank correlation coefficient $$\tau$$ between the true time ranking and the best ranking obtained by any of the four methods. We find that $${\rm NetEmd}_{G5}$$ successfully recovers the time ordering for all three data sets, as can be seen in the time ordered heatmaps given in Fig. 4 which all show clear groupings along the diagonal. The heat maps for the two internet data sets show small groups of outliers which can also be identified as sudden jumps in summary statistics for example the number of nodes. The two large clusters in the heatmap of world trade networks (Fig. 4(c)) coincide with a change in the data gathering methodology in 1984 [33]. Although $${\rm NetEmd}_{G5}$$ comes second to $${\rm Netdis}_{\rm SF}$$ on AS-733 and to GCD11 on AS-caida, $${\rm NetEmd}_{G5}$$ has the highest overall score and is the only measure that achieves consistently high scores on all three data sets. Fig. 4. View largeDownload slide (a), (b) and (c) Heatmaps of $${\rm NetEmd}_{G5}$$ for networks representing the internet at the level of autonomous systems networks and world trade networks. The date of measurement increases from left to right/ top to bottom. $${\rm NetEmd}_{G5}$$ accurately captures the evolution over time in all three data sets by positioning networks that are close in time closer to each other resulting in a clear signal along the diagonal. (d) Kendall’s rank correlation coefficient between the true time ranking and rankings inferred from different network comparison measures. Fig. 4. View largeDownload slide (a), (b) and (c) Heatmaps of $${\rm NetEmd}_{G5}$$ for networks representing the internet at the level of autonomous systems networks and world trade networks. The date of measurement increases from left to right/ top to bottom. $${\rm NetEmd}_{G5}$$ accurately captures the evolution over time in all three data sets by positioning networks that are close in time closer to each other resulting in a clear signal along the diagonal. (d) Kendall’s rank correlation coefficient between the true time ranking and rankings inferred from different network comparison measures. 3.3 NetEmd based on different sets of inputs We examine the effect of reducing the size of graphlets considered in the input of $${\rm NetEmd}$$, which is also relevant from a computational point of view, since enumerating graphlets up to size 5 can be challenging for very large networks. We consider variants based on the graphlet degree distributions of graphlets up to sizes 3 and 4, which we denote as $${\rm NetEmd}_{G3}$$ and $${\rm NetEmd}_{G4}$$. We also consider $${\rm NetEmd}_{\rm DD}$$ which is based only on the degree distribution as a baseline. Results are given in Table 1. We find that reducing the size of graphlets from 5 to 4 does not significantly decrease the performance of $${\rm NetEmd}$$ and actually produces better results on three data sets ($$RG_3$$, Real world and Onnela et $$al.$$). Even when based on only graphlets up to size 3, that is just edges, 2-paths and triangles, $${\rm NetEmd}$$ outperforms all other non-$${\rm NetEmd}$$ methods that we tested on at least 6 out of 8 data sets. Table 1 Results for different variants of $${\rm NetEmd}$$ based on distributions of graphlets up to size 3 and 4 ($${\rm NetEmd}_{G3}$$ and $${\rm NetEmd}_{G4}$$), counts of graphlets up to size 4 in 1-step ego networks of nodes ($${\rm NetEmd}_{E4}$$), eigenvalue spectra of Laplacian operators ($${\rm NetEmd}_{s}$$) and the degree distribution ($${\rm NetEmd}_{DD}$$). Values in bold indicate that a measure achieves the highest score among all measures considered in the manuscript. For $$RG_1$$ we calculate the value of $$\overline{P}$$ for each of the 16 sub-data sets. The table shows the average and standard deviation of the $$\overline{P}$$ values obtained over these 16 sub-data sets Data set  $${\rm NetEmd}_{G3}$$  $${\rm NetEmd}_{G4}$$  $${\rm NetEmd}_{G5}$$  $${\rm NetEmd}_{E4}$$  $${\rm NetEmd}_{S}$$  $${\rm NetEmd}_{\rm DD}$$  Best other  $$RG_1$$  0.989  0.995  0.997  0.993  0.992  0.957  0.996     $$\pm$$ 0.008  $$\pm$$ 0.005  $$\pm$$ 0.003  $$\pm$$ 0.004  $$\pm$$ 0.007  $$\pm$$ 0.024  $$\pm$$ 0.005 (GCD73)  $$RG_2$$  0.982  0.987  0.988  0.983  0.992  0.944  0.976(GCD73)  $$RG_3$$  0.940  0.941  0.925  0.947  0.972  0.902  0.872(GCD11)  RWN  0.952  0.950  0.942  0.933  0.933  0.907  0.906(GCD73)  Onnela et al.  0.892  0.898  0.890  0.892  0.858  0.867  0.832($${\rm Netdis}_{\rm ER}$$)  AS-733  0.808  0.874  0.874  0.922  0.855  0.928  0.933($${\rm Netdis}_{\rm SF}$$)  AS-caida  0.898  0.892  0.890  0.820  0.780  0.821  0.897(GCD11)  World Trade  0.697  0.785  0.821  0.665  0.430  0.358  0.666($${\rm Netdis}_{\rm ER}$$)  Data set  $${\rm NetEmd}_{G3}$$  $${\rm NetEmd}_{G4}$$  $${\rm NetEmd}_{G5}$$  $${\rm NetEmd}_{E4}$$  $${\rm NetEmd}_{S}$$  $${\rm NetEmd}_{\rm DD}$$  Best other  $$RG_1$$  0.989  0.995  0.997  0.993  0.992  0.957  0.996     $$\pm$$ 0.008  $$\pm$$ 0.005  $$\pm$$ 0.003  $$\pm$$ 0.004  $$\pm$$ 0.007  $$\pm$$ 0.024  $$\pm$$ 0.005 (GCD73)  $$RG_2$$  0.982  0.987  0.988  0.983  0.992  0.944  0.976(GCD73)  $$RG_3$$  0.940  0.941  0.925  0.947  0.972  0.902  0.872(GCD11)  RWN  0.952  0.950  0.942  0.933  0.933  0.907  0.906(GCD73)  Onnela et al.  0.892  0.898  0.890  0.892  0.858  0.867  0.832($${\rm Netdis}_{\rm ER}$$)  AS-733  0.808  0.874  0.874  0.922  0.855  0.928  0.933($${\rm Netdis}_{\rm SF}$$)  AS-caida  0.898  0.892  0.890  0.820  0.780  0.821  0.897(GCD11)  World Trade  0.697  0.785  0.821  0.665  0.430  0.358  0.666($${\rm Netdis}_{\rm ER}$$)  Table 1 Results for different variants of $${\rm NetEmd}$$ based on distributions of graphlets up to size 3 and 4 ($${\rm NetEmd}_{G3}$$ and $${\rm NetEmd}_{G4}$$), counts of graphlets up to size 4 in 1-step ego networks of nodes ($${\rm NetEmd}_{E4}$$), eigenvalue spectra of Laplacian operators ($${\rm NetEmd}_{s}$$) and the degree distribution ($${\rm NetEmd}_{DD}$$). Values in bold indicate that a measure achieves the highest score among all measures considered in the manuscript. For $$RG_1$$ we calculate the value of $$\overline{P}$$ for each of the 16 sub-data sets. The table shows the average and standard deviation of the $$\overline{P}$$ values obtained over these 16 sub-data sets Data set  $${\rm NetEmd}_{G3}$$  $${\rm NetEmd}_{G4}$$  $${\rm NetEmd}_{G5}$$  $${\rm NetEmd}_{E4}$$  $${\rm NetEmd}_{S}$$  $${\rm NetEmd}_{\rm DD}$$  Best other  $$RG_1$$  0.989  0.995  0.997  0.993  0.992  0.957  0.996     $$\pm$$ 0.008  $$\pm$$ 0.005  $$\pm$$ 0.003  $$\pm$$ 0.004  $$\pm$$ 0.007  $$\pm$$ 0.024  $$\pm$$ 0.005 (GCD73)  $$RG_2$$  0.982  0.987  0.988  0.983  0.992  0.944  0.976(GCD73)  $$RG_3$$  0.940  0.941  0.925  0.947  0.972  0.902  0.872(GCD11)  RWN  0.952  0.950  0.942  0.933  0.933  0.907  0.906(GCD73)  Onnela et al.  0.892  0.898  0.890  0.892  0.858  0.867  0.832($${\rm Netdis}_{\rm ER}$$)  AS-733  0.808  0.874  0.874  0.922  0.855  0.928  0.933($${\rm Netdis}_{\rm SF}$$)  AS-caida  0.898  0.892  0.890  0.820  0.780  0.821  0.897(GCD11)  World Trade  0.697  0.785  0.821  0.665  0.430  0.358  0.666($${\rm Netdis}_{\rm ER}$$)  Data set  $${\rm NetEmd}_{G3}$$  $${\rm NetEmd}_{G4}$$  $${\rm NetEmd}_{G5}$$  $${\rm NetEmd}_{E4}$$  $${\rm NetEmd}_{S}$$  $${\rm NetEmd}_{\rm DD}$$  Best other  $$RG_1$$  0.989  0.995  0.997  0.993  0.992  0.957  0.996     $$\pm$$ 0.008  $$\pm$$ 0.005  $$\pm$$ 0.003  $$\pm$$ 0.004  $$\pm$$ 0.007  $$\pm$$ 0.024  $$\pm$$ 0.005 (GCD73)  $$RG_2$$  0.982  0.987  0.988  0.983  0.992  0.944  0.976(GCD73)  $$RG_3$$  0.940  0.941  0.925  0.947  0.972  0.902  0.872(GCD11)  RWN  0.952  0.950  0.942  0.933  0.933  0.907  0.906(GCD73)  Onnela et al.  0.892  0.898  0.890  0.892  0.858  0.867  0.832($${\rm Netdis}_{\rm ER}$$)  AS-733  0.808  0.874  0.874  0.922  0.855  0.928  0.933($${\rm Netdis}_{\rm SF}$$)  AS-caida  0.898  0.892  0.890  0.820  0.780  0.821  0.897(GCD11)  World Trade  0.697  0.785  0.821  0.665  0.430  0.358  0.666($${\rm Netdis}_{\rm ER}$$)  Given that the complexity of enumerating graphlets up to size $$s$$ in a network on $$N$$ nodes having maximum degree $$k_{\rm max}$$ is $$O(Nk_{\max}^{s-1})$$, $${\rm NetEmd}_{G4}$$ offers an optimal combination of performance and computational efficiency in most cases. The even less computationally costly $${\rm NetEmd}_{G3}$$ scales favourably even to networks of billions of edges for which enumerating graphlets of size 4 can be computationally prohibitive. This opens the door for comparing very large networks which are outside the reach of current methods while still retaining state of the art performance. Furthermore, $${\rm NetEmd}$$ behaves well under the bootstrap which samples nodes and uses the only the graphlet degrees of these nodes as inputs; see for example [5, 20–22]. Sub-sampling can be leveraged to further improve computational efficiency of $${\rm NetEmd}$$ (see Appendix D). We find that in some cases restricting the set of inputs actually leads to an increase in the performance of $${\rm NetEmd}$$. This indicates that not all graphlet distributions are equally informative in all settings [35]. Consequently, identifying (learning) which graphlet distributions contain the most pertinent information for a given task might lead to significant improvements in performance. Such generalizations can be incorporated into $${\rm NetEmd}$$ in a straightforward manner, for instance by modifying the sum in Equation (2.3) to incorporate weights. $${\rm NetEmd}$$ is ideally suited for such metric learning [36] type generalizations since it constructs an individual distance for each graphlet distribution. Moreover, such single feature $${\rm NetEmd}$$ measures are in many cases highly informative even on their own. For instance $${\rm NetEmd}_{\rm DD}$$, which only uses the degree distribution, outperforms the non-$${\rm NetEmd}$$ measures we tested individually on more than half the data sets we considered. We also considered counts of graphlets up to size 4 in 1-step ego networks of nodes ($${\rm NetEmd}_{E4}$$) [4] as an alternative way of capturing subgraph distributions, for which we denote the measure as $${\rm NetEmd}_{E4}$$. Although we find that $${\rm NetEmd}_{E4}$$ achieves consistently high scores variants based on graphlet degree distributions tend to perform better on most data sets. Finally, we consider spectral distributions of graphs as a possible alternative to graphlet based features. The spectra of various graph operators are closely related to topological properties of graphs [37–39] and have been widely used to characterize and compare graphs [2, 40]. We used the spectra of the graph Laplacian and normalized graph Laplacian as inputs for $${\rm NetEmd}$$ for which we denote the measure as $${\rm NetEmd}_S$$. For a given graph the Laplacian is defined as $$L=D-A$$ where $$A$$ is the adjacency matrix of the graph and $$D$$ is the diagonal matrix whose diagonal entries are the node degrees. The normalized Laplacian $$\hat{L}$$ is defined as $$D^{-\frac{1}{2}}LD^{-\frac{1}{2}}$$. Given the eigenvalue distributions $$S(L)$$ and $$S(\hat{L})$$ of $$L$$ and $$\hat{L}$$ we define $${\rm NetEmd}_S$$ to be $$\frac{1}{2}({\rm NetEmd}_{S(L)}+{\rm NetEmd}_{S(\hat{L})})$$. We find that in general $${\rm NetEmd}_S$$ performs better in clustering random graphs of different sizes and densities when compared to graphlet based network comparison measures. However, on the RWN and Onnela et al. data sets graphlet based $${\rm NetEmd}$$ measures tend to perform better than the spectral variant which can be attributed to the prevalence of network motifs in real world networks, giving graphlet based measures an advantage. The spectral variant is also outperformed on the time ordering of data sets which in turn might be a result of the sensitivity of graph spectra to small changes in the underlying graph [2]. 3.4 Functional classification of networks One of the primary motivations in studying the structure of networks is to identify topological features that can be related to the function of a network. In the context of network comparison this translates into the problem of finding metrics that can identify functionally similar networks based on their topological structure. In order to test whether $${\rm NetEmd}$$ can be used to identify functionally similar networks, we use several benchmarks from the machine learning literature where graph similarity measures, called graph kernels, have been intensively studied over the past decade. In the context of machine learning the goal is to construct classifiers that can accurately predict the class membership of unknown graphs. We test $${\rm NetEmd}$$ on benchmark data sets representing social networks [41] consisting of Reddit posts, scientific collaborations and ego networks in the Internet Movie Database (IMDB). The Reddit data sets Reddit-Binary, Reddit-Multi-5k and Reddit-Multi-12k consist of networks representing Reddit treads where nodes correspond to users and two users are connected whenever one responded to the other’s comments. While for the Reddit-Binary data sets the task is to classify networks into discussion based and question/answer based communities, in the data sets Reddit-Multi-5k and Reddit-Multi-12k the task is to classify networks according to their subreddit categories. COLLAB is a data set consisting of ego-networks of scientists from the fields High Energy Physics, Condensed Matter Physics and Astro Physics and the task is to determine which of these fields a given researcher belongs to. Similarly, the data sets IMDB-Binary and IMDB-Multi represent collaborations between film actors derived from the IMDB and the task is to classify ego-networks into different genres, that is action and romance in the case of IMDB-Binary and comedy, action and Sci-Fi genres in the case of IMDB-Multi. We use C-support vector machine (C-SVM) [42] classifiers with a Gaussian kernel $$K(G,G')=\exp(-\frac{{\rm NetEmd}(G,G')^2}{2\alpha^2})$$, where $$\alpha$$ is a free parameter to be learned during training. Performance evaluation is carried out by 10-fold cross validation, where at each step of the validation 9 folds are used for training and 1 fold for evaluation. Free parameters of classifiers are learned via 10-fold cross validation on the training data only. Finally, every experiment is repeated 10 fold, and average prediction accuracy and standard deviation are reported. Table 2 gives classification accuracies obtained using $${\rm NetEmd}$$ measures based on graphlets up to size five ($${\rm NetEmd}_{G5}$$) and spectra of Laplacian operators ($${\rm NetEmd}_{S}$$) on the data sets representing social networks. We compare $${\rm NetEmd}$$ based kernels to graphlet kernels [43] and deep graphlet kernels [41] as well as two non-SVM classifiers namely the random forest (RF) classifier introduced in [44] and the convolutional neural network based classifier introduced in [45]. Short descriptions of the classifiers is given in the Appendix (Section F.2). Table 2 10-fold cross validation accuracies of Gaussian kernels based on $${\rm NetEmd}$$ measures using the distributions of graphlets up to size 5 ($${\rm NetEmd}_{G5}$$) and Laplacian spectra ($${\rm NetEmd}_S$$) and other graph kernels, namely the deep graphlet kernels (DGK)[41] and the graphlet kernel (GK) [43]. We also consider alternatives to support vector machines classifiers, namely the random forest classifiers (RF) introduced in [44] and convolutional neural networks (PCSN) [45]. Values in bold correspond to significantly higher scores, which are scores with t-test p-values less than 0.05 when compared to the highest score Kernel  Reddit-Binary  Reddit-Multi-5k  Reddit-Multi-12k  COLLAB  IMDB-Binary  IMDB-Multi  $${\rm NetEmd}_{G5}$$  92.67 $$\pm$$ 0.30  54.61 $$\pm$$ 0.18  48.09 $$\pm$$ 0.21  79.32 $$\pm$$ 0.27  66.99 $$\pm$$ 1.19  41.45 $$\pm$$ 0.70  $${\rm NetEmd}_S$$  88.59 $$\pm$$ 0.35  53.05 $$\pm$$ 0.34  44.45 $$\pm$$ 0.18  79.05 $$\pm$$ 0.20  71.68 $$\pm$$ 0.88  46.06 $$\pm$$ 0.50  DGK  78.04 $$\pm$$ 0.39  41.27 $$\pm$$ 0.18  32.22 $$\pm$$ 0.10  73.09 $$\pm$$ 0.25  66.96 $$\pm$$ 0.56  44.55 $$\pm$$ 0.52  GK  77.34 $$\pm$$ 0.18  41.01 $$\pm$$ 0.17  31.82 $$\pm$$ 0.08  72.84 $$\pm$$ 0.28  65.87 $$\pm$$ 0.98  43.89 $$\pm$$ 0.38  RF  88.7 $$\pm$$ 1.99  50.9 $$\pm$$ 2.07  42.7 $$\pm$$ 1.28  76.5 $$\pm$$ 1.68  72.4 $$\pm$$ 4.68  47.8 $$\pm$$ 3.55  PCSN  86.30 $$\pm$$ 1.58  49.10 $$\pm$$ 0.70  41.32 $$\pm$$ 0.42  72.60 $$\pm$$ 2.15  71.00 $$\pm$$ 2.29  45.23 $$\pm$$ 2.84  Kernel  Reddit-Binary  Reddit-Multi-5k  Reddit-Multi-12k  COLLAB  IMDB-Binary  IMDB-Multi  $${\rm NetEmd}_{G5}$$  92.67 $$\pm$$ 0.30  54.61 $$\pm$$ 0.18  48.09 $$\pm$$ 0.21  79.32 $$\pm$$ 0.27  66.99 $$\pm$$ 1.19  41.45 $$\pm$$ 0.70  $${\rm NetEmd}_S$$  88.59 $$\pm$$ 0.35  53.05 $$\pm$$ 0.34  44.45 $$\pm$$ 0.18  79.05 $$\pm$$ 0.20  71.68 $$\pm$$ 0.88  46.06 $$\pm$$ 0.50  DGK  78.04 $$\pm$$ 0.39  41.27 $$\pm$$ 0.18  32.22 $$\pm$$ 0.10  73.09 $$\pm$$ 0.25  66.96 $$\pm$$ 0.56  44.55 $$\pm$$ 0.52  GK  77.34 $$\pm$$ 0.18  41.01 $$\pm$$ 0.17  31.82 $$\pm$$ 0.08  72.84 $$\pm$$ 0.28  65.87 $$\pm$$ 0.98  43.89 $$\pm$$ 0.38  RF  88.7 $$\pm$$ 1.99  50.9 $$\pm$$ 2.07  42.7 $$\pm$$ 1.28  76.5 $$\pm$$ 1.68  72.4 $$\pm$$ 4.68  47.8 $$\pm$$ 3.55  PCSN  86.30 $$\pm$$ 1.58  49.10 $$\pm$$ 0.70  41.32 $$\pm$$ 0.42  72.60 $$\pm$$ 2.15  71.00 $$\pm$$ 2.29  45.23 $$\pm$$ 2.84  Table 2 10-fold cross validation accuracies of Gaussian kernels based on $${\rm NetEmd}$$ measures using the distributions of graphlets up to size 5 ($${\rm NetEmd}_{G5}$$) and Laplacian spectra ($${\rm NetEmd}_S$$) and other graph kernels, namely the deep graphlet kernels (DGK)[41] and the graphlet kernel (GK) [43]. We also consider alternatives to support vector machines classifiers, namely the random forest classifiers (RF) introduced in [44] and convolutional neural networks (PCSN) [45]. Values in bold correspond to significantly higher scores, which are scores with t-test p-values less than 0.05 when compared to the highest score Kernel  Reddit-Binary  Reddit-Multi-5k  Reddit-Multi-12k  COLLAB  IMDB-Binary  IMDB-Multi  $${\rm NetEmd}_{G5}$$  92.67 $$\pm$$ 0.30  54.61 $$\pm$$ 0.18  48.09 $$\pm$$ 0.21  79.32 $$\pm$$ 0.27  66.99 $$\pm$$ 1.19  41.45 $$\pm$$ 0.70  $${\rm NetEmd}_S$$  88.59 $$\pm$$ 0.35  53.05 $$\pm$$ 0.34  44.45 $$\pm$$ 0.18  79.05 $$\pm$$ 0.20  71.68 $$\pm$$ 0.88  46.06 $$\pm$$ 0.50  DGK  78.04 $$\pm$$ 0.39  41.27 $$\pm$$ 0.18  32.22 $$\pm$$ 0.10  73.09 $$\pm$$ 0.25  66.96 $$\pm$$ 0.56  44.55 $$\pm$$ 0.52  GK  77.34 $$\pm$$ 0.18  41.01 $$\pm$$ 0.17  31.82 $$\pm$$ 0.08  72.84 $$\pm$$ 0.28  65.87 $$\pm$$ 0.98  43.89 $$\pm$$ 0.38  RF  88.7 $$\pm$$ 1.99  50.9 $$\pm$$ 2.07  42.7 $$\pm$$ 1.28  76.5 $$\pm$$ 1.68  72.4 $$\pm$$ 4.68  47.8 $$\pm$$ 3.55  PCSN  86.30 $$\pm$$ 1.58  49.10 $$\pm$$ 0.70  41.32 $$\pm$$ 0.42  72.60 $$\pm$$ 2.15  71.00 $$\pm$$ 2.29  45.23 $$\pm$$ 2.84  Kernel  Reddit-Binary  Reddit-Multi-5k  Reddit-Multi-12k  COLLAB  IMDB-Binary  IMDB-Multi  $${\rm NetEmd}_{G5}$$  92.67 $$\pm$$ 0.30  54.61 $$\pm$$ 0.18  48.09 $$\pm$$ 0.21  79.32 $$\pm$$ 0.27  66.99 $$\pm$$ 1.19  41.45 $$\pm$$ 0.70  $${\rm NetEmd}_S$$  88.59 $$\pm$$ 0.35  53.05 $$\pm$$ 0.34  44.45 $$\pm$$ 0.18  79.05 $$\pm$$ 0.20  71.68 $$\pm$$ 0.88  46.06 $$\pm$$ 0.50  DGK  78.04 $$\pm$$ 0.39  41.27 $$\pm$$ 0.18  32.22 $$\pm$$ 0.10  73.09 $$\pm$$ 0.25  66.96 $$\pm$$ 0.56  44.55 $$\pm$$ 0.52  GK  77.34 $$\pm$$ 0.18  41.01 $$\pm$$ 0.17  31.82 $$\pm$$ 0.08  72.84 $$\pm$$ 0.28  65.87 $$\pm$$ 0.98  43.89 $$\pm$$ 0.38  RF  88.7 $$\pm$$ 1.99  50.9 $$\pm$$ 2.07  42.7 $$\pm$$ 1.28  76.5 $$\pm$$ 1.68  72.4 $$\pm$$ 4.68  47.8 $$\pm$$ 3.55  PCSN  86.30 $$\pm$$ 1.58  49.10 $$\pm$$ 0.70  41.32 $$\pm$$ 0.42  72.60 $$\pm$$ 2.15  71.00 $$\pm$$ 2.29  45.23 $$\pm$$ 2.84  On the Reddit data sets and the COLLAB data set, $${\rm NetEmd}_{G5}$$ significantly outperforms other state-of-the-art graph classifiers. On the other hand, we find that $${\rm NetEmd}_{G5}$$ performs poorly on the IMDB data sets. This can be traced back to the large number of complete graphs present in the IMDB data sets: 139 out of the 1000 graphs in IMDB-Binary and 789 out of 1500 graphs in IMDB-Multi are complete graphs which mainly correspond to ego-networks of actors having acted only in a single film. By definition, $${\rm NetEmd}_{G5}$$ cannot distinguish between complete graphs of different sizes since all graphlet degree distributions are concentrated on a single value in complete graphs. The spectral variant $${\rm NetEmd}_S$$ is not affected by this and we find that $${\rm NetEmd}_S$$ is either on par with or outperforms the other non-$${\rm NetEmd}$$ graph classifiers on all six data sets. We also tested $${\rm NetEmd}$$ on benchmark data sets representing chemical compounds and protein structures. Unlike the social network data sets, in these data sets nodes and edges are labeled to reflect domain specific knowledge such as atomic number, amino acid type and bond type. Although $${\rm NetEmd}$$, in contrast to the other graph kernels, does not rely on domain specific knowledge in the form of node or edge labels, we found that $${\rm NetEmd}$$ outperforms many of the considered graph kernels coming only second to the Weisfeiler–Lehman [46] type kernels in terms of overall performance (see Appendix E). 4. Discussion Starting from basic principles, we have introduced a general network comparison methodology, $${\rm NetEmd}$$, that is aimed at capturing common generating processes in networks. We tested $${\rm NetEmd}$$ in a large variety of experimental settings and found that $${\rm NetEmd}$$ successfully identifies similar networks at multiple scales even when networks differ significantly in terms of size and density, generally outperforming other graphlet based network comparison measures. Even when based only on graphlets up to size 3 (i.e. edges, 2-paths and triangles), $${\rm NetEmd}$$ has performance comparable to the state of the art, making $${\rm NetEmd}$$ feasible even for networks containing billions of edges and nodes. By exploring machine learning applications we showed that $${\rm NetEmd}$$ captures topological similarity in a way that relates to the function of networks and outperforms state-of-the art graph classifiers on several graph classification benchmarks. Although we only considered variants of $${\rm NetEmd}$$ that are based on distributions of graphlets and spectra of Laplacian operators in this article, $${\rm NetEmd}$$ can also be applied to other graph features in a straightforward manner. For instance, distributions of paths and centrality measures might capture larger scale properties of networks and their inclusion into $${\rm NetEmd}$$ might lead to a more refined measure. Data availability The source code for $${\rm NetEmd}$$ is freely available at: http://opig.stats.ox.ac.uk/resources Funding EPSRC (grant EP/K032402/1 to A.W, G.R, C.D and R.G); and EPSRC (grants EP/G037280/1 and EP/L016044/1 to C.D). Colciencias through (grant 568 to L.O.); COST Action (CA15109 to R.G.) and Dame Kathleen Ollerenshaw Research Fellowship; Alan Turing Institute (grant EP/NS10129/1 to C.D. and G.R.). Acknowledgements We thank Xiaochuan Xu and Martin O’Reilly for useful discussions. L.O. acknowledges the support of Colciencias. R.G. acknowledges support from the COST Action. C.D. and G.R. acknowledge the support of the Alan Turing Institute. Appendix A. Implementation A.1 Graphlet distributions. In the main article, both the graphlet degree distribution and graphlet counts in 1-step ego networks were used as inputs for $${\rm NetEmd}$$. Graphlet degree distributions. The graphlet degree [11] of a node specifies the number of graphlets (small induced subgraphs) of a certain type the node appears in, while distinguishing between different positions the node can have in a graphlet. Different positions within a graphlet correspond to the orbits of the automorphism group of the graphlet. Among graphs on two to four nodes, there are 9 possible graphs and 15 possible orbits. Among graphs on two to five nodes there are 30 possible graphs and 73 possible orbits. Graphlet distributions based on ego-networks. Another way of obtaining graphlet distributions is to consider graphlet counts in ego-networks [4]. The $$k$$-step ego-network of a node $$i$$ is defined as the subgraph induced on all the nodes that can be reached from $$i$$ (including $$i$$) in less than $$k$$ steps. For a given $$k$$, the distribution of a graphlet $$m$$ in a network $$G$$ is then simply obtained by counting the occurrence of $$m$$ as an induced subgraph in the $$k$$-step ego-networks of each individual node. A.2 Step-wise implementation In this article, for integer valued network features such as graphlet based distributions, we base our implementation on the probability distribution that corresponds to the histogram of feature $$t$$ with bin width 1 as $$p_t(G)$$. $${\rm NetEmd}$$ can also be defined on the basis of discrete empirical distributions, that is distributions consisting of point masses (see Section C). Here we summarise the calculation of the $${\rm NetEmd}_T(G,G')$$ distance between networks $$G$$ and $$G'$$ (with $$N$$ and $$N'$$ nodes respectively), based on the comparison of the set of local network features $$T=\{t_1,\ldots,t_m\}$$ of graphlet degrees corresponding to graphlets up to size $$k$$. (1) First one computes the graphlet degree sequences corresponding to graphlets up to size $$k$$ for networks $$G$$ and $$G'$$. This can be done efficiently using the algorithm ORCA [47]. For the graphlet degree $$t_1$$ compute a histogram across all $$N$$ nodes of $$G$$ having bins of width 1 of which the centers are at their respective values. This histogram is then normalized to have total mass 1. We then interpret the histogram as the (piecewise continuous) probability density function of a random variable. This probability density function is denoted by $$p_{t_1}(G)$$. The standard deviation of $$p_{t_1}(G)$$ is then computed, and is used to rescale the distribution so that it has variance 1. This distribution is denoted by $$\widetilde{p_{t_1}(G)}$$. (2) Repeat the above step for network $$G'$$, and denote the resulting distribution by $$\widetilde{p_{t_1}(G')}$$. Now compute   \begin{equation*} {\rm NetEmd}_{t_1}^*(G,G')=\mathrm{inf}_{c\in\mathbb{R}}\big({\rm EMD}\big(\widetilde{p_{t_1}(G)}(\cdot+c),\widetilde{p_{t_1}(G')}(\cdot)\big)\big). \end{equation*} In practice, this minimisation over $$c$$ is computed using a suitable optimization algorithm. In our implementation we use the Brent–Dekker algorithm [48] with an error tolerance of 0.00001 and with the number of iterations upper bounded by 150. (3) Repeat the above two steps for the network features $$t_2,\ldots,t_m$$ and compute   \begin{equation*} {\rm NetEmd}_T(G,G')=\frac{1}{m}\sum_{j=1}^m {\rm NetEmd}_{t_j}^*(G,G'). \end{equation*} A.3 Example: $${\rm EMD}^*$$ for Gaussian distributions Suppose that $$p$$ and $$q$$ are $$N(\mu_1,\sigma_1^2)$$ and $$N(\mu_2,\sigma_2^2)$$ distributions, respectively. Then   \begin{align*} {\rm EMD}^*(p,q) &=\inf_{c\in\mathbb{R}}\Big({\rm EMD}\big(\tilde{p}(\cdot+c),\tilde{q}(\cdot)\big)\Big)\\ &={\rm EMD} \left(\tilde{p}\left(\cdot-\frac{\mu_1}{\sigma_1}+\frac{\mu_2}{\sigma_2}\right),\tilde{q}(\cdot)\right)\\ &={\rm EMD}\big(\tilde{q}(\cdot),\tilde{q}(\cdot)\big)=0. \end{align*} Here we used that if $$X\sim N(\mu_1,\sigma_1^2)$$ and $$Y\sim N(\mu_2,\sigma_2^2)$$, then $$\frac{X}{\sigma_1}+c\sim N(\frac{\mu_1}{\sigma_1}-c,1)$$ and $$\frac{Y}{\sigma_2}\sim N(\frac{\mu_2}{\sigma_2},1)$$, and these two distributions are equal if $$c=\frac{\mu_1}{\sigma_1}-\frac{\mu_2}{\sigma_2}$$. A.4 Spectral NetEmd When using spectra of graph operators, which take real values instead of the integer values one has in the case of graphlet distributions, we use the empirical distribution consisting of point masses for computing $${\rm NetEmd}$$. For more details see Section C of this appendix. A.5 Computational complexity The computational complexity of graphlet based comparison methods is dominated by the complexity of enumerating graphlets. For a network of size $$N$$ and maximum degree $$d$$, enumerating all connected graphlets up to size $$m$$ has complexity $$O(Nd^{m-1})$$, while counting all graphlets up to size $$m$$ in all $$k$$-step ego-networks has complexity $$O(Nd^{k+m-1})$$. Because most real world networks are sparse, graphlet enumeration algorithms tends to scale more favourably in practice than the worst case upper bounds given above. In the case of spectral measures, the most commonly used algorithms for computing the eigenvalue spectrum have complexity $$O(N^3)$$. Recent results show that the spectra of graph operators can be approximated efficiently in $$O(N^2)$$ time [49]. Given the distribution of a feature $$t$$, computing $${\rm EMD}_t^*(G,G')$$ has complexity $$O(k(s+s')\mathrm{log}(s+s'))$$, where $$s$$ and $$s'$$ are the number of different values $$t$$ takes in $$G$$ and $$G'$$ respectively and $$k$$ is the maximum number function calls of the optimization algorithm used to align the distributions. For node based features such as graphlet distributions, the worst case complexity is $$O(k(N(G)+N(G'))\mathrm{log}(N(G)+N(G')))$$, where $$N(G)$$ is the number of nodes of $$G$$, since the number of different values $$t$$ can take is bounded by the number of nodes. We use the combinatorial graphlet enumeration algorithm ORCA [47] for enumerating graphlets. Typical runtimes for enumerating all graphlets up to size 5 are can be found in Table A1 along with runtimes for $${\rm NetEmd}_{G5}$$ given the corresponding graphlet degree distributions. Enumerating graphlets of size 5 for networks of size around 1000 nodes tends to take several minutes allowing data sets consisting of several hundreds of such networks to be analysed on a single CPU in a matter of hours. Larger data sets containing networks of the order of 10000 nodes or larger are computationally challenging since enumerating graphlets of size 5 for such networks can take several hours [47]. Consequently, computing $${\rm NetEmd}_{G5}$$ on data sets such as $$RG_3$$ and $$Reddit$$-$$Multi$$-12$$k$$ are computationally challenging tasks that can take up to 3 days on a 24 CPU cluster. Table A1 Runtimes for enumerating graphlets up to size 5 using ORCA for Erdős–Rényi and a Barabási Albert random graphs and computation times for evaluating $${\rm NetEmd}_{G5}$$ given the graphlet degree distributions. For $${\rm NetEmd}_{G5}$$ the range $$t_{\min}$$–$$t_{\max}$$ obtained over all pairwise comparisons of all networks of a given size is shown. Experiments were performed on a single core of an Intel i7-6700HQ processor. ORCA is implemented in C++ and $${\rm NetEmd}_{G5}$$ is implemented in Python and Cython Number of nodes average degrees  $$N=250$$ (10, 20, 50, 100)  $$N=1000$$ (10, 20, 50, 100)  $$N=5000$$ (10, 20, 50, 100)  $$N=10000$$ (10, 20, 50, 100)  Erdős–Rényi model  ORCA  (0.02 s, 0.17 s, 3.10 s, 16.5 s)  (0.09 s, 0.65 s, 9.68 s, 96.6 s)  (0.40 s, 2.38 s, 40.7 s, 490 s)  (1.36 s, 7.21 s, 82 s, 858 s)  Barabási Albert  ORCA  (0.06 s, 0.37 s, 3.29 s, 34.7 s)  (0.38 s, 2.29 s, 25.9 s, 184 s)  (3.53 s, 15.4 s, 272 s, 2366 s)  (6.66 s, 57.5 s, 786 s, 5870 s)  $${\rm NetEmd}_{G5}$$  $$t_{\min}$$–$$t_{\max}$$  0.27 s–0.72 s  0.66 s–2.53 s  1.63 s–10.5 s  2.58 s–18.0 s  Number of nodes average degrees  $$N=250$$ (10, 20, 50, 100)  $$N=1000$$ (10, 20, 50, 100)  $$N=5000$$ (10, 20, 50, 100)  $$N=10000$$ (10, 20, 50, 100)  Erdős–Rényi model  ORCA  (0.02 s, 0.17 s, 3.10 s, 16.5 s)  (0.09 s, 0.65 s, 9.68 s, 96.6 s)  (0.40 s, 2.38 s, 40.7 s, 490 s)  (1.36 s, 7.21 s, 82 s, 858 s)  Barabási Albert  ORCA  (0.06 s, 0.37 s, 3.29 s, 34.7 s)  (0.38 s, 2.29 s, 25.9 s, 184 s)  (3.53 s, 15.4 s, 272 s, 2366 s)  (6.66 s, 57.5 s, 786 s, 5870 s)  $${\rm NetEmd}_{G5}$$  $$t_{\min}$$–$$t_{\max}$$  0.27 s–0.72 s  0.66 s–2.53 s  1.63 s–10.5 s  2.58 s–18.0 s  Table A1 Runtimes for enumerating graphlets up to size 5 using ORCA for Erdős–Rényi and a Barabási Albert random graphs and computation times for evaluating $${\rm NetEmd}_{G5}$$ given the graphlet degree distributions. For $${\rm NetEmd}_{G5}$$ the range $$t_{\min}$$–$$t_{\max}$$ obtained over all pairwise comparisons of all networks of a given size is shown. Experiments were performed on a single core of an Intel i7-6700HQ processor. ORCA is implemented in C++ and $${\rm NetEmd}_{G5}$$ is implemented in Python and Cython Number of nodes average degrees  $$N=250$$ (10, 20, 50, 100)  $$N=1000$$ (10, 20, 50, 100)  $$N=5000$$ (10, 20, 50, 100)  $$N=10000$$ (10, 20, 50, 100)  Erdős–Rényi model  ORCA  (0.02 s, 0.17 s, 3.10 s, 16.5 s)  (0.09 s, 0.65 s, 9.68 s, 96.6 s)  (0.40 s, 2.38 s, 40.7 s, 490 s)  (1.36 s, 7.21 s, 82 s, 858 s)  Barabási Albert  ORCA  (0.06 s, 0.37 s, 3.29 s, 34.7 s)  (0.38 s, 2.29 s, 25.9 s, 184 s)  (3.53 s, 15.4 s, 272 s, 2366 s)  (6.66 s, 57.5 s, 786 s, 5870 s)  $${\rm NetEmd}_{G5}$$  $$t_{\min}$$–$$t_{\max}$$  0.27 s–0.72 s  0.66 s–2.53 s  1.63 s–10.5 s  2.58 s–18.0 s  Number of nodes average degrees  $$N=250$$ (10, 20, 50, 100)  $$N=1000$$ (10, 20, 50, 100)  $$N=5000$$ (10, 20, 50, 100)  $$N=10000$$ (10, 20, 50, 100)  Erdős–Rényi model  ORCA  (0.02 s, 0.17 s, 3.10 s, 16.5 s)  (0.09 s, 0.65 s, 9.68 s, 96.6 s)  (0.40 s, 2.38 s, 40.7 s, 490 s)  (1.36 s, 7.21 s, 82 s, 858 s)  Barabási Albert  ORCA  (0.06 s, 0.37 s, 3.29 s, 34.7 s)  (0.38 s, 2.29 s, 25.9 s, 184 s)  (3.53 s, 15.4 s, 272 s, 2366 s)  (6.66 s, 57.5 s, 786 s, 5870 s)  $${\rm NetEmd}_{G5}$$  $$t_{\min}$$–$$t_{\max}$$  0.27 s–0.72 s  0.66 s–2.53 s  1.63 s–10.5 s  2.58 s–18.0 s  Appendix B. Proof that NetEmd is a distance measure We begin by stating a definition. A pseudometric on a set $$X$$ is a non-negative real-valued function $$d:X\times X\rightarrow [0,\infty)$$ such that, for all $$x,y,z\in X$$, (1) $$d(x,x)= 0$$; (2) $$d(x,y)=d(y,x)$$ (symmetry); (3) $$d(x,z)\leq d(x,y)+d(y,z)$$ (triangle inequality). If Condition 1 is replaced by the condition that $$d(x,y)=0\iff x=y$$ then $$d$$ defines a metric. Note that this requirement can only be satisfied by a network comparison measure that is based on a complete set of graph invariants and hence network comparison measures in general will not satisfy this requirement. Proposition Let $$M$$ denote the space of all real-valued probability measures supported on $$\mathbb{R}$$ with finite, non-zero variance. Then the $${\rm EMD}^*$$ distance between probability measures, $$\mu_X$$ and $$\mu_Y$$ in $$M$$ defined by   \begin{equation*}{\rm EMD}^*(\mu_X,\mu_Y)=\inf_{c\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_X(\cdot),\tilde{\mu}_Y(\cdot+c)), \end{equation*} defines a pseudometric on the space of probability measures $$M$$. Proof. We first note that if $$\mu_X\in M$$ then $$\tilde{\mu}_X(\cdot+c)\in M$$ for any $$c\in\mathbb{R}$$. Let us now verify that $${\rm EMD}^*$$ satisfies all properties of a pseudometric. Clearly, for any $$\mu_X\in M$$, we have $$0\leq {\rm EMD}^*(\mu_X,\mu_X)\leq {\rm EMD}(\tilde{\mu}_X(\cdot),\tilde{\mu}_X(\cdot))=0$$, and so $${\rm EMD}^*(\mu_X,\mu_X)=0$$. Symmetry holds, since for, any $$\mu_X$$ and $$\mu_Y$$ in $$M$$,   \begin{align*}{\rm EMD}^*(\mu_X,\mu_Y)&=\inf_{c\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_X(\cdot),\tilde{\mu}_Y(\cdot+c))\\&=\inf_{c\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_Y(\cdot+c),\tilde{\mu}_X(\cdot))\\ &=\inf_{c\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_Y(\cdot),\tilde{\mu}_X(\cdot+c))\\&={\rm EMD}^*(\mu_Y,\mu_X). \end{align*} Finally, we verify that $${\rm EMD}^*$$ satisfies the triangle inequality. Suppose $$\mu_X$$, $$\mu_Y$$ and $$\mu_Z$$ are probability measures from the space $$M$$, then so are $$\tilde{\mu}_X(\cdot+a)$$, $$\tilde{\mu}_Y(\cdot+b)$$ for any $$a,b\in\mathbb{R}$$. Since EMD satisfies the triangle inequality, we have, for any $$a,b\in\mathbb{R}$$,   \begin{equation*}\label{emdeqn1}{\rm EMD}(\tilde{\mu}_X(\cdot+a),\tilde{\mu}_Y(\cdot+b))\leq {\rm EMD}(\tilde{\mu}_X(\cdot+a),\tilde{\mu}_Z(\cdot))+{\rm EMD}(\tilde{\mu}_Y(\cdot+b),\tilde{\mu}_Z(\cdot)). \end{equation*} Since the above inequality holds for all $$a,b\in\mathbb{R}$$, we have that   \begin{align*} {\rm EMD}^*(\mu_X,\mu_Y)&=\inf_{c\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_X(\cdot+c),\tilde{\mu}_Y(\cdot))\\ &=\inf_{a,b\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_X(\cdot+a),\tilde{\mu}_Y(\cdot+b)) \\ &\leq \inf_{a,b\in\mathbb{R}}\big[{\rm EMD}(\tilde{\mu}_X(\cdot+a),\tilde{\mu}_Z(\cdot))+{\rm EMD}(\tilde{\mu}_Y(\cdot+b),\mu_Z(\cdot))\big] \\ &=\inf_{a\in\mathbb{R}}\big[{\rm EMD}(\tilde{\mu}_X(\cdot+a),\tilde{\mu}_Z(\cdot))+\inf_{b\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_Y(\cdot+b),\tilde{\mu}_Z(\cdot))\big] \\ &=\inf_{a\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_X(\cdot+a),\tilde{\mu}_Z(\cdot))+\inf_{b\in\mathbb{R}}{\rm EMD}(\tilde{\mu}_Y(\cdot+b),\tilde{\mu}_Z(\cdot))\\ &={\rm EMD}^*(\mu_X,\mu_Z)+{\rm EMD}^*(\mu_Y,\mu_Z), \end{align*} as required. We have thus verified that $${\rm EMD}^*$$ satisfies all properties of a pseudometric. □ Appendix C. Generalization of $${\rm EMD}^*$$ to point masses Although in the case of graphlet based features we based our implementation of $${\rm NetEmd}$$ on probability distribution functions that correspond to normalized histograms having bin width 1 $${\rm NetEmd}$$ can also be based on empirical distributions consisting of collections of point masses located at the observed values. The definition of $${\rm EMD}^*$$ can be generalized to include distributions of zero variance, that is unit point masses. Mathematically, the distribution of a point mass at $$x_0$$ is given by the Dirac measure $$\delta_x(x_0)$$. Such distributions are frequently encountered in practice since some graphlets do not occur in certain networks. First, we note that unit point masses are always mapped onto unit point masses under rescaling operations. Moreover, for a unit point mass $$\delta_x(x_0)$$ we have that $$\mathrm{inf}_{c\in\mathbb{R}}({\rm EMD}(\tilde{p}(\cdot+c),\delta_x(x_0)))$$$$=\mathrm{inf}_{c\in\mathbb{R}}\left({\rm EMD}(\tilde{p}(\cdot+c),\delta_x(kx_0))\right)$$ for all $$p\in M$$ and $$k>0$$. Consequently, $${\rm EMD}^*$$ can be generalized to include unit point masses in a consistent fashion by always rescaling them by 1:   \begin{equation*}\label{emdmet2} {\rm EMD}^*(p,q)=\mathrm{inf}_{c\in\mathbb{R}}\big({\rm EMD}(\hat{p}(\cdot+c),\hat{q})\big), \end{equation*} where $$\hat{p}=\tilde{p}$$ (as in Eq. 2.1) if $$p$$ has a non-zero variance, and $$\hat{p}=p$$ if $$p$$ has variance zero. Appendix D. Sub-sampling $${\rm NetEmd}$$ is well suited for network sub-sampling [5, 21, 22]. In the case of graphlets the sub-sampling procedure consist of sampling nodes and using the graphlet degrees corresponding to these nodes only as inputs for $${\rm NetEmd}$$. Figure D1 shows the $$\overline{P}$$ scores for variants of $${\rm NetEmd}$$ on a set of synthetic networks and the Onnela et al. data set. We find that the performance of $${\rm NetEmd}$$ is stable under sub-sampling and that in general using a sample of only $$10\%$$ of the nodes produces results comparable to the case where all nodes are used. Fig. D1. View largeDownload slide (a) and (b) show the heatmaps of pairwise distances on $$RG_3$$ ($$N\in\{1250,2500,5000,10000\}$$ and $$k\in\{10,20,40,80\}$$) according to $${\rm NetEmd}_{G5}$$ and next best performing measure GCD11, respectively. In the heat map, networks are ordered from top to bottom in the following order: model, average degree, and node count. Although we observe some degree of off diagonal mixing, the heatmap of $${\rm NetEmd}$$ still shows 8 diagonal blocks corresponding to different generative models in contrast to the heat map of GCD11. Fig. D1. View largeDownload slide (a) and (b) show the heatmaps of pairwise distances on $$RG_3$$ ($$N\in\{1250,2500,5000,10000\}$$ and $$k\in\{10,20,40,80\}$$) according to $${\rm NetEmd}_{G5}$$ and next best performing measure GCD11, respectively. In the heat map, networks are ordered from top to bottom in the following order: model, average degree, and node count. Although we observe some degree of off diagonal mixing, the heatmap of $${\rm NetEmd}$$ still shows 8 diagonal blocks corresponding to different generative models in contrast to the heat map of GCD11. Fig. D2. View largeDownload slide The $$\overline{P}$$ values for different variants of $${\rm NetEmd}$$ under sub-sampling for (a) a set of 80 synthetic networks coming from eight different random graph models with 2500 nodes and average degree 20, (b) for the Onnela et al. data set showing the average and standard deviation over 50 experiments for each sampled proportion. Note that the performance of $${\rm NetEmd}$$ under sub-sampling is remarkably stable and is close to optimal even when only $$10\%$$ of nodes are sampled. For synthetic networks, we find that the stability of $${\rm NetEmd}$$ increases as the size of the graphlets used in the input is increased. Fig. D2. View largeDownload slide The $$\overline{P}$$ values for different variants of $${\rm NetEmd}$$ under sub-sampling for (a) a set of 80 synthetic networks coming from eight different random graph models with 2500 nodes and average degree 20, (b) for the Onnela et al. data set showing the average and standard deviation over 50 experiments for each sampled proportion. Note that the performance of $${\rm NetEmd}$$ under sub-sampling is remarkably stable and is close to optimal even when only $$10\%$$ of nodes are sampled. For synthetic networks, we find that the stability of $${\rm NetEmd}$$ increases as the size of the graphlets used in the input is increased. Appendix E. Results for data sets of chemical compounds and proteins We also tested $${\rm NetEmd}$$ on benchmark data sets representing chemical compounds (MUTAG, NCI1 and NCI109) and protein structures (ENZYMES and D&D). MUTAG [50] is a data set of 188 chemical compounds that are labelled according to their mutagenic effect on Salmonella typhimurium. NCI1 and NCI109 represent sets of chemical compounds which are labelled for their activity against non-small cell lung cancer and ovarian cancer cell lines, respectively [15]. Nodes and edges in MUTAG, NCI1 and NCI109 are labelled by atomic number and bound type, respectively. ENZYMES and D&D [51] consist of networks representing protein structures at the level of tertiary structure and amino acids, respectively. While networks in ENZYMES are classified into six different enzyme classes, networks in D&D are classified according to whether or not they correspond to an enzyme. Nodes in ENZYMES are labelled according to structural element type and according to amino acid types in D&D. Classification accuracies obtained using $${\rm NetEmd}$$ on the data sets of chemical compounds and protein structures are given in Table E1, along with results for other graph kernels reported in [46]. For a detailed description of these kernels we refer to [46] and the references therein. Note that, in contrast to all other kernels in Table E1, $${\rm NetEmd}$$ does not use any domain specific knowledge in the form of node or edge labels. Node and edge labels are highly informative for all five classification tasks—as shown in [52]. Table E1 10-fold cross validation accuracies of Gaussian kernels based on $${\rm NetEmd}_{G5}$$ and $${\rm NetEmd}_S$$ and other kernels reported in [46] Kernel  MUTAG  NCI1  NCI109  ENZYMES  D & D  $${\rm NetEmd}_{G5}$$  83.71 $$\pm$$ 1.16  78.59 $$\pm$$ 0.28  76.71 $$\pm$$ 0.34  46.55 $$\pm$$ 1.25  78.01 $$\pm$$ 0.38  $${\rm NetEmd}_{S}$$  83.30 $$\pm$$ 1.20  77.36 $$\pm$$ 0.38  76.14 $$\pm$$ 0.27  42.75 $$\pm$$ 0.78  76.74 $$\pm$$ 0.43  WL subtree  82.05 $$\pm$$ 0.36  82.19 $$\pm$$ 0.18  82.46 $$\pm$$ 0.24  52.22 $$\pm$$ 1.26  79.78 $$\pm$$ 0.36  WL edge  81.06 $$\pm$$ 1.95  84.37 $$\pm$$ 0.30  84.49 $$\pm$$ 0.20  53.17 $$\pm$$ 2.04  77.95 $$\pm$$ 0.70  WL shortest path  83.78 $$\pm$$ 1.46  84.55 $$\pm$$ 0.36  83.53 $$\pm$$ 0.30  59.05 $$\pm$$ 1.05  79.43 $$\pm$$ 0.55  Ramon & Gärtner  85.72 $$\pm$$ 0.49  61.86 $$\pm$$ 0.27  61.67 $$\pm$$ 0.21  13.35 $$\pm$$ 0.87  57.27 $$\pm$$ 0.07  p-random walk  79.19 $$\pm$$ 1.09  58.66 $$\pm$$ 0.28  58.36 $$\pm$$ 0.94  27.67 $$\pm$$ 0.95  66.64 $$\pm$$ 0.83  Random walk  80.72 $$\pm$$ 0.38  64.34 $$\pm$$ 0.27  63.51 $$\pm$$ 0.18  21.68 $$\pm$$ 0.94  71.70 $$\pm$$ 0.47  Graphlet count  75.61 $$\pm$$ 0.49  66.00 $$\pm$$ 0.07  66.59 $$\pm$$ 0.08  32.70 $$\pm$$ 1.20  78.59 $$\pm$$ 0.12  Shortest path  87.28 $$\pm$$ 0.55  73.47 $$\pm$$ 0.11  73.07 $$\pm$$ 0.11  41.68 $$\pm$$ 1.79  78.45 $$\pm$$ 0.26  Kernel  MUTAG  NCI1  NCI109  ENZYMES  D & D  $${\rm NetEmd}_{G5}$$  83.71 $$\pm$$ 1.16  78.59 $$\pm$$ 0.28  76.71 $$\pm$$ 0.34  46.55 $$\pm$$ 1.25  78.01 $$\pm$$ 0.38  $${\rm NetEmd}_{S}$$  83.30 $$\pm$$ 1.20  77.36 $$\pm$$ 0.38  76.14 $$\pm$$ 0.27  42.75 $$\pm$$ 0.78  76.74 $$\pm$$ 0.43  WL subtree  82.05 $$\pm$$ 0.36  82.19 $$\pm$$ 0.18  82.46 $$\pm$$ 0.24  52.22 $$\pm$$ 1.26  79.78 $$\pm$$ 0.36  WL edge  81.06 $$\pm$$ 1.95  84.37 $$\pm$$ 0.30  84.49 $$\pm$$ 0.20  53.17 $$\pm$$ 2.04  77.95 $$\pm$$ 0.70  WL shortest path  83.78 $$\pm$$ 1.46  84.55 $$\pm$$ 0.36  83.53 $$\pm$$ 0.30  59.05 $$\pm$$ 1.05  79.43 $$\pm$$ 0.55  Ramon & Gärtner  85.72 $$\pm$$ 0.49  61.86 $$\pm$$ 0.27  61.67 $$\pm$$ 0.21  13.35 $$\pm$$ 0.87  57.27 $$\pm$$ 0.07  p-random walk  79.19 $$\pm$$ 1.09  58.66 $$\pm$$ 0.28  58.36 $$\pm$$ 0.94  27.67 $$\pm$$ 0.95  66.64 $$\pm$$ 0.83  Random walk  80.72 $$\pm$$ 0.38  64.34 $$\pm$$ 0.27  63.51 $$\pm$$ 0.18  21.68 $$\pm$$ 0.94  71.70 $$\pm$$ 0.47  Graphlet count  75.61 $$\pm$$ 0.49  66.00 $$\pm$$ 0.07  66.59 $$\pm$$ 0.08  32.70 $$\pm$$ 1.20  78.59 $$\pm$$ 0.12  Shortest path  87.28 $$\pm$$ 0.55  73.47 $$\pm$$ 0.11  73.07 $$\pm$$ 0.11  41.68 $$\pm$$ 1.79  78.45 $$\pm$$ 0.26  Table E1 10-fold cross validation accuracies of Gaussian kernels based on $${\rm NetEmd}_{G5}$$ and $${\rm NetEmd}_S$$ and other kernels reported in [46] Kernel  MUTAG  NCI1  NCI109  ENZYMES  D & D  $${\rm NetEmd}_{G5}$$  83.71 $$\pm$$ 1.16  78.59 $$\pm$$ 0.28  76.71 $$\pm$$ 0.34  46.55 $$\pm$$ 1.25  78.01 $$\pm$$ 0.38  $${\rm NetEmd}_{S}$$  83.30 $$\pm$$ 1.20  77.36 $$\pm$$ 0.38  76.14 $$\pm$$ 0.27  42.75 $$\pm$$ 0.78  76.74 $$\pm$$ 0.43  WL subtree  82.05 $$\pm$$ 0.36  82.19 $$\pm$$ 0.18  82.46 $$\pm$$ 0.24  52.22 $$\pm$$ 1.26  79.78 $$\pm$$ 0.36  WL edge  81.06 $$\pm$$ 1.95  84.37 $$\pm$$ 0.30  84.49 $$\pm$$ 0.20  53.17 $$\pm$$ 2.04  77.95 $$\pm$$ 0.70  WL shortest path  83.78 $$\pm$$ 1.46  84.55 $$\pm$$ 0.36  83.53 $$\pm$$ 0.30  59.05 $$\pm$$ 1.05  79.43 $$\pm$$ 0.55  Ramon & Gärtner  85.72 $$\pm$$ 0.49  61.86 $$\pm$$ 0.27  61.67 $$\pm$$ 0.21  13.35 $$\pm$$ 0.87  57.27 $$\pm$$ 0.07  p-random walk  79.19 $$\pm$$ 1.09  58.66 $$\pm$$ 0.28  58.36 $$\pm$$ 0.94  27.67 $$\pm$$ 0.95  66.64 $$\pm$$ 0.83  Random walk  80.72 $$\pm$$ 0.38  64.34 $$\pm$$ 0.27  63.51 $$\pm$$ 0.18  21.68 $$\pm$$ 0.94  71.70 $$\pm$$ 0.47  Graphlet count  75.61 $$\pm$$ 0.49  66.00 $$\pm$$ 0.07  66.59 $$\pm$$ 0.08  32.70 $$\pm$$ 1.20  78.59 $$\pm$$ 0.12  Shortest path  87.28 $$\pm$$ 0.55  73.47 $$\pm$$ 0.11  73.07 $$\pm$$ 0.11  41.68 $$\pm$$ 1.79  78.45 $$\pm$$ 0.26  Kernel  MUTAG  NCI1  NCI109  ENZYMES  D & D  $${\rm NetEmd}_{G5}$$  83.71 $$\pm$$ 1.16  78.59 $$\pm$$ 0.28  76.71 $$\pm$$ 0.34  46.55 $$\pm$$ 1.25  78.01 $$\pm$$ 0.38  $${\rm NetEmd}_{S}$$  83.30 $$\pm$$ 1.20  77.36 $$\pm$$ 0.38  76.14 $$\pm$$ 0.27  42.75 $$\pm$$ 0.78  76.74 $$\pm$$ 0.43  WL subtree  82.05 $$\pm$$ 0.36  82.19 $$\pm$$ 0.18  82.46 $$\pm$$ 0.24  52.22 $$\pm$$ 1.26  79.78 $$\pm$$ 0.36  WL edge  81.06 $$\pm$$ 1.95  84.37 $$\pm$$ 0.30  84.49 $$\pm$$ 0.20  53.17 $$\pm$$ 2.04  77.95 $$\pm$$ 0.70  WL shortest path  83.78 $$\pm$$ 1.46  84.55 $$\pm$$ 0.36  83.53 $$\pm$$ 0.30  59.05 $$\pm$$ 1.05  79.43 $$\pm$$ 0.55  Ramon & Gärtner  85.72 $$\pm$$ 0.49  61.86 $$\pm$$ 0.27  61.67 $$\pm$$ 0.21  13.35 $$\pm$$ 0.87  57.27 $$\pm$$ 0.07  p-random walk  79.19 $$\pm$$ 1.09  58.66 $$\pm$$ 0.28  58.36 $$\pm$$ 0.94  27.67 $$\pm$$ 0.95  66.64 $$\pm$$ 0.83  Random walk  80.72 $$\pm$$ 0.38  64.34 $$\pm$$ 0.27  63.51 $$\pm$$ 0.18  21.68 $$\pm$$ 0.94  71.70 $$\pm$$ 0.47  Graphlet count  75.61 $$\pm$$ 0.49  66.00 $$\pm$$ 0.07  66.59 $$\pm$$ 0.08  32.70 $$\pm$$ 1.20  78.59 $$\pm$$ 0.12  Shortest path  87.28 $$\pm$$ 0.55  73.47 $$\pm$$ 0.11  73.07 $$\pm$$ 0.11  41.68 $$\pm$$ 1.79  78.45 $$\pm$$ 0.26  On MUTAG, $${\rm NetEmd}$$ achieves an accuracy that is comparable to the Weisfeiler–Lehman (WL) shortest path kernel, but is outperformed by the shortest path kernel and the kernel by Ramon & Gärtner. While on NCI1, NCI109 and ENZYMES, $${\rm NetEmd}$$ is outperformed only by WL kernels, on D&D $${\rm NetEmd}$$ achieves a classification accuracy that is comparable to the best performing kernels. Notably, on D&D $${\rm NetEmd}$$ also outperforms the vector model by Dobson and Doig [53] (classification accuracy: 76.86 $$\pm$$ 1.23) which is based on 52 physical and chemical features without using domain specific knowledge, that is solely based on graph topology. Appendix F. Graph classifiers F.1 Implementation of C-SVMs Following the procedure in [46] we use 10-fold cross validation with a C-SVM [42] to test classification performance. We use the python package scikit-learn [54] which is based on libsvm [55]. The $$C-value$$ of the C-SVM and the $$\alpha$$ for the Gaussian kernel is tuned independently for each fold using training data from that fold only. We consider values $$\{2^{-7},2^{-6},...,2^7\}$$ for $$C$$ and the $$\{2^{-7},2^{-6},...,2^7\}$$ multiples of the median of the distance matrix for $$\alpha$$. Each experiment is repeated 10 times, and average prediction accuracies and their standard deviations are reported. We also note that the Gaussian NetEmd kernel is not necessarily positive semidefinite for all values of $$\alpha$$ [56]. The implication is that the C-SVM might converge to a stationary point that is not always guaranteed to be a global optimum. Although there exist alternative algorithms [57] for training C-SVMs with indefinite kernels which might result in better classification accuracy, here we chose to use the standard libsvm-algorithm in order to ensure a fair comparison between kernels. For a discussion of support vector machines with indefinite kernels see [58]. F.2 Other kernels and classifiers F.2.1 Graph kernels Graph kernels are in general based on counting various types substructures in graphs such as walks (p-random walk, random walk), paths (Shortest path) and sub-trees (Ramon & Gärtner). Given the count vectors of the substructures of interest the corresponding kernel is defined as their inner product. F.2.2 The graphlet kernel and the deep graphlet kernel The graphlet kernel and the deep graphlet kernel use the normalized counts of graphlets on $$k$$ nodes, including disconnected graphlets, as features. The deep graphlet kernel involves the additional step of learning similarities between graphlets based on the edit distance. In the case of the social network data sets $$k=7$$. F.2.3 Weisfeiler Lehman kernels The Weisfeiler Lehman (WL) kernels are based on the Weisfeiler Lehman label propagation procedure in which a node is iteratively labeled by the labels of its neighbours. Given a base kernel the corresponding WL kernel is obtained by combining the base kernel over several iterations of the WL procedure. The WL-subtree, WL-edge and WL-shortest path kernels have base kernels that use the following counts as features: node labels, labelled edges, and shortest paths with labelled end points, respectively. F.2.4 The random forest classifier The random forest classifier in [44] is based on the following features: number of nodes, number of edges, average degree, degree assortativity, number of triangles and the global clustering coefficient. F.2.5 PCSN The convolutional neural network classifier of Niepert et al. is based on learning feature representations of graphs that correspond to locally connected regions of networks. The approach first identifies a sequence of nodes for which neighbourhood graphs are created and then maps these onto a vector space representation. Appendix G. Detailed description of data sets and models G.1 Synthetic networks and random graph models $$RG_1$$ consists of 16 sub data sets corresponding to combinations of $$N\in\{1250,2500,5000,10000\}$$ and $$k\in\{10,20,40,80\}$$ containing 10 realizations for each model so that each of the 16 sub data set contains 80 networks. In $$RG_2$$ the size $$N$$ and average degree $$k$$ are increased independently in linear steps to twice their initial value ($$N\in\{2000,3000,4000\}$$ and $$k\in\{20,24,28,32,36,40\}$$) and contains 10 realizations per model parameter combination, resulting in a data set of $$3\times6\times8\times10=1440$$ networks. In $$RG_3$$ the size $$N$$ and average degree $$k$$ are increased independently in multiples of 2 to 8 times their initial value ($$N\in\{1250,2500,5000,10000\}$$ and $$k\in\{10,20,40,80\}$$) and again contains 10 realizations per model parameter combination, resulting in a data set of $$4\times4\times8\times10=1280$$ networks. The models are as follows. G.1.1 The Erdős–Rényi model We consider the Erdős–Rényi (ER) model [25] $$G(N,m)$$ where $$N$$ is the number of nodes and $$m$$ is the number of edges. The edges are chosen uniformly at random without replacement from the $$\binom{N}{2}$$ possible edges. G.1.2 The configuration model Given a graphical degree sequence, the configuration model creates a random graph that is drawn uniformly at random from the space of all graphs with the given degree sequence. The degree sequence of the configuration models used in the article is taken to be degree sequence of a duplication divergence model that has the desired average degree. G.1.3 The Barabási Albert preferential attachment model In the Barabási–Albert model [16], a network is generated starting from a small initial network to which nodes of degree $$m$$ are added iteratively and the probability of connecting the new node to an existing node is proportional to the degree of the existing node. G.1.4 Geometric random graphs Geometric random graphs [59] are constructed under the assumption that the nodes in the network are embedded into a $$D$$ dimensional space, and the presence of an edge depends only on the distance between the nodes and a given threshold $$r$$. The model is constructed by placing $$N$$ nodes uniformly at random in an $$D$$-dimensional square $$[0,1]^D$$. Then edges are placed between any pair of nodes for which the distance between them is less or equal to the threshold $$r$$. We use $$D=3$$ and set $$r$$ to be the threshold that results in a network with the desired average degree, while the distance is the Euclidean distance. G.1.5 The geometric gene duplication model The geometric gene duplication model is a geometric model [27] in which the nodes are distributed in three-dimensional Euclidean space $$\mathbb{R}^3$$ according to the following rule. Starting from an small initial set of nodes in three dimensions, at each step a randomly chosen node is selected and a new node is placed at random within a Euclidean distance $$d$$ of this node. The process is repeated until the desired number of nodes is reached. Nodes within a certain distance $$r$$ are then connected. We fix $$r$$ to obtain the desired average degree. G.1.6 The duplication divergence model of Vázquez et al. The duplication divergence model of Vázquez et al. [24] is defined by the following growing rules: (1) Duplication: A node $$v_i$$ is randomly selected and duplicated ($$v_i'$$) along with all of its interactions. An edge between $$v_i$$ and $$v_i'$$ is placed with probability $$p$$. (2) Divergence: For each pair of duplicated edges $$\{(v_i,v_k);(v_i',v_k)\}$$; one of the duplicated edges is selected uniformly at random and then deleted with probability $$q$$. This process is followed until the desired number of nodes is reached. In our case we fix $$p$$ to be 0.05 and adjust $$q$$ through a grid search to obtain a network that on average has the desired average degree. G.1.7 The duplication divergence of Ispolatov et al The duplication divergence model of Ispolatov et al. [26] starts with an initial network consisting of a single edge and then at each step a random node is chosen for duplication and the duplicate is connected to each of the neighbours of its parent with probability $$p$$. We adjust $$p$$ to obtain networks that have on average the desired average degree. G.1.8 The Watts–Strogatz model The Watts–Strogatz model, [30] creates graphs that interpolate between regular graphs and ER graphs. The model starts with a ring of $$n$$ nodes in which each node is connected to its $$k$$-nearest neighbours in both directions of the ring. Each edges is rewired with probability $$p$$ to a node which is selected uniformly at random. While $$k$$ is adjusted to obtain networks having the desired average degree we take $$p$$ to be 0.05. G.2 Real world data sets Summary statistics of the data sets are given in Table G1. Table G1 Summary statistics of data sets $$N$$, $$E$$ and $$d$$ stand for the number of nodes, number of edges and edge density, respectively. Data set  #Networks  $$N_{\min}$$  Median ($$N$$)  $$N_{\max}$$  $$E_{\min}$$  Median ($$E$$)  $$E_{\max}$$  $$d_{\min}$$  Median ($$d$$)  $$d_{\max}$$  RWN  167  24  351  62586  76  2595  824617  7.55$$e-$$05  0.0163  0.625  Onnela et al.  151  30  918  11586  62  2436  232794  4.26$$e-$$5  0.0147  0.499  AS-caida  122  8020  22883  26475  18203  46290  53601  1.48$$e-$$4  1.78$$e-$$4  5.66$$e-$$4  AS-733  732  493  4180.5  6474  1234  8380.5  13895  6.63$$e-$$4  9.71$$e-$$4  1.01$$e-$$2  World Trade  53  156  195  242  5132  7675  18083  0.333  0.515  0.625  Networks  Reddit-Binary  2000  6  304.5  3782  4  379  4071  5.69$$e-$$4  8.25$$e-$$3  0.286  Reddit-Multi-5k  4999  22  374  3648  21  422  4783  6.55$$e-$$4  6.03$$e-$$3  0.091  Reddit-Multi-12k  11929  2  280  3782  1  323  5171  5.69$$e-$$4  8.27$$e-$$3  1.0  COLLAB  5000  32  52  492  60  654.5  40120  0.029  0.424  1.0  IMDB-Binary  1000  12  17  136  26  65  1249  0.095  0.462  1.0  IMDB-Multi  1500  7  10  89  12  36  1467  0.127  1.0  1.0  MUTAG  188  10  17.5  28  10  19  33  0.082  0.132  0.222  NCI1  4110  3  27  111  2  29  119  0.0192  0.0855  0.667  NCI109  4127  4  26  111  3  29  119  0.0192  0.0862  0.5  ENZYMES  600  2  32  125  1  60  149  0.0182  0.130  1.0  D&D  1178  30  241  5748  63  610.5  14267  8.64$$e-$$4  0.0207  0.2  Data set  #Networks  $$N_{\min}$$  Median ($$N$$)  $$N_{\max}$$  $$E_{\min}$$  Median ($$E$$)  $$E_{\max}$$  $$d_{\min}$$  Median ($$d$$)  $$d_{\max}$$  RWN  167  24  351  62586  76  2595  824617  7.55$$e-$$05  0.0163  0.625  Onnela et al.  151  30  918  11586  62  2436  232794  4.26$$e-$$5  0.0147  0.499  AS-caida  122  8020  22883  26475  18203  46290  53601  1.48$$e-$$4  1.78$$e-$$4  5.66$$e-$$4  AS-733  732  493  4180.5  6474  1234  8380.5  13895  6.63$$e-$$4  9.71$$e-$$4  1.01$$e-$$2  World Trade  53  156  195  242  5132  7675  18083  0.333  0.515  0.625  Networks  Reddit-Binary  2000  6  304.5  3782  4  379  4071  5.69$$e-$$4  8.25$$e-$$3  0.286  Reddit-Multi-5k  4999  22  374  3648  21  422  4783  6.55$$e-$$4  6.03$$e-$$3  0.091  Reddit-Multi-12k  11929  2  280  3782  1  323  5171  5.69$$e-$$4  8.27$$e-$$3  1.0  COLLAB  5000  32  52  492  60  654.5  40120  0.029  0.424  1.0  IMDB-Binary  1000  12  17  136  26  65  1249  0.095  0.462  1.0  IMDB-Multi  1500  7  10  89  12  36  1467  0.127  1.0  1.0  MUTAG  188  10  17.5  28  10  19  33  0.082  0.132  0.222  NCI1  4110  3  27  111  2  29  119  0.0192  0.0855  0.667  NCI109  4127  4  26  111  3  29  119  0.0192  0.0862  0.5  ENZYMES  600  2  32  125  1  60  149  0.0182  0.130  1.0  D&D  1178  30  241  5748  63  610.5  14267  8.64$$e-$$4  0.0207  0.2  Table G1 Summary statistics of data sets $$N$$, $$E$$ and $$d$$ stand for the number of nodes, number of edges and edge density, respectively. Data set  #Networks  $$N_{\min}$$  Median ($$N$$)  $$N_{\max}$$  $$E_{\min}$$  Median ($$E$$)  $$E_{\max}$$  $$d_{\min}$$  Median ($$d$$)  $$d_{\max}$$  RWN  167  24  351  62586  76  2595  824617  7.55$$e-$$05  0.0163  0.625  Onnela et al.  151  30  918  11586  62  2436  232794  4.26$$e-$$5  0.0147  0.499  AS-caida  122  8020  22883  26475  18203  46290  53601  1.48$$e-$$4  1.78$$e-$$4  5.66$$e-$$4  AS-733  732  493  4180.5  6474  1234  8380.5  13895  6.63$$e-$$4  9.71$$e-$$4  1.01$$e-$$2  World Trade  53  156  195  242  5132  7675  18083  0.333  0.515  0.625  Networks  Reddit-Binary  2000  6  304.5  3782  4  379  4071  5.69$$e-$$4  8.25$$e-$$3  0.286  Reddit-Multi-5k  4999  22  374  3648  21  422  4783  6.55$$e-$$4  6.03$$e-$$3  0.091  Reddit-Multi-12k  11929  2  280  3782  1  323  5171  5.69$$e-$$4  8.27$$e-$$3  1.0  COLLAB  5000  32  52  492  60  654.5  40120  0.029  0.424  1.0  IMDB-Binary  1000  12  17  136  26  65  1249  0.095  0.462  1.0  IMDB-Multi  1500  7  10  89  12  36  1467  0.127  1.0  1.0  MUTAG  188  10  17.5  28  10  19  33  0.082  0.132  0.222  NCI1  4110  3  27  111  2  29  119  0.0192  0.0855  0.667  NCI109  4127  4  26  111  3  29  119  0.0192  0.0862  0.5  ENZYMES  600  2  32  125  1  60  149  0.0182  0.130  1.0  D&D  1178  30  241  5748  63  610.5  14267  8.64$$e-$$4  0.0207  0.2  Data set  #Networks  $$N_{\min}$$  Median ($$N$$)  $$N_{\max}$$  $$E_{\min}$$  Median ($$E$$)  $$E_{\max}$$  $$d_{\min}$$  Median ($$d$$)  $$d_{\max}$$  RWN  167  24  351  62586  76  2595  824617  7.55$$e-$$05  0.0163  0.625  Onnela et al.  151  30  918  11586  62  2436  232794  4.26$$e-$$5  0.0147  0.499  AS-caida  122  8020  22883  26475  18203  46290  53601  1.48$$e-$$4  1.78$$e-$$4  5.66$$e-$$4  AS-733  732  493  4180.5  6474  1234  8380.5  13895  6.63$$e-$$4  9.71$$e-$$4  1.01$$e-$$2  World Trade  53  156  195  242  5132  7675  18083  0.333  0.515  0.625  Networks  Reddit-Binary  2000  6  304.5  3782  4  379  4071  5.69$$e-$$4  8.25$$e-$$3  0.286  Reddit-Multi-5k  4999  22  374  3648  21  422  4783  6.55$$e-$$4  6.03$$e-$$3  0.091  Reddit-Multi-12k  11929  2  280  3782  1  323  5171  5.69$$e-$$4  8.27$$e-$$3  1.0  COLLAB  5000  32  52  492  60  654.5  40120  0.029  0.424  1.0  IMDB-Binary  1000  12  17  136  26  65  1249  0.095  0.462  1.0  IMDB-Multi  1500  7  10  89  12  36  1467  0.127  1.0  1.0  MUTAG  188  10  17.5  28  10  19  33  0.082  0.132  0.222  NCI1  4110  3  27  111  2  29  119  0.0192  0.0855  0.667  NCI109  4127  4  26  111  3  29  119  0.0192  0.0862  0.5  ENZYMES  600  2  32  125  1  60  149  0.0182  0.130  1.0  D&D  1178  30  241  5748  63  610.5  14267  8.64$$e-$$4  0.0207  0.2  G.2.1 Real world networks from different classes (RWN) We compiled a data set consisting of 10 different classes of real world networks: social networks, metabolic networks, protein interaction networks, protein structure networks, food webs, autonomous systems networks of the internet, world trade networks, airline networks, peer to peer file sharing networks and scientific collaboration networks. Although in some instances larger versions of these data sets are available, we restrict the maximum number of networks in a certain class to 20 by taking random samples of larger data sets in order to avoid scores being dominated by larger network classes. The class of social networks consists of 10 social networks from the Pajek data set which can be found at http://vlado.fmf.uni-lj.si/pub/networks/data/default.htm (22 January 2018) (Networks: ’bkfrat’, ’bkham’, ’dolphins’, ’kaptailS1’, ’kaptailS2’, ’kaptailT1’, ’kaptailT2’, ’karate’, ’lesmis’, ’prison’) and a sample of 10 Facebook networks from [60] (Networks: ’Bucknell39’, ’Duke14’, ’Harvard1’, ’MU78’, ’Maine59’, ’Rice31’, ’UC61’, ’UCLA26’, ’UVA16’,’Yale4’). The class of metabolic networks consists of 20 networks taken from [61] (Networks: ’AB’, ’AG’, ’AP’, ’AT’, ’BS’, ’CE’, ’CT’, ’EF’, ’HI’, ’MG’, ’MJ’, ’ML’, ’NG’, ’OS’, ’PA’, ’PN’, ’RP’, ’TH’, ’TM’, ’YP’). The class of protein interaction networks consists of six networks from BIOGRID [62] (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Mus musculus and Saccharomyces cerevisiae downloaded: October 2015) and 5 networks from HINT [63] (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens and Mus musculus (Version: June 1 2014)) and the protein interaction network of Escherichia coli by Rajagopala et al. [64]. The class of protein structure networks consists of a sample of 20 networks from the data set D&D (Networks: 20, 119, 231, 279, 335, 354, 355, 369, 386, 462, 523, 529, 597, 748, 833, 866, 990, 1043, 1113, 1157). The class of food webs consists of 20 food webs from the Pajek data set: http://vlado.fmf.uni-lj.si/pub/networks/data/default.htm (22 January 2018) (Networks: ’ChesLower’, ’ChesMiddle’, ’ChesUpper’, ’Chesapeake’, ’CrystalC’, ’CrystalD’, ’Everglades’, ’Florida’, ’Michigan’, ’Mondego’, ’Narragan’, ’StMarks’, ’baydry’, ’baywet’, ’cypdry’, ’cypwet’, ’gramdry’, ’gramwet’, ’mangdry’, ’mangwet’). The class of internet networks consists of 10 randomly chosen networks from AS-733 [32] (Networks:’1997/11/12’, ’1997/12/28’, ’1998/01/01’, ’1998/06/06’, ’1998/08/13’, ’1998/12/04’, ’1999/03/30’, ’1999/04/17’, ’1999 /06/18’, ’1999/08/30’) and 10 randomly chosen networks from AS-caida [32] (Networks: ’2004/10/04’, ’2006/01/23’, ’2006/03/27’, ’2006/07/10’, ’2006/09/25’, ’2006/11/27’, ’2007/01/15’, ’2007/04/30’, ’2007/05/28’, ’2007/09/24’). Both data sets are from SNAP [65](June 1 2016). The class of world trade networks is a sample of 20 networks of the larger data set considered in [33, 34] (Networks: 1968, 1971, 1974, 1975, 1976, 1978, 1980, 1984, 1989, 1992, 1993, 1996, 1998, 2001, 2003, 2005, 2007, 2010, 2011, 2012). The airline networks were derived from the data available at: http://openflights.org/ (22 January 2018). For this we considered the 50 largest airlines from the database in terms of the number of destinations that the airline serves. For each airline a network is obtained by the considering all airports that are serviced by the airlines which are connected whenever there is direct flight between a pair of nodes. We then took a sample of 20 networks from this larger data set (Airline codes of the networks: ’AD’, ’AF’, ’AM’, ’BA’, ’DY’, ’FL’, ’FR’, ’JJ’, ’JL’, ’MH’, ’MU’, ’NH’, ’QF’, ’SU’, ’SV’, ’U2’, ’UA’, ’US’, ’VY’, ’ZH’). The class of peer to peer networks consist of nine networks of the Gnutella file sharing platform measured at different dates which are available at [65]. The scientific collaboration networks consists of five networks representing different scientific disciplines which were obtained from [65] (1 June 2015). G.2.2 Onnela et al. data set The Onnela et al. data set consists of all undirected and unweighted networks from the larger collection analysed in [31]. A complete list of networks and class membership can be found in the supplementary information of [4]. G.2.3 Time ordered data sets The data sets AS-caida and AS-733 each represent the internet measured at the level of autonomous systems at various points in time. Both data sets were downloaded from [65] (1 June 2015). The World Trade Networks data set is based on the data set [33] for the years 1962–2000 and on UN COMTRADE [34] for the years 2001-2015. Two countries are connected in the network whenever they import or export a commodity from a each other within the given calendar year. The complete data set was downloaded from : http://atlas.media.mit.edu/en/resources/data/ on 22 January 2018. G.2.4 Machine learning benchmarks A short description of the social networks datasets was given in the main text. A more detailed description can be found in [41]. The social network data sets were downloaded from https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets on 22 January 2018. A short short description of the chemical compound and protein structure data sets was given in Section E. A more detailed description of the data set can be found in [46]. These data sets were downloaded from: https://www.bsse.ethz.ch/mlcb/research/machine-learning/graph-kernels.html on 22 January 2018. Appendix H. $$\overline{P}$$ and area under ROC We assume that our data set $$S=\{G_1,G_2,...,G_n\}$$ of networks consists of disjoint classes $$C=\{c_1,c_2,...,c_m\}$$. For a given distance measure $$d$$ and given a network $$G\in S$$ we want to calculate the empirical probability $$P(G)$$ that $$d(G,G_1)<d(G,G_2)$$ where $$G_1$$ is a randomly selected network from the same class as $$G$$ (excluding itself) and $$G_2$$ is a randomly selected network from outside the class of $$G$$. Given measure $$d$$, $$P(G)$$ is simply given by:   $$P(G)=\frac{1}{(|S|-|C(G)|)(|C(G)|-1)}\sum_{G_1\in C_G\setminus \{G\}} \sum_{G_2\in S\setminus C_G} I(d(G,G_1)<d(G,G_2)),\label{pg}$$ (H.1) where $$C_G$$ denotes the set of networks in the same class of $$G$$ and $$I$$ is the indicator function. Now we compare this to the area under ROC of the classifier $$kNN(G)$$ that classifies the k-nearest neighbours of $$G$$ to be in the same class a $$G$$. Let the $$k^{th}$$ nearest neighbour of $$G$$ according to $$d$$ be $$G_k$$. For a given $$k$$ the true positive (TP) and false positive (FP) rates can be written as:   $$\label{tp} {\rm TP}(k)=\frac{1}{(|C(G)|-1)}\sum_{G_1\in C_G\setminus \{G\}} I(d(G,G_1)\leq d(G,G_k)).$$ (H.2)  \begin{eqnarray} {\rm FP}(k)=\frac{1}{(|S|-|C(G)|)}\sum_{G_2\in S\setminus C_G} I(d(G,G_2)\leq d(G,G_k)). \end{eqnarray} (H.3) The ROC curve (i.e. $${\rm FP}(k)$$ vs $${\rm TP}(k)$$) is a step function as $${\rm FP}(k)$$ increases by $$\frac{1}{(|S|-|C(G)|)}$$ if $$G_k$$ is a false positive and $${\rm TP}(k)$$ increases its value by $$\frac{1}{(|C(G)|-1)}$$ if $$G_k$$ is a true positive. Consequently the area under the ROC curve is given by $$\sum_k \frac{1}{(|S|-|C(G)|)} {\rm TP}(k) I(G_k\in S\setminus C_G)$$. Substituting Eq. (H.2) into this equation yields Eq. (H.1) showing the equivalence of $$P(G)$$ and the area under the ROC curve of $$kNN(G)$$. H.1 Area under precision recall curve Given the relation of $$\overline{P}$$ to the area under ROC one can also define a performance measure analogous to $$\overline{P}$$ based on the area under precision-recall curve of $$kNN(G)$$. More specifically, we define $$\overline{\rm PR}=\frac{1}{|S|}\sum_{G}{\rm PR}(kNN(G))$$, where $${\rm PR}(kNN(G))$$ is the area under the precision recall curve of $$kNN(G)$$. Although $$\overline{{\rm PR}}$$ produces results that are consistent with $$\overline{P}$$ the area under the precision recall curve lacks the clear statistical interpretation of $$\overline{P}$$, especially in the multi-class setting [66], hence we choose $$\overline{P}$$ as our default measure of performance. $$\overline{{\rm PR}}$$ scores are given Table H1. Note that $${\rm NetEmd}$$ measures achieve the highest $$\overline{{\rm PR}}$$ score on all data sets. Table H1 $$\overline{{\rm PR}}$$ scores for measures and data sets considered in the main text. $${\rm NetEmd}$$ measures have the highest $$\overline{{\rm PR}}$$ score (given in bold) on all data sets. For $$RG_1$$ we calculated the value of the $$\overline{{\rm PR}}$$ score for each of the 16 sub-data sets. The table shows the average and standard deviation of the $$\overline{{\rm PR}}$$ values obtained over these 16 sub-data sets    $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.962 $$\pm$$ 0.024  0.926  0.792  0.741  0.687  $${\rm NetEmd}_{G4}$$  0.984 $$\pm$$ 0.015  0.951  0.810  0.752  0.728  $${\rm NetEmd}_{G5}$$  0.988 $$\pm$$ 0.011  0.957  0.784  0.744  0.714  $${\rm NetEmd}_{S}$$  0.980 $$\pm$$ 0.017  0.980  0.908  0.689  0.646  $${\rm NetEmd}_{E4}$$  0.973 $$\pm$$ 0.018  0.945  0.814  0.699  0.668  $${\rm NetEmd}_{DD}$$  0.830 $$\pm$$ 0.066  0.748  0.606  0.598  0.578  $${\rm Netdis}_{\rm ER}$$  0.946 $$\pm$$ 0.032  0.709  0.492  0.610  0.546  $${\rm Netdis}_{\rm SF}$$  0.965 $$\pm$$ 0.021  0.792  0.541  0.571  0.507  GCD11  0.976 $$\pm$$ 0.022  0.923  0.692  0.686  0.570  GCD73  0.986 $$\pm$$ 0.016  0.935  0.673  0.711  0.618  GGDA  0.908 $$\pm$$ 0.092  0.877  0.618  0.511  0.566     $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.962 $$\pm$$ 0.024  0.926  0.792  0.741  0.687  $${\rm NetEmd}_{G4}$$  0.984 $$\pm$$ 0.015  0.951  0.810  0.752  0.728  $${\rm NetEmd}_{G5}$$  0.988 $$\pm$$ 0.011  0.957  0.784  0.744  0.714  $${\rm NetEmd}_{S}$$  0.980 $$\pm$$ 0.017  0.980  0.908  0.689  0.646  $${\rm NetEmd}_{E4}$$  0.973 $$\pm$$ 0.018  0.945  0.814  0.699  0.668  $${\rm NetEmd}_{DD}$$  0.830 $$\pm$$ 0.066  0.748  0.606  0.598  0.578  $${\rm Netdis}_{\rm ER}$$  0.946 $$\pm$$ 0.032  0.709  0.492  0.610  0.546  $${\rm Netdis}_{\rm SF}$$  0.965 $$\pm$$ 0.021  0.792  0.541  0.571  0.507  GCD11  0.976 $$\pm$$ 0.022  0.923  0.692  0.686  0.570  GCD73  0.986 $$\pm$$ 0.016  0.935  0.673  0.711  0.618  GGDA  0.908 $$\pm$$ 0.092  0.877  0.618  0.511  0.566  Table H1 $$\overline{{\rm PR}}$$ scores for measures and data sets considered in the main text. $${\rm NetEmd}$$ measures have the highest $$\overline{{\rm PR}}$$ score (given in bold) on all data sets. For $$RG_1$$ we calculated the value of the $$\overline{{\rm PR}}$$ score for each of the 16 sub-data sets. The table shows the average and standard deviation of the $$\overline{{\rm PR}}$$ values obtained over these 16 sub-data sets    $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.962 $$\pm$$ 0.024  0.926  0.792  0.741  0.687  $${\rm NetEmd}_{G4}$$  0.984 $$\pm$$ 0.015  0.951  0.810  0.752  0.728  $${\rm NetEmd}_{G5}$$  0.988 $$\pm$$ 0.011  0.957  0.784  0.744  0.714  $${\rm NetEmd}_{S}$$  0.980 $$\pm$$ 0.017  0.980  0.908  0.689  0.646  $${\rm NetEmd}_{E4}$$  0.973 $$\pm$$ 0.018  0.945  0.814  0.699  0.668  $${\rm NetEmd}_{DD}$$  0.830 $$\pm$$ 0.066  0.748  0.606  0.598  0.578  $${\rm Netdis}_{\rm ER}$$  0.946 $$\pm$$ 0.032  0.709  0.492  0.610  0.546  $${\rm Netdis}_{\rm SF}$$  0.965 $$\pm$$ 0.021  0.792  0.541  0.571  0.507  GCD11  0.976 $$\pm$$ 0.022  0.923  0.692  0.686  0.570  GCD73  0.986 $$\pm$$ 0.016  0.935  0.673  0.711  0.618  GGDA  0.908 $$\pm$$ 0.092  0.877  0.618  0.511  0.566     $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.962 $$\pm$$ 0.024  0.926  0.792  0.741  0.687  $${\rm NetEmd}_{G4}$$  0.984 $$\pm$$ 0.015  0.951  0.810  0.752  0.728  $${\rm NetEmd}_{G5}$$  0.988 $$\pm$$ 0.011  0.957  0.784  0.744  0.714  $${\rm NetEmd}_{S}$$  0.980 $$\pm$$ 0.017  0.980  0.908  0.689  0.646  $${\rm NetEmd}_{E4}$$  0.973 $$\pm$$ 0.018  0.945  0.814  0.699  0.668  $${\rm NetEmd}_{DD}$$  0.830 $$\pm$$ 0.066  0.748  0.606  0.598  0.578  $${\rm Netdis}_{\rm ER}$$  0.946 $$\pm$$ 0.032  0.709  0.492  0.610  0.546  $${\rm Netdis}_{\rm SF}$$  0.965 $$\pm$$ 0.021  0.792  0.541  0.571  0.507  GCD11  0.976 $$\pm$$ 0.022  0.923  0.692  0.686  0.570  GCD73  0.986 $$\pm$$ 0.016  0.935  0.673  0.711  0.618  GGDA  0.908 $$\pm$$ 0.092  0.877  0.618  0.511  0.566  Appendix I. Performance evaluation via area under precision recall curve by Yaveroglu et al.? The area under precision recall curve (AUPRC) was used as a performance metric for network comparison measures by Yaveroglu et al. [5]. The AUPRC is based on a classifier that for a given distance threshold $$\epsilon$$ classifies pairs of networks to be similar whenever $$d(G,G')<\epsilon$$. A pair satisfying $$d(G,G')<\epsilon$$ is taken to be a true positive whenever $$G$$ and $$G'$$ are from the same class. The AUPRC is then defined to be the area under the precision recall curve obtained by varying $$\epsilon$$ in small increments. However, AUPRC is problematic, especially in settings where one has more than two classes and when classes are separated at different scales. Figure I1 gives three examples of metrics for a problem that has three classes: (a) shows a metric $$d_1$$ (AUPRC = 0.847) that clearly separates the 3-classes which, however, has a lower AUPRC than the metrics given in (b) (AUPRC = 0.902) which confuses half of Class-1 with Class-2 and (c) ( AUPRC = 0.896) which shows 2 rather than 3 classes. The colour scale in the figure represents the magnitude of a comparison between a pair of individuals according to the corresponding metric. Fig. I1. View largeDownload slide Heat maps of three measures for in an example of 3 equally sized classes. (a) Metric $$d_1$$ shows clear separation between the 3 classes. (b) $$d_2$$ shows 3 classes with half of Class-1 positioned closer to Class-2. (c) $$d_3$$ identifies 2 rather than 3 classes. Note that $$d_1$$ has lower AUPRC than $$d_2$$ and $$d_3$$ despite being best at identifying the 3 classes whereas $$\overline{P}$$ values for the metrics are $$\overline{P}(d_1)$$=1.0, $$\overline{P}(d_2)$$=0.887 and $$\overline{P}(d_3)$$=0.869. Similarly, $$\overline{{\rm PR}}$$ values for the metrics are $$\overline{{\rm PR}}(d_1)$$=1.0, $$\overline{{\rm PR}}(d_2)$$=0.893 and $$\overline{{\rm PR}}(d_3)$$=0.882. Fig. I1. View largeDownload slide Heat maps of three measures for in an example of 3 equally sized classes. (a) Metric $$d_1$$ shows clear separation between the 3 classes. (b) $$d_2$$ shows 3 classes with half of Class-1 positioned closer to Class-2. (c) $$d_3$$ identifies 2 rather than 3 classes. Note that $$d_1$$ has lower AUPRC than $$d_2$$ and $$d_3$$ despite being best at identifying the 3 classes whereas $$\overline{P}$$ values for the metrics are $$\overline{P}(d_1)$$=1.0, $$\overline{P}(d_2)$$=0.887 and $$\overline{P}(d_3)$$=0.869. Similarly, $$\overline{{\rm PR}}$$ values for the metrics are $$\overline{{\rm PR}}(d_1)$$=1.0, $$\overline{{\rm PR}}(d_2)$$=0.893 and $$\overline{{\rm PR}}(d_3)$$=0.882. Some of the problems of AUPRC are the following. First, AUPRC is based on a classifier that identifies pairs of similar networks and hence is only indirectly related to the problem of separating classes. Moreover, the classifier uses a single global threshold $$\epsilon$$ for all networks and classes, and hence implicitly assumes that all classes are separated on the same scale. The AUPRC further lacks a clear statistical interpretation, which complicates its use especially when one has multiple classes and when precision recall curves of different measures intersect. Despite its problems, we give AUPRC values for all measures we considered in the main text in Table I1 for the sake of completeness. Note that $${\rm NetEmd}$$ measures achieve the highest AUPRC on all data sets. Table I1 AUPRC scores for measures and data sets considered in the main text. $${\rm NetEmd}$$ measures have the highest AUPRC score (given in bold) on all data sets. For $$RG_1$$, we calculated the value of the AUPRC score for each of the 16 sub-data sets. The table shows the average and standard deviation of the AUPRC values obtained over these 16 sub-data sets.    $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.917 $$\pm$$ 0.039  0.869  0.702  0.800  0.756  $${\rm NetEmd}_{G4}$$  0.959 $$\pm$$ 0.030  0.930  0.759  0.774  0.786  $${\rm NetEmd}_{G5}$$  0.981 $$\pm$$ 0.018  0.957  0.766  0.722  0.757  $${\rm NetEmd}_{S}$$  0.967 $$\pm$$ 0.015  0.958  0.833  0.702  0.672  $${\rm NetEmd}_{E4}$$  0.966 $$\pm$$ 0.030  0.945  0.801  0.777  0.739  $${\rm NetEmd}_{\rm DD}$$  0.756 $$\pm$$ 0.044  0.708  0.516  0.655  0.612  $${\rm Netdis}_{\rm ER}$$  0.867 $$\pm$$ 0.044  0.579  0.396  0.607  0.621  $${\rm Netdis}_{\rm SF}$$  0.852 $$\pm$$ 0.028  0.657  0.437  0.522  0.592  GCD11  0.888 $$\pm$$ 0.084  0.709  0.478  0.713  0.693  GCD73  0.966 $$\pm$$ 0.052  0.858  0.571  0.736  0.743  GGDA  0.815 $$\pm$$ 0.176  0.740  0.481  0.500  0.625     $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.917 $$\pm$$ 0.039  0.869  0.702  0.800  0.756  $${\rm NetEmd}_{G4}$$  0.959 $$\pm$$ 0.030  0.930  0.759  0.774  0.786  $${\rm NetEmd}_{G5}$$  0.981 $$\pm$$ 0.018  0.957  0.766  0.722  0.757  $${\rm NetEmd}_{S}$$  0.967 $$\pm$$ 0.015  0.958  0.833  0.702  0.672  $${\rm NetEmd}_{E4}$$  0.966 $$\pm$$ 0.030  0.945  0.801  0.777  0.739  $${\rm NetEmd}_{\rm DD}$$  0.756 $$\pm$$ 0.044  0.708  0.516  0.655  0.612  $${\rm Netdis}_{\rm ER}$$  0.867 $$\pm$$ 0.044  0.579  0.396  0.607  0.621  $${\rm Netdis}_{\rm SF}$$  0.852 $$\pm$$ 0.028  0.657  0.437  0.522  0.592  GCD11  0.888 $$\pm$$ 0.084  0.709  0.478  0.713  0.693  GCD73  0.966 $$\pm$$ 0.052  0.858  0.571  0.736  0.743  GGDA  0.815 $$\pm$$ 0.176  0.740  0.481  0.500  0.625  Table I1 AUPRC scores for measures and data sets considered in the main text. $${\rm NetEmd}$$ measures have the highest AUPRC score (given in bold) on all data sets. For $$RG_1$$, we calculated the value of the AUPRC score for each of the 16 sub-data sets. The table shows the average and standard deviation of the AUPRC values obtained over these 16 sub-data sets.    $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.917 $$\pm$$ 0.039  0.869  0.702  0.800  0.756  $${\rm NetEmd}_{G4}$$  0.959 $$\pm$$ 0.030  0.930  0.759  0.774  0.786  $${\rm NetEmd}_{G5}$$  0.981 $$\pm$$ 0.018  0.957  0.766  0.722  0.757  $${\rm NetEmd}_{S}$$  0.967 $$\pm$$ 0.015  0.958  0.833  0.702  0.672  $${\rm NetEmd}_{E4}$$  0.966 $$\pm$$ 0.030  0.945  0.801  0.777  0.739  $${\rm NetEmd}_{\rm DD}$$  0.756 $$\pm$$ 0.044  0.708  0.516  0.655  0.612  $${\rm Netdis}_{\rm ER}$$  0.867 $$\pm$$ 0.044  0.579  0.396  0.607  0.621  $${\rm Netdis}_{\rm SF}$$  0.852 $$\pm$$ 0.028  0.657  0.437  0.522  0.592  GCD11  0.888 $$\pm$$ 0.084  0.709  0.478  0.713  0.693  GCD73  0.966 $$\pm$$ 0.052  0.858  0.571  0.736  0.743  GGDA  0.815 $$\pm$$ 0.176  0.740  0.481  0.500  0.625     $$RG_1$$  $$RG_2$$  $$RG_3$$  RWN  Onnela et al.  $${\rm NetEmd}_{G3}$$  0.917 $$\pm$$ 0.039  0.869  0.702  0.800  0.756  $${\rm NetEmd}_{G4}$$  0.959 $$\pm$$ 0.030  0.930  0.759  0.774  0.786  $${\rm NetEmd}_{G5}$$  0.981 $$\pm$$ 0.018  0.957  0.766  0.722  0.757  $${\rm NetEmd}_{S}$$  0.967 $$\pm$$ 0.015  0.958  0.833  0.702  0.672  $${\rm NetEmd}_{E4}$$  0.966 $$\pm$$ 0.030  0.945  0.801  0.777  0.739  $${\rm NetEmd}_{\rm DD}$$  0.756 $$\pm$$ 0.044  0.708  0.516  0.655  0.612  $${\rm Netdis}_{\rm ER}$$  0.867 $$\pm$$ 0.044  0.579  0.396  0.607  0.621  $${\rm Netdis}_{\rm SF}$$  0.852 $$\pm$$ 0.028  0.657  0.437  0.522  0.592  GCD11  0.888 $$\pm$$ 0.084  0.709  0.478  0.713  0.693  GCD73  0.966 $$\pm$$ 0.052  0.858  0.571  0.736  0.743  GGDA  0.815 $$\pm$$ 0.176  0.740  0.481  0.500  0.625  References 1. Newman M. E. J. Networks: An Introduction . Oxford, UK: Oxford University Press, 2010. Google Scholar CrossRef Search ADS   2. Wilson R. C. & Zhu P. ( 2008) A study of graph spectra for comparing graphs and trees. Pattern Recogn. , 41, 2833– 2841. Google Scholar CrossRef Search ADS   3. Neyshabur B., Khadem A., Hashemifar S. & Arab S. S. ( 2013) Netal: a new graph- based method for global alignment of protein–protein interaction networks. Bioinformatics , 27, 1654– 1662. Google Scholar CrossRef Search ADS   4. Ali W., Rito T., Reinert G., Sun F. & Deane C. M. ( 2014) Alignment-free protein interaction network comparison. Bioinformatics , 30, i430– i437. Google Scholar CrossRef Search ADS PubMed  5. Yaveroglu Ö. N., Malod-Dognin N., Davis D., Levnajic Z., Janjic V., Karapandza R., Stojmirovic A. & Przulj N. ( 2014) Revealing the hidden language of complex networks. Sci. Rep. , 4. 6. Singh R., Xu J. & Berger B. ( 2008) Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc. Natl. Acad. Sci. USA , 105, 12763– 12768. Google Scholar CrossRef Search ADS   7. Kuchaiev O. & Pržulj N. ( 2011) Integrative network alignment reveals large regions of global network similarity in yeast and human. Bioinformatics , 27, 1390– 1396. Google Scholar CrossRef Search ADS PubMed  8. Mamano N. & Hayes W. B. ( 2017) Sana: simulated annealing far outperforms many other search algorithms for biological network alignment. Bioinformatics , 33, 2156– 2164. Google Scholar CrossRef Search ADS PubMed  9. Hashemifar S. & Xu J. ( 2014) HubAlign: an accurate and efficient method for global alignment of protein–protein interaction networks. Bioinformatics , 30, i438– i444. Google Scholar CrossRef Search ADS PubMed  10. Milo R., Itzkovitz S., Kashtan N., Levitt R., Shen-Orr S., Ayzenshtat I. Sheffer M. & Alon U. ( 2004) Superfamilies of evolved and designed networks. Science , 303, 1538– 1542. Google Scholar CrossRef Search ADS PubMed  11. Pržulj N. ( 2007) Biological network comparison using graphlet degree distribution. Bioinformatics , 23, e177– e183. Google Scholar CrossRef Search ADS PubMed  12. Rito T., Wang Z., Deane C. M. & Reinert G. ( 2010) How threshold behaviour affects the use of subgraphs for network comparison. Bioinformatics , 26, i611– i617. Google Scholar CrossRef Search ADS PubMed  13. Kossinets G. & Watts D. J. ( 2006) Empirical analysis of an evolving social network. Science , 311, 88– 90. Google Scholar CrossRef Search ADS PubMed  14. Borgwardt K. M., Kriegel H.-P., Vishwanathan S. V. N. & Schraudolph N. N. ( 2007) Graph kernels for disease outcome prediction from protein-protein interaction networks. Pacific Symposium on Biocomputing , vol. 12. Singapore: World Scientific Publishing, pp. 4– 15. 15. Wale N., Watson I. A. & Karypis G. ( 2008) Comparison of descriptor spaces for chemical compound retrieval and classification. Knowl. Inf. Syst. , 14, 347– 375. Google Scholar CrossRef Search ADS   16. Barabási A.-L. & Albert R. ( 1999) Emergence of scaling in random networks. Science , 286, 509– 512. Google Scholar CrossRef Search ADS PubMed  17. Milo R., Shen-Orr S., Itzkovitz S., Kashtan N., Chklovskii D. & Alon U. ( 2002) Network motifs: Simple building blocks of complex networks. Science , 298, 824– 827. Google Scholar CrossRef Search ADS PubMed  18. Masoudi-Nejad A., Schreiber F., Kashani Z. & Razaghi M. ( 2012) Building blocks of biological networks: a review on major network motif discovery algorithms. IET Syst. Biol. , 6, 164– 174. Google Scholar CrossRef Search ADS PubMed  19. Wegner A. E. ( 2014) Subgraph covers: an information-theoretic approach to motif analysis in networks. Phys. Rev. X , 4, 041026. 20. Holmes S. & Reinert G. ( 2004) Stein’s method for the bootstrap. Stein’s Method . Beachwood, OH, USA: Institute of Mathematical Statistics, pp. 93– 132. Google Scholar CrossRef Search ADS   21. Bhattacharya S. & Bickel P. J. ( 2015) Subsampling bootstrap of count features of networks. Ann. Statist. , 43, 2384– 2411. Google Scholar CrossRef Search ADS   22. Ali W., Wegner A. E., Gaunt R. E., Deane C. M. & Reinert G. ( 2016) Comparison of large networks with sub-sampling strategies. Sci. Rep. , 6, 28955. Google Scholar CrossRef Search ADS PubMed  23. Runber Y., Tomasi C. & Guibas L. J. ( 1998) A metric for distributions with applications to image databases. IEEE International Conference on Computer Vision . Piscataway, NJ, USA: IEEE, pp. 59– 66. 24. Vázquez A., Flammini A., Maritan A. & Vespignani A. ( 2003) Modeling of protein interaction networks. Complexus , 1, 38– 44. Google Scholar CrossRef Search ADS   25. Erdős P. & Rényi A. ( 1960) On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci. , 5, 1960. 26. Ispolatov I., Krapivsky P. L. & Yuryev A. ( 2005) Duplication-divergence model of protein interaction network. Phys. Rev. E , 71, 061911. Google Scholar CrossRef Search ADS   27. Higham D. J., Rašajski M. & Pržulj N. ( 2008) Fitting a geometric graph to a protein–protein interaction network. Bioinformatics , 24, 1093– 1099. Google Scholar CrossRef Search ADS PubMed  28. Penrose M. ( 2003) Random Geometric Graphs . Oxford, UK: Oxford University Press. Google Scholar CrossRef Search ADS   29. Molloy M. & Reed B. ( 1995) A critical point for random graphs with a given degree sequence. Random Structures Algorithms , 6, 161– 180. Google Scholar CrossRef Search ADS   30. Watts D. J. & Strogatz S. H. ( 1998) Collective dynamics of ‘small-world’ networks. Nature , 393, 440– 442. Google Scholar CrossRef Search ADS PubMed  31. Onnela J.-P., Fenn D. J., Reid S., Porter M. A., Mucha P. J., Fricker M. D. & Jones N. S. ( 2012) Taxonomies of networks from community structure. Phys. Rev. E , 86, 036104, 2012. Google Scholar CrossRef Search ADS   32. Leskovec J., Kleinberg J. & Faloutsos C. ( 2005) Graphs over time: densification laws, shrinking diameters and possible explanations. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining . New York, NY, USA: ACM, pp. 177– 187. 33. Feenstra R. C., Lipsey R. E., Deng H., Ma A. C. & Mo H. ( 2005) World trade flows: 1962–2000. Technical Report . National Bureau of Economic Research. 34. United-Nations-Statistics-Division ( 2015) United nations commodity trade statistics database (un comtrade). http://comtrade.un.org/ ( last date accessed 22 January 2018). 35. Maugis P. G., Priebe C. E., Olhede S. C. & Wolfe P. J. ( 2017) Statistical inference for network samples using subgraph counts. ArXiv preprint: arXiv:1701.00505 . 36. Xing E. P., Ng A. Y., Jordan M. I. & Russell S. ( 2003) Distance metric learning with application to clustering with side-information. Adv. Neur. Inf. Process. Syst. , 15, 505– 512. 37. Mohar B., Alavi Y., Chartrand G. & Oellermann O. R. ( 1991) The Laplacian spectrum of graphs. Graph Theory, Combinatorics, and Applications  ( Mohar B. ed.), 2, Wiley NY, USA: Wiley Intersci. Publ., pp. 871– 898. 38. Chung F. R. K. ( 1997) Spectral Graph Theory , vol. 92. Providence, RI, USA: American Mathematical Society. 39. Banerjee A. & Jost J. ( 2008) On the spectrum of the normalized graph laplacian. Linear Algebra Appl. , 428, 3015– 3022. Google Scholar CrossRef Search ADS   40. Gu J., Jost J., Liu S. & Stadler P. F. ( 2016) Spectral classes of regular, random, and empirical graphs. Linear Algebra Appl. , 489, 30– 49. Google Scholar CrossRef Search ADS   41. Yanardag P. & Vishwanathan S. V. N. ( 2015) Deep graph kernels. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . New York, NY, USA: ACM, pp. 1365– 1374. 42. Cortes C. & Vapnik V. ( 1995) Support-vector networks. Mach. Learn. , 20, 273– 297. 43. Shervashidze N., Vishwanathan S. V. N., Petri T., Mehlhorn K. & Borgwardt K. M. ( 2009) Efficient graphlet kernels for large graph comparison. AISTATS  ( van Dyk D. & Welling M. eds), 5, Florida, USA: PMLR, pp. 488– 495. 44. Barnett I., Malik N., Kuijjer M. L., Mucha P. J. & Onnela J.-P. ( 2016) Feature-based classification of networks. CoRR , arXiv preprint: arXiv:1610.05868. 45. Niepert M., Ahmed M. & Kutzkov K. ( 2016) Learning convolutional neural networks for graphs. International Conference on Machine Learning . New York, NY, USA: PMLR, pp. 2014– 2023. 46. Shervashidze N., Schweitzer P., van Leeuwen E. J., Mehlhorn K. & Borgwardt K. M. ( 2011) Weisfeiler-Lehman graph kernels. J. Mach. Learn. Res. , 12, 2539– 2561. 47. Hočevar T. & Demšar J. ( 2014) A combinatorial approach to graphlet counting. Bioinformatics , 30, 559– 565. Google Scholar CrossRef Search ADS PubMed  48. Brent R. P. ( 1971) An algorithm with guaranteed convergence for finding a zero of a function. Comput. J. , 14, 422– 425. Google Scholar CrossRef Search ADS   49. Thüne M. ( 2013) Eigenvalues of matrices and graphs. PhD Thesis , University of Leipzig. 50. Debnath A. K., Lopez de Compadre R. L., Debnath G., Shusterman A. J. & Hansch C. ( 1991) Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. J. Med. Chem. , 34, 786– 797. Google Scholar CrossRef Search ADS PubMed  51. Borgwardt K. M., Ong C. S., Schonauer S., Vishwanathan SVN., Smola A. J. & Kriegel H.-P. ( 2005) Protein function prediction via graph kernels. Bioinformatics , 21, i47– i56. Google Scholar CrossRef Search ADS PubMed  52. Sugiyama M. & Borgwardt K. ( 2015) Halting in random walk kernels. Advances in Neural Information Processing Systems . ( Cortes C. Lee D. D. Sugiyama M. & Garnett R. eds), Cambridge, MA, USA: MIT Press, pp. 1639– 1647. 53. Dobson P. D. & Doig A. J. ( 2003) Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol. , 330, 771– 783. Google Scholar CrossRef Search ADS PubMed  54. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O. Blondel M., Prettenhofer P., Weiss R., Dubourg V. et al.   ( 2011) Scikit-learn: machine learning in python. J. Mach. Learn. Res. , 12, 2825– 2830. 55. Chang C.-C. & Lin C.-J. ( 2011) Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) , 2, 27. 56. Jayasumana S., Hartley R., Salzmann M., Li H. & Harandi M. ( 2015) Kernel methods on Riemannian manifolds with Gaussian RBF kernels. IEEE Trans. Pattern Anal. Mach. Intell. , 37, 2464– 2477. Google Scholar CrossRef Search ADS PubMed  57. Luss R. & d’Aspremont A. ( 2008) Support vector machine classification with indefinite kernels. Advances in Neural Information Processing Systems . ( Platt J. C. Koller D. Singer Y. & Roweis S. T. eds), USA: Curran Associates Inc., pp. 953– 960. Google Scholar CrossRef Search ADS   58. Haasdonk B. ( 2005) Feature space interpretation of SVMS with indefinite kernels. IEEE Trans. Pattern Anal. Mach. Intell. , 27, 482– 492. Google Scholar CrossRef Search ADS PubMed  59. Gilbert E. N. ( 1961) Random plane networks. J. Soc. Ind. Appl. Math. , 9, 533– 543. Google Scholar CrossRef Search ADS   60. Traud A. L., Mucha P. J. & Porter M. A. ( 2012) Social structure of Facebook networks. Phys. A , 391, 4165– 4180. Google Scholar CrossRef Search ADS   61. Jeong H., Tombor B., Albert R., Oltvai Z. N. & Barabási A.-L. ( 2000) The large-scale organization of metabolic networks. Nature , 407, 651– 654. Google Scholar CrossRef Search ADS PubMed  62. Stark C., Breitkreutz B.-J., Reguly T., Boucher L., Breitkreutz A. & Tyers M. ( 2006) Biogrid: a general repository for interaction datasets. Nucleic Acids Res. , 34, D535– D539. Google Scholar CrossRef Search ADS PubMed  63. Das J. & Yu H. ( 2012) Hint: High-quality protein interactomes and their applications in understanding human disease. BMC Syst Biol. , 6, 92. Google Scholar CrossRef Search ADS PubMed  64. Rajagopala S. V., Sikorski P., Kumar A., Mosca R., Vlasblom J., Arnold R., Franca-Koh J., Pakala S. B., Phanse S., Ceol A., et al.   ( 2014) The binary protein–protein interaction landscape of escherichia coli. Nat. Biotechnol. , 32, 285– 290. Google Scholar CrossRef Search ADS PubMed  65. Jure L. & Krevl A. ( 2014) SNAP datasets: Stanford large network dataset collection. https://snap.stanford.edu/data/ ( last date accessed 22 January 2018). 66. Hand D. J. & Till R. J. ( 2001) A simple generalisation of the area under the roc curve for multiple class classification problems. Mach. Learn. , 45, 171– 186. Google Scholar CrossRef Search ADS   © The authors 2018. Published by Oxford University Press. All rights reserved.

### Journal

Journal of Complex NetworksOxford University Press

Published: Feb 3, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations