An improved hyperbolic embedding algorithm

An improved hyperbolic embedding algorithm Abstract Because hyperbolic space has properties that make it amenable to graph representations, there is significant interest in scalable hyperbolic-space embedding methods. These embeddings enable constant-time approximation of shortest-path distances, and so are significantly more efficient than full shortest-path computations. In this article, we improve on existing landmark-based hyperbolic embedding algorithms for large-scale graphs. Whereas previous methods compute the embedding by using the derivative-free Nelder–Mead simplex optimization method, our approach uses the limited-memory BFGS (LBFGS) method, which is quasi-Newton optimization, with analytic gradients. Our method is not only significantly faster but also produces higher-quality embeddings. Moreover, we are able to include the hyperbolic curvature as a variable in the optimization. We compare our hyperbolic embedding method implementation in Python (called Hypy) against the best publicly available software, Rigel. Our method is an order of magnitude faster and shows significant improvements in the accuracy of the shortest-path distance calculations. Tests are performed on a variety of real-world networks, and we show the scalability of our method by embedding a graph with 1.8 billion edges and 65 million nodes. 1. Introduction and related work Hyperbolic embeddings are of growing interest because they exploit fundamental similarities between the geometry of hyperbolic spaces and properties of scale-free networks [1]. In general, the idea of an embedding is to map each node of a graph to a coordinate in some space, for example Euclidean, spherical or hyperbolic [2]. Once the mapping is obtained, which is by no means trivial, estimating the shortest-path distance between a pair of nodes is as simple as plugging their hyperbolic coordinates into the distance formula, that is, a constant-time calculation. This allows for extremely efficient estimates of quantities that may depend on distance calculations, such as graph centrality [3]. The most common type of embedding is into Euclidean space. Here, the distance formula is given by the usual Euclidean-squared distance. Shavitt and Tankel [4] explored Euclidean graph embedding as early as 2003. In subsequent years, they also explored embedding Internet graphs into a hyperbolic space, noting that the geometry of the Internet graph resembles the geometry of the hyperbolic spaces. In particular, Shavitt and Tankel noted that the Internet graph had a tightly connected core and long-stretched tendrils [5]. With this qualitative assessment of the geometry of Internet graphs, they achieve greater accuracy in shortest-path estimates by embedding into hyperbolic rather than Euclidean space. This assessment of the structure and geometry of graphs was more rigorously studied by Krioukov et al. [1]. They show how the geometry of hyperbolic space allows a more natural construction of a scale-free network that retains many characteristic properties, such as degree distributions that follow a power law. This particular work led to a probabilistic embedding algorithm in which a likelihood function was introduced, where the likelihood of a connection between two nodes exponentially decreases with distance from the origin [6, 7]. The exact hyperbolic distances are not directly enforced in this approach, but rather the average degree, power-law exponent, and average clustering coefficients of the graph are used as hyperparameters in the likelihood function. Once the likelihood function has been defined, a brute force embedding of all nodes at once is too computationally expensive for most networks, and so Boguna et al. [6] suggest embedding a representative, but much smaller, subgraph first and then iteratively adding more and more layers of the full graph. In some ways, this iterative subgraph style of embedding is similar to an approach known as the landmark method, which dates back to 2003 [8]. However, one major difference is that the true graph distances are directly utilized in the landmark embedding approach, whereas only the adjacency matrix is used in the likelihood approach. The work in this article is more closely related to landmark type methods, which have recently been used for embedding million- and billion-node graphs [3, 9, 10]. For more recent examples of the likelihood based approaches, we refer the reader to [11] and [12]. While these methods propose a global maximum likelihood algorithm, as opposed to our method which uses multiple solves of a local optimization algorithm, they are limited in their experiments to a two-dimensional embedding and graphs with only a few thousand nodes. It is important to note, however, that while the landmark approach is designed to give accurate estimates of graph distances between nodes, the likelihood-based approaches are not necessarily designed to do the same. For example, likelihood based methods can play a valuable role in helping to predict missing links in graphs. Thus, a comparison between the two methods should be taken with a grain of salt with the understanding that both approaches may have different end goals. Our proposed approach is an improvement on a well-known landmark-based hyperbolic-embedding method called Rigel [3]. Our method differs in the following ways: (1) the choice of the landmarks is random, with probabilities proportion to the node degrees; (2) the embedding algorithm uses the limited-memory BFGS (LBFGS) quasi-Newton optimization method instead of the Nelder–Mead simplex optimization method; (3) we derive and use analytical gradients; and (4) we optimize the hyperbolic curvature in addition to the hyperbolic coordinates of the vertices. Moreover, our hyperbolic embedding algorithm is implemented in Python (including Numpy and Scipy libraries) with an option to use MPI for large graphs. This article is organized as follows: Section 2 presents the problem formulation, introduces the hyperbolic distance formula, and derives the hyperbolic embedding objective functions. Section 3 explains the benefits of using LBFGS over Nelder–Mead and provides the gradient formulas with their respective implementation algorithms. Section 4 details our new hyperbolic embedding algorithms, from how we choose the landmarks all the way to how we perform the full embedding procedure. Section 5 illustrates the performance of our method in comparison to Rigel for five different graphs with sizes from 300 thousand nodes and 1 million edges to 3 million nodes and 117 million edges. We also run on a graph that is too large for Rigel, with 66 million nodes and 1.8 billion edges. Finally, Section 6 concludes our article with a summary of results. 2. Problem formulation Let $$\mathscr{G} = (\mathscr{V},\mathscr{E})$$ where $$\mathscr{V}$$ represents the vertices and $$\mathscr{E}$$ represents the edges. Let $$n=|\mathscr{V}|$$ denote the number of nodes and $$m=|\mathscr{E}|$$ denote the number of edges. 2.1 Graph distance The graph distance between two nodes $$i,j \in \mathscr{V}$$ is defined as the shortest path between $$i$$ and $$j$$ and denoted by $$\Delta_{ij}$$, which we hereafter refer to as the true distance. It is the minimum number of edges that need to be traversed to go from $$i$$ to $$j$$. For instance, $$\Delta_{ii} = 0$$; if $$(i,j) \in E$$, $$i \ne j$$, then $$\Delta_{ij}=1$$; and so on. The difficulty is that computing the distances can be prohibitively expensive, especially for large graphs. We assume that the graph is connected so that $$\Delta_{ij}$$ is finite for all pairs $$(i,j)\in \mathscr{V} \otimes \mathscr{V}$$. For hyperbolic embedding, the goal is to determine coordinates in hyperbolic space for each vertex so that the hyperbolic distance between coordinate representations can be used to estimate the true (shortest path) distance between vertices [5]. 2.2 Hyperbolic space and distances We let $$\mathscr{H}^d$$ denote the hyperbolic space of dimension $$d$$, that is, \begin{align*} \mathscr{H}^d = \left\{ {{\mathbf{u} : u_1^2 + \dots + u_d^2 - u_{d+1}^2 = -1}} \right\}\!. \end{align*} We consider only points in the upper sheet of the hyperboloid, so that the last coordinate is determined according to the first $$d$$ coordinates: \begin{align*} u_{d+1} = \sqrt{ u_1^2 + \dots + u_d^2 + 1 }. \end{align*} We hereafter represent any points in $$\mathscr{H}^d$$ by only the first $$d$$ coordinates that is, $$\mathbf{u} = (u_1,\dots,u_d)^{\top}$$. To define distance in hyperbolic space, we first give some preliminary definitions. For two points $$\mathbf{u},\mathbf{v} \in \mathscr{H}^d$$, let \begin{align*} \mathbf{u} \cdot \mathbf{v} = \sum_{k=1}^{d} u_k v_k {\quad\text{and}\quad} \| \mathbf{u} \| = \sqrt{ \mathbf{u} \cdot \mathbf{u} } \end{align*} be the standard Euclidean dot product and norm, respectively. The hyperbolic dot product is defined as \begin{align*} \mathbf{u} \ast \mathbf{v} = \mathbf{u} \cdot \mathbf{v} - \sqrt{ ( \| \mathbf{u} \|^2 + 1) ( \|\mathbf{v}\|^2 + 1) }, \end{align*} and then we define the hyperbolic distance between $$\mathbf{u},\mathbf{v} \in \mathscr{H}^d$$ to be $$d_\mathscr{H} (\mathbf{u},\mathbf{v},\kappa) = \frac{1}{\sqrt{\kappa}} {\text{arccosh}}( -\mathbf{u} \ast \mathbf{v} ),$$ (2.1) where $$\kappa > 0$$ and $$-\kappa$$ is commonly referred to as the curvature. Curvature does not actually change the shape of the hyperbolic half sheet $$\mathscr{H}^d$$ but instead changes the metric we use to measure distances between points, for example, geodesics. Note that the distance is symmetric so that $$d_\mathscr{H} (\mathbf{u},\mathbf{v}) = d_\mathscr{H} (\mathbf{v},\mathbf{u})$$. 2.3 Embedding objective and representation For each $$i \in \mathscr{V}$$, we let $$\mathbf{u}_i \in \mathscr{H}^d$$ denote its representation in hyperbolic space. For any subset $$\mathscr{A} \subseteq \mathscr{V}$$, we let $$\mathbf{U}_{\mathscr{A}}$$ denote the matrix of size $$d \times |\mathscr{A}|$$ whose columns are the representations, that is, $$\mathbf{u}_i$$ for all $$i \in \mathscr{A}$$. For instance, $$\mathbf{U}_\mathscr{V} = \left[ {\matrix{{{{\bf{u}}_1}} & {{{\bf{u}}_2}} & \cdots & {{{\bf{u}}_n}} \cr } } \right]$$. In our later discussion of algorithms, we assume for ease of notation that the matrix $$\mathbf{U}_{\mathscr{A}}$$ comes with implicit knowledge of the associated set $$\mathscr{A}$$. Since our goal is to choose the representations so that they minimize the embedding error, we define what this means precisely. The embedding error for vertices $$i,j\in \mathscr{V}$$ and curvature $$\kappa$$ is $$e(\mathbf{u}_i,\mathbf{u}_j,\kappa) = \frac{1}{2} (d_{\mathscr{H}}(\mathbf{u}_i,\mathbf{u}_j,\kappa) - \Delta_{ij})^2,$$ (2.2) where $$\Delta_{ij}$$ is the true distance as described in Section 2.1. For an arbitrary set $$\mathscr{A} \subseteq \mathscr{V}$$, we define two error measures. First, the measure $${\hat E_{\mathscr{A}}(\mathbf{u}_j,\kappa)} = \sum_{i\in \mathscr{A}} e(\mathbf{u}_i,\mathbf{u}_j,\kappa)$$ (2.3) is the embedding error for vertex $$j$$ with respect to the fixed embedding $$\mathbf{U}_{\mathscr{A}}$$ for the set $$\mathscr{A}$$. Second, the measure $${E_{\mathcal{A}}(\mathbf{U}_{\mathcal{A}},\kappa)} = \sum_{{i,j \in \mathscr{A}}\atop{i < j}} e(\mathbf{u}_i,\mathbf{u}_j,\kappa)$$ (2.4) is the sum of the embedding errors for all unique vertex pairs in $$\mathscr{A} \otimes \mathscr{A}$$. Ideally, we would pick $$\mathbf{U}_{\mathscr{V}}$$ and $$\kappa$$ such that the total embedding error $${E_{\mathscr{V}}(\mathbf{U}_{\mathscr{V}},\kappa)}$$ is minimized. This ensures that the hyperbolic distances are as close as possible to the true distances. Computing this objective function for large $$n$$ is prohibitive. We discuss how to approximately solve this optimization problem in the next subsection. We have not discussed the choice of the embedding dimension, $$d$$, which is typically between 2 and 10. We will solve the problems for several values of $$d$$ and pick the one the yields the best embedding by some validation metric. We discuss its selection in more detail in Section 5. 2.4 Optimization formulation using landmarks One problem with minimizing $${E_{\mathscr{V}}(\mathbf{U}_{\mathscr{V}},\kappa)}$$ directly is that it requires us to compute the true distances, $$\Delta_{ij}$$, between all $$n(n-1)/2$$ vertex pairs, which is prohibitively expensive and moreover defeats the purpose of estimating the distances. Instead, we use the idea of landmarks, introduced and studied in [8, 10]. We first embed the landmarks optimally in relation to each other and then embed the other points in relation to the landmarks. The idea is that embedding a few important vertices will determine the placement of the remainder. Let $$\mathscr{L} \subset \mathscr{V}$$ denote the set of landmarks and $$\ell = |\mathscr{L}|$$ be the number of landmarks. We typically choose $$\ell \ll n$$ to be very small, for example, 100. We find coordinates for the landmarks, $$\mathbf{U}_{\mathscr{L}}$$ of size $$d \times \ell$$, and the curvature, $$\kappa$$, that minimize the total landmark embedding error given by $${E_{\mathscr{L}}(\mathbf{U}_{\mathscr{L}},\kappa)}$$, which only requires $$\ell^2$$ values of $$\Delta_{ij}$$. The total number of optimization parameters in this case is $$\ell d + 1$$. We embed the landmarks by minimizing the discrepancy between the true and hyperbolic distances for every pair of nodes in the landmark set. Once the landmark embedding ($$\mathbf{U}_{\mathscr{L}}$$ and $$\kappa$$) is determined, then we determine the coordinate of every non-landmark. For each $$j \in \mathscr{V}\setminus\mathscr{L}$$ we find the embedding $$\mathbf{u}_j$$ of length $$d$$ that minimizes the discrepancy between $$\mathbf{u}_j$$ and its distance to every landmark as measured by $${\hat E_{j}(\mathbf{u}_j,\kappa)}$$. There are a total of $$(n-\ell)$$ optimization problems, each of which has only $$d$$ parameters. Moreover, since each non-landmark coordinate $$\mathbf{u}_j$$ for $$j \in \mathscr{V}\setminus\mathscr{L}$$ can be embedded independently of the others, the embedding of the non-landmark vertices is easily parallelizable. More specifically, each non-landmark coordinate $$\mathbf{u}_j$$ is chosen by minimizing 2.3 with $$\mathscr{A}$$ set to $$\mathscr{L}$$, where each of these objective functions are independent of one another. The advantage of the landmark approach is that we need only compute $$\Delta_{ij}$$ for each $$i \in \mathscr{L}$$ and $$j \in \mathscr{V}$$. In other words, we compute $$O(n)$$ distances rather than $$n^2$$. This preprocessing step can be quite expensive, requiring $$\ell$$ single-source-shortest-path computations, but is a necessity for any landmark approach. 3. Optimization improvements Rigel uses the Nelder–Mead simplex algorithm [13] to compute the minimums for the landmark and non-landmark embedding objective functions (2.4), while we propose using the LBFGS quasi-Newton approach [14]. The Nelder–Mead simplex algorithm is a direct search method that does not require explicit gradient calculations, which makes it an attractive option when the analytic gradient is difficult or impossible to obtain. Briefly, the Nelder–Mead algorithm evaluates a sequence of $$p+1$$ dimensional simplices, where $$p$$ is the dimension of the objective function. Each successive simplex attempts to move closer to the minimum, requiring only the evaluation of the objective function at the vertices of the simplices. However, for a broad class of problems, the Nelder–Mead simplex algorithm can be unstable and fail unpredictably [15]. In fact, it can be shown that the Nelder–Mead can fail to converge to a stationary point even if the objective function is strictly convex with three continuous derivatives [16]. For these reasons, we propose using an approach that uses gradient information and guarantees convergence. LBFGS is a quasi-Newton optimization method used to efficiently find local minima by utilizing gradient information. It is based on Newton’s method where successive guesses for the minimum are computed by approximating the objective function with a local quadratic Taylor series expansion. At each iteration, the local quadratic approximation can minimized by inverting the Hessian and computing the gradient. To avoid the computational burden of evaluating the full Hessian matrix at each iteration, the LBFGS method directly estimates the inverse of the Hessian matrix using rank-one updates based on the position and gradient vectors. This results in a computational cost that scales linearly with the dimension at each iteration [14] and makes the LBFGS algorithm well suited for large-scale problems [17]. Moreover, convergence to a stationary point can be guaranteed when the objective function is twice continuously differentiable [18]. If the function is also strictly convex, then we are assured convergence to the global minimizer. In the present context, our objective function is not convex, which means that there is no guarantee of convergence to the global minimum. However, by running the LBFGS algorithm multiple times with different starting points, we can obtain a set of local minimum solutions and select the best of these as an approximation to the global minimum. This procedure is described in more detail in Section 4.5. Below, we derive the analytic gradient needed for the LBFGS method. Theorem 3.1 The gradient for $$e(\mathbf{u},\mathbf{v},\kappa)$$ with respect to $$\mathbf{u}$$ is $${\frac{ \partial e(\mathbf{u},\mathbf{v},\kappa)}{\partial \mathbf{u}}} = \left( \frac{ d_\mathscr{H}(\mathbf{u},\mathbf{v},\kappa) - \Delta} {\sqrt{\kappa ((\mathbf{u} \ast \mathbf{v}) ^2 - 1) }} \right) \left( \frac{ \sqrt{ \| \mathbf{v} \|^2+1 }}{ \sqrt{ \| \mathbf{u} \|^2+1 }} \mathbf{u} - \mathbf{v} \right),$$ (3.1) and with respect to $$\kappa$$ is $${\frac{ \partial e(\mathbf{u},\mathbf{v},\kappa)}{\partial \kappa}} = - \frac{ \left( d_{\mathscr{H}}(\mathbf{u},\mathbf{v},\kappa) - \Delta \right) {\text{arccosh}}(-\mathbf{u} \ast \mathbf{v})} {2\kappa^{3/2}},$$ (3.2) where $$\Delta$$ is the true distance from $$\mathbf{u}$$ to $$\mathbf{v}$$. Proof. To derive (3.1), by the chain rule, we have: \begin{align*} {\frac{ \partial e(\mathbf{u},\mathbf{v},\kappa) }{\partial \mathbf{u}}} &= \left( d_{\mathscr{H}}(\mathbf{u},\mathbf{v},\kappa) - \Delta \right) \; {\frac{ \partial d_{\mathscr{H}}(\mathbf{u},\mathbf{v},\kappa)}{\partial \mathbf{u}}}, \\ {\frac{ \partial d_{\mathscr{H}}(\mathbf{u},\mathbf{v},\kappa)}{\partial \mathbf{u}}} &= \frac{1}{\sqrt{\kappa}} \; \frac{1}{\sqrt{ (\mathbf{u} \ast \mathbf{v})^2-1}} \; {\frac{ \partial (-\mathbf{u} \ast \mathbf{v})}{\partial \mathbf{u}}}, \\ {\frac{ \partial (-\mathbf{u} \ast \mathbf{v})}{\partial \mathbf{u}}} &= \frac{ (\| \mathbf{v} \|^2+1)}{\sqrt{ (\| \mathbf{u} \|^2+1)(\| \mathbf{v} \|^2+1)}} \mathbf{u} - \mathbf{v} . \end{align*} Similarly, (3.2) also follows from the chain rule by \begin{align*} {\frac{ \partial e(\mathbf{u},\mathbf{v},\kappa)}{\partial \kappa}} &= \left( d_{\mathscr{H}}(\mathbf{u},\mathbf{v},\kappa) - \Delta \right) \; {\frac{ \partial d_{\mathscr{H}}(\mathbf{u},\mathbf{v},\kappa)}{\partial \kappa}}, \\ {\frac{ \partial d_{\mathscr{H}}(\mathbf{u},\mathbf{v},\kappa)}{\partial \kappa}} &= \frac{-1}{2} \kappa^{-3/2} {\text{arccosh}}(-\mathbf{u} \ast \mathbf{v}).\\[-42pt] \end{align*} □ For implementation purposes, Algorithms 1 and 2 detail the matrix-vector calculations used to efficiently compute the hyperbolic distances and gradients between landmarks and non-landmarks, respectively. Note that the operation $$\odot$$ (e.g., in Line 9 in Algorithm 1), refers to elementwise multiplication for matrices and vectors. Algorithm 1 first computes $$e(\mathbf{u},\mathbf{v},\kappa)$$ for all non-unique pairs $$\mathbf{u},\mathbf{v} \in \mathscr{L}$$. While this is redundant due to the symmetry in (2.2), the calculations are much simpler and the additional computational cost is negligible. To compute $${E_{\mathcal{A}}(\mathbf{U}_{\mathcal{A}},\kappa)}$$ over unique pairs of indices in $$\mathscr{L}$$, as in (2.4), we simply include a factor of $$1/2$$ in Line 7 and Line 14 in Algorithm 1. If we define $$\mathscr{A} \subseteq \mathscr{V}$$ and $$k = |\mathscr{A}|$$, then the function and gradient evaluation for the embedding of points in $$\mathscr{A}$$ with respect to each other in Algorithm 1 involves $$\mathscr{O}(k^2 d)$$ floating point operations, while the function and gradient evaluation for the embedding of one point with respect to $$\mathscr{A}$$ in Algorithm 2 involves $$\mathscr{O}(k d)$$ floating point operations. 4. New hyperbolic embedding algorithm Following the landmark-based embedding procedure outlined in Section 2.4, we describe our approach. We first explain landmark selection in Section 4.1 and computation of true graph distances in Section 4.2. We describe a simple embedding method in Section 4.3 and a recursive version in Section 4.4, which enables efficient use of larger numbers of landmarks. 4.1 Selecting landmarks Ideally, the set of landmarks will be such that any vertex is close to a landmark and such that the landmarks are not redundant. In terms of selecting landmarks, Zhao et al. [10] compared choosing vertices at random so they are not close to each other, selecting the vertices with the highest degrees so that they have high centrality, and selecting the vertices that have high degree but are also at least $$k$$ hops away (for $$k = 2, 3$$ and $$4$$) from one another. Surprisingly, the differences in relative error between all these landmark-selection techniques were fairly small. Consequently, Rigel simply chooses the highest-degree nodes to be its landmarks [3]. While this approach is computationally inexpensive, it does not guarantee that the selected landmark nodes are far apart. A uniform random sampling technique tends to avoid choosing landmarks too close together, but does not necessarily choose landmarks with high degree. Our approach combines both techniques by randomly sampling the vertices proportional to their degrees. This allows us to take advantage of the benefits of both random sampling and selection-by-highest-degree techniques, while still remaining computationally inexpensive (this approach does not require any distance calculations between landmark nodes). Our landmark selection algorithm is shown in Algorithm 3. Algorithm 1 View largeDownload slide Distance and gradient calculation for landmark embedding algorithm. Algorithm 1 View largeDownload slide Distance and gradient calculation for landmark embedding algorithm. Algorithm 2 View largeDownload slide Distance and gradient calculation for landmark embedding algorithm. Algorithm 2 View largeDownload slide Distance and gradient calculation for landmark embedding algorithm. Algorithm 3 View largeDownload slide Landmark Selection Algorithm 3 View largeDownload slide Landmark Selection Algorithm 4 View largeDownload slide Distance and gradient calculation for landmark embedding algorithm. Algorithm 4 View largeDownload slide Distance and gradient calculation for landmark embedding algorithm. 4.2 Computing true graph distances We calculate the $$\Delta_{ij}$$ values for the landmarks only, as shown in Algorithm 4. Typically, the inner loop is calculated via a single-source-shortest-path method [19], that is, it finds the shortest path to every node from each $$i \in \mathscr{L}$$. 4.3 Simple embedding procedure We first consider embedding of the landmarks. Algorithm 5 defines the A2A function which embeds a set of vertices, $$\mathscr{A}$$, with respect to itself. It also determines the optimal $$\kappa$$. The number of optimization variables is $$kd + 1$$ where $$k = |\mathscr{A}|$$. If $$\mathscr{A}=\mathscr{L}$$, then this optimizes the set of landmarks with respect to itself, but we have left the function generic (i.e., using $$\mathscr{A}$$ rather than $$\mathscr{L}$$) for reasons that will become apparent when we discuss the recursive version. The function A2A accepts initial guesses for the matrix $$\mathbf{U}_{\mathscr{A}}$$ and the scalar $$\kappa > 0$$ as input, which may be either random guesses or something better. The LBFGS method is used for optimization and requires takes two inputs: (1) the method to evaluate the objective function and gradient from Algorithm 1 and (2) the initial starting point. Upon completion, the function A2A returns values of $$\mathbf{U}_{\mathscr{A}}$$ and $$\kappa$$ that give an optimal hyperbolic embedding, with the caveat that this may be a local optimum since the problem is not convex. Algorithm 5 View largeDownload slide Hyperbolic embedding of vertices in $$\mathscr{A}$$ relative to itself Algorithm 5 View largeDownload slide Hyperbolic embedding of vertices in $$\mathscr{A}$$ relative to itself Algorithm6 defines the function B2A that embeds vertices in the set $$\mathscr{B}$$ with respect to fixed $$\mathbf{U}_{\mathscr{A}}$$ and $$\kappa$$. We require $$\mathscr{B}$$ and $$\mathscr{A}$$ to be disjoint. It solves a series of $$|\mathscr{B}|$$ optimization problems, and each problem is of dimension $$d$$. Since the optimization problems are independent, the optimization solves can be done in parallel. Each optimization problem uses the objective function defined in Algorithm 2 and a random initial guess. Algorithm 6 View largeDownload slide Hyperbolic embedding of vertices in $$\mathscr{B}$$ relative to $$\mathscr{A}$$ Algorithm 6 View largeDownload slide Hyperbolic embedding of vertices in $$\mathscr{B}$$ relative to $$\mathscr{A}$$ Using the function discussed so far, the full procedure is straightforward and given in Algorithm 7. The inputs are the graph and the number of landmarks. The output is the embedding for all nodes $$\mathbf{U}_{\mathscr{V}}$$ and $$\kappa$$. We optimize the landmarks and $$\kappa$$ in a single pass and then optimize the non-landmarks. Algorithm 7 View largeDownload slide Hyperbolic embedding, simple version Algorithm 7 View largeDownload slide Hyperbolic embedding, simple version 4.4 Recursive embedding procedure In general, computing the landmark embedding using Algorithm 7 with a random initial guess can be inefficient and inaccurate for more than, say, $$\ell=100$$ landmarks. Instead, we propose an alternative algorithm which recursively builds a better initial guess for the landmark embedding. In order to do this, we first partition the landmarks into $$r$$ sets, that is, $$\mathscr{L} = \mathscr{B}_1 \cup \mathscr{B}_2 \cup \cdots \cup \mathscr{B}_r {\quad\text{with}\quad} \mathscr{B}_i \cap \mathscr{B}_j = \emptyset, \; \forall i\neq j \in \left\{ {{1,\dots,r}} \right\}\!.$$ (4.1) We then define $$\mathscr{A}_k$$ to be the union of the first $$k$$ partitions, that is, \begin{align*} \mathscr{A}_k = \bigcup_{i=1}^k \mathscr{B}_i. \end{align*} Note that $$\mathscr{A}_r = \mathscr{L}$$. To perform the embedding, we first embed $$\mathscr{A}_1$$ with a random guess for the initial embedding. We then embed the nodes $$\mathscr{B}_2$$ with respect to $$\mathscr{A}_1$$. The embeddings for these two sets of nodes is combined to form the initial guess for embedding $$\mathscr{A}_2$$ with respect to itself. This process is continued until all $$r$$ partitions of the landmarks are embedded. The recursive landmark embedding approach is detailed in Algorithm 8. Similarly, to obtain an improved starting point for the non-landmark embedding optimization we first embed each non-landmark node relative to a random subset of the original landmark nodes. This is similar to what Rigel does, however, we take it a step further and use this solution as an initial guess for the full non-landmark embedding procedure. The recursive non-landmark algorithm is shown in Algorithm 9. Algorithm 8 View largeDownload slide Hyperbolic embedding of vertices $$\mathscr{A}$$ relative to itself, recursive version Algorithm 8 View largeDownload slide Hyperbolic embedding of vertices $$\mathscr{A}$$ relative to itself, recursive version 4.5 Implementation details Before we move on to the examples, we briefly discuss implementation strategies for improving the landmark and non-landmark optimization procedures. Because our objective function is non-convex, there is a possibility that our optimization routine might get stuck in a local minimum. To lessen this possibility, we run the landmark embedding multiple times, in parallel, for different random starting positions. The optimal set of landmark embedding coordinates with the smallest error, $$E_{\mathscr{L}}(\mathbf{U}_{\mathscr{L}},\kappa)$$, is chosen to begin our non-landmark embedding routine. Second, the full $$\Delta_{ij}$$ matrix for the non-landmark embedding algorithm can be prohibitively large, for example, $$(n-\ell) \ell 2^5$$ bits using standard 32 bit integers. In order to efficiently manage memory for the non-landmark embedding procedure and allow us to embed a graph of any size, we split the $$\Delta_{ij}$$ distance matrix into smaller blocks of columns. Specifically, we split $$\mathscr{N}$$, the set of non-landmark indices, into approximately $$s \approx (n-\ell)/b$$ sets of indices that each have approximately $$b$$ elements, that is, $$\mathscr{N} = \mathscr{N}_1 \cup \cdots \cup \mathscr{N}_s {\quad\text{with}\quad} \mathscr{N}_p \cap \mathscr{N}_q = \emptyset\; \forall p\neq q \in \left\{ {{1,\dots,s}} \right\}, \; |\mathscr{N}_p| \approx b$$ (4.2) Algorithm 9 View largeDownload slide Hyperbolic embedding of vertices in $$\mathscr{B}$$ relative to $$\mathscr{A}$$, recursive version Algorithm 9 View largeDownload slide Hyperbolic embedding of vertices in $$\mathscr{B}$$ relative to $$\mathscr{A}$$, recursive version Algorithm 10 View largeDownload slide Main Hyperbolic embedding, recursive version with non-landmark splitting Algorithm 10 View largeDownload slide Main Hyperbolic embedding, recursive version with non-landmark splitting The $$\Delta_{ij}$$ matrix is then split according to the column partitions defined in (4.2). This results in $$s$$ non-landmark distance matrices each requiring only $$b \cdot \ell \cdot 2^5$$ bits of memory. The non-landmark embedding is then performed by loading a single partition at a time, where each partition can be run in parallel. The main procedure is detailed in Algorithm 10, where the recursive landmark embedding, Algorithm 4, can be run many times in parallel to obtain the embedding with the least error, and the non-landmark nodes in $$\mathscr{N}_j$$ can also be executed in parallel for each $$j = 1,\dots,s$$. 5. Experimental results We test our hyperbolic embedding algorithm on five datasets from SNAP [20], with up to 65 million nodes and 1.8 billion edges; see Table 1. All tests are performed on a Dual socket Intel server using 40 Intel Haswell 2000Mhz E5-2683v3 CPUs, each with 14 processors and 60 Gigabytes of shared memory, for a total of 560 cores. All shortest-path (i.e., true) distances are computed using SNAP for Python [21]. The validation data for each graph consists of $$100,000$$ randomly chosen pairs of nodes (excluding landmarks). Table 1 Datasets Graph Nodes Edges Avg. Degree Amazon 334,863 925,872 5.53 DBLP 317,080 1,049,866 3.31 Youtube 1,134,890 2,987,624 5.27 LiveJournal 3,997,962 34,681,189 17.35 Orkut 3,072,441 117,185,083 76.28 Friendster 65,608,366 1,806,067,135 55.06 Graph Nodes Edges Avg. Degree Amazon 334,863 925,872 5.53 DBLP 317,080 1,049,866 3.31 Youtube 1,134,890 2,987,624 5.27 LiveJournal 3,997,962 34,681,189 17.35 Orkut 3,072,441 117,185,083 76.28 Friendster 65,608,366 1,806,067,135 55.06 Table 1 Datasets Graph Nodes Edges Avg. Degree Amazon 334,863 925,872 5.53 DBLP 317,080 1,049,866 3.31 Youtube 1,134,890 2,987,624 5.27 LiveJournal 3,997,962 34,681,189 17.35 Orkut 3,072,441 117,185,083 76.28 Friendster 65,608,366 1,806,067,135 55.06 Graph Nodes Edges Avg. Degree Amazon 334,863 925,872 5.53 DBLP 317,080 1,049,866 3.31 Youtube 1,134,890 2,987,624 5.27 LiveJournal 3,997,962 34,681,189 17.35 Orkut 3,072,441 117,185,083 76.28 Friendster 65,608,366 1,806,067,135 55.06 5.1 Comparison with Rigel For a baseline, we embed these graphs into hyperbolic space using Rigel [22], which is an open source C++ implementation by the Systems, Algorithms, Networking and Data Laboratory (SAND Lab) Group at University of California, Santa Barbara. It uses the general landmark embedding approach of Algorithm 7, with a few key differences. First, Rigel uses the Nelder–Mead simplex algorithm to minimize the landmark and non-landmark embedding error. Second, Rigel does not include the curvature as a variable in the optimization routine, nor does it incorporate the use of the gradient of the objective function. Rigel does, however, allow the user to fix the curvature manually during the embedding procedure. Experiments seem to indicate that the optimal curvature for Rigel is $$\kappa = 1$$ [3], which we use in the subsequent experiments. Finally, Rigel by default only embeds 16 landmark nodes relative to each other. These nodes are the primary landmark nodes. The rest of the landmarks are embedded relative only to the 16 primary landmark nodes. Furthermore, Rigel embeds the non-landmark nodes relative to only 16 randomly chosen landmark nodes. We compare the performance of Hypy (Algorithm 10) with Rigel; both use the same 100 landmarks generated using Algorithm 3. For Hypy, the landmarks indices are partitioned with $$\mathscr{B}_1 = \{1,\dots,32\},\;\mathscr{B}_2 = \{33,\dots,64\}$$ and $$\mathscr{B}_3 = \{65,\dots,100 \}$$ (see (4.1)). Hypy performs the landmarks embedding 560 times (equal to the number of processors on our computational server). The best result of these embeddings, in terms of the objective function defined by (2.4) with respect to $$\mathscr{A} = \mathscr{L}$$, is used in the results and subsequent computations. Since Rigel’s initial guess is unchanging, we only run it once and use that result for subsequent computation. The non-landmarks are split into groups of size $$250,000$$ nodes, that is, $$b \approx 250,000$$ (see (4.2)). Both Hypy and Rigel distribute the non-landmark embedding among the 560 processors. Before we take a look at the results, we note that Rigel failed to produce a solution for the Friendster dataset due to its large size, and so that dataset is not discussed in this subsection. We do include Friendster results in Section 5.2. In the leftmost columns of Figs 1 and 2, we show the mean squared error (MSE) and standard deviation for all pairs of landmark nodes, that is, $$(i,j) \in \mathscr{L} \otimes \mathscr{L}$$ with $$i \neq j$$ for $$d=2,3,\dots,10$$. The mean squared error for the landmarks is the average squared error between the true graph distance and the estimated hyperbolic distance among all unique landmark nodes pairs. For the non-landmarks, the mean squared error is defined as the average error in distance between each non-landmark node and all non-landmark nodes. In general, the MSE decreases with the dimension because there are more degrees of freedom. An increase in MSE as the dimension increase indicates that the optimization hit a local rather than a global solution. Observe that Hypy always obtains a lower landmark MSE value than Rigel. There are several reasons for this. First, Hypy has an additional free parameter–-curvature ($$\kappa$$). Second, Hypy is using LBFGS, a better optimization algorithm than the Nelder–Meade simplex algorithm. Third, Hypy optimizes the entire set of landmark nodes relative to each other rather than doing a small subset and optimizing the rest relative to those. Finally, Hypy is run multiple times with different random initializations. Fig. 1. View largeDownload slide Hypy versus Rigel in terms of mean standard error (MSE) using the same $$\ell=100$$ landmarks. For each value of $$d$$, the MSE values are computed using the optimal landmark embedding from 560 random starting points. Landmarks are all pairs of landmark nodes. Non-landmarks are all pairs comprising one non-landmark and one landmark. Validation is 100,000 random pairs (excluding landmarks). (a) Amazon, (b) DBLP and (c) YouTube. Fig. 1. View largeDownload slide Hypy versus Rigel in terms of mean standard error (MSE) using the same $$\ell=100$$ landmarks. For each value of $$d$$, the MSE values are computed using the optimal landmark embedding from 560 random starting points. Landmarks are all pairs of landmark nodes. Non-landmarks are all pairs comprising one non-landmark and one landmark. Validation is 100,000 random pairs (excluding landmarks). (a) Amazon, (b) DBLP and (c) YouTube. Fig. 2. View largeDownload slide Hypy versus Rigel in terms of mean standard error (MSE) using the same $$\ell=100$$ landmarks. For each value of $$d$$, the MSE values are computed using the optimal landmark embedding from 560 random starting points. Landmarks are all pairs of landmark nodes. Non-landmarks are all pairs comprising one non-landmark and one landmark. Validation is 100,000 random pairs (excluding landmarks). (a) LiveJournal and (b) Orkut. Fig. 2. View largeDownload slide Hypy versus Rigel in terms of mean standard error (MSE) using the same $$\ell=100$$ landmarks. For each value of $$d$$, the MSE values are computed using the optimal landmark embedding from 560 random starting points. Landmarks are all pairs of landmark nodes. Non-landmarks are all pairs comprising one non-landmark and one landmark. Validation is 100,000 random pairs (excluding landmarks). (a) LiveJournal and (b) Orkut. In the middle column of Figs 1 and 2, we show the MSE and standard deviation for all pairs in $$(i,j) \in \mathscr{L} \otimes (\mathscr{V} \setminus \mathscr{L})$$, that is, between the landmark and non-landmark nodes for $$d=2,3,\dots,10$$. Observe again that Hypy always achieves a lower non-landmark MSE than Rigel, due to the combination of a better landmark embedding to begin with and the improved optimization procedure for embedding the non-landmarks. In the rightmost column of Figs 1 and 2, we show the MSE for a validation set, that is, 100,000 randomly selected pairs $$(i,j) \in (\mathscr{V} \setminus \mathscr{L}) \otimes (\mathscr{V} \setminus \mathscr{L})$$. Note that the validation set is not used anywhere in the optimization procedure, so it serves as an independent check on the quality of the embedding. We indicate the minimum validation error with min for each method. Table 2 summarizes the validation results for the five test graphs. In all cases, Hypy improves the MSE versus Rigel by 23–44%. Moreover, the optimal embedding dimension ($$d$$) is generally lower for Hypy. This is important because it fundamentally changes our understanding of the intrinsic dimensionality of the graph. Table 2 Comparison of Rigel and Hypy using the same $$\ell=100$$ landmarks and a validation set of 100,000 random pairs for $$d = 1,\dots,10$$. Showing optimal embedding dimension and $$\kappa$$ (for Hypy) corresponding to the best validation set MSE. Used $$\kappa=1$$ for Rigel. The last column shows the percentage decrease in validation set MSE over Rigel for the validation dataset Rigel ($$\boldsymbol{\kappa =} {\bf 1}$$) Hypy Hypy vs. Rigel Graph $$d$$ $$\text{MSE}$$ $$d$$ $$\kappa$$ $$\text{MSE}$$ MSE Decrease% Amazon 7 2.07 $$\geq$$10 0.08 1.17 44 DBLP $$\geq$$10 0.56 9 0.76 0.43 23 Youtube $$\geq$$10 0.31 7 2.27 0.18 42 LiveJournal $$\geq$$10 0.35 8 1.69 0.24 31 Orkut 10 0.30 5 5.56 0.19 37 Rigel ($$\boldsymbol{\kappa =} {\bf 1}$$) Hypy Hypy vs. Rigel Graph $$d$$ $$\text{MSE}$$ $$d$$ $$\kappa$$ $$\text{MSE}$$ MSE Decrease% Amazon 7 2.07 $$\geq$$10 0.08 1.17 44 DBLP $$\geq$$10 0.56 9 0.76 0.43 23 Youtube $$\geq$$10 0.31 7 2.27 0.18 42 LiveJournal $$\geq$$10 0.35 8 1.69 0.24 31 Orkut 10 0.30 5 5.56 0.19 37 Table 2 Comparison of Rigel and Hypy using the same $$\ell=100$$ landmarks and a validation set of 100,000 random pairs for $$d = 1,\dots,10$$. Showing optimal embedding dimension and $$\kappa$$ (for Hypy) corresponding to the best validation set MSE. Used $$\kappa=1$$ for Rigel. The last column shows the percentage decrease in validation set MSE over Rigel for the validation dataset Rigel ($$\boldsymbol{\kappa =} {\bf 1}$$) Hypy Hypy vs. Rigel Graph $$d$$ $$\text{MSE}$$ $$d$$ $$\kappa$$ $$\text{MSE}$$ MSE Decrease% Amazon 7 2.07 $$\geq$$10 0.08 1.17 44 DBLP $$\geq$$10 0.56 9 0.76 0.43 23 Youtube $$\geq$$10 0.31 7 2.27 0.18 42 LiveJournal $$\geq$$10 0.35 8 1.69 0.24 31 Orkut 10 0.30 5 5.56 0.19 37 Rigel ($$\boldsymbol{\kappa =} {\bf 1}$$) Hypy Hypy vs. Rigel Graph $$d$$ $$\text{MSE}$$ $$d$$ $$\kappa$$ $$\text{MSE}$$ MSE Decrease% Amazon 7 2.07 $$\geq$$10 0.08 1.17 44 DBLP $$\geq$$10 0.56 9 0.76 0.43 23 Youtube $$\geq$$10 0.31 7 2.27 0.18 42 LiveJournal $$\geq$$10 0.35 8 1.69 0.24 31 Orkut 10 0.30 5 5.56 0.19 37 The wall clock runtimes of Hypy and Rigel are compared in Fig. 3. The runtime for Hypy is the total for the 560 initial guesses (in parallel) for the landmark embedding and the non-landmark embedding (also in parallel). The runtime for Rigel is the total for a single run for the landmark embedding and the non-landmark embedding (in parallel). The figure shows that the runtime of Hypy does not depend on the embedding dimension ($$d$$); in fact, runtime slightly decreases as the dimension increases, indicating that the optimization routine has an easier job of embedding the graph in higher dimensions. Rigel, on the other hand, becomes much more expensive as the dimension increases. Moreover, for all graphs and all dimensions, Hypy is significantly faster than Rigel, sometimes by almost an order of magnitude. Fig. 3. View largeDownload slide Runtime (wall clock) comparison on five graphs between our Hypy (H) implementation and the existing Rigel (R) code for $$\ell = 100$$. The timing of Hypy is for 560 initial guesses (in parallel) and also parallelizes the non-landmark embedding. The timing of Rigel is for a single landmark embedding and parallelizng the non-landmark embedding over 560 processors. Fig. 3. View largeDownload slide Runtime (wall clock) comparison on five graphs between our Hypy (H) implementation and the existing Rigel (R) code for $$\ell = 100$$. The timing of Hypy is for 560 initial guesses (in parallel) and also parallelizes the non-landmark embedding. The timing of Rigel is for a single landmark embedding and parallelizng the non-landmark embedding over 560 processors. Table 3 Optimal embedding dimension for $$d = 1,\dots,10$$, corresponding MSE and percent decrease for $$\ell = 100, 400$$ and $$800$$. The percent decrease is calculated relative to the MSE for $$\ell = 100$$ Hypy, $$d ({\rm MSE}, \% \downarrow )$$ Graph $$\ell=100$$ $$\ell=400$$ $$\ell=800$$ Amazon $$\geq$$10 (1.17) 8 (0.97, 17%) $$\geq$$10 (0.93, 21%) DBLP 9 (0.43) $$\geq$$10 (0.38, 12%) $$\geq$$10 (0.34, 21%) Youtube 7 (0.18) $$\geq$$10 (0.13, 27%) 8 (0.12, 33%) LiveJournal 8 (0.24) 9 (0.20, 17%) $$\geq$$10 (0.19, 21%) Orkut 5 (0.19) $$\geq$$10 (0.16, 16%) $$\geq$$10 (0.15, 21%) Friendster 8 (0.18) $$\geq$$10 (0.15, 17%) $$\geq$$10 (0.15, 0%) Hypy, $$d ({\rm MSE}, \% \downarrow )$$ Graph $$\ell=100$$ $$\ell=400$$ $$\ell=800$$ Amazon $$\geq$$10 (1.17) 8 (0.97, 17%) $$\geq$$10 (0.93, 21%) DBLP 9 (0.43) $$\geq$$10 (0.38, 12%) $$\geq$$10 (0.34, 21%) Youtube 7 (0.18) $$\geq$$10 (0.13, 27%) 8 (0.12, 33%) LiveJournal 8 (0.24) 9 (0.20, 17%) $$\geq$$10 (0.19, 21%) Orkut 5 (0.19) $$\geq$$10 (0.16, 16%) $$\geq$$10 (0.15, 21%) Friendster 8 (0.18) $$\geq$$10 (0.15, 17%) $$\geq$$10 (0.15, 0%) Table 3 Optimal embedding dimension for $$d = 1,\dots,10$$, corresponding MSE and percent decrease for $$\ell = 100, 400$$ and $$800$$. The percent decrease is calculated relative to the MSE for $$\ell = 100$$ Hypy, $$d ({\rm MSE}, \% \downarrow )$$ Graph $$\ell=100$$ $$\ell=400$$ $$\ell=800$$ Amazon $$\geq$$10 (1.17) 8 (0.97, 17%) $$\geq$$10 (0.93, 21%) DBLP 9 (0.43) $$\geq$$10 (0.38, 12%) $$\geq$$10 (0.34, 21%) Youtube 7 (0.18) $$\geq$$10 (0.13, 27%) 8 (0.12, 33%) LiveJournal 8 (0.24) 9 (0.20, 17%) $$\geq$$10 (0.19, 21%) Orkut 5 (0.19) $$\geq$$10 (0.16, 16%) $$\geq$$10 (0.15, 21%) Friendster 8 (0.18) $$\geq$$10 (0.15, 17%) $$\geq$$10 (0.15, 0%) Hypy, $$d ({\rm MSE}, \% \downarrow )$$ Graph $$\ell=100$$ $$\ell=400$$ $$\ell=800$$ Amazon $$\geq$$10 (1.17) 8 (0.97, 17%) $$\geq$$10 (0.93, 21%) DBLP 9 (0.43) $$\geq$$10 (0.38, 12%) $$\geq$$10 (0.34, 21%) Youtube 7 (0.18) $$\geq$$10 (0.13, 27%) 8 (0.12, 33%) LiveJournal 8 (0.24) 9 (0.20, 17%) $$\geq$$10 (0.19, 21%) Orkut 5 (0.19) $$\geq$$10 (0.16, 16%) $$\geq$$10 (0.15, 21%) Friendster 8 (0.18) $$\geq$$10 (0.15, 17%) $$\geq$$10 (0.15, 0%) In Fig. 4, we illustrate the modest benefit of allowing the curvature to vary as compared to fixing it to $$\kappa = 1$$. In this experiment, we vary the dimension from 2 to 10 and run each method 56 times with 56 different initial starting points (but the same starting points with and without fixed curvature). Fig. 4. View largeDownload slide Improvement in Hypy by letting $$\kappa$$ vary versus keeping it fixed at $$\kappa=1$$. (a) Amazon, (b) YouTube and (c) LiveJournal. Fig. 4. View largeDownload slide Improvement in Hypy by letting $$\kappa$$ vary versus keeping it fixed at $$\kappa=1$$. (a) Amazon, (b) YouTube and (c) LiveJournal. Using the optimal dimensions and curvature in Table 2, we break out the MSE values by true distance (# hops) for the validation data in Fig. 5. Note that the y-axis is the log of the MSE scores, in order to make the larger values easier to see. The MSE is low for the most frequent hop counts. Assuming this histogram is representative of the pair count for the entire graph, this is reasonable since the objective function (2.2) penalizes the discrepancies between the true and hyperbolic distances which occur most often. For almost all hop counts, especially those which occur with lowest frequencies, Hypy has a smaller MSE than Rigel. Fig. 5. View largeDownload slide Breakdown of validation errors by hop count (bottom) and number of pairs per hop counts (top). (a) Amazon, (b) DBLP, (c) YouTube, (d) LiveJournal and (e) Orkut. Fig. 5. View largeDownload slide Breakdown of validation errors by hop count (bottom) and number of pairs per hop counts (top). (a) Amazon, (b) DBLP, (c) YouTube, (d) LiveJournal and (e) Orkut. 5.2 Increasing the number of landmarks We now study the effect of increasing the number of landmarks with Hypy for the various graphs listed in Table 1, including the Friendster graph. We omit Rigel because our preliminary experiments indicated that increasing the number of landmarks did not reduce the validation error, most likely due to the fact that Rigel only use 16 relative landmarks to perform the embedding. Figure 6 shows the MSE for the validation data set for $$\ell = 100$$, $$400$$ and $$800$$. For $$\ell=100$$, we use the same landmark set partitioning as in Section 5.1; for $$\ell=400$$, we split the landmarks into sets of sizes 32, 32, 64 and 270; and for $$\ell=800$$, we use sizes 32, 32, 64, 128 and 544. There is a significant drop in MSE when going from $$\ell = 100$$ to $$\ell = 400$$ landmarks, but much less of a drop from $$\ell = 400$$ to $$\ell = 800$$, especially for graphs with a larger number of nodes. Also, the optimal embedding dimension ranges from eight to ten dimensions for $$\ell > 100$$. We can reasonably conclude that increasing the number of landmarks does improve the embedding, but it is less significant for larger graphs, that is, graphs with more than a million nodes. Table 3 summarizes the optimal embedding dimension for each choice of $$\ell$$ and the improvement compared to $$\ell=100$$. Fig. 6. View largeDownload slide MSE for validation data for Hypy comparing $$\ell = 100, 400$$ and $$800$$. (a) Amazon: validation, (b) DBLP: validation, (c) Youtube: validation, (d) LiveJournal: validation, (e) Orkut: validation and (f) Friendster: validation. Fig. 6. View largeDownload slide MSE for validation data for Hypy comparing $$\ell = 100, 400$$ and $$800$$. (a) Amazon: validation, (b) DBLP: validation, (c) Youtube: validation, (d) LiveJournal: validation, (e) Orkut: validation and (f) Friendster: validation. Figure 7 shows the run time for Hypy with different choices of $$\ell$$. As expected, the time for the complete embedding procedure using $$400$$ and $$800$$ landmarks is longer, roughly one to two orders of magnitude larger, respectively, than the embedding time for $$\ell = 100$$. Also, for $$\ell > 100$$, the embedding time as a function of dimension is roughly flat, whereas we see a slight decrease in seconds per dimension for $$\ell = 100$$. So while the optimization routine may have an easier time of embedding in higher dimensions, the objective functions for $$\ell > 100$$ are more costly to evaluate. Fig. 7. View largeDownload slide Runtime comparison for each graphs at different landmarks. The numbers in parentheses indicate the number of landmarks. Fig. 7. View largeDownload slide Runtime comparison for each graphs at different landmarks. The numbers in parentheses indicate the number of landmarks. 6. Discussion and conclusions In this article, we propose a fast and efficient hyperbolic embedding algorithm with a landmark approach that uses an LBFGS optimization routine with recursively built initial starting points. Our approach is an improvement over Rigel because it incorporates analytically computed gradients, includes the curvature as an optimization variable, and uses LBFGS to perform the optimization. We show that Hypy can provide a significant improvement in accuracy over Rigel, sometimes by as much as 44% (see Table 2). Moreover, Hypy can be almost an order of magnitude faster for large graphs (see Fig. 7). Hypy is designed to scale well for very large graphs by allowing parallelization of the landmark and non-landmark embedding procedures and has been successfully tested on graphs with 65 million nodes and 1.8 billion edges. Furthermore, because our routine is so efficient, we are able to explore embeddings with a number of landmarks greater than 100, which is often the limit in many studies up until now. These tests show that, indeed, more landmarks can further reduce the MSE validation errors by as much as 33% (see Table 3). In cases where multiple cores are not available, most of the examples in this article can still be computed using a single core machine in a reasonable amount time, that is, less than a day. The landmark embedding procedure is the cheapest part of the procedure and can be performed on the order of seconds to minutes for graphs with node sizes ranging from thousands to a few million, respectively. In order to run the landmark embedding at multiple starting points, this will have to be done in serial and certainly take longer, but will scale linearly with the number of random starting points. The non-landmark embedding procedure will most likely be the computational bottleneck due to shear number of nodes in some graphs. For graphs with a few million nodes and embeddings with 100 landmarks, the non-landmark embedding procedure on a single core may need to run on the order of hours, whereas landmarks with sizes greater than 100 can take days. For the largest data set, that is, Friendster with roughly 65 million nodes, running this on a single core would take months and would thus be impractical. Also, the preprocessing step, for example, computing the landmark nodes and distances, can take hours on a single core for graphs with nodes in the millions. We have shown the benefit of an improved optimization procedure, and future work may consider more sophisticated methods such as stochastic gradient descent, especially for extremely large problems. Lastly, the code will be available on github. Funding This work was supported in part by the Defense Advanced Research Project Agency (DARPA). Acknowledgement Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. We would also like to thank the reviewers for their time and help in finalizing this article. References 1. Krioukov D. , Papadopoulos F. , Kitsak M. , Vahdat A. & Boguñá M. ( 2010 ) Hyperbolic geometry of complex networks. Phys. Rev. E , 82 , 036106 . Google Scholar CrossRef Search ADS 2. Ng T. S. E. & Zhang H. ( 2002 ) Predicting internet network distance with coordinates-based approaches. INFOCOM 2002: Proceedings of the Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies. pp. 170 – 179 . 3. Zhao X. , Sala A. , Zheng H. & Zhao B. Y. ( 2011 ) Efficient shortest paths on massive social graphs. COLCOM’11: Proceedings of the 7th International Conference on Collaborative Computing: Networking, Applications and Worksharing ( Georgakopoulos D. and Joshi J. eds). Orlando, FL : IEEE , pp. 77 – 86 . 4. Shavitt Y. & Tankel T. ( 2004 ) Big-bang Simulation for embedding network distances in Euclidean space. IEEE/ACM Trans. Netw. , 12 , 993 – 1006 . Google Scholar CrossRef Search ADS 5. Shavitt Y. & Tankel T. ( 2004 ) On the curvature of the internet and its usage for overlay construction and distance estimation. INFOCOM 2004: Proceedings of the Twenty-Third Annual Joint Conference of the IEEE Computer and Communications Societies. Hong Kong : IEEE , pp. 374 – 384 . 6. Boguñá M. , Papadopoulos F. & Krioukov D. ( 2010 ) Sustaining the internet with hyperbolic mapping. Nat. Commun. , 1 , 62 . Google Scholar CrossRef Search ADS PubMed 7. Papadopoulos F. , Kitsak M. , Serrano M. A. , Boguñá M. & Krioukov D. ( 2012 ) Popularity versus similarity in growing networks. Nature , 489 , 537 – 540 . Google Scholar CrossRef Search ADS PubMed 8. Tang L. & Crovella M. ( 2003 ) Virtual landmarks for the internet. IMC’03: Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement. New York, NY: ACM, pp. 143 – 152 . 9. Ajwani D. , Meyer U. & Veith D. ( 2015 ) An I/O-efficient distance oracle for evolving real-world graphs. ALENEX ’15: Proceedings of the Meeting on Algorithm Engineering & Expermiments ( Goodrich M. and Mitzenmacher M. eds). Arlington, VA : SIAM , pp. 159 – 172 . Google Scholar CrossRef Search ADS 10. Zhao X. , Sala A. , Wilson C. , Zheng H. & Zhao B. Y. ( 2010 ) Orion: shortest path estimation for large social graphs. WOSN’10: Proceedings of the 3rd Wonference on Online Social Networks ( El Abbadi A. and Krishnamurthy B. eds). New York, NY : ACM , pp. 1 – 9 . 11. Papadopoulos F. , Psomas C. & Krioukov D. ( 2015 ) Network mapping by replaying hyperbolic growth. IEEE/ACM Trans. Netw. , 23 , 198 – 211 . Google Scholar CrossRef Search ADS 12. Papadopoulos F. , Aldecoa R. & Krioukov D. ( 2015 ) Network geometry inference using common neighbors. Phys. Rev. E , 92 . 13. Nelder J. A. & Mead R. ( 1965 ) A simplex method for function minimization. Comput. J. , 7 , 308 – 313 . Google Scholar CrossRef Search ADS 14. Nocedal J. ( 1980 ) Updating quasi-Newton matrices with limited storage. Math. Comput. , 35 , 773 – 782 . Google Scholar CrossRef Search ADS 15. Kolda T. G. , Lewis R. M. & Torczon V. ( 2003 ) Optimization by direct search: new perspectives on some classical and modern methods. SIAM Rev. , 45 , 385 – 482 . Google Scholar CrossRef Search ADS 16. McKinnon K. I. M. ( 1998 ) Convergence of the Nelder–Mead simplex method to a nonstationary point. SIAM J. Optim. , 9 , 148 – 158 . Google Scholar CrossRef Search ADS 17. Liu D. & Nocedal J. ( 1989 ) On the limited memory BFGS method for large scale optimization. Math. Program. , 45 , 503 – 528 . Google Scholar CrossRef Search ADS 18. Nocedal J. & Wright S. J. ( 2006 ) Numerical Optimization . Springer . 19. Lee C. Y. ( 1961 ) An algorithm for path connections and its applications. IRE Trans. Electron. Comput. , EC-10 ( 3 ), 346 – 365 . Google Scholar CrossRef Search ADS 20. Leskovec J. & Krevl A. ( 2014 ) SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data. 21. Leskovec J. & Sosič R. ( 2016 ) SNAP: A general-purpose network analysis and graph-mining library. ACM Trans. Intell. Sys. Technol. , 8 , 1:1 – 1:20 . 22. Zhao X. , Sala A. , Zheng H. & Zhao B. Y. ( 2013 ) Rigel. http://http://sandlab.cs.ucsb.edu/rigel/. Published by Oxford University Press 2017. This work is written by US Government employees and is in the public domain in the US. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Complex Networks Oxford University Press

An improved hyperbolic embedding algorithm

, Volume Advance Article (3) – Dec 11, 2017
21 pages

/lp/oxford-university-press/an-improved-hyperbolic-embedding-algorithm-bwcNgJtYjY
Publisher
Oxford University Press
Published by Oxford University Press 2017. This work is written by US Government employees and is in the public domain in the US.
ISSN
2051-1310
eISSN
2051-1329
D.O.I.
10.1093/comnet/cnx034
Publisher site
See Article on Publisher Site

Abstract

Abstract Because hyperbolic space has properties that make it amenable to graph representations, there is significant interest in scalable hyperbolic-space embedding methods. These embeddings enable constant-time approximation of shortest-path distances, and so are significantly more efficient than full shortest-path computations. In this article, we improve on existing landmark-based hyperbolic embedding algorithms for large-scale graphs. Whereas previous methods compute the embedding by using the derivative-free Nelder–Mead simplex optimization method, our approach uses the limited-memory BFGS (LBFGS) method, which is quasi-Newton optimization, with analytic gradients. Our method is not only significantly faster but also produces higher-quality embeddings. Moreover, we are able to include the hyperbolic curvature as a variable in the optimization. We compare our hyperbolic embedding method implementation in Python (called Hypy) against the best publicly available software, Rigel. Our method is an order of magnitude faster and shows significant improvements in the accuracy of the shortest-path distance calculations. Tests are performed on a variety of real-world networks, and we show the scalability of our method by embedding a graph with 1.8 billion edges and 65 million nodes. 1. Introduction and related work Hyperbolic embeddings are of growing interest because they exploit fundamental similarities between the geometry of hyperbolic spaces and properties of scale-free networks [1]. In general, the idea of an embedding is to map each node of a graph to a coordinate in some space, for example Euclidean, spherical or hyperbolic [2]. Once the mapping is obtained, which is by no means trivial, estimating the shortest-path distance between a pair of nodes is as simple as plugging their hyperbolic coordinates into the distance formula, that is, a constant-time calculation. This allows for extremely efficient estimates of quantities that may depend on distance calculations, such as graph centrality [3]. The most common type of embedding is into Euclidean space. Here, the distance formula is given by the usual Euclidean-squared distance. Shavitt and Tankel [4] explored Euclidean graph embedding as early as 2003. In subsequent years, they also explored embedding Internet graphs into a hyperbolic space, noting that the geometry of the Internet graph resembles the geometry of the hyperbolic spaces. In particular, Shavitt and Tankel noted that the Internet graph had a tightly connected core and long-stretched tendrils [5]. With this qualitative assessment of the geometry of Internet graphs, they achieve greater accuracy in shortest-path estimates by embedding into hyperbolic rather than Euclidean space. This assessment of the structure and geometry of graphs was more rigorously studied by Krioukov et al. [1]. They show how the geometry of hyperbolic space allows a more natural construction of a scale-free network that retains many characteristic properties, such as degree distributions that follow a power law. This particular work led to a probabilistic embedding algorithm in which a likelihood function was introduced, where the likelihood of a connection between two nodes exponentially decreases with distance from the origin [6, 7]. The exact hyperbolic distances are not directly enforced in this approach, but rather the average degree, power-law exponent, and average clustering coefficients of the graph are used as hyperparameters in the likelihood function. Once the likelihood function has been defined, a brute force embedding of all nodes at once is too computationally expensive for most networks, and so Boguna et al. [6] suggest embedding a representative, but much smaller, subgraph first and then iteratively adding more and more layers of the full graph. In some ways, this iterative subgraph style of embedding is similar to an approach known as the landmark method, which dates back to 2003 [8]. However, one major difference is that the true graph distances are directly utilized in the landmark embedding approach, whereas only the adjacency matrix is used in the likelihood approach. The work in this article is more closely related to landmark type methods, which have recently been used for embedding million- and billion-node graphs [3, 9, 10]. For more recent examples of the likelihood based approaches, we refer the reader to [11] and [12]. While these methods propose a global maximum likelihood algorithm, as opposed to our method which uses multiple solves of a local optimization algorithm, they are limited in their experiments to a two-dimensional embedding and graphs with only a few thousand nodes. It is important to note, however, that while the landmark approach is designed to give accurate estimates of graph distances between nodes, the likelihood-based approaches are not necessarily designed to do the same. For example, likelihood based methods can play a valuable role in helping to predict missing links in graphs. Thus, a comparison between the two methods should be taken with a grain of salt with the understanding that both approaches may have different end goals. Our proposed approach is an improvement on a well-known landmark-based hyperbolic-embedding method called Rigel [3]. Our method differs in the following ways: (1) the choice of the landmarks is random, with probabilities proportion to the node degrees; (2) the embedding algorithm uses the limited-memory BFGS (LBFGS) quasi-Newton optimization method instead of the Nelder–Mead simplex optimization method; (3) we derive and use analytical gradients; and (4) we optimize the hyperbolic curvature in addition to the hyperbolic coordinates of the vertices. Moreover, our hyperbolic embedding algorithm is implemented in Python (including Numpy and Scipy libraries) with an option to use MPI for large graphs. This article is organized as follows: Section 2 presents the problem formulation, introduces the hyperbolic distance formula, and derives the hyperbolic embedding objective functions. Section 3 explains the benefits of using LBFGS over Nelder–Mead and provides the gradient formulas with their respective implementation algorithms. Section 4 details our new hyperbolic embedding algorithms, from how we choose the landmarks all the way to how we perform the full embedding procedure. Section 5 illustrates the performance of our method in comparison to Rigel for five different graphs with sizes from 300 thousand nodes and 1 million edges to 3 million nodes and 117 million edges. We also run on a graph that is too large for Rigel, with 66 million nodes and 1.8 billion edges. Finally, Section 6 concludes our article with a summary of results. 2. Problem formulation Let $$\mathscr{G} = (\mathscr{V},\mathscr{E})$$ where $$\mathscr{V}$$ represents the vertices and $$\mathscr{E}$$ represents the edges. Let $$n=|\mathscr{V}|$$ denote the number of nodes and $$m=|\mathscr{E}|$$ denote the number of edges. 2.1 Graph distance The graph distance between two nodes $$i,j \in \mathscr{V}$$ is defined as the shortest path between $$i$$ and $$j$$ and denoted by $$\Delta_{ij}$$, which we hereafter refer to as the true distance. It is the minimum number of edges that need to be traversed to go from $$i$$ to $$j$$. For instance, $$\Delta_{ii} = 0$$; if $$(i,j) \in E$$, $$i \ne j$$, then $$\Delta_{ij}=1$$; and so on. The difficulty is that computing the distances can be prohibitively expensive, especially for large graphs. We assume that the graph is connected so that $$\Delta_{ij}$$ is finite for all pairs $$(i,j)\in \mathscr{V} \otimes \mathscr{V}$$. For hyperbolic embedding, the goal is to determine coordinates in hyperbolic space for each vertex so that the hyperbolic distance between coordinate representations can be used to estimate the true (shortest path) distance between vertices [5]. 2.2 Hyperbolic space and distances We let $$\mathscr{H}^d$$ denote the hyperbolic space of dimension $$d$$, that is, \begin{align*} \mathscr{H}^d = \left\{ {{\mathbf{u} : u_1^2 + \dots + u_d^2 - u_{d+1}^2 = -1}} \right\}\!. \end{align*} We consider only points in the upper sheet of the hyperboloid, so that the last coordinate is determined according to the first $$d$$ coordinates: \begin{align*} u_{d+1} = \sqrt{ u_1^2 + \dots + u_d^2 + 1 }. \end{align*} We hereafter represent any points in $$\mathscr{H}^d$$ by only the first $$d$$ coordinates that is, $$\mathbf{u} = (u_1,\dots,u_d)^{\top}$$. To define distance in hyperbolic space, we first give some preliminary definitions. For two points $$\mathbf{u},\mathbf{v} \in \mathscr{H}^d$$, let \begin{align*} \mathbf{u} \cdot \mathbf{v} = \sum_{k=1}^{d} u_k v_k {\quad\text{and}\quad} \| \mathbf{u} \| = \sqrt{ \mathbf{u} \cdot \mathbf{u} } \end{align*} be the standard Euclidean dot product and norm, respectively. The hyperbolic dot product is defined as \begin{align*} \mathbf{u} \ast \mathbf{v} = \mathbf{u} \cdot \mathbf{v} - \sqrt{ ( \| \mathbf{u} \|^2 + 1) ( \|\mathbf{v}\|^2 + 1) }, \end{align*} and then we define the hyperbolic distance between $$\mathbf{u},\mathbf{v} \in \mathscr{H}^d$$ to be $$d_\mathscr{H} (\mathbf{u},\mathbf{v},\kappa) = \frac{1}{\sqrt{\kappa}} {\text{arccosh}}( -\mathbf{u} \ast \mathbf{v} ),$$ (2.1) where $$\kappa > 0$$ and $$-\kappa$$ is commonly referred to as the curvature. Curvature does not actually change the shape of the hyperbolic half sheet $$\mathscr{H}^d$$ but instead changes the metric we use to measure distances between points, for example, geodesics. Note that the distance is symmetric so that $$d_\mathscr{H} (\mathbf{u},\mathbf{v}) = d_\mathscr{H} (\mathbf{v},\mathbf{u})$$. 2.3 Embedding objective and representation For each $$i \in \mathscr{V}$$, we let $$\mathbf{u}_i \in \mathscr{H}^d$$ denote its representation in hyperbolic space. For any subset $$\mathscr{A} \subseteq \mathscr{V}$$, we let $$\mathbf{U}_{\mathscr{A}}$$ denote the matrix of size $$d \times |\mathscr{A}|$$ whose columns are the representations, that is, $$\mathbf{u}_i$$ for all $$i \in \mathscr{A}$$. For instance, $$\mathbf{U}_\mathscr{V} = \left[ {\matrix{{{{\bf{u}}_1}} & {{{\bf{u}}_2}} & \cdots & {{{\bf{u}}_n}} \cr } } \right]$$. In our later discussion of algorithms, we assume for ease of notation that the matrix $$\mathbf{U}_{\mathscr{A}}$$ comes with implicit knowledge of the associated set $$\mathscr{A}$$. Since our goal is to choose the representations so that they minimize the embedding error, we define what this means precisely. The embedding error for vertices $$i,j\in \mathscr{V}$$ and curvature $$\kappa$$ is $$e(\mathbf{u}_i,\mathbf{u}_j,\kappa) = \frac{1}{2} (d_{\mathscr{H}}(\mathbf{u}_i,\mathbf{u}_j,\kappa) - \Delta_{ij})^2,$$ (2.2) where $$\Delta_{ij}$$ is the true distance as described in Section 2.1. For an arbitrary set $$\mathscr{A} \subseteq \mathscr{V}$$, we define two error measures. First, the measure $${\hat E_{\mathscr{A}}(\mathbf{u}_j,\kappa)} = \sum_{i\in \mathscr{A}} e(\mathbf{u}_i,\mathbf{u}_j,\kappa)$$ (2.3) is the embedding error for vertex $$j$$ with respect to the fixed embedding $$\mathbf{U}_{\mathscr{A}}$$ for the set $$\mathscr{A}$$. Second, the measure $${E_{\mathcal{A}}(\mathbf{U}_{\mathcal{A}},\kappa)} = \sum_{{i,j \in \mathscr{A}}\atop{i < j}} e(\mathbf{u}_i,\mathbf{u}_j,\kappa)$$ (2.4) is the sum of the embedding errors for all unique vertex pairs in $$\mathscr{A} \otimes \mathscr{A}$$. Ideally, we would pick $$\mathbf{U}_{\mathscr{V}}$$ and $$\kappa$$ such that the total embedding error $${E_{\mathscr{V}}(\mathbf{U}_{\mathscr{V}},\kappa)}$$ is minimized. This ensures that the hyperbolic distances are as close as possible to the true distances. Computing this objective function for large $$n$$ is prohibitive. We discuss how to approximately solve this optimization problem in the next subsection. We have not discussed the choice of the embedding dimension, $$d$$, which is typically between 2 and 10. We will solve the problems for several values of $$d$$ and pick the one the yields the best embedding by some validation metric. We discuss its selection in more detail in Section 5. 2.4 Optimization formulation using landmarks One problem with minimizing $${E_{\mathscr{V}}(\mathbf{U}_{\mathscr{V}},\kappa)}$$ directly is that it requires us to compute the true distances, $$\Delta_{ij}$$, between all $$n(n-1)/2$$ vertex pairs, which is prohibitively expensive and moreover defeats the purpose of estimating the distances. Instead, we use the idea of landmarks, introduced and studied in [8, 10]. We first embed the landmarks optimally in relation to each other and then embed the other points in relation to the landmarks. The idea is that embedding a few important vertices will determine the placement of the remainder. Let $$\mathscr{L} \subset \mathscr{V}$$ denote the set of landmarks and $$\ell = |\mathscr{L}|$$ be the number of landmarks. We typically choose $$\ell \ll n$$ to be very small, for example, 100. We find coordinates for the landmarks, $$\mathbf{U}_{\mathscr{L}}$$ of size $$d \times \ell$$, and the curvature, $$\kappa$$, that minimize the total landmark embedding error given by $${E_{\mathscr{L}}(\mathbf{U}_{\mathscr{L}},\kappa)}$$, which only requires $$\ell^2$$ values of $$\Delta_{ij}$$. The total number of optimization parameters in this case is $$\ell d + 1$$. We embed the landmarks by minimizing the discrepancy between the true and hyperbolic distances for every pair of nodes in the landmark set. Once the landmark embedding ($$\mathbf{U}_{\mathscr{L}}$$ and $$\kappa$$) is determined, then we determine the coordinate of every non-landmark. For each $$j \in \mathscr{V}\setminus\mathscr{L}$$ we find the embedding $$\mathbf{u}_j$$ of length $$d$$ that minimizes the discrepancy between $$\mathbf{u}_j$$ and its distance to every landmark as measured by $${\hat E_{j}(\mathbf{u}_j,\kappa)}$$. There are a total of $$(n-\ell)$$ optimization problems, each of which has only $$d$$ parameters. Moreover, since each non-landmark coordinate $$\mathbf{u}_j$$ for $$j \in \mathscr{V}\setminus\mathscr{L}$$ can be embedded independently of the others, the embedding of the non-landmark vertices is easily parallelizable. More specifically, each non-landmark coordinate $$\mathbf{u}_j$$ is chosen by minimizing 2.3 with $$\mathscr{A}$$ set to $$\mathscr{L}$$, where each of these objective functions are independent of one another. The advantage of the landmark approach is that we need only compute $$\Delta_{ij}$$ for each $$i \in \mathscr{L}$$ and $$j \in \mathscr{V}$$. In other words, we compute $$O(n)$$ distances rather than $$n^2$$. This preprocessing step can be quite expensive, requiring $$\ell$$ single-source-shortest-path computations, but is a necessity for any landmark approach. 3. Optimization improvements Rigel uses the Nelder–Mead simplex algorithm [13] to compute the minimums for the landmark and non-landmark embedding objective functions (2.4), while we propose using the LBFGS quasi-Newton approach [14]. The Nelder–Mead simplex algorithm is a direct search method that does not require explicit gradient calculations, which makes it an attractive option when the analytic gradient is difficult or impossible to obtain. Briefly, the Nelder–Mead algorithm evaluates a sequence of $$p+1$$ dimensional simplices, where $$p$$ is the dimension of the objective function. Each successive simplex attempts to move closer to the minimum, requiring only the evaluation of the objective function at the vertices of the simplices. However, for a broad class of problems, the Nelder–Mead simplex algorithm can be unstable and fail unpredictably [15]. In fact, it can be shown that the Nelder–Mead can fail to converge to a stationary point even if the objective function is strictly convex with three continuous derivatives [16]. For these reasons, we propose using an approach that uses gradient information and guarantees convergence. LBFGS is a quasi-Newton optimization method used to efficiently find local minima by utilizing gradient information. It is based on Newton’s method where successive guesses for the minimum are computed by approximating the objective function with a local quadratic Taylor series expansion. At each iteration, the local quadratic approximation can minimized by inverting the Hessian and computing the gradient. To avoid the computational burden of evaluating the full Hessian matrix at each iteration, the LBFGS method directly estimates the inverse of the Hessian matrix using rank-one updates based on the position and gradient vectors. This results in a computational cost that scales linearly with the dimension at each iteration [14] and makes the LBFGS algorithm well suited for large-scale problems [17]. Moreover, convergence to a stationary point can be guaranteed when the objective function is twice continuously differentiable [18]. If the function is also strictly convex, then we are assured convergence to the global minimizer. In the present context, our objective function is not convex, which means that there is no guarantee of convergence to the global minimum. However, by running the LBFGS algorithm multiple times with different starting points, we can obtain a set of local minimum solutions and select the best of these as an approximation to the global minimum. This procedure is described in more detail in Section 4.5. Below, we derive the analytic gradient needed for the LBFGS method. Theorem 3.1 The gradient for $$e(\mathbf{u},\mathbf{v},\kappa)$$ with respect to $$\mathbf{u}$$ is $${\frac{ \partial e(\mathbf{u},\mathbf{v},\kappa)}{\partial \mathbf{u}}} = \left( \frac{ d_\mathscr{H}(\mathbf{u},\mathbf{v},\kappa) - \Delta} {\sqrt{\kappa ((\mathbf{u} \ast \mathbf{v}) ^2 - 1) }} \right) \left( \frac{ \sqrt{ \| \mathbf{v} \|^2+1 }}{ \sqrt{ \| \mathbf{u} \|^2+1 }} \mathbf{u} - \mathbf{v} \right),$$ (3.1) and with respect to $$\kappa$$ is $${\frac{ \partial e(\mathbf{u},\mathbf{v},\kappa)}{\partial \kappa}} = - \frac{ \left( d_{\mathscr{H}}(\mathbf{u},\mathbf{v},\kappa) - \Delta \right) {\text{arccosh}}(-\mathbf{u} \ast \mathbf{v})} {2\kappa^{3/2}},$$ (3.2) where $$\Delta$$ is the true distance from $$\mathbf{u}$$ to $$\mathbf{v}$$. Proof. To derive (3.1), by the chain rule, we have: \begin{align*} {\frac{ \partial e(\mathbf{u},\mathbf{v},\kappa) }{\partial \mathbf{u}}} &= \left( d_{\mathscr{H}}(\mathbf{u},\mathbf{v},\kappa) - \Delta \right) \; {\frac{ \partial d_{\mathscr{H}}(\mathbf{u},\mathbf{v},\kappa)}{\partial \mathbf{u}}}, \\ {\frac{ \partial d_{\mathscr{H}}(\mathbf{u},\mathbf{v},\kappa)}{\partial \mathbf{u}}} &= \frac{1}{\sqrt{\kappa}} \; \frac{1}{\sqrt{ (\mathbf{u} \ast \mathbf{v})^2-1}} \; {\frac{ \partial (-\mathbf{u} \ast \mathbf{v})}{\partial \mathbf{u}}}, \\ {\frac{ \partial (-\mathbf{u} \ast \mathbf{v})}{\partial \mathbf{u}}} &= \frac{ (\| \mathbf{v} \|^2+1)}{\sqrt{ (\| \mathbf{u} \|^2+1)(\| \mathbf{v} \|^2+1)}} \mathbf{u} - \mathbf{v} . \end{align*} Similarly, (3.2) also follows from the chain rule by \begin{align*} {\frac{ \partial e(\mathbf{u},\mathbf{v},\kappa)}{\partial \kappa}} &= \left( d_{\mathscr{H}}(\mathbf{u},\mathbf{v},\kappa) - \Delta \right) \; {\frac{ \partial d_{\mathscr{H}}(\mathbf{u},\mathbf{v},\kappa)}{\partial \kappa}}, \\ {\frac{ \partial d_{\mathscr{H}}(\mathbf{u},\mathbf{v},\kappa)}{\partial \kappa}} &= \frac{-1}{2} \kappa^{-3/2} {\text{arccosh}}(-\mathbf{u} \ast \mathbf{v}).\\[-42pt] \end{align*} □ For implementation purposes, Algorithms 1 and 2 detail the matrix-vector calculations used to efficiently compute the hyperbolic distances and gradients between landmarks and non-landmarks, respectively. Note that the operation $$\odot$$ (e.g., in Line 9 in Algorithm 1), refers to elementwise multiplication for matrices and vectors. Algorithm 1 first computes $$e(\mathbf{u},\mathbf{v},\kappa)$$ for all non-unique pairs $$\mathbf{u},\mathbf{v} \in \mathscr{L}$$. While this is redundant due to the symmetry in (2.2), the calculations are much simpler and the additional computational cost is negligible. To compute $${E_{\mathcal{A}}(\mathbf{U}_{\mathcal{A}},\kappa)}$$ over unique pairs of indices in $$\mathscr{L}$$, as in (2.4), we simply include a factor of $$1/2$$ in Line 7 and Line 14 in Algorithm 1. If we define $$\mathscr{A} \subseteq \mathscr{V}$$ and $$k = |\mathscr{A}|$$, then the function and gradient evaluation for the embedding of points in $$\mathscr{A}$$ with respect to each other in Algorithm 1 involves $$\mathscr{O}(k^2 d)$$ floating point operations, while the function and gradient evaluation for the embedding of one point with respect to $$\mathscr{A}$$ in Algorithm 2 involves $$\mathscr{O}(k d)$$ floating point operations. 4. New hyperbolic embedding algorithm Following the landmark-based embedding procedure outlined in Section 2.4, we describe our approach. We first explain landmark selection in Section 4.1 and computation of true graph distances in Section 4.2. We describe a simple embedding method in Section 4.3 and a recursive version in Section 4.4, which enables efficient use of larger numbers of landmarks. 4.1 Selecting landmarks Ideally, the set of landmarks will be such that any vertex is close to a landmark and such that the landmarks are not redundant. In terms of selecting landmarks, Zhao et al. [10] compared choosing vertices at random so they are not close to each other, selecting the vertices with the highest degrees so that they have high centrality, and selecting the vertices that have high degree but are also at least $$k$$ hops away (for $$k = 2, 3$$ and $$4$$) from one another. Surprisingly, the differences in relative error between all these landmark-selection techniques were fairly small. Consequently, Rigel simply chooses the highest-degree nodes to be its landmarks [3]. While this approach is computationally inexpensive, it does not guarantee that the selected landmark nodes are far apart. A uniform random sampling technique tends to avoid choosing landmarks too close together, but does not necessarily choose landmarks with high degree. Our approach combines both techniques by randomly sampling the vertices proportional to their degrees. This allows us to take advantage of the benefits of both random sampling and selection-by-highest-degree techniques, while still remaining computationally inexpensive (this approach does not require any distance calculations between landmark nodes). Our landmark selection algorithm is shown in Algorithm 3. Algorithm 1 View largeDownload slide Distance and gradient calculation for landmark embedding algorithm. Algorithm 1 View largeDownload slide Distance and gradient calculation for landmark embedding algorithm. Algorithm 2 View largeDownload slide Distance and gradient calculation for landmark embedding algorithm. Algorithm 2 View largeDownload slide Distance and gradient calculation for landmark embedding algorithm. Algorithm 3 View largeDownload slide Landmark Selection Algorithm 3 View largeDownload slide Landmark Selection Algorithm 4 View largeDownload slide Distance and gradient calculation for landmark embedding algorithm. Algorithm 4 View largeDownload slide Distance and gradient calculation for landmark embedding algorithm. 4.2 Computing true graph distances We calculate the $$\Delta_{ij}$$ values for the landmarks only, as shown in Algorithm 4. Typically, the inner loop is calculated via a single-source-shortest-path method [19], that is, it finds the shortest path to every node from each $$i \in \mathscr{L}$$. 4.3 Simple embedding procedure We first consider embedding of the landmarks. Algorithm 5 defines the A2A function which embeds a set of vertices, $$\mathscr{A}$$, with respect to itself. It also determines the optimal $$\kappa$$. The number of optimization variables is $$kd + 1$$ where $$k = |\mathscr{A}|$$. If $$\mathscr{A}=\mathscr{L}$$, then this optimizes the set of landmarks with respect to itself, but we have left the function generic (i.e., using $$\mathscr{A}$$ rather than $$\mathscr{L}$$) for reasons that will become apparent when we discuss the recursive version. The function A2A accepts initial guesses for the matrix $$\mathbf{U}_{\mathscr{A}}$$ and the scalar $$\kappa > 0$$ as input, which may be either random guesses or something better. The LBFGS method is used for optimization and requires takes two inputs: (1) the method to evaluate the objective function and gradient from Algorithm 1 and (2) the initial starting point. Upon completion, the function A2A returns values of $$\mathbf{U}_{\mathscr{A}}$$ and $$\kappa$$ that give an optimal hyperbolic embedding, with the caveat that this may be a local optimum since the problem is not convex. Algorithm 5 View largeDownload slide Hyperbolic embedding of vertices in $$\mathscr{A}$$ relative to itself Algorithm 5 View largeDownload slide Hyperbolic embedding of vertices in $$\mathscr{A}$$ relative to itself Algorithm6 defines the function B2A that embeds vertices in the set $$\mathscr{B}$$ with respect to fixed $$\mathbf{U}_{\mathscr{A}}$$ and $$\kappa$$. We require $$\mathscr{B}$$ and $$\mathscr{A}$$ to be disjoint. It solves a series of $$|\mathscr{B}|$$ optimization problems, and each problem is of dimension $$d$$. Since the optimization problems are independent, the optimization solves can be done in parallel. Each optimization problem uses the objective function defined in Algorithm 2 and a random initial guess. Algorithm 6 View largeDownload slide Hyperbolic embedding of vertices in $$\mathscr{B}$$ relative to $$\mathscr{A}$$ Algorithm 6 View largeDownload slide Hyperbolic embedding of vertices in $$\mathscr{B}$$ relative to $$\mathscr{A}$$ Using the function discussed so far, the full procedure is straightforward and given in Algorithm 7. The inputs are the graph and the number of landmarks. The output is the embedding for all nodes $$\mathbf{U}_{\mathscr{V}}$$ and $$\kappa$$. We optimize the landmarks and $$\kappa$$ in a single pass and then optimize the non-landmarks. Algorithm 7 View largeDownload slide Hyperbolic embedding, simple version Algorithm 7 View largeDownload slide Hyperbolic embedding, simple version 4.4 Recursive embedding procedure In general, computing the landmark embedding using Algorithm 7 with a random initial guess can be inefficient and inaccurate for more than, say, $$\ell=100$$ landmarks. Instead, we propose an alternative algorithm which recursively builds a better initial guess for the landmark embedding. In order to do this, we first partition the landmarks into $$r$$ sets, that is, $$\mathscr{L} = \mathscr{B}_1 \cup \mathscr{B}_2 \cup \cdots \cup \mathscr{B}_r {\quad\text{with}\quad} \mathscr{B}_i \cap \mathscr{B}_j = \emptyset, \; \forall i\neq j \in \left\{ {{1,\dots,r}} \right\}\!.$$ (4.1) We then define $$\mathscr{A}_k$$ to be the union of the first $$k$$ partitions, that is, \begin{align*} \mathscr{A}_k = \bigcup_{i=1}^k \mathscr{B}_i. \end{align*} Note that $$\mathscr{A}_r = \mathscr{L}$$. To perform the embedding, we first embed $$\mathscr{A}_1$$ with a random guess for the initial embedding. We then embed the nodes $$\mathscr{B}_2$$ with respect to $$\mathscr{A}_1$$. The embeddings for these two sets of nodes is combined to form the initial guess for embedding $$\mathscr{A}_2$$ with respect to itself. This process is continued until all $$r$$ partitions of the landmarks are embedded. The recursive landmark embedding approach is detailed in Algorithm 8. Similarly, to obtain an improved starting point for the non-landmark embedding optimization we first embed each non-landmark node relative to a random subset of the original landmark nodes. This is similar to what Rigel does, however, we take it a step further and use this solution as an initial guess for the full non-landmark embedding procedure. The recursive non-landmark algorithm is shown in Algorithm 9. Algorithm 8 View largeDownload slide Hyperbolic embedding of vertices $$\mathscr{A}$$ relative to itself, recursive version Algorithm 8 View largeDownload slide Hyperbolic embedding of vertices $$\mathscr{A}$$ relative to itself, recursive version 4.5 Implementation details Before we move on to the examples, we briefly discuss implementation strategies for improving the landmark and non-landmark optimization procedures. Because our objective function is non-convex, there is a possibility that our optimization routine might get stuck in a local minimum. To lessen this possibility, we run the landmark embedding multiple times, in parallel, for different random starting positions. The optimal set of landmark embedding coordinates with the smallest error, $$E_{\mathscr{L}}(\mathbf{U}_{\mathscr{L}},\kappa)$$, is chosen to begin our non-landmark embedding routine. Second, the full $$\Delta_{ij}$$ matrix for the non-landmark embedding algorithm can be prohibitively large, for example, $$(n-\ell) \ell 2^5$$ bits using standard 32 bit integers. In order to efficiently manage memory for the non-landmark embedding procedure and allow us to embed a graph of any size, we split the $$\Delta_{ij}$$ distance matrix into smaller blocks of columns. Specifically, we split $$\mathscr{N}$$, the set of non-landmark indices, into approximately $$s \approx (n-\ell)/b$$ sets of indices that each have approximately $$b$$ elements, that is, $$\mathscr{N} = \mathscr{N}_1 \cup \cdots \cup \mathscr{N}_s {\quad\text{with}\quad} \mathscr{N}_p \cap \mathscr{N}_q = \emptyset\; \forall p\neq q \in \left\{ {{1,\dots,s}} \right\}, \; |\mathscr{N}_p| \approx b$$ (4.2) Algorithm 9 View largeDownload slide Hyperbolic embedding of vertices in $$\mathscr{B}$$ relative to $$\mathscr{A}$$, recursive version Algorithm 9 View largeDownload slide Hyperbolic embedding of vertices in $$\mathscr{B}$$ relative to $$\mathscr{A}$$, recursive version Algorithm 10 View largeDownload slide Main Hyperbolic embedding, recursive version with non-landmark splitting Algorithm 10 View largeDownload slide Main Hyperbolic embedding, recursive version with non-landmark splitting The $$\Delta_{ij}$$ matrix is then split according to the column partitions defined in (4.2). This results in $$s$$ non-landmark distance matrices each requiring only $$b \cdot \ell \cdot 2^5$$ bits of memory. The non-landmark embedding is then performed by loading a single partition at a time, where each partition can be run in parallel. The main procedure is detailed in Algorithm 10, where the recursive landmark embedding, Algorithm 4, can be run many times in parallel to obtain the embedding with the least error, and the non-landmark nodes in $$\mathscr{N}_j$$ can also be executed in parallel for each $$j = 1,\dots,s$$. 5. Experimental results We test our hyperbolic embedding algorithm on five datasets from SNAP [20], with up to 65 million nodes and 1.8 billion edges; see Table 1. All tests are performed on a Dual socket Intel server using 40 Intel Haswell 2000Mhz E5-2683v3 CPUs, each with 14 processors and 60 Gigabytes of shared memory, for a total of 560 cores. All shortest-path (i.e., true) distances are computed using SNAP for Python [21]. The validation data for each graph consists of $$100,000$$ randomly chosen pairs of nodes (excluding landmarks). Table 1 Datasets Graph Nodes Edges Avg. Degree Amazon 334,863 925,872 5.53 DBLP 317,080 1,049,866 3.31 Youtube 1,134,890 2,987,624 5.27 LiveJournal 3,997,962 34,681,189 17.35 Orkut 3,072,441 117,185,083 76.28 Friendster 65,608,366 1,806,067,135 55.06 Graph Nodes Edges Avg. Degree Amazon 334,863 925,872 5.53 DBLP 317,080 1,049,866 3.31 Youtube 1,134,890 2,987,624 5.27 LiveJournal 3,997,962 34,681,189 17.35 Orkut 3,072,441 117,185,083 76.28 Friendster 65,608,366 1,806,067,135 55.06 Table 1 Datasets Graph Nodes Edges Avg. Degree Amazon 334,863 925,872 5.53 DBLP 317,080 1,049,866 3.31 Youtube 1,134,890 2,987,624 5.27 LiveJournal 3,997,962 34,681,189 17.35 Orkut 3,072,441 117,185,083 76.28 Friendster 65,608,366 1,806,067,135 55.06 Graph Nodes Edges Avg. Degree Amazon 334,863 925,872 5.53 DBLP 317,080 1,049,866 3.31 Youtube 1,134,890 2,987,624 5.27 LiveJournal 3,997,962 34,681,189 17.35 Orkut 3,072,441 117,185,083 76.28 Friendster 65,608,366 1,806,067,135 55.06 5.1 Comparison with Rigel For a baseline, we embed these graphs into hyperbolic space using Rigel [22], which is an open source C++ implementation by the Systems, Algorithms, Networking and Data Laboratory (SAND Lab) Group at University of California, Santa Barbara. It uses the general landmark embedding approach of Algorithm 7, with a few key differences. First, Rigel uses the Nelder–Mead simplex algorithm to minimize the landmark and non-landmark embedding error. Second, Rigel does not include the curvature as a variable in the optimization routine, nor does it incorporate the use of the gradient of the objective function. Rigel does, however, allow the user to fix the curvature manually during the embedding procedure. Experiments seem to indicate that the optimal curvature for Rigel is $$\kappa = 1$$ [3], which we use in the subsequent experiments. Finally, Rigel by default only embeds 16 landmark nodes relative to each other. These nodes are the primary landmark nodes. The rest of the landmarks are embedded relative only to the 16 primary landmark nodes. Furthermore, Rigel embeds the non-landmark nodes relative to only 16 randomly chosen landmark nodes. We compare the performance of Hypy (Algorithm 10) with Rigel; both use the same 100 landmarks generated using Algorithm 3. For Hypy, the landmarks indices are partitioned with $$\mathscr{B}_1 = \{1,\dots,32\},\;\mathscr{B}_2 = \{33,\dots,64\}$$ and $$\mathscr{B}_3 = \{65,\dots,100 \}$$ (see (4.1)). Hypy performs the landmarks embedding 560 times (equal to the number of processors on our computational server). The best result of these embeddings, in terms of the objective function defined by (2.4) with respect to $$\mathscr{A} = \mathscr{L}$$, is used in the results and subsequent computations. Since Rigel’s initial guess is unchanging, we only run it once and use that result for subsequent computation. The non-landmarks are split into groups of size $$250,000$$ nodes, that is, $$b \approx 250,000$$ (see (4.2)). Both Hypy and Rigel distribute the non-landmark embedding among the 560 processors. Before we take a look at the results, we note that Rigel failed to produce a solution for the Friendster dataset due to its large size, and so that dataset is not discussed in this subsection. We do include Friendster results in Section 5.2. In the leftmost columns of Figs 1 and 2, we show the mean squared error (MSE) and standard deviation for all pairs of landmark nodes, that is, $$(i,j) \in \mathscr{L} \otimes \mathscr{L}$$ with $$i \neq j$$ for $$d=2,3,\dots,10$$. The mean squared error for the landmarks is the average squared error between the true graph distance and the estimated hyperbolic distance among all unique landmark nodes pairs. For the non-landmarks, the mean squared error is defined as the average error in distance between each non-landmark node and all non-landmark nodes. In general, the MSE decreases with the dimension because there are more degrees of freedom. An increase in MSE as the dimension increase indicates that the optimization hit a local rather than a global solution. Observe that Hypy always obtains a lower landmark MSE value than Rigel. There are several reasons for this. First, Hypy has an additional free parameter–-curvature ($$\kappa$$). Second, Hypy is using LBFGS, a better optimization algorithm than the Nelder–Meade simplex algorithm. Third, Hypy optimizes the entire set of landmark nodes relative to each other rather than doing a small subset and optimizing the rest relative to those. Finally, Hypy is run multiple times with different random initializations. Fig. 1. View largeDownload slide Hypy versus Rigel in terms of mean standard error (MSE) using the same $$\ell=100$$ landmarks. For each value of $$d$$, the MSE values are computed using the optimal landmark embedding from 560 random starting points. Landmarks are all pairs of landmark nodes. Non-landmarks are all pairs comprising one non-landmark and one landmark. Validation is 100,000 random pairs (excluding landmarks). (a) Amazon, (b) DBLP and (c) YouTube. Fig. 1. View largeDownload slide Hypy versus Rigel in terms of mean standard error (MSE) using the same $$\ell=100$$ landmarks. For each value of $$d$$, the MSE values are computed using the optimal landmark embedding from 560 random starting points. Landmarks are all pairs of landmark nodes. Non-landmarks are all pairs comprising one non-landmark and one landmark. Validation is 100,000 random pairs (excluding landmarks). (a) Amazon, (b) DBLP and (c) YouTube. Fig. 2. View largeDownload slide Hypy versus Rigel in terms of mean standard error (MSE) using the same $$\ell=100$$ landmarks. For each value of $$d$$, the MSE values are computed using the optimal landmark embedding from 560 random starting points. Landmarks are all pairs of landmark nodes. Non-landmarks are all pairs comprising one non-landmark and one landmark. Validation is 100,000 random pairs (excluding landmarks). (a) LiveJournal and (b) Orkut. Fig. 2. View largeDownload slide Hypy versus Rigel in terms of mean standard error (MSE) using the same $$\ell=100$$ landmarks. For each value of $$d$$, the MSE values are computed using the optimal landmark embedding from 560 random starting points. Landmarks are all pairs of landmark nodes. Non-landmarks are all pairs comprising one non-landmark and one landmark. Validation is 100,000 random pairs (excluding landmarks). (a) LiveJournal and (b) Orkut. In the middle column of Figs 1 and 2, we show the MSE and standard deviation for all pairs in $$(i,j) \in \mathscr{L} \otimes (\mathscr{V} \setminus \mathscr{L})$$, that is, between the landmark and non-landmark nodes for $$d=2,3,\dots,10$$. Observe again that Hypy always achieves a lower non-landmark MSE than Rigel, due to the combination of a better landmark embedding to begin with and the improved optimization procedure for embedding the non-landmarks. In the rightmost column of Figs 1 and 2, we show the MSE for a validation set, that is, 100,000 randomly selected pairs $$(i,j) \in (\mathscr{V} \setminus \mathscr{L}) \otimes (\mathscr{V} \setminus \mathscr{L})$$. Note that the validation set is not used anywhere in the optimization procedure, so it serves as an independent check on the quality of the embedding. We indicate the minimum validation error with min for each method. Table 2 summarizes the validation results for the five test graphs. In all cases, Hypy improves the MSE versus Rigel by 23–44%. Moreover, the optimal embedding dimension ($$d$$) is generally lower for Hypy. This is important because it fundamentally changes our understanding of the intrinsic dimensionality of the graph. Table 2 Comparison of Rigel and Hypy using the same $$\ell=100$$ landmarks and a validation set of 100,000 random pairs for $$d = 1,\dots,10$$. Showing optimal embedding dimension and $$\kappa$$ (for Hypy) corresponding to the best validation set MSE. Used $$\kappa=1$$ for Rigel. The last column shows the percentage decrease in validation set MSE over Rigel for the validation dataset Rigel ($$\boldsymbol{\kappa =} {\bf 1}$$) Hypy Hypy vs. Rigel Graph $$d$$ $$\text{MSE}$$ $$d$$ $$\kappa$$ $$\text{MSE}$$ MSE Decrease% Amazon 7 2.07 $$\geq$$10 0.08 1.17 44 DBLP $$\geq$$10 0.56 9 0.76 0.43 23 Youtube $$\geq$$10 0.31 7 2.27 0.18 42 LiveJournal $$\geq$$10 0.35 8 1.69 0.24 31 Orkut 10 0.30 5 5.56 0.19 37 Rigel ($$\boldsymbol{\kappa =} {\bf 1}$$) Hypy Hypy vs. Rigel Graph $$d$$ $$\text{MSE}$$ $$d$$ $$\kappa$$ $$\text{MSE}$$ MSE Decrease% Amazon 7 2.07 $$\geq$$10 0.08 1.17 44 DBLP $$\geq$$10 0.56 9 0.76 0.43 23 Youtube $$\geq$$10 0.31 7 2.27 0.18 42 LiveJournal $$\geq$$10 0.35 8 1.69 0.24 31 Orkut 10 0.30 5 5.56 0.19 37 Table 2 Comparison of Rigel and Hypy using the same $$\ell=100$$ landmarks and a validation set of 100,000 random pairs for $$d = 1,\dots,10$$. Showing optimal embedding dimension and $$\kappa$$ (for Hypy) corresponding to the best validation set MSE. Used $$\kappa=1$$ for Rigel. The last column shows the percentage decrease in validation set MSE over Rigel for the validation dataset Rigel ($$\boldsymbol{\kappa =} {\bf 1}$$) Hypy Hypy vs. Rigel Graph $$d$$ $$\text{MSE}$$ $$d$$ $$\kappa$$ $$\text{MSE}$$ MSE Decrease% Amazon 7 2.07 $$\geq$$10 0.08 1.17 44 DBLP $$\geq$$10 0.56 9 0.76 0.43 23 Youtube $$\geq$$10 0.31 7 2.27 0.18 42 LiveJournal $$\geq$$10 0.35 8 1.69 0.24 31 Orkut 10 0.30 5 5.56 0.19 37 Rigel ($$\boldsymbol{\kappa =} {\bf 1}$$) Hypy Hypy vs. Rigel Graph $$d$$ $$\text{MSE}$$ $$d$$ $$\kappa$$ $$\text{MSE}$$ MSE Decrease% Amazon 7 2.07 $$\geq$$10 0.08 1.17 44 DBLP $$\geq$$10 0.56 9 0.76 0.43 23 Youtube $$\geq$$10 0.31 7 2.27 0.18 42 LiveJournal $$\geq$$10 0.35 8 1.69 0.24 31 Orkut 10 0.30 5 5.56 0.19 37 The wall clock runtimes of Hypy and Rigel are compared in Fig. 3. The runtime for Hypy is the total for the 560 initial guesses (in parallel) for the landmark embedding and the non-landmark embedding (also in parallel). The runtime for Rigel is the total for a single run for the landmark embedding and the non-landmark embedding (in parallel). The figure shows that the runtime of Hypy does not depend on the embedding dimension ($$d$$); in fact, runtime slightly decreases as the dimension increases, indicating that the optimization routine has an easier job of embedding the graph in higher dimensions. Rigel, on the other hand, becomes much more expensive as the dimension increases. Moreover, for all graphs and all dimensions, Hypy is significantly faster than Rigel, sometimes by almost an order of magnitude. Fig. 3. View largeDownload slide Runtime (wall clock) comparison on five graphs between our Hypy (H) implementation and the existing Rigel (R) code for $$\ell = 100$$. The timing of Hypy is for 560 initial guesses (in parallel) and also parallelizes the non-landmark embedding. The timing of Rigel is for a single landmark embedding and parallelizng the non-landmark embedding over 560 processors. Fig. 3. View largeDownload slide Runtime (wall clock) comparison on five graphs between our Hypy (H) implementation and the existing Rigel (R) code for $$\ell = 100$$. The timing of Hypy is for 560 initial guesses (in parallel) and also parallelizes the non-landmark embedding. The timing of Rigel is for a single landmark embedding and parallelizng the non-landmark embedding over 560 processors. Table 3 Optimal embedding dimension for $$d = 1,\dots,10$$, corresponding MSE and percent decrease for $$\ell = 100, 400$$ and $$800$$. The percent decrease is calculated relative to the MSE for $$\ell = 100$$ Hypy, $$d ({\rm MSE}, \% \downarrow )$$ Graph $$\ell=100$$ $$\ell=400$$ $$\ell=800$$ Amazon $$\geq$$10 (1.17) 8 (0.97, 17%) $$\geq$$10 (0.93, 21%) DBLP 9 (0.43) $$\geq$$10 (0.38, 12%) $$\geq$$10 (0.34, 21%) Youtube 7 (0.18) $$\geq$$10 (0.13, 27%) 8 (0.12, 33%) LiveJournal 8 (0.24) 9 (0.20, 17%) $$\geq$$10 (0.19, 21%) Orkut 5 (0.19) $$\geq$$10 (0.16, 16%) $$\geq$$10 (0.15, 21%) Friendster 8 (0.18) $$\geq$$10 (0.15, 17%) $$\geq$$10 (0.15, 0%) Hypy, $$d ({\rm MSE}, \% \downarrow )$$ Graph $$\ell=100$$ $$\ell=400$$ $$\ell=800$$ Amazon $$\geq$$10 (1.17) 8 (0.97, 17%) $$\geq$$10 (0.93, 21%) DBLP 9 (0.43) $$\geq$$10 (0.38, 12%) $$\geq$$10 (0.34, 21%) Youtube 7 (0.18) $$\geq$$10 (0.13, 27%) 8 (0.12, 33%) LiveJournal 8 (0.24) 9 (0.20, 17%) $$\geq$$10 (0.19, 21%) Orkut 5 (0.19) $$\geq$$10 (0.16, 16%) $$\geq$$10 (0.15, 21%) Friendster 8 (0.18) $$\geq$$10 (0.15, 17%) $$\geq$$10 (0.15, 0%) Table 3 Optimal embedding dimension for $$d = 1,\dots,10$$, corresponding MSE and percent decrease for $$\ell = 100, 400$$ and $$800$$. The percent decrease is calculated relative to the MSE for $$\ell = 100$$ Hypy, $$d ({\rm MSE}, \% \downarrow )$$ Graph $$\ell=100$$ $$\ell=400$$ $$\ell=800$$ Amazon $$\geq$$10 (1.17) 8 (0.97, 17%) $$\geq$$10 (0.93, 21%) DBLP 9 (0.43) $$\geq$$10 (0.38, 12%) $$\geq$$10 (0.34, 21%) Youtube 7 (0.18) $$\geq$$10 (0.13, 27%) 8 (0.12, 33%) LiveJournal 8 (0.24) 9 (0.20, 17%) $$\geq$$10 (0.19, 21%) Orkut 5 (0.19) $$\geq$$10 (0.16, 16%) $$\geq$$10 (0.15, 21%) Friendster 8 (0.18) $$\geq$$10 (0.15, 17%) $$\geq$$10 (0.15, 0%) Hypy, $$d ({\rm MSE}, \% \downarrow )$$ Graph $$\ell=100$$ $$\ell=400$$ $$\ell=800$$ Amazon $$\geq$$10 (1.17) 8 (0.97, 17%) $$\geq$$10 (0.93, 21%) DBLP 9 (0.43) $$\geq$$10 (0.38, 12%) $$\geq$$10 (0.34, 21%) Youtube 7 (0.18) $$\geq$$10 (0.13, 27%) 8 (0.12, 33%) LiveJournal 8 (0.24) 9 (0.20, 17%) $$\geq$$10 (0.19, 21%) Orkut 5 (0.19) $$\geq$$10 (0.16, 16%) $$\geq$$10 (0.15, 21%) Friendster 8 (0.18) $$\geq$$10 (0.15, 17%) $$\geq$$10 (0.15, 0%) In Fig. 4, we illustrate the modest benefit of allowing the curvature to vary as compared to fixing it to $$\kappa = 1$$. In this experiment, we vary the dimension from 2 to 10 and run each method 56 times with 56 different initial starting points (but the same starting points with and without fixed curvature). Fig. 4. View largeDownload slide Improvement in Hypy by letting $$\kappa$$ vary versus keeping it fixed at $$\kappa=1$$. (a) Amazon, (b) YouTube and (c) LiveJournal. Fig. 4. View largeDownload slide Improvement in Hypy by letting $$\kappa$$ vary versus keeping it fixed at $$\kappa=1$$. (a) Amazon, (b) YouTube and (c) LiveJournal. Using the optimal dimensions and curvature in Table 2, we break out the MSE values by true distance (# hops) for the validation data in Fig. 5. Note that the y-axis is the log of the MSE scores, in order to make the larger values easier to see. The MSE is low for the most frequent hop counts. Assuming this histogram is representative of the pair count for the entire graph, this is reasonable since the objective function (2.2) penalizes the discrepancies between the true and hyperbolic distances which occur most often. For almost all hop counts, especially those which occur with lowest frequencies, Hypy has a smaller MSE than Rigel. Fig. 5. View largeDownload slide Breakdown of validation errors by hop count (bottom) and number of pairs per hop counts (top). (a) Amazon, (b) DBLP, (c) YouTube, (d) LiveJournal and (e) Orkut. Fig. 5. View largeDownload slide Breakdown of validation errors by hop count (bottom) and number of pairs per hop counts (top). (a) Amazon, (b) DBLP, (c) YouTube, (d) LiveJournal and (e) Orkut. 5.2 Increasing the number of landmarks We now study the effect of increasing the number of landmarks with Hypy for the various graphs listed in Table 1, including the Friendster graph. We omit Rigel because our preliminary experiments indicated that increasing the number of landmarks did not reduce the validation error, most likely due to the fact that Rigel only use 16 relative landmarks to perform the embedding. Figure 6 shows the MSE for the validation data set for $$\ell = 100$$, $$400$$ and $$800$$. For $$\ell=100$$, we use the same landmark set partitioning as in Section 5.1; for $$\ell=400$$, we split the landmarks into sets of sizes 32, 32, 64 and 270; and for $$\ell=800$$, we use sizes 32, 32, 64, 128 and 544. There is a significant drop in MSE when going from $$\ell = 100$$ to $$\ell = 400$$ landmarks, but much less of a drop from $$\ell = 400$$ to $$\ell = 800$$, especially for graphs with a larger number of nodes. Also, the optimal embedding dimension ranges from eight to ten dimensions for $$\ell > 100$$. We can reasonably conclude that increasing the number of landmarks does improve the embedding, but it is less significant for larger graphs, that is, graphs with more than a million nodes. Table 3 summarizes the optimal embedding dimension for each choice of $$\ell$$ and the improvement compared to $$\ell=100$$. Fig. 6. View largeDownload slide MSE for validation data for Hypy comparing $$\ell = 100, 400$$ and $$800$$. (a) Amazon: validation, (b) DBLP: validation, (c) Youtube: validation, (d) LiveJournal: validation, (e) Orkut: validation and (f) Friendster: validation. Fig. 6. View largeDownload slide MSE for validation data for Hypy comparing $$\ell = 100, 400$$ and $$800$$. (a) Amazon: validation, (b) DBLP: validation, (c) Youtube: validation, (d) LiveJournal: validation, (e) Orkut: validation and (f) Friendster: validation. Figure 7 shows the run time for Hypy with different choices of $$\ell$$. As expected, the time for the complete embedding procedure using $$400$$ and $$800$$ landmarks is longer, roughly one to two orders of magnitude larger, respectively, than the embedding time for $$\ell = 100$$. Also, for $$\ell > 100$$, the embedding time as a function of dimension is roughly flat, whereas we see a slight decrease in seconds per dimension for $$\ell = 100$$. So while the optimization routine may have an easier time of embedding in higher dimensions, the objective functions for $$\ell > 100$$ are more costly to evaluate. Fig. 7. View largeDownload slide Runtime comparison for each graphs at different landmarks. The numbers in parentheses indicate the number of landmarks. Fig. 7. View largeDownload slide Runtime comparison for each graphs at different landmarks. The numbers in parentheses indicate the number of landmarks. 6. Discussion and conclusions In this article, we propose a fast and efficient hyperbolic embedding algorithm with a landmark approach that uses an LBFGS optimization routine with recursively built initial starting points. Our approach is an improvement over Rigel because it incorporates analytically computed gradients, includes the curvature as an optimization variable, and uses LBFGS to perform the optimization. We show that Hypy can provide a significant improvement in accuracy over Rigel, sometimes by as much as 44% (see Table 2). Moreover, Hypy can be almost an order of magnitude faster for large graphs (see Fig. 7). Hypy is designed to scale well for very large graphs by allowing parallelization of the landmark and non-landmark embedding procedures and has been successfully tested on graphs with 65 million nodes and 1.8 billion edges. Furthermore, because our routine is so efficient, we are able to explore embeddings with a number of landmarks greater than 100, which is often the limit in many studies up until now. These tests show that, indeed, more landmarks can further reduce the MSE validation errors by as much as 33% (see Table 3). In cases where multiple cores are not available, most of the examples in this article can still be computed using a single core machine in a reasonable amount time, that is, less than a day. The landmark embedding procedure is the cheapest part of the procedure and can be performed on the order of seconds to minutes for graphs with node sizes ranging from thousands to a few million, respectively. In order to run the landmark embedding at multiple starting points, this will have to be done in serial and certainly take longer, but will scale linearly with the number of random starting points. The non-landmark embedding procedure will most likely be the computational bottleneck due to shear number of nodes in some graphs. For graphs with a few million nodes and embeddings with 100 landmarks, the non-landmark embedding procedure on a single core may need to run on the order of hours, whereas landmarks with sizes greater than 100 can take days. For the largest data set, that is, Friendster with roughly 65 million nodes, running this on a single core would take months and would thus be impractical. Also, the preprocessing step, for example, computing the landmark nodes and distances, can take hours on a single core for graphs with nodes in the millions. We have shown the benefit of an improved optimization procedure, and future work may consider more sophisticated methods such as stochastic gradient descent, especially for extremely large problems. Lastly, the code will be available on github. Funding This work was supported in part by the Defense Advanced Research Project Agency (DARPA). Acknowledgement Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. We would also like to thank the reviewers for their time and help in finalizing this article. References 1. Krioukov D. , Papadopoulos F. , Kitsak M. , Vahdat A. & Boguñá M. ( 2010 ) Hyperbolic geometry of complex networks. Phys. Rev. E , 82 , 036106 . Google Scholar CrossRef Search ADS 2. Ng T. S. E. & Zhang H. ( 2002 ) Predicting internet network distance with coordinates-based approaches. INFOCOM 2002: Proceedings of the Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies. pp. 170 – 179 . 3. Zhao X. , Sala A. , Zheng H. & Zhao B. Y. ( 2011 ) Efficient shortest paths on massive social graphs. COLCOM’11: Proceedings of the 7th International Conference on Collaborative Computing: Networking, Applications and Worksharing ( Georgakopoulos D. and Joshi J. eds). Orlando, FL : IEEE , pp. 77 – 86 . 4. Shavitt Y. & Tankel T. ( 2004 ) Big-bang Simulation for embedding network distances in Euclidean space. IEEE/ACM Trans. Netw. , 12 , 993 – 1006 . Google Scholar CrossRef Search ADS 5. Shavitt Y. & Tankel T. ( 2004 ) On the curvature of the internet and its usage for overlay construction and distance estimation. INFOCOM 2004: Proceedings of the Twenty-Third Annual Joint Conference of the IEEE Computer and Communications Societies. Hong Kong : IEEE , pp. 374 – 384 . 6. Boguñá M. , Papadopoulos F. & Krioukov D. ( 2010 ) Sustaining the internet with hyperbolic mapping. Nat. Commun. , 1 , 62 . Google Scholar CrossRef Search ADS PubMed 7. Papadopoulos F. , Kitsak M. , Serrano M. A. , Boguñá M. & Krioukov D. ( 2012 ) Popularity versus similarity in growing networks. Nature , 489 , 537 – 540 . Google Scholar CrossRef Search ADS PubMed 8. Tang L. & Crovella M. ( 2003 ) Virtual landmarks for the internet. IMC’03: Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement. New York, NY: ACM, pp. 143 – 152 . 9. Ajwani D. , Meyer U. & Veith D. ( 2015 ) An I/O-efficient distance oracle for evolving real-world graphs. ALENEX ’15: Proceedings of the Meeting on Algorithm Engineering & Expermiments ( Goodrich M. and Mitzenmacher M. eds). Arlington, VA : SIAM , pp. 159 – 172 . Google Scholar CrossRef Search ADS 10. Zhao X. , Sala A. , Wilson C. , Zheng H. & Zhao B. Y. ( 2010 ) Orion: shortest path estimation for large social graphs. WOSN’10: Proceedings of the 3rd Wonference on Online Social Networks ( El Abbadi A. and Krishnamurthy B. eds). New York, NY : ACM , pp. 1 – 9 . 11. Papadopoulos F. , Psomas C. & Krioukov D. ( 2015 ) Network mapping by replaying hyperbolic growth. IEEE/ACM Trans. Netw. , 23 , 198 – 211 . Google Scholar CrossRef Search ADS 12. Papadopoulos F. , Aldecoa R. & Krioukov D. ( 2015 ) Network geometry inference using common neighbors. Phys. Rev. E , 92 . 13. Nelder J. A. & Mead R. ( 1965 ) A simplex method for function minimization. Comput. J. , 7 , 308 – 313 . Google Scholar CrossRef Search ADS 14. Nocedal J. ( 1980 ) Updating quasi-Newton matrices with limited storage. Math. Comput. , 35 , 773 – 782 . Google Scholar CrossRef Search ADS 15. Kolda T. G. , Lewis R. M. & Torczon V. ( 2003 ) Optimization by direct search: new perspectives on some classical and modern methods. SIAM Rev. , 45 , 385 – 482 . Google Scholar CrossRef Search ADS 16. McKinnon K. I. M. ( 1998 ) Convergence of the Nelder–Mead simplex method to a nonstationary point. SIAM J. Optim. , 9 , 148 – 158 . Google Scholar CrossRef Search ADS 17. Liu D. & Nocedal J. ( 1989 ) On the limited memory BFGS method for large scale optimization. Math. Program. , 45 , 503 – 528 . Google Scholar CrossRef Search ADS 18. Nocedal J. & Wright S. J. ( 2006 ) Numerical Optimization . Springer . 19. Lee C. Y. ( 1961 ) An algorithm for path connections and its applications. IRE Trans. Electron. Comput. , EC-10 ( 3 ), 346 – 365 . Google Scholar CrossRef Search ADS 20. Leskovec J. & Krevl A. ( 2014 ) SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data. 21. Leskovec J. & Sosič R. ( 2016 ) SNAP: A general-purpose network analysis and graph-mining library. ACM Trans. Intell. Sys. Technol. , 8 , 1:1 – 1:20 . 22. Zhao X. , Sala A. , Zheng H. & Zhao B. Y. ( 2013 ) Rigel. http://http://sandlab.cs.ucsb.edu/rigel/. Published by Oxford University Press 2017. This work is written by US Government employees and is in the public domain in the US.

Journal

Journal of Complex NetworksOxford University Press

Published: Dec 11, 2017

DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month Explore the DeepDyve Library Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve Freelancer DeepDyve Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations