TY - JOUR AU - van Wieringen, Wessel AB - 1 Introduction Twitter is a prominent social media tool that provides a rich resource of information. The huge volume of gathered information calls for powerful methods to translate large and complex data into small chunks of understandable signals that can be used in several areas ranging from social sciences and health research to marketing and e-commerce. As an example, studied later in detail, more than one million tweets related to the social upheaval surrounding the 2009 Iranian presidential election may be compressed into an accessible visual summary. Such summary information can entail different topics that are highlighted in a certain period of time and have evolved over time. This can be viewed as a form of network reconstruction where collections of linked words, concepts or terms represent highlighted topics at a certain time-stamp. Any changes in such topics over time, from becoming outdated, expanded or created, can be explained by evolution of the links between the words. Our interest is in the reconstruction of networks of words/terms from multiple Twitter data sets corresponding to specific time periods, where different networks share topological similarities. Next to this, data from a particular time period may be heterogeneous in that they cannot be represented by a single network. This phenomenon might be explained by differences among (parts of) the networks across different time periods, or can originate from hidden sub-networks within each time period. Naturally, to end up with interpretable networks we aim to reconstruct networks in which only a few terms or predictors play an important role. This means that we search for sparse networks, networks with relatively few links. Therefore we a) propose a framework to simultaneously reconstruct multiple sparse networks with a possibly shared structure and b) extend this idea to the case where there is more than one network for a given time period. The problem of network reconstruction is operationalized here as the estimation of a number of Gaussian graphical models (GGMs) for which the nonzero elements of the precision matrices (inverse covariance matrices) correspond to edges of the network. Estimation of a single GGM, especially in a high-dimensional setting where the data dimension is larger than the sample size, often proceeds in a regularized fashion (see, for example, [1–5]). These methods typically minimize the log-likelihood of the data augmented with an ℓ1-penalty on the elements of the precision matrix. For instance, Meinshausen and Bühlmann [1] proposed to identify the edges of a GGM by sparse estimation of the precision matrix through lasso regressions of each random variable on all other variables. Friedman et al. [4] presented the graphical lasso, a procedure to estimate sparsely the precision matrix directly. In [2, 3, 5] different estimation algorithms for the graphical lasso method are treated. An alternative approach is provided in [6], where it is illustrated that in cases where the true graphical model does not need to be extremely sparse in terms of containing many zero elements, ridge penalties coupled with post-hoc selection may outperform the lasso. A common challenge with these approaches is that they do not take into consideration the uncertainty of the parameter estimates and require selection of the penalty parameters that control sparsity. Wang [7] proposed a Bayesian counterpart to the graphical lasso that provides a solution for such shortcomings, and developed a fast algorithm to estimate a moderately large precision matrix. However, this method does not serve our purpose in the face of multiple networks with possibly evolving structures. There is a number of methods that consider simultaneous estimation of multiple graphical models corresponding to more than one data set (see for instance [8–12]). Guo et al. [8] extended the graphical lasso with a certain parametrization of multiple precision matrices whose elements are expressed as a product of shared and class-specific factors. Through a hierarchical penalty on both the shared and the class-specific factors their method shrinks some elements in the inverse covariance matrices to zero. Danaher et al. [9] proposed a general framework with arbitrary type of penalty and derived fused lasso and group lasso estimators, where the fused lasso estimation encourages shared structure and/or equal values for the elements across the precision matrices, while the group lasso estimation emphasizes only a shared sparse structure. More recently, a fused ridge version of multiple graphical model estimation has been proposed [11]. As another example, Zhu et al. [10] adapted the truncated ℓ1-penalization of [13] to stimulate elements of the precision matrices across data sets to be similar. Bayesian counterparts include [12, 14]. These methods give proper consideration to common characteristics of the data sets while simultaneously estimating them. However, they lack the flexibility to account for heterogeneity within each data set. In [15] the problem of learning the evolution of an interaction network, modeled as a GGM, from cross-sectional, high-dimensional data in the face of heterogeneity was addressed through fused ridge penalized estimation of a combination of mixtures of GGMs. Here we consider this problem from a Bayesian perspective. In this paper we present a novel Bayesian approach to the joint estimation of multiple graphical models, that takes into account both shared topological structures between the networks, and heterogeneity within the networks. In particular, we propose a Bayesian Gaussian fused graphical lasso estimation algorithm to estimate group-wise precision matrices that may exhibit network similarities, and augment this with a mixture model to account for heterogeneity of the data within a network. This is done in the spirit of the data integrative Bayesian inference method that we proposed in [16], by forming a new prior distribution on the elements of the precision matrices and obtaining a posterior distribution that resembles the ℓ1-penalized likelihood plus a fused penalty. A data allocation scheme is employed to simultaneously uncover the hidden clustering components of the mixture model while estimation of the cluster-specific precision matrices is achieved through column-wise block Gibbs sampling. In the application that we consider the different networks originate from multiple time periods, however, for the method the ordering over time is irrelevant. This makes our method widely applicable. The paper is organized as follows. In Section 2.1 we propose a Bayesian Gaussian graphical network reconstruction method for data from multiple time periods or multiple groups. This is extended in Section 2.2 to allow for heterogeneity in the sense that each time period may encompass data from more than one (unknown) sub-population. In Section 3.1, these approaches are evaluated and compared by simulation. Section 3.2 illustrates the application of the proposed method in analyzing tweets regarding the 2009 Iranian presidential election. We conclude in Section 4 with discussing future improvements. 2 Materials and methods Throughout the paper we will use capital letters to denote random variables, random vectors or random matrices; bold type will be used for vectors and matrices. The symbol ∝ stands for “is proportional to”. To emphasize that for the proposed method the ordering of the time periods is irrelevant, throughout this section we will use the word group instead of time period, and the unknown sub-populations belonging to one time period, will be called subgroups, clusters, or components. 2.1 Bayesian fused graphical lasso Characteristics from a sample of n individuals comprising T groups have been observed. For t = 1, …, T, the number of individuals in group t will be denoted by nt, and for , the random vector Yi represents the p-dimensional vector of characteristics of individual i. In the sequel we will write for the complete n × p data matrix. For later use we also define and n<1 = 0. The grouping of individuals is exhaustive and exclusive in the sense that an individual appears in a single group only. The random vectors of characteristics are assumed to be independent and to follow a group-wise Gaussian law, We consider the joint estimation the groups’ precision matrices Ω = {Ω1, ‥, ΩT}. For the purpose of interpretability sparse estimates are sought for, while the context suggests that the structure of the precision matrices may be shared between groups. In a frequentist setting all requirements (high-dimensionality, sparsity and a possibly common structure of the precision matrices) are catered for by the fused graphical lasso estimator [9], which maximizes the following penalized joint log-likelihood (1) with respect to Ω. In (1), St = St(Y) denotes the sample covariance matrix of group t. Note that the estimator above generalizes the one originally proposed in [9] which uses for all t1 and t2 (except t1 ≠ t2). The last two summands of the penalized log-likelihood (1) comprise the fused graphical lasso penalty. The convexity of the penalty tackles the high-dimensionality, while its first summand (i.e., the lasso penalty) induces sparsity, and, finally, its second summand (i.e., the fused penalty) shrinks the precision matrices towards a common structure for large values of the penalty parameter . Here we present a Bayesian interpretation of the fused graphical lasso. This requires the evaluation of the joint posterior distribution p(Ω|Y, Λ) of the Ωt. To this end we denote by {Ω}−t the set of precision matrices with Ωt excluded, and define for t = 2, …, T, the prior distribution of each precision matrix given the others as (2) In the above denotes the i, j-th element of Ωt. Note that the (2) is invariant to the order of conditioning. Furthermore, the diagonal and off-diagonal elements of the precision matrices are thus assumed to follow a priori an exponential and a double exponential distribution, respectively, (see for example [17] and [7] for similar approaches). The differences between corresponding precision elements of any pair of groups also obey a double exponential law. The term I(Ωt ≻ 0) limits the support of the prior to the positive definite matrices. With the fused graphical lasso prior (2), the posterior distribution Ωt is not a well-known standard distribution, but an efficient Gibbs sampling scheme can be designed. This extends the work of [7] for the Bayesian graphical lasso. In a nutshell, the Gibbs sampler amounts to iteratively sampling one column of Ωt at the time which guarantees positive definiteness. As a first step towards our Gibbs sampler we derive a tractable formulation of the conditional posterior of each precision matrix given the others. Application of the definition of conditional probability and subsequent insertion of the equality p(Ω|Λ) = p(Ωt|{Ω}−t, Λ)p({Ω}−t|Λ) yields An analytic expression is now readily available from the normality assumption of the data together with the prior (2). Gibbs sampling, however, is still hampered by the double exponential distributions employed in the prior of the precision elements. This is circumvented by a hierarchical representation of these distributions by a scale mixture of normal distributions [18], (3) Define, corresponding to each double exponential distribution in the prior, a latent scale parameter in accordance with the scale mixture representation (3) above. Furthermore, let denote the independent latent scale parameters corresponding to group t, and let τ = {τ1, …, τT}. Finally, we endow λt and λt′, t with independent gamma priors with shape parameter s and rate parameter r. Conditioning on the latent scale mixture parameters, and rewriting the hierarchical model, we obtain for i = n