# A convex program for mixed linear regression with a recovery guarantee for well-separated data

A convex program for mixed linear regression with a recovery guarantee for well-separated data Abstract We introduce a convex approach for mixed linear regression over d features. This approach is a second-order cone program, based on L1 minimization, which assigns an estimate regression coefficient in $$\mathbb {R}^{d}$$ for each data point. These estimates can then be clustered using, for example, k-means. For problems with two or more mixture classes, we prove that the convex program exactly recovers all of the mixture components in the noiseless setting under technical conditions that include a well-separation assumption on the data. Under these assumptions, recovery is possible if each class has at least d-independent measurements. We also explore an iteratively reweighted least squares implementation of this method on real and synthetic data. 1. Introduction We study the mixed linear regression problem with k classes and m data points. Let $$\{S_{p}\}_{p=1}^{k}$$ be a partition of $${ \{1,\dots ,\mathrm {m}\}}$$, where i ∈ Sp means that the ith data point belongs to the pth class. In class p, we assume that the data is generated by a linear process with regression coefficients $$\beta _{p} \in \mathbb {R}^{d}$$. Let the measurements $$\{(a_{i},b_{i})\}_{i=1}^{m} \subset \mathbb {R}^{d}\times \mathbb {R}$$ be such that for all $$p \in { \{1,\dots ,\mathrm {k}\}}$$, \begin{align} a_{i}^{\top}\beta_{p} = b_{i}\ \ \text{if}\ \ i \in S_{p}. \end{align} (1.1) The mixed linear regression problem is to simultaneously estimate $$\{\beta _{p}\}_{p =1}^{k}$$ and $$\{S_{p}\}_{p=1}^{k}$$ from $$\{(a_{i},b_{i})\}_{i=1}^{m}$$. In the above problem formulation, let np = |Sp|. We define ℓi as the p such that i ∈ Sp, i.e. ℓi is the label of the ith data point. Mixed linear regression has broad applications in scenarios that require disentangling data into multiple latent classes and estimating the parameters. One application is resource management in health care, where the classes are patients (with and without illness) and the mixed data is usage of resource and medical care . Other applications include music tone perception , subspace clustering  and population classification . Recent work on mixed linear regression has mostly focused on mixtures of two classes, i.e. k = 2 [4,15]. For independent, Gaussian measurements, the authors in  develop an initialization procedure for the Expectation Minimization (EM) algorithm. This initialization is based on a grid search on the eigenspace of a matrix of second order moments of measurements. In the noiseless setting, the authors prove that their initialization, followed by an EM algorithm recovers the two regression coefficients with high probability if there are $$O(d\log ^{2}d)$$ Gaussian measurements. In , the authors lift the mixed linear regression problem to a low rank non-symmetric matrix completion problem. In the noiseless case with O(d) sub-Gaussian measurements, the program exactly recovers the two mixture components with high probability. In both of these approaches, the number of mixture components were restricted to exactly two. For mixed linear regression with two or more mixture components, tensor-based methods have been introduced in  and, very recently, in . In , low-rank tensor regression and tensor factorization is used to recover the mixture components. In the noiseless setting, i.e. for all $$p \in \{1,\dots ,k\}$$, $$a_{i}^{\top }\beta _{p} = b_{i}$$ if i ∈ Sp, their recovery theorem does not guarantee exact recovery for any m. Instead, it establishes that the recovery error is $$O\left (\frac {1}{\sqrt {m}}\right )$$. In , the method of moments is used to generate an initialization for the EM algorithm. For i.i.d. Gaussian measurements, the algorithm provably recovers the mixture components if the number of measurements is $$O\left (k^{10}d\right )$$. For identification of k classes, there must be at least kd measurements. As a result, this approach has an optimal sample complexity in the number of feature elements, d, but its scaling with respect to the number of classes, k, is suboptimal. Another drawback of these tensor-based methods is that they are computationally expensive because they operate in high-dimensional spaces. The work in the present paper brings ideas from convex clustering to the problem of mixed linear regression. The algorithm we introduce is inspired by the algorithm in . In that algorithm, an ℓ1 minimization in the form of a fused lasso  acts to cluster noisy points in $$\mathbb {R}^{d}$$. The problem of mixed linear regressions can be viewed as a challenging generalization of clustering in $$\mathbb {R}^{d}$$ where every measurement is subject to a codimension-1 ambiguity. We note in particular that in the noiseless case $$\mathbb {R}^{d}$$ clustering is trivial whereas mixed linear regression is not. Consequently, this paper focuses only on the case of noiseless measurements in order to grapple with the difficulty of this codimension-1 ambiguity. The contribution of the present paper is to introduce a convex algorithm for the mixed linear regression problem with two or more components. In the noiseless setting and under some technical conditions that include well-separation of data, this algorithm exactly recovers the mixture components with kd-independent measurements. The convex algorithm is based on an ℓ1 minimization that assigns an estimate regression coefficient in $$\mathbb {R}^{d}$$ for each data point. These estimates can then be clustered by, for example, k-means. An estimate for each class can then be found by running k separate regressions. 1.1. Convex program and main result We introduce a two-step approach for mixed linear regression. This approach introduces a free variable for each data point that acts as the estimate of the mixture component corresponding to that data point. The first step is to solve a second-order cone program to obtain estimated mixture components for each data point, and the second step is to cluster these estimates using k-means. The details of these steps are as follows: 1. Solve the following second-order cone program: \begin{align} \begin{aligned} &\operatorname*{arg\,min}_{z_{1}, \dots, z_{m} \in \mathbb{R}^{d}}\ \ \sum_{i=1}^{m}\sum_{j = 1}^{m} \|z_{i} - z_{j}\|_{2}\\ &\text{subject to}\ \ a_{i}^{\top}z_{i} = b_{i}, \ i\in{ \{1,\dots,\mathrm{m}\}}. \end{aligned} \end{align} (1.2) Each zi in (1.2) is constrained to belong to the hyperplane corresponding to the measurement (ai, bi), as per (1.1). If $$\{z_{i}^{\natural }\}_{i=1}^{m}$$ minimizes (1.2), then $$z_{i}^{\natural }$$ is an estimate of βp for the class p such that i ∈ Sp. 2. Cluster $$\{z_{i}^{\natural }\}_{i=1}^{m}$$ using k-means and estimate the mixture components corresponding to each class by running k separate regressions. Program (1.2) is an ℓ1 minimization over all pairs (i, j). Due to the sparsity promoting property of ℓ1 minimization, in some cases, (1.2) can find a solution where $$z_{i}^{\natural } = z_{j}^{\natural }$$ for many pairs (i, j). Let $$\{z_{i}^{\sharp }\}_{i=1}^{m}$$ be the candidate solution of (1.2), i.e. $$z_{i}^{\sharp } = \beta _{\ell _{i}}$$. In the noiseless case, successful recovery would mean that the minimizer of (1.2) is equal to the candidate solution, i.e. $$z_{i}^{\natural } = \beta _{\ell _{i}}$$. Even in this case, note that $$\|z_{i}^{\natural }-z_{j}^{\natural }\|_{2} \neq 0$$ for most pairs (i, j). Nonetheless, we will prove that if the measurements are well- separated in a certain sense, the minimizer of (1.2) can still be the candidate solution. We prove that under two technical assumptions (1.2) can exactly recover the regression coefficients in the noiseless case. In order to state these assumptions, let \begin{align} v_{pq} = \frac{\beta_{p}-\beta_{q}}{\|\beta_{p}-\beta_{q}\|_{2}} \end{align} (1.3) and define the weighted directions, vp, by \begin{align} v_{p} := \frac{1}{\sum_{q\neq p} n_{q}} \sum_{q \neq p}n_{q}v_{pq}. \end{align} (1.4) That is, vp is the weighted average of the directions to βp from other mixture components {βq}q≠p. Our first assumption is that the measurements are ‘well-separated’ in the following sense: \begin{align} \max_{p \in [k]}\max_{i \in S_{p}} \frac{\left\|P_{v_{p}^{\perp}}a_{i}\right\|_{2}}{\|P_{v_{p}}a_{i}\|_{2}} < \frac{1}{2}\min_{p\in [k]}\frac{n_{p}}{m}, \end{align} (1.5) where $$P_{v_{p}}$$ is the projector onto the span of {vp} and $$P_{v_{p}^{\perp }}$$ is the projector onto the subspace orthogonal to vp. Intuitively, $$\{(a_{i},b_{i})\}_{i =1}^{m}$$ is well-separated if ai is approximately parallel to $$v_{l_{i}}$$. This condition ensures that the hyperplanes corresponding to a fixed class are within a small angle of each other. Note that $$\frac {\|P_{v}^{\perp }a\|_{2}}{\|P_{v}a\|_{2}}=0$$ if v is parallel to a and is small if v is approximately parallel to a. Examples of well-separated data and not well-separated data are shown in Fig. 1(a and b). Fig. 1. View largeDownload slide The collection of lines in panel (a) is well-separated in the sense of (1.5). These measurements also satisfy the balance condition (1.6). In contrast, the collection of lines in panel (b) is not well-separated because the angle between ν1 and a few measurements corresponding to β1 is too big. Similarly, the collection of lines in panel (b) is not balanced because the weighted average of the normal vectors corresponding to measurements in β2 is not in the direction of ν2. Note that vp is the weighted average of the directions to βp from other mixture components {βq}q≠p. Fig. 1. View largeDownload slide The collection of lines in panel (a) is well-separated in the sense of (1.5). These measurements also satisfy the balance condition (1.6). In contrast, the collection of lines in panel (b) is not well-separated because the angle between ν1 and a few measurements corresponding to β1 is too big. Similarly, the collection of lines in panel (b) is not balanced because the weighted average of the normal vectors corresponding to measurements in β2 is not in the direction of ν2. Note that vp is the weighted average of the directions to βp from other mixture components {βq}q≠p. Our second assumption is that the measurements are ‘balanced’ in the following sense: \begin{align} \sum_{i\in S_{p}}\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)\frac{P_{v_{p}^{\perp}}a_{i}}{\|P_{v_{p}}a_{i}\|_{2}} = 0,\text{ for all } p \in{ \{1,\dots,\mathrm{k}\}}. \end{align} (1.6) Intuitively, $$\{(a_{i},b_{i})\}_{i =1}^{m}$$ is balanced if a particular average of the $$\{a_{i}\}_{i\in S_{p}}$$ is exactly in the direction of vp. Examples of balanced data and not balanced data are shown in Fig. 1(a and b). Our main result is that in the noiseless case, (1.2) uniquely recovers the mixture components with well-separated and balanced measurements, provided there are d-independent measurements per class. Theorem 1.1 Fix $$\{\beta _{p}\}_{p=1}^{k} \subset \mathbb {R}^{d}$$. Let $$\{(a_{i},b_{i})\}_{i=1}^{m} \subset \mathbb {R}^{d} \times \mathbb {R}$$ be measurements that satisfy (1.1). If the measurements satisfy the well-separation condition (1.5), the balance condition (1.6) and for all p ∈ [k], span$$\{a_{i}:i\in S_{p}\} = \mathbb {R}^{d}$$, then the solution to (1.2) is unique and satisfies $${z_{i}} = \beta _{l_{i}}$$ for all i ∈ [m]. Numerical simulation on synthetic data verify Theorem 1.1. We present the results of numerical simulation on two types of synthetic data, both satisfying the well-separation property (1.5). In the first simulation, the data satisfy the balance property (1.6) as well. The second simulation shows that recovery is possible in the imbalanced case. We also present the results of numerical simulation on two real datasets and show that the algorithm presented in this paper estimates the regression coefficients, and classifies the datasets reasonably well. 1.2. Discussion and future work We propose a convex approach for mixed linear regression with k regression coefficients. An interesting observation in the noiseless case is that the convex program (1.2) can recover both the sparsity pattern and the regression coefficients exactly. In contrast, consider the Basis Pursuit Denoise (BPDN) program \begin{align} \min_{x\in \mathbb{R}^{n}} \|x\|_{1} + \lambda \|Ax-b\|_{2}^{2} \end{align} (1.7) in the case where b = Axo, for a sparse xo. For appropriate λ, the optimizer of (1.7) will have the correct sparsity pattern, but incorrect estimates of the coefficients of xo, see . Given that both BPDN and (1.2) are based on L1 minimization, a natural question is why can (1.2) recover both sparsity pattern and the estimates? This is because if the sparsity pattern is recovered using (1.2) in the noiseless case then zi = zj for i, j in the same class. Since there are sufficient independent points corresponding to each class, there is only one possible value of zi. In the noisy case of mixed linear regression, an estimation step like clustering using k-means is required. This is analogous to the estimation step required for estimating the regression coefficient from the solution to the BPDN program. Another interesting observation is that the program (1.2) does not have a free parameter when posed in the noiseless case. This is analogous to the Basis Pursuit program \begin{align} \min_{x\in \mathbb{R}^{n}} \|x\|_{1} \text{ subject to } Ax=b, \end{align} (1.8) which also does not have any parameter in the noiseless case. In this paper, we analyze the noiseless mixed linear regression problem, and provide a recovery guarantee when data satisfies conditions (1.5) and (1.6). The current paper is mainly focused in providing an understanding of the well-separation condition while assuming that the data are balanced. However, real data are never exactly balanced. Thus, understanding the level of imbalance the algorithm can handle is a fruitful area of future work. In the same vein, another important research direction is to consider a more complete model that include noise and corruption. 1.3. Organization of the paper The remainder of the paper is organized as follows. In Section 1.4, we present notations used throughout the paper. In Section 2, we present the proof of Theorem 1.1. In Section 3, we introduce an iteratively reweighted least squares (IRLS) implementation that solves (1.2). We observe its performance on real and synthetic data. 1.4. Notation Let $$[n] = \{1,\dots ,n\}$$. Let Id×d be the d × d identity matrix. Let $${B_{r}^{d}}$$ be the d dimensional Euclidean ball of radius r centered at the origin and let $${S_{r}^{d}}$$ be the d dimensional Euclidean sphere of radius r centered at the origin. For any matrix X and any vector v, let ∥X∥F be the Frobenius norm of X and let ∥v∥2 be the ℓ2 norm of v. For any non-zero vector v, let $$\hat {v} = \frac {v}{\ \|v\|_{2}}$$. For a vector $$v \in \mathbb {R}^{d}$$, let $$P_{v^{\perp }} = I_{d\times d} - \hat {v}\hat {v}^{\top }$$ and $$P_{v} = \hat {v}\hat {v}^{\top }$$. For a vector $$a_{i} \in \mathbb {R}^{d}$$, let ais be the sth element of ai. Throughout the paper, indices i and j are related to data points, and indices p and q are related to mixture components. 2. Proof We will prove Theorem 1.1 by constructing an exact dual certificate. A dual certificate is a dual optimal point which certifies that the candidate solution is globally optimal, see [2,8,10]. To arrive at the form of the dual certificate, we now derive the Karush-Kuhn-Tucker (KKT) conditions for (1.2). Let $$f:(\mathbb {R}^{d})^{m}\rightarrow \mathbb {R}$$ be \begin{align*} f(Z) = f(z_{1},\dots,z_{m}) = \sum_{i = 1}^{m}\sum_{j=1}^{m} \|z_{i} - z_{j}\|_{2}. \end{align*} The augmented Lagrangian for (1.2) is \begin{align*} \mathscr{L}(Z,\nu) = \sum_{i = 1}^{m}\sum_{j=1}^{m} \|z_{i} - z_{j}\|_{2} - \sum_{i=1}^{m}\nu_{i}\left(a_{i}^{\top}z_{i}-b_{i}\right), \end{align*} where $$\nu \in \mathbb {R}^{m}$$ is a Lagrange multiplier. Let Z♯ be the candidate solution, i.e. $$z_{i}^{\sharp } = \beta _{l_{i}}$$ for all $$i \in { \{1,\dots ,\mathrm {m}\}}$$. The first order optimality conditions for (1.2) at Z♯ are \begin{align} & 0 \in \partial_{Z} \left(\sum_{i = 1}^{m}\sum_{j=1}^{m} \|z_{i} - z_{j}\|_{2}\right)(Z^{\sharp}) - \partial_{Z}\left(\sum_{i=1}^{m}\nu_{i}\left(a_{i}^{\top}z_{i}-b_{i}\right)\right)(Z^{\sharp}), \end{align} (2.1) \begin{equation} a_{i}^{\ T}z_{i}^{\sharp} = b_{i},\ i\in{ \{1,\dots,\mathrm{m}\}}, \end{equation} (2.2) where ∂Zf(Z♯) is the subdifferential of f with respect to Z evaluated at Z♯. Note that \begin{align} \partial_{(z_{i},z_{j})} \|z_{i}-z_{j}\|_{2}(z_{i}^{\sharp},z_{j}^{\sharp}) = \left\{ \begin{array}{ll} \left\{\left[\begin{array}{c} v_{pq}\\ -v_{pq} \end{array} \right]\right\}, & \ell_{i} = p,\ \ell_{j} = q,\ p\neq q\\[10pt] \left\{\left[\begin{array}{c} \xi_{ij}\\ -\xi_{ij} \end{array}\right]: \|\xi_{ij}\|_{2}\leqslant\ 1 \right\}, & z_{i}^{\sharp} = z_{j}^{\sharp} \end{array}\right. \end{align} (2.3) where vpq is defined in (1.3) and $$\xi _{ij} \in \mathbb {R}^{d}$$. So, the KKT conditions for (1.2) are for all $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and i ∈ Sp, \begin{align} 0 \in \left\{\sum_{j\in S_{p}, j\neq i} \xi_{ij} + \sum_{q\neq p}n_{q}v_{pq} - \nu_{i}a_{i}: \xi_{ij} \in \mathbb{R}^{d},\ \|\xi_{ij}\|_{2}\leq 1,\ \xi_{ij} = -\xi_{ji},\ \nu_{i} \in \mathbb{R}\right\}\!\!. \end{align} (2.4) We will now prove that if such ξij and ν exist and if all ∥ξij∥2 < 1, then the unique output of (1.2) is the candidate solution Z♯. Lemma 2.1 Fix $$\{\beta _{p}\}_{p=1}^{k} \subset \mathbb {R}^{d}$$. Let $$\{(a_{i},b_{i})\}_{i=1}^{m} \subset \mathbb {R}^{d} \times \mathbb {R}$$ be measurements that satisfy (1.1). Assume for $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and for all i, j ∈ Sp, there exists $$\xi _{ij} \in \mathbb {R}^{d}$$ and $$\nu _{i} \in \mathbb {R}$$ such that \begin{equation} \nu_{i}a_{i}=\sum_{j\in S_{p}, j\neq i} \xi_{ij} + \sum_{q\neq p}n_{q}v_{p} , \end{equation} (2.5) \begin{equation} \|\xi_{ij}\|_{2} < 1 , \end{equation} (2.6) \begin{equation} \xi_{ij} = -\xi_{ji} . \end{equation} (2.7) Also, assume for all $$p \in { \{1,\dots ,\mathrm {k}\}}$$, span$$\{a_{i}:i\in S_{p}\}=\mathbb {R}^{d}$$. Then, Z♯ is the unique solution to (1.2). Proof. We will show that f(Z♯ + H) > f(Z♯) for any feasible perturbation H ≠ 0. We first separate f(Z♯ + H) into two parts. \begin{align} f(Z^{\sharp}+H) = \sum_{p=1}^{k}\sum_{i,j \in S_{p}}\|h_{i}-h_{j}\|_{2} + \underbrace{\sum_{p=1}^{k}\sum_{q\neq p}\sum_{i\in S_{p}}\sum_{j \in S_{q}}\|z_{i}^{\sharp} -z_{j}^{\sharp}+h_{i}-h_{j}\|_{2}}_{I}\!. \end{align} (2.8) To bound I from below, note that $$\|z_{i}^{\sharp } -z_{j}^{\sharp }+h_{i}-h_{j}\|_{2} \geqslant \ \|z_{i}^{\sharp } -z_{j}^{\sharp }\|_{2}+v_{pq}^{\top }(h_{i}-h_{j})$$ because vpq, defined in (1.3), is a subgradient of ∥⋅∥2 at $$z_{i}^{\sharp }-z_{j}^{\sharp }$$ if i ∈ Sp and j ∈ Sq. Thus, \begin{equation} I = \sum_{p=1}^{k}\sum_{q\neq p}\sum_{i\in S_{p}}\sum_{j \in S_{q}}\|z_{i}^{\sharp} -z_{j}^{\sharp}+h_{i}-h_{j}\|_{2} \end{equation} (2.9) \begin{equation} \geqslant \sum_{p=1}^{k}\sum_{q\neq p}\sum_{i\in S_{p}}\sum_{j \in S_{q}}\|z_{i}^{\sharp} - z_{j}^{\sharp}\|_{2} + \sum_{p=1}^{k}\sum_{q\neq p}\sum_{i\in S_{p}}\sum_{j \in S_{q}}v_{pq}^{\top}(h_{i}-h_{j}) \end{equation} (2.10) \begin{equation} = f(Z^{\sharp})+ \sum_{p=1}^{k}\sum_{q\neq p}\sum_{i\in S_{p}}\sum_{j \in S_{q}}v_{pq}^{\top}h_{i} - \sum_{p=1}^{k}\sum_{q\neq p}\sum_{i\in S_{p}}\sum_{j \in S_{q}}v_{pq}^{\top}h_{j} \end{equation} (2.11) \begin{equation} = f(Z^{\sharp})+\sum_{p=1}^{k}\sum_{q\neq p}\sum_{i \in S_{p}}n_{q}v_{pq}^{\top}h_{i}+\sum_{p=1}^{k}\sum_{q\neq p}\sum_{j\in S_{q}}n_{p}v_{qp}^{\top}h_{j} \end{equation} (2.12) \begin{equation} = f(Z^{\sharp})+2\sum_{p=1}^{k}\sum_{i \in S_{p}}\sum_{q\neq p}n_{q}v_{p}^{\top}h_{i}. \end{equation} (2.13) Note that in (2.12) we used vpq = −vqp and in (2.13) we used (1.4). Combining (2.8) and (2.13), it suffices to show \begin{align} \sum_{p=1}^{k}\sum_{i,j \in S_{p}}\|h_{i}-h_{j}\|_{2} + 2\underbrace{\sum_{p=1}^{k}\sum_{i\in S_{p}}\sum_{q\neq p}n_{q}v_{p}^{\top}h_{i}}_{II}>0, \end{align} (2.14) for all feasible H ≠ 0. First, we show that for all feasible H ≠ 0 there exists $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and i, j ∈ Sp such that hi ≠ hj. We provide a proof by contradiction. Assume for all $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and i, j ∈ Sp, hi = hj. Fix $$p\in { \{1,\dots ,\mathrm {k}\}}$$. Let hj = cp for all j ∈ Sp. Since cp is a feasible perturbation, $$a_{j}^{\top }c_{p} = 0$$ for all j ∈ Sp. Thus, cp is orthogonal to any element in span$$\{a_{j}:j\in S_{p}\} = \mathbb {R}^{d}$$. Hence, hj = cp = 0 for all j ∈ Sp, which contradicts H ≠ 0. We now compute \begin{equation} II = \sum_{p=1}^{k}\sum_{i \in S_{p}}\left(\sum_{q\neq p}n_{q}v_{p}\right)^{\top}h_{i} \end{equation} (2.15) \begin{equation} = \sum_{p=1}^{k}\sum_{i \in S_{p}}\left(\nu_{i}a_{i} - \sum_{j\in S_{p},j\neq i}\xi_{ij}\right)^{\top}h_{i} \end{equation} (2.16) \begin{equation} = -\frac{1}{2}\sum_{p=1}^{k}2\sum_{i \in S_{p}}\sum_{j\in S_{p},j\neq i}\xi_{ij}^{\top}h_{i} \end{equation} (2.17) \begin{equation} = -\frac{1}{2}\sum_{p=1}^{k}\left(\sum_{i \in S_{p}}\sum_{j\in S_{p},j\neq i}\xi_{ij}^{\top}h_{i} + \sum_{j \in S_{p}}\sum_{i\in S_{p},i\neq j}\xi_{ji}^{\top}h_{j}\right) \end{equation} (2.18) \begin{equation} = -\frac{1}{2}\sum_{p=1}^{k}\sum_{i \in S_{p}}\sum_{j\in S_{p},j\neq i}\xi_{ij}^{\top}\left(h_{i} -h_{j}\right) \end{equation} (2.19) \begin{equation} \geqslant -\frac{1}{2} \sum_{p=1}^{k}\sum_{i \in S_{p}}\sum_{j\in S_{p}}\|\xi_{ij}\|_{2}\|h_{i}-h_{j}\|_{2} \end{equation} (2.20) \begin{equation} = -\frac{\gamma}{2} \sum_{p=1}^{k}\sum_{i \in S_{p}}\sum_{j\in S_{p}}\|h_{i}-h_{j}\|_{2}, \end{equation} (2.21) where $$\gamma :=\max _{p \in [k]}\max _{i,j \in S_{\textrm {p}}}\|\xi _{ij}\|_{2}$$. By assumption (2.6), γ < 1. Note that in (2.16), we used (2.5). In (2.17), we used $$a_{i}^{\top }h_{i} = 0$$ (since H is feasible). In (2.18), both terms in parenthesis are equal by interchanging the dummy variables i and j. In (2.19), we used the antisymmetry condition (2.7). In (2.20), we used Cauchy–Schwartz inequality. Combining (2.14) and (2.21), we get \begin{align} \sum_{p=1}^{k}\left(\sum_{i,j \in S_{p}}\|h_{i}-h_{j}\|_{2} + 2\sum_{i\in S_{p}}\sum_{q\neq p}n_{q}v_{p}^{\top}h_{i}\right) &\geqslant\ (1-\gamma)\sum_{p=1}^{k}\sum_{i,j\in S_{p}}\|h_{p}-h_{q}\|_{2}> 0. \end{align} (2.22) The strict inequality in (2.22) holds because (1 − γ) > 0 and H ≠ 0 implies that there exists $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and i, j ∈ Sp such that hi ≠ hj. Hence, Z♯ is the unique solution to (1.2). As a consequence of Lemma 2.1, the proof of Theorem 1.1 is simplified to constructing an exact certificate that satisfies (2.5), (2.6) and (2.7). Proof of Theorem 1.1 For all $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and i, j ∈ Sp, let ξij and νi be \begin{equation} \nu_{i} = \operatorname*{sign}(v_{p}^{\top}a_{i})\frac{\|v_{p}\|_{2}\sum_{q\neq p}n_{q}}{\|P_{v_{p}}a_{i}\|_{2}} \end{equation} (2.23) \begin{equation} \xi_{ij} = \frac{1}{n_{p}}\left(\nu_{i}P_{v_{p}^{\perp}}a_{i} - \nu_{j} P_{v_{p}^{\perp}}a_{j}\right). \end{equation} (2.24) By Lemma 2.1, it is sufficient to show that, for all $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and i, j ∈ Sp, these ξij and νi satisfy (2.5), (2.6) and (2.7). Note that (2.7) follows immediately. Condition (2.6) holds because \begin{equation} \|\xi_{ij}\|_{2} \leqslant\ \frac{1}{n_{p}}\left(\|\nu_{i}P_{v_{p}^{\perp}}a_{i}\|_{2} + \|\nu_{j} P_{v_{p}^{\perp}}a_{j}\|_{2}\right) \end{equation} (2.25) \begin{equation} \leqslant\ \frac{2}{n_{p}} \max_{i \in S_{p}} \|\nu_{i}P_{v_{p}^{\perp}}a_{i}\|_{2} \end{equation} (2.26) \begin{equation} = \frac{2}{n_{p}} \max_{i \in S_{p}}\|v_{p}\|_{2}\sum_{q\neq p}n_{q}\frac{\|P_{v_{p}^{\perp}}a_{i}\|_{2}}{\|P_{v_{p}}a_{i}\|_{2}} \end{equation} (2.27) \begin{equation} < \frac{2}{n_{p}}m\frac{n_{p}}{2m} \end{equation} (2.28) \begin{equation} = 1. \end{equation} (2.29) In (2.27), we used (2.23), and in (2.28), we used (1.5) along with ∥vp∥2 ⩽ 1. Lastly, we will show that (2.5) holds through direct computation. Fix $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and i ∈ Sp. We now compute \begin{equation} \sum_{j\in S_{p}, j\neq i} \xi_{ij} + \sum_{q\neq p}n_{q}v_{p} \end{equation} (2.30) \begin{equation} = \sum_{j\in S_{p}, j\neq i} \frac{1}{n_{p}}\left(\nu_{i}P_{v_{p}^{\perp}}a_{i} - \nu_{j} P_{v_{p}^{\perp}}a_{j}\right) + \sum_{q\neq p}n_{q}v_{p} \end{equation} (2.31) \begin{equation} = \sum_{j\in S_{p}, j\neq i} \frac{1}{n_{p}}\|v_{p}\|_{2}\left(\sum_{q\neq p}n_{q}\right)\left(\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)\frac{P_{v_{p}^{\perp}}a_{i}}{\|P_{v_{p}}a_{i}\|_{2}} - \operatorname*{sign}\left(v_{p}^{\top}a_{j}\right)\frac{P_{v_{p}^{\perp}}a_{j}}{\|P_{v_{p}}a_{j}\|_{2}}\right) + \sum_{q\neq p}n_{q}v_{p} \end{equation} (2.32) \begin{equation} = \left(\sum_{q\neq p}n_{q}\right)\frac{\|v_{p}\|_{2}}{n_{p}}\sum_{j\in S_{p}}\left(\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)\frac{P_{v_{p}^{\perp}}a_{i}}{\|P_{v_{p}}a_{i}\|_{2}} -\operatorname*{sign}\left(v_{p}^{\top}a_{j}\right)\frac{P_{v_{p}^{\perp}}a_{j}}{\|P_{v_{p}}a_{j}\|_{2}}\right)+\sum_{q\neq p}n_{q}v_{p} \end{equation} (2.33) \begin{equation} = \left(\sum_{q\neq p}n_{q}\right)\frac{\|v_{p}\|_{2}}{n_{p}}\left(\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)n_{p}\frac{P_{v_{p}^{\perp}}a_{i}}{\|P_{v_{p}}a_{i}\|_{2}} - \sum_{j\in S_{p}}\operatorname*{sign}\left(v_{p}^{\top}a_{j}\right)\frac{P_{v_{p}^{\perp}}a_{j}}{\|P_{v_{p}}a_{j}\|_{2}}\right)+\left(\sum_{q\neq p}n_{q}\right)v_{p} \end{equation} (2.34) \begin{equation} = \left(\sum_{q\neq p}n_{q}\right)\|v_{p}\|_{2}\left(\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)\frac{P_{v_{p}^{\perp}}a_{i}}{\|P_{v_{p}}a_{i}\|_{2}}+\frac{v_{p}}{\|v_{p}\|_{2}}\right) \end{equation} (2.35) \begin{equation} = \left(\sum_{q\neq p}n_{q}\right)\frac{\|v_{p}\|_{2}\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)}{\|P_{v_{p}}a_{i}\|_{2}}\left(P_{v_{p}^{\perp}}a_{i}+\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)\frac{v_{p}}{\|v_{p}\|_{2}}\|P_{v_{p}}a_{i}\|_{2}\right) \end{equation} (2.36) \begin{equation} = \nu_{i}\left(P_{v_{p}^{\perp}}a_{i}+P_{v_{p}}a_{i}\right) \end{equation} (2.37) \begin{equation} = \nu_{i}a_{i}. \end{equation} (2.38) Note that in (2.32), we used (2.23). In (2.35), we used the assumed balance condition (1.6), and in (2.37), we used (2.23) and $$P_{v_{p}}a_{i} = \operatorname *{sign}(v_{p}^{\top }a_{i})\|P_{v_{p}}a_{i}\|_{2}\frac {v_{p}}{\|v_{p}\|_{2}}$$. Hence, (2.5) holds. By Lemma 2.1, Z♯ is the unique solution to (1.2). 2.1. Derivation of dual certificate In this section, we provide a derivation of constructing the dual certificate. Fix $$p \in { \{1,\dots ,\mathrm {k}\}}$$. Without loss of generality, assume $$S_{p} ={ \{1,\dots ,\mathrm {n}_{\mathrm {p}}\}}$$. Note that (2.5) can be decomposed into its components along and orthogonal to vp: \begin{equation} \sum_{j\in S_{p}, j\neq i} P_{v_{p}}\xi_{ij} + \sum_{q\neq p}n_{q}v_{p} = \nu_{i}P_{v_{p}}a_{i}, \ i \in S_{p} \end{equation} (2.39) \begin{equation} \sum_{j\in S_{p}, j\neq i} P_{v_{p}^{\perp}} \xi_{ij} = \nu_{i}P_{v_{p}^{\perp}} a_{i}, \ i \in S_{p}. \end{equation} (2.40) Note that solving (2.40) is equivalent to solving a system of linear equations Aξ = b, where $$A \in \mathbb {R}^{n_{p}d \times \frac {n_{p}(n_{p}-1)d}{2}}$$ and $$b\in \mathbb {R}^{n_{p}d}$$. Let $$\tilde {B}$$ be the first (np − 1) block rows of A and $$\tilde {b}$$ be the first (np − 1) blocks of b. So, the ith block of $$\tilde {b}$$ is $$\nu _{i}P_{v_{p}^{\perp }}a_{i}$$ and the matrix $$\tilde {B}$$ and the vector ξ are \begin{equation*} (\tilde{B})_{il} = \left\{\begin{array}{l l} I_{d\times d} & \ \text{if} \ l \in \left\{\sum_{r=1}^{i-1}(n_{p}-r) + s: \ s = 1,\dots, n_{p}-i\right\}\\[6pt] -I_{d\times d}& \ \text{if} \ i>1, \ l \in \left\{ \sum_{r=1}^{i-s}(n_{p}-r)+i-s+1: \ s = 2,\dots, i\right\}\\[6pt] 0 & \text{ otherwise,} \end{array} \right. \end{equation*} \begin{equation*} (\xi)_{l} = P_{v_{p}^{\perp}}\xi_{ij}\ \text{if} \ i = \operatorname*{arg\,max}_{\tilde{i}\in[n_{p}-1]}\sum_{s = 1}^{\tilde{i}}\frac{l-\sum_{r=1}^{s-1}(n_{p}-r)}{\left|l-\sum_{r=1}^{s-1}(n_{p}-r)\right|},\ j = l-\sum_{r=1}^{i-1}(n_{p}-r)+i. \end{equation*} It is straightforward to verify that $$\tilde {B}$$ is full row rank and the least squares solution of $$\tilde {B}\xi = \tilde {b}$$ is also a solution to Aξ = b if \begin{align} \sum_{i = 1}^{n_{p}} \nu_{i}P_{v_{p}^{\perp}}a_{i} = 0. \end{align} (2.41) Now, for all j ∈ Sp, choose $$P_{v_{p}}\xi _{ij} = 0$$ in (2.39). Then, for all i ∈ Sp, \begin{align} \nu_{i} = \operatorname*{sign}(v_{p}^{\top}a_{i})\frac{\|v_{p}\|_{2}\sum_{q\neq p}n_{q}}{\|P_{v_{p}}a_{i}\|_{2}}. \end{align} (2.42) Using these νi, (2.41) is satisfied because of the assumed balance condition (1.6). We note that the least two-norm solution of $$\tilde {B}\xi = \tilde {b}$$ is $$\xi = \tilde {B}^{T}(\tilde {B}\tilde {B}^{T})^{-1}\tilde {b}$$, where \begin{equation*} \left(\tilde{B}\tilde{B}^{T}\right)_{ls} = \left\{\begin{array}{cc} (n_{p} -1)I_{d\times d} & \text{if } l =s\\[.1em] -I_{d\times d} & \text{if } l \neq s, \end{array} \right. \text{ and} \end{equation*} \begin{equation*} \left(\left(\tilde{B}\tilde{B}^{T}\right)^{-1}\right)_{ls} = \left\{\begin{array}{cc} \frac{2}{n_{p}}I_{d\times d} & \text{if } l = s\\[.5em] \frac{1}{n_{p}}I_{d\times d} & \text{if } l \neq s. \end{array} \right. \end{equation*} Here, $$\tilde {B}\tilde {B}^{T} \in \mathbb {R}^{d(n_{j}-1)\times d(n_{j}-1)}$$. Solving for ξ, we get $$P_{v_{p}^{\perp }}\xi _{ij} = \frac {\nu _{i}P_{v_{p}^{\perp }}a_{i}}{n_{p}} - \frac {\nu _{j}P_{v_{p}^{\perp }}a_{j}}{n_{p}}$$. Since $$P_{v_{p}}\xi _{ij} = 0$$, we have $$\xi _{ij} = \frac {\nu _{i}P_{v_{p}^{\perp }}a_{i}}{n_{p}} - \frac {\nu _{j}P_{v_{p}^{\perp }}a_{j}}{n_{p}}$$. 3. Numerical results In this section we formulate an IRLS algorithm for (1.2) and provide numerical results on real and synthetic data. The general idea of the IRLS algorithm is to approximate a non-smooth objective function, as in (1.2), with a sequence of smooth functionals that converges to the objective function. In our implementation of the IRLS algorithm for (1.2), the smooth approximation results in a weighted least squares problem which can be solved through the normal equations. The pseudocode is given in Algorithm 1. We refer the readers to  for convergence result of the IRLS algorithm. Algorithm 1 outputs estimated mixture components for each data point. These estimates can then be clustered using k-means. The mixture component corresponding to each class is obtained by running k separate regressions. When implementing the IRLS algorithm, we observe convergence for a fixed δ ≪ 1 (δt in Algorithm 1 is fixed to 10−16). The maximum number of iterations is 150. Let $$Z_{t}^{\natural }$$ be the minimizer of the quadratic program in the IRLS algorithm at iteration t. The stopping criterion is $$\frac {1}{\sqrt {m}}\|Z_{t+1}^{\natural } - Z_{t}^{\natural }\|_{F} < 10^{-5}$$. We use Algorithm 1, along with k-means, to classify two real datasets. In both of these datasets, the true labels of the data points are unknown. We compare the output of Algorithm 1 followed by k-means with the output of an algorithm introduced in . The algorithm introduced in  recovers an estimate of the mixture components when used on a mixed linear regression problem with bounded but arbitrary noise. For ease of reference, this algorithm is presented in Algorithm 2. First, we analyze the BikeShare data that contain the number of bikes rented in a day in the San Francisco bay area from 1 September 2014 to 30 August 2015. The data are provided by the Bay Area BikeShare program.1 In this dataset, m = 364 and d = 2. For each measurement $$(a_{i},b_{i}) \in \mathbb {R}^{2}\times \mathbb {R}$$, ai1 = i is the independent variable of a simple linear model, ai2 = 1 corresponds to the constant part and bi is the response. In this model, bi is the number of bikes rented. Let $$\mu ={ \frac {1}{m}}\sum _{i=1}^{m}a_{i1}$$ and α = 0.005. Here, the average μ is used to center the dataset and α is used to ensure the dataset is sufficiently well-separated. Let $$\tilde {a}_{i1} = \alpha \left (a_{i1}-\mu \right )$$ and let $$\tilde {a}_{i} = \left (\tilde {a}_{i1},a_{i2}\right )$$. Algorithm 1 is then used on the dataset $$\{(\tilde {a}_{i},b_{i})\}_{i=1}^{m}$$. For purpose of illustration, k-means with k = 2 is used on the output of Algorithm 1 to estimate the mixture components. The result of this process is shown in Fig. 2(c), where the classification seems to differentiate between weekend and weekday bike rental trends. For comparison, the result of Algorithm 2 on the dataset $$\left \{\left (\tilde {a}_{i},b_{i}\right )\right \}_{i=1}^{m}$$ with η set to 100 obtained using the semidefinite-quadratic-linear programing solver SDPT3 is shown in Fig. 2(d). Fig. 2. View largeDownload slide Panel (a) shows data consisting of the number of bikes rented per day in the Bay Area BikeShare program. A circle or a dot in panel (b) is an estimate of one of the mixture components corresponding to a data point in panel (a). Panel (b) shows the output of (1.2) followed by k-means with k = 2. Panel (c) shows the fitted lines obtained using two separate regressions. In panel (c), points in first and second classes are represented by dots and circles, respectively. Qualitatively, the classification presented in panel (c) seems to differentiate between weekend and weekday bike rental trends. Panel (d) shows the result of Algorithm 2 with η = 100. Fig. 2. View largeDownload slide Panel (a) shows data consisting of the number of bikes rented per day in the Bay Area BikeShare program. A circle or a dot in panel (b) is an estimate of one of the mixture components corresponding to a data point in panel (a). Panel (b) shows the output of (1.2) followed by k-means with k = 2. Panel (c) shows the fitted lines obtained using two separate regressions. In panel (c), points in first and second classes are represented by dots and circles, respectively. Qualitatively, the classification presented in panel (c) seems to differentiate between weekend and weekday bike rental trends. Panel (d) shows the result of Algorithm 2 with η = 100. Secondly, we consider music tone perception data which show the relationship between actual tone and tone perceived by a musician. These data were generated in an experiment conducted by Cohen in 1980 . In this dataset, m = 150 and d = 2. For each measurement $$(a_{i},b_{i}) \in \mathbb {R}^{2}\times \mathbb {R}$$, ai1 = i is the independent variable of a simple linear model, ai2 = 1 corresponds to the constant part and bi is the response. In this model, ai1 is the actual tone and bi is the perceived tone. Let $$\mu ={ \frac {1}{m}}\sum _{i=1}^{m}a_{i1}$$ and α = 40. Here, the average μ is used to center the dataset and α is used to ensure the dataset is sufficiently well-separated. Let $$\tilde {a}_{i1} = \alpha (a_{i1}-\mu )$$ and let $$\tilde {a}_{i} = (\tilde {a}_{i1},a_{i2})$$. Algorithm 1 is then used on the dataset $$\{(\tilde {a}_{i},b_{i})\}_{i=1}^{m}$$. For purpose of illustration, k-means with k = 2 is used on the output of Algorithm 1 to estimate the mixture components. The result of this process is shown in Fig. 3(c). For comparison, the result of Algorithm 2 on the dataset $$\{(\tilde {a}_{i},b_{i})\}_{i=1}^{m}$$ with η set to 4 obtained using SDPT3 solver is shown in Fig. 2(d). Fig. 3. View largeDownload slide Panel (a) shows perceived tone by a musician versus actual tone for a range of tones. A circle or a dot in panel (b) is an estimate of one of the mixture components corresponding to a data point in panel (a). Panel (b) shows the output of (1.2) followed by k-means with k = 2. Panel (c) shows the fitted lines obtained using two separate regressions. In panel (c), points in first and second classes are represented by dots and circles, respectively. Panel (d) shows the result of Algorithm 2 with η = 4. Fig. 3. View largeDownload slide Panel (a) shows perceived tone by a musician versus actual tone for a range of tones. A circle or a dot in panel (b) is an estimate of one of the mixture components corresponding to a data point in panel (a). Panel (b) shows the output of (1.2) followed by k-means with k = 2. Panel (c) shows the fitted lines obtained using two separate regressions. In panel (c), points in first and second classes are represented by dots and circles, respectively. Panel (d) shows the result of Algorithm 2 with η = 4. We also provide two simulation results on synthetic data. The first simulation result verifies Theorem 1.1 and the second simulation result shows recovery using (1.2) in the case with imbalanced measurements is possible. For the first simulation, let k = 3, $$d \in \{3,\dots ,15\}$$ and m = 48. Let $$\left \{\beta _{p}\right \}_{p=1}^{k}\subset \mathbb {R}^{d}$$ be such that $$\left (\beta _{p}\right )_{l} = 1$$ if l = p and 0 otherwise. Let n1 = n2 = n3 = 16. For $$P_{v_{p}^{\perp }}$$, let the columns of $$Q_{p}\in \mathbb {R}^{d\times (d-1)}$$ be an orthonormal basis of the column space of $$P_{v_{p}^{\perp }}$$. Consider the following measurements for the first simulation: Fix $$p \in { \{1,\dots ,\mathrm {k}\}}$$. For $$i \in { \left \{1,\dots ,\frac {\mathrm {n}_{\mathrm {p}}}{2}\right \}}$$, let $$x_{i} \sim \text {Uniform}\left (B_{\alpha }^{d-1}\right )$$, i.e. xi is a random point uniformly distributed in the d − 1 dimensional ℓ2 ball of radius α centered at the origin. Here, hyperplanes corresponding to labels in Sp are contained in an aperture α ∈ [0, 0.75]. For $$i \in { \left \{1,\dots ,\frac {\mathrm {n}_{\mathrm {p}}}{2}\right \}}$$, let $$a_{i} = \hat {v}_{p}+Q_{p}x_{i}$$ and for $$i \in \left \{\frac {n_{p}}{2} +1, \dots , n_{p}\right \}$$, let $$a_{i}= \hat {v}_{p}-Q_{p}x_{i-\frac {n_{p}}{2}}$$. These measurements are symmetric and satisfy the balance condition since there exists pairs i, j ∈ Sp such that \begin{align} \frac{P_{v_{p}^{\perp}}a_{i}}{\|P_{v_{p}}a_{i}\|_{2}} = -\frac{P_{v_{p}^{\perp}}a_{j}}{\|P_{v_{p}}a_{j}\|_{2}}. \end{align} (3.1) Lastly, for $$i \in { \{1,\dots ,\mathrm {n}_{\mathrm {p}}\}}$$, let $$b_{i} = a_{i}^{\top }\beta _{p}$$. Figure 4 shows the fraction of successful recovery from 10 independent trials for mixed linear regression from data as described above. Black squares correspond to no successful recovery and white squares to 100% successful recovery. Let Z♯ be the candidate minimizer of (1.2) and let Z♮ be the output of (1.2). For each trial, we say (1.2) successfully recovers the mixture components if $$\frac {1}{\sqrt {m}}\|Z^{\natural } - Z^{\sharp }\|_{F} < 10^{-5}$$. This evaluation metric is used because we want to provide numerical verification of Theorem 1.1. In the area to the left of the line, the measurements are well-separated in the sense of (1.5), i.e. $$\max _{p\in }\max _{i\in S_{p}}\frac { \|P_{v_{p}^{\perp }}a_{i} \|_{2}}{\|P_{v_{p}}a_{i}\|_{2}} < \frac {1}{6}$$. The figure also shows that when measurements are not well-separated, recovery using (1.2) will likely fail. Fig. 4. View largeDownload slide The empirical recovery probability from synthetic data with three mixture components as a function of dimension, d, and aperture, α. The shades of black and white represents the fraction of successful simulation. White blocks correspond to successful recovery and black blocks correspond to unsuccessful recovery. Each block corresponds to the average from 10 independent trials. Fig. 4. View largeDownload slide The empirical recovery probability from synthetic data with three mixture components as a function of dimension, d, and aperture, α. The shades of black and white represents the fraction of successful simulation. White blocks correspond to successful recovery and black blocks correspond to unsuccessful recovery. Each block corresponds to the average from 10 independent trials. For the second simulation, let k = 3 and $$d \in \{3,\dots ,15\}$$. Let $$\left \{\beta _{p}\right \}_{p=1}^{k}\subset \mathbb {R}^{d}$$ be such that $$\left (\beta _{p}\right )_{l} = 1$$ if l = p and 0 otherwise. Let n1 = n2 = n3 = 4d. For $$P_{v_{p}^{\perp }}$$, let the columns of $$Q_{p}\in \mathbb {R}^{d\times (d-1)}$$ be an orthonormal basis of the column space of $$P_{v_{p}^{\perp }}$$, as in the first simulation. Consider the following measurements for the second simulation: Fix the aperture, α = 0.2. For p ∈{1, 2} and $$i \in \left \{1,\dots ,\frac {n_{p}}{2}\right \}$$, let $$x_{i} \sim \text {Uniform}\left (B_{\alpha }^{d-1}\right )$$. For $$i \in \left \{1,\dots ,\frac {n_{p}}{2}\right \}$$, let $$a_{i} = \hat {v}_{p}+Q_{p}x_{i}$$ and for $$i \in \left \{\frac {n_{p}}{2} +1, \dots , n_{p}\right \}$$, let $$a_{i}= \hat {v}_{p}-Q_{p}x_{i-\frac {n_{p}}{2}}$$. These measurements are symmetric as in the first simulation. For measurements that belong to the third class, we perturb the symmetric measurements to achieve the desired level of imbalance. Specifically, for p = 3 and $$i \in \left \{1,\dots ,\frac {n_{p}}{2}\right \}$$, let $$x_{i} \sim \text {Uniform}\left (B_{\alpha }^{d-1}\right )$$. For $$i \in \left \{1,\dots ,\frac {n_{p}}{2}\right \}$$, let $$\tilde {a}_{i} = \hat {v}_{p}+Q_{p}x_{i}$$ and for $$i \in \left \{\frac {n_{p}}{2} +1, \dots , n_{p}\right \}$$, let $$\tilde {a}_{i}= \hat {v}_{p}-Q_{p}x_{i-\frac {n_{p}}{2}}$$. Let $$w \sim \text {Uniform}\left (S_{\tau _{p}}^{d-1}\right )$$ and for $$i\in \left \{1,\dots ,n_{p}\right \}$$, let $$a_{i} = \tilde {a_{i}}+Q_{p}w$$. Here, τp ∈ [0, 0.062] is the measure of imbalance of the measurements that belongs to the pth class. Note that \begin{align} \tau_{p} = \frac{1}{n_{p}}\left \|\sum_{i\in S_{p}}\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)\frac{P_{v_{p}^{\perp}}a_{i}}{\left\|P_{v_{p}}a_{i}\right\|_{2}}\right\|_{2}. \end{align} (3.2) Lastly, for $$i \in \{1,\dots ,n_{p}\}$$, let $$b_{i} = a_{i}^{\top }\beta _{p}$$. Figure 5 shows the fraction of successful recovery from 10 independent trials for mixed linear regression from data as described above. In Fig. 5, the number of measurements in each class is four times the dimension. Black squares correspond to no successful recovery and white squares to 100% successful recovery. Let Z♯ be the candidate minimizer of (1.2) and let Z♮ be the output of (1.2). For each trial, we say (1.2) successfully recovers the mixture components if $$\frac {1}{\sqrt {m}}\|Z^{\natural } - Z^{\sharp }\|_{F} < 10^{-5}$$. The figure shows that recovery using (1.2) from imbalanced measurements is possible if the number of measurements scale linearly with dimension of the mixture components. Fig. 5. View largeDownload slide The empirical recovery probability from synthetic data with three mixture components as a function of dimension, d, and imbalance, τ3, of measurements in third class. The measurements in classes 1 and 2 are exactly balanced, i.e. τ1 = τ2 = 0. The shades of black and white represents the fraction of successful simulation. White blocks correspond to successful recovery and black blocks correspond to unsuccessful recovery. Each block corresponds to the average from 10 independent trials. Fig. 5. View largeDownload slide The empirical recovery probability from synthetic data with three mixture components as a function of dimension, d, and imbalance, τ3, of measurements in third class. The measurements in classes 1 and 2 are exactly balanced, i.e. τ1 = τ2 = 0. The shades of black and white represents the fraction of successful simulation. White blocks correspond to successful recovery and black blocks correspond to unsuccessful recovery. Each block corresponds to the average from 10 independent trials. Funding National Science Foundation (DMS-1464525) to P.H. Footnotes Dataset can be obtained from the following URL: http://www.bayareabikeshare.com/open-data References 1. Bissantz , N. , Dümbgen , L. , Munk , A. & Stratmann , B. ( 2009 ) Convergence analysis of generalized iteratively reweighted least squares algorithms on convex function spaces . SIAM J. Optimization , 19 , 1828 -- 1845 . Google Scholar Crossref Search ADS 2. Candès , E. & Recht , B. ( 2012 ) Exact matrix completion via convex optimization . Commun. ACM , 55 , 111 -- 119 . Google Scholar Crossref Search ADS 3. Chaganty , A. & Liang , P. ( 2013 ) Spectral experts for estimating mixtures of linear regressions . J. Mach. Learn. Res. , 28 , 1040 -- 1048 . 4. Chen , Y. , Yi , X. & Caramanis , C. ( 2014 ) A convex formulation for mixed regression with two components: minimax optimal rates . J. Mach. Learn. Res. , 35 , 560 -- 604 . 5. Cohen , E. A. ( 1980 ) Inharmonic tone perception . Ph.D. Thesis, Stanford University . 6. Deb , P. & Holmes , A. ( 2000 ) Estimates of use and costs of behavioural health care: a comparison of standard and finite mixture models . Health Econ. , 6 , 475 -- 489 . Google Scholar Crossref Search ADS 7. Elhamifar , E. & Vidal , R. ( 2013 ) Sparse subspace clustering: algorithm, theory, and applications . IEEE Trans. Pattern Anal. Mach. Intell. , 35 11 , 2765 -- 2781 . Google Scholar Crossref Search ADS PubMed 8. Fuchs , J. J. ( 2004 ) On sparse representations in arbitrary redundant bases . IEEE Trans. Inf. Theory , 50 , 1341 -- 1344 . Google Scholar Crossref Search ADS 9. Grun , B. & Leisch , F. ( 2007 ) Applications of finite mixtures of regression models . https://cran.r-project.org/web/packages/flexmix/vignettes/regression-examples.pdf. 10. Hand , P . ( 2014 ) Conditions for existence of dual certificates in rank-one semidefinite problems . Communications in Mathematical Sciences, 12 , 1363 -- 1378 . 11. Hocking , T. , Philippe Vert , J. , Joulin , A. & Bach , F. R. ( 2011 ) Clusterpath: an algorithm for clustering using convex fusion penalties. Proceedings of the 28th International Conference on Machine Learning (ICML-11) (L. Getoor & T. Scheffer eds). New York, NY : ACM , pp. 745 -- 752 . 12. Tibshirani , R. , Saunders , M. , Rosset , S. , Zhu , J. & Knight , K. ( 2005 ) Sparsity and smoothness via the fused lasso . J. R. Stat. Soc. Ser. B (Stat. Methodol.) , 67 , 91 -- 108 . Google Scholar Crossref Search ADS 13. Xinyang Yi , C. C. & Sanghavi , S. ( 2016 ) Solving a mixture of many random linear equations by tensor decomposition and alternating minimization. arXiv preprint arXiv:1608.05749 . 14. Li , Y.-H. , & Jonathan Scarlett , P. R. V. C. , ( 2015 ) Sparsistency of ℓ1-regularized M-estimators. Proceedings of the 18th International Conference on Artifical Intelligence and Statistics , San Diego, CA, USA . 15. Yi , X. , Caramanis , C. & Sanghavi , S. ( 2014 ) Alternating minimization for mixed linear regression. Proceedings of The 31st International Conference on Machine Learning , pp. 613 -- 621 . Symbol indicates reproducible data. © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Information and Inference: A Journal of the IMA Oxford University Press

# A convex program for mixed linear regression with a recovery guarantee for well-separated data

, Volume 7 (3) – Sep 19, 2018
17 pages      /lp/ou_press/a-convex-program-for-mixed-linear-regression-with-a-recovery-guarantee-wbr7tJUIxB
Publisher
Oxford University Press
ISSN
2049-8764
eISSN
2049-8772
DOI
10.1093/imaiai/iax018
Publisher site
See Article on Publisher Site

### Abstract

Abstract We introduce a convex approach for mixed linear regression over d features. This approach is a second-order cone program, based on L1 minimization, which assigns an estimate regression coefficient in $$\mathbb {R}^{d}$$ for each data point. These estimates can then be clustered using, for example, k-means. For problems with two or more mixture classes, we prove that the convex program exactly recovers all of the mixture components in the noiseless setting under technical conditions that include a well-separation assumption on the data. Under these assumptions, recovery is possible if each class has at least d-independent measurements. We also explore an iteratively reweighted least squares implementation of this method on real and synthetic data. 1. Introduction We study the mixed linear regression problem with k classes and m data points. Let $$\{S_{p}\}_{p=1}^{k}$$ be a partition of $${ \{1,\dots ,\mathrm {m}\}}$$, where i ∈ Sp means that the ith data point belongs to the pth class. In class p, we assume that the data is generated by a linear process with regression coefficients $$\beta _{p} \in \mathbb {R}^{d}$$. Let the measurements $$\{(a_{i},b_{i})\}_{i=1}^{m} \subset \mathbb {R}^{d}\times \mathbb {R}$$ be such that for all $$p \in { \{1,\dots ,\mathrm {k}\}}$$, \begin{align} a_{i}^{\top}\beta_{p} = b_{i}\ \ \text{if}\ \ i \in S_{p}. \end{align} (1.1) The mixed linear regression problem is to simultaneously estimate $$\{\beta _{p}\}_{p =1}^{k}$$ and $$\{S_{p}\}_{p=1}^{k}$$ from $$\{(a_{i},b_{i})\}_{i=1}^{m}$$. In the above problem formulation, let np = |Sp|. We define ℓi as the p such that i ∈ Sp, i.e. ℓi is the label of the ith data point. Mixed linear regression has broad applications in scenarios that require disentangling data into multiple latent classes and estimating the parameters. One application is resource management in health care, where the classes are patients (with and without illness) and the mixed data is usage of resource and medical care . Other applications include music tone perception , subspace clustering  and population classification . Recent work on mixed linear regression has mostly focused on mixtures of two classes, i.e. k = 2 [4,15]. For independent, Gaussian measurements, the authors in  develop an initialization procedure for the Expectation Minimization (EM) algorithm. This initialization is based on a grid search on the eigenspace of a matrix of second order moments of measurements. In the noiseless setting, the authors prove that their initialization, followed by an EM algorithm recovers the two regression coefficients with high probability if there are $$O(d\log ^{2}d)$$ Gaussian measurements. In , the authors lift the mixed linear regression problem to a low rank non-symmetric matrix completion problem. In the noiseless case with O(d) sub-Gaussian measurements, the program exactly recovers the two mixture components with high probability. In both of these approaches, the number of mixture components were restricted to exactly two. For mixed linear regression with two or more mixture components, tensor-based methods have been introduced in  and, very recently, in . In , low-rank tensor regression and tensor factorization is used to recover the mixture components. In the noiseless setting, i.e. for all $$p \in \{1,\dots ,k\}$$, $$a_{i}^{\top }\beta _{p} = b_{i}$$ if i ∈ Sp, their recovery theorem does not guarantee exact recovery for any m. Instead, it establishes that the recovery error is $$O\left (\frac {1}{\sqrt {m}}\right )$$. In , the method of moments is used to generate an initialization for the EM algorithm. For i.i.d. Gaussian measurements, the algorithm provably recovers the mixture components if the number of measurements is $$O\left (k^{10}d\right )$$. For identification of k classes, there must be at least kd measurements. As a result, this approach has an optimal sample complexity in the number of feature elements, d, but its scaling with respect to the number of classes, k, is suboptimal. Another drawback of these tensor-based methods is that they are computationally expensive because they operate in high-dimensional spaces. The work in the present paper brings ideas from convex clustering to the problem of mixed linear regression. The algorithm we introduce is inspired by the algorithm in . In that algorithm, an ℓ1 minimization in the form of a fused lasso  acts to cluster noisy points in $$\mathbb {R}^{d}$$. The problem of mixed linear regressions can be viewed as a challenging generalization of clustering in $$\mathbb {R}^{d}$$ where every measurement is subject to a codimension-1 ambiguity. We note in particular that in the noiseless case $$\mathbb {R}^{d}$$ clustering is trivial whereas mixed linear regression is not. Consequently, this paper focuses only on the case of noiseless measurements in order to grapple with the difficulty of this codimension-1 ambiguity. The contribution of the present paper is to introduce a convex algorithm for the mixed linear regression problem with two or more components. In the noiseless setting and under some technical conditions that include well-separation of data, this algorithm exactly recovers the mixture components with kd-independent measurements. The convex algorithm is based on an ℓ1 minimization that assigns an estimate regression coefficient in $$\mathbb {R}^{d}$$ for each data point. These estimates can then be clustered by, for example, k-means. An estimate for each class can then be found by running k separate regressions. 1.1. Convex program and main result We introduce a two-step approach for mixed linear regression. This approach introduces a free variable for each data point that acts as the estimate of the mixture component corresponding to that data point. The first step is to solve a second-order cone program to obtain estimated mixture components for each data point, and the second step is to cluster these estimates using k-means. The details of these steps are as follows: 1. Solve the following second-order cone program: \begin{align} \begin{aligned} &\operatorname*{arg\,min}_{z_{1}, \dots, z_{m} \in \mathbb{R}^{d}}\ \ \sum_{i=1}^{m}\sum_{j = 1}^{m} \|z_{i} - z_{j}\|_{2}\\ &\text{subject to}\ \ a_{i}^{\top}z_{i} = b_{i}, \ i\in{ \{1,\dots,\mathrm{m}\}}. \end{aligned} \end{align} (1.2) Each zi in (1.2) is constrained to belong to the hyperplane corresponding to the measurement (ai, bi), as per (1.1). If $$\{z_{i}^{\natural }\}_{i=1}^{m}$$ minimizes (1.2), then $$z_{i}^{\natural }$$ is an estimate of βp for the class p such that i ∈ Sp. 2. Cluster $$\{z_{i}^{\natural }\}_{i=1}^{m}$$ using k-means and estimate the mixture components corresponding to each class by running k separate regressions. Program (1.2) is an ℓ1 minimization over all pairs (i, j). Due to the sparsity promoting property of ℓ1 minimization, in some cases, (1.2) can find a solution where $$z_{i}^{\natural } = z_{j}^{\natural }$$ for many pairs (i, j). Let $$\{z_{i}^{\sharp }\}_{i=1}^{m}$$ be the candidate solution of (1.2), i.e. $$z_{i}^{\sharp } = \beta _{\ell _{i}}$$. In the noiseless case, successful recovery would mean that the minimizer of (1.2) is equal to the candidate solution, i.e. $$z_{i}^{\natural } = \beta _{\ell _{i}}$$. Even in this case, note that $$\|z_{i}^{\natural }-z_{j}^{\natural }\|_{2} \neq 0$$ for most pairs (i, j). Nonetheless, we will prove that if the measurements are well- separated in a certain sense, the minimizer of (1.2) can still be the candidate solution. We prove that under two technical assumptions (1.2) can exactly recover the regression coefficients in the noiseless case. In order to state these assumptions, let \begin{align} v_{pq} = \frac{\beta_{p}-\beta_{q}}{\|\beta_{p}-\beta_{q}\|_{2}} \end{align} (1.3) and define the weighted directions, vp, by \begin{align} v_{p} := \frac{1}{\sum_{q\neq p} n_{q}} \sum_{q \neq p}n_{q}v_{pq}. \end{align} (1.4) That is, vp is the weighted average of the directions to βp from other mixture components {βq}q≠p. Our first assumption is that the measurements are ‘well-separated’ in the following sense: \begin{align} \max_{p \in [k]}\max_{i \in S_{p}} \frac{\left\|P_{v_{p}^{\perp}}a_{i}\right\|_{2}}{\|P_{v_{p}}a_{i}\|_{2}} < \frac{1}{2}\min_{p\in [k]}\frac{n_{p}}{m}, \end{align} (1.5) where $$P_{v_{p}}$$ is the projector onto the span of {vp} and $$P_{v_{p}^{\perp }}$$ is the projector onto the subspace orthogonal to vp. Intuitively, $$\{(a_{i},b_{i})\}_{i =1}^{m}$$ is well-separated if ai is approximately parallel to $$v_{l_{i}}$$. This condition ensures that the hyperplanes corresponding to a fixed class are within a small angle of each other. Note that $$\frac {\|P_{v}^{\perp }a\|_{2}}{\|P_{v}a\|_{2}}=0$$ if v is parallel to a and is small if v is approximately parallel to a. Examples of well-separated data and not well-separated data are shown in Fig. 1(a and b). Fig. 1. View largeDownload slide The collection of lines in panel (a) is well-separated in the sense of (1.5). These measurements also satisfy the balance condition (1.6). In contrast, the collection of lines in panel (b) is not well-separated because the angle between ν1 and a few measurements corresponding to β1 is too big. Similarly, the collection of lines in panel (b) is not balanced because the weighted average of the normal vectors corresponding to measurements in β2 is not in the direction of ν2. Note that vp is the weighted average of the directions to βp from other mixture components {βq}q≠p. Fig. 1. View largeDownload slide The collection of lines in panel (a) is well-separated in the sense of (1.5). These measurements also satisfy the balance condition (1.6). In contrast, the collection of lines in panel (b) is not well-separated because the angle between ν1 and a few measurements corresponding to β1 is too big. Similarly, the collection of lines in panel (b) is not balanced because the weighted average of the normal vectors corresponding to measurements in β2 is not in the direction of ν2. Note that vp is the weighted average of the directions to βp from other mixture components {βq}q≠p. Our second assumption is that the measurements are ‘balanced’ in the following sense: \begin{align} \sum_{i\in S_{p}}\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)\frac{P_{v_{p}^{\perp}}a_{i}}{\|P_{v_{p}}a_{i}\|_{2}} = 0,\text{ for all } p \in{ \{1,\dots,\mathrm{k}\}}. \end{align} (1.6) Intuitively, $$\{(a_{i},b_{i})\}_{i =1}^{m}$$ is balanced if a particular average of the $$\{a_{i}\}_{i\in S_{p}}$$ is exactly in the direction of vp. Examples of balanced data and not balanced data are shown in Fig. 1(a and b). Our main result is that in the noiseless case, (1.2) uniquely recovers the mixture components with well-separated and balanced measurements, provided there are d-independent measurements per class. Theorem 1.1 Fix $$\{\beta _{p}\}_{p=1}^{k} \subset \mathbb {R}^{d}$$. Let $$\{(a_{i},b_{i})\}_{i=1}^{m} \subset \mathbb {R}^{d} \times \mathbb {R}$$ be measurements that satisfy (1.1). If the measurements satisfy the well-separation condition (1.5), the balance condition (1.6) and for all p ∈ [k], span$$\{a_{i}:i\in S_{p}\} = \mathbb {R}^{d}$$, then the solution to (1.2) is unique and satisfies $${z_{i}} = \beta _{l_{i}}$$ for all i ∈ [m]. Numerical simulation on synthetic data verify Theorem 1.1. We present the results of numerical simulation on two types of synthetic data, both satisfying the well-separation property (1.5). In the first simulation, the data satisfy the balance property (1.6) as well. The second simulation shows that recovery is possible in the imbalanced case. We also present the results of numerical simulation on two real datasets and show that the algorithm presented in this paper estimates the regression coefficients, and classifies the datasets reasonably well. 1.2. Discussion and future work We propose a convex approach for mixed linear regression with k regression coefficients. An interesting observation in the noiseless case is that the convex program (1.2) can recover both the sparsity pattern and the regression coefficients exactly. In contrast, consider the Basis Pursuit Denoise (BPDN) program \begin{align} \min_{x\in \mathbb{R}^{n}} \|x\|_{1} + \lambda \|Ax-b\|_{2}^{2} \end{align} (1.7) in the case where b = Axo, for a sparse xo. For appropriate λ, the optimizer of (1.7) will have the correct sparsity pattern, but incorrect estimates of the coefficients of xo, see . Given that both BPDN and (1.2) are based on L1 minimization, a natural question is why can (1.2) recover both sparsity pattern and the estimates? This is because if the sparsity pattern is recovered using (1.2) in the noiseless case then zi = zj for i, j in the same class. Since there are sufficient independent points corresponding to each class, there is only one possible value of zi. In the noisy case of mixed linear regression, an estimation step like clustering using k-means is required. This is analogous to the estimation step required for estimating the regression coefficient from the solution to the BPDN program. Another interesting observation is that the program (1.2) does not have a free parameter when posed in the noiseless case. This is analogous to the Basis Pursuit program \begin{align} \min_{x\in \mathbb{R}^{n}} \|x\|_{1} \text{ subject to } Ax=b, \end{align} (1.8) which also does not have any parameter in the noiseless case. In this paper, we analyze the noiseless mixed linear regression problem, and provide a recovery guarantee when data satisfies conditions (1.5) and (1.6). The current paper is mainly focused in providing an understanding of the well-separation condition while assuming that the data are balanced. However, real data are never exactly balanced. Thus, understanding the level of imbalance the algorithm can handle is a fruitful area of future work. In the same vein, another important research direction is to consider a more complete model that include noise and corruption. 1.3. Organization of the paper The remainder of the paper is organized as follows. In Section 1.4, we present notations used throughout the paper. In Section 2, we present the proof of Theorem 1.1. In Section 3, we introduce an iteratively reweighted least squares (IRLS) implementation that solves (1.2). We observe its performance on real and synthetic data. 1.4. Notation Let $$[n] = \{1,\dots ,n\}$$. Let Id×d be the d × d identity matrix. Let $${B_{r}^{d}}$$ be the d dimensional Euclidean ball of radius r centered at the origin and let $${S_{r}^{d}}$$ be the d dimensional Euclidean sphere of radius r centered at the origin. For any matrix X and any vector v, let ∥X∥F be the Frobenius norm of X and let ∥v∥2 be the ℓ2 norm of v. For any non-zero vector v, let $$\hat {v} = \frac {v}{\ \|v\|_{2}}$$. For a vector $$v \in \mathbb {R}^{d}$$, let $$P_{v^{\perp }} = I_{d\times d} - \hat {v}\hat {v}^{\top }$$ and $$P_{v} = \hat {v}\hat {v}^{\top }$$. For a vector $$a_{i} \in \mathbb {R}^{d}$$, let ais be the sth element of ai. Throughout the paper, indices i and j are related to data points, and indices p and q are related to mixture components. 2. Proof We will prove Theorem 1.1 by constructing an exact dual certificate. A dual certificate is a dual optimal point which certifies that the candidate solution is globally optimal, see [2,8,10]. To arrive at the form of the dual certificate, we now derive the Karush-Kuhn-Tucker (KKT) conditions for (1.2). Let $$f:(\mathbb {R}^{d})^{m}\rightarrow \mathbb {R}$$ be \begin{align*} f(Z) = f(z_{1},\dots,z_{m}) = \sum_{i = 1}^{m}\sum_{j=1}^{m} \|z_{i} - z_{j}\|_{2}. \end{align*} The augmented Lagrangian for (1.2) is \begin{align*} \mathscr{L}(Z,\nu) = \sum_{i = 1}^{m}\sum_{j=1}^{m} \|z_{i} - z_{j}\|_{2} - \sum_{i=1}^{m}\nu_{i}\left(a_{i}^{\top}z_{i}-b_{i}\right), \end{align*} where $$\nu \in \mathbb {R}^{m}$$ is a Lagrange multiplier. Let Z♯ be the candidate solution, i.e. $$z_{i}^{\sharp } = \beta _{l_{i}}$$ for all $$i \in { \{1,\dots ,\mathrm {m}\}}$$. The first order optimality conditions for (1.2) at Z♯ are \begin{align} & 0 \in \partial_{Z} \left(\sum_{i = 1}^{m}\sum_{j=1}^{m} \|z_{i} - z_{j}\|_{2}\right)(Z^{\sharp}) - \partial_{Z}\left(\sum_{i=1}^{m}\nu_{i}\left(a_{i}^{\top}z_{i}-b_{i}\right)\right)(Z^{\sharp}), \end{align} (2.1) \begin{equation} a_{i}^{\ T}z_{i}^{\sharp} = b_{i},\ i\in{ \{1,\dots,\mathrm{m}\}}, \end{equation} (2.2) where ∂Zf(Z♯) is the subdifferential of f with respect to Z evaluated at Z♯. Note that \begin{align} \partial_{(z_{i},z_{j})} \|z_{i}-z_{j}\|_{2}(z_{i}^{\sharp},z_{j}^{\sharp}) = \left\{ \begin{array}{ll} \left\{\left[\begin{array}{c} v_{pq}\\ -v_{pq} \end{array} \right]\right\}, & \ell_{i} = p,\ \ell_{j} = q,\ p\neq q\\[10pt] \left\{\left[\begin{array}{c} \xi_{ij}\\ -\xi_{ij} \end{array}\right]: \|\xi_{ij}\|_{2}\leqslant\ 1 \right\}, & z_{i}^{\sharp} = z_{j}^{\sharp} \end{array}\right. \end{align} (2.3) where vpq is defined in (1.3) and $$\xi _{ij} \in \mathbb {R}^{d}$$. So, the KKT conditions for (1.2) are for all $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and i ∈ Sp, \begin{align} 0 \in \left\{\sum_{j\in S_{p}, j\neq i} \xi_{ij} + \sum_{q\neq p}n_{q}v_{pq} - \nu_{i}a_{i}: \xi_{ij} \in \mathbb{R}^{d},\ \|\xi_{ij}\|_{2}\leq 1,\ \xi_{ij} = -\xi_{ji},\ \nu_{i} \in \mathbb{R}\right\}\!\!. \end{align} (2.4) We will now prove that if such ξij and ν exist and if all ∥ξij∥2 < 1, then the unique output of (1.2) is the candidate solution Z♯. Lemma 2.1 Fix $$\{\beta _{p}\}_{p=1}^{k} \subset \mathbb {R}^{d}$$. Let $$\{(a_{i},b_{i})\}_{i=1}^{m} \subset \mathbb {R}^{d} \times \mathbb {R}$$ be measurements that satisfy (1.1). Assume for $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and for all i, j ∈ Sp, there exists $$\xi _{ij} \in \mathbb {R}^{d}$$ and $$\nu _{i} \in \mathbb {R}$$ such that \begin{equation} \nu_{i}a_{i}=\sum_{j\in S_{p}, j\neq i} \xi_{ij} + \sum_{q\neq p}n_{q}v_{p} , \end{equation} (2.5) \begin{equation} \|\xi_{ij}\|_{2} < 1 , \end{equation} (2.6) \begin{equation} \xi_{ij} = -\xi_{ji} . \end{equation} (2.7) Also, assume for all $$p \in { \{1,\dots ,\mathrm {k}\}}$$, span$$\{a_{i}:i\in S_{p}\}=\mathbb {R}^{d}$$. Then, Z♯ is the unique solution to (1.2). Proof. We will show that f(Z♯ + H) > f(Z♯) for any feasible perturbation H ≠ 0. We first separate f(Z♯ + H) into two parts. \begin{align} f(Z^{\sharp}+H) = \sum_{p=1}^{k}\sum_{i,j \in S_{p}}\|h_{i}-h_{j}\|_{2} + \underbrace{\sum_{p=1}^{k}\sum_{q\neq p}\sum_{i\in S_{p}}\sum_{j \in S_{q}}\|z_{i}^{\sharp} -z_{j}^{\sharp}+h_{i}-h_{j}\|_{2}}_{I}\!. \end{align} (2.8) To bound I from below, note that $$\|z_{i}^{\sharp } -z_{j}^{\sharp }+h_{i}-h_{j}\|_{2} \geqslant \ \|z_{i}^{\sharp } -z_{j}^{\sharp }\|_{2}+v_{pq}^{\top }(h_{i}-h_{j})$$ because vpq, defined in (1.3), is a subgradient of ∥⋅∥2 at $$z_{i}^{\sharp }-z_{j}^{\sharp }$$ if i ∈ Sp and j ∈ Sq. Thus, \begin{equation} I = \sum_{p=1}^{k}\sum_{q\neq p}\sum_{i\in S_{p}}\sum_{j \in S_{q}}\|z_{i}^{\sharp} -z_{j}^{\sharp}+h_{i}-h_{j}\|_{2} \end{equation} (2.9) \begin{equation} \geqslant \sum_{p=1}^{k}\sum_{q\neq p}\sum_{i\in S_{p}}\sum_{j \in S_{q}}\|z_{i}^{\sharp} - z_{j}^{\sharp}\|_{2} + \sum_{p=1}^{k}\sum_{q\neq p}\sum_{i\in S_{p}}\sum_{j \in S_{q}}v_{pq}^{\top}(h_{i}-h_{j}) \end{equation} (2.10) \begin{equation} = f(Z^{\sharp})+ \sum_{p=1}^{k}\sum_{q\neq p}\sum_{i\in S_{p}}\sum_{j \in S_{q}}v_{pq}^{\top}h_{i} - \sum_{p=1}^{k}\sum_{q\neq p}\sum_{i\in S_{p}}\sum_{j \in S_{q}}v_{pq}^{\top}h_{j} \end{equation} (2.11) \begin{equation} = f(Z^{\sharp})+\sum_{p=1}^{k}\sum_{q\neq p}\sum_{i \in S_{p}}n_{q}v_{pq}^{\top}h_{i}+\sum_{p=1}^{k}\sum_{q\neq p}\sum_{j\in S_{q}}n_{p}v_{qp}^{\top}h_{j} \end{equation} (2.12) \begin{equation} = f(Z^{\sharp})+2\sum_{p=1}^{k}\sum_{i \in S_{p}}\sum_{q\neq p}n_{q}v_{p}^{\top}h_{i}. \end{equation} (2.13) Note that in (2.12) we used vpq = −vqp and in (2.13) we used (1.4). Combining (2.8) and (2.13), it suffices to show \begin{align} \sum_{p=1}^{k}\sum_{i,j \in S_{p}}\|h_{i}-h_{j}\|_{2} + 2\underbrace{\sum_{p=1}^{k}\sum_{i\in S_{p}}\sum_{q\neq p}n_{q}v_{p}^{\top}h_{i}}_{II}>0, \end{align} (2.14) for all feasible H ≠ 0. First, we show that for all feasible H ≠ 0 there exists $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and i, j ∈ Sp such that hi ≠ hj. We provide a proof by contradiction. Assume for all $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and i, j ∈ Sp, hi = hj. Fix $$p\in { \{1,\dots ,\mathrm {k}\}}$$. Let hj = cp for all j ∈ Sp. Since cp is a feasible perturbation, $$a_{j}^{\top }c_{p} = 0$$ for all j ∈ Sp. Thus, cp is orthogonal to any element in span$$\{a_{j}:j\in S_{p}\} = \mathbb {R}^{d}$$. Hence, hj = cp = 0 for all j ∈ Sp, which contradicts H ≠ 0. We now compute \begin{equation} II = \sum_{p=1}^{k}\sum_{i \in S_{p}}\left(\sum_{q\neq p}n_{q}v_{p}\right)^{\top}h_{i} \end{equation} (2.15) \begin{equation} = \sum_{p=1}^{k}\sum_{i \in S_{p}}\left(\nu_{i}a_{i} - \sum_{j\in S_{p},j\neq i}\xi_{ij}\right)^{\top}h_{i} \end{equation} (2.16) \begin{equation} = -\frac{1}{2}\sum_{p=1}^{k}2\sum_{i \in S_{p}}\sum_{j\in S_{p},j\neq i}\xi_{ij}^{\top}h_{i} \end{equation} (2.17) \begin{equation} = -\frac{1}{2}\sum_{p=1}^{k}\left(\sum_{i \in S_{p}}\sum_{j\in S_{p},j\neq i}\xi_{ij}^{\top}h_{i} + \sum_{j \in S_{p}}\sum_{i\in S_{p},i\neq j}\xi_{ji}^{\top}h_{j}\right) \end{equation} (2.18) \begin{equation} = -\frac{1}{2}\sum_{p=1}^{k}\sum_{i \in S_{p}}\sum_{j\in S_{p},j\neq i}\xi_{ij}^{\top}\left(h_{i} -h_{j}\right) \end{equation} (2.19) \begin{equation} \geqslant -\frac{1}{2} \sum_{p=1}^{k}\sum_{i \in S_{p}}\sum_{j\in S_{p}}\|\xi_{ij}\|_{2}\|h_{i}-h_{j}\|_{2} \end{equation} (2.20) \begin{equation} = -\frac{\gamma}{2} \sum_{p=1}^{k}\sum_{i \in S_{p}}\sum_{j\in S_{p}}\|h_{i}-h_{j}\|_{2}, \end{equation} (2.21) where $$\gamma :=\max _{p \in [k]}\max _{i,j \in S_{\textrm {p}}}\|\xi _{ij}\|_{2}$$. By assumption (2.6), γ < 1. Note that in (2.16), we used (2.5). In (2.17), we used $$a_{i}^{\top }h_{i} = 0$$ (since H is feasible). In (2.18), both terms in parenthesis are equal by interchanging the dummy variables i and j. In (2.19), we used the antisymmetry condition (2.7). In (2.20), we used Cauchy–Schwartz inequality. Combining (2.14) and (2.21), we get \begin{align} \sum_{p=1}^{k}\left(\sum_{i,j \in S_{p}}\|h_{i}-h_{j}\|_{2} + 2\sum_{i\in S_{p}}\sum_{q\neq p}n_{q}v_{p}^{\top}h_{i}\right) &\geqslant\ (1-\gamma)\sum_{p=1}^{k}\sum_{i,j\in S_{p}}\|h_{p}-h_{q}\|_{2}> 0. \end{align} (2.22) The strict inequality in (2.22) holds because (1 − γ) > 0 and H ≠ 0 implies that there exists $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and i, j ∈ Sp such that hi ≠ hj. Hence, Z♯ is the unique solution to (1.2). As a consequence of Lemma 2.1, the proof of Theorem 1.1 is simplified to constructing an exact certificate that satisfies (2.5), (2.6) and (2.7). Proof of Theorem 1.1 For all $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and i, j ∈ Sp, let ξij and νi be \begin{equation} \nu_{i} = \operatorname*{sign}(v_{p}^{\top}a_{i})\frac{\|v_{p}\|_{2}\sum_{q\neq p}n_{q}}{\|P_{v_{p}}a_{i}\|_{2}} \end{equation} (2.23) \begin{equation} \xi_{ij} = \frac{1}{n_{p}}\left(\nu_{i}P_{v_{p}^{\perp}}a_{i} - \nu_{j} P_{v_{p}^{\perp}}a_{j}\right). \end{equation} (2.24) By Lemma 2.1, it is sufficient to show that, for all $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and i, j ∈ Sp, these ξij and νi satisfy (2.5), (2.6) and (2.7). Note that (2.7) follows immediately. Condition (2.6) holds because \begin{equation} \|\xi_{ij}\|_{2} \leqslant\ \frac{1}{n_{p}}\left(\|\nu_{i}P_{v_{p}^{\perp}}a_{i}\|_{2} + \|\nu_{j} P_{v_{p}^{\perp}}a_{j}\|_{2}\right) \end{equation} (2.25) \begin{equation} \leqslant\ \frac{2}{n_{p}} \max_{i \in S_{p}} \|\nu_{i}P_{v_{p}^{\perp}}a_{i}\|_{2} \end{equation} (2.26) \begin{equation} = \frac{2}{n_{p}} \max_{i \in S_{p}}\|v_{p}\|_{2}\sum_{q\neq p}n_{q}\frac{\|P_{v_{p}^{\perp}}a_{i}\|_{2}}{\|P_{v_{p}}a_{i}\|_{2}} \end{equation} (2.27) \begin{equation} < \frac{2}{n_{p}}m\frac{n_{p}}{2m} \end{equation} (2.28) \begin{equation} = 1. \end{equation} (2.29) In (2.27), we used (2.23), and in (2.28), we used (1.5) along with ∥vp∥2 ⩽ 1. Lastly, we will show that (2.5) holds through direct computation. Fix $$p \in { \{1,\dots ,\mathrm {k}\}}$$ and i ∈ Sp. We now compute \begin{equation} \sum_{j\in S_{p}, j\neq i} \xi_{ij} + \sum_{q\neq p}n_{q}v_{p} \end{equation} (2.30) \begin{equation} = \sum_{j\in S_{p}, j\neq i} \frac{1}{n_{p}}\left(\nu_{i}P_{v_{p}^{\perp}}a_{i} - \nu_{j} P_{v_{p}^{\perp}}a_{j}\right) + \sum_{q\neq p}n_{q}v_{p} \end{equation} (2.31) \begin{equation} = \sum_{j\in S_{p}, j\neq i} \frac{1}{n_{p}}\|v_{p}\|_{2}\left(\sum_{q\neq p}n_{q}\right)\left(\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)\frac{P_{v_{p}^{\perp}}a_{i}}{\|P_{v_{p}}a_{i}\|_{2}} - \operatorname*{sign}\left(v_{p}^{\top}a_{j}\right)\frac{P_{v_{p}^{\perp}}a_{j}}{\|P_{v_{p}}a_{j}\|_{2}}\right) + \sum_{q\neq p}n_{q}v_{p} \end{equation} (2.32) \begin{equation} = \left(\sum_{q\neq p}n_{q}\right)\frac{\|v_{p}\|_{2}}{n_{p}}\sum_{j\in S_{p}}\left(\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)\frac{P_{v_{p}^{\perp}}a_{i}}{\|P_{v_{p}}a_{i}\|_{2}} -\operatorname*{sign}\left(v_{p}^{\top}a_{j}\right)\frac{P_{v_{p}^{\perp}}a_{j}}{\|P_{v_{p}}a_{j}\|_{2}}\right)+\sum_{q\neq p}n_{q}v_{p} \end{equation} (2.33) \begin{equation} = \left(\sum_{q\neq p}n_{q}\right)\frac{\|v_{p}\|_{2}}{n_{p}}\left(\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)n_{p}\frac{P_{v_{p}^{\perp}}a_{i}}{\|P_{v_{p}}a_{i}\|_{2}} - \sum_{j\in S_{p}}\operatorname*{sign}\left(v_{p}^{\top}a_{j}\right)\frac{P_{v_{p}^{\perp}}a_{j}}{\|P_{v_{p}}a_{j}\|_{2}}\right)+\left(\sum_{q\neq p}n_{q}\right)v_{p} \end{equation} (2.34) \begin{equation} = \left(\sum_{q\neq p}n_{q}\right)\|v_{p}\|_{2}\left(\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)\frac{P_{v_{p}^{\perp}}a_{i}}{\|P_{v_{p}}a_{i}\|_{2}}+\frac{v_{p}}{\|v_{p}\|_{2}}\right) \end{equation} (2.35) \begin{equation} = \left(\sum_{q\neq p}n_{q}\right)\frac{\|v_{p}\|_{2}\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)}{\|P_{v_{p}}a_{i}\|_{2}}\left(P_{v_{p}^{\perp}}a_{i}+\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)\frac{v_{p}}{\|v_{p}\|_{2}}\|P_{v_{p}}a_{i}\|_{2}\right) \end{equation} (2.36) \begin{equation} = \nu_{i}\left(P_{v_{p}^{\perp}}a_{i}+P_{v_{p}}a_{i}\right) \end{equation} (2.37) \begin{equation} = \nu_{i}a_{i}. \end{equation} (2.38) Note that in (2.32), we used (2.23). In (2.35), we used the assumed balance condition (1.6), and in (2.37), we used (2.23) and $$P_{v_{p}}a_{i} = \operatorname *{sign}(v_{p}^{\top }a_{i})\|P_{v_{p}}a_{i}\|_{2}\frac {v_{p}}{\|v_{p}\|_{2}}$$. Hence, (2.5) holds. By Lemma 2.1, Z♯ is the unique solution to (1.2). 2.1. Derivation of dual certificate In this section, we provide a derivation of constructing the dual certificate. Fix $$p \in { \{1,\dots ,\mathrm {k}\}}$$. Without loss of generality, assume $$S_{p} ={ \{1,\dots ,\mathrm {n}_{\mathrm {p}}\}}$$. Note that (2.5) can be decomposed into its components along and orthogonal to vp: \begin{equation} \sum_{j\in S_{p}, j\neq i} P_{v_{p}}\xi_{ij} + \sum_{q\neq p}n_{q}v_{p} = \nu_{i}P_{v_{p}}a_{i}, \ i \in S_{p} \end{equation} (2.39) \begin{equation} \sum_{j\in S_{p}, j\neq i} P_{v_{p}^{\perp}} \xi_{ij} = \nu_{i}P_{v_{p}^{\perp}} a_{i}, \ i \in S_{p}. \end{equation} (2.40) Note that solving (2.40) is equivalent to solving a system of linear equations Aξ = b, where $$A \in \mathbb {R}^{n_{p}d \times \frac {n_{p}(n_{p}-1)d}{2}}$$ and $$b\in \mathbb {R}^{n_{p}d}$$. Let $$\tilde {B}$$ be the first (np − 1) block rows of A and $$\tilde {b}$$ be the first (np − 1) blocks of b. So, the ith block of $$\tilde {b}$$ is $$\nu _{i}P_{v_{p}^{\perp }}a_{i}$$ and the matrix $$\tilde {B}$$ and the vector ξ are \begin{equation*} (\tilde{B})_{il} = \left\{\begin{array}{l l} I_{d\times d} & \ \text{if} \ l \in \left\{\sum_{r=1}^{i-1}(n_{p}-r) + s: \ s = 1,\dots, n_{p}-i\right\}\\[6pt] -I_{d\times d}& \ \text{if} \ i>1, \ l \in \left\{ \sum_{r=1}^{i-s}(n_{p}-r)+i-s+1: \ s = 2,\dots, i\right\}\\[6pt] 0 & \text{ otherwise,} \end{array} \right. \end{equation*} \begin{equation*} (\xi)_{l} = P_{v_{p}^{\perp}}\xi_{ij}\ \text{if} \ i = \operatorname*{arg\,max}_{\tilde{i}\in[n_{p}-1]}\sum_{s = 1}^{\tilde{i}}\frac{l-\sum_{r=1}^{s-1}(n_{p}-r)}{\left|l-\sum_{r=1}^{s-1}(n_{p}-r)\right|},\ j = l-\sum_{r=1}^{i-1}(n_{p}-r)+i. \end{equation*} It is straightforward to verify that $$\tilde {B}$$ is full row rank and the least squares solution of $$\tilde {B}\xi = \tilde {b}$$ is also a solution to Aξ = b if \begin{align} \sum_{i = 1}^{n_{p}} \nu_{i}P_{v_{p}^{\perp}}a_{i} = 0. \end{align} (2.41) Now, for all j ∈ Sp, choose $$P_{v_{p}}\xi _{ij} = 0$$ in (2.39). Then, for all i ∈ Sp, \begin{align} \nu_{i} = \operatorname*{sign}(v_{p}^{\top}a_{i})\frac{\|v_{p}\|_{2}\sum_{q\neq p}n_{q}}{\|P_{v_{p}}a_{i}\|_{2}}. \end{align} (2.42) Using these νi, (2.41) is satisfied because of the assumed balance condition (1.6). We note that the least two-norm solution of $$\tilde {B}\xi = \tilde {b}$$ is $$\xi = \tilde {B}^{T}(\tilde {B}\tilde {B}^{T})^{-1}\tilde {b}$$, where \begin{equation*} \left(\tilde{B}\tilde{B}^{T}\right)_{ls} = \left\{\begin{array}{cc} (n_{p} -1)I_{d\times d} & \text{if } l =s\\[.1em] -I_{d\times d} & \text{if } l \neq s, \end{array} \right. \text{ and} \end{equation*} \begin{equation*} \left(\left(\tilde{B}\tilde{B}^{T}\right)^{-1}\right)_{ls} = \left\{\begin{array}{cc} \frac{2}{n_{p}}I_{d\times d} & \text{if } l = s\\[.5em] \frac{1}{n_{p}}I_{d\times d} & \text{if } l \neq s. \end{array} \right. \end{equation*} Here, $$\tilde {B}\tilde {B}^{T} \in \mathbb {R}^{d(n_{j}-1)\times d(n_{j}-1)}$$. Solving for ξ, we get $$P_{v_{p}^{\perp }}\xi _{ij} = \frac {\nu _{i}P_{v_{p}^{\perp }}a_{i}}{n_{p}} - \frac {\nu _{j}P_{v_{p}^{\perp }}a_{j}}{n_{p}}$$. Since $$P_{v_{p}}\xi _{ij} = 0$$, we have $$\xi _{ij} = \frac {\nu _{i}P_{v_{p}^{\perp }}a_{i}}{n_{p}} - \frac {\nu _{j}P_{v_{p}^{\perp }}a_{j}}{n_{p}}$$. 3. Numerical results In this section we formulate an IRLS algorithm for (1.2) and provide numerical results on real and synthetic data. The general idea of the IRLS algorithm is to approximate a non-smooth objective function, as in (1.2), with a sequence of smooth functionals that converges to the objective function. In our implementation of the IRLS algorithm for (1.2), the smooth approximation results in a weighted least squares problem which can be solved through the normal equations. The pseudocode is given in Algorithm 1. We refer the readers to  for convergence result of the IRLS algorithm. Algorithm 1 outputs estimated mixture components for each data point. These estimates can then be clustered using k-means. The mixture component corresponding to each class is obtained by running k separate regressions. When implementing the IRLS algorithm, we observe convergence for a fixed δ ≪ 1 (δt in Algorithm 1 is fixed to 10−16). The maximum number of iterations is 150. Let $$Z_{t}^{\natural }$$ be the minimizer of the quadratic program in the IRLS algorithm at iteration t. The stopping criterion is $$\frac {1}{\sqrt {m}}\|Z_{t+1}^{\natural } - Z_{t}^{\natural }\|_{F} < 10^{-5}$$. We use Algorithm 1, along with k-means, to classify two real datasets. In both of these datasets, the true labels of the data points are unknown. We compare the output of Algorithm 1 followed by k-means with the output of an algorithm introduced in . The algorithm introduced in  recovers an estimate of the mixture components when used on a mixed linear regression problem with bounded but arbitrary noise. For ease of reference, this algorithm is presented in Algorithm 2. First, we analyze the BikeShare data that contain the number of bikes rented in a day in the San Francisco bay area from 1 September 2014 to 30 August 2015. The data are provided by the Bay Area BikeShare program.1 In this dataset, m = 364 and d = 2. For each measurement $$(a_{i},b_{i}) \in \mathbb {R}^{2}\times \mathbb {R}$$, ai1 = i is the independent variable of a simple linear model, ai2 = 1 corresponds to the constant part and bi is the response. In this model, bi is the number of bikes rented. Let $$\mu ={ \frac {1}{m}}\sum _{i=1}^{m}a_{i1}$$ and α = 0.005. Here, the average μ is used to center the dataset and α is used to ensure the dataset is sufficiently well-separated. Let $$\tilde {a}_{i1} = \alpha \left (a_{i1}-\mu \right )$$ and let $$\tilde {a}_{i} = \left (\tilde {a}_{i1},a_{i2}\right )$$. Algorithm 1 is then used on the dataset $$\{(\tilde {a}_{i},b_{i})\}_{i=1}^{m}$$. For purpose of illustration, k-means with k = 2 is used on the output of Algorithm 1 to estimate the mixture components. The result of this process is shown in Fig. 2(c), where the classification seems to differentiate between weekend and weekday bike rental trends. For comparison, the result of Algorithm 2 on the dataset $$\left \{\left (\tilde {a}_{i},b_{i}\right )\right \}_{i=1}^{m}$$ with η set to 100 obtained using the semidefinite-quadratic-linear programing solver SDPT3 is shown in Fig. 2(d). Fig. 2. View largeDownload slide Panel (a) shows data consisting of the number of bikes rented per day in the Bay Area BikeShare program. A circle or a dot in panel (b) is an estimate of one of the mixture components corresponding to a data point in panel (a). Panel (b) shows the output of (1.2) followed by k-means with k = 2. Panel (c) shows the fitted lines obtained using two separate regressions. In panel (c), points in first and second classes are represented by dots and circles, respectively. Qualitatively, the classification presented in panel (c) seems to differentiate between weekend and weekday bike rental trends. Panel (d) shows the result of Algorithm 2 with η = 100. Fig. 2. View largeDownload slide Panel (a) shows data consisting of the number of bikes rented per day in the Bay Area BikeShare program. A circle or a dot in panel (b) is an estimate of one of the mixture components corresponding to a data point in panel (a). Panel (b) shows the output of (1.2) followed by k-means with k = 2. Panel (c) shows the fitted lines obtained using two separate regressions. In panel (c), points in first and second classes are represented by dots and circles, respectively. Qualitatively, the classification presented in panel (c) seems to differentiate between weekend and weekday bike rental trends. Panel (d) shows the result of Algorithm 2 with η = 100. Secondly, we consider music tone perception data which show the relationship between actual tone and tone perceived by a musician. These data were generated in an experiment conducted by Cohen in 1980 . In this dataset, m = 150 and d = 2. For each measurement $$(a_{i},b_{i}) \in \mathbb {R}^{2}\times \mathbb {R}$$, ai1 = i is the independent variable of a simple linear model, ai2 = 1 corresponds to the constant part and bi is the response. In this model, ai1 is the actual tone and bi is the perceived tone. Let $$\mu ={ \frac {1}{m}}\sum _{i=1}^{m}a_{i1}$$ and α = 40. Here, the average μ is used to center the dataset and α is used to ensure the dataset is sufficiently well-separated. Let $$\tilde {a}_{i1} = \alpha (a_{i1}-\mu )$$ and let $$\tilde {a}_{i} = (\tilde {a}_{i1},a_{i2})$$. Algorithm 1 is then used on the dataset $$\{(\tilde {a}_{i},b_{i})\}_{i=1}^{m}$$. For purpose of illustration, k-means with k = 2 is used on the output of Algorithm 1 to estimate the mixture components. The result of this process is shown in Fig. 3(c). For comparison, the result of Algorithm 2 on the dataset $$\{(\tilde {a}_{i},b_{i})\}_{i=1}^{m}$$ with η set to 4 obtained using SDPT3 solver is shown in Fig. 2(d). Fig. 3. View largeDownload slide Panel (a) shows perceived tone by a musician versus actual tone for a range of tones. A circle or a dot in panel (b) is an estimate of one of the mixture components corresponding to a data point in panel (a). Panel (b) shows the output of (1.2) followed by k-means with k = 2. Panel (c) shows the fitted lines obtained using two separate regressions. In panel (c), points in first and second classes are represented by dots and circles, respectively. Panel (d) shows the result of Algorithm 2 with η = 4. Fig. 3. View largeDownload slide Panel (a) shows perceived tone by a musician versus actual tone for a range of tones. A circle or a dot in panel (b) is an estimate of one of the mixture components corresponding to a data point in panel (a). Panel (b) shows the output of (1.2) followed by k-means with k = 2. Panel (c) shows the fitted lines obtained using two separate regressions. In panel (c), points in first and second classes are represented by dots and circles, respectively. Panel (d) shows the result of Algorithm 2 with η = 4. We also provide two simulation results on synthetic data. The first simulation result verifies Theorem 1.1 and the second simulation result shows recovery using (1.2) in the case with imbalanced measurements is possible. For the first simulation, let k = 3, $$d \in \{3,\dots ,15\}$$ and m = 48. Let $$\left \{\beta _{p}\right \}_{p=1}^{k}\subset \mathbb {R}^{d}$$ be such that $$\left (\beta _{p}\right )_{l} = 1$$ if l = p and 0 otherwise. Let n1 = n2 = n3 = 16. For $$P_{v_{p}^{\perp }}$$, let the columns of $$Q_{p}\in \mathbb {R}^{d\times (d-1)}$$ be an orthonormal basis of the column space of $$P_{v_{p}^{\perp }}$$. Consider the following measurements for the first simulation: Fix $$p \in { \{1,\dots ,\mathrm {k}\}}$$. For $$i \in { \left \{1,\dots ,\frac {\mathrm {n}_{\mathrm {p}}}{2}\right \}}$$, let $$x_{i} \sim \text {Uniform}\left (B_{\alpha }^{d-1}\right )$$, i.e. xi is a random point uniformly distributed in the d − 1 dimensional ℓ2 ball of radius α centered at the origin. Here, hyperplanes corresponding to labels in Sp are contained in an aperture α ∈ [0, 0.75]. For $$i \in { \left \{1,\dots ,\frac {\mathrm {n}_{\mathrm {p}}}{2}\right \}}$$, let $$a_{i} = \hat {v}_{p}+Q_{p}x_{i}$$ and for $$i \in \left \{\frac {n_{p}}{2} +1, \dots , n_{p}\right \}$$, let $$a_{i}= \hat {v}_{p}-Q_{p}x_{i-\frac {n_{p}}{2}}$$. These measurements are symmetric and satisfy the balance condition since there exists pairs i, j ∈ Sp such that \begin{align} \frac{P_{v_{p}^{\perp}}a_{i}}{\|P_{v_{p}}a_{i}\|_{2}} = -\frac{P_{v_{p}^{\perp}}a_{j}}{\|P_{v_{p}}a_{j}\|_{2}}. \end{align} (3.1) Lastly, for $$i \in { \{1,\dots ,\mathrm {n}_{\mathrm {p}}\}}$$, let $$b_{i} = a_{i}^{\top }\beta _{p}$$. Figure 4 shows the fraction of successful recovery from 10 independent trials for mixed linear regression from data as described above. Black squares correspond to no successful recovery and white squares to 100% successful recovery. Let Z♯ be the candidate minimizer of (1.2) and let Z♮ be the output of (1.2). For each trial, we say (1.2) successfully recovers the mixture components if $$\frac {1}{\sqrt {m}}\|Z^{\natural } - Z^{\sharp }\|_{F} < 10^{-5}$$. This evaluation metric is used because we want to provide numerical verification of Theorem 1.1. In the area to the left of the line, the measurements are well-separated in the sense of (1.5), i.e. $$\max _{p\in }\max _{i\in S_{p}}\frac { \|P_{v_{p}^{\perp }}a_{i} \|_{2}}{\|P_{v_{p}}a_{i}\|_{2}} < \frac {1}{6}$$. The figure also shows that when measurements are not well-separated, recovery using (1.2) will likely fail. Fig. 4. View largeDownload slide The empirical recovery probability from synthetic data with three mixture components as a function of dimension, d, and aperture, α. The shades of black and white represents the fraction of successful simulation. White blocks correspond to successful recovery and black blocks correspond to unsuccessful recovery. Each block corresponds to the average from 10 independent trials. Fig. 4. View largeDownload slide The empirical recovery probability from synthetic data with three mixture components as a function of dimension, d, and aperture, α. The shades of black and white represents the fraction of successful simulation. White blocks correspond to successful recovery and black blocks correspond to unsuccessful recovery. Each block corresponds to the average from 10 independent trials. For the second simulation, let k = 3 and $$d \in \{3,\dots ,15\}$$. Let $$\left \{\beta _{p}\right \}_{p=1}^{k}\subset \mathbb {R}^{d}$$ be such that $$\left (\beta _{p}\right )_{l} = 1$$ if l = p and 0 otherwise. Let n1 = n2 = n3 = 4d. For $$P_{v_{p}^{\perp }}$$, let the columns of $$Q_{p}\in \mathbb {R}^{d\times (d-1)}$$ be an orthonormal basis of the column space of $$P_{v_{p}^{\perp }}$$, as in the first simulation. Consider the following measurements for the second simulation: Fix the aperture, α = 0.2. For p ∈{1, 2} and $$i \in \left \{1,\dots ,\frac {n_{p}}{2}\right \}$$, let $$x_{i} \sim \text {Uniform}\left (B_{\alpha }^{d-1}\right )$$. For $$i \in \left \{1,\dots ,\frac {n_{p}}{2}\right \}$$, let $$a_{i} = \hat {v}_{p}+Q_{p}x_{i}$$ and for $$i \in \left \{\frac {n_{p}}{2} +1, \dots , n_{p}\right \}$$, let $$a_{i}= \hat {v}_{p}-Q_{p}x_{i-\frac {n_{p}}{2}}$$. These measurements are symmetric as in the first simulation. For measurements that belong to the third class, we perturb the symmetric measurements to achieve the desired level of imbalance. Specifically, for p = 3 and $$i \in \left \{1,\dots ,\frac {n_{p}}{2}\right \}$$, let $$x_{i} \sim \text {Uniform}\left (B_{\alpha }^{d-1}\right )$$. For $$i \in \left \{1,\dots ,\frac {n_{p}}{2}\right \}$$, let $$\tilde {a}_{i} = \hat {v}_{p}+Q_{p}x_{i}$$ and for $$i \in \left \{\frac {n_{p}}{2} +1, \dots , n_{p}\right \}$$, let $$\tilde {a}_{i}= \hat {v}_{p}-Q_{p}x_{i-\frac {n_{p}}{2}}$$. Let $$w \sim \text {Uniform}\left (S_{\tau _{p}}^{d-1}\right )$$ and for $$i\in \left \{1,\dots ,n_{p}\right \}$$, let $$a_{i} = \tilde {a_{i}}+Q_{p}w$$. Here, τp ∈ [0, 0.062] is the measure of imbalance of the measurements that belongs to the pth class. Note that \begin{align} \tau_{p} = \frac{1}{n_{p}}\left \|\sum_{i\in S_{p}}\operatorname*{sign}\left(v_{p}^{\top}a_{i}\right)\frac{P_{v_{p}^{\perp}}a_{i}}{\left\|P_{v_{p}}a_{i}\right\|_{2}}\right\|_{2}. \end{align} (3.2) Lastly, for $$i \in \{1,\dots ,n_{p}\}$$, let $$b_{i} = a_{i}^{\top }\beta _{p}$$. Figure 5 shows the fraction of successful recovery from 10 independent trials for mixed linear regression from data as described above. In Fig. 5, the number of measurements in each class is four times the dimension. Black squares correspond to no successful recovery and white squares to 100% successful recovery. Let Z♯ be the candidate minimizer of (1.2) and let Z♮ be the output of (1.2). For each trial, we say (1.2) successfully recovers the mixture components if $$\frac {1}{\sqrt {m}}\|Z^{\natural } - Z^{\sharp }\|_{F} < 10^{-5}$$. The figure shows that recovery using (1.2) from imbalanced measurements is possible if the number of measurements scale linearly with dimension of the mixture components. Fig. 5. View largeDownload slide The empirical recovery probability from synthetic data with three mixture components as a function of dimension, d, and imbalance, τ3, of measurements in third class. The measurements in classes 1 and 2 are exactly balanced, i.e. τ1 = τ2 = 0. The shades of black and white represents the fraction of successful simulation. White blocks correspond to successful recovery and black blocks correspond to unsuccessful recovery. Each block corresponds to the average from 10 independent trials. Fig. 5. View largeDownload slide The empirical recovery probability from synthetic data with three mixture components as a function of dimension, d, and imbalance, τ3, of measurements in third class. The measurements in classes 1 and 2 are exactly balanced, i.e. τ1 = τ2 = 0. The shades of black and white represents the fraction of successful simulation. White blocks correspond to successful recovery and black blocks correspond to unsuccessful recovery. Each block corresponds to the average from 10 independent trials. Funding National Science Foundation (DMS-1464525) to P.H. Footnotes Dataset can be obtained from the following URL: http://www.bayareabikeshare.com/open-data References 1. Bissantz , N. , Dümbgen , L. , Munk , A. & Stratmann , B. ( 2009 ) Convergence analysis of generalized iteratively reweighted least squares algorithms on convex function spaces . SIAM J. Optimization , 19 , 1828 -- 1845 . Google Scholar Crossref Search ADS 2. Candès , E. & Recht , B. ( 2012 ) Exact matrix completion via convex optimization . Commun. ACM , 55 , 111 -- 119 . Google Scholar Crossref Search ADS 3. Chaganty , A. & Liang , P. ( 2013 ) Spectral experts for estimating mixtures of linear regressions . J. Mach. Learn. Res. , 28 , 1040 -- 1048 . 4. Chen , Y. , Yi , X. & Caramanis , C. ( 2014 ) A convex formulation for mixed regression with two components: minimax optimal rates . J. Mach. Learn. Res. , 35 , 560 -- 604 . 5. Cohen , E. A. ( 1980 ) Inharmonic tone perception . Ph.D. Thesis, Stanford University . 6. Deb , P. & Holmes , A. ( 2000 ) Estimates of use and costs of behavioural health care: a comparison of standard and finite mixture models . Health Econ. , 6 , 475 -- 489 . Google Scholar Crossref Search ADS 7. Elhamifar , E. & Vidal , R. ( 2013 ) Sparse subspace clustering: algorithm, theory, and applications . IEEE Trans. Pattern Anal. Mach. Intell. , 35 11 , 2765 -- 2781 . Google Scholar Crossref Search ADS PubMed 8. Fuchs , J. J. ( 2004 ) On sparse representations in arbitrary redundant bases . IEEE Trans. Inf. Theory , 50 , 1341 -- 1344 . Google Scholar Crossref Search ADS 9. Grun , B. & Leisch , F. ( 2007 ) Applications of finite mixtures of regression models . https://cran.r-project.org/web/packages/flexmix/vignettes/regression-examples.pdf. 10. Hand , P . ( 2014 ) Conditions for existence of dual certificates in rank-one semidefinite problems . Communications in Mathematical Sciences, 12 , 1363 -- 1378 . 11. Hocking , T. , Philippe Vert , J. , Joulin , A. & Bach , F. R. ( 2011 ) Clusterpath: an algorithm for clustering using convex fusion penalties. Proceedings of the 28th International Conference on Machine Learning (ICML-11) (L. Getoor & T. Scheffer eds). New York, NY : ACM , pp. 745 -- 752 . 12. Tibshirani , R. , Saunders , M. , Rosset , S. , Zhu , J. & Knight , K. ( 2005 ) Sparsity and smoothness via the fused lasso . J. R. Stat. Soc. Ser. B (Stat. Methodol.) , 67 , 91 -- 108 . Google Scholar Crossref Search ADS 13. Xinyang Yi , C. C. & Sanghavi , S. ( 2016 ) Solving a mixture of many random linear equations by tensor decomposition and alternating minimization. arXiv preprint arXiv:1608.05749 . 14. Li , Y.-H. , & Jonathan Scarlett , P. R. V. C. , ( 2015 ) Sparsistency of ℓ1-regularized M-estimators. Proceedings of the 18th International Conference on Artifical Intelligence and Statistics , San Diego, CA, USA . 15. Yi , X. , Caramanis , C. & Sanghavi , S. ( 2014 ) Alternating minimization for mixed linear regression. Proceedings of The 31st International Conference on Machine Learning , pp. 613 -- 621 . Symbol indicates reproducible data. © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

### Journal

Information and Inference: A Journal of the IMAOxford University Press

Published: Sep 19, 2018

Access the full text.