TY - JOUR AU - Yilmaz,, Özgür AB - Abstract We study the problem of estimating a low-rank tensor when we have noisy observations of a subset of its entries. A rank-|$r$|⁠, order-|$d$|⁠, |$N \times N \times \cdots \times N$| tensor, where |$r=O(1)$|⁠, has |$O(dN)$| free variables. On the other hand, prior to our work, the best sample complexity that was achieved in the literature is |$O\left(N^{\frac{d}{2}}\right)$|⁠, obtained by solving a tensor nuclear-norm minimization problem. In this paper, we consider the ‘M-norm’, an atomic norm whose atoms are rank-1 sign tensors. We also consider a generalization of the matrix max-norm to tensors, which results in a quasi-norm that we call ‘max-qnorm’. We prove that solving an M-norm constrained least squares (LS) problem results in nearly optimal sample complexity for low-rank tensor completion (TC). A similar result holds for max-qnorm as well. Furthermore, we show that these bounds are nearly minimax rate-optimal. We also provide promising numerical results for max-qnorm constrained TC, showing improved recovery compared to matricization and alternating LS. 1. Introduction Representing data as multi-dimensional arrays, i.e., tensors, arises naturally in many modern applications such as interpolating large-scale seismic data [17, 41], medical imaging [51], data mining [1], image compression [49, 60], hyper-spectral image analysis [46] and radar signal processing [54]. A more extensive list of such applications can be found in [40]. There are many reasons where one may want to work with a subset of the tensor entries: (i) when these data sets are large, we wish to store only a small number of the entries to compress; (ii) in some applications, the acquisition of each entry can be expensive, e.g., each entry may be obtained by solving a large Partial Differential Equations [70]; (iii) some of the tensor entries might get lost due to physical constraints while gathering them. These restrictions result in situations where one has access only to a subset of the tensor entries. The problem of tensor completion (TC) entails recovering a tensor from a subset of its entries. Without assuming further structure on the underlying tensor, tensor completion (TC) problem is ill-posed as the missing entries can take on any value, and thus impossible to recover. Therefore, here (and in many applications) tensors of interest are those that can be expressed (approximately) as a lower dimensional object compared to the ambient dimension of the tensor. In particular, in this paper, we consider tensors that have low CP-rank [14, 31] (which, often, we will simply call ‘rank’). The low-rank assumption makes TC a feasible problem. For example, an order-|$d$|⁠, rank-|$r$| tensor, which has the size |$N_1 \times N_2 \times \cdots N_d$|⁠, where |$N_i=O(N)$|⁠, has |$O(rNd)$| free variables, which is much smaller than |$N^d$|⁠, the ambient dimension of the tensor. TC problem focuses on two important goals: given a low-rank tensor: (1) identify the number of entries that need to be observed from which a good approximation of the tensor can be recovered. This number will depend on the size parameters |$N_i$|⁠, the rank |$r$| and the order |$d$|⁠. (2) Design stable and tractable methods that can recover a low-rank tensor from a subset of its entries. The order-|$2$| case, known as matrix completion, has been extensively studied in the literature [12, 20, 23, 38, 64]. A natural approach is finding the matrix with the lowest rank that is consistent with the measurements. However, rank minimization is NP-hard; therefore, research focused on tractable alternatives. One such alternative is nuclear norm, also known as trace-norm, which is the convex relaxation of the rank function [23]. It was shown in [13] that solving a nuclear-norm minimization problem would recover a rank-|$r$|⁠, |$N \times N$| matrix from only |$O(rN\textrm{polylog}(N))$| samples under mild incoherence conditions on the matrix. The nuclear norm is the sum of the singular values of the matrix and it is also the dual of the spectral norm. Extensive research has been done in analysing variants of nuclear-norm minimization and designing efficient algorithms to solve it as shown in [8, 11, 13, 71]. Alternative interpretations for the rank, as well as nuclear norm, can be obtained by considering certain factorizations. In particular, the rank of a matrix |$M$| is the minimum number of columns of the factors |$U,\ V$|⁠, where |$M=UV^{\top }$|⁠; the nuclear norm is the minimum product of the Frobenius norms of the factors, i.e., |$\|M\|_{\ast }:= \min \|U\|_F \|V\|_F\ \textrm{subject to}\; M=UV^{\top }$| [64]. An alternative proxy for the rank of a matrix is its max-norm defined as |$\|M\|_{\max }:=\min \|U\|_{2,\infty }\|V\|_{2,\infty }$||$\textrm{subject to}\ M=UV^{\top }$| [64], where |$\|U\|_{2,\infty }$| is the maximum |$\ell _2$|-norm of the rows of |$U$|⁠. The max-norm bounds the norm of the rows of the factors |$U$| and |$V$| and was used for matrix completion in [24]. There, the authors studied both max-norm and trace-norm matrix completion by analysing the Rademacher complexity of the unit balls of these norms. They proved that under uniformly random sampling, either with or without replacement, |$m=O\left(\frac{rN}{\varepsilon }\log ^3\left(\frac{1}{\varepsilon }\right)\right)$| samples are sufficient for achieving mean squared recovery error |$\varepsilon $| using max-norm constrained estimation and |$m=O\left(\frac{rN\log (N)}{\varepsilon }\log ^3\left(\frac{1}{\varepsilon }\right)\right)$| samples are sufficient for achieving mean squared recovery error |$\varepsilon $| using nuclear-norm constrained estimation. Despite all the powerful tools and algorithms developed for matrix completion, TC problem is still fairly open and not as well understood. For instance, there is a large gap between theoretical guarantees and what is observed in numerical simulations. This is mainly due to the lack of efficient orthogonal decompositions, low-rank approximations and limited knowledge of the structure of low-rank tensors compared to matrices. This large gap has motivated much research connecting the general TC problem to matrix completion by rearranging the tensor as a matrix, including the sum of nuclear-norms (SNN) model that minimizes the SNN of matricizations of the tensor along all its dimensions, leading to sufficient recovery with |$m=O(r N^{d-1})$| samples [28, 49]. More balanced matricizations, such as the one introduced in [52], can result in a better bound of |$m=O\left(rN^{\lceil \frac{d}{2} \rceil{}}\right)$| samples. Once we move from matrices to higher-order tensors, many of the well-known facts of matrix algebra cease to be true. For example, even a best rank-|$k$| approximation may not exist for some tensors, as illustrated in [40, Section 3.3], showing that the set of tensors with rank |$r\le 2$| is not closed. In fact, [32] proves that most tensor problems are NP-hard (also the title of [32]) for tensors with |$d \geq 3$|⁠, including computing the rank, spectral norm and nuclear norm. The computational complexity of directly solving TC without matricizing on one hand and the inferior results of matricization on the other hand make TC challenging. Having all these complications in mind, on the theoretical side, a low-rank tensor has |$O(rdN)$| free variables, but an upper-bound of |$O\left(rN^{\lceil \frac{d}{2} \rceil{}}\right)$| on the sample complexity based on the results prior to our paper based on matricization. When |$d>2$|⁠, the polynomial dependence on |$N$| seems to have a lot of room for improvement. Moreover, it is well known that empirical recovery results are much better when the tensor is not rearranged as a matrix, even though in this case, one attempts to solve an NP-hard problem. This has resulted in efforts towards narrowing this gap by means of heuristic algorithms [4, 49]. In spite of good empirical results and reasonable justifications, a theoretical study filling in the gap was not presented in these cases. The nuclear norm of a tensor, defined as the dual of the spectral norm, was originally introduced in [30, 58] and has been revisited in more depth in the past few years, e.g., in [21, 33]. Recently [73] studied TC using nuclear-norm minimization and proved that under mild conditions on the tensor, |$m=O(\sqrt{r}N^{\frac{d}{2}} \log (N))$| measurements are sufficient for successful recovery—still far from the number of free variables. With the goal of obtaining linear dependence on |$N$|⁠, we analyse TC via a max-qnorm (max-quasi-norm) constrained least squares (LS) algorithm, where the max-qnorm is a direct generalization of the matrix max-norm to the case of tensors. Unfortunately, max-qnorm is non-convex. Motivated by this and based on the unit ball of the dual of the max-qnorm (which is a convex norm), we define a convex atomic-norm, which we call M-norm. We then analyse an M-norm constrained LS problem, where we obtain recovery bounds that are optimal in their dependence on the size |$N$| of the tensor. The main contributions of this paper are as follows. For the sake of simplicity, consider an order-|$d$| tensor |$T \in \mathbb{R}^{N \times \cdots \times N}$|⁠. Notice that all the results in this paper can be extended to |$(N_1 \times N_2 \times \cdots \times N_d)$|-tensors provided |$N_i = O(N)$| for |$1\leq i \leq d$|⁠. Therefore, unless otherwise informative, we assume |$N_i=N$| for |$1\leq i \leq d$|⁠. 1.1 Notations and basics on tensors We adopt the notation of Kolda and Bader’s review on tensor decompositions [40]. Below, |$\lambda $|⁠, |$\sigma $| and |$\alpha $| are used to denote scalars, and |$C$| and |$c$| denote universal constants. Vectors are denoted by lower case letters, e.g., |$u$| and |$v$|⁠. Matrices and tensors are represented by upper case letters, usually using |$A$| and |$M$| for matrices and |$T$| and |$X$| for tensors. Tensors are a generalization of matrices to higher order, also called multi-dimensional arrays. For example, a first-order tensor is a vector and a second-order tensor is a matrix. |$X \in \bigotimes _{i=1}^{d} \mathbb{R}^{N_i}$| is a |$d$|th-order tensor whose |$i$|th size is |$N_i$|⁠. An order |$d$| tensor is sometimes referred to as a |$d$|-dimensional tensor. We also denote |$\bigotimes _{i=1}^{d} \mathbb{R}^{N}$| as |$\mathbb{R}^{N^d}$|⁠. Entries of a tensor are either specified as |$X_{i_1, i_2, \cdots , i_d}$| or |$ X(i_1, i_2, \cdots , i_d)$|⁠, where |$1 \leq i_j \leq N_j$| for |$1 \leq j \leq d$|⁠. Alternatively, setting |$\omega =(i_1, i_2, \cdots , i_d)$|⁠, |$X_{\omega }:=X(i_1, i_2, \cdots , i_d)$|⁠. The inner product of |$T$| and |$X$| is denoted by |$\langle T, X\rangle $|⁠. The symbol |$\circ $| represents both matrix and vector outer products, where |$T=U_1 \circ U_2 \circ \cdots \circ U_d$| means |$T(i_1,i_2,\cdots ,i_d)=\sum _{k} U_1(i_1,k)U_2(i_2,k)$||$\cdots U_d(i_d,k)$|⁠, where |$k$| ranges over the columns of the factors. We denote |$U^{(1)} \circ U^{(2)} \circ \cdots \circ U^{(d)}$| as |$\bigcirc _{i=1}^{i=d} U^{(i)}$|⁠. In the special case when |$u_j$| are vectors, |$T=u_1 \circ u_2 \circ \cdots \circ u_d$| satisfies |$T(i_1,i_2,\cdots ,i_d)=u_1(i_1)u_2(i_2)\cdots u_d(i_d)$|⁠. The infinity-norm of a tensor |$T$|⁠, |$\|T\|_{\infty }$|⁠, is the infinity norm of the vectorized version of the tensor. In particular, |$\|T\|_{\infty } \leq \alpha $| means that the magnitudes of the entries of |$T$| are bounded by |$\alpha $|⁠. Finally, |$[N]:=\{1, \cdots , N\}$| and |$[N]^d$| denotes the set |$[N] \times [N] \times ... \times [N]$| that consists of |$d$|-tuples. 1.1.1 Rank of a tensor A unit tensor is a tensor |$U \in \bigotimes _{j=1}^{d} \mathbb{R}^{N_j}$| that can be written as \begin{equation} U=u^{(1)} \circ u^{(2)} \circ \cdots \circ u^{(d)}, \end{equation} (1.1) where |$u^{(j)} \in \mathbb{R}^{N_j}$| is a unit-norm vector. The vectors |$u^{(j)}$| are called the components of |$U$|⁠. Define |$\mathbb{U}_d$| to be the set of unit tensors of order |$d$|⁠. A rank-|$1$| tensor is a scalar multiple of a unit tensor. The rank of a tensor |$T$|⁠, denoted by rank(⁠|$T$|⁠), is defined as the smallest number of rank-|$1$| tensors that generate |$T$| as their sum, i.e., \begin{equation*} T = \sum_{i=1}^r \lambda_i U_i = \sum_{i=1}^r \lambda_i u_i^{(1)} \circ u_i^{(2)} \circ \cdots \circ u_i^{(d)}, \end{equation*} where |$U_i \in \mathbb{U}_d$| is a unit tensor. This low-rank decomposition is also known as CANDECOMP/PARAFAC (CP) decomposition [14, 31]. In this paper, we use CP decompositions; however, we note that there are other decompositions that are used in the literature such as Tucker decomposition [69]. For a detailed overview of alternate decompositions, refer to [40]. 1.1.2 Tensor norms Define |$\mathbb{T}_d$| to be set of all order-|$d$| tensors of size |$N_1 \times N_2 \times \cdots \times N_d$|⁠. For |$X, T \in \mathbb{T}_d$|⁠, the inner product of |$X$| and |$T$| is defined as \begin{equation*} \langle X, T\rangle = \sum_{i_1=1}^{N_1} \sum_{i_2=1}^{N_2} \cdots \sum_{i_d=1}^{N_d} X_{i_1, i_2, \cdots, i_d} T_{i_1, i_2, \cdots, i_d}. \end{equation*} Consequently, the Frobenius norm of a tensor is defined as \begin{equation} \|T\|_F^2:= \sum_{i_1=1}^{N_1} \sum_{i_2=1}^{N_2} \cdots \sum_{i_d=1}^{N_d} T_{i_1, i_2, \cdots, i_d}^2 = \langle T,T \rangle. \end{equation} (1.2) Next, using the definition of unit tensors, one can define the spectral norm of tensors as \begin{equation} \|T\|:= \max_{U \in \mathbb{U}_d} \langle T, U\rangle. \end{equation} (1.3) Similarly, nuclear norm was also generalized for tensors (see [26, 47], although the original idea dates back to Grothendieck [30]) as \begin{equation} \|T\|_{\ast}:= \max_{\|X\| \leq 1} \langle T, X\rangle. \end{equation} (1.4) Here, we generalize the definition of max-norm to tensors as \begin{equation} \|T\|_{\max}:= \min_{T=\bigcirc_{i=1}^{i=d} U^{(i)}}\Bigg\{ \prod_{j=1}^{d} \\|U^{(j)}\|_{2,\infty}\Bigg\}, \end{equation} (1.5) where |$\|U\|_{2,\infty } = \sup _{\|x\|_2=1} \|Ux\|_{\infty }$|⁠. In Section 3 we prove that for |$d>2$| this generalization does not satisfy the triangle inequality and is a quasi-norm (which we call max-qnorm). We analyse the max-qnorm thoroughly in Section 3. 1.2 Simplified upper bound on TC recovery error Without going into details, we briefly state and compute the upper bounds we establish (in Section 4.2) on the recovery errors associated with M-norm and max-qnorm constrained (MNC)TC. Given a rank-|$r$|⁠, order-|$d$| tensor |$T^{\sharp } \in \bigotimes _{i=1}^{d} \mathbb{R}^N$|⁠, and a random subset of indices |$\varOmega =\{\omega _1,\omega _2,\cdots ,\omega _m\},\ \omega _i \in [N]^d$|⁠, we observe |$m$| noisy entries |$\{Y(\omega _t)\}_{t=1}^{m}$| of |$\{T^{\sharp }(\omega _t)\}_{t=1}^{m}$|⁠, where each observation is perturbed by i.i.d. noise with mean zero and variance |$\sigma ^2$|⁠. The purpose of TC is to recover |$T^{\sharp }$| from these |$m$| observations when |$m \ll N^d$|⁠. To give a simple version of our result, we assume that indices in |$\varOmega $| are drawn independently at random with the same probability for each observation, i.e., we assume uniform sampling. Note that we provide a general observation model in Section 4.1 and a general version of the theorem (which covers both uniform and non-uniform sampling) in Section 4. Theorem 1 Consider a rank-|$r$|⁠, order-|$d$| tensor |$T^{\sharp } \in \bigotimes _{i=1}^{d} \mathbb{R}^N$| with |$\|T^{\sharp }\|_{\infty } \leq \alpha $|⁠. Assume that we are given a collection of noisy observations \begin{equation*}Y(\omega_t)= T^{\ast}(\omega_t) + \sigma \xi_t\, \ \ t=1,\cdots,m,\end{equation*} where the noise sequence |$\xi _t$| are i.i.d. standard normal random variables, and each index |$\omega _t$| is chosen uniformly random over all the indices of the tensor. Then, there exists a constant |$C<20$| such that the solution of \begin{equation} \hat{T}_{M} = \arg \min_{X} \frac{1}{m}\sum_{t=1}^{m} (X(\omega_t)-Y(\omega_t))^2 \ \ \ \ \textrm{subject to}\ \ \ \ \|X\|_{\infty} \leq \alpha,\ \|X\|_{M} \leq (r\sqrt{r})^{d-1} \alpha, \end{equation} (1.6) satisfies \begin{equation*}\frac{\|T^{\sharp}-\hat{T}_{M}\|_F^2}{N^d} \leq C (\alpha + \sigma) \alpha (r\sqrt{r})^{d-1}\sqrt{\frac{d N}{m}},\end{equation*} with probability greater than |$1-\textrm{e}^{\frac{-N}{\ln (N)}}-\textrm{e}^{-\frac{dN}{2}}$|⁠. Moreover, the solution of \begin{equation} \hat{T}_{\max} = \arg \min_{X} \frac{1}{m}\sum_{t=1}^{m} (X(\omega_t)-Y(\omega_t))^2 \ \ \ \ \textrm{subject to}\ \ \ \ \|X\|_{\infty} \leq \alpha,\ \|X\|_{\max} \leq (\sqrt{r^{d^2-d}}) \alpha, \end{equation} (1.7) satisfies \begin{equation*}\frac{\|T^{\sharp}-\hat{T}_{\max}\|_F^2}{N^d} \leq C_d (\alpha + \sigma) \alpha \sqrt{r^{d^2-d}}\sqrt{\frac{d N}{m}},\end{equation*} with probability greater than |$1-\textrm{e}^{\frac{-N}{\ln (N)}}-\textrm{e}^{-\frac{dN}{2}}$|⁠. Remark 2 Above, |$\|X\|_M$| is the M-norm of tensor |$X$| that is an atomic norm whose atoms are rank-|$1$| sign tensors defined in Section 3.2 (3.6) and |$\|X\|_{\max }$| is max-qnorm defined in (1.5). Remark 3 (Theoretical contributions). The general framework for establishing these upper bounds is already available (the key is to control the Rademacher complexity of the set of interest). The methods to adapt this to the matrix case are available in, e.g., [10, 63]. To move to TC, we study the interaction of the max-qnorm, the M-norm and the rank of a tensor in Section 3. The tools given in Section 3 allow us to generalize matrix completion to TC. 1.3 Organization In Section 2 we briefly overview recent results on TC and max-norm constrained matrix completion. In Section 3 we introduce the generalized tensor max-qnorm and characterize the max-qnorm unit ball, which is crucial in our analysis. This also results in defining a certain convex atomic-norm that gives similar bounds on constrained TC problem. We also prove that both the M-norm and the max-qnorm of a bounded rank-|$r$| tensor |$T$| can be bounded by a function of |$\|T\|_{\infty }$| and |$r$|⁠, independently of |$N$|⁠. We have deferred all the proofs to Section 8. In Section 4 we explain the TC problem and state the main results on recovering low-rank bounded tensors. We also compare our results with previous results on TC and max-norm constrained matrix completion. In Section 6 we state an upper bound on the performance of the M-norm constrained TC that proves optimal dependence on the size. In Section 7 we present numerical results on the performance of MNC TC and compare it with applying matrix completion on the matricized version of the tensor and Section 8 contains all the proofs. 2. Related work 2.1 Tensor matricization The process of reordering the entries of a tensor into a matrix is called matricization, also known as unfolding or flattening. For a tensor |$X \in \bigotimes _{i=1}^{d} \mathbb{R}^{N_i}$|⁠, mode-|$k$| fibres of the tensor are |$\varPi _{j \neq k} N_j$| vectors obtained by fixing all indices of |$X$| except for the |$k$|th one. The mode-|$k$| matricization of |$X$|⁠, denoted by |$X_{(k)} \in \mathbb{R}^{N_k \times \varPi _{j \neq k} N_j}$|⁠, is obtained by arranging all the mode-|$k$| fibres of |$X$| along columns of the matrix |$X_{(k)}$|⁠. More precisely, |$X_{(k)}(i_k,j)=X(i_1, i_2, \cdots , i_d)$|⁠, where \begin{equation*}j=1+\sum_{s=1,s \neq k}^d(i_s-1)J_s\ \ \ \ \textrm{with}\ \ \ \ J_s=\varPi_{l=1,l\neq k}^{s-1} N_l.\end{equation*} A detailed illustration of these definitions can be found in [39, 40]. A generalization of these unfoldings was proposed by [52] that rearranges |$X_{(1)}$| into a more balanced matrix: for |$j \in \{1,\cdots ,d\}$|⁠, |$X_{[j]}$| is obtained by arranging the first |$j$| dimensions along the rows and the rest along the columns. In particular, using Matlab notation |$X_{[j]}=\textrm{reshape}(X_{(1)},\varPi _{i=1}^{j} N_i,\varPi _{i=j+1}^{i=d} N_i)$|⁠. More importantly, for a rank-|$r$| tensor |$T = \sum _{i=1}^r \lambda _i u_i^{(1)} \circ u_i^{(2)} \circ \cdots \circ u_i^{(d)}$|⁠, |$T_{[j]} = \sum _{i=1}^r \lambda _i (u_i^{(1)} \otimes u_i^{(2)} \otimes \cdots \otimes u_i^{(j)}) \circ (u_i^{(j+1)} \otimes \cdots \otimes u_i^{(d)})$|⁠, which is a rank-|$r$| matrix. Here, the symbol |$\otimes $| represents the Kronecker product. Similarly, the ranks of all matricizations defined above are less than or equal to the rank of the tensor. 2.2 Past results Using max-norm for learning, low-rank matrices was pioneered in [64], where max-norm was used for collaborative prediction. In this paper we use max-qnorm for TC, which is a generalization of a recent result on matrix completion using max-norm constrained optimization [10]. In this section we review some of the results that are related to M-norm and max-qnorm TC. In particular, we first go over results on matrix completion using nuclear-norm and max-norm minimization; we then review the literature on TC. Inspired by [23], which proved that the nuclearnorm is the convex envelope of the matrix rank function, most research on matrix completion has focused on using nuclear-norm minimization. Assuming |$M^{\sharp }$| to be a rank-|$r$|⁠, |$N \times N$| matrix and |$M_{\varOmega }$| to be the set of |$m$| independent samples of this matrix, it was proved in [12, 63] that \begin{equation} \hat{M}:= \arg \min\ \|X\|_{\ast} \ \ \ \ \textrm{subject to} \ \ M^{\sharp}_{\varOmega}=X_{\varOmega}, \end{equation} (2.1) recovers the matrix |$M^{\sharp }$| exactly if |$|\varOmega |> C N^{1.2} r \log (N)$|⁠, provided that the row and column space of the matrix |$M^{\sharp }$| is ‘incoherent’. This result was later improved in [38] to |$|\varOmega |=O(Nr\log (N))$|⁠. There has been significant research in this area since then, either in sharpening the theoretical bound, e.g., [5, 13, 57], or designing efficient algorithms to solve (2.1), e.g., [8, 34]. More relevant to noisy TC are the results of [10, 11, 37], which consider recovering |$M^{\sharp }$| from measurements |$Y_{\varOmega }$|⁠, where |$Y=M^{\sharp }+Z$| and |$|\varOmega |=m$|⁠. Here |$Z$| is a noise matrix. It was proved in [11] that if |$\|Z_{\varOmega }\|_F \leq \delta $|⁠, by solving the nuclear-norm minimization problem \begin{equation*}\arg \min \ \|X\|_{\ast}\ \ \textrm{subject to}\ \|(X-Y)_{\varOmega}\|_F \leq \delta,\end{equation*} we can recover |$\hat{M}$|⁠, where \begin{equation*}\frac{1}{N}\|M^{\sharp}-\hat{M}\|_F \leq C\sqrt{\frac{N}{m}} \delta + 2\frac{\delta}{N},\end{equation*} provided that there are sufficiently many measurements for perfect recovery in the noiseless case. Another approach was taken in [37], where the authors assume that |$\|M^{\sharp }\|_{\infty } \leq \alpha $| and |$Z$| is a zero mean random matrix whose entries are i.i.d. with sub-Gaussian-norm |$\sigma $|⁠. They then suggest initializing the left- and right-hand singular vectors (⁠|$L$| and |$R$|⁠) from the observations |$Y_{\varOmega }$| and they prove that by solving \begin{equation*}\underset{L,S,R} {\textrm{min}}\ \frac{1}{2}\|M^{\sharp}-LSR^{\top}\|_F^2\ \ \textrm{subject to}\ L^{\top}L=\mathbb{I}_r,R^{\top}R=\mathbb{I}_r,\end{equation*} one can recover a rank-|$r$| matrix |$\hat{M}$|⁠, where \begin{equation*}\frac{1}{N}\|M^{\sharp} - \hat{M}\|_F \leq C \alpha\sqrt{\frac{Nr}{m}} + C^{\prime}\sigma \sqrt{\frac{Nr\alpha\log(N)}{m}}.\end{equation*} Inspired by promising results regarding the use of max-norm for collaborative filtering [64], a max-norm constrained optimization was employed in [24] to solve the noisy matrix completion problem under the uniform sampling assumption. Nuclear-norm minimization has been proved to be rate-optimal for matrix completion. However, it is not entirely clear if it is the best approach for non-uniform sampling. In many applications, such as collaborative filtering, the uniform sampling assumption is not a reasonable assumption. For example, in the Netflix problem, some movies get much more attention and therefore have more chance of being rated compared to others. To tackle the issue of non-uniform samples, [53] suggested using a weighted nuclear norm, imposing probability distributions on samples belonging to each row or column. Due to similar considerations, [10] generalized the max-norm matrix completion to the case of non-uniform sampling and proved that, with high probability, |$m=O\left(\frac{Nr}{\varepsilon }\log ^3\left(\frac{1}{\varepsilon }\right)\right)$| samples are sufficient for achieving mean squared recovery error |$\varepsilon $|⁠, where the mean squared error is dependent on the distribution of the observations. To be more precise, in the error bound of [10], entries that have higher probability of being observed are recovered more accurately compared to the entries that have a smaller probability of being observed. In particular, [10] assumed a general sampling distribution as explained in Section 1.2 (when |$d=2$|⁠) that includes both uniform and non-uniform sampling. Assuming that each entry of the noise matrix is a zero mean Gaussian random variable with variance |$\sigma $|⁠, and |$\|M^{\sharp }\|_{\infty } \leq \alpha $|⁠, they proved that the solution |$\hat{M}_{\max }$| of \begin{equation*}\underset{\|M\|_{\max}\leq \sqrt{r}\alpha} {\textrm{min}}\ \|(M^{\sharp}-M)_{\varOmega}\|_F^2\end{equation*} satisfies \begin{equation*}\frac{1}{N^2}\|\hat{M}_{\max}-M^{\sharp}\|_F^2 \leq C \mu (\alpha + \sigma) \alpha \sqrt{\frac{r N}{n}},\end{equation*} with probability greater than |$1-2\textrm{e}^{-dN}$|⁠, provided |$\pi _{\omega } \geq \frac{1}{\mu N^2},\forall \omega \in [N] \times [N]$|⁠. This paper generalizes the above result to TC. Finally, we briefly review the TC literature. There is a long list of heuristic algorithms that attempt to solve the TC problem by using different decompositions or matricizations that, in spite of showing good empirical results, are not backed with a theoretical explanation that shows the superiority of using the tensor structure instead of matricization, e.g., see [29, 49]. The most popular approach is minimizing the SNN of all the matricizations of the tensor along all modes. To be precise one solves \begin{equation} \underset{X}{\textrm{min}} \sum_{i=1}^d \beta_i \|X_{(i)}\|_{\ast}\ \textrm{subject to} \ X_{\varOmega}=T^{\sharp}_{\varOmega}, \end{equation} (2.2) where |$X_{(i)}$| is the mode-|$i$| matricization of the tensor (see [49, 62, 67]). The result obtained by solving (2.2) is highly sensitive on the choice of the weights |$\beta _{i}$| and an exact recovery requirement is not available. At least, in the special case of tensor sensing, where the measurements of the tensor are its inner products with random Gaussian tensors, [52] proves that |$m=O(rN^{d-1})$| is necessary for (2.2), whereas a more balanced matricization such as |$X_{[\lfloor \frac{d}{2} \rfloor{}]}$| (as explained in Section 2.1) can achieve successful recovery with |$m=O\left(r^{\lfloor \frac{d}{2} \rfloor{}}N^{\lceil \frac{d}{2} \rceil{}}\right)$| Gaussian measurements. Assuming |$T^{\sharp }$| is symmetric and has an orthogonal decomposition, in [35] it was proved that when |$d=3$|⁠, an alternating minimization algorithm can achieve exact recovery from |$O(r^5 N^{\frac{3}{2}} \log (N)^4)$| random samples. However, the empirical results of this work show good results for non-symmetric tensors as well if a good initial point can be found. In [74], a generalization of the singular value decomposition for tensors, called t-SVD, is used to prove that a third-order tensor (⁠|$d=3$|⁠) can be recovered from |$O(r N^2 \log (N)^2)$| measurements, provided that the tensor satisfies some incoherence conditions, called tensor incoherence conditions. The last related result that we will mention is an interesting theoretical result that generalizes the nuclear norm to tensors as the dual of the spectral norm, and avoids any kind of matricization in the proof [73]. They show that using the nuclear norm, the sample size requirement for a tensor with low coherence using nuclearnorm is |$m=O(\sqrt{rN^d}\log (N))$|⁠. Comparing our result with the result of [73], an important question that needs to be investigated is whether max-qnorm is a better measure of complexity of low-rank tensors compared to nuclear norm or whether the difference is just an artifact of the proofs. While we introduce the framework for max-qnorm in this paper, an extensive comparison of these two norms is beyond the scope of this paper. Another difficulty of using tensor nuclear norm is the lack of sophisticated or even approximate algorithms that can minimize nuclearnorm of a tensor. To our knowledge, this paper provides the first result that proves linear dependence of the sufficient number of random samples on |$N$|⁠. It is worth mentioning though that [42] proves that |$O(N r^{d-0.5} d\log (r))$| adaptively chosen samples is sufficient for exact recovery of tensors. However, the result is heavily dependent on the samples being adaptive. We compare our results with some of the above mentioned results in Sections 5 and 7. 3. Max-qnorm and atomic M-norm In this section we introduce the max-qnorm and M-norm of tensors, and characterize the unit ball of these norms as tensors that have a specific decomposition with bounded factors. This characterization helps us in Section 3.4 to prove a bound on the max-qnorm and M-norm of low-rank tensors that is independent of |$N$|⁠, but we also note that the results in this section might be of independent interest. For a better flow we postpone all the proofs to Section 8. 3.1 Matrix max-norm First, we define the max-norm of matrices [64] that was also defined in [48] as |$\gamma _2$| norm. We also mention some of the properties of the matrix max-norm that we generalize later on in this section. Recall that the max-norm of a matrix is defined in [64] as \begin{equation} \|M\|_{\max}=\underset{M=U \circ V}{\textrm{min}}\lbrace \|U\|_{2,\infty}\|V\|_{2,\infty}\rbrace, \end{equation} (3.1) where |$\|U\|_{2,\infty } = \underset{\|x\|_2=1}{\sup } \|Ux\|_{\infty }$| is the maximum |$\ell _2$|-norm of the rows of |$U$| as defined in [48, 65]. Considering all possible factorizations of a matrix |$M=U\circ V$|⁠, the rank of |$M$| is the minimum number of columns in the factors and the nuclear-norm of |$M$| is the minimum product of the Frobenius norms of the factors. The max-norm, on the other hand, finds the factors with the smallest row-norm as |$\|U\|_{2,\infty }$| is the maximum |$\ell _2$|-norm of the rows of the matrix |$U$|⁠. Furthermore, max-norm is comparable with nuclear norm in the following sense [48, 63]: \begin{equation} \|M\|_{\max} \approx \inf \left\{ \sum_j |\sigma_j|: M=\sum_j \sigma_j u_j v_j^T, \|u_j\|_{\infty}=\|v_j\|_{\infty}=1 \right\}. \end{equation} (3.2) Here, the factor of equivalence is the Grothendieck’s constant |$K_G \in (1.67,1.79)$|⁠. To be precise \begin{equation} \|M\|_{\max} \leq \inf \left\{ \sum_j |\sigma_j|: M=\sum_j \sigma_j u_j v_j^T, \|u_j\|_{\infty}=\|v_j\|_{\infty}=1\right\} \leq K_G \|M\|_{\max}. \end{equation} (3.3) Moreover, in connection with element-wise |$\ell _{\infty }$| norm we have \begin{equation} \|M\|_{\infty} \leq \|M\|_{\max} \leq \sqrt{\textrm{rank}(M)} \|M\|_{1,\infty} \leq \sqrt{\textrm{rank}(M)}\|M\|_{\infty}. \end{equation} (3.4) This is an interesting result that shows that we can bound the max-norm of a low-rank matrix by an upper bound that is independent of |$N$|⁠. 3.2 Tensor max-qnorm and atomic M-norm We generalize the definition of max-norm to tensors as follows. Let |$T$| be an order-|$d$| tensor and define \begin{equation} \|T\|_{\max}:= \underset{T=\bigcirc_{i=1}^{i=d} U^{(i)}}{\min}\left\lbrace \prod_{j=1}^{d} \|U^{(j)}\|_{2,\infty}\right\rbrace. \end{equation} (3.5) Notice that this definition with |$d=2$| agrees with the definition of max-norm for matrices. As in the matrix case, the rank of a tensor |$T$| is the minimum possible number of columns in the low-rank factorization of |$T=\bigcirc _{i=1}^{i=d} U^{(i)}$| and the max-qnorm is the minimum product of the row norms of the factors over all such decompositions. Theorem 4 For |$d \geq 3$|⁠, the max-qnorm (3.5) does not satisfy the triangle inequality. However, it satisfies a quasi-triangle inequality \begin{equation*}\|X+T\|_{\max} \leq 2^{\frac{d}{2}-1} \left(\|X\|_{\max}+\|T\|_{\max}\right)\!,\end{equation*} and, therefore, it is a quasi-norm. The proof of Theorem 4 is given in Section 8.1. Note that we construct in this proof a tensor |$T = T_1 + T_2$| such that |$T_1$| and |$T_2$| have unit max-qnorm, and |$\|T||_{\max }> 2$|⁠, i.e., the max-qnorm does not respect the triangle inequality. Since the max-qnorm is homogeneous, this shows that it is also non-convex. Later on, in Section 4.2, we show that an MNC LS estimation, with the max-qnorm as in (1.5), breaks the |$O\left(N^{\frac{d}{2}}\right)$| limitation on the number of measurements. This is mainly due to two main properties: (i) Max-qnorm of a bounded low-rank tensor does not depend on the size of the tensor. (ii) Defining |$T_{\pm }:=\lbrace T \in \{\pm 1\}^{N \times N \times \cdots \times N}\ |\ \textrm{rank}(T)=1\rbrace $|⁠, the unit ball of the tensor max-qnorm is a subset of |$C_d \operatorname{conv}(T_{\pm })$| that is a convex combination of |$2^{Nd}$| rank-|$1$| sign tensors. Here |$C_d$| is a constant that only depends on |$d$| and |$\operatorname{conv}(S)$| is the convex envelope of the set |$S$|⁠. Notice that by definition of the rank, any element of |$T_{\pm }$| can be rewritten as |$\bigcirc _{j=1}^{d} u^{(j)}$|⁠, where |$u^{(j)}(k) \in \{\pm 1\}$|⁠. One caveat is that the max-qnorm is non-convex. To obtain a convex alternative that still satisfies the properties (i) and (ii) above, we consider the norm induced by the set |$T_{\pm }$| directly; this is an atomic norm as discussed in [15]. We then define the atomic M-norm of a tensor |$T$| as the gauge of |$T_{\pm }$| given by \begin{equation} \|T\|_{M}:= \inf\{t>0:T \in t\ \operatorname{conv}(T_{\pm})\}. \end{equation} (3.6) As |$T_{\pm }$| is centrally symmetric around the origin and spans |$\bigotimes _{j=1}^{d} \mathbb{R}^{N_j}$|⁠, this is indeed a norm, thus convex, and the gauge function can be rewritten as \begin{equation} \|T\|_{M} = \inf\left\{ \sum_{X \in T_{\pm}} c_X:\ T=\sum_{X \in T_{\pm}} c_X X, c_X \geq 0,X \in T_{\pm}\right\}. \end{equation} (3.7) 3.3 Unit max-qnorm ball of tensors In the next lemma we prove that, similar to the matrix case, the tensor unit max-qnorm ball is comparable to the set |$T_{\pm }$|⁠. First define |$\mathbb{B}_{\max }^T(1):= \lbrace T \in \mathbb{R}^{N \times \cdots \times N}\ |\ \|T\|_{\max } \leq 1\rbrace $| and |$\mathbb{B}_{M}(1):= \{ T: \|T\|_{M}\leq 1\}$|⁠. Lemma 5 The unit ball of the max-qnorm, the unit ball of the atomic M-norm and |$\operatorname{conv}(T_{\pm })$| satisfy the following: (i) |$\mathbb{B}_{M}(1)=\operatorname{conv}(T_{\pm })$|⁠, (ii) |$\mathbb{B}_{\max }^T(1)$||$\subset $||$c_1 c_2^d$||$\operatorname{conv}(T_{\pm })$|⁠. Here |$c_1$| and |$c_2$| are absolute constants, derived from the generalized Grothendieck theorem [6, 68] as given in Section 8.2. Using Lemma 5, it is easy to analyse the Rademacher complexity of the unit ball of these two norms. In fact, noticing that |$T_{\pm }$| is a finite class with |$|T_{\pm }|<2^{dN}$|⁠, together with some basic properties of Rademacher complexity, we can prove the following lemma. Below, |$\hat{R}_{\varOmega }(X)$| denotes the empirical Rademacher complexity of |$X$|⁠. To keep this section simple, we refer to Section A for the definition of Rademacher complexity and proof of Lemma 6. Notice that in this setup, a tensor is considered as a function from the set of indices |$[N]^d$| to |$\mathbb{R}$|⁠, where each index is mapped to the corresponding entry value. We then calculate the Rademacher complexity of tensors with low M-norm (a function class) or low max-qnorm tensors (another function class). Lemma 6 The Rademacher complexities of the unit balls of M-norm and max-qnorm satisfy the following: (i) |$\underset{\varOmega :|\varOmega |=m}{\sup } \hat{R}_{\varOmega }(\mathbb{B}_{M}(1)) < 6 \sqrt{\frac{dN}{m}}$|⁠, (ii) |$\underset{\varOmega :|\varOmega |=m}{\sup } \hat{R}_\varOmega (\mathbb{B}_{\max }^T(1)) < 6 c_1 c_2^d \sqrt{\frac{dN}{m}}$|⁠. Here |$c_1$| and |$c_2$| are absolute constants, derived from the generalized Grothendieck theorem. 3.4 Max-qnorm and M-norm of bounded low-rank tensors Next, we bound the max-qnorm and M-norm of a rank-|$r$| tensor whose (entry-wise) infinity norm is less than |$\alpha $|⁠. First, we bound the max-qnorm and a similar proof can be used to obtain a bound on the M-norm as well, which we explain in the Section 8.3. As mentioned before, for |$d=2$|⁠, i.e., the matrix case, an interesting inequality has been proved [48] that does not depend on the size of the matrix, i.e., |$\|M\|_{\max } \leq \sqrt{\textrm{rank}(M)}\ \alpha $|⁠. In what follows we generalize this bound to the max-qnorm and M-norm of a rank-|$r$| tensor |$T$| with |$\|T\|_{\infty } \leq \alpha $|⁠. Theorem 7 Assume |$T \in \mathbb{R}^{N \times \cdots \times N}$| is an order-|$d$|⁠, rank-|$r$| tensor with |$\|T\|_{\infty } = \alpha $|⁠. Then (i) |$\alpha \leq \|T\|_{M} \leq (r\sqrt{r})^{d-1} \alpha .$| (ii) |$\alpha \leq \|T\|_{\max } \leq \sqrt{r^{d^2-d}} \alpha .$| The proofs of these two bounds are similar and both of them can be found in Section 8.3. Notice the discrepancy of Theorem 7 when |$d=2$|⁠: setting |$d=2$| in (ii) yields |$\|T\|_{\max } \leq r \alpha $| rather than |$\|T\|_{\max } \leq \sqrt{r}\alpha $| as given by [48]. This is an artifact of the proof that hints that the bounds in Theorem 7 might be not optimal in their dependence on |$r$| for general |$d$| as well. 4. M-norm constrained TC In this section we consider the problem of TC from noisy measurements of a subset of the tensor entries. As explained before, we assume that the indices of the entries that are measured are drawn independently at random with replacement. Also, the tensor of interest is low-rank and has bounded entries. Instead of constraining the problem to the set of low-rank bounded tensors, we consider a more general case and consider the set of bounded tensors with bounded M-norm, which includes the set of low-rank bounded tensors. Accordingly, we solve an M-norm constrained LS problem, given in (4.3) below, and show that we can achieve near-optimal sample complexity. Similar results can be obtained for an MNC LS. When |$d=2$|⁠, i.e., the matrix case, max-norm constrained matrix completion has been thoroughly studied in [10], so we will not prove the lemmas and theorems that can be directly used in the tensor case; see [10] for more details. 4.1 Observation model Given an order-|$d$| tensor |$T^{\sharp } \in \mathbb{R}^{N^d}$| and a random subset of indices |$\varOmega =\{\omega _1,\omega _2,\cdots ,\omega _m\},$||$\omega _i \in [N]^d$|⁠, we observe |$m$| noisy entries |$\{Y(\omega _t)\}_{t=1}^{m}$| \begin{equation} Y(\omega_t)= T^{\sharp}(\omega_t) + \sigma \xi_t\, \ \ t=1,\cdots,m \end{equation} (4.1) for some |$\sigma> 0$|⁠. The variables |$\xi _t$| are zero-mean i.i.d. random variables with |$\mathbb{E}(\xi _t^2)=1$|⁠. The indices in |$\varOmega $| are drawn randomly with replacement from a predefined probability distribution |$\varPi =\{\pi _\omega \}$|⁠, for |$\omega \in [N]^d$|⁠, such that |$\sum _\omega \pi _\omega =1$|⁠. Obviously |$\max \pi _\omega \geq \frac{1}{N^d}$|⁠. Although it is not a necessary condition for our proof, it is natural to assume that there exists |$\mu \geq 1$| such that \begin{equation*} \pi_{\omega} \geq \frac{1}{\mu N^d}\ \ \forall \omega \in [N]^d, \end{equation*} which ensures that each entry is observed with some positive probability. This observation model encompasses both uniform and non-uniform sampling, and is a better fit than uniform sampling in many practical applications. 4.2 M-norm constrained LS estimation Given a collection of noisy observations |$\{Y(\omega _t)\}_{t=1}^{m}$| of a low-rank tensor |$T^{\sharp }$| following the observation model (4.1), we solve an LS problem to find an estimate of |$T^{\sharp }$|⁠. Consider the set of bounded M-norm tensors with bounded infinity norm \begin{equation*}K^T_{M}(\alpha,R):=\lbrace T \in \mathbb{R}^{N \times N \times \cdots \times N}: \|T\|_{\infty} \leq \alpha, \|T\|_{M} \leq R \rbrace.\end{equation*} Notice that, assuming that |$T^{\sharp }$| has rank |$r$| and |$\|T^{\sharp }\|_{\infty } \leq \alpha $|⁠, Theorem 7 ensures that a choice of |$R = (r\sqrt{r})^{d-1} \alpha $| is sufficient for |$T^{\sharp } in K^T_{M}(\alpha ,R)$|⁠. Defining \begin{equation} \hat{\mathscr{L}}_{m}(X,Y):=\frac{1}{m}\sum_{t=1}^{m} (X(\omega_t)-Y(\omega_t))^2, \end{equation} (4.2) we obtain the estimate |$\hat{T}_{M}$| by solving the optimization problem \begin{equation} \hat{T}_{M} = \arg \min_{X} \hat{\mathscr{L}}_{m}(X,Y) \ \ \ \ \textrm{subject to}\ \ \ \ X \in K^T_{M}(\alpha,R). \end{equation} (4.3) In other words, |$\hat{T}_{M}$| is a tensor with entries bounded by |$\alpha $| and M-norm less than |$R$| that is closest to the sampled tensor in Frobenius norm. Moreover, as for any tensor |$T$|⁠, |$\|T\|_{M}$| and |$\|T\|_{\max }$| are greater than or equal to |$\|T\|_{\infty }$|⁠, we assume |$R \geq \alpha $|⁠. We now state our main result on the performance of M-norm constrained TC, as in (4.3), for recovering a bounded low-rank tensor. Theorem 8 Consider an order-|$d$| tensor |$T^{\sharp } \in \bigotimes _{i=1}^{d} \mathbb{R}^N$| with |$\|T^{\sharp }\|_{\infty } \leq \alpha $| and |$\|T^{\sharp }\|_{M} \leq R$|⁠. Given a collection of noisy observations |$\{Y(\omega _t)\}_{t=1}^{m}$| following the observation model (4.1), where the noise sequence |$\xi _t$| are i.i.d. standard normal random variables, the minimizer |$\hat{T}_{M}$| of (4.3) satisfies \begin{equation} \|\hat{T}_{M}-T^{\sharp}\|_{\varPi}^2:=\sum_{\omega \in [N]^d} \pi_\omega (\hat{T}_{M}(\omega)-T^{\sharp}(\omega))^2 \leq C \left( \sigma(R+\alpha) + R\alpha \right) \sqrt{\frac{dN}{m}}, \end{equation} (4.4) with probability greater than |$1-\textrm{e}^{\frac{-N}{\ln (N)}}-\textrm{e}^{-\frac{dN}{2}}$|⁠. Here, |$C$| is a universal constant with |$00$|⁠. Moreover, following the steps in [10, Theorem 3.2], we can prove the following theorem. Theorem 16 In the setting of Theorem 8, assume that |$\sigma \leq \alpha $|⁠. Then with probability |$1-\frac{2}{m}$| over the choice of sample set \begin{equation} |\hat{T}_{M}-T^{\sharp}|_{\varPi}^2 \leq C \left[ \sigma \left( \sqrt{ \log(m)^3 \frac{R^2 N}{m}}+\sqrt{\log(m)^{\frac{3}{2}}\ \frac{\alpha^2}{m}} \right)+ \log(m)^3\ \frac{R^2 N}{m} + \log(m)^{\frac{3}{2}}\ \frac{\alpha^2}{m} \right]. \end{equation} (4.10) The proof is similar to the proof in [10, Theorem 6.2] and can be found in Section 8.5. Remark 17 For fixed |$N$| and |$r$| and |$\sigma =0$|⁠, Theorem 16 proves decay rate |$O\left(\frac{\log (m)^3}{m}\right)$| that is faster than the decay rate |$O\left(\sqrt{\frac{1}{m}}\right)$| proved in Theorem 8. Moreover, this decay rate is consistent with the result in [11] that considers matrix completion using nuclear-norm minimization. However, when |$\sigma>0$|⁠, the error bound in Theorem 8 is tighter than the one in Theorem 16. 5. Comparison to past results As discussed in Section 2.2, there are several works that have considered max-norm for matrix completion [22, 25, 45, 61, 64]. Among these, the closest work to our result is [10], where the authors study max-norm constrained matrix completion, which is a special case of MNC TC with |$d=2$|⁠. Here, we generalize the framework of [10] to the problem of TC. Although the main approach is similar, the new ingredients include building a machinery for analysing the max-qnorm and M-norm of low-rank tensors, as explained in Section 3. As expected, Theorem 13 reduces to the one in [10] when |$d=2$|⁠. More interestingly, when |$d>2$|⁠, the only values in the upper bound in (4.8) that change compared to the matrix error bound are the upper bound on the max-qnorm of the |$d$|th order tensor (which is independent of |$N$|⁠), and the order |$d$|⁠, which changes the constants slightly. As can be seen from Theorem 7, for a rank-|$r$| tensor |$T$| with |$\|T\|_{\infty } \leq \alpha $|⁠, we have |$\|T\|_{M} \leq (r\sqrt{r})^{d-1} \alpha $|⁠. Therefore, assuming |$\alpha =O(1)$|⁠, to obtain |$\frac{1}{N^d}\|\hat{T}_{M}-T^{\sharp }\|_F^2 \leq \varepsilon $|⁠, it is sufficient to have |$m> C \frac{(r\sqrt{r})^{d-1} d N} {\varepsilon ^2}$| samples. Similarly, using the max-qnorm, for an approximation error bounded by |$\varepsilon $|⁠, it is sufficient to obtain |$m> C_d \frac{r^{d^2-d} d N}{\varepsilon ^2}$| samples. In contrast, the sufficient number of measurements with the best possible matricization is |$m> C \frac{r N^{\lceil \frac{d}{2} \rceil{}}}{\varepsilon ^2}$|⁠, significantly bigger for higher-order tensors. TC using nuclear norm gives significantly inferior bounds as well. In particular, fixing |$r$| and |$d$|⁠, compared to latest results on TC using nuclear norm [73], using M-norm lowers the theoretical sufficient number of measurements from |$O\left(N^{\frac{d}{2}}\right)$| to |$O(dN)$|⁠. 6. Information theoretic lower bound To obtain a lower bound on the performance of (4.3), we employ a classical information theoretic technique to establish a minimax lower bound for non-uniform sampling of random TC on the max-qnorm ball. A similar strategy in the matrix case has been used in [10, 19]. In order to derive such a lower bound, we find a set of tensors in the set |$K^T_{M}$| that are sufficiently far away from each other. Fano’s inequality implies that with the finite amount of information that we have, there is no method that can differentiate between all the elements of a set with too many elements, and therefore any method will fail to recover at least one of them with a large probability. The main ideas and techniques closely follow [10, Section 6.2]; therefore, we only explain the main steps we take to generalize this approach from matrices to tensors. Similar to the upper bound case, we analyse a general restriction on the M-norm or the max-qnorm of the tensors instead of concentrating on low-rank tensors. Note that by means of Theorem 7, this approach encompasses low-rank tensors as well. Recall that the set of bounded low M-norm tensors is given by \begin{equation} K_{M}^T(\alpha,R):=\lbrace T \in \mathbb{R}^{N \times N \times \cdots \times N}: \|T\|_{\infty} \leq \alpha, \|T\|_{M} \leq R \rbrace. \end{equation} (6.1) We will find a lower bound for the recovery error for any method that takes |$\{Y(\omega _t)\}_{t=1}^{m}$| as input and outputs an estimate |$\hat{T}$|⁠. This includes |$\hat{T}_{M}$| that is obtained by \begin{equation} \hat{T}_{M} = \arg \min_{X} \hat{\mathscr{L}}_{m}(X,Y) \ \ \ \ \textrm{subject to}\ \ \ \ X \in K_{M}^T(\alpha,R). \end{equation} (6.2) In particular, we show that when the sampling distribution satisfies \begin{equation*}\frac{\mu}{N^d} \leq \min_{\omega} \pi_{\omega} \leq \max_{\omega} \pi_{\omega} \leq \frac{L}{N^d},\end{equation*} the M-norm constrained LS estimator is rate-optimal on |$K_{M}^T(\alpha ,R)$|⁠. The following theorem is proved in Section 8.6. Theorem 18 Assume that the noise sequence |$\xi _t$| is i.i.d. standard normal random variable and the sampling distribution |$\varPi $| satisfies |$\max _{\omega } \pi _{\omega } \leq \frac{L}{N^d}$|⁠. Fix |$\alpha $|⁠, |$R$|⁠, |$N$| and |$m$| such that \begin{equation} R^2 \geq \frac{48\alpha^2 K_G^2}{N}, \end{equation} (6.3) where |$K_G$| is the Grothendieck’s constant. Then, the minimax recovery error is lower bounded by \begin{equation} \underset{\hat{T}_M}{\inf} \ \underset{T \in K_{M}^T(\alpha,R)}{\sup} \frac{1}{N^d} \mathbb{E}\|\hat{T}-T\|_F^2 \geq \min \left\{\frac{\alpha^2}{16},\frac{\sigma R}{128\sqrt{2}K_G}\sqrt{\frac{N}{mL}}\right\}. \end{equation} (6.4) Remark 19 Comparing the above theorem with (4.6), we observe that as long as |$\frac{\sigma R}{128\sqrt{2}K_G}\sqrt{\frac{N}{mL}} < \frac{\alpha ^2}{16}$|⁠, M-norm constrained TC is optimal in both |$N$| and |$R$|⁠. 7. Experiments In this section we present algorithms that we use to attempt to solve (4.7) and experiments concerning computing max-qnorm of specific classes of tensors and MNC TC. As mentioned before, most of the typical procedures such as calculating the nuclear norm or even calculating the rank of a tensor are NP-hard. The situation seems even more hopeless if we consider the results of [2] that connects three-dimensional TC with strongly refuting random |$3$|-SAT [16], which has a long line of research behind it. In short, if we assume that either max-qnorm or M-norm is computable in polynomial time, a conjecture of [18] for strongly refuting random |$3$|-SAT will be disproved. All these being said, this is the first paper considering max-qnorm for TC, and the preliminary results we show in this section are promising, outperforming matricization, as well as the TenALS algorithm of [35] consistently. Here, we concentrate on (4.7) instead of (4.3) as we are not aware of any algorithm that can even attempt to solve (4.3), and simple heuristic algorithms we designed for (4.7) give promising results even though we do not know if any of these provably converge due to the non-convexity of the optimization problem (4.7). There are two questions that need to be answered when solving (4.7). First is how to choose the max-qnorm bound |$R$| and second is how to solve the LS problem once |$R$| is fixed. We address both these questions next. We also run some experiments to estimate the tensor max-qnorm of some specific classes of tensors to get an idea of how the max-qnorm of a tensor depends on its size and rank. Finally, we compare the results of MNC TC with TenALS and matricization. 7.1 Algorithms for MNC LS estimation In this section we introduce a few algorithms that attempt to solve (or approximate the solution of) (4.7). Defining |$f(V_1, \cdots , V_d,Y):=\hat{\mathscr{L}}_m((V_1 \circ \cdots \circ V_d),Y)$|⁠, we minimize \begin{equation} \min f(V_1, \cdots, V_d,Y)\ \textrm{subject to}\ \max_i (\|V_i\|_{2,\infty}) \leq \sqrt[d]{R}, \end{equation} (7.1) where |$R$| is the max-qnorm constraint. Note that in the definition of the max-qnorm, there is no limitation on the column size of the factors |$V_i$|⁠. However, in our experiments, we limit the factor sizes to |$N \times 2N$|⁠, i.e., we limit the number of columns of the factors to |$2N$|⁠. Although |$2N$| is an arbitrary value and we have not derived an error bound in terms of the max-qnorm of tensors with this limitation, we believe (and our experiments support this belief) that this choice is large enough when |$r \ll N$|⁠. We defer the analysis of the effect of this choice on the error bounds to future work. All the algorithms mentioned in this section are first-order methods that are scalable for higher dimensions and just require access to first derivative of the loss function. 7.1.1 Projected gradient The first algorithm is the projected gradient algorithm where, for each factor, we fix all the other factors and take a step according to the gradient of the loss function. Next, we project back all the factors onto the set |$C:=\{X:\|X\|_{2,\infty } \leq \sqrt [d]{R}\}$|⁠. To be precise, for each factor |$V_i$|⁠, define the matricization of |$T=V_1 \circ \cdots \circ V_d$| along the |$i$|th dimension to be |$T_i:=V_i \circ R_i$| and define |$f_i(X):=\hat{\mathscr{L}}((X \circ R_i),Y_i)$|⁠, where |$Y_i$| is the matricization of |$Y$| along its |$i$|th dimension. Fixing a step size |$\gamma $|⁠, we update all the factors in parallel via \begin{equation} V_i \leftarrow \mathbb{P}_{C}(V_i - \gamma \bigtriangledown(\,f_i) R_i). \end{equation} (7.2) Here, |$\mathbb{P}_C$| simply projects the factor onto the set of matrices with |$\ell _2$|-infinity norm not exceeding |$\sqrt [d]{R}$|⁠. This projection is a row-wise operation: the rows with |$\ell _2$|-norm larger than |$\sqrt [d]{R}$| are scaled down so that their |$\ell _2$|-norm is |$\sqrt [d]{R}$|⁠; other rows remain untouched. This is a well-known algorithm with many efficient implementations and modifications. Furthermore, using armijo line search rule to ensure sufficient decrease of the loss function, it is guaranteed to find a stationary point of (7.2). 7.1.2 Projected quasi-Newton Stacking all the factors in a matrix |$X$| given by \begin{equation*} X =\left[ \begin{array}{@{}cc@{}} V_1\\ V_2\\ \vdots\\ V_d \end{array}\right] \end{equation*} and defining |$f(X):=\hat{\mathscr{L}}_m((V_1 \circ \cdots \circ V_d),Y)$|⁠, this algorithm uses the BFGS quasi-Newton method to form a quadratic approximation to the function at the current estimate, and then uses a spectral projected gradient (SPG) method to minimize this quadratic function, constrained to |$X \in C$|⁠. We use the implementation of [59] that uses limited memory Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm and a Barzilai–Borwein scaling of the gradient. Finally, we use a non-monotone Armijo line search along the feasible direction to find the next iterate in the SPG step. 7.1.3 Stochastic gradient The loss function \begin{equation*}\hat{\mathscr{L}}_{m}(X,Y)=\frac{1}{m}\sum_{t=1}^{m} (X(\omega_t)-Y(\omega_t))^2\end{equation*} is decomposable into the sum of |$m$| loss functions, each concerning one observed entry. This makes it very easy to use stochastic gradient methods that, at each iteration, take one or more of the entries and find the feasible direction according to this subset of observations. In particular, at each iteration |$i$|⁠, we take a subset, |$\varOmega _i \subset \varOmega $| with |$|\varOmega _i|=k1$|⁠. This shows that the max-qnorm satisfies the triangle inequality when |$d=2$|⁠. However, this is not true for |$d>2$|⁠. Next, we prove that the max-qnorm does not satisfy the triangle inequality when |$d=3$|⁠; higher order cases can be proved similarly. The main challenge is that the size of the factors is not fixed. We circumvent this by constructing the following simple counter-example. Let |$T=T_1 + T_2$|⁠, where $$T_1=\begin{bmatrix} 1\\0 \end{bmatrix} \circ \begin{bmatrix} 1\\0 \end{bmatrix} \circ \begin{bmatrix} 1\\0 \end{bmatrix}$$ and $$T_2 =\begin{bmatrix} 1\\1 \end{bmatrix} \circ \begin{bmatrix} 1\\1 \end{bmatrix} \circ \begin{bmatrix} 1\\1 \end{bmatrix}$$ ⁠; note that |$T$| is a rank-|$2$|⁠, |$2\times 2 \times 2$| tensor. Here, |$T_1$| and |$T_2$| are rank-|$1$| tensors with |$\|T_1\|_{\max }=1$| and |$\|T_2\|_{\max }=1$| (notice that for any |$T$|⁠, |$\|T\|_{\max } \geq \|T\|_{\infty }$|⁠). Therefore, if max-qnorm satisfies the triangle-inequality, then |$\|T\|_{\max }$| cannot exceed |$2$|⁠. In what follows we prove that this is not possible. If |$\|T\|_{\max } \leq 2$|⁠, then there exists a decomposition |$T=U^{(1)} \circ U^{(2)} \circ U^{(3)}$| such that |$\|T\|_{\max }=\prod _{j=1}^{3} \|U^{(j)}\|_{2,\infty } \leq 2$|⁠, and with a simple rescaling of the factors, \begin{equation} \|U^{(1)}\|_{2,\infty} \leq \sqrt{2},\ \|U^{(2)}\|_{2,\infty} \leq \sqrt{2},\ \|U^{(3)}\|_{2,\infty} \leq 1. \end{equation} (8.1) First, notice that |$T$| is an all-ones tensor except for one entry, where |$T(1,1,1)=2$|⁠. Defining the generalized inner product as \begin{equation} \langle x_1, \cdots, x_d \rangle:= \sum_{i=1}^{k} \prod_{j=1}^{d} x_j(i), \end{equation} (8.2) this means that \begin{equation} \langle U^{(1)}(1,:), U^{(2)}(1,:), U^{(3)}(1,:)\rangle =2. \end{equation} (8.3) Using the Cauchy–Schwarz inequality, we obtain \begin{equation} \langle U^{(1)}(1,:), U^{(2)}(1,:), U^{(3)}(1,:)\rangle \leq \|U^{(1)}(1,:)\| \ \|U^{(2)}(1,:)\| \ \|U^{(3)}(1,:)\|_{\infty}. \end{equation} (8.4) Combining (8.1), (8.3) and (8.4) we get \begin{equation*}2 \leq \|U^{(1)}(1,:)\| \ \|U^{(2)}(1,:)\| \leq \|U^{(1)}\|_{2,\infty} \ \|U^{(2)}\|_{2,\infty} \leq 2,\end{equation*} which together with (8.1) proves that \begin{equation} \|U^{(1)}(1,:)\|=\sqrt{2},\ \textrm{and}\ \|U^{(2)}(1,:)\|=\sqrt{2}. \end{equation} (8.5) Moreover, similarly \begin{equation*}2 \leq 2 \|U^{(3)}(1,:)\|_{\infty}\leq 2\ \Rightarrow\ \|U^{(3)}(1,:)\|_{\infty}=1.\end{equation*} Notice that |$\|U^{(3)}(1,:)\| \leq 1$| and |$\|U^{(3)}(1,:)\|_{\infty }=1$|⁠, which proves that |$U^{(3)}(1,:)$| is an all zeros vector with a single non-zero entry of one. Remember that the number of columns of |$U^{(3)}$| is arbitrary. Without loss of generality, we can assume \begin{equation} U^{(3)}(1,:)=(1,0,\cdots,0). \end{equation} (8.6) Combining this with (8.3) and (8.5), we can also prove that \begin{equation} U^{(1)}(1,:)=U^{(2)}(1,:)=(\sqrt{2},0,\cdots,0). \end{equation} (8.7) Now from |$T(1,1,2)=1$| and the above two equations we have to have |$U^{(3)}(2,1)=\frac{1}{2}$| and, similarly, |$U^{(2)}(2,1)=\frac{1}{\sqrt{2}}$|⁠. Finally, |$T(1,2,2)=U^{(1)}(1,1)\ U^{(2)}(2,1)\ U^{(3)}(2,1)=\sqrt{2} \frac{1}{\sqrt{2}} \frac{1}{2} = \frac{1}{2}$|⁠, which is a contradiction. 8.2 Proof of Lemma 5 Characterization of the unit ball of the atomic M-norm follows directly from (3.6). By definition, any tensor |$T$| with |$\|T\|_{M} \leq 1$| is a convex combination of the atoms of |$T_{\pm }$|⁠, |$ T=\sum _{X \in T_{\pm }} c_X X, c_X>0$| with |$\sum _{X \in T_{\pm }} c_X=1$|⁠. This proves that |$\mathbb{B}_{M}(1)=\operatorname{conv}(T_{\pm })$|⁠. To characterize the unit ball of max-qnorm, we use a generalization of Grothendieck’s theorem to higher order tensors [6, 68]. First, we generalize the matrix |$\|\cdot \|_{\infty ,1}$| norm, defined by |$\{\|M\|_{\infty ,1}:=\underset{\|x\|_{\infty }=1}{\sup }\|Mx\|_1\}$| to tensors. Definition 20 Let |$T$| be an order-|$d$| tensor. Then \begin{equation*}\|T\|_{\infty,1}:=\underset{\|x_1\|_{\infty},\cdots,\|x_d\|_{\infty} \leq 1} {\sup}\left| \sum_{i_1=1}^{N} \cdots \sum_{i_d=1}^{N} T(i_1,\cdots,i_d)x_1(i_1)\cdots x_d(i_d)\right|.\end{equation*} Theorem 21 (Generalized Grothendieck theorem[6]). Let |$T \in \mathbb{R}^{N^d}$| be an order-|$d$| tensor such that |$\|T\|_{\infty ,1} \leq 1$|⁠, |$k$| be a positive integer and let |$u_{i_j}^j \in \mathbb{R}^k, 1 \leq j \leq d, 1 \leq i_j \leq N$| be |$d \times N$| vectors such that |$\|u_{i_j}^j\| \leq 1$|⁠. Then \begin{equation} \left| \sum_{i_1=1}^{N} \cdots \sum_{i_d=1}^{N} T(i_1,\cdots,i_d)\langle u_{i_1}^1, u_{i_2}^2, \cdots, u_{i_d}^d \rangle \right| \leq c_1 c_2^d, \end{equation} (8.8) where |$\left \langle u_{i_1}^1, u_{i_2}^2, \cdots , u_{i_d}^d \right \rangle $| is the generalized inner product of |$u_{i_1}^1, u_{i_2}^2, \cdots , u_{i_d}^d$| as defined in (8.2). Here, |$c_1 \leq \frac{K_G}{5}$| and |$c_2 \leq 2.83$|⁠. Now we use Theorem 21 to prove Lemma 5. Proof of Lemma 5 The dual norm of the max-qnorm is \begin{equation} \|T\|_{\max}^{\ast} = \underset{\|U\|_{\max} \leq 1}{\max} \langle T,U \rangle = \underset{\|u_{i_1}^1\|,\cdots,\|u_{i_d}^d\| \leq 1}{\max}\sum_{i_1=1}^{N} \cdots \sum_{i_d=1}^{N} T(i_1,\cdots,i_d)\left\langle u_{i_1}^1, u_{i_2}^2, \cdots, u_{i_d}^d \right\rangle. \end{equation} (8.9) Above, |$u_{i_1}^1,\cdots ,u_{i_d}^d$| can be chosen in |$\mathbb{R}^k$| for any |$k \geq 1$|⁠. Using Theorem 21, |$\|T\|_{\max }^{\ast } \leq c_1 c_2^d \|T\|_{\infty ,1}$|⁠. On the other hand, for |$u_{i_1}^1,\cdots ,u_{i_d}^d \in \mathbb{R}$| the right-hand side of (8.9) is equal to |$\|T\|_{\infty ,1}$|⁠. Therefore, |$\|T\|_{\infty ,1} \leq \|T\|_{\max }^{\ast }$|⁠. Taking the dual, we obtain \begin{equation} \frac{\|T\|_{\infty,1}^{\ast}}{c_1 c_2^d} \leq (\|T\|_{\max}^{\ast})^{\ast} \leq \|T\|_{\infty,1}^{\ast}. \end{equation} (8.10) Notice that the max-qnorm, defined in (1.5) is a quasi-norm and therefore, |$(\|T\|_{\max }^{\ast })^{\ast }$| is not equal to |$\|T\|_{\max }$|⁠. However, the max-qnorm is absolutely homogeneous, and therefore, \begin{equation*}(\|T\|_{\max}^{\ast})^{\ast} = \underset{\|Z\|_{\max}^{\ast}\leq 1}{\max} \langle T,Z \rangle \leq \|T\|_{\max},\end{equation*} which implies that \begin{equation} \frac{\|T\|_{\infty,1}^{\ast}}{c_1 c_2^d} \leq \|T\|_{\max}. \end{equation} (8.11) To calculate the unit ball of |$\|.\|_{\infty ,1}^{\ast }$|⁠, notice that the argument of the supremum in Definition 20 is linear in each variable |$x_j(i_j)$| and as |$-1 \leq x_j(i_j) \leq 1$|⁠, the suprema are achieved when |$x_j(i_j)=\pm 1$|⁠. This means that |$\|T\|_{\infty ,1}=\underset{U \in T_{\pm }}{\sup } |\langle T,U \rangle |$|⁠. Therefore, |$\operatorname{conv}(T_{\pm })$| is the unit ball of |$\|.\|_{\infty ,1}^{\ast }$| and Lemma 5 (ii) follows from (8.11). 8.3 Proof of Theorem 7 In order to prove the tensor max-qnorm bound, we first sketch the proof of [56] for the matrix case. That is, assuming that |$M$| is a matrix with |$\textrm{rank}(M)=r$| and |$\|M\|_{\infty }\leq \alpha $|⁠, we show that there exists a decomposition |$M=U \circ V$|⁠, where |$U \in \mathbb{R}^{N_1 \times r}, V \in \mathbb{R}^{N_2 \times r}$| and |$\|U\|_{2,\infty } \leq \sqrt{r}, \|V\|_{2,\infty } \leq \alpha $|⁠. To that end, we first state a version of John’s theorem [56]. Theorem 22 (John’s theorem [36]). For any full-dimensional symmetric convex set |$K \subseteq \mathbb{R}^r$| and any ellipsoid |$E \subseteq \mathbb{R}^r$| (defined with respect to |$\ell _2$| norm) that is centred at the origin, there exists an invertible linear map |$S$| so that |$E \subseteq S(K) \subseteq \sqrt{r} E$|⁠. Corollary 23 [56, Corollary 2.2) For any rank-|$r$| matrix |$M \in \mathbb{R}^{N_1 \times N_2}$| with |$\|M\|_{\infty }\leq \alpha $| there exist vectors |$u_1,\cdots ,u_{N_1},v_1,\cdots ,v_{N_2} \in \mathbb{R}^r$| such that |$\langle u_i,v_j\rangle =M_{i,j}$| and |$\|u_i\|\leq \sqrt{r}$| and |$\|v_j\|\leq \alpha $|⁠. The proof of Corollary 23 is based on considering any rank-|$r$| decomposition of |$M=X\circ Y$|⁠, where |$X \in \mathbb{R}^{N_1 \times r}$|⁠, |$Y \in \mathbb{R}^{N_2 \times r}$| and |$M_{i,j}=\langle x_i,y_j\rangle $|⁠. Define |$K$| to be the convex hull of the set |$\{ \pm x_i: i \in [N_1]\}$|⁠. Then using the linear map |$S$| in John’s theorem for the set |$K$| with the ellipsoid |$E=\mathbb{B}_r:=\{x \in \mathbb{R}^r: \|x\|_2 \leq 1\}$|⁠, the decomposition |$M=(XS)\circ (YS^{-1})$| satisfies the conditions of Corollary 23 [56]. The following lemma proves the existence of a nuclear decomposition for bounded rank-|$r$| tensors, which can directly be used to bound the M-norm of a bounded rank-|$r$| tensor. Lemma 24 Any order-|$d$|⁠, rank-|$r$| tensor |$T$| with |$\|T\|_{\infty } \leq \alpha $| can be decomposed into |$r^{d-1}$| rank-one tensors whose components have unit infinity norm such that \begin{equation} T=\sum_{j=1}^{r^{d-1}} \sigma_j u_j^1 \circ u_j^2 \circ \cdots \circ u_j^d,\ \|u_j^1\|_{\infty},\cdots,\|u_j^d\|_{\infty} \leq 1, \ \textrm{with} \ \sum_{j=1}^{r^{d-1}} |\sigma_j| \leq (r\sqrt{r})^{d-1} \alpha. \end{equation} (8.12) Proof. We prove this lemma by induction. The proof for |$d=2$| follows directly from applying John’s theorem to a rank-|$r$| decomposition of |$T$|⁠, i.e., |$T=XS \circ YS^{-1}$|⁠, where |$T=X \circ Y$| [56]. This is summarized in Corollary 23 above as well. Now assume an order-|$d$| tensor that can be written as |$T=\sum _{j=1}^{r} \lambda _i u_j^{(1)} \circ u_j^{(2)} \circ \cdots \circ u_j^{(d)}$| and |$\|T\|_{\infty } \leq \alpha $|⁠. Matricizing along the first dimension results in |$T_{[1]} = \sum _{i=1}^r (\lambda _i u_i^{(1)}) \circ (u_i^{(2)} \otimes \cdots \otimes u_i^{(d)})$|⁠. Using Matlab notation we can write |$T_{[1]}=U \circ V$|⁠, where |$U(:,i)=\lambda _i u_i^{(1)} \in \mathbb{R}^{N_1}$|⁠, and |$V(:,i)=u_i^{(2)} \otimes \cdots \otimes u_i^{(d)} \in \mathbb{R}^{\varPi _{k=2}^{k=d}N_k}$|⁠. By John’s theorem, there exists an |$S \in \mathbb{R}^{r \times r}$|⁠, where |$T_{[1]}=X \circ Y$| with |$X=US$|⁠, |$Y=VS^{-1}$|⁠, |$\|X\|_{\infty } \leq \|X\|_{2,\infty } \leq \sqrt{r}$| and |$\|Y\|_{\infty } \leq \|Y\|_{2,\infty } \leq \alpha $|⁠. Furthermore, each column of |$Y$| is a linear combination of the columns of |$V$|⁠, i.e., there exist |$\zeta _{1}, \cdots \zeta _{r}$| such that |$Y(:,i)=\sum _{j=1}^r \zeta _{j} (u_j^{(2)} \otimes \cdots \otimes u_j^{(d)})$|⁠. Therefore, unfolding |$i$|th column of |$Y$| into a |$(d-1)$|-dimensional tensor |$E_i \in \mathbb{R}^{N_2 \times \cdots \times N_d}$| results in a rank-|$r$|⁠, order |$(d-1)$| tensor with |$\|E_i\|_{\infty } \leq \|Y\|_{\infty } \leq \alpha $|⁠. By induction, |$E_i$| can be decomposed into |$r^{d-2}$| rank-one tensors with bounded factors, i.e., \begin{equation*} E_i =\sum_{j=1}^{r^{d-2}} \sigma_{i,j} v_{i,j}^1 \circ v_{i,j}^2 \circ \cdots \circ v_{i,j}^d,\ \|v_{i,j}\|_{\infty} \leq 1, \ \sum |\sigma_{i,j}| \leq (r\sqrt{r})^{d-2} \alpha. \end{equation*} Going back to the original tensor, as |$T_{[1]}= X \circ Y$|⁠, we also have |$T= \sum _{i=1}^r X(:,i) \circ \left(\sum _{j=1}^{r^{d-2}} \sigma _{i,j} v_{i,j}^1 \circ v_{i,j}^2 \circ \cdots \circ v_{i,j}^d\right)$|⁠. Notice that |$\|X(:,i)\|_{\infty } \leq \sqrt{r}$|⁠. Therefore, we can rewrite \begin{equation*} T=\sum_{i=1}^r \sum_{j=1}^{r^{d-2}} (\sigma_{i,j} \|X(:,i)\|_{\infty}) \ \frac{X(:,i)}{\|X(:,i)\|_{\infty}} \circ v_{i,j}^1 \circ v_{i,j}^2 \circ \cdots \circ v_{i,j}^d.\end{equation*} By rearranging, we get |$T=\sum _{k=1}^{r^{d-1}} \sigma _k u_K^1 \circ u_k^2 \circ \cdots \circ u_k^d,\ \|u_k^1\|_{\infty },\cdots ,\|u_k^d\|_{\infty } \leq 1$| and \begin{equation*}\sum_{k=1}^{r^{d-1}} |\sigma_k| = \sum_{i=1}^r \sum_{j=1}^{r^{d-2}} \sigma_{i,j} \|X(:,i)\|_{\infty} \leq \sum_{i=1}^r \sqrt{r} \sum_{j=1}^{r^{d-2}} \sigma_{i,j} \leq \sum_{i=1}^r \sqrt{r} \left( (r\sqrt{r})^{d-2} \alpha \right) = (r\sqrt{r})^{d-1} \alpha,\end{equation*} which concludes the proof of Lemma 24. This lemma can be used directly to bound the M-norm of a bounded rank-|$r$| tensor. Next, we bound the max-qnorm of a bounded rank-|$r$| tensor. The following lemma proves the existence of a nuclear decomposition for such tensors, which can be used directly to bound their max-qnorm. As the max-qnorm is homogeneous, without loss of generality we assume |$\|T\|_{\infty } \leq 1$|⁠. Lemma 25 Any order-|$d$|⁠, rank-|$r$| tensor |$T \in \bigotimes _{i=1}^{d} \mathbb{R}^{N_i}$| with |$\|T\|_{\infty } \leq 1$| can be decomposed into |$r^{d-1}$| rank-one tensors, |$T=\sum _{j=1}^{r^{d-1}} u_j^1 \circ u_j^2 \circ \cdots \circ u_j^d$|⁠, where \begin{equation} \sum_{j=1}^{r^{d-1}} (u_j^k(t))^2 \leq r^{d-1}\ \textrm{for any}\; 1\leq k \leq d,\ 1\leq t\leq N_k. \end{equation} (8.13) Notice that |$\sqrt{\sum _{j=1}^{r^{d-1}} (u_j^k(t))^2}$| is the spectral norm of |$j$|th row of |$k$|th factor of |$T$|⁠, i.e., |$\sum _{j=1}^{r^{d-1}} (u_j^k(t))^2 \leq r^{d-1}$| means that the two-infinity norm of the factors is bounded by |$\sqrt{r^{d-1}}$|⁠. Remark 26 [Proof by Lemma 24] At the end of this subsection, we provide a proof for the lemma as stated above. However, using the decomposition obtained in Lemma 24, we can find a decomposition with |$\sum _{j=1}^{r^{d-1}} (u_j^k(t))^2 \leq r^{d}$|⁠. To do this notice that by Lemma 24, and defining |$\vec{\sigma }:=\{\sigma _1, \cdots ,\sigma _{r^{d-1}}\}$| we can write \begin{equation*}T=\sum_{j=1}^{r^{d-1}} \sigma_j v_j^1 \circ v_j^2 \circ \cdots \circ v_j^d,\ \|v_j^1\|_{\infty},\cdots,\|v_j^d\|_{\infty} \leq 1, \ \textrm{with} \|\vec{\sigma}\|_1\leq (r\sqrt{r})^{d-1}.\end{equation*} Now define \begin{equation*}u_j^k:=(\sigma_j)^{\frac{1}{d}} v_j^k\ \textrm{for any}\; 1\leq k \leq d,\ 1\leq t\leq N_k.\end{equation*} It is easy to check that |$T=\sum _{j=1}^{r^{d-1}} u_j^1 \circ u_j^2 \circ \cdots \circ u_j^d$| and \begin{equation*}\sum_{j=1}^{r^{d-1}} (u_j^k(t))^2 = \sum_{j=1}^{r^{d-1}} \sigma_j^{\frac{2}{d}} (v_j^k(t))^2 \leq \sum_{j=1}^{r^{d-1}} \sigma_j^{\frac{2}{d}} = \|\vec{\sigma}\|_{\frac{2}{d}}^{\frac{2}{d}}.\end{equation*} Using Hölder’s inequality, when |$d \geq 2$| we have \begin{equation*}\sum_{j=1}^{r^{d-1}} (u_j^k(t))^2 = \|\vec{\sigma}\|_{\frac{2}{d}}^{\frac{2}{d}} \leq \| \vec{\sigma} \|_1^{\frac{2}{d}} (r^{d-1})^{1-\frac{2}{d}} \leq r^{\frac{3d-3}{d}} r^{\frac{(d-1)(d-2)}{d}}=r^{\frac{(d-1)(d+1)}{d}} \leq r^d.\end{equation*} This proves an upper bound that is close to the one in the lemma. To get a more optimal upper bound (the one stated in the Lemma 25] we need to go over the induction steps as explained below. Proof of Lemma 25 We prove this lemma by induction. For |$d=2$|⁠, the proof follows directly from applying John’s theorem to a rank-|$r$| decomposition of |$T$|⁠, i.e., |$T=XS \circ YS^{-1}$|⁠, where |$T=X \circ Y$|⁠. Now let |$T$| be an order-|$d$| tensor that can be written as |$T=\sum _{j=1}^{r} u_j^1 \circ u_j^2 \circ \cdots \circ u_j^d$| and |$\|T\|_{\infty } \leq 1$|⁠. Matricizing along the first dimension results in |$T_{[1]} = \sum _{i=1}^r (u_i^{1}) \circ (u_i^{2} \otimes \cdots \otimes u_i^{d})$|⁠. Using matrix notation we can write |$T_{[1]}=U \circ V$|⁠, where |$U(:,i)= u_i^{1} \in \mathbb{R}^{N_1}$|⁠, and |$V(:,i)=u_i^{2} \otimes \cdots \otimes u_i^{d} \in \mathbb{R}^{\varPi _{k=2}^{k=d}N_k}$|⁠. By John’s theorem, there exists an |$S \in \mathbb{R}^{r \times r}$|⁠, where |$T_{[1]}=X \circ Y$| with |$X=US$|⁠, |$Y=VS^{-1}$|⁠, |$\|X\|_{2,\infty } \leq \sqrt{r}$| and |$\|Y\|_{\infty } \leq \|Y\|_{2,\infty } \leq 1$|⁠. More importantly, each column of |$Y$| is a linear combination of the columns of |$V$|⁠, i.e., there exist |$\zeta _{1}, \cdots \zeta _{r}$| such that |$Y(:,i)=\sum _{j=1}^r \zeta _{j} (u_j^{(2)} \otimes \cdots \otimes u_j^{(d)})$|⁠. Therefore, unfolding the |$i$|th column of |$Y$| into the tensor |$E_i \in \mathbb{R}^{N_2 \times \cdots \times N_d}$| results in a rank-|$r$|⁠, order |$(d-1)$|-dimensional tensor with |$\|E_i\|_{\infty } \leq \|Y\|_{\infty } \leq 1$|⁠. By the induction hypothesis, |$E_i$| can be decomposed into |$r^{d-2}$| rank-one tensors, where |$E_i =\sum _{j=1}^{r^{d-2}} v_{i,j}^2 \circ v_{i,j}^3 \circ \cdots \circ v_{i,j}^d$|⁠, where |$\sum _{j=1}^{r^{d-2}} (v_{i,j}^k(t))^2 \leq r^{d-2}$| for any |$2\leq k\leq d$| and any |$1\leq t \leq N_k$|⁠. Notice that the factors start from |$v_{i,j}^2$| to emphasize that |$E$| is generated from the indices in the dimensions |$2$| to |$d$|⁠. Going back to the original tensor, as |$T_{[1]}= X \circ Y$|⁠, we can write \begin{equation*}T= \sum_{i=1}^r X(:,i) \circ \left(\sum_{j=1}^{r^{d-2}} v_{i,j}^2 \circ v_{i,j}^3 \circ \cdots \circ v_{i,j}^d \right).\end{equation*} By distributing the outer product we get |$T= \sum _{i=1}^r \sum _{j=1}^{r^{d-2}} X(:,i) \circ v_{i,j}^2 \circ v_{i,j}^3 \circ \cdots \circ v_{i,j}^d$|⁠. Renaming the vectors in the factors we get \begin{equation*}T = \sum_{k=1}^{r^{d-1}} u_k^1 \circ u_k^2 \circ \cdots \circ u_k^d.\end{equation*} Now we bound the max-norm of |$T$| using this decomposition by considering each factor separately using the information we have about |$X$| and |$E_i$|⁠. Starting from the first factor, notice that |$\|X\|_{2,\infty } \leq \sqrt{r}$|⁠, or more precisely |$\sum _{i=1}^r X(t,i)^2 \leq r$| for any |$1 \leq t \leq N_1$|⁠. By examining the two decompositions of |$T$| stated above, we get \begin{equation*}u_k^1=X(:,\textrm{mod}(k-1,r)+1)\end{equation*} and therefore, \begin{equation} \sum_{k=1}^{r^{d-1}} (u_k^1(t))^2= r^{d-2} \sum_{i=1}^{r} X(t,i)^2 \leq r^{d-2} r=r^{d-1},\ \textrm{for any}\ 1 \leq t \leq N_1, \end{equation} (8.14) which proves the lemma for the vectors in the first dimension of the decomposition. For the second dimension, define |$j:=\textrm{mod}(k-1,r^{d-2})+1$|⁠, and |$j:=\frac{k-j}{r^{d-2}}+1$|⁠. Then \begin{equation*}u_k^2=v_{i,j}^k,\end{equation*} and therefore, \begin{equation} \sum_{k=1}^{r^{d-1}} (u_k^2(t))^2= \sum_{i=1}^r \sum_{j=1}^{r^{d-2}} (v_{i,j}^2(t))^2 \leq \sum_{i=1}^ r r^{d-2} \leq r^{d-1},\ \textrm{for any}\, 1 \leq t \leq N_2, \end{equation} (8.15) which finishes the proof of the lemma for the vectors in the second dimension. All the other dimensions can be bounded in an exactly similar way to the second dimension. The bound on the max-qnorm of a bounded rank-|$r$| tensor follows directly from Lemma 25 and the definition of the tensor max-qnorm. Remark 27 In both Lemmas 24 and 25, we start by decomposing a tensor |$T=U_1 \circ U_2 \circ \cdots \circ U_d$| into |$T_{[1]}=U_1 \circ V$| and by generating |$K$| (in John’s theorem) using the rows of the factor |$U_1$|⁠. Notice that John’s theorem requires the set |$K$| to be full-dimensional. This condition is satisfied in the matrix case as the low-rank decomposition of a matrix (with the smallest rank) is full-dimensional. However, this is not necessarily the case for tensors. In other words, the matricization along a dimension might have smaller rank than the original tensor. To deal with this issue, consider a factor |$U_{add}$| with the same size of |$U_1$| such that |$U_1+U_{add}$| is full-dimensional. Now the tensor |$T_{\varepsilon } = T + \varepsilon U_{add} \circ U_2 \circ \cdots \circ U_d$| satisfies the conditions of the John’s theorem and by taking |$\varepsilon $| to zero we can prove that |$\|T\|_M = \|T_{\varepsilon }\|_M$| and |$\|T\|_{\max } = \|T_{\varepsilon }\|_{\max }$|⁠. Notice that M-norm is convex and max-qnorm satisfies |$\|X+T\|_{\max } \leq \big ( \sqrt{\|X\|_{\max }^{\frac{2}{d}} + \|T\|_{\max }^{\frac{2}{d}}} \big )^d$|⁠. 8.4 Proof of Theorem 8 To prove Theorem 8, we make use of Lemma 5, Lemma 6 and Theorem 7 repeatedly. Note that some other calculations are simple manipulations of those in the proof in [10, Section 6]. For ease of notation define |$\hat{T}:=\hat{T}_{M}$|⁠. Notice that |$T^{\sharp }$| is feasible for (4.3) and therefore, \begin{equation*} \frac{1}{m}\sum_{t=1}^{m} (\hat{T}(\omega_t)-Y(\omega_t))^2 \leq \frac{1}{m}\sum_{t=1}^{m} (T^{\sharp}(\omega_t)-Y(\omega_t))^2. \end{equation*} Plugging in |$Y(\omega _t)= T^{\sharp }(\omega _t) + \sigma \xi _t$| and defining |$\varDelta =\hat{T}-T^{\sharp } \in K_{M}^T(2\alpha ,2R)$| we get \begin{equation} \frac{1}{m}\sum_{t=1}^{m} \varDelta(\omega_t)^2 \leq \frac{2\sigma}{m}\sum_{t=1}^{m} \xi_t \varDelta(\omega_t). \end{equation} (8.16) The proof is based on obtaining a lower bound for the left-hand side of (8.16) and an upper bound for its right-hand side. 8.4.1 Upper bound on right-hand side of (8.16) First, we bound |$\hat{R}_m(\alpha ,R)\!:=\! \underset{\varDelta \in K_M^T(\alpha ,R)} {\sup } \left|\!\frac{1}{m}\! \sum _{t=1}^{m}\! \xi _t \varDelta (\omega _t)\right|$|⁠, where |$\xi _t$| is a sequence of |$\mathcal{N}(0,1)$| random variables. With probability at least |$1-\delta $| over |$\xi =\{\xi _t\}$|⁠, we can relate this value to a Gaussian maxima as follows [55, Theorem 4.7]: \begin{equation*} \begin{aligned} \underset{\varDelta \in K_M^T(\alpha,R)} {\sup} \left|\frac{1}{m} \sum_{t=1}^{m} \xi_t \varDelta(\omega_t)\right| &\leq \mathbb{E}_{\xi}\left[\underset{\varDelta \in K_M^T(\alpha,R)} {\sup} \left|\frac{1}{m} \sum_{t=1}^{m} \xi_t \varDelta(\omega_t)\right|\right] + \pi \alpha \sqrt{\frac{\log (\frac{1}{\delta})}{2m}} \\ &\leq R\ \mathbb{E}_{\xi}\left[\underset{T \in T_{\pm}} {\sup} \left|\frac{1}{m} \sum_{t=1}^{m} \xi_t T(\omega_t)\right|\right] + \pi \alpha \sqrt{\frac{\log (\frac{1}{\delta})}{2m}}, \end{aligned} \end{equation*} where |$T \in T_{\pm }$| is the set of rank-one sign tensors with |$|T_{\pm }| < 2^{Nd}$|⁠. Notice that in the second inequality we use |$\mathbb{B}_{M}(1)=\operatorname{conv}(T_{\pm })$| (Lemma 5) and remove the convex hull from the supremum. Since for each |$T$|⁠, |$\sum _{t=1}^{m} \xi _t T(\omega _t)$| is a Gaussian with mean zero and variance |$m$|⁠, the expected maxima is bounded by |$\sqrt{2m \log (|T_{\pm }|)}$|⁠. Gathering all the above information, we end up with the following upper bound with probability larger than |$1-\delta $|⁠: \begin{equation} \underset{T \in K_M^T(\alpha,R)} {\sup} \left|\frac{1}{m} \sum_{t=1}^{m} \xi_t T(\omega_t)\right| \leq R \sqrt{\frac{2\log(2)Nd}{m}} + \pi \alpha \sqrt{\frac{\log \left(\frac{1}{\delta}\right)}{2m}}. \end{equation} (8.17) Choosing |$\delta =\textrm{e}^{-\frac{Nd}{2}}$| we get that with probability at least |$1-\textrm{e}^{-\frac{Nd}{2}}$| \begin{equation} \underset{T \in K_M^T(\alpha,R)} {\sup} \left|\frac{1}{m} \sum_{t=1}^{m} \xi_t T(\omega_t)\right| \leq 2(R+\alpha)\sqrt{\frac{Nd}{m}}. \end{equation} (8.18) 8.4.2 Lower bound on left-hand side of (8.16) In this section we prove that, with high probability, |$\frac{1}{m}\sum _{t=1}^{m} \varDelta (\omega _t)^2$| does not deviate much from its expectation |$\|\varDelta \|_{\varPi }^2$|⁠. For ease of notation, define |$T_{\varOmega }=(T(\omega _1),T(\omega _2),\cdots ,T(\omega _m))$| to be the set of chosen samples drawn from |$\varPi $| and \begin{equation*}\|T\|_{\varPi}^2=\frac{1}{m}\mathbb{E}_{\varOmega \sim \varPi} \|T_{\varOmega}\|_2^2=\sum_{\omega} \pi_{\omega} T(\omega)^2.\end{equation*} We next prove that with high probability over the samples, \begin{equation} \frac{1}{m} \|T_\varOmega\|_2^2 \geq \|T\|_{\varPi}^2 - C\beta \sqrt{\frac{Nd}{m}}, \end{equation} (8.19) holds uniformly for all tensors |$T \in K_M^T(1,\beta )$|⁠. Lemma 28 Defining |$\delta (\varOmega ):=\underset{T \in K_M^T(1,\beta )} {\sup } |\frac{1}{m} \|T_\varOmega \|_2^2 - \|T\|_{\varPi }^2|$| and assuming |$N,d>2$| and |$m \leq N^d$|⁠, there exists a constant |$C>20$| such that \begin{equation*}\mathbb{P}\left(\delta(\varOmega)> C\beta \sqrt{\frac{Nd}{m}}\right) \leq \textrm{e}^{\frac{-N}{ln(N)}}.\end{equation*} Proof of Lemma 28 To prove this lemma, we shall show that we can bound the |$t$|th moment of |$\delta (\varOmega )$| as \begin{equation} \mathbb{E}_{\varOmega \sim \varPi}[\delta(\varOmega)^t] \leq \left( \frac{8 \beta\sqrt{Nd+t\ \ln(m)}}{\sqrt{m}} \right)^t. \end{equation} (8.20) Assume that (8.20) holds (which we prove below). Then we can prove Lemma 28 by using Markov’s inequality together with (8.20). Specifically, \begin{equation} \mathbb{P}\left(\delta(\varOmega)> C \beta \sqrt{\frac{Nd}{m}}\right) = \mathbb{P}\left( (\delta(\varOmega))^t > \left(C \beta \sqrt{\frac{Nd}{m}}\right)^t \right) \leq \frac{\mathbb{E}_{\varOmega \sim \varPi}[\delta(\varOmega)^t]}{\left(C \beta \sqrt{\frac{Nd}{m}}\right)^t}. \end{equation} (8.21) Using (8.20) and simplifying we get \begin{equation*}\mathbb{P}\left(\delta(\varOmega)> C \beta \sqrt{\frac{Nd}{m}}\right) \leq \left( \frac{4 \sqrt{Nd + t \textrm{ln}(m)}}{C \sqrt{Nd}} \right) ^t. \end{equation*} Taking |$t = \frac{Nd}{\textrm{ln}(m)}$| and for |$C>12$|⁠, \begin{equation*}\mathbb{P}\left(\delta(\varOmega)> C \beta \sqrt{\frac{Nd}{m}}\right) \leq \textrm{e}^{\frac{-Nd}{\ln(m)}} \leq \textrm{e}^{\frac{-N}{\ln(N)}}.\end{equation*} Now we prove (8.20) by using some standard tools from probability in Banach spaces, including symmetrization and contraction inequality [19, 44]. Viewing the tensor |$T \in \bigotimes _{i=1}^d \mathbb{R}^N$| as a function from |$[N]^d \rightarrow R$|⁠, we define |$f_T(\omega _1, \omega _2, \cdots , \omega _d):= T(\omega _1, \omega _2, \cdots , \omega _d)^2$|⁠. We are interested in bounding |$\delta (\varOmega ):=\underset{f_T: T \in K_M^T(1,\beta )} {\sup } \left|\frac{1}{m} \sum _{i=1}^{m} f_T(\omega _i) - \mathbb{E}(f_T(\omega _i))\right |$|⁠. A standard symmetrization argument and using contraction principle yield \begin{align*} \mathbb{E}_{\varOmega \sim \varPi}[\delta(\varOmega)^t] &\leq \mathbb{E}_{\varOmega \sim \varPi} \left\lbrace 2 \mathbb{E}_{\varepsilon} \left[\underset{T \in K_M^T(1,\beta)} {\sup} \left|\frac{1}{m} \sum_{i=1}^m \varepsilon_i T(\omega_i)^2\right| \right] \right\rbrace^t\\ &\leq \mathbb{E}_{\varOmega \sim \varPi} \left\lbrace 4 \mathbb{E}_{\varepsilon} \left[\underset{T \in K_M^T(1,\beta)} {\sup} \left|\frac{1}{m} \sum_{i=1}^m \varepsilon_i T(\omega_i)\right| \right] \right\rbrace^t, \end{align*} where |$\varepsilon _i$| are Rademacher random variables. Note that if |$T_{M} \leq \beta $|⁠, then |$T \in \beta \operatorname{conv}(T_{\pm })$| and therefore, \begin{align*} \mathbb{E}_{\varOmega \sim \varPi}[\delta(\varOmega)^t] &\leq \mathbb{E}_{\varOmega \sim \varPi} \left\lbrace 4 \beta \mathbb{E}_{\varepsilon} \left[\underset{T \in T_{\pm}} {\sup} \left|\frac{1}{m} \sum_{i=1}^m \varepsilon_i T(\omega_i)\right| \right] \right\rbrace^t\\ &= \beta^t \mathbb{E}_{\varOmega \sim \varPi} \left\lbrace \mathbb{E}_{\varepsilon} \left[\underset{T \in T_{\pm}} {\sup} \left|\frac{4}{m} \sum_{i=1}^m \varepsilon_i \right| \right] \right\rbrace^{t}. \end{align*} To bound the right-hand side above, note that for any |$\alpha>0$| [63, Theorem 36] \begin{equation*}\mathbb{P}_{\varepsilon} \left(\frac{4}{m} \sum_{i=1}^m \varepsilon_i \geq \frac{\alpha}{\sqrt{m}}\right) = \mathbb{P} \left (\textrm{Binom}\left(m,\frac{1}{2}\right) \geq \frac{m}{2} + \frac{\alpha \sqrt{m}}{8} \right) \leq \textrm{e}^{\frac{-\alpha^2}{16}}.\end{equation*} Taking a union bound over |$T_{\pm }$|⁠, as |$|T_{\pm }| \leq 2^{Nd}$| we get \begin{equation*}\mathbb{P}_{\varepsilon} \left[\underset{T \in T_{\pm}} {\sup} \left( \left|\frac{4}{m} \sum_{i=1}^m \varepsilon_i T(\omega_i)\right|\right) \geq \frac{\alpha}{\sqrt{m}}\right] \leq 2^{Nd+1} \textrm{e}^{\frac{-\alpha^2}{16}}. \end{equation*} Combining the above result and using Jensen’s inequality, when |$t>1$| \begin{align*} \beta^t \mathbb{E}_{\varOmega \sim \varPi} \left \lbrace \mathbb{E}_{\varepsilon} \left[\underset{T \in T_{\pm}} {\sup} \left|\frac{4}{m} \sum_{i=1}^m \varepsilon_i \right| \right] \right\rbrace^t &\leq \beta^t \mathbb{E}_{\varOmega \sim \varPi} \left \lbrace \mathbb{E}_{\varepsilon} \left[\underset{T \in T_{\pm}} {\sup} \left|\frac{4}{m} \sum_{i=1}^m \varepsilon_i \right| \right]^t \right \rbrace\\ &\leq \beta^t \left( \left(\frac{\alpha}{\sqrt{m}}\right)^t + 4^t 2^{Nd+1} \textrm{e}^{\frac{-\alpha^2}{16}} \right) .\end{align*} Choosing |$\alpha =\sqrt{16 \ln (4 \times 2^{Nd+1}) + 4t\ln (m)}$| and simplifying proves (8.20). 8.4.3 Gathering the results of (8.16), (8.18) and (8.19) Now we combine the upper and lower bounds in the last two sections to prove Theorem 8. On one hand, from (8.18), as |$\varDelta \in K_T(2\alpha ,2R)$|⁠, we get \begin{equation*} \frac{1}{m}\|\varDelta_\varOmega\|_2^2 \leq 8\sigma (R+\alpha) \sqrt{\frac{Nd}{m}},\end{equation*} with probability greater than |$1-\textrm{e}^{-\frac{Nd}{2}}$|⁠. On the other hand, using Lemma 28 and rescaling, we get \begin{equation*}\|\varDelta\|_{\varPi}^2 \leq \frac{1}{m} \|\varDelta_\varOmega\|_2^2 + C R \alpha \sqrt{\frac{Nd}{m}},\end{equation*} with probability greater than |$1-\textrm{e}^{\frac{-N}{\ln (N)}}$|⁠. The above two inequalities finish the proof of Theorem 8. Remark 29 There are only two differences in the proof of Theorem 8 and Theorem 13. First, an extra constant, |$c_1 c_2^d$| shows up in the Rademacher complexity of unit max-qnorm ball that changes the constant |$C$| in Theorem 8 to |$C c_1 c_2^d$| in Theorem 13. The second difference is the max-qnorm of the error tensor |$\varDelta =\hat{T}-T^{\sharp }$| (as in (8.16)), which now belongs to |$K_{\max }^T(2\alpha , 2^{d-1} R)$| instead of |$K_{\max }^T(2\alpha , 2R)$|⁠. 8.5 Proof of Theorem 16 The proof is based on [66], which establishes excess risk bounds of risk minimization with smooth loss functions and with a hypothesis class of bounded empirical Rademacher complexity. To that end, remember that |$Y(\omega _t)=T^{\sharp }(\omega _t) + \sigma \xi _t$|⁠. We define the loss function \begin{equation*} \mathscr{L}(X):=\mathbb{E}_{\omega_t \sim \varPi, \xi_t \sim \mathcal{N}(0,1)} (X(\omega_t)-Y(\omega_t))^2 = \|X-T^{\sharp}\|_{\varPi}^2 + \sigma^2\end{equation*} and its corresponding empirical loss function \begin{equation*}\hat{\mathscr{L}}_{m}(X,Y):=\frac{1}{m}\sum_{t=1}^{m} (X(\omega_t)-Y(\omega_t))^2,\end{equation*} where |$\{Y(\omega _t)\}_{t=1}^m$| is i.i.d. noisy samples generated according to |$\varPi $|⁠. Our goal is to bound |$\|\hat{T}_{M} - T^{\sharp }\|_{\varPi }^2$|⁠, where |$\hat{T}_{M}:= \arg \min _{X \in K_T(\alpha ,R)} \hat{\mathscr{L}}_{m}(X,Y)$|⁠. The first step is proving that with high probability the noise is bounded. In particular, with probability greater than |$1- \frac{1}{m}$| [10, equation (6.16)], \begin{equation*}\max_{1 \leq t \leq m} |\xi_t|< c \sqrt{\log(m)}.\end{equation*} Using this and [66, Theorem 1], we get that with probability greater than |$1- \delta $| \begin{equation} \begin{aligned} &\mathscr{L}(\hat{T}_{M}) - \min_{X \in K_T(\alpha,R)} \mathscr{L}(X) \\ & \quad \leq C \left[ \sigma \left( \log(m)^{\frac{3}{2}} R_m(K_T(\alpha,R))+ \sqrt{ \frac{b \log\left(\frac{1}{\delta}\right)}{m}}\right) + \log(m)^3\ R_m(K_T(\alpha,R))^2+ \frac{b \log\left(\frac{1}{\delta}\right)}{m} \right], \end{aligned} \end{equation} (8.22) where |$b=5 \alpha ^2 + 4 \alpha \sigma \sqrt{\log (m)}$|⁠, and |$R_m(K_T(\alpha ,R))$| is bounded by |$6 \sqrt{\frac{dN}{m}}$| (Lemma 32). Moreover, note that |$min_{X \in K_T(\alpha ,R)} \mathscr{L}(X) = \mathscr{L}(T^{\sharp }) = \sigma ^2$| and |$\mathscr{L}(\hat{T}_{M}) = \|\hat{T}_M - T^{\sharp }\|_{\varPi }^2 + \sigma ^2$|⁠. We complete the proof by plugging these two into the left-hand side of (8.22). 8.6 Proof of Theorem 18 8.6.1 Packing set construction First, we construct a packing for the set |$K_{M}^T(\alpha ,R)$|⁠. Lemma 30 Let |$r=\left\lfloor \left(\frac{R}{\alpha K_G}\right)^2 \right\rfloor{},$| let |$K_{M}^T(\alpha ,R)$| be defined as in (6.1) and suppose |$\gamma \leq 1$| is such that |$\frac{r}{\gamma ^2}$| is an integer and |$\frac{r}{\gamma ^2} \leq N$|⁠. Then there exists a set |$\chi ^T \subset K_{M}^T(\alpha ,R)$| with \begin{equation*}|\chi^T| \geq \exp\left(\frac{rN}{16\gamma^2}\right)\end{equation*} such that (i) For |$T \in \chi ^T$|⁠, |$|T(\omega )|=\alpha \gamma $| for |$\omega \in [N]^d$|⁠. (ii) For any |$T^{i},T^{j} \in \chi ^T$|⁠, |$T^{i} \neq T^{j}$| \begin{equation*}\|T^{i}-T^{j}\|_F^2 \geq \frac{\alpha^2 \gamma^2 N^d}{2}.\end{equation*} Proof. This packing is a tensor version of the packing set generated in [19, Lemma 3) with similar properties, and our construction is based on the packing set generated there for low-rank matrices with bounded entries. In particular, we know that there exists a set |$\chi \subset \lbrace M \in \mathbb{R}^{N \times N}: \|M\|_{\infty } \leq \alpha , \textrm{rank}(M)=r \rbrace $| with |$|\chi | \geq \exp \left(\frac{rN}{16\gamma ^2}\right)$| and for any |$M^{i},M^{j} \in \chi $|⁠, |$\|M^{i}-M^{j}\|_F^2 \geq \frac{\alpha ^2 \gamma ^2 N^2}{2}$| when |$i \neq j$|⁠. Take any |$M^{k} \in \chi $|⁠. |$M^{k}$| is a rank-|$r$| matrix with |$\|M^{k}\|_{\infty } =\alpha \gamma \leq \alpha $| and therefore |$\|M^{k}\|_{\max } \leq \sqrt{r} \alpha $|⁠. By (3.3), there exists a nuclear decomposition of |$M^{k}$| with bounded singular vectors |$M^{k}=\sum _{i} \sigma _i u_i \circ v_i,\ \|u_i\|_{\infty },\|v_i\|_{\infty } \leq 1$|⁠, such that |$\sum _{i=1} |\sigma _i| \leq K_G \sqrt{r} \alpha $|⁠. Define |$T^{k} = \sum _{i} \sigma _i u_i \circ v_i \circ \vec{\mathbf{1}} \cdots \circ \vec{\mathbf{1}}$|⁠, where |$\vec{\mathbf{1}} \in \mathbb{R}^N$| is the vector of all ones. Note that |$\|u_i\|_{\infty }, \|v_i\|_{\infty }, \|\vec{\mathbf{1}}\|_{\infty } \leq 1$|⁠, and therefore by definition, |$\|T^{k}\|_{M} \leq K_G \sqrt{r} \alpha \leq R$| by Lemma 5. The tensor is basically generated by stacking the matrix |$M^{k}$| along all the other |$d-2$| dimensions and therefore |$|T^{k}(\omega )| = \alpha \gamma $| for |$\omega \in [N]^d$|⁠, and |$\|T^{k}\|_{\infty } \leq \alpha $|⁠. Hence we build |$\chi ^T$| from |$\chi $| by taking the outer product of the matrices in |$\chi $| by the vector |$\vec{\mathbf{1}}$| along all the other dimensions. Obviously |$|\chi ^T|=|\chi | \geq \exp \left(\frac{rN}{16\gamma ^2}\right)$|⁠. It just remains to prove \begin{equation*}\|T^{i}-T^{j}\|_F^2 \geq \frac{\alpha^2 \gamma^2 N^d}{2}.\end{equation*} Assuming that |$T^{i}$| is generated from |$M^{i}$| and |$T^{j}$| is generated from |$M^{j}$|⁠, since \begin{align*} T^{i}(i_1,i_2, \cdots,i_d)&=M^{i}(i_1,i_2),\\ \|T^{i}-T^{j}\|_F^2 &= \sum_{i_1=1}^{N} \sum_{i_2=1}^{N} \cdots \sum_{i_d=1}^{N} (T^{i}(i_1,i_2, \cdots,i_d) - T^{j}(i_1,i_2, \cdots,i_d))^2\\&= N^{d-2} \sum_{i_1=1}^{N} \sum_{i_2=1}^{N} (M^{i}(i_1,i_2)-M^{j}(i_1,i_2))^2 = N^{d-2} \|M^{i}-M^{j}\|_F^2 \geq \frac{\alpha^2 \gamma^2 N^d}{2}, \end{align*} which concludes proof of the lemma. 8.6.2 Proof of Theorem 18 Now we use the construction in Lemma 30 to obtain a |$\delta $|-packing set |$\chi ^T$| of |$K_{M}^T$| with |$\delta =\alpha \gamma \sqrt{\frac{N^d}{2}}$|⁠. For the lower bound, we assume that the sampling distribution satisfies \begin{equation} \max_{\omega} \pi_{\omega} \leq \frac{L}{N^d}. \end{equation} (8.23) The proof follows the proof in [10, Section 6.2], the main parts of which we will rewrite; we refer to [10] for more details. It is based on two main components. First is a lower bound on the |$\|\cdot \|_F$|-risk in terms of the error in a multi-way hypothesis testing problem [72], given by \begin{equation*}\underset{\hat{T}}{\textrm{inf}} \underset{T \in K_{M}^T(\alpha,R)}{\textrm{sup}} \mathbb{E}_T\|\hat{T}-T\|_F^2 \geq \frac{\delta^2}{4} \underset{\tilde{T}}{\min} \mathbb{P} (\tilde{T} \neq T^{\sharp}),\end{equation*} where |$T^{\sharp }$| is uniformly distributed over the pacing set |$\chi ^T$|⁠. The left-hand side above is the maximum expected error obtained by the best estimator in terms of Frobenius norm. Second component is a variant of the Fano’s inequality, which, conditioned on the observations |$\varOmega =\{\omega _1, \cdots , \omega _m\}$|⁠, gives the lower bound \begin{equation} \mathbb{P}(\tilde{T} \neq T^{\sharp} | \varOmega) \geq 1-\frac{\Big({{|\chi^T|}\choose{2}}\Big)^{-1} \sum_{k\neq j} K(T^k || T^j)+\log(2)}{\log|\chi^T|}, \end{equation} (8.24) where |$K(T^k || T^j)$| is the Kullback–Leibler divergence between distributions |$(Y_\varOmega |T^k)$| and |$(Y_\varOmega |T^j)$|⁠. Here, |$(Y_\varOmega |T^k)$| is the probability of observing |$Y_\varOmega $| given that the measurements are taken from |$T^k$|⁠. For our observation model with i.i.d. Gaussian noise, we have \begin{equation*}K(T^k || T^j) = \frac{1}{2\sigma^2}\sum_{t=1}^{m} (T^k(\omega_t) - T^j(\omega_t))^2,\end{equation*} and therefore, \begin{equation*}\mathbb{E}_\varOmega [K(T^k || T^j)] = \frac{m}{2\sigma^2} \big\|T^k - T^j\big\|_{\varPi}^2.\end{equation*} From the property (1) of the packing set in Lemma 30, |$\|T^k-T^j\|_F^2 \leq 4 \gamma ^2 N^d$|⁠. This, combined with (8.23) and (8.24), yields \begin{equation*}\mathbb{P}(\tilde{T} \neq T^{\sharp}) \geq 1-\frac{\frac{32L\gamma^4\alpha^2m}{\sigma^2}+12\gamma^2}{rN}\geq 1-\frac{32L\gamma^4\alpha^2m}{\sigma^2rN}-\frac{12\gamma^{2}}{rN}\geq \frac{3}{4}-\frac{32L\gamma^4\alpha^{2}m}{\sigma^{2}rN},\end{equation*} provided |$rN>48$|⁠. Now if |$\gamma ^4 \leq \frac{\sigma ^2rN}{128L\alpha ^2m}$|⁠, \begin{equation*}\underset{\hat{T}}{\textrm{inf}} \underset{T \in K_{M}^T(\alpha,R)}{\textrm{sup}} \frac{1}{N^d} \mathbb{E}\big\|\hat{T}-T\big\|_F^2 \geq \frac{\alpha^2 \gamma^2}{16}.\end{equation*} Therefore, if |$\frac{\sigma ^2rN}{128L\alpha ^2m} \geq 1$| choosing |$\gamma =1$| finishes the proof. Otherwise choosing |$\gamma ^2 = \frac{\sigma }{8\sqrt{2}\alpha }\sqrt{\frac{rN}{Lm}}$| results in \begin{equation*}\underset{\hat{T}}{\textrm{inf}} \underset{T \in K_{M}^T(\alpha,R)}{\textrm{sup}} \frac{1}{N^d} \mathbb{E}\big\|\hat{T}-T\big\|_F^2 \geq \frac{\sigma\alpha}{128\sqrt{2}}\sqrt{\frac{rN}{Lm}} \geq \frac{\sigma R}{128\sqrt{2}K_G}\sqrt{\frac{N}{Lm}}.\end{equation*} 9. Future directions and open problems In this work, we considered MNC LS for TC and showed that, theoretically, the number of required measurements needed is linear in the maximum size of the tensor. To the best of our knowledge, this is the first work that reduces the required number of measurements from |$N^{\frac{d}{2}}$| to |$N$|⁠. Yet, there are many open problems to be addressed and complications to be resolved. Following, we list a few of these. Calculating the nuclear-norm of a tensor is NP-hard [27]. Although it seems likely that calculating the M-norm is NP-hard as well, it has not been proven yet. An interesting future direction is analysing this norm and the possibility of defining a surrogate that is easier to calculate. Another interesting question is studying the dual norm of the M-norm. The discrepancy between the upper bounds on the nuclearnorm and the max-qnorm of a bounded low-rank tensor is significant, and it is also the main reason for the theoretical superiority of max-qnorm over nuclear-norm. In our proof, one of the main theoretical steps for bounding the LS estimation error, constrained with an arbitrary norm, is bounding the Rademacher complexity of unit-norm tensors and finding a tight bound for the norm of low-rank tensors. In the case of max-qnorm, we are able to achieve upper bound of |$r^{\sqrt{d^2-d}} \alpha $| and Rademacher complexity of |$O\left(\sqrt{\frac{dN}{m}}\right)$|⁠. A calculation of these quantities for nuclear-norm still needs to be done. However, a generalization of current results gives an upper bound of |$O(\sqrt{r^{d-1} N^d} \alpha )$| for the nuclear-norm of rank-|$r$| tensors. Considering the tensor |$\vec{\mathbf{1}} \circ \cdots \circ \vec{\mathbf{1}} $|⁠, we can see that this bound is tight. We leave the exact answer to this question to future work. We know that the dependence of the upper bound of the low-rank max-qnorm tensor found in Theorem 7 is optimal in |$N$|⁠. However, we believe the dependence on |$r$| can be improved. We saw in Section 7 that this is definitely the case for some specific class of tensors. In addition to the open problems concerning algorithms for calculating max-qnorm of tensors and projection onto max-qnorm balls, an interesting question is analysing exact tensor recovery using max-qnorm. Most of the evidence points to this being NP-hard, including [32] that proves that a lot of similar tensor problems is NP-hard, and [2] that establishes a connection between noisy TC and the 3-SAT problem, which proves that if exact TC can be done in polynomial time, the conjecture in [18] will be disapproved. However, an exact study of whether or not it is NP-hard or availability of polynomial time estimates need to be done. The preliminary results of Algorithm 2 show significant improvements over previous algorithms. This highlights the need for a more sophisticated (and provable) algorithm which utilizes max-qnorm for TC. As a first step, in the matrix case, the max-qnorm can be reformulated as a semidefinite programming problem, and together with [7], this proves once we solve the problem using its factors, any local minimum is a global minimum, and hence proves global convergence of the algorithm. However, this is not the case in tensors and, in our experiments, we saw that the results are sensitive to the size of low-rank factors. Analysing this behavior is an interesting future direction. Funding Ö.Y. was funded in part by a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant (22R82411), an NSERC Accelerator Award (22R68054) and an NSERC Collaborative Research and Development Grant DNOISE II (22R07504). Y.P. is partially supported by NSERC Discovery Grant (22R23068). References 1. Acar , E. , Çamtepe , S. A. , Krishnamoorthy , M. S. & Yener , B. ( 2005 ) Modeling and multiway analysis of chatroom tensors . International Conference on Intelligence and Security Informatics . Springer , pp. 256 – 268 . Google Preview WorldCat COPAC 2. Barak , B. & Moitra , A. ( 2016 ) Noisy tensor completion via the sum-of-squares hierarchy . In Conference on Learning Theory ., 417 – 445 . 3. Bartlett , P. L. & Mendelson , S. ( 2002 ) Rademacher and Gaussian complexities: risk bounds and structural results . J. Mach. Learn. Res. , 3 , 463 – 482 . WorldCat 4. Bazerque , J. A. , Mateos , G. & Giannakis , G. ( 2013 ) Rank regularization and Bayesian inference for tensor completion and extrapolation . IEEE Trans. Signal Process. , 61 , 5689 – 5703 . Google Scholar Crossref Search ADS WorldCat 5. Bhojanapalli , S. & Jain , P. ( 2014 ) Universal matrix completion . In International Conference on Machine Learning ., 1881 – 1889 . 6. Blei , R. C. ( 1979 ) Multidimensional extensions of the Grothendieck inequality and applications . Ark. Mat. , 17 , 51 – 68 . Google Scholar Crossref Search ADS WorldCat 7. Burer , S. & Choi , C. ( 2006 ) Computational enhancements in low-rank semidefinite programming . Optim. Methods Softw. , 21 , 493 – 512 . Google Scholar Crossref Search ADS WorldCat 8. Cai , J.-F. , Candès , E. J. & Shen , Z. ( 2010 ) A singular value thresholding algorithm for matrix completion . SIAM J. Optim. , 20 , 1956 – 1982 . Google Scholar Crossref Search ADS WorldCat 9. Cai , T. & Zhou , W.-X. ( 2013 ) A max-norm constrained minimization approach to 1-bit matrix completion . J. Mach. Learn. Res. , 14 , 3619 – 3647 . WorldCat 10. Cai , T. T. , Zhou , W.-X. ( 2016 ) Matrix completion via max-norm constrained optimization . Electron. J. Stat. , 10 , 1493 – 1525 . Google Scholar Crossref Search ADS WorldCat 11. Candes , E. J. & Plan , Y. ( 2010 ) Matrix completion with noise . Proc. IEEE , 98 , 925 – 936 . Google Scholar Crossref Search ADS WorldCat 12. Candès , E. J. & Recht , B. ( 2009 ) Exact matrix completion via convex optimization . Found. Comput. Math. , 9 , 717 – 772 . Google Scholar Crossref Search ADS WorldCat 13. Candès , E. J. & Tao , T. ( 2010 ) The power of convex relaxation: near-optimal matrix completion . IEEE Trans. Inform. Theory , 56 , 2053 – 2080 . Google Scholar Crossref Search ADS WorldCat 14. Carroll , J. D. & Chang , J.-J. ( 1970 ) Analysis of individual differences in multidimensional scaling via an N-way generalization of Eckart–Young decomposition . Psychometrika , 35 , 283 – 319 . Google Scholar Crossref Search ADS WorldCat 15. Chandrasekaran , V. , Recht , B. , Parrilo , P. A. & Willsky , A. S. ( 2012 ) The convex geometry of linear inverse problems . Found. Comput. Math. , 12 , 805 – 849 . Google Scholar Crossref Search ADS WorldCat 16. Coja-Oghlan , A. , Goerdt , A. & Lanka , A. ( 2007 ) Strong refutation heuristics for random k-SAT . Combin. Probab. Comput. , 16 , 5 – 28 . Google Scholar Crossref Search ADS WorldCat 17. Da Silva , C. & Herrmann , F. J. ( 2015 ) Optimization on the Hierarchical Tucker manifold—applications to tensor completion . Linear Algebra Appl. , 481 , 131 – 173 . Google Scholar Crossref Search ADS WorldCat 18. Daniely , A. , Linial , N. & Shalev-Shwartz , S. ( 2013 ) More data speeds up training time in learning halfspaces over sparse vectors . Advances in Neural Information Processing Systems , pp. 145 – 153 . WorldCat 19. Davenport , M. A. , Plan , Y. , van den Berg , E. & Wootters , M. ( 2014 ) 1-bit matrix completion . Inf. Inference , 3 , 189 – 223 . Google Scholar Crossref Search ADS WorldCat 20. Davenport , M. A. & Romberg , J. ( 2016 ) An overview of low-rank matrix recovery from incomplete observations . IEEE J. Sel. Top. Signal Process. , 10 , 608 – 622 . Google Scholar Crossref Search ADS WorldCat 21. Derksen , H. ( 2013 ) On the nuclear norm and the singular value decomposition of tensors . Found. Comput. Math. , pp. 1 – 33 . WorldCat 22. Fang , E. X. , Liu , H. , Toh , K.-C. & Zhou , W.-X. ( 2015 ) Max-norm optimization for robust matrix recovery . Math. Programming , pp. 1 – 31 . WorldCat 23. Fazel , M. ( 2002 ) Matrix rank minimization with applications . Ph.D. Thesis , Stanford University . 24. Foygel , R. & Srebro , N. ( 2011 ) Concentration-based guarantees for low-rank matrix reconstruction . In Proceedings of the 24th Annual Conference on Learning Theory , pp. 315 – 340 . 25. Foygel , R. , Srebro , N. & Salakhutdinov , R. R. ( 2012 ) Matrix reconstruction with the local max norm . Advances in Neural Information Processing Systems , pp. 935 – 943 . WorldCat 26. Friedland , S . ( 1982 ) Variation of tensor powers and spectrat . Linear and Multilinear Algebra , 12 , 81 – 98 . Google Scholar Crossref Search ADS WorldCat 27. Friedland , S. & Lim , L.-H. ( 2018 ) Nuclear norm of higher-order tensors . Math. Comput. , 87 , 1255 – 1281 . Google Scholar Crossref Search ADS WorldCat 28. Gandy , S. , Recht , B. & Yamada , I. ( 2011 ) Tensor completion and low-n-rank tensor recovery via convex optimization . Inverse Problems , 27 , 025010 . WorldCat 29. Grasedyck , L. , Kressner , D. & Tobler , C. ( 2013 ) A literature survey of low-rank tensor approximation techniques . GAMM-Mitt. , 36 , 53 – 78 . Google Scholar Crossref Search ADS WorldCat 30. Grothendieck , A. ( 1955 ) Produits tensoriels topologiques et espaces nucléaires . Séminaire Bourbaki , 2 , 193 – 200 . WorldCat 31. Harshman , R. A. ( 1970 ) Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-mode factor analysis . UCLA Working Papers in Phonetics , 16 , 1 – 84 . COPAC 32. Hillar , C. J. & Lim , L.-H. ( 2013 ) Most tensor problems are NP-hard . J. ACM , 60 , 45 . Google Scholar Crossref Search ADS WorldCat 33. Hu , S. ( 2015 ) Relations of the nuclear norm of a tensor and its matrix flattenings . Linear Algebra Appl. , 478 , 188 – 199 . Google Scholar Crossref Search ADS WorldCat 34. Jain , P. , Netrapalli , P. & Sanghavi , S. ( 2013 ) Low-rank matrix completion using alternating minimization . In Proceedings of the forty-fifth annual ACM symposium on Theory of computing . ACM , pp. 665 – 674 . Google Preview WorldCat COPAC 35. Jain , P. & Oh , S. ( 2014 ) Provable tensor factorization with missing data . Advances in Neural Information Processing Systems , pp. 1431 – 1439 . WorldCat 36. John , F. ( 2014 ) Extremum problems with inequalities as subsidiary conditions . In Traces and Emergence of Nonlinear Programming . Springer , pp. 197 – 215 . Google Preview WorldCat COPAC 37. Keshavan , R. H. , Montanari , A. & Oh , S. ( 2010 ) Matrix completion from noisy entries . J. Mach. Learn. Res. , 11 , 2057 – 2078 . WorldCat 38. Keshavan , R. H. , Oh , S. & Montanari , A. ( 2009 ) Matrix completion from a few entries. 2009 IEEE International Symposium on Information Theory . IEEE , pp. 324 – 328 . 39. Kolda , T. G. ( 2006 ) Multilinear operators for higher-order decompositions . United States Department of Energy. Google Preview WorldCat COPAC 40. Kolda , T. G. & Bader , B. W. ( 2009 ) Tensor decompositions and applications . SIAM Rev. , 51 , 455 – 500 . Google Scholar Crossref Search ADS WorldCat 41. Kreimer , N. , Stanton , A. & Sacchi , M. D. ( 2013 ) Tensor completion based on nuclear norm minimization for 5D seismic data reconstruction . Geophysics , 78 , V273 – V284 . Google Scholar Crossref Search ADS WorldCat 42. Krishnamurthy , A. & Singh , A. ( 2013 ) Low-rank matrix and tensor completion via adaptive sampling . Advances in Neural Information Processing Systems , pp. 836 – 844 . WorldCat 43. Kushner , H. J. & Clark , D. S. ( 2012 ) Stochastic Approximation Methods for Constrained and Unconstrained Systems, vol. 26 . Springer Science & Business Media . Google Preview WorldCat COPAC 44. Ledoux , M. & Talagrand , M. ( 2013 ) Probability in Banach Spaces: Isoperimetry and Processes . Springer Science & Business Media , 23 . Google Preview WorldCat COPAC 45. Lee , J. D. , Recht , B. , Srebro , N. , Tropp , J. & Salakhutdinov , R. R. ( 2010 ) Practical large-scale optimization for max-norm regularization . Advances in Neural Information Processing Systems , pp. 1297 – 1305 . WorldCat 46. Li , N. & Li , B. ( 2010 ) Tensor completion for on-board compression of hyperspectral images . 2010 IEEE International Conference on Image Processing . IEEE, pp. 517 – 520 . 47. Lim , L.-H. & Comon , P. ( 2010 ) Multiarray signal processing: tensor decomposition meets compressed sensing . C. R. Acad. Sci. IIb Mec. , 338 , 311 – 320 . WorldCat 48. Linial , N. , Mendelson , S. , Schechtman , G. & Shraibman , A. ( 2007 ) Complexity measures of sign matrices . Combinatorica , 27 , 439 – 463 . Google Scholar Crossref Search ADS WorldCat 49. Liu , J. , Musialski , P. , Wonka , P. & Ye , J. ( 2013 ) Tensor completion for estimating missing values in visual data . IEEE Trans. Pattern Anal. Mach. Intell. , 35 , 208 – 220 . Google Scholar Crossref Search ADS PubMed WorldCat 50. Ma , S. , Goldfarb , D. & Chen , L. ( 2011 ) Fixed point and Bregman iterative methods for matrix rank minimization . Math. Programming , 128 , 321 – 353 . Google Scholar Crossref Search ADS WorldCat 51. Mocks , J. ( 1988 ) Topographic components model for event-related potentials and some biophysical considerations . IEEE Trans. Biomed. Eng. , 6 , 482 – 484 . Google Scholar Crossref Search ADS WorldCat 52. Mu , C. , Huang , B. , Wright , J. & Goldfarb , D. ( 2014 ) Square deal: lower bounds and improved relaxations for tensor recovery . Proceedings of the 31st International Conference on Machine Learning (ICML-14) , pp. 73 – 81 . 53. Negahban , S. & Wainwright , M. J. ( 2012 ) Restricted strong convexity and weighted matrix completion: optimal bounds with noise . J. Mach. Learn. Res. , 13 , 1665 – 1697 . WorldCat 54. Nion , D. & Sidiropoulos , N. D. ( 2010 ) Tensor algebra and multidimensional harmonic retrieval in signal processing for MIMO radar . IEEE Trans. Signal Process. , 58 , 5693 – 5705 . Google Scholar Crossref Search ADS WorldCat 55. Pisier , G. ( 1999 ) The Volume of Convex Bodies and Banach Space Geometry, vol. 94 . Cambridge University Press . Google Preview WorldCat COPAC 56. Rashtchian , C. ( 2016 ) Bounded matrix rigidity and John’s theorem . Electronic Colloquium on Computational Complexity (ECCC) , vol. 23 , p. 93 . 57. Recht , B. ( 2011 ) A simpler approach to matrix completion . J. Mach. Learn. Res. , 12 , 3413 – 3430 . WorldCat 58. Schatten , R. ( 1985 ) A Theory of Cross-Spaces, No. 26 . Princeton University Press . Google Preview WorldCat COPAC 59. Schmidt , M. W. , Van Den Berg , E. , Friedlander , M. P. & Murphy , K. P. ( 2009 ) Optimizing costly f unctions with bluesimple blueconstraints: bluea bluelimited-bluememory blueprojected Quasi-Newton bluealgorithm. AISTATS , vol. 5 , p. 2009 . 60. Shashua , A. & Levin , A. ( 2001 ) Linear image coding for regression and classification using the tensor-rank principle . Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 1 . IEEE , pp. 1 – 42 . Google Preview WorldCat COPAC 61. Shen , J. , Xu , H. & Li , P. ( 2014 ) Online optimization for max-norm regularization . Advances in Neural Information Processing Systems , pp. 1718 – 1726 . WorldCat 62. Signoretto , M. , De Lathauwer , L. & Suykens , J. A. ( 2010 ) Nuclear norms for tensors and their use for convex multilinear estimation . Linear Algebra and Its Applications , 43 (in press). WorldCat 63. Srebro , N. ( 2004 ) Learning with matrix factorizations . Ph.D. Thesis . COPAC 64. Srebro , N. , Rennie , J. & Jaakkola , T. S. ( 2005 ) Maximum-margin matrix factorization . Advances in Neural Information Processing Systems , pp. 1329 – 1336 . WorldCat 65. Srebro , N. & Shraibman , A. ( 2005 ) Rank, trace-norm and max-norm . In International Conference on Computational Learning Theory . Springer , pp. 545 – 560 . Google Preview WorldCat COPAC 66. Srebro , N. , Sridharan , K. & Tewari , A. ( 2010 ) Optimistic rates for learning with a smooth loss . arXiv preprint arXiv:1009.3896 . 67. Tomioka , R. , Hayashi , K. & Kashima , H. ( 2010 ) Estimation of low-rank tensors via convex optimization . arXiv preprint arXiv:1010.0789 . 68. Tonge , A. ( 1978 ) The von Neumann inequality for polynomials in several Hilbert–Schmidt operators . J. London Math. Soc. , 2 , 519 – 526 . Google Scholar Crossref Search ADS WorldCat 69. Tucker , L. R. ( 1966 ) Some mathematical notes on three-mode factor analysis . Psychometrika , 31 , 279 – 311 . Google Scholar Crossref Search ADS PubMed WorldCat 70. van Leeuwen , T. & Herrmann , F. J. ( 2013 ) Fast waveform inversion without source-encoding . Geophysical Prospecting , 61 , 10 – 19 . Google Scholar Crossref Search ADS WorldCat 71. Xu , Y. , Yin , W. , Wen , Z. & Zhang , Y. ( 2012 ) An alternating direction algorithm for matrix completion with nonnegative factors . Front. Math. China , 7 , 365 – 384 . Google Scholar Crossref Search ADS WorldCat 72. Yang , Y. & Barron , A. ( 1999 ) Information-theoretic determination of minimax rates of convergence . Ann. Statist. , 27 , pp. 1564 – 1599 . Google Scholar Crossref Search ADS WorldCat 73. Yuan , M. & Zhang , C.-H. ( 2016 ) On tensor completion via nuclear norm minimization . Found. Comput. Math. , 16 , 1031 – 1068 . Google Scholar Crossref Search ADS WorldCat 74. Zhang , Z. & Aeron , S. ( 2017 ) Exact tensor completion using t-SVD . IEEE Trans. Signal Process , 65 , 1511 – 1526 . Appendix. A. Rademacher Complexity A technical tool that we use in the proof of our main results involves data-dependent estimates of the Rademacher and Gaussian complexities of a function class. We refer to [3, 65] for a detailed introduction of these concepts. Definition 31 [9] Let |$\mathbb{P}$| be a probability distribution on a set |$\chi $| and assume the set |$\varOmega =\lbrace X_1,\cdots , X_m\rbrace $| is |$m$| independent samples drawn from |$\chi $| according to |$\mathbb{P}$|⁠. The empirical Rademacher complexity of a class |$F$| of functions |$\mathbb{F}$| defined from |$\chi $| to |$\mathbb{R}$| is defined as \begin{equation*}\hat{R}_\varOmega(\mathbb{F})=\frac{2}{|\varOmega|}\mathbb{E}_{\varepsilon}\left[\underset{f \in \mathbb{F}}{\sup}\left|\sum_{i=1}^{i=m}\varepsilon_if(X_i)\right|\right],\end{equation*} where |$\varepsilon _i$| are independent Rademacher random variables. Moreover, the Rademacher complexity with respect to the distribution |$\mathbb{P}$| over a sample |$\varOmega $| of |$|\varOmega |$| points drawn independently according to |$\mathbb{P}$| is defined as the expectation of the empirical Rademacher complexity, i.e., \begin{equation*}R_{|\varOmega|}(\mathbb{F})=\mathbb{E}_{\varOmega\sim \mathbb{P}}[\hat{R}_\varOmega(\mathbb{F})].\end{equation*} Two important properties that will be used in the following lemma are first, if |$\mathbb{F} \subset \mathbb{G}$|⁠, then |$\hat{R}_\varOmega (\mathbb{F}) \leq \hat{R}_\varOmega (\mathbb{G})$| and second, |$\hat{R}_\varOmega (\mathbb{F})=\hat{R}_\varOmega (\mathbb{\operatorname{conv}(F)})$|⁠. Lemma 32 $$\underset{\varOmega :|\varOmega |=m}{\sup } \hat{R}_\varOmega (\mathbb{B}_{M}(1)) < 6 \sqrt{\frac{dN}{m}}.$$ Proof. By definition, |$\mathbb{B}_{M}(1)=\operatorname{conv}(T_{\pm })$|⁠, and |$T_{\pm }$| is a finite class with |$|T_{\pm }|<2^{dN}$|⁠. Therefore, |$\hat{R}_\varOmega (T_{\pm }) < \sqrt{7\frac{2dN+log|\varOmega |}{|\varOmega |}}$| [63], which concludes the proof. Lemma 33 $$\underset{\varOmega :|\varOmega |=m}{\sup } \hat{R}_\varOmega (\mathbb{B}_{\max }^T(1)) < 6 c_1 c_2^d \sqrt{\frac{dN}{m}}.$$ Proof. By Lemma 5, |$\mathbb{B}_{\max }^T(1)$||$\subset $||$c_1 c_2^d$||$\operatorname{conv}(T_{\pm })$| and we have |$\hat{R}_\varOmega (T_{\pm }) < \sqrt{7\frac{2dN+log|\varOmega |}{|\varOmega |}}$|⁠. Taking the convex hull of this class and using |$|\varOmega |=m\leq N^d$| and scaling by |$c_1 c_2^d$| we get |$\hat{R}_\varOmega (\mathbb{B}_{\max }^T(1)) \leq 6 c_1 c_2^d \sqrt{\frac{dN}{m}}$|⁠. © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - Near-optimal sample complexity for convex tensor completion JF - Information and Inference: A Journal of the IMA DO - 10.1093/imaiai/iay019 DA - 2019-09-19 UR - https://www.deepdyve.com/lp/oxford-university-press/near-optimal-sample-complexity-for-convex-tensor-completion-q4jNGM2OVj SP - 577 VL - 8 IS - 3 DP - DeepDyve ER -