Add Journal to My Library
Biometrika
, Volume 105 (1) – Mar 1, 2018

18 pages

/lp/ou_press/dual-regression-c4y6HlPIYk

- Publisher
- Oxford University Press
- Copyright
- © 2018 Biometrika Trust
- ISSN
- 0006-3444
- eISSN
- 1464-3510
- D.O.I.
- 10.1093/biomet/asx074
- Publisher site
- See Article on Publisher Site

Summary We propose dual regression as an alternative to quantile regression for the global estimation of conditional distribution functions. Dual regression provides the interpretational power of quantile regression while avoiding the need to repair intersecting conditional quantile surfaces. We introduce a mathematical programming characterization of conditional distribution functions which, in its simplest form, is the dual program of a simultaneous estimator for linear location-scale models, and use it to specify and estimate a flexible class of conditional distribution functions. We present asymptotic theory for the corresponding empirical dual regression process. 1. Introduction Let $$Y$$ be a continuous random variable and let $$X$$ be a random vector. Then the conditional distribution function of $$Y$$ given $$X$$, written $$U=F_{Y\mid X}(Y\mid X)$$, has three properties: $$U$$ is standard uniform, $$U$$ is independent of $$X$$, and $$F_{Y\mid X}(y\mid x)$$ is strictly increasing in $$y$$ for any value $$x$$ of $$X$$. We will refer to these three properties as uniformity, independence and monotonicity. For some specified zero-mean and unit-variance distribution function $$F$$ with support the real line and inverse function $$F^{-1}$$, define $$\varepsilon=F^{-1}\{F_{Y\mid X}(Y\mid X)\}$$. Then $$\varepsilon$$ satisfies independence and monotonicity, has distribution $$F$$, and is transformed to uniformity by taking $$U=F(\varepsilon)$$. If we have a sample of $$n$$ points $$\{(x_{i},y_{i})\}_{i=1}^{n}$$ drawn independently from the joint distribution $$F_{Y\!,\,X}(Y\!,\,X)$$, how might we estimate the $$n$$ values $$\varepsilon_{i}=F^{-1}\{F_{Y\mid X}(y_{i}\mid x_{i})\}$$ using only the requirements that the estimate should display independence and monotonicity and have distribution $$F$$? We explore this by formulating a sequence of mathematical programming problems that embodies these requirements, with each element of this sequence providing an asymptotically valid characterization of an increasingly flexible class of conditional distribution functions. The term dual is motivated by the observation that the estimation of a conditional distribution function $$F_{Y\mid X}$$ indexed by a parameter $$\theta$$ is usually formulated in terms of a procedure that obtains $$\theta$$ directly and then characterizes $$F_{Y\mid X}$$ as a by-product. A classical example is the linear location-shift model $$F_{Y\mid X}(y_{i}\mid x_{i})=F\{(y_{i}-\beta^{{ \mathrm{\scriptscriptstyle T} }} x_{i})/\sigma\}$$, for which the parameter vector $$\theta=(\beta,\sigma)^{{ \mathrm{\scriptscriptstyle T} }}$$ needs to be estimated in order to obtain the $$n$$ values $$\varepsilon_{i}=(y_{i}-\beta^{{ \mathrm{\scriptscriptstyle T} }} x_{i})/\sigma$$. Here we turn that process around, obtaining $$\varepsilon_{i}$$ first from a mathematical programming problem and finding $$\theta$$ afterwards, if at all. In its simplest form, dual regression augments median regression dual programming (Koenker & Bassett, 1978) with second-moment orthogonality constraints, while expanding the support of parameter values from the unit interval to the real line. Adding further orthogonality constraints gives rise to a sequence of augmented, generalized dual regression programs. Although each of these seeks only to find the $$\varepsilon_{i}=F^{-1}(u_{i})$$, their first-order conditions show that the assignment of these $$n$$ values corresponds to a sequence of augmented location-scale representations, the simplest of which is a linear heteroscedastic model. Moreover, their second-order conditions are equivalent to monotonicity, so that optimal dual regression solutions satisfy this property. To each element of the sequence of dual programs corresponds a convex primal problem, both nontrivial to determine and difficult to implement, the convexity of which guarantees uniqueness of optimal dual regression solutions. For a given specification of $$F_{Y\mid X}(Y\mid X)$$, the first-order conditions of the corresponding primal problem also describe necessary and sufficient conditions for independence of the associated dual solutions. Thus our dual formulation reveals a sequence of convex optimization problems, gives a feasible and direct implementation of each of them, and uniquely characterizes the family of associated globally monotone representations, which can then be used as complete estimates of a flexible class of conditional distribution functions. 2. Basics 2.1. Dual regression We introduce the principles underlying our method by providing a new characterization of the conditional distribution function $$F_{Y\mid X}(Y\mid X)$$ associated with the linear location-scale model \begin{equation} Y=\beta_{1}^{{ \mathrm{\scriptscriptstyle T} }} X+(\beta_{2}^{{ \mathrm{\scriptscriptstyle T} }} X)\varepsilon,\quad\beta_{2}^{{ \mathrm{\scriptscriptstyle T} }} X>0,\quad\varepsilon \mid X\sim F, \end{equation} (1) where $$X$$ is a $$K\times1$$ vector of explanatory variables including an intercept, and $$F$$ is a zero-mean and unit-variance cumulative distribution function over the real line. Suppose that we observe $$n$$ identically and independently distributed realizations $$\{(y_{i},x_{i})\}_{i=1}^{n}$$ generated according to model (1). The primary population target of our analysis is \[ \varepsilon_{i}=\frac{y_{i}-\beta_{1}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}}{\beta_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}}=F^{-1}\{F_{Y\mid X}(y_{i}\mid x_{i})\} \quad(i=1,\ldots,n), \] knowledge of which is equivalent to knowledge of the $$n$$ values $$F_{Y\mid X}(y_{i}\mid x_{i})$$, up to the monotone transformation $$F$$. Let $$\lambda=(\lambda_{1},\lambda_{2})^{{ \mathrm{\scriptscriptstyle T} }}\in\mathbb{R}^{2\times K}$$ and $$e_{o}\in\mathbb{R}^{n}$$ satisfy the system of $$n$$ equations and $$n$$ inequality constraints \begin{equation} y_{i}=\lambda_{1}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}+(\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i})e_{oi},\quad\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}>0 \quad(i=1,\ldots,n), \end{equation} (2) where $$e_{o}$$ further satisfies the $$2\times K$$ orthogonality conditions $$\sum_{i=1}^{n}x_{i}e_{oi}=0$$ and $$\sum_{i=1}^{n}x_{i}(e_{oi}^{2}-1)=0$$. Since $$x_{i}$$ includes an intercept, the sample moments of $$e_{o}$$ and $$e_{o}^{2}$$ are $$0$$ and $$1$$, and $$e_{o}$$ and $$e_{o}^{2}$$ are orthogonal to each column of the $$n\times K$$ matrix $$(x_{1},\ldots,x_{n})^{{ \mathrm{\scriptscriptstyle T} }}$$ of explanatory variables. We propose a characterization of the sequence of vectors $$e_{o}$$ that satisfy representation (2) and the associated orthogonality constraints for each $$n$$. The corresponding sequence of empirical distribution functions then provides an asymptotically valid characterization of the conditional distribution function $$F_{Y\mid X}(Y\mid X)$$ corresponding to the data-generating process (1). As a by-product of this approach, we simultaneously obtain a characterization of the parameter vector $$\lambda$$ in (2), which then provides a consistent estimator of the population parameter $$\beta$$ in (1). For each $$x_{i}$$, with the scale function $$\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i} > 0$$, $$y_{i}$$ is an increasing function of $$e_{oi}$$, and to representation (2) corresponds a convex function \[ C(x_{i},e_{oi},\lambda)=\int_{0}^{e_{oi}}\{\lambda_{1}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}+(\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i})s\}\,{\rm d} s=(\lambda_{1}^{{ \mathrm{\scriptscriptstyle T} }} x_{i})e_{oi}+\frac{1}{2}(\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i})e_{oi}^{2} \quad(e_{oi}\in\mathbb{R}), \] whose quadratic form corresponds to a location-scale representation for $$F_{Y\mid X}(Y\mid X)$$. Letting $$y$$ be the $$n\times1$$ vector of dependent variable values, and assuming knowledge of $$\lambda$$ and $$e_{o}$$, we consider assigning a value $$e_{i}$$ to each observation in the sample by maximizing the correlation between $$y$$ and $$e=(e_{1},\ldots,e_{n})^{{ \mathrm{\scriptscriptstyle T} }}$$ subject to a constraint that embodies the properties of $$e_{o}$$: \begin{equation} \max_{e\in\mathbb{R}^{n}} \left\{ y^{{ \mathrm{\scriptscriptstyle T} }}e : \sum_{i=1}^{n}C(x_{i},e_{i},\lambda)=\sum_{i=1}^{n}C(x_{i},e_{oi},\lambda)\right\}\!\text{.} \end{equation} (3) Problem (3) describes the assignment of $$e$$ values to $$y$$ values in a sample generated according to a location-scale model, and it admits $$e=e_{o}$$ as its only solution. Since $$e_{o}$$ and $$\lambda$$ are unknown, the assignment problem (3) is infeasible; we therefore introduce the equivalent, feasible formulation \begin{equation} \max_{e\in\mathbb{R}^{n}}\left\{ y^{{ \mathrm{\scriptscriptstyle T} }}e : \sum_{i=1}^{n}x_{i}e_{i} =0, \; \frac{1}{2}\sum_{i=1}^{n}x_{i}(e_{i}^{2}-1) =0\right\}\!, \end{equation} (4) i.e., the dual regression program. 2.2. Solving the dual program The solution to (4) is easily found from the Lagrangian \[ \mathscr{L}=\sum_{i=1}^{n}y_{i}e_{i}-\lambda_{1}\sum_{i=1}^{n}x_{i}e_{i}-\frac{1}{2}\lambda_{2}\sum_{i=1}^{n}x_{i}(e_{i}^{2}-1)\text{.} \] Differentiating with respect to $$e_{i}$$ gives first-order conditions \[ \frac{\partial\mathscr{L}}{\partial e_{i}}=y_{i}-\lambda_{1}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}-(\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i})e_{i}=0 \quad(i=1,\ldots,n)\text{.} \] Upon rearranging we obtain the closed-form solution \begin{equation*} e_{i}=\frac{y_{i}-\lambda_{1}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}}{\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}} \quad (i=1,\ldots,n), \end{equation*} which is of the location-scale form $$e_{i}=\{y_{i}-\mu(x_{i})\}/\sigma(x_{i})$$, with $$\mu(x_{i})$$ and $$\sigma(x_{i})$$ linear in $$x_{i}$$. Another view is obtained by writing the first-order conditions as \begin{equation} y_{i}=\lambda_{1}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}+(\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i})e_{i} \quad(i=1,\ldots,n), \end{equation} (5) a linear location-scale representation, with corresponding quantile regression representation \begin{equation} y_{i}=(\lambda_{1}+\lambda_{2}e_{i})^{{ \mathrm{\scriptscriptstyle T} }} x_{i}=\{\lambda_{1}+\lambda_{2}F_{n}^{-1}(u_{i})\}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}\equiv\beta(u_{i})^{{ \mathrm{\scriptscriptstyle T} }} x_{i}\quad(i=1,\ldots,n)\text{.} \end{equation} (6) Program (4) thus provides a complete characterization of linear representations of the form (5) and (6), as they arise from its first-order conditions. Moreover, the parameters of these representations are the Lagrange multipliers $$\lambda_{1}$$ and $$\lambda_{2}$$ of an optimization problem with solution $$e=e_{o}$$. The quantile regression representation of the first-order conditions of (4) sheds additional light on the monotonicity of dual regression solutions, when there are no repeated $$X$$ values. For $$u,u'\in(0,1)$$ with $$u'>u$$, the no-crossing property of conditional quantiles requires that \[ \beta(u')^{{ \mathrm{\scriptscriptstyle T} }} x_{i}-\beta(u)^{{ \mathrm{\scriptscriptstyle T} }} x_{i}>0\quad(i=1,\ldots,n), \] which is satisfied if $$\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}$$ is strictly positive for each $$i$$, and coincides with the $$n$$ second-order conditions of program (4): \[ \frac{\partial^{2}\mathscr{L}}{\partial e^{2}_{i} }=-\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}<0\quad(i=1,\ldots,n)\text{.} \] Therefore, an optimal $$e$$ solution that violates the monotonicity property is ruled out by the requirement that for an observation with $$X$$ value $$x_{i}$$, the ordering of the $$Y$$ values $$\beta(u')^{{ \mathrm{\scriptscriptstyle T} }} x_{i}$$ and $$\beta(u)^{{ \mathrm{\scriptscriptstyle T} }} x_{i}$$ must correspond to the ordering of the $$u$$ values. Hence the correlation criterion of system (4) suffices to impose monotonicity, with optimality of a solution then being equivalent to monotonicity at the $$n$$ sample points. Dual regression thus incorporates this property in the estimation procedure, which facilitates extrapolation beyond the empirical support of $$X$$ and yields finite-sample improvements in the estimation of conditional quantile functions, as illustrated in § 5.2. 2.3. Formal duality By Lagrangian duality arguments (Boyd & Vandenberghe, 2004, Ch. 5), the objective function of the dual of problem (4) is \[ Q_{n}(\lambda)=\sup_{e\in\mathbb{R}^{n}}y^{{ \mathrm{\scriptscriptstyle T} }}e-\sum_{i=1}^{n}\big\{ C(x_{i},e_{i},\lambda)-C(x_{i},e_{oi},\lambda)\big\} , \] defined for all $$\lambda\in\Lambda_{0}$$, where $$\Lambda_{0}=\Lambda_{1}\times\Lambda_{2}$$ with $$\Lambda_{1}=\mathbb{R}^{K}$$ and $$\Lambda_{2}=\{\lambda_{2}\in\mathbb{R}^{K}:\inf_{i\leqslant n}\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}>0\}$$. Under the conditions of Theorem 1 below, $$Q_{n}(\lambda)$$ has a closed-form expression and is strictly convex over $$\Lambda_{0}$$, and minimizing $$Q_{n}(\lambda)$$ over $$\Lambda_{0}$$ is equivalent to solving (4). Given a vector $$\omega\in\mathbb{R}^{n}$$, we let $$\textrm{diag}(\omega_{i})$$ denote the $$n\times n$$ diagonal matrix with diagonal elements $$\omega_{1},\ldots,\omega_{n}$$. Condition 1. The random variable $$Y$$ is continuously distributed conditional on $$X$$, with conditional density $$f_{Y\mid X}(y\mid X)$$ bounded away from zero. Condition 2. For a specified vector $$\omega\in\mathbb{R}^{n}$$, the matrix $$\textrm{diag}(\omega_{i})$$ is nonsingular and the matrix $$\sum_{i=1}^{n}\omega_{i}^{-1}x_{i}x_{i}^{{ \mathrm{\scriptscriptstyle T} }}=M_{n}$$ is finite, positive definite, and of rank $$K$$. Condition 3. There exists $$(\lambda,e_{o})\in \Lambda_{0}\times\mathbb{R}^{n}$$ such that $$y_{i}=\lambda_{1}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}+(\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i})e_{oi}$$ with $$\inf_{i\leqslant n}\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}\geqslant\tau$$ for some constant $$\tau>0$$, and $$\sum_{i=1}^{n}x_{i}e_{oi}=0$$ and $$\sum_{i=1}^{n}x_{i}(e_{oi}^{2}-1)=0$$. Theorem 1 summarizes our finite-sample analysis of dual regression. The proofs of all our formal results are given in the Supplementary Material. Theorem 1. If Conditions 1–3 hold with $$\omega = (\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{1}, \ldots, \lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{n})$$ for all $$\lambda_{2}\in\Lambda_{2}$$, then problem (3) admits the equivalent feasible formulation (4), with solution and multipliers $$e^{*}$$ and $$\lambda^{*}$$, respectively. Moreover, for program (4) the following hold. (i) Primal problem: the dual of (4) is \begin{equation} \min_{\lambda\in\Lambda_{0}}\sum_{i=1}^{n}\frac{1}{2}\left\{\left( \frac{y_{i}-\lambda_{1}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}}{\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}}\right)^{2}+1\right\} \bigl(\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}\bigr), \end{equation} (7)the primal dual regression problem, with solution $$\lambda_{n}$$. (ii) First-order conditions: program (4) admits the method-of-moments representation \begin{equation} \sum_{i=1}^{n}x_{i}\left(\frac{y_{i}-\lambda_{1}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}}{\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}} \right) = 0,\quad \frac{1}{2}\sum_{i=1}^{n}x_{i}\left\{ \left( \frac{y_{i}-\lambda_{1}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}}{\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}} \right)^{2}-1\right\} = 0,\end{equation} (8)the first-order conditions of (7). (iii) With probability $$1$$ we have: (a) uniqueness, i.e., the pair $$(\lambda_{n},e^{*})$$ is the unique optimal solution to (7) and (4), and$$\lambda_{n}=\lambda^{*}$$; (b) strong duality, i.e., the value of (4) equals the value of (7). Theorem 1 establishes formal duality of our initial assignment problem under first- and second-moment orthogonality constraints and the global $$M$$-estimation problem (7). Convexity of (7) guarantees that to a unique assignment of $$e$$ values corresponds a unique linear representation of the form (2). Uniqueness further implies that if $$e_{o}$$ satisfies independence, then the orthogonality conditions in (8) are both necessary and sufficient for the dual solution $$e^{*}$$ to satisfy independence. The primal problem (7) is a locally heteroscedastic generalization of a simultaneous location-scale estimator proposed by Huber (1981) and further analysed in Owen (2001). The linear heteroscedastic model of equation (5) has previously arisen in the quantile regression literature; see Koenker & Zhao (1994) and He (1997). The former considered the efficient estimation of (5) via $$L$$-estimation, while the latter developed a restricted quantile regression method that prevents quantile crossing. Compared with these quantile-based methods, dual regression trades local estimation and the convenient linear programming formulation of quantile regression for simultaneous estimation of location and scale parameters. 2.4. Connection with the dual formulation of quantile regression The dual problem of the linear $$0{\cdot}5$$ quantile regression of $$Y$$ on $$X$$ is (cf. Koenker, 2005, equation (3.12)) \begin{equation} \max_{u} \left\{ y^{{ \mathrm{\scriptscriptstyle T} }}u : \sum_{i=1}^{n}x_{i}\left(u_{i}-\frac{1}{2}\right)=0,\; u\in[0,1]^{n}\right\}\!\text{.} \end{equation} (9) The solution to problem (9) produces values of $$u$$ that are mostly 0 and 1, with $$K$$ sample points being assigned $$u$$ values that are neither 0 nor 1. The points that are assigned 1 fall above the median quantile regression, the points assigned 0 fall below, and the remaining points fall on the median quantile regression plane. One direction of extension of (9) is to replace the 1/2 with values $$\alpha$$ that fall between 0 and 1 to obtain the $$\alpha$$ quantile regression. Another extension is to augment problem (9) by adding $$K$$ more constraints: \begin{equation} \max_{u}\left\{y^{{ \mathrm{\scriptscriptstyle T} }}u : \sum_{i=1}^{n}x_{i}\left(u_{i}-\frac{1}{2}\right) =0, \;\sum_{i=1}^{n}x_{i}\left(u_{i}^{2}-\frac{1}{3}\right) =0, \; u\in[0,1]^{n}\right\}\!\text{.} \end{equation} (10) The solution to (9) does not satisfy (10): the variance of $$u$$ around 0 in the solution to (9) is approximately $$1/2$$, not $$1/3$$. To satisfy program (10), the $$u$$ values have to be moved off $$0$$ and $$1$$. Since $$x_{i}$$ contains an intercept, the sample moments of $$u$$ and $$u^{2}$$ will be $$1/2$$ and $$1/3$$; $$u$$ and $$u^{2}$$ will be orthogonal to the columns of the matrix $$(x_{1},\ldots,x_{n})^{{ \mathrm{\scriptscriptstyle T} }}$$, relations that are necessary but not sufficient for uniformity and independence. The systems (9) and (10) both demand monotonicity by maximally correlating $$y$$ and $$u$$. A violation of monotonicity requires there to be two observations that share the same $$X$$ values but have different $$y$$ values, with the lower of the two $$y$$ values having the weakly higher value of $$u$$. However, a solution characterized by such a violation could be improved upon by exchanging the $$u$$ assignments. Thus violation of monotonicity in program (9) arises because the set of admissible exchanges in $$u$$ assignments is overly restricted: (9) is dual to a linear program that is well known to have solutions at which $$K$$ observations are interpolated when $$K$$ parameters are being estimated, i.e., the hyperplanes obtained by regression quantiles must interpolate $$K$$ observations. By reformulating program (10) into a constrained optimization problem over $$\mathbb{R}^{n}$$, program (4) further expands the set of admissible exchanges in $$u$$ assignments, since $$u$$ is restricted to $$[0,1]^{n}$$. Upon doing this, the problem corresponding to (10) becomes the dual regression program (4), where $$e$$ can take on any real value. It is then natural to take $$u_{i}^{*}=F_{n}(e_{i}^{*}),$$ the empirical cumulative distribution function of the dual regression solution $$e^{*}$$, thereby imposing uniformity to high precision even at small $$n$$. 3. Generalization 3.1. Infeasible generalized dual regression The dual regression characterization of location-scale conditional distribution functions via the monotonicity element, the objective, and the independence element, the constraints, can be exploited to characterize more flexible representations. Similarly to the approach introduced in § 2, we first analyse the infeasible assignment problem for a general representation of the stochastic structure of $$Y$$ conditional on $$X$$: \begin{equation} Y=H(X,\varepsilon)\equiv H_{X}(\varepsilon),\quad\varepsilon \mid X\sim F, \end{equation} (11) where $$F$$ is a specified cumulative distribution function whose support is the real line, and for each value $$x$$ of $$X$$ the derivative $$H_{x}'(\varepsilon)$$ of $$H_{x}(\varepsilon)$$ is strictly positive. The representation (11) always exists with $$H_{x}$$ defined as the composition of the conditional quantile function of $$Y$$ given $$X=x$$ and the distribution function $$F$$. To each monotone function $$H_{x}$$ also corresponds a convex function $$\tilde{H}_{x}$$ defined as \[ \tilde{H}_{x}(e)\equiv\int_{0}^{e}H_{x}(s)\,{\rm d} s \quad(e\in\mathbb{R})\text{.} \] The monotonicity of $$H_{x}(\varepsilon)$$ guarantees the convexity of $$\tilde{H}_{x}(\varepsilon)$$. The slope of this function gives the value of $$Y$$ corresponding to a value $$e$$ of $$\varepsilon$$ at $$X=x$$. Thus $$F_{Y\mid X}(Y\mid X)$$ corresponds to a collection of convex functions, with one element of this collection for each value of $$X,$$ together with a single random variable whose distribution is common to all the convex functions. Equipped with $$\tilde{H}_{X}$$, suppose we are tasked with assigning a value $$e_{i}$$ to each of the $$n$$ realizations $$\{(y_{i},x_{i})\}_{i=1}^{n}$$. Then, for $$S_{n}=\sum_{i=1}^{n}\tilde{H}_{x_{i}}(\varepsilon_{i})$$, solving the infeasible problem \begin{equation} \max_{e\in\mathbb{R}^{n}} \left\{ y^{{ \mathrm{\scriptscriptstyle T} }}e : \sum_{i=1}^{n}\tilde{H}_{x_{i}}(e_{i})=S_{n}\right\} \end{equation} (12) generates the correct $$y$$-$$e$$ assignment: taking the Lagrangian \[ \mathscr{L}=y^{{ \mathrm{\scriptscriptstyle T} }}e-\Lambda\left\{ \sum_{i=1}^{n}\tilde{H}_{x_{i}}(e_{i})-S_{n}\right\}\!, \] the $$n$$ associated first-order conditions are \begin{equation} \frac{\partial\mathscr{L}}{\partial e_{i}}=y_{i}-\Lambda H_{x_{i}}(e_{i})=0 \quad(i=1,\ldots,n), \end{equation} (13) and convexity of $$\tilde{H}_{x_{i}}$$ guarantees that (13) is uniquely satisfied by $$(\Lambda,e)=(1,\varepsilon_{o})$$, with $$\varepsilon_{o}=(\varepsilon_{1},\ldots,\varepsilon_{n})^{{ \mathrm{\scriptscriptstyle T} }}$$. This demonstrates that maximizing $$y^{{ \mathrm{\scriptscriptstyle T} }}e$$ generally suffices to match the $$e$$ values to the $$y$$ values, regardless of the form of $$H_{X}$$ in (11). Theorem 2. Suppose that (11) holds with $$H_{x_{i}}:\mathbb{R}\rightarrow\mathbb{R}$$ a continuously differentiable function that satisfies $$\inf_{e\in\mathbb{R}}H'_{x_{i}}(e)\geqslant\tau$$ for each $$x_{i}$$ and some constant $$\tau>0$$. Then the infeasible generalized dual regression problem (12) with $$S_{n}=\sum_{i=1}^{n}\tilde{H}_{x_{i}}(\varepsilon_{i})$$ generates the correct $$y$$-$$e$$ assignment, i.e., the pair $$(\Lambda,e)=(1,\varepsilon_{o})$$ uniquely solves the first-order conditions (13). Theorem 2 shows that problem (12) fully characterizes the $$y$$-$$e$$ assignment problem: given $$\tilde{H}_{x_{i}}$$ and $$S_{n}$$, solving (12) assigns a value $$e_{i}$$ to each sample point $$(y_{i},x_{i})$$, and this value is the corresponding value $$F_{Y\mid X}(y_{i}\mid x_{i})$$ up to a specified transformation $$F$$. If $$F$$ is specified to be a known distribution, the $$n$$ values $$F_{Y\mid X}(y_{i}\mid x_{i})$$ are then also known. If $$F$$ is specified to be an unknown distribution, as in our application below, the empirical distribution of $$\varepsilon_{o}$$ then provides an asymptotically valid estimator for $$F$$. Knowledge of $$\tilde{H}_{x_{i}}$$ and $$S_{n}$$ can thus be incorporated into a mathematical programming problem which delivers the values of $$F_{Y\mid X}$$ at the $$n$$ sample points. 3.2. Generalized dual regression representations$$:$$ definition and characterization Problem (12) is infeasible because neither $$\tilde{H}_{x_{i}}$$ nor $$S_{n}$$ is known. However, Theorem 2 motivates a feasible approach once $$H_{X}$$ and $$F$$ are specified. Denote the components of $$X$$ without the intercept by $$\tilde{X}$$, so that $$X=(1,\tilde{X})^{{ \mathrm{\scriptscriptstyle T} }}$$. Without loss of generality, let $$\tilde{X}$$ be centred, denoted by $$\tilde{X}^{\rm c}$$, and let $$X^{\rm c}=(1,\tilde{X}^{\rm c})^{{ \mathrm{\scriptscriptstyle T} }}$$. With $$h_{1}(\varepsilon)=1$$ and $$h_{2}(\varepsilon)=\varepsilon$$, we specify $$H_{X}$$ by a linear combination of $$J$$ basis functions $$h(\varepsilon)=\{h_{1}(\varepsilon),\ldots,h_{J}(\varepsilon)\}^{{ \mathrm{\scriptscriptstyle T} }}$$, the coefficients of which depend on $$X$$, \begin{equation} H_{X}(\varepsilon)=\sum_{j=1}^{J}\beta_{j}(X)h_{j}(\varepsilon), \end{equation} (14) and we assume that $$H_{X}$$ is linear in $$X$$ and set \begin{equation} \beta_{j}(X)=\alpha_{j}+\beta_{j}^{{ \mathrm{\scriptscriptstyle T} }}\tilde{X}^{\rm c} \quad(j=1,\ldots,J)\text{.} \end{equation} (15) Finally, we specify a zero-mean and unit-variance distribution for $$\varepsilon$$ by imposing $$E(\varepsilon)=0$$ and $$E\{(\varepsilon^{2}-1)/2\}=0$$ and setting $$\alpha_{j}=0$$ for $$j=3,\ldots,J$$ in (15). With $$\alpha_{2}+\beta_{2}^{{ \mathrm{\scriptscriptstyle T} }}\tilde{X}^{\rm c}>0$$, our normalization and (14) and (15) together yield the augmented, generalized dual regression model \begin{equation} Y=\alpha_{1}+\alpha_{2}\varepsilon+\beta_{1}^{{ \mathrm{\scriptscriptstyle T} }}\tilde{X}^{\rm c}+(\beta_{2}^{{ \mathrm{\scriptscriptstyle T} }}\tilde{X}^{\rm c})\varepsilon+\sum_{j=3}^{J}(\beta_{j}^{{ \mathrm{\scriptscriptstyle T} }}\tilde{X}^{\rm c})h_{j}(\varepsilon)\equiv H_{X}(\varepsilon;\alpha,\beta),\quad\varepsilon\mid X\sim F\text{.} \end{equation} (16) Model (16) admits the following interpretation. When $$\tilde{X}^{\rm c}=0,$$$$\:Y=\alpha_{1}+\alpha_{2}\varepsilon$$ and $$\varepsilon=(Y-\alpha_{1})/\alpha_{2}$$, so that $$\varepsilon$$ is just a rescaled version of the distribution of $$Y$$ at $$\tilde{X}^{\rm c}=0$$. Since $$\varepsilon$$ is independent of $$X$$, transformations of this shape of $$\varepsilon$$ must suffice to produce $$Y$$ at other values of $$X$$. The first two transformations, $$\beta_{1}^{{ \mathrm{\scriptscriptstyle T} }}\tilde{X}^{\rm c}$$ and $$(\beta_{2}^{{ \mathrm{\scriptscriptstyle T} }}\tilde{X}^{\rm c})\varepsilon$$, are translations of location and scale which do not essentially affect the shape of $$Y$$’s response to changes in $$\varepsilon$$ at all. The additional terms $$(\beta_{j}^{{ \mathrm{\scriptscriptstyle T} }}\tilde{X}^{\rm c})h_{j}(\varepsilon)$$ achieve that end. Suppose that we observe a sample of $$n$$ identically and independently distributed realizations $$\{(y_{i},x_{i})\}_{i=1}^{n}$$ generated according to model (16). Define $$x_{ij}^{\rm c}=x_{i}^{\rm c}$$ for $$j=1,2$$ and $$x_{ij}^{\rm c}=\tilde{x}_{i}^{\rm c}$$ for $$j=3,\ldots,J$$, and let $$(\gamma,\lambda)\in\mathbb{R}^{2+J(K-1)}$$ and $$e_{o}\in\mathbb{R}^{n}$$ satisfy the system of $$n$$ equations and $$2n$$ inequality constraints \begin{equation} y_{i}=H_{x_{i}}(e_{oi};\gamma,\lambda), \quad \gamma_{2}+\lambda_{2}^{{ \mathrm{\scriptscriptstyle T} }}\tilde{x}_{i}^{\rm c}>0, \quad H'_{x_{i}}(e_{oi};\gamma,\lambda)>0 \quad (i=1,\ldots,n), \end{equation} (17) where $$e_{o}$$ further satisfies $$\sum_{i=1}^{n}x_{ij}^{\rm c}\tilde{h}_{j}(e_{oi})=0$$$$(j\,{=}\,1,\ldots,J)$$, with $$\tilde{h}_{1}(e_{oi})=e_{oi}$$, $$\tilde{h}_{2}(e_{oi})=$$$$(e_{oi}^{2}-1)/2$$ and $$\tilde{h}_{j}(e_{oi})=\int_{0}^{e_{oi}}h_{j}(s)\,{\rm d} s$$$$(j=3,\ldots,J)$$. These relations reduce to the linear heteroscedastic representation of § 2 for $$J=2$$, and require that $$e_0$$ be a zero-mean and unit-variance vector satisfying the augmented set of orthogonality conditions $$\sum_{i=1}^{n}\tilde{x}_{i}^{\rm c}\tilde{h}_{j}(e_{oi})=0$$$$(j=1,\ldots,J)$$. The sequence of vectors $$e_{o}$$ that satisfies the generalized dual regression representation (17) as well as the associated orthogonality constraints for each $$n$$ then provides an asymptotically valid characterization of the data-generating process (16). Each element of this sequence is characterized by the assignment problem \begin{equation} \max_{e\in\mathbb{R}^{n}} \left\{ y^{{ \mathrm{\scriptscriptstyle T} }}e : \sum_{i=1}^{n}\tilde{H}_{x_{i}}(e_{i};\theta)=\sum_{i=1}^{n}\tilde{H}_{x_{i}}(e_{oi};\theta)\right\}\!, \end{equation} (18) where $$\tilde{H}_{x_{i}}(e_{i};\theta)=\int_{0}^{e_{i}}H_{x_{i}}(s;\theta)\,{\rm d} s$$ and $$\theta=(\theta_{1},\ldots,\theta_{J})^{{ \mathrm{\scriptscriptstyle T} }}$$, with $$\theta_{j}=(\gamma_{j},\lambda_{j})^{{ \mathrm{\scriptscriptstyle T} }}\in\mathbb{R}^{K}$$ for $$j=1,2$$ and $$\theta_{j}=\lambda_{j}\in\mathbb{R}^{K-1}$$ for $$j=3,\ldots,J$$. Since $$e_{0}$$ and $$\theta$$ are unknown, problem (18) is infeasible. We therefore formulate an equivalent, feasible implementation of problem (12): \begin{equation} \max_{e\in\mathbb{R}^{n}} \left\{ y^{{ \mathrm{\scriptscriptstyle T} }}e : \sum_{i=1}^{n}x_{ij}^{\rm c}\tilde{h}_{j}(e_{i})=0 \;\; (j=1,\ldots,J) \right\}\!, \end{equation} (19) the generalized dual regression program; (19) uniquely characterizes representation (17). In order to state the properties of (19) formally, we define the parameter space $$\Theta_{n}$$, which specifies parameter values compatible with monotone representations: \begin{align*} \Theta_{n}=\Bigl\{ \theta\in\Theta_{0,n}: \text{there exists } e\in\mathbb{R}^{n} &\text{ such that }\, y_{i}=H_{x_{i}}(e_{i};\theta)\\[-5pt] & \text{ and }\inf_{e\in\mathbb{R}}H'_{x_{i}}(e;\theta)>0\;\,(i=1,\ldots,n)\Bigr\}, \end{align*} where $$\Theta_{0,n}=\{\theta\in\mathbb{R}^{2+J(K-1)}:\;\inf_{i\leqslant n}\theta_{2}^{{ \mathrm{\scriptscriptstyle T} }} x_{i}^{\rm c}>0\}$$. For $$\theta\in\Theta_{n}$$, let $$e(y_{i},x_{i},\theta)$$ denote the inverse function of $$H_{x_{i}}(e_{i};\theta)$$, which is well-defined for each $$x_{i}$$. We assume that the basis functions $$h$$ and the pair $$(\theta,e_{0})$$ satisfy the following conditions. Condition 4. There exists a finite constant $$C_{h}$$ such that $$\max_{j=3,\ldots,J}\sup_{e\in\mathbb{R}}\{|h_{j}(e)|+|\tilde{h}_{j}(e)|\}\leqslant C_{h}$$, and the matrix $$E[h\{e(Y,X,\theta)\}h\{e(Y,X,\theta)\}^{{ \mathrm{\scriptscriptstyle T} }} \mid X=x_{i}]$$ is finite and nonsingular for each $$x_{i}$$ and all $$\theta\in\Theta_{n}$$. Condition 5. There exists $$(\theta,e_{o})\in\Theta_{n}\times\mathbb{R}^{n}$$ such that $$y_{i}=H_{x_{i}}(e_{oi};\theta)$$ and $$\inf_{e\in\mathbb{R}}H'_{x_{i}}(e;\theta)\geqslant\tau$$ for $$i=1,\ldots,n$$ and some constant $$\tau>0$$, and $$e_{o}$$ satisfies $$\sum_{i=1}^{n}x_{ij}^{\rm c}\tilde{h}_{j}(e_{oi})=0$$ for $$j=1,\ldots,J$$. Let $$\phi(\theta)=[H_{x_{1}}'\{e(y_{1},x_{1},\theta);\theta\},\dots,H_{x_{n}}'\{e(y_{n},x_{n},\theta);\theta\}]^{{ \mathrm{\scriptscriptstyle T} }}$$. Theorem 3 summarizes our finite-sample analysis of generalized dual regression. Theorem 3. If Conditions 1, 2, 4 and 5 hold with $$\omega=\phi(\theta)$$ for all $$\theta\in\Theta_{n}$$, then problem (12) admits the equivalent feasible formulation (19), with solution and multipliers $$e^{*}$$ and $$\theta^{*}$$, respectively. Moreover, for program (19) the following hold. (i) Primal problem: the dual of (19) is \begin{equation} \min_{\theta\in\Theta_{n}} \sum_{i=1}^{n}\sum_{j=2}^{J}(\theta_{j}^{{ \mathrm{\scriptscriptstyle T} }} x_{ij}^{\rm c})\Bigl[h_{j}\{e(y_{i},x_{i},\theta)\}e(y_{i},x_{i},\theta)-\tilde{h}_{j}\{e(y_{i},x_{i},\theta)\} \Bigr], \end{equation} (20)the primal generalized dual regression problem, with solution $$\theta_{n}$$. (ii) First-order conditions: program (19) admits the method-of-moments representation \begin{equation} \sum_{i=1}^{n}x_{ij}^{\rm c}\tilde{h}_{j}\{e(y_{i},x_{i},\theta)\}=0 \quad(j=1,\ldots,J), \end{equation} (21)the first-order conditions of (20). (iii) With probability$$1$$we have: (a) uniqueness, i.e., the pair $$(\theta_{n},e^{*})$$ is the unique optimal solution to (20) and (19), and$$\theta_{n}=\theta^{*}$$; (b) strong duality, i.e., the value of (19) equals the value of (20). Problem (19) augments the set of orthogonality constraints in (4) and generates increasingly flexible representations of the form (17). For each element of this sequence, (19) provides a feasible formulation of the generalized $$y$$-$$e$$ assignment problem (12) with optimality condition $$-H'_{x_{i}}(e_{i}^{*};\theta^{*})<0$$ equivalent to monotonicity. Theorem 3 also states the form of the corresponding primal problem, whose convexity guarantees that (19) and (20) uniquely and equivalently characterize representation (16). Uniqueness further implies that if $$e_{o}$$ satisfies independence, then the orthogonality conditions in (21) are both necessary and sufficient for the dual solution $$e^{*}$$ to satisfy independence as well. Theorem 3 thus characterizes and establishes the duality between specification of orthogonality constraints on $$e$$ and specification of a globally monotone representation for $$Y$$ conditional on $$X$$. Formally, (20) is the restriction of the dual of (19) to $$\Theta_{n}$$. The existence condition, Condition 5, and the form of the optimality conditions for (19) together ensure that (19) does not admit a global maximum with associated multipliers outside $$\Theta_{n}$$. Implementing (20) thus requires the imposition of inequality constraints with $$e_{i}$$ only implicitly defined in the specification of (20) for $$J>2$$, and problem (19) therefore provides a greatly simplified dual implementation. The special case of dual regression corresponds to $$J=2$$, and imposing $$\sum_{i=1}^{n}\tilde{h}_{j}(e_{i})=0$$ for $$j=1,2$$ is a normalization. The simple basis $$\{e_{i},(e_{i}^{2}-1)/2\}$$ is obviously impoverished for the space of all convex functions, although quite practical for many applications once the flexibility in the distribution of $$e$$ is taken into account. 3.3. Connection with optimal transport formulation of quantile regression An alternative approach is to specify $$F$$ to be a known distribution and alter representation (16) and the corresponding problem accordingly. If $$F$$ is specified to be the standard uniform distribution, then (10) in § 2.4 can be generalized as \begin{equation} \max_{u\in[0,1]^{n}} \left\{ y^{{ \mathrm{\scriptscriptstyle T} }}u : \frac{1}{j}\sum_{i=1}^{n}x_{i}\left(u_{i}^{j}-\frac{1}{j+1}\right)=0, \;\; j=1,\ldots,J\right\}\text{.} \end{equation} (22) For $$u_{i}\in[0,1]$$, let $$m^{J}(u_{i})=\{m_{J1}(u_{i}),\ldots,m_{JJ}(u_{i})\}^{{ \mathrm{\scriptscriptstyle T} }}$$ with $$m_{Jj}(u_{i})=j^{-1}\{u_{i}^{j}-(\,j+1)^{-1}\}$$. Letting $$\otimes$$ denote the Kronecker product, the large-sample form of program (22) is \begin{equation} \max_{U_{J}\in(0,1)} \bigl\{ E(YU_{J}) : E\{X\otimes m^{J}(U_{J})\}=0\bigr\}\text{.} \end{equation} (23) On increasing $$J$$, both the distributional and the orthogonality constraints are strengthened. Because $$X$$ includes an intercept, the distribution of $$U_{J}$$ approaches uniformity while simultaneously satisfying an increasing sequence of orthogonality constraints. In the limit, a uniformly distributed random variable $$U$$ satisfying the full set of orthogonality constraints is thus specified. Since $$E\{X\otimes m^{J}(U)\}=0$$ for all $$J$$ is equivalent to the mean-independence property $$E(X\mid U)=E(X)$$ and the uniformity constraint $$U\sim {\rm Un}(0,1)$$, in the large-$$J$$ limit program (23) coincides with the scalar quantile regression problem proposed in independent work by Carlier et al. (2016, cf. equation (19), p. 1180), \begin{equation} \max \bigl\{ E(YU) : U\sim {\rm Un}(0,1),\;E(X\mid U)=E(X)\bigr\}, \end{equation} (24) which provides an optimal transport formulation of quantile regression; we are grateful to an anonymous referee for highlighting this connection. Program (24) is directly amenable to a linear programming implementation which maintains and exploits the full specification of the marginal distribution of $$U$$ as a known distribution, whereas (23) provides a sequential nonlinear programming characterization of $$U$$ which relaxes the uniformity constraint for finite $$n$$ and $$J$$. For $$e_{i}\in\mathbb{R}$$, let $$\tilde{h}^{J}(e_{i})=\{\tilde{h}_{1}(e_{i}),\ldots,\tilde{h}_{J}(e_{i})\}^{{ \mathrm{\scriptscriptstyle T} }}$$. The large-sample form of program (19) is \begin{equation} \max_{e_{J}\in\mathbb{R}}\big\{ E(Ye_{J}) : E(e_{J})=0, \; E(e_{J}^{2}-1) =0, \; E\{\tilde{X}^{\rm c}\otimes\tilde{h}^{J}(e_{J})\} =0\big\}\text{.} \end{equation} (25) Program (25) relaxes the support constraint in (22) and only specifies first and second moments of $$e_{J}$$, while the centring of $$X$$ ensures that this is sufficient for $$e_{J}$$ to be uniquely determined. The empirical distribution of solutions of the finite-sample analogue (19) of (25) then provides an asymptotically valid characterization of the distribution of $$e_{J}$$. Letting $$J$$ increase, the orthogonality constraints in (25) are strengthened, and $$e_{J}$$ gets closer and closer to satisfying the mean-independence property $$E(\tilde{X}^{\rm c}\mid e_{J})=0$$. It follows that for $$J$$ large enough, (25) is equivalent to \begin{equation} \max_{e\in\mathbb{R}}\big\{E(Ye) : E(e)=0, \; E(e^{2}-1) =0, \; E(\tilde{X}^{\rm c}\mid e) = 0\big\}, \end{equation} (26) the limiting generalized dual regression problem. Theorem 4 summarizes this discussion. Theorem 4. Assume that $$E(\|X\|^{2})<\infty$$. (i) Suppose that for any $$a(U)$$ with $$E\{a(U)^{2}\}<\infty$$ there are $$J\times1$$ vectors $$\psi_{J}$$ such that as $$J\rightarrow\infty$$, $$\:E[\{a(U)-m^{J}(U)^{{ \mathrm{\scriptscriptstyle T} }}\psi_{J}\}^{2}]\rightarrow0$$. Then programs (23) and (24) are equivalent in the large-$$J$$ limit. (ii) Suppose that for any $$b(e)$$ with $$E\{b(e)^{2}\}<\infty$$ there are $$J\times1$$ vectors $$\psi_{J}$$ such that as $$J\rightarrow\infty$$, $$\:E[\{b(e)-\tilde{h}^{J}(e)^{{ \mathrm{\scriptscriptstyle T} }}\psi_{J}\}^{2}]\rightarrow0$$. Then programs (25) and (26) are equivalent for $$J$$ large enough. 4. Asymptotic properties We apply our framework to the estimation of a $$J$$-term generalized dual regression model of the form (16). Denote the support of $$X$$ by $$\mathcal{X}$$, and for some finite constant $$C_{\theta}$$ define $$\Theta_{0}=\{\theta\in\mathbb{R}^{2+J(K-1)}:\|\theta\|\leqslant C_{\theta}\text{ and }\inf_{x\in\mathcal{X}}\theta_{2}^{{ \mathrm{\scriptscriptstyle T} }} x^{\rm c}>0\}$$. Letting $$\mathcal{C}^{1}(\mathbb{R})$$ denote the space of continuously differentiable functions on $$\mathbb{R}$$, define the space of strictly increasing functions indexed by $$X$$ values, $$\mathcal{M}(X)=\{e_{X}\in\mathcal{C}^{1}(\mathbb{R}):\inf_{y\in\mathbb{R}}e'_{x}(y)>0 \text{ for all }x\in\mathcal{X}\}$$. The large-sample analogue of $$\Theta_{n}$$ is then the space of vectors in $$\Theta_{0}$$ such that there exists a corresponding optimal generalized dual regression representation, \[ \Theta=\left\{ \theta\in\Theta_{0}:\,\text{there exists }e_{X}\in\mathcal{M}(X)\textrm{ with }{\rm pr}[Y=H_{X}\{e_{X}(Y);\theta\}]=1\right\}\!\text{.} \] For any $$\theta\in\Theta$$, denote $$e_{X}$$ in $$\mathcal{M}(X)$$ such that $${\rm pr}[Y=H_{X}\{e_{X}(Y);\theta\}]=1$$ by $$e(Y,X,\theta)$$. Condition 6. For some $$\theta_{0}\in\Theta$$ and some zero-mean and unit-variance cumulative distribution function $$F$$, the representation $$Y=H_{X}(\varepsilon;\theta_{0})$$ holds with probability $$1$$, with $$\varepsilon\sim F$$ and $$E\{\tilde{X}h_{j}(\varepsilon)\}=0$$ for $$j=1,\ldots,J$$, and $$\inf_{e\in\mathbb{R}}H'_{X}(e;\theta_{0})\geqslant\tau$$ for some constant $$\tau>0$$. Condition 7. The matrix $$M_{n}$$ defined in Condition 2 satisfies $$\textrm{lim }n^{-1}M_{n}=M$$, a positive-definite matrix of rank $$K$$, and for all $$\theta\in\Theta$$ the matrix $$E[h\{e(Y,X,\theta)\}h\{e(Y,X,\theta)\}^{{ \mathrm{\scriptscriptstyle T} }}\mid X]$$ is nonsingular. Condition 8. We have (i) $$E(Y^{2})<\infty$$, $$E(\Vert X\Vert ^{4})<\infty$$ and $$E(Y^{2}\Vert X\Vert ^{2})<\infty$$; (ii) $$E(Y^{4})<\infty$$, $$E(\Vert X\Vert ^{6})<\infty$$ and $$E(Y^{4}\Vert X\Vert ^{2})<\infty$$. These conditions are used to establish existence and consistency of dual regression solutions, and Condition 8(ii) is needed for asymptotic normality of estimates of $$\theta_{0}$$. In view of the uniqueness stated in Theorem 3(iii), these properties are shared by $$\theta_{n}$$ and $$\theta^{*}$$, which we write as $$\hat{\theta}$$ for notational simplicity. We also denote both $$e_{i}^{*}$$ and indirect estimates $$e(y_{i},x_{i},\theta_{n})$$, constructed after solving (20), by $$\hat{e}_{i}$$, with empirical distribution function $$F_{n}(e)=n^{-1}\sum_{i=1}^{n}1(\hat{e}_{i}\leqslant e)$$ for $$e\in\mathbb{R}$$. Furthermore, Theorem 3(ii) shows that while the solution $$e^{*}$$ is obtained directly by solving the mathematical program (19), knowledge that the solution obeys representation (17) can be exploited to write estimating equations for $$\hat{\theta}$$ in the form of system (21). The computation of the asymptotic distribution of $$\hat{\theta}$$ follows from this characterization. Theorem 5. If $$\{(y_{i},x_{i})\}_{i=1}^{n}$$ are identically and independently distributed, and Conditions 1, 2, 4 and 6–8 hold with$$\omega=\phi(\theta)$$for all$$\theta\in\Theta_{n}$$, then: (i) there exists $$\hat{\theta}$$ in $$\Theta$$ with probability approaching $$1$$; (ii) $$\hat{\theta}$$ converges in probability to $$\theta_{0}$$; (iii) $$n^{1/2}(\hat{\theta}-\theta_{0})$$ converges in distribution to $$N(0,\Sigma)$$, where $$\Sigma$$ is defined in the Supplementary Material. Knowledge of the statistical properties of $$\hat{\theta}$$ can be used to establish the limiting behaviour of the empirical distribution of $$\hat{e}$$. Define the empirical dual regression process \[ \mathbb{U}_{n}(e)=n^{1/2}\{F_{n}(e)-F(e)\}\quad(e\in\mathbb{R})\text{.} \] Theorem 6 establishes weak convergence of the empirical distribution of $$\hat{e}$$ and the limiting behaviour of $$\mathbb{U}_{n}$$, accounting for its dependence on the distribution of $$n^{1/2}(\hat{\theta}-\theta_{0})$$. Theorem 6. If the conditions of Theorem 5 hold and, uniformly in $$x$$ over $$\mathcal{X}$$, $$\:f_{Y\mid X}(y\mid x)$$ is uniformly continuous in $$y$$, is bounded and, for some finite constant $$C_{f}$$ and all $$\theta\in\Theta_{n}$$, satisfies $$\sup_{e\in\mathbb{R}}e^{2}f_{Y \mid X}\{H_{x}(e;\theta) \mid x\}\leqslant C_{f}$$, then: (i) $$\sup_{e\in\mathbb{R}}|F_{n}(e)-F(e)|$$ converges in probability to zero; (ii) $$\mathbb{U}_{n}$$ converges weakly to a zero-mean Gaussian process $$\mathbb{U}$$ with covariance function defined in the Supplementary Material. Theorems 5 and 6 together establish that the pair $$(\hat{\theta},\hat{e})$$ provides an asymptotically valid characterization of the generalized dual regression representation specified in Condition 6. When $$\varepsilon$$ is independent of $$X$$, Theorem 6 further implies that the empirical distribution of $$\hat{e}$$ provides an asymptotically valid estimator of the conditional distribution of $$Y$$ given $$X$$. For $$u\in(0,1)$$, estimates of the $$X$$ coefficients in quantile regression form can then be constructed as $$\sum_{j=1}^{J}\hat{\lambda}_{j}h_{j}\{F_{n}^{-1}(u)\}$$, exploiting the structure of the conditional quantile function of $$Y$$ given $$X$$ implied by representation (16). Theorem 6 also establishes asymptotic normality of the empirical dual regression process. The form of the covariance function of $$\mathbb{U}$$ reflects the effect of imposing the sample orthogonality constraints in (19) on the empirical distribution of $$e^{*}$$ or, equivalently, the influence of sample variability of parameter estimates $$\theta_{n}$$ on the empirical distribution of $$e(y_{i},x_{i},\theta_{n})$$, as expected from the classical result of Durbin (1973). Theorem 6 can be applied to perform pointwise inference on the conditional distribution function of $$Y$$ conditional on $$X$$. However, uniform inference over regions of the joint support of $$Y$$ and $$X$$ is typically of interest in practice. Several approaches in the presence of nonpivotal limit processes have been considered in the literature (e.g., Koenker & Xiao, 2002; Parker, 2013), including simulation methods (Chernozhukov et al., 2013). Extension of existing results to dual regression is beyond the scope of this paper but is a natural direction for future study of uniform inference on the empirical dual regression process. 5. Engel’s data revisited 5.1. Empirical analysis The classical dataset collected by Engel consists of food expenditure and income measurements for 235 households, and has been studied using quantile regression methods (Koenker, 2005). We illustrate the application of dual regression methods by estimating the relationship between food expenditure and income, with household income as a single regressor and food expenditure as the outcome of interest. We specify the vector of basis functions by means of trigonometric series. Alternative choices such as splines and shape-preserving wavelets (De Vore, 1977; Cosma et al., 2007) could also be considered. To choose $$J$$, we first implement program (19) for $$J=2$$, which we then augment sequentially by adding one pair of cosine and sine bases at a time, up to a representation of order $$J=8$$. At each step, we compute a Schwarz information criterion (Schwarz, 1978) applied to the primal generalized dual regression problem, exploiting the strong duality result of Theorem 3 to compute its value as $$y^{{ \mathrm{\scriptscriptstyle T} }}e^{*}+\{2+J(K-1)\}\log n$$. Our procedure selects the location-scale representation $$J=2$$. In the Supplementary Material, we describe the procedure and report results from the augmented specifications, which show that our results are robust with respect to the number of terms included. To test for the validity of the selected model, a complementary procedure that should be explored in future research is to test for independence of dual regression solutions and explanatory variables. The test for multivariate independence proposed by Genest et al. (2007) would provide an interesting starting point for such a development. All computational procedures can be implemented in the software R (R Development Core Team, 2018) using open source software packages for nonlinear optimization such as Ipopt or Nlopt and their R interface Ipoptr and Nloptr developed by Jelmer Ypma. Quantile regression procedures from the package quantreg have been used to carry out our comparisons. Figure 1 plots the estimated distribution of food expenditure conditional on household income. Estimates $$\{u_{i}^{*}\}_{i=1}^{n}$$, where $$u_{i}^{*}=F_{n}(e_{i}^{*})$$, are used to plot each observation in the $$xyu$$-space with predicted coordinates $$(x_{i},y_{i},u_{i}^{*})$$, and the solid lines give the $$u$$-level sets for a grid of values $$\{0.1,\ldots,0{\cdot}9\}$$. Although nonstandard, this representation relates to standard quantile regression plots since the levels of the distribution function give the conditional quantiles of food expenditure for each value of income; these are the plotted shadow solid lines corresponding, for each $$u$$, to dual regression estimates of conditional quantile functions of food expenditure given household income. Fig. 1. View largeDownload slide Dual regression estimate of the distribution of food expenditure conditional on income: level sets (solid lines) are plotted for a grid of values ranging from $$0{\cdot}1$$ to $$0{\cdot}9$$; the projected shadow level sets represent the respective conditional quantile functions appearing on the $$xy$$-plane. Fig. 1. View largeDownload slide Dual regression estimate of the distribution of food expenditure conditional on income: level sets (solid lines) are plotted for a grid of values ranging from $$0{\cdot}1$$ to $$0{\cdot}9$$; the projected shadow level sets represent the respective conditional quantile functions appearing on the $$xy$$-plane. Figure 1 shows that the predicted conditional distribution function obtained by dual regression indeed has all the desired properties. Of particular interest is the fact that the estimated function is monotone in food expenditure. Also, our estimates satisfy some basic smoothness conditions across probability levels, in the food expenditure values. This feature does not typically characterize estimates of the conditional quantile process obtained by quantile regression methods, as conditional quantile functions are then estimated sequentially and independently of each other. The decreasing slope of the distribution function across values of income provides evidence that the data indeed follow a heteroscedastic generating process. This is the distributional counterpart of quantile functions having increasing slope across probability levels, a feature characterizing the conditional quantile functions on the $$xy$$-plane and which signals increasing dispersion in food expenditure across household income values. Figure 2 displays the more familiar quantile regression plots. The panels show scatterplots of Engel’s data as well as conditional quantile functions obtained by dual and quantile regression methods. The rescaled plots in Fig. 2(b) and (d) highlight some features of the two procedures. The fitted lines obtained from dual regression are not subject to crossing in this example, whereas several of the fitted quantile regression lines cross for small values of household income. Finally, the more evenly-spread dual regression conditional quantile functions illustrate the effect of specifying a functional form for the quantile regression coefficients, while preserving asymmetry in the conditional distribution of food expenditure. Fig. 2. View largeDownload slide Scatterplots and regression estimates of the conditional $$\{0{\cdot}1,0{\cdot}15,\ldots,0{\cdot}9\}$$ quantile functions (solid lines) for Engel’s data: (a) dual regression; (c) quantile regression; and (b),(d) their rescaled counterparts. Fig. 2. View largeDownload slide Scatterplots and regression estimates of the conditional $$\{0{\cdot}1,0{\cdot}15,\ldots,0{\cdot}9\}$$ quantile functions (solid lines) for Engel’s data: (a) dual regression; (c) quantile regression; and (b),(d) their rescaled counterparts. Figure 3 compares our estimates of the intercept and income coefficients in quantile regression form with those from quantile regression. For interpretational purposes, we follow Koenker (2005) and estimate the functional coefficients after having recentred household income. This avoids having to interpret the intercept as food expenditure for households with zero income. After centring, the intercept coefficient can be interpreted as the $$u$$th quantile of food expenditure for households with mean income. Figure 3 plots the estimated quantile regression coefficients against $$u$$. It illustrates the fact that the flexible structure imposed by dual regression yields estimates that are smoother than their quantile regression counterparts, the latter showing somewhat erratic behaviour around our estimates. Fig. 3. View largeDownload slide Engel coefficient plots revisited: dual (solid) and quantile (dashed) regression estimates of the (a) intercept and (b) income coefficients as functions of the quantile index; least squares estimates are also shown (horizontal line). Fig. 3. View largeDownload slide Engel coefficient plots revisited: dual (solid) and quantile (dashed) regression estimates of the (a) intercept and (b) income coefficients as functions of the quantile index; least squares estimates are also shown (horizontal line). 5.2. Simulations We give a brief summary of a Monte Carlo simulation to assess the finite-sample properties of dual regression. The data-generating process is \begin{equation*} y_{i} = \alpha_{1}+\beta_{1}\tilde{x}_{i}+(\alpha_{2}+\beta_{2}\tilde{x}_{i})\varepsilon_{i},\quad\varepsilon_{i}\sim N(0,1), \end{equation*} with parameter values calibrated to the empirical application, from which 4999 samples are simulated. As a benchmark, we compare generalized dual regression estimates of the values $$F_{Y\mid X}(y_{i}\mid x_{i})$$$$(i=1,\ldots,n)$$ with those obtained by applying the inversion procedure of Chernozhukov et al. (2010) to the quantile regression process. For each simulation, the estimation and selection procedures are identical to those implemented in the empirical application. Table 1 shows results on the accuracy of conditional distribution function estimates. We report average estimation errors across simulations of dual regression and quantile regression estimators, along with their ratio in percentage terms. Estimation errors are measured in $$L^{p}$$-norms $$\Vert \cdot\Vert _{p}$$ for $$p=1,2$$ and $$\infty$$, where $$\:\Vert f\Vert _{p}=\{ \int_{\mathbb{R}}|f(s)|^{p}\,{\rm d} s\} ^{1/p}$$ for $$f:\mathbb{R}\to[0,1]$$, and are computed with $$e^{*}$$ being the solutions to the selected generalized dual regression program. Correct model selection ranges from 75% of the simulations for $$n=100$$ to 90% for $$n=1000$$, which provides encouraging evidence for the validity of the proposed criterion. The results show that for this set-up, our estimates systematically outperform quantile regression-based estimates, with the spread in performance increasing with sample size. Whereas for $$n=235$$ the reduction in average estimation error is between 8% and 17%, depending on the norm, the estimation error is reduced by up to 30% when $$n=1000$$. The greater reduction in average errors in the $$L^{\infty}$$-norm reflects the higher accuracy in estimation of extreme parts of the distribution. Table 1. $$L^{p}$$ estimation errors $$(\times100)$$ of generalized dual, $$L_{\rm GDR}^{p}$$, and quantile regression, $$L_{\rm QR}^{p}$$, estimates of $$\{F_{Y\mid X}(y_{i}\mid x_{i})\}^{n}_{i=1}$$, along with their ratios $$(\times 100)$$, for $$p=1,2$$ and $$\infty$$ Sample size $$L_{\rm GDR}^{1}$$ $$L_{\rm GDR}^{1}/L_{\rm QR}^{1}$$ $$L_{\rm GDR}^{2}$$ $$L_{\rm GDR}^{2}/L_{\rm QR}^{2}$$ $$L_{\rm GDR}^{\infty}$$ $$L_{\rm GDR}^{\infty}/L_{\rm QR}^{\infty}$$ $$n=100$$ $$4{\cdot}1$$ $$93{\cdot}3$$ $$5{\cdot}6$$ $$92{\cdot}6$$ $$21{\cdot}9$$ $$89{\cdot}9$$ $$n=235$$ $$2{\cdot}7$$ $$91{\cdot}6$$ $$3{\cdot}7$$ $$89{\cdot}7$$ $$17{\cdot}2$$ $$82{\cdot}3$$ $$n=500$$ $$1{\cdot}9$$ $$90{\cdot}7$$ $$2{\cdot}6$$ $$88{\cdot}0$$ $$13{\cdot}0$$ $$75{\cdot}0$$ $$n=1000$$ $$1{\cdot}3$$ $$90{\cdot}2$$ $$1{\cdot}8$$ $$87{\cdot}0$$ $$9{\cdot}9$$ $$70{\cdot}1$$ Sample size $$L_{\rm GDR}^{1}$$ $$L_{\rm GDR}^{1}/L_{\rm QR}^{1}$$ $$L_{\rm GDR}^{2}$$ $$L_{\rm GDR}^{2}/L_{\rm QR}^{2}$$ $$L_{\rm GDR}^{\infty}$$ $$L_{\rm GDR}^{\infty}/L_{\rm QR}^{\infty}$$ $$n=100$$ $$4{\cdot}1$$ $$93{\cdot}3$$ $$5{\cdot}6$$ $$92{\cdot}6$$ $$21{\cdot}9$$ $$89{\cdot}9$$ $$n=235$$ $$2{\cdot}7$$ $$91{\cdot}6$$ $$3{\cdot}7$$ $$89{\cdot}7$$ $$17{\cdot}2$$ $$82{\cdot}3$$ $$n=500$$ $$1{\cdot}9$$ $$90{\cdot}7$$ $$2{\cdot}6$$ $$88{\cdot}0$$ $$13{\cdot}0$$ $$75{\cdot}0$$ $$n=1000$$ $$1{\cdot}3$$ $$90{\cdot}2$$ $$1{\cdot}8$$ $$87{\cdot}0$$ $$9{\cdot}9$$ $$70{\cdot}1$$ In the Supplementary Material we describe the experiment in detail, and report results on estimation of quantile regression coefficients and the distribution of selected models across simulations. We also present additional simulations that illustrate the empirical performance of dual regression with multiple covariates and show that it performs well relative to the noncrossing quantile regression method proposed by Bondell et al. (2010). 6. Discussion If we regard problems such as (4) and (19) as dual, then their solutions reveal a corresponding primal problem. Typically, the Lagrange multipliers of the dual problem appear as parameters in the primal problem, which has a data-generating process interpretation. The constraints on the construction of the stochastic elements thus have shadow values that are parameters of a data-generating representation. In this way the relation between identification and estimation is clarified: a parameter of the data-generating process is the Lagrange multiplier of a specific constraint on the construction of the stochastic element, so to specify that some parameters are zero and others nonzero is to say that some constraints are binding in the large-sample limit while others are not. Another way of expressing this is to say that when a primal corresponds to the data-generating process, additional moment conditions are superfluous: they will in the limit attract Lagrange multiplier values of zero and consequently not affect the value of the program or the solution. In a sense, this is obvious: the parameters of the primal formulation can typically be identified and estimated through an $$M$$-estimation problem that will generate $$K$$ equations to be solved for the $$K$$ unknown parameters. Nonetheless, the recognition that the only moment conditions that contribute to enforcing the independence requirement are those whose imposition simultaneously reduces the objective function while providing multipliers that are coefficients in the stochastic representation of $$Y$$ suggests the futility of portmanteau approaches to imposing independence, such as those based on characteristic functions. The dual formulation reveals that to specify the binding moment conditions is to specify an approximating data-generating process representation, which can then be extrapolated to provide estimates of objects of interest beyond the $$n$$ explicitly estimated values of $$\varepsilon_{i}$$ that characterize the sample and the definition of the mathematical program. As is well understood in mathematical programming, dual solutions provide lower bounds on the values obtained by primal problems. In the generic form of the problems we have considered here, there is no gap between the primal and dual values; hence in econometrics these problems are said to display point identification. We conjecture that problems without point identification do have gaps between their dual and primal values, and that this characterization will enhance our understanding. Acknowledgement We are indebted to Andrew Chesher for many fruitful discussions and to Roger Koenker for encouragement at a key early stage. We also thank Dennis Kristensen, Lars Nesheim, David Pacini, Jelmer Ypma, Yanos Zylberberg, the editor, the associate editor, two referees, and participants of seminars and conferences for helpful comments that have considerably improved the paper. The first author is also Research Professor in the Department of Economics at Johns Hopkins University. The second author gratefully acknowledges the financial support of the U.K. Economic and Social Research Council and the Royal Economic Society. Supplementary material Supplementary material available at Biometrika online includes proofs of all the theoretical results and additional numerical simulations. References Bondell H., Reich B. & Wang H. ( 2010). Noncrossing quantile regression curve estimation. Biometrika 97, 825– 38. Google Scholar CrossRef Search ADS PubMed Boyd S. P. & Vandenberghe L. ( 2004). Convex Optimization . Cambridge: Cambridge University Press. Google Scholar CrossRef Search ADS Carlier G., Chernozhukov V. & Galichon A. ( 2016). Vector quantile regression: An optimal transport approach. Ann. Statist. 44, 1165– 92. Google Scholar CrossRef Search ADS Cosma A., Scaillet O. & von Sachs R. ( 2007). Multivariate wavelet-based shape-preserving estimation for dependent observations. Bernouilli 13, 301– 29. Google Scholar CrossRef Search ADS Chernozhukov V., Fernandez-Val I. & Galichon A. ( 2010). Quantile and probability curves without crossing. Econometrica 78, 1093– 125. Google Scholar CrossRef Search ADS Chernozhukov V., Fernandez-Val I. & Melly B. ( 2013). Inference on counterfactual distributions. Econometrica 81, 2205– 68. Google Scholar CrossRef Search ADS De Vore R. ( 1977). Monotone approximation by splines. SIAM J. Math. Anal. 8, 891– 905. Google Scholar CrossRef Search ADS Durbin J. ( 1973). Weak convergence of the sample distribution function when parameters are estimated. Ann. Statist. 1, 279– 90. Google Scholar CrossRef Search ADS Genest C., Quessy J.-F. & Remillard B. ( 2007). Asymptotic local efficiency of Cramer–von Mises tests for multivariate independence. Ann. Statist. 35, 166– 91. Google Scholar CrossRef Search ADS He X. ( 1997). Quantile curves without crossing. Am. Statistician 51, 186– 92. Huber P. ( 1981). Robust Statistics . New York: Wiley. Google Scholar CrossRef Search ADS Koenker R. ( 2005). Quantile Regression . New York: Cambridge University Press. Google Scholar CrossRef Search ADS Koenker R. & Bassett G. ( 1978). Regression quantiles. Econometrica 46, 33– 50. Google Scholar CrossRef Search ADS Koenker R. & Xiao Z. ( 2002). Inference on the quantile regression process. Econometrica 70, 1583– 612. Google Scholar CrossRef Search ADS Koenker R. & Zhao Q. ( 1994). L-estimation for linear heteroscedastic models. Nonparam. Statist. 3, 223– 35. Google Scholar CrossRef Search ADS Owen A. B. ( 2001). Empirical Likelihood . Boca Raton, Florida: Chapman&Hall/CRC. Google Scholar CrossRef Search ADS Parker T. ( 2013). A comparison of alternative approaches to supremum-norm goodness-of-fit tests with estimated parameters. Economet. Theory 29, 969– 1008. Google Scholar CrossRef Search ADS R Development Core Team ( 2018). R: A Language and Environment for Statistical Computing . Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0, http://www.R-project.org. Schwarz G. ( 1978). Estimating the dimension of a model. Ann. Statist. 6, 461– 4. Google Scholar CrossRef Search ADS © 2018 Biometrika Trust

Biometrika – Oxford University Press

**Published: ** Mar 1, 2018

Loading...

personal research library

It’s your single place to instantly

**discover** and **read** the research

that matters to you.

Enjoy **affordable access** to

over 12 million articles from more than

**10,000 peer-reviewed journals**.

All for just $49/month

Read as many articles as you need. **Full articles** with original layout, charts and figures. Read **online**, from anywhere.

Keep up with your field with **Personalized Recommendations** and **Follow Journals** to get automatic updates.

It’s easy to organize your research with our built-in **tools**.

Read from thousands of the leading scholarly journals from *SpringerNature*, *Elsevier*, *Wiley-Blackwell*, *Oxford University Press* and more.

All the latest content is available, no embargo periods.

## “Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”

Daniel C.

## “Whoa! It’s like Spotify but for academic articles.”

@Phil_Robichaud

## “I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”

@deepthiw

## “My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”

@JoseServera

- Read unlimited articles
- Personalized recommendations
- No expiration
- Print 20 pages per month
- 20% off on PDF purchases
- Organize your research
- Get updates on your journals and topic searches

**$49/month**

Start Free Trial

14-day Free Trial

Best Deal — 39% off
### Annual Plan

- All the features of the Professional Plan, but for
**39% off**! - Billed annually
- No expiration
- For the normal price of 10 articles elsewhere, you get one full year of unlimited access to articles.

~~$588~~

**$360/year**

Start Free Trial

14-day Free Trial

Read and print from thousands of top scholarly journals.

System error. Please try again!

or

By signing up, you agree to DeepDyve’s Terms of Service and Privacy Policy.

Already have an account? Log in

Bookmark this article. You can see your Bookmarks on your DeepDyve Library.

To save an article, **log in** first, or **sign up** for a DeepDyve account if you don’t already have one.