TY - JOUR AU - Zhang,, Zhengyu AB - Summary In this paper, we consider the semiparametric identification and estimation of a heteroscedastic binary choice model with endogenous dummy regressors and no parametric restriction on the distribution of the error term. Our approach addresses various drawbacks associated with previous estimators proposed for this model. It allows for: general multiplicative heteroscedasticity in both selection and outcome equations; a nonparametric selection mechanism; and multiple discrete endogenous regressors. The resulting three‐stage estimator is shown to be asymptotically normal, with a convergence rate that can be arbitrarily close to n−1/2 if certain smoothness assumptions are satisfied. Simulation results show that our estimator performs reasonably well in finite samples. Our approach is then used to study the intergenerational transmission of smoking habits in British households. 1. Introduction The estimation of a binary choice model with a binary endogenous regressor is of considerable importance in applied micro‐econometrics. For example, evaluating the impact of job training on subsequent employment status often involves estimating a model such as Y=1{X′β0+Dδ0>U}, (1.1) D=1{Z′γ0>V}. (1.2) In (1.1)–(1.2), Y is the binary indicator of employment status, which depends on a pX‐dimensional vector X of observable characteristics and binary indicator D of training reception.1D is further determined by a pZ‐dimensional vector of instruments Z through a single‐index Z′γ0 ⁠. The dummy variable D is generally endogenous because of possible correlation between the unobserved error terms U and V. The parameter vector (β0,δ0) is of central interest to researchers.2 Below, we provide three additional examples where models (1.1) and (1.2) are relevant. Example 1.1. The model is used by Loureiro et al. (2010) to examine the intergenerational transmission of smoking habits, where Y indicates whether a teenager is a smoker and D indicates whether the teenager's father or mother is a smoker. In their study, the parental smoking indicator D is likely to correlate with the error term U because the teenagers and their parents share unobservable preferences, such as attitudes towards risk, health consciousness and genetic traits. Example 1.2. The model is useful for studying insurance markets. Suppose we wish to estimate the probability of an accident during a policy period. The outcome of interest is a binary variable that takes the value of 1 if a driver is involved in an accident during the insurance policy period; the binary endogenous variable is an indicator for whether the driver has purchased comprehensive insurance or not. Now, a positive coefficient on the binary endogenous regressor implies a higher probability of the driver having an accident even after controlling for covariates in case the driver has comprehensive insurance, thus indicating the presence of moral hazard. Example 1.3. The model was adopted by Angrist and Evans (1998) to identify the causal effect of fertility (having a third child or not) on mothers' labour supply. The outcome of interest Y is whether the mother worked in 1999, the binary endogenous explanatory variable D is the presence of a third child, and the instrument is whether the mother's first two children are of the same gender. Traditionally, the model can be estimated by maximum likelihood (ML) after specifying the joint normal distribution of (U,V) ⁠; see, e.g. Heckman (1978) and Amemiya (1978). However, the ML method is notorious for yielding inconsistent results when the error distribution is misspecified. Because such parametric specifications in general cannot be justified in economic theory, this has motivated the recent interest in semiparametric methods that do not require the parametric specification of error distribution. Lewbel (2000) proposed an estimation method that does not require one to know the distribution of the unobservable error term and the endogenous variable that could be either binary or continuous. In this paper, we rely on a special continuous regressor model with large support. It is similar to the Lewbel (2000) model, but employs a different identification and estimation strategy. Abrevaya et al. (2010) considered the problem of testing the statistical significance of causal (or treatment) effects in a general nonlinear setting; they incorporated models (1.1) and (1.2) as a leading special case. Although they used a model that is much more general than ours, they sought to estimate the direction (or sign), rather than the magnitude, of the causal effects. An alternative approach to deal with endogeneity is the control function. For a binary choice model with a continuous endogenous variable, the control function proposed by Blundell and Powell (2004), Rothe (2009) and Hoderlein and Sherman (2015) can be used for identification and estimation. However, one substantial limitation of such approaches for binary choice models is that they generally require the endogenous regressor to be a strictly increasing function of the first‐stage error term, and so typically cannot be applied if the endogenous covariates are discrete. Despite an abundance of econometric literature addressing endogeneity in nonlinear models, to the best of our knowledge, to date there are only a few distribution‐free methods that deliver consistent estimators for a binary choice model with endogenous dummy regressors. One paper closely related to our study is Yildiz (2013), who proposed a multistage matching estimator model (1.1)–(1.2) without imposing large support conditions or relying on parametric distributional assumptions. Three major differences are found between our approach and that of Yildiz (2013). First, because Yildiz (2013) employs the density weighted average derivative estimator of Powell et al. (1989) for constructing the first‐ and second‐stage estimators, her approach heavily relies on the assumption of independence between the error term and the regressors, thus ruling out heteroscedastic error terms. While deviation from normality may have serious consequences for parametric estimators, such as the ML estimator, evidence suggests that such estimators are severely affected by unknown forms of heteroscedasticity. Furthermore, semiparametric estimators requiring homoscedasticity also behave badly in the presence of unknown forms of heteroscedasticity; see, e.g. Powell (1986), Horowitz (1992) and Donald (1995). Therefore, it is highly desirable to develop semiparametric estimators that are robust not only to non‐normality, but also to general forms of heteroscedasticity. In this paper, we propose a nonparametric multiplicative heteroscedasticity that has been adopted in various forms for many models in the econometric and statistical literature; see, e.g. Harvey (1976), Engle (1982) and Chen and Khan (2000, 2003a). Second, inconsistent inferences can be drawn not only from incorrect specification of the distribution of the error term, or the form of heteroscedasticity, but also from the misspecification of the selection mechanism's data‐generating process. For example recognizing that a single‐index selection equation might result in possible inconsistent estimates, Ahn and Powell (1993) considered the estimation of a sample selection model subject to nonparametric selection mechanism. Similarly, to lessen the possibility of misleading statistical inference, it might be meaningful to consider the current model with a general nonparametric selection mechanism. The data‐generating process for the endogenous variable D in Yildiz (2013) has a linear index threshold crossing structure. This linear structure is crucial not to the identification strategy, but to the estimation strategy. However, the model in this paper has no such problem, because we allow for the selection equation to be generally nonlinear. Third, the existing literature, including Yildiz (2013), concentrates almost exclusively on a model containing one single binary endogenous regressor, although the models used in many empirical applications contain a multivalued discrete regressor (either ordered or unordered), or more than one binary endogenous regressor. Examples of ordered multivalued discrete regressors include hours of training in training programs and levels of transfers in anti‐poverty programs; examples of unordered multivalued discrete regressors include travel modes where passengers choose a train, plane or motor vehicle. Moreover, to examine the intergenerational transmission of smoking habits (Example 1.1), the model of Loureiro et al. (2010) contains two endogenous dummies for whether the teenager's father and mother are smokers or not. In this paper, we propose a novel computationally convenient and distribution‐free estimator for a heteroscedastic binary choice model with endogenous dummy regressors. The defining feature of our model is that it reduces to handling an estimating equation that has a partially linear varying coefficient (PLVC) form, i.e. Y=X′β0+Dδ0+D1m1(P)+D2m2(P), where β0 and δ0 are the parameters of interest.3 Each of Y ⁠, X, D, D1 ⁠, D2 and P is either observed or an identified function of the observed variables, while m1(·) and m2(·) are regarded as two known (nuisance) scalar valued functions. Our approach has several advantages relative to existing estimators. First, it simultaneously accommodates a general class of binary choice models with general multiplicative heteroscedasticity, a purely nonparametric selection mechanism and one or multiple discrete endogenous regressor(s). Second, unlike models (1.1) and (1.2) proposed by Zhang and He (2013), which are robust against non‐normality and unknown heteroscedasticity, our method does not need the error terms to be jointly distributed symmetrically. The key idea of this paper is closely related to Manski (1985), and Chen and Khan (2003a). We employ the maximum score approach in Manski (1985) to convert the rank conditions and we estimate the conditional quantiles of the latent variable. Chen and Khan (2003a) considered the estimation of a heteroscedastic sample selection model. One distinctive feature of Chen and Khan (2003a) is that their multistage estimation procedures end up estimating a partially linear regression model with both the regressors and regressand transformed from raw data. In this paper, we take a similar approach and show that our model can be reduced to analysing a PLVC model with nonparametrically generated regressand and regressors. Our paper contributes to the literature in the following three ways. First, for nonlinear econometric models such as the binary choice model, our method can simultaneously handle general multiplicative heteroscedasticity, the nonlinear selection mechanism and multiple discrete endogenous regressors. Second, our procedure comes down to estimating a sophisticated semiparametric PLVC model with nonparametrically generated regressand and regressors. Thus, deriving the limiting distribution of our estimator is not a standard problem and by no means a trivial extension of Chen and Khan (2003a). Third, our model is computationally convenient, with wide application in empirical research. Because our model is robust to both heteroscedasticity and non‐normality, it gives more credible results than parametric models, and thus allows us to test the robustness of such parametric models. For more specific details, see Section 6. The paper is structured as follows. The model and identification strategy are described in Section 2. In Section 3, we define the multistage estimator, while in Section 4 we study its large sample properties. In Section 5, we discuss the extensions of our model. An empirical example is provided in Section 6. We conclude in Section 7. All details about Monte Carlo simulation are collected in online Appendix B. The proofs of all lemmata can be found in the Appendix at the end of the paper. The proof of Theorem 4.1 is provided in online Appendix C. 2. Identification This paper focuses on the following binary choice model: Y=1{Y∗>0}; (2.1) Y∗=X′β0+Dδ0−U; (2.2) D=1{m(Z)>V}. (2.3) Here, Y∗ is the latent scalar variable (mother's latent labour supply), which relies on a pX‐dimensional vector of exogenous regressors X and a binary indicator D (indicating whether the mother has a third child or not). (β0,δ0) is a vector of the coefficients to be estimated. The binary indicator D is further determined by a pZ‐dimensional vector of instruments Z through a general nonparametric function m(Z) ⁠. Instead of assuming that both U and V are homoscedastic, we represent each of them as a product of a nonparametric scale function of the regressors and a homoscedastic error that is independent of those regressors, i.e. U=σ(X2)×ε,V=ς(Z)×υ. (2.4) Here, σ(·) and ς(·) are unknown functions of the regressors, and ε and υ are independent of (X,Z) but not necessarily independent of each other. Note that the distribution of U should rely on X only through X2. A proper subset of X is X=(X1,X2) ⁠, where X1 is a continuous regressor with nonzero coefficient. Remark 2.1. General multiplicative heteroscedasticity is one of the most popular heteroscedastic forms raised in the econometric literature. Similar model specifications can be found in Newey and Stoker (1993), Donald (1995), Melenberg and Van Soest (1996), Chen and Khan (2000, 2003a) and Khan (2013). At first, the model adopted here seems to allow for heteroscedasticity on condition that the heteroscedasticity is multiplicative. However, this is not the case. The nonparametric components σ(·) and ς(·) are left unspecified, thus permitting conditional heteroscedasticity of very general forms. Moreover, from Khan (2013), any binary choice model under zero conditional median restriction is observationally equivalent to a probit/logit model with a multiplicative heteroscedastic error term; see Theorem 2.1 in Khan (2013). Thus, the multiplicative heteroscedasticity specified here is not as restrictive as it appears to be. Note that σ(·) also can be identified once the distribution of ε is known. However, it should also be emphasized that this parameter by itself is of less interest as it provides only the functional form of heteroscedasticity when ε is correctly specified. Remark 2.2. The heteroscedastic error term U=σ(X2)×ε relies on X only through X2, where X can be divided into X=(X1,X2) and X1 is a continuous regressor with a nonzero coefficient. Here, we have imposed a mild exclusion restriction that is crucial to our identification strategy introduced below. Chamberlain (1992), Donald (1995), Powell (1994) and Chen and Zhou (2010), among others, have adopted a similar heteroscedasticity exclusion restriction. This type of heteroscedasticity exclusion restriction often arises in economic applications; see Donald (1995) and Powell (1994), among others, for some detailed discussion on such a form of heteroscedasticity in semiparametric estimation contexts. Let W be the vector of distinct components in the vector (X2′,Z′,D)′ ⁠. Note that we define W as different elements, consisting of only X2, Z and D for simplicity. Define W=(Wc,Wd) ⁠, where Wc is the continuous variable in W and Wd is the discrete element in W. Next, we introduce some lemmata for model identification. First, we state a basic result that we will use later. Lemma 2.1. For any 0<τ<1 ⁠, the τth quantile function of ε conditional on X=x ⁠, Z=z and D=d relies only on P=P(z) and D, where P(z)=E[D|Z=z] denotes the propensity score. From Lemma 2.1, we can always represent the conditional quantile function of ε, signified as q(·) ⁠, through the function of (τ,d,p) as q(τ,d,p)=dq1(τ,p)+(1−d)q2(τ,p), where ql(τ,p) is the inverse function of Ll(·,Fυ−1(p)) with respect to its first argument for l=1,2 ⁠, in which L1(c1,c2)=Pr{ε≤c1|υ0)|W=w] with respect to (β1,a) ⁠, where 1(·) is the indicator function. Remark 2.4. The proof of Lemma 2.3 in the Appendix illustrates Assumption 2.4. Note that the propensity score P(z)=E[D|Z=z] is a function of Z. If Z contains element X1, then Q(τ,x,z,d) is the total nonlinear function related to (x,z,d) ⁠, implying that the identification fails; for more specific details, see Manski (1985). The reason for employing the linear part x1β10 is also applicable for model identification under the maximum score approach. Similar model specification can be found in Krief (2014) and Zhang (2016). For empirical study, the economic implication behind why Z does not contain X1 is that X1 has no influence on endogenous dummy variable D, which is easy to inspect. This case is not uncommon in economic study. We have provided an example that fits our model quite well. For more specific details, see Section 6. Remark 2.5. The support condition in this paper is different from those in Vytlacil and Yildiz (2007). The identification strategy's central feature in Vytlacil and Yildiz (2007) is the use of variation in the instruments for the endogenous regressor and other exogenous regressors to infer the variation in the exogenous regressors that would compensate for the variation in the dummy endogenous regressor, such that their paper does not need large support assumptions. The support of all regressors in Vytlacil and Yildiz (2007) can all be interval as long as it satisfies some ‘matching’ condition. Similar matching estimators can be found in Yildiz (2013). In contrast to Vytlacil and Yildiz (2007), our paper employs the maximum score approach of Manski (1985) for parameter identification, and therefore we need a continuous distributed variable X1 with a large support whose coefficient is nonzero. This paper and that of Vytlacil and Yildiz (2007) focus on different issues. In our paper, we apply all restrictions on X1, but we do not need any specific support assumption for W=(X2′,Z′,D) ⁠. 3. Estimation The identification strategy introduced in the preceding section can be transformed into the following multistage estimation procedure. In the following three subsections, we discuss each of the stages in detail. 3.1. Estimating the Propensity Score In the first stage, we estimate the probability of D=1 conditional on the selection equation regressor Z. We refer to these conditional probabilities as propensity scores. We assume that the regressor vector Zi can be partitioned into (Zic,Zid) ⁠, where Zic and Zid are pZc‐dimensional continuously and pZd‐dimensional discretely distributed components, respectively. By applying the Nadaraya–Watson kernel estimator used in Li and Racine (2007), the propensity score can be estimated by P̂(z)=(1/nh0pZc)∑i=1nDik0(Zic−zc/h0)1{Zid=zd}(1/nh0pZc)∑i=1nk0(Zic−zc/h0)1{Zid=zd}, (3.1) where k0Zic−zch0=∏l=1pZck0Zlic−zlch0. Here, Zlic and zlc are the lth components of Zic and zc ⁠, respectively, k0(·) is a kernel function and h0 is the corresponding bandwidth converging to zero as the sample size increases to infinity. The additional conditions imposed on k0(·) and h0 are discussed in Section 4. 3.2. Local Smoothed Maximum Score Estimation From Lemma 2.3, the conditional quantile of Y∗ is identified as long as the parameter (β10,ϕ(τ,w)) is the unique maximizer of Λ(τ,w,a) ⁠, where Λ(τ,w,a)=E[(Y−(1−τ))1(X1β1+a>0)|W=w], (3.2) with respect to (β1,a) ⁠. As the coefficients can be estimated only up to scale in binary choice models, our paper takes the normalization performed with β10=1 ⁠. In some circumstances, the sign of β10 may be unknown. In this case, one could perform the normalization with |β10|=1 by adding β10∈{−1,+1} as an additional parameter to estimate. Note that doing so does not change the results of this paper. Thus, we impose the coefficient of β10 to be 1 here. We could have maximized (3.2) by using the maximum score approach discussed in Manski (1985). However, to use the indicator function 1(·) ⁠, we might need a complex algorithm, which may exhibit a very slow convergence rate.4 To overcome this issue, we employ an integrated kernel function in this paper, as proposed by Horowitz (1992) for smoothed maximum score estimation, to replace the indicator function, i.e. Λn(τ,w,a)=1nh2pWc∑i=1n(Yi−(1−τ))K1Xi1+ah1k2Wic−wch21{Wid=wd}, (3.3) ϕ̂(τ,w)= argmax a∈AΛn(τ,w,a). (3.4) Here, K1(t)=∫−∞tk1(s)ds is an integrated kernel function with lim t→−∞K1(t)=0 and lim t→+∞K1(t)=1 ⁠. Note that k2Wic−wch2=∏l=1pWck2Wlic−wlch2, Wlic and wlc are the lth components of Wic and wc ⁠, respectively, k1(·) and k2(·) are both kernel functions and h1 and h2 are the corresponding bandwidths converging to zero as the sample size increases to infinity. 3.3. Estimating the PLVC Model With Generated Regressors From Lemma 2.2, we establish the following PLVC model Q¯(τ1,τ2,Xi,Zi,Di)=Xi′β0+Diδ0+ΔQ(τ1,τ2,Xi,Zi,Di)Di×m1(Pi)+ΔQ(τ1,τ2,Xi,Zi,Di)(1−Di)×m2(Pi), (3.5) where ml(p)=q¯l(τ1,τ2,p)/Δql(τ1,τ2,p) ⁠, for l=1,2 ⁠. Note that as we set β10=1 and Q(τ,Xi,Zi,Di)=X1i+ϕ(τ,Wi) ⁠, (3.5) reduces to Q¯(τ1,τ2,Wi)=X2i′β20+Diδ0+ΔQ(τ1,τ2,Wi)Di×m1(Pi)+ΔQ(τ1,τ2,Wi)(1−Di)×m2(Pi), (3.6) where Q¯(τ1,τ2,Wi)=ϕ(τ1,Wi)+ϕ(τ2,Wi)2 and ΔQ(τ1,τ2,Wi)=ϕ(τ1,Wi)−ϕ(τ2,Wi). Now, the coefficient we need to estimate is ψ0(p)=(β20′,δ0,m1(p),m2(p))′. (3.7) Following Robinson (1989) and Li et al. (2002), by the local least‐squares (LS) estimator, ψn(p)=S1n−1(p)S2n(p), (3.8) where S1n(p)=1nh3∑i=1nX2iDiΔQ(τ1,τ2,Wi)DiΔQ(τ1,τ2,Wi)(1−Di)X2iDiΔQ(τ1,τ2,Wi)DiΔQ(τ1,τ2,Wi)(1−Di)′×k3Pi−ph3, S2n(p)=1nh3∑i=1nX2iDiΔQ(τ1,τ2,Wi)DiΔQ(τ1,τ2,Wi)(1−Di)Q¯(τ1,τ2,Wi)k3Pi−ph3, k3(·) is a kernel function and h3 is the corresponding bandwidth. The estimator ψn(p) is not operational because of lack of knowledge about the values of ΔQ(τ1,τ2,Wi) ⁠, Q¯(τ1,τ2,Wi) and Pi ⁠. A feasible estimator can be constructed by replacing the unobservable variables with their nonparametric estimators. The estimator can be written as ψ̂n(p)=Ŝ1n−1(p)Ŝ2n(p), (3.9) where Ŝ1n(p)=1nh3∑i=1nX2iDiΔQ̂(τ1,τ2,Wi)DiΔQ̂(τ1,τ2,Wi)(1−Di)X2iDiΔQ̂(τ1,τ2,Wi)DiΔQ̂(τ1,τ2,Wi)(1−Di)′×k3P̂i−ph3, Ŝ2n(p)=1nh3∑i=1nX2iDiΔQ̂(τ1,τ2,Wi)DiΔQ̂(τ1,τ2,Wi)(1−Di)Q¯̂(τ1,τ2,Wi)k3P̂i−ph3 and P̂i=P̂(Zi),Q¯̂(τ1,τ2,Wi)=ϕ̂(τ1,Wi)+ϕ̂(τ2,Wi)2,ΔQ̂(τ1,τ2,Wi)=ϕ̂(τ1,Wi)−ϕ̂(τ2,Wi). Let θ(p) denote the first pX‐dimensional subvector of ψ(p) ⁠. For each p∈(0,1) ⁠, θ̂n(p) would be a consistent estimator for θ0=(β20′,δ0)′ ⁠. Thus, a more efficient estimator for θ0 would be the weighted average of these local estimators: θ̂n=∫p̲p¯ϖ(p)dp−1∫p̲p¯ϖ(p)θ̂n(p)dp. (3.10) Here, [p̲,p¯] is any compact subset of (0,1) and ϖ(·) is a pX×pX weighting function. Remark 3.1. The generic weighting matrix ϖ(·) is used for theoretical interest. In practice, a frequently used and computationally convenient weighting function is ϖ(p)=1{p̲0 ⁠; if pWc=0 ⁠, then no k2(·) is needed. (b) The integrated kernel function K1(t)=∫−∞tk1(s)ds is bounded in absolute value with lim t→−∞K1(t)=0 and lim t→+∞K1(t)=1 ⁠. Moreover, its derivative k1(t) satisfies ∫k12(t)dt<∞ ⁠. Assumption 4.7. (second‐stage bandwidth sequence) (a) If pWc>0 ⁠, the bandwidth sequences h1 and h2 used in the second stage are of the form h1=c1n−θ1 ⁠, h2=c2n−θ2 ⁠, where c1 and c2 are some constants and θ1 and θ2 are some positive numbers satisfying θ1=1/(2κ1+1) and θ2∈κ1+1κ2(2κ1+1),κ1−1pWc(2κ1+1). (b) If pWc=0 ⁠, the bandwidth sequence h1 used in the second stage is of the form h1=c1n−θ1 ⁠, where c1 is some constant and θ1=1/(2κ1+1) ⁠. No h2 is needed. Assumption 4.8. (first‐stage regressor distribution) (a) The regressor Z can be decomposed as Z=(Zc,Zd) ⁠, where the pZc‐dimensional vector Zc is continuously distributed and the pZd‐dimensional vector Zd is discretely distributed. (b) Define fZ(z) as the joint density of Z=(Zc,Zd) in z. Let fZc|Zd(·|zd) denote the conditional density function of Zc ⁠, given Zd=zd ⁠. We assume it is bounded away from 0 and ∞ and is continuous for Z∈Z ⁠. Let Pr{Zd=zd} denote the mass function of zd ⁠; we assume that the discrete variable Zd takes a finite number of points on Z. Finally, we have fZ(z)=fZc|Zd(zc|zd)Pr{Zd=zd} ⁠. (c) For each zd∈Z ⁠, the joint density function fZ(z) is bounded and continuously differentiable of order κ0 at least for zc ⁠. Assumption 4.9. (first‐stage kernel function) The kernel function k0(·) used in the first stage is of bounded variation and support, and is of order κ0≥pZc1+(1/2κ1+1)−6θ3 if pZc>0 ⁠, where θ3 is defined in Assumption 4.4. If pZc=0 ⁠, then no k0(·) is needed. Assumption 4.10. (first‐stage bandwidth sequence) (a) If pZc>0 ⁠, the bandwidth sequence h0 used in the first stage is of the form h0=c0n−θ0 ⁠, where c0 is some constant and θ0∈12κ0,1pZcκ1+12κ1+1−3θ3. (b) If pZc=0 ⁠, no h0 is needed. Remark 4.2. The above conditions used for the kernel function and bandwidth are similar to those in Ahn and Powell (1993) and Chen and Khan (2000, 2003a). Generally, the above bandwidth conditions are used to control for the residual terms of all stages so that they are sufficiently small and plugging them into the third stage will not affect the convergence rate of the final estimator. The range of bandwidths is inferred from the infinitesimals generated from the first and second stages. Note that the solution is not unique, and that our assumption provides one feasible solution; other methods to choose the kernel function and its bandwidth can be further researched. Use the definitions in Assumption 4.1. For Q1=X1+ϕ(τ1,W) and Q2=X1+ϕ(τ2,W) ⁠, as we assume β10=1 ⁠, there is a one‐to‐one relationship between (Ql,W′)′ and (X1,W′)′ for any fixed (β20′,δ0)′ for l=1,2 ⁠. By Assumption 2.2, the distribution of Ql conditional on W has positive density everywhere with respect to the Lebesgue measure for almost every W∈W ⁠. Thus, we introduce the following assumption. Assumption 4.11. (order of smoothness) Let Fu1|Q1W(·|Q1,W) and Fu2|Q2W(·|Q2,W) denote the CDF of u1 conditional on Q1,W and u2 conditional on Q2,W ⁠, respectively, where u1 and u2 are defined in Assumption 4.1. Note that fQ1|W(·|W) and fQ2|W(·|W) are as defined in Assumption 4.1; they denote the conditional density function of Q1|W and Q2|W ⁠, respectively. Now, there exists a real constant M<∞ such that, as function of Ql ⁠, Ful|QlW(Ql|Ql,W) and fQl|W(Ql|W) are all bounded by M, for l=1,2 ⁠. Moreover, all functions are at least κ1 times continuously differentiable, with all derivatives bounded by M. Remark 4.3. Assumption 4.11 gives the smoothed maximum score aiming to control for the asymptotic bias of the parameter. Similar arguments can be found in the literature, such as Assumptions 8 and 9 in Horowitz (1992), and Assumption 10(a) in Krief (2014). The asymptotic bias is of order O(h1κ1) ⁠, which converges to 0 as n→∞ ⁠. Now, let eX=[IpX,O,O] denote the pX×(pX+2)‐dimensional matrix, where IpX represents the pX×pX identity matrix and the last two columns are zero vectors. Let Xi=X2iDiΔQ(τ1,τ2,Wi)DiΔQ(τ1,τ2,Wi)(1−Di); as Pi=P(Zi) is a function of Zi and Zi is a subvector of Wi ⁠, define G1(τ1,Wi)=ϖ(Pi)eXV−1(Pi)fP(Pi)α−1(τ1,Wi)XifW(Wi)12−Dim1(Pi)−(1−Di)m2(Pi), G2(τ2,Wi)=ϖ(Pi)eXV−1(Pi)fP(Pi)α−1(τ2,Wi)XifW(Wi)12+Dim1(Pi)+(1−Di)m2(Pi), (Ful|QlW(Ql|Ql,W)−(1−τl))(j)=∂j∂jQl(Ful|QlW(Ql|Ql,W)−(1−τl)) and (fQl|W(Ql|W))(j)=∂j∂jQlfQl|W(Ql|W), for l=1,2 ⁠, j=1,…,κ1 ⁠. Let lim n→∞nh1h1κ1=λ<∞ ⁠; introduce B=∫t1κ1k1(t1)dt1E[G1(τ1,W)×∑s=0κ11s!(κ1−s)!×(Fu1|Q1W(0|0,W)−(1−τ1))(s)(fQ1|W(0|W))(κ1−s)+G2(τ2,W)×∑s=0κ11s!(κ1−s)!(Fu2|Q2W(0|0,W)−(1−τ1))(s)(fQ2|W(0|W))(κ1−s)], and Ω=∫k12(t1)dt1E[G1(τ1,W)G1(τ1,W)′(τ1−τ12)fQ1|W(0|W)+G2(τ2,W)G2(τ2,W)′(τ2−τ22)fQ2|W(0|W)]. Then, we introduce the following theorem. Theorem 4.1. Under Assumptions 2.1–2.4 and 4.1–4.11, the estimator θ̂n defined in (3.10) has limiting distribution as nh1→∞ ⁠, nh1(θ̂n−θ0)→dN∫p̲p¯ϖ(p)dp−1×λB,∫p̲p¯ϖ(p)dp−1×Ω×∫p̲p¯ϖ(p)dp−1. Remark 4.4. Theorem 4.1 is consistent with Theorem 1 in Chen and Khan (2003b), illustrating that an estimator converging at parametric rate n−1/2 cannot be found for binary choice models with unknown multiplicative heteroscedasticity. However, under Assumption 4.6, the convergence rate of our estimator can get to at least θ̂n−θ0=Op(n−2/5) ⁠, which is quite close to the root‐n convergence speed. Furthermore, the use of a higher‐order kernel speeds up convergence if the required density derivatives exist. The asymptotic bias can be removed by selecting the bandwidth h1=c1n−θ1 satisfying θ1>1/(2κ1+1) such that lim n→−∞nh1h1κ1→0 ⁠. However, this does not produce the fastest convergence speed in probability. Similar arguments can be found in Horowitz (1992) and Krief (2014). Moreover, two factors govern the selection of the quantile pair in finite samples. The first is for the full rank condition in Assumptions 4.1 and 4.2 to be satisfied in finite samples. This means that τ1 and τ2 cannot get too close (for extreme cases, V(p) is singular because ΔQ(τ1,τ2,W)=0 when τ1=τ2 ⁠). However, the estimator's precision is also sacrificed if τ1 and τ2 get too far from each other, as the density would become too small for the nonparametric estimator. From the literature, such as Chen and Khan (2000, 2003a), and Monte Carlo results, we suggest that (τ1,τ2) should be symmetric around 0.5 and keep a little bit of distance, such as (τ1,τ2)=(0.3,0.7) ⁠.5 The expression of asymptotic variance could provide some insight on pair selection for maximizing asymptotic efficiency. Expanding on this point, efficiency can be further improved by combining the various estimators of θ0 based on various quantile pairs, perhaps by extending the generalized method of moments (GMM) framework suggested by Buchinsky (1998) and Chen et al. (2005). Theoretically, the choice of (τ1,τ2) can be infinite and, correspondingly, the moment condition can also be infinite. Finding the optimal quantile pair could be quite difficult, but it could achieve the highest efficiency as it involves knowledge of the regressor distribution, parameter β20, δ0 and the density of the homoscedastic component (ε,υ) ⁠. For large sample inference, λB and Ω need to be estimated. We propose a standard plug‐in estimator to replace the unknown functions in B and Ω. We construct G1(τ1,Wi) ⁠, e.g. Ĝ1(τ1,Wi)=ϖ(Pî)eXV̂−1(P̂i)f̂P(P̂i)α̂−1(τ1,Wi)X̂if̂W(Wi)×12−Dim̂1(P̂i)−(1−Di)m̂2(P̂i). Here, P̂i=P̂(Zi) is the nonparametric propensity score estimator, V̂(P̂i) is the kernel estimator of the expectation of XîXî′ conditional on P̂i ⁠, and Xi is replaced by Xî=X2iDiΔQîDiΔQî(1−Di), where ΔQî=ϕ̂(τ1,Wi)−ϕ̂(τ2,Wi) ⁠. For α̂(τ1,Wi) ⁠, use the expression in Lemma A.2 in the proof of Theorem 4.1.6 This is α̂(τ1,Wi)=−1nh12h2pWc∑j=1n(Yj−(1−τ1))k1′Xj1+ϕ̂(τ1,Wi)h1k2Wjc−Wich2×1{Wjd=Wid}, where f̂W(Wi) is the kernel estimator of the density function of Wi and m̂1(p) and m̂2(p) can be obtained directly from the last two terms of ψ̂n(p)=(β̂20′(p),δ̂0(p),m̂1(p),m̂2(p))′=Ŝ1n−1(p)Ŝ2n(p) ⁠. For construction of λB ⁠, the proof of Theorem 4.1 shows that7 1nh1∑i=1n(g1i+g2i)→pλB. Here, g1i=G1(τ1,Wi)(Yi−(1−τ1))k1Xi1+ϕ(τ1,Wi)h1 and g2i=G2(τ2,Wi)(Yi−(1−τ2))k1Xi1+ϕ(τ2,Wi)h1. Now, let ĝ1i=Ĝ1(τ1,Wi)(Yi−(1−τ1))k1Xi1+ϕ̂(τ1,Wi)h1 and ĝ2i=Ĝ2(τ2,Wi)(Yi−(1−τ2))k1Xi1+ϕ̂(τ2,Wi)h1. We can construct λB as λB̂=1nh1∑i=1n(ĝ1i+ĝ2i). Similarly, Ω can be estimated as Ω̂=1nh1∑i=1n(ĝ1i+ĝ2i)(ĝ1i+ĝ2i)′. Following the arguments in Powell et al. (1989), we can show that λB̂→pλB and Ω̂→pΩ ⁠. 5. Extension: Heteroscedastic Binary Choice Model with Multiple Endogenous Dummy Regressors A model that is parallel to (2.1)–(2.3) but contains two endogenous dummy regressors can be written as Y=1{Y∗>0}, (5.1) Y∗=X′β0+D1δ10+D2δ20−U, (5.2) D1=1{m1(Z)>V1}, (5.3) D2=1{m2(Z)>V2}, (5.4) where U=σ(X2)×ε,V1=ς1(Z)×υ1,V2=ς2(Z)×υ2. (5.5) The endogenous binary variables D1 and D2 are determined by a pZ‐dimensional vector of instruments Z through m1(Z) and m2(Z) ⁠, respectively. The functions m1(·) ⁠, m2(·) ⁠, σ(·) ⁠, ς1(·) and ς2(·) are left as unknown identified functions. The disturbances ε, υ1 and υ2 are jointly independent of (X,Z) ⁠, but not necessarily mutually independent. We first state the result that is parallel to Lemma 2.1. Lemma 5.1. The τth quantile function of ε conditional on D1,D2,X,Z relies only on P1=P1(z)=E[D1|Z=z] ⁠, P2=P2(z)=E[D2|Z=z] ⁠, D1 and D2. Moreover, the τth quantile function of ε conditional on (D1,D2,X,Z)=(d1,d2,x,z) can be written as q(τ,d1,d2,p1,p2)=d1d2q11(τ,p1,p2)+d1(1−d2)q12(τ,p1,p2)+(1−d1)d2q21(τ,p1,p2)+(1−d1)(1−d2)q22(τ,p1,p2). Let Q(τ,x,z,d1,d2) denote the τth quantile function of Y∗ conditional on X=x ⁠, Z=z ⁠, D1=d1 and D2=d2 ⁠. The following lemma is parallel to Lemma 2.2. Lemma 5.2. For any 0<τ1<τ2<1 ⁠, Q¯(τ1,τ2,x,z,d1,d2)=x′β0+d1δ10+d2δ20+d1d2ΔQ(τ1,τ2,x,z,d1,d2)q¯11(τ1,τ2,p1,p2)Δq11(τ1,τ2,p1,p2)+d1(1−d2)ΔQ(τ1,τ2,x,z,d1,d2)q¯12(τ1,τ2,p1,p2)Δq12(τ1,τ2,p1,p2)+(1−d1)d2ΔQ(τ1,τ2,x,z,d1,d2)q¯21(τ1,τ2,p1,p2)Δq21(τ1,τ2,p1,p2)+(1−d1)(1−d2)ΔQ(τ1,τ2,x,z,d1,d2)q¯22(τ1,τ2,p1,p2)Δq22(τ1,τ2,p1,p2), where Q¯(τ1,τ2,x,z,d1,d2)=Q(τ1,x,z,d1,d2)+Q(τ2,x,z,d1,d2)2,ΔQ(τ1,τ2,x,z,d1,d2)=Q(τ1,x,z,d1,d2)−Q(τ2,x,z,d1,d2),q¯l(τ1,τ2,p1,p2)=ql(τ1,p1,p2)+ql(τ2,p1,p2)2,Δql(τ1,τ2,p1,p2)=ql(τ1,p1,p2)−ql(τ2,p1,p2). Similar to Lemma 2.2, Lemma 5.2 obtains a PLVC model‐type relationship between the pseudo‐dependent variable Q¯(τ1,τ2,X,Z,D1,D2) and pseudo‐regressor vector [X,D1,D2,D1D2ΔQ(τ1,τ2,X,Z,D1,D2),D1(1−D2)ΔQ(τ1,τ2,X,Z,D1,D2),(1−D1)D2ΔQ(τ1,τ2,X,Z,D1,D2),(1−D1)(1−D2)ΔQ(τ1,τ2,X,Z,D1,D2)]. Each coefficient in D1D2ΔQ(τ1,τ2,X,Z,D1,D2) ⁠, D1(1−D2)ΔQ(τ1,τ2,X,Z,D1,D2) ⁠, (1−D1) D2ΔQ(τ1,τ2,X,Z,D1,D2) and (1−D1)(1−D2)ΔQ(τ1,τ2,X,Z,D1,D2) is a smoothed function of the propensity scores P1 and P2. Thus, the above identification result can be directly translated into a multistage estimation procedure similar to that in Section 3. Moreover, the above identification strategy can further be extended to models with more than two endogenous dummy regressors. 6. Empirical Study Before the empirical study, we conduct several groups of Monte Carlo experiments to test our approach. Our estimator performs reasonably well in finite samples for each design. The bias, standard deviation (SD) and root mean square error (RMSE) are all acceptable. The SD and RMSE decline rapidly as sample size grows, and the magnitude of such a decline is consistent with Theorem 4.1. Our results seem quite robust with respect to the choice of quantile pairs in terms of either bias, SD or RMSE.8 In this section, we apply our methodology to the research of intergenerational transmission of smoking habits in British households. Loureiro et al. (2010) conducted similar research using a multivariate probit model to investigate whether the impact of parental smoking habits on their children's smoking decisions is a causal one. As unobservable preferences shared by children and their parents, endogeneity is a potential problem when obtaining consistent parameters. Loureiro et al. (2010) used instrumental variables to overcome this problem by selecting socio‐occupational indicators for the teenagers' grandparents. The assumption behind using instrumental variables is that the impact of parental socio‐economic status on smoking behaviour does not extend beyond one generation, after controlling for the relevant explanatory variables. This means that the socio‐occupational indicators of teenagers' grandparents are suitable instrumental variables. However, the multivariate normal distribution of disturbance terms seems to be a strong assumption, even after eliminating the possible endogeneity through the use of instrumental variables. Parametric models are criticized for possible large specification errors when the assumption deviates from reality. Because error terms cannot be justified in binary choice models, it is best to use semiparametric models to test the robustness of the results of Loureiro et al. (2010). Han and Vytlacil (2017) generalize the bivariate normality assumption on the latent error terms of a bivariate probit model through the use of copulas. Instead of requiring the joint distribution of latent error terms to be bivariate normal, they allow for the marginal distributions to be arbitrary but known, thus restricting their dependence structure by imposing their copula function based on a broad class of parametric copulas that includes the normal copula as a special case. For our model in this paper, it is robust to both heteroscedasticity and endogeneity. Moreover, our methodology enables a purely nonparametric selection mechanism that is more general than the linear single‐index form of Loureiro et al. (2010). In this regard, the model in Han and Vytlacil (2017) and our model seem to be very meaningful for checking the robustness of the model of Loureiro et al. (2010). Now we apply the three models discussed above to investigate whether the impact of maternal smoking habits on children's smoking decisions is a causal one in a single‐mother household. Our data are drawn from the British Household Panel Survey (BHPS). The sample covers the period 1991–2008, with 7,053 individual observations (see Table 1). In the analysis, the outcome variable smoker/Y is whether the teenager smokes or not. The questionnaire information is processed using the same method as applied in Loureiro et al. (2010): children aged 11–15 years are asked the question. How many cigarettes did you smoke in the last seven days? If a child reports to have smoked at least one cigarette in the last one week, the child is classified as a smoker. Children older than 15 years are asked to answer the direct question as to whether or not they categorize themselves as smokers. This is included in the BHPS. Five explanatory variables are included: msmoker/D, which is a binary endogenous dummy variable for whether a single mother smokes or not; cage/X1, which is a child's age at the interview year; mage/X2, which is a mother's age at the interview year; medu/X3, which are dummy variables for mother's highest educational degree obtained (four options are classified in the first group – higher degree, first degree, teaching qualification and other higher qualification – and the other options are classified in another group); mjob/X4, which is a mother's occupational level (four levels are classified in the first group – professional occupation, managerial and technical, skilled non‐manual and skilled manual – and the other three are classified in another group – never had a job, partly skilled and unskilled); lnmincome/X5, which is the log for monthly household income that is expressed in 1996 pounds, as in Loureiro et al. (2010). The instrumental variable is the teenagers' grandparents' socio‐economic status: gf's job/Z1, which is the social occupation level of the teenager's grandfather when the teenager's mother was 14 years old; gm's job/Z2, which is the social occupation level of the teenager's grandmother when the teenager's mother was 14 years old. The grandparents' jobs are processed in the same way as the mother's job, except for the grandfathers labelled ‘armed force’, who we categorize in the second group. Table 1. Descriptive Statistics Variable Description (7,053 observations) Mean SD Min Max smoker Whether child is a smoker 0.2712 0.4446 0 1 msmoker Whether single mother is a smoker 0.4987 0.5000 0 1 cage Child's age at interview year 15.7757 2.5706 11 20 mage Single mother's age at interview year 42.0563 5.5699 26 65 medu Whether single mother 0.4273 0.4947 0 1 has higher education mjob Whether single mother's occupation 0.6701 0.4702 0 1 is classified as high level lnmincome Natural logarithm to 7.6204 0.5484 4.1950 10.4656 monthly household income gf's job Social occupation level of grandfather 0.6696 0.4704 0 1 when single mother was 14 gm's job Social occupation level of grandmother 0.3338 0.4716 0 1 when single mother was 14 Variable Description (7,053 observations) Mean SD Min Max smoker Whether child is a smoker 0.2712 0.4446 0 1 msmoker Whether single mother is a smoker 0.4987 0.5000 0 1 cage Child's age at interview year 15.7757 2.5706 11 20 mage Single mother's age at interview year 42.0563 5.5699 26 65 medu Whether single mother 0.4273 0.4947 0 1 has higher education mjob Whether single mother's occupation 0.6701 0.4702 0 1 is classified as high level lnmincome Natural logarithm to 7.6204 0.5484 4.1950 10.4656 monthly household income gf's job Social occupation level of grandfather 0.6696 0.4704 0 1 when single mother was 14 gm's job Social occupation level of grandmother 0.3338 0.4716 0 1 when single mother was 14 View Large Table 1. Descriptive Statistics Variable Description (7,053 observations) Mean SD Min Max smoker Whether child is a smoker 0.2712 0.4446 0 1 msmoker Whether single mother is a smoker 0.4987 0.5000 0 1 cage Child's age at interview year 15.7757 2.5706 11 20 mage Single mother's age at interview year 42.0563 5.5699 26 65 medu Whether single mother 0.4273 0.4947 0 1 has higher education mjob Whether single mother's occupation 0.6701 0.4702 0 1 is classified as high level lnmincome Natural logarithm to 7.6204 0.5484 4.1950 10.4656 monthly household income gf's job Social occupation level of grandfather 0.6696 0.4704 0 1 when single mother was 14 gm's job Social occupation level of grandmother 0.3338 0.4716 0 1 when single mother was 14 Variable Description (7,053 observations) Mean SD Min Max smoker Whether child is a smoker 0.2712 0.4446 0 1 msmoker Whether single mother is a smoker 0.4987 0.5000 0 1 cage Child's age at interview year 15.7757 2.5706 11 20 mage Single mother's age at interview year 42.0563 5.5699 26 65 medu Whether single mother 0.4273 0.4947 0 1 has higher education mjob Whether single mother's occupation 0.6701 0.4702 0 1 is classified as high level lnmincome Natural logarithm to 7.6204 0.5484 4.1950 10.4656 monthly household income gf's job Social occupation level of grandfather 0.6696 0.4704 0 1 when single mother was 14 gm's job Social occupation level of grandmother 0.3338 0.4716 0 1 when single mother was 14 View Large We first conduct our study using the parametric model. The bivariate probit model is written as Y=1{Xi′β0+Diδ0>Ui}, (6.1) D=1{Zi′γ0>Vi}, (6.2) where Xi=(1,X1i,X2i,X3i,X4i,X5i)′ ⁠, Zi=(X2i,X3i,X4i,X5i,Z1i,Z2i)′ ⁠. The disturbance term (U,V) is assumed to be a bivariate normal distribution. Note that Z does not include X1 because the child's age does not influence the mother's smoking decision. Loureiro et al. (2010) processed X1 using the same method. To illustrate this point, we first run a probit model for msmoker on all the explanatory and instrumental variables. The results obtained are given in Table 2 using stata. Table 2. Mother'S Smoking Variable Mother is a smoker Coefficient SD cage 0.0023 0.0065 mage −0.0259*** 0.0030 medu −0.2476*** 0.0322 mjob −0.3369*** 0.0342 lnmincome −0.0472* 0.0286 gf's job −0.2440*** 0.0331 gm's job −0.0036 0.0327 intercept 1.9025*** 0.2331 Variable Mother is a smoker Coefficient SD cage 0.0023 0.0065 mage −0.0259*** 0.0030 medu −0.2476*** 0.0322 mjob −0.3369*** 0.0342 lnmincome −0.0472* 0.0286 gf's job −0.2440*** 0.0331 gm's job −0.0036 0.0327 intercept 1.9025*** 0.2331 Note: ***, ** and * indicate statistical significance at the 1%, 5% and 10% levels, respectively. View Large Table 2. Mother'S Smoking Variable Mother is a smoker Coefficient SD cage 0.0023 0.0065 mage −0.0259*** 0.0030 medu −0.2476*** 0.0322 mjob −0.3369*** 0.0342 lnmincome −0.0472* 0.0286 gf's job −0.2440*** 0.0331 gm's job −0.0036 0.0327 intercept 1.9025*** 0.2331 Variable Mother is a smoker Coefficient SD cage 0.0023 0.0065 mage −0.0259*** 0.0030 medu −0.2476*** 0.0322 mjob −0.3369*** 0.0342 lnmincome −0.0472* 0.0286 gf's job −0.2440*** 0.0331 gm's job −0.0036 0.0327 intercept 1.9025*** 0.2331 Note: ***, ** and * indicate statistical significance at the 1%, 5% and 10% levels, respectively. View Large The results in Table 2 confirm the following conjecture: the age of a child is not an explanatory variable for a mother's smoking habit. We run bivariate probit models for (6.1) and (6.2), to obtain the results shown in Table 3. Table 3. Bivariate Probit Coefficients Variable Child Mother Coefficient SD Coefficient SD msmoker 0.4453*** 0.0826 – – cage 0.1974*** 0.0080 – – mage −0.0098*** 0.0034 −0.0255*** 0.0028 medu −0.0736** 0.0367 −0.2485*** 0.0321 mjob −0.1233*** 0.0388 −0.3369*** 0.0342 lnmincome −0.1110*** 0.0314 −0.0462* 0.0281 gf's job – – −0.2462*** 0.0331 gm's job – – 0.0005 0.0327 intercept −2.6555*** 0.2778 1.9191*** 0.2321 Variable Child Mother Coefficient SD Coefficient SD msmoker 0.4453*** 0.0826 – – cage 0.1974*** 0.0080 – – mage −0.0098*** 0.0034 −0.0255*** 0.0028 medu −0.0736** 0.0367 −0.2485*** 0.0321 mjob −0.1233*** 0.0388 −0.3369*** 0.0342 lnmincome −0.1110*** 0.0314 −0.0462* 0.0281 gf's job – – −0.2462*** 0.0331 gm's job – – 0.0005 0.0327 intercept −2.6555*** 0.2778 1.9191*** 0.2321 Note: ***, ** and * indicate statistical significance at the 1%, 5% and 10% levels, respectively. View Large Table 3. Bivariate Probit Coefficients Variable Child Mother Coefficient SD Coefficient SD msmoker 0.4453*** 0.0826 – – cage 0.1974*** 0.0080 – – mage −0.0098*** 0.0034 −0.0255*** 0.0028 medu −0.0736** 0.0367 −0.2485*** 0.0321 mjob −0.1233*** 0.0388 −0.3369*** 0.0342 lnmincome −0.1110*** 0.0314 −0.0462* 0.0281 gf's job – – −0.2462*** 0.0331 gm's job – – 0.0005 0.0327 intercept −2.6555*** 0.2778 1.9191*** 0.2321 Variable Child Mother Coefficient SD Coefficient SD msmoker 0.4453*** 0.0826 – – cage 0.1974*** 0.0080 – – mage −0.0098*** 0.0034 −0.0255*** 0.0028 medu −0.0736** 0.0367 −0.2485*** 0.0321 mjob −0.1233*** 0.0388 −0.3369*** 0.0342 lnmincome −0.1110*** 0.0314 −0.0462* 0.0281 gf's job – – −0.2462*** 0.0331 gm's job – – 0.0005 0.0327 intercept −2.6555*** 0.2778 1.9191*** 0.2321 Note: ***, ** and * indicate statistical significance at the 1%, 5% and 10% levels, respectively. View Large The results shown in Table 3 are similar to those shown for Loureiro et al. (2010). Next, we relax the assumptions slightly and apply the methods used in Han and Vytlacil (2017). The model in Han and Vytlacil (2017) can be summarized as Y=1{Xi′β0+Diδ0>Ui}, (6.3) D=1{Zi′γ0>Vi}, (6.4) where (U,V)⊥(X,Z) ⁠. Marginal distributions FU and FV are assumed to be known, with E[U]=E[V]=0 and Var (U)= Var (V)=1 ⁠. The joint distribution (U,V) is assumed, satisfying some copula transformation; i.e. (U,V)∼FUV(U,V)=C(FU(U),FV(V);ρ) ⁠, where C(·,·;ρ) is a copula known up to scale parameter ρ∈Ω such that C:(0,1)2→(0,1) is twice differentiable in its argument and ρ. To apply Han and Vytlacil (2017) in this empirical study, we need to assume joint distribution of the error term (U,V) ⁠. As different copula functions give similar results, we choose the Farlie–Gumbel–Morgenstern family, i.e. C(u1,u2;ρ)=u1u2+ρu1u2(1−u1)(1−u2), for ρ∈[−1,1] ⁠. We list the results in Table 4. Table 4. Generalized Bivariate Probit Model With Copula Variable Child Mother Coefficient SD Coefficient SD msmoker 0.2751*** 0.0278 – – cage 0.1648*** 0.0064 – – mage −0.0093*** 0.0029 −0.0211*** 0.0023 medu −0.0760*** 0.0301 −0.2168*** 0.0255 mjob −0.1132*** 0.0312 −0.2815*** 0.0293 lnmincome −0.0965*** 0.0266 −0.0352 0.0229 gf's job – – −0.1969*** 0.0273 gm's job – – −0.0031 0.0271 intercept −2.0643*** 0.2200 1.5524*** 0.1950 Variable Child Mother Coefficient SD Coefficient SD msmoker 0.2751*** 0.0278 – – cage 0.1648*** 0.0064 – – mage −0.0093*** 0.0029 −0.0211*** 0.0023 medu −0.0760*** 0.0301 −0.2168*** 0.0255 mjob −0.1132*** 0.0312 −0.2815*** 0.0293 lnmincome −0.0965*** 0.0266 −0.0352 0.0229 gf's job – – −0.1969*** 0.0273 gm's job – – −0.0031 0.0271 intercept −2.0643*** 0.2200 1.5524*** 0.1950 Note: The results are obtained from copula functions of the Farlie–Gumbel–Morgenstern family. ***, ** and * indicate statistical significance at the 1%, 5% and 10% levels, respectively. View Large Table 4. Generalized Bivariate Probit Model With Copula Variable Child Mother Coefficient SD Coefficient SD msmoker 0.2751*** 0.0278 – – cage 0.1648*** 0.0064 – – mage −0.0093*** 0.0029 −0.0211*** 0.0023 medu −0.0760*** 0.0301 −0.2168*** 0.0255 mjob −0.1132*** 0.0312 −0.2815*** 0.0293 lnmincome −0.0965*** 0.0266 −0.0352 0.0229 gf's job – – −0.1969*** 0.0273 gm's job – – −0.0031 0.0271 intercept −2.0643*** 0.2200 1.5524*** 0.1950 Variable Child Mother Coefficient SD Coefficient SD msmoker 0.2751*** 0.0278 – – cage 0.1648*** 0.0064 – – mage −0.0093*** 0.0029 −0.0211*** 0.0023 medu −0.0760*** 0.0301 −0.2168*** 0.0255 mjob −0.1132*** 0.0312 −0.2815*** 0.0293 lnmincome −0.0965*** 0.0266 −0.0352 0.0229 gf's job – – −0.1969*** 0.0273 gm's job – – −0.0031 0.0271 intercept −2.0643*** 0.2200 1.5524*** 0.1950 Note: The results are obtained from copula functions of the Farlie–Gumbel–Morgenstern family. ***, ** and * indicate statistical significance at the 1%, 5% and 10% levels, respectively. View Large The results in Table 4 are similar to those in Table 3 in either value or sign. The significance is almost the same for the two tables, except for monthly household income (lnmincome), which shows 10% significance for mother's smoking in Table 3, but no significance in Table 4. Note that as the different models have different assumptions, it is not very relevant to compare the coefficients across different models directly. Thus, we compare the probability of the child smoking between households where mothers are smokers and non‐smokers, testing whether the mother being a smoker significantly increases the probability of the child smoking. As we know, the probability associated with the child smoking is P(Y=1|x′β0,d=1)=P(UUi}, (6.7) Di=1{m(Zi)>Vi}, (6.8) where Xi and Zi are the same. The disturbance terms U and V are allowed to contain heteroscedasticity, while the selection function (6.8) is a purely nonparametric mechanism. Obviously, the results obtained from (6.7) and (6.8) are more credible compared to the parametric model results. From Section 4, we incorporate the sixth‐order kernel function for all stages under this condition. Specifically, the integrated kernel K1(·) is used as K1(t)=12+315204815t−1403t3+3785t5−3967t7+1439t91[|t|≤1]+1[t>1], (6.9) and for the other kernel function, we use ks(·) as ks(t)=158−54t2+18t412πexp−t22, (6.10) for s=0,1,2 ⁠, which is a Gaussian kernel of order 6. The bandwidth is chosen following the assumptions in Section 4. In this part, we consider (τ1,τ2)∈{(0.3,0.7),(0.4,0.6)} and we explore the robustness of our estimation results with respect to the selection of quantile pairs. From Table 3 and the results of Loureiro et al. (2010), because the coefficient is only identified up to scale, we choose to normalize on the child's age coefficient (X1). From the literature, we know that the impact of age on the child's smoking habit is significantly positive. Thus, for model identification, we assume that the sign is positive and the coefficient of X1 is unity. Moreover, we compute the asymptotic variance using bootstrap to test for the significance of each coefficient. The results are given in Table 6. Table 6. Estimation Coefficients of Generalized Binary Choice Model Variable (τ1,τ2)=(0.3,0.7) (τ1,τ2)=(0.4,0.6) Coefficient SD Coefficient SD msmoker 1.3313*** 0.1485 1.0894*** 0.1460 mage −0.0962*** 0.0118 −0.1021*** 0.0137 medu 0.0280 0.1083 0.1136 0.1319 mjob −0.6051*** 0.1206 −0.6718*** 0.1476 lnmincome −0.9270*** 0.1484 −1.1202*** 0.1628 intercept −7.5659*** 1.1360 −5.4709*** 1.2671 Variable (τ1,τ2)=(0.3,0.7) (τ1,τ2)=(0.4,0.6) Coefficient SD Coefficient SD msmoker 1.3313*** 0.1485 1.0894*** 0.1460 mage −0.0962*** 0.0118 −0.1021*** 0.0137 medu 0.0280 0.1083 0.1136 0.1319 mjob −0.6051*** 0.1206 −0.6718*** 0.1476 lnmincome −0.9270*** 0.1484 −1.1202*** 0.1628 intercept −7.5659*** 1.1360 −5.4709*** 1.2671 Note: Standard errors have been computed by bootstrap. ***, ** and * indicate statistical significance at the 1%, 5% and 10% levels, respectively. View Large Table 6. Estimation Coefficients of Generalized Binary Choice Model Variable (τ1,τ2)=(0.3,0.7) (τ1,τ2)=(0.4,0.6) Coefficient SD Coefficient SD msmoker 1.3313*** 0.1485 1.0894*** 0.1460 mage −0.0962*** 0.0118 −0.1021*** 0.0137 medu 0.0280 0.1083 0.1136 0.1319 mjob −0.6051*** 0.1206 −0.6718*** 0.1476 lnmincome −0.9270*** 0.1484 −1.1202*** 0.1628 intercept −7.5659*** 1.1360 −5.4709*** 1.2671 Variable (τ1,τ2)=(0.3,0.7) (τ1,τ2)=(0.4,0.6) Coefficient SD Coefficient SD msmoker 1.3313*** 0.1485 1.0894*** 0.1460 mage −0.0962*** 0.0118 −0.1021*** 0.0137 medu 0.0280 0.1083 0.1136 0.1319 mjob −0.6051*** 0.1206 −0.6718*** 0.1476 lnmincome −0.9270*** 0.1484 −1.1202*** 0.1628 intercept −7.5659*** 1.1360 −5.4709*** 1.2671 Note: Standard errors have been computed by bootstrap. ***, ** and * indicate statistical significance at the 1%, 5% and 10% levels, respectively. View Large Because of different assumptions on the disturbance terms in different models, it is meaningless to compare the absolute values of the coefficients directly. What we are concerned with is the sign and significance in the different models. The results in Table 6 are similar to the results in Tables 3 and 4 in either sign or significance. The key regressor, mother smoker (msmoker), is still positive and significant at the 1% level. The only exception is the mother's education, with a coefficient that is slightly positive but not significant in Table 6, negative and significant at the 5% level in Table 3 and negative and significant at the 1% level in Table 4. The logic behind this could be that the mother's education is not a significantly influential element in the child's smoking decision after controlling for other elements, such as whether the mother smokes or not, and the household income. Compared to Loureiro et al. (2010), our finding is less susceptible to misspecification of functional forms in the selection equation or any type of parametric assumption on the distribution of the error terms. The result provides strong statistical evidence that the endogenous mother's smoking indicator has a causal effect on the teenager's smoking habit. Moreover, our estimator seems quite robust against the selection of quantile pairs. Next, we predict the probability of the child smoking, given a smoking or non‐smoking mother. From the relationship between probability and the conditional mean in binary choice models, we have P(Y=1|x′β0,d=1)=E[Y|x′β0,d=1]=∑i=1nk((Xi′β0−x′β0)/h)1(Di=1)Yi∑i=1nk((Xi′β0−x′β0)/h)1(Di=1), (6.11) P(Y=1|x′β0,d=0)=E[Y|x′β0,d=0]=∑i=1nk((Xi′β0−x′β0)/h)1(Di=0)Yi∑i=1nk((Xi′β0−x′β0)/h)1(Di=0), (6.12) where (6.11) is the probability of a child smoking, given a smoking mother, and (6.12) is the probability of a child smoking, given a non‐smoking mother. Here, k(·) is a standard Gaussian kernel, and h is the corresponding bandwidth. The results are given in Table 7. Table 7. Predicted Probabilities of Child Smoking (τ1,τ2)=(0.3,0.7) (τ1,τ2)=(0.4,0.6) msmoker=1 p̂1=0.3218*** (0.0066) p̂1=0.3212*** (0.0074) msmoker=0 p̂0=0.2204*** (0.0062) p̂0=0.2208*** (0.0065) p̂1−p̂0=0.1014*** (0.0095) p̂1−p̂0=0.1004*** (0.0096) (τ1,τ2)=(0.3,0.7) (τ1,τ2)=(0.4,0.6) msmoker=1 p̂1=0.3218*** (0.0066) p̂1=0.3212*** (0.0074) msmoker=0 p̂0=0.2204*** (0.0062) p̂0=0.2208*** (0.0065) p̂1−p̂0=0.1014*** (0.0095) p̂1−p̂0=0.1004*** (0.0096) Note: The results are based on the model in this paper. The standard errors in parentheses have been computed by bootstrap. ***, ** and * indicate statistical significance at the 1%, 5% and 10% levels, respectively. View Large Table 7. Predicted Probabilities of Child Smoking (τ1,τ2)=(0.3,0.7) (τ1,τ2)=(0.4,0.6) msmoker=1 p̂1=0.3218*** (0.0066) p̂1=0.3212*** (0.0074) msmoker=0 p̂0=0.2204*** (0.0062) p̂0=0.2208*** (0.0065) p̂1−p̂0=0.1014*** (0.0095) p̂1−p̂0=0.1004*** (0.0096) (τ1,τ2)=(0.3,0.7) (τ1,τ2)=(0.4,0.6) msmoker=1 p̂1=0.3218*** (0.0066) p̂1=0.3212*** (0.0074) msmoker=0 p̂0=0.2204*** (0.0062) p̂0=0.2208*** (0.0065) p̂1−p̂0=0.1014*** (0.0095) p̂1−p̂0=0.1004*** (0.0096) Note: The results are based on the model in this paper. The standard errors in parentheses have been computed by bootstrap. ***, ** and * indicate statistical significance at the 1%, 5% and 10% levels, respectively. View Large From Table 7, the predicted probability calculated using our model shows some similarity with that calculated with parametric pairs. The difference of 10 percentage points between the household with a smoking mother and the household with a non‐smoking mother, with a 1% significance, indicates that the mother's smoking behaviour has a great influence on the child smoking. This illustrates the importance of cigarette control by the government. Finally, we calculate the goodness of fit for different models. Various pseudo‐R2 measures have been proposed for binary choice models. McFadden (1974) suggests the measure 1−(Lur/L0) ⁠, where Lur is the log‐likelihood function for the estimated model and L0 is the log‐likelihood function for the model with only one intercept. For our case, this method cannot be applied because our model does not assume the distribution of the error terms, making the calculation of a log‐likelihood function impossible. The other goodness‐of‐fit measure usually used is the correctly predicted percentage. For each individual i, we compute the predicted probability for Yi=1 ⁠, given the explanatory regressors Xi and Di ⁠. If P(Yi=1|Xi′β0+Diδ0)≥0.5 ⁠, then we predict Yi to be 1; if P(Yi=1|Xi′β0+Diδ0)<0.5 ⁠, then Yi is predicted to be zero. The percentage of predicted Yi that matches the actual Yi is the correctly predicted percentage. This method applies to both the bivariate probit model (i.e. the generalized bivariate probit model using copula) and our generalized model. We use (6.5) and (6.11) to calculate P(Yi=1|Xi′β0+Diδ0) under different models. The results are given in Table 8. Table 8. Goodness of Fit: Correctly Predicted Percentage Bivariate probit model 0.7319 Generalized bivariate probit model using copula 0.6996 Our expanding model with (τ1,τ2)=(0.3,0.7) 0.7322 Our expanding model with (τ1,τ2)=(0.4,0.6) 0.7323 Bivariate probit model 0.7319 Generalized bivariate probit model using copula 0.6996 Our expanding model with (τ1,τ2)=(0.3,0.7) 0.7322 Our expanding model with (τ1,τ2)=(0.4,0.6) 0.7323 View Large Table 8. Goodness of Fit: Correctly Predicted Percentage Bivariate probit model 0.7319 Generalized bivariate probit model using copula 0.6996 Our expanding model with (τ1,τ2)=(0.3,0.7) 0.7322 Our expanding model with (τ1,τ2)=(0.4,0.6) 0.7323 Bivariate probit model 0.7319 Generalized bivariate probit model using copula 0.6996 Our expanding model with (τ1,τ2)=(0.3,0.7) 0.7322 Our expanding model with (τ1,τ2)=(0.4,0.6) 0.7323 View Large From Table 8, the goodness of fit exhibits some similarity across the three models, making the assumption of the normal distribution of the error term credible in this case. Because the outcome depends greatly on the assumption in the binary choice model, we need to weaken the model assumption and generalize the model to obtain robust results. Our method is useful to test for the robustness of parametric models. 7. Conclusion In this paper, we propose a novel method for identifying and estimating the causal effects of dummy endogenous variables in a heteroscedastic binary response model. The model allows for multiple dummy endogenous variables and a nonparametric selection mechanism, thus offering an estimation procedure for cases not previously covered in the literature. The estimator can deliver a rate of convergence in probability close to the parametric rate, which is nh1 consistent and asymptotically normal. A Monte Carlo simulation shows that the estimator works well for small samples.9 A real‐data example, which re‐analyses the causal effect of intergenerational transmission of smoking habits, is provided to illustrate our estimator. We highlight two important issues for future research. The first is the choice of quantiles. In our paper, the results seem to be robust with different quantile pairs. However, as each quantile corresponds with a generalized moment condition, there should exist an optimal quantile pair to make the most efficient estimator in theory that could obtain the smallest asymptotic variance. The second is the choice of optimal bandwidth h1. Theorem 4.1 shows that we can obtain either a fast convergence speed or a non‐biased estimator by choosing a different bandwidth h1. However, it is not clear which is better in practice and this is an issue we leave for further research. Acknowledgement The authors are grateful for the hard work and helpful comments provided by the editor Jaap Abbring and two anonymous referees. Deep gratitude also goes to Professor Hanghui Zhang for his illuminating guidance and suggestions on several key problems. Z. Zhang is also affiliated with the Key Laboratory of Mathematical Economics (SUFE), Ministry of Education. The research is supported by the National Science Foundation of China (Grant No. 71501116), the 2018 Program for Innovative Research Team of Shanghai University of Finance and Economics and the Innovation Fund of Shanghai University of Finance and Economics (Grant No.CXJJ‐2015‐354). The authors are responsible for all remaining errors in this paper. Footnotes 1 X may include a constant term. 2 Throughout the paper, any parameter with subscript zero denotes the true parameter generating the data. 3 See (2.1)–(2.3) for more details. 4 Only n1/3 for the standard maximum score approach, with a nonstandard limiting distribution; see Kim and Pollard (1990). It will be even slower here because we first condition and then take the maximum score. We are grateful to the referee for noting this point. 5 For more specific details on the Monte Carlo results, see online Appendix B. 6 For more specific details, see online Appendix C. 7 For more specific details, see online Appendix C. 8 For more specific details, see online Appendix B. 9 For more specific details, see online Appendix B. References Abrevaya , J. , J. A. Hausman and S. Khan ( 2010 ). Testing for causal effects in a generalized regression model with endogenous regressors . Econometrica 78 , 2043 – 61 . Google Scholar Crossref Search ADS Ahn , H. and J. L. Powell ( 1993 ). Semiparametric estimation of censored selection models with a nonparametric selection mechanism . Journal of Econometrics 58 , 3 – 29 . Google Scholar Crossref Search ADS Amemiya , T. ( 1978 ). The estimation of a simultaneous equation generalized probit model . Econometrica 46 , 1193 – 205 . Google Scholar Crossref Search ADS Angrist , J. D. and W. N. Evans ( 1998 ). Children and their parents labor supply: evidence from exogenous variation in family size . American Economic Review 88 ( 3 ), 450 – 77 . Blundell , R. and J. L. Powell ( 2004 ). Endogeneity in semiparametric binary response models . Review of Economic Studies 71 , 655 – 79 . Google Scholar Crossref Search ADS Buchinsky , M. ( 1998 ). Recent advances in quantile regression models: a practical guideline for empirical research . Journal of Human Resources 33 , 88 – 126 . Google Scholar Crossref Search ADS Chamberlain , G. ( 1992 ). Efficiency bounds for semiparametric regression . Econometrica 60 , 567 – 96 . Google Scholar Crossref Search ADS Chen , S. , B. D. Dahl and S. Khan ( 2005 ). Nonparametric identification and estimation of a censored location‐scale regression model . Journal of the American Statistical Association 100 , 212 – 21 . Google Scholar Crossref Search ADS Chen , S. and S. Khan ( 2000 ). Estimating censored regression models in the presence of nonparametric multiplicative heteroskedasticity . Journal of Econometrics 98 , 283 – 316 . Google Scholar Crossref Search ADS Chen , S. and S. Khan ( 2003a ). Semiparametric estimation of a heteroscedastic sample selection model . Econometric Theory 19 , 1040 – 64 . Google Scholar Crossref Search ADS Chen , S. and S. Khan ( 2003b ). Rates of convergence for estimating regression coefficients in heteroskedastic discrete response models . Journal of Econometrics 117 , 245 – 78 . Google Scholar Crossref Search ADS Chen , S. and Y. Zhou ( 2010 ). Semiparametric and nonparametric estimation of sample selection models under symmetry . Journal of Econometrics 157 , 143 – 50 . Google Scholar Crossref Search ADS Donald , S. G. ( 1995 ). Two‐step estimation of heteroskedastic sample selection models . Journal of Econometrics 65 , 347 – 80 . Google Scholar Crossref Search ADS Dong , Y. and A. Lewbel ( 2015 ). A simple estimator for binary choice models with endogenous regressors . Econometric Reviews 34 , 82 – 105 . Google Scholar Crossref Search ADS Engle , R. F. ( 1982 ). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation . Econometrica 50 , 987 – 1007 . Google Scholar Crossref Search ADS Han , S. and E. J. Vytlacil ( 2017 ). Identification in a generalization of bivariate probit models with dummy endogenous regressors . Journal of Econometrics 199 , 63 – 73 . Google Scholar Crossref Search ADS Harvey , A. C. ( 1976 ). Estimating regression models with multiplicative heteroscedasticity . Econometrica 44 , 461 – 65 . Google Scholar Crossref Search ADS Heckman , J. J. ( 1978 ). Dummy endogenous variables in a simultaneous equations system . Econometrica 46 , 931 – 60 . Google Scholar Crossref Search ADS Hoderlein , S. and R. Sherman ( 2015 ). Identification and estimation in a correlated random coefficients binary response model . Journal of Econometrics 188 , 135 – 49 . Google Scholar Crossref Search ADS Horowitz , J. L. ( 1992 ). A smoothed maximum score estimator for the binary response model . Econometrica 60 , 505 – 31 . Google Scholar Crossref Search ADS Khan , S. ( 2013 ). Distribution free estimation of heteroskedastic binary response models using Probit/Logit criterion functions . Journal of Econometrics 172 , 168 – 82 . Google Scholar Crossref Search ADS Kim , J. and D. Pollard ( 1990 ). Cube root asymptotics . Annals of Statistics 18 , 191 – 219 . Google Scholar Crossref Search ADS Krief , J. M. ( 2014 ). An integrated kernel‐weighted smoothed maximum score estimator for the partially linear response model . Econometric Theory 30 , 647 – 75 . Google Scholar Crossref Search ADS Lewbel , A. ( 2000 ). Semiparametric qualitative response model estimation with unknown heteroscedasticity or instrument variables . Journal of Econometrics 7 , 145 – 77 . Google Scholar Crossref Search ADS Li , Q. , C. J. Huang , D. Li and T. T. Fu ( 2002 ). Semiparametric smooth coefficient models . Journal of Business and Economics Statisitics 20 , 412 – 22 . Google Scholar Crossref Search ADS Li , Q. and J. S. Racine ( 2007 ). Nonparametric Econometric: Theory and Practice . Princeton, NJ : Princeton University Press . Loureiro , M. L. , A. Sanz‐de‐galdeano and D. Vuri ( 2010 ). Smoking habits: like father, like son, like mother, like daughter . Oxford Bulletin of Economics and Statistics 72 , 717 – 43 . Google Scholar Crossref Search ADS McFadden , D. ( 1974 ). Conditional logit analysis of quatitative choice behavior. In P. Zarembka (Ed.), Frontiers in Econometrics , Chapter 4, 105 – 42 . New York, NY : Academic Press . Manski , C. F. ( 1985 ). Semiparametric analysis of discrete response: asymptotic properties of the maximum score estimator . Journal of Econometrics 27 , 313 – 33 . Google Scholar Crossref Search ADS Melenberg , B. and A. Van Soest ( 1996 ). Parametric and semiparametric modelling of vacation expenditures . Journal of Applied Econometrics 11 , 59 – 76 . Google Scholar Crossref Search ADS Newey , W. K. and T. M. Stoker ( 1993 ). Efficiency of weighted average derivative estimators and index models . Econometrica 61 , 1199 – 223 . Google Scholar Crossref Search ADS Powell , J. L. ( 1986 ). Symmetrically trimmed least squares estimation for tobit models . Econometrica 54 , 1435 – 60 . Google Scholar Crossref Search ADS Powell , J. L. ( 1994 ). Estimation of semiparametric models. In R. Engle and D. McFadden (Eds.), Handbook of Econometrics , Volume 4, 2444 – 521 . Amsterdam : North‐Holland . Powell , J. L. , J. H. Stock and T. M. Stoker ( 1989 ). Semiparametric estimation of index coefficients . Econometrica 57 , 1403 – 30 . Google Scholar Crossref Search ADS Robinson , P. M. ( 1989 ). Hypothesis testing in semiparametric and nonparametric models for econometric time series . Review of Economic Studies 56 , 511 – 34 . Google Scholar Crossref Search ADS Rothe , C. ( 2009 ). Semiparametric estimation of binary response models with endogenous regressors . Journal of Econometrics 153 , 51 – 64 . Google Scholar Crossref Search ADS Vytlacil , E. and N. Yildiz ( 2007 ). Dummy endogenous variables in weakly separable models . Econometrica 75 , 757 – 79 . Google Scholar Crossref Search ADS Yildiz , N. ( 2013 ). Estimation of binary choice models with linear index and dummy endogenous variables . Econometric Theory 29 , 354 – 92 . Google Scholar Crossref Search ADS Zhang , Z. ( 2016 ). Semiparametric estimation of partially linear transformation models under conditional quantile restriction . Econometric Theory 32 , 458 – 97 . Google Scholar Crossref Search ADS Zhang , Z. and X. He ( 2012 ). Estimation of a heteroscedastic binary choice model with an endogenous dummy regressor . Economics Letters 117 , 753 – 57 . Google Scholar Crossref Search ADS Appendix: Proofs of the Lemmata 2.1 Proof of Lemma Because a quantile function is the inverse of a distribution function, it suffices to show that the conditional distribution function of ε relies on (X,Z,D) only through (P,D) ⁠. If D=1 ⁠, then Pr{ε≤c1|D=1,X=x,Z=z}=Pr{ε≤c1|υ0⇔Pr{Y∗>0|x,z,d}>1−τ⇔E[Y|x,z,d]>1−τ. From the rank condition and by applying the standard identification argument in Manski (1985), we can show that (β10,ϕ(τ,w)) is identified up to scale for each w. (β10,ϕ(τ,w)) is the unique maximizer of E[(Y−(1−τ))1(X1β1+a>0)|W=w] with respect to (β1,a) ⁠, where 1( · ) is the indicator function. □ 5.1 Proof of Lemma Because the conditional quantile function can be obtained by inverting the conditional CDF, it suffices to show that the conditional CDF of ε relies on (D1,D2,X,Z) only through (P1,P2,D1,D2) ⁠. If D1=D2=1 ⁠, then Pr{ε≤c|D1=1,D2=1,X=x,Z=z}=Pr{ε≤c|υ1