# Frequency-domain full-waveform inversion with non-linear descent directions

Frequency-domain full-waveform inversion with non-linear descent directions Summary Full-waveform inversion (FWI) is a highly non-linear inverse problem, normally solved iteratively, with each iteration involving an update constructed through linear operations on the residuals. Incorporating a flexible degree of non-linearity within each update may have important consequences for convergence rates, determination of low model wavenumbers and discrimination of parameters. We examine one approach for doing so, wherein higher order scattering terms are included within the sensitivity kernel during the construction of the descent direction, adjusting it away from that of the standard Gauss–Newton approach. These scattering terms are naturally admitted when we construct the sensitivity kernel by varying not the current but the to-be-updated model at each iteration. Linear and/or non-linear inverse scattering methodologies allow these additional sensitivity contributions to be computed from the current data residuals within any given update. We show that in the presence of pre-critical reflection data, the error in a second-order non-linear update to a background of s0 is, in our scheme, proportional to at most (Δs/s0)3 in the actual parameter jump Δs causing the reflection. In contrast, the error in a standard Gauss–Newton FWI update is proportional to (Δs/s0)2. For numerical implementation of more complex cases, we introduce a non-linear frequency-domain scheme, with an inner and an outer loop. A perturbation is determined from the data residuals within the inner loop, and a descent direction based on the resulting non-linear sensitivity kernel is computed in the outer loop. We examine the response of this non-linear FWI using acoustic single-parameter synthetics derived from the Marmousi model. The inverted results vary depending on data frequency ranges and initial models, but we conclude that the non-linear FWI has the capability to generate high-resolution model estimates in both shallow and deep regions, and to converge rapidly, relative to a benchmark FWI approach involving the standard gradient. Inverse theory, Waveform inversion, Theoretical seismology, Wave scattering and diffraction 1 INTRODUCTION Seismic full-waveform inversion, or FWI Lailly (1983); Tarantola (1984); Virieux & Operto (2009); Métivier et al. (2017) has recently undergone a period of rapid development and refinement, and as a tool for high-resolution velocity model building appears to be reaching a relatively mature state. Multiparameter FWI Operto et al. (2013); Plessix et al. (2013); Alkhalifah & Plessix (2014); Innanen (2014); Pan et al. (2016), wherein larger and more difficult parametrization and updating issues are encountered, is less well advanced. However, the prospect of having access to stable and well-characterized 3-D multiparameter and multicomponent FWI algorithms, employable as a monitoring tool for inferring static or changing rock physics properties of reservoirs, makes FWI the subject of continuing research. The relationship between seismic data and subsurface geological property variations is intrinsically non-linear. Inverse methodologies may neglect non-linearity, through linearization, but many current state of the art seismic inversion methods seek to account for it. Non-linearity can in principle be accounted for directly, or through iterations. In a direct approach, there is a single, complex, often series-based calculation involving non-linear operations on the data/residuals. In the iterative approach, there are repeated procedure of a simpler calculation, each using the results of the previous iteration and each involving a linear operation on the data/residuals. Non-linear inverse scattering formulations are example of the direct approach Zhang & Weglein (2009); Gauss–Newton formulations of FWI are example of the fully iterative approach. Full Newton methods, in which the gradient and the Hessian operator are both functions of the residuals (e.g. Pan et al.2016), involve updates which each depends non-linearly on the residuals, and so are in some sense hybrids. It is possible, in other words, to view directly non-linear and iteratively non-linear methods as being end-members of a spectrum, upon which a general point would correspond to an inverse methodology with both direct and iterative characteristics in some balance. However, few if any well-defined procedures exist by which one can create a Newton-type FWI algorithm that sits at a desired point along this spectrum, that is, in which iterative and direct aspects are balanced to some set of specifications. General non-linear sensitivity formalisms, directly related to the full scattering series, have been discussed in the context of seismic inversion Wu & Zheng (2014). The purpose of this paper is to set out, and examine, the basic analytical and numerical features of an approach for the flexible incorporation of non-linearity within each FWI update by including high-order scattering. Within FWI, the issue of non-linearity mixes with issues of resolution and the presence or absence of low-frequency data. Theoretically, FWI can recover both long- and short-wavelength structures, given wide-aperture acquisition systems and broad-band sources. When provided with adequate data, high-resolution velocity models can be constructed in shallow regions, where long-to-intermediate wavelength structures can be recovered Virieux & Operto (2009). The spatial resolution of FWI is governed by the relationship established in diffraction tomography between the scattering angle and the local model wavenumber component Sirgue & Pratt (2004); Alkhalifah (2015); Brossier et al. (2015). During updating, large-angle scattering data, for example, diving waves and post-critical reflections, are used to determine low-to-intermediate wavenumber model components, and small-angle scattering data are used to determine high-wavenumber model components. Without long offsets and low-frequency information, FWI cannot reconstruct the low-wavenumber components, in which case it behaves more like a least-squares migration, updating primarily the high-wavenumber components of the model. This requires an accurate initial velocity model. Data with offsets sufficiently long to illuminate deeper parts of geological volumes are infrequently acquired. To better retrieve long-wavelength model structures, and additionally to increase the chances of FWI converging to a global minimum, several strategies are available. Some effort may be expended towards building a starting model that is sufficiently similar to the actual model that only small adjustments to long-wavelength structures is needed; alternatively, or additionally, schemes which work to more completely extract low-frequency information from the data can be devised. One published approach to this involves using complex-valued frequency (or time-damping) methods such as Laplace-domain and Laplace–Fourier-domain inversions Shin & Cha (2008, 2009), or by using envelope information obtained from the Hilbert transform, such as envelope inversion Wu et al. (2014). Inversion methods have also been proposed to overcome cycle skipping directly, by expanding the unknown model space, for example, with the Wiener-filter coefficients as in adaptive waveform inversion Warner & Guasch (2016), with an additional dimension as in Biondi & Almomin (2014), or by expanding the search space as in van Leeuwen & Herrmann (2013). Low-wavenumber components of the model can also be retrieved directly from reflection data. Several methods have been proposed to accomplish this in both data and image domains, such as reflection waveform inversion (RWI, Xu et al.2012), migration-based traveltime tomography Chavent et al. (1994), differential semblance optimization Symes & Carazzone (1991) and migration velocity analysis Sava & Biondi (2004), etc. In data-domain methods, the model is decomposed into a low-wavenumber background part, which is to be updated, and a high-wavenumber perturbation part, which is temporarily assumed to be known, and usually obtained by a migration process at each iteration. These migrations provide approximate virtual sources in depth, increasing the transmission-like or tomographic wave paths available to the updating. This way, wide-scattering angles are obtained and the low-wavenumber components of the model are better constrained. Refractions can be used together with reflections to perform a better reconstruction of the background model Wang et al. (2015); Zhou et al. (2015). Simultaneous inversion of the background and perturbation model have also been studied in the data domain and mixed data/image domain Sun & Symes (2012); Biondi & Almomin (2014); Wu & Alkhalifah (2015); Alkhalifah & Wu (2016a) to mitigate non-linearities of the FWI formulation. If an FWI formulation with non-linear sensitivity kernel explicitly included can be stably formulated, which is the goal in this paper, it should be expected to update low- and high-wavenumber components of the model simultaneously. This is because the high-order terms in the non-linear sensitivity kernel is expected to introduce features similar to RWI, in that transmission paths from scattering points in the subsurface to both source and receiver points on the surface are included. Qualitatively, we formulate the problem as follows. We will work in the context of a simple multidimensional scalar framework (i.e. focusing on P-wave velocity only). Using standard Newton- or Gauss–Newton-type FWI as a starting point, we enact a specific approach to generalizing the sensitivity kernel calculation such that it becomes generally residual-dependent. The sensitivity kernel at iteration n is normally determined by comparing a variation δs of the medium sn against the resulting variation in the wave G propagating through sn. In this paper, we instead consider the variation to be with respect to the wave propagating in the n + 1th iterate, that is, the medium we are in the process of determining. This medium is different from sn by some amount we will call Δsn. Sensitivity kernel formulae arising from this variation involve a range of scattering processes, each involving one instance of δs and multiple instances of Δsn. These formulae are not immediately ready for use, because Δsn is an unknown, in fact it is in theory exactly the update we would like to solve for. However, it offers a way for residuals to be incorporated non-linearly. An analytic example of a non-linear update, based on this new sensitivity kernel, and containing first- and second-order terms in the residuals, is then computed to examine the manner in which non-linear amplitude information is included. The first iteration in the reconstruction of a single scalar boundary from a shot record of data, a problem which can be explored with fully analytic background wave propagation operators and data, is considered. The improved step length towards the true model when backscattered data at high angles and/or due to high contrasts are used, is shown. Such a formulation may have particular relevance for updates constructed using pre-critical reflection data. It has been shown that Gauss–Newton-type FWI algorithms respond to small-angle backscattered data, for example, pre-critical specular reflections, in a manner consistent with linearized inverse scattering and linearized Amplitude-Variation-with-Offset (AVO) inversion Innanen (2014). This means a multiparameter reflection FWI updating scheme manages parameter cross-talk in the same way as do AVO and linear inverse scattering, but, it also means that linearization error will affect Gauss–Newton FWI in the presence of specular reflections to the same degree. Linearization error appears in AVO inversion because backscattered wave amplitudes are generally non-linear in medium property variations. In the special case of two elastic half-spaces, for instance, the Zoeppritz equations define a highly non-linear relationship between reflection coefficients and relative changes in elastic properties across a reflecting boundary. In AVO, the relationship is typically linearized Aki & Richards (2002); Castagna & Backus (1993); Foster et al. (2010). Linearization error takes the form of a decrease in accuracy with an increase of incidence angle and/or magnitude of relative changes. This is one reason in AVO why density is difficult to separate from P-wave velocity (e.g. Stolt & Weglein 1985), and why certain anisotropic parameters are difficult to determine in azimuthal AVO (e.g. Mahmoudian et al.2015). These parameters are best distinguished at high angle, where the Aki–Richards formula and standard FWI sensitivity kernel are insufficiently accurate. Thus, standard Gauss–Newton FWI must be expected to be insensitive to reflected amplitude variations at large angles and when perturbations are large. The analytic example demonstrates that updates consistent with non-linear AVO approximations are possible when non-linear sensitivity kernel is invoked. We next analyse the numerical properties of this non-linear FWI updating scheme in a more general setting. We start from the construction of the non-linear sensitivity kernel. By adding a Δsn to the nth updated model, the wavefield can be divided into a background wavefield that is obtained in the nth updated model, and a perturbation wavefield. Using both the background wavefield and perturbation wavefield, we obtain the non-linear sensitivity kernel. This is implemented in a non-linear two-loop FWI scheme in the frequency domain. At each iteration, we first invert for the perturbation using a linear inversion scheme. This is the inner loop. We then use the newly obtained perturbation to generate the non-linear sensitivity kernels which are employed in the outer loop. This outer loop determines the non-linear descent direction. Naturally, without going to the trouble of calculating the non-linear descent direction in the outer loop, the perturbation obtained within the inner loop could be used to provide a direct update of the model, in a scheme similar to that of Kwak et al. (2014). However, from the numerical examples, we observe that the use of the two-loop approach is what incorporates the transmission wave paths from the scattering points to both sources and receivers at the surface, whose wide-scattering angles are responsible for updating deep low-wavenumber model components. Applications on the Marmousi model indicate that this non-linear FWI approach converges in fewer iterations than the conventional FWI, and is robust in the absence of rich low-frequency data. 2 FORMULATION In this paper, the space-frequency-domain isotropic acoustic wave equation with constant density is used to describe wave motion:   $$[\omega ^2s(\mathbf {r})+ \nabla ^2] G \left(\mathbf {r},\mathbf {r}_s,\omega |s \right)=-\delta (\mathbf {r}-\mathbf {r}_s),$$ (1)where ω is the frequency, s(r) is the squared slowness parameter s(r) = c − 2(r) and G(r, rs, ω|s) is the Green’s function generated in the model s(r) by an impulsive point source located at rs. When the point source has spectrum W(ω), the wavefield is P(r, rs, ω|s) = W(ω)G(r, rs, ω|s). 2.1 From linear to non-linear sensitivity kernel FWI seeks to estimate subsurface properties through an iterative process by minimizing the difference between the synthetic data and the observed data. At the nth iteration, the misfit function is usually given in a least-squares norm as   $$\phi (s_n)=\frac{1}{2} \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} \int {\,\,}\mathrm{d}\omega \Vert \delta d(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) \Vert ^2,$$ (2)where δd(rg, rs, ω|sn) = d(rg, rs, ω) − P(rg, rs, ω|sn) is the data residual between the observed data d(rg, rs, ω) and the synthetic data P(rg, rs, ω|sn) for each source rs and receiver rg of the seismic survey, and ‖.‖ is the L2 norm. The synthetic data are obtained by extracting the wavefield P(r, rs, ω|sn), modeled in the model sn(r) of the current iteration, at the receiver locations for each source. It does so by iteratively updating the current model sn(r) to sn + 1(r) using a perturbation Δsn(r)   $$s_{n+1}(\mathbf {r})=s_n(\mathbf {r})+\Delta s_n(\mathbf {r}),$$ (3)which can be determined by finding the minimum of the misfit function in the vicinity of the current model sn(r) Pratt et al. (1998); Virieux & Operto (2009)   $$\Delta s_n(\mathbf {r})=-\sum _{\mathbf {r}^{\prime }} H_n^{-1} (\mathbf {r},\mathbf {r}^{\prime })g_n(\mathbf {r}^{\prime }),$$ (4)where gn(r) and Hn(r, r΄) are the gradient and the Hessian calculated at the nth iteration, respectively. The gradient is the first-order derivative of the misfit function ϕ(sn) with respect to the model parameter s,   $$g_n(\mathbf {r})=\frac{\partial \phi (s_n)}{\partial s(\mathbf {r})}=- \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} \int {\,\,}\mathrm{d}\omega {\rm Re} \left( \frac{\partial P (\mathbf {r}_g,\mathbf {r}_s,\omega |s_n )}{\partial s(\mathbf {r})} \delta d^{*}(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n ) \right),$$ (5)where * stands for the complex conjugate and ∂P(rg, rs, ω|sn)/∂s(r) is the Fréchet derivative or the sensitivity kernel. Since it is extremely expensive to explicitly calculate the sensitivity kernel, in FWI, the adjoint-state method Lailly (1983); Plessix (2006) is usually used to directly calculate the gradient. The negative of the gradient points to descent direction. The Hessian is the second-order derivative of the misfit function,   \begin{eqnarray} H\left(\mathbf {r},\mathbf {r}^{\prime }\right)&=&\frac{\partial ^2 \phi (s_n)}{\partial s(\mathbf {r}) \partial s(\mathbf {r}^{\prime })} \nonumber \\ &=& \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} \int {\,\,}\mathrm{d}\omega {\rm Re} \left( \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)}{\partial s(\mathbf {r})} \frac{\partial P^{*}(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)}{\partial s(\mathbf {r}^{\prime })} +\, \frac{\partial ^2 P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)}{\partial s^2(\mathbf {r})} \delta d^{*}(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) \right). \end{eqnarray} (6)The second term in the Hessian, which is related to double-scattered waves, is usually neglected, and the approximation of Hessian is used to form a Gauss–Newton update,   $$H_{{\rm GN}} (\mathbf {r},\mathbf {r}^{\prime })= \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} \int {\,\,}\mathrm{d}\omega {\rm Re} \left( \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)}{\partial s(\mathbf {r})} \frac{\partial P^{*}(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)}{\partial s(\mathbf {r}^{\prime })} \right).$$ (7)The inverse of the Hessian can play an important role in FWI. The inverse of the approximate Hessian can help refocus the gradient from limited illumination, as well as mitigate the coupling between different parameters, and the inverse of the full Hessian can also help remove the artefacts caused by double-scattered waves, which is expected to make the full Newton method converge more rapidly Pratt et al. (1998); Operto et al. (2013). However, it is possible that adding the second term of the Hessian could destroy the positive definiteness of the approximate Hessian, in which case, adding an appropriate pre-conditioner, which is usually an approximation of the inverse Hessian, is necessary to accelerate the convergence of the full Newton method Métivier et al. (2014, 2017). When assuming the inverse of the Hessian as a scalar αn, the model can then be updated through a steepest descent (SD) method   $$s_{n+1}(\mathbf {r})=s_n(\mathbf {r})-\alpha _n g_n(\mathbf {r}),$$ (8)where αn can be determined by a line search method. The optimization problem of minimizing the misfit function (2) is highly non-linear, since the relationship between the wavefield and model is non-linear, as described by the wave equation (1). The model perturbation (4) is found in the framework of the Born approximation, which requires the mismatch of the traveltimes less than half of the period, to avoid convergence toward a local minimum caused by so-called cycle-skipping problem (e.g. Virieux & Operto 2009). One direct way to help mitigate this non-linearity is to have broad-band seismic data with wide-offset range and very good initial model before starting the FWI procedure, however, even with different types of misfit definition proposed van Leeuwen & Mulder (2008); Brossier et al. (2015); Choi & Alkhalifah (2015), the gradient (5) used to find the model perturbation can be seen as obtained with the sensitivity kernel ∂P/∂s derived by varying the field P(rg, rs, ω|sn) in the current model sn(r) under the Born approximation (as shown in Appendix eq. A6) as   $$\frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)}{\partial s(\mathbf {r})} \approx -\omega ^2G(\mathbf {r}_g,\mathbf {r},\omega |s_n)P(\mathbf {r},\mathbf {r}_s,\omega |s_n),$$ (9)which reflects the ‘true’ sensitivity of the model only to the extent that sn(r) reflects the true model properties, so it depends only on sn(r). The other way is to incorporate the non-linearity by taking multiple scattering into account Wu & Zheng (2014); Alkhalifah & Wu (2016b); Wu & Alkhalifah (2017), which can be performed by introducing non-linear sensitivity kernels. Prior to convergence, a different sensitivity, in fact, one which implicitly or explicitly depends on the residuals, is expected if we instead vary the field in the model under construction, sn + 1(r). Based on the scattering theory, as discussed in Innanen (2015), the synthetic data calculated using model sn + 1(r) can be written as an expansion of the synthetic data calculated using model sn(r) using Lippmann–Schwinger equation as shown in Appendix eq. (A5),   \begin{eqnarray} &&{{ P (\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1} ) }}\nonumber \\ \quad&=& P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)- \omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n) \Delta s_n(\mathbf {r}^{\prime })P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_{n+1}) \nonumber \\ \quad&=& P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)- \omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n) \Delta s_n(\mathbf {r}^{\prime })P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_n) \nonumber \\ && {}+\omega ^4 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n)\Delta s_n(\mathbf {r}^{\prime }) \int {\,\,}\mathrm{d}\mathbf {r}^{\prime \prime }G(\mathbf {r}^{\prime },\mathbf {r}^{\prime \prime },\omega |s_n) \Delta s_n(\mathbf {r}^{\prime \prime })P(\mathbf {r}^{\prime \prime },\mathbf {r}_s,\omega |s_n)\nonumber \\ &&{}+ \ldots . \end{eqnarray} (10)Varying the model by a δs localized at r in sn + 1(r), the calculated field is   \begin{eqnarray} P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1}+\delta s(\mathbf {r}))&=& P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) - \omega ^2 G(\mathbf {r}_g,\mathbf {r},\omega |s_n) \delta s(\mathbf {r}) P(\mathbf {r},\mathbf {r}_s,\omega |s_{n})- \omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n) \Delta s_n(\mathbf {r}^{\prime })P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_n) \nonumber \\ && {}+ \omega ^4 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r},\omega |s_n)G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_n) P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_n) \Delta s_n(\mathbf {r}^{\prime }) \delta s(\mathbf {r}) \nonumber \\ && {}+ \omega ^4 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n)G(\mathbf {r}^{\prime },\mathbf {r},\omega |s_n) P(\mathbf {r},\mathbf {r}_s,\omega |s_n) \Delta s_n(\mathbf {r}^{\prime }) \delta s(\mathbf {r}) + \ldots . \end{eqnarray} (11)Comparing the synthetic data P(rg, rs, ω|sn + 1 + δs(r)) in the perturbed model sn + 1(r) + δs in eq. (11) and synthetic data P(rg, rs, ω|sn + 1) at the n + 1th iteration in eq. (10), the perturbation δP(rg, rs, ω|sn + 1) caused by this added variation δs and Δsn(r) is the difference between these two series:   \begin{eqnarray} \delta P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})&=&P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1}+\delta s(\mathbf {r})) -P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1}) = -\omega ^2 G(\mathbf {r}_g,\mathbf {r},\omega |s_n) \delta s(\mathbf {r}) P(\mathbf {r},\mathbf {r}_s,\omega |s_{n}) \nonumber \\ && {}+ \omega ^4 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r},\omega |s_n)G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_n) P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_n) \Delta s_n(\mathbf {r}^{\prime }) \delta s(\mathbf {r}) \nonumber \\ && {}+ \omega ^4 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n)G(\mathbf {r}^{\prime },\mathbf {r},\omega |s_n) P(\mathbf {r},\mathbf {r}_s,\omega |s_n) \Delta s_n(\mathbf {r}^{\prime }) \delta s(\mathbf {r}) + \ldots , \end{eqnarray} (12)where the first term can be interpreted as a scattering process involving one interaction with the variation δs only, and the second and third terms include one interaction with Δsn(r) and one interaction with δs (see Figs 1a–c). The resulting sensitivity kernel at n + 1th iteration can then be written as   \begin{eqnarray} \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} &=& \lim _{\delta s \rightarrow 0} \frac{\delta P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\delta s(\mathbf {r})} \nonumber \\ &=& \left( \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} \right)_0 +\left( \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} \right)_1+ \ldots , \end{eqnarray} (13)where the index refers to the order of the term in the perturbation Δsn(r). The first term is the conventional FWI sensitivity kernel (9), which only depends on the model sn(r)   $$\left( \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} \right)_0 =\frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)}{\partial s(\mathbf {r})} = -\omega ^2G(\mathbf {r}_g,\mathbf {r},\omega |s_n)P(\mathbf {r},\mathbf {r}_s,\omega |s_n),$$ (14)and the second term is   \begin{eqnarray} \left( \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} \right)_1 = -\omega ^2 \Big (\delta G(\mathbf {r}_g,\mathbf {r},\omega |s_n)P(\mathbf {r},\mathbf {r}_s,\omega |s_n) + G(\mathbf {r}_g,\mathbf {r},\omega |s_n) \delta P(\mathbf {r},\mathbf {r}_s,\omega |s_n) \Big ), \end{eqnarray} (15)where δG(rg, r, ω|sn) and δP(r, rs, ω|sn) are the first-order scattered wavefields related to Δsn(r),   \begin{eqnarray} \delta G(\mathbf {r}_g,\mathbf {r},\omega |s_n)= -\omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n)\Delta s_n(\mathbf {r}^{\prime })G(\mathbf {r}^{\prime },\mathbf {r},\omega |s_{n}), \end{eqnarray} (16)  \begin{eqnarray} \delta P(\mathbf {r},\mathbf {r}_s,\omega |s_n)= -\omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_{n})\Delta s_n(\mathbf {r}^{\prime })P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_{n}). \end{eqnarray} (17)Starting from the second term, the sensitivity kernel depends not only on the model sn, but also on the perturbation Δsn(r), which introduce the non-linear effect to the sensitivity kernel. When Δsn(r) is zero, the higher order terms in eq. (13) vanish, and the sensitivity kernel reduces to the conventional form in eq. (9). Figure 1. View largeDownload slide Scattering processes associated with non-linear sensitivity kernel calculation: (a) zeroth order in Δsn, and (b) and (c) first order in Δsn. Propagation is indicated with a blue arrow. Figure 1. View largeDownload slide Scattering processes associated with non-linear sensitivity kernel calculation: (a) zeroth order in Δsn, and (b) and (c) first order in Δsn. Propagation is indicated with a blue arrow. To include higher order scattering terms related to Δsn(r), scattered wavefields in eqs (16) and (17) can be replaced by the Lippmann–Schwinger equation as   \begin{eqnarray} \delta G(\mathbf {r}_g,\mathbf {r},\omega |s_{n+1})= -\omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n)\Delta s_n(\mathbf {r}^{\prime })G(\mathbf {r}^{\prime },\mathbf {r},\omega |s_{n+1}), \end{eqnarray} (18)  \begin{eqnarray} \delta P(\mathbf {r},\mathbf {r}_s,\omega |s_{n+1})= -\omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_{n})\Delta s_n(\mathbf {r}^{\prime })P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_{n+1}). \end{eqnarray} (19)Substituting scattered wavefield equations (18) and (19) back to the second term (15), we can extend this second-order sensitivity kernel into a higher order sensitivity kernel   \begin{eqnarray} \left( \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} \right)_1+\cdots = -\omega ^2 (\delta G(\mathbf {r}_g,\mathbf {r},\omega |s_{n+1})P(\mathbf {r},\mathbf {r}_s,\omega |s_n) + G(\mathbf {r}_g,\mathbf {r},\omega |s_n) \delta P(\mathbf {r},\mathbf {r}_s,\omega |s_{n+1}) ). \end{eqnarray} (20)In this higher order sensitivity kernel, the term δG(rg, r, ω|sn + 1)δP(r, rs, ω|sn + 1) is not included, and, because we make this additional approximation, it shares similar form as the one derived based on the De Wolf approximation Wu & Zheng (2014), with its first term (15) used in RWI (e.g. Xu et al.2012) to provide a better update for the background model. 2.2 Calculation of perturbation Δsn before n + 1th iteration All the Green’s functions used in the perturbation wavefield equation (12) to construct this new sensitivity kernel equation (13) for n + 1th model sn + 1(r) can be calculated in the nth model sn(r). However, to calculate the higher order terms in the new sensitivity kernel equation (13), perturbation Δsn(r) is required. This quantity is unknown, but it has a relationship with the nth data residual. According to inverse scattering theory, for each frequency, the perturbation Δsn(r) can be exchanged in to a series related to the nth data residual δd(rg, rs, ω|sn) Innanen (2015),   \begin{eqnarray} \delta d(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)&=& d(\mathbf {r}_g,\mathbf {r}_s,\omega )-P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) \nonumber \\ &=&-\omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n) \Delta s_n(\mathbf {r}^{\prime }) P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_n)+ \ldots \nonumber \\ &=& {\cal G} \Delta s_n(\mathbf {r})+\cdot , \end{eqnarray} (21)where Δsn(r) = s(r) − sn(r), the difference between the true model s(r) and the nth model sn(r), and the data residual δd(rg, rs, ω|sn) can be seen as the scattered data produced by this perturbation Δsn(r). Through standard inverse scattering techniques, Δsn(r) can then be inverted as a series expression in order of the data residual Weglein et al. (2003). By taking only the first-order term related to Δsn(r), eq. (21) becomes the Born forward modeling, which describes the relationship between the perturbation Δsn(r) and the data residual as   $$\delta P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) =-\omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n) \Delta s_n(\mathbf {r}^{\prime }) P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_n).$$ (22)Therefore, the perturbation Δsn(r) can be approached by the solution Δs(r) of a further, second data fitting scheme in a least-squares sense, which iteratively minimizes the misfit function   \begin{eqnarray} \phi (\Delta s)&=&\frac{1}{2} \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} \left\Vert \delta S(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) \right\Vert ^2 \nonumber \\ &=&\frac{1}{2} \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} \left\Vert \delta d(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)-\delta P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) \right\Vert ^2, \end{eqnarray} (23)where δP(rg, rs, ω|sn) is the scattered data obtained from sampling the scattered wavefield δP(r, rs, ω|sn) at receiver rg, which can be found by solving the wave equation with a virtual secondary source − ω2Δs(r΄)P(r΄, rs, ω|sn):   $$\left[\omega ^2 s_{n}(\mathbf {r})+ \nabla ^2 \right] \delta P(\mathbf {r},\mathbf {r}_s,\omega |s_n) =-\omega ^2\Delta s(\mathbf {r}^{\prime })P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_n).$$ (24)The perturbation Δs(r) can be obtained by updating the initial perturbation (e.g., starting from zero) with the gradient $$\tilde{g}(\mathbf {r})$$ of the misfit function (23)   $$\tilde{g}(\mathbf {r}) = \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} \omega ^2 {\rm Re} ( G(\mathbf {r}_g,\mathbf {r},\omega |s_n) P(\mathbf {r},\mathbf {r}_s,\omega |s_n) \delta S^{*}(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) ).$$ (25)By considering only the first-order scattering as in eq. (22), the updating of this perturbation should then be equivalent to applying a Gauss–Newton pre-conditioner to the gradient of the FWI misfit equation (5) at the current model sn(r) (Alkhalifah & Wu 2016b). Similar as discussed in Kwak et al. ’s (2014) work, this perturbation Δs(r) can, lead to a direct update of the model without line search at one frequency as   $$\tilde{s}_{n+1}(\mathbf {r})=s_n(\mathbf {r})+\Delta s(\mathbf {r}).$$ (26) 2.3 Gradient with non-linear sensitivity kernel Substituting the perturbation Δs(r) into the non-linear sensitivity kernel equation (13), we can get the gradient with non-linear sensitivity kernel up to second order to update model sn(r) as   \begin{eqnarray} g_n(\mathbf {r})&=& \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} {\rm Re} \big (\omega ^2 \delta d^{*} (\mathbf {r}_g,\mathbf {r}_s,\omega |s_n ) \big( G(\mathbf {r}_g,\mathbf {r},\omega |s_n)P(\mathbf {r},\mathbf {r}_s,\omega |s_n) \nonumber \\ && {} + \delta G(\mathbf {r}_g,\mathbf {r},\omega |s_n)P(\mathbf {r},\mathbf {r}_s,\omega |s_n) + G(\mathbf {r}_g,\mathbf {r},\omega |s_n) \delta P(\mathbf {r},\mathbf {r}_s,\omega |s_n) \big )\big ), \end{eqnarray} (27)where δP(r, rs, ω|sn) is as in eq. (24) and under the reciprocity principle, δG(rg, r, ω|sn) can be obtained as the scattered wavefield δG(r, rg, ω|sn) by solving the wave equation with virtual secondary source − ω2Δs(r΄)G(r΄, rg, ω|sn) in model sn(r). Changing wavefields in the virtual secondary sources from P(r΄, rs, ω|sn) and G(r΄, rg, ω|sn) to $$P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |\tilde{s}_{n+1})$$ and $$G(\mathbf {r}^{\prime },\mathbf {r}_g,\omega |\tilde{s}_{n+1})$$ for source and receiver side scattered wavefields as in eq. (20), we can obtain the gradient contains higher order scattering processes. By exchanging the data residual δd(rg, rs, ω|sn) with $$\delta d (\mathbf {r}_g,\mathbf {r}_s,\omega |\tilde{s}_{n+1})$$  \begin{eqnarray} \delta d \left(\mathbf {r}_g,\mathbf {r}_s,\omega |\tilde{s}_{n+1} \right) &=& d (\mathbf {r}_g,\mathbf {r}_s,\omega ) -P (\mathbf {r}_g,\mathbf {r}_s,\omega |\tilde{s}_{n+1}) \nonumber \\ &\approx & d (\mathbf {r}_g,\mathbf {r}_s,\omega) -P (\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) - \delta P(\mathbf {r}_g,\mathbf {r}_s,\omega |\tilde{s}_{n+1}), \end{eqnarray} (28)which contains not only the data in model sn(r), but also the synthesized scattered data in perturbation Δs(r), we can get the gradient with higher order sensitivity kernel to update model $$\tilde{s}_{n+1}(\mathbf {r})$$ as   \begin{eqnarray} g_{n+1}(\mathbf {r}) &=& \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} {\rm Re} \Big (\omega ^2 \delta d^{*} \left(\mathbf {r}_g,\mathbf {r}_s,\omega |\tilde{s}_{n+1} \right) \big ( G(\mathbf {r}_g,\mathbf {r},\omega |s_n)P(\mathbf {r},\mathbf {r}_s,\omega |s_n) \nonumber \\ &&{} + \delta G(\mathbf {r}_g,\mathbf {r},\omega |\tilde{s}_{n+1}) P(\mathbf {r},\mathbf {r}_s,\omega |s_n) + G(\mathbf {r}_g,\mathbf {r},\omega |s_n) \delta P(\mathbf {r},\mathbf {r}_s,\omega |\tilde{s}_{n+1}) \big )\Big ) \nonumber \\ &=& \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} {\rm Re} (\omega ^2 \delta d^{*} \left(\mathbf {r}_g,\mathbf {r}_s,\omega |\tilde{s}_{n+1} \right) \big ( G(\mathbf {r}_g,\mathbf {r},\omega |\tilde{s}_{n+1})P(\mathbf {r},\mathbf {r}_s,\omega |\tilde{s}_{n+1}) \nonumber \\ &&{} - \delta G(\mathbf {r}_g,\mathbf {r},\omega |\tilde{s}_{n+1}) \delta P(\mathbf {r},\mathbf {r}_s,\omega |\tilde{s}_{n+1}) \big )). \end{eqnarray} (29) 3 ANALYTICAL AND NUMERICAL EXAMPLES In this section, we will exemplify the behaviour of this non-linear FWI scheme in two distinct ways. First, we will consider its analytic response to reflection data from a single horizontal interface representing a jump in model parameter value from s0 to s0 + Δs. We will use this analytical development to show that, while a standard Gauss–Newton FWI update is in error by order (Δs/s0)2, a non-linear FWI update truncated at second order in the residuals generates an update in error by at most order (Δs/s0)3. Second, we will examine the numerical behaviour of the non-linear FWI scheme using the Marmousi model. We will use these numerical examples as a platform upon which to discuss implementation issues. 3.1 Non-linear FWI and amplitude non-linearity Let us first examine a special case of these non-linear sensitivity kernel and assume the source spectrum is W(ω) = 1, in which case the resulting gradient (i) is second order in the residuals, and (ii) simply incorporates non-linear reflectivity information. This is achieved by truncating eq. (13) at order 1, and considering only a portion of the full difference Δsn(r) between the nth and the n + 1th iterate, as illustrated in Fig. 2. In it, the general scattering picture (see Figs 1a–c) is replaced with a Δsn(r) that is localized and collocated with the variation point. This is obtained by setting Δsn(r΄) = Δsn(r)δ(r − r΄) in eqs (18) and (19), such that second term in eq. (13) becomes   \begin{eqnarray*} \nonumber \left( \frac{\partial G(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} \right)_1 = 2 \omega ^4 G(\mathbf {r}_g,\mathbf {r},\omega |s_n) G(\mathbf {r},\mathbf {r}_s,\omega |s_n) G(\mathbf {r},\mathbf {r},\omega |s_n) {\cal G}^{*-1} \delta d^*(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n), \end{eqnarray*} where $$\cal G$$ is the operator containing the integral and Green’s functions in eq. (21) and the quantity $${\cal G}^{*-1}\delta d^*(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)$$ reduces to   \begin{eqnarray} {\cal G}^{*-1} \delta d^*(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) = - \frac{1}{\omega ^2} \frac{ \delta d^*(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) }{G^*(\mathbf {r}_g,\mathbf {r},\omega |s_n) G^*(\mathbf {r},\mathbf {r}_s,\omega |s_n) }. \end{eqnarray} (30)Putting the lowest two orders of the sensitivity kernel together the following case of non-linear sensitivity kernel is obtained:   \begin{eqnarray} \left( \frac{\partial G(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} \right) \approx -\omega ^2 J_{3{\rm D}}(\mathbf {r},\omega | s_n) \left[ 1 + 2 \frac{ \delta d^*( \mathbf {r}_g,\mathbf {r}_s,\omega |s_n) }{ I_{3{\rm D}}(\mathbf {r},\omega |s_n) } \right], \end{eqnarray} (31)where   \begin{eqnarray} I_{3{\rm D}}(\mathbf {r},\omega | s_n) &=& \frac{G^*(\mathbf {r}_g,\mathbf {r},\omega |s_n) G^*(\mathbf {r},\mathbf {r}_s,\omega |s_n)}{G(\mathbf {r},\mathbf {r},\omega |s_n)},\ {\rm and}\ J_{3{\rm D}}(\mathbf {r},\omega | s_n) = G(\mathbf {r}_g,\mathbf {r},\omega |s_n) G(\mathbf {r},\mathbf {r}_s,\omega |s_n). \end{eqnarray} (32)The formula I3D involves evaluating the Green’s function G(r, r, ω|sn) with colocated source and receiver. In any numerical application this term would need to be regularized. In our 2-D analysis to follow, which is carried out in the wavenumber domain, this leads to no complicating issues. Figure 2. View largeDownload slide Special case of scattering processes illustrated in Fig. 1, in which Δsn(r) and the variation δs(r) are local and collocated. Figure 2. View largeDownload slide Special case of scattering processes illustrated in Fig. 1, in which Δsn(r) and the variation δs(r) are local and collocated. In order to verify that they meaningfully incorporate non-linear information, we consider a 2-D version of the sensitivity kernel formula in eq. (31). In this 2-D case, the model varies in depth only, and thus sensitivity kernel is defined in terms of medium variations in z, but the wave physics is 2-D (coordinates x and z). Also, for the purpose of analysis, we consider the data in the (kg, ω) domain, where kg is the Fourier conjugate to the lateral geophone position xg. Due to a source at (xs, zs), the wavefield observed at depth zg is P(kg, zg, xs, zs, ω). To simplify the problem, we use only one shot at the origin zs = xs = 0 with record coordinate system along the surface with zg = 0. Under this restriction, the data residual is δd(kg, ω|sn), and the associated second-order sensitivity kernel formula has the same essential form,   \begin{eqnarray} \left( \frac{\partial G(k_g,\omega |s_{n+1})}{\partial s(z)} \right) \approx -\omega ^2 J_{2{\rm D}} (k_g,z,\omega |s_n) \left[ 1 +2 \frac{ \delta d^*(k_g,\omega |s_n) }{ I_{2{\rm D}}(k_g,z,\omega |s_n) } \right], \end{eqnarray} (33)but some slight differences in the weights:   \begin{eqnarray*} \nonumber J_{2{\rm D}} (k_g,z,\omega |s_n) = \int {\,\,}\mathrm{d}x^{\prime } G(k_g,0,x^{\prime },z,\omega |s_n) G(x^{\prime },z,0,0,\omega |s_n), \end{eqnarray*} and   \begin{eqnarray*} \nonumber I_{{\rm 2D}} (k_g,z,\omega |s_n) = \frac{ J_{2{\rm D}}^* (k_g,z,\omega |s_n) J_{2{\rm D}} (k_g,z,\omega |s_n) }{ \int {\,\,}\mathrm{d}x^{\prime } \int {\,\,}\mathrm{d}x^{\prime \prime } G(k_g,0,x^{\prime },z,\omega |s_n) G(x^{\prime },z,x^{\prime \prime },z,\omega |s_n) G(x^{\prime \prime },z,0,0,\omega |s_n) }. \end{eqnarray*} FWI updates are then derived in the (kg, ω) domain, with the data residual δd(kg, ω|sn) for a fixed kg at each iteration, which allows us to distinguish between updating with high angle (large kg) versus low angle data. The misfit function of FWI (2) is modified to   \begin{eqnarray} \phi (s_n) = \frac{1}{2} \int {\,\,}\mathrm{d}\omega \Vert \delta d(k_g,\omega |s_n) \Vert ^2, \end{eqnarray} (34)and for the first FWI update, it is minimized with updates of Gauss–Newton form:   \begin{eqnarray} \delta s_0 (z) = - \int {\,\,}\mathrm{d}z^{\prime } H_0^{-1}(z,z^{\prime }) g_0(z^{\prime }), \end{eqnarray} (35)where the gradient is based on a version of the non-linear sensitivity kernel:   \begin{eqnarray} (g_0(z))_{{\rm nonlinear}} = - \int {\,\,}\mathrm{d}\omega \left( \frac{ \partial G(k_g,\omega |s_{1}) }{ \partial s(z) } \right) \delta d^*(k_g,\omega |s_0), \end{eqnarray} (36)and the Hessian is based on standard sensitivity kernel (e.g. Virieux & Operto 2009):   \begin{eqnarray} H_0(z,z^{\prime }) = \int {\,\,}\mathrm{d}\omega \frac{ \partial G(k_g,\omega |s_{0}) }{ \partial s(z) } \frac{ \partial G^*(k_g,\omega |s_{0}) }{ \partial s(z^{\prime }) }. \end{eqnarray} (37)If the initial medium is homogeneous, we can analyse this update using exact forms for the Green’s functions Clayton & Stolt (1981): $$G(k_g,z_g,x,z,\omega |s_0) = (i2q_g)^{-1}e^{-ik_gx + iq_g|z_g-z|}$$ and $$G(x,z,x_s,z_s,\omega |s_0) = (2 \pi )^{-1} \int \mathrm{d}k^{\prime } (i2q^{\prime })^{-1} e^{ ik^{\prime }(x-x_s) + iq^{\prime }|z-z_s|}$$, where $$q_g^2 = \omega ^2 s_0 - k_g^2$$ and q΄2 = ω2s0 − k΄2. The I and J reduce to   \begin{eqnarray} J_{{\rm 2D}} (k_g,z,\omega |s_0) = \frac{e^{i2q_gz}}{(i2q_g)^2}, I_{2{\rm D}} (k_g,z,\omega |s_0) = \frac{e^{-i2q_gz}}{i2q_g}, \end{eqnarray} (38)and the sensitivity kernel itself becomes   \begin{eqnarray} \left( \frac{\partial G(k_g,\omega |s_{1})}{\partial s(z)} \right) = -\omega ^2 \left[ \frac{e^{i2q_gz}}{(i2q_g)^2} \right] \left[ 1 +2 \delta d^*(k_g,\omega |s_0) (i2q_g) e^{i2q_gz} \right]. \end{eqnarray} (39) Reconstruction of the single-interface medium in Fig. 3(a) is considered. The goal is the determination, from a constant initial medium s0, of the profile s(z) = s0 + ΔsS(z − z1), where S is a step or Heaviside function. Δs is the amplitude of the ideal update, taking us directly from the initial model to the correct answer. The backscattered amplitude (i.e. the reflection coefficient) can be expressed as a series in orders of this Δs:   \begin{eqnarray} R(\theta ) = R_1(\theta ) + R_2(\theta ) + \cdot , {\rm where} \end{eqnarray} (40)  \begin{eqnarray*} R_1(\theta ) = - \frac{1}{4} \left( 1 + \sin ^2 \theta \right) \left( \frac{\Delta s}{s_0} \right), R_2(\theta ) = \frac{1}{8} \left( 1 + 2 \sin ^2 \theta \right) \left( \frac{\Delta s}{s_0} \right)^2, \\ \end{eqnarray*} etc. In Fig. 4(a), we note by comparing exact, first-order (R ≈ R1) and second-order (R ≈ R1 + R2) coefficient calculations that basic linearization error can arise at low angles when contrasts are large, and (in Fig. 4b) at high angle even when contrasts are low. We also note that corrections out to as low as second order can lead to significant error reduction. With analysable formulae for R in hand, we can then analytically express the complex conjugate of the (kg, ω) domain residuals at the first iteration Innanen (2014):   \begin{eqnarray} \delta d^*(k_g,\omega |s_0) = -R(\theta ) \frac{e^{-i2q_gz_1}}{i2q_g}. \end{eqnarray} (41) Figure 3. View largeDownload slide Model for analysis of non-linear FWI updates. (a) Medium is a step S(z − z1) from s0 to s1 with boundary at z1 and (b) reflection coefficient R(θ) where θ is related to plane wave variables ω, $$c_0 = 1/\sqrt{s_0}$$, kg and qg. Figure 3. View largeDownload slide Model for analysis of non-linear FWI updates. (a) Medium is a step S(z − z1) from s0 to s1 with boundary at z1 and (b) reflection coefficient R(θ) where θ is related to plane wave variables ω, $$c_0 = 1/\sqrt{s_0}$$, kg and qg. Figure 4. View largeDownload slide Reflection coefficient R(θ) as a function of large contrasts or large angles. (a) R(θ) for low angles, but large contrasts and (b) R(θ) for small contrasts, but large angles. Figure 4. View largeDownload slide Reflection coefficient R(θ) as a function of large contrasts or large angles. (a) R(θ) for low angles, but large contrasts and (b) R(θ) for small contrasts, but large angles. With all the ingredients for the sensitivity kernel now available in analytic form, we may analyse the first iteration in the reconstruction of Fig. 3(a). The gradient now has two terms, one first order in δd* and the other second, that is, $$(g_0(z))_{{\rm nonlinear}} = g_0^{(1)}(z) + g_0^{(2)}(z)$$, where   \begin{eqnarray} g_0^{(1)}(z) &=& \frac{c_0^2}{4} R(\theta ) \int\mathrm{d}\omega \left( \frac{\omega ^2/c_0^2}{q_g^2} \right) \frac{e^{i2q_g(z-z_1)}}{i2q_g},\ {\rm and}\ g_0^{(2)}(z) = -\frac{c_0^2}{2} R^2(\theta ) \int \mathrm{d}\omega \left( \frac{\omega ^2/c_0^2}{q_g^2} \right) \frac{e^{i2q_g(2z-2z_1)}}{i2q_g}. \end{eqnarray} (42)Noting that Innanen (2014)  dω =  d2qg(c0/2cos θ) and qg = (ω/c0)cos θ, we can evaluate these integrals and reassemble the gradient, obtaining   \begin{eqnarray} (g_0(z))_{{\rm nonlinear}} = \frac{c_0^3 \pi }{4 \cos ^3 \theta } [ R(\theta ) - 2R^2(\theta ) ] S(z-z_1). \end{eqnarray} (43)The Hessian, which we have given a standard Gauss–Newton approximate form, evaluates in this simple environment (Innanen 2014) to   \begin{eqnarray} H_0(z,z^{\prime }) = c_0^5 \pi (16 \cos ^5 \theta )^{-1} \delta (z-z^{\prime }), \end{eqnarray} (44)and so, via eq. (35), the first update is of the form   \begin{eqnarray} (\delta s_{0} (z))_{{\rm nonlinear}} & =& - \frac{ 4 \cos ^2 \theta }{c_0^2} \left[ R(\theta ) - 2R^2(\theta ) \right] S(z-z_1). \end{eqnarray} (45) We characterized the ideal update as Δs(z) = ΔsS(z − z1) and related it to the reflection coefficient through eq. (40). To analyse the relative accuracy of the first- and second-order FWI iterations, we will substitute two truncations of the series for R(θ) into eq. (45). The standard Gauss–Newton update is recovered by neglecting R2; noting also that to leading order in sin 2θ we may replace 1/cos 2θ with (1 + sin 2θ), we obtain   \begin{eqnarray} \left( \frac{ \delta s_{0} (z) }{s_0} \right)_{{\rm standard}} \approx - \left( \frac{4}{1 + \sin ^2 \theta } \right) R(\theta ) S(z-z_1). \end{eqnarray} (46)The non-linear Gauss–Newton-like update, based on second-order collocated sensitivity kernel, is   \begin{eqnarray} \left( \frac{ \delta s_{0} (z) }{s_0} \right)_{{\rm nonlinear}} =- \left( \frac{4}{1 + \sin ^2 \theta } \right) \left[ R(\theta ) - 2 R^2(\theta ) \right] S(z-z_1). \end{eqnarray} (47)Let us first verify that a standard Gauss–Newton update is equivalent to the ideal update to first order. If contrasts and angles are low, second-order contributions to R are negligible, and the reflection coefficient is   \begin{eqnarray} R(\theta ) \approx R_1(\theta ) = - \frac{1}{4} \left( 1 + \sin ^2 \theta \right) \left( \frac{\Delta s}{s_0} \right); \end{eqnarray} (48)substituting this into eq. (46), we obtain   \begin{eqnarray} \left( \frac{ \delta s_{0} (z) }{s_0} \right)_{{\rm standard}} \approx \left( \frac{\Delta s}{s_0} \right) S(z-z_1), \end{eqnarray} (49)demonstrating the equivalence of δs0(z) and the ideal Δs(z). However, if the angle or contrast is such that second-order contributions to R(θ) are non-negligible, referring to eq. (40) we must instead substitute   \begin{eqnarray} R(\theta ) \approx - \frac{1}{4} \left( 1 + \sin ^2 \theta \right) \left( \frac{\Delta s}{s_0} \right) + \frac{1}{8} \left( 1 + 2 \sin ^2 \theta \right) \left( \frac{\Delta s}{s_0} \right)^2, \end{eqnarray} (50)and this produces a discrepancy at second order between the Gauss–Newton update δs0(z) and the ideal update Δs(z):   \begin{eqnarray} \left( \frac{ \delta s_{0} (z) }{s_0} \right)_{{\rm standard}} \approx \left[ \left( \frac{\Delta s}{s_0} \right) - \frac{1}{2} (1+\sin ^2 \theta ) \left( \frac{\Delta s}{s_0} \right)^2 \right] S(z-z_1). \end{eqnarray} (51)Let us compare this with the update generated using the second-order sensitivity kernel expression. Substituting the reflection coefficient in eq. (50) into eq. (47), a corrective term is introduced at second order, exactly suppressing the second-order discrepancy corrupting eq. (51), such that the resulting update lapses back to   \begin{eqnarray} \left( \frac{ \delta s_{0} (z) }{s_0} \right)_{{\rm nonlinear}} \approx \left( \frac{\Delta s}{s_0} \right) S(z-z_1); \end{eqnarray} (52)here the consistency of the candidate and ideal updates extends to second order, rather than just first. In Fig. 5 the difference between second-order sensitivity kernel and (standard) first-order sensitivity kernel is illustrated. Because we consider a fixed kg, we can examine the accuracy angle by angle; a full inversion would sum over kg and thus average over these angles. In Fig. 5(a) the interface is large contrast, going from c0 = (s0)−1/2 = 1500 ms−1 in the upper half-space to c1 = (s1)−1/2 = 1800 ms−1 in the lower; especially in the range θ = 0°–30° the difference between the standard Gauss–Newton update and that based on second-order sensitivity kernel is significant. Meanwhile in Fig. 5(b), the interface represents a small contrast, with c0 = 1500 ms−1 and c1 = 1600 ms−1, but is examined over a wider range of angles. Here the second-order update ‘sticks to’ the exact update to roughly θ = 60°, in contrast to the standard update which deviates significantly at θ = 30°. Figure 5. View largeDownload slide Exact (solid), first-order (dash dot), versus second-order (dash) updates at each angle of incidence, (a) the interface corresponds to a large contrast and (b) the interface corresponds to a small contrast but examined over a larger range of angles. Figure 5. View largeDownload slide Exact (solid), first-order (dash dot), versus second-order (dash) updates at each angle of incidence, (a) the interface corresponds to a large contrast and (b) the interface corresponds to a small contrast but examined over a larger range of angles. 3.2 Numerical implementation and examples So far we have discussed a non-linear FWI scheme by dividing the calculation of sensitivity kernel into two steps: the first step is calculating a perturbation from the current data residual and the second step is using this perturbation to generate the higher order terms in the sensitivity kernel. Multiscale strategies can be used in this non-linear FWI, by performing the inversion sequentially from the low to high frequencies, and the frequency interval can be determined according to the half offset-to depth ratio Rmax = hmax/z Sirgue & Pratt (2004) as   $$\omega _{n+1}=\frac{\omega _n}{a_{{\rm min}}},$$ (53)with   $$a_{{\rm min}}=\frac{1}{\sqrt{1+R_{{\rm max}}^{2}}}.$$ (54)Although the frequency interval determined by eq. (53) is not constant as is normally used, which is determined by the maximum recorded time to prevent aliasing, theoretically it can provide enough coverage for the vertical wavenumber. An efficient non-linear FWI algorithm can be perform with certain frequencies as shown in Table 1. Table 1. Frequency-domain non-linear FWI.     View Large We will use a nine point frequency-domain finite difference with constant density Jo et al. (1996) to model the observed data and synthetic data. Hereafter, second order FWI (SOFWI) and non-linear FWI (NFWI) will be used to distinguish the gradient calculated using eq. (27) with non-linear sensitivity kernel up to second order and data residual in model sn, and eq. (29) with non-linear sensitivity kernel with all higher order terms and data residual in model $$\tilde{s}_{n+1}(\mathbf {r})$$, respectively. FWI solved with conventional updates, by SD, truncated Newton method with approximate Hessian (GN) and full Hessian (Newton) Métivier et al. (2013); Pan et al. (2017) will be used for comparison. The true model is shown in Fig. 6, we add a 500 m water layer on the top of the model to reduce the refractions in the data. Two initial models and different frequency ranges are used to test convergence and resolution characteristics of the method. Synthetic data are generated in the frequency domain with 461 receivers with spacing of 20 m and 46 sources with spacing of 200 m along the surface, where the first receiver locates at x = 0 m and the first source locates at x = 100 m. Figure 6. View largeDownload slide True velocity model of the Marmousi example. Figure 6. View largeDownload slide True velocity model of the Marmousi example. The first initial model is obtained from the true velocity using a Gaussian smoother (excluding the water layer) as shown in Fig. 7(a). With three frequencies (4, 6.6 and 14.9 Hz) starting at 4 Hz, 10 iterations per frequency are used in both conventional and non-linear FWI by updating only the velocity under the added water layer. The SD FWI result is shown in Fig. 7(b). Five inner iterations are used for truncated Newton method (with GN Hessian as in Fig. 7c and full Hessian as in Fig. 7d), SOFWI (Fig. 7e) and NFWI (Fig. 7f). For the first frequency, Fig. 8(a) shows the residual norm versus number of iterations, and Fig. 8(b) shows the model error norm versus number of iterations, with dash–dotted, dashed, dot, solid and sold lines with plus sign stands for SD FWI, truncated GN FWI, truncated Newton FWI, NFWI and SOFWI result, respectively. FWI with the non-linear gradient converges more rapidly than conventional SD FWI. Because of the lack of low frequencies, both truncated GN FWI and truncated Newton FWI perform similarly to SD FWI, but the convergence of both SOFWI and NFWI outperforms SD FWI; the NFWI result is particularly close to the true model. Figure 7. View largeDownload slide (a) Initial velocity. Inversion results with conventional updates using (b) SD, (c) truncated GN, (d) truncated Newton method, (e) SOFWI and (f) NFWI with 10 iterations for three frequencies starting from 4 to 15 Hz. Figure 7. View largeDownload slide (a) Initial velocity. Inversion results with conventional updates using (b) SD, (c) truncated GN, (d) truncated Newton method, (e) SOFWI and (f) NFWI with 10 iterations for three frequencies starting from 4 to 15 Hz. Figure 8. View largeDownload slide (a) Norm of data residual vector as a function of the number of iterations and (b) relative model least-squares error versus number of iterations. Figure 8. View largeDownload slide (a) Norm of data residual vector as a function of the number of iterations and (b) relative model least-squares error versus number of iterations. Then, we use a linear model as the initial model, which is exact at the added water layer, but changes linearly along depth direction from 1.5 to 4 km s−1, as shown in Fig. 9(a) to test NFWI. Five frequencies (2, 3.3, 5.5, 9, and 14.9 Hz) starting at 2 Hz are used for the inversion, with 10 iterations per frequency. Figs 9(b)–(d) show the inversion results from conventional FWI using SD, truncated GN and truncated Newton method, respectively, and Fig. 9(e) shows the inversion result from NFWI (five inner iterations are used for truncated GN FWI, truncated Newton FWI and NFWI). Velocity profiles along x = 4 and 6 km are shown in Fig. 10, where black lines show the true velocity, green lines show the initial velocity, and red, cyan, magenta and blue line stands for SD FWI, truncated GN FWI, truncated Newton FWI and NFWI result, respectively. For the first frequency, Fig. 11(a) shows the residual norm versus number of iterations, and Fig. 11(b) shows the model error norm versus number of iterations. We observe that, for the given number of iterations, with a minimum frequency of 2 Hz, SD FWI can only reconstruct rough model structures, truncated GN FWI can help converge much faster towards the true model, and truncated Newton FWI with full Hessian, which includes second-order scattering processes, does not provide significant improvement in comparison to truncated GN FWI. But, NFWI outperforms all these methods, especially in its reconstruction of model structures in the deeper regions. Figure 9. View largeDownload slide (a) Initial velocity. Inversion results with conventional updates using (b) SD, (c) truncated GN, (d) truncated Newton method and (e) NFWI with 10 iterations for five frequencies starting from 2 to 15 Hz. Figure 9. View largeDownload slide (a) Initial velocity. Inversion results with conventional updates using (b) SD, (c) truncated GN, (d) truncated Newton method and (e) NFWI with 10 iterations for five frequencies starting from 2 to 15 Hz. Figure 10. View largeDownload slide Velocity profiles along (a) x = 4 km and (b) x = 6 km. Figure 10. View largeDownload slide Velocity profiles along (a) x = 4 km and (b) x = 6 km. Figure 11. View largeDownload slide (a) Norm of data residual vector as a function of the number of iterations and (b) relative model least-squares error versus number of iterations. Figure 11. View largeDownload slide (a) Norm of data residual vector as a function of the number of iterations and (b) relative model least-squares error versus number of iterations. Using the same linear initial model as in Fig. 9(b), to obtain a better wavenumber coverage of the deeper regions, we use uniformly sampled 12 frequencies between 4 and 15 Hz to perform both conventional FWI and NFWI inversions. Figs 12(a)–(c) show the inversion results from conventional FWI using SD, truncated GN and truncated Newton method, respectively, and Fig. 12(d) shows the inversion result from NFWI. For a given number of iterations, in this case, with the conventional FWI method, both SD and truncated GN methods converge to a local minimum, where velocity is not correctly updated even in the shallow regions. Using the full Hessian rather than the Gauss–Newton Hessian does not improve the result, but the NFWI method provides a fairly good result. Figure 12. View largeDownload slide Inversion results with conventional updates using (a) SD, (b) truncated GN, (c) truncated Newton method and (d) NFWI with 10 iterations for 12 frequencies starting from 4 to 15 Hz. Figure 12. View largeDownload slide Inversion results with conventional updates using (a) SD, (b) truncated GN, (c) truncated Newton method and (d) NFWI with 10 iterations for 12 frequencies starting from 4 to 15 Hz. 4 DISCUSSION AND CONCLUSIONS In this study, we considered including non-linearity in the sensitivity kernel calculations during the construction of the descent direction in FWI. We showed with an analytic example that in the presence of pre-critical reflection data, the error in a non-linear FWI update truncated at second order is proportional to (Δs/s0)3. In contrast, the error in a standard Gauss–Newton FWI update is only proportional to (Δs/s0)2. We also discussed an approach to the implementation of FWI with this non-linear descent direction in the frequency domain with a two-nested-loops approach. The inner loop involves determination of a perturbation from the data residuals according to the Born approximation, and the outer loop involves updating an initial model with this non-linear descent direction. Involving multiple scattering in each FWI update could have important consequences. Higher order scattering terms can be included in the sensitivity kernel based on De Wolf approximation (Wu & Zheng 2014), it can also be included by extending the misfit function (e.g. Alkhalifah & Wu 2016b; Wu & Alkhalifah 2017). We approached the problem by computing a sensitivity kernel via a variation of the medium to be constructed, sn + Δsn, at each FWI iteration. The non-linear sensitivity kernel constructed this way can include scattering processes with one interaction with the unknown perturbation and multiple interactions with Δsn. The first term in this new sensitivity kernel is equivalent to the sensitivity kernel used in conventional FWI, which only include scattering processes with one interaction with the unknown perturbation. The additional terms contain the multiple scattering processes related to Δsn. This provides additional wave paths that help in updating between scatterers. Compared to FWI with a conventional descent direction without multiple scattering, such as SD methods and truncated Gauss–Newton methods, FWI with this non-linear descent direction can accommodate multiple scattering appearing in the data residuals. Examples based on the Marmousi model indicate that FWI with this new descent direction, with a non-linear sensitivity kernel up to the second or higher order, converges both more rapidly and with greater ultimate accuracy than FWI with conventional descent direction. Although the full Hessian contains information related to doubly scattered waves, proper choice of pre-conditioner, as well as more advanced methods to calculate the Newton descent direction, could be necessary to insure the convergence of Newton-type methods. This is because the positive definiteness of the approximate Hessian is lost, and the misfit function is not locally convex anymore (Pratt et al.1998; Métivier et al.2014, 2017). In this case, compared to the truncated Newton method without pre-conditioning, FWI with our non-linear descent direction is more robust. This is especially true in the case of missing low-frequency data. Acknowledgements We thank the editors Ludovic Métivier, Tariq Alkhalifah and Ru-Shan Wu for valuable suggestions and comments that ultimately helped to improve this paper. We thank the sponsors of CREWES for support. This work was funded by CREWES (Consortium for Research in Elastic Wave Exploration Seismology), and by NSERC (Natural Science and Engineering Research Council of Canada) through the grant CRDPJ 461179-13, and in part through the Canada First Research Excellence Fund. We also thank Junxiao Li, Jian Sun, Huaizhen Chen and Khalid Almuteri for valuable discussions. REFERENCES Aki K., Richards P., 2002. Quantitative Seismology , University Science Books. Alkhalifah T., 2015. Scattering-angle based filtering of the waveform inversion gradients, Geophys. J. Int. , 200, 363– 373. Google Scholar CrossRef Search ADS   Alkhalifah T., Plessix R.E., 2014. A recipe for practical full-waveform inversion in anisotropic media: an analytical parameter resolution study, Geophysics , 79, R91– R101. Google Scholar CrossRef Search ADS   Alkhalifah T., Wu Z.D., 2016a. The natural combination of full and image-based waveform inversion, Geophys. Prospect. , 64, 19– 30. Google Scholar CrossRef Search ADS   Alkhalifah T., Wu Z.D., 2016b. Multiscattering inversion for low-model wavenumbers, Geophysics , 81, R417– R428. Google Scholar CrossRef Search ADS   Biondi B., Almomin A., 2014. Simultaneous inversion of full data bandwidth by tomographic full-waveform inversion, Geophysics , 79, Wa129– Wa140. Google Scholar CrossRef Search ADS   Brossier R., Operto S., Virieux J., 2015. Velocity model building from seismic reflection data by full-waveform inversion, Geophys. Prospect. , 63, 354– 367. Google Scholar CrossRef Search ADS   Castagna J., Backus M., 1993. Offset-Dependent Reflectivity: Theory and Practice of AVO Analysis , SEG. Google Scholar CrossRef Search ADS   Chavent G., Cément F., Gómez S., 1994. Automatic determination of velocities via migration-based traveltime waveform inversion: a synthetic data example, in SEG Technical Program Expanded Abstracts , Vol. 328, pp. 1179– 1182, Society of Exploration Geophysicists. Choi Y., Alkhalifah T., 2015. Unwrapped phase inversion with an exponential damping, Geophysics , 80, R251– R264. Google Scholar CrossRef Search ADS   Clayton R.W., Stolt R.H., 1981. A born-wkbj inversion method for acoustic reflection data, Geophysics , 46, 1559– 1567. Google Scholar CrossRef Search ADS   Foster D.J., Keys R.G., Lane F.D., 2010. Interpretation of avo anomalies, Geophysics , 75, A3– A13. Google Scholar CrossRef Search ADS   Innanen K.A., 2014. Seismic avo and the inverse hessian in precritical reflection full waveform inversion, Geophys. J. Int. , 199, 717– 734. Google Scholar CrossRef Search ADS   Innanen K.A., 2015. Full waveform inversion updating in the presence of high angle/high contrast reflectivity, in SEG Technical Program Expanded Abstracts , pp. 1314– 1319. Jo C.H., Shin C.S., Suh J.H., 1996. An optimal 9-point, finite-difference, frequency-space, 2-d scalar wave extrapolator, Geophysics , 61, 529– 537. Google Scholar CrossRef Search ADS   Kwak S., Kim Y., Shin C., 2014. Frequency-domain direct waveform inversion based on perturbation theory, Geophys. J. Int. , 197, 987– 1001. Google Scholar CrossRef Search ADS   Lailly P., 1983. The seismic inverse problem as a sequence of before stack migrations, in Conference on Inverse Scattering, Theory and Application, Society of Industrial and Applied Mathematics, Expanded Abstracts , pp. 206– 220, eds Bednar J.B., Richard R., Enders R., Arthur W., Society for Industrial and Applied Mathematics, Philadelphia, PA. Mahmoudian F., Margrave G.F., Wong J., Henley D.C., 2015. Azimuthal amplitude variation with offset analysis of physical modeling data acquired over an azimuthally anisotropic medium, Geophysics , 80, C21– C35. Google Scholar CrossRef Search ADS   Métivier L., Brossier R., Virieux J., Operto S., 2013. Full waveform inversion and the truncated newton method, SIAM J. Sci. Comput. , 35, B401– B437. Google Scholar CrossRef Search ADS   Métivier L., Bretaudeau F., Brossier R., Operto S., Virieux J., 2014. Full waveform inversion and the truncated newton method: quantitative imaging of complex subsurface structures, Geophys. Prospect. , 62, 1353– 1375. Google Scholar CrossRef Search ADS   Métivier L., Brossier R., Operto S., Virieux J., 2017. Full waveform inversion and the truncated newton method, SIAM Rev. , 59, 153– 195. Google Scholar CrossRef Search ADS   Operto S., Gholami Y., Prieux V., Ribodetti A., Brossier R., Métivier L., Virieux J., 2013. A guided tour of multiparameter full-waveform inversion with multicomponent data: from theory to practice, Leading Edge , 32, 1040– 1054. Google Scholar CrossRef Search ADS   Pan W.Y., Innanen K.A., Margrave G.F., Fehler M.C., Fang X.D., Li J.X., 2016. Estimation of elastic constants for hti media using Gauss-Newton and full-newton multiparameter full-waveform inversion, Geophysics , 81, R275– R291. Google Scholar CrossRef Search ADS   Pan W.Y., Innanen K.A., Liao W.Y., 2017. Accelerating hessian-free Gauss-Newton full-waveform inversion via l-bfgs preconditioned conjugate-gradient algorithm, Geophysics , 82, R49– R64. Google Scholar CrossRef Search ADS   Plessix R.E., 2006. A review of the adjoint-state method for computing the gradient of a functional with geophysical applications, Geophys. J. Int. , 167, 495– 503. Google Scholar CrossRef Search ADS   Plessix R.E., Milcik P., Rynja H., Stopin A., Matson K., Abri S., 2013. Multiparameter full-waveform inversion: marine and land examples, Leading Edge , 32, 1030– 1038. Google Scholar CrossRef Search ADS   Pratt R.G., Shin C., Hicks G.J., 1998. Gauss-newton and full newton methods in frequency-space seismic waveform inversion, Geophys. J. Int. , 133, 341– 362. Google Scholar CrossRef Search ADS   Sava P., Biondi B., 2004. Wave-equation migration velocity analysis. I. Theory, Geophys. Prospect. , 52, 593– 606. Google Scholar CrossRef Search ADS   Shin C., Cha Y.H., 2008. Waveform inversion in the laplace domain, Geophys. J. Int. , 173, 922– 931. Google Scholar CrossRef Search ADS   Shin C., Cha Y.H., 2009. Waveform inversion in the laplace-fourier domain, Geophys. J. Int. , 177, 1067– 1079. Google Scholar CrossRef Search ADS   Sirgue L., Pratt R.G., 2004. Efficient waveform inversion and imaging: a strategy for selecting temporal frequencies, Geophysics , 69, 231– 248. Google Scholar CrossRef Search ADS   Stolt R.H., Weglein A.B., 1985. Migration and inversion of seismic data, Geophysics , 50, 2458– 2472. Google Scholar CrossRef Search ADS   Sun D., Symes W.W., 2012. Waveform inversion via non-linear differential semblance optimization, in SEG Technical Program Expanded Abstracts , pp. 1– 7, Society of Exploration Geophysicists. Symes W.W., Carazzone J.J., 1991. Velocity inversion by differential semblance optimization, Geophysics , 56, 654– 663. Google Scholar CrossRef Search ADS   Tarantola A., 1984. Inversion of seismic-reflection data in the acoustic approximation, Geophysics , 49, 1259– 1266. Google Scholar CrossRef Search ADS   van Leeuwen T., Herrmann F.J., 2013. Mitigating local minima in full-waveform inversion by expanding the search space, Geophys. J. Int. , 195, 661– 667. Google Scholar CrossRef Search ADS   van Leeuwen T., Mulder W.A., 2008. Velocity analysis based on data correlation, Geophys. Prospect. , 56, 791– 803. Google Scholar CrossRef Search ADS   Virieux J., Operto S., 2009. An overview of full-waveform inversion in exploration geophysics, Geophysics , 74, Wcc1– Wcc26. Google Scholar CrossRef Search ADS   Wang H.Y., Singh S.C., Audebert F., Calandra H., 2015. Inversion of seismic refraction and reflection data for building long-wavelength velocity models, Geophysics , 80, R81– R93. Google Scholar CrossRef Search ADS   Warner M., Guasch L., 2016. Adaptive waveform inversion: theory, Geophysics , 81, R429– R445. Google Scholar CrossRef Search ADS   Weglein A.B. et al.  , 2003. Inverse scattering series and seismic exploration, Inverse Probl. , 19, R27– R83. Google Scholar CrossRef Search ADS   Wu R.S., Zheng Y.C., 2014. Non-linear partial derivative and its de wolf approximation for non-linear seismic inversion, Geophys. J. Int. , 196, 1827– 1843. Google Scholar CrossRef Search ADS   Wu R.S., Luo J.R., Wu B.Y., 2014. Seismic envelope inversion and modulation signal model, Geophysics , 79, Wa13– Wa24. Google Scholar CrossRef Search ADS   Wu Z.D., Alkhalifah T., 2015. Simultaneous inversion of the background velocity and the perturbation in full-waveform inversion, Geophysics , 80, R317– R329. Google Scholar CrossRef Search ADS   Wu Z.D., Alkhalifah T., 2017. Efficient scattering-angle enrichment for a nonlinear inversion of the background and perturbations components of a velocity model, Geophys. J. Int. , 210, 1981– 1992. Google Scholar CrossRef Search ADS   Xu S., Wang D., Chen F., Lambaré G., Zhang Y., 2012. Inversion on reflected seismic wave, in SEG Technical Program Expanded Abstracts , Vol. 509, pp. 1– 7, Society of Exploration Geophysicists. Zhang H.Y., Weglein A.B., 2009. Direct nonlinear inversion of 1d acoustic media using inverse scattering subseries, Geophysics , 74, Wcd29– Wcd39. Google Scholar CrossRef Search ADS   Zhou W., Brossier R., Operto S., Virieux J., 2015. Full waveform inversion of diving & reflected waves for velocity model building with impedance inversion based on scale separation, Geophys. J. Int. , 202, 1535– 1554. Google Scholar CrossRef Search ADS   APPENDIX A: VIRTUAL SECONDARY SOURCE AND THE BORN APPROXIMATION Suppose that the model can be split into two parts   $$s(\mathbf {r})=s_0(\mathbf {r})+\delta s(\mathbf {r}),$$ (A1)where s0(r) contains the low-wavenumber component of the model or the background part, and δs(r) contains the high-wavenumber component or the perturbation part. Accordingly the Green’s functions in the actual and background model are   \begin{eqnarray} {} [\omega ^2s(\mathbf {r})+ \nabla ^2]G(\mathbf {r},\mathbf {r}_s,\omega |s) =-\delta (\mathbf {r}-\mathbf {r}_s) \end{eqnarray} (A2)  \begin{eqnarray} {} [\omega ^2s_0(\mathbf {r})+ \nabla ^2]G(\mathbf {r},\mathbf {r}_s,\omega |s_0) =-\delta (\mathbf {r}-\mathbf {r}_s). \end{eqnarray} (A3)The scattered wavefield, or to say the perturbed wavefield   $$\delta G(\mathbf {r},\mathbf {r}_s,\omega |s_0)=G(\mathbf {r},\mathbf {r}_s,\omega |s)-G(\mathbf {r},\mathbf {r}_s,\omega |s_0),$$ (A4)can be expressed exactly with the Lippmann–Schwinger equation as   \begin{eqnarray} \delta G (\mathbf {r},\mathbf {r}_s,\omega |s_0) &=& -\omega ^2 \int \mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_0) \delta s(\mathbf {r}^{\prime })G(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s) \nonumber \\ &=& -\omega ^2 \int \mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_0)\delta s(\mathbf {r}^{\prime })G(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_0) \nonumber \\ && {}+\omega ^4 \int \mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_0)\delta s(\mathbf {r}^{\prime }) \int\mathrm{d}\mathbf {r}^{\prime \prime }G(\mathbf {r}^{\prime },\mathbf {r}^{\prime \prime },\omega |s_0) \delta s(\mathbf {r}^{\prime \prime })G(\mathbf {r}^{\prime \prime },\mathbf {r}_s,\omega |s_0)\nonumber \\ && {}+ \cdots . \end{eqnarray} (A5)Under the single-scattering assumption, only the first-order term is considered in eq. (A5), which gives the Born approximation   $$\delta G_{{\rm Born}}(\mathbf {r},\mathbf {r}_s,\omega |s_0)=-\omega ^2 \int \mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_0)\delta s(\mathbf {r}^{\prime })G(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_0).$$ (A6)The scattered wavefield δG(r, rs, ω|s0) in the Lippmann–Schwinger equation (A5), can be generated using − ω2δs(r΄)G(r΄, rs, ω|s) as virtual secondary source,   \begin{eqnarray} {}[\omega ^2s_0(\mathbf {r})+ \nabla ^2] \delta G(\mathbf {r},\mathbf {r}_s,\omega |s_0) & =-\omega ^2\delta s(\mathbf {r}^{\prime })G(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s), \end{eqnarray} (A7)while under the Born approximation (A6), the scattered wavefield can be calculated with virtual secondary source − ω2δs(r΄)G(r΄, rs, ω|s0)   \begin{eqnarray} {}[\omega ^2s_0(\mathbf {r})+ \nabla ^2]\delta G_{Born}(\mathbf {r},\mathbf {r}_s,\omega |s_0) & =-\omega ^2\delta s(\mathbf {r}^{\prime })G(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_0). \end{eqnarray} (A8) © The Author(s) 2018. Published by Oxford University Press on behalf of The Royal Astronomical Society. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Geophysical Journal International Oxford University Press

# Frequency-domain full-waveform inversion with non-linear descent directions

, Volume 213 (2) – May 1, 2018
18 pages

/lp/ou_press/frequency-domain-full-waveform-inversion-with-non-linear-descent-wt1LFMmJZ0
Publisher
The Royal Astronomical Society
ISSN
0956-540X
eISSN
1365-246X
D.O.I.
10.1093/gji/ggy002
Publisher site
See Article on Publisher Site

### Abstract

Summary Full-waveform inversion (FWI) is a highly non-linear inverse problem, normally solved iteratively, with each iteration involving an update constructed through linear operations on the residuals. Incorporating a flexible degree of non-linearity within each update may have important consequences for convergence rates, determination of low model wavenumbers and discrimination of parameters. We examine one approach for doing so, wherein higher order scattering terms are included within the sensitivity kernel during the construction of the descent direction, adjusting it away from that of the standard Gauss–Newton approach. These scattering terms are naturally admitted when we construct the sensitivity kernel by varying not the current but the to-be-updated model at each iteration. Linear and/or non-linear inverse scattering methodologies allow these additional sensitivity contributions to be computed from the current data residuals within any given update. We show that in the presence of pre-critical reflection data, the error in a second-order non-linear update to a background of s0 is, in our scheme, proportional to at most (Δs/s0)3 in the actual parameter jump Δs causing the reflection. In contrast, the error in a standard Gauss–Newton FWI update is proportional to (Δs/s0)2. For numerical implementation of more complex cases, we introduce a non-linear frequency-domain scheme, with an inner and an outer loop. A perturbation is determined from the data residuals within the inner loop, and a descent direction based on the resulting non-linear sensitivity kernel is computed in the outer loop. We examine the response of this non-linear FWI using acoustic single-parameter synthetics derived from the Marmousi model. The inverted results vary depending on data frequency ranges and initial models, but we conclude that the non-linear FWI has the capability to generate high-resolution model estimates in both shallow and deep regions, and to converge rapidly, relative to a benchmark FWI approach involving the standard gradient. Inverse theory, Waveform inversion, Theoretical seismology, Wave scattering and diffraction 1 INTRODUCTION Seismic full-waveform inversion, or FWI Lailly (1983); Tarantola (1984); Virieux & Operto (2009); Métivier et al. (2017) has recently undergone a period of rapid development and refinement, and as a tool for high-resolution velocity model building appears to be reaching a relatively mature state. Multiparameter FWI Operto et al. (2013); Plessix et al. (2013); Alkhalifah & Plessix (2014); Innanen (2014); Pan et al. (2016), wherein larger and more difficult parametrization and updating issues are encountered, is less well advanced. However, the prospect of having access to stable and well-characterized 3-D multiparameter and multicomponent FWI algorithms, employable as a monitoring tool for inferring static or changing rock physics properties of reservoirs, makes FWI the subject of continuing research. The relationship between seismic data and subsurface geological property variations is intrinsically non-linear. Inverse methodologies may neglect non-linearity, through linearization, but many current state of the art seismic inversion methods seek to account for it. Non-linearity can in principle be accounted for directly, or through iterations. In a direct approach, there is a single, complex, often series-based calculation involving non-linear operations on the data/residuals. In the iterative approach, there are repeated procedure of a simpler calculation, each using the results of the previous iteration and each involving a linear operation on the data/residuals. Non-linear inverse scattering formulations are example of the direct approach Zhang & Weglein (2009); Gauss–Newton formulations of FWI are example of the fully iterative approach. Full Newton methods, in which the gradient and the Hessian operator are both functions of the residuals (e.g. Pan et al.2016), involve updates which each depends non-linearly on the residuals, and so are in some sense hybrids. It is possible, in other words, to view directly non-linear and iteratively non-linear methods as being end-members of a spectrum, upon which a general point would correspond to an inverse methodology with both direct and iterative characteristics in some balance. However, few if any well-defined procedures exist by which one can create a Newton-type FWI algorithm that sits at a desired point along this spectrum, that is, in which iterative and direct aspects are balanced to some set of specifications. General non-linear sensitivity formalisms, directly related to the full scattering series, have been discussed in the context of seismic inversion Wu & Zheng (2014). The purpose of this paper is to set out, and examine, the basic analytical and numerical features of an approach for the flexible incorporation of non-linearity within each FWI update by including high-order scattering. Within FWI, the issue of non-linearity mixes with issues of resolution and the presence or absence of low-frequency data. Theoretically, FWI can recover both long- and short-wavelength structures, given wide-aperture acquisition systems and broad-band sources. When provided with adequate data, high-resolution velocity models can be constructed in shallow regions, where long-to-intermediate wavelength structures can be recovered Virieux & Operto (2009). The spatial resolution of FWI is governed by the relationship established in diffraction tomography between the scattering angle and the local model wavenumber component Sirgue & Pratt (2004); Alkhalifah (2015); Brossier et al. (2015). During updating, large-angle scattering data, for example, diving waves and post-critical reflections, are used to determine low-to-intermediate wavenumber model components, and small-angle scattering data are used to determine high-wavenumber model components. Without long offsets and low-frequency information, FWI cannot reconstruct the low-wavenumber components, in which case it behaves more like a least-squares migration, updating primarily the high-wavenumber components of the model. This requires an accurate initial velocity model. Data with offsets sufficiently long to illuminate deeper parts of geological volumes are infrequently acquired. To better retrieve long-wavelength model structures, and additionally to increase the chances of FWI converging to a global minimum, several strategies are available. Some effort may be expended towards building a starting model that is sufficiently similar to the actual model that only small adjustments to long-wavelength structures is needed; alternatively, or additionally, schemes which work to more completely extract low-frequency information from the data can be devised. One published approach to this involves using complex-valued frequency (or time-damping) methods such as Laplace-domain and Laplace–Fourier-domain inversions Shin & Cha (2008, 2009), or by using envelope information obtained from the Hilbert transform, such as envelope inversion Wu et al. (2014). Inversion methods have also been proposed to overcome cycle skipping directly, by expanding the unknown model space, for example, with the Wiener-filter coefficients as in adaptive waveform inversion Warner & Guasch (2016), with an additional dimension as in Biondi & Almomin (2014), or by expanding the search space as in van Leeuwen & Herrmann (2013). Low-wavenumber components of the model can also be retrieved directly from reflection data. Several methods have been proposed to accomplish this in both data and image domains, such as reflection waveform inversion (RWI, Xu et al.2012), migration-based traveltime tomography Chavent et al. (1994), differential semblance optimization Symes & Carazzone (1991) and migration velocity analysis Sava & Biondi (2004), etc. In data-domain methods, the model is decomposed into a low-wavenumber background part, which is to be updated, and a high-wavenumber perturbation part, which is temporarily assumed to be known, and usually obtained by a migration process at each iteration. These migrations provide approximate virtual sources in depth, increasing the transmission-like or tomographic wave paths available to the updating. This way, wide-scattering angles are obtained and the low-wavenumber components of the model are better constrained. Refractions can be used together with reflections to perform a better reconstruction of the background model Wang et al. (2015); Zhou et al. (2015). Simultaneous inversion of the background and perturbation model have also been studied in the data domain and mixed data/image domain Sun & Symes (2012); Biondi & Almomin (2014); Wu & Alkhalifah (2015); Alkhalifah & Wu (2016a) to mitigate non-linearities of the FWI formulation. If an FWI formulation with non-linear sensitivity kernel explicitly included can be stably formulated, which is the goal in this paper, it should be expected to update low- and high-wavenumber components of the model simultaneously. This is because the high-order terms in the non-linear sensitivity kernel is expected to introduce features similar to RWI, in that transmission paths from scattering points in the subsurface to both source and receiver points on the surface are included. Qualitatively, we formulate the problem as follows. We will work in the context of a simple multidimensional scalar framework (i.e. focusing on P-wave velocity only). Using standard Newton- or Gauss–Newton-type FWI as a starting point, we enact a specific approach to generalizing the sensitivity kernel calculation such that it becomes generally residual-dependent. The sensitivity kernel at iteration n is normally determined by comparing a variation δs of the medium sn against the resulting variation in the wave G propagating through sn. In this paper, we instead consider the variation to be with respect to the wave propagating in the n + 1th iterate, that is, the medium we are in the process of determining. This medium is different from sn by some amount we will call Δsn. Sensitivity kernel formulae arising from this variation involve a range of scattering processes, each involving one instance of δs and multiple instances of Δsn. These formulae are not immediately ready for use, because Δsn is an unknown, in fact it is in theory exactly the update we would like to solve for. However, it offers a way for residuals to be incorporated non-linearly. An analytic example of a non-linear update, based on this new sensitivity kernel, and containing first- and second-order terms in the residuals, is then computed to examine the manner in which non-linear amplitude information is included. The first iteration in the reconstruction of a single scalar boundary from a shot record of data, a problem which can be explored with fully analytic background wave propagation operators and data, is considered. The improved step length towards the true model when backscattered data at high angles and/or due to high contrasts are used, is shown. Such a formulation may have particular relevance for updates constructed using pre-critical reflection data. It has been shown that Gauss–Newton-type FWI algorithms respond to small-angle backscattered data, for example, pre-critical specular reflections, in a manner consistent with linearized inverse scattering and linearized Amplitude-Variation-with-Offset (AVO) inversion Innanen (2014). This means a multiparameter reflection FWI updating scheme manages parameter cross-talk in the same way as do AVO and linear inverse scattering, but, it also means that linearization error will affect Gauss–Newton FWI in the presence of specular reflections to the same degree. Linearization error appears in AVO inversion because backscattered wave amplitudes are generally non-linear in medium property variations. In the special case of two elastic half-spaces, for instance, the Zoeppritz equations define a highly non-linear relationship between reflection coefficients and relative changes in elastic properties across a reflecting boundary. In AVO, the relationship is typically linearized Aki & Richards (2002); Castagna & Backus (1993); Foster et al. (2010). Linearization error takes the form of a decrease in accuracy with an increase of incidence angle and/or magnitude of relative changes. This is one reason in AVO why density is difficult to separate from P-wave velocity (e.g. Stolt & Weglein 1985), and why certain anisotropic parameters are difficult to determine in azimuthal AVO (e.g. Mahmoudian et al.2015). These parameters are best distinguished at high angle, where the Aki–Richards formula and standard FWI sensitivity kernel are insufficiently accurate. Thus, standard Gauss–Newton FWI must be expected to be insensitive to reflected amplitude variations at large angles and when perturbations are large. The analytic example demonstrates that updates consistent with non-linear AVO approximations are possible when non-linear sensitivity kernel is invoked. We next analyse the numerical properties of this non-linear FWI updating scheme in a more general setting. We start from the construction of the non-linear sensitivity kernel. By adding a Δsn to the nth updated model, the wavefield can be divided into a background wavefield that is obtained in the nth updated model, and a perturbation wavefield. Using both the background wavefield and perturbation wavefield, we obtain the non-linear sensitivity kernel. This is implemented in a non-linear two-loop FWI scheme in the frequency domain. At each iteration, we first invert for the perturbation using a linear inversion scheme. This is the inner loop. We then use the newly obtained perturbation to generate the non-linear sensitivity kernels which are employed in the outer loop. This outer loop determines the non-linear descent direction. Naturally, without going to the trouble of calculating the non-linear descent direction in the outer loop, the perturbation obtained within the inner loop could be used to provide a direct update of the model, in a scheme similar to that of Kwak et al. (2014). However, from the numerical examples, we observe that the use of the two-loop approach is what incorporates the transmission wave paths from the scattering points to both sources and receivers at the surface, whose wide-scattering angles are responsible for updating deep low-wavenumber model components. Applications on the Marmousi model indicate that this non-linear FWI approach converges in fewer iterations than the conventional FWI, and is robust in the absence of rich low-frequency data. 2 FORMULATION In this paper, the space-frequency-domain isotropic acoustic wave equation with constant density is used to describe wave motion:   $$[\omega ^2s(\mathbf {r})+ \nabla ^2] G \left(\mathbf {r},\mathbf {r}_s,\omega |s \right)=-\delta (\mathbf {r}-\mathbf {r}_s),$$ (1)where ω is the frequency, s(r) is the squared slowness parameter s(r) = c − 2(r) and G(r, rs, ω|s) is the Green’s function generated in the model s(r) by an impulsive point source located at rs. When the point source has spectrum W(ω), the wavefield is P(r, rs, ω|s) = W(ω)G(r, rs, ω|s). 2.1 From linear to non-linear sensitivity kernel FWI seeks to estimate subsurface properties through an iterative process by minimizing the difference between the synthetic data and the observed data. At the nth iteration, the misfit function is usually given in a least-squares norm as   $$\phi (s_n)=\frac{1}{2} \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} \int {\,\,}\mathrm{d}\omega \Vert \delta d(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) \Vert ^2,$$ (2)where δd(rg, rs, ω|sn) = d(rg, rs, ω) − P(rg, rs, ω|sn) is the data residual between the observed data d(rg, rs, ω) and the synthetic data P(rg, rs, ω|sn) for each source rs and receiver rg of the seismic survey, and ‖.‖ is the L2 norm. The synthetic data are obtained by extracting the wavefield P(r, rs, ω|sn), modeled in the model sn(r) of the current iteration, at the receiver locations for each source. It does so by iteratively updating the current model sn(r) to sn + 1(r) using a perturbation Δsn(r)   $$s_{n+1}(\mathbf {r})=s_n(\mathbf {r})+\Delta s_n(\mathbf {r}),$$ (3)which can be determined by finding the minimum of the misfit function in the vicinity of the current model sn(r) Pratt et al. (1998); Virieux & Operto (2009)   $$\Delta s_n(\mathbf {r})=-\sum _{\mathbf {r}^{\prime }} H_n^{-1} (\mathbf {r},\mathbf {r}^{\prime })g_n(\mathbf {r}^{\prime }),$$ (4)where gn(r) and Hn(r, r΄) are the gradient and the Hessian calculated at the nth iteration, respectively. The gradient is the first-order derivative of the misfit function ϕ(sn) with respect to the model parameter s,   $$g_n(\mathbf {r})=\frac{\partial \phi (s_n)}{\partial s(\mathbf {r})}=- \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} \int {\,\,}\mathrm{d}\omega {\rm Re} \left( \frac{\partial P (\mathbf {r}_g,\mathbf {r}_s,\omega |s_n )}{\partial s(\mathbf {r})} \delta d^{*}(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n ) \right),$$ (5)where * stands for the complex conjugate and ∂P(rg, rs, ω|sn)/∂s(r) is the Fréchet derivative or the sensitivity kernel. Since it is extremely expensive to explicitly calculate the sensitivity kernel, in FWI, the adjoint-state method Lailly (1983); Plessix (2006) is usually used to directly calculate the gradient. The negative of the gradient points to descent direction. The Hessian is the second-order derivative of the misfit function,   \begin{eqnarray} H\left(\mathbf {r},\mathbf {r}^{\prime }\right)&=&\frac{\partial ^2 \phi (s_n)}{\partial s(\mathbf {r}) \partial s(\mathbf {r}^{\prime })} \nonumber \\ &=& \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} \int {\,\,}\mathrm{d}\omega {\rm Re} \left( \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)}{\partial s(\mathbf {r})} \frac{\partial P^{*}(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)}{\partial s(\mathbf {r}^{\prime })} +\, \frac{\partial ^2 P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)}{\partial s^2(\mathbf {r})} \delta d^{*}(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) \right). \end{eqnarray} (6)The second term in the Hessian, which is related to double-scattered waves, is usually neglected, and the approximation of Hessian is used to form a Gauss–Newton update,   $$H_{{\rm GN}} (\mathbf {r},\mathbf {r}^{\prime })= \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} \int {\,\,}\mathrm{d}\omega {\rm Re} \left( \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)}{\partial s(\mathbf {r})} \frac{\partial P^{*}(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)}{\partial s(\mathbf {r}^{\prime })} \right).$$ (7)The inverse of the Hessian can play an important role in FWI. The inverse of the approximate Hessian can help refocus the gradient from limited illumination, as well as mitigate the coupling between different parameters, and the inverse of the full Hessian can also help remove the artefacts caused by double-scattered waves, which is expected to make the full Newton method converge more rapidly Pratt et al. (1998); Operto et al. (2013). However, it is possible that adding the second term of the Hessian could destroy the positive definiteness of the approximate Hessian, in which case, adding an appropriate pre-conditioner, which is usually an approximation of the inverse Hessian, is necessary to accelerate the convergence of the full Newton method Métivier et al. (2014, 2017). When assuming the inverse of the Hessian as a scalar αn, the model can then be updated through a steepest descent (SD) method   $$s_{n+1}(\mathbf {r})=s_n(\mathbf {r})-\alpha _n g_n(\mathbf {r}),$$ (8)where αn can be determined by a line search method. The optimization problem of minimizing the misfit function (2) is highly non-linear, since the relationship between the wavefield and model is non-linear, as described by the wave equation (1). The model perturbation (4) is found in the framework of the Born approximation, which requires the mismatch of the traveltimes less than half of the period, to avoid convergence toward a local minimum caused by so-called cycle-skipping problem (e.g. Virieux & Operto 2009). One direct way to help mitigate this non-linearity is to have broad-band seismic data with wide-offset range and very good initial model before starting the FWI procedure, however, even with different types of misfit definition proposed van Leeuwen & Mulder (2008); Brossier et al. (2015); Choi & Alkhalifah (2015), the gradient (5) used to find the model perturbation can be seen as obtained with the sensitivity kernel ∂P/∂s derived by varying the field P(rg, rs, ω|sn) in the current model sn(r) under the Born approximation (as shown in Appendix eq. A6) as   $$\frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)}{\partial s(\mathbf {r})} \approx -\omega ^2G(\mathbf {r}_g,\mathbf {r},\omega |s_n)P(\mathbf {r},\mathbf {r}_s,\omega |s_n),$$ (9)which reflects the ‘true’ sensitivity of the model only to the extent that sn(r) reflects the true model properties, so it depends only on sn(r). The other way is to incorporate the non-linearity by taking multiple scattering into account Wu & Zheng (2014); Alkhalifah & Wu (2016b); Wu & Alkhalifah (2017), which can be performed by introducing non-linear sensitivity kernels. Prior to convergence, a different sensitivity, in fact, one which implicitly or explicitly depends on the residuals, is expected if we instead vary the field in the model under construction, sn + 1(r). Based on the scattering theory, as discussed in Innanen (2015), the synthetic data calculated using model sn + 1(r) can be written as an expansion of the synthetic data calculated using model sn(r) using Lippmann–Schwinger equation as shown in Appendix eq. (A5),   \begin{eqnarray} &&{{ P (\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1} ) }}\nonumber \\ \quad&=& P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)- \omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n) \Delta s_n(\mathbf {r}^{\prime })P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_{n+1}) \nonumber \\ \quad&=& P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)- \omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n) \Delta s_n(\mathbf {r}^{\prime })P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_n) \nonumber \\ && {}+\omega ^4 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n)\Delta s_n(\mathbf {r}^{\prime }) \int {\,\,}\mathrm{d}\mathbf {r}^{\prime \prime }G(\mathbf {r}^{\prime },\mathbf {r}^{\prime \prime },\omega |s_n) \Delta s_n(\mathbf {r}^{\prime \prime })P(\mathbf {r}^{\prime \prime },\mathbf {r}_s,\omega |s_n)\nonumber \\ &&{}+ \ldots . \end{eqnarray} (10)Varying the model by a δs localized at r in sn + 1(r), the calculated field is   \begin{eqnarray} P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1}+\delta s(\mathbf {r}))&=& P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) - \omega ^2 G(\mathbf {r}_g,\mathbf {r},\omega |s_n) \delta s(\mathbf {r}) P(\mathbf {r},\mathbf {r}_s,\omega |s_{n})- \omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n) \Delta s_n(\mathbf {r}^{\prime })P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_n) \nonumber \\ && {}+ \omega ^4 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r},\omega |s_n)G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_n) P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_n) \Delta s_n(\mathbf {r}^{\prime }) \delta s(\mathbf {r}) \nonumber \\ && {}+ \omega ^4 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n)G(\mathbf {r}^{\prime },\mathbf {r},\omega |s_n) P(\mathbf {r},\mathbf {r}_s,\omega |s_n) \Delta s_n(\mathbf {r}^{\prime }) \delta s(\mathbf {r}) + \ldots . \end{eqnarray} (11)Comparing the synthetic data P(rg, rs, ω|sn + 1 + δs(r)) in the perturbed model sn + 1(r) + δs in eq. (11) and synthetic data P(rg, rs, ω|sn + 1) at the n + 1th iteration in eq. (10), the perturbation δP(rg, rs, ω|sn + 1) caused by this added variation δs and Δsn(r) is the difference between these two series:   \begin{eqnarray} \delta P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})&=&P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1}+\delta s(\mathbf {r})) -P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1}) = -\omega ^2 G(\mathbf {r}_g,\mathbf {r},\omega |s_n) \delta s(\mathbf {r}) P(\mathbf {r},\mathbf {r}_s,\omega |s_{n}) \nonumber \\ && {}+ \omega ^4 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r},\omega |s_n)G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_n) P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_n) \Delta s_n(\mathbf {r}^{\prime }) \delta s(\mathbf {r}) \nonumber \\ && {}+ \omega ^4 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n)G(\mathbf {r}^{\prime },\mathbf {r},\omega |s_n) P(\mathbf {r},\mathbf {r}_s,\omega |s_n) \Delta s_n(\mathbf {r}^{\prime }) \delta s(\mathbf {r}) + \ldots , \end{eqnarray} (12)where the first term can be interpreted as a scattering process involving one interaction with the variation δs only, and the second and third terms include one interaction with Δsn(r) and one interaction with δs (see Figs 1a–c). The resulting sensitivity kernel at n + 1th iteration can then be written as   \begin{eqnarray} \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} &=& \lim _{\delta s \rightarrow 0} \frac{\delta P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\delta s(\mathbf {r})} \nonumber \\ &=& \left( \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} \right)_0 +\left( \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} \right)_1+ \ldots , \end{eqnarray} (13)where the index refers to the order of the term in the perturbation Δsn(r). The first term is the conventional FWI sensitivity kernel (9), which only depends on the model sn(r)   $$\left( \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} \right)_0 =\frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)}{\partial s(\mathbf {r})} = -\omega ^2G(\mathbf {r}_g,\mathbf {r},\omega |s_n)P(\mathbf {r},\mathbf {r}_s,\omega |s_n),$$ (14)and the second term is   \begin{eqnarray} \left( \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} \right)_1 = -\omega ^2 \Big (\delta G(\mathbf {r}_g,\mathbf {r},\omega |s_n)P(\mathbf {r},\mathbf {r}_s,\omega |s_n) + G(\mathbf {r}_g,\mathbf {r},\omega |s_n) \delta P(\mathbf {r},\mathbf {r}_s,\omega |s_n) \Big ), \end{eqnarray} (15)where δG(rg, r, ω|sn) and δP(r, rs, ω|sn) are the first-order scattered wavefields related to Δsn(r),   \begin{eqnarray} \delta G(\mathbf {r}_g,\mathbf {r},\omega |s_n)= -\omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n)\Delta s_n(\mathbf {r}^{\prime })G(\mathbf {r}^{\prime },\mathbf {r},\omega |s_{n}), \end{eqnarray} (16)  \begin{eqnarray} \delta P(\mathbf {r},\mathbf {r}_s,\omega |s_n)= -\omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_{n})\Delta s_n(\mathbf {r}^{\prime })P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_{n}). \end{eqnarray} (17)Starting from the second term, the sensitivity kernel depends not only on the model sn, but also on the perturbation Δsn(r), which introduce the non-linear effect to the sensitivity kernel. When Δsn(r) is zero, the higher order terms in eq. (13) vanish, and the sensitivity kernel reduces to the conventional form in eq. (9). Figure 1. View largeDownload slide Scattering processes associated with non-linear sensitivity kernel calculation: (a) zeroth order in Δsn, and (b) and (c) first order in Δsn. Propagation is indicated with a blue arrow. Figure 1. View largeDownload slide Scattering processes associated with non-linear sensitivity kernel calculation: (a) zeroth order in Δsn, and (b) and (c) first order in Δsn. Propagation is indicated with a blue arrow. To include higher order scattering terms related to Δsn(r), scattered wavefields in eqs (16) and (17) can be replaced by the Lippmann–Schwinger equation as   \begin{eqnarray} \delta G(\mathbf {r}_g,\mathbf {r},\omega |s_{n+1})= -\omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n)\Delta s_n(\mathbf {r}^{\prime })G(\mathbf {r}^{\prime },\mathbf {r},\omega |s_{n+1}), \end{eqnarray} (18)  \begin{eqnarray} \delta P(\mathbf {r},\mathbf {r}_s,\omega |s_{n+1})= -\omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_{n})\Delta s_n(\mathbf {r}^{\prime })P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_{n+1}). \end{eqnarray} (19)Substituting scattered wavefield equations (18) and (19) back to the second term (15), we can extend this second-order sensitivity kernel into a higher order sensitivity kernel   \begin{eqnarray} \left( \frac{\partial P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} \right)_1+\cdots = -\omega ^2 (\delta G(\mathbf {r}_g,\mathbf {r},\omega |s_{n+1})P(\mathbf {r},\mathbf {r}_s,\omega |s_n) + G(\mathbf {r}_g,\mathbf {r},\omega |s_n) \delta P(\mathbf {r},\mathbf {r}_s,\omega |s_{n+1}) ). \end{eqnarray} (20)In this higher order sensitivity kernel, the term δG(rg, r, ω|sn + 1)δP(r, rs, ω|sn + 1) is not included, and, because we make this additional approximation, it shares similar form as the one derived based on the De Wolf approximation Wu & Zheng (2014), with its first term (15) used in RWI (e.g. Xu et al.2012) to provide a better update for the background model. 2.2 Calculation of perturbation Δsn before n + 1th iteration All the Green’s functions used in the perturbation wavefield equation (12) to construct this new sensitivity kernel equation (13) for n + 1th model sn + 1(r) can be calculated in the nth model sn(r). However, to calculate the higher order terms in the new sensitivity kernel equation (13), perturbation Δsn(r) is required. This quantity is unknown, but it has a relationship with the nth data residual. According to inverse scattering theory, for each frequency, the perturbation Δsn(r) can be exchanged in to a series related to the nth data residual δd(rg, rs, ω|sn) Innanen (2015),   \begin{eqnarray} \delta d(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)&=& d(\mathbf {r}_g,\mathbf {r}_s,\omega )-P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) \nonumber \\ &=&-\omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n) \Delta s_n(\mathbf {r}^{\prime }) P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_n)+ \ldots \nonumber \\ &=& {\cal G} \Delta s_n(\mathbf {r})+\cdot , \end{eqnarray} (21)where Δsn(r) = s(r) − sn(r), the difference between the true model s(r) and the nth model sn(r), and the data residual δd(rg, rs, ω|sn) can be seen as the scattered data produced by this perturbation Δsn(r). Through standard inverse scattering techniques, Δsn(r) can then be inverted as a series expression in order of the data residual Weglein et al. (2003). By taking only the first-order term related to Δsn(r), eq. (21) becomes the Born forward modeling, which describes the relationship between the perturbation Δsn(r) and the data residual as   $$\delta P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) =-\omega ^2 \int {\,\,}\mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r}_g,\mathbf {r}^{\prime },\omega |s_n) \Delta s_n(\mathbf {r}^{\prime }) P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_n).$$ (22)Therefore, the perturbation Δsn(r) can be approached by the solution Δs(r) of a further, second data fitting scheme in a least-squares sense, which iteratively minimizes the misfit function   \begin{eqnarray} \phi (\Delta s)&=&\frac{1}{2} \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} \left\Vert \delta S(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) \right\Vert ^2 \nonumber \\ &=&\frac{1}{2} \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} \left\Vert \delta d(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)-\delta P(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) \right\Vert ^2, \end{eqnarray} (23)where δP(rg, rs, ω|sn) is the scattered data obtained from sampling the scattered wavefield δP(r, rs, ω|sn) at receiver rg, which can be found by solving the wave equation with a virtual secondary source − ω2Δs(r΄)P(r΄, rs, ω|sn):   $$\left[\omega ^2 s_{n}(\mathbf {r})+ \nabla ^2 \right] \delta P(\mathbf {r},\mathbf {r}_s,\omega |s_n) =-\omega ^2\Delta s(\mathbf {r}^{\prime })P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_n).$$ (24)The perturbation Δs(r) can be obtained by updating the initial perturbation (e.g., starting from zero) with the gradient $$\tilde{g}(\mathbf {r})$$ of the misfit function (23)   $$\tilde{g}(\mathbf {r}) = \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} \omega ^2 {\rm Re} ( G(\mathbf {r}_g,\mathbf {r},\omega |s_n) P(\mathbf {r},\mathbf {r}_s,\omega |s_n) \delta S^{*}(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) ).$$ (25)By considering only the first-order scattering as in eq. (22), the updating of this perturbation should then be equivalent to applying a Gauss–Newton pre-conditioner to the gradient of the FWI misfit equation (5) at the current model sn(r) (Alkhalifah & Wu 2016b). Similar as discussed in Kwak et al. ’s (2014) work, this perturbation Δs(r) can, lead to a direct update of the model without line search at one frequency as   $$\tilde{s}_{n+1}(\mathbf {r})=s_n(\mathbf {r})+\Delta s(\mathbf {r}).$$ (26) 2.3 Gradient with non-linear sensitivity kernel Substituting the perturbation Δs(r) into the non-linear sensitivity kernel equation (13), we can get the gradient with non-linear sensitivity kernel up to second order to update model sn(r) as   \begin{eqnarray} g_n(\mathbf {r})&=& \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} {\rm Re} \big (\omega ^2 \delta d^{*} (\mathbf {r}_g,\mathbf {r}_s,\omega |s_n ) \big( G(\mathbf {r}_g,\mathbf {r},\omega |s_n)P(\mathbf {r},\mathbf {r}_s,\omega |s_n) \nonumber \\ && {} + \delta G(\mathbf {r}_g,\mathbf {r},\omega |s_n)P(\mathbf {r},\mathbf {r}_s,\omega |s_n) + G(\mathbf {r}_g,\mathbf {r},\omega |s_n) \delta P(\mathbf {r},\mathbf {r}_s,\omega |s_n) \big )\big ), \end{eqnarray} (27)where δP(r, rs, ω|sn) is as in eq. (24) and under the reciprocity principle, δG(rg, r, ω|sn) can be obtained as the scattered wavefield δG(r, rg, ω|sn) by solving the wave equation with virtual secondary source − ω2Δs(r΄)G(r΄, rg, ω|sn) in model sn(r). Changing wavefields in the virtual secondary sources from P(r΄, rs, ω|sn) and G(r΄, rg, ω|sn) to $$P(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |\tilde{s}_{n+1})$$ and $$G(\mathbf {r}^{\prime },\mathbf {r}_g,\omega |\tilde{s}_{n+1})$$ for source and receiver side scattered wavefields as in eq. (20), we can obtain the gradient contains higher order scattering processes. By exchanging the data residual δd(rg, rs, ω|sn) with $$\delta d (\mathbf {r}_g,\mathbf {r}_s,\omega |\tilde{s}_{n+1})$$  \begin{eqnarray} \delta d \left(\mathbf {r}_g,\mathbf {r}_s,\omega |\tilde{s}_{n+1} \right) &=& d (\mathbf {r}_g,\mathbf {r}_s,\omega ) -P (\mathbf {r}_g,\mathbf {r}_s,\omega |\tilde{s}_{n+1}) \nonumber \\ &\approx & d (\mathbf {r}_g,\mathbf {r}_s,\omega) -P (\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) - \delta P(\mathbf {r}_g,\mathbf {r}_s,\omega |\tilde{s}_{n+1}), \end{eqnarray} (28)which contains not only the data in model sn(r), but also the synthesized scattered data in perturbation Δs(r), we can get the gradient with higher order sensitivity kernel to update model $$\tilde{s}_{n+1}(\mathbf {r})$$ as   \begin{eqnarray} g_{n+1}(\mathbf {r}) &=& \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} {\rm Re} \Big (\omega ^2 \delta d^{*} \left(\mathbf {r}_g,\mathbf {r}_s,\omega |\tilde{s}_{n+1} \right) \big ( G(\mathbf {r}_g,\mathbf {r},\omega |s_n)P(\mathbf {r},\mathbf {r}_s,\omega |s_n) \nonumber \\ &&{} + \delta G(\mathbf {r}_g,\mathbf {r},\omega |\tilde{s}_{n+1}) P(\mathbf {r},\mathbf {r}_s,\omega |s_n) + G(\mathbf {r}_g,\mathbf {r},\omega |s_n) \delta P(\mathbf {r},\mathbf {r}_s,\omega |\tilde{s}_{n+1}) \big )\Big ) \nonumber \\ &=& \sum _{\mathbf {r}_s} \sum _{\mathbf {r}_g} {\rm Re} (\omega ^2 \delta d^{*} \left(\mathbf {r}_g,\mathbf {r}_s,\omega |\tilde{s}_{n+1} \right) \big ( G(\mathbf {r}_g,\mathbf {r},\omega |\tilde{s}_{n+1})P(\mathbf {r},\mathbf {r}_s,\omega |\tilde{s}_{n+1}) \nonumber \\ &&{} - \delta G(\mathbf {r}_g,\mathbf {r},\omega |\tilde{s}_{n+1}) \delta P(\mathbf {r},\mathbf {r}_s,\omega |\tilde{s}_{n+1}) \big )). \end{eqnarray} (29) 3 ANALYTICAL AND NUMERICAL EXAMPLES In this section, we will exemplify the behaviour of this non-linear FWI scheme in two distinct ways. First, we will consider its analytic response to reflection data from a single horizontal interface representing a jump in model parameter value from s0 to s0 + Δs. We will use this analytical development to show that, while a standard Gauss–Newton FWI update is in error by order (Δs/s0)2, a non-linear FWI update truncated at second order in the residuals generates an update in error by at most order (Δs/s0)3. Second, we will examine the numerical behaviour of the non-linear FWI scheme using the Marmousi model. We will use these numerical examples as a platform upon which to discuss implementation issues. 3.1 Non-linear FWI and amplitude non-linearity Let us first examine a special case of these non-linear sensitivity kernel and assume the source spectrum is W(ω) = 1, in which case the resulting gradient (i) is second order in the residuals, and (ii) simply incorporates non-linear reflectivity information. This is achieved by truncating eq. (13) at order 1, and considering only a portion of the full difference Δsn(r) between the nth and the n + 1th iterate, as illustrated in Fig. 2. In it, the general scattering picture (see Figs 1a–c) is replaced with a Δsn(r) that is localized and collocated with the variation point. This is obtained by setting Δsn(r΄) = Δsn(r)δ(r − r΄) in eqs (18) and (19), such that second term in eq. (13) becomes   \begin{eqnarray*} \nonumber \left( \frac{\partial G(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} \right)_1 = 2 \omega ^4 G(\mathbf {r}_g,\mathbf {r},\omega |s_n) G(\mathbf {r},\mathbf {r}_s,\omega |s_n) G(\mathbf {r},\mathbf {r},\omega |s_n) {\cal G}^{*-1} \delta d^*(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n), \end{eqnarray*} where $$\cal G$$ is the operator containing the integral and Green’s functions in eq. (21) and the quantity $${\cal G}^{*-1}\delta d^*(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n)$$ reduces to   \begin{eqnarray} {\cal G}^{*-1} \delta d^*(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) = - \frac{1}{\omega ^2} \frac{ \delta d^*(\mathbf {r}_g,\mathbf {r}_s,\omega |s_n) }{G^*(\mathbf {r}_g,\mathbf {r},\omega |s_n) G^*(\mathbf {r},\mathbf {r}_s,\omega |s_n) }. \end{eqnarray} (30)Putting the lowest two orders of the sensitivity kernel together the following case of non-linear sensitivity kernel is obtained:   \begin{eqnarray} \left( \frac{\partial G(\mathbf {r}_g,\mathbf {r}_s,\omega |s_{n+1})}{\partial s(\mathbf {r})} \right) \approx -\omega ^2 J_{3{\rm D}}(\mathbf {r},\omega | s_n) \left[ 1 + 2 \frac{ \delta d^*( \mathbf {r}_g,\mathbf {r}_s,\omega |s_n) }{ I_{3{\rm D}}(\mathbf {r},\omega |s_n) } \right], \end{eqnarray} (31)where   \begin{eqnarray} I_{3{\rm D}}(\mathbf {r},\omega | s_n) &=& \frac{G^*(\mathbf {r}_g,\mathbf {r},\omega |s_n) G^*(\mathbf {r},\mathbf {r}_s,\omega |s_n)}{G(\mathbf {r},\mathbf {r},\omega |s_n)},\ {\rm and}\ J_{3{\rm D}}(\mathbf {r},\omega | s_n) = G(\mathbf {r}_g,\mathbf {r},\omega |s_n) G(\mathbf {r},\mathbf {r}_s,\omega |s_n). \end{eqnarray} (32)The formula I3D involves evaluating the Green’s function G(r, r, ω|sn) with colocated source and receiver. In any numerical application this term would need to be regularized. In our 2-D analysis to follow, which is carried out in the wavenumber domain, this leads to no complicating issues. Figure 2. View largeDownload slide Special case of scattering processes illustrated in Fig. 1, in which Δsn(r) and the variation δs(r) are local and collocated. Figure 2. View largeDownload slide Special case of scattering processes illustrated in Fig. 1, in which Δsn(r) and the variation δs(r) are local and collocated. In order to verify that they meaningfully incorporate non-linear information, we consider a 2-D version of the sensitivity kernel formula in eq. (31). In this 2-D case, the model varies in depth only, and thus sensitivity kernel is defined in terms of medium variations in z, but the wave physics is 2-D (coordinates x and z). Also, for the purpose of analysis, we consider the data in the (kg, ω) domain, where kg is the Fourier conjugate to the lateral geophone position xg. Due to a source at (xs, zs), the wavefield observed at depth zg is P(kg, zg, xs, zs, ω). To simplify the problem, we use only one shot at the origin zs = xs = 0 with record coordinate system along the surface with zg = 0. Under this restriction, the data residual is δd(kg, ω|sn), and the associated second-order sensitivity kernel formula has the same essential form,   \begin{eqnarray} \left( \frac{\partial G(k_g,\omega |s_{n+1})}{\partial s(z)} \right) \approx -\omega ^2 J_{2{\rm D}} (k_g,z,\omega |s_n) \left[ 1 +2 \frac{ \delta d^*(k_g,\omega |s_n) }{ I_{2{\rm D}}(k_g,z,\omega |s_n) } \right], \end{eqnarray} (33)but some slight differences in the weights:   \begin{eqnarray*} \nonumber J_{2{\rm D}} (k_g,z,\omega |s_n) = \int {\,\,}\mathrm{d}x^{\prime } G(k_g,0,x^{\prime },z,\omega |s_n) G(x^{\prime },z,0,0,\omega |s_n), \end{eqnarray*} and   \begin{eqnarray*} \nonumber I_{{\rm 2D}} (k_g,z,\omega |s_n) = \frac{ J_{2{\rm D}}^* (k_g,z,\omega |s_n) J_{2{\rm D}} (k_g,z,\omega |s_n) }{ \int {\,\,}\mathrm{d}x^{\prime } \int {\,\,}\mathrm{d}x^{\prime \prime } G(k_g,0,x^{\prime },z,\omega |s_n) G(x^{\prime },z,x^{\prime \prime },z,\omega |s_n) G(x^{\prime \prime },z,0,0,\omega |s_n) }. \end{eqnarray*} FWI updates are then derived in the (kg, ω) domain, with the data residual δd(kg, ω|sn) for a fixed kg at each iteration, which allows us to distinguish between updating with high angle (large kg) versus low angle data. The misfit function of FWI (2) is modified to   \begin{eqnarray} \phi (s_n) = \frac{1}{2} \int {\,\,}\mathrm{d}\omega \Vert \delta d(k_g,\omega |s_n) \Vert ^2, \end{eqnarray} (34)and for the first FWI update, it is minimized with updates of Gauss–Newton form:   \begin{eqnarray} \delta s_0 (z) = - \int {\,\,}\mathrm{d}z^{\prime } H_0^{-1}(z,z^{\prime }) g_0(z^{\prime }), \end{eqnarray} (35)where the gradient is based on a version of the non-linear sensitivity kernel:   \begin{eqnarray} (g_0(z))_{{\rm nonlinear}} = - \int {\,\,}\mathrm{d}\omega \left( \frac{ \partial G(k_g,\omega |s_{1}) }{ \partial s(z) } \right) \delta d^*(k_g,\omega |s_0), \end{eqnarray} (36)and the Hessian is based on standard sensitivity kernel (e.g. Virieux & Operto 2009):   \begin{eqnarray} H_0(z,z^{\prime }) = \int {\,\,}\mathrm{d}\omega \frac{ \partial G(k_g,\omega |s_{0}) }{ \partial s(z) } \frac{ \partial G^*(k_g,\omega |s_{0}) }{ \partial s(z^{\prime }) }. \end{eqnarray} (37)If the initial medium is homogeneous, we can analyse this update using exact forms for the Green’s functions Clayton & Stolt (1981): $$G(k_g,z_g,x,z,\omega |s_0) = (i2q_g)^{-1}e^{-ik_gx + iq_g|z_g-z|}$$ and $$G(x,z,x_s,z_s,\omega |s_0) = (2 \pi )^{-1} \int \mathrm{d}k^{\prime } (i2q^{\prime })^{-1} e^{ ik^{\prime }(x-x_s) + iq^{\prime }|z-z_s|}$$, where $$q_g^2 = \omega ^2 s_0 - k_g^2$$ and q΄2 = ω2s0 − k΄2. The I and J reduce to   \begin{eqnarray} J_{{\rm 2D}} (k_g,z,\omega |s_0) = \frac{e^{i2q_gz}}{(i2q_g)^2}, I_{2{\rm D}} (k_g,z,\omega |s_0) = \frac{e^{-i2q_gz}}{i2q_g}, \end{eqnarray} (38)and the sensitivity kernel itself becomes   \begin{eqnarray} \left( \frac{\partial G(k_g,\omega |s_{1})}{\partial s(z)} \right) = -\omega ^2 \left[ \frac{e^{i2q_gz}}{(i2q_g)^2} \right] \left[ 1 +2 \delta d^*(k_g,\omega |s_0) (i2q_g) e^{i2q_gz} \right]. \end{eqnarray} (39) Reconstruction of the single-interface medium in Fig. 3(a) is considered. The goal is the determination, from a constant initial medium s0, of the profile s(z) = s0 + ΔsS(z − z1), where S is a step or Heaviside function. Δs is the amplitude of the ideal update, taking us directly from the initial model to the correct answer. The backscattered amplitude (i.e. the reflection coefficient) can be expressed as a series in orders of this Δs:   \begin{eqnarray} R(\theta ) = R_1(\theta ) + R_2(\theta ) + \cdot , {\rm where} \end{eqnarray} (40)  \begin{eqnarray*} R_1(\theta ) = - \frac{1}{4} \left( 1 + \sin ^2 \theta \right) \left( \frac{\Delta s}{s_0} \right), R_2(\theta ) = \frac{1}{8} \left( 1 + 2 \sin ^2 \theta \right) \left( \frac{\Delta s}{s_0} \right)^2, \\ \end{eqnarray*} etc. In Fig. 4(a), we note by comparing exact, first-order (R ≈ R1) and second-order (R ≈ R1 + R2) coefficient calculations that basic linearization error can arise at low angles when contrasts are large, and (in Fig. 4b) at high angle even when contrasts are low. We also note that corrections out to as low as second order can lead to significant error reduction. With analysable formulae for R in hand, we can then analytically express the complex conjugate of the (kg, ω) domain residuals at the first iteration Innanen (2014):   \begin{eqnarray} \delta d^*(k_g,\omega |s_0) = -R(\theta ) \frac{e^{-i2q_gz_1}}{i2q_g}. \end{eqnarray} (41) Figure 3. View largeDownload slide Model for analysis of non-linear FWI updates. (a) Medium is a step S(z − z1) from s0 to s1 with boundary at z1 and (b) reflection coefficient R(θ) where θ is related to plane wave variables ω, $$c_0 = 1/\sqrt{s_0}$$, kg and qg. Figure 3. View largeDownload slide Model for analysis of non-linear FWI updates. (a) Medium is a step S(z − z1) from s0 to s1 with boundary at z1 and (b) reflection coefficient R(θ) where θ is related to plane wave variables ω, $$c_0 = 1/\sqrt{s_0}$$, kg and qg. Figure 4. View largeDownload slide Reflection coefficient R(θ) as a function of large contrasts or large angles. (a) R(θ) for low angles, but large contrasts and (b) R(θ) for small contrasts, but large angles. Figure 4. View largeDownload slide Reflection coefficient R(θ) as a function of large contrasts or large angles. (a) R(θ) for low angles, but large contrasts and (b) R(θ) for small contrasts, but large angles. With all the ingredients for the sensitivity kernel now available in analytic form, we may analyse the first iteration in the reconstruction of Fig. 3(a). The gradient now has two terms, one first order in δd* and the other second, that is, $$(g_0(z))_{{\rm nonlinear}} = g_0^{(1)}(z) + g_0^{(2)}(z)$$, where   \begin{eqnarray} g_0^{(1)}(z) &=& \frac{c_0^2}{4} R(\theta ) \int\mathrm{d}\omega \left( \frac{\omega ^2/c_0^2}{q_g^2} \right) \frac{e^{i2q_g(z-z_1)}}{i2q_g},\ {\rm and}\ g_0^{(2)}(z) = -\frac{c_0^2}{2} R^2(\theta ) \int \mathrm{d}\omega \left( \frac{\omega ^2/c_0^2}{q_g^2} \right) \frac{e^{i2q_g(2z-2z_1)}}{i2q_g}. \end{eqnarray} (42)Noting that Innanen (2014)  dω =  d2qg(c0/2cos θ) and qg = (ω/c0)cos θ, we can evaluate these integrals and reassemble the gradient, obtaining   \begin{eqnarray} (g_0(z))_{{\rm nonlinear}} = \frac{c_0^3 \pi }{4 \cos ^3 \theta } [ R(\theta ) - 2R^2(\theta ) ] S(z-z_1). \end{eqnarray} (43)The Hessian, which we have given a standard Gauss–Newton approximate form, evaluates in this simple environment (Innanen 2014) to   \begin{eqnarray} H_0(z,z^{\prime }) = c_0^5 \pi (16 \cos ^5 \theta )^{-1} \delta (z-z^{\prime }), \end{eqnarray} (44)and so, via eq. (35), the first update is of the form   \begin{eqnarray} (\delta s_{0} (z))_{{\rm nonlinear}} & =& - \frac{ 4 \cos ^2 \theta }{c_0^2} \left[ R(\theta ) - 2R^2(\theta ) \right] S(z-z_1). \end{eqnarray} (45) We characterized the ideal update as Δs(z) = ΔsS(z − z1) and related it to the reflection coefficient through eq. (40). To analyse the relative accuracy of the first- and second-order FWI iterations, we will substitute two truncations of the series for R(θ) into eq. (45). The standard Gauss–Newton update is recovered by neglecting R2; noting also that to leading order in sin 2θ we may replace 1/cos 2θ with (1 + sin 2θ), we obtain   \begin{eqnarray} \left( \frac{ \delta s_{0} (z) }{s_0} \right)_{{\rm standard}} \approx - \left( \frac{4}{1 + \sin ^2 \theta } \right) R(\theta ) S(z-z_1). \end{eqnarray} (46)The non-linear Gauss–Newton-like update, based on second-order collocated sensitivity kernel, is   \begin{eqnarray} \left( \frac{ \delta s_{0} (z) }{s_0} \right)_{{\rm nonlinear}} =- \left( \frac{4}{1 + \sin ^2 \theta } \right) \left[ R(\theta ) - 2 R^2(\theta ) \right] S(z-z_1). \end{eqnarray} (47)Let us first verify that a standard Gauss–Newton update is equivalent to the ideal update to first order. If contrasts and angles are low, second-order contributions to R are negligible, and the reflection coefficient is   \begin{eqnarray} R(\theta ) \approx R_1(\theta ) = - \frac{1}{4} \left( 1 + \sin ^2 \theta \right) \left( \frac{\Delta s}{s_0} \right); \end{eqnarray} (48)substituting this into eq. (46), we obtain   \begin{eqnarray} \left( \frac{ \delta s_{0} (z) }{s_0} \right)_{{\rm standard}} \approx \left( \frac{\Delta s}{s_0} \right) S(z-z_1), \end{eqnarray} (49)demonstrating the equivalence of δs0(z) and the ideal Δs(z). However, if the angle or contrast is such that second-order contributions to R(θ) are non-negligible, referring to eq. (40) we must instead substitute   \begin{eqnarray} R(\theta ) \approx - \frac{1}{4} \left( 1 + \sin ^2 \theta \right) \left( \frac{\Delta s}{s_0} \right) + \frac{1}{8} \left( 1 + 2 \sin ^2 \theta \right) \left( \frac{\Delta s}{s_0} \right)^2, \end{eqnarray} (50)and this produces a discrepancy at second order between the Gauss–Newton update δs0(z) and the ideal update Δs(z):   \begin{eqnarray} \left( \frac{ \delta s_{0} (z) }{s_0} \right)_{{\rm standard}} \approx \left[ \left( \frac{\Delta s}{s_0} \right) - \frac{1}{2} (1+\sin ^2 \theta ) \left( \frac{\Delta s}{s_0} \right)^2 \right] S(z-z_1). \end{eqnarray} (51)Let us compare this with the update generated using the second-order sensitivity kernel expression. Substituting the reflection coefficient in eq. (50) into eq. (47), a corrective term is introduced at second order, exactly suppressing the second-order discrepancy corrupting eq. (51), such that the resulting update lapses back to   \begin{eqnarray} \left( \frac{ \delta s_{0} (z) }{s_0} \right)_{{\rm nonlinear}} \approx \left( \frac{\Delta s}{s_0} \right) S(z-z_1); \end{eqnarray} (52)here the consistency of the candidate and ideal updates extends to second order, rather than just first. In Fig. 5 the difference between second-order sensitivity kernel and (standard) first-order sensitivity kernel is illustrated. Because we consider a fixed kg, we can examine the accuracy angle by angle; a full inversion would sum over kg and thus average over these angles. In Fig. 5(a) the interface is large contrast, going from c0 = (s0)−1/2 = 1500 ms−1 in the upper half-space to c1 = (s1)−1/2 = 1800 ms−1 in the lower; especially in the range θ = 0°–30° the difference between the standard Gauss–Newton update and that based on second-order sensitivity kernel is significant. Meanwhile in Fig. 5(b), the interface represents a small contrast, with c0 = 1500 ms−1 and c1 = 1600 ms−1, but is examined over a wider range of angles. Here the second-order update ‘sticks to’ the exact update to roughly θ = 60°, in contrast to the standard update which deviates significantly at θ = 30°. Figure 5. View largeDownload slide Exact (solid), first-order (dash dot), versus second-order (dash) updates at each angle of incidence, (a) the interface corresponds to a large contrast and (b) the interface corresponds to a small contrast but examined over a larger range of angles. Figure 5. View largeDownload slide Exact (solid), first-order (dash dot), versus second-order (dash) updates at each angle of incidence, (a) the interface corresponds to a large contrast and (b) the interface corresponds to a small contrast but examined over a larger range of angles. 3.2 Numerical implementation and examples So far we have discussed a non-linear FWI scheme by dividing the calculation of sensitivity kernel into two steps: the first step is calculating a perturbation from the current data residual and the second step is using this perturbation to generate the higher order terms in the sensitivity kernel. Multiscale strategies can be used in this non-linear FWI, by performing the inversion sequentially from the low to high frequencies, and the frequency interval can be determined according to the half offset-to depth ratio Rmax = hmax/z Sirgue & Pratt (2004) as   $$\omega _{n+1}=\frac{\omega _n}{a_{{\rm min}}},$$ (53)with   $$a_{{\rm min}}=\frac{1}{\sqrt{1+R_{{\rm max}}^{2}}}.$$ (54)Although the frequency interval determined by eq. (53) is not constant as is normally used, which is determined by the maximum recorded time to prevent aliasing, theoretically it can provide enough coverage for the vertical wavenumber. An efficient non-linear FWI algorithm can be perform with certain frequencies as shown in Table 1. Table 1. Frequency-domain non-linear FWI.     View Large We will use a nine point frequency-domain finite difference with constant density Jo et al. (1996) to model the observed data and synthetic data. Hereafter, second order FWI (SOFWI) and non-linear FWI (NFWI) will be used to distinguish the gradient calculated using eq. (27) with non-linear sensitivity kernel up to second order and data residual in model sn, and eq. (29) with non-linear sensitivity kernel with all higher order terms and data residual in model $$\tilde{s}_{n+1}(\mathbf {r})$$, respectively. FWI solved with conventional updates, by SD, truncated Newton method with approximate Hessian (GN) and full Hessian (Newton) Métivier et al. (2013); Pan et al. (2017) will be used for comparison. The true model is shown in Fig. 6, we add a 500 m water layer on the top of the model to reduce the refractions in the data. Two initial models and different frequency ranges are used to test convergence and resolution characteristics of the method. Synthetic data are generated in the frequency domain with 461 receivers with spacing of 20 m and 46 sources with spacing of 200 m along the surface, where the first receiver locates at x = 0 m and the first source locates at x = 100 m. Figure 6. View largeDownload slide True velocity model of the Marmousi example. Figure 6. View largeDownload slide True velocity model of the Marmousi example. The first initial model is obtained from the true velocity using a Gaussian smoother (excluding the water layer) as shown in Fig. 7(a). With three frequencies (4, 6.6 and 14.9 Hz) starting at 4 Hz, 10 iterations per frequency are used in both conventional and non-linear FWI by updating only the velocity under the added water layer. The SD FWI result is shown in Fig. 7(b). Five inner iterations are used for truncated Newton method (with GN Hessian as in Fig. 7c and full Hessian as in Fig. 7d), SOFWI (Fig. 7e) and NFWI (Fig. 7f). For the first frequency, Fig. 8(a) shows the residual norm versus number of iterations, and Fig. 8(b) shows the model error norm versus number of iterations, with dash–dotted, dashed, dot, solid and sold lines with plus sign stands for SD FWI, truncated GN FWI, truncated Newton FWI, NFWI and SOFWI result, respectively. FWI with the non-linear gradient converges more rapidly than conventional SD FWI. Because of the lack of low frequencies, both truncated GN FWI and truncated Newton FWI perform similarly to SD FWI, but the convergence of both SOFWI and NFWI outperforms SD FWI; the NFWI result is particularly close to the true model. Figure 7. View largeDownload slide (a) Initial velocity. Inversion results with conventional updates using (b) SD, (c) truncated GN, (d) truncated Newton method, (e) SOFWI and (f) NFWI with 10 iterations for three frequencies starting from 4 to 15 Hz. Figure 7. View largeDownload slide (a) Initial velocity. Inversion results with conventional updates using (b) SD, (c) truncated GN, (d) truncated Newton method, (e) SOFWI and (f) NFWI with 10 iterations for three frequencies starting from 4 to 15 Hz. Figure 8. View largeDownload slide (a) Norm of data residual vector as a function of the number of iterations and (b) relative model least-squares error versus number of iterations. Figure 8. View largeDownload slide (a) Norm of data residual vector as a function of the number of iterations and (b) relative model least-squares error versus number of iterations. Then, we use a linear model as the initial model, which is exact at the added water layer, but changes linearly along depth direction from 1.5 to 4 km s−1, as shown in Fig. 9(a) to test NFWI. Five frequencies (2, 3.3, 5.5, 9, and 14.9 Hz) starting at 2 Hz are used for the inversion, with 10 iterations per frequency. Figs 9(b)–(d) show the inversion results from conventional FWI using SD, truncated GN and truncated Newton method, respectively, and Fig. 9(e) shows the inversion result from NFWI (five inner iterations are used for truncated GN FWI, truncated Newton FWI and NFWI). Velocity profiles along x = 4 and 6 km are shown in Fig. 10, where black lines show the true velocity, green lines show the initial velocity, and red, cyan, magenta and blue line stands for SD FWI, truncated GN FWI, truncated Newton FWI and NFWI result, respectively. For the first frequency, Fig. 11(a) shows the residual norm versus number of iterations, and Fig. 11(b) shows the model error norm versus number of iterations. We observe that, for the given number of iterations, with a minimum frequency of 2 Hz, SD FWI can only reconstruct rough model structures, truncated GN FWI can help converge much faster towards the true model, and truncated Newton FWI with full Hessian, which includes second-order scattering processes, does not provide significant improvement in comparison to truncated GN FWI. But, NFWI outperforms all these methods, especially in its reconstruction of model structures in the deeper regions. Figure 9. View largeDownload slide (a) Initial velocity. Inversion results with conventional updates using (b) SD, (c) truncated GN, (d) truncated Newton method and (e) NFWI with 10 iterations for five frequencies starting from 2 to 15 Hz. Figure 9. View largeDownload slide (a) Initial velocity. Inversion results with conventional updates using (b) SD, (c) truncated GN, (d) truncated Newton method and (e) NFWI with 10 iterations for five frequencies starting from 2 to 15 Hz. Figure 10. View largeDownload slide Velocity profiles along (a) x = 4 km and (b) x = 6 km. Figure 10. View largeDownload slide Velocity profiles along (a) x = 4 km and (b) x = 6 km. Figure 11. View largeDownload slide (a) Norm of data residual vector as a function of the number of iterations and (b) relative model least-squares error versus number of iterations. Figure 11. View largeDownload slide (a) Norm of data residual vector as a function of the number of iterations and (b) relative model least-squares error versus number of iterations. Using the same linear initial model as in Fig. 9(b), to obtain a better wavenumber coverage of the deeper regions, we use uniformly sampled 12 frequencies between 4 and 15 Hz to perform both conventional FWI and NFWI inversions. Figs 12(a)–(c) show the inversion results from conventional FWI using SD, truncated GN and truncated Newton method, respectively, and Fig. 12(d) shows the inversion result from NFWI. For a given number of iterations, in this case, with the conventional FWI method, both SD and truncated GN methods converge to a local minimum, where velocity is not correctly updated even in the shallow regions. Using the full Hessian rather than the Gauss–Newton Hessian does not improve the result, but the NFWI method provides a fairly good result. Figure 12. View largeDownload slide Inversion results with conventional updates using (a) SD, (b) truncated GN, (c) truncated Newton method and (d) NFWI with 10 iterations for 12 frequencies starting from 4 to 15 Hz. Figure 12. View largeDownload slide Inversion results with conventional updates using (a) SD, (b) truncated GN, (c) truncated Newton method and (d) NFWI with 10 iterations for 12 frequencies starting from 4 to 15 Hz. 4 DISCUSSION AND CONCLUSIONS In this study, we considered including non-linearity in the sensitivity kernel calculations during the construction of the descent direction in FWI. We showed with an analytic example that in the presence of pre-critical reflection data, the error in a non-linear FWI update truncated at second order is proportional to (Δs/s0)3. In contrast, the error in a standard Gauss–Newton FWI update is only proportional to (Δs/s0)2. We also discussed an approach to the implementation of FWI with this non-linear descent direction in the frequency domain with a two-nested-loops approach. The inner loop involves determination of a perturbation from the data residuals according to the Born approximation, and the outer loop involves updating an initial model with this non-linear descent direction. Involving multiple scattering in each FWI update could have important consequences. Higher order scattering terms can be included in the sensitivity kernel based on De Wolf approximation (Wu & Zheng 2014), it can also be included by extending the misfit function (e.g. Alkhalifah & Wu 2016b; Wu & Alkhalifah 2017). We approached the problem by computing a sensitivity kernel via a variation of the medium to be constructed, sn + Δsn, at each FWI iteration. The non-linear sensitivity kernel constructed this way can include scattering processes with one interaction with the unknown perturbation and multiple interactions with Δsn. The first term in this new sensitivity kernel is equivalent to the sensitivity kernel used in conventional FWI, which only include scattering processes with one interaction with the unknown perturbation. The additional terms contain the multiple scattering processes related to Δsn. This provides additional wave paths that help in updating between scatterers. Compared to FWI with a conventional descent direction without multiple scattering, such as SD methods and truncated Gauss–Newton methods, FWI with this non-linear descent direction can accommodate multiple scattering appearing in the data residuals. Examples based on the Marmousi model indicate that FWI with this new descent direction, with a non-linear sensitivity kernel up to the second or higher order, converges both more rapidly and with greater ultimate accuracy than FWI with conventional descent direction. Although the full Hessian contains information related to doubly scattered waves, proper choice of pre-conditioner, as well as more advanced methods to calculate the Newton descent direction, could be necessary to insure the convergence of Newton-type methods. This is because the positive definiteness of the approximate Hessian is lost, and the misfit function is not locally convex anymore (Pratt et al.1998; Métivier et al.2014, 2017). In this case, compared to the truncated Newton method without pre-conditioning, FWI with our non-linear descent direction is more robust. This is especially true in the case of missing low-frequency data. Acknowledgements We thank the editors Ludovic Métivier, Tariq Alkhalifah and Ru-Shan Wu for valuable suggestions and comments that ultimately helped to improve this paper. We thank the sponsors of CREWES for support. This work was funded by CREWES (Consortium for Research in Elastic Wave Exploration Seismology), and by NSERC (Natural Science and Engineering Research Council of Canada) through the grant CRDPJ 461179-13, and in part through the Canada First Research Excellence Fund. We also thank Junxiao Li, Jian Sun, Huaizhen Chen and Khalid Almuteri for valuable discussions. REFERENCES Aki K., Richards P., 2002. Quantitative Seismology , University Science Books. Alkhalifah T., 2015. Scattering-angle based filtering of the waveform inversion gradients, Geophys. J. Int. , 200, 363– 373. Google Scholar CrossRef Search ADS   Alkhalifah T., Plessix R.E., 2014. A recipe for practical full-waveform inversion in anisotropic media: an analytical parameter resolution study, Geophysics , 79, R91– R101. Google Scholar CrossRef Search ADS   Alkhalifah T., Wu Z.D., 2016a. The natural combination of full and image-based waveform inversion, Geophys. Prospect. , 64, 19– 30. Google Scholar CrossRef Search ADS   Alkhalifah T., Wu Z.D., 2016b. Multiscattering inversion for low-model wavenumbers, Geophysics , 81, R417– R428. Google Scholar CrossRef Search ADS   Biondi B., Almomin A., 2014. Simultaneous inversion of full data bandwidth by tomographic full-waveform inversion, Geophysics , 79, Wa129– Wa140. Google Scholar CrossRef Search ADS   Brossier R., Operto S., Virieux J., 2015. Velocity model building from seismic reflection data by full-waveform inversion, Geophys. Prospect. , 63, 354– 367. Google Scholar CrossRef Search ADS   Castagna J., Backus M., 1993. Offset-Dependent Reflectivity: Theory and Practice of AVO Analysis , SEG. Google Scholar CrossRef Search ADS   Chavent G., Cément F., Gómez S., 1994. Automatic determination of velocities via migration-based traveltime waveform inversion: a synthetic data example, in SEG Technical Program Expanded Abstracts , Vol. 328, pp. 1179– 1182, Society of Exploration Geophysicists. Choi Y., Alkhalifah T., 2015. Unwrapped phase inversion with an exponential damping, Geophysics , 80, R251– R264. Google Scholar CrossRef Search ADS   Clayton R.W., Stolt R.H., 1981. A born-wkbj inversion method for acoustic reflection data, Geophysics , 46, 1559– 1567. Google Scholar CrossRef Search ADS   Foster D.J., Keys R.G., Lane F.D., 2010. Interpretation of avo anomalies, Geophysics , 75, A3– A13. Google Scholar CrossRef Search ADS   Innanen K.A., 2014. Seismic avo and the inverse hessian in precritical reflection full waveform inversion, Geophys. J. Int. , 199, 717– 734. Google Scholar CrossRef Search ADS   Innanen K.A., 2015. Full waveform inversion updating in the presence of high angle/high contrast reflectivity, in SEG Technical Program Expanded Abstracts , pp. 1314– 1319. Jo C.H., Shin C.S., Suh J.H., 1996. An optimal 9-point, finite-difference, frequency-space, 2-d scalar wave extrapolator, Geophysics , 61, 529– 537. Google Scholar CrossRef Search ADS   Kwak S., Kim Y., Shin C., 2014. Frequency-domain direct waveform inversion based on perturbation theory, Geophys. J. Int. , 197, 987– 1001. Google Scholar CrossRef Search ADS   Lailly P., 1983. The seismic inverse problem as a sequence of before stack migrations, in Conference on Inverse Scattering, Theory and Application, Society of Industrial and Applied Mathematics, Expanded Abstracts , pp. 206– 220, eds Bednar J.B., Richard R., Enders R., Arthur W., Society for Industrial and Applied Mathematics, Philadelphia, PA. Mahmoudian F., Margrave G.F., Wong J., Henley D.C., 2015. Azimuthal amplitude variation with offset analysis of physical modeling data acquired over an azimuthally anisotropic medium, Geophysics , 80, C21– C35. Google Scholar CrossRef Search ADS   Métivier L., Brossier R., Virieux J., Operto S., 2013. Full waveform inversion and the truncated newton method, SIAM J. Sci. Comput. , 35, B401– B437. Google Scholar CrossRef Search ADS   Métivier L., Bretaudeau F., Brossier R., Operto S., Virieux J., 2014. Full waveform inversion and the truncated newton method: quantitative imaging of complex subsurface structures, Geophys. Prospect. , 62, 1353– 1375. Google Scholar CrossRef Search ADS   Métivier L., Brossier R., Operto S., Virieux J., 2017. Full waveform inversion and the truncated newton method, SIAM Rev. , 59, 153– 195. Google Scholar CrossRef Search ADS   Operto S., Gholami Y., Prieux V., Ribodetti A., Brossier R., Métivier L., Virieux J., 2013. A guided tour of multiparameter full-waveform inversion with multicomponent data: from theory to practice, Leading Edge , 32, 1040– 1054. Google Scholar CrossRef Search ADS   Pan W.Y., Innanen K.A., Margrave G.F., Fehler M.C., Fang X.D., Li J.X., 2016. Estimation of elastic constants for hti media using Gauss-Newton and full-newton multiparameter full-waveform inversion, Geophysics , 81, R275– R291. Google Scholar CrossRef Search ADS   Pan W.Y., Innanen K.A., Liao W.Y., 2017. Accelerating hessian-free Gauss-Newton full-waveform inversion via l-bfgs preconditioned conjugate-gradient algorithm, Geophysics , 82, R49– R64. Google Scholar CrossRef Search ADS   Plessix R.E., 2006. A review of the adjoint-state method for computing the gradient of a functional with geophysical applications, Geophys. J. Int. , 167, 495– 503. Google Scholar CrossRef Search ADS   Plessix R.E., Milcik P., Rynja H., Stopin A., Matson K., Abri S., 2013. Multiparameter full-waveform inversion: marine and land examples, Leading Edge , 32, 1030– 1038. Google Scholar CrossRef Search ADS   Pratt R.G., Shin C., Hicks G.J., 1998. Gauss-newton and full newton methods in frequency-space seismic waveform inversion, Geophys. J. Int. , 133, 341– 362. Google Scholar CrossRef Search ADS   Sava P., Biondi B., 2004. Wave-equation migration velocity analysis. I. Theory, Geophys. Prospect. , 52, 593– 606. Google Scholar CrossRef Search ADS   Shin C., Cha Y.H., 2008. Waveform inversion in the laplace domain, Geophys. J. Int. , 173, 922– 931. Google Scholar CrossRef Search ADS   Shin C., Cha Y.H., 2009. Waveform inversion in the laplace-fourier domain, Geophys. J. Int. , 177, 1067– 1079. Google Scholar CrossRef Search ADS   Sirgue L., Pratt R.G., 2004. Efficient waveform inversion and imaging: a strategy for selecting temporal frequencies, Geophysics , 69, 231– 248. Google Scholar CrossRef Search ADS   Stolt R.H., Weglein A.B., 1985. Migration and inversion of seismic data, Geophysics , 50, 2458– 2472. Google Scholar CrossRef Search ADS   Sun D., Symes W.W., 2012. Waveform inversion via non-linear differential semblance optimization, in SEG Technical Program Expanded Abstracts , pp. 1– 7, Society of Exploration Geophysicists. Symes W.W., Carazzone J.J., 1991. Velocity inversion by differential semblance optimization, Geophysics , 56, 654– 663. Google Scholar CrossRef Search ADS   Tarantola A., 1984. Inversion of seismic-reflection data in the acoustic approximation, Geophysics , 49, 1259– 1266. Google Scholar CrossRef Search ADS   van Leeuwen T., Herrmann F.J., 2013. Mitigating local minima in full-waveform inversion by expanding the search space, Geophys. J. Int. , 195, 661– 667. Google Scholar CrossRef Search ADS   van Leeuwen T., Mulder W.A., 2008. Velocity analysis based on data correlation, Geophys. Prospect. , 56, 791– 803. Google Scholar CrossRef Search ADS   Virieux J., Operto S., 2009. An overview of full-waveform inversion in exploration geophysics, Geophysics , 74, Wcc1– Wcc26. Google Scholar CrossRef Search ADS   Wang H.Y., Singh S.C., Audebert F., Calandra H., 2015. Inversion of seismic refraction and reflection data for building long-wavelength velocity models, Geophysics , 80, R81– R93. Google Scholar CrossRef Search ADS   Warner M., Guasch L., 2016. Adaptive waveform inversion: theory, Geophysics , 81, R429– R445. Google Scholar CrossRef Search ADS   Weglein A.B. et al.  , 2003. Inverse scattering series and seismic exploration, Inverse Probl. , 19, R27– R83. Google Scholar CrossRef Search ADS   Wu R.S., Zheng Y.C., 2014. Non-linear partial derivative and its de wolf approximation for non-linear seismic inversion, Geophys. J. Int. , 196, 1827– 1843. Google Scholar CrossRef Search ADS   Wu R.S., Luo J.R., Wu B.Y., 2014. Seismic envelope inversion and modulation signal model, Geophysics , 79, Wa13– Wa24. Google Scholar CrossRef Search ADS   Wu Z.D., Alkhalifah T., 2015. Simultaneous inversion of the background velocity and the perturbation in full-waveform inversion, Geophysics , 80, R317– R329. Google Scholar CrossRef Search ADS   Wu Z.D., Alkhalifah T., 2017. Efficient scattering-angle enrichment for a nonlinear inversion of the background and perturbations components of a velocity model, Geophys. J. Int. , 210, 1981– 1992. Google Scholar CrossRef Search ADS   Xu S., Wang D., Chen F., Lambaré G., Zhang Y., 2012. Inversion on reflected seismic wave, in SEG Technical Program Expanded Abstracts , Vol. 509, pp. 1– 7, Society of Exploration Geophysicists. Zhang H.Y., Weglein A.B., 2009. Direct nonlinear inversion of 1d acoustic media using inverse scattering subseries, Geophysics , 74, Wcd29– Wcd39. Google Scholar CrossRef Search ADS   Zhou W., Brossier R., Operto S., Virieux J., 2015. Full waveform inversion of diving & reflected waves for velocity model building with impedance inversion based on scale separation, Geophys. J. Int. , 202, 1535– 1554. Google Scholar CrossRef Search ADS   APPENDIX A: VIRTUAL SECONDARY SOURCE AND THE BORN APPROXIMATION Suppose that the model can be split into two parts   $$s(\mathbf {r})=s_0(\mathbf {r})+\delta s(\mathbf {r}),$$ (A1)where s0(r) contains the low-wavenumber component of the model or the background part, and δs(r) contains the high-wavenumber component or the perturbation part. Accordingly the Green’s functions in the actual and background model are   \begin{eqnarray} {} [\omega ^2s(\mathbf {r})+ \nabla ^2]G(\mathbf {r},\mathbf {r}_s,\omega |s) =-\delta (\mathbf {r}-\mathbf {r}_s) \end{eqnarray} (A2)  \begin{eqnarray} {} [\omega ^2s_0(\mathbf {r})+ \nabla ^2]G(\mathbf {r},\mathbf {r}_s,\omega |s_0) =-\delta (\mathbf {r}-\mathbf {r}_s). \end{eqnarray} (A3)The scattered wavefield, or to say the perturbed wavefield   $$\delta G(\mathbf {r},\mathbf {r}_s,\omega |s_0)=G(\mathbf {r},\mathbf {r}_s,\omega |s)-G(\mathbf {r},\mathbf {r}_s,\omega |s_0),$$ (A4)can be expressed exactly with the Lippmann–Schwinger equation as   \begin{eqnarray} \delta G (\mathbf {r},\mathbf {r}_s,\omega |s_0) &=& -\omega ^2 \int \mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_0) \delta s(\mathbf {r}^{\prime })G(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s) \nonumber \\ &=& -\omega ^2 \int \mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_0)\delta s(\mathbf {r}^{\prime })G(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_0) \nonumber \\ && {}+\omega ^4 \int \mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_0)\delta s(\mathbf {r}^{\prime }) \int\mathrm{d}\mathbf {r}^{\prime \prime }G(\mathbf {r}^{\prime },\mathbf {r}^{\prime \prime },\omega |s_0) \delta s(\mathbf {r}^{\prime \prime })G(\mathbf {r}^{\prime \prime },\mathbf {r}_s,\omega |s_0)\nonumber \\ && {}+ \cdots . \end{eqnarray} (A5)Under the single-scattering assumption, only the first-order term is considered in eq. (A5), which gives the Born approximation   $$\delta G_{{\rm Born}}(\mathbf {r},\mathbf {r}_s,\omega |s_0)=-\omega ^2 \int \mathrm{d}\mathbf {r}^{\prime }G(\mathbf {r},\mathbf {r}^{\prime },\omega |s_0)\delta s(\mathbf {r}^{\prime })G(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_0).$$ (A6)The scattered wavefield δG(r, rs, ω|s0) in the Lippmann–Schwinger equation (A5), can be generated using − ω2δs(r΄)G(r΄, rs, ω|s) as virtual secondary source,   \begin{eqnarray} {}[\omega ^2s_0(\mathbf {r})+ \nabla ^2] \delta G(\mathbf {r},\mathbf {r}_s,\omega |s_0) & =-\omega ^2\delta s(\mathbf {r}^{\prime })G(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s), \end{eqnarray} (A7)while under the Born approximation (A6), the scattered wavefield can be calculated with virtual secondary source − ω2δs(r΄)G(r΄, rs, ω|s0)   \begin{eqnarray} {}[\omega ^2s_0(\mathbf {r})+ \nabla ^2]\delta G_{Born}(\mathbf {r},\mathbf {r}_s,\omega |s_0) & =-\omega ^2\delta s(\mathbf {r}^{\prime })G(\mathbf {r}^{\prime },\mathbf {r}_s,\omega |s_0). \end{eqnarray} (A8) © The Author(s) 2018. Published by Oxford University Press on behalf of The Royal Astronomical Society.

### Journal

Geophysical Journal InternationalOxford University Press

Published: May 1, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations