# Mitigating nonlinearity in full waveform inversion using scaled-Sobolev pre-conditioning

Mitigating nonlinearity in full waveform inversion using scaled-Sobolev pre-conditioning Summary The Born approximation successfully linearizes seismic full waveform inversion if the background velocity is sufficiently accurate. When the background velocity is not known it can be estimated by using model scale separation methods. A frequently used technique is to separate the spatial scales of the model according to the scattering angles present in the data, by using either first- or second-order terms in the Born series. For example, the well-known ‘banana-donut’ and the ‘rabbit ear’ shaped kernels are, respectively, the first- and second-order Born terms in which at least one of the scattering events is associated with a large angle. Whichever term of the Born series is used, all such methods suffer from errors in the starting velocity model because all terms in the Born series assume that the background Green's function is known. An alternative approach to Born-based scale separation is to work in the model domain, for example, by Gaussian smoothing of the update vectors, or some other approach for separation by model wavenumbers. However such model domain methods are usually based on a strict separation in which only the low-wavenumber updates are retained. This implies that the scattered information in the data is not taken into account. This can lead to the inversion being trapped in a false (local) minimum when sharp features are updated incorrectly. In this study we propose a scaled-Sobolev pre-conditioning (SSP) of the updates to achieve a constrained scale separation in the model domain. The SSP is obtained by introducing a scaled Sobolev inner product (SSIP) into the measure of the gradient of the objective function with respect to the model parameters. This modified measure seeks reductions in the L2 norm of the spatial derivatives of the gradient without changing the objective function. The SSP does not rely on the Born prediction of scale based on scattering angles, and requires negligible extra computational cost per iteration. Synthetic examples from the Marmousi model show that the constrained scale separation using SSP is able to keep the background updates in the zone of attraction of the global minimum, in spite of using a poor starting model in which conventional methods fail. Image processing, Inverse theory, Tomography, Waveform inversion, Coda waves, Controlled source seismology 1 INTRODUCTION The first-order Born approximation provides a practical way to linearize full waveform inversion (FWI) and obtain a computationally feasible gradient (Lailly 1983; Pratt et al.1998; Virieux & Operto 2009). However this linearization is only useful if the background velocity is sufficiently accurate. All such attempts to linearize FWI that rely upon terms in the Born scattering series have inherent limits on the background velocity errors that can be tolerated by the inversion. This is because of the local nature of the Born series. Regardless of how many terms in the Born series are used, the nonlinear dependence of the data on the background velocity is not accounted for (Symes 1991). The error limit is approximately given by the half-cycle criterion (Pratt 2008), which limits the level of acceptable background velocity errors. If the criterion is violated this leads directly to data domain cycle-skipping. It might be hoped that having a starting model that honours the half-cycle criterion for early arrivals would solve the problem of nonlinearity in the Born approximation. However, waveform inversions are complicated by the fact that the (smooth, low-wavenumber) background velocity has to be correct before inverting for the location and strength of the (sharp, high-wavenumber) scatterers in the model. This requires a separation of low- and high-wavenumber features in the gradient. That is, we require a method for model scale separation in order to update the background correctly. In this study we propose a scaled-Sobolev inner product (SSIP) that allows controlled separation of the model spatial scales by adjusting the scale factors that we introduce. We formalize the concept first introduced in Zuberi & Pratt (2016) (also applied to real data by Consolvo et al. (2017)), and we provide a number of examples demonstrating the approach. The SSIP leads to an edge-preserving smoothing operator, which we shall apply to the model updates. We refer to this approach as scaled-Sobolev pre-conditioning (SSP). The SSP gives constrained background updates, in which the constraint is provided by the sharp features in the model updates (i.e. the preserved edges in the SSP pre-conditioning), while introducing only a negligible additional computational cost per iteration compared to the conventional approach. The conventional FWI gradient includes both low- and high-wavenumber features. It is common to invoke the Born scattering series to separate the background from the scatterers by isolating events in the data based on scattering angles: wide angle scattering (including transmissions) are sensitive to low wavenumbers, and narrow angle scattering (including reflections) are sensitive to high wavenumbers (Wu & Toksöz 1987). We refer to these strategies as ‘Born-based’ approaches. To this end, some methods use the scattering close to 180° to isolate the low wavenumbers (e.g. by using the early arrivals, corresponding to the diving waves). These low-wavenumber updates are associated with ‘banana/donut’ (Rickett 2000) shaped sensitivity kernels that link the source to the receiver. The second-order terms are also used in some methods, in which the source-to-scatterer and scatterer-to-receiver sensitivity kernels are combined. These second-order sensitivity kernels, also known as ‘rabbit ears’ (Zhou et al.2015), respond to double scatterings in which at least one scattering angle is close to 180°. Although the conventional FWI gradient contains contributions from both the first-order diving waves and the second-order scattered waves, the high-wavenumber features often dominate the gradient and may trap the inversion in a false (local) minimum (Brossier et al.2015), which is why scale separation is required. First-order Born methods for scale separation work by isolating the single scattering response associated with large scattering angles. This may be achieved, for example, by weighting the waveform data according to offset, or by applying time-damping of the data to down-weight the later, reflected energy during the early stages of the inversion (Kamei et al.2013). Such a weighting is relatively inexpensive computationally, because no wavefield calculation or imaging is required, but the large angle scattering response can be difficult to isolate. This is because both large and small angle responses can occupy the same region in the data, especially when the subsurface model is strongly heterogeneous. The scattering angles can also be used to apply weights in the model domain, which would not require a separation of the scattering responses (Wu & Alkhalifah 2015). However, the accuracy of subsurface scattering angle calculations are still dependent on the accuracy of the background velocity (a requirement for the first-order Born term to be valid), which potentially leads to errors in the scale separation. Second-order Born methods calculate the second-order sensitivity kernels explicitly in an attempt to obtain the background velocity updates. An example of such methods is Wave Equation Migration Velocity Analysis (WEMVA), proposed by Sava & Biondi (2004a). WEMVA uses a difference in (high-wavenumber) image perturbations as the objective function, and the WEMVA operator relates the image perturbation to the background velocity perturbation, which creates the second-order sensitivity kernels. In WEMVA the requirement of small background velocity errors can lead to image domain cycle-skipping. Sava & Biondi (2004a,b) suggested a linearized image perturbation to overcome this issue. They suggested manually picking the background velocity ratios that flatten the subsurface angle gathers, which requires pre-calculation of subsurface angle gathers for a range of different ratios. Although this can be done in a computationally efficient manner as the calculation is a manual step performed only once in the inversion process, this still adds to the overall cost of the method. Overall, the total computational cost increases because of (i) the cost of computing the second-order sensitivity kernels (requiring that the source and receiver perturbed wavefields have to be modelled), and (ii) because of the manual step required to mitigate the errors caused by background velocity errors. Zhou et al. (2015) proposed a method for ‘joint full waveform inversion’ (JFWI) that uses both first- and second-order terms of the Born series. JFWI separates the primary reflected and transmitted (diving) wave responses in the data, using the first-order Born approximation, and the primary reflections in the data to calculate the second-order sensitivity kernels. The source (receiver) side second-order sensitivity kernels are generated by cross-correlating the back (forward) propagated perturbed wavefield with the forward (back) propagated background wavefield. This second-order calculation also leads to a substantial increase in the computational cost of JFWI. The separation of reflected and diving waves in JFWI, while computationally inexpensive, can be difficult in practice and might also require adjustment at each iteration (Zhou et al.2015). An advantage of the second-order Born methods over the first-order methods is that they use the reflected energy in the data to constrain the initial background updates, meaning that while the background is being updated it remains consistent with the scattering events contained in the data. Furthermore, if the background errors are negligible, the use of the second-order Born term is an approximation to the inverse Hessian, which can help the convergence rate. However, calculating the second-order sensitivity kernels puts an extra computational burden on the inversion, and because the approach is a part of the Born series it still requires an accurate background velocity model. In other words, the velocity-depth ambiguity is a problem in the second-order methods as well. The background error issue is mitigated by using either the redundancy in the data (using multiple offsets to calculate angle gathers in WEMVA) or by adding additional a priori information about the model (JFWI uses an a priori reflectivity model), although these measures further increase the computational cost of the inversions. In general using higher order terms in the Born series for scale separation suffers from the same background error limitations that we are trying to mitigate, and any attempts to correct for these errors increase computational cost. In addition, when a band of frequencies is being inverted, the Born-based separations isolate the highest available wavenumber and the rest of the wavenumber spectrum must be considered to constitute the background (see Fig. 1). This is demonstrated in Fig. 2 where we illustrate an example of a JFWI update calculated by using scattering response from two known1 reflectors independently (to avoid multiple scattering noise). The regions in white circles show high wavenumber parts of the update, which is even higher than the kmax in Fig. 1 (these are present because the a priori reflectivity model consisted of a discontinuous velocity structure). Figure 1. View largeDownload slide Model scales (after Wu & Toksöz 1987). Between kb and kmax the background and scatterer scales are not clearly defined; a ‘scattering’ wavenumber for one frequency could be considered to be the ‘background’ wavenumber for another. Figure 1. View largeDownload slide Model scales (after Wu & Toksöz 1987). Between kb and kmax the background and scatterer scales are not clearly defined; a ‘scattering’ wavenumber for one frequency could be considered to be the ‘background’ wavenumber for another. Figure 2. View largeDownload slide JFWI updates calculated for each reflector independently. The white circles show that high wavenumber information is still present in the image. This remnant high wavenumber information will be included before the background velocity model has converged, therefore it can potentially move the inversion away from the true solution. Figure 2. View largeDownload slide JFWI updates calculated for each reflector independently. The white circles show that high wavenumber information is still present in the image. This remnant high wavenumber information will be included before the background velocity model has converged, therefore it can potentially move the inversion away from the true solution. The limitations of the Born-based approaches for scale separation discussed above can be avoided by separating the model scales in the updates using wavenumber filtering (Sirgue 2003; Brenders & Pratt 2007b) or Gaussian smoothing. While these methods separate the scales in the model domain, the corresponding scattering response in the observed data cannot be separated. In other words, unlike the second-order Born-based approach, the background updates will not be constrained by the sharp wavenumber features (the seismic image), as these are deliberately removed from the update. This can cause the sharper features to be inaccurate when they are eventually imaged, especially if the starting model does not have any information about the true low wavenumbers. Constrained background updates can however be obtained by smoothing the gradient while preserving the edges. For example, Guitton et al. (2012) proposed a pre-conditioning that preserves dips in the model, but this requires estimation of the dips based on local dip filtering of the image Hale (2007). Other methods of edge preserving smoothing such as anisotropic diffusion (Perona & Malik 1990) or high-order Sobolev gradient flows proposed by Calder et al. (2011) have been used in non-seismic image processing, however they have not been utilized as a scale separation tool in FWI. Furthermore, the Sobolev norm proposed by Calder et al. (2011) is not particularly well suited for scale separation, mainly due to its lack of flexibility in choosing the scale factors. The Laplacian operator is frequently used to suppress low wavenumber artefacts in reverse time migration (RTM) (Zhang & Sun 2009), however the artefacts in RTM are the desired background updates in FWI. Therefore, an inverse Laplacian-type operator would be an appropriate choice for a pre-conditioner to enhance the low wavenumber features in the FWI gradient, that is some form of (1 + ∇2)−1, where ∇2 is the Laplacian. Indeed the inverse Laplacian has been used to pre-condition the FWI gradient with different scaling factors in the Laplacian and powers of the whole operator (e.g. Symes & Kern 1994; Fu & Symes 2017). In this paper we generalize this operator to include higher order derivatives based on the SSIP, and we show that the special case of the inverse Laplacian (i.e. first-order SSP) operator is an edge-preserving smoother. We explore this edge-preserving nature of the SSP and show that it is useful in mitigating nonlinearity in FWI when used as a scale separation tool in model domain. As a result of the edge-preservation, the low wavenumber content enhanced by the SSP will also receive contribution from the rabbit ears that will automatically be generated due to scattering from the sharp features in the gradient. A more mathematically precise reason for the mitigation of nonlinearity due to the first-order SSP can be deduced from a result regarding the differentiability of the FWI objective function with respect to the model parameters: Stolk (2001) and Blazek et al. (2013) showed that the objective function is differentiable if the source function and at least two orders of its time derivatives are in L2 space, and the model perturbation is measured by an L∞ norm. The L∞ norm is dominated by the Sobolev norm of order 1 and exponent 2 if the model behaves locally as 1D (Evans 2010), as is typical for sedimentary structure. This means that the updates will be smoother and a Fréchet derivative will be more likely to exist. The existence of a Fréchet derivative would ensure that the adjoint state method (Plessix 2006) gives a more accurate representation of the action of the Jacobian (partial derivative wavefield) matrix, thereby reducing the nonlinearity in FWI. The higher orders of SSP will not be studied here, but as they lead to more smoothing they can be used to control the level of edge-preservation in the smoothing. For example, Zuberi & Pratt (2017) use higher orders of SSP in the data domain to smooth the data and define a new objective function based on SSIP. 2 INTRODUCING THE SCALED-SOBOLEV INNER PRODUCT (SSIP) A note on the use of the ∇ notation The notation   \begin{equation*} \nabla \equiv \sum _{i}\hat{x}_{i}\frac{\partial }{\partial _{i}} \end{equation*}is used here to denote the ‘del’ operator generating the gradient vector from a scalar field. Thus in inversion the gradient of the objective function yields   \begin{equation*} \nabla _{p}E\left(\mathbf {p}\right)=\left(\begin{array}{c}\partial E/\partial p_{1}\\ \partial E/\partial p_{2}\\ \vdots \\ \partial E/\partial p_{m} \end{array}\right), \end{equation*}a column vector in m-dimensional space where m is the number of components in the vector p. In the Sobolev norm the notation is also used to compute first derivatives with respect to the spatial dimensions, so that   \begin{equation*} \nabla p(x)=\hat{\mathbf {i}}\frac{\partial p}{\partial x}+\hat{\mathbf {j}}\frac{\partial p}{\partial y}+\hat{\mathbf {k}}\frac{\partial p}{\partial z}=\hat{\mathbf {e}}_{1}\frac{\partial p}{\partial x_{1}}+\hat{\mathbf {e}}_{2}\frac{\partial p}{\partial x_{2}}+\hat{\mathbf {e}}_{3}\frac{\partial p}{\partial x_{3}}. \end{equation*}We adopt the convention that all non-subscripted uses of ∇ refer to the spatial gradient, and all other uses are indicated by the use of a subscript, that is, ∇pE. An introductory analysis of the Sobolev gradient using matrix and functional representations is given in Appendix A. The Sobolev gradient using the Laplacian operator has been applied in a variety of contexts (e.g. Symes & Kern 1994; Renka 2009, 2013). The Laplacian arises from the first-order derivatives in the Sobolev norm (Appendix A). The Sobolev gradient can be generalized to contain higher powers of the Laplacian (i.e. higher order derivatives in the Sobolev norm) and each derivative can be scaled independently. Therefore, let us utilize a more general inner product   $${\left\langle f,g\right\rangle _{A}\equiv \left\langle Af,Ag\right\rangle _{{\rm L}^{2}}=\left\langle f,\left(A^{\dagger }A\right)g\right\rangle _{{\rm L}^{2}}\!,}$$ (1)where † represents the adjoint. With A defined in eq. (2) below, eq. (1) defines the SSIP. Note that A†A is a positive definite symmetric operator. If we take A to be a differential operator then,   \begin{eqnarray} A & \equiv & \sum _{i=0}^{N=ML}\hat{\mathbf {e}}_{i}\mu _{i}\frac{\partial ^{m_{i}}}{\partial x_{l_{i}}^{m_{i}}}\quad \mathrm{and}\quad A^{\dagger }=\sum _{i=0}^{N=ML}\hat{\mathbf {e}}_{i}\left(-1\right)^{m_{i}}\mu _{i}\frac{\partial ^{m_{i}}}{\partial x_{l_{i}}^{m_{i}}}, \end{eqnarray} (2)where M and L are the maximum order of the derivatives and the number of dimensions, respectively. The superscript $$m_{i}=\lfloor \frac{i-1}{L}\rfloor +1$$ represents the order of the derivative and $$x_{l_{i}}$$ is the $$l_{i}^{{\rm th}}$$ dimension with li = [(i − 1)mod L] + 1. The highest order of derivative is $$M=\lfloor \frac{N-1}{L}\rfloor +1$$. The factors μi are real valued scalars and the unit vectors are orthonormal, that is,   $$\hat{\mathbf {e}}_{i}\cdot \hat{\mathbf {e}}_{j}=\left\lbrace \begin{array}{@{}l@{\quad }l@{}}1 & {\rm for}\ i=j\\ 0 & {\rm for}\ i\ne j \end{array}\right..$$ (3)The $$\left(-1\right)^{m_{i}}$$ factor in eq. (2) comes from integration by parts and an assumption that the functions involved have homogeneous boundary conditions. Under these conditions   $$AA^{\dagger }=A^{\dagger }A=\sum _{i=0}^{N}\left(-1\right)^{m_{i}}\mu _{i}^{2}\frac{\partial ^{2m_{i}}}{\partial x_{l_{i}}^{2m_{i}}},$$ (4)and we may write the SSIP as   $$\left\langle f,g\right\rangle _{A}=\sum _{i=0}^{N}\mu _{i}^{2}\left\langle f^{\left(i\right)},g^{\left(i\right)}\right\rangle _{{\rm L}^{2}},$$ (5)that is, the norm includes all scaled derivatives of f and g with $$f^{\left(i\right)}\equiv \frac{\partial ^{m_{i}}}{\partial x_{l_{i}}^{m_{i}}}f$$. The orthogonality of the derivative terms is imposed, and, as above for the sake of brevity the unit vectors are not written explicitly. When the functions are smooth (i.e. when all orders of derivatives approach zero), the SSIP is proportional to the L2 norm. The space of functions equipped with the SSIP is the scaled-Sobolev space. In the scaled-Sobolev space, depending on the highest order of derivatives in the SSIP, the functions are required to be differentiable. In contrast, functions in the Hilbert space (equipped only with an L2 inner product) are by definition potentially less smooth, as these functions do not have to be differentiable. Note that in the SSIP all orders and dimensions of the derivatives can have independently chosen scale factors, allowing an independent weighting of model scales for each dimension and each derivative order. Eqs (1), (2), (4) and (5) are the general (anisotropic) form; when μi is replaced by $$\mu _{m_{i}}$$ (meaning that the scale factors are chosen to be the same for all dimensions in a given order of the derivatives), the equations will be referred to as being ‘isotropic’. The inner product given in eq. (5) is different from the inner product suggested by Calder et al. (2011). In the inner product suggested by Calder et al. (2011) the operator A†A, defined by eq. (4), is replaced by the operator (I − λ∇2)k. While this operator also leads to an edge preserving smoothing, Calder's choice of scales for different dimensions is not completely independent. The operator (I − λ∇2)k is equal to A†A given in eq. (4) only when the weighting terms μi are the same for all dimension, when M = 1 in eq. (4) and when k = 1 in the operator (I − λ∇2)k. By way of explanation, let us look at a two dimensional isotropic example of the inner product in eq. (5). In this case the first-order derivatives N = ML = 2 × 1 and μ0, μ1 = 1 and μ2 = μ1, for which three unit vectors are required, $$\hat{\mathbf {e}}_{0}$$ for the zeroth-order derivative (the original functions), and $$(\hat{\mathbf {e}}_{1},\hat{\mathbf {e}}_{2})$$ for the two spatial components of the first derivatives in the gradient. Thus $$A=\hat{\mathbf {e}}_{0}+\hat{\mathbf {e}}_{1}\frac{\partial }{\partial x}+\hat{\mathbf {e}}_{3}\frac{\partial }{\partial y}$$ and   $$\left\langle f,g\right\rangle _{A}=\left\langle f,g\right\rangle _{{\rm L}^{2}}+\left\langle \frac{\partial f}{\partial x},\frac{\partial g}{\partial x}\right\rangle _{{\rm L}^{2}}+\left\langle \frac{\partial f}{\partial y},\frac{\partial g}{\partial y}\right\rangle _{{\rm L}^{2}}.$$ (6)The first term in eq. (6) is just the conventional inner product, that is ∫∂f(x)g(x)dx and the second two terms can be simplified as follows   \begin{eqnarray*} \left\langle \frac{\partial f}{\partial x},\frac{\partial g}{\partial x}\right\rangle _{{\rm L}^{2}} & = & \int _{c}^{d}\left[\int _{a}^{b}\frac{\partial f}{\partial x}\frac{\partial g}{\partial x}{\rm d}x\right]{\rm d}y\\ & = & \int _{c}^{d}\left[f\frac{\partial g}{\partial x}|_{a}^{b}-\int _{a}^{b}f\frac{\partial ^{2}g}{\partial x^{2}}{\rm d}x\right]{\rm d}y\\ & = & \int _{c}^{d}\left[-\int _{a}^{b}f\frac{\partial ^{2}g}{\partial x^{2}}{\rm d}x\right]{\rm d}y\\ & = & -\left\langle f,\frac{\partial ^{2}g}{\partial x^{2}}\right\rangle _{{\rm L}^{2}}, \end{eqnarray*} where integration by parts and homogeneous boundary conditions have been used. Then, eq. (6) becomes   \begin{eqnarray*} \left\langle f,g\right\rangle _{A} & = & \left\langle f,g\right\rangle _{{\rm L}^{2}}+\left\langle \frac{\partial f}{\partial x},\frac{\partial g}{\partial x}\right\rangle _{{\rm L}^{2}}+\left\langle \frac{\partial f}{\partial y},\frac{\partial g}{\partial y}\right\rangle _{{\rm L}^{2}}\\ & = & \left\langle f,g\right\rangle _{{\rm L}^{2}}-\left\langle f,\frac{\partial ^{2}g}{\partial x^{2}}\right\rangle _{{\rm L}^{2}}-\left\langle f,\frac{\partial ^{2}g}{\partial y^{2}}\right\rangle _{{\rm L}^{2}}\\ & = & \left\langle f,\left(1-\nabla ^{2}\right)g\right\rangle _{{\rm L}^{2}}\\ & = & \left\langle f,\left(A^{\dagger }A\right)g\right\rangle _{{\rm L}^{2}}, \end{eqnarray*} where $$A^{\dagger }=\hat{\mathbf {e}}_{0}-\hat{\mathbf {e}}_{1}\frac{\partial }{\partial x}-\hat{\mathbf {e}}_{3}\frac{\partial }{\partial y}$$ so that A†A = 1 − ∇2. 3 SCALED-SOBOLEV PRE-CONDITIONING In this section we will show how the SSIP can be used in the model domain as a pre-conditioner with the steepest-descent method. To see how the new inner product, the SSIP defined in eq. (1), will affect the FWI gradient let us ask the question asked by Neuberger (1997): ‘To what sort of gradient are we led’, if the inner product in eq. (A6) is replaced by the SSIP? That is, we require (as above) that   $${\rm d}E=\left\langle \delta p,g\right\rangle _{{\rm L}^{2}}=\left\langle \delta p,g_{_{S}}\right\rangle _{A}=\left\langle \delta p,\left(A^{\dagger }A\right)g_{_{S}}\right\rangle _{{\rm L}^{2}}\!,$$ (7)where gS is the gradient in the scaled-Sobolev space, g is the conventional gradient used in the Born scale-space approach, and δp is the slowness perturbation. Eq. (7) results in a mapping from the Born scale-space to the scaled-Sobolev space, that is,   $$g_{_{S}}=\left(A^{\dagger }A\right)^{-1}g,$$ (8)where A†A is the scaled-Sobolev pre-conditioner. The fact that gS is smoother than g can be seen by rewriting eq. (7) as   \begin{eqnarray} &&{\left\langle \delta p,g\right\rangle _{{\rm L}^{2}}\!=\!\mu _{0}^{2}\left\langle \delta p,g_{_{S}}\right\rangle _{{\rm L}^{2}}\!+\!\sum _{i=1}^{N=ML}\left(-1\right)^{m_{i}}\mu _{i}^{2}\left\langle \delta p,\frac{\partial ^{2m_{i}}g_{_{S}}}{\partial x_{l_{i}}^{2m_{i}}}\right\rangle _{{\rm L}^{2}}.} \end{eqnarray} (9)The Sobolev inner product ⟨·,·⟩A is as defined in equation (1). Eq. (9) shows that if μ0 ≠ 0 and μi = 0 for i greater than zero, gS would simply be a scaled version of g. However, if μ0 is set to zero then the combined derivatives of $$g_{_{S}}$$ would have to be a scaled version of g, which means that $$g_{_{S}}$$ would have to be very smooth. By choosing appropriate values for μi the smoothness of the scaled-Sobolev gradients can be controlled, and the effective scale of the updates can be controlled. This result allows a scale decomposition that (i) does not require recourse to data domain decompositions based on the Born approximation, and (ii) does not require a complete scale decomposition such as that required in some model-domain smoothing methods. The steepest-descent method applied to the Sobolev gradient requires that the model perturbation $$\delta p_{_{S}}$$ be proportional to the gradient gS, that is,   $$\delta p_{_{S}}=-\alpha g_{_{S}},$$ (10)where α is a positive step length chosen to optimize the convergence. The perturbation in the objective function to first order with the SSIP is thus   \begin{eqnarray} \left({\rm d}E\right)_{S} &=& \left\langle \delta p_{_{S}}\!,g_{S}\right\rangle _{A}=-\alpha ^{-1}\left\langle \delta p_{_{S}},\delta p_{_{S}}\right\rangle _{A}\nonumber\\ &=&-\alpha ^{-1}\left\langle \delta p_{_{S}},\left(A^{\dagger }A\right)\delta p_{_{S}}\right\rangle _{{\rm L}^{2}}. \end{eqnarray} (11)By definition the term in angular brackets in last equality of eq. (11) is positive, therefore the direction of the update represents a descent direction. This also shows that any pre-conditioner must be positive definite in order for the updates to remain in a descent direction. As a special case, if A is defined for three dimensions and first-order derivatives, that is if $$A\equiv \mu _{0}+\mu _{1}{\rm \frac{\partial }{\partial x}+\mu _{2}{\rm \frac{\partial }{\partial y}}+\mu _{3}{\rm \frac{\partial }{\partial z}}}$$, then this implies (using eq. 4) that $$A^{\dagger }A=\mu _{0}^{2}-\mu _{1}^{2}\frac{\partial ^{2}}{\partial x^{2}}-\mu _{2}^{2}\frac{\partial ^{2}}{\partial y^{2}}-\mu _{3}^{2}\frac{\partial ^{2}}{\partial z^{2}}$$. Therefore, for this special case the pre-conditioning becomes   $$g_{_{S}}=\left(\mu _{0}^{2}-\mu _{1}^{2}\frac{\partial ^{2}}{\partial x^{2}}-\mu _{2}^{2}\frac{\partial ^{2}}{\partial y^{2}}-\mu _{3}^{2}\frac{\partial ^{2}}{\partial z^{2}}\right)^{-1}g.$$ (12)The isotropic version of eq. (12) is obtained by setting μ1 = μ2 = μ3, which gives the isotropic SSP   $$g_{_{S}}=\left(\mu _{0}^{2}-\mu _{1}^{2}\nabla ^{2}\right)^{-1}g,$$ (13)in which case the relative importance of the zeroth (background) term and the derivative terms can be controlled by adjusting the parameters μ0 and μ1. The anisotropic case, that is when the scalars for all dimension are not the same (for this special case μ1 ≠ μ2 ≠ μ3), can be useful when the dimensions of the model are very different from each other. The scale factors for the anisotropic case can be chosen to enhance or suppress the desired model dimension. 4 COMPARISON WITH THE GAUSSIAN SMOOTHING OPERATOR In this section we shall compare the effects of the isotropic first-order SSP operator (eq. 13) with a Gaussian smoothing operator. Gaussian smoothing can also be used to separate the spatial scales of the model without recourse to the Born approximation. Because the Gaussian smoothing operator has a well-defined inverse (Ulmer 2010), it is positive definite, which means that it could be used to define a Sobolev norm. However, due to the edge removing nature of Gaussian smoothing, the Gaussian scale separation is complete in the sense that in the limit of strong smoothing all high-wavenumber features is lost after Gaussian smoothing. The Gaussian smoothing operator in two dimensions can be written as   $$w\left(x,z\right)=\left(\frac{\pi }{\sigma ^{2}}\right)e^{-\frac{\pi ^{2}\left(x^{2}+z^{2}\right)}{\sigma ^{2}}},$$ (14)where σ2 is a real constant variance that selects the spatial scales of the image, therefore we shall call it a scale factor here. Convolution of the Gaussian kernel w(x, z) with an image would smooth it and the level of smoothness is decided by the scale factor σ2. This convolution with an image, say m(x, z), can be written in the wavenumber domain as   $$m_{_{G}}\left(k_{x},k_{z}\right)=w\left(k_{x},k_{z}\right)m\left(k_{x},k_{z}\right)=e^{-\sigma ^{2}k^{2}}m\left(k_{x},k_{z}\right),$$ (15)where $$k^{2}\equiv k_{x}^{2}+k_{z}^{2}$$ and w(kx, kz) is the Fourier transform of w(x, z). By comparison, the application of the isotropic SSP of eq. (13) in the wavenumber domain can be written as   $$m_{_{S}}\left(k_{x},k_{z}\right)=\left(\mu _{0}^{2}+\mu _{1}^{2}k^{2}\right)^{-1}m\left(k_{x},k_{z}\right).$$ (16) An image of the Marmousi model will be used as a test image2 to see the effects of Gaussian and SSP smoothing kernels in eqs (15) and (16). The test image and its wavenumber amplitude spectrum are shown in Figs 3(a) and (b), respectively. The SSP operator (eq. 16) with $$\left(\mu _{0}^{2},\mu _{1}^{2}\right)=\left(1,100\right)$$ was applied to the test image; the result and its wavenumber amplitude spectrum are shown in Figs 4(a) and (b), respectively. Fig. 4 shows that while the low-wavenumber features of the test image have been enhanced, the high-wavenumber features have not been lost completely and the reflectors of the Marmousi model are still visible. The application of the Gaussian smoothing kernel on the test image with σ2 = 100 is shown in Fig. 5(a) and the corresponding wavenumber amplitude spectrum in Fig. 5(b). Fig. 5 shows that at this high level of Gaussian smoothing there is virtually no high wavenumber information left in the image. Reducing the Gaussian scale factor reduces the blur in the smooth image, but to retain the high wavenumbers the scale factor has to be drastically reduced. This is shown in Fig. 6 where the scale factor is progressively reduced but the reflectors in the Marmousi model remain invisible. Figure 3. View largeDownload slide (a) A test image and (b) its wavenumber amplitude spectrum. Figure 3. View largeDownload slide (a) A test image and (b) its wavenumber amplitude spectrum. Figure 4. View largeDownload slide (a) The SSP applied to the test image (Fig. 3), with $$\left(\mu _0^2,\mu _1^2\right)=\left(1,100\right)$$ and (b) its wavenumber amplitude spectrum. Figure 4. View largeDownload slide (a) The SSP applied to the test image (Fig. 3), with $$\left(\mu _0^2,\mu _1^2\right)=\left(1,100\right)$$ and (b) its wavenumber amplitude spectrum. Figure 5. View largeDownload slide (a) Gaussian smoothing applied to the test image (Fig. 3) with σ2 = 100 and (b) its wavenumber amplitude spectrum. The small non-zero amplitude region around the origin in (b) shows the almost complete loss of high wavenumber information after Gaussian smoothing. Figure 5. View largeDownload slide (a) Gaussian smoothing applied to the test image (Fig. 3) with σ2 = 100 and (b) its wavenumber amplitude spectrum. The small non-zero amplitude region around the origin in (b) shows the almost complete loss of high wavenumber information after Gaussian smoothing. Figure 6. View largeDownload slide Gaussian smoothing applied to the test image (Fig. 3), with (a) σ2 = 50, (b) σ2 = 10, (c) σ2 = 1 and (d) σ2 = 0.5. Decreasing the scale factor of the Gaussian smoothing reduces the weights of the low-wavenumber features, however the high wavenumbers in the image are lost. To preserve the edges, the scale factor has to approach zero in which case the low-wavenumber features are no longer enhanced. Figure 6. View largeDownload slide Gaussian smoothing applied to the test image (Fig. 3), with (a) σ2 = 50, (b) σ2 = 10, (c) σ2 = 1 and (d) σ2 = 0.5. Decreasing the scale factor of the Gaussian smoothing reduces the weights of the low-wavenumber features, however the high wavenumbers in the image are lost. To preserve the edges, the scale factor has to approach zero in which case the low-wavenumber features are no longer enhanced. The difference in smoothing between SSP and Gaussian kernels can be seen in Fig. 7. The SSP kernel (Fig. 7a) has high amplitudes close to the origin, and it remains non-zero up to a large wavenumber radius, while the Gaussian kernel Fig. 7(b) drops down to zero much faster. If σ2 is decreased in an attempt to enhance the higher wavenumbers the Gaussian bell surface becomes broader (Fig. 7c), but the ratio of low to high wavenumber amplitude remains larger in the Gaussian kernel than in the SSP. Furthermore, the tail of the Gaussian kernel is much smaller than the SSP even with small smoothing as shown in Fig. 7(d). Figure 7. View largeDownload slide Direct comparison of the smoothing kernels: (a) Sobolev pre-conditioning with $$(\mu _0^2,\mu _1^2)=(1,0.1),$$ (b) Gaussian smoothing with σ2 = 0.1 and (c) Gaussian smoothing with σ2 = 0.035. (d) Profiles extracted from (a) to (c) at zero vertical wavenumber, where the blue line is from (a), the green line is from (b) and the red line is from (c). Decreasing the Gaussian smoothing in an attempt to enhance the high wavenumbers broadens the Gaussian bell-shaped curve (note the shift from the green to red lines), but the higher wavenumbers are still not recovered to the level of SSP (blue line). Figure 7. View largeDownload slide Direct comparison of the smoothing kernels: (a) Sobolev pre-conditioning with $$(\mu _0^2,\mu _1^2)=(1,0.1),$$ (b) Gaussian smoothing with σ2 = 0.1 and (c) Gaussian smoothing with σ2 = 0.035. (d) Profiles extracted from (a) to (c) at zero vertical wavenumber, where the blue line is from (a), the green line is from (b) and the red line is from (c). Decreasing the Gaussian smoothing in an attempt to enhance the high wavenumbers broadens the Gaussian bell-shaped curve (note the shift from the green to red lines), but the higher wavenumbers are still not recovered to the level of SSP (blue line). 5 EXAMPLES The examples in this section were obtained by performing FWI in the Laplace–Fourier domain (Pratt et al.1998; Sirgue & Pratt 2004; Kamei et al.2013). In this domain the frequency is considered to be a complex variable ω = ωr + i/τ, where the real part (ωr) is the actual circular frequency and the imaginary part is the reciprocal of a time constant (τ). Kamei et al. (2013) showed that during FWI the use of finite values of the time constant τ is equivalent to time-damping the input data (down weighting the later arrivals). We employed this approach here in creating synthetic time-damped frequency domain input (‘observed’) data for our inversions, and we then used the same single frequency, τ = 1 s time constant during the inversions. The method of steepest-descent was used with Brent's method for a line search (Brent 1973; Press et al.1993). We performed two sets of FWIs: In the first set the FWIs with SSP and Gaussian pre-conditioning were compared, and in the second set an exponential offset-dependent gain factor was applied to the residuals throughout both the SSP and the Gaussian pre-conditioned FWIs (after Brenders & Pratt 2007a). This amplitude gain compensates for the loss in amplitude of the increasingly delayed first arrivals at large offsets due to time damping, and was implemented using an exponential function of offset, exp (offset/(τ · vmin)), where vmin is a parameter with units of velocity used to scale the gain. For all inversions vmin was set equal to 2000 m/s. In the second set of inversions, the exponential offset gain effectively enhances the response from the later arrivals, which include both long offset diving waves and deep reflection events. Thus, with an offset-dependent amplitude gain the inversion responds preferentially to the deeper parts of the model, even in the early stages of FWIs. This implies that synthetic reflections need to be generated during the early stages of the inversion, so that the computed low-wavenumber updates remain in a direction that is consistent with the data. Therefore the second set of inversions require that the sharper features in the gradient must be retained during the earliest iterations. We thus expect that the low-wavenumber updates will be constrained by reflected arrivals in the SSP, whereas the Gaussian pre-conditioner will fail to produce synthetic reflections due to the complete separation of the scales. We carried out all tests with our own version of the Marmousi model (after Brougois et al. (1990); Fig. 8a). We implemented the SSP tests using the zeroth- and first-order terms in the isotropic SSP. In other words, the differential operator in eq. (13) was a spatial gradient operator. The SSP inversions were compared with inversions in which we used a Gaussian pre-conditioning scheme. The starting velocity for all inversions was a 1-D linear function of depth z, shown in Fig. 8(b). The model grid spacing was 8 m and an extra layer, 160 m thick, was added on top with 1500 m/s velocity to suppress free surface multiples. The velocity was fixed at 1500 m s−1 down to a depth of 174 m throughout the inversions. There were 364 sources and receivers, both with 24 m spacing, from (x, z) = (160, 160) m to (x, z) = (8872, 160) m. A Küpper wavelet (Küpper 1958), shown in Fig. 9(a), was used as the source time function, and synthetic ‘observed’ data were generated at 4, 6 and 8 Hz using the Laplace–Fourier domain modelling in the true model (Fig. 8a) with a time-damping factor of τ = 1 s. All of the inversions were performed for the same frequencies simultaneously (i.e. no sweeping from low to high frequencies was used), the same τ = 1 s damping, and the true source wavelet was used during the inversions. The synthetic observed data at the these inversion frequencies were obtained directly from the Laplace–Fourier domain modelling. We also generated time domain data for the purposes of visualization, using 250 equally spaced frequencies from 0.25 to 62.5 Hz; a shot gather after Fourier synthesis is shown in Fig. 9(c). Figure 8. View largeDownload slide (a) The true velocity model used in the inversion tests and (b) the linear 1-D starting velocity model. Vertical lines in the figures represent the locations where velocity profiles will be compared. Figure 8. View largeDownload slide (a) The true velocity model used in the inversion tests and (b) the linear 1-D starting velocity model. Vertical lines in the figures represent the locations where velocity profiles will be compared. Figure 9. View largeDownload slide (a) The Küpper source wavelet used in the inversion tests, (b) its amplitude spectrum and (c) a representative synthetic observed shot gather at x = 2536 m. Figure 9. View largeDownload slide (a) The Küpper source wavelet used in the inversion tests, (b) its amplitude spectrum and (c) a representative synthetic observed shot gather at x = 2536 m. The ray paths covering the true model from sources at 2, 4, 6 and 8 km are shown in Figs 10(a)–(d), respectively. The ray paths demonstrate the sparse refraction coverage of the region below 1.5 km depth. This lack of refraction coverage, and the fact that we used time damping result in weaker updates within the deeper parts of the model. However, this loss of information from the deeper part of the model is partly compensated for in the second set of tests when we use an offset-dependent exponential gain. Figure 10. View largeDownload slide Representative ray paths in the true (Marmousi) model with sources located at: (a) 2 km, (b) 4 km, (c) 6 km and (d) 8 km. The receivers are located along the horizontal white line at z = 160 m. Figure 10. View largeDownload slide Representative ray paths in the true (Marmousi) model with sources located at: (a) 2 km, (b) 4 km, (c) 6 km and (d) 8 km. The receivers are located along the horizontal white line at z = 160 m. During the inversions we noted that the descent algorithm occasionally ‘stalled’, meaning that either the relative reduction in the objective function was less than 0.001 per cent for two or more iterations consecutively or a negative velocity was obtained in the updates. When this happened, we used this as a trigger to reduce the scale factors in the pre-conditioning equations, thereby reducing the spatial scales of the updates (and introducing more high wavenumbers). This strategy for switching the model spatial scales was used for both the SSP scheme and for the Gaussian pre-conditioning scheme. For SSP $$\mu _{0}^{2}=1$$ was kept constant in eq. (13) for all iterations in both sets of inversion tests, and only $$\mu _{1}^{2}$$ was varied. In order to keep the SSP dimensionless, the dimension of μ1 must length (we use km for units) and μ0 must be dimensionless. Similarly, for the Gaussian pre-conditioner the dimension of σ is length in km. The parameter μ1 was chosen to be 10 km, that is $$\mu _{1}^{2}=100\,{\rm km}^2$$ initially. Physically this means that the smoothing length is 10 km which is approximately the size of the larger (horizontal) axis, which would give low wavenumber updates with a dominant wavelength equal to the larger model dimension. Henceforth, the units of μi are kmi, and similarly units for σ2 are km2. 5.1 Inversions without offset-dependent exponential gain The full schedule of scale factors for this set of inversions is provided in Table 1. Both SSP and Gaussian pre-conditioned inversions were started with a scale factor of 100, that is $$\left(\mu _{0}^{2},\mu _{1}^{2}\right)=\left(1,100\right)$$ for SSP and σ2 = 100 for Gaussian pre-conditioning. The Gaussian inversion with σ2 = 100 stalled (producing a negative velocity value in the update) at the first iteration, so we restarted with a scale factor of 50 as shown in Table 1. Although in general higher scale factors enhance the low-wavenumber scales in the updates, since the SSP combines the low- and high-wavenumber scales in the updates the scales have to be switched less often in the SSP method than the Gaussian. Fig. 11 depicts the gradient in the SSP inversion at the 173rd iteration, before and after pre-conditioning, illustrating the emphasis on low wavenumbers when $$\mu _{1}^{2}=50$$, but also illustrating that while the low-wavenumber features are being updated, the sharper features start to appear as well. This wide band of wavenumbers helps to keep the SSP method in the zone of attraction of the global minimum as the scale factor is subsequently reduced. Fig. 12 depicts the equivalent result (for the seventh iteration) using the Gaussian smoothing method, illustrating that all information on the sharp features (i.e. the images of the scattering structure) is lost due to the nearly complete scale separation of the Gaussian smoother. Figure 11. View largeDownload slide An image of the objective function gradient (a) before and (b) after SSP (pre-conditioning) with $$\left(\mu _0^2,\mu _1^2\right)=\left(1,50\right)$$, at the 173rd iteration in the SSP inversion. No offset-dependent exponential gain was applied to the data in this test. Figure 11. View largeDownload slide An image of the objective function gradient (a) before and (b) after SSP (pre-conditioning) with $$\left(\mu _0^2,\mu _1^2\right)=\left(1,50\right)$$, at the 173rd iteration in the SSP inversion. No offset-dependent exponential gain was applied to the data in this test. Figure 12. View largeDownload slide The objective function gradients in the inversion with Gaussian pre-conditioning (a) before and (b) after Gaussian smoothing with σ2 = 50, shown at the 173rd iteration. No offset-dependent exponential gain was applied to the data in this test. Figure 12. View largeDownload slide The objective function gradients in the inversion with Gaussian pre-conditioning (a) before and (b) after Gaussian smoothing with σ2 = 50, shown at the 173rd iteration. No offset-dependent exponential gain was applied to the data in this test. Table 1. Iterations and smoothing parameters for SSP and Gaussian pre-conditioning FWIs when no offset-dependent exponential gain was applied. SSP  Gaussian  iterations  $$\mu _0^2$$  $$\mu _1^2$$ (km2)  iterations  σ2 (km2)  1–72  1  100  1–7  50  73–307  1  50  8–28  10  308–650  1  10  29–107  1        108–118  0.5        119–277  0.1        278–650  0  SSP  Gaussian  iterations  $$\mu _0^2$$  $$\mu _1^2$$ (km2)  iterations  σ2 (km2)  1–72  1  100  1–7  50  73–307  1  50  8–28  10  308–650  1  10  29–107  1        108–118  0.5        119–277  0.1        278–650  0  View Large A total of 650 iterations were performed for both SSP and Gaussian inversions and the final results are plotted in Figs 13(a) and (b), respectively. While the Gaussian inversion managed to recover the background velocity, the sharp features were not recovered well, especially in the deeper parts of the model. This is because when the low-wavenumber features were being updated, the inversion was not constrained by the small angle scattering events (present in the data at all iterations). The SSP inversion on the other hand, included the sharp features alongside the background updates, which constrained the gradient direction such that it was consistent with all scattering response in the data at every iteration. A comparison of the velocity profiles in the final inversion results at 2, 4, 6 and 8 km is shown in Fig. 14. The linear starting velocity model also presented a significant bulk shift with respect to the true velocity model in some places, for example, at 2 km (Fig. 14a). Although both the SSP and Gaussian inversions corrected for these bulk shifts, the Gaussian inversion results are much more erratic below 1.5 km depth. Figure 13. View largeDownload slide Final inversion results without offset-dependent exponential gain after 650 iterations for (a) the SSP inversion and (b) for Gaussian pre-conditioning. In the final stages minimum values of the SSP scalars $$(\mu _0^2,\mu _1^2)=(1,10),$$ and the Gaussian smoothing scalar σ2 = 0 were used. Figure 13. View largeDownload slide Final inversion results without offset-dependent exponential gain after 650 iterations for (a) the SSP inversion and (b) for Gaussian pre-conditioning. In the final stages minimum values of the SSP scalars $$(\mu _0^2,\mu _1^2)=(1,10),$$ and the Gaussian smoothing scalar σ2 = 0 were used. Figure 14. View largeDownload slide Velocity profiles from the final inversion results in Fig. 13, at (a) 2 km, (b) 4 km, (c) 6 km and (d) 8 km. The red line depicts the true velocity model and the green line depicts the starting velocity model. Solid and dashed blue lines are SSP and Gaussian pre-conditioning results, respectively. Figure 14. View largeDownload slide Velocity profiles from the final inversion results in Fig. 13, at (a) 2 km, (b) 4 km, (c) 6 km and (d) 8 km. The red line depicts the true velocity model and the green line depicts the starting velocity model. Solid and dashed blue lines are SSP and Gaussian pre-conditioning results, respectively. Fig. 15 shows the behaviour of the data misfit (i.e. the squared residuals) as a function of iteration, for the SSP and Gaussian inversions. The full curve is given in Fig. 15(a) and zoomed version are given in Figs 15(b)–(d). The SSP residuals remain smaller than the Gaussian residuals until iteration number 278. At iteration 278th in the Gaussian inversion the scale factor is set to zero, so that from then on no pre-conditioning is applied. While the residual is lower, the scatterers are not imaged correctly. Figure 15. View largeDownload slide Comparison of squared residuals versus iteration history for the SSP method and for the Gaussian method, for inversions without offset-dependent exponential gain, showing (a) all iterations from 1 to 650, (b) iterations 1–20, (c) iterations 20–200 and (d) iterations 200–650. Figure 15. View largeDownload slide Comparison of squared residuals versus iteration history for the SSP method and for the Gaussian method, for inversions without offset-dependent exponential gain, showing (a) all iterations from 1 to 650, (b) iterations 1–20, (c) iterations 20–200 and (d) iterations 200–650. 5.2 Inversions with offset-dependent exponential gain Following our initial inversion results, we then tested a full set of inversions in which the data were pre-conditioned through the application of an offset-dependent exponential gain. This strategy has the effect of boosting the deeper parts of the gradient, because the offset gain enhances far offset arrivals in the data that contain both the diving wave response (early arrivals) and the reflection response from deeper reflectors (later arrivals). The reflected energy at late times and far offsets arises from small scattering angles3 from deeper parts of the model. For the SSP inversions this means that the low-wavenumber updates potentially benefit from the contributions of the sharper features in the gradient due to additional low wavenumber contributions arising from second-order scattering, and as a result the SSP scales have to be modified more often than the previous set of SSP inversions carried out without offset-dependent gain. Table 2 shows the scale factors used for SSP and Gaussian inversions for these inversions. The gradients at the 217th iteration before and after SSP are shown in Fig. 16 for a scale factor of μ1 = 10. Similarly the gradient before and after Gaussian pre-conditioning is shown in Fig. 17 for a scale factor of σ = 10. Figure 16. View largeDownload slide An image of the objective function gradients (a) before, and (b) after SSP (pre-conditioning) with $$(\mu _0^2,\mu _1^2)=(1,10)$$, at the 217th iteration in the SSP inversion. Offset-dependent exponential gain was applied during this test. Figure 16. View largeDownload slide An image of the objective function gradients (a) before, and (b) after SSP (pre-conditioning) with $$(\mu _0^2,\mu _1^2)=(1,10)$$, at the 217th iteration in the SSP inversion. Offset-dependent exponential gain was applied during this test. Figure 17. View largeDownload slide The objective function gradients in the inversion with Gaussian pre-conditioning (a) before and (b) after Gaussian smoothing with σ2 = 10, shown at the 11th iteration. Figure 17. View largeDownload slide The objective function gradients in the inversion with Gaussian pre-conditioning (a) before and (b) after Gaussian smoothing with σ2 = 10, shown at the 11th iteration. Table 2. Iterations and smoothing parameters for SSP and Gaussian pre-conditioning FWIs when an offset-dependent exponential gain was applied. SSP  Gaussian  iterations  $$\mu _0^2$$  $$\mu _1^2$$ (km2)  iterations  σ2 (km2)  1–31  1  100  1–11  10  32–95  1  50  12–41  1  96–216  1  10  42–58  0.5  217–325  1  1  59–298  0.1  326–365  1  0.1  299–365  0  SSP  Gaussian  iterations  $$\mu _0^2$$  $$\mu _1^2$$ (km2)  iterations  σ2 (km2)  1–31  1  100  1–11  10  32–95  1  50  12–41  1  96–216  1  10  42–58  0.5  217–325  1  1  59–298  0.1  326–365  1  0.1  299–365  0  View Large The final inversion results (after 365 iterations) from the SSP inversion and Gaussian pre-conditioned inversions using offset-dependent gains are compared in Fig. 18. The 1-D profiles of the velocities are compared in Fig. 19. The offset-dependent exponential gain applied meant that events from deeper reflectors were weighted higher in the data, and we suggest that the Gaussian pre-conditioned inversion failed because the background updates were not constrained by the sharp features. Figure 18. View largeDownload slide Final inversion results without offset-dependent exponential gain, after 365 iterations (a) for SSP and (b) for Gaussian pre-conditioning inversions. In the final stages minimum values of the SSP scalars $$\left(\mu _0^2,\mu _1^2\right)=\left(1,0.1\right)$$, and the Gaussian smoothing scalar σ2 = 0.1 were used. Figure 18. View largeDownload slide Final inversion results without offset-dependent exponential gain, after 365 iterations (a) for SSP and (b) for Gaussian pre-conditioning inversions. In the final stages minimum values of the SSP scalars $$\left(\mu _0^2,\mu _1^2\right)=\left(1,0.1\right)$$, and the Gaussian smoothing scalar σ2 = 0.1 were used. Figure 19. View largeDownload slide Velocity profiles in Fig. 18 at (a) 2 km, (b) 4 km, (c) 6 km and (d) 8 km. The red line depicts the true velocity model and the green line depicts the starting velocity. Solid and dashed blue lines are the SSP and Gaussian pre-conditioning results, respectively. Figure 19. View largeDownload slide Velocity profiles in Fig. 18 at (a) 2 km, (b) 4 km, (c) 6 km and (d) 8 km. The red line depicts the true velocity model and the green line depicts the starting velocity. Solid and dashed blue lines are the SSP and Gaussian pre-conditioning results, respectively. Fig. 20 shows the behaviour of the residual versus iteration number for both the inversions in which offset-dependent exponential gain was applied. As before the residual curve for the Gaussian inversion drops rapidly after the scale factor has been reduced. For the SSP inversion there are dips in the residual curve, caused by the fact that the scatterers being updated simultaneously and continuously along with the background. Figure 20. View largeDownload slide Comparison of squared residuals versus iteration history for the SSP method (solid curves) and for the Gaussian method (dashed curves), for inversions with offset-dependent exponential gain. Iterations from 1 to 365 for both SSP and Gaussian pre-conditioning are shown. Figure 20. View largeDownload slide Comparison of squared residuals versus iteration history for the SSP method (solid curves) and for the Gaussian method (dashed curves), for inversions with offset-dependent exponential gain. Iterations from 1 to 365 for both SSP and Gaussian pre-conditioning are shown. 6 CONCLUSION We have defined an inner product, the SSIP based on a Sobolev norm, which includes a measure of the functions along with their derivatives. The scale factors in the definition of the SSIP allow control over how much a given order of derivative can contribute to the inner product. The SSIP leads to a pre-conditioner, the SSP that allows the low wavenumbers to be enhanced in an edge-preserving manner. The SSP maps an FWI gradient onto the scaled-Sobolev space, which requires that the functions are differentiable at least as many times as the highest order of SSP. This differentiability requirement implies that the scaled-Sobolev space consists of smoother function than those in a general Hilbert space equipped with an L2 norm. Because of the control we have over the scale factors, the SSP is suitable for separating the model spatial scales in FWI. Spatial derivatives of the velocity model have also been constrained using Tikhonov regularization in FWI to achieve a direction dependent smoothing (e.g. Sinoquet 2005; Versteeg & Symes 1993; Lailly & Sinoquet 1996). In Appendix B, we show the close relation between a Sobolev gradient and Tikhonov regularization. We have shown that the nonlinearity in the FWI objective can be can be mitigated by using SSP, which separates the spatial scales of the model updates based on the spatial derivatives of the subsurface model. The SSP updates the background such that the subsurface model is updated in a direction that is consistent with the small angle scattering responses contained in the data. These constrained updates are the result of the edge preserving nature of the SSP. The model scales can be chosen by choosing the scale factors in the SSP, which can be set differently for all dimensions and all orders of the derivatives. This makes SSP a convenient tool for scale separation. If the model scales are set equal to each other for all dimensions in a given order of derivative, the pre-conditioning is isotropic. Anisotropic form of SSP can be useful in cases where model dimensions are very different from each other. The synthetic examples with the Marmousi model show that the SSP is more robust compared to the Gaussian smoothing, which does not preserve the edges in the model. In the example with offset-dependent gain, deeper parts of the gradient were enhanced at all iterations for both SSP and Gaussian pre-conditioned inversions, however the Gaussian pre-conditioned inversion was not constrained by the sharp features in the image. Therefore the SSP inversion performed better when the data were pre-conditioned using an exponential offset-dependent gain, leading to the desired effect of improving the result in deeper parts of the model. The Gaussian pre-conditioned inversion result on the other hand, was further away from the true solution than the Gaussian result without the offset based gain. ACKNOWLEDGEMENTS We would like to thank Dr. William W. Symes and Dr. Kris Innanen for their comments and suggestions. Footnotes 1 The reflector positions were known a priori and were not part of the velocity model through which the wavefield was propagated. 2 The image we use is actually an FWI gradient, however here it is used as an arbitrary image that contains a band of wavenumbers. Any other image could also be used to test the point made in this section. 3 Small scattering angle here means less than critical angle, in which case the corresponding response in the data consists of specular reflections. 4 The unitary operator 1 would be replaced by an identity matrix I if a discrete representation is being used in which case ∇ (the spatial gradient operator) would be replaced by its matrix equivalent. REFERENCES Blazek K.D., Stolk C., Symes W.W., 2013. A mathematical framework for inverse wave problems in heterogeneous media, Inverse Problems  29( 6), 065001, 37. Google Scholar CrossRef Search ADS   Brenders A., Pratt R., 2007a. Efficient waveform tomography for lithospheric imaging: implications for realistic, two-dimensional acquisition geometries and low-frequency data, Geophys. J. Int.  168( 1), 152– 170. Google Scholar CrossRef Search ADS   Brenders A.J., Pratt R.G., 2007b. Full waveform tomography for lithospheric imaging: results from a blind test in a realistic crustal model, Geophys. J. Int.  168( 1), 133– 151. Google Scholar CrossRef Search ADS   Brent R.P., 1973. Algorithms for Minimization without Derivatives  Prentice-Hall, Includes index. Brossier R., Operto S., Virieux J., 2015. Velocity model building from seismic reflection data by full-waveform inversion, Geophys. Prospect.  63( 2), 354– 367. Google Scholar CrossRef Search ADS   Brougois A., Bourget M., Lailly P., Poulet M., Ricarte P., Versteeg R., 1990. Marmousi, model and data, in EAEG Workshop-Practical Aspects of Seismic Data Inversion . Calder J., Mansouri A., Yezzi A., 2011. New possibilities in image diffusion and sharpening via high-order Sobolev gradient flows, J. Math. Imaging Vis.  40( 3), 248– 258. Google Scholar CrossRef Search ADS   Consolvo B.P., Zuberi M.A.H., Pratt R.G., Cary P.W., 2017. FWI with scaled-Sobolev preconditioning applied to short-offset vibroseis field data, in 79th EAGE Conference and Exhibition 2017 . Evans L., 2010. Partial Differential Equations  Graduate studies in mathematics, American Mathematical Society. Google Scholar CrossRef Search ADS   Fu L., Symes W.W., 2017. An adaptive multiscale algorithm for efficient extended waveform inversion, Geophysics  82( 3), R183– R197. Google Scholar CrossRef Search ADS   Guitton A., Ayeni G., Díaz E., 2012. Constrained full-waveform inversion by model reparameterization, Geophysics  77( 2), R117– R127. Google Scholar CrossRef Search ADS   Hale D., 2007. Local dip filtering with directional Laplacians, CWP Proj. Rev.  567 91– 102. Kamei R., Pratt R.G., Tsuji T., 2013. On acoustic waveform tomography of wide-angle OBS data–strategies for pre-conditioning and inversion, Geophys. J. Int.  194( 2), 1250– 1280. Google Scholar CrossRef Search ADS   Küpper F.J., 1958. Theoretische untersuchung über die mehrfachaufstellung von geophonen., Geophys. Prospect.  6( 3), 194– 256. Google Scholar CrossRef Search ADS   Lailly P., 1983. The seismic inverse problem as a sequence of before stack migrations, in Conference on Inverse Scattering–Theory and Application  pp. 206– 220, Society for Industrial and Applied Mathematics, Expanded Abstracts. Lailly P., Sinoquet D., 1996. Smooth velocity models in reflection tomography for imaging complex geological structures, Geophys. J. Int.  124( 2), 349– 362. Google Scholar CrossRef Search ADS   Neuberger J., 1997. Sobolev Gradients and Differential Equations  no. 1670 in Lecture Notes in Computer Science, Springer. Google Scholar CrossRef Search ADS   Perona P., Malik J., 1990. Scale-space and edge detection using anisotropic diffusion, IEEE Trans. Pattern Anal. Mach. Intell.  12( 7), 629– 639. Google Scholar CrossRef Search ADS   Plessix R.-E., 2006. A review of the adjoint-state method for computing the gradient of a functional with geophysical applications, Geophys. J. Int.  167( 2), 495– 503. Google Scholar CrossRef Search ADS   Pratt R.G., 2008. Waveform tomography-successes, cautionary tales, and future directions, in 70th EAGE Conference and Exhibition-Workshops and Fieldtrips . Pratt R.G., Shin C., Hick G.J., 1998. Gauss-Newton and full Newton methods in frequency-space seismic waveform inversion, Geophys. J. Int.  133( 2), 341– 362. Google Scholar CrossRef Search ADS   Press W.H., Teukolsky S.A., Vetterling W.T., Flannery B.P., 1993. Numerical Recipes in FORTRAN; The Art of Scientific Computing  2nd edn, Cambridge Univ. Press. Renka R., 2009. Image segmentation with a Sobolev gradient method, Nonlinear Anal. Theory Methods Appl.  71( 12), e774 – e780. Renka R.J., 2013. Nonlinear least squares and Sobolev gradients, Appl. Numer. Math.  65 91– 104. Google Scholar CrossRef Search ADS   Rickett J., 2000. Traveltime sensitivity kernels: Banana-doughnuts or just plain bananas?, Stanford Exploration Proj. Rep.  103 63– 71. Sava P., Biondi B., 2004a. Wave-equation migration velocity analysis. I. Theory, Geophys. Prospect.  52( 6), 593– 606. Google Scholar CrossRef Search ADS   Sava P., Biondi B., 2004b. Wave-equation migration velocity analysis. II. Subsalt imaging examples, Geophys. Prospect.  52( 6), 607– 623. Google Scholar CrossRef Search ADS   Sinoquet D., 2005. Modeling A Priori Information on the Velocity Field in Reflection Tomography  pp. 591– 594, Society of Exploration Geophysicists. Sirgue L., 2003. Inversion de la Forme D'onde Dans Le Domaine Fréquentiel de Données Sismiques Grands Offsets, PhD thesis  Université Paris and Queen's University. Sirgue L., Pratt R.G., 2004. Efficient waveform inversion and imaging: a strategy for selecting temporal frequencies, Geophysics  69( 1), 231– 248. Google Scholar CrossRef Search ADS   Stolk C.C., 2001. On the modeling and inversion of seismic data, PhD thesis  Universiteit Utrecht. Symes W.W., 1991. A differential semblance algorithm for the inverse problem of reflection seismology, Comput. Math. Appl.  22( 4), 147– 178. Google Scholar CrossRef Search ADS   Symes W.W., Kern M., 1994. Inversion of reflection seismograms by differential semblance analysis: algorithm structure and synthetic examples, Geophys. Prospect.  42( 6), 565– 614. Google Scholar CrossRef Search ADS   Ulmer W., 2010. Inverse problem of linear combinations of Gaussian convolution kernels (deconvolution) and some applications to proton/photon dosimetry and image processing, Inverse Probl.  26( 8), 085002, 26. Google Scholar CrossRef Search ADS   Versteeg R., Symes W., 1993. Geometric constraints on seismic inversion, in SEG Technical Program Expanded Abstracts 1993  pp. 595– 598, Society of Exploration Geophysicists. Virieux J., Operto S., 2009. An overview of full-waveform inversion in exploration geophysics, Geophysics  74( 6), WCC1– WCC26. Google Scholar CrossRef Search ADS   Wu R., Toksöz M., 1987. Diffraction tomography and multisource holography applied to seismic imaging, Geophysics  52( 1), 11– 25. Google Scholar CrossRef Search ADS   Wu Z., Alkhalifah T., 2015. An improved full waveform inversion procedure based on scattering angle enrichment, in 77th EAGE Conference and Exhibition 2015 . Zhang Y., Sun J., 2009. Practical issues of reverse time migration: true amplitude gathers, noise removal and harmonic-source encoding, in Beijing 2009 International Geophysical Conference and Exposition  Beijing, China, 24–27 April 2009, pp. 204– 204, Society of Exploration Geophysicists. Zhou W., Brossier R., Operto S., Virieux J., 2015. Full waveform inversion of diving & reflected waves for velocity model building with impedance inversion based on scale separation, Geophys. J. Int.  202( 3), 1535– 1554. Google Scholar CrossRef Search ADS   Zuberi M.A.H., Pratt R.G., 2016. Mitigating non-linearity in full waveform inversion by scaled-Sobolev pre-conditioning, in 78th EAGE Conference and Exhibition 2016 . Zuberi M.A.H., Pratt R., 2017. Mitigating cycle skipping in full waveform inversion by using a scaled-Sobolev objective function, in 79th EAGE Conference and Exhibition 2017 . APPENDIX A: THE SOBOLEV GRADIENT IN FWI In this appendix we start by looking at how the Sobolev norm can be used in FWI by changing the definition of the gradient vector. For a conventional definition of the gradient, a change in the objective function E(p) due to a perturbation δp in the inversion parameters can be written as a discrete inner product, that is a vector dot product of a row vector (∇pE)T and a column vector of parameter perturbations δp,   \begin{eqnarray} {\rm d}E & =\left(\nabla _{p}E\right)^{T}\left(\delta \mathbf {p}\right)\!, \end{eqnarray} (A1)where ∇p is the conventional gradient with respect to the inversion parameters, that is, the first partial derivatives of the FWI objective function with respect to each of the inversion parameters (as in the box above). The FWI gradient ∇pE contains a range of spatial scales (wavenumbers) according to the frequencies and scattering angles contained in the observed data. We propose a criterion for the separation of those scales in which the sharp features of the slowness model are represented by the components of the spatial derivatives of the inversion parameters. This means that the scales are no longer fully independent, however we will introduce a modified definition of the gradient that separates out the contributions of the various scales. A1 A Sobolev gradient using matrix representation For discretely sampled distributions of model parameters, we assume a vector representation of the parameters p, and we build a modified definition of the gradient by augmenting the gradient vector with additional first-derivative components according to the number of spatial dimensions, that is, in three dimensions the augmented gradient using first-order spatial derivatives is given by   $$\nabla _{A}E=\left(\begin{array}{c}\nabla _{S}E\\ \partial \left(\nabla _{S}E\right)/\partial x\\ \partial \left(\nabla _{S}E\right)/\partial y\\ \partial \left(\nabla _{S}E\right)/\partial z \end{array}\right),$$ (A2)where ∇SE is a new Sobolev gradient (yet to be determined). This introduces additional subspaces into the search that represent the sharp features of the Sobolev gradient. This new definition of the gradient allows us to explicitly seek additional reductions in the L2 norm of the spatial derivatives of the gradient, without changing the definition of the objective functional. To proceed we now allow a perturbation with additional components in these subspaces defined in the same manner, and we calculate the resulting change in the objective function using the vector dot product   $${}{\rm d}E=\left(\begin{array}{c}\nabla _{S}E\\ \partial \left(\nabla _{S}E\right)/\partial x\\ \partial \left(\nabla _{S}E\right)/\partial y\\ \partial \left(\nabla _{S}E\right)/\partial z \end{array}\right)^{T}\left(\begin{array}{c}\delta \mathbf {p}\\ \partial \left(\mathbf {\delta p}\right)/\partial x\\ \partial \mathbf {\left(\delta p\right)}/\partial y\\ \partial \mathbf {\left(\delta p\right)}/\partial z \end{array}\right),$$ (A3)which can be expanded and simplified to yield   \begin{eqnarray} {}{\rm d}E &=& \left(\nabla _{S}E\right)^{T}\delta \mathbf {p}\!+\!\frac{\partial \left(\nabla _{S}E\right)^{T}}{\partial x}\frac{\partial \left(\delta \mathbf {p}\right)}{\partial x}\!+\!\frac{\partial \left(\nabla _{S}E\right)^{T}}{\partial y}\frac{\partial \left(\delta \mathbf {p}\right)}{\partial y}\nonumber\\ &&+\,\,\frac{\partial \left(\nabla _{S}E\right)^{T}}{\partial z}\frac{\partial \left(\delta \mathbf {p}\right)}{\partial z} \end{eqnarray} (A4a)  \begin{eqnarray} \hphantom{{}{\rm d}E } &=&\left(\nabla _{S}E\right)^{T}\delta \mathbf {p}-\left(\nabla _{S}E\right)^{T}\frac{\partial ^{2}\delta \mathbf {p}}{\partial x^{2}}\nonumber\\ &&-\,\left(\nabla _{S}E\right)^{T}\frac{\partial ^{2}\delta \mathbf {p}}{\partial y^{2}}-\left(\nabla _{S}E\right)^{T}\frac{\partial ^{2}\delta \mathbf {p}}{\partial z^{2}} \end{eqnarray} (A4b)  \begin{eqnarray} \hphantom{{}{\rm d}E }=\left(\nabla _{S}E\right)^{T}\left(\mathbf {I}-\nabla ^{2}\right)\delta \mathbf {p}, \end{eqnarray} (A4c) where ∇2 is the spatial Laplacian operator. The row and column vector multiplications in eq. (A4a) represent summations over the spatial domain (i.e. discrete integration). Integration by parts has been used to transfer the spatial derivatives of the gradient onto the perturbation term (resulting in the appearance of a negative sign). Following Neuberger (1997), we seek the new gradient ∇SE such that for a given δp, the change in the objective function according to eq. (A4c) should be the same as the one given by eq. (A1). By comparing eqs (A1) and (A4c) we find an expression for the Sobolev gradient   $$\nabla _{S}E=\left(\mathbf {I}-\nabla ^{2}\right)^{-1}\nabla _{p}E.$$ (A5) To summarize, we have introduced an augmented definition of the gradient that contains a range of spatial scales, and we have augmented the perturbations to contain the corresponding scales. This changes, or pre-conditions the conventional gradient of the objective function using the inverse of the operator (I − ∇2) to yield the Sobolev gradient. We shall further interpret this result at the bottom of the next section. An alternate method for arriving at this expression is given in Appendix B. A2 A Sobolev gradient using functional representation For continuous distributions of inversion parameters p(x), the first-order perturbation in the objective function $${\rm {\rm }{d}}E$$ due to a change in parameters δp(x) is given by the L2 inner product   \begin{eqnarray} {\rm {d}}E & =\int _{\Omega }\delta p\left(x\right){\rm \left(\nabla _{p}E\right)}\left(x\right){\rm {d}}^{3}x\equiv \int _{\Omega }\delta p\left(x\right)g\left(x\right){\rm {d}}^{3}x, \end{eqnarray} (A6)where (∇pE)(x); ≡ g(x) is the gradient (the functional derivative) of the functional E with respect to p(x), x ≡ (x, y, z) and Ω represents the volume of integration. For g(x) and δp(x) with given norms $$\left(\int _{\Omega }\left(g\left(x\right)\right)^{2}{\rm {d}}^{3}x\right)^{\frac{{1}}{2}}$$ and $$\left(\int _{\Omega }\left(\delta p\left(x\right)\right)^{2}{\rm {d}}^{3}x\right)^{\frac{{1}}{2}}$$, dE is a maximum when g(x) and δp(x) are proportional to each other. This gives the method of steepest descent. Note that the norm is the square root of the inner product (defined in eq. A6) of a function with itself. This means that the functions g(x) and p(x) are elements of a space of functions that has an inner product defined by eq. (A6) and a norm given by $$\left(\int _{\Omega }\left|f\left(x\right)\right|^{2}{\rm {d}}^{3}x\right)^{\frac{{1}}{2}}$$. By assuming completeness, this space becomes a Hilbert space. The definition of a norm of a function in this space allows functions that are not differentiable. If the definition of the inner product is modified so that the first derivative of a function is also considered, the allowed functions include only those that are at least once differentiable, that is, non-differentiable functions are excluded. Such a space is a Sobolev space, and as an example the inner product could be defined according to a continuous version of eq. (A3), in which case eq. (A6) becomes   \begin{eqnarray} {\rm d}E & =\int _{\Omega }\left[\hat{\mathbf {e}}_{0}g\left(x\right)+\nabla g\left(x\right)\right]\left[\hat{\mathbf {e}}_{0}\delta p\left(x\right)+\nabla \left(\delta p\left(x\right)\right)\right]{\rm d}^{3}x, \end{eqnarray} (A7)where ∇ (used without subscripts) is the spatial gradient, that is $$\nabla \equiv \hat{\mathbf {e}}_{1}\frac{\partial }{\partial x_{1}}+\hat{\mathbf {e}}_{2}\frac{\partial }{\partial x_{2}}+\hat{\mathbf {e}}_{3}\frac{\partial }{\partial x_{3}}$$. In eq. (A7) we have introduced a new unit vector $$\hat{\mathbf {e}}_{0}$$ to represent the original, background slownesses, now supplemented by the three Cartesian unit vectors of 3-D physical space $$(\hat{\mathbf {e}}_{1},\hat{\mathbf {e}}_{2},\hat{\mathbf {e}}_{3})$$, corresponding to each of the three subspaces containing the x, y, and z derivatives in the spatial gradient, all mutually orthogonal and defined to be orthogonal to $$\hat{\mathbf {e}}_{0}$$. We note that eq. (A7) is a special case of   $${}{\rm d}E\equiv \left\langle Dg,\ Dp\right\rangle _{{\rm L}_{2}}$$ (A8)which is the definition of a Sobolev inner product Renka (2013). The operator, D in eq. (A7) is a special case in which D ≡ 1 + ∇.4 Eq. (A7) is thus a special case of the Sobolev inner product, with the highest order of derivatives equal to 1. In other words, it is an L2 inner product between the sums of two functions and their first derivatives. With some algebra (and omitting the unit vectors for brevity), eq. (A7) becomes   \begin{eqnarray} {\rm {d}}E & = & \int _{\Omega }\left[g_{_{S}}\left(x\right)+\nabla g_{_{S}}\left(x\right)\right]\left[\delta p\left(x\right)+\nabla \delta p\left(x\right)\right]{\rm {d}}^{3}x\nonumber \\ & = & \int _{\Omega }\left[g_{_{S}}\left(x\right)\delta p\left(x\right)+\nabla g_{_{S}}\left(x\right)\nabla \delta p\left(x\right)\right]{\rm {d}}^{3}x\nonumber \\ & = & \int _{\Omega }\left[g_{_{S}}\left(x\right)-\nabla ^{2}g_{_{S}}\left(x\right)\right]\delta p\left(x\right){\rm {d}}^{3}x, \end{eqnarray} (A9)where ∇2 is the Laplacian. Integration by parts has been used, together with the condition that the perturbation p(x) vanishes at the boundary of the volume Ω. In the third line of eq. (A9) we are again returning to an L2 inner product but with a changed gradient. By comparing equations (A9) and (A6), and again requiring that the change dE in both equations is the same, we get the relationship between L2 and Sobolev gradients, that is,   $$g_{_{S}}\left(x\right)=\left(1-\nabla ^{2}\right)^{-1}g\left(x\right).$$ (A10)The operator (1 − ∇2)−1 in eq. (A10) maps the conventional gradient from a Hilbert space with an L2 norm to the Sobolev space, resulting in a pre-conditioning of the gradient. The fact that $$g_{_{S}}\left(x\right)$$ is a smoother function than g(x) can be seen in the wavenumber domain application of eq. (A10), that is,   $$g_{_{S}}\left(k\right)=\left(1+k^{2}\right)^{-1}g\left(k\right)\!,$$ (A11)where k is the wavenumber magnitude, that is, k ≡ 2π/λ where λ is the wavelength. In this equation low-wavenumber features of the gradient are weighted more than the high-wavenumber features. The scales are treated quite differently: At very low wavenumbers the Sobolev gradient is identical to the conventional gradient, and at very high wavenumbers the Sobolev gradient is a double integration of the conventional gradient, leading to the reduction of high wavenumber content in the gradient but the preservation of some high wavenumber features. This leads to an ‘edge-preserving’ behaviour of the pre-conditioner. APPENDIX B: FROM TIKHONOV TO SOBOLEV The inner product (SSIP) and the resulting pre-conditioning (SSP) were derived in the main text using an argument following Neuberger (1997). In this appendix we will give an alternate way of arriving at the scaled-Sobolev pre-conditioning method. We shall start by deriving an expression for a parameter perturbation for a Tikhonov regularized objective function. We then discuss and discard the undesirable features of that update and keep the one that is desirable, allowing us to arrive again at an expression for the Sobolev gradient. After arriving at the resulting expressions, we shall give a physical interpretation. A change in any FWI objective function E following a perturbation in medium parameters δp is given by the first-order expression.   $${\rm d}E=(\nabla _{p}E)^{T}\delta \mathbf {p}.$$ (B1)The steepest descent method uses the gradient to define   $$\delta \mathbf {p}=-\alpha \nabla _{p}E,$$ (B2)where α is a non-negative, real-valued ‘step length’, while the Newton estimate of δp is related to the gradient and the Hessian H through   $$\mathbf {H}\delta \mathbf {p}=-\nabla _{p}E,$$ (B3)which is obtained by applying a parameter perturbation such that $$\nabla _{p}E\left(\mathbf {p}+\delta \mathbf {p}\right)=0\approx \nabla _{p}E\left(\mathbf {\mathbf {p}}\right)+\mathbf {H}\left(\mathbf {p}\right)\delta \mathbf {p}$$ to second order. If we specifically define a Tikhonov-type regularized objective function   $$E_{R}=\mathbf {r}^{T}\mathbf {r}+\lambda \left(\mathbf {R}\mathbf {p}\right)^{T}\left(\mathbf {R}\mathbf {p}\right)=E_{c}+\lambda \mathbf {p}^{T}\mathbf {R}^{T}\mathbf {R}\mathbf {p},$$ (B4)where r represents the conventional data residuals, Ec = rTr represents the conventional least squares objective functional, and R is a roughening operator (often the matrix representation of the first-order spatial derivatives), then this new objective function will have a different minimum in p space. The gradient and Hessian of the regularized objective function are   $$\nabla _{p}E_{R}=\nabla _{p}E_{c}+\lambda \mathbf {R}^{T}\mathbf {R}\mathbf {p}$$ (B5)where ∇pEc is the gradient of the conventional objective functional, and   $$\mathbf {H}_{R}=\mathbf {H}_{c}+\lambda \mathbf {R}^{T}\mathbf {R},$$ (B6)respectively, with Hc being the Hessian of the conventional objective functional. The Newton perturbation of eq. (B3) for the new Tikhonov objective becomes   $$\mathbf {H}_{R}\delta \mathbf {p}_{_{R}}=-\nabla _{p}E_{R}.$$ (B7)By substituting eq. (B5) and (B6), eq. (B7) becomes   $$\left(\mathbf {H}_{c}+\lambda \mathbf {R}^{T}\mathbf {R}\right)\delta \mathbf {p}_{_{R}}=-\nabla _{p}E_{c}-\lambda \mathbf {R}^{T}\mathbf {R}\mathbf {p}.$$ (B8)A desirable feature of the update given by eq. (B8) is that the weighted roughening operators RTR can be used to control how high or low we want the wavenumbers to be. This is a desirable feature because the roughening operators are constructed without reference to the Born approximation and therefore the scale-space defined by varying λ does not rely on the Born approximation. There are a couple of undesirable features in eq. (B8). If very low wavenumbers are required λ has to be very high. In that case, according to eq. (B4), the data misfit is practically ignored in the computation of the conventional gradient direction. The appearance of the conventional Hessian, which can improve convergence if the problem is quasi-linear, makes it extremely expensive to calculate and invert in FWI. The Gauss–Newton approximation is usually used, that is H is replaced by JTJ, where J is the conventional Jacobian. Even with the Gauss-Newton approximation, the calculation of this approximate Hessian may involve many extra forward propagations. B1 An intuitive step Let us keep the desired features of eq. (B8) and discard the undesired ones, which means that we shall neglect the second term on the right hand side of eq. (B8) and replace H on the left hand side by λ0I, where λ0 is a scalar and I is an identity matrix. Under these simplifications, eq. (B8) becomes   \begin{eqnarray} \left(\lambda _{0}\mathbf {I}+\lambda _{1}\mathbf {R}^{T}\mathbf {R}\right)\delta \mathbf {p}_{_{S}} & =-\nabla _{p}E_{c} \end{eqnarray} (B9a)or   \begin{eqnarray} \delta \mathbf {p}_{_{S}} & =-\left(\lambda _{0}\mathbf {I}+\lambda _{1}\mathbf {R}^{T}\mathbf {R}\right)^{-1}\nabla _{p}E_{c}. \end{eqnarray} (B9b) The change of subscript in the model update vector from R to S signifies that this update no longer corresponds to the minimization of regularized objective function (eq. B4), nor is it the conventional update either. Instead we have returned to the conventional objective function Ec, but we construct a model update that is a pre-conditioned version of the conventional gradient. The subscript S stands for Sobolev, chosen for reasons that will become clear shortly. Following the steepest descent method of eq. (B2), multiplying and dividing the right hand side of eq. (B9b) by a non-negative real-valued scalar $$\alpha _{_{S}}$$ we get   \begin{eqnarray} \delta \mathbf {p}_{_{S}} & = & -\alpha _{_{S}}\left(\mu _{0}^{2}\mathbf {I}+\mu _{1}^{2}\mathbf {R}^{T}\mathbf {R}\right)^{-1}\nabla _{p}E_{c}\nonumber \\ & \equiv & -\alpha _{_{S}}\nabla _{S}E_{c}, \end{eqnarray} (B10)where we have introduced the scaling factors   \begin{equation*} \mu _{o}^{2}\equiv \lambda _{o}\alpha _{s}\ \!\ \mathrm{and}\ \ \ \mu _{1}^{2}\equiv \lambda _{1}\alpha _{s}. \end{equation*} Eq. (B10) defines a Sobolev gradient of the conventional objective function that corresponds to a pre-conditioned version of the convention FWI gradient. Although the conventional objective function is retained in the expression, we are modifying the gradient to find a search direction that is smoother, so that we fit the long wavelength components earlier in the inversion. We also defined a scaled-Sobolev parameter update δpS that is proportional to ∇SEc in magnitude and opposite in direction. This is just the method of steepest-descent, but in a new scaled-Sobolev space of possible parameter distributions, with $$\alpha _{_{S}}$$ being simply the step length of the steepest-descent direction in this space. If we choose R to be the matrix expression of the first-order derivatives, then   $$\nabla _{S}E_{c}=\left(\mu _{0}^{2}\mathbf {I}-\mu _{1}^{2}\nabla ^{2}\right)^{-1}\nabla _{p}E_{c},$$ (B11)where ∇2 = −RTR is the matrix representation of the Laplacian operator. Equation (B11) may be compared with equations (A5), (A10) and (13) in the body of the paper. Let us confirm that the scaled-Sobolev update (eq. B10) does indeed give a reduction in the objective function. By using the Sobolev gradient from eq. (B11) and substituting the Sobolev perturbation δpS from eq. (B10) into eq. (B1) we obtain   \begin{eqnarray} \left({\rm d}E\right)_{S} =\delta \mathbf {p}_{_{S}}^{T}\left(\mu _{0}^{2}\mathbf {I}+\mu _{1}^{2}\mathbf {R}^{T}\mathbf {R}\right)\nabla _{S}E_{c} \end{eqnarray} (B12a)   \begin{eqnarray} & =\left(\begin{array}{c}\mu _{0}\delta \mathbf {p}_{_{S}}\\ \mu _{1}\mathbf {R}\delta \mathbf {p}_{_{S}} \end{array}\right)^{T}\left(\begin{array}{c}\mu _{0}\nabla _{S}E_{c}\\ \mu _{1}\mathbf {R}\nabla _{S}E_{c} \end{array}\right) \end{eqnarray} (B12b)  \begin{eqnarray} & =-\alpha _{_{S}}^{-1}\left(\begin{array}{c}\mu _{0}\delta \mathbf {p}_{_{S}}\\ \mu _{1}\mathbf {R}\delta \mathbf {p}_{_{S}} \end{array}\right)^{T}\left(\begin{array}{c}\mu _{0}\delta \mathbf {p}_{_{S}}\\ \mu _{1}\mathbf {R}\delta \mathbf {p}_{_{S}} \end{array}\right). \end{eqnarray} (B12c) Eq. (B12c) gives a change in the objective function corresponding to an update in a direction $$\delta \mathbf {p}_{_{S}}$$. The change is strictly negative because the dot product   $$\left(\begin{array}{c}\mu _{0}\delta \mathbf {p}_{_{S}}\\ \mu _{1}\mathbf {R}\delta \mathbf {p}_{_{S}} \end{array}\right)^{T}\left(\begin{array}{c}\mu _{0}\delta \mathbf {p}_{_{S}}\\ \mu _{1}\mathbf {R}\delta \mathbf {p}_{_{S}} \end{array}\right)$$ (B13)and $$\alpha _{_{S}}$$ are both non-negative, and hence the objective function must be reduced. B2 Interpretation Now we shall look at what it means to have ‘jumped’ from eq. (B8) to eq. (B9a). The conventional steepest descent gives a change in the conventional gradient of   \begin{eqnarray} {\rm d}E_{c} =\delta \mathbf {p}^{T}\nabla _{p}E_{c} \end{eqnarray} (B14a)  \begin{eqnarray} \hphantom{{\rm d}E_{c} } =-\alpha ^{-1}\left(\delta \mathbf {p}\right)^{T}\delta \mathbf {p}. \end{eqnarray} (B14b) By comparison of eqs (B14b) and (B12c) we can see that the definition of the inner product has been changed. The conventional inner product used in the steepest descent method is simply the scaled squared norm of the parameter perturbation, whereas in eq. (B12c) the squared norm of a function is given by a scaled version of the conventional inner product plus a conventional inner product of the derivatives of the function. If the scalars are equal to 1, this is the squared Sobolev norm. Since we are using a scaled version of that norm we shall term this the scaled-Sobolev norm. We shall call the corresponding inner product the scaled-Sobolev inner product (SSIP). The space of functions equipped with the inner product defined in eq. (B12c) will be termed the scaled-Sobolev space, hence the choice of the subscript S for the perturbations and the gradients. Therefore, by changing from eq. (B8) to eq. (B9a) we have changed the space of the medium perturbation functions that we are allowed to move in during the inversion process. The conventional change in the objective function is based on an L2-norm inner product, which does not require the perturbation functions to be differentiable. Therefore, the conventional updates can potentially involve rougher functions than SSIP, because the SSIP requires the function to be differentiable at least once. Eq. (B11), with the scaling factor equal to one, is the starting point in the main text, which follows the arguments given by Neuberger (1997). There is no reason why the order of the derivatives in the definition of SSIP should be restricted to one. A generalized result is given in the main text. © The Author(s) 2017. Published by Oxford University Press on behalf of The Royal Astronomical Society. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Geophysical Journal International Oxford University Press

# Mitigating nonlinearity in full waveform inversion using scaled-Sobolev pre-conditioning

, Volume 213 (1) – Apr 1, 2018
20 pages

/lp/ou_press/mitigating-nonlinearity-in-full-waveform-inversion-using-scaled-FnGfTwaQ7R
Publisher
Oxford University Press
ISSN
0956-540X
eISSN
1365-246X
D.O.I.
10.1093/gji/ggx549
Publisher site
See Article on Publisher Site

### Abstract

Summary The Born approximation successfully linearizes seismic full waveform inversion if the background velocity is sufficiently accurate. When the background velocity is not known it can be estimated by using model scale separation methods. A frequently used technique is to separate the spatial scales of the model according to the scattering angles present in the data, by using either first- or second-order terms in the Born series. For example, the well-known ‘banana-donut’ and the ‘rabbit ear’ shaped kernels are, respectively, the first- and second-order Born terms in which at least one of the scattering events is associated with a large angle. Whichever term of the Born series is used, all such methods suffer from errors in the starting velocity model because all terms in the Born series assume that the background Green's function is known. An alternative approach to Born-based scale separation is to work in the model domain, for example, by Gaussian smoothing of the update vectors, or some other approach for separation by model wavenumbers. However such model domain methods are usually based on a strict separation in which only the low-wavenumber updates are retained. This implies that the scattered information in the data is not taken into account. This can lead to the inversion being trapped in a false (local) minimum when sharp features are updated incorrectly. In this study we propose a scaled-Sobolev pre-conditioning (SSP) of the updates to achieve a constrained scale separation in the model domain. The SSP is obtained by introducing a scaled Sobolev inner product (SSIP) into the measure of the gradient of the objective function with respect to the model parameters. This modified measure seeks reductions in the L2 norm of the spatial derivatives of the gradient without changing the objective function. The SSP does not rely on the Born prediction of scale based on scattering angles, and requires negligible extra computational cost per iteration. Synthetic examples from the Marmousi model show that the constrained scale separation using SSP is able to keep the background updates in the zone of attraction of the global minimum, in spite of using a poor starting model in which conventional methods fail. Image processing, Inverse theory, Tomography, Waveform inversion, Coda waves, Controlled source seismology 1 INTRODUCTION The first-order Born approximation provides a practical way to linearize full waveform inversion (FWI) and obtain a computationally feasible gradient (Lailly 1983; Pratt et al.1998; Virieux & Operto 2009). However this linearization is only useful if the background velocity is sufficiently accurate. All such attempts to linearize FWI that rely upon terms in the Born scattering series have inherent limits on the background velocity errors that can be tolerated by the inversion. This is because of the local nature of the Born series. Regardless of how many terms in the Born series are used, the nonlinear dependence of the data on the background velocity is not accounted for (Symes 1991). The error limit is approximately given by the half-cycle criterion (Pratt 2008), which limits the level of acceptable background velocity errors. If the criterion is violated this leads directly to data domain cycle-skipping. It might be hoped that having a starting model that honours the half-cycle criterion for early arrivals would solve the problem of nonlinearity in the Born approximation. However, waveform inversions are complicated by the fact that the (smooth, low-wavenumber) background velocity has to be correct before inverting for the location and strength of the (sharp, high-wavenumber) scatterers in the model. This requires a separation of low- and high-wavenumber features in the gradient. That is, we require a method for model scale separation in order to update the background correctly. In this study we propose a scaled-Sobolev inner product (SSIP) that allows controlled separation of the model spatial scales by adjusting the scale factors that we introduce. We formalize the concept first introduced in Zuberi & Pratt (2016) (also applied to real data by Consolvo et al. (2017)), and we provide a number of examples demonstrating the approach. The SSIP leads to an edge-preserving smoothing operator, which we shall apply to the model updates. We refer to this approach as scaled-Sobolev pre-conditioning (SSP). The SSP gives constrained background updates, in which the constraint is provided by the sharp features in the model updates (i.e. the preserved edges in the SSP pre-conditioning), while introducing only a negligible additional computational cost per iteration compared to the conventional approach. The conventional FWI gradient includes both low- and high-wavenumber features. It is common to invoke the Born scattering series to separate the background from the scatterers by isolating events in the data based on scattering angles: wide angle scattering (including transmissions) are sensitive to low wavenumbers, and narrow angle scattering (including reflections) are sensitive to high wavenumbers (Wu & Toksöz 1987). We refer to these strategies as ‘Born-based’ approaches. To this end, some methods use the scattering close to 180° to isolate the low wavenumbers (e.g. by using the early arrivals, corresponding to the diving waves). These low-wavenumber updates are associated with ‘banana/donut’ (Rickett 2000) shaped sensitivity kernels that link the source to the receiver. The second-order terms are also used in some methods, in which the source-to-scatterer and scatterer-to-receiver sensitivity kernels are combined. These second-order sensitivity kernels, also known as ‘rabbit ears’ (Zhou et al.2015), respond to double scatterings in which at least one scattering angle is close to 180°. Although the conventional FWI gradient contains contributions from both the first-order diving waves and the second-order scattered waves, the high-wavenumber features often dominate the gradient and may trap the inversion in a false (local) minimum (Brossier et al.2015), which is why scale separation is required. First-order Born methods for scale separation work by isolating the single scattering response associated with large scattering angles. This may be achieved, for example, by weighting the waveform data according to offset, or by applying time-damping of the data to down-weight the later, reflected energy during the early stages of the inversion (Kamei et al.2013). Such a weighting is relatively inexpensive computationally, because no wavefield calculation or imaging is required, but the large angle scattering response can be difficult to isolate. This is because both large and small angle responses can occupy the same region in the data, especially when the subsurface model is strongly heterogeneous. The scattering angles can also be used to apply weights in the model domain, which would not require a separation of the scattering responses (Wu & Alkhalifah 2015). However, the accuracy of subsurface scattering angle calculations are still dependent on the accuracy of the background velocity (a requirement for the first-order Born term to be valid), which potentially leads to errors in the scale separation. Second-order Born methods calculate the second-order sensitivity kernels explicitly in an attempt to obtain the background velocity updates. An example of such methods is Wave Equation Migration Velocity Analysis (WEMVA), proposed by Sava & Biondi (2004a). WEMVA uses a difference in (high-wavenumber) image perturbations as the objective function, and the WEMVA operator relates the image perturbation to the background velocity perturbation, which creates the second-order sensitivity kernels. In WEMVA the requirement of small background velocity errors can lead to image domain cycle-skipping. Sava & Biondi (2004a,b) suggested a linearized image perturbation to overcome this issue. They suggested manually picking the background velocity ratios that flatten the subsurface angle gathers, which requires pre-calculation of subsurface angle gathers for a range of different ratios. Although this can be done in a computationally efficient manner as the calculation is a manual step performed only once in the inversion process, this still adds to the overall cost of the method. Overall, the total computational cost increases because of (i) the cost of computing the second-order sensitivity kernels (requiring that the source and receiver perturbed wavefields have to be modelled), and (ii) because of the manual step required to mitigate the errors caused by background velocity errors. Zhou et al. (2015) proposed a method for ‘joint full waveform inversion’ (JFWI) that uses both first- and second-order terms of the Born series. JFWI separates the primary reflected and transmitted (diving) wave responses in the data, using the first-order Born approximation, and the primary reflections in the data to calculate the second-order sensitivity kernels. The source (receiver) side second-order sensitivity kernels are generated by cross-correlating the back (forward) propagated perturbed wavefield with the forward (back) propagated background wavefield. This second-order calculation also leads to a substantial increase in the computational cost of JFWI. The separation of reflected and diving waves in JFWI, while computationally inexpensive, can be difficult in practice and might also require adjustment at each iteration (Zhou et al.2015). An advantage of the second-order Born methods over the first-order methods is that they use the reflected energy in the data to constrain the initial background updates, meaning that while the background is being updated it remains consistent with the scattering events contained in the data. Furthermore, if the background errors are negligible, the use of the second-order Born term is an approximation to the inverse Hessian, which can help the convergence rate. However, calculating the second-order sensitivity kernels puts an extra computational burden on the inversion, and because the approach is a part of the Born series it still requires an accurate background velocity model. In other words, the velocity-depth ambiguity is a problem in the second-order methods as well. The background error issue is mitigated by using either the redundancy in the data (using multiple offsets to calculate angle gathers in WEMVA) or by adding additional a priori information about the model (JFWI uses an a priori reflectivity model), although these measures further increase the computational cost of the inversions. In general using higher order terms in the Born series for scale separation suffers from the same background error limitations that we are trying to mitigate, and any attempts to correct for these errors increase computational cost. In addition, when a band of frequencies is being inverted, the Born-based separations isolate the highest available wavenumber and the rest of the wavenumber spectrum must be considered to constitute the background (see Fig. 1). This is demonstrated in Fig. 2 where we illustrate an example of a JFWI update calculated by using scattering response from two known1 reflectors independently (to avoid multiple scattering noise). The regions in white circles show high wavenumber parts of the update, which is even higher than the kmax in Fig. 1 (these are present because the a priori reflectivity model consisted of a discontinuous velocity structure). Figure 1. View largeDownload slide Model scales (after Wu & Toksöz 1987). Between kb and kmax the background and scatterer scales are not clearly defined; a ‘scattering’ wavenumber for one frequency could be considered to be the ‘background’ wavenumber for another. Figure 1. View largeDownload slide Model scales (after Wu & Toksöz 1987). Between kb and kmax the background and scatterer scales are not clearly defined; a ‘scattering’ wavenumber for one frequency could be considered to be the ‘background’ wavenumber for another. Figure 2. View largeDownload slide JFWI updates calculated for each reflector independently. The white circles show that high wavenumber information is still present in the image. This remnant high wavenumber information will be included before the background velocity model has converged, therefore it can potentially move the inversion away from the true solution. Figure 2. View largeDownload slide JFWI updates calculated for each reflector independently. The white circles show that high wavenumber information is still present in the image. This remnant high wavenumber information will be included before the background velocity model has converged, therefore it can potentially move the inversion away from the true solution. The limitations of the Born-based approaches for scale separation discussed above can be avoided by separating the model scales in the updates using wavenumber filtering (Sirgue 2003; Brenders & Pratt 2007b) or Gaussian smoothing. While these methods separate the scales in the model domain, the corresponding scattering response in the observed data cannot be separated. In other words, unlike the second-order Born-based approach, the background updates will not be constrained by the sharp wavenumber features (the seismic image), as these are deliberately removed from the update. This can cause the sharper features to be inaccurate when they are eventually imaged, especially if the starting model does not have any information about the true low wavenumbers. Constrained background updates can however be obtained by smoothing the gradient while preserving the edges. For example, Guitton et al. (2012) proposed a pre-conditioning that preserves dips in the model, but this requires estimation of the dips based on local dip filtering of the image Hale (2007). Other methods of edge preserving smoothing such as anisotropic diffusion (Perona & Malik 1990) or high-order Sobolev gradient flows proposed by Calder et al. (2011) have been used in non-seismic image processing, however they have not been utilized as a scale separation tool in FWI. Furthermore, the Sobolev norm proposed by Calder et al. (2011) is not particularly well suited for scale separation, mainly due to its lack of flexibility in choosing the scale factors. The Laplacian operator is frequently used to suppress low wavenumber artefacts in reverse time migration (RTM) (Zhang & Sun 2009), however the artefacts in RTM are the desired background updates in FWI. Therefore, an inverse Laplacian-type operator would be an appropriate choice for a pre-conditioner to enhance the low wavenumber features in the FWI gradient, that is some form of (1 + ∇2)−1, where ∇2 is the Laplacian. Indeed the inverse Laplacian has been used to pre-condition the FWI gradient with different scaling factors in the Laplacian and powers of the whole operator (e.g. Symes & Kern 1994; Fu & Symes 2017). In this paper we generalize this operator to include higher order derivatives based on the SSIP, and we show that the special case of the inverse Laplacian (i.e. first-order SSP) operator is an edge-preserving smoother. We explore this edge-preserving nature of the SSP and show that it is useful in mitigating nonlinearity in FWI when used as a scale separation tool in model domain. As a result of the edge-preservation, the low wavenumber content enhanced by the SSP will also receive contribution from the rabbit ears that will automatically be generated due to scattering from the sharp features in the gradient. A more mathematically precise reason for the mitigation of nonlinearity due to the first-order SSP can be deduced from a result regarding the differentiability of the FWI objective function with respect to the model parameters: Stolk (2001) and Blazek et al. (2013) showed that the objective function is differentiable if the source function and at least two orders of its time derivatives are in L2 space, and the model perturbation is measured by an L∞ norm. The L∞ norm is dominated by the Sobolev norm of order 1 and exponent 2 if the model behaves locally as 1D (Evans 2010), as is typical for sedimentary structure. This means that the updates will be smoother and a Fréchet derivative will be more likely to exist. The existence of a Fréchet derivative would ensure that the adjoint state method (Plessix 2006) gives a more accurate representation of the action of the Jacobian (partial derivative wavefield) matrix, thereby reducing the nonlinearity in FWI. The higher orders of SSP will not be studied here, but as they lead to more smoothing they can be used to control the level of edge-preservation in the smoothing. For example, Zuberi & Pratt (2017) use higher orders of SSP in the data domain to smooth the data and define a new objective function based on SSIP. 2 INTRODUCING THE SCALED-SOBOLEV INNER PRODUCT (SSIP) A note on the use of the ∇ notation The notation   \begin{equation*} \nabla \equiv \sum _{i}\hat{x}_{i}\frac{\partial }{\partial _{i}} \end{equation*}is used here to denote the ‘del’ operator generating the gradient vector from a scalar field. Thus in inversion the gradient of the objective function yields   \begin{equation*} \nabla _{p}E\left(\mathbf {p}\right)=\left(\begin{array}{c}\partial E/\partial p_{1}\\ \partial E/\partial p_{2}\\ \vdots \\ \partial E/\partial p_{m} \end{array}\right), \end{equation*}a column vector in m-dimensional space where m is the number of components in the vector p. In the Sobolev norm the notation is also used to compute first derivatives with respect to the spatial dimensions, so that   \begin{equation*} \nabla p(x)=\hat{\mathbf {i}}\frac{\partial p}{\partial x}+\hat{\mathbf {j}}\frac{\partial p}{\partial y}+\hat{\mathbf {k}}\frac{\partial p}{\partial z}=\hat{\mathbf {e}}_{1}\frac{\partial p}{\partial x_{1}}+\hat{\mathbf {e}}_{2}\frac{\partial p}{\partial x_{2}}+\hat{\mathbf {e}}_{3}\frac{\partial p}{\partial x_{3}}. \end{equation*}We adopt the convention that all non-subscripted uses of ∇ refer to the spatial gradient, and all other uses are indicated by the use of a subscript, that is, ∇pE. An introductory analysis of the Sobolev gradient using matrix and functional representations is given in Appendix A. The Sobolev gradient using the Laplacian operator has been applied in a variety of contexts (e.g. Symes & Kern 1994; Renka 2009, 2013). The Laplacian arises from the first-order derivatives in the Sobolev norm (Appendix A). The Sobolev gradient can be generalized to contain higher powers of the Laplacian (i.e. higher order derivatives in the Sobolev norm) and each derivative can be scaled independently. Therefore, let us utilize a more general inner product   $${\left\langle f,g\right\rangle _{A}\equiv \left\langle Af,Ag\right\rangle _{{\rm L}^{2}}=\left\langle f,\left(A^{\dagger }A\right)g\right\rangle _{{\rm L}^{2}}\!,}$$ (1)where † represents the adjoint. With A defined in eq. (2) below, eq. (1) defines the SSIP. Note that A†A is a positive definite symmetric operator. If we take A to be a differential operator then,   \begin{eqnarray} A & \equiv & \sum _{i=0}^{N=ML}\hat{\mathbf {e}}_{i}\mu _{i}\frac{\partial ^{m_{i}}}{\partial x_{l_{i}}^{m_{i}}}\quad \mathrm{and}\quad A^{\dagger }=\sum _{i=0}^{N=ML}\hat{\mathbf {e}}_{i}\left(-1\right)^{m_{i}}\mu _{i}\frac{\partial ^{m_{i}}}{\partial x_{l_{i}}^{m_{i}}}, \end{eqnarray} (2)where M and L are the maximum order of the derivatives and the number of dimensions, respectively. The superscript $$m_{i}=\lfloor \frac{i-1}{L}\rfloor +1$$ represents the order of the derivative and $$x_{l_{i}}$$ is the $$l_{i}^{{\rm th}}$$ dimension with li = [(i − 1)mod L] + 1. The highest order of derivative is $$M=\lfloor \frac{N-1}{L}\rfloor +1$$. The factors μi are real valued scalars and the unit vectors are orthonormal, that is,   $$\hat{\mathbf {e}}_{i}\cdot \hat{\mathbf {e}}_{j}=\left\lbrace \begin{array}{@{}l@{\quad }l@{}}1 & {\rm for}\ i=j\\ 0 & {\rm for}\ i\ne j \end{array}\right..$$ (3)The $$\left(-1\right)^{m_{i}}$$ factor in eq. (2) comes from integration by parts and an assumption that the functions involved have homogeneous boundary conditions. Under these conditions   $$AA^{\dagger }=A^{\dagger }A=\sum _{i=0}^{N}\left(-1\right)^{m_{i}}\mu _{i}^{2}\frac{\partial ^{2m_{i}}}{\partial x_{l_{i}}^{2m_{i}}},$$ (4)and we may write the SSIP as   $$\left\langle f,g\right\rangle _{A}=\sum _{i=0}^{N}\mu _{i}^{2}\left\langle f^{\left(i\right)},g^{\left(i\right)}\right\rangle _{{\rm L}^{2}},$$ (5)that is, the norm includes all scaled derivatives of f and g with $$f^{\left(i\right)}\equiv \frac{\partial ^{m_{i}}}{\partial x_{l_{i}}^{m_{i}}}f$$. The orthogonality of the derivative terms is imposed, and, as above for the sake of brevity the unit vectors are not written explicitly. When the functions are smooth (i.e. when all orders of derivatives approach zero), the SSIP is proportional to the L2 norm. The space of functions equipped with the SSIP is the scaled-Sobolev space. In the scaled-Sobolev space, depending on the highest order of derivatives in the SSIP, the functions are required to be differentiable. In contrast, functions in the Hilbert space (equipped only with an L2 inner product) are by definition potentially less smooth, as these functions do not have to be differentiable. Note that in the SSIP all orders and dimensions of the derivatives can have independently chosen scale factors, allowing an independent weighting of model scales for each dimension and each derivative order. Eqs (1), (2), (4) and (5) are the general (anisotropic) form; when μi is replaced by $$\mu _{m_{i}}$$ (meaning that the scale factors are chosen to be the same for all dimensions in a given order of the derivatives), the equations will be referred to as being ‘isotropic’. The inner product given in eq. (5) is different from the inner product suggested by Calder et al. (2011). In the inner product suggested by Calder et al. (2011) the operator A†A, defined by eq. (4), is replaced by the operator (I − λ∇2)k. While this operator also leads to an edge preserving smoothing, Calder's choice of scales for different dimensions is not completely independent. The operator (I − λ∇2)k is equal to A†A given in eq. (4) only when the weighting terms μi are the same for all dimension, when M = 1 in eq. (4) and when k = 1 in the operator (I − λ∇2)k. By way of explanation, let us look at a two dimensional isotropic example of the inner product in eq. (5). In this case the first-order derivatives N = ML = 2 × 1 and μ0, μ1 = 1 and μ2 = μ1, for which three unit vectors are required, $$\hat{\mathbf {e}}_{0}$$ for the zeroth-order derivative (the original functions), and $$(\hat{\mathbf {e}}_{1},\hat{\mathbf {e}}_{2})$$ for the two spatial components of the first derivatives in the gradient. Thus $$A=\hat{\mathbf {e}}_{0}+\hat{\mathbf {e}}_{1}\frac{\partial }{\partial x}+\hat{\mathbf {e}}_{3}\frac{\partial }{\partial y}$$ and   $$\left\langle f,g\right\rangle _{A}=\left\langle f,g\right\rangle _{{\rm L}^{2}}+\left\langle \frac{\partial f}{\partial x},\frac{\partial g}{\partial x}\right\rangle _{{\rm L}^{2}}+\left\langle \frac{\partial f}{\partial y},\frac{\partial g}{\partial y}\right\rangle _{{\rm L}^{2}}.$$ (6)The first term in eq. (6) is just the conventional inner product, that is ∫∂f(x)g(x)dx and the second two terms can be simplified as follows   \begin{eqnarray*} \left\langle \frac{\partial f}{\partial x},\frac{\partial g}{\partial x}\right\rangle _{{\rm L}^{2}} & = & \int _{c}^{d}\left[\int _{a}^{b}\frac{\partial f}{\partial x}\frac{\partial g}{\partial x}{\rm d}x\right]{\rm d}y\\ & = & \int _{c}^{d}\left[f\frac{\partial g}{\partial x}|_{a}^{b}-\int _{a}^{b}f\frac{\partial ^{2}g}{\partial x^{2}}{\rm d}x\right]{\rm d}y\\ & = & \int _{c}^{d}\left[-\int _{a}^{b}f\frac{\partial ^{2}g}{\partial x^{2}}{\rm d}x\right]{\rm d}y\\ & = & -\left\langle f,\frac{\partial ^{2}g}{\partial x^{2}}\right\rangle _{{\rm L}^{2}}, \end{eqnarray*} where integration by parts and homogeneous boundary conditions have been used. Then, eq. (6) becomes   \begin{eqnarray*} \left\langle f,g\right\rangle _{A} & = & \left\langle f,g\right\rangle _{{\rm L}^{2}}+\left\langle \frac{\partial f}{\partial x},\frac{\partial g}{\partial x}\right\rangle _{{\rm L}^{2}}+\left\langle \frac{\partial f}{\partial y},\frac{\partial g}{\partial y}\right\rangle _{{\rm L}^{2}}\\ & = & \left\langle f,g\right\rangle _{{\rm L}^{2}}-\left\langle f,\frac{\partial ^{2}g}{\partial x^{2}}\right\rangle _{{\rm L}^{2}}-\left\langle f,\frac{\partial ^{2}g}{\partial y^{2}}\right\rangle _{{\rm L}^{2}}\\ & = & \left\langle f,\left(1-\nabla ^{2}\right)g\right\rangle _{{\rm L}^{2}}\\ & = & \left\langle f,\left(A^{\dagger }A\right)g\right\rangle _{{\rm L}^{2}}, \end{eqnarray*} where $$A^{\dagger }=\hat{\mathbf {e}}_{0}-\hat{\mathbf {e}}_{1}\frac{\partial }{\partial x}-\hat{\mathbf {e}}_{3}\frac{\partial }{\partial y}$$ so that A†A = 1 − ∇2. 3 SCALED-SOBOLEV PRE-CONDITIONING In this section we will show how the SSIP can be used in the model domain as a pre-conditioner with the steepest-descent method. To see how the new inner product, the SSIP defined in eq. (1), will affect the FWI gradient let us ask the question asked by Neuberger (1997): ‘To what sort of gradient are we led’, if the inner product in eq. (A6) is replaced by the SSIP? That is, we require (as above) that   $${\rm d}E=\left\langle \delta p,g\right\rangle _{{\rm L}^{2}}=\left\langle \delta p,g_{_{S}}\right\rangle _{A}=\left\langle \delta p,\left(A^{\dagger }A\right)g_{_{S}}\right\rangle _{{\rm L}^{2}}\!,$$ (7)where gS is the gradient in the scaled-Sobolev space, g is the conventional gradient used in the Born scale-space approach, and δp is the slowness perturbation. Eq. (7) results in a mapping from the Born scale-space to the scaled-Sobolev space, that is,   $$g_{_{S}}=\left(A^{\dagger }A\right)^{-1}g,$$ (8)where A†A is the scaled-Sobolev pre-conditioner. The fact that gS is smoother than g can be seen by rewriting eq. (7) as   \begin{eqnarray} &&{\left\langle \delta p,g\right\rangle _{{\rm L}^{2}}\!=\!\mu _{0}^{2}\left\langle \delta p,g_{_{S}}\right\rangle _{{\rm L}^{2}}\!+\!\sum _{i=1}^{N=ML}\left(-1\right)^{m_{i}}\mu _{i}^{2}\left\langle \delta p,\frac{\partial ^{2m_{i}}g_{_{S}}}{\partial x_{l_{i}}^{2m_{i}}}\right\rangle _{{\rm L}^{2}}.} \end{eqnarray} (9)The Sobolev inner product ⟨·,·⟩A is as defined in equation (1). Eq. (9) shows that if μ0 ≠ 0 and μi = 0 for i greater than zero, gS would simply be a scaled version of g. However, if μ0 is set to zero then the combined derivatives of $$g_{_{S}}$$ would have to be a scaled version of g, which means that $$g_{_{S}}$$ would have to be very smooth. By choosing appropriate values for μi the smoothness of the scaled-Sobolev gradients can be controlled, and the effective scale of the updates can be controlled. This result allows a scale decomposition that (i) does not require recourse to data domain decompositions based on the Born approximation, and (ii) does not require a complete scale decomposition such as that required in some model-domain smoothing methods. The steepest-descent method applied to the Sobolev gradient requires that the model perturbation $$\delta p_{_{S}}$$ be proportional to the gradient gS, that is,   $$\delta p_{_{S}}=-\alpha g_{_{S}},$$ (10)where α is a positive step length chosen to optimize the convergence. The perturbation in the objective function to first order with the SSIP is thus   \begin{eqnarray} \left({\rm d}E\right)_{S} &=& \left\langle \delta p_{_{S}}\!,g_{S}\right\rangle _{A}=-\alpha ^{-1}\left\langle \delta p_{_{S}},\delta p_{_{S}}\right\rangle _{A}\nonumber\\ &=&-\alpha ^{-1}\left\langle \delta p_{_{S}},\left(A^{\dagger }A\right)\delta p_{_{S}}\right\rangle _{{\rm L}^{2}}. \end{eqnarray} (11)By definition the term in angular brackets in last equality of eq. (11) is positive, therefore the direction of the update represents a descent direction. This also shows that any pre-conditioner must be positive definite in order for the updates to remain in a descent direction. As a special case, if A is defined for three dimensions and first-order derivatives, that is if $$A\equiv \mu _{0}+\mu _{1}{\rm \frac{\partial }{\partial x}+\mu _{2}{\rm \frac{\partial }{\partial y}}+\mu _{3}{\rm \frac{\partial }{\partial z}}}$$, then this implies (using eq. 4) that $$A^{\dagger }A=\mu _{0}^{2}-\mu _{1}^{2}\frac{\partial ^{2}}{\partial x^{2}}-\mu _{2}^{2}\frac{\partial ^{2}}{\partial y^{2}}-\mu _{3}^{2}\frac{\partial ^{2}}{\partial z^{2}}$$. Therefore, for this special case the pre-conditioning becomes   $$g_{_{S}}=\left(\mu _{0}^{2}-\mu _{1}^{2}\frac{\partial ^{2}}{\partial x^{2}}-\mu _{2}^{2}\frac{\partial ^{2}}{\partial y^{2}}-\mu _{3}^{2}\frac{\partial ^{2}}{\partial z^{2}}\right)^{-1}g.$$ (12)The isotropic version of eq. (12) is obtained by setting μ1 = μ2 = μ3, which gives the isotropic SSP   $$g_{_{S}}=\left(\mu _{0}^{2}-\mu _{1}^{2}\nabla ^{2}\right)^{-1}g,$$ (13)in which case the relative importance of the zeroth (background) term and the derivative terms can be controlled by adjusting the parameters μ0 and μ1. The anisotropic case, that is when the scalars for all dimension are not the same (for this special case μ1 ≠ μ2 ≠ μ3), can be useful when the dimensions of the model are very different from each other. The scale factors for the anisotropic case can be chosen to enhance or suppress the desired model dimension. 4 COMPARISON WITH THE GAUSSIAN SMOOTHING OPERATOR In this section we shall compare the effects of the isotropic first-order SSP operator (eq. 13) with a Gaussian smoothing operator. Gaussian smoothing can also be used to separate the spatial scales of the model without recourse to the Born approximation. Because the Gaussian smoothing operator has a well-defined inverse (Ulmer 2010), it is positive definite, which means that it could be used to define a Sobolev norm. However, due to the edge removing nature of Gaussian smoothing, the Gaussian scale separation is complete in the sense that in the limit of strong smoothing all high-wavenumber features is lost after Gaussian smoothing. The Gaussian smoothing operator in two dimensions can be written as   $$w\left(x,z\right)=\left(\frac{\pi }{\sigma ^{2}}\right)e^{-\frac{\pi ^{2}\left(x^{2}+z^{2}\right)}{\sigma ^{2}}},$$ (14)where σ2 is a real constant variance that selects the spatial scales of the image, therefore we shall call it a scale factor here. Convolution of the Gaussian kernel w(x, z) with an image would smooth it and the level of smoothness is decided by the scale factor σ2. This convolution with an image, say m(x, z), can be written in the wavenumber domain as   $$m_{_{G}}\left(k_{x},k_{z}\right)=w\left(k_{x},k_{z}\right)m\left(k_{x},k_{z}\right)=e^{-\sigma ^{2}k^{2}}m\left(k_{x},k_{z}\right),$$ (15)where $$k^{2}\equiv k_{x}^{2}+k_{z}^{2}$$ and w(kx, kz) is the Fourier transform of w(x, z). By comparison, the application of the isotropic SSP of eq. (13) in the wavenumber domain can be written as   $$m_{_{S}}\left(k_{x},k_{z}\right)=\left(\mu _{0}^{2}+\mu _{1}^{2}k^{2}\right)^{-1}m\left(k_{x},k_{z}\right).$$ (16) An image of the Marmousi model will be used as a test image2 to see the effects of Gaussian and SSP smoothing kernels in eqs (15) and (16). The test image and its wavenumber amplitude spectrum are shown in Figs 3(a) and (b), respectively. The SSP operator (eq. 16) with $$\left(\mu _{0}^{2},\mu _{1}^{2}\right)=\left(1,100\right)$$ was applied to the test image; the result and its wavenumber amplitude spectrum are shown in Figs 4(a) and (b), respectively. Fig. 4 shows that while the low-wavenumber features of the test image have been enhanced, the high-wavenumber features have not been lost completely and the reflectors of the Marmousi model are still visible. The application of the Gaussian smoothing kernel on the test image with σ2 = 100 is shown in Fig. 5(a) and the corresponding wavenumber amplitude spectrum in Fig. 5(b). Fig. 5 shows that at this high level of Gaussian smoothing there is virtually no high wavenumber information left in the image. Reducing the Gaussian scale factor reduces the blur in the smooth image, but to retain the high wavenumbers the scale factor has to be drastically reduced. This is shown in Fig. 6 where the scale factor is progressively reduced but the reflectors in the Marmousi model remain invisible. Figure 3. View largeDownload slide (a) A test image and (b) its wavenumber amplitude spectrum. Figure 3. View largeDownload slide (a) A test image and (b) its wavenumber amplitude spectrum. Figure 4. View largeDownload slide (a) The SSP applied to the test image (Fig. 3), with $$\left(\mu _0^2,\mu _1^2\right)=\left(1,100\right)$$ and (b) its wavenumber amplitude spectrum. Figure 4. View largeDownload slide (a) The SSP applied to the test image (Fig. 3), with $$\left(\mu _0^2,\mu _1^2\right)=\left(1,100\right)$$ and (b) its wavenumber amplitude spectrum. Figure 5. View largeDownload slide (a) Gaussian smoothing applied to the test image (Fig. 3) with σ2 = 100 and (b) its wavenumber amplitude spectrum. The small non-zero amplitude region around the origin in (b) shows the almost complete loss of high wavenumber information after Gaussian smoothing. Figure 5. View largeDownload slide (a) Gaussian smoothing applied to the test image (Fig. 3) with σ2 = 100 and (b) its wavenumber amplitude spectrum. The small non-zero amplitude region around the origin in (b) shows the almost complete loss of high wavenumber information after Gaussian smoothing. Figure 6. View largeDownload slide Gaussian smoothing applied to the test image (Fig. 3), with (a) σ2 = 50, (b) σ2 = 10, (c) σ2 = 1 and (d) σ2 = 0.5. Decreasing the scale factor of the Gaussian smoothing reduces the weights of the low-wavenumber features, however the high wavenumbers in the image are lost. To preserve the edges, the scale factor has to approach zero in which case the low-wavenumber features are no longer enhanced. Figure 6. View largeDownload slide Gaussian smoothing applied to the test image (Fig. 3), with (a) σ2 = 50, (b) σ2 = 10, (c) σ2 = 1 and (d) σ2 = 0.5. Decreasing the scale factor of the Gaussian smoothing reduces the weights of the low-wavenumber features, however the high wavenumbers in the image are lost. To preserve the edges, the scale factor has to approach zero in which case the low-wavenumber features are no longer enhanced. The difference in smoothing between SSP and Gaussian kernels can be seen in Fig. 7. The SSP kernel (Fig. 7a) has high amplitudes close to the origin, and it remains non-zero up to a large wavenumber radius, while the Gaussian kernel Fig. 7(b) drops down to zero much faster. If σ2 is decreased in an attempt to enhance the higher wavenumbers the Gaussian bell surface becomes broader (Fig. 7c), but the ratio of low to high wavenumber amplitude remains larger in the Gaussian kernel than in the SSP. Furthermore, the tail of the Gaussian kernel is much smaller than the SSP even with small smoothing as shown in Fig. 7(d). Figure 7. View largeDownload slide Direct comparison of the smoothing kernels: (a) Sobolev pre-conditioning with $$(\mu _0^2,\mu _1^2)=(1,0.1),$$ (b) Gaussian smoothing with σ2 = 0.1 and (c) Gaussian smoothing with σ2 = 0.035. (d) Profiles extracted from (a) to (c) at zero vertical wavenumber, where the blue line is from (a), the green line is from (b) and the red line is from (c). Decreasing the Gaussian smoothing in an attempt to enhance the high wavenumbers broadens the Gaussian bell-shaped curve (note the shift from the green to red lines), but the higher wavenumbers are still not recovered to the level of SSP (blue line). Figure 7. View largeDownload slide Direct comparison of the smoothing kernels: (a) Sobolev pre-conditioning with $$(\mu _0^2,\mu _1^2)=(1,0.1),$$ (b) Gaussian smoothing with σ2 = 0.1 and (c) Gaussian smoothing with σ2 = 0.035. (d) Profiles extracted from (a) to (c) at zero vertical wavenumber, where the blue line is from (a), the green line is from (b) and the red line is from (c). Decreasing the Gaussian smoothing in an attempt to enhance the high wavenumbers broadens the Gaussian bell-shaped curve (note the shift from the green to red lines), but the higher wavenumbers are still not recovered to the level of SSP (blue line). 5 EXAMPLES The examples in this section were obtained by performing FWI in the Laplace–Fourier domain (Pratt et al.1998; Sirgue & Pratt 2004; Kamei et al.2013). In this domain the frequency is considered to be a complex variable ω = ωr + i/τ, where the real part (ωr) is the actual circular frequency and the imaginary part is the reciprocal of a time constant (τ). Kamei et al. (2013) showed that during FWI the use of finite values of the time constant τ is equivalent to time-damping the input data (down weighting the later arrivals). We employed this approach here in creating synthetic time-damped frequency domain input (‘observed’) data for our inversions, and we then used the same single frequency, τ = 1 s time constant during the inversions. The method of steepest-descent was used with Brent's method for a line search (Brent 1973; Press et al.1993). We performed two sets of FWIs: In the first set the FWIs with SSP and Gaussian pre-conditioning were compared, and in the second set an exponential offset-dependent gain factor was applied to the residuals throughout both the SSP and the Gaussian pre-conditioned FWIs (after Brenders & Pratt 2007a). This amplitude gain compensates for the loss in amplitude of the increasingly delayed first arrivals at large offsets due to time damping, and was implemented using an exponential function of offset, exp (offset/(τ · vmin)), where vmin is a parameter with units of velocity used to scale the gain. For all inversions vmin was set equal to 2000 m/s. In the second set of inversions, the exponential offset gain effectively enhances the response from the later arrivals, which include both long offset diving waves and deep reflection events. Thus, with an offset-dependent amplitude gain the inversion responds preferentially to the deeper parts of the model, even in the early stages of FWIs. This implies that synthetic reflections need to be generated during the early stages of the inversion, so that the computed low-wavenumber updates remain in a direction that is consistent with the data. Therefore the second set of inversions require that the sharper features in the gradient must be retained during the earliest iterations. We thus expect that the low-wavenumber updates will be constrained by reflected arrivals in the SSP, whereas the Gaussian pre-conditioner will fail to produce synthetic reflections due to the complete separation of the scales. We carried out all tests with our own version of the Marmousi model (after Brougois et al. (1990); Fig. 8a). We implemented the SSP tests using the zeroth- and first-order terms in the isotropic SSP. In other words, the differential operator in eq. (13) was a spatial gradient operator. The SSP inversions were compared with inversions in which we used a Gaussian pre-conditioning scheme. The starting velocity for all inversions was a 1-D linear function of depth z, shown in Fig. 8(b). The model grid spacing was 8 m and an extra layer, 160 m thick, was added on top with 1500 m/s velocity to suppress free surface multiples. The velocity was fixed at 1500 m s−1 down to a depth of 174 m throughout the inversions. There were 364 sources and receivers, both with 24 m spacing, from (x, z) = (160, 160) m to (x, z) = (8872, 160) m. A Küpper wavelet (Küpper 1958), shown in Fig. 9(a), was used as the source time function, and synthetic ‘observed’ data were generated at 4, 6 and 8 Hz using the Laplace–Fourier domain modelling in the true model (Fig. 8a) with a time-damping factor of τ = 1 s. All of the inversions were performed for the same frequencies simultaneously (i.e. no sweeping from low to high frequencies was used), the same τ = 1 s damping, and the true source wavelet was used during the inversions. The synthetic observed data at the these inversion frequencies were obtained directly from the Laplace–Fourier domain modelling. We also generated time domain data for the purposes of visualization, using 250 equally spaced frequencies from 0.25 to 62.5 Hz; a shot gather after Fourier synthesis is shown in Fig. 9(c). Figure 8. View largeDownload slide (a) The true velocity model used in the inversion tests and (b) the linear 1-D starting velocity model. Vertical lines in the figures represent the locations where velocity profiles will be compared. Figure 8. View largeDownload slide (a) The true velocity model used in the inversion tests and (b) the linear 1-D starting velocity model. Vertical lines in the figures represent the locations where velocity profiles will be compared. Figure 9. View largeDownload slide (a) The Küpper source wavelet used in the inversion tests, (b) its amplitude spectrum and (c) a representative synthetic observed shot gather at x = 2536 m. Figure 9. View largeDownload slide (a) The Küpper source wavelet used in the inversion tests, (b) its amplitude spectrum and (c) a representative synthetic observed shot gather at x = 2536 m. The ray paths covering the true model from sources at 2, 4, 6 and 8 km are shown in Figs 10(a)–(d), respectively. The ray paths demonstrate the sparse refraction coverage of the region below 1.5 km depth. This lack of refraction coverage, and the fact that we used time damping result in weaker updates within the deeper parts of the model. However, this loss of information from the deeper part of the model is partly compensated for in the second set of tests when we use an offset-dependent exponential gain. Figure 10. View largeDownload slide Representative ray paths in the true (Marmousi) model with sources located at: (a) 2 km, (b) 4 km, (c) 6 km and (d) 8 km. The receivers are located along the horizontal white line at z = 160 m. Figure 10. View largeDownload slide Representative ray paths in the true (Marmousi) model with sources located at: (a) 2 km, (b) 4 km, (c) 6 km and (d) 8 km. The receivers are located along the horizontal white line at z = 160 m. During the inversions we noted that the descent algorithm occasionally ‘stalled’, meaning that either the relative reduction in the objective function was less than 0.001 per cent for two or more iterations consecutively or a negative velocity was obtained in the updates. When this happened, we used this as a trigger to reduce the scale factors in the pre-conditioning equations, thereby reducing the spatial scales of the updates (and introducing more high wavenumbers). This strategy for switching the model spatial scales was used for both the SSP scheme and for the Gaussian pre-conditioning scheme. For SSP $$\mu _{0}^{2}=1$$ was kept constant in eq. (13) for all iterations in both sets of inversion tests, and only $$\mu _{1}^{2}$$ was varied. In order to keep the SSP dimensionless, the dimension of μ1 must length (we use km for units) and μ0 must be dimensionless. Similarly, for the Gaussian pre-conditioner the dimension of σ is length in km. The parameter μ1 was chosen to be 10 km, that is $$\mu _{1}^{2}=100\,{\rm km}^2$$ initially. Physically this means that the smoothing length is 10 km which is approximately the size of the larger (horizontal) axis, which would give low wavenumber updates with a dominant wavelength equal to the larger model dimension. Henceforth, the units of μi are kmi, and similarly units for σ2 are km2. 5.1 Inversions without offset-dependent exponential gain The full schedule of scale factors for this set of inversions is provided in Table 1. Both SSP and Gaussian pre-conditioned inversions were started with a scale factor of 100, that is $$\left(\mu _{0}^{2},\mu _{1}^{2}\right)=\left(1,100\right)$$ for SSP and σ2 = 100 for Gaussian pre-conditioning. The Gaussian inversion with σ2 = 100 stalled (producing a negative velocity value in the update) at the first iteration, so we restarted with a scale factor of 50 as shown in Table 1. Although in general higher scale factors enhance the low-wavenumber scales in the updates, since the SSP combines the low- and high-wavenumber scales in the updates the scales have to be switched less often in the SSP method than the Gaussian. Fig. 11 depicts the gradient in the SSP inversion at the 173rd iteration, before and after pre-conditioning, illustrating the emphasis on low wavenumbers when $$\mu _{1}^{2}=50$$, but also illustrating that while the low-wavenumber features are being updated, the sharper features start to appear as well. This wide band of wavenumbers helps to keep the SSP method in the zone of attraction of the global minimum as the scale factor is subsequently reduced. Fig. 12 depicts the equivalent result (for the seventh iteration) using the Gaussian smoothing method, illustrating that all information on the sharp features (i.e. the images of the scattering structure) is lost due to the nearly complete scale separation of the Gaussian smoother. Figure 11. View largeDownload slide An image of the objective function gradient (a) before and (b) after SSP (pre-conditioning) with $$\left(\mu _0^2,\mu _1^2\right)=\left(1,50\right)$$, at the 173rd iteration in the SSP inversion. No offset-dependent exponential gain was applied to the data in this test. Figure 11. View largeDownload slide An image of the objective function gradient (a) before and (b) after SSP (pre-conditioning) with $$\left(\mu _0^2,\mu _1^2\right)=\left(1,50\right)$$, at the 173rd iteration in the SSP inversion. No offset-dependent exponential gain was applied to the data in this test. Figure 12. View largeDownload slide The objective function gradients in the inversion with Gaussian pre-conditioning (a) before and (b) after Gaussian smoothing with σ2 = 50, shown at the 173rd iteration. No offset-dependent exponential gain was applied to the data in this test. Figure 12. View largeDownload slide The objective function gradients in the inversion with Gaussian pre-conditioning (a) before and (b) after Gaussian smoothing with σ2 = 50, shown at the 173rd iteration. No offset-dependent exponential gain was applied to the data in this test. Table 1. Iterations and smoothing parameters for SSP and Gaussian pre-conditioning FWIs when no offset-dependent exponential gain was applied. SSP  Gaussian  iterations  $$\mu _0^2$$  $$\mu _1^2$$ (km2)  iterations  σ2 (km2)  1–72  1  100  1–7  50  73–307  1  50  8–28  10  308–650  1  10  29–107  1        108–118  0.5        119–277  0.1        278–650  0  SSP  Gaussian  iterations  $$\mu _0^2$$  $$\mu _1^2$$ (km2)  iterations  σ2 (km2)  1–72  1  100  1–7  50  73–307  1  50  8–28  10  308–650  1  10  29–107  1        108–118  0.5        119–277  0.1        278–650  0  View Large A total of 650 iterations were performed for both SSP and Gaussian inversions and the final results are plotted in Figs 13(a) and (b), respectively. While the Gaussian inversion managed to recover the background velocity, the sharp features were not recovered well, especially in the deeper parts of the model. This is because when the low-wavenumber features were being updated, the inversion was not constrained by the small angle scattering events (present in the data at all iterations). The SSP inversion on the other hand, included the sharp features alongside the background updates, which constrained the gradient direction such that it was consistent with all scattering response in the data at every iteration. A comparison of the velocity profiles in the final inversion results at 2, 4, 6 and 8 km is shown in Fig. 14. The linear starting velocity model also presented a significant bulk shift with respect to the true velocity model in some places, for example, at 2 km (Fig. 14a). Although both the SSP and Gaussian inversions corrected for these bulk shifts, the Gaussian inversion results are much more erratic below 1.5 km depth. Figure 13. View largeDownload slide Final inversion results without offset-dependent exponential gain after 650 iterations for (a) the SSP inversion and (b) for Gaussian pre-conditioning. In the final stages minimum values of the SSP scalars $$(\mu _0^2,\mu _1^2)=(1,10),$$ and the Gaussian smoothing scalar σ2 = 0 were used. Figure 13. View largeDownload slide Final inversion results without offset-dependent exponential gain after 650 iterations for (a) the SSP inversion and (b) for Gaussian pre-conditioning. In the final stages minimum values of the SSP scalars $$(\mu _0^2,\mu _1^2)=(1,10),$$ and the Gaussian smoothing scalar σ2 = 0 were used. Figure 14. View largeDownload slide Velocity profiles from the final inversion results in Fig. 13, at (a) 2 km, (b) 4 km, (c) 6 km and (d) 8 km. The red line depicts the true velocity model and the green line depicts the starting velocity model. Solid and dashed blue lines are SSP and Gaussian pre-conditioning results, respectively. Figure 14. View largeDownload slide Velocity profiles from the final inversion results in Fig. 13, at (a) 2 km, (b) 4 km, (c) 6 km and (d) 8 km. The red line depicts the true velocity model and the green line depicts the starting velocity model. Solid and dashed blue lines are SSP and Gaussian pre-conditioning results, respectively. Fig. 15 shows the behaviour of the data misfit (i.e. the squared residuals) as a function of iteration, for the SSP and Gaussian inversions. The full curve is given in Fig. 15(a) and zoomed version are given in Figs 15(b)–(d). The SSP residuals remain smaller than the Gaussian residuals until iteration number 278. At iteration 278th in the Gaussian inversion the scale factor is set to zero, so that from then on no pre-conditioning is applied. While the residual is lower, the scatterers are not imaged correctly. Figure 15. View largeDownload slide Comparison of squared residuals versus iteration history for the SSP method and for the Gaussian method, for inversions without offset-dependent exponential gain, showing (a) all iterations from 1 to 650, (b) iterations 1–20, (c) iterations 20–200 and (d) iterations 200–650. Figure 15. View largeDownload slide Comparison of squared residuals versus iteration history for the SSP method and for the Gaussian method, for inversions without offset-dependent exponential gain, showing (a) all iterations from 1 to 650, (b) iterations 1–20, (c) iterations 20–200 and (d) iterations 200–650. 5.2 Inversions with offset-dependent exponential gain Following our initial inversion results, we then tested a full set of inversions in which the data were pre-conditioned through the application of an offset-dependent exponential gain. This strategy has the effect of boosting the deeper parts of the gradient, because the offset gain enhances far offset arrivals in the data that contain both the diving wave response (early arrivals) and the reflection response from deeper reflectors (later arrivals). The reflected energy at late times and far offsets arises from small scattering angles3 from deeper parts of the model. For the SSP inversions this means that the low-wavenumber updates potentially benefit from the contributions of the sharper features in the gradient due to additional low wavenumber contributions arising from second-order scattering, and as a result the SSP scales have to be modified more often than the previous set of SSP inversions carried out without offset-dependent gain. Table 2 shows the scale factors used for SSP and Gaussian inversions for these inversions. The gradients at the 217th iteration before and after SSP are shown in Fig. 16 for a scale factor of μ1 = 10. Similarly the gradient before and after Gaussian pre-conditioning is shown in Fig. 17 for a scale factor of σ = 10. Figure 16. View largeDownload slide An image of the objective function gradients (a) before, and (b) after SSP (pre-conditioning) with $$(\mu _0^2,\mu _1^2)=(1,10)$$, at the 217th iteration in the SSP inversion. Offset-dependent exponential gain was applied during this test. Figure 16. View largeDownload slide An image of the objective function gradients (a) before, and (b) after SSP (pre-conditioning) with $$(\mu _0^2,\mu _1^2)=(1,10)$$, at the 217th iteration in the SSP inversion. Offset-dependent exponential gain was applied during this test. Figure 17. View largeDownload slide The objective function gradients in the inversion with Gaussian pre-conditioning (a) before and (b) after Gaussian smoothing with σ2 = 10, shown at the 11th iteration. Figure 17. View largeDownload slide The objective function gradients in the inversion with Gaussian pre-conditioning (a) before and (b) after Gaussian smoothing with σ2 = 10, shown at the 11th iteration. Table 2. Iterations and smoothing parameters for SSP and Gaussian pre-conditioning FWIs when an offset-dependent exponential gain was applied. SSP  Gaussian  iterations  $$\mu _0^2$$  $$\mu _1^2$$ (km2)  iterations  σ2 (km2)  1–31  1  100  1–11  10  32–95  1  50  12–41  1  96–216  1  10  42–58  0.5  217–325  1  1  59–298  0.1  326–365  1  0.1  299–365  0  SSP  Gaussian  iterations  $$\mu _0^2$$  $$\mu _1^2$$ (km2)  iterations  σ2 (km2)  1–31  1  100  1–11  10  32–95  1  50  12–41  1  96–216  1  10  42–58  0.5  217–325  1  1  59–298  0.1  326–365  1  0.1  299–365  0  View Large The final inversion results (after 365 iterations) from the SSP inversion and Gaussian pre-conditioned inversions using offset-dependent gains are compared in Fig. 18. The 1-D profiles of the velocities are compared in Fig. 19. The offset-dependent exponential gain applied meant that events from deeper reflectors were weighted higher in the data, and we suggest that the Gaussian pre-conditioned inversion failed because the background updates were not constrained by the sharp features. Figure 18. View largeDownload slide Final inversion results without offset-dependent exponential gain, after 365 iterations (a) for SSP and (b) for Gaussian pre-conditioning inversions. In the final stages minimum values of the SSP scalars $$\left(\mu _0^2,\mu _1^2\right)=\left(1,0.1\right)$$, and the Gaussian smoothing scalar σ2 = 0.1 were used. Figure 18. View largeDownload slide Final inversion results without offset-dependent exponential gain, after 365 iterations (a) for SSP and (b) for Gaussian pre-conditioning inversions. In the final stages minimum values of the SSP scalars $$\left(\mu _0^2,\mu _1^2\right)=\left(1,0.1\right)$$, and the Gaussian smoothing scalar σ2 = 0.1 were used. Figure 19. View largeDownload slide Velocity profiles in Fig. 18 at (a) 2 km, (b) 4 km, (c) 6 km and (d) 8 km. The red line depicts the true velocity model and the green line depicts the starting velocity. Solid and dashed blue lines are the SSP and Gaussian pre-conditioning results, respectively. Figure 19. View largeDownload slide Velocity profiles in Fig. 18 at (a) 2 km, (b) 4 km, (c) 6 km and (d) 8 km. The red line depicts the true velocity model and the green line depicts the starting velocity. Solid and dashed blue lines are the SSP and Gaussian pre-conditioning results, respectively. Fig. 20 shows the behaviour of the residual versus iteration number for both the inversions in which offset-dependent exponential gain was applied. As before the residual curve for the Gaussian inversion drops rapidly after the scale factor has been reduced. For the SSP inversion there are dips in the residual curve, caused by the fact that the scatterers being updated simultaneously and continuously along with the background. Figure 20. View largeDownload slide Comparison of squared residuals versus iteration history for the SSP method (solid curves) and for the Gaussian method (dashed curves), for inversions with offset-dependent exponential gain. Iterations from 1 to 365 for both SSP and Gaussian pre-conditioning are shown. Figure 20. View largeDownload slide Comparison of squared residuals versus iteration history for the SSP method (solid curves) and for the Gaussian method (dashed curves), for inversions with offset-dependent exponential gain. Iterations from 1 to 365 for both SSP and Gaussian pre-conditioning are shown. 6 CONCLUSION We have defined an inner product, the SSIP based on a Sobolev norm, which includes a measure of the functions along with their derivatives. The scale factors in the definition of the SSIP allow control over how much a given order of derivative can contribute to the inner product. The SSIP leads to a pre-conditioner, the SSP that allows the low wavenumbers to be enhanced in an edge-preserving manner. The SSP maps an FWI gradient onto the scaled-Sobolev space, which requires that the functions are differentiable at least as many times as the highest order of SSP. This differentiability requirement implies that the scaled-Sobolev space consists of smoother function than those in a general Hilbert space equipped with an L2 norm. Because of the control we have over the scale factors, the SSP is suitable for separating the model spatial scales in FWI. Spatial derivatives of the velocity model have also been constrained using Tikhonov regularization in FWI to achieve a direction dependent smoothing (e.g. Sinoquet 2005; Versteeg & Symes 1993; Lailly & Sinoquet 1996). In Appendix B, we show the close relation between a Sobolev gradient and Tikhonov regularization. We have shown that the nonlinearity in the FWI objective can be can be mitigated by using SSP, which separates the spatial scales of the model updates based on the spatial derivatives of the subsurface model. The SSP updates the background such that the subsurface model is updated in a direction that is consistent with the small angle scattering responses contained in the data. These constrained updates are the result of the edge preserving nature of the SSP. The model scales can be chosen by choosing the scale factors in the SSP, which can be set differently for all dimensions and all orders of the derivatives. This makes SSP a convenient tool for scale separation. If the model scales are set equal to each other for all dimensions in a given order of derivative, the pre-conditioning is isotropic. Anisotropic form of SSP can be useful in cases where model dimensions are very different from each other. The synthetic examples with the Marmousi model show that the SSP is more robust compared to the Gaussian smoothing, which does not preserve the edges in the model. In the example with offset-dependent gain, deeper parts of the gradient were enhanced at all iterations for both SSP and Gaussian pre-conditioned inversions, however the Gaussian pre-conditioned inversion was not constrained by the sharp features in the image. Therefore the SSP inversion performed better when the data were pre-conditioned using an exponential offset-dependent gain, leading to the desired effect of improving the result in deeper parts of the model. The Gaussian pre-conditioned inversion result on the other hand, was further away from the true solution than the Gaussian result without the offset based gain. ACKNOWLEDGEMENTS We would like to thank Dr. William W. Symes and Dr. Kris Innanen for their comments and suggestions. Footnotes 1 The reflector positions were known a priori and were not part of the velocity model through which the wavefield was propagated. 2 The image we use is actually an FWI gradient, however here it is used as an arbitrary image that contains a band of wavenumbers. Any other image could also be used to test the point made in this section. 3 Small scattering angle here means less than critical angle, in which case the corresponding response in the data consists of specular reflections. 4 The unitary operator 1 would be replaced by an identity matrix I if a discrete representation is being used in which case ∇ (the spatial gradient operator) would be replaced by its matrix equivalent. REFERENCES Blazek K.D., Stolk C., Symes W.W., 2013. A mathematical framework for inverse wave problems in heterogeneous media, Inverse Problems  29( 6), 065001, 37. Google Scholar CrossRef Search ADS   Brenders A., Pratt R., 2007a. Efficient waveform tomography for lithospheric imaging: implications for realistic, two-dimensional acquisition geometries and low-frequency data, Geophys. J. Int.  168( 1), 152– 170. Google Scholar CrossRef Search ADS   Brenders A.J., Pratt R.G., 2007b. Full waveform tomography for lithospheric imaging: results from a blind test in a realistic crustal model, Geophys. J. Int.  168( 1), 133– 151. Google Scholar CrossRef Search ADS   Brent R.P., 1973. Algorithms for Minimization without Derivatives  Prentice-Hall, Includes index. Brossier R., Operto S., Virieux J., 2015. Velocity model building from seismic reflection data by full-waveform inversion, Geophys. Prospect.  63( 2), 354– 367. Google Scholar CrossRef Search ADS   Brougois A., Bourget M., Lailly P., Poulet M., Ricarte P., Versteeg R., 1990. Marmousi, model and data, in EAEG Workshop-Practical Aspects of Seismic Data Inversion . Calder J., Mansouri A., Yezzi A., 2011. New possibilities in image diffusion and sharpening via high-order Sobolev gradient flows, J. Math. Imaging Vis.  40( 3), 248– 258. Google Scholar CrossRef Search ADS   Consolvo B.P., Zuberi M.A.H., Pratt R.G., Cary P.W., 2017. FWI with scaled-Sobolev preconditioning applied to short-offset vibroseis field data, in 79th EAGE Conference and Exhibition 2017 . Evans L., 2010. Partial Differential Equations  Graduate studies in mathematics, American Mathematical Society. Google Scholar CrossRef Search ADS   Fu L., Symes W.W., 2017. An adaptive multiscale algorithm for efficient extended waveform inversion, Geophysics  82( 3), R183– R197. Google Scholar CrossRef Search ADS   Guitton A., Ayeni G., Díaz E., 2012. Constrained full-waveform inversion by model reparameterization, Geophysics  77( 2), R117– R127. Google Scholar CrossRef Search ADS   Hale D., 2007. Local dip filtering with directional Laplacians, CWP Proj. Rev.  567 91– 102. Kamei R., Pratt R.G., Tsuji T., 2013. On acoustic waveform tomography of wide-angle OBS data–strategies for pre-conditioning and inversion, Geophys. J. Int.  194( 2), 1250– 1280. Google Scholar CrossRef Search ADS   Küpper F.J., 1958. Theoretische untersuchung über die mehrfachaufstellung von geophonen., Geophys. Prospect.  6( 3), 194– 256. Google Scholar CrossRef Search ADS   Lailly P., 1983. The seismic inverse problem as a sequence of before stack migrations, in Conference on Inverse Scattering–Theory and Application  pp. 206– 220, Society for Industrial and Applied Mathematics, Expanded Abstracts. Lailly P., Sinoquet D., 1996. Smooth velocity models in reflection tomography for imaging complex geological structures, Geophys. J. Int.  124( 2), 349– 362. Google Scholar CrossRef Search ADS   Neuberger J., 1997. Sobolev Gradients and Differential Equations  no. 1670 in Lecture Notes in Computer Science, Springer. Google Scholar CrossRef Search ADS   Perona P., Malik J., 1990. Scale-space and edge detection using anisotropic diffusion, IEEE Trans. Pattern Anal. Mach. Intell.  12( 7), 629– 639. Google Scholar CrossRef Search ADS   Plessix R.-E., 2006. A review of the adjoint-state method for computing the gradient of a functional with geophysical applications, Geophys. J. Int.  167( 2), 495– 503. Google Scholar CrossRef Search ADS   Pratt R.G., 2008. Waveform tomography-successes, cautionary tales, and future directions, in 70th EAGE Conference and Exhibition-Workshops and Fieldtrips . Pratt R.G., Shin C., Hick G.J., 1998. Gauss-Newton and full Newton methods in frequency-space seismic waveform inversion, Geophys. J. Int.  133( 2), 341– 362. Google Scholar CrossRef Search ADS   Press W.H., Teukolsky S.A., Vetterling W.T., Flannery B.P., 1993. Numerical Recipes in FORTRAN; The Art of Scientific Computing  2nd edn, Cambridge Univ. Press. Renka R., 2009. Image segmentation with a Sobolev gradient method, Nonlinear Anal. Theory Methods Appl.  71( 12), e774 – e780. Renka R.J., 2013. Nonlinear least squares and Sobolev gradients, Appl. Numer. Math.  65 91– 104. Google Scholar CrossRef Search ADS   Rickett J., 2000. Traveltime sensitivity kernels: Banana-doughnuts or just plain bananas?, Stanford Exploration Proj. Rep.  103 63– 71. Sava P., Biondi B., 2004a. Wave-equation migration velocity analysis. I. Theory, Geophys. Prospect.  52( 6), 593– 606. Google Scholar CrossRef Search ADS   Sava P., Biondi B., 2004b. Wave-equation migration velocity analysis. II. Subsalt imaging examples, Geophys. Prospect.  52( 6), 607– 623. Google Scholar CrossRef Search ADS   Sinoquet D., 2005. Modeling A Priori Information on the Velocity Field in Reflection Tomography  pp. 591– 594, Society of Exploration Geophysicists. Sirgue L., 2003. Inversion de la Forme D'onde Dans Le Domaine Fréquentiel de Données Sismiques Grands Offsets, PhD thesis  Université Paris and Queen's University. Sirgue L., Pratt R.G., 2004. Efficient waveform inversion and imaging: a strategy for selecting temporal frequencies, Geophysics  69( 1), 231– 248. Google Scholar CrossRef Search ADS   Stolk C.C., 2001. On the modeling and inversion of seismic data, PhD thesis  Universiteit Utrecht. Symes W.W., 1991. A differential semblance algorithm for the inverse problem of reflection seismology, Comput. Math. Appl.  22( 4), 147– 178. Google Scholar CrossRef Search ADS   Symes W.W., Kern M., 1994. Inversion of reflection seismograms by differential semblance analysis: algorithm structure and synthetic examples, Geophys. Prospect.  42( 6), 565– 614. Google Scholar CrossRef Search ADS   Ulmer W., 2010. Inverse problem of linear combinations of Gaussian convolution kernels (deconvolution) and some applications to proton/photon dosimetry and image processing, Inverse Probl.  26( 8), 085002, 26. Google Scholar CrossRef Search ADS   Versteeg R., Symes W., 1993. Geometric constraints on seismic inversion, in SEG Technical Program Expanded Abstracts 1993  pp. 595– 598, Society of Exploration Geophysicists. Virieux J., Operto S., 2009. An overview of full-waveform inversion in exploration geophysics, Geophysics  74( 6), WCC1– WCC26. Google Scholar CrossRef Search ADS   Wu R., Toksöz M., 1987. Diffraction tomography and multisource holography applied to seismic imaging, Geophysics  52( 1), 11– 25. Google Scholar CrossRef Search ADS   Wu Z., Alkhalifah T., 2015. An improved full waveform inversion procedure based on scattering angle enrichment, in 77th EAGE Conference and Exhibition 2015 . Zhang Y., Sun J., 2009. Practical issues of reverse time migration: true amplitude gathers, noise removal and harmonic-source encoding, in Beijing 2009 International Geophysical Conference and Exposition  Beijing, China, 24–27 April 2009, pp. 204– 204, Society of Exploration Geophysicists. Zhou W., Brossier R., Operto S., Virieux J., 2015. Full waveform inversion of diving & reflected waves for velocity model building with impedance inversion based on scale separation, Geophys. J. Int.  202( 3), 1535– 1554. Google Scholar CrossRef Search ADS   Zuberi M.A.H., Pratt R.G., 2016. Mitigating non-linearity in full waveform inversion by scaled-Sobolev pre-conditioning, in 78th EAGE Conference and Exhibition 2016 . Zuberi M.A.H., Pratt R., 2017. Mitigating cycle skipping in full waveform inversion by using a scaled-Sobolev objective function, in 79th EAGE Conference and Exhibition 2017 . APPENDIX A: THE SOBOLEV GRADIENT IN FWI In this appendix we start by looking at how the Sobolev norm can be used in FWI by changing the definition of the gradient vector. For a conventional definition of the gradient, a change in the objective function E(p) due to a perturbation δp in the inversion parameters can be written as a discrete inner product, that is a vector dot product of a row vector (∇pE)T and a column vector of parameter perturbations δp,   \begin{eqnarray} {\rm d}E & =\left(\nabla _{p}E\right)^{T}\left(\delta \mathbf {p}\right)\!, \end{eqnarray} (A1)where ∇p is the conventional gradient with respect to the inversion parameters, that is, the first partial derivatives of the FWI objective function with respect to each of the inversion parameters (as in the box above). The FWI gradient ∇pE contains a range of spatial scales (wavenumbers) according to the frequencies and scattering angles contained in the observed data. We propose a criterion for the separation of those scales in which the sharp features of the slowness model are represented by the components of the spatial derivatives of the inversion parameters. This means that the scales are no longer fully independent, however we will introduce a modified definition of the gradient that separates out the contributions of the various scales. A1 A Sobolev gradient using matrix representation For discretely sampled distributions of model parameters, we assume a vector representation of the parameters p, and we build a modified definition of the gradient by augmenting the gradient vector with additional first-derivative components according to the number of spatial dimensions, that is, in three dimensions the augmented gradient using first-order spatial derivatives is given by   $$\nabla _{A}E=\left(\begin{array}{c}\nabla _{S}E\\ \partial \left(\nabla _{S}E\right)/\partial x\\ \partial \left(\nabla _{S}E\right)/\partial y\\ \partial \left(\nabla _{S}E\right)/\partial z \end{array}\right),$$ (A2)where ∇SE is a new Sobolev gradient (yet to be determined). This introduces additional subspaces into the search that represent the sharp features of the Sobolev gradient. This new definition of the gradient allows us to explicitly seek additional reductions in the L2 norm of the spatial derivatives of the gradient, without changing the definition of the objective functional. To proceed we now allow a perturbation with additional components in these subspaces defined in the same manner, and we calculate the resulting change in the objective function using the vector dot product   $${}{\rm d}E=\left(\begin{array}{c}\nabla _{S}E\\ \partial \left(\nabla _{S}E\right)/\partial x\\ \partial \left(\nabla _{S}E\right)/\partial y\\ \partial \left(\nabla _{S}E\right)/\partial z \end{array}\right)^{T}\left(\begin{array}{c}\delta \mathbf {p}\\ \partial \left(\mathbf {\delta p}\right)/\partial x\\ \partial \mathbf {\left(\delta p\right)}/\partial y\\ \partial \mathbf {\left(\delta p\right)}/\partial z \end{array}\right),$$ (A3)which can be expanded and simplified to yield   \begin{eqnarray} {}{\rm d}E &=& \left(\nabla _{S}E\right)^{T}\delta \mathbf {p}\!+\!\frac{\partial \left(\nabla _{S}E\right)^{T}}{\partial x}\frac{\partial \left(\delta \mathbf {p}\right)}{\partial x}\!+\!\frac{\partial \left(\nabla _{S}E\right)^{T}}{\partial y}\frac{\partial \left(\delta \mathbf {p}\right)}{\partial y}\nonumber\\ &&+\,\,\frac{\partial \left(\nabla _{S}E\right)^{T}}{\partial z}\frac{\partial \left(\delta \mathbf {p}\right)}{\partial z} \end{eqnarray} (A4a)  \begin{eqnarray} \hphantom{{}{\rm d}E } &=&\left(\nabla _{S}E\right)^{T}\delta \mathbf {p}-\left(\nabla _{S}E\right)^{T}\frac{\partial ^{2}\delta \mathbf {p}}{\partial x^{2}}\nonumber\\ &&-\,\left(\nabla _{S}E\right)^{T}\frac{\partial ^{2}\delta \mathbf {p}}{\partial y^{2}}-\left(\nabla _{S}E\right)^{T}\frac{\partial ^{2}\delta \mathbf {p}}{\partial z^{2}} \end{eqnarray} (A4b)  \begin{eqnarray} \hphantom{{}{\rm d}E }=\left(\nabla _{S}E\right)^{T}\left(\mathbf {I}-\nabla ^{2}\right)\delta \mathbf {p}, \end{eqnarray} (A4c) where ∇2 is the spatial Laplacian operator. The row and column vector multiplications in eq. (A4a) represent summations over the spatial domain (i.e. discrete integration). Integration by parts has been used to transfer the spatial derivatives of the gradient onto the perturbation term (resulting in the appearance of a negative sign). Following Neuberger (1997), we seek the new gradient ∇SE such that for a given δp, the change in the objective function according to eq. (A4c) should be the same as the one given by eq. (A1). By comparing eqs (A1) and (A4c) we find an expression for the Sobolev gradient   $$\nabla _{S}E=\left(\mathbf {I}-\nabla ^{2}\right)^{-1}\nabla _{p}E.$$ (A5) To summarize, we have introduced an augmented definition of the gradient that contains a range of spatial scales, and we have augmented the perturbations to contain the corresponding scales. This changes, or pre-conditions the conventional gradient of the objective function using the inverse of the operator (I − ∇2) to yield the Sobolev gradient. We shall further interpret this result at the bottom of the next section. An alternate method for arriving at this expression is given in Appendix B. A2 A Sobolev gradient using functional representation For continuous distributions of inversion parameters p(x), the first-order perturbation in the objective function $${\rm {\rm }{d}}E$$ due to a change in parameters δp(x) is given by the L2 inner product   \begin{eqnarray} {\rm {d}}E & =\int _{\Omega }\delta p\left(x\right){\rm \left(\nabla _{p}E\right)}\left(x\right){\rm {d}}^{3}x\equiv \int _{\Omega }\delta p\left(x\right)g\left(x\right){\rm {d}}^{3}x, \end{eqnarray} (A6)where (∇pE)(x); ≡ g(x) is the gradient (the functional derivative) of the functional E with respect to p(x), x ≡ (x, y, z) and Ω represents the volume of integration. For g(x) and δp(x) with given norms $$\left(\int _{\Omega }\left(g\left(x\right)\right)^{2}{\rm {d}}^{3}x\right)^{\frac{{1}}{2}}$$ and $$\left(\int _{\Omega }\left(\delta p\left(x\right)\right)^{2}{\rm {d}}^{3}x\right)^{\frac{{1}}{2}}$$, dE is a maximum when g(x) and δp(x) are proportional to each other. This gives the method of steepest descent. Note that the norm is the square root of the inner product (defined in eq. A6) of a function with itself. This means that the functions g(x) and p(x) are elements of a space of functions that has an inner product defined by eq. (A6) and a norm given by $$\left(\int _{\Omega }\left|f\left(x\right)\right|^{2}{\rm {d}}^{3}x\right)^{\frac{{1}}{2}}$$. By assuming completeness, this space becomes a Hilbert space. The definition of a norm of a function in this space allows functions that are not differentiable. If the definition of the inner product is modified so that the first derivative of a function is also considered, the allowed functions include only those that are at least once differentiable, that is, non-differentiable functions are excluded. Such a space is a Sobolev space, and as an example the inner product could be defined according to a continuous version of eq. (A3), in which case eq. (A6) becomes   \begin{eqnarray} {\rm d}E & =\int _{\Omega }\left[\hat{\mathbf {e}}_{0}g\left(x\right)+\nabla g\left(x\right)\right]\left[\hat{\mathbf {e}}_{0}\delta p\left(x\right)+\nabla \left(\delta p\left(x\right)\right)\right]{\rm d}^{3}x, \end{eqnarray} (A7)where ∇ (used without subscripts) is the spatial gradient, that is $$\nabla \equiv \hat{\mathbf {e}}_{1}\frac{\partial }{\partial x_{1}}+\hat{\mathbf {e}}_{2}\frac{\partial }{\partial x_{2}}+\hat{\mathbf {e}}_{3}\frac{\partial }{\partial x_{3}}$$. In eq. (A7) we have introduced a new unit vector $$\hat{\mathbf {e}}_{0}$$ to represent the original, background slownesses, now supplemented by the three Cartesian unit vectors of 3-D physical space $$(\hat{\mathbf {e}}_{1},\hat{\mathbf {e}}_{2},\hat{\mathbf {e}}_{3})$$, corresponding to each of the three subspaces containing the x, y, and z derivatives in the spatial gradient, all mutually orthogonal and defined to be orthogonal to $$\hat{\mathbf {e}}_{0}$$. We note that eq. (A7) is a special case of   $${}{\rm d}E\equiv \left\langle Dg,\ Dp\right\rangle _{{\rm L}_{2}}$$ (A8)which is the definition of a Sobolev inner product Renka (2013). The operator, D in eq. (A7) is a special case in which D ≡ 1 + ∇.4 Eq. (A7) is thus a special case of the Sobolev inner product, with the highest order of derivatives equal to 1. In other words, it is an L2 inner product between the sums of two functions and their first derivatives. With some algebra (and omitting the unit vectors for brevity), eq. (A7) becomes   \begin{eqnarray} {\rm {d}}E & = & \int _{\Omega }\left[g_{_{S}}\left(x\right)+\nabla g_{_{S}}\left(x\right)\right]\left[\delta p\left(x\right)+\nabla \delta p\left(x\right)\right]{\rm {d}}^{3}x\nonumber \\ & = & \int _{\Omega }\left[g_{_{S}}\left(x\right)\delta p\left(x\right)+\nabla g_{_{S}}\left(x\right)\nabla \delta p\left(x\right)\right]{\rm {d}}^{3}x\nonumber \\ & = & \int _{\Omega }\left[g_{_{S}}\left(x\right)-\nabla ^{2}g_{_{S}}\left(x\right)\right]\delta p\left(x\right){\rm {d}}^{3}x, \end{eqnarray} (A9)where ∇2 is the Laplacian. Integration by parts has been used, together with the condition that the perturbation p(x) vanishes at the boundary of the volume Ω. In the third line of eq. (A9) we are again returning to an L2 inner product but with a changed gradient. By comparing equations (A9) and (A6), and again requiring that the change dE in both equations is the same, we get the relationship between L2 and Sobolev gradients, that is,   $$g_{_{S}}\left(x\right)=\left(1-\nabla ^{2}\right)^{-1}g\left(x\right).$$ (A10)The operator (1 − ∇2)−1 in eq. (A10) maps the conventional gradient from a Hilbert space with an L2 norm to the Sobolev space, resulting in a pre-conditioning of the gradient. The fact that $$g_{_{S}}\left(x\right)$$ is a smoother function than g(x) can be seen in the wavenumber domain application of eq. (A10), that is,   $$g_{_{S}}\left(k\right)=\left(1+k^{2}\right)^{-1}g\left(k\right)\!,$$ (A11)where k is the wavenumber magnitude, that is, k ≡ 2π/λ where λ is the wavelength. In this equation low-wavenumber features of the gradient are weighted more than the high-wavenumber features. The scales are treated quite differently: At very low wavenumbers the Sobolev gradient is identical to the conventional gradient, and at very high wavenumbers the Sobolev gradient is a double integration of the conventional gradient, leading to the reduction of high wavenumber content in the gradient but the preservation of some high wavenumber features. This leads to an ‘edge-preserving’ behaviour of the pre-conditioner. APPENDIX B: FROM TIKHONOV TO SOBOLEV The inner product (SSIP) and the resulting pre-conditioning (SSP) were derived in the main text using an argument following Neuberger (1997). In this appendix we will give an alternate way of arriving at the scaled-Sobolev pre-conditioning method. We shall start by deriving an expression for a parameter perturbation for a Tikhonov regularized objective function. We then discuss and discard the undesirable features of that update and keep the one that is desirable, allowing us to arrive again at an expression for the Sobolev gradient. After arriving at the resulting expressions, we shall give a physical interpretation. A change in any FWI objective function E following a perturbation in medium parameters δp is given by the first-order expression.   $${\rm d}E=(\nabla _{p}E)^{T}\delta \mathbf {p}.$$ (B1)The steepest descent method uses the gradient to define   $$\delta \mathbf {p}=-\alpha \nabla _{p}E,$$ (B2)where α is a non-negative, real-valued ‘step length’, while the Newton estimate of δp is related to the gradient and the Hessian H through   $$\mathbf {H}\delta \mathbf {p}=-\nabla _{p}E,$$ (B3)which is obtained by applying a parameter perturbation such that $$\nabla _{p}E\left(\mathbf {p}+\delta \mathbf {p}\right)=0\approx \nabla _{p}E\left(\mathbf {\mathbf {p}}\right)+\mathbf {H}\left(\mathbf {p}\right)\delta \mathbf {p}$$ to second order. If we specifically define a Tikhonov-type regularized objective function   $$E_{R}=\mathbf {r}^{T}\mathbf {r}+\lambda \left(\mathbf {R}\mathbf {p}\right)^{T}\left(\mathbf {R}\mathbf {p}\right)=E_{c}+\lambda \mathbf {p}^{T}\mathbf {R}^{T}\mathbf {R}\mathbf {p},$$ (B4)where r represents the conventional data residuals, Ec = rTr represents the conventional least squares objective functional, and R is a roughening operator (often the matrix representation of the first-order spatial derivatives), then this new objective function will have a different minimum in p space. The gradient and Hessian of the regularized objective function are   $$\nabla _{p}E_{R}=\nabla _{p}E_{c}+\lambda \mathbf {R}^{T}\mathbf {R}\mathbf {p}$$ (B5)where ∇pEc is the gradient of the conventional objective functional, and   $$\mathbf {H}_{R}=\mathbf {H}_{c}+\lambda \mathbf {R}^{T}\mathbf {R},$$ (B6)respectively, with Hc being the Hessian of the conventional objective functional. The Newton perturbation of eq. (B3) for the new Tikhonov objective becomes   $$\mathbf {H}_{R}\delta \mathbf {p}_{_{R}}=-\nabla _{p}E_{R}.$$ (B7)By substituting eq. (B5) and (B6), eq. (B7) becomes   $$\left(\mathbf {H}_{c}+\lambda \mathbf {R}^{T}\mathbf {R}\right)\delta \mathbf {p}_{_{R}}=-\nabla _{p}E_{c}-\lambda \mathbf {R}^{T}\mathbf {R}\mathbf {p}.$$ (B8)A desirable feature of the update given by eq. (B8) is that the weighted roughening operators RTR can be used to control how high or low we want the wavenumbers to be. This is a desirable feature because the roughening operators are constructed without reference to the Born approximation and therefore the scale-space defined by varying λ does not rely on the Born approximation. There are a couple of undesirable features in eq. (B8). If very low wavenumbers are required λ has to be very high. In that case, according to eq. (B4), the data misfit is practically ignored in the computation of the conventional gradient direction. The appearance of the conventional Hessian, which can improve convergence if the problem is quasi-linear, makes it extremely expensive to calculate and invert in FWI. The Gauss–Newton approximation is usually used, that is H is replaced by JTJ, where J is the conventional Jacobian. Even with the Gauss-Newton approximation, the calculation of this approximate Hessian may involve many extra forward propagations. B1 An intuitive step Let us keep the desired features of eq. (B8) and discard the undesired ones, which means that we shall neglect the second term on the right hand side of eq. (B8) and replace H on the left hand side by λ0I, where λ0 is a scalar and I is an identity matrix. Under these simplifications, eq. (B8) becomes   \begin{eqnarray} \left(\lambda _{0}\mathbf {I}+\lambda _{1}\mathbf {R}^{T}\mathbf {R}\right)\delta \mathbf {p}_{_{S}} & =-\nabla _{p}E_{c} \end{eqnarray} (B9a)or   \begin{eqnarray} \delta \mathbf {p}_{_{S}} & =-\left(\lambda _{0}\mathbf {I}+\lambda _{1}\mathbf {R}^{T}\mathbf {R}\right)^{-1}\nabla _{p}E_{c}. \end{eqnarray} (B9b) The change of subscript in the model update vector from R to S signifies that this update no longer corresponds to the minimization of regularized objective function (eq. B4), nor is it the conventional update either. Instead we have returned to the conventional objective function Ec, but we construct a model update that is a pre-conditioned version of the conventional gradient. The subscript S stands for Sobolev, chosen for reasons that will become clear shortly. Following the steepest descent method of eq. (B2), multiplying and dividing the right hand side of eq. (B9b) by a non-negative real-valued scalar $$\alpha _{_{S}}$$ we get   \begin{eqnarray} \delta \mathbf {p}_{_{S}} & = & -\alpha _{_{S}}\left(\mu _{0}^{2}\mathbf {I}+\mu _{1}^{2}\mathbf {R}^{T}\mathbf {R}\right)^{-1}\nabla _{p}E_{c}\nonumber \\ & \equiv & -\alpha _{_{S}}\nabla _{S}E_{c}, \end{eqnarray} (B10)where we have introduced the scaling factors   \begin{equation*} \mu _{o}^{2}\equiv \lambda _{o}\alpha _{s}\ \!\ \mathrm{and}\ \ \ \mu _{1}^{2}\equiv \lambda _{1}\alpha _{s}. \end{equation*} Eq. (B10) defines a Sobolev gradient of the conventional objective function that corresponds to a pre-conditioned version of the convention FWI gradient. Although the conventional objective function is retained in the expression, we are modifying the gradient to find a search direction that is smoother, so that we fit the long wavelength components earlier in the inversion. We also defined a scaled-Sobolev parameter update δpS that is proportional to ∇SEc in magnitude and opposite in direction. This is just the method of steepest-descent, but in a new scaled-Sobolev space of possible parameter distributions, with $$\alpha _{_{S}}$$ being simply the step length of the steepest-descent direction in this space. If we choose R to be the matrix expression of the first-order derivatives, then   $$\nabla _{S}E_{c}=\left(\mu _{0}^{2}\mathbf {I}-\mu _{1}^{2}\nabla ^{2}\right)^{-1}\nabla _{p}E_{c},$$ (B11)where ∇2 = −RTR is the matrix representation of the Laplacian operator. Equation (B11) may be compared with equations (A5), (A10) and (13) in the body of the paper. Let us confirm that the scaled-Sobolev update (eq. B10) does indeed give a reduction in the objective function. By using the Sobolev gradient from eq. (B11) and substituting the Sobolev perturbation δpS from eq. (B10) into eq. (B1) we obtain   \begin{eqnarray} \left({\rm d}E\right)_{S} =\delta \mathbf {p}_{_{S}}^{T}\left(\mu _{0}^{2}\mathbf {I}+\mu _{1}^{2}\mathbf {R}^{T}\mathbf {R}\right)\nabla _{S}E_{c} \end{eqnarray} (B12a)   \begin{eqnarray} & =\left(\begin{array}{c}\mu _{0}\delta \mathbf {p}_{_{S}}\\ \mu _{1}\mathbf {R}\delta \mathbf {p}_{_{S}} \end{array}\right)^{T}\left(\begin{array}{c}\mu _{0}\nabla _{S}E_{c}\\ \mu _{1}\mathbf {R}\nabla _{S}E_{c} \end{array}\right) \end{eqnarray} (B12b)  \begin{eqnarray} & =-\alpha _{_{S}}^{-1}\left(\begin{array}{c}\mu _{0}\delta \mathbf {p}_{_{S}}\\ \mu _{1}\mathbf {R}\delta \mathbf {p}_{_{S}} \end{array}\right)^{T}\left(\begin{array}{c}\mu _{0}\delta \mathbf {p}_{_{S}}\\ \mu _{1}\mathbf {R}\delta \mathbf {p}_{_{S}} \end{array}\right). \end{eqnarray} (B12c) Eq. (B12c) gives a change in the objective function corresponding to an update in a direction $$\delta \mathbf {p}_{_{S}}$$. The change is strictly negative because the dot product   $$\left(\begin{array}{c}\mu _{0}\delta \mathbf {p}_{_{S}}\\ \mu _{1}\mathbf {R}\delta \mathbf {p}_{_{S}} \end{array}\right)^{T}\left(\begin{array}{c}\mu _{0}\delta \mathbf {p}_{_{S}}\\ \mu _{1}\mathbf {R}\delta \mathbf {p}_{_{S}} \end{array}\right)$$ (B13)and $$\alpha _{_{S}}$$ are both non-negative, and hence the objective function must be reduced. B2 Interpretation Now we shall look at what it means to have ‘jumped’ from eq. (B8) to eq. (B9a). The conventional steepest descent gives a change in the conventional gradient of   \begin{eqnarray} {\rm d}E_{c} =\delta \mathbf {p}^{T}\nabla _{p}E_{c} \end{eqnarray} (B14a)  \begin{eqnarray} \hphantom{{\rm d}E_{c} } =-\alpha ^{-1}\left(\delta \mathbf {p}\right)^{T}\delta \mathbf {p}. \end{eqnarray} (B14b) By comparison of eqs (B14b) and (B12c) we can see that the definition of the inner product has been changed. The conventional inner product used in the steepest descent method is simply the scaled squared norm of the parameter perturbation, whereas in eq. (B12c) the squared norm of a function is given by a scaled version of the conventional inner product plus a conventional inner product of the derivatives of the function. If the scalars are equal to 1, this is the squared Sobolev norm. Since we are using a scaled version of that norm we shall term this the scaled-Sobolev norm. We shall call the corresponding inner product the scaled-Sobolev inner product (SSIP). The space of functions equipped with the inner product defined in eq. (B12c) will be termed the scaled-Sobolev space, hence the choice of the subscript S for the perturbations and the gradients. Therefore, by changing from eq. (B8) to eq. (B9a) we have changed the space of the medium perturbation functions that we are allowed to move in during the inversion process. The conventional change in the objective function is based on an L2-norm inner product, which does not require the perturbation functions to be differentiable. Therefore, the conventional updates can potentially involve rougher functions than SSIP, because the SSIP requires the function to be differentiable at least once. Eq. (B11), with the scaling factor equal to one, is the starting point in the main text, which follows the arguments given by Neuberger (1997). There is no reason why the order of the derivatives in the definition of SSIP should be restricted to one. A generalized result is given in the main text. © The Author(s) 2017. Published by Oxford University Press on behalf of The Royal Astronomical Society.

### Journal

Geophysical Journal InternationalOxford University Press

Published: Apr 1, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations