TY - JOUR AU1 - Mayerhöfer, Thomas G. AU2 - Noda, Isao AU3 - Pahlow, Susanne AU4 - Heintzmann, Rainer AU5 - Popp, Jürgen AB - Introduction Systematic errors lead to non-accurate results with biases even if an experiment is repeated multiple times and the results are averaged to reduce the statistical random error. Systematic errors are often hard if not sometimes impossible to detect [1, 2]. One reason is that the down- or upshift of the mean compared to the true value does not influence the distribution of results caused by random errors. Even though it is possible to remove or reduce random errors and obtain a seemingly consistent (“precise”) result, these may nevertheless still be far from the underlying ground-truth. In particular, in curve fitting, where experimental data points are fitted assuming that the points follow a mathematical or physical model, a good agreement between measured points and fitted curve may delude the operator into thinking that systematic errors are absent [2, 3]. A quantitative and objective measure of the agreement between experimental and fitted data is the residual sum of squares (RSS). This measure belongs to the so-called loss functions and is also used in machine learning procedures including deep learning to quantify the learning progress and, after finalization of the training phase, the quality of the model with help of a test set. In general, it must be kept in mind that a good agreement of a fit indicated by a low RSS is no sign of accuracy, neither in terms of the experimental data nor of the underlying model. That is, data analysis of quantitative experiments is based upon the assumption that the measured or calculated independent and dependent variables are not subject to systematic errors [4]. Curve fitting is often applied when nonlinear mathematical or physical inverse problems are to be solved that cannot be linearized. The inverse problem has its counterpart in the direct problem, which is a forward calculation, e.g., like the shadow cast by a 3D object on a white wall. The inverse problem, in contrast, involves reconstructing the 3D object from the shadow it casts, i.e., to derive the cause from the observation. Famous nonlinear inverse problems are the determination of properties of an object, based on the scattering characteristics of incoming radiation [5], recovering the diffusion coefficient of single phase fluid flows in porous media [6], or the recovering of wave-speeds and density distributions based on seismograms [7], but there are numerous further examples. When linear inverse problems and those that can be linearized are to be solved, i.e., if the observable depends in a linear fashion on the independent variable(s), analytical solutions can be derived. Such linear models form the basis for many chemometric methods [8]. In contrast to linear problems, nonlinear problems require to iteratively improve the fitted curve by minimizing a measure of disagreement, to approach the curves of the experimental data. Potential applications of curve fitting are countless and encompass virtually all scientific disciplines. Examples include biosynthesis [9], thermoluminescence [10], solar energy [11], materials science and technology [12], agriculture [13], cancer research [14], kinetics [15], thermal engineering [16], transportation [17], soil science [18], remote sensing of ecosystems [19], epidemiology [20], power and energy engineering [21], population growth [22] and spectroscopy [23], to name just a few. The disagreement metrics to minimize during the fit depends on the properties of the noise and possibly on prior information on the parameters to fit. However, in most cases it is sensible to assume Gaussian statistics, which requires the minimization of the RSS also called the sum of squared errors or its weighted version in case of non-uniform but known variances. The RSS traces back to Carl Friedrich Gauß. For the most part it is used in its original form, and not much work has been done in general lately to improve it, except for a few particular applications. E.g., recently Korkmaz proposed a generalized RSS for multiple linear regression [24]. In addition, Zhang et al. suggested an integrated residual sum of squares for a supervised principal component regression method for relating functional responses with high dimensional predictors [25]. To fit high-dimensional data sets with more predictors than the sample size, Roozbeh et al. proposed penalization terms [26]. While these works were motivated by problems completely different to ours, Sumon and Majumder applied weighted versions of the RSS to account for systematic deviations, but in form of single data like irrelevant data, missing observation, and distributional faults [27]. Recently, we derived alternative loss functions, which are based on 2D correlation analysis or spectroscopy (2D COS) [3, 28]. 2D correlation analysis has been introduced in the 1980s by one of us as a tool for infrared spectroscopy which found widespread use also in other spectroscopic methods, like Raman spectroscopy and mass spectrometry [29]. In principle, 2D correlation spectroscopy is based on acquiring a series of spectra under a systematic change of one parameter of the sample (the so-called perturbation), e.g., the stretch of a polymer, the temperature or the concentration of one compound in a mixture. The perturbation can certainly also be a parameter of an established physical model to describe how its alteration induces changes in the spectra, like the thickness of films. A variant of 2D correlation spectroscopy is hybrid 2D correlation spectroscopy, which allows the comparison of two different spectral series, e.g., the same compound under two different perturbations. In the sense of curve fitting, a variant would be to let one series consist of experimental data, whereas the second comprises modelled spectra. In the original 2D correlation maps, half of the data points are redundant, due to symmetry relations between points separated by the diagonal from low to high values of the independent variable. For hybrid 2D correlation maps these relations do no longer hold, but the more the two series resemble each other, the smaller the deviations from this symmetry relations become. This property of hybrid 2D correlation maps can be exploited by formulating an alternative criterion for the resemblance between experimental and modelled data which includes the correlations in between the series, which we call smart error sum (SES). In contrast to the fits using the conventional RSS, the 2D correlation-based smart error sum does not force the modelled curve to agree point by point with experimental curves, but accounts for correlations between the latter and between the individual points of a curve. As a consequence, even when the experimental data are reduced by a (frequency dependent) factor, the data can still be analyzed in a meaningful way. In spectroscopy, such a situation is often encountered, for example, due to diffuse reflection caused by surface roughness, or measured data becomes larger than predictable by models which do not account for such experimental problems. This approach can not only detect, but also remove multiplicative systematic errors as has been demonstrated for infrared spectra of films on substrates with different thicknesses (additive systematic errors can also be treated after applying an exponential transformation). Quite often, though, series of spectra are not available and it is only one data curve that is to be fitted. In this case inter spectral correlations cannot be exploited. However, it is still possible to use correlations between the individual data points based on a recent subtype of 2D correlation analysis. The so-called two-trace 2D correlation analysis or spectroscopy (2T2D COS) [30, 31] requires only two sets of spectral data of which only one needs to be experimental. In this case the symmetry relations cannot be used as a criterion, but if experiment and model agreed perfectly, one of the maps would amount to become everywhere zero. The value of this idea has been proven employing the same physical system as the original smart error sum. As we will show, the 2T2D SES approach is similar to utilizing normalized cross-correlation (NCC) and zero mean normalized cross-correlation (ZNCC) as SES. NCC and ZNCC are related to 2D correlation analysis [29] and often used for signal analysis. Examples entail comparing image quality in competition with the conventional residual sum of squares [32–35], tracking wavelength-shifts in Fiber-Bragg gratings [36, 37]. and, recently, also least squares optimizations of images [38]. While the application to real systems and problems helped to establish the validity of the approach, it also partially obscured the mathematical basis and the principal properties of the method. To enable broader application, we therefore here reduce it to its essential properties and demonstrate it based on a simple example in the following. In addition, we provide the 2T2D-based smart error sum in a form that scales, like the conventional sum of squared residuals, linearly with the number of points. Accordingly, the former can replace the latter in nonlinear curve-fitting applications that are prone to unknown systematic experimental errors. The code of the program which we used to obtain the results shown in the following is made available together with this work so that they can be easily reproduced. In addition, the code can effortlessly be modified to be used for other non-linear models. Contributions The fundamental idea of this paper is to present the pure mathematical background of the methods from [3] and [28] in a form understandable to most readers independent of specialty and to investigate under which prerequisites the new family of loss functions can be applied. With regard to members of the family of smart error sums, the paper will introduce in the theoretical considerations section two novel members, namely the normalized cross correlation and the Pearson coefficient, which can also be used as loss functions after small, but important alterations, which are derived from an altered 2T2D loss function. The latter alteration, introduced in this paper, scales linearly with the number of points (very much like the conventional residual sum of squares) and no longer with the squared number of points as the previous. This immensely increases its potential, in particular for complex problems. Concerning the requirements for the application of the new family of loss functions, we will show in the following that nonlinearity with respect to the external variable (“the perturbation”) is not sufficient to allow the smart error sums to be used. Instead, the correct criterium is a disproportionate change with the perturbation. This is a vital insight, because it proves that the smart error sums reject automatically approximations that do not lead to disproportionate changes. Finally, we will show how the smart error sums automatically reject unrealistic models. This could be of great interest for neuronal networks as it should allow to build models easier and thus faster. Theoretical considerations Basic equations for 2D-COS The following is based on the matrix algebra employed for 2D-COS as used by Noda and Osaki [29]. Since the formalism was originally developed for spectroscopy, we have to slightly reformulate it. However, to allow the reader to connect the following to the original literature, we will try to adhere to the original terminology as closely as possible. We assume that we have a function of two variables x and t of which we call the former the shaping variable and the latter the perturbation. A number of different data points located on m different curves which differ with regard to t, shall be represented by employing discrete values xi and tj according to . These curves will be called a set of dynamic spectra The dynamic spectra are arranged in a matrix Yk in the following way: (1) The index k∈{1,2} and indicates if the set of dynamic spectra consists of either the set of “measured” (k = 1) or the set of simulated spectra (k = 2). One may think that it is advantageous to mean-center the dynamic spectra, i.e., subtracting the mean spectrum of the series from each individual measured spectrum. However, such mean-centering or, more general, referencing is often not only unnecessary [39], but sometimes even detrimental. Yet, if the array of curves or dynamic spectra share a common offset, this offset needs to be removed prior to application, otherwise not only the 2D correlation maps [40], but also the smart error sums are ill-defined. From the matrices Yk the variance-covariance matrices Φxx can be generated by: (2) If Y1 = Y2, then we speak of conventional 2D-COS, whereas the case Y1 ≠ Y2 leads to a so-called hybrid correlation analysis. In case of the conventional smart error sum Y1 is formed from the “measured” and Y2 from the corresponding simulated curves. As pointed out in ref. [29], each element of the variance-covariance matrix expresses the similarity between a specific pair of intensity variations at different xj. If Y1 = Y2, the diagonal elements are the autocorrelation functions of the intensity variations along t at a given xj. The variance-covariance matrix is identical to the synchronous 2D correlation map/spectrum. In order to compute the asynchronous 2D correlation map/spectrum, the Hilbert–Noda transformation matrix N must be calculated first. The elements of N are given by: (3) The elements of the asynchronous 2D correlation map/spectrum can then be calculated according to, (4) where we again distinguish between the conventional case (I) and hybrid 2D-COS (II). Derivation of the smart error sum As already mentioned, in the introduction for the conventional (Y1 = Y2) 2D-correlation analysis for synchronous and asynchronous spectra/maps certain symmetry relationships hold. Accordingly, the synchronous spectra are always symmetric relative to the diagonal elements Φ(xi, xj). This condition can be expressed as, (5) Accordingly, the diagonal from small to large x values is a mirror plane that relates each point above the diagonal to its mirror image below it. From Eq (5) it follows that the sum of differences of all variances and covariances above the diagonal and their counterparts below the diagonal are zero: (6) For hybrid 2D-COS, the synchronous maps are not necessarily obeying the above condition. The residuals of the differences of the elements Φ(xj, xk) and Φ(xk, xj) is a measure of dissimilarity, which can be generally written as, (7) with DS, the so-called Minkowski distance, which is called the Euclidian distance for p = 2. Therefore, hybrid 2D correlation maps allow a derivation of these quantities simply from their symmetry relations [3]. For asynchronous 2D correlation maps, a similar relationship can be derived. Accordingly, similarly to Eq (5), we find from the condition that the conventional asynchronous 2D correlation maps are always antisymmetric with respect to the diagonal the following relation: [3] (8) This relation leads for the hybrid 2D-correlation asynchronous map to: (9) Note that for both, Eqs (7) and (9), we can include the diagonal since the terms with j = k are zero. For p = 2, and are special residual sums of squares, which we call the synchronous and the asynchronous residual sum of squares, SRSS and ARSS. SRSS and ARSS can be combined ad hoc to the smart error sum (SES) according to: (10) While only a few lines of code need to be exchanged to implement SES, compared to the conventional RSS which scales with SES scales with , which is an obvious drawback in particular for larger datasets with a high number of datapoints. In addition, in cases where only a single measured spectrum is available for curve fitting, the smart error sum cannot be used, simply because it is not possible to generate a 2D correlation map from a single spectrum. For this case, we have introduced an alternative smart error sum based on hybrid 2T2D-COS, with one measured curve, while the other is the simulated one. Synchronous and asynchronous 2D correlation spectrum/map are then calculated by [30, 31], (11) wherein m(xj) = Y1(xj,tl) is the measured and m(sj) = Y2(xj,tl) the simulated curve with an arbitrary tl. Based on Eq (11), the hybrid synchronous spectrum is always symmetric and the asynchronous spectrum is always antisymmetric relative to the diagonal. Therefore, the underlying principle of the smart error sum, introduced for series of curves, namely to increase the symmetry of the hybrid synchronous 2D correlation map and the antisymmetry of the hybrid asynchronous 2D correlation map by varying the fit parameters, cannot be employed. As an alternative we can use that the asynchronous 2T2D-Correlation map Ψxx vanishes if both the given and the modelled curve are linearly dependent. Put in concrete terms, in this case the given and the modelled curve can also, as in case of the conventional smart error sum, differ by a simple scalar multiplication factor. Accordingly, the 2T2D smart error sum is given by [28], (12) where we set p = 2. Therefore, a corresponding algorithm would minimize by finding optimized values for the fit parameters. In Eq (12), all points below the diagonal need not to be considered, which follows from the asynchronous map being perfectly antisymmetric: (13) On the other hand, if set p = 2, then there is a possibility to significantly simplify Eq (12), if we let both sums run from 1 to l. In this case, (14) which scales with like the conventional residual sum of squares instead of like the other smart error sums based on 2D correlation analysis while preserving the advantage that only a few lines of code has to be exchanged. The advantage of the smart error sums in comparison with the conventional residual sum of squares as minimalization criterion is that for the former experimental and simulated curve are not forced to agree by all means but can be different by an individual factor, the optimum of which is determined by maximum correlation. In other words, not only the best agreement between original and simulated values determines the fit parameters, but also the correlation of the curves in a series or within a curve. From a mathematical point of view, the additional degree of freedom can be understood in terms of the phase angle: (15) The term phase angle is used, because Φxx and Ψxx are linearly independent and can be described formally as a complex function: (16) Θ(x1,x2) is then derived from the polar form. For hybrid 2T2D correlation analysis, Ψ(x1,x2) becomes zero if the original curve and its best fit agree within a multiplication factor, which means that Θ(x1,x2) = 0. This situation means that the two curves are linearly dependent or even identical if systematic errors are absent. Accordingly, an alternative form for the 2T2D-based smart error sum is given by: [28] (17) For series-based hybrid 2D correlation analysis, the ratios Ψ1(x1,x2)/Φ1(x1,x2) for the set of the given curves and Ψ2(x1,x2)/Φ2(x1,x2) for the fitted curves are equal if the correlations within both sets of curves agree. An alternative form of the original smart error sum is therefore [28], (18) where Θex(xk,xj) are the phase angles of the original data and Θsim(xk,xj) are those of the simulated curves. This form, for p = 2 and without consideration of its symmetry properties, has originally been introduced by Shinzawa et al. [41, 42] and used exclusively for the method of alternating least squares (ALS). In this form, a theoretical problem of Eq (10) is avoided, which occurs if either SRSS or ARSS becomes zero, which in practice unlikely happens due to numerical errors related to the conversion of numbers to the binary system. To be on the safe side, a dummy regularizing positive constant ε can be incorporated into Eqs (6) and (7), e.g. ε = 10−10 (this value is small enough to have no effect on the actual computation and may be viewed as a predictable substitute for random bit noise). Similar consideration may apply to the calculation of the phase angle defined in Eq (15) and it may be advantageous to regularize the denominator, even though the chance for the intensity of synchronous spectrum becomes exactly zero might be slim (there are chances that this can happen near the zero-crossing area. The sign of the regularization constant has to be the same as the sign of the synchronous spectrum intensity. The primary reason for regularizing the ratio between asynchronous and synchronous intensities is to avoid the ambiguity of the zero-divided-by-zero situation where the dynamic spectrum remains zero). Note that for a typical arctangent function routine, the direction of a vector in the phase plain is confined to the first and fourth quadrants. In other words, the phase angle calculated by a computer is automatically assumed to take the value between–π/2 and +π /2. In a practical physically expected situation, a phase vector can point to the direction outside of this artificial confinement. Therefore, it is generally advantageous to assume that the phase angle should be confined between -π/4 and +3π/4, and to add π whenever the calculated value lies between—π/2 and—π/4. Unfortunately, the form of Eq (18) prevents it from simplifications so that it scales with like the original SES. According to Eq (18), individual 2D maps can differ by multiplication factors, even though their phase angles are equal. In the absence of systematic errors, the factor becomes unity. It might not be obvious, but the normalized cross correlation NCC coefficient can be derived from the 2T2D SES: (19) As long as m = C⋅s, with an arbitrary factor C, NCC = 1, otherwise −1