# Identification of atypical (rare) elements—a conditional, distribution-free approach

Identification of atypical (rare) elements—a conditional, distribution-free approach Abstract The discovery of atypical elements has become one of the most important challenges in data analysis and exploration. At the same time it is not an easy matter with difficult conditions, and not even strictly defined. This article presents a ready-to-use procedure for identifying atypical elements in the sense of rarely occurring. The issue is considered in a conditional approach, where describing and conditioning variables can be multidimensional continuous with the second type also potentially categorical. The application of nonparametric concepts frees the investigated procedure from distributions of describing and conditioning variables. Ease of interpretation and completeness of the presented material lend themselves to the use of the worked out method in a wide range of tasks in various applications of data analysis in science and practice. 1. Introduction Atypical elements (often rashly referred to as outliers) can intuitively be considered as significantly differing from the rest of a data set (Hawkins, 1980; Barnett & Lewis, 1994; Hodge & Austin, 2004; Aggarwal, 2013). Their occurrence most commonly results from considerable (‘gross’) errors arising during measurement, collection, storage and processing. In practice, they hinder the correct utilization of knowledge available, and their elimination or correction enables the use of more convenient and more effective methods at later stages of data analysis and exploration. What is more, in marketing, atypical elements may represent cases so different from the majority of the population that any individual decision based on such a group—so distinct and insignificant—often turns out to be economically unviable. In engineering, the presence of atypical states in dynamic systems may be evidence of malfunction of a component or the entire device, and proper reaction usually enables any serious consequences to be avoided. The detection of an atypical element may also signify an attempt to hack into a computer system. On the other hand, in many social and economic problems the appearance of such an element could be a positive trait, as it may characterize completely new trends or uncommon phenomena, and their quick discovery allows the appropriate specific action to be taken in anticipation. Therefore, the identification of atypical elements constitutes a natural cognitive challenge of great scientific and practical meaning (Kulczycki et al., 2007). The task of identifying atypical elements is one of the very difficult conditioning. Above all most often there is no definition or even criterion indicating which elements should be considered atypical. Moreover, we do not have a pattern of atypical elements, and even if we did, it would be—by its nature—small in number, strongly unbalanced with respect to the typical elements set. In the most simple one-dimensional case, where data distribution is unimodal, atypical elements can be considered to be elements distant (according to the basic meaning of the term ‘outlier’) from a median—of more than $$3/2$$ of the interquartile range (Larose, 2005; Section 2.7). However, a similar approach cannot be taken concerning complex multimodal distributions. For example, when particular modes are significantly distanced from each other, elements lying in the centre between them should be regarded as atypical, although they may be located very near to the median, definitely closer than $$3/2$$ of the interquartile range. This article understands to be atypical those elements occurring rarely in the population. Thus, having a representative set of data, we will highlight regions of lowest distribution density, such that common probability of elements appearing in those regions is less or equal to the assumed value. Thanks to methodology applied here, the above regions can be of any shape and location; they may be comprised of many separate parts. In numerous practical tasks, the data possessed can be significantly refined through the measurement and inclusion of the current value of quantity considerably influencing the subject of investigation. In engineering practice, such a factor may often be the current temperature. From a formal point of view the above aim can be realized by using a conditional probabilistic approach (Dawid, 1979). In this case, the basic attributes, termed describing, become dependent on conditioning factors, whose introduced specific values can make substantially more precise the information relating to an object under research. This approach is the main subject of this article. For defining characteristics of data, the nonparametric methodology of kernel estimators is used, which freed the investigated procedures from forms of distributions characterizing both the describing and conditioning quantities. Both can be continuous multidimensional, and the latter can also be categorical. The presented material is complete and ready-to-use without laborious investigations. In particular, valuable is its easy, illustrative interpretation. Thus, Section 2 outlines the mathematical preliminaries: kernel estimators in nonparametric density estimation. The investigated procedure for outlier identification in the conditioning approach is described in Section 3. Next, Sections 4 and 5 present the results of numerical and experimental—for a control engineering task—tests which confirmed the correct functioning of the method. The last section provides final comments and summary of the presented research. A broad review of cases and methods for the identification of atypical elements is found in the classic publications (Barnett & Lewis, 1994; Hodge & Austin, 2004; Aggarwal, 2013). The approach presented in this article differs from the classical techniques firstly in its conditional aspect and its freeing from the distributions of the describing and conditioning quantities. The preliminary version of this article was presented as Kulczycki et al. (2016). 2. Nonparametric density estimation In the presented method, the characteristics of a data set will be defined using the nonparametric methodology of kernel estimators. This kind of procedure is distribution-free i.e. the preliminary assumptions concerning the types of appearing distributions are not required. A broad description can be found in the classic monographs (Silverman, 1986; Wand & Jones, 1995; Kulczycki, 2005). Exemplary applications for data analysis tasks are described in the publications (Kulczycki, 2008; Kulczycki & Charytanowicz, 2010, 2013; Kulczycki & Kowalski, 2015); see also (Kulczycki & Łukasik, 2014). Let the $$n$$-dimensional continuous random variable $$X$$ be given, with a distribution characterized by the density $$f$$. Its kernel estimator $$\hat{{f}}:R^{n}\to [0,\infty )$$, calculated using experimentally obtained values for the $$m-$$element random sample xi for i=1,2,…,m (1) in its basic form is defined as f^(x)=1mhn∑i=1mK(x−xih), (2) where $$m\in N\backslash \{0\}$$, the coefficient $$h>0$$ is called a smoothing parameter, while the measurable function $$K:R^{n}\to [0,\infty )$$ of unit integral $$\int_{ R^{n}} {K(x)\,\mbox{d}x} =1$$, symmetrical with respect to zero and having a weak global maximum in this place, takes the name of a kernel. The choice of form of the kernel $$K$$ and the calculation of the smoothing parameter $$h$$ is made most often with the criterion of the mean integrated square error. Thus, the choice of the kernel form has—from a statistical point of view—no practical meaning and thanks to this, it becomes possible to take primarily into account properties of the estimator obtained or calculational aspects, advantageous from the point of view of the applicational problem under investigation; for broader discussion see the books (Wand & Jones, 1995—Sections 2.7 and 4.5; Kulczycki, 2005—Section 3.1.3). In practice, for the one-dimensional case (i.e. when $$n=1)$$, the function $$K$$ is assumed most often to be the density of a common probability distribution. In the multidimensional case, two natural generalizations of the above concept are used: radial and product kernels. Thanks to convenient analysis, the latter will be used in the following. The main idea here is the division of particular variables with the multidimensional kernel then becoming a product of $$n$$ one-dimensional kernels for particular coordinates. Thus the kernel estimator is then given as f^(x)=1mh1h2…hn∑i=1mK1(x1−xi,1h1)K2(x2−xi,2h2) … Kn(xn−xi,nhn), (3) where $$K_{j}$$ ($$j=1,\mbox{ }2,\mbox{ }{\ldots}\mbox{ },\mbox{ }n)$$ denote one-dimensional kernels, $$h_{j}$$ ($$j=1,\mbox{ }2,\mbox{ }{\ldots}\mbox{ },\mbox{ }n)$$ are smoothing parameters individualized for particular coordinates, while assigning to coordinates x=[x1x2⋮xn] and xi=[xi,1xi,2⋮xi,n] for i=1, 2, … , m. (4) The fixing of the smoothing parameter has significant meaning for quality of estimation. Fortunately many suitable procedures for calculating its value on the basis of random sample (1) have been worked out; for broader discussion see the books (Silverman, 1986; Wand & Jones, 1995; Kulczycki, 2005) In particular, for the one-dimensional case, the effective plug-in method (Wand & Jones, 1995—Section 3.6.1; Kulczycki, 2005—Section 3.1.5) is especially recommended. Of course this method can also be applied in the $$n$$-dimensional case when a product kernel is used, sequentially $$n$$ times for each coordinate. One can also apply the simplified method (Silverman, 1986—Section 3.4.1; Wand & Jones, 1995—Section 3.2.1; Kulczycki, 2005—Section 3.1.5), according to which hj=(8π3W(Kj)U(Kj)21m)1/5σ^j for j=1,2,…,n, (5) where $$W(K_{j} )=\int_{-\infty }^\infty {K_{j} (x)^{2}\mbox{ d}x}$$ and $$U(K_{j} )=\int_{-\infty }^\infty {x^{2}K_{j} (x)\mbox{ d}x}$$, while $$\hat{{\sigma }}_{j}$$ denotes the estimator of a standard deviation for the $$j$$-th coordinate: σ^j=1m−1∑i=1mxi,j2−1m(m−1)(∑i=1mxi,j)2 for j=1, 2, … , n. (6) The value obtained by formula (5) may be sufficiently precise for many practical applications, whereas—thanks to its simplicity—this method significantly increases calculation velocity. For specific cases, such calculated value can also be individually refined. The above concept will now be generalized for the conditional case. Here, besides the basic (sometimes termed the describing) $$n_{Y}$$-dimensional random variable $$Y$$, let also be given the $$n_{W}$$-dimensional random variable $$W$$, called hereinafter the conditioning random variable. Their composition X=[YW] (7) is a random variable of the dimension $$n_{Y} +n_{W}$$. Assume that distributions of the variables $$X$$ and, in consequence, $$W$$ have densities, denoted below as $$f_{X} :R^{n_{Y} +n_{W} }\to [0,\infty )$$ and $$f_{W} :R^{n_{W} }\to [0,\infty )$$, respectively. Let also be given the so-called conditioning value, i.e. the fixed value of conditioning random variable $$w^{\ast }\in R^{n_{W} }$$, such that fW(w∗)>0. (8) Then the function $$f_{Y\vert W=w^{\ast }} :R^{n_{Y} }\to [0,\infty )$$ given by fY|W=w∗(y)=fX(y,w∗)fW(w∗) for every y∈RnY (9) constitutes a conditional density of probability distribution of the random variable $$Y$$ for the conditioning value $$w^{\ast }$$. The conditional density $$f_{Y\vert W=w^{\ast }}$$ can so be treated as a ‘classic’ density, whose form has been made more accurate in practical applications with $$w^{\ast }$$—a concrete value taken by the conditioning variable $$W$$ in a given situation. Let therefore, the random sample [yiwi] for i=1, 2, … , m, (10) obtained from variable (7) be given. The particular elements of this sample are interpreted as the values $$y_{i}$$ taken in measurements from the random variable $$Y$$, when the conditioning variable $$W$$ assumes the respective values $$w_{i}$$. On the basis of sample (10), one can calculate $$\hat{{f}}_{X}$$, i.e. the kernel estimator of density of the random variable $$X$$ probability distribution, while the sample wi for i=1, 2, … , m (11) enables the computation of $$\hat{{f}}_{W}$$—the kernel density estimator for the conditioning variable $$W$$. The kernel estimator of conditional density of the random variable $$Y$$ distribution for the conditioning value $$w^{\ast }$$, is defined then—in natural consequence of formula (9)—as the function $$\hat{{f}}_{Y\vert W=w^{\ast }} :R^{n_{Y} }\to [0,\infty )$$ given by f^Y|W=w∗(y)=f^X(y,w∗)f^W(w∗). (12) If for the estimator $$\hat{{f}}_{W}$$ one uses a kernel with positive values, then the inequality $$\hat{{f}}_{W} (w^{\ast })>0$$ implied by condition (8) is fulfilled for any $$w^{\ast }\in R^{n_{W} }$$. If one uses in pairs the same kernel to the estimator $$\hat{{f}}_{X}$$ for coordinates which correspond to the vector $$W$$ and to the estimator $$\hat{{f}}_{W}$$, then the expression for the kernel estimator of conditional density becomes particularly helpful for practical applications. Namely, formula (12) can be specified to the form f^Y|W=w∗(y)=1h1h2…hnY∑i=1mK1(y1−yi,1h1)K2(y2−yi,2h2)⋯KnY(ynY−yi,nYhnY)KnY+1(w1∗−wi,1hnY+1)KnY+2(w2∗−wi,2hnY+2)⋯KnY+nW(wnW∗−wi,nWhnY+nW)∑i=1mKnY+1(w1∗−wi,1hnY+1)KnY+2(w2∗−wi,2hnY+2)⋯KnY+nW(wnW∗−wi,nWhnY+nW), (13) where $$K_{j}$$ ($$j=1,\mbox{ }2,\mbox{ }{\ldots}\mbox{ },\mbox{ }n_{Y} +n_{W} )$$ denote one-dimensional kernels, $$h_{j}$$ ($$j=1,\mbox{ }2,\mbox{ }{\ldots}\mbox{ },\mbox{ }n_{Y} +n_{W} )$$ mean smoothing parameters individualized for particular coordinates, while assigning to the coordinates y=[y1y2⋮ynY],w∗=[w1∗w2∗⋮wnW∗] and yi=[yi,1yi,2⋮yi,nY],wi=[wi,1wi,2⋮wi,nW] for i=1, 2, … , m. (14) Define the so-called conditioning parameters $$d_{i}$$ for $$i=1,\mbox{ }2,\mbox{ }{\ldots}\mbox{ },\mbox{ }m$$ by the following formula: di=KnY+1(w1∗−wi,1hnY+1)KnY+2(w2∗−wi,2hnY+2) ⋯ KnY+nW(wnW∗−wi,nWhnY+nW). (15) Thanks to the assumption of positive values for the kernels $$K_{n_{Y} +1}$$, $$K_{n_{Y} +2} \mbox{, {\cdots} , }K_{n_{Y} +n_{W} }$$, these parameters are also positive. So the kernel estimator of conditional density (9) can be finally presented in the form f^Y|W=w∗(y)=1h1h2…hnY∑i=1mdi∑i=1mdiK1(y1−yi,1h1)K2(y2−yi,2h2) ⋯ KnY(ynY−yi,nYhnY). (16) The value of the parameter $$d_{i}$$ characterizes the ‘distance’ of the given conditioning value $$w^{\ast }$$ from $$w_{i}$$—that of the conditioning variable for which the $$i$$-th element of the random sample was obtained. Then estimator (16) can be interpreted as the linear combination of kernels mapped to particular elements of a random sample obtained for the variable $$Y$$, when the coefficients of this combination $$d_{i}$$ characterize how representative these elements are for the given value $$w^{\ast }$$. For further investigations, the (one-dimensional) Cauchy kernel Kj(x)=2π1(1+x2)2 for j=1, 2, … , nY+nW (17) will be applied. The constants occurring in formula (5), for Cauchy kernel (17) equal: U(Kj)=1 (18) W(Kj)=54π. (19) To summarize: formula (16), substituting (14) and (15), with (5) and (6) and (17)–(19), constitutes comprehensive material, allowing convenient calculation of a distribution-free estimator of conditional density, based on random sample (10). 3. An algorithm for atypical elements identification Drawing forth from the material presented in the previous section, an algorithm for the conditional identification of atypical (rare) elements will now be investigated. So, consider the data set comprised of elements $$y_{i}$$ obtained for the conditioning values $$w_{i}$$ (for $$i=1,2,{\ldots},m)$$, respectively, which can be treated as representative for a population under research. Denote also a tested element as [y∗w∗]∈RnY+nW. (20) It can be interpreted as the value $$y^{\ast }$$ of describing variables, obtained for the conditioning value $$w^{\ast }$$. The aim of the procedure is to ascertain if for the value $$w^{\ast }$$, the element $$y^{\ast }$$ should be considered as atypical in the sense of rare occurrences, or not. For this purpose, fix first the number r∈(0,1) (21) defining a desired proportion of atypical to typical elements, more accurately the share of atypical elements in a population. In practice the values $$r=0.01,\;0.05,\;0.1$$ can be proposed. In reference to the notations in the previous section, let us treat the elements $$y_{i}$$ as the realizations of the $$n_{Y}$$-dimensional random variable $$Y$$, while elements $$w_{i}$$ as respective realizations of the conditioning random variable $$W$$, and then calculate the conditional density $$\hat{{f}}_{Y\vert W=w^{\ast }}$$. Next, let us consider the set of its values for the elements $$y_{i}$$, therefore f^Y|W=w∗(yi) for i=1,2,…,m. (22) Note that the above values are real (one-dimensional). The specific values $$\hat{{f}}_{Y\vert W=w^{\ast }} (y_{i} )$$ refers to the probability of occurrence of the element $$y_{i}$$ when the value of the conditioning variable is $$w^{\ast }$$. So, the greater the value $$\hat{{f}}_{Y\vert W=w^{\ast }} (y_{i} )$$, the more typical element $$y_{i}$$ can be interpreted to be for the given $$w^{\ast }$$. Let us treat as typical these elements for which the density $$\hat{{f}}_{Y\vert W=w^{\ast }}$$ is bigger than a given limit value, while atypical—those for which it is smaller. In accordance with the assumptions made above, such a natural limit value constitutes a conditional quantile of the order $$r$$ for the condition $$w^{\ast }$$; its estimator is denoted hereinafter as $$\hat{{q}}_{r\vert w^{\ast }}$$. Finally, if for the fixed conditioning value $$w^{\ast }$$, the value of a density function for the tested element $$y^{\ast }$$, i.e. f^Y|W=w∗(y∗) (23) is calculated, and the condition f^Y|W=w∗(y∗)⩽q^r|w∗ (24) is fulfilled, then the tested element should be ascertained as atypical, while in the opposite case f^Y|W=w∗(y∗)>q^r|w∗ (25) as typical. In this way, the space $$R^{n_{Y} }$$ is divided into two regions: the first containing atypical elements, which fulfil condition (24), and the second consisting of typical ones, satisfying (25), such that—with precision to estimation errors—a probability of the former is $$r$$, and of the latter $$1-r$$. There remains, however, to calculate the above mentioned value of conditional quantile estimator $$\hat{{q}}_{r\vert w^{\ast }}$$. To this aim the kernel estimator scheme presented in the article (Kulczycki et al., 2015), fitted to the task investigated here, will be applied. Performing the same technique as that applied before to construct the density $$\hat{{f}}_{Y\vert W=w^{\ast }}$$, allows for better use of procedures already executed, and the previously gained experience of the researcher. The role of the describing factor (previously the $$n_{Y}$$-dimensional variable $$Y)$$ will now be taken by the one-dimensional random variable $$Z$$ ($$n_{Z} =1)$$, therefore, in place of variable (7) consider instead the $$(n_{Y} +1)$$-dimensional composition X=[ZW] (26) The values (22) will be treated as experimentally obtained realizations of the random variable $$Z$$, evaluated—as previously—for the realizations $$w_{i}$$ of the $$n_{W}$$-dimensional conditioning variable $$W$$, respectively. Thus, denoting zi=f^Y|W=w∗(yi) for i=1,2,…,m, (27) random sample (10) is now replaced by [ziwi] for i=1,2,…,m. (28) Then, the natural kernel estimator of a conditional quantile for the random variable $$Z$$ with conditioning variable $$W$$ assuming the fixed value $$w^{\ast }$$ is the solution of the following equation with the argument $$\hat{{q}}_{r\vert w^{\ast }}$$: ∫−∞q^r|w∗f^Z|W=w∗(z)dz=r. (29) For estimation of the conditional density $$\hat{{f}}_{Z\vert W=w^{\ast }}$$ appearing in the above formula, the kernel estimator (16) will be used. Then, for current notations, it takes the following form ∫−∞q^r|w∗1hZ∑i=1mdiKZ(z−zihZ)dz−r∑i=1mdi=0, (30) where $$K_{Z}$$ and $$h_{Z}$$ mean a one-dimensional kernel (continuous with positive values), and a smoothing parameter (calculated for values (27), possibly using formula (5)) corresponding to the variable $$Z$$. Let also the kernel $$K_{Z}$$ be such that its primitive $$I_{Z} \mbox{:}R\to [0,1]$$ given as $$I_{Z} \mbox{(}w\mbox{)}=\int_{-\infty }^w {K_{Z} \mbox{(}u\mbox{)}\,\mbox{d}{\kern 1pt}u} \;$$ is expressed by a relatively simple analytical formula. Equation (30) can be stated then equivalently in the following form: ∑i=1mdiIZ(q^r|w∗−zihZ)−r∑i=1mdi=0. (31) If the left side of the above equation is denoted by $$L$$, i.e. L(q^r|w∗)=∑i=1mdiIZ(q^r|w∗−zihZ)−r∑i=1mdi, (32) then $$\lim\limits_{\hat{{y}}_{w^{\ast }} \to -\infty } L(\hat{{q}}_{r\vert w^{\ast }} )<$$, $$\lim\limits_{\hat{{y}}_{w^{\ast }} \to \infty } L(\hat{{q}}_{r\vert w^{\ast }} )>0$$, the function $$L$$ is (strictly) increasing and its derivative is simply expressed by L′(q^r|w∗)=1hZ∑i=1mdiKZ(q^r|w∗−zihZ). (33) In this situation, the solution of equation (41) can be effectively calculated on the basis of Newton’s algorithm (Kincaid & Cheney, 2002) as the limit of the sequence $$\mbox{\{}\hat{{q}}_{r\vert w^{\ast },j} \mbox{\}}_{j=\mbox{0}}^{\infty }$$ defined by q^r|w∗,0 =∑i=1mdizi∑i=1mdi (34) q^r|w∗,j+1 =q^r|w∗,j−L(q^r|w∗,j)L′(q^r|w∗,j) for j=0, 1, ⋯ (35) with the functions $$L$$ and $${L}'$$ being given by dependencies (32) and (33), whereas a stop criterion takes on the form |q^r|w∗,j−q^r|w∗,j−1| ⩽0.01 σ^Z, (36) while $$\hat{{\sigma }}_{Z}$$ is the estimator of the standard deviation of the random variable $$Z$$, calculated from formula (6) for elements (23). It is worth noting that many elements used above to estimate the conditional quantile—in particular the values of the conditioning parameters and the smoothing parameters—were already obtained earlier during the calculation of the conditional density $$\hat{{f}}_{Y\vert W=w^{\ast }}$$. For Cauchy kernel (17) recommended here, which is continuous with positive values, its primitive is given as I(x)=1π[arctg(x)+x(1+x2)]+12. (37) Thanks to the use of kernel estimators with strong averaging properties, inference takes place not only for data obtained exactly for $$w^{\ast }$$ (among the values $$w_{i}$$ there may be some too small for reliable consideration or even not at all), but also for neighbouring values proportional to their ‘closeness’ with respect to $$w^{\ast }$$. Finally, if the material included in Section 2 is used to create the estimator of the conditional density $$\hat{{f}}_{Y\vert W=w^{\ast }}$$, then application of the algorithm (34)–(36) with substitutions (32), (33), (37) and (15) with (17), (5) and (6) in respect of (18) and (19), completes the procedure identification of atypical (rare) elements in a conditional and distribution-free approach, for an assumed proportion of atypical to typical elements (21), being the subject of these investigations. 4. Numerical verification The procedure presented in this article has been numerically verified in detail. The obtained results confirmed its correct function and full completion of intentions and goals set out in the Introduction. Particularly, in the case of a positive correlation between the describing and conditioning factors, the greater (or smaller) the value of the conditioning attributes, the greater (or smaller) the values of describing elements detected to be atypical. For the negative correlation, the above relation is inverse. To offer an illustrative example, assume that the data is three-dimensional, where the first two coordinates are describing, and the third conditioning, i.e. in accordance with the notations implemented earlier, we have $\left[ {{\begin{array}{*{20}c} {Y_{1} } \hfill \\ {Y_{2} } \hfill \\ W \hfill \\ \end{array} }} \right]$. These are obtained from a pseudorandom number generator, the distribution of which consists of five normal components with the following matrices and shares: E1=[330]Cov1=[10−0.4010.7−0.40.71]35% (38) E2=[−330]Cov2=[10−0.4010.7−0.40.71]20% (39) E3=[−3−30]Cov3=[10−0.4010.7−0.40.71]20% (40) E4=[3−30]Cov4=[10−0.4010.7−0.40.71]15% (41) E5=[000]Cov5=[10−0.4010.7−0.40.71]10%. (42) The distribution, therefore, is multimodal and asymmetric. The first coordinate $$Y_{1}$$ is negatively correlated with the conditioning factor $$W$$, while the second $$Y_{2}$$ positively. Tests were carried out for $$m$$ sizes from 1,000 to 1,000,000. The former seems not to be an excessive requirement for three-dimensional tasks, the latter was possible thanks to the application of procedures of linear calculational complexity, in particular the use of formula (5). To improve results, corrections for particular coordinates of the smoothing parameter were occasionally made in this task $$-$$ such modifications, often based on visual observations, are common in kernel estimation methods. A graphic illustration of results can be found in Fig. 1. Demonstrative contours and percentage share of each component of distribution (38)–(42) are shown. Atypical (rare) elements being detected are denoted by the small circles. It should be stressed that they are located not only on the distribution peripheries, but also around the centre, due to realizations of the random variable not occurring frequently in this area and in consequence the distribution density is low. This effect is achieved by applying a nonparametric method of kernel estimators. Fig. 1. View largeDownload slide Illustrative locations of detected atypical elements for distribution (38)–(42); $$m=10,000$$, $$r=0.05$$, $$w^{\ast }=0$$. Fig. 1. View largeDownload slide Illustrative locations of detected atypical elements for distribution (38)–(42); $$m=10,000$$, $$r=0.05$$, $$w^{\ast }=0$$. The mean values of obtained detected atypical (rare) elements for particular values of the conditioning factor are shown in Table 1. Every cell shows above each other the mean values for the first $$Y_{1}$$ coordinate and the second $$Y_{2} -$$denoted as $$\bar{{y}}_{1}$$ and $$\bar{{y}}_{2}$$, respectively$$-$$of atypical elements being detected. Due to the symmetry of distribution (38)–(42), the results for negative conditioning values $$w^{\ast }$$ showed themselves to be symmetric. As expected, a growth in the conditioning factor value was accompanied by a drop in the mean value of the first coordinate of detected atypical elements, and an increase in that of the second. This behaves according to intuition, due to—as implied from formulas (38)–(42)—the coordinates $$Y_{1}$$ and $$W$$ being negatively correlated, but $$Y_{2}$$ and $$W$$ positively. In this way, knowledge of the current conditioning value $$w^{\ast }$$ enables models used for practical purposes to be significantly more precisely designed. The mean values of atypical elements were practically (around 10% of standard deviation) independent of the assumed proportion of atypical to typical elements, defined by the parameter. This parameter also can undergo correction in order to make more precise the required number of atypical elements and in consequence assumed share of such elements in a population. Increasing the value $$r$$ results in proportionate growth in their number; similarly, its decrease implies the opposite effects. The second coordinate $$Y_{2}$$, more than the first $$Y_{1}$$, was dependent on the conditioning variable $$W$$. This is justified by the fact that, in formulas (38)–(42) the element $$cov_{23} =cov_{32}$$ is bigger (in the sense of an absolute value) than $$cov_{13} =cov_{31}$$. Table 1 Mean values of detected atypical elements;$$m=10,000$$, $$r=0.05$$, means obtained from 100 runs. r 0.01 0.05 0.1 w$$^{\mathbf{\ast }}$$ 0 $$\bar{{y}}_{1} = 0.11$$ $$\bar{{y}}_{1} = -$$0.09 $$\bar{{y}}_{1} = -$$0.04 $$\bar{{y}}_{2} = -$$0.18 $$\bar{{y}}_{2} = -$$0.13 $$\bar{{y}}_{2} = -$$0.06 1 $$\bar{{y}}_{1} = -$$0.87 $$\bar{{y}}_{1} = -$$0.74 $$\bar{{y}}_{1} = -$$0.61 $$\bar{{y}}_{2} = 1.03$$ $$\bar{{y}}_{2} = 0.99$$ $$\bar{{y}}_{2} = 0.95$$ 2 $$\bar{{y}}_{1} = -$$1.10 $$\bar{{y}}_{1} = -$$1.07 $$\bar{{y}}_{1} = -$$0.96 $$\bar{{y}}_{2} = 1.87$$ $$\bar{{y}}_{2} = 1.89$$ $$\bar{{y}}_{2} = 1.72$$ r 0.01 0.05 0.1 w$$^{\mathbf{\ast }}$$ 0 $$\bar{{y}}_{1} = 0.11$$ $$\bar{{y}}_{1} = -$$0.09 $$\bar{{y}}_{1} = -$$0.04 $$\bar{{y}}_{2} = -$$0.18 $$\bar{{y}}_{2} = -$$0.13 $$\bar{{y}}_{2} = -$$0.06 1 $$\bar{{y}}_{1} = -$$0.87 $$\bar{{y}}_{1} = -$$0.74 $$\bar{{y}}_{1} = -$$0.61 $$\bar{{y}}_{2} = 1.03$$ $$\bar{{y}}_{2} = 0.99$$ $$\bar{{y}}_{2} = 0.95$$ 2 $$\bar{{y}}_{1} = -$$1.10 $$\bar{{y}}_{1} = -$$1.07 $$\bar{{y}}_{1} = -$$0.96 $$\bar{{y}}_{2} = 1.87$$ $$\bar{{y}}_{2} = 1.89$$ $$\bar{{y}}_{2} = 1.72$$ Table 1 Mean values of detected atypical elements;$$m=10,000$$, $$r=0.05$$, means obtained from 100 runs. r 0.01 0.05 0.1 w$$^{\mathbf{\ast }}$$ 0 $$\bar{{y}}_{1} = 0.11$$ $$\bar{{y}}_{1} = -$$0.09 $$\bar{{y}}_{1} = -$$0.04 $$\bar{{y}}_{2} = -$$0.18 $$\bar{{y}}_{2} = -$$0.13 $$\bar{{y}}_{2} = -$$0.06 1 $$\bar{{y}}_{1} = -$$0.87 $$\bar{{y}}_{1} = -$$0.74 $$\bar{{y}}_{1} = -$$0.61 $$\bar{{y}}_{2} = 1.03$$ $$\bar{{y}}_{2} = 0.99$$ $$\bar{{y}}_{2} = 0.95$$ 2 $$\bar{{y}}_{1} = -$$1.10 $$\bar{{y}}_{1} = -$$1.07 $$\bar{{y}}_{1} = -$$0.96 $$\bar{{y}}_{2} = 1.87$$ $$\bar{{y}}_{2} = 1.89$$ $$\bar{{y}}_{2} = 1.72$$ r 0.01 0.05 0.1 w$$^{\mathbf{\ast }}$$ 0 $$\bar{{y}}_{1} = 0.11$$ $$\bar{{y}}_{1} = -$$0.09 $$\bar{{y}}_{1} = -$$0.04 $$\bar{{y}}_{2} = -$$0.18 $$\bar{{y}}_{2} = -$$0.13 $$\bar{{y}}_{2} = -$$0.06 1 $$\bar{{y}}_{1} = -$$0.87 $$\bar{{y}}_{1} = -$$0.74 $$\bar{{y}}_{1} = -$$0.61 $$\bar{{y}}_{2} = 1.03$$ $$\bar{{y}}_{2} = 0.99$$ $$\bar{{y}}_{2} = 0.95$$ 2 $$\bar{{y}}_{1} = -$$1.10 $$\bar{{y}}_{1} = -$$1.07 $$\bar{{y}}_{1} = -$$0.96 $$\bar{{y}}_{2} = 1.87$$ $$\bar{{y}}_{2} = 1.89$$ $$\bar{{y}}_{2} = 1.72$$ 5. Empirical verification: fault detection The procedure worked out also successfully underwent verification in a real applicational task in control engineering. Based on the current state of the system, atypical elements were discovered, indicating potentially arising failures of a supervised device (Kulczycki, 1998; Korbicz et al., 2004). Consider a mechanical system with dynamics modelled by the differential inclusion y¨(t)∈H(y˙(t),y(t))+u(t), (43) where $$y$$ expresses the position of the object, $$u$$ denotes a piecewise continuous control and the function $$H$$, characterizing resistance to motion, is piecewise continuous and additionally multivalued at the points of discontinuity (particularly for $$\dot{{y}}(t)=0$$ it represents phenomena connected with static friction). In the event of no resistance to motion, i.e. when $$H\equiv 0$$, inclusion (43) can be reduced to a differential equation $$\ddot{{y}}(t)=m\,u(t)$$ expressing the mass $$m$$ submitted to the action of a force according to Newton’s second law of dynamics. The above task constitutes therefore a problem of fundamental importance in the control of manipulators and robots (Kulczycki, 2000; Kulczycki & Wisniewski, 2002). This concept was the basis for creating a complex algorithm for controlling a laboratory robot arm. The shape of the function $$H$$ is quite complex; its interpretation multifaceted (Blau, 2009). The dependence of motion resistance on the velocity $$\dot{{y}}(t)$$ predominates; its schematic is illustrated in Fig. 2. For zero velocity this dependence is multivalued—the static friction phenomenon is at place here. As velocity grows, resistance to motion decreases due to the phenomenon of ‘skipping’ over the roughness. At higher speeds there occurs a natural increase in motion resistance related to e.g. environmental (air or water) resistance. Due to the changing direction of motion resistance force when displacement (velocity) changes direction, the function $$H$$ is discontinuous near zero. The motion resistance value may also depend on the position$$y(t)$$, mainly because of non-homogeneity of materials. The identification of the function $$H$$ is, therefore, very difficult, not only as a result of its dependence on other less important factors (e.g. temperature, dampness) but also hysteresis occurring for low velocities, as well as global metrological difficulties when measuring motion resistance. In this situation, the stochastic approach, in itself accounting for inaccuracies in the form of probability uncertainties, becomes very attractive in practical applications. Fig. 2. View largeDownload slide Motion resistance as a function of velocity. Fig. 2. View largeDownload slide Motion resistance as a function of velocity. Therefore, let value of motion resistance be a describing variable, while velocity and position constitute conditioning variables. Obtaining a set of representative data (10) does not present any practical difficulty due to repetitive actions of the manipulator. Detection of an atypical element is evidence of a fault in the movement system, e.g. mechanism seizure. Appropriately changing the number $$r$$ fixed by formula (21), one can form sensitivity of a basic fault detection system created thus. Its lowering results in diminished sensitivity—a smaller number of false alarms, but also a greater probability of missing a potential defect; increasing the number $$r$$ implies the opposite effects. The possibility of such an adaptation must be noted as an advantage. The results of these experiments positively verified the concept presented in this article and confirmed the proper functioning of the resulting statistical inference system herein. The data set also contained elements characteristic for malfunctions. In cases where the symptoms appeared abruptly, the anomalies of the device were promptly discovered. If, on the other hand, the fault was accompanied by a slow progression of symptoms, it was forecast and later also discovered. In this case, on the basis of previous values of the function $$\hat{{f}}_{Y\vert W=w^{\ast }}$$, a forecast is calculated, and it is compared to the quantile, in accordance with the scheme described in Section 3. One should underline that fault prognosis, still rare in practical applications, proved to be highly effective in the case of slowly progressing symptoms, discovering anomalies before the object’s characteristics moved beyond the range of correct conditions for a system’s functioning, thanks to the proper recognition of the change in the trend of values of the symptom vector, which indicates an adverse direction of its evolution. Analysing Fig. 2 it can be seen that the introduction of conditioning variables and the information contained in the current conditioning value allows for a significantly improved fault detection system arising in this way. Namely, note that the doubled value of motion resistance for the velocity $$0.25\mbox{ }v_{\mbox{max}}$$—clearly showing the occurrence of a fault—is correct for velocities approximating zero. The introduction of the current conditioning value lowered the number of false alarms several times over, with respect to the unconditional version, with the same sensitivity of the fault detection system. 6. Final comments and summary The procedure presented in this article has been given in its basic form, however its transparent nature and clear interpretation facilitates specific modifications and generalizations. Above all this allows the inclusion of conditioning factors other than the continuous. Similarly to kernel estimation definition (2) formulated in Section 2 for continuous random variables, one can construct kernel estimators for categorical variables, including also their compositions with continuous; see (Li et al., 2006; Gaosheng et al., 2009). After introducing categorical factors to the algorithm worked out here, it undergoes practically no changes, apart from technical ones resulting from calculational differences. This property particularly should be underlined considering modern data analysis tasks, which more and more often take advantage of the many different configurations for particular types of attributes. Summarizing the material presented in this article, the investigated procedure can be described in the following sequence: 1. define vector (7) with separate describing and conditioning coordinates; 2. experimentally obtain random sample (10); 3. assume the kernel form; the Cauchy form (17) is proposed here; 4. for every coordinate of vector (7) calculate a smoothing parameter using formulas (5) with (6) (for the Cauchy kernel one should use dependences (18) and (19)) or another method available in literature; after obtaining the conditioning value $$w^{\ast }$$: 5. establish the values of conditioning parameters (15); 6. construct kernel estimators of conditional density (16); 7. compute random sample (28) using (27); 8. with Newton’s algorithm (34)–(36) substituting (32) and (33) (for the Cauchy kernel also (37) ) calculate the value of the conditional quantile estimator; once the tested value $$y^{\ast }$$ has been obtained: 9. calculate the value of the kernel estimator of conditional density for the tested element (23); 10. if condition (24) is fulfilled then the tested element should be classed as atypical, whereas if (25) then it is typical. Note that if the conditioning factor is gradual, e.g. daily changes in temperature, then the above algorithm can be decomposed into two stages. The first phase contains procedures for calculating parameter values (i.e. actions described above in points 1–8) and may be performed earlier, in advance. Then while working in real time, it is enough to carry out just the second—consisting only in calculating a value of the conditional density function and compare it to the quantile (points 9 and 10)—a much faster phase. Finally, this article presents the algorithm for atypical (rare) elements as well as for a multidimensional case, with particular coordinates being continuous, and for conditioning factor also categorical. The conditional approach allows in practice for refinement of the model by including the current value of the conditioning factors. Use of the nonparametric concepts frees the worked out procedure from distributions of describing and conditioning attributes. The investigated algorithm is ready for direct use without any additional laborious research or calculations. The presented concept is universal in nature and can be applied in data analysis for a wide range of tasks in science and practice, in the fields of engineering, economics and management, environmental and social issues, biomedicine and other related fields. The results have been verified positively based on generated and real data for practical problems from control engineering. Acknowledgements Our heartfelt thanks go to our colleagues Damian Kruszewski and Cyprian Prochot, with whom we collaborated on the subject presented here. References Aggarwal, C. C. ( 2013 ) Outlier Analysis . Springer , New York . Barnett, V. & Lewis T. ( 1994 ) Outliers in Statistical Data . Wiley , New York . Blau, P. A. ( 2009 ) Friction Science and Technology: From Concepts to Applications . Taylor & Francis , Boca Raton . Dawid, A. P. ( 1979 ) Conditional independence in statistical theory . J. Royal Statistical Soc. , Series B , 41 , 1 – 31 . Gaosheng, J. , Rui, L. & Zhongwen, L. ( 2009 ) Nonparametric estimation of multivariate CDF with categorical and continuous data . Adv. Econometrics , 25 , 291 – 318 . Hawkins, D. M. ( 1980 ) Identification of Outliers . Chapman and Hall , London . Hodge, V. & Austin, J. ( 2004 ) A survey of outlier detection methodologies . Artif. Intell. Rev. , 22 , 85 – 126 . Google Scholar CrossRef Search ADS Kincaid, D. & Cheney, W. ( 2002 ) Numerical Analysis . Brooks/Cole , Pacific Grove . Korbicz, J. , Kościelny, J. M. , Kowalczuk, Z. & Cholewa, W. (eds). ( 2004 ) Fault Diagnosis: Models, Artificial Intelligence, Applications . Springer , Berlin . Kulczycki, P. ( 1998 ) Wykrywanie uszkodzen w systemach zautomatyzowanych metodami statystycznymi . Alfa , Warsew . Kulczycki, P. ( 2005 ) Estymatory jadrowe w analizie systemowej . WNT , Warsaw . Kulczycki, P. ( 2008 ) Kernel estimators in industrial applications . In Soft Computing Applications in Industry ( Prasad B. ed.). Springer , pp. 69 – 91 . Kulczycki, P. ( 2000 ) Fuzzy controller for mechanical systems . IEEE Trans. Fuzzy Syst. , 8 , 645 – 652 . Google Scholar CrossRef Search ADS Kulczycki, P. & Charytanowicz, M. ( 2010 ) A complete gradient clustering algorithm formed with kernel estimators , Int. J Appl. Math. Comput. Sci. , 20 , 123 – 134 . Google Scholar CrossRef Search ADS Kulczycki, P. , Charytanowicz, M. ( 2013 ) Conditional parameter identification with different losses of under- and overestimation . Appl. Math. Modell. , 37 , 2166 – 2177 . Google Scholar CrossRef Search ADS Kulczycki, P. , Charytanowicz, M. & Dawidowicz, A. ( 2015 ) A convenient ready-to-use algorithm for a conditional quantile estimator . Appl. Math. Inf. Sci. 9 , 841 – 850 . Kulczycki, P. , Charytanowicz, M. , Kowalski, P. A. & Łukasik, S. ( 2016 ) Atypical (rare) elements detection – a conditional nonparametric approach . Computational Modeling of Objects Presented in Images: Fundamentals, Methods, and Applications , Niagara Falls (USA) , 21 – 23 September 2016 , LNCS, Springer, Cham, in press . Kulczycki, P. , Hryniewicz, O. & Kacprzyk, J. (eds). ( 2007 ) Techniki informacyjne w badaniach systemowych . WNT , Warsaw . Kulczycki, P. & Kowalski, P. A. ( 2015 ) Bayes classification for nonstationary patterns . Int. J. Comput. Methods 12 , ID 1550008 (19 pages) . Kulczycki, P. & Łukasik, S. ( 2014 ) An algorithm for reducing dimension and size of sample for data exploration procedures , Int. J. Appl. Math. Comput. Sci. 24 , 133 – 149 . Google Scholar CrossRef Search ADS Kulczycki, P. & Wisniewski, R. ( 2002 ) Fuzzy controller for a system with uncertain load , Fuzzy Sets Syst. 131 , 185 – 195 . Google Scholar CrossRef Search ADS Larose, D. T. ( 2005 ) Discovering Knowledge in Data: An Introduction to Data Mining . Wiley , Hoboken . Li, Q. , Ouyang, D. & Racine J. S. ( 2006 ) Cross-validation and the estimation of probability distributions with categorical data . J. Nonparametric Statistics 18 , 69 – 100 . Google Scholar CrossRef Search ADS Silverman, B. W. ( 1986 ) Density Estimation for Statistics and Data Analysis . Chapman and Hall , London . Wand, M. P. & Jones, M. C. ( 1995 ) Kernel Smoothing . Chapman and Hall , London . © The authors 2017. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png IMA Journal of Mathematical Control and Information Oxford University Press

# Identification of atypical (rare) elements—a conditional, distribution-free approach

, Volume Advance Article – Mar 8, 2017
15 pages

/lp/ou_press/identification-of-atypical-rare-elements-a-conditional-distribution-UyzcBAQg47
Publisher
Oxford University Press
ISSN
0265-0754
eISSN
1471-6887
D.O.I.
10.1093/imamci/dnx007
Publisher site
See Article on Publisher Site

### Abstract

Abstract The discovery of atypical elements has become one of the most important challenges in data analysis and exploration. At the same time it is not an easy matter with difficult conditions, and not even strictly defined. This article presents a ready-to-use procedure for identifying atypical elements in the sense of rarely occurring. The issue is considered in a conditional approach, where describing and conditioning variables can be multidimensional continuous with the second type also potentially categorical. The application of nonparametric concepts frees the investigated procedure from distributions of describing and conditioning variables. Ease of interpretation and completeness of the presented material lend themselves to the use of the worked out method in a wide range of tasks in various applications of data analysis in science and practice. 1. Introduction Atypical elements (often rashly referred to as outliers) can intuitively be considered as significantly differing from the rest of a data set (Hawkins, 1980; Barnett & Lewis, 1994; Hodge & Austin, 2004; Aggarwal, 2013). Their occurrence most commonly results from considerable (‘gross’) errors arising during measurement, collection, storage and processing. In practice, they hinder the correct utilization of knowledge available, and their elimination or correction enables the use of more convenient and more effective methods at later stages of data analysis and exploration. What is more, in marketing, atypical elements may represent cases so different from the majority of the population that any individual decision based on such a group—so distinct and insignificant—often turns out to be economically unviable. In engineering, the presence of atypical states in dynamic systems may be evidence of malfunction of a component or the entire device, and proper reaction usually enables any serious consequences to be avoided. The detection of an atypical element may also signify an attempt to hack into a computer system. On the other hand, in many social and economic problems the appearance of such an element could be a positive trait, as it may characterize completely new trends or uncommon phenomena, and their quick discovery allows the appropriate specific action to be taken in anticipation. Therefore, the identification of atypical elements constitutes a natural cognitive challenge of great scientific and practical meaning (Kulczycki et al., 2007). The task of identifying atypical elements is one of the very difficult conditioning. Above all most often there is no definition or even criterion indicating which elements should be considered atypical. Moreover, we do not have a pattern of atypical elements, and even if we did, it would be—by its nature—small in number, strongly unbalanced with respect to the typical elements set. In the most simple one-dimensional case, where data distribution is unimodal, atypical elements can be considered to be elements distant (according to the basic meaning of the term ‘outlier’) from a median—of more than $$3/2$$ of the interquartile range (Larose, 2005; Section 2.7). However, a similar approach cannot be taken concerning complex multimodal distributions. For example, when particular modes are significantly distanced from each other, elements lying in the centre between them should be regarded as atypical, although they may be located very near to the median, definitely closer than $$3/2$$ of the interquartile range. This article understands to be atypical those elements occurring rarely in the population. Thus, having a representative set of data, we will highlight regions of lowest distribution density, such that common probability of elements appearing in those regions is less or equal to the assumed value. Thanks to methodology applied here, the above regions can be of any shape and location; they may be comprised of many separate parts. In numerous practical tasks, the data possessed can be significantly refined through the measurement and inclusion of the current value of quantity considerably influencing the subject of investigation. In engineering practice, such a factor may often be the current temperature. From a formal point of view the above aim can be realized by using a conditional probabilistic approach (Dawid, 1979). In this case, the basic attributes, termed describing, become dependent on conditioning factors, whose introduced specific values can make substantially more precise the information relating to an object under research. This approach is the main subject of this article. For defining characteristics of data, the nonparametric methodology of kernel estimators is used, which freed the investigated procedures from forms of distributions characterizing both the describing and conditioning quantities. Both can be continuous multidimensional, and the latter can also be categorical. The presented material is complete and ready-to-use without laborious investigations. In particular, valuable is its easy, illustrative interpretation. Thus, Section 2 outlines the mathematical preliminaries: kernel estimators in nonparametric density estimation. The investigated procedure for outlier identification in the conditioning approach is described in Section 3. Next, Sections 4 and 5 present the results of numerical and experimental—for a control engineering task—tests which confirmed the correct functioning of the method. The last section provides final comments and summary of the presented research. A broad review of cases and methods for the identification of atypical elements is found in the classic publications (Barnett & Lewis, 1994; Hodge & Austin, 2004; Aggarwal, 2013). The approach presented in this article differs from the classical techniques firstly in its conditional aspect and its freeing from the distributions of the describing and conditioning quantities. The preliminary version of this article was presented as Kulczycki et al. (2016). 2. Nonparametric density estimation In the presented method, the characteristics of a data set will be defined using the nonparametric methodology of kernel estimators. This kind of procedure is distribution-free i.e. the preliminary assumptions concerning the types of appearing distributions are not required. A broad description can be found in the classic monographs (Silverman, 1986; Wand & Jones, 1995; Kulczycki, 2005). Exemplary applications for data analysis tasks are described in the publications (Kulczycki, 2008; Kulczycki & Charytanowicz, 2010, 2013; Kulczycki & Kowalski, 2015); see also (Kulczycki & Łukasik, 2014). Let the $$n$$-dimensional continuous random variable $$X$$ be given, with a distribution characterized by the density $$f$$. Its kernel estimator $$\hat{{f}}:R^{n}\to [0,\infty )$$, calculated using experimentally obtained values for the $$m-$$element random sample xi for i=1,2,…,m (1) in its basic form is defined as f^(x)=1mhn∑i=1mK(x−xih), (2) where $$m\in N\backslash \{0\}$$, the coefficient $$h>0$$ is called a smoothing parameter, while the measurable function $$K:R^{n}\to [0,\infty )$$ of unit integral $$\int_{ R^{n}} {K(x)\,\mbox{d}x} =1$$, symmetrical with respect to zero and having a weak global maximum in this place, takes the name of a kernel. The choice of form of the kernel $$K$$ and the calculation of the smoothing parameter $$h$$ is made most often with the criterion of the mean integrated square error. Thus, the choice of the kernel form has—from a statistical point of view—no practical meaning and thanks to this, it becomes possible to take primarily into account properties of the estimator obtained or calculational aspects, advantageous from the point of view of the applicational problem under investigation; for broader discussion see the books (Wand & Jones, 1995—Sections 2.7 and 4.5; Kulczycki, 2005—Section 3.1.3). In practice, for the one-dimensional case (i.e. when $$n=1)$$, the function $$K$$ is assumed most often to be the density of a common probability distribution. In the multidimensional case, two natural generalizations of the above concept are used: radial and product kernels. Thanks to convenient analysis, the latter will be used in the following. The main idea here is the division of particular variables with the multidimensional kernel then becoming a product of $$n$$ one-dimensional kernels for particular coordinates. Thus the kernel estimator is then given as f^(x)=1mh1h2…hn∑i=1mK1(x1−xi,1h1)K2(x2−xi,2h2) … Kn(xn−xi,nhn), (3) where $$K_{j}$$ ($$j=1,\mbox{ }2,\mbox{ }{\ldots}\mbox{ },\mbox{ }n)$$ denote one-dimensional kernels, $$h_{j}$$ ($$j=1,\mbox{ }2,\mbox{ }{\ldots}\mbox{ },\mbox{ }n)$$ are smoothing parameters individualized for particular coordinates, while assigning to coordinates x=[x1x2⋮xn] and xi=[xi,1xi,2⋮xi,n] for i=1, 2, … , m. (4) The fixing of the smoothing parameter has significant meaning for quality of estimation. Fortunately many suitable procedures for calculating its value on the basis of random sample (1) have been worked out; for broader discussion see the books (Silverman, 1986; Wand & Jones, 1995; Kulczycki, 2005) In particular, for the one-dimensional case, the effective plug-in method (Wand & Jones, 1995—Section 3.6.1; Kulczycki, 2005—Section 3.1.5) is especially recommended. Of course this method can also be applied in the $$n$$-dimensional case when a product kernel is used, sequentially $$n$$ times for each coordinate. One can also apply the simplified method (Silverman, 1986—Section 3.4.1; Wand & Jones, 1995—Section 3.2.1; Kulczycki, 2005—Section 3.1.5), according to which hj=(8π3W(Kj)U(Kj)21m)1/5σ^j for j=1,2,…,n, (5) where $$W(K_{j} )=\int_{-\infty }^\infty {K_{j} (x)^{2}\mbox{ d}x}$$ and $$U(K_{j} )=\int_{-\infty }^\infty {x^{2}K_{j} (x)\mbox{ d}x}$$, while $$\hat{{\sigma }}_{j}$$ denotes the estimator of a standard deviation for the $$j$$-th coordinate: σ^j=1m−1∑i=1mxi,j2−1m(m−1)(∑i=1mxi,j)2 for j=1, 2, … , n. (6) The value obtained by formula (5) may be sufficiently precise for many practical applications, whereas—thanks to its simplicity—this method significantly increases calculation velocity. For specific cases, such calculated value can also be individually refined. The above concept will now be generalized for the conditional case. Here, besides the basic (sometimes termed the describing) $$n_{Y}$$-dimensional random variable $$Y$$, let also be given the $$n_{W}$$-dimensional random variable $$W$$, called hereinafter the conditioning random variable. Their composition X=[YW] (7) is a random variable of the dimension $$n_{Y} +n_{W}$$. Assume that distributions of the variables $$X$$ and, in consequence, $$W$$ have densities, denoted below as $$f_{X} :R^{n_{Y} +n_{W} }\to [0,\infty )$$ and $$f_{W} :R^{n_{W} }\to [0,\infty )$$, respectively. Let also be given the so-called conditioning value, i.e. the fixed value of conditioning random variable $$w^{\ast }\in R^{n_{W} }$$, such that fW(w∗)>0. (8) Then the function $$f_{Y\vert W=w^{\ast }} :R^{n_{Y} }\to [0,\infty )$$ given by fY|W=w∗(y)=fX(y,w∗)fW(w∗) for every y∈RnY (9) constitutes a conditional density of probability distribution of the random variable $$Y$$ for the conditioning value $$w^{\ast }$$. The conditional density $$f_{Y\vert W=w^{\ast }}$$ can so be treated as a ‘classic’ density, whose form has been made more accurate in practical applications with $$w^{\ast }$$—a concrete value taken by the conditioning variable $$W$$ in a given situation. Let therefore, the random sample [yiwi] for i=1, 2, … , m, (10) obtained from variable (7) be given. The particular elements of this sample are interpreted as the values $$y_{i}$$ taken in measurements from the random variable $$Y$$, when the conditioning variable $$W$$ assumes the respective values $$w_{i}$$. On the basis of sample (10), one can calculate $$\hat{{f}}_{X}$$, i.e. the kernel estimator of density of the random variable $$X$$ probability distribution, while the sample wi for i=1, 2, … , m (11) enables the computation of $$\hat{{f}}_{W}$$—the kernel density estimator for the conditioning variable $$W$$. The kernel estimator of conditional density of the random variable $$Y$$ distribution for the conditioning value $$w^{\ast }$$, is defined then—in natural consequence of formula (9)—as the function $$\hat{{f}}_{Y\vert W=w^{\ast }} :R^{n_{Y} }\to [0,\infty )$$ given by f^Y|W=w∗(y)=f^X(y,w∗)f^W(w∗). (12) If for the estimator $$\hat{{f}}_{W}$$ one uses a kernel with positive values, then the inequality $$\hat{{f}}_{W} (w^{\ast })>0$$ implied by condition (8) is fulfilled for any $$w^{\ast }\in R^{n_{W} }$$. If one uses in pairs the same kernel to the estimator $$\hat{{f}}_{X}$$ for coordinates which correspond to the vector $$W$$ and to the estimator $$\hat{{f}}_{W}$$, then the expression for the kernel estimator of conditional density becomes particularly helpful for practical applications. Namely, formula (12) can be specified to the form f^Y|W=w∗(y)=1h1h2…hnY∑i=1mK1(y1−yi,1h1)K2(y2−yi,2h2)⋯KnY(ynY−yi,nYhnY)KnY+1(w1∗−wi,1hnY+1)KnY+2(w2∗−wi,2hnY+2)⋯KnY+nW(wnW∗−wi,nWhnY+nW)∑i=1mKnY+1(w1∗−wi,1hnY+1)KnY+2(w2∗−wi,2hnY+2)⋯KnY+nW(wnW∗−wi,nWhnY+nW), (13) where $$K_{j}$$ ($$j=1,\mbox{ }2,\mbox{ }{\ldots}\mbox{ },\mbox{ }n_{Y} +n_{W} )$$ denote one-dimensional kernels, $$h_{j}$$ ($$j=1,\mbox{ }2,\mbox{ }{\ldots}\mbox{ },\mbox{ }n_{Y} +n_{W} )$$ mean smoothing parameters individualized for particular coordinates, while assigning to the coordinates y=[y1y2⋮ynY],w∗=[w1∗w2∗⋮wnW∗] and yi=[yi,1yi,2⋮yi,nY],wi=[wi,1wi,2⋮wi,nW] for i=1, 2, … , m. (14) Define the so-called conditioning parameters $$d_{i}$$ for $$i=1,\mbox{ }2,\mbox{ }{\ldots}\mbox{ },\mbox{ }m$$ by the following formula: di=KnY+1(w1∗−wi,1hnY+1)KnY+2(w2∗−wi,2hnY+2) ⋯ KnY+nW(wnW∗−wi,nWhnY+nW). (15) Thanks to the assumption of positive values for the kernels $$K_{n_{Y} +1}$$, $$K_{n_{Y} +2} \mbox{, {\cdots} , }K_{n_{Y} +n_{W} }$$, these parameters are also positive. So the kernel estimator of conditional density (9) can be finally presented in the form f^Y|W=w∗(y)=1h1h2…hnY∑i=1mdi∑i=1mdiK1(y1−yi,1h1)K2(y2−yi,2h2) ⋯ KnY(ynY−yi,nYhnY). (16) The value of the parameter $$d_{i}$$ characterizes the ‘distance’ of the given conditioning value $$w^{\ast }$$ from $$w_{i}$$—that of the conditioning variable for which the $$i$$-th element of the random sample was obtained. Then estimator (16) can be interpreted as the linear combination of kernels mapped to particular elements of a random sample obtained for the variable $$Y$$, when the coefficients of this combination $$d_{i}$$ characterize how representative these elements are for the given value $$w^{\ast }$$. For further investigations, the (one-dimensional) Cauchy kernel Kj(x)=2π1(1+x2)2 for j=1, 2, … , nY+nW (17) will be applied. The constants occurring in formula (5), for Cauchy kernel (17) equal: U(Kj)=1 (18) W(Kj)=54π. (19) To summarize: formula (16), substituting (14) and (15), with (5) and (6) and (17)–(19), constitutes comprehensive material, allowing convenient calculation of a distribution-free estimator of conditional density, based on random sample (10). 3. An algorithm for atypical elements identification Drawing forth from the material presented in the previous section, an algorithm for the conditional identification of atypical (rare) elements will now be investigated. So, consider the data set comprised of elements $$y_{i}$$ obtained for the conditioning values $$w_{i}$$ (for $$i=1,2,{\ldots},m)$$, respectively, which can be treated as representative for a population under research. Denote also a tested element as [y∗w∗]∈RnY+nW. (20) It can be interpreted as the value $$y^{\ast }$$ of describing variables, obtained for the conditioning value $$w^{\ast }$$. The aim of the procedure is to ascertain if for the value $$w^{\ast }$$, the element $$y^{\ast }$$ should be considered as atypical in the sense of rare occurrences, or not. For this purpose, fix first the number r∈(0,1) (21) defining a desired proportion of atypical to typical elements, more accurately the share of atypical elements in a population. In practice the values $$r=0.01,\;0.05,\;0.1$$ can be proposed. In reference to the notations in the previous section, let us treat the elements $$y_{i}$$ as the realizations of the $$n_{Y}$$-dimensional random variable $$Y$$, while elements $$w_{i}$$ as respective realizations of the conditioning random variable $$W$$, and then calculate the conditional density $$\hat{{f}}_{Y\vert W=w^{\ast }}$$. Next, let us consider the set of its values for the elements $$y_{i}$$, therefore f^Y|W=w∗(yi) for i=1,2,…,m. (22) Note that the above values are real (one-dimensional). The specific values $$\hat{{f}}_{Y\vert W=w^{\ast }} (y_{i} )$$ refers to the probability of occurrence of the element $$y_{i}$$ when the value of the conditioning variable is $$w^{\ast }$$. So, the greater the value $$\hat{{f}}_{Y\vert W=w^{\ast }} (y_{i} )$$, the more typical element $$y_{i}$$ can be interpreted to be for the given $$w^{\ast }$$. Let us treat as typical these elements for which the density $$\hat{{f}}_{Y\vert W=w^{\ast }}$$ is bigger than a given limit value, while atypical—those for which it is smaller. In accordance with the assumptions made above, such a natural limit value constitutes a conditional quantile of the order $$r$$ for the condition $$w^{\ast }$$; its estimator is denoted hereinafter as $$\hat{{q}}_{r\vert w^{\ast }}$$. Finally, if for the fixed conditioning value $$w^{\ast }$$, the value of a density function for the tested element $$y^{\ast }$$, i.e. f^Y|W=w∗(y∗) (23) is calculated, and the condition f^Y|W=w∗(y∗)⩽q^r|w∗ (24) is fulfilled, then the tested element should be ascertained as atypical, while in the opposite case f^Y|W=w∗(y∗)>q^r|w∗ (25) as typical. In this way, the space $$R^{n_{Y} }$$ is divided into two regions: the first containing atypical elements, which fulfil condition (24), and the second consisting of typical ones, satisfying (25), such that—with precision to estimation errors—a probability of the former is $$r$$, and of the latter $$1-r$$. There remains, however, to calculate the above mentioned value of conditional quantile estimator $$\hat{{q}}_{r\vert w^{\ast }}$$. To this aim the kernel estimator scheme presented in the article (Kulczycki et al., 2015), fitted to the task investigated here, will be applied. Performing the same technique as that applied before to construct the density $$\hat{{f}}_{Y\vert W=w^{\ast }}$$, allows for better use of procedures already executed, and the previously gained experience of the researcher. The role of the describing factor (previously the $$n_{Y}$$-dimensional variable $$Y)$$ will now be taken by the one-dimensional random variable $$Z$$ ($$n_{Z} =1)$$, therefore, in place of variable (7) consider instead the $$(n_{Y} +1)$$-dimensional composition X=[ZW] (26) The values (22) will be treated as experimentally obtained realizations of the random variable $$Z$$, evaluated—as previously—for the realizations $$w_{i}$$ of the $$n_{W}$$-dimensional conditioning variable $$W$$, respectively. Thus, denoting zi=f^Y|W=w∗(yi) for i=1,2,…,m, (27) random sample (10) is now replaced by [ziwi] for i=1,2,…,m. (28) Then, the natural kernel estimator of a conditional quantile for the random variable $$Z$$ with conditioning variable $$W$$ assuming the fixed value $$w^{\ast }$$ is the solution of the following equation with the argument $$\hat{{q}}_{r\vert w^{\ast }}$$: ∫−∞q^r|w∗f^Z|W=w∗(z)dz=r. (29) For estimation of the conditional density $$\hat{{f}}_{Z\vert W=w^{\ast }}$$ appearing in the above formula, the kernel estimator (16) will be used. Then, for current notations, it takes the following form ∫−∞q^r|w∗1hZ∑i=1mdiKZ(z−zihZ)dz−r∑i=1mdi=0, (30) where $$K_{Z}$$ and $$h_{Z}$$ mean a one-dimensional kernel (continuous with positive values), and a smoothing parameter (calculated for values (27), possibly using formula (5)) corresponding to the variable $$Z$$. Let also the kernel $$K_{Z}$$ be such that its primitive $$I_{Z} \mbox{:}R\to [0,1]$$ given as $$I_{Z} \mbox{(}w\mbox{)}=\int_{-\infty }^w {K_{Z} \mbox{(}u\mbox{)}\,\mbox{d}{\kern 1pt}u} \;$$ is expressed by a relatively simple analytical formula. Equation (30) can be stated then equivalently in the following form: ∑i=1mdiIZ(q^r|w∗−zihZ)−r∑i=1mdi=0. (31) If the left side of the above equation is denoted by $$L$$, i.e. L(q^r|w∗)=∑i=1mdiIZ(q^r|w∗−zihZ)−r∑i=1mdi, (32) then $$\lim\limits_{\hat{{y}}_{w^{\ast }} \to -\infty } L(\hat{{q}}_{r\vert w^{\ast }} )<$$, $$\lim\limits_{\hat{{y}}_{w^{\ast }} \to \infty } L(\hat{{q}}_{r\vert w^{\ast }} )>0$$, the function $$L$$ is (strictly) increasing and its derivative is simply expressed by L′(q^r|w∗)=1hZ∑i=1mdiKZ(q^r|w∗−zihZ). (33) In this situation, the solution of equation (41) can be effectively calculated on the basis of Newton’s algorithm (Kincaid & Cheney, 2002) as the limit of the sequence $$\mbox{\{}\hat{{q}}_{r\vert w^{\ast },j} \mbox{\}}_{j=\mbox{0}}^{\infty }$$ defined by q^r|w∗,0 =∑i=1mdizi∑i=1mdi (34) q^r|w∗,j+1 =q^r|w∗,j−L(q^r|w∗,j)L′(q^r|w∗,j) for j=0, 1, ⋯ (35) with the functions $$L$$ and $${L}'$$ being given by dependencies (32) and (33), whereas a stop criterion takes on the form |q^r|w∗,j−q^r|w∗,j−1| ⩽0.01 σ^Z, (36) while $$\hat{{\sigma }}_{Z}$$ is the estimator of the standard deviation of the random variable $$Z$$, calculated from formula (6) for elements (23). It is worth noting that many elements used above to estimate the conditional quantile—in particular the values of the conditioning parameters and the smoothing parameters—were already obtained earlier during the calculation of the conditional density $$\hat{{f}}_{Y\vert W=w^{\ast }}$$. For Cauchy kernel (17) recommended here, which is continuous with positive values, its primitive is given as I(x)=1π[arctg(x)+x(1+x2)]+12. (37) Thanks to the use of kernel estimators with strong averaging properties, inference takes place not only for data obtained exactly for $$w^{\ast }$$ (among the values $$w_{i}$$ there may be some too small for reliable consideration or even not at all), but also for neighbouring values proportional to their ‘closeness’ with respect to $$w^{\ast }$$. Finally, if the material included in Section 2 is used to create the estimator of the conditional density $$\hat{{f}}_{Y\vert W=w^{\ast }}$$, then application of the algorithm (34)–(36) with substitutions (32), (33), (37) and (15) with (17), (5) and (6) in respect of (18) and (19), completes the procedure identification of atypical (rare) elements in a conditional and distribution-free approach, for an assumed proportion of atypical to typical elements (21), being the subject of these investigations. 4. Numerical verification The procedure presented in this article has been numerically verified in detail. The obtained results confirmed its correct function and full completion of intentions and goals set out in the Introduction. Particularly, in the case of a positive correlation between the describing and conditioning factors, the greater (or smaller) the value of the conditioning attributes, the greater (or smaller) the values of describing elements detected to be atypical. For the negative correlation, the above relation is inverse. To offer an illustrative example, assume that the data is three-dimensional, where the first two coordinates are describing, and the third conditioning, i.e. in accordance with the notations implemented earlier, we have $\left[ {{\begin{array}{*{20}c} {Y_{1} } \hfill \\ {Y_{2} } \hfill \\ W \hfill \\ \end{array} }} \right]$. These are obtained from a pseudorandom number generator, the distribution of which consists of five normal components with the following matrices and shares: E1=[330]Cov1=[10−0.4010.7−0.40.71]35% (38) E2=[−330]Cov2=[10−0.4010.7−0.40.71]20% (39) E3=[−3−30]Cov3=[10−0.4010.7−0.40.71]20% (40) E4=[3−30]Cov4=[10−0.4010.7−0.40.71]15% (41) E5=[000]Cov5=[10−0.4010.7−0.40.71]10%. (42) The distribution, therefore, is multimodal and asymmetric. The first coordinate $$Y_{1}$$ is negatively correlated with the conditioning factor $$W$$, while the second $$Y_{2}$$ positively. Tests were carried out for $$m$$ sizes from 1,000 to 1,000,000. The former seems not to be an excessive requirement for three-dimensional tasks, the latter was possible thanks to the application of procedures of linear calculational complexity, in particular the use of formula (5). To improve results, corrections for particular coordinates of the smoothing parameter were occasionally made in this task $$-$$ such modifications, often based on visual observations, are common in kernel estimation methods. A graphic illustration of results can be found in Fig. 1. Demonstrative contours and percentage share of each component of distribution (38)–(42) are shown. Atypical (rare) elements being detected are denoted by the small circles. It should be stressed that they are located not only on the distribution peripheries, but also around the centre, due to realizations of the random variable not occurring frequently in this area and in consequence the distribution density is low. This effect is achieved by applying a nonparametric method of kernel estimators. Fig. 1. View largeDownload slide Illustrative locations of detected atypical elements for distribution (38)–(42); $$m=10,000$$, $$r=0.05$$, $$w^{\ast }=0$$. Fig. 1. View largeDownload slide Illustrative locations of detected atypical elements for distribution (38)–(42); $$m=10,000$$, $$r=0.05$$, $$w^{\ast }=0$$. The mean values of obtained detected atypical (rare) elements for particular values of the conditioning factor are shown in Table 1. Every cell shows above each other the mean values for the first $$Y_{1}$$ coordinate and the second $$Y_{2} -$$denoted as $$\bar{{y}}_{1}$$ and $$\bar{{y}}_{2}$$, respectively$$-$$of atypical elements being detected. Due to the symmetry of distribution (38)–(42), the results for negative conditioning values $$w^{\ast }$$ showed themselves to be symmetric. As expected, a growth in the conditioning factor value was accompanied by a drop in the mean value of the first coordinate of detected atypical elements, and an increase in that of the second. This behaves according to intuition, due to—as implied from formulas (38)–(42)—the coordinates $$Y_{1}$$ and $$W$$ being negatively correlated, but $$Y_{2}$$ and $$W$$ positively. In this way, knowledge of the current conditioning value $$w^{\ast }$$ enables models used for practical purposes to be significantly more precisely designed. The mean values of atypical elements were practically (around 10% of standard deviation) independent of the assumed proportion of atypical to typical elements, defined by the parameter. This parameter also can undergo correction in order to make more precise the required number of atypical elements and in consequence assumed share of such elements in a population. Increasing the value $$r$$ results in proportionate growth in their number; similarly, its decrease implies the opposite effects. The second coordinate $$Y_{2}$$, more than the first $$Y_{1}$$, was dependent on the conditioning variable $$W$$. This is justified by the fact that, in formulas (38)–(42) the element $$cov_{23} =cov_{32}$$ is bigger (in the sense of an absolute value) than $$cov_{13} =cov_{31}$$. Table 1 Mean values of detected atypical elements;$$m=10,000$$, $$r=0.05$$, means obtained from 100 runs. r 0.01 0.05 0.1 w$$^{\mathbf{\ast }}$$ 0 $$\bar{{y}}_{1} = 0.11$$ $$\bar{{y}}_{1} = -$$0.09 $$\bar{{y}}_{1} = -$$0.04 $$\bar{{y}}_{2} = -$$0.18 $$\bar{{y}}_{2} = -$$0.13 $$\bar{{y}}_{2} = -$$0.06 1 $$\bar{{y}}_{1} = -$$0.87 $$\bar{{y}}_{1} = -$$0.74 $$\bar{{y}}_{1} = -$$0.61 $$\bar{{y}}_{2} = 1.03$$ $$\bar{{y}}_{2} = 0.99$$ $$\bar{{y}}_{2} = 0.95$$ 2 $$\bar{{y}}_{1} = -$$1.10 $$\bar{{y}}_{1} = -$$1.07 $$\bar{{y}}_{1} = -$$0.96 $$\bar{{y}}_{2} = 1.87$$ $$\bar{{y}}_{2} = 1.89$$ $$\bar{{y}}_{2} = 1.72$$ r 0.01 0.05 0.1 w$$^{\mathbf{\ast }}$$ 0 $$\bar{{y}}_{1} = 0.11$$ $$\bar{{y}}_{1} = -$$0.09 $$\bar{{y}}_{1} = -$$0.04 $$\bar{{y}}_{2} = -$$0.18 $$\bar{{y}}_{2} = -$$0.13 $$\bar{{y}}_{2} = -$$0.06 1 $$\bar{{y}}_{1} = -$$0.87 $$\bar{{y}}_{1} = -$$0.74 $$\bar{{y}}_{1} = -$$0.61 $$\bar{{y}}_{2} = 1.03$$ $$\bar{{y}}_{2} = 0.99$$ $$\bar{{y}}_{2} = 0.95$$ 2 $$\bar{{y}}_{1} = -$$1.10 $$\bar{{y}}_{1} = -$$1.07 $$\bar{{y}}_{1} = -$$0.96 $$\bar{{y}}_{2} = 1.87$$ $$\bar{{y}}_{2} = 1.89$$ $$\bar{{y}}_{2} = 1.72$$ Table 1 Mean values of detected atypical elements;$$m=10,000$$, $$r=0.05$$, means obtained from 100 runs. r 0.01 0.05 0.1 w$$^{\mathbf{\ast }}$$ 0 $$\bar{{y}}_{1} = 0.11$$ $$\bar{{y}}_{1} = -$$0.09 $$\bar{{y}}_{1} = -$$0.04 $$\bar{{y}}_{2} = -$$0.18 $$\bar{{y}}_{2} = -$$0.13 $$\bar{{y}}_{2} = -$$0.06 1 $$\bar{{y}}_{1} = -$$0.87 $$\bar{{y}}_{1} = -$$0.74 $$\bar{{y}}_{1} = -$$0.61 $$\bar{{y}}_{2} = 1.03$$ $$\bar{{y}}_{2} = 0.99$$ $$\bar{{y}}_{2} = 0.95$$ 2 $$\bar{{y}}_{1} = -$$1.10 $$\bar{{y}}_{1} = -$$1.07 $$\bar{{y}}_{1} = -$$0.96 $$\bar{{y}}_{2} = 1.87$$ $$\bar{{y}}_{2} = 1.89$$ $$\bar{{y}}_{2} = 1.72$$ r 0.01 0.05 0.1 w$$^{\mathbf{\ast }}$$ 0 $$\bar{{y}}_{1} = 0.11$$ $$\bar{{y}}_{1} = -$$0.09 $$\bar{{y}}_{1} = -$$0.04 $$\bar{{y}}_{2} = -$$0.18 $$\bar{{y}}_{2} = -$$0.13 $$\bar{{y}}_{2} = -$$0.06 1 $$\bar{{y}}_{1} = -$$0.87 $$\bar{{y}}_{1} = -$$0.74 $$\bar{{y}}_{1} = -$$0.61 $$\bar{{y}}_{2} = 1.03$$ $$\bar{{y}}_{2} = 0.99$$ $$\bar{{y}}_{2} = 0.95$$ 2 $$\bar{{y}}_{1} = -$$1.10 $$\bar{{y}}_{1} = -$$1.07 $$\bar{{y}}_{1} = -$$0.96 $$\bar{{y}}_{2} = 1.87$$ $$\bar{{y}}_{2} = 1.89$$ $$\bar{{y}}_{2} = 1.72$$ 5. Empirical verification: fault detection The procedure worked out also successfully underwent verification in a real applicational task in control engineering. Based on the current state of the system, atypical elements were discovered, indicating potentially arising failures of a supervised device (Kulczycki, 1998; Korbicz et al., 2004). Consider a mechanical system with dynamics modelled by the differential inclusion y¨(t)∈H(y˙(t),y(t))+u(t), (43) where $$y$$ expresses the position of the object, $$u$$ denotes a piecewise continuous control and the function $$H$$, characterizing resistance to motion, is piecewise continuous and additionally multivalued at the points of discontinuity (particularly for $$\dot{{y}}(t)=0$$ it represents phenomena connected with static friction). In the event of no resistance to motion, i.e. when $$H\equiv 0$$, inclusion (43) can be reduced to a differential equation $$\ddot{{y}}(t)=m\,u(t)$$ expressing the mass $$m$$ submitted to the action of a force according to Newton’s second law of dynamics. The above task constitutes therefore a problem of fundamental importance in the control of manipulators and robots (Kulczycki, 2000; Kulczycki & Wisniewski, 2002). This concept was the basis for creating a complex algorithm for controlling a laboratory robot arm. The shape of the function $$H$$ is quite complex; its interpretation multifaceted (Blau, 2009). The dependence of motion resistance on the velocity $$\dot{{y}}(t)$$ predominates; its schematic is illustrated in Fig. 2. For zero velocity this dependence is multivalued—the static friction phenomenon is at place here. As velocity grows, resistance to motion decreases due to the phenomenon of ‘skipping’ over the roughness. At higher speeds there occurs a natural increase in motion resistance related to e.g. environmental (air or water) resistance. Due to the changing direction of motion resistance force when displacement (velocity) changes direction, the function $$H$$ is discontinuous near zero. The motion resistance value may also depend on the position$$y(t)$$, mainly because of non-homogeneity of materials. The identification of the function $$H$$ is, therefore, very difficult, not only as a result of its dependence on other less important factors (e.g. temperature, dampness) but also hysteresis occurring for low velocities, as well as global metrological difficulties when measuring motion resistance. In this situation, the stochastic approach, in itself accounting for inaccuracies in the form of probability uncertainties, becomes very attractive in practical applications. Fig. 2. View largeDownload slide Motion resistance as a function of velocity. Fig. 2. View largeDownload slide Motion resistance as a function of velocity. Therefore, let value of motion resistance be a describing variable, while velocity and position constitute conditioning variables. Obtaining a set of representative data (10) does not present any practical difficulty due to repetitive actions of the manipulator. Detection of an atypical element is evidence of a fault in the movement system, e.g. mechanism seizure. Appropriately changing the number $$r$$ fixed by formula (21), one can form sensitivity of a basic fault detection system created thus. Its lowering results in diminished sensitivity—a smaller number of false alarms, but also a greater probability of missing a potential defect; increasing the number $$r$$ implies the opposite effects. The possibility of such an adaptation must be noted as an advantage. The results of these experiments positively verified the concept presented in this article and confirmed the proper functioning of the resulting statistical inference system herein. The data set also contained elements characteristic for malfunctions. In cases where the symptoms appeared abruptly, the anomalies of the device were promptly discovered. If, on the other hand, the fault was accompanied by a slow progression of symptoms, it was forecast and later also discovered. In this case, on the basis of previous values of the function $$\hat{{f}}_{Y\vert W=w^{\ast }}$$, a forecast is calculated, and it is compared to the quantile, in accordance with the scheme described in Section 3. One should underline that fault prognosis, still rare in practical applications, proved to be highly effective in the case of slowly progressing symptoms, discovering anomalies before the object’s characteristics moved beyond the range of correct conditions for a system’s functioning, thanks to the proper recognition of the change in the trend of values of the symptom vector, which indicates an adverse direction of its evolution. Analysing Fig. 2 it can be seen that the introduction of conditioning variables and the information contained in the current conditioning value allows for a significantly improved fault detection system arising in this way. Namely, note that the doubled value of motion resistance for the velocity $$0.25\mbox{ }v_{\mbox{max}}$$—clearly showing the occurrence of a fault—is correct for velocities approximating zero. The introduction of the current conditioning value lowered the number of false alarms several times over, with respect to the unconditional version, with the same sensitivity of the fault detection system. 6. Final comments and summary The procedure presented in this article has been given in its basic form, however its transparent nature and clear interpretation facilitates specific modifications and generalizations. Above all this allows the inclusion of conditioning factors other than the continuous. Similarly to kernel estimation definition (2) formulated in Section 2 for continuous random variables, one can construct kernel estimators for categorical variables, including also their compositions with continuous; see (Li et al., 2006; Gaosheng et al., 2009). After introducing categorical factors to the algorithm worked out here, it undergoes practically no changes, apart from technical ones resulting from calculational differences. This property particularly should be underlined considering modern data analysis tasks, which more and more often take advantage of the many different configurations for particular types of attributes. Summarizing the material presented in this article, the investigated procedure can be described in the following sequence: 1. define vector (7) with separate describing and conditioning coordinates; 2. experimentally obtain random sample (10); 3. assume the kernel form; the Cauchy form (17) is proposed here; 4. for every coordinate of vector (7) calculate a smoothing parameter using formulas (5) with (6) (for the Cauchy kernel one should use dependences (18) and (19)) or another method available in literature; after obtaining the conditioning value $$w^{\ast }$$: 5. establish the values of conditioning parameters (15); 6. construct kernel estimators of conditional density (16); 7. compute random sample (28) using (27); 8. with Newton’s algorithm (34)–(36) substituting (32) and (33) (for the Cauchy kernel also (37) ) calculate the value of the conditional quantile estimator; once the tested value $$y^{\ast }$$ has been obtained: 9. calculate the value of the kernel estimator of conditional density for the tested element (23); 10. if condition (24) is fulfilled then the tested element should be classed as atypical, whereas if (25) then it is typical. Note that if the conditioning factor is gradual, e.g. daily changes in temperature, then the above algorithm can be decomposed into two stages. The first phase contains procedures for calculating parameter values (i.e. actions described above in points 1–8) and may be performed earlier, in advance. Then while working in real time, it is enough to carry out just the second—consisting only in calculating a value of the conditional density function and compare it to the quantile (points 9 and 10)—a much faster phase. Finally, this article presents the algorithm for atypical (rare) elements as well as for a multidimensional case, with particular coordinates being continuous, and for conditioning factor also categorical. The conditional approach allows in practice for refinement of the model by including the current value of the conditioning factors. Use of the nonparametric concepts frees the worked out procedure from distributions of describing and conditioning attributes. The investigated algorithm is ready for direct use without any additional laborious research or calculations. The presented concept is universal in nature and can be applied in data analysis for a wide range of tasks in science and practice, in the fields of engineering, economics and management, environmental and social issues, biomedicine and other related fields. The results have been verified positively based on generated and real data for practical problems from control engineering. Acknowledgements Our heartfelt thanks go to our colleagues Damian Kruszewski and Cyprian Prochot, with whom we collaborated on the subject presented here. References Aggarwal, C. C. ( 2013 ) Outlier Analysis . Springer , New York . Barnett, V. & Lewis T. ( 1994 ) Outliers in Statistical Data . Wiley , New York . Blau, P. A. ( 2009 ) Friction Science and Technology: From Concepts to Applications . Taylor & Francis , Boca Raton . Dawid, A. P. ( 1979 ) Conditional independence in statistical theory . J. Royal Statistical Soc. , Series B , 41 , 1 – 31 . Gaosheng, J. , Rui, L. & Zhongwen, L. ( 2009 ) Nonparametric estimation of multivariate CDF with categorical and continuous data . Adv. Econometrics , 25 , 291 – 318 . Hawkins, D. M. ( 1980 ) Identification of Outliers . Chapman and Hall , London . Hodge, V. & Austin, J. ( 2004 ) A survey of outlier detection methodologies . Artif. Intell. Rev. , 22 , 85 – 126 . Google Scholar CrossRef Search ADS Kincaid, D. & Cheney, W. ( 2002 ) Numerical Analysis . Brooks/Cole , Pacific Grove . Korbicz, J. , Kościelny, J. M. , Kowalczuk, Z. & Cholewa, W. (eds). ( 2004 ) Fault Diagnosis: Models, Artificial Intelligence, Applications . Springer , Berlin . Kulczycki, P. ( 1998 ) Wykrywanie uszkodzen w systemach zautomatyzowanych metodami statystycznymi . Alfa , Warsew . Kulczycki, P. ( 2005 ) Estymatory jadrowe w analizie systemowej . WNT , Warsaw . Kulczycki, P. ( 2008 ) Kernel estimators in industrial applications . In Soft Computing Applications in Industry ( Prasad B. ed.). Springer , pp. 69 – 91 . Kulczycki, P. ( 2000 ) Fuzzy controller for mechanical systems . IEEE Trans. Fuzzy Syst. , 8 , 645 – 652 . Google Scholar CrossRef Search ADS Kulczycki, P. & Charytanowicz, M. ( 2010 ) A complete gradient clustering algorithm formed with kernel estimators , Int. J Appl. Math. Comput. Sci. , 20 , 123 – 134 . Google Scholar CrossRef Search ADS Kulczycki, P. , Charytanowicz, M. ( 2013 ) Conditional parameter identification with different losses of under- and overestimation . Appl. Math. Modell. , 37 , 2166 – 2177 . Google Scholar CrossRef Search ADS Kulczycki, P. , Charytanowicz, M. & Dawidowicz, A. ( 2015 ) A convenient ready-to-use algorithm for a conditional quantile estimator . Appl. Math. Inf. Sci. 9 , 841 – 850 . Kulczycki, P. , Charytanowicz, M. , Kowalski, P. A. & Łukasik, S. ( 2016 ) Atypical (rare) elements detection – a conditional nonparametric approach . Computational Modeling of Objects Presented in Images: Fundamentals, Methods, and Applications , Niagara Falls (USA) , 21 – 23 September 2016 , LNCS, Springer, Cham, in press . Kulczycki, P. , Hryniewicz, O. & Kacprzyk, J. (eds). ( 2007 ) Techniki informacyjne w badaniach systemowych . WNT , Warsaw . Kulczycki, P. & Kowalski, P. A. ( 2015 ) Bayes classification for nonstationary patterns . Int. J. Comput. Methods 12 , ID 1550008 (19 pages) . Kulczycki, P. & Łukasik, S. ( 2014 ) An algorithm for reducing dimension and size of sample for data exploration procedures , Int. J. Appl. Math. Comput. Sci. 24 , 133 – 149 . Google Scholar CrossRef Search ADS Kulczycki, P. & Wisniewski, R. ( 2002 ) Fuzzy controller for a system with uncertain load , Fuzzy Sets Syst. 131 , 185 – 195 . Google Scholar CrossRef Search ADS Larose, D. T. ( 2005 ) Discovering Knowledge in Data: An Introduction to Data Mining . Wiley , Hoboken . Li, Q. , Ouyang, D. & Racine J. S. ( 2006 ) Cross-validation and the estimation of probability distributions with categorical data . J. Nonparametric Statistics 18 , 69 – 100 . Google Scholar CrossRef Search ADS Silverman, B. W. ( 1986 ) Density Estimation for Statistics and Data Analysis . Chapman and Hall , London . Wand, M. P. & Jones, M. C. ( 1995 ) Kernel Smoothing . Chapman and Hall , London . © The authors 2017. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved.

### Journal

IMA Journal of Mathematical Control and InformationOxford University Press

Published: Mar 8, 2017

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations