# Detection of periodicity based on independence tests – III. Phase distance correlation periodogram

Detection of periodicity based on independence tests – III. Phase distance correlation periodogram Abstract I present the Phase Distance Correlation (PDC) periodogram – a new periodicity metric, based on the Distance Correlation concept of Gábor Székely. For each trial period, PDC calculates the distance correlation between the data samples and their phases. PDC requires adaptation of the Székely’s distance correlation to circular variables (phases). The resulting periodicity metric is best suited to sparse data sets, and it performs better than other methods for sawtooth-like periodicities. These include Cepheid and RR-Lyrae light curves, as well as radial velocity curves of eccentric spectroscopic binaries. The performance of the PDC periodogram in other contexts is almost as good as that of the Generalized Lomb–Scargle periodogram. The concept of phase distance correlation can be adapted also to astrometric data, and it has the potential to be suitable also for large evenly spaced data sets, after some algorithmic perfection. methods: data analysis, methods: statistical, binaries: spectroscopic, stars: variables: Cepheids, stars: variables: RR Lyrae 1 INTRODUCTION In the two previous papers of this series (Zucker 2015, hereafter Paper I; Zucker 2016, hereafter Paper II), I have introduced a new non-parametric approach to the detection of periodicities in sparse data sets. The new approach I introduced in those two papers followed the logic of string-length techniques (e.g. Lafler & Kinman 1965; Clarke 2002) and quantified the dependence between consecutive phase-folded samples (serial dependence), for every trial period. While the classic string-length methods essentially quantified the linear (Pearson) correlation in those pairs of samples, the new approach I introduced measured the generalized dependence thereof. The improved technique I introduced in Paper II used the Blum–Kiefer–Rosenblatt (BKR) statistic (Blum, Kiefer & Rosenblatt 1961) to quantify this dependence, constituting an improvement over the use of the Hoeffding statistic (Hoeffding 1948) in Paper I. The BKR approach proved to be particularly successful for periodicities of the sawtooth type, including also radial-velocity (RV) curves of eccentric single-lined spectroscopic binaries (SB1). The most basic and widely used technique to search for periodicities in unevenly sampled data is the Lomb–Scargle periodogram (Lomb 1976; Scargle 1982). The Lomb–Scargle periodogram is explicitly based on searching for sinusoidal periodicities, usually assuming that every periodicity can be represented as a series of sinusoids (harmonics). Lomb–Scargle periodogram has been improved and generalized in several ways (e.g. Bretthorst 2001a,b), and VanderPlas (2017) presents an excellent and intuitive introduction to this staple of astronomical data analysis. Here, I will refer by the name Generalized Lomb–Scargle (GLS) periodogram to the method introduced by Zechmeister & Kürster (2009) which simply considers a constant offset term in addition to the sinusoid. The inspiration to this work is the observation that the GLS can be represented also as a correlation measure. For a specific trial period, the GLS value is essentially the correlation between the samples and their corresponding phases according to the trial period. Since the phase is a circular variable (e.g. Mardia & Jupp 2000), the correlation is not calculated in the usual way applicable to two linear variables. Mardia (1976) suggested a statistic to measure the correlation between a linear variable and a circular one. A simple algebraic manipulation shows that the statistic Mardia introduced leads exactly to the GLS formula. Mardia’s statistic measures a linear-circular correlation. The basic change I propose here is to use a measure of a linear-circular dependence, rather than correlation. The way forward is provided by a recent progress in the estimation of dependence by Székely, Rizzo & Bakirov (2007). Székely et al. developed a measure of dependence between two variables (not necessarily of the same dimension), which they dubbed distance correlation. The distance correlation is zero if and only if the two variables are statistically independent, unlike the usual Pearson correlation statistic, which can be zero for dependent variables. In its usual formulation, the calculation of the distance correlation has an appealing resemblance to Pearson correlation, and can therefore be considered quite intuitive and elegant. The only missing step is the adaptation of the distance correlation to circular variables. I have applied the derivation of Székely et al. (2007) assuming one of the two variables is circular, and arrived at a satisfactory prescription. Thus, calculating the distance correlation between the samples and their phases at each trial period, amounts to a generalization of the GLS to general dependence – the Phase Distance Correlation (PDC) Periodogram. While in Section 2 I present the final expression I obtained, I justify it a little more rigorously in the Appendix. In Section 3, I present a comparison of the performance of the new periodogram against some other alternatives. Finally, I discuss those results and future work in Section 4. 2 PHASE DISTANCE CORRELATION In what follows, I detail the recipe to calculate the phase distance correlation for a given trial period. Most of the calculation is identical to the calculation of the distance correlation in Székely et al. (2007), except for the definition of the phase distance, which is an adaptation of the linear distance to a circular variable. In the Appendix, I justify this expression based on a rigorous derivation. Let us assume our data set consists of N samples xi (i = 1, ..., N), taken at times ti, and we wish to calculate the phase distance correlation for the trial period P. First, let us compute an N by N sample distance matrix:   $$a_{ij} = |x_i-x_j|{\, }.$$ (1) In order to compute a phase distance matrix, we start by calculating a phase difference matrix:   $$\phi _{ij} = (t_i-t_j)\, {\rm mod\,} {P}$$ (2)which we then convert to a phase distance matrix:   $$b_{ij} = \phi _{ij}(P-\phi _{ij}){\, }.$$ (3) This choice of expression for phase distance is not trivial. We might have thought about other, maybe more intuitive expressions, such as min (ϕij, P − ϕij), or sin (πϕij/P) (e.g. Mardia & Jupp 2000). However, this expression is the result of applying the derivation of Székely et al. to circular variables. In fact, I experimented with several other phase distance alternatives, and none yielded better performance than the one in equation (3). The next step in the calculation is ‘double centring’ of the matrices a and b, i.e.   $$A_{ij} = a_{ij}-a_{i\cdot }-a_{\cdot j}+a_{\cdot \cdot } B_{ij} = b_{ij}-b_{i\cdot }-b_{\cdot j}+b_{\cdot \cdot },$$ (4)where ai · is the ith row mean, a· j is the jth column mean, and a· · is the grand mean of the a matrix. Similar definitions apply for the b matrix. Aij and Bij are now the centred distance matrices. We can now finally define the dependence measure by:   $$\mathrm{PDC} = \frac{\sum \limits _{ij}A_{ij}B_{ij}}{\sqrt{(\sum \limits _{ij}A^2_{ij})(\sum \limits _{ij}B^2_{ij})}}.$$ (5)Superficially, if we ignore the dependencies among the entries of the matrices A and B, the numerator in equation (5) resembles the covariance between Aij and Bij, and the whole expression resembles the Pearson correlation. This resemblance led Székely et al. to choose the names distance covariance and distance correlation. PDC is a dimensionless quantity, and it is bounded below by 0 (complete phase independence), and above by 1. Higher values mean stronger dependence on phase, and therefore indicate a periodicity is more likely. I chose to show here two examples of applying the PDC periodogram to simulated data. In both examples, I drew random times between 0 and 1000  d, from a uniform distribution, and generated a periodic signal with a period of 2 d. In both examples I calculated the PDC and the GLS for comparison, for a uniform frequency grid ranging between 10−4 and 1 d−1, with a constant spacing of 10−4 d−1. Fig. 1 presents a case of 100 samples, and a pure sinusoidal signal, with no added noise. This is an easy case to detect, especially with the GLS, which is actually based on the assumption of sinusoidal signal. As expected, both periodograms manage to detect the signal very easily – a very prominent and sharp peak at the correct frequency is apparent. Moreover, it seems that the two periodograms are very similar, except for a small peak corresponding to a subharmonic of the true frequency, which appears in the PDC plot. Figure 1. View largeDownload slide An example application of the PDC periodogram to a 100-sample data set, containing a pure sinusoid. The two upper panels show the data – raw (left-hand panel) and phase folded (right-hand panel). The middle panel shows the PDC and the lower panel shows the GLS. Figure 1. View largeDownload slide An example application of the PDC periodogram to a 100-sample data set, containing a pure sinusoid. The two upper panels show the data – raw (left-hand panel) and phase folded (right-hand panel). The middle panel shows the PDC and the lower panel shows the GLS. Fig. 2 presents a much more difficult case, with only 20 samples, a sawtooth signal, and added noise, at a signal-to-noise ratio (SNR) of 2 (see Paper I for an exact definition of SNR in the current context). The PDC plot exhibits a decisive peak at the correct frequency, while no such prominent peaks appear in the GLS plot. The highest peak in the GLS periodogram actually appears at a completely wrong frequency. In the next section, I will try to quantify how typical this case is. Figure 2. View largeDownload slide An example application of the PDC periodogram to a 20-sample data set, containing a sawtooth signal, with added Gaussian noise with an SNR of 2. The two upper panels show the data – raw (left-hand panel) and phase folded (right-hand panel). The middle panel shows the PDC and the lower panel shows the GLS. Figure 2. View largeDownload slide An example application of the PDC periodogram to a 20-sample data set, containing a sawtooth signal, with added Gaussian noise with an SNR of 2. The two upper panels show the data – raw (left-hand panel) and phase folded (right-hand panel). The middle panel shows the PDC and the lower panel shows the GLS. 3 COMPARISON I present here a performance comparison of PDC against three other periodicity metrics. I already mentioned the first one – GLS – in the previous section. The GLS is obviously a kind of gold standard in the field of periodicity searches in unevenly sampled data sets, against which every new method should be compared. Next is the serial von-Neumann Ratio (von Neumann et al. 1941), which I also included in the tests I performed in Paper I and Paper II. I use the periodicity metric based on the von-Neumann Ratio as a representative of the various string-length techniques. The third metric I use for benchmarking is the BKR-based approach I introduced in Paper II. The performance diagnostic I use is the Receiver Operating Characteristic (ROC) curve (e.g. Fawcett 2006). The ROC curve of a detection scheme is simply a plot of the True Positive Rate (or sensitivity), against the False Positive Rate (or type I error). As we change the detection threshold we change both of those two rates, and the dependence between them is monotonic. In all the tests I ran, I used the same procedure to draw the observation times as I did in the examples I have shown above – uniform distribution between 0 and 1000 d. I also calculated the periodicity metrics on the same frequency grid with 104 frequencies between 10−4 and 1 d−1. For each number of samples I wished to test, I first drew 1000 realizations of light curves with random times and uncorrelated Gaussian random samples. I then calculated the four periodograms for each such realization of white noise, and normalized them (i.e. subtracted the means and divided by the standard deviations) to get Z scores. For each periodogram, I got an empirical distribution of the maximum Z-score values. The percentiles of that distribution served as thresholds for a given FPR. I could now use those thresholds to test the periodograms on light curves that included periodic signals. I used six kinds of periodic signals – the same ones I used in Paper I and Paper II – pure sinusoid, almost sinusoidal, sawtooth, box shape (‘pulse’), eccentric eclipsing binary (EB) and eccentric SB1 RV curve. I simulated always the same period – 2 d. For a given configuration of periodicity shape, number of samples, and SNR, I tested 1000 realizations to estimate the true positive rate (TPR). I standardized each periodogram to get the corresponding Z scores, and applied the thresholds I had previously obtained from the uncorrelated Gaussian random samples. Thus, each threshold yielded a pair of FPR and TPR values, resulting in the ROC curve. A close look at Fig. 1 of Paper II reveals that the situation which differentiates the most between the various methods is having a 20-sample data set, with an SNR of 3. Therefore, I present in Fig. 3 the ROC curves obtained for data sets with 20 samples and SNR of 3. The ROC curves of PDC are represented by red lines, those of the GLS by green lines, the von-Neumann ratio by dark blue, and the BKR periodicity metric by light blue. Figure 3. View largeDownload slide ROC curves for the four periodicity metrics and six periodicity shapes, for light curves with 20 samples and SNR of 3. Red lines represent the PDC, green lines represent the GLS, dark blue stands for the von-Neumann ratio and light blue represents the BKR. Figure 3. View largeDownload slide ROC curves for the four periodicity metrics and six periodicity shapes, for light curves with 20 samples and SNR of 3. Red lines represent the PDC, green lines represent the GLS, dark blue stands for the von-Neumann ratio and light blue represents the BKR. In the case of pure sinusoid (upper left panel), as expected, the GLS dominates over all other methods. PDC performs only slightly better than the BKR periodogram, which has the poorest performance. This is no surprise, as the GLS is indeed expected to have the best performance for sinusoidal signals. The situation with the ‘almost sinusoidal’ periodicity shape is not much different. PDC turns out to be the best performer for sawtooth-like signals, better than BKR, which was previously shown to have the best performance for those signals in Paper II. BKR performs poorly for box-shape periodicity, as I have already shown in Paper II, but PDC is much better than BKR, and almost approaches the performance of the GLS in that situation. For eccentric EB, there seems not to be a significant difference between the failure of most methods, except for the von-Neumann periodogram which performs much better. For an RV curve of eccentric SB1, PDC performs almost like the BKR, except for a small advantage for PDC in the low FPR side. 4 DISCUSSION I introduced here the PDC periodogram – a periodogram which generalizes the GLS to shapes other than sinusoidal. PDC is essentially a generalization of the GLS correlation approach to general dependence, and it has an appealing algebraic structure similar to correlation. In all situations I presented in Fig. 3, the PDC periodogram outperformed the BKR periodogram, which I presented in Paper II, and thus constitutes a clear improvement. Unlike BKR, whose main advantage is for sawtooth-like periodicities, PDC also performs almost as well as the GLS in other circumstances. Sawtooth-like periodicities are typical in photometry to Cepheids and RR-Lyrae stars, and in spectroscopy to RV curves of eccentric SB1 and exoplanets. Recently, Pinamonti et al. (2017) highlighted the importance of improving the capability to detect periodicities in RV signals, especially for eccentric orbits of binary stars and exoplanets. The PDC periodogram serves to achieve this goal. The original distance correlation which Székely et al. (2007) presented, is well defined also for multidimensional variables. Thus, PDC, which is based on the distance correlation, can be extended also to two-dimensional signals. Specifically, it can be extended to astrometric signals. Currently, the astrometric detection of Keplerian orbits (of exoplanets and binary stars) is based on a somewhat naïve extension of the GLS to two-dimensional signals (Sozzetti et al. 2003; Bartlett, Ianna & Begam 2009). This approach performs poorly for eccentric orbits. A natural extension of the PDC periodogram to two-dimensional signals (using Euclidean distance for the two-dimensional variable) will probably provide a much more natural solution. It will be interesting to examine also the performance of the PDC periodogram under non-random uneven sampling laws, such as those affected by periodic observability constraints, or when the signal comprises two different periodicities, e.g. an RV curve of a two-planet system. The main drawback of the recipe I presented in Section 2 is its heavy computational burden. It involves O(N2) calculation for each trial period. This currently limits the application of PDC to sparse data sets. However, sparse data is the relevant context for the PDC periodogram, since in large data sets periodicities are usually adequately detectable by the GLS or other conventional techniques. Nevertheless, one approach to try and improve the complexity of the computation is suggested in Huo & Székely (2016). Huo & Székely use an unbiased version of distance correlation first suggested by Székely & Rizzo (2014), and apply to it an AVL-tree computational approach (Adelson-Velskii & Landis 1962), to obtain an O(Nlog N) algorithm to calculate the distance correlation. Future work should include an attempt to apply this algorithm to PDC and increase the allowed size of the data set. Another promising research direction concerns evenly spaced time series, such as Kepler light curves (Borucki et al. 2010). In this case, the phase distance matrix bij becomes a Toeplitz matrix and may be amenable to significant computational shortcuts (e.g. Victor 2001), especially for periods that are integer multiples of the sampling interval. Thus, specific periods may result in even more degenerate matrix structures, potentially constituting a set of ‘natural’ periods. This may pave way to a Fourier-like transform, one that is based on general periodicities, and not focusing on sinusoidal ones. In summary, the phase distance correlation approach I present here, has a potential to contribute significantly to solve the challenge of finding general periodicities, both in unevenly sampled sparse data sets and in large evenly sampled ones. Acknowledgements This research was supported by the ISRAEL SCIENCE FOUNDATION (grant No. 848/16). REFERENCES Abramowitz M., Stegun I. A., 1972, Handbook of Mathematical Functions , Dover Press, New York Adelson-Velskii G. Landis E. M., 1962, Sov. Math. Doklady , 3, 1259 Bartlett J. L. Ianna P. A. Begam M. C., 2009, PASP , 121, 365 CrossRef Search ADS   Blum J. R. Kiefer J. Rosenblatt M., 1961, Ann. Math. Stat. , 32, 485 CrossRef Search ADS   Borucki W. J. Koch D. Basri G.et al.  , 2010, Science , 327, 977 CrossRef Search ADS PubMed  Bretthorst G. L., 2001a, in Mohammad-Djafari A., ed., AIP Conf. Ser. Vol. 568, Bayesian Inference and Maximum Entropy Methods in Science and Engineering . Am. Inst. Phys., New York, p. 241 Bretthorst G. L., 2001b, in Mohammad-Djafari A., ed., AIP Conf. Ser. Vol. 568, Bayesian Inference and Maximum Entropy Methods in Science and Engineering . Am. Inst. Phys., New York, p. 246 Clarke D., 2002, A&A , 386, 763 CrossRef Search ADS   Fawcett T., 2006, Pattern Recognit. Lett. , 27, 861 CrossRef Search ADS   Hoeffding W., 1948, Ann. Math. Stat. , 19, 293 CrossRef Search ADS   Huo X. Székely G. J., 2016, Technometrics , 58, 435 CrossRef Search ADS   Lafler J. Kinman T. D., 1965, ApJS , 11, 216 CrossRef Search ADS   Lomb N. R., 1976, Ap&SS , 39, 447 CrossRef Search ADS   Mardia K. V., 1976, Biometrika , 63, 403 CrossRef Search ADS   Mardia K. V., Jupp P. E., 2000, Directional Statistics , Wiley, Chichester, UK Pinamonti M. Sozzetti A. Bonomo A. S. Damasso M., 2017, MNRAS , 468, 3775 CrossRef Search ADS   Scargle J. D., 1982, ApJ , 263, 835 CrossRef Search ADS   Sozzetti A. Casertano S. Brown R. A. Lattanzi M. G., 2003, PASP , 115, 1072 CrossRef Search ADS   Székely G. J. Rizzo M. L., 2014, Ann. Stat. , 42, 2382 CrossRef Search ADS   Székely G. J. Rizzo M. L. Bakirov N. K., 2005, Ann. Stat. , 35, 2769 CrossRef Search ADS   VanderPlas J. T., 2017, preprint (arXiv:1703.09824) Victor P., 2001, Structured Matrices and Polynomials, Unified Superfast Algorithms , Birkhäuser, Basel, CH von Neumann J. Kent R. H. Bellinson H. R. Hart B. I., 1941, Ann. Math. Stat. , 12, 153 CrossRef Search ADS   Zechmeister M. Kürster M., 2009, A&A , 496, 577 CrossRef Search ADS   Zucker S., 2015, MNRAS , 449, 2723 (Paper I) CrossRef Search ADS   Zucker S., 2016, MNRAS , 457, L118 (Paper II) CrossRef Search ADS   APPENDIX: PHASE DISTANCE JUSTIFICATION Székely et al. (2007) based their derivation of the distance correlation on the probability-theory concept of the characteristic function. The characteristic function φX of a linear random variable X is essentially equivalent to a Fourier transform of its probability density function fX:   $$\varphi _X(t) = \int _\mathbb {R} {\rm e}^{itx} f_X(x) \mathrm{d}x.$$ (A1)The joint characteristic function φXY(t, s) of the two random variables is similarly obtained from the joint probability density function fXY(x, y). Statistical independence of the two variables mean fXY(x, y) = fX(x)fY(y), which is known to be equivalent to φXY(t, s) = φX(t)φY(s). Essentially, Székely et al. proposed the following dependence measure between the two linear random variables X and Y:   $$\mathcal {V}^2(X,Y) = \frac{1}{\pi ^2} \int _{\mathbb {R}^2} \frac{|\varphi _{XY}(t,s) - \varphi _X(t) \varphi _Y(s)|^2}{t^2s^2} {\, } \mathrm{d}t {\, } \mathrm{d}s.$$ (A2)Székely et al. dub this measure distance covariance. Originally, they considered multidimensional variables, but for our needs here we focus only on one-dimensional variables. Given a finite bivariate sample: $$\lbrace (x_k,y_k)\rbrace _{k=1}^N$$, the empirical characteristic function of X is defined by:   $$\varphi ^N_X(t) = \frac{1}{N} \sum _{k=1}^N {\rm e}^{itx_k}$$ (A3)and the bivariate joint empirical characteristic function is:   $$\varphi ^N_{XY}(t,s) = \frac{1}{N} \sum _{k=1}^N {\rm e}^{itx_k+isy_k}.$$ (A4) Thus, the sample distance covariance is now given by:   $$\mathcal {V}_N^2(X,Y) = \frac{1}{\pi ^2} \int _{\mathbb {R}^2} \frac{|\varphi ^N_{XY}(t,s) - \varphi ^N_X(t) \varphi ^N_Y(s)|^2}{t^2s^2} {\, } \mathrm{d}t {\, } \mathrm{d}s.$$ (A5) Székely et al. show how the expression above leads to the final expression in equation (5). I follow here the path they drew, but apply it to the case where Y is an angular variable – a circular variable ranging between 0 and 2π. The characteristic function of angular variables is defined only for integer values m (Mardia & Jupp 2000):   $$\varphi _Y(m) = \int _0^{2\pi } {\rm e}^{imy} f_Y(y) \mathrm{d}y.$$ (A6)The empirical characteristic function is now:   $$\varphi _Y^N(m) = \frac{1}{N} \sum _{k=1}^N {\rm e}^{imy_k}.$$ (A7)The bivariate joint empirical characteristic function for a linear variable X and a circular variable Y is a function of a real variable t and an integer m:   $$\varphi _{XY}^N(t,m) = \frac{1}{N} \sum _{k=1}^N {\rm e}^{itx_k+imy_k}.$$ (A8) In analogy to the distance covariance for two linear variables, we can now define the linear-circular distance covariance (the phase distance covariance) in the following way:   $$\mathcal {V}_N^2(X,Y) = \frac{1}{2\pi } \int _\mathbb {R} \sum _{^{m=-\infty }_{ m \ne 0}}^\infty \frac{|\varphi ^N_{XY}(t,m) - \varphi ^N_X(t) \varphi ^N_Y(m)|^2}{m^2t^2} {\, } \mathrm{d}t.$$ (A9) The two following observations determine the results of this calculation and also the normalization constant 2π. The first is a special case of Lemma 1 in Székely et al. (2007), which states that for every real x:   $$\int _\mathbb {R} \frac{1-\cos (tx)}{t^2} \mathrm{d}t = \pi |x|.$$ (A10)The second is an adaptation to the angular case, which states that for every y ∈ [0, 2π):   $$\sum _{^{m=-\infty }_{ m \ne 0}}^\infty \frac{1-\cos (my)}{m^2} = \frac{1}{2} y (2\pi -y).$$ (A11)This last assertion can be proven easily using the properties of the standard Clausen function Sl2(θ) (Abramowitz & Stegun 1972). Expansion of the numerator in equation (A9) leads to several expressions of the form of the expressions in equations A10 and A11. The origin of the cosine terms is the complex exponent in the definitions of the empirical characteristic functions (equations A3, A7, and A8). Since sine is an odd function, the corresponding sine terms cancel out in the integral and the infinite sum. After performing the integral and the infinite sum, one is left with algebraic expressions of the kind of equations 2.15–2.18 in Székely et al. (2007). The only difference is that instead of the Y-distance |Yk − Yl|q, we have the new phase distance |yk − yl|(2π − |yk − yl|)/2 . The last algebraic step which leads to the expressions in equation (4) and the numerator of equation (5) is proven in the appendix of Székely et al. (2007). Finally, for convenience, I have converted the angular units in radians to time units by multiplying twice by P/2π, which does not change the distance correlation value because of the normalization in equation (5). © 2017 The Author(s) Published by Oxford University Press on behalf of the Royal Astronomical Society http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Monthly Notices of the Royal Astronomical Society: Letters Oxford University Press

# Detection of periodicity based on independence tests – III. Phase distance correlation periodogram

, Volume 474 (1) – Feb 1, 2018
5 pages

/lp/ou_press/detection-of-periodicity-based-on-independence-tests-iii-phase-YNbANQCSaM
Publisher
Oxford University Press
ISSN
1745-3925
eISSN
1745-3933
D.O.I.
10.1093/mnrasl/slx198
Publisher site
See Article on Publisher Site

### Abstract

Abstract I present the Phase Distance Correlation (PDC) periodogram – a new periodicity metric, based on the Distance Correlation concept of Gábor Székely. For each trial period, PDC calculates the distance correlation between the data samples and their phases. PDC requires adaptation of the Székely’s distance correlation to circular variables (phases). The resulting periodicity metric is best suited to sparse data sets, and it performs better than other methods for sawtooth-like periodicities. These include Cepheid and RR-Lyrae light curves, as well as radial velocity curves of eccentric spectroscopic binaries. The performance of the PDC periodogram in other contexts is almost as good as that of the Generalized Lomb–Scargle periodogram. The concept of phase distance correlation can be adapted also to astrometric data, and it has the potential to be suitable also for large evenly spaced data sets, after some algorithmic perfection. methods: data analysis, methods: statistical, binaries: spectroscopic, stars: variables: Cepheids, stars: variables: RR Lyrae 1 INTRODUCTION In the two previous papers of this series (Zucker 2015, hereafter Paper I; Zucker 2016, hereafter Paper II), I have introduced a new non-parametric approach to the detection of periodicities in sparse data sets. The new approach I introduced in those two papers followed the logic of string-length techniques (e.g. Lafler & Kinman 1965; Clarke 2002) and quantified the dependence between consecutive phase-folded samples (serial dependence), for every trial period. While the classic string-length methods essentially quantified the linear (Pearson) correlation in those pairs of samples, the new approach I introduced measured the generalized dependence thereof. The improved technique I introduced in Paper II used the Blum–Kiefer–Rosenblatt (BKR) statistic (Blum, Kiefer & Rosenblatt 1961) to quantify this dependence, constituting an improvement over the use of the Hoeffding statistic (Hoeffding 1948) in Paper I. The BKR approach proved to be particularly successful for periodicities of the sawtooth type, including also radial-velocity (RV) curves of eccentric single-lined spectroscopic binaries (SB1). The most basic and widely used technique to search for periodicities in unevenly sampled data is the Lomb–Scargle periodogram (Lomb 1976; Scargle 1982). The Lomb–Scargle periodogram is explicitly based on searching for sinusoidal periodicities, usually assuming that every periodicity can be represented as a series of sinusoids (harmonics). Lomb–Scargle periodogram has been improved and generalized in several ways (e.g. Bretthorst 2001a,b), and VanderPlas (2017) presents an excellent and intuitive introduction to this staple of astronomical data analysis. Here, I will refer by the name Generalized Lomb–Scargle (GLS) periodogram to the method introduced by Zechmeister & Kürster (2009) which simply considers a constant offset term in addition to the sinusoid. The inspiration to this work is the observation that the GLS can be represented also as a correlation measure. For a specific trial period, the GLS value is essentially the correlation between the samples and their corresponding phases according to the trial period. Since the phase is a circular variable (e.g. Mardia & Jupp 2000), the correlation is not calculated in the usual way applicable to two linear variables. Mardia (1976) suggested a statistic to measure the correlation between a linear variable and a circular one. A simple algebraic manipulation shows that the statistic Mardia introduced leads exactly to the GLS formula. Mardia’s statistic measures a linear-circular correlation. The basic change I propose here is to use a measure of a linear-circular dependence, rather than correlation. The way forward is provided by a recent progress in the estimation of dependence by Székely, Rizzo & Bakirov (2007). Székely et al. developed a measure of dependence between two variables (not necessarily of the same dimension), which they dubbed distance correlation. The distance correlation is zero if and only if the two variables are statistically independent, unlike the usual Pearson correlation statistic, which can be zero for dependent variables. In its usual formulation, the calculation of the distance correlation has an appealing resemblance to Pearson correlation, and can therefore be considered quite intuitive and elegant. The only missing step is the adaptation of the distance correlation to circular variables. I have applied the derivation of Székely et al. (2007) assuming one of the two variables is circular, and arrived at a satisfactory prescription. Thus, calculating the distance correlation between the samples and their phases at each trial period, amounts to a generalization of the GLS to general dependence – the Phase Distance Correlation (PDC) Periodogram. While in Section 2 I present the final expression I obtained, I justify it a little more rigorously in the Appendix. In Section 3, I present a comparison of the performance of the new periodogram against some other alternatives. Finally, I discuss those results and future work in Section 4. 2 PHASE DISTANCE CORRELATION In what follows, I detail the recipe to calculate the phase distance correlation for a given trial period. Most of the calculation is identical to the calculation of the distance correlation in Székely et al. (2007), except for the definition of the phase distance, which is an adaptation of the linear distance to a circular variable. In the Appendix, I justify this expression based on a rigorous derivation. Let us assume our data set consists of N samples xi (i = 1, ..., N), taken at times ti, and we wish to calculate the phase distance correlation for the trial period P. First, let us compute an N by N sample distance matrix:   $$a_{ij} = |x_i-x_j|{\, }.$$ (1) In order to compute a phase distance matrix, we start by calculating a phase difference matrix:   $$\phi _{ij} = (t_i-t_j)\, {\rm mod\,} {P}$$ (2)which we then convert to a phase distance matrix:   $$b_{ij} = \phi _{ij}(P-\phi _{ij}){\, }.$$ (3) This choice of expression for phase distance is not trivial. We might have thought about other, maybe more intuitive expressions, such as min (ϕij, P − ϕij), or sin (πϕij/P) (e.g. Mardia & Jupp 2000). However, this expression is the result of applying the derivation of Székely et al. to circular variables. In fact, I experimented with several other phase distance alternatives, and none yielded better performance than the one in equation (3). The next step in the calculation is ‘double centring’ of the matrices a and b, i.e.   $$A_{ij} = a_{ij}-a_{i\cdot }-a_{\cdot j}+a_{\cdot \cdot } B_{ij} = b_{ij}-b_{i\cdot }-b_{\cdot j}+b_{\cdot \cdot },$$ (4)where ai · is the ith row mean, a· j is the jth column mean, and a· · is the grand mean of the a matrix. Similar definitions apply for the b matrix. Aij and Bij are now the centred distance matrices. We can now finally define the dependence measure by:   $$\mathrm{PDC} = \frac{\sum \limits _{ij}A_{ij}B_{ij}}{\sqrt{(\sum \limits _{ij}A^2_{ij})(\sum \limits _{ij}B^2_{ij})}}.$$ (5)Superficially, if we ignore the dependencies among the entries of the matrices A and B, the numerator in equation (5) resembles the covariance between Aij and Bij, and the whole expression resembles the Pearson correlation. This resemblance led Székely et al. to choose the names distance covariance and distance correlation. PDC is a dimensionless quantity, and it is bounded below by 0 (complete phase independence), and above by 1. Higher values mean stronger dependence on phase, and therefore indicate a periodicity is more likely. I chose to show here two examples of applying the PDC periodogram to simulated data. In both examples, I drew random times between 0 and 1000  d, from a uniform distribution, and generated a periodic signal with a period of 2 d. In both examples I calculated the PDC and the GLS for comparison, for a uniform frequency grid ranging between 10−4 and 1 d−1, with a constant spacing of 10−4 d−1. Fig. 1 presents a case of 100 samples, and a pure sinusoidal signal, with no added noise. This is an easy case to detect, especially with the GLS, which is actually based on the assumption of sinusoidal signal. As expected, both periodograms manage to detect the signal very easily – a very prominent and sharp peak at the correct frequency is apparent. Moreover, it seems that the two periodograms are very similar, except for a small peak corresponding to a subharmonic of the true frequency, which appears in the PDC plot. Figure 1. View largeDownload slide An example application of the PDC periodogram to a 100-sample data set, containing a pure sinusoid. The two upper panels show the data – raw (left-hand panel) and phase folded (right-hand panel). The middle panel shows the PDC and the lower panel shows the GLS. Figure 1. View largeDownload slide An example application of the PDC periodogram to a 100-sample data set, containing a pure sinusoid. The two upper panels show the data – raw (left-hand panel) and phase folded (right-hand panel). The middle panel shows the PDC and the lower panel shows the GLS. Fig. 2 presents a much more difficult case, with only 20 samples, a sawtooth signal, and added noise, at a signal-to-noise ratio (SNR) of 2 (see Paper I for an exact definition of SNR in the current context). The PDC plot exhibits a decisive peak at the correct frequency, while no such prominent peaks appear in the GLS plot. The highest peak in the GLS periodogram actually appears at a completely wrong frequency. In the next section, I will try to quantify how typical this case is. Figure 2. View largeDownload slide An example application of the PDC periodogram to a 20-sample data set, containing a sawtooth signal, with added Gaussian noise with an SNR of 2. The two upper panels show the data – raw (left-hand panel) and phase folded (right-hand panel). The middle panel shows the PDC and the lower panel shows the GLS. Figure 2. View largeDownload slide An example application of the PDC periodogram to a 20-sample data set, containing a sawtooth signal, with added Gaussian noise with an SNR of 2. The two upper panels show the data – raw (left-hand panel) and phase folded (right-hand panel). The middle panel shows the PDC and the lower panel shows the GLS. 3 COMPARISON I present here a performance comparison of PDC against three other periodicity metrics. I already mentioned the first one – GLS – in the previous section. The GLS is obviously a kind of gold standard in the field of periodicity searches in unevenly sampled data sets, against which every new method should be compared. Next is the serial von-Neumann Ratio (von Neumann et al. 1941), which I also included in the tests I performed in Paper I and Paper II. I use the periodicity metric based on the von-Neumann Ratio as a representative of the various string-length techniques. The third metric I use for benchmarking is the BKR-based approach I introduced in Paper II. The performance diagnostic I use is the Receiver Operating Characteristic (ROC) curve (e.g. Fawcett 2006). The ROC curve of a detection scheme is simply a plot of the True Positive Rate (or sensitivity), against the False Positive Rate (or type I error). As we change the detection threshold we change both of those two rates, and the dependence between them is monotonic. In all the tests I ran, I used the same procedure to draw the observation times as I did in the examples I have shown above – uniform distribution between 0 and 1000 d. I also calculated the periodicity metrics on the same frequency grid with 104 frequencies between 10−4 and 1 d−1. For each number of samples I wished to test, I first drew 1000 realizations of light curves with random times and uncorrelated Gaussian random samples. I then calculated the four periodograms for each such realization of white noise, and normalized them (i.e. subtracted the means and divided by the standard deviations) to get Z scores. For each periodogram, I got an empirical distribution of the maximum Z-score values. The percentiles of that distribution served as thresholds for a given FPR. I could now use those thresholds to test the periodograms on light curves that included periodic signals. I used six kinds of periodic signals – the same ones I used in Paper I and Paper II – pure sinusoid, almost sinusoidal, sawtooth, box shape (‘pulse’), eccentric eclipsing binary (EB) and eccentric SB1 RV curve. I simulated always the same period – 2 d. For a given configuration of periodicity shape, number of samples, and SNR, I tested 1000 realizations to estimate the true positive rate (TPR). I standardized each periodogram to get the corresponding Z scores, and applied the thresholds I had previously obtained from the uncorrelated Gaussian random samples. Thus, each threshold yielded a pair of FPR and TPR values, resulting in the ROC curve. A close look at Fig. 1 of Paper II reveals that the situation which differentiates the most between the various methods is having a 20-sample data set, with an SNR of 3. Therefore, I present in Fig. 3 the ROC curves obtained for data sets with 20 samples and SNR of 3. The ROC curves of PDC are represented by red lines, those of the GLS by green lines, the von-Neumann ratio by dark blue, and the BKR periodicity metric by light blue. Figure 3. View largeDownload slide ROC curves for the four periodicity metrics and six periodicity shapes, for light curves with 20 samples and SNR of 3. Red lines represent the PDC, green lines represent the GLS, dark blue stands for the von-Neumann ratio and light blue represents the BKR. Figure 3. View largeDownload slide ROC curves for the four periodicity metrics and six periodicity shapes, for light curves with 20 samples and SNR of 3. Red lines represent the PDC, green lines represent the GLS, dark blue stands for the von-Neumann ratio and light blue represents the BKR. In the case of pure sinusoid (upper left panel), as expected, the GLS dominates over all other methods. PDC performs only slightly better than the BKR periodogram, which has the poorest performance. This is no surprise, as the GLS is indeed expected to have the best performance for sinusoidal signals. The situation with the ‘almost sinusoidal’ periodicity shape is not much different. PDC turns out to be the best performer for sawtooth-like signals, better than BKR, which was previously shown to have the best performance for those signals in Paper II. BKR performs poorly for box-shape periodicity, as I have already shown in Paper II, but PDC is much better than BKR, and almost approaches the performance of the GLS in that situation. For eccentric EB, there seems not to be a significant difference between the failure of most methods, except for the von-Neumann periodogram which performs much better. For an RV curve of eccentric SB1, PDC performs almost like the BKR, except for a small advantage for PDC in the low FPR side. 4 DISCUSSION I introduced here the PDC periodogram – a periodogram which generalizes the GLS to shapes other than sinusoidal. PDC is essentially a generalization of the GLS correlation approach to general dependence, and it has an appealing algebraic structure similar to correlation. In all situations I presented in Fig. 3, the PDC periodogram outperformed the BKR periodogram, which I presented in Paper II, and thus constitutes a clear improvement. Unlike BKR, whose main advantage is for sawtooth-like periodicities, PDC also performs almost as well as the GLS in other circumstances. Sawtooth-like periodicities are typical in photometry to Cepheids and RR-Lyrae stars, and in spectroscopy to RV curves of eccentric SB1 and exoplanets. Recently, Pinamonti et al. (2017) highlighted the importance of improving the capability to detect periodicities in RV signals, especially for eccentric orbits of binary stars and exoplanets. The PDC periodogram serves to achieve this goal. The original distance correlation which Székely et al. (2007) presented, is well defined also for multidimensional variables. Thus, PDC, which is based on the distance correlation, can be extended also to two-dimensional signals. Specifically, it can be extended to astrometric signals. Currently, the astrometric detection of Keplerian orbits (of exoplanets and binary stars) is based on a somewhat naïve extension of the GLS to two-dimensional signals (Sozzetti et al. 2003; Bartlett, Ianna & Begam 2009). This approach performs poorly for eccentric orbits. A natural extension of the PDC periodogram to two-dimensional signals (using Euclidean distance for the two-dimensional variable) will probably provide a much more natural solution. It will be interesting to examine also the performance of the PDC periodogram under non-random uneven sampling laws, such as those affected by periodic observability constraints, or when the signal comprises two different periodicities, e.g. an RV curve of a two-planet system. The main drawback of the recipe I presented in Section 2 is its heavy computational burden. It involves O(N2) calculation for each trial period. This currently limits the application of PDC to sparse data sets. However, sparse data is the relevant context for the PDC periodogram, since in large data sets periodicities are usually adequately detectable by the GLS or other conventional techniques. Nevertheless, one approach to try and improve the complexity of the computation is suggested in Huo & Székely (2016). Huo & Székely use an unbiased version of distance correlation first suggested by Székely & Rizzo (2014), and apply to it an AVL-tree computational approach (Adelson-Velskii & Landis 1962), to obtain an O(Nlog N) algorithm to calculate the distance correlation. Future work should include an attempt to apply this algorithm to PDC and increase the allowed size of the data set. Another promising research direction concerns evenly spaced time series, such as Kepler light curves (Borucki et al. 2010). In this case, the phase distance matrix bij becomes a Toeplitz matrix and may be amenable to significant computational shortcuts (e.g. Victor 2001), especially for periods that are integer multiples of the sampling interval. Thus, specific periods may result in even more degenerate matrix structures, potentially constituting a set of ‘natural’ periods. This may pave way to a Fourier-like transform, one that is based on general periodicities, and not focusing on sinusoidal ones. In summary, the phase distance correlation approach I present here, has a potential to contribute significantly to solve the challenge of finding general periodicities, both in unevenly sampled sparse data sets and in large evenly sampled ones. Acknowledgements This research was supported by the ISRAEL SCIENCE FOUNDATION (grant No. 848/16). REFERENCES Abramowitz M., Stegun I. A., 1972, Handbook of Mathematical Functions , Dover Press, New York Adelson-Velskii G. Landis E. M., 1962, Sov. Math. Doklady , 3, 1259 Bartlett J. L. Ianna P. A. Begam M. C., 2009, PASP , 121, 365 CrossRef Search ADS   Blum J. R. Kiefer J. Rosenblatt M., 1961, Ann. Math. Stat. , 32, 485 CrossRef Search ADS   Borucki W. J. Koch D. Basri G.et al.  , 2010, Science , 327, 977 CrossRef Search ADS PubMed  Bretthorst G. L., 2001a, in Mohammad-Djafari A., ed., AIP Conf. Ser. Vol. 568, Bayesian Inference and Maximum Entropy Methods in Science and Engineering . Am. Inst. Phys., New York, p. 241 Bretthorst G. L., 2001b, in Mohammad-Djafari A., ed., AIP Conf. Ser. Vol. 568, Bayesian Inference and Maximum Entropy Methods in Science and Engineering . Am. Inst. Phys., New York, p. 246 Clarke D., 2002, A&A , 386, 763 CrossRef Search ADS   Fawcett T., 2006, Pattern Recognit. Lett. , 27, 861 CrossRef Search ADS   Hoeffding W., 1948, Ann. Math. Stat. , 19, 293 CrossRef Search ADS   Huo X. Székely G. J., 2016, Technometrics , 58, 435 CrossRef Search ADS   Lafler J. Kinman T. D., 1965, ApJS , 11, 216 CrossRef Search ADS   Lomb N. R., 1976, Ap&SS , 39, 447 CrossRef Search ADS   Mardia K. V., 1976, Biometrika , 63, 403 CrossRef Search ADS   Mardia K. V., Jupp P. E., 2000, Directional Statistics , Wiley, Chichester, UK Pinamonti M. Sozzetti A. Bonomo A. S. Damasso M., 2017, MNRAS , 468, 3775 CrossRef Search ADS   Scargle J. D., 1982, ApJ , 263, 835 CrossRef Search ADS   Sozzetti A. Casertano S. Brown R. A. Lattanzi M. G., 2003, PASP , 115, 1072 CrossRef Search ADS   Székely G. J. Rizzo M. L., 2014, Ann. Stat. , 42, 2382 CrossRef Search ADS   Székely G. J. Rizzo M. L. Bakirov N. K., 2005, Ann. Stat. , 35, 2769 CrossRef Search ADS   VanderPlas J. T., 2017, preprint (arXiv:1703.09824) Victor P., 2001, Structured Matrices and Polynomials, Unified Superfast Algorithms , Birkhäuser, Basel, CH von Neumann J. Kent R. H. Bellinson H. R. Hart B. I., 1941, Ann. Math. Stat. , 12, 153 CrossRef Search ADS   Zechmeister M. Kürster M., 2009, A&A , 496, 577 CrossRef Search ADS   Zucker S., 2015, MNRAS , 449, 2723 (Paper I) CrossRef Search ADS   Zucker S., 2016, MNRAS , 457, L118 (Paper II) CrossRef Search ADS   APPENDIX: PHASE DISTANCE JUSTIFICATION Székely et al. (2007) based their derivation of the distance correlation on the probability-theory concept of the characteristic function. The characteristic function φX of a linear random variable X is essentially equivalent to a Fourier transform of its probability density function fX:   $$\varphi _X(t) = \int _\mathbb {R} {\rm e}^{itx} f_X(x) \mathrm{d}x.$$ (A1)The joint characteristic function φXY(t, s) of the two random variables is similarly obtained from the joint probability density function fXY(x, y). Statistical independence of the two variables mean fXY(x, y) = fX(x)fY(y), which is known to be equivalent to φXY(t, s) = φX(t)φY(s). Essentially, Székely et al. proposed the following dependence measure between the two linear random variables X and Y:   $$\mathcal {V}^2(X,Y) = \frac{1}{\pi ^2} \int _{\mathbb {R}^2} \frac{|\varphi _{XY}(t,s) - \varphi _X(t) \varphi _Y(s)|^2}{t^2s^2} {\, } \mathrm{d}t {\, } \mathrm{d}s.$$ (A2)Székely et al. dub this measure distance covariance. Originally, they considered multidimensional variables, but for our needs here we focus only on one-dimensional variables. Given a finite bivariate sample: $$\lbrace (x_k,y_k)\rbrace _{k=1}^N$$, the empirical characteristic function of X is defined by:   $$\varphi ^N_X(t) = \frac{1}{N} \sum _{k=1}^N {\rm e}^{itx_k}$$ (A3)and the bivariate joint empirical characteristic function is:   $$\varphi ^N_{XY}(t,s) = \frac{1}{N} \sum _{k=1}^N {\rm e}^{itx_k+isy_k}.$$ (A4) Thus, the sample distance covariance is now given by:   $$\mathcal {V}_N^2(X,Y) = \frac{1}{\pi ^2} \int _{\mathbb {R}^2} \frac{|\varphi ^N_{XY}(t,s) - \varphi ^N_X(t) \varphi ^N_Y(s)|^2}{t^2s^2} {\, } \mathrm{d}t {\, } \mathrm{d}s.$$ (A5) Székely et al. show how the expression above leads to the final expression in equation (5). I follow here the path they drew, but apply it to the case where Y is an angular variable – a circular variable ranging between 0 and 2π. The characteristic function of angular variables is defined only for integer values m (Mardia & Jupp 2000):   $$\varphi _Y(m) = \int _0^{2\pi } {\rm e}^{imy} f_Y(y) \mathrm{d}y.$$ (A6)The empirical characteristic function is now:   $$\varphi _Y^N(m) = \frac{1}{N} \sum _{k=1}^N {\rm e}^{imy_k}.$$ (A7)The bivariate joint empirical characteristic function for a linear variable X and a circular variable Y is a function of a real variable t and an integer m:   $$\varphi _{XY}^N(t,m) = \frac{1}{N} \sum _{k=1}^N {\rm e}^{itx_k+imy_k}.$$ (A8) In analogy to the distance covariance for two linear variables, we can now define the linear-circular distance covariance (the phase distance covariance) in the following way:   $$\mathcal {V}_N^2(X,Y) = \frac{1}{2\pi } \int _\mathbb {R} \sum _{^{m=-\infty }_{ m \ne 0}}^\infty \frac{|\varphi ^N_{XY}(t,m) - \varphi ^N_X(t) \varphi ^N_Y(m)|^2}{m^2t^2} {\, } \mathrm{d}t.$$ (A9) The two following observations determine the results of this calculation and also the normalization constant 2π. The first is a special case of Lemma 1 in Székely et al. (2007), which states that for every real x:   $$\int _\mathbb {R} \frac{1-\cos (tx)}{t^2} \mathrm{d}t = \pi |x|.$$ (A10)The second is an adaptation to the angular case, which states that for every y ∈ [0, 2π):   $$\sum _{^{m=-\infty }_{ m \ne 0}}^\infty \frac{1-\cos (my)}{m^2} = \frac{1}{2} y (2\pi -y).$$ (A11)This last assertion can be proven easily using the properties of the standard Clausen function Sl2(θ) (Abramowitz & Stegun 1972). Expansion of the numerator in equation (A9) leads to several expressions of the form of the expressions in equations A10 and A11. The origin of the cosine terms is the complex exponent in the definitions of the empirical characteristic functions (equations A3, A7, and A8). Since sine is an odd function, the corresponding sine terms cancel out in the integral and the infinite sum. After performing the integral and the infinite sum, one is left with algebraic expressions of the kind of equations 2.15–2.18 in Székely et al. (2007). The only difference is that instead of the Y-distance |Yk − Yl|q, we have the new phase distance |yk − yl|(2π − |yk − yl|)/2 . The last algebraic step which leads to the expressions in equation (4) and the numerator of equation (5) is proven in the appendix of Székely et al. (2007). Finally, for convenience, I have converted the angular units in radians to time units by multiplying twice by P/2π, which does not change the distance correlation value because of the normalization in equation (5). © 2017 The Author(s) Published by Oxford University Press on behalf of the Royal Astronomical Society

### Journal

Monthly Notices of the Royal Astronomical Society: LettersOxford University Press

Published: Feb 1, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations