FastDTW is approximate and Generally Slower than the Algorithm it ApproximatesWu, Renjie;Keogh, Eamonn J.
doi: 10.1109/TKDE.2020.3033752 10.1109/ICDE51399.2021.00249pmid: N/A
Abstract: Many time series data mining problems can be solved with repeated use of distance measure. Examples of such tasks include similarity search, clustering, classification, anomaly detection and segmentation. For over two decades it has been known that the Dynamic Time Warping (DTW) distance measure is the best measure to use for most tasks, in most domains. Because the classic DTW algorithm has quadratic time complexity, many ideas have been introduced to reduce its amortized time, or to quickly approximate it. One of the most cited approximate approaches is FastDTW. The FastDTW algorithm has well over a thousand citations and has been explicitly used in several hundred research efforts. In this work, we make a surprising claim. In any realistic data mining application, the approximate FastDTW is much slower than the exact DTW. This fact clearly has implications for the community that uses this algorithm: allowing it to address much larger datasets, get exact results, and do so in less time.
Multilevel Emulation for Stochastic Computer Models with Application to Large Offshore Wind farmsKennedy, Jack C.;Henderson, Daniel A.;Wilson, Kevin J.
doi: N/Apmid: N/A
Abstract: Renewable energy projects, such as large offshore wind farms, are critical to achieving low-emission targets set by governments. Stochastic computer models allow us to explore future scenarios to aid decision making whilst considering the most relevant uncertainties. Complex stochastic computer models can be prohibitively slow and thus an emulator may be constructed and deployed to allow for efficient computation. We present a novel heteroscedastic Gaussian Process emulator which exploits cheap approximations to a stochastic offshore wind farm simulator. We also conduct a probabilistic sensitivity analysis to understand the influence of key parameters in the wind farm model which will help us to plan a probability elicitation in the future.
The growing amplification of social media: Measuring temporal and social contagion dynamics for over 150 languages on Twitter for 2009-2020Alshaabi, Thayer;Dewhurst, David R.;Minot, Joshua R.;Arnold, Michael V.;Adams, Jane L.;Danforth, Christopher M.;Dodds, Peter Sheridan
doi: N/Apmid: N/A
Abstract: Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, and Portuguese being the most dominant. To quantify social spreading in each language over time, we compute the 'contagion ratio': The balance of retweets to organic messages. We find that for the most common languages on Twitter there is a growing tendency, though not universal, to retweet rather than share new content. By the end of 2019, the contagion ratios for half of the top 30 languages, including English and Spanish, had reached above 1 -- the naive contagion threshold. In 2019, the top 5 languages with the highest average daily ratios were, in order, Thai (7.3), Hindi, Tamil, Urdu, and Catalan, while the bottom 5 were Russian, Swedish, Esperanto, Cebuano, and Finnish (0.26). Further, we show that over time, the contagion ratios for most common languages are growing more strongly than those of rare languages.
Short-term CO2 emissions forecasting based on decomposition approaches and its impact on electricity market schedulingBokde, Neeraj;Tranberg, Bo;Andresen, Gorm Bruun
doi: 10.1016/j.apenergy.2020.116061pmid: N/A
Abstract: The world is facing major challenges related to global warming and emissions of greenhouse gases is a major causing factor. In 2017, energy industries accounted for 46% of all CO2 emissions globally, which shows a large potential for reduction. This paper proposes a novel short-term CO2 emissions forecast to enable intelligent scheduling of flexible electricity consumption to minimize the resulting CO2 emissions. Two proposed time series decomposition methods are developed for short-term forecasting of the CO2 emissions of electricity. These are in turn bench-marked against a set of state-of-the-art models. The result is a new forecasting method with a 48-hour horizon targeted the day-ahead electricity market. Forecasting benchmarks for France show that the new method has a mean absolute percentage error that is 25% lower than the best performing state-of-the-art model. Further, application of the forecast for scheduling flexible electricity consumption is studied for five European countries. Scheduling a flexible block of 4 hours of electricity consumption in a 24 hour interval can on average reduce the resulting CO2 emissions by 25% in France, 17% in Germany, 69% in Norway, 20% in Denmark, and just 3% in Poland when compared to consuming at random intervals during the day.
Generalized Energy Based ModelsArbel, Michael;Zhou, Liang;Gretton, Arthur
doi: N/Apmid: N/A
Abstract: We introduce the Generalized Energy Based Model (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally an implicit model), which can learn the support of data with low intrinsic dimension in a high dimensional space; and an energy function, to refine the probability mass on the learned support. Both the energy function and base jointly constitute the final model, unlike GANs, which retain only the base distribution (the "generator"). GEBMs are trained by alternating between learning the energy and the base. We show that both training stages are well-defined: the energy is learned by maximising a generalized likelihood, and the resulting energy-based loss provides informative gradients for learning the base. Samples from the posterior on the latent space of the trained model can be obtained via MCMC, thus finding regions in this space that produce better quality samples. Empirically, the GEBM samples on image-generation tasks are of much better quality than those from the learned generator alone, indicating that all else being equal, the GEBM will outperform a GAN of the same complexity. When using normalizing flows as base measures, GEBMs succeed on density modelling tasks, returning comparable performance to direct maximum likelihood of the same networks.
Data-efficient Domain Randomization with Bayesian OptimizationMuratore, Fabio;Eilers, Christian;Gienger, Michael;Peters, Jan
doi: N/Apmid: N/A
Abstract: When learning policies for robot control, the required real-world data is typically prohibitively expensive to acquire, so learning in simulation is a popular strategy. Unfortunately, such polices are often not transferable to the real world due to a mismatch between the simulation and reality, called 'reality gap'. Domain randomization methods tackle this problem by randomizing the physics simulator (source domain) during training according to a distribution over domain parameters in order to obtain more robust policies that are able to overcome the reality gap. Most domain randomization approaches sample the domain parameters from a fixed distribution. This solution is suboptimal in the context of sim-to-real transferability, since it yields policies that have been trained without explicitly optimizing for the reward on the real system (target domain). Additionally, a fixed distribution assumes there is prior knowledge about the uncertainty over the domain parameters. In this paper, we propose Bayesian Domain Randomization (BayRn), a black-box sim-to-real algorithm that solves tasks efficiently by adapting the domain parameter distribution during learning given sparse data from the real-world target domain. BayRn uses Bayesian optimization to search the space of source domain distribution parameters such that this leads to a policy which maximizes the real-word objective, allowing for adaptive distributions during policy optimization. We experimentally validate the proposed approach in sim-to-sim as well as in sim-to-real experiments, comparing against three baseline methods on two robotic tasks. Our results show that BayRn is able to perform sim-to-real transfer, while significantly reducing the required prior knowledge.
High-dimensional Multivariate Geostatistics: A Bayesian Matrix-Normal ApproachZhang, Lu;Banerjee, Sudipto;Finley, Andrew O.
doi: 10.1002/env.2675pmid: N/A
Abstract: Joint modeling of spatially-oriented dependent variables is commonplace in the environmental sciences, where scientists seek to estimate the relationships among a set of environmental outcomes accounting for dependence among these outcomes and the spatial dependence for each outcome. Such modeling is now sought for massive data sets with variables measured at a very large number of locations. Bayesian inference, while attractive for accommodating uncertainties through hierarchical structures, can become computationally onerous for modeling massive spatial data sets because of its reliance on iterative estimation algorithms. This manuscript develops a conjugate Bayesian framework for analyzing multivariate spatial data using analytically tractable posterior distributions that obviate iterative algorithms. We discuss differences between modeling the multivariate response itself as a spatial process and that of modeling a latent process in a hierarchical model. We illustrate the computational and inferential benefits of these models using simulation studies and analysis of a Vegetation Index data set with spatially dependent observations numbering in the millions.