Statistics

journal article

Open Access Collection

FastDTW is approximate and Generally Slower than the Algorithm it Approximates

2020 Statistics

doi: 10.1109/TKDE.2020.3033752 10.1109/ICDE51399.2021.00249pmid: N/A

Abstract: Many time series data mining problems can be solved with repeated use of distance measure. Examples of such tasks include similarity search, clustering, classification, anomaly detection and segmentation. For over two decades it has been known that the Dynamic Time Warping (DTW) distance measure is the best measure to use for most tasks, in most domains. Because the classic DTW algorithm has quadratic time complexity, many ideas have been introduced to reduce its amortized time, or to quickly approximate it. One of the most cited approximate approaches is FastDTW. The FastDTW algorithm has well over a thousand citations and has been explicitly used in several hundred research efforts. In this work, we make a surprising claim. In any realistic data mining application, the approximate FastDTW is much slower than the exact DTW. This fact clearly has implications for the community that uses this algorithm: allowing it to address much larger datasets, get exact results, and do so in less time.

journal article

Open Access Collection

Multilevel Emulation for Stochastic Computer Models with Application to Large Offshore Wind farms

Kennedy, Jack C.;Henderson, Daniel A.;Wilson, Kevin J.

2020 Statistics

doi: N/Apmid: N/A

Abstract: Renewable energy projects, such as large offshore wind farms, are critical to achieving low-emission targets set by governments. Stochastic computer models allow us to explore future scenarios to aid decision making whilst considering the most relevant uncertainties. Complex stochastic computer models can be prohibitively slow and thus an emulator may be constructed and deployed to allow for efficient computation. We present a novel heteroscedastic Gaussian Process emulator which exploits cheap approximations to a stochastic offshore wind farm simulator. We also conduct a probabilistic sensitivity analysis to understand the influence of key parameters in the wind farm model which will help us to plan a probability elicitation in the future.

journal article

Open Access Collection

The growing amplification of social media: Measuring temporal and social contagion dynamics for over 150 languages on Twitter for 2009-2020

Alshaabi, Thayer;Dewhurst, David R.;Minot, Joshua R.;Arnold, Michael V.;Adams, Jane L.;Danforth, Christopher M.;Dodds, Peter Sheridan

2020 Statistics

doi: N/Apmid: N/A

Abstract: Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, and Portuguese being the most dominant. To quantify social spreading in each language over time, we compute the 'contagion ratio': The balance of retweets to organic messages. We find that for the most common languages on Twitter there is a growing tendency, though not universal, to retweet rather than share new content. By the end of 2019, the contagion ratios for half of the top 30 languages, including English and Spanish, had reached above 1 -- the naive contagion threshold. In 2019, the top 5 languages with the highest average daily ratios were, in order, Thai (7.3), Hindi, Tamil, Urdu, and Catalan, while the bottom 5 were Russian, Swedish, Esperanto, Cebuano, and Finnish (0.26). Further, we show that over time, the contagion ratios for most common languages are growing more strongly than those of rare languages.

journal article

Open Access Collection

Short-term CO2 emissions forecasting based on decomposition approaches and its impact on electricity market scheduling

Bokde, Neeraj;Tranberg, Bo;Andresen, Gorm Bruun

2020 Statistics

doi: 10.1016/j.apenergy.2020.116061pmid: N/A

Abstract: The world is facing major challenges related to global warming and emissions of greenhouse gases is a major causing factor. In 2017, energy industries accounted for 46% of all CO2 emissions globally, which shows a large potential for reduction. This paper proposes a novel short-term CO2 emissions forecast to enable intelligent scheduling of flexible electricity consumption to minimize the resulting CO2 emissions. Two proposed time series decomposition methods are developed for short-term forecasting of the CO2 emissions of electricity. These are in turn bench-marked against a set of state-of-the-art models. The result is a new forecasting method with a 48-hour horizon targeted the day-ahead electricity market. Forecasting benchmarks for France show that the new method has a mean absolute percentage error that is 25% lower than the best performing state-of-the-art model. Further, application of the forecast for scheduling flexible electricity consumption is studied for five European countries. Scheduling a flexible block of 4 hours of electricity consumption in a 24 hour interval can on average reduce the resulting CO2 emissions by 25% in France, 17% in Germany, 69% in Norway, 20% in Denmark, and just 3% in Poland when compared to consuming at random intervals during the day.

journal article

Open Access Collection

Generalized Energy Based Models

Arbel, Michael;Zhou, Liang;Gretton, Arthur

2020 Statistics

doi: N/Apmid: N/A

Abstract: We introduce the Generalized Energy Based Model (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally an implicit model), which can learn the support of data with low intrinsic dimension in a high dimensional space; and an energy function, to refine the probability mass on the learned support. Both the energy function and base jointly constitute the final model, unlike GANs, which retain only the base distribution (the "generator"). GEBMs are trained by alternating between learning the energy and the base. We show that both training stages are well-defined: the energy is learned by maximising a generalized likelihood, and the resulting energy-based loss provides informative gradients for learning the base. Samples from the posterior on the latent space of the trained model can be obtained via MCMC, thus finding regions in this space that produce better quality samples. Empirically, the GEBM samples on image-generation tasks are of much better quality than those from the learned generator alone, indicating that all else being equal, the GEBM will outperform a GAN of the same complexity. When using normalizing flows as base measures, GEBMs succeed on density modelling tasks, returning comparable performance to direct maximum likelihood of the same networks.

journal article

Open Access Collection

Variable fusion for Bayesian linear regression via spike-and-slab priors

Wu, Shengyi;Shimamura, Kaito;Yoshikawa, Kohei;Murayama, Kazuaki;Kawano, Shuichi

2020 Statistics

doi: 10.1007/978-981-16-2765-1_41pmid: N/A

Abstract: In linear regression models, fusion of coefficients is used to identify predictors having similar relationships with a response. This is called variable fusion. This paper presents a novel variable fusion method in terms of Bayesian linear regression models. We focus on hierarchical Bayesian models based on a spike-and-slab prior approach. A spike-and-slab prior is tailored to perform variable fusion. To obtain estimates of the parameters, we develop a Gibbs sampler for the parameters. Simulation studies and a real data analysis show that our proposed method achieves better performance than previous methods.

journal article

Open Access Collection

Data-efficient Domain Randomization with Bayesian Optimization

Muratore, Fabio;Eilers, Christian;Gienger, Michael;Peters, Jan

2020 Statistics

doi: N/Apmid: N/A

Abstract: When learning policies for robot control, the required real-world data is typically prohibitively expensive to acquire, so learning in simulation is a popular strategy. Unfortunately, such polices are often not transferable to the real world due to a mismatch between the simulation and reality, called 'reality gap'. Domain randomization methods tackle this problem by randomizing the physics simulator (source domain) during training according to a distribution over domain parameters in order to obtain more robust policies that are able to overcome the reality gap. Most domain randomization approaches sample the domain parameters from a fixed distribution. This solution is suboptimal in the context of sim-to-real transferability, since it yields policies that have been trained without explicitly optimizing for the reward on the real system (target domain). Additionally, a fixed distribution assumes there is prior knowledge about the uncertainty over the domain parameters. In this paper, we propose Bayesian Domain Randomization (BayRn), a black-box sim-to-real algorithm that solves tasks efficiently by adapting the domain parameter distribution during learning given sparse data from the real-world target domain. BayRn uses Bayesian optimization to search the space of source domain distribution parameters such that this leads to a policy which maximizes the real-word objective, allowing for adaptive distributions during policy optimization. We experimentally validate the proposed approach in sim-to-sim as well as in sim-to-real experiments, comparing against three baseline methods on two robotic tasks. Our results show that BayRn is able to perform sim-to-real transfer, while significantly reducing the required prior knowledge.

journal article

Open Access Collection

High-dimensional Multivariate Geostatistics: A Bayesian Matrix-Normal Approach

Zhang, Lu;Banerjee, Sudipto;Finley, Andrew O.

2020 Statistics

doi: 10.1002/env.2675pmid: N/A

Abstract: Joint modeling of spatially-oriented dependent variables is commonplace in the environmental sciences, where scientists seek to estimate the relationships among a set of environmental outcomes accounting for dependence among these outcomes and the spatial dependence for each outcome. Such modeling is now sought for massive data sets with variables measured at a very large number of locations. Bayesian inference, while attractive for accommodating uncertainties through hierarchical structures, can become computationally onerous for modeling massive spatial data sets because of its reliance on iterative estimation algorithms. This manuscript develops a conjugate Bayesian framework for analyzing multivariate spatial data using analytically tractable posterior distributions that obviate iterative algorithms. We discuss differences between modeling the multivariate response itself as a spatial process and that of modeling a latent process in a hierarchical model. We illustrate the computational and inferential benefits of these models using simulation studies and analysis of a Vegetation Index data set with spatially dependent observations numbering in the millions.

journal article

Open Access Collection

New statistical model for misreported data with application to current public health challenges

Moriña, David;Fernández-Fontelo, Amanda;Cabaña, Alejandra;Puig, Pedro

2020 Statistics

doi: N/Apmid: N/A

Abstract: The main goal of this work is to present a new model able to deal with potentially misreported continuous time series. The proposed model is able to handle the autocorrelation structure in continuous time series data, which might be partially or totally underreported or overreported. Its performance is illustrated through a comprehensive simulation study considering several autocorrelation structures and two real data applications on human papillomavirus incidence in Girona (Catalunya, Spain) and COVID-19 incidence in the Chinese region of Heilongjiang.

journal article

Open Access Collection

Unsupervised Domain Adaptation Through Transferring both the Source-Knowledge and Target-Relatedness Simultaneously

Tian, Qing;Zhu, Yanan;Ma, Chuang;Cao, Meng

2020 Statistics

doi: N/Apmid: N/A

Abstract: Unsupervised domain adaptation (UDA) is an emerging research topic in the field of machine learning and pattern recognition, which aims to help the learning of unlabeled target domain by transferring knowledge from the source domain.

Showing 1 to 10 of 47 Articles

Articles per page

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Related Journals: