Algorithm 1004Reizenstein, Jeremy F.; Graham, Benjamin
doi: 10.1145/3371237pmid: N/A
Iterated-integral signatures and log signatures are sequences calculated from a path that characterizes its shape. They originate from the work of K. T. Chen and have become important through Terry Lyons’s theory of differential equations driven by rough paths, which is an important developing area of stochastic analysis. They have applications in statistics and machine learning, where there can be a need to calculate finite parts of them quickly for many paths. We introduce the signature and the most basic information (displacement and signed areas) that it contains. We present algorithms for efficiently calculating these signatures. For log signatures this requires consideration of the structure of free Lie algebras. We benchmark the performance of the algorithms. The methods are implemented in C++ and released as a Python extension package, which also supports differentiation. In combination with a machine learning library (Tensorflow, PyTorch, or Theano), this allows end-to-end learning of neural networks involving signatures.
Algorithm 1003Davis, Timothy A.; Hager, William W.; Kolodziej, Scott P.; Yeralan, S. Nuri
doi: 10.1145/3337792pmid: N/A
Partitioning graphs is a common and useful operation in many areas, from parallel computing to VLSI design to sparse matrix algorithms. In this article, we introduce Mongoose, a multilevel hybrid graph partitioning algorithm and library. Building on previous work in multilevel partitioning frameworks and combinatoric approaches, we introduce novel stall-reducing and stall-free coarsening strategies, as well as an efficient hybrid algorithm leveraging (1) traditional combinatoric methods and (2) continuous quadratic programming formulations. We demonstrate how this new hybrid algorithm outperforms either strategy in isolation, and we also compare Mongoose to METIS and demonstrate its effectiveness on large and social networking (power law) graphs.
Strassen’s Algorithm Reloaded on GPUsHuang, Jianyu; Yu, Chenhan D.; Geijn, Robert A. van de
doi: 10.1145/3372419pmid: N/A
Conventional Graphics Processing Unit (GPU) implementations of Strassen’s algorithm (Strassen) rely on the existing high-performance matrix multiplication (gemm), trading space for time. As a result, such approaches can only achieve practical speedup for relatively large, “squarish” matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms. Our algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from gemm, fusing additional operations, and avoiding extra workspace. We further exploit intra- and inter-kernel parallelism by batching, streaming, and employing atomic operations. We develop a performance model for NVIDIA Volta GPUs to select the appropriate blocking parameters and predict the performance for gemm and Strassen. Overall, our 1-level Strassen can achieve up to 1.11 speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU. With additional workspace, our 2-level Strassen can achieve 1.19 speedup with a crossover point at 7,680.
Algorithm 1006Abergel, Rémy; Moisan, Lionel
doi: 10.1145/3365983pmid: N/A
We present a computational procedure to evaluate the integral ∫yx sp-1 e-μs ds for 0 ≤ x < y ≤ +∞,μ = ±1, p> 0, which generalizes the lower (x=0) and upper (y=+∞) incomplete gamma functions. To allow for large values of x, y, and p while avoiding under/overflow issues in the standard double precision floating point arithmetic, we use an explicit normalization that is much more efficient than the classical ratio with the complete gamma function. The generalized incomplete gamma function is estimated with continued fractions, with integrations by parts, or, when x ≈ y, with the Romberg numerical integration algorithm. We show that the accuracy reached by our algorithm improves a recent state-of-the-art method by two orders of magnitude, and it is essentially optimal considering the limitations imposed by floating point arithmetic. Moreover, the admissible parameter range of our algorithm (0 ≤ p,x,y ≤ 1015) is much larger than competing algorithms, and its robustness is assessed through massive usage in an image processing application.
Algorithm 1005Jonasson, Kristjan; Sigurdsson, Sven; Yngvason, Hordur Freyr; Ragnarsson, Petur Orri; Melsted, Pall
doi: 10.1145/3382191pmid: N/A
A set of Fortran subroutines for reverse mode algorithmic (or automatic) differentiation of the basic linear algebra subprograms (BLAS) is presented. This is preceded by a description of the mathematical tools used to obtain the formulae of these derivatives, with emphasis on special matrices supported by the BLAS: triangular, symmetric, and band. All single and double precision BLAS derivatives have been implemented, together with the Cholesky factorization from Linear Algebra Package (LAPACK). The subroutines are written in Fortran 2003 with a Fortran 77 interface to allow use from C and C++, as well as dynamic languages such as R, Python, Matlab, and Octave. The subroutines are all implemented by calling BLAS, thereby attaining fast runtime. Timing results show derivative runtimes that are about twice those of the corresponding BLAS, in line with theory. The emphasis is on reverse mode because it is more important for the main application that we have in mind, numerical optimization. Two examples are presented, one dealing with the least squares modeling of groundwater, and the other dealing with the maximum likelihood estimation of the parameters of a vector autoregressive time series. The article contains comprehensive tables of formulae for the BLAS derivatives as well as for several non-BLAS matrix operations commonly used in optimization.
A Software Platform for Adaptive High Order Multistep MethodsArévalo, Carmen; Jonsson-Glans, Erik; Olander, Josefine; Soto, Monica Selva; Söderlind, Gustaf
doi: 10.1145/3372159pmid: N/A
We present a software package, Modes, offering h-adaptive and p-adaptive linear multistep methods for first order initial value problems in ordinary differential equations. The implementation is based on a new parametric, grid-independent representation of multistep methods [Arévalo and Söderlind 2017]. Parameters are supplied for over 60 methods. For nonstiff problems, all maximal order methods (p=k for explicit and p=k+1 for implicit methods) are supported. For stiff computation, implicit methods of order p=k are included. A collection of step-size controllers based on digital filters is provided, generating smooth step-size sequences offering improved computational stability. Controllers may be selected to match method and problem classes. A new system for automatic order control is also provided for designated families of multistep methods, offering simultaneous h- and p-adaptivity. Implemented as a Matlab toolbox, the software covers high order computations with linear multistep methods within a unified, generic framework. Computational experiments show that the new software is competitive and offers qualitative improvements. Modes is available for downloading and is primarily intended as a platform for developing a new generation of state-of-the-art multistep solvers, as well as for true ceteris paribus evaluation of algorithmic components. This also enables method comparisons within a single implementation environment.
High-order Numerical Quadratures in a Tetrahedron with an Implicitly Defined Curved InterfaceCui, Tao; Leng, Wei; Liu, Huaqing; Zhang, Linbo; Zheng, Weiying
doi: 10.1145/3372144pmid: N/A
Given a shape regular tetrahedron and a curved surface that is defined implicitly by a nonlinear level set function and divides the tetrahedron into two sub-domains, a general-purpose, robust, and high-order numerical algorithm is proposed in this article for computing both volume integrals in the sub-domains and surface integrals on their common boundary. The algorithm uses a direct approach that decomposes 3D volume integrals or 2D surface integrals into multiple 1D integrals and computes the 1D integrals with Gaussian quadratures. It only requires finding roots of univariate nonlinear functions in given intervals and evaluating the integrand, the level set function, and the gradient of the level set function at given points. It can achieve arbitrarily high accuracy by increasing the orders of Gaussian quadratures, and it does not need extra a priori knowledge about the integrand and the level set function. The code for the algorithm is freely available in the open-source finite element toolbox Parallel Hierarchical Grid (PHG) and can serve as a basic building block for implementing 3D high-order numerical algorithms involving implicit interfaces or boundaries.
Architecture and Performance of Devito, a System for Automated Stencil ComputationLuporini, Fabio; Louboutin, Mathias; Lange, Michael; Kukreja, Navjot; Witte, Philipp; Hückelheim, Jan; Yount, Charles; Kelly, Paul H. J.; Herrmann, Felix J.; Gorman, Gerard J.
doi: 10.1145/3374916pmid: N/A
Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process—from mathematical equations down to C++ code—is performed by the Devito compiler through a series of intermediate representations. Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling, and parallelization. Some of these are obtained through well-established stencil optimizers, integrated in the backend of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performance optimizations is demonstrated using operators drawn from seismic imaging applications.
PETSc DMNetworkAbhyankar, Shrirang; Betrie, Getnet; Maldonado, Daniel Adrian; Mcinnes, Lois C.; Smith, Barry; Zhang, Hong
doi: 10.1145/3344587pmid: N/A
We present DMNetwork, a high-level package included in the PETSc library for the simulation of multiphysics phenomena over large-scale networked systems. The library aims at applications that have networked structures such as those in electrical, gas, and water distribution systems. DMNetwork provides data and topology management, parallelization for multiphysics systems over a network, and hierarchical and composable solvers to exploit the problem structure. DMNetwork eases the simulation development cycle by providing the necessary infrastructure through simple abstractions to define and query the network components. This article presents the design of DMNetwork, illustrates its user interface, and demonstrates its ability to solve multiphysics systems, such as an electric circuit, a network of power grid and water subnetworks, and transient hydraulic systems over large networks with more than 2 billion variables on extreme-scale computers using up to 30,000 processors.
Product Algebras for Galerkin Discretisations of Boundary Integral Operators and their ApplicationsBetcke, Timo; Scroggs, Matthew W.; Śmigaj, Wojciech
doi: 10.1145/3368618pmid: N/A
Operator products occur naturally in a range of regularised boundary integral equation formulations. However, while a Galerkin discretisation only depends on the domain space and the test (or dual) space of the operator, products require a notion of the range. In the boundary element software package Bempp, we have implemented a complete operator algebra that depends on knowledge of the domain, range, and test space. The aim was to develop a way of working with Galerkin operators in boundary element software that is as close to working with the strong form on paper as possible, while hiding the complexities of Galerkin discretisations. In this article, we demonstrate the implementation of this operator algebra and show, using various Laplace and Helmholtz example problems, how it significantly simplifies the definition and solution of a wide range of typical boundary integral equation problems.