Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

PFASST-ER: combining the parallel full approximation scheme in space and time with parallelization across the method

PFASST-ER: combining the parallel full approximation scheme in space and time with... To extend prevailing scaling limits when solving time-dependent partial differential equations, the parallel full approximation scheme in space and time (PFASST) has been shown to be a promising parallel-in-time integrator. Similar to space–time multigrid, PFASST is able to compute multiple time-steps simultaneously and is therefore in particular suitable for large- scale applications on high performance computing systems. In this work we couple PFASST with a parallel spectral deferred correction (SDC) method, forming an unprecedented doubly time-parallel integrator. While PFASST provides global, large- scale “parallelization across the step”, the inner parallel SDC method allows integrating each individual time-step “parallel across the method” using a diagonalized local Quasi-Newton solver. This new method, which we call “PFASST with Enhanced concuRrency” (PFASST-ER), therefore exposes even more temporal concurrency. For two challenging nonlinear reaction- diffusion problems, we show that PFASST-ER works more efficiently than the classical variants of PFASST and can use more processors than time-steps. Keywords Parallel-in-time integration · Parallel full approximation scheme in space and time · Spectral deferred corrections · Parallelization across the method · Parallelization across the step · Quasi-Newton 1 Introduction As one example, the “Parallel Full Approximation Scheme in Space and Time” (PFASST) by Emmett and Minion [6] The efficient use of modern high performance computing allows one to integrate multiple time-steps simultaneously by systems for solving space–time-dependent differential equa- using inner iterations of spectral deferred corrections (SDC) tions has become one of the key challenges in computational on a space–time hierarchy. It works on the so called com- science. Exploiting the exponentially growing number of posite collocation problem, where each time-step includes a processors using traditional techniques for spatial parallelism further discretization through quadrature nodes. This “paral- becomes problematic when, for example, for a fixed problem lelization across the steps” approach [3] targets large-scale size communication costs start to dominate. Parallel-in-time parallelization on top of saturated spatial parallelization of integration methods have recently been shown to provide partial differential equations (PDEs), where parallelization a promising way to extend these scaling limits, see e.g. in the temporal domain acts as a multiplier for standard par- [7,18,22] to name but a few examples. allelization techniques in space. In contrast, “parallelization across the method” approaches [3] try to parallelize the inte- gration within an individual time-step. While this typically Communicated by Sebastian Schöps. results in smaller-scale parallelization in the time-domain, parallel efficiency and applicability of these methods are B Ruth Schöbel often more favorable. Most notably, the “revisionist inte- [email protected] gral deferred correction method” (RIDC) by Christlieb et Robert Speck al. [4] makes use of integral deferred corrections (which are [email protected] indeed closely related to SDC) in order to compute multiple Institut für Numerische Mathematik, TU Dresden, Dresden, iterations in a pipelined way. In [19], different approaches Germany for parallelizing SDC across the method have been dis- Jülich Supercomputing Centre, Forschungszentrum Jülich cussed, allowing the simultaneous computation of updates on GmbH, Jülich, Germany 123 12 Page 2 of 12 R. Schöbel, R. Speck multiple quadrature nodes. A much more structured and com- u = f (u), u(0) = u (1) t 0 plete overview of parallel-in-time integration approaches can be found in [8]. The Parallel-in-Time community website with u(t ), u , f (u) ∈ R. In order to keep the notation sim- (https://parallel-in-time.org) offers a comprehensive list of ple, we do not consider systems of initial value problems references. for now, where u(t ) ∈ R . Necessary modifications will be The key goal of parallel-in-time integrators is to expose mentioned where needed. In a first step, we now discretize additional parallelism in the temporal domain in the cases this problem in time and review the idea of single-step, time- where classical strategies like parallelism in space are either serial spectral deferred corrections (SDC). already saturated or not even possible. In [5] the classical Parareal method [15] is used to overcome the scaling limit of 2.1 Spectral deferred corrections a space-parallel simulation of a kinematic dynamo on up to 1600 cores. The multigrid extension of Parareal, the “multi- For one time-step on the interval [t , t ] the Picard formu- l l+1 grid reduction in time” method (MGRIT), has been shown to lation of Eq. (1) is given by provide significant speedup beyond spatial parallelization [7] for a multitude of problems. Using PFASST, a space-parallel t u(t ) = u + f (u(s))ds, t ∈[t , t ]. (2) l,0 l l+1 N-body solver has been extended in [22] to run on up to 262,244 cores, while in [18] it has been coupled to a space- parallel multigrid solver on up to 458,752 cores. To approximate the integral we use a spectral quadrature So far, parallel-in-time methods have been implemented rule. We define M quadrature nodes τ ,...,τ , which l,1 l,M and tested either without any additional parallelization tech- are given by t ≤ τ < ··· <τ = t . We will in the l l,1 l,M l+1 niques or in combination with spatial parallelism. The goal following explicitly exploit the condition that the last node for this work is to couple two different parallel-in-time strate- is equal to the right integral boundary. Quadrature rules like gies in order to extend the overall temporal parallelism Gauß-Radau or Gauß-Lobatto quadrature satisfy this prop- exposed by the resulting integrator. To this end, we take erty. We can then approximate the integrals from t to the the diagonalization idea for SDC presented in [19] (paral- nodes τ , such that l,m lel across the method) and use it within PFASST (parallel across the steps). In this way we create an algorithm that computes approximations for different time-steps simulta- u = u + Δt q f (u ), l,m l,0 m, j l, j neously but also works in parallel on each time-step itself. j =1 Doing so we combine the advantages of both paralleliza- tion techniques and create the “Parallel Full Approximation where u ≈ u(τ ), Δt = t − t and q represent the l,m l,m l+1 l m, j Scheme in Space and Time with Enhanced concuRrency” quadrature weights for the interval [t ,τ ] such that l l,m (PFASST-ER), an unprecedented doubly time-parallel inte- grator for PDEs. In the next section we will first introduce l,m SDC and PFASST from an algebraic point of view, follow- q f (u ) ≈ f (u(s))ds. m, j l, j ing [1,2]. We particularly focus on nonlinear problems and j =1 briefly explain the application of a Newton solver within PFASST. Then, this Newton solver is modified in Sect. 3 We combine these M equations into one system so that by using a diagonalization approach the resulting Quasi-Newton method can be computed in parallel across (I − Δt Q f ) (u ) = u , (3) l l,0 the quadrature nodes of each time-step. In Sect. 4, we com- pare different variants of this idea to the classical PFASST which we call the “collocation problem”. Here, u = implementation using two nonlinear reaction-diffusion test T T M (u ,..., u ) ≈ (u(τ ), . . . , u(τ )) ∈ R , u = l,1 l,M l,1 l,M l,0 cases. We show parallel runtimes for different setups and T M M ×M (u ,..., u ) ∈ R , Q = (q ) ∈ R is the matrix l,0 l,0 ij i , j evaluate the impact of the various Newton and diagonaliza- gathering the quadrature weights and the vector function tion strategies. Section 5 concludes this work with a short M M f : R → R is given by summary and an outlook. f (u ) = ( f (u ), ..., f (u )) . l l,1 l,M 2 Parallelization across the steps with To simplify the notation we define PFASST coll C (u ) := (I − Δt Q f ) (u ). l l We focus on an initial value problem 123 PFASST-ER: combining the parallel full approximation scheme in space and time with… Page 3 of 12 12 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ coll We note that for u(t ) ∈ R , we need to replace Q by Q ⊗ I , u u f 1 0,0 ⎜ coll ⎟ ⎜ ⎟ ⎜ ⎟ where ⊗ denotes the Kronecker product. −HC u 0 ⎜ ⎟ 2 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ = , ⎜ . ⎟ ⎜ . ⎟ System (3) is dense and a direct solution is not advis- . . ⎜ ⎟ . . . . ⎝ ⎠ ⎝ ⎠ . . . . ⎝ ⎠ able, in particular if f is a nonlinear operator. The spectral coll u 0 −HC L deferred correction method solves the collocation problem in an iterative way. While it has been derived originally from M ×M where the matrix H ∈ R on the lower subdiagonal classical deferred or defect correction strategies, we here transfers the information from one time-step to the next follow [10,17,27] to present SDC as preconditioned Picard one. It takes the value of the last node τ of an interval iteration. A standard Picard iteration is given by l,M [t , t ], which is by requirement equal to the left bound- l l+1 ary t of the following interval [t , t ], and provides l+1 l+1 l+2 k+1 k coll k u = u + (u − C (u )) l,0 l f l it as a new starting value for this interval. Therefore, the matrix H contains the value 1 on every position in the last column and zeros elsewhere. To write the composite col- for k = 0,..., K , and some initial guess u . location problem in a more compact form we define the In order to increase range and speed of convergence, we T LM vector u = (u ,..., u ) ∈ R , which contains the solu- now precondition this iteration. The standard approach to 1 L sdc tion at all quadrature nodes at all time-steps, and the vector preconditioning is to define an operator P , which is easy T LM b = (u , 0,..., 0) ∈ R , which contains the initial to invert but also close to the operator of the system. We 0,0 condition for all nodes at the first interval and zeros else- define this “SDC preconditioner” as LM LM where. We define F : R → R as an extension of f so that F(u) = ( f (u ), ..., f (u )) . Then, the composite 1 L sdc P (u ) := (I − Δt Q f ) (u ) l Δ l collocation problem can be written as so that the preconditioned Picard iteration reads C (u) = b. (6) sdc k+1 sdc coll k with P (u ) = (P − C )(u ) + u . (4) l,0 f f f l C (u) = (I − Δt (I ⊗ Q) F − E ⊗ H) (u), F L sdc The key for defining P is the choice of the matrix Q . L×L The idea is to choose a “simpler” quadrature rule to generate where the matrix E ∈ R just has ones on the first subdi- a triangular matrix Q such that solving System (4) can be N agonal and zeros elsewhere. If u ∈ R , we need to replace done by forward substitution. Common choices include the H by H ⊗ I . implicit Euler method or the so-called “LU-trick”, where the SDC can be used to solve the composite collocation prob- LU decomposition of Q with lem by forward substitution in a sequential way, which means to solve one time-step after each other using the previous LU T T solution as initial value of the current time-step. The parallel- Q = U for Q = LU (5) in-time integrator PFASST, on the other hand solves the composite collocation problem by calculating on all time- is used [27]. steps simultaneously and is therefore an attractive alternative. System (4) establishes the method of spectral deferred The first step from SDC towards PFASST is the introduction corrections, which can be used to approximate the solution of multiple levels, which are representations of the problem of the collocation problem on a single time-step. In the next with different accuracies in space and time. In order to sim- step, we will couple multiple collocation problems and use plify the notation we focus on a two-level scheme consisting SDC to explain the idea of the parallel full approximation of a fine and a coarse level. Coarsening can be achieved for scheme in space and time. example by reducing the resolution in space, by decreasing the number of quadrature nodes on each interval or by solv- 2.2 Parallel full approximation scheme in space and ing implicit systems less accurately. Especially a coarsening time through the reduction of quadrature points does not seem to be worthwhile for our idea to parallize the belonging calcu- The idea of PFASST is to solve a “composite collocation lations, since there would no longer be a full employment problem” for multiple time-steps at once using multigrid regarding the calculations on the coarse grid, but instead techniques and SDC for each step in parallel. This composite individual processors would have to communicate larger collocation problem for L time-steps can be written as amounts of data. For this work, we only consider coarsening 123 12 Page 4 of 12 R. Schöbel, R. Speck coarse in space, i.e., by using a restriction operator R on a vector sweep N N u ∈ R we obtain a new vector u ˜ ∈ R .Viceversa,the fine interpolation operator T is used to interpolate values from sweep u ˜ to u. Operators, vectors and numbers on the coarse level coarse will be denoted by a tilde to avoid further index cluttering. comm. Thus, the composite collocation operator on the coarse-level fine LM N ˜ ˜ is given by C . While C is defined on R , C acts F F F comm. LM N on R with N ≤ N, but as before we will neglect the space dimension in the following notation. The extension of the spatial transfer operators to the full space–time domain t t t t t is given by R = I ⊗ R and T = I ⊗ T . 0 1 2 3 4 LM LM P P P P 0 1 2 3 The main goal of the introduction of a coarse level is to Fig. 1 Schematic view of PFASST on four processors. The figure was move the serial part of the computation to this hopefully created with pfasst-tikz [14] cheaper level, while being able to run the expensive part in parallel. For that, we define two preconditioners: a serial one with a lower subdiagonal for the coarse level and a parallel, 3. the coarse grid correction applied to the fine level value block-diagonal one for the fine level. The serial precondi- tionier for the coarse level is defined by k+1 k+ k k u = u + T(u ˜ − Ru ), (8) ⎛ ⎞ sdc 4. the fine sweep on the composite collocation problem on ⎜ ⎟ sdc ˜ ˜ −H P ⎜ ⎟ the fine level ˜ ⎜ ⎟ P = , ⎜ . . ⎟ . . ⎝ . . ⎠ k+1 k+ sdc ˜ ˜ 2 P (u ) = (P − C )(u ) + b. (9) −H P F F F or, in a more compact way, by In Fig. 1, we see a schematic representation of the described steps. The time-step parallel procedure, which we ˜ ˜ ˜ ˜ ˜ P (u ˜ ) = I − Δt (I ⊗ Q )F − E ⊗ H (u ˜ ). describe here is also the same for all PFASST versions, that F L Δ we will introduce later. It is common to use as many proces- sors as time-steps: In the given illustration four processors Inverting this corresponds to a single inner iteration of SDC work on four time-steps. Therefore the temporal domain is (a “sweep”) on step 1, then sending forward the result to step divided into four intervals, which are assigned to four proces- 2, an SDC sweep there and so on. The parallel preconditioner sors P ,..., P . Every processor performs SDC sweeps on on the fine level then simply reads 0 3 its assigned interval on alternating levels. The big red blocks P (u) = (I − Δt (I ⊗ Q ) F)(u). represent fine sweeps, given by Eq. (9), and the small blue F L Δ blocks coarse sweeps, given by Eq. (7). Applying P on the fine level leads to L decoupled SDC The coarse sweep over all intervals is a serial process: sweeps, which can be run in parallel. after a processor finished its coarse sweeps, it sends forward For PFASST, these two preconditioners and the levels its results to the next processor, which takes this result as an they work on are coupled using a full approximation scheme initial value for its own coarse sweeps. We see the commu- (FAS) known from nonlinear multigrid theory [25]. Follow- nication in the picture represented by small arrows, which ing [1] one iteration of PFASST can then be formulated in connect the coarse sweeps of each interval. In (7), the need four steps: for communication with a neighboring process is obvious, because P is not a (block-) diagonal matrix, but has entries 1. the computation of the FAS correction τ , including the on its lower block-diagonal. P on the other hand is block- restriction of the fine value to the coarse level diagonal, which means that the processors can calculate on the fine level in parallel. We see in (9) that there is only a con- k k k τ = C (Ru ) − RC (u ), F F nection to previous time-steps through the right-hand side, where we gather values from the previous time-step and iter- 2. the coarse sweep on the modified composite collocation ation but not from the current iteration. The picture shows problem on the coarse level this connection by a fine communication, which forwards data from each fine sweep to the following fine sweep of k+1 k ˜ ˜ ˜ P (u ˜ ) = (P − C )(u ˜ ) + b + τ , (7) F F F the right neighbor. The fine and coarse calculations on every computation time predictor PFASST-ER: combining the parallel full approximation scheme in space and time with… Page 5 of 12 12 k+ processor are connected through the FAS corrections, which where the u -term is independent of the current iteration l,0 in our formula are part of the coarse sweep. (which, of course, leads to the parallelism on the fine level). As we have seen above, the typical strategy would be to 2.3 PFASST-Newton solve these systems line by line, node by node, using for- ward substitution and previous PFASST iterates as initial For each coarse and each fine sweep within each PFASST guesses. An alternative approach has been presented in [19], iteration, System (7) and System (9), respectively, need to where each SDC iteration can be parallelized across the be solved. If f is a nonlinear function these systems are nodes. While this is trivial for linear problems, nonlinear nonlinear as well. The obvious and traditional way to proceed ones require the linearization of the full equations, not node- in this case is to linearize the problem locally (i.e. for each wise as before. For the fine sweep, let time-step, at each quadrature node) using Newton’s method. 1 1 This way, PFASST is the outer solver with an inner Newton k+ k+ sdc sdc sdc coll 2 2 G (v) := P (v) − (P − C )(u ) − u f f f f l l,0 iteration. For triangular Q ,the mth equation on the lth time- step on the coarse level reads sdc then a Newton step for G (v) = 0 is given by m−1 k+1 k+1 k+1 Δ Δ ˜ ˜ (1 − Δt q ˜ f )(u ˜ ) =˜ u + Δt q ˜ f (u ˜ ) l,m l,n sdc j j sdc j l,m l,0 l,n ∇G (v )e =−G (v ), f f n=1 j +1 j j v = v + e , + c ˜(u ˜ ) , l,m sdc j sdc j k+1 k for Jacobian matrix ∇G (v ) of G evaluated at v .We where u ˜ =˜ u and c ˜(u ˜ ) is the mth entry the lth 0,0 l,m f f 0,0 k k have ˜ ˜ block of c ˜(u ˜ ) := (P − C )(u ˜ ) + τ . This term gathers F F all values of the previous iteration. The first summand of sdc j sdc j ∇G (v ) =∇P (v ) f f the right-hand side of the coarse level equation corresponds ˜ j to b and H, while the following sum comes from the lower = I − Δt Q ∇ f (v ) triangular structure of Q . j j For time-step l these equations can be solved one by one for Jacobian matrix ∇ f (v ) of f evaluated at v which in using Newton iterations and forward substitution. This is turn is given by inherently serial, because the solution on the mth quadrature j j j   T node depends on the solution at all previous nodes through ∇ f (v ) = diag( f (v ),..., f (v )) . 1 M the sum. Thus, while running parallel across the steps, each solution of the local collocation problem is found in serial. There is still no parallelism to exploit, but when we replace In the next section, we will present a novel way of applying j the full Jacobian matrix ∇ f (v ) by the approximation Newton’s method, which allows one to parallelize this part f (v )I , which is the derivative of f at the initial value l,0 M across the collocation nodes, joining parallelization across for the current time-step, we can use the step with parallelization across the method. sdc j Δ-QN ∇G (v ) ≈∇G (v ) := I − f (v )Δt Q l,0 l,0 Δ f f 3 PFASST-ER to establish a Quasi-Newton iteration as From the perspective of a single time-step [t , t ] or pro- l l+1 Δ-QN j sdc j ∇G (v )e =−G (v ), l,0 f f cessor P ,Eq. (7) on the coarse level for this step reads j +1 j j v = v + e . sdc k+1 k+1 sdc coll k k ˜ ˜ ˜ P (u ˜ ) − u ˜ = (P − C )(u ˜ ) + τ , f l l,0 f f l l This decouples the evaluation of the Jacobian matrix from the k k where τ is the lth component of τ , belonging to the interval current quadrature nodes and now Q can be diagonalized, so Δ-QN [t , t ]. Note that the serial dependency is given by the term that the inversion of ∇G (v ) can be parallelized across l l+1 l,0 k+1 k+1 u ˜ , so that it does not depend on the solution u ˜ of this the nodes. Note that there are other options for approximating l,0 l equation and can thus be considered as part of a given right- the full Jacobian matrix. Most notably, in [9] the mean over hand side. On the fine level, this is even simpler, because all Jacobian matrices is used (there across the time-steps). We there we have to solve did not see any impact on the convergence when following this strategy, most likely because the number of quadrature 1 1 k+ k+ sdc k+1 sdc coll 2 2 P (u ) = (P − C )(u ) + u , nodes is typically rather low. The advantage of using the f f f l l l,0 123 12 Page 6 of 12 R. Schöbel, R. Speck initial value is that it reduces the number of evaluations of the The question now is, how much the approximation of Jacobian matrix, which also includes communication time. the Jacobians affects the convergence and runtime of the Provided that Q is diagonalizable, we can decompose method and how all this compares to standard PFASST iter- −1 it by Q = V  V , where  = diag((Q ) ) con- ations. It is well known that for suitable right-hand sides Δ Δ Δ Δ Δ ii tains the eigenvalues (Q ) ∈ R of Q and V contains its and initial guesses the standard, unmodified Newton method Δ ii Δ eigenvectors. converges quadratically while the Quasi-Newton method as Using the given diagonalization the algorithm reads: well as SDC show linear convergence, see e.g. [11,13,24]. We will examine the impact of these approaches in the fol- −1 j sdc j j sdc j lowing section along the lines of two numerical examples. 1. replace r =−G (v ) by r ¯ =−V G (v ) (serial), f Δ f A more rigorous mathematical analysis is currently ongoing j j 2. solve I − f (v )Δt  e ¯ = r ¯ (parallel in M), l,0 Δ work, as it can be embedded into a larger convergence theory j j j 3. replace e ¯ by e = V e ¯ (serial), for PFASST with inner Newton-type solvers. j +1 j j 4. set v = v + e (parallel in M). This can be iterated until a certain threshold is reached and 4 Numerical results k+1 then set u = v to obtain the solution of the equation for the fine sweep. On the coarse level, the procedure is very We apply PFASST and PFASST-ER to two different, rather sdc similar, with a slightly different definition of G (v ˜ ). In prac- challenging reaction-diffusion problems, starting with a tice, choosing only a single Newton iteration (i.e. J = 1) is detailed analysis of the parallelization strategies for the sufficient, because this is only the inner solver for an outer Allen–Cahn equation and highlighting differences to these PFASST iteration. In all cases we have studied so far, using findings for the Gray-Scott equations. more inner iterations does not lead to a faster overall method. This linearization and diagonalization strategy immedi- 4.1 Allen–Cahn equation ately suggests a second approach: instead of using Q for the preconditioner, we can use the original quadrature matrix We study the two-dimensional Allen–Cahn equation, which Q directly. The intention of using Q in the first place was to is given by obtain a preconditioner which allowed inversion using for- ward substitutions. Now, with diagonalization in place, this u = Δu + u(1 − u) (10) is no longer necessary. Instead, we can use t coll coll P := C on the spatial domain [−0.5, 0.5] and with initial condition f f 2 2 R − (x + y ) and thus u = tanh √ , 2ε k+ coll coll 2 G (v) := C (v) − u . f f l,0 and periodic boundary conditions. We use simple second- order finite differences for discretization in space and take Note that this is just the lth block of the original composite 256 elements in each dimension on the fine level and 128 collocation problem. Following the same ideas as before, we on the coarse one. We furthermore use M = 4 Gauß-Radau end up with nodes, set ε = 0.04, Δt = 0.001 <ε and stop the simula- tion after 24 time-steps at T = 0.024. The initial condition QN coll j ∇G (v ) ≈∇G (v ) := I − f (v )Δt Q, l,0 l,0 f describes a circle with a radius R = 0.25, see e.g. [28]. Note that since our focus is on the temporal parallelization, −1 which can be diagonalized using Q = VV , where  is the temporal resolution was chosen to be quite high in con- a diagonal matrix with eigenvalues λ (Q) ∈ C.The same trast to the spatial resolution. Errors in space and time are not idea can be applied to the coarse level sweep, of course. As balanced here (and in the following example), with the spatial a result, the original nonlinear SDC sweeps within PFASST error being much higher. This has been done deliberately to are now replaced by Quasi-Newton iterations which can be avoid higher computational costs. By increasing the accuracy done parallel across the nodes. We note that using simplified in space, we would increase the amount of large paralleliz- or Quasi-Newton methods for solving implicit Runge-Kutta able computations in relation to communication. This would schemes is a standard approach, as e.g. [26] shows. We further in turn improve the overall parallel efficiency, which would refer to [19] for more details on the idea of parallel SDC in the end lead to even better scaling results. However, when sweeps with Q and Q . using parallelization in space, all processors have very few 123 PFASST-ER: combining the parallel full approximation scheme in space and time with… Page 7 of 12 12 PFASST: N iter 300 PFASST: N iter PFASST: 1 iter PFASST: 1 iter PFASST-ER: Q PFASST-ER: Q Δ Δ 1,000 PFASST-ER: Q PFASST-ER: Q SL ML P2 P4 P8 P12 P24 SL ML P2 P4 P8 P12 P24 algorithm algorithm Fig. 3 Time to solution for Allen–Cahn with parallelization only across Fig. 2 Number of linear solves for the Allen–Cahn example, all meth- time-steps ods run serial on the nodes degrees-of-freedom anyway, so our results may even reflect a “real” situation better. We can see that performing more than one inner Newton The results we present in the following were computed iteration (“PFASST: N iter” vs. “PFASST: 1 iter”) does not with pySDC [20,21] on the supercomputer JURECA [12]. improve the convergence of the overall algorithm. Although We run a serial single-level simulation using SDC (“SL” in it is possible that by increasing the number of inner Newton the plots), a serial multi-level simulation using multi-level iterations the number of outer PFASST iterations decreases, SDC (“ML”, which is PFASST on one processor, see [23]) the total effort, which can be measured by the total number and parallel simulations with 2, 4, 8, 12 and 24 processors of linear solves, increases due to a higher number of inner −10 (“P2” to “P24”), all until a given residual tolerance of 10 Newton iterations. is reached. Using the Quasi-Newton approach with the same precon- If less processors than time-steps are used, the time ditioner instead of the classical Newton solver (“PFASST- domain is split into blocks of parallel PFASST runs. These ER: Q ” vs. “PFASST: 1 iter”) shows little effect on the total are handled sequentially, using the solution of the previous iteration numbers, but using the original quadrature matrix LU block as the initial data for the next one. For example, 6 pro- Q instead of Q inside the preconditioner (“PFASST-ER: cessors work on the first 6 time-steps until convergence and Q” vs. “PFASST-ER: Q ”) greatly reduces the number of the solution is used as new initial condition for the next block iterations. of 6 time-steps. This is repeated until all 24 time-steps have However, without parallelization one iteration of PFASST- been completed. ER with Q is in general more expensive than one iteration In Fig. 2 we show the maximum number of linear solves of the other algorithms, because it requires the solution of a which were performed by the slowest processor (i.e. last full system via diagonalization instead of stepping through a processor in time) for different versions of the solvers, aggre- triangular system via forward substitution. gated over all its time-steps and quadrature nodes, over all In Fig. 3, we thus examine whether the lower number of outer and inner iterations. more expensive iterations actually pays off. The plot shows Here, two versions of the original PFASST algorithm are results for the same setup as Fig. 2, but now we focus on the run: The first one performs exactly one inner Newton iter- runtime instead of the iteration numbers. We only consider ation in every PFASST iteration; this version is labeled as parallelization across the time-steps to compare the impact “PFASST: 1 iter”. In contrast, “PFASST: N iter” performs as of the algorithmic change first. We see that despite the fact many inner Newton iterations required so that the residual of that the iterations are more expensive, PFASST-ER with Q −11 the nonlinear inner problem is less than 10 . Both PFASST already in this example shows a lower runtime than the origi- LU versions use the quadrature matrix Q from Eq. (5)inside nal PFASST method. This is also true when using Q instead the preconditioner. For PFASST-ER we also show two vari- of Q. ants: The PFASST-ER algorithm, which uses the original Q At this point, we have not yet considered the additional inside the preconditioner is labeled as “PFASST-ER: Q” and direction of concurrency exposed by PFASST-ER. For that, LU the one which uses Q is labeled as “PFASST-ER: Q ”. we next compare different distributions of up to 24 cores on Solving the innermost linear systems is done using GMRES the 4 quadrature nodes and the 24 time-steps. All divisions −12 with a tolerance of 10 in all cases. of 24 were tested with all possible distributions. number of linear solves runtime (sec.) 12 Page 8 of 12 R. Schöbel, R. Speck across the nodes now is not the optimal strategy. Here, using 54.7s 41.2s 2 cores on the nodes and 12 on the steps is the most effi- 74.6% 49.5% cient combination, albeit still significantly slower than using PFASST-ER with Q, even with the same combination. The 88.5s 71.0s 51.5s reason for this potentially surprising result is that solving the 92.2% 57.5% 39.6% innermost linear systems heavily depends on the structure of these systems, in particular when using an iterative solver like GMRES. Moreover, initial guesses are a crucial factor, too. 115.6s 88.1s 66.8s 163.3 For PFASST-ER, we use the current solution at node zero of 70.6% 46.4% 30.5% the respective time-step as the initial guess. This is particu- larly suitable for the closest first nodes, but potentially less so 1 2 4 8 for later ones. While both effects did not lead to significant Cores for time-steps variations in the time spent for solving the linear systems when using Q, it does produce a severe load imbalance when using Q . More specifically, using 4 cores for the nodes and 32.8s 31.3s 41.5% 21.7% only 1 for the time-steps, i.e. exploiting only parallelization across the nodes, the first core takes about 118.2 seconds for all linear system solves together at the first node, while the 58.3s 44.5s 38.7s 46.7% 30.6% 17.6% last core takes about 194.6 seconds on the last node. There- fore, using 2 cores on the nodes, which enables a better load distribution is the ideal choice. One possibility would be that 96.1s 77.2s 56.7s 55.1s core 1 deals with nodes 1 and 4 and core 2 with 2 and 3, but 56.6% 35.2% 24.0% 12.3% because node 3 and 4 are very close to each other and the cor- responding calculations are almost equally expensive also an 3 6 12 24 alternating distribution is an ideal choice. This is precisely Cores for time-steps what has been done for Fig. 5, leading to the best speedup with 2 cores on the nodes. For other examples, an optimal Fig. 4 Runtimes in seconds (first number) and efficiencies (second distribution might be more difficult to find. number) with different distribution of cores using PFASST-ER with Q In Fig. 6 we now summarize the best results: PFASST for the Allen–Cahn equation with one inner Newton iteration in comparison to PFASST- ER using Q and 2 cores on the nodes and PFASST-ER The two plots in Fig. 4 show different combinations of using Q with 4 cores on the node. The plot shows the simula- cores used for step-parallelization (x-axis) and for node- tion time for each variant based on the number of processors parallelization (y-axis) with PFASST-ER and Q. Multiplying used in total. We see that PFASST-ER is always much more the numbers on both axes gives the total number of cores time efficient in doing the calculations than PFASST, with used for this simulation. This is also the reason why there are another significant gain when using Q instead of Q .Now, two plots, because not all combinations are actually possible since PFASST-ER adds another direction of parallelization compared to PFASST, we can not only increase parallel effi- or meaningful. Within each colored block the total runtime (in seconds, first number) and parallel efficiencies (second ciency as shown, but also extend the number of usable cores to obtain a better time-to-solution. This has been done in number) for this setup are given. We can see that using all available cores for parallelization across the step is by far not Fig. 7: taking 48 or 96 cores in total further reduces the com- the most efficient choice. In turn, more than 4 cores cannot puting time for 24 time-steps. With PFASST-ER, the number be used for parallelization across the nodes, although 4 gives of resources that can be used for parallel-in-time integration the best speedup. Indeed, the best combination for this prob- is no more limited by the number of time-steps, but can be lem is to maximize node-parallelization first and then add increased by the factor given by the number of quadrature step-parallelization (31.3 seconds with 4 cores on the nodes nodes. and 6 on the steps, lower picture). This is about 1.8 times faster than using 24 cores for the steps alone and more than 4.2 Gray-Scott equations 5 times faster than the serial PFASST-ER run. Although using Q instead of Q in PFASST-ER is faster The second example we present here is the Gray-Scott system for this example, it is quite revealing to repeat the simulations [16], which is given by using Q . These results are shown in Fig. 5 and it is obvious that using as many cores as possible for the parallelization u = D Δu − 2uv + F (1 − u), t u Cores for time-nodes Cores for time-nodes PFASST-ER: combining the parallel full approximation scheme in space and time with… Page 9 of 12 12 PFASST: 1 iter 221.8s 162.6s PFASST-ER Q ,2cores 25.5% 17.4% PFASST-ER: Q,4cores 120.0s 80.1s 61.3s 94.4% 70.6% 46.2% 115.8s 93.2s 71.7s 226.4 97.8% 60.8% 39.5% 1 2 4 8 P4 P8 P12 P24 P48 P96 Cores for time-steps total number of cores Fig. 7 Runtimes for different number of processors, Allen–Cahn exam- 133.5s 111.0s 4 ple 14.1% 8.5% PFASST: 1 iter 68.9s 50.6s 45.6s PFASST-ER: Q 54.8% 37.3% 20.7% 1,000 PFASST-ER: Q 112.6s 80.6s 59.2s 50.8s 67.0% 46.8% 31.9% 18.6% 3 6 12 24 Cores for time-steps Fig. 5 Runtimes in seconds (first number) and efficiencies (second number) with different distribution of cores using PFASST-ER with SL ML P2 P3 P4 P6 P8 P12 P24 Q for the Allen–Cahn equation algorithm PFASST: 1 iter Fig. 8 Number of linear solves for the Gray-Scott example, all methods PFASST-ER: Q , 2 cores 100 Δ run serial on the nodes PFASST-ER: Q, 4 cores 70 −4 −5 this circle. We use D = 10 , D = 10 and set a feed u v rate of F = 0.0367 and a kill rate of K = 0.0649. This leads after some time to a process similar to cellular division and is known as “mitosis”. We discretize the spatial domain with 128 points in each dimension on the fine level and with 64 on the coarse one, using standard finite differences. We 30 discretize every time-step of size Δt = 1 with 4 quadrature nodes and run the simulation again for 24 time-steps. P4 P8 P12 P24 The results are similar to the ones for the Allen–Cahn total number of cores equation in the previous section. We will omit the case of Fig. 6 Runtimes for the three best variants, Allen–Cahn example PFASST with more than one inner Newton iteration, though. We start again by looking at the total number of linear solves the different algorithms need to perform. Figure 8 v = D Δv + 2uv − (F + K )v, t v shows the number of linear solves for the methods, which −12 run until a residual tolerance of 10 is reached. The results on the spatial domain [0, 1]×[0, 1], with periodic bound- look quite similar to the ones for the previous example, with ary conditions. As initial condition we choose a circle with one critical difference: The difference between the Q-variant radius 0.05 centred in the spatial domain, where u = 0.5 and of PFASST-ER and the other algorithms becomes smaller v = 0.25 on the inside, and u = 1.0 and v = 0 outside of more rapidly the more parallel time-steps are used. There Cores for time-nodes Cores for time-nodes runtime (sec.) number of linear solves runtime (sec.) 12 Page 10 of 12 R. Schöbel, R. Speck PFASST: 1 iter 18.5s 14.4s PFASST-ER: Q 66.3% 42.5% 50 PFASST-ER: Q 26.0s 18.7s 15.3s 94.6% 65.7% 40.0% 33.1s 24.9s 21.0s 49.1 74.2% 49.4% 29.2% 1 2 4 8 SL ML P2 P3 P4 P6 P8 P12 P24 Cores for time-steps algorithm Fig. 9 Time to solution for Gray-Scott with parallelization only across 12.3s 10.8s time-steps 4 33.2% 19.0% 16.6s 13.5s 13.0s 49.3% 30.3% 15.7% is no obvious explanation (at least, obvious to us) for this behavior, though. The more time steps are approximated simultaneously, the less suitable u works as initial value for 28.0s 22.1s 19.2s 12.5s 58.4% 37.0% 21.3% 16.3% more distant time-steps. Although the full Newton and the Quasi-Newton methods differ by an order of convergence in theory, in our scenario this seems relevant only for good 3 6 12 24 initial values. One can expect that the runtime will increase Cores for time-steps when using PFASST-ER with Q, while it stayed about the Fig. 10 Runtimes in seconds (first number) and efficiencies (second same in the case of the Allen–Cahn example. number) with different distribution of cores using PFASST-ER with Q This is precisely what we can see in Fig. 9. The more par- for the Gray-Scott equations allel time-steps are run, the less efficient PFASST-ER with Q in this variant becomes. Already at 3 parallel steps, it is as costly as the original PFASST version, at least when paral- lelization across the nodes is not considered. 5 Conclusion and outlook Now, adding node-parallelization, the findings are again similar to the ones in the previous section: Figure 10 shows Today’s supercomputers are designed with an ever increasing that PFASST-ER with Q is still more efficient than using number of processors. Therefore we need our software and PFASST. In particular, using more cores on the nodes is better the underlying numerical algorithms to handle this increas- and the best combination is again 4 cores on the nodes and 6 ing degree of parallelism. Time-parallel integrators are one on the steps. Again, this changes when considering PFASST- promising research direction, with quite a number of differ- ER with Q as in Fig. 11, where the ideal setup uses only ent approaches. Some approaches parallelize each individual 2 cores on the nodes, but 12 on the steps. This is again due time-step and others act on multiple time-steps simultane- to load imbalances of the innermost linear solves. However, ously. In this paper we have introduced a solver that works in note the key difference to the previous results: The fastest run parallel across the method as well as across the steps. More of the Q -variant is now faster than the one of the Q-variant. precisely, we combine node-parallel spectral deferred correc- In Fig. 12 we now give an overview of the best results: If tions with the parallel full approximation scheme in space we use parallelism across the nodes in a suitable way, both and time. While PFASST allows one to compute multiple PFASST-ER versions are more efficient based on the simu- time-steps simultaneously and target large-scale parallelism lation time than the classical PFASST algorithm. Both can in time, the new version called PFASST-ER presented here be used to extend the scaling capabilities beyond the number extends this idea with an efficient small-scale parallelization of time-steps, and both scale rather well in this regime. Note, for every single time-step itself. The scaling studies show that however, that the Q -variant can here only leverage 2 × 24 a combination of both concepts seems to be the most efficient cores. It is then faster than the Q-variant with twice as many way to solve time-dependent PDEs. Here we tested two dif- cores. ferent preconditioners: ones using the traditional, triangular runtime (sec.) Cores for time-nodes Cores for time-nodes PFASST-ER: combining the parallel full approximation scheme in space and time with… Page 11 of 12 12 of parallel time-steps. For the Q-preconditioner, the over- 30.6s 21.3s all number of iterations was lower and time-to-solution was 38.8% 27.9% faster. Adding node-parallelization, parallel efficiency can be increased and speedup extended when compared to PFASST. 27.1s 15.8s 11.1s Both PFASST-ER versions lead in the end to better scaling 87.9% 75.1% 53.5% results than the classical PFASST algorithm. PFASST-ER Q especially offers an almost equal distribution of work for iter- ative linear solvers with respect to the individual quadrature 27.3s 17.8s 12.8s 47.5 87.1% 66.6% 46.3% nodes of a time-step. This advantage makes this algorithm particularly flexible and can be used for any number of quadrature points. 1 2 4 8 PFASST-ER is particularly favorable if an increase in par- Cores for time-steps allelism across the steps would lead to a severe increase in the number of iterations. This could be due to e.g. the type of the equation or the coarsening strategy. During our experiments 17.7s 15.5s we saw that it is not clear apriori which combination of node- 22.4% 12.8% and step-parallelization is the most efficient one. This could lead to many, potentially irrelevant runs to find the sweet spot. 12.9s 9.2s 8.2s Here, a performance model and a suitable convergence the- 61.3% 43.2% 24.2% ory are needed to at least narrow down the relevant options. This has to be accompanied by more numerical tests, relating e.g. model parameters with load imbalances, to identify the 21.2s 14.3s 11.1s 10.5s 74.6% 55.4% 35.8% 18.9% limits of this approach. Acknowledgements The authors thankfully acknowledge the financial 3 6 12 24 support by the German Federal Ministry of Education and Research Cores for time-steps through the ParaPhase project within the framework “IKT 2020 - Forschung für Innovationen” (Project Number 01IH15005A). Fig. 11 Runtimes in seconds (first number) and efficiencies (second number) with different distribution of cores using PFASST-ER with Funding Open Access funding enabled and organized by Projekt Q for the Gray-Scott equations DEAL. Open Access This article is licensed under a Creative Commons PFASST: 1 iter Attribution 4.0 International License, which permits use, sharing, adap- PFASST-ER Q ,2cores tation, distribution and reproduction in any medium or format, as 20 PFASST-ER: Q,4cores long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indi- cate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, 12 unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copy- right holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. P4 P8 P12 P24 P48 P96 total number of cores References Fig. 12 Runtimes for the three best variants, Gray-Scott example 1. Bolten, M., Moser, D., Speck, R.: A multigrid perspective on the parallel full approximation scheme in space and time. Numer. Lin- ear Algebra Appl. 24(6), e2110 (2017). https://doi.org/10.1002/ quadrature matrix Q , generated by a LU-decomposition nla.2110 and one using the original matrix Q. Both can be diagonal- 2. Bolten, M., Moser, D., Speck, R.: Asymptotic convergence of the parallel full approximation scheme in space and time for lin- ized and used as parallel-across-the-node preconditioners. ear problems. Numer. Linear Algebra Appl. 25(6), e2208 (2018). For the Q -preconditioner, we saw load imbalances when https://doi.org/10.1002/nla.2208 using an inner iterative linear solver, but by grouping nodes 3. Burrage, K.: Parallel methods for ODEs. Adv. Comput. Math. 7, we still can speed up the simulation beyond the number 1–3 (1997) Cores for time-nodes Cores for time-nodes runtime (sec.) 12 Page 12 of 12 R. Schöbel, R. Speck 4. Christlieb, A.J., Macdonald, C.B., Ong, B.W.: Parallel high-order 20. Speck, R.: Algorithm 997: pySDC-prototyping spectral deferred integrators. SIAM J. Sci. Comput. 32(2), 818–835 (2010) corrections. ACM Trans. Math. Softw. (2019). https://doi.org/10. 5. Clarke, A.T., Davies, C.J., Ruprecht, D., Tobias, S.M.: Parallel-in- 1145/3310410 time integration of kinematic dynamos (2019). arXiv:1902.00387 21. Speck, R.: Website for pySDC (2019). https://parallel-in-time.org/ [physics.comp-ph] pySDC/. Accessed 27 November 2019 6. Emmett, M., Minion, M.L.: Toward an efficient parallel in time 22. Speck, R., Ruprecht, D., Krause, R., Emmett, M., Minion, M., method for partial differential equations. Commun. Appl. Math. Winkel, M., Gibbon, P.: A massively space–time parallel N-body Comput. Sci. 7, 105–132 (2012) solver. In: Proceedings of the International Conference on High 7. Falgout, R.D., Friedhoff, S., Kolev, T.V., MacLachlan, S.P., Performance Computing, Networking, Storage and Analysis, SC Schroder, J.B., Vandewalle, S.: Multigrid methods with space–time ’12, pp. 92:1–92:11. IEEE Computer Society Press, Los Alami- concurrency. Comput. Vis. Sci. 18(4–5), 123–143 (2017) tos, CA, USA (2012). ISBN 978-1-4673-0804-5. http://dl.acm.org/ 8. Gander, M.J.: 50 years of time parallel time integration. In: Multi- citation.cfm?id=2388996.2389121. event-place: Salt Lake City, ple Shooting and Time Domain Decomposition. Springer (2015). Utah https://doi.org/10.1007/978-3-319-23321-5_3 23. Speck, R., Ruprecht, D., Emmett, M., Minion, M.L., Bolten, M., 9. Gander, M.J., Halpern, L., Ryan, J., Tran, T.T.B.: A direct solver Krause, R.: A multi-level spectral deferred correction method. for time parallelization. In: Dickopf, T., Gander, M.J., Halpern, L., BIT Numer. Math. 55, 843–867 (2015). https://doi.org/10.1007/ Krause, R., Pavarino, L.F. (Eds) Domain Decomposition Methods s10543-014-0517-x in Science and Engineering XXII, pp. 491–499. Springer (2016). 24. Tang, T., Xie, H., Yin, X.: High-order convergence of spectral https://doi.org/10.1007/978-3-319-18827-0_50 deferred correction methods on general quadrature nodes. J. Sci. 10. Huang, J., Jia, J., Minion, M.: Accelerating the convergence of Comput. 56(1), 1–13 (2013) spectral deferred correction methods. J. Comput. Phys. 214(2), 25. Trottenberg, U., Oosterlee, C., Schuller, A.: Multigrid. Academic 633–656 (2006) Press, London (2000) 11. Jackson, K .R., Kværnø, A., Nørsett, S .P.: The use of butcher series 26. Wanner, G., Hairer, E.: Solving Ordinary Differential Equations II. in the analysis of newton-like iterations in Runge–Kutta formulas. Springer, Berlin (1996) Appl. Numer. Math. 15(3), 341–356 (1994) 27. Weiser, M.: Faster SDC convergence on non-equidistant grids by 12. Jülich Supercomputing Centre. JURECA: General-purpose super- DIRK sweeps. BIT Numer. Math. 55(4), 1219–1241 (2014) computer at Jülich Supercomputing Centre. J. Large-Scale Res. 28. Zhang, J., Du, Q.: Numerical studies of discrete approximations Facil. 2(A62) (2016). https://doi.org/10.17815/jlsrf-2-121 to the Allen–Cahn equation in the sharp interface limit. SIAM J. 13. Kelley, C.T.: Iterative Methods for Linear and Nonlinear Equations. Sci. Comput. 31(4), 3042–3063 (2009). https://doi.org/10.1137/ Number 16 in Frontiers in Applied Mathematics. SIAM (1995) 080738398. 14. Koehler, F.: PFASST TikZ. https://github.com/Parallel-in-Time/ pfasst-tikz (2015) 15. Lions, J.-L., Maday, Y., Turinici, G.: A “parareal” in time Publisher’s Note Springer Nature remains neutral with regard to juris- discretization of PDE’s. Comptes Rendus de l’Académie des dictional claims in published maps and institutional affiliations. Sciences—Series I: Mathematics. https://doi.org/10.1016/S0764- 4442(00)01793-6 16. Pearson, J.E.: Complex patterns in a simple system. Science 261(5118), 189–192 (1993) 17. Ruprecht, D., Speck, R.: Spectral deferred corrections with fast- wave slow-wave splitting. SIAM J. Sci. Comput. 38(4), A2535– A2557 (2016) 18. Ruprecht, D., Speck, R., Emmett, M., Bolten, M., Krause, R.: Poster: extreme-scale space-time parallelism. In: Proceed- ings of the 2013 Conference on High Performance Computing Networking, Storage and Analysis Companion, SC ’13 Com- panion (2013). http://sc13.supercomputing.org/sites/default/files/ PostersArchive/tech_posters/post148s2-file3.pdf 19. Speck, R.: Parallelizing spectral deferred corrections across the method. Comput. Vis. Sci. 19(3–4), 75–83 (2018). https://doi.org/ 10.1007/s00791-018-0298-x http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Computing and Visualization in Science Springer Journals

PFASST-ER: combining the parallel full approximation scheme in space and time with parallelization across the method

Loading next page...
 
/lp/springer-journals/pfasst-er-combining-the-parallel-full-approximation-scheme-in-space-YIq2sGEbW1

References (29)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2020
ISSN
1432-9360
eISSN
1433-0369
DOI
10.1007/s00791-020-00330-5
Publisher site
See Article on Publisher Site

Abstract

To extend prevailing scaling limits when solving time-dependent partial differential equations, the parallel full approximation scheme in space and time (PFASST) has been shown to be a promising parallel-in-time integrator. Similar to space–time multigrid, PFASST is able to compute multiple time-steps simultaneously and is therefore in particular suitable for large- scale applications on high performance computing systems. In this work we couple PFASST with a parallel spectral deferred correction (SDC) method, forming an unprecedented doubly time-parallel integrator. While PFASST provides global, large- scale “parallelization across the step”, the inner parallel SDC method allows integrating each individual time-step “parallel across the method” using a diagonalized local Quasi-Newton solver. This new method, which we call “PFASST with Enhanced concuRrency” (PFASST-ER), therefore exposes even more temporal concurrency. For two challenging nonlinear reaction- diffusion problems, we show that PFASST-ER works more efficiently than the classical variants of PFASST and can use more processors than time-steps. Keywords Parallel-in-time integration · Parallel full approximation scheme in space and time · Spectral deferred corrections · Parallelization across the method · Parallelization across the step · Quasi-Newton 1 Introduction As one example, the “Parallel Full Approximation Scheme in Space and Time” (PFASST) by Emmett and Minion [6] The efficient use of modern high performance computing allows one to integrate multiple time-steps simultaneously by systems for solving space–time-dependent differential equa- using inner iterations of spectral deferred corrections (SDC) tions has become one of the key challenges in computational on a space–time hierarchy. It works on the so called com- science. Exploiting the exponentially growing number of posite collocation problem, where each time-step includes a processors using traditional techniques for spatial parallelism further discretization through quadrature nodes. This “paral- becomes problematic when, for example, for a fixed problem lelization across the steps” approach [3] targets large-scale size communication costs start to dominate. Parallel-in-time parallelization on top of saturated spatial parallelization of integration methods have recently been shown to provide partial differential equations (PDEs), where parallelization a promising way to extend these scaling limits, see e.g. in the temporal domain acts as a multiplier for standard par- [7,18,22] to name but a few examples. allelization techniques in space. In contrast, “parallelization across the method” approaches [3] try to parallelize the inte- gration within an individual time-step. While this typically Communicated by Sebastian Schöps. results in smaller-scale parallelization in the time-domain, parallel efficiency and applicability of these methods are B Ruth Schöbel often more favorable. Most notably, the “revisionist inte- [email protected] gral deferred correction method” (RIDC) by Christlieb et Robert Speck al. [4] makes use of integral deferred corrections (which are [email protected] indeed closely related to SDC) in order to compute multiple Institut für Numerische Mathematik, TU Dresden, Dresden, iterations in a pipelined way. In [19], different approaches Germany for parallelizing SDC across the method have been dis- Jülich Supercomputing Centre, Forschungszentrum Jülich cussed, allowing the simultaneous computation of updates on GmbH, Jülich, Germany 123 12 Page 2 of 12 R. Schöbel, R. Speck multiple quadrature nodes. A much more structured and com- u = f (u), u(0) = u (1) t 0 plete overview of parallel-in-time integration approaches can be found in [8]. The Parallel-in-Time community website with u(t ), u , f (u) ∈ R. In order to keep the notation sim- (https://parallel-in-time.org) offers a comprehensive list of ple, we do not consider systems of initial value problems references. for now, where u(t ) ∈ R . Necessary modifications will be The key goal of parallel-in-time integrators is to expose mentioned where needed. In a first step, we now discretize additional parallelism in the temporal domain in the cases this problem in time and review the idea of single-step, time- where classical strategies like parallelism in space are either serial spectral deferred corrections (SDC). already saturated or not even possible. In [5] the classical Parareal method [15] is used to overcome the scaling limit of 2.1 Spectral deferred corrections a space-parallel simulation of a kinematic dynamo on up to 1600 cores. The multigrid extension of Parareal, the “multi- For one time-step on the interval [t , t ] the Picard formu- l l+1 grid reduction in time” method (MGRIT), has been shown to lation of Eq. (1) is given by provide significant speedup beyond spatial parallelization [7] for a multitude of problems. Using PFASST, a space-parallel t u(t ) = u + f (u(s))ds, t ∈[t , t ]. (2) l,0 l l+1 N-body solver has been extended in [22] to run on up to 262,244 cores, while in [18] it has been coupled to a space- parallel multigrid solver on up to 458,752 cores. To approximate the integral we use a spectral quadrature So far, parallel-in-time methods have been implemented rule. We define M quadrature nodes τ ,...,τ , which l,1 l,M and tested either without any additional parallelization tech- are given by t ≤ τ < ··· <τ = t . We will in the l l,1 l,M l+1 niques or in combination with spatial parallelism. The goal following explicitly exploit the condition that the last node for this work is to couple two different parallel-in-time strate- is equal to the right integral boundary. Quadrature rules like gies in order to extend the overall temporal parallelism Gauß-Radau or Gauß-Lobatto quadrature satisfy this prop- exposed by the resulting integrator. To this end, we take erty. We can then approximate the integrals from t to the the diagonalization idea for SDC presented in [19] (paral- nodes τ , such that l,m lel across the method) and use it within PFASST (parallel across the steps). In this way we create an algorithm that computes approximations for different time-steps simulta- u = u + Δt q f (u ), l,m l,0 m, j l, j neously but also works in parallel on each time-step itself. j =1 Doing so we combine the advantages of both paralleliza- tion techniques and create the “Parallel Full Approximation where u ≈ u(τ ), Δt = t − t and q represent the l,m l,m l+1 l m, j Scheme in Space and Time with Enhanced concuRrency” quadrature weights for the interval [t ,τ ] such that l l,m (PFASST-ER), an unprecedented doubly time-parallel inte- grator for PDEs. In the next section we will first introduce l,m SDC and PFASST from an algebraic point of view, follow- q f (u ) ≈ f (u(s))ds. m, j l, j ing [1,2]. We particularly focus on nonlinear problems and j =1 briefly explain the application of a Newton solver within PFASST. Then, this Newton solver is modified in Sect. 3 We combine these M equations into one system so that by using a diagonalization approach the resulting Quasi-Newton method can be computed in parallel across (I − Δt Q f ) (u ) = u , (3) l l,0 the quadrature nodes of each time-step. In Sect. 4, we com- pare different variants of this idea to the classical PFASST which we call the “collocation problem”. Here, u = implementation using two nonlinear reaction-diffusion test T T M (u ,..., u ) ≈ (u(τ ), . . . , u(τ )) ∈ R , u = l,1 l,M l,1 l,M l,0 cases. We show parallel runtimes for different setups and T M M ×M (u ,..., u ) ∈ R , Q = (q ) ∈ R is the matrix l,0 l,0 ij i , j evaluate the impact of the various Newton and diagonaliza- gathering the quadrature weights and the vector function tion strategies. Section 5 concludes this work with a short M M f : R → R is given by summary and an outlook. f (u ) = ( f (u ), ..., f (u )) . l l,1 l,M 2 Parallelization across the steps with To simplify the notation we define PFASST coll C (u ) := (I − Δt Q f ) (u ). l l We focus on an initial value problem 123 PFASST-ER: combining the parallel full approximation scheme in space and time with… Page 3 of 12 12 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ coll We note that for u(t ) ∈ R , we need to replace Q by Q ⊗ I , u u f 1 0,0 ⎜ coll ⎟ ⎜ ⎟ ⎜ ⎟ where ⊗ denotes the Kronecker product. −HC u 0 ⎜ ⎟ 2 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ = , ⎜ . ⎟ ⎜ . ⎟ System (3) is dense and a direct solution is not advis- . . ⎜ ⎟ . . . . ⎝ ⎠ ⎝ ⎠ . . . . ⎝ ⎠ able, in particular if f is a nonlinear operator. The spectral coll u 0 −HC L deferred correction method solves the collocation problem in an iterative way. While it has been derived originally from M ×M where the matrix H ∈ R on the lower subdiagonal classical deferred or defect correction strategies, we here transfers the information from one time-step to the next follow [10,17,27] to present SDC as preconditioned Picard one. It takes the value of the last node τ of an interval iteration. A standard Picard iteration is given by l,M [t , t ], which is by requirement equal to the left bound- l l+1 ary t of the following interval [t , t ], and provides l+1 l+1 l+2 k+1 k coll k u = u + (u − C (u )) l,0 l f l it as a new starting value for this interval. Therefore, the matrix H contains the value 1 on every position in the last column and zeros elsewhere. To write the composite col- for k = 0,..., K , and some initial guess u . location problem in a more compact form we define the In order to increase range and speed of convergence, we T LM vector u = (u ,..., u ) ∈ R , which contains the solu- now precondition this iteration. The standard approach to 1 L sdc tion at all quadrature nodes at all time-steps, and the vector preconditioning is to define an operator P , which is easy T LM b = (u , 0,..., 0) ∈ R , which contains the initial to invert but also close to the operator of the system. We 0,0 condition for all nodes at the first interval and zeros else- define this “SDC preconditioner” as LM LM where. We define F : R → R as an extension of f so that F(u) = ( f (u ), ..., f (u )) . Then, the composite 1 L sdc P (u ) := (I − Δt Q f ) (u ) l Δ l collocation problem can be written as so that the preconditioned Picard iteration reads C (u) = b. (6) sdc k+1 sdc coll k with P (u ) = (P − C )(u ) + u . (4) l,0 f f f l C (u) = (I − Δt (I ⊗ Q) F − E ⊗ H) (u), F L sdc The key for defining P is the choice of the matrix Q . L×L The idea is to choose a “simpler” quadrature rule to generate where the matrix E ∈ R just has ones on the first subdi- a triangular matrix Q such that solving System (4) can be N agonal and zeros elsewhere. If u ∈ R , we need to replace done by forward substitution. Common choices include the H by H ⊗ I . implicit Euler method or the so-called “LU-trick”, where the SDC can be used to solve the composite collocation prob- LU decomposition of Q with lem by forward substitution in a sequential way, which means to solve one time-step after each other using the previous LU T T solution as initial value of the current time-step. The parallel- Q = U for Q = LU (5) in-time integrator PFASST, on the other hand solves the composite collocation problem by calculating on all time- is used [27]. steps simultaneously and is therefore an attractive alternative. System (4) establishes the method of spectral deferred The first step from SDC towards PFASST is the introduction corrections, which can be used to approximate the solution of multiple levels, which are representations of the problem of the collocation problem on a single time-step. In the next with different accuracies in space and time. In order to sim- step, we will couple multiple collocation problems and use plify the notation we focus on a two-level scheme consisting SDC to explain the idea of the parallel full approximation of a fine and a coarse level. Coarsening can be achieved for scheme in space and time. example by reducing the resolution in space, by decreasing the number of quadrature nodes on each interval or by solv- 2.2 Parallel full approximation scheme in space and ing implicit systems less accurately. Especially a coarsening time through the reduction of quadrature points does not seem to be worthwhile for our idea to parallize the belonging calcu- The idea of PFASST is to solve a “composite collocation lations, since there would no longer be a full employment problem” for multiple time-steps at once using multigrid regarding the calculations on the coarse grid, but instead techniques and SDC for each step in parallel. This composite individual processors would have to communicate larger collocation problem for L time-steps can be written as amounts of data. For this work, we only consider coarsening 123 12 Page 4 of 12 R. Schöbel, R. Speck coarse in space, i.e., by using a restriction operator R on a vector sweep N N u ∈ R we obtain a new vector u ˜ ∈ R .Viceversa,the fine interpolation operator T is used to interpolate values from sweep u ˜ to u. Operators, vectors and numbers on the coarse level coarse will be denoted by a tilde to avoid further index cluttering. comm. Thus, the composite collocation operator on the coarse-level fine LM N ˜ ˜ is given by C . While C is defined on R , C acts F F F comm. LM N on R with N ≤ N, but as before we will neglect the space dimension in the following notation. The extension of the spatial transfer operators to the full space–time domain t t t t t is given by R = I ⊗ R and T = I ⊗ T . 0 1 2 3 4 LM LM P P P P 0 1 2 3 The main goal of the introduction of a coarse level is to Fig. 1 Schematic view of PFASST on four processors. The figure was move the serial part of the computation to this hopefully created with pfasst-tikz [14] cheaper level, while being able to run the expensive part in parallel. For that, we define two preconditioners: a serial one with a lower subdiagonal for the coarse level and a parallel, 3. the coarse grid correction applied to the fine level value block-diagonal one for the fine level. The serial precondi- tionier for the coarse level is defined by k+1 k+ k k u = u + T(u ˜ − Ru ), (8) ⎛ ⎞ sdc 4. the fine sweep on the composite collocation problem on ⎜ ⎟ sdc ˜ ˜ −H P ⎜ ⎟ the fine level ˜ ⎜ ⎟ P = , ⎜ . . ⎟ . . ⎝ . . ⎠ k+1 k+ sdc ˜ ˜ 2 P (u ) = (P − C )(u ) + b. (9) −H P F F F or, in a more compact way, by In Fig. 1, we see a schematic representation of the described steps. The time-step parallel procedure, which we ˜ ˜ ˜ ˜ ˜ P (u ˜ ) = I − Δt (I ⊗ Q )F − E ⊗ H (u ˜ ). describe here is also the same for all PFASST versions, that F L Δ we will introduce later. It is common to use as many proces- sors as time-steps: In the given illustration four processors Inverting this corresponds to a single inner iteration of SDC work on four time-steps. Therefore the temporal domain is (a “sweep”) on step 1, then sending forward the result to step divided into four intervals, which are assigned to four proces- 2, an SDC sweep there and so on. The parallel preconditioner sors P ,..., P . Every processor performs SDC sweeps on on the fine level then simply reads 0 3 its assigned interval on alternating levels. The big red blocks P (u) = (I − Δt (I ⊗ Q ) F)(u). represent fine sweeps, given by Eq. (9), and the small blue F L Δ blocks coarse sweeps, given by Eq. (7). Applying P on the fine level leads to L decoupled SDC The coarse sweep over all intervals is a serial process: sweeps, which can be run in parallel. after a processor finished its coarse sweeps, it sends forward For PFASST, these two preconditioners and the levels its results to the next processor, which takes this result as an they work on are coupled using a full approximation scheme initial value for its own coarse sweeps. We see the commu- (FAS) known from nonlinear multigrid theory [25]. Follow- nication in the picture represented by small arrows, which ing [1] one iteration of PFASST can then be formulated in connect the coarse sweeps of each interval. In (7), the need four steps: for communication with a neighboring process is obvious, because P is not a (block-) diagonal matrix, but has entries 1. the computation of the FAS correction τ , including the on its lower block-diagonal. P on the other hand is block- restriction of the fine value to the coarse level diagonal, which means that the processors can calculate on the fine level in parallel. We see in (9) that there is only a con- k k k τ = C (Ru ) − RC (u ), F F nection to previous time-steps through the right-hand side, where we gather values from the previous time-step and iter- 2. the coarse sweep on the modified composite collocation ation but not from the current iteration. The picture shows problem on the coarse level this connection by a fine communication, which forwards data from each fine sweep to the following fine sweep of k+1 k ˜ ˜ ˜ P (u ˜ ) = (P − C )(u ˜ ) + b + τ , (7) F F F the right neighbor. The fine and coarse calculations on every computation time predictor PFASST-ER: combining the parallel full approximation scheme in space and time with… Page 5 of 12 12 k+ processor are connected through the FAS corrections, which where the u -term is independent of the current iteration l,0 in our formula are part of the coarse sweep. (which, of course, leads to the parallelism on the fine level). As we have seen above, the typical strategy would be to 2.3 PFASST-Newton solve these systems line by line, node by node, using for- ward substitution and previous PFASST iterates as initial For each coarse and each fine sweep within each PFASST guesses. An alternative approach has been presented in [19], iteration, System (7) and System (9), respectively, need to where each SDC iteration can be parallelized across the be solved. If f is a nonlinear function these systems are nodes. While this is trivial for linear problems, nonlinear nonlinear as well. The obvious and traditional way to proceed ones require the linearization of the full equations, not node- in this case is to linearize the problem locally (i.e. for each wise as before. For the fine sweep, let time-step, at each quadrature node) using Newton’s method. 1 1 This way, PFASST is the outer solver with an inner Newton k+ k+ sdc sdc sdc coll 2 2 G (v) := P (v) − (P − C )(u ) − u f f f f l l,0 iteration. For triangular Q ,the mth equation on the lth time- step on the coarse level reads sdc then a Newton step for G (v) = 0 is given by m−1 k+1 k+1 k+1 Δ Δ ˜ ˜ (1 − Δt q ˜ f )(u ˜ ) =˜ u + Δt q ˜ f (u ˜ ) l,m l,n sdc j j sdc j l,m l,0 l,n ∇G (v )e =−G (v ), f f n=1 j +1 j j v = v + e , + c ˜(u ˜ ) , l,m sdc j sdc j k+1 k for Jacobian matrix ∇G (v ) of G evaluated at v .We where u ˜ =˜ u and c ˜(u ˜ ) is the mth entry the lth 0,0 l,m f f 0,0 k k have ˜ ˜ block of c ˜(u ˜ ) := (P − C )(u ˜ ) + τ . This term gathers F F all values of the previous iteration. The first summand of sdc j sdc j ∇G (v ) =∇P (v ) f f the right-hand side of the coarse level equation corresponds ˜ j to b and H, while the following sum comes from the lower = I − Δt Q ∇ f (v ) triangular structure of Q . j j For time-step l these equations can be solved one by one for Jacobian matrix ∇ f (v ) of f evaluated at v which in using Newton iterations and forward substitution. This is turn is given by inherently serial, because the solution on the mth quadrature j j j   T node depends on the solution at all previous nodes through ∇ f (v ) = diag( f (v ),..., f (v )) . 1 M the sum. Thus, while running parallel across the steps, each solution of the local collocation problem is found in serial. There is still no parallelism to exploit, but when we replace In the next section, we will present a novel way of applying j the full Jacobian matrix ∇ f (v ) by the approximation Newton’s method, which allows one to parallelize this part f (v )I , which is the derivative of f at the initial value l,0 M across the collocation nodes, joining parallelization across for the current time-step, we can use the step with parallelization across the method. sdc j Δ-QN ∇G (v ) ≈∇G (v ) := I − f (v )Δt Q l,0 l,0 Δ f f 3 PFASST-ER to establish a Quasi-Newton iteration as From the perspective of a single time-step [t , t ] or pro- l l+1 Δ-QN j sdc j ∇G (v )e =−G (v ), l,0 f f cessor P ,Eq. (7) on the coarse level for this step reads j +1 j j v = v + e . sdc k+1 k+1 sdc coll k k ˜ ˜ ˜ P (u ˜ ) − u ˜ = (P − C )(u ˜ ) + τ , f l l,0 f f l l This decouples the evaluation of the Jacobian matrix from the k k where τ is the lth component of τ , belonging to the interval current quadrature nodes and now Q can be diagonalized, so Δ-QN [t , t ]. Note that the serial dependency is given by the term that the inversion of ∇G (v ) can be parallelized across l l+1 l,0 k+1 k+1 u ˜ , so that it does not depend on the solution u ˜ of this the nodes. Note that there are other options for approximating l,0 l equation and can thus be considered as part of a given right- the full Jacobian matrix. Most notably, in [9] the mean over hand side. On the fine level, this is even simpler, because all Jacobian matrices is used (there across the time-steps). We there we have to solve did not see any impact on the convergence when following this strategy, most likely because the number of quadrature 1 1 k+ k+ sdc k+1 sdc coll 2 2 P (u ) = (P − C )(u ) + u , nodes is typically rather low. The advantage of using the f f f l l l,0 123 12 Page 6 of 12 R. Schöbel, R. Speck initial value is that it reduces the number of evaluations of the The question now is, how much the approximation of Jacobian matrix, which also includes communication time. the Jacobians affects the convergence and runtime of the Provided that Q is diagonalizable, we can decompose method and how all this compares to standard PFASST iter- −1 it by Q = V  V , where  = diag((Q ) ) con- ations. It is well known that for suitable right-hand sides Δ Δ Δ Δ Δ ii tains the eigenvalues (Q ) ∈ R of Q and V contains its and initial guesses the standard, unmodified Newton method Δ ii Δ eigenvectors. converges quadratically while the Quasi-Newton method as Using the given diagonalization the algorithm reads: well as SDC show linear convergence, see e.g. [11,13,24]. We will examine the impact of these approaches in the fol- −1 j sdc j j sdc j lowing section along the lines of two numerical examples. 1. replace r =−G (v ) by r ¯ =−V G (v ) (serial), f Δ f A more rigorous mathematical analysis is currently ongoing j j 2. solve I − f (v )Δt  e ¯ = r ¯ (parallel in M), l,0 Δ work, as it can be embedded into a larger convergence theory j j j 3. replace e ¯ by e = V e ¯ (serial), for PFASST with inner Newton-type solvers. j +1 j j 4. set v = v + e (parallel in M). This can be iterated until a certain threshold is reached and 4 Numerical results k+1 then set u = v to obtain the solution of the equation for the fine sweep. On the coarse level, the procedure is very We apply PFASST and PFASST-ER to two different, rather sdc similar, with a slightly different definition of G (v ˜ ). In prac- challenging reaction-diffusion problems, starting with a tice, choosing only a single Newton iteration (i.e. J = 1) is detailed analysis of the parallelization strategies for the sufficient, because this is only the inner solver for an outer Allen–Cahn equation and highlighting differences to these PFASST iteration. In all cases we have studied so far, using findings for the Gray-Scott equations. more inner iterations does not lead to a faster overall method. This linearization and diagonalization strategy immedi- 4.1 Allen–Cahn equation ately suggests a second approach: instead of using Q for the preconditioner, we can use the original quadrature matrix We study the two-dimensional Allen–Cahn equation, which Q directly. The intention of using Q in the first place was to is given by obtain a preconditioner which allowed inversion using for- ward substitutions. Now, with diagonalization in place, this u = Δu + u(1 − u) (10) is no longer necessary. Instead, we can use t coll coll P := C on the spatial domain [−0.5, 0.5] and with initial condition f f 2 2 R − (x + y ) and thus u = tanh √ , 2ε k+ coll coll 2 G (v) := C (v) − u . f f l,0 and periodic boundary conditions. We use simple second- order finite differences for discretization in space and take Note that this is just the lth block of the original composite 256 elements in each dimension on the fine level and 128 collocation problem. Following the same ideas as before, we on the coarse one. We furthermore use M = 4 Gauß-Radau end up with nodes, set ε = 0.04, Δt = 0.001 <ε and stop the simula- tion after 24 time-steps at T = 0.024. The initial condition QN coll j ∇G (v ) ≈∇G (v ) := I − f (v )Δt Q, l,0 l,0 f describes a circle with a radius R = 0.25, see e.g. [28]. Note that since our focus is on the temporal parallelization, −1 which can be diagonalized using Q = VV , where  is the temporal resolution was chosen to be quite high in con- a diagonal matrix with eigenvalues λ (Q) ∈ C.The same trast to the spatial resolution. Errors in space and time are not idea can be applied to the coarse level sweep, of course. As balanced here (and in the following example), with the spatial a result, the original nonlinear SDC sweeps within PFASST error being much higher. This has been done deliberately to are now replaced by Quasi-Newton iterations which can be avoid higher computational costs. By increasing the accuracy done parallel across the nodes. We note that using simplified in space, we would increase the amount of large paralleliz- or Quasi-Newton methods for solving implicit Runge-Kutta able computations in relation to communication. This would schemes is a standard approach, as e.g. [26] shows. We further in turn improve the overall parallel efficiency, which would refer to [19] for more details on the idea of parallel SDC in the end lead to even better scaling results. However, when sweeps with Q and Q . using parallelization in space, all processors have very few 123 PFASST-ER: combining the parallel full approximation scheme in space and time with… Page 7 of 12 12 PFASST: N iter 300 PFASST: N iter PFASST: 1 iter PFASST: 1 iter PFASST-ER: Q PFASST-ER: Q Δ Δ 1,000 PFASST-ER: Q PFASST-ER: Q SL ML P2 P4 P8 P12 P24 SL ML P2 P4 P8 P12 P24 algorithm algorithm Fig. 3 Time to solution for Allen–Cahn with parallelization only across Fig. 2 Number of linear solves for the Allen–Cahn example, all meth- time-steps ods run serial on the nodes degrees-of-freedom anyway, so our results may even reflect a “real” situation better. We can see that performing more than one inner Newton The results we present in the following were computed iteration (“PFASST: N iter” vs. “PFASST: 1 iter”) does not with pySDC [20,21] on the supercomputer JURECA [12]. improve the convergence of the overall algorithm. Although We run a serial single-level simulation using SDC (“SL” in it is possible that by increasing the number of inner Newton the plots), a serial multi-level simulation using multi-level iterations the number of outer PFASST iterations decreases, SDC (“ML”, which is PFASST on one processor, see [23]) the total effort, which can be measured by the total number and parallel simulations with 2, 4, 8, 12 and 24 processors of linear solves, increases due to a higher number of inner −10 (“P2” to “P24”), all until a given residual tolerance of 10 Newton iterations. is reached. Using the Quasi-Newton approach with the same precon- If less processors than time-steps are used, the time ditioner instead of the classical Newton solver (“PFASST- domain is split into blocks of parallel PFASST runs. These ER: Q ” vs. “PFASST: 1 iter”) shows little effect on the total are handled sequentially, using the solution of the previous iteration numbers, but using the original quadrature matrix LU block as the initial data for the next one. For example, 6 pro- Q instead of Q inside the preconditioner (“PFASST-ER: cessors work on the first 6 time-steps until convergence and Q” vs. “PFASST-ER: Q ”) greatly reduces the number of the solution is used as new initial condition for the next block iterations. of 6 time-steps. This is repeated until all 24 time-steps have However, without parallelization one iteration of PFASST- been completed. ER with Q is in general more expensive than one iteration In Fig. 2 we show the maximum number of linear solves of the other algorithms, because it requires the solution of a which were performed by the slowest processor (i.e. last full system via diagonalization instead of stepping through a processor in time) for different versions of the solvers, aggre- triangular system via forward substitution. gated over all its time-steps and quadrature nodes, over all In Fig. 3, we thus examine whether the lower number of outer and inner iterations. more expensive iterations actually pays off. The plot shows Here, two versions of the original PFASST algorithm are results for the same setup as Fig. 2, but now we focus on the run: The first one performs exactly one inner Newton iter- runtime instead of the iteration numbers. We only consider ation in every PFASST iteration; this version is labeled as parallelization across the time-steps to compare the impact “PFASST: 1 iter”. In contrast, “PFASST: N iter” performs as of the algorithmic change first. We see that despite the fact many inner Newton iterations required so that the residual of that the iterations are more expensive, PFASST-ER with Q −11 the nonlinear inner problem is less than 10 . Both PFASST already in this example shows a lower runtime than the origi- LU versions use the quadrature matrix Q from Eq. (5)inside nal PFASST method. This is also true when using Q instead the preconditioner. For PFASST-ER we also show two vari- of Q. ants: The PFASST-ER algorithm, which uses the original Q At this point, we have not yet considered the additional inside the preconditioner is labeled as “PFASST-ER: Q” and direction of concurrency exposed by PFASST-ER. For that, LU the one which uses Q is labeled as “PFASST-ER: Q ”. we next compare different distributions of up to 24 cores on Solving the innermost linear systems is done using GMRES the 4 quadrature nodes and the 24 time-steps. All divisions −12 with a tolerance of 10 in all cases. of 24 were tested with all possible distributions. number of linear solves runtime (sec.) 12 Page 8 of 12 R. Schöbel, R. Speck across the nodes now is not the optimal strategy. Here, using 54.7s 41.2s 2 cores on the nodes and 12 on the steps is the most effi- 74.6% 49.5% cient combination, albeit still significantly slower than using PFASST-ER with Q, even with the same combination. The 88.5s 71.0s 51.5s reason for this potentially surprising result is that solving the 92.2% 57.5% 39.6% innermost linear systems heavily depends on the structure of these systems, in particular when using an iterative solver like GMRES. Moreover, initial guesses are a crucial factor, too. 115.6s 88.1s 66.8s 163.3 For PFASST-ER, we use the current solution at node zero of 70.6% 46.4% 30.5% the respective time-step as the initial guess. This is particu- larly suitable for the closest first nodes, but potentially less so 1 2 4 8 for later ones. While both effects did not lead to significant Cores for time-steps variations in the time spent for solving the linear systems when using Q, it does produce a severe load imbalance when using Q . More specifically, using 4 cores for the nodes and 32.8s 31.3s 41.5% 21.7% only 1 for the time-steps, i.e. exploiting only parallelization across the nodes, the first core takes about 118.2 seconds for all linear system solves together at the first node, while the 58.3s 44.5s 38.7s 46.7% 30.6% 17.6% last core takes about 194.6 seconds on the last node. There- fore, using 2 cores on the nodes, which enables a better load distribution is the ideal choice. One possibility would be that 96.1s 77.2s 56.7s 55.1s core 1 deals with nodes 1 and 4 and core 2 with 2 and 3, but 56.6% 35.2% 24.0% 12.3% because node 3 and 4 are very close to each other and the cor- responding calculations are almost equally expensive also an 3 6 12 24 alternating distribution is an ideal choice. This is precisely Cores for time-steps what has been done for Fig. 5, leading to the best speedup with 2 cores on the nodes. For other examples, an optimal Fig. 4 Runtimes in seconds (first number) and efficiencies (second distribution might be more difficult to find. number) with different distribution of cores using PFASST-ER with Q In Fig. 6 we now summarize the best results: PFASST for the Allen–Cahn equation with one inner Newton iteration in comparison to PFASST- ER using Q and 2 cores on the nodes and PFASST-ER The two plots in Fig. 4 show different combinations of using Q with 4 cores on the node. The plot shows the simula- cores used for step-parallelization (x-axis) and for node- tion time for each variant based on the number of processors parallelization (y-axis) with PFASST-ER and Q. Multiplying used in total. We see that PFASST-ER is always much more the numbers on both axes gives the total number of cores time efficient in doing the calculations than PFASST, with used for this simulation. This is also the reason why there are another significant gain when using Q instead of Q .Now, two plots, because not all combinations are actually possible since PFASST-ER adds another direction of parallelization compared to PFASST, we can not only increase parallel effi- or meaningful. Within each colored block the total runtime (in seconds, first number) and parallel efficiencies (second ciency as shown, but also extend the number of usable cores to obtain a better time-to-solution. This has been done in number) for this setup are given. We can see that using all available cores for parallelization across the step is by far not Fig. 7: taking 48 or 96 cores in total further reduces the com- the most efficient choice. In turn, more than 4 cores cannot puting time for 24 time-steps. With PFASST-ER, the number be used for parallelization across the nodes, although 4 gives of resources that can be used for parallel-in-time integration the best speedup. Indeed, the best combination for this prob- is no more limited by the number of time-steps, but can be lem is to maximize node-parallelization first and then add increased by the factor given by the number of quadrature step-parallelization (31.3 seconds with 4 cores on the nodes nodes. and 6 on the steps, lower picture). This is about 1.8 times faster than using 24 cores for the steps alone and more than 4.2 Gray-Scott equations 5 times faster than the serial PFASST-ER run. Although using Q instead of Q in PFASST-ER is faster The second example we present here is the Gray-Scott system for this example, it is quite revealing to repeat the simulations [16], which is given by using Q . These results are shown in Fig. 5 and it is obvious that using as many cores as possible for the parallelization u = D Δu − 2uv + F (1 − u), t u Cores for time-nodes Cores for time-nodes PFASST-ER: combining the parallel full approximation scheme in space and time with… Page 9 of 12 12 PFASST: 1 iter 221.8s 162.6s PFASST-ER Q ,2cores 25.5% 17.4% PFASST-ER: Q,4cores 120.0s 80.1s 61.3s 94.4% 70.6% 46.2% 115.8s 93.2s 71.7s 226.4 97.8% 60.8% 39.5% 1 2 4 8 P4 P8 P12 P24 P48 P96 Cores for time-steps total number of cores Fig. 7 Runtimes for different number of processors, Allen–Cahn exam- 133.5s 111.0s 4 ple 14.1% 8.5% PFASST: 1 iter 68.9s 50.6s 45.6s PFASST-ER: Q 54.8% 37.3% 20.7% 1,000 PFASST-ER: Q 112.6s 80.6s 59.2s 50.8s 67.0% 46.8% 31.9% 18.6% 3 6 12 24 Cores for time-steps Fig. 5 Runtimes in seconds (first number) and efficiencies (second number) with different distribution of cores using PFASST-ER with SL ML P2 P3 P4 P6 P8 P12 P24 Q for the Allen–Cahn equation algorithm PFASST: 1 iter Fig. 8 Number of linear solves for the Gray-Scott example, all methods PFASST-ER: Q , 2 cores 100 Δ run serial on the nodes PFASST-ER: Q, 4 cores 70 −4 −5 this circle. We use D = 10 , D = 10 and set a feed u v rate of F = 0.0367 and a kill rate of K = 0.0649. This leads after some time to a process similar to cellular division and is known as “mitosis”. We discretize the spatial domain with 128 points in each dimension on the fine level and with 64 on the coarse one, using standard finite differences. We 30 discretize every time-step of size Δt = 1 with 4 quadrature nodes and run the simulation again for 24 time-steps. P4 P8 P12 P24 The results are similar to the ones for the Allen–Cahn total number of cores equation in the previous section. We will omit the case of Fig. 6 Runtimes for the three best variants, Allen–Cahn example PFASST with more than one inner Newton iteration, though. We start again by looking at the total number of linear solves the different algorithms need to perform. Figure 8 v = D Δv + 2uv − (F + K )v, t v shows the number of linear solves for the methods, which −12 run until a residual tolerance of 10 is reached. The results on the spatial domain [0, 1]×[0, 1], with periodic bound- look quite similar to the ones for the previous example, with ary conditions. As initial condition we choose a circle with one critical difference: The difference between the Q-variant radius 0.05 centred in the spatial domain, where u = 0.5 and of PFASST-ER and the other algorithms becomes smaller v = 0.25 on the inside, and u = 1.0 and v = 0 outside of more rapidly the more parallel time-steps are used. There Cores for time-nodes Cores for time-nodes runtime (sec.) number of linear solves runtime (sec.) 12 Page 10 of 12 R. Schöbel, R. Speck PFASST: 1 iter 18.5s 14.4s PFASST-ER: Q 66.3% 42.5% 50 PFASST-ER: Q 26.0s 18.7s 15.3s 94.6% 65.7% 40.0% 33.1s 24.9s 21.0s 49.1 74.2% 49.4% 29.2% 1 2 4 8 SL ML P2 P3 P4 P6 P8 P12 P24 Cores for time-steps algorithm Fig. 9 Time to solution for Gray-Scott with parallelization only across 12.3s 10.8s time-steps 4 33.2% 19.0% 16.6s 13.5s 13.0s 49.3% 30.3% 15.7% is no obvious explanation (at least, obvious to us) for this behavior, though. The more time steps are approximated simultaneously, the less suitable u works as initial value for 28.0s 22.1s 19.2s 12.5s 58.4% 37.0% 21.3% 16.3% more distant time-steps. Although the full Newton and the Quasi-Newton methods differ by an order of convergence in theory, in our scenario this seems relevant only for good 3 6 12 24 initial values. One can expect that the runtime will increase Cores for time-steps when using PFASST-ER with Q, while it stayed about the Fig. 10 Runtimes in seconds (first number) and efficiencies (second same in the case of the Allen–Cahn example. number) with different distribution of cores using PFASST-ER with Q This is precisely what we can see in Fig. 9. The more par- for the Gray-Scott equations allel time-steps are run, the less efficient PFASST-ER with Q in this variant becomes. Already at 3 parallel steps, it is as costly as the original PFASST version, at least when paral- lelization across the nodes is not considered. 5 Conclusion and outlook Now, adding node-parallelization, the findings are again similar to the ones in the previous section: Figure 10 shows Today’s supercomputers are designed with an ever increasing that PFASST-ER with Q is still more efficient than using number of processors. Therefore we need our software and PFASST. In particular, using more cores on the nodes is better the underlying numerical algorithms to handle this increas- and the best combination is again 4 cores on the nodes and 6 ing degree of parallelism. Time-parallel integrators are one on the steps. Again, this changes when considering PFASST- promising research direction, with quite a number of differ- ER with Q as in Fig. 11, where the ideal setup uses only ent approaches. Some approaches parallelize each individual 2 cores on the nodes, but 12 on the steps. This is again due time-step and others act on multiple time-steps simultane- to load imbalances of the innermost linear solves. However, ously. In this paper we have introduced a solver that works in note the key difference to the previous results: The fastest run parallel across the method as well as across the steps. More of the Q -variant is now faster than the one of the Q-variant. precisely, we combine node-parallel spectral deferred correc- In Fig. 12 we now give an overview of the best results: If tions with the parallel full approximation scheme in space we use parallelism across the nodes in a suitable way, both and time. While PFASST allows one to compute multiple PFASST-ER versions are more efficient based on the simu- time-steps simultaneously and target large-scale parallelism lation time than the classical PFASST algorithm. Both can in time, the new version called PFASST-ER presented here be used to extend the scaling capabilities beyond the number extends this idea with an efficient small-scale parallelization of time-steps, and both scale rather well in this regime. Note, for every single time-step itself. The scaling studies show that however, that the Q -variant can here only leverage 2 × 24 a combination of both concepts seems to be the most efficient cores. It is then faster than the Q-variant with twice as many way to solve time-dependent PDEs. Here we tested two dif- cores. ferent preconditioners: ones using the traditional, triangular runtime (sec.) Cores for time-nodes Cores for time-nodes PFASST-ER: combining the parallel full approximation scheme in space and time with… Page 11 of 12 12 of parallel time-steps. For the Q-preconditioner, the over- 30.6s 21.3s all number of iterations was lower and time-to-solution was 38.8% 27.9% faster. Adding node-parallelization, parallel efficiency can be increased and speedup extended when compared to PFASST. 27.1s 15.8s 11.1s Both PFASST-ER versions lead in the end to better scaling 87.9% 75.1% 53.5% results than the classical PFASST algorithm. PFASST-ER Q especially offers an almost equal distribution of work for iter- ative linear solvers with respect to the individual quadrature 27.3s 17.8s 12.8s 47.5 87.1% 66.6% 46.3% nodes of a time-step. This advantage makes this algorithm particularly flexible and can be used for any number of quadrature points. 1 2 4 8 PFASST-ER is particularly favorable if an increase in par- Cores for time-steps allelism across the steps would lead to a severe increase in the number of iterations. This could be due to e.g. the type of the equation or the coarsening strategy. During our experiments 17.7s 15.5s we saw that it is not clear apriori which combination of node- 22.4% 12.8% and step-parallelization is the most efficient one. This could lead to many, potentially irrelevant runs to find the sweet spot. 12.9s 9.2s 8.2s Here, a performance model and a suitable convergence the- 61.3% 43.2% 24.2% ory are needed to at least narrow down the relevant options. This has to be accompanied by more numerical tests, relating e.g. model parameters with load imbalances, to identify the 21.2s 14.3s 11.1s 10.5s 74.6% 55.4% 35.8% 18.9% limits of this approach. Acknowledgements The authors thankfully acknowledge the financial 3 6 12 24 support by the German Federal Ministry of Education and Research Cores for time-steps through the ParaPhase project within the framework “IKT 2020 - Forschung für Innovationen” (Project Number 01IH15005A). Fig. 11 Runtimes in seconds (first number) and efficiencies (second number) with different distribution of cores using PFASST-ER with Funding Open Access funding enabled and organized by Projekt Q for the Gray-Scott equations DEAL. Open Access This article is licensed under a Creative Commons PFASST: 1 iter Attribution 4.0 International License, which permits use, sharing, adap- PFASST-ER Q ,2cores tation, distribution and reproduction in any medium or format, as 20 PFASST-ER: Q,4cores long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indi- cate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, 12 unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copy- right holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. P4 P8 P12 P24 P48 P96 total number of cores References Fig. 12 Runtimes for the three best variants, Gray-Scott example 1. Bolten, M., Moser, D., Speck, R.: A multigrid perspective on the parallel full approximation scheme in space and time. Numer. Lin- ear Algebra Appl. 24(6), e2110 (2017). https://doi.org/10.1002/ quadrature matrix Q , generated by a LU-decomposition nla.2110 and one using the original matrix Q. Both can be diagonal- 2. Bolten, M., Moser, D., Speck, R.: Asymptotic convergence of the parallel full approximation scheme in space and time for lin- ized and used as parallel-across-the-node preconditioners. ear problems. Numer. Linear Algebra Appl. 25(6), e2208 (2018). For the Q -preconditioner, we saw load imbalances when https://doi.org/10.1002/nla.2208 using an inner iterative linear solver, but by grouping nodes 3. Burrage, K.: Parallel methods for ODEs. Adv. Comput. Math. 7, we still can speed up the simulation beyond the number 1–3 (1997) Cores for time-nodes Cores for time-nodes runtime (sec.) 12 Page 12 of 12 R. Schöbel, R. Speck 4. Christlieb, A.J., Macdonald, C.B., Ong, B.W.: Parallel high-order 20. Speck, R.: Algorithm 997: pySDC-prototyping spectral deferred integrators. SIAM J. Sci. Comput. 32(2), 818–835 (2010) corrections. ACM Trans. Math. Softw. (2019). https://doi.org/10. 5. Clarke, A.T., Davies, C.J., Ruprecht, D., Tobias, S.M.: Parallel-in- 1145/3310410 time integration of kinematic dynamos (2019). arXiv:1902.00387 21. Speck, R.: Website for pySDC (2019). https://parallel-in-time.org/ [physics.comp-ph] pySDC/. Accessed 27 November 2019 6. Emmett, M., Minion, M.L.: Toward an efficient parallel in time 22. Speck, R., Ruprecht, D., Krause, R., Emmett, M., Minion, M., method for partial differential equations. Commun. Appl. Math. Winkel, M., Gibbon, P.: A massively space–time parallel N-body Comput. Sci. 7, 105–132 (2012) solver. In: Proceedings of the International Conference on High 7. Falgout, R.D., Friedhoff, S., Kolev, T.V., MacLachlan, S.P., Performance Computing, Networking, Storage and Analysis, SC Schroder, J.B., Vandewalle, S.: Multigrid methods with space–time ’12, pp. 92:1–92:11. IEEE Computer Society Press, Los Alami- concurrency. Comput. Vis. Sci. 18(4–5), 123–143 (2017) tos, CA, USA (2012). ISBN 978-1-4673-0804-5. http://dl.acm.org/ 8. Gander, M.J.: 50 years of time parallel time integration. In: Multi- citation.cfm?id=2388996.2389121. event-place: Salt Lake City, ple Shooting and Time Domain Decomposition. Springer (2015). Utah https://doi.org/10.1007/978-3-319-23321-5_3 23. Speck, R., Ruprecht, D., Emmett, M., Minion, M.L., Bolten, M., 9. Gander, M.J., Halpern, L., Ryan, J., Tran, T.T.B.: A direct solver Krause, R.: A multi-level spectral deferred correction method. for time parallelization. In: Dickopf, T., Gander, M.J., Halpern, L., BIT Numer. Math. 55, 843–867 (2015). https://doi.org/10.1007/ Krause, R., Pavarino, L.F. (Eds) Domain Decomposition Methods s10543-014-0517-x in Science and Engineering XXII, pp. 491–499. Springer (2016). 24. Tang, T., Xie, H., Yin, X.: High-order convergence of spectral https://doi.org/10.1007/978-3-319-18827-0_50 deferred correction methods on general quadrature nodes. J. Sci. 10. Huang, J., Jia, J., Minion, M.: Accelerating the convergence of Comput. 56(1), 1–13 (2013) spectral deferred correction methods. J. Comput. Phys. 214(2), 25. Trottenberg, U., Oosterlee, C., Schuller, A.: Multigrid. Academic 633–656 (2006) Press, London (2000) 11. Jackson, K .R., Kværnø, A., Nørsett, S .P.: The use of butcher series 26. Wanner, G., Hairer, E.: Solving Ordinary Differential Equations II. in the analysis of newton-like iterations in Runge–Kutta formulas. Springer, Berlin (1996) Appl. Numer. Math. 15(3), 341–356 (1994) 27. Weiser, M.: Faster SDC convergence on non-equidistant grids by 12. Jülich Supercomputing Centre. JURECA: General-purpose super- DIRK sweeps. BIT Numer. Math. 55(4), 1219–1241 (2014) computer at Jülich Supercomputing Centre. J. Large-Scale Res. 28. Zhang, J., Du, Q.: Numerical studies of discrete approximations Facil. 2(A62) (2016). https://doi.org/10.17815/jlsrf-2-121 to the Allen–Cahn equation in the sharp interface limit. SIAM J. 13. Kelley, C.T.: Iterative Methods for Linear and Nonlinear Equations. Sci. Comput. 31(4), 3042–3063 (2009). https://doi.org/10.1137/ Number 16 in Frontiers in Applied Mathematics. SIAM (1995) 080738398. 14. Koehler, F.: PFASST TikZ. https://github.com/Parallel-in-Time/ pfasst-tikz (2015) 15. Lions, J.-L., Maday, Y., Turinici, G.: A “parareal” in time Publisher’s Note Springer Nature remains neutral with regard to juris- discretization of PDE’s. Comptes Rendus de l’Académie des dictional claims in published maps and institutional affiliations. Sciences—Series I: Mathematics. https://doi.org/10.1016/S0764- 4442(00)01793-6 16. Pearson, J.E.: Complex patterns in a simple system. Science 261(5118), 189–192 (1993) 17. Ruprecht, D., Speck, R.: Spectral deferred corrections with fast- wave slow-wave splitting. SIAM J. Sci. Comput. 38(4), A2535– A2557 (2016) 18. Ruprecht, D., Speck, R., Emmett, M., Bolten, M., Krause, R.: Poster: extreme-scale space-time parallelism. In: Proceed- ings of the 2013 Conference on High Performance Computing Networking, Storage and Analysis Companion, SC ’13 Com- panion (2013). http://sc13.supercomputing.org/sites/default/files/ PostersArchive/tech_posters/post148s2-file3.pdf 19. Speck, R.: Parallelizing spectral deferred corrections across the method. Comput. Vis. Sci. 19(3–4), 75–83 (2018). https://doi.org/ 10.1007/s00791-018-0298-x

Journal

Computing and Visualization in ScienceSpringer Journals

Published: Sep 25, 2020

There are no references for this article.