TY - JOUR AU - Cheng,, Chin-Yu AB - ABSTRACT We present gamer-2, a GPU-accelerated adaptive mesh refinement (AMR) code for astrophysics. It provides a rich set of features, including adaptive time-stepping, several hydrodynamic schemes, magnetohydrodynamics, self-gravity, particles, star formation, chemistry, and radiative processes with grackle, data analysis with yt, and memory pool for efficient object allocation. gamer-2 is fully bitwise reproducible. For the performance optimization, it adopts hybrid OpenMP/MPI/GPU parallelization and utilizes overlapping CPU computation, GPU computation, and CPU–GPU communication. Load balancing is achieved using a Hilbert space-filling curve on a level-by-level basis without the need to duplicate the entire AMR hierarchy on each MPI process. To provide convincing demonstrations of the accuracy and performance of gamer-2, we directly compare with enzo on isolated disc galaxy simulations and with flash on galaxy cluster merger simulations. We show that the physical results obtained by different codes are in very good agreement, and gamer-2 outperforms enzo and flash by nearly one and two orders of magnitude, respectively, on the Blue Waters supercomputers using 1–256 nodes. More importantly, gamer-2 exhibits similar or even better parallel scalability compared to the other two codes. We also demonstrate good weak and strong scaling using up to 4096 GPUs and 65 536 CPU cores, and achieve a uniform resolution as high as |$10\, 240^3$| cells. Furthermore, gamer-2 can be adopted as an AMR + GPUs framework and has been extensively used for the wave dark matter simulations. gamer-2 is open source (available at https://github.com/gamer-project/gamer) and new contributions are welcome. methods: numerical 1 INTRODUCTION Many problems in computational astrophysics require resolving structures at a wide range of spatial scales. For this reason, the adaptive mesh refinement (AMR) method (Berger & Oliger 1984; Berger & Colella 1989) has played an indispensable role by enabling high dynamic range simulations of astrophysical phenomena. The fundamental principle in AMR is to allow the simulation resolution, in both space and time, to adaptively and locally adjust so as to concentrate computational resources on regions requiring higher resolution. It is achieved by first covering the entire computational domain with a uniform-resolution grid, and then adding hierarchies of nested refined grid patches (also referred to as ‘patches’, ‘grids’, or ‘blocks’) with decreasing cell spacing over subvolumes of interest. There have been many hydrodynamic AMR codes for astrophysics (e.g. Kravtsov, Klypin & Khokhlov 1997; Fryxell et al. 2000; Teyssier 2002; Cunningham et al. 2009; Schive, Tsai & Chiueh 2010; Almgren et al. 2010; Mignone et al. 2012; Almgren et al. 2013; Bryan et al. 2014; White, Stone & Gammie 2016). Among these, the AMR implementations can be broadly classified into three categories based on the basic units adopted for grid refinement. The first category uses rectangular patches of arbitrary aspect ratios as the refinement units, where different patches can have different sizes and shapes; for example, enzo (Bryan et al. 2014) utilizes this scheme. The second category performs refinement on a cell-by-cell basis, for example, art (Kravtsov et al. 1997) and ramses (Teyssier 2002). The third category performs refinement on patches of fixed size (e.g. 83 cells); in other words, all patches are restricted to be geometrically similar to each other. Examples include flash (Fryxell et al. 2000), gamer (Schive et al. 2010), and athena+ + (Stone et al. 2008; White et al. 2016). In addition to AMR, it is also possible to achieve high resolution with Lagrangian particles (e.g. gadget-2, Springel 2005), a moving unstructured mesh (e.g. arepo, Springel 2010), or a meshless method (e.g. gizmo, Hopkins 2015). Use of graphic processing units (GPUs) has recently become a promising technique to achieve substantial speedups in simulation performance. However, compared to the uniform-resolution approaches, it remains extremely challenging for AMR codes to efficiently exploit the petascale computing power in heterogeneous CPU/GPU supercomputers, mainly due to the complicated AMR data structure, load imbalance, and the great amount of work required to isolate and convert existing physics modules to run with high efficiency on GPUs. So far, only a few astrophysical AMR codes have taken advantage of GPU acceleration. For example, both enzo (Bryan et al. 2014) and ramses (Teyssier 2002) have ported the hydrodynamic and magnetohydrodynamic solvers to GPUs (Wang, Abel & Kaehler 2010; Kestener, Château & Teyssier 2010), and flash has ported an octree Poisson solver to GPUs (Lukat & Banerjee 2016). gamer (GPU-accelerated Adaptive MEsh Refinement, Schive et al. 2010, hereafter referred to as gamer-1), is the first astrophysical AMR code designed from scratch to exploit GPU acceleration. It supports both GPU hydrodynamic and Poisson solvers. However, the physical modules supported in gamer-1 are much more limited compared to other widely adopted AMR codes, which restricts possible applications of the code. Here, we present gamer-2, a significant revision of gamer-1 that includes much richer functionality, including adaptive time-stepping, several hydrodynamic schemes, magnetohydrodynamics, dual energy formalism, self-gravity, particles, star formation, chemistry, and radiative processes with the grackle library, data analysis with the yt package, memory pool, bitwise reproducibility, test problem infrastructure, and the capability of being used as an AMR + GPUs framework. It also incorporates significant improvements in accuracy, stability, performance, and scalability. For a GPU-accelerated astrophysical AMR code, there are at least three questions to be addressed in order to make a fair performance comparison with a CPU-only AMR code. First, does it sacrifice accuracy for performance? Second, does it still outperform CPUs by a large margin when enabling a rich set of physical modules? Third, how does the performance scale with the number of GPUs especially when running extremely large parallel simulations? To provide clear and convincing answers to these questions, we directly compare gamer-2 with two widely adopted AMR codes, flash and enzo, based on realistic astrophysical applications, namely, binary cluster merger simulations and isolated disc galaxy simulations, where we enable GPU acceleration for enzo as well. These comparison simulations show that the physical results obtained by different codes are in very good agreement, and gamer-2 outperforms enzo and flash by nearly one and two orders of magnitude, respectively, on the Blue Waters supercomputer1 using 1–256 nodes. We also demonstrate good weak and strong scaling in gamer-2 using up to 4096 GPUs and 65 536 CPU cores. This paper is structured as follows. We describe the numerical algorithms in Section 2 and the performance optimizations in Section 3. Section 4 shows the code tests, especially focusing on the comparison simulations with gamer-2, enzo, and flash. Finally, we summarize our results and discuss future work in Section 5. 2 NUMERICAL ALGORITHMS We provide in this section an overview of the numerical algorithms implemented in gamer-2, including the AMR structure, hydrodynamic and gravity solvers, particle integration, and other miscellaneous features. Detailed descriptions of the performance optimizations and parallelization are given in Section 3. 2.1 Adaptive mesh refinement The AMR structure in gamer-2 is very similar to that in the original gamer-1 code, and therefore we only provide a short summary here. Note that in this paper, we use the terms ‘patch’, ‘grid’, and ‘block’ interchangeably. gamer-2 adopts a block-structured AMR where the simulation domain is covered by a hierarchy of patches with various resolutions. Patches with the same resolution are referred to as being on the same AMR level, where the root level, l = 0, has the coarsest resolution. The resolution ratio between two adjacent levels is currently fixed to 2. The levels of any two nearby patches can differ by at most 1, so the gamer-2 AMR hierarchy is always properly nested. All patches are restricted to be geometrically similar to each other, which is similar to the AMR implementation in flash. We assume a fixed patch size of 83 cells throughout this paper unless otherwise specified. The AMR hierarchy is manipulated by an octree data structure. A patch on level l can have zero or eight child patches on level l + 1 that cover the same physical domain as their parent patch, and up to 26 sibling patches (including those along the diagonal directions) on the same level. For convenience, all patches store the patch identification numbers of their parent, child, and sibling patches (if any exist). For advancing patches on different levels, gamer-2 supports both the shared time-step and adaptive time-step integrations. For the former, all patches in the simulations are restricted to have the same evolution time-step, which is generally more robust but less efficient as patches on lower levels might have unnecessarily small time-steps. This fixed time-step scheme is used by the flash code. For the latter, patches on higher levels can have smaller time-steps. Furthermore, gamer-2 does not require the time-step ratio between two adjacent levels to be a constant, which is similar to the implementation in enzo. Different levels are advanced in a way similar to the W-cycle in a multigrid scheme, where the lower levels are advanced first and then wait until they are synchronized with higher levels. This approach can improve performance notably, especially when higher refinement levels only cover small subvolumes of the simulation domain. For the adaptive time-step integration, in order to synchronize two adjacent levels, the finer level sometimes requires an extra update with an extremely small time-step, which can have a non-negligible impact on the overall performance. To alleviate this issue, we allow the time-step on level l, Δtl, to increase by a small fraction (typically |$C_{\rm inc} {\sim }10{{\ \rm per\ cent}}$|⁠) if that helps synchronize with level l − 1. Moreover, and less intuitively, we also allow Δtl to decrease by a small fraction (typically |$C_{\rm dec} {\sim }10{{\ \rm per\ cent}}$| as well) if it could remove that extra small time-step on level l + 1. See Fig. 1 for an illustration of the second optimization. The procedure for computing the final optimum Δtl can be described by the following pseudo-code: // ''lv'' is the abbreviation of ''level'' dt(lv) = ComputeTimeStep(lv) dt_inc = t(lv-1) - t(lv) dt_dec = 2*dt(lv+1) // synchronize lv and lv-1 if (1+C_inc)*dt(lv)>dt_inc dt(lv) = dt_inc // synchronize lv and lv + 1 else if dt(lv)>dt_dec and (1-C_dec)*dt(lv) 0.6. For the Poisson solver, gamer-2 uses the SOR scheme and adds five additional buffer zones around each patch to make the potential smoother across the patch boundaries (see Fig. 2). In comparison, flash adopts a finite-volume multigrid scheme that aims to minimize the global residual (Ricker 2008), which, in general, should be more accurate but also more computationally expensive. Unlike flash, gamer-2 does not implement explicit grid derefinement criteria (see Section 2.1). However, we have verified that the grid distribution of the two codes are very similar in this test (the difference in the numbers of maximum level cells is less than |${\sim }20{{\ \rm per\ cent}}$|⁠, e.g. see Table 2). For the floating-point accuracy, gamer-2 uses single precision while flash uses double precision, because single precision is not officially supported in flash. This discrepancy makes our performance comparison in favor of gamer-2, which could be unfair in this sense. However, we also demonstrate in the next section that the physical results obtained by the two codes are very consistent, suggesting that double precision may not be necessary for this test. Table 1. Comparison of the numerical setup between gamer-2 and flash in the galaxy cluster merger simulations. gamer-2 flash AMR implementation Fixed patch size of 83 cells, no permanent allocation for patch ghost zones Fixed patch size of 83 cells, patch ghost zones are allocated permanentlya Grid resolution Root grid 1283, four refinement levels, maximum resolution |$\Delta h = 7.0 \, \rm kpc$| Same as gamer-2 Particle resolution Number of particles Np = 5 × 106 (for each cluster) and mass resolution |$m_{\rm p} = 1.4\times 10^8\, \rm {M}_{\odot }$| Same as gamer-2 Fluid solver CTU scheme with six Riemann solvers per cell, PPM reconstruction, Roe’s solver, and van Leer slope limiter Same as gamer-2 except for a revised CTU scheme requiring only three Riemann solvers per cellb Poisson solver SOR multigridc Particle solver CIC interpolation and KDK particle update CIC interpolation and variable time-step Leapfrog particle update Boundary condition Fluid solver: outflow Poisson solver: isolated Same as gamer-2 Refinement (1) Löhner’s error estimator on gas density, pressure, and temperature with a refinement threshold of 0.8 and a minimum gas density threshold of |$10^{-28}\, {\rm g \text{cm}^{-3}}$| (2) Maximum number of particles in a patch: 100 Same as gamer-2 Derefinement No explicit derefinement criteria (1) Löhner’s error estimator with a derefinement threshold of 0.2 (2) Minimum number of particles in a patch: 12 Time-step Cpar = 0.8, CCFL = 0.5 Cpar = 0.8, CCFL = 0.6 Parallelization Hybrid MPI/OpenMP/GPU MPI and CPU-only Load balancing Hilbert space-filling curve Morton space-filling curve Time integration Adaptive time-step Shared time-step Floating-point format Single precision Double precision gamer-2 flash AMR implementation Fixed patch size of 83 cells, no permanent allocation for patch ghost zones Fixed patch size of 83 cells, patch ghost zones are allocated permanentlya Grid resolution Root grid 1283, four refinement levels, maximum resolution |$\Delta h = 7.0 \, \rm kpc$| Same as gamer-2 Particle resolution Number of particles Np = 5 × 106 (for each cluster) and mass resolution |$m_{\rm p} = 1.4\times 10^8\, \rm {M}_{\odot }$| Same as gamer-2 Fluid solver CTU scheme with six Riemann solvers per cell, PPM reconstruction, Roe’s solver, and van Leer slope limiter Same as gamer-2 except for a revised CTU scheme requiring only three Riemann solvers per cellb Poisson solver SOR multigridc Particle solver CIC interpolation and KDK particle update CIC interpolation and variable time-step Leapfrog particle update Boundary condition Fluid solver: outflow Poisson solver: isolated Same as gamer-2 Refinement (1) Löhner’s error estimator on gas density, pressure, and temperature with a refinement threshold of 0.8 and a minimum gas density threshold of |$10^{-28}\, {\rm g \text{cm}^{-3}}$| (2) Maximum number of particles in a patch: 100 Same as gamer-2 Derefinement No explicit derefinement criteria (1) Löhner’s error estimator with a derefinement threshold of 0.2 (2) Minimum number of particles in a patch: 12 Time-step Cpar = 0.8, CCFL = 0.5 Cpar = 0.8, CCFL = 0.6 Parallelization Hybrid MPI/OpenMP/GPU MPI and CPU-only Load balancing Hilbert space-filling curve Morton space-filling curve Time integration Adaptive time-step Shared time-step Floating-point format Single precision Double precision Notes.aflash supports the ‘NO-PERMANENT-GUARDCELLS’ (npg) mode, which however does not work well with active particles. We have therefore disabled this functionality. bLee (2013). cRicker (2008). dSingle precision is not officially supported in flash. But we have demonstrated that this discrepancy does not affect the physical results here. See discussions in Sections 4.3.2 and 4.3.3. Open in new tab Table 1. Comparison of the numerical setup between gamer-2 and flash in the galaxy cluster merger simulations. gamer-2 flash AMR implementation Fixed patch size of 83 cells, no permanent allocation for patch ghost zones Fixed patch size of 83 cells, patch ghost zones are allocated permanentlya Grid resolution Root grid 1283, four refinement levels, maximum resolution |$\Delta h = 7.0 \, \rm kpc$| Same as gamer-2 Particle resolution Number of particles Np = 5 × 106 (for each cluster) and mass resolution |$m_{\rm p} = 1.4\times 10^8\, \rm {M}_{\odot }$| Same as gamer-2 Fluid solver CTU scheme with six Riemann solvers per cell, PPM reconstruction, Roe’s solver, and van Leer slope limiter Same as gamer-2 except for a revised CTU scheme requiring only three Riemann solvers per cellb Poisson solver SOR multigridc Particle solver CIC interpolation and KDK particle update CIC interpolation and variable time-step Leapfrog particle update Boundary condition Fluid solver: outflow Poisson solver: isolated Same as gamer-2 Refinement (1) Löhner’s error estimator on gas density, pressure, and temperature with a refinement threshold of 0.8 and a minimum gas density threshold of |$10^{-28}\, {\rm g \text{cm}^{-3}}$| (2) Maximum number of particles in a patch: 100 Same as gamer-2 Derefinement No explicit derefinement criteria (1) Löhner’s error estimator with a derefinement threshold of 0.2 (2) Minimum number of particles in a patch: 12 Time-step Cpar = 0.8, CCFL = 0.5 Cpar = 0.8, CCFL = 0.6 Parallelization Hybrid MPI/OpenMP/GPU MPI and CPU-only Load balancing Hilbert space-filling curve Morton space-filling curve Time integration Adaptive time-step Shared time-step Floating-point format Single precision Double precision gamer-2 flash AMR implementation Fixed patch size of 83 cells, no permanent allocation for patch ghost zones Fixed patch size of 83 cells, patch ghost zones are allocated permanentlya Grid resolution Root grid 1283, four refinement levels, maximum resolution |$\Delta h = 7.0 \, \rm kpc$| Same as gamer-2 Particle resolution Number of particles Np = 5 × 106 (for each cluster) and mass resolution |$m_{\rm p} = 1.4\times 10^8\, \rm {M}_{\odot }$| Same as gamer-2 Fluid solver CTU scheme with six Riemann solvers per cell, PPM reconstruction, Roe’s solver, and van Leer slope limiter Same as gamer-2 except for a revised CTU scheme requiring only three Riemann solvers per cellb Poisson solver SOR multigridc Particle solver CIC interpolation and KDK particle update CIC interpolation and variable time-step Leapfrog particle update Boundary condition Fluid solver: outflow Poisson solver: isolated Same as gamer-2 Refinement (1) Löhner’s error estimator on gas density, pressure, and temperature with a refinement threshold of 0.8 and a minimum gas density threshold of |$10^{-28}\, {\rm g \text{cm}^{-3}}$| (2) Maximum number of particles in a patch: 100 Same as gamer-2 Derefinement No explicit derefinement criteria (1) Löhner’s error estimator with a derefinement threshold of 0.2 (2) Minimum number of particles in a patch: 12 Time-step Cpar = 0.8, CCFL = 0.5 Cpar = 0.8, CCFL = 0.6 Parallelization Hybrid MPI/OpenMP/GPU MPI and CPU-only Load balancing Hilbert space-filling curve Morton space-filling curve Time integration Adaptive time-step Shared time-step Floating-point format Single precision Double precision Notes.aflash supports the ‘NO-PERMANENT-GUARDCELLS’ (npg) mode, which however does not work well with active particles. We have therefore disabled this functionality. bLee (2013). cRicker (2008). dSingle precision is not officially supported in flash. But we have demonstrated that this discrepancy does not affect the physical results here. See discussions in Sections 4.3.2 and 4.3.3. Open in new tab Table 2. Comparison of the volume-filling fractions on the refinement levels between gamer-2 and flash in the lower resolution galaxy cluster merger simulations at |$t=5\, \rm Gyr$|⁠. Level |$\Delta h/\, \rm kpc$| Filling fraction Number of cells gamer-2 (per cent) flash (per cent) gamer-2 flash 1 55.70 56.45 38.65 9.5 × 106 6.5 × 106 2 27.85 31.62 23.08 4.2 × 107 3.1 × 107 3 13.93 7.13 6.48 7.7 × 107 7.0 × 107 4 6.96 0.86 0.75 7.4 × 107 6.4 × 107 Level |$\Delta h/\, \rm kpc$| Filling fraction Number of cells gamer-2 (per cent) flash (per cent) gamer-2 flash 1 55.70 56.45 38.65 9.5 × 106 6.5 × 106 2 27.85 31.62 23.08 4.2 × 107 3.1 × 107 3 13.93 7.13 6.48 7.7 × 107 7.0 × 107 4 6.96 0.86 0.75 7.4 × 107 6.4 × 107 Open in new tab Table 2. Comparison of the volume-filling fractions on the refinement levels between gamer-2 and flash in the lower resolution galaxy cluster merger simulations at |$t=5\, \rm Gyr$|⁠. Level |$\Delta h/\, \rm kpc$| Filling fraction Number of cells gamer-2 (per cent) flash (per cent) gamer-2 flash 1 55.70 56.45 38.65 9.5 × 106 6.5 × 106 2 27.85 31.62 23.08 4.2 × 107 3.1 × 107 3 13.93 7.13 6.48 7.7 × 107 7.0 × 107 4 6.96 0.86 0.75 7.4 × 107 6.4 × 107 Level |$\Delta h/\, \rm kpc$| Filling fraction Number of cells gamer-2 (per cent) flash (per cent) gamer-2 flash 1 55.70 56.45 38.65 9.5 × 106 6.5 × 106 2 27.85 31.62 23.08 4.2 × 107 3.1 × 107 3 13.93 7.13 6.48 7.7 × 107 7.0 × 107 4 6.96 0.86 0.75 7.4 × 107 6.4 × 107 Open in new tab 4.3.2 Accuracy comparison Fig. 9 shows the slices of gas temperature through the cluster centre at t = 0.0, 3.3, 6.6, and |$10.0 \, \rm Gyr$| obtained by gamer-2 and flash. The first core passage occurs at |$t \sim 1.5 \, \rm Gyr$|⁠, after which the oscillating DM cores continue driving shocks and turbulence that heat up the intracluster medium, eventually forming a high temperature and constant entropy core. The results of gamer-2 (upper panels) and flash (lower panels) are verified to be very consistent with each other. Figure 9. Open in new tabDownload slide Slices of gas temperature through the cluster centre at four different epochs in the galaxy cluster merger simulations. Each panel is |$8 \, \rm Mpc$| on a side. The results obtained by gamer-2 (upper panels) and flash (lower panels) are found to be in very good agreement with each other. See Fig. 10 for more quantitative comparisons of the radial profiles. Figure 9. Open in new tabDownload slide Slices of gas temperature through the cluster centre at four different epochs in the galaxy cluster merger simulations. Each panel is |$8 \, \rm Mpc$| on a side. The results obtained by gamer-2 (upper panels) and flash (lower panels) are found to be in very good agreement with each other. See Fig. 10 for more quantitative comparisons of the radial profiles. Fig. 10 shows the radial profiles of the electron number density ne, gas entropy S, gas temperature T, and DM mass density ρDM at |$t=10 \, \rm Gyr$|⁠, where gas is assumed to be fully ionized and the gas specific entropy is defined as |$S \equiv k_\text{B} T n_\text{e}^{-2/3}$| with kB the Boltzmann constant. A constant entropy core can be clearly identified within |${\sim }300 \, \rm kpc$|⁠. Most strikingly, the results obtained by gamer-2 and flash are found to be literally indistinguishable. It demonstrates the consistent numerical setup we adopt for this code comparison experiment, including, for example, the initial condition, boundary conditions, spatial and temporal resolution, AMR implementation, and grid refinement criteria. It also indicates that the different numerical schemes between the two codes described in Section 4.3.1, for example, the time integration, fluid and Poisson solvers, and floating-point accuracy, do not have a significant impact here. Figure 10. Open in new tabDownload slide Radial profiles at |$t=10\, \rm Gyr$| in the galaxy cluster merger simulations. Different panels show the electron number density (upper left), gas entropy (upper right), gas temperature (lower left), and DM mass density (lower right). A remarkable agreement is observed between the results of gamer-2 (solid lines) and flash (dashed lines). Figure 10. Open in new tabDownload slide Radial profiles at |$t=10\, \rm Gyr$| in the galaxy cluster merger simulations. Different panels show the electron number density (upper left), gas entropy (upper right), gas temperature (lower left), and DM mass density (lower right). A remarkable agreement is observed between the results of gamer-2 (solid lines) and flash (dashed lines). 4.3.3 Performance comparison Based on the very consistent physical results between gamer-2 and flash, as shown in Figs 9 and 10, here we compare their strong scaling performance on Blue Waters. In order to have a fair comparison between the codes with and without GPU acceleration, we run gamer-2 on the XK nodes while flash on the XE nodes: each XK node is composed of one GPU (NVIDIA Tesla K20X) and one 16-core CPU (AMD Opteron 6276), and each XE node is composed of two 16-core CPUs. In addition, since there are two NUMA domains per XK node, each of which shares an eight MB L3 cache, for gamer-2, we launch two MPI processes per node and 78 OpenMP threads per MPI process in order to improve memory affinity by having all threads running in the same NUMA domain. The two MPI processes running on the same node shares the same GPU by taking advantage of the CUDA MPS. For flash, we launch 32 MPI processes per XE node. Fig. 11 shows the strong scaling of the cluster merger simulations. We first conduct the lower resolution tests (⁠|$\Delta h \sim 7.0 \, \rm kpc$|⁠, the same as that adopted in Section 4.3.2) for both gamer-2 and flash using up to 256 nodes. Note that the minimum number of nodes adopted in flash is 16 instead of 1 due to its much larger memory consumption, which will be discussed at the end of this section. We measure the performance in |$t=4\text{--}6 \, \rm Gyr$|⁠, during which there are ∼2.1 × 108 cells in total and ∼7.1 × 107 cells on the maximum refinement level. When using the same number of nodes, the speedup of gamer-2 over flash is measured to be 78–101 and 64–83 in terms of total wall time and cell updates per second, respectively. For example, for Nnode = 16, gamer-2 achieves 8.3 × 106 cell updates per second per XK node, and flash achieves 1.0 × 105 cell updates per second per XE node (corresponding to 3.1 × 103 cell updates per second per CPU core). Moreover, this speedup ratio only drops by |${\sim }23{{\ \rm per\ cent}}$| when increasing Nnode from 16 to 128, and the performance of both codes starts to decline when Nnode > 128. It suggests that gamer-2 and flash exhibit similar parallel scalability, despite the fact that the computational time of gamer-2 has been greatly reduced with GPUs. Figure 11. Open in new tabDownload slide Strong scaling of the galaxy cluster merger simulations run with gamer-2 and flash on Blue Waters. gamer-2 runs on the XK nodes composed of one GPU (NVIDIA Tesla K20X) and one 16-core CPU (AMD Opteron 6276) per node, while flash runs on the XE nodes composed of two 16-core CPUs per node. The upper panel shows the total number of cell updates per second for (i) the lower resolution tests (⁠|$\Delta h \sim 7.0 \, \rm kpc$|⁠) of gamer-2 (solid line) and flash (dashed line) using up to 256 nodes, and (ii) the higher resolution test of gamer-2 (⁠|$\Delta h \sim 0.87 \, \rm kpc$|⁠, dotted line) using 16–2048 nodes. Note that the minimum number of nodes adopted in flash is 16 instead of 1 due to its much larger memory consumption (see the text for details). The dashed–dotted line represents the ideal scaling. The lower panel shows the speedup of gamer-2 over flash in the lower resolution tests in terms of either cell updates per second (solid line) or total wall time (dashed line). Both cases reveal nearly two orders of magnitude speedup. See Fig. 12 for the detailed performance metrics of this test. Note that gamer-2 uses single precision but flash uses double-precision arithmetic. Figure 11. Open in new tabDownload slide Strong scaling of the galaxy cluster merger simulations run with gamer-2 and flash on Blue Waters. gamer-2 runs on the XK nodes composed of one GPU (NVIDIA Tesla K20X) and one 16-core CPU (AMD Opteron 6276) per node, while flash runs on the XE nodes composed of two 16-core CPUs per node. The upper panel shows the total number of cell updates per second for (i) the lower resolution tests (⁠|$\Delta h \sim 7.0 \, \rm kpc$|⁠) of gamer-2 (solid line) and flash (dashed line) using up to 256 nodes, and (ii) the higher resolution test of gamer-2 (⁠|$\Delta h \sim 0.87 \, \rm kpc$|⁠, dotted line) using 16–2048 nodes. Note that the minimum number of nodes adopted in flash is 16 instead of 1 due to its much larger memory consumption (see the text for details). The dashed–dotted line represents the ideal scaling. The lower panel shows the speedup of gamer-2 over flash in the lower resolution tests in terms of either cell updates per second (solid line) or total wall time (dashed line). Both cases reveal nearly two orders of magnitude speedup. See Fig. 12 for the detailed performance metrics of this test. Note that gamer-2 uses single precision but flash uses double-precision arithmetic. The different speedups measured from the total wall time and cell updates per second are due to several factors. The cell updates per second depends mostly on the performance of individual PDE solvers, which itself is related to the CPU/GPU performance, the adopted floating-point accuracy, and the numerical schemes adopted in the PDE solvers. In comparison, the speedup in term of total wall time is arguably more comprehensive, since it takes into account not only the performance of PDE solvers but also many other factors, such as the time integration scheme, the evolution time-step, and the number of cells on each level. See Section 4.3.1, especially Table 1, for the summary of different numerical setup between gamer-2 and flash. In short, flash uses a more efficient fluid solver and a more accurate but also more computationally expensive Poisson solver. The CFL safety factor adopted for flash (CCFL = 0.6) is larger than gamer-2 (CCFL = 0.5), but the average time-steps on the maximum level are found to be very similar due to the same time-step criterion for updating particles. More precisely, the minimum time-steps in both codes are set by the fastest-moving particles on the maximum refinement level, which always move faster than the characteristic hydrodynamic speed which sets the hydrodynamic time-step. gamer-2 is found to allocate |${\sim }30\text{--}50{{\ \rm per\ cent}}$| more cells on lower levels and |${\sim }10\text{--}20{{\ \rm per\ cent}}$| more cells on higher levels than flash (e.g. see Table 2), mainly because gamer-2 tends to pre-allocate patches earlier than flash due to the implementation of large flag buffers (see Section 2.1). This issue of overallocation in gamer-2 is relatively more serious on lower levels, since there are fewer patches on these levels. However, it is not a serious problem since gamer-2 adopts an adaptive time-step integration that allows patches on lower levels to have larger time-steps. It is also the main reason why the speedup in terms of total wall time is |${\sim }20{{\ \rm per\ cent}}$| higher than that in terms of cell updates per second. Last but not least, we remind the reader that we adopt single precision for gamer-2 and double precision for flash in this test. In principle, using single precision for flash could improve performance, however single-precision calculations are not currently supported by flash. To test the scalability of gamer-2 further, we also conduct higher resolution runs (⁠|$\Delta h \sim 0.87 \, \rm kpc$|⁠, eight times higher than the lower resolution counterpart) using 16–2048 XK nodes. We measure the performance from t = 4.0 to |$4.5 \, \rm Gyr$|⁠, during which there are ∼8.8 × 109 cells in total and ∼7.1 × 109 cells on the maximum refinement level. The total number of particles is fixed to 107, the same as the lower resolution test. The total memory consumption for Nnode = 2048 at |$t=4.5 \, \rm Gyr$| is |${\sim }2.2 {\, \rm TB}$|⁠. Fig. 11 (upper panel, dotted line) shows the strong scaling of this higher resolution test, exhibiting a much better scaling than its lower resolution counterpart. It demonstrates that gamer-2 can scale well to thousands of GPUs in a realistic astrophysical application. It also indicates that the parallel scalability may be sensitive to the load imbalance resulting from particles, which will be investigated shortly in this section. Before giving a more detailed analysis of strong scaling, we first introduce two quantities useful for quantifying the parallel scalability. The ‘parallel efficiency’ of strong scaling is defined as \begin{eqnarray*} P_{\rm {strong}}(N_{\rm node}) = \frac{T(N_{\rm node,ref}) N_{\rm node,ref}}{T(N_{\rm node}) N_{\rm node}}, \end{eqnarray*} (6) where T(Nnode) is the simulation wall time using Nnode nodes. Generally speaking, the parallel efficiency also depends on the reference performance T(Nnode, ref). For a proper comparison, we adopt Nnode, ref = 16 for the lower resolution tests of both codes, even though the minimum Nnode in the gamer-2 runs is 1 instead of 16. For the higher resolution test, we adopt Nnode = 16. We also introduce another quantity to quantify the scalability, namely, the ‘doubling efficiency’, which is defined as \begin{eqnarray*} D_{\rm {strong}}(N_{\rm node}) = \frac{T(N_{\rm node}/2)}{T(N_{\rm node})}-1, \end{eqnarray*} (7) This quantity corresponds to the performance gain when doubling the computational resource from Nnode/2 to Nnode, which is arguably more intuitive than the conventional parallel efficiency and has the advantage of being independent of the minimum Nnode adopted. For example, one may have Dstrong(Nnode) = 0.6 for Nnode = 2–2048, which suggests reasonable scalability since a performance speedup of 1.6 is always obtained when doubling the number of nodes. However, the corresponding parallel efficiency is as low as |$P_{\rm {strong}}(2048) = (1.6/2)^{11} \sim 9{{\ \rm per\ cent}}$|⁠, which could be misleading. Fig. 12 shows the performance metrics of the cluster merger simulations, including the total wall time, maximum CPU memory consumption per MPI process, parallel efficiency, and doubling efficiency as a function of the number of XK and XE nodes for gamer-2 and flash, respectively. Most importantly, as demonstrated by both the parallel and doubling efficiencies, the two codes exhibit similar parallel scalability, especially for Nnode > 32. It is consistent with the finding of an almost constant speedup of gamer-2 over flash for Nnode > 32, as shown in the lower panel of Fig. 11. Figure 12. Open in new tabDownload slide Performance metrics of the strong scaling of the cluster merger simulations. This is complementary to and uses the same symbols as Fig. 11. Different panels show the total wall time (upper left), maximum CPU memory consumption per MPI process (upper right), parallel efficiency (lower left), and doubling efficiency (lower right). See equations (6) and (7) for the definitions of parallel and doubling efficiencies in strong scaling. Note that the minimum number of nodes adopted in flash is 16 instead of 1 due to its much larger memory consumption. Therefore, for a proper comparison, we adopt Nnode, ref = 16 in equation (6) for both codes when calculating the parallel efficiency. gamer-2 and flash exhibit similar parallel scalability, even though gamer-2 is about two orders of magnitude faster. We do not show the maximum per-process memory consumption of flash because it is mainly determined by the size of the pre-allocated memory pool set manually. Figure 12. Open in new tabDownload slide Performance metrics of the strong scaling of the cluster merger simulations. This is complementary to and uses the same symbols as Fig. 11. Different panels show the total wall time (upper left), maximum CPU memory consumption per MPI process (upper right), parallel efficiency (lower left), and doubling efficiency (lower right). See equations (6) and (7) for the definitions of parallel and doubling efficiencies in strong scaling. Note that the minimum number of nodes adopted in flash is 16 instead of 1 due to its much larger memory consumption. Therefore, for a proper comparison, we adopt Nnode, ref = 16 in equation (6) for both codes when calculating the parallel efficiency. gamer-2 and flash exhibit similar parallel scalability, even though gamer-2 is about two orders of magnitude faster. We do not show the maximum per-process memory consumption of flash because it is mainly determined by the size of the pre-allocated memory pool set manually. In addition, the higher resolution test of gamer-2 shows a better scaling than its lower resolution counterpart. The scalability is reasonably good in a large range of Nnode from 16 to 1024. The parallel efficiency is measured to be |$78{{\ \rm per\ cent}}$| for Nnode = 128 and |$37{{\ \rm per\ cent}}$| for Nnode = 1024, and the doubling efficiency is measured to be |$87{{\ \rm per\ cent}}$| for Nnode = 128 and |$35{{\ \rm per\ cent}}$| for Nnode = 1024. Also, note that the CPU memory consumption per MPI process in gamer-2 deviates from the ideal scaling, especially when increasing the number of nodes. It is most likely due to the allocation of buffer patches and MPI buffers. To determine the performance bottleneck in gamer-2, especially for large Nnode, we show in Fig. 13 the wall time of various operations in the lower resolution runs using 8 and 64 nodes. It is found that the grid PDE solvers, namely, the hydrodynamic and Poisson solvers, are the major performance bottlenecks for Nnode = 8, while for Nnode = 64 the bottlenecks shift to the particle routines and MPI communication. From Nnode = 8 to 64, the fraction of time spent on the hydrodynamic and Poisson solvers decrease from |$21{{\ \rm per\ cent}}$| and |$30{{\ \rm per\ cent}}$| to |$8{{\ \rm per\ cent}}$| and |$16{{\ \rm per\ cent}}$|⁠, respectively; in comparison, the fraction of time spent on the particle routines and MPI communication increase from |$17{{\ \rm per\ cent}}$| and |$14{{\ \rm per\ cent}}$| to |$24{{\ \rm per\ cent}}$| and |$32{{\ \rm per\ cent}}$|⁠, respectively. Figure 13. Open in new tabDownload slide Wall time of various operations measured from the lower resolution cluster merger simulations with gamer-2 using 8 and 64 nodes. For each operation, we also show the fraction of time in the two runs and the speedup gained from increasing Nnode = 8 to 64. The ‘MPI’ time includes transferring both grids and particles for the ‘Hydro’, ‘Poisson’, and ‘Particle’ routines but excludes the MPI communication during the grid refinement, which is included in the ‘Refine’ time. The ‘Poisson’ time includes depositing particle mass onto grids with the CIC interpolation. It shows that the grid PDE solvers (i.e. the hydrodynamic and Poisson solvers) are the major performance bottlenecks for Nnode = 8, while the particle routines and MPI communication become the bottlenecks for Nnode = 64 as these operations exhibit poor scalability. Note that, for better clarification, the performance shown here does not consider particle weights for load balancing, which is therefore different from the optimized performance shown in Figs 11 and 12. See the text for details. Figure 13. Open in new tabDownload slide Wall time of various operations measured from the lower resolution cluster merger simulations with gamer-2 using 8 and 64 nodes. For each operation, we also show the fraction of time in the two runs and the speedup gained from increasing Nnode = 8 to 64. The ‘MPI’ time includes transferring both grids and particles for the ‘Hydro’, ‘Poisson’, and ‘Particle’ routines but excludes the MPI communication during the grid refinement, which is included in the ‘Refine’ time. The ‘Poisson’ time includes depositing particle mass onto grids with the CIC interpolation. It shows that the grid PDE solvers (i.e. the hydrodynamic and Poisson solvers) are the major performance bottlenecks for Nnode = 8, while the particle routines and MPI communication become the bottlenecks for Nnode = 64 as these operations exhibit poor scalability. Note that, for better clarification, the performance shown here does not consider particle weights for load balancing, which is therefore different from the optimized performance shown in Figs 11 and 12. See the text for details. There are several things to note about these results. First of all, the average MPI message size per process for Nnode = 64 is found to be as small as ∼1 MB, which thus suffers from a relatively larger latency. Second, for better clarification, the performance shown in Fig. 13 is measured from a separate set of simulations that (i) adds an explicit MPI synchronization between grid and particle routines in order for a proper timing analysis, and (ii) does not consider the particle load-balancing weights when estimating the workload of each patch (i.e. Wpar = 0.0). It partially explains the poor scalability observed in the particle routines and MPI communication, the latter of which also includes transferring particles. For Nnode = 64, it is found that by removing that extra MPI synchronization between grid and particle routines, the overall performance is improved by |${\sim }37{{\ \rm per\ cent}}$|⁠. Having Wpar = 2.0 further improves the performance by |${\sim }10{{\ \rm per\ cent}}$|⁠. The performance shown in Figs 11 and 12 has incorporated these optimizations. These findings reveal the importance of balancing the workload of both grids and particles simultaneously, as discussed in Section 3.3. Finally, we compare the CPU memory consumption between gamer-2 and flash. We find that flash consumes about an order of magnitude more memory than gamer-2 when using the same number of nodes, which is why the minimum Nnode adopted for flash is 16 instead of 1 as for gamer-2. This is mainly because flash allocates the ghost zones of all patches permanently (see footnote a in Table 1) but gamer-2 does not (see Section 3.4). For a patch with 83 interior cells and four ghost zones on each side (which is the number of ghost zones adopted in flash), the total memory consumption including the ghost zones is eight times larger than that without the ghost zones. In addition, flash uses double precision arithmetic which doubles the memory consumption. On the other hand, gamer-2 adopts the adaptive time-step integration requiring storing all the grid data at two different physical times for the temporal interpolation, which also roughly doubles the memory consumption. Last but not least, unlike flash, gamer-2 does not pre-allocate a memory pool for all blocks that will be used during the simulation (see Section 3.4). 4.4 AGORA isolated disc galaxy: gamer-2 versus enzo Simulations of the gas, stars, and DM in an idealized isolated disc galaxy present a unique numerical challenge for astrophysical simulation codes. These simulations combine self-gravity, gas dynamics, particle dynamics with particles existing at a range of masses, radiative cooling, and star formation. In addition, gas temperatures may reach as low as |$10\, {\rm K}$|⁠, but have a velocity relative to the simulation box of hundreds of kilometres per second, requiring the use of a dual energy formalism to avoid spurious temperature fluctuations and negative temperatures due to truncation errors. On top of the bulk circular velocity, the gas also exhibits supersonic turbulence driven by gravitational instability (Goldbaum, Krumholz & Forbes 2015). Despite these challenges, isolated disc galaxy simulations are commonly used to understand more complicated zoom-in simulations (Kim et al. 2016), galaxy merger simulations (Robertson et al. 2006), and as a testbed for physics modules that will be used in more realistic simulations. These simulations also allow ab initio exploration of the dynamics of the interstellar medium (ISM) of a galaxy much like the Milky Way (Goldbaum, Krumholz & Forbes 2016), enabling direct comparison with observations of the ISM of our own Galaxy. In this section, we simulate a Milky Way-sized isolated disc galaxy with star formation using both gamer-2 and enzo, from which we demonstrate that, not only the physical results obtained by the two codes are in good agreement, but gamer-2 outperforms enzo by almost one order of magnitude. Similar to the previous section, we first describe the simulation setup, with particular emphasis on the similarities and differences of the two codes. We then check the consistency of the physical results, and finally compare the strong scaling performance. 4.4.1 Simulation setup The simulation setup of our isolated disc galaxy simulations closely follow Goldbaum et al. (2015) and the AGORA High-resolution Galaxy Simulations Comparison Project (Kim et al. 2016). So we only provide a short summary here. The initial condition is composed of three components: a DM halo, a galactic disc consists of both gas and stars, and a stellar bulge. (i) The DM halo follows an NFW profile with a virial mass |$M_{\rm 200}=1.1\times 10^{12} \, \rm {M}_{\odot }$|⁠, a concentration parameter c = 10, and a spin parameter λ = 0.04. (ii) The disc follows an exponential profile with a total disc mass |$M_{\rm d}=4.3\times 10^{10} \, \rm {M}_{\odot }$|⁠, a scale length |$r_{\rm d}=3.4 \, \rm kpc$|⁠, a scale height |$z_{\rm d}=0.34 \, \rm kpc$|⁠, and a gas mass fraction fd = Md, gas/Md = 0.2. The gas disc has an initial metal mass fraction Zd = Md, metal/Md, gas = 1.3 × 10−2 and an initial temperature |$T_{\rm d}=10^4 \, \rm K$|⁠. The circular velocity is set such that the disc is in centrifugal equilibrium. (iii) The stellar bulge is modelled as a Hernquist profile (Hernquist 1990) with a mass |$M_{\rm b}=4.3\times 10^9 \, \rm {M}_{\odot }$|⁠. The initial conditions of both DM and stellar particles as well as the gas rotation curve can be downloaded from the AGORA Project.9 The simulation has a cubic domain with a length |$L=1.31 \, \rm Mpc$| and is evolved for |$500 \, \rm Myr$|⁠. The grackle library is used to solve the chemistry and radiative processes in this work. We adopt the equilibrium solver in grackle using tabulated cooling and heating rates, which are functions of gas density and temperature. These tables incorporate both primordial and metal cooling as well as a UV background. We also include the photoelectric heating from dust and an effective CMB temperature floor. The metal field is treated as a passive scalar advected along with the gas, and thus the metal fraction can vary in space and time. To avoid artificial fragmentation, we follow Goldbaum et al. (2015) and Kim et al. (2016) and employ a pressure floor in the hydrodynamic solver to ensure that the local Jeans length is resolved by at least four cells on the maximum refinement level. The full details of the subgrid model for stochastic star formation can be found in Section 2.4 of Goldbaum et al. (2015). Specifically, we adopt a gas density threshold |$n_{\rm H,thres}=20\, {\rm cm}^{-3}$| and a star formation efficiency |$f_\star =1{{\ \rm per\ cent}}$|⁠. We also impose a minimum star particle mass |$m_\star =2\times 10^3 \, \rm {M}_{\odot }$|⁠. No star formation feedback is included in this work. Table 3 summarizes the similarities and differences of the numerical setup adopted by gamer-2 and enzo in these comparison simulations. Here, we elaborate on the major differences. gamer-2 restricts all patches to have exactly the same size (83 cells in this work), which has several advantages. For example, the AMR hierarchy can be manipulated efficiently with an octree data structure. The memory allocation is more predictable, which eases the issue of memory fragmentation and maximizes memory reuse (see Section 3.4). The smaller patch size conceivably leads to a more efficient use of both CPU and GPU caches. In addition, it is more straightforward to optimize load balancing. In comparison, enzo allows all patches to have different sizes. It reduces the number of cells that are unnecessarily refined compared to gamer-2. The relatively larger patches also reduce both the computation and communication overhead associated with the ghost zones of each patch. However, arguably, reconstructing the AMR hierarchy becomes more expensive in this approach due to the more complicated grid structure, which might deteriorate the parallel scalability, especially in massively parallel simulations. For the Poisson solver, gamer-2 uses the SOR scheme suitable for smaller grids (see Section 3.1), while enzo adopts the multigrid scheme suitable for larger grids due to its higher convergence rate. In addition, gamer-2 has a fixed patch size of 83 cells and adds five additional buffer zones around each patch to make the potential smoother across the patch boundaries (see Fig. 2). In comparison, enzo has grid patches generally larger than 83 and also allocates a slightly larger buffer zones of six around each patch. Besides, it applies an iterative procedure to exchange potential between sibling grids to improve the accuracy further, which has not been implemented into gamer-2. In this work, we adopt 10 such iterations in the enzo simulations, which is measured to increase the simulation time by |${\sim }10{{\ \rm per\ cent}}$|⁠. Table 3. Comparison of the numerical setup between gamer-2 and enzo in the isolated disc galaxy simulations. gamer-2 enzo AMR implementation Fixed patch size of 83 cells, no permanent allocation of patch ghost zones Patch size is not fixed, patch ghost zones are allocated permanently Grid resolution Root grid 643, 10 refinement levels, maximum resolution |$\Delta h = 20 \, \rm pc$| Same as gamer-2 Particle resolution Halo: Np = 1 × 107 and |$m_{\rm p} = 1.3\times 10^5\, \rm {M}_{\odot }$| stellar disc: Np = 1 × 107 and |$m_{\rm p} = 3.4\times 10^3\, \rm {M}_{\odot }$| bulge: Np = 1.25 × 106 and |$m_{\rm p} = 3.4\times 10^3\, \rm {M}_{\odot }$| Same as gamer-2 Fluid solver Dimensionally unsplit MUSCL-Hancock scheme with PPM reconstruction (requiring three Riemann solvers per cell), HLLC solver, hybrid van Leer and generalized minmod slope limiter, dual energy formalism solving the entropy equation Dimensionally split direct Eulerian approach with PPM reconstruction (requiring three Riemann solvers per cell), HLLC solver, hybrid van Leer and generalized minmod slope limiter, dual energy formalism solving the internal energy equation Poisson solver SOR Multigrid Particle solver CIC interpolation and KDK particle update CIC interpolation and DKD particle update Boundary condition Fluid solver: outflow Poisson solver: isolated Fluid solver: periodica Poisson solver: isolated Refinement (1) Maximum particle mass in a cell: |$1.0\times 10^6 \, \rm {M}_{\odot }$| (2) Maximum gas mass in a cell: |$3.4\times 10^2 \, \rm {M}_{\odot }$| (3) Resolving Jeans length by at least 64 cells (4) Five additional levels of statically refined regions above the root grid, enclosing volumes that are successively smaller by a factor of 8 Same as gamer-2 Derefinement No explicit derefinement criteria Same as gamer-2 Time-step Cpar = 0.5 and CCFL = 0.5 Same as gamer-2 Parallelization Hybrid MPI/OpenMP/GPU MPI and GPU accelerationb Load balancing Hilbert space-filling curve Same as gamer-2 Time integration Adaptive time-stepc Same as gamer-2 Floating-point format Single precision Same as gamer-2 gamer-2 enzo AMR implementation Fixed patch size of 83 cells, no permanent allocation of patch ghost zones Patch size is not fixed, patch ghost zones are allocated permanently Grid resolution Root grid 643, 10 refinement levels, maximum resolution |$\Delta h = 20 \, \rm pc$| Same as gamer-2 Particle resolution Halo: Np = 1 × 107 and |$m_{\rm p} = 1.3\times 10^5\, \rm {M}_{\odot }$| stellar disc: Np = 1 × 107 and |$m_{\rm p} = 3.4\times 10^3\, \rm {M}_{\odot }$| bulge: Np = 1.25 × 106 and |$m_{\rm p} = 3.4\times 10^3\, \rm {M}_{\odot }$| Same as gamer-2 Fluid solver Dimensionally unsplit MUSCL-Hancock scheme with PPM reconstruction (requiring three Riemann solvers per cell), HLLC solver, hybrid van Leer and generalized minmod slope limiter, dual energy formalism solving the entropy equation Dimensionally split direct Eulerian approach with PPM reconstruction (requiring three Riemann solvers per cell), HLLC solver, hybrid van Leer and generalized minmod slope limiter, dual energy formalism solving the internal energy equation Poisson solver SOR Multigrid Particle solver CIC interpolation and KDK particle update CIC interpolation and DKD particle update Boundary condition Fluid solver: outflow Poisson solver: isolated Fluid solver: periodica Poisson solver: isolated Refinement (1) Maximum particle mass in a cell: |$1.0\times 10^6 \, \rm {M}_{\odot }$| (2) Maximum gas mass in a cell: |$3.4\times 10^2 \, \rm {M}_{\odot }$| (3) Resolving Jeans length by at least 64 cells (4) Five additional levels of statically refined regions above the root grid, enclosing volumes that are successively smaller by a factor of 8 Same as gamer-2 Derefinement No explicit derefinement criteria Same as gamer-2 Time-step Cpar = 0.5 and CCFL = 0.5 Same as gamer-2 Parallelization Hybrid MPI/OpenMP/GPU MPI and GPU accelerationb Load balancing Hilbert space-filling curve Same as gamer-2 Time integration Adaptive time-stepc Same as gamer-2 Floating-point format Single precision Same as gamer-2 Notes.aThe difference in the boundary conditions of the fluid solver is found to have negligible effect in this work. bIn enzo, currently only the fluid and MHD solvers have been ported to GPUs. cBoth gamer-2 and enzo do not restrict the time-step ratio between two adjacent levels to be a constant. Open in new tab Table 3. Comparison of the numerical setup between gamer-2 and enzo in the isolated disc galaxy simulations. gamer-2 enzo AMR implementation Fixed patch size of 83 cells, no permanent allocation of patch ghost zones Patch size is not fixed, patch ghost zones are allocated permanently Grid resolution Root grid 643, 10 refinement levels, maximum resolution |$\Delta h = 20 \, \rm pc$| Same as gamer-2 Particle resolution Halo: Np = 1 × 107 and |$m_{\rm p} = 1.3\times 10^5\, \rm {M}_{\odot }$| stellar disc: Np = 1 × 107 and |$m_{\rm p} = 3.4\times 10^3\, \rm {M}_{\odot }$| bulge: Np = 1.25 × 106 and |$m_{\rm p} = 3.4\times 10^3\, \rm {M}_{\odot }$| Same as gamer-2 Fluid solver Dimensionally unsplit MUSCL-Hancock scheme with PPM reconstruction (requiring three Riemann solvers per cell), HLLC solver, hybrid van Leer and generalized minmod slope limiter, dual energy formalism solving the entropy equation Dimensionally split direct Eulerian approach with PPM reconstruction (requiring three Riemann solvers per cell), HLLC solver, hybrid van Leer and generalized minmod slope limiter, dual energy formalism solving the internal energy equation Poisson solver SOR Multigrid Particle solver CIC interpolation and KDK particle update CIC interpolation and DKD particle update Boundary condition Fluid solver: outflow Poisson solver: isolated Fluid solver: periodica Poisson solver: isolated Refinement (1) Maximum particle mass in a cell: |$1.0\times 10^6 \, \rm {M}_{\odot }$| (2) Maximum gas mass in a cell: |$3.4\times 10^2 \, \rm {M}_{\odot }$| (3) Resolving Jeans length by at least 64 cells (4) Five additional levels of statically refined regions above the root grid, enclosing volumes that are successively smaller by a factor of 8 Same as gamer-2 Derefinement No explicit derefinement criteria Same as gamer-2 Time-step Cpar = 0.5 and CCFL = 0.5 Same as gamer-2 Parallelization Hybrid MPI/OpenMP/GPU MPI and GPU accelerationb Load balancing Hilbert space-filling curve Same as gamer-2 Time integration Adaptive time-stepc Same as gamer-2 Floating-point format Single precision Same as gamer-2 gamer-2 enzo AMR implementation Fixed patch size of 83 cells, no permanent allocation of patch ghost zones Patch size is not fixed, patch ghost zones are allocated permanently Grid resolution Root grid 643, 10 refinement levels, maximum resolution |$\Delta h = 20 \, \rm pc$| Same as gamer-2 Particle resolution Halo: Np = 1 × 107 and |$m_{\rm p} = 1.3\times 10^5\, \rm {M}_{\odot }$| stellar disc: Np = 1 × 107 and |$m_{\rm p} = 3.4\times 10^3\, \rm {M}_{\odot }$| bulge: Np = 1.25 × 106 and |$m_{\rm p} = 3.4\times 10^3\, \rm {M}_{\odot }$| Same as gamer-2 Fluid solver Dimensionally unsplit MUSCL-Hancock scheme with PPM reconstruction (requiring three Riemann solvers per cell), HLLC solver, hybrid van Leer and generalized minmod slope limiter, dual energy formalism solving the entropy equation Dimensionally split direct Eulerian approach with PPM reconstruction (requiring three Riemann solvers per cell), HLLC solver, hybrid van Leer and generalized minmod slope limiter, dual energy formalism solving the internal energy equation Poisson solver SOR Multigrid Particle solver CIC interpolation and KDK particle update CIC interpolation and DKD particle update Boundary condition Fluid solver: outflow Poisson solver: isolated Fluid solver: periodica Poisson solver: isolated Refinement (1) Maximum particle mass in a cell: |$1.0\times 10^6 \, \rm {M}_{\odot }$| (2) Maximum gas mass in a cell: |$3.4\times 10^2 \, \rm {M}_{\odot }$| (3) Resolving Jeans length by at least 64 cells (4) Five additional levels of statically refined regions above the root grid, enclosing volumes that are successively smaller by a factor of 8 Same as gamer-2 Derefinement No explicit derefinement criteria Same as gamer-2 Time-step Cpar = 0.5 and CCFL = 0.5 Same as gamer-2 Parallelization Hybrid MPI/OpenMP/GPU MPI and GPU accelerationb Load balancing Hilbert space-filling curve Same as gamer-2 Time integration Adaptive time-stepc Same as gamer-2 Floating-point format Single precision Same as gamer-2 Notes.aThe difference in the boundary conditions of the fluid solver is found to have negligible effect in this work. bIn enzo, currently only the fluid and MHD solvers have been ported to GPUs. cBoth gamer-2 and enzo do not restrict the time-step ratio between two adjacent levels to be a constant. Open in new tab 4.4.2 Accuracy comparison Fig. 14 shows the face-on projection of gas density in the central |$30 \, \rm kpc$| region at t = 0, 100, 300, and |$500 \, \rm Myr$| obtained by gamer-2 and enzo. The filamentary structures form quickly due to self-gravity, radiative cooling, and shear flow. These filaments then continuously collapse into gravitationally bound clouds and trigger star formation. We notice that, at later epochs of the simulations, a significant fraction of gas has collapsed and merged into large clouds and formed unrealistically massive star clusters, and there is no prominent spiral structure. These results are inconsistent with the smooth disc and prominent spiral arms observed in disc galaxies, and are mainly due to the lack of star formation feedback in this work that leads to overcooling of gas (see fig. 2 in Goldbaum et al. 2016). Active galactic nucleus feedback is also expected to strongly change the thermodynamics and kinematics of the multiphase gas (e.g. Gaspari et al. 2018). It is, however, not a concern here since we focus on the comparison between different codes. Fig. 14 shows that the gross morphological features obtained by gamer-2 and enzo agree well with each other. Subtle differences are expected to some extent because of the stochastic star formation and the different numerical implementations (see Table 3). More quantitative comparisons are provided below. Figure 14. Open in new tabDownload slide Face-on projection of gas density at four different epochs in the isolated disc galaxy simulations. Each panel is |$30 \, \rm kpc$| on a side. The simulations with gamer-2 (upper panels) and enzo (lower panels) show very similar filamentary structures. Subtle differences are expected to some extent because of the stochastic star formation and the different numerical implementations (see Table 3). See Figs 15–17 for more quantitative comparisons between the two codes. At late times, a significant fraction of gas has collapsed and merged into large gravitationally bound clouds and there are no prominent spiral arms, mainly because we do not include star formation feedback in this work (see fig. 2 in Goldbaum et al. 2016). Figure 14. Open in new tabDownload slide Face-on projection of gas density at four different epochs in the isolated disc galaxy simulations. Each panel is |$30 \, \rm kpc$| on a side. The simulations with gamer-2 (upper panels) and enzo (lower panels) show very similar filamentary structures. Subtle differences are expected to some extent because of the stochastic star formation and the different numerical implementations (see Table 3). See Figs 15–17 for more quantitative comparisons between the two codes. At late times, a significant fraction of gas has collapsed and merged into large gravitationally bound clouds and there are no prominent spiral arms, mainly because we do not include star formation feedback in this work (see fig. 2 in Goldbaum et al. 2016). Fig. 15 shows the azimuthally averaged profiles of various gas properties at |$t=500 \, \rm Myr$|⁠, including the surface density, temperature, rotation velocity, and velocity dispersion. Following Kim et al. (2016), we set the galactic centre to the location of peak gas density within |$1 \, \rm kpc$| from the centre of gas mass. All profiles exhibit clear oscillation, which become more prominent in time as an increasing fraction of gas collapses into massive clouds. The temperature within |${\sim }12 \, \rm kpc$| drops significantly from the initial temperature of |$10^4 \, \rm K$| to below |${\sim }100 \, \rm K$| due to the balance between efficient metal line cooling and UV heating. Substantial turbulent velocity dispersion, primarily driven by gravitational instability, develops quickly and increases toward the galactic centre. Importantly, we find very good agreement between gamer-2 and enzo in all four profiles. A relatively large discrepancy appears in the innermost region, which is somewhat expected since the central region is sensitive to both the determination of the galactic centre and local mergers. Figure 15. Open in new tabDownload slide Azimuthally averaged profiles at |$t=500 \, \rm Myr$| in the isolated disc galaxy simulations. Different panels show the gas surface density (upper left), gas temperature (upper right), gas rotation velocity (lower left), and gas velocity dispersion (lower right). The results of gamer-2 (red lines) and enzo (blue lines) are in good agreement with each other. A relatively large discrepancy appears in the innermost region, which is somewhat expected since the central region is sensitive to both the determination of the galactic centre and local mergers. Figure 15. Open in new tabDownload slide Azimuthally averaged profiles at |$t=500 \, \rm Myr$| in the isolated disc galaxy simulations. Different panels show the gas surface density (upper left), gas temperature (upper right), gas rotation velocity (lower left), and gas velocity dispersion (lower right). The results of gamer-2 (red lines) and enzo (blue lines) are in good agreement with each other. A relatively large discrepancy appears in the innermost region, which is somewhat expected since the central region is sensitive to both the determination of the galactic centre and local mergers. Fig. 16 shows the probability distribution function of gas in the density–temperature plane at |$t=500 \, \rm Myr$|⁠. A clear branch toward the high-density, low-temperature corner can be easily identified, resulting from the balance between the various cooling and heating mechanisms adopted in this work (see Section 4.4.1). The low-density, high-temperature component in the upper left corner corresponds to the gaseous halo included in the consideration of numerical stability (Kim et al. 2016). This figure further validates the consistency of the gaseous thermodynamic properties between our gamer-2 and enzo simulations, thanks to the common chemistry and radiative library grackle. A relatively large scatter is found in the gamer-2 simulation, which however constitutes a negligible fraction of the total gas mass. Figure 16. Open in new tabDownload slide Probability distribution function of gas in the density–temperature plane of the isolated disc galaxy simulations at |$t=500 \, \rm Myr$|⁠. Colour bar represents the gas mass in each bin. The results obtained by gamer-2 (left-hand panel) and enzo (right-hand panel) are in very good agreement thanks to the common library grackle adopted for calculating the chemistry and radiative processes. Figure 16. Open in new tabDownload slide Probability distribution function of gas in the density–temperature plane of the isolated disc galaxy simulations at |$t=500 \, \rm Myr$|⁠. Colour bar represents the gas mass in each bin. The results obtained by gamer-2 (left-hand panel) and enzo (right-hand panel) are in very good agreement thanks to the common library grackle adopted for calculating the chemistry and radiative processes. Fig. 17 shows the star formation rate (SFR) as a function of time, for which again gamer-2 and enzo agree very well with each other. The SFR reaches |${\sim }10 \, \rm {M}_{\odot }\, {\rm yr}^{-1}$| after |$t \gtrsim 100 \, \rm Myr$|⁠, consistent with the result of Goldbaum et al. (2015) which also does not include star formation feedback, and is about a factor of 5 higher than the SFR obtained in the simulations with feedback (Kim et al. 2016; Goldbaum et al. 2016). Interestingly, the consistency between gamer-2 and enzo shown in this figure seems to be better than that found in the comparison of grid codes in the AGORA comparison project (see fig. 26 in Kim et al. 2016). It remains to be investigated whether this level of consistency could be achieved after including feedback. Figure 17. Open in new tabDownload slide SFR as a function of time in the isolated disc galaxy simulations. We find very good agreement between gamer-2 (red line) and enzo (blue line). Figure 17. Open in new tabDownload slide SFR as a function of time in the isolated disc galaxy simulations. We find very good agreement between gamer-2 (red line) and enzo (blue line). The agreement between the simulations results of gamer-2 and enzo, as verified in Figs 14–17, demonstrates the consistent numerical setup adopted for this code comparison experiment, including, for example, the initial condition, spatial and temporal resolution, and grid refinement criteria. It also suggests that the differences between the two codes described in Section 4.4.1, for example, the AMR implementation, fluid and Poisson solvers, and particle integration, do not have a serious impact here. These facts strengthen the results of performance comparison shown in the next section. 4.4.3 Performance comparison Here, we compare strong scaling performance between gamer-2 and enzo measured on Blue Waters. Since enzo also supports GPU acceleration, although only for the hydrodynamic and MHD solvers, we run both codes on the XK nodes, each of which is composed of one GPU (NVIDIA Tesla K20X) and one 16-core CPU (AMD Opteron 6276). Similar to the cluster merger simulations described in Section 4.3.3, for gamer-2, we launch two MPI processes per node and seven OpenMP threads per MPI process to improve memory affinity, and use CUDA MPS to allow these two processes to share the same GPU. For enzo, we launch 16 MPI processes per node since it does not support hybrid MPI/OpenMP parallelization. In addition, for Nnode = 1–4, we disable SubgridSizeAutoAdjust10 and set MaximumSubgridSize = 8192 to avoid exhausting the GPU memory due to the too large grid size. Changing MaximumSubgridSize by a factor of 2 reduces the performance of Nnode = 4 by |${\sim }10\text{--}20{{\ \rm per\ cent}}$|⁠. For Nnode = 8–128, we enable SubgridSizeAutoAdjust and have OptimalSubgridsPerProcessor = 16, which are found to generally achieve the best performance in our tests. The Hilbert space-filling curve is adopted for load balancing in both gamer-2 and enzo.11 We measure the performance in a relatively short period of |$t=300\text{--}305 \, \rm Myr$|⁠, which is representative enough since both the total number of cells and evolution time-step are found to be quite stable after |$t \gtrsim 100 \, \rm Myr$|⁠. At |$t=300 \, \rm Myr$|⁠, there are ∼2.2 × 108 cells in total, |${\sim }20{{\ \rm per\ cent}}$| of which are on the maximum refinement level. Fig. 18 shows the strong scaling of the isolated disc galaxy simulations. The speedup of gamer-2 over enzo is measured to be about 4–8 in terms of both total wall time and cell updates per second. For example, for Nnode = 64, gamer-2 and enzo achieve 1.5 × 106 and 3.2 × 105 cell updates per second per node, respectively. This result is encouraging, especially because both codes take advantage of GPU acceleration. More importantly, this speedup ratio is approximately a constant for Nnode ≤ 32 and increases for Nnode > 32. The overall performance of enzo starts to drop for Nnode > 64, while that of gamer-2 starts to drops for Nnode > 128. These results suggest that gamer-2 not only runs faster but also scales better than enzo. Figure 18. Open in new tabDownload slide Strong scaling of the isolated disc galaxy simulations run with gamer-2 (solid line) and enzo (dashed line) using 1–128 nodes. Both codes support GPU acceleration and run on the Blue Waters XK nodes, each of which is composed of one GPU (NVIDIA Tesla K20X) and one 16-core CPU (AMD Opteron 6276). The upper panel shows the total number of cell updates per second, where the dashed–dotted line represents the ideal scaling. The lower panel shows the speedup of gamer-2 over enzo in terms of both cell updates per second (solid line) and total wall time (dashed line). gamer-2 is measured to be ∼4–8 times faster than enzo. More importantly, this speedup ratio is approximately a constant for Nnode ≤ 32 and increases for Nnode > 32, suggesting that gamer-2 exhibits better parallel scalability than enzo. See Fig. 19 for the detailed performance metrics of this test. Figure 18. Open in new tabDownload slide Strong scaling of the isolated disc galaxy simulations run with gamer-2 (solid line) and enzo (dashed line) using 1–128 nodes. Both codes support GPU acceleration and run on the Blue Waters XK nodes, each of which is composed of one GPU (NVIDIA Tesla K20X) and one 16-core CPU (AMD Opteron 6276). The upper panel shows the total number of cell updates per second, where the dashed–dotted line represents the ideal scaling. The lower panel shows the speedup of gamer-2 over enzo in terms of both cell updates per second (solid line) and total wall time (dashed line). gamer-2 is measured to be ∼4–8 times faster than enzo. More importantly, this speedup ratio is approximately a constant for Nnode ≤ 32 and increases for Nnode > 32, suggesting that gamer-2 exhibits better parallel scalability than enzo. See Fig. 19 for the detailed performance metrics of this test. In Fig. 18, we show the performance speedups in terms of both total wall time and cell updates per second. The former arguably provides a more comprehensive comparison because it considers not only the performance of all PDE solvers but also many other factors such as the AMR implementation and evolution time-step. In this test, the speedup in terms of total wall time is measured to be |${\sim }5\text{--}20{{\ \rm per\ cent}}$| higher than that in terms of cell updates per second, partially because gamer-2 generally requires less extra updates for synchronizing nearby AMR levels (see Section 2.1, especially Fig. 1). On the other hand, although gamer-2 and enzo allocate roughly the same number of cells on the maximum refinement level l = 10 (the difference is less than |$3{{\ \rm per\ cent}}$|⁠), we find that gamer-2 typically allocates |${\sim }50\text{--}150{{\ \rm per\ cent}}$| more cells than enzo on other high levels (e.g. l = 7–9, see Table 4). It is mainly because gamer-2 restricts all patches to have the same size which results in over-refinement, especially along the direction perpendicular to the galactic disc and on the levels with cell sizes much larger than the disc scale height. This issue, however, does not pose a serious problem here since lower levels are updated much less frequently thanks to the adaptive time-step integration. Table 4. Comparison of the volume-filling fractions on higher refinement levels between gamer-2 and enzo in the isolated disc galaxy simulations at |$t=300\, \rm Myr$|⁠. Level |$\Delta h/\, \rm pc$| Filling fraction Number of cells gamer-2 (per cent) enzo (per cent) gamer-2 enzo 7 160.0 2.4 × 10−3 1.0 × 10−3 1.3 × 107 5.7 × 106 8 80.0 4.3 × 10−4 2.0 × 10−4 1.9 × 107 8.8 × 106 9 40.0 1.2 × 10−4 6.3 × 10−5 4.2 × 107 2.2 × 107 10 20.0 1.5 × 10−5 1.5 × 10−5 4.2 × 107 4.1 × 107 Level |$\Delta h/\, \rm pc$| Filling fraction Number of cells gamer-2 (per cent) enzo (per cent) gamer-2 enzo 7 160.0 2.4 × 10−3 1.0 × 10−3 1.3 × 107 5.7 × 106 8 80.0 4.3 × 10−4 2.0 × 10−4 1.9 × 107 8.8 × 106 9 40.0 1.2 × 10−4 6.3 × 10−5 4.2 × 107 2.2 × 107 10 20.0 1.5 × 10−5 1.5 × 10−5 4.2 × 107 4.1 × 107 Open in new tab Table 4. Comparison of the volume-filling fractions on higher refinement levels between gamer-2 and enzo in the isolated disc galaxy simulations at |$t=300\, \rm Myr$|⁠. Level |$\Delta h/\, \rm pc$| Filling fraction Number of cells gamer-2 (per cent) enzo (per cent) gamer-2 enzo 7 160.0 2.4 × 10−3 1.0 × 10−3 1.3 × 107 5.7 × 106 8 80.0 4.3 × 10−4 2.0 × 10−4 1.9 × 107 8.8 × 106 9 40.0 1.2 × 10−4 6.3 × 10−5 4.2 × 107 2.2 × 107 10 20.0 1.5 × 10−5 1.5 × 10−5 4.2 × 107 4.1 × 107 Level |$\Delta h/\, \rm pc$| Filling fraction Number of cells gamer-2 (per cent) enzo (per cent) gamer-2 enzo 7 160.0 2.4 × 10−3 1.0 × 10−3 1.3 × 107 5.7 × 106 8 80.0 4.3 × 10−4 2.0 × 10−4 1.9 × 107 8.8 × 106 9 40.0 1.2 × 10−4 6.3 × 10−5 4.2 × 107 2.2 × 107 10 20.0 1.5 × 10−5 1.5 × 10−5 4.2 × 107 4.1 × 107 Open in new tab Fig. 19 shows the performance metrics of the isolated disc galaxy simulations, including the total wall time, maximum CPU memory consumption per MPI process, parallel efficiency, and doubling efficiency as a function of the number of XK nodes. Most importantly, both the parallel and doubling efficiencies demonstrate that the two codes exhibit very similar parallel scalability for Nnode ≤ 32, and, furthermore, gamer-2 scales noticeably better than enzo for Nnode > 32, consistent with Fig. 18. Figure 19. Open in new tabDownload slide Performance metrics of the strong scaling of the isolated disc galaxy simulations. This is complementary to and uses the same symbols as Fig. 18. Different panels show the total wall time (upper left), maximum CPU memory consumption per MPI process (upper right), parallel efficiency (lower left), and doubling efficiency (lower right). See equations (6) and (7) for the definitions of parallel and doubling efficiencies in strong scaling. gamer-2 and enzo exhibit very similar parallel scalability for Nnode ≤ 32, and gamer-2 scales noticeably better than enzo for Nnode > 32. The maximum per-process memory consumption of enzo is not shown since the data are not available. Figure 19. Open in new tabDownload slide Performance metrics of the strong scaling of the isolated disc galaxy simulations. This is complementary to and uses the same symbols as Fig. 18. Different panels show the total wall time (upper left), maximum CPU memory consumption per MPI process (upper right), parallel efficiency (lower left), and doubling efficiency (lower right). See equations (6) and (7) for the definitions of parallel and doubling efficiencies in strong scaling. gamer-2 and enzo exhibit very similar parallel scalability for Nnode ≤ 32, and gamer-2 scales noticeably better than enzo for Nnode > 32. The maximum per-process memory consumption of enzo is not shown since the data are not available. We notice that the doubling efficiency of enzo oscillates for Nnode = 4–16, likely because, as described in the beginning of this section, we enable SubgridSizeAutoAdjust for Nnode > 4 to improve performance and scalability. Also note that the CPU memory consumption per MPI process in gamer-2 deviates from the ideal scaling, especially when increasing the number of nodes, most likely due to the allocation of buffer patches and MPI buffers. To determine the performance bottleneck of gamer-2 in the isolated disc galaxy simulations, especially for large Nnode, we compare in Fig. 20 the wall time of various operations in the Nnode = 8 and 64 runs. We find that, for Nnode = 8, the chemistry and cooling library grackle is the major performance bottleneck. In contrast, for Nnode = 64, the bottlenecks move to the particle routines and MPI communication, which is similar to what we find in the cluster merger simulations (see Fig. 13). It is mainly because, currently, the load balancing in gamer-2 is better optimized for grid solvers like grackle than for particle-related routines. From Nnode = 8 to 64, the fraction of time spent on grackle decreases from |$34{{\ \rm per\ cent}}$| to |$13{{\ \rm per\ cent}}$|⁠, while that on the particle routines and MPI communication increase from |$13{{\ \rm per\ cent}}$| and |$15{{\ \rm per\ cent}}$| to |$25{{\ \rm per\ cent}}$| and |$27{{\ \rm per\ cent}}$|⁠, respectively. Also note that a GPU-supported grackle is currently under development, which would likely lead to larger speedups. Figure 20. Open in new tabDownload slide Wall time of various operations measured from the isolated disc galaxy simulations with gamer-2 using 8 and 64 nodes. For each operation, we also show the fraction of time in the two runs and the speedup gained from increasing Nnode = 8–64. The ‘MPI’ time includes transferring both grids and particles for the ‘Hydro’, ‘Poisson’, and ‘Particle’ routines but excludes the MPI communication during the grid refinement, which is included in the ‘Refine’ time. The ‘Poisson’ time includes depositing particle mass onto grids with the CIC interpolation. It shows that the chemistry and cooling library grackle is the major performance bottleneck for Nnode = 8, while the particle routines and MPI communication become the bottlenecks for Nnode = 64 because these operations exhibit poor scalability. Note that, for better clarification, the performance shown here does not consider particle weights for load balancing, which is therefore different from the optimized performance shown in Figs 18 and 19. See the text for details. Also note that a GPU-supported grackle is currently under development, which would likely lead to larger speedups. Figure 20. Open in new tabDownload slide Wall time of various operations measured from the isolated disc galaxy simulations with gamer-2 using 8 and 64 nodes. For each operation, we also show the fraction of time in the two runs and the speedup gained from increasing Nnode = 8–64. The ‘MPI’ time includes transferring both grids and particles for the ‘Hydro’, ‘Poisson’, and ‘Particle’ routines but excludes the MPI communication during the grid refinement, which is included in the ‘Refine’ time. The ‘Poisson’ time includes depositing particle mass onto grids with the CIC interpolation. It shows that the chemistry and cooling library grackle is the major performance bottleneck for Nnode = 8, while the particle routines and MPI communication become the bottlenecks for Nnode = 64 because these operations exhibit poor scalability. Note that, for better clarification, the performance shown here does not consider particle weights for load balancing, which is therefore different from the optimized performance shown in Figs 18 and 19. See the text for details. Also note that a GPU-supported grackle is currently under development, which would likely lead to larger speedups. There are several things to note about these results, which are similar to the discussions given in the cluster merger simulations. First of all, the average MPI message size per process for Nnode = 64 is found to be as small as ∼1 MB, which is smaller than that for Nnode = 8 by a factor of few. The Nnode = 64 run thereby suffers from a relatively larger communication latency. Second, for better clarification, the performance shown in Fig. 20 is measured from a separate set of simulations that (i) adds an explicit MPI synchronization between grid and particle routines in order for a proper timing analysis, and (ii) disregards particle weights in load balancing (i.e. Wpar = 0.0), both of which deteriorate the scalability. For Nnode = 64, we find that the overall performance is improved by |${\sim }22{{\ \rm per\ cent}}$| after removing that extra MPI synchronization between grid and particle routines, and by another |${\sim }23{{\ \rm per\ cent}}$| after adopting Wpar = 1.0. These optimizations have been incorporated into the strong scaling shown in Figs 18 and 19. Third, note that the time fraction of the Poisson solver shown in Fig. 20 includes the time for depositing particle mass onto grids, which is why it scales worse than the other two grid solvers (i.e. hydrodynamic and grackle solvers). 5 SUMMARY AND FUTURE WORK In this paper, we have presented gamer-2, a significant revision of the GPU-accelerated AMR code gamer-1 (Schive et al. 2010). It includes much richer functionality and incorporates significant improvements in accuracy, stability, performance, and scalability. Table 5 summarizes the major differences between gamer-2 and gamer-1. Table 5. Summary of the major differences between gamer-2 and gamer-1. Features gamer-2 gamer-1 References Adaptive time-step Fully supported without requiring the time-step ratio between two adjacent levels to be a constant Only supported in pure hydrodynamic simulations and the time-step ratio between two adjacent levels must be 2 Section 2.1 Fluid solvers Both dimensional split, Riemann-solver-free scheme (RTVD) and dimensional unsplit, Riemann-solver-based schemes (MHM, VL, and CTU) with PLM/PPM data reconstructions and various Riemann solvers RTVD only Section 2.2 Dual energy formalism Supported Unsupported Section 2.2 MHD Supported using the CTU + CT scheme Not supported Zhang et al. (2018) Poisson solver Adding five buffer zones around each refinement-level patch Multilevel solver that eliminates the pseudo-mass sheets on the patch interfaces Section 2.3 Gravity in hydro Operator-unsplit approach operator-split approach Section 2.3 Boundary conditions Fluid: periodic, outflow, reflecting, inflow (i.e. user-defined) Gravity: periodic, isolated Fluid: periodic Gravity: periodic Section 2.2, Section 2.3 Particles Supported Unsupported Section 2.4 grackle Supported Unsupported Section 2.5 Bitwise reproducibility Supported Unsupported Section 2.7.1 HDF5 output Supported Unsupported Section 2.7.2 yt data analysis Supported Unsupported Section 2.7.2 Test problem infrastructure Supported Unsupported Section 2.7.3 AMR + GPUs framework Supported (e.g. ψDM simulations) Unsupported Section 2.7.4 Hybrid MPI/OpenMP Supported Unsupported Section 3.2 Parallelization Hilbert curve for load balancing Rectangular domain decomposition Section 3.3 Memory pool Supported Unsupported Section 3.4 Features gamer-2 gamer-1 References Adaptive time-step Fully supported without requiring the time-step ratio between two adjacent levels to be a constant Only supported in pure hydrodynamic simulations and the time-step ratio between two adjacent levels must be 2 Section 2.1 Fluid solvers Both dimensional split, Riemann-solver-free scheme (RTVD) and dimensional unsplit, Riemann-solver-based schemes (MHM, VL, and CTU) with PLM/PPM data reconstructions and various Riemann solvers RTVD only Section 2.2 Dual energy formalism Supported Unsupported Section 2.2 MHD Supported using the CTU + CT scheme Not supported Zhang et al. (2018) Poisson solver Adding five buffer zones around each refinement-level patch Multilevel solver that eliminates the pseudo-mass sheets on the patch interfaces Section 2.3 Gravity in hydro Operator-unsplit approach operator-split approach Section 2.3 Boundary conditions Fluid: periodic, outflow, reflecting, inflow (i.e. user-defined) Gravity: periodic, isolated Fluid: periodic Gravity: periodic Section 2.2, Section 2.3 Particles Supported Unsupported Section 2.4 grackle Supported Unsupported Section 2.5 Bitwise reproducibility Supported Unsupported Section 2.7.1 HDF5 output Supported Unsupported Section 2.7.2 yt data analysis Supported Unsupported Section 2.7.2 Test problem infrastructure Supported Unsupported Section 2.7.3 AMR + GPUs framework Supported (e.g. ψDM simulations) Unsupported Section 2.7.4 Hybrid MPI/OpenMP Supported Unsupported Section 3.2 Parallelization Hilbert curve for load balancing Rectangular domain decomposition Section 3.3 Memory pool Supported Unsupported Section 3.4 Open in new tab Table 5. Summary of the major differences between gamer-2 and gamer-1. Features gamer-2 gamer-1 References Adaptive time-step Fully supported without requiring the time-step ratio between two adjacent levels to be a constant Only supported in pure hydrodynamic simulations and the time-step ratio between two adjacent levels must be 2 Section 2.1 Fluid solvers Both dimensional split, Riemann-solver-free scheme (RTVD) and dimensional unsplit, Riemann-solver-based schemes (MHM, VL, and CTU) with PLM/PPM data reconstructions and various Riemann solvers RTVD only Section 2.2 Dual energy formalism Supported Unsupported Section 2.2 MHD Supported using the CTU + CT scheme Not supported Zhang et al. (2018) Poisson solver Adding five buffer zones around each refinement-level patch Multilevel solver that eliminates the pseudo-mass sheets on the patch interfaces Section 2.3 Gravity in hydro Operator-unsplit approach operator-split approach Section 2.3 Boundary conditions Fluid: periodic, outflow, reflecting, inflow (i.e. user-defined) Gravity: periodic, isolated Fluid: periodic Gravity: periodic Section 2.2, Section 2.3 Particles Supported Unsupported Section 2.4 grackle Supported Unsupported Section 2.5 Bitwise reproducibility Supported Unsupported Section 2.7.1 HDF5 output Supported Unsupported Section 2.7.2 yt data analysis Supported Unsupported Section 2.7.2 Test problem infrastructure Supported Unsupported Section 2.7.3 AMR + GPUs framework Supported (e.g. ψDM simulations) Unsupported Section 2.7.4 Hybrid MPI/OpenMP Supported Unsupported Section 3.2 Parallelization Hilbert curve for load balancing Rectangular domain decomposition Section 3.3 Memory pool Supported Unsupported Section 3.4 Features gamer-2 gamer-1 References Adaptive time-step Fully supported without requiring the time-step ratio between two adjacent levels to be a constant Only supported in pure hydrodynamic simulations and the time-step ratio between two adjacent levels must be 2 Section 2.1 Fluid solvers Both dimensional split, Riemann-solver-free scheme (RTVD) and dimensional unsplit, Riemann-solver-based schemes (MHM, VL, and CTU) with PLM/PPM data reconstructions and various Riemann solvers RTVD only Section 2.2 Dual energy formalism Supported Unsupported Section 2.2 MHD Supported using the CTU + CT scheme Not supported Zhang et al. (2018) Poisson solver Adding five buffer zones around each refinement-level patch Multilevel solver that eliminates the pseudo-mass sheets on the patch interfaces Section 2.3 Gravity in hydro Operator-unsplit approach operator-split approach Section 2.3 Boundary conditions Fluid: periodic, outflow, reflecting, inflow (i.e. user-defined) Gravity: periodic, isolated Fluid: periodic Gravity: periodic Section 2.2, Section 2.3 Particles Supported Unsupported Section 2.4 grackle Supported Unsupported Section 2.5 Bitwise reproducibility Supported Unsupported Section 2.7.1 HDF5 output Supported Unsupported Section 2.7.2 yt data analysis Supported Unsupported Section 2.7.2 Test problem infrastructure Supported Unsupported Section 2.7.3 AMR + GPUs framework Supported (e.g. ψDM simulations) Unsupported Section 2.7.4 Hybrid MPI/OpenMP Supported Unsupported Section 3.2 Parallelization Hilbert curve for load balancing Rectangular domain decomposition Section 3.3 Memory pool Supported Unsupported Section 3.4 Open in new tab To reveal the optimal performance of gamer-2, we first measure the performance of individual GPU solvers and show that both the hydrodynamic and Poisson solvers achieve a single-precision performance of |${\sim }2\times 10^8 \, \rm {cells\, s^{-1}}$| on an NVIDIA Tesla P100 GPU. We also measure the weak scaling performance with and without AMR in a 3D KH instability test on the Blue Waters supercomputer using 1–4096 XK nodes, each of which is composed of one NVIDIA Tesla K20X GPU and one 16-core AMD Opteron 6276 CPU. By taking advantage of the hybrid MPI/OpenMP/GPU parallelization, we are able to fully exploit both 4096 GPUs and 65 536 CPU cores simultaneously, and achieve a peak performance of 8.3 × 1010 and |$4.6\times 10^{10} \, \rm {cells\, s^{-1}}$| and a parallel efficiency of |$74{{\ \rm per\ cent}}$| and |$58{{\ \rm per\ cent}}$| in the uniform-grid and AMR tests, respectively. Note that the simulation reaches an overall resolution as high as |$10\, 240^3$| cells with 4096 nodes in the uniform-grid test. To further provide clear and convincing demonstrations of the accuracy and performance of gamer-2, we directly compare gamer-2 with two widely adopted AMR codes, flash and enzo, based on realistic astrophysical simulations running on Blue Waters. First, we compare gamer-2 with flash in binary cluster merger simulations, which closely follow the numerical setup of ZuHone (2011) and involve hydrodynamics, self-gravity, and DM. We show that the physical results obtained by the two codes are in excellent agreement, and gamer-2 is measured to be 78 − 101 times faster than flash in strong scaling tests using 1–256 nodes. More importantly, both codes exhibit similar parallel scalability, despite the fact that the computational time of gamer-2 has been greatly reduced by exploiting GPUs. We also measure the strong scaling of gamer-2 from 16 to 2048 nodes using a set of higher resolution simulations, and obtain a parallel efficiency of |$78{{\ \rm per\ cent}}$| with 128 nodes and |$37{{\ \rm per\ cent}}$| with 1024 nodes. Second, we compare gamer-2 with enzo in isolated disc galaxy simulations, which closely follow the numerical setup of Goldbaum et al. (2015) and the AGORA High-resolution Galaxy Simulations Comparison Project (Kim et al. 2016) but with a spatial resolution of |$20 \, \rm pc$|⁠. These simulations involve a richer set of physical modules, including hydrodynamics, self-gravity, DM, advection of metals, radiative cooling and heating, and stochastic star formation. Again, we find very good agreement between the physical results obtained by gamer-2 and enzo. To compare the performance, we also enable GPU acceleration in enzo for the hydrodynamic solver. Even so, gamer-2 is still measured to be ∼4–8 times faster than enzo in strong scaling tests using 1–128 nodes. It may be partially due to the fact that enzocurrently does not support asynchronous GPU kernel execution and CPU–GPU communication. Further investigation will be conducted in the future. More importantly, this speedup ratio is approximately a constant of 4–5 with 1–32 nodes and increases to 5–8 when using more than 32 nodes, suggesting that gamer-2 not only runs faster but also scales noticeably better than enzo. gamer-2 has supported the following features to improve the parallel scalability: Hybrid OpenMP/MPI parallelization (see Section 3.2). It reduces internode communication and therefore improves the parallel scalability, especially when using a large number of nodes. Fixed patch size. It greatly simplifies the parallel manipulation of AMR hierarchy and load balancing, especially in massively parallel simulations. Moreover, we do not require duplicating the entire AMR hierarchy on each MPI process (see Section 3.3). Level-by-level load balancing with Hilbert space-filling curves (see Section 3.3). Especially, we take into account the particle weights in load balancing (see Fig. 4), and further minimize the MPI synchronization between grid and particle routines. gamer-2 allocates memory pools for both grid and particle data to alleviate the issue of memory fragmentation and to maximize memory reuse (see Section 3.4). Moreover, the code minimizes the GPU memory requirement by storing all the data on the CPU’s main memory and transferring only a small and fixed amount of patch data to GPUs (typically several hundreds of MB to a few GB per GPU) at a time. We have identified several performance bottlenecks from the detailed timing analysis conducted in this work (e.g. see Figs 13 and 20), including load imbalance due to particles, grackle library, MPI communication, and CPU performance when preparing the input data for GPU solvers. To improve performance further, we are currently investigating the following optimizations: Transferring the deposited particle mass density on grids instead of individual particles when calculating the total mass density on levels other than the maximum level. This will greatly reduce the MPI communication for particles and also improve load balancing. Porting some of the particle routines to GPUs. MPI non-blocking communication. It will allow overlapping MPI communication by both CPU and GPU computations. GPU-accelerated grackle. grackle computes the chemistry and radiative processes on a cell-by-cell basis, which should be very GPU-friendly because no synchronization and data exchange between different cells are required. We have obtained an order of magnitude speedup in preliminary tests. Optimization of CPU routines. One important optimization in gamer-2 is to allow CPUs and GPUs to work concurrently (see Section 3.2). Accordingly, depending on the CPU and GPU specifications, we find that the performance bottleneck may occur in CPUs when invoking a GPU kernel (since we still rely on CPUs to prepare the input data for GPUs). It is therefore essential to optimize the CPU performance further by, for example, improving the OpenMP parallel efficiency, porting more operations to GPUs, and optimizing memory access. Given the excellent performance reported here, it is then essential to extend the functionality of gamer-2 so that it can be applied to a broader range of applications. The following new features are being or will be investigated in the near future. In addition, since gamer-2 is an open-source code, we are also looking forward to contributions from the community. New particle features including tracer particles, comoving coordinates, and multiple species. We also plan to store ‘relative’ instead of absolute particle positions, which can be very helpful for simulations demanding an extremely large dynamic range. Non-Cartesian coordinates. Non-ideal hydrodynamics and non-ideal MHD. Radiative transfer. Parallel I/O. Testing framework for ensuring the correctness of the code. Note that gamer-2 can also run in a ‘CPU-only’ mode using a hybrid MPI/OpenMP parallelization, for which we simply replace all GPU solvers by their CPU counterparts parallelized with OpenMP and use the same MPI implementation as in the GPU-accelerated code. Therefore, gamer-2 is also suitable for CPU-only supercomputers, especially for those with a larger number of cores per node like Intel Xeon Phi. Hybrid MPI/OpenMP is essential to achieve optimal performance in such systems, and it may require further optimization about, for example, thread affinity, thread load balancing, and OpenMP nested parallelism. Finally, we emphasize that the great performance and scalability of gamer-2 demonstrated here in both binary cluster merger and isolated disc galaxy simulations allow one to study various astrophysical phenomena requiring resolutions that are not realistically attainable previously. For example, for the cluster merger simulations, we have obtained preliminary results from simulations with sub-kpc resolution, which will enable us to reduce the numerical viscosity significantly and to investigate the properties of the turbulent cascade down to a scale where the effects of a physical viscosity are expected to become relevant. It is also possible to increase the spatial resolution of isolated disc galaxy simulations to |${\sim }5\, \rm pc$|⁠, which will produce a dynamically evolving ISM, undergoing repeated cycles of collapse, star formation, feedback, rarefaction, and recollapse which have been extremely difficult to fully resolve in a global galactic-scale simulation over galactic dynamical times. ACKNOWLEDGEMENTS HS would like to thank Edward Seidel, Gabrielle Allen, and Tzihong Chiueh for their great support on this project, and Roland Haas and Brian O’Shea for stimulating discussions. HS would also like to thank Sandor Molnar for helping implement the galaxy cluster merger simulations and Hsiang-Chih Hwang for helping implement particles into gamer-2. HS and MG are grateful to James Stone for insightful discussions. The authors are also grateful to Britton Smith for helping incorporate grackle into gamer-2. Finally, we want to thank the referee, Michael Norman, for a constructive report that helped to improve the paper. This publication is supported in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4561 to Matthew Turk, and is based upon work supported by the National Science Foundation under Grant No. ACI-1535651. This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications (NCSA). This work also used computational resources provided by the Innovative Systems Lab (ISL) at NCSA. MG is supported by NASA through Einstein Postdoctoral Fellowship Award Number PF5-160137 issued by the Chandra X-ray Observatory Center, which is operated by the SAO for and on behalf of NASA under contract NAS8-03060. Support for this work was also provided by Chandra grant GO7-18121X. Footnotes 1 https://bluewaters.ncsa.illinois.edu 2 The MHD extension has not been made publicly available yet, but will be released soon. 3 https://grackle.readthedocs.io 4 https://support.hdfgroup.org/HDF5 5 http://yt-project.org 6 https://bitbucket.org/data-exp-lab/libyt 7 http://princetonuniversity.github.io/athena 8 We use seven instead of eight OpenMP threads per MPI process since using eight threads somehow binds two threads to the same CPU core. 9 http://goo.gl/8JzbIJ. Note that we use the high-resolution files, while Kim et al. (2016) use the low-resolution files. 10 See https://enzo.readthedocs.io/en/latest/index.html 11 It corresponds to LoadBalancing = 4, which is not officially supported but works well in our tests. REFERENCES Almgren A. S. et al. , 2010 , ApJ , 715 , 1221 https://doi.org/10.1088/0004-637X/715/2/1221 Crossref Search ADS Almgren A. S. , Bell J. B. , Lijewski M. J. , Lukić Z. , Van Andel E. , 2013 , ApJ , 765 , 39 https://doi.org/10.1088/0004-637X/765/1/39 Crossref Search ADS Amdahl G. M. , 1967 , Proceedings of the Spring Joint Computer Conference (AFIPS ’67, Spring) . ACM , New York, NY, USA , p. 483 Google Preview WorldCat COPAC Banerjee N. , Sharma P. , 2014 , MNRAS , 443 , 687 https://doi.org/10.1093/mnras/stu1179 Crossref Search ADS Berger M. J. , Colella P. , 1989 , J. Comput. Phys. , 82 , 64 https://doi.org/10.1016/0021-9991(89)90035-1 Crossref Search ADS Berger M. J. , Oliger J. , 1984 , J. Comput. Phys. , 53 , 484 https://doi.org/10.1016/0021-9991(84)90073-1 Crossref Search ADS Brunetti G. , Lazarian A. , 2007 , MNRAS , 378 , 245 https://doi.org/10.1111/j.1365-2966.2007.11771.x Crossref Search ADS Bryan G. L. , Norman M. L. , Stone J. M. , Cen R. , Ostriker J. P. , 1995 , Comput. Phys. Commun. , 89 , 149 https://doi.org/10.1016/0010-4655(94)00191-4 Crossref Search ADS Bryan G. L. et al. , 2014 , ApJS , 211 , 19 https://doi.org/10.1088/0067-0049/211/2/19 Crossref Search ADS Colella P. , 1990 , J. Comput. Phys. , 87 , 171 https://doi.org/10.1016/0021-9991(90)90233-Q Crossref Search ADS Cunningham A. J. , Frank A. , Varnière P. , Mitran S. , Jones T. W. , 2009 , ApJS , 182 , 519 https://doi.org/10.1088/0067-0049/182/2/519 Crossref Search ADS De Martino I. , Broadhurst T. , Tye S.-H. H. , Chiueh T. , Schive H.-Y. , Lazkoz R. , 2017 , Phys. Rev. Lett. , 119 , 221103 https://doi.org/10.1103/PhysRevLett.119.221103 Crossref Search ADS PubMed Eastwood J. , Brownrigg D. , 1979 , J. Comput. Phys. , 32 , 24 https://doi.org/10.1016/0021-9991(79)90139-6 Crossref Search ADS Eckert D. , Gaspari M. , Vazza F. , Gastaldello F. , Tramacere A. , Zimmer S. , Ettori S. , Paltani S. , 2017 , ApJ , 843 , L29 https://doi.org/10.3847/2041-8213/aa7c1a Crossref Search ADS Eddington A. S. , 1916 , MNRAS , 76 , 572 https://doi.org/10.1093/mnras/76.7.572 Crossref Search ADS Einfeldt B. , Munz C. , Roe P. , Sj’ogreen B. , 1991 , Journal of Computational Physics , 92 , 273 Crossref Search ADS Evans C. R. , Hawley J. F. , 1988 , ApJ , 332 , 659 https://doi.org/10.1086/166684 Crossref Search ADS Falle S. A. E. G. , 1991 , MNRAS , 250 , 581 https://doi.org/10.1093/mnras/250.3.581 Crossref Search ADS Frigo M. , Johnson S. G. , 2005 , Proc. IEEE , 93 , 216 Crossref Search ADS Fryxell B. et al. , 2000 , ApJS , 131 , 273 https://doi.org/10.1086/317361 Crossref Search ADS Gaspari M. , Churazov E. , 2013 , A&A , 559 , A78 https://doi.org/10.1051/0004-6361/201322295 Crossref Search ADS Gaspari M. et al. , 2018 , ApJ , 854 , 167 https://doi.org/10.3847/1538-4357/aaaa1b Crossref Search ADS Goldbaum N. J. , Krumholz M. R. , Forbes J. C. , 2015 , ApJ , 814 , 131 https://doi.org/10.1088/0004-637X/814/2/131 Crossref Search ADS Goldbaum N. J. , Krumholz M. R. , Forbes J. C. , 2016 , ApJ , 827 , 28 https://doi.org/10.3847/0004-637X/827/1/28 Crossref Search ADS Hernquist L. , 1990 , ApJ , 356 , 359 https://doi.org/10.1086/168845 Crossref Search ADS Hockney R. , Eastwood J. , 1988 , Computer Simulation Using Particles . Taylor & Francis , New York Google Preview WorldCat COPAC Hopkins P. F. , 2015 , MNRAS , 450 , 53 https://doi.org/10.1093/mnras/stv195 Crossref Search ADS Huang J. , Greengard L. , 1999 , SIAM J. Sci. Comput. , 21 , 1551 Crossref Search ADS Jiang Y.-F. , Belyaev M. , Goodman J. , Stone J. M. , 2013 , New Astron. , 19 , 48 https://doi.org/10.1016/j.newast.2012.08.002 Crossref Search ADS Jin S. , Xin Z. , 1995 , Commun. Pure Appl. Math. , 48 , 235 Crossref Search ADS Kestener P. , Château F. , Teyssier R. , 2010 , Algorithms and Architectures for Parallel Processing (ICA3PP’10). Lecture Notes in Computer Science . Springer-Verlag , Berlin, Heidelberg , p. 281 Google Preview WorldCat COPAC Khatri R. , Gaspari M. , 2016 , MNRAS , 463 , 655 https://doi.org/10.1093/mnras/stw2027 Crossref Search ADS Kim J.-h. et al. , 2016 , ApJ , 833 , 202 https://doi.org/10.3847/1538-4357/833/2/202 Crossref Search ADS Kravtsov A. V. , Klypin A. A. , Khokhlov A. M. , 1997 , ApJS , 111 , 73 https://doi.org/10.1086/313015 Crossref Search ADS Lau E. T. , Gaspari M. , Nagai D. , Coppi P. , 2017 , ApJ , 849 , 54 https://doi.org/10.3847/1538-4357/aa8c00 Crossref Search ADS Lee D. , 2013 , J. Comput. Phys. , 243 , 269 https://doi.org/10.1016/j.jcp.2013.02.049 Crossref Search ADS Löhner R. , Morgan K. , Peraire J. , Vahdati M. , 1987 , Int. J. Numer. Methods Fluids , 7 , 1093 Crossref Search ADS Lukat G. , Banerjee R. , 2016 , New Astron. , 45 , 14 https://doi.org/10.1016/j.newast.2015.10.007 Crossref Search ADS Martin D. F. , Colella P. , 2000 , J. Comput. Phys. , 163 , 271 https://doi.org/10.1006/jcph.2000.6575 Crossref Search ADS Mignone A. , Zanni C. , Tzeferacos P. , van Straalen B. , Colella P. , Bodo G. , 2012 , ApJS , 198 , 7 https://doi.org/10.1088/0067-0049/198/1/7 Crossref Search ADS Miyoshi T. , Kusano K. , 2005 , J. Comput. Phys. , 208 , 315 https://doi.org/10.1016/j.jcp.2005.02.017 Crossref Search ADS Müller E. , Steinmetz M. , 1995 , Comput. Phys. Commun. , 89 , 45 https://doi.org/10.1016/0010-4655(94)00185-5 Crossref Search ADS Nagai D. , Vikhlinin A. , Kravtsov A. V. , 2007 , ApJ , 655 , 98 https://doi.org/10.1086/509868 Crossref Search ADS Navarro J. F. , Frenk C. S. , White S. D. M. , 1996 , ApJ , 462 , 563 https://doi.org/10.1086/177173 Crossref Search ADS NVIDIA , 2017 , CUDA C Programming Guide, 8.0 edn. NVIDIA Corporation . Available at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html Google Preview WorldCat COPAC Press W. H. , Teukolsky S. A. , Vetterling W. T. , Flannery B. P. , 2007 , Numerical Recipes. The Art of Scientific Computing, 3rd edn. Cambridge Univ. Press , Cambridge Google Preview WorldCat COPAC Ricker P. M. , 2008 , ApJS , 176 , 293 https://doi.org/10.1086/526425 Crossref Search ADS Robertson B. , Bullock J. S. , Cox T. J. , Di Matteo T. , Hernquist L. , Springel V. , Yoshida N. , 2006 , ApJ , 645 , 986 https://doi.org/10.1086/504412 Crossref Search ADS Roe P. , 1981 , J. Comput. Phys. , 43 , 357 https://doi.org/10.1016/0021-9991(81)90128-5 Crossref Search ADS Ryu D. , Ostriker J. P. , Kang H. , Cen R. , 1993 , ApJ , 414 , 1 https://doi.org/10.1086/173051 Crossref Search ADS Schive H.-Y. , Tsai Y.-C. , Chiueh T. , 2010 , ApJS , 186 , 457 https://doi.org/10.1088/0067-0049/186/2/457 Crossref Search ADS Schive H.-Y. , Zhang U.-H. , Chiueh T. , 2012 , Int. J. High Perform. Comput. Appl. , 26 , 367 Crossref Search ADS Schive H.-Y. , Chiueh T. , Broadhurst T. , 2014a , Nat. Phys. , 10 , 496 Crossref Search ADS Schive H.-Y. , Liao M.-H. , Woo T.-P. , Wong S.-K. , Chiueh T. , Broadhurst T. , Hwang W.-Y. P. , 2014b , Phys. Rev. Lett. , 113 , 261302 https://doi.org/10.1103/PhysRevLett.113.261302 Crossref Search ADS Shukla H. , Schive H.-Y. , Woo T.-P. , Chiueh T. , 2011 , Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’11) . ACM , New York, NY, USA , p. 37:1 Google Preview WorldCat COPAC Smith B. D. et al. , 2017 , MNRAS , 466 , 2217 https://doi.org/10.1093/mnras/stw3291 Crossref Search ADS Springel V. , 2005 , MNRAS , 364 , 1105 https://doi.org/10.1111/j.1365-2966.2005.09655.x Crossref Search ADS Springel V. , 2010 , MNRAS , 401 , 791 https://doi.org/10.1111/j.1365-2966.2009.15715.x Crossref Search ADS Stone J. M. , Gardiner T. , 2009 , New Astron. , 14 , 139 https://doi.org/10.1016/j.newast.2008.06.003 Crossref Search ADS Stone J. M. , Gardiner T. A. , Teuben P. , Hawley J. F. , Simon J. B. , 2008 , ApJS , 178 , 137 https://doi.org/10.1086/588755 Crossref Search ADS Teyssier R. , 2002 , A&A , 385 , 337 https://doi.org/10.1051/0004-6361:20011817 Crossref Search ADS Toro E. F. , 2009 , Riemann Solvers and Numerical Methods for Fluid Dynamics. A Practical Introduction, 3rd edn. Springer-Verlag , Berlin Google Preview WorldCat COPAC Trac H. , Pen U.-L. , 2003 , PASP , 115 , 303 https://doi.org/10.1086/367747 Crossref Search ADS Turk M. J. , Smith B. D. , Oishi J. S. , Skory S. , Skillman S. W. , Abel T. , Norman M. L. , 2011 , ApJS , 192 , 9 https://doi.org/10.1088/0067-0049/192/1/9 Crossref Search ADS van Leer B. , 1979 , J. Comput. Phys. , 32 , 101 https://doi.org/10.1016/0021-9991(79)90145-1 Crossref Search ADS van Leer B. , 2006 , Commun. Comput. Phys. , 1 , 192 Wang P. , Abel T. , Kaehler R. , 2010 , New Astron. , 15 , 581 https://doi.org/10.1016/j.newast.2009.10.002 Crossref Search ADS White C. J. , Stone J. M. , Gammie C. F. , 2016 , ApJS , 225 , 22 https://doi.org/10.3847/0067-0049/225/2/22 Crossref Search ADS Woodward P. , Colella P. , 1984 , J. Comput. Phys. , 54 , 115 https://doi.org/10.1016/0021-9991(84)90142-6 Crossref Search ADS Zhang U.-H. , Schive H.-Y. , Chiueh T. , 2018 , ApJS , 236 , 50 https://doi.org/10.3847/1538-4365/aac49e Crossref Search ADS ZuHone J. A. , 2011 , ApJ , 728 , 54 https://doi.org/10.1088/0004-637X/728/1/54 Crossref Search ADS APPENDIX A: POISSON SOLVER The Poisson solver of gamer-2 on the refined patches is substantially different from that of gamer-1. To smooth out the gravitational potential across patch boundaries, gamer-2 adds several buffer zones around each patch (see Fig. 2), while gamer-1 adopts a multilevel relaxation scheme to reduce the pseudo-mass sheets on the patch boundaries (Schive et al. 2010). To compare their accuracy, we calculate the potential of a Hernquist profile (Hernquist 1990): \begin{eqnarray*} \rho (r) = \frac{\rho _0}{r/r_0(1+r/r_0)^3}, \end{eqnarray*} (A1) where r0 and ρ0 are the characteristic radius and density, respectively. This profile has a finite mass |$M=2\pi r_0^3 \rho _0$| and an analytical form of potential: \begin{eqnarray*} \phi _{\rm anal}(r) = -\frac{GM}{r + r_0}, \end{eqnarray*} (A2) where G is the gravitational constant. We adopt G = r0 = ρ0 = 1. The computational domain is cubic with a length L = 100 and a 643 root grid. A cell on level l is flagged for refinement if its density exceeds 10−2 × 4l, and we enable six refinement levels to well resolve r0 by a maximum resolution of ∼2.4 × 10−2. Isolated boundary conditions for gravity are adopted. Fig. A1 shows the gravitational potential on a central |$5\, r_0$| slice evaluated by gamer-2 using five buffer zones. The left- and right-hand panels show the numerical results ϕnume and the corresponding relative errors, ϕerr ≡ |(ϕnume − ϕanal)/ϕanal)|, respectively. The relative errors within r ≲ r0 are found to be as low as on the order of 10−3–10−5, although numerical artefacts introduced by the patch interfaces are still present. Figure A1. Open in new tabDownload slide Central slice of the gravitational potential of a Hernquist profile. The left-hand panel shows the numerical results evaluated by gamer-2 using five buffer zones and six refinement levels. The right-hand panel shows the corresponding relative errors by comparing with the analytical solution, overlaid with AMR patch outlines. Figure A1. Open in new tabDownload slide Central slice of the gravitational potential of a Hernquist profile. The left-hand panel shows the numerical results evaluated by gamer-2 using five buffer zones and six refinement levels. The right-hand panel shows the corresponding relative errors by comparing with the analytical solution, overlaid with AMR patch outlines. Fig. A2 shows the volume-weighted radial profile of the numerical errors of different schemes. We compare gamer-2 with Nbuf = 1, 3, and 5 buffer zones and gamer-1 with Niter = 50 and 100 V-cycle iterations and sibling relaxation steps (see Schive et al. 2010 for details). It shows that in these cases the numerical accuracy improves with a larger number of Nbuf or Niter, and gamer-2 with Nbuf = 5 provides the most accurate solution within r ≲ r0. We also find that Nbuf > 5 does not improve accuracy further. In addition, the gravitational potential of a patch of 83 cells and five ghost zones on each side consumes ∼46 KB memory per patch in double precision, which can just fit into the small but fast shared memory of modern GPUs. Therefore, we adopt Nbuf = 5 by default. Figure A2. Open in new tabDownload slide Volume-weighted numerical errors as a function of radius for computing the potential of a Hernquist profile. We compare the errors of different schemes: gamer-2 with 1 (dashed-double-dotted line), 3 (dotted line), and 5 (solid line) buffer zones, and gamer-1 with 50 (dashed–dotted line) and 100 (dashed line) V-cycle iterations and sibling relaxation steps (see Schive et al. 2010 for details). gamer-2 with five buffer zones is found to provide the most accurate solution within r ≲ r0. Figure A2. Open in new tabDownload slide Volume-weighted numerical errors as a function of radius for computing the potential of a Hernquist profile. We compare the errors of different schemes: gamer-2 with 1 (dashed-double-dotted line), 3 (dotted line), and 5 (solid line) buffer zones, and gamer-1 with 50 (dashed–dotted line) and 100 (dashed line) V-cycle iterations and sibling relaxation steps (see Schive et al. 2010 for details). gamer-2 with five buffer zones is found to provide the most accurate solution within r ≲ r0. © 2018 The Author(s) Published by Oxford University Press on behalf of the Royal Astronomical Society This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - gamer-2: a GPU-accelerated adaptive mesh refinement code – accuracy, performance, and scalability JF - Monthly Notices of the Royal Astronomical Society DO - 10.1093/mnras/sty2586 DA - 2018-12-21 UR - https://www.deepdyve.com/lp/oxford-university-press/gamer-2-a-gpu-accelerated-adaptive-mesh-refinement-code-accuracy-QO9UvY0Vpa SP - 4815 VL - 481 IS - 4 DP - DeepDyve ER -