MAPPER: Managing Application Performance via Parallel Efficiency Regulation*Srikanthan, Sharanyan; Chakraborti, Sayak; Ferro, Princeton; Dwarkadas, Sandhya
doi: 10.1145/3501767pmid: N/A
State-of-the-art systems, whether in servers or desktops, provide ample computational and storage resources to allow multiple simultaneously executing potentially parallel applications. However, performance tends to be unpredictable, being a function of algorithmic design, resource allocation choices, and hardware resource limitations.In this article, we introduce MAPPER, a manager of application performance via parallel efficiency regulation. MAPPER uses a privileged daemon to monitor (using hardware performance counters) and coordinate all participating applications by making two coupled decisions: the degree of parallelism to allow each application to improve system efficiency while guaranteeing quality of service (QoS), and which specific CPU cores to schedule applications on. The QoS metric may be chosen by the application and could be in terms of execution time, throughput, or tail latency, relative to the maximum performance achievable on the machine. We demonstrate that using a normalized parallel efficiency metric allows comparison across and cooperation among applications to guarantee their required QoS. While MAPPER may be used without application or runtime modification, use of a simple interface to communicate application-level knowledge improves MAPPER’s efficacy. Using a QoS guarantee of 85% of the IPC achieved with a fair share of resources on the machine, MAPPER achieves up to 3.3 \( \times \) speedup relative to unmodified Linux and runtime systems, with an average improvement of 17% in our test cases. At the same time, MAPPER violates QoS for only 2% of the applications (compared to 23% for Linux), while placing much tighter bounds on the worst case. MAPPER relieves hardware bottlenecks via task-to-CPU placement and allocates more CPU contexts to applications that exhibit higher parallel efficiency while guaranteeing QoS, resulting in both individual application performance predictability and overall system efficiency.
MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body CorrelationWang, Qihan; Peng, Zhen; Ren, Bin; Chen, Jie; Edwards, Robert G.
doi: 10.1145/3506705pmid: N/A
The many-body correlation function is a fundamental computation kernel in modern physics computing applications, e.g., Hadron Contractions in Lattice quantum chromodynamics (QCD). This kernel is both computation and memory intensive, involving a series of tensor contractions, and thus usually runs on accelerators like GPUs. Existing optimizations on many-body correlation mainly focus on individual tensor contractions (e.g., cuBLAS libraries and others). In contrast, this work discovers a new optimization dimension for many-body correlation by exploring the optimization opportunities among tensor contractions. More specifically, it targets general GPU architectures (both NVIDIA and AMD) and optimizes many-body correlation’s memory management by exploiting a set of memory allocation and communication redundancy elimination opportunities: first, GPU memory allocation redundancy: the intermediate output frequently occurs as input in the subsequent calculations; second, CPU-GPU communication redundancy: although all tensors are allocated on both CPU and GPU, many of them are used (and reused) on the GPU side only, and thus, many CPU/GPU communications (like that in existing Unified Memory designs) are unnecessary; third, GPU oversubscription: limited GPU memory size causes oversubscription issues, and existing memory management usually results in near-reuse data eviction, thus incurring extra CPU/GPU memory communications.Targeting these memory optimization opportunities, this article proposes MemHC, an optimized systematic GPU memory management framework that aims to accelerate the calculation of many-body correlation functions utilizing a series of new memory reduction designs. These designs involve optimizations for GPU memory allocation, CPU/GPU memory movement, and GPU memory oversubscription, respectively. More specifically, first, MemHC employs duplication-aware management and lazy release of GPU memories to corresponding host managing for better data reusability. Second, it implements data reorganization and on-demand synchronization to eliminate redundant (or unnecessary) data transfer. Third, MemHC exploits an optimized Least Recently Used (LRU) eviction policy called Pre-Protected LRU to reduce evictions and leverage memory hits. Additionally, MemHC is portable for various platforms including NVIDIA GPUs and AMD GPUs. The evaluation demonstrates that MemHC outperforms unified memory management by \( 2.18\times \) to \( 10.73\times \) . The proposed Pre-Protected LRU policy outperforms the original LRU policy by up to \( 1.36\times \) improvement.1
ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing RuntimesChen, Jing; Manivannan, Madhavan; Abduljabbar, Mustafa; Pericàs, Miquel
doi: 10.1145/3510422pmid: N/A
Parallel applications often rely on work stealing schedulers in combination with fine-grained tasking to achieve high performance and scalability. However, reducing the total energy consumption in the context of work stealing runtimes is still challenging, particularly when using asymmetric architectures with different types of CPU cores. A common approach for energy savings involves dynamic voltage and frequency scaling (DVFS) wherein throttling is carried out based on factors like task parallelism, stealing relations, and task criticality. This article makes the following observations: (i) leveraging DVFS on a per-task basis is impractical when using fine-grained tasking and in environments with cluster/chip-level DVFS; (ii) task moldability, wherein a single task can execute on multiple threads/cores via work-sharing, can help to reduce energy consumption; and (iii) mismatch between tasks and assigned resources (i.e., core type and number of cores) can detrimentally impact energy consumption. In this article, we propose EneRgy Aware SchedulEr (ERASE), an intra-application task scheduler on top of work stealing runtimes that aims to reduce the total energy consumption of parallel applications. It achieves energy savings by guiding scheduling decisions based on per-task energy consumption predictions of different resource configurations. In addition, ERASE is capable of adapting to both given static frequency settings and externally controlled DVFS. Overall, ERASE achieves up to 31% energy savings and improves performance by 44% on average, compared to the state-of-the-art DVFS-based schedulers.
A Case For Intra-rack Resource Disaggregation in HPCMichelogiannakis, George; Klenk, Benjamin; Cook, Brandon; Teh, Min Yee; Glick, Madeleine; Dennison, Larry; Bergman, Keren; Shalf, John
doi: 10.1145/3514245pmid: N/A
The expected halt of traditional technology scaling is motivating increased heterogeneity in high-performance computing (HPC) systems with the emergence of numerous specialized accelerators. As heterogeneity increases, so does the risk of underutilizing expensive hardware resources if we preserve today’s rigid node configuration and reservation strategies. This has sparked interest in resource disaggregation to enable finer-grain allocation of hardware resources to applications. However, there is currently no data-driven study of what range of disaggregation is appropriate in HPC. To that end, we perform a detailed analysis of key metrics sampled in NERSC’s Cori, a production HPC system that executes a diverse open-science HPC workload. In addition, we profile a variety of deep-learning applications to represent an emerging workload. We show that for a rack (cabinet) configuration and applications similar to Cori, a central processing unit with intra-rack disaggregation has a 99.5% probability to find all resources it requires inside its rack. In addition, ideal intra-rack resource disaggregation in Cori could reduce memory and NIC resources by 5.36% to 69.01% and still satisfy the worst-case average rack utilization.
GiantVM: A Novel Distributed Hypervisor for Resource Aggregation with DSM-aware OptimizationsJia, Xingguo; Zhang, Jin; Yu, Boshi; Qian, Xingyue; Qi, Zhengwei; Guan, Haibing
doi: 10.1145/3505251pmid: N/A
We present GiantVM,1 an open-source distributed hypervisor that provides the many-to-one virtualization to aggregate resources from multiple physical machines. We propose techniques to enable distributed CPU and I/O virtualization and distributed shared memory (DSM) to achieve memory aggregation. GiantVM is implemented based on the state-of-the-art type-II hypervisor QEMU-KVM, and it can currently host conventional OSes such as Linux. (1) We identify the performance bottleneck of GiantVM to be DSM, through a top-down performance analysis. Although GiantVM offers great opportunities for CPU-intensive applications to enjoy the aggregated CPU resources, memory-intensive applications could suffer from cross-node page sharing, which requires frequent DSM involvement and leads to performance collapse. We design the guest-level thread scheduler, DaS (DSM-aware Scheduler), to overcome the bottleneck. When benchmarking with NAS Parallel Benchmarks, the DaS could achieve a performance boost of up to 3.5×, compared to the default Linux kernel scheduler. (2) While evaluating DaS, we observe the advantage of GiantVM as a resource reallocation facility. Thanks to the SSI abstraction of GiantVM, migration could be done by guest-level scheduling. DSM allows standby pages in the migration destination, which need not be transferred through the network. The saved network bandwidth is 68% on average, compared to VM live migration. Resource reallocation with GiantVM increases the overall CPU utilization by 14.3% in a co-location experiment.
Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresKumar, Rakesh; Alipour, Mehdi; Black-Schaffer, David
doi: 10.1145/3506704pmid: N/A
Exploiting memory-level parallelism (MLP) is crucial to hide long memory and last-level cache access latencies. While out-of-order (OoO) cores, and techniques building on them, are effective at exploiting MLP, they deliver poor energy efficiency due to their complex and energy-hungry hardware. This work revisits slice-out-of-order (sOoO) cores as an energy-efficient alternative for MLP exploitation. sOoO cores achieve energy efficiency by constructing and executing slices of MLP-generating instructions out-of-order only with respect to the rest of instructions; the slices and the remaining instructions, by themselves, execute in-order. However, we observe that existing sOoO cores miss significant MLP opportunities due to their dependence-oblivious in-order slice execution, which causes dependent slices to frequently block MLP generation. To boost MLP generation, we introduce Freeway, a sOoO core based on a new dependence-aware slice execution policy that tracks dependent slices and keeps them from blocking subsequent independent slices and MLP extraction. The proposed core incurs minimal area and power overheads, yet approaches the MLP benefits of fully OoO cores. Our evaluation shows that Freeway delivers 12% better performance than the state-of-the-art sOoO core and is within 7% of the MLP limits of full OoO execution.
Low-power Near-data Instruction Execution Leveraging Opcode-based Timing AnalysisAthanasios, Tziouvaras; Georgios, Dimitriou; Georgios, Stamoulis
doi: 10.1145/3504005pmid: N/A
Traditional processor architectures utilize an external DRAM for data storage, while they also operate under worst-case timing constraints. Such designs are heavily constrained by the delay costs of the data transfer between the core pipeline and the DRAM, and they are incapable of exploiting the timing variations of their pipeline stages. In this work, we focus on a near-data processing methodology combined with a novel timing analysis technique that enables the adaptive frequency scaling of the core clock and boosts the performance of low-power designs. We propose a near-data processing and better-than-worst-case co-design methodology to efficiently move the instruction execution to the DRAM side and, at the same time, to allow the pipeline to operate at higher clock frequencies compared to the worst-case approach. To this end, we develop a timing analysis technique, which evaluates the timing requirements of individual instructions and we dynamically scale the clock frequency, according to the instructions types that currently occupy the pipeline. We evaluate the proposed methodology on six different RISC-V post-layout implementations using an HMC DRAM to enable the processing-in-memory (PIM) process. Results indicate an average speedup factor of 1.96× with a 1.6× reduction in energy consumption compared to a standard RISC-V PIM baseline implementation.
Preserving Addressability Upon GC-Triggered Data Movements on Non-Volatile MemoryYe, Chencheng; Xu, Yuanchao; Shen, Xipeng; Jin, Hai; Liao, Xiaofei; Solihin, Yan
doi: 10.1145/3511706pmid: N/A
This article points out an important threat that application-level Garbage Collection (GC) creates to the use of non-volatile memory (NVM). Data movements incurred by GC may invalidate the pointers to objects on NVM and, hence, harm the reusability of persistent data across executions. The article proposes the concept of movement-oblivious addressing (MOA), and develops and compares three novel solutions to materialize the concept for solving the addressability problem. It evaluates the designs on five benchmarks and a real-world application. The results demonstrate the promise of the proposed solutions, especially hardware-supported Multi-Level GPointer, in addressing the problem in a space- and time-efficient manner.
Memory-Aware Functional IR for Higher-Level Synthesis of AcceleratorsSchlaak, Christof; Juang, Tzung-Han; Dubach, Christophe
doi: 10.1145/3501768pmid: N/A
Specialized accelerators deliver orders of a magnitude of higher performance than general-purpose processors. The ever-changing nature of modern workloads is pushing the adoption of Field Programmable Gate Arrays (FPGAs) as the substrate of choice. However, FPGAs are hard to program directly using Hardware Description Languages (HDLs). Even modern high-level HDLs, e.g., Spatial and Chisel, still require hardware expertise.This article adopts functional programming concepts to provide a hardware-agnostic higher-level programming abstraction. During synthesis, these abstractions are mechanically lowered into a functional Intermediate Representation (IR) that defines a specific hardware design point. This novel IR expresses different forms of parallelism and standard memory features such as asynchronous off-chip memories or synchronous on-chip buffers. Exposing such features at the IR level is essential for achieving high performance.The viability of this approach is demonstrated on two stencil computations and by exploring the optimization space of matrix-matrix multiplication. Starting from a high-level representation for these algorithms, our compiler produces low-level VHSIC Hardware Description Language (VHDL) code automatically. Several design points are evaluated on an Intel Arria 10 FPGA, demonstrating the ability of the IR to exploit different hardware features. This article also shows that the designs produced are competitive with highly tuned OpenCL implementations and outperform hardware-agnostic OpenCL code.
Cooperative Slack Management: Saving Energy of Multicore Processors by Trading Performance Slack Between QoS-Constrained ApplicationsNejat, Mehrzad; Manivannan, Madhavan; Pericàs, Miquel; Stenström, Per
doi: 10.1145/3505559pmid: N/A
Processor resources can be adapted at runtime according to the dynamic behavior of applications to reduce the energy consumption of multicore processors without affecting the Quality-of-Service (QoS). To achieve this, an online resource management scheme is needed to control processor configurations such as cache partitioning, dynamic voltage-frequency scaling, and dynamic adaptation of core resources.Prior State-of-the-art has shown the potential for reducing energy without any performance degradation by coordinating the control of different resources. However, in this article, we show that by allowing short-term variations in processing speed (e.g., instructions per second rate), in a controlled fashion, we can enable substantial improvements in energy savings while maintaining QoS. We keep track of such variations in the form of performance slack. Slack can be generated, at some energy cost, by processing faster than the performance target. On the other hand, it can be utilized to save energy by allowing a temporary relaxation in the performance target. Based on this insight, we present Cooperative Slack Management (CSM). During runtime, CSM finds opportunities to generate slack at low energy cost by estimating the performance and energy for different resource configurations using analytical models. This slack is used later when it enables larger energy savings. CSM performs such trade-offs across multiple applications, which means that the slack collected for one application can be used to reduce the energy consumption of another. This cooperative approach significantly increases the opportunities to reduce system energy compared with independent slack management for each application. For example, we show that CSM can potentially save up to 41% of system energy (on average, 25%) in a scenario in which both prior art and an extended version with local slack management for each core are ineffective.