Shift-CIM: In-SRAM Alignment To Support General-Purpose Bit-level Sparsity Exploration in SRAM MultiplicationZhao, Gaoyang; Li, Qiuran; Lin, Rongzhen; Wang, Yaohua
doi: 10.1145/3719654pmid: N/A
Multiplication plays a critical role in SRAM-based Computing-in-Memory (CIM) architectures. However, current SRAM-based CIMs face three major limitations. First, they do not fully exploit bit-level sparsity, resulting in unnecessary overhead in both latency and energy consumption. Second, the generation of numerous zero-dot products is superfluous. Third, the irregular organization of SRAM complicates the implementation.To address these issues, we propose Shift-CIM, a general-purpose approach that fully leverages bit-level sparsity within SRAM-based multiplications. Shift-CIM aligns the multipliers within the SRAM array, accumulating only the required dot products based on the non-zero bits of the multipliers. Shift-CIM achieves a regular SRAM organization by assembling two irregular SRAM arrays in a transposed manner.Our evaluations show that Shift-CIM is highly efficient, operating at a supply voltage of 0.9 V and a frequency of 833 MHz, while incurring only a 4.8% area overhead. Despite these modest requirements, Shift-CIM significantly accelerates multiplication operations, achieving up to 3.08× the performance improvement and a 60% reduction in energy consumption compared to state-of-the-art designs.
Dynamic Power Management Through Multi-agent Deep Reinforcement Learning for Heterogeneous SystemsWang, Yiming; Zhang, Weizhe; Hao, Meng; Kong, Weizhi; Wen, Yuan
doi: 10.1145/3716872pmid: N/A
Power management and optimization play a significant role in modern computer systems, from battery-powered devices to servers running in data centers. Existing approaches for power capping fail to meet the requirements presented by dynamic workloads, and the situation becomes even more severe, given the divergent energy efficiency of workloads on heterogeneous hardware platforms. Adaptively optimizing energy consumption for dynamic workloads presents a great challenge to heterogeneous systems. To tackle this challenge, we present a machine learning based method to improve system-level power efficiency. We employ multi-agent deep reinforcement learning (MADRL) to automatically explore the relationship between long-term performance and the power budget for workloads of different types on classic CPU-GPU heterogeneous platforms. Our framework equips each device with an agent, enabling decentralized control over its power budget while maintaining centralized coordination to maximize the running time of applications within a power cap. We evaluate our approach against state-of-the-art methods on CPU-GPU platforms. Experimental results show that our method improves performance by an average of 8.5%. Additionally, our method is significantly more stable compared to the state-of-the-art heuristic approach.
SRSparse: Generating Codes for High-Performance Sparse Matrix-Vector Semiring ComputationsDu, Zhen; Liu, Ying; Sun, Ninghui; Cui, Huimin; Feng, Xiaobing; Li, Jiajia
doi: 10.1145/3722114pmid: N/A
Sparse matrix-vector semiring computation is a key operation in sparse matrix computations, with performance strongly dependent on both program design and the features of the sparse matrices. Given the diversity of sparse matrices, designing a tailored program for each matrix is challenging. To address this, we propose SRSparse,1 a program generator that creates tailored programs by automatically combining program designing methods to fit specific input matrices. It provides two components: the problem definition configuration, which declares the computation, and the scheduling language, which can be leveraged by an auto-tuner to specify the program designs. The two are lowered to the intermediate representations of SRSparse, the Format IR and Kernel IR, which respectively generate format conversion routine and kernel code. We evaluate SRSparse on four representative sparse kernels and three format conversion routines. For sparse kernels, SRSparse achieves median speedups over handwritten programs: COO (3.50×), CSR-Adaptive (5.36×), CSR5 (2.06×), ELL (1.63×), Gunrock (1.57×), and GraphBLAST (1.96×); over an auto-tuner: AlphaSparse (1.16×); and over a compiler: TACO (1.71×). For format conversion routines, SRSparse achieves median speedups over handwritten implementations: Intel MKL (7.60×), SPARSKIT (2.61×), CUSP (2.77×), and Ginkgo (1.74×); and over a compiler: TACO (4.04×).
SnsBooster: Enhancing Sampling-based Arch Evaluation Efficiency through Online Performance Sensitivity AnalysisHan, Chenji; Zhang, Zifei; Xue, Feng; Li, Xinyu; Wu, Yuxuan; Zhang, Tingting; Liu, Tianyi; Guo, Qi; Zhang, Fuxin
doi: 10.1145/3727637pmid: N/A
Sampling-based methods, such as SimPoint, are widely used for efficient pre-silicon μArch evaluations, where the costs are the number of simulation points multiplied by the number of evaluated μArch designs. However, these costs keep growing with an increasing number of simulation points and expanding μArch design space. Although techniques have been developed to accelerate the μArch design space exploration, less attention has been given to further reducing the simulation budget of each μArch evaluation. Common strategies like reducing simulation coverage or sampling fewer simulation points typically compromise estimation accuracy. Therefore, further reducing the simulation budget without compromising estimation accuracy remains a critical research problem.In this work, we propose SnsBooster to enhance sampling-based μArch evaluation efficiency, based on two insights: (a) large portions of simulation points’ performance changes are typically insensitive to the evaluated μArch changes, and (b) simulation points’ performance sensitivities under specific μArch change correlate with their inherent characteristics. By online building a μArch-specific performance sensitivity classifier via progressive simulation and continuous validation, SnsBooster can identify and selectively evaluate only performance-sensitive points, thus reducing the simulation budget without compromising estimation accuracy. When applied across various μArch changes, SnsBooster achieves an average simulation budget reduction of 39.04% with an accuracy loss of only 0.14%, compared to simulating all the sampled points. Under the same accuracy loss, SnsBooster’s simulation budgets are only 64.73% and 65.60% of those required by methods of reducing simulation coverage or sampling fewer points. Besides, under identical simulation budgets, the average accuracy losses of these methods are 1.41% and 1.23%, which is substantially higher than that of SnsBooster.
PANDA: Adaptive Prefetching and Decentralized Scheduling for Dataflow ArchitecturesQin, Shantian; Fan, Zhihua; li, Wenming; Wang, Zhen; An, Xuejun; Ye, Xiaochun; Fan, Dongrui
doi: 10.1145/3721288pmid: N/A
Dataflow architectures are considered promising architecture, offering a commendable balance of performance, efficiency, and flexibility. Abundant prior works have been proposed to improve the performance of dataflow architectures. Nevertheless, these solutions can be further improved due to the lack of efficient data prefetching and flexible task scheduling. In this article, we propose a novel dataflow architecture with adaptive prefetching and decentralized scheduling (PANDA). First, we present an application-adaptive data prefetching method and on-chip memory microarchitecture designed to overlap memory access latency. Second, we introduce a decentralized dataflow scheduling approach and processing element (PE) microarchitecture aimed at improving hardware utilization. Experimental results show that in a wide range of real-world applications, PANDA attains up to 2.53× performance improvement and 1.79× energy efficiency improvement over the state-of-the-art dataflow architectures.
Koala: Efficient Pipeline Training through Automated Schedule Searching on Domain-Specific LanguageTang, Yu; Yin, Lujia; Li, Qiao; Zhu, Hongyu; Li, Hengjie; Zhang, Xingcheng; Qiao, Linbo; Li, Dongsheng; Li, Jiaxin
doi: 10.1145/3722113pmid: N/A
Pipeline parallelism is a crucial technique for large-scale model training, enabling parameter splitting and performance enhancement. However, creating effective pipeline schedules often requires significant manual effort and coding skills, leading to practical inconveniences and complex debugging. Major frameworks such as DeepSpeed and ColossalAI simplify the process by adopting predefined pipeline schedule strategies, such as GPipe and 1F1B. The use of predefined schedules offers limited flexibility and suboptimal training efficiency, as the limited number of manually set candidates cannot provide the optimal strategy for arbitrary model training. To deal with the issue, this article aims to automatically search for the optimal strategy with high efficiency. Since current frameworks only support a limited set of fixed strategies, lacking the technical capability to create a comprehensive strategy search space, we first design a novel domain-specific language (DSL) for pipeline schedule development. The DSL exhibits great understandability, agility, and reusability, supporting the development of all known pipeline schedule strategies and their variants. Second, we are the first to model the complete pipeline schedule strategy space via the DSL, enabling an automated end-to-end globally optimal pipeline schedule searching, while past work may get stuck in a local optimum. Finally, we propose to optimize pipeline performance by modeling and solving the pipeline schedule as a Binary-Tree-Traversing (BTT) optimization problem. Based on the formalization, we further adopt a Dynamic Try-Test Genetic Algorithm to search for the best pipeline schedule strategy, which overwhelms a variety of pre-defined ones. Experimental results show that Koala achieves an enhanced performance by up to \(1.53\times\) over state-of-the-art approaches. Besides, the pipeline schedule strategy searched by Koala outperforms pre-defined pipeline schedule strategies by \(1.10\times \sim 1.55\times\) . Moreover, Koala has superior scalability and effectiveness in combining with data parallelism and tensor parallelism.
ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream WorkloadsHu, CunChen; Huang, HeYang; Xu, LiangLiang; Chen, XuSheng; Wang, Chenxi; Xu, Jiang; Chen, Shuang; Feng, Hao; Wang, Sa; Bao, Yungang; Sun, Ninghui; Shan, Yizhou
doi: 10.1145/3732941pmid: N/A
Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in ShuffleInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computation-saturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that ShuffleInfer improves time-to-first-token (TTFT), job completion time (JCT), and inference efficiency in terms of performance per dollar by a large margin, e.g., it uses 38% less resources all the while lowering average TTFT and average JCT by 97% and 47%, respectively.
BridgeGC: An Efficient Cross-Level Garbage Collector for Big Data FrameworksWang, Yicheng; Xu, Lijie; Guo, Tian; Dou, Wensheng; Zeng, Hongbin; Wang, Wei; Wei, Jun; Huang, Tao
doi: 10.1145/3722110pmid: N/A
Popular big data frameworks commonly run atop Java Virtual Machine (JVM) and rely on garbage collection (GC) mechanism to automatically allocate/reclaim in-memory objects. Existing garbage collectors are designed based on the hypothesis that most objects are short lived. However, big data frameworks usually generate many long-lived data objects, which can cause heavy GC overhead. Recent approaches have reduced GC overhead in big data frameworks but still suffer from heavy human efforts, additional runtime overhead, or suboptimal GC efficiency.This article describes the design of BridgeGC, a big-data-friendly garbage collector that significantly reduces GC overhead introduced by long-lived data objects. BridgeGC follows a cross-level co-design. At the big data framework level, BridgeGC provides two annotations for framework developers to denote the creation and release of data objects. Based on the annotations, BridgeGC tracks the lifecycles of annotated data objects and optimizes their allocation/reclamation at the GC level. At the GC level, we design a label-based allocator that stores data objects separately from other objects and balances their memory usage in the same JVM, leading to fewer GC cycles. We further design an efficient collector to eliminate unnecessary marking and copying of data objects during GC cycles, lowering the GC time. We have integrated BridgeGC into OpenJDK ZGC. The extensive evaluation, using two popular big data frameworks (Flink and Spark) and a key–value database (Cassandra), shows that BridgeGC achieves 31–82% GC time reduction compared to the baseline ZGC. BridgeGC also outperforms other traditional and academic garbage collectors in end-to-end performance.
GOLDYLOC: Global Optimizations & Lightweight Dynamic Logic for ConcurrencyPati, Suchita; Aga, Shaizeen; Jayasena, Nuwan; Sinclair, Matthew
doi: 10.1145/3730584pmid: N/A
Modern accelerators like GPUs increasingly execute independent operations concurrently to improve the device’s compute utilization. However, effectively harnessing it on GPUs for important primitives such as general matrix multiplications (GEMMs) remains challenging. Although modern GPUs have significant hardware and software GEMM support, their kernel implementations and optimizations typically assume each kernel executes in isolation and can utilize all GPU resources. This approach is highly efficient when kernels execute in isolation, but causes significant resource contention and slowdowns when kernels execute concurrently. Moreover, current approaches often only statically expose and control parallelism within an application, without considering runtime information such as varying input size and concurrent applications—often exacerbating contention. These issues limit performance benefits from concurrently executing independent operations. Accordingly, we propose GOLDYLOC , which considers the global resources across all concurrent operations to identify performant GEMM kernels, which we call globally optimized (GO)-Kernels. GOLDYLOC also introduces a lightweight dynamic logic which considers the dynamic execution environment for available parallelism and input sizes to execute performant combinations of concurrent GEMMs on the GPU. Overall, GOLDYLOC improves the performance of concurrent GEMMs on a real GPU by up to 2× (18% geomean per workload) versus the default concurrency approach and provides up to 2.5× (43% geomean per workload) speedup over sequential execution.
TransCL: An Automatic CUDA-to-OpenCL Programs Transformation FrameworkShi, Changqing; Sun, Yufei; Chen, Rui; Wang, Jiahao; Guo, Qiang; Gong, Chunye; Sui, Yicheng; Jin, Yutong; Zhang, Yuzhi
doi: 10.1145/3718987pmid: N/A
With the rising demand for computational power and the increasing variety of computational scenarios, considerable interest has emerged in transforming existing CUDA programs into more general-purpose OpenCL programs, enabling them to run across diverse hardware platforms. However, manual methods, typically designed for specific applications, lack flexibility. Current automated conversion techniques also face considerable challenges, particularly in handling diverse programming interfaces, memory management, and so on, and are insufficient for converting large-scale, complex CUDA projects. In this article, we propose a novel source-to-source program transformation framework, TransCL, which automates the conversion of CUDA programs in four key aspects: source code, execution model, programming model, and memory model. To achieve this, we abstract a set of conversion rules aligned with the latest CUDA standards, develop a transcoder, implement an OpenCL-compatible programming interface library, and establish a memory mapping mechanism between CUDA and OpenCL. Experiments demonstrate that TransCL provides a high level of automation in converting CUDA-based applications and is effective in handling large, complex projects such as TensorFlow. Moreover, the converted AI framework successfully conducted model training for the first time. The experiment also validates that the converted program can execute correctly across multiple platforms and demonstrate good performance.