HotLD: A Workload-Aware Method for Global Code-Layout Optimization of Shared LibrariesNing, Xueqin; Ma, Jun; Jia, Zhouyang; Tan, Yusong; Yu, Jie; Dong, Pan; Wang, Jing; Shen, Lianghao
doi: 10.1145/3769310pmid: N/A
Dynamic linking is an important technique in the process of software development. While dynamic linking can save memory and enhance maintainability, it also incurs performance overhead and hinders the application of profile-guided code layout optimization techniques to third-party libraries. Existing works (such as BOLT) can solve the problem by generating optimized versions of shared libraries for a given workload. Online PGO methods (such as OCOLOS) can further support on-the-fly code replacement to adapt to workload changes. These methods, however, are limited in several aspects: (1) hard to perform global optimization, (2) high memory consumption and performance overhead, and (3) limited usage scenarios. To address these issues, we propose HotLD, a workload-aware method for global code-layout optimization of shared libraries. HotLD can improve the performance of shared libraries while introducing limited performance and memory overhead. The core idea behind HotLD is to create dedicated code copies for each typical application workload during the offline phase, and perform global optimization according to workload characteristics. At runtime, HotLD monitors the running workload of the target program, then dynamically selects and links an appropriate code copy. We conducted experiments using real-world applications to evaluate the effectiveness and efficiency of HotLD. The results demonstrate that HotLD can improve their performance (up to 22.37%) with limited performance overhead (at the millisecond level). Compared with existing works, HotLD uses 1%–46% memory to support 2×–9× workloads.
Corrigendum: gECC: A GPU-based high-throughput framework for Elliptic Curve CryptographyXiong, Qian; Ma, Weiliang; Shi, Xuanhua; Zhou, Yongluan; Jin, Hai; Huang, Kaiyi; Wang, Haozhou; Wang, Zhengru
doi: 10.1145/3776758pmid: N/A
This is a corrigendum for the article “gECC: A GPU-based high-throughput framework for Elliptic Curve Cryptography” published in ACM Trans. Arch. Code Optim. 22, 3, Article 84 (September 2025), 27 pages.
Advancing Matrix Operations for High-Performance and Memory-Efficient Automata Processing on GPUsWu, Zhenlin; Ge, Tianao; Li, Jiajia; Chen, Xinyu; Liu, Hongyuan
doi: 10.1145/3774656pmid: N/A
Finite state automata are essential in various domains, such as pattern matching and data analytics, where high throughput is critical. Recent work has explored representing automata execution as matrix algebra and leveraging CPU Basic Linear Algebra Subprograms (BLAS) libraries. While promising, this approach faces bottlenecks in memory usage, data locality, and redundant computation. This work systematically identifies these bottlenecks and develops automata-specific optimizations to address them.We focus on GPUs due to their high compute capabilities and widespread availability. To overcome these challenges, we propose three key techniques to enhance computational and memory efficiency: (1) transition matrix deduplication to reduce memory usage, (2) interleaved state renumbering to improve GPU thread utilization, and (3) state vector caching to eliminate redundant computations. Detailed evaluation shows that the proposed solution matches GPU-based automata engines in performance (up to 6.54× speedup) while using less than 2% of their memory footprint, and outperforms state-of-the-art domain-specific accelerators (up to 965×) for automata workloads.
HAVIT: An Efficient Hardware-Accelerator for Vision Transformer with Informative Patch Selection TechniquesGoyal, Anadi; Sangwan, Gaurav; Patel, Aradhya; Das, Palash
doi: 10.1145/3764865pmid: N/A
Vision Transformers (ViTs) are widely utilized in the domain of computer vision. However, they encounter challenges related to high computational expenses, particularly for real-time inference on devices with limited resources. To tackle this challenge, we design HAVIT, an efficient ViT accelerator that integrates lightweight informative patch selection techniques with the core ViT acceleration module. The informative patch selection techniques introduce minimal computation, comprising only about 0.03% on average compared with the savings achieved, as they rely on edge detection methods like Sobel and Canny, in contrast to the computationally intensive AI models. After the edge detection, we use three proposed algorithms—Density-based Patch selector (DSP), Row Column Intersection patch selector (RCI), and Bounded Envelop Patch selector (BEP)—each offering unique paths for selecting informative patches with distinct accuracy-performance tradeoffs. Upon receiving the informative patches, the ViT acceleration portion of the proposed HAVIT employs parallelism by identifying independent data to enhance the performance further while being energy efficient. Experimental results show that our policies can significantly reduce computational workload by around 27%–57% compared with the baseline (no_patch_selection). The proposed HAVIT also outperforms state-of-the-art accelerators in terms of overall system performance.
Efficient GPU-Centered Singular Value Decomposition Using the Divide-and-Conquer MethodLiu, Shifang; Li, Huiyuan; Sheng, Hongjiao; Gui, Haoyuan; Zhang, Xiaoyu
doi: 10.1145/3764932pmid: N/A
Singular Value Decomposition (SVD) is a fundamental matrix factorization technique in linear algebra, widely applied in numerous matrix-related problems. However, traditional SVD approaches are hindered by slow panel factorization and frequent CPU-GPU data transfers in heterogeneous systems, despite advancements in GPU computational capabilities. In this article, we introduce a GPU-centered SVD algorithm, incorporating a novel GPU-based bidiagonal divide-and-conquer (BDC) method. We reformulate the algorithm and data layout of different steps for SVD computation, performing all panel-level computations and trailing matrix updates entirely on GPU to eliminate CPU-GPU data transfers. Furthermore, we integrate related computations to optimize BLAS utilization, thereby increasing arithmetic intensity and fully leveraging the computational capabilities of GPUs. Additionally, we introduce a newly developed GPU-based BDC algorithm that restructures the workflow to eliminate matrix-level CPU-GPU data transfers and enable asynchronous execution between the CPU and GPU. Experimental results on AMD MI210 and NVIDIA V100 GPUs demonstrate that our proposed method achieves speedups of up to 1293.64x/7.47x and 14.10x/12.38x compared with rocSOLVER/cuSOLVER and MAGMA, respectively.
DFGAS: Exploring the Balance of HW-SW Scheduling through the DFG-Aware SchemeLiu, Tianyu; Fan, Zhihua; Li, Wenming; Wang, Zhen; Qiu, Yuhang; Tang, Shengzhong; Wu, Haibin; Liu, Yanhuan; Ye, Xiaochun; Fan, Dongrui
doi: 10.1145/3773768pmid: N/A
Coarse-Grained Reconfigurable Architectures (CGRAs) have been regarded as promising spatial computing fabric for the ever-evolving algorithms in multiple domains. However, pure software scheduling cannot compensate for the deficiencies in over-serialization and load imbalancing of these pure static CGRA designs. To address the issues caused by limited hardware flexibility, an in-depth study on the balance between the software and hardware scheduling design of CGRA is needed to achieve more precise, accurate, and adaptive scheduling of dataflow.In this article, we propose DFGAS (DFG-Aware Scheduling), a dataflow-driven CGRA which provides a comprehensive scheduling approach that encompasses software prediction, runtime adaptive execution, and post-execution refinement. Prior to execution, the TimeStamp prediction algorithm, coupled with the inherent dataflow execution model, enables coarse-grained (block-level) prediction for prioritized transfer and computation on NoC and PEs. During execution, the execution of key dataflow graph (DFG) blocks and edges is accelerated by incorporating a dynamic and adaptive dataflow mechanism. It leverages hardware-software co-design to obtain a holistic view of the entire DFG and continuously self-adaptively optimizes the scheduling process. Furthermore, a complete workflow is implemented, supporting making refinements to the software DFG mapping results. DFGAS represents a scheduling scheme of CGRA that is worth exploring, achieving hardware-software co-design that balances energy efficiency and flexibility. Experiments show that DFGAS achieves 1.35× energy efficiency improvement over a dataflow-driven CGRA and 1.9× energy efficiency improvement over a state-of-the-art pure static CGRA.
GTSM: A multi-edge-centric temporal subgraph matching framework on GPUsHe, Jiezhong; Jia, Menghan; Chen, Yixin; Liu, Zhouyang; Li, Dongsheng
doi: 10.1145/3771286pmid: N/A
Temporal subgraph matching aims to identify subgraphs in temporal networks that satisfy both structural and temporal constraints, with applications ranging from social network analysis to fraud detection. As this NP-hard problem involves massive computation on large graphs, GPU acceleration becomes critical. However, existing edge-centric approaches suffer from computational redundancy, inefficient memory management, and limited scalability on large graphs, hindering efficient GPU acceleration. To address these challenges, we propose GTSM,1 a GPU-optimized temporal subgraph matching system featuring three innovations: (1) A multi-edge-centric paradigm that reduces redundant search space through multi-edge compressions along with an efficient decompression algorithm; (2) A memory-bound optimization that maximizes GPU resource utilization; (3) A heterogeneous BFS-DFS execution model where CPU performs Breadth-First Search (BFS) to ensure load balancing across GPUs. Experiments demonstrate that GTSM achieves a 5.5×-93.2× speedup over the state-of-the-art GPU systems, while solving 10%–40% more queries. With our heterogeneous execution model, our system achieves near-linear scaling in multi-GPU configurations.
Minimizing overhead of out-of-channel data exchanges to balance wear-outs and I/Os in RAID-enabled SSDsYang, Fan; Wu, Jiaojiao; Xiao, Chenqi; Li, Jun; Sha, Zhibing; Cai, Zhigang; Shi, Yuanquan; Tan, Kanlun; Liao, Jianwei
doi: 10.1145/3776584pmid: N/A
The channel-level RAID implementation has been introduced to NAND flash-based solid-state drives (SSDs), to fight against channel failures. But, it suffers from unbalanced wear-outs and I/O workloads across channels, due to the nature of in-channel updates on data/parity chunks of data stripes, leading to a decline in I/O performance and a negative impact on the lifespan of RAID-enabled SSDs. This article introduces an approach to yield wear-out and I/O balances, with the minimal overhead caused by location exchanges of data/parity chunks belonging to the same stripe, when fulfilling write requests or garbage collections (GCs). To this end, we build an assessment model for measuring the balance level of all SSD channels, and then trigger a location exchange of data/parity chunks in the same data stripe, if the exchange operation can be beneficial to wear-out and I/O balances. Besides, we introduce a scheme of pairGC to further minimize the overhead of data location exchanges. Specifically, it pairs the GC operations on different channels and conducts out-of-channel page moves, to achieve the goal of data location exchanges without additional cost. Through a series of emulation experiments on eight disk traces of real-world applications, we show that our proposal can greatly improve I/O performance by 40.0% on average, as well as noticeably balance I/O workloads over SSD channels and prolong the endurance of SSDs, in contrast to the state-of-the-art RAID optimization schemes inside SSDs.
Towards high scalability and fine-grained parallelism on distributed HPC platformsde Haro ruiz, Juan Miguel; Álvarez Martínez, Carlos; Jiménez-González, Daniel; Morais, Lucas; Martorell Bofill, Xavier
doi: 10.1145/3774815pmid: N/A
Current High-Performance Computing systems rely on massive parallelism to achieve exascale performance. They use task scheduling and message-passing programming models to explore complementary sources of parallelism. Combining the two holds the promise of allowing seamless exploitation of both intra- and inter-node concurrency while leveraging widely-known programming abstractions. Still, the interaction between the two raises coordination problems that could make work distribution excessively costly, limiting performance. This work is the first to evaluate comprehensive hardware acceleration of their combined use integrating them in a programming model that further exploits their synergies. The hardware/software co-design approach proposed for this purpose is prototyped on a cluster of 64 FPGA nodes, where each holds a RISC-V Rocket Chip CPU with 8 cores. On one hand, this article combines OMPIF and Picos, which are hardware accelerators for message passing and task scheduling respectively. They interface with the CPU through RoCC-based custom RISC-V instructions. On the other hand, we present the Implicit Message Passing (IMP) programming model, that extends task scheduling abstractions to leverage MPI-mediated inter-node parallelism without requiring explicit MPI calls. Thus, IMP transparently allows the dataflow-style execution induced by task scheduling to span multiple nodes. We implement three benchmarks, N-body, Heat, and Cholesky, each with two different strategies, IMP and explicit MPI, and evaluate them on the multi-core FPGA-based cluster. We demonstrate our hardware-software co-design approach achieves near-linear scalability with IMP and the OMPIF/Picos accelerators, and reduces task management overhead from 2200 to 300 cycles per task. Furthermore, when leveraging all 512 cores (split among the 64 nodes), we measure speedups of 2.04x (40x in communication), 1.25x (7x in communication), and 7.29x (25x in communication) compared with unaccelerated MPI for N-body, Heat, and Cholesky respectively. Finally, at 64 nodes, we respectively achieve 99%, 83%, and 79% of weak scaling efficiency.
Optimizing General Sparse Matrix-Matrix Multiplication on the GPUWang, Yizhuo; Lin, Hongpeng; Wei, Bingxin; Gao, Jianhua; Ji, Weixing
doi: 10.1145/3774654pmid: N/A
General Sparse Matrix-Matrix Multiplication (SpGEMM) is a crucial computational kernel in the field of scientific and engineering computing. Due to the irregular distribution of nonzero elements in sparse matrices, SpGEMM computation faces challenges such as non-contiguous memory access and workload imbalance. This article focuses on optimizing SpGEMM for GPU platforms. First, a lightweight machine learning model is trained to predict the optimal method for estimating the size of result matrix. Next, different kernels are launched in groups to maximize GPU shared memory utilization and achieve load balancing. For the hash-based sparse accumulator, heuristic methods are used to select the optimal hash load factors and hash multiplier factors, thereby reducing the number of hash collisions. In addition, thread reduction is applied in the symbolic phase to enhance intra-block parallelism. Combining these optimization strategies, we implemented an adaptive SpGEMM algorithm for GPUs and compared its performance with current state-of-the-art algorithms. The results show that our algorithm achieves significant performance improvements.