Bubble-Swap Flow ControlDai, Yi; Lu, Kai; Ma, Sheng; Su, Jinshu; Li, Dongsheng
doi: 10.1145/3705316pmid: N/A
Deadlock-free adaptive routing is extensively adopted in both on-chip and off-chip interconnection networks to improve communication bandwidth and reduce latency. Introducing virtual channels (VCs), also known as virtual lanes (VLs). This is the mainstream technique to handle deadlocks incurred by adaptive routing and also provides VC preemption for higher priority traffic. However, existing deadlock-free flow control schemes either underutilize memory resources due to inefficient buffer management to simplify hardware implementation, or rely on complicated global coordination and synchronization with very high hardware complexity. Most hardware-friendly schemes use more VCs and memory resources to enable ease of implementation of deadlock-free flow control. In contrast, sophisticated schemes achieve deadlock freedom with minimum VC cost, even eliminating additional buffer requirement through the complicated control mechanisms. In this work, we rethink the root cause of the deadlock problem from a different perspective by considering it as a lack of credit, which makes us find an efficient solution to the deadlock problem. With minor modification of credit accumulation and return, our proposed bubble-swap flow control (BSFC) ensures atomic buffer swap between two adjacent routers only based on local credit status while making full use of the buffer space. BSFC achieves a better tradeoff between implementation complexity and memory overhead and can be easily integrated in the industrial router with no modification on buffer allocation or port arbitration. The simulation results demonstrate BSFC outperforms existing bubble-based deadlock-free methods by average 64% higher throughput. We further propose a credit reservation strategy to eliminate the escape virtual channel (VC) cost for fully adaptive routing implementation. The synthesizing results demonstrate that BSFC along with credit reservation (BSFC-CR) can reduce the area and power consumption by respectively 29% and 26% in contrast to the traditional critical bubble scheme (CBS).
Taming Flexible Job Packing in Deep Learning Training ClustersYang, Pengyu; Cui, Weihao; Xue, Chunyu; Zhao, Han; Chen, Chen; Chen, Quan; Yang, Jing; Guo, Minyi
doi: 10.1145/3711927pmid: N/A
Job packing is an effective technique to harvest the idle resources allocated to the deep learning (DL) training jobs but not fully utilized, especially when clusters may experience low utilization, and users may overestimate their resource needs. However, existing job packing techniques tend to be conservative due to the mismatch in scope and granularity between job packing and cluster scheduling. In particular, tapping the potential of job packing in the training cluster requires a local and fine-grained coordination mechanism. To this end, we propose a novel job-packing middleware named Gimbal, which operates between the cluster scheduler and the hardware resources. As middleware, Gimbal must not only facilitate coordination among the packed jobs but also support various scheduling objectives of different schedulers. Gimbal achieves dual functionality by introducing a set of worker calibration primitives designed to calibrate workers’ execution status in a fine-grained manner. The primitives obscure the complexity of the underlying job and resource management mechanisms, thus offering the generality and extensibility for crafting coordination policies tailored to various scheduling objectives. We implement Gimbal on a real-world GPU cluster and evaluate it with a set of representative DL training jobs. The results show that Gimbal improves different scheduling objectives up to 1.32× compared with the state-of-the-art job packing techniques.
Maximizing Data and Hardware Reuse for HLS with Early-Stage Symbolic PartitioningJuang, Tzung-Han; Dubach, Christophe
doi: 10.1145/3711926pmid: N/A
While traditional High-Level Synthesis (HLS) converts “high-level” C-like programs into hardware automatically, producing high-performance designs still requires hardware expertise. Optimizations such as data partitioning can have a large impact on performance since they directly affect data reuse patterns and the ability to reuse hardware. However, optimizing partitioning is a difficult process since minor changes in the parameter choices can lead to totally unpredictable performance.Functional array-based languages have been proposed instead of C-based approaches, as they offer stronger performance guarantees. This article proposes to follow a similar approach and exposes a divide-and-conquer primitive at the algorithmic level to let users partition any arbitrary computation. The compiler is then free to explore different partition shapes to maximize both data and hardware reuse automatically. The main challenge remains that the impact of partitioning is only known much later in the compilation flow. This is due to the hard-to-predict effects of the many optimizations applied during compilation.To solve this problem, the partitioning is expressed using a set of symbolic tunable parameters, introduced early in the compilation pipeline. A symbolic performance model is then used in the last compilation stage to predict performance based on the possible values of the tunable parameters. Using this approach, a design space exploration is conducted on an Intel Arria 10 Field Programmable Gate Arrays (FPGAs), and competitive performance is achieved on the classical VGG and TinyYolo neural networks.
RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel FusionYan, Jianrong; Jiang, Wenbin; He, Dongao; Wen, Suyang; Li, Yang; Jin, Hai; Shao, Zhiyuan
doi: 10.1145/3702001pmid: N/A
Graph Neural Networks (GNNs) have achieved remarkable successes in various graph-based learning tasks, thanks to their ability to leverage advanced GPUs. However, GNNs currently face challenges arising from the concurrent use of advanced Tensor Cores (TCs) and CUDA Cores (CDs) in GPUs. These challenges are further exacerbated due to repeated, inefficient, and redundant aggregations in GNN that result from the high sparsity and irregular non-zero distribution of real-world graphs. We propose RT-GNN, a GNN framework based on the fusion of advanced TC and CD units, to eliminate the aforementioned redundancies by exploiting the properties of an adjacency matrix. First, a novel GNN representation technique, hierarchical embedding graph (HEG) is proposed to manage the intermediate aggregation results hierarchically, which can further avoid redundancy in intermediate aggregations elegantly. Next, to address the inherent sparsity of graphs, RT-GNN places the blocks (a.k.a. tiles) in HEG onto TCs and CDs according to their sparsity by a new block-based row-wise multiplication approach, which assembles TCs and CDs to work concurrently. Experimental results demonstrate that HEG outperforms HAG by an average speedup of 19.3× for redundancy elimination performance, especially up to 72× speedup on the dataset of ARXIV. Moreover, for overall performance, RT-GNN outperforms state-of-the-art GNN frameworks (including DGL, HAG, GNNAdvisor, and TC-GNN) by an average factor of 3.1× while maintaining or even improving the task accuracy.
exZNS: Extending Zoned Namespace to Support Byte-loggable ZonesQi, Wenjie; Tan, Zhipeng; Zhang, Ziyue; Yuan, Ying; Feng, Dan
doi: 10.1145/3705318pmid: N/A
Emerging Zoned Namespace (ZNS) provides hosts with fine-grained, performance-predictable storage management. ZNS organizes the address space into zones composed of fixed-size, sequentially written, non-overwritable blocks, making it suitable for log-structured file systems. However, our experimental analysis reveals that ZNS’s write restrictions introduce notable persistence overhead. Firstly, out-of-place updates of data blocks require frequent small modifications to file metadata blocks, which are typically much smaller than a block, to record the latest logical block address. Secondly, some files, such as databases’ Write-Ahead Logging files, frequently execute synchronous small writes, with I/O sizes typically smaller than a logical block. The persistence of these file metadata and file data requires writing back the entire block even if it is only partially updated. This significantly increases the I/O latency and potentially reduces device lifespan.This article proposes exZNS, an innovative extension of ZNS, designed to provide both regular zones and byte-loggable zones. By exposing the persistent write buffer of the opened zones on the device to the application, the byte-loggable zone allows for appending at byte granularity through a new set of APIs. To reduce the persistence overhead described above, we built exBlzFS, a novel high-performance file system for exZNS. exBlzFS selectively records the partial updates of metadata blocks to the byte-loggable zone to ensure metadata persistence, and persists file data to the byte-loggable zone at byte granularity to absorb the frequent small writes. Evaluations show that exBlzFS increases the IOPS of RocksDB by 42.7% and 76.3%, and reduces the device’s write traffic by 86% and 94%, compared with BlzFS and F2FS, respectively.1
ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive PanelYao, Dezhong; Zhao, Sifan; Liu, Tongtong; Wu, Gang; Jin, Hai
doi: 10.1145/3703352pmid: N/A
The Sparse General Matrix-Matrix multiplication (SpGEMM) is a fundamental component for many applications, such as algebraic multigrid methods (AMG), graphic processing, and deep learning. However, the unbearable latency of computing high-dimensional, large-scale sparse matrix multiplication on GPUs hinders the development of these applications. An effective approach is heterogeneous cores collaborative computing, but this method must address three aspects: (1) irregular non-zero elements lead to load imbalance and irregular memory access, (2) different core computing latency differences reduce computational parallelism, and (3) temporary data transfer between different cores introduces additional latency overhead. In this work, we propose an innovative framework for collaborative large-scale sparse matrix multiplication on CPU-GPU heterogeneous cores, named ApSpGEMM. ApSpGEMM is based on sparsity rules and proposes reordering and splitting algorithms to eliminate the impact of non-zero element distribution features on load and memory access. Then adaptive panels allocation with affinity constraints among cores improves computational parallelism. Finally, carefully arranged asynchronous data transmission and computation balance communication overhead. Compared with state-of-the-art SpGEMM methods, our approach provides excellent absolute performance on matrices with different sparse structures. On heterogeneous cores, the GFlops of large-scale sparse matrix multiplication is improved by 2.25 to 7.21 times.
A High Scalability Memory NoC with Shared-Inside Hierarchical-Groupings for Triplet-Based Many-Core ArchitectureLi, Chunfeng; Shi, Feng; Yin, Fei; Soliman, Karim; Wei, Jin
doi: 10.1145/3688610pmid: N/A
Innovative processor architecture designs are shifting towards Many-Core Architectures (MCAs) to meet the future demands of high-performance computing as the limits of Moore’s Law have almost been reached. Many-core processors utilize shared memory hierarchies to achieve high-speed memory systems, improving memory access efficiency. However, as the number of cores multiplies, the scalability of this system is significantly constrained by the increased proportion of long-distance and Non-Uniform Memory Access (NUMA). Improving the scalability of MCAs is crucial for achieving large/super-scale general-purpose many-core processors. This work proposes a high-scalability memory Network-on-Chip (NoC) for Triplet-Based Many-Core Architecture (TriBA), named TriBA-mNoC. TriBA-mNoC maintains a consistent core-to-core spacing as the network scale increases, effectively preventing increased long-distance memory access latency. Moreover, it leverages an inherent advantage of shared-inside hierarchical-groupings, alleviating common NUMA issues in the NoC design. Evaluations of static network characteristics show that TriBA-mNoC outperforms most classical NoCs in network diameter, average distance, and cost. TriBA-mNoC can be integrated with TriBA in the same silicon die with a tile-like floorplan, forming a novel NoC called TriBA-NoC, which can combine the strengths of both networks to maximize the architecture performance. We evaluated the memory access performance and scalability of TriBA-NoC using the mathematical evaluation models and actual simulations with real traffic (PARSEC 3.0 and SPLASH-2) at different network scales. The mathematical evaluation results indicate that TriBA-NoC achieves an aggregate speedup of approximately 3x compared with 2D-Mesh for a similar number of cores. Furthermore, TriBA-NoC’s single-core speedup efficiency remains stable as the number of cores increases under the same cache hit ratio, whereas 2D-Mesh experiences a rapid decline, highlighting TriBA-NoC’s exceptional scalability. Finally, the actual traffic simulation results show that TriBA-NoC achieves an average memory access latency and time reduction of 25.90% to 40.50% and 5.61% to 31.69%, respectively, compared with 2D-Mesh.
MasterPlan: A Reinforcement Learning Based Scheduler for Archive StorageChen, Xinqi; Xu, Erci; Mo, Dengyao; Lu, Ruiming; Wu, Haonan; Ding, Dian; Xue, Guangtao
doi: 10.1145/3708542pmid: N/A
With the sheer volume of data in today’s world, archive storage systems play a significant role in persisting the cold data. Due to stringent cost concerns, one popular design is to organize disks into groups and periodically switch them to be powered on for serving user requests. Scheduling thus becomes critical for both CapEx and performance. Unfortunately, field results indicate that existing schedulers can be often suboptimal. Our further analysis suggests that the main reason is the mismatch between the ever-changing workloads and the fixed set of coarsely-configured parameters in current heuristic-based schedulers.In this article, we propose MasterPlan, a reinforcement learning (RL) based scheduler for archive storage systems. By identifying the unique characteristics of archive storage service, we design a state space and reward function for the RL agent. MasterPlan includes a continuous action encoding approach to guarantee efficient exploration, and a meta adaptation module to extract features of workload series. Experiments show that MasterPlan can achieve 1.25× throughput, 2.16× 99th latency and 1.47× power draw improvement compared to existing solutions.
Accelerating Nearest Neighbor Search in 3D Point Cloud Registration on GPUsChang, Qiong; Wang, Weimin; Miyazaki, Jun
doi: 10.1145/3716875pmid: N/A
The Iterative Closest Points (ICP) algorithm is the most widely used method for estimating rigid transformation in 3D point cloud registration. However, the ICP relies on repeatedly performing computationally intensive nearest neighbor searches (NNS) within 3D space. This dependency becomes a significant bottleneck when processing large datasets, thereby hindering the practical deployment of point cloud technologies in real-world applications. To address this issue, we propose two approximate nearest neighbor search (ANNS) acceleration strategies for efficient improvement of the processing speed of the NNS. Our strategies first voxelize target cloud points and then fill voxels in the 3D coordinate space around the source point cloud in two different ways, which can convert the global nearest neighbor search to a local search. Both the proposed methods are suited to be parallelized on GPUs with a low computational load. Extensive experiments show that our methods significantly accelerate NNS processing while maintaining high accuracy, outperforming most of the currently known approaches.
Leveraging the Hardware Resources to Accelerate cryo-EM Reconstruction of RELION on the New Sunway SupercomputerXu, Jingle; Fu, Jiayu; Gan, Lin; Chen, Yaojian; Sun, Zhaoqi; Huang, Zhenchun; Yang, Guangwen
doi: 10.1145/3701990pmid: N/A
The fast development of biomolecular structure determination has enabled the fine-grained study of objects in the micro-world, such as proteins and RNAs. The world is benefited. However, as the computational algorithms are constantly developed, the enrichment of features increases the algorithmic complexity and brings more computationally unfriendly modules. It calls for efficient solutions to leverage the rich and various hardware resources from the world’s most state-of-the-art supercomputing systems, and to fully accelerate the performance of the applications. In this article, we present our efforts on porting and optimizing the 3D reconstruction of RELION, one of the most popular cryo-EM software for biomolecular structure determinations, by leveraging different resources of the latest generation of Sunway heterogeneous supercomputer. Several novel approaches are proposed to resolve different challenges faced by the complex algorithm, including a multi-level parallel scheme and operator optimizations to smartly map and scale RELION, efficient strategies to largely address the memory bottlenecks and improve data locality, lock-free writing solutions to minimize write-write conflicts, and pipelining approaches to obtain excellent computation and communication overlap. Combining all proposed optimizations, the computation time is greatly reduced to under 2 hours, achieving 11.9× and 8.9× speedups on two different datasets. The overall design scales to 131,072 cores, increasing parallel efficiency from 33% to 61% and from 46% to 70%, respectively. To the best of our knowledge, this is the first work that fully optimized and scaled the 3D reconstruction of RELION using the latest Sunway system.