ACTION: Adaptive Cache Block Migration in Distributed Cache ArchitecturesMummidi, Chandra Sekhar; Kundu, Sandip
doi: 10.1145/3572911pmid: N/A
Chip multiprocessors (CMP) with more cores have more traffic to the last-level cache (LLC). Without a corresponding increase in LLC bandwidth, such traffic cannot be sustained, resulting in performance degradation. Previous research focused on data placement techniques to improve access latency in Non-Uniform Cache Architectures (NUCA). Placing data closer to the referring core reduces traffic in cache interconnect. However, earlier data placement work did not account for the frequency with which specific memory references are accessed. The difficulty of tracking access frequency for all memory references is one of the main reasons why it was not considered in NUCA data placement. In this research, we present a hardware-assisted solution called ACTION (Adaptive Cache Block Migration) to track the access frequency of individual memory references and prioritize placement of frequently referred data closer to the affine core. ACTION mechanism implements cache block migration when there is a detectable change in access frequencies due to a shift in the program phase. ACTION counts access references in the LLC stream using a simple and approximate method and uses a straightforward placement and migration solution to keep the hardware overhead low. We evaluate ACTION on a 4-core CMP with a 5x5 mesh LLC network implementing a partitioned D-NUCA against workloads exhibiting distinct asymmetry in cache block access frequency. Our simulation results indicate that ACTION can improve CMP performance by up to 7.5% over state-of-the-art (SOTA) D-NUCA solutions.
Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory AcceleratorsLiu, Qiaoyi; Setter, Jeff; Huff, Dillon; Strange, Maxwell; Feng, Kathleen; Horowitz, Mark; Raina, Priyanka; Kjolstad, Fredrik
doi: 10.1145/3572908pmid: N/A
Image processing and machine learning applications benefit tremendously from hardware acceleration. Existing compilers target either FPGAs, which sacrifice power and performance for programmability, or ASICs, which become obsolete as applications change. Programmable domain-specific accelerators, such as coarse-grained reconfigurable arrays (CGRAs), have emerged as a promising middle-ground, but they have traditionally been difficult compiler targets since they use a different memory abstraction. In contrast to CPUs and GPUs, the memory hierarchies of domain-specific accelerators use push memories: memories that send input data streams to computation kernels or to higher or lower levels in the memory hierarchy and store the resulting output data streams. To address the compilation challenge caused by push memories, we propose that the representation of these memories in the compiler be altered to directly represent them by combining storage with address generation and control logic in a single structure—a unified buffer. The unified buffer abstraction enables the compiler to separate generic push memory optimizations from the mapping to specific memory implementations in the backend. This separation allows our compiler to map high-level Halide applications to different CGRA memory designs, including some with a ready-valid interface. The separation also opens the opportunity for optimizing push memory elements on reconfigurable arrays. Our optimized memory implementation, the Physical Unified Buffer, uses a wide-fetch, single-port SRAM macro with built-in address generation logic to implement a buffer with two read and two write ports. It is 18% smaller and consumes 31% less energy than a physical buffer implementation using a dual-port memory that only supports two ports. Finally, our system evaluation shows that enabling a compiler to support CGRAs leads to performance and energy benefits. Over a wide range of image processing and machine learning applications, our CGRA achieves 4.7× better runtime and 3.5× better energy-efficiency compared to an FPGA.
Fast One-Sided RDMA-Based State Machine Replication for Disaggregated MemoryDu, Jingwen; Wang, Fang; Feng, Dan; Gan, Changchen; Cao, Yuchao; Zou, Xiaomin; Li, Fan
doi: 10.1145/3587096pmid: N/A
Disaggregated memory architecture has risen in popularity for large datacenters with the advantage of improved resource utilization, failure isolation, and elasticity. Replicated state machines (RSMs) have been extensively used for reliability and consistency. In traditional RSM protocols, each replica stores replicated data and has the computing power to participate in some part of the protocols. However, traditional RSM protocols fail to work in the disaggregated memory architecture due to asymmetric resources on CPU nodes and memory nodes. This article proposes ECHO, a fast one-sided RDMA-based RSM protocol with lightweight log replication and remote applying, efficient linearizability guarantee, and fast coordinator failure recovery. ECHO enables all operations in the protocol to be efficiently executed using only one-sided RDMA, without the participation of any computing resource in the memory pool. To provide lightweight log replication and remote applying, ECHO couples the replicated log and the state machine to avoid dual-copy and performs remote applying by updating pointers. To enable efficient remote log state management, ECHO leverages a hitchhiked log state updating scheme to eliminate extra network round trips. To provide efficient linearizability guarantee, ECHO performs immediate remote applying after log replication and leverages the local locks at the coordinator to ensure linear consistency. Moreover, ECHO adopts a commit-aware log cache to make data visible immediately after being committed. To achieve fast failure recovery, ECHO leverages a commit point identification scheme to reduce the overhead of log consistency recovery. Experimental results demonstrate that ECHO outperforms the state-of-the-art RSM protocol (namely Sift) in multiple scenarios. For example, ECHO achieves 27%–52% higher throughput on typical write-intensive workloads. Moreover, ECHO reduces the consistency recovery time by three orders of magnitude for coordinator failure.
User-driven Online Kernel Fusion for SYCLPérez, Víctor; Sommer, Lukas; Lomüller, Victor; Narasimhan, Kumudha; Goli, Mehdi
doi: 10.1145/3571284pmid: N/A
Heterogeneous programming models are becoming increasingly popular to support the ever-evolving hardware architectures, especially for new and emerging specialized accelerators optimizing specific tasks. While such programs provide performance portability of the existing applications across various heterogeneous architectures to some extent, short-running device kernels can affect an application performance due to overheads of data transfer, synchronization, and kernel launch. While in applications with one or two short-running kernels the overhead can be negligible, it can be noticeable when these short-running kernels dominate the overall number of kernels in an application, as it is the case in graph-based neural network models, where there are several small memory-bound nodes alongside few large compute-bound nodes. To reduce the overhead, combining several kernels into a single, more optimized kernel is an active area of research. However, this task can be time-consuming and error-prone given the huge set of potential combinations. This can push programmers to seek a tradeoff between (a) task-specific kernels with low overhead but hard to maintain and (b) smaller modular kernels with higher overhead but easier to maintain. While there are DSL-based approaches, such as those provided for machine learning frameworks, which offer the possibility of such a fusion, they are limited to a particular domain and exploit specific knowledge of that domain and, as a consequence, are hard to port elsewhere. This study explores the feasibility of a user-driven kernel fusion through an extension to the SYCL API to address the automation of kernel fusion. The proposed solution requires programmers to define the subgraph regions that are potentially suitable for fusion without any modification to the kernel code or the function signature. We evaluate the performance benefit of our approach on common neural networks and study the performance improvement in detail.
An Optimized Framework for Matrix Factorization on the New Sunway Many-core PlatformMa, Wenjing; Liu, Fangfang; Chen, Daokun; Lu, Qinglin; Hu, Yi; Wang, Hongsen; Yuan, Xinhui
doi: 10.1145/3571856pmid: N/A
Matrix factorization functions are used in many areas and often play an important role in the overall performance of the applications. In the LAPACK library, matrix factorization functions are implemented with blocked factorization algorithm, shifting most of the workload to the high-performance Level-3 BLAS functions. But the non-blocked part, the panel factorization, becomes the performance bottleneck, especially for small- and medium-size matrices that are the common cases in many real applications. On the new Sunway many-core platform, the performance bottleneck of panel factorization can be alleviated by keeping the panel in the LDM for the panel factorization. Therefore, we propose a new framework for implementing matrix factorization functions on the new Sunway many-core platform, facilitating the in-LDM panel factorization. The framework provides a template class with wrapper functions, which integrates inter-CPE communication for the Level-1 and Level-2 BLAS functions with flexible interfaces and can accommodate different partitioning schemes. With the framework, writing panel factorization code with data residing in the LDM space can be done with much higher productivity. We implemented three functions (dgetrf, dgeqrf, and dpotrf) based on the framework and compared our work with a CPE_BLAS version, which uses the original LAPACK implementation linked with optimized BLAS library that runs on the CPE mesh. Using the most favorable partitioning, the panel factorization part achieves speedup of up to 26.3, 19.1, and 18.2 for the three matrix factorization functions. For the whole function, our implementation is based on a carefully tuned recursion framework, and we added specific optimization to some subroutines used in the factorization functions. Overall, we obtained average speedup of 9.76 on dgetrf, 10.12 on dgeqrf, and 4.16 on dpotrf, compared to the CPE_BLAS version. Based on the current template class, our work can be extended to support more categories of linear algebra functions.
Scale-out Systolic ArraysYüzügüler, Ahmet Caner; Sönmez, Canberk; Drumond, Mario; Oh, Yunho; Falsafi, Babak; Frossard, Pascal
doi: 10.1145/3572917pmid: N/A
Multi-pod systolic arrays are emerging as the architecture of choice in DNN inference accelerators. Despite their potential, designing multi-pod systolic arrays to maximize effective throughput/Watt—i.e., throughput/Watt adjusted when accounting for array utilization—poses a unique set of challenges. In this work, we study three key pillars in multi-pod systolic array designs, namely array granularity, interconnect, and tiling. We identify optimal array granularity across workloads and show that state-of-the-art commercial accelerators use suboptimal array sizes for single-tenancy workloads. We, then evaluate the bandwidth/latency trade-offs in interconnects and show that Butterfly networks offer a scalable topology for accelerators with a large number of pods. Finally, we introduce a novel data tiling scheme with custom partition size to maximize utilization in optimally sized pods. We propose Scale-out Systolic Arrays, a multi-pod inference accelerator for both single- and multi-tenancy based on these three pillars. We show that SOSA exhibits scaling of up to 600 TeraOps/s in effective throughput for state-of-the-art DNN inference workloads, and outperforms state-of-the-art multi-pod accelerators by a factor of 1.5 ×.1
Source Matching and Rewriting for MLIR Using String-Based AutomataEspindola, Vinicius; Zago, Luciano; Yviquel, Hervé; Araujo, Guido
doi: 10.1145/3571283pmid: N/A
A typical compiler flow relies on a uni-directional sequence of translation/optimization steps that lower the program abstract representation, making it hard to preserve higher-level program information across each transformation step. On the other hand, modern ISA extensions and hardware accelerators can benefit from the compiler’s ability to detect and raise program idioms to acceleration instructions or optimized library calls. Although recent works based on Multi-Level IR (MLIR) have been proposed for code raising, they rely on specialized languages, compiler recompilation, or in-depth dialect knowledge. This article presents Source Matching and Rewriting (SMR), a user-oriented source-code-based approach for MLIR idiom matching and rewriting that does not require a compiler expert’s intervention. SMR uses a two-phase automaton-based DAG-matching algorithm inspired by early work on tree-pattern matching. First, the idiom Control-Dependency Graph (CDG) is matched against the program’s CDG to rule out code fragments that do not have a control-flow structure similar to the desired idiom. Second, candidate code fragments from the previous phase have their Data-Dependency Graphs (DDGs) constructed and matched against the idiom DDG. Experimental results show that SMR can effectively match idioms from Fortran (FIR) and C (CIL) programs while raising them as BLAS calls to improve performance. Additional experiments also show performance improvements when using SMR to enable code replacement in areas like approximate computing and hardware acceleration.
FlexPointer: Fast Address Translation Based on Range TLB and Tagged PointersChen, Dongwei; Tong, Dong; Yang, Chun; Yi, Jiangfang; Cheng, Xu
doi: 10.1145/3579854pmid: N/A
Page-based virtual memory relies on TLBs to accelerate the address translation. Nowadays, the gap between application workloads and the capacity of TLB continues to grow, bringing many costly TLB misses and making the TLB a performance bottleneck. Previous studies seek to narrow the gap by exploiting the contiguity of physical pages. One promising solution is to group pages that are both virtually and physically contiguous into a memory range. Recording range translations can greatly increase the TLB reach, but ranges are also hard to index because they have arbitrary bounds. The processor has to compare against all the boundaries to determine which range an address falls in, which restricts the usage of memory ranges. In this article, we propose a tagged-pointer-based scheme, FlexPointer, to solve the range indexing problem. The core insight of FlexPointer is that large memory objects are rare, so we can create memory ranges based on such objects and assign each of them a unique ID. With the range ID integrated into pointers, we can index the range TLB with IDs and greatly simplify its structure. Moreover, because the ID is stored in the unused bits of a pointer and is not manipulated by the address generation, we can shift the range lookup to an earlier stage, working in parallel with the address generation. According to our trace-based simulation results, FlexPointer can reduce nearly all the L1 TLB misses, and page walks for a variety of memory-intensive workloads. Compared with a 4K-page baseline system, FlexPointer shows a 14% performance improvement on average and up to 2.8x speedup in the best case. For other workloads, FlexPointer shows no performance degradation.
Multi-objective Hardware-aware Neural Architecture Search with Pareto Rank-preserving Surrogate ModelsBenmeziane, Hadjer; Ouarnoughi, Hamza; El Maghraoui, Kaoutar; Niar, Smail
doi: 10.1145/3579853pmid: N/A
Deep learning (DL) models such as convolutional neural networks (ConvNets) are being deployed to solve various computer vision and natural language processing tasks at the edge. It is a challenge to find the right DL architecture that simultaneously meets the accuracy, power, and performance budgets of such resource-constrained devices. Hardware-aware Neural Architecture Search (HW-NAS) has recently gained steam by automating the design of efficient DL models for a variety of target hardware platforms. However, such algorithms require excessive computational resources. Thousands of GPU days are required to evaluate and explore an architecture search space such as FBNet [45]. State-of-the-art approaches propose using surrogate models to predict architecture accuracy and hardware performance to speed up HW-NAS. Existing approaches use independent surrogate models to estimate each objective, resulting in non-optimal Pareto fronts. In this article, HW-PR-NAS,1 a novel Pareto rank-preserving surrogate model for edge computing platforms, is presented. Our model integrates a new loss function that ranks the architectures according to their Pareto rank, regardless of the actual values of the various objectives. We employ a simple yet effective surrogate model architecture that can be generalized to any standard DL model. We then present an optimized evolutionary algorithm that uses and validates our surrogate model. Our approach has been evaluated on seven edge hardware platforms from various classes, including ASIC, FPGA, GPU, and multi-core CPU. The evaluation results show that HW-PR-NAS achieves up to 2.5× speedup compared to state-of-the-art methods while achieving 98% near the actual Pareto front.