TY - JOUR
AU1 - Wang, Jialun
AU2 - Pang, Wenhao
AU3 - Weng, Chuliang
AU4 - Zhou, Aoying
AB - In analytical queries, a number of important operators like JOIN and GROUP BY are suitable for parallelization, and GPU is an ideal accelerator considering its power of parallel computing. However, when data size increases to hundreds of gigabytes, one GPU card becomes insufficient due to the small capacity of global memory and the slow data transfer between host and device. A straightforward solution is to equip more GPUs linked with high-bandwidth connectors, but the cost will be highly increased. We utilize unified memory (UM) produced by NVIDIA CUDA (Compute Unified Device Architecture) to make it possible to accelerate large-scale queries on just one GPU, but we notice that the transfer performance between host and UM, which happens before kernel execution, is often significantly slower than the theoretical bandwidth. An important reason is that, in single-GPU environment, data processing systems usually invoke only one or a static number of threads for data copy, leading to an inefficient transfer which slows down the overall performance heavily. In this paper, we present D-Cubicle, a runtime module to accelerate data transfer between host-managed memory and unified memory. D-Cubicle boosts the actual transfer speed dynamically through a self-adaptive approach. In our experiments, taking data transfer into account, D-Cubicle processes 200 GB of data on a single GPU with 32 GB of global memory, achieving 1.43x averagely and 2.09x maximally the performance of the baseline system.
TI - D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems
JF - Frontiers of Computer Science
DO - 10.1007/s11704-022-2160-z
DA - 2023-08-01
UR - https://www.deepdyve.com/lp/springer-journals/d-cubicle-boosting-data-transfer-dynamically-for-large-scale-GL9nIr6Qqe
VL - 17
IS - 4
DP - DeepDyve
ER -