GPU Sparse Matrix Kernel Optimization
Sparsity widely exists in deep learning: the sparse weight in pruned neural network, the highly-sparse graph structure in graph neural networks, the input-dependent sparsity in self-attention, etc. Exploiting sparsity in DL needs efficient sparse matrix kernels on GPU.
Our research on optimizing sparse matrix kernels includes optimizing the memory access pattern (SC’22) and workload balancing (SRC’21) and using auto-tuning techniques to handle the input-dependent performance of the SpMM kernel (DAC’22).
Reference Work
GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks.
Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction.
ACM SRC’21 [paper]Heuristic Adaptability to Input Dynamics for SpMM on GPUs.
DAC’22 (Best Paper Nominee) [paper][code]
Adaptive and Efficient Runtime System for GNN Acceleration on GPUs
As the emerging trend of graph-based deep learning, Graph Neural Networks (GNNs) excel in their capability to generate high-quality node feature vectors (embeddings). However, the existing one-size-fits-all GNN implementations are insufficient to catch up with the evolving GNN architectures, the ever-increasing graph sizes, and the diverse node embedding dimensionalities.
We introduce, 1) GNNAdvisor, an adaptive and efficient runtime system to accelerate various GNN workloads on single-GPU platforms; 2) TC-GNN, the first GPU Tensor Core Unit (TCU) based GNN acceleration framework to reconcile the “Sparse” GNN computation with high-performance ”Dense” TCUs on GPUs; 3) study how to optimize GNN training at the computational graph level and build a compiler for saving computation, IO, and memory.
Reference Work
GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs.
OSDI’21. [paper][code]TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs.
OSDI’22 (Poster) [Poster][code]Understanding GNN Computational Graph: A Coordinated Computation, IO, and Memory Perspective. MLSys’22. [paper][code]
Quantized Graph Neural Network and its Acceleration on GPUs
With the increasing popularity of graph-based learning, Graph Neural Networks (GNNs) win lots of attention from research and industry fields because of their high accuracy. However, existing GNNs suffer from high memory footprints (e.g., node embedding features). This high memory footprint hurdles the potential applications towards memory-constrained devices, such as the widely-deployed IoT devices.
To this end, we propose 1) a specialized GNN quantization scheme, SGQuant, to systematically reduce the GNN memory consumption; 2) the first Tensor Core (TC) based computing framework, QGTC, to support any-bitwidth computation for QGNNs on GPUs.
Reference Work
Efficient Graph Neural Network Accelerator
Graph convolutional network (GCN) emerges as a promising direction to learn the inductive representation in graph data commonly used in widespread applications, such as E-commerce, social networks, and knowledge graphs. However, learning from graphs is non-trivial because of its mixed computation model involving both graph analytics and neural network computing. To this end, we decompose GCN learning into two hierarchical paradigms: graph-level and node-level computing. We propose a lightweight graph reordering methodology, incorporated with a GCN accelerator architecture that equips a customized cache design to fully utilize the graph-level data reuse. We also propose a mapping methodology aware of data reuse and task-level parallelism to handle various graphs inputs effectively. Results show that Rubik accelerator design improves energy efficiency by 26.3x to 1375.2x than GPU platforms across different datasets and GCN models.
Reference Work
Rubik: A Hierarchical Architecture for Efficient Graph Learning.
TCAD’21. [paper]
Scaling Graph Neural Networks to Multi-GPU Platforms
The increasing size of input graphs for graph neural networks (GNNs) highlights the demand for using multi-GPU platforms. However, existing multi-GPU GNN solutions suffer from inferior performance due to imbalanced computation and inefficient communication. To this end, we propose MGG, a novel system design to accelerate GNNs on multi-GPU platforms via a GPU-centric software pipeline. MGG explores the potential of hiding remote memory access latency in GNN workloads through fine-grained computation-communication pipelining. Extensive experiments show that MGG outperforms state-of-the-art multi-GPU systems across various settings: (i) on average 5.01X and 10.65× faster than the UVM-based GNN design and ROC system on full-graph GNN training, respectively; (ii) on average 1.57X and 1.35X faster than DGL and GNNAdvisor on batched GNN training, respectively.
Reference Work
MGG: Accelerating Multi-GPU Graph Neural Networks via Fine-grained Communication-Computation Pipelining. OSDI’23