Sparse Neural Network Acceleration on GPUs

Exploiting tensor-cores for sparse DNNs is very challenging. Existing pruning approaches fail to balance the demands of accuracy and efficiency: random sparsity preserves the model quality well but prohibits tensor-core acceleration, while highly-structured block-wise sparsity can exploit tensor-cores but suffers from severe accuracy loss.

In this work, we propose a novel sparse pattern, Shuffled Block-wise sparsity (Shfl-BW), designed to efficiently utilize tensor-cores while minimizing the constraints on the weight structure. Our insight is that row- and column-wise permutation provides abundant flexibility for the weight structure, while introduces negligible overheads using our GPU kernel designs. Our techniques can achieve the state-of-the-art speed-accuracy trade-offs on GPUs. We can accelerate the computation-intensive layers of Transformer by 1.81, 4.18 and 1.90 times on NVIDIA V100, T4 and A100 GPUs respectively at 75% sparsity.

Reference Work

  • Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning. DAC’22 [paper][code]

Neural Network Architecture and Kernel Optimization

With the increasing popularity of deep learning, Convolutional Neural Networks (CNNs) have been widely applied in various domains, such as image classification and object detection, and achieve stunning success in terms of their high accuracy over the traditional statistical methods. To exploit potential of CNN models, a huge amount of research and industry efforts have been devoted to optimizing CNNs.

We propose 1) a 3D-Receptive Field (3DRF), an explainable and easy-to-compute metric, to estimate the quality of a CNN architecture and guide the search process of designs; 2) We introduce, DSXplore, the first optimized design for exploring DSCs on CNNs; 3) a big-little dual-module inference to dynamically skip unnecessary memory accesses and computations to accelerate DNN inference

Reference Work.

  • An Efficient Quantitative Approach for Optimizing Convolutional Neural Networks. CIKM’21. [paper][bibtex]

  • DSXplore: Optimizing Convolutional Neural Networks via Sliding-Channel Convolutions. IPDPS’21.

    [paper][bibtex][code]

  • Boosting Deep Neural Network Efficiency with Dual-Module Inference. ICML’20.

    [paper][bibtex]

Quantized Neural Network Acceleration on GPU Tensor Cores

Over the years, accelerating neural networks with quantization has been widely studied. Unfortunately, prior efforts with diverse precisions (e.g., 1-bit weights and 2-bit activations) are usually restricted by limited precision support on GPUs (e.g., int1 and int4).

To break such restrictions, we introduce 1) the first Arbitrary Precision Neural Network framework (APNN-TC) to fully exploit quantization benefits on Ampere GPU Tensor Cores; 2) the first Emulated GEMM on Tensor Cores (EGEMM-TC) to extend the usage of Tensor Cores to accelerate scientific computing applications without compromising the precision requirements.

Reference Work.

  • APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores. SC’21.

    [paper][bibtex][code]

  • EGEMM-TC: Accelerating Scientific Computing on Tensor Cores with Extended Precision. PPoPP’21.

    [paper][bibtex]