Deep Learning Recommendation Model
Image Source: https://www.xenonstack.com/use-cases/recommendation-system/
With the sharp increasing volume of user data, Deep Learning Recommendation Model (DLRM) becomes an indispensable infrastructure in large technology companies. It is reported that DLRMs contribute more than 50% training demands and 80% inference demands in Meta’s data center.
Different from traditional compute-intensive neural net-work architectures , DLRMs consist of not only
the compute-intensive multi-layer perceptron (MLP) but also the memory-intensive embedding tables. In the industry-scale DLRM workload, the footprint of embedding tables is on the order of terabytes which easily surpasses the limited High Bandwidth Memory (HBM) of a single GPU device and makes it challenging to training a large-scale DLRM efficiently.
Direction 1: Resource Efficient DLRM Training
[SC’22] EL-Rec: Efficient Large-scale Recommendation Model Training via Tensor-train Embedding
Existing DLRM training systems require a large number of GPUs due to the memory-intensive embedding tables. We propose EL-Rec, an efficient computing framework harnessing the Tensor-train (TT) technique to democratize the training of large-scale DLRMs with limited GPU resources. Specifically, EL-Rec optimizes TT decomposition based on key computation primitives of embedding tables and implements a high-performance compressed embedding table which is a drop-in replacement of Pytorch API. We also introduce an index reordering technique to harvest the performance gains from both local and global information of training inputs. A pipeline training paradigm is implemented to eliminate the communication overhead between the host memory and the training worker. Comprehensive experiments demonstrate that EL-Rec can handle the largest publicly available DLRM dataset with a single GPU and achieves 3x speedup over the state-of-the-art DLRM frameworks.
Direction 2: Finding the Optimal Embedding Table Placement
[In Submission]
Current solutions of large-scale DLRM training on the multi-GPU platform are still inefficient due to unbalanced workload partitioning and intensive inter-GPU communication. This is because the embedding tables in DLRM usually exhibits significant heterogeneity not only in table length but also in the access frequency of different rows. Such heterogeneity makes it challenging to get a balanced embedding table partition in terms of memory and computation.
In this work, we propose a propose an optimality-guided heuristic algorithm to efficiently generate near optimal embedding table partition, placement and duplication strategy. We also implement a SHMEM-based embedding table training system that supports fine-grained EMT partition and hybrid-parallelism EMT training. Comprehensive experiments reveal that OPER achieves on average 3.4× and
5.1× speedup on training and inference respectively over state-of-the-art DLRM frameworks.