Efficient Inference and Training of Large Neural Network Models

The memory consumption and computational cost of state-of-the-art deep neural network models are dramatically increasing. So, it is beneficial to apply efficient deep learning to inference and training. This talk presents the progress on this work.

First, get an introduction to learned threshold pruning (LTP) for accelerated inference.

Then learn about staged training for transformers and topology-aware structured communications (TASC), which are designed to accelerate training.

Finally, see how to use these solutions to efficiently implement large recommendation models where quantization is systematically applied in deep quantized recommendation models (DQRM). Take advantage of the sparsity in a deep learning recommendation model (DLRM) to better support hot embeddings.

These methods can achieve great performance and have decent generalization ability.

Speakers

Zhen Dong received bachelor of science from Peking University in 2018 and PhD from the University of California at Berkeley in 2022. He is a post-doctorate student at UC Berkeley working with Professor Kurt Keutzer. Zhen received the Outstanding Graduate Award at Peking University and the distinguished Berkeley University Fellowship. His research interests include efficient deep learning, quantization, model compression, and hardware-software codesign.

Kurt Keutzer is a professor of electrical engineering and computer science (EECS) at the University of California, Berkeley where he is a member of the BAIR Lab and codirector of the Berkeley* DeepDrive research consortium. His research covers all aspects of deep learning. Kurt's collaboration on the LARS and LAMB algorithms reduced the training time of ImageNet and BERT to minutes. His Squeeze family of deep neural networks (DNN) were among the first DNNs suitable for mobile applications. As an entrepreneur, Kurt has been an investor and advisor to over 30 startups.