Episode 10 of the Stanford MLSys Seminar Series: Horovod and the Evolution of Deep Learning at Scale
Speaker: Travis Addair
Abstract: Deep neural networks are pushing the state of the art in numerous machine learning research domains; from computer vision, to natural language processing, and even tabular business data. However, scaling such models to train efficiently on large datasets imposes a unique set of challenges that traditional batch data processing systems were not designed to solve. Horovod is an open source framework that scales models written in TensorFlow, PyTorch, and MXNet to train seamlessly on hundreds of GPUs in parallel. In this talk, we'll explain the concepts and unique constraints that led to the development of Horovod at Uber, and discuss how the latest trends in deep learning research are informing the future direction of the project within the Linux Foundation. We'll explore how Horovod fits into production ML workflows in industry, and how tools like Spark and Ray can combine with Horovod to make productionizing deep learning at scale on remote data centers as simple as running locally on your laptop. Finally, we'll share some thoughts on what's next for large scale deep learning, including new distributed training architectures and how the larger ecosystem of production ML tooling is evolving.
Speaker bio: Travis Addair is a software engineer at Uber leading the Deep Learning Training team as part of the Michelangelo machine learning platform. He is the lead maintainer for the Horovod open source project and chairs its Technical Steering Committee within the Linux Foundation. In the past, he’s worked on scaling machine learning systems at Google and Lawrence Livermore National Lab.
1:25 Horovod and the Evolution of Deep Learning at Scale
3:19 Distributed Deep Learning
10:44 Introducing Horovod