Communication-Efficient Gradient Coding for Straggler Mitigation in Distributed Learning

Swanand Kadhe, O. Ozan Koyluoglu, Kannan Ramchandran

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, need to overcome two limitations: delays caused by slow running machines called stragglers, and communication overheads. Recently, Ye and Abbe [ICML 2018] proposed a coding-theoretic paradigm to characterize a fundamental trade-off between computation load per worker, communication overhead per worker, and straggler tolerance. However, their proposed coding schemes suffer from heavy decoding complexity and poor numerical stability. In this paper, we develop a communication-efficient gradient coding framework to overcome these drawbacks. Our proposed framework enables using any linear code to design the encoding and decoding functions. When a particular code is used in this framework, its block-length determines the computation load, dimension determines the communication overhead, and minimum distance determines the straggler tolerance. The flexibility of choosing a code allows us to gracefully trade-off the straggler threshold and communication overhead for smaller decoding complexity and higher numerical stability. Further, we show that using a maximum distance separable (MDS) code generated by a random Gaussian matrix in our framework yields a gradient code that is optimal with respect to the trade-off and, in addition, satisfies stronger guarantees on numerical stability as compared to the previously proposed schemes. Finally, we evaluate our proposed framework on Amazon EC2 and demonstrate that it reduces the average iteration time by 16% as compared to prior gradient coding schemes.

Original languageEnglish (US)
Title of host publication2020 IEEE International Symposium on Information Theory, ISIT 2020 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2634-2639
Number of pages6
ISBN (Electronic)9781728164328
DOIs
StatePublished - Jun 2020
Externally publishedYes
Event2020 IEEE International Symposium on Information Theory, ISIT 2020 - Los Angeles, United States
Duration: Jul 21 2020Jul 26 2020

Publication series

NameIEEE International Symposium on Information Theory - Proceedings
Volume2020-June
ISSN (Print)2157-8095

Conference

Conference2020 IEEE International Symposium on Information Theory, ISIT 2020
Country/TerritoryUnited States
CityLos Angeles
Period7/21/207/26/20

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Information Systems
  • Modeling and Simulation
  • Applied Mathematics

Fingerprint

Dive into the research topics of 'Communication-Efficient Gradient Coding for Straggler Mitigation in Distributed Learning'. Together they form a unique fingerprint.

Cite this