ACM SIGMETRICS 2021
June 14-18, 2021
ATP: In-network Aggregation for Multi-tenant Learning
Distributed deep neural network training (DT) systems are widely deployed in clusters where the network is shared across multiple tenants, i.e., multiple DT jobs. Each DT job computes and aggregates gradients. Recent advances in hardware accelerators have shifted the the performance bottleneck of training from computation to communication. To speed up DT jobs' communication, we propose ATP, a service for in-network aggregation aimed at modern multi-rack, multi-job DT settings. ATP uses emerging programmable switch hardware to support in-network aggregation at multiple rack switches in a cluster to speedup DT jobs. ATP performs decentralized, dynamic, best-effort aggregation, enables efficient and equitable sharing of limited switch resources across simultaneously running DT jobs, and gracefully accommodates heavy contention for switch resources. ATP outperforms existing systems accelerating training throughput by up to 38% - 66% in a cluster shared by multiple DT jobs.
ChonLam Lao is a master's student in Computer Science at IIIS Tsinghua University, advised by Professor Wenfei Wu. Next year, he will study for a doctoral degree at Harvard University co-advised by Professor Minlan Yu and Professor Aditya Akella. His research primary focuses on programmable networks and machine learning systems. Recently, his paper "ATP: In-network Aggregation for Multi-tenant Learning" is accepted and received the best paper award at NSDI 2021.
Michael Lingzhi Li
Forecasting Covid-19 With Application To Vaccine Trial Design and Distribution
To help combat the COVID-19 pandemic and understand the impact of government interventions, we develop DELPHI, a novel epidemiological model. We applied DELPHI across over 200 regions since early April 2020 with consistently high predictive power, and is a key contributor to the core CDC ensemble forecast. DELPHI compares favorably with other models and predicted large-scale epidemics in areas such as South Africa and Russia weeks before realization. Furthermore, using DELPHI, we can quantify the impact of interventions and provide insights on future virus incidence under different policies. We illustrate how Janssen Pharmaceuticals (J&J) effectively utilized such analysis from DELPHI to optimally select the Phase III trial locations of the first single-dose vaccine Ad26.Cov2.S, accelerating the trial by 8 weeks while reducing the number of participants needed by 25%. We also demonstrate how DELPHI informed FEMA on optimizing vaccine distribution under constrained supply to minimize the number of pandemic deaths.
Michael Lingzhi Li is a doctoral candidate at the MIT Operations Research Center, advised by Prof. Dimitris Bertsimas. His research interests primarily focus on scalable algorithms that combine machine learning and optimization, with emphasis on real-world applications in both healthcare and supply chain management. He has worked on problems in interpretable machine learning, personalized risk predictions, medical therapy prescription, infectious disease epidemiology, warehouse optimization and labor scheduling. He is the recipient of awards including the 2021 Innovative Applications in Analytics Award, the 2020 INFORMS Pierskalla Award and the 2019 MSOM Best Student Paper Finalist Award.
Second-Order Information in Non-Convex Stochastic Optimization: Power and Limitations
We design an algorithm which finds an ϵ-approximate stationary point (with ‖∇F(x)‖≤ϵ) using O(ϵ−3) stochastic gradient and Hessian-vector products, matching guarantees that were previously available only under a stronger assumption of access to multiple queries with the same random seed. We prove a lower bound which establishes that this rate is optimal and---surprisingly---that it cannot be improved using stochastic pth order methods for any p≥2, even when the first p derivatives of the objective are Lipschitz. Together, these results characterize the complexity of non-convex stochastic optimization with second-order methods and beyond. Expanding our scope to the oracle complexity of finding (ϵ,γ)-approximate second-order stationary points, we establish nearly matching upper and lower bounds for stochastic second-order methods. Our lower bounds here are novel even in the noiseless case.
This is joint work with Yossi Arjevani, Yair Carmon, John Duchi, Dylan J. Foster and Karthik Sridharan.
Ayush is a PhD student in the Computer Science department at Cornell University, advised by Professor Karthik Sridharan and Professor Robert D. Kleinberg. His research interests span across optimization, online learning, reinforcement learning and control, and the interplay between them. Before coming to Cornell, he spent a year at Google as a part of the Brain residency program. Before Google, he completed his undergraduate studies in computer science from IIT Kanpur in India where he was awarded the President's gold medal.