ACM SIGMETRICS 2026
Ann Arbor, Michigan, USA
June 8-12, 2026
SIGMETRICS 2026 will feature five tutorials organized across two parallel tracks.
| Causal Survival Analysis | 3 hrs |
| Efficient Large Language Model Inference | 3 hrs |
Track: 1
Duration: 3 hours
Speakers: Jessy Han (MIT), George Chen (Carnegie Mellon), Devavrat Shah (MIT)
Many applications involve reasoning about the amount of time that will elapse before a critical event happens. When will a hard drive fail, a customer cancel a subscription, a patient get discharged from the hospital, or a convicted criminal reoffend? These time durations are referred to as time-to-event outcomes. Modeling time-to-event outcomes under partial observation, i.e., censoring, has been extensively studied within the fields of survival analysis and reliability engineering for decades. This tutorial aims to bring the audience up to speed on the basics as well as modern developments of survival analysis, with a heavy emphasis on the growing active area of research on causal survival analysis, where time-to-event outcomes arise under interventions whose effects must be inferred from nonrandomized observational data using causal inference. We provide a lay of the land of the major categories of causal survival analysis methods available today, how to benchmark causal survival methods, and open questions.
Jessy (Xinyi) Han. Jessy (Xinyi) Han is a Ph.D. candidate at the Massachusetts Institute of Technology and an incoming Assistant Professor (institution to be decided). Her research develops methods in causal inference and survival analysis to enable when-if decision-making, a framework for understanding how interventions affect not only what happens but when it happens. She works closely with practitioners to bring these methods into high-impact domains, including healthcare, policy evaluation, and business strategy. She is a recipient of the Google-MIT Schwarzman College of Computing Fellowship.
George H. Chen. George H. Chen is an Associate Professor in Carnegie Mellon University's Heinz College of Information Systems and Public Policy. He studies trustworthy machine learning methods for reasoning about time, often in the context of health applications. Much of his work is on predicting time durations before critical events happen (also called time-to-event prediction or survival analysis), or on analyzing time series such as electronic health records and EEG data. He is interested in developing new methods for these time-related problems as well as understanding when and why these methods work in terms of statistical guarantees. George completed his Ph.D. in Electrical Engineering and Computer Science at MIT, where he won the George Sprowls Award for outstanding Ph.D. thesis in computer science and the Goodwin Medal, the top teaching award given to graduate students. He is a recipient of an NSF CAREER Award and has also co-founded CoolCrop, a startup that provides cold storage and marketing analytics to rural farmers in India.
Devavrat Shah. Devavrat Shah is the Andrew (1956) and Erna Viterbi Professor in MIT's Department of Electrical Engineering and Computer Science and is a member of the Institute for Data, Systems and Society, Laboratory for Information and Decision Systems, and the Statistics and Data Science Center. His research focuses on statistical inference and stochastic networks, and his contributions span a variety of areas including resource allocation in communications networks, inference and learning on graphical models, and algorithms for social data processing, including ranking, recommendations, and crowdsourcing. Within networks, his work spans a range of areas across electrical engineering, computer science, and operations research. He earned a BS in computer science and engineering from the Indian Institute of Technology and a Ph.D. in computer science from Stanford University. His work has been recognized through prize paper awards in machine learning, operations research, and computer science, as well as career prizes including the 2025 ACM SIGMETRICS Achievement Award, the 2010 Erlang Prize from the INFORMS Applied Probability Society, and the 2008 ACM SIGMETRICS Rising Star Award.
Track: 1
Duration: 3 hours
Speakers: Ankur Mallick (Microsoft), Srikant Bharadwaj (Microsoft)
Large Language Models (LLMs) have become a foundational component of modern interactive systems, yet deploying them at scale remains extraordinarily expensive and complex. While recent advances in model architecture and training have received significant attention, inference efficiency, especially under latency-sensitive and highly variable workloads, has emerged as a critical bottleneck. For the SIGMETRICS community, LLM inference presents a rare convergence of hardware-aware performance modeling, online scheduling, and queueing-theoretic trade-offs at unprecedented scale. This tutorial will provide a systematic, end-to-end view of efficient LLM inference, spanning GPU execution models, batching and scheduling algorithms, and analytical and statistical simulation techniques for capacity planning and performance optimization.
We will begin by demystifying how modern GPUs execute transformer workloads, highlighting the distinct computational and memory characteristics of the prefill and decode phases, and explaining why traditional throughput-centric optimization strategies often fail under realistic service-level objectives (SLOs). We will show how GPU micro-architecture, kernel behavior, batching granularity, and memory hierarchies jointly shape latency and efficiency envelopes.
Next, we will dive into batching and scheduling algorithms for LLM serving, including continuous batching, paged attention, prefill chunking, and multi-stage (prefill/decode) deployments. We will frame these techniques through a metrics-driven lens, emphasizing the inherent throughput-latency-fairness trade-offs that arise under heterogeneous request sizes and bursty arrivals. Drawing on recent research and large-scale production experience, we will highlight open challenges in operating LLM services near their optimal efficiency frontier without triggering catastrophic tail-latency degradation.
Finally, we will introduce an analytical and statistical LLM inference simulator developed by the speakers' team that combines hardware-level performance models with queueing and workload distributions to predict end-to-end latency, batching behavior, and utilization. Unlike trace-driven or full discrete-event simulators, this approach enables fast what-if analysis, principled identification of optimal operating points, and quantitative evaluation of scheduling, caching, and hardware trade-offs. We will present a live demo of the simulator, discuss its validation against real deployments, and outline how researchers and practitioners can use it to reason about LLM inference systems using the language of SIGMETRICS: arrival processes, service rates, queueing delay, and tail latencies.
Overall, this tutorial aims to bridge the gap between LLM systems research and performance modeling, equipping the SIGMETRICS audience with the conceptual tools, models, and abstractions needed to analyze and design efficient, reliable LLM inference services at scale.
Track: 2
Duration: 1.5 hours
Speaker: Guannan Qu (Carnegie Mellon)
Networked systems like energy networks, traffic networks, etc. are ubiquitous and play an indispensable role in advancing our modern society. The decision-making and operation of such systems have long been a tremendous challenge. In the meantime, the recent advancement of machine learning, particularly reinforcement learning, has achieved tremendous success across different domains, exhibiting impressive capability to learn to control complex and unknown systems. Due to these advantages of reinforcement learning, it has been recognized to hold great potential for revolutionizing the way we operate these large-scale networked systems. However, despite a rich literature on reinforcement learning and multi-agent reinforcement learning, these algorithms are widely recognized to suffer from scalability, stability, and safety issues when it comes to large-scale networked systems. To address these challenges, there have been recent lines of work in the literature that exploit structural properties of networked systems to design more scalable multi-agent reinforcement learning algorithms. Examples of the structural properties include the sparse network topology, the locality property in multi-robot systems, and homogeneity in queueing systems. This tutorial provides a holistic overview of these results, covering various types of structural properties and how to integrate these properties into multi-agent reinforcement learning.
Guannan Qu is an Assistant Professor in the Electrical and Computer Engineering Department at Carnegie Mellon University. He joined the department in September 2021. He received his Ph.D. in applied mathematics from Harvard University in 2019 and was a postdoctoral scholar in the Department of Computing and Mathematical Sciences at the California Institute of Technology from 2019 to 2021. He is the recipient of an NSF CAREER Award, an ICRA Best Paper Finalist recognition, the Best Paper Award from the AAAI 2025 Workshop on Multi-Agent AI in the Real World, the Caltech Simoudis Discovery Award, a PIMCO Fellowship, and the IEEE SmartGridComm Best Student Paper Award. His research interests lie in control, optimization, and machine and reinforcement learning.
Track: 2
Duration: 1.5 hours
Speaker: Mengdi Wang (Princeton University)
Recent advances in large foundation models, such as large language models and diffusion models, have demonstrated impressive capabilities. However, to truly align these models with user feedback or maximize real-world objectives, it is crucial to exert control over the decoding processes in order to steer the distribution of generated output. In this tutorial, we will explore methods and theory for controlled generation within large language models and diffusion models. We will discuss various modalities for achieving this control, focusing on applications such as LLM alignment, accelerated inference, transfer learning, and diffusion-based optimization.
Mengdi Wang is a Professor of Electrical and Computer Engineering and the Center for Statistics and Machine Learning, and by courtesy, Computer Science and Bioengineering, at Princeton University. She founded and co-directs Princeton AI^2, Princeton AI for Accelerating Invention. Her research spans generative AI, reinforcement learning, and large language models, with a focus on efficient fine-tuning, LLM reasoning, AI agents, AI for biotech, and automated science. She works closely with scientists and practitioners to implement AI algorithms, handle real-world data and systems, and apply AI technology to improve decision making and accelerate scientific discovery. She has received multiple honors, including the NSF CAREER Award, a Google Award, MIT Tech35, and the Donald Eckman Award for extraordinary contributions to the intersection of control, dynamical systems, machine learning, and information theory.
Track: 2
Duration: 2 hours
Speaker: Neil Walton (Durham University)
It is useful to view quantum computing and networking through the lens of the classical OSI model, which separates systems into layers: physical, data link, network, transport, session, presentation, and application. In quantum technologies, the physics in the physical layer and the information theory in the data link layer are becoming relatively well established. However, the higher layers, particularly control and transport, remain largely undeveloped, especially for quantum computing applications. The first logical and distributed logical computations have only just been demonstrated, and we are still learning what reliable, scalable coordination across devices should look like.
At the same time, networking will be essential. Logical qubit counts remain modest in current quantum processing units, so scaling is likely to require distributed architectures connected by quantum links. This raises new performance questions that classical methods do not directly answer: how to manage entanglement as a consumable resource, how to model latency and reliability when information cannot be copied, and how to coordinate hybrid classical-quantum control. In this tutorial, we will draw analogies with past classical systems and explore how performance evaluation techniques may need to evolve to support emerging quantum networked computing applications.
Neil Walton is a Professor in Operations Management at Durham University Business School. He received his undergraduate degree in 2005, Master's degree in 2006, and Ph.D. in 2010 in mathematics from the University of Cambridge. His research is in applied probability and principally concerns the decentralized minimization of congestion in networks. He was a lecturer at the University of Amsterdam, where he held an NWO Veni Fellowship. He then moved to the University of Manchester, where he was a Reader in Mathematics. Neil has conducted research visits at Microsoft Research Cambridge, the Basque Centre for Mathematics, and the Automatic Control Laboratory at ETH Zurich. From 2017 to 2019, Neil was the head of the probability and statistics group at the University of Manchester. He was a Fellow of the Alan Turing Institute and a guest lecturer at London Business School. He is an associate editor of the journal Stochastic Systems and an area editor for stochastic models at Operations Research. He has won the Best Paper Award at ACM SIGMETRICS and the 2018 Erlang Prize from the INFORMS Applied Probability Society. He currently chairs the prize committee of the INFORMS Applied Probability Society and is organizing the Applied Probability Society conference when it comes to Durham.