SIGMETRICS '25: Abstracts of the 2025 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

Full Citation in the ACM Digital Library

SESSION: Session 1A: Queueing Theory

Steady-State Convergence of the Continuous-Time Routing System with General Distributions in Heavy Traffic

This paper examines a continuous-time routing system with general interarrival and service time distributions, operating under either the join-the-shortest-queue policy or the power-of-two-choices policy. Under a weaker set of assumptions than those commonly established in the literature, we prove that the scaled steady-state queue length at each station converges weakly to a common exponential random variable in heavy traffic. Specifically, our results hold under the assumption of the (2 + ε)th moment for the interarrival and service distributions with some ε > 0. The proof leverages the Palm version of the basic adjoint relationship (BAR) as a key technique.

Finite-Time Behavior of Erlang-C Model: Mixing Time, Mean Queue Length and Tail Bounds

Service systems like data centers and ride-hailing are popularly modeled as queueing systems in the literature. Such systems are primarily studied in the steady state due to their analytical tractability. However, almost all applications in real life do not operate in a steady state, so there is a clear discrepancy in translating theoretical queueing results to practical applications. To this end, we provide a finite-time convergence for Erlang-C systems (also known as M/M/n queues), providing a stepping stone towards understanding the transient behavior of more general queueing systems. We obtain a bound on the Chi-square distance between the finite time queue length distribution and the stationary distribution for a finite number of servers. We then use these bounds to study the behavior in the many-server heavy-traffic asymptotic regimes. The Erlang-C model exhibits a phase transition at the so-called Halfin-Whitt regime. We show that our mixing rate matches the limiting behavior in the Super-Halfin-Whitt regime and matches up to a constant factor in the Sub-Halfin-Whitt regime.

To prove such a result, we employ the Lyapunov-Poincaré approach, where we first carefully design a Lyapunov function to obtain a negative drift outside a finite set. Within the finite set, we develop different strategies depending on the properties of the finite set to get a handle on the mixing behavior via a local Poincaré inequality. A key aspect of our methodological contribution is in obtaining tight guarantees in these two regions, which when combined, give us tight mixing time bounds. We believe that this approach is of independent interest for studying mixing in reversible countable-state Markov chains more generally.

Improving Multiresource Job Scheduling with Markovian Service Rate Policies

Modern cloud computing workloads are composed of multiresource jobs that require a variety of computational resources in order to run, such as CPU cores, memory, disk space, or hardware accelerators. A server can run a set of multiresource jobs in parallel if the total resource demands of those jobs do not exceed the resource capacity of the server. Given a stream of arriving multiresource jobs, a scheduling policy must decide which jobs to run in parallel at every moment in time in order to minimize the mean response time across jobs --- the average time between when a job arrives to the system until it is completed.

We develop and analyze a new class of policies for scheduling multiresource jobs called Markovian Service Rate (MSR) policies. An MSR policy precomputes several candidate schedules, sets of jobs that can run in parallel, and then switches between these candidate schedules randomly according to a continuous-time Markov chain. While prior scheduling policies are either computationally complex, analytically intractable, or result in high mean response time, we develop an analysis of MSR policies to prove that they are both computationally simple and can guarantee low mean response time. We also show that these policies perform well in practice using simulations based on traces from the Google Borg cluster scheduler. This abstract summarizes our full paper.

On the Distribution of Sojourn Times in Tandem Queues

The paper studies the (end-to-end) waiting and sojourn times in tandem queues with general arrivals and light-tailed service times. It is shown that the tails of the corresponding distributions are subject to polynomial-exponential upper bounds, whereby the degrees of the polynomials depend on both the number of bottleneck queues and the 'light-tailedness' of the service times. Closed-form bounds constructed for a two-queue tandem with exponential service times are shown to be numerically sharp, improve upon alternative large-deviations bounds by many orders of magnitude, and recover the exact results in the case of Poisson arrivals.

SESSION: Session 1B: Game Theory

Online Allocation with Multi-Class Arrivals: Group Fairness vs Individual Welfare

We introduce and study a multi-class online resource allocation problem with group fairness guarantees. The problem involves allocating a fixed amount of resources to a sequence of agents, each belonging to a specific group. The primary objective is to ensure fairness across different groups in an online setting. We focus on three fairness notions: one based on quantity and two based on utility. To achieve fair allocations, we develop two threshold-based online algorithms, proving their optimality under two fairness notions and near-optimality for the more challenging one. Additionally, we demonstrate a fundamental trade-off between group fairness and individual welfare using a novel representative function-based approach. To address this trade-off, we propose a set-aside multi-threshold algorithm that reserves a portion of the resource to ensure fairness across groups while utilizing the remaining resource to optimize efficiency under utility-based fairness notions. This algorithm is proven to achieve the Pareto-optimal trade-off. We also demonstrate that our problem can model a wide range of real-world applications, including network caching and cloud computing, and empirically evaluate our proposed algorithms in the network caching problem using real datasets.

Game Theoretic Liquidity Provisioning in Concentrated Liquidity Market Makers

Automated marker makers (AMMs) are decentralized exchanges that enable the automated trading of digital assets. Liquidity providers (LPs) deposit digital tokens, which can be used by traders to execute trades, which generate fees for the investing LPs. In AMMs, trade prices are determined algorithmically, unlike classical limit order books. Concentrated liquidity market makers (CLMMs) are a major class of AMMs that offer liquidity providers flexibility to decide not only how much liquidity to provide, but in what ranges of prices they want the liquidity to be used. This flexibility can complicate strategic planning, since fee rewards are shared among LPs. We formulate and analyze a game theoretic model to study the incentives of LPs in CLMMs. Our main results show that while our original formulation admits multiple Nash equilibria and has complexity quadratic in the number of price ticks in the contract, it can be reduced to a game with a unique Nash equilibrium whose complexity is only linear. We further show that the Nash equilibrium of this simplified game follows a waterfilling strategy, in which low-budget LPs use up their full budget, but rich LPs do not. Finally, by fitting our game model to real-world CLMMs, we observe that in liquidity pools with risky assets, LPs adopt investment strategies far from the Nash equilibrium. Under price uncertainty, they generally invest in fewer and wider price ranges than our analysis suggests, with lower-frequency liquidity updates. In such risky pools, by updating their strategy to more closely match the Nash equilibrium of our game, LPs can improve their median daily returns by \116, which corresponds to an increase of 0.009% in median daily return on investment (ROI). At maximum, LPs can improve daily ROI by 0.855% when they reach Nash equilibrium. In contrast, in stable pools (e.g., of only stablecoins), LPs already adopt strategies that more closely resemble the Nash equilibrium of our game.

Two Choice Behavioral Game Dynamics with Myopic-Rational and Herding Players

In classical game theory, the players are assumed to be rational and intelligent, which is often contradictory to reality. We consider more realistic behavioral game dynamics where the players choose actions in a turn-by-turn manner and exhibit two prominent behavioral traits --- α-fraction of them are myopic who strategically choose optimal actions against the empirical distribution of the previous plays, while others herd towards the most popular action till then. Our analysis focuses on scenarios when players encounter two possible choices and get to play only once. We constructively derive the almost sure mean-field limits of such dynamics. Further, we provide comparisons across various levels and types of rationality. Our framework extends naturally to model avoid-the-crowd behavior. Finally, we illustrate our results with two practical examples: the participation game and the routing game.

Allocating Public Goods via Dynamic Max-Min Fairness: Long-Run Behavior and Competitive Equilibria

Dynamic max-min fair allocation (DMMF) is a simple and popular mechanism for the repeated allocation of a shared resource among competing agents: in each round, each agent can choose to request or not for the resource, which is then allocated to the requesting agent with the least number of allocations received till then. Recent work has shown that under DMMF, a simple threshold-based request policy enjoys surprisingly strong robustness properties, wherein each agent can realize a significant fraction of her optimal utility irrespective of how other agents' behave. While this goes some way in mitigating the possibility of a 'tragedy of the commons' outcome, the robust policies require that an agent defend against arbitrary (possibly adversarial) behavior by other agents. This however may be far from optimal compared to real world settings, where other agents are selfish optimizers rather than adversaries. Therefore, robust guarantees give no insight on how agents behave in an equilibrium, and whether outcomes are improved under one.

Our work aims to bridge this gap by studying the existence and properties of equilibria under DMMF. To this end, we first show that despite the strong robustness guarantees of threshold based strategies, no Nash equilibrium exists when agents participate in DMMF using such strategies. On the positive side, however, we show that for the symmetric case, a simple data-driven request policy guarantees that no agent benefits from deviating to a different fixed threshold policy. In our proposed policy agents aim to match the historical allocation rate with a vanishing drift towards the rate optimizing overall welfare for all users. Furthermore, the resulting equilibrium outcome can be significantly better compared to what follows from the robustness guarantees.

Our results are built on a complete characterization of the steady-state distribution under DMMF, as well as new techniques for analyzing strategic agent outcomes under dynamic allocation mechanisms; we hope these may prove of independent interest in related problems.

SESSION: Session 1C: Security

MUDGUARD: Taming Malicious Majorities in Federated Learning using Privacy-preserving Byzantine-robust Clustering

Byzantine-robust Federated Learning (FL) aims to counter malicious clients and train an accurate global model while maintaining an extremely low attack success rate. Most existing systems, however, are only robust when most of the clients are honest. FLTrust (NDSS '21) and Zeno++ (ICML '20) do not make such an honest majority assumption but can only be applied to scenarios where the server is provided with an auxiliary dataset used to filter malicious updates. FLAME (USENIX '22) and EIFFeL (CCS '22) maintain the semi-honest majority assumption to guarantee robustness and the confidentiality of updates. It is, therefore, currently impossible to ensure Byzantine robustness and confidentiality of updates without assuming a semi-honest majority. To tackle this problem, we propose a novel Byzantine-robust and privacy-preserving FL system, called MUDGUARD, to capture malicious minority and majority for server and client sides, respectively. Our experimental results demonstrate that the accuracy of MUDGUARD is practically close to the FL baseline using FedAvg without attacks (≈0.8% gap on average). Meanwhile, the attack success rate is around 0%-5% even under an adaptive attack tailored to MUDGUARD. We further optimize our design by using binary secret sharing and polynomial transformation, leading to communication overhead and runtime decreases of 67%-89.17% and 66.05%-68.75%, respectively.

Application-driven Reexamination of Datacenter Microbursts

Microbursts are microsecond-scale congestion events that have a severe negative impact on the performance of datacenter networks. Despite various methods proposed to mitigate microbursts, recent studies show that microbursts and their negative effects still persist in datacenters. We show that the primary reason for observing low performance from microburst mitigation methods is that these methods are used in situations or scenarios where they are not suitable or effective. We dissect various microburst mitigation approaches, demonstrating that each method's effectiveness is limited to microbursts with particular characteristics. Then, we carry out a comprehensive measurement and analysis of the diverse characteristics of microbursts, with particular attention to various applications' influences. Our findings reveal that microbursts generated in different applications exhibit different characteristics, which can significantly affect the performance of various mitigation methods. This indicates that optimal microburst mitigation in datacenters requires careful deployment of techniques based on the characteristics of hosted applications.

VESTA: A Secure and Efficient FHE-based Three-Party <u>V</u>ectorized <u>E</u>valuation <u>S</u>ystem for <u>T</u>ree <u>A</u>ggregation Models

Machine Learning as a Service (MLaaS) platforms facilitate collaborative machine learning, but trust issues necessitate privacy-preserving methods. As for tree ensembles, prior Fully Homomorphic Encryption (FHE)-based approaches suffer from slow inference speed and high memory usage. This work proposes the VESTA system that leverages compile-time precomputation and parallelizable model partitioning to enhance performance. VESTA achieves a 2.1x speedup and reduces memory consumption by 59.4% compared to state-of-the-art solutions.

Confidential VMs Explained: An Empirical Analysis of AMD SEV-SNP and Intel TDX

Confidential Virtual Machines (CVMs) strive to alleviate the programmability and usability challenges of the previously proposed enclave-based trusted computing technologies, promoting easier deployment in cloud infrastructures. However, differing microarchitectural features, interfaces, and security properties among vendors complicate the evaluation of CVMs for different use cases. Understanding the performance implications, functional limitations, and security guarantees of CVMs is a crucial step toward their adoption.

This work presents a detailed empirical analysis of two leading CVM technologies: AMD Secure Encrypted Virtualization--Secure Nested Paging (SEV-SNP) and Intel Trust Domain Extensions (TDX). We review their microarchitectural components and conduct a thorough performance evaluation across various aspects, including memory management, computational and I/O performance, and attestation primitives. We further present a security analysis through a trusted computing base (TCB) evaluation and Common Vulnerabilities and Exposures (CVE) analysis. Our key findings demonstrate, among others, the effect of CVMs on boot time, memory management and I/O, and identify inefficiencies in their context switch mechanisms. We further provide insights into the performance implications of CVMs and highlight potential room for improvement.

SESSION: Session 2A: Deep Learning

DiskAdapt: Hard Disk Failure Prediction based on Pre-Training and Fine-Tuning

With the rapid development of information technology, data centers have become increasingly reliant on storage devices. However, the risk of physical damage to these devices poses significant challenges to data security. As a result, fault prediction technology has become crucial for improving the reliability of storage systems. Current research primarily focuses on analyzing hard drive operational data using machine learning or deep learning. However, most methods are trained and tested on a single model of hard drive, leading to insufficient generalization in fault detection. Large data centers typically use hard drives from multiple manufacturers and different models, with frequent replacements, making it difficult for existing methods to fully meet enterprise needs. To address these challenges, this paper proposes a hard drive fault prediction model based on pretraining and fine-tuning, named DiskAdapt. The model adopts the AdaptFormer architecture and leverages a pretraining-fine-tuning framework to extract multivariate time series features that precisely reflect the operational state of hard drives. By integrating these multivariate features and performing feature alignment operations, the model achieves accurate fault prediction. Extensive experiments conducted on multiple public datasets demonstrate that our method outperforms other baseline models.

PROPHET: PRediction Of 5G bandwidtH using Event-driven causal Transformer

Accurately predicting 5G bandwidth in vehicular mobility scenarios is essential for optimizing real-time communication (RTC) systems in a range of emerging vehicular applications, such as autonomous driving, in-car video conferencing, vehicular augmented reality (AR), and remote driving in dynamic urban environments. In this paper, we propose Prophet, a causality-aware Transformer model designed to forecast 5G throughput by capturing the complex causal relationships between control-plane 5G network events (e.g., handovers) and environmental factors (e.g., signal strength). Our key contributions include: (1) a wavelet-based encoder that efficiently captures long-term trends through multi-scale signal decomposition; (2) a control-plane causality attention mechanism in the decoder that focuses on localized, short-term throughput variations triggered by network events; and (3) a Granger causality attention mechanism that enhances prediction accuracy by emphasizing historically significant data patterns. Prophet surpasses State-Of-The-Art (SOTA) models like random forest, Long Short-Term Memory (LSTM), and other SOTA transformers in both accuracy and scalability.

Diffusion-Based Generative System Surrogates for Scalable Learning-Driven Optimization in Virtual Playgrounds

In this paper, we present DiffNEST, a diffusion-based surrogate framework that enables scalable, learning-driven optimization in complex computing environments. As modern systems become increasingly intricate, traditional optimization methods often fall short, and RL-based approaches are hindered by high data collection costs and hardware limitations. DiffNEST addresses these challenges by training a diffusion model to generate realistic and temporally coherent system traces from high-level workload and configuration inputs, eliminating the need for physical hardware during optimization. We show that DiffNEST produces high-fidelity traces that accurately reflect both spatial and temporal system behaviors. Using these synthetic traces, we optimize DVFS and multi-core cache allocation, demonstrating effective policy learning without relying on actual hardware. In addition, DiffNEST can be fine-tuned across different optimization tasks and workload domains, making it a versatile surrogate modeling framework for system-level optimization.

FastFlow: Early Yet Robust Network Flow Classification using the Minimal Number of Time-Series Packets

Internet service providers (ISPs) that operate high-speed links expect network flow classifiers to accurately classify flows as early as possible, be robust to packet sequence disorders, and be capable of detecting the existence of unseen flow types. We develop FastFlow, a time-series flow classification method that achieves the above objectives, which dynamically selects the minimal number of packets at runtime to balance accuracy and efficiency. Our method first represents the packet streams of a flow at both per-packet and per-slot granularity for precise packet statistics with robustness to packet sequence disorders. A sequential decision-based classification model that leverages the LSTM architecture is developed. The model makes dynamic decisions on the minimal number of time-series data points to classify each flow. We evaluated our method on three public datasets and demonstrated its superior performance in early and accurate flow classification. Deployment insights on the classification of over 22.9 million flows across seven application types and 33 content providers in a campus network over one week are discussed.

SESSION: Session 2B: Theory I

Using Lock-Free Design for Throughput-Optimized Cache Eviction

This paper presents a practical approach to cache eviction algorithm design, called Mobius, that optimizes the concurrent throughput of caches and reduces cache operation latency by utilizing lock-free data structures, while maintaining high cache hit ratios. Mobius includes two key designs. First, Mobius employs two lock-free FIFO queues to manage cache items, ensuring that all cache operations are executed efficiently in concurrency. Second, Mobius integrates a consecutive detection mechanism that merges multiple modifications during eviction into a single operation, thereby reducing data races. The implementation of Mobius in CacheLib and RocksDB highlights its high concurrency in both synthetic and real-world workloads.

Optimal SSD Management with Predictions

Flash-based solid state drives (SSDs) have emerged as a leading storage technology, but their unique constraints pose fundamental management challenges. Unlike hard-disk drives, SSDs do not allow in-place updates: data must be written to a clean location within a block, and only entire blocks can be cleaned. Moreover, when a block is cleaned, all valid data it stores must first be rewritten elsewhere, resulting in additional writes that harm both performance and device longevity. In this work, we present one of the few worst-case analyses of algorithms for minimizing rewrites in SSDs. Our model falls within the framework of learning-augmented algorithms: algorithms that receive advice from a prediction oracle and their performance is analyzed as a function of the oracle's error. We introduce a novel algorithm for this model, and prove that it achieves optimal worst-case guarantees. Experiments on a range of SSD workloads confirm the practical advantages of the proposed algorithm over existing SSD management schemes.

Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks

Stochastic Network Optimization (SNO) concerns scheduling in stochastic queueing systems and has been widely studied in network theory. Classical SNO algorithms require network conditions to be stationary w.r.t. time, which fails to capture the non-stationary components in increasingly many real-world scenarios. Moreover, most existing algorithms in network optimization assume perfect knowledge of network conditions before decision, which again rules out applications where unpredictability in network conditions presents.

Motivated by these issues, this paper considers Adversarial Network Optimization (ANO) under bandit feedback. Specifically, we consider the task of i) maximizing some unknown and time-varying utility function associated with scheduler's actions, where ii) the underlying network topology is a non-stationary multi-hop network whose conditions change arbitrarily with time, and iii) only bandit feedback (the effect of actually deployed actions) is revealed after decision-making. We propose the UMO2 algorithm, which does not require any pre-decision knowledge or counterfactual feedback, ensures network stability, and also matches the utility maximization performance of any ''mildly varying'' reference policy up to a polynomially decaying gap. To our knowledge, no previous algorithm can handle multi-hop networks or achieve utility maximization guarantees in ANO problems with bandit feedback, whereas ours is able to do both.

Technically, our method builds upon a novel integration of online learning techniques into the Lyapunov drift-plus-penalty method. Specifically, we propose meticulous analytical techniques to jointly balance online learning and Lyapunov arguments, which is used to handle the complex inter-dependency among queues in multi-hop networks. To tackle the learning obstacles due to potentially unbounded queue sizes and negative queue differences, we design a new online linear optimization algorithm that automatically adapts to the unknown (potentially negative) loss magnitudes. Finally, we also propose a bandit convex optimization algorithm with novel queue-dependent learning rate scheduling that suites drastically varying queue lengths in utility maximization. Our new insights and techniques in online learning can also be of independent interest.

Reducing Sensor Requirements by Relaxing the Network Metric Dimension

Source localization in graphs aims to identify the origin of events like epidemics or misinformation by leveraging structural graph properties. A key tool is the metric dimension, which measures the minimum number of sensors needed to uniquely identify all vertices based on their distances. However, this often involve placing a large number of sensors. Hence, we propose a relaxed version of the metric dimension, by allowing vertices at a distance less than k to have identical distances to the sensors. This relaxation lowers sensor requirements while preserving practical resolution. Our contributions include an analysis of this k-relaxed metric dimension in both deterministic and random trees, highlighting the role of graph structure in sensor placement. Numerical experiments on various graphs show the relaxed metric dimension is much smaller than the traditional one.

SESSION: Session 2C: Reliable Systems

The Tale of Errors in Microservices: Extended Abstract

Microservice architectures have become the de facto paradigm for building scalable, service-oriented systems. Although their decentralized design promotes resilience and rapid development, the inherent complexity leads to subtle performance challenges. In particular, non-fatal errors - internal failures of remote procedure calls that do not cause top-level request failures - can accumulate along the critical path, inflating latency and wasting resources.

In this work, we analyze over 11 billion RPCs across more than 6,000 microservices at Uber. Our study shows that nearly 29% of successful requests experience non-fatal errors that remain hidden in traditional monitoring. We propose a novel latency-reduction estimator (LR estimator) to quantify the potential benefit of eliminating these errors. Our contributions include a systematic study of RPC error patterns, a methodology to estimate latency reductions, and case studies demonstrating up to a 30% reduction in tail latency.

Quantum Computing in the RAN with Qu4Fec: Closing Gaps Towards Quantum-based FEC Processors

Quantum computers could offer substantial energy-saving advantages over conventional compute servers, particularly in the context of 5G signal processing. This paper explores the potential of Quantum Annealing (QA) to implement Forward Error Correction (FEC) and introduces Qu4Fec, an innovative solution for decoding Low-Density Parity Check (LDPC) codes. Our findings demonstrate that Qu4Fec improves state-of-the-art approaches in terms of Block Error Rate (BLER). However, our experiments reveal that the current leading Quantum Processing Units (QPUs) design is suboptimal for the FEC problem formulation. We identify key challenges, including qubit chain length, scaling, and quantization, and propose Quantum architectures to address these limitations.

Design and Modeling of a New File Transfer Architecture to Reduce Undetected Errors Evaluated in the FABRIC Testbed

Ensuring data integrity for petabyte-scale file transfers is critical for scientific applications. As packet sizes increase, so does the likelihood of undetected errors. Multi-Level Error Detection (MLED) is a recursive architecture that leverages in-network resources to reduce undetected error probability (UEP) in file transfer. MLED organizes communication in layers at different levels, with each layer implementing configurable policies. Through experimentation on the FABRIC testbed, we show that while the traditional---transport and data link layer---error detection results in corrupt file transfers requiring retransmission, MLED detects and corrects these errors at intermediate levels, when an adversarial error model is used. MLED thus achieves a 100% gain in goodput reaching over 800 Mbps on a single connection with no appreciable increase in delay.

Beaver: A High-Performance and Crash-Consistent File System Cache via PM-DRAM Collaborative Memory Tiering

The in-memory cache layer is crucial in building a file system. Crash-consistency is highly desirable for applications running on the file system, ensuring that data is written in an all-or-none fashion during unexpected system failures or crashes. However, existing works fail to achieve both high read and write performance in constructing a crash-consistent cache layer. In this paper, we propose Beaver, a new in-memory file system cache that achieves both crash-consistency and high read/write performance. Beaver exploits a read/write distinguishable memory hierarchy involving both PM and DRAM. Beaver uses the PM to serve write operations for crash-consistency and high write performance, and uses the DRAM to serve read operations for high read performance. We also introduce a kprefetcher to asynchronously move data from PM to DRAM off the critical path of write operations. We extensively evaluated Beaver with the latest file system cache using real-world applications and popular benchmarks. Application evaluation results show that Beaver outperformed other file system caches by up to 16.1% and 11.0% on RocksDB and MinIO, respectively. Beaver's source code is available at https://github.com/PanQL/Beaver.

SESSION: Session 3A: Data Centers

Tiered Cloud Routing: Methodology, Latency, and Improvement

Large cloud providers including AWS, Azure, and Google offer two tiers of network services to their customers: WAN-transit service and inet-transit service. Little is known about how each cloud provider offers different transit services, how well these services work, and whether the quality of those services can be further improved. In this work, we conduct a large-scale study to answer these questions. Using RIPE Atlas probes as vantage points, we explore how traffic enters and leaves the WAN of each of the three clouds. In addition, we measure the access latencies of these two network services of each cloud and compare them with emulated alternative routing strategies.

Exploring Function Granularity for Serverless Machine Learning Application with GPU Sharing

Recent years have witnessed increasing interest in machine learning (ML) inferences on serverless computing due to its auto-scaling and cost-effective properties. However, one critical aspect, function granularity, has been largely overlooked, limiting the potential of serverless ML. This paper explores the impact of function granularity on serverless ML, revealing its important effects on the SLO hit rates and resource costs of serverless applications. It further proposes adaptive granularity as an approach to addressing the phenomenon that no single granularity fits all applications and situations. It explores three predictive models and presents programming tools and runtime extensions to facilitate the integration of adaptive granularity into existing serverless platforms. Experiments show adaptive granularity produces up to a 29.2% improvement in SLO hit rates and up to a 24.6% reduction in resource costs over the state-of-the-art serverless ML which uses fixed granularity.

UniContainer: Unlocking the Potential of Unikernel for Secure and Efficient Containerization

Container technology, characterized by its convenience in deployment and exceptional performance, has emerged as a dominant force in the realm of cloud computing. However, the shared kernel among different containers and the common practice of running multiple applications within a single container pose threats to user data from malicious co-resident containers or other malicious programs within the same container. To address this problem, we introduce a novel container architecture design - UniContainer. UniContainer partitions original containers, allowing each component to run within an independent customized Unikernel, thereby enhancing isolation both between containers and within containers. We implement a prototype of UniContainer based on Unikraft, enabling automated analysis of target container system call logs and configuration of optimal Unikraft images, while preserving the convenience of container technology deployment. Through experimental evaluation, we validate the effectiveness and performance of UniContainer, maximizing the outstanding performance benefits of container technology.

Microns: Connection Subsetting for Microservices in Shared Clusters

Microservice applications typically employ a technique known as connection subsetting to ensure resource-efficient and stable communication with persistent connections. However, the interdependency in microservice applications and complex runtime environments pose significant challenges for effective connection subsetting, rendering traditional strategies notably inefficient.

In this paper, we present Microns, a connection subsetting framework designed for microservices in shared clusters. At the application level, Microns effectively handles the complex call dependencies in applications and meticulously determines the number of connections maintained by each pair of dependent microservices. At the microservice level, Microns manages the connection relationships between dependent containers according to their respective contributions to end-to-end latency. Experiments across microservice benchmarks demonstrate that Microns achieves a significant reduction on end-to-end latency by over 74.4%.

SESSION: Session 3B: Theory II

A Piecewise Lyapunov Analysis of Sub-quadratic SGD: Applications to Robust and Quantile Regression

Motivated by robust and quantile regression problems, we investigate the stochastic gradient descent (SGD) algorithm for minimizing an objective function f that is locally strongly convex with a sub--quadratic tail. This setting covers many widely used online statistical methods. We introduce a novel piecewise Lyapunov function that enables us to handle functions f with only first-order differentiability, which includes a wide range of popular loss functions such as Huber loss. Leveraging our proposed Lyapunov function, we derive finite-time moment bounds under general diminishing stepsizes, as well as constant stepsizes. We further establish the weak convergence, central limit theorem and bias characterization under constant stepsize, providing the first geometrical convergence result for sub--quadratic SGD. Our results have wide applications, especially in online statistical methods. In particular, we discuss two applications of our results. 1) Online robust regression: We consider a corrupted linear model with sub--exponential covariates and heavy--tailed noise. Our analysis provides convergence rates comparable to those for corrupted models with Gaussian covariates and noise. 2) Online quantile regression: Importantly, our results relax the common assumption in prior work that the conditional density is continuous and provide a more fine-grained analysis for the moment bounds.

Optimal Aggregation via Overlay Trees: Delay-MSE Tradeoffs under Failures

Many applications, e.g., federated learning, require the aggregation of information across a large number of distributed nodes. In this paper, we explore efficient schemes to do this at scale leveraging aggregation at intermediate nodes across overlay trees. Files/updates are split into chunks which are in turn simultaneously aggregated over different trees. For a synchronous setting with homogeneous communications capabilities and deterministic link delays, we develop a delay optimal aggregation schedule. In the asynchronous setting, where delays are stochastic but i.i.d., across links, we show that for an asynchronous implementation of the above schedule, the expected aggregation delay is near-optimal. We then consider the impact that failures in the network have on the resulting Mean Square Error (MSE) for the estimated aggregates and how it can be controlled through the addition of redundancy, reattempts, and optimal failure-aware estimates for the desired aggregate. Based on the analysis of a natural model of failures, we show how to choose parameters to optimize the trade-off between aggregation delay and MSE. We present simulation results exhibiting the above mentioned tradeoffs. We also consider a more general class of correlated failures and demonstrate via simulation the applicability of our techniques in those settings as well.

The Power of Migrations in Dynamic Bin Packing

In the Dynamic Bin Packing problem, n items arrive and depart the system in an online manner, and the goal is to maintain a good packing throughout. We consider the objective of minimizing the total active time, i.e., the sum of the number of open bins over all times. An important tool for maintaining an efficient packing in many applications is the use of migrations; e.g., transferring computing jobs across different machines. However, there are large gaps in our understanding of the approximability of dynamic bin packing with migrations. Prior work has covered the power of no migrations and > n migrations, but we ask the question: What is the power of limited (≤ n) migrations?

Our first result is a dichotomy between no migrations and linear migrations: Using a sublinear number of migrations is asymptotically equivalent to doing zero migrations, where the competitive ratio grows with μ, the ratio of the largest to smallest item duration. On the other hand, we prove that for every α ∈ (0,1], there is an algorithm that does ≈ αn migrations and achieves competitive ratio ≈ 1/α (in particular, independent of μ); we also show that this tradeoff is essentially best possible. This fills in the gap between zero migrations and > n migrations in Dynamic Bin Packing.

Finally, in light of the above impossibility results, we introduce a new model that more directly captures the impact of migrations. Instead of limiting the number of migrations, each migration adds a delay of C time units to the item's duration; this commonly appears in settings where a blackout or set-up time is required before the item can restart its execution in the new bin. In this new model, we prove a O(min(√C, μ))-approximation, and an almost matching lower bound. We also present preliminary experiments that indicate that our theoretical results are predictive of the practical performance of our algorithms.

Tight Bounds for Dynamic Bin Packing with Predictions

A variant of the Dynamic Bin Packing (DBP) problem known as MinUsageTime DBP seeks to minimize the accumulated length of time for which bins are used in packing a sequence of items. This paper studies MinUsageTime DBP with predictions about item durations and establishes tight competitiveness bounds over the entire spectrum of prediction errors.

SESSION: Session 3C: Measurement I

Uncovering BGP Action Communities and Community Squatters in the Wild

The Border Gateway Protocol (BGP) offers several ''knobs'' to control routing decisions, but they are coarse-grained and only affect routes received from neighboring Autonomous Systems (AS). To enhance policy expressiveness, BGP was extended with the communities attribute, allowing an AS to attach metadata to routes and influence the routing decisions of a remote AS. The metadata can carry information to (e.g., where a route was received) or request an action from a remote AS (e.g., not to export a route to one of its neighbors). Unfortunately, the semantics of BGP communities are not standardized, lack universal rules, and are poorly documented. In this work, we design and evaluate algorithms to automatically uncover BGP action communities and ASes that violate standard practices by consistently using the information communities of other ASes, revealing undocumented relationships between them (e.g., siblings). Our experimental evaluation with billions of route announcements from public BGP route collectors from 2018 to 2023 uncovers previously unknown AS relationships and shows that our algorithm for identifying action communities achieves average precision and recall of 92.5% and 86.5%, respectively.

Beyond App Markets: Demystifying Underground Mobile App Distribution Via Telegram

The thriving mobile app ecosystem encompasses a wide range of functionalities. However, within this ecosystem, a subset of apps provides illicit services such as gambling and pornography to pursue economic gains, collectively referred to as ''underground economy apps''. While previous studies have examined these apps' characteristics and identification methods, investigations into their distribution via platforms beyond app markets (like Telegram) remain scarce, which has emerged as a crucial channel for underground activities and cybercrime due to the robust encryption and user anonymity.

This study provides the first comprehensive exploration of the underground mobile app ecosystem on Telegram. Overcoming the complexities of the Telegram environment, we build a novel dataset and analyze the prevalence, promotional strategies, and characteristics of these apps. Our findings reveal the significant prevalence of these apps on Telegram, with the total sum of subscription user numbers across channels promoting these apps equivalent to 1% of Telegram's user base. We find these apps primarily cater to gambling and pornography services. We uncover sophisticated promotional strategies involving complex networks of apps, websites, users, and channels, and identify significant gaps in Telegram's content moderation capabilities. Our analysis also exposes the misuse of iOS features for app distribution and the prevalence of malicious behaviors in these apps. This research not only enhances our understanding of the underground app ecosystem but also provides valuable insights for developing effective regulatory measures and protecting users from potential risks associated with these covert operations. Our findings provide implications for platform regulators, app market operators, law enforcement agencies, and cybersecurity professionals in combating the proliferation of underground apps on encrypted messaging platforms. The full version of this paper is available at [7].

INT-MC: Low-Overhead In-Band Network-Wide Telemetry Based on Matrix Completion

In-band Network Telemetry (INT) enhances real-time, high-resolution network monitoring capabilities by incorporating fine-grained internal state information into packets. Utilizing INT for network-wide visualization can significantly bolster network management and operation. Although existing studies have made significant contributions, they have not been able to simultaneously meet the following two objectives: 1) Comprehensive visibility of the network's internal state, which refers to obtaining information on all switch ports and their corresponding link states; 2) Low measurement overhead, which involves reducing measurement bandwidth overhead and minimizing the impact of the measurement process on the network. We designed INT-MC to meet both of these objectives. INT-MC is an efficient and cost-effective network-wide telemetry solution based on matrix completion. By modeling link metadata as a matrix and leveraging its low-rank property, INT-MC selectively measures certain links. It then employs matrix completion algorithms to infer information about unmeasured links, thereby achieving low-overhead network-wide telemetry. Designing paths to cover the target links involves an NP-complete problem, and the high computational complexity may delay the measurement tasks. To circumvent this, we propose an improved scheme based on Eulerian digraph decomposition, transforming the path calculation problem into a high-information path selection problem, significantly reducing computational costs.

We have implemented an INT-MC prototype within the NSFNet topology, consisting of 14 Tofino switches and 10 end-hosts, and conducted extensive experiments and evaluations. The results indicate that INT-MC incurs only 16% of the measurement overhead compared to existing network-wide telemetry solutions, while achieving nearly identical accuracy. Even under high-frequency measurements of 20 times per second, the bandwidth overhead of INT-MC is approximately 0.075% of the total bandwidth, exerting minimal impact on the network.

Beyond Data Points: Regionalizing Crowdsourced Latency Measurements

Despite significant investments in access network infrastructure, universal access to high-quality Internet connectivity remains a challenge. Policymakers often rely on crowdsourced datasets to assess Internet performance across geographic regions. However, such datasets exhibit non-uniform sampling densities and misalignment with infrastructure boundaries. This work presents a spatial analysis framework to construct stable sampling boundaries for Internet performance metrics using crowdsourced latency data. We apply statistical techniques to aggregate performance over geographic regions, interpolate unsampled locations, and cluster spatial units into contiguous regions with similar characteristics. Our evaluation demonstrates that these methods achieve higher temporal stability compared to traditional approaches that aggregate directly over administrative boundaries. The findings highlight the importance of spatial modeling in optimizing Internet performance distribution for policy and planning.

SESSION: Session 4A: Online Learning I

Asynchronous Multi-Agent Bandits: Fully Distributed vs. Leader-Coordinated Algorithms

We study the cooperative asynchronous multi-agent multi-armed bandits problem, where each agent's active (arm pulling) decision rounds are asynchronous. That is, in each round, only a subset of agents is active to pull arms, and this subset is unknown and time-varying. We consider two models of multi-agent cooperation, fully distributed and leader-coordinated, and propose algorithms for both models that attain near-optimal regret and communications bounds, both of which are almost as good as their synchronous counterparts. The fully distributed algorithm relies on a novel communication policy consisting of accuracy adaptive and on-demand components, and successive arm elimination for decision-making. For leader-coordinated algorithms, a single leader explores arms and recommends them to other agents (followers) to exploit. As agents' active rounds are unknown, a competent leader must be chosen dynamically. We propose a variant of the Tsallis-INF algorithm with low switches to choose such a leader sequence. Lastly, we report numerical simulations of our new asynchronous algorithms with other known baselines.

Combinatorial Logistic Bandits

Combinatorial multi-armed bandit (CMAB) is a fundamental online learning framework that can optimize cumulative rewards in networked systems under uncertainty. Real-world applications like content delivery and channel allocation often feature binary base arm rewards and nonlinear total reward functions. This paper introduces combinatorial logistic bandits (CLogB), a contextual CMAB framework with the base arm reward modeled as a nonlinear logistic function of the context, and the feedback is governed by a general arm-triggering process. We study CLogB with smooth reward functions, covering applications such as online content delivery, online multi-LLM selection, and dynamic channel allocation. Our first algorithm, CLogUCB, uses a variance-agnostic exploration bonus and achieves a regret bound of Õ(d√(κ KT)), where d is the feature dimension, κ reflects logistic model nonlinearity, K is the maximum number of triggered arms, and Õ ignores logarithmic factors. This improves on prior results by Õ(√κ ). We further propose VA-CLogUCB, a variance-adaptive enhancement achieving regret bounds of Õ(d√(KT) ) under standard smoothness conditions and Õ(d√T ) under stronger variance conditions, removing dependence on K. For time-invariant feature maps, we enhance computational efficiency by avoiding nonconvex optimization while maintaining Õ(d√T) regret. Experiments on synthetic and real-world datasets validate the superior performance of our algorithms, demonstrating their effectiveness and scalability for real-world networked systems.

Online Fair Allocation of Reusable Resources

Motivated by the emerging paradigm of resource allocation that integrates classical objectives, such as cost minimization, with societal objectives, such as carbon awareness, this paper proposes a general framework for the online fair allocation of reusable resources. Within this framework, an online decision-maker seeks to allocate a finite resource with capacity C to a sequence of requests arriving with unknown distributions of types, utilities, and resource usage durations. To accommodate diverse objectives, the framework supports multiple actions and utility types, and the goal is to achieve max-min fairness among utilities, i.e., maximize the minimum time-averaged utility across all utility types. Our performance metric is an (α,β)-competitive guarantee of the form: ALG ≥ α • OPT* - O(Tβ-1), α, β ∈ (0, 1], where OPT* and ALG are the time-averaged optimum and objective value achieved by the decision maker, respectively. We propose a novel algorithm that achieves a competitive guarantee of (1-O(√(log C/C)),2/3) under the bandit feedback. As resource capacity increases, the multiplicative competitive ratio term 1-O(√(log C/C)) asymptotically approaches optimality. Notably, when the resource capacity exceeds a certain threshold, our algorithm achieves an improved competitive guarantee of (1, 2/3). Our algorithm employs an optimistic penalty-weight mechanism coupled with a dual exploration-discarding strategy to balance resource feasibility, exploration, and fairness among utilities.

Universal and Tight Bounds on Counting Errors of Count-Min Sketch with Conservative Updates

Count-Min Sketch with Conservative Updates (CMS-CU) is a memory-efficient hash-based algorithm used to estimate the occurrences of items within a data stream. In this paper, we study a similar algorithm, which we refer to as CU-S, where d hash functions map each item to d counters in a single array of size m. Specifically, we present an analytical method to quantify the trade-off between memory usage and the Counting Error (CE) of an item in CU-S, defined as the discrepancy between the estimated and actual number of its occurrences. The first result of this paper shows that items absent from the stream experience the highest CE. We refer to their error as the Unseen items Error (UE). The second result is an upper bound on UE of any stream, expressed in terms of the Reference Error (RE), which is UE of a reference stream where each item appears at most once. Finally, we construct two Markov processes parametrized by g which yield lower and upper bounds on RE, with g balancing accuracy and computation time of these bounds.

SESSION: Session 4B: Performance

Internet Service Usage and Delivery As Seen From a Residential Network

Given the increasing residential Internet use, a thorough understanding of what services are used and how they are delivered to residential networks is crucial. However, access to residential traces is limited due to their proprietary nature. Most prior work used campus datasets from academic buildings and undergraduate dorms, and the few studies with residential traces are often outdated or use data unavailable to other researchers. We provide access to a new residential dataset-we have been collecting traffic from ~1000 off-campus residences that house faculty, postdocs, graduate students, and their families. Although our residents are university affiliates, our dataset captures their activity at home, and we show that this dataset offers a distinct perspective from the campus and dorm traffic. We investigate the serving infrastructures and services accessed by the residences, revealing several interesting findings: peer-to-peer activity is notable, comprising 47% of the total flow duration; third-party CDNs host many services but serve much less traffic (e.g., Cloudflare hosts 19% of domains but only 2% of traffic); and 11 of the top 100 services that have nearby servers often serve users from at least 1,000km farther away. This broad analysis, as well as our data sharing, pushes toward a more thorough understanding of Internet service usage and delivery, motivating and supporting future research. This extended abstract is an abridged version of our full paper.

CHash: A High Cost-Performance Hash Design for CXL-based Disaggregated Memory System

The exponential growth in demand for high-performance computing systems has created an urgent requirement for innovative memory technologies that can provide higher bandwidth and enhanced capacity scalability. In particular, Compute Express Link (CXL) has emerged as a promising solution for memory expansion and system acceleration. Hash-based index structures are widely recognized as fundamental components of in-memory database systems, and they are commonly used for indexing in-memory key-value stores due to their capability for rapid lookup performance. How to design and maintain a hash index structure in a CXLbased disaggregated memory system, with comprehensive consolidation, is a challenging topic of in-depth research.

Specifically, the architecture and characteristics of the CXL memory device are significantly distinct from those of DRAM and PMEM. Simply applying prior hashing approaches without considering the differences would not fully exploit the characteristics presented by CXL memory devices. This leads to the performance inefficiencies of conventional indexing designs and operation processes that overlook the characteristics of the CXL memory device.

In this paper, we conduct extensive experiments on two real CXL memory devices based on the 4thgeneration Intel® Xeon® Scalable Processor with CXL 1.0 support. Specifically, we run various microbenchmarks not only to evaluate the performance characteristics of CXL memory devices, but also to measure the performance impact of different memory allocation methods. We observe significant performance disparities and allocation inefficiencies, motivating us to innovate the hash design in the CXL-based disaggregated memory system. Upon in-depth analysis, we propose CHash, as shown in Figure 1, which consists of several key designs, as shown below:

Stage buckets for KV batching. The experimental results show that real CXL memory devices from different vendors exhibit diverse performance characteristics. Moreover, there is still a significant performance gap between CPU-attached DRAM and CXL memory. To address this difference, CHash first stores the KV records in a DRAM buffer, called the stage bucket in our design, to batch the KVs, avoiding frequent and small writes to CXL memory devices.

SAcceleration techniques for fast lookups. Although memory access latencies for read and write operations are comparable, the end-to-end performance of reads is lower than that of writes due to multiple caching layers, including the CPU cache and the write buffer in the device's memory controller, shorten the write latency. In response, CHash implements several acceleration techniques designed to minimize costly key-value comparisons, thereby reducing the number of reads to CXL memory. Furthermore, CHash stores the metadata which is the most frequently accessed components of the hash table in DRAM to enhance the overall performance.

Some alternative optimizations for the effective CXL memory management, including two-level filtering, hash multiplexing and so on, are also further adopted to improve the overall performance. Extensive experimental results validate the efficiency of CHash in different CXL memory devices, demonstrating that it achieves a 2.2× to 9.4× speedup for insertions, and a 1.2× to 4.5× speedup for lookups, all the while maintaining a high load factor (~90%), compared to state-of-the-art hashing schemes.

Understanding Intel User Interrupts

User interrupts (UINTR) is a new hardware feature available in recent Intel Xeon processors. It enables a CPU to send, receive, and handle hardware inter-processor interrupts directly in user mode, %(ring 3), meaning without the intervention of the OS kernel. In this paper, we shed light on the inner working of this facility and investigate, with detailed performance and overhead measurements considering deployments on physical machines and virtual machines with Linux and KVM, how it can be leveraged for more efficient inter-process communications, as an alternative to OS (Unix) signals. With UINTR, preemptive user-level thread schedulers can achieve quantum lengths in the microsecond scale.

A Case Study for Ray Tracing Cores: Performance Insights with Breadth-First Search and Triangle Counting in Graphs

The emerging Ray-tracing cores on GPUs have been repurposed for non-ray-tracing tasks by researchers recently. In this paper, we explore the benefits and effectiveness of executing graph algorithms on RT cores. We re-design breadth-first search and triangle counting on the new hardware as graph algorithm representatives. Our implementations focus on how to convert the graph operations to bounding volume hierarchy construction and ray generation, which are computational paradigms specific to ray tracing. We evaluate our RT-based methods on a wide range of real-world datasets. The results do not show the advantage of the RT-based methods over CUDA-based methods. We extend the experiments to the set intersection workload on synthesized datasets, and the RT-based method shows superior performance when the skew ratio is high. By carefully comparing the RT-based and CUDA-based binary search, we discover that RT cores are more efficient at searching for elements, but this comes with a constant and non-trivial overhead of the execution pipeline. Furthermore, the overhead of BVH construction is substantially higher than sorting on CUDA cores for large datasets. Our case studies unveil several rules of adapting graph algorithms to ray-tracing cores that might benefit future evolution of the emerging hardware towards general-computing tasks.

SESSION: Session 4C: Quantum

Peer-to-Peer Distribution of Graph States Across Spacetime Quantum Networks of Arbitrary Topology

Graph states are a class of important multiparty entangled quantum states, of which Bell pairs are the special case. Realizing a robust and fast distribution of arbitrary graph states in the downstream layer of the quantum network is essential for enabling large-scale quantum networks. To address this, we propose a novel quantum network protocol, called P2PGSD, inspired by the classical Peer-to-Peer network. This protocol efficiently implements general graph state distribution in the network layer, demonstrating significant advantages in resource efficiency and scalability, particularly for sparse graph states. An explicit mathematical model for the general graph state distribution problem has also been constructed, above which the intractability for a wide class of resource minimization problems is proved and the optimality of the existing algorithms is discussed. Moreover, we leverage the space-time quantum network for memory management in network challenges, drawing inspiration from special relativity. We suggest a universal quantum distributed computation framework to exploit the strengths of our protocols, as confirmed by numerical simulations that reveal up to a 50% enhancement for general sparse graph states. This work marks a significant step toward resource-efficient multiparty entanglement distribution for diverse network topologies.

Modeling and Simulating Rydberg Atom Quantum Computers for Hardware-Software Co-design with PachinQo

Quantum computing has the potential to accelerate various domains: scientific computation, machine learning, and optimization. Recently, Rydberg atom quantum computing has emerged as a promising quantum computing technology, especially with the demonstration of the zonal addressing architecture. However, this demonstration is only compatible with one type of quantum algorithm, and extending it to compile and execute general quantum algorithms is a challenge. To address it, we propose PachinQo, a framework to co-design the architecture and compilation for zonal addressing systems for any given quantum algorithm. PachinQo's evaluation demonstrates its ability to improve a quantum algorithm's estimated probability of success by 45% on average in error-prone quantum environments.

Optimal Scheduling in a Quantum Switch: Capacity and Throughput Optimality

With a growing number of quantum networks in operation, there is a pressing need for performance analysis of quantum switching technologies. A quantum switch establishes, distributes, and maintains entanglements across a network. In contrast to a classical switching fabric, a quantum switch is a two-sided queueing network. The switch generates Link-Level Entanglements (LLEs) across links that connect users to the switch, which are then fused to process the network's entanglement requests. First, we characterize the capacity region that is defined as the set of entanglement request rates for which there exists a scheduling policy stabilizing the system. We then show that a sequence of Max-Weight policies that we propose achieve throughput optimality in the asymptotic sense. Our proof techniques analyse a two-time scale separation phenomenon at the fluid scale for a general switch topology. This allows us to demonstrate that the optimal fluid dynamics are given by a scheduling algorithm that solves a certain average reward Markov Decision Process.

Quantum Network Optimization: From Optimal Routing to Fair Resource Allocation

Quantum networks are essential infrastructure for enabling large-scale and long-distance quantum communications but face significant challenges in routing optimization and resource allocation due to their probabilistic nature and quantum resource limitations. Existing approaches typically tackle these problems in isolation, often by simply applying classical routing algorithms, maximizing the overall profit to allocate resources without considering fairness, or improving fairness in an ad-hoc way without a rigorous model. This paper proposes a general framework to systematically address these challenges. First, we conduct a thorough analysis of quantum network metrics using routing algebra as the mathematical foundation, and design provably optimal routing algorithms to tackle the unique challenges arising from their probabilistic characteristics. Second, we formulate an optimization model that simultaneously considers fairness among concurrent requests while respecting various quantum resource constraints, and design efficient near-optimal heuristics to solve it. The proposed framework provides both theoretical insights and practical solutions for the design and management of future quantum networks.

SESSION: Session 5A: Systems I

NetJIT: Bridging the Gap from Traffic Prediction to Preknowledge for Distributed Machine Learning

Today's distributed machine learning (DML) introduces heavy traffic load, making the network one of the primary bottlenecks. To mitigate this bottleneck, existing state-of-the-art network optimization methods, such as traffic or topology engineering, are proposed to adapt to real-time traffic. However, current traffic measurement and prediction methods struggle to collect sufficiently fine-grained and accurate traffic patterns. This limitation impedes the ability of cutting-edge network optimization techniques to react agilely to the ever-changing traffic demands of DML jobs.

This paper proposes NetJIT, a novel program-behavior-aware toolkit for foreseeing the traffic pattern of DML. To the best of our knowledge, this is the first work proposing the use of just-in-time (JIT) program analysis for real-time traffic measurement. In DML applications, communication behavior is primarily determined by the previously computed results. NetJIT leverages this characteristic to anticipate communication details by tracing and analyzing the data relations in the computation process. This capability enables the deployment of optimization strategies in advance.

We deploy NetJIT in real-world network optimization for traffic preknowledge. Evaluation with the self-built testbed demonstrates that NetJIT can achieve up to about 97% less error of detecting communication events compared with other methods. Simulations with real-world DML workloads further illustrate that NetJIT enables more precise network optimization, leading to approximately 50% better network performance w.r.t the metrics including average iteration time, throughput, and average packet delay.

ScaleOPT: A Scalable Optimal Page Replacement Policy Simulator

This paper proposes ScaleOPT, a scalable optimal page replacement policy (OPT) simulator by leveraging multi-core parallelism. Specifically, we first propose AccessMap which enables calculating the next reference time of the accessed page in a constant time by collecting the future references of pages in parallel before the OPT simulation. Second, we introduce Pipelined-Tree consisting of multiple trees which organize caches based on min-max reference times. It enables traverse and update operations of each cache to be performed in a partially parallel manner (i.e., pipelined), thereby scaling out the OPT simulation on multi-cores. The experimental results demonstrate that ScaleOPT improves the simulation time on a 72-core machine by 6.3×, 20.5×, and 13.9× compared with the long-established standard algorithm-based simulator along with our AccessMap, and two widely-used cache simulators, respectively.

SESSION: Session 5B: Systems II

PipeCo: Pipelining Cold Start of Deep Learning Inference Services on Serverless Platforms

The fusion of serverless computing and deep learning (DL) has led to serverless inference, offering a promising approach for developing and deploying scalable and cost-efficient deep learning inference services (DLISs). However, the challenge of cold start presents a significant obstacle for DLISs, where DL model size greatly impacts latency. Existing studies mitigate cold starts by extending keep-alive times, which unfortunately leads to decreased resource utilization efficiency. To address this issue, we introduce PipeCo, a system designed to alleviate DLIS cold start. The core concept of PipeCo is to achieve the miniaturization and pipelining of DLIS cold start. Firstly, PipeCo utilizes a vertical partitioning approach to divide each DLIS into multiple slices, prewarming slices in a sequential and overlapping manner to decrease the overall cold-start latency. Secondly, PipeCo employs an attention-based prediction mechanism to estimate periodic patterns in requests and idle containers for scheduling slices. Thirdly, PipeCo incorporates a similarity-based container matcher for the reuse of idle containers. We implemented a prototype of PipeCo on the OpenFaaS platform and conducted extensive experiments using three real-world DLIS repositories. The results demonstrate that PipeCo effectively decreases end-to-end (E2E) latency by up to 62.67% on CPU and 58.81% on GPU clusters and reduces the overall resource usage by 65.31% compared to five state-of-the-art baselines.

PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures

Graph Neural Networks (GNNs) are emerging models to analyze graph-structure data. The GNN execution involves both compute-intensive and memory-intensive kernels. The memory-intensive kernels dominate execution time, because they are significantly bottlenecked by data movement between memory and processors. Processing-In-Memory (PIM) systems can alleviate this data movement bottleneck by placing simple processors near or inside memory arrays. To this end, we investigate the potential of PIM systems to alleviate the data movement bottleneck in GNNs, and introduce PyGim, an efficient and easy-to-use GNN library for real PIM systems. We propose intelligent parallelization techniques for memory-intensive kernels of GNNs tailored for real PIM systems, and develop an easy-to-use Python API for them. PyGim employs a cooperative GNN execution, in which the compute- and memory-intensive kernels are executed in processor-centric and memory-centric computing systems, respectively, to fully exploit the hardware capabilities. PyGim integrates a lightweight tuner that configures the parallelization strategy of the memory-intensive kernel of GNNs to provide high system performance, while also enabling high programming ease. We extensively evaluate PyGim on a real-world PIM system that has 16 PIM DIMMs with 1992 PIM cores connected to a Host CPU. In GNN inference, we demonstrate that it outperforms prior state-of-the-art PIM works by on average 4.38× (up to 7.20×), and the state-of-the-art PyTorch implementation running on Host (on Intel Xeon CPU) by on average 3.04× (up to 3.44×). PyGim improves energy efficiency by 2.86× (up to 3.68×) and 1.55× (up to 1.75×) over prior PIM and PyTorch Host schemes, respectively. In memory-intensive kernel of GNNs, PyGim provides 11.6× higher resource utilization in PIM system than that of PyTorch library (optimized CUDA implementation) in GPU systems. Our work provides useful recommendations for software, system and hardware designers. PyGim is publicly and freely available at https://github.com/CMU-SAFARI/PyGim to facilitate the widespread use of PIM systems in GNNs.

SESSION: Session 5C: Blockchains I

CertainSync: Rateless Set Reconciliation with Certainty

Set reconciliation is a fundamental task in distributed systems, particularly in blockchain networks, where it enables the synchronization of transaction pools among peers and facilitates block dissemination. Existing traditional set reconciliation schemes are either statistical, providing success probability as a function of the communication overhead and the size of the symmetric difference, or require parametrization and estimation of the size of the symmetric difference, which can be prone to error. In this paper, we present CertainSync, a novel reconciliation framework that, to the best of our knowledge, is the first to guarantee successful set reconciliation without any parametrization or estimators in use. The framework is rateless and adapts to the unknown symmetric difference size. The set reconciliation is guaranteed to be completed successfully whenever the communication overhead reaches a lower bound derived from the symmetric difference size and the universe size. Our framework is based on recent constructions of Invertible Bloom Lookup Tables (IBLTs) ensuring successful element listing as long as the number of elements is bounded. We provide a theoretical analysis to prove the certainty in the set reconciliation for multiple constructions. The approach is also validated by simulations, showing the ability to synchronize sets with efficient communication costs while maintaining reconciliation guarantees compared to other baseline schemes for set reconciliation. To further improve communication overhead for large universes as blockchain networks, CertainSync is extended with a universe reduction technique to minimize communication overhead. We compare and validate the extended framework UniverseReduceSync against the basic CertainSync framework through simulations using real blockchain transaction hash data from the Ethereum blockchain network. The results illustrate a trade-off between improved communication costs and maintaining reconciliation guarantees without relying on parametrization or estimators, offering a comprehensive solution for set reconciliation in diverse scenarios.

The Last Survivor of PoS Pools: Staker's Dilemma

In PoW blockchains, the pooling approach has been criticized to be vulnerable to the block withholding (BWH) attack. BWH attackers can unfairly claim rewards from victim pools by submitting proofs of work while secretly withholding any valid blocks they discover. It is well known that BWH attackers against PoW face the miner's dilemma. Interestingly, our findings reveal that this vulnerability manifests more severely in Proof-of-Stake (PoS) systems. For a network only consisting of one attacker pool and one victim pool, the attacker will eventually manipulate the network while the victim will vanish by losing the stake ratio gradually. Moreover, in a more realistic scenario with multiple BWH attacker pools and one solo staker who does not join any pools, we show that only one lucky attacker and the solo staker will survive, whereas all the other pools will vanish gradually, revealing the staker's dilemma. Our analysis is supported by experiments on massive real blockchain systems and numerical simulations.

SESSION: Session 6A: Online Learning II

Learning-Augmented Decentralized Online Convex Optimization in Networks

This paper studies learning-augmented decentralized online convex optimization in a networked multi-agent system, a challenging setting that has remained under-explored. We first consider a linear learning-augmented decentralized online algorithm (LADO-Lin) that combines a machine learning (ML) policy with a baseline expert policy in a linear manner. We show that, while LADO-Lin can exploit the potential of ML predictions to improve the average cost performance, it cannot have guaranteed worst-case performance. To address this limitation, we propose a novel online algorithm (LADO) that adaptively combines the ML policy and expert policy to safeguard the ML predictions to achieve strong competitiveness guarantees. We also prove the average cost bound for LADO, revealing the tradeoff between average performance and worst-case robustness and demonstrating the advantage of training the ML policy by explicitly considering the robustness requirement. Finally, we run an experiment on decentralized battery management. Our results highlight the potential of ML augmentation to improve the average performance as well as the guaranteed worst-case performance of LADO.

Robust Gittins for Stochastic Scheduling

A common theme in stochastic optimization problems is that, theoretically, stochastic algorithms need to "know" relatively rich information about the underlying distributions. This is at odds with most applications, where distributions are rough predictions based on historical data. Thus, commonly, stochastic algorithms are making decisions using imperfect predicted distributions, while trying to optimize over some unknown true distributions.

We consider the fundamental problem of scheduling stochastic jobs preemptively on a single machine to minimize expected mean completion time in the setting where the scheduler is only given imperfect predicted job size distributions. If the predicted distributions are perfect, then it is known that this problem can be solved optimally by the Gittins index policy.

The goal of our work is to design a scheduling policy that is robust in the sense that it produces nearly optimal schedules even if there are modest discrepancies between the predicted distributions and the underlying real distributions. Our main contributions are: - We show that the standard Gittins index policy is not robust in this sense. If the true distributions are perturbed by even an arbitrarily small amount, then running the Gittins index policy using the perturbed distributions can lead to an unbounded increase in mean completion time. - We explain how to modify the Gittins index policy to make it robust, that is, to produce nearly optimal schedules, where the approximation depends on a new measure of error between the true and predicted distributions that we define. Looking forward, the approach we develop here can be applied more broadly to many other stochastic optimization problems to better understand the impact of mispredictions, and lead to the development of new algorithms that are robust against such mispredictions.

Learning-Augmented Competitive Algorithms for Spatiotemporal Online Allocation with Deadline Constraints

We introduce and study spatiotemporal online allocation with deadline constraints (SOAD), a new online problem motivated by emerging challenges in sustainability and energy. In SOAD, an online player completes a workload by allocating and scheduling it on the points of a metric space (X, d) while subject to a deadline T. At each time step, a service cost function is revealed that represents the cost of servicing the workload at each point, and the player must irrevocably decide the current allocation of work to points. Whenever the player moves this allocation, they incur a movement cost defined by the distance metric d(•, •) that captures, e.g., an overhead cost. SOAD formalizes the open problem of combining general metrics and deadline constraints in the online algorithms literature, unifying problems such as metrical task systems and online search. We propose a competitive algorithm for SOAD along with a matching lower bound establishing its optimality. Our main algorithm, ST-CLIP, is a learning-augmented algorithm that takes advantage of predictions (e.g., forecasts of relevant costs) and achieves an optimal consistency-robustness trade-off. We evaluate our proposed algorithms in a simulated case study of carbon-aware spatiotemporal workload management, an application in sustainable computing that schedules a delay-tolerant batch compute job on a distributed network of data centers. In these experiments, we show that ST-CLIP substantially improves on heuristic baseline methods.

A Gittins Policy for Optimizing Tail Latency

We consider the problem of scheduling to minimize asymptotic tail latency in an M/G/1 queue with unknown job sizes. When the job size distribution is heavy-tailed, numerous policies that do not require job size information (e.g. Processor Sharing, Least Attained Service) are known to be strongly tail optimal, meaning that their response time tail has the fastest possible asymptotic decay. In contrast, for light-tailed size distributions, only in the last few years have policies been developed that outperform simple First-Come First-Served (FCFS). The most recent of these is γ-Boost, which achieves strong tail optimality in the light-tailed setting. But thus far, all policies that outperform FCFS in the light-tailed setting, including γ-Boost, require known job sizes.

In this paper, we design a new scheduling policy that achieves strong tail optimality in the light-tailed M/G/1 with unknown job sizes. Surprisingly, the optimal policy turns out to be a variant of the Gittins policy, but with a novel and unusual feature: it uses a negative discount rate. Our work also applies to systems with partial information about job sizes, covering γ-Boost as an extreme case when job sizes are in fact fully known. This abstract summarizes our full paper [7].

SESSION: Session 6B: Measurement II

Revisiting Traffic Splitting for Software Switch in Datacenter

Datacenter network topology contains multiple paths between server machines, with each path assigned aweight. Software switches perform traffic splitting, an essential networking operation in datacenters. Previous studies leveraged software switches to distribute network connections across paths, under the assumption that the software switches accurately divide connections according to path weights. However, we reveal that current traffic splitting techniques exhibit significant inaccuracy and resource inefficiency. Consequently, real-world datacenter services (e.g., deep learning) experience ∼2.7× longer communication completion times than the ideal scenario. To address these problems, we propose VALO, a new traffic splitting technique for software switches, for 1) high accuracy and 2) resource-efficiency. For the goals, we introduce new concepts: score graph and VALO gravity. We implement VALO using the de-facto software switch and evaluate it thoroughly. On average, VALO achieves ∼34.8× better accuracy and ∼67.6× better resource efficiency compared to existing techniques, respectively. As a result, VALO demonstrates 1.3×--2.5× faster average communication completion times for real-world services than existing techniques.

Exploiting Kubernetes Autoscaling for Economic Denial of Sustainability

The complexity required to support the flexibility and scale of networks achievable by Kubernetes (K8s)-based applications leaves them vulnerable to Economic Denial of Sustainability (EDoS) attacks attempting to deprive targets of the financial means to sustain themselves. We develop Markov Decision Process (MDP) based models to reason about autoscaling and the components influencing scaling thresholds, and build atop the MDP a Stackelberg game structure to reason about EDoS attacks on K8s autoscaling and the circumstances under which adversaries inject traffic. Through numerical evaluations, we show examples where hourly-based charges eliminate incentives for resource scale-down, increasing attack incentive. Through the use of experiments on a realistic K8s cluster, we further show that increasing the attack intensity may render it less effective in generating surplus K8s resources compared to the increased traffic generation cost.

A Global Perspective on the Past, Present, and Future of Video Streaming over Starlink

This study presents the first global analysis of on-demand video streaming over Low Earth Orbit (LEO) satellite networks, using data from over one million households across 85 countries. We highlight Starlink's role as a major LEO provider, enhancing connectivity in underserved regions. Our findings reveal that while overall video quality on Starlink matches that of traditional networks, the inherent variability in LEO conditions---such as throughput fluctuations and packet loss---leads to an increase in bitrate switches and rebuffers. To further improve the quality of experience for the LEO community, we manipulate existing congestion control and adaptive bitrate streaming algorithms using simulation and real A/B tests deployed on over one million households. Our results underscore the need for video streaming and congestion control algorithms to adapt to rapidly evolving network landscapes, ensuring high-quality service across diverse and dynamic network types.

ForgetMeNot: Understanding and Modeling the Impact of Forever Chemicals Toward Sustainable Large-Scale Computing

Fluorinated compounds, commonly known as forever chemicals, play a critical role in semiconductor fabrication steps such as lithography, etching, and chamber cleaning. They exhibit global warming potentials thousands of times higher than carbon dioxide and persist in the atmosphere for millennia. Yet, sustainability efforts in computer systems have primarily focused on carbon emissions alone. We introduce ForgetMeNot, a modeling tool that quantifies fluorinated compound emissions across different steps of the semiconductor manufacturing pipeline. We validate its accuracy using real-world fab data and demonstrate its utility for emission-aware hardware design. ForgetMeNot reveals how different chip characteristics affect fluorinated compound emissions, and helps datacenter operators select low-emission hardware configurations. By integrating fluorinated compound emissions into hardware manufacturing decisions, ForgetMeNot presents a more complete sustainability framework for computing systems.

SESSION: Session 6C: Blockchains II

Phishing Tactics Are Evolving: An Empirical Study of Phishing Contracts on Ethereum

The prosperity of Ethereum has led to a rise in phishing scams. Initially, scammers lured users into transferring or granting tokens to Externally Owned Accounts (EOAs). Now, they have shifted to deploying phishing contracts to deceive users. Specifically, scammers trick victims into either directly transferring tokens to phishing contracts or granting these contracts control over their tokens. Our research reveals that phishing contracts have resulted in significant financial losses for users. While several studies have explored cybercrime on Ethereum, to the best of our knowledge, the understanding of phishing contracts is still limited.

In this paper, we present the first empirical study of phishing contracts on Ethereum. We first build a sample dataset including 790 reported phishing contracts, based on which we uncover the key features of phishing contracts. Then, we propose to collect phishing contracts by identifying suspicious functions from the bytecode and simulating transactions. With this method, we have built the first large-scale phishing contract dataset on Ethereum, comprising 37,654 phishing contracts deployed between December 29, 2022 and January 1, 2025. Based on the above dataset, we collect phishing transactions and then conduct the measurement from the perspectives of victim accounts, phishing contracts, and deployer accounts. Alarmingly, these phishing contracts have launched 211,319 phishing transactions, leading to 190.7 million in losses for 171,984 victim accounts. Moreover, we identify a large-scale phishing group deploying 85.7% of all phishing contracts, and it remains active at present. Our work aims to serve as a valuable reference in combating phishing contracts and protecting users' assets.

Blockchain Amplification Attack

Strategies related to the blockchain concept of arbitrage or front/back running, create strong economic incentives for network nodes to reduce latency. Modified nodes, that minimize transaction validation and neglect to filter invalid transactions in the Ethereum peer-to-peer (P2P) network, introduce a novel attack vector---a Blockchain Amplification Attack. An attacker can exploit those modified nodes to amplify invalid transactions thousands of times, posing a security threat to the entire network. To illustrate attack practicality in the current Ethereum main network, we 1) identify thousands of similar attacks in the wild, 2) mathematically model the propagation mechanism, 3) empirically measure model parameters from our monitoring nodes, and 4) compare the performance with other existing Denial-of-Service (DoS) attacks through local simulation. We show that an attacker can amplify network traffic at modified nodes by a factor of 3,600, and cause economic damages of 13,800 times the amount needed to carry out the attack. Despite these risks, aggressive latency reduction may still be profitable to justify the existence of modified nodes. To assess this trade-off, we 1) simulate the transaction validation process in a local network and 2) empirically measure the latency reduction by deploying our modified node in the Ethereum test network. We finally provide mitigation strategies against the blockchain amplification attack.

Piecing Together the Jigsaw Puzzle of Transactions on Heterogeneous Blockchain Networks

The Web3 ecosystem is increasingly evolving to include multiple chains, with decentralized applications (dApps) distributed across different blockchains. This has driven the need for cross-chain bridges to enable blockchain interoperability. However, this opens up novel attack surface, and media outlets have reported serious attacks related to cross-chain bridges. Nevertheless, few prior works have studied cross-chain bridges and their related transactions, especially from a security perspective. To fill the void, this paper presents the first comprehensive analysis of cross-chain transactions. We first build a cross-chain transaction dataset based on semantic analysis of popular cross-chain bridges, covering 13 decentralized bridges and 7 representative blockchains, with over 80 million transactions in total. Based on this comprehensive dataset, we study the landscape of cross-chain transactions from several angles, including token usage, user profile and the purposes of transactions. We further observe that cross-chain bridges can be abused for malicious/aggressive purposes. Thus we design an automated detector and deploy it in-the-wild to flag misbehaviors from millions of cross-chain transactions. We identify hundreds of abnormal transactions related to exploits and arbitrages, etc. Our research underscores the importance of the cross-chain ecosystem, and offers an effective detector for pinpointing security threats.

Towards Understanding and Analyzing Instant Cryptocurrency Exchanges

In this paper, we examine a novel category of services in the blockchain ecosystem termed Instant Cryptocurrency Exchange (ICE) services. Originally conceived to facilitate cross-chain asset transfers, ICE services have, unfortunately, been abused for money laundering activities due to two key features: the absence of a strict Know Your Customer (KYC) policy and incomplete on-chain data of user requests. As centralized and non-transparent services, ICE services pose considerable challenges in the tracing of illicit fund flows laundered through them.

Our comprehensive study of ICE services begins with an analysis of their features and workflow. We classify ICE services into two distinct types: Standalone and Delegated. We then perform a measurement analysis of ICE services, paying particular attention to their usage in illicit activities. Our findings indicate that a total of $12,473,290 illegal funds have been laundered through ICE services, and 432 malicious addresses were initially funded by ICE services. Based on the insights from measurement analysis, we propose a matching algorithm designed to evaluate the effectiveness of ICE services in terms of efficiency and prevention of traceability. Our evaluation reveals that 92% of the user requests analyzed were completed in less than three minutes, underscoring the efficiency of ICE services. In addition, we demonstrate that the algorithm is effective in tracing illicit funds in situations where ICE services are used in malicious activities. To engage the community, the entire dataset used in this study is open-source.