SIGMETRICS/PERFORMANCE '24: Abstracts of the 2024 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems

Full Citation in the ACM Digital Library

SESSION: Session 1A: Queueing

Approximations to Study the Impact of the Service Discipline in Systems with Redundancy

In this paper we develop the first methods to approximate the queue length distribution in a queueing system with redundancy under various service disciplines. We focus on a system with exponential job sizes, i.i.d. copies, and a large number of servers. In the case of Processor Sharing, we provide a pair and a triplet approximation. We also develop a pair approximation for First-Come-First-Served, Limited Processor Sharing and Last-Come-First-Served servers. We present numerical evidence that shows that all the approximations presented in the paper are highly accurate, but that none of them are asymptotically exact (as the number of servers goes to infinity).

Multi-dimensional State Space Collapse in Non-complete Resource Pooling Scenarios

We establish an explicit multi-dimensional state space collapse (SSC) for parallel-processing systems with arbitrary compatibility constraints between servers and job types. This breaks major new ground beyond the SSC results and queue length asymptotics in the literature which are largely restricted to complete resource pooling (CRP) scenarios where the steady-state queue length vector concentrates around a line in heavy traffic. The multi-dimensional SSC that we establish reveals heavy-traffic behavior which is also far more tractable than the pre-limit queue length distribution, yet exhibits a fundamentally more intricate structure than in the one-dimensional case.

Strongly Tail-Optimal Scheduling in the Light-Tailed M/G/1

We study the problem of scheduling jobs in a queueing system, specifically an M/G/1 with light-tailed job sizes, to asymptotically optimize the response time tail. This means scheduling to make \mathbfP [T > t], the chance a job's response time exceeds t, decay as quickly as possible in the t → ∞ limit. For some time, the best known policy was First-Come First-Served (FCFS), which has an asymptotically exponential tail: P [T > t] ∼ C e t . FCFS achieves the optimal decay rate ~γ, but its tail constant C is suboptimal. We derive a closed-form expression for the optimal tail constant, and we introduce γ-Boost, a new policy that achieves this optimal tail constant. We also show via simulation that γ-Boost has excellent practical performance. This abstract summarizes our full paper.

Heavy-Traffic Optimal Size- and State-Aware Dispatching

We study the problem of dispatching jobs to multiple FCFS (First-Come, First-Served) queues. We consider the case where the dispatcher is size-aware, meaning it learns the size (i.e. service time) of each job as it arrives; and state-aware, meaning it always knows the amount of work (i.e. total remaining service time) at each queue. While size- and state-aware dispatching to FCFS queues has been extensively studied, little is known about optimal dispatching for the objective of minimizing mean delay. In this work, we propose the first size- and state-aware dispatching policy, called CARD (Controlled Asymmetry Reduces Delay), that provably minimizes mean delay in heavy traffic. This work summarizes our full paper.

SESSION: Session 1B: Measurements

Deep Dive into NTP Pool's Popularity and Mapping

Time synchronization is of paramount importance on the Internet, with the Network Time Protocol (NTP) serving as the primary synchronization protocol. The NTP Pool, a volunteer-driven initiative launched two decades ago, facilitates connections between clients and NTP servers. Our analysis of root DNS queries reveals that the NTP Pool has consistently been the most popular time service. We further investigate the DNS component (GeoDNS) of the NTP Pool, which is responsible for mapping clients to servers. Our findings indicate that the current algorithm is heavily skewed, leading to the emergence of time monopolies for entire countries. For instance, clients in the US are served by 551 NTP servers, while clients in Cameroon and Nigeria are served by only one and two servers, respectively, out of the 4k+ servers available in the NTP Pool. We examine the underlying assumption behind GeoDNS for these mappings and discover that time servers located far away can still provide accurate clock time information to clients. We have shared our findings with the NTP Pool operators, who acknowledge them and plan to revise their algorithm to enhance security.

MetaVRadar: Measuring Metaverse Virtual Reality Network Activity

The ''metaverse'', wherein users can immerse in virtual worlds through their VR headsets to work, study, play, shop, socialize, and entertain, is fast becoming a reality. However, little is known about the network dynamics of metaverse VR applications, which are needed to make telecommunications network infrastructure ''metaverse ready'' to support superlative user experience. This work is an empirical measurement study of metaverse VR network behavior. By analyzing metaverse sessions on the Oculus VR headset, we first develop a categorization of user activity into distinct states ranging from login home to streetwalking and event attendance to asset trading, characterizing network traffic per state, thereby highlighting the vastly more complex nature of a metaverse session compared to streaming video or gaming. Our second contribution develops a real-time method MetaVRadar to detect metaverse session and classify the user activity state leveraging formalized flow signatures and volumetric attributes. Our third contribution practically implements MetaVRadar in a large university campus network to demonstrate its usability.

SCADA World: An Exploration of the Diversity in Power Grid Networks

Despite a growing interest in understanding the industrial control networks that monitor and control our critical infrastructures (such as the power grid), to date, SCADA networks have been analyzed in isolation from each other. They have been treated as monolithic networks without taking into consideration their differences. In this paper, we analyze real-world data from different parts of a power grid (generation, transmission, distribution, and end-consumer) and show that these industrial networks exhibit a variety of unique behaviors and configurations that have not been documented before. To the best of our knowledge, our study is the first to tackle the analysis of power grid networks at this level. Our results help us dispel several misconceptions proposed by previous work, and we also provide new insights into the differences and types of SCADA networks.

Democratizing LEO Satellite Network Measurement

Low Earth Orbit (LEO) satellite networks are quickly gaining traction with promises of impressively low latency, high bandwidth, and global reach. However, the research community knows relatively little about their operation and performance in practice. The obscurity is largely due to the high barrier of entry for measuring LEO networks, which requires deploying specialized hardware or recruiting large numbers of satellite Internet customers. In this paper, we introduce HitchHiking, a methodology that democratizes global visibility into LEO satellite networks. HitchHiking builds on the observation that Internet-exposed services that use LEO Internet can reveal satellite network architecture and performance, bypassing the need for specialized hardware. We evaluate HitchHiking against ground truth measurements and prior methods, showing that it provides more coverage and accuracy. With HitchHiking, we complete the largest study to date of Starlink network latency, measuring over 2,400~users across 27~countries. We uncover unexpected patterns in latency that surface how LEO routing is more complex than previously understood. Finally, we conclude with recommendations for future research on LEO networks.

SESSION: Session 2A: Bloom Filters

Analysis of False Negative Rates for Recycling Bloom Filters (Yes, They Happen!)

Bloom Filters are a desirable data structure for distinguishing new values in sequences of data (i.e., messages), due to their space efficiency, their low false positive rates (incorrectly classifying a new value as a repeat), and never producing false negatives (classifying a repeat value as new). However, as the Bloom Filter's bits are filled, false positive rates creep upward. To keep false positive rates below a reasonable threshold, applications periodically "recycle" the Bloom Filter, clearing the memory and then resuming the tracking of data. After a recycle point, subsequent arrivals of recycled messages are likely to be misclassified as new; recycling induces false negatives. Despite numerous applications of recycling, the corresponding false negative rates have never been analyzed. In this paper, we derive approximations, upper bounds, and lower bounds of false negative rates for several variants of recycling Bloom Filters. These approximations and bounds are functions of the size of memory used to store the Bloom Filter and the distributions on new arrivals and repeat messages, and can be efficiently computed on conventional hardware. We show, via comparison to simulation, that our upper bounds and approximations are extremely tight, and can be efficiently computed for megabyte-sized Bloom Filters on conventional hardware.

Invertible Bloom Lookup Tables with Listing Guarantees

The Invertible Bloom Lookup Table (IBLT) is a probabilistic concise data structure for set representation that supports a listing operation as the recovery of the elements in the represented set. Its applications can be found in network synchronization and traffic monitoring as well as in error-correction codes. IBLT can list its elements with probability affected by the size of the allocated memory and the size of the represented set, such that it can fail with small probability even for relatively small sets. While previous works only studied the failure probability of IBLT, this work initiates the worst case analysis of IBLT that guarantees successful listing for all sets of a certain size. The worst case study is important since the failure of IBLT imposes high overhead. We describe a novel approach that guarantees successful listing when the set satisfies a tunable upper bound on its size. To allow that, we develop multiple constructions that are based on various coding techniques such as stopping sets and the stopping redundancy of error-correcting codes, and Steiner systems as well as new methodologies we develop. We analyze the sizes of IBLTs with listing guarantees obtained by the various methods as well as their mapping memory and runtime overheads. Lastly, we study lower bounds on the achievable sizes of IBLT with listing guarantees and verify the results in the paper by simulations.

Lightweight Acquisition and Ranging of Flows in the Data Plane

As networks get more complex, the ability to track almost all the flows is becoming of paramount importance. This is because we can then detect transient events impacting only a subset of the traffic. Solutions for flow monitoring exist, but it is getting very difficult to produce accurate estimations for every <flowID,counter> tuple given the memory constraints of commodity programmable switches. Indeed, as networks grow in size, more flows have to be tracked, increasing the number of tuples to be recorded. At the same time, end-host virtualization requires more specific flowIDs, enlarging the memory cost for every single entry. Finally, the available memory resources have to be shared with other important functions as well (e.g., load balancing, forwarding, ACL).

To address those issues, we present FlowLiDAR (Flow Lightweight Detection and Ranging), a new solution that is capable of tracking almost all the flows in the network while requiring only a modest amount of data plane memory, which is not dependent on the size of flowIDs. We implemented the scheme in P4, tested it using real traffic from ISPs, and compared it against four state-of-the-art solutions: FlowRadar, NZE, PR-sketch, and Elastic Sketch.

SESSION: Session 2B: Architecture

TAO: Re-Thinking DL-based Microarchitecture Simulation

Microarchitecture simulators are indispensable tools for microarchitecture designers to validate, estimate, and optimize new hardware that meets specific design requirements. While the quest for a fast, accurate and detailed microarchitecture simulation has been ongoing for decades, existing simulators excel and fall short at different aspects: (i) Although execution-driven simulation is accurate and detailed, it is extremely slow and requires expert-level experience to design. (ii) Trace-driven simulation reuses the execution traces in pursuit of fast simulation but faces accuracy concerns and fails to achieve significant speedup. (iii) Emerging deep learning (DL)-based simulations are remarkably fast and have acceptable accuracy, but introduce substantial overheads from trace regeneration and model re-training when simulating a new microarchitecture.

Re-thinking the advantages and limitations of the aforementioned three mainstream simulation paradigms, this paper introduces TAO that redesigns the DL-based simulation with three primary contributions: First, we propose a new training dataset design such that the subsequent simulation (i.e., inference) only needs functional trace as inputs, which can be rapidly generated and reused across microarchitectures. Second, to increase the detail of the simulation, we redesign the input features and the DL model using self-attention to support predicting various performance metrics of interest. Third, we propose techniques to train a microarchitecture agnostic embedding layer that enables fast transfer learning between different microarchitectural configurations and effectively reduces the re-training overhead of conventional DL-based simulators. TAO can predict various performance metrics of interest, significantly reduce the simulation time, and maintain similar simulation accuracy as state-of-the-art DL-based endeavors.

Agents of Autonomy: A Systematic Study of Robotics on Modern Hardware

As robots increasingly permeate modern society, it is crucial for the system and hardware research community to bridge its long-standing gap with robotics. This divide has persisted due to the lack of (i) a systematic performance evaluation of robotics on different computing platforms and (ii) a comprehensive, open-source, cross-platform benchmark suite.

To address these gaps, we present a systematic performance study of robotics on modern hardware and introduce RoWild, an open-source benchmark suite for robotics that is comprehensive and cross-platform. Our workloads encompass a broad range of robots, including driverless vehicles, pilotless drones, and stationary robotic arms, and we evaluate their performance on a spectrum of modern computing platforms, from low-end embedded CPUs to high-end server-grade GPUs.

Our findings reveal that current architectures experience significant inefficiencies when executing robotic workloads, highlighting the need for architectural advancements. We discuss approaches for meeting these requirements, offering insights for improving the performance of robotics.

The full version of the paper is available in [11], and the source code of the benchmark suite is available in [2].

Automated Backend Allocation for Multi-Model, On-Device AI Inference

On-Device Artificial Intelligence (AI) services such as face recognition, object tracking and voice recognition are rapidly scaling up deployments on embedded, memory-constrained hardware devices. These services typically delegate AI inference models for execution on CPU and GPU computing backends. While GPU delegation is a common practice to achieve high speed computation, the approach suffers from degraded throughput and completion times under multi-model scenarios, i.e., concurrently executing services. This paper introduces a solution to sustain performance in multi-model, on-device AI contexts by dynamically allocating a combination of CPU and GPU backends per model. The allocation is feedback-driven, and guided by a knowledge of model-specific, multi-objective pareto fronts comprising inference latency and memory consumption. Our backend allocation algorithm that runs online per model, and achieves 25-100% improvement in throughput over static allocations as well as load-balancing scheduler solutions targeting multi-model scenarios.

SESSION: Session 3A: Optimal Control

Learning the Optimal Control for Evolving Systems with Converging Dynamics

We consider a principle or controller that can pick actions from a fixed action set to control an evolving system with converging dynamics. The converging dynamics means that, if the principle holds the same action, the system will asymptotically converge to a unique stable state determined by this action. In our model, the dynamics of the system are unknown to the principle, and the principle can only receive bandit feedback (maybe noisy) on the impacts of his actions. The principle aims to learn which stable state yields the highest reward while adhering to specific constraints and to immerse the system into this state as quickly as possible. We measure the principle's performance in terms of regret and constraint violation. In cases where the action set is finite, we propose an algorithm Optimistic-Pessimistic Convergence and Confidence Bounds (OP-C2B) that ensures sublinear regret and constraint violation simultaneously. Particularly, OP-C2B achieves logarithmic regret and constraint violation when the system convergence rate is linear or superlinear. Furthermore, we generalize our algorithm OP-C2B to the case of an infinite action set and demonstrate its ability to maintain sublinear regret and constraint violation.

Sampling for Remote Estimation of the Wiener Process over an Unreliable Channel

In this paper, we study a sampling problem where a source takes samples from a Wiener process and transmits them through a wireless channel to a remote estimator. Due to channel fading, interference, and potential collisions, the packet transmissions are unreliable and could take random time durations. Our objective is to devise an optimal causal sampling policy that minimizes the long-term average mean square estimation error. This optimal sampling problem is a recursive optimal stopping problem, which is generally quite difficult to solve. However, we prove that the optimal sampling strategy is, in fact, a simple threshold policy where a new sample is taken whenever the instantaneous estimation error exceeds a threshold. This threshold remains a constant value that does not vary over time. By exploring the structure properties of the recursive optimal stopping problem, a low-complexity iterative algorithm is developed to compute the optimal threshold. This work generalizes previous research by incorporating both transmission errors and random transmission times into remote estimation. Numerical simulations are provided to compare our optimal policy with the zero-wait and age-optimal policies.

Near-Optimal Packet Scheduling in Multihop Networks with End-to-End Deadline Constraints

Scheduling packets with end-to-end deadline constraints in multihop networks is an important problem that has been notoriously difficult to tackle. Recently, there has been progress on this problem in the worst-case traffic setting, with the objective of maximizing the number of packets delivered within their deadlines. Specifically, the proposed algorithms were shown to achieve Ω(1/log(L)) fraction of the optimal objective value if the minimum link capacity in the network is Cmin = Ω(log (L)), where L is the maximum length of a packet's route in the network (which is bounded by the packet's maximum deadline). However, such guarantees can be quite pessimistic due to the strict worst-case traffic assumption and may not accurately reflect real-world settings. In this work, we aim to address this limitation by exploring whether it is possible to design algorithms that achieve a constant fraction of the optimal value while relaxing the worst-case traffic assumption. We provide a positive answer by demonstrating that in stochastic traffic settings, such as i.i.d. packet arrivals, near-optimal, (1-ε)-approximation algorithms can be designed if Cmin = Ω((over (logL/ε) Ω2). To the best of our knowledge, this is the first result that shows this problem can be solved near-optimally under nontrivial assumptions on traffic and link capacity.

Prelimit Coupling and Steady-State Convergence of Constant-stepsize Nonsmooth Contractive SA

Motivated by Q-learning, we study nonsmooth contractive stochastic approximation (SA) with constant stepsize. We focus on two important classes of dynamics: 1) nonsmooth contractive SA with additive noise, and 2) synchronous and asynchronous Q-learning, which features both additive and multiplicative noise. For both dynamics, we establish weak convergence of the iterates to a stationary limit distribution in Wasserstein distance. Furthermore, we propose a prelimit coupling technique for establishing steady-state convergence and characterize the limit of the stationary distribution as the stepsize goes to zero. Using this result, we derive that the asymptotic bias of nonsmooth SA is proportional to the square root of the stepsize, which stands in sharp contrast to smooth SA. This bias characterization allows for the use of Richardson-Romberg extrapolation for bias reduction in nonsmooth SA.

SESSION: Session 3B: Software Performance

Machine Learning Systems are Bloated and Vulnerable

Today's software is bloated with both code and features that are not used by most users. This bloat is prevalent across the entire software stack, from operating systems and applications to containers. Containers are lightweight virtualization technologies used to package code and dependencies, providing portable, reproducible and isolated environments. For their ease of use, data scientists often utilize machine learning containers to simplify their workflow. However, this convenience comes at a cost: containers are often bloated with unnecessary code and dependencies, resulting in very large sizes. In this paper, we analyze and quantify bloat in machine learning containers. We develop MMLB, a framework for analyzing bloat in software systems, focusing on machine learning containers. MMLB measures the amount of bloat at both the container and package levels, quantifying the sources of bloat. In addition, MMLB integrates with vulnerability analysis tools and performs package dependency analysis to evaluate the impact of bloat on container vulnerabilities. Through experimentation with 15 machine learning containers from TensorFlow, PyTorch, and Nvidia, we show that bloat accounts for up to 80% of machine learning container sizes, increasing container provisioning times by up to 370% and exacerbating vulnerabilities by up to 99%. For more detail, see the full paper, ~\citezhang2024machine.

Thorough Characterization and Analysis of Large Transformer Model Training At-Scale

Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study.

In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-to-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.

HEAL: Performance Troubleshooting Deep inside Data Center Hosts

This study demonstrates the salient facts and challenges of host failure operations in hyperscale data centers. A host incident can involve hundreds of distinct host-level metrics. The faulting mechanism inside the host connects these heterogeneous metrics through direct and indirect correlation, making it extremely difficult to sort out the propagation procedures and the root cause from these intertwined indicators. To deeply understand the failure mechanism inside the host, we develop HEAL -- a novel host metrics analysis toolkit. HEAL discovers dynamic causality in sparse heterogeneous host metrics by combining the strengths of both time series and random variable analysis. It also extracts causal directional hints from causality's asymmetry and historical knowledge, which finally help HEAL produce accurate results given undesirable inputs. Evaluations in our production environment verify that HEAL provides significantly better result accuracy and full-process interpretability than the SOTA baselines. With these advantages, HEAL successfully serves our data center and worldwide product operations and impressively contributes to many other workflows.

Kernel vs. User-Level Networking: Don't Throw Out the Stack with the Interrupts

This paper reviews the performance characteristics of network stack processing for communication-heavy server applications. Recent literature often describes kernel-bypass and user-level networking as a silver bullet to attain substantial performance improvements, but without providing a comprehensive understanding of how exactly these improvements come about. We identify and quantify the direct and indirect costs of asynchronous hardware interrupt requests (IRQ) as a major source of overhead. While IRQs and their handling have a substantial impact on the effectiveness of the processor pipeline and thereby the overall processing efficiency, their overhead is difficult to measure directly when serving demanding workloads. This paper presents an indirect methodology to assess IRQ overhead by constructing preliminary approaches to reduce the impact of IRQs. While these approaches are not suitable for general deployment, their corresponding performance observations indirectly confirm the conjecture. Based on these findings, a small modification of a vanilla Linux system is devised that improves the efficiency and performance of traditional kernel-based networking significantly, resulting in up to 45% increased throughput without compromising tail latency. In case of server applications, such as web servers or Memcached, the resulting performance is comparable to using kernel-bypass and user-level networking when using stacks with similar functionality and flexibility.

SESSION: Session 4A: Sustainable Computing

Online Conversion with Switching Costs: Robust and Learning-Augmented Algorithms

We introduce and study online conversion with switching costs, a family of online problems that capture emerging problems at the intersection of energy and sustainability. In this problem, an online player attempts to purchase (alternatively, sell) fractional shares of an asset during a fixed time horizon with length T. At each time step, a cost function (alternatively, price function) is revealed, and the player must irrevocably decide an amount of asset to convert. The player also incurs a switching cost whenever their decision changes in consecutive time steps, i.e., when they increase or decrease their purchasing amount. We introduce competitive (robust) threshold-based algorithms for both the minimization and maximization variants of this problem, and show they are optimal among deterministic online algorithms. We then propose learning-augmented algorithms that take advantage of untrusted black-box advice (such as predictions from a machine learning model) to achieve significantly better average-case performance without sacrificing worst-case competitive guarantees. Finally, we empirically evaluate our proposed algorithms using a carbon-aware EV charging case study, showing that our algorithms substantially improve on baseline methods for this problem.

The Online Pause and Resume Problem: Optimal Algorithms and An Application to Carbon-Aware Load Shifting

We introduce and study the online pause and resume problem. In this problem, a player attempts to find the k lowest (alternatively, highest) prices in a sequence of fixed length T, which is revealed sequentially. At each time step, the player is presented with a price and decides whether to accept or reject it. The player incurs a switching cost whenever their decision changes in consecutive time steps, i.e., whenever they pause or resume purchasing. This online problem is motivated by the goal of carbon-aware load shifting, where a workload may be paused during periods of high carbon intensity and resumed during periods of low carbon intensity and incurs a cost when saving or restoring its state. It has strong connections to existing problems studied in the literature on online optimization, though it introduces unique technical challenges that prevent the direct application of existing algorithms. Extending prior work on threshold-based algorithms, we introduce double-threshold algorithms for both variants of this problem. We further show that the competitive ratios achieved by these algorithms are the best achievable by any deterministic online algorithm. Finally, we empirically validate our proposed algorithm through case studies on the application of carbon-aware load shifting using real carbon trace data and existing baseline algorithms.

CarbonScaler: Leveraging Cloud Workload Elasticity for Optimizing Carbon-Efficiency

Due to inherent variations in energy's carbon intensity, temporal shifting has become a key method in reducing the carbon footprint of batch workloads. However, temporally shifting workloads involves searching for periods with lower carbon intensity, which increases the workload's completion time. In this paper, we present CarbonScaler, a new approach that reduces carbon emissions of batch workloads without extending their completion time. Our approach relies on applications' ability to change their compute demand by scaling the workload based on fluctuations in energy's carbon intensity. We present a carbon-aware scheduling algorithm, a Kubernetes-based prototype, and an analytic tool to guide the carbon-efficient deployment of batch applications in the cloud. We evaluate CarbonScaler using real-world applications and show that it can yield 33% carbon savings without extending the completion time and up to 32% extra carbon savings over state-of-the-art suspend-resume policies when completion time is flexible.

SESSION: Session 4B: Security Attacks

Optimized Cross-Path Attacks via Adversarial Reconnaissance

While softwarization and virtualization technologies make modern communication networks appear easier to manage, they also introduce highly complex interactions within the networks that can cause unexpected security threats. In this work, we study a particular security threat due to the sharing of links between high-security paths and low-security paths, which enables a new type of DoS attacks, called cross-path attacks, that indirectly attack a set of targeted high-security paths (target paths) by congesting the shared links through a set of attacker-controlled low-security paths (attack paths). While the feasibility of such attacks has been recently demonstrated in the context of SDN, their potential performance impact has not been characterized. To this end, we develop an approach for designing an optimized cross-path attack under a constrained total attack rate, consisting of (i) novel reconnaissance algorithms that can provide consistent estimates of the locations and parameters of the shared links via network tomography, and (ii) efficient optimization methods to design the optimal allocation of attack rate over the attack paths to maximally degrade the performance of the target paths. The proposed attack has achieved a significantly larger performance impact than its non-optimized counterparts in extensive evaluations based on multiple network settings, signaling the importance of addressing such intelligent attacks in network design. For more detail, see the full paper [4].

Who's Got My Back? Measuring the Adoption of an Internet-wide BGP RTBH Service

Distributed Denial-of-Service (DDoS) attacks continue to threaten the availability of Internet-based services. While countermeasures exist to decrease the impact of these attacks, not all operators have the resources or knowledge to deploy them. Unwanted Traffic Removal Service (UTRS), being one of the oldest community-based anti-DDoS services aims at mitigating major DDoS attacks through the Border Gateway Protocol (BGP).

In this paper we develop and evaluate a methodology to automatically detect UTRS participation in the wild. To this end, we deploy a measurement infrastructure and devise a methodology to detect UTRS-based traffic blocking. Using this methodology, we conducted a longitudinal analysis of UTRS participants over ten weeks. Our results show that at any point in time, there were 562 participants, including multihomed, stub, transit, and IXP ASes. Moreover, we surveyed 245 network operators to understand why they would (not) join UTRS. Results show that threat and coping appraisal significantly influence the intention to participate in UTRS.

A Large Scale Study and Classification of VirusTotal Reports on Phishing and Malware URLs

VirusTotal (VT) is a widely used scanning service for researchers and practitioners to label malicious entities and predict new security threats. Unfortunately, it is little known to the end-users how VT URL scanners decide on the maliciousness of entities and the attack types they are involved in (e.g., phishing or malware-hosting websites). In this paper, we conduct a systematic comparative study on VT URL scanners' behavior for different attack types of malicious URLs, in terms of 1) detection specialties, 2) stability, 3) correlations between scanners, and 4) lead/lag behaviors. Our findings highlight that the VT scanners commonly disagree with each other on their detection and attack type classification, leading to challenges in ascertaining the maliciousness of a URL and taking prompt mitigation actions according to different attack types. This motivates us to present a new highly accurate classifier that helps correctly identify the attack types of malicious URLs at the early stage. This in turn assists practitioners in performing better threat aggregation and choosing proper mitigation actions for different attack types.

SESSION: Session 5A: Online Optimization

Online Allocation with Replenishable Budgets: Worst Case and Beyond

This paper studies online resource allocation with replenishable budgets, where budgets can be replenished on top of the initial budget and an agent sequentially chooses online allocation decisions without violating the available budget constraint at each round. We propose a novel online algorithm, called OACP (Opportunistic Allocation with Conservative Pricing), that conservatively adjusts dual variables while opportunistically utilizing available resources. OACP achieves a bounded asymptotic competitive ratio in adversarial settings as the number of decision rounds T gets large. Importantly, the asymptotic competitive ratio of OACP is optimal in the absence of additional assumptions on budget replenishment. To further improve the competitive ratio, we make a mild assumption that there is budget replenishment every T^*\geq1 decision rounds and propose OACP+ to dynamically adjust the total budget assignment for online allocation. Next, we move beyond the worst-case and propose LA-OACP (Learning-Augmented OACP/OACP+), a novel learning-augmented algorithm for online allocation with replenishable budgets. We prove that LA-OACP can improve the average utility compared to OACP/OACP+ when the ML predictor is properly trained, while still offering worst-case utility guarantees when the ML predictions are arbitrarily wrong. Finally, we run simulation studies of sustainable AI inference powered by renewables, validating our analysis and demonstrating the empirical benefits of LA-OACP.

Fair Resource Allocation in Virtualized O-RAN Platforms

O-RAN systems in virtualized platforms (O-Cloud) offer performance boosts but also raise energy concerns. This paper assesses O-Cloud's energy costs and proposes energy-efficient policies for base station (BS) data loads and transport block (TB) sizes. These policies balance energy savings and performance fairly across servers and users. To handle the unknown and time-varying parameters affecting the policies, we develop a novel online learning framework with fairness guarantees that apply to the entire operation horizon of the system (long-term fairness).

BONES: Near-Optimal Neural-Enhanced Video Streaming

Accessing high-quality video content can be challenging due to insufficient and unstable network bandwidth. Recent advances in neural enhancement have shown promising results in improving the quality of degraded videos through deep learning. Neural-Enhanced Streaming (NES) incorporates this new approach into video streaming, allowing users to download low-quality video segments and then enhance them to obtain high-quality content without violating the playback of the video stream. We introduce BONES, an NES control algorithm that jointly manages the network and computational resources to maximize the quality of experience (QoE) of the user. BONES formulates NES as a Lyapunov optimization problem and solves it in an online manner with near-optimal performance, making it the first NES algorithm to provide a theoretical performance guarantee. Comprehensive experimental results indicate that BONES increases QoE by 5% to 20% over state-of-the-art algorithms with minimal overhead. Our code is available at https://github.com/UMass-LIDS/bones.

SESSION: Session 5B: Memory Management

Scalability Limitations of Processing-in-Memory using Real System Evaluations

Processing-in-memory (PIM) has been widely explored in academia and industry to accelerate numerous workloads. By reducing the data movement and increasing parallelism, PIM offers great performance and energy efficiency. A large amount of cores or nodes present in PIM provide massive parallelism and compute throughput; however, this also proposes challenges and limitations for some workloads. In this work, we provide an extensive evaluation and analysis of a real PIM system from UPMEM. We specifically target emerging workloads featuring collective communication, demonstrating its role as the primary limitation within current PIM architecture.

GuaNary: Efficient Buffer Overflow Detection In Virtualized Clouds Using Intel EPT-based Sub-Page Write Protection Support

Buffer overflow is a widespread memory safety violation in C/C++, reported as the top vulnerability in 2022. Secure memory allocators are generally used to protect systems against attacks that may exploit buffer overflows. Existing allocators mainly rely on two types of countermeasures to prevent or detect overflows: canaries and guard pages, each with pros and cons in terms of detection latency and memory footprint.

This paper follows the Out of Hypervisor (OoH) trend for virtualized cloud applications. It introduces GuaNary, a novel safety guard against overflows allowing synchronous detection at a low memory footprint cost. OoH is a new virtualization research axis introduced in 2022 advocating the exposure of hardware features for virtualization to the guest OS so that its processes can take advantage of them. Based on the OoH principle, GuaNary leverages Intel Sub-Page write Permission (SPP), a recent hardware virtualization feature that allows to write-protect guest memory at the granularity of 128B (namely, sub-page) instead of 4KB. We implement a software stack, LeanGuard, which promotes the utilization of SPP from inside virtual machines by new secure allocators that use GuaNary. Our evaluation shows that for the same number of protected buffers, LeanGuard consumes 8.3x less memory than SlimGuard, a state-of-the-art secure allocator. Furthermore, for the same memory consumption, LeanGuard protecting 25x more buffers than SlimGuard.

A High-bandwidth High-capacity Hybrid 3D Memory for GPUs

GPUs execute thousands of active threads simultaneously, requiring high memory bandwidth to handle multiple memory requests efficiently. The memory bandwidth in GPUs has always been increasing, but it is still insufficient for the demands of fine-grained threads, necessitating a higher memory bandwidth. Important workloads like deep learning and data analytics demand terabytes of data processing, necessitating high memory capacity and bandwidth to avoid performance overheads. True-3D stacking of non-volatile memory layers on GPUs can provide the required higher bandwidth and capacity, enhancing performance and energy efficiency. We propose a high-bandwidth high-capacity hybrid 3D memory (H3DM) that doubles bandwidth through true-3D integration compared to the baseline GPU architecture and affords 272 GB of total memory capacity by stacking 8 PCM layers (each of 32 GB) and two DRAM layers (each of 8 GB).

SESSION: Session 6A: Strategies and Pricing

Strategyproof Decision-Making in Panel Data Settings and Beyond

We consider the problem of decision-making using panel data, in which a decision-maker gets noisy, repeated measurements of multiple units (or agents). We consider the setup used in synthetic control methods, where there is a pre-intervention period when the principal observes the outcomes of each unit, after which the principal uses these observations to assign a treatment to each unit. Unlike this classical setting, we permit the units generating the panel data to be strategic, i.e. units may modify their pre-intervention outcomes in order to receive a more desirable intervention. The principal's goal is to design a strategyproof intervention policy, i.e. a policy that assigns units their utility-maximizing interventions despite their potential strategizing. We first identify a necessary and sufficient condition under which a strategyproof intervention policy exists, and provide a strategyproof mechanism with a simple closed form when one does exist. Along the way, we prove impossibility results for strategic multiclass classification, a natural extension of the well-studied (binary) strategic classification problem to the multiclass setting, which may be of independent interest. When there are two interventions, we establish that there always exists a strategyproof mechanism, and provide an algorithm for learning such a mechanism. For three or more interventions, we provide an algorithm for learning a strategyproof mechanism if there exists a sufficiently large gap in the principal's rewards between different interventions. Finally, we empirically evaluate our model using real-world panel data collected from product sales over 18 months. We find that our methods compare favorably to baselines which do not take strategic interactions into consideration, even in the presence of model misspecification.

When Should Prices Stay Fixed? On the Chances and Limitations of Spot Pricing in Larger Markets

Selling resources via auctions often seems profit-optimal in theory. Yet in practice, providers most often choose to sell homogeneous resources such as cloud computing instances at fixed prices. While it has been argued that this is explained by relatively non-volatile demand distributions and highly competitive market environments, these arguments only paint a partial picture. Through a formal game theoretic analysis, we show that the relative profit increase of offering resources through a spot pricing mechanism instead of at a fixed price is unbounded as long as a sufficiently competitive outside option exists, even if demand is very non-volatile. To explain the lack of spot pricing in practice, we consider that users might be biased against more complex auction-based mechanisms. We find that in large, non-volatile markets even a very small user bias will turn fixed-pricing profit-optimal if demand is sufficiently large and non-volatile. We derive a sufficient condition under which fixed prices are profit-optimal and illustrate the observed effects through a numerical example.

Continuous Query-based Data Trading

In the era of big data, traditional data trading methods designed for one-time queries on static databases fail to meet the demands of continuous query-based trading on streaming data, often resulting in repeated and inaccurate charges due to neglecting computation sharing in continuous query execution. To address this, we introduce CQTrade, the first trading framework tailored for continuous queries, which incorporates computation sharing across different time windows and queries, enhancing integration with existing trading systems. Our contributions include a theoretical analysis of computation-sharing techniques, the development of a general optimization problem to maximize seller revenue adaptable to various techniques, and the proposal of a branch-and-price algorithm to handle the problem's complexity. Our evaluation shows that the proposed framework improves the success rate of data trading by 12.8% and boosts the seller's revenue by an average of 28.7%, compared to the one-time query-based data trading methods used in the current data market.

SESSION: Session 6B: Storage

Shrinking VOD Traffic via Rényi-Entropic Optimal Transport

In response to the exponential surge in Video on Demand (VOD) traffic, numerous research endeavors have concentrated on optimizing and enhancing infrastructure efficiency. In contrast, this paper explores whether users' demand patterns can be shaped to reduce the pressure on infrastructure. Our main idea is to design a mechanism that alters the distribution of user requests to another distribution which is much more cache-efficient, but still remains 'close enough' (in terms of cost) to fulfil individual user's preference.

A Closer Look into IPFS: Accessibility, Content, and Performance

The InterPlanetary File System (IPFS) has recently gained considerable attention. While prior research has focused on understanding its performance characterization and application support, it remains unclear: (1) what kind of files/content are stored in IPFS, (2) who are providing these files, (3) are these files always accessible, and (4) what affects the file access performance.

To answer these questions, in this paper, we perform measurement and analysis on over 4 million files associated with CIDs (content IDs) that appeared in publicly available IPFS datasets. Our results reveal the following key findings: (1) Mixed file accessibility: while IPFS is not designed for a permanent storage, accessing a non-trivial portion of files, such as those of NFTs and video streams, often requires multiple retrieval attempts, potentially blocking NFT transactions and negatively affecting the user experience. (2) Dominance of NFT (non-fungible token) and video files: about 50% of stored files are NFT-related, followed by a large portion of video files, among which about half are pirated movies and adult content. (3) Centralization of content providers: a small number of peers (top-50), mostly cloud nodes hosted by tech companies, serve a large portion (95%) of files, deviating from IPFS's intended design goal. (4) High variation of downloading throughput and lookup time: large file retrievals experience lower average throughput due to more overhead for resolving file chunk CIDs, and looking up files hosted by non-cloud nodes takes longer. We hope that our findings can offer valuable insights for (1) IPFS application developers to take into consideration these characteristics when building applications on top of IPFS, and (2) IPFS system developers to improve IPFS and similar systems to be developed for Web3.

StarShip: Mitigating I/O Bottlenecks in Serverless Computing for Scientific Workflows

This work highlights the significance of I/O bottlenecks that data-intensive HPC workflows face in serverless environments - an issue that has been largely overlooked by prior works. We propose StarShip, a framework that leverages different storage options and multi-tier functions to reduce I/O overhead by co-optimizing for service time and service cost. StarShip leverages the Levenberg-Marquardt optimization to find an effective solution in a large, complex search space. It outperforms existing methods with a 45% improvement in service time and a 37.6% reduction in service cost.

SESSION: Session 7A: Networks

Network Fairness Ambivalence: When Does Social Network Capital Mitigate or Amplify Unfairness?

What are the necessary and sufficient conditions under which multi-hop dissemination strategies decrease rather than increase inequity within social networks? Our analysis of various strategies suggests that this largely depends on a limit related to the degree of homophily in the network.

Change Point Detection with Adaptive Measurement Schedules for Network Performance Verification

When verifying that a communications network fulfills its specified performance, it is critical to note sudden shifts in network behavior as quickly as possible. Change point detection methods can be useful in this endeavor, but classical methods rely on measuring with a fixed measurement period, which can often be suboptimal in terms of measurement costs. In this paper, we extend the existing framework of change point detection with a notion of physical time. Instead of merely deciding when to stop, agents must now also decide at which future time to take the next measurement. Agents must now minimize the necessary number of measurements pre- and post-change, while maintaining a trade-off between post-change delay and false alarm rate. We establish, through this framework, the suboptimality of typical periodic measurements and propose a simple alternative, called crisis mode agents. We show analytically that crisis mode agents significantly outperform periodic measurements schemes. We further verify this in numerical evaluation, both on an array of synthetic change point detection problems as well as on the problem of detecting traffic load changes in a 5G test bed through end-to-end RTT measurements.

NetDiffusion: Network Data Augmentation Through Protocol-Constrained Traffic Generation

Datasets of labeled network traces are essential for a multitude of machine learning (ML) tasks in networking, yet their availability is hindered by privacy and maintenance concerns, such as data staleness. To overcome this limitation, synthetic network traces can often augment existing datasets. Unfortunately, current synthetic trace generation methods, which typically produce only aggregated flow statistics or a few selected packet attributes, do not always suffice, especially when model training relies on having features that are only available from packet traces. This shortfall manifests in both insufficient statistical resemblance to real traces and suboptimal performance on ML tasks when employed for data augmentation. In this paper, we apply diffusion models to generate high-resolution synthetic network traffic traces. We present NetDiffusion, a tool that uses a finely-tuned, controlled variant of a Stable Diffusion model to generate synthetic network traffic that is high fidelity and conforms to protocol specifications. Our evaluation demonstrates that packet captures generated from NetDiffusion can achieve higher statistical similarity to real data and improved ML model performance than current state-of-the-art approaches (e.g., GAN-based approaches). Furthermore, our synthetic traces are compatible with common network analysis tools and support a myriad of network tasks, suggesting that NetDiffusion can serve a broader spectrum of network analysis and testing tasks, extending beyond ML-centric applications.

SESSION: Session 7B: Security of Distributed Systems

Miracle or Mirage? A Measurement Study of NFT Rug Pulls

NFT rug pull is one of the most prominent type of NFT scam, whose definition is that the developers of an NFT project abandon it and run away with investors' funds. Although they have drawn attention from our community, to the best of our knowledge, the NFT rug pulls have not been systematically measured. To fill the void, this paper presents the first in-depth measurement study of NFT rug pulls. Specifically, we first compile a list of 253 known NFT rug pulls as our initial confirmed rug pulls (i.e., ground truth), based on which we perform a pilot study, highlighting the key symptoms of NFT rug pulls. Then, we design an effective rule-based detector to measure the prevalence of NFT rug pulls in the ecosystem. We have labelled 7,487 happened NFT rug pull projects which were not revealed by our community. To eliminate the potential damage brought by rug pull scams, we take a step further towards designing a real-time prediction model to proactively identify the potential rug pull projects in an early stage ahead of the scam happens. We have implemented a prototype system, and deployed it in the real-world setting for over 6 months. Our system has raised alarms for 5,557 new NFT projects, which works as a whistle blower that pinpoints rug pull scams timely, thus mitigating the impacts.

Towards Understanding and Characterizing the Arbitrage Bot Scam In the Wild

This paper presents the first comprehensive analysis of an emerging cryptocurrency scam named "arbitrage bot" disseminated on online social networks. The scam revolves around Decentralized Exchanges (DEX) arbitrage and aims to lure victims into executing a so-called "bot contract" to steal funds from them. To entice victims and convince them of this scheme, we found that scammers have flocked to publish YouTube videos to demonstrate plausible profits and provide detailed instructions and links to the bot contract.

To collect the scam at a large scale, we developed a fully automated scam detection system named CryptoScamHunter, which continuously collects YouTube videos and automatically detects scams. Meanwhile, CryptoScamHunter can download the source code of the bot contract from the provided links and extract the associated scam cryptocurrency address. Through deploying CryptoScamHunter from Jun. 2022 to Jun. 2023, we have detected 10,442 arbitrage bot scam videos published from thousands of YouTube accounts. Our analysis reveals that different strategies have been utilized in spreading the scam, including crafting popular accounts, registering spam accounts, and using obfuscation tricks to hide the real scam address in the bot contracts. Moreover, from the scam videos we have collected over 800 malicious bot contracts with source code and extracted 354 scam addresses. By further expanding the scam addresses with a similar contract matching technique, we have obtained a total of 1,697 scam addresses. Through tracing the transactions of all scam addresses on the Ethereum mainnet and Binance Smart Chain, we reveal that over 25,000 victims have fallen prey to this scam, resulting in a financial loss of up to 15 million USD.

Our work sheds light on the dissemination tactics and censorship evasion strategies adopted in the arbitrage bot scam, as well as on the scale and impact of such a scam on online social networks and blockchain platforms, emphasizing the urgent need for effective detection and prevention mechanisms against such fraudulent activity.

FedQV: Leveraging Quadratic Voting in Federated Learning

Federated Learning (FL) permits different parties to collaboratively train a global model without disclosing their respective local labels. A crucial step of FL, that of aggregating local models to produce the global one, shares many similarities with public decision-making, and elections in particular. In that context, a major weakness of FL, namely its vulnerability to poisoning attacks, can be interpreted as a consequence of the one person one vote (henceforth 1p1v) principle that underpins most contemporary aggregation rules.

In this paper, we introduce FedQV, a novel aggregation algorithm built upon the quadratic voting, recently proposed as a better alternative to 1p1v-based elections. Our theoretical analysis establishes that FedQV is a truthful mechanism in which bidding according to one's true valuation is a dominant strategy that achieves a convergence rate matching that of state-of-the-art methods. Furthermore, our empirical analysis using multiple real-world datasets validates the superior performance of FedQV against poisoning attacks. It also shows that combining FedQV with unequal voting "budgets'' according to a reputation score increases its performance benefits even further. Finally, we show that FedQV can be easily combined with Byzantine-robust privacy-preserving mechanisms to enhance its robustness against both poisoning and privacy attacks. This extended abstract is an abridged version of [3].

SESSION: Session 8A: Load Balancing

Near-Optimal Stochastic Bin-Packing in Large Service Systems with Time-Varying Item Sizes

In modern computing systems, jobs' resource requirements often vary over time. Accounting for this temporal variability during job scheduling is essential for meeting performance goals. However, theoretical understanding on how to schedule jobs with time-varying resource requirements is limited. Motivated by this gap, we propose a new setting of the stochastic bin-packing problem in service systems that allows for time-varying job resource requirements, also referred to as 'item sizes' in traditional bin-packing terms. In this setting, a job or 'item' must be dispatched to a server or 'bin' upon arrival. Its resource requirement varies over time as a Markov chain while in service. Our goal is to minimize the expected number of active servers, or 'non-empty bins', in the steady state.

Under our problem formulation, we develop a job dispatch policy, named Join-Requesting-Server (JRS). We show that in the asymptotic regime where the job arrival rate scales large linearly with respect to a scaling factor r, JRS achieves an additive optimality gap of O(√r) in the objective value, where the optimal objective value is Θ(r). Our approach highlights a novel policy conversion framework that utilizes the solution of a single-server problem.

Distributed Speed Scaling in Large-Scale Service Systems

We consider a large-scale parallel-server loss system with an unknown arrival rate, where each server is able to adjust its processing speed. The objective is to minimize the system cost, which consists of a power cost to maintain the servers' processing speeds and a quality of service cost depending on the tasks' processing times, among others. We draw on ideas from stochastic approximation to design a novel speed scaling algorithm and prove that the servers' processing speeds converge to the globally asymptotically optimum value. Curiously, the algorithm is fully distributed and does not require any communication between servers. Apart from the algorithm design, a key contribution of our approach lies in demonstrating how concepts from the stochastic approximation literature can be leveraged to effectively tackle learning problems in large-scale, distributed systems. En route, we also analyze the performance of a fully heterogeneous parallel-server loss system, where each server has a distinct processing speed, which might be of independent interest.

Server Saturation in Skewed Networks

We use bipartite graphs to model compatibility constraints that arise between tasks and servers in data centers, cloud computing systems and content delivery networks. We prove that servers with skewed graph neighborhoods saturate with tasks in a limiting regime. The neighborhood of a server is skewed if it contains a diverging number of dispatchers with uniformly bounded degrees.

SESSION: Session 8B: Network Infrastructure

Xaminer: An Internet Cross-Layer Resilience Analysis Tool

A resilient Internet infrastructure is critical in our highly interconnected society. However, the Internet faces several vulnerabilities, ranging from natural disasters to human activities, that can impact the physical layer and, in turn, the higher network layers, such as IP links. In this paper, we introduce Xaminer, the first Internet cross-layer resilience analysis tool, to evaluate the interplay between physical-and network-layer failures. Using a cross-layer Internet map and a failure event model, Xaminer generates a risk profile encompassing a cross-layer impact report, critical infrastructure identification at each layer, and the discovery of trends and patterns under different failure event settings. Xaminer's key strengths lie in its adaptability to diverse disaster scenarios, the ability to assess risks at various granularities, and the capability to generate joint risk profiles for multiple events. We demonstrate Xaminer's capabilities in cross-layer analysis across a spectrum of disaster event models and regions, showcasing its potential role in facilitating well-informed decision-making for resilience planning and deployments. The Xaminer tool is available as open-source software.

Nautilus: A Framework for Cross-Layer Cartography of Submarine Cables and IP Links

Submarine cables constitute the backbone of the Internet. However, these critical infrastructure components are vulnerable to several natural and man-made threats, and during failures, are difficult to repair in remote oceans. In spite of their crucial role, we have a limited understanding of the impact of submarine cable failures on global connectivity, particularly on the higher layers of the Internet.

In this paper, we present Nautilus, a framework for cross-layer cartography of submarine cables and IP links. Using a corpus of public datasets and Internet cartographic techniques, Nautilus identifies IP links that are likely traversing submarine cables and maps them to one or more potential cables. Nautilus also gives each IP to cable assignment a prediction score that reflects the confidence in the mapping. Nautilus generates a mapping for 3.05 million and 1.43 million IPv4 and IPv6 links, respectively, spanning 91% of all active cables. In the absence of ground truth data, we validate Nautilus mapping using three techniques: analyzing past cable failures, using targeted traceroute measurements, and comparing with public network maps of two operators.

A Hop Away from Everywhere: A View of the Intercontinental Long-haul Infrastructure

We present a longitudinal study of intercontinental long-haul links (LHL) - links with latencies significantly higher than that of all other links in a traceroute path. Our study is motivated by the recognition of these LHLs as a network-layer manifestation of transoceanic undersea cables. We present a methodology and associated processing system for identifying long-haul links in traceroute measurements, and report on our findings from. We apply this system to a large corpus of traceroute data and report on multiple aspects of long haul connectivity including country-level prevalence, routers as international gateways, preferred long-haul destinations, and the evolution of these characteristics over a 7 year period.