ICPE '25: Proceedings of the 16th ACM/SPEC International Conference on Performance Engineering

Full Citation in the ACM Digital Library

SESSION: Keynote Talk 1

AI for Performance Engineering and Performance Engineering for AI

Lizy K. John

Artificial Intelligence (AI) and Machine Learning (ML) techniques are revolutionizing various domains, including performance engineering. Performance engineering, which involves the evaluation, modeling, and optimization of system performance, has traditionally relied on established methodologies that have proven effective over the years. However, the growing complexity and heterogeneity of modern computing systems, particularly with the emergence of AI accelerators, has resulted in a shift in approach. AI and ML techniques are now being leveraged to achieve unprecedented levels of efficiency and scalability in performance engineering. Similarly, performance engineering is modifying metrics and methodologies for ML benchmarking. This talk will describe some opportunities and challenges when AI meets Performance Engineering.

AI/ML can address challenges in performance engineering by learning complex system behaviors from vast amounts of data, enabling adaptive and predictive performance models. One of the key advantages of using AI/ML in performance engineering is its ability to identify performance bottlenecks and predict system behavior under varying workloads. Machine learning models can analyze performance metrics in real time, allowing for automated tuning and optimization. This capability is particularly useful in cloud computing environments, where dynamic resource allocation is crucial for maintaining efficiency and cost-effectiveness. Moreover, AI-driven approaches can facilitate workload characterization and anomaly detection. By training models on historical data, AI systems can detect deviations from normal performance patterns, identifying potential issues before they impact system stability. This proactive approach to performance engineering reduces downtime and enhances overall system reliability

SESSION: Session 1 - Profiling, Bottleneck Analysis, and Software Development

PARAGRAPH: Phase-Aware Resource Demand Profiling for HPDA/HPC Jobs

Ivo Rohwer
Nikolas Herbst
Maximilian Schwinger
Peter Friedl
Michael Stephan
Samuel Kounev

The processing of large amounts of data in central high performance data analytics (HPDA) systems is playing an increasingly important role in science and business. However, many HPDA systems exhibit a low utilization of their available resources during normal operation. An important reason for this underutilization is that too many resources are reserved for individual jobs. This is often a consequence of the common practice of reserving a uniform amount of resources such as CPU or memory for the entire execution time of a job. Given that many data intensive (DI) jobs consist of different phases with different resource demands, resources are normally reserved according to the demand of the most resource-intensive phase. This results in more resources being reserved over a long period of time than are actually needed.

Flexible resource allocation techniques require detailed information about the resource demands of individual jobs to be applied effectively. In this work, we present PARAGRAPH, an approach to create phase models and resulting phase-aware resource demand profiles for individual job types from training datasets of resource consumption time series. PARAGRAPH considers the individual jobs as black boxes and fully relies on recorded system-level metrics. To do this, we first extract the different phases from the resource consumption time series using a BinSeg-based algorithm. We then apply the C-DBSCAN clustering algorithm to assign labels to the individual segments. Based on this information, a phase model and a resource demand profile can be extracted. These phase-aware resource demand profiles can then be used for scheduling decisions. We evaluate PARAGRAPH in an experimental scenario that allows flexible resource reservation on a HPDA platform. Here, we show that a given set of job instances can be executed up to 29% faster for a given resource limit due to better resource utilization.

BottleMod: Modeling Data Flows and Tasks for Fast Bottleneck Analysis

Ansgar Löβer
Joel Witzke
Florian Schintke
Björn Scheuermann

In recent years, scientific workflows have become increasingly popular. However, their tasks are often seen as black boxes, making it difficult to optimize them or identify bottlenecks due to the complex relationships between tasks. Several factors impact task progress, including input data availability, computing power, data transfer speed and network connectivity. During task execution, resource requirements may change significantly. We propose a new method to model task requirements over their lifetime. Using these models, we predict resource consumption over time and execution duration based on a given allocation strategy with low overhead. This method enables computationally simple and fast performance predictions, including bottleneck analysis during workflow runtime. We derive a piecewise-defined bottleneck function from the discrete intersections of the task models' limiting functions. This allows us to predict potential performance gains when mitigating bottlenecks and aids in better resource allocation and workflow execution.

SESSION: Session 2 - Cloud Architectures

Utilizing Graph Neural Networks for Effective Link Prediction in Microservice Architectures

Ghazal Khodabandeh
Alireza Ezaz
Majid Babaei
Naser Ezzati-Jivan

Managing microservice architectures in distributed systems is complex and resource-intensive due to the high frequency and dynamic nature of inter-service interactions. Accurate prediction of these future interactions can enhance adaptive monitoring, enabling proactive maintenance and resolution of potential performance issues before they escalate. This study introduces a Graph Neural Network (GNN)-based approach, specifically using a Graph Attention Network (GAT), for link prediction in microservice Call Graphs. Unlike social networks, where interactions tend to occur sporadically and are often less frequent, microservice Call Graphs involve highly frequent and time-sensitive interactions that are essential to operational performance.

Our approach leverages temporal segmentation, advanced negative sampling, and GAT's attention mechanisms to model these complex interactions accurately. Using real-world data, we evaluate our model across performance metrics such as AUC, Precision, Recall, and F1 Score, demonstrating its high accuracy and robustness in predicting microservice interactions. Our findings support the potential of GNNs for proactive monitoring in distributed systems, paving the way for applications in adaptive resource management and performance optimization.

Generating Executable Microservice Applications for Performance Benchmarking

Yannik Lubas
Martin Straesser
André Bauer
Samuel Kounev

Microservice applications are the building blocks of modern cloud applications. As such, their performance aspects have been receiving increasing attention in the software engineering community. However, many microservice performance studies use only a small set of popular microservice test applications for experiments, questioning the applicability of their approaches in practice. Researchers currently lack the opportunity to collect large and diverse datasets containing performance metrics of microservices. This is because popular test applications only represent specific technology stacks and often come with custom benchmark tooling (e.g., load generation and monitoring). In this paper, we present Creo, a framework for generating microservice applications that (1) are fully executable, (2) have configurable properties and resource usage profiles, and (3) have built-in support for standardized monitoring, load generation, and deployment. Our approach enables researchers to run experiments with diverse microservice applications with minimal effort. We demonstrate the value of our approach in the context of two use cases. First, we show that using generated applications when training machine learning models for predicting performance degradation can improve the prediction accuracy. Second, we evaluate a recent approach for performance anomaly classification on a set of generated applications highlighting strengths and weaknesses not discussed in the original work.

Columbo: A Reasoning Framework for Kubernetes' Configuration Space

Matthijs Jansen
Sacheendra Talluri
Krijn Doekemeijer
Nick Tehrany
Alexandru Iosup
Animesh Trivedi

Resource managers such as Kubernetes are rapidly evolving to support low-latency and scalable computing paradigms such as serverless and granular computing. As a result, Kubernetes supports dozens of workload deployment models and exposes roughly 1,600 configuration parameters. Previous work has shown that parameter tuning can significantly improve Kubernetes' performance, but identifying which parameters impact performance and should be tuned remains challenging. To help users optimize their Kubernetes deployments, we present Columbo, an offline reasoning framework to detect and resolve performance bottlenecks using configuration parameters. We study Kubernetes and define its workload deployment pipeline of 6 stages and 26 steps. To detect bottlenecks, Columbo uses an analytical model to predict the best-case deployment time of a workload per pipeline stage and compares it to empirical data from a novel benchmark suite. Columbo then uses a rule-based methodology to recommend parameter updates based on the detected bottleneck, deployed workload, and mapping of configurations to pipeline stages. We demonstrate that Columbo reduces workload deployment time across its benchmark suite by 28% on average and 79% at most. We report a total execution time decrease of 17% for data processing with Spark and up to 20% for serverless workflows with OpenWhisk. Columbo is open-source and available at https://github.com/atlarge-research/continuum/tree/columbo.

Proportional Fairness and Isolation for Serverless Applications over FaaS Platforms

George Kelantonakis
Fallia Kourou
Kostas Magoutis

Effectively supporting multi-tenant application deployments in the emerging Function-as-a-Service (FaaS) (or serverless computing) model requires extending it with fairness and isolation mechanisms. Quality-of-service (QoS) concepts developed over time in the networking, storage, and virtualized infrastructure domains are currently being investigated in the space of serverless platforms. In this paper, we propose a two-level serverless QoS architecture that combines state-of-the-art scheduling algorithms and mechanisms with the unique characteristics of distributed serverless platforms, resulting into a system that provides proportional fairness for serverless applications with shared access to distributed and load-balanced FaaS platforms. The primary advantage of our approach is the use of higher-level scheduling mechanisms only, avoiding the need to manage low-level resources within underlying FaaS platforms (thus not requiring changes to them) for achieving fairness. We demonstrate the concrete benefits of our architecture using state-of-the-art benchmarks in experiments over AWS EC2.

SESSION: Session 3 - LLM Performance

An Empirical Characterization of Outages and Incidents in Public Services for Large Language Models

Xiaoyu Chu
Sacheendra Talluri
Qingxian Lu
Alexandru Iosup

People and businesses increasingly rely on public LLM services, such as ChatGPT, DALL·E, and Claude. Understanding their outages, and particularly measuring their failure-recovery processes, is becoming a stringent problem. However, only limited studies exist in this emerging area. Addressing this problem, in this work we conduct an empirical characterization of outages and failure-recovery in public LLM services. We collect and prepare datasets for 8 commonly used LLM services across 3 major LLM providers, including market-leads OpenAI and Anthropic. We conduct a detailed analysis of failure recovery statistical properties, temporal patterns, co-occurrence, and the impact range of outage-causing incidents. We make over 10 observations, among which: (1) Failures in OpenAI's ChatGPT take longer to resolve but occur less frequently than those in Anthropic's Claude;(2) OpenAI and Anthropic service failures exhibit strong weekly and monthly periodicity; and (3) OpenAI services offer better failure-isolation than Anthropic services. Our research explains LLM failure characteristics and thus enables optimization in building and using LLM systems. FAIR data and code are publicly available on https://zenodo.org/records/14018219 and https://github.com/atlarge-research/llm-service-analysis.

PreNeT: Leveraging Computational Features to Predict Deep Neural Network Training Time

Alireza Pourali
Arian Boukani
Hamzeh Khazaei

Training deep learning models, particularly Transformer-based architectures such as Large Language Models (LLMs), demands substantial computational resources and extended training periods. While optimal configuration and infrastructure selection can significantly reduce associated costs, this optimization requires preliminary analysis tools. This paper introduces PreNeT, a novel predictive framework designed to address this optimization challenge. PreNeT facilitates training optimization by integrating comprehensive computational metrics, including layer-specific parameters, arithmetic operations and memory utilization. A key feature of PreNeT is its capacity to accurately predict training duration on previously unexamined hardware infrastructures, including novel accelerator architectures. This framework employs a sophisticated approach to capture and analyze the distinct characteristics of various neural network layers, thereby enhancing existing prediction methodologies. Through proactive implementation of PreNeT, researchers and practitioners can determine optimal configurations, parameter settings, and hardware specifications to maximize cost-efficiency and minimize training duration. Experimental results demonstrate that PreNeT achieves up to 72% improvement in prediction accuracy compared to contemporary state-of-the-art frameworks.

Large Language Model Fine-tuning with Low-Rank Adaptation: A Performance Exploration

Bagus Hanindhito
Bhavesh Patel
Lizy K. John

Fine-tuning pre-trained models is the preferred method for adapting large language models (LLMs) for specific downstream tasks since it is significantly more efficient in terms of computational costs and energy than training the models from scratch. However, with LLMs experiencing exponential growth, fine-tuning the models becomes more challenging and expensive as they demand more computational resources. Many approaches are proposed to fine-tune state-of-the-art models efficiently, reducing the infrastructure needed, and thus, making them accessible to the public.

In this paper, we investigate a technique called Low-Rank Adaptation (LoRA), one approach to efficiently fine-tuning LLMs by leveraging low intrinsic dimensions possessed by the models during fine-tuning. Specifically, we explore different data formats that can be used during LoRA fine-tuning and compare them regarding workload performance and model accuracy. The experiment compared LoRA and its quantized counterpart (QLoRA) with regular methods to fine-tune state-of-the-art LLMs. The analysis includes estimating memory usage, measuring resource utilization, and evaluating the model quality after fine-tuning. Three state-of-the-art Graphics Processing Units (GPUs) are used for experiments, including NVIDIA H100, NVIDIA A100, and NVIDIA L40. We also use the newest AMD MI300X GPU as a preliminary exploration.

The experiment shows that although LoRA with a 16-bit floating-point format can significantly reduce the computational resource demand, it still requires data-center-class GPUs with ample memory to fine-tune LLMs with 70 billion parameters. Using QLoRA with 4-bit floating-point format significantly lowers the memory requirements by as much as 75% compared to LoRA, allowing a single GPU with 48 GB and 80 GB of memory to fine-tune 70 billion parameter models. In addition, QLoRA delivers model quality that is on par with or exceeds the quality of the model obtained from conventional fine-tuning.

Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models

Tom Wallace
Beatrice Ombuki-Berman
Naser Ezzati-Jivan

Advancements in Natural Language Processing are heavily reliant on Transformer architectures, whose improvements come at substantial resource costs due to ever-growing model sizes. This study explores optimization techniques, including quantization, knowledge distillation, and pruning, focusing on energy and computational efficiency while retaining performance. Among standalone methods, 4-Bit quantization significantly reduces energy use with minimal accuracy loss. Hybrid approaches, like NVIDIA's Minitron approach combining KD and structured pruning, further demonstrate promising trade-offs between size reduction and accuracy retention. A novel optimization framework is introduced, offering a flexible framework for comparing various methods. Through the investigation of these compression methods, we provide valuable insights for developing more sustainable and efficient LLMs, shining a light on the often-ignored concern of energy efficiency.

SESSION: Session 4 - Cloud in Industry

Bridging Clusters: A Comparative Look at Multi-Cluster Networking Performance in Kubernetes

Sai Sindhur Malleni
Raúl Sevilla
José Castillo Lema
André Bauer

Microservices and containers have transformed the way applications are developed, tested, deployed, scaled, and managed. Several container orchestration platforms, like Kubernetes, have emerged, streamlining container management at scale and providing enterprise-grade support for application modernization. Driven by application, compliance, and end-user requirements, companies opt to deploy multiple Kubernetes clusters across public and private clouds. However, deploying applications in multi-cluster environments presents distinct challenges, especially managing the communication between the microservices spread across clusters. Traditionally, custom configurations, like VPNs or firewall rules, were required to connect such complex setups of clusters spanning the public cloud and on-premise infrastructure. This industry paper presents a comprehensive analysis of network performance characteristics for three popular open-source multi-cluster networking solutions (namely, Skupper, Submariner, and Istio), addressing the challenges of microservices connectivity across clusters. We evaluate key factors such as latency, throughput, and resource utilization using established tools and benchmarks, offering valuable insights for organizations aiming to optimize the network performance of their multi-cluster deployments. Our experiments revealed that each solution involves unique trade-offs in performance and resource efficiency: Submariner offers low latency and consistency, Istio excels in throughput with moderate resource consumption, and Skupper stands out for its ease of configuration while maintaining balanced performance.

Shaved Ice: Optimal Compute Resource Commitments for Dynamic Multi-Cloud Workloads

Murray Stokely
Neel Nadgir
Jack Peele
Orestis Kostakis

Cloud providers have introduced pricing models to incentivize long-term commitments of compute capacity. These long-term commitments allow the cloud providers to get guaranteed revenue for their investments in data centers and computing infrastructure. However, these commitments expose cloud customers to demand risk if expected future demand does not materialize. While there are existing studies of theoretical techniques for optimizing performance, latency, and cost, relatively little has been reported so far on the trade-offs between cost savings and demand risk for compute commitments for large-scale cloud services.

We characterize cloud compute demand based on an extensive three year study of the Snowflake Data Cloud, which includes data warehousing, data lakes, data science, data engineering, and other workloads across multiple clouds. We quantify capacity demand drivers from user workloads, hardware generational improvements, and software performance improvements. Using this data, we formulate a series of practical optimizations that maximize capacity availability and minimize costs for the cloud customer.

Towards Workload-aware Cloud Efficiency: A Large-scale Empirical Study of Cloud Workload Characteristics

Anjaly Parayil
Jue Zhang
Xiaoting Qin
Íñigo Goiri
Lexiang Huang
Timothy Zhu
Chetan Bansal

Cloud providers introduce features and optimizations to improve efficiency and reliability, such as Spot VMs, Harvest VMs, oversubscription, and auto-scaling. To use these effectively, it's important to understand workload characteristics. However, workload characterization can be complex and difficult to scale manually due to multiple signals involved. In this study, we conduct the first large-scale empirical study of first-party workloads at Microsoft to understand their characteristics. Through this empirical study, we aim to answer the following questions: (1) What are the critical workload characteristics that impact efficiency and reliability on cloud platforms? (2) How do these characteristics vary across different workloads? (3) How can cloud platforms leverage these insights to efficiently characterize all workloads at scale? This study provides a deeper understanding of workload characteristics and their impact on cloud performance, which can aid in optimizing cloud services and identifies potential areas for future research.

Cost Optimization and Performance Control in the Hybrid Multi-cloud Environment

Boris Zibitsker
Alexander Lupersolsky

Escalating cloud costs and unpredictable performance are major challenges for organizations, particularly when deploying Generative AI (GenAI) applications in hybrid multi-cloud environments-a market projected to reach nearly 20 trillion by 2030.

This paper introduces a novel systems approach to cost optimization, performance control, and FinOps decision-making through automated observability, advanced queueing network modeling, and gradient optimization.

Observability automation narrows the scope of the tuning efforts by focusing on the most resource-consuming and credit use applications, applications with the highest rate of performance and cost anomalies, applications with the highest frequency of failed queries, and applications with the highest volume of data spilled to local and remote storage.

Modeling and optimization determine the minimal configurations, resource allocation, workload management, and budgets needed to meet Service Level Goals (SLGs) for all business applications running on different cloud data platforms in the Hybrid Multi-Cloud environment. Modeling and optimization evaluate options and set cost and performance expectations for proposed changes.

By comparing actual performance and costs with expectations, our approach enables closed-loop performance and cost control, mitigating the risks of unexpected financial and operational outcomes. The presented case studies highlight the value of our technology in optimizing application costs and controlling performance across diverse projects, including sizing new applications before cloud deployment, selecting appropriate cloud platforms, optimizing cloud migration strategies, and managing dynamic capacity in hybrid multi-cloud environments.

Our predictions demonstrated high accuracy, with the difference between measured and predicted costs within 10%.

SESSION: Keynote Talk 2

Software Performance Engineering for Foundation Model-Powered Software (FMware)

Ahmed E. Hassan

This keynote examines the transformative impact of Foundation Models (FMs), particularly Large Language Models (LLMs), on software development, emphasizing the critical role of Software Performance Engineering (SPE) in ensuring FM-powered software (FMware) achieves essential performance goals such as throughput and latency. With the LLM market projected to reach 36.1 billion by 2030 [3], addressing SPE challenges has become increasingly urgent. Drawing extensively from comprehensive literature surveys, industry-academia interactions, customer feedback, and practical experience detailed in [6]. This keynote identifies four critical SPE challenges throughout the FMware lifecycle, discusses current state-of-practice solutions, proposes future research directions, and introduces a vision for an innovative SLA-aware runtime system designed to enhance the performance and efficiency of FMware.

SESSION: Session 5 - Benchmarking in Industry

Accelerating Model Optimization on the Edge Through Automated Performance Benchmarking and End-to-End Profiling

Nayara Aguiar
Helen Chigirinskaya
Jie Chen
Anoush Najarian

The resource-constrained nature of edge devices poses unique challenges in meeting strict performance requirements. However, performance benchmarks for deployed models are often run manually and infrequently, and other phases of the development workflow, such as the conversion from high-level languages to C/C++ code, might not be evaluated for performance. While these traditional approaches to performance evaluation give important insights for improvements in the final product, the integration of performance testing throughout the product development process enables early detection and mitigation of performance issues. In this work, we propose an automated workflow that streamlines the performance evaluation and optimization of deployed deep learning models on edge devices.

Platform Performance Suite (PPS): A Framework for Performance Analysis & Diagnosis of Complex Cyber-physical Systems

Konstantinos Triantafyllidis
Yuri Blankenstein
Sobhan Niknam
Jos Hegge

The performance of cyber-physical systems (CPS) is a determining factor for their success and often needs to be guaranteed. When performance issues occur, their analysis and the identification of the root-cause should be fast. Typically, the analysis requires the association of system observations (i.e., tracing) to design and implementation artifacts. For this association, multidisciplinary domain knowledge is required, which typically resigns in the workforce but is unmanageable due to the immense and continuously increasing system complexity.This paper proposes a model-based method, the Platform Performance Suite (PPS), supported by corresponding tooling. PPS enables automated analysis for the identification of the root-cause when performance issues, like system throughput loss, occur. The cornerstone of the approach is the Lifecycle Software Architecture model which captures the domain knowledge in models by modeling all the relationships among the system artifacts from specification up to the runtime phase. The performance analysis at runtime phase is carried out on TMSC models guaranteeing the generic applicability of the method, while the connection to the other lifecycle phases enables the automated identification of the root-cause by pinpointing specific artifacts. The evaluation of the method has been carried out on a world-leading complex and high-performing CPS, the ASML TWINSCAN system, where a throughput loss needs to be addressed promptly. PPS has shown remarkable speed-up in the identification of the root-cause compared to the state-of-practice method.

SESSION: Session 6 - Architecture

Introducing GPU Persistent Graphs for Time-sensitive Workflows

Cyril Cetre
Florian Ferreira
Rémi Barrere
Damien Gratadour

The emphasis on throughput in GPU design poses challenges when integrating them into time-sensitive applications. Recent advancements in GPU architectures and software have enabled the reduction of overhead and interference along the critical path through the use of advanced GPU mechanisms, such as persistent kernels. Despite these advancements, these methods often involve trade-offs in performance, flexibility, or portability. In response, we introduce our proposal of persistent graphs to enable a fully host-independent GPU pipeline scheduler. This concept is entirely modular, simple to implement and allows for online changes of workloads without impacting performance. We demonstrate the effectiveness of our approach through extensive benchmarking including an example of a highly constrained cyber-physical system: adaptive optics for astronomy. Our evaluation spans various state-of-the-art platforms, including A100 GPUs and embedded Jetson Orin SoCs. Notably, with our proposed implementation, response time variability never exceeds 10%, even on platform with significant CPU interference.

Component-Based Analytical Modeling of GPU Runtime Performance: a Case-Study in Scientific Computing

Jolly Chen
Ana Lucia Varbanescu
Monica Dessole

Analytical performance models are excellent tools for fast performance prediction and can be used effectively for designing and tuning parallel algorithms. However, such models are non-trivial to build, especially when both the application and the system are very complex.

In this context, we study the applicability and limitations of a component-based analytical approach to model (and predict) the performance of GPU operations. Using microbenchmarks, we incorporate dynamic runtime behavior and architecture-dependent factors in the predictions.

Our model validation and evaluation focus on a specific case-study: ROOT histogramming -- a high-energy physics (HEP) application whose performance is critical in most experiments' data analysis pipeline (i.e., histogramming is run millions of times per analysis). We show our approach in action by constructing the model and showing how it can be useful for scenario analysis, where it can accurately predict trends and performance rankings. In addition, the design process of the model itself can lead to insights into the source of performance bottlenecks. We conclude that component-based modeling is feasible and practical for GPU applications. It is a modeling approach with a reasonable trade-off between accuracy, prediction speed, and interoperability.

Multi-Strided Access Patterns to Boost Hardware Prefetching

Miguel O. Blom
Kristian F. D. Rietveld
Rob V. van Nieuwpoort

Important memory-bound kernels, such as linear algebra, convolutions, and stencils, rely on SIMD instructions as well as optimizations targeting improved vectorized data traversal and data re-use to attain satisfactory performance. On contemporary CPU architectures, the hardware prefetcher is of key importance for efficient utilization of the memory hierarchy. In this paper, we demonstrate that transforming a memory access pattern consisting of a single stride to one that concurrently accesses multiple strides, can boost the utilization of the hardware prefetcher, and in turn improves the performance of memory-bound kernels significantly. Using a set of micro-benchmarks, we establish that accessing memory in a multi-strided manner enables more cache lines to be concurrently brought into the cache, resulting in improved cache hit ratios and higher effective memory bandwidth without the introduction of costly software prefetch instructions. Subsequently, we show that multi-strided variants of a collection of six memory-bound dense compute kernels outperform state-of-the-art counterparts on three different micro-architectures. More specifically, for kernels among which Matrix Vector Multiplication, Convolution Stencil and kernels from PolyBench, we achieve significant speedups of up to 12.55x over Polly, 2.99x over MKL, 1.98x over OpenBLAS, 1.08x over Halide and 1.87x over OpenCV. The code transformation to take advantage of multi-strided memory access is a natural extension of the loop unroll and loop interchange techniques, allowing this method to be incorporated into compiler pipelines in the future.

Parallel GPU-Enabled Algorithms for SpGEMM on Arbitrary Semirings with Hybrid Communication

Thomas McFarland
Julian Bellavita
Giulia Guidi

Sparse General Matrix Multiply (SpGEMM) is key for various High-Performance Computing (HPC) applications such as genomics and graph analytics. Using the semiring abstraction, many algorithms can be formulated as SpGEMM, allowing redefinition of addition, multiplication, and numeric types. Today large input matrices require distributed memory parallelism to avoid disk I/O, and modern HPC machines with GPUs can greatly accelerate linear algebra computation.

In this paper, we implement a GPU-based distributed-memory SpGEMM routine on top of the CombBLAS library. Our implementation achieves a speedup of over 2× compared to the CPU-only CombBLAS implementation and up to 3× compared to PETSc for large input matrices.

Furthermore, we note that communication between processes can be optimized by either direct host-to-host or device-to-device communication, depending on the message size. To exploit this, we introduce a hybrid communication scheme that dynamically switches data paths depending on the message size, thus improving runtimes in communication-bound scenarios.

Detecting Noisy Neighbors in CPU-Isolated Cgroups Environments

Simon Volpert
Sascha Winkelhofer
Jörg Domaschka
Stefan Wesner

Control groups (cgroups) are a crucial isolation mechanism in containerized environments, but they don't fully prevent performance interference (noisy neighbors). This paper presents a novel, workload-agnostic approach for detecting noisy neighbors within CPU-isolated cgroups. Using in-kernel profiling with Extended Berkeley Packet Filter (eBPF), we instrument the Linux process scheduler to capture scheduling latencies and preemption frequencies. We introduce a detection method based on these metrics to identify noisy neighbors online without requiring workload profiles or offline analysis. Evaluations across various workload scenarios demonstrate the effectiveness of our approach in accurately identifying performance degradation caused by noisy neighbors.

An Analysis of User-space Idle State Instructions on x86 Processors

Malte-Christian Kuns
Hannes Tröpgen
Robert Schöne

Power consumption has become a limiting factor in all areas of computing. Hence, making the most of the available power budget is paramount. To use the available budget most efficiently, techniques like dynamic voltage and frequency scaling and idle states can be used. This work analyzes the instructions UMWAIT, TPAUSE, and MWAITX on three different systems. We analyze their instruction latencies, power consumptions, and dependencies on core frequencies. To do so, we introduce benchmarks to gather performance and power parameters, which can be used for future software optimizations. Key findings include: The expected sleep duration passed to UMWAIT and TPAUSE can influence the depth of the user idle state. The actual sleep duration of TPAUSE increases stepwise with an increasing expected sleep duration. Requesting a deeper idle state leads to an additional sleep duration, which increases with a lower core frequency. The core frequency influences the instruction latency of TPAUSE, where a low frequency can lead to an irregular performance pattern. The latency of TPAUSE, UMWAIT, and MWAITX is most often higher than requested on the evaluated systems. Core power consumption can be reduced by ~20% ~70% compared to the usage of PAUSE. The latency for waking a core in user idle reflects the underlying hardware architecture with tens (desktop architecture with shallow idle states) to hundreds (server architecture with deep idle states) of nanoseconds at nominal frequencies.

SESSION: Session 7 - Memory and Network

On-demand Memory Compression of Stream Aggregates through Reinforcement Learning

Jingyu Liu
Vincenzo Gulisano

Stream Aggregates are crucial in digital infrastructures for transforming continuous data streams into actionable insights. However, state-of-the-art Stream Processing Engines lack mechanisms to effectively balance performance with memory consumption - a capability that is especially crucial in environments with fluctuating computational resources and data-intensive workloads.

This paper tackles this gap by introducing a novel on-demand adaptive memory compression scheme for stream Aggregates. Our approach uses Reinforcement Learning (RL) to dynamically adapt how a stream Aggregate compresses its state, balancing performance and memory utilization under a given processing latency threshold. We develop a model that incorporates the application- and data-specific nuances of stream Aggregates and create a framework to train RL Agents to adjust memory compression levels in real-time. Additionally, we shed light on a trade-off between the timeliness of an RL Agent training and its resulting behavior, defining several policies to account for this trade-off.

Through extensive evaluation, we show that the proposed RL Agent supports well on-demand memory compression. We also study the effects of our policies - providing guidance on their role in RL applied to stream Aggregates - and show our framework supports lean execution of such RL jobs.

Better Memory Tiering, Right from the First Placement

João Póvoas
João Barreto
Bartosz Chomi?ski
André Gonçalves
Fedar Karabeinikau
Maciej Maciejewski
Jakub Schmiegel
Kostiantyn Storozhuk

Heterogeneous memory (HMem) architectures have recently emerged and revolutionized the traditional memory hierarchy. This trend is likely to increase with the rise of the Compute Express Link (CXL) standard. A fundamental problem that arises when working with HMem architectures is data placement: at which memory tier should the data objects of an application be placed to optimize its performance? Existing proposals that tackle this problem at the system-level operate transparently to the application, hence without explicit placement hints from it.

Such lack of knowledge is a key challenge to the first placement of new objects. Today's state-of-the-art systems for memory tiering solve the first-placement problem by blindly guessing that new pages (holding new objects) should be better placed in the fast tier. However, for large working sets, this blind guess fails for a large fraction of pages, which results in important performance shortcomings.

This paper aims to mitigate such shortcomings by replacing blind guess in the first placement with an educated guess, which takes advantage of past object-level access patterns. We propose a novel memory tiering system that addresses the first-placement problem by combining hmalloc, an HMem-aware memory allocation library, with Ambix, a page-based memory tiering system, and exploiting their object and page-level synergies. Our experimental evaluation when running realistic HPC benchmarks on a real HMem system demonstrates that our synergistic approach is effectively able to address the first-placement limitation of previous systems. Our approach achieves up to 2.03x speedup over traditional memory management through intelligent first placement alone. When combined with the state-of-the-art support for tiered page placement provided in the latest versions of Linux, performance further improves, reaching up to 2.28x speedup over modern memory tiering systems in certain HPC workloads.

Non-linear Programming for the Network Calculus Analysis of FIFO Feedforward Networks

Lukas Herll
Steffen Bondorf

System designs for bounded communication latencies often employ a rather basic concept at their core: First-In First-Out (FIFO) queueing. Network Calculus (NC) can compute delay bounds for the end-to-end communication of data flows crossing potentially large feedforward networks of such First-In First-Out (FIFO) systems. Analysis complexity stems from the need to keep track of the interactions between flows when they compete for resources, i.e., multiplex in shared queues.

Network Calculus (NC) has an elegant solution to this: an open, so-called First-In First-Out (FIFO) parameter is introduced every time a (worst-case) First-In First-Out (FIFO) interaction occurs in the analysis. At the end of the analysis stands a (min,plus)-algebraic term with interdependent First-In First-Out (FIFO) parameters. We aim at finding a near-optimal setting for all open parameters. When employing standard optimization techniques, we cannot work with a parameterized (min,plus)-algebraic term directly. Thus, we show how to derive a minimum size (plus,times)-algebraic term that we can use with Non-Linear Program (NLP) efficiently. Additionally, we show how to differentiate this term to open our approach to gradient-based Non-Linear Program (NLP) algorithms.

In numerical evaluations, we show that our approach outperforms the complexity/accuracy tradeoff of existing heuristics to set the First-In First-Out (FIFO) parameters. With a slight increase of analysis runtime, we reduce the gap to the optimal setting by a factor of 4.4, to 0.15% on average.

Uplink End-to-End Latency Characterization of a 5G NSA Access Network

Orangel Azuaje
Ana Aguiar
Peter Steenkiste

5G networks offer significant advancements over its predecessor, 4G Long-Term Evolution (LTE). Low latency network access, a key requirement enabling near real-time responsiveness as required by applications such as autonomous driving, factory automation and virtual reality, is one of 5G's key features. In this paper, we present the results of a long-term measurement campaign of the uplink end-to-end (e2e) latency experienced by a 5G-capable device using a commercial sub-6Ghz 5G non-standalone (NSA) network. Our results show an average uplink e2e latency of 12ms, with a 95th percentile of 21ms. This compares favorably with an average uplink e2e latency of 35ms and a 95th percentile of 53ms using 4G LTE to reach the same destination. We also characterize and define, through real-world network parameters in the uplink data transmission process, an unexpected latency pattern that impacts the performance of latency-sensitive applications, even in 5G standalone (SA) networks, such as edge computing or ultra-reliable low-latency communication (URLLC)-a new class of applications targeted in 5G networks.

SESSION: Keynote Talk 3

Great Performance for Bad Days

Marc Brooker

Most traditional approaches to performance measurement and optimization focus on performance under good conditions.

Performance during bad times (during and after overload, during and after failures, sudden workload changes, etc) is equally important to customers and operators of systems at all sizes. In this talk, I'll look at what it takes to keep performance high during adverse conditions, including avoiding and reacting to metastable failure modes, and explore some of the gaps in current benchmarking techniques which tend to hide metastable behavior.

SESSION: Session 8 - Deployments and Capacity Planning

Beyond Maximum Throughput: Explore Full Operational Envelope for Capacity Planning

Xiaosong Lou

Throughput is a common metric in performance tests, particularly for capacity planning. Traditionally, systems are tested until failure, using the maximum throughput achieved as the primary performance measurement. However, this method does not accurately reflect system capacity, especially when response time is a major factor of user experience. We propose a comprehensive performance testing strategy that explores the full operational envelope of the system. By introducing the concept of maximum recommended throughput, we demonstrate that this new benchmark provides more reliable information for performance testing, benchmarking and capacity planning. In some cases, the concept can be extended to resource utilization. Finally we show that exploring the full operational envelope is also powerful in identifying connection throttles.

CADAEC: Content-Aware Deployment of AI Workloads in Edge-Cloud Ecosystem

Ratul Kishore Saha
Sparsh Mittal
Rekha Singhal
Manoj Nambiar

The rapid growth of edge devices has revolutionized industrial AI applications, including robotics, autonomous systems, and IoT, where real-time processing is essential. These systems face the challenge of managing concurrent, high-volume workloads across resource-constrained edge devices and cloud infrastructure. A major hurdle is optimizing deep learning model deployment across edge-cloud environments in dynamic conditions, particularly where input quality and noise fluctuate under concurrent demands. This paper introduces a novel optimization framework, that addressed these challenges and dynamically selects the most suitable models from a diverse model zoo and determines optimal deployment locations (edge or cloud). The proposed framework leverages content-aware approach to minimize both communication and computation latency while considering hardware limitations and environmental factors. Using a binary linear programming (BILP) approach, our method efficiently balances model distribution of an AI pipeline, maximizing end-to-end performance. We validate this framework on a robotic AI pipeline in real-world, noise-variant environments, comparing content-aware and content-agnostic deployment strategies. Our results demonstrate significant optimization in deployment latency and system performance under high-concurrency conditions, using both content-agnostic and content-aware approaches, highlighting the framework's robustness and scalability. Additionally, we showed the effectiveness of the content-aware approach over the content-agnostic method in optimizing deployment choices and reducing latency, while maintaining the desired qualitative outcomes of the AI pipeline with different communication set up. This makes the content-aware strategy more suitable for complex, real-world environments where input quality and noise vary significantly. Overall, The proposed method presents a compelling solution for optimizing AI pipelines in edge-cloud ecosystems, offering potential for broader applications domains.

SESSION: Session 9 - Energy and Modeling

Understanding the Energy Consumption of Cloud-native Software Systems

Lars Andringa
Brian Setz
Vasilios Andrikopoulos

As the dependence on software systems running on cloud data centers grows on a daily basis, there is an increasingly stronger motivation to reduce their energy consumption. A necessary but not trivial step in this direction is understanding how energy is consumed in virtualized, multi-tenant environments such as the one provisioned in the cloud. Prior work focuses on isolated, non-virtualized systems and is difficult to transfer to this context. A number of industry-led approaches have appeared in the meantime in terms of tools and technological stacks building on the concept of observability as the means to achieve this goal. This paper discusses our approach in adopting one such stack and consequently assessing it for fitness to purpose through an experimental procedure. To this effect, we deploy a cloud-native application on a private cloud infrastructure instrumented for measuring energy consumption through a combination of hardware and software means. We combine the information from these instrumentation points into a mapping model to deal with the different virtualization layers and compare the model against the values reported by the observability stack. Furthermore, we use our model to attribute energy consumption across the virtualization layers and understand how energy is consumed at each one.

HeteroBench: Multi-kernel Benchmarks for Heterogeneous Systems

Hongzheng Tian
Alok Mishra
Zhiheng Chen
Rolando P. Hong Enriquez
Dejan Milojicic
Eitan Frachtenberg
Sitao Huang

The end of Moore's Law and Dennard scaling has driven the proliferation of heterogeneous systems with accelerators, including CPUs, GPUs, and FPGAs, each with distinct architectures, compilers, and programming environments. GPUs excel at massively parallel processing for tasks like deep learning training and graphics rendering, while FPGAs offer hardware-level flexibility and energy efficiency for low-latency, high-throughput applications. In contrast, CPUs, while general-purpose, often fall short in high-parallelism or power-constrained applications. This architectural diversity makes it challenging to compare these accelerators effectively, leading to uncertainty in selecting optimal hardware and software tools for specific applications.

To address this challenge, we introduce HeteroBench, a versatile benchmark suite for heterogeneous systems. HeteroBench allows users to evaluate multi-compute kernel applications across various accelerators, including CPUs, GPUs (from NVIDIA, AMD, Intel), and FPGAs (AMD), supporting programming environments of Python, Numba-accelerated Python, serial C++, OpenMP (both CPUs and GPUs), OpenACC and CUDA for GPUs, and Vitis HLS for FPGAs. This setup enables users to assign kernels to suitable hardware platforms, ensuring comprehensive device comparisons.

What makes HeteroBench unique is its vendor-agnostic, cross-platform approach, spanning diverse domains such as image processing, machine learning, numerical computation, and physical simulation, ensuring deeper insights for HPC optimization. Extensive testing across multiple systems provides practical reference points for HPC practitioners, simplifying hardware selection and performance tuning for both developers and end-users alike. This suite may assist to make more informed decision on AI/ML deployment and HPC development, making it an invaluable resource for advancing academic research and industrial applications.

Quantifying Data Leakage in Failure Prediction Tasks

Daniel Grillmeyer
Marius Hadry
Veronika Lesch
Vanessa Borst
Robert Leppich
André Bauer
Samuel Kounev

With the ever increasing importance of cloud computing and a strong focus on reliable data centers, a high amount of research has been done on failure prediction for hard disk drives. The collection of monitoring data, such as SMART statistics (Self-Monitoring, Analysis, and Reporting Technology) from operational HDDs, enables operators to obtain predictions about the expected remaining useful life. Numerous methods for HDD failure prediction have been published in recent years, and their evaluation has shown decent results. However, a naive splitting into training and test sets can lead to data leakage and, thus, over-optimistic results that cannot be achieved on the data of scientific interest. In this paper, we propose a novel data leakage measure for quantifying the amount of data leakage in training and test datasets. Further, we define four splitting techniques and evaluate our measure in terms of the performance optimism of classification models with respect to these different splitting strategies. Our results consistently show that splitting techniques prone to data leakage induce an overestimation of predictive performance. Overall, we were able to show the usefulness of the defined data leakage measure, as well as its connection with different splitting techniques and the performance optimism of prediction models.

Energy Metrics for Edge Microservice Request Placement Strategies

Klervie Toczé
Simin Nadjm-Tehrani

Microservices are a way of splitting the logic of an application into small blocks that can be run on different computing units and used by other applications. It has been successful for cloud applications and is now increasingly used for edge applications. This new architecture brings many benefits but it makes deciding where a given service request should be executed (i.e. its placement) more complex as every small block needed for the request has to be placed.

In this paper, we investigate energy-centric request placement for services that use the microservice architecture, and specifically whether using different energy metrics for optimization leads to different placement strategies. We consider the problem as an instance of a traveling purchaser problem and propose an integer linear programming formulation. This formulation aims at minimizing energy consumption while respecting latency requirements. We consider two different energy consumption metrics, namely overall or marginal energy, when applied as a measure to determine a placement. Our simulations show that using different energy metrics indeed results in different request placements