ICPE '25: Companion of the 16th ACM/SPEC International Conference on Performance Engineering

Full Citation in the ACM Digital Library

SESSION: Journal First Track

RPerf: Mining User Reviews Using Topic Modeling to Assist Performance Testing: An Industrial Experience Report

Zehao Wang
Wei Liu
Jinfu Chen
Tse-Hsun (Peter) Chen

Software performance affects the user-perceived quality of software. Therefore, it is important to analyze the performance issues that users are concerned with. In this paper, we document our experience working with our industry partner on analyzing user reviews to identify and analyze performance issues users are concerned with. In particular, we designed an approach, RPerf, which automatically analyzes unstructured user reviews and generates a performance analysis report that can assist performance engineers with performance testing. In particular, RPerf uses BERTopic to uncover performance-related topics in user reviews. RPerf then maps the derived topics to performance KPIs (key performance indicators) such as response time. Such performance KPIs better help performance test design and allocate performance testing resources. Finally, RPerf extracts user usage scenarios from user reviews to help identify the causes. Through a manual evaluation, we find that RPerf achieves a high accuracy (over 93%) in identifying the performance-related topics and performance KPIs from user reviews. RPerf can also accurately extract usage scenarios in over 80% of user reviews. We discuss the performance analysis report that is generated based on RPerf. We believe that our findings can assist practitioners with analyzing performance-related user reviews and inspire future research on user review analysis.

IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency

Saeid Ghafouri
Kamran Razavi
Mehran Salmani
Alireza Sanaee
Tania Lorido Botran
Lin Wang
Joseph Doyle
Pooyan Jamshidi

Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a critical challenge in machine learning production systems. Our work introduces IPA, an innovative online deep learning inference pipeline adaptation system. IPA dynamically configures batch sizes, replication, and model variants to simultaneously optimize accuracy, minimize costs, and meet user-defined latency Service Level Agreements (SLAs). Unlike existing approaches, IPA captures the correlation between configuration changes across multiple pipeline stages, enabling holistic end-to-end optimization of inference pipelines rather than optimizing each stage individually.

Key to IPA's innovation is its joint optimization of accuracy and cost in a multi-stage pipeline setting. Although previous work has focused predominantly on cost-aware optimizations, IPA leverages accuracy-aware adaptations, combining autoscaling and model-switching techniques to achieve precise trade-offs between accuracy, cost, and latency. This approach is particularly advantageous in scenarios where heterogeneity in model variants can be exploited to enhance overall pipeline performance dynamically.

Through extensive evaluations in a Kubernetes implementation with five real-world inference pipelines (Video Monitoring, Audio Question Answering, Audio Sentiment Analysis, Summarisation Question Answering, and Natural Language Processing), we demonstrate that IPA improves normalized accuracy by up to 35% with a negligible cost increase of less than 5%. The system's effectiveness in achieving granular trade-offs makes it a valuable contribution to the optimization of complex, multi-stage ML inference pipelines. The replication package is available at https://github.com/reconfigurable-ml-pipeline/ipa.

Multi-Criteria Optimization of Real-Time DAGs on Heterogeneous Platforms under P-EDF

Tommaso Cucinotta
Alexandre Amory
Gabriele Ara
Francesco Paladino
Marco Di Natale

This paper [3] tackles the problem of optimal placement of complex real-time embedded applications on heterogeneous platforms [7]. Applications are composed of directed acyclic graphs (DAGs) of tasks, with each DAG having a minimum inter-arrival period for its activation requests, and an end-to-end deadline within which all of the computations need to terminate since each activation. The platforms of interest are heterogeneous power-aware multicore platforms with DVFS capabilities, including big.LITTLE Arm architectures, and platforms with GPU or FPGA hardware accelerators with Dynamic Partial Reconfiguration capabilities. Tasks can be deployed on CPUs using partitioned EDF-based scheduling. Additionally, some of the tasks may have an alternate implementation available for one of the accelerators on the target platform, which are assumed to serve requests in non-preemptive FIFO order. The system can be optimized by: minimizing the average power consumption, respecting precise timing constraints; maximizing the minimum relative slack among all deployed DAG applications, respecting given power consumption constraints; or even a combination of these, in a multi-objective formulation, obtaining a minimum-power and robust deployment, as demonstrated by the obtained experimental results.

SESSION: Artifact Evaluation Track

SplitTracr: A Flexible Performance Evaluation Tool for Cooperative Inference and Split Computing

Nicholas Bovee
Izhar Ali
Suraj Bitla
Gopi Krishna Patapanchala
Shen-Shyang Ho

We present SplitTracr, a performance evaluation tool used on deep neural networks (DNNs) to simplify the development overhead of split computation-based DNN models. Usage of such a tool allows the neural network to be dynamically split into component parts at any point. Our tool handles the management of configuration of hardware edge devices, export & injection of forward pass tensors during live tests, dataset handling, metrics & logging in the testing process. We examine the structure, methodology, and design choices we made to create SplitTracr, and review the results of our pre-provided base cases.

The Kieker Observability Framework Version 2

Shinhyung Yang
David Georg Reichelt
Reiner Jung
Marcel Hansson
Wilhelm Hasselbring

Observability of a software system aims at allowing its engineers and operators to keep the system robust and highly available. With this paper, we present the Kieker Observability Framework Version 2, the successor of the Kieker Monitoring Framework.

In this tool artifact paper, we do not just present the Kieker framework, but also a demonstration of its application to the TeaStore benchmark, integrated with the visual analytics tool ExplorViz. This demo is provided both as an online service and as an artifact to deploy it yourself.

A Dataset of Performance Measurements and Alerts from Mozilla (Data Artifact)

Mohamed Bilel Besbes
Diego Elias Costa
Suhaib Mujahid
Gregory Mierzwinski
Marco Castelluccio

Performance regressions in software systems can lead to significant financial losses and degraded user satisfaction, making their early detection and mitigation critical. Despite the importance of practices that capture performance regressions early, there is a lack of publicly available datasets that comprehensively capture real-world performance measurements, expert-validated alerts, and associated metadata such as bugs and testing conditions.

To address this gap, we introduce a unique dataset to support various research studies in performance engineering, anomaly detection, and machine learning. This dataset was collected from Mozilla Firefox's performance testing infrastructure and comprises 5,655 performance time series, 17,989 performance alerts, and detailed annotations of resulting bugs collected from May 2023 to May 2024. By publishing this dataset, we provide researchers with an invaluable resource for studying performance trends, developing novel change point detection methods, and advancing performance regression analysis across diverse platforms and testing environments. The dataset is available at https://doi.org/10.5281/zenodo.14642238.

SESSION: Data Challenge Track

TraceLens: Early Detection of Software Anomalies Using Critical Path Analysis

Masoumeh Nourollahi
Amir Haghshenas
Michel Dagenais

Runtime smell detection in software systems, particularly through system call analysis, has garnered significant attention in recent years. Although various machine learning techniques have been employed to enhance detection accuracy and reduce false positives, limited focus has been given to their practical application in early real-time anomaly detection. To address this gap, we propose a deep learning-based approach, called TraceLens, designed for the early detection of performance-related issues in software systems. Unlike traditional methods that rely on system call data, our approach leverages critical path analysis, enabling more efficient and targeted anomaly detection. Experimental results demonstrate that this approach achieves detection performance comparable to methods that use system calls, while significantly improving data collection efficiency. In addition, the critical path dataset highlights software dependencies, both internal and external, providing deeper insight into the dynamic behavior of software systems.

Kernel-Level Event-Based Performance Anomaly Detection in Software Systems under Varying Load Conditions

Anthonia Njoku
Heng Li
Foutse Khomh

Performance anomalies in software systems can lead to significant disruptions and reduced user satisfaction. Traditional methods of anomaly detection rely on log events that capture higher-level system activities but may lack the details to effectively pinpoint root causes. This study investigates the detection of performance anomalies in software systems using kernel-level event data. By leveraging both classical and deep learning approaches, we developed models capable of identifying anomalous patterns in system behavior. The experimental dataset, consisting of over 24 million events collected under various noise and workload conditions, provided a comprehensive basis for analysis. Our results show the robustness of ensemble techniques in predicting performance anomalies with the random forest (accuracy = 89%) and ensemble stacking (F1 score= 0.76, AUC= 0.84) models outperforming other classifiers. Feature importance analysis revealed that CPU-bound events, such as sched_switch and sched_wakeup, are key indicators of performance anomalies. Additionally, a significant relationship was identified between system workload conditions and the likelihood of anomalies, as confirmed by statistical testing. These findings highlight the potential of kernel-level data for precise anomaly detection and provide insights for optimizing system monitoring and performance management.

SESSION: Emerging Research Track

Improving Runtime Performance in Java: A Systematic Detection and Refactoring Approach for Lock Contention Code Smells (Work In progress paper)

Ankita Mukherjee
Ashadullah Shawon
Md Asif Khan
Joseph Robertson
Ramiro Liscano
Akramul Azim
Vijay Sundaresan
Yee-Kang Chang

Concurrent programming offers significant performance benefits but introduces complexities in development and testing, particularly in multi-threaded environments. As software scales and complexity increases, fault localization—the process of identifying fault locations—becomes increasingly tedious, costly, and equally critical. To address this, we conducted a comprehensive study, defining a set of performance-related code smells that capture various lock contention scenarios within Java intrinsic locks, along with recommendations and refactored code. To verify these code smells, we developed a code smell detection tool that analyzes source code, identifies critical sections, and applies predefined patterns to detect contention. This tool was validated on several open-source projects, successfully detecting over 350 lock contention instances across Apache HBase, Glide, and EventBus. Performance validation using the Java Lock Monitor (JLM) demonstrated that implementing our recommendations led to significant improvements. On average, the lock hold time decreased over 10%, the spin count was reduced over 2%, and the GET operations increased by 18%, highlighting the efficacy of the proposed framework in mitigating lock contention and enhancing runtime performance. As a work in progress, this research highlights the importance of systematically addressing performance bottlenecks in Java applications and lays the foundation for scalable and maintainable multi-threaded systems.

Modeling And Optimizing Runtime Adaptation Strategies At Design-Time Using Evolutionary Algorithms (Idea Paper)

Martina Rapp
Max Scheerer
Ralf Sieger
Ralf Reussner
Raffaela Mirandola

Self-adaptive (software) systems (SASs) dynamically adjust themselves to environmental changes during runtime (RT) to uphold Quality of Service (QoS) objectives. Designing and optimizing the adaptation strategies for SASs, particularly in relation to their impact on quality attributes, presents a significant challenge. The extensive design space of adaptation strategies typically requires automated exploration, as manual exploration is usually infeasible. While most existing approaches focus on RT optimization, which requires the implementation of the system, we examine the optimization of runtime adaptation strategies during design-time (DT) which we consider more effective in achieving QoS goals compared to purely RT-optimized strategies. Furthermore, DT analysis offers heuristically optimized strategies prior to implementation, enhancing quality properties such as performability. We aim to complement RT optimization by proposing a Model-based quality analysis (MBQA) approach at design-time that optimizes MAPE-K based adaptation strategies across all phases. In contrast, current approaches typically focus on optimizing specific phases, such as the analysis or planning phase, rather than the strategy as a whole. In this paper, we present a comprehensive DT approach for the optimization of adaptation strategies using evolutionary algorithms.

Optimizing Memory Access Patterns through Automatic Data Layout Transformation (Work in Progress Paper)

Jolly Chen
Ana Lucia Varbanescu
Axel Naumann

In many programming languages, memory access patterns exhibited by an application are dictated by the data structures defined by the programmer, which, in turn, dictate how the data are ordered in memory. Exploring access pattern optimizations is essential for performance: we demonstrate, through several benchmarks, the effects of Array of Structures (AoS) and Structure of Arrays (SoA) layouts on cache utilization, auto-vectorization, and false sharing. Despite these benefits, exploration remains a time-consuming task because it requires rewriting data structure definitions and, very often, computing kernel code to accommodate these changes.

We argue that such changes could and should be automated. In this work, we propose the design of a C++ framework for automatically redefining data structures to modify the data layout and access patterns. Leveraging experimental C++26 reflection and token injection features, we can modify the structure while preserving the original C++ syntax for accessing data. Our framework enables rapid prototyping of access pattern optimizations, potentially unlocking significant performance gains.

SESSION: Posters and Demostrations Track

LogAn: An LLM-Based Log Analytics Tool with Causal Inferencing

Pranjal Gupta
Karan Bhukar
Harshit Kumar
Seema Nagar
Prateeti Mohapatra
Debanjana Kar

This paper demonstrates ''LogAn'', a first-of-a-kind LLM-powered intelligent log analytics tool that is designed for analyzing the behaviour of applications under faulty conditions. The tool features an intuitive GUI that displays key information, such as error cues or signals from log data, along with various analytics and summaries to assist IT engineers in identifying and understanding critical issues. LogAn has been successfully deployed in production and is now actively utilized by IBM Software Support. Since May 2024, it has processed 1, 376 cases, saving 8094 minutes of support engineers' time. Demo Video Link: https://tinyurl.com/demo-logan

COQO: Cost-Optimal Query Orchestration Tool

Kuldeep Singh
Ravi Kumar Singh
Shruti Kunde
Mayank Mishra
Rekha Singhal
Manoj Nambiar

A key application of Large Language Models (LLMs) is chatbot systems, where customer queries may span heterogeneous data(structured databases, unstructured data, and web resources) and require efficient reasoning and accurate information retrieval. Often these systems struggle to deliver real-time, accurate responses for multi-source queries.

This paper introduces COQO (Cost Optimal, Query Orchestration), a GUI-based tool employing a two-pipeline approach: the context-latent pipeline captures query structure, and the content-acute pipeline refines responses using specific data sources. Leveraging graph-based retrieval-augmented generation (RAG) for vector search, COQO minimizes reliance on LLMs for reasoning, enhancing cost-efficiency. We demonstrate the efficacy of the tool in terms of its cost-effectiveness handling intricate queries of a retail banking chatbot application.

SESSION: Tutorial Track

Serverless Orchestration on the Edge-Cloud Continuum: Challenges and Solutions

Reza Farahani
Radu Prodan

Serverless computing simplifies application development by abstracting infrastructure management, allowing developers to focus on building application functionality while infrastructure providers handle tasks such as resource scaling and provisioning. However, orchestrating serverless applications across the edge-cloud continuum introduces significant challenges, including effectively managing heterogeneous resources with diverse computational capabilities and energy limitations, guaranteeing low-latency execution, dynamically allocating workloads according to real-time performance metrics, and ensuring fault tolerance and seamless scalability across distributed edge-cloud resources. This tutorial first explores foundational serverless computing concepts, including Function-as-a-Service (FaaS), Backend-as-a-Service (BaaS), and their integration into distributed edge-cloud systems. It then introduces advancements in multi-cloud orchestration, edge-cloud integration strategies, and resource allocation techniques, focusing on their applicability in real-world scenarios. It addresses the challenges of orchestrating serverless applications across edge-cloud environments, mainly using dynamic workload distribution models, multi-objective scheduling algorithms, and energy-optimized orchestration. Practical demonstrations employ Kubernetes, serverless platforms such as Google Cloud Functions, AWS Lambda, AWS Step Functions, OpenFaaS, and OpenWhisk, along with monitoring tools like Prometheus and Grafana, to deploy and execute real-world application workflows, providing participants with hands-on experience and insights into evaluating and refining energy- and performance-aware serverless orchestration strategies.

Next Generation Energy Efficiency Benchmarking: Reliable and Reproducible Efficiency Measurements in a Diverse IT-Landscape

Maximilian Meissner
Klaus-Dieter Lange
Aaron Cragin
Pushpa H J
Robert Proulx
Samuel Kounev

Energy efficiency has become a critical quality criterion of IT-systems. While hardware designers have long worked on improving efficiency, researchers in other fields, such as software design and Artificial Intelligence, are increasingly focusing on this issue as well. Such combined efforts are essential to mitigate the significant energy demands of current and future hardware and software.

In comparison to performance benchmarking, energy efficiency benchmarking involves a variety of additional factors that must be carefully considered and controlled. Without proper setup and oversight, measurements can easily lead to misleading, inconclusive, or unreliable results. Because reproducibility of experiment results, or lack thereof, is a critical concern in the scientific community, knowing the intricacies of power and efficiency measurements is a crucial pre-requisite for carrying out meaningful experiments.

This paper offers insight into aspects important for conducting energy efficiency experiments based on best practices. It explains the effects of subtle differences in experiment setup and discusses current trends and challenges in efficiency benchmarking.

SESSION: Fifth Workshop on Education and Practice of Performance Engineering (WEPPE25)

WEPPE25: 5th Workshop on Education and Practice of Performance Engineering

Alberto Avritzer
Matteo Camilli
James J. Cusick
Andrea Janes

The Workshop on Education and Practice of Performance Engineering, in its 5th edition, brings together University and Industry Performance Engineers to share education and practice experiences. Specifically, the goal of the workshop is to discuss the gap between university performance engineering programs and the skills required to implement performance engineering in industrial settings. In this edition, we have built a workshop program consisting of three full papers and four invited talks.

The first part of the workshop is made up of three papers describing industrial experiences with performance engineering. Each paper presents techniques for robust performance engineering, experience with the application of performance engineering, and new trends in performance engineering driven by current innovation in technology and processes.

From the Deep End to Coaching Others in the Performance Pool

James J. Cusick

This paper presents a set of experiences with applied computer performance engineering allowing for abstraction to generalized use. This anecdotal personal journey defines three broad stages of industry methodological adoption from the initial introduction to performance engineering, to accomplishing a broader understanding, and finally to coaching engineering staff to define and realize performant systems. Initial challenges in solving performance requirements lead to scaling up to broader deployments from within a corporate leadership position. Several real-world examples are provided along this arc of learning and practice. Readers will obtain an understanding of the hurdles encountered in adopting performance engineering within the wider development community and specific means to overcome them.

cfdSCOPE: A Fluid-Dynamics Proxy App for Teaching Performance Engineering

Peter Arzt
Sebastian Kreutzer
Tim Jammer
Christian Bischof

Teaching performance engineering in high-performance computing (HPC) requires example codes that demonstrate bottlenecks and enable hands-on optimization. However, existing HPC applications and proxy apps often lack the balance of simplicity, transparency, and optimization potential needed for effective teaching. To address this, we developed cfdSCOPE, a compact, open-source computational fluid dynamics (CFD) proxy app specifically designed for educational purposes. cfdSCOPE simulates flow in a 3D volume using sparse linear algebra, a common HPC workload, and comprises fewer than 1,100 lines of code. Its minimal dependencies and transparent design ensure students can fully control and optimize performance-critical aspects, while its naive OpenMP parallelization provides significant optimization opportunities, thus making it an ideal tool for teaching performance engineering.

Overcoming Challenges in Teaching Performance Related Tools

Catalina M Lladó

Teaching performance evaluation courses presents unique challenges for educators. Professors must navigate the complexities of integrating practical work with performance-related tools while dealing with constraints such as limited industry support, resource availability, and varying student skill levels. This article explores three primary approaches using industry tools, building on tools created by other students, and having students develop tools from scratch. Each option comes with its own benefits and drawbacks. By analysing these approaches, this article aims to provide strategies to enhance student engagement, foster learning, help professors, and open up a discussion on this challenging topic.

Cultivating Performance Awareness in a Testing Project: A Focus on Machine-Readable Travel Documents

Lu Xiao
Andre B. Bondi
Eman Abdullah AlOmar
Yu Tao

This paper presents a course project to integrate performance engineering concepts into a software testing and quality assurance curriculum. It uses the real-world context of validating and testing Machine-Readable Travel Documents (MRTDs) to integrate multiple testing techniques, including unit testing, mocking, mutation testing, and performance measurement. This integration allows students to ''connect the dots'' between different testing methodologies, enhancing their ability to apply them holistically in software testing projects. A key goal of the project is to help students understand how performance testing naturally fits into the overall testing process—just as it would in real-world practice—alongside functional testing. Students engage in hands-on exercises that require evaluating both functional correctness (e.g., conformance to MRTD standards) and performance attributes, such as execution time and the cost of encoding and decoding large sets of input records. The preliminary results suggest that this approach not only deepens students' understanding of performance engineering but also encourages them to view testing as a multifaceted process. We share this project with other educators as a framework for incorporating performance testing into software testing curricula, ensuring that students can practice critical testing skills in a real-world context.

Guerrilla Techniques for Robust Performance Engineering

Neil Gunther

Making performance engineering more robust requires complementing data collection with performance models. In this invited talk, Guerrilla techniques are presented that help simplify the performance modeling procedure.

Performance Engineering: New and Conflicting Trends

Alexander Podelko

Performance engineering is adjusting to major industry trends - such as cloud computing, agile development, and DevOps. As systems scale and sophistication skyrocket, performance definitely gets more attention. However, such adjusting happens in different, sometimes conflicting, ways and the future of performance as a separate discipline is not clear. We may observe integration with development (''Shift Left'') and operations (''Shift Right''), as well as appearance of new disciplines that includes parts of performance engineering (such as SRE and FinOps). While some trends are clear (such as continuous performance testing or observability) - others are still being formed. It is not trivial to define the performance engineering body of knowledge at the moment.

How can we Teach Workload Modeling in CS Systems Classes? [Invited Talk]

Cristina L. Abad

In this invited talk at WEPPE 2025, I discuss how systems courses are a good place to teach workload modeling, including suggestions on how to do so that are rooted in personal experience, existing literature, and examples from course programs found online. The ideas presented in this talk were first presented at the TeaPACS 2024 workshop.

SESSION: Third International Workshop on AI Performance and Optimization in the LLM World (AIPerfLLM 2025)

AIPerfLLM: 3rd International Workshop on Performance Optimization in the LLM world

Kingsum Chow
Emilio Incerto
Marin Litoiu
Zhihao Chang
Anil Rajput
Khun Ban
Daniele Masti
Zhiheng Lyu

Artificial Intelligence (AI) has been widely adopted in various domains (e.g., computer vision, natural language processing, and reliability analysis). However, its use for performance modeling and evaluation remains limited, and its benefits to the performance engineering field are still unclear. Researchers and practitioners have recently started focusing on methods such as explainable or white-box AI-based solutions in performance engineering, but the tools, methodologies, and datasets that enable wider adoption are still lacking. Meanwhile, the rapid rise of large language models (LLMs) such as ChatGPT poses new challenges in performance optimization and cost containment. LLM pre-training is expensive, and the necessary infrastructure also incurs significant carbon footprint. This workshop aims to bridge research and practice by bringing together academia and industry to share experiences and insights in performance engineering for LLM-based services and AI applications. We target techniques and methodologies to optimize performance while reducing energy consumption and cost.

Retrieval Augmented Generation Fine-Tuned LLM Model for Code Recommendations to Mitigate Lock Contention

Ashadullah Shawon
Ramiro Liscano
Akramul Azim
Vijay Sundaresan
Yee-Kang Chang

Lock contention performance faults can lead to degradation in the performance of software applications. Unlike software bugs, per- formance faults do not lead to failures and application crashes but surface as a degradation in the response and execution of an ap- plication and can surface fairly late in the deployment life of an application. Tools exist for the identification and detection of lock performance faults but there is a lack of effective code refactor- ing recommendations for a developer to mitigate the performance degradation caused by lock-contention. Recent advances in Large Language Models (LLMs) have demonstrated positive results in code refactoring for fixing software bugs and mitigating run time faults. However, traditional LLM-based approaches often suffer from hal- lucination errors, where the generated code may not accurately reflect the context of the project or existing codebase. This thesis presents a novel approach that combines Retrieval Augmented Gen- eration (RAG) with a fine-tuned LLM model for refactored code recommendation aimed at reducing lock-contention performance faults in Java applications. The RAG fine-tuned model combines the strengths of contextual understanding from LLMs with the preci- sion of retrieval-based systems, thereby ensuring that the generated recommendations are relevant, accurate, and hallucination-free. Se- mantic and syntactic metrics of the recommendations generated by the combined RAG and LLM model show an accuracy of approxi- mately 90% compared to an accuracy of approximately 25% when a baseline LLM model is used.

ConsciousLLM: The Future of Intelligent Deployment of Self-aware Models

Ravi Kumar Singh
Shruti Kunde
Mayank Mishra
Rekha Singhal
Manoj Nambiar

As the complexity of large language models (LLMs) increases, so does their parameter count and size. While LLMs with a substantial number of parameters yield highly accurate results, their deployment presents significant challenges even for enterprises. Existing methods for distributing transformer blocks across multiple nodes for inference are well-known; however, the responsibility for distribution typically rests with the user, often resulting in sub-optimal resource utilization.

In this effort, we introduce a novel framework called ConsciousLLM, designed to self-consciously re-deploy LLMs across multiple enterprise-wide machines, by leveraging residual resources on the machines. The framework incorporates a ''Self-Awareness Agent'' that continuously monitors resource utilization and recalculates the optimal placement of the LLM blocks over time, thus ensuring efficient utilization of memory and compute. By dynamically redistributing transformer blocks based on real-time resource availability, the framework lowers operational costs, and improves overall system performance.

We validate the efficacy of ConsciousLLM by conducting experiments with well-known open source models such as Mixtral 8x7B and LLaMA-3 (70B). Our results illustrate that capability of these models to autonomously enhance their deployment strategies, leading to optimized performance on inference tasks.

RAGuru: A Tool to Create and Automatically Deploy Workload Optimized RAG

Archisman Bhowmick
Rishikesh S
Ashay Taksande
Kuldeep Singh
Mayank Mishra
Rekha Singhal

Retrieval Augmented Generation (RAG) architectures have emerged as a powerful solution to enhance the accuracy and relevance of large language models (LLMs) by integrating retrieval mechanisms with generative capabilities. However, the design of an effective RAG pipeline is inherently complex, involving multiple components such as chunking strategies, embedding models, retrieval systems, and choice of LLMs. Each of these components offers numerous configuration options and the selection of the optimal combination is often a daunting task. The challenge is compounded by the need to consider trade-offs between performance, accuracy, and cost, which are not always straightforward and can vary significantly depending on the workload.

In this context, we present RAGuru, an innovative tool designed to automate the design and creation of cost and latency-optimized RAG architectures. RAGuru addresses the complexities of RAG design by intelligently selecting and configuring the optimal components based on the user's specific workload requirements and also ensuring higher quality responses from RAG. By using an inhouse dataset of cost and performance metrics, RAGuru ensures that the resulting architecture is cost and latency wise optimal for an use-case, while achieving high accuracy. The architecture design space choices can be fed to terraform[2] as a configuration file for automatic deployment of the cost-performance optimal RAG. We have tested RAGuru in real-world scenarios. In a particular case, the RAG generated by RAGuru, demonstrated comparable performance at approximately half the cost of a conventional RAG system, with only minimal accuracy¹ loss.

RAGs to Riches: Cost-efficient Complex Query Orchestration

Ravi Kumar Singh
Kuldeep Singh
Shruti Kunde
Mayank Mishra
Rekha Singhal
Manoj Nambiar

Large Language Models (LLMs) have become integral to modern business operations, especially for tasks involving reasoning over large datasets. One prominent application of LLMs is in chatbot systems, where customers provide natural language queries, often complex in nature, requiring decomposition to retrieve relevant information from various data sources. These queries may span structured databases, unstructured data, or public information from the internet, making efficient data retrieval and reasoning vital for real-time, accurate responses. In this paper, we propose two cost-efficient ''Query Orchestration'' approaches (Context Latent and Context Acute) to address these challenges. By leveraging graph-based retrieval-augmented generation (RAG) techniques for vector search, we optimize data retrieval while minimizing reliance on LLMs for reasoning to reduce costs. Our approach is validated through experiments on a banking use case, where we demonstrate its effectiveness in providing high-quality

SESSION: Thirteenth International Workshop on Load Testing and Benchmarking of Software Systems (LTB 2025)

LTB25: 13th International Workshop on Load Testing and Benchmarking of Software Systems

Stephen Fan
Lizhi Liao
Zhenhao Li

It is our great pleasure to welcome you to the 13th edition of the International Workshop on Load Testing and Benchmarking of Software Systems - LTB 2025, (https://ltb2025.github.io). This workshop brings together software testing and software performance researchers, practitioners, and tool developers to discuss the challenges and opportunities of conducting research on load testing and benchmarking software systems, including theory, applications, and experiences. LTB 2025 includes 1 keynote talk and 4 research papers. The topics cover AIOps, performance and load testing, workload tracing, benchmarking, and performance verification.

Towards eBPF Overhead Quantification: An Exemplary Comparison of eBPF and SystemTap

Simon Volpert
Sascha Winkelhofer
Jörg Domaschka
Stefan Wesner

Quantifying the performance overhead of instrumentation tools is crucial to maximize their effectiveness in performance analysis and monitoring. This paper presents a methodology for precisely measuring the overhead introduced by eBPF and SystemTap, two prominent tools for dynamic instrumentation. We employ a fixed-time benchmark approach, that targets common system call functions (open and mmap) with various probe types, to isolate and quantify the overhead under controlled conditions. Our results demonstrate that both eBPF and SystemTap impose measurable overhead, with user-space probes being the most expensive. Furthermore, we find that data transfer significantly impacts SystemTap's performance, while eBPF remains relatively unaffected. Our findings, validated by comparison with ftrace and internal eBPF statistics, provide valuable insights for informed decision-making when applying instrumentation. The presented methodology offers a robust, repeatable and reproducible approach for quantifying instrumentation overhead that can be applied to any target system.

EMU-LLM: Emulators for Performance Evaluation of LLM-based Applications

Deeksha Deeksha
Ashwin Krishnan
Manoj Nambiar

With the advent of Large Language Models (LLMs) in modern web applications, rigorous performance testing has become essential to assess application's speed, stability, and resource usage under diverse workload conditions. While third-party APIs enable rapid integration and scaling in LLM-based applications, scalability testing with these APIs often incurs high costs and adds complexity throughout the development cycle. To address these challenges, we propose EMU-LLM, a framework that automatically selects and integrates emulators to replace third-party APIs while preserving LLM behavior. This emulator-centric approach offers a cost-effective solution for performance evaluation and system testing of LLM-based web applications. Our framework enhances the robustness and efficiency of applications and highlights promising directions for future research. This paper aims to guide researchers, developers, and testers on the significance of emulators in optimizing LLM performance and fostering growth.

DMML: A Machine-learning Performance Model for Data Migration

Hasti Ghaneshirazi
Fares Hamouda
Marios Fokaefs
Wejdene Haouari
Dariusz Jania

Data migration at scale can be a daunting task. It may require significant resources and time, which must be taken from value-adding activities of an enterprise. Besides errors may occur, which can jeopardize the integrity of the data and waste resources. Accurately estimating data migration time and resource performance is critical for optimizing time, cost, and risk in large-scale data transfers. In this paper, we propose the use of machine learning to create performance models for data migration. We utilize DMBench, a benchmarking and load testing tool specifically tailored for data migrations, to generate data, simulating various data migration scenarios with different data sizes, vCPUs, RAM size, and data compression types. We experimented with multiple ML algorithms and showed the effect of hyperparameter tuning in the model's accuracy. Our results show that the XGBoost is the most accurate and consistent across the different scenarios. We demonstrate the model building process and its evaluation on an industrial case study.

Scientific Computing Energy Footprinting Across CPU Generations

Jacob D. Hauenstein
Timothy S. Newman

A multi-dimensional study of the energy footprint for scientific computing across CPU generations is presented. The study utilizes a version of the popular Linpack benchmark, which is analogous to the HPL (High Performance Linpack) used today for TOP500 Supercomputer ranking, as a proxy for the computational performance of scientific computing. Here, this Linpack benchmark is applied to three generations of (the same level of) Intel x86 CPUs. Energy consumption of the benchmark's computation is considered via exploitation of onboard monitors. By considering metrics for power usage, time, and computational productivity in conjunction with computational strategies and runtime conditions, this study delivers guidance for achievement of computation goals related to computation time and energy consumption. Tradeoffs between these two, including marginal energy costs and time benefits for computational performance improvement, are also explored. Results for a wide array of experiments are reported.

SESSION: Eighth Workshop on Hot Topics in Cloud Computing Performance (HotCloudPerf 2025)

HotCloudPerf'25 Workshop Chairs' Welcome

Klervie Toczé
André Bauer
Dragi Kimovski
Daniele Bonetta

The HotCloudPerf workshop proposes a meeting venue for academics and practitioners---from experts to trainees---in the field of Cloud computing performance. The new understanding of Cloud computing covers the full computing continuum from data centers to edge to IoT sensors and devices. The workshop aims to engage this community and lead to the development of new methodological aspects for gaining a deeper understanding not only of cloud performance, but also of cloud operation and behavior, through diverse quantitative evaluation tools, including benchmarks, metrics, and workload generators. The workshop focuses on novel cloud properties such as elasticity, performance isolation, dependability, and other non-functional system properties, in addition to classical performance-related metrics such as response time, throughput, scalability, and efficiency. HotCloudPerf 2025, co-located with the 16th ACM/SPEC International Conference on Performance Engineering (ICPE 2025), is held on May 6th, 2025.

GenAI for bottleneck Detection in Cloud Architecture

Sharod Roy Choudhury
Dheeraj Chahal
Chetan Phalak
Manju Ramesh
Rekha Singhal

The rapid adoption of cloud computing, accelerated by the global pandemic, has increased the need for efficient cloud architecture that balances cost and performance. As organizations migrate applications to the cloud, cloud architects face challenges in managing an overwhelming number of services—often exceeding a thousand. This paper presents a novel tool designed for editable cloud architecture management that automates the optimization process.

Our solution enables cloud architects to visually design and edit cloud architectures while utilizing a backend represented as a directed acyclic graph in an adjacency matrix. This structure allows for dynamic adjustments based on real-time workload predictions, moving from reactive to proactive resource management. Leveraging advanced Generative AI models, specifically Azure's GPT-4o [11], our tool identifies alternative services that can effectively replace or supplement existing ones based on functionality. By extracting relevant data from AWS documentation, we provide actionable insights on service performance and cost.

We validate our approach through use cases, demonstrating the tool's effectiveness in detecting potential bottlenecks and recommending service adjustments to eliminate Service Level Agreement (SLA) violations. Our findings indicate that the tool enhances performance and reduces operational costs, empowering cloud architects to make informed, data-driven decisions. This innovative approach significantly streamlines cloud resource management, ensuring organizations can effectively navigate the complexities of their cloud environments and achieve sustained operational excellence.

Investigating Performance Overhead of Distributed Tracing in Microservices and Serverless Systems

Anders Nõu
Sacheendra Talluri
Alexandru Iosup
Daniele Bonetta

Distributed tracing is crucial to achieve observability in modern distributed systems. However, its adoption introduces performance trade-offs, impacting throughput and latency. This paper investigates the overhead of distributed tracing in microservices and serverless applications. We provide an analysis of the popular OpenTelemetry and Elastic APM distributed tracing frameworks, evaluating their performance impact on microservices and serverless workloads. We highlight and categorize the primary sources of overhead and measure their contribution to performance degradation. The results reveal significant throughput reductions (19-80%) and latency increases (up to 175%) depending on application configurations and execution environments. Our findings reveal that serializing trace data for export is the largest cause of overhead.

IrisBench: An Open-Source Benchmark Suite for Video Processing Systems in Cloud

Zhiqi Li
Ruiqi Yu
Jianshu Liu

Recent advances in generative text-to-video AI models (e.g., VideoPoet and Sora) have spurred a surge in video production, leading to an increased demand for video processing pipelines among various video service providers such as YouTube and TikTok. With the improvement of cloud computing, video processing systems are frequently updated and present both opportunities and challenges while optimizing the quality of service (QoS) and cloud resource utilization. However, research on evaluating the performance of video processing systems is limited. Besides the availability of video datasets and realistic workloads, the lack of an open-source benchmark system reflecting the characteristics of industrial video processing systems is a significant gap. To fill this gap, we develop IrisBench, an open-source benchmark suite for cloud video processing systems to facilitate research on performance analysis. Our benchmark suite includes three video services: video transcoding, video partitioning, and video object detection services. Our future work relies on using IrisBench to study the architectural implications of various cloud video processing systems in the cloud.

Remote Memory Prefetching: Is Coarse-grained Fine?

James McMahon
Vinita Pawar
Ryan Stutsman

CXL raises new questions about tiering, pooling, and remote memory access. In most disaggregated memory approaches, compute nodes access remote memory pools at page (4~KB) granularity. This is the case for two reasons: to help amortize high remote access costs and because the host CPUs' address translation hardware is set up for 4~KB pages. While fetching and caching whole remote pages helps with applications that have high spatial locality, for some applications it can introduce contention for cache capacity since potentially cold or unrelated adjacent data is also cached. Additionally, this can increase network bandwidth utilization.

Operating at a smaller cache block granularity (e.g.\ 64~B) could reduce remote access amplification and make more efficient use of local caches and the network. Further, operating at a cacheline granularity would allow future work to build upon the many decades of work done in CPU hardware data prefetching, since prefetchers at that level primarily operate at small cacheline granularities. Because of the higher latencies associated with remote memory access, more time and hardware resources can be used to generate prefetch predictions. This could allow previous models that were too complex for real-world CPU data prefetching to find practical application and for practical models to be scaled up.

In this work, we explore whether cacheline-granular prefetching could be beneficial for memory pooling and far-memory systems. Specifically, we investigate whether it is possible to achieve performance comparable to today's page-granular remote accesses by using small, cacheline-granular remote accesses with aggressive prefetching. This paper shows that while cacheline-granular prefetching seems like a natural next step for remote CXL devices, beating page granular accesses appears to be difficult for the workloads we explored.

ObjecTier: Non-Invasively Boosting Memory Tiering Performance

Vinita Pawar
Ankit Bhardwaj
Ryan Stutsman

Recent research has developed page-based memory-tiering systems that place hot pages in fast tiers and cold pages in slower, more capacious tiers. However, applications place many objects together within pages, and most pages contain some objects that are hot and some that are cold. Our simulations of a key-value workload confirm this; even the hottest pages in the fast tier can contain 50% cold data.

To improve fast tier utilization, we describe the design of a new framework, ObjecTier, that uses application knowledge to efficiently consolidate hot data and cold data. This allows ObjecTier-enabled applications to boost fast tier hit rates and improve performance regardless of which underlying memory tiering system they use underneath, even if that system is page based.

With simulations, we show that ObjecTier may improve average memory access time (AMAT) by 2× without adding any memory space overhead for our simulated key-value store workload. We conclude by outlining the next steps to make the ObjecTier framework a reality for easy adaptation of applications like key-value stores and other indexed databases.

FAILS: A Framework for Automated Collection and Analysis of LLM Service Incidents

Sándor Battaglini-Fischer
Nishanthi Srinivasan
Bálint László Szarvas
Xiaoyu Chu
Alexandru Iosup

Large Language Model (LLM) services such as ChatGPT, DALL·E, and Cursor have quickly become essential for society, businesses, and individuals, empowering applications such as chatbots, image generation, and code assistance. The complexity of LLM systems makes them prone to failures and affects their reliability and availability, yet their failure patterns are not fully understood, making it an emerging problem. However, there are limited datasets and studies in this area, particularly lacking an open-access tool for analyzing LLM service failures based on incident reports. Addressing these problems, in this work we propose FAILS, the first open-sourced framework for incident reports collection and analysis on different LLM services and providers. FAILS provides comprehensive data collection, analysis, and visualization capabilities, including: (1) It can automatically collect, clean, and update incident data through its data scraper and processing components;(2) It provides 17 types of failure analysis, allowing users to explore temporal trends of incidents, analyze service reliability metrics, such as Mean Time to Recovery (MTTR) and Mean Time Between Failures (MTBF);(3) It leverages advanced LLM tools to assist in data analysis and interpretation, enabling users to gain observations and insights efficiently. All functions are integrated in the backend, allowing users to easily access them through a web-based frontend interface. FAILS supports researchers, engineers, and general users to understand failure patterns and further mitigate operational incidents and outages in LLM services. The framework is publicly available on https://github.com/atlarge-research/FAILS.

Introducing Resource Awareness Levels in Edge Computing Resource Management

Klervie Toczé

Resources are key components in making edge computing a successful paradigm. Indeed, this paradigm requires computational, networking, and storage resources located in devices that are geographically spread. At the same time, the footprint of ICT as a whole is growing and is straining for the planet. Researchers and practitioners involved in edge computing therefore need to be resource-aware and carefully think about how to perform resource management in edge computing.

In this work, three resource awareness levels are introduced as a framework. The proposed framework enables one to reflect on edge computing resource management from a resource perspective in a novel way. These levels help to think about where to focus future research efforts based on where current ones are. The proposed levels are derived from the levels of learning of Bateson. Concrete examples of works performed at the different levels are provided, and using these levels in practice is discussed.

Microservice Applications and Their Workloads on GitHub

Yannik Lubas
Martin Straesser
Ivo Rohwer
Samuel Kounev
André Bauer

Many cloud applications employ a microservice architecture to reap its various benefits, such as independent scalability and development. However, microservices' distributed nature comes with its own challenges, which the literature tries to combat with various approaches in domains, such as autoscaling or service placement. Researchers, especially performance engineers, need representative microservice applications as a basis to develop and evaluate their approaches. While the literature proposes several reference and benchmark applications, the set of commonly used applications is rather limited. Further, the representativeness of reference applications for industry practices has been challenged, raising questions about the applicability of existing approaches. To investigate possible alternatives to and inspirations for microservice testbeds, we conduct an exploratory search of microservice applications and their workloads by mining GitHub repositories. Our work provides two datasets containing 553 applications and 8 workload dataset repositories, respectively. In addition, we provide a first analysis of the collected data that future work can build upon.

SESSION: Sixth Workshop on Benchmarking in the Data Center: Expanding to the Cloud (BID 2025)

Benchmarking in the Data Center: Expanding to the Cloud

Kaushik Velusamy
Awais Khan

Welcome to the 2025 6th International Workshop on Benchmarking in the Data Centre: Expanding to the Cloud (BID 2025), hosted at York University Canada as a workshop track of the International Conference on Performance Engineering (ICPE'25). High Performance Computing (HPC) is no longer confined to universities and national research laboratories. It is increasingly used in industry and in the cloud. Education of users also needs to take this into account. Users need to be able to evaluate what benefits HPC can bring to their companies, what type of computational resources (e.g. multi-, many-core CPUs, GPUs, hybrid systems) would be best for their workloads and how they can evaluate what they should pay for these resources. Another issue that arises in shared computing environments is privacy: in commercial HPC environments, data produced, and software used typically has commercial value, and so needs to be protected. Recent general adoption of AI and machine learning has motivated migration of HPC workloads to cloud data centers, and there is a growing interest by the community on performance evaluation in this area, especially for end-to-end workflows. In addition to traditional performance benchmarking and high-performance system evaluation (including absolute performance, energy efficiency), as well as configuration optimizations, this workshop will discuss issues that are of particular importance in commercial HPC. Benchmarking has typically involved running specific workloads that are reflective of typical HPC workloads, yet with growing diversity of workloads, theoretical performance modeling is also of interest to allow for performance prediction given a minimal set of measurements.

AutoBench: A Holistic Platform for Automated and Reproducible Benchmarking in HPC Testbeds

Arjun Parab
Amir Raoofy
Leon Spörl
Stefan Dimitrov
Matthew Tovey
Josef Weidendorfer

Benchmarking is indispensable for evaluating HPC systems and architectures, providing critical insights into their performance, efficiency, and operational characteristics. However, the increasing heterogeneity and complexity of modern HPC architectures present significant challenges for benchmarking to achieve consistent and comprehensive insights. Likewise, commercial HPC environments encounter similar challenges due to their dynamic and diverse nature. Therefore, it is crucial to have automatic benchmarking of platforms, which consider holistic configuration options across various layers including the operating system layer, the software stack layer, among others.

This paper presents AutoBench, an automated benchmarking platform designed to target benchmarking on testbed systems at HPC and Cloud data centers to address the above challenges. With its multi-layered, customizable configuration options, AutoBench assists benchmarking across diverse systems. In addition, AutoBench enables automation, exploration of optimal configurations in multiple layers, and reproducibility.

We demonstrate how we use this benchmarking tool in the BEAST system at Leibniz Supercomputing Centre (LRZ) to provide comparisons between various architectures and their benefits. We also demonstrate that AutoBench can reproduce benchmarks with an acceptable variance of ~5%.

Performance Tools for the NEC SX-Aurora Tsubasa

Christian von Elm
Robert Schöne

83% of the TOP500s performance share is contributed by systems that utilize accelerators. While the overwhelming majority of accelerators in TOP500 systems are Graphics Processing Units (GPUs), other accelerators, that excel in specialized applications, are also present. The NEC SX-Aurora Tsubasa is the most common non-GPU accelerator for November 2024 TOP500 systems and has proven to be efficient for certain codes like the weather and climate model ICON. However, applications have to be carefully optimized to achieve this efficiency. Performance analysis tools, which support developers in understanding the runtime behavior of their programs can help in that regard. With profiling tools programmers can see which functions contribute to the overall runtime, but also which of them are causing high amounts of cache misses. However, the basic principle of profiling -- the summarization of metrics over all invocations of a function -- can hide differences between subsequent invocations of the same function. Due to its implementation a function could perhaps only exhibit high-amounts of cache misses, increasing its runtime, every fourth call. Hence, tools that allow developers to trace their applications by recording non-aggregated events in the time-domain can further provide insight into the applications performance properties. However, as of today only profiling tools exist for the NEC SX-Aurora Tsubasa. In this paper, we use its performance measurement capabilities to extend the Score-P measurement infrastructure and the lo2s system monitoring tool to enable developers to record the exact execution of programs using tracing. We further demonstrate how developers can find performance issues that cannot be detected with profiling-based tools in the ICON weather and climate model. We also describe the overhead and perturbation that our tools introduce.