ICPE '24: Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering

Full Citation in the ACM Digital Library

SESSION: Keynote Talk 1

How the Cloud made Performance Appear on the Board Agenda

Manzoor Mohammed

Cloud spending is growing! Gartner predicts a 20% surge to 678.8 billion in 2024, making it a top expense after personnel for many organisations. In fact, 78% of US businesses and 54% in EMEA already leverage the cloud for diverse needs, from infrastructure and storage to development. However, a crucial question lingers: are we maximising our cloud investments? This year, boards are wanting to know the answer this question as investments continue to increase due to new technologies such as LLM. Thankfully, by delving deeper into cloud performance characteristics, we can unlock valuable insights. This keynote will explore how understanding performance empowers organisations to extract the full potential of their cloud, transforming cost and performance data into strategic advantage.

SESSION: Session 1a: Benchmarking

ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks

Sören Henning
Adriano Vogel
Michael Leichtfried
Otmar Ertl
Rick Rabiser

Distributed stream processing frameworks help building scalable and reliable applications that perform transformations and aggregations on continuous data streams. This paper introduces ShuffleBench, a novel benchmark to evaluate the performance of modern stream processing frameworks. In contrast to other benchmarks, it focuses on use cases where stream processing frameworks are mainly employed for shuffling (i.e., re-distributing) data records to perform state-local aggregations, while the actual aggregation logic is considered as black-box software components. ShuffleBench is inspired by requirements for near real-time analytics of a large cloud observability platform and takes up benchmarking metrics and methods for latency, throughput, and scalability established in the performance engineering research community. Although inspired by a real-world observability use case, it is highly configurable to allow domain-independent evaluations. ShuffleBench comes as a ready-to-use open-source software utilizing existing Kubernetes tooling and providing implementations for four state-of-the-art frameworks. Therefore, we expect ShuffleBench to be a valuable contribution to both industrial practitioners building stream processing applications and researchers working on new stream processing approaches. We complement this paper with an experimental performance evaluation that employs ShuffleBench with various configurations on Flink, Hazelcast, Kafka Streams, and Spark in a cloud-native environment. Our results show that Flink achieves the highest throughput while Hazelcast processes data streams with the lowest latency.

SESSION: Session 1b: Programming Langauge and Software

Vectorized Intrinsics Can Be Replaced with Pure Java Code without Impairing Steady-State Performance

Júnior Löff
Filippo Schiavio
Andrea Rosà
Matteo Basso
Walter Binder

Several methods of the Java Class Library (JCL) rely on vectorized intrinsics. While these intrinsics undoubtedly lead to better performance, implementing them is extremely challenging, tedious, error-prone, and significantly increases the effort in understanding and maintaining the code. Moreover, their implementation is platform-dependent. An unexplored, easier-to-implement alternative is to replace vectorized intrinsics with portable Java code using the Java Vector API. However, this is attractive only if the Java code achieves similar steady-state performance as the intrinsics. This paper shows that this is the case. We focus on the hashCode and equals computations for byte arrays. We replace the platform-dependent vectorized intrinsics with pure-Java code employing the Java Vector API, resulting in similar steady-state performance. We show that our Java implementations are easy to fine-tune by exploiting characteristics of the input (i.e., the array length), while such tuning would be much more difficult and cumbersome in a vectorized intrinsic. Additionally, we propose a new vectorized hashCode computation for long arrays, for which a corresponding intrinsic is currently missing. We evaluate the performance of the tuned implementations on four popular benchmark suites, showing that the performance are in line with those of the original OpenJDK 21 with intrinsics. Finally, we describe a general approach to integrate code using the Java Vector API into the core classes of the JCL, which is challenging because premature use of the Java Vector API would crash the JVM during its fragile initialization phase. Our approach can be adopted by developers to modify JCL classes without any changes to the native codebase.

Rethinking 'Complement' Recommendations at Scale with SIMD

Shrey Pandey
Saikat Kumar Das
Hrishikesh V. Ganu
Satyajeet Singh

Maximizing cart value by increasing the number of items in electronic carts is one of the key strategies adopted by e-commerce platforms for optimal conversion of positive user intent during an online shopping session. Recommender systems play a key-role in suggesting personalized candidate items that can be added to cart by the user. However, it is important to serve a diverse set of personalized recommendations that 'complement' user's cart content to practically increase item count in cart and also contribute towards product discovery. Borrowed from Quantum Physics, Determinantal Point Processes (DPP) are used widely in recommender systems to diversify personalized product recommendations for improved user engagement. However, vertically scaling DPP for recommendation sets, personalized with vector similarity metric like cosine similarity, to serve large scale real-time concurrent user requests is non-trivial. We propose a vectorized reformulation of cosine similarity and conditional DPP implementation to best utilize the highly improved vector computation capabilities (SIMD) of modern processors. Experimental evidence on real-world traffic shows that the proposed method can handle upto 15x more concurrent traffic while improving latency. The proposed method also uses portable SIMD constructs from Python libraries which can be easily adopted in most available SIMD supported CPUs with minimal code changes.

An Adaptive Logging System (ALS): Enhancing Software Logging with Reinforcement Learning Techniques

Amirmahdi Khosravi Tabrizi
Naser Ezzati-Jivan
Francois Tetreault

The efficient management of software logs is crucial in software performance evaluation, enabling detailed examination of runtime information for postmortem analysis. Recognizing the importance of logs and the challenges developers face in making informed log-placement decisions, there is a clear need for a robust log-placement framework that supports developers. Existing frameworks, however, are limited by their inability to adapt to customized logging objectives, a concern highlighted by our industrial partner, Ciena, who required a system for their specific logging goals in resource-limited environments like routers. Moreover, these frameworks often show poor cross-project consistency. This study introduces a novel performance logging objective designed to uncover potential performance-bugs, categorized into three classes-Loops, Synchronization, and API Misuses-and defines 12 source code features for their detection. We present an Adaptive Logging System (ALS), based on reinforcement learning, which adjusts to specified logging objectives, particularly for identifying performance-bugs. This framework, not restricted to specific projects, demonstrates stable cross-project performance. We trained and evaluated ALS on Python source code from 17 diverse open-source projects within the Apache and Django ecosystems. Our findings suggest that ALS has the potential to significantly enhance current logging practices by providing a more targeted, efficient, and context-aware logging approach, particularly beneficial for our industry partner who requires a flexible system that adapts to varied performance objectives and logging needs in their unique operational environments.

Time Series Forecasting of Runtime Software Metrics: An Empirical Study

Federico Di Menna
Luca Traini
Vittorio Cortellessa

Software applications can produce a wide range of runtime software metrics (e.g., number of crashes, response times), which can be closely monitored to ensure operational efficiency and prevent significant software failures. These metrics are typically recorded as time series data. However, runtime software monitoring has become a high-effort task due to the growing complexity of today's software systems. In this context, time series forecasting (TSF) offers unique opportunities to enhance software monitoring and facilitate proactive issue resolution. While TSF methods have been widely studied in areas like economics and weather forecasting, our understanding of their effectiveness for software runtime metrics remains somewhat limited. In this paper, we investigate the effectiveness of four TSF methods on 25 real-world runtime software metrics recorded over a period of one and a half years. These methods comprise three recurrent neural network (RNN) models and one traditional time series analysis technique (i.e., SARIMA). The metrics are gathered from a large-scale IT infrastructure involving tens of thousands of digital devices. Our results indicate that, in general, RNN models are very effective in the runtime software metrics prediction, although in some scenarios and for certain specific metrics (e.g., waiting times) SARIMA proves to outperform RNN models. Additionally, our findings suggest that the advantages of using RNN models vanish when the prediction horizon becomes too wide, in our case when it exceeds one week.

An Empirical Analysis of Common OCI Runtimes' Performance Isolation Capabilities

Simon Volpert
Sascha Winkelhofer
Stefan Wesner
Jörg Domaschka

Industry and academia have strong incentives to adopt virtualization technologies. Such technologies can reduce the total cost of ownership or facilitate business models like cloud computing. These options have recently grown significantly with the rise of Kubernetes and the OCI runtime specification. Both enabled virtualization technology vendors to easily integrate their solution into existing infrastructures, leading to increased adoption. Making a detailed decision on a technology selection based on objective characteristics is a complex task. This specifically includes the instrumentation of performance characteristics that are an important aspect for a fair comparison. Moreover, a subsequent quantification of the isolation capability based on performance metrics is not readily available.

In this paper, we instrument and determine the OCI runtime isolation capability by measuring virtualized system resources. We hereby build on two previous contributions, a proven isolation measurement workflow engine, and meaningful isolation metrics. The existing workflow engine is extended to integrate OCI runtime instrumentation as well as the novel isolation metrics.

We indicate a quantifiable distinction between the isolation capabilities of these technologies. Researchers and industry alike can use the results to make decisions on the adoption of virtualization technology based on their isolation characteristics. Furthermore, our extended measurement workflow engine can be leveraged to conduct further experiments with new technologies, metrics, and scenarios.

SESSION: Session 2a: Hardware

An Experimental Setup to Evaluate RAPL Energy Counters for Heterogeneous Memory

Lukas Alt
Anara Kozhokanova
Thomas Ilsche
Christian Terboven
Matthias S. Mueller

Power consumption of the main memory in modern heterogeneous high-performance computing (HPC) constitutes a significant part of the total power consumption of a node. This motivates energy-efficient solutions targeting the memory domain as well. Practitioners need reliable energy measurement techniques for analyzing energy and power consumption of applications and performance optimizations. Running Average Power Limit (RAPL) is a common choice, as it provides uncomplicated access to the energy measurements. While RAPL's accuracy has been studied and validated on homogeneous memory platforms, no work we are aware of investigated its accuracy on heterogeneous memory platforms, specifically with high-capacity memory (HCM). This paper describes the process of measuring the memory power consumption externally using riser cards in detail. We validate RAPL's accuracy by comparing results obtained from Intel's Ice Lake-SP system equipped with DDR4 DRAM and Intel Optane Persistent Memory Modules (PMM). In addition, we verify the accuracy of our instrumentation setup by comparing the results from an older Broadwell system with the results in the literature. We show that the RAPL values on a heterogeneous memory system report a higher offset from the reference measurements. The difference is more pronounced at lower memory load for all memory types. Also, we find that RAPL readings are inconsistent between multiple sockets and over time. Based on the evaluated scenarios, we conclude that RAPL overestimates the actual power consumption on heterogeneous memory systems and provide a discussion on the possible causes of this effect.

Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton Layouts for Arrays

Stephen Nicholas Swatman
Ana-Lucia Varbanescu
Andy D. Pimentel
Andreas Salzburger
Attila Krasznahorkay

The layout of multi-dimensional data can have a significant impact on the efficacy of hardware caches and, by extension, the performance of applications. Common multi-dimensional layouts include the canonical row-major and column-major layouts as well as the Morton curve layout. In this paper, we describe how the Morton layout can be generalized to a very large family of multi-dimensional data layouts with widely varying performance characteristics. We posit that this design space can be efficiently explored using a combinatorial evolutionary methodology based on genetic algorithms. To this end, we propose a chromosomal representation for such layouts as well as a methodology for estimating the fitness of array layouts using cache simulation. We show that our fitness function correlates to kernel running time in real hardware, and that our evolutionary strategy allows us to find candidates with favorable simulated cache properties in four out of the eight real-world applications under consideration in a small number of generations. Finally, we demonstrate that the array layouts found using our evolutionary method perform well not only in simulated environments but that they can effect significant performance gains---up to a factor ten in extreme cases---in real hardware.

Energy Efficiency Features of the Intel Alder Lake Architecture

Robert Schöne
Markus Velten
Daniel Hackenberg
Thomas Ilsche

The continuous evolution of processors requires vendors to translate ever-growing transistor budgets into performance improvements, e.g., by including more functional units, memory controllers, input/output (I/O) interfaces, graphics processing units (GPUs), and caches. This trend also increases complexity, which cannot be fully hidden from the operating system (OS) or application domains. Issues likewhere to place threads if cores have different frequency ranges or architectures, orwhere to perform a task that might be hardware-accelerated cannot be decided on a hardware level. Moreover, performance improvements need to be achieved within a limited power envelope with energy efficiency as a first order design goal. Introduced power saving techniques, however, can contradict OS and applications performance assumptions. Several processor vendors offer heterogeneous processor architectures, such as ARM's big.LITTLE or Apple M1, combining high-performance and power-efficient cores. Intel's first such architecture, Alder Lake, integrates different core architectures and various accelerating components. This work presents an architecture overview of Alder Lake and an in-depth analysis of its power efficiency properties and techniques. For example, this includes frequency scaling of different components, idle states and their latencies, integrated energy measurement capabilities, and recently introduced processor feedback interfaces and OS integration.

Developing Index Structures in Persistent Memory Using Spot-on Optimizations with DRAM

Xingsheng Zhao
Prajwal Challa
Chen Zhong
Song Jiang

The emergence of persistent memory (PMem) is greatly impacting the design of commonly used data structures to obtain the full benefit from the new technology. Compared to the DRAM, PMem's larger capacity and lower cost make it an attractive alternative for hosting large data structures, such as indexes of in-memory databases, especially for those that require data persistency. However, simply using existing index structures in the PMem can be unexpectedly inefficient for three reasons. (1) Index accesses are composed of small writes and reads. (2) Each small write is required to come with expensive fence and flush operations. And (3) PMems usually prefer large accesses for high performance with their internal block-like access designs despite being byte-addressable. For example, Intel Optane DC PMem has a 256-byte access unit~(XPLine), leading to significant read/write amplification for small accesses. In this work we systematically study a series of techniques, including application-managed write-buffering, read-caching, and out-of-place updates and their synergistic effect on performance of some representative indexes (hash table, B+ tree, and skip list) designed for PMems. We then apply the knowledge obtained from this investigation into the design of a high-performance PMem index, named Spot-on tree (SPTree), that facilitates applications to selectively cache read-intensive components of an index and to buffer written data to index structure, while providing crash consistency and quick recovery upon crash. Compared to the state-of-art indexes, SPTree provides up to 2X and 4X higher write and read throughput, respectively.

SESSION: Keynote Talk 2

What does Performance Mean for Large Language Models?

Jane Hillston

In the last decade there has been a significant leap in the capability of foundation AI models, largely driven by the introduction and refinement of transformer-based machine learning architectures. The most visible consequence of this has been the explosion of interest and application of large language models such as ChatGPT. This is one exemplar of how a foundation model trained on a huge amount of data can be specialised for particular task, often by a phase of reinforcement learning with human feedback.

Within the AI community "performance" of such systems is generally taken to mean how well they respond to their users on characteristics such as accuracy, verifiability, and bias. Performance analysis usually considers both the responsiveness of a system to its user and the efficiency and equity of resource use. These foundation models rely on massive amounts of resource but there appears to have been little work considering how to understand the resource use or the trade-offs that exist between how the system responds to users and the amount of resource used.

In this talk I will present initial ideas of what it could mean to develop a framework of performance evaluation for foundation models such as large language models. Such a framework would need to take into consideration the distinct phases of operation for these models, which broadly speaking can be categorised as training, generating and fine-tuning. Evaluating the trade-off between user interests and resource management will require the identification of suitable metrics. Resources in such systems are more than simply compute and storage use, and bandwidth; data and even human resources also play crucial roles in training and fine-tuning. I will discuss all these topics.

SESSION: Session 3a: Micro-services 1

InstantOps: A Joint Approach to System Failure Prediction and Root Cause Identification in Microserivces Cloud-Native Applications

Raphael Rouf
Mohammadreza Rasolroveicy
Marin Litoiu
Seema Nagar
Prateeti Mohapatra
Pranjal Gupta
Ian Watts

As microservice and cloud computing operations increasingly adopt automation, the importance of models for fostering resilient and efficient adaptive architectures becomes paramount. This paper presents InstantOps, a novel approach to system failure prediction and root cause analysis leveraging a three-fold modality of IT observability data: logs, metrics, and traces. The proposed methodology integrates Graph Neural Networks (GNN) to capture spatial information and Gated Recurrent Units (GRU) to encapsulate the temporal aspects within the data. A key emphasis lies in utilizing a stitched representation derived from logs, microservices events(e.g. Image Pull Back Off, PVC Pending), and resource metrics to predict system failures proactively. The traces are aggregated to construct a comprehensive service call flow graph and represented as a dynamic graph. Furthermore, permutation testing is applied to harness node scores, aiding in the identification of root causes behind these failures. To evaluate the efficiency of InstantOps, we utilized in-house data from the open-source application Quote of the Day (QoTD) as well as two publicly available datasets, MicroSS and Train Ticket. The F1 scores obtained in predicting the system failures from these data sets were 0.96, 0.98, and 0.97, respectively, beating the stateof-the-art. Additionally, we further evaluated the efficiency of root cause analysis using MAR and MFR. These results also outperform the state of the art.

SESSION: Session 3b: Resource Management

Daedalus: Self-Adaptive Horizontal Autoscaling for Resource Efficiency of Distributed Stream Processing Systems

Benjamin J. J. Pfister
Dominik Scheinert
Morgan K. Geldenhuys
Odej Kao

To maintain a stable Quality of Service (QoS), these systems require a sufficient allocation of resources. At the same time, over-provisioning can result in wasted energy and high operating costs. Therefore, to maximize resource utilization, autoscaling methods have been proposed that aim to efficiently match the resource allocation with the incoming workload. However, determining when and by how much to scale remains a significant challenge. Given the long-running nature of DSP jobs, scaling actions need to be executed at runtime, and to maintain a good QoS, they should be both accurate and infrequent. To address the challenges of autoscaling, the concept of self-adaptive systems is particularly fitting. These systems monitor themselves and their environment, adapting to changes with minimal need for expert involvement.

This paper introduces Daedalus, a self-adaptive manager for autoscaling in DSP systems, which draws on the principles of self-adaption to address the challenge of efficient autoscaling. Daedalus monitors a running DSP job and builds performance models, aiming to predict the maximum processing capacity at different scale-outs. When combined with time series forecasting to predict future workloads, Daedalus proactively scales DSP jobs, optimizing for maximum throughput and minimizing both latencies and resource usage. We conducted experiments using Apache Flink and Kafka Streams to evaluate the performance of Daedalus against two state-of-the-art approaches. Daedalus was able to achieve comparable latencies while reducing resource usage by up to 71%.

Demeter: Resource-Efficient Distributed Stream Processing under Dynamic Loads with Multi-Configuration Optimization

Morgan K. Geldenhuys
Dominik Scheinert
Odej Kao
Lauritz Thamsen

Distributed Stream Processing (DSP) focuses on the near real-time processing of large streams of unbounded data. To increase processing capacities, DSP systems are able to dynamically scale across a cluster of commodity nodes, ensuring a good Quality of Service despite variable workloads. However, selecting scaleout configurations which maximize resource utilization remains a challenge. This is especially true in environments where workloads change over time and node failures are all but inevitable. Furthermore, configuration parameters such as memory allocation and checkpointing intervals impact performance and resource usage as well. Sub-optimal configurations easily lead to high operational costs, poor performance, or unacceptable loss of service.

In this paper, we present Demeter, a method for dynamically optimizing key DSP system configuration parameters for resource efficiency. Demeter uses Time Series Forecasting to predict future workloads and Multi-Objective Bayesian Optimization to model runtime behaviors in relation to parameter settings and workload rates. Together, these techniques allow us to determine whether or not enough is known about the predicted workload rate to proactively initiate short-lived parallel profiling runs for data gathering. Once trained, the models guide the adjustment of multiple, potentially dependent system configuration parameters ensuring optimized performance and resource usage in response to changing workload rates. Our experiments on a commodity cluster using Apache Flink demonstrate that Demeter significantly improves the operational efficiency of long-running benchmark jobs.

BFQ, Multiqueue-Deadline, or Kyber? Performance Characterization of Linux Storage Schedulers in the NVMe Era

Zebin Ren
Krijn Doekemeijer
Nick Tehrany
Animesh Trivedi

Flash SSDs have become the de-facto choice to deliver high I/O performance to modern data-intensive workloads. These workloads are often deployed in the cloud, where multiple tenants share access to flash-based SSDs. Cloud providers use various techniques, including I/O schedulers available in the Linux kernel, such as \textitBFQ, Multiqueue-Deadline ~(\mqddl ), and \kyber, to ensure certain performance qualities (i.e., service-level agreements, SLAs). Though designed for fast NVMe SSDs, there has not been a systematic study of these schedulers for modern, high-performance SSDs with their unique challenges. In this paper. we systematically characterize the performance, overheads, and scalability properties of Linux storage schedulers on NVMe SSDs with millions of I/O operations/s. We report 23 observations and 5 key findings that indicate that (i) CPU performance is the primary bottleneck with the Linux storage stack with high-performance NVMe SSDs; (ii) Linux I/O schedulers can introduce 63.4% performance overheads with NVMe SSDs; (iii) \kyber and q can deliver 99.3% lower P99 latency than \none or \mqddl schedulers in the presence of multiple interfering workloads. We open-source the scripts and datasets of this work at: https://zenodo.org/records/10599514.

The Cost of Simplicity: Understanding Datacenter Scheduler Programming Abstractions

Aratz Manterola Lasa
Sacheendra Talluri
Tiziano De Matteis
Alexandru Iosup

Schedulers are a crucial component in datacenter resource management. Each scheduler offers different capabilities, and users use them through their APIs. However, there is no clear understanding of what programming abstractions they offer, nor why they offer some and not others. Consequently, it is difficult to understand their differences and the performance costs imposed by their APIs. In this work, we study the programming abstractions offered by industrial schedulers, their shortcomings, and their related performance costs. We propose a general reference architecture for scheduler programming abstractions. Specifically, we analyze the programming abstractions of five popular industrial schedulers, understand the differences in their APIs, and identify the missing abstractions. Finally, we carry out exemplary experiments using trace-driven simulation demonstrating that an API extension, such as container migration, can improve total execution time per task by 81%, highlighting how schedulers sacrifice performance by implementing simpler programming abstractions. All the relevant software and data artifacts are publicly available at https://github.com/atlarge-research/quantifying-api-design.

SESSION: Session 4a: GPU and AI

Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly

Bagus Hanindhito
Lizy K. John

Machine Learning (ML) workloads generally contain a significant amount of matrix computations; hence, hardware accelerators for ML have been incorporating support for matrix accelerators. With the popularity of GPUs as hardware accelerators for ML, specialized matrix accelerators are embedded into GPUs (e.g., Tensor Cores on NVIDIA GPUs) to significantly improve the performance and energy efficiency of ML workloads. NVIDIA Tensor Cores and other matrix accelerators have been designed to support General Matrix-Matrix Multiplication (GEMM) for many data types. While previous research has demonstrated impressive performance gains with Tensor Cores, they primarily focused on Convolutional Neural Networks (CNNs).

This paper explores Tensor Cores' performance on various workloads, including Graph Convolutional Networks (GCNs), on NVIDIA H100 and A100 GPUs. In our experiments with NVIDIA GPUs, CNNs can achieve 1.91x (TF32) and 2.42x (FP16) end-to-end performance improvements with the use of Tensor Cores, whereas GCNs struggle to surpass a 1.03x (FP16) boost. Some implementations even experience slowdowns despite software transformation. Additionally, we explore the potential of Tensor Cores in non-GEMM-like kernels, providing insights into how software techniques can map diverse computation patterns onto Tensor Cores. Our investigation encompasses several kernels and end-to-end applications, aiming to comprehend the nuanced performance impact of Tensor Cores. Furthermore, we are among the first to present third-party evaluations of H100 GPU performance over the prior A100 GPU.

MalleTrain: Deep Neural Networks Training on Unfillable Supercomputer Nodes

Xiaolong Ma
Feng Yan
Lei Yang
Ian Foster
Michael E. Papka
Zhengchun Liu
Rajkumar Kettimuthu

First-come first-serve scheduling can result in substantial (up to 10%) of transiently idle nodes on supercomputers. Recognizing that such unfilled nodes are well-suited for deep neural network (DNN) training, due to the flexible nature of DNN training tasks, Liu et al. proposed that the re-scaling DNN training tasks to fit gaps in schedules be formulated as a mixed-integer linear programming (MILP) problem, and demonstrated via simulation the potential benefits of the approach. Here, we introduce MalleTrain, a system that provides the first practical implementation of this approach and that furthermore generalizes it by allowing it to be used even for DNN training applications for which model information is unknown before runtime. Key to this latter innovation is the use of a lightweight online job profiling advisor (JPA) to collect critical scalability information for DNN jobs---information that it then employs to optimize resource allocations dynamically, in real time. We describe the MalleTrain architecture and present the results of a detailed experimental evaluation on a supercomputer GPU cluster and several representative DNN training workloads, including neural architecture search and hyperparameter optimization. Our results not only confirm the practical feasibility of leveraging idle supercomputer nodes for DNN training but improve significantly on prior results, improving training throughput by up to 22.3% without requiring users to provide job scalability information.

Leftovers for LLaMA

Ravi Kumar Singh
Likhith Bandamudi
Shruti Kunde
Mayank Mishra
Rekha Singhal

n recent years, large language models (LLMs) have become pervasive in our day-to-day lives, with enterprises utilizing their services for a wide range of NLP-based applications. The exponential growth in the size of LLMs poses a significant challenge for efficiently utilizing these models for inference tasks, which require a substantial amount of memory and compute. Enterprises often possess multiple resources (workers, nodes, servers) with unused (leftover) capacity, providing an opportunity to address this challenge by distributing large models across these resources. Recent work such as Petals, provides a platform for distributing LLM models in a cluster of resources. Petals require that users use their discretion to distribute blocks on a given cluster, consequently leading to a non-optimal placement of blocks. In this paper, we propose LLaMPS - a large language model placement system that aims to optimize the placement of transformer blocks on the available enterprise resources, by utilizing the leftover capacity of the worker nodes. Our approach considers leftover memory capacity along with available CPU cores, when distributing transformer blocks optimally across worker nodes. Furthermore, we enhance the scalability of the system by maximizing the number of clients that can be served concurrently. We validate the efficacy of our approach by conducting extensive experiments using open-source large language models - BLOOM (1b, 3b, and 7b parameters), Falcon, and LLaMA. Our experiments demonstrate that LLaMPS facilitates optimal placement of transformer blocks by utilizing leftover resources, thus enabling enterprise-level deployment of large language models

Processing Natural Language on Embedded Devices: How Well Do Transformer Models Perform?

Souvika Sarkar
Mohammad Fakhruddin Babar
Md Mahadi Hassan
Monowar Hasan
Shubhra Kanti Karmaker Santu

Voice-controlled systems are becoming ubiquitous in many IoT-specific applications such as home/industrial automation, automotive infotainment, and healthcare. While cloud-based voice services (\eg Alexa, Siri) can leverage high-performance computing servers, some use cases (\eg robotics, automotive infotainment) may require to execute the natural language processing (NLP) tasks offline, often on resource-constrained embedded devices. Transformer-based language models such as BERT and its variants are primarily developed with compute-heavy servers in mind. Despite the great performance of BERT models across various NLP tasks, their large size and numerous parameters pose substantial obstacles to offline computation on embedded systems. Lighter replacement of such language models (\eg DistilBERT and TinyBERT) often sacrifice accuracy, particularly for complex NLP tasks. Until now, it is still unclear \ca whether the state-of-the-art language models, \viz BERT and its variants are deployable on embedded systems with a limited processor, memory, and battery power and \cb if they do, what are the "right'' set of configurations and parameters to choose for a given NLP task. This paper presents aperformance study of transformer language models under different hardware configurations and accuracy requirements and derives empirical observations about these resource/accuracy trade-offs. In particular, we study how the most commonly used BERT-based language models (\viz BERT, RoBERTa, DistilBERT, and TinyBERT) perform on embedded systems. We tested them on \textitfour off-the-shelf embedded platforms (\hardware) with 2 GB and 4 GB memory (\ie a total of \textiteight hardware configurations) and \textitfour datasets (\ie HuRIC, GoEmotion, CoNLL, WNUT17) running various NLP tasks. Our study finds that executing complex NLP tasks (such as "sentiment'' classification) on embedded systems isfeasible even without any GPUs (\eg \rpi with 2 GB of RAM). We release our implementations for community use. Our findings can help designers understand the deployability and performance of transformer language models, especially those based on BERT architectures.

SESSION: Keynote Talk 3

Optimizing Edge AI: Performance Engineering in Resource-Constrained Environments

Giuliano Casale

Recent years have witnessed the growth of Edge AI, a transformative paradigm that integrates neural networks with edge computing, bringing computational intelligence closer to end users. However, this innovation is not without its challenges, especially in environments with limited computing, network, and memory constraints, where resource-hungry AI models often need to be partitioned for distributed execution. This issue becomes even more acute in scenarios where post-deployment updates are infeasible or costly, posing a need to accurately reason about the interplay between resource constraints and Quality-of-Service (QoS) in Edge AI systems, so as to optimally design and operate them.

In this keynote talk, I will focus on these challenges, discussing QoS management and deployment problems arising in Edge AI systems. I will review mechanisms such as early exits and DNN partitioning that are distinctive of this problem space, explaining how they could be accounted for and leveraged in system performance and reliability tuning. I will then illustrate how design decisions and the definition of novel runtime control algorithms can be guided by approaches based on both traditional analytical models and emerging data-driven methods based on machine learning models.

SESSION: Session 5a: Analysis

TBASCEM - Tight Bounds with Arrival and Service Curve Estimation by Measurements

Christoph Funda
Thomas Herpel
Reinhard German
Kai-Steffen Jens Hielscher

This paper aims to solve the challenge of quantifying the perfor- mance of Hardware-in-the-Loop (HIL) computer systems used for data re-injection. The system can be represented as a multiple queue and server system that operates on a First-In, First-Out (FIFO) basis. The task at hand involves establishing tight bounds on end-to-end delay and system backlog. This is necessary to optimise buffer and pre-buffer time configurations. Network Calculus (NC) is chosen as the basic analytical framework to achieve this. In the literature, there are different techniques for estimating arrival and service curves from measurement data which can be used for NC calcu- lations. We have selected four of these methods to be applied to datasets of industrial Timestamp Logging (TL). The problem arises because these conventional methods often produce bounds that are much larger (by a factor of 1000 or more) than the measured maximum values, resulting in inefficient design of HIL system pa- rameters and inefficient resource usage. The proposed approach, called TBASCEM, introduces a reverse engineering approach based on linear NC equations for estimating the parameters of arrival and service curves. By imposing constraints on the equation variables and employing non-linear optimization, TBASCEM searches for a burst parameter estimation which derives tight global delay bounds. In addition, TBASCEM simplifies the run-time measurement pro- cess, supporting real-time data acquisition to evaluate and optimise HIL system performance, and enhancing observability to adapt the HIL configuration to new sensor data. The benefits of TBASCEM are clearly that it enables an efficient performance logging of arrival and service curve parameters and with deriving tighter bounds in HIL systems, compared to evaluated state-of-the-art methods, mak- ing TBASCEM an invaluable tool for optimising and monitoring streaming applications in non-hard-real-time environments.

SESSION: Session 5b: Edge

A Learning-Based Caching Mechanism for Edge Content Delivery

Hoda Torabi
Hamzeh Khazaei
Marin Litoiu

With the advent of 5G networks and the rise of the Internet of Things (IoT), Content Delivery Networks (CDNs) are increasingly extending into the network edge. This shift introduces unique challenges, particularly due to the limited cache storage and the diverse request patterns at the edge. These edge environments can host traffic classes characterized by varied object-size distributions and object-access patterns. Such complexity makes it difficult for traditional caching strategies, which often rely on metrics like request frequency or time intervals, to be effective. Despite these complexities, the optimization of edge caching is crucial. Improved byte hit rates at the edge not only alleviate the load on the network backbone but also minimize operational costs and expedite content delivery to end-users. In this paper, we introduce HR-Cache, a comprehensive learning-based caching framework grounded in the principles of Hazard Rate (HR) ordering, a rule originally formulated to compute an upper bound on cache performance. HR-Cache leverages this rule to guide future object eviction decisions. It employs a lightweight machine learning model to learn from caching decisions made based on HR ordering, subsequently predicting the "cache-friendliness'' of incoming requests. Objects deemed "cache-averse'' are placed into cache as priority candidates for eviction. Through extensive experimentation, we demonstrate that HR-Cache not only consistently enhances byte hit rates compared to existing state-of-the-art methods but also achieves this with minimal prediction overhead. Our experimental results, using three real-world traces and one synthetic trace, indicate that HR-Cache consistently achieves 2.2-14.6% greater WAN traffic savings than LRU. It outperforms not only heuristic caching strategies but also the state-of-the-art learning-based algorithm.

Function Offloading and Data Migration for Stateful Serverless Edge Computing

Matteo Nardelli
Gabriele Russo Russo

Serverless computing and, in particular, Function-as-a-Service (FaaS) have emerged as valuable paradigms to deploy applications without the burden of managing the computing infrastructure. While initially limited to the execution of stateless functions in the cloud, serverless computing is steadily evolving. The paradigm has been increasingly adopted at the edge of the network to support latency-sensitive services. Moreover, it is not limited to stateless applications, with functions often recurring to external data stores to exchange partial computation outcomes or to persist their internal state. To the best of our knowledge, several policies to schedule function instances to distributed host have been proposed, but they do not explicitly model the data dependency of functions and its impact on performance.

In this paper, we study the allocation of functions and associated key-value state in geographically distributed environments. Our contribution is twofold. First, we design a heuristic for function offloading that satisfies performance requirements. Then, we formulate the state migration problem via Integer Linear Programming, taking into account the heterogeneity of data, its access patterns by functions, and the network resources. Extensive simulations demonstrate that our policies allow FaaS providers to effectively support stateful functions and also lead to improved response times.

No Clash on Cache: Observations from a Multi-tenant Ecommerce Platform

Anna Lira
Ruan Alves
Thiago Emmanuel Pereira
Fabio Morais
João Ramalho
Mariana Mendes

Caching is a classic technique for improving system performance by reducing client-perceived latency and server load. However, cache management still needs to be improved and is even more difficult in multi-tenant systems. To shed light on these problems and discuss possible solutions, we performed a workload characterization of a multi-tenant cache operated by a large ecommerce platform. In this platform, each one of thousands of tenants operates independently. We found that the workload patterns of the tenants could be very different. Also, the characteristics of the tenants change over time. Based on these findings, we highlight strategies to improve the management of multi-tenant cache systems.

MemSaver: Enabling an All-in-memory Switch Experience for Many Apps in a Smartphone

Prajwal Challa
Baohua Song
Song Jiang

The availability of diverse applications (apps) and the need to use many apps simultaneously have propelled users to constantly switch between apps in smartphones. For an instantaneous switch, these apps are often expected to stay in the memory. However, when a user opens more apps and memory pressure increases, Android kills background apps to relieve the memory pressure. When the user switches a killed app back to the foreground, the user experiences a laggy response that compromises his experience. To delay this killing under memory pressure for a smoother user experience, we proposeMemSaver, a low-cost approach for preemptively swapping selected pages of the background apps out of memory to avoid or postpone the killing of apps while ensuring their near-ideal switch time. MemSaver uses pages accessed during events similar to the switch and about the same app context for predicting the pages to be accessed in the next switch. Evaluations on OnePlus 9 Pro using representative apps show that up to 60% of app's memory (RSS) can be saved while maintaining the switch time within the acceptable range.

SESSION: Session 6: Micro-services 2

Systemizing and Mitigating Topological Inconsistencies in Alibaba's Microservice Call-graph Datasets

Darby Huye
Lan Liu
Raja R. Sambasivan

Alibaba's 2021 and 2022 microservice datasets are the only publicly available sources of request-workflow traces from a large-scale microservice deployment. They have the potential to strongly influence future research as they provide much-needed visibility into industrial microservices' characteristics. We conduct the first systematic analyses of both datasets to help facilitate their use by the community. We find that the 2021 dataset contains numerous inconsistencies preventing accurate reconstruction of full trace topologies. The 2022 dataset also suffers from inconsistencies, but at a much lower rate. Tools that strictly follow Alibaba's specs for constructing traces from these datasets will silently ignore these inconsistencies, misinforming researchers by creating traces of the wrong sizes and shapes. Tools that discard traces with inconsistencies will discard many traces. We present Casper, a construction method that uses redundancies in the datasets to sidestep the inconsistencies. Compared to an approach that discards traces with inconsistencies, Casper accurately reconstructs an additional 25.5% of traces in the 2021 dataset (going from 58.32% to 83.82%) and an additional 12.18% in the 2022 dataset (going from 86.42% to 98.6%).

Disambiguating Performance Anomalies from Workload Changes in Cloud-Native Applications

Alexandru Baluta
Yar Rouf
Joydeep Mukherjee
Zhen Ming Jiang
Marin Litoiu

Modern cloud-native applications are adopting the microservice architecture in which applications are deployed in lightweight containers that run inside a virtual machine (VM). Containers running different services are often co-located inside the same virtual machine. While this enables better resource optimization, it can cause interference among applications. This can lead to performance degradation. Detecting the cause of performance degradation at runtime is crucial to decide the correct remediation action such as, but not limited to, scaling or migrating. We propose a non-intrusive detection technique that differentiates between degradation caused by load and by interference. First, we define an operational zone for the application. Then we define a disambiguation method that uses models to classify interference and normal load. In contrast to previous work, our proposed detection technique does not require intrusive application instrumentation and incurs minimal performance overhead. We demonstrate how we can design effective Machine Learning models that can be generalized to detect interference from different types of applications. We evaluate our technique using realistic microservice benchmarks on AWS EC2. The results show that our approach outperforms existing interference detection techniques in F_1 score by at least 2.75% and at most 53.86%.