The complexity of microservices and their distributed nature necessitates constant monitoring and tracing of their execution to identify performance problems and underlying root causes. However, the large volume of collected data and the complexity of distributed communications pose challenges in identifying and locating abnormal services. In this paper, we propose a novel approach that takes into consideration the importance of execution contexts in propagating and localizing performance root causes. We achieve this by integrating social network analysis techniques with spectrum analysis. To evaluate our proposed approach, we conducted an experiment using a real-world benchmark, and we observed promising preliminary results, with a success rate of 91.3% in correctly identifying the primary root cause (top-1), and a perfect 100% success rate in finding the root cause within the top three candidates (top-3).
The need for accelerated object detection is paramount for safety critical applications such as autonomous vehicles. This paper focuses on leveraging parallel processing techniques for enhancing the performance of object detection. Specifically, this research engineers system performance by timely detection of common objects encountered by vehicles, such as other automobiles, pedestrians, and bicycles. Deploying popular pretrained deep learning models like the You Only Look Once (YOLO) model within the Apache Spark framework, the potential enhancements in detection speed achieved through parallel processing are investigated. The capability of the system to efficiently handle large datasets and distribute time-critical applications across multiple nodes is explored to improve both latency and scalability. The one-factor-at-a-time method is used to assess the impact of different system and workload parameters on performance. Of particular interest is the impact of Spark data partitioning on performance, especially for driving scenarios where the number of objects are changing rapidly. A novel data partitioning technique that uses the principles of entropy is utilized. The overall performance objective of this research will be to improve speed for object detection in cars which can improve safety in time critical events such as sudden braking or turning.
The relentless advancement of artificial intelligence (AI) and machine learning (ML) applications necessitates the development of specialized hardware accelerators capable of handling the increasing complexity and computational demands. Traditional computing architectures, based on the von Neumann model, are being outstripped by the requirements of contemporary AI/ML algorithms, leading to a surge in the creation of accelerators like the Graphcore Intelligence Processing Unit (IPU), Sambanova Reconfigurable Dataflow Unit (RDU), and enhanced GPU platforms. These hardware accelerators are characterized by their innovative data-flow architectures and other design optimizations that promise to deliver superior performance and energy efficiency for AI/ML tasks.
This research provides a preliminary evaluation and comparison of these commercial AI/ML accelerators, delving into their hardware and software design features to discern their strengths and unique capabilities. By conducting a series of benchmark evaluations on common DNN operators and other AI/ML workloads, we aim to illuminate the advantages of data-flow architectures over conventional processor designs and offer insights into the performance trade-offs of each platform. The findings from our study will serve as a valuable reference for the design and performance expectations of research prototypes, thereby facilitating the development of next-generation hardware accelerators tailored for the ever-evolving landscape of AI/ML applications. Through this analysis, we aspire to contribute to the broader understanding of current accelerator technologies and to provide guidance for future innovations in the field.
Autotuning is an automated process that selects the best computer program implementation from a set of candidates to improve performance, such as execution time, when run under new circumstances, such as new hardware. The process of autotuning generates a large amount of performance data with multiple potential use cases, including reproducing results, comparing included methods, and understanding the impact of individual tuning parameters.
We propose the adoption of FAIR Principles, which stands for Findable, Accessible, Interoperable, and Reusable, to organize the guidelines for data sharing in autotuning research. The guidelines aim to lessen the burden of sharing data and provide a comprehensive checklist of recommendations for shared data. We illustrate three examples that could greatly benefit from shared autotuning data to advance the research without time- and resource-demanding data collection.
To facilitate data sharing, we have taken a community-driven approach to define a common format for the data using a JSON schema and provide scripts for their collection.
The proposed comprehensive guide for collecting and sharing performance data in autotuning research can promote further advances in the field and encourage research collaboration.
In this paper, we present a comprehensive empirical study to evaluate four prominent Computer Vision inference frameworks. Our goal is to shed light on their strengths and weaknesses and provide valuable insights into the challenges of selecting the right inference framework for diverse situations. Additionally, we discuss the potential room for improvement to accelerate inference computing efficiency.
This paper proposes a new traffic decomposition method called MNA to solve multi-class queueing networks with first-come first-serve stations having phase-type (PH) service, which generalizes the classic QNA method by Whitt. MNA not only supports open queueing networks but also closed networks, which are useful to model concurrency limits in software systems. Using validation models, we show that under low SCV of service time and inter-arrival time, the new method is on average more accurate than QNA. Therefore, MNA can provide better software performance prediction for quality-of-service management tasks.
Microservices have been a cornerstone for building scalable, flexible, and robust applications, thereby enabling service providers to enhance their systems' resilience and fault tolerance. However, adopting this architecture has often led to many challenges, particularly when pinpointing performance bottlenecks and diagnosing their underlying causes. Various tools have been developed to bridge this gap and facilitate comprehensive observability in microservice ecosystems. While these tools are effective at detecting latency-related anomalies, they often fall short of isolating the root causes of these problems. In this paper, we present a novel method for identifying and analyzing performance anomalies in microservice-based applications by leveraging cross-layer tracing techniques. Our method uniquely integrates system resource metrics-such as CPU, disk, and network consumption-with each user request, providing a multi-dimensional view for diagnosing performance issues. Through the use of sequential pattern mining, this method effectively isolates aberrant execution behaviors and helps identify their root causes. Our experimental evaluations demonstrate its efficiency in diagnosing a wide range of performance anomalies.
Data migration refers to the set of tasks around transferring data over a network between two systems, either homogeneous or heterogeneous, and the potential reformatting of this data. Combined with large volumes of data, resource constraints and variety in data models and formats, data migration can be critical for enterprises, as it can consume a significant amount of time, incur high costs, and pose a significant risk if not executed correctly. The ability to accurately and effectively predict these challenges and plan for proper resource, time and budget allocation is vital for the proper execution of data migration. In this work, we introduce the concept of load testing and benchmarking for data migration to allow decision-makers for higher efficiency and effectiveness when planning for such tasks. Our framework aims for extensibility and customizability to enable the execution of a greater variety of tests. Here, we present a prototype architecture, a roadmap of how the development of such a platform should proceed and a simple case study of how it can be used in practice.
The finer-granularity of microservices facilitate their evolution and deployment on shared resources. However, resource concurrency creates elusive interdependencies, which can cause complex interference patterns to propagate as performance anomalies across distinct applications. Meanwhile, the existing methods for Anomaly Detection (AD) and Root-Cause Analysis (RCA) are confounded by this phenomenon of interference because they operate within single call-graphs. To bridge this gap, we develop a graph formalism (Spatio-Temporal Interference Graph - STIG) to express interference patterns and an artifact to simulate their dynamics. Our simulator contributes to the study and mitigation of interference patterns as a performance phenomenon that emerges from regular resource consumption anomalies.
In the evolving landscape of software development and system operations, the demand for automating traditionally manual tasks has surged. Continuous operation and minimal downtimes highlight the need for automated detection and remediation of runtime anomalies. Ansible, known for its scalable features, including high-level abstraction and modularity, stands out as a reliable solution for managing complex systems securely. The challenge lies in creating an on-the-spot Ansible solution for dynamic auto-remediation, requiring a substantial dataset for in-context tuning of large language models (LLMs). Our research introduces KubePlaybook, a curated dataset with 130 natural language prompts for generating automation-focused remediation code scripts. After rigorous manual testing, the generated code achieved an impressive 98.86% accuracy rate, affirming the solution's reliability and performance in addressing dynamic auto-remediation complexities.
Microservices offer the benefits of scalable flexibility and rapid deployment, making them a preferred architecture in today's IT industry. However, their dynamic nature increases their susceptibility to failures, highlighting the need for effective troubleshooting strategies. Current methods for pinpointing issues in microservices often depend on impractical supervision or rest on unrealistic assumptions. We propose a novel approach using graph unsupervised neural networks and critical path analysis to address these limitations. Our experiments on four open-source microservice benchmarks show significant results, with top-1 accuracy ranging from 86.4% to 96%, over 6% enhancement compared to existing methods. Moreover, our approach reduces training time by 5.6 times compared to similar works on the same datasets.
Having an observation of the microservices connections complexities within a service is essential for system management and optimization. In this study, we analyzed a dataset of microservice traces from Alibaba's production clusters, segmenting call graphs based on services. Using a community detection model, we uncovered the connections between microservices within each service by finding collaborative patterns and dependencies. Expanding our analysis, we identified similarities among service graphs using clustering techniques. These findings provide detailed insights for system optimization and decision-making, offering a roadmap for using the constructed runtime microservices network behavior to improve overall system efficiency and performance.
As the market for cloud computing continues to grow, an increasing number of users are deploying applications as microservices. The shift introduces unique challenges in identifying and addressing performance issues, particularly within large and complex infrastructures. To address this challenge, we propose a methodology that unveils temporal performance deviations in microservices by clustering containers based on their performance characteristics at different time intervals. Showcasing our methodology on the Alibaba dataset, we found both stable and dynamic performance patterns, providing a valuable tool for enhancing overall performance and reliability in modern application landscapes.
Microservice architectures are a widely adopted architectural pattern for large-scale applications. Given the large adoption of these systems, several works have been proposed to detect performance anomalies starting from analysing the execution traces. However, most of the proposed approaches rely on machine learning (ML) algorithms to detect anomalies. While ML methods may be effective in detecting anomalies, the training and deployment of these systems as been shown to be less efficient in terms of time, computational resources, and energy required.
In this paper, we propose a novel approach based on Context-free grammar for anomaly detection of microservice systems execution traces. We employ the SAX encoding to transform execution traces into strings. Then, we select strings encoding anomalies, and for each possible anomaly, we build a Context-free grammar using the Sequitur grammar induction algorithm. We test our approach on two real-world datasets and compare it with a Logistic Regression classifier. We show how our approach is more effective in terms of training time of ~15 seconds with a minimum loss in effectiveness of ~5% compared to the Logistic Regression baseline.
In large-scale microservice architectures, such as those utilized by Alibaba, identifying and addressing performance bottlenecks is a significant challenge due to the complicated interactions between thousands of services. To navigate this challenge, we have developed a critical-path-based technique aimed at analyzing microservice interactions within these complex systems. This technique facilitates the identification of critical nodes where service requests experience the longest delays. Our contribution is the discovery of performance variability in service interactions' response times within these critical paths, and pinpointing specific interactions within the system that show a high degree of performance variability. This improves the ability to detect service performance issues and their root causes allowing for dynamic adjustment in data collection detail, and targets critical interactions for adaptive monitoring.
The rapid expansion of Large Language Models (LLMs) presents significant challenges in efficient deployment for inference tasks, primarily due to their substantial memory and computational resource requirements. Many enterprises possess a variety of computing resources-servers, VMs, PCs, laptops-that cannot individually host a complete LLM. Collectively, however, these resources may be adequate for even the most demanding LLMs. We introduce LLaMPS, a novel tool, designed to optimally distribute blocks 1 of LLMs across available computing resources within an enterprise. LLaMPS leverages the unused capacities of these machines, allowing for the decentralized hosting of LLMs. This tool enables users to contribute their machine's resources to a shared pool, facilitating others within the network to access and utilize these resources for inference tasks. At its core, LLaMPS employs a sophisticated distributed framework to allocate transformer blocks of LLMs across various servers. In cases where a model is pre-deployed, users can directly access inference results (GUI and API). Our tool has undergone extensive testing with several open-source LLMs, including BLOOM-560m, BLOOM-3b, BLOOM-7b1, Falcon 40b, and LLaMA-70b. It is currently implemented in a real-world enterprise network setting, demonstrating its practical applicability and effectiveness.
This paper introduces Kube-burner\footnotehttps://github.com/kube-burner/kube-burner , an open-source tool for orchestrating performance and scalability testing of Kubernetes\footnote\urlhttps://kubernetes.io , with the ability to operate seamlessly across different distributions. We discuss its importance in the cloud native landscape, features and capabilities and delve into its architecture and usage. Additionally, we also present a case study on performance benchmarking using Kube-burner and subsequent analysis to demonstrate its value.
The success of application migration to cloud depends on multiple factors such as achieving expected performance, optimal cost on deployment, data security etc. The application migration process starts with the architecture design, mapping technical and business specifications to the appropriate services in cloud. However, cloud vendors offer numerous services for each service type and requirement. The onus of selecting the optimal service from the pool lies with the user. Identifying an optimal service for a specific component or application requirement is a daunting task and necessitates a deep understanding of each cloud service offered.
This paper introduces SuperArch, a supervised architecture design tool designed to facilitate optimal selection and configuration of cloud services. We propose utilization of Large Language Models (LLM) for extracting information from user requirements and specifications, aiding in optimal selection of cloud services. Additionally, SuperArch maps workloads to the cloud services to generate optimal configurations of the cloud service and estimate performance and cost of the entire architecture.
We are pleased to welcome you to the 2024 ACM Workshop on Artificial Intelligence for Performance Modeling, Prediction, and Control - AIPerf'24.
Welcome to the 2024 5th International Workshop on Benchmarking in the Data Centre: Expanding to the Cloud (BID '24), hosted at Imperial College London as a workshop track of the International Conference on Performance Engineering (ICPE'24). The past few years have been remarkably exciting for the cloud computing domain. We are witnessing groundbreaking developments in AI architectures, new AI/ML methodologies, and the significant expansion of newer CPU architectures such as AArch64 and RISC-V. These innovations not only redefine the capabilities and efficiency of cloud-based services but also open new avenues of research on how we can attain the best possible performance in the cloud.
BID '24 is dedicated to advancing the field of high-performance computing (HPC) benchmarking, extending its application from traditional academic settings to industry and the cloud. This evolution prompts a reassessment of user education concerning HPC's advantages, optimal selection of computational resources for specific workloads, and the considerations surrounding cost and environmental impact.Our discussions will encompass several key areas: privacy issues in commercial HPC environments, emerging cloud architectures, comprehensive workflows for effective benchmarking, and theoretical approaches to performance analysis. Additionally, this year's workshop will delve into the unique challenges presented by AI/ML workloads in cloud settings.
The success of BID '24 is made possible by the contributions of numerous individuals and organizations. We extend our gratitude to all authors and presenters who have submitted their work for discussion. A special thank you goes to the members of the Technical Committee for their invaluable support and diligent reviews. We also appreciate the hospitality and support from our hosts in London, UK, who have provided an excellent venue for our workshop.
Ultimately, the essence of BID '24 is shaped by its participants. We thank all authors, speakers, and attendees for enriching this workshop with their insights and presence. We hope you find the discussions stimulating, the networking fruitful, and your time in London memorable.
It is our great pleasure to welcome you to GraphSys'24, the 2nd edition of the ACM/SPEC Workshop on Serverless, Extreme-Scale, and Sustainable Graph Processing Systems. This is a returning workshop, where we continue to facilitate the exchange of ideas and expertise in the broad field of high-performance large-scale graph processing.
The Linked Data Benchmark Council (LDBC) was originally created in 2012 as part of a European Union-funded project of the same name. Its original goal was to design standard benchmarks for graph processing and to facilitate competition among vendors to drive innovation in the field. 12 years later, the LDBC organization has 20+ member organizations (including database, hardware, and cloud vendors) and has five standard benchmark workloads, with frequent audit requests from vendors. Moreover, LDBC extended its scope behind benchmarking to cover graph schemas and graph query languages, and has a liaison arrangement with the ISO SQL/GQL standards committee. In this talk, I will reflect on the LDBC's organizational history, goals, and main technical achievements.
This paper presents GraphMa, a framework aimed at enhancing pipeline-oriented computation for graph processing. GraphMa integrates the principles of pipeline computation with graph processing methodologies to provide a structured approach for analyzing and processing graph data. The framework defines a series of computational abstractions, including computation as type, higher-order traversal, and directed data-transfer, which collectively facilitate the decomposition of graph operations into modular functions. These functions can be composed into pipelines, supporting the systematic development of graph algorithms. For this paper, our focus lies in particular on the capability to implement the well- established computational models for graph processing within the proposed framework. In addition, the paper discusses the design of GraphMa, its computational models, and the implementation de- tails that illustrate the framework's application to graph processing tasks.
This work critically examines several approaches to temperature prediction for High-Performance Computing (HPC) systems, focusing on component-level and holistic models. In particular, we use publicly available data from the Tier-0 Marconi100 supercomputer and propose models ranging from a room-level Graph Neural Network (GNN) spatial model to node-level models. Our results highlight the importance of correct graph structures and suggest that while graph-based models can enhance predictions in certain scenarios, node-level models remain optimal when data is abundant. These findings contribute to understanding the effectiveness of different modeling approaches in HPC thermal prediction tasks, enabling proactive management of the modeled system.
Extreme conditions and the integrity of LiDAR sensors influence AI perception models in autonomous vehicles. Lens contamination caused by external particles can compromise LiDAR object detection performance. Automatic contaminant detection is important to improve reliability of sensor information propagated to the user or to object detection algorithms. However, dynamic conditions such as variations in location, distance, and types of objects around the autonomous vehicle make robust and fast contaminant detection significantly challenging. We propose a method for contaminant detection using voxel-based graph transformation to address the challenge of sparse LiDAR data. This method considers LiDAR points as graph nodes and employs a graph attention layer to enhance the accuracy of contaminant detection. Additionally, we introduce cross-environment training and testing on real-world contaminant LiDAR data to ensure high generalization across different environments. Compared with the current state-of-the-art approaches in contaminant detection, our proposed method significantly improves the performance by as much as 0.1575 in F1-score. Consistently achieving F1 scores of 0.936, 0.902, and 0.920 across various testing scenarios, our method demonstrates robustness and adaptability. Requiring 128 milliseconds on a AMD EPYC 74F3 CPU for the end-to-end process, our method is well-suited for an early warning system, outperforming human reaction times, which require at least 390 milliseconds to detect hazards. This significantly contributes to enhancing safety and reliability in the operations of autonomous vehicles.
Datacenters are key components in the ICT infrastructure supporting our digital society. Datacenter operations are hampered by operational complexity and dynamics, risking to reduce or even offset the performance, energy efficiency, and other datacenter benefits. A promising emerging technology, Operational Data Analytics~(ODA), promises to collect and use monitoring data to improve datacenter operations. However, it is challenging to organize, share, and leverage the massive and heterogeneous data resulting from monitoring datacenters. Addressing this combined challenge, starting from the idea that graphs could provide a good abstraction, in this work we present our early work on designing and implementing a graph-based approach for datacenter ODA. We focus on two main components of datacenter ODA. First, we design, implement, and validate agraph-based ontology for datacenters that captures both high-level meta-data information and low-level metrics of operational data collected from real-world datacenters, and maps them to a graph structure for better organization and further use. Second, we design and implementODAbler, a software framework for datacenter ODA, which combines ODA data with an online simulator to make predictions about current operational decisions and other what-if scenarios. We take the first steps to illustrate the practical use of ODAbler, and explore its potential to support datacenter ODA through graph-based analysis. Our work helps construct the case that graph-based ontologies have great value for datacenter ODA and, further, to improving datacenter operations.
High-performance computing (HPC) is the cornerstone of technological advancements in our digital age, but its management is becoming increasingly challenging, particularly as systems approach exascale. Operational data analytics (ODA) and holistic monitoring frameworks aim to alleviate this burden by collecting live telemetry from HPC systems. ODA frameworks rely on NoSQL databases for scalability, with implicit data structures embedded in metric names, necessitating domain knowledge for navigating telemetry data relations. To address the imperative need for explicit representation of relations in telemetry data, we propose a novel ontology for ODA, which we apply to a real HPC installation. The proposed ontology captures relationships between topological components and links hardware components(compute nodes, rack, systems) with job's execution and allocations collected telemetry. This ontology forms the basis for constructing a knowledge graph, enabling graph queries for ODA. Moreover, we propose a comparative analysis of the complexity (expressed in lines of code) and domain knowledge requirement (qualitatively assessed by informed end-users) of complex query implementation with the proposed method and NoSQL methods commonly employed in today's ODAs. We focused on six queries informed by facility managers' daily operations, aiming to benefit not only facility managers but also system administrators and user support. Our comparative analysis demonstrates that the proposed ontology facilitates the implementation of complex queries with significantly fewer lines of code and domain knowledge required as compared to NoSQL methods.
While graph sampling is key to scalable processing, little research has tried to thoroughly compare and understand how it preserves features such as degree, clustering, and distances dependent on the graph size and structural properties. This research evaluates twelve widely adopted sampling algorithms across synthetic and real datasets to assess their qualities in three metrics: degree, clustering coefficient (CC), and hop plots. We find the random jump algorithm to be an appropriate choice regarding degree and hop-plot metrics and the random node for CC metric. In addition, we interpret the algorithms' sample quality by conducting correlation analysis with diverse graph properties. We discover eigenvector centrality and path-related features as essential features for these algorithms' degree quality estimation, node numbers (or the size of the largest connected component) as informative features for CC quality estimation and degree entropy, edge betweenness and path-related features as meaningful features for hop-plot metric. Furthermore, with increasing graph size, most sampling algorithms produce better-quality samples under degree and hop-plot metrics.
Knowledge graphs are extremely versatile semantic tools, but there are current bottlenecks with expanding them to a massive scale. This concern is a focus of the Graph-Massivizer project, where solutions for scalable massive graph processing are investigated. In this paper we'll describe how to build a massive knowledge graph from existing information or external sources in a repeatable and scalable manner. We go through the process step-by-step, and discuss how the Graph-Massivizer project supports the development of large knowledge graphs and the considerations necessary for replication.
The growing desire among application providers for a cost model based on pay-per-use, combined with the need for a seamlessly integrated platform to manage the complex workflows of their applications, has spurred the emergence of a promising computing paradigm known as serverless computing. Although serverless computing was initially considered for cloud environments, it has recently been extended to other layers of the computing continuum, i.e., edge and fog. This extension emphasizes that the proximity of computational resources to data sources can further reduce costs and improve performance and energy efficiency. However, orchestrating the computing continuum in complex application workflows, including a set of serverless functions, introduces new challenges. This paper investigates the opportunities and challenges introduced by serverless computing for workflow management systems (WMS) on the computing continuum. In addition, the paper provides a taxonomy of state-of-the-art WMSs and reviews their capabilities.
Go-Network is a Go language package for network generation and sampling. The core package provides basic data structures representing undirected graphs. Go-Network currently supports only integer values on graph nodes and edges. The library implements (a) data loading utilities supporting frequent graph formats, (b) algorithms for synthetic graph generation (e.g., Erd\Ho s-Ré nyi graphs), and thirty implementations of graph sampling algorithms. Among the many benefits the library inherits from Go (designed as a replacement for C++) are the compilation and execution speed (compiles directly to machine code) and its great support for concurrency while being memory savvy. These factors make the library a powerful tool for scientific purposes. We briefly describe the existing functionality, compare it against another graph sampling library (Little Ball of Fur), describe our design decisions, and draw attention to future work. Go-Network is publicly available and can be imported from https://github.com/graph-massivizer/go-network.
The popularity and adoption of large language models (LLM) like ChatGPT has evolved rapidly. LLM pre-training is expensive. ChatGPT is estimated to cost over 700,000 per day to operate, and using GPT-4 to support customer service can cost a small business over 21,000 a month. The high infrastructure and financial costs, coupled with the specialized talent required, make LLM technology inaccessible to most organizations. For instance, the up-front costs include the emissions generated to manufacture the relevant hardware and the cost to run that hardware during the training procedure, both while the machines are operating at full capacity and while they are not. The best estimate of the dynamic computing cost in the case of GPT-3, the model behind the original ChatGPT, is approximately 1,287,000 kWh, or 552 tons of carbon dioxide. The goal of this workshop is to address the urgency of reducing energy consumption of LLM applications, by bringing together researchers from the academia and industry to share their experience and insights in performance engineering in the LLM world.
Large Language Models (LLMs) are advanced natural language processing models that are trained on vast amounts of text data to understand and generate human-like language. These models are designed to understand context, generate coherent and contextually relevant text, and demonstrate advanced language capabilities. In the dynamic landscape of LLMs, the demand for efficient inference benchmarking is crucial. Organizations such as TPC and SPEC brought several industry standard benchmark [1][2][3][4]. This publication introduces EchoSwift [11], a comprehensive benchmarking framework designed to evaluate the real-time performance of LLMs in deployment scenarios. As LLMs ascend to the forefront of technological innovation, their seamless integration into real-world applications demands a nuanced understanding of their efficiency, throughput, latency, and scalability. It is within this dynamic landscape that our publication unveils the EchoSwift, a novel benchmarking framework meticulously crafted to address the pressing need for comprehensive inference benchmarking, as well as the discovery of the right configuration for specific LLM requirements. For instance, certain deployments might have 32 tokens as input and 256 tokens as output, while others might have 256 tokens as input and 64 tokens as output. It is crucial to acknowledge that the configuration for these two requirements need not be the same for an optimal performance, scale and better TCO. The EchoSwift not only aids in comprehensive configuration discovery but also facilitates robust Performance/Scale testing, ensuring that LLM deployments are not only efficient but also finely tuned to their specific operational demands.
It gives us immense pleasure to extend a warm welcome to you for the 2024 edition of the Workshop on Hot Topics in Cloud Computing Performance - HotCloudPerf 2024. Cloud computing represents one of the most significant transformations in the realm of IT infrastructure and usage. The adoption of global services within public clouds is on the rise, and the immensely lucrative global cloud market already sustains over 1 million IT-related jobs. However, optimizing the performance and efficiency of the IT services provided by both public and private clouds remains a considerable challenge. Emerging architectures, techniques, and real-world systems entail interactions with the computing continuum, serverless operation, everything as a service, complex workflows, auto-scaling and -tiering, etc. The extent to which traditional performance engineering, software engineering, and system design and analysis tools can contribute to understanding and engineering these emerging technologies is uncertain. The community requires practical tools and robust methodologies to address the hot topics in cloud computing performance effectively.
Large-scale, production-grade cloud applications are no longer black boxes for academic researchers. They are observable subjects under test in an increasing number of projects, with the aim to quantify and improve their runtime characteristics, including performance. With more meaningful measurements available, data-driven approaches have matured and advanced the knowledge in particular around conventional stateless workloads such as functions and containers. A few less explored areas still exist. They are fueled by the increasing number of atypical function deployments for instance in message brokers, in intelligent switches and in blockchains. This talk summarises reference architectures for large-scale applications, sometimes resulting in nation-scale deployments, discusses performance numbers in this context, and elaborates on whether more focus on performance is needed.
The top cloud providers offer more than a hundred serverless services, such as Function-as-a-Service and various ML-based Services speech to text, text to speech, or translation. Unfortunately, while the cloud provider SDKs simplify the usage of serverless services, they also lock the users to use services of the respective provider only. Moreover, the dynamic and heterogeneous nature of the underlying serverless infrastructure introduces other deficiencies for agile development, automated deployment, and efficient and effective execution of serverless workflow applications.
opment, modeling, and running serverless workflow applications that use various serverless managed services in federated serverless infrastructures. The main goal is to follow the approach "Code Once Run Everywhere" where the developers code their "intents" and the runtime system then selects the specific deployment of end-point managed cloud services.
NVMe SSDs have become the de-facto storage choice for high-performance I/O-intensive workloads. Often, these workloads are run in a shared setting, such as in multi-tenant clouds where they share access to fast NVMe storage. In such a shared setting, ensuring quality of service among competing workloads can be challenging. To offer performance differentiation to I/O requests, various SSD-optimized I/O schedulers have been designed. However, many of them are either not publicly available or are yet to be proven in a production setting. Among the widely-tested I/O schedulers available in the Linux kernel, it has been shown that \kyber is one of the best-fit schedulers for SSDs due to its low CPU overheads and high scalability. However, \kyber has various configuration options, and there is limited knowledge on how to configure \kyber to improve applications' performance. In this paper, we systematically characterize how \kyber 's configurations affect the performance of I/O workloads and how this effect differs with different file systems and storage devices. We report 11 observations and make 5 guidelines that indicate that (i) \kyber can deliver up to 26.3% lower read latency than the \none scheduler with interfering write workloads; (ii) with a file system, \kyber can be configured to deliver up to 35.9% lower read latency at the cost of 34.5%--50.3% lower write throughput, allowing users to make a trade-off between read latency and write throughput; and (iii) \kyber leads to performance losses when \kyber is used with multiple throughput-bound workloads and the SSDs is not the bottleneck. Our benchmarking scripts and results are open-sourced and available at: https://github.com/stonet-research/hotcloudperf24-kyber-artifact-public.
Multicluster disaster recovery on cloud-native platforms such as Kubernetes usually replicates application data and Kubernetes resources to a safe recovery cluster. In the event of a disaster, Kubernetes resources are restored to the recovery cluster to recover the affected applications. We tested 10 popular Kubernetes applications using this naive approach, and 60% failed. Problems include data being restored in the wrong order, cluster-specific data being restored instead of generated by the cluster, etc. All these problems lead to our recipe design that enables disaster recovery of all Kubernetes applications. In this paper, we analyze the problems we encountered during the disaster recovery of Kubernetes applications and categorize applications based on their disaster recovery behaviors. We present a recipe that groups, orders, and filters Kubernetes resources to enable disaster recovery. Finally, we evaluate the reliability and efficiency of the recipe. Our evaluation shows that recipe achieves a 100% success rate of disaster recovery while adding mere seconds of overhead to the recovery time.
Sustainability has become a critical focus area across the technology industry, most notably in cloud data centers. In such shared-use computing environments, there is a need to account for the power consumption of individual users. Prior work on power prediction of individual user jobs in shared environments has often focused on workloads that stress a single resource, such as CPU or DRAM. These works typically employ a specific machine learning (ML) model to train and test on the target workload for high accuracy. However, modern workloads in data centers can stress multiple resources simultaneously, and cannot be assumed to always be available for training. This paper empirically evaluates the performance of various ML models under different model settings and training data assumptions for the per-job power prediction problem using a range of workloads. Our evaluation results provide key insights into the efficacy of different ML models. For example, we find that linear ML models suffer from poor prediction accuracy (as much as 25% prediction error), especially for unseen workloads. Conversely, non-linear models, specifically XGBoost and Random Forest, provide reasonable accuracy (7--9% error). We also find that data-normalization and the power-prediction model formulation affect the accuracy of individual ML models in different ways.
Data centers have become an increasingly significant contributor to the global carbon footprint. In 2021, the global data center industry was responsible for around 1% of the worldwide greenhouse gas emissions. With more resource-intensive workloads, such as Large Language Models, gaining popularity, this percentage is expected to increase further. Therefore, it is crucial for data center service providers to become aware of and accountable for the sustainability impact of their design and operational choices. However, reducing the carbon footprint of data centers has been a challenging process due to the lack of comprehensive metrics, carbon-aware design tools, and guidelines for carbon-aware optimization. In this work, we propose FootPrinter, a first-of-its-kind tool that supports data center designers and operators in assessing the environmental impact of their data center. FootPrinter uses coarse-grained operational data, grid energy mix information, and discrete event simulation to determine the data center's operational carbon footprint and evaluate the impact of infrastructural or operational changes. FootPrinter can simulate days of operations of a regional data center on a commodity laptop in a few seconds, returning the estimated footprint with marginal error. By making this project open source, we hope to engage the community in the development of methodologies and tools for systematically assessing and exploring the sustainability of data centers.
The development of serverless scientific workflows is a complex and tedious procedure and opens several challenges in how to compose workflow processing steps as serverless functions and how much memory to assign to each serverless function, which affects not only the computing resources, but also the networking communication to the cloud storage. Merging multiple processing steps into a single serverless function (fusion) reduces the number of invocations, but restricts the developer to assign the maximum required memory of all fused processing steps, which may increase the overall costs.
In this paper, we address the aforementioned challenges for the widely used Montage workflow. We created three different workflow implementations (fine, medium, and coarse) for two cloud providers AWS and GCP and deployed workflow functions with different memory assignments 135 MB, 512 MB, and 1 GB (function deployments). Our experiments show that many Montage functions run cheaper and faster with more memory on both providers. Consequently, selecting the most cost-effective memory configuration, as opposed to the minimal memory, resulted in a reduction of the makespan by 67.27% on AWS and 10.93% on GCP. Applying the same to workflow implementations with fewer functions (coarse) led to a further reduction in the makespan by 24.98% on AWS and 12.96% on GCP, while simultaneously reducing the total cost by 5.33% and 1.99%, respectively. Surprisingly, the fastest implementation was the medium implementation executed on AWS.
Recent years have seen a resurgence of societal interest in Metaverses and virtual reality (VR), with large companies such as Meta and Apple investing multi-billion dollars into its future. With the recent developments in VR hardware and software, understanding how to operate these systems efficiently and with good performance becomes increasingly important. However, studying Metaverse and VR systems is challenging because publicly available data detailing the performance of these systems is rare. Moreover, collecting this data is labor-intensive because VR devices are end-user devices that are driven by human input. In this work, we address this challenge and work towards a workload trace archive for Metaverse systems. To this end, we design, implement, and validate librnr, a system to record and replay human input on VR devices, automating large parts of the process of collecting VR traces. We use librnr to collect 106 traces with a combined runtime of 7 hours from state-of-the-art VR hardware under a variety of representative scenarios. Through analysis of our initial results, we find that power use of VR devices can increase by up to 29% depending on the location of the VR device relative to the user-defined play area, and show that noticeable performance degradation can occur when network bandwidth drops below 100Mbps. Encouraging community adoption of both librnr and the emerging trace archive, we publish both according to FAIR data principles at https://github.com/atlarge-research/librnr.
Geo-distributed (GD) training is a machine-learning technique that uses geographically distributed data for model training. Like Federated Learning, geo-distributed machine learning can provide data privacy and also benefit from the cloud infrastructure provided by many vendors in multiple geographies. However, GD training suffers from multiple challenges such as performance degradation due to cross-geography low network bandwidth and high cost of deployment. Additionally, all major cloud vendors such as Amazon AWS, Microsoft Azure, and Google Cloud Platform provide services in several geographies. Hence, finding a high-performance as well as cost-effective cloud service provider and service for GD training is a challenge. In this paper, we present our evaluation of the performance and cost associated with training models in multi-cloud and multi-geography. We evaluate multiple deployment architectures using computing and storage services from multiple cloud vendors. The use of serverless instances in conjunction with virtual machines for model training is evaluated in this study. Additionally, we build and evaluate cost models for estimating the cost of distributed training of models in a multi-cloud environment. Our study shows that the judicious selection of cloud services and architecture might result in cost and performance gains.
As contemporary computing infrastructures evolve to include diverse architectures beyond traditional von Neumann models, the limitations of classical graph-based infrastructure and application modelling become apparent, particularly in the context of the computing continuum and its interactions with Internet of Things (IoT) applications. Hypergraphs prove instrumental in overcoming this obstacle by enabling the representation of computing resources and data sources irrespective of scale. This allows the identification of new relationships and hidden properties, supporting the creation of a federated, sustainable, cognitive computing continuum with shared intelligence. The paper introduces the HyperContinuum conceptual platform, which provides resource and applications management algorithms for distributed applications in conjunction with next-generation computing continuum infrastructures based on novel von Neumann computer architectures. The HyperContinuum platform outlines high-order hypergraph applications representation, sustainability optimization for von Neumann architectures, automated cognition through federated learning for IoT application execution, and adaptive computing continuum resources provisioning.
We propose a novel approach for resource demand profiling of resource-intensive monolithic workflows that consist of different phases. Workflow profiling aims to estimate the resource demands of workflows. Such estimates are important for workflow scheduling in data centers and enable the efficient use of available resources. Our approach considers the workflows as black boxes, in other words, our approach can fully rely on recorded system-level metrics, which is the standard scenario from the perspective of data center operators. Our approach first performs an offline analysis of a dataset of resource consumption values of different runs of a considered workflow. For this analysis, we apply the time series segmentation algorithm PELT and the clustering algorithm DBSCAN. This analysis extracts individual phases and the respective resource demands. We then use the results of this analysis to train a Hidden Markov Model in a supervised manner for online phase detection. Furthermore, we provide a method to update the resource demand profiles at run-time of the workflows based on this phase detection. We test our approach on Earth Observation workflows that process satellite data. The results imply that our approach already works in some common scenarios. On the other hand, for cases where the behavior of individual phases is changed too much by contention, we identify room and next steps for improvements.
It is our great pleasure to welcome you to the twelfth edition of the International Workshop on Load Testing and Benchmarking of Software Systems - LTB 2024, https://ltb2024.github.io/). This one-day workshop brings together software testing and software performance researchers, practitioners, and tool developers to discuss the challenges and opportunities of conducting research on load testing and benchmarking software systems, including theory, applications, and experiences. LTB 2024 includes 2 keynote talks, 4 research papers, and 2 industry presentations. The topics cover performance of serverless computing, performance and load testing, performance-driven culture, workload generation, workload tracing, benchmarking, and performance verification.
In the rapidly evolving fields of encryption and blockchain technologies, the efficiency and security of cryptographic schemes significantly impact performance. This paper introduces a comprehensive framework for continuous benchmarking in one of the most popular cryptography Rust libraries, \textttfastcrypto. What makes our analysis unique is the realization that automated benchmarking is not just a performance monitor and optimization tool, but it can be used for cryptanalysis and innovation discovery as well. Surprisingly, benchmarks can uncover spectacular security flaws and inconsistencies in various cryptographic implementations and standards, while at the same time they can identify unique opportunities for innovation not previously known to science, such as providing a) hints for novel algorithms, b) indications for mix-and-match library functions that result in world record speeds, and c) evidences of biased or untested real world algorithm comparisons in the literature.
Our approach transcends traditional benchmarking methods by identifying inconsistencies in multi-threaded code, which previously resulted in unfair comparisons. We demonstrate the effectiveness of our methodology in identifying the fastest algorithms for specific cryptographic operations like signing, while revealing hidden performance characteristics and security flaws. The process of continuous benchmarking allowed \textttfastcrypto to break many crypto-operations speed records in the Rust language ecosystem. A notable discovery in our research is the identification of vulnerabilities and unfair speed claims due to missing padding checks in high-performance Base64 encoding libraries. We also uncover insights into algorithmic implementations such as multi-scalar elliptic curve multiplications, which exhibit different performance gains when applied in different schemes and libraries. This was not evident in conventional benchmarking practices. Further, our analysis highlights bottlenecks in cryptographic algorithms where pre-computed tables can be strategically applied, accounting for L1 and L2 CPU cache limitations.
Our benchmarking framework also reveals that certain algorithmic implementations incur additional overheads due to serialization processes, necessitating a refined 'apples to apples' comparison approach. We identified unique performance patterns in some schemes, where efficiency scales with input size, aiding blockchain technologies in optimal parameter selection and data compression.
Crucially, continuous benchmarking serves as a tool for ongoing audit and security assurance. Variations in performance can signal potential security issues during upgrades, such as cleptography, hardware manipulation or supply chain attacks. This was evidenced by critical private key leakage vulnerabilities we found in one of the most popular EdDSA Rust libraries. By providing a dynamic and thorough benchmarking approach, our framework empowers stakeholders to make informed decisions, enhance security measures, and optimize cryptographic operations in an ever-changing digital landscape.
An effective isolation among workloads within a shared and possibly contended compute environment is a crucial aspect for industry and academia alike to ensure optimal performance and resource utilization. Modern ecosystems offer a wide range of approaches and solutions to ensure isolation for a multitude of different compute resources. Past experiments have verified the effectiveness of this resource isolation with micro benchmarks. The effectiveness of QoS isolation for intricate workloads beyond micro benchmarks however, remains an open question.
This paper addresses this gap by introducing a specific example involving a database workload isolated using Cgroups from a disruptor contending for CPU resources. Despite the even distribution of CPU isolation limits among the workloads, our findings reveal a significant impact of the disruptor on the QoS of the database workload. To illustrate this, we present a methodology for quantifying this isolation, accompanied by an implementation incorporating essential instrumentation through eBPF.
This not only highlights the practical challenges in achieving robust QoS isolation but also emphasizes the need for additional instrumentation and realistic scenarios to comprehensively evaluate and address these challenges.
In the modern fast paced and highly autonomous software development teams, it's crucial to maintain a sustainable approach to all performance engineering activites, including performance testing. The high degree of autonomy often results in teams building their own frameworks that are not used consistently and may be abandoned due to lack of support or integration with existing infrastructure, processes and tools.
To address these challenges, we present a self-service performance testing platform based on open-source software, that supports distributed load generation, historical results storage and a notification system to trigger alerts in Slack messenger. In addition, it integrates with GitHub Actions to enable developers running load tests as part of their CI/CD pipelines.
We'd like to share some of the technical solutions and the details of the decision-making process behind the performance testing platform in a scale-up environment, our experience in building this platform and, most importantly, rolling it out to autonomous development teams and onboarding them into the continuous performance improvement process.
Application Performance Monitoring (APM) tools are used in the industry to gain insights, identify bottlenecks, and alert to issues related to software performance. The available APM tools generally differ in terms of functionality and licensing, but also in monitoring overhead, which should be minimized due to use in production deployments. One notable source of monitoring overhead is the instrumentation technology, which adds code to the system under test to obtain monitoring data. Because there are many ways how to instrument applications, we study the overhead of five different instrumentation technologies (AspectJ, ByteBuddy, DiSL, Javassist, and pure source code instrumentation) in the context of the Kieker open-source monitoring framework, using the MooBench benchmark as the system under test. Our experiments reveal that ByteBuddy, DiSL, Javassist, and source instrumentation achieve low monitoring overhead, and are therefore most suitable for achieving generally low overhead in the monitoring of production systems. However, the lowest overhead may be achieved by different technologies, depending on the configuration and the execution environment (e.g., the JVM implementation or the processor architecture). The overhead may also change due to modifications of the instrumentation technology. Consequently, if having the lowest possible overhead is crucial, it is best to analyze the overhead in concrete scenarios, with specific fractions of monitored methods and in the execution environment that accurately reflects the deployment environment. To this end, our extensions of the Kieker framework and the MooBench benchmark enable repeated assessment of monitoring overhead in different scenarios.
Cloud computing and cloud-native platforms have rendered runtime environments more malleable. Simultaneously, the growing demand for flexible and agile software applications and services has driven the emergence of self-adaptive architectures. These architectures, in turn, facilitate software performance modeling, tuning, optimization, and scaling in a continuous manner, blurring the boundary between development-time and run-time. Self-adaptive software employs feedback loop controllers inspired by control theory or variations of the Monitoring-Analysis-Planning-Acting (MAPE) architecture. Whether implemented in a centralized or decentralized manner, most controllers utilize performance models that are learned or tuned at run-time. This shift implies that software is designed to be observable and controllable during execution, presupposing the co-design of software applications and their runtime controllers.
This talk commences with a succinct overview of the evolution of self-adaptive software, accentuating key milestones along the journey. Subsequently, recent advancements in software performance modeling at runtime and the role of learning-enabled performance management during software operation are presented.
Two recent works are highlighted: one focusing on constructing robust performance models to sustain continuous operation and deployment of cloud-native software, and the other on utilizing multimodal models for performance anomaly detection. The former supports cloud operations like continuous deployment of co-located applications, migration, consolidation of services, or scaling in response to workloads or interferences. The latter is tailored to support performance anomaly detection, localization, and identification of root causes, facilitating swift remediation of faults using generative AI. The final segment of the talk delves into current challenges in developing self-adaptive systems, presenting insights from a recent survey on the state of self-adaptive software in the industry and the challenges perceived by practitioners.
A new era has been opened at the end of last century in the performance analysis research area, when an explicit and independent role has started to be given to software in performance analysis of computing systems. Indeed, software has moved from being a monolithic element, strictly dependent on the platform where it is deployed and exclusively aimed at producing values to parameterize a platform model, to become an independent model itself, with its own components and interactions. This change has impacted all fields of this research area, such as: modeling languages, processes for analysis and synthesis of software models, platform model parameterization, performance model solution techniques, interpretation of results, benchmarking and performance testing. It has also represented one of the triggers that lead to the birth of a research community around the computing system performance issues strictly related to software aspects. Indeed, in 1998 the first ACM Workshop on Software and Performance (WOSP) took place, with the aim of getting together researchers and practitioners of software area with the ones of the performance area, so to offer a playground where different skills and expertise could join and originate a new vision on the role of software in performance assessment. This talk attempts to reconstruct the road of software performance research that has started at the time of the first WOSP event in 1998 down to today events (i.e., ICPE conference, WOSP-C and other workshops). The spirit of the talk is to observe the evolution of this research area, including successful and (apparently) unsuccessful directions. Some promising directions will be tentatively sketched by "standing on the shoulders of giants".
Server hardware is becoming more and more heterogeneous, with an increasingly diverse landscape of accelerators such as GPUs, FPGAs, or novel processing-in-memory (PIM) technologies. Designing and evaluating scheduling algorithms for these is far from trivial due to accelerator-specific setup costs, compute capabilities, and other characteristics. In fact, many existing scheduling simulators only consider some of these characteristics, or only support a specific sub-set of accelerators. To overcome these challenges, we present HetSim, a modular simulator for task-based scheduling on heterogeneous hardware. HetSim enables research on online and offline scheduling and placement strategies for modern compute platforms that combine CPU cores with multiple GPU, FPGA, and PIM accelerators. It is efficient, fair, and compatible with a variety of common workload descriptions, output metrics, and visualization tools. We use HetSim to reproduce results from Alebrahim et al., and examine how accelerator characteristics affect the performance of various scheduling strategies. Our results indicate that ignoring accelerator characteristics during simulation is often detrimental, and that the ideal scheduling algorithm for a given workload may depend on available accelerators and their characteristics. HetSim is available as open-source software.
Performance modeling for large-scale data analytics workloads can improve the efficiency of cluster resource allocations and job scheduling. However, the performance of these workloads is influenced by numerous factors, such as job inputs and the assigned cluster resources. As a result, performance models require significant amounts of training data. This data can be obtained by exchanging runtime metrics between collaborating organizations. Yet, not all organizations may be inclined to publicly disclose such metadata.
We present a privacy-preserving approach for sharing runtime metrics based on differential privacy and data synthesis. Our evaluation on performance data from 736 Spark job executions indicates that fully anonymized training data largely maintains performance prediction accuracy, particularly when there is minimal original data available. With 30 or fewer available original data samples, the use of synthetic training data resulted only in a one percent reduction in performance model accuracy on average.
While product-form queueing networks are effective in analyzing system performance, they encounter difficulties in scenarios involving internal concurrency. Moreover, the complexity introduced by synchronization delays challenges the accuracy of analytic methods. This paper proposes a novel approximation technique for closed fork-join systems, called MMT, which relies on transformation into a mixed queueing network model for computational analysis. The approach substitutes fork and join with a probabilistic router and a delay station, introducing auxiliary open job classes to capture the influence of parallel computation and synchronization delay on the performance of original job classes. Evaluation experiments show the higher accuracy of the proposed method in forecasting performance metrics compared to a classic method, the Heidelberger-Trivedi transformation. This suggests that our method could serve as a promising alternative in evaluating queueing networks that contains fork-join systems.
Performance Engineering still needs to be adopted in many organizations. At the same time, user expectations for fast and reliable applications are increasing. This paper discusses different approaches to establishing a performance engineering culture. After highlighting some of the challenges that hold businesses back from making performance a shared responsibility, we convey a success story about how performance became a matter for everyone in a large European bank.
Efficiency has always been at the core of software performance engineering research. Many aspects that have been addressed in performance engineering for decades are gaining popularity under the umbrella of Green IT and Green Software Engineering. Engineers and marketers in the industry are looking for ways to measure how green (in terms of carbon dioxide emissions) their software products are. Proxy measures are proposed, such as hosting cost or the power consumption of the hardware environment on which the software is running. In environments where a software system runs on a dedicated server instance, this may make sense, but in virtualised, containerised or serverless environments, it is necessary to find ways of allocating the energy consumption of the entire server to software components that share the same infrastructure. This paper proposes the use of resource demand measurements as a basis for measuring how green a given software actually is.