ICPE '22: Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering

Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Data Science Driven Methods for Sustainable and Failure Tolerant Edge Systems

Nowadays we experience a paradigm shift in our society, where every item around us is becoming a computer facilitating life-changing applications like self-driving cars, tele-medicine, precision agriculture or virtual reality. On one hand, for the execution of such resource demanding applications we need powerful IT facilities. On the other hand, the requirements often include latencies below 100 ms or even below 10 ms -- what is called ''tactile internet''. To facilitate low latency computation has to be placed in the vicinity of the end users by utilizing the concept of Edge Computing. In this talk we explain the challenges of Edge systems in combination with tactile internet. We discuss the recent problems of geographically distributed machine learning applications and novel approaches to balance competing priorities like the energy efficiency and the staleness of the machine learning models. Available failure resilience mechanisms designed for Cloud computing or generic distributed systems cannot be applied to Edge systems due to timeliness, hyper heterogeneity and resource scarcity. Therefore, we discuss a novel machine learning based mechanism that evaluates the failure resilience of a service deployed redundantly on the edge infrastructure. Our approach learns the spatiotemporal dependencies between edge server failures and combines them with the topological information to incorporate link failures by utilizing the concept of the Dynamic Bayesian Networks (DBNs). Eventually, we infer the probability that a certain set of servers fails or disconnects concurrently during service runtime.

Performance Optimization of HPC Applications in Large-Scale Cluster Systems

In modern HPC clusters, the performance of an application is a combination of several aspects. To successfully improve the application performance, all performance aspects should be analyzed and optimized. In particular, as modern CPUs contain more and more cores, the speed of floating-point computations has increased rapidly, making data access one of the main bottlenecks in most HPC applications. Furthermore, performance diagnosing for HPC applications can be extremely complex, and the performance bottlenecks of HPC applications may vary with the scale of parallelism. In this presentation, a multi-layered data access (MLDA) optimization methodology is introduced. Developers could follow this methodology to optimize the HPC applications. We provide several examples of applying the MLDA method on real-world HPC applications, including the weather, ocean, material science, CFD, and MHD areas.

SESSION: Session 1: Service and Cloud Computing

LongTale: Toward Automatic Performance Anomaly Explanation in Microservices

Performance troubleshooting is notoriously difficult for distributed microservices-based applications. A typical root-cause diagnosis for performance anomaly by an analyst starts by narrowing down the scope of slow services, investigates into high-level performance metrics or available logs in the slow components, and finally drills down to an actual cause. This process can be long, tedious, and sometimes aimless due to the lack of domain knowledge and the sheer number of possible culprits. This paper introduces a new machine-learning-driven performance analysis system called LongTale that automates the troubleshooting process for latency-related performance anomalies to facilitate the root cause diagnosis and explanation. LongTale builds on existing application-layer tracing in two significant aspects. First, it stitches application-layer traces with corresponding system stack traces, which enables more informative root-cause analysis. Second, it utilizes a novel machine-learning-driven analysis that feeds on the combined data to automatically uncover the most likely contributing factor(s) for given performance slowdown. We demonstrate how LongTale can be utilized in different scenarios, including abnormal long-tail latency explanation and performance interference analysis.

An Empirical Study of Service Mesh Traffic Management Policies for Microservices

A microservice architecture features hundreds or even thousands of small loosely coupled services with multiple instances. Because microservice performance depends on many factors including the workload, inter-service traffic management is complex in such dynamic environments. Service meshes aim to handle this complexity and to facilitate management, observability, and communication between microservices. Service meshes provide various traffic management policies such as circuit breaking and retry mechanisms, which are claimed to protect microservices against overload and increase the robustness of communication between microservices. However, there have been no systematic studies on the effects of these mechanisms on microservice performance and robustness. Furthermore, the exact impact of various tuning parameters for circuit breaking and retries are poorly understood. This work presents a large set of experiments conducted to investigate these issues using a representative microservice benchmark in a Kubernetes testbed with the widely used Istio service mesh. Our experiments reveal effective configurations of circuit breakers and retries. The findings presented will be useful to engineers seeking to configure service meshes more systematically and also open up new areas of research for academics in the area of service meshes for (autonomic) microservice resource management.

Best Practices for HPC Workloads on Public Cloud Platforms: A Guide for Computational Scientists to Use Public Cloud for HPC Workloads

HPC (high performance computing) applications come with a variety of requirements for computation, communication, and storage; and many of these requirements can be met with commodity technology available in public clouds. In this article, we report on results for several well-known HPC applications on IBM public cloud, and we describe best practices for running such applications on cloud systems in general. Our results show that public clouds are not only ready for HPC workloads, but they can provide performance comparable to, and in some cases better than, current supercomputers.

NVMe Virtualization for Cloud Virtual Machines

Public clouds are rapidly moving to support Non-Volatile Memory Express (NVMe) based storage to meet the ever-increasing I/O throughput and latency demands of modern workloads. They provide NVMe storage through virtual machines (VMs) where multiple VMs running on a host may share a physical NVMe device. The virtualization method used to share the NVMe capability has important performance, usability and security implications. In this paper, we propose three NVMe storage virtualization methods: PCI device passthrough, virtual block device method, and Storage Performance Development Kit (SPDK) virtual host target method. We evaluate these virtualization methods in terms of performance, scalability, CPU overhead, technology maturity, security, and availability to use one or more of these methods in IBM public cloud.

SESSION: Session 2: GPUs and Heterogeneous Platforms

Exploring the Use of Novel Spatial Accelerators in Scientific Applications

Driven by the need to find alternative accelerators which can viably replace graphics processing units (GPUs) in next-generation Supercomputing systems, this paper proposes a methodology to enable agile application/hardware co-design. The application-first methodology provides the ability to come up with design of accelerators while working with real-world workloads, available accelerators, and system software. The iterative design process targets a set of kernels in a workload for performance estimates that can prune the design space for later phases of detailed architectural evaluations. To this effect, in this paper, a novel data-parallel device model is introduced that simulates the latency of performance-sensitive operations in an accelerator including data transfers and kernel computation using multi-core CPUs. The use of off-the-shelf simulators, such as pre-RTL simulator Aladdin or multiple tools available for exploring the design of deep neural network accelerators (e.g., Timeloop) is demonstrated for evaluation of various accelerator designs using applications with realistic inputs. Examples of multiple device configurations that are instantiable in a system are explored to evaluate the performance benefit of deploying novel accelerators. The proposed device is integrated with a programming model and system software to potentially explore the impacts of high-level programming languages/compilers and low-level effects such as task scheduling on multiple accelerators. We analyze our methodology for a set of applications that represent high-performance computing (HPC) and graph analytics. The applications include a computational chemistry kernel realized using tensor contractions, triangle counting, GraphSAGE and Breadth-first Search. These applications include kernels such as dense matrix-dense matrix multiplication, sparse matrix-spare matrix multiplication, and sparse matrix-dense vector multiplication. Our results indicate potential performance benefits and insights for system design by including accelerators that realize these kernels along-side general purpose accelerators.

Extending SYCL's Programming Paradigm with Tensor-based SIMD Abstractions

Heterogeneous computing has emerged as an important method for supporting more than one kind of processors or accelerators in a program. There is generally a trade off between source code portability and device performance for heterogeneous programming. Thus, new programming abstractions to assist programmers to reduce their development efforts while minimizing performance penalties is extremely valuable.

The Khronos SYCL standard defines an abstract single-programmultiple- data (SPMD) programming model for heterogeneous computing. This paper presents a language extension on top of the SYCL standard to enable flexibility for programmers. We introduce a set of single-instruction-multiple-data (SIMD) abstractions based on multi-dimensional arrays (Tensors) in conjuction with the existing SPMD programming paradigm.

Our work is based on a C++ language and a set new of LLVM intermediate representation (IR) for representing the SIMD programs. This also includes a set of custom optimization passes that performs instruction lowering, automatic address allocation, and synchronization insertion. We show how our work can be used in conjunction with conventional SYCL SPMD programming for various benchmarks such as general matrix multiplication (GEMM) and lower upper (LU) inverse and evaluate its hardware utilization performance.

Oversubscribing GPU Unified Virtual Memory: Implications and Suggestions

Recent GPU architectures support unified virtual memory (UVM), which offers great opportunities to solve larger problems by memory oversubscription. Although some studies are concerned over the performance degradation under UVM oversubscription, the reasons behind workloads' diverse sensitivities to oversubscription is still unclear. In this work, we take the first step to select various benchmark applications and conduct rigorous experiments on their performance under different oversubscription ratios. Specifically,we take into account the variety of memory access patterns and explain applications' diverse sensitivities to oversubscription. We also consider prefetching and UVM hints, and discover their complex impact under different oversubscription ratios. Moreover, the strengths and pitfalls of UVM's multi-GPU support are discussed. We expect that this paper will provide useful experiences and insights for UVM system design.

Isolating GPU Architectural Features Using Parallelism-Aware Microbenchmarks

GPUs develop at a rapid pace, with new architectures emerging every 12 to 18 months. Every new GPU architecture introduces new features, expecting to improve on previous generations. However, the impact of these changes on the performance of GPGPU applications may not be directly apparent; it is often unclear to developers how exactly these features will affect the performance of their code. In this paper we propose a suite of microbenchmarks to uncover the performance of novel GPU hardware features in isolation. We target features in both the memory system and the arithmetic cores. We further ensure, by design, that our microbenchmarks capture the massively parallel nature of the GPUs, while providing fine-grained timing information at the level of individual compute units. Using this benchmarking suite, we study the differences between three of the most recent NVIDIA architectures: Pascal, Turing, and Ampere. We find that the architecture differences can have a meaningful impact on both synthetic and more realistic applications. This impact is visible both in terms of outright performance, but also affects the choice of execution parameters for realistic applications. We conclude that microbenchmarking, adapted to massive GPU parallelism, can expose differences between GPU generations, and discuss how it can be adapted for future architectures.

SESSION: Session 3: Empirical Studies of Performance

Same, Same, but Dissimilar: Exploring Measurements for Workload Time-series Similarity

Benchmarking is a core element in the toolbox of most systems researchers and is used for analyzing, comparing, and validating complex systems. In the quest for reliable benchmark results, a consensus has formed that a significant experiment must be based on multiple runs. To interpret these runs, mean and standard deviation are often used. In case of experiments where each run produces a time series, applying and comparing the mean is not easily applicable and not necessarily statistically sound. Such an approach ignores the possibility of significant differences between runs with a similar average. In order to verify this hypothesis, we conducted a survey of 1,112 publications of selected performance engineering and systems conferences canvassing open data sets from performance experiments. The identified 3 data sets purely rely on average and standard deviation. Therefore, we propose a novel analysis approach based on similarity analysis to enhance the reliability of performance evaluations. Our approach evaluates 12 (dis-)similarity measures with respect to their applicability in analysing performance measurements and identifies four suitable similarity measures. We validate our approach by demonstrating the increase in reliability for the data sets found in the survey.

Studying the Performance Risks of Upgrading Docker Hub Images: A Case Study of WordPress

The Docker Hub repository contains Docker images of applications, which allow users to do in-place upgrades to benefit from the latest released features and security patches. However, prior work showed that upgrading a Docker image not only changes the main application, but can also change many dependencies. In this paper, we present a methodology to study the performance impact of upgrading the Docker Hub image of an application, thereby focusing on changes to dependencies. We demonstrate our methodology through a case study of 90 official images of the WordPress application. Our study shows that Docker image users should be cautious and conduct a performance test before upgrading to a newer Docker image in most cases. Our methodology can assist them to better understand the performance risks of such upgrades, and helps them to decide how thorough such a performance test should be.

Why Is It Not Solved Yet?: Challenges for Production-Ready Autoscaling

Autoscaling is a task of major importance in the cloud computing domain as it directly affects both operating costs and customer experience. Although there has been active research in this area for over ten years now, there is still a significant gap between the proposed methods in the literature and the deployed autoscalers in practice. Hence, many research autoscalers do not find their way into production deployments. This paper describes six core challenges that arise in production systems that are still not solved by most research autoscalers. We illustrate these problems through experiments in a realistic cloud environment with a real-world multi-service business application and show that commonly used autoscalers have various shortcomings. In addition, we analyze the behavior of overloaded services and show that these can be problematic for existing autoscalers. Generally, we analyze that these challenges are only insufficiently addressed in the literature and conclude that future scaling approaches should focus on the needs of production systems.

Evaluating the Scalability and Elasticity of Function as a Service Platform

Function as a Service (FaaS) is a new software technology with promising features such as automated resource management and auto-scaling. Since these operational aspects are transparent, software engineers may not fully understand the scaling characteristics as well as limitations of this technology and this lack of information can lead to undesired performance results. To address these concerns, we perform a study to characterize FaaS' scalability with intensive workloads on three popular FaaS cloud platforms, namely Amazon AWS Lambda, IBM and Azure Cloud Function. We also study a workload smoother design pattern to examine if it enhances FaaS overall performance. The results show that different FaaS platforms adopt distinct scaling strategies and by applying a workload smoother, software engineers can achieve 99 - 100% success rates compared to 60 - 80% when FaaS' system is saturated.

SESSION: Session 4: Machine Learning and Performance

Machine Learning based Interference Modelling in Cloud-Native Applications

Cloud-native applications are often composed of lightweight containers and conform to the microservice architecture. Cloud providers offer platforms for container hosting and orchestration. These platforms reduce the level of support required from the application owner as operational tasks are delegated to the platform. Furthermore, containers belonging to different applications can be co-located on the same virtual machine to utilize resources more efficiently. Given that there are underlying shared resources and consequently potential performance interference, predicting the level of interference before deciding to share virtual machines can avoid undesirable performance deterioration. We propose a lightweight performance interference modelling technique for cloud-native microservices. The technique constructs ML models for response time prediction and can dynamically account for changing runtime conditions through the use of a sliding window method. We evaluate our technique against realistic microservices on AWS EC2. Our technique outperforms baseline and competing techniques in MAPE by at least 1.45% and at most 92.04%.

Performance Model and Profile Guided Design of a High-Performance Session Based Recommendation Engine

Session-based recommendation (SBR) systems are widely used in transactional systems to make personalized recommendations to the end-user. In online retail systems, recommendations-based decisions need to be made at a very high rate especially during peak hours. The required computational workload is very high especially when there is a larger number of products involved. Session Based Recommendation (SBR) models incorporate the learning-based product buying pattern from various user interaction sessions and try to recommend the top-K products, the user is likely to purchase. These models comprise several functional layers that widely vary in their compute and data access patterns. To support high recommendation rates, all these layers need a performance optimal implementation, which can be a challenge given the diverse nature of the computations involved. For this reason, one compute platform - whether it is CPU, GPU, or a Field Programmable Gate Array (FPGA) may not be able to provide an optimal implementation for all the layers. In this paper, we describe performance modeling and profile-based design approach to arrive at an optimal implementation, comprising of the hybrid CPU, GPU, and FPGA platforms for NISER - a session-based recommendation model that avoids popularity bias in recommendations. In addition, the design for the CPU-FPGA hybrid platform is implemented for NISER and we observed that experimental results closely follow the results predicted by the performance model for the implemented deployment option.

The Cost of Reinforcement Learning for Game Engines: The AZ-Hive Case-study

Although utilising computers to play board games has been a topic of research for many decades, the recent rapid developments in the field of reinforcement learning - like AlphaZero and variants - brought unprecedented progress in games such as chess and Go. However, the efficiency of this process remains unknown. In this work, we analyse the cost and efficiency of the AlphaZero approach when building a new game engine. Thus, we present our experience building AZ-Hive, an AlphaZero-based playing engine for the game of Hive. Using only the rules of the game and a quality of play assessment, AZ-Hive learns to play the game from scratch. Getting AZ-Hive up and running requires encoding the game in AlphaZero, i.e., capturing the board, the game state, the rules and the assessment of play-quality. And different encodings lead to significantly different AZ-Hive engines, with very different performance results. Thus, we propose a design space for configuring AZ-Hive, and demonstrate the costs and benefits of different configurations in this space. We find that different configurations lead to a less or more competitive playing-engine, but the training and evaluation for different such engines is prohibitively expensive. Moreover, no systematic, efficient exploration or pruning of the space is possible. In turn, an exhaustive exploration can easily take tens of training-years.

SESSION: Session 5: Hardware Performance

Alternating Blind Identification of Power Sources for Mobile SoCs

The need for faster Systems on Chip (SoCs) has accelerated scaling trends, leading to a considerable power density increase and raising critical power and thermal challenges. The ability to measure power consumption of different hardware units is essential for the operation and improvement of mobile SoCs, as well as the enhancement of the power efficiency of the software that runs on them. SoCs are usually enabled with embedded thermal sensors to measure the temperature at the hardware unit level; however, they lack the ability to sense the power. In this paper we introduce an Alternating Blind Identification of Power sources (Alternating-BPI), a technique that accurately estimates the power consumption of individual SoC units without the use of any design based models. The proposed technique uses a novel approach to blindly identify the sources of power consumption, by relying only on the measurements from the embedded thermal sensors and the total power consumption. The accuracy and applicability of the proposed technique was verified using simulation and experimental data. Alternating-BPI is able to estimate the power at the SoC hardware unit level with up to 98.1% accuracy. Furthermore, we demonstrate the applicability of the proposed technique on a commercial SoC and provide a fine-grain analysis of the power profiles of CPU and GPU Apps, as well as Artificial Intelligence (AI), Virtual Reality (VR) and Augmented Reality (AR) Apps. Additionally, we demonstrate that the proposed technique could be used to estimate the power consumption per-process by relying on the estimated per-unit power numbers and per-unit hardware utilization numbers. The analysis provided by the proposed technique gives useful insights about the power efficiency of the different hardware units on a state-of-the-art commercial SoC.

Memory Performance of AMD EPYC Rome and Intel Cascade Lake SP Server Processors

Modern processors, in particular within the server segment, integrate more cores with each generation. This increases their complexity in general, and that of the memory hierarchy in particular. Software executed on such processors can suffer from performance degradation when data is distributed disadvantageously over the available resources. To optimize data placement and access patterns, an in-depth analysis of the processor design and its implications for performance is necessary. This paper describes and experimentally evaluates the memory hierarchy of AMD EPYC Rome and Intel Xeon Cascade Lake SP server processors in detail. Their distinct microarchitectures cause different performance patterns for memory latencies, in particular for remote cache accesses. Our findings illustrate the complex NUMA properties and how data placement and cache coherence states impact access latencies to local and remote locations. This paper also compares theoretical and effective bandwidths for accessing data at the different memory levels and main memory bandwidth saturation at reduced core counts. The presented insight is a foundation for modeling performance of the given microarchitectures, which enables practical performance engineering of complex applications. Moreover, security research on side-channel attacks can also leverage the presented findings.

Near-Storage Processing for Solid State Drive Based Recommendation Inference with SmartSSDs®

Deep learning-based recommendation systems are extensively deployed in numerous internet services, including social media, entertainment services, and search engines, to provide users with the most relevant and personalized content. Production scale deep learning models consist of large embedding tables with billions of parameters. DRAM-based recommendation systems incur a high infrastructure cost and limit the size of the deployed models. Recommendation systems based on solid-state drives (SSDs) are a promising alternative for DRAM-based systems. Systems based on SSDs can offer ample storage required for deep learning models with large embedding tables. This paper proposes SmartRec, an inference engine for deep learning-based recommendation systems that utilizes Samsung SmartSSD, an SSD with an on-board FPGA that can process data in-situ. We evaluate SmartRec with state-of-the-art recommendation models from Facebook and compare its performance and energy efficiency to a DRAM-based system on a CPU. We show SmartRec improves the energy efficiency of the recommendation inference task up to 10x in comparison to the baseline CPU implementation. In addition, we propose a novel application-specific caching system for SmartSSDs that allows the kernel on the FPGA to use its DRAM as a cache to minimize high latency SSD accesses. Finally, we demonstrate the scalability of our design by offloading the computation to multiple SmartSSDs to further improve performance.

HLS_Profiler: Non-Intrusive Profiling Tool for HLS based Applications

The High-Level Synthesis (HLS) tools aid in simplified and faster design development without familiarity with Hardware Description Language (HDL) and Register Transfer Logic (RTL) design flow. However, it is not straight forward to associate every line of source code to a clock-cycle of synthesized hardware design. On the other hand, the traditional RTL-based design development flow provides the fine-grained performance profile through waveforms. With the same level of visibility in HLS designs, the designers can identify the performance-bottlenecks and obtain the target performance by iteratively fine-tuning the source code. Although, the HLS development tools provide the low-level waveforms, interpreting them in terms of source code variables is a challenging and tedious task. Addressing this gap, we propose an automated profiler tool, HLS\_Profiler, that provides performance profile of source code in a cycle-accurate manner. The HLS\_Profiler tool is non-intrusive and collectively uses the $łangle$static analysis, dynamic trace$\rangle$ of the source code to present the performance profile report to attribute latent clock cycles to each line of source code. Additionally, we developed a set of associative rules to maintain correctness in performance profile of the HLS design. To verify correctness, we demonstrate the HLS\_Profiler tool on MachSuite Benchmarks and an industry-grade recommendation application. The proposed HLS\_Profiler framework provides visibility into the cycle-by-cycle hardware execution of source-code and aids the designer in making performance-centric decisions.

SESSION: Session 6: Theory of Performance

A Mixed PS-FCFS Policy for CPU Intensive Workloads

Round robin (RR) is a widely adopted scheduling policy in modern computer systems. The scheduler handles the concurrency by alternating the run processes in such a way that they can use the processor continuously for at most a quantum of time. When the processor is assigned to another process, a context switch occurs. Although modern architectures handle context switches quite efficiently, the processes may incur in some indirect costs mainly due to cache overwriting.

RR is widely appreciated both in case of interactive and CPU intensive processes. In the latter case, with respect to the First-Come-First-Served approach (FCFS), RR does not penalise the small jobs.

In this paper, we study a scheduling policy, namely PS-FCFS, that fixes a maximum level of parallelism N and leaves the remaining jobs in a FCFS queue. The idea is that of exploiting the advantages of RR without incurring in heavy slowdowns because of context switches.

We propose a queueing model for PS-FCFS allowing us to: (i) find the optimal level of multiprogramming and (ii) study important properties of this policy such as the mean performance measures and results about its sensitivity to the moments of the jobs' service demands.

A Stochastic Extension of Stateflow

Although commonly used in industry, a major drawback of Stateflow is that it lacks support for stochastic properties; properties that are often needed to build accurate models of real-world systems. In order to solve this problem, as the first contribution, Stochastic Stateflow (SSF) is presented as a stochastic extension of a subset of Stateflow models. As the second contribution, the tool SMP-tool is updated with support for SSF models specified in Stateflow. Finally, as the third contribution, an industrial case study is presented.

Sampling-based Label Propagation for Balanced Graph Partitioning

In this experience paper, we present new sampling-based algorithms for balanced graph partitioning based on the Label Propagation (LP) approach. The purpose is to define new heuristics to extend the multi-objective and scalable Balanced GRAph Partitioning algorithm B-GRAP proposed in \citeELMOUSSAWI_Dexa2020. The main challenge is related to how to build a graph sample that ensures stability and improves the convergence and the partitioning quality which depend strongly on the structure of the graph. We defined two sampling-based heuristics named RD-B-GRAP and HD-B-GRAP in order to study the behavior of the propagation according to different quality measures related to the vertex and the edge balance, to the edge cut, and also to the propagation time. The results obtained on different graphs showed that the sampling-based algorithms improve the propagation time without affecting the balance between partitions. Moreover, The edge cut is slightly improved on some graphs.