IEEE Cluster 2021 Program

Conference - Tuesday, September 6

8:30-9:15

Registration

9:15 - 10:30

HS6

REX-IO

HS7

HPCMASPA

HS8

EAHPC

HS9

Tutorial: Introduction to Research Data Management (RDM) with Hands-On for HPC Use Cases

Tutorial: Introduction to Research Data Management (RDM) with Hands-On for HPC Use Cases

Marcel Nellesen, RWTH Aachen University
Christian Loeschen, TU Dresden

Abstract

The amount of data generated by research projects is constantly growing. Research institutions and funders increasingly demand for measures to ensure data safety and data quality sustainably. This workshop will give an introduction into the motivation, basics and concepts of RDM, as well as hints for practical work. Participants will learn guidelines on structuring and documenting data, utilizing metadata and working collaboratively on data. They also learn how to tackle data publications and what to consider when creating data management plans. A hands-on exercise will be done using the Coscine research data platform, to include RDM into HPC environments.

The workshop is split into lectures, interactive exercises and practical hands-on-sessions, with the opportunity to discuss participants questions.

10:30 - 11:00

Coffee Break

11:00 - 12:30

HS6

REX-IO

HS7

HPCMASPA

HS8

EAHPC

HS9

Tutorial: Introduction to Research Data Management (RDM) with Hands-On for HPC Use Cases

12:30 - 14:00

Lunch

14:00 - 15:30

HS6

REX-IO

HS7

HPCMASPA

HS8

HPCEuropeLatAm

HS9

Tutorial: Heterogeneous Programming in Modern C++ with SYCL

Tutorial: Heterogeneous Programming in Modern C++ with SYCL

Aksel Alpay, Heidelberg University
Igor Baratta, University of Cambridge
Tom Deakin, University of Bristol
Peter Žužek, Codeplay

Abstract

Parallel programming can be used to take advantage of heterogeneous architectures including GPUs, FPGAs, XPUs, IPUs, TPUs or special units on CPUs to significantly increase the performance of applications. SYCL is an open standard programming model that is defined by the industry and lets developers support many of these processors from different vendors using a single code base and only modern standard C++ code. This tutorial will give software developers the knowledge they need to begin developing parallel applications using C++ and the SYCL programming model. Our goal is to equip attendees with the skills they need to build highly performant applications that can be used in the fields of HPC and AI and deployed to multiple hardware platforms. We will cover the fundamentals of the SYCL programming model before moving to more advanced topics. We will explore how SYCL can be used to write serious applications, covering intermediate to advanced features of SYCL as well as some of the tools and libraries that support SYCL application development. This is a hands-on tutorial, attendees will work through exercises that represent key design patterns encountered by people who program heterogeneous systems and deploy this code to multiple processors from different vendors.

15:30 - 16:00

Coffee Break

16:00 - 17:30

HS6

REX-IO

HS7

HPCMASPA

HS8

HPCEuropeLatAm

HS9

Tutorial: Heterogeneous Programming in Modern C++ with SYCL

19:00

Networking Dinner — Kulturbrauerei Heidelberg

Registration required - limited seats. (Register here)

Conference - Wednesday, September 7

8:15-9:00

Registration

9:00 - 9:30

Cluster 2022 Opening

HS13

9:30 - 10:30

Keynote: Luca Benini, ETH Zurich

HS13

Mempools: The Rise of Tightly Coupled Processor Clusters

Chair: Trilce Estrada

10:30 - 11:00

Coffee Break

11:00 - 12:30 -- Parallel Sessions

Networking & Security

HS13

Chair: Alexandru Calotoiu

Bring the BitCODE - Moving Compute and Data in Distributed Heterogeneous Systems

Bring the BitCODE - Moving Compute and Data in Distributed Heterogeneous Systems

Wenbin Lu, Luis E. Pena, Pavel Shamis, Valentin Churavy

Abstract

In this paper, we present a framework for moving compute and data between processing elements in a distributed heterogeneous system. The implementation of the framework is based on the LLVM compiler toolchain combined with the UCX communication framework. The framework can generate binary machine code or LLVM bitcode for multiple CPU architectures and move the code to remote machines while dynamically optimizing and linking the code on the target platform. The remotely injected code can recursively propagate itself to other remote machines or generate new code. The goal of this paper is threefold: (a) to present an architecture and implementation of the framework that provides essential infrastructure to program a new class of disaggregated systems wherein heterogeneous programming elements such as compute nodes and data processing units (DPUs) are distributed across the system, (b) to demonstrate how the framework can be integrated with modern, high-level programming languages such as Julia, and (c) to demonstrate and evaluate a new class of eXtended Remote Direct Memory Access (X-RDMA) communication operations that are enabled by this framework. To evaluate the capabilities of the framework, we used a cluster with Fujitsu CPUs and as well as heterogeneous cluster with Intel CPUs and BlueField-2 DPUs interconnected using high-performance RDMA fabric. We demonstrated an X-RDMA pointer chase application that outperforms an RDMA GET-based implementation by 70% and is as fast as Active Messages, but does not require function predeployment on remote platforms.

Exploring Light-weight Cryptography for Efficient and Secure Lossy Data Compression

Exploring Light-weight Cryptography for Efficient and Secure Lossy Data Compression

Ruiwen Shan

Abstract

The enormous volume of data generated by large-scale instruments and simulations poses significant challenges in archiving, transferring, sharing and analyzing data for various scientific groups. Lossy reduction techniques are vital to reduce dataset size to acceptable levels. However, putting more information content per bit, increases the severity of loss if perturbed by malicious users or hardware failures. In the worst case, the entire dataset is compromised. Malevolent alteration or destruction of datasets containing crucial discoveries can completely invalidate research outcomes in scientific studies. Therefore, it is critical to integrate compression and encryption to handle data securely and efficiently. The current state-of-the-art combination technique Cmpr-Encr handles compression and encryption as two distinct processes. This reduces the compression ratio and bandwidth, especially for hard-to-compress datasets. In this paper, we propose two data protection strategies that work in conjunction with the lossy compressor SZ: Quantization-only and Huffman-only, and carefully evaluate the overhead they introduce on compression bandwidth and ratio. Based on the results of testing with real-world scientific datasets, we find that the cost of Quantization-only varies with the dataset's properties and requires cautious selection. Huffman-only is able to maintain more than 99% of the original compression ratio while saving 6.5% in compression time compared to SZ. Applying Cmpr-Encr leads to a reduction in compression bandwidth, whereas Huffman-only increases bandwidth by 3.1% over the SZ, on average.

SKV: A SmartNIC-Offloaded Distributed Key-value Store

SKV: A SmartNIC-Offloaded Distributed Key-value Store

Shangyi Sun

Abstract

In data center networks, applications such as distributed key-value stores consume a lot of CPU resources. The performance of the entire system drops significantly under heavy load conditions. In order to improve the performance of key-value stores, many existing studies use RDMA (Remote Direct Memory Access) to reduce the communication overhead. However, RDMA primitives can only offload simple operations to the NIC, such as reading and writing remote memory. With the emergence of new hardware like SmartNICs, we consider whether we can offload more complex operations in distributed key-value store to SmartNICs to reduce the load on CPU.In this paper we present SKV, a SmartNIC-offloaded distributed key-value store. In order to make full use of the offload ability of the SmartNIC, We make a detailed analysis on the characteristic and architecture of SmartNIC and distributed key-value store. SKV offloads operations such as data replication to the SmartNIC. We design a new replication mechanism, which enables the server to separate background processing from the interaction with clients in the front. We implement SKV with Mellanox BlueField SmartNIC. Our evaluations show that SKV improves the overall throughput by 14% and reduces latency by 21% compared with baseline.

Scheduling and Multi-Tenancy

HS10

Chair: Dirk Pleiter

What does Inter-Cluster Job Submission and Execution Behavior Reveal to Us?

What does Inter-Cluster Job Submission and Execution Behavior Reveal to Us?

Tirthak Patel

Abstract

Modern High Performing Computing (HPC) facilities have multiple computing clusters that serve different purposes. These include large-scale computing clusters and smaller data visualization and analysis clusters, which are meant to shift the load of data analytics jobs from the large-scale systems. We perform the first in-depth characterization of cross-cluster behavior of users and jobs and provide an analysis of three inter-related systems at the Argonne Leadership Computing Facility (ALCF). We will make our novel dataset, prediction methodology and models open-source to the community upon acceptance.

Matching-based Scheduling of Asynchronous Data Processing Workflows on the Computing Continuum

Matching-based Scheduling of Asynchronous Data Processing Workflows on the Computing Continuum

Narges Mehran, Zahra Najafabadi Samani, Dragi Kimovski, Radu Prodan

Abstract

Today’s distributed computing infrastructures encompass complex workflows for real-time data gathering, transferring, storage, and processing, quickly overwhelming centralized cloud centers. Recently, the computing continuum that federates the Cloud services with emerging Fog and Edge devices represents a relevant alternative for supporting the next-generation data processing workflows. However, eminent challenges in automating data processing across the computing continuum still exist, such as scheduling heterogeneous devices across the Cloud, Fog, and Edge layers. We propose a new scheduling algorithm called C3-MATCH, based on matching theory principles and involving two sets of players negotiating different utility functions: 1) workflow microservices that prefer computing devices with lower data processing and queuing times; 2) computing continuum devices that prefer microservices with corresponding resource requirements and less data transmission time. We evaluate C3-MATCH using real-world road sign inspection and sentiment analysis workflows on a federated computing continuum across four Cloud, Fog, and Edge providers. Our combined simulation and real execution results reveal that C3-MATCH achieves up to 67% lower completion time compared to three state-of-the-art methods.

MRSch: Multi-Resource Scheduling for HPC

MRSch: Multi-Resource Scheduling for HPC

Matt Dearing

Abstract

Emerging workloads in high-performance computing (HPC) are embracing significant changes, such as having diverse resource requirements instead of being CPU-centric. This advancement forces cluster schedulers to consider multiple schedulable resources during decision-making. Existing scheduling studies rely on heuristic or optimization methods, which are limited by an inability to adapt to new scenarios for ensuring long-term scheduling performance. We present an intelligent scheduling agent named MRSch for multi-resource scheduling in HPC that leverages direct future prediction (DFP), an advanced multi-objective reinforcement learning algorithm. While DFP demonstrated outstanding performance in a gaming competition, it has not been previously explored in the context of HPC scheduling. Several key techniques are developed in this study to tackle the challenges involved in multi-resource scheduling. These techniques enable MRSch to learn an appropriate scheduling policy automatically and dynamically adapt its policy in response to workload changes via dynamic resource prioritizing. We compare MRSch with existing scheduling methods through extensive trace-base simulations. Our results demonstrate that MRSch improves scheduling performance by up to 48% compared to the existing scheduling methods.

12:30 - 14:00

Poster Session and Lunch

Posters

Empirical Study on the GPU-accelerated HPL Performance: Effects of PCIe Communication
Authors: Jieun Choi, Yosang Jeong, Ji-Hoon Kang, Gibeom Gu, Hoon Ryu
H2M: Towards Heuristics for Heterogeneous Memory
Authors: Jannis Klinkenberg, Clément Foyer, Brice Goglin, Emmanuel Jeannot, Anara Kozhokanova, Christian Terboven
An Efficient Sparse CNNs Accelerator on FPGA
Authors: Yonghua Zhang, Hongxu Jiang, Xiaobin Li, Haojie Wang, Dong Dong, Yongxiang Cao
An Analysis of Performance Variability on Dragonfly+ topology
Authors: Majid Salimi Beni, Biagio Cosenza
An Asynchronous Parallel Algorithm to Improve the Scalability of Finite Element Solvers
Authors: Zhuo Tian, Changyou Zhang

14:00 - 15:30 -- Parallel Sessions

MPI

HS13

Chair: Sascha Hunold

A framework for hierarchical single-copy MPI collectives on multicore nodes

A framework for hierarchical single-copy MPI collectives on multicore nodes

George Katevenis

Abstract

Collective operations are widely used by MPI applications to realize their communication patterns. Their efficiency is crucial for both performance and scalability of parallel applications. For deriving efficient MPI implementations, significant effort is put to keep pace with advances and capabilities of the underlying hardware and interconnect. Recent processor advances have led to nodes with higher core counts and complex internal structures and memory hierarchies. Such nodes are able to host tens to hundreds of processes and thus, performance of MPI collectives at the intra-node level becomes critical. In this work, we propose a framework for collective operations at the intra-node level, that aims to lower latency and increased bandwidth. Our approach utilizes knowledge of internal node structure to construct hierarchical algorithms, and XPMEM to achieve single-copy transfers. Pipelining is used to overlap communication at different levels of the hierarchy. We evaluate the proposed approach through several microbenchmarks and real-world MPI applications. For evaluation purposes, we compare the proposed approach with implementations of similar schemes from two recent studies. Our evaluation with microbenchmarks for Broadcast and Allreduce shows speedup up to 2.5x and 3x, respectively, over UCC and OpenMPI's default collectives implementation. Compared to recent research studies, we improve Broadcast by up to 5x, and Allreduce by up to 7x. We reduce the time of three applications -- PiSvM, miniAMR and CNTK, by up to 12%, 52% and 12%, respectively, over the next best-performing alternative.

Deadlock Detection of MPI Program Based on Refined Match-sets

Deadlock Detection of MPI Program Based on Refined Match-sets

ShuShan Li

Abstract

Deadlock is one of the common problems in the message passing interface. At present, most methods for detecting MPI deadlocks rely on exhausting all execution paths of a program, which is inefficient. In addition, the number of execution paths increases exponentially with the number of wildcard receive events and processes, further worsening the situation. To alleviate the problem, we propose a deadlock detection approach based on match-sets to avoid exploring execution paths. With this approach, a match detection rule is raised to form the rough match-sets based on Lazy Lamport Clocks Protocol, then in order to refine the match-sets, three refining algorithms are presented according to the non-overtaking rule and MPI communication mechanism, finally deadlocks are detected by analyzing the refined match-sets. We have implemented a tool called SAMPI and performed experimental evaluation on 27 programs. The experimental results show that SAMPI is efficient to detect deadlocks in MPI programs, especially to handle programs with many interleavings.

Runtimes

HS10

Chair: Bernd Mohr

Pythia: an oracle to guide runtime system decisions

Pythia: an oracle to guide runtime system decisions

Alexis Colin

Abstract

Runtime systems are commonly used by parallel applications in order to efficiently exploit the underlying hardware resources. A runtime system hides the complexity of the management of the hardware and exposes a high-level interface to application developers. To this end, it makes decisions by relying on heuristics that estimate the future behavior of the application. In this paper, we propose Pythia, a library that serves as an oracle capable of predicting the future behavior of an application, so that the runtime system can make more informed decisions. Pythia builds on the deterministic nature of many HPC applications: by recording an execution trace, Pythia captures the application main behavior. The trace can be provided for future executions of the application, and a runtime system can ask for predictions of future program behavior. We evaluate Pythia on 13 MPI applications and show that Pythia can accurately predict the future of most of these applications, even when varying the problem size. We demonstrate how Pythia predictions can guide a runtime system optimization by implementing an adaptive thread parallelism strategy in GNU OpenMP runtime system. The evaluation shows that, thanks to Pythia prediction, the adaptive strategy reduces the execution time of an application by up to 38%.

Pushing the Boundaries of Small Tasks: Scalable Low-Overhead Data-Flow Programming in TTG

Pushing the Boundaries of Small Tasks: Scalable Low-Overhead Data-Flow Programming in TTG

Joseph Schuchart

Abstract

Shared memory parallel programming models strive to provide low-overhead execution environments. Task-based programming models, in particular, are well-suited to cope with the ubiquitous multi- and many-core systems since they allow applications to express all available concurrency to a scheduler, which is tasked with exploiting the available hardware resources. It is general consensus that atomic operations should be preferred over locks and mutexes to avoid inter-thread serialization and the resulting loss in efficiency. However, even atomic operations may serialize threads if not used judiciously. In this work, we will discuss several optimizations applied to TTG and the underlying PaRSEC runtime system aiming at removing contentious atomic operations to reduce the overhead of task management to a few hundred clock cycles. The result is an optimized implementation of data-flow programming model that seamlessly scales from a single node to distributed execution, competing with shared-memory-only programming models such as OpenMP.

Distributed Continuation Stealing is More Scalable than You Might Think

Distributed Continuation Stealing is More Scalable than You Might Think

Shumpei Shiina

Abstract

The need for load balancing in applications with irregular parallelism has motivated research on work stealing. An important choice in work-stealing schedulers is between child stealing or continuation stealing. In child stealing, a newly created task is made stealable by other processors, whereas in continuation stealing, the caller's continuation is made stealable by executing the newly created task first, which preserves the serial execution order. Although the benefits of continuation stealing have been demonstrated on shared memory by Cilk and other runtime systems, it is rarely employed on distributed memory, presumably because it has been thought to be difficult to implement and inefficient as it involves migration of call stacks across nodes. Akiyama and Taura recently introduced efficient RDMA-based continuation stealing, but the practicality of distributed continuation stealing is still unclear because a comparison of its performance with that of child stealing has not previously been performed. This paper presents the results of a comparative performance analysis of continuation stealing and child stealing on distributed memory. To clarify the full potential of continuation stealing, we first investigated various RDMA-based synchronization (task join) implementations, which had not previously been fully investigated. The results revealed that, when the task synchronization pattern was complicated, continuation stealing performed better than child stealing despite its relatively long steal latency due to stack migration. Notably, our runtime system achieved almost perfect scaling on 110,592 cores in an unbalanced tree search (UTS) benchmark. This scalability is comparable to or even better than that of state-of-the-art bag-of-tasks counterparts.

15:30 - 16:00

Coffee Break

16:00 - 17:30 -- Parallel Sessions

MPI Collectives

HS13

Chair: Sascha Hunold

Fast(er) Construction of Round-optimal n-Block Broadcast Schedules

Fast(er) Construction of Round-optimal n-Block Broadcast Schedules

Jesper Larsson Träff

Abstract

We give a fast(er), communication-free, parallel construction of optimal communication schedules that allow broadcasting of $n$ distinct blocks of data from a root processor to all other processors in $1$-ported, $p$-processor networks with fully bidirectional communication. For any $p$ and $n$, broadcasting in this model requires $n-1+\ceiling{\log_2 p}$ communication rounds. In contrast to other constructions, all processors follow the same, circulant graph communication pattern, which makes it possible to use the schedules for the allgather (all-to-all-broadcast) operation as well. The new construction takes $O(\log^3 p)$ time steps per processor, each of which can compute its part of the schedule independently of the other processors in $O(\log p)$ space. The result is a significant improvement over the sequential $O(p \log^2 p)$ time and $O(p\log p)$ space construction of Tr\"aff and Ripke (2009) with considerable practical import. The round-optimal schedule construction is then used to implement communication optimal algorithms the broadcast and (irregular) allgather collective operations as found in MPI (the Message-Passing Interface), and significantly and practically improve over the implementations in standard MPI libraries (\texttt{mpich}, OpenMPI, Intel MPI) for certain problem ranges. The application to the irregular allgather operation is entirely new.

Lossy all-to-all exchange for accelerating parallel 3-D FFTs on hybrid architectures with GPUs

Lossy all-to-all exchange for accelerating parallel 3-D FFTs on hybrid architectures with GPUs

Sebastien Cayrols, Jiali Li, George Bosilca, Stanimire Tomov, Alan Ayala, Jack Dongarra

Abstract

In the context of parallel applications, communication is a critical part of the infrastructure and a potential bottleneck. The traditional approach to tackle communication challenges consists of redesigning algorithms so that the complexity or the communication volume is reduced. However, there are algorithms like the Fast Fourier Transform (FFT) where reducing the volume of communication is very challenging but holds promise because it can lead to a large benefit in terms of time-to-completion. In this paper, we revisit the implementation of the MPI all-to-all routine at the core of 3D FFTs by using advanced MPI features, such as One-Sided Communication, and integrate data compression during communication to reduce the volume of data exchanged. Since some compression techniques are ’lossy’ in the sense that they involve a loss of accuracy, we study the impact of lossy compression in heFFTe, the state-of-the-art FFT library for large scale 3D FFTs on hybrid architectures with GPUs. Consequently, we design an approximate FFT algorithm that trades off user-controlled accuracy for speed. We show that we speedup the 3D FFTs proportionally to the compression rate. In terms of accuracy, comparing our approach with a reduced precision execution, where both the data and the computation are in reduced precision, we show that when the volume of communication is compressed to the size of the reduced precision data, the approximate FFT algorithm is as fast as the one in reduced precision while the accuracy is one order of magnitude better.

ACCLAiM: Advancing the Practicality of MPI Collective Communication Autotuning Using Machine Learning

ACCLAiM: Advancing the Practicality of MPI Collective Communication Autotuning Using Machine Learning

Michael Wilkins

Abstract

MPI collective communication is an omnipresent communication model for High-Performance Computing (HPC) systems. The performance of a collective operation depends strongly on the algorithm used to implement it. MPI libraries use inaccurate heuristics to select these algorithms, causing applications to suffer unnecessary slowdowns. Machine learning (ML)-based autotuners are a promising alternative. ML autotuners can intelligently select algorithms for individual jobs, resulting in near-optimal performance. However, these approaches currently spend more time training than they save by accelerating applications, rendering them impractical. We make the case that ML-based collective algorithm selection autotuners can be made practical and accelerate production applications on large-scale supercomputers. We identify multiple impracticalities in the existing work, such as slow training point selection and ignoring non-power-of-two feature values. We address these challenges through variance-based point selection and model testing alongside topology-aware benchmark parallelization. Our approach minimizes training time by eliminating unnecessary training points and maximizing machine utilization. We incorporate our improvements into a prototype active learning system, ACCLAiM (Advancing Collective Communication (L) Autotuning using Machine Learning). We show that each of ACCLAiM’s advancements significantly reduce training time compared to the best existing machine learning approach. Then, we apply ACCLAiM on a leadership-class supercomputer and demonstrate the conditions where ACCLAiM can accelerate HPC applications, provide the advantage of ML autotuners in a production setting for the first time.

Serverless & Virtual Networks

HS10

Chair: Jay Lofstead

Call Scheduling to Reduce Response Time of a FaaS System, Paweł Żuk

Call Scheduling to Reduce Response Time of a FaaS System, Paweł Żuk

Paweł Żuk

Abstract

In an overloaded FaaS cluster, individual worker nodes strain under lengthening queues of requests. Although the cluster might be eventually horizontally-scaled, adding a new node takes dozens of seconds. As serving applications are tuned for tail serving latencies, and these greatly increase under heavier loads, the current workaround is resource over-provisioning. In fact, even though a service can withstand a steady load of e.g. 70% CPU utilization, the autoscaler is triggered at e.g. 30-40% (thus the service uses twice as many nodes as it would be needed). We propose an alternative: a worker-level method handling heavy load without increasing the number of nodes. FaaS executions are not interactive, compared to, e.g., text editors: end-users take no advantage of when CPU is assigned to processes often, yet for short periods. Inspired by scheduling methods for High Performance Computing, we take a radical step of replacing the classic OS preemption by (1) queuing requests based on their historical characteristics; (2) once a request is being processed, setting its CPU limit to exactly one core (with no CPU oversubscription). We extend OpenWhisk and measure the efficiency of the proposed solutions, based on SeBS benchmark. In a loaded system, our method decreases the average response time by a factor of 4. The improvement is even higher for shorter requests, as the average stretch is decreased by a factor of 18. This leads us to show that we can provide better response-time statistics with 3 machines compared to a 4-machine baseline.

FaaSt: Optimize Makespan of Serverless Workflows in Federated Commercial FaaS

FaaSt: Optimize Makespan of Serverless Workflows in Federated Commercial FaaS

Sashko Ristov

Abstract

Nowadays, scientists migrate workflow applications on serverless Function-as-a-Service (FaaS) platforms in a form of so called function choreographies (FCs) to benefit from FaaS high elasticity and instantly spawning numerous functions. However, the heterogeneous nature of federated FaaS overburdens decisions for the most appropriate configuration setup. Unfortunately, related work mainly support either (i) scheduling algorithms for serverful workflow applications that run on virtual machines or (ii) container-based algorithms to schedule individual serverless functions on specific container (executor). Either approach is hard to implement for FCs in federated FaaS; the former due to specifics of the FaaS resource model, while the latter because they are primarily focused on bag of functions and reducing startup latency down to microseconds. Such optimization is negligible for scientific FCs whose functions may run hundreds of seconds due to enormous compute and I/O operations to distributed cloud storage. Instead, scientific FCs would benefit from schedulers that select the appropriate FaaS provider, cloud region, and memory settings. To bridge this gap in scheduling scientific FCs in federated FaaS, this paper introduces FaaSt, a novel listbased FC scheduler that optimizes makespan of an FC that runs functions across multiple FaaS providers. The evaluation with three other schedulers showed that FaaSt overcomes limitations of a single FaaS region and generates speedup of up to 2.82× when running FCs across four cloud regions compared to a single region. Moreover, the FaaSt FC scheduler achieves speedup of up to 1.74× compared to the other state-of-the-art FC schedulers across the same four regions.

Last-mile Matters: Mitigating the Tail Latency of Virtualized Networks with Multipath Data Plane

Last-mile Matters: Mitigating the Tail Latency of Virtualized Networks with Multipath Data Plane

Dian Shen

Abstract

Virtualized network has become the cornerstone of today's large-scale cloud data centers. In particular, the data plane of virtualized network, consisting of virtual switch, virtual router and other software network functionalities, performs all network packets processing of virtual machines (VMs). However, current virtualized data plane solutions incur drastic performance interference with co-resident VMs, and thus suffer from unpredictable network performance, especially in terms of tail latency. In this work, we show that the performance issue stems from the fact that CPU plays a dual role of both communication and computation in virtualized networks. A number of virtual network components and their complex packets processing create an undue burden on the hosts' CPUs and in turn cause the mutual performance interference among VMs and networks. To address this issue, we present a multipath data plane solution, where the traffic of VMs can be adaptively and seamlessly offloaded to the adjacent hosts. At the core of this design is to optimize the VM traffic allocation among multiple paths. We formulate the VM multipath traffic allocation problem with coupled variables of computing and network resources, which were only considered as mutually independent in prior researches. Then we present a distributed algorithm to efficiently solve the large-scale, inter-dependent global optimization problem, with convergence and optimality guarantees. Through extensive simulations and real-world testbed experiments, we show that our solution delivers consistent performance improvement (up to 6.7 times improvement in aggregate throughput and 21.4 times reduction in tail latency, respectively) in the dynamic cloud system.

17:30 - 19:30

Welcome Reception

Reception

Conference - Thursday, September 8

8:45-9:15

Registration

9:15 - 9:30

Announcements

HS13

9:30 - 10:30

Keynote: Kristal Michielsen, Jülich Supercomputing Centre

HS13

Integrating Quantum Computers in HPC Infrastructures

Integrating Quantum Computers in HPC Infrastructures

Kristal Michielsen, Jülich Supercomputing Centre

Abstract

For practical quantum computing, HPC infrastructures shall integrate quantum computers and simulators (QCS) in addition to cloud access to stand-alone QCS.

As longterm experience in conventional supercomputing demonstrate, the successful integration of QCS into HPC systems requires a focus on all three fundamental components of the HPC ecosystem: users and their applications, software, and hardware.

A broad user community will need to invest time and effort in developing new kinds of algorithms and software for real-world applications that take full advantage of the QCS as accelerators that speed up existing classical algorithms and software. In addition, a QCS full software stack will have to be developed that takes into account the various kinds of QCS hardware that is implemented on a variety of qubit platforms.

Finally, the development of use cases in a co-design approach with "hybrid" computing architectures in mind, will make it possible to address research challenges that cannot be met with current HPC architectures.

The High Performance Computer and Simulator hybrid (HPCQS) infrastructure, the pan-European hybrid HPC/quantum infrastructure supported by the European High-Performance Computing Joint Undertaking (EuroHPC JU) and six European countries (Austria, France, Germany, Ireland, Italy and Spain) realizes, after the Jülich UNified Infrastructure for Quantum computing (JUNIQ), a second step towards a European Quantum Computing and Simulation Infrastructure (EuroQCS), as advocated for in the Strategic Research Agenda of the European Quantum Flagship.

Chair: Felix Wolf

10:30 - 11:00

Coffee Break

11:00 - 12:30 -- Parallel Sessions

Applications

HS13

Chair: Tom Deakin

Towards Virtual Certification of Gas Turbine Engines With Performance-Portable Simulations

Towards Virtual Certification of Gas Turbine Engines With Performance-Portable Simulations

Gihan Mudalige, Istvan Reguly

Abstract

We present the large-scale, computational fluid dynamics (CFD) simulation of a full gas-turbine engine compressor, demonstrating capability towards overcoming current limitations for virtual certification of aero-engine design. The simulation is carried out through a performance portable code-base on multi-core/many-core HPC clusters with a CFD-to-CFD coupled execution, combining an industrial CFD solver linked using custom coupler software. The application innovates in its design for performance portability through the OP2 domain specific library for the CFD components, allowing the automatic generation of highly optimized platform-specific parallelizations for both multi-core (CPU) and many-core (GPU) clusters via a single high-level source. The code is used for the simulation of a 4.58B node, full-annulus 10-row production-grade test compressor (DLR’s Rig250), using a coupled sliding-plane setup on the ARCHER2 and Cirrus supercomputers at EPCC. The OP2 generated multiple parallelizations, together with optimized coupler configurations on heterogeneous/hybrid settings achieve, for the first time, execution of 1 revolution in less than 6 hours on 512 nodes of ARCHER2 (65k cores), with a parallel scaling efficiency of over 80% compared to a 107 node run. Results indicate a speed up of the CFD suite by an order of a magnitude (~30x) relative to current production capability. Benchmarking and performance modelling project a time-to-solution of less than 5 hours on a cluster of 488xNVIDIA V100 GPUs, about 3x-4x speedup over CPU clusters. The work demonstrates a step-change towards achieving virtual certification of aircraft engines with the requisite fidelity and tractable time-to-solution that was previously out of reach under production settings.

Hybrid Analysis of Fusion Data for Online Understanding of Complex Science on Extreme Scale Computers

Hybrid Analysis of Fusion Data for Online Understanding of Complex Science on Extreme Scale Computers

Eric Suchyta

Abstract

The current practice for fusion scientists running first principle simulations on high performance computing platforms is to either run their simulations and output their data for post-hoc analysis, or to place in situ analytics into their code. In this paper we examine a complex workflow using XGC fusions simulation run on the Oak Ridge Leadership Computing Facility's supercomputer Summit, which also involve three analyses as part of the results necessary for scientific discovery. We discuss the challenges faced when implementing these algorithms and present an original hybrid staging technique to help enable the physicists to make discoveries during the execution of the simulation. By creating this infrastructure, we can examine complicated physics results, which may not have been possible without the infrastructure. For example, our work enables the online visualization of turbulent homoclinic tangle around the magnetic X-point, breaking the last confinement surface. This visualization could help fusion scientists to better understand and improve the turbulence spread of plasma exhaust heat, which is crucial toward realizing plasmas beyond the currently accessible physics regimes of present-day tokamak reactors. The physics of turbulent homoclinic tangle will be reported in a future physics publication, by utilizing the original online analysis/visualization framework presented in this paper.

High Performance Adaptive Physics Refinement to Enable Large-Scale Tracking of Cancer Cell Trajectory

High Performance Adaptive Physics Refinement to Enable Large-Scale Tracking of Cancer Cell Trajectory

Daniel F. Puleri

Abstract

The ability to track simulated cancer cells through the circulatory system, important for developing a mechanistic understanding of metastatic spread, pushes the limits of today’s supercomputers by requiring the simulation of large fluid volumes at cellular-scale resolution. To overcome this challenge, we introduce a new adaptive physics refinement (APR) method that captures cellular-scale interaction across large domains and leverages a hybrid CPU-GPU approach to maximize performance. Through algorithmic advances that integrate multi-physics and multi-resolution models, we establish a finely resolved window with explicitly modeled cells coupled to a coarsely resolved bulk fluid domain. In this work we present multiple validations of the APR framework by comparing against fully resolved fluid-structure interaction methods and employ techniques, such as latency hiding and maximizing memory bandwidth, to effectively utilize heterogeneous node architectures. Collectively, these computational developments and performance optimizations provide a robust and scalable framework to enable system-level simulations of cancer cell transport.

I/O

HS10

Chair: Jay Lofstead

Be SMART, Save I/O: A Probabilistic Approach to Avoid Uncorrectable Errors in Storage Systems

Be SMART, Save I/O: A Probabilistic Approach to Avoid Uncorrectable Errors in Storage Systems

Md Arifuzzaman

Abstract

Silent data corruption poses a significant risk to the integrity of data in storage systems. Although error correction codes (ECC) can recover the majority of such errors, a non-negligible portion of them escape ECC, referred as uncorrectable errors (UEs). Despite being rare in nature, increasing scale of storage systems and fast-growing I/O rates decreased the mean time between UEs from months to hours. Yet, unlike disk failures, UEs are hard to predict with high precision, making it difficult to use proactive measurements due to considerable overhead to system and application performance. In this paper, we introduce a probabilistic approach to deploy UE mitigation strategies (e.g., write verification, data scrubbing, etc.) while keeping the system overhead at a tolerable range. To achieve this, we first estimate the probability of I/O operations to be exposed to UEs and determine a subset of disks for which employing UE avoidance strategies can lower the UE probability to a desired range. Through extensive simulations, we demonstrate that when proposed probabilistic model is used to implement \emph{write verification} strategy to detect and recover from UEs, more than $50\%$ of all write-triggered UEs can be avoided with $1\%$ read overhead, and more than $70\%$ of UEs can be mitigated with less than $3.5\%$ read overhead. We further measure the impact of read overhead on write performance in production Lustre and Spectrum Scale file systems and validate our finding that that more than $50\%$ of UEs can be avoided with less than $0.2-0.9\%$ decrease in write throughput.

The role of storage target allocation in applications' I/O performance with BeeGFS

The role of storage target allocation in applications' I/O performance with BeeGFS

Francieli Boito, Guillaume Pallez, Luan Teylo

Abstract

Parallel file systems are at the core of HPC I/O infrastructures. Those systems minimize the I/O time of applications by separating files into fixed-size chunks and distributing them across multiple storage targets. Therefore, the I/O performance experienced with a PFS is directly linked to the capacity of retrieving these chunks in parallel. In this work, we conduct an in-depth evaluation of the impact of the stripe count (the number of targets used for striping) on the write performance of BeeGFS, one of the most popular parallel file systems today. We consider different network configurations and show the fundamental role played by this parameter, in addition to the number of compute nodes, processes and storage targets. Through a rigorous experimental evaluation, we directly contradict conclusions from related work. Notably, we show that sharing I/O targets does not lead to performance degradation and that applications should use as many storage targets as possible. Our recommendations have the potential to significantly improve the overall write performance of BeeGFS deployments, and also provide valuable information for future work on storage target allocation and stripe count tuning.

Extracting and characterizing I/O behavior of HPC workloads

Extracting and characterizing I/O behavior of HPC workloads

Hariharan Devarajan

Abstract

System administrators set default storage system configuration parameters with the goal of providing high performance for their system's I/O workloads. However, this generalized configuration can lead to sub-optimal I/O performance for individual workloads. Users can provide parameter settings to the storage system to obtain better performance for individual applications, but it can be very challenging to determine which parameters to set and to what values. This problem is even further exacerbated by the increased complexity of modern storage systems. In this work, we move towards a solution to this problem by providing a systematic categorization of workload-related information that users or middleware libraries can pass to the storage system to optimize I/O performance for the specific workloads. We study applications and workflows from different scientific domains to cover a broad range of HPC use-cases. Through our categorization, we find that a) workload features differ based on hardware, software, and data components involved in the execution of the workloads and b) multiple workload features together drive I/O optimizations. Additionally, the methodology proposed in this work optimizes complex scientific workloads by 2.2x-8x using workload-aware I/O optimizations. Using the proposed methodology, users can pragmatically characterize their workload, and this characterization can assist the storage system in configuring itself to optimize I/O performance for individual workloads in HPC systems.

12:30 - 14:00

Lunch

14:00 - 14:30

Vendor Presentation: Min Li, Huawei

HS13

Compute 2030

Compute 2030

Min Li, Huawei

Abstract

By 2030, we will be producing yottabytes of data every year. The amount of general computing power in use will increase tenfold, and AI computing power will increase by a factor of 500. The digital and physical worlds will be seamlessly converged, allowing people and machines to interact perceptually and emotionally. In this talk, from industrial viewpoint, we'll present 8 scenarios that will be prominent in 2030, from which 18 innovative directions are identified. We believe in the next decade, computing will help us move into an intelligent world — a process of the same epochal significance as the age of discovery, the industrial revolution, and the space age.

Chair: Felix Wolf

14:30 - 15:30

Panel: Novel Hardware is Good, Programmable Hardware is Better

HS13

Novel Hardware is Good, Programmable Hardware is Better

Chair: Michèle Weiland

15:30 - 16:00

Coffee Break

16:00 - 17:00

Best Paper Nominees

HS13

Chair: Trilce Estrada

Improving Object Placement Methodology for Hybrid Memory Systems in HPC

Improving Object Placement Methodology for Hybrid Memory Systems in HPC

Marc Jordà, Antonio J. Peña

Abstract

Hybrid memory systems are an emerging trend to provide larger RAM sizes at reasonable cost and energy consumption. Recent byte-addressable persistent memory (PMEM) technology offers capacities comparable to storage devices and access times much closer to DRAMs than other non-volatile memory technology. To palliate the large gap with DRAM performance, DRAM and PMEM are usually combined. Users have the choice to either manage allocations to different memory spaces manually or leverage the DRAM as a cache for the virtual address space of the PMEM. In this paper, we present novel methodology to address automatic object-level placement, which addresses the performance shortcomings of a previous solution, yielding a framework which is competitive in performance with respect to the state of the art, while enabling a much simpler workflow. Our experiments leveraging Intel Optane Persistent Memory show from matching to greatly improved performance with respect to state–of–the–art software and hardware solutions, attaining over 2x runtime improvement in miniapplications and over 6% in OpenFOAM, a complex production application.

Efficient Hierarchical State Vector Simulation of Quantum Circuits via Acyclic Graph Partitioning

Efficient Hierarchical State Vector Simulation of Quantum Circuits via Acyclic Graph Partitioning

Yusuf Ozkaya

Abstract

Early but promising results in quantum computing have been enabled by the concurrent development of quantum algorithms, devices, and materials. Classical simulation of quantum programs has enabled the design and analysis of algorithms and implementation strategies targeting current and anticipated quantum device architectures. In this paper, we present a graph-based approach to achieving efficient quantum circuit simulation. Our approach involves partitioning the graph representation of a given quantum circuit into acyclic subgraphs/circuits that exhibit better data locality. Simulation of each sub-circuit is organized hierarchically, with the iterative construction and simulation of smaller state vectors, improving overall performance. Also, this partitioning reduces the number of passes through data, improving the total computation time. We present three partitioning strategies and observe that acyclic graph partitioning typically results in the best time-to-solution. In contrast, other strategies reduce the partitioning time at the expense of potentially increased simulation times. Experimental evaluation demonstrates the effectiveness of our approach.

18:00 - 23:00

Banquet

Heidelberg Castle

Guided tour (18:00 - 19:00)

Reception on castle terrace (19:00 - 20:00)

Dinner (20:00 - 23:00)

Conference - Friday, September 9

8:45-9:15

Registration

9:15 - 9:30

Award Ceremony and Cluster 2023 Presentation

HS13

9:30 - 10:30

Keynote: Rio Yokota, Tokyo Institute of Technology

HS13

Matrices in Deep Neural Networks and How to Compute Them in Parallel

Chair: Abhinav Bhatele

10:30 - 11:00

Coffee Break

11:00 - 12:30 -- Parallel Sessions

Operations and ML Training Strategies

HS13

Chair: Shadi Ibrahim

fairDMS: Rapid Model Training by Data and Model Reuse

fairDMS: Rapid Model Training by Data and Model Reuse

Zhengchun Liu

Abstract

Extracting actionable information from data produced by instruments such as the Linac Coherent Light Source (LCLS-II) and Advanced Photon Source Upgrade (APS-U) is becoming more challenging due to the fast-growing data generation rate. The rapid analysis possible with ML methods can enable fast feedback loops that can be used to adjust experimental setups in real time, for example when errors occur or interesting events are detected. However, to avoid degradation in ML performance over time due to changes in an instrument or sample, we need a way to update ML models rapidly while an experiment is running. We present here a data service and model service to accelerate deep neural network training with a focus on ML-based scientific applications. Our proposed data service achieves 100x speedup in terms of data labelling compare to the current state-of-the-art. Further, our model service achieve up to 200x improvement in training speed. Overall, fairDMS achieves up to 92x speedup in-terms of end-to-end model updating time.

ALBADross: Active Learning Based Anomaly Diagnosis for Production HPC Systems

ALBADross: Active Learning Based Anomaly Diagnosis for Production HPC Systems

Burak Aksar

Abstract

Diagnosing causes of performance variations in High-Performance Computing (HPC) systems is a daunting challenge due to the systems' scale and complexity. Variations in application performance result in premature job termination, lower energy efficiency, or wasted computing resources. One potential solution is manual root-cause analysis based on system telemetry data. However, this approach has become an increasingly time-consuming procedure as the process relies on human expertise and the size of telemetry data is voluminous. Recent research employs supervised machine learning (ML) models to diagnose previously encountered performance anomalies in compute nodes automatically. However, these models generally necessitate vast amounts of labeled samples that represent anomalous and healthy states of an application during training. This demand for labeled samples is constraining because gathering labeled samples is difficult and costly, especially considering anomalies that occur infrequently. This paper proposes a novel active learning-based framework that diagnoses previously encountered performance anomalies in HPC systems using significantly fewer labeled samples compared to state-of-the-art ML-based frameworks. Our framework combines an active learning-based query strategy and a supervised classifier to minimize the number of labeled samples required to achieve a target performance score. We evaluate our framework on a production HPC system and a testbed HPC cluster using real and proxy applications. We show that our framework, ALBADross, achieves a 0.95 F1-score using 28x fewer labeled samples compared to a supervised approach with equal F1-score, even when there are previously unseen applications and application inputs in the test dataset.

HPC Storage Service Autotuning Using Variational-Autoencoder-Guided Asynchronous Bayesian Optimization

HPC Storage Service Autotuning Using Variational-Autoencoder-Guided Asynchronous Bayesian Optimization

Matthieu Dorier, Romain Egele

Abstract

Distributed data storage services tailored to specific applications have grown popular in the high-performance computing (HPC) community as a way to address I/O and storage challenges. These services offer a variety of specific interfaces, semantics, and data representations. They also expose many tuning parameters, making it difficult for their users to find the best configuration for a given workload and platform. To address this issue, we develop a novel variational-autoencoder-guided asynchronous Bayesian optimization method to tune HPC storage service parameters. Our approach uses transfer learning to leverage prior tuning results and use a dynamically updated surrogate model to explore the large parameter search space in a systematic way. We implement our approach within the DeepHyper open-source framework, and apply it to the autotuning of a high-energy physics workflow on Argonne's Theta supercomputer. We show that our transfer-learning approach enables a more than 40x search speedup over random search, compared with a 2.5x to 10x speedup when not using transfer learning. Additionally, we show that our approach is on par with state-of-the-art autotuning frameworks in speed and outperforms them in resource utilization and parallelization capabilities.

Node Technologies

HS10

Chair: Mohamed Hassan

Enabling Dynamic Virtual Frequency Scaling for Virtual Machines in the Cloud

Enabling Dynamic Virtual Frequency Scaling for Virtual Machines in the Cloud

Emile Cadorel

Abstract

With the democratization of the Cloud paradigm, many applications are developed to be executed inside virtual machines hosted by remote data centers providing an Infrastructure-as-a-Service (IaaS). These applications, developed by different users with different goals, tend to have different behaviors, hence a similar treatment on the Cloud provider side seems to be sub-optimal. Indeed, VM are black boxes to which are attached vCPUs, whose frequency are all the same, and are mainly indicative. In our opinion, an important limitation can be noted here. Because the Cloud provider is unaware of the applications that are executed inside the VMs, it has little insight on the behavior of the applications, and how to manage the VMs. For these reasons, Cloud provider can assign too much or too few resources to a VM, and might rely on migration mechanism to cope with that problem. In this paper, we propose to attach a virtual frequency to the VM template, which can be configured by the customer to better describe her expected application requirements, and the associated quality of service. Then, to enforce this virtual frequency, we designed a controller that leverages the Linux cgroup system to dynamically adjust the configuration on the host machine. We evaluate our new controller on a real infrastructure with real CPU-intensive applications executed by VM with different frequencies. We also discuss the benefits of our virtual frequency capping for VM placement.

SVAGC: Garbage Collection with a Scalable Virtual Address Swapping Technique

SVAGC: Garbage Collection with a Scalable Virtual Address Swapping Technique

Ismail Ataie, Weikuan Yu

Abstract

Managed programming languages including Java and Scala are very popular for data analytics and mobile applications. However, they often face challenging issues due to the overhead caused by the automatic memory management to detect and reclaim free available memory. It has been observed that during their Garbage Collection (GC), excessively long pauses can account for up to 40% of the total execution time. Therefore, mitigating the GC overhead has been an active research topic to satisfy today's application requirements. This paper proposes a new technique called SwapVA to improve data copying in the copying/moving phases of GCs and reduce the GC pause time, thereby mitigating the issue of GC overhead. Our contribution is twofold. First, a SwapVA system call is introduced as a zero-copy technique to accelerate the GC copying/moving phase. Second, for the demonstration of its effectiveness, we have integrated SwapVA into SVAGC as an implementation of scalable Full GC on multi-core systems. Based on our results, the proposed solutions can dramatically reduce the GC pause in applications with large objects by as much as 70.9% and 97%, respectively, in the Sparse.large/4 (one quarter of the default input size) and Sigverify benchmarks.

The Cost of Flexibility: Embedded versus Discrete Routers in CGRAs for HPC

The Cost of Flexibility: Embedded versus Discrete Routers in CGRAs for HPC

Boma Adhi

Abstract

Coarse-Grained Reconfigurable Arrays (CGRAs) are a class of reconfigurable architectures that inherit the performance and usability of Central Processing Units (CPUs) and the reconfigurability of Field-Programmable Gate Arrays (FPGAs). Historically, CGRAs have been used successfully to accelerate embedded applications and are today also being considered to accelerate High-Performance Computing (HPC) applications in future supercomputers. However, embedded-systems and supercomputers are two different domains with different applications and constraints, and it is today not fully understood what CGRA design decisions adequately cater to the HPC market. One such unknown parameters is regarding the interconnect that facilitates intra-CGRA communication. Today, intra-CGRA communication comes in two flavors: using routers closely embedded into the compute units or using discrete routers outside the compute units. The former trades flexibility to reduce hardware cost, while the latter has greater flexibility with more resource usage. In this paper, we investigate which of both designs suits the CGRA HPC segment. We extend our previous methodology, which consists of both a parameterized CGRA design and an OpenMP-capable compiler, to accommodate both types of routing designs, including verification tests using RTL simulation. Our results show that the discrete router design can facilitate better use of processing elements (PEs) compared to embedded routers and achieve between 81.25% to 26.39% reduction in unnecessary PE occupancy for a aggressively unrolled stencil kernel on a 18 x 16 CGRA at a (estimated) interconnect resource overhead by 6.3x. This reduction in PE occupancy can be used to exploit instruction-level parallelism (ILP) through even more aggressive unrolling.

12:30 - 14:00

Lunch

14:00 - 15:30 -- Parallel Sessions

Tensors & Linear Algebra

HS13

Chair: Trilce Estrada

BALA-CPD: BALanced and Asynchronous Distributed Tensor Decomposition

BALA-CPD: BALanced and Asynchronous Distributed Tensor Decomposition

Zheng Miao

Abstract

Tensor decomposition is widely used in machine learning, recommendation systems, and social networks. Parallel algorithms running on distributed memory systems are required to solve large real-world tensors. Parallel algorithms suffer two major performance bottlenecks: load imbalance and communication cost, which are difficult to overcome due to the inherent tradeoff among the multiple types of computations and communications, especially for irregular sparse tensors. Previous work predominately focuses on balancing the load within the tensor-related computation, resulting in imbalance for multiple matrix-only computations and increased communication costs. It also extensively uses collective communication operations and bulk-synchronous computations by interleaving stages of global communication and stages of local computation, failing to hide the communication cost. In this paper, we present a novel algorithm BALA-CPD, which achieves the best overall workload balance, and efficiently overlaps communication and computation for the popular distributed Canonical Polyadic Decomposition (CPD) algorithms. BALA-CPD uses a workload and data partition scheme that prioritizes the load balance for all the matrix-only computations and all the communications. When necessary, BALA-CPD adjusts to mitigate the load imbalance for the tensor-related computation. Departing from the bulk-synchronous approaches, BALA-CPD breaks down computation and communication in consecutive stages, and masks the communication costs by a combination of one-sided asynchronous communication and a fine-grained interleaving of communication and computation. We implement BALA-CPD and evaluate it on a 64-node cluster with 1280 processors. Experimental results show BALA-CPD is scalable and outperforms the state-of-the-art distributed implementations by up to 1.8X on 1280 processors.

Optimizations of H-matrix-vector Multiplication for Modern Multi-core Processors

Optimizations of H-matrix-vector Multiplication for Modern Multi-core Processors

Tetsuya Hoshino

Abstract

The hierarchical matrices (H-matrices) are robust methods for approximating the dense matrices that appear in the Boundary Element Method (BEM). In order to accelerate solving linear systems using an iterative method in the BEM, it is essential to speed up the matrix-vector multiplication in the iterative linear solver. However, compared with the case of dense or sparse matrices, there are not enough studies on speeding up the Hierarchical Matrix-Vector multiplication (HiMV). The HiMV algorithm consists of the multiplication of a number of sub-matrices and vectors, but it is not clear how these should be implemented for acceleration. This paper discusses optimization methodologies of HiMV for modern multi-core CPUs, such as the H-matrix storage method for efficient memory access, avoidance method of write contention during reduction operations on the solution vector, inter-thread load balancing, and blocking and sub-matrix sorting methods for cache efficiency. As a result, we demonstrate these optimizations significantly impact on the performance of modern CPU-based supercomputers. For instance, we assumed that the best performance to aim for is the performance of dense matrix vector multiplication (GEMV) and obtained 84.8%, 100.7%, and 98.7% performance of the GEMV Flops in single-socket runs on A64FX, AMD EPYC, and Intel Xeon Cascade Lake, respectively. We found that optimization for better memory performance and cache efficiency is especially important for the A64FX with high- speed HBM2 memory.

Optimizing Irregular-Shaped Matrix-Matrix Multiplication on Multi-Core DSPs

Optimizing Irregular-Shaped Matrix-Matrix Multiplication on Multi-Core DSPs

Shangfei Yin

Abstract

General Matrix Multiplication (GEMM) has a wide range of applications in scientific simulation and artificial intelligence. Although traditional libraries can achieve high performance on large regular-shaped GEMMs, they often behave not well on irregular-shaped GEMMs, which are often found in new algorithms and applications of high-performance computing (HPC). Due to energy efficiency constraints, low-power multi-core digital signal processors (DSPs) have become an alternative architecture in HPC systems. Targeting multi-core DSPs in FT-m7032, a prototype CPU-DSPs heterogeneous processor for HPC, an efficient implementation - ftIMM - for three types of irregular-shaped GEMMs is proposed. FtIMM supports automatic generation of assembly micro-kernels, two parallelization strategies, and auto-tuning of block sizes and parallelization strategies. The experiments show that ftIMM can get better performance than the traditional GEMM implementations on multi-core DSPs in FT-m7032, yielding on up to 7.2x performance improvement, when performing on irregular-shaped GEMMs. And ftIMM on multi-core DSPs can also far outperform the open source library on multi-core CPUs in FT-m7032, delivering up to 30.7x higher performance.

Distributed Memory Applications

HS10

Chair: Dirk Pleiter

Painless Transposition of Reproducible Distributed Environments with NixOS Compose

Painless Transposition of Reproducible Distributed Environments with NixOS Compose

Quentin Guilloteau

Abstract

Development of environments for distributed systems is a tedious and time-consuming iterative process. The reproducibility of such environments is a crucial factor for rigorous scientific contributions. We think that being able to smoothly test environments both locally and on a target distributed platform makes development cycles faster and reduces the friction to adopt better experimental practices. To address this issue, this paper introduces the notion of environment transposition and implements it in NixOS Compose, a tool that generates reproducible distributed environments. It enables users to deploy their environments on virtualized (docker, QEMU) or physical (Grid'5000) platforms with the same unique description of the environment. We show that NixOS Compose enables to build reproducible environments without overhead by comparing it to state-of-the-art solutions for the generation of distributed environments (EnOSlib and Kameleon). NixOS Compose actually enables substantial performance improvements on image building time over Kameleon (up to 11x faster for initial builds and up to 19x faster when building a variation of an existing environment).

Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach

Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach

Matthew Whitlock

Abstract

Integrating recent advancements in resilient algorithms and techniques into existing codes is a singular challenge in fault tolerance – in part due to the underlying complexity of implementing resilience in the first place, but also due to the difficulty introduced when attempting to integrate the functionality of a standalone new strategy with the preexisting resilience layers of an application. We propose that the solution does not rest in building integrated solutions for users, but in runtimes designed to integrate into a larger comprehensive resilience system and thereby enabling the necessary jump to multi-layered recovery. Our work designs, implements, and verifies one such comprehensive system of runtimes. Utilizing Fenix, a process resilience tool with integration into preexisting resilience systems as a design priority, we update Resilient Kokkos and the use pattern of VeloC to support application-level integration of resilience runtimes. Our work shows that designing integrable systems rather than integrated systems allows for user designed optimization and upgrading of resilience techniques while maintaining the simplicity and performance of all-in-one resilience solutions. More application-specific choice in resilience strategies allows for better long-term flexibility, performance, and — importantly — simplicity.

Fast Dynamic Updates and Dynamic SpGEMM on MPI-Distributed Graphs

Fast Dynamic Updates and Dynamic SpGEMM on MPI-Distributed Graphs

Alexander van der Grinten

Abstract

Sparse matrix multiplication (SpGEMM) is a fundamental kernel used in many diverse application areas, both numerical and discrete. For example, many algebraic graph algorithms rely on SpGEMM in the tropical semiring to compute shortest paths in graphs. Recently, SpGEMM has received growing attention regarding implementations for specific (parallel) architectures. Yet, this concerns only the static problem, where both input matrices do not change. In many applications, however, matrices (or their corresponding graphs) change over time. Although recomputing from scratch is very expensive, we are not aware of any dynamic SpGEMM algorithms in the literature. In this paper, we thus propose a batch-dynamic algorithm for MPI-based parallel computing. Building on top of a distributed graph/matrix data structure that allows for fast updates, our dynamic SpGEMM reduces the communication volume significantly. It does so by exploiting that updates change far fewer matrix entries than there are non-zeros in the input operands. Our experiments with popular benchmark graphs show that our approach pays off. For batches of insertions or removals of matrix entries, our dynamic SpGEMM is substantially faster than the static algorithms in the state-of-the-art competitors CombBLAS, CTF and PETSc.

15:30 - 16:00

Coffee Break

16:00 - 17:30 -- Parallel Sessions

Deep Learning

HS13

Chair: Rio Yokota

HVAC: Removing I/O Bottleneck for Large-Scale Deep Learning Applications

HVAC: Removing I/O Bottleneck for Large-Scale Deep Learning Applications

Awais Khan

Abstract

Scientific communities are increasingly adopting deep learning (DL) models in their applications to accelerate scientific discovery processes. However, with rapid growth in the computing capabilities of HPC supercomputers, large-scale DL applications have to spend a significant portion of training time performing I/O to a parallel storage system. Previous research works have investigated optimization techniques such as prefetching and caching. Unfortunately, there exist non- trivial challenges to adopting the existing solutions on HPC supercomputers for large-scale DL training applications, which include nonperformance and/or failures at extreme scale, lack of portability and generality in design, complex deployment methodology, and being limited to a specific application or dataset. To address these challenges, we propose High-Velocity AI Cache (HVAC), a distributed read-cache layer that targets and fully exploits the node-local storage or near node-local storage technology. HVAC seamlessly accelerates read I/O by aggregating node-local or near node-local storage, avoiding metadata lookups and file locking while preserving portability in the application code. We deploy and evaluate HVAC on 1024 nodes (with over 6000 NVIDIA V100 GPUS) of the Summit supercomputer. In particular, we evaluate the scalability, efficiency, accuracy, and load distribution of HVAC compared to GPFS and XFS- on-NVMe. With four different DL applications, we observe an average 25% performance improvement atop GPFS and 9% drop against XFS-on-NVMe, which scale linearly and are considered the performance upper bound. We envision HVAC as an important caching library for upcoming HPC supercomputers such as Frontier.

AutoPipe: A Fast Pipeline Parallelism Approach with Balanced Partitioning and Micro-batch Slicing

AutoPipe: A Fast Pipeline Parallelism Approach with Balanced Partitioning and Micro-batch Slicing

Weijie Liu

Abstract

Recently, pipeline parallelism has been widely used in training of large DNN models. However, there are still two main challenges for efficient pipeline parallelism: i) a balanced model partition is crucial for the pipeline efficiency, whereas prior works lack a sound solution to automatically generate a balanced partition. ii) the startup overhead is inevitable and especially significant for deep pipelines, which is an important source of pipeline bubbles and severely affects pipeline scalability. We propose AutoPipe to solve this two problems, which contains i) a planner for automatically and quickly generating a balanced pipeline partition scheme, with a fine-grained partitioner. This partitioner groups DNN in the sub-layer granularity and finds the balanced scheme with a heuristic search algorithm; and ii) a micro-batch slicer that reduces pipeline startup overhead according to the Planner results by splitting the micro-batch evenly. This slicer automatically solves an appropriate number of micro-batches to split. The experimental results show that AutoPipe can accelerate training by up to 1.30x over the state-of-the-art distributed training framework Megatron-LM, with 50% reduction in startup overhead and an order-of-magnitude reduction in pipeline planning time. Furthermore, AutoPipe Planner improves the partition balance by 2.73x-12.7x compared to DAPPLE Planner and Piper.

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training

Yabo Duan

Abstract

As the deep learning model grows larger, training model with a single computational resource becomes impractical. To solve this, hybrid parallelism, which combines data and pipeline parallelism emerges to train large models with multiple GPUs. In practice, using heterogeneous GPU clusters to train large models is a common need due to the upgrade of a part of hardware. However, existing hybrid parallelism approaches in the heterogeneous environment do not work well in communication efficacy, workload balance among GPUs and utilizing the memory constrained GPU. To address these problems, we present a parallel DNN training approach, Hybrid Parallelism on Heterogeneous clusters (HPH). In HPH, we propose a topology designer that minimizes the communication time cost. Furthermore, HPH uses a partition algorithm that automatically partitions DNN layers among workers to maximize throughput. Besides, HPH adopts recomputation-aware scheduling to reduce memory consumption and further reschedule the pipeline to eliminate the extra time overhead of recomputation. Our experimental results on a 32-GPU heterogeneous cluster show that HPH achieves up to 141% training speed-up compared with the state-of-the-art approach.

Shared Memory

HS10

Chair: Jay Lofstead

Recursive Multi-Section on the Fly: Shared-Memory Streaming Algorithms for Hierarchical Graph Partitioning and Process Mapping

Recursive Multi-Section on the Fly: Shared-Memory Streaming Algorithms for Hierarchical Graph Partitioning and Process Mapping

Marcelo Fonseca Faraj

Abstract

Partitioning a graph into balanced blocks such that few edges run between blocks is a key problem for large-scale distributed processing. In this work, we present a shared-memory streaming multi-recursive partitioning scheme that performs recursive multi-sections on the fly without knowing the overall input graph to compute hierarchical partitionings. If a hierarchy is not specified as an input, our approach can also be used as a tool to solve the standard graph partitioning problem. Our approach has a considerably lower running time complexity in comparison with state-of-the-art non-buffered one-pass partitioning algorithms designed for the non-hierarchical graph partitioning case. Moreover, if the topology of a distributed system is known, it is possible to further optimize the communication costs by mapping partitions onto processing elements. Our experiments indicate that our algorithm is both faster and produces better process mappings than competing tools. In case of graph partitioning, our framework is up to two orders of magnitude faster at the cost of 5% more cut edges compared to Fennel.

MemGaze: Rapid & Effective Load-Level Memory Analysis

MemGaze: Rapid & Effective Load-Level Memory Analysis

Ozgur, Ozan Kilic

Abstract

A major challenge of memory analysis tools is combining high-resolution analysis and low overhead measurement. Currently, hardware/software-based analysis of load-level sequences incurs time slowdowns of O(100×). We present MemGaze, a tool for low-overhead, high-resolution memory analysis. MemGaze uses Intel’s Processor Tracing (PT) instruction ptwrite to collect sampled and compressed memory address traces for load-level, sequence-aware analysis of data reuse. We describe multi-resolution analysis for locations vs. operations, accesses vs. spatio-temporal reuse, and reuse (distance, rate, volume) vs. access patterns. Both trace size and resolution are controllable. We use MemGaze to elucidate the memory effects of different data structures and algorithms. For sampled traces that are ≈1% of a full one, analysis metrics have 1-25% MAPE for histograms of varying dynamic sequence lengths. With current suboptimal kernel support (PT runs continuously), MemGaze’s time overhead is typically 10–95%; 7× at worst. However, when PT runs only during samples, overhead is 10–35% on memory intensive regions and correlates with executed ptwrites.

Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI

Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI

Kinan Alattar

Abstract

There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use programming API in different languages including Scala, Java, and Python. Spark supports parallel and distributed execution of user workloads by supporting communication using an event-driven framework called Netty. In this context, efforts --- including Spark-RDMA and SparkUCX --- were made in the past to optimize Apache Spark on High-Performance Computing (HPC) systems equipped with low-latency and high-performance interconnects like InfiniBand. In the HPC community, Message Passing Interface (MPI) libraries are widely adopted for parallelizing science and engineering applications. This paper designs and implements MPI4Spark which uses MPI for communication in a parallel and distributed setting on HPC systems. This approach realizes the vision of "Converged Communication Stack" for Big Data, Deep Learning, and HPC workloads. Also, it provides portability and performance benefits since MPI4Spark is capable of utilizing popular HPC interconnects including InfiniBand, Omni-Path, Slingshot, and others. MPI4Spark relies on several features offered by MPI including point-to-point communication primitives, intra-communicators, inter-communicators, and non-blocking probe functions. A unique feature of MPI4Spark is that it utilizes Dynamic Process Management (DPM) for maintaining the distributed execution model of Spark ecosystem. The performance of MPI4Spark is evaluated against Spark-RDMA and Vanilla Spark using OSU HiBD Benchmarks (OHB) and Intel HiBench suite that contains a variety of Resilient Distributed Dataset (RDD), Graph Processing, and Machine Learning workloads. This evaluation is done on three HPC systems including TACC Frontera, TACC Stampede2, and an internal cluster.