IEEE Cluster 2023 Program

Conference - Tuesday, Oct 31

8:30 - 9:15

Registration

9:15 - 10:45

Mesa Ballroom

HPCMASPA: Monitoring and Analysis for HPC Systems Plus Applications

Canyon room

Tutorial: Performance Analysis, Tools and Best-Known Methods on Muti-Chip Module Chiplet based High Performance Computing AMD EPYC Zen4 Architecture

10:45 - 11:15

Coffee Break

11:15 - 12:45

Mesa Ballroom

HPCMASPA: Monitoring and Analysis for HPC Systems Plus Applications

Canyon room

Tutorial: Zen4

12:45 - 14:00

Lunch (provided)

14:00 - 15:30

Mesa Ballroom

HPCMASPA: Monitoring and Analysis for HPC Systems Plus Applications

Canyon room

REX-IO: Re-envisioning Extreme-Scale I/O for Emerging Hybrid HPC Workloads

15:30 - 16:00

Coffee Break

16:00 - 17:30

Mesa Ballroom

HPCMASPA: Monitoring and Analysis for HPC Systems Plus Applications

Canyon room

REX-IO: Re-envisioning Extreme-Scale I/O for Emerging Hybrid HPC Workloads

Conference - Wednesday, Nov 1

8:15 - 9:00

Registration

9:00 - 9:30

Cluster 2023 Opening

Mesa Ballroom

9:30 - 10:30

Keynote: Bill Magro (Google)

Mesa Ballroom

AI, Cloud, and the Future of HPC.

Chair: Scott Pakin, Los Alamos National Laboratory

10:30 - 11:00

Coffee Break

11:00 - 12:30 -- Parallel Sessions

Distributed Machine Learning (Session I)

Mesa Ballroom

Chair: Olamide Timothy Tawose, Lincoln University, Pennsylvania

Accelerating Distributed ML Training via Selective Synchronization

Accelerating Distributed ML Training via Selective Synchronization

Sahil Tyagi (Indiana University Bloomington), and Martin Swany (Indiana University)

Abstract

In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present SelSync, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of SelSync to improve convergence in the context of semi-synchronous training. In our evaluation, SelSync converges to the same or better accuracy than BSP while reducing training time by up to 14x.

PredictDDL: Reusable Workload Performance Prediction for Distributed Deep Learning

PredictDDL: Reusable Workload Performance Prediction for Distributed Deep Learning

Kevin Assogba (Rochester Institute of Technology), Eduardo Lima (Rochester Institute of Technology), M. Mustafa Rafique (Rochester Institute of Technology), and Minseok Kwon (Rochester Institute of Technology)

Abstract

Predicting Deep Learning (DL) training workload runtime allows for optimized usage of both on-premises and public data centers, e.g., allocating resources for task completion before a deadline. The state-of-the-art prediction models, e.g., Ernest and Cherrypick, treat workloads as black boxes, and require running the workload on a fraction of the dataset every time a change occurs followed by retraining the prediction model. This significantly limits the reusability of prediction models across workloads with different DL architectures. In this paper, we propose a different approach where the prediction model is trained only once for a particular dataset type, e.g., ImageNet, thus completely avoiding tedious and costly retraining tasks for new DL workloads. Our proposed approach, called PredictDDL, provides an end-to-end performance prediction system for distributed DL training workloads. PredictDDL leverages Graph HyperNetworks, a class of neural networks that takes computational graphs as input and produces vector representations of neural networks. PredictDDL is the first prediction model that eliminates the need of retraining a performance prediction model for each new DL workload and maximizes the reuse of the prediction model by requiring to run the workload a single time to make time measurements for training the prediction model. Our extensive evaluation using representative workloads shows that PredictDDL achieves up to 9.8x lower average prediction error and 10.3x lower inference duration compared to a state-of-the-art system, Ernest, on workloads with multiple DNN architectures.

Exact Distributed Stochastic Block Partitioning

Exact Distributed Stochastic Block Partitioning

Frank Wanye (Virginia Tech), Vitaliy Gleyzer (MIT Lincoln Lab), Edward Kao (MIT Lincoln Lab), and Wu-chen Feng (Virginia Tech)

Abstract

Stochastic block partitioning (SBP) is a community detection algorithm that is highly accurate even on graphs with a complex community structure, but its inherently serial nature hinders its widespread adoption by the wider scientific community. To make it practical to analyze large real-world graphs with SBP, there is a growing need to parallelize and distribute the algorithm. The current state-of-the-art distributed SBP algorithm is a divide-and-conquer approach that limits communication between compute nodes until the end of inference. This leads to the breaking of computational dependencies, which causes convergence issues as the number of compute nodes increases, and when the graph is sufficiently sparse. In this paper, we introduce EDiSt - an exact distributed stochastic block partitioning algorithm. Under EDiSt, compute nodes periodically share community assignments during inference. Due to this additional communication, EDiSt improves upon the divide-and-conquer algorithm by allowing it to scale out to a larger number of compute nodes without suffering from convergence issues, even on sparse graphs. We show that EDiSt provides speedups of up to 23.8× over the divide-and-conquer approach, and speedups up to 38.0× over shared memory parallel SBP when scaled out to 64 compute nodes.

Resource Management (Session II)

Canyon Room

Chair: Jesper Larsson Träff, TU Wien - faculty of informatics

DEHype: Retrofitting Hypervisors for a Resource-Disaggregated Environment

DEHype: Retrofitting Hypervisors for a Resource-Disaggregated Environment

Taehoon Kim (ETRI), Kwangwon Koh (ETRI), Changdae Kim (ETRI), Eunji Pak (ETRI), Yeonjeong Jeong (ETRI), and Sang-Hoon Kim (Ajou University)

Abstract

Resource disaggregation has been proposed as a solution for the resource under-utilization in data centers. However, the host virtualization technologies, which are the basic building blocks for constructing data centers, are implemented without considering the disaggregated resources. In addition, we discover a RDMA I/O unit plays a significant role in performance of the resource disaggregated environment. In this study, we propose DEHype, which alleviates the inefficiency of hypervisors adapted in disaggregated environment through investigating host virtualization technologies that are suitable for the disaggregated memory systems. Specifically, DEHype aims to identify and improve the performance issues of virtual machines through KVM/QEMU in a disaggregated resource environment. The results demonstrate the effectiveness of the proposed optimizations in improving the performance of disaggregated memory systems. DEHype achieves up to a 351% improvement over the state-of-the-art disaggregated memory system.

SciLance: Mitigate Load Imbalance for Parallel Scientific Applications in Cloud Environments

SciLance: Mitigate Load Imbalance for Parallel Scientific Applications in Cloud Environments

Xinying Wang (University of Nevada), Lipeng Wan (Georgia State University), Scott Klasky (Oak Ridge National Laboratory), Dongfang Zhao (University of Washington), and Feng Yan (University of Houston)

Abstract

Elastic cloud computing provides new opportunities for accelerating the process of scientific discovery. However, unlike high-performance computing (HPC) systems that are built and optimized for workloads with intensive inter-node communication demands, the low-latency and high bandwidth communication capability is only enabled on a few very expensive high-end instance types in the cloud, which leads to poor cost-effectiveness. In addition, re-balancing the workload through extra data movement among compute nodes is a common way to mitigate the load imbalance issue in many scientific simulations, which further amplifies the communication pressure and makes it challenging to efficiently use cloud resources. To this end, we propose SciLance, which addresses the workload imbalance challenge by utilizing the heterogeneous and elastic resources offered by cloud platforms. Our key insight is that instead of moving data among compute nodes to balance the workload, we create a heterogeneous resource pool to dynamically adapt resource allocation to compensate the profiled runtime imbalance. We prototype SciLance and perform extensive evaluation using adaptive mesh refinement (AMR) based scientific applications. The evaluation results demonstrate that SciLance can achieve up to 36.63% better performance with 16.91% lower cost for Warpx simulation codes.

Generalized Collectives for the Exascale Era

Generalized Collectives for the Exascale Era

Michael Wilkins (Northwestern University), Hanming Wang (Northwestern University), Peizhi Liu (Northwestern University), Bangyen Pham (Northwestern University), Yanfei Guo (Argonne National Laboratory), Rajeev Thakur (Argonne National Laboratory), Peter Dinda (Northwestern University), and Nikos Hardavellas (Northwestern University)

Abstract

Exascale supercomputers have renewed the exigence of improving distributed communication, specifically MPI collectives. Previous works accelerated collectives for specific scenarios by changing the radix of the collective algorithms. However, these approaches fail to explore the interplay between modern hardware features, such as multi-port networks, and software features, such as message size. In this paper, we present a novel approach that uses system-agnostic, generalized (i.e., variable-radix) algorithms to capture all relevant features and provide broad speedups for upcoming exascale-class systems.

We identify hardware commonalities found on announced exascale systems and generalize three common communication kernels (binomial tree, ring, and recursive doubling) to better leverage these features, creating 10 implementations. For each kernel, we develop analytical models to intuit algorithm performance with varying radix values.

Experiments on the world’s first exascale supercomputer (Frontier at ORNL) and an exascale test system (Polaris at ANL) show that our generalized algorithms outperform the baseline open-source and proprietary vendor MPI implementations by a significant margin, up to over 4.5x. We empirically determine optimal algorithms and parameter values, identifying where the analytical models are accurate and where hardware features directly determine performance. Most notably, we show how a single, system-agnostic implementation of a generalized algorithm can optimize for multiple hardware/software features across multiple systems.

12:30 - 14:00

Lunch (provided)

14:00 - 15:30 -- Parallel Sessions

Software Systems for ML (Session III)

Mesa Ballroom

Chair: Jim Brandt, Sandia National Laboratories

FedGuard: Selective Parameter Aggregation for Poisoning Attack Mitigation in Federated Learning

FedGuard: Selective Parameter Aggregation for Poisoning Attack Mitigation in Federated Learning

Melvin Chelli (DFKI), Cédric Prigent (INRIA), René Schubotz (DFKI), Alexandru Costan (IRISA/INSA Rennes, INSA Rennes), Gabriel Antoniu (Inria), Loïc Cudennec (DGA), and Philipp Slusallek (DFKI)

Abstract

Minimizing the attack surface of Federated Learning (FL) systems is a field of active research. FL turns out to be highly vulnerable to various threats coming from the edge of the network. Current approaches rely on robust aggregation, anomaly detection and generative models for defending against poisoning attacks. Yet, they either have limited defensive capabilities due to their underlying design or are impractical to use as they rely on constraining building blocks.

We introduce FedGuard, a novel FL framework that utilizes the generative capabilities of Conditional Variational AutoEncoders (CVAE) to effectively defend against poisoning attacks with tuneable overhead in communication and computation.

Whilst the idea of hardening a FL system using generative models is not entirely new, FedGuard’s original contribution is in its selective parameter aggregation operator with parameter selection being driven by synthetic validation data sampled from the CVAEs trained locally by each participating party.

Experimental evaluations in a 100 clients setup demonstrates FedGuard to be more effective against label and sign flipping attacks as well as additive noise and same value attacks than previous works. FedGuard successfully defends in scenarios with up to 50% malicious peers where other strategies fail. In addition, FedGuard does not require auxiliary datasets or centralized (pre-) training, and provides resilience against poisoning attacks from the very first round of federated training.

Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models

Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models

Wei Wang (National University of Defense Technology), Zhiquan Lai (NUDT, Computer College), Shengwei Li (National University of Defense Technology), Weijie Lu (National University of Defense Technology), Keshi Ge (National University of Defense Technology), Yujie Liu (National University of Defense Technology), Ao Shen (National University of Defense Technology), and Dongsheng Li (National University of Defense Technology, Computer College)

Abstract

Mixture of Expert (MoE) has received increasing attention for scaling DNN models to extra-large size with negligible increases in computation. However, a significant load imbalance occurs in the device during the training of a MoE model, resulting in significantly reduced throughput. Previous works on load balancing introduce additional runtime overhead and suffer from high execution overhead. To address these issues, we present Prophet: a fine-grained load balancing method for parallel training of large-scale MoE models, which consists of a planner and a scheduler. Prophet planner first employs a fine-grained resource allocation method to determine the possible scenarios for the expert placement in a fine-grained manner, and then efficiently searches for a well-balanced expert placement to avoid additional overhead. Prophet scheduler uses the locality of the token distribution to schedule the resource allocation operations using a layer-wise fine-grained schedule strategy to hide their overhead. We conduct extensive experiments in four clusters and five representative models. The results indicate that Prophet gains up to 2.3x speedup compared to the state-of-the-art MoE frameworks including Deepspeed-MoE and FasterMoE. Additionally, Prophet achieves a load balancing enhancement of up to 12.06x when compared to FasterMoE.

HIOS: Hierarchical Inter-Operator Scheduler for Real-Time Inference of DAG-Structured Deep Learning Models on Multiple GPUs

HIOS: Hierarchical Inter-Operator Scheduler for Real-Time Inference of DAG-Structured Deep Learning Models on Multiple GPUs

Turja Kundu (University of North Texas), and Tong Shu (University of North Texas)

Abstract

Neural-network-enabled data analysis in real-time scientific applications imposes stringent requirements on inference latency. Meanwhile, recent deep learning (DL) model design trends to replace a single branch with multiple branches for high prediction accuracy and robustness, which makes inter-operator parallelization become an effective approach to improve inference latency. However, existing inter-operator parallelization techniques for inference acceleration are mainly focused on utilization optimization in a single GPU. With the data size of an input sample and the scale of a DL model ever-growing, the limited resource of a single GPU is insufficient to support parallel execution of large operations. In order to break this limitation, we studies hybrid inter-operator parallelism both among multiple GPUs and in each GPU. In this paper, we propose a hierarchical inter-operator scheduler (HIOS) to automatically distribute large operators onto different GPUs and group small operators in the same GPU for parallel execution. Particularly, we design a novel scheduling algorithm, named HIOS-LP, which consists of inter-GPU operator parallelization through iterative critical-path mapping and the intra-GPU operator parallelization based on a sliding window. Experiments with modern convolutional neural network benchmarks on different GPU platforms demonstrate that our HIOS-LP outperforms the state-of-the-art inter-operator scheduling algorithm IOS by up to 28%.

Storage Systems and Data Management (Session IV)

Canyon Room

Chair: Sarah Neuwirth, Johannes Gutenberg University Mainz, Juelich Supercomputing Centre

FullRepair: Towards Optimal Repair Pipelining in Erasure-Coded Clustered Storage Systems

FullRepair: Towards Optimal Repair Pipelining in Erasure-Coded Clustered Storage Systems

Yuzuo Zhang (Huazhong University Of Science And Technology), Xinyuan Tu (Huazhong University of Science and Technology), Lin Wang (Huazhong University of Science and Technology), Yuchong Hu (Huazhong University of Science and Technology), Fang Wang (Huazhong University of Science and Technology), and Ye Wang (Huazhong University of Science and Technology)

Abstract

Clustered storage systems often deploy erasure coding that encodes data into coded chunks and distributes them across nodes to tolerate node failures. It is a storage-efficient redundancy scheme but incurs high repair penalty; thus there are many studies focusing on the erasure-coded repair of failed blocks, and some state-of-the-arts pipeline the repair of failed data to improve the repair performance. However, we observe that all existing repair pipelining methods only use a single pipeline, making network bandwidth resources of storage nodes underutilized. In this paper, we propose FullRepair, a new repair pipelining mechanism based on multiple pipelines with the aim of fully exploiting all available bandwidth resources during repair. We construct four constraints to model the repair pipelining problem such that FullRepair can obtain the optimal pipelined repair throughput under fully bandwidth utilization. We design a multi-pipeline scheduling scheme for FullRepair so as to achieve the above optimality. Experiments on the Amazon EC2 show that compared with the state-of-the-art repair pipelining methods RP and PivotRepair, FullRepair reduces the repair time of single chunk by up to 45.4% and 33.19%, respectively.

Performance Characterization of NVMe Flash Devices with Zoned Namespaces (ZNS)

Performance Characterization of NVMe Flash Devices with Zoned Namespaces (ZNS)

Krijn Doekemeijer (Vrije Universiteit Amsterdam), Nick Tehrany (TU Delft), Balakrishnan Chandrasekaran (Vrije Universiteit Amsterdam), Matias Bjorling (Western Digital), and Animesh Trivedi (Vrije Universiteit Amsterdam)

Abstract

The recent emergence of NVMe flash devices with Zoned Namespace support, ZNS SSDs, represents a significant new advancement in flash storage. ZNS SSDs introduce a new storage abstraction of append-only zones with a set of new I/O (i.e., append) and management (zone state machine transition) commands. With the new abstraction and commands, ZNS SSDs offer more control to the host software stack than a (non-zoned) SSD for flash management, which is known to be complex (because of garbage collection, scheduling, block allocation, parallelism management, over-provisioning). ZNS SSDs are, consequently, gaining adoption in a variety of applications (e.g., file systems, key-value stores, and databases), particularly latency-sensitive big data applications. Despite this enthusiasm, there has yet to be a systematic characterization of ZNS SSD performance with its zoned storage model abstractions and I/O operations. This work addresses this crucial shortcoming. We report on the performance features of a commercially available ZNS SSD (13 key observations), explain how these features can be incorporated into publicly available state-of-the-art ZNS emulators, and recommend guidelines for ZNS SSD application developers. All artifacts (code and data sets) of this study are publicly available at: https://anonymous.4open.science/r/NVMeBenchmarks-0551.

KV-CSD: A Hardware-Accelerated Key-Value Store for Data-Intensive Applications

KV-CSD: A Hardware-Accelerated Key-Value Store for Data-Intensive Applications

Inhyuk Park (SK hynix), Qing Zheng (Carnegie Mellon University, New Mexico Consortium), Dominic Manno (Los Alamos National Laboratory), Soonyeal Yang (SK hynix), Jason Lee (Los Alamos National Laboratory), David Bonnie (Los Alamos National Laboratory), Bradley Settlemyer (NVIDIA), Youngjae Kim (Sogang University), Woosuk Chung (SK hynix), and Gary Grider (Los Alamos National Lab)

Abstract

Popular write-optimized software key-value stores such as LevelDB and RocksDB are often good at reads. While data is initially stored in a write-optimized format, in the background it is asynchronously transformed into a read-optimized format for efficient followup queries. Write-optimized key-value stores can still block writes. This happens when those background workers cannot keep up with the foreground insertion workload.

This paper makes a case for a hardware-accelerated key-value store that allows for running performance-critical operations --- such as background data reorganization and queries --- on storage rather than on a host. This better hides background work latency, prevents it from blocking foreground writes, and improves overall I/O efficiency. Our prototype, called TomDB, is a key-value based computational storage device consisting of an NVMe SSD and a System-on-a-Chip (SoC) that implements an ordered key-value store atop the SSD. Through offloaded processing, TomDB streamlines data insertion, reduces host-device data movement for both background data reorganization and query processing, and shows up to 10.6x lower write times and up to 7.4x faster queries compared to the current state-of-the-art on a real-world scientific dataset.

15:30 - 16:00

Coffee Break

16:00 - 17:00 -- Parallel Sessions

Disaggregated Architectures (Session V)

Canyon Room

Chair: Hariharan Devarajan, Lawrence Livermore National Laboratory

Rethinking Virtual Machines Live Migration for Memory Disaggregation

Rethinking Virtual Machines Live Migration for Memory Disaggregation

Xingguo Jia (Shanghai Jiao Tong University), Xingzi Yu (Shanghai Jiao Tong University), Yun Wang (Shanghai Jiao Tong University), Senhao Yu (Beijing Institute of Technology), and Zhengwei Qi (Shanghai Jiao Tong University)

Abstract

Resource underutilization has troubled data centers for several decades. Memory disaggregation provides an efficient way to improve memory utilization while leaving a missing puzzle piece on CPU underutilization. Live migration is an essential method for CPU resource reallocation. However, the state-of-the-art Virtual Machines (VM) live migration suffers from significant resource consumption. We discover the substantial potential for optimizing live migration in disaggregated memory systems. We propose Anemoi, a source management system incorporating VM live migration into memory disaggregation to fill in the missing piece. Disaggregated memory enables the read-only replica to be accessible from destination nodes, eliminating considerable network transmission for memory pages and saving migration time. Regarding the potential overwhelming memory consumption of duplicate read-only replicas, we design a dedicated compression algorithm. The evaluation shows that Anemoi reduces the network bandwidth use and the migration time by 69% and 83%, respectively, compared to VM live migration. The compression can achieve a space-saving of 83.6%.

Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics

Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics

George Michelogiannakis (Lawrence Berkeley National Laboratory, Stanford University), Yehia Arafa (New Mexico State University, Qualcomm Inc), Brandon Cook (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Liang Yuan Dai (Columbia University), Abdel-Hameed Badawy (New Mexico State University), Madeleine Glick (Columbia University), Yuyang Wang (Columbia University), Keren Bergman (Columbia University), and John Shalf (Lawrence Berkeley National Laboratory)

Abstract

The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to realize these gains and cost-effectively meet the stringent bandwidth and latency requirements of HPC applications. To that end, we describe how modern photonic components can be co-designed with modern HPC racks to implement flexible intra-rack resource disaggregation and fully meet the bit error rate (BER) and high escape bandwidth of all chip types in modern HPC racks with negligible power overhead. Our photonic-based disaggregated rack provides an average application speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared to a similar system that instead uses modern electronic switches for disaggregation. Using observed resource usage from a production system, we estimate that an iso-performance intra-rack disaggregated HPC system using photonics would require 4× fewer memory modules and 2× fewer NICs than a non-disaggregated baseline.

ML for Scheduling and Management (Session VI)

Mesa Ballroom

Chair: Haiying Xu, University Corporation for Atmospheric Research

ExplSched: Maximizing Deep Learning Cluster Efficiency for Exploratory Jobs

ExplSched: Maximizing Deep Learning Cluster Efficiency for Exploratory Jobs

Zixuan Chen.
Authors: Hongliang Li (Jilin University), Hairui Zhao (Jilin University), Zhewen Xu (Jilin University), Xiang Li (Jilin University), and Haixiao Xu (Jilin University)

Abstract

Resource management for Deep Learning (DL) clusters is essential for system efficiency and model training quality. Existing schedulers provided by DL frameworks are mostly adaptations from traditional HPC clusters and usually work on jobs' makespan, assuming that DL training jobs finish completely. Unfortunately, it is reported that a fair amount of training jobs are exploratory jobs and often finish unsuccessfully in production clusters. This is due to the distinct characteristic of Deep Neural Network (DNN) training that it is an exploratory process of frequent user interventions, such as adjusting model structures, tuning hyperparameters, and exploring feature validity. Existing DL cluster schedulers using offline algorithms are not suitable for exploratory jobs when unexpected early terminations can cause noticeable resource waste. Moreover, DL training jobs are iterative and usually yield diminishing returns. Equally allocating resources among training iterations is not efficient, especially when dealing with exploratory jobs where it can worsen the degradation of system efficiency. The fundamental goal of a DL training job is to gain model quality improvement, usually indicated by the loss reduction (job profit) of a DNN model. This paper introduces a novel scheduling problem for exploratory jobs that seeks to maximize the overall training profit of a DL cluster. We propose ExplSched, an online scheduling solution based on the primal-dual framework. It emphasizes the importance of job profit to resource consumption ratio to make quick resource allocation decisions. Experimental results show that ExplSched achieved an average system utility improvement of 87.28% compared with other related works.

Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach

Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach

Urvij Saroliya (Technical University of Munich), Eishi Arima (Technical University of Munich), Dai Liu (Technical University of Munich), and Martin Schulz (Technical University of Munich)

Abstract

GPU-based heterogeneous architectures are now commonly used in HPC systems. Due to their architectural simplicity specialized for data-level parallelism, GPUs can offer much higher computational throughput and memory bandwidth than CPUs in the same generation do. However, as the available resources in GPUs have increased exponentially over the past decades, it has become increasingly difficult for a single program to fully utilize them. As a consequence, the industry has started supporting several resource partitioning features in order to improve the resource utilisation by co-scheduling multiple programs on the same GPU die at the same time. Driven by the technological trend, this paper focuses on hierarchical resource partitioning on modern GPUs, and as an example, we utilize a combination of two different features available on recent NVIDIA GPUs in a hierarchical manner: MPS (Multi-Process Service), a finer-grained logical partitioning; and MIG (Multi-Instance GPU), a coarse-grained physical partitioning. We propose a method for comprehensively co-optimizing the setup of hierarchical partitioning and the selection of co-scheduling groups from a given set of jobs, based on reinforcement learning using their profiles. Our thorough experimental results demonstrate that our approach can successfully set up job concurrency, partitioning, and co-scheduling group selections simultaneously. This results in a maximum throughput improvement by a factor of 1.87 compared to the time-sharing scheduling.

18:00 - 22:00

Social event: Meow Wolf

Meow Wolf: House of Eternal Return

Conference - Thursday, Nov 2

8:15 - 10:00

Registration

10:00 - 10:15

Announcements

Mesa Ballroom

10:15 - 10:30

Special address: Tim Randles

Mesa Ballroom

25 Years of Cluster Computing at Los Alamos National Laboratory.

Chair: Frank Mueller, North Carolina State University

10:30 - 11:00

Coffee Break

11:00 - 12:30 -- Parallel Sessions

Communication (Session VII)

Canyon Room

Chair: Kengo Nakajima, University of Tokyo/RIKEN R-CCS

Communication-Avoiding Recursive Aggregation

Communication-Avoiding Recursive Aggregation

Yihao Sun (Syracuse University) Sidharth Kumar (University of Illinois at Chicago), Thomas Gilray (University of Alabama at Birmingham), and Kristopher Micinski (Syracuse University)

Abstract

Recursive aggregation has been of considerable interest due to its unifying a wide range of deductive-analytic workloads, including social-media mining and graph analytics. For example, Single-Source Shortest Paths (SSSP), Connected Components (CC), and PageRank may all be expressed via recursive aggregates. Implementing recursive aggregation has posed a serious algorithmic challenge, with state-of-the-art work identifying sufficient conditions (e.g., pre-mappability) under which implementations may push aggregation within recursion, avoiding the serious materialization overhead inherent to traditional reachability-based methods (e.g., Datalog).

State-of-the-art implementations of engines supporting recursive aggregates focus on large unified machines, due to the challenges posed by mixing semi-na"ive evaluation with distribution. In this work, we present an approach to implementing recursive aggregates on high-performance clusters which avoids the communication overhead inhibiting current-generation distributed systems to scale recursive aggregates to extremely high process counts. Our approach leverages the observation that aggregators form functional dependencies, allowing us to implement recursive aggregates via a high-parallel local aggregation to ensure maximal throughput. Additionally, we present a dynamic join planning mechanism, which customizes join order per-iteration based on dynamic relation sizes. We implemented our approach in PARALAGG a library which allows the declarative implementation of queries which utilize recursive aggregates and executes them using our MPI-based runtime. We evaluate PARALAGG on a large unified nodes and leadership-class supercomputers, demonstrating scalability up to 16,384 processes.

HASpMV: Heterogeneity-Aware Sparse Matrix-Vector Multiplication on Modern Asymmetric Multicore Processors

HASpMV: Heterogeneity-Aware Sparse Matrix-Vector Multiplication on Modern Asymmetric Multicore Processors

Wenxuan Li (China University of Petroleum-Beijing, Super Scientific Software Laboratory), Helin Cheng (China University of Petroleum-Beijing), Zhengyang Lu (China University of Petroleum-Beijing, Super Scientific Software Laboratory), Yuechen Lu (China University of Petroleum-Beijing, Super Scientific Software Laboratory), and Weifeng Liu (China University of Petroleum-Beijing, Super Scientific Software Laboratory)

Abstract

Sparse matrix-vector multiplication (SpMV) is a fundamental routine in computational science and engineering. Its optimization methods on various homogeneous parallel processors, such as CPUs and GPUs, received much attention. Recently, asymmetric multicore processors (AMPs) have heterogeneous performance and efficient cores (e.g., P- and E-cores from Intel and Apple, or Big.LITTLE cores from ARM), or cores with different cache structures (e.g., cores with/without 3D V-Cache from AMD) are becoming one of the mainstream in more desktop and workstation computers. However, there lacks heterogeneity-aware research on accelerating SpMV on AMPs. We in this paper propose a parallel algorithm called heterogeneity-aware SpMV (HASpMV) for improving the performance of SpMV on the latest 12th- and 13th-Gen AMPs from Intel and Ryzen 9 AMPs from AMD. We first micro-benchmark bandwidth and multi-/single-core SpMV to collect performance characteristics and to motivate our algorithm design, and then develop several optimization techniques to assign workloads between the two types of cores for achieving significantly better cache locality and load balancing. The experimental results show that our HASpMV brings on average 2.61x/3.17x, 2.31x/1.52x and 3.73x/2.23x (up to 5.23x/9.46x, 4.46x/5.31x and 8.23x/4.49x) speedups over the newest version of the Intel oneMKL library and the open-source work CSR5 and merge-SpMV on Intel Core i9-12900KF/13900KF, respectively. Also, HASpMV brings on average 1.43x, 1.3x and 1.29x (up to 6.28x, 7.8x and 10.8x) speedups over AMD Optimizing CPU Libraries (AOCL), CSR5 and merge-SpMV when comparing AMD Ryzen 9 7950X3D and 7950X AMPs, respectively.

TopoCommit: A Topological Commit Protocol for Cross-Ledger Transactions in Scientific Computing

TopoCommit: A Topological Commit Protocol for Cross-Ledger Transactions in Scientific Computing

Olamide Timothy Tawose (University of Nevada, Reno), Lei Yang (University of Nevada, Reno), and Dongfang Zhao (University of Washington; University of Nevada, Reno)

Abstract

While increasingly more applications are tempted to manage their data in decentralized systems, such as blockchains or distributed ledgers, the data exchange across multiple, potentially heterogeneous, decentralized systems remains an open problem: State-of-the-art protocols cannot meet one or more of the core requirements, such as atomicity, liveness, and scalability. Specifically, in the field of scientific computing, although a blockchain service was recently developed for scientific computing environments, the data exchanges and transactions among distinct ledgers are not supported. Observing that many modern scientific applications are collaborated on by multiple teams and the increasingly complicated (in-situ) workflows thereof, we argue that there is a pressing need to realize an efficient and scalable protocol for distinct ledgers to exchange data in scientific computing. This paper proposes a topological approach to enabling atomic, nonblocking, and scalable data exchanges among an arbitrary number of scientific ledgers in the context of collaborative scientific computing. Specifically, we construct a topological space formed by these ledgers—abstracting those nodes in a cross-ledger transaction as topological objects such as abstract simplex and simplicial complex. These topological objects, in turn, serve as the building blocks of a topological protocol, namely TopoCommit, under practical assumptions. We implement TopoCommit and integrate it into SciChain, a recently published distributed ledger for tracking scientific data provenance. The extensive evaluation of up to 1,008 nodes and 144 distinct ledgers on CloudLab shows that TopoCommit outperforms state-of-the-art protocols by up to 70×.

Workflow and Data Processing (Session VIII)

Mesa Ballroom

Chair: Qing Zheng, Los Alamos National Laboratory

ProvLight: Efficient Workflow Provenance Capture on the Edge-to-Cloud Continuum

ProvLight: Efficient Workflow Provenance Capture on the Edge-to-Cloud Continuum

Daniel Rosendo (Inria), Marta Mattoso (UFRJ), Alexandru Costan * (IRISA/INSA Rennes, INSA Rennes), Renan Souza (Oak Ridge National Laboratory), Debora Pina (Federal University of Rio de Janeiro), Patrick Valduriez (Inria), and Gabriel Antoniu (Inria)

Abstract

Modern scientific workflows require hybrid infrastructures combining numerous decentralized resources on the IoT/Edge interconnected to Cloud/HPC systems (aka the Computing Continuum) to enable their optimized execution. Understanding and optimizing the performance of such complex Edge-to-Cloud workflows is challenging. Capturing the provenance of key performance indicators, with their related data and processes, may assist in understanding and optimizing workflow executions. However, the capture overhead can be prohibitive, particularly in resource-constrained devices, such as the ones on the IoT/Edge.

To address this challenge, based on a performance analysis of existing systems, we propose ProvLight, a tool to enable efficient provenance capture on the IoT/Edge. We leverage simplified data models, data compression and grouping, and lightweight transmission protocols to reduce overheads. We further integrate ProvLight into the E2Clab framework to enable workflow provenance capture across the Edge-to-Cloud Continuum. This integration makes E2Clab a promising platform for the performance optimization of applications through reproducible experiments.

We validate ProvLight at a large scale with synthetic workloads on 64 real-life IoT/Edge devices in the FIT IoT LAB testbed. Evaluations show that ProvLight outperforms state-of-the-art systems like ProvLake and DfAnalyzer in resource-constrained devices. ProvLight is 26---37x faster to capture and transmit provenance data; uses 5---7x less CPU; 2x less memory; transmits 2x less data; and consumes 2---2.5x less energy. ProvLight and E2Clab are available as open-source tools.

Optimizing HPC I/O Performance with Regression Analysis and Ensemble Learning

Optimizing HPC I/O Performance with Regression Analysis and Ensemble Learning

Yuzuo Zhang.
Authors: Zhangyu Liu (Northwest University, China), Cheng Zhang (Northwest University), Huijun Wu (College of Computer, National University of Defense Technology, China), Jianbin Fang (National University of Defense Technology), Lin Peng (College of Computer, National University of Defense Technology, China), Guixin Ye (Northwest University, China), and Zhanyong Tang (Northwest University, China)

Abstract

To improve parallel I/O performance, it is imperative to optimize the adjustable parameters across the different layers of the I/O software stack. Finding an optimal configuration for different scenarios is hampered by the complex interaction dynamics between these parameters and the large parameter space. Previous research efforts have focused on tuning these parameters using independent algorithms; however, these approaches exhibit certain shortcomings such as unstable performance results and delayed convergence rates.

Our research introduces a novel approach called OPRAEL, which is based on ensembles and performance modeling using regression analysis. To test its effectiveness, we applied this approach on the Tianhe-II supercomputer using three well-known benchmark datasets - S3D-I/O, BT-I/O, and IOR. Leveraging our experience in predictive modeling, we optimized the tuning of the I/O stack parameters. Our experimental results show a remarkable 10.2x improvement in write performance speedup for the optimization task with BT-I/O and a 500x500x500 input. We also compared the potential of using a single search algorithm versus using reinforcement learning search in the I/O parameter auto-optimization task. Our results show that OPRAEL outperforms the traditional approach, resulting in a maximum 8.4x improvement in write performance for the 128 process IOR optimization.

A Lightweight, Effective Compressibility Estimation Method for Error-bounded Lossy Compression

A Lightweight, Effective Compressibility Estimation Method for Error-bounded Lossy Compression

Arkaprabha Ganguli (Michigan State University), Robert Underwood (Argonne National Laboratory), Julie Bessac (National Renewable Energy Laboratory, Virginia Polytechnic Institute and State University), David Krasowska (Northwestern University), Jon Calhoun (Clemson University), Sheng Di (Argonne National Laboratory), and Franck Cappello (Argonne National Laboratory, University of Illinois at Urbana-Champaign)

Abstract

Error-bounded lossy compression turns more and more important for the data-moving intensive applications to deal with big datasets efficiently in HPC environments, which often requires knowing the compressibility of the datasets be- fore performing the compression. However, the off-the-shelf state-of-the-art lossy compressors are often driven by error bounds, so the compression ratios cannot be forecasted until the completion of the compression operation. In this paper, we propose a lightweight, robust, easy-to-train model that estimates the compressibility of datasets for different lossy compressors accurately. Our approach combines novel predictors that measure various notions of spatial correlation and smoothness exploited by lossy compressors that are implemented efficiently on the GPU in a framework and that uses mixture model regression to improve robustness with conformal prediction to provide bounds on the estimates. We then use these models with a detailed analysis of speedup to understand the tradeoffs between high speed, consistent speed, and accuracy of the methods on real applications. We evaluate our approach in the context of 3 key applications where compression ratio estimation is highly required.

12:30 - 13:00

Lunch (pickup)

13:00 - 14:00

Lunch Keynote: Jesús Labarta (BSC)

Mesa Ballroom

Pushing RISC-V into HPC.

Pushing RISC-V into HPC.

Jesús Labarta (BSC)

Abstract

The talk will present the philosophy and results of the activity within the European Processor Initiative (EPI) to design a RISC-V vector accelerator. I will briefly present the overall project structure but then focus on the vision of how long vector architectures address fundamental issues in HPC computing such as expressing concurrency and dealing with latency. I will also discuss how the Open Standard RISC-V ISA provides a foundation on which that vision can be deployed while at the same time leveraging contributions of a growing community.

I will describe the architecture of the RISC-V processor designed in the project and its software environment. I will present performance an analysis results obtained on an FPGA emulator implementing the same RTL of the taped out test chip now in the bring up process. The FPGA emulator constitutes a Software Development Vehicle (SDV) where a standard Linux environment is available, as well as an LLVM compiler supporting both intrinsics and automatic vectorization. A powerful performance analysis framework is available to understand the behavior of real applications. This environment seamlessly covers a very wire range of levels of detail, from full application coarse grain to microscopic micro-architectural behavior.

Chair: Frank Mueller, North Carolina State University

14:00 - 16:00

Best Paper presentations

Mesa Ballroom

Chair: Sunita Chandrasekaran, University of Delaware

A Dynamic Network-Native MPI Partitioned Aggregation Over InfiniBand Verbs

A Dynamic Network-Native MPI Partitioned Aggregation Over InfiniBand Verbs

Yiltan Temucin (Queen's University), Scott Levy (Sandia National Laboratories), Whit Schonbein (Sandia National Laboratories), Ryan Grant (Queen's University, University of New Mexico), and Ahmad Afsahi (Queen’s University)

Abstract

Modern HPC systems require efficient hybrid programming model to utilize their hardware resources effectively. The Message Passing Interface (MPI) has accommodated next generation hardware by providing new APIs such as the MPI Partitioned interface. This API provides a user with fine-grain communication without the overhead of traditional MPI point-to-point communication in multi-threaded workloads. To the best of our knowledge, we present the first work on detailed hardware-level design for an MPI Partitioned implementation. We guide readers through a method to map the MPI Partitioned interface to the InfiniBand Verbs API. Alongside implementation details, we also study the aggregation of user partitions and how we can efficiently send them over the network. We study a brute force approach and using the Partitioned LogGP (PLogGP) model to predict ideal aggregation. We observe that using the PLogGP model provides provides comparable performance without exhausting computing resources to search the entire solution space. The PLogGP design was further optimized by considering how the partition arrival pattern can be used to dynamically modify our aggregation scheme. We profiled our micro-benchmarks to provide analysis on how and why this additional optimization is beneficial to our results and how we can fine-tune this mechanism. Finally, we evaluated our PLogGP and Timer-based PLogGP designs with a commonly used communication pattern in HPC (communication sweep) to observe the impact when communicating with multiple processes in an application-like scenario at 1024 cores.

DoW-KV: A DPU-offloaded and Write-optimized Key-Value Store on Disaggregated Persistent Memory

DoW-KV: A DPU-offloaded and Write-optimized Key-Value Store on Disaggregated Persistent Memory

Yiwen Zhang (Huazhong University of Science and Technology, Wuhan National Laboratory for Optoelectronics), Guokuan Li (the Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology; Huazhong University of Science and Technology), Jiguang Wan (Huazhong University of Science and Technology, Wuhan National Laboratory for Optoelectronics), Junyue Wang (the School of Computer Science and Technology,Huazhong University of Science and Technology; the School of Computer Science and Technology), Ting Yao (Cloud Storage Service Product Dept, Huawei Technologies Co.,Ltd; Huawei Technologies Co.,Ltd), Jun Li (Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology; Wuhan National Laboratory for Optoelectronics), Huatao Wu (Huawei Cloud; Huawei Technologies Co.,Ltd), and Daohui Wang (Huawei Cloud; Huawei Technologies Co.,Ltd)

Abstract

Disaggregated Persistent Memory (DPM) is a promising technology offering elasticity, high resource utilization, persistent data storage, and lower power consumption. While building KV stores on the DPM benefit from these merits, it is also challenging to achieve efficient writes due to the bottleneck of low bandwidth caused by small random writes to PM, and the intensive CPU consumption on the persistent memory server (PMS) is inevitable. Integrating the SmartNIC such as the Data Processing Unit (DPU) to the DPM gives developers the chance to optimize writing to KV stores by utilizing both the DPU memory and the DPU processor. However, simple offloading cannot make full use of the DPU’s potential capacity. To address these chal- lenges, we propose DoW-KV, a persistent hash KV store on DPM. DoW-KV employs a two-tier hash index consisting of a DPU cache table in DPU memory and multiple PM persistent tables on the PM. It relocates small random writes to the DPU memory and consolidate them to the PM at a coarse granularity. Furthermore, DoW-KV uses DPU-offloaded step merge and a coroutine-based asynchronous processing framework to efficiently manage the PM persistent tables. DoW-KV also introduces a client-mixed read strategy to boost key searching on the two-tier hash index. Experiment results show that DoW-KV outperforms the state- of-the-art DINOMO by 2.1× and 1.3× in the Put and Get operations, respectively.

Uniform Algorithms for Reduce-scatter and (most) other Collectives for MPI

Uniform Algorithms for Reduce-scatter and (most) other Collectives for MPI

Jesper Larsson Traff (TU Wien), Sascha Hunold (Vienna University of Technology), Ioannis Vardas (Vienna University of Technology), and Nikolaus Manes Funk (TU Wien)

Abstract

We explore the use of a regular, circulant graph communication pattern for the implementation of the reduction-to-all, by specialization the reduction-to-root, the reduce-scatter, the all-to-all-broadcast and the rooted gather and scatter collective operations, all as found in MPI, for commutative operators and for any number of processes. The reduction-to-all algorithm reconstructs the little known algorithm by Bar-Noy, Kipnis and Schieber (1993), which the paper considerably extends.

We experiment with extensions and combinations of the algorithms for these operations, and examine their performance from the perspective of performance guidelines, and in direct comparison to the implementations in common MPI libraries. On a small cluster with 36x 32 cores, and two larger HPC production systems we show that we can especially for MPI_Reduce_scatter_block achieve considerably better performance than standard MPI library implementations. Our algorithms can perform consistently which the implementations in standard MPI libraries sometimes do not.

n a homogeneous, one-ported communication system with linear transmission costs, reduction-to-all, reduce-scatter and all-to-all-broadcast can all be implemented in O(log p+m) time steps for problems of size m with small constants which we analyze and discuss.

JACO: Java Code Layout Optimizer Enabling Continuous Optimization without Pausing Application Services

JACO: Java Code Layout Optimizer Enabling Continuous Optimization without Pausing Application Services

Wenhai Lin (Zhejiang University), Jingchang Qin (Zhejiang University), Yiquan Chen (Zhejiang University, Alibaba Group), Zhen Jin (Zhejiang University), Jiexiong Xu (Zhejiang University), Yuzhong Zhang (Alibaba Group), Shishun Cai (Alibaba Group), Lirong Fu (Zhejiang University), Yi Chen (University of Michigan), and Wenzhi Chen (Zhejiang University)

Abstract

Many Java applications in data centers suffer from severe processor pipeline frontend bottlenecks, which can be mitigated by profile-guided code layout optimizations (PGCLO). To maximize optimization opportunities, state-of-the-art PGCLO solutions adopt continuous optimization to ensure that the code layout consistently matches ever-changing application control flow characteristics. However, existing continuous optimizations inevitably pause the application to execute the new code com- pletely, which leads to high response latency and significantly deteriorates user experience.

In this paper, we propose JACO, a novel profile-guided Java code layout optimizer, enabling continuous optimization without pausing application services. The key idea of JACO is to enable the execution of both the old and new code simultaneously rather than completely switching to the new code. In particular, JACO is composed of three components: (1) A lightweight profiler captures the control flow information of the application and then generates an optimized function order. (2) A control flow switcher generates new code based on optimized function order and switches the application to execute the new code without pausing the application services. (3) A selective code reclaimer only frees the memory occupied by the inactive old code. We evaluated JACO on both open-source applications and real-world applications from a world-leading company. JACO achieved up to a 16.36% performance improvement for real-world applications. The state-of-the-art approach introduces up to 37.93x latency overhead that will interrupt application services, while JACO only introduces a negligible 7% latency overhead.

16:00 - 16:30

Coffee Break

16:30 - 16:45

Best Paper Awards

Mesa Ballroom

16:45 - 17:45

Poster lightning talks

Mesa Ballroom

18:00 - 20:00

Poster reception

Posters

Efficient Particle Tracing for Scalable Kinetic Plasma Simulation Analysis
Authors: Nigel Tan (University of Tennessee Knoxville), ScoA Luedtke (Los Alamos National Laboratory), Michela Taufer (University of Tennessee Knoxville), and Brian Albright (Los Alamos National Laboratory)
Visual Analytics Interactive Tool for Neural Network Archaeology
Authors: Seoyoung An (University of Tennessee Knoxville), Georgia Channing (University of Tennessee Knoxville), Catherine Schuman (University of Tennessee Knoxville), and Michela Taufer (University of Tennessee Knoxville)
Mappings and Patterns to Improve the Triangular Matrix Product on Distributed Systems
Authors: Inmaculada Santamaria-Valenzuela (Universidad de Valladolid), Rocío Carratalá-Sáez (Universidad de Valladolid), Yuri Torres (Universidad de Valladolid), Diego R. Llanos (Universidad de Valladolid), and Arturo Gonzalez-Escriban (Universidad de Valladolid)
Accelerating Distributed ML Training via Selective Synchronization
Authors: Sahil Tyagi (Indiana University Bloomington) and Martin Swany (Indiana University Bloomington)
On the Multi-Dimensional Parallelization and Optimization of Stochastic Block Partitioning for Community Detection
Authors: Frank Wanye (Virginia Tech) and Wu-chun Feng (Virginia Tech)
Few-shot HPC application Runtime Prediction
Authors: Si Chen (Emory University), Simon Garcia De Gonzalo (Sandia National Laboratories), and Avani Wildani (Emory University)
An Efficient and Accurate Compression Ratio Estimation Model for SZx
Authors: Arham Khan (University of Chicago), Sheng Di (Argonne National Laboratory), Kai Zhao (University of Alabama), Jinyang Liu (University of California Riverside), Kyle Chard (University of Chicago), Ian Foster (University of Chicago), and Franck Cappello (Argonne National Laboratory)
Performance Insights into Device-initiated RMA using Kokkos Remote Spaces
Authors: Daniel Mishler (University of Tennessee, Sandia National Laboratories), Jan Ciesko (Sandia National Laboratories), Stephen Olivier (Sandia National Laboratories), and George Bosilca (University of Tennessee, Knoxville)
Performance Improvement by Enhancing Spatial Parallelism on FPGA for HPC Applications
Authors: Yuka Sano (University of Tsukuba), Taisuke Boku (University of Tsukuba), Mitsuhisa Sato (RIKEN), Miwako Tsuji (RIKEN), Norihisa Fujita (University of Tsukuba), and Ryohei Kobayashi (University of Tsukuba)
OpenMP Offloading to DPU
Authors: Muhammad Usman (Barcelona Supercomputing Center), Sergio Iserte (Barcelona Supercomputing Center), Roger Ferrer (Barcelona Supercomputing Center), and Antonio J. Peña (Barcelona Supercomputing Center)
Latency and Bandwidth Microbenchmarks of Six US Department of Energy Systems in the Top500
Authors: Carl Pearson (Sandia National Laboratories), Christopher Siefert (Sandia National Laboratories), Stephen Olivier (Sandia National Laboratories), Andrey Prokopenko (Oak Ridge National Laboratories), Timothy Fuller (Sandia National Laboratories), and Jonathan Hu (Sandia National Laboratories)
A Lightweight Network Traffic Prediction Method for SmartNICs
Authors: Tinotenda Matsika (Queen's University), Whit Schonbein (Sandia National Laboratories), and Ryan Grant (Queen's University)
I/O Characterization and Performance Evaluation of Large-scale Storage Architectures for Heterogeneous Workloads
Authors: Olga Kogiou (Florida State University), Hariharan Devararajan (Lawrence Livermore National Laboratory), Chen Wang (Lawrence Livermore National Laboratory), Weikuan Yu (Florida State University), and Kathryn Mohror (Lawrence Livermore National Laboratory)
Accelerating Hyperdimensional Classifier with SYCL
Authors: Zheming Jin (Oak Ridge National Laboratories) and Jeffrey VeAer (Oak Ridge National Laboratories)

Pecos room

Conference - Friday, Nov 3

8:15 - 9:00

Registration

9:00 - 9:30

Cluster 2024 Presentation

Mesa Ballroom

Chairs: Taisuke Boku, University of Tsukuba
Kengo Nakajima, RIKEN Center for Computational Science

9:30 - 10:30

Keynote: Susan Coghlan (ANL)

Mesa Ballroom

Update on the Aurora Supercomputer.

Chair: Antonio J. Peña, Barcelona Supercomputing Center

10:30 - 11:00

Coffee Break

11:00 - 12:30 -- Parallel Sessions

GPU and FPGA Applications (Session IX)

Mesa Ballroom

Chair: Florina Ciorba, University of Basel

A Finite-Difference Time-Domain (FDTD) solver with linearly scalable performance in an FPGA cluster

A Finite-Difference Time-Domain (FDTD) solver with linearly scalable performance in an FPGA cluster

Zhenyu Xu (University of Rhode Island, Clemson University), Miaoxiang Yu (University of Rhode Island, Clemson University), Jillian Cai (University of Rhode Island), Qing Yang (University of Rhode Island), and Tao Wei (University of Rhode Island, Clemson University)

Abstract

This paper presents an FPGA cluster based Finite-Difference Time-Domain (FDTD) accelerator that offers a linear speedup with the number of FPGAs participating in computation within the cluster. FDTD is a numeric method for simulating electromagnetic wave propagation and interactions with diverse materials and structures. Recent advancements in machine learning-based design and optimization techniques for photonic integrated circuits and microwave circuits, known as inverse design, have demonstrated remarkable success. Inverse design necessitates numerous FDTD simulations, and the high-performance FDTD accelerator enables rapid design automation, which is crucial for accelerating innovation. Our proposed accelerator comprises deeply pipelined FDTD cell update kernels that can traverse multiple FPGAs via high-speed optical links, effectively utilizing available resources across all FPGAs in a cluster. The architecture includes a head node and a flexible number of cascaded server nodes, together with custom cross-FPGA data routing kernels integrated into the "Open Cloud Testbed" (OCT) FPGA infrastructure to facilitate seamless data transfer. The proposed accelerator is developed on an existing platform, OCT FPGA. Our experiments reveal that, for a 4096x4096 2.5D FDTD simulation, each server node (Xilinx Alveo U280) can achieve 86.4 Giga-cells updates per second (GCUPS), and the head node can achieve 38.4 GCUPS. The overall speed with 4 server nodes is 38.4 + 4 x 86.4 = 384 GCUPS.

GPU Occupancy Prediction of Deep Learning Models Using Graph Neural Network

GPU Occupancy Prediction of Deep Learning Models Using Graph Neural Network

Hengquan Mei (University of Science and Technology of China), Huaizhi Qu (University of Science and Technology of China), Jingwei Sun (University of Science and Technology of China), Yanjie Gao (Microsoft Research), Haoxiang Lin (Microsoft Research), and Guangzhong Sun (University of Science and Technology of China)

Abstract

GPU is the mainstream infrastructure for executing deep learning (DL) workloads. To conduct resource-efficiency scheduling of DL workloads, GPU occupancy plays an important role for understanding whether GPUs are fully utilized. GPU occupancy is the ratio of the number of active warps on a streaming multiprocessor (SM) to the maximum number of active warps supported by the SM. By predicting the GPU occupancy of a DL model before its execution, we can estimate the percentage of the hardware’s ability to process warps that are actively in use. However, general performance prediction for DL models is challenging due to the diverse DL model architectures. In this paper, we propose DNN-occu to predict GPU occupancy of DL models. DNN-occu precisely captures the relations between structural factors of computation graphs of DL models and corresponding GPU occupancy. We also propose a novel graph neural network model to better represent these relations and make generalizable predictions. Empirical evaluations on a variety of DL models as well as configurations show that DNN-occu achieves high accuracy for occupancy prediction and has zero-shot ability for predicting the occupancy of unseen DL models. Our experiments show that DNN-occu achieves an overall prediction error of 9.271%. Besides, we conduct a trace driven simulation of DL workload scheduling, where DNN-occu achieves up to 31.45% improvement to GPU utilization and 19.71% reduction in makespan.

Reducing Data Motion and Energy Consumption of Geospatial Modeling Applications Using Automated Precision Conversion

Reducing Data Motion and Energy Consumption of Geospatial Modeling Applications Using Automated Precision Conversion

Qinglei Cao (University of Tennessee), Sameh Abdulah (King Abdullah University of Science & Technology), Hatem Ltaief (King Abdullah University of Science & Technology), Marc Genton (King Abdullah University of Science & Technology), David Keyes (King Abdullah University of Science & Technology), and George Bosilca (University of Tennessee)

Abstract

The burgeoning interest in large-scale geospatial modeling, particularly within the domains of climate and weather prediction, underscores the concomitant critical importance of accuracy, scalability, and computational speed. Harnessing these complex simulations' potential, however, necessitates innovative computational strategies, especially considering the increasing volume of data involved. Recent advancements in Graphics Processing Units (GPUs) have opened up new avenues for accelerating these modeling processes. In particular, their efficient utilization necessitates new strategies, such as mixed-precision arithmetic, that can balance the trade-off between computational speed and model accuracy. This paper leverages the PaRSEC runtime system and delves into the opportunities provided by mixed-precision arithmetic to expedite large-scale geospatial modeling in heterogeneous environments. By using an automated conversion strategy, our mixed-precision approach significantly improves computational performance (up to 3X) on the Summit supercomputer and reduces the associated energy consumption on various Nvidia GPU generations. Importantly, this implementation ensures the requisite accuracy in environmental applications, a critical factor in their operational viability. The findings of this study bear significant implications for future research and development in high-performance computing, underscoring the transformative potential of mixed-precision arithmetic on GPUs in addressing the computational demands of large-scale geospatial modeling and making a stride toward a more sustainable, efficient, and accurate future in large-scale environmental applications.

MPI & Networking (Session X)

Canyon Room

Chair: George Michelogiannakis, Lawrence Berkeley National Laboratory

SDT: A Low-cost and Topology-reconfigurable Testbed for Network Research

SDT: A Low-cost and Topology-reconfigurable Testbed for Network Research

Zixuan Chen (Fudan University), Zhigao Zhao (Fudan University), Zijian Li (Fudan University), Jiang Shao (Fudan University), Sen Liu (Fudan University), and Yang Xu (Fudan University)

Abstract

Network experiments are essential to network-related scientific research (e.g., congestion control, QoS, network topology design, and traffic engineering). However, (re)configuring various topologies on a real testbed is expensive, time-consuming, and error-prone. In this paper, we propose \emph{Software Defined Topology Testbed (SDT)}, a method for constructing a user-defined network topology using a few commodity switches. SDT is low-cost, deployment-friendly, and reconfigurable, which can run multiple sets of experiments under different topologies by simply using different topology configuration files at the controller we designed. We implement a prototype of SDT and conduct numerous experiments. Evaluations show that SDT only introduces at most 2\% extra overhead than full testbeds on multi-hop latency and is far more efficient than software simulators (reducing the evaluation time by up to 2899x). SDT is more cost-effective and scalable than existing Topology Projection (TP) solutions. Further experiments show that SDT can support various network research experiments at a low cost on topics including but not limited to topology design, congestion control, and traffic engineering.

PiP-MColl: Process-in-Process-based Multi-object MPI Collectives

PiP-MColl: Process-in-Process-based Multi-object MPI Collectives

Jiajun Huang (University of California, Riverside), Kaiming Ouyang (NVIDIA Corporation), Yujia Zhai (University of California, Riverside), Jinyang Liu (University of California, Riverside), Min Si (Meta Platforms, Inc.), Ken Raffenetti (Argonne National Laboratory), Hui Zhou (Argonne National Laboratory), Atsushi Hori (National Institute of Informatics), Zizhong Chen (University of California, Riverside), Yanfei Guo (Argonne National Laboratory), and Rajeev Thakur (Argonne National Laboratory)

Abstract

In the era of exascale computing, the adoption of a large number of CPU cores and nodes by high-performance computing (HPC) applications has made MPI collective performance increasingly crucial. As the number of cores and nodes increases, the importance of optimizing MPI collective performance becomes more evident. Current collective algorithms, including kernel-assisted inter-process data exchange techniques and data sharing based shared-memory approaches, are prone to significant performance degradation due to the overhead of system calls and page faults or the cost of extra data-copy latency. These issues can negatively impact the efficiency and scalability of HPC applications. To address these issues, we propose PiP-MColl, a Process-in-Process-based Multi-object Inter-process MPI Collective design that maximizes small message MPI collective performance at scale. We also present specific designs to boost the performance for larger messages, such that we observe a comprehensive improvement for a series of message sizes beyond small messages. PiP-MColl features efficient multiple sender and receiver collective algorithms and leverages Process-in-Process shared memory techniques to eliminate unnecessary system call, page fault overhead and extra data copy, which results in improved intra- and inter-node message rate and throughput. Experimental results demonstrate that PiP-MColl significantly outperforms popular MPI libraries, including OpenMPI, MVAPICH2, and Intel MPI, by up to 4.6X for the MPI collectives MPI_Scatter, MPI_Allgather, and MPI_Allreduce.

12:30 - 14:00

Lunch (provided)

Conference ends

Conference Room Layout

Conference - Tuesday, Oct 31 8:30 - 9:15 Registration 9:15 - 10:45 Mesa Ballroom HPCMASPA: Monitoring and Analysis for HPC Systems Plus Applications Canyon room Tutorial: Performance Analysis, Tools and Best-Known Methods on Muti-Chip Module Chiplet based High Performance Computing AMD EPYC Zen4 Architecture Tutorial: Performance Analysis, Tools and Best-Known Methods on Muti-Chip Module Chiplet based High Performance Computing AMD EPYC Zen4 Architecture TBD Abstract TBD 10:45 - 11:15 Coffee Break 11:15 - 12:45 Mesa Ballroom HPCMASPA: Monitoring and Analysis for HPC Systems Plus Applications Canyon room Tutorial: Zen4 12:45 - 14:00 Lunch (provided) 14:00 - 15:30 Mesa Ballroom HPCMASPA: Monitoring and Analysis for HPC Systems Plus Applications Canyon room REX-IO: Re-envisioning Extreme-Scale I/O for Emerging Hybrid HPC Workloads 15:30 - 16:00 Coffee Break 16:00 - 17:30 Mesa Ballroom HPCMASPA: Monitoring and Analysis for HPC Systems Plus Applications Canyon room REX-IO: Re-envisioning Extreme-Scale I/O for Emerging Hybrid HPC Workloads	Conference - Wednesday, Nov 1 8:15 - 9:00 Registration 9:00 - 9:30 Cluster 2023 Opening Mesa Ballroom 9:30 - 10:30 Keynote: Bill Magro (Google) Mesa Ballroom AI, Cloud, and the Future of HPC. AI, Cloud, and the Future of HPC. Bill Magro (Google) Abstract The slowing of Moore’s Law had a profound impact on the trajectory of HPC and spurred the worldwide race to exascale. This, in turn, fostered new directions in system architecture that have spurred a renaissance in AI. Cloud computing has also steadily grown and is feeling the influence of HPC. We have now reached a point where AI and cloud are directly shaping the future of HPC. In this talk we will discuss how we got here, what the future of HPC looks like, and some important implications for HPC practitioners. Chair: Scott Pakin, Los Alamos National Laboratory 10:30 - 11:00 Coffee Break 11:00 - 12:30 -- Parallel Sessions Distributed Machine Learning (Session I) Mesa Ballroom Chair: Olamide Timothy Tawose, Lincoln University, Pennsylvania Accelerating Distributed ML Training via Selective Synchronization Accelerating Distributed ML Training via Selective Synchronization Sahil Tyagi (Indiana University Bloomington), and Martin Swany (Indiana University) Abstract In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present SelSync, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of SelSync to improve convergence in the context of semi-synchronous training. In our evaluation, SelSync converges to the same or better accuracy than BSP while reducing training time by up to 14x. PredictDDL: Reusable Workload Performance Prediction for Distributed Deep Learning PredictDDL: Reusable Workload Performance Prediction for Distributed Deep Learning Kevin Assogba (Rochester Institute of Technology), Eduardo Lima (Rochester Institute of Technology), M. Mustafa Rafique (Rochester Institute of Technology), and Minseok Kwon (Rochester Institute of Technology) Abstract Predicting Deep Learning (DL) training workload runtime allows for optimized usage of both on-premises and public data centers, e.g., allocating resources for task completion before a deadline. The state-of-the-art prediction models, e.g., Ernest and Cherrypick, treat workloads as black boxes, and require running the workload on a fraction of the dataset every time a change occurs followed by retraining the prediction model. This significantly limits the reusability of prediction models across workloads with different DL architectures. In this paper, we propose a different approach where the prediction model is trained only once for a particular dataset type, e.g., ImageNet, thus completely avoiding tedious and costly retraining tasks for new DL workloads. Our proposed approach, called PredictDDL, provides an end-to-end performance prediction system for distributed DL training workloads. PredictDDL leverages Graph HyperNetworks, a class of neural networks that takes computational graphs as input and produces vector representations of neural networks. PredictDDL is the first prediction model that eliminates the need of retraining a performance prediction model for each new DL workload and maximizes the reuse of the prediction model by requiring to run the workload a single time to make time measurements for training the prediction model. Our extensive evaluation using representative workloads shows that PredictDDL achieves up to 9.8x lower average prediction error and 10.3x lower inference duration compared to a state-of-the-art system, Ernest, on workloads with multiple DNN architectures. Exact Distributed Stochastic Block Partitioning Exact Distributed Stochastic Block Partitioning Frank Wanye (Virginia Tech), Vitaliy Gleyzer (MIT Lincoln Lab), Edward Kao (MIT Lincoln Lab), and Wu-chen Feng (Virginia Tech) Abstract Stochastic block partitioning (SBP) is a community detection algorithm that is highly accurate even on graphs with a complex community structure, but its inherently serial nature hinders its widespread adoption by the wider scientific community. To make it practical to analyze large real-world graphs with SBP, there is a growing need to parallelize and distribute the algorithm. The current state-of-the-art distributed SBP algorithm is a divide-and-conquer approach that limits communication between compute nodes until the end of inference. This leads to the breaking of computational dependencies, which causes convergence issues as the number of compute nodes increases, and when the graph is sufficiently sparse. In this paper, we introduce EDiSt - an exact distributed stochastic block partitioning algorithm. Under EDiSt, compute nodes periodically share community assignments during inference. Due to this additional communication, EDiSt improves upon the divide-and-conquer algorithm by allowing it to scale out to a larger number of compute nodes without suffering from convergence issues, even on sparse graphs. We show that EDiSt provides speedups of up to 23.8× over the divide-and-conquer approach, and speedups up to 38.0× over shared memory parallel SBP when scaled out to 64 compute nodes. Resource Management (Session II) Canyon Room Chair: Jesper Larsson Träff, TU Wien - faculty of informatics DEHype: Retrofitting Hypervisors for a Resource-Disaggregated Environment DEHype: Retrofitting Hypervisors for a Resource-Disaggregated Environment Taehoon Kim (ETRI), Kwangwon Koh (ETRI), Changdae Kim (ETRI), Eunji Pak (ETRI), Yeonjeong Jeong (ETRI), and Sang-Hoon Kim (Ajou University) Abstract Resource disaggregation has been proposed as a solution for the resource under-utilization in data centers. However, the host virtualization technologies, which are the basic building blocks for constructing data centers, are implemented without considering the disaggregated resources. In addition, we discover a RDMA I/O unit plays a significant role in performance of the resource disaggregated environment. In this study, we propose DEHype, which alleviates the inefficiency of hypervisors adapted in disaggregated environment through investigating host virtualization technologies that are suitable for the disaggregated memory systems. Specifically, DEHype aims to identify and improve the performance issues of virtual machines through KVM/QEMU in a disaggregated resource environment. The results demonstrate the effectiveness of the proposed optimizations in improving the performance of disaggregated memory systems. DEHype achieves up to a 351% improvement over the state-of-the-art disaggregated memory system. SciLance: Mitigate Load Imbalance for Parallel Scientific Applications in Cloud Environments SciLance: Mitigate Load Imbalance for Parallel Scientific Applications in Cloud Environments Xinying Wang (University of Nevada), Lipeng Wan (Georgia State University), Scott Klasky (Oak Ridge National Laboratory), Dongfang Zhao (University of Washington), and Feng Yan (University of Houston) Abstract Elastic cloud computing provides new opportunities for accelerating the process of scientific discovery. However, unlike high-performance computing (HPC) systems that are built and optimized for workloads with intensive inter-node communication demands, the low-latency and high bandwidth communication capability is only enabled on a few very expensive high-end instance types in the cloud, which leads to poor cost-effectiveness. In addition, re-balancing the workload through extra data movement among compute nodes is a common way to mitigate the load imbalance issue in many scientific simulations, which further amplifies the communication pressure and makes it challenging to efficiently use cloud resources. To this end, we propose SciLance, which addresses the workload imbalance challenge by utilizing the heterogeneous and elastic resources offered by cloud platforms. Our key insight is that instead of moving data among compute nodes to balance the workload, we create a heterogeneous resource pool to dynamically adapt resource allocation to compensate the profiled runtime imbalance. We prototype SciLance and perform extensive evaluation using adaptive mesh refinement (AMR) based scientific applications. The evaluation results demonstrate that SciLance can achieve up to 36.63% better performance with 16.91% lower cost for Warpx simulation codes. Generalized Collectives for the Exascale Era Generalized Collectives for the Exascale Era Michael Wilkins (Northwestern University), Hanming Wang (Northwestern University), Peizhi Liu (Northwestern University), Bangyen Pham (Northwestern University), Yanfei Guo (Argonne National Laboratory), Rajeev Thakur (Argonne National Laboratory), Peter Dinda (Northwestern University), and Nikos Hardavellas (Northwestern University) Abstract Exascale supercomputers have renewed the exigence of improving distributed communication, specifically MPI collectives. Previous works accelerated collectives for specific scenarios by changing the radix of the collective algorithms. However, these approaches fail to explore the interplay between modern hardware features, such as multi-port networks, and software features, such as message size. In this paper, we present a novel approach that uses system-agnostic, generalized (i.e., variable-radix) algorithms to capture all relevant features and provide broad speedups for upcoming exascale-class systems. We identify hardware commonalities found on announced exascale systems and generalize three common communication kernels (binomial tree, ring, and recursive doubling) to better leverage these features, creating 10 implementations. For each kernel, we develop analytical models to intuit algorithm performance with varying radix values. Experiments on the world’s first exascale supercomputer (Frontier at ORNL) and an exascale test system (Polaris at ANL) show that our generalized algorithms outperform the baseline open-source and proprietary vendor MPI implementations by a significant margin, up to over 4.5x. We empirically determine optimal algorithms and parameter values, identifying where the analytical models are accurate and where hardware features directly determine performance. Most notably, we show how a single, system-agnostic implementation of a generalized algorithm can optimize for multiple hardware/software features across multiple systems. 12:30 - 14:00 Lunch (provided) 14:00 - 15:30 -- Parallel Sessions Software Systems for ML (Session III) Mesa Ballroom Chair: Jim Brandt, Sandia National Laboratories FedGuard: Selective Parameter Aggregation for Poisoning Attack Mitigation in Federated Learning FedGuard: Selective Parameter Aggregation for Poisoning Attack Mitigation in Federated Learning Melvin Chelli (DFKI), Cédric Prigent (INRIA), René Schubotz (DFKI), Alexandru Costan (IRISA/INSA Rennes, INSA Rennes), Gabriel Antoniu (Inria), Loïc Cudennec (DGA), and Philipp Slusallek (DFKI) Abstract Minimizing the attack surface of Federated Learning (FL) systems is a field of active research. FL turns out to be highly vulnerable to various threats coming from the edge of the network. Current approaches rely on robust aggregation, anomaly detection and generative models for defending against poisoning attacks. Yet, they either have limited defensive capabilities due to their underlying design or are impractical to use as they rely on constraining building blocks. We introduce FedGuard, a novel FL framework that utilizes the generative capabilities of Conditional Variational AutoEncoders (CVAE) to effectively defend against poisoning attacks with tuneable overhead in communication and computation. Whilst the idea of hardening a FL system using generative models is not entirely new, FedGuard’s original contribution is in its selective parameter aggregation operator with parameter selection being driven by synthetic validation data sampled from the CVAEs trained locally by each participating party. Experimental evaluations in a 100 clients setup demonstrates FedGuard to be more effective against label and sign flipping attacks as well as additive noise and same value attacks than previous works. FedGuard successfully defends in scenarios with up to 50% malicious peers where other strategies fail. In addition, FedGuard does not require auxiliary datasets or centralized (pre-) training, and provides resilience against poisoning attacks from the very first round of federated training. Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models Wei Wang (National University of Defense Technology), Zhiquan Lai (NUDT, Computer College), Shengwei Li (National University of Defense Technology), Weijie Lu (National University of Defense Technology), Keshi Ge (National University of Defense Technology), Yujie Liu (National University of Defense Technology), Ao Shen (National University of Defense Technology), and Dongsheng Li (National University of Defense Technology, Computer College) Abstract Mixture of Expert (MoE) has received increasing attention for scaling DNN models to extra-large size with negligible increases in computation. However, a significant load imbalance occurs in the device during the training of a MoE model, resulting in significantly reduced throughput. Previous works on load balancing introduce additional runtime overhead and suffer from high execution overhead. To address these issues, we present Prophet: a fine-grained load balancing method for parallel training of large-scale MoE models, which consists of a planner and a scheduler. Prophet planner first employs a fine-grained resource allocation method to determine the possible scenarios for the expert placement in a fine-grained manner, and then efficiently searches for a well-balanced expert placement to avoid additional overhead. Prophet scheduler uses the locality of the token distribution to schedule the resource allocation operations using a layer-wise fine-grained schedule strategy to hide their overhead. We conduct extensive experiments in four clusters and five representative models. The results indicate that Prophet gains up to 2.3x speedup compared to the state-of-the-art MoE frameworks including Deepspeed-MoE and FasterMoE. Additionally, Prophet achieves a load balancing enhancement of up to 12.06x when compared to FasterMoE. HIOS: Hierarchical Inter-Operator Scheduler for Real-Time Inference of DAG-Structured Deep Learning Models on Multiple GPUs HIOS: Hierarchical Inter-Operator Scheduler for Real-Time Inference of DAG-Structured Deep Learning Models on Multiple GPUs Turja Kundu (University of North Texas), and Tong Shu (University of North Texas) Abstract Neural-network-enabled data analysis in real-time scientific applications imposes stringent requirements on inference latency. Meanwhile, recent deep learning (DL) model design trends to replace a single branch with multiple branches for high prediction accuracy and robustness, which makes inter-operator parallelization become an effective approach to improve inference latency. However, existing inter-operator parallelization techniques for inference acceleration are mainly focused on utilization optimization in a single GPU. With the data size of an input sample and the scale of a DL model ever-growing, the limited resource of a single GPU is insufficient to support parallel execution of large operations. In order to break this limitation, we studies hybrid inter-operator parallelism both among multiple GPUs and in each GPU. In this paper, we propose a hierarchical inter-operator scheduler (HIOS) to automatically distribute large operators onto different GPUs and group small operators in the same GPU for parallel execution. Particularly, we design a novel scheduling algorithm, named HIOS-LP, which consists of inter-GPU operator parallelization through iterative critical-path mapping and the intra-GPU operator parallelization based on a sliding window. Experiments with modern convolutional neural network benchmarks on different GPU platforms demonstrate that our HIOS-LP outperforms the state-of-the-art inter-operator scheduling algorithm IOS by up to 28%. Storage Systems and Data Management (Session IV) Canyon Room Chair: Sarah Neuwirth, Johannes Gutenberg University Mainz, Juelich Supercomputing Centre FullRepair: Towards Optimal Repair Pipelining in Erasure-Coded Clustered Storage Systems FullRepair: Towards Optimal Repair Pipelining in Erasure-Coded Clustered Storage Systems Yuzuo Zhang (Huazhong University Of Science And Technology), Xinyuan Tu (Huazhong University of Science and Technology), Lin Wang (Huazhong University of Science and Technology), Yuchong Hu (Huazhong University of Science and Technology), Fang Wang (Huazhong University of Science and Technology), and Ye Wang (Huazhong University of Science and Technology) Abstract Clustered storage systems often deploy erasure coding that encodes data into coded chunks and distributes them across nodes to tolerate node failures. It is a storage-efficient redundancy scheme but incurs high repair penalty; thus there are many studies focusing on the erasure-coded repair of failed blocks, and some state-of-the-arts pipeline the repair of failed data to improve the repair performance. However, we observe that all existing repair pipelining methods only use a single pipeline, making network bandwidth resources of storage nodes underutilized. In this paper, we propose FullRepair, a new repair pipelining mechanism based on multiple pipelines with the aim of fully exploiting all available bandwidth resources during repair. We construct four constraints to model the repair pipelining problem such that FullRepair can obtain the optimal pipelined repair throughput under fully bandwidth utilization. We design a multi-pipeline scheduling scheme for FullRepair so as to achieve the above optimality. Experiments on the Amazon EC2 show that compared with the state-of-the-art repair pipelining methods RP and PivotRepair, FullRepair reduces the repair time of single chunk by up to 45.4% and 33.19%, respectively. Performance Characterization of NVMe Flash Devices with Zoned Namespaces (ZNS) Performance Characterization of NVMe Flash Devices with Zoned Namespaces (ZNS) Krijn Doekemeijer (Vrije Universiteit Amsterdam), Nick Tehrany (TU Delft), Balakrishnan Chandrasekaran (Vrije Universiteit Amsterdam), Matias Bjorling (Western Digital), and Animesh Trivedi (Vrije Universiteit Amsterdam) Abstract The recent emergence of NVMe flash devices with Zoned Namespace support, ZNS SSDs, represents a significant new advancement in flash storage. ZNS SSDs introduce a new storage abstraction of append-only zones with a set of new I/O (i.e., append) and management (zone state machine transition) commands. With the new abstraction and commands, ZNS SSDs offer more control to the host software stack than a (non-zoned) SSD for flash management, which is known to be complex (because of garbage collection, scheduling, block allocation, parallelism management, over-provisioning). ZNS SSDs are, consequently, gaining adoption in a variety of applications (e.g., file systems, key-value stores, and databases), particularly latency-sensitive big data applications. Despite this enthusiasm, there has yet to be a systematic characterization of ZNS SSD performance with its zoned storage model abstractions and I/O operations. This work addresses this crucial shortcoming. We report on the performance features of a commercially available ZNS SSD (13 key observations), explain how these features can be incorporated into publicly available state-of-the-art ZNS emulators, and recommend guidelines for ZNS SSD application developers. All artifacts (code and data sets) of this study are publicly available at: https://anonymous.4open.science/r/NVMeBenchmarks-0551. KV-CSD: A Hardware-Accelerated Key-Value Store for Data-Intensive Applications KV-CSD: A Hardware-Accelerated Key-Value Store for Data-Intensive Applications Inhyuk Park (SK hynix), Qing Zheng (Carnegie Mellon University, New Mexico Consortium), Dominic Manno (Los Alamos National Laboratory), Soonyeal Yang (SK hynix), Jason Lee (Los Alamos National Laboratory), David Bonnie (Los Alamos National Laboratory), Bradley Settlemyer (NVIDIA), Youngjae Kim (Sogang University), Woosuk Chung (SK hynix), and Gary Grider (Los Alamos National Lab) Abstract Popular write-optimized software key-value stores such as LevelDB and RocksDB are often good at reads. While data is initially stored in a write-optimized format, in the background it is asynchronously transformed into a read-optimized format for efficient followup queries. Write-optimized key-value stores can still block writes. This happens when those background workers cannot keep up with the foreground insertion workload. This paper makes a case for a hardware-accelerated key-value store that allows for running performance-critical operations --- such as background data reorganization and queries --- on storage rather than on a host. This better hides background work latency, prevents it from blocking foreground writes, and improves overall I/O efficiency. Our prototype, called TomDB, is a key-value based computational storage device consisting of an NVMe SSD and a System-on-a-Chip (SoC) that implements an ordered key-value store atop the SSD. Through offloaded processing, TomDB streamlines data insertion, reduces host-device data movement for both background data reorganization and query processing, and shows up to 10.6x lower write times and up to 7.4x faster queries compared to the current state-of-the-art on a real-world scientific dataset. 15:30 - 16:00 Coffee Break 16:00 - 17:00 -- Parallel Sessions Disaggregated Architectures (Session V) Canyon Room Chair: Hariharan Devarajan, Lawrence Livermore National Laboratory Rethinking Virtual Machines Live Migration for Memory Disaggregation Rethinking Virtual Machines Live Migration for Memory Disaggregation Xingguo Jia (Shanghai Jiao Tong University), Xingzi Yu (Shanghai Jiao Tong University), Yun Wang (Shanghai Jiao Tong University), Senhao Yu (Beijing Institute of Technology), and Zhengwei Qi (Shanghai Jiao Tong University) Abstract Resource underutilization has troubled data centers for several decades. Memory disaggregation provides an efficient way to improve memory utilization while leaving a missing puzzle piece on CPU underutilization. Live migration is an essential method for CPU resource reallocation. However, the state-of-the-art Virtual Machines (VM) live migration suffers from significant resource consumption. We discover the substantial potential for optimizing live migration in disaggregated memory systems. We propose Anemoi, a source management system incorporating VM live migration into memory disaggregation to fill in the missing piece. Disaggregated memory enables the read-only replica to be accessible from destination nodes, eliminating considerable network transmission for memory pages and saving migration time. Regarding the potential overwhelming memory consumption of duplicate read-only replicas, we design a dedicated compression algorithm. The evaluation shows that Anemoi reduces the network bandwidth use and the migration time by 69% and 83%, respectively, compared to VM live migration. The compression can achieve a space-saving of 83.6%. Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics George Michelogiannakis (Lawrence Berkeley National Laboratory, Stanford University), Yehia Arafa (New Mexico State University, Qualcomm Inc), Brandon Cook (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Liang Yuan Dai (Columbia University), Abdel-Hameed Badawy (New Mexico State University), Madeleine Glick (Columbia University), Yuyang Wang (Columbia University), Keren Bergman (Columbia University), and John Shalf (Lawrence Berkeley National Laboratory) Abstract The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to realize these gains and cost-effectively meet the stringent bandwidth and latency requirements of HPC applications. To that end, we describe how modern photonic components can be co-designed with modern HPC racks to implement flexible intra-rack resource disaggregation and fully meet the bit error rate (BER) and high escape bandwidth of all chip types in modern HPC racks with negligible power overhead. Our photonic-based disaggregated rack provides an average application speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared to a similar system that instead uses modern electronic switches for disaggregation. Using observed resource usage from a production system, we estimate that an iso-performance intra-rack disaggregated HPC system using photonics would require 4× fewer memory modules and 2× fewer NICs than a non-disaggregated baseline. ML for Scheduling and Management (Session VI) Mesa Ballroom Chair: Haiying Xu, University Corporation for Atmospheric Research ExplSched: Maximizing Deep Learning Cluster Efficiency for Exploratory Jobs ExplSched: Maximizing Deep Learning Cluster Efficiency for Exploratory Jobs Zixuan Chen. Authors: Hongliang Li (Jilin University), Hairui Zhao (Jilin University), Zhewen Xu (Jilin University), Xiang Li (Jilin University), and Haixiao Xu (Jilin University) Abstract Resource management for Deep Learning (DL) clusters is essential for system efficiency and model training quality. Existing schedulers provided by DL frameworks are mostly adaptations from traditional HPC clusters and usually work on jobs' makespan, assuming that DL training jobs finish completely. Unfortunately, it is reported that a fair amount of training jobs are exploratory jobs and often finish unsuccessfully in production clusters. This is due to the distinct characteristic of Deep Neural Network (DNN) training that it is an exploratory process of frequent user interventions, such as adjusting model structures, tuning hyperparameters, and exploring feature validity. Existing DL cluster schedulers using offline algorithms are not suitable for exploratory jobs when unexpected early terminations can cause noticeable resource waste. Moreover, DL training jobs are iterative and usually yield diminishing returns. Equally allocating resources among training iterations is not efficient, especially when dealing with exploratory jobs where it can worsen the degradation of system efficiency. The fundamental goal of a DL training job is to gain model quality improvement, usually indicated by the loss reduction (job profit) of a DNN model. This paper introduces a novel scheduling problem for exploratory jobs that seeks to maximize the overall training profit of a DL cluster. We propose ExplSched, an online scheduling solution based on the primal-dual framework. It emphasizes the importance of job profit to resource consumption ratio to make quick resource allocation decisions. Experimental results show that ExplSched achieved an average system utility improvement of 87.28% compared with other related works. Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach Urvij Saroliya (Technical University of Munich), Eishi Arima (Technical University of Munich), Dai Liu (Technical University of Munich), and Martin Schulz (Technical University of Munich) Abstract GPU-based heterogeneous architectures are now commonly used in HPC systems. Due to their architectural simplicity specialized for data-level parallelism, GPUs can offer much higher computational throughput and memory bandwidth than CPUs in the same generation do. However, as the available resources in GPUs have increased exponentially over the past decades, it has become increasingly difficult for a single program to fully utilize them. As a consequence, the industry has started supporting several resource partitioning features in order to improve the resource utilisation by co-scheduling multiple programs on the same GPU die at the same time. Driven by the technological trend, this paper focuses on hierarchical resource partitioning on modern GPUs, and as an example, we utilize a combination of two different features available on recent NVIDIA GPUs in a hierarchical manner: MPS (Multi-Process Service), a finer-grained logical partitioning; and MIG (Multi-Instance GPU), a coarse-grained physical partitioning. We propose a method for comprehensively co-optimizing the setup of hierarchical partitioning and the selection of co-scheduling groups from a given set of jobs, based on reinforcement learning using their profiles. Our thorough experimental results demonstrate that our approach can successfully set up job concurrency, partitioning, and co-scheduling group selections simultaneously. This results in a maximum throughput improvement by a factor of 1.87 compared to the time-sharing scheduling. 18:00 - 22:00 Social event: Meow Wolf Meow Wolf: House of Eternal Return Meow Wolf as a psychedelic interactive art installation that's fun for all ages. Buses will run continually between the Hilton and Meow Wolf from 5:30pm – 10:30pm. Food and drink will be provided.	Conference - Thursday, Nov 2 8:15 - 10:00 Registration 10:00 - 10:15 Announcements Mesa Ballroom 10:15 - 10:30 Special address: Tim Randles Mesa Ballroom 25 Years of Cluster Computing at Los Alamos National Laboratory. Chair: Frank Mueller, North Carolina State University 10:30 - 11:00 Coffee Break 11:00 - 12:30 -- Parallel Sessions Communication (Session VII) Canyon Room Chair: Kengo Nakajima, University of Tokyo/RIKEN R-CCS Communication-Avoiding Recursive Aggregation Communication-Avoiding Recursive Aggregation Yihao Sun (Syracuse University) Sidharth Kumar (University of Illinois at Chicago), Thomas Gilray (University of Alabama at Birmingham), and Kristopher Micinski (Syracuse University) Abstract Recursive aggregation has been of considerable interest due to its unifying a wide range of deductive-analytic workloads, including social-media mining and graph analytics. For example, Single-Source Shortest Paths (SSSP), Connected Components (CC), and PageRank may all be expressed via recursive aggregates. Implementing recursive aggregation has posed a serious algorithmic challenge, with state-of-the-art work identifying sufficient conditions (e.g., pre-mappability) under which implementations may push aggregation within recursion, avoiding the serious materialization overhead inherent to traditional reachability-based methods (e.g., Datalog). State-of-the-art implementations of engines supporting recursive aggregates focus on large unified machines, due to the challenges posed by mixing semi-na"ive evaluation with distribution. In this work, we present an approach to implementing recursive aggregates on high-performance clusters which avoids the communication overhead inhibiting current-generation distributed systems to scale recursive aggregates to extremely high process counts. Our approach leverages the observation that aggregators form functional dependencies, allowing us to implement recursive aggregates via a high-parallel local aggregation to ensure maximal throughput. Additionally, we present a dynamic join planning mechanism, which customizes join order per-iteration based on dynamic relation sizes. We implemented our approach in PARALAGG a library which allows the declarative implementation of queries which utilize recursive aggregates and executes them using our MPI-based runtime. We evaluate PARALAGG on a large unified nodes and leadership-class supercomputers, demonstrating scalability up to 16,384 processes. HASpMV: Heterogeneity-Aware Sparse Matrix-Vector Multiplication on Modern Asymmetric Multicore Processors HASpMV: Heterogeneity-Aware Sparse Matrix-Vector Multiplication on Modern Asymmetric Multicore Processors Wenxuan Li (China University of Petroleum-Beijing, Super Scientific Software Laboratory), Helin Cheng (China University of Petroleum-Beijing), Zhengyang Lu (China University of Petroleum-Beijing, Super Scientific Software Laboratory), Yuechen Lu (China University of Petroleum-Beijing, Super Scientific Software Laboratory), and Weifeng Liu (China University of Petroleum-Beijing, Super Scientific Software Laboratory) Abstract Sparse matrix-vector multiplication (SpMV) is a fundamental routine in computational science and engineering. Its optimization methods on various homogeneous parallel processors, such as CPUs and GPUs, received much attention. Recently, asymmetric multicore processors (AMPs) have heterogeneous performance and efficient cores (e.g., P- and E-cores from Intel and Apple, or Big.LITTLE cores from ARM), or cores with different cache structures (e.g., cores with/without 3D V-Cache from AMD) are becoming one of the mainstream in more desktop and workstation computers. However, there lacks heterogeneity-aware research on accelerating SpMV on AMPs. We in this paper propose a parallel algorithm called heterogeneity-aware SpMV (HASpMV) for improving the performance of SpMV on the latest 12th- and 13th-Gen AMPs from Intel and Ryzen 9 AMPs from AMD. We first micro-benchmark bandwidth and multi-/single-core SpMV to collect performance characteristics and to motivate our algorithm design, and then develop several optimization techniques to assign workloads between the two types of cores for achieving significantly better cache locality and load balancing. The experimental results show that our HASpMV brings on average 2.61x/3.17x, 2.31x/1.52x and 3.73x/2.23x (up to 5.23x/9.46x, 4.46x/5.31x and 8.23x/4.49x) speedups over the newest version of the Intel oneMKL library and the open-source work CSR5 and merge-SpMV on Intel Core i9-12900KF/13900KF, respectively. Also, HASpMV brings on average 1.43x, 1.3x and 1.29x (up to 6.28x, 7.8x and 10.8x) speedups over AMD Optimizing CPU Libraries (AOCL), CSR5 and merge-SpMV when comparing AMD Ryzen 9 7950X3D and 7950X AMPs, respectively. TopoCommit: A Topological Commit Protocol for Cross-Ledger Transactions in Scientific Computing TopoCommit: A Topological Commit Protocol for Cross-Ledger Transactions in Scientific Computing Olamide Timothy Tawose (University of Nevada, Reno), Lei Yang (University of Nevada, Reno), and Dongfang Zhao (University of Washington; University of Nevada, Reno) Abstract While increasingly more applications are tempted to manage their data in decentralized systems, such as blockchains or distributed ledgers, the data exchange across multiple, potentially heterogeneous, decentralized systems remains an open problem: State-of-the-art protocols cannot meet one or more of the core requirements, such as atomicity, liveness, and scalability. Specifically, in the field of scientific computing, although a blockchain service was recently developed for scientific computing environments, the data exchanges and transactions among distinct ledgers are not supported. Observing that many modern scientific applications are collaborated on by multiple teams and the increasingly complicated (in-situ) workflows thereof, we argue that there is a pressing need to realize an efficient and scalable protocol for distinct ledgers to exchange data in scientific computing. This paper proposes a topological approach to enabling atomic, nonblocking, and scalable data exchanges among an arbitrary number of scientific ledgers in the context of collaborative scientific computing. Specifically, we construct a topological space formed by these ledgers—abstracting those nodes in a cross-ledger transaction as topological objects such as abstract simplex and simplicial complex. These topological objects, in turn, serve as the building blocks of a topological protocol, namely TopoCommit, under practical assumptions. We implement TopoCommit and integrate it into SciChain, a recently published distributed ledger for tracking scientific data provenance. The extensive evaluation of up to 1,008 nodes and 144 distinct ledgers on CloudLab shows that TopoCommit outperforms state-of-the-art protocols by up to 70×. Workflow and Data Processing (Session VIII) Mesa Ballroom Chair: Qing Zheng, Los Alamos National Laboratory ProvLight: Efficient Workflow Provenance Capture on the Edge-to-Cloud Continuum ProvLight: Efficient Workflow Provenance Capture on the Edge-to-Cloud Continuum Daniel Rosendo (Inria), Marta Mattoso (UFRJ), Alexandru Costan * (IRISA/INSA Rennes, INSA Rennes), Renan Souza (Oak Ridge National Laboratory), Debora Pina (Federal University of Rio de Janeiro), Patrick Valduriez (Inria), and Gabriel Antoniu (Inria) Abstract Modern scientific workflows require hybrid infrastructures combining numerous decentralized resources on the IoT/Edge interconnected to Cloud/HPC systems (aka the Computing Continuum) to enable their optimized execution. Understanding and optimizing the performance of such complex Edge-to-Cloud workflows is challenging. Capturing the provenance of key performance indicators, with their related data and processes, may assist in understanding and optimizing workflow executions. However, the capture overhead can be prohibitive, particularly in resource-constrained devices, such as the ones on the IoT/Edge. To address this challenge, based on a performance analysis of existing systems, we propose ProvLight, a tool to enable efficient provenance capture on the IoT/Edge. We leverage simplified data models, data compression and grouping, and lightweight transmission protocols to reduce overheads. We further integrate ProvLight into the E2Clab framework to enable workflow provenance capture across the Edge-to-Cloud Continuum. This integration makes E2Clab a promising platform for the performance optimization of applications through reproducible experiments. We validate ProvLight at a large scale with synthetic workloads on 64 real-life IoT/Edge devices in the FIT IoT LAB testbed. Evaluations show that ProvLight outperforms state-of-the-art systems like ProvLake and DfAnalyzer in resource-constrained devices. ProvLight is 26---37x faster to capture and transmit provenance data; uses 5---7x less CPU; 2x less memory; transmits 2x less data; and consumes 2---2.5x less energy. ProvLight and E2Clab are available as open-source tools. Optimizing HPC I/O Performance with Regression Analysis and Ensemble Learning Optimizing HPC I/O Performance with Regression Analysis and Ensemble Learning Yuzuo Zhang. Authors: Zhangyu Liu (Northwest University, China), Cheng Zhang (Northwest University), Huijun Wu (College of Computer, National University of Defense Technology, China), Jianbin Fang (National University of Defense Technology), Lin Peng (College of Computer, National University of Defense Technology, China), Guixin Ye (Northwest University, China), and Zhanyong Tang (Northwest University, China) Abstract To improve parallel I/O performance, it is imperative to optimize the adjustable parameters across the different layers of the I/O software stack. Finding an optimal configuration for different scenarios is hampered by the complex interaction dynamics between these parameters and the large parameter space. Previous research efforts have focused on tuning these parameters using independent algorithms; however, these approaches exhibit certain shortcomings such as unstable performance results and delayed convergence rates. Our research introduces a novel approach called OPRAEL, which is based on ensembles and performance modeling using regression analysis. To test its effectiveness, we applied this approach on the Tianhe-II supercomputer using three well-known benchmark datasets - S3D-I/O, BT-I/O, and IOR. Leveraging our experience in predictive modeling, we optimized the tuning of the I/O stack parameters. Our experimental results show a remarkable 10.2x improvement in write performance speedup for the optimization task with BT-I/O and a 500x500x500 input. We also compared the potential of using a single search algorithm versus using reinforcement learning search in the I/O parameter auto-optimization task. Our results show that OPRAEL outperforms the traditional approach, resulting in a maximum 8.4x improvement in write performance for the 128 process IOR optimization. A Lightweight, Effective Compressibility Estimation Method for Error-bounded Lossy Compression A Lightweight, Effective Compressibility Estimation Method for Error-bounded Lossy Compression Arkaprabha Ganguli (Michigan State University), Robert Underwood (Argonne National Laboratory), Julie Bessac (National Renewable Energy Laboratory, Virginia Polytechnic Institute and State University), David Krasowska (Northwestern University), Jon Calhoun (Clemson University), Sheng Di (Argonne National Laboratory), and Franck Cappello (Argonne National Laboratory, University of Illinois at Urbana-Champaign) Abstract Error-bounded lossy compression turns more and more important for the data-moving intensive applications to deal with big datasets efficiently in HPC environments, which often requires knowing the compressibility of the datasets be- fore performing the compression. However, the off-the-shelf state-of-the-art lossy compressors are often driven by error bounds, so the compression ratios cannot be forecasted until the completion of the compression operation. In this paper, we propose a lightweight, robust, easy-to-train model that estimates the compressibility of datasets for different lossy compressors accurately. Our approach combines novel predictors that measure various notions of spatial correlation and smoothness exploited by lossy compressors that are implemented efficiently on the GPU in a framework and that uses mixture model regression to improve robustness with conformal prediction to provide bounds on the estimates. We then use these models with a detailed analysis of speedup to understand the tradeoffs between high speed, consistent speed, and accuracy of the methods on real applications. We evaluate our approach in the context of 3 key applications where compression ratio estimation is highly required. 12:30 - 13:00 Lunch (pickup) 13:00 - 14:00 Lunch Keynote: Jesús Labarta (BSC) Mesa Ballroom Pushing RISC-V into HPC. Pushing RISC-V into HPC. Jesús Labarta (BSC) Abstract The talk will present the philosophy and results of the activity within the European Processor Initiative (EPI) to design a RISC-V vector accelerator. I will briefly present the overall project structure but then focus on the vision of how long vector architectures address fundamental issues in HPC computing such as expressing concurrency and dealing with latency. I will also discuss how the Open Standard RISC-V ISA provides a foundation on which that vision can be deployed while at the same time leveraging contributions of a growing community. I will describe the architecture of the RISC-V processor designed in the project and its software environment. I will present performance an analysis results obtained on an FPGA emulator implementing the same RTL of the taped out test chip now in the bring up process. The FPGA emulator constitutes a Software Development Vehicle (SDV) where a standard Linux environment is available, as well as an LLVM compiler supporting both intrinsics and automatic vectorization. A powerful performance analysis framework is available to understand the behavior of real applications. This environment seamlessly covers a very wire range of levels of detail, from full application coarse grain to microscopic micro-architectural behavior. Chair: Frank Mueller, North Carolina State University 14:00 - 16:00 Best Paper presentations Mesa Ballroom Chair: Sunita Chandrasekaran, University of Delaware A Dynamic Network-Native MPI Partitioned Aggregation Over InfiniBand Verbs A Dynamic Network-Native MPI Partitioned Aggregation Over InfiniBand Verbs Yiltan Temucin (Queen's University), Scott Levy (Sandia National Laboratories), Whit Schonbein (Sandia National Laboratories), Ryan Grant (Queen's University, University of New Mexico), and Ahmad Afsahi (Queen’s University) Abstract Modern HPC systems require efficient hybrid programming model to utilize their hardware resources effectively. The Message Passing Interface (MPI) has accommodated next generation hardware by providing new APIs such as the MPI Partitioned interface. This API provides a user with fine-grain communication without the overhead of traditional MPI point-to-point communication in multi-threaded workloads. To the best of our knowledge, we present the first work on detailed hardware-level design for an MPI Partitioned implementation. We guide readers through a method to map the MPI Partitioned interface to the InfiniBand Verbs API. Alongside implementation details, we also study the aggregation of user partitions and how we can efficiently send them over the network. We study a brute force approach and using the Partitioned LogGP (PLogGP) model to predict ideal aggregation. We observe that using the PLogGP model provides provides comparable performance without exhausting computing resources to search the entire solution space. The PLogGP design was further optimized by considering how the partition arrival pattern can be used to dynamically modify our aggregation scheme. We profiled our micro-benchmarks to provide analysis on how and why this additional optimization is beneficial to our results and how we can fine-tune this mechanism. Finally, we evaluated our PLogGP and Timer-based PLogGP designs with a commonly used communication pattern in HPC (communication sweep) to observe the impact when communicating with multiple processes in an application-like scenario at 1024 cores. DoW-KV: A DPU-offloaded and Write-optimized Key-Value Store on Disaggregated Persistent Memory DoW-KV: A DPU-offloaded and Write-optimized Key-Value Store on Disaggregated Persistent Memory Yiwen Zhang (Huazhong University of Science and Technology, Wuhan National Laboratory for Optoelectronics), Guokuan Li (the Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology; Huazhong University of Science and Technology), Jiguang Wan (Huazhong University of Science and Technology, Wuhan National Laboratory for Optoelectronics), Junyue Wang (the School of Computer Science and Technology,Huazhong University of Science and Technology; the School of Computer Science and Technology), Ting Yao (Cloud Storage Service Product Dept, Huawei Technologies Co.,Ltd; Huawei Technologies Co.,Ltd), Jun Li (Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology; Wuhan National Laboratory for Optoelectronics), Huatao Wu (Huawei Cloud; Huawei Technologies Co.,Ltd), and Daohui Wang (Huawei Cloud; Huawei Technologies Co.,Ltd) Abstract Disaggregated Persistent Memory (DPM) is a promising technology offering elasticity, high resource utilization, persistent data storage, and lower power consumption. While building KV stores on the DPM benefit from these merits, it is also challenging to achieve efficient writes due to the bottleneck of low bandwidth caused by small random writes to PM, and the intensive CPU consumption on the persistent memory server (PMS) is inevitable. Integrating the SmartNIC such as the Data Processing Unit (DPU) to the DPM gives developers the chance to optimize writing to KV stores by utilizing both the DPU memory and the DPU processor. However, simple offloading cannot make full use of the DPU’s potential capacity. To address these chal- lenges, we propose DoW-KV, a persistent hash KV store on DPM. DoW-KV employs a two-tier hash index consisting of a DPU cache table in DPU memory and multiple PM persistent tables on the PM. It relocates small random writes to the DPU memory and consolidate them to the PM at a coarse granularity. Furthermore, DoW-KV uses DPU-offloaded step merge and a coroutine-based asynchronous processing framework to efficiently manage the PM persistent tables. DoW-KV also introduces a client-mixed read strategy to boost key searching on the two-tier hash index. Experiment results show that DoW-KV outperforms the state- of-the-art DINOMO by 2.1× and 1.3× in the Put and Get operations, respectively. Uniform Algorithms for Reduce-scatter and (most) other Collectives for MPI Uniform Algorithms for Reduce-scatter and (most) other Collectives for MPI Jesper Larsson Traff (TU Wien), Sascha Hunold (Vienna University of Technology), Ioannis Vardas (Vienna University of Technology), and Nikolaus Manes Funk (TU Wien) Abstract We explore the use of a regular, circulant graph communication pattern for the implementation of the reduction-to-all, by specialization the reduction-to-root, the reduce-scatter, the all-to-all-broadcast and the rooted gather and scatter collective operations, all as found in MPI, for commutative operators and for any number of processes. The reduction-to-all algorithm reconstructs the little known algorithm by Bar-Noy, Kipnis and Schieber (1993), which the paper considerably extends. We experiment with extensions and combinations of the algorithms for these operations, and examine their performance from the perspective of performance guidelines, and in direct comparison to the implementations in common MPI libraries. On a small cluster with 36x 32 cores, and two larger HPC production systems we show that we can especially for MPI_Reduce_scatter_block achieve considerably better performance than standard MPI library implementations. Our algorithms can perform consistently which the implementations in standard MPI libraries sometimes do not. n a homogeneous, one-ported communication system with linear transmission costs, reduction-to-all, reduce-scatter and all-to-all-broadcast can all be implemented in O(log p+m) time steps for problems of size m with small constants which we analyze and discuss. JACO: Java Code Layout Optimizer Enabling Continuous Optimization without Pausing Application Services JACO: Java Code Layout Optimizer Enabling Continuous Optimization without Pausing Application Services Wenhai Lin (Zhejiang University), Jingchang Qin (Zhejiang University), Yiquan Chen (Zhejiang University, Alibaba Group), Zhen Jin (Zhejiang University), Jiexiong Xu (Zhejiang University), Yuzhong Zhang (Alibaba Group), Shishun Cai (Alibaba Group), Lirong Fu (Zhejiang University), Yi Chen (University of Michigan), and Wenzhi Chen (Zhejiang University) Abstract Many Java applications in data centers suffer from severe processor pipeline frontend bottlenecks, which can be mitigated by profile-guided code layout optimizations (PGCLO). To maximize optimization opportunities, state-of-the-art PGCLO solutions adopt continuous optimization to ensure that the code layout consistently matches ever-changing application control flow characteristics. However, existing continuous optimizations inevitably pause the application to execute the new code com- pletely, which leads to high response latency and significantly deteriorates user experience. In this paper, we propose JACO, a novel profile-guided Java code layout optimizer, enabling continuous optimization without pausing application services. The key idea of JACO is to enable the execution of both the old and new code simultaneously rather than completely switching to the new code. In particular, JACO is composed of three components: (1) A lightweight profiler captures the control flow information of the application and then generates an optimized function order. (2) A control flow switcher generates new code based on optimized function order and switches the application to execute the new code without pausing the application services. (3) A selective code reclaimer only frees the memory occupied by the inactive old code. We evaluated JACO on both open-source applications and real-world applications from a world-leading company. JACO achieved up to a 16.36% performance improvement for real-world applications. The state-of-the-art approach introduces up to 37.93x latency overhead that will interrupt application services, while JACO only introduces a negligible 7% latency overhead. 16:00 - 16:30 Coffee Break 16:30 - 16:45 Best Paper Awards Mesa Ballroom 16:45 - 17:45 Poster lightning talks Mesa Ballroom 18:00 - 20:00 Poster reception Posters Efficient Particle Tracing for Scalable Kinetic Plasma Simulation Analysis Authors: Nigel Tan (University of Tennessee Knoxville), ScoA Luedtke (Los Alamos National Laboratory), Michela Taufer (University of Tennessee Knoxville), and Brian Albright (Los Alamos National Laboratory) Visual Analytics Interactive Tool for Neural Network Archaeology Authors: Seoyoung An (University of Tennessee Knoxville), Georgia Channing (University of Tennessee Knoxville), Catherine Schuman (University of Tennessee Knoxville), and Michela Taufer (University of Tennessee Knoxville) Mappings and Patterns to Improve the Triangular Matrix Product on Distributed Systems Authors: Inmaculada Santamaria-Valenzuela (Universidad de Valladolid), Rocío Carratalá-Sáez (Universidad de Valladolid), Yuri Torres (Universidad de Valladolid), Diego R. Llanos (Universidad de Valladolid), and Arturo Gonzalez-Escriban (Universidad de Valladolid) Accelerating Distributed ML Training via Selective Synchronization Authors: Sahil Tyagi (Indiana University Bloomington) and Martin Swany (Indiana University Bloomington) On the Multi-Dimensional Parallelization and Optimization of Stochastic Block Partitioning for Community Detection Authors: Frank Wanye (Virginia Tech) and Wu-chun Feng (Virginia Tech) Few-shot HPC application Runtime Prediction Authors: Si Chen (Emory University), Simon Garcia De Gonzalo (Sandia National Laboratories), and Avani Wildani (Emory University) An Efficient and Accurate Compression Ratio Estimation Model for SZx Authors: Arham Khan (University of Chicago), Sheng Di (Argonne National Laboratory), Kai Zhao (University of Alabama), Jinyang Liu (University of California Riverside), Kyle Chard (University of Chicago), Ian Foster (University of Chicago), and Franck Cappello (Argonne National Laboratory) Performance Insights into Device-initiated RMA using Kokkos Remote Spaces Authors: Daniel Mishler (University of Tennessee, Sandia National Laboratories), Jan Ciesko (Sandia National Laboratories), Stephen Olivier (Sandia National Laboratories), and George Bosilca (University of Tennessee, Knoxville) Performance Improvement by Enhancing Spatial Parallelism on FPGA for HPC Applications Authors: Yuka Sano (University of Tsukuba), Taisuke Boku (University of Tsukuba), Mitsuhisa Sato (RIKEN), Miwako Tsuji (RIKEN), Norihisa Fujita (University of Tsukuba), and Ryohei Kobayashi (University of Tsukuba) OpenMP Offloading to DPU Authors: Muhammad Usman (Barcelona Supercomputing Center), Sergio Iserte (Barcelona Supercomputing Center), Roger Ferrer (Barcelona Supercomputing Center), and Antonio J. Peña (Barcelona Supercomputing Center) Latency and Bandwidth Microbenchmarks of Six US Department of Energy Systems in the Top500 Authors: Carl Pearson (Sandia National Laboratories), Christopher Siefert (Sandia National Laboratories), Stephen Olivier (Sandia National Laboratories), Andrey Prokopenko (Oak Ridge National Laboratories), Timothy Fuller (Sandia National Laboratories), and Jonathan Hu (Sandia National Laboratories) A Lightweight Network Traffic Prediction Method for SmartNICs Authors: Tinotenda Matsika (Queen's University), Whit Schonbein (Sandia National Laboratories), and Ryan Grant (Queen's University) I/O Characterization and Performance Evaluation of Large-scale Storage Architectures for Heterogeneous Workloads Authors: Olga Kogiou (Florida State University), Hariharan Devararajan (Lawrence Livermore National Laboratory), Chen Wang (Lawrence Livermore National Laboratory), Weikuan Yu (Florida State University), and Kathryn Mohror (Lawrence Livermore National Laboratory) Accelerating Hyperdimensional Classifier with SYCL Authors: Zheming Jin (Oak Ridge National Laboratories) and Jeffrey VeAer (Oak Ridge National Laboratories) Pecos room	Conference - Friday, Nov 3 8:15 - 9:00 Registration 9:00 - 9:30 Cluster 2024 Presentation Mesa Ballroom Chairs: Taisuke Boku, University of Tsukuba Kengo Nakajima, RIKEN Center for Computational Science 9:30 - 10:30 Keynote: Susan Coghlan (ANL) Mesa Ballroom Update on the Aurora Supercomputer. Update on the Aurora Supercomputer. Susan Coghlan (ANL) Abstract This presentation will provide an overview of the Aurora supercomputer, touching on the architecture, early science, and Aurora’s current status. The talk will also discuss some of the challenges seen along the way, how this deployment has differed from past deployments, and offer up a few lessons learned. Chair: Antonio J. Peña, Barcelona Supercomputing Center 10:30 - 11:00 Coffee Break 11:00 - 12:30 -- Parallel Sessions GPU and FPGA Applications (Session IX) Mesa Ballroom Chair: Florina Ciorba, University of Basel A Finite-Difference Time-Domain (FDTD) solver with linearly scalable performance in an FPGA cluster A Finite-Difference Time-Domain (FDTD) solver with linearly scalable performance in an FPGA cluster Zhenyu Xu (University of Rhode Island, Clemson University), Miaoxiang Yu (University of Rhode Island, Clemson University), Jillian Cai (University of Rhode Island), Qing Yang (University of Rhode Island), and Tao Wei (University of Rhode Island, Clemson University) Abstract This paper presents an FPGA cluster based Finite-Difference Time-Domain (FDTD) accelerator that offers a linear speedup with the number of FPGAs participating in computation within the cluster. FDTD is a numeric method for simulating electromagnetic wave propagation and interactions with diverse materials and structures. Recent advancements in machine learning-based design and optimization techniques for photonic integrated circuits and microwave circuits, known as inverse design, have demonstrated remarkable success. Inverse design necessitates numerous FDTD simulations, and the high-performance FDTD accelerator enables rapid design automation, which is crucial for accelerating innovation. Our proposed accelerator comprises deeply pipelined FDTD cell update kernels that can traverse multiple FPGAs via high-speed optical links, effectively utilizing available resources across all FPGAs in a cluster. The architecture includes a head node and a flexible number of cascaded server nodes, together with custom cross-FPGA data routing kernels integrated into the "Open Cloud Testbed" (OCT) FPGA infrastructure to facilitate seamless data transfer. The proposed accelerator is developed on an existing platform, OCT FPGA. Our experiments reveal that, for a 4096x4096 2.5D FDTD simulation, each server node (Xilinx Alveo U280) can achieve 86.4 Giga-cells updates per second (GCUPS), and the head node can achieve 38.4 GCUPS. The overall speed with 4 server nodes is 38.4 + 4 x 86.4 = 384 GCUPS. GPU Occupancy Prediction of Deep Learning Models Using Graph Neural Network GPU Occupancy Prediction of Deep Learning Models Using Graph Neural Network Hengquan Mei (University of Science and Technology of China), Huaizhi Qu (University of Science and Technology of China), Jingwei Sun (University of Science and Technology of China), Yanjie Gao (Microsoft Research), Haoxiang Lin (Microsoft Research), and Guangzhong Sun (University of Science and Technology of China) Abstract GPU is the mainstream infrastructure for executing deep learning (DL) workloads. To conduct resource-efficiency scheduling of DL workloads, GPU occupancy plays an important role for understanding whether GPUs are fully utilized. GPU occupancy is the ratio of the number of active warps on a streaming multiprocessor (SM) to the maximum number of active warps supported by the SM. By predicting the GPU occupancy of a DL model before its execution, we can estimate the percentage of the hardware’s ability to process warps that are actively in use. However, general performance prediction for DL models is challenging due to the diverse DL model architectures. In this paper, we propose DNN-occu to predict GPU occupancy of DL models. DNN-occu precisely captures the relations between structural factors of computation graphs of DL models and corresponding GPU occupancy. We also propose a novel graph neural network model to better represent these relations and make generalizable predictions. Empirical evaluations on a variety of DL models as well as configurations show that DNN-occu achieves high accuracy for occupancy prediction and has zero-shot ability for predicting the occupancy of unseen DL models. Our experiments show that DNN-occu achieves an overall prediction error of 9.271%. Besides, we conduct a trace driven simulation of DL workload scheduling, where DNN-occu achieves up to 31.45% improvement to GPU utilization and 19.71% reduction in makespan. Reducing Data Motion and Energy Consumption of Geospatial Modeling Applications Using Automated Precision Conversion Reducing Data Motion and Energy Consumption of Geospatial Modeling Applications Using Automated Precision Conversion Qinglei Cao (University of Tennessee), Sameh Abdulah (King Abdullah University of Science & Technology), Hatem Ltaief (King Abdullah University of Science & Technology), Marc Genton (King Abdullah University of Science & Technology), David Keyes (King Abdullah University of Science & Technology), and George Bosilca (University of Tennessee) Abstract The burgeoning interest in large-scale geospatial modeling, particularly within the domains of climate and weather prediction, underscores the concomitant critical importance of accuracy, scalability, and computational speed. Harnessing these complex simulations' potential, however, necessitates innovative computational strategies, especially considering the increasing volume of data involved. Recent advancements in Graphics Processing Units (GPUs) have opened up new avenues for accelerating these modeling processes. In particular, their efficient utilization necessitates new strategies, such as mixed-precision arithmetic, that can balance the trade-off between computational speed and model accuracy. This paper leverages the PaRSEC runtime system and delves into the opportunities provided by mixed-precision arithmetic to expedite large-scale geospatial modeling in heterogeneous environments. By using an automated conversion strategy, our mixed-precision approach significantly improves computational performance (up to 3X) on the Summit supercomputer and reduces the associated energy consumption on various Nvidia GPU generations. Importantly, this implementation ensures the requisite accuracy in environmental applications, a critical factor in their operational viability. The findings of this study bear significant implications for future research and development in high-performance computing, underscoring the transformative potential of mixed-precision arithmetic on GPUs in addressing the computational demands of large-scale geospatial modeling and making a stride toward a more sustainable, efficient, and accurate future in large-scale environmental applications. MPI & Networking (Session X) Canyon Room Chair: George Michelogiannakis, Lawrence Berkeley National Laboratory SDT: A Low-cost and Topology-reconfigurable Testbed for Network Research SDT: A Low-cost and Topology-reconfigurable Testbed for Network Research Zixuan Chen (Fudan University), Zhigao Zhao (Fudan University), Zijian Li (Fudan University), Jiang Shao (Fudan University), Sen Liu (Fudan University), and Yang Xu (Fudan University) Abstract Network experiments are essential to network-related scientific research (e.g., congestion control, QoS, network topology design, and traffic engineering). However, (re)configuring various topologies on a real testbed is expensive, time-consuming, and error-prone. In this paper, we propose \emph{Software Defined Topology Testbed (SDT)}, a method for constructing a user-defined network topology using a few commodity switches. SDT is low-cost, deployment-friendly, and reconfigurable, which can run multiple sets of experiments under different topologies by simply using different topology configuration files at the controller we designed. We implement a prototype of SDT and conduct numerous experiments. Evaluations show that SDT only introduces at most 2\% extra overhead than full testbeds on multi-hop latency and is far more efficient than software simulators (reducing the evaluation time by up to 2899x). SDT is more cost-effective and scalable than existing Topology Projection (TP) solutions. Further experiments show that SDT can support various network research experiments at a low cost on topics including but not limited to topology design, congestion control, and traffic engineering. PiP-MColl: Process-in-Process-based Multi-object MPI Collectives PiP-MColl: Process-in-Process-based Multi-object MPI Collectives Jiajun Huang (University of California, Riverside), Kaiming Ouyang (NVIDIA Corporation), Yujia Zhai (University of California, Riverside), Jinyang Liu (University of California, Riverside), Min Si (Meta Platforms, Inc.), Ken Raffenetti (Argonne National Laboratory), Hui Zhou (Argonne National Laboratory), Atsushi Hori (National Institute of Informatics), Zizhong Chen (University of California, Riverside), Yanfei Guo (Argonne National Laboratory), and Rajeev Thakur (Argonne National Laboratory) Abstract In the era of exascale computing, the adoption of a large number of CPU cores and nodes by high-performance computing (HPC) applications has made MPI collective performance increasingly crucial. As the number of cores and nodes increases, the importance of optimizing MPI collective performance becomes more evident. Current collective algorithms, including kernel-assisted inter-process data exchange techniques and data sharing based shared-memory approaches, are prone to significant performance degradation due to the overhead of system calls and page faults or the cost of extra data-copy latency. These issues can negatively impact the efficiency and scalability of HPC applications. To address these issues, we propose PiP-MColl, a Process-in-Process-based Multi-object Inter-process MPI Collective design that maximizes small message MPI collective performance at scale. We also present specific designs to boost the performance for larger messages, such that we observe a comprehensive improvement for a series of message sizes beyond small messages. PiP-MColl features efficient multiple sender and receiver collective algorithms and leverages Process-in-Process shared memory techniques to eliminate unnecessary system call, page fault overhead and extra data copy, which results in improved intra- and inter-node message rate and throughput. Experimental results demonstrate that PiP-MColl significantly outperforms popular MPI libraries, including OpenMPI, MVAPICH2, and Intel MPI, by up to 4.6X for the MPI collectives MPI_Scatter, MPI_Allgather, and MPI_Allreduce. 12:30 - 14:00 Lunch (provided) Conference ends
	Conference Room Layout