Tuesday 08:20-09:10 ![]() Zoom Keynote K1: Keynote-1 Chair: Pete Beckman (Argonne National Laboratory) Fugaku: the First `Exascale' Supercomputer (Satoshi Matsuoka) Satoshi Matsuoka (RIKEN Center for Computational Science (R-CCS)) Biography Biography Satoshi Matsuoka (RIKEN Center for Computational Science (R-CCS)) Satoshi Matsuoka from April 2018 has become the director of Riken CCS, the top-tier HPC center that represents HPC in Japan, developing and hosting Japan’s tier-one ‘Fugaku’ supercomputer which has become the fastest supercomputer in the world in all four major supercomputer rankings, along with multitudes of ongoing cutting edge HPC research being conducted, including investigating Post-Moore era computing. He was the leader of the TSUBAME series of supercomputers, at Tokyo Institute of Technology, where he still holds a Professor position, to continue his research activities in HPC as well as scalable Big Data and AI. His commendations include the ACM Gordon Bell Prize in 2011 and the IEEE Sidney Fernbach Award in 2014, both being one of the highest awards in the field of HPC, as well as being the Program Chair for ACM/IEEE Supercomputing 2013 (SC13). Abstract Abstract Fugaku is the first `exascale’ supercomputer of the world, not due to its peak double precision flops, but rather, its demonstrated performance in real applications that were expected of exascale machines on their conceptions 10 years ago, as well as reaching actual exaflops in new breed of benchmarks such as HPL-AI. But the importance of Fugaku is "applications first" philosophy under which it was developed, and its resulting mission to be the centerpiece for rapid realization of the so-called Japanese `Society 5.0' as defined by the Japanese S&T national policy. As such, Fugaku’s immense power is directly applicable not only to traditional scientific simulation applications, but can be a target of Society 5.0 applications that encompasses conversion of HPC & AI & Big Data as well as Cyber (IDC & Network) vs. Physical (IoT) space, with immediate societal impact. In fact, Fugaku is already in partial operation a year ahead of schedule, primarily to obtain early Society 5.0 results including combatting COVID-19 as well as resolving other important societal issues. Tuesday 09:20-10:10 ![]() Zoom Paper T1: Best Papers Chair: Masaaki Kondo (University of Tokyo, Riken R-CCS) Best Paper Konrad von Kirchbach (TU Wien/Faculty of Informatics), Markus Lehr and Sascha Hunold (TU wien/Faculty of Informatics), Christian Schulz (University of Vienna/Faculty of Computer Science), and Jesper Larsson Träff (TU wien/Faculty of Informatics) Abstract Abstract Good process-to-compute-node mappings can be decisive for well performing HPC applications. A special, important class of process-to-node mapping problems is the problem of mapping processes that communicate in a sparse stencil pattern to Cartesian grids. By thoroughly exploiting the inherently present structure in this type of problem, we devise three novel distributed algorithms that are able to handle arbitrary stencil communication patterns effectively. We analyze the expected performance of our algorithms based on an abstract model of inter- and intra-node communication. An extensive experimental evaluation on several HPC machines shows that our algorithms are up to two orders of magnitude faster in running time than a (sequential) high-quality general graph mapping tool, while obtaining similar results in communication performance. Furthermore, our algorithms also achieve significantly better mapping quality compared to previous state-of-the-art Cartesian grid mapping algorithms. This results in up to a threefold performance improvement of an MPI_Neighbor_alltoall exchange operation. Our new algorithms can be used to implement the MPI_Cart_create functionality. Best Paper Minseok Kwon and Krishna Prasad Neupane (Rochester Institute of Technology); John Marshall (Cisco Systems, Inc.); and M. Mustafa Rafique (Rochester Institute of Technology) Abstract Abstract Programmability in the data plane has become increasingly important as virtualization is introduced into networking and software-defined networking becomes more prevalent. Yet, the performance of programmable data planes, often on commodity hardware, is a major concern, especially in light of ever-increasing network link speed and routing table size. This paper focuses on IP lookup, specifically the longest prefix matching for IPv6 addresses, which is a major performance bottleneck in programmable switches. As a solution, the paper presents CuVPP, a programmable switch that uses packet batch processing and cache locality for both instructions and data by leveraging Vector Packet Processing (VPP). We thoroughly evaluate CuVPP with both real network traffic and file-based lookup on a commodity hardware server connected via 80~Gbps network links and compare its performance with the other popular approaches. Our evaluation shows that CuVPP can achieve up to 4.5 million lookups per second with real traffic, higher than other trie- or filter-based lookup approaches, and scales well even when the routing table size grows to 2 million prefixes. Best Paper Xi Luo (University of Tennessee, Knoxville); Wei Wu (Los Alamos National Laboratory); George Bosilca, Yu Pei, and Qinglei Cao (University of Tennessee, Knoxville); Thananon Patinyasakdikul (Cray); and Dong Zhong and Jack Dongarra (University of Tennessee, Knoxville) Abstract Abstract High-performance computing (HPC) systems keep growing in scale and heterogeneity to satisfy the increasing computational need, and this brings new challenges to the design of MPI libraries, especially with regard to collective operations. Best Paper Marc-André Vef, Rebecca Steiner, Reza Salkhordeh, and Jörg Steinkamp (Johannes Gutenberg University Mainz); Florent Vennetier and Jean-François Smigielski (OpenIO); and André Brinkmann (Johannes Gutenberg University Mainz) Abstract Abstract Data-driven applications are becoming increasingly important in numerous industrial and scientific fields, growing the need for scalable data storage, such as object storage. Yet, many data-driven applications cannot use object interfaces directly and often have to rely on third-party file system connectors that support only a basic representation of objects as files in a flat namespace. With sometimes millions of objects per bucket, this simple organization is insufficient for users and applications who are usually only interested in a small subset of objects. These huge buckets are not only lacking basic semantic properties and structure, but they are also challenging to manage from a technical perspective as object store file systems cannot cope with such directory sizes. Tuesday 14:00-14:50 ![]() Zoom Paper T2: Performance Characterization and Scheduling Chair: Johannes Langguth (Simula) Yi-Chao Wang, Jie Wang, Jin-Kun Chen, Si-Cheng Zuo, Xiao-Ming Su, and James Lin (Shanghai Jiao Tong University) Abstract Abstract Modern processors provide hundreds of low-level hardware events (such as cache miss rate), but offer only a small number (usually 6–12) of hardware counters to collect these events due to limited register resource. Multiplexing (MPX) is an estimation-based technique designed to collect hardware events simultaneously with few hardware counters. However, the lowaccuracy of existing MPX methods prevents this technique from wide usage in real conditions. To obtain accurate and reliable hardware counter values, we conducted this work in three steps: 1) to explore the root cause of inaccuracy, we characterized the estimation errors of MPX, and found that estimation errors arise from the outliers in PAPI; 2) to eliminate these outliers and improve MPX accuracy, we proposed two non-linear growth rate gradient estimation methods: divided curved-area method (DCAM) and curved-area method (CAM); 3) based on these two methods, we developed a new MPX library for PAPI, NeoMPX. We evaluated NeoMPX with six Rodinia benchmarks on four mainstream x86 and ARM server processors, and compared the results with PAPI default MPX and two other state-of-art MPX methods, DIRA, and TAM. Evaluations show that, for collecting 16 evaluated hardware events, our methods can improve up to 59% accuracy than PAPI default MPX, and achieve 36% and 5% higher accuracy than DIRA and TAM, respectively. We have open-sourced NeoMPX and expect it to enable PAPI MPX for practical usage. Tim Hegeman, Animesh Trivedi, and Alexandru Iosup (Vrije Universiteit Amsterdam) Abstract Abstract Graph processing is one of the most important and ubiquitous classes of analytical workloads. To process large graph datasets with diverse algorithms, tens of distributed graph processing frameworks emerged. Their users are increasingly expecting high performance for diversifying workloads. Meeting this expectation depends on understanding the performance of each framework. However, performance analysis and characterization of a distributed graph processing framework is challenging. Contributing factors are the irregular nature of graph computation across datasets and algorithms, the semantic gap between workload-level and system-level monitoring, and the lack of lightweight mechanisms for collecting fine-grained performance data. Addressing the challenge, in this work we present Grade10, an experimental framework for fine-grained performance characterization of distributed graph processing workloads. Grade10 captures the graph workload execution as a performance graph from logs and application traces, and builds a fine-grained, unified workload-level and system-level view of performance. Grade10 samples sparsely for lightweight monitoring and addresses the problem of accuracy through a novel approach for resource attribution. Last, it can identify automatically resource bottlenecks and common classes of performance issues. Our real-world experimental evaluation with Giraph and PowerGraph, two state-of-the-art distributed graph processing systems, shows that Grade10 can reveal large differences in the nature and severity of bottlenecks across systems and workloads. We also show that Grade10 can be used in debugging processes, by exemplifying how we find with it a synchronization bug in PowerGraph that slows down affected phases by 1.10−2.50x. Grade10 is an open-source project available at https://github.com/atlarge-research/grade10. Marcos Maroñas and Xavier Teruel (Barcelona Supercomputing Center); Mark Bull (University of Edinburgh); Eduard Ayguadé (Barcelona Supercomputing Center, Universitat Politècnica de Catalunya); and Vicenç Beltran (Barcelona Supercomputing Center) Abstract Abstract Hybrid programming is a promising approach to exploit clusters of multicore systems. Our focus is on the combination of MPI and tasking. This hybrid approach combines the low-latency and high throughput of MPI with the flexibility of tasking models and their inherent ability to handle load imbalance. However, combining tasking with standard MPI implementations can be a challenge. The Task-Aware MPI library (TAMPI) eases the development of applications combining tasking with MPI. TAMPI enables developers to overlap computation and communication phases by relying on the tasking data-flow execution model. Using this approach, the original computation that was distributed in many different MPI ranks is grouped together in fewer MPI ranks, and split into several tasks per rank. Nevertheless, programmers must be careful with task granularity. Too fine-grained tasks introduce too much overhead, while too coarse-grained tasks lead to lack of parallelism. An adequate granularity may not always exist, especially in distributed environments where the same amount of work is distributed among many more cores. Worksharing tasks are a special kind of tasks, recently proposed, that internally leverage worksharing techniques. By doing so, a single worksharing task may run in several cores concurrently. Nonetheless, the task management costs remain the same than a regular task. In this work, we study the combination of worksharing tasks and TAMPI on distributed environments using two well known mini-apps: HPCCG and LULESH. Our results show significant improvements using worksharing tasks compared to regular tasks, and to other state-of-the-art alternatives such as OpenMP worksharing. Anne Benoit, Valentin Le Fèvre, and Lucas Perotin (ENS Lyon); Padma Raghavan (Vanderbilt University); Yves Robert (ENS Lyon, University of Tennessee Knoxville); and Hongyang Sun (Vanderbilt University) Abstract Abstract This paper focuses on the resilient scheduling of moldable parallel jobs on high-performance computing (HPC) platforms. Moldable jobs allow for choosing a processor allocation before execution, and their execution time obeys various speedup models. The objective is to minimize the overall completion time of the jobs, or makespan, assuming that jobs are subject to arbitrary failure scenarios, and hence need to be re-executed each time they fail until successful completion. This work generalizes the classical framework where jobs are known offline and do not fail. We introduce a list-based algorithm, and prove new %complexity results and approximation ratios for three prominent speedup models (roofline, communication, Amdahl). We also introduce a batch-based algorithm, where each job is allowed a restricted number of failures per batch, and prove a new approximation ratio for the arbitrary speedup model. We conduct an extensive set of simulations to evaluate and compare different variants of the two algorithms. The results show that they consistently outperform some baseline heuristics. In particular, the list algorithm performs better for the roofline and communication models, while the batch algorithm has better performance for the Amdahl's model. Overall, our best algorithm is within a factor of 1.47 of a lower bound on average over the whole set of experiments, and within a factor of 1.8 in the worst case. Loic Pottier and Rafael Ferreira Da Silva (USC Information Sciences Institute), Henri Casanova (University of Hawai'i at Manoa), and Ewa Deelman (USC Information Sciences Institute) Abstract Abstract Scientific domains ranging from bioinformatics to astronomy and earth science rely on traditional high-performance computing (HPC) codes, often encapsulated in scientific workflows. In contrast to traditional HPC codes that employ a few programming and runtime approaches that are highly optimized for HPC platforms, scientific workflows are not necessarily optimized for these platforms. As an effort to reduce the gap between compute and I/O performance, HPC platforms have adopted intermediate storage layers known as burst buffers. A burst buffer (BB) is a fast storage layer positioned between the global parallel file system and the compute nodes. Two designs currently exist: (i) shared, where the BBs are located on dedicated nodes; and (ii) on-node, in which each compute node embeds a private BB. In this paper, using accurate simulations and real-world experiments, we study how to best use these new storage layers when executing scientific workflows. These applications are not necessarily optimized to run on HPC systems, and thus can exhibit I/O patterns that differ from that of HPC codes. Thus, we first characterize the I/O behaviors of a real-world workflow under different configuration scenarios on two leadership-class HPC systems (Cori at NERSC and Summit at ORNL). Then, we use these characterizations to calibrate a simulator for workflow executions on HPC systems featuring shared and private BBs. Last, we evaluate our approach against a large I/O-intensive workflow, and we provide insights on the performance levels and the potential limitations of these two BBs architectures. Co-scheML: Interference-aware Container Co-scheduling Scheme using Machine Learning Application Profiles for GPU Clusters ![]() Sejin Kim and Yoonhee Kim (Sookmyung Women's University) Abstract Abstract Recently, efficient execution of applications on Graphic Processing Unit(GPU) has emerged as a research topic to increase overall system throughput in cluster environment. As a current cluster orchestration platform using GPUs only supports an exclusive execution of an application on a GPU, the platform may not utilize resource of GPUs fully relying on application characteristics. Nonetheless, co-execution of GPU applications leads to interference coming from resource contention among applications. If diverse resource usage characteristics of GPU applications are not deliberated, unbalanced usage of computing resources and performance degradation could be induced in a GPU cluster. This study introduces Co-scheML for co-execution of various GPU applications such as High Performance Computing (HPC), Deep Learning (DL) Training, and DL Inference. Interference model is constructed by applying Machine Learning (ML) model with GPU metrics since predicting interference has a difficulty. Predicted interference is utilized and deployment of an application is determined by Co-scheML scheduler. Experimental results of the Co-ScheML strategy show that average job completion time is improved by 23%, and the makespan is shortened by 22% in average, as compared to baseline schedulers. Wednesday 08:00-08:50 ![]() Zoom Paper T3: Architecture and Network Support for HPC Workloads Chair: Philippe Olivier Alexandre Navaux (Federal University of Rio Grande do Sul, UFRGS) Opportunities and limitations of Quality-of-Service in Message Passing applications on adaptively routed Dragonfly and Fat Tree networks ![]() Jeremiah Wilke and Joseph Kenny (Sandia National Labs) Abstract Abstract Avoiding communication bottlenecks remains a critical challenge in high-performance computing (HPC) as systems grow to exascale. Numerous design possibilities exist for avoiding network congestion including topology, adaptive routing, congestion control, and quality-of-service (QoS). While network design often focuses on topological features like diameter, bisection bandwidth, and routing, efficient QoS implementations will be critical for next-generation interconnects. HPC workloads are dominated by tightly-coupled mathematics, making delays in a single message manifest as delays across an entire parallel job. QoS can spread traffic onto different virtual lanes (VLs), lowering the impact of network hotspots by providing priorities or bandwidth guarantees that prevent starvation of critical traffic. Two leading topology candidates, Dragonfly and Fat Tree, are often discussed in terms of routing properties and cost, but the topology can have a major impact on QoS. While Dragonfly has attractive routing flexibility and cost relative to Fat Tree, the extra routing complexity requires several VLs to avoid deadlock. Here we discuss the special challenges of Dragonfly, proposing configurations that use different routing algorithms for different service levels (SLs) to limit VL requirements. We provide simulated results showing how each QoS strategy performs on different classes of application and different workload mixes. Despite Dragonfly's desirable characteristics for adaptive routing, Fat Tree is shown to be an attractive option when QoS is considered. Jie Li, Ghanzanfar Ali, and Ngan Nguyen (Texas Tech University); Jon Hass (Dell EMC Inc.); and Alan Sill, Tommy Dang, and Yong Chen (Texas Tech University) Abstract Abstract Understanding the status of high-performance computing platforms and correlating applications to resource usage provide insight into the interactions among platform components. A lot of efforts have been devoted into developing monitoring solutions; however, a large-scale HPC system usually requires a combination of methods/tools to successfully monitor all metrics, which will lead to a huge effort in configuration and monitoring. Besides, monitoring tools are often left behind in the procurement of large-scale HPC systems. These challenges have motivated the development of a next-generation out-of-the-box monitoring tool that can be easily deployed without losing informative metrics. Ching-Hsiang Chu, Kawthar Shafie Khorassani, Qinghua Zhou, Hari Subramoni, and Dhabaleswar K. Panda (The Ohio State University) Abstract Abstract In the last decade, many scientific applications have been significantly accelerated by large-scale GPU systems. However, the movement of non-contiguous GPU-resident data is one of the most challenging parts to scale these applications using communication middleware like MPI. Although plenty of research has discussed improving non-contiguous data movement within communication middleware, the pack/unpacking operations on GPU are still expensive. They cannot be hidden due to the limitation of MPI standard and the not-well-optimized designs in existing MPI implementations for GPU-resident data. Consequently, application developers tend to implement customized packing/unpacking kernels to improve GPU utilization by avoiding unnecessary synchronizations in MPI routines. However, this reduces productivity as well as performance as it cannot overlap the packing/unpacking operations with communication. In this paper, we propose a novel approach to achieve low-latency and high-bandwidth by dynamically fusing the packing/unpacking GPU kernels to reduce the expensive kernel launch overhead. The evaluation of the proposed designs shows up to 8X and 5X performance improvement for sparse and dense non-contiguous layout, respectively, compared to the state-of-the-art approaches on the Lassen system. Similarly, we observe up to 19X improvement over existing approaches on the ABCI system. Furthermore, the proposed design also outperforms the production library, such as SpectrumMPI, OpenMPI, and MVAPICH2, by many orders of magnitude. Chao Zheng, Nathaniel Kremer-Herman, Tim Shaffer, and Douglas Thain (University of Notre Dame) Abstract Abstract High-throughput computing (HTC) workloads seek to complete as many jobs as possible over a long period of time. Such workloads require efficient execution of many parallel jobs and can occupy a large number of resources for a long time, such that full utilization is the normal state of an HTC facility. The widespread use of container orchestrators eases the deployment of HTC frameworks across different platforms, which also provides an opportunity to scale up HTC workloads with almost infinite resources on the public cloud. However, the autoscaling mechanisms of container orchestrators are primarily designed to support latency-sensitive microservices, and result in unexpected behavior when presented with HTC workloads. In this paper, we design a feedback autoscaler, High Throughput Autoscaler (HTA), that leverages the unique characteristics of the HTC workload to autoscales the resource pools used by HTC workloads on container orchestrators. HTA takes into account a reference input, the real-time status of the jobs’ queue, as well as two feedback inputs, resource consumption of jobs, and the resource initialization time of the container orchestrator. We implement HTA using the Makeflow workload manager, Work Queue job scheduler, and the Kubernetes cluster manager. We evaluate its performance on both CPU-bound and IO-bound workloads. The evaluation results show that, by using HTA, we improve resource utilization by 5.6× with a slight increase in execution time (about 15%) for the CPU-bound workload, and shorten the workload execution time by up to 3.65× for the IO-bound workload. Yang Bai, Dezun Dong, Shan Huan, Zejia Zhou, and Xiangke Liao (National University of Defense Technology) Abstract Abstract Proactive transport nowadays has drawn much attention because of fast convergence, near-zero queueing and low latency. Proactive protocols, however, need an extra RTT to allocate ideal sending rate for new flows. To solve this, some studies, such as pHost, Homa, send unscheduled packets with line rate in the first RTT, which will causes severe network congestion. To avoid queue buildup, Aeolus directly drops unscheduled packets when congestion occurs. Nevertheless, based on our experiment, a considerable part of small flows (0-100KB) will be completed in the first RTT under 100 Gbps network, so that dropping unscheduled packets will severely affect performance of the small flows. Yijia Zhang (Boston University); Taylor Groves, Brandon Cook, and Nicholas Wright (Lawrence Berkeley National Laboratory); and Ayse K. Coskun (Boston University) Abstract Abstract In modern high-performance computing (HPC) systems, network congestion is an important factor that contributes to performance degradation. However, how network congestion impacts application performance is not fully understood. As Aries network, a recent HPC network architecture featuring a dragonfly topology, is equipped with network counters measuring packet transmission statistics on each router, these network metrics can potentially be utilized to understand network performance. In this work, by experiments on a large HPC system, we quantify the impact of network congestion on various applications' performance in terms of execution time, and we correlate application performance with network metrics. Our results demonstrate diverse impacts of network congestion: while applications with intensive MPI operations (such as HACC and MILC) suffer from more than 40% extension in their execution times under network congestion, applications with less intensive MPI operations (such as Graph500 and HPCG) are mostly not affected. We also demonstrate that a stall-to-flit ratio metric derived from Aries network counters is positively correlated with performance degradation and, thus, this metric can serve as an indicator of network congestion in HPC systems. Jorji Nonaka (RIKEN R-CCS); Toshihiro Hanawa (The University of Tokyo, JCAHPC); and Fumiyoshi Shoji (RIKEN R-CCS) Abstract Abstract The hot water cooling technique has been widely accepted as one of the standard techniques to improve the energy efficiency for the HPC and Data Centers. However, the higher operating temperature may impact the CPU power consumption due to the leakage current. Moreover, it may degrade the computational performance due to the DVFS mechanism activated to maintain the power and temperature within the TDP limit. In that sense, to fairly evaluate the efficiency of the hot water cooling technique, it becomes important to take into consideration not only the energy reduction on the HPC facility side (cooling system) but also the impact on the power consumption and the performance degradation on the HPC system side. In this paper, we utilized the Oakforest-PACS system and its facility, jointly administrated by University of Tsukuba and The University of Tokyo, in order to execute a quantitative and systematic analysis on the impact of the cooling water temperature onto the HPC system and its facility. For this purpose, we utilized lower (9oC) and higher (18oC) cooling water temperature other than the regular operational temperature (12oC). Contrary to the gain in the energy consumption, on the HPC facility side, when using higher cooling water temperature, we observed an increase in the number of nodes suffering from performance degradation on the HPC system side. As a result, it can directly increase the probability of including low-performance nodes on multiple node jobs, and thus affecting their overall performance, especially during barrier synchronizations. Wednesday 14:00-14:50 ![]() Zoom Keynote K2: Keynote-2 Chair: Taisuke Boku (Center for Computational Sciences, University of Tsukuba/Graduate School of Systems and Information Engineering, University of Tsukuba) The Price Performance of Performance Models (Felix Wolf) Felix Wolf (Technical University of Darmstadt) Biography Biography Felix Wolf (Technical University of Darmstadt) Felix Wolf is full professor at the Department of Computer Science of Technical University of Darmstadt in Germany, where he leads the Laboratory for Parallel Programming. He works on methods, tools, and algorithms that support the development and deployment of parallel software systems in various stages of their life cycle. Wolf received his Ph.D. degree from RWTH Aachen University in 2003. After working more than two years as a postdoc at the Innovative Computing Laboratory of the University of Tennessee, he was appointed research group leader at Jülich Supercomputing Centre. Between 2009 and 2015, he was head of the Laboratory for Parallel Programming at the German Research School for Simulation Sciences in Aachen and full professor at RWTH Aachen University. Wolf has made major contributions to several open-source performance tools for parallel programs, including Scalasca, Score-P, and Extra-P. Moreover, he has initiated the Virtual Institute High Productivity Supercomputing, an international initiative of HPC programming-tool builders aimed at the enhancement, integration, and deployment of their products. He has published more than a hundred refereed articles on parallel computing, several of which have received awards. Abstract Abstract To understand the scaling behavior of HPC applications, developers often use performance models. A performance model is a formula that expresses a key performance metric, such as runtime, as a function of one or more execution parameters, such as core count and input size. Performance models offer quick insights on a very high level of abstraction, including predictions of future behavior. In view of the complexity of today’s applications, which often combine several sophisticated algorithms, creating performance models manually is extremely laborious. Empirical performance modeling, the process of learning such models from performance data, offers a convenient alternative, but comes with its own set of challenges. The two most prominent ones are noise and the cost of the experiments needed to generate the underlying data. In this talk, we will review the state of the art in empirical performance modeling and investigate how we can employ machine learning and other strategies to improve the quality and lower the cost of the resulting models. Wednesday 15:10-16:00 ![]() Zoom Paper T4: Framework for Data and Storage Chair: Suren Byna (Lawrence Berkeley Lab) E2Clab: Exploring the Computing Continuum through Repeatable, Replicable and Reproducible Edge-to-Cloud Experiments ![]() Daniel Rosendo (Inria); Pedro Silva (Hasso-Plattner Institut); Matthieu Simonin (Inria); Alexandru Costan (IRISA, INSA Rennes); and Gabriel Antoniu (Inria) Abstract Abstract Distributed digital infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex applications to be executed from IoT Edge devices to the HPC Cloud (aka the Computing Continuum, the Digital Continuum, or the Transcontinuum). Understanding end-to-end performance in such a complex continuum is challenging. This breaks down to reconciling many, typically contradicting application requirements and constraints with low-level infrastructure design choices. One important challenge is to accurately reproduce relevant behaviors of a given application workflow and representative settings of the physical infrastructure underlying this complex continuum. In this paper we introduce a rigorous methodology for such a process and validate it through E2Clab. It is the first platform to support the complete analysis cycle of an application on the Computing Continuum: (i) the configuration of the experimental environment, libraries and frameworks; (ii) the mapping between the application parts and machines on the Edge, Fog and Cloud; (iii) the deployment of the application on the infrastructure; (iv) the automated execution; and (v) the gathering of experiment metrics. We illustrate its usage with a real-life application deployed on the Grid'5000 testbed, showing that our framework allows one to understand and improve performance, by correlating it to the parameter settings, the resource usage and the specifics of the underlying infrastructure. Davut Ucar and Engin Arslan (University of Nevada, Reno) Abstract Abstract Driven by the advancements in computing and sensing technology, scientific applications started to generate a huge volume of data which needs to be streamed to high-performance computing clusters timely for real-time (or near-real time) processing, necessitating reliable network performance to operate seamlessly. However, existing data transfer applications are predominantly designed for batch workloads in a way that transfer configurations cannot be altered once they are set. This, in turn, severely limits streaming applications from adapting to changing dataset and network conditions therefore meeting stringent performance requirements. In this paper, we propose FStream to offer performance guarantees to time-sensitive streaming applications by dynamically adjusting transfer settings when system conditions deviate from initial assumptions to sustain high network performance throughout the runtime. We evaluate the performance of FStream by transferring several synthetic and real-world workloads in high-performance production networks and show that it offers up to 9x performance improvement over state-of-the-art data transfer solutions. Haoliang Tan, Zhiyuan Zhang, Xiangyu Zou, Qing Liao, and Wen Xia (HITSZ) Abstract Abstract Delta compression (or called delta encoding) is a data reduction technique capable of calculating the differences (i.e., delta) among the very similar files and chunks, and is thus widely used for optimizing synchronization replication, backup/archival storage, cache compression, etc. However, delta compression is costly because of its time-consuming wordmatching operations for delta calculation. Existing delta encoding approaches, are either at a slow encoding speed, such as Xdelta and Zdelta, or at a low compression ratio, such as Ddelta and Edelta. In this paper, we propose Gdelta, a fast delta encoding approach with a high compression ratio, that improves the delta encoding speed by employing an improved fast Gear-based rolling hash for scanning fine-grained words, and a quick arraybased indexing scheme for word-matching, and then, after wordmatching, further batch compressing the rest to improve the compression ratio. Our evaluation results driven by six real-world datasets suggest that Gdelta achieves encoding/decoding speedups of 2X∼4X over the classic Xdelta and Zdelta approaches while increasing the compression ratio by about 10%∼120%. Zhe Wang and Pradeep Subedi (Rutgers University), Matthieu Dorier (Argonne National Laboratory), and Philip E. Davis and Manish Parashar (Rutgers University) Abstract Abstract As scientific workflows increasingly use extreme-scale resources, the imbalance between higher computational capabilities, generated data volumes, and available I/O bandwidth is limiting the ability to translate these scales into insights. In-situ workflows (and the in-situ approach) are leveraging storage levels close to the computation in novel ways in order to reduce the required I/O. However, to be effective, it is important that the mapping and execution of such in-situ workflows adopts a data-driven approach, enabling in-situ tasks to be executed flexibly based upon data content. This paper first explores the design space for data-driven in-situ workflows. Specifically, it presents a model that captures different factors that influence the mapping, execution, and performance of data-driven in-situ workflows and experimentally studies the impact of different mapping decisions and execution patterns. The paper then presents the design, implementation, and experimental evaluation of a data-driven in-situ workflow execution framework that leverages in-memory distributed data management and user-defined task-triggers to enable efficient and scalable in-situ workflow execution. Qinglei Cao and George Bosilca (University of Tennessee); Wei Wu (Los Alamos National Laboratory); and Dong Zhong, Aurelien Bouteiller, and Jack Dongarra (University of Tennessee) Abstract Abstract Data redistribution aims to reshuffle data to optimize some objective for an algorithm. The objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal to increase the efficiency and therefore decrease the time-to-solution for the algorithm. The classical redistribution problem focuses on optimally scheduling communications when reshuffling data between two regular, usually block-cyclic, data distributions. Recently, task-based runtime systems have gained popularity as a potential candidate to address the programming complexity on the way to exascale. In addition to an increase in portability against complex hardware and software systems, task-based runtime systems have the potential to be able to more easily cope with less-regular data distribution, providing a more balanced computational load during the lifetime of the execution. In this scenario, it becomes paramount to develop a general redistribution algorithm for task-based runtime systems, which could support all types of regular and irregular data distributions. In this paper, we detail a flexible redistribution algorithm, capable of dealing with redistribution problems without constraints of data distribution and data size and implement it in a task-based runtime system, PaRSEC. Performance results show great capability compared to ScaLAPACK, and applications highlight an increased efficiency with little overhead in terms of data distribution and data size. Thursday 08:00-08:50 ![]() Zoom Keynote K3: Keynote-3 Chair: Franck Cappello (Argonne National Laboratory) AI for Science (Alok N. Choudhary) Alok N. Choudhary (Northwestern University) Biography Biography Alok N. Choudhary (Northwestern University) Dr. Alok Choudhary is the Dever Professor of Electrical Engineering and Computer Science at Northwestern University. He also teaches at Kellogg School of management. He is the founder, chairman and chief scientist of 4C insights, a big data analytics and marketing technology software company (4C was recently acquired by MediaOcean). He received the National Science Foundation’s Young Investigator Award in 1993. He is a fellow of IEEE, ACM and AAAS. He was listed by Adweek in " trailblazers and pioneers in Marketing technologies" and Dr. Choudhary was also awarded with the "Technology Manager of The Year in Chicago" in 2018. Prof. Choudhary received the first award for “Excellence in Research, Teaching and Service” from the McCormick School of Engineering. His research interests are in high-performance computing, ML/AI and scalable data mining (and their applications in science, medicine and business), and high-performance I/O systems. Alok Choudhary has published more than 400 papers in various journals and conferences and has graduated 45+ PhD students, including more than 10 women PhDs (one of the highest in the world). He serves on the board of several companies. He serves on National academies of Science roundtable to develop and recommend post-secondary data science education strategies. Dr. Choudhary currently serves on the U.S. Secretary of Energy’s Advisory Board on Artificial Intelligence. Abstract Abstract "AI for Science" seeks to understand, explore and develop Machine Learning and Data Mining approaches for accelerating scientific discoveries as well as designs. An example of this is learning from data to build predictive models that can enable exploration of scientific questions without relying upon underlying theory. Given that modern instruments, supercomputing simulations, experiments, sensors and IoT are creating massive amounts of data at an astonishing speed and diversity, AI for Science has the potential to accelerate science discoveries by orders of magnitude. Another example is the acceleration of so called the "inverse problems" which explore the design space based on desired properties. This talk presents examples, possibilities and limitations of "AI for Science". Thursday 09:10-10:00 ![]() Zoom Paper T5: Programming, System Software and Container Chair: Alfredo Goldman (USP) Bogdan Nicolae, Matthieu Dorier, Justin Wozniak, and Franck Cappello (Argonne National Laboratory) Abstract Abstract Training modern deep neural network (DNN) models involves complex workflows triggered by model exploration, sensitivity analysis, explainability, etc. A key primitive in this context is the ability to clone a model training instance, i.e. "fork" the training process in a potentially different direction, which enables comparisons of different evolution paths using variations of training data and model parameters. However, in a quest improve the training throughput, a mix of data parallel, model parallel, pipeline parallel and layer-wise parallel approaches are making the problem of cloning highly complex. In this paper, we explore the problem of efficient cloning under such circumstances. To this end, we leverage several properties of data-parallel training and layer-wise parallelism to design DeepClone, a cloning approach based on augmenting the execution graph to gain direct access to tensors, which are then sharded and reconstructed asynchronously in order to minimize runtime overhead, standby duration, readiness duration. Compared with state-of-art approaches, DeepClone shows orders of magnitude improvement for several classes of DNN models. Jie Ren, Kai Wu, and Dong Li (University of California, Merced) Abstract Abstract Hardware failures and faults often result in application crash in HPC. The emergence of non-volatile memory (NVM) provides a solution to address this problem. Leveraging the non-volatility of NVM, one can build in-memory checkpoints or enable crash-consistent data objects. However, these solutions cause large memory consumption, extra writes to NVM, or disruptive changes to applications. We introduces a fundamentally new methodology to handle HPC under failures based on NVM. In particular, we attempt to use remaining data objects in NVM (possibly stale ones because of losing data updates in caches) to restart crashed applications. To address the challenge of possibly unsuccessful recomputation after the application restarts, we introduce a framework EasyCrash that uses a systematic approach to automatically decide how to selectively persist application data objects to significantly increase possibility of successful recomputation. EasyCrash enables up to 30\% improvement (20\% on average) in system efficiency at various system scales. Hariharan Devarajan, Anthony Kougkas, Keith Bateman, and Xian-He Sun (Illinois Institute of Technology Chicago) Abstract Abstract Most parallel programs use irregular control flow and data structures, which are perfect for one-sided communication paradigms such as MPI or PGAS programming languages. However, these environments lack the presence of efficient function-based application libraries that can utilize popular communication fabrics such as TCP, IB, and RoCE. Additionally, there is a lack of high-performance data structure interfaces. We present Hermes Container Library (HCL), a high-performance distributed data structures library that offers high-level abstractions including hash-maps, sets, and queues. HCL uses a RPC over RDMA technology that implements a novel functional programming paradigm. In this paper, we present the HCL DataBox abstraction that enables fast data serialization of complex data types, data persistency on flash storage, and inter- and intra-node data access optimization via a hybrid data access model. Evaluation results from testing the performance of a set of HCL data structures shows that HCL programs are 2x to 12x faster compared to BCL, a state-of-the-art distributed data structure library. Sascha Hunold (TU Wien), Abhinav Bhatele (University of Maryland), George Bosilca (University of Tennessee), and Peter Knees (TU Wien) Abstract Abstract The Message Passing Interface (MPI) defines the semantics of data communication operations, while the implementing libraries provide several parameterized algorithms for each operation. Each algorithm of an MPI collective operation may work best on a particular system and may be dependent on the specific communication problem. Internally, MPI libraries employ heuristics to select the best algorithm for a given communication problem when being called by an MPI application. The majority of MPI libraries allow users to override the default algorithm selection, enabling the tuning of this selection process. The problem then becomes how to select the best possible algorithm for a specific case automatically. In this paper, we address the algorithm selection problem for MPI collective communication operations. To solve this problem, we propose an auto-tuning framework for collective MPI operations based on machine-learning techniques. First, we execute a set of benchmarks of an MPI library and its entire set of collective algorithms. Second, for each algorithm, we fit a performance model by applying regression learners. Last, we use the regression models to predict the best possible (fastest) algorithm for an unseen communication problem. We evaluate our approach for different MPI libraries and several parallel machines. The experimental results show that our approach outperforms the standard algorithm selection heuristics, which are hard-coded into the MPI libraries, by a significant margin. Jesper Larsson Träff and Sascha Hunold (TU Wien) Abstract Abstract Many modern, high-performance systems increase the cumulated node-bandwidth by offering more than a single communication network and/or by having multiple connections to the network, such that a single processor-core cannot by itself saturate the off-node bandwidth. Efficient algorithms and implementations for collective operations as found in, e.g., MPI, must be explicitly designed for exploiting such multi-lane capabilities. We are interested in gauging to which extent this might be the case. We systematically decompose the MPI collectives into similar operations that can execute concurrently on and exploit multiple network lanes. Our decomposition is applicable to all standard MPI collectives (broadcast, gather, scatter, allgather, reduce allreduce, reduce-scatter, scan, alltoall), and our implementations' performance can be readily compared to the native collectives of any given MPI library. Contrary to expectation, our full-lane, performance guideline implementations in many cases show surprising performance improvements with different MPI libraries on a dual-socket, dual-network Intel OmniPath cluster, indicating a large potential for improving the performance of native MPI library implementations. Our full-lane implementations are in many cases large factors faster than the corresponding MPI collectives. We see similar results on a larger, dual-rail Intel InfiniBand cluster. The results indicate considerable room for improvement of the MPI collectives in current MPI libraries including a more efficient use of multi-lane capabilities. Jonatan Enes (Universidade da Coruña (UDC), CITIC); Guillaume Fieni (University of Lille, INRIA); Roberto Rey Expósito (Universidade da Coruña (UDC), CITIC); Romain Rouvoy (University of Lille, INRIA); and Juan Touriño (Universidade da Coruña (UDC), CITIC) Abstract Abstract Energy consumption is currently highly regarded on computing systems for many reasons, such as improving the environmental impact and reducing operational costs considering the rising price of energy. Previous works have analysed how to improve energy efficiency from the entire infrastructure down to individual computing instances (e.g., virtual machines). However, the research is more scarce when it comes to controlling energy consumption, specially in real time and at the software level. This paper presents a platform that manages a power budget to cap the energy along several hierarchies, from users to applications and down to individual computing instances. Using software containers as the underlying virtualization technology, the energy limitation is implemented thanks to the platform’s ability to monitor container energy consumption and dynamically adjust its CPU resources via vertical scaling as required. Several representative Big Data applications have been deployed on the proposed platform to prove the feasibility of this power budgeting approach for energy control, showing that it is possible to effectively distribute and enforce a power budget among several users and applications. Xusheng Zhang, Ziyu Shen, Bin Xia, Zheng Liu, and Yun Li (Nanjing University of Posts and Telecommunications) Abstract Abstract Virtualization technologies provide solutions of cloud computing. Virtual resource scheduling is a crucial task in data centers, and the power consumption of virtual resources is a critical foundation of virtualization scheduling. Containers are the smallest unit of virtual resource scheduling and migration. Although many effective models for estimating power consumption of virtual machines (VM) have been proposed, few power estimation models of containers have been put forth. In this paper, we offer a fast-training piecewise regression model based on decision tree to build a VM power estimation model and estimate the containers’ power by treating the container as a group of processes on the VM. In our model, we characterize the nonlinear relationship between power and features and realize the effective estimation of the containers on the VM. We evaluate the proposed model on 13 workloads in PARSEC and compare it with several models. The experimental results prove the effectiveness of our proposed model on most workloads. Moreover, the estimated power of the containers is in line with expectations. Thursday 14:00-14:50 ![]() Zoom Paper T6: HPC Applications Chair: Hatem Ltaief (KAUST) Renga Bashyam K G and Sathish Vadhiyar (Indian Institute of Science) Abstract Abstract K-Nearest Neighbor (k-NN) search is one of the most commonly used approaches for similarity search. It finds extensive applications in machine learning and data mining. This era of big data warrants efficiently scaling k-NN search algorithms for billion-scale datasets with high dimensionality. In this paper, we propose a solution towards this end where we use vantage point trees for partitioning the dataset across multiple processes and exploit an existing graph-based sequential approximate k-NN search algorithm called HNSW (Hierarchical Navigable Small World) for searching locally within a process. Our hybrid MPI-OpenMP solution employs techniques including exploiting MPI one-sided communication for reducing communication times and partition replication for better load balancing across processes. We demonstrate computation of k-NN for 10,000 queries in the order of seconds using our approach on ~8000 cores on a dataset with billion points in a 128-dimensional space. We also show 10X speedup over a completely k-d tree-based solution for the same dataset, thus demonstrating better suitability of our solution for high dimensional datasets. Our solution shows almost linear strong scaling. Niclas Jansson (KTH Royal Institute of Technology) Abstract Abstract Current finite element codes scale reasonably well as long as each core has sufficient amount of local work that can balance communication costs. However, achieving efficient performance at exascale will require unreasonable large problem sizes, in particular for low-order methods, where the small amount of work per element already is a limiting factor on current post petascale machines. Key bottlenecks for these methods are sparse matrix assembly, where communication latency starts to limit performance as the number of cores increases, and linear solvers, where efficient overlapping is necessary to amortize communication and synchronization cost of sparse matrix vector multiplication and dot products. We present our work on improving strong scalability limits of message passing based general low-order finite element based solvers. Using lightweight one-sided communication offered by partitioned global address space languages (PGAS), we demonstrate that the scalability of performance critical, latency sensitive sparse matrix assembly can achieve almost an order of magnitude better scalability. Linear solvers are also addressed via a signaling put algorithm for low-cost point-to-point synchronization, achieving similar performance as message passing based linear solvers. We introduce a new hybrid MPI+PGAS implementation of the open source general finite element framework FEniCS, replacing the linear algebra backend with a new library written in Unified Parallel C (UPC). A detailed description of the implementation and the hybrid interface to FEniCS is given, and the feasibility of the approach is demonstrated via a performance study of the hybrid implementation on Cray XC40 machines. Kevin Sala (Barcelona Supercomputing Center (BSC)), Alejandro Rico (Arm Research), and Vicenç Beltran (Barcelona Supercomputing Center (BSC)) Abstract Abstract Adaptive Mesh Refinement (AMR) is a prevalent method used by distributed-memory simulation applications to adapt the accuracy of their solutions depending on the turbulent conditions in each of their domain regions. These applications are usually dynamic since their domain areas are refined or coarsened in various refinement stages during their execution. Thus, they periodically redistribute their workloads among processes to avoid load imbalance. Although the defacto standard for scientific computing in distributed environments is MPI, in recent years, pure MPI applications are being ported to hybrid ones, attempting to cope with modern multi-core systems. Recently, the Task-Aware MPI library was proposed to efficiently integrate MPI communications and tasking models, providing also the transparent management of communications issued by tasks. Sihuan Li (UC, Riverside); Sheng Di (Argonne National Laboratory); Kai Zhao (UC, Riverside); Xin Liang (Oak Ridge National Laboratory); Zizhong Chen (UC, Riverside); and Franck Cappello (Argonne National Laboratory) Abstract Abstract Data reduction techniques have been widely demanded and used by large-scale high performance computing (HPC) applications because of vast volumes of data to be produced and stored for post-analysis. Due to very limited compression ratios of lossless compressors, error-bounded lossy compression has become an indispensable part in many HPC applications nowadays, because it can significantly reduce science data volume with user-acceptable data distortion. Since the large- scale HPC applications equipped with lossy compression techniques always need to deal with vast volume of data, soft errors or silent data corruptions (SDC) are non-negligible. Although SDC detection techniques have been studied for years, no studies were performed toward the HPC applications with lossy compression, leaving a significant gap between these applications and confidence of execution results. To fill this gap, this paper proposes a couple of SDC detection strategies for scientific simulations with lossy compression. Experimental results on 4 widely used scientific simulation datasets show promising detection ability could be still obtained with two popular lossy compressors. Our parallel experiments with up to 1,024 cores confirm that the time overheads could be limited within 7.9%. Mohammad Mahdi Javanmard and Zafar Ahmad (Stony Brook University), Jaroslaw Zola (University at Buffalo), Louis-Noël Pouchet (Colorado State University), and Rezaul Chowdhury and Robert Harrison (Stony Brook University) Abstract Abstract One of the most important properties of distributed computing systems (e.g., Apache Spark, Apache Hadoop, etc) on clusters and computation clouds is the ability to scale out by adding more compute nodes to the cluster. This important feature can lead to performance gain provided the computation (or the algorithm) itself can scale out. In other words, the computation (or the algorithm) should be easily decomposable into smaller units of work to be distributed among the workers based on the hardware/software configuration of the cluster or the cloud. Additionally, on such clusters, there is an important trade-off between communication cost, parallelism, and memory requirement. Due to the scalability need as well as this trade-off, it is crucial to have a well-decomposable, adaptive, tunable, and scalable program. Tunability enables the programmer to find an optimal point in the trade-off spectrum to execute the program efficiently on a specific cluster. We design and implement well-decomposable and tunable dynamic programming algorithms from the Gaussian Elimination Paradigm (GEP), such as Floyd-Warshall’s all-pairs shortest path and Gaussian elimination without pivoting, for execution on Apache Spark. Our implementations are based on parametric multi-way recursive divide-&-conquer algorithms. We explain how to map implementations of those grid-based parallel algorithms to the Spark framework. Finally, we provide experimental results illustrating the performance, scalability, and portability of our Spark programs. We show that offloading the computation to an OpenMP environment (by running parallel recursive kernels) within Spark is at least partially responsible for a 2-5x speedup of the DP benchmarks. Thursday 14:50-15:40 ![]() Zoom Paper T7: IO, Visualization, and Machine Learning Chair: Kento Sato (RIKEN) Chan-Jung Chang and Jerry Chou (National Tsing Hua University), Yu-Ching Chou (H3 Platform Inc.), and I-Hsin Chung (IBM T. J. Watson) Abstract Abstract As data volume keeps increasing in a rapid rate, there is an urgent need for large, reliable and cost-effective storage systems. Erasure coding has drawn increasing attention because of its ability to ensure data reliability with higher storage efficiency, and it has been widely adopted in many distributed and large-scale storage systems, such as Azure cloud storage and HDFS. However, the storage efficiency of erasure code comes at the price of higher computing complexity. While many studies have shown the coding computations can be significantly accelerated using GPU, the overhead of data transfer between storage devices and GPUs become a new performance bottleneck. In this work, we designed and implemented, ECS2, a fast erasure coding library on GPU-accelerated storage to let users enhance their data protection with transparent IO performance and file system like programming interface. By taking advantage of the latest GPUDirect technology supported on Nvidia GPU, our library is able to bypass CPU and host memory copy from the IO path, so that both the computing and IO overhead from coding can be minimized. Using synthetic IO workload based on real storage system trace, we show that the IO latency can be reduced by 10%~20% with GPUDirect technology, and the overall IO throughput of a storage system can be improved up to 70%. Steven W. D. Chien and Artur Podobas (KTH Royal Institute of Technology), Ivy B. Peng (Lawrence Livermore National Laboratory), and Stefano Markidis (KTH Royal Institute of Technology) Abstract Abstract Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and this can potentially be a performance bottleneck. TensorFlow, one of the most popular Deep-Learning platforms, now offers a new profiler interface and allows instrumentation of TensorFlow operations. However, the current profiler only enables analysis at the TensorFlow platform level and does not provide system-level information. In this paper, we extend TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that performs instrumentation through Darshan. We use the same Darshan shared instrumentation library and implement a runtime attachment without using a system preload. We can extract Darshan profiling data structures during TensorFlow execution to enable analysis through the TensorFlow profiler. We visualize the performance results through TensorBoard, the web-based TensorFlow visualization tool. At the same time, we do not alter Darshan’s existing implementation. We illustrate tf-Darshan by performing two case studies on ImageNet image and Malware classification. We show that by guiding optimization using data from tf-Darshan, we increase POSIX I/O bandwidth by up to 19% by selecting data for staging on fast tier storage. We also show that Darshan has the potential of being used as a runtime library for profiling and providing information for future optimization. Jens Huthmann (RIKEN R-CCS), Artur Podobas (Royal Institute of Technology), Lukas Sommer (TU-Darmstadt), Andreas Koch (TU Darmstadt), and Kentaro Sano (RIKEN R-CCS) Abstract Abstract The recent maturity in High-Level Synthesis (HLS) has renewed the interest of using Field-Programmable Gate-Arrays (FPGAs) to accelerate High-Performance Computing (HPC) applications. Today, several studies have shown performance- and power-benefits of using FPGAs compared to existing approaches for several HPC applications with ample room for improvements. Unfortunately, tracing and visualizing the performance of applications running on FPGAs is nearly non-existent, and understanding FPGA performance is often left to intuition or expert guesses. Roba Binyahib (University of Oregon), David Pugmire (Oak Ridge National Laboratory), and Abhishek Yenpure and Hank Childs (University of Oregon) Abstract Abstract There are multiple algorithms for parallelizing particle advection for scientific visualization workloads. While many previous studies have contributed to the understanding of individual algorithms, our study aims to provide a holistic understanding of how algorithms perform relative to each other on various workloads. To accomplish this, we consider four popular parallelization algorithms and run a “bake-off” study (i.e., an empirical study) to identify the best matches for each. The study includes 216 tests, going to a concurrency of up to 8192 cores and considering data sets as large as 34 billion cells with 300 million particles. Overall, our study informs three important research questions: (1) which parallelization algorithms perform best for a given workload?, (2) why?, and (3) what are the unsolved problems in parallel particle advection? In terms of findings, we find that the seeding box is the most important factor in choosing the best algorithm, and also that there is a significant opportunity for improvement in execution time, scalability, and efficiency. Wei Rang, Donglin Yang, and Dazhao Cheng (UNC Charlotte); Kun Suo (Kennesaw State University); and Wei Chen (Nvidia Corporation) Abstract Abstract Many deep learning applications deployed in dynamic environments change over time, in which the training models are supposed to be continuously updated with streaming data in order to guarantee better descriptions on data trends. However, most of the state-of-the-art learning frameworks support well in offline training methods while omitting online model updating strategies. In this work, we propose and implement iDlaLayer, a thin middleware layer on top of existing training frameworks that streamlines the support and implementation of online deep learning applications. In pursuit of good model quality as well as fast data incorporation, we design a Data Life Aware model updating strategy (DLA), which builds training data samples according to contributions of data from different life stages, and considers the training cost consumed in model updating. We evaluate iDlaLayer’s performance through both simulations and experiments based on TensorflowOnSpark with three representative online learning workloads. Our experimental results demonstrate that iDlaLayer reduces the overall elapsed time of MNIST, Criteo and PageRank by 11.3%, 28.2% and 15.2% compared to the periodic update strategy, respectively. It further achieves an average 20% decrease in training cost and brings about 5% improvement in model quality against the traditional continuous training method. Gangzhao Lu and Weizhe Zhang (Harbin Institute of Technology) and Zheng Wang (University of Leeds) Abstract Abstract Cnvolution computation is a common operation in deep neural networks (DNNs) and is often responsible for performance bottlenecks during training and inferencing. Existing approaches for accelerating convolution operations aim to reduce computational complexity. However, these strategies often increase the memory footprint with extra memory accesses, thereby leaving much room for performance improvement. This paper presents a novel approach to optimize memory access for convolution operations, specically targeting GPU execution. Our approach leverages two optimization techniques to reduce the number of memory operations for convolution operations performed on the width and height dimensions. For convolution computations on the width dimension, we exploit shuffle instructions to exchange the overlapped columns of the input for reducing the number of memory transactions. For convolution operations on the height dimension, we multiply each overlapped row of the input with multiple rows of a filter to compute multiple output elements to improve the data locality of row elements. We apply our approach to 2D and multi-channel 2D convolutions on an NVIDIA 2080Ti GPU. For 2D convolution, our approach delivers over 2x faster performance than the state-of-the-art image processing libraries. For multi-channel 2D convolutions, we obtain up to 2x speedups over the quickest algorithm of cuDNN. |
Wednesday 08:00-08:50 ![]() Zoom Paper T3: Architecture and Network Support for HPC Workloads Chair: Philippe Olivier Alexandre Navaux (Federal University of Rio Grande do Sul, UFRGS) |