Monday 9:00am-6:00pm Ambassador/Registry Workshop First Extreme-scale Scientific Software Stack Forum (E4S Forum) Tuesday 8:45am-9:00am Ambassador/Registry Plenary Cluster 2019 Opening Chair: Ron Brightwell (Sandia National Laboratories); Patrick G. Bridges (University of New Mexico); Martin Schulz (Technical University of Munich); Patrick McCormick (Los Alamos National Laboratory) Tuesday 9:00am-10:00am Ambassador/Registry Keynote Keynote 1 Chair: Martin Schulz (Technical University of Munich) RAPIDS: Open Source Python Data Science with GPU Acceleration and Dask (Joe Eaton) Joe Eaton (Nvidia) Biography Biography Joe Eaton (Nvidia) Joe Eaton is the Principal System Engineer for Data and Graph Analytics at NVIDIA. He has spent the last 6 years at NVIDIA working on applications of sparse linear algebra: CUDA Libraries, cuSOLVER, cuSPARSE, nvGRAPH and AmgX. Now 100% on RAPIDS, developing Python APIs, cuML and cuGRAPH. RAPIDS is an end-to-end platform for data science, including IO, ETL, model training, inference and visualization. Previously he spent 18 years in Oil & Gas reservoir simulation. He is a frequent speaker at SC, GTC, and directly interfaces with engineers and mathematicians across industries. Abstract Abstract Getting large scale data science done is a tricky business. There are multiple platforms, the data science algorithms are developed in Python but sometimes deployed in Java, managing the whole workflow is complicated. RAPIDS is a recent open-source package that tries to bring the Python ease-of-use to large scale data science. This talk will present RAPIDS, how it can be used to accelerate workflows in Jupyter notebooks based on Pandas and Scikit-Learn, and how a single notebook can scale to cluster deployment with minimal changes. Tuesday 10:00am-10:30am Ambassador/Registry Paper Top 3 Papers of Cluster 2019 (1/3) Chair: Martin Schulz (Technical University of Munich) Evaluating Burst Buffer Placement in HPC Systems Best Paper Harsh Khetawat (NCSU), Christopher Zimmer (ORNL), Frank Mueller (NCSU), Scott Atchley and Sudharshan Vazhkudai (ORNL), and Misbah Mubarak (ANL) Abstract Abstract Burst buffers are increasingly exploited in contemporary supercomputers to bridge the performance gap between compute and storage systems. The design of burst buffers, particularly the placement of these devices and the underlying network topology, impacts both performance and cost. As the cost of other components such as memory and accelerators is increasing, it is becoming more important that HPC centers provision burst buffers tailored to their workloads. This work contributes a provisioning system to provide accurate, multi-tenant simulations that model realistic application and storage workloads from HPC systems. The framework aids HPC centers in modeling their workloads against multiple network and burst buffer configurations rapidly. In experiments with our framework, we provide acomparison of representative OLCF I/O workloads against multiple burst buffer designs. We analyze the impact of these designs on latency, I/O phase lengths, contention for network and storage devices, and choice of network topology. Tuesday 11:00am-12:30pm Ambassador/Registry Paper Deep Learning 1 Chair: Judy Qiu (Indiana University) Yue Zhu, Weikuan Yu, and Bing Jiao (Florida State University); Kathryn Mohror and Adam Moody (Lawrence Livermore National Laboratory); and Fahim Chowdhury (Florida State University) Abstract Abstract On large-scale High-Performance Computing (HPC) systems, applications are provisioned with aggregated resources to meet their peak demands for brief periods. This results in resource underutilization because application requirements vary a lot during execution. This problem is particularly pronounced for deep learning applications that are running on leadership HPC systems with a large pool of burst buffers in the form of flash or non-volatile memory (NVM) devices. In this paper, we examine the I/O patterns of deep neural networks and reveal their critical need of loading many small samples randomly for successful training. We have designed a special Deep Learning File System (DLFS) that provides a thin set of APIs. Particularly, we design the metadata management of DLFS through an in-memory tree-based sample directory and its file services through the user-level SPDK protocol that can disaggregate the capabilities of NVM Express (NVMe) devices to parallel training tasks. Our experimental results show that DLFS can dramatically improve the throughput of training for deep neural networks on NVMe over Fabric, compared with the kernel-based Ext4 file system. Furthermore, DLFS achieves efficient user-level storage disaggregation with very little CPU utilization. FluentPS: A Parameter Server Design with Low-frequency Synchronization for Distributed Deep Learning Xin Yao, Xueyu Wu, and Cho-Li Wang (HKU) Abstract Abstract With pursuing high accuracy on big datasets, current research prefers designing complex neural networks, which need to maximize data parallelism for short training time. Many distributed deep learning systems, such as MXNet and Petuum, widely use parameter server framework with relaxed synchronization models. Although these models could cost less on each synchronization, its frequency is still high among many workers, e.g., the soft barrier introduced by Stale Synchronous Parallel (SSP) model. In this paper, we introduce our parameter server design, namely FluentPS, which can reduce frequent synchronization and optimize communication overhead in a large-scale cluster. Different from using a single scheduler to manage all parameters' synchronization in some previous designs, our system allows each server to independently adjust schemes for synchronizing its own parameter shard and overlaps the push and pull processes of different servers. We also explore two methods to improve the SSP model: (1) lazy execution of buffered pull requests to reduce the synchronization frequency and (2) a probability-based strategy to pause the fast worker at a probability under SSP condition, which avoids unnecessary waiting of fast workers. We evaluate ResNet-56 with the same large batch size at different cluster scales. While guaranteeing robust convergence, FluentPS gains up to 6x speedup and reduce 93.7% communication costs than PS-Lite. The raw SSP model causes up to 131x delayed pull requests than our improved synchronization model, which can provide fine-tuned staleness controls and achieve higher accuracy. Arpan Jain, Ammar Ahmad Awan, Quentin Anthony, Hari Subramoni, and Dhabaleswar K. Panda (The Ohio State University) Abstract Abstract The recent surge of Deep Learning (DL) models and applications can be attributed to the rise in computational resources, availability of large-scale datasets, and accessible DL frameworks such as TensorFlow and PyTorch. Because these frameworks have been heavily optimized for NVIDIA GPUs, several performance characterization studies exist for GPU-based Deep Neural Network (DNN) training. However, there exist very few research studies that focus on CPU-based DNN training. In this paper, we provide an in-depth performance characterization of state-of-the-art DNNs such as ResNet(s) and Inception-v3/v4 on multiple CPU architectures including Intel Xeon Broadwell, three variants of the Intel Xeon Skylake, AMD EPYC, and NVIDIA GPUs like K80, P100, and V100. We provide three key insights: 1) Multi-process (MP) training should be used even for a single-node, because the single-process (SP) approach cannot fully exploit all the cores, 2) Performance of both SP and MP depend on various features such as the number of cores, the processes per node (ppn), and DNN architecture, and 3) There is a non-linear and complex relationship between CPU/system characteristics (core-count, ppn, hyper-threading, etc) and DNN specifications such as inherent parallelism between layers. We further provide a comparative analysis for CPU and GPU-based training and profiling analysis for Horovod. The fastest Skylake we had access to is up to 2.35x better than a K80 GPU but up to 3.32x slower than a V100 GPU. For ResNet-152 training, we observed that MP is up to 1.47x faster than SP and achieves 125x speedup on 128 Skylake nodes. Tuesday 2:00pm-3:30pm Ambassador/Registry Paper Deep Learning 2 Chair: Hatem Ltaief (KAUST) Jingoo Han (Virginia Tech), Luna Xu (IBM Research), M. Mustafa Rafique (Rochester Institute of Technology), Ali R. Butt (Virginia Tech), and Seung-Hwan Lim (Oak Ridge National Laboratory) Abstract Abstract Deep learning (DL) has become a key technique for solving complex problems in scientific research and discovery. DL training for science is substantially challenging because it has to deal with massive quantities of multi-dimensional data. High-performance computing (HPC) supercomputers are increasingly being employed for meeting the exponentially growing demand for DL. Multiple GPUs and high-speed interconnect network are needed for supporting DL on HPC systems. However, the excessive use of GPUs without considering effective benefits leads to inefficient resource utilization of these expensive setups. In this paper, we conduct a quantitative analysis to gauge the efficacy of DL workloads on the latest HPC system and identify viability of next-generation DL-optimized heterogeneous supercomputers for enabling researchers to develop more efficient resource management and distributed DL middleware. We evaluate well-known DL models with large-scale datasets using the popular TensorFlow framework, and provide a thorough evaluation including scalability, accuracy, variability, storage resource, GPU-GPU/GPU-CPU data transfer, and GPU utilization. Our analysis reveals that the latest heterogeneous supercomputing cluster shows varying performance trend as compared to the existing literature for single- and multi-node training. To the best of our knowledge, this is the first work to conduct such a quantitative and comprehensive study of DL training on a supercomputing system with multiple GPUs. Sam Ade Jacobs, Brian Van Essen, Tim Moon, Jae Seung Yeom, David Hysom, Brian Spears, Rushil Anirudh, Jayaraman Thiagaranjan, Shusen Liu, Jim Gaffney, Peer-Timo Bremer, Tom Benson, Peter Robinson, and Luc Peterson (Lawrence Livermore National Lab) Abstract Abstract Training deep neural networks on large scientific data is a challenging task that requires enormous compute power, especially if no pre-trained models exist to initialize the process. We present a novel tournament method to train traditional as well as generative adversarial networks built on LBANN, a scalable deep learning framework optimized for HPC systems. LBANN combines multiple levels of parallelism and exploits some of the worlds largest supercomputers. We demonstrate our framework by creating a complex predictive model based on multi-variate data from high-energy-density physics containing hundreds of millions of images and hundreds of millions of scalar values derived from tens of millions of simulations of inertial confinement fusion. Our approach combines an HPC workflow and extends LBANN with optimized data ingestion and the new tournament-style training algorithm to produce a scalable neural network architecture using a CORAL-class supercomputer. Experimental results show that 64 trainers (1024 GPUs) achieve a speedup of 70.2× over a single trainer (16 GPUs) baseline, and an effective 109% parallel efficiency. Zhao Zhang, Lei Huang, Ruizhu Huang, and Weijia Xu (TACC) and Daniel S. Katz (University of Illinois) Abstract Abstract The use of deep learning (DL) on HPC resources has become common as scientists explore and exploit DL methods to solve domain problems. On the other hand, in the coming exascale computing era, a high error rate is expected to be problematic for most HPC applications. The impact of errors on DL applications, especially DL training, remains unclear given their stochastic nature. In this paper, we focus understanding DL training applications on HPC in the presence of silent data corruption. Specifically, we design and perform a quantification study with three representative applications by manually injecting silent data corruption errors (SDCs) across the design space and compare training results with the error-free baseline. The results show only 0.61--1.76% of SDCs cause training failures, and taking the SDC rate in modern hardware into account, the actual chance of a failure is one in thousands to millions of executions. With this quantitatively measured impact, computing centers can make rational design decisions based on their application portfolio, the acceptable failure rate, and financial constraints; for example, they might determine their confidence in the correctness of training results performed processors without error correction code (ECC) RAM. We also discover that over 75-90% of the SDCs that cause catastrophic errors can be easily detected by a training loss in the next iteration. Thus we propose this error-aware software solution to correct catastrophic errors, as it has significantly lower time and space overhead compared to algorithm-based fault-tolerance (ABFT) and ECC. Tuesday 4:00pm-5:00pm Ambassador/Registry Paper Machine Learning Chair: Ann Gentile (Sandia National Laboratories) Bo Peng and Judy Qiu (Indiana University) Abstract Abstract Gradient Boosting Decision Tree (GBDT) is a widely used machine learning algorithm, whose training involves both irregular computation and random memory access and is challenging for system optimizations. In this paper, we conduct a comprehensive performance analysis of two state-of-the-art systems, XGBoost and LightGBM. They represent two typical parallel implementations for GBDT; one is data parallel and the other one is parallel over features. Substantial thread synchronization overhead, as well as the inefficiency of random memory access, is identified. We propose HarpGBDT, a new GBDT system designed from the perspective of parallel efficiency optimization. Firstly, we adopt a new tree growth method that selects the top K candidates of tree nodes to enable the use of more levels of parallelism without sacrificing the algorithm's accuracy. Secondly, we organize the training data and model data in blocks and propose a block-wise approach as a general model that enables the exploration of various parallelism options. Thirdly, we propose a mixed mode to utilize the advantages of a different mode of parallelism in different phases of training. By changing the configuration of the block size and parallel mode, HarpGBDT is able to attain better parallel efficiency. By extensive experiments on four datasets with different statistical characteristics on the Intel(R) Xeon(R) E5-2699 server, HarpGBDT on average performs 8x faster than XGBoost and 2.6x faster than LightGBM. Dhiraj Kalamkar, Kunal Banerjee, Sudarshan Srinivasan, Srinivas Sridharan, Evangelos Georganas, Mikhail Smorkalov, Cong Xu, and Alexander Heinecke (Intel) Abstract Abstract Google's neural machine translation (GNMT) is state-of-the-art recurrent neural network (RNN/LSTM) based language translation application. It is computationally more demanding than well-studied convolutional neural networks (CNNs). Also, in contrast to CNNs, RNNs heavily mix compute and memory bound layers which requires careful tuning on a latency machine to optimally use fast on-die memories for best single processor performance. Additionally, due to massive compute demand, it is essential to distribute the entire workload among several processors and even compute nodes. To the best of our knowledge, this is the first work which attempts to scale this application on an Intel CPU cluster. Our CPU-based GNMT optimization, the first of its kind, achieves this by the following steps: (i) we choose a monolithic long short-term memory (LSTM) cell implementation from LIBXSMM library (specifically tuned for CPUs) and integrate it into TensorFlow, (ii) we modify GNMT code to use fused time step LSTM op for the encoding stage, (iii) we combine Horovod and Intel MLSL scaling libraries for improved performance on multiple nodes, and (iv) we extend the bucketing logic for grouping similar length sentences together to multiple nodes for achieving load balance across multiple ranks. In summary, we demonstrate that due to these changes we are able to outperform Google's stock CPU-based GNMT implementation by ~2x on single node and potentially enable more than 25x speedup using 16 node CPU cluster. Wednesday 8:45am-9:00am Ambassador/Registry Plenary Announcements Chair: Ron Brightwell (Sandia National Laboratories); Patrick G. Bridges (University of New Mexico) Wednesday 9:00am-10:00am Ambassador/Registry Keynote Keynote 2 Chair: Patrick McCormick (Los Alamos National Laboratory) AI@Edge (Pete Beckman) Pete Beckman (Argonne National Laboratory) Biography Biography Pete Beckman (Argonne National Laboratory) Pete Beckman is the co-director of the Northwestern University/Argonne Institute for Science and Engineering. During the past 25 years, his research has been focused on software and architectures for large-scale parallel and distributed computing systems. For the DOE’s Exascale Computing Project, Beckman leads the Argo project focused on low-level resource management for the operating system and runtime. He is the founder and leader of the Waggle project for smart sensors and edge computing that is used by the Array of Things project. Beckman also coordinates the collaborative technical research activities in extreme-scale computing between the US Department of Energy and Japan’s ministry of education, science, and technology and helps lead the BDEC (Big Data and Extreme Computing) series of international workshops. Beckman received his PhD in computer science from Indiana University. Abstract Abstract ----- Wednesday 10:00am-10:30am Ambassador/Registry Paper Top 3 Papers of Cluster 2019 (2/3) Chair: Patrick McCormick (Los Alamos National Laboratory) Algorithm-Based Fault Tolerance for Parallel Stencil Computations Best Paper Aurélien Cavelan and Florina M. Ciorba (University of Basel, Swizterland) Abstract Abstract The increase in HPC systems size and complexity, together with increasing on-chip transistor density, power limitations, and number of components, render modern HPC systems subject to soft errors. Silent data corruptions (SDCs) are typically caused by such soft errors in the form of bit-flips in the memory subsystem and hinder the correctness of scientific applications. This work addresses the problem of protecting a class of iterative computational kernels, called stencils, against SDCs when executing on parallel HPC systems. Existing SDC detection and correction methods are in general either inaccurate, inefficient, or targeting specific application classes that do not include stencils. This work proposes a novel algorithm-based fault tolerance (ABFT) method to protect scientific applications that contain arbitrary stencil computations against SDCs. The ABFT method can be applied both online and offline to accurately detect and correct SDCs in 2D and 3D parallel stencil computations. We present a formal model for the proposed method including theorems and proofs for the computation of the associated checksums as well as error detection and correction. We experimentally evaluate the use of the proposed ABFT method on a real 3D stencil-based application (HotSpot3D) via a fault-injection, detection, and correction campaign. Results show that the proposed ABFT method achieves less than 8% overhead compared to the performance of the unprotected stencil application. Moreover, it accurately detects and corrects SDCs. While the offline ABFT version corrects errors more accurately, it may incur a small additional overhead than its online counterpart. Wednesday 11:00am-12:30pm Ambassador/Registry Paper Message Passing Chair: Felix Wolf (TU Darmstadt) Nathan Hjelm (Google Inc.); Howard Pritchard and Samuel K. Gutiérrez (Los Alamos National Laboratory); Daniel J. Holmes (EPCC, The University of Edinburgh); Ralph Castain (Intel); and Anthony Skjellum (University of Tennessee at Chattanooga) Abstract Abstract The recently proposed MPI Sessions extensions to the MPI standard present a new paradigm for applications to use with MPI. MPI Sessions has the potential to address several limitations of MPI's current specification: MPI cannot be initialized within an MPI process from different application components without a priori knowledge or coordination; MPI cannot be initialized more than once; and, MPI cannot be reinitialized after MPI finalization. MPI Sessions also offers the possibility for more flexible ways for individual components of an application to express the capabilities they require from MPI at a finer granularity than is presently possible. Thananon Patinyasakdikul, David Eberius, and George Bosilca (University of Tennessee) and Nathan Hjelm (University of New Mexico) Abstract Abstract The Message Passing Interface (MPI) has been one of the most prominent programming paradigms in highperformance computing (HPC) for the past decade. Lately, with changes in modern hardware leading to a drastic increase in the number of processor cores, developers of parallel applications are moving toward more integrated parallel programming paradigms, where MPI is used along with other, possibly node-level, programming paradigms, or MPI+X. MPI+threads emerged as one of the favorite choices in HPC community, according to a survey of the HPC community. However, threading support in MPI comes with many compromises to the overall performance delivered, and, therefore, its adoption is compromised. Tom Cornebize (Université Grenoble Alpes, French Institute for Research in Computer Science and Automation (INRIA)); Arnaud Legrand (National Center for Scientific Research (CNRS), French Institute for Research in Computer Science and Automation (INRIA)); and Franz Christian Heinrich (French Institute for Research in Computer Science and Automation (INRIA)) Abstract Abstract Finely tuning MPI applications (number of processes, granularity, collective operation algorithms, topology and process placement) is critical to obtain good performance on supercomputers. With a rising cost of modern supercomputers, running parallel applications at scale solely to optimize their performance is extremely expensive. Having inexpensive but faithful predictions of expected performance could be a great help for researchers and system administrators. The methodology we propose captures the complexity of adaptive applications by emulating the MPI code while skipping insignificant parts. We demonstrate its capability with High Performance Linpack (HPL), the benchmark used to rank supercomputers in the TOP500 and which requires a careful tuning. We explain (1) how we both extended the SimGrid's SMPI simulator and slightly modified the open-source version of HPL to allow a fast emulation on a single commodity server at the scale of a supercomputer and (2) how to model the different components (network, BLAS, ...) of the system. We show that a careful modeling of both spatial and temporal node variability allows us to obtain predictions within a few percents of real experiments (see Figure 1). Wednesday 2:00pm-3:30pm Ambassador/Registry Paper Cluster Communication Chair: Scott Levy (Sandia National Laboratories) Teng Ma (Tsinghua University, Alibaba); Tao Ma, Zhuo Song, Jingxuan Li, and Huaixin Chang (Alibaba); Kang Chen (Tsinghua University); Hai Jiang (Arkansas State University); and Yongwei Wu (Tsinghua University) Abstract Abstract X-RDMA is a communication middleware deployed and heavily used in Alibaba’s large-scale cluster hosting cloud storage and database systems. Unlike recent research projects which purely focus on squeezing out the raw hardware performance, it puts emphasis on robustness, scalability, and maintainability of large-scale production clusters. X-RDMA integrates necessary features, not available in current RDMA ecosystem, to release the developers from complex and imperfect details. X-RDMA simplifies the programming model, extends RDMA protocols for application awareness, and proposes mechanisms for resource management with thousands of connections per machine. It also reduces the work for administration and performance tuning with built-in tracing, tuning and monitoring tools. Ayesha Afzal, Georg Hager, and Gerhard Wellein (Friedrich-Alexander University Erlangen-Nürnberg) Abstract Abstract Analytic, first-principles performance modeling of distributed-memory applications is difficult due to a wide spectrum of random disturbances caused by the application and the system. These disturbances (commonly called "noise") run contrary to the assumptions about regularity that one usually employs when constructing simple analytic models. Despite numerous efforts to quantify, categorize, and reduce such effects, a comprehensive quantitative understanding of their performance impact is not available, especially for long, one-off delays of execution periods that have global consequences for the parallel application. In this work, we investigate various traces collected from synthetic benchmarks that mimic real applications on simulated and real message-passing systems in order to pinpoint the mechanisms behind delay propagation. We analyze the dependence of the propagation speed of "idle waves," i.e., propagating phases of inactivity, emanating from injected delays with respect to the execution and communication properties of the application, study how such delays decay under increased noise levels, and how they interact with each other. We also show how fine-grained noise can make a system immune against the adverse effects of propagating idle waves. Our results contribute to a better understanding of the collective phenomena that manifest themselves in distributed-memory parallel applications. Abu Naser, Mohsen Gavahi, Cong Wu, Viet Tung Hoang, Zhi Wang, and Xin Yuan (Florida State University) Abstract Abstract As High Performance Computing (HPC) applications with data security requirements are increasingly moving to execute in the public cloud, there is a demand that the cloud infrastructure for HPC should support privacy and integrity. Incorporating privacy and integrity mechanisms in the communication infrastructure of today’s public cloud is challenging because recent advances in the networking infrastructure in data centers have shifted the communication bottleneck from the network links to the network end points and because encryption is computationally intensive. Wednesday 4:00pm-5:30pm Ambassador/Registry Paper Resource Allocation Chair: Eishi Arima (The University of Tokyo) Qingxiao Sun, Yi Liu, Hailong Yang, Zhongzhi Luan, and Depei Qian (Beihang University) Abstract Abstract Meeting the Quality of Service (QoS) requirement under task consolidation on the GPU is extremely challenging. Previous work mostly relies on static task or resource scheduling and cannot handle the QoS violation during runtime. In addition, the existing work fails to exploit the computing characteristics of batch tasks, and thus wastes the opportunities to reduce power consumption while improving GPU utilization. To address the above problems, we propose a new runtime mechanism SMQoS that can dynamically adjust the resource allocation during runtime to satisfy the QoS of latency-sensitive tasks and determine the optimal resource allocation for batch tasks to improve GPU utilization and power efficiency. The experimental results show that with SMQoS, 2.27% and 7.58% more task co-runnings reach the 95% QoS target than Spart and Rollover respectively. In addition, SMQoS achieves 23.9% and 32.3% higher throughput, and reduces the power consumption by 25.7% and 10.1%, compared to Spart and Rollover respectively. Lee Savoie and David Lowenthal (University of Arizona), Bronis de Supinski and Kathryn Mohror (Lawrence Livermore National Laboratory), and Nikhil Jain (Nvidia) Abstract Abstract Jobs on most high-performance computing (HPC) systems share the network with other concurrently executing jobs. This sharing creates contention that can severely degrade performance. We investigate the use of Quality of Service (QoS) mechanisms to reduce the negative impacts of network contention. Our results show that careful use of QoS reduces the impact of contention for specific jobs, resulting in up to a 27% performance improvement. In some cases the impact of contention is completely eliminated. These improvements are achieved with limited negative impact to other jobs; any job that experiences performance loss typically degrades less than 5%, often much less. Our approach can help ensure that HPC machines maintain high throughput as per-node compute power continues to increase faster than network bandwidth. Prashanth Thinakaran and Jashwant Raj Gunasekaran (Penn State), Bikash Sharma (Facebook), and Mahmut Kandemir and Chita Das (Penn State) Abstract Abstract Compute heterogeneity is increasingly gaining prominence in modern datacenters due to the addition of accelerators like GPUs and FPGAs. We observe that datacenter schedulers are agnostic of these emerging accelerators, especially their resource utilization footprints, and thus, not well equipped to dynamically provision them based on the application needs. We observe that the state-of-the-art datacenter schedulers fail to provide fine-grained resource guarantees for latency-sensitive tasks that are GPU-bound. Specifically for GPUs, this results in resource fragmentation and interference leading to poor utilization of allocated resources. Furthermore, GPUs exhibit highly linear energy efficiency with respect to utilization and hence proactive management of these resources is essential to keep the operational costs low while ensuring the end-to-end Quality of Service (QoS). Yiqin Gao (ENS Lyon); Louis-Claude Canon (Univ. Franche Comté); Yves Robert (ENS Lyon, Univ. Tenn. Knoxville); and Frédéric Vivien (Inria) Abstract Abstract This work introduces scheduling strategies to maximize the expected number of independent tasks that can be executed on a cloud platform with budget and deadline constraints. The cloud platform is composed of several types of virtual machines (VMs), where each type has a unit execution cost that depends upon its characteristics. The amount of budget spent during the execution of a task on a given VM is the product of its execution length by the unit execution cost of that VM. The execution lengths of tasks follow a variety of standard probability distributions (exponential, uniform, half-normal, lognormal, gamma, inverse-gamma and Weibull) whose mean and standard deviation both depend upon the VM type. Finally, there is a global available budget and a deadline constraint, and the goal is to successfully execute as many tasks as possible before the deadline is reached or the budget is exhausted (whichever comes first). On each VM, the scheduler can decide at any instant to interrupt the execution of a (long) running task and to launch a new one, but the budget already spent for the interrupted task is lost. The main questions are which\VMs to enroll, and whether and when to interrupt tasks that have been executing for some time. We assess the complexity of the problem by showing its NP-completeness and providing a 2-approximation for the asymptotic case where budget and deadline both tend to infinity. Then we introduce several heuristics and compare their performance by running an extensive set of simulations. Thursday 9:00am-10:00am Ambassador/Registry Keynote Keynote 3 Chair: Patrick G. Bridges (University of New Mexico) Big Data Spatiotemporal Analytics - Trends, Characteristics and Applications (Sangmi Lee Pallickara) Sangmi Pallickara (Colorado State University) Biography Biography Sangmi Pallickara (Colorado State University) Sangmi Lee Pallickara is an Associate Professor of Computer Science and a Cochran Family Professor at Colorado State University. Her research interests are in the area of Big Data for the sciences with an emphasis on issues related to predictive analytics, storage, retrievals, and metadata management. She has published over 60 peer-reviewed articles. She serves on the editorial board of the Journal of Big Data. Her research has been funded through grants from the National Science Foundation, the Advanced Research Projects Agency-Energy (Department of Energy), the Department of Homeland Security, the Environmental Defense Fund, Google, Amazon and Hewlett Packard. She is a recipient of the CAREER award from the U.S. National Science Foundation and the IEEE TCSC Award for Excellence in Scalable Computing (Mid-Career Researcher). Abstract Abstract We are living in the big data era. Scientists and businesses analyze large-scale datasets for insights leading to better decisions and strategic moves. With emergence of geo sensors and IoT, we are faced with voluminous geotagged data. Indeed, more than 80% of large datasets are estimated to be geospatial. In this talk, I discuss the unique characteristics and challenges in large-scale geospatial analytics while focusing on the distinctive patterns in data collection, access, and computation. I will also discuss our methods for data retrieval and analytics derived to cope with these challenges at scale. Finally, I will discuss a couple of scientific applications leveraging our methodology in geo-science domains including ecology and agriculture. Thursday 10:00am-10:30am Ambassador/Registry Paper Top 3 Papers of Cluster 2019 (3/3) Chair: Patrick G. Bridges (University of New Mexico) STASH : Fast Hierarchical Aggregation Queries for Effective Visual Spatiotemporal Explorations Best Paper Saptashwa Mitra, Paahuni Khandelwal, Shrideep Pallickara, and Sangmi Lee Pallickara (Colorado State University) Abstract Abstract The proliferation of spatiotemporal datasets has lead to a demand for scalable real-time analytics over these datasets to help scientists make inferences and inform decision-making. However, the data is voluminous and combined with incessant user queries adversely impacts the latency of the visualization queries over such datasets stored over large clusters. Thursday 11:00am-12:30pm Ambassador/Registry Paper Tools and Optimization Chair: Bronis R. de Supinski (Lawrence Livermore National Laboratory) Saeed Taheri and Ian Briggs (University of Utah), Martin Burtscher (Texas State University), and Ganesh Gopalakrishnan (University of Utah) Abstract Abstract We present a tool called DiffTrace that approaches debugging via whole program tracing and diffing of typical and erroneous traces. After collecting these traces, a user-configurable front-end filters out irrelevant function calls and then summarizes loops in the retained function calls based on state-of-the-art loop extraction algorithms. Information about these loops is inserted into concept lattices, which we use to compute salient dissimilarities to narrow down bugs. DiffTrace is a clean start that addresses debugging features missing in existing approaches. Our experiments on an MPI/OpenMP program called ILCS and initial measurements on LULESH, a DOE miniapp, demonstrate the advantages of the proposed debugging approach. Arnab K. Paul (Virginia Tech); Ryan Chard (Argonne National Laboratory); Kyle Chard and Steven Tuecke (University of Chicago); Ali R. Butt (Virginia Tech); and Ian Foster (Argonne National Laboratory, University of Chicago) Abstract Abstract Data automation, monitoring, and management tools are reliant on being able to detect, report, and respond to file system events. Various data event reporting tools exist for specific operating systems and storage devices, such as inotify for Linux, kqueue for BSD, and FSEvents for macOS. However, these tools are not designed to monitor distributed file systems. Indeed, many cannot scale to monitor many thousands of directories, or simply cannot be applied to distributed file systems. Moreover, each tool implements a custom API and event representation, making the development of generalized and portable event-based applications challenging. As file systems grow in size and become increasingly diverse, there is a need for scalable monitoring solutions that can be applied to a wide range of both distributed and local systems. We present here a generic and scalable file system monitor and event reporting tool, FSMonitor, that provides a file-system-independent event representation and event capture interface. FSMonitor uses a modular Data Storage Interface (DSI) architecture to enable the selection and application of appropriate event monitoring tools to detect and report events from a target file system, and implements efficient and fault-tolerant mechanisms that can detect and report events even on large file systems. We describe and evaluate DSIs for common UNIX, macOS, and Windows storage systems, and for the Lustre distributed file system. Our experiments on a 897 TB Lustre file system show that FSMonitor can capture and process almost 38000 events per second. On the Benefits of Anticipating Load Imbalance for Performance Optimization of Parallel Applications Anthony Boulmier (University of Geneva), Franck Raynaud (Unversity of Geneva), Nabil Abdennadher (HES-SO), and Bastien Chopard (Unversity of Geneva) Abstract Abstract In parallel iterative applications, computational efficiency is essential for addressing large problems. Load imbalance is one of the major performance degradation factors of parallel applications. Therefore, distributing, cleverly, and as evenly as possible, the workload among processing elements (PE) maximizes application performance. So far, the standard load balancing method consists in distributing the workload evenly between PEs and, when load imbalance appears, redistributing the extra load from overloaded PEs to underloaded PEs. However, this does not anticipate the load imbalance growth that may continue during the next iterations. In this paper, we present a first step toward a novel philosophy of load balancing that unloads the PEs that will be overloaded in the near future to let the application rebalance itself via its own dynamics. Herein, we present a formal definition of our new approach using a simple mathematical model and discuss its advantages compared to the standard load balancing method. In addition to the theoretical study, we apply our method to an application that reproduces the computation of a fluid model with non-uniform erosion. The performance validates the benefit of anticipating load imbalance. We observed up to 16% performance improvement compared to the standard load balancing method. Monday 9:00am-3:30pm Regal Workshop Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications Tuesday 11:00am-12:30pm Regal Paper Parallel Applications Using Alternate Models Chair: David Boehme (Lawrence Livermore National Laboratory) Dalal Sukkari (KAUST), Mathieu Faverge (INRIA), and Hatem Ltaief and David Keyes (KAUST) Abstract Abstract This paper describes how to leverage task-based implementation of the polar decomposition on massively parallel systems using the PARSEC dynamic runtime system. Based on a formulation of the iterative QR Dynamically-Weighted Halley (QDWH) algorithm, our novel implementation reduces data traffic while exploiting high concurrency from the underlying hardware architecture. First, we replace the most time-consuming classical QR factorization phase with a new hierarchical variant, customized for the specific structure of the matrix during the QDWH iterations. The newly developed hierarchical QR for QDWH exploits not only the matrix structure, but also shortens the length of the critical path to maximize hardware occupancy. We then deploy PARSEC to seamlessly orchestrate, pipeline, and track the data dependencies of the various linear algebra building blocks involved during the iterative QDWH algorithm. PARSEC enables to overlap communications with computations thanks to its asynchronous scheduling of fine-grained computational tasks. It employs look-ahead techniques to further expose parallelism, while actively pursuing the critical path. In addition, we identify synergistic opportunities between the task-based QDWH algorithm and the PARSEC framework. We exploit them during the hierarchical QR factorization to enforce a locality- aware task execution. The latter feature permits to minimize the expensive inter-node communication, which represents one of the main bottlenecks for scaling up applications on challenging distributed-memory systems. We report numerical accuracy and performance results using well and ill-conditioned matrices. The benchmarking campaign reveals up to 2X performance speedup against the existing state-of-the-art implementation for the polar decomposition on 36,864 cores. Roger Kowalewski, Pascal Jungblut, and Karl Fuerlinger (LMU Munich) Abstract Abstract Sorting is one of the most critical non-numerical algorithms and covers use cases in a wide spectrum of scientific applications. Although we can build upon excellent research over the last decades, scaling to thousands of processing units on modern many-core architectures reveals a gap between theory and practice. We adopt ideas of the well-known quickselect and sample sort algorithms to minimize data movement. Our evaluation demonstrates that we can keep up with recently proposed distribution sort algorithms in large-scale experiments, without any assumptions on the input keys. Additionally, our implementation outperforms an efficient multi-threaded merge sort on a single node. Our implementation is based on a C++ PGAS approach with an STL-like interface and can easily be integrated into many application codes. As part of the presented experiments, we further reveal challenges with multi-threaded MPI and one-sided communication. Amani Alonazi and Hatem Ltaief (KAUST); Issam Said (NVIDIA); Samuel Thibault (University of Bordeaux, LaBRI – INRIA Bordeaux Sud-Ouest); and David Keyes (KAUST) Abstract Abstract We propose a new framework for deploying Reverse Time Migration (RTM) simulations on distributed-memory systems equipped with multiple GPUs. Our software, TB-RTM, infrastructure engine relies on the StarPU dynamic runtime system to orchestrate the asynchronous scheduling of RTM computational tasks on the underlying resources. Besides dealing with the challenging hardware heterogeneity, TB-RTM supports tasks with different workload characteristics, which stress disparate components of the hardware system. RTM is challenging in that it operates intensively at both ends of the memory hierarchy, with compute kernels running at the highest level of the memory system, possibly in GPU main memory, while I/O kernels are saving solution data to fast storage. We consider how to span the wide performance gap between the two extreme ends of the memory system, i.e., GPU memory and fast storage, on which large-scale RTM simulations routinely execute. To maximize hardware occupancy while maintaining high memory bandwidth throughout the memory subsystem, our framework presents the new out-of-core (OOC) feature from StarPU to prefetch data solutions in and out not only from/to the GPU/CPU main memory but also from/to the fast storage system. The OOC technique may trigger opportunities for overlapping expensive data movement with computations. TB-RTM framework addresses this challenging problem of heterogeneity with a systematic approach that is oblivious to the targeted hardware architectures. Our resulting RTM framework can effectively be deployed on massively parallel GPU-based systems, while delivering performance scalability up to 500 GPUs. Tuesday 2:00pm-3:30pm Regal Paper Workflows Chair: Sameer Shende (University of Oregon; ParaTools, Inc.) Alberto Miranda (Barcelona Supercomputing Center); Adrian Jackson (EPCC, The University of Edinburgh); Tommaso Tocci (Barcelona Supercomputing Center); Iakovos Panourgias (EPCC, The University of Edinburgh); and Ramon Nou (Barcelona Supercomputing Center) Abstract Abstract As HPC systems move into the Exascale era, parallel file systems are struggling to keep up with the I/O requirements from data-intensive problems. While the inclusion of burst buffers has helped to alleviate this by improving I/O performance, it has also increased the complexity of the I/O hierarchy by adding additional storage layers each with its own semantics. This forces users to explicitly manage data movement between the different storage layers, which, coupled with the lack of interfaces to communicate data dependencies between jobs in a data-driven workflow, prevents resource schedulers from optimizing these transfers to benefit the cluster's overall performance. This paper proposes several extensions to job schedulers, prototyped using the Slurm scheduling system, to enable users to appropriately express the data dependencies between the different phases in their processing workflows. It also introduces a new service for asynchronous data staging called NORNS that coordinates with the job scheduler to orchestrate data transfers to achieve better resource utilization. Our evaluation shows that a workflow-aware Slurm exploits node-local storage more effectively, reducing the filesystem I/O contention and improving job running times. Pradeep Subedi, Philip E. Davis, and Manish Parashar (Rutgers University) Abstract Abstract Extreme scale scientific workflows are composed of multiple applications that exchange data at runtime. Several data-related challenges are limiting the potential impact of such workflows. While data staging and in-situ models of execution have emerged as approaches to address data-related costs at extreme scales, increasing data volumes and complex data exchange patterns impact the effectiveness of such approaches. In this paper, we design and implement DESTINY, which is an autonomic data delivery mechanism for staging-based in-situ workflows. DESTINY dynamically learns the data access patterns of scientific workflow applications and leverages these patterns to decrease data access costs. Specifically, DESTINY uses machine learning techniques to anticipate future data accesses, proactively packages and delivers the data necessary to satisfy these requests as close to the consumer as possible and, when data staging processes and consumer processes are colocated, removes the need for inter-process communication by making these data available to the consumer as shared-memory objects. When consumer processes reside on nodes other than staging nodes, the data is packaged and stored in a format the client will likely access in future. This amortizes expensive data discovery and assembly operations typically associated with data staging. We experimentally evaluate the performance and scalability of DESTINY on leadership class platforms using synthetic applications and the S3D combustion workflow. We demonstrate that DESTINY is scalable and can achieve a reduction of up to 75% in read response time as compared to in-memory staging service for production scientific workflows. Han Zhang (National University of Singapore), Lavanya Ramapantulu (International Institute of Information Technology), and Yong Meng Teo (National University of Singapore) Abstract Abstract Big-data application processing is increasingly geo-distributed, a paradigm shift from the traditional cluster-based processing frameworks. As the communication time for data movement across geo-distributed data centers is not a design criterion for traditional cluster-based processing approaches, there are research gaps in the algorithms used for staging and scheduling big-data applications for geo-distributed clusters. We address these gaps by proposing Harmony, an approach consisting of both staging and scheduling strategies to minimize an application's total execution time. Tuesday 4:00pm-5:00pm Regal Paper Clustering Chair: Ron Brightwell (Sandia National Laboratories) MuDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality Aditya Sarma, Poonam Goyal, Sonal Kumari, Anand Wani, Jagat Sesh Challa, Saiyedul Islam, and Navneet Goyal (Birla Institute of Technology & Science, Pilani) Abstract Abstract DBSCAN is one of the most popular and effective clustering algorithms that is capable of identifying arbitrary-shaped clusters and noise efficiently. However, its super-linear complexity makes it infeasible for applications involving clustering of Big Data. A major portion of the computation time of DBSCAN is taken up by the neighborhood queries, which becomes a bottleneck to its performance. We address this issue in our proposed micro-cluster based DBSCAN algorithm, MuDBSCAN, which identifies core-points even without performing neighbourhood queries and becomes instrumental in reducing the run-time of the algorithm. It also significantly reduces the computation time per neighbourhood query while producing exact DBSCAN clusters. Moreover, the micro-cluster based solution makes it scalable for high dimensional data. We also propose a highly scalable distributed implementation of MuDBSCAN, MuDBSCAN-D, to exploit a commodity cluster infrastructure. Experimental results demonstrate tremendous improvements in performance of our proposed algorithms as compared to their respective state-of-the-art solutions for various standard datasets. MuDBSCAN-D is an exact parallel solution for DBSCAN which is capable of processing massive amounts of data efficiently (1 billion data points in 41 minutes on a 32 node cluster), while producing a clustering that is same as that of traditional DBSCAN. Wednesday 11:00am-12:30pm Regal Paper Data Centers and Clouds Chair: Kevin Pedretti (Sandia National Laboratories) Kexi Kang, Jinghui Zhang, Jiahui Jin, Dian Shen, and Junzhou Luo (Southeast University); Wenxin Li (Hong Kong University of Science and Technology); and Zhiang Wu (Nanjing University of Finance and Economics) Abstract Abstract Modern multi-queue data centers often use the standard Explicit Congestion Notification (ECN) scheme to achieve high network performance. However, one substantial drawback of this approach is that micro-burst traffic can cause the instantaneous queue length to exceed the ECN’s threshold, resulting in numerous mismarkings. After enduring too many mismarkings, senders may overreact, leading to severe throughput loss. As a solution to this dilemma, we propose our own adaptation the Micro-burst ECN (MBECN) scheme-to mitigate mismarking. MBECN finds a more appropriate threshold baseline for each queue to absorb micro-bursts, based on steady-state analysis and an ideal generalized processor sharing (GPS) model. By adopting a queue-occupation-based dynamically adjusting algorithm, MBECN effectively handles packet backlog without hurting latency. Through testbed experiments, we find that MBECN improves throughput by ~20% and reduces flow completion time (FCT) by ~40%. Using large scale simulations, we find that throughput can be improved by 1.5~2.4x with DCTCP and 1.26~1.35x with ECN*. We also measure network delay and find that latency only increases by 7.36%. Nannan Zhao (Virginia Tech); Vasily Tarasov (IBM Research—Almaden); Hadeel Albahar (Virginia Tech); Ali Anwar, Lukas Rupprecht, Dimitrios Skourtis, and Amit S. Warke (IBM Research—Almaden); Mohamed Mohamed (Apple); and Ali R. Butt (Virginia Tech) Abstract Abstract Docker containers have become a prominent solution for supporting modern enterprise applications due to the highly desirable features of isolation, low overhead, and efficient packaging of the execution environment. Containers are created from images which are shared between users via a Docker registry. The amount of data Docker registries store is massive; for example, Docker Hub, a popular public registry, stores at least half a million public images. In this paper, we analyze over 167 TB of uncompressed Docker Hub images, characterize them using multiple metrics and evaluate the potential of file-level deduplication in Docker Hub. Our analysis helps to make conscious decisions when designing storage for containers in general and Docker registries in particular. For example, only 3% of the files in images are unique, which means file-level deduplication has a great potential to save storage space for the registry. Our findings can motivate and help improve the design of data reduction, caching, and pulling optimizations for registries. Dong Huang, Xiaopeng Fan, and Yang Wang (Shenzhen Institutes of Advanced Technology); Shuibing He (Zhejiang University); and Chengzhong Xu (University of Macau) Abstract Abstract In this paper, we study the data caching problem in mobile cloud environment where multiple correlated data items could be packed and migrated to serve a predefined sequence of requests. By leveraging the spatial and temporal trajectory of requests, we propose a two-phase caching algorithm. We first investigate the correlation between data items to determine whether or not two data items could be packed to transfer, and then combine the algorithm proposed in \cite{wang2017data} and a greedy strategy to design a two-phase algorithm, named \emph{DP\_Greedy}, for effectively caching these shared data items to serve a predefined sequence of requests. Under homogeneous cost model, we prove the proposed algorithm is at most $2/\alpha$ times worse than the optimal one in terms of the total service cost, where $\alpha$ is the discount factor we defined, and also show that the algorithm can achieve this results within $O(mn^2)$ time and $O(mn)$ space complexity for $m$ caches to serve a $n$-length sequence. We evaluate our algorithm by effectively implementing it and comparing it with the non-packing case, the result show the proposed DP\_Greedy algorithm not only presents excellent performances but is also more in line with the actual situation. Wednesday 2:00pm-3:30pm Regal Paper Efficient Storage Chair: Kathryn Mohror (LLNL) Yuzhe Li, Jiang Zhou, and Weiping Wang (Institute of Information Engineering, Chinese Academy of Sciences) and Yong Chen (Texas Tech University) Abstract Abstract In-memory key/value store (KV-store) is a key building block for numerous applications running on a cluster. With the increase of cluster scale, efficiency and availability have become two critical demanding features. Traditional replication provides redundancy but is inefficient due to its high storage cost. Erasure coding can provide data reliability with significantly low storage requirements but primarily used for long-term archival data due to the limitation of write performance. Recent studies attempt to combine these two techniques, e.g. using replication for frequently-updated metadata and using erasure coding for large, read-only data. In this study, we propose RE-Store, an in-memory key/value system with a novel, hybrid scheme of replication and erasure coding to achieve both efficiency and reliability. RE-Store introduces replication into erasure coding by making one copy for each encoded data and replacing partial parity with replicas for storage-efficiency. When failures occur, it uses replicas to ensure data availability, avoiding the inefficiency of erasure coding during repair. A fast online recovery is achieved for fault tolerance at different failure scenarios, with little performance degradation. We have implemented RE-Store on a real key/value system and conducted extensive evaluations to validate its design and to study its performance, efficiency, and reliability. Experimental results show that RE-Store has a similar performance with erasure coding and replication under normal operations, yet saves 18% to 34% memory compared to replication when tolerating 2 to 4 failures. Qing Zheng, Charles Cranor, Ankush Jain, Gregory Ganger, Garth Gibson, and George Amvrosiadis (Carnegie Mellon University) and Bradley Settlemyer and Gary Grider (Los Alamos National Lab) Abstract Abstract We are approaching a point in time when it will be infeasible to catalog and query data after it has been generated. This trend has fueled research on in-situ data processing (i.e. operating on data as it is streamed to storage). One important example of this approach is in-situ data indexing. Prior work has shown the feasibility of indexing at scale as a two-step process. First, one partitions data by key across the CPU cores of a parallel job. Then each core indexes its subset as data is persisted. Online partitioning requires transferring data over the network so that it can be indexed and stored by the core responsible for the data. This approach is becoming increasingly costly as new computing platforms emphasize parallelism instead of individual core performance that is crucial for communication libraries and systems software in general. In addition to indexing, scalable online data partitioning is also useful in other contexts such as load balancing and efficient compression. Zhi Qiao (University of North Texas; USRC, LANL); Song Fu (University of North Texas); and Hsing-Bung Chen and Bradley Settlemyer (Los Alamos National Laboratory) Abstract Abstract Due to the vast storage needs of high performance computing (HPC), the scale and complexity of storage systems in HPC data centers continue growing. Disk failures have become the norm. With the ever-increasing disk capacity, RAID recovery based on disk rebuild becomes more and more expensive, which causes significant performance degradation and even unavailability of storage systems. Declustered redundant array of independent disks shuffle data and parity blocks among all drives in a RAID group, which aims to accelerate RAID reconstruction and improve performance. With the popularity of ZFS file system and software RAID used in production systems, in this paper, we extensively evaluate and analyze declustered RAID with regard to the RAID I/O performance and recovery time on an high performance storage platform at Los Alamos National Laboratory. Our empirical study reveals that the speedup of declustered RAID over traditional RAID is sub-linear to the parallelism of recovery I/O. Furthermore, we formally model and analyze the reliability of declustered RAID using the mean-time-to-data-loss and discover that the improved recovery performance leads to a higher storage reliability compared with the traditional RAID. Wednesday 4:00pm-5:15pm Regal Paper Compression Chair: Christian Engelmann (Oak Ridge National Laboratory) Analyzing the Impact of Lossy Compressor Variability on Checkpointing Scientific Simulations (short paper) Pavlo D. Triantafyllides, Tasmia Reza, and Jon C. Calhoun (Clemson University) Abstract Abstract Lossy compression algorithms are effective tools to reduce the size of high-performance computing data sets. As established lossy compressors such as SZ and ZFP evolve, they seek to improve the compression/decompression bandwidth and the compression ratio. Algorithm improvements may alter the spatial distribution of errors in the compressed data even when using the same error bound and error bound type. If HPC applications are to compute on lossy compressed data, application users require an understanding of how the performance and spatial distribution of error changes. We explore how spatial distributions of error, compression/decompression bandwidth, and compression ratio change for HPC data sets from the applications PlasComCM and Nek5000 between various versions of SZ and ZFP. In addition, we explore how the spatial distribution of error impacts application correctness when restarting from lossy compressed checkpoints. We verify that known approaches to selecting error tolerances for lossy compressed checkpointing are robust to compressor selection and in the face of changes in the distribution of error. Xin Liang (UC, Riverside); Sheng Di (Argonne National Laboratory); Dingwen Tao (the University of Alabama); Sihuan Li (UC, Riverside); Bogdan Nicolae (Argonne National Laboratory); Zizhong Chen (UC, Riverside); and Franck Cappello (Argonne National Laboratory) Abstract Abstract Because of the ever-increasing data being produced by today's high performance computing (HPC) scientific simulations, I/O performance is becoming a significant bottleneck for their executions. An efficient error-controlled lossy compressor is a promising solution to significantly reduce data writing time for scientific simulations running on supercomputers. In this paper, we explore how to optimize the data dumping performance for scientific simulation by leveraging error-bounded lossy compression techniques. The contributions of the paper are threefold. (1) We propose a novel I/O performance profiling model that can effectively represent the I/O performance with different execution scales and data sizes, and optimize the estimation accuracy of data dumping performance using least square method. (2) We develop an adaptive lossy compression framework that can select the bestfit compressor (between two leading lossy compressors SZ and ZFP) with optimized parameter settings with respect to overall data dumping performance. (3) We evaluate our adaptive lossy compression framework with up to 32k cores on a supercomputer facilitated with fast I/O systems and using real-world scientific simulation datasets. Experiments show that our solution can mostly always lead the data dumping performance to the optimal level with very accurate selection of the bestfit lossy compressor and settings. The data dumping performance can be improved by up to 27% at different scales. Mohammad Hasanzadeh Mofrad and Rami Melhem (University of Pittsburgh) and Yousuf Ahmad and Mohammad Hammoud (Carnegie Mellon University in Qatar) Abstract Abstract This paper presents Triply Compressed Sparse Column (TCSC), a novel compression technique designed specifically for matrix-vector operations where the matrix as well as the input and output vectors are sparse. We refer to these operations as SpMSpV$^2$. TCSC compresses the nonzero columns and rows of a highly sparse matrix representing a large real-world graph. During this compression, it encodes the sparsity patterns of the input and output vectors within the compressed representation of the sparse matrix itself. Consequently, it aligns the compressed indices of the input and output vectors with those of the compressed matrix columns and rows, thus eliminating the need for extra indirections when SpMSpV$^2$ operations access the vectors. This results in fewer cache misses, greater space efficiency and faster execution times. We evaluate TCSC's performance and show that it is more space and time efficient compared to CSC and DCSC, with up to $11 \times$ speedup. We integrate TCSC into GraphTap, our suggested linear algebra-based distributed graph analytics system. We compare GraphTap against GraphPad and LA3, two state-of-the-art linear algebra-based distributed graph analytics systems, using different dataset scales and numbers of processes. GraphTap is up to $7\times$ faster than these systems due to TCSC and the resulting communication efficiency. Thursday 11:00am-12:30pm Regal Paper Applications Chair: Ryan Grant (Sandia National Labs, University of New Mexico) Multi-physics simulations of particle tracking in arterial geometries with a scalable moving window algorithm Gregory J. Herschlag (Duke University), John Gounley (Oak Ridge National Laboratory), Sayan Roychowdhury (Duke University), Erik W. Draeger (Lawrence Livermore National Laboratory), and Amanda Randles (Duke University) Abstract Abstract In arterial systems, cancer cell trajectories determine metastatic cancer locations; similarly, particle trajectories determine drug delivery distribution. Predicting trajectories is challenging, as the dynamics are affected by local interactions with red blood cells, complex hemodynamic flow structure, and downstream factors such as stenoses or blockages. Direct simulation is not possible, as even a single simulation of a large arterial domain with explicit red blood cells is currently intractable on even the largest supercomputers. To overcome this limitation, we present a multi-physics adaptive window algorithm, in which individual red blood cells are explicitly modeled in a small region of interest moving through a coupled arterial fluid domain. We describe the coupling between the window and fluid domains, including automatic insertion and deletion of explicit cells and dynamic tracking of cells of interest by the window. We show that this algorithm scales efficiently on heterogeneous architectures and enables us to perform large, highly-resolved particle-tracking simulations that would otherwise be intractable. Marco Minutoli (Pacific Northwest National Laboratory, Washington State University); Mahantesh Halappanavar (Pacific Northwest National Laboratory); Ananth Kalyanaraman (Washington State University); and Arun Sathanur, Ryan Mcclure, and Jason McDermott (Pacific Northwest National Laboratory) Abstract Abstract The Influence Maximization problem has been extensively studied in the past decade because of its practical applications in finding the key influencers in social networks. Due to the hardness of the underlying problem, existing algorithms have tried to trade off practical efficiency with approximation guarantees. However, approximate solutions take several hours of compute time on modest sized real world inputs. On the other hand, there is a lack of effective parallel and distributed algorithms to solve this problem. In this paper, we present efficient parallel algorithms for multithreaded and distributed systems to solve the influence maximization with approximation guarantee. Our algorithms extend state-of-the-art sequential approach based on computing reverse reachability sets. We present a detailed experimental evaluation, and analyze their performance and their sensitivity to input parameters, using real world inputs. Our experimental results demonstrate significant speedup on parallel architectures. We further show a speedup of up to 586X relative to the state-of-the-art sequential baseline using 1024 nodes of a supercomputer at far greater accuracy and twice the seed set size. To the best of our knowledge, this is the first effort in parallelizing the influence maximization operation at scale. Scalable, High-Order Continuity Across Block Boundaries of Functional Approximations Computed in Parallel Iulian Grindeanu, Tom Peterka, Vijay Mahadevan, and Youssef S. Nashed (Argonne National Laboratory) Abstract Abstract We investigate the representation of discrete scientific data with a $C^k$-continuous functional model, where $C^k$ denotes k-th-order continuity, in a distributed-memory parallel setting. The MFA — Multivariate Functional Approximation — model is a piecewise-continuous functional approximation based on multivariate high-dimensional B-splines. When computing an MFA approximation in parallel over multiple blocks in a spatial domain decomposition, the interior of each block will be $C^k$-continuous, k being the B-spline polynomial degree, but discontinuities exist across neighboring block boundaries. We present an efficient and scalable solution to ensure $C^k$ continuity across blocks, by blending neighboring approximations. We show that after decomposing the domain in structured, overlapping blocks and approximating blocks independently to high degrees of accuracy, we can extend the local solution to the global domain by using compact, multidimensional smooth-step functions. We prove that this approach, which can be viewed as an extended partition of unity approximation method, is highly scalable on modern architectures. |