Tuesday 10:45am-12:15pm ![]() Assembly Hall Plenary, Paper Session I: best papers (Areas 1 and 2) Chair: Harald Koestler (University of Erlangen-Nuremberg) Best Paper Ali Murat Gok (Northwestern University); Sheng Di, Yuri Alexeev, and Dingwen Tao (Argonne National Laboratory); Vladimir Mironov (Lomonosov Moscow State University); and Xin Liang and Franck Cappello (Argonne National Laboratory) Abstract Abstract Computation of two-electron repulsion integrals is the critical and the most time-consuming step in a typical quantum chemistry simulation. Such calculations have massive computing and storage requirements, which scale as O(n^4) with the size of a chemical system. Compressing the integral's data and storing it on disk can avoid costly recalculation, significantly speeding the overall quantum chemistry calculations; but it requires a fast compression algorithm. To this end, we developed PaSTRI (Pattern Scaling for Two-electron Repulsion Integrals) and implemented the algorithm in the data compression package SZ. PaSTRI leverages the latent pattern features in the integral dataset and optimizes the calculation of the appropriate number of bits required for the storage of the integral. We have evaluated PaSTRI using the integral datasets generated by the quantum chemistry program GAMESS. The results show an excellent 16.8 compression ratio with low overhead, while maintaining 10^-10 absolute precision based on user's requirement. Best Paper Mohammadreza Bayatpour, Jahanzeb Maqbool Hashmi, Sourav Chakraborty, Pouya Kousha, Hari Subramoni, and Dhabaleswar K. Panda (Ohio State University) Abstract Abstract Message Passing Interface (MPI), thus far, has remained a dominant programming models to program large-scale scientific applications. Collective communication operations in MPI are of significant importance due to their communication intensive nature and use in scientific applications. With the emergence of multi-/many-core systems, and rise of deep learning applications, it is important to revisit MPI collectives, particularly MPI Allreduce to exploit vast parallelism offered by modern architectures. In this paper, we take up this challenge and propose Scalable and Adaptive designs for Large message Reduction collectives (SALaR). We focus on MPI Allreduce due to its use in deep learning frameworks and propose new designs that can significantly improve its performance by exploiting architectural features of modern multi-/many-cores in tandem with high- throughput network such as InfiniBand. We also propose a theoretical model to analyze communication and computation cost and use these insights to guide our designs. The evaluation of the proposed SALaR based designs shows significant performance gains over state-of-the-art designs on a wide variety of micro-benchmarks and applications. Tuesday 1:30pm-3:00pm ![]() Assembly Hall Paper Paper session II: Matrix algorithms Chair: Dheeraj Sreedhar (IBM Research - India) Zheng Miao, Jon Calhoun, and Rong Ge (Clemson University) Abstract Abstract Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. TETSUYA HOSHINO, AKIHIRO IDA, TOSHIHIRO HANAWA, and KENGO NAKAJIMA (The University of Tokyo) Abstract Abstract Hierarchical matrices (H-matrices) are an approximation technique for dense matrices, such as the coefficient matrix of the Boundary Element Method (BEM). An H-matrix is expressed by a set of low-rank approximated and small dense sub-matrices, each of which has various rank, of the dense matrix. The use of H-matrices reduces the required memory footprint of the dense matrices from O(N^2) to O(NlogN) and is suitable for many-core processors that have relatively small memory capacity compared to traditional CPUs. However, the existing parallel Adaptive Cross Approximation (ACA) algorithms, which are low- rank approximation algorithms used to construct H-matrices, are not designed to exploit many-core processors in terms of load balancing. In the existing parallel algorithm, the ACA process is independently applied to each sub-matrix. The computational load of ACA process for each sub-matrix depends on the sub- matrix’s rank but the rank is defined after ACA applied. It makes difficult to balance the load. We propose a load-balancing aware parallel ACA algorithm for H-matrices that focus on many-core processors. We implemented the proposed algorithm into HACApK, which is an open-source H-matrices library originally developed for CPU-based clusters. The proposed algorithm was evaluated using BEM problems on an NVIDIA Tesla P100 GPU (P100) and an Intel Xeon Broadwell processor (BDW). The evaluation results demonstrate the improved performance of the proposed algorithm in all GPU cases. For example, in a case wherein it is difficult for existing parallel algorithms to balance load, the proposed algorithm achieved 12.9 times performance improvement for P100. Juan Carlos Pichel and Beatriz Pateiro-López (Universidade de Santiago de Compostela) Abstract Abstract In this paper, a new methodology to select the best storage format for sparse matrices based on deep learning techniques is introduced. We focus on the selection of the proper format for the sparse matrix-vector multiplication (SpMV), which is one of the most important computational kernels in many scientific and engineering applications. Our approach considers the sparsity pattern of the matrices as an image, using the RGB channels to code several of the matrix properties. As a consequence, we generate image datasets that include enough information to successfully train a Convolutional Neural Network (CNN). Considering GPUs as target platforms, the trained CNN selects the best storage format 90.1% of the time, obtaining 99.4% of the highest SpMV performance among the tested formats. Tuesday 1:30pm-3:00pm ![]() Minor Hall Paper Paper session III: Architecture and Interconnect Chair: Ann Gentile (Sandia National Laboratories) Yuichiro Ajima, Takahiro Kawashima, Takayuki Okamoto, Naoyuki Shida, Kouichi Hirai, Toshiyuki Shimizu, Shinya Hiramoto, Yoshiro Ikeda, Takahide Yoshikawa, Kenji Uchida, and Tomohiro Inoue (Fujitsu Limited) Abstract Abstract In this paper, we introduce a new and highly scalable interconnect called Tofu interconnect D that will be used in the post-K machine. This machine will officially be operational around 2021. The letter D represents high “density” node and “dynamic” packet slicing for “dual-rail” transfer. Herein we describe the design and the evaluation results of TofuD. Due to the high-density packaging, the optical link ratio of TofuD has decreased to 25% from the 66% optical link ratio of Tofu2. TofuD applies a new technique called dynamic packet slicing to reduce latency and to improve fault resilience. The evaluation results show that the one-way 8-byte Put latency is 0.49 μs. This is 31% lower than the latency of Tofu2. The injection rate per node is 38.1 GB/s which is approximately 83% of the injection rate of Tofu2. The link efficiency is as high as approximately 93%. Qingye Jiang (The University of Sydney), Young Choon Lee (Macquarie University), and Albert Y. Zomaya (The University of Sydney) Abstract Abstract The move from the traditional Software-as-a-Product (SaaP) model to the Software-as-a-Service (SaaS) model is apparent with the adoption of cloud computing. Unlike the SaaP model, the SaaS model delivers a diverse set of software features directly from public clouds to a large number of users with varying quality of service (QoS) requirements. There are two outstanding issues with traditional QoS systems: (1) they are usually designed and developed with a special purpose, making them difficult to be reused for other use cases; and (2) they have limited scalability due to the write-intensive nature of admission control workload. In this paper, we present Janus - a generic and scalable QoS framework for SaaS applications, taking full advantage of cloud's inherent horizontal scalability. Janus uses a multi-layer architecture to eliminate the communication between nodes in the same layer achieving horizontal scalability without sacrificing vertical scalability. Janus ensures accurate admission control using a distributed set of leaky buckets with a refill mechanism. Janus adopts a key-value request-response mechanism for easy integration with the actual application. We extensively evaluate Janus on AWS with both Apache HTTP server benchmarking tool and a photo sharing web application. Our experimental results demonstrate that (a) Janus achieves linear scalability both vertically and horizontally, and (b) Janus can be integrated with existing applications with a minimum amount of code change. Janus achieves more than 100,000 requests per second with only 10 nodes in the QoS server layer and 90% of the admission control decision were made in 3 milliseconds. Alexandru Uta (VU University Amsterdam), Ana Lucia Varbanescu (University of Amsterdam), Ahmed Musaafir (Vrije Universiteit Amsterdam), Chris Lemaire (TU Delft), and Alexandru Iosup (Vrije Universiteit Amsterdam) Abstract Abstract The question “Can big data and HPC infrastructure converge?” has important implications for many operators and clients of modern computing. However, answering it is challenging. The hardware is currently different, and fast evolving: big data uses machines with modest numbers of fat cores per socket, large caches, and much memory, whereas HPC uses machines with larger numbers of (thinner) cores, non-trivial NUMA architectures, and fast interconnects. In this work, we investigate the convergence of big data and HPC infrastructure for one of the most challenging application domains, the highly irregular graph processing. We contrast through a systematic, experimental study of over 300,000 core-hours the performance of a modern multicore, Intel Knights Landing (KNL) and of traditional big data hardware, in processing representative graph workloads using state-of-the-art graph analytics platforms. The experimental results indicate KNL is convergence-ready, performance-wise, but only after extensive and expert-level tuning of software and hardware parameters. Tuesday 3:30pm-5:15pm ![]() Assembly Hall Paper Paper session IV: Generating and Optimizing Applications Chair: Nathan Tallent (Pacific Northwest National Laboratory) Sebastian Kuckuk and Harald Köstler (Friedrich-Alexander-Universität Erlangen-Nürnberg) Abstract Abstract The study of ocean currents has been an active area of research for decades. As a model close to the water surface, the shallow water equations (SWE) can be used. For realistic simulations, efficient numerical solvers are necessary that exhibit a good node-level performance while still maintaining scalability. When comparing the discretized model and the actual implementation, one often finds that they differ vastly. This gap makes it hard for domain experts to implement their models and high performance computing (HPC) experts are required to ensure an optimal implementation. Using domain-specific languages (DSLs) and code generation techniques can be a useful tool to bridge this gap. In recent years, ExaStencils and its DSL ExaSlang have proven to provide a suitable platform for this. We present an extension from up to now elliptic to hyperbolic partial differential equations (PDEs) in this work, namely the SWE. After setting up a suitable discretization, we demonstrate how it can be mapped to ExaSlang code. This code is still quite similar to the original, mathematically motivated specification and can be easily written by domain experts. Still, solvers generated from this abstract representation can be run on large-scale clusters. We demonstrate this by giving performance and scalability results on the state-of-the-art GPU cluster Piz Daint where we solve for close to a trillion unknowns on 2048 GPUs. From there, we discuss the performance impact of different optimizations such as overlapping computation and communication, or switching to a hybrid CPU-GPU parallelization scheme. Linjin Cai, Yichao Wang, and James Lin (Shanghai Jiao Tong University) Abstract Abstract Sunway TaihuLight is China's recent top-ranked supercomputer worldwide that was the first to be built entirely with home-grown processors. This supercomputer can be programmed with two approaches: directive-based OpenACC and native programming. These approaches are studied here using GTC-P, a particle-in-cell code for investigating micro-turbulence in magnetic fusion plasmas. We have compared the performance and programming efforts between the OpenACC and the native version of GTC-P. Associated results show that in the OpenACC version, the kernel with irregular memory access becomes the main performance bottleneck due to poor data locality. To address this issue, we have applied two optimizations on the native version: (1) register level communication (RLC); and (2) an “asynchronization” strategy. With these two optimizations, the native version can achieve up to 2.5X speedup for the memory-bound kernel compared with the OpenACC version. In addition, we have now scaled GTC-P on 4,259,840 cores of TaihuLight and demonstrate performance comparisons with several world-leading supercomputers. Parallel Approximation of the Maximum Likelihood Estimation for the Prediction of Large-Scale Geostatistics Simulations ![]() Sameh Abdulah, Hatem Ltaief, Ying Sun, Marc Genton, and David Keyes (King Abdullah University of Science and Technology) Abstract Abstract Maximum likelihood estimation is an important statistical technique for estimating missing data, for example in climate and environmental applications, which are usually large and feature data points that are irregularly spaced. In particular, the Gaussian log-likelihood function is the de facto model, which operates on the resulting sizable dense covariance matrix. The advent of high-performance systems with advanced computing power and memory capacity have enabled full simulations only for rather small dimensional climate problems, solved at the machine precision accuracy. The challenge for high dimensional problems lies in the computation requirements of the log-likelihood function, which necessitates O(n^2) storage and O(n^3) operations, where n represents the number of given spatial locations. This prohibitive computational cost may be reduced by using approximation techniques that not only enable large-scale simulations otherwise intractable but also maintain the accuracy and the fidelity of the spatial statistics model. In this paper, we extend the Exascale GeoStatistics software (i.e., ExaGeoStat) to support the Tile Low-Rank (TLR) approximation, which exploits the data sparsity of the dense covariance matrix by compressing the off-diagonal tiles up to a user-defined accuracy threshold. The underlying linear algebra operations may then be carried out on this data compression format, which may ultimately reduce the arithmetic complexity of the maximum likelihood estimation and the corresponding memory footprint. Performance results of TLR-based computations on shared and distributed-memory systems attain up to 13X and 5X speedups, respectively, compared to full accuracy simulations using synthetic and real datasets (up to 2M), while ensuring adequate prediction accuracy. Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert Latham, Robert Ross, Snyder Shane, and Stefan M. Wild (Argonne National Laboratory) Abstract Abstract Storage system performance modeling is crucial for efficient use of heterogeneous shared resources on leadership-class computers. Variability in application performance, particularly variability arising from concurrent applications sharing I/O resources, is a major hurdle in the development of accurate performance models. We adopt a deep learning approach based on conditional variational auto encoders (CVAE) for I/O performance modeling, and use it to quantify performance variability. We illustrate our approach using the data collected on Edison, a production supercomputing system at the National Energy Research Scientific Computing Center (NERSC). The CVAE approach is investigated by comparing it to a previously proposed sensitivity-based Gaussian process (GP) model. We find that the CVAE model performs slightly better than the GP model in cases where training and testing data come from different applications, since CVAE can inherently leverage the whole data from multiple applications whereas GP partitions the data and builds separate models for each partition. Hence, the CVAE offers an alternative modeling approach that does not need pre-processing; it has enough flexibility to handle data from a wide variety of applications without changing the inference approach. Tuesday 3:30pm-5:15pm ![]() Minor Hall Paper Paper session V: Filesystems and Applications Chair: Hans Vandierendonck (Queen's University Belfast) Kun Feng and Xian-He Sun (Illinois Institute of Technology), Xi Yang (Teradata Inc), and Shujia Zhou (Northrop Grumman Information Technology) Abstract Abstract Modern High Performance Computing (HPC) applications, such as Earth science simulations, produce large amounts of data due to the surging of computing power, while big data applications have become more compute-intensive due to increasingly sophisticated analysis algorithms. The needs of both HPC and big data technologies for advanced HPC and big data applications create a demand for integrated system support. In this study, we introduce Scientific Data Processing (SciDP) to support both HPC and big data applications via integrated scientific data processing. SciDP can directly process scientific data stored on a Parallel File System (PFS), which is typically deployed in an HPC environment, in a big data programming environment running atop Hadoop Distributed File System (HDFS). SciDP seamlessly integrates PFS, HDFS, and the widely-used R data analysis system to support highly efficient processing of scientific data. It utilizes the merits of both PFS and HDFS for fast data transfer, overlaps computing with data accessing, and integrates R into the data transfer process. Experimental results show that SciDP accelerates analysis and visualization of a production NASA Center for Climate Simulation (NCCS) climate and weather application by 6x to 8x when compared to existing solutions. Masahiro Tanaka (Graduate School of Media and Governance, Keio University); Osamu Tatebe (Center for Computational Sciences, University of Tsukuba); and Hideyuki Kawashima (Faculty of Environment and Information Studies, Keio University) Abstract Abstract In this paper, we describe a use case applying a scientific workflow system and a distributed file system to improve the performance of telescope data processing. The application is pipeline processing of data generated by Hyper Suprime-Cam (HSC) which is a focal plane camera mounted on the Subaru telescope. In this paper, we focus on the scalability of parallel I/O and core utilization. The IBM Spectrum Scale (GPFS) used for actual operation has a limit on scalability due to the configuration using storage servers. Therefore, we introduce the Gfarm file system which uses the storage of the worker node for parallel I/O performance. To improve core utilization, we introduce the Pwrake workflow system instead of the parallel processing framework developed for the HSC pipeline. Descriptions of task dependencies are necessary to further improve core utilization by overlapping different types of tasks. We discuss the usefulness of the workflow description language with the function of scripting language for defining complex task dependency. In the experiment, the performance of the pipeline is evaluated using a quarter of the observation data per night (input files: 80 GB, output files: 1.2 TB). Measurements on strong scaling from 48 to 576 cores show that the processing with Gfarm file system is more scalable than that with GPFS. Measurement using 576 cores shows that our method improves the processing speed of the pipeline by 2.2 times compared with the method used in actual operation. Teng Wang, Suren Byna, Bin Dong, and Houjun Tang (Lawrence Berkeley National Laboratory) Abstract Abstract High performance computing (HPC) architectures have been adding new layers of storage, such as burst buffers to tolerate latency between memory and disk-based file systems. However, existing file system and burst buffer solutions typically manage each storage layer individually. Consequently, the burden of managing data across multiple layers falls upon HPC system users. Yida Wang (Beihang University); Changhai Zhao and Zengbo Wang (Research & Development Center, BGP Inc. CNPC); Chao Liu, Chao Li, and Haihua Yan (Beihang University); and Jiamin Wen (Research & Development Center, BGP Inc. CNPC) Abstract Abstract Seismic processing is an important technology in petroleum industry. During the execution of seismic processing applications, large amount of intermediate data are generated and accessed. Providing high-performance services for intermediate data in the traditional storage architecture is expensive. In addition, because of the existence of new storage devices, the heterogeneity of storage environment has brought much inconvenience to the application developers and petroleum scientists. In this paper, we present a hierarchical intermediate data storage system called HIDStore. HIDStore employs distributed storage system based on the local storage devices and idle network resources to accelerate intermediate data access. Our experiments show that using HIDStore could improve the performance of various seismic processing applications and the resource utilization in compute cluster. HIDStore also abstracts different kinds of storage devices into hierarchical logical volumes and provides easy-to-use API to access data. Developers could deal with intermediate data in a high level of abstraction. Applications based on the HIDStore could fit into different storage environment and gain optimal performance automatically. Intermediate data in HIDStore could be automatically evicted once they expire. Wednesday 10:45am-12:15pm ![]() Assembly Hall Plenary, Paper Session VI: Best papers - Areas 3 and 4 Chair: Ron Brightwell (Sandia National Laboratories) Best Paper Chen Wang and Nikoli Dryden (University of Illinois at Urbana-Champaign), Franck Cappello (Argonne National Laboratory), and Marc Snir (University of Illinois at Urbana-Champaign) Abstract Abstract As we move toward exascale platforms, silent data corruptions (SDC) are likely to occur more frequently. Such errors can lead to incorrect results. Attempts have been made to use generic algorithms to detect such errors. Such detectors have demonstrated high precision and recall for detecting errors, but only if they run immediately after an error has been injected. In this paper, we propose a neural network based detector that can detect SDCs even multiple iterations after they were injected. We have evaluated our detector with 6 FLASH applications and 2 Mantevo mini-apps. Experiments show that our detector can detect more than 89% of SDCs with a false positive rate of less than 2%. Best Paper Xin Liang (UC Riverside), Sheng Di (Argonne National Laboratory), Dingwen Tao and Zizhong Chen (UC Riverside), and Franck Cappello (Argonne National Laboratory) Abstract Abstract Because of the ever-increasing execution scale of scientific applications, how to store the extremely large volume of data efficiently is becoming a serious issue. A significant reduction of the scientific data size can effectively mitigate the I/O burden and save considerable storage space. Since lossless compressors suffer from limited compression ratios, error-controlled lossy compressors have been studied for years. Existing error-controlled lossy compressors, however, focus mainly on absolute error bounds, which cannot meet users' diverse demands such as pointwise relative error bounds. Although some of the state-of-the-art lossy compressors support pointwise relative error bound, the compression ratios are generally low because of the limitation in their designs and possible spiky data changes in local data regions. In this work, we propose a novel, efficient approach to perform compression based on the pointwise relative error bound with higher compression ratios than existing solutions provide. Our contribution is threefold. (1) We propose a novel transformation scheme that can transfer the pointwise relative-error-bounded compression problem to an absolute-error-bounded compression issue. We also analyze the practical properties of our transformation scheme both theoretically and experimentally. (2) We implement the proposed technique in two of the most popular absolute-error-bounded lossy compressors, SZ and ZFP. (3) We evaluate our solution using multiple real-world application data across different scientific domains on a supercomputer with up to 4,096 cores and 12 TB of data. Experiments show that our solution achieves over $1.38X$ dumping and $1.31X$ loading performance over the second-best lossy compressor, respectively. Wednesday 1:30pm-3:00pm ![]() Assembly Hall Paper Paper session VII: Benchmarking and Modeling Chair: Tal Ben-Nun (ETH Zurich) Omar Aaziz and Jeanine Cook (Sandia National Labs), Jonathan Cook (New Mexico State University), Tanner Juedeman (Sandia National Labs), David Richards (Lawrence Livermore National Laboratory), and Courtenay Vaughan (Sandia National Labs) Abstract Abstract Proxy applications are a simplified means for stakeholders to evaluate how both hardware and software stacks might perform on the class of real applications that they are meant to model. However, characterizing the re- lationship between them and their behavior is not an easy task. We present a data-driven methodology for characterizing the relationship between real and proxy applications based on collecting runtime data from both and then using data analytics to find their correspondence and divergence. We use new capabilities for application-level monitoring within LDMS (Lightweight Distributed Monitoring System) to cap- ture hardware performance counter and MPI-related data. To demonstrate the utility of this methodology, we present experimental evidence from two system platforms, using four proxy applications from the current ECP Proxy Application Suite and their corresponding parent applications (in the ECP application portfolio). Results show that each proxy analyzed is representative of its parent with respect to computation and memory behavior. We also analyze communication patterns separately using mpiP data and show that communication for these four proxy/parent pairs is also similar. Alexandru Calotoiu and Alexander Graf (TU Darmstadt), Torsten Hoefler (ETH Zurich), and Daniel Lorenz and Felix Wolf (TU Darmstadt) Abstract Abstract Given the tremendous cost of an exascale system, its architecture must match the requirements of the applications it is supposed to run as precisely as possible. Conversely, applications must be designed such that building an appropriate system becomes feasible, motivating the idea of co-design. In this process, a fundamental aspect of the application requirements are the rates at which the demand for different resources grows as a code is scaled to a larger machine. However, if the anticipated scale exceeds the size of available platforms this demand can no longer be measured. This is clearly the case when designing an exascale system. Moreover, creating analytical models to predict these requirements is often too laborious - especially when the number and complexity of target applications is high. In this paper, we show how automated performance modeling can be used to quickly predict application requirements for varying scales and problem sizes. Following this approach, we determine the exascale requirements of five scientific codes and use them to illustrate system design tradeoffs. Next Stop “NoOps": Enabling Cross-System Diagnostics Through Graph-based Composition of Logs and Metrics ![]() Michał Zasadziński and Marc Solé (CA Technologies), Alvaro Brandon (Universitat Politecnica de Madrid), Victor Muntés-Mulero (CA Technologies), and David Carrera (Universitat Politecnica de Catalunya) Abstract Abstract Performing diagnostics in IT systems is an increasingly complicated task, and it is not doable in satisfactory time by even most skillful operators. Systems and their architecture change very rapidly in response to business and user demand. Many organizations see value in the maintenance and management model of NoOps that stands for No Operations. One of the implementations of this model is a system that is maintained automatically without any human intervention. The path to NoOps involves not only precise and fast diagnostics but also reusing as much knowledge as possible after the system is reconfigured or changed. The biggest challenge is to leverage knowledge on one IT system and reuse this knowledge for diagnostics of another, different system. We propose a framework of weighted graphs which can transfer knowledge, and perform high-quality diagnostics of IT systems. We encode all possible data in a graph representation of a system state and automatically calculate weights of these graphs. Then, thanks to the evaluation of similarity between graphs, we transfer knowledge about failures from one system to another and use it for diagnostics. We successfully evaluate the proposed approach on Spark, Hadoop, Kafka and Cassandra systems. Wednesday 1:30pm-3:00pm ![]() Minor Hall Paper Paper session VIII: Managing Heterogeneity and Imbalance Chair: Rong Ge (Clemson University) Cutting the Tail: Designing High Performance Message Brokers to Reduce Tail Latencies in Stream Processing ![]() M. Haseeb Javed, Xiaoyi Lu, and Dhabaleswar K. Panda (The Ohio State University) Abstract Abstract Over the last decade, organizations have become heavily reliant on providing near-instantaneous insights to the end user based on vast amounts of data collected from various sources in real-time. To accomplish this task, a stream processing pipeline is constructed which consists of a Stream Processing Engine (SPE) and a Message Broker (MB). The SPE is responsible for performing computations on the data and providing insights from it. MB acts as an intermediate queue to which data is written by ephemeral sources and then fetched by the SPE to perform computations on. Due to the inherent real-time nature of such a pipeline, low latency is a highly desirable feature for them. Thus, many existing research works have focused on improving latency and throughput for the streaming pipeline. However, there is a dearth of studies optimizing the tail latencies of such pipelines. Moreover, the root cause of this high tail latency is still vague. In this paper, we propose a model-based approach to analyze in-depth the reasons behind high tail latency in streaming systems such as Apache Kafka. Having found the MB to be a major contributor of messages with high tail latencies in a streaming pipeline, we design and implement a high-performance MB, called Frieda, with the higher goal of accelerating any arbitrary stream processing pipeline regardless of the SPE used. Our experiments show a reduction of up to 89%x in 99.9th percentile latency for microbenchmarks and up to 31% for full-fledged stream processing pipeline constructed using Yahoo! Streaming Benchmark. Young Ki Kim and M. Reza HoseinyFarahabady (The University of Sydney), Young Choon Lee (Macquarie University), Albert Y. Zomaya (The University of Sydney), and Raja Jurdak (CSIRO) Abstract Abstract Lambda platform is a new concept based on an event-driven server-less computation that empowers application developers to build scalable enterprise software in a virtualized environment without provisioning or managing any physical servers (a server-less solution). In reality, however, devising an effective consolidation method to host multiple Lambda functions into a single machine is challenging. The existing simple resource allocation algorithms, such as the round-robin policy used in many commercial server-less systems, suffer from lack of responsiveness to a sudden surge in the incoming workload. This will result in an unsatisfactory performance degradation that is directly experienced by the end-user of a Lambda application. In this paper, we address the problem of CPU cap management in a Lambda platform for ensuring different QoS enforcement levels in a platform with shared resources, in case of fluctuations and sudden surges in the incoming workload requests. To this end, we present a closed-loop (feedback-based) CPU cap controller, which fulfills the QoS levels enforced by the application owners. The controller adjusts the number of working threads per QoS class and dispatches the outstanding Lambda functions along with the associated events to the most appropriate working thread. The proposed solution reduces the QoS violations by an average of 6.36 times compared to the round-robin policy. It can also maintain the end-to-end response time of applications belonging to the highest priority QoS class close to the target set-point while decreasing the overall response time by up to 52%. Luna Xu and Ali R. Butt (Virginia Tech) and Seung-Hwan Lim and Ramakrishnan Kannan (Oak Ridge National Lab) Abstract Abstract Big data processing systems such as Spark are employed in an increasing number of diverse applications such as machine learning, graph computation, and scientific computing, each with dynamic and different resource needs. These applications increasingly run on heterogeneous hardware, e.g., with out-of-core accelerators. However, big data platforms do not factor in this multi-dimensional heterogeneity of applications and hardware leading to a fundamental mismatch between the application and hardware characteristics, and the resource scheduling adopted in big data platforms. For example, Hadoop and Spark consider only data locality when assigning tasks to nodes and typically disregard the hardware capabilities and their suitability to specific application requirements. Wednesday 3:30pm-5:00pm ![]() Assembly Hall Paper Paper session IX: Graphs and Big Data Analytics Chair: Ali R. Butt (Virginia Tech) Keita Iwabuchi, Geoff Sanders, Keith Henderson, and Roger Pearce (Lawrence Livermore National Laboratory) Abstract Abstract The eccentricity of a vertex is defined as the length of the longest shortest path to any other vertex. While eccentricity is an important measure of vertex centrality, directly computing exact eccentricity for all vertices on large-scale graphs is prohibitively costly. Takes and Kosters proposed an iterative algorithm that uses multiple runs of single-source shortest path (SSSP) to compute lower and upper bounds on eccentricity at every vertex. Their technique converges to exact eccentricity by performing SSSP from only a small percentage of vertices, when sources are efficiently selected. However, their source selection strategies do not always yield rapid convergence. Jianping Zeng and Hongfeng Yu (University of Nebraska-Lincoln) Abstract Abstract We present a new distributed community detection algorithm for large graphs based on the Louvain method. We exploit a distributed delegate partitioning to ensure the workload and communication balancing among processors. In addition, we design a new heuristic strategy to carefully coordinate the community constitution in a distributed environment, and ensure the convergence of the distributed clustering algorithm. Our intensive experimental study has demonstrated the scalability and the correctness of our algorithm with various large-scale real-world and synthetic graph datasets using up to 32,768 processors. Ryan D. Friese, Nathan Tallent, Malachi Schram, Mahantesh Halappanavar, and Kevin J. Barker (Pacific Northwest National Laboratory) Abstract Abstract We present techniques for optimizing the performance of data-intensive workflows that execute on geographically distributed and heterogeneous resources. We optimize for both throughput and response time. Optimizing for throughput, we alleviate data-transfer bottlenecks. To hide access times of accessing remote data, we transparently introduce prefetching (overlapping data transfer and computation), without changing workflow source code. Optimizing for response time, we introduce intelligent scheduling for a set of high-priority tasks. We replace a greedy scheduler that assigns tasks without accounting for differing performance on heterogeneous resources, leading to long latencies. Intelligent scheduling rapidly selects a near-optimal solution for a bi-objective optimization problem. One objective is a good task assignment; the other objective is minimize I/O contention by distributing load across resources and time. To reason about task completion times, we use modeling tools to generate accurate predictions of execution times. We show performance results for Belle II workflow for high energy physics. The combination of these techniques can improve throughput over production Belle II configurations by 20-40%. Our work is general and adaptable to other distributed workflows. Wednesday 3:30pm-5:00pm ![]() Minor Hall Paper Paper session X: Performance Engineering in Filesystems Chair: Anthony Skjellum (University of Tennessee at Chattanooga) Anthony Kougkas (Illinois Institute of Technology Chicago), Hariharan Devarajan and Xian-He Sun (Illinois Institute of Technology), and Jay Lofstead (Sandia National Laboratories) Abstract Abstract Modern HPC systems employ burst buffer installations to reduce the peak I/O requirements for external storage and deal with the burstiness of I/O in modern scientific applications. These I/O buffering resources are shared between multiple applications that run concurrently. This leads to severe performance degradation due to contention, a phenomenon called cross-application I/O interference. In this paper, we first explore the negative effects of interference at the burst buffer layer and we present two new metrics that can quantitatively describe the slowdown applications experience due to interference. We introduce Harmonia, a new dynamic I/O scheduler that is aware of interference, adapts to the underlying system, implements a new 2-way decision-making process and employs several scheduling policies to maximize the system efficiency and applications' performance. Our evaluation shows that Harmonia, through better I/O scheduling, can outperform by 3x existing state-of-the-art buffering management solutions and can lead to better resource utilization. Rohan Garg and Apoorve Mohan (Northeastern University), Michael Sullivan (Nvidia Research), and Gene Cooperman (Northeastern University) Abstract Abstract Unified Virtual Memory (UVM) was recently introduced with CUDA version 8 and the Pascal GPU. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM has become increasing important, especially as NVIDIA CUDA continues to gain wider popularity: 87 of the top 500 supercomputers in the latest listings use NVIDIA GPUs, with a current trend of ten additional NVIDIA-based supercomputers each year. Dingwen Tao (The University of Alabama); Sheng Di (Argonne National Laboratory); Xin Liang and Zizhong Chen (University of California, Riverside); and Franck Cappello (Argonne National Laboratory) Abstract Abstract Error-controlled lossy compression has been studied for years because of extremely large volumes of data being produced by today’s scientific simulations. None of existing lossy compressors, however, allow users to fix the peak signal-to-noise ratio (PSNR) during compression, although PSNR has been considered as one of the most significant indicators to assess compression quality. In this paper, we propose a novel technique providing a fixed-PSNR lossy compression for scientific data sets. We implement our proposed method based on the SZ lossy compression framework and release the code as an open-source toolkit. We evaluate our fixed-PSNR compressor on three real world high-performance computing data sets. Experiments show that our solution has a high accuracy in controlling PSNR, with an average deviation of 0.1 ∼ 5.0 dB on the tested data sets. Marc-André Vef, Nafiseh Moti, and Tim Süß (Johannes Gutenberg University Mainz); Tommaso Tocci, Ramon Nou, and Alberto Miranda (Barcelona Supercomputing Center); Toni Cortes (Barcelona Supercomputing Center, Universitat Politecnica de Catalunya); and André Brinkmann (Johannes Gutenberg University Mainz) Abstract Abstract We present GekkoFS, a temporary, highly-scalable burst buffer file system which has been specifically optimized for new access patterns of data-intensive High-Performance Computing (HPC) applications. The file system provides relaxed POSIX semantics, only offering features which are actually required by most (not all) applications. It is able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes, significantly outperforming the capabilities of general-purpose parallel file systems. Thursday 10:45am-12:15pm ![]() Assembly Hall Paper Paper session XI: Hierarchy and Sharing in System Architecture Chair: Giorgis Georgakoudis (Queen's University Belfast) Sascha Hunold (TU Wien) and Alexandra Carpen-Amarie (Fraunhofer ITWM) Abstract Abstract MPI benchmarks are used for analyzing or tuning the performance of MPI libraries. Generally, every MPI library should be adjusted to the given parallel machine, especially on supercomputers. System operators can define which algorithm should be selected for a specific MPI operation, and this decision which algorithm to select is usually made after analyzing bench- mark results. The problem is that the latency of communication operations in MPI is very sensitive to the chosen data acquisition and data processing method. For that reason, depending on how the performance is measured, system operators may end up with a completely different MPI library setup. In the present work, we focus on the problem of precisely measuring the latency of collective operations, in particular, for small payloads, where external experimental factors play a significant role. We present a novel clock synchronization algorithm, which exploits the hierarchical architecture of compute clusters, and we show that it outperforms previous approaches, both in run-time and in precision. We also propose a different scheme to obtain precise MPI run-time measurements (called Round-Time), which is based on given, fixed time slices, as opposed to the traditional way of measuring for a predefined number of repetitions. We also highlight that the use of MPI_Barrier has a significant effect on experimentally determined latency values of MPI collectives. We argue that MPI_Barrier should be avoided if the average run- time of the barrier function is in the same order of magnitude as the run-time of the MPI function to be measured. Philipp Samfass (Technical University of Munich), Jannis Klinkenberg (RWTH Aachen University), and Michael Bader (Technical University of Munich) Abstract Abstract “Equal work results in equal execution time” is an assumption that has fundamentally driven design and implemen- tation of parallel applications for decades. However, increasing hardware variability on current architectures (e.g., through Turbo Boost, dynamic voltage and frequency scaling or thermal effects) necessitate a revision of this assumption. Expecting an increase of these effects on future (exascale-)systems, we develop a novel MPI+OpenMP-only distributed work stealing concept that – based on on-line performance monitoring – selectively steals and remotely executes tasks across MPI boundaries. This concept has been implemented in the parallel adaptive mesh refinement (AMR) framework sam(oa)^2 for OpenMP tasks of traversing a grid section. Corresponding performance measurements in the presence of enforced CPU clock frequency imbalances demonstrate that a state-of-the-art cost-based (chains-on-chains partitioning) load balancing mechanism is insufficient and can even degrade performance, whereas additional distributed work stealing successfully mitigates the frequency-induced imbalances. Furthermore, our results indicate that our approach is also suitable for load balancing work-induced imbalances in a realistic AMR test case. Guillaume Aupy (Inria); Anne Benoit (ENS de Lyon, Georgia Institute of Technology); Brice Goglin (Inria); Loïc Pottier (ENS de Lyon); and Yves Robert (ENS de Lyon, University of Tennessee) Abstract Abstract Co-scheduling techniques are used to improve the throughput of applications on chip multiprocessors (CMP), but sharing resources often generates critical interferences. We focus on the interferences in the last level of cache (LLC) and use the Cache Allocation Technology (CAT) recently provided by Intel to partition the LLC and give each co-scheduled application their own cache area. We consider m iterative HPC applications running concurrently and answer to the following questions: (i) how to precisely model the behavior of these applications on the cache partitioned platform? and (ii) how many cores and cache fractions should be assigned to each application to maximize the platform efficiency? Here, platform efficiency is defined as maximizing the performance either globally, or as guaranteeing a fixed ratio of iterations per second for each application. Through extensive experiments using CAT, we demonstrate the impact of cache partitioning when multiple HPC application are co-scheduled onto CMP platforms. Thursday 10:45am-12:15pm ![]() Minor Hall Paper Paper session XII: Scheduling, Elasticity and Energy Chair: Felix Wolf (TU Darmstadt) Jinwei Liu (Clemson University) and Haiying Shen and Ankur Sarker (University of Virginia) Abstract Abstract Task scheduling and preemption are two important functions in data-parallel clusters. Though directed acyclic graph task dependencies are common in data-parallel clusters, previous task scheduling and preemption methods do not fully utilize such task dependency to increase throughput since they simply schedule precedent tasks prior to their dependent tasks or neglect the dependency. We notice that in both scheduling and preemption, choosing a task with more dependent tasks to run allows more tasks to be runnable next, which facilitates to select a task that can more increase throughput. Accordingly, in this paper, we propose a Dependency-aware Scheduling and Preemption system (DSP) to achieve high throughput. First, we build an integer linear programming model to minimize the makespan (i.e., the time when all jobs finish execution) with the consideration of task dependency and deadline, and derive the target server and start time for each task, which can minimize the makespan. Second, we utilize task dependency to determine tasks' priorities for preemption. Finally, we propose a method to reduce the number of unnecessary preemptions that cause more overhead than the throughput gain. Extensive experimental results based on a real cluster and Amazon EC2 cloud service show that DSP achieves much higher throughput compared to existing strategies. Benjamin Camus and Anne Blavette (CNRS), Fanny Dufossé (Inria), and Anne-Cécile Orgerie (CNRS) Abstract Abstract The growing appetite of new technologies, such as Internet-of-Things, for Cloud resources leads to an unprecedented energy consumption for these infrastructures. In order to make these energy-hungry distributed systems more sustainable, Cloud providers resort more and more to on-site renewable energy production facilities like photovoltaic panels. Yet, this intermittent and variable electricity production is often uncorrelated with the Cloud consumption induced by its workload. Geographical load balancing, virtual machine (VM) migration and consolidation can be used to exploit multiple Cloud data centers’ locations and their associated photovoltaic panels for increasing their renewable energy consumption. However, these techniques cost energy and network bandwidth, and this limits their utilization. In this paper, we propose to rely on the flexibility brought by Smart Grids to exchange renewable energy between distributed sites and thus, to further increase the overall Cloud’s self-consumption of the locally-produced renewable energy. Our solution is named SCORPIUS: Self-Consumption Optimization of Renewable energy Production In distribUted cloudS. It takes into account telecommunication network constraints and electrical grid requirements to optimize the Cloud’s self-consumption by trading-off between VM migration and renewable energy exchange. Our simulation-based results show that SCORPIUS outperforms existing solutions on various workload traces of production Clouds in terms of both renewable self-consumption and overall energy consumption. Alexandru Uta (Vrije Universiteit Amsterdam), Sietse Au and Alexey Ilyushkin (TU Delft), and Alexandru Iosup (Vrije Universiteit Amsterdam) Abstract Abstract Graphs are a natural fit for modeling concepts used in solving diverse problems in science, commerce, engineering, and governance. Responding to the diversity of graph data and algorithms, many parallel and distributed graph-processing systems exist. However, until now these platforms use a static model of deployment: they only run on a pre-defined set of machines. This raises many conceptual and pragmatic issues, including misfit with the highly dynamic nature of graph processing, and could lead to resource waste and high operational costs. In contrast, in this work we explore the benefits and drawbacks of the dynamic model of deployment. Building a three-layer benchmarking framework for assessing elasticity in graph analytics, we conduct an in-depth elasticity study of distributed graph processing. Our framework is composed of state-of-the-art workloads, autoscalers, and metrics, derived from the LDBC Graphalytics benchmark and SPEC RG Cloud Group’s elasticity metrics. We uncover the benefits and cost of elasticity in graph processing: while elasticity allows for fine-grained resource management, and does not degrade application performance, we find that graph workloads are sensitive to data migration while leasing or releasing resources. Moreover, we identify non-trivial interactions between scaling policies and graph workloads, which add an extra level of complexity to resource management and scheduling for graph processing. Thursday 1:30pm-3:00pm ![]() Assembly Hall Paper Paper session XIII: Deep Learning Chair: Michela Taufer (University of Delaware) Dheeraj Sreedhar, Vaibhav Saxena, Yogish Sabharwal, and Ashish Verma (IBM Research - India) and Sameer Kumar (Google Inc.) Abstract Abstract Deep Neural Networks (DNNs) have achieved impressive accuracy in many application domains including image classification. Training of DNNs is an extremely compute intensive process and is solved using variants of the stochastic gradient descent (SGD) algorithm. A lot of recent research has focused on improving the performance of DNN training. In this paper, we present optimization techniques to improve the performance of the data parallel synchronous SGD algorithm using the Torch framework: (i) we maintain data in-memory to avoid file I/O overheads, (ii) we propose optimizations to the Torch data parallel table framework that handles multithreading, and (iii) we present MPI optimization to minimize communication overheads. We evaluate the performance of our optimizations on a Power 8 Minsky cluster with 64 nodes and 256 NVidia Pascal P100 GPUs. With our optimizations, we are able to train 90 epochs of the ResNet-50 model on the Imagenet-1k dataset using 256 GPUs in just 48 minutes. This significantly improves on the previously best known performance of training 90 epochs of the ResNet-50 model on the same dataset using the same number of GPUs in 65 minutes. To the best of our knowledge, this is the best known training performance demonstrated for the Imagenet-1k dataset using 256 GPUs. Yosuke Oyama (Tokyo Institute of Technology); Tal Ben-Nun and Torsten Hoefler (ETH Zurich); and Satoshi Matsuoka (RIKEN Center for Computational Science, Tokyo Institute of Technology) Abstract Abstract cuDNN is a low-level library that provides GPU kernels frequently used in deep learning. Specifically, cuDNN implements several equivalent convolution algorithms, whose performance and memory footprint may vary considerably, depending on the layer dimensions. When an algorithm is automatically selected by cuDNN, the decision is performed on a per-layer basis, and thus it often resorts to slower algorithms that fit the workspace size constraints. We present u-cuDNN, a thin wrapper library for cuDNN that transparently divides layers’ mini-batch computation into multiple micro-batches, both on a single GPU and a heterogeneous set of GPUs. Based on Dynamic Programming and Integer Linear Programming (ILP), u-cuDNN enables faster algorithms by decreasing the workspace requirements. At the same time, u-cuDNN does not decrease the accuracy of the results, effectively decoupling statistical efficiency from the hardware efficiency. We demonstrate the effectiveness of u-cuDNN for the Caffe and TensorFlow frameworks, achieving speedups of 1.63x for AlexNet and 1.30x for ResNet-18 on V100-SXM2 GPU. We also show that u-cuDNN achieves speedups of up to 4.54x, and 1.60x on average for DeepBench's convolutional layers. In a distributed setting, u-cuDNN attains a speedup of 2.20x when training ResNet-18 on a heterogeneous GPU cluster over a single GPU. These results indicate that using micro-batches can seamlessly increase the performance of deep learning, while maintaining the same overall memory footprint. Liandeng Li, Jiarui Fang, and Haohuan Fu (Tsinghua Univ., National Supercomputing Center in Wuxi); Jinlei Jiang (Tsinghua Univ.); Wenlai Zhao and Conghui He (Tsinghua Univ., National Supercomputing Center in Wuxi); Xin You (Beihang Univ.); and Guangwen Yang (Tsinghua Univ., National Supercomputing Center in Wuxi) Abstract Abstract This paper reports our efforts on swCaffe, a high-efficient parallel framework for accelerating deep neural networks (DNNs) training on Sunway TaihuLight, one of the fastest supercomputers in the world that adopts a unique heterogeneous many-core architecture. First, we point out some insightful principles to fully exploit the performance of the innovative many-core architecture. Second, we propose a set of optimization strategies for redesigning a variety of neural network layers based on Caffe. Third, we put forward a topology-aware parameter synchronization scheme to scale the synchronous Stochastic Gradient Descent (SGD) method to multiple processors efficiently. We evaluate our framework by training a variety of widely used neural networks with the ImageNet dataset. On a single node, swCaffe can achieve 23% ̃119% overall performance compared with Caffe running on K40m GPU. As compared with Caffe on CPU, swCaffe runs 3.04 ̃7.84× faster on all networks. When training ResNet50 and AlexNet with 1024 nodes, swCaffe can achieve up to 715.45× and 928.15× speedup. Thursday 1:30pm-3:00pm ![]() Minor Hall Paper Paper session XIV: Languages and Programming Models Chair: Sebastian Kuckuk (Friedrich-Alexander-Universität Erlangen-Nürnberg) Juan J. Galvez, Karthik Senthil, and Laxmikant V. Kale (University of Illinois at Urbana-Champaign) Abstract Abstract Parallel programming can be extremely challenging. Programming models have been proposed to simplify this task, but wide acceptance of these remains elusive for many reasons, including the demand for greater accessibility and productivity. Tim Suess (Johannes Gutenberg University Mainz, Institute of Computer Science); Nils Doering and André Brinkmann (Johannes Gutenberg University Mainz, ZDV); and Lars Nagel (Loughborough University, Department of Computer Science) Abstract Abstract The internal parallelism of compute resources increases permanently, and graphics processing units (GPUs) and other accelerators have been gaining importance in many domains. Researchers from life science, bioinformatics or artificial intelligence, for example, use GPUs to accelerate their computations. However, languages typically used in some of these disciplines often do not benefit from the technical developments because they cannot be executed natively on GPUs. Instead existing programs must be rewritten in other, less dynamic programming languages. On the other hand, the gap in programming features between accelerators and common CPUs shrinks permanently. Since accelerators are becoming more competitive with regard to general computations, they will not be mere special-purpose processors in the future. It is a valid assumption that future GPU generations can be used in a similar or even the same way as CPUs and that compilers or interpreters will be needed for a wider range of computer languages. We present CuLi, an interactive Lisp interpreter, that performs all computations on a CUDA-capable GPU. The host system is needed only for the input and the output. At the moment, Lisp programs running on CPUs outperform Lisp programs on GPUs, but we present trends indicating that this might change in the future. Our study gives an outlook on the possibility of running Lisp programs or other dynamic programming languages on next-generation accelerators. Herbert Jordan (University of Innsbruck); Thomas Heller (University of Erlangen-Nuremberg); Philipp Gschwandtner, Peter Zangerl, and Peter Thoman (University of Innsbruck); Dietmar Fey (University of Erlangen-Nuremberg); and Thomas Fahringer (University of Innsbruck) Abstract Abstract Contemporary state-of-the-art runtime systems underlying widely utilized general purpose parallel programming languages and libraries like OpenMP, MPI, or OpenCL provide the foundation for accessing the parallel capabilities of modern computing architectures. In the tradition of their respective imperative host languages those runtime systems' main focus is on providing means for the distribution and synchronization of operations --- while the organization and management of manipulated data is left to application developers. Consequently, the distribution of data remains inaccessible to those runtime systems. However, many desirable system-level features depend on a runtime system's ability to exercise control on the distribution of data. Thus, program models underlying traditional systems lack the potential for the support of those features. Thursday 3:30pm-5:00pm ![]() Assembly Hall Paper Paper session XV: Scalable Filesystems Chair: Jay Lofstead (Sandia National Laboratories) Lihui Wu, Weigang Wu, and Ning Huang (Sun Yat-sen University) and Zhiguang Chen (Sun Yat -sen University) Abstract Abstract State machine replication (SMR) is a fundamental fault tolerant technique for distributed systems to guarantee consistency among replicas via sequential execution of commands. With the development of cloud computing, parallel SMR has been recently proposed for large scale cloud datacenters. In this paper, we propose PDFE, a novel parallel SMR scheme, which realizes flexible dispatch of parallel ordered commands for parallel executing. In PDFE, the mapping/binding between ordering threads and work threads becomes dynamic, and commands can be dis-patched according to the work load level of different threads. Such flexibility can help achieve two levels of load balancing: load balancing between ordering threads and work threads, and load balancing among work threads. The major challenge in our work lies in the inconsistency problem caused by dynamic changes in command dispatch, and it is addressed by a specially designed mechanism. Compared with existing parallel SMR schemes, PDFE can achieve better load balancing and higher system efficiency. Such advantages are validated by experimental perfor-mance evaluation. Teng Wang, Suren Byna, Glenn Lockwood, and Nicholas Wright (Lawrence Berkeley National Laboratory) and Phil Carns and Shane Snyder (Argonne National Laboratory) Abstract Abstract Modern HPC systems are collecting abundant I/O performance instrumentation. The massive volume and heterogeneity of this instrumentation has made it difficult to perform in-depth integrated analysis in a timely manner, however. To overcome this gap and to allow users to identify the root causes of poor application I/O performance, we present IOMiner, an I/O log analytics framework. IOMiner provides easy to use interfaces for analyzing instrumentation data, a unified storage schema that hides the heterogeneity of the raw instrumentation data, and a sweep-line based algorithm for root cause analysis of poor application I/O performance. IOMiner is implemented atop Spark to facilitate efficient, interactive, parallel analytics. We demonstrate the capabilities of IOMiner by using it to analyze logs collected on a large-scale production HPC system. Our analysis techniques not only uncover the root cause of poor I/O performance in key application case studies but also provide new insight into HPC I/O workload characterization. Jingqing Mu and Jerome Soumagne (The HDF Group); Houjun Tang, Suren Byna, and Quincey Koziol (Lawrence Berkeley National Laboratory); and Richard Warren (The HDF Group) Abstract Abstract On the road to exascale, the high-performance computing (HPC) community is seeing the emergence of multi-tier storage systems that take aim at the unprecedented amount of data being produced by scientific applications. Multi-tier storage in HPC systems take advantage of new technologies such as non-volatile memory, burst buffer, etc., and bring the benefit of increased I/O bandwidth. However, existing data management solutions for HPC applications, which were mainly developed for single-tier storage, are no longer suitable at handling the increased level of complexity and currently delegate the task of handling deep memory hierarchies back to the user. This presents programming challenges for application developers and is an opportunity for defining new storage and I/O paradigms. We describe a novel object-based data abstraction that takes advantage of deep memory hierarchies by providing a simplified programming interface that enables autonomous, asynchronous, and transparent data movement with a server-driven architecture. Users can define a mapping between application memory and these abstract storage objects, creating a linkage between either all or part of an object’s content without making data copy or transfer calls, avoiding explicit management of complex data movement across multiple storage hierarchies. We evaluate our system by storing plasma physics simulation data with different storage layouts. Thursday 3:30pm-5:00pm ![]() Minor Hall Paper Paper session XVI: Algorithms, Applications and Performance Chair: Kenneth B. Kent (University of New Brunswick) peng chen (Tokyo Institute of Technology; AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory, National Institute of Advanced Industrial Science and Technology); Mohamed Wahib (National Institute of Advanced Industrial Science and Technology, Tokyo, Japan); Shinichiro Takizawa (AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory, National Institute of Advanced Industrial Science and Technology); Ryousei Takano (National Institute of Advanced Industrial Science and Technology, Tokyo, Japan); and Satoshi Matsuoka (Tokyo Institute of Technology; RIKEN Center for Computational Science, Hyogo, Japan) Abstract Abstract Two-dimensional Summed Area Tables (SAT) is a fundamental primitive used in image processing and machine learning applications. We present a collection of optimization methods for computing SAT on CUDA-enabled GPUs. Conventional approaches rely on computing the prefix sum in one dimension in parallel, transposing the matrix, then computing the prefix sum for the other dimension in parallel. Additionally, conventional methods use the scratchpad memory as cache. We propose a collection of algorithms that are scalable with respect to problem size. We use the register cache technique instead of the scratchpad memory and also employ a naive serial scan on the thread level for computing the prefix sum for one of the dimensions. Using a novel transpose-in-registers method we increase the inter-thread parallelism and outperform conventional SAT implementations. In addition, we significantly reduce both the communication between threads and the number of arithmetic instructions. On an Nvidia Pascal P100 GPU and Volta V100, our evaluations demonstrate that our implementations outperform state of the art libraries and yield up to 2.3x and 3.2x speedup over OpenCV and Nvidia NPP libraries, respectively. Balazs Nemeth, Tom Haber, Jori Liesenborgs, and Wim Lamotte (Universiteit Hasselt) Abstract Abstract Sequential Monte Carlo methods are a useful tool to tackle non-linear problems in a Bayesian setting. A target posterior distribution is approximated by moving a set of weighted particles through a sequence of distributions. To counteract degeneracy caused by sequentially changing the underlying distribution, particles occasionally need to be resampled. Deciding if this is necessary requires a reduction operation on the weights after each update. Hence, scalability on a cluster is not only determined by the number of particles used, but also by how well load is balanced. This paper shows how speculative execution in Sequential Monte Carlo with Markov Chain Monte Carlo steps can improve parallel scalability. The key insight is that decisions taken based on the reduction result in each step can be accurately predicted. Consequently, synchronization inherent in the reduction can, in most cases, be avoided, relaxing the limit imposed by load imbalance. Particles are renumbered during resampling to further improve accuracy. Multiple test scenarios, each with different load balance characteristics, are studied empirically on a compute cluster. Tests show that when decisions are predicted correctly, execution time is reduced drastically for use cases with high load imbalance. Furthermore, the maximum theoretical gain, derived from execution characteristics, is compared with the measured improvement to verify that most speculative evaluations are actually useful. If predictions are incorrect, or load is balanced, speculation has no measurable negative impact. Performance is also evaluated in a weak scaling setting on cluster with 36 cores in each system. Shweta Salaria (Tokyo Institute of Technology, AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory); Aleksandr Drozd and Artur Podobas (Tokyo Institute of Technology); and Satoshi Matsuoka (RIKEN Center for Computational Science, Tokyo Institute of Technology) Abstract Abstract Performance prediction of parallel applications across systems becomes increasingly important in today's diverse computing environments. A wide range of choices in execution platforms pose new challenges to researchers in choosing a system which best fits their workloads and administrators in scheduling applications to the best performing systems. While previous studies have employed simulation- or profile-based prediction approaches, such solutions are time-consuming to be deployed on multiple platforms. To address this problem, we use two collaborative filtering techniques to build analytical models which can quickly and accurately predict the performance of workloads across different multicore systems. The first technique leverages information gained from performance observed for certain applications on a subset of systems and use it to discover similarities among applications as well as systems. The second collaborative filtering based model learns latent features of systems and workloads automatically and use these features to characterize the performance of applications on different platforms. We evaluated both the methods using 30 workloads chosen from NAS Parallel Benchmarks, BOTS and Rodinia benchmarking suites on ten different systems. Our results show that such collaborative filtering methods can make predictions with RMSE as low as 0.6 and with an average RMSE of 1.6. Tuesday 9:00am-10:15am ![]() Assembly Hall Plenary Welcome and Keynote I Chair: Paul H. J. Kelly (Imperial College London) Crossing the Chasm: How to develop weather and climate models for next generation computers Chris Maynard (University of Reading, Met Office) Biography Biography Chris Maynard (University of Reading, Met Office) Chris Maynard has more than 20 years experience developing scientific software for supercomputers in diverse fields such as Quantum Field Theory, Magnetic Materials, Group Theory and Computational Fluid Dynamics. He is leading the software development of the Met Office’s new weather and climate model capable of running on so called Exascale computing architectures.. In January 2018 he joined the University of Reading Computer Science department part-time as an Associate Professor, whilst retaining his former role with the Met Office. His research interests include scientific software development, Programming Models, Performance Modelling and Optimisations and Parallel Computing, as well as new and novel processors necessary to build an Exascale Computer. Abstract Abstract Energy constraints in processor design have led to radically different computer architectures than for example the ubiquitous x86 architecture. The newer generations of processors are not faster but inherently more parallel. For scientific and numerical computing in general this poses a number of challenges. Rapidly evolving hardware and the absence of a programming model in which to express this explosion of parallelism further compound these problems. This is particular acute for weather and climate codes which are large and complex, with many hundreds of thousands of lines of code, representing a significant investment in scientific development and have long development cycles of a decade or more. In this talk I will outline some approaches to this software challenge, such as developing domain specific languages, being taken by several European groups to cross the chasm between our scientific aspiration to exploit Exascale computing and our ability to port, develop and adapt existing code to new architectures. Tuesday 10:45am-12:15pm ![]() Assembly Hall Plenary, Paper Session I: best papers (Areas 1 and 2) Chair: Harald Koestler (University of Erlangen-Nuremberg) Best Paper Ali Murat Gok (Northwestern University); Sheng Di, Yuri Alexeev, and Dingwen Tao (Argonne National Laboratory); Vladimir Mironov (Lomonosov Moscow State University); and Xin Liang and Franck Cappello (Argonne National Laboratory) Abstract Abstract Computation of two-electron repulsion integrals is the critical and the most time-consuming step in a typical quantum chemistry simulation. Such calculations have massive computing and storage requirements, which scale as O(n^4) with the size of a chemical system. Compressing the integral's data and storing it on disk can avoid costly recalculation, significantly speeding the overall quantum chemistry calculations; but it requires a fast compression algorithm. To this end, we developed PaSTRI (Pattern Scaling for Two-electron Repulsion Integrals) and implemented the algorithm in the data compression package SZ. PaSTRI leverages the latent pattern features in the integral dataset and optimizes the calculation of the appropriate number of bits required for the storage of the integral. We have evaluated PaSTRI using the integral datasets generated by the quantum chemistry program GAMESS. The results show an excellent 16.8 compression ratio with low overhead, while maintaining 10^-10 absolute precision based on user's requirement. Best Paper Mohammadreza Bayatpour, Jahanzeb Maqbool Hashmi, Sourav Chakraborty, Pouya Kousha, Hari Subramoni, and Dhabaleswar K. Panda (Ohio State University) Abstract Abstract Message Passing Interface (MPI), thus far, has remained a dominant programming models to program large-scale scientific applications. Collective communication operations in MPI are of significant importance due to their communication intensive nature and use in scientific applications. With the emergence of multi-/many-core systems, and rise of deep learning applications, it is important to revisit MPI collectives, particularly MPI Allreduce to exploit vast parallelism offered by modern architectures. In this paper, we take up this challenge and propose Scalable and Adaptive designs for Large message Reduction collectives (SALaR). We focus on MPI Allreduce due to its use in deep learning frameworks and propose new designs that can significantly improve its performance by exploiting architectural features of modern multi-/many-cores in tandem with high- throughput network such as InfiniBand. We also propose a theoretical model to analyze communication and computation cost and use these insights to guide our designs. The evaluation of the proposed SALaR based designs shows significant performance gains over state-of-the-art designs on a wide variety of micro-benchmarks and applications. Wednesday 9:00am-10:15am ![]() Assembly Hall Plenary Announcements and keynote II Chair: Dimitrios Nikolopoulos (Queen's University Belfast) Unconventional Computing with Reconfigurable Devices in the Cloud Michaela Blott (Xilinx Research) Biography Biography Michaela Blott (Xilinx Research) Michaela Blott is a Principal Engineer at Xilinx Research, where she is heading a team of international scientists, driving research into new application domains for FPGAs, such as machine learning, and hyperscale deployments. She graduated from the University of Kaiserslautern in Germany and brings over 25 years of experience in computer architecture, FPGA and board design, working in both research institutions (ETH and Bell Labs) as well as development organizations. She is strongly involved with the international research community as technical co-chair of FPL’2018, industry advisor on numerous projects, and serves on numerous technical program committees (DATE, FPGA, FPL, GLOBALSIP, Hipeac). Abstract Abstract Conventional von-Neumann architectures are suffering from rising power densities and performance scaling is slowing down with next generation technology nodes, while at the same time we are faced with an explosion of data and sky-high compute requirements associated with the roll-out of machine learning algorithms. Reconfigurable logic with FPGAs can tailor the hardware to the application through customized datapaths and memory architectures. Thereby FPGAs can achieve much higher energy efficiencies compared to conventional CPU- and GPU-based solutions. This has stimulated interest in their exploitation within power-hungry data centers with recent benchmarks showing that FPGA-based application acceleration can bring orders of magnitude improvement in regards to performance and performance per Watt compared to their counterparts. During this talk, we broadly characterize a range of applications and explore how through these unconventional customized compute architectures new levels of performance scalability and compute efficiency can be unleashed. Wednesday 10:45am-12:15pm ![]() Assembly Hall Plenary, Paper Session VI: Best papers - Areas 3 and 4 Chair: Ron Brightwell (Sandia National Laboratories) Best Paper Chen Wang and Nikoli Dryden (University of Illinois at Urbana-Champaign), Franck Cappello (Argonne National Laboratory), and Marc Snir (University of Illinois at Urbana-Champaign) Abstract Abstract As we move toward exascale platforms, silent data corruptions (SDC) are likely to occur more frequently. Such errors can lead to incorrect results. Attempts have been made to use generic algorithms to detect such errors. Such detectors have demonstrated high precision and recall for detecting errors, but only if they run immediately after an error has been injected. In this paper, we propose a neural network based detector that can detect SDCs even multiple iterations after they were injected. We have evaluated our detector with 6 FLASH applications and 2 Mantevo mini-apps. Experiments show that our detector can detect more than 89% of SDCs with a false positive rate of less than 2%. Best Paper Xin Liang (UC Riverside), Sheng Di (Argonne National Laboratory), Dingwen Tao and Zizhong Chen (UC Riverside), and Franck Cappello (Argonne National Laboratory) Abstract Abstract Because of the ever-increasing execution scale of scientific applications, how to store the extremely large volume of data efficiently is becoming a serious issue. A significant reduction of the scientific data size can effectively mitigate the I/O burden and save considerable storage space. Since lossless compressors suffer from limited compression ratios, error-controlled lossy compressors have been studied for years. Existing error-controlled lossy compressors, however, focus mainly on absolute error bounds, which cannot meet users' diverse demands such as pointwise relative error bounds. Although some of the state-of-the-art lossy compressors support pointwise relative error bound, the compression ratios are generally low because of the limitation in their designs and possible spiky data changes in local data regions. In this work, we propose a novel, efficient approach to perform compression based on the pointwise relative error bound with higher compression ratios than existing solutions provide. Our contribution is threefold. (1) We propose a novel transformation scheme that can transfer the pointwise relative-error-bounded compression problem to an absolute-error-bounded compression issue. We also analyze the practical properties of our transformation scheme both theoretically and experimentally. (2) We implement the proposed technique in two of the most popular absolute-error-bounded lossy compressors, SZ and ZFP. (3) We evaluate our solution using multiple real-world application data across different scientific domains on a supercomputer with up to 4,096 cores and 12 TB of data. Experiments show that our solution achieves over $1.38X$ dumping and $1.31X$ loading performance over the second-best lossy compressor, respectively. Thursday 9:00am-10:15am ![]() Assembly Hall Plenary Cluster 2019 presentation and Keynote III Chair: Bronis R. de Supinski (Lawrence Livermore National Laboratory) Programmability, Portability and Performance: Challenges and Opportunities in Enabling Usable Systems in the Exascale Era Kathryn O’Brien (IBM) Biography Biography Kathryn O’Brien (IBM) Kathryn O’Brien is a Principal Research Staff Member at IBM T.J. Watson Research Center, where she has worked for over 25 years, holding various technical, managerial and staff positions. Early on, as part of the Research compiler team, she worked on the first automatic parallelizing and vectorizing compilers, and later she managed the compiler team that implemented OpenMP on the CELL heterogeneous architecture. Since that time she has been heavily engaged in the adoption of OpenMP across a range of product and research compiler efforts. Over the last 8 years she has been part of the leadership team driving IBM Research’s Exascale program, where her focus has been on the evolution and development of the broader software programming and tools environment. She is currently the IBM Research Compiler Strategist and Programming Models technical lead, as well as the PI for the CORAL NRE program. As part of this latter role she has been responsible for driving the Programming Model strategy for the IBM CORAL and future HPC and large-scale systems. Abstract Abstract IBM recently assumed 1st and 3rd position on the Top 500 with the DOE CORAL Summit and Sierra systems respectively. This achievement represents the culmination of a strong technical collaboration between various groups within IBM as well as the CORAL partners. In this talk I will discuss the challenges and opportunities encountered in the course of building a CORAL programming environment to address the critical requirements of Programmability, Performance and Portability. I will also briefly look forward to the ongoing challenges in evolving these programming environment beyond purely HPC, to address the broader and ubiquitous requirements of machine learning applications. |