P01: Jonas Posner (University of Kassel, Germany), System-Level vs. Application-Level Checkpointing
- “Fault tolerance is becoming increasingly important since the probability of permanent hardware failures increases with machine size. A typical resilience approach to fail/stop failures today is checkpointing, which can be performed on system- or application-level. Both levels come in many variants, but they fundamentally differ. On system-level, no code changes are required, full program states are saved, and after a failure the program must be restarted from the last checkpoint. In contrast, on application-level, only user-defined data are checkpointed, which requires some programming effort. Thereby, the running time overhead may be reduced significantly, and programs may continue execution after failures.
- Typical representatives include DMTCP (Distributed MultiThreaded Checkpointing) for system-level, and FTGLB (Fault Tolerant Global Load Balancing) for application-level. DMTCP is a user-space library, which checkpoints parallel programs transparently and restarts them from a checkpoint. DMTCP supports many programming languages and HPC environments.
- FTGLB bases on a distributed task-pool pattern, and writes uncoordinated in-memory checkpoints. Checkpoints only include task descriptors and interim results, and are written at regular time intervals and at certain events, e.g. work stealing.
- In this work, we experimentally compare DMTCP and FTGLB with up to 320 processes. Moreover, we derive formulas for predicting running times, including failure handling. With these formulas, we compare DMTCP and FTGLB in failure-prone and larger settings. Overall, the results clearly show that the application-level optimizations of FTGLB are worthwhile since the running time overhead is significantly lower than that of DMTCP.”
P02: Hoon Ryu (Korea Institute of Science and Technology Information), Ji-Hoon Kang, A HPC-based Prediction on the Practicality of Long-distance Quantum Key Distributions
- Practicality of long-distance quantum communications based on a BB84 quantum key distribution (QKD) protocol is examined with Monte Carlo simulations coupled to a parallel computing. Given a quantum channel that is not free from noises, optimal sizes of the shared key information and corresponding chances for detecting the eavesdropper are calculated to present clues that can be utilized to figure out the utility of the protocol. Delivering simple but sound principles that have not been focused quite well, this work serves as a useful case study that shows the need of high performance computing for QKD modeling, and can trigger potential efforts for further modeling studies that involve more realistic and complicated factors.
P03: Masahiro Nakao (RIKEN R-CCS), Yuetsu Kodama, Katsuki Fujisawa, Mitsuhisa Sato, Koji Ueno, Performance Evaluation of Supercomputer Fugaku using Breadth-First Search Benchmark in Graph500
- There is increasing demand for the high-speed processing of large-scale graphs in various fields. However, such graph processing requires irregular calculations, making it difficult to scale performance on large-scale distributed memory systems. Against this background, Graph500, a competition for evaluating the performance of large-scale graph processing, has been held. We developed breadth-first search (BFS), which is one of the benchmark kernels used in Graph500, and took the top spot a total of 10 times using the K computer. In this paper, we tune BFS performance and evaluate it using the supercomputer Fugaku, which is the successor to the K computer. The results of evaluating BFS for a large-scale graph composed of about 1.1 trillion vertices and 17.6 trillion edges using 92,160 nodes of Fugaku indicate that Fugaku has 2.27 times the performance of the K computer. Fugaku took the top spot on Graph500 in June 2020.
P04: Wenqi Lou (University of Science and Technology of China), Chao Wang, Lei Gong, Xuehai Zhou, OctCNN: An Energy-Efficient FPGA Accelerator for CNNs using Octave Convolution Algorithm
- Recently, embedded FPGAs have been explored as a potential platform for deploying machine learning on edge-devices due to their high energy efficiency and low cost. However, the lack of resources also makes the deployment of CNN on FPGAs more challenging. In this paper, we present OctCNN, which utilizes the octave convolution (OctConv) algorithm to optimize the FPGA-based CNN accelerator. We first propose a novel architecture for deploying OctConv on FPGAs and then present a resource and performance analysis model to guide a fast design space exploration. As a case study, we implement a classic CNN model, VGG16, on Xilinx ZC702. Results show, compared to the mobile-class CPU and GPU, OctCNN achieves 16.88 times and 2.43 times energy efficiency, respectively. Besides, it has a promising energy efficiency compared to previous FPGA accelerators.
P05: Tomohiro Kawanabe (RIKEN), Kazuma Hatta, Kenji Ono , ChOWDER: A New Approach for Viewing 3D Web GIS on Ultra-High-Resolution Scalable Display
- ChOWDER is an open-source, web-based scalable display system that consists of multiple display devices on which a web browser operates in cooperation to construct a single large pixel space. Newly introduced functionality of displaying 3D geographic information systems allows us to show large 3D geographic information on ultra-high-resolution tiled display system. This paper describes the method of implementation, use cases, and related works of this functionality.
P06: YIYU TAN (RIKEN Center for Computational Science), Toshiyuki Imamura, An FPGA-based Sound Field Rendering System
- This research investigates the development of an FPGA-based sound field rendering system with the FDTD method, in which wave equations are directly implemented by the reconfigurable hardware, spatial blocking and temporal parallelism are applied to alleviate external memory bandwidth bottleneck and speed up computation. Compared with the software simulation carried out on a desktop machine with a Xeon Gold 6212U processor and 512 GB DRAMs, the FPGA system achieves about 13 times speedup in computaton performance.
P07: Antonis Papaioannou (ICS - FORTH), Chrysostomos Zeginis, Kostas Magoutis, The Case for Better Integrating Scalable Data Stores and Stream-Processing Systems
- Scalable stream processing systems require external storage systems for long-term storage of non-emphemeral state. Recent research have pointed to scalable in-memory key-value stores, such as Redis, as an efficient solution to external management of state. While such data stores have been interconnected with scalable streaming systems, they are currently managed independently, missing opportunities for optimizations, such as exploiting locality between stream partitions and table shards, as well as coordinating elasticity actions.
P08: Shuhei Kudo (RIKEN R-CCS), Keigo Nitadori, Takuya Ina, Toshiyuki Imamura, Prompt report on Exa-scale HPL-AI benchmark
- Our performance benchmark of HPL-AI on the supercomputer Fugaku was awarded in the 55th top500 at ISC20. The effective performance was 1.42 EFlop/s, and the world’s first achievement to exceed the wall of exascale in a floating-point arithmetic benchmark. Due to the novelty of HPL-AI, there are few guidelines for large systems and several drawbacks to the large-scale benchmark. It is not enough to replace FP64 operations solely to those on FP32 or FP16. At the least, we need thoughtful numerical analysis for lower-precision arithmetic and introduction of optimization techniques on extensive computing such as on Fugaku. In the poster, we give some comments on the accuracy, implementation, performance improvement, and report on the Exa-scale benchmark on Fugaku.
P09: Hao Zhang (RIKEN BDR), Itta Ohmura, Makoto Taiji, Implementing a Comprehensive Networks-on-Chip Generator with Optimal Configurations
- Networks-on-Chip has become the de facto design paradigm for many-core systems. This poster presents a general purpose Networks-on-Chip (NoC) generator with optimal configurations, named EAGEN. The NoC producer module of EAGEN is written in Chisel3 language which generates both Verilog and C++ model for hardware implementation and software simulation, respectively. Once the application specification and design constrains are fed to EAGEN, the synthesizable RTL code with optimal configuration will be output.
P10: Norihisa Fujita (University of Tsukuba), Ryohei Kobayashi, Yoshiki Yamaguchi, Kohji Yoshikawa, Makito Abe, Masayuki Umemura, Toward OpenACC-enabled GPU-FPGA Accelerated Computing
- Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing because their computation and communication capabilities have drastically improved in recent years due to advances in semiconductor integration technologies that rely on Moore’s Law. These improvements reveal the possibility of implementing a concept to enable on-the-fly offloading computation at which CPUs/GPUs perform poorly to FPGAs while performing low-latency data movement. We think that this concept is key to improving the performance of heterogeneous supercomputers using accelerators such as the GPU. In this paper, we propose a GPU–FPGA-accelerated simulation based on the concept and show preliminary results of the proposed concept.