IEEE Cluster 2024

Photo © KOBE TOURISM BUREAU

IEEE Cluster 2024

Keynotes Cluster 2024

Software Advancements Meeting Hardware Innovations: Challenges and Solutions.
Sunita Chandrasekaran
University of Delaware

Abstract

With the rapid advancements in the computer architecture space, migration of legacy applications to these novel architectures is in a perpetual state of catch-up. To navigate this dynamically evolving hardware landscape, software and toolchains must stay ahead of the architectural explosion curve. As challenging as this sounds, this synchronization between hardware and software is crucial for maximizing the benefits of the advanced hardware platform. To tackle this, the integration of AI with HPC is gaining prominence. Deep learning methodologies, in particular, capitalize on vast amounts of domain-specific data to discern patterns critical for scientific comprehension. By orchestrating the HPC+AI integration effectively, we can unlock not only advancements in science but also computational efficiency. The overall workflow especially in large scale systems can be further enhanced by streaming simulation data directly to a machine-learning (ML) framework. This strategy will bypass conventional file system bottlenecks thus enabling transformation of data in transit, asynchronously with both the simulation process and model training. This talk will delve into these approaches demonstrating the synergy between hardware innovation and software adaptation using legacy real-world scientific applications as case studies, at scale.

Biography

Sunita Chandrasekaran is an Associate Professor at the University of Delaware's (UDEL) Department of Computer and Information Sciences where she also co-directs the AI Center of Excellence. Her research focus entails high-performance and exascale computing as well as machine learning and AI. Her projects involve leveraging high-level programming models and abstractions to enhance real-world scientific legacy applications, often porting them to hundreds to thousands of GPUs including the top #1 system, Frontier. Additionally, her group probes into the usability of large language models (LLMs) for software testing and develops machine learning-based frameworks to gain deeper insights into drugs responses in cancer research.

Modular Supercomputing at (and beyond) Exascale.
Estela Suarez
Juelich Supercomputing Centre
University of Bonn

Abstract

The variety of applications using or striving to use high performance computing (HPC) resources is growing, as are their needs for more compute and data management. Simultaneously, the physical limits facing Moore’s law demand for disruptive compute solutions, which the industry is delivering in the form of a wide variety of custom designed processing units, acceleration devices and, going even beyond: quantum and neuromorphic computers. The challenge for hosting sites consists on finding an HPC system desgin able to fulfil the needs of all kinds of user applications, integrating a diversity of processing solutions, and keeping the energy consumption and operational costs at bay.

The Modular Supercomputing Architecture (MSA) addresses these challenges by orchestrating diverse resources like CPUs, GPUs, and accelerators at a system level, organizing them into compute modules that are interconnected via a high-speed network and run using a common software stack. Each module is configured to cater to specific application classes and user requirements. Exotic technologies such as quantum or neuromorphic devices are attached as additional modules to which part of the computational load of a given application can be offloaded. Users can individually determine the modules and the number of nodes per module that their jobs shall use, according to their application's needs. System administrators employ advanced scheduling mechanisms in conjunction with comprehensive monitoring and analysis framework to maximize system utilization.

This talk will detail MSA's key features, hardware, and software components, and discuss MSA's prospects in the context of Exascale and Post-Exascale computing.

Biography

Prof. Dr. Estela Suarez is Joint Lead of the department “Novel System Architecture Design” at the Jülich Supercomputing Centre, which she joined in 2010. Since 2022 she is also Associate Professor of High Performance Computing at the University of Bonn. Her research focuses on HPC system architecture and codesign. As leader of the DEEP project series she has driven the development of the Modular Supercomputing Architecture, including hardware, software and application implementation and validation. Additionally, since 2018 she leads the codesign efforts within the European Processor Initiative. She holds a PhD in Physics from the University of Geneva (Switzerland) and a Master degree in Astrophysics from the University Complutense of Madrid (Spain).

Similarities and Differences of Scaling Laws in AI and HPC.
Rio Yokota
Tokyo Institute of Technology

Abstract

When computers were less powerful, and data was less abundant, many sophisticated models were invented to simulate (in HPC) or learn (in AI) the complex world around us. As the capability of computers has improved at an exponential rate for the past fifty years, brute force computing of simpler models has made sophisticated models obsolete in some areas. There are similarities and differences between how this has happened in the field of HPC and AI. For example, in the field of turbulence modeling, sophisticated RANS models have been replaced by simpler LES and DNS models. Scaling laws in turbulence have enabled the prediction of the capability of these simpler models at scale, which made it easier to justify their enormous cost. Similarly, scaling laws in deep neural networks have made it possible to predict their capability at scale, which has driven the recent explosion in investment to train larger and larger models. However, while the transition from RANS to LES happened nearly two decades ago, scaling laws in AI comes at a time when Moore’s law is approaching its end. This difference has many implications ranging from the design of computer architectures to the dynamics between sophisticated modeling versus brute force computing. Understanding the similarities and differences between these scaling laws is the key to predicting the dynamics between HPC and AI in the upcoming years.

Biography

Rio Yokota is a Professor at the Global Scientific Information and Computing Center, Tokyo Institute of Technology. His research interests lie at the intersection of high performance computing and machine learning. He is the developer numerous libraries for fast multipole methods (ExaFMM), hierarchical low-rank algorithms (Hatrix) that scale to the full system on the largest supercomputers today. He has also lead efforts to train ImageNet in two minutes, and more recently to pre-train large language models using thousands of GPUs. He has been optimizing algorithms on GPUs since 2006, and was part of a team that received the Gordon Bell prize in 2009.