IEEE Cluster 2025

Workshops ● Cluster 2025

Full Day

REX-IO 2025: 5th Workshop on Re-envisioning Extreme-Scale I/O for Emerging Hybrid HPC Workloads, Arnab K. Paul (BITS Pilani, K K Birla Goa Campus, India), Sarah M. Neuwirth (Johannes Gutenberg University Mainz), Jay Lofstead (Sandia National Laboratories).

Workshop description (click to expand)

High Performance Computing (HPC) applications are evolving to include not only traditional modeling and simulation bulk-synchronous scale-up workloads but also scale-out workloads, including artificial intelligence (AI), big data analytics methods, deep learning, and complex multi-step workflows. With the advent of Exascale systems such as Frontier, workflows include multiple different components from both scale-up and scale-out communities operating together to drive scientific discovery and innovation. With the often conflicting design choices between optimizing for write- vs. read-intensive, having flexible I/O systems is crucial to support hybrid workloads. Another performance aspect is the intensifying complexity of parallel file and storage systems in large-scale cluster environments. Storage system designs are advancing beyond the traditional two-tiered file system and archive model by introducing new tiers of temporary, fast storage close to the computing resources with distinctly different performance characteristics.

The changing landscape of emerging HPC workloads along with the ever increasing gap between the compute and storage performance capabilities reinforces the need for an in-depth understanding of extreme-scale I/O and for rethinking existing data storage and management techniques. Traditional approaches might fail to address the challenges of extreme-scale hybrid workloads. Novel I/O optimization and management techniques integrating machine learning and AI algorithms, such as intelligent load balancing and I/O pattern prediction, are needed to ease the handling of the exponential growth of data as well as the complex storage and file system hierarchies. Furthermore, user-friendly, transparent and innovative approaches are essential to adapt to the needs of different HPC I/O workloads while easing the scientific and commercial code development and efficiently utilizing extreme-scale parallel I/O and storage resources.

Established at IEEE Cluster 2021, the Re-envisioning Extreme-Scale I/O for Emerging Hybrid HPC Workloads (REX-IO) workshop has created a forum for experts, researchers, and engineers in the parallel I/O and storage, compute facility operation, and HPC application domains. REX-IO solicits novel work that characterizes I/O behavior and identifies the challenges in scientific data and storage management for emerging HPC workloads, introduces potential solutions to alleviate some of these challenges, and demonstrates the effectiveness of the proposed solutions to improve I/O performance for the exascale supercomputing era and beyond. We envision that this workshop will continue contributing to the community and further drive discussions between storage and I/O researchers, HPC application users and the data analytics community to give a better in-depth understanding of the impact on the storage and file systems induced by emerging HPC applications.
LLMxHPC 2025: The 2nd International Workshop on Large Language Models (LLMs) and HPC, Matthieu Dorier (Argonne National Laboratory), Kevin A. Brown (Argonne National Laboratory), Tanwi Mallick (Argonne National Laboratory), Aleksandr Drozd (RIKEN)

Workshop description (click to expand)

High-Performance Computing (HPC) systems have become critical for meeting the computational and data-intensive needs of training Large Language Models (LLMs). Simultaneously, in the domain of HPC research, LLMs are emerging as transformative tools to understand and improve HPC system productivity and efficiency. There are clear synergies between these areas and meaningful coordination of efforts holds great promise. This workshop brings together researchers and developers to explore the intersection of HPC and LLMs, offering a comprehensive look at how these two domains can mutually benefit and drive each other's advancement. The workshop has two key focus areas: (i) co-design and deployment of HPC systems to support LLM training and (ii) using LLMs to understand and optimize/tune HPC systems. A combination of paper presentations, panel discussion, and keynote will be included in the program to highlight salient research and development activities, promote diverse perspectives and visions, and stimulate discussion in the community.

Topics to be covered in this workshop include, but are not limited to, the computational and data needs of LLM training; emerging HPC architectural advancements relevant to these needs, such as GPU-accelerated computing, high-bandwidth memory systems, and advanced networking capabilities; and LLM-HPC co-design efforts. Topics will also include utilizing LLMs to improve HPC deployment and operations such as analyzing extensive system logs for performance, energy efficiency, and reliability; fine-tuning complex HPC hardware and software stacks; and design space exploration.

Conference registration includes workshops and tutorials (with no additional fee). Workshop-only registration is NOT available.

