IEEE Cluster 2016

Workshops

The following workshops are organized in conjunction with Cluster 2016.

FTS 2016 –The Second International Workshop on Fault Tolerant Systems Research and historical trends have shown an increase in the frequency and scope of failures for many years, leading to a surge in interest in new algorithms, system software, and hardware which can tolerate such errors. Past advancements such as ECC memory, checkpoint/restart systems, and Algorithmic Based Fault Tolerance have mitigated the impact of increased failures somewhat. As machine scales continue to increase, however, the sophistication of our resilience solutions must also keep pace. Some current predictions for exascale achines consider a node failure rate as high as 1 per hour. At this rate, the current solutions will be incomplete and novel solutions to handle failures at both the hardware and software level must be conceived.
Website: http://www.mcs.anl.gov/events/workshops/fts/2016/
Co-Organizers:
Sheng Di, Argonne National Laboratory, USA
Vaidy Sunderam, Emory University, USA

Keynote - It's not my fault! Finding errors in parallel codes

David Abramson
Time: 14:00-15:00, September 15, 2016
Room: Grand Hall B, 5F, Palais de Chine Hotel

Abstract
Debugging software has always been difficult, with little tool support available. Finding faults in parallel programs is even harder because the machines and problems are so large, and the amount of state to be examined becomes prohibitive. Faults are often introduced when codes are modified, the software or hardware environment changes or they are scaled up to solve larger problems. All too often we hear the programmers scream "It's not my fault!"
Over the years we have developed a technique called "Relative Debugging", in which a code is debugged against another, reference, version. This makes the process simpler because programmers can compare the state of computation between a faulty version and a previous code that is correct, and the programmer doesn't need to have a mental model of what the program state should be. However, relative debugging can also be expensive because it needs to compare large data structures across the machine. Parallel computers offer a way of accelerating the comparisons using parallel algorithms, making the technique practical.

In this talk I will introduce relative debugging, show how it assists test and debug, and discuss the various techniques used to scale it up to very large problems and machines.

Biography
David Abramson (University of Queensland)
Professor David Abramson has been involved in computer architecture and high performance computing research since 1979. He has held appointments at Griffith University, CSIRO, RMIT and Monash University. At CSIRO he was the program leader of the Division of Information Technology High Performance Computing Program, and was also an adjunct Associate Professor at RMIT in Melbourne. He served as a program manager and chief investigator in the Co-operative Research Centre for Intelligent Decisions Systems and the Co-operative Research Centre for Enterprise Distributed Systems. He was the Director of the Monash e-Education Centre and a Professor of Computer Science in the Faculty of Information Technology at Monash University. Abramson is currently the Director of the Research Computing Centre at the University of Queensland. He is a fellow of the Association for Computing Machinery (ACM), the Academy of Science and Technological Engineering (ATSE) and the Australian Computer Society (ACS), and a Senior Member of the IEEE.

IEEE Cluster 2016 Workshop
FTS 2016 –The Second International Workshop on Fault Tolerant Systems

September 15, 2016 (Thu.)
14:00-15:00	Keynote Speech Chair: Vaidy Sunderam
14:00-15:00	"It's not my fault! Finding errors in parallel codes" David Abramson
15:00-18:00	Coffee Break
15:00-15:30	FTS 2016 –The Second International Workshop on Fault Tolerant Systems Chair: Vaidy Sunderam
15:30-16:00	Selective Replication for Fault-tolerant Task-Parallel HPC Applications Omer Subasi, Gulay Yalcin, Ferad Zyulkyarov, Osman Unsal and Jesus Labarta
16:00-16:30	TwinPCG: Dual Thread Redundancy with Forward Recovery for Preconditioned Conjugate Gradient Methods Kiril Dichev and Dimitrios Nikolopoulos
16:30-17:00	An ABFT Scheme Based on Communication Characteristics Upama Kabir and Dhrubajyoti Goswami
17:00-17:30	Separation Kernel Robustness Testing: The XtratuM Case Study Stephen Grixti, Nicholas Sammut, Maria Hernek, Elena Carrascosa, Miguel Masmano & Alfons Crespo
17:30-17:55	Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications Sheng Di
17:55-18:00	Wrap Up