Workshops
The following workshops are organized in conjunction with Cluster 2016.
FTS 2016 –The Second International Workshop on Fault Tolerant Systems Research and historical trends have shown an increase in the frequency and scope of failures for many years, leading to a surge in interest in new algorithms, system software, and hardware which can tolerate such errors. Past advancements such as ECC memory, checkpoint/restart systems, and Algorithmic Based Fault Tolerance have mitigated the impact of increased failures somewhat. As machine scales continue to increase, however, the sophistication of our resilience solutions must also keep pace. Some current predictions for exascale achines consider a node failure rate as high as 1 per hour. At this rate, the current solutions will be incomplete and novel solutions to handle failures at both the hardware and software level must be conceived.
Website: http://www.mcs.anl.gov/events/workshops/fts/2016/
Co-Organizers:
Sheng Di, Argonne National Laboratory, USA
Vaidy Sunderam, Emory University, USA
Keynote - It's not my fault! Finding errors in parallel codes
David Abramson
Time: 14:00-15:00, September 15, 2016
Room: Grand Hall B, 5F, Palais de Chine Hotel
Abstract
Debugging software has always been difficult, with little tool support available. Finding faults in parallel programs is even harder because the machines and problems are so large, and the amount of state to be examined becomes prohibitive. Faults are often introduced when codes are modified, the software or hardware environment changes or they are scaled up to solve larger problems. All too often we hear the programmers scream "It's not my fault!"
Over the years we have developed a technique called "Relative Debugging", in which a code is debugged against another, reference, version. This makes the process simpler because programmers can compare the state of computation between a faulty version and a previous code that is correct, and the programmer doesn't need to have a mental model of what the program state should be. However, relative debugging can also be expensive because it needs to compare large data structures across the machine. Parallel computers offer a way of accelerating the comparisons using parallel algorithms, making the technique practical.
Biography
David Abramson (University of Queensland)
Professor David Abramson has been involved in computer architecture and high performance computing research since 1979. He has held appointments at Griffith University, CSIRO, RMIT and Monash University. At CSIRO he was the program leader of the Division of Information Technology High Performance Computing Program, and was also an adjunct Associate Professor at RMIT in Melbourne. He served as a program manager and chief investigator in the Co-operative Research Centre for Intelligent Decisions Systems and the Co-operative Research Centre for Enterprise Distributed Systems. He was the Director of the Monash e-Education Centre and a Professor of Computer Science in the Faculty of Information Technology at Monash University. Abramson is currently the Director of the Research Computing Centre at the University of Queensland. He is a fellow of the Association for Computing Machinery (ACM), the Academy of Science and Technological Engineering (ATSE) and the Australian Computer Society (ACS), and a Senior Member of the IEEE.
FTS 2016 –The Second International Workshop on Fault Tolerant Systems
September 15, 2016 (Thu.) | |
14:00-15:00 | Keynote Speech Chair: Vaidy Sunderam |
14:00-15:00 | "It's not my fault! Finding errors in parallel codes" David Abramson |
15:00-18:00 | Coffee Break |
15:00-15:30 | FTS 2016 –The Second International Workshop on Fault Tolerant Systems Chair: Vaidy Sunderam |
15:30-16:00 | Selective Replication for Fault-tolerant Task-Parallel HPC Applications Omer Subasi, Gulay Yalcin, Ferad Zyulkyarov, Osman Unsal and Jesus Labarta |
16:00-16:30 | TwinPCG: Dual Thread Redundancy with Forward Recovery for Preconditioned Conjugate Gradient Methods Kiril Dichev and Dimitrios Nikolopoulos |
16:30-17:00 | An ABFT Scheme Based on Communication Characteristics Upama Kabir and Dhrubajyoti Goswami |
17:00-17:30 | Separation Kernel Robustness Testing: The XtratuM Case Study Stephen Grixti, Nicholas Sammut, Maria Hernek, Elena Carrascosa, Miguel Masmano & Alfons Crespo |
17:30-17:55 | Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications Sheng Di |
17:55-18:00 | Wrap Up |