Workshops
The following workshops are organized in conjunction with  Cluster 2016.
        
        FTS 2016 –The  Second International Workshop on Fault Tolerant  Systems  Research and historical trends have shown an increase in  the frequency and scope of failures for many years, leading to a surge in  interest in new algorithms, system software, and hardware which can tolerate  such errors. Past advancements such as ECC memory, checkpoint/restart systems,  and Algorithmic Based Fault Tolerance have mitigated the impact of increased  failures somewhat. As machine scales continue to increase, however, the  sophistication of our resilience solutions must also keep pace. Some current  predictions for exascale achines consider a node failure rate as high as 1 per  hour. At this rate, the current solutions will be incomplete and novel  solutions to handle failures at both the hardware and software level must be  conceived.
        Website: http://www.mcs.anl.gov/events/workshops/fts/2016/
        Co-Organizers: 
Sheng Di, Argonne National Laboratory, USA 
Vaidy Sunderam, Emory University, USA      
Keynote - It's not my fault! Finding errors in parallel codes
|  | 
David Abramson
        Time: 14:00-15:00,  September 15, 2016
      Room: Grand Hall B,  5F, Palais de Chine Hotel
Abstract 
  Debugging  software has always been difficult, with little tool support available. Finding  faults in parallel programs is even harder because the machines and problems  are so large, and the amount of state to be examined becomes prohibitive.  Faults are often introduced when codes are modified, the software or hardware  environment changes or they are scaled up to solve larger problems. All too  often we hear the programmers scream "It's not my fault!" 
Over the  years we have developed a technique called "Relative Debugging", in  which a code is debugged against another, reference, version. This makes the  process simpler because programmers can compare the state of computation  between a faulty version and a previous code that is correct, and the  programmer doesn't need to have a mental model of what the program state should  be. However, relative debugging can also be expensive because it needs to  compare large data structures across the machine. Parallel computers offer a  way of accelerating the comparisons using parallel algorithms, making the  technique practical. 
Biography 
      David Abramson (University of Queensland)
      Professor David Abramson has been  involved in computer architecture and high performance computing research since  1979. He has held appointments at Griffith University, CSIRO, RMIT and Monash University. At CSIRO he was the program leader of the  Division of Information Technology High Performance Computing Program, and was  also an adjunct Associate Professor at RMIT in Melbourne. He served as a  program manager and chief investigator in the Co-operative Research Centre for  Intelligent Decisions Systems and the Co-operative Research Centre for  Enterprise Distributed Systems. He was the Director of the Monash e-Education  Centre and a Professor of Computer Science in the Faculty of Information  Technology at Monash University. Abramson is currently the Director of the Research  Computing Centre at the University  of Queensland. He is a fellow of  the Association for Computing Machinery (ACM), the Academy of Science and Technological  Engineering (ATSE) and the Australian Computer Society (ACS), and a Senior Member of the IEEE.
      
FTS 2016 –The Second International Workshop on Fault Tolerant Systems
| September 15, 2016 (Thu.) | |
| 14:00-15:00 | Keynote    Speech Chair: Vaidy Sunderam | 
| 14:00-15:00 | "It's not my fault! Finding errors in parallel codes" David Abramson | 
| 15:00-18:00 | Coffee Break | 
| 15:00-15:30 | FTS 2016    –The Second International Workshop on Fault Tolerant Systems Chair: Vaidy Sunderam | 
| 15:30-16:00 | Selective    Replication for Fault-tolerant Task-Parallel HPC Applications Omer Subasi, Gulay Yalcin, Ferad Zyulkyarov, Osman Unsal and Jesus Labarta | 
| 16:00-16:30 | TwinPCG:    Dual Thread Redundancy with Forward Recovery for Preconditioned Conjugate    Gradient Methods Kiril Dichev and Dimitrios Nikolopoulos | 
| 16:30-17:00 | An ABFT    Scheme Based on Communication Characteristics Upama Kabir and Dhrubajyoti Goswami | 
| 17:00-17:30 | Separation    Kernel Robustness Testing: The XtratuM Case Study Stephen Grixti, Nicholas Sammut, Maria Hernek, Elena Carrascosa, Miguel Masmano & Alfons Crespo | 
| 17:30-17:55 | Adaptive    Impact-Driven Detection of Silent Data Corruption for HPC Applications Sheng Di | 
| 17:55-18:00 | Wrap Up | 


