Past and Upcoming Talks and Demonstations by the CIFTS team
List of talks, posters, and demonstrations under the CIFTS umbrella. These publications either relate to CIFTS, FTB or to the fault tolerance research being done with different HPC software in order to eventually integrate them in CIFTS in a seamless manner.
- Poster on Scalable Distributed Consensus to Support MPI Fault Tolerance, D. Buntinas, at the 18th EuroMPI Conference, Sept 2011
- Poster on Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance, J. Hursey, R. Graham, G. Bronevetsky, D. Buntinas, H. Pritchard and D. Solt, at the 18th EuroMPI Conference, Sept 2011
- Talk on Realization of User-Level Fault Tolerance Policy Management through a Holistic Approach for Fault Correlation, H. Park, at the IEEE International Symposium on Policies for Distributed Systems and Networks (POLICY), June 2011
- Talk on Berkeley Lab's Checkpoint/Restart (BLCR), E. Roman, at the Discovery 2011: HPC and Cloud Computing Workshop, June 2011
- Poster on Strategies for Fault Tolerance in Multicomponent Applications, David Bernholdt, at the International Conference on Computational Science (ICCS 2011), June 2011
- Technical session on User Application Monitoring through Assessment of Abnormal Behavior Recorded in RAS Logs, Byung H. Park, Thomas J. Naughton, Al Geist, Raghul Gunasekaran, David Dillow and Galen Shipman, at the Cray Users Group Conference, May 2011
- Technical session on Real-Time System Log Monitoring/Analytics Framework, Raghul Gunasekaran, Byung H. Park, David Dillow, Galen Shipman and Al Geist at the Cray Users Group Conference, May 2011
showcased new capabilities
and features of Fault Tolerant
software and how coordination
frameworks like CIFTS can
help in end-to-end fault
management. We showcased all
popular implementations of MPI
(which included FTB-enabled
MPICH2, MVAPICH2 with migration
and checkpointing support and Open
MPI), the FTB-enabled FT library,
and several applications.
Download the M4V and MOV version of the application demostration. The demo showcases an FTB-enabled molecular dynamic application in a user-specified fault tolerant policy framework.
- Keynote Talk on Networking Technologies for Clusters: Where do We Stand and What Lies Ahead?, D. K. Panda, at the International Conference on Parallel and Distributed Systems (ICPADS '10), Dec 2010
- Demo on Fault Tolerance in Open MPI using the FTB, Abhishek Kulkarni, in the the Argonne National Laboratory booth at The ACM/IEEE Supercomputing Conference (SC '10), Nov 2010
- Presentation on Preempting Torque Jobs with BLCR , P. Hargrove, at the TORQUE Open Source Resource Manager Road Map and three key topic workshop in conjuction with The ACM/IEEE Supercomputing Conference (SC '10), Nov 2010
- Round Table Discussion on Berkeley Lab's Checkpoint/Restart (BLCR) , P. Hargrove, at the Lawrence Berkeley National Laboratory (LBNL) Booth in conjunction with The ACM/IEEE Supercomputing Conference (SC '10), Nov 2010
- Demo on Proactive Fault-Resilience with Process Migration in MVAPICH2: A demonstration with Tachyon , X. Ouyang, in the Argonne National Laboratory booth at The ACM/IEEE Supercomputing Conference (SC '10), Nov 2010
- Presentation on Designing High-End Computing Systems with InfiniBand and High-Speed Ethernet, D. K. Panda, P. Balaji, and S. Sur, at The ACM/IEEE Supercomputing Conference (SC '10), Nov 2010
- Birds-of-feather talk on CIFTS : Coordinated Fault Tolerance for High Performance Computing, P. Beckman, D. Bernholdt, D. Buntinas, A. Bouteiller, A. Kulkarni, P. Hargrove and D.K. Panda, at The ACM/IEEE Supercomputing Conference (SC '10), Nov 2010
- Demo on FTB-enabled Fault Tolerant Linear Algebra Library - Using coordination to improve fault toleranceA. Bouteiller, in the Argonne National Laboratory booth at at The ACM/IEEE Supercomputing Conference (SC '10), Nov 2010
- Invited Talk on Hardware and Software Considerations for VV & UQ, D. Bernholdt, at the The 2010 Workshop on Verification, Validation, and Uncertainty Analysis in High-Performance Computing (VVUHPC 2010), in conjuction with Supercomputing Conference (SC'10), Nov 2010
- Talk on Preempting Torque Jobs with BLCR, P. Hargrove, at the TORQUE Open Source Resource Manager Road Map and three key topic workshop in conjuction with The ACM/IEEE Supercomputing Conference, Nov 2010
- Presentation on Designing High Performance, Scalable and Fault-Resilient MPI Library for Modern Cluster, D. K. Panda and S. Sur, at the Ohio Supercomputer Center Booth, at The ACM/IEEE Supercomputing Conference (SC '10), Nov 2010
- Presentation on Designing High Performance, Scalable and Fault-Resilient MPI Library for Modern Clusters, D. K. Panda, at the Institute of Computing Technology (ICT), Chinese Academy of Sciences, Oct 2010
- Keynote Talk on Networking Technologies for Exascale Computing Systems: Opportunities and Challenges, D. K. Panda, at the HPC China Conference, Oct 2010
- Technical session on RAVEN: RAS data Analysis through Visually Enhanced Navigation, Byung H. Park, Junseong Heo, Kora Guruprasad and Al Geist, at the Cray Users Group Conference, May 2010
- Technical session on Correlating Log Messages for System Diagnostics, Raghul Gunasekaran, Byung H. Park, Galen Shipman and Al Geist, at the Cray Users Group Conference, May 2010
- Presentation on InfiniBand Software Networking Technologies, D. K. Panda, at the Discovery 2015 Workshop, Oak Ridge National Laboratory, July 2010
- Presentation on Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet, D. K. Panda and P. Balaji, at The International Supercomputing Conference (ISC), May 2010
- Invited Talk on CIFTS: Coordinated Infrastructure for Fault Tolerant Systems, R. Gupta, at the SIAM Conference on Parallel Processing for Scientific Computing, Feb 2010
- Presentation on Designing High Performance, Scalable and Fault-Resilient MPI Library for Modern ClustersD. K. Panda, at the Pacific Northwest National Library (PNNL), Feb 2010
- DEMO VIDEO: Several demonstrations with FTB and FTB-enabled software, such as MVAPICH, MPICH2, Open MPI, FT-LA, the Log monitoring tools for the CRAY system, MD (molecular-dynamics) application were presented at the Supercomputing Conference, Portland, Nov 2009. If you missed the demonstrations, you can download the POV-RAY demonstration now. This video demonstrates process resilience in Open MPI using CIFTS.
- Birds-of-feather talk on CIFTS : Coordinated Fault Tolerance for High Performance Computing, P. Beckman, D. Bernholdt, A. Bouteiller, A. Lumsdaine, P. Hargrove and D.K. Panda, at the Supercomputing Conference, Nov 2009
- Round Table Discussion on Berkeley Lab's Checkpoint/Restart (BLCR) , P. Hargrove, at the Lawrence Berkeley National Laboratory (LBNL) Booth in conjunction with the ACM/IEEE International Conference for High Performance Computing(HPC), Networking, Storage and Analysis (SC'09), Nov 2009
- Presentation on Process Resilience in Open MPI using the CIFTS Fault Tolerance Backplane: A POV-Ray Demonstration, Abhishek Kulkarni, in the Argonne National Laboratory booth at The ACM/IEEE International Conference for High Performance Computing(HPC), Networking, Storage and Analysis (SC'09), Nov 2009
- Presentation on Open MPI Tutorial, Joshua Hursey, Jeffrey M. Squyres, Abhishek Kulkarni and Andrew Lumsdaine, in the Indiana University booth at The ACM/IEEE International Conference for High Performance Computing(HPC), Networking, Storage and Analysis (SC'09), Nov 2009
- Presentation on A Transparent Process Migration Framework for Open MPI, Joshua Hursey, in the Cisco booth at The ACM/IEEE International Conference for High Performance Computing(HPC), Networking, Storage and Analysis (SC'09), Nov 2009
- Demo on Coordinated Checkpoint/Restart support in MVAPICH2: A demonstration with the NASA Parallel Benchmark Suite, X. Ouyang, in the Argonne National Laboratory booth at The ACM/IEEE International Conference for High Performance Computing(HPC), Networking, Storage and Analysis (SC'09), Nov 2009
- Presentation on Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet, D. K. Panda, P. Balaji, and M. Koop, at The ACM/IEEE International Conference for High Performance Computing(HPC), Networking, Storage and Analysis (SC'09), Nov 2009
- Demo on FTB-enabled Fault Tolerant Linear Algebra Library, A. Bouteiller, in the Argonne National Laboratory booth at The ACM/IEEE International Conference for High Performance Computing(HPC), Networking, Storage and Analysis (SC'09), Nov 2009
- Presentation on High Performance, Scalable and Fault-Tolerant MPI over InfiniBand: An Overview of MVAPICH/MVAPICH2 Project, D. K. Panda, at Tsukuba University, Oct 2009
- Seminar on Supporting Fault-Tolerance in Modern High-End Computing Systems with InfiniBand, D. K. Panda, at the Fault-Tolerance in High Performance Computing, May 2009
- Presentation on Designing Fault Resilient and Fault Tolerant Systems with InfiniBand, D. K. Panda, in the HPC Resiliency Workshop, Oct 2009
- Invited Poster Presentation on Designing Fault Resilient and Fault Tolerant Systems with InfiniBand, D. K. Panda, A. Vishnu, and K. Gopalkrishnan, in the National Workshop on Resiliency, Aug 2009
- Talk on CIFTS: Coordinated Infrastructure for Fault Tolerant Systems, R. Gupta, at the Argonne Leadership Computing Facility, May 2009
- Talk on Berkeley Lab Checkpoint/Restart (BLCR): Status and Future Plans, P. Hargrove, at Dagstuhl Seminar "Fault Tolerance in High-Performance Computing and Grids", May 2009
- Talk on CIFTS: Coordinated Infrastructure for Fault Tolerant Systems, R. Gupta, at the Fermi National Accelerator Laboratory, May 2009
- Invited Talk on CIFTS: Coordinated Infrastructure for Fault Tolerant Systems, R. Gupta, at the University of Chicago, April 2009
- Talk on System-level Checkpoint/Restart with BLCR, P. Hargrove, at the TeraGrid 2009 Fault Tolerance Workshop, Mar 2009
- Talk on CIFTS: Coordinated Infrastructure for Fault Tolerant Systems, R. Gupta, Argonne National Laboratory booth atIEEE/ACM International Conference for High-Performance Computing, Networking, Storage and Analysis (SC), 2008
- Birds-of-feather talk on CIFTS : Coordinated Fault Tolerance for High Performance Computing, P. Beckman, D. Bernholdt, A. Bouteiller, A. Lumsdaine, P. Hargrove and D.K. Panda, at the Supercomputing Conference, Austin, Nov 2008
- Poster on Analyzing Failure Events on ORNL's Cray XT4, H. Park, Z. Zheng, Z. Lan and A. Geist, at the ACM/IEEE Supercomputing Conference (SC'2008), Austin, 2008
- Talk on CIFTS: Coordinated Infrastructure for Fault Tolerant Systems, R. Gupta, at the Workshop on Fault Tolerance and Resiliency, in conjunction with Los Alamos Computer Science Symposium (LACSS), 2008
- Talk on System-level Checkpoint/Restart with BLCR, P. Hargrove, at the Los Alamos Computer Science Symposium (LACSS08)
- Talk on Advanced Checkpoint Fault Tolerance Solution, P. Hargrove, at the HPC Workshop on Trends, Technologies and Collaborative Opportunities in High Performance and Grid Computing, June 2008
- Round Table Discussion on Berkeley Lab's Checkpoint/Restart (BLCR) , P. Hargrove, at the Lawrence Berkeley National Laboratory (LBNL) Booth in conjunction with the ACM/IEEE International Conference for High Performance Computing(HPC), Networking, Storage and Analysis (SC'08), Nov 2008
- Presentation on Research at Indiana University for Reliable Petascale Performance,Timothy I. Mattox , in the Indiana University booth at the ACM/IEEE SC'08 Conference, Austin, Texas, Nov 2008
- Presentation on Fault Tolerance in High Performance Computing: MPI and Checkpoint/Restart,Joshua Hursey , in the Indiana University booth at the ACM/IEEE SC'08 Conference, Austin, Texas, Nov 2008
- Presentation on Checkpoint/Restart Support in Open MPI, Joshua Hursey, in Sun Microsystems, Inc. Tech Talk Series, May 2008
- Keynote Talk on Fault-Tolerant/System-Management Issues in InfiniBand, D. K. Panda, in the International Workshop on System Management Techniques, Processes and Services (SMPTS), April 2008
- Demo on FTB-enabled Fault Tolerant Linear Algebra LibraryA. Bouteiller, in the Argonne National Laboratory booth at the ACM/IEEE SC'08 Conference, Nov 2008
- Talk on CIFTS: Coordinated Infrastructure for Fault Tolerant Systems, R. Gupta, Argonne National Laboratory booth at The IEEE/ACM International Conference for High-Performance Computing, Networking, Storage and Analysis (SC), 2007
- Birds-of-feather talk on CIFTS : Coordinated Fault Tolerance for High Performance Computing, P. Beckman, D. Bernholdt, A. Bouteiller, A. Lumsdaine, P. Hargrove and D.K. Panda, at The ACM/IEEE International Conference for High Performance Computing(HPC), Networking, Storage and Analysis (SC'07), Nov 2007
- Round Table Discussion on Berkeley Lab's Checkpoint/Restart (BLCR) , P. Hargrove, at the Lawrence Berkeley National Laboratory (LBNL) Booth in conjunction with The ACM/IEEE International Conference for High Performance Computing(HPC), Networking, Storage and Analysis (SC'07), Nov 2007
- Presentation on MPI Is Dead? Long Live MPI! Evolving MPI for the Next Generation of Supercomputing, Timothy I. Mattox, in the Cisco and Indiana University booths at The ACM/IEEE International Conference for High Performance Computing(HPC), Networking, Storage and Analysis (SC'07), Nov 2007
- Birds-of-feather talk on Handling Reliability at the MPI layer, D. K. Panda, in Reliability of High-Speed Networks, in conjunction with The ACM/IEEE International Conference for High Performance Computing(HPC), Networking, Storage and Analysis (SC'07), Nov 2007
- Demo on Introducing the FTB-enabled Fault Tolerant Linear Algebra LibraryA. Bouteiller, in the Argonne National Laboratory booth at the The ACM/IEEE International Conference for High Performance Computing(HPC), Networking, Storage and Analysis (SC'07), Nov 2007
- Talk on Introduction to CIFTS, R. Gupta, at the CCA Forum meeting, April 2007
- Presentation on Process Fault Tolerance in Open MPI, Joshua Hursey, in the Innovative Computing Laboratory (ICL) Friday Lunch Speaker Series, University of Tennessee, Knoxville, Feb 2007