Publications and Awards

List of publications under the CIFTS umbrella. These publications either relate to CIFTS, FTB or to the fault tolerance research being done with different HPC software in order to eventually integrate them in CIFTS in a seamless manner.

2012

  • "Application Self-health Monitoring for Extreme-scale Resiliency using Cooperative Fault Management", Pratul Agarwal, Thomas Naughton, S. Alam, B-H Park, David Bernholdt, Josh Hursey and Al. Geist, Concurrency and Computation: Practice and Experience (2012). Submitted.
  • "Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI", R. Rajachandrasekar, X. Besseron and D. K. Panda, Proceedings of the International Workshop on System Management Techniques, Processes, and Services (SMTPS), in conjunction with International Parallel and Distributed Processing Symposium (IPDPS), 2012
  • "Correlated Set Coordination in Fault Tolerant Message Logging Protocols for Many-core Clusters", A. Bouteiller, T. Herault, G. Bosilca and J. Dongarra, Journal of Concurrency and Computation: Practice and Experience, 2012

2011

  • "High Performance Pipelined Process Migration with RDMA", X. Ouyang, R. Rajachandrasekar, X. Besseron and D. K. Panda, Proceedings of CCGrid 2011
  • "Correlated Set Coordination in Fault Tolerant Message Logging Protocols". A. Bouteiller, T. Herault, G. Bosilca and J. Dongarra, Lecture Notes in Computer Science, Proceedings of the 2011 Euro-Par conference, 2011
  • "Realization of User-Level Fault Tolerance Policy Management through a Holistic Approach for Fault Correlation", B-H. Park, T. Naughton, P. Agarwal, D. Bernholdt, A. Geist and J. Tippens, IEEE International Symposium on Policies for Distributed Systems and Networks (POLICY), June 2011
  • "Strategies for Fault Tolerance in Multicomponent Applications", A. Shet, W. Elwasif, S. Foley, B-H. Park, D. Bernholdt and R. Bramley, Proceedings of the International Conference on Computational Science (ICCS 2011), June 2011
  • "Scalable Distributed Consensus to Support MPI Fault Tolerance", D. Buntinas, Technical Report ANL/MCS-TM-314, Argonne National Laboratory, June 2011
  • "Understanding Checkpointing Overheads on Massive-Scale Systems: Analysis of the IBM Blue Gene/P System", R. Gupta and H. Naik and P. Beckman, International Journal Of High Performance Computing Applications, May 2011
  • "Co-Analysis of RAS Log and Job Log on Blue Gene/P", Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S. Coghlan, and D. Buettner, 25th IEEE International Parallel and Distributed Processing Symposium (IPDPS' 11), May 2011

2010

  • "A Practical Failure Prediction with Location and Lead Time for Blue Gene/P", Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman, Proceedings of the 1st Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), in conjunction with DSN'10, 2010.
  • "Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols", G. Boscilca, A. Bouteiller, T. Herault, P. Lermarinier and J. Dongarra, Proceedings of EuroMPI, 2010
  • "Checkpoint/Restart-Enabled Parallel Debugging", J. Hursey and C. January and M. O'Connor and P. Hargrove and D. Lecomber and J. Squyres and A. Lumsdaine, Proceedings of EuroMPI, 2010
  • "Disaster Survival Guide in Petascale Computing: An Algorithmic Approach", J. Dongarra, Z. Chen, G. Bosilca, and J. Langou, to appear in Petascale Computing: Algorithms and Applications, Chapman and Hall / CRC Press.
  • "RDMA-Based Job Migration Framework for MPI over InfiniBand", X. Ouyang, S. Marcarelli, R. Rajachandrasekar and D. K. Panda, IEEE International Conference on Cluster Computing 2010 (Cluster '10), Sept. 2010 [Download PDF]
  • "Enhancing Checkpoint Performance with Staging IO and SSD", X. Ouyang, S. Marcarelli and D. K. Panda, IEEE International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI), May 2010 [Download PDF]
  • "Redesigning the message logging model for high performance", A. Bouteiller, G. Bosilca and J. Dongarra, Concurrency and Computation: Practice and Experience, 2010

2009

  • "Reasons to be Pessimist or Optimist for Failure Recovery in High Performance Clusters", A. Bouteiller, T. Ropars, G. Bosilca, C. Morin and J. Dongarra, Proceedings of the 2009 IEEE Cluster Conference, 2009.
  • "CIFTS: A Coordinated infrastructure for Fault-Tolerant Systems", R. Gupta, P. Beckman, H. Park, E. Lusk, P. Hargrove," A. Geist, D. K. Panda, A. Lumsdaine and J. Dongarra, Proceedings of the International Conference on Parallel Processing (ICPP), 2009. [Download PDF]
  • "Analyzing Checkpointing Trends for Applications on Petascale Systems", H. Naik, R. Gupta and P. Beckman, Second International Workshop on Parallel Programming Models and Systems Software (P2S2) for High-End Computing in conjunction with International Conference on Parallel Processing (ICPP), 2009
  • "System log Pre-processing to Improve Failure Prediction, Z. Zheng, Z. Lan, B-H Park, and A. Geist, Proceedings of the 39th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2009.
  • "Interconnect Agnostic Checkpoint/Restart in Open MPI", J. Hursey, T. Mattox, and A. Lumsdaine, Proceedings of the 18th ACM international symposium on High Performance Distributed Computing (HPDC), 2009).
  • "Fast checkpointing by write aggregation with dynamic buffer and interleaving on multicore architecture", X. Ouyang, K. Gopalakrishnan, T. Gangadharappa, and D.K. Panda, Proceedings of the International Symposium on High Performance Computing (HiPC) 2009. [Download PDF]
  • "Accelerating checkpoint operation by node-level write aggregation in multicore systems", X. Ouyang, K. Gopalakrishnan and D. K. Panda, Proceedings of the International Conference on Parallel Processing (ICPP), 2009. [Download PDF]

2008

  • The ORNL CIFTS team, in collaboration with Ziming Zheng from Zhiling Lan's team from IIT is a winner of the "The Cray Log Analysis Contest" at the First USENIX Workshop on the Analysis of System Logs (WASL'08), which was co-located with the 8th USENIX Symposium on OperatingSystems Design and Implementation (OSDI'08).
  • "Dynamic Meta-Learning for Failure Prediction in Large-scale Systems: A Case Study", J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks and B-H Park, Proceedings of the International Conference on Parallel Processing (ICPP), 2008.
  • "Efficient One-Copy MPI Shared Memory Communication in Virtual Machines", W. Huang, M. Koop, D. K. Panda, Proceedings of the International Conference on Cluster Computing (Cluster), 2008. [Download PDF]
  • "Fault Tolerance Management for a Hierarchical GridRPC Middleware", A. Bouteiller, and F. Desprez, Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGRID), 2008.

2007

  • "Automatic Path Migration over InfiniBand: Early Experiences", A. Vishnu, A. Mamidala, S. Narravula, and D. K. Panda, Proceedings of Third International Workshop on System Management Techniques, Processes, and Services, held in conjunction with IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2007.
  • "The design and implementation of checkpoint/restart process fault tolerance for Open MPI", J. Hursey, J. Squyres, T. Mattox, and A. Lumsdaine, Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2007.
  • "Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand", Q. Gao, W. Huang, M. Koop, and D. K. Panda, Proceedings of International Conference on Parallel Processing (ICPP), 2007. [Download PDF]
  • "Self-Healing in Binomial Graph Networks", T. Angskun, G. Bosilca and J. Dongarra, Proceedings of the 2nd International Workshop On Reliability in Decentralized Distributed Systems (RDDS), 2007.
  • "Optimal Routing in Binomial Graph Networks", T. Angskun, G. Bosilca, B. V. Zanden and J. Dongarra, Proceedings of the International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2007.
  • "Binomial Graph: A Scalable and Fault-Tolerant Logical Network Topology", T. Angskun, G. Bosilca, J. Dongarra, Proceedings of the 5th International Symposium on Parallel and Distributed Processing and Applications (ISPA), 2007.
  • "Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing", Z. Chen, M. Yang, G. Francia, and J. Dongarra, Proceedings of the workshop on Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing, held in conjunctinos with IPDPS, 2007.
  • "Recovery Patterns for Iterative Methods in a Parallel Unstable Environment", G. Bosilca, Z. Chen, J. Dongarra, and J. Langou, SIAM SISC, 2007.