Design and evaluation of a self-healing Kepler for scientific workflows

Arjun Hary, Ali Akoglu, Youssif AlNashif, Salim Hariri, Darrel Jenerette

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

Kepler is a popular open source scientific workflow (SWF) as it simplifies the effort required to construct complex data flow models through a visual interface. As the complexity of the workflow applications that will run on heterogeneous distributed systems increases, fault management becomes a critical design issue for large scale scientific and engineering applications. Due to the long execution times of these applications, it is important that they are fault tolerant; i.e. the workflow application can recover gracefully from faults without the need to restart the application from the beginning. The current implementation of Kepler tool does not support fault tolerance or recovery mechanisms. In this paper, we extend the Kepler capabilities to support fault tolerant scientific workflow (FT-SWF) with a checkpoint mechanism where corrective measures are taken seamlessly in an autonomic manner whenever a fault is detected. To the best of our knowledge, this is the first approach on adding autonomic operations to Kepler. We have evaluated the FT-Kepler on a distributed application used by ecosystem researchers. We evaluated the performance of the workflow with hardware and software based fault scenarios in terms of execution time, recovery time, and the checkpoint mechanism overhead. The experimental evaluations indicate that the checkpoint mechanism adds negligible overhead to the total execution time of the workflow and as the fault rate increases, the number of checkpoints should be increased.

Original languageEnglish (US)
Title of host publicationHPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Pages340-343
Number of pages4
DOIs
StatePublished - 2010
Event19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010 - Chicago, IL, United States
Duration: Jun 21 2010Jun 25 2010

Publication series

NameHPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

Other

Other19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010
Country/TerritoryUnited States
CityChicago, IL
Period6/21/106/25/10

Keywords

  • Autonomic
  • Fault tolerant
  • Kepler
  • Scientific workflow

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'Design and evaluation of a self-healing Kepler for scientific workflows'. Together they form a unique fingerprint.

Cite this