TY - GEN
T1 - Design and evaluation of a self-healing Kepler for scientific workflows
AU - Hary, Arjun
AU - Akoglu, Ali
AU - AlNashif, Youssif
AU - Hariri, Salim
AU - Jenerette, Darrel
PY - 2010
Y1 - 2010
N2 - Kepler is a popular open source scientific workflow (SWF) as it simplifies the effort required to construct complex data flow models through a visual interface. As the complexity of the workflow applications that will run on heterogeneous distributed systems increases, fault management becomes a critical design issue for large scale scientific and engineering applications. Due to the long execution times of these applications, it is important that they are fault tolerant; i.e. the workflow application can recover gracefully from faults without the need to restart the application from the beginning. The current implementation of Kepler tool does not support fault tolerance or recovery mechanisms. In this paper, we extend the Kepler capabilities to support fault tolerant scientific workflow (FT-SWF) with a checkpoint mechanism where corrective measures are taken seamlessly in an autonomic manner whenever a fault is detected. To the best of our knowledge, this is the first approach on adding autonomic operations to Kepler. We have evaluated the FT-Kepler on a distributed application used by ecosystem researchers. We evaluated the performance of the workflow with hardware and software based fault scenarios in terms of execution time, recovery time, and the checkpoint mechanism overhead. The experimental evaluations indicate that the checkpoint mechanism adds negligible overhead to the total execution time of the workflow and as the fault rate increases, the number of checkpoints should be increased.
AB - Kepler is a popular open source scientific workflow (SWF) as it simplifies the effort required to construct complex data flow models through a visual interface. As the complexity of the workflow applications that will run on heterogeneous distributed systems increases, fault management becomes a critical design issue for large scale scientific and engineering applications. Due to the long execution times of these applications, it is important that they are fault tolerant; i.e. the workflow application can recover gracefully from faults without the need to restart the application from the beginning. The current implementation of Kepler tool does not support fault tolerance or recovery mechanisms. In this paper, we extend the Kepler capabilities to support fault tolerant scientific workflow (FT-SWF) with a checkpoint mechanism where corrective measures are taken seamlessly in an autonomic manner whenever a fault is detected. To the best of our knowledge, this is the first approach on adding autonomic operations to Kepler. We have evaluated the FT-Kepler on a distributed application used by ecosystem researchers. We evaluated the performance of the workflow with hardware and software based fault scenarios in terms of execution time, recovery time, and the checkpoint mechanism overhead. The experimental evaluations indicate that the checkpoint mechanism adds negligible overhead to the total execution time of the workflow and as the fault rate increases, the number of checkpoints should be increased.
KW - Autonomic
KW - Fault tolerant
KW - Kepler
KW - Scientific workflow
UR - http://www.scopus.com/inward/record.url?scp=78649998209&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78649998209&partnerID=8YFLogxK
U2 - 10.1145/1851476.1851525
DO - 10.1145/1851476.1851525
M3 - Conference contribution
AN - SCOPUS:78649998209
SN - 9781605589428
T3 - HPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
SP - 340
EP - 343
BT - HPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
T2 - 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010
Y2 - 21 June 2010 through 25 June 2010
ER -