Online Self-Healing Control Loop to Prevent and Mitigate Faults in Scientific Workflows

Rafael Ferreira da Silva, Rosa Filgueira, Ewa Deelman, Erola Pairo-Castineira, Ian Michael Overton, Malcolm Atkinson

Research output: Contribution to conferencePaperpeer-review

Abstract

Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. In spite of many success stories, a key challenge for running workflow in distributed systems is failure prediction, detection, and recovery. In this paper, we present a novel online self-healing framework, where failures are predicted before they happen, and are mitigated when possible. The proposed approach is to use control theory developed as part of autonomic computing, and in particular apply the proportional-integral-derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, to mitigate faults by adjusting the inputs of the mechanism. The PID controller aims at detecting the possibility of a fault far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of the Big Data era—data footprint and memory usage. We define, implement, and evaluate PID controllers to autonomously manage data and memory usage of a bioinformatics workflow that consumes/produces over 4.4TB of data, and requires over 24TB of memory to run all tasks concurrently. Experimental results indicate that workflow executions may significantly benefit from PID controllers, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed control loop, and faults are detected and mitigated far in advance.
Original languageEnglish
Publication statusPublished - 18 Nov 2016
EventSupercomputing 2016 - Salt Like City, United States
Duration: 13 Nov 201618 Nov 2016
http://sc16.supercomputing.org/

Conference

ConferenceSupercomputing 2016
CountryUnited States
CitySalt Like City
Period13/11/1618/11/16
Internet address

Fingerprint Dive into the research topics of 'Online Self-Healing Control Loop to Prevent and Mitigate Faults in Scientific Workflows'. Together they form a unique fingerprint.

Cite this