Abstract
Scientific workflows have become mainstream for conducting
large-scale scientific research. As a result, many workflow
applications and Workflow Management Systems (WMSs)
have been developed as part of the cyberinfrastructure to
allow scientists to execute their applications seamlessly on
a range of distributed platforms. In spite of many success
stories, a key challenge for running workflows in distributed
systems is failure prediction, detection, and recovery. In
this paper, we propose an approach to use control theory
developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible.
The proposed approach applying the proportional-integralderivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, to
mitigate faults by adjusting the inputs of the controller. The
PID controller aims at detecting the possibility of a fault far
enough in advance so that an action can be performed to
prevent it from happening. To demonstrate the feasibility of
the approach, we tackle two common execution faults of the
Big Data era—data storage overload and memory overflow.
We define, implement, and evaluate simple PID controllers
to autonomously manage data and memory usage of a bioinformatics workflow that consumes/produces over 4.4TB of
data, and requires over 24TB of memory to run all tasks
concurrently. Experimental results indicate that workflow
executions may significantly benefit from PID controllers,
in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown
of 1.01) can be attained when using our proposed method,
and faults are detected and mitigated far in advance of their
occurence.
large-scale scientific research. As a result, many workflow
applications and Workflow Management Systems (WMSs)
have been developed as part of the cyberinfrastructure to
allow scientists to execute their applications seamlessly on
a range of distributed platforms. In spite of many success
stories, a key challenge for running workflows in distributed
systems is failure prediction, detection, and recovery. In
this paper, we propose an approach to use control theory
developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible.
The proposed approach applying the proportional-integralderivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, to
mitigate faults by adjusting the inputs of the controller. The
PID controller aims at detecting the possibility of a fault far
enough in advance so that an action can be performed to
prevent it from happening. To demonstrate the feasibility of
the approach, we tackle two common execution faults of the
Big Data era—data storage overload and memory overflow.
We define, implement, and evaluate simple PID controllers
to autonomously manage data and memory usage of a bioinformatics workflow that consumes/produces over 4.4TB of
data, and requires over 24TB of memory to run all tasks
concurrently. Experimental results indicate that workflow
executions may significantly benefit from PID controllers,
in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown
of 1.01) can be attained when using our proposed method,
and faults are detected and mitigated far in advance of their
occurence.
Original language | English |
---|---|
Pages (from-to) | 15-24 |
Number of pages | 10 |
Journal | CEUR Workshop Proceedings |
Volume | 1800 |
Publication status | Published - 28 Feb 2017 |
Event | 11th Workshop on Workflows in Support of Large-Scale Science 2016 - Salt Lake City, United States Duration: 14 Nov 2016 → 14 Nov 2016 |