TY - GEN
T1 - Supervised workpools for reliable massively parallel computing
AU - Stewart, Robert
AU - Trinder, Philip William
AU - Maier, Patrick
N1 - To appear
PY - 2013
Y1 - 2013
N2 - The manycore revolution is steadily increasing the performance and size of massively parallel systems, to the point where system reliability becomes a pressing concern. Therefore, massively parallel compute jobs must be able to tolerate failures. For example, in the HPC-GAP project we aim to coordinate symbolic computations in architectures with 106 cores. At that scale, failures are a real issue. Functional languages are well known for advantages both for parallelism and for reliability, e.g. stateless computations can be scheduled and replicated freely. This paper presents a software level reliability mechanism, namely supervised fault tolerant workpools implemented in a Haskell DSL for parallel programming on distributed memory architectures. The workpool hides task scheduling, failure detection and task replication from the programmer. To the best of our knowledge, this is a novel construct. We demonstrate how to abstract over supervised workpools by providing fault tolerant instances of existing algorithmic skeletons. We evaluate the runtime performance of these skeletons both in the presence and absence of faults, and report low supervision overheads.
AB - The manycore revolution is steadily increasing the performance and size of massively parallel systems, to the point where system reliability becomes a pressing concern. Therefore, massively parallel compute jobs must be able to tolerate failures. For example, in the HPC-GAP project we aim to coordinate symbolic computations in architectures with 106 cores. At that scale, failures are a real issue. Functional languages are well known for advantages both for parallelism and for reliability, e.g. stateless computations can be scheduled and replicated freely. This paper presents a software level reliability mechanism, namely supervised fault tolerant workpools implemented in a Haskell DSL for parallel programming on distributed memory architectures. The workpool hides task scheduling, failure detection and task replication from the programmer. To the best of our knowledge, this is a novel construct. We demonstrate how to abstract over supervised workpools by providing fault tolerant instances of existing algorithmic skeletons. We evaluate the runtime performance of these skeletons both in the presence and absence of faults, and report low supervision overheads.
KW - Fault tolerance
KW - Haskell
KW - parallel computing
KW - workpools
U2 - 10.1007/978-3-642-40447-4_16
DO - 10.1007/978-3-642-40447-4_16
M3 - Conference contribution
AN - SCOPUS:84883172435
SN - 9783642404467
VL - 7829
T3 - Lecture Notes in Computer Science
SP - 247
EP - 262
BT - Trends in Functional Programming
PB - Springer
T2 - 13th Symposium on Trends in Functional Programming
Y2 - 12 June 2012 through 12 June 2012
ER -