Transparent Fault Tolerance for Scalable Functional Computation

Robert Stewart, Phil Trinder, Patrick Maier

Research output: Contribution to journalArticlepeer-review

6 Citations (Scopus)
56 Downloads (Pure)

Abstract

Reliability is set to become a major concern on emergent large-scalearchitectures. While there are many parallel languages, and indeedmany parallel functional languages, very few address reliability. Thenotable exception is the widely emulated Erlang distributed actormodel that provides explicit supervision and recovery of actors withisolated state.We investigate scalable transparent fault tolerant functionalcomputation with automatic supervision and recovery of tasks. We do soby developing HdpH-RS, a variant of the Haskell distributed parallelHaskell (HdpH) DSL with Reliable Scheduling. Extending the distributedwork stealing protocol of HdpH for task supervision and recovery ischallenging. To eliminate elusive concurrency bugs, we validate theHdpH-RS work stealing protocol using the SPIN model checker.HdpH-RS differs from the actor model in that its principal entitiesare tasks, i.e. independent stateless computations, rather thanisolated stateful actors. Thanks to statelessness, fault recovery canbe performed automatically and entirely hidden in the HdpH-RS runtimesystem. Statelessness is also key for proving a crucial property ofthe semantics of HdpH-RS: fault recovery does not change the result ofthe program, akin to deterministic parallelism.HdpH-RS provides a simple distributed fork/join-style programmingmodel, with minimal exposure of fault tolerance at the language level,and a library of higher level abstractions such as algorithmicskeletons. In fact, the HdpH-RS DSL is exactly the same as the HdpHDSL, hence users can opt in or out of fault tolerant execution withoutany refactoring.Computations in HdpH-RS are always as reliable as the root node, nomatter how many nodes and cores are actually used. We benchmarkHdpH-RS on conventional clusters and an HPC platform: all benchmarkssurvive Chaos Monkey random fault injection; the system scales welle.g. up to 1400 cores on the HPC; reliability and recovery overheadsare consistently low even at scale.
Original languageEnglish
Article numbere5
Number of pages42
JournalJournal of Functional Programming
Volume26
Early online date17 Mar 2016
DOIs
Publication statusE-pub ahead of print - 17 Mar 2016

Fingerprint

Dive into the research topics of 'Transparent Fault Tolerance for Scalable Functional Computation'. Together they form a unique fingerprint.

Cite this