Supervised workpools for reliable massively parallel computing

Robert Stewart, Philip William Trinder, Patrick Maier

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)
25 Downloads (Pure)

Abstract

The manycore revolution is steadily increasing the performance and size of massively parallel systems, to the point where system reliability becomes a pressing concern. Therefore, massively parallel compute jobs must be able to tolerate failures. For example, in the HPC-GAP project we aim to coordinate symbolic computations in architectures with 106 cores. At that scale, failures are a real issue. Functional languages are well known for advantages both for parallelism and for reliability, e.g. stateless computations can be scheduled and replicated freely. This paper presents a software level reliability mechanism, namely supervised fault tolerant workpools implemented in a Haskell DSL for parallel programming on distributed memory architectures. The workpool hides task scheduling, failure detection and task replication from the programmer. To the best of our knowledge, this is a novel construct. We demonstrate how to abstract over supervised workpools by providing fault tolerant instances of existing algorithmic skeletons. We evaluate the runtime performance of these skeletons both in the presence and absence of faults, and report low supervision overheads.

Original languageEnglish
Title of host publicationTrends in Functional Programming
Subtitle of host publication13th International Symposium, TFP 2012, St. Andrews, UK, June 12-14, 2012, Revised Selected Papers
PublisherSpringer
Pages247-262
Number of pages16
Volume7829
ISBN (Electronic)978-3-642-40447-4
ISBN (Print)9783642404467
DOIs
Publication statusPublished - 2013
Event13th Symposium on Trends in Functional Programming - St. Andrews, United Kingdom
Duration: 12 Jun 201212 Jun 2012

Publication series

NameLecture Notes in Computer Science
Volume7829
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference13th Symposium on Trends in Functional Programming
Abbreviated titleTFP 2012
Country/TerritoryUnited Kingdom
CitySt. Andrews
Period12/06/1212/06/12

Keywords

  • Fault tolerance
  • Haskell
  • parallel computing
  • workpools

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Fingerprint

Dive into the research topics of 'Supervised workpools for reliable massively parallel computing'. Together they form a unique fingerprint.

Cite this