Transparently resilient task parallelism for chapel

Konstantina Panagiotopoulou, Hans-Wolfgang Loidl

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Hardware failure in High-Performance Computing systems is the norm. Failure data, collected over a nine year period across 22 large-scale systems of up to few thousands of NUMA or SMP nodes, at Los Alamos National laboratories, show averages of 20-1000 failures per year. This paper describes the design and outlines the implementation of transparent resilience for task parallelism in Chapel, a high-performance language developed for productive parallel programming. We detail the design directions and we implement a transparent resilience mechanism within Chapel's runtime system. Our primary goal is to ensure program termination in the presence of hardware failure of one or multiple nodes in the system. We evaluate our implementation using a set of five synthetic microbenchmarks covering Chapel's task parallel constructs and we quantify and discuss the small overheads and speedups noted for the resilient implementation compared to the latest non-resilient Chapel release.

Original languageEnglish
Title of host publication2016 IEEE International Parallel and Distributed Processing Symposium Workshops
PublisherIEEE
Pages1586-1595
Number of pages10
ISBN (Electronic)9781509036820
ISBN (Print)9781509036837
DOIs
Publication statusPublished - 4 Aug 2016
Event30th IEEE International Parallel and Distributed Processing Symposium Workshops 2016 - Chicago, United States
Duration: 23 May 201627 May 2016

Conference

Conference30th IEEE International Parallel and Distributed Processing Symposium Workshops 2016
Abbreviated titleIPDPSW 2016
Country/TerritoryUnited States
CityChicago
Period23/05/1627/05/16

Keywords

  • Automatic task adoption
  • Fault detection
  • Fault recovery
  • Fault tolerance
  • Parallelism
  • PGAS
  • Resilience
  • Runtime system
  • Transparency

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Networks and Communications
  • Hardware and Architecture
  • Software

Fingerprint

Dive into the research topics of 'Transparently resilient task parallelism for chapel'. Together they form a unique fingerprint.

Cite this