Abstract
Hardware failure in High-Performance Computing systems is the norm. Failure data, collected over a nine year period across 22 large-scale systems of up to few thousands of NUMA or SMP nodes, at Los Alamos National laboratories, show averages of 20-1000 failures per year. This paper describes the design and outlines the implementation of transparent resilience for task parallelism in Chapel, a high-performance language developed for productive parallel programming. We detail the design directions and we implement a transparent resilience mechanism within Chapel's runtime system. Our primary goal is to ensure program termination in the presence of hardware failure of one or multiple nodes in the system. We evaluate our implementation using a set of five synthetic microbenchmarks covering Chapel's task parallel constructs and we quantify and discuss the small overheads and speedups noted for the resilient implementation compared to the latest non-resilient Chapel release.
Original language | English |
---|---|
Title of host publication | 2016 IEEE International Parallel and Distributed Processing Symposium Workshops |
Publisher | IEEE |
Pages | 1586-1595 |
Number of pages | 10 |
ISBN (Electronic) | 9781509036820 |
ISBN (Print) | 9781509036837 |
DOIs | |
Publication status | Published - 4 Aug 2016 |
Event | 30th IEEE International Parallel and Distributed Processing Symposium Workshops 2016 - Chicago, United States Duration: 23 May 2016 → 27 May 2016 |
Conference
Conference | 30th IEEE International Parallel and Distributed Processing Symposium Workshops 2016 |
---|---|
Abbreviated title | IPDPSW 2016 |
Country/Territory | United States |
City | Chicago |
Period | 23/05/16 → 27/05/16 |
Keywords
- Automatic task adoption
- Fault detection
- Fault recovery
- Fault tolerance
- Parallelism
- PGAS
- Resilience
- Runtime system
- Transparency
ASJC Scopus subject areas
- Computational Theory and Mathematics
- Computer Networks and Communications
- Hardware and Architecture
- Software