A lightweight approach to GPU resilience

Max Baird, Christian Fensch, Sven-Bodo Scholz, Artjoms Šinkarovs

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Resilience for HPC applications typically is implemented as a CPU-based rollback-recovery technique. In this context, long running accelerator computations on GPUs pose a major challenge as these devices usually do not offer any means of interrupt. This paper proposes a solution to the aforementioned problem: it suggests a novel approach that rewrites GPU kernels so that a soft interrupt of their execution becomes possible. Our approach is based on the Compute Unified Device Architecture (CUDA) by Nvidia and works by taking advantage of CUDA’s execution model of partitioning threads into blocks. In essence, we re-write the kernel so that each block determines whether it should continue execution or return control to the CPU. By doing so we are able to perform a premature interrupt of kernels.

Original languageEnglish
Title of host publicationEuro-Par 2018
Subtitle of host publicationParallel Processing Workshops
EditorsGabriele Mencagli, Dora B. Heras
PublisherSpringer
Pages826-838
Number of pages13
ISBN (Electronic)9783030105495
ISBN (Print)9783030105488
DOIs
Publication statusE-pub ahead of print - 31 Dec 2018
Event24th International Conference on Parallel and Distributed Computing 2018 - Turin, Italy
Duration: 27 Aug 201828 Aug 2018

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume11339
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference24th International Conference on Parallel and Distributed Computing 2018
Abbreviated titleEuro-Par 2018
CountryItaly
CityTurin
Period27/08/1828/08/18

Keywords

  • GPU
  • HPC
  • Resilience

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'A lightweight approach to GPU resilience'. Together they form a unique fingerprint.

Cite this