Checkpointing Kernel Executions of MPI+CUDA Applications

Max Baird, Sven-Bodo Scholz, Artjoms Šinkarovs, Leonardo Bautista-Gomez

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper proposes a new approach to checkpointing MPI applications that use long-running CUDA kernels. It becomes possible to take snapshots of data residing on the GPUs without waiting for kernels to complete. The proposed technique is implemented in the context of the state of the art high performance fault tolerance library FTI. As a result we get an elegant solution to the problem of developing resilient MPI applications where GPU kernels run longer than the mean time between hardware failures. We describe in detail how we checkpoint/restart collaborative MPI-CUDA applications, and we provide an initial evaluation of the proposed approach using the Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH) application as a case study.

Original languageEnglish
Title of host publicationEuro-Par 2019: Parallel Processing Workshops. Euro-Par 2019
PublisherSpringer
Pages694-706
Number of pages13
ISBN (Electronic)9783030483401
ISBN (Print)9783030483395
DOIs
Publication statusPublished - 2020
Event25th International European Conference on Parallel and Distributed Computing 2019 - Göttingen, Germany
Duration: 26 Aug 201930 Aug 2019

Publication series

NameLecture Notes in Computer Science
Volume11997
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference25th International European Conference on Parallel and Distributed Computing 2019
Abbreviated titlexiWAT 2020
CountryGermany
CityGöttingen
Period26/08/1930/08/19

Keywords

  • Checkpoints
  • GPU
  • HPC
  • MPI
  • Resilience
  • Snapshots

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Checkpointing Kernel Executions of MPI+CUDA Applications'. Together they form a unique fingerprint.

Cite this