Checkpointing Kernel Executions of MPI+CUDA Applications

Max Baird*, Sven-Bodo Scholz, Artjoms Šinkarovs, Leonardo Bautista-Gomez

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)


This paper proposes a new approach to checkpointing MPI applications that use long-running CUDA kernels. It becomes possible to take snapshots of data residing on the GPUs without waiting for kernels to complete. The proposed technique is implemented in the context of the state of the art high performance fault tolerance library FTI. As a result we get an elegant solution to the problem of developing resilient MPI applications where GPU kernels run longer than the mean time between hardware failures. We describe in detail how we checkpoint/restart collaborative MPI-CUDA applications, and we provide an initial evaluation of the proposed approach using the Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH) application as a case study.

Original languageEnglish
Title of host publicationEuro-Par 2019: Parallel Processing Workshops. Euro-Par 2019
Number of pages13
ISBN (Electronic)9783030483401
ISBN (Print)9783030483395
Publication statusPublished - 2020
Event25th International European Conference on Parallel and Distributed Computing 2019 - Göttingen, Germany
Duration: 26 Aug 201930 Aug 2019

Publication series

NameLecture Notes in Computer Science
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference25th International European Conference on Parallel and Distributed Computing 2019
Abbreviated titlexiWAT 2020


  • Checkpoints
  • GPU
  • HPC
  • MPI
  • Resilience
  • Snapshots

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science


Dive into the research topics of 'Checkpointing Kernel Executions of MPI+CUDA Applications'. Together they form a unique fingerprint.

Cite this