https://www.deepspeed.ai/tutorials/universal-checkpointing/
Universal Checkpointing with DeepSpeed: A Practical Guide - DeepSpeed
DeepSpeed Universal Checkpointing feature is a powerful tool for saving and loading model checkpoints in a way that is both efficient and flexible, enabling...
universalcheckpointingdeepspeedpracticalguide
https://scholarsmine.mst.edu/comsci_facwork/1108/
"Analyzing the Performance and Accuracy of Lossy Checkpointing on Sub-I" by Tasmia Reza, Kristopher...
Future exascale systems are expected to be characterized by more frequent failures than current petascale systems. This places increased importance on the...
https://arxiv.org/html/2406.18820v1
Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training
large scaleuniversalcheckpointingefficientflexible
https://talend.github.io/component-runtime/main/latest/component-checkpoint.html
Checkpointing :: Talend Component Kit Developer Reference Guide
How to develop a checkpointing producer with Talend Component Kit
developer referencecheckpointingtalendcomponentkit
https://repository.gatech.edu/entities/publication/ce5d2c64-96c6-4a3f-9d05-116787ff916d
KIMA: Hybrid Checkpointing for Recovery from a Wide Range of Errors and Detection Latencies
Full system reliability is a problem that spans multiple levels of the software/hardware stack. The normal execution of a program in a system can be disrupted...
https://chaste.github.io/releases/2026.1/user-tutorials/cardiaccheckpointingandrestarting/
Cardiac Checkpointing And Restarting Chaste
Chaste (Cancer, Heart and Soft Tissue Environment) is a general purpose simulation package aimed at multi-scale, computationally demanding problems arising in...
cardiaccheckpointingchaste