Robuta

https://arxiv.org/abs/1301.5007
Abstract page for arXiv paper 1301.5007: Ergodicity and scaling limit of a constrained multivariate Hawkes process
scaling limitergodicityconstrainedmultivariate
https://epoch.ai/blog/hardware-failures-wont-limit-ai-scaling
Nov 22, 2024 - Hardware failures won’t limit AI training scale - GPU memory checkpointing enables training with millions of GPUs despite failures.
hardwarefailureslimitscalingepoch