Reducing Checkpoint Costs with Python and NVIDIA nvCOMP

Training large language models requires periodic checkpoints. These full snapshots of model weights, optimizer states, and gradients are saved to storage to allow training to resume after interruptions. At scale, these checkpoints become massive (782 GB for a 70B parameter model) and frequent (every 15-30 minutes), generating one of the largest line items in a training budget. Most AI teams focus on GPU utilization, training throughput, and model quality, but few pay attention to the costs associated with checkpointing, which is a costly oversight.

The synchronous checkpoint overhead for a 405B model on 128 NVIDIA DGX B200 GPUs can reach $200,000 a month. By introducing a lossless compression step implemented in about 30 lines of Python, we can reduce storage costs by $56,000 each month. Mixture of experts (MoE) models save even more. This blog post will break down how we arrived at these calculations and how NVIDIA nvComp can enhance checkpointing efficiency.

Hardware interruptions at a scale of 1000+ GPUs are not uncommon. Meta reported 419 unexpected interruptions during 54 days of Llama 3 training on 16,384 NVIDIA H100 GPUs. This is why most teams checkpoint every 15-30 minutes; it is essential infrastructure rather than optional overhead. Standard practice results in 48 checkpoints per day, leading to 1.13 PB written to storage in a month of continuous training.

Moreover, during each synchronous checkpoint write, all 8 GPUs sit idle. The wait time for each checkpoint write is about 156.4 seconds, resulting in idle GPU costs exceeding $2,200 a month, not including storage fees. Scaling this to a 64 GPU cluster increases the monthly cost to over $17,500, and at 128 GPUs, idle costs exceed $200,000.

Asynchronous checkpointing alleviates part of the issue, but framework support is still maturing. A complementary technique that can be easily implemented is checkpoint compression, which also reduces cold start time when restoring state. NVIDIA nvCOMP introduces GPU-accelerated compression, allowing for compression directly in GPU memory without additional data transfer costs.

We fine-tuned two model architectures and compressed each component using nvCOMP on NVIDIA H200 and B200 GPUs. The compression ratios depend on data rather than hardware. ZSTD and ANS are algorithms that provide high compression speed and can be easily integrated into Python workflows. The choice between them depends on your storage speed.

Reducing Checkpoint Costs with Python and NVIDIA nvCOMP

Related articles

Effective reward functions for customizing Amazon Nova with AWS Lambda

Model Drift: Understanding and Fixing Performance Issues Over Time

Increasing Enterprise Governance Challenges with Edge AI Workloads