Optimize Model Training with Amazon SageMaker HyperPod

TGS, a geoscience data provider for the energy sector, develops advanced seismic foundation models (SFMs) that analyze complex 3D seismic data to identify geological structures essential for energy exploration. As part of their AWS infrastructure modernization, TGS partnered with the AWS Generative AI Innovation Center (GenAIIC) to optimize their SFM training infrastructure. This article describes how TGS achieved near-linear scaling for distributed training and expanded context windows for their Vision Transformer-based SFM using Amazon SageMaker HyperPod. This joint solution reduced training time from 6 months to just 5 days while enabling the analysis of seismic volumes larger than previously possible.

The SFM employs a Vision Transformer (ViT) architecture with Masked AutoEncoder (MAE) training designed by the TGS team. Scaling such models presents several challenges: data scale and complexity, training efficiency, and expanded analytical capabilities. TGS works with large volumes of proprietary 3D seismic data stored in domain-specific formats. The sheer volume and structure of this data require efficient streaming strategies to maintain high throughput and help prevent GPU idle time during training.

Training large foundation models on 3D volumetric data is computationally intensive. Accelerating training cycles would enable TGS to incorporate new data more frequently and iterate on model improvements faster, delivering more value to their clients. The geological context a model can analyze depends on how much 3D volume it can process at once. Expanding this capability would allow the models to capture both local details and broader geological patterns simultaneously.

Understanding these challenges highlights the need for a comprehensive approach to distributed training and infrastructure optimization. The AWS GenAIIC partnered with TGS to develop a comprehensive solution addressing these challenges. The collaboration focused on three key areas: establishing an efficient data pipeline, optimizing distributed training across multiple nodes, and expanding the model’s context window to analyze larger geological volumes.

The solution utilizes SageMaker HyperPod to provide a resilient, scalable training infrastructure with automatic health monitoring and checkpoint management. The SageMaker HyperPod cluster is configured with AWS Identity and Access Management (IAM) execution roles scoped to the minimum permissions required for training operations, deployed within a virtual private cloud (VPC) with network isolation and security groups restricting communication to authorized training nodes. Terabytes of training data stream directly from Amazon Simple Storage Service (Amazon S3), alleviating the need for intermediate storage layers while maintaining high throughput.

Optimize Model Training with Amazon SageMaker HyperPod

Related articles

Optimize AI Costs with Amazon Bedrock Projects

Building a text-to-SQL solution using Amazon Bedrock

Document Extraction System for 4,700 PDFs in 45 Minutes