Optimized Deployments in SageMaker JumpStart

Amazon SageMaker JumpStart provides pretrained models for various tasks, simplifying the initiation of AI workloads. JumpStart offers access to solutions for common use cases that can be deployed to SageMaker AI Managed Inference endpoints or SageMaker HyperPod clusters. With preset deployment options, customers can quickly transition from model selection to deployment.

Model deployments through SageMaker JumpStart are fast and straightforward. Customers can select options based on expected concurrent users, considering P50 latency, time-to-first token (TTFT), and throughput (token/second/user). While concurrent user configuration options are helpful for general scenarios, they are not task-aware, and we recognize that customers use SageMaker JumpStart for diverse specific use cases such as content generation, summarization, or Q&A. Each use case may require specific settings to enhance performance.

The definition of performance is not limited to latency alone, and some customers may measure it in terms of throughput or the lowest cost per token. Based on this foundation, we are excited to announce the launch of SageMaker JumpStart optimized deployments. Improved JumpStart deployments address the need for rich and straightforward deployment customization by offering predefined configurations designed for specific use cases. Customers maintain the same level of visibility into the details of their proposed deployments, but now deployments are optimized for their specific use case and performance constraints.

To start using SageMaker JumpStart optimized deployments, customers need at least the following: an AWS account, a SageMaker Studio domain, and an AWS Identity and Access Management (IAM) role that can be used to create a model and an endpoint. Once these features are in place, customers can begin using SageMaker JumpStart optimized deployments right away.

To get started, open SageMaker Studio and choose Models. Select any of the models that support optimized deployments and click Deploy in the top-right corner. The resulting screen now features a collapsible window labeled “Performance,” which presents selection options for optimized deployments. The displayed options require users to first select a use case. For text-based models, these use cases can range from generative writing to chat-style interactions; image and video will feature different use cases once support is added for those input types.

After selecting a use case, customers must choose one of three constraint optimizations: Cost optimized, Throughput optimized, and Latency optimized. There is also a Balanced option for customers looking for the best average performance across all logged metrics. After selection, a preset deployment configuration is defined for the endpoint. Customers can further review and select additional configuration values such as timeouts, endpoint naming, and security settings. Once configuration is complete, customers click the Deploy option in the bottom-right corner.

SageMaker JumpStart optimized deployments are available for the following models: Meta Llama-3.1-8B-Instruct, Llama-2-7b-hf, Llama-3.2-3B, Meta-Llama-3-8B, Llama-3.2-1B-Instruct, and others. These are the launch models for optimized deployments, and we are actively expanding support to include additional models.

Optimized Deployments in SageMaker JumpStart

Related articles

Amazon SageMaker HyperPod Optimizes Inference for AI Models

AWS Introduces Path-to-Value Framework for Generative AI Adoption

Creating a Context Layer to Enhance LLM System Performance