Use the latest Amazon SageMaker HyperPod recipes to speed up foundation model training and fine-tuning.
The general availability of Amazon SageMaker HyperPod recipes enables developers and data scientists of various backgrounds to begin training and optimizing foundation models (FMs) with cutting-edge performance in just a few minutes. Now, they have access to optimal training and tuning recipes for well-known FMs that are publicly available, like Mixtral 8x22B, Llama 3.1 405B, and Llama 3.2 90B.
AWS presented SageMaker HyperPod at AWS re:Invent 2023, which uses preconfigured distributed training libraries to stretch across over a thousand compute resources in parallel and cut down on training time for FMs by up to 40%. With SageMaker HyperPod, you can determine the accelerated computing resources needed for training, design the best training schedules, and distribute training workloads among various capacity blocks according to compute resource availability.
AWS-tested training stacks are included in SageMaker HyperPod recipes, which eliminate weeks of iterative testing and evaluation by reducing the need for laborious work experimenting with various model configurations. The recipes automate a number of crucial processes, including managing the end-to-end training loop, automating checkpoints for quicker failure recovery, loading training datasets, and implementing distributed training methodologies.
To further maximize training performance and cut expenses, you can easily switch between instances that are GPU- or Trainium-based with a straightforward recipe modification. Workloads on SageMaker HyperPod or SageMaker training assignments can be executed in production with ease.
In-use SageMaker HyperPod recipes
To get started, look over training recipes for well-known publicly accessible FMs in the SageMaker HyperPod recipes GitHub repository.
To achieve state-of-the-art performance, you simply need to run the recipe with a single line command after editing the simple recipe parameters to indicate an instance type and the location of your dataset in cluster settings.
After cloning the repository, you must modify the recipe config.yaml file to define the model and cluster type.
The recipes are compatible with SageMaker training jobs, SageMaker HyperPod with Slurm, and SageMaker HyperPod with Amazon Elastic Kubernetes Service (Amazon EKS). The cluster type (Slurm orchestrator), model name (Meta Llama 3.1 405B language model), instance type (ml.p5.48xlarge), and data locations (training data, results, logs, etc.) can all be set up, for instance.
This YAML file describes the ideal setup, including the number of accelerator devices, instance type, training precision, parallelization and sharding strategies, optimizer, and logging to track experiments using TensorBoard. You can optionally change model-specific training parameters in this file.
You must set up the SageMaker HyperPod cluster according to the cluster setup instructions in order to execute this recipe in SageMaker HyperPod with Slurm.
Next, copy the modified recipe, access the Slurm controller, and connect to the SageMaker HyperPod head node. A Slurm submission script for the job is then generated by running a helper file, which you can use as a dry run to check the content before beginning the training job.
The trained model is immediately saved to the designated data location upon training completion.
Install the prerequisites, clone the recipe from the GitHub repository, and modify the recipe (cluster: k8s) on your laptop in order to run it on SageMaker HyperPod with Amazon EKS. Next, establish a connection between your laptop and the EKS cluster, and then execute the recipe using the HyperPod Command Line Interface (CLI).
Using the SageMaker Python SDK, you can also execute recipes on SageMaker training jobs. PyTorch training scripts are used in SageMaker training tasks with overriding training recipes in the example that follows.
With the completely automated checkpointing feature, the model checkpoints are saved on Amazon Simple Storage Service (Amazon S3) throughout training, allowing for quicker recovery from instance restarts and training errors.
Currently accessible
The SageMaker HyperPod recipes GitHub repository now contains Amazon SageMaker HyperPod recipes.