Introducing Amazon Trn2 Instances And UltraServers For AI/ML

December 5, 2024

226

For AI/ML training and inference, Amazon EC2 Trn2 Instances and Trn2 UltraServers are now accessible.

The most potent EC2 computing choices for ML training and inference are the recently released Amazon Elastic computing Cloud (Amazon EC2) Trn2 instances and Trn2 UltraServers. The Trn2 instances are 4 times quicker, have 4 times more memory bandwidth, and have 3 times more memory capacity than the first-generation Trn1 instances thanks to the second-generation AWS Trainium processors (AWS Trainium2).

Compared to the current generation of GPU-based EC2 P5e and P5en instances, Trn2 instances provide 30–40% better pricing performance.

Elastic Fabric Adapter (EFA) v3 gives each Trn2 instance 192 vCPUs, 2 TiB of RAM, and 3.2 Tbps of network bandwidth in addition to 16 Trainium2 processors. Provides up to 35% less latency than predecessors.

For optimal inference and training performance on frontier foundation models, the Trn2 UltraServers, a brand-new computing solution, include 64x Trainium2 processors coupled with a high-bandwidth, low-latency NeuronLink interface.

Amazon and AWS services are already powered by tens of thousands of Trainium processors. For instance, on the most recent Prime Day, the Rufus shopping assistant was backed by more than 80,000 AWS Inferentia and Trainium1 chips. The latency-optimized Llama 3.1 405B and Claude 3.5 Haiku models on Amazon Bedrock are already powered by Trainium2 CPUs.

Up, Out, and Up

Creative types of computing power combined with equally creative architectural forms allow for the frontier models’ size and complexity to rise steadily. When things were simpler, there were two approaches to discuss scalability in architecture: scaling out (using additional computers) and scaling up (using a larger computer).

Beginning with the NeuronCore and working the way up to an UltraCluster, let’s review the Trn2 building blocks:

The Trainium2 chip’s core is made up of neuron cores. A scalar engine (one input to one output), a vector engine (many inputs to numerous outputs), a tensor engine (systolic array multiplication, convolution, and transposition), and a GPSIMD (general purpose single instruction multiple data) core are all included in each third-generation NeuronCore.

Eight NeuronCores and 96 GiB of High Bandwidth Memory (HBM) are included on each Trainium2 chip, which has an HBM bandwidth of 2.9 TB/second. Pairs of physical cores can be combined into a single logical core, or the cores can be addressed and utilized separately. Up to 1.3 petaflops of dense FP8 computation and 5.2 petaflops of sparse FP8 computation may be achieved on a single Trainium2 device. Automatic reordering of the HBM queue allows for 95% memory bandwidth utilization.

In turn, 16 Trainum2 chips reside in each Trn2 instance. That includes 128 NeuronCores, 1.5 TiB HBM, and 46 TB/second HBM bandwidth. This adds up to up to 83.2 petaflops of sparse FP8 computation and up to 20.8 petaflops of dense FP8 computation. In order to achieve high bandwidth, low latency chip-to-chip communication at 1 TB/second, the Trainium2 chips are coupled in a 2D torus across NeuronLink.

Four Trn2 instances hosted on an UltraServer are linked via low-latency, high-bandwidth NeuronLink. Six TiB of HBM, 64 Trainium2 chips, 512 NeuronCores, and 185 TB/second of HBM bandwidth are all included in that. Based on calculations, this can produce up to 332 petaflops of sparse FP8 computation and up to 83 petaflops of dense FP computation.

Cores at corresponding XY coordinates in each of the four instances are connected in a ring, in addition to the 2D torus that links NeuronCores inside each instance. UltraServers contribute to industry-leading response times for inference, resulting in the greatest real-time experiences.

When compared to standalone instances, UltraServers improve model training performance and efficiency through quicker collective communication for model parallelism. UltraServers are available in preview form; to join the preview, get in touch with us. UltraServers are made to handle training and inference at the trillion parameter level and beyond.

On order to provide scale-out distributed training across tens of thousands of Trainium chips on a single petabit scale, non-blocking network, with access to Amazon FSx for Lustre high performance storage, Trn2 instances and UltraServers are being deployed on EC2 UltraClusters.

Making Use of Trn2 Instances

Trn2 instances may be rented using Amazon EC2 Capacity Blocks for ML and are now available for production use in the US East (Ohio) AWS Region. Up to 64 instances may be reserved for a maximum of six months; bookings may be made up to eight weeks in advance, with quick start timings and the option to extend if necessary. To reserve GPU capacity for your machine learning tasks, see Announcing Amazon EC2 Capacity Blocks for ML.

AWS Deep Learning AMIs are a good place to start when it comes to software. PyTorch, JAX, and many more frameworks and technologies that you are likely already familiar with are preconfigured in these images.

You can bring your apps across and recompile them for usage on Trn2 instances if you built them using the AWS Neuron SDK. JAX, PyTorch, and necessary libraries like Hugging Face, PyTorch Lightning, and NeMo are all fully integrated with this SDK. In addition to offering comprehensive insights for profiling and debugging, Neuron incorporates pre-built optimizations for distributed training and inference using the free source PyTorch libraries NxD Training and NxD Inference. Additionally, Neuron supports OpenXLA, which includes reliable HLO and GSPMD. This allows developers of PyTorch/XLA and JAX to use Neuron’s Trainium2 compiler optimizations.

Introducing Amazon Trn2 Instances And UltraServers For AI/ML

Up, Out, and Up

Making Use of Trn2 Instances

LEAVE A REPLY Cancel reply

About Us

Tutorials