Amazon Inferentia
Amazon EC2 performs best and is cheapest for deep learning and generative AI inference.
What is Amazon Inferentia?
AWS developed the Amazon Inferentia chip line to accelerate EC2 deep learning inference processes. The Amazon Inferentia and Inferentia2 chips aim for high throughput, low latency, and low cost EC2 instances. The AWS Neurone SDK lets developers install PyTorch and TensorFlow models on these chips. Compared to the original version, Inferentia2 delivers notable gains in memory, performance, and data type support. Numerous businesses are using these chips, which are tuned for sustainability, for tasks including fraud detection, image production, and natural language processing.
Why Inferentia?
Amazon Inferentia chips offer the best performance and lowest pricing in Amazon EC2 for deep learning (DL) and generative AI inference applications.
The first-generation Amazon Inferentia chip powers Amazon EC2 Inf1 instances, which have up to 70% lower cost per inference and 2.3x higher throughput than comparable instances. Numerous clients have embraced Inf1 instances and recognised its cost and performance advantages, including Amazon Alexa, Sprinklr, Money Forward, and Finch AI.
Compared to Inferentia, the Amazon Inferentia2 chip offers up to 10x reduced latency and up to 4x higher throughput. Large language models (LLM) and latent diffusion models are two examples of the more complicated models that can be deployed at scale using Amazon EC2 Inf2 instances that are based on inferentia2. Scale-out distributed inference with ultra-high-speed communication between chips is supported by Inf2 instances, the first inference-optimized instances in Amazon EC2. Inf2 instances have been used by numerous clients, such as Leonardo.ai, Deutsche Telekom, and Qualtrics, for their generative AI and deep learning applications.
Developers can train models on AWS Trainium chips and deploy them on Amazon Inferentia chips with the aid of the AWS Neurone SDK. Because it interfaces directly with well-known frameworks like PyTorch and TensorFlow, you can run on Amazon Inferentia processors and keep using your current code and workflows.
Benefits of Amazon Inferentia
Optimized for high throughput and low latency
Each EC2 Inf1 instance contains up to 16 Inferentia chips, and each first-generation Inferentia chip contains four first-generation NeuronCores. Up to 12 Inferentia2 chips can be found in each EC2 Inf2 instance, and each Inferentia2 chip has two second-generation NeuronCores. Up to 190 tera floating operations per second (TFLOPS) of FP16 performance are supported by each Inferentia2 processor. In addition to having a significant quantity of on-chip memory, the first-generation Inferentia contains 8 GB of DDR4 memory per chip. By providing 32 GB of HBM per chip, Inferentia2 outperforms Inferentia in terms of total memory and memory bandwidth.
Native support for ML frameworks
Popular machine learning frameworks like PyTorch and TensorFlow are natively integrated with the AWS Neurone SDK. DL models may be advantageously deployed on both Amazon Inferentia chips using these frameworks with AWS Neurone, which is built to reduce code changes and tie-in to vendor-specific solutions. On Inferentia chips, Neurone facilitates the execution of your inference applications for speech recognition, video and image production, language translation, text summarisation, natural language processing (NLP)/understanding, fraud detection, personalisation, and more.
Wide range of data types with automatic casting
The FP16, BF16, and INT8 data types are supported by the first-generation Inferentia. Inferentia2 gives developers more options to maximise accuracy and performance by adding support for FP32, TF32, and the new configurable FP8 (cFP8) data format. AWS Neurone optimises accuracy and speed by automatically casting high-precision FP32 models to lower-precision data types. Because autocasting eliminates the need for lower-precision retraining, it shortens time to market.
State-of-the-art DL capabilities
Inferentia2 has proprietary C++ operators and hardware optimisations for dynamic input sizes. In addition, it features stochastic rounding, a probabilistic rounding method that offers superior accuracy and performance over older rounding modes.
Built for sustainability
Because Inf2 instances and the underlying Inferentia2 processors are specifically designed to run DL models at scale, they provide up to 50% greater performance/watt than comparable Amazon EC2 instances. When deploying ultra-large models, Inf2 instances assist you in achieving your sustainability objectives.