Amazon EC2 Inf1 Instances: Low-cost and high-performance machine learning inference.
What is Amazon EC2 Inf1 Instances?
Companies in a wide range of sectors are considering transformation driven by artificial intelligence (AI) to spur corporate innovation, enhance consumer satisfaction, and streamline operations. The complexity of the machine learning (ML) models that underpin AI applications is growing, which raises the expense of the underlying computational infrastructure. Inference frequently accounts for up to 90% of the infrastructure investment used to create and execute ML applications. Consumers are searching for affordable infrastructure options to put their machine learning applications into production.
High-performance and reasonably priced machine learning inference is provided via Amazon EC2 Inf1 instances. Compared to similar Amazon EC2 instances, they offer up to 70% reduced cost per inference and up to 2.3x higher throughput. ML inference applications are supported by EC2 Inf1 instances, which are constructed from the ground up. Up to 16 AWS Inferentia chips, which are high-performance ML inference chips created and manufactured by AWS, are included in them. For high throughput inference, EC2 Inf1 instances also provide networking speeds of up to 100 Gbps and 2nd Generation Intel Xeon Scalable CPUs.
Users can execute large-scale machine learning (ML) inference applications including search, recommendation engines, computer vision, speech recognition, NLP, customisation, and fraud detection on Inf1 instances.
Developers can use the AWS Neuron SDK with TensorFlow, PyTorch, and Apache MXNet to deploy ML models to Inf1 instances. With very minor code modifications and no reliance on vendor-specific solutions, they can smoothly move apps onto Inf1 instances while maintaining the same ML operations.
With Amazon SageMaker, AWS Deep Learning AMIs (DLAMI) that are preloaded with Neurone SDK, or Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS) for containerised machine learning applications, you can quickly get started with Inf1 instances.
Advantages
Up to 70% lower cost per inference
Developers may drastically lower the cost of their ML production deployments by utilising Inf1. Compared to similar Amazon EC2 instances, the low instance cost and high throughput of EC2 Inf1 instances result in a cost per inference that is up to 70% lower.
Ease of use and code portability
Common machine learning frameworks like TensorFlow, PyTorch, and MXNet are integrated with the Neurone SDK. With only minor code modifications, developers may smoothly move their application to Inf1 instances while maintaining the same ML operations. This frees customers from vendor-specific solutions and allows them to employ the latest technologies, the compute platform that best suits their needs, and the ML framework of their choice.
Up to 2.3x higher throughput
The throughput of Inf1 instances can be up to 2.3 times higher than that of similar Amazon EC2 instances. EC2 Inf1 instances are powered by AWS Inferentia chips, which are tuned for inference performance for small batch sizes. This allows real-time applications to meet latency constraints and maximise throughput.
Extremely low latency
Large amounts of on-chip memory on AWS Inferentia processors allow ML models to be cached on the chip itself. Using features like the NeuronCore Pipeline, which removes the requirement to access external memory resources, you may deploy your models. Deploying real-time inference applications at latencies close to real-time without affecting bandwidth is possible using Inf1 instances.
Support for various ML models and data types
Several popular ML model architectures are supported by EC2 Inf1 instances, including Transformer and BERT for natural language processing and SSD, VGG, and ResNext for image identification and classification. Furthermore, users may simply build and run inference using pretrained or fine-tuned models by altering a single line of code with Neuron’s support for the HuggingFace model repository. For a range of models and performance requirements, multiple data types, including BF16 and FP16 with mixed precision, are also supported.
Features
Powered by AWS Inferentia
AWS Inferentia is an ML chip designed to provide low-cost, high-performance inference. Each AWS Inferentia chip supports FP16, BF16, and INT8 data types, has four first-generation NeuronCores, and can do up to 128 tera operations per second (TOPS). big amounts of on-chip memory on AWS Inferentia chips can also be utilised to cache big models, which is particularly advantageous for models that need frequent memory access.
Deploy with popular ML frameworks using AWS Neuron
Compiler, runtime driver, and profiling tools make up the AWS Neurone SDK. It makes it possible to use Inf1 instances to run sophisticated neural network models that were developed and trained in well-known frameworks like TensorFlow, PyTorch, and MXNet. NeuronCore Pipeline offers high inference throughput and reduced inference costs by utilising a high-speed physical chip-to-chip interface to divide big models for execution among many Inferentia chips.
High-performance networking and storage
For applications that need high-speed networking, EC2 Inf1 instances provide networking throughput of up to 100 Gbps. High-throughput, low-latency networking and Amazon Elastic Block Store (Amazon EBS) interfaces are made available to EC2 Inf1 instances using next-generation Elastic Network Adapter (ENA) and NVM Express (NVMe) technologies.
Built on AWS Nitro System
The AWS Nitro System is a comprehensive set of building blocks that reduces virtualisation overhead while delivering high speed, high availability, and high security by shifting many of the conventional virtualisation tasks to specialised hardware and software.