Run ML and HPC applications at scale with the Elastic Fabric Adapter.
What is Elastic Fabric Adapter?
Customers can execute applications that need high volumes of inter-node communications at scale on AWS by using the Elastic Fabric Adapter (EFA), a network interface for Amazon EC2 instances. The performance of inter-instance communications is improved by its specially designed operating system (OS) bypass hardware interface, which is essential for scaling these applications. EFA enables machine learning (ML) applications using the NVIDIA Collective Communications Library (NCCL) and high performance computing (HPC) applications utilizing the Message Passing Interface (MPI) to grow to thousands of CPUs or GPUs. This gives you the on-demand elasticity and flexibility of the AWS cloud along with the application performance of on-premises HPC clusters.
Any compatible EC2 instance can have EFA enabled as an optional EC2 networking feature at no extra cost. Additionally, you may move your HPC apps to AWS with minimal changes because it integrates with the most widely used interfaces, APIs, and libraries for inter-node interactions.
EFA supports Nvidia Collective Communications Library (NCCL) for AI and ML applications, Open MPI 4 and later, and Intel MPI 2019 Update 5 and later for HPC applications. It also interfaces with Libfabric 1.7.0 and later.
AWS Elastic Fabric Adapter
EFA fundamentals
There are two methods for connecting an EFA device to an EC2 instance:
- Creating both an EFA device and an ENA device by using a conventional EFA interface, also known as EFA plus ENA.
- By using an EFA-only interface, only the Elastic Fabric Adapter device is created.
Features like built-in OS-bypass and congestion control via the Scalable Reliable Datagram (SRD) protocol are offered by the Elastic Fabric Adapter device. The low-latency, dependable transport functionality made possible by the EFA device features enables the EFA interface to improve application performance for HPC and ML workloads running on Amazon EC2. On the other hand, the ENA device provides conventional IP networking.
Traditionally, HPC applications use the Message Passing connect (MPI) to connect with the system’s network transport, while AI/ML applications employ NCCL. Applications in the AWS cloud have to interact with NCCL or MPI, which subsequently leverages the TCP/IP stack of the operating system and the ENA device driver to allow network connection between instances.
AI/ML applications use NCCL and HPC programs use MPI to connect directly with the Libfabric API using a typical EFA (EFA with ENA) or EFA-only interface. To send packets to the network, the Libfabric API talks directly with the EFA device, avoiding the operating system kernel. This lowers overhead and improves the performance of HPC and AI/ML applications.
Amazon Elastic Fabric Adapter
ENA, EFA, and EFA-only network interface differences
There are two kinds of network interfaces offered by Amazon EC2:
- All of the conventional IP networking and routing functionalities needed to provide IP networking for a VPC are offered via ENA interfaces.
- Both the EFA device for low-latency, high-throughput communication and the ENA device for IP networking are provided by EFA (EFA with ENA) interfaces.
- EFA-only interfaces do not support the ENA device for conventional IP networking; they only support the EFA device’s features.
A comparison of ENA, Elastic Fabric Adapter , and EFA-only network interfaces is shown in the following table.
ENA | EFA (EFA with ENA) | EFA-only | |
---|---|---|---|
Supports IP networking functionality | Yes | Yes | No |
Can be assigned IPv4 or IPv6 addresses | Yes | Yes | No |
Can be used as primary network interface for instance | Yes | Yes | No |
Counts towards ENI attachement limit for instance | Yes | Yes | Yes |
Instance type support | Supported on all Nitro-based instances types | Supported instance types | Supported instance types |
Parameter naming in EC2 APIs | interface | efa | efa-only |
Field naming in EC2 console | No selection | EFA with ENA | EFA-only |
Interfaces and libraries that are supported
The following libraries and interfaces are supported by Elastic Fabric Adapters:
- Launch MPI 4 and beyond.
- Take note: For Graviton-based instances, Open MPI 4.0 or later is recommended.
- Update 5 and later for Intel MPI 2019.
- 2.4.2 or later of the NVIDIA Collective Communications Library (NCCL)
- Version 2.3 or higher of the AWS Neuron SDK
Types of instances that are supported
To see the available instance types that support EFAs in a specific Region
The available instance types vary by Region. To see the available instance types that support Elastic Fabric Adapters in a Region, use the describe-instance-types command with the --region
parameter. Include the --filters
parameter to scope the results to the instance types that support EFA and the --query
parameter to scope the output to the value of InstanceType
.
aws ec2 describe-instance-types –region us-east-1 –filters Name=network-info.efa-supported,Values=true –query “InstanceTypes[*].[InstanceType]” –output text | sort
Operating systems that are supported
Depending on the CPU type, different operating systems are supported. The supported operating systems are displayed in the following table.
Operating system | Intel/AMD (x86_64 ) instance types | AWS Graviton (arm64 ) instance types |
---|---|---|
Amazon Linux 2023 | ✓ | ✓ |
Amazon Linux 2 | ✓ | ✓ |
RHEL 8 and 9 | ✓ | ✓ |
Debian 10, 11, and 12 | ✓ | ✓ |
Rocky Linux 8 and 9 | ✓ | ✓ |
Ubuntu 20.04, 22.04, and 24.04 | ✓ | ✓ |
SUSE Linux Enterprise 15 SP2 and later | ✓ | ✓ |
OpenSUSE Leap 15.5 and later | ✓ |
Note: Ubuntu 20.04 supports peer direct support when used with dl1.24xlarge instances.
Limitations of Elastic Fabric Adapter
The following are the limits of EFAs:
Note: Traffic sent through an EFA (EFA plus ENA) or EFA-only interface’s EFA device is referred to as EFA traffic.
- There is currently no support for EFA traffic between P4d/P4de/DL1 instances and other instance types.
- One EFA per network card can be set up for instance types that allow multiple network cards. Only one Elastic Fabric Adapter per instance is supported by the other supported instance types.
- Dedicated instances and dedicated hosts for c7g.16xlarge, m7g.16xlarge, and r7g.16xlarge are not supported when an EFA is attached.
- Availability Zones and VPCs cannot be traversed by EFA traffic. Normal IP traffic from an EFA interface’s ENA device is exempt from this.
- Routable EFA traffic does not exist. Routable IP traffic from an EFA interface’s ENA device is still available.
- AWS Outposts does not support Elastic Fabric Adapter.
- Only applications based on the AWS Cloud Digital Interface Software Development Kit (AWS CDI SDK) can use the EFA device of an EFA (EFA with ENA) interface on Windows instances. Without the additional EFA device capabilities, an EFA (EFA with ENA) interface works as an ENA interface when it is connected to a Windows instance for programs that are not based on the CDI SDK. Windows and Linux applications built using AWS CDI do not support the EFA-only interface.
Advantages
Quicker outcomes
For inter-instance communications, EFA’s special OS bypass networking technology offers a low-latency, low-jitter connection. This makes it possible for your distributed machine learning or tightly coupled HPC systems to grow to thousands of cores, which speeds up their operation.
Adaptable setup
You have the freedom to select the best computing configuration for your workload by turning on Elastic Fabric Adapter support on an expanding list of EC2 instances. Just enable EFA support on your new compute machines and adjust your cluster configurations as your needs evolve. There is no need for advance planning or bookings.
Smooth migration
Elastic Fabric Adapter communicates via the libfabric interface and libfabric APIs. This interface is supported by nearly all HPC programming models, so you may move your current HPC apps to the cloud with minimal changes.
How it operates
Use cases
Fluid Dynamics Computation
Engineers can now model ever-more complicated flow phenomena because to developments in Computational Fluid Dynamics (CFD) methods, while HPC speeds up turnaround times. Design engineers may now experiment with more adjustable parameters by scaling out their simulation jobs using Elastic Fabric Adapter, which produces faster and more accurate results.
Modeling the weather
To get accurate results, complex weather models need fast interconnects, large memory bandwidth, and reliable parallel file systems. Results are more accurate when the model’s grid spacing is closer, but it also uses more processing power. With the help of EFA’s quick interconnect, weather modeling apps may benefit from the AWS cloud’s nearly infinite scalability and produce more precise forecasts faster.
Learning Machines
Distributed computing on GPUs can greatly speed up the training of deep learning models. NCCL has already been implemented into top deep learning frameworks like Caffe, Caffe2, Chainer, MxNet, TensorFlow, and PyTorch to utilize its multi-GPU collectives for communications between nodes. Because EFA is tailored for NCCL on AWS, these training models have higher throughput and scalability, which produces quicker outcomes.
Elastic Fabric Adapter pricing
EFA is a free optional networking capability offered by Amazon EC2 that you can activate on any supported instance.