Wednesday, December 11, 2024

How AWS FSx for Lustre Boosts GPU Performance By 12x

- Advertisement -

Throughput to GPU instances is increased by up to 12x with Amazon FSx for Lustre.

AWS has announced that Amazon FSx for Lustre now supports the Elastic Fabric Adapter (EFA) and NVIDIA GPUDirect Storage (GDS). Applications needing large volumes of inter-node interactions can be executed at scale with the help of EFA, a network interface for Amazon EC2 instances. A direct data link between local or remote storage and GPU memory is made possible by the GDS technology. Compared to the prior FSx for Lustre version, Amazon FSx for Lustre with EFA/GDS support offers up to 12 times higher (up to 1200 Gbps) per-client throughput with these improvements.

- Advertisement -

The most performance-demanding applications, including financial modeling, drug discovery, deep learning training, and autonomous vehicle development, may be developed and executed using FSx for Lustre. You can use more potent GPU and HPC instances, like Amazon EC2 P5, Trn1, and HPC7a, as datasets expand and new technologies appear. Up until now, throughput for individual client instances was restricted to 100 Gbps when using typical TCP networking to access FSx for Lustre file systems. The necessity for FSx for Lustre file systems to offer the performance required to make the best use of the growing network bandwidth of these state-of-the-art EC2 instances while accessing big datasets is being driven by this adoption.

When employing P5 GPU instances and NVIDIA CUDA in your applications, you may now achieve up to 1,200 Gbps throughput per client instance (twelve times greater throughput than before) with FSx for Lustre’s support for Elastic Fabric Adapter and GDS.

With this new feature, you can speed up your HPC and machine learning (ML) workloads and make the most of the network bandwidth of the most potent computing instances. By avoiding the operating system and optimizing data transfer over the AWS Scalable Reliable Datagram (SRD) protocol, EFA improves performance. By removing superfluous memory copies and facilitating direct data transmission between the file system and GPU memory, GDS further enhances performance.

Let’s observe how this functions in real life.

- Advertisement -

Setting up Amazon FSx for Lustre with EFA turned on

You can start by selecting Create file system and then Amazon FSx for Lustre in the Amazon FSx panel.

Give the file system name. You can choose SSD, Persistent, and the new option with EFA enabled under the Deployment and Storage Type column. In the Throughput per unit of storage area, You can choose 1000 MB/s/TiB. The smallest storage capacity that these parameters allow is 4.8 TiB, therefore you enter that value.

Utilize an EFA-enabled security group and the default virtual private cloud (VPC) for networking. Keep every other option set to its default setting.

After going over every option, construct the file system. The file system is available for use after a few minutes.

Using an Amazon EC2 instance to mount an Amazon FSx for Lustre file system with EFA enabled

Select the Ubuntu Amazon Machine Image (AMI), type a name for the instance, then click Launch instance on the Amazon EC2 console. You can choose the instance type trn1.32xlarge.

Select the same subnet that the FSx Lustre file system uses in Network settings and change the default parameters. Then choose three pre-existing security groups under Firewall (security groups): the default security group, the security group that grants Secure Shell (SSH) access, and the EFA-enabled security group utilized by the FSx for Lustre file system.

Then select ENA and EFA as the interface types in Advanced Network Configuration. In the absence of this configuration, the instance would employ conventional TCP networking, and the throughput of the connection to the FSx for Lustre file system would remain restricted at 100 Gbps.

Depending on the instance type, you can add more EFA network interfaces to increase throughput.

When the instance is ready, you can connect using EC2 Instance Connect and follow the FSx for Lustre User Guide’s instructions for configuring EFA clients and installing the Lustre client.

Then mount an FSx for Lustre file system from an EC2 instance by following the directions.

To use as a mount point, make a folder:

sudo mkdir -p /fsx

In the FSx console, choose the file system and search for the DNS and mount names. Mount the file system using these values:

sudo mount -t lustre -o relatime,flock file_system_dns_name@tcp:/mountname /fsx

When you access an EFA-enabled file system from client instances running Lustre 2.15 or later and supporting EFA, EFA is used automatically.

Things to be aware of

All AWS regions where persistent 2 is available now offer free EFA and GDS support for new Amazon FSx for Lustre file systems. Without requiring any further setup, FSx for Lustre uses EFA automatically when users access an EFA-enabled file system from client instances that support it. For instance types in the accelerated computing category, network bandwidths and EFA support are detailed in this table of network requirements.

Lustre 2.15 clients running Ubuntu 22.04 with kernel 6.8 or higher are required in order to use EFA-enabled instances with FSx for Lustre file systems.

It should be noted that within your Amazon Virtual Private Cloud (Amazon VPC) connection, your client instances and file systems need to be situated in the same subnet.

File systems with EFA enabled automatically support GDS. You must have the NVIDIA Compute Unified Device Architecture (CUDA) package, the open-source NVIDIA driver, and the NVIDIA GPUDirect Storage Driver installed on your client instance in order to use GDS with your FSx for Lustre file systems. The AWS Deep Learning AMI has these packages preloaded. After that, you can use GPUDirect storage to move data between your file system and GPUs using your CUDA-enabled application.

Keep in mind that EFA-enabled file systems have greater minimum storage capacity increments than non-EFA-enabled file systems when you plan your deployment. In contrast to 1.2TB for FSx for Lustre file systems without EFA enabled, the minimum storage capacity for EFA-enabled file systems starts at 4.8 TiB if you select the 1,000 MB/s/TiB throughput tier. AWS DataSync can be used to transfer data from an existing file system to a new one that supports EFA and GDS if you’re wanting to shift your current workloads.

FSx for Lustre preserves compatibility with both EFA and non-EFA workloads for optimal flexibility. All workloads can easily access an EFA-enabled file system without any extra setup thanks to the Elastic Network Adapter (ENA), which automatically routes traffic from non-EFA client instances over conventional TCP/IP networking.

- Advertisement -
Thota nithya
Thota nithya
Thota Nithya has been writing Cloud Computing articles for govindhtech from APR 2023. She was a science graduate. She was an enthusiast of cloud computing.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes