Boost Intel CPU Speed for(GNN) Graph Neural Network Training

By govindh tech

October 31, 2023

3

362

Boost Intel CPU Speed for(GNN) Graph Neural Network Training

Graph Neural Network Training (GNN) Accelerated on Intel CPU with Hybrid Partitioning and Fused Sampling

High points

A novel graph sampling technique dubbed “fused sampling,” created by Intel Labs and AIA, may speed up the training of Graph Neural Networks (GNNs) on CPUs by up to two times. One of the most widely used libraries for GNN training, the Deep Graph Library (DGL), now includes the updated sample process.
With the use of a novel graph partitioning technique called “hybrid partitioning,” Intel Labs has been able to significantly accelerate the distributed training of Graph Neural Networks (GNNs) on huge networks. Popular graph benchmarks have seen epoch durations reduced by as much as 30%.
Using 16 2-socket computers, each with two 4th Gen Intel Xeon Scalable Processors (Sapphire Rapids), the combination of fused sampling and hybrid partitioning set a new CPU record for training GNNs on the well-known ogbn-papers100M benchmark, achieving a total FP32 training time of just 1.6 minutes.

In several graph-related tasks, including link prediction in recommendation graphs, molecular graph physical property prediction, and high-level semantic feature prediction of nodes in citation graphs and social networks, Graph Neural Networks (GNNs) have achieved state-of-the-art performance. In many fields, graphs may include billions of edges and millions of nodes.

Training over the whole graph at once may rapidly deplete memory. Sampling-based training is one of the most widely used techniques for training GNNs on big graphs: we randomly select a small portion of the graph (small enough to fit in available memory) for each training iteration and train the GNN on this graph sample. However, as a illustrates, the time required for the graph sampling during each iteration may easily eclipse the time for the forward and backward runs of the GNN.

The graph is often divided among many computers to speed up sampling-based training, as seen in the machines are in charge of producing their own graph samples and using them to train the GNN model. Each computer would need to speak with other machines in order to create a graph sample since the graph topology is divided across the devices. As we generate bigger graph samples, this communication cost will increase. When the GNN model includes additional layers, the graph sample size usually grows.

In the following, we outline two complimentary methods that tackle the significant communication cost associated with distributed sample-based training and the high sampling CPU sampling overhead now experienced by popular machine learning libraries.

*Randomly sampling a small subgraph from a large graph in order to train a GNN on the small, sampled graph* image credit to Intel

1. Combining Sampling

Every training iteration must include graph sampling. Thus, it is essential to sample graphs as quickly as feasible. Popular GNN libraries, like DGL (a popular GNN training library), provide a typical sample pipeline that consists of several phases that each produce intermediate tensors that must be written to and subsequently read from memory.

2. Adaptable partitioning

When a graph becomes too large to store in the memory of one training machine, it is often divided among many machines. The relevant graph data required for each machine to train the GNN model is requested and provided via inter-machine communication. We have noticed that the majority of the graph representation size is often occupied by the characteristics connected to the graph nodes. More than 90% of the RAM required to display the graph is often occupied by the node characteristics.

Inspired by this finding, we created a novel partitioning technique called hybrid partitioning, which, as splits the graph exclusively according to its node properties while reproducing the relatively tiny graph topology information (the graph’s adjacency matrix) across all training machines. Because the machines only need to communicate node characteristics, this results in a significant decrease in the number of communication rounds in distributed sampling based GNN training trials.

As a fused sampling in conjunction with hybrid partitioning resulted in a significant decrease in epoch durations for distributed sample-based GNN training. Even on its own, hybrid partitioning improves performance; when combined with fused sampling, it increases epoch times by more than two times. We obtain a record-breaking total FP32 training time of under 1.6 minutes on 16 2-socket computers by using hybrid partitioning and fused sampling.

3 COMMENTS

SZBOX S513's Intel Power And Connector Mystery! December 25, 2023 At 11:28 am

[…] a result of its strong and yet current Intel CPU, the SZBOX S513 is a flat computer system that is expected to be suitable for a variety of […]

Log in to leave a comment
Asus ROG Maximus Z790 Hero BTF Brings Power Criteria January 12, 2024 At 9:53 am

[…] Maximus Z790 Hero BTF can support a top-tier Intel CPU with voltage regulator modules with 20+1+2 power stages rated at 90 amps each. Each power stage […]

Log in to leave a comment
Open MatSci ML Toolkit For Intel Xeon Processor January 18, 2024 At 2:34 pm

[…] the transition between these scales in a seamless approach. The training of an equivariant graph neural network, also known as an E(n)-GNN, was carried out so that we could demonstrate that […]

Log in to leave a comment

Boost Intel CPU Speed for(GNN) Graph Neural Network Training

Graph Neural Network Training (GNN) Accelerated on Intel CPU with Hybrid Partitioning and Fused Sampling

1. Combining Sampling

2. Adaptable partitioning

Ironwood AI Chip: Google’s 7th-gen Tensor Processing Unit

Intel Quartus Prime Pro Edition 25.1 Optimized for Agilex 3

MediaTek Kompanio Ultra 910 for best Chromebook Performance

3 COMMENTS

LEAVE A REPLY Cancel reply

Page Content

Recent Posts

The BrowseComp: Benchmarking Web Browsing Agents

Firebase Studio: Unlock Agentic Development With Gemini AI

Application Centric AI Cloud: The Future of DevOps Growth

BigQuery Unified Governance: Centralized Data Management

Bigtable SQL Introduces Native Support for Real-Time Queries

How AlloyDB AI Query Engine Empower Smart Apps Developers

About Us

POPULAR CATEGORY