Among the Open MatSci ML Toolkit highlights:
Researchers from Intel Labs and Intel DCAI demonstrated how advanced artificial intelligence models can be trained on 4th Generation Intel Xeon Scalable Processors to achieve competitive modeling performance on materials property prediction tasks.
A number of different materials property prediction tasks were performed on two datasets that were supported by the Open MatSci ML Toolkit. The researchers demonstrated that pre-training on a synthetic task can improve modeling performance for these applications.
After Intel Labs released the Open MatSci ML Toolkit 1.0 in 2023, Intel Labs and Intel’s Data Center and AI (DCAI) group presented a paper demonstrating its capabilities. In conjunction with SC 23, the International Conference for High Performance Computing, Networking, Storage, and Analysis, the AI4S workshop accepted this paper. This paper showed how advanced artificial intelligence (AI) models can be trained on 4th Generation Intel Xeon Scalable Processors to achieve competitive modeling performance for a wide range of materials property prediction tasks.
Workshops were selective. Intel proposed a new pre-training task based on classifying crystallographic symmetry group-generated point structures in this scientific paper. Intel also examined the pros and cons of such a method for downstream modeling performance, which can be done on CPUs.
A collection of software engineering utilities that can be used to train advanced artificial intelligence models for materials science tasks is provided by the Open MatSci ML Toolkit. The capability of the toolkit to integrate multiple types of data for a variety of tasks is one of its strengths. This ability is essential for the successful training of artificial intelligence models for materials science. Additionally, the toolkit includes built-in support for a standard collection of geometric deep learning architectures.
This support is continuously updated in response to the release of new models that are considered to be state-of-the-art. The Open MatSci ML Toolkit 1.0 provides support for a variety of general utilities and abstractions, including datasets, tasks, and models. A number of datasets that are widely utilized in the field of materials science have been incorporated, which provides the fundamental components for conducting experiments.
Through the utilization of Intel Xeon Scalable Processors of the Fourth Generation, Training Models at Scale
It is the purpose of the Open MatSci ML Toolkit to facilitate the training of models at different scales of computing (for instance, a laptop, a workstation, and a data center), as well as to facilitate the transition between these scales in a seamless approach. The training of an equivariant graph neural network, also known as an E(n)-GNN, was carried out so that we could demonstrate that capability.
By utilizing a number of Intel Xeon Scalable Processors of the fourth generation, Intel conducted an investigation into the impact that the addition of parallel computing of multiple CPU processes has on the training of the neural network. There were instances in which Intel were able to achieve greater stability in the training of neural networks, as demonstrated by a lower validation error.
The stability of the training, on the other hand, deteriorates when training with the greatest number of ranks. This highlights the necessity of adopting a principled approach for highly distributed AI model training in order to manage trade-offs between training speed and stability. In the process of training large language models, research conducted by Meta AI has demonstrated similar effects. This instability has been attributed to divergence in the Adam family of optimizers, which is a well-known family of optimizers. The findings that they have presented here indicate that similar phenomena may occur in the training of distributed graph neural networks; therefore, additional research is required to find solutions to these problems.
Understanding the Performance of Synthetic Pre-Training Through Modeling Analysis
Furthermore, Intel investigated the usefulness of pre-training artificial intelligence models on an auxiliary symmetry prediction task. This was done in addition to demonstrating the training capabilities of advanced central processing units. This strategy has been demonstrated to be effective in the field of computer vision, where artificial intelligence models are frequently endowed with generalized knowledge about the dataset through the use of synthetic data. This is then followed by subsequent fine-tuning designed for a particular task.
According to the findings presented in above Figure , pre-training on their proposed synthetic task has the potential to improve modeling performance for a wide range of materials property prediction tasks. These tasks were performed on two datasets that were supported by the Open MatSci ML Toolkit. In spite of the fact that the increase is not uniform, it demonstrates that auxiliary tasks have the potential to endow the E(n)-GNN model with general knowledge that can be utilized for the purpose of solving prediction tasks for multiple properties across multiple datasets.
The model was subjected to additional analysis, which revealed that pre-training has the ability to cluster the representations of different datasets. The embedding’s for three datasets, namely OpenCatalyst-S2EF, OpenCatalyst-IS2RE, and Materials Project. On the other hand, the embedding’s for two additional datasets, namely LiPS and Carolina DB, are shown to separate into distinct clusters. This analysis can be helpful in understanding how various datasets may represent other physical and chemical aspects that are relevant to the materials that they stand for, as well as how to potentially design future tasks and datasets for the purpose of training more powerful geometric deep learning models for materials science.