Boost Multi-Node Data Sharing for GPUs and SYCL Devices with Intel SHMEM. With the release of the Intel oneAPI HPC Toolkit 2025.1, Intel SHMEM transitions from an open-source GitHub project to a fully validated product release that implements a Partitioned Global Address Space (PGAS) programming model that is compatible with the OpenSHMEM 1.5 standard for remote data access and memory sharing in distributed multi-node computing environments.
The twist is that Intel SHMEM provides the OpenSHMEM communication API to System-wide Compute Language (SYCL) based computational device kernels on Intel GPUs in addition to distributed Intel Xeon Scalable CPU setups. This makes it the perfect companion library and API for high-performance large dataset applications running on supercomputers such as the Dawn supercomputer at the Cambridge Open Zettascale Lab or the Aurora Exascale Supercomputer at the Argonne Leadership Computing Facility (ALCF) or Intel Xeon 6 compute clusters with SYCL device GPUs.
Intel SHMEM’s SYCL extension differs from other OpenSHMEM-based solutions in that it does not impose vendor-specific programming environment restrictions on users, in contrast to NVSHMEM and ROC SHMEM. To think that SYCL’s inherent multiarchitecture and multi-vendor readiness better fits the concept of an open API such as OpenSHMEM.
The opinion is that the adoption of SYCL, which is intended to be compatible with a portable programming environment, contributes positively to the anticipated standardization work for GPU additions to OpenSHMEM.
Intel SHMEM Software Architecture
The majority of host-initiated activities in the current OpenSHMEM standard are included in Intel SHMEM, which uses a Partitioned Global Address Space (PGAS) programming model. New device-initiated functions that can be called straight from GPU kernels are also included. The following are the main features that Intel SHMEM offers:
- OpenSHMEM 1.5 compatible point-to-point Remote Memory Access (RMA), Atomic Memory Operations (AMO), signaling, memory ordering, and synchronization operations are supported via device and host APIs.
- Support for OpenSHMEM 1.5 teams API-aligned collective operations via device and host APIs.
- Device API support for RMA, signaling, collective, memory ordering, and synchronization operations at the SYCL work-group and sub-group level extensions.
- A comprehensive collection of C++ function templates that replace the OpenSHMEM specification’s C11 Generic procedures.
Let’s talk about some of these features’ implementation specifics and how runtime performance is improved.
A C/C++ application’s host memory is the point of view from which the current OpenSHMEM memory model is defined. This restricts, without modifying the specification, the use of GPU or other accelerator memories as symmetric segments. Additionally, the execution model must be modified to enable simultaneous execution of all processor elements (PE), with each PE using a functional mapping of GPU memories as a symmetric heap. Unified Shared Memory (USM) permits memory from host, device, and shared address spaces in the SYCL programming architecture. Depending on the memory type of the runtime, this might call for distinct communication and completion semantics.
With tightly connected GPU devices that implement a SYCL API, Intel SHMEM is engineered to provide low latency and high throughput communications for a multi-node computation architecture.
Exploiting the high-speed unified network among the GPUs and enabling off-node data connection at the same time is a crucial problem for distributed programming of such a tightly connected GPU system. Furthermore, the library must quickly and effectively determine which path the network, shared host memory, or the GPU-GPU fabric is best for extracting the maximum performance given the message and network sizes because SHMEM applications transfer messages of a wide range of sizes.
Assume that the data that needs to be conveyed is stored in the GPU’s memory. Then, instead of copying the data to host memory and synchronizing the GPU and host memory when the data arrives, it is better to employ a zero-copy design. With the introduction of GPU remote direct memory access (RDMA) capabilities, which allow the NIC to register GPU device memory for zero-copy transfers but pose a software engineering issue when the functionality is not available, GPU-initiated communication has become even more crucial.

In light of these factors, let’s look at the Intel SHMEM software design. The application layer is depicted at the top of Figure 1 and is made up of Intel SHMEM calls inside an OpenSHMEM program (or an MPI program if host proxy backend compatibility is allowed).
There are two main groups into which the Intel SHMEM APIs fall:
[1] apparatus-started
[2] started by the host
As seen in the circle(3), a host-side proxy thread uses a common OpenSHMEM library to delegate some GPU-initiated tasks for inter-node (scale-out) communication. It is not necessary to construct this proxy backend as pure OpenSHMEM. An (4)OSHMPI solution would be adequate for MPI compatibility and provide competitive performance.
With the exception of the fact that the routines are prefixed with ishmem rather than shmem, Intel SHMEM supports all of the (5)host-initiated OpenSHMEM functions. This name method offers the benefit of differentiating between the several OpenSHMEM-based runtimes that support GPU abstractions, even if it may not be strictly required.
Xe-Links allow individual GPU threads to issue loads, stores, and atomic operations to memory on other GPUs within a local group of GPUs. Very low latency can be provided by individual loads and stores. Broadband can be significantly increased by having multiple threads running loads and stores at the same time, but this comes at the cost of using computing resources (threads) for communications.
The available hardware copy engines can be leveraged to overlap communications and computation, allowing Xe-Links to operate at full speed while the GPU computing cores are engaged in computation, albeit at the expense of a starting latency.
All of these methods are used by Intel SHMEM in various operating systems to maximize speed, including load-storing directly, load-sharing among GPU threads, and employing a cutover strategy to leverage the hardware copy engines for non-blocking operations and big transfers.
Device -Side Implementation
PEs must be mapped to SYCL devices one to one in order for Intel SHMEM to function. A distinct GPU memory space as a symmetric heap for the matching PE is guaranteed by this 1:1 mapping of PE to a GPU tile. The buffer needs to be registered with the FI_MR_HMEM mode flag set in order for the proxy thread to allow host-sided OpenSHMEM operations on a symmetric heap that is located in GPU memory.
The interfaces are:
- shmemx_heap_create(base_ptr, size, …)
- shmemx_heap_preinit()
- shmemx_heap_preinit_thread(requested, &provided)
- shmemx_heap_postinit()
GPU memory registers as a device memory region on an external symmetric heap, which is supported in addition to the conventional OpenSHMEM symmetric heap that is located in host memory, in order for Intel SHMEM to specify which region to register.
The goal of Intel SHMEM is to support every OpenSHMEM API. However, because only the host is capable of configuring the data structures that manage CPU/GPU proxy interactions and dynamically allocating device memory, some interfaces, like the initialization and finalization APIs and all memory management APIs, must be called from the host alone.
RMA, AMO, signaling, collective, memory ordering, and point-to-point synchronization procedures are among the majority of OpenSHMEM communication interfaces that can be called from both the host and the device with the same semantics.
New Features
The whole Intel SHMEM specification, including sample programs, build and run instructions, and information on the programming paradigm and available API calls, is included in the latest Intel SHMEM release.
OpenSHMEM 1.5 and 1.6 features, such as point-to-point Remote Memory Access (RMA), Atomic Memory Operations (AMO), Signaling, Memory Ordering, Teams, Collectives, Synchronization operations, and strided RMA operations, allow you to target both the host and the device.
For SYCL work-group and sub-group level extensions of RMA, Signaling, Collective, Memory Ordering, and Synchronization operations, Intel SHMEM has API support on the device. Additionally, the host has API support for SYCL queue ordered RMA, Collective, Signaling, and Synchronization operations.
List of Key Recently Added Features
- Support for on_queue API additions enables the host to queue OpenSHMEM operations on SYCL devices. As a dependency vector, these APIs additionally give users the option to supply a list of SYCL events.
- OSHMPI support. It is now possible to set up Intel SHMEM to operate over OSHMPI with an appropriate MPI back-end. Additional information can be found in Building Intel SHMEM.
- Intel Tiber AI Cloud support for Intel SHMEM. Please adhere to these guidelines.
- Support for OpenSHMEM thread models is limited. Thread initialization and query routines are supported by the host API.
- Vector point-to-point synchronisation procedures are supported by the host and device APIs.
- Networks enabled by the OFI Libfabric MLX provider are supported using the Intel MPI Library.
- New feature descriptions and APIs have been added to the updated specification.
- Enhanced a broader collection of unit tests that cover the new APIs’ capabilities.
Get it now
Intel SHMEM can be found in source at its GitHub repository, as well as standalone or as a component of the Intel oneAPI HPC Toolkit. Try it out and see how multiarchitecture GPU-initiated memory sharing and remote data access in dispersed multi-node networks can be enhanced!