Quick Sub-Stream Parallelization for the Random Number Generator oneMKL MRG32k3a
Generation of Random Numbers and Parallel Computing
Random number generation is one of the many computational jobs that GPUs are employed for, and their capability has grown over time. Numerous applications, including scientific computing, simulation, and cryptography, depend on random numbers. They are essential in supplying the random seeds for a wide range of prediction scenarios, such as emergency response planning for earthquakes and tsunamis, predictive maintenance, and quantitative finance risk assessment.
Because of their capacity for parallel processing, GPUs provide a notable edge when it comes to producing random numbers. This increases speed and efficiency by enabling the simultaneous generation of several random numbers. Generating random numbers on a GPU is essential for scaling and speeding up extremely accurate computations, simulations, and prediction engines in the digital age, where data privacy and accuracy are crucial. Without affecting the end-user experience, it enables extremely safe cryptography in corporate security, banking, and the health sciences.
A high-performance math library that offers optimized functions for mathematical computations is the Intel oneAPI Math Kernel Library (oneMKL). It is intended to speed up intricate mathematical processes including random number generation, quick Fourier transforms, and linear algebra. The Intel oneAPI Base Toolkit, which includes oneMKL, is a full suite of development tools and libraries for creating high-performance, optimized programs for accelerated computing systems.
For these more complex use cases, a popular approach for producing pseudo-random numbers is the MRG32k3a random number generator. With a long duration, it is a combined multiple recursive random number generator. As a result, it can generate a sizable collection of distinct random numbers before repeating. Because of its well-known statistical qualities, it can be used in a wide range of applications.
What if it divides the algorithm into multiple smaller tasks that may be handled concurrently to further enhance its performance? By raising the level of parallelism in the algorithm.
The addition of a sub-sequence parallelization technique for the MRG32k3a random number generator to oneMKL is presented in this article. Although the library already has ways for creating random numbers, this paper emphasizes the advantages of using a sub-sequence parallelization strategy.
Methods of Parallelization
Random numbers can be generated in parallel in a number of methods. Nonetheless, it must take into account the generator duration and the ensuing sequence’s statistical accuracy. It is impossible to compromise statistical randomness or the amount of time before a number repeats. The main methods for putting such parallelization into practice are examined in the section that follows.
The skip-ahead technique is used to divide the generation process evenly among threads in the conventional implementation of MRG32k3a random number parallelization. For example, if ten threads are used to create 100 random numbers, the first thread will produce the first ten elements, the second thread will produce the following ten, and so on. Although this method guarantees consistent results, each thread must perform a very difficult computationally demanding skip-ahead procedure.
Pierre L’Ecuyer et al.’s research on “Object-Oriented Random-Number Package with Many Long Streams and Substreams” shows that the sequence generated by the MRG32k3a generator can be divided into sub-sequences with a displacement of 2^67 elements, which can then be recombined without producing any correlation in the final sequence.
Because the MRG32k3a algorithm has a huge period of 2191 elements, this is achievable. Random numbers are generated by each sub-stream and then stored in memory with an offset equal to the number of sub-streams. For instance, if we want to generate 100 random numbers using 10 sub-streams, the first sub-stream will produce the first, eleventh,…, and 91st elements, the second sub-stream will produce the second, twelveth,…, and 92nd elements, and so on.
Assigning distinct seeds to every thread is another well-liked technique for producing random numbers in parallel. While performance scalability can be achieved with this method, statistical independence of the sequences with various seeds is not guaranteed. Similar or associated numerical sequences are more likely to be produced when random numbers are created concurrently using various seeds.
A single seed is used for generation in the sub-sequence parallelization method, which ensures a high-quality output sequence and produces respectable performance.
Sub-Sequence Utilization with OneMKL
By including a method at the random number generator setup stage, you can employ sub-sequence parallelization with the oneMKL Host API.
Using MRG32k3a to Update the Generation Method for Uniformly Distributed Numbers.
Pass optimal{} to set the default number of sub-streams according to your hardware. To manually choose the ideal number of sub-streams for a given assignment, use the custom{num_streams} option.
As demonstrated above, the number of sub-streams in a sequence with sub-stream parallelization dictates the order in which numbers are generated. When using alternative hardware and generator implementations, the same outcome can be obtained by using the “custom” option.
The MRG32k3a generator can also be used to create numbers, and the oneMKL Device API can be used for sub-sequence parallelization.
// …
namespace rng_device = oneapi::mkl::rng::device;
// kernel with initialization of generators
queue.parallel_for({num_streams}, [=](std::size_t id) {
// initialize generators with an offset of 2^67 elements
rng_device::mrg32k3a<1> engine(seed, {0, 4096 * (id % num_streams)});
engine_ptr[id] = engine;
}).wait();
// generate random numbers
queue.parallel_for({num_streams}, [=](std::size_t id) {
auto engine = engine_ptr[id];
rng_device::uniform distr;
std::uint32_t count = 0;
while (id + (count * num_streams) < n) {
auto res = rng_device::generate(distr, engine);
result_ptr[id + (count * num_streams)] = res;
count++;
}
}).wait();
The above Figure shows an example of implementing sub-stream parallelization with the oneMKL Device API.
Interchangeability between oneMKL RNG and CuRAND
Maintaining the output sequence is essential when moving an application from the current* (CUDA Random Number Generation) library to oneMKL. If not, the functionality might not be exactly the same. OneMKL’s MRG32k3a random number generator is used in sub-sequence parallelization mode to address this issue.
The MRG32k3a random number generator only functions in sub-sequence parallelization mode within the cuRAND Host API. The conventional implementation of oneMKL, on the other hand, takes a sequential method. Consequently, distinct sequences will be generated by these two implementations of the same random number generator.
If both random number generators have the same initialization seed, then switching to sub-sequence mode enables the identical sequences to be obtained.
The Intel DPC++ Compatibility Tool can help streamline the switch to oneMKL. It provides automated capabilities for code analysis, optimization, and transformation. Additionally, it preserves the output sequence by enabling an automated transition from cuRAND to oneMKL’s RNG domain.
Scalability while Maintaining Unpredictability
The Intel oneMKL library’s sub-sequence parallelization technique for the MRG32k3a random number generator effectively produces high-quality pseudo-random numbers on GPUs. This can be accomplished without sacrificing statistical quality by splitting the output sequence into several sub-streams that can be processed concurrently.
For the MRG32k3a random number generator, the sub-sequence approach strikes the ideal compromise between scalability and randomness. With the help of the given usage examples, developers can quickly incorporate this technique into their current oneMKL processes and utilize GPU parallel computation for random number generation. By allowing programs to switch from other GPU math libraries, such as cuRAND, to oneMKL, this technique enhances usability and compatibility. Regardless of whether an application is best suited to run on discrete Intel Arc Graphics GPUs, integrated GPUs based on Intel Xe2 Graphics, Intel Data Center GPUs, or even third-party compute accelerators, this ultimately speeds up simulations, numerical analysis, and other applications that depend on large volumes of high-quality random data.
Obtain the software
Get the Intel oneAPI Math Kernel Library (oneMKL) alone or as a component of the Intel oneAPI Base Toolkit, a collection of essential tools and libraries for creating data-driven, high-performance applications for a variety of architectures.