Random Number Generation with cuRAND* and oneMKL

Intel® oneAPI Math Kernel Library Has You Covered on CPU and GPU

Get the Latest on All Things CODE

author-image

By

The Intel® oneAPI Math Kernel Library (oneMKL) binary distribution is the fastest and most-used math library for Intel®-based systems1, just as the binary distribution only math libraries included in the CUDA* Toolkit are the most used set of libraries for NVIDIA* GPUs.

The oneAPI Math Kernel Library (oneMKL) Interfaces and its SYCL* API, on the other hand, are defined by the oneAPI Specification for oneMKL, oneAPI being an open, cross-industry, standards-based, unified, multiarchitecture, multi-vendor programming model.

Thus, the oneAPI Math Kernel Library (oneMKL) Interfaces project is an open source specification implementation. The project illustrates how the SYCL interfaces documented in the oneMKL specification can be implemented for any math library and work for any target hardware. While the implementation provided may not yet be the full implementation of the specification, the goal is to build it out over time.

This allows oneMKL to share in the vision of oneAPI and provide freedom of choice for targeted hardware platform configurations. Using a common codebase for different accelerated compute platforms through source code portability and performance portability.

It is not sufficient to simply translate CUDA calling conventions into SYCL. Commonly used libraries and their APIs also need to be matched and migrated. To ease the migration of an existing codebase from CUDA to SYCL so that it can run on not only NVIDIA GPUs but a wide range of possible GPUs and accelerators, functionality and capabilities need to be mapped and well-understood for CUDA-based as well as SYCL-based library implementations.

This article will provide a snapshot highlighting the overlap in the current random number generation library function implementations of cuRAND* and oneMKL.


Figure 1.  Assigning Sets of Random Numbers using cuRAND and oneMKL Device API


The code comparison in Figure 1 shows the differences in the header files required and the use of curand_normal() versus generate() for assigning a random number in a loop.

Similar function call mapping will have to be used across the board. Here, we will not go into the details of the calling conventions but only provide a brief summary for context.

Detailed documentation can be found at these locations:

Let us consider a simple, complete, and reusable code sample for random number generation using Host API as well as Device API implemented in CUDA and oneMKL, providing us with a good overview of the similarities and differences in the usage model:

oneMKL cuRAND
#include <iostream>
#include <vector>

#include <sycl/sycl.hpp>
#include <oneapi/mkl/rng.hpp>

int main() {
    const size_t n = 4096;
    sycl::queue queue;
    using allocator_t = sycl::usm_allocator<double, sycl::usm::alloc::shared>;

    // allocate memory
    std::vector<double, allocator_t> r_vec(n, allocator_t(queue));
    double* r = r_vec.data();

    // create basic random number generator object
    oneapi::mkl::rng::philox4x32x10 engine(queue, /*SEED*/ 777);
    // create distribution object
    oneapi::mkl::rng::gaussian<double> distr;
    // perform generation
    oneapi::mkl::rng::generate(distr, engine, n, r).wait();

    double s = 0.0;
    for(int i = 0; i < n; i++) {
       s += r[i];
    }
    s /= n;
    std::cout << "Average = " << s << std::endl;
    return 0;
}
#include <iostream>
#include <stdio.h>

#include <cuda.h>
#include <curand_kernel.h>

int main() {
    const size_t n = 4096;
    double* r;
    // allocate memory
    cudaMallocManaged(&r, n * sizeof(double));

    // create and initialize generator
    curandGenerator_t rng;
    curandCreateGenerator(&rng, CURAND_RNG_PSEUDO_PHILOX4_32_10);
    curandSetPseudoRandomGeneratorSeed(rng, 777ull);

    // perform generation
    curandGenerateNormalDouble(rng, r, n, 0.0, 1.0);
    cudaDeviceSynchronize();

    curandDestroyGenerator(rng);

    double s = 0.0;
    for(int i = 0; i < n; i++) {
       s += r[i];
    }
    s /= n;
    std::cout << "Average = " << s << std::endl;

    cudaFree(r);
    return 0;
}

Figure 2. Side-by-side code comparison using Host API

oneMKL cuRAND
#include <iostream>
#include <vector>

#include <sycl/sycl.hpp>
#include <oneapi/mkl/rng/device.hpp>

int main() {
    sycl::queue queue;

    const size_t n = 4096;
    using allocator_t = sycl::usm_allocator<double, sycl::usm::alloc::shared>;

    std::vector<double, allocator_t> r_vec(n, allocator_t(queue));
    double* r = r_vec.data();

    queue.parallel_for(n, [=](size_t i){
        // create basic random number generator object
        oneapi::mkl::rng::device::philox4x32x10 engine(/*SEED*/ 777, i);
        // create distribution object
        oneapi::mkl::rng::device::gaussian<double> distr;
        // perform generation
        r[i] = oneapi::mkl::rng::device::generate(distr, engine);
    }).wait();

    double s = 0.0;
    for(int i = 0; i < n; i++) {
       s += r[i];
    }
    s /= n;
    std::cout << "Average = " << s << std::endl;
    return 0;
}
#include <iostream>
#include <stdio.h>

#include <cuda.h>
#include <curand_kernel.h>

// CUDA kernel <<<Num blocks, block size>>>
__global__ void generate_philox4x32x10(double* result) {
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    // create basic random number generator object
    curandStatePhilox4_32_10_t localState;
    curand_init(777, 0, id, &localState);
    // perform generation
    result[id] = curand_normal_double(&localState);
}

int main() {
    const size_t n = 4096;
    double* r;

    cudaMallocManaged(&r, n * sizeof(double));

    generate_philox4x32x10<<<n / 512, 512>>>(r);
    cudaDeviceSynchronize();

    double s = 0.0;
    for(int i = 0; i < n; i++) {
       s += r[i];
    }
    s /= n;
    std::cout << "Average = " << s << std::endl;

    cudaFree(r);
    return 0;
}

Figure 3. Side-by-side code comparison using Device API

oneMKL RNG SYCL API for Host and Target

oneMKL provides SYCL interfaces for the Vector Statistics Random Number Generators (RNG) routines implementing commonly used pseudorandom, quasi-random, and non-deterministic generators with continuous and discrete distributions. oneMKs RNG SYCL API follows the C++ standard, an additional advantage for C++ users.

The SYCL API consists of a host-side and device-side API. In addition, there are two APIs for storing data on a device and sharing data between devices and the host: the buffer API and the unified shared memory (USM) API.

Most of oneMKL RNG functions can be executed on a CPU or a GPU, often using the same device API.

The device-side communication is handled via a oneAPI compliant backend runtime library for the targeted device, which could be a CPU or a GPU from vendors like Intel, AMD, or NVIDIA. The compute device could even be other compute accelerators that such a runtime library had been developed for using OpenCL or oneAPI Level Zero backend framework.

RNG API calls support explicit manual offload (Host API) as well as the ability to call a set of functions directly from SYCL kernels (Device API) with Random Number Generators Device Routines

The oneMKL RNG device API represents generators as classes, which may be called directly in the kernel or allocated on the host.

A typical sequence of operations using Host API looks as follows:

  1. Create and initialize the object for a basic random number generator.
  2. Create and initialize the object for the distribution generator.
  3. Call the generate() routine to get random numbers with the appropriate statistical distribution.
  4. Cleanup is handled automatically by the RNG object’s destructor.

The example in Figure 2 and Figure 3 demonstrates the generation of random numbers by the basic generator (engine) PHILOX4X32X10. The seed is equal to 777. The generator is used to generate 4096 normally distributed random numbers with default parameters a = 0 and sigma= 1. The purpose of examples is to calculate the sample mean for normal distribution with the given parameters.

cuRAND API for Host and Target

Although cuRAND’s focus is on NVIDIA’s GPUs, there are a few software architectural similarities. It also has a library on the host (CPU) side and a device (GPU) header file. Random numbers can be generated on the device or the host CPU. It includes support for commonly used pseudorandom and quasi-random generators.

If random number generation happens on the GPU, the calls to the GPU are initiated from the host CPU, and the created random numbers are stored in global device memory. The resulting number can be reused on the GPU using CUDA kernels or copied back to the host memory for additional processing.

cuRAND represents different device API generators as pointers to a structure.

A typical sequence of operations looks as follows:

  1. Create a new generator of the desired type.
  2. Allocate memory on the device.
  3. Generate random numbers with curandGenerate().
  4. Clean up with curandDestroyGenerator().

For a detailed overview of the cuRAND usage model, please consult its online CUDA Toolkit Documentation.

Functional Coverage

The tables below provide a functional map of the current oneMKL RNG host API and device API to its CUDA equivalent and help you identify current gaps in respective coverage.

Development is ongoing. Thus, the data presented in these tables currently will change over time. Please check the oneMKL RNG SYCL API Implementation on GitHub frequently or ensure to always use the latest Intel oneAPI Math Kernel Library product release for your software development.

The columns show the cuRAND API name (if available) followed by the equivalent oneMKL construct. The “not supported” entry indicates that cuRAND and oneMKL, respectively, don’t support a specific functionality.

Both tables cover RNG generators and distributions.

NOTE: All oneMKL APIs in tables 5 and 6 support CPU and GPU both, except the following generators that support only CPU: ars5, niederreiter, wichman_hill, sfmt19937, r250 and nondeterministic.

RNG Host API Functionality Map

The following table, namespace oneapi::mkl::rng is used for oneMKL RNG Host API.

cuRAND API support

oneMKL API support

Generators

CURAND_RNG_PSEUDO_DEFAULT

philox4x32x10

CURAND_RNG_PSEUDO_XORWOW

not supported

CURAND_RNG_PSEUDO_MRG32K3A

mrg32k3a

CURAND_RNG_PSEUDO_MTGP32

Intel oneMKL provides mt2203 counterpart

CURAND_RNG_PSEUDO_MT19937

mt19937

CURAND_RNG_PSEUDO_PHILOX4_32_10

philox4x32x10

CURAND_RNG_QUASI_SOBOL32

sobol

CURAND_RNG_QUASI_SCRAMBLED_SOBOL32

can be supported by sobol

CURAND_RNG_QUASI_SOBOL64

not supported

CURAND_RNG_QUASI_SCRAMBLED_SOBOL64

not supported

not supported

ars5

not supported

mcg59

not supported

niederreiter

not supported

mcg31m1

not supported

wichmann_hill

not supported

sfmt19937

not supported

r250

not supported

nondeterministic

Distributions

curandGenerate
curandGenerateLongLong

generate(uniform_bits<>{}, …);

curandGenerateLogNormal curandGenerateLogNormalDouble

generate(lognormal<>{}, …);

curandGenerateNormal curandGenerateNormalDouble

generate(gaussian<>{}, …);

curandGeneratePoisson

generate(poisson<>{}, …);

curandGenerateUniform
curandGenerateUniformDouble

generate(uniform<>{}, …);

not supported

bits

not supported

bernoulli

not supported

beta

not supported

binomial

not supported

chi_square

not supported

exponential

not supported

gamma

not supported

geometric

not supported

gumbel

not supported

hypergeometric

not supported

laplace

not supported

multinomial

not supported

negative_binomial

Figure 4. Random Number Generation Host API Functionality Map

RNG Device API Functionality Map

In the following table namespace oneapi::mkl::rng::device is used for oneMKL RNG Device API.

cuRAND API support

oneMKL API support

Generators

curandStateMtgp32_t

not supported

curandStateScrambledSobol64_t
curandStateScrambledSobol32_t

not supported

curandStateSobol64_t
curandStateSobol32_t

not supported

curandStateMRG32k3a_t

mrg32k3a

curandStatePhilox4_32_10_t

philox4x32x10

__device__ void curand_init (unsigned long long seed,
unsigned long long subsequence, unsigned long long
offset, curandStateMRG32k3a_t *state)

mrg32k3a<vec_size>(uint32_t seed, uint64_t offset)
 

__device__ void curand_init (unsigned long long seed,
unsigned long long subsequence, unsigned long long
offset, curandStatePhilox4_32_10_t *state)

philox4x32x10<vec_size>(uint64_t seed, uint64_t offset)

QUALIFIERS_MTGP32 float curand_mtgp32_single
(curandStateMtgp32_t *state)

not supported

QUALIFIERS_MTGP32 unsigned int
curand_mtgp32_specific (curandStateMtgp32_t *state,
unsigned char index, unsigned char n)

not supported

__device__ unsigned int curand (generator_state_t* generator)

uint32_t generate(bits<>& distr, generator& gen)

__device__ uint4 curand4 (generator_state_t* generator)

sycl::vec<uint32_t, 4> generate(bits<>{}, generator<4>& gen)

Distributions

__device__ float curand_normal
__device__ float2 curand_normal2
__device__ float4 curand_normal4


__device__ double curand_normal_double
__device__ double2 curand_normal2_double
__device__ double4 curand_normal4_double

sycl::vec<float, vec_size> generate(gaussian<float>{}, generator<vec_size>);

sycl::vec<double, vec_size> generate(gaussian<double>{}, generator<vec_size>);

__device__ float curand_uniform
__device__ float2 curand_uniform2
__device__ float4 curand_uniform4


__device__ double curand_uniform_double
__device__ double2 curand_uniform2_double
__device__ double4 curand_uniform4_double

sycl::vec<float, vec_size> generate(uniform<float>{}, generator<vec_size>);

sycl::vec<double, vec_size> generate(uniform<double>{}, generator<vec_size>);

__device__ float curand_log_normal(generator_state_t *state, float mean, float stddev)
__device__ float2 curand_log_normal2(generator_state_t *state, float mean, float stddev)
__device__ float4 curand_log_normal4(generator_state_t *state, float mean, float stddev)


__device__ double curand_log_normal_double(generator_state_t *state, double mean, double stddev)
__device__ double2 curand_log_normal2_double(generator_state_t *state, double mean, double stddev)
__device__ double4 curand_log_normal4_double(generator_state_t *state, double mean, double stddev)

sycl::vec<float, vec_size> generate(lognormal<float>(mean, stddev), generator<vec_size>);

sycl::vec<double, vec_size> generate(lognormal<double>(mean, stddev), generator<vec_size>);

__device__ unsigned int curand_poisson
(generator_state_t *state, double lambda) __device__ uint4
curand_poisson4 (curandStatePhilox4_32_10_t *state, double lambda)

sycl::vec<int32_t, vec_size> generate(poisson<int32_t>(lambda), generator<vec_size>);

sycl::vec<uint32_t, vec_size> generate(poisson<uint32_t>(lambda), generator<vec_size>);

__device__ unsigned int
curand_discrete (
    curandState_t *state,
    curandDiscreteDistribution_t discrete_distribution)

__device__ uint4
curand_discrete4 (curandStatePhilox4_32_10_t *state, curandDiscreteDistribution_t discrete_distribution)

not supported

Not supported

bits

Not supported

exponential

Figure 5. Random Number Generation Device API Functionality Map

As you might expect, oneMKL RNG has considerably more extensive coverage using its host API and device API, whereas cuRAND offers slightly more coverage with its device API.

Portability Across Compute Devices

SYCLomatic and the Intel DPC++ Compatibility Tool will assist in automatically migrating cuRAND function calls into oneMKL RNG calls for increased code portability and flexibility in targeted offload compute accelerators. The migrated code can be run not only on Intel HW but also on NVIDIA GPUs using oneMKL Interfaces Github project.

Many implementations have a direct match between the two APIs, whether you use CUDA or SYCL.

Some, however, do currently not have this direct match in either direction. In some areas, oneMKL provides more functionality; in others, cuRAND does. We are actively working on increasing the functional overlap where feasible, prioritized by frequency of use in developers’ applications. This implies that functional portability and performance portability is easily accessible in most cases.

If there is a scenario where the oneMKL implementation details are not obvious, please contact our community forum or our Priority Support team of consulting engineers.

Join Us and Get Started with oneMKL RNG SYCL Interfaces

Become part of the effort to make high-performance cross-architecture compute transparent, portable, and flexible. Include SYCL as the accelerator and GPU offload solution in your code path. Adopt oneAPI as the means to implementations free from vendor lock.

You can install oneMKL as a part of the Intel® oneAPI Base Toolkit or download its stand-alone version, both for free! 

Additional Resources