Intel® oneAPI Math Kernel Library Cookbook

ID 758503
Date 6/24/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Multiple simple random sampling without replacement

Goal

Generate K>>1 simple random length-M samples without replacement from a population of size N (1 MN).

Solution

For exact definitions and more details of the problem, see [SRSWOR].

Use the following implementation of a partial Fisher-Yates Shuffle algorithm [KnuthV2] and oneMKL random number generators (RNG) to generate each sample:

Partial Fisher-Yates Shuffle algorithm

 A2.1: (Initialization step) let PERMUT_BUF contain natural numbers 1, 2, ..., N

 A2.2: for i from 1 to M do:

     A2.3: generate random integer X uniform on {i,...,N}

     A2.4: interchange PERMUT_BUF[i] and PERMUT_BUF[X]

 A2.5: (Copy step) for i from 1 to M do: RESULTS_ARRAY[i]=PERMUT_BUF[i]

 End.

The program that implements the algorithm conducts 11 969 664 experiments. Each experiment, which generates a sequence of M unique random natural numbers from 1 to N, is actually a partial length-M random shuffle of the whole population of N elements. Because the main loop of the algorithm works as a real lottery, each experiment is called "lottery M of N" in the program.

The program uses M=6 and N=49, stores result samples (sequences of length M) in a single array RESULTS_ARRAY, and uses all available parallel threads.

Source code: see the lottery6of49 folder in the samples archive available at https://www.intel.com/content/dam/develop/external/us/en/documents/mkl-cookbook-samples-120115.zip.

Parallelization

#pragma omp parallel
{
    thr = omp_get_thread_num(); /* the thread index */
    VSLStreamStatePtr stream;
    /* RNG stream initialization in this thread */
    vslNewStream( &stream, VSL_BRNG_MT2203+thr, seed );
    ... /* Generation of experiment samples (in thread number thr) */
    vslDeleteStream( &stream );
}

The code exploits all CPUs with all available processor cores by using the OpenMP* #pragma parallel directive. The array of experiment results RESULTS_ARRAY is broken down into THREADS_NUM portions, where THREADS_NUM is the number of available CPU threads, and each thread (parallel region) processes its own portion of the array.

oneMKL basic random number generators with the VSL_BRNG_MT2203 parameter easily support a parallel independent stream in each thread.

Generation of experiment samples

/* A2.1: Initialization step */
/*  Let PERMUT_BUF contain natural numbers 1, 2, ..., N */
 for( i=0; i<N; i++ ) PERMUT_BUF[i]=i+1; /* using the set {1,...,N} */
 for( sample_num=0; sample_num<EXPERIM_NUM/THREADS_NUM; sample_num++ ){
     /* Generate next lottery sample (steps A2.2, A2.3, and A2.4): */
     Fisher_Yates_shuffle(...);
     /*  A2.5: Copy step */   
     for(i=0; i<M; i++)
         RESULTS_ARRAY[thr*ONE_THR_PORTION_SIZE + sample_num*M + i] = PERMUT_BUF[i];
 }

This code implements the partial Fisher-Yates Shuffle algorithm in each thread.

In the case of simulating many experiments, the Initialization step is only needed once because at the beginning of each experiment, the order of natural numbers 1...N in the PERMUT_BUF array does not matter (like in a real lottery).

Fisher_Yates_shuffle function

/* A2.2: for i from 0 to M-1 do */
Fisher_Yates_shuffle (...)
{  
    for(i=0; i<M; i++) {
        /* A2.3: generate random natural number X from {i,...,N-1} */
        j = Next_Uniform_Int(...);
        /* A2.4: interchange PERMUT_BUF[i] and PERMUT_BUF[X] */
        tmp = PERMUT_BUF[i];
        PERMUT_BUF[i] = PERMUT_BUF[j];
        PERMUT_BUF[j] = tmp;
    }
}

Each iteration of the loop A2.2 works as a real lottery step: it extracts a random item X from the bin with remaining items PERMUT_BUF[i], ..., PERMUT_BUF[N] and puts the item X at the end of the results row PERMUT_BUF[1],...,PERMUT_BUF[i]. The algorithm is partial because it does not generate the full permutation of length N, but only a part of length M.

NOTE:

Unlike the pseudocode that describes the algorithm, the program uses zero-based arrays.

Discussion

In step A2.3, the program calls the Next_Uniform_Int function to generate the next random integer X, uniform on {i, ..., N-1} (see the source code for details). To exploit the full power of vectorized RNGs from oneMKL, but to minimize vectorization overheads, the generator must generate a sufficiently large vector D_UNIFORM01_BUF of size RNGBUFSIZE that fits the L1 cache. Each thread uses its own buffer D_UNIFORM01_BUF and the index D_UNIFORM01_IDX pointing to after the last used random number from that buffer. In the first call to Next_Uniform_Int function (or in the case all random numbers from the buffer have been used), the full buffer of random numbers is regenerated again by calling the vdRngUniform function with the length RNGBUFSIZE and the index D_UNIFORM01_IDX set to zero (earlier in the program):

vdRngUniform( ... RNGBUFSIZE, D_UNIFORM01_BUF ... );
      

Because oneMKL only provides generators of random values with the same distribution, but step A2.3 requires random integers on different intervals, the buffer is filled with double-precision random numbers uniformly distributed on [0;1) and then, in the Integer scaling step, these double-precision values are converted to fit the needed integer intervals:

number 0   distributed on {0,...,N-1}   = 0   + {0,...,N-1}
number 1   distributed on {1,...,N-1}   = 1   + {0,...,N-2}

...

number M-1 distributed on {M-1,...,N-1} = M-1 + {0,...,N-M}

(then repeat previous M steps)

number M     distributed on: see (0)
number M+1   distributed on: see (1)

...

number 2*M-1 distributed on: see (M-1)
(then again repeat previous M steps)
...

and so on

Integer scaling

/* Integer scaling step */
for(i=0;i<RNGBUFSIZE/M;i++)
    for(k=0;k<M;k++)
        I_RNG_BUF[i*M+k] =
            k + (unsigned int)(D_UNIFORM01_BUF[i*M+k] * (double)(N-k));

Here RNGBUFSIZE is a multiple of M.

See [SRSWOR] for performance notes related to this code.

Routines Used

Task

Routine

Creates and initializes an RNG stream.

vslNewStream

Generates double-precision numbers uniformly distributed over the interval [0;1).

vdRngUniform

Deletes an RNG stream.

vslDeleteStream

Allocates memory buffers aligned on 64-byte boundaries for the results and population.

mkl_malloc

Frees memory allocated by mkl_malloc.

mkl_free

Product and Performance Information

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Notice revision #20201201