Computing Estimations for Large Datasets
Summary Statistics algorithms can compute estimates for large datasets, including datasets in blocks. Different Summary Statistics computation methods may require additional memory for correct processing of the blocks. The VSL_SS_METHOD_FAST method uses raw moments of different orders:
Computation Type | Raw Moment Orders |
---|---|
The central moment of order i>1 | 1,...,i |
The central sum of order i>1 | 1,...,i |
Kurtosis coefficient (a function of central moments of the second and fourth order) | 1,2,3,4 |
Skewness coefficient | 1,2,3 |
Variation coefficient | 1,2 |
Variance-covariance/correlation matrix | The mean (the first raw moment) |
Cross-product matrix | The mean (the first raw moment) |
To support data arrays in blocks and compute their statistical estimates, the library requires buffers to store intermediate results. For this purpose, you need to allocate arrays and make them available to the library via task editors. These arrays store the value of the requested parameter and intermediate estimates of raw moments. The number of buffers necessary for intermediate results corresponds to the maximal order of raw moments required to calculate an estimate. The size of each buffer should store at least p elements, where p is the dimension of the task.
For example, to compute the skewness coefficient, you should allocate a buffer for the requested estimate and three one-dimensional arrays of size p to store the values of raw moments up to the third order. These intermediate results are required to process the next data portion and get the skewness estimate for the whole dataset.
Before the first call to the Compute routine, you should allocate and initialize necessary arrays and pass pointers into the library using one of the available editors. In most cases, elements of the arrays are initialized to zero. If you already have the estimates for the previous data portion, you can use these estimates to initialize the elements and continue the computation.
If there is no available memory for storing raw moments, the computation terminates with a corresponding error code.
The VSL_SS_METHOD_1PASS method relies on the mean estimate. Before using the method for processing data in blocks, you should allocate a buffer for mean and make it available to the library via task editors.
The VSL_SS_METHOD_FAST_USER_MEAN method relies on the mean estimate that you provide to the library before computation. You do not need additional buffers for this method.
You can obtain a correlation matrix from a variance-covariance matrix using a proper standardization. This scaling does not impact the main diagonal of the correlation matrix, that is, the matrix stores the variances of the random vector components. For details, see the Mathematical Notation and Definitions chapter in the Summary Statistics section of [MKLMan].When you compute matrices for the next block in the dataset, the library applies the following steps:
Converts the variance-covariance and/or correlation matrix into a cross-product matrix
Updates the cross-product matrix using the requested computation method
Updates a variance-covariance and/or correlation matrix from the cross-product matrix computed on step two.
If you compute a variance-covariance and/or correlation matrix, the library applies all the three steps. If you compute cross-product, variance-covariance and/or correlation matrices at the same time, the library applies only step two and step three. The library applies the same computation rule to raw/central sums and statistical moments up to the fourth order.
For the best results, apply the following steps, before you compute the matrices:
Computation | Preparation |
---|---|
Variance-covariance/correlation matrix | Provide the buffer for the cross-product matrix of the full storage format |
Statistical moments | Allocate and register buffers for statistical sums of the corresponding order |
With oneMKL Summary Statistics algorithms, you can compute variance-covariance and/or correlation matrices for the dataset available as n blocks using the following technique:
For blocks 1,2,…, n of the dataset, call Summary Statistics algorithm for computation of a cross-product matrix using one of supported methods
Convert the final cross-product matrix into a variance-covariance and/or correlation matrix by applying VSL_SS_METHOD_CP_TO_COVCOR
The example below demonstrates these steps:
#include "mkl_vsl.h" #define NBLOCKS 10 /* number of blocks in the dataset */ #define DIM 3 /* dimension of the task */ #define N 1000 /* number of observations in each block */ int main() { int i; VSLSSTaskPtr task; double x[DIM][N]; /* matrix of data block */ double cp[DIM*DIM], cor[DIM*DIM], mean[DIM]; double w[2]; MKL_INT p, n, xstorage, corstorage, cpstorage; int status; /* Parameters of the task and initialization */ p = DIM; n = N; xstorage = VSL_SS_MATRIX_STORAGE_ROWS; corstorage = VSL_SS_MATRIX_STORAGE_FULL; cp = VSL_SS_MATRIX_STORAGE_FULL; w[0] = 0.0; /* sum of weights */ w[1] = 0.0; /* sum of squares of weights */ for ( i = 0; i < p ; i++ ) mean[i] = 0.0; for ( i = 0; i < p*p; i++ ) cp[i] = 0.0; /* Create a task */ status = vsldSSNewTask( &task, &p, &n, &xstorage, x, 0, 0 ); /* Initialize the task parameters */ status = vsldSSEditCP ( task, mean, NULL, cp, &cpstorage ); status = vsldSSEditTask( task, VSL_SS_ED_COR, cor ); status = vsldSSEditTask( task, VSL_SS_ED_COR_STORAGE, corstorage ); status = vsldSSEditTask( task, VSL_SS_ED_ACCUM_WEIGHT, w ); /* Compute a cross-product matrix for NBLOCKS-1 */ for( i = 0; i < NBLOCKS; i++ ) { /* Get i-th data block to array x */ status = GetBlock( i, x, p, n ); /* Update cross-product matrix using latest block */ status = vsldSSCompute( task, VSL_SS_CP, VSL_SS_METHOD_1PASS ); } /* Convert cross-product matrix into correlation matrix */ status = vsldSSCompute( task, VSL_SS_COR, VSL_SS_METHOD_CP_TO_COVCOR); /* Deallocate the task resources */ status = vslSSDeleteTask( &task ); return 0; }
When converting the cross-product matrix computed for a given data array into a variance-covariance/correlation matrix, the library applies the following rule:
If the array of accumulated weights is available, the library computes the standardization coefficient using values in this array
Otherwise, the library computes the standardization coefficient using the number of observations passed as the input parameter to the Summary Statistics task constructor.
If you need to convert a computed cross-product matrix into a variance-covariance and/or correlation matrix, use the task editors to provide pointers to the matrices, their storage formats, and the array of accumulated weights to the library. Afterwards, call the SSCompute routine function to convert a cross-product matrix into a variance-covariance/correlation matrix as shown below:
status = vsldSSEditTask( task, VSL_SS_ED_COR, cor ); status = vsldSSEditTask( task, VSL_SS_ED_COR_STORAGE, corstorage ); status = vsldSSEditTask( task, VSL_SS_ED_CP, cp ); status = vsldSSEditTask( task, VSL_SS_ED_CP_STORAGE, cpstorage ); status = vsldSSCompute( task, VSL_SS_COR, VSL_SS_METHOD_CP_TO_COVCOR );
You can convert a cross-product matrix into both variance-covariance and correlation matrices by specifying two estimates, the conversion method and one call to the Compute function:
status = vsldSSCompute( task, VSL_SS_COV|VSL_SS_COR, VSL_SS_METHOD_CP_TO_COVCOR );
If you provide a combination of the conversion method and another supported method (for example, fast) to compute a variance-covariance/correlation matrix, the library discards the conversion method and applies another (for example, fast) method to produce the estimate.
You can apply similar approaches to compute raw/central statistical moments from raw/central sums of the corresponding order:
status = vsldSSCompute( task, VSL_SS_3R_MOM| VSL_SS_4C_MOM, VSL_SS_METHOD_SUM_TO_MOM );
By applying the conversion method VSL_SS_METHOD_SUM_TO_MOM, you can also get estimates of kurtosis/skewness/variation coefficient, given available buffers for the moments described in the table above.