Using Robust Methods
Robust methods of Summary Statistics provide two algorithms for outlier detection, Maronna, [Marrona2002] and TBS [Rocke96].
The Maronna algorithm computes the mean and variance-covariance matrix that serve as the start point for the TBS algorithm. The TBS algorithm permits iterating until the required accuracy is achieved or the maximal number of iterations completes. In addition to these parameters, you can specify and pass into the library the maximal breakdown point (the number of outliers the algorithm can hold) and an asymptotic rejection probability (ARP) [Rocke96].
To avoid iterations of the TBS algorithm and compute robust estimate of the mean and variance-covariance using the Maronna algorithm only, set the number of iterations to zero.
Consider a typical usage scenario for the robust methods editor and Compute routine provided below. Parameters of the algorithms, breakdown point, ARP, accuracy and the maximal number of TBS iterations are passed as an array:
breakdown_point = 0.2; arp = 0.001; method_accuracy = 0.001; iter_num = 5; params[0] = breakdown_point; params[1] = arp; params[2] = method_accuracy; params[3] = iter_num;
Robust estimates are stored in memory as rmean and rcov. In the example below, the variance-covariance matrix is stored in the full format specified in the rcov_storage variable.
errcode = vsldSSEditRobustCovariance( task, &rcov_storage, &nparams, params, rmean, rcov );
The Compute routine computes the estimates:
errcode=vsldSSCompute( task, VSL_SS_ROBUST_COV, VSL_SS_METHOD_TBS );
Example:
Consider a task with the dimension p = 10 and the number of observations n = 10,000. The dataset is generated from a multivariate Gaussian distribution with zero mean and a variance-covariance matrix holding 1 on the main diagonal and 0.05 in other entries. The dataset is contaminated with shift outliers that have a multivariate Gaussian distribution with the same variance-covariance matrix and a vector of means with all entries equal to 5.
Use of a non-robust algorithm for variance-covariance and mean estimation for this dataset results in biased estimates. Zero p-values for these estimates are returned.
Means: 0.2566,0.2583,0.2576,0.2633,0.2439,0.2556,0.2530,0.2716,0.2535,0.2519 Variance-Covariance: 2.2540 1.2715 2.1819 1.2852 1.2462 2.2046 1.2885 1.2684 1.2553 2.2310 1.2850 1.2581 1.2571 1.2526 2.2112 1.2650 1.2284 1.2419 1.2820 1.2430 2.1929 1.2789 1.2435 1.2550 1.2555 1.2574 1.2478 2.2113 1.2773 1.2692 1.2676 1.2751 1.2725 1.2733 1.2739 2.2448 1.2813 1.2579 1.2688 1.2723 1.2670 1.2713 1.2839 1.3061 2.2246 1.2696 1.2631 1.2515 1.2701 1.2597 1.2686 1.2554 1.2638 1.2780 2.1893
Use of the Maronna algorithm (that is, iter_num = 0) results in the following estimates:
Means: -0.0022,0.0081,-0.0075,0.0049,-0.0054,0.0012,-0.0087,0.0194,-0.0073,0.0022 p-values for means: 0.1792 0.6077 0.5640 0.3869 0.4281 0.1014 0.6375 0.9570 0.5602 0.1846 Variance-Covariance: 0.9164 0.0605 0.8945 0.0617 0.0374 0.9269 0.0602 0.0570 0.0472 0.9294 0.0584 0.0469 0.0599 0.0443 0.9183 0.0552 0.0394 0.0395 0.0655 0.0484 0.9049 0.0487 0.0449 0.0471 0.0451 0.0564 0.0461 0.9186 0.0293 0.0555 0.0539 0.0456 0.0450 0.0574 0.0501 0.9149 0.0507 0.0339 0.0433 0.0504 0.0429 0.0603 0.0597 0.0696 0.8962 0.0375 0.0573 0.0470 0.0472 0.0502 0.0607 0.0420 0.0381 0.0484 0.8848 p-values for variance-covariance: 0.0000 0.2989 0.0000 0.2966 0.5842 0.0000 0.3471 0.4395 0.9592 0.0000 0.3994 0.9148 0.3590 0.8993 0.0000 0.5128 0.7023 0.6708 0.1869 0.8510 0.0000 0.8508 0.9752 0.9515 0.9411 0.4812 0.9714 0.0000 0.2669 0.4841 0.6001 0.9729 0.9530 0.4207 0.7751 0.0000 0.7151 0.4529 0.8765 0.7468 0.8689 0.2968 0.3317 0.0984 0.0000 0.6082 0.3734 0.9088 0.8997 0.7250 0.2720 0.8321 0.6358 0.7895 0.0000
These estimates are much better. However, the main diagonal of the matrix still gets a zero p-value. To improve the estimate, do five iterations of the TBS algorithm. Quick experiments show that further increase in the number of iterations does not change the estimates significantly:
Means: -0.0018,0.0034,0.0026,0.0067,-0.0108,0.0012,-0.0024,0.0122,-0.0057,-0.0044 p-values for means: 0.1412 0.2612 0.2025 0.4860 0.7098 0.0943 0.1882 0.7693 0.4263 0.3381 Variance-Covariance: 1.0524 0.0583 1.0172 0.0757 0.0426 1.0403 0.0653 0.0630 0.0490 1.0538 0.0672 0.0604 0.0559 0.0462 1.0367 0.0493 0.0295 0.0434 0.0784 0.0442 1.0261 0.0620 0.0429 0.0509 0.0453 0.0491 0.0488 1.0397 0.0410 0.0503 0.0476 0.0507 0.0497 0.0514 0.0497 1.0367 0.0450 0.0370 0.0486 0.0464 0.0430 0.0526 0.0622 0.0719 1.0179 0.0477 0.0587 0.0461 0.0562 0.0514 0.0645 0.0443 0.0346 0.0485 1.0070 p-values for variance-covariance: 0.0002 0.6951 0.2249 0.1676 0.5972 0.0044 0.4613 0.5057 0.8450 0.0001 0.3761 0.5862 0.8152 0.7231 0.0095 0.8726 0.1942 0.6233 0.1170 0.6604 0.0646 0.5690 0.6118 0.9464 0.6795 0.8671 0.8653 0.0050 0.5092 0.9507 0.7992 0.9266 0.9002 0.9932 0.8944 0.0094 0.6867 0.4013 0.8656 0.7504 0.6147 0.9305 0.5185 0.2177 0.2065 0.8205 0.6243 0.7594 0.7800 0.9869 0.4071 0.6776 0.3207 0.8961 0.6185
For more details on robust methods, see Robust Estimation of Variance-Covariance Matrix chapter of this document and the Summary Statistics section of [MKLMan].