K-Means initialization

Intel® oneAPI Data Analytics Library Developer Guide and Reference

Download PDF

ID 772611

Date 3/22/2024

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-25D81CDD-779B-46B6-9838-46ABEEC9E4F3

View Details

Document Table of Contents

Document Table of Contents x

Intel® oneAPI Data Analytics Library (oneDAL)

Intel® oneAPI Data Analytics Library (oneDAL) x

Data Analytics Pipeline Installation System Requirements oneAPI Interfaces DAAL Interfaces Bibliography C++ API

oneAPI Interfaces x

Introduction CPU and GPU Support Build applications with oneDAL Glossary Mathematical Notations Computational Modes Data Management Algorithms Single Program Multiple Data oneAPI Examples Appendix

Data Management x

Array Accessors Data Sources Graphs Tables

Accessors x

Column accessor Row accessor

Data Sources x

CSV data source

Graphs x

Undirected adjacency vector graph Directed adjacency vector graph

Tables x

Homogeneous table Compressed Sparse Rows (CSR) Table

Algorithms x

Clustering Covariance Decomposition Ensembles Graph Kernel Functions Logistic Regression Linear Regression Nearest Neighbors (kNN) Objective function Optimizers Pairwise Distances Statistics Support Vector Machines

Clustering x

DBSCAN K-Means K-Means initialization Mathematical formulation Programming Interface Usage Example Examples

Covariance x

Covariance

Decomposition x

Principal Components Analysis (PCA)

Ensembles x

Decision Forest Classification and Regression (DF)

Graph x

Subgraph Isomorphism Connected Components

Kernel Functions x

Linear kernel Polynomial kernel Radial Basis Function (RBF) kernel Sigmoid kernel

Linear Regression x

Linear Regression

Nearest Neighbors (kNN) x

k-Nearest Neighbors Classification, Regression, and Search (k-NN)

Objective function x

Logistic Loss

Pairwise Distances x

Minkowski distance Chebyshev distance Cosine distance

Statistics x

Basic Statistics

Support Vector Machines x

Support Vector Machine Classifier and Regression (SVM)

oneAPI Examples x

DPC++ C++

DPC++ x

basic_statistics_dense_batch.cpp basic_statistics_dense_online.cpp column_accessor_homogen.cpp cor_dense_batch.cpp cor_dense_online.cpp cov_dense_batch.cpp cov_dense_biased_batch.cpp cov_dense_biased_online.cpp cov_dense_online.cpp csr_accessor.cpp csr_table.cpp dbscan_brute_force_batch.cpp df_cls_hist_batch.cpp df_cls_hist_batch_random.cpp df_cls_traverse_model.cpp df_reg_hist_batch.cpp df_reg_hist_batch_random.cpp df_reg_traverse_model.cpp heterogen_table.cpp homogen_table.cpp kmeans_init_dense.cpp kmeans_lloyd_dense_batch.cpp knn_cls_brute_force_dense_batch.cpp knn_reg_brute_force_dense_batch.cpp knn_search_brute_force_dense_batch.cpp linear_kernel_dense_batch.cpp linear_regression_dense_batch.cpp linear_regression_dense_online.cpp logistic_regression_dense_batch.cpp pca_cor_dense_batch.cpp pca_cor_dense_online.cpp pca_cov_dense_batch.cpp pca_cov_dense_online.cpp pca_precomputed_cor_dense_batch.cpp pca_precomputed_cov_dense_batch.cpp pca_svd_dense_batch.cpp rbf_kernel_dense_batch.cpp svm_two_class_thunder_dense_batch.cpp

C++ x

basic_statistics_dense_batch.cpp basic_statistics_dense_online.cpp column_accessor_homogen.cpp connected_components_batch.cpp cor_dense_batch.cpp cor_dense_online.cpp cov_dense_batch.cpp cov_dense_biased_batch.cpp cov_dense_biased_online.cpp cov_dense_online.cpp csr_accessor.cpp csr_table.cpp dbscan_brute_force_batch.cpp df_cls_dense_batch.cpp df_reg_dense_batch.cpp directed_graph.cpp graph_service_functions.cpp heterogen_table.cpp homogen_table.cpp jaccard_batch.cpp jaccard_batch_app.cpp kmeans_init_dense.cpp kmeans_lloyd_dense_batch.cpp knn_cls_brute_force_dense_batch.cpp knn_cls_kd_tree_dense_batch.cpp knn_search_brute_force_dense_batch.cpp linear_kernel_dense_batch.cpp linear_regression_dense_batch.cpp linear_regression_dense_online.cpp logloss_dense_batch.cpp louvain_batch.cpp pca_cor_dense_batch.cpp pca_cor_dense_online.cpp pca_cov_dense_batch.cpp pca_cov_dense_online.cpp pca_precomputed_dense_batch.cpp pca_svd_dense_batch.cpp pca_svd_dense_online.cpp polynomial_kernel_dense_batch.cpp rbf_kernel_dense_batch.cpp shortest_paths_batch.cpp sigmoid_kernel_dense_batch.cpp subgraph_isomorphism_batch.cpp svm_multi_class_thunder_csr_batch.cpp svm_multi_class_thunder_dense_batch.cpp svm_nu_cls_thunder_csr_batch.cpp svm_nu_cls_thunder_dense_batch.cpp svm_nu_reg_thunder_csr_batch.cpp svm_nu_reg_thunder_dense_batch.cpp svm_reg_thunder_csr_batch.cpp svm_reg_thunder_dense_batch.cpp svm_two_class_smo_csr_batch.cpp svm_two_class_smo_dense_batch.cpp svm_two_class_thunder_csr_batch.cpp svm_two_class_thunder_dense_batch.cpp triangle_counting_batch.cpp

Appendix x

Decision Tree k-d Tree

DAAL Interfaces x

CPU and GPU Support Library Usage Data Management Analysis Training and Prediction Services

Library Usage x

Algorithms Computation Modes Training and Prediction

Training and Prediction x

Classification Usage Model Regression Usage Model Recommendation Systems Usage Model

Data Management x

Numeric Tables Data Sources Data Dictionaries Data Serialization and Deserialization Data Model

Numeric Tables x

Generic Interfaces Essential Interfaces for Algorithms Types of Numeric Tables

Analysis x

K-Means Clustering Density-Based Spatial Clustering of Applications with Noise Correlation and Variance-Covariance Matrices Principal Component Analysis Principal Components Analysis Transform Singular Value Decomposition Association Rules Kernel Functions Expectation-Maximization Cholesky Decomposition QR Decomposition Outlier Detection Distance Matrix Distributions Engines Moments of Low Order Quantile Quality Metrics Sorting Normalization Optimization Solvers

K-Means Clustering x

Batch Processing Distributed Processing Batch Processing Distributed Processing

Density-Based Spatial Clustering of Applications with Noise x

Batch Processing Distributed Processing

Correlation and Variance-Covariance Matrices x

Batch Processing Online Processing Distributed Processing

Principal Component Analysis x

Batch Processing Online Processing Distributed Processing

Singular Value Decomposition x

Batch and Online Processing Distributed Processing

QR Decomposition x

QR Decomposition without Pivoting Pivoted QR Decomposition

QR Decomposition without Pivoting x

Batch and Online Processing Distributed Processing

Outlier Detection x

Multivariate Outlier Detection Multivariate BACON Outlier Detection Univariate Outlier Detection

Distance Matrix x

Correlation Distance Matrix Cosine Distance Matrix

Distributions x

Uniform Distribution Normal Distribution Bernoulli Distribution

Engines x

mt19937 mcg59 mt2203

Moments of Low Order x

Batch Processing Online Processing Distributed Processing

Quality Metrics x

Working with the Default Metric Set Working with User-defined Quality Metrics

Working with the Default Metric Set x

Quality Metrics for Binary Classification Algorithms Quality Metrics for Multi-class Classification Algorithms Quality Metrics for Linear Regression Quality Metrics for Principal Components Analysis

Normalization x

Z-score Min-max

Optimization Solvers x

Objective Function Iterative Solver

Objective Function x

Computation Sum of Functions Mean Squared Error Algorithm Objective Function with Precomputed Characteristics Algorithm Logistic Loss Cross-entropy Loss

Iterative Solver x

Computation Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm Stochastic Gradient Descent Algorithm Adaptive Subgradient Method Coordinate Descent Algorithm Stochastic Average Gradient Accelerated Method

Training and Prediction x

Decision Forest Decision Trees Gradient Boosted Trees Stump Linear and Ridge Regressions LASSO and Elastic Net Regressions k-Nearest Neighbors (kNN) Classifier Implicit Alternating Least Squares Logistic Regression Naïve Bayes Classifier Support Vector Machine Classifier Multi-class Classifier Boosting

Decision Forest x

Decision Forest Regression Decision Forest Classification Decision Forest

Decision Trees x

Decision Tree Regression Decision Tree Classification Decision Tree

Gradient Boosted Trees x

Gradient Boosted Trees Regression Gradient Boosted Trees Classification Gradient Boosted Trees

Stump x

Classification Stump Regression Stump

Linear and Ridge Regressions x

Linear Regression Ridge Regression Linear and Ridge Regressions Computation

LASSO and Elastic Net Regressions x

LASSO Elastic Net LASSO and Elastic Net Computation

Implicit Alternating Least Squares x

Batch Processing Distributed Processing Batch Processing Distributed Processing: Training Distributed Processing: Prediction of Ratings

Naïve Bayes Classifier x

Batch Processing Online Processing Distributed Processing

Boosting x

AdaBoost Classifier AdaBoost Multiclass Classifier BrownBoost Classifier LogitBoost Classifier

Services x

Extracting Version Information Handling Errors Managing Memory Managing the Computational Environment Providing a Callback for the Host Application

C++ API x

Data Management Algorithms Distributed Model: Single Process Multiple Data

Data Management x

Array Accessors Data Sources Graphs Graph Service Tables

Accessors x

Column Accessor Compressed Sparse Rows (CSR) Accessor Row Accessor

Data Sources x

CSV data source

Graphs x

Undirected adjacency vector graph Directed adjacency vector graph

Graph Service x

Undirected adjacency vector graph service Directed adjacency vector graph service

Tables x

Homogeneous table Compressed Sparse Rows (CSR) Table

Algorithms x

Clustering Covariance Decomposition Ensembles Graph Kernel Functions Logistic Regression Linear Regression Nearest Neighbors (kNN) Optimizers Objective function Pairwise Distances Statistics Support Vector Machines

Clustering x

DBSCAN K-Means K-Means initialization

Covariance x

Covariance

Decomposition x

Principal Components Analysis (PCA)

Ensembles x

Decision Forest Classification and Regression (DF)

Graph x

Subgraph Isomorphism Connected Components

Kernel Functions x

Linear kernel Polynomial kernel Radial Basis Function (RBF) kernel Sigmoid kernel

Logistic Regression x

Logistic Regression

Linear Regression x

Linear Regression

Nearest Neighbors (kNN) x

k-Nearest Neighbors Classification (k-NN)

Optimizers x

Newton-CG Optimizer

Objective function x

Objective function Logistic Loss

Pairwise Distances x

Minkowski distance Chebyshev distance Cosine distance

Statistics x

Basic Statistics

Support Vector Machines x

Support Vector Machine Classifier (SVM)

Distributed Model: Single Process Multiple Data x

Distributed SPMD model Communicators

Intel® oneAPI Data Analytics Library (oneDAL)

Data Analytics Pipeline

Installation

System Requirements

oneAPI Interfaces

Introduction

CPU and GPU Support

Build applications with oneDAL

Glossary

Mathematical Notations

Computational Modes

Data Management

Array

Accessors

Column accessor

Row accessor

Data Sources

CSV data source

Graphs

Undirected adjacency vector graph

Directed adjacency vector graph

Tables

Homogeneous table

Compressed Sparse Rows (CSR) Table

Algorithms

Clustering

DBSCAN

K-Means

K-Means initialization

Mathematical formulation
Programming Interface
Usage Example
Examples

Covariance

Covariance

Decomposition

Principal Components Analysis (PCA)

Ensembles

Decision Forest Classification and Regression (DF)

Graph

Subgraph Isomorphism

Connected Components

Kernel Functions

Linear kernel

Polynomial kernel

Radial Basis Function (RBF) kernel

Sigmoid kernel

Logistic Regression

Linear Regression

Linear Regression

Nearest Neighbors (kNN)

k-Nearest Neighbors Classification, Regression, and Search (k-NN)

Objective function

Logistic Loss

Optimizers

Pairwise Distances

Minkowski distance

Chebyshev distance

Cosine distance

Statistics

Basic Statistics

Support Vector Machines

Support Vector Machine Classifier and Regression (SVM)

Single Program Multiple Data

oneAPI Examples

DPC++

basic_statistics_dense_batch.cpp

basic_statistics_dense_online.cpp

column_accessor_homogen.cpp

cor_dense_batch.cpp

cor_dense_online.cpp

cov_dense_batch.cpp

cov_dense_biased_batch.cpp

cov_dense_biased_online.cpp

cov_dense_online.cpp

csr_accessor.cpp

csr_table.cpp

dbscan_brute_force_batch.cpp

df_cls_hist_batch.cpp

df_cls_hist_batch_random.cpp

df_cls_traverse_model.cpp

df_reg_hist_batch.cpp

df_reg_hist_batch_random.cpp

df_reg_traverse_model.cpp

heterogen_table.cpp

homogen_table.cpp

kmeans_init_dense.cpp

kmeans_lloyd_dense_batch.cpp

knn_cls_brute_force_dense_batch.cpp

knn_reg_brute_force_dense_batch.cpp

knn_search_brute_force_dense_batch.cpp

linear_kernel_dense_batch.cpp

linear_regression_dense_batch.cpp

linear_regression_dense_online.cpp

logistic_regression_dense_batch.cpp

pca_cor_dense_batch.cpp

pca_cor_dense_online.cpp

pca_cov_dense_batch.cpp

pca_cov_dense_online.cpp

pca_precomputed_cor_dense_batch.cpp

pca_precomputed_cov_dense_batch.cpp

pca_svd_dense_batch.cpp

rbf_kernel_dense_batch.cpp

svm_two_class_thunder_dense_batch.cpp

C++

basic_statistics_dense_batch.cpp

basic_statistics_dense_online.cpp

column_accessor_homogen.cpp

connected_components_batch.cpp

cor_dense_batch.cpp

cor_dense_online.cpp

cov_dense_batch.cpp

cov_dense_biased_batch.cpp

cov_dense_biased_online.cpp

cov_dense_online.cpp

csr_accessor.cpp

csr_table.cpp

dbscan_brute_force_batch.cpp

df_cls_dense_batch.cpp

df_reg_dense_batch.cpp

directed_graph.cpp

graph_service_functions.cpp

heterogen_table.cpp

homogen_table.cpp

jaccard_batch.cpp

jaccard_batch_app.cpp

kmeans_init_dense.cpp

kmeans_lloyd_dense_batch.cpp

knn_cls_brute_force_dense_batch.cpp

knn_cls_kd_tree_dense_batch.cpp

knn_search_brute_force_dense_batch.cpp

linear_kernel_dense_batch.cpp

linear_regression_dense_batch.cpp

linear_regression_dense_online.cpp

logloss_dense_batch.cpp

louvain_batch.cpp

pca_cor_dense_batch.cpp

pca_cor_dense_online.cpp

pca_cov_dense_batch.cpp

pca_cov_dense_online.cpp

pca_precomputed_dense_batch.cpp

pca_svd_dense_batch.cpp

pca_svd_dense_online.cpp

polynomial_kernel_dense_batch.cpp

rbf_kernel_dense_batch.cpp

shortest_paths_batch.cpp

sigmoid_kernel_dense_batch.cpp

subgraph_isomorphism_batch.cpp

svm_multi_class_thunder_csr_batch.cpp

svm_multi_class_thunder_dense_batch.cpp

svm_nu_cls_thunder_csr_batch.cpp

svm_nu_cls_thunder_dense_batch.cpp

svm_nu_reg_thunder_csr_batch.cpp

svm_nu_reg_thunder_dense_batch.cpp

svm_reg_thunder_csr_batch.cpp

svm_reg_thunder_dense_batch.cpp

svm_two_class_smo_csr_batch.cpp

svm_two_class_smo_dense_batch.cpp

svm_two_class_thunder_csr_batch.cpp

svm_two_class_thunder_dense_batch.cpp

triangle_counting_batch.cpp

Appendix

Decision Tree

k-d Tree

DAAL Interfaces

CPU and GPU Support

Library Usage

Algorithms

Computation Modes

Training and Prediction

Classification Usage Model

Regression Usage Model

Recommendation Systems Usage Model

Data Management

Numeric Tables

Generic Interfaces

Essential Interfaces for Algorithms

Types of Numeric Tables

Data Sources

Data Dictionaries

Data Serialization and Deserialization

Data Model

Analysis

K-Means Clustering

Batch Processing

Distributed Processing

Batch Processing

Distributed Processing

Density-Based Spatial Clustering of Applications with Noise

Batch Processing

Distributed Processing

Correlation and Variance-Covariance Matrices

Batch Processing

Online Processing

Distributed Processing

Principal Component Analysis

Batch Processing

Online Processing

Distributed Processing

Principal Components Analysis Transform

Singular Value Decomposition

Batch and Online Processing

Distributed Processing

Association Rules

Kernel Functions

Expectation-Maximization

Cholesky Decomposition

QR Decomposition

QR Decomposition without Pivoting

Batch and Online Processing

Distributed Processing

Pivoted QR Decomposition

Outlier Detection

Multivariate Outlier Detection

Multivariate BACON Outlier Detection

Univariate Outlier Detection

Distance Matrix

Correlation Distance Matrix

Cosine Distance Matrix

Distributions

Uniform Distribution

Normal Distribution

Bernoulli Distribution

Engines

mt19937

mcg59

mt2203

Moments of Low Order

Batch Processing

Online Processing

Distributed Processing

Quantile

Quality Metrics

Working with the Default Metric Set

Quality Metrics for Binary Classification Algorithms

Quality Metrics for Multi-class Classification Algorithms

Quality Metrics for Linear Regression

Quality Metrics for Principal Components Analysis

Working with User-defined Quality Metrics

Sorting

Normalization

Z-score

Min-max

Optimization Solvers

Objective Function

Computation

Sum of Functions

Mean Squared Error Algorithm

Objective Function with Precomputed Characteristics Algorithm

Logistic Loss

Cross-entropy Loss

Iterative Solver

Computation

Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm

Stochastic Gradient Descent Algorithm

Adaptive Subgradient Method

Coordinate Descent Algorithm

Stochastic Average Gradient Accelerated Method

Training and Prediction

Decision Forest

Decision Forest

Regression Decision Forest

Classification Decision Forest

Decision Trees

Decision Tree

Regression Decision Tree

Classification Decision Tree

Gradient Boosted Trees

Gradient Boosted Trees

Regression Gradient Boosted Trees

Classification Gradient Boosted Trees

Stump

Classification Stump

Regression Stump

Linear and Ridge Regressions

Linear Regression

Ridge Regression

Linear and Ridge Regressions Computation

LASSO and Elastic Net Regressions

LASSO

Elastic Net

LASSO and Elastic Net Computation

k-Nearest Neighbors (kNN) Classifier

Implicit Alternating Least Squares

Batch Processing

Distributed Processing

Batch Processing

Distributed Processing: Training

Distributed Processing: Prediction of Ratings

Logistic Regression

Naïve Bayes Classifier

Batch Processing

Online Processing

Distributed Processing

Support Vector Machine Classifier

Multi-class Classifier

Boosting

AdaBoost Classifier

AdaBoost Multiclass Classifier

BrownBoost Classifier

LogitBoost Classifier

Services

Extracting Version Information

Handling Errors

Managing Memory

Managing the Computational Environment

Providing a Callback for the Host Application

Bibliography

C++ API

Data Management

Array

Accessors

Column Accessor

Compressed Sparse Rows (CSR) Accessor

Row Accessor

Data Sources

CSV data source

Graphs

Undirected adjacency vector graph

Directed adjacency vector graph

Graph Service

Undirected adjacency vector graph service

Directed adjacency vector graph service

Tables

Homogeneous table

Compressed Sparse Rows (CSR) Table

Algorithms

Clustering

DBSCAN

K-Means

K-Means initialization

Covariance

Covariance

Decomposition

Principal Components Analysis (PCA)

Ensembles

Decision Forest Classification and Regression (DF)

Graph

Subgraph Isomorphism

Connected Components

Kernel Functions

Linear kernel

Polynomial kernel

Radial Basis Function (RBF) kernel

Sigmoid kernel

Logistic Regression

Logistic Regression

Linear Regression

Linear Regression

Nearest Neighbors (kNN)

k-Nearest Neighbors Classification (k-NN)

Optimizers

Newton-CG Optimizer

Objective function

Objective function

Logistic Loss

Pairwise Distances

Minkowski distance

Chebyshev distance

Cosine distance

Statistics

Basic Statistics

Support Vector Machines

Support Vector Machine Classifier (SVM)

Distributed Model: Single Process Multiple Data

Distributed SPMD model

Communicators

Visible to Intel only — GUID: GUID-25D81CDD-779B-46B6-9838-46ABEEC9E4F3

View Details

K-Means initialization

The K-Means initialization algorithm receives n feature vectors as input and chooses k initial centroids. After initialization, K-Means algorithm uses the initialization result to partition input data into k clusters.

Operation	Computational methods				Programming Interface
Computing	Dense	Random dense	K-Means++	K-Means++ parallel	compute(…)	compute_input(…)	compute_result(…)

Mathematical formulation

Computing

Given the training set of p-dimensional feature vectors and a positive integer k, the problem is to find a set of p-dimensional initial centroids.

Computing method: dense

The method chooses first k feature vectors from the training set X.

Computing method: random_dense

The method chooses random k feature vectors from the training set X.

Computing method: plus_plus_dense (only on CPU)

The method is designed as follows: the first centroid is selected randomly and . Then the following step is repeated until C reaches the necessary size.

Computing method: parallel_plus_dense (only on CPU)

The method is the same as K-Means++, but the data is divided into equal parts and the algorithm runs on each of them.

Programming Interface

Refer to API Reference: K-Means initialization.

Usage Example

Computing

table run_compute(const table& data) {
   const auto kmeans_desc = kmeans_init::descriptor<float,
                                                   kmeans_init::method::dense>{}
      .set_cluster_count(10)

   const auto result = compute(kmeans_desc, data);

   print_table("centroids", result.get_centroids());

   return result.get_centroids();
}

Examples

oneAPI DPC++

Batch Processing:

kmeans_init_dense.cpp

oneAPI C++

Batch Processing:

kmeans_init_dense.cpp

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® oneAPI Data Analytics Library Developer Guide and Reference

K-Means initialization

Mathematical formulation

Programming Interface

Usage Example

Examples