Distributed Processing

Intel® oneAPI Data Analytics Library Developer Guide and Reference

Download PDF

ID 772611

Date 7/13/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-452F2D3C-69A7-4895-B912-20E1AF77F187

View Details

Distributed Processing

This mode assumes that the data set is split into nblocks blocks across computation nodes.

Algorithm Parameters

The K-Means clustering algorithm in the distributed processing mode has the following parameters:

Algorithm Parameters for K-Means Computation (Distributed Processing)
Parameter	Default Value	Description
`computeStep`	Not applicable	The parameter required to initialize the algorithm. Can be: `step1Local` - the first step, performed on local nodes `step2Master` - the second step, performed on a master node
`algorithmFPType`	`float`	The floating-point type that the algorithm uses for intermediate computations. Can be `float` or `double`.
`method`	`defaultDense`	Available computation methods for K-Means clustering: `defaultDense` - implementation of Lloyd’s algorithm `lloydCSR` - implementation of Lloyd’s algorithm for CSR numeric tables
`nClusters`	Not applicable	The number of clusters. Required to initialize the algorithm.
`gamma`	1.0	The weight to be used in distance calculation for binary categorical features.
`distanceType`	`euclidean`	The measure of closeness between points (observations) being clustered. The only distance type supported so far is the Euclidean distance.
`assignFlag`	`false`	A flag that enables computation of assignments, that is, assigning cluster indices to respective observations.

To compute K-Means clustering in the distributed processing mode, use the general schema described in Algorithms as follows:

Step 1 - on Local Nodes

K-Means Computation: Distributed Processing, Step 1 - on Local Nodes

In this step, the K-Means clustering algorithm accepts the input described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms.

Input for K-Means Computation (Distributed Processing, Step 1)
Input ID	Input
`data`	Pointer to the $n_{i} \times p$ numeric table that represents the i-th data block on the local node. The input can be an object of any class derived from `NumericTable`.
`inputCentroids`	Pointer to the $nClusters \times p$ numeric table with the initial cluster centroids. This input can be an object of any class derived from NumericTable.

In this step, the K-Means clustering algorithm calculates the partial results and results described below. Pass the Partial Result ID or Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms.

Partial Results for K-Means Computation (Distributed Processing, Step 1)
Partial Result ID	Result
`nObservations`	Pointer to the $nClusters \times 1$ numeric table that contains the number of observations assigned to the clusters on local node. NOTE: By default, this result is an object of the `HomogenNumericTable` class, but you can define this result as an object of any class derived from `NumericTable` except `CSRNumericTable`.
`partialSums`	Pointer to the $nClusters \times p$ numeric table with partial sums of observations assigned to the clusters on the local node. NOTE: By default, this result is an object of the `HomogenNumericTable` class, but you can define the result as an object of any class derived from `NumericTable` except `PackedTriangularMatrix`, `PackedSymmetricMatrix`, and `CSRNumericTable`.
`partialObjectiveFunction`	Pointer to the $1 i m e s 1$ numeric table that contains the value of the partial objective function for observations processed on the local node. NOTE: By default, this result is an object of the `HomogenNumericTable` class, but you can define this result as an object of any class derived from `NumericTable` except `CSRNumericTable`.
`partialCandidatesDistances`	Pointer to the $nClusters \times 1$ numeric table that contains the value of the `nClusters` largest objective function for the observations processed on the local node and stored in descending order. NOTE: By default, this result if an object of the `HomogenNumericTable` class, but you can define this result as an object of any class derived from `NumericTable` except `PackedTriangularMatrix`, `PackedSymmetricMatrix`, `CSRNumericTable`.
`partialCandidatesCentroids`	Pointer to the $nClusters \times 1$ numeric table that contains the observations of the `nClusters` largest objective function value processed on the local node and stored in descending order of the objective function. NOTE: By default, this result if an object of the `HomogenNumericTable` class, but you can define this result as an object of any class derived from `NumericTable` except `PackedTriangularMatrix`, `PackedSymmetricMatrix`, `CSRNumericTable`.

Output for K-Means Computation (Distributed Processing, Step 1)
Result ID	Result
`assignments`	Use when `assignFlag` = `true`. Pointer to the $n_{i} \times 1$ numeric table with 32-bit integer assignments of cluster indices to feature vectors in the input data on the local node. NOTE: By default, this result is an object of the `HomogenNumericTable` class, but you can define this result as an object of any class derived from `NumericTable` except `PackedTriangularMatrix`, `PackedSymmetricMatrix`, and `CSRNumericTable`.

Step 2 - on Master Node

K-Means Computation: Distributed Processing, Step 2 - on Master Node

In this step, the K-Means clustering algorithm accepts the input from each local node described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms.

Input for K-Means Computation (Distributed Processing, Step 2)
Input ID	Input
`partialResuts`	A collection that contains results computed in Step 1 on local nodes.

In this step, the K-Means clustering algorithm calculates the results described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms.

Output for K-Means Computation (Distributed Processing, Step 2)
Result ID	Result
`centroids`	Pointer to the $nClusters \times p$ numeric table with centroids. NOTE: By default, this result is an object of the `HomogenNumericTable` class, but you can define the result as an object of any class derived from `NumericTable` except `PackedTriangularMatrix`, `PackedSymmetricMatrix`, and `CSRNumericTable`.
`objectiveFunction`	Pointer to the $1 i m e s 1$ numeric table that contains the value of the objective function. NOTE: By default, this result is an object of the `HomogenNumericTable` class, but you can define this result as an object of any class derived from `NumericTable` except `CSRNumericTable`.

IMPORTANT:

The algorithm computes assignments using input centroids. Therefore, to compute assignments using final computed centroids, after the last call to Step2compute() method on the master node, on each local node set assignFlag to true and do one additional call to Step1compute() and finalizeCompute() methods. Always set assignFlag to true and call finalizeCompute() to obtain assignments in each step.

NOTE:

To compute assignments using original inputCentroids on the given node, you can use K-Means clustering algorithm in the batch processing mode with the subset of the data available on this node. See Batch Processing for more details.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® oneAPI Data Analytics Library Developer Guide and Reference

Distributed Processing

Algorithm Parameters

Step 1 - on Local Nodes

Step 2 - on Master Node