Getting Started
Prerequisites
Intel® Cluster Checker must be accessible by the same path on all nodes.
A readable, writable shared directory must be available from the same path on all nodes for temporary file creation.
$HOME as the shared directory is used by default, but you can change this option by setting the environment variable $CLCK_SHARED_TEMP_DIR to the shared directory.
For admin privileged users, such as root, the environment variable $CLCK_SHARED_TEMP_DIR must be explicitly set.
Determine if passwordless ssh access to all nodes is set up. (e.g. test if the following command responds with a valid hostname, while not asking for ‘Password:’) If passwordless ssh to all nodes is available - go ahead with Environment Setup and Running using Slurm below.
By default Intel® Cluster Checker is configured to use passwordless ssh (through the command pdsh) to launch remotely on nodes of the cluster. Note: you may need to add enabling passwordless access in your local ssh configuration setup.
Intel® Cluster Checker can communicate with MPI rather than pdsh. To use this feature it requires Intel MPI Library to be set up and an edit to an XML configuration file.
Locate and copy clck.xml; found in <installdir>/clck/<version>/etc/clck.xml
In the <collector> section, uncomment the <extension>mpi.so</extension> by removing the commenting statements <!-- in the line above and the --> in the line after.
When launching clck, use the -c flag to point to your new copy of clck.xml, and it will now communicate via MPI rather than pdsh.
Note: In some scenarios you may need to include the Intel® MPI environment variable I_MPI_HYDRA_BOOTSTRAP=<arg> with the appropriate bootstrap agent. Please reference Intel® MPI Library documentation for details on options for this variable.
To revert back to PDSH, do not use the -c flag and use the default clck.xml or put comments around <extension>mpi.so</extension> again.
clck -c <path/to/local/copy/of/clck.xml> -F health_base -f ./nodefile
Environment Setup
Before you start using Intel® Cluster Checker, you will need to establish the proper runtime environment. If you are new to Linux, this means we need to make sure the command line is setup to find the applications we just installed. Helper scripts are provided to accomplish this. For full functionality, Intel® Cluster Checker expects the following items to be loaded in the environment correctly: Intel® Cluster Checker, Intel® MPI Library, Intel® Math Kernel Library and Intel® Distribution of Python.
If using the Intel® oneAPI HPC Toolkit
oneAPI Toolkit includes setvars.sh|csh script in the installation folder and will analyze all software installed from oneAPI and then add it to your path.
Each Intel® oneAPI tool includes individual tool environment variable setting scripts in its env folder oneapi/<tool>/<version>/env/vars.(c)sh
source /opt/intel/oneapi/setvars.sh
or if you would rather source individual packages directly, there are var.sh|var.csh scripts in /opt/intel/oneapi/<tool>/<version>/env/. Please note the ordering of MPI as last item is important because Intel® Distribution for Python also includes an mpirun. We want to insure Intel® MPI Library is being used for MPI. You can validate which MPI is in use with the command which mpirun - looking for the path to oneapi/mpi/latest/bin/mpirun
source /opt/intel/oneapi/mkl/latest/env/vars.sh source /opt/intel/oneapi/intelpython/latest/vars.sh source /opt/intel/oneapi/clck/latest/vars.sh source /opt/intel/oneapi/mpi/latest/env/vars.sh
or from Intel® Parallel Studio XE Cluster Edition including all above components
source psxevars.[sh | csh]
An alternative to these scripts is ‘modulefiles’ to setup your runtime environment.
Versioned modulefiles for all above components can be installed and loaded with Intel® oneAPI.
Additionally the Intel® Cluster Checker modulefile is available using the module commands
module use <install_directory>/clck/<version>/modulefiles module load clck
Running using an Individual Nodefile
The command line for Intel® Cluster Checker is clck. If you type in clck to the Linux command line, hit enter, and it returns command not found; then the environment setup is not correct.
A nodefile specifies which nodes to include and, if applicable, their roles. Intel® Cluster Checker contains a set of pre-defined roles. A separate hostname appears on each line. If no role is specified for a node, that node is considered a compute node. The following example includes four compute nodes.
[user]# cat nodefile node1 node2 node3 node4
A cluster with a single node would only include one hostname in the nodefile. Localhost is not a recommended hostname, use the value returned by the command hostname on the servers themselve and are network resolvable.
You can then do your first run for Intel® Cluster Checker by running
clck -f <nodefile>
Running using Slurm
Regardless of whether you are using a batch script via (sbatch) or allocating nodes (salloc), Intel® Cluster Checker uses the list of nodes allocated through Slurm automatically, unless you override it with the individual nodefile option -f
Do not use the command srun to start Intel® Cluster Checker. Only use the clck command (or clck-collect, clck-analyze, etc.), as parallel job for remote data collection is built-in already.
If running on the commandline with a salloc Slurm resource allocation, remember to have set up the environment. You can then launch Intel® Cluster Checker by running the command:
clck
If running with sbatch, you should be able to run Intel® Cluster Checker by using a Slurm script that must include the environment setup above through your choice of environment setup script(s) or module commands:
source /opt/intel/oneapi/setvars.sh clck
or for specific components:
source /opt/intel/oneapi/intelpython/latest/vars.sh source /opt/intel/oneapi/mkl/latest/env/vars.sh source /opt/intel/oneapi/clck/latest/vars.sh source /opt/intel/oneapi/mpi/latest/env/vars.sh # alternatively use psxevars.[sh | csh] or setvars.sh (Intel oneAPI), or modulefiles to setup environment clck
You can then run
sbatch <script_name>
In both of the above cases, Intel® Cluster Checker will generate a summary output, an in-depth clck_results.log, and a separate clck_execution_warnings.log file.
User-Specific Workflows
Intel® Cluster Checker uses what we call a ‘Framework Definition’ to specify what data is collected, how data is analyzed, and how that information is displayed. By default, Intel® Cluster Checker runs the ‘health_base’ Framework Definition, which provides a quick overall examination of the health of the cluster. Intel® Cluster Checker provides a wide variety of Framework Definitions. We describe here the highest level Framework Definitions for particular types of users; however, you can get a full list of available Framework Definitions by running
clck -X list
You will get further details of a Framework Definition with the option -X and the name of the specific Framework Definition. E.g.
clck -X cpu_base clck -X select_solutions_sim_mod_user_plus_2021.0 | more clck -X health_base | more
The rest of this page includes some of the more commonly used Framework Definitions that can be helpful depending on your role. You can also find a full list of Framework Definitions in the Reference section.
Admin:
For the privileged user, there are four different common-use Framework Definitions for cluster analysis. When first running as an administrator, run
clck <options> -F health_base
You can then look in the file clck_results.log to read the in-depth results of the analysis. These are preliminary checks that would work for either user or administrator. For a more comprehensive, administrator-specific run, next run
clck <options> -F health_admin
If you want to extend to further in-depth checking of the intricacies of your cluster’s uniformity, you will also include the Framework Definitions ‘lshw_hardware_uniformity’, which will find discrepancies in hardware or firmware between nodes, and ‘kernel_parameter_uniformity’, which will give an analysis of the uniformity of the kernel setup, by using
clck <options> -F health_extended_admin
If the optional ‘syscfg’ system configuration utility command has been installed, run and tested to ensure the system is configured uniformly across nodes, can run by
clck <options> -F syscfg_settings_uniformity
You can run all of the above in a single run by running multiple framework definitions at once.
clck <options> -F health_extended_admin -F syscfg_settings_utility
These commands will provide preliminary analysis on the screen, with more details available by default in the file clck_results.log. At this point you can explore other framework options to find what serves your needs best. Be aware that some of the user-level Framework Definitions may not run well as root since they include running of an MPI parallel application.
Here is an overview of all the embedded tests the health_extended_admin framework definition contains. As you can see, health_extended_admin is a super set of health_admin, kernel_parameter_uniformity and lshw_hardware_uniformity; and these framework definitions may in turn have additional tests they perform:
health_extended_admin |-- health_admin | |-- health_base | | |-- cpu_user | | |-- environment_variables_uniformity | | |-- ethernet | | |-- infiniband_user | | |-- network_time_uniformity | | |-- node_process_status | | `-- opa_user | |-- basic_shells | |-- cpu_admin | |-- dgemm_cpu_performance | |-- mpi_bios | |-- infiniband_admin | |-- kernel_version_uniformity | |-- local_disk_storage | |-- memory_uniformity_admin | |-- mpi_libfabric | |-- opa_admin | |-- perl_functionality | |-- privileged_user | |-- python_functionality | |-- rpm_uniformity | |-- services_status | `-- stream_memory_bandwidth_performance |-- kernel_parameter_uniformity `-- lshw_hardware_uniformity
Note: Administrators and privileged users must be aware that the data they collect with privileges may contain information about the servers that should be protected. Data such as system MSR settings. It is highly recommended that the database a privileged user creates is protected and realize that it should not be shared with users you do not want to have access to that type of information.
User:
For the non-privileged cluster user, there are three common-use Framework Definitions for cluster analysis. When first running, run
clck <options> -F health_base
You can then look in the file clck_results.log to read the in-depth results of the analysis. In the event that you desire more extended checking, including several lightweight performance checks (IMB, SGEMM, STREAM), you can next run
clck <options> -F health_user
To add more extensive performance checking (DGEMM, HPL) to the above, you can next run
clck <options> -F health_extended_user
These commands will provide preliminary analysis on the screen, with more details available by default in the file clck_results.log. At this point you can explore other framework options to find what serves your needs best. Be aware that not all tools are user-accessible so some may report data missing.
Here is an overview showing how health_extended_user framework definition is a package containing many different sets of tests including some other framework definitions that contain even more checks and tests, such as health_user and health_base:
health_extended_user |-- health_user | |-- health_base | | |-- cpu_user | | |-- environment_variables_uniformity | | |-- ethernet | | |-- infiniband_user | | |-- network_time_uniformity | | |-- node_process_status | | `-- opa_user | |-- basic_internode_connectivity | |-- basic_shells | |-- file_system_uniformity | |-- imb_pingpong_fabric_performance | |-- kernel_version_uniformity | |-- memory_uniformity_user | |-- mpi_local_functionality | |-- mpi_multinode_functionality | |-- perl_functionality | |-- python_functionality | |-- sgemm_cpu_performance | `-- stream_memory_bandwidth_performance |-- dgemm_cpu_performance `-- hpl_cluster_performance
Intel® MPI Library Troubleshooting
Admin:
For the privileged user wanting to make sure their cluster is set up to work with the Intel® MPI Library, run
clck <options> -F mpi_prereq_admin
This Framework Definition helps debug BIOS, software, environment, and hardware issues that could be causing sub-optimal performance or problems using the Intel® MPI Library.
User:
For the non-privileged user wanting to make sure their cluster is set up to work with the Intel® MPI Library, run
clck <options> -F mpi_prereq_user
This Framework Definition helps debug environment and software issues that could be causing sub-optimal performance or problems using the Intel® MPI Library.