Analysis
Actions
Given a populated database (see the Data Collection chapter), Intel® Cluster Checker analyzes the data to identify issues, diagnose problems, and in some cases, provide recommendations on how to repair the cluster. Invoke the clck-analyze program to perform analysis or clck to perform both collection and analysis. The analysis evaluates the collected data using an embedded expert system.
Running Analysis
There are two ways to analyze data using Intel® Cluster Checker. The command clck first collects data and then analyzes it and clck-analyze analyzes data in an existing database without collecting new data. If not given any command line options, Intel® Cluster Checker will analyze all nodes in the database by default. To analyze a subset of nodes or to assign node roles, provide a nodefile using the -f command line option. For more information about writing a nodefile, see the Selecting Nodes section in the Data Collection chapter. For details about the available command line options, see Configuring Intel® Cluster Checker. A typical use of the analysis command is:
clck-analyze -f nodefile
The output to the screen will provide a brief summary of any issues found. Further details will be written to the log file.
Each issue has a category, message ID, a severity, and a list of relevant nodes. It may also have a database row ID and a remedy. The message id is an unique identifier for the issue type. All issues have a primary id; some issues may also have an optional sub-id appended with a colon, id:sub-id. The message id can be used to suppress the issue (see the Suppressions section).
A list of nodes will be displayed with the issue, indicating the nodes in the system to which the issue applies. Node names displayed in parentheses indicate that the issue applies to a pair of nodes, such as MPI latency between a pair of nodes.
The database row id is a list of database entries containing the raw data that led to the issue. Database row ids are only included when debug output is enabled (see Configuring Intel® Cluster Checker).
Some issues recommend a suggested remedy to resolve the issue. Some remedies may require privileged cluster access.
Issues fall into one of two categories:
Diagnoses
Diagnoses describe the root cause of an issue. For example, MPI performance is substandard because some network setting is mis-configured. The typical process to reach a diagnosis is by combining one or more observations. In this example, an observation for substandard MPI performance and another observation for a mis-configured network setting.
Observations
An observation is an objective fact about the cluster based on collected data. For example, a cluster’s memory may not be uniform.
Each reported issue should be investigated and either resolved or suppressed (see the Suppressions section). Once the issue is resolved, fresh data should be collected and the analysis repeated. When no issues are reported, the cluster has been successfully verified with Intel® Cluster Checker.
Selecting Nodes
By default, clck-analyze will analyze all nodes in the database, while using clck may use either Slurm to auto-detect nodes or a nodefile. If a nodefile is supplied, then the list of nodes contained in the nodefile will be used instead of all available nodes in the database. Optional nodefile annotations can also be specified and may alter the analysis output (see the Selecting Nodes section in the “Data Collection” chapter for more details). For example, some rules may only apply to compute nodes and ignore non-compute nodes.
Framework Definition (FWD) Selection
Framework Definitions, further detailed in the Framework Definitions chapter, can be used to select which group of providers will run during data collection and which analyzer extensions and knowledge base modules will run during analysis.
Framework Definitions can be specified through the command line by using the -F / --framework-definition command line option.
-F FWD / --framework-definition FWD
For instance, the following command would run myFramework.xml:
clck -F /path/to/myFramework.xml
Custom FWDs can also be specified in the configuration file /opt/intel/oneapi/clck/latest/etc/clck.xml. The following example shows how to declare the use of two custom definitions:
<configuration> <analyzer> <framework_definitions> <framework_definition>/path/to/CustomFWD1.xml </framework_definition> <framework_definition>/path/to/CustomFWD2.xml </framework_definition> </framework_definitions> ... ... </analyzer> ... </configuration>
For more information about Framework Definitions, see the Framework Definitions section in the Reference.
Suppressions
In some cases, while the issue may be correct, the behavior is actually intended and should not be flagged. Such issues can be suppressed by adding an entry to the configuration file.
The base suppression format is:
<configuration> <analyzer> ... <suppressions> <suppress> <id>string</id> <node_id>hostname</node_id> <severity>num</severity> </suppress> ... </suppressions> ... </analyzer> ... </configuration>
Multiple suppressions may be specified.
<id>string</id>
Suppress all issues matching the specified message id string. The default is empty, meaning suppress all message ids that match the other tags. If the message id includes a sub-id and only the primary id is used, then all messages with the same primary id will be suppressed regardless of the sub-id.
<node_id>hostname</node_id>
Suppress all issues corresponding to the specified node. The default is empty, meaning suppress all nodes that match the other tags.
If a tag is omitted, then the default value is used. There is implicit AND logic among tags within a suppression.
The following example will suppress all issues from node4, any issues with message id example-id and with a confidence level of less than 50% on any node, as well as any issues with message id network:eth1 or network:eth2 but not other sub-id values.
<configuration> <analyzer> ... <suppressions> <suppress> <node_id>node4</node_id> </suppress> <suppress> <confidence>50</confidence> <id>example-id</id> </suppress> <suppress> <id>network:eth1</id> </suppress> <suppress> <id>network:eth2</id> </suppress> </suppressions> ... </analyzer> ... </configuration>
Configuration Options
Intel® Cluster Checker contains both command line options and a configuration file to allow for configuration of the tool. The chapter Configuring Intel® Cluster Checker contains a complete list of command line options and an explanation of the config file.
The config file is in an XML format, and a variety of XML tags are available to configure the behavior of Intel® Cluster Checker. Below is a list of configuration tags that affect analysis.
cluster-mode-uniformity-threshold
Specify the threshold ratio for checking the uniformity of cluster mode entries across the cluster.
XML syntax:
<config> <cluster-mode-uniformity-threshold>NUMBER </cluster-mode-uniformity-threshold> </config>
If the percentage of nodes that share the same cluster mode entry value is above the value specified for the cluster-mode-uniformity-threshold tag, then that value is considered uniform in that cluster. If the percentage of nodes that share the same cluster mode entry value is below the uniformity threshold, then a sign is generated.
data-age-threshhold
Specify the maximum age of data points, in seconds, before a data point is considered too old for relevant analysis.
XML syntax:
<config> <data-age-threshold>NUMBER </data-age-threshold> </config>
The value should be an integer value greater than 0. The default value is 604800 seconds (1 week).
data-source-time-difference
Specify the maximum time difference allowed between timestamps for two data sources that contribute to the same analysis sign.
XML syntax:
<config> <data-source-time-difference>NUMBER </data-source-time-difference> </config>
Currently this is only enabled for the dgemm sign substandard-dgemm-due-to-offline-cores.
The value should be an integer value greater than 0. The default value is 900 seconds (15 minutes) for dgemm.
dgemm-number-of-mad
Specify the number of median absolute deviations (MADs) allowed before a dgemm value is considered an outlier.
XML syntax:
<config> <dgemm-number-of-mad>NUMBER </dgemm-number-of-mad> </config>
The value should be an integer value greater than 0.
dgemm-peak-fraction
Specify the minimum value of the ratio between the measured dgemm performance and theoretical peak performance value.
XML syntax:
<config> <dgemm-peak-fraction>NUMBER </dgemm-peak-fraction> </config>
Any value below this will generate a sign.
The value should be a floating point value between 0 and 1.
environment-denylist
Specify the environment variable patterns that will be ignored for uniformity comparison across the cluster.
XML syntax:
<config> <environment-denylist> <entry>PATTERN</entry> <entry>PATTERN</entry> </environment-denylist> </config>
The value within each entry tag is interpreted as a POSIX matching regular expression. If this value is not a valid POSIX regular expression, then no filtering will be done.
The entry tag can be repeated multiple times.
Note that to exactly match meta characters, (^[.*(${()+|?<>), they should be escaped.
hpl-number-of-mad
Specify the number of median absolute deviations (MADs) allowed before an HPL value is considered an outlier.
XML syntax:
<config> <hpl-number-of-mad>NUMBER </hpl-number-of-mad> </config>
The value should be an integer value greater than 0.
imb-pingpong-number-of-mad
Specify the number of median absolute deviations (MADs) allowed before a PingPong latency or bandwidth value is considered an outlier.
XML syntax:
<config> <imb-pingpong-number-of-mad>NUMBER </imb-pingpong-number-of-mad> </config>
The value should be an integer value greater than 0.
iozone-number-of-mad
Specify the number of median absolute deviations (MADs) allowed before an iozone value is considered an outlier.
XML syntax:
<config> <iozone-number-of-mad>NUMBER </iozone-number-of-mad> </config>
The value should be an integer value greater than 0.
kernel-denylist
Specify the kernel parameter patterns that will be ignored for uniformity comparisons across the cluster.
XML syntax:
<config> <kernel-denylist> <entry>PATTERN</entry> <entry>PATTERN</entry> </kernel-denylist> </config>
The value within each entry tag is interpreted as a POSIX matching regular expression. If this value is not a valid POSIX regular expression, then no filtering will be done.
The entry tag can be repeated multiple times.
Note that to exactly match meta characters, (^[.*(${()+|?<>), they should be escaped.
kernel-param-uniformity-threshold
Specify the threshold ratio for checking the uniformity of kernel parameters across the cluster, that is, sysctl entries.
XML syntax:
<config> <kernel-param-uniformity-threshold>NUMBER </kernel-param-uniformity-threshold> </config>
If the percentage of nodes that share the same kernel parameter entry value is above the value specified for the kernel-param-uniformity-threshold tag, then that value is considered uniform in that cluster. If the percentage of nodes that share the same kernel parameter entry value is below the uniformity threshold, then a sign is generated.
The value should be an floating point value between 0 and 1.
logical-cores-uniformity-threshold
Specify the threshold ratio for checking the uniformity of logical cores across the cluster.
XML syntax:
<config> <logical-cores-uniformity-threshold>NUMBER </logical-cores-uniformity-threshold> </config>
If the percentage of nodes that share the same setting is above the value specified for the logical-cores-uniformity-threshold tag, then it is considered uniform on the cluster. If the percentage of nodes that share the same number of logical cores is below the uniformity threshold, then a sign is generated.
The value should be a floating point value between 0 and 1. The default value is 0.9.
lshw-denylist
Specify the lshw output patterns that will be ignored for uniformity comparison across the cluster.
XML syntax:
<config> <lshw-denylist> <entry>PATTERN</entry> <entry>PATTERN</entry> </lshw-denylist> </config>
The value within each entry tag is interpreted as a POSIX matching regular expression. If this value is not a valid POSIX regular expression, then no filtering will be done.
The entry tag can be repeated multiple times.
Note that to exactly match meta characters, (^[.*(${()+|?<>), they should be escaped.
lshw-uniformity-threshold
Specify the threshold ratio for checking the uniformity of lshw entries across the cluster.
XML syntax:
<config> <lshw-uniformity-threshold>NUMBER </lshw-uniformity-threshold> </config>
If the percentage of nodes that share the same lshw entry value is above the value specified for the lshw-uniformity-threshold tag, then that value is considered uniform in that cluster. If the percentage of nodes that share the same lshw entry value is below the uniformity threshold, then a sign is generated.
The value should be an floating point value between 0 and 1.
memory-mode-uniformity-threshold
Specify the threshold ratio for checking the uniformity of memory mode entries across the cluster.
XML syntax:
<config> <memory-mode-uniformity-threshold>NUMBER </memory-mode-uniformity-threshold> </config>
If the percentage of nodes that share the same memory mode entry value is above the value specified for the memory-mode-uniformity-threshold tag, then that value is considered uniform in that cluster. If the percentage of nodes that share the same memory mode entry value is below the uniformity threshold, then a sign is generated.
The value should be an floating point value between 0 and 1.
memory-uniformity-threshold
Specify the maximum allowable deviation, in bytes, from the median memory size before a memory size is considered non-uniform.
XML syntax:
<config> <memory-uniformity-threshold>NUMBER </memory-uniformitythreshold> </config>
Any value greater than 0 can be used for this tag. The default value is 268435456 bytes (256 MB).
ntp-offset-threshold
Specify the maximum offset value an NTP peer can have before a sign is generated.
XML syntax:
<config> <ntp-offset-threshold>NUMBER </ntp-offset-threshold> </config>
Any floating point value can be used for this tag.
outlier-max-median-mad-dist
Specify the maximum distance, in orders of magnitude, between the median and median absolute deviation (MAD) for the MAD outlier algorithm to be used.
XML syntax:
<config> <outlier-max-median-mad-dist>NUMBER </outlier-max-median-maddist> </config>
If the allowable distance is exceeded, then the MAD outlier algorithm is disabled and a fallback algorithm (controlled by the outlier-median-pct tag) is used for outlier rules.
The following describes the test controlled by the outlier-max-median-mad-dist tag:
if ( |median - MAD| < 10^outlier-max-median-mad-dist ) then <use MAD outlier algorithm> else <use fallback outlier algorithm>
Any value greater than 0 can be used for this tag. The default value is 2.5.
outlier-median-pct
Percentage of the median used to calculate outliers by the fallback algorithm.
XML syntax:
<config> <outlier-median-pct>NUMBER </outlier-median-pct> </config>
The outlier-median-pct determines the distance from the median that a sample value is allowed to be before it is considered an outlier in the fallback outlier algorithm. The outlier-median-pct value is divided by 100 and multiplied by the median to get an allowable distance. If the sample value is further away from the median than the allowable distance, the sample value is considered an outlier.
The following describes the fallback outlier algorithm controlled by the outlier-median-pct tag:
if ( |median - sample_value| > (median * (outlier-median-pct / 100) ) then <the sample_value is an outlier> else <the sample_value is not an outlier>
Any value between 0 and 100 can be used. The default value is 5.
preferred-cluster-mode
Specify the preferred cluster mode for Intel® Xeon Phi™ processor.
XML syntax:
<config> <preferred-cluster-mode>MODE </preferred-cluster-mode> </config>
Valid values for MODE are All2All, SNC2, SNC4, Hemisphere and Quadrant.
preferred-memory-mode
Specify the preferred memory mode for Intel® Xeon Phi™ processor.
XML syntax:
<config> <preferred-memory-mode>MODE </preferred-memory-mode> </config>
Valid values for MODE are Flat, Cache, Hybrid25 and Hybrid50.
preferred-tickless-cores
Specify the list of cores for the nohz_full kernel parameter for the Intel® Xeon Phi™ processor.
XML syntax:
<config> <preferred-tickless-cores>core list </preferred-tickless-cores> </config>
128-255, 1,2,7-9, 1,6,9 are examples of valid values.
preferred-turbo-status
Specify the preferred Intel® Turbo Boost Technology status for the processor.
XML syntax:
<config> <preferred-turbo-status>STATUS </preferred-turbo-status> </config>
The valid values are enabled and disabled.
rpm-uniformity-threshold
Specify the threshold ratio for checking whether each rpm file installed on a node is uniform across the cluster.
XML syntax:
<config> <rpm-uniformity-threshold>NUMBER </rpm-uniformity-threshold> </config>
If the percentage of nodes that share the same rpm file is above the value specified for the rpm-uniformity-threshold tag, then that rpm file is considered uniform on the cluster. If the percentage of nodes that share the same rpm file is below the uniformity threshold, then a sign is generated.
The value should be a floating point value between 0 and 1.
storage-max-used-pct
Specify the maximum percentage of space that can be used on a disk partition.
XML syntax:
<config> <storage-max-used-pct>NUMBER </storage-max-used-pct> </config>
If the percentage is exceeded on a disk partition, then a sign is emitted.
Any value between 0 and 100 can be used for this tag. The default value is 85.
stream-number-of-mad
Specify the number of median absolute deviations (MADs) allowed before a stream value is considered an outlier.
XML syntax:
<config> <stream-number-of-mad>NUMBER </stream-number-of-mad> </config>
The value should be an integer value greater than 0.
threads-per-core-uniformity-threshold
Specify the threshold ratio for checking the uniformity of threads available per core across the cluster.
XML syntax:
<config> <threads-per-core-uniformity-threshold>NUMBER </threads-per-core-uniformity-threshold> </config>
If the percentage of nodes that share the same setting is above the value specified for the threads-per-core-uniformity-threshold tag, then it is considered uniform on the cluster. If the percentage of nodes that share the same threads available per core is below the uniformity threshold, then a sign is generated.
The value should be a floating point value between 0 and 1. The default value is 0.9.
turbo-status-uniformity-threshold
Specify the threshold ratio for checking the uniformity of Intel® Turbo Boost Technology status (enabled or disabled) on a set of nodes within the cluster.
XML syntax:
<config> <turbo-status-uniformity-threshold>NUMBER </turbo-status-uniformity-threshold> </config>
If the percentage of nodes that share the same Intel® Turbo Boost Technology status is above the value specified for the turbo-status-uniformity-threshold tag, then Intel® Turbo Boost Technology status is considered uniform on the cluster; otherwise a sign is generated.
The value should be a floating point value between 0 and 1. The default value is set to 0.9.
Post Processor JSON Output
Cluster Checker has the ability to enable reporting and analysis logging to a JSON formatted file. This could be beneficial when looking to centralize all logging. To enable, the cluster checker configuration XML file, clck.xml, needs to be updated to include <entry>clck_json</entry> in the postproc_extension section. Default location for this file is install-path/etc/clck.xml The result would look something like this:
<postprocessor> <postproc_extensions> <group> <entry>clck_json</entry> <entry>category_summary</entry> <entry>category_log</entry> <entry>clck_execution_warnings</entry> </group> </postproc_extensions> </postprocessor>
The Node Group Feature
Node group is a beta feature. The node group feature enables finer-grained control of the analysis capabilities of the Intel® Cluster Checker on collected data by defining groups of nodes and assign tests through individual framework definitions (FWDs) to these node groups.
The intent of this feature is to enable the analysis by common features or attributes of nodes in a heterogenous cluster, for example if there are different groups of the same type or speed of a processor, speed/type/size of memory, or even the communications fabric. While Intel® Cluster Checker can collect data for all of the nodes in a heterogenous cluster at the same time, it can separate the analysis by the compute nodes common attributes using a node group configuration file. Instead of reporting the differences between all nodes of all different groups, now individual groups can be analyzed by their specific characteristics.
There are no changes to Intel® Cluster Checker performing the collection of the data - this feature only changes the analysis of the collected data by user-defined grouping in a “node group” file. An example of a node group configuration file is provided below.
The Node Group Command Line Option
This feature is enabled by running the command clck (or clck-analyze) with the command line option -g/--groupfile<nodegroupconfig> with an appropriate node group configuration file. For example
as a privileged user
clck -F health_admin -g path/to/my/nodegroupconfig.xml
as a non-privileged user
clck -F health_extended_user -g path/to/my/nodegroupconfig.xml
If analysis is run separately from data collection by clck-analyze, the FWDs used in the analysis must have been included in the collection stage to provide sufficient data to later analyze the node group.
An example node group configuration file can be found in at: $CLCK_HOME/etc/example_groups_cpu_mem.xml
The Node Group Configuration File
A node group configuration file defines a collection of compute nodes whose data is analyzed together based upon specific FWDs.
The node group configuration file is an XML file that lists group definitions and test definitions:
Group definitions specify node group names and which nodes belong in each groups.
Test definitions specify which set of tests (framework definitions) are assigned to which groups.
Here is an example node group configuration file body, listing two groups and one FWD:
<?xml version="1.0" encoding="UTF-8"?> <node_group_config> <nodegroup name="A"> <nodes> <node>c01</node> <node>c03</node> </nodes> </nodegroup> <nodegroup name="B"> <nodes> <node>c07</node> <node>c08</node> </nodes> </nodegroup> <fwd name="memory_uniformity"> <nodegroups> <group>A</group> </nodegroups> </fwd> </node_group_config>
When a node group configuration file is applied to Intel® Cluster Checker analysis (-g/--groupfile), tests of enlisted FWDs limit their comparisons to only compare nodes in the specified groups for each test. This way uniformity tests (hardware, firmware, network, software, performance etc…) can reflect the actual configuration of a heterogeneous cluster.
Example of Defining a Node Group
The section of the node group configuration file will contain one or more defined node groups. The node group section allows you to assign a group name to multiple servers which can then be used for grouped analysis. Here is a single node group example:
<nodegroup name="A"> <nodefiles> <path>Path/to/a/nodefile</path> </nodefiles> <nodes> <node>c01</node> <node>c04</node> </nodes> </nodegroup>
In this example nodes c01 and c04 along with any nodes specified in the nodefile (same format as with the -f option, single server name per line matching the hostname output) are assigned to group A. Nodes can be assigned to multiple groups.
Names of node groups must have no spaces. Group name All is reserved for the group containing all nodes being run. If a FWD is assigned to specific node groups, but not all nodes are included by its group assignments, an according group with the remaining nodes is added automatically, named All-except-nodegroup-<group-name>.
If a node that is not being analyzed is included in a group, it will be ignored. Groups with no nodes being analyzed in it in an actual run will also be ignored.
The <nodegroup> XML section accepts using nodefiles, individually listed nodes or combinations of both as shown in the example above.
Example of Using Node Groups for a Specific Test
This section of the XML configuration file allows the user to define which node groups should be associated with which specific tests (or Framework Definition, fwd) during the analysis phase. This is an example of assigning node groups to specific tests:
<fwd name="cpu_user"> <nodegroups> <group>A</group> <group>B</group> <group>C</group> </nodegroups> </fwd>
Here analyzing the tests specified in the framework definition cpu_user are analyzed within each of the groups A, B and C. The grouping of analysis is applied to any of the explicitly specified FWDs. Analysis of a FWD not specified for grouping which includes a FWD that is explicitly listed for grouping, will apply grouping to the included FWD explicitly listed for grouping. (e.g. in this example -F health_extended_user which includes the FWD cpu_user will analyze all uniformity tests on the single group of All nodes, except the test of cpu_user, which will be analyzed only within the three individual groups A, B and C.
Of note <fwd name="xyz"> must be a valid framework definition either included with Intel® Cluster Checker or a user defined framework definition. In the example above, cpu_user is a valid framework definition. In this paragraph xyz is invalid by default and as a result would not be used.
All three groups will analyze the test cpu_user independently of each other and a group consisting of all remaining nodes will be created if there are remaining nodes (called group All-except-nodegroups-A-B-C) and will run the analysis on the group of remaining nodes separately. This means nodes in group A will only be compared against other nodes in group A. If you desire testing of node groups A & B, nodes in nodegroup B should be added to node group A or a completely independent node group created just for this test. Groups can be assigned to more than one test.
Tests on FWDs in the node group configuration file that are not run through either the command line (through the -F option), the configuration file (default being $CLCK_HOME/etc/clck.xml), nor included in a framework run through will be ignored. If a test does not have any nodes included in the groups assigned to it, analysis will ignore that entry.
Complete Example File with Output
A full example of a complete node group configuration file for a simple configuration of 8 compute nodes in a cluster with hostnames c[01-08], with two types of processors, and two sizes of memory;
compute nodes with Intel® Xeon 6148 processors
4 nodes with 192GB memory
compute nodes with Intel® Xeon 6258R processors
2 nodes with 192GB memory
2 nodes with 768GB memory
Then the console output is shown below using the example command line as a privileged user
clck -F health_admin -g path/to/my/nodegroupconfig.xml
A regular user could apply the same node group configuration file to any of his Intel® Cluster Checker tests as well in the same way, e.g.
clck -F health_extended_user -g path/to/my/nodegroupconfig.xml
This is the example node group configuration file for above configuration:
<?xml version="1.0" encoding="UTF-8"?> <node_group_config> <!-->Node group files can be specified with the '-g <file path/name>' option.<--> <!-->This option is incompatible with subclustering defined in the node file.<--> <!-->List group name (no spaces) and included nodes.<--> <!-->The name "All" is a reserved node group name and will be ignored.<--> <nodegroup name="xeon6148"> <nodes> <node>c01</node> <node>c02</node> <node>c03</node> <node>c04</node> </nodes> </nodegroup> <nodegroup name="xeon6258R"> <nodes> <node>c05</node> <node>c06</node> <node>c07</node> <node>c08</node> </nodes> </nodegroup> <nodegroup name="xeon6148_mem192GB"> <nodes> <node>c01</node> <node>c02</node> <node>c03</node> <node>c04</node> </nodes> </nodegroup> <nodegroup name="xeon6258R_mem192GB"> <nodes> <node>c05</node> <node>c06</node> </nodes> </nodegroup> <nodegroup name="xeon6258R_mem768GB"> <nodes> <node>c07</node> <node>c08</node> </nodes> </nodegroup> <!-->List which Framework definitions will run on which groups<--> <!-->Frameworks not being run with or included by -F or -c config will be skipped.<--> <fwd name="cpu_base"> <nodegroups> <group>xeon6148</group> <group>xeon6258R</group> </nodegroups> </fwd> <fwd name="dgemm_cpu_performance"> <nodegroups> <group>xeon6148_mem192GB</group> <group>xeon6258R_mem192GB</group> <group>xeon6258R_mem768GB</group> </nodegroups> </fwd> <fwd name="sgemm_cpu_performance"> <nodegroups> <group>xeon6148_mem192GB</group> <group>xeon6258R_mem192GB</group> <group>xeon6258R_mem768GB</group> </nodegroups> </fwd> <fwd name="avx512_performance_ratios_user"> <nodegroups> <group>xeon6148_mem192GB</group> <group>xeon6258R_mem192GB</group> <group>xeon6258R_mem768GB</group> </nodegroups> </fwd> <fwd name="avx512_performance_ratios_priv"> <nodegroups> <group>xeon6148_mem192GB</group> <group>xeon6258R_mem192GB</group> <group>xeon6258R_mem768GB</group> </nodegroups> </fwd> <fwd name="hpl_cluster_performance"> <nodegroups> <group>xeon6148_mem192GB</group> <group>xeon6258R_mem192GB</group> <group>xeon6258R_mem768GB</group> </nodegroups> </fwd> <fwd name="stream_memory_bandwidth_performance"> <nodegroups> <group>xeon6148_mem192GB</group> <group>xeon6258R_mem192GB</group> <group>xeon6258R_mem768GB</group> </nodegroups> </fwd> <fwd name="syscfg_settings_uniformity"> <nodegroups> <group>xeon6148_mem192GB</group> <group>xeon6258R_mem192GB</group> <group>xeon6258R_mem768GB</group> </nodegroups> </fwd> <fwd name="kernel_parameter_uniformity"> <nodegroups> <group>xeon6148_mem192GB</group> <group>xeon6258R_mem192GB</group> <group>xeon6258R_mem768GB</group> </nodegroups> </fwd> <fwd name="lshw_hardware_uniformity"> <nodegroups> <group>xeon6148_mem192GB</group> <group>xeon6258R_mem192GB</group> <group>xeon6258R_mem768GB</group> </nodegroups> </fwd> <fwd name="memory_uniformity"> <nodegroups> <group>mem192GB</group> <group>mem768GB</group> </nodegroups> </fwd> </node_group_config>
Here is the console output of this command:
$ clck -f nodelist -g example_groups_cpu_mem.xml -F health_admin -g/--groupfile is a beta feature currently in development. Intel(R) Cluster Checker 2021.1 Beta 6 (build 20200403) Running Collect ................................................................................................................................................................................................................................................................................................................................................................ Running Analyze SUMMARY Command-line: clck -f nodelist -g example_groups_cpu_mem.xml -F health_admin Tests Run: health_admin Overall Result: 6 issues found - FUNCTIONALITY (2), HARDWARE UNIFORMITY (2), SOFTWARE UNIFORMITY (2) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 8 nodes tested: c[01-08].skl 0 nodes with no issues: 8 nodes with issues: c[01-08].skl ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Intel(R) Cluster Checker completed analysis with the following groups: User-Configured Groups (Defined in example_groups_cpu_mem.xml) 1. Group "mem192GB" Nodes: c[01-06].skl Tests: memory_uniformity 2. Group "mem768GB" Nodes: c[07-08].skl Tests: memory_uniformity 3. Group "xeon6148" Nodes: c[01-04].skl Tests: cpu_base 4. Group "xeon6148_mem192GB" Nodes: c[01-04].skl Tests: dgemm_cpu_performance, stream_memory_bandwidth_performance 5. Group "xeon6258R" Nodes: c[05-08].skl Tests: cpu_base 6. Group "xeon6258R_mem192GB" Nodes: c[05-06].skl Tests: dgemm_cpu_performance, stream_memory_bandwidth_performance 7. Group "xeon6258R_mem768GB" Nodes: c[07-08].skl Tests: dgemm_cpu_performance, stream_memory_bandwidth_performance Automatically Configured Groups 1. Group "All" Nodes: c[01-08].skl Tests: health_admin ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ FUNCTIONALITY The following functionality issues were detected: Group "mem192GB": c[01-06].skl No issues detected. Group "mem768GB": c[07-08].skl No issues detected. Group "xeon6148": c[01-04].skl No issues detected. Group "xeon6148_mem192GB": c[01-04].skl No issues detected. Group "xeon6258R": c[05-08].skl No issues detected. Group "xeon6258R_mem192GB": c[05-06].skl No issues detected. Group "xeon6258R_mem768GB": c[07-08].skl No issues detected. Group "All": c[01-08].skl 1. Intel(R) Turbo Boost Technology is disabled. 1 node: c03.skl 2. The Intel(R) Cluster Checker requires the Intel(R) Omni-Path tool 'opasmaquery'. 1 node: c01.skl HARDWARE UNIFORMITY The following hardware uniformity issues were detected: Group "mem192GB": c[01-06].skl No issues detected. Group "mem768GB": c[07-08].skl No issues detected. Group "xeon6148": c[01-04].skl No issues detected. Group "xeon6148_mem192GB": c[01-04].skl No issues detected. Group "xeon6258R": c[05-08].skl No issues detected. Group "xeon6258R_mem192GB": c[05-06].skl No issues detected. Group "xeon6258R_mem768GB": c[07-08].skl No issues detected. Group "All": c[01-08].skl 1. The Intel(R) Turbo Boost Technology status 'disabled', is not uniform. 12% of nodes in the same grouping have the same Intel(R) Turbo Boost Technology status. 1 node: c03.skl 2. The Intel(R) Turbo Boost Technology status 'enabled', is not uniform. 88% of nodes in the same grouping have the same Intel(R) Turbo Boost Technology status. 7 nodes: c[01-02,04-08].skl PERFORMANCE No issues detected. SOFTWARE UNIFORMITY The following software uniformity issues were detected: Group "mem192GB": c[01-06].skl No issues detected. Group "mem768GB": c[07-08].skl No issues detected. Group "xeon6148": c[01-04].skl No issues detected. Group "xeon6148_mem192GB": c[01-04].skl No issues detected. Group "xeon6258R": c[05-08].skl No issues detected. Group "xeon6258R_mem192GB": c[05-06].skl No issues detected. Group "xeon6258R_mem768GB": c[07-08].skl No issues detected. Group "All": c[01-08].skl 1. The Energy/Performance Bias BIOS setting, '6.00', is not uniform. 88% of nodes in the same grouping have the same Energy/Performance Bias setting. Intel(R) MPI Library works best with these values being consistent. 7 nodes: c[01-05,07-08].skl 2. The Energy/Performance Bias BIOS setting, '7.00', is not uniform. 12% of nodes in the same grouping have the same Energy/Performance Bias setting. Intel(R) MPI Library works best with these values being consistent. 1 node: c06.skl See the following files for more information: clck_results.log, clck_execution_warnings.log
Note: the group "All" runs all the tests in FWD health_admin on all the nodes with the exception of the tests specified to run on the listed groups which it ignores.
Specifying the Results Destination
The -o/–logfile command line options allow a user to configure the desired location for the results log by specifying either a file name or a directory path. The following examples show the expected behavior for each of these cases.
clck -f nodefile -o output_file.log
This command collects and analyzes the data as specified in the framework definition health_base using the specified nodefile and writes results to output_file.log. The execution warnings log file will be written to clck_execution_warnings.log.
clck -f nodefile -o directory/
This command collects and analyzes the data as specified in the framework definition health_base using the specified nodefile and writes results to directory/clck_results.log. If ‘directory/’ is an existing directory, the trailing ‘/’ may be omitted. The execution warnings log file will be written to directory/clck_execution_warnings.log.
Note: If the user does not have write and execute permissions to the path provided to the -o option, then the results file will be written to the current working directory using only the file name component of the provided path. In such a scenario, if the path provided to the -o option corresponds to a directory, then the name of the results log file will default to clck_results.log