Knowledge Base
Intel® Cluster Checker is an expert system. A classic definition of an expert system is “an intelligent computer program that uses knowledge and inference procedures to solve problems that are difficult enough to require significant human expertise for their solutions” (Edward A. Feigenbaum, “Knowledge Engineering in the 1980s”, Stanford University Computer Science Department, 1982). The problem that Intel® Cluster Checker solves is diagnosing system-level issues with Beowulf style clusters.
Two main elements are called out in this definition: knowledge and inference procedures. Knowledge comes from two sources: observations about an actual system and if/then rules that encapsulate human expertise. Intel® Cluster Checker relies on data providers to make observations about the cluster and saves the result in a database (see the Data Providers chapter).
Expert systems differ from typical procedural programs in that there is not a fixed order of execution. The order is logically inferred, using one of several common schemes such as the Rete algorithm , by dynamically analyzing the interdependence of rules and facts.
One limitation of expert systems is that they are only as good as the knowledge (rules) they contain. To remain relevant, a knowledge base needs to continually grow and change as new and variant cases are uncovered. This chapter addresses how to express human expertise as knowledge base rules to extend the diagnostic capabilities of Intel® Cluster Checker. Please consider contributing extensions to the Intel® Cluster Checker team so that other users can also benefit.
The C Language Integrated Production Systems (CLIPS) was originally created in the 1980s at NASA’s Johnson Space Center (Joseph Giarratano and Gary Riley, Expert Systems: Principles and Programming, Thomson Course Technology, 2005). CLIPS is an expert system shell that combines an inference engine with a language for representing knowledge. Like many AI environments, the CLIPS language is very similar to LISP.
More recently, CLIPS added object oriented capabilities. Intel® Cluster Checker is based on the CLIPS Object Oriented Language (COOL).
The CLIPS User’s Guide is an excellent introduction to CLIPS.
Knowledge Base Overview
Key Concepts
Signs
Signs are one of the core elements of the knowledge base. A sign is an objective observation of the system. For example, if one node in a system has an amount of memory differing from the rest of the system, a sign will appear indicating that the memory is not uniform. Signs, then, do not infer anything about the cluster. Rather, they indicate some observation of an issue based on the data collected. In the Intel® Cluster Checker output, signs generated from CLIPS are referred to as “observations”.
All the varieties of signs have a state slot that represents a state diagram, where a sign is first initialized, then transitions to the observed state when a rule is run, and finally becomes diagnosed if the sign is used to make a diagnosis (see Diagnoses below).
The severity slot contains a value that ranges from 0 to 100. These values map to one of three severity levels - informational, warning, and critical. Informational observations map to the severity of 0-24 and provide additional information or minor issues about the system. An informational observation indicates that the cluster is fully functional but may have minor performance issues or not conform to best practices. Warning observations map to a range of 25-74 and indicate that the cluster is essentially functional but has performance issues and/or a non-core capability has functionality issues or is missing. Critical observations map to a range of 75-100 and indicate that a core cluster capability is non-functional or missing. The most severe critical observations indicate that a cluster component may irreparably fail if not addressed immediately. The rule that sets the sign (that is, transitions it into the observed state) sets the severity, and the sign will appear as an observation in the output.
Every sign also has an id slot that corresponds to a message catalog key (clck/<version>/kb/data/msg_en.xmc). The message catalog contains a string that describes the sign. Typically the string is a single sentence, but it may be longer. By convention, the id value should be the same as the name of the rule that created it. The id value is also used to look up the sign when making diagnoses.
Finally, the args slot contains variable values to be inserted into the message catalog string. Together, the id and args slots are roughly analogous to the C printf family of functions. The message catalog can be extended by simply adding new entries.
Diagnoses
Intel® Cluster Checker uses signs to make inferences about the cluster, resulting in diagnoses. Diagnoses are based on one or more sign. Using the non-uniform memory example from above, the non-uniform memory sign will result in a non-uniform hardware diagnosis.
Diagnoses are made based on the value of signs. A diagnosis is also defined by a rule. Whenever a diagnosis is made, the signs used to make the diagnosis should be transitioned to the diagnosed state. This is important because signs that are not used to make a diagnosis (that is, left in the observed state) will be printed out as undiagnosed signs. Undiagnosed signs indicate that an issue was found, but Intel® Cluster Checker was unable to infer anything based on it. Not all signs will result in diagnoses.
Similar to signs, diagnoses have severity, id, and args slots. The severity slot will typically be a composite of the signs used to reach the diagnosis. For example, a diagnosis based on a sign reached with low confidence and another sign with a high confidence, should probably have a low to intermediate confidence value depending on the particular case.
Remedies
Remedies provide actionable steps to resolve an issue, such as changing the permissions on a file or rebooting a node. Remedies are specified using two optional sign slots, remedy and remedyargs. Similar to id, remedy corresponds to a message catalog key (clck/<version>/kb/data/msg_en.xmc) and remedy-args contains variable values to be inserted into the message catalog string. If the remedy slot is empty, then no remedy is displayed.
Basic Implementation
*Note: The duck sample is not working with Intel® Cluster Checker 2019 or Intel® Cluster Checker 2021.
Classes
CLIPS classes are roughly analogous to C structures or C++ classes. Slots are to member variables as classes are to C structures. A slot typically has some attributes, or defined facets, such as the type, default value, etc. See the CLIPS documentation for more information about facets. The slots are populated with information from the database through analyzer extensions.
The class definition for the Duck example follows and can also be found at src/kb/classes/duck.clp in the SDK Duck Sample* at
(defclass DUCK "This class corresponds to the 'duck' node rating tool." (is-a BASE_NODE BASE_TIMESTAMP DATABASE MULTISET) (role concrete) (pattern-match reactive) (slot count (type INTEGER) (default 1)) (slot sound (type SYMBOL) (allowed-values honk quack) (default honk)))
In addition to the explicitly defined slots, the DUCK class inherits slots from its base classes. For instance, the node_id slot, which corresponds to an unique node identifier, is inherited from BASE_NODE class. If the class represents a property of multiple nodes, such as the network performance between a pair of nodes, it would instead inherit from the NODE_PAIR or BASE_CLUSTER base classes (clck/<version>/kb/core/cluster.clp).
Rules
For each class, there is typically a corresponding rule file. For instance, the DUCK class is defined in the file src/kb/classes/duck.clp, and the corresponding rules are defined in the file src/kb/rules/duck.clp in the SDK Duck Sample*. Based on the data contained in the instances and potentially other information such as the hardware configuration of a node, a rule creates one or more signs or diagnoses.
A CLIPS rule has a left-hand side (LHS) and a right-hand side (RHS), separated by the => token. The LHS is the set of if/then conditions that describe when the rule should fire. The RHS contains the action that should be performed when the LHS conditions are met. Typically the action is to create a sign or diagnosis.
Signs
Several varieties of signs are provided (clck/<version>/kb/core/sign.clp).
BOOLEAN_SIGN represents quantities that are either true or false. For example, a process either is either in the zombie state or not.
COUNTER_SIGN represents quantities that correspond to a count of something. For example, the number of network retries.
PERFORMANCE_SIGN represents a measure of performance that is either normal, substandard, or invalid. For example, the measured floating point performance meets expectations for the hardware configuration, does not meet expectations, or is an invalid value (such as negative or not a number).
GENERIC_SIGN is a general sign that can be used if one of the preceding specialized sign classes is not appropriate.
Note that in the output, all signs will be referred to as observations.
Organization and Directory Structure
The knowledge base is divided into several sub-components.
The clck/<version>/kb/core sub-directory contains the core data structures and message handlers used by the rest of the knowledge base. These files should typically not be modified.
The diagnostic knowledge is split between the clck/<version>/kb/classes and clck/<version>/kb/rules subdirectories. Class definitions are part of clck/<version>/kb/classes while the logic defining signs and diagnoses is contained in the clck/<version>/kb/rules subdirectory.
The clck/<version>/kb/data sub-directory contains lists of hardware components and their properties, as well as the catalog of messages.
Functions that extend the base CLIPS functionality can be put in the clck/<version>/kb/functions sub-directory.
Framework definitions use the <kb_mods> tag to load CLIPS files. For example, the cpu framework definition uses this tag to load the file cpu.clp, which contains the relevant rules.
Each sub-directory has a file named load.clp. This file loads the rest of the files in the same sub-directory. If, for example, a new rules file is added, then it needs to be added to its corresponding clck/<version>/kb/rules level load file to be enabled. The user-defined duck.clp file, sitting at the kb directory level can contain the following:
(batch* classes/duck.clp) (batch* rules/duck/load.clp)
While clck/<version>/rules/duck/load.clp contains the following:
(batch* duck-honking.clp) (batch* duck-less-than-three-quacks.clp) (batch* duck-stopping.clp)
Automatically Created Objects
A NODE object is automatically created for each node being checked. Each NODE object contains slots for the node architecture, roles, and subcluster membership. These slots may be used to restrict a rule to a particular type of node.
A single instance of the CONFIG class named [config] is automatically created and contains the input configuration parameters. The instance name [config] is reserved for this purpose and no other instances should use this name. This instance may be used to make the behavior of a rule user configurable.
Configurability
The CONFIG class contains all user configurable options and is defined in clck/<version>/kb/core/config.clp. A single instance of this class always exists with this reserved name. This class can be extended by adding new slots.
The slots of the CONFIG class form a global namespace, so slot names should be chosen with that consideration.
A simplified definition of the CONFIG class is as follows:
(defclass CONFIG (is-a USER) (role concrete) (pattern-match reactive) ; clck-checks is a list of connector extensions to be ; performed. (multislot clck-checks (type SYMBOL) (default (create$ all_to_all cpu dgemm environment ...)) ; The maximum allowable age of a data point, in seconds, ; before a data point is considered "too old". The ; default is 1 week. (slot data-age-threshold (type NUMBER) (default 604800)) ...
To use the CONFIG class, a corresponding rule would add a single condition to the left hand side:
(defrule duck-data-is-too-old "Identify instances where the most recent DUCK data should be considered too old." ; IF the 'duck' connector extension is enabled (object (is-a CONFIG) (name [config]) (clck-checks $? duck $?) (data-age-threshold ?age-threshold)) ...
The values of the CONFIG slots should always have defaults and are configurable in the Intel® Cluster Checker config file.
The following construct can be used to set values for single slot variables.
<configuration> <analyzer> <config> <ssf-layer>core</ssf-layer> </config> </analyzer> </configuration>
The following construct can be used to set values for multislot variables.
<configuration> <analyzer> <config> <clck-checks> <entry>PATTERN1</entry> <entry>PATTERN2</entry> </clck-checks> </config> </analyzer> </configuration>
Example
This section steps through the complete DUCK knowledge base example. The source files are provided online in the SDK Duck Sample, specifically in the folder src/kb.
Class Definition
Recall that the duck command rates nodes on a scale from 1 to 5 quacks, or if there is an error during the evaluation, honks instead of quacks. So the key data elements that need to be included in the knowledge base are a node identifier, the sound (quack or honk), and the number of times the sound is repeated. The following is an example CLIPS class definition that includes all of these elements. In an actual distribution, it would be added to the knowledge base as clck/<version>/kb/classes/duck.clp.
(defclass DUCK "This class corresponds to the 'duck' node rating tool." (is-a BASE_NODE BASE_TIMESTAMP DATABASE MULTISET) (role concrete) (pattern-match reactive) (slot count (type INTEGER) (default 1)) (slot sound (type SYMBOL) (allowed-values honk quack) (default honk)))
The node_id slot is inherited from the BASE_NODE class, the row-id slot is inherited from the DATABASE class, and the timestamp slot is inherited from the BASE_TIMESTAMP class. The MULTISET inheritance will be described with the uniformity rule.
With the class defined, the analyzer extension can now create instances based on the content of the database. Rules can now be defined to check the output.
Rules
Rule 1: Missing good output
In this example, the first rule creates a sign whenever the number of quacks is less than 3. In an actual distribution, the rule would be added to knowledge base as clck/<version>/kb/rules/duck/duck-less-than-three-quacks.clp.
(defrule duck-less-than-three-quacks "Create a sign whenever the number of 'quacks' is less than 3." ; IF the 'duck' analyzer extension is enabled (object (is-a CONFIG) (name [config]) (clck-checks $? duck $ ?)) ; AND a node instance with the role 'compute' or 'enhanced' ; exists (object (is-a NODE) (node_id ?n) (role $?role&:(member$ compute ?role) |:(member$ enhanced ?role))) ; AND an instance of the DUCK class exists for a node with ; the same node_id and with the sound 'quack' ?o <- (object (is-a DUCK) (count ?c) (node_id ?n) (sound quack)) ; AND the number of quacks is less than 3 (test (< ?c 3)) => ; THEN create a sign (make-instance of COUNTER_SIGN (node_id ?n) (confidence 90) (severity 50) (source ?o) (state observed) (value ?c) (id "duck-less-than-three-quacks") (args (create$ ?c))))
The LHS of this rule steps through a series of conditions.
An instance of the CONFIG class with the name [config] must exist with the clck-checks slot containing duck. In other words, only fire this rule if the duck analyzer extension is enabled.
A NODE object must exist where the role slot contains either compute or enhanced. In other words, only fire this rule for compute / enhanced nodes. As a side effect, the ‘?n’ variable is populated with the id of the node.
A DUCK object must exist where the sound is quack and the node_id slot is same as the ?n value found in the prior condition. In other words, only fire this rule for nodes with both a NODE object and a DUCK object. As a side effect, set the ?c variable is populated with the number of quacks.
The number of quacks, ?c, must be less than 3.
If all four of these conditions are met, the rule will fire and execute the action on the right hand side. The rule is automatically evaluated by the inference engine for all possible combinations of objects, so each node is checked by this single rule.
The severity level is arbitrary, and a more sophisticated rule might scale it depending on the number of quacks. For example, 1 quack might have a severity level of 75 while 2 quacks has a severity level of 50. See the tables in Signs for guidance on setting the severity level.
A message catalog entry with the key duck-less-than-three-quacks would be added to clck/<version>/kb/data/msg_en.xmc in an actual distribution. An example message catalog entry is provided online in the SDK Duck Sample*, located at src/kb/data/msg_en.xmc.
Rule 2: Error case
A second rule should be added for the case where the duck honks, indicating a serious error. The overall construction of the rule is similar to the previous rule.
(defrule duck-honking "If the duck honks like a goose, something serious has happened." ; IF the 'duck' analyzer extension is enabled (object (is-a CONFIG) (name [config]) (clck-checks $? duck $ ?)) ; AND a node instance with the role 'compute' or 'enhanced' ; exists (object (is-a NODE) (node_id ?n) (role $?role&:(member$ compute ?role) |:(member$ enhanced ?role))) ; AND an instance of the DUCK class exists for a node with ; the same node_id and with the sound 'honk' ?o <- (object (is-a DUCK) (node_id ?n) (sound honk)) => ; THEN create a sign (make-instance of BOOLEAN_SIGN (node_id ?n) (confidence 100) (severity 100) (source ?o) (state observed) (value TRUE) (id "duck-honking")))
As above, a message catalog entry with the key duck-honking should be added.
Rule 3: Uniformity
Finally, a rule might be added to verify that all nodes have the same quack rating.
Usually the question of uniformity can be sufficiently answered by determining what fraction of nodes have the same / different value as a particular node. This approach avoids the combinatorial explosion of comparing every node to every other node and also avoids the problems associated with choosing a “reference” node. The MULTISET class is provided for determining uniformity. A multiset is similar to a set except it is a key / value pair where the value is the number of elements with the same key. For example, the set {a, a, a, b} corresponds to the multiset {a:3, b:1}.
The DUCK class inherits from the MULTISET class. The init message handler, roughly analogous to a C++ constructor, must be added to automatically insert the key / value pair into the multiset when each DUCK instance is created:
(defmessage-handler DUCK init after () "Add MULTISET key / value pairs. Skip non-quacks." (if (eq ?self:sound quack) then (send ?self add (send ?self multiset-key) ?self:count))) (defmessage-handler DUCK multiset-key () "Generate a distinct key for each node architecture, role, and subcluster combination." ; defaults (bind ?architecture x86_64) (bind ?role compute) (bind ?subcluster default) (bind ?ins (find-instance ((?n NODE)) (eq ?n:node_id ?self:node_id))) (if (= (length ?ins) 1) then (bind ?i (nth$ 1 ?ins)) (bind ?architecture (send ?i get-architecture)) (bind ?subcluster (send ?i get-subcluster)) (if (member$ compute (send ?i get-role)) then (bind ?role compute) else (if (member$ enhanced (send ?i get-role)) then (bind ?role enhanced)))) (bind ?key (sym-cat (class ?self) + ?subcluster + ?role + ?architecture)) (return ?key))
The multiset-key message handler creates distinct keys for each subcluster, node architecture, and node role. This is done to avoid comparing fundamentally different nodes. For example, do not compare compute nodes and storage nodes.
The uniformity rule is:
(defrule duck-quack-count-is-not-consistent "Create a sign whenever the number of 'quacks' is not consistent." ; IF the 'duck' analyzer extension is enabled Knowledge Base 107 (object (is-a CONFIG) (name [config]) (clck-checks $? duck $ ?)) ; AND a node instance with the role 'compute' or 'enhanced' ; exists (object (is-a NODE) (node_id ?n) (role $?role&:(member$ compute ?role) |:(member$ enhanced ?role))) ; AND an instance of the DUCK class exists for a node with ; the same node_id and with the sound 'quack' ?o <- (object (is-a DUCK) (node_id ?n) (count ?c) (multiset_control TRUE) (sound quack)) ; AND the fraction of nodes with the same quack count is ; less than 0.9 (test (< (send ?o fraction (send ?o multiset-key) ?c) 0.9)) => (bind ?key (send ?o multiset-key)) (bind ?fraction (- 1 (send ?o fraction ?key ?c))) (make-instance of BOOLEAN_SIGN (node_id ?n) (confidence (* 100 ?fraction)) (severity 80) (state observed) (source ?o) (value TRUE) (id "duck-quack-count-is-not-consistent") (args (create$ (* 100 (send ?o fraction ?key ?c)) ?c))))
The (multiset_control TRUE) condition appears in this rule to guarantee that all values have been added to the multiset before attempting to activate the rule. It should be used in all rules that rely on a multiset value.
The final LHS condition decides that if at least 90% of nodes have the same value, then it is actually correct. This is an arbitrary threshold to try to minimize the number of false positives that get reported.
The RHS creates a temporary variable ?fraction that corresponds to the fraction of nodes that have a different number of quacks.
Rule 4: Diagnosis
The duck diagnostic tool does not lend itself to diagnosis. The quack rating scale is unambiguous, but is a closely held trade secret by Waterfowl Industries and additional information such as why a node rated 2 quacks instead of 3 or the duck honked is not provided.
Diagnoses are typically made by combining one or more signs. For example, consider the combination of the proverbial black swan sign and the duck-honking sign to produce the diagnosis that the duck is honking because it is actually a black swan:
(defrule duck-duck-swan "Diagnose the root cause of the honking duck." ; IF the 'duck' analyzer extension is enabled (object (is-a CONFIG) (name [config]) (clck-checks $? duck $ ?)) ; AND a node instance with the role 'compute' or 'enhanced'108 Knowledge Base ; exists (object (is-a NODE) (node_id ?n) (role $?role&:(member$ compute ?role) |:(member$ enhanced ?role))) ; AND a “”duck-honking sign exists for a node with the ; same node_id ?s1 <- (object (is-a SIGN) (node_id ?n) (id "duck-honking")) ; AND a "black-swan" sign exists for a node with the same ; node_id ?s2 <- (object (is-a SIGN) (node_id ?n) (id "black-swan")) => ; THEN create a DIAGNOSIS and mark the signs as diagnosed (send ?s1 put-state diagnosed) (send ?s2 put-state diagnosed) (make-instance of DIAGNOSIS (node_id ?n) (confidence 20) (severity 100) (source (create$ (send ?s1 get-source) (send ?s2 get-source))) (id "duck-duck-swan") (remedy "duck-duck-swan-remedy")))
Note that this rule is not part of the included sample files.
Custom Rules for Framework Definitions
Framework definitions accept native or custom rule sets as long as they are specified as followed:
<configuration> <framework_definition> <kb_mods> <mod>duck.clp</mod> </kb_mods> </framework_definition> </configuration>
duck.clp contains pointers to the duck class and rule file(s). The following CLIPS file is an example of what duck.clp might contain:
(batch* classes/duck.clp) (batch* rules/duck/load.clp)
If multiple <kb_mods> are specified for loading, they have to be located in the same folder, as only one kb mod’s path can be specified per Framework Definition. If no path is given, the default location is assumed /opt/intel/oneapi/clck/<version>/kb. No duplicate classes or rules should be loaded. These can result in duplicate diagnosis.
Developing with CLIPS
Style
Intel® Cluster Checker requires the following in custom rules:
The rule is defined with “defrule”.
The ID of the sign or diagnosis is defined either literally (with double quotation marks) or using a variable (beginning with the ? symbol and using the bind keyword).
If a rule contains the “not” keyword (such as checking that the same sign does not already exist), it must not be used in the line that defines the sign/diagnosis ID.
The arrow (=>) defining the LHS and RHS of the rule must be on its own line.
In a diagnosis that can be triggered by multiple signs:
Each potential sign ID must be on its own line.
In an OR rule (in which many signs may contribute to the same diagnosis), the “|” symbol must appear after the sign ID.
Other than these requirements, coding style is largely a matter of personal preference. The following additional style guidelines are recommended:
Do not exceed 80 characters per line.
Generally use alphabetical order for any list of items.
Use all lower case, except for class names.
Uses dashes rather than underscores or CamelCase.
Document all classes, functions, message-handlers, rules, etc. using the CLIPS comment field rather than ‘;’ style comments.
Use the same value for the rule name and sign / diagnosis id slot.
Debugging and Profiling
CLIPS includes several techniques to help better understand what it is doing.
One of the most debugging useful techniques is the watch capability (see section 13.2.3 in the CLIPS Basic Programming Guide).
CLIPS also includes a good profiling capability (see section 13.16 in the CLIPS Basic Programming Guide).
Additional debug and/or profile statements may be included, in which case, additional output will be displayed when running an analysis.
*Note: The duck sample is not working with Intel® Cluster Checker 2019 or Intel® Cluster Checker 2021.