Test and Troubleshoot Your Installation
You may need some of the following information to support your users.
Check if Intel GPU started successfully
Use the queries below to check if the Intel GPU started successfully.
Check that you have a card present in the system and visible on the PCI bus. Execute the following command:
sudo lspci -k | egrep "VGA compatible|Display"
If the command fails to return any value, you have a hardware problem (a power cycle might fix it).
Check that the i915 kernel driver is loaded:
lsmod | grep i915
NOTE:If the kernel parameter nomodeset is used, the i915 driver will not load. You can check that the kernel option nomodeset is not present with the following command:# grep nomodeset /proc/cmdline BOOT_IMAGE=(hd0,msdos6)/boot/vmlinuz... numa_zonelist_order=N nomodeset_edd=off eagerfpu=on
Check that the graphics devices are present. Execute the following command:
ls -l /dev/dri
If you see card0 or renderD128, it means that the card started successfully. Verify that the permissions are in line with your expectations and users have read/write access to the device files.
If you do not see any of the values above, use the following command to further investigate possible issues:
dmesg -T | grep i915
If the command returns nothing, the card did not start at all.
If the command returns some value, you should be able to see if the card started properly or if issues have been encountered. Possible issues include:
Integrated graphics and PCIe-based graphics devices are both enabled in the BIOS
Memory Mapped I/O Size is not set to 1024G in Advanced > PCI Configuration
Use the dmesg -T | grep i915 command to find other issues related to the card.
If the card appears to have started successfully, further verify it by using the commands:
sudo cat /sys/kernel/debug/dri/0/i915_capabilities | grep "platform:"
or
sudo cat /sys/kernel/debug/dri/1/i915_capabilities | grep "platform:"
to see if a reasonable device name is returned.
GPU is present but not accessible to users or expected drivers/devices are not found
Users can query the available compute devices using the command clinfo -l (you may need to install clinfo before they can do this). If clinfo does not return the Intel GPUs, there are a number of reasons for this (if the above checks indicate the device is present and seemed to initialize properly):
Permissions issue. Check the group that owns the card0 or renderD128 files in /dev/dri. As noted in Step 4 Set Up User Permissions for Using the Intel GPU Device Files of this document, by default, on each machine with an Intel GPU, you need to give each user access to the local “render” or “video” groups. If you changed the group ownership on these files to a cluster-wide group ID that is not on every account by default (such the “users” group, which many users are likely to have), be aware that the alpha-numeric version of the group ID may not be available when the GPU device is started. You need to use the numeric version of this group ID, obtained using getent group <groupname>.
Conflicting or missing driver entries. The OpenCL* Installable Client Driver files (ICD files) can be found in the /etc/OpenCL/vendors directory by default, or, if you followed the process in Step 3: Adjust Location of Intel Graphics Compute Runtime, in some other locations containing files with *.icd extensions. Inspect these *.icd files.
Make sure that there is only one *.icd file for each shared library name (device).
Check that file paths specified in *.icd files point to valid directories (invalid directories are especially possible when you are setting up to enable multiple user-node drivers to be installed at once)
If OCL_ICD_VENDORS and OCL_ICD_FILENAMES are defined, make sure they point to valid locations and do not have any of the above issues. Make sure that users defined these variables in their runtime environment.
Check that the user loaded the environment modules that give access to oneAPI and your driver environment (Step 5: Generate and Set Up Module Files), or otherwise initialized the oneAPI environment and driver locations.
Long-running compute jobs crash before completion
By default, the Intel Graphics drivers assume that they are used to run graphics applications that have kernels running to completion many times a second. If they do not, the i915 kernel kills these long-running kernels on the assumption that they are hanging (where “long-running” means running for about a couple of seconds). Contrast this with compute jobs, where a kernel may intentionally run for many seconds or minutes.
Assuming the user applications are correctly implemented, you need to check that the steps you took to disable hangcheck and preemption in Step 4: Set Up User Permissions for Using the Intel GPU Device Files worked correctly. They did not if any of the following commands return values other than zero:
sudo cat /sys/module/i915/parameters/enable_hangcheck
find /sys/devices -regex '.*/drm/card[0-9]*/engine/[rc]cs[0-9]*/preempt_timeout_ms' -exec echo {} \; -exec cat {} \;
If the returned values are not zero, go back to Step 4: Set Up User Permissions for Using the Intel GPU Device Files and fix these.
Intel® VTune™ Profiler does not collect performance data from the GPU
There are several reasons why the user cannot collect performance information:
System variables described in Step 4: Set Up User Permissions for Using the Intel GPU Device Files are not set up. Both of the following queries must return zero:
cat /proc/sys/dev/i915/perf_stream_paranoid
cat /proc/sys/kernel/yama/ptrace_scope (Ubuntu* only)
VTune drivers are not loaded. To check this, use the following command:
lsmod | egrep 'vtsspp|sep5|socperf3|pax'
You should see something like:
vtsspp 405504 0 sep5 2170880 0 socperf3 598016 1 sep5 pax 16384 0
Otherwise, the VTune drivers are not been built or started. Make sure that the VTune drivers for your kernel are built by running build_driver in the sepdk/src subdirectory where VTune is installed, and that the VTune drivers are started with insmod-sep in the same directory.
If the VTune driver is started, make sure that the group assigned to it and the debug file system are the same. You can check this with the following commands:
ls -g /dev | grep sep ls -g /sys/kernel | grep debug
If the groups are not the same, shut down the VTune drivers and restart them to use the same group:
sudo ./rmmod-sep sudo ./insmod-sep -g <group> sudo ./boot-script --install -g <group> sudo /opt/intel/oneapi/vtune/latest/bin64/prepare-debugfs.sh -g <group>
oneAPI Debugger does not work on graphics processes
Make sure that:
The debugger was set up using the instructions at Get Started with Intel® Distribution for GDB* on Linux* OS Host, in particular that the system was booted with the i915.debug_eu=1 kernel variable.
Users enable the ZET_ENABLE_PROGRAM_DEBUGGING=1 environment variable before running gdb-oneapi with a GPU application as the target.
You can run additional tests using the system check utility mentioned in Get Started with Intel® Distribution for GDB* on Linux* OS Host.
Query a list of OpenCL devices
Run clinfo -l to get a simple list of all available devices. For example, on a node with two graphic cards you might see:
$ clinfo -l Platform #0: Intel(R) FPGA Emulation Platform for OpenCL(TM) `-- Device #0: Intel(R) FPGA Emulation Device Platform #1: Intel(R) OpenCL `-- Device #0: Genuine Intel(R) CPU $0000%@ Platform #2: Intel(R) OpenCL HD Graphics +-- Device #0: Intel(R) Graphics [0x0205] `-- Device #1: Intel(R) Graphics [0x0205]
If you have sourced the oneAPI environment via setvars.bat, the sycl-ls command will show the list of all available compute devices.
Query Intel GPU properties
Run clinfo without any arguments and save the returned result to a file. In this file, you can find details about the offload compute devices discovered, such as hardware characteristics and version information for the driver that is used to access each device. Querying version information can be particularly useful when a system is set up to provide multiple driver versions to the user, and the user needs to check what driver version is being used, or when this setup is hidden by a script. Checking the offload devices returned is also a valuable way to make sure the offload compute environment is set to properly expose the correct or expected devices (some configurations may choose to expose only a subset of available hardware devices). Use grep commands to search for specific information returned by clinfo, for example:
Device name
Driver version
Max compute units
Max sub-groups per work group
Sub-group sizes
Global memory size
Max memory allocation
Preferred work group size multiple