GPU Management and Monitoring
The nvidia-smi
command provided by NVIDIA can be used to manage and monitor GPUs enabled Compute Nodes. In conjunction with the xCAT``xdsh`` command, you can easily manage and monitor the entire set of GPU enabled Compute Nodes remotely from the Management Node.
Example:
# xdsh <noderange> "nvidia-smi -i 0 --query-gpu=name,serial,uuid --format=csv,noheader"
node01: Tesla K80, 0322415075970, GPU-b4f79b83-c282-4409-a0e8-0da3e06a13c3
...
Warning
The following commands are provided as convenience. Always consult the nvidia-smi
manpage for the latest supported functions.
Management
Some useful nvidia-smi
example commands for management.
Set persistence mode, When persistence mode is enabled the NVIDIA driver remains loaded even when no active clients, DISABLED by default:
nvidia-smi -i 0 -pm 1Disabled ECC support for GPU. Toggle ECC support, A flag that indicates whether ECC support is enabled, need to use –query-gpu=ecc.mode.pending to check [Reboot required]:
nvidia-smi -i 0 -e 0Reset the ECC volatile/aggregate error counters for the target GPUs:
nvidia-smi -i 0 -p 0/1Set MODE for compute applications, query with –query-gpu=compute_mode:
nvidia-smi -i 0 -c 0/1/2/3Trigger reset of the GPU
nvidia-smi -i 0 -rEnable or disable Accounting Mode, statistics can be calculated for each compute process running on the GPU, query with -query-gpu=accounting.mode:
nvidia-smi -i 0 -am 0/1Specifies maximum power management limit in watts, query with –query-gpu=power.limit
nvidia-smi -i 0 -pl 200
Monitoring
Some useful nvidia-smi
example commands for monitoring.
The number of NVIDIA GPUs in the system
nvidia-smi --query-gpu=count --format=csv,noheaderThe version of the installed NVIDIA display driver
nvidia-smi -i 0 --query-gpu=driver_version --format=csv,noheaderThe BIOS of the GPU board
nvidia-smi -i 0 --query-gpu=vbios_version --format=csv,noheaderProduct name, serial number and UUID of the GPU:
nvidia-smi -i 0 --query-gpu=name,serial,uuid --format=csv,noheaderFan speed:
nvidia-smi -i 0 --query-gpu=fan.speed --format=csv,noheaderThe compute mode flag indicates whether individual or multiple compute applications may run on the GPU. (known as exclusivity modes)
nvidia-smi -i 0 --query-gpu=compute_mode --format=csv,noheaderPercent of time over the past sample period during which one or more kernels was executing on the GPU:
nvidia-smi -i 0 --query-gpu=utilization.gpu --format=csv,noheaderTotal errors detected across entire chip. Sum of device_memory, register_file, l1_cache, l2_cache and texture_memory
nvidia-smi -i 0 --query-gpu=ecc.errors.corrected.aggregate.total --format=csv,noheaderCore GPU temperature, in degrees C:
nvidia-smi -i 0 --query-gpu=temperature.gpu --format=csv,noheaderThe ECC mode that the GPU is currently operating under:
nvidia-smi -i 0 --query-gpu=ecc.mode.current --format=csv,noheaderThe power management status:
nvidia-smi -i 0 --query-gpu=power.management --format=csv,noheaderThe last measured power draw for the entire board, in watts:
nvidia-smi -i 0 --query-gpu=power.draw --format=csv,noheaderThe minimum and maximum value in watts that power limit can be set to
nvidia-smi -i 0 --query-gpu=power.min_limit,power.max_limit --format=csv