Sensors and Hardware Monitoring

AMI uses ASDM for sensor discovery and monitoring.

Whenever a compatible PCIe device is detected, as part of the initial setup procedure, the AMI driver performs the GET_SDR ASDM API call - this retrieves all available sensor information from the device over PCI/GCQ.

This data gets stored for each device and remains unchanged for the lifecycle of the device. Any further sensor readings use the GET_ALL_SENSOR_DATA API and update the relevant fields in the previously discovered sensor data.

Currently, all sensor readings are grouped by sensor type (AMI does not currently perform individual sensor readings).

Sensor data is exposed via the standard Linux interface hwmon - this is a kernel API for reporting sensor data via sysfs nodes. Each read triggers an update to the sensor information (i.e., a GET_ALL_SENSOR_DATA API call - per sensor type), provided that it is within a specified refresh timeout. For example, reading the hwmon node temp1_input would request all temperature readings, update the relevant data, and report the refreshed values.

ASDM Parsing

Below are some important notes regarding the parsing of ASDM data:

  • The term “SID” is used to refer to “repo type”.

  • The SID argument in the XGQ/GCQ payload for requesting size of sensor data should not be 000b (“Request size of sensor data”) but rather it should be the type of the SDR you are requesting; i.e., for temperature data it should be 010b.

  • “Min” is not supported”.

  • “Lower thresholds” are not supported.

  • The order of threshold support bits is as follows:
    [0] Upper fatal
    [1] Upper critical [2] Upper warning [3] Lower warning [4] Lower critical [5] Lower fatal [6] Average [7] Max
  • The threshold fields themselves are populated in the following order: Lower fatal Lower critical Lower warning Upper fatal Upper critical Upper warning Average Max

  • The “status” field comes before “average” and “max”.

Example HWMON Tree

Below is an example of the sensor interface that AMI exposes to the user via HWMON.

[user@linuxpc][25/05/2023 17:00:32][hwmon3]:) pwd
/sys/class/hwmon/hwmon3
[user@linuxpc][25/05/2023 17:00:41][hwmon3]:) tree
.
├── curr1_average
├── curr1_input
├── curr1_label
├── curr1_max
├── curr1_status
├── curr2_average
├── curr2_input
├── curr2_label
├── curr2_max
├── curr2_status
├── curr3_average
├── curr3_input
├── curr3_label
├── curr3_max
├── curr3_status
├── device -> ../../../0000:c1:00.0
├── in0_average
├── in0_input
├── in0_label
├── in0_max
├── in0_status
├── in1_average
├── in1_input
├── in1_label
├── in1_max
├── in1_status
├── in2_average
├── in2_input
├── in2_label
├── in2_max
├── in2_status
├── name
├── power
│   ├── async
│   ├── autosuspend_delay_ms
│   ├── control
│   ├── runtime_active_kids
│   ├── runtime_active_time
│   ├── runtime_enabled
│   ├── runtime_status
│   ├── runtime_suspended_time
│   └── runtime_usage
├── power1_average
├── power1_input
├── power1_label
├── power1_max
├── power1_status
├── subsystem -> ../../../../../../class/hwmon
├── temp1_input
├── temp1_label
├── temp1_max
├── temp1_status
├── temp2_input
├── temp2_label
├── temp2_max
├── temp2_status
├── temp3_input
├── temp3_label
├── temp3_max
├── temp3_status
└── uevent

3 directories, 58 files

Example output of the “sensors” command:

[user@linuxpc][25/05/2023 17:00:42][hwmon3]:) sensors
Alveo-pci-2100
Adapter: PCI adapter
12v_pex:      12.24 V  (max = +12.25 V, avg = +12.24 V)
3v3_pex:       3.34 V  (max =  +3.34 V, avg =  +3.34 V)
vccint:      700.00 mV (max =  +0.70 V, avg =  +0.70 V)
PCB:          +33.0°C  (high = +34.0°C)
device:       +37.0°C  (high = +39.0°C)
vccint:       +37.0°C  (high = +39.0°C)
Total Power:  14.00 W  (avg =  12.00 W, max =  14.00 W)
12v_pex:     980.00 mA (max =  +1.00 A, avg =  +0.91 A)
3v3_pex:     720.00 mA (max =  +0.76 A, avg =  +0.72 A)
vccint:        2.00 A  (max =  +2.50 A, avg =  +1.96 A)

Alveo-pci-c100
Adapter: PCI adapter
12v_pex:      12.23 V  (max = +12.25 V, avg = +12.23 V)
3v3_pex:       3.34 V  (max =  +3.34 V, avg =  +3.34 V)
vccint:      699.00 mV (max =  +0.70 V, avg =  +0.70 V)
PCB:          +35.0°C  (high = +36.0°C)
device:       +38.0°C  (high = +40.0°C)
vccint:       +40.0°C  (high = +41.0°C)
Total Power:  15.00 W  (avg =  12.00 W, max =  15.00 W)
12v_pex:       1.06 A  (max =  +1.06 A, avg =  +0.98 A)
3v3_pex:     740.00 mA (max =  +0.78 A, avg =  +0.74 A)
vccint:        2.20 A  (max =  +2.80 A, avg =  +2.13 A)

Page Revision: v. 12