Sensors and Hardware Monitoring

AMI uses ASDM for sensor discovery and monitoring.

Whenever a compatible PCIe device is detected, as part of the initial setup procedure, the AMI driver performs the GET_SDR ASDM API call - this retrieves all available sensor information from the device over PCI/GCQ.

This data gets stored for each device and remains unchanged for the lifecycle of the device. Any further sensor readings use the GET_ALL_SENSOR_DATA API and update the relevant fields in the previously discovered sensor data.

Currently, all sensor readings are grouped by sensor type (AMI does not currently perform individual sensor readings).

Sensor data is exposed via the standard Linux interface hwmon - this is a kernel API for reporting sensor data via sysfs nodes. Each read triggers an update to the sensor information (i.e., a GET_ALL_SENSOR_DATA API call - per sensor type), provided that it is within a specified refresh timeout. For example, reading the hwmon node temp1_input would request all temperature readings, update the relevant data, and report the refreshed values.

ASDM Parsing

Below are some important notes regarding the parsing of ASDM data:

  • The term “SID” is used to refer to “repo type”.

  • The SID argument in the XGQ/GCQ payload for requesting size of sensor data should not be 000b (“Request size of sensor data”) but rather it should be the type of the SDR you are requesting; i.e., for temperature data it should be 010b.

  • “Min” is not supported.

  • “Lower thresholds” are not supported.

  • The order of threshold support bits is as follows:

    • [0] Upper fatal

    • [1] Upper critical

    • [2] Upper warning

    • [3] Lower warning

    • [4] Lower critical

    • [5] Lower fatal

    • [6] Average

    • [7] Max

  • The threshold fields themselves are populated in the following order:

    • Lower fatal

    • Lower critical

    • Lower warning

    • Upper fatal

    • Upper critical

    • Upper warning

    • Average

    • Max

  • The “status” field comes before “average” and “max”.

Thresholds

As mentioned above, some of the sensors have threshold fields configured (a value of “0” indicates that the threshold is not set or used). When the Upper Critical value is reached, the AMI driver kills any processes attached to that device.

The thresholds are set by the card’s management firmware (see Profiles and CMake build process#SensorsProfile). These values can be seen from running the ami_tool sensors command with the “-x limits” flag, as shown below for the V80:

Sensor limits

%  ami_tool sensors -d 21 -x limits

Name            |      Value | Status    |   Limits (Warn, Crit, Fatal)
-----------------------------------------------------------------------
1V2_GTXAVTT     |   39.000 C | valid     |                          N/A
                |    8.000 A | valid     |                          N/A
                |    1.200 V | valid     |                          N/A
-----------------------------------------------------------------------
0V88_VCC_CPM5   |   36.000 C | valid*    |                          N/A
                |    3.000 A | valid*    |                          N/A
                |    0.879 V | valid*    |                          N/A
-----------------------------------------------------------------------
PCB             |   38.000 C | valid*    |     80.000,  85.000,  95.000
-----------------------------------------------------------------------
Device          |   46.000 C | valid*    |     92.000, 100.000, 105.000
-----------------------------------------------------------------------
VCCINT          |   42.000 C | valid*    |    100.000, 110.000, 125.000
                |   33.000 A | valid*    |                          N/A
                |    0.800 V | valid*    |                          N/A
-----------------------------------------------------------------------
Module_0        |    0.000 C | invalid   |     80.000,  85.000,     N/A
-----------------------------------------------------------------------
Module_1        |    0.000 C | invalid   |     80.000,  85.000,     N/A
-----------------------------------------------------------------------
Module_2        |   36.000 C | valid*    |     80.000,  85.000,     N/A
-----------------------------------------------------------------------
Module_3        |    0.000 C | invalid   |     80.000,  85.000,     N/A
-----------------------------------------------------------------------
DIMM            |   36.000 C | valid*    |                          N/A
-----------------------------------------------------------------------
1V2_VCC_HBM     |   43.000 C | valid*    |                          N/A
                |    4.000 A | valid*    |                          N/A
                |    1.200 V | valid*    |                          N/A
-----------------------------------------------------------------------
Total_Power     |   64.074 W | valid     |                          N/A
-----------------------------------------------------------------------
12V_AUX1        |    1.119 A | valid*    |     12.500,  12.750,     N/A
                |   12.192 V | valid*    |                          N/A
-----------------------------------------------------------------------
12V_AUX2        |    1.459 A | valid*    |     12.500,  12.750,     N/A
                |   12.192 V | valid*    |                          N/A
-----------------------------------------------------------------------
1V2_VCCO_DIMM   |    1.119 A | valid*    |                          N/A
                |    1.208 V | valid*    |                          N/A
-----------------------------------------------------------------------
3V3_PEX         |    1.699 A | valid*    |      3.000,   3.150,     N/A
                |    3.304 V | valid*    |                          N/A
-----------------------------------------------------------------------
12V_PEX         |    2.239 A | valid*    |      5.500,   5.750,     N/A
                |   12.176 V | valid*    |                          N/A
-----------------------------------------------------------------------
3V3_QSFP        |    0.040 A | valid*    |                          N/A
                |    3.296 V | valid*    |                          N/A
-----------------------------------------------------------------------
1V5_VCCAUX      |    1.497 V | valid*    |                          N/A

Example HWMON Tree

Below is an example of the sensor interface that AMI exposes to the user via HWMON.

[user@linuxpc][25/05/2023 17:00:32][hwmon3]:) pwd
/sys/class/hwmon/hwmon3
[user@linuxpc][25/05/2023 17:00:41][hwmon3]:) tree
.
├── curr1_average
├── curr1_input
├── curr1_label
├── curr1_max
├── curr1_status
├── curr2_average
├── curr2_input
├── curr2_label
├── curr2_max
├── curr2_status
├── curr3_average
├── curr3_input
├── curr3_label
├── curr3_max
├── curr3_status
├── device -> ../../../0000:c1:00.0
├── in0_average
├── in0_input
├── in0_label
├── in0_max
├── in0_status
├── in1_average
├── in1_input
├── in1_label
├── in1_max
├── in1_status
├── in2_average
├── in2_input
├── in2_label
├── in2_max
├── in2_status
├── name
├── power
│   ├── async
│   ├── autosuspend_delay_ms
│   ├── control
│   ├── runtime_active_kids
│   ├── runtime_active_time
│   ├── runtime_enabled
│   ├── runtime_status
│   ├── runtime_suspended_time
│   └── runtime_usage
├── power1_average
├── power1_input
├── power1_label
├── power1_max
├── power1_status
├── subsystem -> ../../../../../../class/hwmon
├── temp1_input
├── temp1_label
├── temp1_max
├── temp1_status
├── temp2_input
├── temp2_label
├── temp2_max
├── temp2_status
├── temp3_input
├── temp3_label
├── temp3_max
├── temp3_status
└── uevent

3 directories, 58 files

Example output of the “sensors” command:

[user@linuxpc][25/05/2023 17:00:42][hwmon3]:) sensors

Alveo-pci-2100
Adapter: PCI adapter
VCCINT:        800.00 mV (avg =  +0.80 V, highest =  +0.80 V)
1V2_VCC_HBM:     1.20 V  (avg =  +1.20 V, highest =  +1.20 V)
12V_AUX1:       12.21 V  (avg = +12.22 V, highest = +12.22 V)
12V_AUX2:       12.22 V  (avg = +12.22 V, highest = +12.22 V)
1V2_VCCO_DIMM:   1.21 V  (avg =  +1.21 V, highest =  +1.21 V)
3V3_PEX:         3.31 V  (avg =  +3.31 V, highest =  +3.31 V)
12V_PEX:        12.19 V  (avg = +12.20 V, highest = +12.20 V)
3V3_QSFP:        3.31 V  (avg =  +3.31 V, highest =  +3.31 V)
1V5_VCCAUX:      1.50 V  (avg =  +1.50 V, highest =  +1.50 V)
1V2_GTXAVTT:     1.20 V  (avg =  +1.20 V, highest =  +1.20 V)
0V88_VCC_CPM5: 879.00 mV (avg =  +0.88 V, highest =  +0.88 V)
PCB:            +34.0°C  (high = +80.0°C, crit low = +85.0°C)
                         (crit = +95.0°C, highest = +34.0°C)
Device:         +42.0°C  (high = +92.0°C, crit low = +100.0°C)
                         (crit = +105.0°C, highest = +43.0°C)
VCCINT:         +38.0°C  (high = +100.0°C, crit low = +110.0°C)
                         (crit = +125.0°C, highest = +39.0°C)
Module_0:       +34.0°C  (high = +80.0°C, crit low = +85.0°C)
                         (highest = +34.0°C)
Module_1:       +34.0°C  (high = +80.0°C, crit low = +85.0°C)
                         (highest = +34.0°C)
Module_2:       +32.0°C  (high = +80.0°C, crit low = +85.0°C)
                         (highest = +32.0°C)
Module_3:       +30.0°C  (high = +80.0°C, crit low = +85.0°C)
                         (highest = +30.0°C)
DIMM:           +32.0°C  (highest = +32.0°C)
1V2_VCC_HBM:    +40.0°C  (highest = +40.0°C)
1V2_GTXAVTT:    +36.0°C  (highest = +36.0°C)
0V88_VCC_CPM5:  +34.0°C  (highest = +34.0°C)
Total_Power:    63.53 W  (highest =  64.03 W, avg =  63.65 W)
VCCINT:         34.00 A  (avg = +33.00 A, highest = +34.00 A)
1V2_VCC_HBM:     4.00 A  (avg =  +4.00 A, highest =  +4.00 A)
12V_AUX1:        1.10 A  (crit min = +12.75 A, max = +12.50 A)
                         (avg =  +1.10 A, highest =  +1.12 A)
12V_AUX2:        1.46 A  (crit min = +12.75 A, max = +12.50 A)
                         (avg =  +1.46 A, highest =  +1.48 A)
1V2_VCCO_DIMM:   1.10 A  (avg =  +1.11 A, highest =  +1.12 A)
3V3_PEX:         1.64 A  (crit min =  +3.15 A, max =  +3.00 A)
                         (avg =  +1.64 A, highest =  +1.64 A)
12V_PEX:         2.22 A  (crit min =  +5.75 A, max =  +5.50 A)
                         (avg =  +2.21 A, highest =  +2.22 A)
3V3_QSFP:      120.00 mA (avg =  +0.12 A, highest =  +0.14 A)
1V2_GTXAVTT:     8.00 A  (avg =  +8.00 A, highest =  +8.00 A)
0V88_VCC_CPM5:   3.00 A  (avg =  +3.00 A, highest =  +3.00 A)

Page Revision: v. 20