Alveo I2C Telemetry

AMD/Xilinx® Alveo™ cards support OoB communication via ALVEO I2C/SMBus commands at I2C address 0x65 (0xCA in 8-bit). While 100 KHz and 400 KHz are standard among Server BMCs, I2C speeds between 90 KHz and 700 KHz are tested and supported by Satellite Controller.

The following information is exposed via ALVEO I2C protocol:

  • Thermal sensors such as FPGA, max Board, max DIMM, and max QSFP temperature (if present)
  • Total board power consumption
  • SC FW version number
  • Critical Sensor Data Record (CSDR) - Specific to ALVEO U30 only

The following table lists the supported commands:

Table: Supported I2C/SMBus Commands

Command/Register Value Command Description Transaction Type Number of Resp Bytes
0x01 Maximum DIMM temperature Read byte 1
0x02 Maximum card temperature Read byte 1
0x03 Card power consumption Read word 2
0x04 Satellite Controller FW version Block read 4
0x05 Maximum FPGA die temperature Read byte 1
0x06 Maximum QSFP temperature Read byte 1
0x0F FPGA Reset Write byte 1
0x20 Critical Sensor Data Record Block read 64

Note: AMD/Xilinx recommends waiting for 1-2 ms between any two I2C transactions. Without the delay, uninterrupted I2C operation is not guaranteed.

0x01 - Maximum DIMM Temperature

Note: Not applicable for U30 cards.

The DIMMs in the Alveo™ cards with the number varying with each product. The primary motivation for server BMC to read the DIMM temperature is to provide closed-loop thermal monitoring. The best way to send the DIMM temperature is to provide maximum of all DIMM temperature values. SC FW keeps track of temperature values internally for all the DIMMs present in the Alveo card, sending only the maximum DIMM temperature value to server BMC. Server BMC uses command code 0x01 to read the max DIMM temperature value. The response data from the Alveo card is 1-byte temperature data (twos complement) and the range is -128°C to 127°C.

Table: Maximum DIMM, Server BMC Request

Server BMC Request
Command code 0x01
Data bytes N/A

Table: Maximum DIMM, Alveo™ Response

** Alveo™ Response **
Data bytes [Byte 0]

1-byte temperature data (2’s complement) and

the range is -128 °C to 127 °C

For example:

[Byte 0] = 0xFE presents –2°C

[Byte 0] = 0x23 presents 35°C

0x02 - Maximum Board Temperature

Server BMC uses register 0x02 to read the maximum board temperature value. The response data from the Alveo™ card is 1-byte temperature data (twos complement) and the range is -128°C to 127°C.

Table: Maximum Board Temperature, Server BMC Request

Server BMC Request
Command code 0x02
Data bytes N/A

Table: Maximum Board Temperature, Alveo™ Response

** Alveo™ Response **
Data bytes [Byte 0]

1-byte temperature data (twos complement) and

the range is -128°C to 127°C

For example:

[Byte 0] = 0xFE presents -2°C

[Byte 0] = 0x23 presents 35°C

0x03 - Board Power Consumption

Server BMC uses register 0x03 to read the current board power consumption value. The response data from the Alveo™ card is 2-byte power consumption data (LSB first), unit is in watts (W).

Table: Board Power Consumption, Server BMC Request

Server BMC Request
Command code 0x03
Data bytes N/A

Table: Board Power Consumption, Alveo™ Response

** Alveo™ Response **
Data bytes [Byte 0]

2-byte temperature data in watts (W). For example:

[Byte 0] [Byte 1] = 0x32 0x00 presents 50W (0x0032)

[Byte 0] [Byte 1] = 0x20 0x01 presents 288W (0x0120)

0x04 - Satellite Controller Firmware Version

Server BMC uses register 0x04 to read the current SC FW version, which follows xx.yy.zz formatting. The response data from the Alveo™ card is 4 bytes.

Table: SC Firmware Version, Server BMC Request

Server BMC Request
Command code 0x04
Data bytes N/A

Table: SC Firmware Version, Alveo™ Response

** Alveo™ Response **
Data bytes

[Byte 0]

[Byte 1]

[Byte 2]

[Byte 3]

4-byte firmware version – LSB first

[Byte 0] – Firmware version ; [Byte 1] – Major revision

[Byte 2] – Minor revision ; [Byte 3] - Reserved

For example:

v6.2.11 = 0x00 0x0B 0x02 0x06

v7.13.9 = 0x00 0x09 0x0D 0x07

0x05 - Maximum FPGA Die Temperature

Server BMC uses register 0x05 to read the maximum FPGA die temperature value. The response data from the Alveo™ card is 1-byte temperature data (2’s complement) and the range is -128°C to 127°C.

Table: FPGA Die Temperature

Server BMC Request
Command code 0x05
Data bytes N/A

Table: Max FPGA die Temperature, Alveo™ Response

** Alveo™ Response **
Data bytes [Byte 0]

1-byte temperature data (2’s complement) and

the range is -128 to 127°C.

For example:

[Byte 0] = 0xFE presents -2°C

[Byte 0] = 0x23 presents 35°C

0x06 - Maximum QSFP Temperature

Note: Not applicable for U30 cards.

Some Alveo™ products comes with network interface (i.e., QSFP or SFP-DD) modules. The number of SFP modules varies depending on the model. The primary motivation for server BMC to read the SFP temperature is to provide closed-loop thermal monitoring. The most effective way to expose the SFP temperature is to provide the maximum value of all the SFP temperature values.

SC FW internally tracks temperature values for all the SFP modules present in an Alveo™ card, sending only the maximum SFP temperature value to server BMC. Server BMC uses register 0x06 to read the maximum QSFP temperature value. The response data from the Alveo™ card is 1-byte temperature data (twos complement) and the range is -128°C to 127°C.

Table: Maximum QSFP Temperature, Server BMC Request

Server BMC Request
Command code 0x06
Data bytes N/A

Table: Maximum QSFP Temperature, Alveo™ Response

** Alveo™ Response **
Data bytes [Byte 0]

1-byte temperature data (twos complement) and

the range is -128°C to 127°C.

For example:

[Byte 0] = 0xFE presents -2°C

[Byte 0] = 0x23 presents 35°C

0x0F - Reset FPGA

A reset of the FPGA through the out-of-band channel is a desirable operation to bring the FPGA out of any stuck condition (i.e., PCIe link down, FPGA lock-up, user workload corruption/hang) leaving any in-band operation ineffective. Server BMC uses register 0x0F to request the reset of the FPGA. Wherever applicable, SC has the capability to reset the FPGA. This feature/option may not be available in all products and when supported, SC firmware responds with the status 0x01 immediately and runs the operation in the background.

Table: Reset FPGA server BMC request

Server BMC Request
Command code 0x0F
Data bytes B0: 0x01 - Cold reset; 0x02 - Warm reset

Table: Reset FPGA, Alveo™ Response

** Alveo™ Response **
Data bytes [Byte 0]

0x01 - FPGA reset initiated

0x02 - Request failed

0x03 - Operation not supported

0x20 - Critical Sensor Data Record (CSDR) Command

Note: Currently, this command is only supported in Alveo™ U30 Hyperscaler SKU.

The CSDR command implementation is Block Read from server BMC’s perspective and SC sends the data LSB first (i.e.) Byte 0, Byte 1 … Byte 63 order.

The following sensor information are packaged into the SDR response (64 bytes):

  • Status: Contains TCRIT, PG, ZYNQ error and other status information.
  • Temperature: FPGA, inlet, and outlet sensors.
  • Total power consumption: 3V3 I/V, 12V I/V, 12VAUX I/V.
  • DDR errors: Recoverable and non-recoverable errors.
  • PCIe errors: Recoverable and non-recoverable errors.
  • Network status and temperature, if applicable.

Table: CSDR Command

Offset Number of Bytes Register Description Notes
0 4 Board status information  
4 4 Board security status information  
8 1 Board inlet temperature  
9 1 Board outlet temperature  
10 4 Board edge connector 3.3V input sensor  
14 4 Board edge connector 12V input sensor  
18 4 Board AUX connector 12V input sensor  
22 2 Board total power consumption  
24 1 Device 1 status information  
25 2 Device 1 junction temperature  
27 10 Device 1 advanced error counters  
37 1 Device 2 status information  
38 2 Device 2 junction temperature  
40 10 Device 2 advanced error counters  
50 1 Network module 0 temperature N/A for U30
51 2 Network module 0 status N/A for U30
53 1 Network module 1 temperature N/A for U30
54 2 Network module 1 status N/A for U30
56 8 Reserved  

Critical Sensor Data Record (CSDR) Command Response

Table: Board Status Information

Note: Bits[18:8] are not applicable for Alveo™ U30.

Bit Field Bit Field Mapping Data Format Sensor Description
Bit[31:26] Reserved N/A N/A
Bit[26:19] Total # of SC flash writes

8-bits unsigned;

Unit: count

Total # of writes to SC flash,

represented in multiples of 100s

Ex: 37 count => 3700 writes

Bit[18] AUX power cable present

1-bit unsigned;

Unit: state

0 – No AUX power cable

1 – AUX cable present

Bit[17] Network module 1 MODPRSNT

1-bit unsigned;

Unit: state

0 – Not present

1 – Present

Bit[16] Network module 0 MODPRSNT

1-bit unsigned;

Unit: state

0 – Not present

1 – Present

Bit[15:12] HBM_CATTRIP event counter

4-bits unsigned;

Unit: count

Number of HBM CATTRIP events,

after SC code update

Bit[11:8] TWARN event counter

4-bits unsigned;

Unit: count

Number of TWARN events,

after SC power up.

Bit[7:4] Power good event counter

4-bits unsigned;

Unit: count

Number of power good events,

after SC power up.

Bit[3:0] TCRIT event counter

4-bits unsigned;

Unit: count

Number of TCRIT events,

after SC power up.

Table: Board Security Status Information

Bit Field Bit Field Mapping Data Format Sensor Description
Bit[31:16] Reserved    
Bit[15] JTAG Access

1-bit unsigned;

Unit: state

0: Disabled

1: Enabled

Bit[14:11] Flash authentication status

4-bit unsigned;

Unit: state

State: 0=NOT DONE, 1=DONE

Bit 14: FPGA2 Recovery flash device

Bit 13: FPGA2 Primary flash device

Bit 12: FPGA1 Recovery flash device

Bit 11: FPGA1 Primary flash device

Bit[10] SC_SPI_DEV2_CTRL5 NA Reserved
Bit[9] SC_SPI_DEV2_CTRL4

1-bit unsigned;

Unit: state

For flash control modes 2b‘00 and 2b‘01:

0: Flash write protect

1: Flash write enable

Bit[8:7]

Bit[8]: SC_SPI_DEV2_CTRL3

Bit[7]: SC_SPI_DEV2_CTRL1

DEV2 flash mode control

2-bit unsigned;

Unit: state

2b‘00: DEV2 x2 with WP; 2b‘10 DEV2 x4 no WP

2b‘01: SC x1 with WP; 2b‘11 Not Valid

Bit[6]

SC_SPI_DEV2_CTRL2

Primary/Recovery flash selected

1-bit unsigned;

Unit: state

0: DEV2 primary flash selected

1: DEV2 recovery flash selected

Bit[5] SC_SPI_DEV1_CTRL5 NA Reserved
Bit[4] SC_SPI_DEV1_CTRL4

1-bit unsigned

Unit: state

For Flash Control Modes 2b‘00 and 2b‘01:

0: Flash Write Protect

1: Flash Write Enable

Bit[3:2]

Bit[3]: SC_SPI_DEV1_CTRL3

Bit[2]: SC_SPI_DEV1_CTRL1

DEV1 flash mode control

2-bit unsigned;

Unit: state

2b‘00: DEV1 x2 with WP; 2b‘10 DEV1 x4 no WP

2b‘01: SC x1 with WP; 2b‘11 Not Valid

Bit[1]

SC_SPI_DEV1_CTRL2

Primary/Recovery flash selected

1-bit unsigned;

Unit: state

0: DEV1 primary flash selected

1: DEV1 recovery flash selected

Bit[0]

SC_SPI_DEV_SEL; Connects

SC to SPI MUX of Dev 1 or Dev 2

1-bit unsigned;

Unit: state

0: SC to DEV1 SPI

1: SC to DEV2 SPI

Table: Board Temperature, Voltage, Current and Power sensors

Bit Field Bit Field Mapping Data Format Sensor Description
Board Inlet Temperature
Byte 0

Inlet temp sensor value

(located at back bracket)

1-byte two’s compliment

Unit: Celsius

Range: –128 to 127°C

Example: 0x21= 33°C, 0xFE = -2°C

Board Outlet Temperature
Byte 0

Outlet temp sensor value

(located at IO bracket)

1-byte two’s compliment

Unit: Celsius

Range: –128 to 127°C

Example: 0x21= 33°C, 0xFE = -2°C

Board Edge Connector 3.3V Input Sensor - Not applicable for U30 cards
Byte[3:2] Edge Connector 3.3V input voltage 2-byte unsigned Voltage in volts
Byte[1:0] Edge Connector 3.3V input current 2-byte unsigned Current in amps
Board Edge Connector 12V Input Sensor
Byte[3:2] Edge Connector 12V input voltage 2-byte unsigned Voltage in volts, LSB 1.25mV; 0x2570=11.98V
Byte[1:0] Edge connector 12V input current 2-byte unsigned Current in amps, LSB 1.25mA; 0x2710=12.5A
Board AUX Connector 12V Input Sensor - Not applicable for U30 cards
Byte[3:2] AUX connector 12V input voltage 2-byte unsigned Voltage in volts
Byte[1:0] AUX connector 12V input current 2-byte unsigned Current in amps
Board Total Power
Byte[1:0] Total card power

2-bytes unsigned

LSB first; Unit: watts

[Byte 0] [Byte 1] = 0x32 0x00

presents 50W (0x0032)

Table: FPGA Device 1 and 2 - Status, Temperature & Error information

Bit Field Bit Field Mapping Data Format Sensor Description
Device 1 Status Information
Bits[7:4] KeepAlive enum

4-bits unsigned;

Unit: count

Heart bit counter from FPGA device
Bit[3] ERRORn_STATUS

1-bit unsigned;

Unit: state

Device PS_ERROR_STATUS pin status

N/A for U30. For details refer [*]

Bit[2] ERRORn

1-bit unsigned

Unit: state

Device PS_ERROR_OUT pin status

N/A for U30. For details refer [*]

Bit[1] INIT_B

1-bit unsigned

Unit: state

Device INIT_B pin status

For details refer [**]

Bit[0] FPGA_DONE

1-bit unsigned

Unit: state

Device DONE pin status

For details refer [**]

Device 1 Junction Temperature
Bit[15:8] HBM junction temperature

1-byte two’s compliment

Unit: Celsius

NA for U30

Example: 0x21= 33°C, 0xFE = -2°C

Bit[7:0] FPGA junction temperature

1-byte two’s compliment

Unit: Celsius

0xFE presents –2°C; 0x23=35°C

Example: 0x21= 33°C, 0xFE = -2°C

Device 1 Advanced Error Counters
Byte[9:6] PCIe correctable error counter

4-bytes unsigned;

LSB First; Unit: count

Number of correctable PCIe errors for

device 1 after device/SC reboot

Byte[5:4] PCIe uncorrectable error counter

2-bytes unsigned;

LSB First; Unit: count

Number of uncorrectable PCIe errors for

device 1 after device/SC reboot

Byte[3:2] DDR correctable error counter

2-bytes unsigned;

LSB First; Unit: count

Number of correctable DDR errors for

device 1 after device/SC reboot

Byte[1:0] DDR uncorrectable error counter

2-bytes unsigned;

LSB First; Unit: count

Number of uncorrectable DDR errors for

device 1 after device/SC reboot

Device 2 Status Information
Bits[7:4] KeepAlive enum

4-bits unsigned;

Unit: count

Heart bit counter from FPGA device
Bit[3] ERRORn_STATUS

1-bit unsigned;

Unit: state

Device PS_ERROR_STATUS pin status

For details refer [*]

Bit[2] ERRORn

1-bit unsigned

Unit: state

Device PS_ERROR_OUT pin status

For details refer [*]

Bit[1] INIT_B

1-bit unsigned

Unit: state

Device INIT_B pin status

For details refer [**]

Bit[0] FPGA_DONE

1-bit unsigned

Unit: state

Device DONE pin status

For details refer [**]

Device 2 Junction Temperature
Bit[15:8] HBM junction temperature

1-byte two’s compliment

Unit: Celsius

NA for U30

Example: 0x21= 33°C, 0xFE = -2°C

Bit[7:0] FPGA junction temperature

1-byte two’s compliment

Unit: Celsius

0xFE presents –2°C; 0x23=35°C

Example: 0x21= 33°C, 0xFE = -2°C

Device 2 Advanced Error Counters
Byte[9:6] PCIe correctable error counter

4-bytes unsigned;

LSB First; Unit: count

Number of correctable PCIe errors for

device 2 after device/SC reboot

Byte[5:4] PCIe uncorrectable error counter

2-bytes unsigned;

LSB First; Unit: count

Number of uncorrectable PCIe errors for

device 2 after device/SC reboot

Byte[3:2] DDR correctable error counter

2-bytes unsigned;

LSB First; Unit: count

Number of correctable DDR errors for

device 2 after device/SC reboot

Byte[1:0] DDR uncorrectable error counter

2-bytes unsigned;

LSB First; Unit: count

Number of uncorrectable DDR errors for

device 2 after device/SC reboot

[*] -> See Zynq UltraScale+ Device Technical Reference Manual for signal definition
[]** -> See UltraScale Architecture Configuration User Guide for signal definition

Table: Network Module (QSFP) - Temperature and Status information

Note: Not applicable for U30 cards.

Bit Field Bit Field Mapping Data Format Sensor Description
Network Module 0 Temperature
Byte 0 Network module 0 temperature

1-byte two’s compliment

Unit: Celsius

Range: –128 to 127°C;

Example: 0x21= 33°C, 0xFE = -2°C

Network Module 0 Status
Bit[15] Reserved N/A N/A
Bit[14] OverCurrentL

1-bit unsigned

Unit: state

0: Normal operation

1: Over-current event

Bit[13] PowerEnL

1-bit unsigned

Unit: state

0: Power off

1: Power enabled

Bit[12:11] TxFault[1:0]

2-bit unsigned

Unit: state

[1:0] for SFP-DD

[0] for SFP N/A for QSFP 0: No Event

1: Transmitter detected a fault

Bit[10:9] TxDisable[1:0]

2-bit unsigned

Unit: state

[1:0] for SFP-DD

[0] for SFP N/A for QSFP 0: No Event

1: Transmitter output turned off by host

Bit[8:7] RxLos[1:0]

2-bit unsigned

Unit: state

[1:0] for SFP-DD

[0] for SFP N/A for QSFP 0: No Event

1: Optical signal level low

Bit[6:3] RS0-[2:1],RS1-[2:1]

4-bit unsigned

Unit: state

4 bits for SFP-DD; 2 bits for SFP;

N/A for QSFP; Speed select by host

Bit[2] LPMode

1-bit unsigned

Unit: state

All module types: Power Mode Control from

host; 0: Normal; 1: Low Power Mode

Bit[1] IntL

1-bit unsigned

Unit: state

QSFP only

0: No event; 1: Interrupt asserted

Bit[0] ModPrsL

1-bit unsigned

Unit: state

All module types:

0: module absent, 1: module present

Network Module 1 Temperature
Byte 0 Network module 0 temperature

1-byte two’s compliment

Unit: Celsius

Range: –128 to 127°C;

Example: 0x21= 33°C, 0xFE = -2°C

Network Module 1 Status
Bit[15] Reserved N/A N/A
Bit[14] OverCurrentL

1-bit unsigned

Unit: state

0: Normal operation

1: Over-current event

Bit[13] PowerEnL

1-bit unsigned

Unit: state

0: Power off

1: Power enabled

Bit[12:11] TxFault[1:0]

2-bit unsigned

Unit: state

[1:0] for SFP-DD

[0] for SFP N/A for QSFP 0: No Event

1: Transmitter detected a fault

Bit[10:9] TxDisable[1:0]

2-bit unsigned

Unit: state

[1:0] for SFP-DD

[0] for SFP N/A for QSFP 0: No Event

1: Transmitter output turned off by host

Bit[8:7] RxLos[1:0]

2-bit unsigned

Unit: state

[1:0] for SFP-DD

[0] for SFP N/A for QSFP 0: No Event

1: Optical signal level low

Bit[6:3] RS0-[2:1],RS1-[2:1]

4-bit unsigned

Unit: state

4 bits for SFP-DD; 2 bits for SFP;

N/A for QSFP; Speed select by host

Bit[2] LPMode

1-bit unsigned

Unit: state

All module types: Power Mode Control from

host; 0: Normal; 1: Low Power Mode

Bit[1] IntL

1-bit unsigned

Unit: state

QSFP only

0: No event; 1: Interrupt asserted

Bit[0] ModPrsL

1-bit unsigned

Unit: state

All module types:

0: module absent, 1: module present

AMD Support

For support resources such as answers, documentation, downloads, and forums, see the Alveo Accelerator Cards AMD/Xilinx Community Forum.

License

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License.

You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

All images and documentation, including all debug and support documentation, are licensed under the Creative Commons (CC) Attribution 4.0 International License (the “CC-BY-4.0 License”); you may not use this file except in compliance with the CC-BY-4.0 License.

You may obtain a copy of the CC-BY-4.0 License at https://creativecommons.org/licenses/by/4.0/

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

XD038 | © Copyright 2023, Advanced Micro Devices Inc.