Alveo Debug Guide

Common Steps

This page covers various support and debugging techniques used across other pages in the debug guide. If you are just starting to debug please consult the main page to determine the best starting point for your needs.

The techniques have been grouped into the following three sections:

Section Items Covered
System Steps Host hardware, BIOS, and OS setup
XRT Steps XRT installation and operation
Card Steps Card programming and operation

System Steps

This section covers steps involving the host hardware, BIOS, and OS setup.


Log machine state

To quickly collect system and card information during debugging download and run the following script: WTSxrtLog.sh. The script captures:

  • Initial dmesg state

  • OS

  • lscpi state

  • System information with dmidecode and BIOS settings

  • XRT state

  • Determine which cards validate

  • dmesg state after XRT commands


Root sudo access

For certain actions, root/sudo privileges are required.  Determine root/sudo privileges using the  whoami  and or sudo whoami commands:

whoami will tell you who the user is

:~> whoami
 root

If the response is root, you have the needed permissions. Otherwise try

:~> sudo whoami
root

root indicates your account has sudo access. If you do not have the needed permissions speak with your sys-admin


Host machine and hypervisor information

To display host machine information and determine if a VM environment is being used, run the following linux dmidecode command sudo dmidecode | less

It will display:

  1. Serial Number

  2. Model

  3. BIOS version

XRT has support for the KVM hypervisor and virtual machines. The dmidecode output has enough information to confirm if XRT is running on a physical machine versus a virtual machine using the manufacturer and product name lines in the beginning of the report. You can also use this information to determine if the machine is a workstation or a server, with a web search.

A physical machine will report the manufacturer and model of the hardware. Common manufacturers include Dell, HP, SuperMicro and IBM. An example of this output, for a Dell server, is displayed below.

Handle 0x0100, DMI type 1, 27 bytes
System Information
	Manufacturer: Dell Inc.
	Product Name: PowerEdge R730
	Version: Not Specified
	Serial Number: xxxxxx

An example of the BIOS version found is displayed below. The BIOS displayed is version 2.8.0 in this case.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
	Vendor: Dell Inc.
	Version: 2.8.0

A virtual machine does not list a specific product model. The manufacturer also changes to the VM provider (normally a software company) as displayed below:

Handle 0x0100, DMI type 1, 27 bytes
System Information
Manufacturer: QEMU
Product Name: Standard PC (Q35 + ICH9, 2009)

Common VM manufacturers include QEMU, Xen, VMware, VirtualBox/Oracle, OpenStack, KVM. Only KVM is supported - see  https://developer.xilinx.com/en/articles/using-alveo-data-center-accelerator-cards-in-a-kvm-environment.html for setup details.


Determine Linux release

Use the cat /etc/*release command to determine the Linux release

:~> cat /etc/*release
NAME="Red Hat Enterprise Linux"
VERSION="9.1 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.1"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.1 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/9/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.1
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.1"
Red Hat Enterprise Linux release 9.1 (Plow)
Red Hat Enterprise Linux release 9.1 (Plow)

In this example, the first line tells us the system is running RHEL 9.1

Ubuntu or CentOS outputs look similar.


Determine Linux kernel and header information

For XRT usage, the machine needs to have compilation header files that match the machine’s linux kernel. Run uname -r to get the kernel version as below:

:~> uname -r
4.18.0-193.el8.x86_64

In this example the system is running kernel version 4. For Ubuntu, only the GA kernels released with the LTS are supported and not the HWE kernels (refer to the latest XRT Release Notes for supported OS/kernel versions in UG1451

The kernel headers package names change between distributions. Directions for Ubuntu and RHEL/CentOS follow

Ubuntu

Run sudo apt list | grep linux-headers- plus the results for the kernel, from uname -r,  to check which version of the headers package is installed as shown below:

:~> sudo apt list | grep linux-headers-4.15.0-112-generic

linux-headers-4.15.0-112-generic/bionic-security,now 4.15.0-112.113 amd64 [installed,automatic]

RHEL/CentOS

Run yum list | grep kernel- to check what versions of the kernel-devel and kernel-headers packages are install as displayed below:

:~> yum list | grep kernel-
Repository 'UIM_install' is missing name in configuration, using id.
kernel-core.x86_64                                                4.18.0-193.el8                                   @anaconda      
kernel-devel.x86_64                                               4.18.0-193.el8                                   @anaconda      
kernel-headers.x86_64                                             4.18.0-193.el8                                   @anaconda      

Determine PCIe slot type and speed

Run dmidecode to display slot capabilities and determine how many slots in the motherboard support the intended shell’s PCIe configuration (i.e Gen3x16).

:~> sudo dmidecode | grep -A 10 PCI | grep -A 10 -B 1 Type | less
....
Handle 0x00C2, DMI type 9, 19 bytes
System Slot Information
        Designation: PCI-E Slot 1
        Type: x16
        Current Usage: Available
        Length: Long
        Characteristics:
                3.3 V is provided
                PME signal is supported
        Bus Address: 0000:04:00.0

The Type: field shows this is a x16 slot. Any other value display will indicate it is not a x16 slot.

Some machines will report a slot is Available or In Use in the Current Usage: field. In the example above the slot is Available.

The actual speed depends on the card/slot negotiation. The lspci command displays the actual slot speed.  The example below shows a card running at x4 speeds in a x16 slot.

:~> sudo lspci -vvvd 10ee: | grep LnkCap
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited

If the card is running slower than the slot, review the shell capabilities with xbmgmt examine -d <bdf> as shown below:

:~> sudo xbmgmt examine -d 04:00.0

-----------------------------------------------------
1/1 [0000:04:00.0] : xilinx_u50lv_gen3x4_xdma_base_2
-----------------------------------------------------
Flash properties
  Type                 : spi
  Serial Number        : 123456789

Device properties
  Type                 : u50lv
  Name                 : ALVEO U50 LV PQ
  Config Mode          : 7
  Max Power            : 75W

Flashable partitions running on FPGA
  Platform             : xilinx_u50lv_gen3x4_xdma_base_2
  SC Version           : 5.0.27
  Platform UUID        : CA1BD561-0169-A52C-E463-B3300DF98172
  Interface UUID       : 05A5E9D4-E079-740E-76C7-499FEEC81DB3

This is a Gen3 x4 shell. The x4 link is expected.

NOTE: Make sure to match the BDF (04:00.0 in this case) with the output from dmidecode to ensure you are looking at the right card.

Another option is to look directly at the the silk screening on a motherboard or in the case of a server it may be on the risers. Here is an example picture:

motherboard labeling Diagram

The green circles show a slot that supports Gen3 x16 cards. The red circles show a Gen3 x8 that do not support Gen3 x16 speeds.


Disable fastboot

Use this if the system recognizes the card only after a warm boot

  • Boot the system into the BIOS setup menu

    • F12 on many machines

  • Disable the settings that allows the BIOS to skip initialization steps for a faster boot

    • Review system documentation to determine the right setting if there are questions

On cold boot the on-card FGPA (with the PCIe link) is being programmed while the machine and bios are booting up. Sometimes the BIOS will finish enumerating the PCIe device before the FPGA programming is finished. A warm boot will cause the BIOS to re-enumerate the devices while the FPGA PCIe link is up.


Check for unpopulated CPUs

On modern systems the PCIe controller is built into the CPU. If there are one or more missing CPUs you may not be able to use all the PCIe slots. Review system documentation to determine which slots are usable.

To check for missing CPUs, look for a Central Processor entry with Status: Unpopulated in the dmidecode log as shown below.  The line Status: Unpopulated means this system has at least one missing CPU.

Handle 0x0401, DMI type 4, 42 bytes
Processor Information
	Socket Designation: CPU2
	Type: Central Processor
	Family: Unknown
	Manufacturer: Not Specified
	ID: 00 00 00 00 00 00 00 00
	Version: Not Specified
	Voltage: Unknown
	External Clock: Unknown
	Max Speed: 4000 MHz
	Current Speed: Unknown
	Status: Unpopulated

Determine UEFI and Secure Boot

As of XRT 2020.1 the drivers do not have the signing to enable Secure Boot and will not work if the BIOS has the UEFI Secure Boot feature enabled.

If you have the mokutil package installed you can determine if Secure Boot is enabled with the following command:

:~> mokutil --sb-state
 SecureBoot enabled

Any other output indicates Secure Boot is not enabled.

Not all machines have mokutil installed - its an optional package. You may have to install the package or check the Secure Boot setting in the system BIOS.


XRT Steps

This section covers steps involving XRT installation and operation.


Confirm XRT/platform compatibility

Not all versions of XRT work with all platforms. To determine which XRT versions work with a platform refer to UG1120.


Determine XRT version

The Branch: field shows the XRT github branch. It is normally tied to a release like 2022.1 or 2022.2.

If the value is Master, you are using an un-official release.

Version: indicates the build as seen via the xbutil examine example below:

:~> xbutil examine
System Configuration
  OS Name              : Linux
  Release              : 5.14.0-162.6.1.el9_1.x86_64
  Version              : #1 SMP PREEMPT_DYNAMIC Fri Sep 30 07:36:03 EDT 2022
  Machine              : x86_64
  CPU Cores            : 16
  Memory               : 63788 MB
  Distribution         : Red Hat Enterprise Linux 9.1 (Plow)
  GLIBC                : 2.34
  Model                : PowerEdge R730

XRT
  Version              : 2.15.225
  Branch               : 2023.1
  Hash                 : adf27adb3cfadc6e4c41d6db814159f1329b24f3
  Hash Date            : 2023-05-03 17:13:10
  XOCL                 : 2.15.225, adf27adb3cfadc6e4c41d6db814159f1329b24f3
  XCLMGMT              : 2.15.225, adf27adb3cfadc6e4c41d6db814159f1329b24f3

Devices present
BDF             :  Shell                            Platform UUID                         Device ID       Device Ready*  
-------------------------------------------------------------------------------------------------------------------------
[0000:82:00.1]  :  xilinx_u280_gen3x16_xdma_base_1  283BAB8F-654D-8674-968F-4DA57F7FA5D7  user(inst=128)  Yes            


* Devices that are not ready will have reduced functionality when using XRT tools

The XOCL and XCLMGMT versions must match. If not, XRT will not operate correctly.


Determine XRT packages using the package manager

You will want to determine which versions (if any) of the below packages are installed

  • XRT

  • Deployment platform(s)

  • xbtest (an optional Alveo™ related package - see AR 75656 for more details)

  • Development platform(s)

Ubuntu

Determine the platform and xbtest packages with sudo apt list | grep -i xilinx-. The machine in the example below has the U50 and U55C deployment platforms installed; xbtest is not installed.

:~> sudo apt list | grep -i xilinx

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

xilinx-cmc-u50/now 1.0.40-3398385 all [installed,local]
xilinx-cmc-u55/now 1.5.25-3395704 all [installed,local]
xilinx-sc-fw-u50/now 5.2.20-1.6d4a0da all [installed,local]
xilinx-sc-fw-u55/now 7.1.22-1.b8c3d15 all [installed,local]
xilinx-u50-gen3x16-xdma-base/now 5-3499627 all [installed,local]
xilinx-u50-gen3x16-xdma-validate/now 5-3499627 all [installed,local]
xilinx-u55c-gen3x16-xdma-base/now 3-3494559 all [installed,local]
xilinx-u55c-gen3x16-xdma-validate/now 3-3506150 all [installed,local]

Determine the XRT package with sudo apt list | grep -i xrt. The example below has xrt 2.11.634 installed.

:~> sudo apt list | grep -i xrt

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

xrt/now 2.15.225 amd64 [installed,local]

Use the XRT release table  to see if the XRT version is an official release.

RHEL/CentOS

Determine the platform and xbtest packages with yum list | grep -i xilinx- The machine in the example below has the Gen3x16 U50 deployment platform installed as well as the associated xbtest; no development platforms are installed.

:~> yum list | grep -i xilinx-
Repository 'UIM_install' is missing name in configuration, using id.
xbtest-xilinx-u50-gen3x16-xdma-blp-1.noarch                       5.0-3055365                                        @@commandline  
xilinx-cmc-u50.noarch                                             1.0.27-3041182                                     @@commandline  
xilinx-sc-fw-u50.noarch                                           5.1.7-1.f121ae9                                    @@commandline  
xilinx-u50-gen3x16-xdma-blp.noarch                                1-2784799                                          @@commandline  

There are 2 additional xbtest packages that can be seen with yum list | grep -i xbtest-:

:~> yum list | grep -i xbtest
Repository 'UIM_install' is missing name in configuration, using id.
xbtest-common.x86_64                                              1.1-1                                              @@commandline  
xbtest-sw-5.x86_64                                                0-1                                                @@commandline  
xbtest-xilinx-u50-gen3x16-xdma-blp-1.noarch                       5.0-3055365                                        @@commandline  

Determine the XRT package with yum list | grep -i xrt. The example below has xrt 2.8.743 installed.

:~> yum list | grep -i xrt
Repository 'UIM_install' is missing name in configuration, using id.
xrt.x86_64                                                        2.8.743-1                                          @@commandline 

Use the XRT release table to see if the XRT version is an official release.


Unload/reload XRT drivers

Use modprobe -r to remove the drivers as shown below

sudo modprobe -r xocl
sudo modprobe -r xclmgmt

Use modprobe to reload the drivers as shown below

sudo modprobe xclmgmt
sudo modprobe xocl

Order matters for both of these commands. xocl depends on xclmgmt.


Remove XRT

Removing XRT will also remove the deployment and development platforms.  If you plan to re-install XRT, get a  snapshot of XRT, platform, and xbtest packages installed on the system.

Before removing XRT, run sudo xbmgmt examine -r platform and if any cards have a shell on them, return each card to golden. This is necessary because not all XRT versions work with all platforms. All XRT versions work with a card that has been factory reset. Performing this step makes future card installs easier.

Once each card is returned to golden, use the package manager to remove the current XRT package using the following command. Note, the package manager will also remove the dependent packages.

  • Ubuntu: sudo apt remove xrt

  • RHEL/CentOS: sudo yum remove xrt

If needed use the package manager to find the installed package names.


xclbin fails to load

The first time an accelerator is run on a card, XRT loads the binary description of the accelerator, stored in an .xclbin file.
If the .xclbin fails to load, run xbmgmt examine --device <management BDF> . If the output displays errors, see SC Troubleshooting. Else there may be an platform mismatch.


XRT release versions and download locations

Available XRT releases along with download links are given in the table below.

Release Download Link
2021.1_pu1
XRT 2.11.634
  • RHEL/CentOS 7.x
  • RHEL/CentOS 8.x
  • Ubuntu 16.04
  • Ubuntu 18.04
  • Ubuntu 20.04
  • 2021.2
    XRT 2.12.427
  • RHEL/CentOS 7.x
  • RHEL/CentOS 8.x
  • Ubuntu 18.04
  • Ubuntu 20.04
  • 2022.1
    XRT 2.13.466
  • RHEL/CentOS 7.x
  • RHEL/CentOS 8.x
  • Ubuntu 18.04
  • Ubuntu 20.04
  • 2022.2
    XRT 2.14.354
  • RHEL/CentOS 7.x
  • RHEL/CentOS 8.x
  • Ubuntu 18.04
  • Ubuntu 20.04
  • Ubuntu 22.04
  • 2023.1
    XRT 2.15.225
  • RHEL/CentOS 7.x
  • RHEL 8.x
  • RHEL 9.x
  • Ubuntu 18.04
  • Ubuntu 20.04
  • Ubuntu 22.04
  • If you are looking for an older XRT that uses the legacy commands go to the previous XRT release versions and download locations.


    Card Steps

    Provides various techniques for debugging Alveo cards.

    Displaying Card BDF Values

    Each Alveo card is assigned both a management and an user BDF (Bus:Device:Function). For a given card, the management and user BDF values differ only by the Function digit.

    To obtain the management and user BDF values for installed cards, use the following commands. It requires XRT to be installed.

    The XRT documentation provides detailed descriptions on the management and user functions.

    Management BDF

    To display the management BDF values of installed cards, use the following command:

    sudo /opt/xilinx/xrt/bin/xbmgmt examine

    The management BDF values of installed cards are given in the square brackets under ‘Devices present’ and has the form vvvv:xx:yy.0. In the below example, only one device is present and has a management BDF of 0000:af:00.0.

    Devices present
      [0000:af:00.0] : xilinx_u50_gen3x16_xdma_201920_3
    

    User BDF

    To display the user BDF values of installed cards, use the following command:

    sudo /opt/xilinx/xrt/bin/xbutil examine

    The user BDF values of installed cards are given in the square brackets under ‘Devices present’ and has the form vvvv:xx:yy.1. In the below example, only one device is present and has a user BDF of 0000:af:00.1.

    Devices present
      [0000:af:00.1] : xilinx_u50_gen3x16_xdma_201920_3
    

    NOTE: The user BDF only exists once a shell has been flashed onto the card. Cards with the factory golden image will only have a management BDF.

    BDF displayed using lspci

    Alveo card BDF values can also be displayed using the lspci Linux command:

    sudo lspci -vd 10ee:

    See Card Not Recognized for lspci field details. Alveo card PCIe IDs are discussed in the next section.

    In a multi card system you may need to cross reference lspci output and xbmgmt examine --verbose output to determine if there is a card missing.


    Card PCIe IDs

    Alveo card PCIe ID information is given here. It provides the PCIe ID for both the golden image and for the shell installed.


    Display card and host platform and SC versions

    If the platform or SC running on the card does not match what is installed on the host system, applications will be unable to run correctly. To display the partition and SC version installed on the system and running on the card, use the following command:

    sudo xbmgmt examine --device <management BDF>

    See Displaying Card BDF Values for obtaining the management BDF value.

    An output similar to this will be displayed.

    -------------------------------------------------
    [0000:3b:00.0] : xilinx_u55c_gen3x16_xdma_base_3
    -------------------------------------------------
    Flash properties
      Type                 : spi
      Serial Number        : XFL1P0345SA0
    
    Device properties
      Type                 : u55c
      Name                 : ALVEO U55C
      Config Mode          : 0x7
      Max Power            : 225W
    
    Flashable partitions running on FPGA
      Platform             : xilinx_u55c_gen3x16_xdma_base_3
      SC Version           : 7.1.22
      Platform UUID        : 97088961-FEAE-DA91-52A2-1D9DFD63CCEF
      Interface UUID       : B7AC1ABE-1E3E-1CB6-86D5-A81232452676
    
    Flashable partitions installed in system
      Platform             : xilinx_u55c_gen3x16_xdma_base_3
      SC Version           : 7.1.22
      Platform UUID        : 97088961-FEAE-DA91-52A2-1D9DFD63CCEF
    
    
      Mac Address          : 00:0A:35:08:8C:AD
                           : 00:0A:35:08:8C:AE
                           : 00:0A:35:08:8C:AF
                           : 00:0A:35:08:8C:B0
                           : 00:0A:35:08:8C:B1
                           : 00:0A:35:08:8C:B2
                           : 00:0A:35:08:8C:B3
                           : 00:0A:35:08:8C:B4
    

    When a card has successfully been installed and the firmware has been updated, both entries for Platform and SC version under Flashable partitions running on FPGA and Flashable partitions installed in system must match. If they do not match, the system will be unable to correctly run applications on your card.

    To confirm they match, visually compare the ‘Platform’ and ‘SC Version’ entries under Flashable partitions running on FPGA and Flashable partitions installed in system.

    In the above example, the Platform (xilinx_u55c_gen3x16_xdma_base_3) and SC version (7.1.22) displayed under Flashable partitions running on FPGA and Flashable partitions installed in system both match. If they don’t match, follow the steps in the card’s installation guide to install the deployment software.

    If the platform under Flashable partitions running on FPGA has the word golden or recovery in the name, the card is running the factory image and needs to be flashed before use. To flash the card, follow the steps in the card’s installation guide. See Reverting the card to factory image for details on the factory image.

    Finally, if there is a platform displayed under Flashable partition running on FPGA but none under Flashable partitions installed in system, as shown in the example below, the deployment platform will first need to be installed on the system in order to run applications on the card. Follow the steps in the card’s installation guide to install the deployment software.

    -------------------------------------------------
    [0000:3b:00.0] : xilinx_u55c_gen3x16_xdma_base_3
    -------------------------------------------------
    Flash properties
      Type                 : spi
      Serial Number        : XFL1P0345SA0
    
    Device properties
      Type                 : u55c
      Name                 : ALVEO U55C
      Config Mode          : 0x7
      Max Power            : 225W
    
    Flashable partitions running on FPGA
      Platform             : xilinx_u55c_gen3x16_xdma_base_3
      SC Version           : 7.1.22
      Platform UUID        : 97088961-FEAE-DA91-52A2-1D9DFD63CCEF
      Interface UUID       : B7AC1ABE-1E3E-1CB6-86D5-A81232452676
    
    Flashable partitions installed in system
      <none found>        
    
    
      Mac Address          : 00:0A:35:08:8C:AD
                           : 00:0A:35:08:8C:AE
                           : 00:0A:35:08:8C:AF
                           : 00:0A:35:08:8C:B0
                           : 00:0A:35:08:8C:B1
                           : 00:0A:35:08:8C:B2
                           : 00:0A:35:08:8C:B3
                           : 00:0A:35:08:8C:B4
    
    WARNING  : No shell is installed on the system.
    

    Flash the card with a deployment platform

    Once the card has been installed in the system, a deployment platform must be flashed onto the card before the card can run an application. To flash the card with a deployment platform, use the following command:

    sudo xbmgmt program --base --device <management BDF>
    

    See Displaying Card BDF Values for obtaining the management BDF value.

    It is recommended to follow the installation steps in the card’s installation guide. See Card Installation Guides for links to the installation guide, or navigate to your card’s landing page under the Alveo Boards and Kits web page.


    Programming DFX-2RP shell partitions

    The DFX-2RP platforms have a base partition, a shell partition, and user space as described in the XRT platforms documentation under Two Stage Platforms. For DFX-2RP platforms it is necessary to not only flash the base partition, but also to program the shell partition prior to running an application.

    See AR 75975 for more information.


    Determine active or passive card

    The xbmgmt examine -r mechanical command can be used to determine if a card is active or passive based on the fan presence. An active card shows as below

    sudo xbmgmt examine -d a6:00.0 -r mechanical
    
    -------------------------------------------------
    [0000:a6:00.0] : xilinx_u280_gen3x16_xdma_base_1
    -------------------------------------------------
    Mechanical
      Fans
        FPGA Fan 1
          Critical Trigger Temp : 49 C
          Speed                 : 1410 RPM
    

    A passive card shows as below

    sudo xbmgmt examine -d 3b:00.0 -r mechanical
    
    -------------------------------------------------
    [0000:3b:00.0] : xilinx_u55c_gen3x16_xdma_base_3
    -------------------------------------------------
    Mechanical
      Fans
        Not present
    
    • Fans with a fan listed indicates the card has a fan. The card is actively cooled, suitable for workstations.

    • Fans: Not present means the card is passively cooled and depends on server fans for cooling.


    Monitor card power and temperature

    During operation it is necessary to ensure the card operates within the power and thermal limits. These limits are dependent on the card version as detailed below:

    The U50 is rated for 75W, supplied by the motherboard via

    • A 10W 3.3V rail

      • Dedicated for HBM power, see AR 75222 for more information

    • And a 65W 12V rail

    The U200/U250/U280 are rated for 225W

    • 75W from the motherboard via

      • A 10W 3.3V rail

      • And a 65W 12V rail

    • Plus 150W power from an 8 pin PCIe AUX power source.

    Power

    Use the following command to display a card’s maximum power along with its current power consumption.
    xbutil examine --device <user BDF> --report electrical

    Depending on the card and how the card was installed, the card may have the following maximum power limits.

    • 75W

      • No PCIe Aux power

      • Only power from the motherboard PCIe slot

      • Not supported for U200/U250/U280 in Vitis™ flows

    • 150W

      • 6 pin PCIe Aux power connected supplying 75W

      • Not supported for U200/U250/U280 in Vitis flows

    • 225W

      • 8 pin PCIe Aux power connected supplying 150W

    An example output is shown below. The maximum power level for the card is 225W and displays the current power being consumed by the card at 16.035208W. It provides a breakdown of the various power rails.

    -------------------------------------------------
    [0000:82:00.1] : xilinx_u280_gen3x16_xdma_base_1
    -------------------------------------------------
    Electrical
      Max Power              : 225 Watts
      Power                  : 16.035208 Watts
      Power Warning          : false
    
      Power Rails            : Voltage   Current
      12 Volts Auxillary     :  0.436 V,  0.016 A
      12 Volts PCI Express   : 12.254 V,  1.308 A
      3.3 Volts PCI Express  :  3.282 V
      3.3 Volts Auxillary    :  3.347 V
      Internal FPGA Vcc      :  0.851 V,  4.672 A
      DDR Vpp Bottom         :  2.500 V
      DDR Vpp Top            :  2.500 V
      5.5 Volts System       :  5.509 V
      Vcc 1.2 Volts Top      :  1.205 V
      Vcc 1.2 Volts Bottom   :  1.204 V
      1.8 Volts Top          :  1.806 V
      0.9 Volts Vcc          :  0.902 V
      12 Volts SW            : 12.240 V
      Mgt Vtt                :  1.204 V
    

    The following are the three key power rails:

    • 12 Volts PCI Express

      • 12V motherboard rail

        • +/- 8% Voltage tolerance

      • Current should be =<5.5A

    • 3.3 Volts PCI Express

      • 3.3 Volt motherboard rail

        • +/- 9% Voltage tolerance

      • Current should be =< 3A

    • 12 Volts Auxillary (For U55C/U200/U250/U280)

      • These lines will report a near 0 value if PCIe Aux power is not supplied

      • Current should be =<12.5A for 225W operation

    For the U50, the 3.3V rail powers the HBM while the 12V rail powers the rest of the card except the SC. One can confirm the HBM power is within it’s limits by looking at the 3.3 Volts PCI Express Voltage and Current values.

    :~> xbutil examine -r thermal electrical -d 0000:04:00.1
    ------------------------------------------------
    [0000:04:00.1] : xilinx_u50_gen3x16_xdma_base_5
    ------------------------------------------------
    Electrical
      Max Power              : 75 Watts
      Power                  : 65.196971 Watts
      Power Warning          : false
    
      Power Rails            : Voltage   Current
      12 Volts PCI Express   : 12.182 V,  4.678 A
      3.3 Volts PCI Express  :  3.317 V,  2.475 A
      Internal FPGA Vcc      :  0.852 V, 41.800 A
      Internal FPGA Vcc IO   :  0.848 V,  4.900 A
      5.5 Volts System       :  4.947 V
      1.8 Volts Top          :  1.808 V
      0.9 Volts Vcc          :  0.905 V
      Mgt Vtt                :  1.209 V
      3.3 Volts Vcc          :  3.359 V
      1.2 Volts HBM          :  1.208 V
      Vpp 2.5 Volts          :  2.494 V
    
    Thermals
      PCB Top Front          : 45 C
      PCB Top Rear           : 47 C
      FPGA                   : 70 C
      Int Vcc                : 69 C
      FPGA HBM               : 65 C
    

    The equation to calculate power is

    power = voltage \* current. From the information provided in the example above HBM is using 8.21W (3.317V * 2.475A).

    You can use this script in a second terminal to monitor temperatures and voltages on a card while a design is running. Make sure to modify the script loop for the time required (this is based on the loop count in line 17 and the seconds of delay in between calls on line 19).

    To use, run ./loop_examine.sh <user BDF> See Displaying Card BDF Values for obtaining the management BDF value.

    Temperature

    Use the following command to display the value of various temperature sensors on the card.
    xbutil examine --device <user BDF> --report thermal

    An example of the output is shown below

    -------------------------------------------------
    [0000:3b:00.1] : xilinx_u55c_gen3x16_xdma_base_3
    -------------------------------------------------
    Thermals
      Temperature            : Celcius
      PCB Top Front          :     36 C
      PCB Top Rear           :     32 C
      FPGA                   :     38 C
      Int Vcc                :     41 C
    

    Use system logs to see if the card exceeded power or thermal limits

    XRT will log if there has been a critical event in dmesg similar to below:

    [93352.134536] xclmgmt 0000:a3:00.0: clock.m.25165825 ffff90a92b74a010 clock_status_check: Critical temperature or power event, kernel clocks have been stopped.
    [93352.134643] xclmgmt 0000:a3:00.0: firewall.m.12582914 ffff90a92e634c10 check_firewall: AXI Firewall 3 tripped, status: 0x80004, bar offset 0x3000, resource ep_firewall_ctrl_user_00 1 0 1 axi_firewall
    [93352.134648] xclmgmt 0000:a3:00.0: firewall.m.12582914 ffff90a92e634c10 check_firewall: Firewall 0, ep ep_firewall_blp_ctrl_mgmt_00 1 0 1 axi_firewall, status: 0x0, bar offset 0x1f02000
    [93352.134652] xclmgmt 0000:a3:00.0: firewall.m.12582914 ffff90a92e634c10 check_firewall: Firewall 1, ep ep_firewall_blp_ctrl_user_00 1 0 1 axi_firewall, status: 0x0, bar offset 0x1f03000
    [93352.134657] xclmgmt 0000:a3:00.0: firewall.m.12582914 ffff90a92e634c10 check_firewall: Firewall 2, ep ep_firewall_ctrl_mgmt_00 1 0 1 axi_firewall, status: 0x0, bar offset 0x2000
    [93352.134661] xclmgmt 0000:a3:00.0: firewall.m.12582914 ffff90a92e634c10 check_firewall: Firewall 3, ep ep_firewall_ctrl_user_00 1 0 1 axi_firewall, status: 0x80004, bar offset 0x3000
    [93352.134666] xclmgmt 0000:a3:00.0: firewall.m.12582914 ffff90a92e634c10 check_firewall: Firewall 4, ep ep_firewall_ctrl_debug_00 1 0 1 axi_firewall, status: 0x0, bar offset 0x4000
    [93352.134670] xclmgmt 0000:a3:00.0: firewall.m.12582914 ffff90a92e634c10 check_firewall: Firewall 5, ep ep_firewall_data_h2c_00 1 0 1 axi_firewall, status: 0x0, bar offset 0x5000
    [93352.134672] xclmgmt 0000:a3:00.0: health_check_cb: Card requires pci hot reset
    

    There will be no mention if this is a temperature or power event.

    There may be an indication of an over temperature event earlier in dmesg as XRT will log if a device is approaching the temperature limits as below:

    [686835.888454] xclmgmt 0000:05:00.0: check_temp_within_range: Warning: A Xilinx acceleration device is reporting a temperature of 96C. There is a card shutdown limit if the device hits 97C. Please keep the device below 88C.
    [686836.200333] xocl 0000:05:00.1:  ffff8a0e1e23f098 _xocl_drvinst_open: OPEN 2
    

    If the limits are exceeded in either case, the device will be shut down. Often a firewall trip can be observed as below:

    [67066.484066] firewall.m firewall.m.10485760: dev ffff9b24390edc10, check_firewall: AXI Firewall 3 tripped, status: 0x4, bar offset 0x3000, resource ep_firewall_ctrl_user_00 1 0 1
    [67066.484070] xclmgmt 0000:03:00.0: health_check_cb: Card is in a Bad state, notify userpf
    

    This indicates the device was shutdown to protect it.

    You can also expect to see a firewall trip message in xbutil examine as shown below:

    Card Power(W)
    16
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Firewall Last Error Status
    Level 3 : 0x80004(RECS_CONTINUOUS_RTRANSFERS_MAX_WAIT|RECS_WRITE_TO_BVALID_MAX_WAIT)
    Error occurred on: Wed 2020-08-19 13:59:13 MDT
    

    To actively monitor the card’s power and temperature during runtime, use the script provided in the monitoring power and temperature section.


    Reverting the card to factory image

    Xilinx Alveo cards might need to be reverted to their factory (or golden) image. This is recommended when:

    • Preparing to flash a different shell onto the card

    • Preparing to upgrade or change the version of XRT installed on the host

    • The card no longer appears on lspci after programming a custom image onto the card

    AR71757 Reverting Card to Factory image provides methods to revert the card using either the Vitis/XRT or Vivado flow.


    Vivado HW Manager

    This is for the more advanced user comfortable with manually programming FPGAs

    If a card can’t be seen by multiple systems via lspci, the next step is to see if the FPGA is alive by determining if it can be seen in Vivado HW manager. The steps for connecting the U50/U55C and U200/250/280 cards are below.

    For the U50 and U55C:

    1. An Alveo Programming Cable is necessary to see the U50/U55C in Vivado Hardware Manager

    2. The Alveo Programming Cable is shown in Figure 3 in UG1377: UG1377 V1.1

    3. Follow the directions in UG1377 to connect the Alveo Programming Cable between a computer and the card. a. Pages 1-13 b. Often the debug machine is a windows laptop

    4. Open Vivado hardware manager

    5. Tools → Auto connect

    6. set_property PARAM.FREQUENCY 1000000 [current_hw_target]

    7. The FPGA should be displayed. In the following example the xu50_0 is displayed:

    HW Manager

    1. If the device shows up in Vivado HW Manager follow AR 71757 to revert the card back to the golden image

    2. Otherwise the FPGA is not working, post on the Xilinx forums

    For the U200/U250/U280:

    1. Plug in JTAG cable between U200/U250/U280 card and debug machine a. Often the debug machine is a windows laptop

    2. Open Vivado hardware manager

    3. Tools → Auto connect

    4. set_property PARAM.FREQUENCY 1000000 [current_hw_target]

    5. The xcu200_0 should show as below:

    HW Manager

    1. If the device shows up in Vivado HW Manager follow AR 71757 to revert the card back to the golden image

    2. Otherwise the FPGA is not working, post on the [Xilinx forums]https://support.xilinx.com/s/topic/0TO2E000000YKXlWAO/alveo-accelerator-cards)


    Xilinx Support

    For additional support resources such as Answers, Documentation, Downloads, and Alerts, see the Xilinx Support pages. For additional assistance, post your question on the Xilinx Community Forums – Alveo Accelerator Card.

    Have a suggestion, or found an issue please send an email to alveo_cards_debugging@xilinx.com .

    License

    All software including scripts in this distribution are licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License.

    You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

    All images and documentation, including all debug and support documentation, are licensed under the Creative Commons (CC) Attribution 4.0 International License (the “CC-BY-4.0 License”); you may not use this file except in compliance with the CC-BY-4.0 License.

    You may obtain a copy of the CC-BY-4.0 License at https://creativecommons.org/licenses/by/4.0/

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

    XD027 | © Copyright 2021 Xilinx, Inc.