Alveo Debug Guide

Card Not Recognized

When Alveo™ card(s) are installed and functioning correctly within a host machine, the lspci Linux command will correctly identify and report their details as shown in the example below.

:~> sudo lspci -vd 10ee:
03:00.0 Processing accelerators: Xilinx Corporation Device 5004
        Subsystem: Xilinx Corporation Device 000e
        Physical Slot: 4
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at d2000000 (64-bit, prefetchable) [size=32M]
        Memory at d4000000 (64-bit, prefetchable) [size=128K]
        Capabilities: [40] Power Management version 3
        Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [1c0] #19
        Capabilities: [400] Access Control Services
        Capabilities: [410] #15
        Kernel driver in use: xclmgmt
        Kernel modules: xclmgmt

However, there are times when the card is not recognized by the host OS and lspci does not list the card. This section covers techniques to determine the root cause. If you are just starting to debug, please consult the main page to determine if this is the best page for your purposes.

This Page Covers

This page details scenarios and recommended steps to take when the card is not recognized by the host. If your issue is not covered, please post on the Xilinx forums.

Typical causes can be grouped into card and host based issues as given below:

  • Card based

  • Host based

    • Bad or incompatible motherboard slot

    • Slot disabled by host

      • A missing CPU can cause this

    • Poorly seated server risers

    • Other hardware added to the system

    • BIOS settings

You Will Need

Before beginning debug, you need to:

Common Cases


Card not recognized under lspci

This is a large problem space that can have many potential causes. To narrow likely causes we need to consider both the host machine and the card.

Host machine

Alveo card

Go through the following scenarios and see if they apply:

If the above scenarios don’t apply, test the card/machine for the following:

If none of the above have narrowed down the potential issue:

As a last attempt:

  • Cold boot the system twice

  • Pull power

  • Reseat the Alveo card

  • Reseat the server risers if applicable

  • Bring system back up

  • Check to see if the card is recognized

If the issue hasn’t been addressed, please post your situation and the steps that you have gone through to the Xilinx forums.


Vivado flow

This guide does not cover Vivado™ flow debug.

Next steps:

  • You can revert to golden using AR 71757 to go back to a Vitis™ flow

  • Search for answers on the Xilinx forums


USB cable plugged into card

For cards with a USB port, including U200, U250 and U280, ensure there is no USB cable plugged into the card as it will block the FPGA from enumerating on the PCIe bus.

Next steps:

  • Unplug USB cable

  • Cold boot the system


Card recognized after a warm boot

If lspci does not recognize the Alveo card after the machine first powers on, one of the first tests is to perform a warm reboot. If the card is recognized after a warm reboot, this may suggest that there is a BIOS issue.

Before warmboot lspci does not recognize the installed card.

:~> sudo lspci -vd 10ee:
:~>

After warmboot, the same lspci command recognizes the installed card.

:~> sudo lspci -vd 10ee:
03:00.0 Processing accelerators: Xilinx Corporation Device 5004
        Subsystem: Xilinx Corporation Device 000e
        Physical Slot: 4
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at d2000000 (64-bit, prefetchable) [size=32M]
        Memory at d4000000 (64-bit, prefetchable) [size=128K]
        Capabilities: [40] Power Management version 3
        Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [1c0] #19
        Capabilities: [400] Access Control Services
        Capabilities: [410] #15
        Kernel driver in use: xclmgmt
        Kernel modules: xclmgmt

The BIOS can start enumerating devices on the PCIe bus 100ms after the system starts to boot. Sometimes Alveo cards take longer than that to program the static region of the shell containing the PCIe core. Once programmed the card’s PCIe core remains active even during a warm boot. Since a warm boot will cause the BIOS to re-enumerate devices on the PCIe bus, if the Alveo card is recognized, it suggests the BIOS is enumerating before the Alveo card’s PCIe core is active.

Next steps:


Card not recognized after loading platform or golden image

If the card is no longer recognized after loading a platform or reverting to the golden image, there is an issue with the host to card communication.

Next steps:

  • Cold boot the computer and see if host will recognize

  • Run a warmboot and see if the card is recognized

  • Confirm BIOS is not in safety mode

  • Try reseating the card and seeing if the host will recognize

  • See if the FPGA is seen in Vivado HW Manager

    • If it can be seen, revert the card to golden AR 71757


Card not recognized during operation

If the card is no longer recognized under lspci while the machine is on, the card may be overheating, which causes the FPGA to shutdown and be no longer recognized on the PCIe bus.

If the system does not have enough airflow it is possible the FPGA can overheat taking out the PCIe core on the FPGA.

Next steps:

  • Check for an over temperature event in dmesg

  • Use xbutil query to monitor card temperature

  • If there is an over temperature event or if the card is over 90C in xbutil query

    • Cold boot the system

    • Increasing system airflow

      • May need to turn up fan speed

      • May need to close lid

      • May need to check for airflow blocks in system

        • Stray cables

        • Something blocking fan

  • If temperatures and electrical limits are good, and the card is no longer recognized while running an application


Card is recognized in different PCIe slot in same machine

If the card is tested and recognized in a different PCIe slot on the same machine, there is an issue with the intended motherboard slot.

Next steps:


Card is recognized in another machine

If the card is tested and recognized in a different machine, there is an issue with the intended machine settings.

Next steps:


Different Alveo card is recognized in same slot

If a different Alveo card is tested and recognized in the same slot, there is an issue with the intended card.

Next steps:


PCIe bifurcation setting

PCIe bifurcation splits the PCIe link into two (or more) smaller buses. The bifurcation setting can be found in the BIOS. Not all systems support bifurcation. Alveo platforms detailed in this guide are expecting a non-bifurcated link. Ensure bifurcation is not enabled.

Next step:

  • Refer to manufacture’s BIOS documentation to turn off bifurcation.


Recent system change

System changes may impact card/host interoperability. Certain changes may render the card non-functional.

Please refer to the following sections that discuss next steps for different system changes:

Motherboard replacement

Next steps:

Riser replacement

Next steps:

Additional card installed in system

Next steps:

New or updated BIOS on machine

Next steps:

BIOS settings changed

Next steps:

CPU removal

Next steps:

Other hardware added since last boot

Next steps:

  • Shut down the system

  • Pull power

  • Remove the new PCIe device(s)

  • Reseat the Alveo card

    • Reseat the server risers if applicable

  • Bring system back up

  • Confirm the card is recognized

    • If so progress through card install

    • If not try the card in

      • A different slot

      • A different machine


Card not recognized by multiple machines

If the card has been tested in multiple machines and hasn’t been recognized in any of them, there may be a shared machine incompatibility or an issue with the card.

Next steps:

  • If possible, test a card that has been known to work in the same PCIe slot to check if the issue is related to the card or the machines.

  • See if the machines are homogenous or heterogeneous

    • If homogeneous

      • Check on missing CPU sockets

      • Review BIOS version and settings

  • Look at the LEDs on the card:

  • See if the FPGA is seen in Vivado HW Manager

    • If it can be seen, revert the card to golden AR 71757


Blue LED not illuminated

It the blue LED on the card is not illuminated, the FPGA is not being programmed properly during power on.

Next steps:

  • Try the following:

    • Shut down the system

    • Pull power

    • Reseat the Alveo card

    • Reseat the server risers if applicable

    • Bring system back up

    • Check blue LED illuminated

    • Move card to a different slot/Machine and try these steps again

  • If the blue LED is still not illuminated, see if the FPGA is seen in Vivado HW Manager

    • If it can be seen, revert the card to golden AR 71757


Red LED illuminated

If the card’s red LED is illuminated constantly, there is an issue with the on card power delivery. The card is not usable in this state.

Next steps:

  • Determine if the BIOS is in safety mode

  • If the BIOS is not in safety mode, try the following:

    • Shut down the system

    • Pull power

    • Reseat the Alveo card

    • Reseat the server risers if applicable

    • Bring system back up

    • Check if the red LED is out

  • If the red LED continues to be illuminated, navigate to the Service Portal on xilinx.com and initiate a return request.


BIOS in safety mode

If the card’s red LED is illuminated and the system is exhibiting some or all of these other symptoms:

  • Lower resolution on video cards

  • Computer fans in full speed safe mode

  • USB devices may not work

  • The BIOS is failing train (establish a PCI link) with the Alveo card

Next step:

  • Refer to manufacture’s BIOS documentation to address


Xilinx Support

For additional support resources such as Answers, Documentation, Downloads, and Alerts, see the Xilinx Support pages. For additional assistance, post your question on the Xilinx Community Forums – Alveo Accelerator Card.

Have a suggestion, or found an issue please send an email to alveo_cards_debugging@xilinx.com .

License

All software including scripts in this distribution are licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License.

You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

All images and documentation, including all debug and support documentation, are licensed under the Creative Commons (CC) Attribution 4.0 International License (the “CC-BY-4.0 License”); you may not use this file except in compliance with the CC-BY-4.0 License.

You may obtain a copy of the CC-BY-4.0 License at https://creativecommons.org/licenses/by/4.0/

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

XD027 | © Copyright 2021 Xilinx, Inc.