Alveo Debug Guide

Card Not Recognized

When Alveo™ card(s) are installed and functioning correctly within a host machine, the lspci Linux command will correctly identify and report their details. However, there are times when the card is not recognized by the host OS and lspci does not list the card. This section covers techniques to determine the root cause.

This Page Covers

This page details scenarios and recommended steps to take when the card is not recognized by the host. If your issue is not covered, please post on the Xilinx forums.

Typical causes can be grouped into card and host based issues as given below:

  • Card based

  • Host based

    • Bad or incompatible motherboard slot

    • Slot disabled by host

      • A missing CPU can cause this

    • Poorly seated server risers

    • Other hardware added to the system

    • BIOS settings

You Will Need

Before beginning debug, you need to:

Common Cases


Confirm system recognizes cards

The lspci command can be used to confirm the system recognizes the card and provides details on all the PCIe buses and devices in the system. The verbose switch (-v) provides greater detail while the device ID switch (-d) filters specific vendors. For Xilinx, the device ID is 10ee:. The resulting command is lspci -vd 10ee:, refered to as lspci in the document.

Below is an example of the lspci output of an Alveo card recognized by the system.

:~> sudo lspci -vd 10ee:
03:00.0 Processing accelerators: Xilinx Corporation Device 5004
        Subsystem: Xilinx Corporation Device 000e
        Physical Slot: 4
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at d2000000 (64-bit, prefetchable) [size=32M]
        Memory at d4000000 (64-bit, prefetchable) [size=128K]
        Capabilities: [40] Power Management version 3
        Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [1c0] #19
        Capabilities: [400] Access Control Services
        Capabilities: [410] #15
        Kernel driver in use: xclmgmt
        Kernel modules: xclmgmt

If no cards are in the system, the output will not display anything as illustrated below.

:~> sudo lspci -vd 10ee:
:~>

Card not recognized under lspci

There can be many potential causes if a card is not recognized using lspci. To narrow the likely causes it is necessary to consider both the host machine and the card.

Host machine

Alveo card

Review the following scenarios and see if they apply:

If the above scenarios don’t apply, test the card/machine for the following:

If none of the above have narrowed down the potential issue:

As a last attempt:

  • Cold boot the system twice

  • Pull power

  • Reseat the Alveo card

  • Reseat the server risers if applicable

  • Bring system back up

  • Check to see if the card is recognized

If the issue hasn’t been addressed, please post your situation and the steps that you have gone through to the Xilinx forums.


Vivado flow

This guide does not cover Vivado™ flow debug.

Next steps:

  • You can revert to golden using AR 71757 to go back to a Vitis™ flow

  • Search for answers on the Xilinx forums


USB cable plugged into card

For cards with a USB port, including U200, U250 and U280, ensure no USB cable is plugged into the card as it will block the FPGA from enumerating on the PCIe bus.

Next steps:

  • Unplug USB cable from the card

  • Cold boot the system


Card recognized after a warm boot

If lspci does not recognize the Alveo card after the machine first powers on, one of the first tests is to perform a warm reboot. If the card is recognized after a warm reboot, this may suggest that there is a BIOS issue.

Before warmboot sudo lspci -vd 10ee: does not recognize the installed card.

:~> sudo lspci -vd 10ee:
:~>

After warmboot, the same lspci command recognizes the installed card.

:~> sudo lspci -vd 10ee:
03:00.0 Processing accelerators: Xilinx Corporation Device 5004
        Subsystem: Xilinx Corporation Device 000e
        Physical Slot: 4
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at d2000000 (64-bit, prefetchable) [size=32M]
        Memory at d4000000 (64-bit, prefetchable) [size=128K]
        Capabilities: [40] Power Management version 3
        Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [1c0] #19
        Capabilities: [400] Access Control Services
        Capabilities: [410] #15
        Kernel driver in use: xclmgmt
        Kernel modules: xclmgmt

The BIOS can start enumerating devices on the PCIe bus 100ms after the system starts to boot. Sometimes Alveo cards take longer than that to program the static region of the shell containing the PCIe core. Once programmed the card’s PCIe core remains active even during a warm boot. Since a warm boot will cause the BIOS to re-enumerate devices on the PCIe bus, if the Alveo card is recognized, it suggests the BIOS is enumerating before the Alveo card’s PCIe core is active.

Next steps:


Card not recognized after loading platform or golden image

If the card is no longer recognized after loading a platform or reverting to the golden image, there is an issue with the host to card communication.

Next steps:


Card not recognized during operation

If the card is no longer recognized by lspci while the machine is on, the card may be overheating, which causes the FPGA to shutdown and no longer detect on the PCIe bus. Overheating can be caused by not having enough airflow. See the respective card data sheet for airflow requirements.

Next steps:


Card is recognized in different PCIe slot in same machine

If the card is tested and recognized in a different PCIe slot on the same machine, there is an issue with the intended motherboard slot.

Next steps:


Card is recognized in another machine

If the card is tested and recognized in a different machine, there is an issue with the intended machine settings.

Next steps:


Different Alveo card is recognized in same slot

If a different Alveo card is tested and recognized in the same slot, there is an issue with the intended card.

Next steps:


PCIe bifurcation setting

PCIe bifurcation splits the PCIe link into two (or more) smaller buses. The bifurcation setting can be found in the BIOS. Not all systems support bifurcation. Alveo platforms detailed in this guide are expecting a non-bifurcated link. Ensure bifurcation is not enabled.

Next step:

  • Refer to manufacture’s BIOS documentation to turn off bifurcation.


Recent system change

System changes may impact card/host interoperability. Certain changes may render the card non-functional.

Please refer to the following sections that discuss next steps for different system changes:

Motherboard replacement

Next steps:

Riser replacement

Next steps:

Additional card installed in system

Next steps:

New or updated BIOS on machine

Next steps:

BIOS settings changed

Next steps:

CPU removal

Next steps:

Other hardware added since last boot

Next steps:


Card not recognized by multiple machines

If the card has been tested in multiple machines and hasn’t been recognized in any of them, there may be a shared machine incompatibility or an issue with the card.

Next steps:

  • If possible, test a card that has been known to work in the same PCIe slot to check if the issue is related to the card or the machines.

  • See if the machines are homogenous or heterogeneous

    • If homogeneous

      • Check on missing CPU sockets

      • Review BIOS version and settings

  • Look at the LEDs on the card:

  • See if the FPGA is seen in Vivado HW Manager

    • If it can be seen, revert the card to golden AR 71757


Blue LED not illuminated

It the blue LED on the card is not illuminated, the FPGA is not being programmed properly during power on.

Next steps:


Red LED illuminated

If the card’s red LED is illuminated constantly, there is an issue with the on card power delivery. The card is not usable in this state.

Next steps:


BIOS in safety mode

If the card’s red LED is illuminated and the system is exhibiting some or all of these other symptoms:

  • Lower resolution on video cards

  • Computer fans in full speed safe mode

  • USB devices may not work

  • The BIOS is failing train (establish a PCI link) with the Alveo card

Next step:

  • Refer to manufacture’s BIOS documentation to address


Xilinx Support

For additional support resources such as Answers, Documentation, Downloads, and Alerts, see the Xilinx Support pages. For additional assistance, post your question on the Xilinx Community Forums – Alveo Accelerator Card.

Have a suggestion, or found an issue please send an email to alveo_cards_debugging@xilinx.com .

License

All software including scripts in this distribution are licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License.

You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

All images and documentation, including all debug and support documentation, are licensed under the Creative Commons (CC) Attribution 4.0 International License (the “CC-BY-4.0 License”); you may not use this file except in compliance with the CC-BY-4.0 License.

You may obtain a copy of the CC-BY-4.0 License at https://creativecommons.org/licenses/by/4.0/

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

XD027 | © Copyright 2021 Xilinx, Inc.