AVED Debug Techniques

Debug Methods

The debug techniques are listed here in approximate increasing order of complexity and effort.

Card status

Once an error has occurred, it can be useful to check the card status and health to see what is still working and what is not. Some useful checks are:

For AMI specific commands see - AVED Management Interface userguide (ami_tool).

  • lspci device check

    • Issue ‘sudo lspci -vvs <BDF>’

    • Check the PF has expected

      • device ID

      • number of BARs

      • BAR sizes

      • link speed and width in LnkCap

      • link speed and width in LnkSta

      • Vendor Specific capability present

      • kernel driver in use

        • PF0 AMI

  • ami_tool link check

    • Issue ‘ami_tool pcieinfo -d <BDF>’

    • For each expected PF, check the following entries for consistency with lspci link status

      • PCIe Link

      • NUMA NODE

      • CPU Affinity

  • ami_tool overview check

    • Issue ‘ami_tool overview’

    • check the AMI version is as expected

      • AMI version should match AMI driver version

      • AMI major version should match AMC major version

    • For each BDF check

      • ‘Devices State’ is READY

        • if NOT_READY please cold reboot the system

      • Expected design name

  • ami_tool mfg_info check

    • Issue ‘ami_tool mfg_info -d <BDF>’

    • Check

      • eeprom version

      • product name

      • board revision

      • serial no

      • mac address

      • mfg date

      • uuid

      • board part number

      • mfg part number

  • xbtest verify check

    • issue ‘xbtest -c verify -d <BDF> -F’

    • When completed, for each BDF check

      • “RESULT: ALL TESTS PASSED” present

      • “ERROR” not present

  • sensor check

    • Issue “ami_tool sensor -d <BDF> -f json -o <FILE>’

    • Check produced json format is correct

    • For each expected sensor*

      • Check entry is present in json

      • Check for expected value

xbtest logs and CSV files

For errors that occurred while xbtest was running, check the xbest logs for context and error reporting.

xbtest writes sensor data into several CSV files. These can be analyzed as text or imported into Microsoft Excel to analyze the data graphically.

xbtest can be useful to test the card capabilities and to push it to its limits. This is more often helpful for exposing or reproducing an issue, rather than triaging or debugging an issue. However, xbtest does provide an easy method for continuously reading card sensors such as power and temperature, which can be useful in some debug situations.

For xbtest documentation, see AVED Deployment / xbtest Userguide.

Server logs and sensors

  • BMC (Baseboard Management Controller)

    • This provides remote configuration and power control for the Dell servers that V80 cards have been verified against.

    • There is an BMC for each server. It stores logs of server activity. These can help to diagnose physical server issues (e.g. someone pulled out a power cord, or shut down the wrong server). Use the log timestamps to correlate this with other logs.

    • The BMC stores server sensor readings such as temperature. This historical data can be viewed in logs and charts and can help to diagnose issues with power or temperature. For example, if card sensors are not working properly or are not being reported correctly then xbtest would not report the correct temperature, but the BMC would.

  • Crash logs

    • To help with debug, when a server crash occurs, the server loads a crash kernel, which freezes the state of the server and writes this state and the current messages into a crash directory.

    • This is stored in /var/crash/ where a new directory is created for each server crash, named with a timestamp.

dmesg

dmesg is an OS tool for reporting messages from the kernel, including from drivers such as AMI. Useful commands are:

dmesg | less          # show full dmesg output, piped to a pager
dmesg -wT             # show live dmesg output, auto-updates when new messages are sent. -T gives wall clock timestamps (default is seconds since server reboot)

Consult OS documentation for full usage details.

dmesg is particularly useful for information about PCIe® connections at server boot and for AMI debug information.

For AMI messages:

  • messages are preceded by ami: and usually include the card BDF.

  • heartbeat message for AMI/AMC comms will appear if there’s a failure or breakdown in comms.

    • No response - “Failed to get the heartbeat msg!” and “AMC Heartbeat expired event received”

    • Incorrect Response - “Heartbeat validation failed!” and “AMC Heartbeat validation event received”

dmesg content is also written into log files at /var/log/ - the exact directory and file names here differ between RedHat/CentOS and Ubuntu (e.g. on RedHat/CentOS the full dmesg content is written to file messages , on Ubuntu this file is named kern.log ). Use sudo to read these log files.

sysfs

sysfs is a Linux pseudo file system that provides information about hardware devices in the server system. sysfs is mounted at /sys/ and can be accessed like any other Linux file system (cd, ls, cat, etc.).

/sys/bus/pci/devices/ shows the PCIe connected devices. Each directory here shows a mapping from the BDF to the associated device within sysfs. Each directory contains many files and subdirectories that show device information available to the OS.

Some examples include:

  • amc_version : version number for alveo management controller (amc)

  • logic_uuid: Build UUID

Reset and reprogram

Performing a reset after an issue has occurred can help to understand the impact and scope of the issue. For example, if the issue is cleared by a hot reset, the issue is not wide-ranging, but if a full power cycle is required, the issue is more significant.

Several types of reset are possible for aved design:

  • power cycle : shut down server using BMC, power cycle system, boot up server using BMC

  • server cold reboot : shut down server using BMC, boot up server using BMC

  • server warm reboot reboot server using BMC

  • PCIe hot reset (aka in-band reset - reset the FPGA but does not reprogram it from flash) : ami_tool reload -t sbr -d <bdf>

  • AMI driver reload (put driver into clean state without reprogramming): ami_tool reload -t driver -d <bdf>

*Note: If the issue occurred during a cfgmem_fpt/cfgmem_program (AVED Management Interface userguide (ami_tool)#overview) please go straight to AVED Updating FPT Image in Flash to recover the card.

Hardware read/write

Use ami_tool bar_rd and bar_wr to read or write from memory-mapped registers or memories within the platform

bar_rd

% ami_tool bar_rd -d <BDF> -b 2 -a 0x0000ffff -l 4
0000ffff DEAD BEEF DEAD BEEF DEAD BEEF DEAD BEEF
INFO: 4 words read successfully

% ami_tool bar_rd -d <BDF> -b 2 -a 0x0000ffff -l 256 -o outfile.bin
INFO: 256 words read successfully

bar_wr

% ami_tool bar_wr -d <BDF> -b 2 -a 0x0000ffff -i 0xACDCACDC
INFO: 1 words written successfully

% ami_tool bar_wr -d <BDF> -b 2 -a 0x0000ffff -I infile.bin
INFO: 512 words written successfully

Revert to provided exdes

For issues seen in custom designs, if applicable, it may be useful to revert back to the released AVED image to confirm they are/are not seen also.

Swap card slots

If an issue occurs on one card but not on another, try swapping the cards over or move the card into an empty slot in the server and see where the issue occurs.

This can help to isolate an issue to a faulty card, faulty slot, or incorrect slot configuration.

There may be other factors involved. For example if the test only fails intermittently, or there may be some server race or load condition required to create the conditions that triggers the issue.

Swap cards or servers

Try running the test on a different card and/or in a different server, and see where the issue occurs.

This can help to isolate an issue to a faulty card or faulty server.

There may be other factors involved. For example, if the test only fails intermittently, or there may be some server race or load condition required to create the conditions that triggers the issue.


Issue Reporting

When reporting an issue against the AVED solution, provide the following information to aid with triage.

Required

  • Details of issue and any debug attempted.

    • Consistent or intermittent issue?

    • Single or multi card?

    • Recoverable via reset/reboot?

  • logs

    • ami_tool overview

    • ami_tool pcieinfo -d <bdf>

    • ami_tool mfg_info -d <bdf>

    • ami_tool sensors -d <bdf>

    • ami_tool cfgmem_info -d <bdf>

    • lspci -vvs <bdf>

    • dmesg

If applicable

  • xbtest logs:

    • Provide the full set of log files: zip the folder “BDF_<date>_<time>” of the falling run.

    • By default, xbtest stores log files under the folder “xbtest_logs”. With this folder each run is stored within its own folder titled BDF_<date>_<time>.

      • If you store your file differently (-l option), provide the folder accordingly.

  • server logs

Page Revision: v. 22