AVED Debug Techniques¶

Debug Methods¶

The debug techniques are listed here in approximate increasing order of complexity and effort.

Programming¶

If an issue occurs during a cfgmem_fpt/cfgmem_program (AVED Management Interface userguide (ami_tool)#overview) please go straight to AVED Updating FPT Image in Flash to recover the card.

Card status¶

Once an error has occurred, it can be useful to check the card status and health to see what is still working and what is not. Some useful checks are:

For AMI specific commands see - AVED Management Interface userguide (ami_tool).

lspci device check

Issue ‘sudo lspci -vvs <BDF>’

21:00.0 Processing accelerators: Xilinx Corporation Device 50b4
        Subsystem: Xilinx Corporation Device 000e
        Physical Slot: 2-1
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        NUMA node: 2
        Region 0: Memory at 10400000000 (64-bit, prefetchable) [size=256M]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W
                DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 512 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 32GT/s, Width x8, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s (downgraded), Width x8 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range BC, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit+ 64bit+ 128bitCAS+
                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [188 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [1c0 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [3a0 v1] Data Link Feature <?>
        Capabilities: [3b0 v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [400 v1] Lane Margining at the Receiver <?>
        Capabilities: [460 v1] Extended Capability ID 0x2a
        Capabilities: [600 v1] Vendor Specific Information: ID=0020 Rev=0 Len=010 <?>
        Kernel driver in use: ami
        Kernel modules: ami

Check the PF has expected
- device ID
- number of BARs
- BAR sizes
- link speed and width in LnkCap
- link speed and width in LnkSta
- Vendor Specific capability present
- kernel driver in use
  - PF0 AMI

ami_tool link check
- Issue ‘ami_tool pcieinfo -d <BDF>’
- For each expected PF, check the following entries for consistency with lspci link status
  - PCIe Link
  - NUMA NODE
  - CPU Affinity
ami_tool overview check
- Issue ‘ami_tool overview’
- check the AMI version is as expected
  - AMI version should match AMI driver version
  - AMI major version should match AMC major version
- For each BDF check
  - ‘Devices State’ is READY
    - if the device state is not READY, please cold reboot the system
  - Expected design name and UUID
ami_tool mfg_info check
- Issue ‘ami_tool mfg_info -d <BDF>’
- Check
  - eeprom version
  - product name
  - board revision
  - serial no
  - mac address
  - mfg date
  - uuid
  - board part number
  - mfg part number
xbtest verify check
- issue ‘xbtest -c verify -d <BDF> -F’
- When completed, for each BDF check
  - “RESULT: ALL TESTS PASSED” present
  - “ERROR” not present
sensor check
- Issue “ami_tool sensors -d <BDF> -f json -o <FILE>’
- Check produced json format is correct
- For each expected sensor
  - Check entry is present in json
  - Check for expected value
  - See Sensors and Hardware Monitoring for thresholds

xbtest logs and CSV files¶

For errors that occurred while xbtest was running, check the xbest logs for context and error reporting.

xbtest writes sensor data into several CSV files. These can be analyzed as text or imported into Microsoft Excel to analyze the data graphically.

xbtest can be useful to test the card capabilities and to push it to its limits. This is more often helpful for exposing or reproducing an issue, rather than triaging or debugging an issue. However, xbtest does provide an easy method for continuously reading card sensors such as power and temperature, which can be useful in some debug situations.

For xbtest documentation, see AVED Deployment / xbtest Userguide.

Server logs and sensors¶

BMC (Baseboard Management Controller)
- This provides remote configuration and power control for the Dell servers that V80 cards have been verified against.
- There is an BMC for each server. It stores logs of server activity. These can help to diagnose physical server issues (e.g. someone pulled out a power cord, or shut down the wrong server). Use the log timestamps to correlate this with other logs.
- The BMC stores server sensor readings such as temperature. This historical data can be viewed in logs and charts and can help to diagnose issues with power or temperature. For example, if card sensors are not working properly or are not being reported correctly then xbtest would not report the correct temperature, but the BMC would.
Crash logs
- To help with debug, when a server crash occurs, the server loads a crash kernel, which freezes the state of the server and writes this state and the current messages into a crash directory.
- This is stored in /var/crash/ where a new directory is created for each server crash, named with a timestamp.

dmesg¶

dmesg is an OS tool for reporting messages from the kernel, including from drivers such as AMI. Useful commands are:

dmesg | less          # show full dmesg output, piped to a pager
dmesg -wT             # show live dmesg output, auto-updates when new messages are sent. -T gives wall clock timestamps (default is seconds since server reboot)

Consult OS documentation for full usage details.

dmesg is particularly useful for information about PCIe® connections at server boot and for AMI debug information.

For AMI messages:

messages are preceded by ami: and usually include the card BDF.
heartbeat message for AMI/AMC comms will appear if there’s a failure or breakdown in comms.
- No response - “Failed to get the heartbeat msg!” and “AMC Heartbeat expired event received”
- Fatal failure (communications have been shut down) - “Heartbeat fail count above threshold! Raising fatal event…” and “AMC Heartbeat fatal event received, stopping GCQ…”
- Incorrect Response - “Heartbeat validation failed!” and “AMC Heartbeat validation event received”

dmesg content is also written into log files at /var/log/ - the exact directory and file names here differ between RedHat/CentOS and Ubuntu (e.g. on RedHat/CentOS the full dmesg content is written to file messages , on Ubuntu this file is named kern.log ). Use sudo to read these log files.

The AMI driver also populates dmesg with log messages received from AMC; these are always additionally prefixed with the string “AMC OUTPUT:”. By default, the only AMC messages printed in dmesg are those with the level “LOG”.

Changing message verbosity¶

The verbosity of the AMI messages can be changed by changing the value in /sys/bus/pci/drivers/ami/ami_debug_enabled (ensure the file has write-permissions).

If the value is 1, additional debug messages are added to dmesg.
If the value is 0, only errors and important information messages are added to dmesg.

The verbosity of the AMC messages can be updated with the debug_verbosity command; for example, to enable AMC debug messages you can run the command ami_tool debug_verbosity -d <bdf> -l debug - this will cause all AMC debug messages to appear in dmesg.

Resets and reboots¶

Performing a reset after an issue has occurred can help to understand the impact and scope of the issue. For example, if the issue is cleared by a hot reset, the issue is not wide-ranging, but if a full power cycle is required, the issue is more significant.

Several types of reset are possible for aved design:

power cycle : shut down server using BMC, power cycle system, boot up server using BMC
server cold reboot : shut down server using BMC, boot up server using BMC
server warm reboot reboot server using BMC
Host PCI reset (removes the PCI device from the host and forces a bus rescan; this does not remove power from the device): ami_tool reload -t pci -d <bdf>
PCIe hot reset (aka in-band reset - reset the FPGA but does not reprogram it from flash) : ami_tool reload -t sbr -d <bdf>
AMI driver reload (put driver into clean state without reprogramming): ami_tool reload -t driver -d <bdf>

Revert to provided design¶

For issues seen in custom designs, if applicable, it may be useful to revert back to the released AVED image to confirm they are/are not seen also.

sysfs¶

sysfs is a Linux pseudo file system that provides information about hardware devices in the server system. sysfs is mounted at /sys/ and can be accessed like any other Linux file system (cd, ls, cat, etc.).

/sys/bus/pci/devices/ shows the PCIe connected devices. Each directory here shows a mapping from the BDF to the associated device within sysfs. Each directory contains many files and subdirectories that show device information available to the OS.

Some examples include:

amc_version : version number for alveo management controller (amc)
logic_uuid: Build UUID

Hardware read/write¶

Use ami_tool bar_rd and bar_wr to read or write from memory-mapped registers or memories within the platform

bar_rd

% ami_tool bar_rd -d <BDF> -b 2 -a 0x0000ffff -l 4
0000ffff DEAD BEEF DEAD BEEF DEAD BEEF DEAD BEEF
INFO: 4 words read successfully

% ami_tool bar_rd -d <BDF> -b 2 -a 0x0000ffff -l 256 -o outfile.bin
INFO: 256 words read successfully

bar_wr

% ami_tool bar_wr -d <BDF> -b 2 -a 0x0000ffff -i 0xACDCACDC
INFO: 1 words written successfully

% ami_tool bar_wr -d <BDF> -b 2 -a 0x0000ffff -I infile.bin
INFO: 512 words written successfully

Swap card slots

If an issue occurs on one card but not on another, try swapping the cards over or move the card into an empty slot in the server and see where the issue occurs.

This can help to isolate an issue to a faulty card, faulty slot, or incorrect slot configuration.

There may be other factors involved. For example if the test only fails intermittently, or there may be some server race or load condition required to create the conditions that triggers the issue.

Swap cards or servers¶

Try running the test on a different card and/or in a different server, and see where the issue occurs.

This can help to isolate an issue to a faulty card or faulty server.

There may be other factors involved. For example, if the test only fails intermittently, or there may be some server race or load condition required to create the conditions that triggers the issue.

Issue Reporting¶

When reporting an issue against the AVED solution, provide the following information to aid with triage.

Required¶

Details of issue and any debug attempted.
- OS and Kernel versions of the host server.
- Consistent or intermittent issue?
- Single or multi card?
- Recoverable via reset/reboot?
- steps to reproduce
logs
- ami_tool overview
- ami_tool pcieinfo -d <bdf>
- ami_tool mfg_info -d <bdf>
- ami_tool sensors -d <bdf>
- ami_tool cfgmem_info -d <bdf> -t <type>
- lspci -vvs <bdf>
- dmesg

If applicable¶

xbtest logs:
- Provide the full set of log files: zip the folder “BDF_<date>_<time>” of the falling run.
- By default, xbtest stores log files under the folder “xbtest_logs”. With this folder each run is stored within its own folder titled BDF_<date>_<time>.
  - If you store your file differently (-l option), provide the folder accordingly.
server logs

Page Revision: v. 36