# AMR Debug Techniques ## *Debug Methods* The debug techniques are listed here in approximate increasing order of complexity and effort. ### **Programming** If an issue occurs during a cfgmem\_fpt/cfgmem\_program ([AMR Management Interface userguide](ami-tool-guide.md#amrmanagementinterfaceuserguide-ami-tool-overview)) please go straight to [AMR Updating FPT Image in Flash](update-fpt.md) to recover the card. ### **Card status** Once an error has occurred, it can be useful to check the card status and health to see what is still working and what is not. Some useful checks are: For AMI specific commands see - [AMR Management Interface userguide (ami\_tool)](ami-tool-guide.md#amrmanagementinterfaceuserguide-ami-tool-overview). - **lspci device check** - Issue ‘sudo lspci -vvs \’ - 21:00.0 Processing accelerators: Xilinx Corporation Device 50b4 Subsystem: Xilinx Corporation Device 000e Physical Slot: 2-1 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Capabilities: [3b0 v1] Physical Layer 16.0 GT/s Capabilities: [400 v1] Lane Margining at the Receiver Capabilities: [460 v1] Extended Capability ID 0x2a Capabilities: [600 v1] Vendor Specific Information: ID=0020 Rev=0 Len=010 Kernel driver in use: ami Kernel modules: ami - Check the PF has expected - device ID - number of BARs - BAR sizes - link speed and width in LnkCap - link speed and width in LnkSta - Vendor Specific capability present - kernel driver in use - PF0 AMI - **ami\_tool link check** - Issue ‘ami\_tool pcieinfo -d \’ - For each expected PF, check the following entries for consistency with lspci link status - PCIe Link - NUMA NODE - CPU Affinity - **ami\_tool overview check** - Issue ‘ami\_tool overview’ - check the AMI version is as expected - AMI version should match AMI driver version - AMI major version should match AMC major version - For each BDF check - ‘Devices State’ is READY - if the device state is not READY, please cold reboot the system - Expected design name and UUID - **ami\_tool mfg\_info check** - Issue ‘ami\_tool mfg\_info -d \’ - Check - eeprom version - product name - board revision - serial no - mac address - mfg date - uuid - board part number - mfg part number - **xbtest verify check - Only applkicable to V80** - issue ‘xbtest -c verify -d \ -F’ - When completed, for each BDF check - “RESULT: ALL TESTS PASSED” present - “ERROR” not present - **sensor check** - Issue “ami\_tool sensors -d \ -f json -o \’ - Check produced json format is correct - For each expected sensor - Check entry is present in json - Check for expected value - See [Sensors and Hardware Monitoring](../firmware/sensors.md) for thresholds ### **xbtest logs and CSV files applicable to V80** For errors that occurred while xbtest was running, check the xbest logs for context and error reporting. xbtest writes sensor data into several CSV files. These can be analyzed as text or imported into Microsoft Excel to analyze the data graphically. xbtest can be useful to test the card capabilities and to push it to its limits. This is more often helpful for exposing or reproducing an issue, rather than triaging or debugging an issue. However, xbtest does provide an easy method for continuously reading card sensors such as power and temperature, which can be useful in some debug situations. For xbtest documentation, see [xbtest Userguide](../V80/xbtest/install-and-run-xbtest.md). ### **Server logs and sensors** - **BMC (Baseboard Management Controller)** - This provides remote configuration and power control for the Dell servers that V80 cards have been verified against. - There is an BMC for each server. It stores logs of server activity. These can help to diagnose physical server issues (e.g. someone pulled out a power cord, or shut down the wrong server). Use the log timestamps to correlate this with other logs. - The BMC stores server sensor readings such as temperature. This historical data can be viewed in logs and charts and can help to diagnose issues with power or temperature. For example, if card sensors are not working properly or are not being reported correctly then xbtest would not report the correct temperature, but the BMC would. - **Crash logs** - To help with debug, when a server crash occurs, the server loads a crash kernel, which freezes the state of the server and writes this state and the current messages into a crash directory. - This is stored in /var/crash/ where a new directory is created for each server crash, named with a timestamp. ### **dmesg** dmesg is an OS tool for reporting messages from the kernel, including from drivers such as AMI. Useful commands are: dmesg | less # show full dmesg output, piped to a pager dmesg -wT # show live dmesg output, auto-updates when new messages are sent. -T gives wall clock timestamps (default is seconds since server reboot) Consult OS documentation for full usage details. dmesg is particularly useful for information about PCIe® connections at server boot and for AMI debug information. For AMI messages: - messages are preceded by **ami:** and usually include the card BDF. - heartbeat message for AMI/AMC comms will appear if there’s a failure or breakdown in comms. - No response - “Failed to get the heartbeat msg\!” and “AMC Heartbeat expired event received” - Fatal failure (communications have been shut down) - “Heartbeat fail count above threshold\! Raising fatal event…” and “AMC Heartbeat fatal event received, stopping GCQ…” - Incorrect Response - “Heartbeat validation failed\!” and “AMC Heartbeat validation event received” dmesg content is also written into log files at /var/log/ - the exact directory and file names here differ between RedHat/CentOS and Ubuntu (e.g. on RedHat/CentOS the full dmesg content is written to file `messages` , on Ubuntu this file is named `kern.log` ). Use sudo to read these log files. The AMI driver also populates dmesg with log messages received from AMC; these are always additionally prefixed with the string “AMC OUTPUT:”. By default, the only AMC messages printed in dmesg are those with the level “LOG”. #### Changing message verbosity The verbosity of the AMI messages can be changed by changing the value in **/sys/bus/pci/drivers/ami/ami\_debug\_enabled** (ensure the file has write-permissions). - If the value is 1, additional debug messages are added to dmesg. - If the value is 0, only errors and important information messages are added to dmesg. The verbosity of the AMC messages can be updated with the **debug\_verbosity** command; for example, to enable AMC debug messages you can run the command **ami\_tool debug\_verbosity -d \ -l debug** - this will cause all AMC debug messages to appear in dmesg. ### **Resets and reboots** Performing a reset after an issue has occurred can help to understand the impact and scope of the issue. For example, if the issue is cleared by a software device reboot, the issue is not wide-ranging, but if a full power cycle is required, the issue is more significant. Several types of reset are possible for amr design: - power cycle : shut down server using BMC, power cycle system, boot up server using BMC - server cold reboot : shut down server using BMC, boot up server using BMC - server warm reboot : reboot server using BMC - Host PCI reset (removes the PCI device from the host and forces a bus rescan; this does not remove power from the device): **ami\_tool reload -t pci -d \** - Software device reboot (firmware-controlled reboot from a flash partition): **ami\_tool device\_boot -d \ -p \** - AMI driver reload (put driver into clean state without reprogramming): **ami\_tool reload -t driver -d \** ### **Revert to provided design** For issues seen in custom designs, if applicable, it may be useful to revert back to the released AMR image to confirm they are/are not seen also. ### **sysfs** sysfs is a Linux pseudo file system that provides information about hardware devices in the server system. sysfs is mounted at /sys/ and can be accessed like any other Linux file system (cd, ls, cat, etc.). /sys/bus/pci/devices/ shows the PCIe connected devices. Each directory here shows a mapping from the BDF to the associated device within sysfs. Each directory contains many files and subdirectories that show device information available to the OS. Some examples include: - amc\_version : version number for alveo management controller (amc) - logic\_uuid: Build UUID ### **Hardware read/write** Use **ami\_tool bar\_rd** *and* **bar\_wr** to read or write from memory-mapped registers or memories within the board **bar\_rd** % ami_tool bar_rd -d -b 2 -a 0x0000ffff -l 4 0000ffff DEAD BEEF DEAD BEEF DEAD BEEF DEAD BEEF INFO: 4 words read successfully % ami_tool bar_rd -d -b 2 -a 0x0000ffff -l 256 -o outfile.bin INFO: 256 words read successfully **bar\_wr** % ami_tool bar_wr -d -b 2 -a 0x0000ffff -i 0xACDCACDC INFO: 1 words written successfully % ami_tool bar_wr -d -b 2 -a 0x0000ffff -I infile.bin INFO: 512 words written successfully **Swap card slots** If an issue occurs on one card but not on another, try swapping the cards over or move the card into an empty slot in the server and see where the issue occurs. This can help to isolate an issue to a faulty card, faulty slot, or incorrect slot configuration. There may be other factors involved. For example if the test only fails intermittently, or there may be some server race or load condition required to create the conditions that triggers the issue. ### **Swap cards or servers** Try running the test on a different card and/or in a different server, and see where the issue occurs. This can help to isolate an issue to a faulty card or faulty server. There may be other factors involved. For example, if the test only fails intermittently, or there may be some server race or load condition required to create the conditions that triggers the issue. ## *Issue Reporting* When reporting an issue against the AMR solution, provide the following information to aid with triage. ### **Required** - Details of issue and any debug attempted. - OS and Kernel versions of the host server. - Consistent or intermittent issue? - Single or multi card? - Recoverable via reset/reboot? - steps to reproduce - logs - ami\_tool overview - ami\_tool pcieinfo -d \ - ami\_tool mfg\_info -d \ - ami\_tool sensors -d \ - ami\_tool cfgmem\_info -d \ -t \ - lspci -vvs \ - dmesg ### **If applicable** - xbtest logs: - Provide the *full set* of log files: zip the folder “BDF\_\\_\” of the falling run. - By default, xbtest stores log files under the folder “xbtest\_logs”. With this folder each run is stored within its own folder titled BDF\_\\_\. - If you store your file differently (-l option), provide the folder accordingly. - server logs