.. comment:: SPDX-License-Identifier: Apache-2.0 comment:: Copyright (C) 2019-2021 Xilinx, Inc. All rights reserved. XRT/Board Debug FAQ ------------------- Debugging failures on board runs can be a daunting task which often requires *tribal knowledge* to be effective. This document attempts to document the tricks of the trade to help reduce debug cycles for all users. This is a living document and will be continuously updated. Tools of the Trade ~~~~~~~~~~~~~~~~~~ ``dmesg`` Capture Linux kernel and XRT drivers log ``strace`` Capture trace of system calls made by an XRT application ``gdb`` Capture stack trace of an XRT application ``lspci`` Enumerate Xilinx® PCIe devices ``xbutil`` Query status of Xilinx® PCIe device ``xclbinutil`` Retrieve info from an xclbin XRT API Trace Run failing application with XRT logging enabled in ``xrt.ini`` file [Runtime] runtime_log=my_run.log Validating a Working Setup ~~~~~~~~~~~~~~~~~~~~~~~~~~ When observing an application failure on a board, it is important to step back and validate the board setup. That will help establish and validate a clean working environment before running the failing application. We need to ensure that the board is enumerating and functioning. Board Enumeration Check if BIOS and Linux can see the board. So for Xilinx® boards use ``lspci`` utility :: lspci -v -d 10ee: Check if XRT can see the board and reports sane values :: xbutil examine XSA Sanity Test Card validation on kernel, bandwidth, dmatest and etc. (--device for pointing a specific board) :: xbutil validate --device Check DDR and PCIe bandwidth :: xbutil validate --device --run dma Common Reasons For Failures ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Incorrect Memory Topology Usage ............................... 5.0+ XSAs are considered dynamic platforms which use sparse connectivity between acceleration kernels and memory controllers (MIGs). This means that a kernel port can only read/write from/to a specific MIG. This connectivity is frozen at xclbin generation time in specified in mem_topology section of xclbin. The host application needs to ensure that it uses the correct memory banks for buffer allocation using cl_mem_ext_ptr_t for OpenCL applications. For XRT native applications the bank is specified when allocating buffer using ``xrt::kernel::group_id()``. If an application is producing incorrect results it is important to review the host code to ensure that host application and xclbin agree on memory topology. One way to validate this at runtime is to enable XRT logging in ``xrt.ini`` and then carefully go through all buffer allocation requests. Memory Read Before Write ........................ Read-Before-Write in 5.0+ XSAs will cause MIG *ECC* error. This is typically a user error. For example if user expects a kernel to write 4KB of data in DDR but it produced only 1KB of data and now the user tries to transfer full 4KB of data to host. It can also happen if user supplied 1KB sized buffer to a kernel but the kernel tries to read 4KB of data. Note ECC read-before-write error occurs if -- since the last bitstream download which results in MIG initialization -- no data has been written to a memory location but a read request is made for that same memory location. ECC errors stall the affected MIG since usually kernels are not able to handle this error. This can manifest in two different ways: 1. CU may hang or stall because it does not know how to handle this error while reading/writing to/from the affected MIG. ``xbutil examine --device --report dynamic-regions`` will show that the CU is stuck in *BUSY* state and not making progress. 2. AXI Firewall may trip if PCIe DMA request is made to the affected MIG as the DMA engine will be unable to complete request. AXI Firewall trips result in the Linux kernel driver killing all processes which have opened the device node with *SIGBUS* signal. ``xbutil examine --device --report firewall`` will show if an AXI Firewall has indeed tripped including its timestamp. Users should review the host code carefully. One common example is compression where the size of the compressed data is not known upfront and an application may try to migrate more data to host than was produced by the kernel. Incorrect Frequency Scaling ........................... Incorrect frequency scaling usually indicates a tooling or infrastructure bug. Target frequencies for the dynamic (partial reconfiguration) region are frozen at compile time and specified in ``clock_freq_topology`` section of ``xclbin``. If clocks in the dynamic region are running at incorrect — higher than specified — frequency, kernels will demonstrate weird behavior. 1. Often a CU will produce completely incorrect result with no identifiable pattern 2. A CU might hang 3. When run several times, a CU may produce correct results a few times and incorrect results rest of the time 4. A single CU run may produce a pattern of correct and incorrect result segments. Hence for a CU which produces a very long vector output (e.g. vector add), a pattern of correct — typically 64 bytes or one AXI burst — segment followed by incorrect segments are generated. Users should check the frequency of the board with ``xbutil examine --device --report platform`` and compare it against the metadata in xclbin. ``xclbinutil`` may be used to extract metadata from xclbin. CU Deadlock ........... HLS scheduler bugs can also result in CU hangs. CU deadlocks AXI data bus at which point neither read nor write operation can make progress. The deadlocks can be observed with ``xbutil examine --device --report dynamic-regions`` where the CU will appear stuck in *START* or *---* state (can also be observed through debug-ip using the command ``xbutil examine --device --report debug-ip-status``). Note this deadlock can cause other CUs which read/write from/to the same MIG to also hang. AXI Bus Deadlock ................ AXI Bus deadlocks can be caused by `Memory Read Before Write`_ or `CU Deadlock`_ described above. These usually show up as CU hang and sometimes may cause AXI FireWall to trip. Run ``xbutil examine --device --report dynamic-regions`` and ``xbutil examine --device --report firewall`` to check if CU is stuck in *START* or *--* state or if one of the AXI Firewall has tripped. Platform Bugs ............. Bitsream Download Failures Bitstream download failures are usually caused because of incompatible xclbin(s). ``dmesg`` log would provide more insight into why the download failed. At OpenCL level they usually manifest as Invalid Binary (error -44). Rarely MIG calibration might fail after bitstream download. This will also show up as bitstream download failure. Usually XRT driver messages in ``dmesg`` would reveal if MIG calibration failed. Incorrect Timing Constraints If the platform or dynamic region has invalid timing constraints — which is really a platform or Vitis tool bug — CUs would show bizarre behaviors. This may result in incorrect outputs or CU/application hangs. Board in Crashed State ~~~~~~~~~~~~~~~~~~~~~~ When board is in crashed state PCIe read operations start returning ``0XFF``. In this state ``xbutil examine`` would show bizarre metrics. For example ``Temp`` would be very high. Boards in crashed state may be recovered with PCIe hot reset :: xbutil reset If this does not recover the board perform a warm reboot. After reset/reboot please follow steps in `Validating a Working Setup`_ If for some reason communication between xocl driver and management driver gets disrupted, ``xbutil reset`` may not be successful to reset the board. In those cases the following steps are recommended with the help of the sysadmin who has the root previledge 1) unload xocl driver (also shut down VM if xocl is running on a VM) 2) Run ``xbmgmt reset`` XRT Scheduling Options ~~~~~~~~~~~~~~~~~~~~~~ XRT has three kernel execution schedulers today: ERT, KDS and legacy. By default XRT uses ERT which runs on Microblaze. ERT is accessed through KDS which runs inside xocl Linux kernel driver. If ERT is not available KDS uses its own built-in scheduler. From 2018.2 release onwards KDS (tgether with ERT if available in the XSA) is enabled by default. Users can optionally switch to legacy scheduler which runs in userspace. Switching scheduler will help isolate any scheduler related XRT bugs :: [Runtime] ert=false kds=false Writing Good Bug Reports ~~~~~~~~~~~~~~~~~~~~~~~~ When creating bug reports please include the following: 1. Output of ``dmesg`` 2. Output of ``xbutil examine --device --report all`` 3. Application binaries: xclbin, host executable and code, any data files used by the application 4. XRT version 5. XSA name and version