XRT/Board Debug FAQ¶
Debugging failures on board runs can be a daunting task which often requires tribal knowledge to be effective. This document attempts to document the tricks of the trade to help reduce debug cycles for all users. This is a living document and will be continuously updated.
Tools of the Trade¶
Capture Linux kernel and XRT drivers log
Capture trace of system calls made by an XRT application
Capture stack trace of an XRT application
Enumerate Xilinx® PCIe devices
Query status of Xilinx® PCIe device
Retrieve info from an xclbin
- XRT API Trace
Run failing application with XRT logging enabled in
Validating a Working Setup¶
When observing an application failure on a board, it is important to step back and validate the board setup. That will help establish and validate a clean working environment before running the failing application. We need to ensure that the board is enumerating and functioning.
- Board Enumeration
Check if BIOS and Linux can see the board. So for Xilinx® boards use
lspci -v -d 10ee:
Check if XRT can see the board and reports sane values
- XSA Sanity Test
Card validation on kernel, bandwidth, dmatest and etc. (–device <bdf> for pointing a specific board)
xbutil validate --device <bdf>
Check DDR and PCIe bandwidth
xbutil validate --device <bdf> --run dma
Common Reasons For Failures¶
Incorrect Memory Topology Usage¶
5.0+ XSAs are considered dynamic platforms which use sparse connectivity between acceleration kernels and memory controllers (MIGs). This means that a kernel port can only read/write from/to a specific MIG. This connectivity is frozen at xclbin generation time in specified in mem_topology section of xclbin. The host application needs to ensure that it uses the correct memory banks for buffer allocation using cl_mem_ext_ptr_t for OpenCL applications. For XRT native applications the bank is specified when allocating buffer using
If an application is producing incorrect results it is important to review the host code to ensure that host application and xclbin agree on memory topology. One way to validate this at runtime is to enable XRT logging in
xrt.ini and then carefully go through all buffer allocation requests.
Memory Read Before Write¶
Read-Before-Write in 5.0+ XSAs will cause MIG ECC error. This is typically a user error. For example if user expects a kernel to write 4KB of data in DDR but it produced only 1KB of data and now the user tries to transfer full 4KB of data to host. It can also happen if user supplied 1KB sized buffer to a kernel but the kernel tries to read 4KB of data. Note ECC read-before-write error occurs if – since the last bitstream download which results in MIG initialization – no data has been written to a memory location but a read request is made for that same memory location. ECC errors stall the affected MIG since usually kernels are not able to handle this error. This can manifest in two different ways:
CU may hang or stall because it does not know how to handle this error while reading/writing to/from the affected MIG.
xbutil examine --device <bdf> --report dynamic-regionswill show that the CU is stuck in BUSY state and not making progress.
AXI Firewall may trip if PCIe DMA request is made to the affected MIG as the DMA engine will be unable to complete request. AXI Firewall trips result in the Linux kernel driver killing all processes which have opened the device node with SIGBUS signal.
xbutil examine --device <bdf> --report firewallwill show if an AXI Firewall has indeed tripped including its timestamp.
Users should review the host code carefully. One common example is compression where the size of the compressed data is not known upfront and an application may try to migrate more data to host than was produced by the kernel.
Incorrect Frequency Scaling¶
Incorrect frequency scaling usually indicates a tooling or
infrastructure bug. Target frequencies for the dynamic (partial
reconfiguration) region are frozen at compile time and specified in
clock_freq_topology section of
xclbin. If clocks in the dynamic region
are running at incorrect — higher than specified — frequency,
kernels will demonstrate weird behavior.
Often a CU will produce completely incorrect result with no identifiable pattern
A CU might hang
When run several times, a CU may produce correct results a few times and incorrect results rest of the time
A single CU run may produce a pattern of correct and incorrect result segments. Hence for a CU which produces a very long vector output (e.g. vector add), a pattern of correct — typically 64 bytes or one AXI burst — segment followed by incorrect segments are generated.
Users should check the frequency of the board with
xbutil examine --device <bdf> --report platform and compare it against the metadata in xclbin.
xclbinutil may be used to extract metadata from xclbin.
HLS scheduler bugs can also result in CU hangs. CU deadlocks AXI data bus at which point neither read nor write operation can make progress. The deadlocks can be observed with
xbutil examine --device <bdf> --report dynamic-regions where the CU will appear stuck in START or — state (can also be observed through debug-ip using the command
xbutil examine --device <bdf> --report debug-ip-status). Note this deadlock can cause other CUs which read/write from/to the same MIG to also hang.
AXI Bus Deadlock¶
AXI Bus deadlocks can be caused by Memory Read Before Write or CU Deadlock described above. These usually show up as CU hang and sometimes may cause AXI FireWall to trip. Run
xbutil examine --device --report dynamic-regions and
xbutil examine --device --report firewall to check if CU is stuck in START or – state or if one of the AXI Firewall has tripped.
- Bitsream Download Failures
Bitstream download failures are usually caused because of incompatible xclbin(s).
dmesglog would provide more insight into why the download failed. At OpenCL level they usually manifest as Invalid Binary (error -44).
Rarely MIG calibration might fail after bitstream download. This will also show up as bitstream download failure. Usually XRT driver messages in
dmesgwould reveal if MIG calibration failed.
- Incorrect Timing Constraints
If the platform or dynamic region has invalid timing constraints — which is really a platform or Vitis tool bug — CUs would show bizarre behaviors. This may result in incorrect outputs or CU/application hangs.
Board in Crashed State¶
When board is in crashed state PCIe read operations start returning
0XFF. In this state
xbutil examine would show bizarre
metrics. For example
Temp would be very high. Boards in crashed state
may be recovered with PCIe hot reset
If this does not recover the board perform a warm reboot. After reset/reboot please follow steps in Validating a Working Setup
If for some reason communication between xocl driver and management driver gets disrupted,
xbutil reset may not be successful to reset the board. In those cases the following steps are recommended with the help of the sysadmin who has the root previledge
unload xocl driver (also shut down VM if xocl is running on a VM)
XRT Scheduling Options¶
XRT has three kernel execution schedulers today: ERT, KDS and legacy. By default XRT uses ERT which runs on Microblaze. ERT is accessed through KDS which runs inside xocl Linux kernel driver. If ERT is not available KDS uses its own built-in scheduler. From 2018.2 release onwards KDS (tgether with ERT if available in the XSA) is enabled by default. Users can optionally switch to legacy scheduler which runs in userspace. Switching scheduler will help isolate any scheduler related XRT bugs
[Runtime] ert=false kds=false
Writing Good Bug Reports¶
When creating bug reports please include the following:
xbutil examine --device --report all
Application binaries: xclbin, host executable and code, any data files used by the application
XSA name and version