AI Engine Debug Walkthrough Tutorial - From Simulation to Hardware |
AI Engine Debug with Hardware Profiling Features |
Overview¶
In this tutorial you will learn how to:
Use hardware profiling features to inspect design.
Hardware Profiling Features¶
Two flows, XSDB and XRT flow, are supported profiling features with AIE design. The profiling feature requires no design source code change to collect profiling data. No special options required to build the design. You can use XRT Flow or XSDB Flow for this tutorial.
XRT Flow¶
Step 1.1 Prepare the boot image and boot the board¶
After the design is built correctly without error, we are ready to run on the hardware board.
Flash the SD card with the built sd_card.img.
Plug the flashed SD card into the SD card slot of the vck190 board.
Connect the USB type C cable to the board and computer that supports serial port connection.
Set the serial port configuration with Speed=115200, Data=8 bit, Parity=none, Stop bits=1 bit and flow control=none.
Power up the vck190 board to see boot messages from serial connection.
Step 1.2 Generate Profiling data¶
Create an xrt.ini
file on SD card using the following lines.
[Debug]
aie_profile = true
aie_profile_interval_us = 1000
aie_profile_core_metrics = heat_map
aie_profile_memory_metrics = conflicts
aie_profile_interface_metrics = input_bandwidths:0
Supported predefined metrics can be found from UG1076, Profiling the AI Engine in Hardware section (https://docs.xilinx.com/r/en-US/ug1076-ai-engine-environment/Profiling-the-AI-Engine-in-Hardware).
Step 1.3 to Run Application after Petalinux Boots up on Board¶
cd /run/media/mmcblk0p1
./host.exe a.xclbin
Step 1.4 Collect Profiling Files¶
After the design run completes on the hardware, the generated profiling files and run_summary files need to be collected and ready to be examined.
Step 1.4.1¶
Make sure aie_profile_edge_[core_metrics]_[memory_metrics]_[interface_metrics].csv
, summary.csv
and xrt.run_summary
files are created on the SD card.
Step 1.4.2¶
Power off the board and take out the sd_card from board and plug in the sd card to host’s sd card reader.
Step 1.4.3¶
Copy aie_profile_edge_[core_metrics]_[memory_metrics]_[interface_metrics].csv
, summary.csv
and xrt.run_summary
files from SD card back to where the design is.
Step 1.5 Launch Vitis Analyzer to Examine Profiling Files¶
vitis_analyzer xrt.run.summary
After issuing above command, expect to see result from Step 3 Expected Result with Vitis_Analyzer and continue this tutorial.
XSDB Flow¶
Step 2.1 Prepare the boot image and boot the board¶
Same as step 1.1 of XRT flow.
Step 2.2 Generate Profiling data¶
Step 2.2.1 Target Connection Setup¶
Run hardware server from computer that connects to target board¶
Launch hw_server from the computer that has JTAG connection to the VCK190 board.
Step 2.2.2 Connect XSDB to Board¶
Launch xsdb from your host computer at the same level of the design’s Work directory: Issue this commands from XSDB prompt,
xsdb
%xsdb connect -url TCP:${COMPUTER NAME/IP}:3121
%xsdb ta
%xsdb ta 1
%xsdb source ${XILINX_VITIS)/scripts/vitis/util/aie_profile.tcl
%xsdb aieprofile start -graphs dut -work-dir ./Work -core-metrics heat_map -memory-metrics conflicts -interface-metrics input_bandwidths:0 -interval 20 -samples 100
note:
-graph: The graph profile data to be captured.
-core-metrics: The core metrics to be captured.
-memory-metrics: The memory metrics to be captured.
-interface-metrics: The interface metrics to be captured.
-interval: The sample interval in milliseconds (default 20).
-samples: The number of counter samples (default 100).
IMPORTANT: After above command issued, wait until Count: 10, Count: 20, … is displayed from XSDB console. This indicates XSDB is ready to collect design profiling data.
Step 2.3 to Run Application after Petalinux Boots up on Board¶
cd /run/media/mmcblk0p1
./host.exe a.xclbin
Step 2.4 Inspect generated Files¶
After XSDB complete, expect to see aie_profile.csv
, summary.csv
and aie_trace_profile.run_summary
files are created on the host computer at the location where XSDB is launched.
Step 2.5 Launch Vitis Analyzer to Examine Profiling Files¶
vitis_analyzer aie_trace_profile.run.summary
After issuing above command, expect to see result from Step 3 Expected Result with Vitis_Analyzer and continue this tutorial.
Step 3 Expected Result with Vitis_Analyzer¶
Vitis_analyzer GUI is launched, select Profile Summary
then AI Engine & Memory
or Interface Channels
.
Step 4 Open Multiple Profiling Runs with Vitis_Analyzer¶
You can run the application as many times as you would like with your preferences. However, some of these metrics’ sets are interconnected because some use group events and others use individual events. For example, the heat_map metric set contains a metric that groups all kinds of stall events in a single metric along with other metrics that group data transfer events (load/store, streams, cascade, etc.) and vector instructions. To get a better view of which stall type(s) are prevalent, re-run with the stalls metric set. To better understand execution, re-run with the execution metric set.
Step 4.1 Generate First Profiling Data¶
Apply heat_map
for core-metrics
, conflicts
for memory-metrics
, and input_bandwidths
for interface-metrics
to collect first profiling data.
Follow step 1.1 to 1.4 if using XRT flow or follow step 2.1 to 2.4 if using XSDB flow. Save profiling data to a directory ex. profile_0
.
Step 4.2 Generate Second Profiling Data¶
Apply execution
for core-metrics
, dma_locks
for memory-metrics
, and output_bandwidths
for interface-metrics
to collect second profiling data.
Follow step 1.1 to 1.4 if using XRT flow or follow step 2.1 to 2.4 if using XSDB flow. Save profiling data to a directory ex. profile_1
.
Step 4.3 Open First Profiling Data¶
Follow step 1.5 for XRT flow or follow step 2.5 for XSDB flow to open first profiling run_summary file with vitis_analyzer.
Step 4.4 Open Second Profiling Data¶
Click on +
from GUI, highlighted in red square to browse and select second profiling run_summary file. Two runs of profiling data are combined.
This example combines first run with heat_map
, conflicts
, and input_bandwidths
metrics and second run with execution
, dma_locks
and output_bandwidth
metrics.
Click on %
to toggle between absolute and percentage values of collected design metrics.
Click on column header to sort the data within those rows. Click once to display selected row data in ascending order. Click twice to display selected row data in descending order. Click three times to disable sorting function.
Profiling Data Explanation¶
An easy way to know the definition of profile data category by moving mouse cursor to column title. For example, Active Time (ms)
is “Amount of time (in ms) AI Engine was active when it was enabled.”
AI Engine core profiling data¶
Active Time (ms)
=Stall Time (ms)
+Active Utilization Time (ms)
.Take tile(6,0) as an example, tile(6,0) is active for a period of 1.401 milliseconds, where 1.333 milliseconds is stalled and 0.069 milliseconds is actively executing instructions. During 0.069 milliseconds active period, 0.061 milliseconds is executing vector instructions. There are 0.008 milliseconds spent on other instructions such as load/store instructions.
There are 76800
Vector instructions
, 79303Load Instructions
and 1135Store Instructions
duringActive Utilization Time (ms)
.
AI Engine memory profiling data¶
Memory Conflict Time (ms)
indicates memory access conflicts time runing AI Engine execution. Recommend rerunningaiesimulator
with-enable-memory-check
option to check design memory access conflicts.Cumulative Memory Errors Time (ms)
indicates time taken due to ECC errors in any of the data memory banks as well as MM2S and S2MM DMAs.
Interface profiling data¶
Select Profile Summary
then Interface Channels
.
PLIO Bandwidth
displays the design ran with PLIO bandwidth in unit of MB/s. This info is a good indication the design is I/O bound or compute bound design and provides guidance to optimize design performance.
Profiling Data Analysis¶
From AI Engine core profiling data, tile(6,1), tile(6,3),… have much larger number of
Store Instructions
. An indication check tile source code if lowering number ofStore Instructions
can be done to improve performance.From AI Engine Memory profiling data, tile(6,1), tile(9,0),… have non-zero
Memory Conflict Time
value. Suggest running AIE simulator to check for memory access violations and clear those violations if any.From AI Engine Memory profiling data, tile(6,1), tile(6,3),… have longer
Cumulative DMA Lock Stalls Time
. This leads to check input/output PLIO area to see if PLIO frequency or PLIO width is implemented properly. Suggest using Integrated Logic Analyzer (ILA) to check PLIO input/output states during run time.From AI Engine Interface profiling data, hardware profiling data shows design
PLIO Bandwidth
. Apply sorting function by click onPLIO Bandwidth (MB/s)
to examine highest and lowest PLIO bandwidth. Highest is 289.741 MB/s vs. lowest 254.709 MB/s is over 13% difference. This difference suggests to check PLIO input/output implementation and use ILA checking PLIO input/output during run time.
Support¶
GitHub issues will be used for tracking requests and bugs. For questions go to support.xilinx.com.
License¶
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
XD005 | © Copyright 2021 Xilinx, Inc.