# Mixed Kernels Design Tutorial with AXI Stream and Vitis This tutorial demonstrate the design flow for an example mixed kernels hardware design, which includes both RTL kernel and HLS C kernel, as well as Vitis Vision Library. The design generates a real-time clock image, resizes it, then alpha-mix it with an input image in global memory, finally output the result image to global memory. AXI stream interface is used for the kernel-to-kernel connection.
Alpha Mixing


The hardware design includes three kernels: *rtc_gen*, *alpha_mix*, and *strm_dump*. These kernels are directly connected together using AXI stream link. The topology of the design is shown in the figure below.
Topology
The designs have been verified with following software/hardware environment and tool chain version: * Operating System * Redhat/CentOS 7.4 - 7.9 * Ubuntu 16.04/18.04 * OpenCV & OpenCL libraries required * Vitis: 2020.2 * XRT: 2.8.726 * Hardware and Platform (need both the deployment and development platforms) * Alveo U200 - xilinx_u200_xdma_201830_2, xilinx_u200_gen3x16_xdma_1_1_202020_1 * Alveo U250 - xilinx_u250_xdma_201830_2, xilinx_u250_gen3x16_xdma_3_1_202020_1 * Alveo U50 - xilinx_u50_gen3x16_xdma_201920_3 * Alveo U280 - xilinx_u280_xdma_201920_3 The directory struction and brief explainations are as below. ~~~ ├── doc/ # documents ├── hw/ # Hardware build working directory │   ├── alpha_mix.cpp # HLS C source code for alpha_mix kernel │   ├── build_rtc_gen_xo.sh # shell script to call Vivado to package rtc_gen kernel IP to xo file │   ├── config_gen.mk # Makefile sub-module to generate Vitis linking configuration files │   ├── include/ # Vision Vision library include file for HLS C │   ├── Makefile # Makefile for hardware building │   ├── package_rtc_gen.tcl # Vivado tcl script to package rtc_gen kernel IP to xo file │   ├── rtc_gen_ip/ # IP directory for rtc_gen kernel, including all the RTL source codes │   ├── rtc_gen_kernel.xml # Kernel desctiption file for rtc_gen kernel │   └── strm_dump.cpp # HLS C source for strm_dump kernel ├── README.md ├── rtc_gen # working directory for rtc_gen kernel development │   ├── font_sim_data.txt # text format font library file for RTL simulation │   └── src/ # RTL source code for rtc_gen kernel └── sw/ # Test program directory    ├── build/ # Software build working directory    │   ├── font.dat # Font library file including 11 characters    │   ├── setup_emu.sh # setup script for emulation mode    │   └── xrt.ini # XRT configuration for emulation and debug    ├── CMakeLists.txt # cmake configuration file    ├── media/ # Media files for test program    └── src/ # Test program source codes    ├── rtc_alpha_tb.cpp # Test program for the whole design    ├── rtc_gen_test.cpp # Test program for rtc_gen kernel    └── xcl2/ # Xilinx OpenCL include files ~~~ ## RTL Kernel: rtc_gen (XO) *rtc_gen* is the real-time clock digit image generation kernel written in Verilog HDL. *rtc_gen* has an internal always-run real-time-clock driven by AXI bus clock with a clock divider. The time value can be set by host via kernel arguments. The kernel will firstly load the font image library for digits 0-9 from global memory to on-chip buffer, then output the real-time-clock digit image through AXI stream port. The user can also read out the time value from the internal always-run time counter. The character size in the font library is 240 (height) by 160 (width) pixels, and the font library includes 11 characters, i.e. digits 0-9 and colon. Refer to the image below for the font library contents.
Font Library


Each pixel in the font library is represented with 4-bit, which is actually the opacity value for each pixel. When output through AXI stream port, the 4-bit opacity value will be expanded to 8-bit by left shifting 4-bit then add 15 (i.e., expand 0xB to 0xBF). The opacity value will be used by the downstream alpha-mixing kernel to generate time digit with color setting. The font image data size for single chracter is: ~~~ 240 x 160 x 4 = 153,600 bits = 19,200 bytes ~~~ The total font image library size is: ~~~ 19,200 x 11 = 211,200 bytes ~~~ *rtc_gen* support two time format: one is with centi-second, namely HOUR:MIN:SEC:CENTISEC, including 11 characters; the other is without centi-second, namely HOUR:MIN:SEC, including 8 chracters. The time format is set in kernel arguments. Refer to the figures below for examples of the two output time format.
Time Format
The *rtc_gen* kernel has three bus interfaces: + AXI-Lite slave interface for kernel argument and control + AXI-Lite master interface for font library data loading + AXI stream master interface for clock digit image output The kernel is composed of three blocks: *rtc_gen_axi_read_master* for AXI master based font library reading, *rtc_gen_control_s_axi* for AXI slave based kernel arguments and control, and *rtc_gen_core* for core kernel function and AXI stream output. *rtc_gen_axi_read_master* is a standard block generated by Vitis/Vivado RTL Kernel Wizard. *rtc_gen_control_s_axi* is also a generated block, but we need to make some modifications to it to add time value read-out function.
rtc_gen Block
When triggered by the host, the kernel will read time value from internel real-time-clock, and output a frame of time image corresponding to the time value. Following table summarizes the arguments used by *rtc_gen* kernel. |No. | Arguments | Width | Description | | ---- | ---- | ---- | ---- | |0 | work_mode | 1 |[0]: determine the kernel working mode
0 - load font from global memory to on-chip SRAM via AXI read master
1 - output RTC digit figure via AXI steam master | |1 | cs_count | 32 |[21:0]: Centi-second counter. For example, if the system clock is 200MHz, cs_count should be set to 2,000,000 | |2 | time_format | 1 |[0]: determine whether centisecond is included in the output digit images
0 - disable centiseconds output
1 - enable centiseconds output | |3 | time_set_val| 32 |Set time value for internal free-running clock:
[31:24] - hours
[23:16] - minutes
[15:8] - seconds
[7:0] - centi-seconds | |4 | time_set_en | 1 |[0]: write 1 to this bit will load the time_set_value to internal free-running clock. | |5 | time_val | 32 |Read-only regsiter for internal real-time-clock time value:
[31:24] - hours
[23:16] - minutes
[15:8] - seconds
[7:0] - centi-seconds | |6 | read_addr | 64 |AXI master pointer, this is the FPGA device buffer address for font library |

Please read [RTC_GEN RTL Kernel Creation](./doc/rtc_gen_tutorial.md) for more details of the RTL kernel *rtc_gen* and the step-by-step guideline to create this RTL kernel. ## HLS C Kernel: alpha_mix (XO) The kernel *alpha_mix* finishes follow tasks in order: * Receive the clock digit image from *rtc_gen* kernel via AXI stream port * Resize the clock digit image with Vitis Vison Library resize function * Load the background image from global memory, then execute alpha mixing with the clock digit image * Send out the mixed image via AXI stream port
alpha_mix flow
The *alpha_mix* kernel has four bus interfaces: * AXI-Lite slave interface for control * AXI-Lite master interface for background image loading * AXI stream slave interface for clock digit image receiving * AXI stream master interface for mixed image output Following table summarized the arguments used by *alpha_mix* kernel. Please note the kernel use *XF_NPPC8* mode, namely eight pixels will be processed at each clock cycle, so please ensure the *background image width* and *resized time image width* are integer multiples of 8, otherwise the kernel might hang. |No. | Arguments | Width | Description | | ---- | ---- | ---- | ---- | |0 | reserved | - | - | |1 | bgr_img_input | 64 | AXI master pointer, FPGA device buffer for input background image | |2 | reserved | - | - | |3 | time_img_rows_in | 32 | Input time image height | |4 | time_img_cols_in | 32 | Input time image width | |5 | time_img_rows_rsz | 32 | Resized time image height | |6 | time_img_cols_rsz | 32 | Resized time image width | |7 | time_img_pos_row | 32 | Time image vertical coordinate, start from 0 | |8 | time_img_pos_col | 32 | Time image horizontal coordinate, start from 0 | |9 | time_char_color | 32 | Time figure color, bit range [23:0] used for [RGB]| |10 | time_bgr_color | 32 | Time background color, bit range [23:0] used for [RGB]| | |11 | time_bgr_opacity | 32 | Time background opacity, [7:0] used, value range from 0 - 255 | |12 | bgr_img_rows | 32 | Background image height | |13 | bgr_img_cols | 32 | Background image width | You could refer to below figure for the meaning of some kernel arguments.
alpha_mix kernel arguments
Please read [ALPHA_MIX HLS C Kernel Creation](./doc/alpha_mix_tutorial.md) for more details of the HLS C kernel *alpha_mix*. ## HLS C Kernel: strm_dump (XO) *strm_dump* is a simple HLS kernel to dump the input AXI stream to global memory via AXI Lite master. Following table summarizes the arguments used by *strm_dump* kernel. |No. | Arguments | Width | Description | | ---- | ---- | ---- | ---- | |0 | reserved | - |- | |1 | output_addr | 64 |AXI master pointer, this is the FPGA device buffer address for output image | |2 | byte_size | 32 |Data quantity to be output in bytes. This can be calculated based on time format and color depth | ## Bitstream Implementation (XCLBIN) ### rtc_gen_test_hw.xclbin / rtc_gen_test_hw_emu.xclbin This is a simple test system for *rtc_gen* kernel, which integrates two kernels: *rtc_gen* and *strm_dump*, which are connected together using AXI stream bus. Refer to the following connection diagram on U50 platform. According to the different building target (hw or hw_emu), two XCLBIN files will be generated.
rtc_gen_test Diagram
### rtc_alpha_hw.xclbin / rtc_alpha_hw_emu.xclbin This is the fully implemented system, which integrated all the three kernels: *rtc_gen*, *alpha_mix* and *strm_dump*, which are connected together via AXI stream bus. Please note the function of the kernel *strm_dump* is very easy to be merged into *alpha_mix* kernel. We separated the *strm_dump* kernel here just to demonstrate the kernel-to-kernel AXI stream connection functionality. Refer to the following connection diagram on U50 platform. According to the different building target (hw or hw_emu), two XCLBIN files will be generated.
rtc_alpha Diagram
## Test Program ### rtc_gen_test.cpp This program first judges the running mode according to the environment variable *XCL_EMULATION_MODE*, then chooses to use binary file *rtc_gen_test_hw.xclbin* or *rtc_gen_test_hw_emu.xclbin* to finish the testing of RTL kernel *rtc_gen*. It will test both the 8-digit and 11-digit clock format, and the generated clock image will be displayed directly. The program also uses XRT low-level API *xclRegRead* to read and print out the value of register *time_val* of *rtc_gen* kernel, namely the value of the internal hardware time counter. The value of *time_val* is also used to control the image display refresh. To ensure the correct operation of *xclRegRead* function, please ensure to create or modify *xrt.ini* file in the execution directory to add following lines: ~~~ [Runtime] exclusive_cu_context=true ~~~ ### rtc_alpha_tb.cpp This program first judges the running mode according to the environment variable *XCL_EMULATION_MODE*, then chooses to use binary file *rtc_alpha_hw.xclbin* or *rtc_alpha_hw_emu.xclbin* to mix the generated real time clock images to a background image. The user can select background image, set time format, and set clock time by command parameters. The user can also change the color, size, and position of the clock image by modifying the program source code. This test program also uses *xclRegRead* API to read the value of register *time_val* of *rtc_gen* kernel and use that value to control image display refresh.
## How to Use This Repository Before going through the following steps, don't forget to source XRT and Vitis setup files, for example: ~~~ source /opt/xilinx/xrt/setup.sh source /opt/xilinx/Vitis/2020.2/settings64.sh ~~~ The two test programs need to display images. So if you are using remote server, please use VNC desktop, or ssh connection with X11 forwarding along with local X11 server. ### Build the hardware Change to *./hw* directory, then use **make** command to finish the building of three XO files and to XCLBIN files. All available make command option includes: ~~~ make Display help information make all TARGET= PLATFORM= Command to build all the rtc_gen_test and rtc_alpha xclbin and necessary kernel files (xo) for specified target and platform. By default, TARGET=hw, PLATFORM=xilinx_u200_gen3x16_xdma_1_1_202020_1 make all_xo TARGET= PLATFORM= Command to build all the kernel files (xo), including rtc_gen.xo, alpha_mix.xo and strm_dump.xo By default, TARGET=hw, PLATFORM=xilinx_u200_gen3x16_xdma_1_1_202020_1 make clean Command to remove all the generated files. ~~~ In the make command options, the TARGET can be *hw* or *hw_emu*. Because the *rtc_gen* kernel doesn't provide software emulation model, *sw_emu* mode cannot be used. When the TARGET is *hw*, the XCLBIN and XO files will be with *_hw* postfix; when the TARGET is *hw_emu*, the XCLBIN and XO files will be with *_hw_emu* postfix. Please note the RTL kernel *rtc_gen* will not be affected by the *hw* or *hw_emu* option, and there will only be a XO file *rtc_gen.xo*. The PLATFORM could be one of the six choices: xilinx_u200_gen3x16_xdma_1_1_202020_1, xilinx_u200_xdma_201830_2, xilinx_u250_gen3x16_xdma_3_1_202020_1, xilinx_u250_xdma_201830_2, xilinx_u50_gen3x16_xdma_201920_3 and xilinx_u280_xdma_201920_3. No matter whether you have these Alveo cards installed, you can use the platform as the build PLATFORM if you have installed the development platform package (deb or rpm packages) on your system. You can look into */opt/xilinx/platform* directory or use command *platforminfo -l* to check which platforms have been installed. The finally generated xclbin and xo files will be in *./hw* directory after the successful execution of the make command. For example, if you would like to build all XO and XCLBIN files in hardware emulation mode with U50 card, just input: ~~~ make all TARGET=hw_emu PLATFORM=xilinx_u50_gen3x16_xdma_201920_3 ~~~ Because the XCLBIN file building for hardware target needs a long time, to save your time, we also provide the pre-built XCLBIN files (*rtc_gen_test_hw.xclbin* and *rtc_alpha_hw.xclbin*) for each kind of supported Alveo platforms. Please note they are built with *TARGET=hw* option and cannot be used in *hw_emu* mode. For *hw_emu* target XCLBIN files, it's much faster to build and system dependent, so please build them by yourself. You can download the pre-built XCLBIN files via the link: **To use the pre-built xclbin files, please copy the two xclbin files corresponding to your target platform into *./hw* directory, which will be used directly in downstream steps.** ### Build and run the software * Step 1: generate Makefile Change to *./sw/build* directory, then enter **cmake ..** or **cmake3 ..** command. This will generate the *Makefile* for software builds, as well as link the two XCLBIN files in *./hw* directory to *./sw/build* directory. ~~~ cd ./sw/build cmake .. ~~~ * Step 2: compile the programs Enter **make** command, then the two C++ program will be compiled. This will generate two executables: *rtc_gen_test* and *rtc_alpha_tb*. ~~~ make ~~~ Please note because XRT low level API *xclRegRead* is used in the test program, so there are different link library sets for the hardware mode and hardware emulation mode. Altogether four executables will be generated after the successful compilation: *rtc_alpha_tb*, *rtc_alpha_tb_emu*, *rtc_gen_test*, *rtc_gen_test_emu*. Please use the correct executables for hardware or hardware emulation modes. * Step 3: configure running mode (hardware or hardware emulation) Script *setup_emu.sh* is provided to set the running mode. **Run in hardware mode** If you didn't enter emulation mode before, just run the executables *rtc_gen_test* and *rtc_alpha_ab* to run in hardware mode. If you have entered hardware emulation mode and want to exit to real hardware mode, just use following command before running the exeutables: ~~~ source setup_emu.sh -s off ~~~ **Run in hardware emulation mode** To try the test programs in hardware emulation mode, you should use the executables *rtc_gen_test_emu* and *rtc_alpha_tb_emu*. Before running them, please run following command firstly: ~~~ source setup_emu.sh -s on -p PLATFORM_NAME ~~~ The *PLATFORM_NAME* is one of the six supported platform, you could run following command to get help information: ~~~ source setup_emu.sh ~~~ For example, if you want to run the executable in hardware emulation mode with U50 platform, just input: ~~~ source setup_emu.sh -s on -p xilinx_u50_gen3x16_xdma_201920_3 ~~~ *setup_emu.sh* will generate necessary configuration file and setting up the environment. **Note:** The *PLATFORM_NAME* you input here should be consistent with the XCLBIN files in *./sw/build* directory. For more detailes on the hardware emulation for this example design, please read [Emulation Turotial](./doc/hw_emu_tutorial.md) * Step 4: run executables **rtc_gen_test** or **rtc_gen_test_emu** Run the executable *rtc_gen_test* or *rtc_gen_test_emu* to finish the program running in hardware or hardware emulation mode. Firstly an eight-digit clock will be displayed, keep the image window front and press *ESC* key, a second eleven-digit clock will be displayed. Keep the image window front then Press *ESC* key again to exit the program. It will also read and print out the value of register *time_val* of the kernel. **Don't forget to set running mode to hardware emulation before running *rtc_gen_test_emu*.** ~~~ ./rtc_gen_test or ./rtc_gen_test_emu ~~~ The program will firstly judge the running mode (hw or hw_emu), then look for *./sw/build/rtc_gen_test_hw.xclbin* or *./sw/build/rtc_gen_test_hw_emu.xclbin* file and analyze it to get the platform it is using, then compare it with the card you have installed. If mismatching is detected, error information will be reported and the program will exit. **Note**: running under hardware emulation mode may take a long time since it is actually running RTL simulation. * Step 5: run executables **rtc_alpha_tb** or **rtc_alpha_tb_emu** Run the executable *rtc_alpha_tb* or *rtc_alpha_tb_emu* to finish the program running in hardware or hardware emulation mode. There are a few command parameters for the executable, the usage is as below: ~~~ rtc_alpha_tb [-i BACK_IMAGE] [-f] [-s] [-h] -i BACK_IMAGE: set path to the background image, default is ../media/alveo.jpg -f : set to use eleven-digit clock, default is eight-digit -s : use system time to set the clock, default don't set the clock -h : print help information ~~~ There are three images provided in *./sw/media* directory: alveo.jpg, vitis.jpg and victor.jpg, and you could also use other images. Please note the images should be in three-channel format (RGB without transparency). Also please use images big enough, otherwise please modify the program source code to adjust the clock image size or position. Following is some execution command line examples: ~~~ rtc_alpha_tb Mix the clock image with ../media/alveo.jpg and display, don't sync the kernel internal real-time-clock with Linux system clock, and use 8-digit format. rtc_alpha_tb -i ../media/vitis.jpg -f -s Mix the clock image with ../media/vitis.jpg and display, sync the kernel internal real-time-clock with Linux system clock, and use 11-digit format. ~~~ To exit the program, just keep the image window front, then press **ESC** key. The program will firstly judge the running mode (hw or hw_emu), then look for *./sw/build/rtc_alpha_hw.xclbin* or *./sw/build/rtc_alpha_hw_emu.xclbin* file and analyze it to get the platform it is using, then compare it with the card you have installed. If mismatching is detected, error information will be reported and the program will exit. You could make modification to following *#define* section at the beginning of *./sw/src/rtc_alpha_tb.cpp* file to adjust the color, size, position and opacity of the clock image, then repeat **step 2** to re-compile the program and run to see the result. Don't forget to ensure that the width of background image and resized clock image are integer multiples of 8. ~~~c++ // position of clock image, top-left corner is (0,0) #define RTC_POSITION_ROW 64 #define RTC_POSITION_COL 400 // resized clock image size for 8-digit font digit size // ensure RTC_IMG_WIDTH is integer multiple of 8 #define RTC_IMG_WIDTH_8D 480 #define RTC_IMG_HEIGHT_8D 90 // resized clock image size for 11-digit font digit size // ensure RTC_IMG_WIDTH is integer multiple of 8 #define RTC_IMG_WIDTH_11D 528 #define RTC_IMG_HEIGHT_11D 72 // clock image font color #define FONT_COLOR_R 255 #define FONT_COLOR_G 255 #define FONT_COLOR_B 255 // clock image background color #define BGR_COLOR_R 80 #define BGR_COLOR_G 80 #define BGR_COLOR_B 80 // clock image background opacity #define BGR_OPA 100 ~~~ **Note**: running under hardware emulation mode may take a long time since it is actually running RTL simulation. You could use some smaller background image to reduce the run time, and don't forget to modify those size and position parameters described above accordingly in this case. * Step 6: try Vitis profiling function with **rtc_gen_test** and **rtc_alpha_tb** program. Vitis provides powerful profiling features which enable you to get a deeper view into the performance, bandwidth usage, design bottleneck, etc. Please read [Profiling the Application](./doc/profile_tutorial.md) for more details.

Copyright© 2020 Xilinx