Tridiagonal Linear Solver (GTSV)¶

GTSV example resides in L2/benchmarks/gtsv directory. The tutorial provides a step-by-step guide that covers commands for building and running kernel.

Executable Usage¶

Work Directory(Step 1)

The steps for library download and environment setup can be found in Vitis Solver Library. For getting the design,

cd L2/benchmarks/gtsv

Build kernel(Step 2)

Run the following make command to build your XCLBIN and host binary targeting a specific device. Please be noticed that this process will take a long time, maybe couple of hours.

source /opt/xilinx/Vitis/2021.1/settings64.sh
source /opt/xilinx/xrt/setenv.sh
export DEVICE=/opt/xilinx/platforms/xilinx_u250_xdma_201830_2/xilinx_u250_xdma_201830_2.xpfm
export TARGET=hw
make run

Run kernel(Step 3)

To get the benchmark results, please run the following command.

./build_dir.hw.xilinx_u250_xdma_201830_2/test_gtsv.exe -xclbin build_dir.hw.xilinx_u250_xdma_201830_2/kernel_gtsv.xclbin -runs 1 -M 1024

GTSV Input Arguments:

Usage: test_gtsv.exe -[-xclbin -o -c -g]
       -xclbin     gtsv binary;
       -runs       number of runs;
       -M          size of input Matrix row/cloumn;

Note: Default arguments are set in Makefile. The default configs are: -runs 1 -M 16.

Example output(Step 4)

---------------------GTSV Test----------------
Found Platform
Platform Name: Xilinx
INFO: Found Device=xilinx_u250_xdma_201830_2
INFO: Importing build_dir.hw.xilinx_u250_xdma_201830_2/wcc_kernel.xclbin
Loading: 'build_dir.hw.xilinx_u250_xdma_201830_2/wcc_kernel.xclbin'
INFO: kernel has been created
INFO: kernel start------
INFO: kernel end------
INFO: Execution time 53.697ms
INFO: Write DDR Execution time 0.11773ms
INFO: Kernel Execution time 53.198ms
INFO: Read DDR Execution time 0.049562ms
INFO: Total Execution time 53.3653ms
============================================================

Profiling¶

The GTSV is validated on Xilinx Alveo U250 board.

The hardware resources and performance for double datatype is listed in table_gtsvDouble. To describe the resource utilization, we separate the overall utilization into two parts, P stands for the resource usage in platform, that is those instantiated in static region of the FPGA card, K stands for those used in kernels(dynamic region). The Unroll factor means how many CUs are configured to calculate Matrix in parell.

Table 4 double Type GTSV performance table¶
Matrix Size	Unroll	Frequency(MHz)	Latency(MHz)	LUT	REG	BRAM	URAM	DSP
1024x1024	16	300		146609(P)	225383(P)	283(P)	0(P)	7(P)
1024x1024	16	300		112880(K)	118118(K)	144(K)	0(K)	112(K)

Note

The unroll factor is limited by 2 factors, the matrix size and URAM port. The maximum unroll factor should be less than half of matrix size, and \(2 \times {Unroll}^{2}\) should also be less than available URAM on board. Besides, unroll factor can only be the factorization of 2.

Matrix Size	Unroll	URAM	BRAM	DSP	Register	LUT	Kernel time (us)	Frequency (MHz)
1024x1024	16	128	16	960	260297	223889	16.6	291