SPMV (Double precision)¶

SPMV (Double precision) resides in L2/benchmarks/spmv_double directory.

Dataset¶

There are 22 sparse matrices used in the benchmark. These sparse matrices can be downloaded from https://sparse.tamu.edu.

matrix	rows	cols	NNZs
nasa2910	2910	2910	174296
ex9	3363	3363	99471
bcsstk24	3562	3562	159910
bcsstk15	3948	3948	117816
bcsstk28	4410	4410	219024
s3rmt3m3	5357	5357	207695
s2rmq4m1	5489	5489	281111
nd3k	9000	9000	3279690
ted_B_unscaled	10605	10605	144579
ted_B	10605	10605	144579
msc10848	10848	10848	1229778
cbuckle	13681	13681	676515
olafu	16146	16146	1015156
gyro_k	17361	17361	1021159
bodyy4	17546	17546	121938
nd6k	18000	18000	6897316
raefsky4	19779	19779	1328611
bcsstk36	23052	23052	1143140
msc23052	23052	23052	1154814
ct20stif	52329	52329	2698463
nasasrb	54870	54870	2677324
bodyy6	19366	19366	134748

Executable Usage¶

Work Directory(Step 1)

The steps for library download and environment setup can be found in Vitis Sparse Library. For getting the design,

cd L2/benchmarks/spmv_double

Build hw and host (Step 2)

Run the following make command to build your XCLBIN and host binary targeting a specific device. Please be noticed that this process will take a long time, maybe couple of hours.

make build TARGET=hw PLATFORM_REPO_PATHS=/opt/xilinx/platforms DEVICE=xilinx_u280_xdma_291020_3
make host TARGET=hw PLATFORM_REPO_PATHS=/opt/xilinx/platforms DEVICE=xilinx_u280_xdma_291020_3

Generate inputs(Step 3)

conda activate xf_blas
source ./gen_test.sh

The gen_test.sh triggers a set of python scripts to download the .mtx files listed in test.txt under current directory and partitions them evenly across 16 HBM channels. Each paritioned data set, including the value and indices of each NNZ entry, is stored in one HBM channel. Each row of the partitioned data set is padded to multiple of 32 to accommodate the double precision accumulation latency. The padding overhead for each matrix is summarized in the benchmark result as well. This overhead will be reduced with the improvement of floating point support on FPGA platforms.

Run benchmark(Step 4)

To get the benchmark results, please run the following command.

python ./run_test.py

The run_test.py launches the host executable with each partitioned data set and offloads the double precision SpMV operation to U280 card. The SpMV operation is run numerous time (2000 in this benchmark) to mask out the host code overhead. The total run time in the benchmark results includs the OpenCl function call time to trigger the CUs and the hardware run time. The run time [ms] / iteration field gives single SpMV run time on the U280 card.

Example output(Step 5)

All tests pass!
Please find the benchmark results in spmv_perf.csv.

Profiling¶

The SPMV double precision design is validated on Alveo U280 board at 256 MHz frequency. The hardware resource utilizations are listed in the following table.

Table 1 Hardware resources for SPMV double precision design¶
Name	LUT	BRAM	URAM	DSP
Platform	165475	323	64	4
SPMV design	220980	211	64	900
User Budget	1137245	1693	896	9020
Percentage	19.43%	12.46%	7.14%	9.98%

The performance result is shown below.

matrix runs total time[sec] time[ms]/run

nasa2910 2000 0.102513 0.0512565

ex9 2000 0.0759525 0.0379762

bcsstk24 2000 0.0747713 0.0373857

bcsstk15 2000 0.0872443 0.0436221

bcsstk28 2000 0.116322 0.0581609

s3rmt3m3 2000 0.106942 0.0534711

s2rmq4m1 2000 0.126217 0.0631087

nd3k 2000 0.677946 0.338973

ted_B_unscaled 2000 0.136411 0.0682054

ted_B 2000 0.149135 0.0745673

msc10848 2000 0.391394 0.195697

cbuckle 2000 0.216792 0.108396

olafu 2000 0.263899 0.131949

gyro_k 2000 0.412774 0.206387

bodyy4 2000 0.269815 0.134907

nd6k 2000 1.50509 0.752544

raefsky4 2000 0.446744 0.223372

bcsstk36 2000 0.374293 0.187146

msc23052 2000 0.723612 0.361806

ct20stif 2000 1.01894 0.509468

nasasrb 2000 0.780656 0.390328

bodyy6 2000 0.247517 0.123759