SPMV-based Conjugate Gradient Solver with Jacobi Preconditioner¶

Introduction¶

CG solver is widely adopted to solve linear system Ax=b, where the matrix A is symmetric and positive definite. Here is the benchmark for SPMV-based CG solver with the Jacobi preconditioner on Xilinx FPGA Alveo U280.

Benchmark on Hardware¶

Environment Setup (Step 1)¶

Please follow the page Benchmark Overview to correctly setup the environment first.

Hardware Build (Step 2)¶

With the following commands, kernel bitstream cgSolver.xclbin * is built under the directory *./build_dir.hw.xilinx_u280_xdma_201920_3

$ make build TARGET=hw DEVICE=xilinx_u280_xdma_201920_3

Prepare Data (Step 3)¶

Here is a list of the URLs of SPD sparse matrices in the file test.txt. All these sparse matrices are from SuiteSparse Matrix Collection<https://sparse.tamu.edu/>. Users could add more links or trim the existing links in the file. With the following command, these matrices listed in the test.txt file are download from the given links and then are preprocessed. It may take some time to finish the downloading and preprocessing.

$ make data_gen TARGET=hw DEVICE=xilinx_u280_xdma_201920_3

Run on FPGA (Step 4)¶

Check Device¶

If you followed the guide and correctly setup the environment, you are able to run the following command line. You could check whether the target device is prepared and find out the device ID.

$ xbutil scan

Benchmark¶

With the following command, users could benchmark the CG solver with a given matrix.

$ make run TARGET=hw DEVICE=xilinx_u280_xdma_201920_3 mtxName=ted_B

Here lists the configurable parameters with the make command for the benchmark.

Parameters with make command¶
Parameter Name	Default Value	Notes
mtxName	ted_B	Vector size, must be multiple of 16
maxIter	5000	Maximum No. iterations for the solver
tol	1e-12	Fault Tolerance
deviceID	0	Alveo U50 Card ID

Usage¶

Usage: host.exe <XCLBIN File> <Max Iteration> <signature path> <vector path> <mtx_name> [--debug] [device id]
            <XCLBIN File>       path to the xclbin file
            <Max Iteration>     maximum number of iterations
            <Tolerence>         Fault tolerence
            <signature path>    path to the signature files
            <vector path>       path to the vector binary files
            <mtx_name>          sparse matrix name
            <device id>         Device id given by *xbutil scan*

Resource Utilization on Alveo U280¶

The following table lists the resource utilization for SPMV-based CG kernel.

Resource Utilization on U280¶
Name	LUT	LUTAsMem	REG	BRAM	URAM	DSP
User Budget	1104369 [100.00%]	552814 [100.00%]	2217989 [100.00%]	1693 [100.00%]	896 [100.00%]	9020 [100.00%]
Used Resources	285372 [ 25.84%]	36605 [ 6.62%]	442368 [ 19.94%]	267 [ 15.77%]	64 [ 7.14%]	1192 [ 13.22%]

Benchmark Results on Alveo U280 FPGA¶

CPU Hardware information

Model name: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
Total threads: 32, Threads/Core: 2, Cores/Socket: 8, Total sockets: 2, Total Cores:16

FPGA Hardware Information

Device name: Xilinx Alveo U280
Fmax: 243MHz

Benchmark Results on U280¶
Matrix Name	Rows/Cols	NNZs	Padded Rows/Cols	Padded NNZs	Padding Ratio	No. iterations	Time per Iter [ms]	Time per Iter on CPU [ms]	Acceleration Ratio
nasa2910	2910	174296	2912	297952	1.70946	1777	0.0511172	0.0692836	1.36
ex9	3363	99471	3364	199328	2.00388	5000	0.0497677	0.0559332	1.12
bcsstk24	3562	159910	3564	222656	1.39238	5000	0.0598962	0.0581827	0.97
bcsstk15	3948	117816	3948	267488	2.27039	658	0.0927269	0.125615	1.35
bcsstk28	4410	219024	4412	319264	1.45767	4878	0.0586356	6.92198	118.05
s3rmt3m3	5357	207695	5360	330624	1.59187	5000	0.0744822	6.55229	87.97
s2rmq4m1	5489	281111	5492	427648	1.52128	1779	0.084562	6.75384	79.87
nd3k	9000	3279690	9000	4277792	1.30433	5000	0.363479	4.66861	12.84
ted_B	10605	144579	10608	548416	3.79319	30	0.984467	6.53108	6.63
ted_B_unscaled	10605	144579	10608	548416	3.79319	16	1.75354	8.59891	4.90
msc10848	10848	1229778	10848	2050720	1.66755	5000	0.230942	5.43921	23.55
cbuckle	13681	676515	13684	924832	1.36705	1282	0.16427	5.48588	33.40
olafu	16146	1015156	16148	1452320	1.43064	5000	0.169174	5.05108	29.86
gyro_k	17361	1021159	17364	1932384	1.89234	5000	0.254172	4.85938	19.12
bodyy4	17546	121938	17548	710112	5.82355	230	0.174435	4.73164	27.13
nd6k	18000	6897316	18000	9415552	1.3651	5000	0.809868	4.25772	5.26
raefsky4	19779	1328611	19780	2268704	1.70758	5000	0.268956	4.22843	15.72
bcsstk36	23052	1143140	23052	1833056	1.60353	5000	0.253049	3.9882	15.76

Convergence¶

Conjugate gradient method may suffer convergent issue for matrices with large condition number. Jacobi preconditioner, adopted in this kernel, is widely used and dramatically reduces the overall number of iterations to solve the linear system. For some matrices, however, the solver with Jacobi preconditioner is not able to converge. For instance, the number of iterations for some matrices in the above table reached the upper limit 5000 with the preset relative tolerance 10e-12.

Although the solver, for some other matrices e.g. ted_B_unscaled, meets the preset tolerance within the preset number of iteration limit, there might still be some mismatches in the result vector compared to the golden reference x. The solution to this issue is to further reduce the tolerance value to such as 10e-15.