GEMV-based Conjugate Gradient Solver with Jacobi Preconditioner¶

Introduction¶

CG solver is widely adopted to solve linear system Ax=b, where the matrix A is symmetric and positive definite. Here is the benchmark for GEMV-based CG solver with the Jacobi preconditioner on Xilinx FPGA Alveo U50.

Executable Usage¶

Environment Setup (Step 1)¶

Please follow the page Benchmark Overview to correctly setup the environment first.

Build Kernel (Step 2)¶

With the following commands, kernel bitstream cgSolver.xclbin is built under the directory ./build_dir.hw.xilinx_u50_gen3x16_xdma_201920_3

$ make build TARGET=hw DEVICE=xilinx_u50_gen3x16_xdma_201920_3

Prepare Data (Step 3)¶

To benchmark the kernel, there two ways to prepare the data.

Randomly-Generated Data (Optional)¶

You could safely skip this step as it is integrated with the one in the next step if you choose to use random data for the benchmark. Here states the principle of how it works. With the following commands with given vector size e.g. 1024, three data files are generated under directory ./build_dir.hw.xilinx_u50_gen3x16_xdma_201920_3/data/.

It generates a random SPD matrix of size NxN with data type FP64 and then stores the data in a row-major to file A.mat.
It generates a random FP64 vector of size N and then compute vector b = Ax.
The two vectors are stored into files x.mat and b.mat respectively.

Matrix A and vector b are used as inputs for the solver, and vector x is used as the golden reference.

$ make data_gen TARGET=hw DEVICE=xilinx_u50_gen3x16_xdma_201920_3 N=1024

where N is the vector size and must be multiple of 16.

Users’ data¶

Users could prepare their own data for benchmark. 1. Please prepare a SPD matrix with double precision floating point data type. 2. Please prepare golden reference vector and result vector which is the product of the matrix and the golden reference. 3. Please make sure the matrix size is NxN and vector size is N 4. Please make sure*N* is multiple of 16. 5. Please store the matrix, golden reference vector and result vector to binary files named A.mat, x.mat and b.mat respectively, and place them into a directory.

Run on FPGA with Example Data (Step 4)¶

Check Device¶

If you followed the guide and correctly setup the environment, you are able to run the following command line. You could check whether the target device is prepared and find out the device ID.

$ xbutil scan

Benchmark Random Dataset¶

If you decide to use randomly generated data for benchmark in step 3, you could skip that step and run the following command with given vector size N, e.g. 1024 and maximum number of iterations for the solver e.g. 100.

$ make run TARGET=hw DEVICE=xilinx_u50_gen3x16_xdma_201920_3 N=1024 maxIter=100 deviceID=0

Here lists the configurable parameters with the make command for the benchmark.

Parameters with make command¶
Parameter Name	Default Value	Notes
N	1024	Vector size, must be multiple of 16
maxIter	100	Maximum No. iterations for the solver <= 2000
tol	1e-12	Fault Tolerance
deviceID	0	Alveo U50 Card ID
condition_num	128	Conditioner number for matrix generated

Usage¶

For users’ own data, follow the usage specified bellow.

Usage: host.exe <XCLBIN File> <Max Iteration> <Vector Size> <DATA PATH> [device id]
            <XCLBIN File>       path to the xclbin file
            <Max Iteration>     maximum number of iterations
            <Tolerence>         Fault tolerence
            <Vector Size>       size of vector, matrix size N x N
            <DATA PATH>         path to the matrix and vector binary files
            <device id>         Device id given

Resource Utilization¶

The following table lists the resource utilization for GEMV-based CG kernel with 16 HBM channels storing the matrix.

Resource Utilization on U50¶
Name	LUT	LUTAsMem	REG	BRAM	URAM	DSP
User Budget	699619 [100.00%]	369603 [100.00%]	1447189 [100.00%]	1112 [100.00%]	640 [100.00%]	5936 [100.00%]
Used Resources	186448 [ 26.65%]	17334 [ 4.69%]	325149 [ 22.47%]	128 [ 11.51%]	0 [ 0.00%]	1262 [ 21.26%]

Benchmark Results on Alveo U50 FPGA¶

CPU Hardware information

Model name: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
Total threads: 32, Threads/Core: 2, Cores/Socket: 8, Total sockets: 2, Total Cores:16

FPGA Hardware Information

Device name: Xilinx Alveo U50
Fmax: 333MHz
Idle power 24W

Benchmark Results on U50¶
Vector Size	Time per Iteration [ms]	U50 Performance [GFLOPS]	U50 Energy Efficiency [GFLOPS/W]	CPU Performance [GFLOPS]	Acceleration Ratio
1024	0.073	26.938	0.723	12.996	2.073
2048	0.2557	30.658	0.766	27.469	1.116
4096	0.9202	34.018	0.812	7.776	4.375
8192	3.405	36.742	0.839	8.226	4.467

Power Consumption on FPGA¶

Power data could be obtained by

$ xbutil top -d <DEVICE ID>