GEMV-based Conjugate Gradient Solver with Jacobi Preconditioner

Introduction

CG solver is widely adopted to solve linear system Ax=b, where the matrix A is symmetric and positive definite. Here is the benchmark for GEMV-based CG solver with the Jacobi preconditioner on Xilinx FPGA Alveo U50.

Executable Usage

Environment Setup (Step 1)

Please follow the page Benchmark Overview to correctly setup the environment first.

Build Kernel (Step 2)

With the following commands, kernel bitstream cgSolver.xclbin is built under the directory ./build_dir.hw.xilinx_u50_gen3x16_xdma_201920_3

$ make build TARGET=hw DEVICE=xilinx_u50_gen3x16_xdma_201920_3

Prepare Data (Step 3)

To benchmark the kernel, there two ways to prepare the data.

Randomly-Generated Data (Optional)

You could safely skip this step as it is integrated with the one in the next step if you choose to use random data for the benchmark. Here states the principle of how it works. With the following commands with given vector size e.g. 1024, three data files are generated under directory ./build_dir.hw.xilinx_u50_gen3x16_xdma_201920_3/data/.

  1. It generates a random SPD matrix of size NxN with data type FP64 and then stores the data in a row-major to file A.mat.
  2. It generates a random FP64 vector of size N and then compute vector b = Ax.
  3. The two vectors are stored into files x.mat and b.mat respectively.

Matrix A and vector b are used as inputs for the solver, and vector x is used as the golden reference.

$ make data_gen TARGET=hw DEVICE=xilinx_u50_gen3x16_xdma_201920_3 N=1024

where N is the vector size and must be multiple of 16.

Users’ data

Users could prepare their own data for benchmark. 1. Please prepare a SPD matrix with double precision floating point data type. 2. Please prepare golden reference vector and result vector which is the product of the matrix and the golden reference. 3. Please make sure the matrix size is NxN and vector size is N 4. Please make sure*N* is multiple of 16. 5. Please store the matrix, golden reference vector and result vector to binary files named A.mat, x.mat and b.mat respectively, and place them into a directory.

Run on FPGA with Example Data (Step 4)

Check Device

If you followed the guide and correctly setup the environment, you are able to run the following command line. You could check whether the target device is prepared and find out the device ID.

$ xbutil scan

Benchmark Random Dataset

If you decide to use randomly generated data for benchmark in step 3, you could skip that step and run the following command with given vector size N, e.g. 1024 and maximum number of iterations for the solver e.g. 100.

$ make run TARGET=hw DEVICE=xilinx_u50_gen3x16_xdma_201920_3 N=1024 maxIter=100 deviceID=0

Here lists the configurable parameters with the make command for the benchmark.

Parameters with make command
Parameter Name Default Value Notes
N 1024 Vector size, must be multiple of 16
maxIter 100 Maximum No. iterations for the solver <= 2000
tol 1e-12 Fault Tolerance
deviceID 0 Alveo U50 Card ID
condition_num 128 Conditioner number for matrix generated

Usage

For users’ own data, follow the usage specified bellow.

Usage: host.exe <XCLBIN File> <Max Iteration> <Vector Size> <DATA PATH> [device id]
            <XCLBIN File>       path to the xclbin file
            <Max Iteration>     maximum number of iterations
            <Tolerence>         Fault tolerence
            <Vector Size>       size of vector, matrix size N x N
            <DATA PATH>         path to the matrix and vector binary files
            <device id>         Device id given

Resource Utilization

The following table lists the resource utilization for GEMV-based CG kernel with 16 HBM channels storing the matrix.

Resource Utilization on U50
Name LUT LUTAsMem REG BRAM URAM DSP
User Budget 699619 [100.00%] 369603 [100.00%] 1447189 [100.00%] 1112 [100.00%] 640 [100.00%] 5936 [100.00%]
Used Resources 186448 [ 26.65%] 17334 [ 4.69%] 325149 [ 22.47%] 128 [ 11.51%] 0 [ 0.00%] 1262 [ 21.26%]

Benchmark Results on Alveo U50 FPGA

CPU Hardware information

  • Model name: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
  • Total threads: 32, Threads/Core: 2, Cores/Socket: 8, Total sockets: 2, Total Cores:16

FPGA Hardware Information

  • Device name: Xilinx Alveo U50
  • Fmax: 333MHz
  • Idle power 24W
Benchmark Results on U50
Vector Size Time per Iteration [ms] U50 Performance [GFLOPS] U50 Energy Efficiency [GFLOPS/W] CPU Performance [GFLOPS] Acceleration Ratio
1024 0.073 26.938 0.723 12.996 2.073
2048 0.2557 30.658 0.766 27.469 1.116
4096 0.9202 34.018 0.812 7.776 4.375
8192 3.405 36.742 0.839 8.226 4.467

Power Consumption on FPGA

Power data could be obtained by

$ xbutil top -d <DEVICE ID>