L2 GEMM benchmark¶

1. gemm_4CU¶

This example resides in L2/benchmarks/memKernel/gemm_4CU directory. The tutorial provides a step-by-step guide that covers commands for building and running kernel. It performs the matrix-matrix multiplication (A * B = C), M is number of rows of matrix A/C, K is number of columns of matrix A/number of rows of matrix B, N is number of columns of matrix B/C

1.1 Executable Usage¶

1.1.1 Work Directory(Step 1)¶

The steps for library download and environment setup can be found in [here](https://github.com/Xilinx/Vitis_Libraries/tree/master/blas/L2/benchmarks#building). For getting the design,

cd L2/benchmarks/memKernel/gemm_4CU

1.1.2 Build kernel(Step 2)¶

Run the following make command to build your XCLBIN and host binary targeting a specific device. Please be noticed that this process will take a long time, maybe couple of hours.

make run TARGET=hw PLATFORM_REPO_PATHS=/opt/xilinx/platforms DEVICE=xilinx_u250_xdma_201830_2

1.1.3 Run kernel(Step 3)¶

To get the benchmark results, please run the following command.

gemm_4CU Input Arguments:

<host application> <xclbin> m k n

For example:

build_dir.hw.xilinx_u250_xdma_201830_2/host.exe build_dir.hw.xilinx_u250_xdma_201830_2/blas.xclbin 64 64 64

1.1.4 Example output(Step 4)¶

Added GEMM 64x64x64  In kernel 0 Added instruction GEMM (64x64 * 64x64)
Added GEMM 64x64x64  In kernel 1 Added instruction GEMM (64x64 * 64x64)
Added GEMM 64x64x64  In kernel 2 Added instruction GEMM (64x64 * 64x64)
Added GEMM 64x64x64  In kernel 3 Added instruction GEMM (64x64 * 64x64)
Added GEMM 64x64x64  Found Platform
Platform Name: Xilinx
INFO: device name is: xilinx_u250_xdma_201830_2
INFO: Importing build_dir.hw.xilinx_u250_xdma_201830_2/blas.xclbin
Loading: 'build_dir.hw.xilinx_u250_xdma_201830_2/blas.xclbin'
INFO: created kernels
loadXclbin  6960.979134 msec
create kernels  13.595438 msec
create buffers  0.176534 msec
INFO: transferred data to kernel 0
INFO: transferred data to kernel 1
INFO: transferred data to kernel 2
INFO: transferred data to kernel 3
copy to kernels  0.884381 msec
INFO: Executed kernel 0
INFO: Executed kernel 1
INFO: Executed kernel 2
INFO: Executed kernel 3
call kernels  0.398135 msec
INFO: Transferred data from kernel0
INFO: Transferred data from kernel1
INFO: Transferred data from kernel2
INFO: Transferred data from kernel3
copyFromFpga  0.260636 msec
total  6976.308826 msec
subtotalFpga  1.750123 msec
DATA_CSV:,DdrWidth,Freq,M,K,N,Ops,KernelCycles,TimeKernelMs,TimeApiMs,EffKernelPct,EffApiPct,PerfKernelTops,PerfApiTops
DATA_CSV:,16,242.000000,64,64,64,2146304,2639,0.010905,1.750123,38.802577,0.241778,0.199516,0.001226

###########  Op Gemm  ###########
  C = postScale(A * B + X) 64x64 = 64x64 * 64x64 + 64 x 64
  Comparing ...
  Compared 4096 values:  exact match 1281  within tolerance 2815  mismatch 0
Gemm C Matches
pass

1.2 Profiling¶

The xclbin could be built in 242 MHz The hardware resource utilization and benchmark results are shown in the two tables below.

Table 1 Hardware resources

Name	LUT	BRAM	URAM	DSP	FF
blasKernel	250679	94	24	1224	430512

Table 2 Benchmark results

M	N	K	Kernel execution time [ms]	api execution time [ms]	Kernel Eff [%]
64	64	64	0.010905	1.750123	38.802577
128	128	128	0.048517	13.802416	69.772592
256	256	256	0.328314	14.645931	82.485022
512	512	512	3.213388	18.199255	67.420400
1024	1024	1024	24.113855	45.519852	71.875005
2048	2048	2048	186.688153	264.195138	74.270743
4096	4096	4096	1469.773731	1708.938204	75.469945