L3 API GEMM benchmark¶
The benchmark performs the matrix-matrix multiplication (A * B = C), M is number of rows of matrix A/C, K is number of columns of matrix A/number of rows of matrix B, N is number of columns of matrix B/C
1. memKernel¶
This example resides in L3/benchmarks/gemm/memKernel
directory. The tutorial provides a step-by-step guide that covers commands for building and running kernel.
1.1 Executable Usage¶
1.1.1 Work Directory(Step 1)¶
The steps for library download and environment setup can be found in [here](https://github.com/Xilinx/Vitis_Libraries/tree/master/blas/L2/benchmarks#building). For getting the design,
cd L3/benchmarks/gemm/memKernel
1.1.2 Build kernel(Step 2)¶
Run the following make command to build your XCLBIN and host binary targeting a specific device. Please be noticed that this process will take a long time, maybe couple of hours.
make run TARGET=hw PLATFORM_REPO_PATHS=/opt/xilinx/platforms DEVICE=xilinx_u250_xdma_201830_2
1.1.3 Run kernel(Step 3)¶
To get the benchmark results, please run the following command.
Input Arguments:
<host application> <xclbin> <config_info.dat>
For example:
build_dir.hw.xilinx_u250_xdma_201830_2/gemm_bench.exe build_dir.hw.xilinx_u250_xdma_201830_2/blas.xclbin build_dir.hw.xilinx_u250_xdma_201830_2/config_info.dat
1.1.4 Example output(Step 4)¶
xfblasCreate 276.965961 msec copyToFpga 0.237744 msec copyFromFpga 0.753792 msec Api time is 0.991536 msec DATA_CSV:,Freq,M,K,N,TimeApiMs,EffApiPct,PerfApiTops DATA_CSV:,242.000000,64,64,64,0.991536,0.426753,0.000541 >> Kernel #0 << Test passed!
1.1.5 Use script to run benchmark¶
Use mkl to generate dataset, usage of this script is: ./run_gemm_mkl.sh number_of_thread datatype g(generate)/b(benchmark) Then use run_gemm_bench.sh to run benchmark
cd ../gemm_mkl ./run_gemm_mkl.sh 16 float g ./run_gemm_bench.sh build_dir.hw.xilinx_u250_xdma_201830_2/blas.xclbin build_dir.hw.xilinx_u250_xdma_201830_2/config_info.dat
1.2 Profiling¶
The xclbin could be built in 242 MHz The hardware resource utilization and benchmark results are shown in the two tables below.
Table 1 Hardware resources
Name | LUT | BRAM | URAM | DSP | FF |
---|---|---|---|---|---|
blasKernel | 250679 | 94 | 24 | 1224 | 430512 |
Table 2 Benchmark results
M | N | K | api execution time [ms] | api Eff [%] | PerfApiTops |
---|---|---|---|---|---|
256 | 256 | 256 | 2.295277 | 11.798572 | 0.058818 |
512 | 512 | 512 | 7.185994 | 30.148638 | 0.149859 |
1024 | 1024 | 1024 | 33.357721 | 51.957490 | 0.257887 |
2048 | 2048 | 2048 | 218.662946 | 63.410230 | 0.314501 |
4096 | 4096 | 4096 | 1594.648667 | 69.559988 | 0.344877 |
8192 | 8192 | 8192 | 12695.637510 | 69.897233 | 0.346485 |
2. streamingKernel¶
This example resides in L3/benchmarks/gemm/streamingKernel
directory. The tutorial provides a step-by-step guide that covers commands for building and running kernel.
2.1 Executable Usage¶
2.1.1 Work Directory(Step 1)¶
The steps for library download and environment setup can be found in [here](https://github.com/Xilinx/Vitis_Libraries/tree/master/blas/L2/benchmarks#building). For getting the design,
cd L3/benchmarks/gemm/streamingKernel
2.1.2 Build kernel(Step 2)¶
Run the following make command to build your XCLBIN and host binary targeting a specific device. Please be noticed that this process will take a long time, maybe couple of hours.
make run TARGET=hw PLATFORM_REPO_PATHS=/opt/xilinx/platforms DEVICE=xilinx_u250_gen3x16_xdma_3_1_202020_1
2.1.3 Run kernel(Step 3)¶
To get the benchmark results, please run the following command.
Input Arguments:
<host application> <xclbin> <config_info.dat>
For example:
build_dir.hw.xilinx_u250_gen3x16_xdma_3_1_202020_1/gemm_bench.exe build_dir.hw.xilinx_u250_gen3x16_xdma_3_1_202020_1/blas.xclbin build_dir.hw.xilinx_u250_gen3x16_xdma_3_1_202020_1/config_info.dat
2.1.4 Example output(Step 4)¶
xfblasCreate 249.914832 msec copyToFpga 0.243765 msec copyFromFpga 0.437556 msec Api time is 0.681321 msec DATA_CSV:,Freq,M,K,N,TimeApiMs,EffApiPct,PerfApiTops DATA_CSV:,250.000000,64,64,64,0.681321,0.601185,0.000788 >> Kernel #0 << Test passed!
2.1.5 Use script to run benchmark¶
Use mkl to generate dataset, usage of this script is: ./run_gemm_mkl.sh number_of_thread datatype g(generate)/b(benchmark) Then use run_gemm_bench.sh to run benchmark
cd ../gemm_mkl ./run_gemm_mkl.sh 16 float g ./run_gemm_bench.sh build_dir.hw.xilinx_u250_gen3x16_xdma_3_1_202020_1/blas.xclbin build_dir.hw.xilinx_u250_gen3x16_xdma_3_1_202020_1/config_info.dat
2.2 Profiling¶
The xclbin could be built in 250 MHz The hardware resource utilization and benchmark results are shown in the two tables below.
Table 1 Hardware resources
Name | LUT | BRAM | URAM | DSP | REG |
---|---|---|---|---|---|
gemmAddsKernel | 101988 | 0 | 0 | 384 | 192516 |
gemmCPlusXKernel | 8529 | 24 | 0 | 66 | 20358 |
gemmLoadStoreKernel | 7126 | 23 | 0 | 16 | 19457 |
gemmMergeKernel | 8342 | 0 | 0 | 0 | 25219 |
gemmMulsKernel | 50640 | 0 | 0 | 768 | 98013 |
gemmSystolicArrayKernel | 2541 | 0 | 0 | 0 | 240 |
gemmTagsKernel | 20203 | 15 | 0 | 8 | 34678 |
gemmTimerKernel | 32 | 0 | 0 | 0 | 115 |
Table 2 Benchmark results
M | N | K | api execution time [ms] | api Eff [%] | PerfApiTops |
---|---|---|---|---|---|
256 | 256 | 256 | 1.370527 | 19.127241 | 0.024626 |
512 | 512 | 512 | 4.517989 | 46.417820 | 0.059589 |
1024 | 1024 | 1024 | 29.500145 | 56.871639 | 0.072902 |
2048 | 2048 | 2048 | 217.555482 | 61.693563 | 0.079026 |
4096 | 4096 | 4096 | 1685.337895 | 63.710774 | 0.081580 |