L3 API GEMM benchmark

1. Benchmarking Intel® Math Kernel Library (MKL)

1.1 Introduction

Intel® Math Kernel Library provides performance improvement of math functions, e.g. GEMM, when running with Intel processors. To compare with Xilinx’s XFBLAS library, you can use our run-script (run_gemm_mkl.sh) to generate the data and performance benchmark.

1.2 Benchmarking Steps

1.2.1 Access Nimbix cloud

  • Follow the user guide Vitis On Nimbix to login to your Nimbix account
  • Launch application “Xilinx Vitis Unified Software Platform 2019.2” and select “Desktop Mode with FPGA”
  • Choose machine type “16 core, 128 GB RAM, Xilinx Alveo U250 FPGA (nx6u_xdma_201830_2_2_3)”
  • Copy the L3/bencharks/gemm directory to the Nimbix machine, and navigate to the gemm/gemm_mkl directory
  • Follow the steps below to run Intel® MKL GEMM APIsbenchmarks.

Note

FPGA is not required in Intel® Math Kernel Library but will be used in Xilinx’s XFBLAS library.

1.2.2 Install Intel® MK library

To install MKL on Nimbix, please download the full installation package for MKL2020 from Intel® MKL Webste. You need to register for downloading the package. After you have downloaded the package, please unzip it and navigate to the directory includeing “install.sh”. Please enter the following command to install the MKL package.

sudo ./install.sh

1.2.3 Set up MKL environment variables

Intel® MKL: Assume you have installed Intel® MKL, run the appropriate script to set up the environment variables (such as $MKLROOT).

source <INTEL_MKL_INSTALL_DIR>/bin/mklvars.sh intel64

1.2.4 Install numactl

NUMACTL: The linux operating system provides a function, called numactl, that allows the control of scheduling or memory placement policy, which is essential to run parallel programs.

For Ubuntu (you only need to do it once),

sudo apt-get install numactl

1.2.5 Run MKL benchmarking script

The run-script runs the GEMM benchmark with a number of threads, data type, and work mode. Then, it will explore the GEMM’s matrix size from 256 to 16384.

./run_gemm_mkl.sh <thread#> <data_type> <mode>

where:

  • thread#: Number of threads to run, e.g. 1, 2, 4, 8, 16, etc.
  • data_type: Either float or double.
  • mode: g for generating the data, b for benchmarking the performance, and a for both workloads.

1.3 Performance Result on Nimbix Cloud

Configuration:

cpu_model Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
thread# 16
data_type float
benchmark command ./run_gemm_mkl.sh 16 float a

Performance Result (nonCaching):

Square Matrix Size matrix paris running simultaneously Cache (Y/N) API time(ms) TFlops/sec
256 1 N 29.700 0.001
512 1 N 11.799 0.023
1024 1 N 16.591 0.129
2048 1 N 41.319 0.416
4096 1 N 172.369 0.797
8192 1 N 1073.250 1.024
16384 1 N 9060.830 0.971

Performance Result (Caching):

Square Matrix Size matrix paris running simultaneously Cache (Y/N) API time(ms) TFlops/sec
256 1 Y 1.380 0.024
512 1 Y 4.038 0.066
1024 1 Y 4.383 0.490
2048 1 Y 21.282 0.807
4096 1 Y 149.755 0.918
8192 1 Y 1042.860 1.054
16384 1 Y 9045.700 0.972

2. Benchmarking xfblasGemm - Xilinx’s XFBLAS library

Before benchmarking xfblashGemm, please download xf blas xclbin files, unzip the file with “tar -xvzf” command, and copy the folder u250_xdma_201830_2 to directory L3/overlay.

2.1 Benchmarking Steps

2.1.1 Generate test inputs and golden reference

Follow the MKL_benchmark steps to run MKL benchmarks, for float and short data type to generate test inputs and golden reference. To generate test inputs and golden reference for float data type, please run the following command.

./run_gemm_mkl.sh 16 float a

To generate test inputs and golden reference for short data type, please run the following command.

./run_gemm_mkl.sh 16 short a

2.1.2 Build benchmark application

Before benchmark the xfblasGemm, please build the host executable for the corresponding .xclbin files via following script

./build_gemm_bench.sh confi_info_file

2.1.3 Run benchmark

The run-script runs the GEMM benchmark with xclbin and cfg files. It will explore the GEMM’s matrix size from 256 to 8192.

./run_gemm_benchmark.sh xclbin_file config_info_file

where:

  • xclbin_fuke refers to the gemx.xclbin file, including the path.
  • config_info_file refers to config_info.dat file, including the path.

2.2 Performance Results on Nimbix Cloud

Configuration:

fpga_model Xilinx Alveo U250 FPGA (nx6u_xdma_201830_2_2_3)
Frequency 150 Mhz
data_type float
build command ./build_gemm_bench.sh ../../overlay/u250_xdma_201830_2/gemm_float_4kernel/config_info.dat
benchmark command ./run_gemm_bench.sh ../../overlay/u250_xdma_201830_2/gemm_float_4kernel/gemx.xclbin ../../overlay/u250_xdma_201830_2/gemm_float_4kernel/confi_info.dat

Performance Result:

Square Matrix Size matrix paris running simultaneously API time(ms) TFlops/sec
256 4 2.715 0.049
512 4 7.223 0.149
1024 4 40.020 0.214
2048 4 292.971 0.234
4096 4 1990.240 0.276
8192 4 15317.589 0.287

Configuration:

fpga_model Xilinx Alveo U250 FPGA (nx6u_xdma_201830_2_2_3)
Frequency 231 Mhz
data_type short
build command ./build_gemm_bench.sh ../../overlay/u250_xdma_201830_2/gemm_float_4kernel/config_info.dat
benchmark command ./run_gemm_bench.sh ../../overlay/u250_xdma_201830_2/gemm_float_4kernel/gemx.xclbin ../../overlay/u250_xdma_201830_2/gemm_float_4kernel/confi_info.dat

Performance Result:

Square Matrix Size matrix paris running simultaneously API time(ms) Tops/sec
256 4 1.436 0.093
512 4 2.589 0.415
1024 4 13.885 0.619
2048 4 61.879 1.111
4096 4 416.086 1.321
8192 4 3443.76 1.277