HBM Bandwidth

This is a HBM bandwidth check design. Design contains 3 compute units of a kernel which has access to all HBM pseudo-channels (0:31). Host application allocate buffer into all HBM banks and run these 3 compute units concurrently and measure the overall bandwidth between Kernel and HBM Memory.

This is host application to test HBM interface bandwidth. Design contains 3 compute units of Kernel. Each compute unit has full access to all HBM memory (0 to 31). Host application allocate buffers into all 32 HBM Banks (16 Input buffers and 16 output buffers). Host application runs all 3 compute units together and measures the overall HBM bandwidth.

HBM is a high performance RAM interface for 3D-stacked DRAM. HBM can provide very high bandwidth greater than 400 GB/s with low power consumption (20% as compared to GDDR5). These 32 memory resources referenced as HBM [0:31] by v++ and these are accessed via 16 memory controllers.

Host can allocate a buffer into specific HBM bank using CL_MEM_EXT_PTR_XILINX flag of buffer. cl_mem_ext_ptr object needs to be used in cases where memory assignment is done by user explicitly:

cl_mem_ext_ptr_t bufExt;
bufExt.obj = host_pointer;
bufExt.param = 0;
bufExt.flags = n  | XCL_MEM_TOPOLOGY;
buffer_input = cl::Buffer(context, CL_MEM_READ_ONLY | CL_MEM_EXT_PTR_XILINX | CL_MEM_USE_HOST_PTR, size, &bufExt, &err));

HBM memory must be associated to respective kernel I/O ports using sp option. We need to add mapping between HBM memory and I/O ports in krnl_vaddmul.cfg file

[connectivity]
sp=krnl_vaddmul_1.in1:HBM[0]
sp=krnl_vaddmul_1.in2:HBM[1]
sp=krnl_vaddmul_1.out_add:HBM[2]
sp=krnl_vaddmul_1.out_mul:HBM[3]

To see the benifit of HBM, user can look into the runtime logs and see the overall throughput.

Loading: './build_dir.hw.xilinx_u50_xdma_201920_1/krnl_vaddmul.xclbin'
Creating a kernel [krnl_vaddmul:{krnl_vaddmul_1}] for CU(1)
Creating a kernel [krnl_vaddmul:{krnl_vaddmul_2}] for CU(2)
Creating a kernel [krnl_vaddmul:{krnl_vaddmul_3}] for CU(3)
THROUGHPUT = 158.3 GB/s
TEST PASSED

By default we are going with 3 compute units of kernel as we have power consumption limitation while targeting U50 platform. We have the design support for 8 compute units of Kernel. To run the HBM Bandwidth with 8 compute units targeting U280 platform, user need to update krnl_vaddmul.cfg and host.cpp files. Add the following lines in krnl_vaddmul.cfg file

sp=krnl_vaddmul_4.in1:HBM[12]
sp=krnl_vaddmul_4.in2:HBM[13]
sp=krnl_vaddmul_4.out_add:HBM[14]
sp=krnl_vaddmul_4.out_mul:HBM[15]
sp=krnl_vaddmul_5.in1:HBM[16]
sp=krnl_vaddmul_5.in2:HBM[17]
sp=krnl_vaddmul_5.out_add:HBM[18]
sp=krnl_vaddmul_5.out_mul:HBM[19]
sp=krnl_vaddmul_6.in1:HBM[20]
sp=krnl_vaddmul_6.in2:HBM[21]
sp=krnl_vaddmul_6.out_add:HBM[22]
sp=krnl_vaddmul_6.out_mul:HBM[23]
sp=krnl_vaddmul_7.in1:HBM[24]
sp=krnl_vaddmul_7.in2:HBM[25]
sp=krnl_vaddmul_7.out_add:HBM[26]
sp=krnl_vaddmul_7.out_mul:HBM[27]
sp=krnl_vaddmul_8.in1:HBM[28]
sp=krnl_vaddmul_8.in2:HBM[29]
sp=krnl_vaddmul_8.out_add:HBM[30]
sp=krnl_vaddmul_8.out_mul:HBM[31]
nk=krnl_vaddmul:8

In host.cpp file user need to change the #define NUM_KERNEL from 3 to 8

#define NUM_KERNEL 8

User can see higher benifit of HBM when ran on U280 platform with 8 compute units of kernel. Below is the runtime log which shows overall throughput greater than 400GBps

Loading: './build_dir.hw.xilinx_u280_xdma_201920_1/krnl_vaddmul.xclbin'
Creating a kernel [krnl_vaddmul:{krnl_vaddmul_1}] for CU(1)
Creating a kernel [krnl_vaddmul:{krnl_vaddmul_2}] for CU(2)
Creating a kernel [krnl_vaddmul:{krnl_vaddmul_3}] for CU(3)
Creating a kernel [krnl_vaddmul:{krnl_vaddmul_4}] for CU(4)
Creating a kernel [krnl_vaddmul:{krnl_vaddmul_5}] for CU(5)
Creating a kernel [krnl_vaddmul:{krnl_vaddmul_6}] for CU(6)
Creating a kernel [krnl_vaddmul:{krnl_vaddmul_7}] for CU(7)
Creating a kernel [krnl_vaddmul:{krnl_vaddmul_8}] for CU(8)
THROUGHPUT = 421.3 GB/s
TEST PASSED

EXCLUDED PLATFORMS:

  • Alveo U25 SmartNIC

  • Alveo U30

  • Alveo U200

  • All Embedded Zynq Platforms, i.e zc702, zcu102 etc

  • All Versal Platforms, i.e vck190 etc

  • Alveo U250

  • AWS VU9P F1

  • Samsung SmartSSD Computation Storage Drive

  • Samsung U.2 SmartSSD

  • X3 Compute Shell

  • All NoDMA Platforms, i.e u50 nodma etc

DESIGN FILES

Application code is located in the src directory. Accelerator binary files will be compiled to the xclbin directory. The xclbin directory is required by the Makefile and its contents will be filled during compilation. A listing of all the files in this example is shown below

src/host.cpp
src/krnl_vaddmul.cpp
src/krnl_vaddmul.h

Access these files in the github repo by clicking here.

COMMAND LINE ARGUMENTS

Once the environment has been configured, the application can be executed by

./hbm_bandwidth <krnl_vaddmul XCLBIN>