HBM Bandwidth - Pseudo Random Ethash¶

This is a HBM bandwidth example using a pseudo random 1024 bit data access pattern to mimic Ethereum Ethash workloads. The design contains 3 compute units of a kernel, reading 1024 bits from a pseudo random address in each of 2 pseudo channels and writing the results of a simple mathematical operation to a pseudo random address in 2 other pseudo channels. To maximize bandwidth the pseudo channels are used in P2P like configuration - See https://developer.xilinx.com/en/articles/maximizing-memory-bandwidth-with-vitis-and-xilinx-ultrascale-hbm-devices.html for more information on HBM memory access configurations. The host application allocates buffers in 12 HBM banks and runs the compute units concurrently to measure the overall bandwidth between kernel and HBM Memory.

KEY CONCEPTS: High Bandwidth Memory, Multiple HBM Pseudo-channels, Random Memory Access, Linear Feedback Shift Register

KEYWORDS: HBM, XCL_MEM_TOPOLOGY, cl_mem_ext_ptr_t

This is host application to test HBM interface bandwidth for pseudo random 1024 bit data access pattern, mimicking Ethereum Ethash workloads. Design contains 3 compute units of Kernel. Each compute unit reads 1024 bits from a pseudo random address in each of 2 pseudo channels and writes the results of a simple mathematical operation to a pseudo random address in 2 other pseudo channels. Host application allocates buffers into all 12 HBM Banks (6 Input buffers and 6 output buffers). Host application runs all 3 compute units together and measures the overall HBM bandwidth.

HBM is a high performance RAM interface for 3D-stacked DRAM. HBM can provide very high bandwidth greater than 400 GB/s with low power consumption (HBM2 ~ 20W vs GDDR5 ~ 100W). These 32 memory resources referenced as HBM [0:31] by v++ are accessed via 16 memory controllers.

Host can allocate a buffer into specific HBM bank using CL_MEM_EXT_PTR_XILINX flag of buffer. cl_mem_ext_ptr object needs to be used in cases where memory assignment is done by user explicitly:

cl_mem_ext_ptr_t bufExt;
bufExt.obj = host_pointer;
bufExt.param = 0;
bufExt.flags = n  | XCL_MEM_TOPOLOGY;
buffer_input = cl::Buffer(context, CL_MEM_READ_ONLY | CL_MEM_EXT_PTR_XILINX | CL_MEM_USE_HOST_PTR, size, &bufExt, &err));

HBM memory must be associated to respective kernel I/O ports using sp option. We need to add mapping between HBM memory and I/O ports in krnl_vaddmul.cfg file

sp=krnl_vaddmul_1.in1:HBM[0]
sp=krnl_vaddmul_1.in2:HBM[1]
sp=krnl_vaddmul_1.out_add:HBM[2]
sp=krnl_vaddmul_1.out_mul:HBM[3]

To improve the random access bandwidth, in krnl_vaddmul.cpp the latency and num_read_outstanding switches have been added to the HLS INTERFACE definition.

void krnl_vaddmul(
    const v_dt *in1,             // Read-Only Vector 1
    const v_dt *in2,             // Read-Only Vector 2
    v_dt *out_add,               // Output Result for ADD
    v_dt *out_mul,               // Output Result for MUL
    const unsigned int size,     // Size in integer
    const unsigned int num_times // Running the same kernel operations num_times
    ) {
#pragma HLS INTERFACE m_axi port = in1 offset = slave bundle = gmem0 latency = 300 num_read_outstanding=64
#pragma HLS INTERFACE m_axi port = in2 offset = slave bundle = gmem1 latency = 300 num_read_outstanding=64

To see the benefit of HBM, user can look into the runtime logs and see the overall throughput. Following is the real log reported while running the design on U50 platform:

Loading: './build_dir.hw.xilinx_u50_xdma_201920_1/krnl_vaddmul.xclbin'
Creating a kernel [krnl_vaddmul:{krnl_vaddmul_1}] for CU(1)
Creating a kernel [krnl_vaddmul:{krnl_vaddmul_2}] for CU(2)
Creating a kernel [krnl_vaddmul:{krnl_vaddmul_3}] for CU(3)
OVERALL THROUGHPUT = 138.022 GB/s
CHANNEL THROUGHPUT = 11.501 GB/s
TEST PASSED

By default we are going with 3 compute units of kernel as we have power consumption limitation while targeting U50 platform.

EXCLUDED PLATFORMS:

Alveo U25 SmartNIC

Alveo U30

Alveo U200

All Embedded Zynq Platforms, i.e zc702, zcu102 etc

All Versal Platforms, i.e vck190 etc

Alveo U250

AWS VU9P F1

Samsung SmartSSD Computation Storage Drive

Samsung U.2 SmartSSD

X3 Compute Shell

All NoDMA Platforms, i.e u50 nodma etc

DESIGN FILES¶

Application code is located in the src directory. Accelerator binary files will be compiled to the xclbin directory. The xclbin directory is required by the Makefile and its contents will be filled during compilation. A listing of all the files in this example is shown below

src/host.cpp
src/krnl_vaddmul.cpp
src/krnl_vaddmul.h

Access these files in the github repo by clicking here.

COMMAND LINE ARGUMENTS¶

Once the environment has been configured, the application can be executed by

./hbm_bandwidth_pseudo_random <krnl_vaddmul XCLBIN>