Array Partition

This is a simple example of matrix multiplication (Row x Col) to demonstrate how to achieve better performance by array partitioning, using HLS kernel in Vitis Environment.

KEY CONCEPTS: Kernel Optimization, HLS C Kernel, Array Partition

KEYWORDS: #pragma HLS ARRAY_PARTITION, complete

This example demonstrates how array partition in HLS kernel can help to improve the performance. In this example matrix multiplication functionality is used to showcase the benefit of array partition. Design contains two kernels “matmul” a simple matrix multiplication and “matmul_partition” a matrix multiplication implementation using array partition.

#pragma HLS array partition is used to partition an array into multiple smaller arrays or memories. Arrays can be partitioned in three ways, cyclic, block and complete. In this example, complete partition is used to partition one of the dimension of local Matrix array as below

int B[MAX_SIZE][MAX_SIZE];
int C[MAX_SIZE][MAX_SIZE];
#pragma HLS ARRAY_PARTITION variable = B dim = 2 complete
#pragma HLS ARRAY_PARTITION variable = C dim = 2 complete

This array partition helps design to access 2nd dimension of both Matrix B and C concurrently to reduce the overall latency.

To see the benefit of array partition, user can look into system estimate report and see overall latency. Latency Information of normal matmul kernel (without partition):

Compute Unit  Kernel Name  Module Name  Start Interval  Best (cycles)  Avg (cycles)  Worst (cycles)  Best (absolute)  Avg (absolute)  Worst (absolute)
------------  -----------  -----------  --------------  -------------  ------------  --------------  ---------------  --------------  ----------------
matmul_1      matmul       matmul       2068 ~ 3052     2067           2559          3051            6.889 us         8.529 us        9.526 us

Latency Information for matrix multiplication for kernel with partition:

Compute Unit        Kernel Name       Module Name       Start Interval  Best (cycles)  Avg (cycles)  Worst (cycles)  Best (absolute)  Avg (absolute)  Worst (absolute)
------------------  ----------------  ----------------  --------------  -------------  ------------  --------------  ---------------  --------------  ----------------
matmul_partition_1  matmul_partition  matmul_partition  277 ~ 1260      276            768           1259            0.920 us         2.560 us        4.196 us

Example generates the following information as output when ran on Alevo U250 Card:

Found Platform
Platform Name: Xilinx
INFO: Reading ./build_dir.hw.xilinx_u250_gen3x16_xdma_4_1_202210_1/matmul.xclbin
Loading: './build_dir.hw.xilinx_u250_gen3x16_xdma_4_1_202210_1/matmul.xclbin'
|-------------------------+-------------------------|
| Kernel                  |    Wall-Clock Time (ns) |
|-------------------------+-------------------------|
| matmul:                 |                    6826 |
| matmul: partition       |                     853 |
|-------------------------+-------------------------|
| Speedup                 |                8.002345 |
|-------------------------+-------------------------|
Note: Wall Clock Time is meaningful for real hardware execution only, not for emulation.
Please refer to profile summary for kernel execution time for hardware emulation.
TEST PASSED

EXCLUDED PLATFORMS:

  • All NoDMA Platforms, i.e u50 nodma etc

DESIGN FILES

Application code is located in the src directory. Accelerator binary files will be compiled to the xclbin directory. The xclbin directory is required by the Makefile and its contents will be filled during compilation. A listing of all the files in this example is shown below

src/host.cpp
src/matmul.cpp
src/matmul_partition.cpp

Access these files in the github repo by clicking here.

COMMAND LINE ARGUMENTS

Once the environment has been configured, the application can be executed by

./array_partition <matmul XCLBIN>