Vitis BLAS level 3 provides software API functions to offload BLAS operations to pre-built FPGA images.

1. Introduction

The Vitis BLAS level 3 library is an implementation of BLAS on top of the XILINX runtime (XRT). It allows software developers to use Vitis BLAS library without writing any runtime functions and hardware configurations.

1.1 Data layout

Vitis BLAS library uses row-major storage. The array index of a matrix element could be calculated by the following macro.

# define IDX2R(i,j,ld) ( ( (i)*(ld) ) + (j) )

1.2 Memory Allocation

Vitis BLAS level 3 library supports three different versions of APIs to support memory allocation in device. Users could choose from different versions to better support their application based on the cases. Examples for using three different versions could be found in BLAS Level 3 example.

Already have host memory input Host memory matrix sizes are padded Version to choose
Yes Yes Restricted memory version
Yes No Default memory version
No Does not apply Pre-allocated memory version

Restricted memory version

To use restricted memory version, user’s input matrix sizes must be multiple of certain configuration values that are used to build the FPGA bitstreams. Also, host memory is encouraged to be 4k aligned when using restricted memory version. Compared to the default memory version, even though there are requirements on the matrix sizes, restricted memory version could save extra memory copy in host side.

Default memory version

This version has no limitations on user host memory, and it is easy to use. API functions will do the padding internally so this will lead to extra memory copy in host side. The result output matrix will also be the same sizes.

Pre-allocated memory version

To use this version, users need to call API functions to allocate the device memory first, then fill in host memory that is mapped to device memory with values. There is no extra memory copy and the programming is easier compared to the other two versions. However, when filling in the matrices, users need to use the padded sizes, also the result output matrix’s sizes are padded instead of the original ones. Please see examples for more usage information.

1.3 Supported Datatypes

  • float

2. Using the Vitis BLAS API

2.1 General description

This section describes how to use the Vitis BLAS library API level.

2.1.1 Error status

Vitis BLAS API function calls return the error status of datatype xfblasStatus_t.

2.1.2 Vitis BLAS initialization

To initialize the library, xfblasCreate() function must be called. This function will open the device, download the FPGA image to the device and create context on the selected compute unit. For a multi-kernels xclbin, contexts will be opened on the corresponding compute units. Please refer L3 examples for detail usage.

2.2 Datatypes Reference

2.2.1 xfblasStatus_t

The type is used for function status returns. All Vitis BLAS level 3 library functions return status which has the following values.

Item Meaning Value
XFBLAS_STATUS_SUCCESS The function is completed successfully 0
XFBLAS_STATUS_NOT_INITIALIZED The Vitis BLAS library was not initialized. This is usually caused by not calling function xfblasCreate previously. 1
XFBLAS_STATUS_INVALID_VALUE An unsupported value or parameter was passed to the function. For example, an negative matrix size. 2
XFBLAS_STATUS_ALLOC_FAILED Memory allocation failed inside the Vitis BLAS library. 3
XFBLAS_STATUS_NOT_SUPPORTED The functionality requested is not supported yet. 4
XFBLAS_STATUS_NOT_PADDED For restricted mode, matrix sizes are not padded correctly. 5

2.2.2 xfblasEngine_t

The xfblasEngine_t type indicates which engine needs to be performed when initializes the Vitis BLAS library. xfblasEngine_t type should be matched with the FPGA bitstream.

Value Meaning
XFBLAS_ENGINE_GEMM The GEMM engine is selected

2.2.3 xfblasOperation_t

The xfblasOperation_t type indicates which operation needs to be performed with the matrix.

Value Meaning
XFBLAS_OP_N The non-transpose operation is selected
XFBLAS_OP_T The transpose operation is selected
XFBLAS_OP_C The conjugate transpose operation is selected

2.3 Vitis BLAS Helper Function Reference

2.3.1 xfblasCreate

xfblasStatus_t xfblasCreate(const char* xclbin, string configFile, const char* logFile, xfblasEngine_t engineName, unsigned int kernelNumber = 1, unsigned int deviceIndex = 0)

This function initializes the Vitis BLAS library and creates a handle for the specific engine. It must be called prior to any other Vitis BLAS library calls.

Parameters:

xclbin file path to FPGA bitstream
configFile file path to config_info.dat file
logFile file path to log file
engineName Vitis BLAS engine to run
kernelNumber number of kernels that is being used, default is 1
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the initialization succeeded
xfblasStatus_t 1 if the opencl runtime initialization failed
xfblasStatus_t 2 if the xclbin doesn’t contain the engine
xfblasStatus_t 4 if the engine is not supported for now

2.3.2 xfblasFree

xfblasStatus_t xfblasFree(void* A, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)

This function frees memory in FPGA device.

Parameters:

A pointer to matrix A in the host memory
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the operation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 3 if there is no FPGA device memory allocated for the matrix

2.3.3 xfblasDestroy

xfblasStatus_t xfblasDestroy(unsigned int kernelNumber = 1, unsigned int deviceIndex = 0)

This function releases handle used by the Vitis BLAS library.

Parameters:

kernelNumber number of kernels that is being used, default is 1
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the shut down succeeded
xfblasStatus_t 1 if the library was not initialized

2.3.4 xfblasMalloc

xfblasStatus_t xfblasMalloc(short** devPtr, int rows, int lda, int elemSize, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)
xfblasStatus_t xfblasMalloc(float** devPtr, int rows, int lda, int elemSize, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)

This function allocates memory on the FPGA device.

Parameters:

devPtr pointer to mapped memory
rows number of rows in the matrix
lda leading dimension of the matrix that indicates the total number of cols in the matrix
elemSize number of bytes required to store each element in the matrix
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the allocation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 2 if parameters rows, cols, elemSize, lda <= 0 or cols > lda or data types are not matched
xfblasStatus_t 3 if there is memory already allocated to the same matrix
xfblasStatus_t 4 if the engine is not supported for now

2.3.5 xfblasSetVector

xfblasStatus_t xfblasSetVector(int n, int elemSize, short* x, int incx, short* d_x, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)
xfblasStatus_t xfblasSetVector(int n, int elemSize, float* x, int incx, float* d_x, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)

This function copies a vector in host memory to FPGA device memory. xfblasMalloc() need to be called prior to this function.

Parameters:

n number of elements in vector
elemSize number of bytes required to store each element in the vector
x pointer to the vector in the host memory
incx the storage spacing between consecutive elements of vector x
d_x pointer to mapped memory
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the operation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 2 if parameters rows, cols, elemSize, lda <= 0 or cols > lda or data types are not matched
xfblasStatus_t 3 if there is no FPGA device memory allocated for the vector
xfblasStatus_t 4 if the engine is not supported for now

2.3.6 xfblasGetVector

xfblasStatus_t xfblasGetVector(int n, int elemSize, short* d_x, short* x, int incx, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)
xfblasStatus_t xfblasGetVector(int n, int elemSize, float* d_x, float* x, int incx, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)

This function copies a vector in FPGA device memory to host memory.

Parameters:

n number of elements in vector
elemSize number of bytes required to store each element in the vector
d_x pointer to mapped memory
x pointer to the vector in the host memory
incx the storage spacing between consecutive elements of vector x
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the operation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 3 if there is no FPGA device memory allocated for the vector

2.3.7 xfblasSetMatrix

xfblasStatus_t xfblasSetMatrix(int rows, int cols, int elemSize, short* A, int lda, short* d_A, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)
xfblasStatus_t xfblasSetMatrix(int rows, int cols, int elemSize, float* A, int lda, float* d_A, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)

This function copies a matrix in host memory to FPGA device memory. xfblasMalloc() need to be called prior to this function.

Parameters:

rows number of rows in the matrix
cols number of cols in the matrix that is being used
elemSize number of bytes required to store each element in the matrix
A pointer to the matrix array in the host memory
lda leading dimension of the matrix that indicates the total number of cols in the matrix
d_A pointer to mapped memory
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the operation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 2 if parameters rows, cols, elemSize, lda <= 0 or cols > lda or data types are not matched
xfblasStatus_t 3 if there is no FPGA device memory allocated for the matrix
xfblasStatus_t 4 if the engine is not supported for now

2.3.8 xfblasGetMatrix

xfblasStatus_t xfblasGetMatrix(int rows, int cols, int elemSize, short* d_A, short* A, int lda, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)
xfblasStatus_t xfblasGetMatrix(int rows, int cols, int elemSize, float* d_A, float* A, int lda, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)

This function copies a matrix in FPGA device memory to host memory.

Parameters:

rows number of rows in the matrix
cols number of cols in the matrix that is being used
elemSize number of bytes required to store each element in the matrix
d_A pointer to mapped memory
A pointer to the matrix array in the host memory
lda leading dimension of the matrix that indicates the total number of cols in the matrix
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the operation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 3 if there is no FPGA device memory allocated for the matrix

2.3.9 xfblasMallocRestricted

xfblasStatus_t xfblasMallocRestricted(int rows, int cols, int elemSize, void* A, int lda, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)

This function allocates memory for host row-major format matrix on the FPGA device.

Parameters:

rows number of rows in the matrix
cols number of cols in the matrix that is being used
elemSize number of bytes required to store each element in the matrix
A pointer to the matrix array in the host memory
lda leading dimension of the matrix that indicates the total number of cols in the matrix
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the allocation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 2 if parameters rows, cols, elemSize, lda <= 0 or cols > lda or data types are not matched
xfblasStatus_t 3 if there is memory already allocated to the same matrix
xfblasStatus_t 4 if the engine is not supported for now
xfblasStatus_t 5 if rows, cols or lda is not padded correctly

2.3.10 xfblasSetVectorRestricted

xfblasStatus_t xfblasSetVectorRestricted(void* x, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)

This function copies a vector in host memory to FPGA device memory. xfblasMallocRestricted() need to be called prior to this function.

Parameters:

x pointer to the vector in the host memory
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the operation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 3 if there is no FPGA device memory allocated for the vector

2.3.11 xfblasGetVectorRestricted

xfblasStatus_t xfblasGetVectorRestricted(void* x, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)

This function copies a matrix in FPGA device memory to host memory.

Parameters:

x pointer to vetcor x in the host memory
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the operation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 3 if there is no FPGA device memory allocated for the matrix

2.3.12 xfblasSetMatrixRestricted

xfblasStatus_t xfblasSetMatrixRestricted(void* A, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)

This function copies a matrix in host memory to FPGA device memory. xfblasMallocRestricted() need to be called prior to this function.

Parameters:

A pointer to the matrix array in the host memory
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the operation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 3 if there is no FPGA device memory allocated for the matrix

2.3.13 xfblasGetMatrixRestricted

xfblasStatus_t xfblasGetMatrixRestricted(void* A, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)

This function copies a matrix in FPGA device memory to host memory.

Parameters:

A pointer to matrix A in the host memory
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the operation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 3 if there is no FPGA device memory allocated for the matrix

2.3.14 xfblasMallocManaged

xfblasStatus_t xfblasMallocManaged(short** devPtr, int* paddedLda, int rows, int lda, int elemSize, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)
xfblasStatus_t xfblasMallocManaged(float** devPtr, int* paddedLda, int rows, int lda, int elemSize, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)

This function allocates memory on the FPGA device, rewrites the leading dimension size after padding.

Parameters:

devPtr pointer to mapped memory
paddedLda leading dimension of the matrix after padding
rows number of rows in the matrix
lda leading dimension of the matrix that indicates the total number of cols in the matrix
elemSize number of bytes required to store each element in the matrix
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the allocation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 2 if parameters rows, cols, elemSize, lda <= 0 or cols > lda or data types are not matched
xfblasStatus_t 3 if there is memory already allocated to the same matrix
xfblasStatus_t 4 if the engine is not supported for now

2.3.15 xfblasExecute

xfblasStatus_t xfblasExecute (
    unsigned int kernelIndex = 0,
    unsigned int deviceIndex = 0
    )

This function starts the kernel and wait until it finishes.

Parameters:

kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0
xfblasStatus_t 0 if the operation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 3 if there is no FPGA device memory allocated for instrution

2.3.16 xfblasExecuteAsync

void xfblasExecuteAsync (
    unsigned int numKernels = 1,
    unsigned int deviceIndex = 0
    )

This asynchronous function starts all kernels and wait until them finish.

Parameters:

numKernels number of kernels that is being used, default is 1
deviceIndex index of device that is being used, default is 0

2.3.17 xfblasGetByPointer

xfblasStatus_t xfblasGetByPointer (
    void* A,
    unsigned int kernelIndex = 0,
    unsigned int deviceIndex = 0
    )

This function copies a matrix in FPGA device memory to host memory by pointer.

Parameters:

A pointer to matrix A in the host memory
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0
xfblasStatus_t 0 if the operation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 3 if there is no FPGA device memory allocated for the matrix

2.3.18 xfblasGetByAddress

xfblasStatus_t xfblasGetByAddress (
    void* A,
    unsigned long long p_bufSize,
    unsigned int offset,
    unsigned int kernelIndex = 0,
    unsigned int deviceIndex = 0
    )

This function copies a matrix in FPGA device memory to host memory by its address in device memory.

Parameters:

A pointer to matrix A in the host memory
p_bufSize size of matrix A
offset A’s address in device memory
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0
xfblasStatus_t 0 if the operation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 3 if there is no FPGA device memory allocated for the matrix

2.4 Vitis BLAS Function Reference

2.4.1 xfblasGemm

xfblasStatus_t xfblasGemm(xfblasOperation_t transa, xfblasOperation_t transb, int m, int n, int k, int alpha, void* A, int lda, void* B, int ldb, int beta, void* C, int ldc, unsigned int kernelIndex = 0, unsigned int deviceIndex = 0)

This function performs the matrix-matrix multiplication C = alpha*op(A)op(B) + beta*C. See L3 examples for detail usage.

Parameters:

transa operation op(A) that is non- or (conj.) transpose
transb operation op(B) that is non- or (conj.) transpose
m number of rows in matrix A, matrix C
n number of cols in matrix B, matrix C
k number of cols in matrix A, number of rows in matrix B
alpha scalar used for multiplication
A pointer to matrix A in the host memory
lda leading dimension of matrix A
B pointer to matrix B in the host memory
ldb leading dimension of matrix B
beta scalar used for multiplication
C pointer to matrix C in the host memory
ldc leading dimension of matrix C
kernelIndex index of kernel that is being used, default is 0
deviceIndex index of device that is being used, default is 0

Return:

xfblasStatus_t 0 if the operation completed successfully
xfblasStatus_t 1 if the library was not initialized
xfblasStatus_t 3 if not all the matrices have FPGA devie memory allocated
xfblasStatus_t 4 if the engine is not supported for now

3. Obtain FPGA bitstream

FPGA bitstreams could be built in examples or tests folder by using command make build TARGET=hw PLATFORM_REPO_PATHS=LOCAL_PLATFORM_PATH