template class xf::dsp::aie::blas::matrix_mult::matrix_mult_graph

#include "matrix_mult_graph.hpp"


matrix_mult performs a GEneral Matrix Multiply (GEMM), taking two input matrices of configurable dimensions and data type.

These are the templates to configure the Matrix Multiply graph class.



describes the type of individual data samples input of Matrix A to the gemm function. This is a typename and must be one of the following:

int16, cint16, int32, cint32, float, cfloat.


describes the type of individual data samples input of Matrix B to the gemm function. This is a typename and must be one of the following:

int16, cint16, int32, cint32, float, cfloat. The following rules apply:

  • must be an integer type if TT_DATA_A is an integer type
  • must be a float type if TT_DATA_A is a float type.
TP_DIM_A is an unsigned integer which describes the number of elements along the unique dimension (rows) of Matrix A.
TP_DIM_AB is an unsigned integer which describes the number of elements along the common dimension of Matrix A (columns) and Matrix B (rows).
TP_DIM_B is an unsigned integer which describes the number of elements along the unique dimension (columns) of Matrix B.
TP_SHIFT describes power of 2 shift down applied to the accumulation of product terms before each output. TP_SHIFT must be in the range 0 to 61.

describes the selection of rounding to be applied during the shift down stage of processing. TP_RND must be in the range 0 to 7 where

  • 0 = floor (truncate) eg. 3.8 Would become 3.

  • 1 = ceiling e.g. 3.2 would become 4.

  • 2 = round to positive infinity.

  • 3 = round to negative infinity.

  • 4 = round symmetrical to infinity.

  • 5 = round symmetrical to zero.

  • 6 = round convergent to even.

  • 7 = round convergent to odd.

    Modes 2 to 7 round to the nearest integer. They differ only in how they round for values of 0.5.

TP_DIM_A_LEADING describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1. Note, a COL_MAJOR matrix can be transposed to become a ROW_MAJOR matrix.
TP_DIM_B_LEADING describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1.
TP_DIM_OUT_LEADING describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1.

describes wether or not to add an additional kernel to rearrange the matrix samples into their required position.

Setting this option to 0 indicates that the re-arrangement will be done externally to the AIE matrix multiply graph.


describes the number of samples in the window API used for input to Matrix A.

It must be of size TP_DIM_A*TP_DIM_AB*N. Typical use has N=1, however N>1 can be utilised to minimise overhead of window API.

This parameter is optional and has a default value of TP_DIM_A*TP_DIM_AB (N=1).


describes the number of samples in the window API used for input to Matrix B.

It must be of size TP_DIM_B*TP_DIM_AB*M. Typical use has M=1, however M>1 can be utilised to minimise overhead of window API.

This parameter is optional and has a default value of TP_DIM_B*TP_DIM_AB (M=1).

Note, the output window will be of size: (TP_INPUT_WINDOW_VSIZE_A/TP_DIM_AB * TP_INPUT_WINDOW_VSIZE_B/TP_DIM_AB). When N and M is 1, output window size will be TP_DIM_A * TP_DIM_B.


describes the number of AIE Tiles to split the GEMM operation into.

TP_CASC_LEN splits the operation over TP_DIM_AB, where each kernel utilises the cascade stream to pass partial accumulation results to the next kernel. In effect, dot(A,B) + C.

Note, it is also possible to tile the operation over multiple AIE tiles by instantiating multiple GEMM graphs with smaller dimensions.

template <
    typename TT_DATA_A,
    typename TT_DATA_B,
    unsigned int TP_DIM_A,
    unsigned int TP_DIM_AB,
    unsigned int TP_DIM_B,
    unsigned int TP_SHIFT,
    unsigned int TP_RND,
    unsigned int TP_DIM_A_LEADING = ROW_MAJOR,
    unsigned int TP_DIM_B_LEADING = COL_MAJOR,
    unsigned int TP_DIM_OUT_LEADING = ROW_MAJOR,
    unsigned int TP_ADD_TILING_A = 1,
    unsigned int TP_ADD_TILING_B = 1,
    unsigned int TP_ADD_DETILING_OUT = 1,
    unsigned int TP_CASC_LEN = 1
class matrix_mult_graph: public graph

// typedefs

typedef typename std::conditional < (TP_CASC_LEN==1), matMultCasc <false, false>, no_kernel>::type onlyMatMult
typedef typename std::conditional < (TP_CASC_LEN> 1), matMultCasc <false, true>, onlyMatMult>::type firstMatMult
typedef typename std::conditional < (TP_CASC_LEN> 1), matMultCasc <true, false>, firstMatMult>::type lastMatMult
typedef typename std::conditional < (TP_CASC_LEN> 2), matMultCasc <true, true>, lastMatMult>::type middleMatMult
typedef tilerKernelClass <tilingScheme.Atile, tilingScheme.ABtile, dimAPerKernel, (TP_DIM_AB/TP_CASC_LEN), TP_DIM_A_LEADING, TT_DATA_A> TilerClassA
typedef tilerKernelClass <tilingScheme.ABtile, tilingScheme.Btile, (TP_DIM_AB/TP_CASC_LEN), dimBPerKernel, TP_DIM_B_LEADING, TT_DATA_B> TilerClassB
typedef untilerKernelClass <tilingScheme.Atile, tilingScheme.Btile, dimAPerKernel, dimBPerKernel, TP_DIM_OUT_LEADING, outType_t <TT_DATA_A, TT_DATA_B>> DetilerClassOut
typedef ConditionalWidget <isRedundantTilerA?0:TP_ADD_TILING_A, (TP_INPUT_WINDOW_VSIZE_A/TP_CASC_LEN)*sizeof (TT_DATA_A), TilerClassA> TileAConditional
typedef ConditionalWidget <isRedundantTilerB?0:TP_ADD_TILING_B, (TP_INPUT_WINDOW_VSIZE_B/TP_CASC_LEN)*sizeof (TT_DATA_B), TilerClassB> TileBConditional
typedef ConditionalWidget <isRedundantTilerOut?0:TP_ADD_DETILING_OUT, dimAPerKernel*dimBPerKernel*sizeof (outType_t <TT_DATA_A, TT_DATA_B>), DetilerClassOut> DetileOutConditional

// structs

struct no_kernel

// fields

port <input> inA[TP_CASC_LEN]
port <input> inB[TP_CASC_LEN]
port <output> out
static constexpr middleMatMult::tilingStruct tilingScheme
static constexpr unsigned int dimAPerKernel
static constexpr unsigned int dimBPerKernel
static constexpr bool isRedundantTilerA
static constexpr bool isRedundantTilerB
static constexpr bool isRedundantTilerOut


port <input> inA [TP_CASC_LEN]

The input data to the function. This input is two windows of samples of TT_DATA_A and TT_DATA_B type. The number of samples in the window is described by TP_INPUT_WINDOW_VSIZE_A and TP_INPUT_WINDOW_VSIZE_B, which are derived from TP_DIM_A, TP_DIM_AB and TP_DIM_B.

port <output> out

A window API of TP_INPUT_WINDOW_VSIZE_A/TP_DIM_AB * TP_INPUT_WINDOW_VSIZE_B/TP_DIM_AB samples, or simply TP_DIM_A * TP_DIM_B samples of a derived output type.



kernel* getKernels ()

Access function to get pointer to kernel (or first kernel in a chained configuration).


matrix_mult_graph ()

This is the constructor function for the Matric Multiply graph.