template class xf::dsp::aie::blas::matrix_mult::matrix_mult_graph

#include "matrix_mult_graph.hpp"

Overview

matrix_mult performs a GEneral Matrix Multiply (GEMM), taking two input matrices of configurable dimensions and data type.

These are the templates to configure the Matrix Multiply graph class.

Parameters:

TT_DATA_A describes the type of individual data samples input of Matrix A to the gemm function. This is a typename and must be one of the following: int16, cint16, int32, cint32, float, cfloat.
TT_DATA_B

describes the type of individual data samples input of Matrix B to the gemm function. This is a typename and must be one of the following: int16, cint16, int32, cint32, float, cfloat. The following rules apply:

  • must be an integer type if TT_DATA_A is an integer type
  • must be a float type if TT_DATA_A is a float type.
TP_DIM_A is an unsigned integer which describes the number of elements along the unique dimension (rows) of Matrix A.
TP_DIM_AB is an unsigned integer which describes the number of elements along the common dimension of Matrix A (columns) and Matrix B (rows).
TP_DIM_B is an unsigned integer which describes the number of elements along the unique dimension (columns) of Matrix B.
TP_SHIFT is describes power of 2 shift down applied to the accumulation of product terms before each output. TP_SHIFT must be in the range 0 to 61.
TP_RND describes the selection of rounding to be applied during the shift down stage of processing. TP_RND must be in the range 0 to 7 where 0 = floor (truncate) eg. 3.8 Would become 3. 1 = ceiling e.g. 3.2 would become 4. 2 = round to positive infinity. 3 = round to negative infinity. 4 = round symmetrical to infinity. 5 = round symmetrical to zero. 6 = round convergent to even. 7 = round convergent to odd. Modes 2 to 7 round to the nearest integer. They differ only in how they round for values of 0.5.
TP_DIM_A_LEADING describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1. Note, a COL_MAJOR matrix can be transposed to become a ROW_MAJOR matrix.
TP_DIM_B_LEADING describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1.
TP_DIM_OUT_LEADING describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1.
TP_ADD_TILING_A describes wether or not to add an additional kernel to rearrange the matrix samples into their required position. Setting this option to 0 indicates that the re-arrangement will be done externally to the AIE matrix multiply graph.
TP_ADD_TILING_B describes wether or not to add an additional kernel to rearrange the matrix samples into their required position. Setting this option to 0 indicates that the re-arrangement will be done externally to the AIE matrix multiply graph.
TP_ADD_DETILING_OUT describes wether or not to add an additional kernel to rearrange the matrix samples into their required position. Setting this option to 0 indicates that the re-arrangement will be done externally to the AIE matrix multiply graph.
TP_INPUT_WINDOW_VSIZE_A describes the number of samples in the window API used for input to Matrix A. It must be of size TP_DIM_A*TP_DIM_AB*N. Typical use has N=1, however N>1 can be utilised to minimise overhead of window API. This parameter is optional and has a default value of TP_DIM_A*TP_DIM_AB (N=1).
TP_INPUT_WINDOW_VSIZE_B describes the number of samples in the window API used for input to Matrix B. It must be of size TP_DIM_B*TP_DIM_AB*M. Typical use has M=1, however M>1 can be utilised to minimise overhead of window API. This parameter is optional and has a default value of TP_DIM_B*TP_DIM_AB (M=1). Note, the output window will be of size: (TP_INPUT_WINDOW_VSIZE_A/TP_DIM_AB * TP_INPUT_WINDOW_VSIZE_B/TP_DIM_AB). When N and M is 1, output window size will be TP_DIM_A * TP_DIM_B.
TP_CASC_LEN describes the number of AIE Tiles to split the GEMM operation into. TP_CASC_LEN splits the operation over TP_DIM_AB, where each kernel utilises the cascade stream to pass partial accumulation results to the next kernel. In effect, dot(A,B) + C. Note, it is also possible to tile the operation over multiple AIE tiles by instantiating multiple GEMM graphs with smaller dimensions.
template <
    typename TT_DATA_A,
    typename TT_DATA_B,
    unsigned int TP_DIM_A,
    unsigned int TP_DIM_AB,
    unsigned int TP_DIM_B,
    unsigned int TP_SHIFT,
    unsigned int TP_RND,
    unsigned int TP_DIM_A_LEADING = ROW_MAJOR,
    unsigned int TP_DIM_B_LEADING = COL_MAJOR,
    unsigned int TP_DIM_OUT_LEADING = ROW_MAJOR,
    unsigned int TP_ADD_TILING_A = 1,
    unsigned int TP_ADD_TILING_B = 1,
    unsigned int TP_ADD_DETILING_OUT = 1,
    unsigned int TP_INPUT_WINDOW_VSIZE_A = TP_DIM_A* TP_DIM_AB,
    unsigned int TP_INPUT_WINDOW_VSIZE_B = TP_DIM_B* TP_DIM_AB,
    unsigned int TP_CASC_LEN = 1
    >
class matrix_mult_graph: public graph

// typedefs

typedef matrix_mult <TT_DATA_A, TT_DATA_B, TP_DIM_A, (TP_DIM_AB/TP_CASC_LEN), TP_DIM_B, TP_SHIFT, TP_RND, TP_DIM_A_LEADING, TP_DIM_B_LEADING, TP_DIM_OUT_LEADING, (TP_INPUT_WINDOW_VSIZE_A/TP_CASC_LEN), (TP_INPUT_WINDOW_VSIZE_B/TP_CASC_LEN), cascIn, cascOut> matMultCasc
typedef typename std::conditional < (TP_CASC_LEN==1), matMultCasc <false, false>, no_kernel>::type onlyMatMult
typedef typename std::conditional < (TP_CASC_LEN> 1), matMultCasc <false, true>, onlyMatMult>::type firstMatMult
typedef typename std::conditional < (TP_CASC_LEN> 1), matMultCasc <true, false>, firstMatMult>::type lastMatMult
typedef typename std::conditional < (TP_CASC_LEN> 2), matMultCasc <true, true>, lastMatMult>::type middleMatMult
typedef tilerKernelClass <tilingScheme.Atile, tilingScheme.ABtile, dimAPerKernel, (TP_DIM_AB/TP_CASC_LEN), TP_DIM_A_LEADING, TT_DATA_A> TilerClassA
typedef tilerKernelClass <tilingScheme.ABtile, tilingScheme.Btile, (TP_DIM_AB/TP_CASC_LEN), dimBPerKernel, TP_DIM_B_LEADING, TT_DATA_B> TilerClassB
typedef untilerKernelClass <tilingScheme.Atile, tilingScheme.Btile, dimAPerKernel, dimBPerKernel, TP_DIM_OUT_LEADING, outType_t <TT_DATA_A, TT_DATA_B>> DetilerClassOut
typedef ConditionalWidget <isRedundantTilerA?0:TP_ADD_TILING_A, (TP_INPUT_WINDOW_VSIZE_A/TP_CASC_LEN)*sizeof (TT_DATA_A), TilerClassA> TileAConditional
typedef ConditionalWidget <isRedundantTilerB?0:TP_ADD_TILING_B, (TP_INPUT_WINDOW_VSIZE_B/TP_CASC_LEN)*sizeof (TT_DATA_B), TilerClassB> TileBConditional
typedef ConditionalWidget <isRedundantTilerOut?0:TP_ADD_DETILING_OUT, dimAPerKernel*dimBPerKernel*sizeof (outType_t <TT_DATA_A, TT_DATA_B>), DetilerClassOut> DetileOutConditional

// structs

struct no_kernel

// fields

port <input> inA[TP_CASC_LEN]
port <input> inB[TP_CASC_LEN]
port <output> out
static constexpr middleMatMult::tilingStruct tilingScheme
static constexpr unsigned int dimAPerKernel
static constexpr unsigned int dimBPerKernel
static constexpr bool isRedundantTilerA
static constexpr bool isRedundantTilerB
static constexpr bool isRedundantTilerOut

Fields

port <input> inA [TP_CASC_LEN]

The input data to the function. This input is two windows of samples of TT_DATA_A and TT_DATA_B type. The number of samples in the window is described by TP_INPUT_WINDOW_VSIZE_A and TP_INPUT_WINDOW_VSIZE_B, which are derived from TP_DIM_A, TP_DIM_AB and TP_DIM_B.

port <output> out

A window API of TP_INPUT_WINDOW_VSIZE_A/TP_DIM_AB * TP_INPUT_WINDOW_VSIZE_B/TP_DIM_AB samples, or simply TP_DIM_A * TP_DIM_B samples of a derived output type.

Methods

getKernels

kernel* getKernels ()

Access function to get pointer to kernel (or first kernel in a chained configuration).

matrix_mult_graph

matrix_mult_graph ()

This is the constructor function for the Matric Multiply graph.