DSP Library Functions¶
The Xilinx® digital signal processing library (DSPLib) is a configurable library of kernels that can be used to develop applications on Versal™ ACAP AI Engines. This is an Open Source library for DSP applications. Kernels are coded in C++ and contain special functions called intrinsics that give access to AI Engine vector processing capabilities. Kernels can be combined to construct graphs for developing complex designs. An example design is provided with this library for your reference. Each kernel has a corresponding graph. It is highly recommended to use the library element’s graph as the entrypoint. See the Using the Examples for more details.
Filters¶
The DSPLib contains several variants of Finite Impulse Response (FIR) filters. On the AI Engine processor, data is packetized into windows. In the case of FIRs, each window is extended by a margin so that the state of the filter at the end of the previous window may be restored before new computations begin. Therefore, to maximize performance, the window size should be set to the maximum that the system will allow, though this will lead to a corresponding increase in latency. However, this is a complex decision as multiple factors such as data movement and latency need to be taken into consideration.
Note
With a small window size (for example, 32), you pay a high penalty on the function call overhead. This means that the pre/post amble will be major cycle consumer in your function call.
FIR filters have been categorized into classes and placed in a distinct namespace scope: xf::dsp::aie::fir, to prevent name collision in the global scope. Namespace aliasing can be utilized to shorten instantiations:
namespace dsplib = xf::dsp::aie;
Additionally, each FIR filter has been placed in a unique FIR type namespace. The available FIR filter classes and the corresponding graph entry point are listed below:
Table 1: FIR Filter Classes
Function  Namespace 

Single rate, asymmetrical  dsplib::fir::sr_asym::fir_sr_asym_graph 
Single rate, symmetrical  dsplib::fir::sr_sym::fir_sr_sym_graph 
Interpolation asymmetrical  dsplib::fir::interpolate_asym::fir_interpolate_asym_graph 
Decimation, halfband  dsplib::fir::decimate_hb::fir_decimate_hb_graph 
Interpolation, halfband  dsplib::fir::interpolate_hb::fir_interpolate_hb_graph 
Decimation, asymmetric  dsplib::fir::decimate_asym::fir_decimate_asym_graph 
Interpolation, fractional, asymmetric  dsplib::fir::interpolate_fract_asym:: fir_interpolate_fract_asym_graph 
Decimation, symmetric  dsplib::fir::decimate_sym::fir_decimate_sym_graph 
Conventions for Filters¶
All FIR filters can be configured for various types of data and coefficients. These types can be int16, int32, or float, and also real or complex. However, configurations with real data versus complex coefficients are not supported nor are configurations where the coefficients are int32 and data is int16. Data and coefficients must both be integer types or both be float types, as mixes are not supported.
The following table lists the supported combinations of data type and coefficient type.
Table 2: Supported Combinations of Data Type and Coefficient Type
Data Type  

Int16  Cint16  Int32  Cint32  Float  Cfloat  
Coefficient type  Int16  Supported  Supported  Supported  Supported  3  3 
Cint16  1  Supported  1  Supported  3  3  
Int32  2  2  Supported  Supported  3  3  
Cint32  1, 2  2  1  Supported  3  3  
Float  3  3  3  3  Supported  Supported  
Cfloat  3  3  3  3  3  Supported  

For all filters, the coefficient values are passed, not as template parameters, but as an array argument to the constructor for nonreloadable configurations, or to the reload function for reloadable configurations. In the case of symmetrical filters, only the first half (plus any odd centre tap) need be passed, as the remainder may be derived by symmetry. For halfband filters, only the nonzero coefficients should be entered, so the length of the array expected will be the (TP_FIR_LEN+1)/4 + 1 for the centre tap.
The following table lists parameters supported by all the FIR filters:
Table 3: Parameters Supported by FIR Filters
Parameter Name  Type  Description  Range 

TP_FIR_LEN  unsigned  The number of taps  4 to 240 
TP_RND  unsigned int  Round mode  0 = truncate or floor 1 = ceiling (round up) 2 = positive infinity 3 = negative infinity 4 = symmetrical to infinity 5 = symmetrical to zero 6 = convergent to even 7 = convergent to odd 
TP_SHIFT  unsigned int  The number of bits to shift accumulation down by before output.  0 to 61 
TT_DATA  typename  Data Type  int16, cint16, int32, cint32, float, cfloat 
TT_COEFF  typename  Coefficient type  int16, cint16, int32, cint32, float, cfloat 
TP_INPUT_WINDOW_VSIZE  unsigned int  The number of samples in the input window.  Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive RAM use. 
TP_CASC_LEN  unsigned int  The number of cascaded kernels to use for this FIR.  1 to 9. Defaults to 1 if not set. 
TP_DUAL_IP  unsigned int  Use dual inputs (may increase throughput for symmetrical and halfband filters by avoiding load contention by using a second RAM bank for input).  Range 0 (single input), 1 (dual input). Defaults to 0 if not set. 
TP_USE_COEFF_RELOAD  unsigned int  Enable reloadable coefficient feature. An additional ‘coeff’ RTP port will appear on the graph.  0 (no reload), 1 (use reloads). Defaults to 0 if not set. 
TP_NUM_OUTPUTS  unsigned int  Number of fir output ports  >1 
Note
The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.
FFT/iFFT¶
The DSPLib contains one FFT/iFFT solution. This is a single channel, decimation in time (DIT) implementation with configurable point size, data type, and FFT/iFFT function.
Point size may be any power of 2 from 16 to 4096, but this upper limit will be reduced to 2048 for cint16 data type and 1024 for cfloat or cint32 data type where the FFT kernel uses pingpong window input. The 4096 limit may only be achieved where the FFT receives and outputs data to/from kernels on the same processor.
Table 4: FFT Parameters
Name  Type  Description  Range 

TT_DATA  Typename  The input data type  cint16, cint32, cfloat 
TT_TWIDDLE  Typename  The twiddle factor type. Determined by TT_DATA  Set to cint16 for data type of cint16 or cint32 and cfloat for data type of cfloat. 
TP_POINT_SIZE  Unsigned int  The number of samples in a frame to be processed  2^N, where N is in the range 4 to 12, though the upper limit may be constrained by device resources. 
TP_FFT_NIFFT  Unsigned int  Forward or reverse transform  0 (IFFT) or 1 (FFT). 
TP_SHIFT  Unsigned int  The number of bits to shift accumulation down by before output.  0 to 61 
TP_CASC_LEN  Unsigned int  The number of kernels the FFT will be divided over.  1 to 12. Defaults to 1 if not set. Maximum is derived by the number of radix 2 stages required for the given point size (N where pointSize = 2^N) For float data types the max is N. For integer data types the max is CEIL(N/2). 
TP_DYN_PT_SIZE  Unsigned int  FFT point size  2^N, where N is 2 to 12 
TP_WINDOW_VSIZE  Unsigned int  The number of samples in the input window.  Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive memory usage. 
Note
The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.
This FFT implementation does not implement the 1/N scaling of an IFFT. Internally, for cint16 and cint32 data, an internal data type of cint32 is used. After each rank, the values are scaled by only enough to normalize the bit growth caused by the twiddle multiplication (i.e., 15 bits). Distortion caused by saturation will be possible for large point sizes and large values when the data type is cint32. In the final stage, the result is scaled by 17 bits for point size from 16 to 1024, by 18 for 2048, and by 19 for 4096.
No scaling is applied at any point when the data type is cfloat. The graph entry point is the following:
xf::dsp::aie::fft::fft_ifft_dit_1ch_graph
Matrix Multiply¶
The DSPLib contains one Matrix Multiply/GEMM (GEneral Matrix Multiply) solution. The gemm has two input ports connected to two windows of data. The inputs are denoted as Matrix A (inA) and Matrix B (inB). Matrix A has a template parameter TP_DIM_A to describe the number of rows of A. The number of columns of inA must be equal to the number of rows of inB. This is denoted with the template parameter TP_DIM_AB. The number of columns of B is denoted by TP_DIM_B.
An output port connects to a window, where the data for the output matrix will be stored. The output matrix will have rows = inA rows (TP_DIM_A) and columns = inB (TP_DIM_B) columns. The data type of both input matrices can be configured and the data type of the output is derived from the inputs.
Table 5: Matrix Multiply Parameters
Name  Type  Description  Range 

TT_DATA_A  Typename  The input data type  int16, cint16, int32 cint32 float cfloat 
TT_DATA_B  Typename  The input data type  int16, cint16, int32 cint32 float cfloat 
TP_DIM_A  unsigned int  The number of elements along the unique dimension (rows) of Matrix A  
TP_DIM_AB  unsigned int  The number of elements along the common dimension of Matrix A (columns) and Matrix B (rows)  
TP_DIM_B  unsigned int  The number of elements along the unique dimension (rows) of Matrix B  
TP_SHIFT  unsigned int  power of 2 shift down applied to the accumulation of product terms before each output  In range 0 to 61 
TP_RND  unsigned int  Round mode  0 = truncate or floor 1 = ceiling (round up) 2 = positive infinity 3 = negative infinity 4 = symmetrical to infinity 5 = symmetrical to zero 6 = convergent to even 7 = convergent to odd 
TP_DIM_A_LEADING  unsigned int  The scheme in which the data should be stored in memory  ROW_MAJOR = 0 COL_MAJOR = 1 
TP_DIM_B_LEADING  unsigned int  The scheme in which the data should be stored in memory  ROW_MAJOR = 0 COL_MAJOR = 1 
TP_DIM_OUT_LEADING  unsigned int  The scheme in which the data should be stored in memory  ROW_MAJOR = 0 COL_MAJOR = 1 
TP_ADD_TILING_A  unsigned int  Option to add an additional kernel to rearrange matrix samples  0 = rearrange externally to the graph 
TP_ADD_TILING_B  unsigned int  Option to add an additional kernel to rearrange matrix samples  0 = rearrange externally to the graph 
TP_ADD_DETILING_OUT  unsigned int  Option to add an additional kernel to rearrange matrix samples  0 = rearrange externally to the graph 
TP_WINDOW_VSIZE_A  unsigned int  The number of samples in the input window for Matrix A  Must be of size TP_DIM_A* TP_DIM_AB*N has a default value of TP_DIM_A* TP_DIM_AB (N=1) 
TP_WINDOW_VSIZE_B  unsigned int  The number of samples in the input window for Matrix B  Must be of size TP_DIM_B* TP_DIM_AB*M has a default value of TP_DIM_B* TP_DIM_AB (M=1) 
TP_CASC_LEN  unsigned int  The number of AIE tiles to split the operation into  Defaults to 1 if not set. 
Input matrices are processed in distinct blocks and matrix elements must be rearranged into a specific pattern.
The following table demonstrates how a 16x16 input matrix should be rearranged into a 4x4 tiling pattern.
Note
Indices are quoted assuming a row major matrix. A column major matrix needs to be transposed.
Table 6: Matrix Multiply 4x4 tiling pattern
Tile Col 0  Tile Col 1  Tile Col 2  Tile Col 3  

Tile Row 0  0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15 
16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  
32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  
48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  
Tile Row 1  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79 
80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  
96  97  98  99  100  101  102  103  104  105  106  107  108  109  110  111  
112  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  
Tile Row 2  128  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143 
144  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  
160  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  
176  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191  
Tile Row 3  192  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207 
208  209  210  211  212  213  214  215  216  217  218  219  220  221  222  223  
224  225  226  227  228  229  230  231  232  233  234  235  236  237  238  239  
240  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255 
This is stored contigulously in memory like:
0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51, 4, 5, 6, 7, 20, 21, 22, 23, 36, 37, 38, 39, 52, 53, 54, 55, 8, 9, 10, 11, 24, 25, 26, 27, 40, 41, 42, 43, 56, 57, 58, 59, 12, 13, 14, 15, 28, 29, 30, 31, 44, 45, 46, 47, 60, 61, 62, 63, 64, 65, 66, 67, 80, 81, 82, 83, 96, 97, 98, 99, 112, 113, 114, 115, … , 204, 205, 206, 207, 220, 221, 222, 223, 236, 237, 238, 239, 252, 253, 254, 255
The following table demonstrates how a 16x16 input matrix should be rearranged into a 4x2 tiling pattern.
Table 7: Matrix Multiply 4x2 tiling pattern
Tile Col 0  Tile Col 1  Tile Col 2  Tile Col 3  Tile Col 4  Tile Col 5  Tile Col 6  Tile Col 7  

Tile Row 0  0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15 
16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  
32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  
48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  
Tile Row 1  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79 
80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  
96  97  98  99  100  101  102  103  104  105  106  107  108  109  110  111  
112  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  
Tile Row 2  128  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143 
144  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  
160  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  
176  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191  
Tile Row 3  192  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207 
208  209  210  211  212  213  214  215  216  217  218  219  220  221  222  223  
224  225  226  227  228  229  230  231  232  233  234  235  236  237  238  239  
240  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255 
This is stored contigulously in memory like:
0, 1, 16, 17, 32, 33, 48, 49, 2, 3, 18, 19, 34, 35, 50, 51, …, 206, 207, 222, 223, 238, 239, 254, 255
Multiplying a 16x16 matrix (with 4x4 tiling) with a 16x16 matrix (with 4x2 tiling) will result in a 16x16 matrix with 4x2 tiling.
The following table specifies the tiling scheme used for a given data type combination and the corresponding output data type:
Table 8: Matrix Multiply tiling pattern combination
Input Type Combination  Tiling Scheme  Output Type  

A  B  A  B  
int16  int16  4x4  4x4  int16 
int16  cint16  4x2  2x2  cint16 
int16  int32  4x2  2x2  int32 
int16  cint32  2x4  4x2  cint32 
cint16  int16  4x4  4x2  cint16 
cint16  cint16  4x4  4x2  cint16 
cint16  int32  4x4  4x2  cint32 
cint16  cint32  2x2  2x2  cint32 
int32  int16  4x4  4x2  int32 
int32  int32  4x4  4x2  int32 
int32  cint16  4x4  4x2  cint32 
int32  cint32  2x2  2x2  cint32 
cint32  int16  2x4  4x2  cint32 
cint32  cint16  2x2  2x2  cint32 
cint32  int32  2x2  2x2  cint32 
cint32  cint32  2x2  2x2  cint32 
float  float  4x4  4x2  float 
float  cfloat  2x4  4x2  cfloat 
cfloat  float  2x4  4x2  cfloat 
cfloat  cfloat  4x2  2x2  cfloat 
The parameters TP_ADD_TILING_A, TP_ADD_TILING_B, and TP_ADD_DETILING_OUT control the inclusion of an additional preprocessing / postprocessing kernel to perform the required data shuffling. When used with TP_DIM_A_LEADING, TP_DIM_B_LEADING, or TP_DIM_OUT_LEADING, the matrix is also transposed in the tiling kernel.
If the additional kernels are not selected, then the matrix multiply kernels assume incoming data is in the correct format, as specified above. When using the TP_CASC_LEN parameter, the matrix multiply operation is split across TP_DIM_AB and processed in a TP_CASC_LEN number of kernels. The accumulated partial results of each kernel is passed down the cascade port to the next kernel in the cascade chain until the final kernel provides the expected output. Cascade connections are made internally to the matrix multiply graph.
Each AI Engine kernel in the array is given a submatrix, so the interface to the graph is an array of ports for both A and B.
Input Matrix A (16x16  4x4 Tile  Cascade Length 2):
Table 9: Input Matrix A (16x16  4x4 Tile  Cascade Length 2)
AIE 0  AIE 1  

Tile Col 0  Tile Col 1  Tile Col 2  Tile Col 3  
Tile Row 0  0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15 
16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  
32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  
48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  
Tile Row 1  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79 
80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  
96  97  98  99  100  101  102  103  104  105  106  107  108  109  110  111  
112  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  
Tile Row 2  128  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143 
144  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  
160  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  
176  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191  
Tile Row 3  192  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207 
208  209  210  211  212  213  214  215  216  217  218  219  220  221  222  223  
224  225  226  227  228  229  230  231  232  233  234  235  236  237  238  239  
240  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255 
Input Matrix B (16x16  4x2 Tile  Cascade Length 2):
Table 10: Input Matrix B (16x16  4x2 Tile  Cascade Length 2)
Tile Col 0  Tile Col 1  Tile Col 2  Tile Col 3  Tile Col 4  Tile Col 5  Tile Col 6  Tile Col 7  

AIE 0  Tile Row 0  0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15 
16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  
32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  
48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  
Tile Row 1  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  
80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  
96  97  98  99  100  101  102  103  104  105  106  107  108  109  110  111  
112  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  
AIE 1  Tile Row 2  128  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143 
144  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  
160  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  
176  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191  
Tile Row 3  192  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207  
208  209  210  211  212  213  214  215  216  217  218  219  220  221  222  223  
224  225  226  227  228  229  230  231  232  233  234  235  236  237  238  239  
240  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255 
The graph entry point is the following:
xf::dsp::aie::blas::matrix_mult::matrix_mult_graph
Find a full list of descriptions and parameters in the API Reference Overview.
Connections to the cascade ports can be made as follows:
for (int i = 0 ; i < P_CASC_LEN; i++) { connect<>(inA[i], mmultGraph.inA[i]); connect<>(inB[i], mmultGraph.inB[i]); } connect<>(mmultGraph.out, out);
Widgets¶
Widget API Cast¶
The DSPLib contains a Widget API Cast solution, which provides flexibilty when connecting other kernels. This component is able to change the stream interface to window interface and viceversa. It may be configured to read two input stream interfaces and interleave data onto an output window interface. In addition, multiple copies of output window may be configured to allow extra flexibility when connecting to further kernels.
Table 11: Widget API Cast Parameters
Name  Type  Description  Range 

TT_DATA  typename  Data Type  int16, cint16, int32, cint32, float, cfloat 
TP_IN_API  Unsigned int  The input interface type  0 = window, 1 = stream 
TP_OUT_API  Typename int  The output interface type  0 = window, 1 = stream 
TP_NUM_INPUTS  Unsigned int  The number of input stream interfaces to be processed  1  2 
TP_WINDOW_VSIZE  Unsigned int  The number of samples in the input window  Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive RAM use. 
TP_NUM_OUTPUT_CLONES  Unsigned int  The number of output window ports to write the input data to.  1  4 
Note
The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.
The graph entry point is the following:
xf::dsp::aie::widget::api_cast::widget_api_cast_graph
Widget Real to Complex¶
The DSPLib contains a Widget Real to Complex solution, which provides a utility to convert real data to complex or vice versa.
Table 12: Widget Real to Complex Parameters
Name  Type  Description  Range 

TT_DATA  typename  Data Type  int16, cint16, int32, cint32, float, cfloat 
TT_OUT_DATA  typename  Data Type  int16, cint16, int32, cint32, float, cfloat 
TP_WINDOW_VSIZE  Unsigned int  The number of samples in the input window  Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive RAM use. 
Note
The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.
The graph entry point is the following:
xf::dsp::aie::widget::api_cast::widget_api_cast_graph
Compiling and Simulating Using the Makefile¶
A Makefile is included with each library element. It is located in the L2/tests/aie/<library_element> directory. Each Makefile holds default values for each of the library element parameters. These values can be edited as required to configure the library element for your needs.
Prerequisites:
source <yourVitisinstallpath>/lin64/Vitis/HEAD/settings64.csh setenv PLATFORM_REPO_PATHS <yourplatformrepoinstallpath> source <yourXRTinstallpath>/xbb/xrt/packages/xrt2.1.0centos/opt/xilinx/xrt/setup.csh setenv DSPLIB_ROOT <yourVitislibrariesinstallpath/dsp>
Use the following steps to compile, simulate the reference model with the x86sim target and the AIE graphs using AIE emulation plaftorm. The output of the reference model ( logs/ref_output.txt ) is verified against the output of the AIE graphs ( logs/uut_output.txt ).
make run
To overwrite the default parameters, add desired parameters as arguments to the make command, for example:
make run DATA_TYPE=cint16 SHIFT=16
For list of all the configurable parameters, see the L2 Library Element Configuration Parameters.
List of all Makefile targets:
make all TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64> Command to generate the design for specified Target and Shell. make clean Command to remove the generated nonhardware files. make cleanall Command to remove all the generated files. make sd_card TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64> Command to prepare sd_card files. This target is only used in embedded device. make run TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64> Command to run application in emulation or on board. make build TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64> Command to build xclbin application. make host HOST_ARCH=<aarch64> Command to build host application.
Note
For embedded devices like vck190, env variable SYSROOT, EDGE_COMMON_SW and PERL need to be set first, and HOST_ARCH is either aarch32 or aarch64. For example,
export SYSROOT=< pathtoplatformsysroot > export EDGE_COMMON_SW=< pathtorootfsandImagefiles > export PERL=<pathtoperlinstallationlocation >
Simulation results and diff results are located in the in L2/tests/aie/<library_element>/logs/status.txt file. To perform a x86 compilation/simulation, run
make run TARGET=x86sim.
It is also possible to randomly generate coefficient and input data, or to generate specific stimulus patterns like ALL_ONES, IMPULSE, etc. by running
make run STIM_TYPE=4.
L2 Library Element Unit Test¶
Each library element category comes supplied with a test harness which is an example of how to use the library element subgraph in the context of a supergraph. These test harnesses (graphs) can be found in the L2/tests/aie/<library_element>/test.hpp and L2/tests/aie/<library_element>/test.cpp file.
Each library element filter category also has a reference model which is used by the test harness. The reference models graphs are to be found in the L2/tests/aie/inc/<library_element>_ref_graph.hpp file.
Although it is recommended that only L2 (graphs) library elements are instantiated directly in user code, the kernels underlying the graphs can be found in the L1/include/aie/<library_element>.hpp and the L1/src/aie/<library_element>.cpp files.
An example of how a library element may be configured by a parent graph is provided in the L2/examples/fir_129t_sym folder. The example graph, test.h, in the L2/examples/fir_129t_sym folder instantiates the fir_sr_sym graph configured to be a 129tap filter. This example exposes the ports such that the parent graph can be used to replace an existing 129tap symmetric filter point solution design.
L2 Library Element Configuration Parameters¶
L2 FIR configuration parameters¶
The list below consists of configurable parameters for FIR library elements with their default values.
Table 13: L2 FIR configuration parameters
Name  Type  Default  Description 

DATA_TYPE  typename  cint16  Data Type. 
COEFF_TYPE  typename  int16  Coefficient Type. 
FIR_LEN  unsigned  81  FIR length. 
SHIFT  unsigned  16  Acc results shift down value. 
ROUND_MODE  unsigned  0  Rounding mode. 
INPUT_WINDOW_VSIZE  unsigned  512  Input window size. 
CASC_LEN  unsigned  1  Cascade length. 
INTERPOLATE_FACTOR  unsigned  1  Interpolation factor, see note below 
DECIMATE_FACTOR  unsigned  1  Decimation factor, see note below 
DUAL_IP  unsigned  0  Dual inputs used in symmetric FIRs, see note below 
NITER  unsigned  16  Number of iterations to execute. 
GEN_INPUT_DATA  bool  true  Generate input data samples. When true, generate stimulus data as defined in: DATA_STIM_TYPE. When false, use the input file defined in: INPUT_FILE 
GEN_COEFF_DATA  bool  true  Generate random coefficients. When true, generate stimulus data as defined in: COEFF_STIM_TYPE. When false, use the coefficient file defined in: COEFF_FILE 
DATA_STIM_TYPE  unsigned  0  Supported types: 0  random 3  impulse 4  all ones 5  incrementing pattern 6  sym incrementing pattern 8  sine wave 
COEFF_STIM_TYPE  unsigned  0  Supported types: 0  random 3  impulse 4  all ones 5  incrementing pattern 6  sym incrementing pattern 8  sine wave 
INPUT_FILE  string  data/input.txt  Input data samples file. Only used when GEN_INPUT_DATA=false. 
COEFF_FILE  string  data/coeff.txt  Coefficient data file. Only used when GEN_COEFF_DATA=false. 
Note
The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.
Note
Not all dsplib elements support all of the above configurable parameters. Unsupported parameters which are not used have no impact on execution, e.g., parameter INTERPOLATE_FACTOR is only supported by interpolation filters and will be ignored by other library elements.
L2 FFT configuration parameters¶
For the FFT/iFFT library element the list of configurable parameters and default values is presented below.
Table 14: L2 FFT configuration parameters
Name  Type  Default  Description 

DATA_TYPE  typename  cint16  Data Type. 
TWIDDLE_TYPE  typename  cint16  Twiddle Type. 
POINT_SIZE  unsigned  1024  FFT point size. 
SHIFT  unsigned  17  Acc results shift down value. 
FFT_NIFFT  unsigned  0  Forward (1) or reverse (0) transform. 
WINDOW_VSIZE  unsigned  1024  Input/Output window size. By default, set to: $(POINT_SIZE). 
CASC_LEN  unsigned  1  Cascade length. 
DYN_PT_SIZE  unsigned  0  Enable (1) Dynamic Point size feature. 
NITER  unsigned  4  Number of iterations to execute. 
GEN_INPUT_DATA  bool  true  Generate random input data samples. When false, use the input file defined in: INPUT_FILE 
STIM_TYPE  unsigned  0  Supported types: 0  random 3  impulse 4  all ones 5  incrementing pattern 6  sym incrementing pattern 8  sine wave 
INPUT_FILE  string  data/input.txt  Input data samples file. Only used when GEN_INPUT_DATA=false. 
Note
The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.
L2 Matrix Multiply Configuration Parameters¶
For the Matrix Multiply (GeMM) library element the list of configurable parameters and default values is presented below.
Table 15: L2 Matrix Multiply configuration parameters
Name  Type  Default  Description 

T_DATA_A  typename  cint16  Input A Data Type. 
T_DATA_B  typename  cint16  Input B Data Type. 
P_DIM_A  unsigned  16  Input A Dimension 
P_DIM_AB  unsigned  16  Input AB Common Dimension. 
P_DIM_B  unsigned  16  Input B Dimension. 
SHIFT  unsigned  20  Acc results shift down value. 
ROUND_MODE  unsigned  0  Rounding mode. 
P_CASC_LEN  unsigned  1  Cascade length. 
P_DIM_A_LEADING  unsigned  0  ROW_MAJOR = 0 COL_MAJOR = 1 
P_DIM_B_LEADING  unsigned  1  ROW_MAJOR = 0 COL_MAJOR = 1 
P_DIM_OUT_LEADING  unsigned  0  ROW_MAJOR = 0 COL_MAJOR = 1 
P_ADD_TILING_A  unsigned  1  no additional tiling kernel = 0 add additional tiling kernel = 1 
P_ADD_TILING_B  unsigned  1  no additional tiling kernel = 0 add additional tiling kernel = 1 
P_ADD_DETILING_OUT  unsigned  1  no additional detiling kernel = 0 add additional detiling kernel = 1 
NITER  unsigned  16  Number of iterations to execute. 
STIM_TYPE_A  unsigned  0  Supported types: 0  random 3  impulse 4  all ones 5  incrementing pattern 6  sym incrementing pattern 8  sine wave 
STIM_TYPE_B  unsigned  0  Supported types: 0  random 3  impulse 4  all ones 5  incrementing pattern 6  sym incrementing pattern 8  sine wave 
Note
The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.
L2 Widgets Configuration Parameters¶
For the Widgets library elements the list of configurable parameters and default values is presented below.
Table 16: L2 Widget API Casts Configuration Parameters
Name  Type  Default  Description 

DATA_TYPE  typename  cint16  Data Type. 
IN_API  unsigned  0  0 = window, 1 = stream 
OUT_API  unsigned  0  0 = window, 1 = stream 
NUM_INPUTS  unsigned  1  The number of input stream interfaces 
WINDOW_VSIZE  unsigned  256  Input/Output window size. 
NUM_OUTPUT_CLONES  unsigned  1  The number of output window port copies 
Table 17: L2 Widget Real to Complex Configuration Parameters
Name  Type  Default  Description 

DATA_TYPE  typename  cint16  Data Type. 
DATA_OUT_TYPE  typename  cint16  Data Type. 
WINDOW_VSIZE  unsigned  256  Input/Output window size. 
Note
The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.