DSP Library Functions

The Xilinx® digital signal processing library (DSPLib) is a configurable library of kernels that can be used to develop applications on Versal™ ACAP AI Engines. This is an Open Source library for DSP applications. Kernels are coded in C++ and contain special functions called intrinsics that give access to AI Engine vector processing capabilities. Kernels can be combined to construct graphs for developing complex designs. An example design is provided with this library for your reference. Each kernel has a corresponding graph. It is highly recommended to use the library element’s graph as the entry-point. See the Using the Examples for more details.

Filters

The DSPLib contains several variants of Finite Impulse Response (FIR) filters. On the AI Engine processor, data is packetized into windows. In the case of FIRs, each window is extended by a margin so that the state of the filter at the end of the previous window may be restored before new computations begin. Therefore, to maximize performance, the window size should be set to the maximum that the system will allow, though this will lead to a corresponding increase in latency. However, this is a complex decision as multiple factors such as data movement and latency need to be taken into consideration.

Note

With a small window size (for example, 32), you pay a high penalty on the function call overhead. This means that the pre/post amble will be major cycle consumer in your function call.

FIR filters have been categorized into classes and placed in a distinct namespace scope: xf::dsp::aie::fir, to prevent name collision in the global scope. Namespace aliasing can be utilized to shorten instantiations:

namespace dsplib = xf::dsp::aie;

Additionally, each FIR filter has been placed in a unique FIR type namespace. The available FIR filter classes and the corresponding graph entry point are listed below:

Table 1: FIR Filter Classes

Function Namespace
Single rate, asymmetrical dsplib::fir::sr_asym::fir_sr_asym_graph
Single rate, symmetrical dsplib::fir::sr_sym::fir_sr_sym_graph
Interpolation asymmetrical dsplib::fir::interpolate_asym::fir_interpolate_asym_graph
Decimation, halfband dsplib::fir::decimate_hb::fir_decimate_hb_graph
Interpolation, halfband dsplib::fir::interpolate_hb::fir_interpolate_hb_graph
Decimation, asymmetric dsplib::fir::decimate_asym::fir_decimate_asym_graph
Interpolation, fractional, asymmetric dsplib::fir::interpolate_fract_asym:: fir_interpolate_fract_asym_graph
Decimation, symmetric dsplib::fir::decimate_sym::fir_decimate_sym_graph

Conventions for Filters

All FIR filters can be configured for various types of data and coefficients. These types can be int16, int32, or float, and also real or complex. However, configurations with real data versus complex coefficients are not supported nor are configurations where the coefficients are int32 and data is int16. Data and coefficients must both be integer types or both be float types, as mixes are not supported.

The following table lists the supported combinations of data type and coefficient type.

Table 2: Supported Combinations of Data Type and Coefficient Type

Data Type
    Int16 Cint16 Int32 Cint32 Float Cfloat
Coefficient type Int16 Supported Supported Supported Supported 3 3
  Cint16 1 Supported 1 Supported 3 3
  Int32 2 2 Supported Supported 3 3
  Cint32 1, 2 2 1 Supported 3 3
  Float 3 3 3 3 Supported Supported
  Cfloat 3 3 3 3 3 Supported
  1. Complex coefficients are not supported for real-only data types.
  2. Coefficient type of higher precision than data type is not supported.
  3. A mix of float and integer types is not supported.

For all filters, the coefficient values are passed, not as template parameters, but as an array argument to the constructor for non-reloadable configurations, or to the reload function for reloadable configurations. In the case of symmetrical filters, only the first half (plus any odd centre tap) need be passed, as the remainder may be derived by symmetry. For halfband filters, only the non-zero coefficients should be entered, so the length of the array expected will be the (TP_FIR_LEN+1)/4 + 1 for the centre tap.

The following table lists parameters supported by all the FIR filters:

Table 3: Parameters Supported by FIR Filters

Parameter Name Type Description Range
TP_FIR_LEN unsigned The number of taps 4 to 240
TP_RND unsigned int Round mode

0 = truncate or floor

1 = ceiling (round up)

2 = positive infinity

3 = negative infinity

4 = symmetrical to infinity 5 = symmetrical to zero

6 = convergent to even

7 = convergent to odd

TP_SHIFT unsigned int The number of bits to shift accumulation down by before output. 0 to 61
TT_DATA typename Data Type int16, cint16, int32, cint32, float, cfloat
TT_COEFF typename Coefficient type int16, cint16, int32, cint32, float, cfloat
TP_INPUT_WINDOW_VSIZE unsigned int The number of samples in the input window. Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive RAM use.
TP_CASC_LEN unsigned int The number of cascaded kernels to use for this FIR. 1 to 9. Defaults to 1 if not set.
TP_DUAL_IP unsigned int Use dual inputs (may increase throughput for symmetrical and halfband filters by avoiding load contention by using a second RAM bank for input). Range 0 (single input), 1 (dual input). Defaults to 0 if not set.
TP_USE_COEFF_RELOAD unsigned int Enable reloadable coefficient feature. An additional ‘coeff’ RTP port will appear on the graph. 0 (no reload), 1 (use reloads). Defaults to 0 if not set.
TP_NUM_OUTPUTS unsigned int Number of fir output ports >1

Note

The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.

FFT/iFFT

The DSPLib contains one FFT/iFFT solution. This is a single channel, decimation in time (DIT) implementation with configurable point size, data type, and FFT/iFFT function.

Point size may be any power of 2 from 16 to 4096, but this upper limit will be reduced to 2048 for cint16 data type and 1024 for cfloat or cint32 data type where the FFT kernel uses ping-pong window input. The 4096 limit may only be achieved where the FFT receives and outputs data to/from kernels on the same processor.

Table 4: FFT Parameters

Name Type Description Range
TT_DATA Typename The input data type cint16, cint32, cfloat
TT_TWIDDLE Typename The twiddle factor type. Determined by TT_DATA Set to cint16 for data type of cint16 or cint32 and cfloat for data type of cfloat.
TP_POINT_SIZE Unsigned int The number of samples in a frame to be processed 2^N, where N is in the range 4 to 12, though the upper limit may be constrained by device resources.
TP_FFT_NIFFT Unsigned int Forward or reverse transform 0 (IFFT) or 1 (FFT).
TP_SHIFT Unsigned int The number of bits to shift accumulation down by before output. 0 to 61
TP_CASC_LEN Unsigned int The number of kernels the FFT will be divided over.

1 to 12. Defaults to 1 if not set.

Maximum is derived by the number of radix 2 stages required for the given point size (N where pointSize = 2^N)

For float data types the max is N. For integer data types the max is CEIL(N/2).

TP_DYN_PT_SIZE Unsigned int FFT point size 2^N, where N is 2 to 12
TP_WINDOW_VSIZE Unsigned int The number of samples in the input window. Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive memory usage.

Note

The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.

This FFT implementation does not implement the 1/N scaling of an IFFT. Internally, for cint16 and cint32 data, an internal data type of cint32 is used. After each rank, the values are scaled by only enough to normalize the bit growth caused by the twiddle multiplication (i.e., 15 bits). Distortion caused by saturation will be possible for large point sizes and large values when the data type is cint32. In the final stage, the result is scaled by 17 bits for point size from 16 to 1024, by 18 for 2048, and by 19 for 4096.

No scaling is applied at any point when the data type is cfloat. The graph entry point is the following:

xf::dsp::aie::fft::fft_ifft_dit_1ch_graph

Matrix Multiply

The DSPLib contains one Matrix Multiply/GEMM (GEneral Matrix Multiply) solution. The gemm has two input ports connected to two windows of data. The inputs are denoted as Matrix A (inA) and Matrix B (inB). Matrix A has a template parameter TP_DIM_A to describe the number of rows of A. The number of columns of inA must be equal to the number of rows of inB. This is denoted with the template parameter TP_DIM_AB. The number of columns of B is denoted by TP_DIM_B.

An output port connects to a window, where the data for the output matrix will be stored. The output matrix will have rows = inA rows (TP_DIM_A) and columns = inB (TP_DIM_B) columns. The data type of both input matrices can be configured and the data type of the output is derived from the inputs.

Table 5: Matrix Multiply Parameters

Name Type Description Range
TT_DATA_A Typename The input data type int16, cint16, int32 cint32 float cfloat
TT_DATA_B Typename The input data type int16, cint16, int32 cint32 float cfloat
TP_DIM_A unsigned int The number of elements along the unique dimension (rows) of Matrix A  
TP_DIM_AB unsigned int The number of elements along the common dimension of Matrix A (columns) and Matrix B (rows)  
TP_DIM_B unsigned int The number of elements along the unique dimension (rows) of Matrix B  
TP_SHIFT unsigned int power of 2 shift down applied to the accumulation of product terms before each output In range 0 to 61
TP_RND unsigned int Round mode

0 = truncate or floor

1 = ceiling (round up)

2 = positive infinity

3 = negative infinity

4 = symmetrical to infinity 5 = symmetrical to zero

6 = convergent to even

7 = convergent to odd

TP_DIM_A_LEADING unsigned int The scheme in which the data should be stored in memory ROW_MAJOR = 0 COL_MAJOR = 1
TP_DIM_B_LEADING unsigned int The scheme in which the data should be stored in memory ROW_MAJOR = 0 COL_MAJOR = 1
TP_DIM_OUT_LEADING unsigned int The scheme in which the data should be stored in memory ROW_MAJOR = 0 COL_MAJOR = 1
TP_ADD_TILING_A unsigned int Option to add an additional kernel to rearrange matrix samples 0 = rearrange externally to the graph
TP_ADD_TILING_B unsigned int Option to add an additional kernel to rearrange matrix samples 0 = rearrange externally to the graph
TP_ADD_DETILING_OUT unsigned int Option to add an additional kernel to rearrange matrix samples 0 = rearrange externally to the graph
TP_WINDOW_VSIZE_A unsigned int The number of samples in the input window for Matrix A Must be of size TP_DIM_A* TP_DIM_AB*N has a default value of TP_DIM_A* TP_DIM_AB (N=1)
TP_WINDOW_VSIZE_B unsigned int The number of samples in the input window for Matrix B Must be of size TP_DIM_B* TP_DIM_AB*M has a default value of TP_DIM_B* TP_DIM_AB (M=1)
TP_CASC_LEN unsigned int The number of AIE tiles to split the operation into Defaults to 1 if not set.

Input matrices are processed in distinct blocks and matrix elements must be rearranged into a specific pattern.

The following table demonstrates how a 16x16 input matrix should be rearranged into a 4x4 tiling pattern.

Note

Indices are quoted assuming a row major matrix. A column major matrix needs to be transposed.

Table 6: Matrix Multiply 4x4 tiling pattern

  Tile Col 0 Tile Col 1 Tile Col 2 Tile Col 3
Tile Row 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
Tile Row 1 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
Tile Row 2 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
Tile Row 3 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255

This is stored contigulously in memory like:

0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51, 4, 5, 6, 7, 20, 21, 22, 23, 36, 37, 38, 39, 52, 53, 54, 55, 8, 9, 10, 11, 24, 25, 26, 27, 40, 41, 42, 43, 56, 57, 58, 59, 12, 13, 14, 15, 28, 29, 30, 31, 44, 45, 46, 47, 60, 61, 62, 63, 64, 65, 66, 67, 80, 81, 82, 83, 96, 97, 98, 99, 112, 113, 114, 115, … , 204, 205, 206, 207, 220, 221, 222, 223, 236, 237, 238, 239, 252, 253, 254, 255

The following table demonstrates how a 16x16 input matrix should be rearranged into a 4x2 tiling pattern.

Table 7: Matrix Multiply 4x2 tiling pattern

  Tile Col 0 Tile Col 1 Tile Col 2 Tile Col 3 Tile Col 4 Tile Col 5 Tile Col 6 Tile Col 7
Tile Row 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
Tile Row 1 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
Tile Row 2 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
Tile Row 3 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255

This is stored contigulously in memory like:

0, 1, 16, 17, 32, 33, 48, 49, 2, 3, 18, 19, 34, 35, 50, 51, …, 206, 207, 222, 223, 238, 239, 254, 255

Multiplying a 16x16 matrix (with 4x4 tiling) with a 16x16 matrix (with 4x2 tiling) will result in a 16x16 matrix with 4x2 tiling.

The following table specifies the tiling scheme used for a given data type combination and the corresponding output data type:

Table 8: Matrix Multiply tiling pattern combination

Input Type Combination Tiling Scheme Output Type
A B A B  
int16 int16 4x4 4x4 int16
int16 cint16 4x2 2x2 cint16
int16 int32 4x2 2x2 int32
int16 cint32 2x4 4x2 cint32
cint16 int16 4x4 4x2 cint16
cint16 cint16 4x4 4x2 cint16
cint16 int32 4x4 4x2 cint32
cint16 cint32 2x2 2x2 cint32
int32 int16 4x4 4x2 int32
int32 int32 4x4 4x2 int32
int32 cint16 4x4 4x2 cint32
int32 cint32 2x2 2x2 cint32
cint32 int16 2x4 4x2 cint32
cint32 cint16 2x2 2x2 cint32
cint32 int32 2x2 2x2 cint32
cint32 cint32 2x2 2x2 cint32
float float 4x4 4x2 float
float cfloat 2x4 4x2 cfloat
cfloat float 2x4 4x2 cfloat
cfloat cfloat 4x2 2x2 cfloat

The parameters TP_ADD_TILING_A, TP_ADD_TILING_B, and TP_ADD_DETILING_OUT control the inclusion of an additional pre-processing / post-processing kernel to perform the required data shuffling. When used with TP_DIM_A_LEADING, TP_DIM_B_LEADING, or TP_DIM_OUT_LEADING, the matrix is also transposed in the tiling kernel.

If the additional kernels are not selected, then the matrix multiply kernels assume incoming data is in the correct format, as specified above. When using the TP_CASC_LEN parameter, the matrix multiply operation is split across TP_DIM_AB and processed in a TP_CASC_LEN number of kernels. The accumulated partial results of each kernel is passed down the cascade port to the next kernel in the cascade chain until the final kernel provides the expected output. Cascade connections are made internally to the matrix multiply graph.

Each AI Engine kernel in the array is given a sub-matrix, so the interface to the graph is an array of ports for both A and B.

Input Matrix A (16x16 - 4x4 Tile - Cascade Length 2):

Table 9: Input Matrix A (16x16 - 4x4 Tile - Cascade Length 2)

  AIE 0 AIE 1
  Tile Col 0 Tile Col 1 Tile Col 2 Tile Col 3
Tile Row 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
Tile Row 1 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
Tile Row 2 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
Tile Row 3 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255

Input Matrix B (16x16 - 4x2 Tile - Cascade Length 2):

Table 10: Input Matrix B (16x16 - 4x2 Tile - Cascade Length 2)

    Tile Col 0 Tile Col 1 Tile Col 2 Tile Col 3 Tile Col 4 Tile Col 5 Tile Col 6 Tile Col 7
AIE 0 Tile Row 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
Tile Row 1 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
AIE 1 Tile Row 2 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
Tile Row 3 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255

The graph entry point is the following:

xf::dsp::aie::blas::matrix_mult::matrix_mult_graph

Find a full list of descriptions and parameters in the API Reference Overview.

Connections to the cascade ports can be made as follows:

for (int i = 0 ; i < P_CASC_LEN; i++) {
    connect<>(inA[i], mmultGraph.inA[i]);
    connect<>(inB[i], mmultGraph.inB[i]);
}
connect<>(mmultGraph.out, out);

Widgets

Widget API Cast

The DSPLib contains a Widget API Cast solution, which provides flexibilty when connecting other kernels. This component is able to change the stream interface to window interface and vice-versa. It may be configured to read two input stream interfaces and interleave data onto an output window interface. In addition, multiple copies of output window may be configured to allow extra flexibility when connecting to further kernels.

Table 11: Widget API Cast Parameters

Name Type Description Range
TT_DATA typename Data Type int16, cint16, int32, cint32, float, cfloat
TP_IN_API Unsigned int The input interface type 0 = window, 1 = stream
TP_OUT_API Typename int The output interface type 0 = window, 1 = stream
TP_NUM_INPUTS Unsigned int The number of input stream interfaces to be processed 1 - 2
TP_WINDOW_VSIZE Unsigned int The number of samples in the input window Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive RAM use.
TP_NUM_OUTPUT_CLONES Unsigned int The number of output window ports to write the input data to. 1 - 4

Note

The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.

The graph entry point is the following:

xf::dsp::aie::widget::api_cast::widget_api_cast_graph

Widget Real to Complex

The DSPLib contains a Widget Real to Complex solution, which provides a utility to convert real data to complex or vice versa.

Table 12: Widget Real to Complex Parameters

Name Type Description Range
TT_DATA typename Data Type int16, cint16, int32, cint32, float, cfloat
TT_OUT_DATA typename Data Type int16, cint16, int32, cint32, float, cfloat
TP_WINDOW_VSIZE Unsigned int The number of samples in the input window Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive RAM use.

Note

The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.

The graph entry point is the following:

xf::dsp::aie::widget::api_cast::widget_api_cast_graph

Compiling and Simulating Using the Makefile

A Makefile is included with each library element. It is located in the L2/tests/aie/<library_element> directory. Each Makefile holds default values for each of the library element parameters. These values can be edited as required to configure the library element for your needs.

Prerequisites:

source <your-Vitis-install-path>/lin64/Vitis/HEAD/settings64.csh
setenv PLATFORM_REPO_PATHS <your-platform-repo-install-path>
source <your-XRT-install-path>/xbb/xrt/packages/xrt-2.1.0-centos/opt/xilinx/xrt/setup.csh
setenv DSPLIB_ROOT <your-Vitis-libraries-install-path/dsp>

Use the following steps to compile, simulate the reference model with the x86sim target and the AIE graphs using AIE emulation plaftorm. The output of the reference model ( logs/ref_output.txt ) is verified against the output of the AIE graphs ( logs/uut_output.txt ).

make run

To overwrite the default parameters, add desired parameters as arguments to the make command, for example:

make run DATA_TYPE=cint16 SHIFT=16

For list of all the configurable parameters, see the L2 Library Element Configuration Parameters.

List of all Makefile targets:

make all TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64>
    Command to generate the design for specified Target and Shell.

make clean
    Command to remove the generated non-hardware files.

make cleanall
    Command to remove all the generated files.

make sd_card TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64>
    Command to prepare sd_card files.
    This target is only used in embedded device.

make run TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64>
    Command to run application in emulation or on board.

make build TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64>
    Command to build xclbin application.

make host HOST_ARCH=<aarch64>
    Command to build host application.

Note

For embedded devices like vck190, env variable SYSROOT, EDGE_COMMON_SW and PERL need to be set first, and HOST_ARCH is either aarch32 or aarch64. For example,

export SYSROOT=< path-to-platform-sysroot >
export EDGE_COMMON_SW=< path-to-rootfs-and-Image-files >
export PERL=<path-to-perl-installation-location >

Simulation results and diff results are located in the in L2/tests/aie/<library_element>/logs/status.txt file. To perform a x86 compilation/simulation, run

make run TARGET=x86sim.

It is also possible to randomly generate coefficient and input data, or to generate specific stimulus patterns like ALL_ONES, IMPULSE, etc. by running

make run STIM_TYPE=4.

L2 Library Element Unit Test

Each library element category comes supplied with a test harness which is an example of how to use the library element subgraph in the context of a super-graph. These test harnesses (graphs) can be found in the L2/tests/aie/<library_element>/test.hpp and L2/tests/aie/<library_element>/test.cpp file.

Each library element filter category also has a reference model which is used by the test harness. The reference models graphs are to be found in the L2/tests/aie/inc/<library_element>_ref_graph.hpp file.

Although it is recommended that only L2 (graphs) library elements are instantiated directly in user code, the kernels underlying the graphs can be found in the L1/include/aie/<library_element>.hpp and the L1/src/aie/<library_element>.cpp files.

An example of how a library element may be configured by a parent graph is provided in the L2/examples/fir_129t_sym folder. The example graph, test.h, in the L2/examples/fir_129t_sym folder instantiates the fir_sr_sym graph configured to be a 129-tap filter. This example exposes the ports such that the parent graph can be used to replace an existing 129-tap symmetric filter point solution design.

L2 Library Element Configuration Parameters

L2 FIR configuration parameters

The list below consists of configurable parameters for FIR library elements with their default values.

Table 13: L2 FIR configuration parameters

Name Type Default Description
DATA_TYPE typename cint16 Data Type.
COEFF_TYPE typename int16 Coefficient Type.
FIR_LEN unsigned 81 FIR length.
SHIFT unsigned 16 Acc results shift down value.
ROUND_MODE unsigned 0 Rounding mode.
INPUT_WINDOW_VSIZE unsigned 512 Input window size.
CASC_LEN unsigned 1 Cascade length.
INTERPOLATE_FACTOR unsigned 1 Interpolation factor, see note below
DECIMATE_FACTOR unsigned 1 Decimation factor, see note below
DUAL_IP unsigned 0 Dual inputs used in symmetric FIRs, see note below
NITER unsigned 16 Number of iterations to execute.
GEN_INPUT_DATA bool true Generate input data samples. When true, generate stimulus data as defined in: DATA_STIM_TYPE. When false, use the input file defined in: INPUT_FILE
GEN_COEFF_DATA bool true Generate random coefficients. When true, generate stimulus data as defined in: COEFF_STIM_TYPE. When false, use the coefficient file defined in: COEFF_FILE
DATA_STIM_TYPE unsigned 0 Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave
COEFF_STIM_TYPE unsigned 0 Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave
INPUT_FILE string data/input.txt Input data samples file. Only used when GEN_INPUT_DATA=false.
COEFF_FILE string data/coeff.txt Coefficient data file. Only used when GEN_COEFF_DATA=false.

Note

The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.

Note

Not all dsplib elements support all of the above configurable parameters. Unsupported parameters which are not used have no impact on execution, e.g., parameter INTERPOLATE_FACTOR is only supported by interpolation filters and will be ignored by other library elements.

L2 FFT configuration parameters

For the FFT/iFFT library element the list of configurable parameters and default values is presented below.

Table 14: L2 FFT configuration parameters

Name Type Default Description
DATA_TYPE typename cint16 Data Type.
TWIDDLE_TYPE typename cint16 Twiddle Type.
POINT_SIZE unsigned 1024 FFT point size.
SHIFT unsigned 17 Acc results shift down value.
FFT_NIFFT unsigned 0 Forward (1) or reverse (0) transform.
WINDOW_VSIZE unsigned 1024 Input/Output window size. By default, set to: $(POINT_SIZE).
CASC_LEN unsigned 1 Cascade length.
DYN_PT_SIZE unsigned 0 Enable (1) Dynamic Point size feature.
NITER unsigned 4 Number of iterations to execute.
GEN_INPUT_DATA bool true Generate random input data samples. When false, use the input file defined in: INPUT_FILE
STIM_TYPE unsigned 0 Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave
INPUT_FILE string data/input.txt Input data samples file. Only used when GEN_INPUT_DATA=false.

Note

The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.

L2 Matrix Multiply Configuration Parameters

For the Matrix Multiply (GeMM) library element the list of configurable parameters and default values is presented below.

Table 15: L2 Matrix Multiply configuration parameters

Name Type Default Description
T_DATA_A typename cint16 Input A Data Type.
T_DATA_B typename cint16 Input B Data Type.
P_DIM_A unsigned 16 Input A Dimension
P_DIM_AB unsigned 16 Input AB Common Dimension.
P_DIM_B unsigned 16 Input B Dimension.
SHIFT unsigned 20 Acc results shift down value.
ROUND_MODE unsigned 0 Rounding mode.
P_CASC_LEN unsigned 1 Cascade length.
P_DIM_A_LEADING unsigned 0 ROW_MAJOR = 0 COL_MAJOR = 1
P_DIM_B_LEADING unsigned 1 ROW_MAJOR = 0 COL_MAJOR = 1
P_DIM_OUT_LEADING unsigned 0 ROW_MAJOR = 0 COL_MAJOR = 1
P_ADD_TILING_A unsigned 1 no additional tiling kernel = 0 add additional tiling kernel = 1
P_ADD_TILING_B unsigned 1 no additional tiling kernel = 0 add additional tiling kernel = 1
P_ADD_DETILING_OUT unsigned 1 no additional detiling kernel = 0 add additional detiling kernel = 1
NITER unsigned 16 Number of iterations to execute.
STIM_TYPE_A unsigned 0 Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave
STIM_TYPE_B unsigned 0 Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave

Note

The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.

L2 Widgets Configuration Parameters

For the Widgets library elements the list of configurable parameters and default values is presented below.

Table 16: L2 Widget API Casts Configuration Parameters

Name Type Default Description
DATA_TYPE typename cint16 Data Type.
IN_API unsigned 0 0 = window, 1 = stream
OUT_API unsigned 0 0 = window, 1 = stream
NUM_INPUTS unsigned 1 The number of input stream interfaces
WINDOW_VSIZE unsigned 256 Input/Output window size.
NUM_OUTPUT_CLONES unsigned 1 The number of output window port copies

Table 17: L2 Widget Real to Complex Configuration Parameters

Name Type Default Description
DATA_TYPE typename cint16 Data Type.
DATA_OUT_TYPE typename cint16 Data Type.
WINDOW_VSIZE unsigned 256 Input/Output window size.

Note

The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.