DSP Library Functions¶
The Xilinx® digital signal processing library (DSPLib) is a configurable library of kernels that can be used to develop applications on Versal™ ACAP AI Engines. This is an Open Source library for DSP applications. Kernels are coded in C++ and contain special functions called intrinsics that give access to AI Engine vector processing capabilities. Kernels can be combined to construct graphs for developing complex designs. An example design is provided with this library for your reference. Each kernel has a corresponding graph. It is highly recommended to use the library element’s graph as the entry-point. See the Using the Examples for more details.
Filters¶
The DSPLib contains several variants of Finite Impulse Response (FIR) filters. On the AI Engine processor, data is packetized into windows. In the case of FIRs, each window is extended by a margin so that the state of the filter at the end of the previous window may be restored before new computations begin. Therefore, to maximize performance, the window size should be set to the maximum that the system will allow, though this will lead to a corresponding increase in latency. However, this is a complex decision as multiple factors such as data movement and latency need to be taken into consideration.
Note
With a small window size (for example, 32), you pay a high penalty on the function call overhead. This means that the pre/post amble will be major cycle consumer in your function call.
FIR filters have been categorized into classes and placed in a distinct namespace scope: xf::dsp::aie::fir, to prevent name collision in the global scope. Namespace aliasing can be utilized to shorten instantiations:
namespace dsplib = xf::dsp::aie;
Additionally, each FIR filter has been placed in a unique FIR type namespace. The available FIR filter classes and the corresponding graph entry point are listed below:
Table 1: FIR Filter Classes
Function | Namespace |
---|---|
Single rate, asymmetrical | dsplib::fir::sr_asym::fir_sr_asym_graph |
Single rate, symmetrical | dsplib::fir::sr_sym::fir_sr_sym_graph |
Interpolation asymmetrical | dsplib::fir::interpolate_asym::fir_interpolate_asym_graph |
Decimation, halfband | dsplib::fir::decimate_hb::fir_decimate_hb_graph |
Interpolation, halfband | dsplib::fir::interpolate_hb::fir_interpolate_hb_graph |
Decimation, asymmetric | dsplib::fir::decimate_asym::fir_decimate_asym_graph |
Interpolation, fractional, asymmetric | dsplib::fir::interpolate_fract_asym:: fir_interpolate_fract_asym_graph |
Decimation, symmetric | dsplib::fir::decimate_sym::fir_decimate_sym_graph |
Conventions for Filters¶
All FIR filters can be configured for various types of data and coefficients. These types can be int16, int32, or float, and also real or complex. However, configurations with real data versus complex coefficients are not supported nor are configurations where the coefficients are int32 and data is int16. Data and coefficients must both be integer types or both be float types, as mixes are not supported.
The following table lists the supported combinations of data type and coefficient type.
Table 2: Supported Combinations of Data Type and Coefficient Type
Data Type | |||||||
---|---|---|---|---|---|---|---|
Int16 | Cint16 | Int32 | Cint32 | Float | Cfloat | ||
Coefficient type | Int16 | Supported | Supported | Supported | Supported | 3 | 3 |
Cint16 | 1 | Supported | 1 | Supported | 3 | 3 | |
Int32 | 2 | 2 | Supported | Supported | 3 | 3 | |
Cint32 | 1, 2 | 2 | 1 | Supported | 3 | 3 | |
Float | 3 | 3 | 3 | 3 | Supported | Supported | |
Cfloat | 3 | 3 | 3 | 3 | 3 | Supported | |
|
For all filters, the coefficient values are passed, not as template parameters, but as an array argument to the constructor for non-reloadable configurations, or to the reload function for reloadable configurations. In the case of symmetrical filters, only the first half (plus any odd centre tap) need be passed, as the remainder may be derived by symmetry. For halfband filters, only the non-zero coefficients should be entered, so the length of the array expected will be the (TP_FIR_LEN+1)/4 + 1 for the centre tap.
The following table lists parameters supported by all the FIR filters:
Table 3: Parameters Supported by FIR Filters
Parameter Name | Type | Description | Range |
---|---|---|---|
TP_FIR_LEN | unsigned | The number of taps | 4 to 240 |
TP_RND | unsigned int | Round mode | 0 = truncate or floor 1 = ceiling (round up) 2 = positive infinity 3 = negative infinity 4 = symmetrical to infinity 5 = symmetrical to zero 6 = convergent to even 7 = convergent to odd |
TP_SHIFT | unsigned int | The number of bits to shift accumulation down by before output. | 0 to 61 |
TT_DATA | typename | Data Type | int16, cint16, int32, cint32, float, cfloat |
TT_COEFF | typename | Coefficient type | int16, cint16, int32, cint32, float, cfloat |
TP_INPUT_WINDOW_VSIZE | unsigned int | The number of samples in the input window. | Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive RAM use. |
TP_CASC_LEN | unsigned int | The number of cascaded kernels to use for this FIR. | 1 to 9. Defaults to 1 if not set. |
TP_DUAL_IP | unsigned int | Use dual inputs ports. | Range 0 (single input), 1 (dual input). Defaults to 0 if not set. |
TP_USE_COEFF_RELOAD | unsigned int | Enable reloadable coefficient feature. An additional ‘coeff’ RTP port will appear on the graph. | 0 (no reload), 1 (use reloads). Defaults to 0 if not set. |
TP_NUM_OUTPUTS | unsigned int | Number of fir output ports | >1 |
Note
The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.
FFT/iFFT¶
The DSPLib contains one FFT/iFFT solution. This is a single channel, decimation in time (DIT) implementation with configurable point size, data type, and FFT/iFFT function.
Point size may be any power of 2 from 16 to 4096, but this upper limit will be reduced to 2048 for cint16 data type and 1024 for cfloat or cint32 data type where the FFT kernel uses ping-pong window input. The 4096 limit may only be achieved where the FFT receives and outputs data to/from kernels on the same processor.
Table 4: FFT Parameters
Name | Type | Description | Range |
---|---|---|---|
TT_DATA | Typename | The input data type | cint16, cint32, cfloat |
TT_TWIDDLE | Typename | The twiddle factor type. Determined by TT_DATA | Set to cint16 for data type of cint16 or cint32 and cfloat for data type of cfloat. |
TP_POINT_SIZE | Unsigned int | The number of samples in a frame to be processed | 2^N, where N is in the range 4 to 12, though the upper limit may be constrained by device resources. |
TP_FFT_NIFFT | Unsigned int | Forward or reverse transform | 0 (IFFT) or 1 (FFT). |
TP_SHIFT | Unsigned int | The number of bits to shift accumulation down by before output. | 0 to 61 |
TP_CASC_LEN | Unsigned int | The number of kernels the FFT will be divided over. | 1 to 12. Defaults to 1 if not set. Maximum is derived by the number of radix 2 stages required for the given point size (N where pointSize = 2^N) For float data types the max is N. For integer data types the max is CEIL(N/2). |
TP_DYN_PT_SIZE | Unsigned int | FFT point size | 2^N, where N is 2 to 12 |
TP_WINDOW_VSIZE | Unsigned int | The number of samples in the input window. | Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive memory usage. |
Note
The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.
This FFT implementation does not implement the 1/N scaling of an IFFT. Internally, for cint16 and cint32 data, an internal data type of cint32 is used. After each rank, the values are scaled by only enough to normalize the bit growth caused by the twiddle multiplication (i.e., 15 bits). Distortion caused by saturation will be possible for large point sizes and large values when the data type is cint32. In the final stage, the result is scaled by 17 bits for point size from 16 to 1024, by 18 for 2048, and by 19 for 4096.
No scaling is applied at any point when the data type is cfloat. The graph entry point is the following:
xf::dsp::aie::fft::fft_ifft_dit_1ch_graph
Matrix Multiply¶
The DSPLib contains one Matrix Multiply/GEMM (GEneral Matrix Multiply) solution. The gemm has two input ports connected to two windows of data. The inputs are denoted as Matrix A (inA) and Matrix B (inB). Matrix A has a template parameter TP_DIM_A to describe the number of rows of A. The number of columns of inA must be equal to the number of rows of inB. This is denoted with the template parameter TP_DIM_AB. The number of columns of B is denoted by TP_DIM_B.
An output port connects to a window, where the data for the output matrix will be stored. The output matrix will have rows = inA rows (TP_DIM_A) and columns = inB (TP_DIM_B) columns. The data type of both input matrices can be configured and the data type of the output is derived from the inputs.
Table 5: Matrix Multiply Parameters
Name | Type | Description | Range |
---|---|---|---|
TT_DATA_A | Typename | The input data type | int16, cint16, int32 cint32 float cfloat |
TT_DATA_B | Typename | The input data type | int16, cint16, int32 cint32 float cfloat |
TP_DIM_A | unsigned int | The number of elements along the unique dimension (rows) of Matrix A | |
TP_DIM_AB | unsigned int | The number of elements along the common dimension of Matrix A (columns) and Matrix B (rows) | |
TP_DIM_B | unsigned int | The number of elements along the unique dimension (rows) of Matrix B | |
TP_SHIFT | unsigned int | power of 2 shift down applied to the accumulation of product terms before each output | In range 0 to 61 |
TP_RND | unsigned int | Round mode | 0 = truncate or floor 1 = ceiling (round up) 2 = positive infinity 3 = negative infinity 4 = symmetrical to infinity 5 = symmetrical to zero 6 = convergent to even 7 = convergent to odd |
TP_DIM_A_LEADING | unsigned int | The scheme in which the data should be stored in memory | ROW_MAJOR = 0 COL_MAJOR = 1 |
TP_DIM_B_LEADING | unsigned int | The scheme in which the data should be stored in memory | ROW_MAJOR = 0 COL_MAJOR = 1 |
TP_DIM_OUT_LEADING | unsigned int | The scheme in which the data should be stored in memory | ROW_MAJOR = 0 COL_MAJOR = 1 |
TP_ADD_TILING_A | unsigned int | Option to add an additional kernel to rearrange matrix samples | 0 = rearrange externally to the graph |
TP_ADD_TILING_B | unsigned int | Option to add an additional kernel to rearrange matrix samples | 0 = rearrange externally to the graph |
TP_ADD_DETILING_OUT | unsigned int | Option to add an additional kernel to rearrange matrix samples | 0 = rearrange externally to the graph |
TP_WINDOW_VSIZE_A | unsigned int | The number of samples in the input window for Matrix A | Must be of size TP_DIM_A* TP_DIM_AB*N has a default value of TP_DIM_A* TP_DIM_AB (N=1) |
TP_WINDOW_VSIZE_B | unsigned int | The number of samples in the input window for Matrix B | Must be of size TP_DIM_B* TP_DIM_AB*M has a default value of TP_DIM_B* TP_DIM_AB (M=1) |
TP_CASC_LEN | unsigned int | The number of AIE tiles to split the operation into | Defaults to 1 if not set. |
Input matrices are processed in distinct blocks and matrix elements must be rearranged into a specific pattern.
The following table demonstrates how a 16x16 input matrix should be rearranged into a 4x4 tiling pattern.
Note
Indices are quoted assuming a row major matrix. A column major matrix needs to be transposed.
Table 6: Matrix Multiply 4x4 tiling pattern
Tile Col 0 | Tile Col 1 | Tile Col 2 | Tile Col 3 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tile Row 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |
32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | |
48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | |
Tile Row 1 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 |
80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | |
96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | |
112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | |
Tile Row 2 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 |
144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | |
160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | |
176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | |
Tile Row 3 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 |
208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | |
224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | |
240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 |
This is stored contigulously in memory like:
0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51, 4, 5, 6, 7, 20, 21, 22, 23, 36, 37, 38, 39, 52, 53, 54, 55, 8, 9, 10, 11, 24, 25, 26, 27, 40, 41, 42, 43, 56, 57, 58, 59, 12, 13, 14, 15, 28, 29, 30, 31, 44, 45, 46, 47, 60, 61, 62, 63, 64, 65, 66, 67, 80, 81, 82, 83, 96, 97, 98, 99, 112, 113, 114, 115, … , 204, 205, 206, 207, 220, 221, 222, 223, 236, 237, 238, 239, 252, 253, 254, 255
The following table demonstrates how a 16x16 input matrix should be rearranged into a 4x2 tiling pattern.
Table 7: Matrix Multiply 4x2 tiling pattern
Tile Col 0 | Tile Col 1 | Tile Col 2 | Tile Col 3 | Tile Col 4 | Tile Col 5 | Tile Col 6 | Tile Col 7 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tile Row 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |
32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | |
48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | |
Tile Row 1 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 |
80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | |
96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | |
112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | |
Tile Row 2 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 |
144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | |
160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | |
176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | |
Tile Row 3 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 |
208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | |
224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | |
240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 |
This is stored contigulously in memory like:
0, 1, 16, 17, 32, 33, 48, 49, 2, 3, 18, 19, 34, 35, 50, 51, …, 206, 207, 222, 223, 238, 239, 254, 255
Multiplying a 16x16 matrix (with 4x4 tiling) with a 16x16 matrix (with 4x2 tiling) will result in a 16x16 matrix with 4x2 tiling.
The following table specifies the tiling scheme used for a given data type combination and the corresponding output data type:
Table 8: Matrix Multiply tiling pattern combination
Input Type Combination | Tiling Scheme | Output Type | ||
---|---|---|---|---|
A | B | A | B | |
int16 | int16 | 4x4 | 4x4 | int16 |
int16 | cint16 | 4x2 | 2x2 | cint16 |
int16 | int32 | 4x2 | 2x2 | int32 |
int16 | cint32 | 2x4 | 4x2 | cint32 |
cint16 | int16 | 4x4 | 4x2 | cint16 |
cint16 | cint16 | 4x4 | 4x2 | cint16 |
cint16 | int32 | 4x4 | 4x2 | cint32 |
cint16 | cint32 | 2x2 | 2x2 | cint32 |
int32 | int16 | 4x4 | 4x2 | int32 |
int32 | int32 | 4x4 | 4x2 | int32 |
int32 | cint16 | 4x4 | 4x2 | cint32 |
int32 | cint32 | 2x2 | 2x2 | cint32 |
cint32 | int16 | 2x4 | 4x2 | cint32 |
cint32 | cint16 | 2x2 | 2x2 | cint32 |
cint32 | int32 | 2x2 | 2x2 | cint32 |
cint32 | cint32 | 2x2 | 2x2 | cint32 |
float | float | 4x4 | 4x2 | float |
float | cfloat | 2x4 | 4x2 | cfloat |
cfloat | float | 2x4 | 4x2 | cfloat |
cfloat | cfloat | 4x2 | 2x2 | cfloat |
The parameters TP_ADD_TILING_A, TP_ADD_TILING_B, and TP_ADD_DETILING_OUT control the inclusion of an additional pre-processing / post-processing kernel to perform the required data shuffling. When used with TP_DIM_A_LEADING, TP_DIM_B_LEADING, or TP_DIM_OUT_LEADING, the matrix is also transposed in the tiling kernel.
If the additional kernels are not selected, then the matrix multiply kernels assume incoming data is in the correct format, as specified above. When using the TP_CASC_LEN parameter, the matrix multiply operation is split across TP_DIM_AB and processed in a TP_CASC_LEN number of kernels. The accumulated partial results of each kernel is passed down the cascade port to the next kernel in the cascade chain until the final kernel provides the expected output. Cascade connections are made internally to the matrix multiply graph.
Each AI Engine kernel in the array is given a sub-matrix, so the interface to the graph is an array of ports for both A and B.
Input Matrix A (16x16 - 4x4 Tile - Cascade Length 2):
Table 9: Input Matrix A (16x16 - 4x4 Tile - Cascade Length 2)
AIE 0 | AIE 1 | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tile Col 0 | Tile Col 1 | Tile Col 2 | Tile Col 3 | |||||||||||||
Tile Row 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |
32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | |
48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | |
Tile Row 1 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 |
80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | |
96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | |
112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | |
Tile Row 2 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 |
144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | |
160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | |
176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | |
Tile Row 3 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 |
208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | |
224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | |
240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 |
Input Matrix B (16x16 - 4x2 Tile - Cascade Length 2):
Table 10: Input Matrix B (16x16 - 4x2 Tile - Cascade Length 2)
Tile Col 0 | Tile Col 1 | Tile Col 2 | Tile Col 3 | Tile Col 4 | Tile Col 5 | Tile Col 6 | Tile Col 7 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AIE 0 | Tile Row 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | ||
32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | ||
48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | ||
Tile Row 1 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | |
80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | ||
96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | ||
112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | ||
AIE 1 | Tile Row 2 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 |
144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | ||
160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | ||
176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | ||
Tile Row 3 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | |
208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | ||
224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | ||
240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 |
The graph entry point is the following:
xf::dsp::aie::blas::matrix_mult::matrix_mult_graph
Find a full list of descriptions and parameters in the API Reference Overview.
Connections to the cascade ports can be made as follows:
for (int i = 0 ; i < P_CASC_LEN; i++) { connect<>(inA[i], mmultGraph.inA[i]); connect<>(inB[i], mmultGraph.inB[i]); } connect<>(mmultGraph.out, out);
Widgets¶
Widget API Cast¶
The DSPLib contains a Widget API Cast solution, which provides flexibilty when connecting other kernels. This component is able to change the stream interface to window interface and vice-versa. It may be configured to read two input stream interfaces and interleave data onto an output window interface. In addition, multiple copies of output window may be configured to allow extra flexibility when connecting to further kernels.
Table 11: Widget API Cast Parameters
Name | Type | Description | Range |
---|---|---|---|
TT_DATA | typename | Data Type | int16, cint16, int32, cint32, float, cfloat |
TP_IN_API | Unsigned int | The input interface type | 0 = window, 1 = stream |
TP_OUT_API | Typename int | The output interface type | 0 = window, 1 = stream |
TP_NUM_INPUTS | Unsigned int | The number of input stream interfaces to be processed | 1 - 2 |
TP_WINDOW_VSIZE | Unsigned int | The number of samples in the input window | Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive RAM use. |
TP_NUM_OUTPUT_CLONES | Unsigned int | The number of output window ports to write the input data to. | 1 - 4 |
Note
The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.
The graph entry point is the following:
xf::dsp::aie::widget::api_cast::widget_api_cast_graph
Widget Real to Complex¶
The DSPLib contains a Widget Real to Complex solution, which provides a utility to convert real data to complex or vice versa.
Table 12: Widget Real to Complex Parameters
Name | Type | Description | Range |
---|---|---|---|
TT_DATA | typename | Data Type | int16, cint16, int32, cint32, float, cfloat |
TT_OUT_DATA | typename | Data Type | int16, cint16, int32, cint32, float, cfloat |
TP_WINDOW_VSIZE | Unsigned int | The number of samples in the input window | Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive RAM use. |
Note
The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.
The graph entry point is the following:
xf::dsp::aie::widget::api_cast::widget_api_cast_graph
DDS / Mixer¶
The DSPLib contains a DDS and Mixer solution.
In DDS Only mode, there is a single output port that contains the sin/cosine components corresponding to the programmed phase increment. The phase increment is a fixed uint32 value provided as a constructor argument, where 2^31 corresponds to Pi (180 degrees phase increment). The number of samples sent through the output port is determined by the TP_INPUT_WINDOW_SIZE parameter. The output port can be a window interface or a stream interface depending on the use of TP_API.
Mixer inputs are enabled with the TP_MIXER_MODE template parameter. There are two modes that have the mixer functionality enabled. In MIXER_MODE_1, a single input port is exposed and the input samples are complex multiplied by the DDS output for the given phase increment. In MIXER_MODE_2, two input ports are exposed for multi-carrier operation, with the first behaving as in MIXER_MODE_1, and the second input port getting complex multiplied with the complex conjugate of the DDS signal then accumulated to the result of the first complex multiply operation.
Table 13: DDS / Mixer Parameters
Name | Type | Description | Range |
---|---|---|---|
TT_DATA | typename | Data Type | cint16 |
TP_INPUT_WINDOW_VSIZE | Unsigned int | The number of samples to process each iteration | Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive RAM use. |
TP_MIXER_MODE | Unsigned int | Mode of operation | 0 = DDS Only 1 = Single input mixer 2 = Two input mixer |
TP_API | Unsigned int | I/O interface port type | 0 = Window 1 = Stream |
Compiling and Simulating Using the Makefile¶
A Makefile is included with each library element. It is located in the L2/tests/aie/<library_element> directory. Each Makefile holds default values for each of the library element parameters. These values can be edited as required to configure the library element for your needs.
Prerequisites:
source <your-Vitis-install-path>/lin64/Vitis/HEAD/settings64.csh setenv PLATFORM_REPO_PATHS <your-platform-repo-install-path> source <your-XRT-install-path>/xbb/xrt/packages/xrt-2.1.0-centos/opt/xilinx/xrt/setup.csh setenv DSPLIB_ROOT <your-Vitis-libraries-install-path/dsp>
Use the following steps to compile, simulate the reference model with the x86sim target and the AIE graphs using AIE emulation plaftorm. The output of the reference model ( logs/ref_output.txt ) is verified against the output of the AIE graphs ( logs/uut_output.txt ).
make run
To overwrite the default parameters, add desired parameters as arguments to the make command, for example:
make run DATA_TYPE=cint16 SHIFT=16
For list of all the configurable parameters, see the L2 Library Element Configuration Parameters.
List of all Makefile targets:
make all TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64> Command to generate the design for specified Target and Shell. make clean Command to remove the generated non-hardware files. make cleanall Command to remove all the generated files. make sd_card TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64> Command to prepare sd_card files. This target is only used in embedded device. make run TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64> Command to run application in emulation or on board. make build TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64> Command to build xclbin application. make host HOST_ARCH=<aarch64> Command to build host application.
Note
For embedded devices like vck190, env variable SYSROOT, EDGE_COMMON_SW and PERL need to be set first, and HOST_ARCH is either aarch32 or aarch64. For example,
export SYSROOT=< path-to-platform-sysroot > export EDGE_COMMON_SW=< path-to-rootfs-and-Image-files > export PERL=<path-to-perl-installation-location >
Simulation results and diff results are located in the in L2/tests/aie/<library_element>/logs/status.txt file. To perform a x86 compilation/simulation, run
make run TARGET=x86sim.
It is also possible to randomly generate coefficient and input data, or to generate specific stimulus patterns like ALL_ONES, IMPULSE, etc. by running
make run STIM_TYPE=4.
L2 Library Element Unit Test¶
Each library element category comes supplied with a test harness which is an example of how to use the library element subgraph in the context of a super-graph. These test harnesses (graphs) can be found in the L2/tests/aie/<library_element>/test.hpp and L2/tests/aie/<library_element>/test.cpp file.
Each library element filter category also has a reference model which is used by the test harness. The reference models graphs are to be found in the L2/tests/aie/inc/<library_element>_ref_graph.hpp file.
Although it is recommended that only L2 (graphs) library elements are instantiated directly in user code, the kernels underlying the graphs can be found in the L1/include/aie/<library_element>.hpp and the L1/src/aie/<library_element>.cpp files.
An example of how a library element may be configured by a parent graph is provided in the L2/examples/fir_129t_sym folder. The example graph, test.h, in the L2/examples/fir_129t_sym folder instantiates the fir_sr_sym graph configured to be a 129-tap filter. This example exposes the ports such that the parent graph can be used to replace an existing 129-tap symmetric filter point solution design.
L2 Library Element Configuration Parameters¶
L2 FIR configuration parameters¶
The list below consists of configurable parameters for FIR library elements with their default values.
Table 14: L2 FIR configuration parameters
Name | Type | Default | Description |
---|---|---|---|
DATA_TYPE | typename | cint16 | Data Type. |
COEFF_TYPE | typename | int16 | Coefficient Type. |
FIR_LEN | unsigned | 81 | FIR length. |
SHIFT | unsigned | 16 | Acc results shift down value. |
ROUND_MODE | unsigned | 0 | Rounding mode. |
INPUT_WINDOW_VSIZE | unsigned | 512 | Input window size. |
CASC_LEN | unsigned | 1 | Cascade length. |
INTERPOLATE_FACTOR | unsigned | 1 | Interpolation factor, see note below |
DECIMATE_FACTOR | unsigned | 1 | Decimation factor, see note below |
DUAL_IP | unsigned | 0 | Dual inputs used in symmetric FIRs, see note below |
NITER | unsigned | 16 | Number of iterations to execute. |
GEN_INPUT_DATA | bool | true | Generate input data samples. When true, generate stimulus data as defined in: DATA_STIM_TYPE. When false, use the input file defined in: INPUT_FILE |
GEN_COEFF_DATA | bool | true | Generate random coefficients. When true, generate stimulus data as defined in: COEFF_STIM_TYPE. When false, use the coefficient file defined in: COEFF_FILE |
DATA_STIM_TYPE | unsigned | 0 | Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave |
COEFF_STIM_TYPE | unsigned | 0 | Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave |
INPUT_FILE | string | data/input.txt | Input data samples file. Only used when GEN_INPUT_DATA=false. |
COEFF_FILE | string | data/coeff.txt | Coefficient data file. Only used when GEN_COEFF_DATA=false. |
Note
The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.
Note
Not all dsplib elements support all of the above configurable parameters. Unsupported parameters which are not used have no impact on execution, e.g., parameter INTERPOLATE_FACTOR is only supported by interpolation filters and will be ignored by other library elements.
L2 FFT configuration parameters¶
For the FFT/iFFT library element the list of configurable parameters and default values is presented below.
Table 15: L2 FFT configuration parameters
Name | Type | Default | Description |
---|---|---|---|
DATA_TYPE | typename | cint16 | Data Type. |
TWIDDLE_TYPE | typename | cint16 | Twiddle Type. |
POINT_SIZE | unsigned | 1024 | FFT point size. |
SHIFT | unsigned | 17 | Acc results shift down value. |
FFT_NIFFT | unsigned | 0 | Forward (1) or reverse (0) transform. |
WINDOW_VSIZE | unsigned | 1024 | Input/Output window size. By default, set to: $(POINT_SIZE). |
CASC_LEN | unsigned | 1 | Cascade length. |
DYN_PT_SIZE | unsigned | 0 | Enable (1) Dynamic Point size feature. |
NITER | unsigned | 4 | Number of iterations to execute. |
GEN_INPUT_DATA | bool | true | Generate random input data samples. When false, use the input file defined in: INPUT_FILE |
STIM_TYPE | unsigned | 0 | Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave |
INPUT_FILE | string | data/input.txt | Input data samples file. Only used when GEN_INPUT_DATA=false. |
Note
The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.
L2 Matrix Multiply Configuration Parameters¶
For the Matrix Multiply (GeMM) library element the list of configurable parameters and default values is presented below.
Table 16: L2 Matrix Multiply configuration parameters
Name | Type | Default | Description |
---|---|---|---|
T_DATA_A | typename | cint16 | Input A Data Type. |
T_DATA_B | typename | cint16 | Input B Data Type. |
P_DIM_A | unsigned | 16 | Input A Dimension |
P_DIM_AB | unsigned | 16 | Input AB Common Dimension. |
P_DIM_B | unsigned | 16 | Input B Dimension. |
SHIFT | unsigned | 20 | Acc results shift down value. |
ROUND_MODE | unsigned | 0 | Rounding mode. |
P_CASC_LEN | unsigned | 1 | Cascade length. |
P_DIM_A_LEADING | unsigned | 0 | ROW_MAJOR = 0 COL_MAJOR = 1 |
P_DIM_B_LEADING | unsigned | 1 | ROW_MAJOR = 0 COL_MAJOR = 1 |
P_DIM_OUT_LEADING | unsigned | 0 | ROW_MAJOR = 0 COL_MAJOR = 1 |
P_ADD_TILING_A | unsigned | 1 | no additional tiling kernel = 0 add additional tiling kernel = 1 |
P_ADD_TILING_B | unsigned | 1 | no additional tiling kernel = 0 add additional tiling kernel = 1 |
P_ADD_DETILING_OUT | unsigned | 1 | no additional detiling kernel = 0 add additional detiling kernel = 1 |
NITER | unsigned | 16 | Number of iterations to execute. |
STIM_TYPE_A | unsigned | 0 | Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave |
STIM_TYPE_B | unsigned | 0 | Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave |
Note
The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.
L2 Widgets Configuration Parameters¶
For the Widgets library elements the list of configurable parameters and default values is presented below.
Table 17: L2 Widget API Casts Configuration Parameters
Name | Type | Default | Description |
---|---|---|---|
DATA_TYPE | typename | cint16 | Data Type. |
IN_API | unsigned | 0 | 0 = window, 1 = stream |
OUT_API | unsigned | 0 | 0 = window, 1 = stream |
NUM_INPUTS | unsigned | 1 | The number of input stream interfaces |
WINDOW_VSIZE | unsigned | 256 | Input/Output window size. |
NUM_OUTPUT_CLONES | unsigned | 1 | The number of output window port copies |
Table 18: L2 Widget Real to Complex Configuration Parameters
Name | Type | Default | Description |
---|---|---|---|
DATA_TYPE | typename | cint16 | Data Type. |
DATA_OUT_TYPE | typename | cint16 | Data Type. |
WINDOW_VSIZE | unsigned | 256 | Input/Output window size. |
Note
The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.
L2 DDS/Mixer Configuration Parameters¶
For the DDS/Mixer library element, the list of configurable parameters and default values is presented below.
Table 19: L2 DDS/Mixer Configuration Parameters
Name | Type | Default | Description |
---|---|---|---|
DATA_TYPE | typename | cint16 | Data Type. |
INPUT_WINDOW_VSIZE | unsigned | 256 | Input/Output window size. |
MIXER_MODE | unsigned | 2 | The mode of operation of the dds_mixer.
|
TP_API | unsigned | 0 | 0 = window, 1 = stream |