DSP Library Functions¶

The Xilinx® digital signal processing library (DSPLib) is a configurable library of kernels that can be used to develop applications on Versal™ ACAP AI Engines. This is an Open Source library for DSP applications. Kernels are coded in C++ and contain special functions called intrinsics that give access to AI Engine vector processing capabilities. Kernels can be combined to construct graphs for developing complex designs. An example design is provided with this library for your reference. Each kernel has a corresponding graph. It is highly recommended to use the library element’s graph as the entry-point. See the Using the Examples for more details.

Filters¶

The DSPLib contains several variants of Finite Impulse Response (FIR) filters. On the AI Engine processor, data is packetized into windows. In the case of FIRs, each window is extended by a margin so that the state of the filter at the end of the previous window may be restored before new computations begin. Therefore, to maximize performance, the window size should be set to the maximum that the system will allow, though this will lead to a corresponding increase in latency. However, this is a complex decision as multiple factors such as data movement and latency need to be taken into consideration.

Note

With a small window size (for example, 32), you pay a high penalty on the function call overhead. This means that the pre/post amble will be major cycle consumer in your function call.

FIR filters have been categorized into classes and placed in a distinct namespace scope: xf::dsp::aie::fir, to prevent name collision in the global scope. Namespace aliasing can be utilized to shorten instantiations:

namespace dsplib = xf::dsp::aie;

Additionally, each FIR filter has been placed in a unique FIR type namespace. The available FIR filter classes and the corresponding graph entry point are listed below:

Table 1: FIR Filter Classes

Function	Namespace
Single rate, asymmetrical	dsplib::fir::sr_asym::fir_sr_asym_graph
Single rate, symmetrical	dsplib::fir::sr_sym::fir_sr_sym_graph
Interpolation asymmetrical	dsplib::fir::interpolate_asym::fir_interpolate_asym_graph
Decimation, halfband	dsplib::fir::decimate_hb::fir_decimate_hb_graph
Interpolation, halfband	dsplib::fir::interpolate_hb::fir_interpolate_hb_graph
Decimation, asymmetric	dsplib::fir::decimate_asym::fir_decimate_asym_graph
Interpolation, fractional, asymmetric	dsplib::fir::interpolate_fract_asym:: fir_interpolate_fract_asym_graph
Decimation, symmetric	dsplib::fir::decimate_sym::fir_decimate_sym_graph

Conventions for Filters¶

All FIR filters can be configured for various types of data and coefficients. These types can be int16, int32, or float, and also real or complex. However, configurations with real data versus complex coefficients are not supported nor are configurations where the coefficients are int32 and data is int16. Data and coefficients must both be integer types or both be float types, as mixes are not supported.

The following table lists the supported combinations of data type and coefficient type.

Table 2: Supported Combinations of Data Type and Coefficient Type

Data Type
		Int16	Cint16	Int32	Cint32	Float	Cfloat
Coefficient type	Int16	Supported	Supported	Supported	Supported	3	3
	Cint16	1	Supported	1	Supported	3	3
	Int32	2	2	Supported	Supported	3	3
	Cint32	1, 2	2	1	Supported	3	3
	Float	3	3	3	3	Supported	Supported
	Cfloat	3	3	3	3	3	Supported
Complex coefficients are not supported for real-only data types. Coefficient type of higher precision than data type is not supported. A mix of float and integer types is not supported.

For all filters, the coefficient values are passed, not as template parameters, but as an array argument to the constructor for non-reloadable configurations, or to the reload function for reloadable configurations. In the case of symmetrical filters, only the first half (plus any odd centre tap) need be passed, as the remainder may be derived by symmetry. For halfband filters, only the non-zero coefficients should be entered, so the length of the array expected will be the (TP_FIR_LEN+1)/4 + 1 for the centre tap.

The following table lists parameters supported by all the FIR filters:

Table 3: Parameters Supported by FIR Filters

Parameter Name	Type	Description	Range
TP_FIR_LEN	unsigned	The number of taps	4 to 240
TP_RND	unsigned int	Round mode	0 = truncate or floor 1 = ceiling (round up) 2 = positive infinity 3 = negative infinity 4 = symmetrical to infinity 5 = symmetrical to zero 6 = convergent to even 7 = convergent to odd
TP_SHIFT	unsigned int	The number of bits to shift accumulation down by before output.	0 to 61
TT_DATA	typename	Data Type	int16, cint16, int32, cint32, float, cfloat
TT_COEFF	typename	Coefficient type	int16, cint16, int32, cint32, float, cfloat
TP_INPUT_WINDOW_VSIZE	unsigned int	The number of samples in the input window.	Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive RAM use.
TP_CASC_LEN	unsigned int	The number of cascaded kernels to use for this FIR.	1 to 9. Defaults to 1 if not set.
TP_DUAL_IP	unsigned int	Use dual inputs (may increase throughput for symmetrical and halfband filters by avoiding load contention by using a second RAM bank for input).	Range 0 (single input), 1 (dual input). Defaults to 0 if not set.
TP_USE_COEFF_RELOAD	unsigned int	Enable reloadable coefficient feature. An additional ‘coeff’ RTP port will appear on the graph.	0 (no reload), 1 (use reloads). Defaults to 0 if not set.
TP_NUM_OUTPUTS	unsigned int	Number of fir output ports	>1

Note

The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.

FFT/iFFT¶

The DSPLib contains one FFT/iFFT solution. This is a single channel, decimation in time (DIT) implementation with configurable point size, data type, and FFT/iFFT function.

Point size may be any power of 2 from 16 to 4096, but this upper limit will be reduced to 2048 for cint16 data type and 1024 for cfloat or cint32 data type where the FFT kernel uses ping-pong window input. The 4096 limit may only be achieved where the FFT receives and outputs data to/from kernels on the same processor.

Table 4: FFT Parameters

Name	Type	Description	Range
TT_DATA	Typename	The input data type	cint16, cint32, cfloat
TT_TWIDDLE	Typename	The twiddle factor type. Determined by TT_DATA	Set to cint16 for data type of cint16 or cint32 and cfloat for data type of cfloat.
TP_POINT_SIZE	Unsigned int	The number of samples in a frame to be processed	2^N, where N is in the range 4 to 12, though the upper limit may be constrained by device resources.
TP_FFT_NIFFT	Unsigned int	Forward or reverse transform	0 (IFFT) or 1 (FFT).
TP_SHIFT	Unsigned int	The number of bits to shift accumulation down by before output.	0 to 61
TP_CASC_LEN	Unsigned int	The number of kernels the FFT will be divided over.	1 to 12. Defaults to 1 if not set. Maximum is derived by the number of radix 2 stages required for the given point size (N where pointSize = 2^N) For float data types the max is N. For integer data types the max is CEIL(N/2).
TP_DYN_PT_SIZE	Unsigned int	FFT point size	2^N, where N is 2 to 12
TP_WINDOW_VSIZE	Unsigned int	The number of samples in the input window.	Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive memory usage.

Note

The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.

This FFT implementation does not implement the 1/N scaling of an IFFT. Internally, for cint16 and cint32 data, an internal data type of cint32 is used. After each rank, the values are scaled by only enough to normalize the bit growth caused by the twiddle multiplication (i.e., 15 bits). Distortion caused by saturation will be possible for large point sizes and large values when the data type is cint32. In the final stage, the result is scaled by 17 bits for point size from 16 to 1024, by 18 for 2048, and by 19 for 4096.

No scaling is applied at any point when the data type is cfloat. The graph entry point is the following:

xf::dsp::aie::fft::fft_ifft_dit_1ch_graph

Matrix Multiply¶

The DSPLib contains one Matrix Multiply/GEMM (GEneral Matrix Multiply) solution. The gemm has two input ports connected to two windows of data. The inputs are denoted as Matrix A (inA) and Matrix B (inB). Matrix A has a template parameter TP_DIM_A to describe the number of rows of A. The number of columns of inA must be equal to the number of rows of inB. This is denoted with the template parameter TP_DIM_AB. The number of columns of B is denoted by TP_DIM_B.

An output port connects to a window, where the data for the output matrix will be stored. The output matrix will have rows = inA rows (TP_DIM_A) and columns = inB (TP_DIM_B) columns. The data type of both input matrices can be configured and the data type of the output is derived from the inputs.

Table 5: Matrix Multiply Parameters

Name	Type	Description	Range
TT_DATA_A	Typename	The input data type	int16, cint16, int32 cint32 float cfloat
TT_DATA_B	Typename	The input data type	int16, cint16, int32 cint32 float cfloat
TP_DIM_A	unsigned int	The number of elements along the unique dimension (rows) of Matrix A
TP_DIM_AB	unsigned int	The number of elements along the common dimension of Matrix A (columns) and Matrix B (rows)
TP_DIM_B	unsigned int	The number of elements along the unique dimension (rows) of Matrix B
TP_SHIFT	unsigned int	power of 2 shift down applied to the accumulation of product terms before each output	In range 0 to 61
TP_RND	unsigned int	Round mode	0 = truncate or floor 1 = ceiling (round up) 2 = positive infinity 3 = negative infinity 4 = symmetrical to infinity 5 = symmetrical to zero 6 = convergent to even 7 = convergent to odd
TP_DIM_A_LEADING	unsigned int	The scheme in which the data should be stored in memory	ROW_MAJOR = 0 COL_MAJOR = 1
TP_DIM_B_LEADING	unsigned int	The scheme in which the data should be stored in memory	ROW_MAJOR = 0 COL_MAJOR = 1
TP_DIM_OUT_LEADING	unsigned int	The scheme in which the data should be stored in memory	ROW_MAJOR = 0 COL_MAJOR = 1
TP_ADD_TILING_A	unsigned int	Option to add an additional kernel to rearrange matrix samples	0 = rearrange externally to the graph
TP_ADD_TILING_B	unsigned int	Option to add an additional kernel to rearrange matrix samples	0 = rearrange externally to the graph
TP_ADD_DETILING_OUT	unsigned int	Option to add an additional kernel to rearrange matrix samples	0 = rearrange externally to the graph
TP_WINDOW_VSIZE_A	unsigned int	The number of samples in the input window for Matrix A	Must be of size TP_DIM_A* TP_DIM_ABN has a default value of TP_DIM_A TP_DIM_AB (N=1)
TP_WINDOW_VSIZE_B	unsigned int	The number of samples in the input window for Matrix B	Must be of size TP_DIM_B* TP_DIM_ABM has a default value of TP_DIM_B TP_DIM_AB (M=1)
TP_CASC_LEN	unsigned int	The number of AIE tiles to split the operation into	Defaults to 1 if not set.

Input matrices are processed in distinct blocks and matrix elements must be rearranged into a specific pattern.

The following table demonstrates how a 16x16 input matrix should be rearranged into a 4x4 tiling pattern.

Note

Indices are quoted assuming a row major matrix. A column major matrix needs to be transposed.

Table 6: Matrix Multiply 4x4 tiling pattern

	Tile Col 0				Tile Col 1				Tile Col 2				Tile Col 3
Tile Row 0	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31
	32	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47
	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63
Tile Row 1	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79
	80	81	82	83	84	85	86	87	88	89	90	91	92	93	94	95
	96	97	98	99	100	101	102	103	104	105	106	107	108	109	110	111
	112	113	114	115	116	117	118	119	120	121	122	123	124	125	126	127
Tile Row 2	128	129	130	131	132	133	134	135	136	137	138	139	140	141	142	143
	144	145	146	147	148	149	150	151	152	153	154	155	156	157	158	159
	160	161	162	163	164	165	166	167	168	169	170	171	172	173	174	175
	176	177	178	179	180	181	182	183	184	185	186	187	188	189	190	191
Tile Row 3	192	193	194	195	196	197	198	199	200	201	202	203	204	205	206	207
	208	209	210	211	212	213	214	215	216	217	218	219	220	221	222	223
	224	225	226	227	228	229	230	231	232	233	234	235	236	237	238	239
	240	241	242	243	244	245	246	247	248	249	250	251	252	253	254	255

This is stored contigulously in memory like:

0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51, 4, 5, 6, 7, 20, 21, 22, 23, 36, 37, 38, 39, 52, 53, 54, 55, 8, 9, 10, 11, 24, 25, 26, 27, 40, 41, 42, 43, 56, 57, 58, 59, 12, 13, 14, 15, 28, 29, 30, 31, 44, 45, 46, 47, 60, 61, 62, 63, 64, 65, 66, 67, 80, 81, 82, 83, 96, 97, 98, 99, 112, 113, 114, 115, … , 204, 205, 206, 207, 220, 221, 222, 223, 236, 237, 238, 239, 252, 253, 254, 255

The following table demonstrates how a 16x16 input matrix should be rearranged into a 4x2 tiling pattern.

Table 7: Matrix Multiply 4x2 tiling pattern

	Tile Col 0		Tile Col 1		Tile Col 2		Tile Col 3		Tile Col 4		Tile Col 5		Tile Col 6		Tile Col 7
Tile Row 0	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31
	32	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47
	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63
Tile Row 1	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79
	80	81	82	83	84	85	86	87	88	89	90	91	92	93	94	95
	96	97	98	99	100	101	102	103	104	105	106	107	108	109	110	111
	112	113	114	115	116	117	118	119	120	121	122	123	124	125	126	127
Tile Row 2	128	129	130	131	132	133	134	135	136	137	138	139	140	141	142	143
	144	145	146	147	148	149	150	151	152	153	154	155	156	157	158	159
	160	161	162	163	164	165	166	167	168	169	170	171	172	173	174	175
	176	177	178	179	180	181	182	183	184	185	186	187	188	189	190	191
Tile Row 3	192	193	194	195	196	197	198	199	200	201	202	203	204	205	206	207
	208	209	210	211	212	213	214	215	216	217	218	219	220	221	222	223
	224	225	226	227	228	229	230	231	232	233	234	235	236	237	238	239
	240	241	242	243	244	245	246	247	248	249	250	251	252	253	254	255

This is stored contigulously in memory like:

0, 1, 16, 17, 32, 33, 48, 49, 2, 3, 18, 19, 34, 35, 50, 51, …, 206, 207, 222, 223, 238, 239, 254, 255

Multiplying a 16x16 matrix (with 4x4 tiling) with a 16x16 matrix (with 4x2 tiling) will result in a 16x16 matrix with 4x2 tiling.

The following table specifies the tiling scheme used for a given data type combination and the corresponding output data type:

Table 8: Matrix Multiply tiling pattern combination

Input Type Combination		Tiling Scheme		Output Type
A	B	A	B
int16	int16	4x4	4x4	int16
int16	cint16	4x2	2x2	cint16
int16	int32	4x2	2x2	int32
int16	cint32	2x4	4x2	cint32
cint16	int16	4x4	4x2	cint16
cint16	cint16	4x4	4x2	cint16
cint16	int32	4x4	4x2	cint32
cint16	cint32	2x2	2x2	cint32
int32	int16	4x4	4x2	int32
int32	int32	4x4	4x2	int32
int32	cint16	4x4	4x2	cint32
int32	cint32	2x2	2x2	cint32
cint32	int16	2x4	4x2	cint32
cint32	cint16	2x2	2x2	cint32
cint32	int32	2x2	2x2	cint32
cint32	cint32	2x2	2x2	cint32
float	float	4x4	4x2	float
float	cfloat	2x4	4x2	cfloat
cfloat	float	2x4	4x2	cfloat
cfloat	cfloat	4x2	2x2	cfloat

The parameters TP_ADD_TILING_A, TP_ADD_TILING_B, and TP_ADD_DETILING_OUT control the inclusion of an additional pre-processing / post-processing kernel to perform the required data shuffling. When used with TP_DIM_A_LEADING, TP_DIM_B_LEADING, or TP_DIM_OUT_LEADING, the matrix is also transposed in the tiling kernel.

If the additional kernels are not selected, then the matrix multiply kernels assume incoming data is in the correct format, as specified above. When using the TP_CASC_LEN parameter, the matrix multiply operation is split across TP_DIM_AB and processed in a TP_CASC_LEN number of kernels. The accumulated partial results of each kernel is passed down the cascade port to the next kernel in the cascade chain until the final kernel provides the expected output. Cascade connections are made internally to the matrix multiply graph.

Each AI Engine kernel in the array is given a sub-matrix, so the interface to the graph is an array of ports for both A and B.

Input Matrix A (16x16 - 4x4 Tile - Cascade Length 2):

Table 9: Input Matrix A (16x16 - 4x4 Tile - Cascade Length 2)

	AIE 0								AIE 1
	Tile Col 0				Tile Col 1				Tile Col 2				Tile Col 3
Tile Row 0	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31
	32	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47
	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63
Tile Row 1	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79
	80	81	82	83	84	85	86	87	88	89	90	91	92	93	94	95
	96	97	98	99	100	101	102	103	104	105	106	107	108	109	110	111
	112	113	114	115	116	117	118	119	120	121	122	123	124	125	126	127
Tile Row 2	128	129	130	131	132	133	134	135	136	137	138	139	140	141	142	143
	144	145	146	147	148	149	150	151	152	153	154	155	156	157	158	159
	160	161	162	163	164	165	166	167	168	169	170	171	172	173	174	175
	176	177	178	179	180	181	182	183	184	185	186	187	188	189	190	191
Tile Row 3	192	193	194	195	196	197	198	199	200	201	202	203	204	205	206	207
	208	209	210	211	212	213	214	215	216	217	218	219	220	221	222	223
	224	225	226	227	228	229	230	231	232	233	234	235	236	237	238	239
	240	241	242	243	244	245	246	247	248	249	250	251	252	253	254	255

Input Matrix B (16x16 - 4x2 Tile - Cascade Length 2):

Table 10: Input Matrix B (16x16 - 4x2 Tile - Cascade Length 2)

		Tile Col 0		Tile Col 1		Tile Col 2		Tile Col 3		Tile Col 4		Tile Col 5		Tile Col 6		Tile Col 7
AIE 0	Tile Row 0	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
		16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31
		32	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47
		48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63
	Tile Row 1	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79
		80	81	82	83	84	85	86	87	88	89	90	91	92	93	94	95
		96	97	98	99	100	101	102	103	104	105	106	107	108	109	110	111
		112	113	114	115	116	117	118	119	120	121	122	123	124	125	126	127
AIE 1	Tile Row 2	128	129	130	131	132	133	134	135	136	137	138	139	140	141	142	143
		144	145	146	147	148	149	150	151	152	153	154	155	156	157	158	159
		160	161	162	163	164	165	166	167	168	169	170	171	172	173	174	175
		176	177	178	179	180	181	182	183	184	185	186	187	188	189	190	191
	Tile Row 3	192	193	194	195	196	197	198	199	200	201	202	203	204	205	206	207
		208	209	210	211	212	213	214	215	216	217	218	219	220	221	222	223
		224	225	226	227	228	229	230	231	232	233	234	235	236	237	238	239
		240	241	242	243	244	245	246	247	248	249	250	251	252	253	254	255

The graph entry point is the following:

xf::dsp::aie::blas::matrix_mult::matrix_mult_graph

Find a full list of descriptions and parameters in the API Reference Overview.

Connections to the cascade ports can be made as follows:

for (int i = 0 ; i < P_CASC_LEN; i++) {
    connect<>(inA[i], mmultGraph.inA[i]);
    connect<>(inB[i], mmultGraph.inB[i]);
}
connect<>(mmultGraph.out, out);

Widgets¶

Widget API Cast¶

The DSPLib contains a Widget API Cast solution, which provides flexibilty when connecting other kernels. This component is able to change the stream interface to window interface and vice-versa. It may be configured to read two input stream interfaces and interleave data onto an output window interface. In addition, multiple copies of output window may be configured to allow extra flexibility when connecting to further kernels.

Table 11: Widget API Cast Parameters

Name	Type	Description	Range
TT_DATA	typename	Data Type	int16, cint16, int32, cint32, float, cfloat
TP_IN_API	Unsigned int	The input interface type	0 = window, 1 = stream
TP_OUT_API	Typename int	The output interface type	0 = window, 1 = stream
TP_NUM_INPUTS	Unsigned int	The number of input stream interfaces to be processed	1 - 2
TP_WINDOW_VSIZE	Unsigned int	The number of samples in the input window	Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive RAM use.
TP_NUM_OUTPUT_CLONES	Unsigned int	The number of output window ports to write the input data to.	1 - 4

Note

The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.

The graph entry point is the following:

xf::dsp::aie::widget::api_cast::widget_api_cast_graph

Widget Real to Complex¶

The DSPLib contains a Widget Real to Complex solution, which provides a utility to convert real data to complex or vice versa.

Table 12: Widget Real to Complex Parameters

Name	Type	Description	Range
TT_DATA	typename	Data Type	int16, cint16, int32, cint32, float, cfloat
TT_OUT_DATA	typename	Data Type	int16, cint16, int32, cint32, float, cfloat
TP_WINDOW_VSIZE	Unsigned int	The number of samples in the input window	Must be a multiple of the number of lanes used (typically 4 or 8). No enforced range, but large windows will result in mapper errors due to excessive RAM use.

Note

The number of lanes is the number of data elements that is being processed in parallel, e.g., presented at the input window. This varies depending on the data type (i.e., number of bits in each element) and the register or bus width.

The graph entry point is the following:

xf::dsp::aie::widget::api_cast::widget_api_cast_graph

Compiling and Simulating Using the Makefile¶

A Makefile is included with each library element. It is located in the L2/tests/aie/<library_element> directory. Each Makefile holds default values for each of the library element parameters. These values can be edited as required to configure the library element for your needs.

Prerequisites:

source <your-Vitis-install-path>/lin64/Vitis/HEAD/settings64.csh
setenv PLATFORM_REPO_PATHS <your-platform-repo-install-path>
source <your-XRT-install-path>/xbb/xrt/packages/xrt-2.1.0-centos/opt/xilinx/xrt/setup.csh
setenv DSPLIB_ROOT <your-Vitis-libraries-install-path/dsp>

Use the following steps to compile, simulate the reference model with the x86sim target and the AIE graphs using AIE emulation plaftorm. The output of the reference model ( logs/ref_output.txt ) is verified against the output of the AIE graphs ( logs/uut_output.txt ).

make run

To overwrite the default parameters, add desired parameters as arguments to the make command, for example:

make run DATA_TYPE=cint16 SHIFT=16

For list of all the configurable parameters, see the L2 Library Element Configuration Parameters.

List of all Makefile targets:

make all TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64>
    Command to generate the design for specified Target and Shell.

make clean
    Command to remove the generated non-hardware files.

make cleanall
    Command to remove all the generated files.

make sd_card TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64>
    Command to prepare sd_card files.
    This target is only used in embedded device.

make run TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64>
    Command to run application in emulation or on board.

make build TARGET=<aiesim/x86sim/hw_emu/hw> DEVICE=<FPGA platform> HOST_ARCH=<aarch64>
    Command to build xclbin application.

make host HOST_ARCH=<aarch64>
    Command to build host application.

Note

For embedded devices like vck190, env variable SYSROOT, EDGE_COMMON_SW and PERL need to be set first, and HOST_ARCH is either aarch32 or aarch64. For example,

export SYSROOT=< path-to-platform-sysroot >
export EDGE_COMMON_SW=< path-to-rootfs-and-Image-files >
export PERL=<path-to-perl-installation-location >

Simulation results and diff results are located in the in L2/tests/aie/<library_element>/logs/status.txt file. To perform a x86 compilation/simulation, run

make run TARGET=x86sim.

It is also possible to randomly generate coefficient and input data, or to generate specific stimulus patterns like ALL_ONES, IMPULSE, etc. by running

make run STIM_TYPE=4.

L2 Library Element Unit Test¶

Each library element category comes supplied with a test harness which is an example of how to use the library element subgraph in the context of a super-graph. These test harnesses (graphs) can be found in the L2/tests/aie/<library_element>/test.hpp and L2/tests/aie/<library_element>/test.cpp file.

Each library element filter category also has a reference model which is used by the test harness. The reference models graphs are to be found in the L2/tests/aie/inc/<library_element>_ref_graph.hpp file.

Although it is recommended that only L2 (graphs) library elements are instantiated directly in user code, the kernels underlying the graphs can be found in the L1/include/aie/<library_element>.hpp and the L1/src/aie/<library_element>.cpp files.

An example of how a library element may be configured by a parent graph is provided in the L2/examples/fir_129t_sym folder. The example graph, test.h, in the L2/examples/fir_129t_sym folder instantiates the fir_sr_sym graph configured to be a 129-tap filter. This example exposes the ports such that the parent graph can be used to replace an existing 129-tap symmetric filter point solution design.

L2 Library Element Configuration Parameters¶

L2 FIR configuration parameters¶

The list below consists of configurable parameters for FIR library elements with their default values.

Table 13: L2 FIR configuration parameters

Name	Type	Default	Description
DATA_TYPE	typename	cint16	Data Type.
COEFF_TYPE	typename	int16	Coefficient Type.
FIR_LEN	unsigned	81	FIR length.
SHIFT	unsigned	16	Acc results shift down value.
ROUND_MODE	unsigned	0	Rounding mode.
INPUT_WINDOW_VSIZE	unsigned	512	Input window size.
CASC_LEN	unsigned	1	Cascade length.
INTERPOLATE_FACTOR	unsigned	1	Interpolation factor, see note below
DECIMATE_FACTOR	unsigned	1	Decimation factor, see note below
DUAL_IP	unsigned	0	Dual inputs used in symmetric FIRs, see note below
NITER	unsigned	16	Number of iterations to execute.
GEN_INPUT_DATA	bool	true	Generate input data samples. When true, generate stimulus data as defined in: DATA_STIM_TYPE. When false, use the input file defined in: INPUT_FILE
GEN_COEFF_DATA	bool	true	Generate random coefficients. When true, generate stimulus data as defined in: COEFF_STIM_TYPE. When false, use the coefficient file defined in: COEFF_FILE
DATA_STIM_TYPE	unsigned	0	Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave
COEFF_STIM_TYPE	unsigned	0	Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave
INPUT_FILE	string	data/input.txt	Input data samples file. Only used when GEN_INPUT_DATA=false.
COEFF_FILE	string	data/coeff.txt	Coefficient data file. Only used when GEN_COEFF_DATA=false.

Note

The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.

Note

Not all dsplib elements support all of the above configurable parameters. Unsupported parameters which are not used have no impact on execution, e.g., parameter INTERPOLATE_FACTOR is only supported by interpolation filters and will be ignored by other library elements.

L2 FFT configuration parameters¶

For the FFT/iFFT library element the list of configurable parameters and default values is presented below.

Table 14: L2 FFT configuration parameters

Name	Type	Default	Description
DATA_TYPE	typename	cint16	Data Type.
TWIDDLE_TYPE	typename	cint16	Twiddle Type.
POINT_SIZE	unsigned	1024	FFT point size.
SHIFT	unsigned	17	Acc results shift down value.
FFT_NIFFT	unsigned	0	Forward (1) or reverse (0) transform.
WINDOW_VSIZE	unsigned	1024	Input/Output window size. By default, set to: $(POINT_SIZE).
CASC_LEN	unsigned	1	Cascade length.
DYN_PT_SIZE	unsigned	0	Enable (1) Dynamic Point size feature.
NITER	unsigned	4	Number of iterations to execute.
GEN_INPUT_DATA	bool	true	Generate random input data samples. When false, use the input file defined in: INPUT_FILE
STIM_TYPE	unsigned	0	Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave
INPUT_FILE	string	data/input.txt	Input data samples file. Only used when GEN_INPUT_DATA=false.

Note

The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.

L2 Matrix Multiply Configuration Parameters¶

For the Matrix Multiply (GeMM) library element the list of configurable parameters and default values is presented below.

Table 15: L2 Matrix Multiply configuration parameters

Name	Type	Default	Description
T_DATA_A	typename	cint16	Input A Data Type.
T_DATA_B	typename	cint16	Input B Data Type.
P_DIM_A	unsigned	16	Input A Dimension
P_DIM_AB	unsigned	16	Input AB Common Dimension.
P_DIM_B	unsigned	16	Input B Dimension.
SHIFT	unsigned	20	Acc results shift down value.
ROUND_MODE	unsigned	0	Rounding mode.
P_CASC_LEN	unsigned	1	Cascade length.
P_DIM_A_LEADING	unsigned	0	ROW_MAJOR = 0 COL_MAJOR = 1
P_DIM_B_LEADING	unsigned	1	ROW_MAJOR = 0 COL_MAJOR = 1
P_DIM_OUT_LEADING	unsigned	0	ROW_MAJOR = 0 COL_MAJOR = 1
P_ADD_TILING_A	unsigned	1	no additional tiling kernel = 0 add additional tiling kernel = 1
P_ADD_TILING_B	unsigned	1	no additional tiling kernel = 0 add additional tiling kernel = 1
P_ADD_DETILING_OUT	unsigned	1	no additional detiling kernel = 0 add additional detiling kernel = 1
NITER	unsigned	16	Number of iterations to execute.
STIM_TYPE_A	unsigned	0	Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave
STIM_TYPE_B	unsigned	0	Supported types: 0 - random 3 - impulse 4 - all ones 5 - incrementing pattern 6 - sym incrementing pattern 8 - sine wave

Note

The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.

L2 Widgets Configuration Parameters¶

For the Widgets library elements the list of configurable parameters and default values is presented below.

Table 16: L2 Widget API Casts Configuration Parameters

Name	Type	Default	Description
DATA_TYPE	typename	cint16	Data Type.
IN_API	unsigned	0	0 = window, 1 = stream
OUT_API	unsigned	0	0 = window, 1 = stream
NUM_INPUTS	unsigned	1	The number of input stream interfaces
WINDOW_VSIZE	unsigned	256	Input/Output window size.
NUM_OUTPUT_CLONES	unsigned	1	The number of output window port copies

Table 17: L2 Widget Real to Complex Configuration Parameters

Name	Type	Default	Description
DATA_TYPE	typename	cint16	Data Type.
DATA_OUT_TYPE	typename	cint16	Data Type.
WINDOW_VSIZE	unsigned	256	Input/Output window size.

Note

The above configurable parameters range may exceed a library element’s maximum supported range, in which case the compilation will end with a static_assert error informing about the exceeded range.