i48({{-82, -253}, {643, 467}, {9582, 0}, {-192, 140
}});
// Declare shared memory buffers
...
```
It would be tedious to repeat this for every AI Engine, so a utility has been created that will extract this information for all AI Engines. Navigate back to `Emulation-AIE/Work/aie`, and type `GetDeclare.sh`. The output starts as follows:
```C++
Row 0
DoubleStream::FIR_MultiKernel_cout<512, 0, false, false> i48({{-82, -253}, {643, 467}, {9582, 0}, {-192, 140}});
DoubleStream::FIR_MultiKernel_cincout<512, 0, false, false> i49({{0, -204}, {984, 1355}, {7421, 2411}, {-882, 287}});
DoubleStream::FIR_MultiKernel_cincout<512, 0, false, false> i50({{11, -35}, {550, 1691}, {3936, 2860}, {-1079, 0}});
DoubleStream::FIR_MultiKernel_cincout<512, 0, false, false> i51({{-198, 273}, {0, 647}, {1023, 1409}, {-755, -245}});
DoubleStream::FIR_MultiKernel_cincout<512, 0, false, false> i52({{-642, 467}, {538, -1656}, {-200, -615}, {-273, -198}});
DoubleStream::FIR_MultiKernel_cincout<512, 0, false, false> i53({{-1026, 333}, {2860, -3936}, {0, -1778}, {22, 30}});
DoubleStream::FIR_MultiKernel_cincout<512, 0, false, false> i54({{-927, 0}, {6313, -4587}, {517, -1592}, {63, 194}});
DoubleStream::FIR_MultiKernel_cin<512, 0, false, false> i55({{-226, -73}, {9113, -2961}, {467, -643}, {0, 266}});
Row 1
DoubleStream::FIR_MultiKernel_cin<512, 0, true, false> i56({{-226, -73}, {9113, -2961}, {467, -643}, {0, 266}});
DoubleStream::FIR_MultiKernel_cincout<512, 0, false, false> i57({{-82, -253}, {643, 467}, {9582, 0}, {-192, 140}});
DoubleStream::FIR_MultiKernel_cincout<512, 0, false, false> i58({{0, -204}, {984, 1355}, {7421, 2411}, {-882, 287}});
DoubleStream::FIR_MultiKernel_cincout<512, 0, false, false> i59({{11, -35}, {550, 1691}, {3936, 2860}, {-1079, 0}});
...
```
In row 0, no kernel should discard any sample, in row 1, only the first kernel discards one sample, etc...
Finally, all kernels must be connected together with the cascade stream in between them, and the input streams for all of them.
## Compilation and Analysis
Ensure the `InitPythonPath` has been sourced in the `Utils` directory.
Navigate to the `MultiKernel` directory. In the `Makefile` three methods are defined:
- `aie`
- Compiles the graph and the kernels
- `aie_sim`
- Runs the AI Engine System C simulator
- `aie_viz`
- Runs `vitis_analyzer`on the output summary
Take a look at the source code (kernel and graph) to familiarize yourself with C++ instantiation of kernels. In `graph.cpp` the PL AI Engine connections are declared using 64-bit interfaces running at 500 MHz, allowing for maximum bandwidth on the AI Engine array AXI-Stream network.
To have the simulation running, input data must be generated. Change directory to `data` and type `GenerateStreams`. The following parameter should be set for this example:
![missing image](../Images/generateDualStreamsSSR8.jpg)
Click **Generate** then **Exit**. The generated files `PhaseIn_0_0.txt` ... `PhaseIn_7_7.txt` should contain mainly 0's, with a few 1's and 2's. The number of samples per stream is half of the one that is declared in the C++ code because in the C++ code this is the length of the concatenation of both input streams.
Type `make all` and wait for the `vitis_analyzer` GUI to Display. The Vitis analyzer is able to show the graph, how it has been implemented in the device, and the complete timeline of the simulation. In this specific case the graph is very simple (a single kernel) and the implementation is on a single AI Engine.
Click **Graph** to visualize the graph of the application:
![missing image](../Images/Graph8Phases.jpg)
The 64 kernels and their 32 independent input streams are clearly visible. The top graph is for the output phases 0, 2, 4, and 6, the phases where the cascade stream goes from left to right on the physical device, and the bottom graph is for the phases 1, 3, 5, and 7 where the cascade stream goes from right to left.
Click **Array** to visualize where the kernel has been placed, and how it is fed from the the PL:
![missing image](../Images/Array8Phases.jpg)
In this view the cascade streams connecting neighboring AI Engines are key to the performance of this graph. With the four location constraints that were added, the placer had only one solution for the kernel placement: this square. The router had an easy job to feed all these kernels by simply using the south-north AXI-Stream. The path back to the PL from the extremities also uses only the vertical AXI-Streams.
Finally click **Trace** to look at how the entire simulation went through. This may be useful to track where your AI Engine stalls if performance is not as expected:
Now the output of the filter can be displayed. The input being a set of Dirac impulses, the impulse response of the filter should be recognized throughout the waveform. Navigate to `Emulation-AIE/aiesimulator_output/data` and look at the `output_0.txt`. You can see that you have two complex outputs per line which is prepended with a time stamp. `ProcessAIEOutput output_*`.
![missing image](../Images/GraphOutput8Phases.jpg)
The top graph reflects the real part of the output, the bottom graph this is the imaginary part. On both, the filter impulse response is recognizable.
The performance of this architecture can be measured using the timestamped output. In the same directory (`Emulation-AIE/aiesimulator_output/data`) type `StreamThroughput output_*`:
```
output_0_0.txt --> 896.67 Msps
output_0_1.txt --> 896.67 Msps
output_1_0.txt --> 891.99 Msps
output_1_1.txt --> 893.54 Msps
output_2_0.txt --> 896.67 Msps
output_2_1.txt --> 896.67 Msps
output_3_0.txt --> 891.99 Msps
output_3_1.txt --> 893.54 Msps
output_4_0.txt --> 898.25 Msps
output_4_1.txt --> 896.67 Msps
output_5_0.txt --> 891.99 Msps
output_5_1.txt --> 893.54 Msps
output_6_0.txt --> 898.25 Msps
output_6_1.txt --> 896.67 Msps
output_7_0.txt --> 891.99 Msps
output_7_1.txt --> 893.54 Msps
-----------------------
Total Throughput --> 14318.64 Msps
```
This architecture achieves slightly over 14 Gsps performance. It is less than the maximum expected (16 Gsps) because of the number of cycles spent for initialization when the kernels are called. This performance increases when the frame length is increased. For a 32K sample frame length the performance obtained is:
```
Total Throughput --> 15960.30 Msps
```
Which is almost the expected maximum.
Copyright© 2020–2021 Xilinx
XD020