AI Engine Development

See Vitis™ Development Environment on xilinx.com
See Vitis-AI™ Development Environment on xilinx.com
# AI Engine GMIO Programming Model This example introduces the AI Engine GMIO programming model. It includes three steps: * [Step 1 - Synchronous GMIO Transfer](#Step-1---Synchronous-GMIO-Transfer) * [Step 2 - Asynchronous GMIO transfer for Input and Synchronous GMIO transfer for Output](#Step-2---Asynchronous-GMIO-transfer-for-Input-and-Synchronous-GMIO-transfer-for-Output) * [Step 3 - Asynchronous GMIO Transfer and Hardware flow](#Step-3---Asynchronous-GMIO-Transfer-and-Hardware-flow) We will use AI Engine simulator event trace in each step to see how performance can be improved step by step. The last step introduces code to make GMIO work in hardware. ### Step 1 - Synchronous GMIO Transfer In this step, we introduce the synchronous GMIO transfer mode. Change the working directory to `single_aie_gmio/step1`. Looking at the graph code `aie/graph.h`, it can be seen that the design has one output `out` and one input `din`, with an AI Engine kernel `weighted_sum_with_margin`. class mygraph: public adf::graph { private: adf::kernel k_m; public: adf::port out; adf::port din; mygraph() { k_m = adf::kernel::create(weighted_sum_with_margin); adf::connect>(din, k_m.in[0]); adf::connect>(k_m.out[0], out); adf::source(k_m) = "weighted_sum.cc"; adf::runtime(k_m)= 0.9; }; }; Examine the host code in `aie/graph.cpp`. It is seen that two `GMIO` ports, `gmioIn` and `gmioOut`, are instantiated and they are connected to the platform input and output. Then the graph is instantiated and connected to the platform. using namespace adf; GMIO gmioIn("gmioIn",64,1000); GMIO gmioOut("gmioOut",64,1000); adf::simulation::platform<1,1> platform(&gmioIn,&gmioOut); mygraph gr; adf::connect<> net0(gr.out, platform.sink[0]); adf::connect<> net1(platform.src[0], gr.din); The GMIO instantiation `gmioIn` represents the DDR memory space to be read by the AI Engine and `gmioOut` represents the DDR memory space to be written by the AI Engine. The constructor specifies the logical name of the GMIO, burst length (that can be 64, 128, or 256 bytes) of the memory-mapped AXI4 transaction, and the required bandwidth in MB/s (here 1000 MB/s). Inside the main function, two 256-element int32 arrays (1024 bytes) are allocated by `GMIO::malloc`. The `dinArray` points to the memory space to be read by the AI Engine and the `doutArray` points to the memory space to be written by the AI Engine. In Linux, the vitual address passed to `GMIO::gm2aie_nb`, `GMIO::aie2gm_nb`, `GMIO::gm2aie`, and `GMIO::aie2gm` must be allocated by `GMIO::malloc`. After the input data is allocated, it can be initialized. int32* dinArray=(int32*)GMIO::malloc(BLOCK_SIZE_in_Bytes); int32* doutArray=(int32*)GMIO::malloc(BLOCK_SIZE_in_Bytes); `doutRef` is used for golden output reference. It can be allocated by a standard `malloc` because it does not involve GMIO transfer. int32* doutRef=(int32*)malloc(BLOCK_SIZE_in_Bytes); `GMIO::gm2aie` and `GMIO::gm2aie_nb` are used to initiate read transfers from the AI Engine to DDR memory using memory-mapped AXI transactions. The first argument in `GMIO::gm2aie` and `GMIO::gm2aie_nb` is the pointer to the start address of the memory space for the transaction (here `dinArray`). The second argument is the transaction size in bytes. The memory space for the transaction must be within the memory space allocated by `GMIO::malloc`. Similarly, `GMIO::aie2gm` and `GMIO::aie2gm_nb` are used to initiate write transfers from the AI Engine to DDR memory. `GMIO::gm2aie_nb` and `GMIO::aie2gm_nb` are non-blocking functions that return immediately when the transaction is issued - they do not wait for the transaction to complete. In contrast, the functions, `GMIO::gm2aie` and `GMIO::aie2gm` behave in a blocking manner. gmioIn.gm2aie(dinArray,BLOCK_SIZE_in_Bytes); gr.run(ITERATION); gmioOut.aie2gm(doutArray,BLOCK_SIZE_in_Bytes); The blocking transfer (`gmioIn.gm2aie`) has to be completed before `gr.run()` because the GMIO transfer is in synchronous mode here. But the window input of the graph (in PING-PONG manner by default) has only two buffers to store the received data. This means that at the maximum, two blocks of window input data can be transferred by GMIO blocking transfer. Otherwise, the `GMIO::gm2aie` will block the design. In this example program, `ITERATION` is set to one. Because `GMIO::aie2gm()` is working in synchronous mode, the output processing can be done just after it is completed. __Note:__ The memory is non-cachable for GMIO in Linux. In the example program, the design runs four iterations in a loop. In the loop, pre-processing and post-processing are done before and after data transfer. for(int i=0;i<4;i++){ //pre-processing for(int j=0;j
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright© 2020–2021 Xilinx
XD007