AI Engine Development

See Vitis™ Development Environment on xilinx.com
See Vitis-AI™ Development Environment on xilinx.com
# AI Engine GMIO Performance Profile > **Note**: This tutorial targets the [VCK190 ES board](https://www.xilinx.com/products/boards-and-kits/vck190.html). This board is currently available via early access. If you have already purchased this board, download the necessary files from the lounge and ensure you have the correct licenses installed. If you do not have a board and ES license, contact your Xilinx sales contact. AI Engine tools support mapping the GMIO port to the tile DMA one-to-one. It does not support mapping multiple GMIO ports to one tile DMA channel. There is a limit on the number of GMIO ports supported for a given device. For example, the XCVC1902 device on the VCK190 board has 16 AI Engine to NoC master unit (NMU) in total. For each AI Engine to NMU, it supports two MM2S and two S2MM channels. Hence there can be a maximum of 32 AI Engine GMIO inputs, 32 AI Engine GMIO outputs supported, but note that it can be further limited by the existing hardware platform. In this example, we will utilize 32 AI Engine GMIO inputs, 32 AI Engine GMIO outputs in the graph, and profile the performance from one input and one output to 32 inputs and 32 outputs through various ways. Then you will learn about the NOC bandwidth and the advantages and disadvantages of choosing GMIO for data transfer. ## Design Introduction This design has a graph that has 32 AI Engine kernels. Each kernel has one input and one output. Thus, 32 AI Engine GMIO inputs and 32 AI Engine GMIO outputs are connected to the graph. Change the working directory to `perf_profile_aie_gmio`. Take a look at the graph code in `aie/graph.h`. static const int col[32]={6,13,14,45,18,42,4,30,48,49,9,16,29,39,40,31,2,3,46,0,43,27,41,26,11,17,47,1,19,10,34,7}; class mygraph: public adf::graph { private: adf::kernel k[32]; public: adf::port dout[32]; adf::port din[32]; mygraph() { for(int i=0;i<32;i++){ k[i] = adf::kernel::create(vec_incr); adf::connect>(din[i], k[i].in[0]); adf::connect>(k[i].out[0], dout[i]); adf::source(k[i]) = "vec_incr.cc"; adf::runtime(k[i])= 1; adf::location(k[i])=adf::tile(col[i],0); } }; }; In the code above, there are location constraints `adf::location` for each kernel. This is to save time for `aiecompiler`. Note that each kernel has an input window size of 1024 bytes and output window size of 1032 bytes. Next, examine the kernel code `aie/vec_incr.cc`. It adds each int32 input by one and additionally outputs the cycle counter of the AI Engine tile. Due to the later introduction, this counter can be used to calculate the system throughput. void vec_incr(input_window_int32* data,output_window_int32* out){ alignas(32) int32 const1[16]={1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}; v16int32 vec1=*(v16int32*)const1; for(int i=0;i<16;i++) chess_prepare_for_pipelining chess_loop_range(4,) { v16int32 vdata=window_readincr_v16(data); v16int32 vresult=add16(vdata,vec1); window_writeincr(out,vresult); } unsigned long long time=get_cycles(); //cycle counter of the AI Engine tile window_writeincr(out,time); } Next, examine the host code `aie/graph.cpp`. The concepts introduced in [AIE GMIO Programming Model](./single_aie_gmio.md) apply here. We will focus on the new concepts and how to do performance profiling. Some constants defined in the code are as follows: #if !defined(__AIESIM__) && !defined(__X86SIM__) const int ITERATION=4096; #else const int ITERATION=4; #endif const int BLOCK_SIZE_in_Bytes=1024*ITERATION; const int BLOCK_SIZE_out_Bytes=1032*ITERATION; If it is for hardware flow, `ITERATION` is 4096 otherwise, it is four. This is to make sure that the AI Engine simulator can finish in a short amount of time. In the main function, the PS code is going to profile `num` GMIO inputs and outputs, and `num` is from 1, 2, 4, to 32. Non-blocking GMIO APIs (`GMIO::gm2aie_nb` and `GMIO::aie2gm_nb`) are used for GMIO transactions, and `GMIO::wait` is used for output data synchronization. Only when the input and output data are transferred for the kernel, can the kernel be finished. This is because the graph is started for all the AI Engine kernels, but only some of the kernels are profiled. After the code for profiling, the remaining kernels are flushed by transferring data to and from the remaining AI Engine kernels. for(int num=1;num<=32;num*=2){ //Pre-processing for(int i=0;i<32;i++){ for(int j=0;j(timeEnd - mTimeStart) .count(); } void reset() { mTimeStart = std::chrono::high_resolution_clock::now(); } }; The code to start profiling is as follows: Timer timer; The code to end profiling and calculate performance is as follows: double timer_stop=timer.stop(); double throughput=(BLOCK_SIZE_in_Bytes+BLOCK_SIZE_out_Bytes)*num/timer_stop*1000000/1024/1024; std::cout<<"Throughput (by timer GMIO in num="<the_last){ the_last=end[i]; } } std::cout<<"Throughput (by AIE kernel cycles in="<
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright© 2020–2021 Xilinx
XD007