AI Engine Development

See Vitis™ Development Environment on xilinx.com
See Vitis-AI™ Development Environment on xilinx.com

AI Engine GMIO Performance Profile

Note: This tutorial targets the VCK190 ES board. This board is currently available via early access. If you have already purchased this board, download the necessary files from the lounge and ensure you have the correct licenses installed. If you do not have a board and ES license, contact your Xilinx sales contact.

AI Engine tools support mapping the GMIO port to the tile DMA one-to-one. It does not support mapping multiple GMIO ports to one tile DMA channel. There is a limit on the number of GMIO ports supported for a given device. For example, the XCVC1902 device on the VCK190 board has 16 AI Engine to NoC master unit (NMU) in total. For each AI Engine to NMU, it supports two MM2S and two S2MM channels. Hence there can be a maximum of 32 AI Engine GMIO inputs, 32 AI Engine GMIO outputs supported, but note that it can be further limited by the existing hardware platform.

In this example, we will utilize 32 AI Engine GMIO inputs, 32 AI Engine GMIO outputs in the graph, and profile the performance from one input and one output to 32 inputs and 32 outputs through various ways. Then you will learn about the NOC bandwidth and the advantages and disadvantages of choosing GMIO for data transfer.

Design Introduction

This design has a graph that has 32 AI Engine kernels. Each kernel has one input and one output. Thus, 32 AI Engine GMIO inputs and 32 AI Engine GMIO outputs are connected to the graph.

Change the working directory to perf_profile_aie_gmio. Take a look at the graph code in aie/graph.h.

static const int col[32]={6,13,14,45,18,42,4,30,48,49,9,16,29,39,40,31,2,3,46,0,43,27,41,26,11,17,47,1,19,10,34,7};

class mygraph: public adf::graph
{
private:
  adf::kernel k[32];

public:
  adf::port<adf::direction::out> dout[32];
  adf::port<adf::direction::in> din[32];

  mygraph()
  {
     for(int i=0;i<32;i++){
      k[i] = adf::kernel::create(vec_incr);
      adf::connect<adf::window<1024>>(din[i], k[i].in[0]);
      adf::connect<adf::window<1032>>(k[i].out[0], dout[i]);
      adf::source(k[i]) = "vec_incr.cc";
      adf::runtime<adf::ratio>(k[i])= 1;
      adf::location<adf::kernel>(k[i])=adf::tile(col[i],0);
     }
  };
};

In the code above, there are location constraints adf::location for each kernel. This is to save time for aiecompiler. Note that each kernel has an input window size of 1024 bytes and output window size of 1032 bytes.

Next, examine the kernel code aie/vec_incr.cc. It adds each int32 input by one and additionally outputs the cycle counter of the AI Engine tile. Due to the later introduction, this counter can be used to calculate the system throughput.

void vec_incr(input_window_int32* data,output_window_int32* out){
  alignas(32) int32 const1[16]={1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};
  v16int32 vec1=*(v16int32*)const1;
  for(int i=0;i<16;i++)
  chess_prepare_for_pipelining
  chess_loop_range(4,)
  {
    v16int32 vdata=window_readincr_v16(data);
    v16int32 vresult=add16(vdata,vec1);
    window_writeincr(out,vresult);
  }
  unsigned long long time=get_cycles(); //cycle counter of the AI Engine tile
  window_writeincr(out,time);
}

Next, examine the host code aie/graph.cpp. The concepts introduced in AIE GMIO Programming Model apply here. We will focus on the new concepts and how to do performance profiling. Some constants defined in the code are as follows:

#if !defined(__AIESIM__) && !defined(__X86SIM__)
const int ITERATION=4096;
#else
const int ITERATION=4;
#endif
const int BLOCK_SIZE_in_Bytes=1024*ITERATION;
const int BLOCK_SIZE_out_Bytes=1032*ITERATION;

If it is for hardware flow, ITERATION is 4096 otherwise, it is four. This is to make sure that the AI Engine simulator can finish in a short amount of time.

In the main function, the PS code is going to profile num GMIO inputs and outputs, and num is from 1, 2, 4, to 32. Non-blocking GMIO APIs (GMIO::gm2aie_nb and GMIO::aie2gm_nb) are used for GMIO transactions, and GMIO::wait is used for output data synchronization. Only when the input and output data are transferred for the kernel, can the kernel be finished. This is because the graph is started for all the AI Engine kernels, but only some of the kernels are profiled. After the code for profiling, the remaining kernels are flushed by transferring data to and from the remaining AI Engine kernels.

for(int num=1;num<=32;num*=2){
  //Pre-processing
  for(int i=0;i<32;i++){
    for(int j=0;j<BLOCK_SIZE_in_Bytes/sizeof(int);j++){
     dinArray[i][j]=j+num;
    }
   }
   gr.run(ITERATION);

   //Profile starts here
   for(int i=0;i<num;i++){
    gmioIn[i].gm2aie_nb(dinArray[i], BLOCK_SIZE_in_Bytes);
    gmioOut[i].aie2gm_nb(doutArray[i], BLOCK_SIZE_out_Bytes);
   }
   for(int i=0;i<num;i++){
    gmioOut[i].wait();
   }
   //Profile ends here

   //check output correctness
   for(int i=0;i<num;i++){
        for(int j=0;j<BLOCK_SIZE_out_Bytes/sizeof(int);j++){
          if(j%258!=256 && j%258!=257 && doutArray[i][j]!=j+num+1-j/258*2){
             std::cout<<"ERROR:dout["<<i<<"]["<<j<<"]="<<doutArray[i][j]<<std::endl;
             error++;
           break;
          }
       }
    }

  //flush remain stalling kernels
  for(int i=num;i<32;i++){
   gmioIn[i].gm2aie_nb(dinArray[i], BLOCK_SIZE_in_Bytes);
   gmioOut[i].aie2gm_nb(doutArray[i], BLOCK_SIZE_out_Bytes);
  }
  gr.wait();
}

Performance Profiling Methods

In this example, we will introduce some methods for profiling the design. The code to be profiled is in aie/graph.cpp:

//Profile starts here
for(int i=0;i<num;i++){
  gmioIn[i].gm2aie_nb(dinArray[i], BLOCK_SIZE_in_Bytes);
  gmioOut[i].aie2gm_nb(doutArray[i], BLOCK_SIZE_out_Bytes);
}
for(int i=0;i<num;i++){
  gmioOut[i].wait();
}
//Profile ends here

Note: This tutorial assumes that AI Engine runs at 1 GHz.

  1. Profile by XRT xrtGraphTimeStamp API

XRT provides xrtGraphTimeStamp API to get the timestamp of a graph. The unit of timestamp is AI Engine Cycle. An XRT API is required to open graph to get the graph handle. The code to start profiling is as follows:

xuid_t uuid;
xrtDeviceGetXclbinUUID(dhdl, uuid);
auto ghdl=xrtGraphOpen(dhdl,uuid,"gr");
long long t_stamp=xrtGraphTimeStamp(ghdl);

The code to end profiling and calculate performance is as follows:

t_stamp=xrtGraphTimeStamp(ghdl)-t_stamp;
std::cout<<"Throughput (by graph timstamp) bandwidth(GMIO in num="<<num<<",out num="<<num<<"):\t"<<(double)(BLOCK_SIZE_in_Bytes+BLOCK_SIZE_out_Bytes)*num/(double)t_stamp*1000<<"M bytes/s"<<std::endl;

The code is guarded by macro __TIME_STAMP__. To use this method of profiling, define __TIME_STAMP__ for g++ cross compiler in sw/Makefile:

CXXFLAGS += -std=c++14 -D__TIME_STAMP__ -I$(XILINX_HLS)/include/ -I${SDKTARGETSYSROOT}/usr/include/xrt/ -O0 -g -Wall -c -fmessage-length=0 --sysroot=${SDKTARGETSYSROOT} -I${XILINX_VITIS}/aietools/include ${HOST_INC}

To run it in hardware, use the following make command to build the hardware image:

make package TARGET=hw

After the package is done, run the following commands in the Linux prompt after booting Linux from an SD card:

export XILINX_XRT=/usr
cd /mnt/sd-mmcblk0p1
./host.exe a.xclbin

The output is as follows:

GMIO::malloc completed
Throughput (by graph timstamp) bandwidth(GMIO in num=1,out num=1):5331.96M bytes/s
Throughput (by graph timstamp) bandwidth(GMIO in num=2,out num=2):8575.5M bytes/s
Throughput (by graph timstamp) bandwidth(GMIO in num=4,out num=4):9833.88M bytes/s
Throughput (by graph timstamp) bandwidth(GMIO in num=8,out num=8):10426.6M bytes/s
Throughput (by graph timstamp) bandwidth(GMIO in num=16,out num=16):10498.1M bytes/s
Throughput (by graph timstamp) bandwidth(GMIO in num=32,out num=32):10750.4M bytes/s
AIE GMIO PASSED!
GMIO::free completed
PASS!
  1. Profile by C++ class API

The code to use C++ class API is common for Linux system for various platforms. The Timer is defined as follows:

class Timer {
  std::chrono::high_resolution_clock::time_point mTimeStart;
  public:
    Timer() { reset(); }
    long long stop() {
      std::chrono::high_resolution_clock::time_point timeEnd =
          std::chrono::high_resolution_clock::now();
      return std::chrono::duration_cast<std::chrono::microseconds>(timeEnd -
                                                                  mTimeStart)
          .count();
    }
    void reset() { mTimeStart = std::chrono::high_resolution_clock::now(); }
};

The code to start profiling is as follows:

Timer timer;

The code to end profiling and calculate performance is as follows:

double timer_stop=timer.stop();
double throughput=(BLOCK_SIZE_in_Bytes+BLOCK_SIZE_out_Bytes)*num/timer_stop*1000000/1024/1024;
std::cout<<"Throughput (by timer GMIO in num="<<num<<",out num="<<num<<"):\t"<<throughput<<"M Bytes/s"<<std::endl;

The code is guarded by macro __TIMER__. To use this method of profiling, define __TIMER__ for g++ cross compiler in sw/Makefile:

CXXFLAGS += -std=c++14 -D__TIMER__ -I$(XILINX_HLS)/include/ -I${SYSROOT}/usr/include/xrt/ -O0 -g -Wall -c -fmessage-length=0 --sysroot=${SYSROOT} -I${XILINX_VITIS}/aietools/include ${HOST_INC}

The commands to build and run in hardware is same as previously shown. The output in hardware is as follows:

GMIO::malloc completed
Throughput (by timer GMIO in num=1,out num=1):5076.64M Bytes/s
Throughput (by timer GMIO in num=2,out num=2):8335.5M Bytes/s
Throughput (by timer GMIO in num=4,out num=4):9543.97M Bytes/s
Throughput (by timer GMIO in num=8,out num=8):9717.18M Bytes/s
Throughput (by timer GMIO in num=16,out num=16):10154.1M Bytes/s
Throughput (by timer GMIO in num=32,out num=32):10246.8M Bytes/s
AIE GMIO PASSED!
GMIO::free completed
PASS!
  1. Profile by AI Engine cycles got from AI Engine kernels

In this design, we output the AI Engine cycles at the end of each iteration. Each iteration produces 256 int32 data, plus a long long AI Engine cycle counter number. We record the very beginning cycle and the last cycle of all the AI Engine kernels to be profiled because multiple AI Engine kernels start at different cycles though they are enabled by the same graph::run. Thus, we can calculate the system throughput for all the kernels.

Note that there is some gap between the actual performance and the calculated number because there is already some data transfer before the recorded starting cycle. However, the overhead is negligible when the total iteration number is high, which is 4096 in this example.

The code to get AI Engine cycles and calculate the system throughput is as follows:

long long start[32];
long long end[32];
long long very_beginning=0x0FFFFFFFFFFFFFFF;
long long the_last=0;
for(int i=0;i<num;i++){
  start[i]=*(long long*)&doutArray[i][256];
  end[i]=*(long long*)&doutArray[i][BLOCK_SIZE_out_Bytes/sizeof(int)-2];
  if(start[i]<very_beginning){
    very_beginning=start[i];
  }
  if(end[i]>the_last){
    the_last=end[i];
  }
}
std::cout<<"Throughput (by AIE kernel cycles in="<<num<<",out="<<num<<") ="<<(double)(BLOCK_SIZE_in_Bytes+BLOCK_SIZE_out_Bytes)*num/(double)(the_last-very_beginning)*1000<<"M Bytes/s"<<std::endl;

The code is guarded by macro __AIE_CYCLES__. To use this method of profiling, define __AIE_CYCLES__ for g++ cross compiler in sw/Makefile:

CXXFLAGS += -std=c++14 -D__AIE_CYCLES__ -I$(XILINX_HLS)/include/ -I${SYSROOT}/usr/include/xrt/ -O0 -g -Wall -c -fmessage-length=0 --sysroot=${SYSROOT} -I${XILINX_VITIS}/aietools/include ${HOST_INC}

The commands to build and run in hardware are the same as previously shown. The output in hardware is as follows:

GMIO::malloc completed
Throughput (by AIE kernel cycles in=1,out=1) =6545.68M Bytes/s
Throughput (by AIE kernel cycles in=2,out=2) =10267.4M Bytes/s
Throughput (by AIE kernel cycles in=4,out=4) =11036.8M Bytes/s
Throughput (by AIE kernel cycles in=8,out=8) =10807.9M Bytes/s
Throughput (by AIE kernel cycles in=16,out=16) =10958.2M Bytes/s
Throughput (by AIE kernel cycles in=32,out=32) =10892.8M Bytes/s
AIE GMIO PASSED!
GMIO::free completed
PASS!
  1. Profile by event API

The AI Engine has hardware performance counters and can be configured to count hardware events for measuring performance metrics. The API used in this example is to profile graph throughput regarding the specific GMIO port. There may be confliction when multiple GMIO ports are used for event API because of the restriction that performance counter is shared between GMIO ports that access the same AI Engine-PL interface column. Thus, we only profile one GMIO output to show this methodology.

The code to start profiling is as follows:

std::cout<<"total input/output num="<<num<<std::endl;
event::handle handle[32];
for(int i=0;i<1;i++){
  handle[i] = event::start_profiling(gmioOut[i], event::io_stream_start_to_bytes_transferred_cycles, BLOCK_SIZE_out_Bytes);
}

The code to end profling and calculate performance is as follows:

long long cycle_count[32];
for(int i=0;i<1;i++){
  cycle_count[i] = event::read_profiling(handle[i]);
  event::stop_profiling(handle[i]);
}
for(int i=0;i<1;i++){
  double bandwidth = (double)BLOCK_SIZE_out_Bytes / ((double)cycle_count[i] ) *1000; //byte per second
  std::cout<<"Throughput (by event API) gmioOut["<<i<<"] bandwidth="<<bandwidth<<"M Bytes/s"<<std::endl;
}

In this example, event::start_profiling is called to configure the AI Engine to count the clock cycles from the stream start event to the event that indicates BLOCK_SIZE_out_Bytes bytes have been transferred, assuming that the stream stops right after the specified number of bytes are transferred.

For detailed usage about event API, refer to the Versal ACAP AI Engine Programming Environment User Guide (UG1076).

The code is guarded by macro __USE_EVENT_PROFILE__. To use this method of profiling, define __USE_EVENT_PROFILE__ for g++ cross compiler in sw/Makefile:

CXXFLAGS += -std=c++14 -D__USE_EVENT_PROFILE__ -I$(XILINX_HLS)/include/ -I${SDKTARGETSYSROOT}/usr/include/xrt/ -O0 -g -Wall -c -fmessage-length=0 --sysroot=${SDKTARGETSYSROOT} -I${XILINX_VITIS}/aietools/include ${HOST_INC}

The commands to build and run in hardware is same as previously shown. The output in hardware is as follows:

GMIO::malloc completed
total input/output num=1
Throughput (by event API) gmioOut[0] bandwidth=3321.1M Bytes/s
total input/output num=2
Throughput (by event API) gmioOut[0] bandwidth=2737.72M Bytes/s
total input/output num=4
Throughput (by event API) gmioOut[0] bandwidth=1608.08M Bytes/s
total input/output num=8
Throughput (by event API) gmioOut[0] bandwidth=1044.15M Bytes/s
total input/output num=16
Throughput (by event API) gmioOut[0] bandwidth=972.539M Bytes/s
total input/output num=32
Throughput (by event API) gmioOut[0] bandwidth=989.08M Bytes/s
AIE GMIO PASSED!
GMIO::free completed
PASS!

It is seen that when used GMIO ports number increases, the performance for the specific GMIO port drops, indicating that the total system throughput is limited by NOC and DDR bandwidth.

Conclusion

In this tutorial, you learned about:

  • Programming model for AI Engine GMIO

  • Ways to profile system performance and also, about the NOC and DDR bandwidth.



Licensed under the Apache License, Version 2.0 (the “License”);

you may not use this file except in compliance with the License.

You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an “AS IS” BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and

limitations under the License.

Copyright© 2020–2021 Xilinx
XD007