Overview

In the 2021.1 release, Vitis Vision library has added few functions which are implemented on AI Engine™ of Xilinx ACAP Versal devices and validated on VCK190 boards. These implementations exploit the VLIW, SIMD vector processing capabilities of AI Engine™ .

Basic Features

To process high resolution images, xfcvDataMovers component is also provided which divides the image into tiled-units and uses efficient data-movers to manage the transfer of tiled-units to and from AIEngine™ cores. You can find more information on the types of data-movers and their usage, in the Getting Started with Vitis Vision AIEngine Library Functions section.

Vitis Vision AIE Library Contents

Vitis Vision AIEngine™ files are organized into the following directories:

Table Vitis Vision AIE Library Contents
Folder Details
L1/include/aie/imgproc contains header files of vision AIEngine™ functions
L1/include/aie/common contains header files of data-movers and other utility functions
L1/lib/sw contains the data-movers library object files
L2/tests/aie contains the ADF graph code and host-code using data-movers and vision AIEngine™ functions from L1/include/aie

Getting Started with Vitis Vision AIE

Describes the methodology to accelerate Vitis Vision AIE library functions on Versal adaptive compute acceleration platforms (ACAPs). This includes creation of Adaptive Data Flow (ADF) Graphs, setting up virtual platform and writing corresponding host code. It also covers various verification models including x86 based simuation, cycle accurate AIE simulation, HW emulation and HW run methods using a suitable Makefile.

AIE Prerequisites

  1. Valid installation of Vitis™ 2021.2 or later version and the corresponding licenses.
  2. Install the Vitis Vision libraries, if you intend to use libraries compiled differently than what is provided in Vitis.
  3. Install the card for which the platform is supported in Vitis 2021.2 or later versions.
  4. If targeting an embedded platform, set up the evaluation board.
  5. Xilinx® Runtime (XRT) must be installed. XRT provides software interface to Xilinx FPGAs.
  6. Install/compile OpenCV libraries(with compatible libjpeg.so). Appropriate version (X86/aarch32/aarch64) of compiler must be used based on the available processor for the target board.

Note

All Vitis Vision AIE library functions were tested against OpenCV version - 4.4.0

Vitis AIE Design Methodology

Following are critical components in making a kernel work on a platform using Vitis™:

  1. Prepare the Kernels
  2. Data Flow Graph construction
  3. Setting up platform ports
  4. Host code integration
  5. Makefile to compile the kernel for x86 simulation / aie simulation / hw-emulation / hw runs

Prepare the Kernels

Kernels are computation functions that form the fundamental building blocks of the data flow graph specifications. Kernels are declared as ordinary C/C++ functions that return void and can use special data types as arguments (discussed in Window and Streaming Data API). Each kernel should be defined in its own source file. This organization is recommended for reusable and faster compilation. Furthermore, the kernel source files should include all relevant header files to allow for independent compilation. It is recommended that a header file (kernels.h in this documentation) should declare the function prototypes for all kernels used in a graph. An example is shown below.

#ifndef _KERNELS_16B_H_
#define _KERNELS_16B_H_

#include <adf/stream/types.h>
#include <adf/window/types.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#define PARALLEL_FACTOR_16b 16 // Parallelization factor for 16b operations (16x mults)
#define SRS_SHIFT 10           // SRS shift used can be increased if input data likewise adjusted)

void filter2D(input_window_int16* input, const int16_t (&coeff)[16], output_window_int16* output);

#endif

Vitis Vision AIE library functions packaged with Vitis Vision AIE library are pre optimized vector implementations for various computer vision tasks. These functions can be directly included in user kernel (as shown in example below)

#include "imgproc/xf_filter2d_16b_aie.hpp"
#include "kernels.h"

void filter2D(input_window_int16* input, const int16_t (&coeff)[16], output_window_int16* output) {
        xf::cv::aie::filter2D_k3_border(input, coeff, output);
};

Data Flow Graph construction

Once AIE kernels have been prepared, next step is to create a Data Flow Graph class which defines the top level ports, Run time parameters, connectivity, constraints etc. This consists of below steps

  1. Create graph.h and include Adaptive Data Flow (ADF) header file (adf.h). Also include header file with kernel function prototypes (kernel.h)

    #include <adf.h>
    #include "kernels.h"
    
  2. Define your graph class by using the objects which are defined in the adf name space. All user graphs are derived from the class graph.

    include <adf.h>
    #include "kernels.h"
    
    using namespace adf;
    
    class myGraph : public graph {
    private:
        kernel k1;
    };
    
  3. Add top level ports to graph. These ports will be responsible to data transfers to / from the kernels.

    #include <adf.h>
    #include "kernels.h"
    
    using namespace adf;
    
    class simpleGraph : public graph {
    private:
        kernel k1;
    
    public:
        port<input> inptr;
        port<output> outptr;
        port<input> kernelCoefficients;
    };
    
  4. Specify connections of top level ports to kernels. Primary connections type are Window, Stream, Run time parameters. Below is example code specifying connectivity.

    class myGraph : public adf::graph {
    private:
        kernel k1;
    public:
        port<input> inptr;
        port<output> outptr;
        port<input> kernelCoefficients;
    
        myGraph() {
            k1 = kernel::create(filter2D);
            adf::connect<window<TILE_WINDOW_SIZE> >(inptr, k1.in[0]);
            adf::connect<parameter>(kernelCoefficients, async(k1.in[1]));
            adf::connect<window<TILE_WINDOW_SIZE> >(k1.out[0], outptr);
        }
    };
    
  5. Specify source file location and other constraints for each kernel

    class myGraph : public adf::graph {
    private:
        kernel k1;
    public:
        port<input> inptr;
        port<output> outptr;
        port<input> kernelCoefficients;
    
        myGraph() {
            k1 = kernel::create(filter2D);
            adf::connect<window<TILE_WINDOW_SIZE> >(inptr, k1.in[0]);
            adf::connect<parameter>(kernelCoefficients, async(k1.in[1]));
            adf::connect<window<TILE_WINDOW_SIZE> >(k1.out[0], outptr);
            source(k1) = "xf_filter2d.cc";
            // Initial mapping
            runtime<ratio>(k1) = 0.5;
        }
    };
    

Setting up platform ports

Next step is to create a graph.cpp file with platform ports and virtual platform specification. A virtual platform specification helps to connect the data flow graph written with external I/O mechanisms specific to the chosen target for testing or eventual deployment. The platform could be specified for a simulation, emulation, or an actual hardware execution target.

simulation::platform<inputs, outputs> platform_name(port_attribute_list);

There are 3 types of platform ports attributes which describe how data is transferred to / from AIE cores.

FileIO

By default, a platform port attribute is a string name used to construct an attribute of type FileIO. The string specifies the name of an input or output file relative to the current directory that will source or sink the platform data. The explicit form is specified in the following example using a FileIO constructor.

FileIO* in = new FileIO(input_file_name);
FileIO* out = new FileIO(output_file_name);
simulation::platform<1,1> plat(in,out);

FileIO ports are solely for the purpose of application simulation in the absence of an actual hardware platform. They are provided as a matter of convenience to test out a data flow graph in isolation before it is connected to a real platform. An actual hardware platform exports either stream or memory ports.

PLIO

A PLIO port attribute is used to make external stream connections that cross the AI Engine to programmable logic (PL) boundary. The following example shows how the PLIO attributes shown in the previous table can be used in a program to read input data from a file or write output data to a file. The PLIO width and frequency of the PLIO port are also provided in the PLIO constructor. For more details please refer PLIO Attributes.

//Virtual platform ports
PLIO* in1 = new PLIO("DataIn1", adf::plio_64_bits, "data/input.txt");
PLIO* out1 = new PLIO("DataOut1", adf::plio_64_bits, "data/output.txt");
simulation::platform<1, 1> platform(in1,out1);

//Graph object
myGraph filter_graph;

//Virtual platform connectivity
connect<> net0(platform.src[0], filter_graph.inptr);
connect<> net1(filter_graph.outptr, platform.sink[0]);

GMIO

A GMIO port attribute is used to make external memory-mapped connections to or from the global memory. These connections are made between an AI Engine graph and the logical global memory ports of a hardware platform design. For more details please refer GMIO Attributes.

GMIO gmioIn1("gmioIn1", 64, 1000);
GMIO gmioOut("gmioOut", 64, 1000);
simulation::platform<1, 1> platform(&gmioIn1, &gmioOut);

myGraph filter_graph;

connect<> net0(platform.src[0], filter_graph.in1);
connect<> net1(filter_graph.out1, platform.sink[0]);

Host code integration

Depending upon the functional verification model used, the top level application can be written using on of 2 ways.

x86Simulation / AIE simulation

In this mode the top level application can be written inside graph.cpp file. The application contains an instance of ADF graph and a main function within which API’s are called to initialize, run and end the graph. It may also have additional API’s to update Run time parameters. Additionally for hw emulation / hw run modes, the ‘main()’ function can be guarded by a #ifdef to ensure graph is only initialized once, or run only once. The following example code is the simple application defined in Creating a Data Flow Graph (Including Kernels) with the additional guard macro __AIESIM__ and __X86SIM__.

// Virtual platform ports
PLIO* in1 = new PLIO("DataIn1", adf::plio_64_bits, "data/input.txt");
PLIO* out1 = new PLIO("DataOut1", adf::plio_64_bits, "data/output.txt");
simulation::platform<1, 1> platform(in1, out1);

// Graph object
myGraph filter_graph;

// Virtual platform connectivity
connect<> net0(platform.src[0], filter_graph.inptr);
connect<> net1(filter_graph.outptr, platform.sink[0]);

#define SRS_SHIFT 10
float kData[9] = {0.0625, 0.1250, 0.0625, 0.125, 0.25, 0.125, 0.0625, 0.125, 0.0625};


#if defined(__AIESIM__) || defined(__X86SIM__)
int main(int argc, char** argv) {
    filter_graph.init();
    filter_graph.update(filter_graph.kernelCoefficients, float2fixed_coeff<10, 16>(kData).data(), 16);
    filter_graph.run(1);
    filter_graph.end();
    return 0;
}
#endif

In case GMIO based ports are used

#if defined(__AIESIM__) || defined(__X86SIM__)
int main(int argc, char** argv) {
    ...
    ...
    int16_t* inputData = (int16_t*)GMIO::malloc(BLOCK_SIZE_in_Bytes);
    int16_t* outputData = (int16_t*)GMIO::malloc(BLOCK_SIZE_in_Bytes);

    //Prepare input data
    ...
    ...

    filter_graph.init();
    filter_graph.update(filter_graph.kernelCoefficients, float2fixed_coeff<10, 16>(kData).data(), 16);

    filter_graph.run(1);

    //GMIO Data transfer calls
    gmioIn[0].gm2aie_nb(inputData, BLOCK_SIZE_in_Bytes);
    gmioOut[0].aie2gm_nb(outputData, BLOCK_SIZE_in_Bytes);
    gmioOut[0].wait();

    printf("after grph wait\n");
    filter_graph.end();

    ...
}
#endif

HW emulation / HW run

For x86Simulation / AIE simulation, top level application had simple ADF API calls to initialize / run / end the graph. However, for actual AI Engine graph applications the host code must do much more than those simple tasks. The top-level PS application running on the Cortex®-A72, controls the graph and PL kernels: manage data inputs to the graph, handle data outputs from the graph, and control any PL kernels working with the graph. Sample code is illustrated below

1.// Open device, load xclbin, and get uuid

auto dhdl = xrtDeviceOpen(0);//device index=0

xrtDeviceLoadXclbinFile(dhdl,xclbinFilename);
xuid_t uuid;
xrtDeviceGetXclbinUUID(dhdl, uuid);
adf::registerXRT(dhdl, uuid);

2. Allocate output buffer objects and map to host memory

xrtBufferHandle out_bohdl = xrtBOAlloc(dhdl, output_size_in_bytes, 0, /*BANK=*/0);
std::complex<short> *host_out = (std::complex<short>*)xrtBOMap(out_bohdl);

3. Get kernel and run handles, set arguments for kernel, and launch kernel.
xrtKernelHandle s2mm_khdl = xrtPLKernelOpen(dhdl, top->m_header.uuid, "s2mm"); // Open kernel handle
xrtRunHandle s2mm_rhdl = xrtRunOpen(s2mm_khdl);
xrtRunSetArg(s2mm_rhdl, 0, out_bohdl); // set kernel arg
xrtRunSetArg(s2mm_rhdl, 2, OUTPUT_SIZE); // set kernel arg
xrtRunStart(s2mm_rhdl); //launch s2mm kernel

// ADF API:run, update graph parameters (RTP) and so on
gr.init();
gr.update(gr.size, 1024);//update RTP
gr.run(16);//start AIE kernel
gr.wait();

4. Wait for kernel completion.
auto state = xrtRunWait(s2mm_rhdl);

5. Sync output device buffer objects to host memory.

xrtBOSync(out_bohdl, XCL_BO_SYNC_BO_FROM_DEVICE , output_size_in_bytes,/*OFFSET=*/ 0);

//6. post-processing on host memory - "host_out

Vitis Vision AIE library functions provide optimal vector implementations of various computer vision algorithms. These functions are expected to process high resolution images. However because local memory of AIE core module is limited, entire image can’t be fit into it. Also accessing DDR for reading / writing image data will be highly inefficient both for performance and power. To overcome this limitation host code is expected to split the high resolution image into smaller tiles which fit in AIE Engine local memory in ping-pong fashion. Splitting of high resolution image in smaller tiles is a complex operation as it need to be aware of overlap regions and borders. Also the tile size is expected to be aligned with vectorization factor of the kernel.

To facilitate this Vitis Vision Library provides data movers which perform smart tiling / stitching of high resolution images which can meet all above requirements. There are two versions made available which can provide data movement capabilities both using PLIO and GMIO interfaces. A high level class abstraction is provided with simple API interface to facilitate data transfers. The class abstraction allows seamless transition between PLIO - GMIO methods of data transfers.

Important

For HW emulation / HW run it is imperative to include graph.cpp inside host.cpp. This is because platform port specification and ADF graph object instance is declared in graph.cpp.

xfcvDataMovers

xfcvDataMovers class provides a high level API abstraction to initiate data transfer from DDR to AIE core and vice versa for hw-emulation / hw runs. Because each AIE core has limited amount of local memory which is not sufficient to fit in entire high resolution images (input / output), each image needs to be partitioned into smaller tiles and then send to AIE core for computation. After computation the tiled image at output is stitched back to generate the high resolution image at the output. This process involves complex computation as tiling needs to ensure proper border handling and overlap processing in case of convolution based kernels.

xfcvDataMovers class object takes input some simple parameters from users and provides a simple data transaction API where user does not have to bother about the complexity. Moreover it provides a template parameter using which application can switch from PL based data movement to GMIO based (and vice versa) seamlessly.

Table. xfcvDataMovers Template Parameters
Parameter Description
KIND Type of object TILER / STITCHER
DATA_TYPE Data type of AIE core kernel input or output
TILE_HEIGHT_MAX Maximum tile height
TILE_WIDTH_MAX Maximum tile width
AIE_VECTORIZATION_FACTOR AIE core vectorization factor
CORES Number of AIE cores to be used
PL_AXI_BITWIDTH For PL based data movers. It is the data width for AXI transfers between DDR - PL
USE_GMIO Set to true to use GMIO based data transfer
Table. xfcvDataMovers constructor parameters
Parameter Description
overlapH Horizontal overlap of the AIE core / pipeline
overlapV Vertical overlap of the AIE core / pipeline

Note

Horizontal overlap and Vertical overlaps should be computed for the complete pipeline. For example if the pipeline has a single 3x3 2D filter then overlap sizes (both horizontal and vertical) will be 1. However in case of two such filter operations which are back to back the overlap size will be 2. Currently if it is expected from users to provide this input correctly.

The data transfer using xfcvDataMovers class can be done in one out of 2 ways.

  1. PLIO data movers

    This is the default mode for xfcvDataMovers class operation. When this method is used, data is transferred using hardware Tiler / Stitcher IPs provided by Xilinx. The Makefile provided with designs examples shipped with the library provide location to .xo files for these IP’s. It also shows how to incorporate them in Vitis Build System. Having said that, user needs to create an object of xfcvDataMovers class per input / output image as shown in code below

    Important

    The implementations of Tiler and Stitcher for PLIO, are provided as .xo files in ‘L1/lib/hw’ folder. By using these files, you are agreeing to the terms and conditions specified in the LICENSE.txt file available in the same directory.

    int overlapH = 1;
    int overlapV = 1;
    xF::xfcvDataMovers<xF::TILER, int16_t, MAX_TILE_HEIGHT, MAX_TILE_WIDTH, VECTORIZATION_FACTOR> tiler(overlapH, overlapV);
    xF::xfcvDataMovers<xF::STITCHER, int16_t, MAX_TILE_HEIGHT, MAX_TILE_WIDTH, VECTORIZATION_FACTOR> stitcher;
    

    Choice of MAX_TILE_HEIGHT / MAX_TILE_WIDTH provide constraints on image tile size which in turn governs local memory usage. The image tile size in bytes can be computed as below

    Image tile size = (TILE_HEADER_SIZE_IN_BYTES + MAX_TILE_HEIGHT*MAX_TILE_WIDTH*sizeof(DATA_TYPE))

    Here TILE_HEADER_SIZE_IN_BYTES is 128 bytes for current version of Tiler / Stitcher. DATA_TYPE in above example is int16_t (2 bytes).o

    Note

    Current version of HW data movers have 8_16 configuration (i.e. 8 bit image element data type on host side and 16 bit image element data type on AIE kernel side). In future more such configurations will be provided (example: 8_8 / 16_16 etc.)

    Tiler / Stitcher IPs use PL resources available on VCK boards. For 8_16 configuration below table illustrates resource utilization numbers for theese IPs. The numbers correspond to single instance of each IP.

    Tiler / Stitcher resource utilization (8_16 config)
      LUTs FFs BRAMs DSPs Fmax
    Tiler 2761 3832 5 13 400 MHz
    Stitcher 2934 3988 5 7 400 MHz
    Total 5695 7820 10 20  
  2. GMIO data movers

    Transition to GMIO based data movers can be achieved by using a specialized template implementation of above class. All above constraints w.r.t Image tile size calculation are valid here as well. Sample code is shown below

    xF::xfcvDataMovers<xF::TILER, int16_t, MAX_TILE_HEIGHT, MAX_TILE_WIDTH, VECTORIZATION_FACTOR, 1, 0, true> tiler(1, 1);
    xF::xfcvDataMovers<xF::STITCHER, int16_t, MAX_TILE_HEIGHT, MAX_TILE_WIDTH, VECTORIZATION_FACTOR, 1, 0, true> stitcher;
    

    Note

    Last template parameter is set to true, implying GMIO specialization.

Once the objects are constructed, simple API calls can be made to initiate the data transfers. Sample code is shown below

//For PLIO
auto tiles_sz = tiler.host2aie_nb(src_hndl, srcImageR.size());
stitcher.aie2host_nb(dst_hndl, dst.size(), tiles_sz);

//For GMIO
auto tiles_sz = tiler.host2aie_nb(srcData.data(), srcImageR.size(), {"gmioIn[0]"});
stitcher.aie2host_nb(dstData.data(), dst.size(), tiles_sz, {"gmioOut[0]"});

Note

GMIO data transfers take additional argument which is corresponding GMIO port to be used.

Note

For GMIO based transfers there is a blocking method as well (host2aie(…) / aie2host(…)). For PLIO based data transfers the method only non-blocking API calls are provided.

Using ‘tile_sz’ user can run the graph appropriate number of times.

filter_graph.run(tiles_sz[0] * tiles_sz[1]);

After the runs are started, user needs to wait for all transactions to get complete.

filter_graph.wait();
tiler.wait();
stitcher.wait();

Note

Current implementation of xfcvDataMovers support only 1 core. Multi core support is planned for future releases.

Evaluating the Functionality

You can build the kernels and test the functionality through x86 simulation, cycle accurate aie simulation, hw emulation or hw run on the board. Use the following commands to setup the basic environment:

x86 Simulation

Please refer to x86 Functional Simulation section in Vitis Unified Software Development Platform 2021.2 Documentation. For host code development, please refer to Programming the PS Host Application section

AIE Simulation

Please refer to AIE Simulation section in Vitis Unified Software Development Platform 2021.2 Documentation. For host code development, please refer to Programming the PS Host Application section.

HW emulation

Please refer to Programming the PS Host Application section in Vitis Unified Software Development Platform 2021.2 Documentation.

Testing on HW

After the build for hardware target completes, sd_card.img file will be generated in the build directory.

  1. Use a software like Etcher to flash the sd_card.img file on to a SD Card.
  2. After flashing is complete, insert the SD card in the SD card slot on board and power on the board.
  3. Use Teraterm to connect to COM port and wait for the system to boot up.
  4. After the boot up is done, goto /media/sd-mmcblk0p1 directory and run the executable file.

Please refer to hw_run section in Vitis Unified Software Development Platform 2021.2 Documentation.

Design example Using Vitis Vision AIE Library

Following example application performs a 2D filtering operation over a gray scale image. The convolution kernel is a 3x3 window with floating point representation. The coefficients are converted to fixed point representation before being passed to AIE core for computation. The results are cross validated against OpenCV reference implementation. The example illustrates both PLIO and GMIO based data transfers.

ADF Graph

An AI Engine program consists of a data flow graph specification written in C++. The dataflow graph consists of top level ports, kernel instances and connectivity. a graph.h file is created which includes the header adf.h.

For more details on data flow graph creation, please refer AI Engine Programming .

#include "kernels.h"
#include <adf.h>

using namespace adf;

class myGraph : public adf::graph {
   public:
    kernel k1;
    port<input> inptr;
    port<output> outptr;
    port<input> kernelCoefficients;

  myGraph() {
     k1 = kernel::create(filter2D);
     adf::connect<window<TILE_WINDOW_SIZE> >(inptr, k1.in[0]);
     adf::connect<parameter>(kernelCoefficients, async(k1.in[1]));
     adf::connect<window<TILE_WINDOW_SIZE> >(k1.out[0], outptr);

     source(k1) = "xf_filter2d.cc";
     // Initial mapping
     runtime<ratio>(k1) = 0.5;
  };
};

Platform Ports

A top-level application file graph.cpp is created which contains an instance of the graph class and is connected to a simulation platform. A virtual platform specification helps to connect the data flow graph written with external I/O mechanisms specific to the chosen target for testing or eventual deployment.

#include "graph.h"

// Virtual platform ports
PLIO* in1 = new PLIO("DataIn1", adf::plio_64_bits, "data/input.txt");
PLIO* out1 = new PLIO("DataOut1", adf::plio_64_bits, "data/output.txt");
simulation::platform<1, 1> platform(in1, out1);

// Graph object
myGraph filter_graph;

// Virtual platform connectivity
connect<> net0(platform.src[0], filter_graph.inptr);
connect<> net1(filter_graph.outptr, platform.sink[0]);
  1. PLIO

    A PLIO port attribute is used to make external stream connections that cross the AI Engine to programmable logic (PL) boundary. PLIO attributes are used to specify the port name, port bit width and the input/output file names. Note that when simulating PLIO with data files, the data should be organized to accomodate both the width of the PL block as well as the data type of connecting port on the AI Engine block.

    //Platform ports
    PLIO* in1 = new PLIO("DataIn1", adf::plio_64_bits, "data/input.txt");
    PLIO* out1 = new PLIO("DataOut1", adf::plio_64_bits, "data/output.txt");
    
  2. GMIO

    A GMIO port attribute is used to make external memory-mapped connections to or from the global memory. These connections are made between an AI Engine graph and the logical global memory ports of a hardware platform design.

    //Platform ports
    GMIO gmioIn1("gmioIn1", 64, 1000);
    GMIO gmioOut("gmioOut", 64, 1000);
    
    //Virtual platform
    simulation::platform<1, 1> platform(&gmioIn1, &gmioOut);
    
    //Graph object
    myGraph filter_graph;
    
    //Platform ports
    connect<> net0(platform.src[0], filter_graph.in1);
    connect<> net1(filter_graph.out1, platform.sink[0]);
    

Host code

Host code ‘host.cpp’ will be running on the host processor which conatins the code to initialize and run the datamovers and the ADF graph. XRT APIs are used to create the required buffers in the device memory.

First a golden reference image is generated using OpenCV

int run_opencv_ref(cv::Mat& srcImageR, cv::Mat& dstRefImage, float coeff[9]) {
cv::Mat tmpImage;
cv::Mat kernel = cv::Mat(3, 3, CV_32F, coeff);
cv::filter2D(srcImageR, dstRefImage, -1, kernel, cv::Point(-1, -1), 0, cv::BORDER_REPLICATE);
return 0;
}

Then, xclbin is loaded on the device and the device handles are created

xF::deviceInit(xclBinName);

Buffers for input and output data are created using the XRT APIs and data from input CV::Mat is copied to the XRT buffer.

void* srcData = nullptr;
xrtBufferHandle src_hndl = xrtBOAlloc(xF::gpDhdl, (srcImageR.total() * srcImageR.elemSize()), 0, 0);
srcData = xrtBOMap(src_hndl);
memcpy(srcData, srcImageR.data, (srcImageR.total() * srcImageR.elemSize()));

// Allocate output buffer
void* dstData = nullptr;
xrtBufferHandle dst_hndl = xrtBOAlloc(xF::gpDhdl, (op_height * op_width * srcImageR.elemSize()), 0, 0);
dstData = xrtBOMap(dst_hndl);
cv::Mat dst(op_height, op_width, srcImageR.type(), dstData);

xfcvDataMovers objects tiler and stitcher are created. For more details on xfcvDataMovers refer xfcvDataMovers

xF::xfcvDataMovers<xF::TILER, int16_t, MAX_TILE_HEIGHT, MAX_TILE_WIDTH, VECTORIZATION_FACTOR> tiler(1, 1);
xF::xfcvDataMovers<xF::STITCHER, int16_t, MAX_TILE_HEIGHT, MAX_TILE_WIDTH, VECTORIZATION_FACTOR> stitcher;

ADF graph is initialized and the filter coefficients are updated.

filter_graph.init();
filter_graph.update(filter_graph.kernelCoefficients, float2fixed_coeff<10, 16>(kData).data(), 16);

Metadata containing the tile information is generated.

tiler.compute_metadata(srcImageR.size());

The data transfer to AIE via datamovers is initiated along with graph run and further execution waits till the data transfer is complete.

auto tiles_sz = tiler.host2aie_nb(src_hndl, srcImageR.size());
stitcher.aie2host_nb(dst_hndl, dst.size(), tiles_sz);

std::cout << "Graph run(" << (tiles_sz[0] * tiles_sz[1]) << ")\n";

filter_graph.run(tiles_sz[0] * tiles_sz[1]);

filter_graph.wait();
tiler.wait();
stitcher.wait();

Makefile

Run ‘make help’ to get list of commands and flows supported. Running below commands will initiate a hardware build.

source < path-to-Vitis-installation-directory >/settings64.sh
export SYSROOT=< path-to-platform-sysroot >
export EDGE_COMMON_SW=< path-to-rootfs-and-Image-files >
make all TARGET=hw DEVICE=< path-to-platform-directory >/< platform >.xpfm

This example demonstrates how a function/pipeline of functions can run on multiple AIE cores to achieve higher throughput. Back-to-back Filter2D pipeline running on three AIE cores is demonstrated in this example. The source files can be found in L3/tests/aie/Filter2D_multicore/16bit_aie_8bit_pl directory.

This example tests the performance of back-to-back Filter2D pipeline with three images being parallely processed on three AIE cores. Each AIE core is being fed by one instance of Tiler and Stitcher PL kernels.

The tutorial provides a step-by-step guide that covers commands for building and running the pipeline.

Executable Usage

  • Work Directory(Step 1)

The steps for library download and environment setup can be found in README of L3 folder. Please refer Getting Started with Vitis Vision AIEngine Library Functions for more details. For getting the design,

cd L3/tests/aie/Filter2D_multicore/16bit_aie_8bit_pl
  • Build kernel(Step 2)

Run the following make command to build your XCLBIN and host binary targeting a specific device. Please be noticed that this process will take a long time, maybe couple of hours.

export DEVICE=< path-to-platform-directory >/< platform >.xpfm
make all TARGET=hw
  • Run kernel(Step 3)

To get the benchmark results, please run the following command.

make run TARGET=hw
  • Running on HW

After the build for hardware target completes, sd_card.img file will be generated in the build directory.

  1. Use a software like Etcher to flash the sd_card.img file on to a SD Card.
  2. After flashing is complete, insert the SD card in the SD card slot on board and power on the board.
  3. Use Teraterm to connect to COM port and wait for the system to boot up.
  4. After the boot up is done, goto /media/sd-mmcblk0p1 directory and run the executable file.

Performance

The performance is shown below

Table 1 Performance numbers in terms of FPS (Frames Per Second) for full HD images
Dataset FPS
Full HD(1920x1080) 555