XRT Native APIs¶

From 2020.2 release XRT provides a new XRT API set in C, C++, and Python flavor.

To use the native XRT APIs, the host application must link with the xrt_coreutil library. Compiling host code with XRT native C++ API requires C++ standard with -std=c++17 (or newer).

Example g++ command

g++ -g -std=c++17 -I$XILINX_XRT/include -L$XILINX_XRT/lib -o host.exe host.cpp -lxrt_coreutil -pthread

The XRT native API supports both the C and C++ flavor of APIs. For general host code development, C++-based APIs are recommended, hence this document only describes the C++-based API interfaces. The full Doxygen generated C and C++ API documentation can be found in XRT Native Library C++ API.

The C++ Class objects used for the APIs are

	C++ Class	Header files
Device	`xrt::device`	`#include <xrt/xrt_device.h>`
XCLBIN	`xrt::xclbin`	`#include <experimental/xrt_xclbin.h>`
Buffer	`xrt::bo`	`#include <xrt/xrt_bo.h>`
Kernel	`xrt::kernel`	`#include <xrt/xrt_kernel.h>`
Run	`xrt::run`	`#include <xrt/xrt_kernel.h>`
User-managed Kernel	`xrt::ip`	`#include <experimental/xrt_ip.h>`
Graph	`xrt::graph`	`#include <experimental/aie.h>` `#include <experimental/graph.h>`

Majority of the core data structures are defined inside in the header files at $XILINX_XRT/include/xrt/ directory. Few newer features such as xrt::ip, xrt::aie related header files are inside $XILINX_XRT/include/experimental directory. The API interfaces that are in the experimental folder are subject to breaking changes.

The common host code flow using the above data structures is as below

Open Xilinx Device and Load the XCLBIN
Create Buffer objects to transfer data to kernel inputs and outputs
Use the Buffer class member functions for the data transfer between host and device (before and after the kernel execution).
Use Kernel and Run objects to offload and manage the compute-intensive tasks running on FPGA.

Below we will walk through the common API usage to accomplish the above tasks.

Device and XCLBIN¶

Device and XCLBIN class provide fundamental infrastructure-related interfaces. The primary objective of the device and XCLBIN related APIs are

Open a Device
Load compiled kernel binary (or XCLBIN) onto the device

The simplest code to load an XCLBIN as below

     unsigned int dev_index = 0;
     auto device = xrt::device(dev_index);
     auto xclbin_uuid = device.load_xclbin("kernel.xclbin");

The above code block shows

The xrt::device class’s constructor is used to open the device (enumerated as 0)
The member function xrt::device::load_xclbin is used to load the XCLBIN from the filename.
The member function xrt::device::load_xclbin returns the XCLBIN UUID, which is required to open the kernel (refer the Kernel Section).

The class constructor xrt::device::device(const std::string& bdf) also supports opening a device object from a Pcie BDF passed as a string.

     auto device = xrt::device("0000:03:00.1");

The xrt::device::get_info() is a useful member function to obtain necessary information about a device. Some of the information such as Name, BDF can be used to select a specific device to load an XCLBIN

     std::cout << "device name:     " << device.get_info<xrt::info::device::name>() << "\n";
     std::cout << "device bdf:      " << device.get_info<xrt::info::device::bdf>() << "\n";

Buffers¶

Buffers are primarily used to transfer the data between the host and the device. The Buffer related APIs are discussed in the following three subsections

Buffer allocation and deallocation
Data transfer using Buffers
Miscellaneous other Buffer APIs

1. Buffer allocation and deallocation¶

The C++ interface for buffers as below

The class constructor xrt::bo is mainly used to allocates a buffer object 4K aligned. By default, a regular buffer is created (optionally the user can create other types of buffers by providing a flag).

     auto bank_grp_arg0 = kernel.group_id(0); // Memory bank index for kernel argument 0
     auto bank_grp_arg1 = kernel.group_id(1); // Memory bank index for kernel argument 1

     auto input_buffer = xrt::bo(device, buffer_size_in_bytes,bank_grp_arg0);
     auto output_buffer = xrt::bo(device, buffer_size_in_bytes, bank_grp_arg1);

In the above code xrt::bo buffer objects are created using the class constructor. Please note the following

As no special flags are used a regular buffer will be created. Regular buffer is most common type of buffer that has a host backing pointer allocated by user space in heap memory and a device buffer allocated in the specified memory bank.
The second argument specifies the buffer size.
The third argument is used to specify the enumerated memory bank index (to specify the buffer location) where the buffer should be allocated. There are two ways to specify the memory bank index

Through kernel arguments: In the above example, the xrt::kernel::group_id() member function is used to pass the memory bank index. This member function accept kernel argument-index and automatically detect corresponding memory bank index by inspecting XCLBIN.

Passing Memory bank index: The xrt::kernel::group_id() also accepts the direct memory bank index (as observed from xbutil examine --report memory output).

Creating special Buffers¶

The xrt::bo() constructors accept multiple other buffer flags those are described using enum class argument with the following enumerator values

xrt::bo::flags::normal: Regular buffer (default)
xrt::bo::flags::device_only: Device only buffer (meant to be used only by the kernel, there is no host backing pointer).
xrt::bo::flags::host_only: Host only buffer (buffer resides in the host memory directly transferred to/from the kernel)
xrt::bo::flags::p2p: P2P buffer, A special type of device-only buffer capable of peer-to-peer transfer
xrt::bo::flags::cacheable: Cacheable buffer can be used when the host CPU frequently accessing the buffer (applicable for edge platform).

The below example shows creating a P2P buffer on a device memory bank connected to argument 3 of the kernel.

     auto p2p_buffer = xrt::bo(device, buffer_size_in_byte,xrt::bo::flags::p2p, kernel.group_id(3));

Creating Buffers from the user pointer¶

The xrt::bo() constructor can also be called using a pointer provided by the user. The user pointer must be aligned to 4K boundary.

     // Host Memory pointer aligned to 4K boundary
     int *host_ptr;
     posix_memalign(&host_ptr,4096,MAX_LENGTH*sizeof(int));

     // Sample example filling the allocated host memory
     for(int i=0; i<MAX_LENGTH; i++) {
     host_ptr[i] = i;  // whatever
     }

     auto mybuf = xrt::bo (device, host_ptr, MAX_LENGTH*sizeof(int), kernel.group_id(3));

2. Data transfer using Buffers¶

XRT Buffer API library provides a rich set of APIs helping the data transfers between the host and the device, between the buffers, etc. We will discuss the following data transfer style

Data transfer between host and device by Buffer read/write API
Data transfer between host and device by Buffer map API
Data transfer between buffers by copy API

I. Data transfer between host and device by Buffer read/write API¶

To transfer the data from the host to the device, the user first needs to update the host-side buffer backing pointer followed by a DMA transfer to the device.

The xrt::bo class has following member functions for the same functionality

xrt::bo::write()
xrt::bo::sync() with flag XCL_BO_SYNC_BO_TO_DEVICE

To transfer the data from the device to the host, the steps are reversed, the user first needs to do a DMA transfer from the device followed by the reading data from the host-side buffer backing pointer.

The corresponding xrt::bo class’s member functions are

xrt::bo::sync() with flag XCL_BO_SYNC_BO_FROM_DEVICE
xrt::bo::read()

Code example of transferring data from the host to the device

     auto input_buffer = xrt::bo(device, buffer_size_in_bytes, bank_grp_idx_0);
     // Prepare the input data
     int buff_data[data_size];
     for (auto i=0; i<data_size; ++i) {
         buff_data[i] = i;
     }

     input_buffer.write(buff_data);
     input_buffer.sync(XCL_BO_SYNC_BO_TO_DEVICE);

Note the C++ xrt::bo::sync, xrt::bo::write, xrt::bo::read etc has overloaded version that can be used for partial buffer sync/read/write by specifying the size and the offset. For the above code example, the full buffer size and offset=0 are assumed as default arguments.

Also note that if the buffer is created through the user-pointer, the xrt::bo::write or xrt::bo::read is not required before or after the xrt::bo::sync call.

For the device only buffers (created with xrt::bo::flags::device_only flag) the xrt::bo::sync() operation is not required, only xrt::bo::write() (or xrt::bo::read()) is sufficient for DMA operation. As for the device only buffer there is no host backing storage, the xrt::bo::write() (or xrt::bo::read()) directly performs DMA operation to (or from) the device memory.

Below is the example for creation of device only buffers.

     xrt::bo::flags device_flags = xrt::bo::flags::device_only;
     auto device_only_buffer = xrt::bo(device, size_in_bytes, device_flags, bank_grp_arg0);

Here is how xrt::bo::read() and xrt::bo::write() API’s to read/write directly from/to device only buffer there is no host backing storage.

xrt::bo::write(const void* src, size_t size, size_t seek): Copies data from src to device buffer directly.
xrt::bo::read(void* dst, size_t size, size_t skip): Copies data from device buffer to dst.

II. Data transfer between host and device by Buffer map API¶

The API xrt::bo::map() allows mapping the host-side buffer backing pointer to a user pointer. The host code can subsequently exercise the user pointer for the data reads and writes. However, after writing to the mapped pointer (or before reading from the mapped pointer) the API xrt::bo::sync() should be used with direction flag for the DMA operation.

Code example of transferring data from the host to the device by this approach

     auto input_buffer = xrt::bo(device, buffer_size_in_bytes, bank_grp_idx_0);
     auto input_buffer_mapped = input_buffer.map<int*>();

     for (auto i=0;i<data_size;++i) {
         input_buffer_mapped[i] = i;
     }

     input_buffer.sync(XCL_BO_SYNC_BO_TO_DEVICE);

III. Data transfer between the buffers by copy API¶

XRT provides xrt::bo::copy() API for deep copy between the two buffer objects if the platform supports a deep-copy (for detail refer M2M feature described in Memory-to-Memory (M2M)). If deep copy is not supported by the platform the data transfer happens by shallow copy (the data transfer happens via host).

     dst_buffer.copy(src_buffer, copy_size_in_bytes);

The API xrt::bo::copy() also has overloaded versions to provide a different offset than 0 for both the source and the destination buffer.

3. Miscellaneous other Buffer APIs¶

This section describes a few other specific use-cases using buffers.

DMA-BUF API¶

XRT provides Buffer export and import APIs primarily used for sharing buffer across devices (P2P application) and processes. The buffer handle obtained from xrt::bo::export_buffer() is essentially a file descriptor, hence sending across the processes requires a suitable IPC mechanism (example, UDS or Unix Domain Socket) to translate the file descriptor of one process into another process.

xrt::bo::export_buffer(): Export the buffer to an exported buffer handle
xrt::bo() constructor: Allocate a BO imported from exported buffer handle

Consider the situation of exporting buffer from device 1 to device 2 (inside same host process).

     auto buffer_exported = buffer_device_1.export_buffer();
     auto buffer_device_2 = xrt::bo(device_2, buffer_exported);

In the above example

The buffer buffer_device_1 is a buffer allocated on device 1
buffer_device_1 is exported by the member function xrt::bo::export_buffer
The new buffer buffer_device_2 is imported for device_2 by the constructor xrt::bo

Sub-buffer support¶

The xrt::bo class constructor can also be used to allocate a sub-buffer from a parent buffer by specifying a start offset and the size.

In the example below a sub-buffer is created from a parent buffer of size 4 bytes starting from its offset 0

     size_t sub_buffer_size = 4;
     size_t sub_buffer_offset = 0;

     auto sub_buffer = xrt::bo(parent_buffer, sub_buffer_size, sub_buffer_offset);

Buffer information¶

XRT provides few other API Class member functions to obtain information related to the buffer.

The member function xrt::bo::size(): Size of the buffer
The member function xrt::bo::address() : Physical address of the buffer

Kernel and Run¶

To execute a kernel on a device, a kernel class (xrt::kernel) object has to be created from currently loaded xclbin. The kernel object can be used to execute the kernel function on the hardware instance (Compute Unit or CU) of the kernel.

A Run object (xrt::run) represents an execution of the kernel. Upon finishing the kernel execution, the Run object can be reused to invoke the same kernel function if desired.

The following topics are discussed below

Obtaining kernel object from XCLBIN
Getting the bank group index of a kernel argument
Execution of kernel and dealing with the associated run
Other kernel related API

Obtaining kernel object from XCLBIN¶

The kernel object is created from the device, XCLBIN UUID and the kernel name using xrt::kernel() constructor as shown below

     auto xclbin_uuid = device.load_xclbin("kernel.xclbin");
     auto krnl = xrt::kernel(device, xclbin_uuid, name);

Note: A single kernel object (when created by a kernel name) can be used to execute multiple CUs as long as CUs are having identical interface connectivity. If all the CUs of the kernel are not having identical connectivity, XRT assigns a subset of CUs (one or more CUs with identical connectivity) to the created kernel object and discards the rest of the CUs (discarded CUs are not used during the execution of a kernel). For this type of situation creating a kernel object using mangled CU names can be more useful.

As an example, assume a kernel name is foo having 3 CUs foo_1, foo_2, foo_3. The CUs foo_1 and foo_2 are connected to DDR bank 0, but the CU foo_3 is connected to DDR bank 1.

Opening kernel object for foo_1 and foo_2 (as they have identical interface connection)

     krnl_obj_1_2 = xrt::kernel(device, xclbin_uuid, "foo:{foo_1,foo_2}");

Opening kernel object for foo_3

     krnl_obj_3 = xrt::kernel(device, xclbin_uuid, "foo:{foo_3}");

Getting bank group index of the kernel argument¶

We have seen in the Buffer creation section that it is required to provide the buffer location during the buffer creation. The member function xrt::kernel::group_id() returns the memory bank index (or id) of a specific argument of the kernel. This id is passed as a parameter of xrt::bo() constructor to create the buffer on the same memory bank.

Let us review the example below where the buffer is allocated for the kernel’s first (argument index 0) argument.

     auto input_buffer = xrt::bo(device, buffer_size_in_bytes, kernel.group_id(0));

If the kernel bank index is ambiguous then kernel.group_id() returns the last memory bank index in the list it maintains. This is the case when the kernel has multiple CU with different connectivity for that argument. For example, let’s assume a kernel argument (argument 0) is connected to memory bank 0, 1, 2 (for 3 CUs), then kernel.group_id(0) will return the last index from the group {0,1,2}, i.e. 2. As a result the buffer is created on the memory bank 2, so the buffer cannot be used for the CU0 and CU1.

However, in the above situation, the user can always create 3 distinct kernel objects corresponds to 3 CUs (by using the {kernel_name:{cu_name(s)}} for xrt::kernel constructor) to execute the CUs by separate xrt::kernel objects.

Executing the kernel¶

Execution of the kernel is associated with a Run object. The kernel can be executed by the xrt::kernel::operator() that takes all the kernel arguments in order. The kernel execution API returns a run object corresponding to the execution.

     // 1st kernel execution
     auto run = kernel(buf_a, buf_b, scalar_1);
     run.wait();

     // 2nd kernel execution with just changing 3rd argument
     run.set_arg(2,scalar_2); // Arguments are specified starting from 0
     run.start();
     run.wait();

The xrt::kernel class provides overloaded operator () to execute the kernel with a comma-separated list of arguments.

The above c++ code block is demonstrating

The kernel execution using the xrt::kernel() operator with the list of arguments that returns a xrt::run object. This is an asynchronous API and returns after submitting the task.
The member function xrt::run::wait() is used to block the current thread until the current execution is finished.
The member function xrt::run::set_arg() is used to set one or more kernel argument(s) before the next execution. In the example above, only the last (3rd) argument is changed.
The member function xrt::run::start() is used to start the next kernel execution with new argument(s).

Other kernel APIs¶

Obtaining the run object before execution: Example of the previous section shows to obtain a xrt::run object when the kernel is executed (kernel execution returns a run object). However, a xrt::run object can be obtained even before the kernel execution. The flow is as below

Open a Run object by the xrt::run constructor with a kernel argument).
Set the kernel arguments associated for the next execution by the member function xrt::run::set_arg().
Execute the kernel by the member function xrt::run::start().
Wait for the execution finish by the member function xrt::run::wait().

Timeout while wait for kernel finish: The member function xrt::run::wait() blocks the current thread until the kernel execution finishes. To specify a timeout supported API xrt::run::wait() also accepts a timeout in millisecond unit.

User Managed Kernel¶

The xrt::kernel is used to execute the kernels with standard control interface through AXI-Lite control registers. These standard control interfaces are well defined and understood by XRT but transparent to the user. These XRT managed kernels should always be represented by xrt::kernel objects in the host code.

The XRT also supports custom control interface for a kernel. These type of kernels (a.k.a User-Managed Kernel) must be managed by the user by writing/reading to/from the AXI-Lite registers controlling these kernels. To differentiate from the XRT managed kernel, class xrt::ip is used to specify a user-managed kernel inside the user host code.

Creating `xrt::ip` object from XCLBIN¶

The xrt::ip object creation is very similar to creating a kernel.

     auto xclbin_uuid = device.load_xclbin("kernel.xclbin");
     auto ip = xrt::ip(device, xclbin_uuid, "ip_name");

An ip object can only be opened in exclusive mode. That means at a time, only one thread/process can access IP at the same time. This is required for a safety reason because multiple threads/processes reading/writing to the AXI-Lite registers at the same time potentially leads to a race situation.

Allocating buffers for the IP inputs/outputs¶

Similar to XRT managed kernel xrt::bo objects are used to create buffers for IP ports. However, the memory bank location must be specified explicitly by providing enumerated index of the memory bank.

Below is a example of creating two buffers. Note the last argument of xrt::bo is the enumerated index of the memory bank as seen by the XRT (in this example index 8 corresponds to the host-memory bank). The bank index can be obtained by xbutil examine --report memory command.

     auto buf_in_a = xrt::bo(device, DATA_SIZE, xrt::bo::flags::host_only, 8);
     auto buf_in_b = xrt::bo(device, DATA_SIZE, xrt::bo::flags::host_only, 8);

Reading and write CU mapped registers¶

To read and write from the AXI-Lite register space to a CU (specified by xrt::ip object in the host code), the required member functions from the xrt::ip class are

xrt::ip::read_register
xrt::ip::write_register

     int read_data;
     int write_data = 7;

     auto ip = xrt::ip(device, xclbin_uuid, "foo:{foo_1}");

     read_data = ip.read_register(READ_OFFSET);
     ip.write_register(WRITE_OFFSET,write_data);

In the above code block

The CU named “foo_1” (name syntax: “kernel_name:{cu_name}”) is opened exclusively.
The Register Read/Write operation is performed.

Graph¶

In Versal ACAPs with AI Engines, the XRT Graph class (xrt::graph) and its member functions can be used to dynamically load, monitor, and control the graphs executing on the AI Engine array.

A note regarding Device and Buffer: In AIE based application, the device and buffer have some additional functionlities. For this reason the classes xrt::aie::device and xrt::aie::buffer are recommended to specify device and buffer objects.

Graph Opening and Closing¶

The xrt::graph object can be opened using the uuid of the currently loaded XCLBIN file as shown below

     auto xclbin_uuid = device.load_xclbin("kernel.xclbin");
     auto graph = xrt::graph(device, xclbin_uuid, "graph_name");

The graph object can be used to execute the graph function on the AIE tiles.

Reset Functions¶

The member function xrt::graph::reset() is used to reset a specified graph by disabling tiles and enabling tile reset.

     auto device = xrt::aie::device(0);

     // load XCLBIN
     ...

     auto graph = xrt::graph(device, xclbin_uuid, "graph_name");
     // Graph Reset
     graph.reset();

The member function xrt::aie::device::reset_array() is used to reset the whole AIE array. But after this AIE reset functionality is called, the PDI get lost, so a special AIE only XCLBIN has be loaded (This flow is for advanced user only).

Graph execution¶

XRT provides basic graph execution control interfaces to initialize, run, wait, and terminate graphs for a specific number of iterations. Below we will review some of the common graph execution styles.

Graph execution for a fixed number of iterations¶

A graph can be executed for a fixed number of iterations followed by a “busy-wait” or a “time-out wait”.

Busy Wait scheme

The graph can be executed for a fixed number of iteration by xrt::graph::run() API using an iteration argument. Subsequently, xrt::graph::wait() or xrt::graph::end() API should be used (with argument 0) to wait until graph execution is completed.

Let’s review the below example

The graph is executed for 3 iterations by API xrt::graph::run() with the number of iterations as an argument.
The API xrt::graph::wait(0) is used to wait till the iteration is done.
- The API xrt::graph::wait() is used because the host code needs to execute the graph again.
The Graph is executed again for 5 iteration
The API xrt::graph::end(0) is used to wait till the iteration is done.
- After xrt::graph::end() the same graph can not be executed.

     // start from reset state
     graph.reset();

     // run the graph for 3 iteration
     graph.run(3);

     // Wait till the graph is done
     graph.wait(0);  // Use graph::wait if you want to execute the graph again


     graph.run(5);
     graph.end(0);  // Use graph::end if you are done with the graph execution

Timeout wait scheme

As shown in the above example xrt::graph::wait(0) performs a busy-wait and suspend the execution till the graph is not done. If desired a timeout version of the wait can be achieved by xrt::graph::wait(std::chrono::milliseconds) which can be used to wait for some specified number of milliseconds, and if the graph is not done do something else in the meantime. An example is shown below

     // start from reset state
     graph.reset();

     // run the graph for 100 iteration
     graph.run(100);

      while (1) {

        try {
           graph.wait(5);
        }
        catch (const std::system_error& ex) {

           if (ex.code().value() == ETIME) {

              std::cout << "Timeout, reenter......" << std::endl;
              // Do something

           }
       }

Infinite Graph Execution¶

The graph runs infinitely if xrt::graph::run() is called with iteration argument 0. While a graph running infinitely the APIs xrt::graph::wait(), xrt::graph::suspend() and xrt::graph::end() can be used to suspend/end the graph operation after some number of AIE cycles. The API xrt::graph::resume() is used to execute the infinitely running graph again.

     // start from reset state
     graph.reset();

     // run the graph infinitely
     graph.run(0);

     graph.wait(3000);  // Suspends the graph after 3000 AIE cycles from the previous start

     graph.resume(); // Restart the suspended graph again to run forever

     graph.suspend(); // Suspend the graph immediately

     graph.resume(); // Restart the suspended graph again to run forever

     graph.end(5000);  // End the graph operation after 5000 AIE cycles from the previous start

In the example above

The member function xrt::graph::run(0) is used to execute the graph infinitely
The member function xrt::graph::wait(3000) suspends the graph after 3000 AIE cycles from the graph starts.
- If the graph was already run more than 3000 AIE cycles the graph is suspended immediately.
The member function xrt::graph::resume() is used to restart the suspended graph
The member function xrt::graph::suspend() is used to suspend the graph immediately
The member function xrt::graph::end(5000) is ending the graph after 5000 AIE cycles from the previous graph start.
- If the graph was already run more than 5000 AIE cycles the graph ends immediately.
- Using xrt::graph::end() eliminates the capability of rerunning the Graph (without loading PDI and a graph reset again).

Measuring AIE cycle consumed by the Graph¶

The member function xrt::graph::get_timestamp() can be used to determine AIE cycle consumed between a graph start and stop.

Here in this example, the AIE cycle consumed by 3 iteration is calculated

     // start from reset state
     graph.reset();

     uint64_t begin_t = graph.get_timestamp();

     // run the graph for 3 iteration
     graph.run(3);

     graph.wait(0);

     uint64_t end_t = graph.get_timestamp();

     std::cout<<"Number of AIE cycles consumed in the 3 iteration is: "<< end_t-begin_t;

RTP (Runtime Parameter) control¶

The xrt::graph class contains member function to update and read the runtime parameters of the graph.

The member function xrt::graph::update() to update the RTP
The member function xrt::graph::read() to read the RTP.

     graph.reset();

     graph.run(2);

     float increment = 1.0;
     graph.update("mm.mm0.in[2]", increment);

     // Do more things
     graph.run(16);
     graph.wait(0);

     // Read RTP
     float increment_out;
     graph.read("mm.mm0.inout[0]", &increment_out);
     std::cout<<"\n RTP value read<<increment_out;

In the above example, the member function xrt::graph::update() and xrt::graph::read() are used to update and read the RTP values respectively. Note the function arguments

The hierarchical name of the RTP port
Variable to set/read the RTP

DMA operation to and from Global Memory IO¶

The AIE buffer class xrt::aie::bo supports member function xrt::aie::bo::sync() that can be used to synchronize the buffer contents between global memory and AIE. The following code shows a sample example

     auto device = xrt::aie::device(0);

     // Buffer from global memory (GM) to AIE
     auto in_bo  = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);

     // Buffer from AIE to global memory (GM)
     auto out_bo  = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);

     auto inp_bo_map = in_bo.map<float *>();
     auto out_bo_map = out_bo.map<float *>();

     // Prepare input data
     std::copy(my_float_array,my_float_array+SIZE,inp_bo_map);


     in_bo.sync("in_sink", XCL_BO_SYNC_BO_GMIO_TO_AIE, SIZE * sizeof(float),0);

     out_bo.sync("out_sink", XCL_BO_SYNC_BO_AIE_TO_GMIO, SIZE * sizeof(float), 0);

The above code shows

Input and output buffer (in_bo and out_bo) to the graph are created and mapped to the user space
The member function xrt::aie::bo::sync is used for data transfer using the following arguments
- The name of the GMIO ports associated with the DMA transfer
- The direction of the buffer transfer
  GMIO to Graph: XCL_BO_SYNC_BO_GMIO_TO_AIE
  
  Graph to GMIO: XCL_BO_SYNC_BO_AIE_TO_GMIO
- The size and the offset of the buffer

GMIOs & External Buffers¶

XRT provides a buffer class xrt::aie::buffer which represents GMIO & External-Buffers. GMIOs & External Buffers facilitates the movement of data from global memory (like DDR) to the AI Engine and vice versa.Both GMIO and External buffers work together to manage data flow efficiently by ensuring that large datasets can be processed effectively without overwhelming local memory resources

The AIE buffer xrt::aie::buffer object creation would be succesful if GMIO/External buffer exists with given name.

This class has overloaded member function xrt::aie::buffer::sync(...) that can be used to synchronize the buffer contents between global memory and AIE.

xrt::aie::buffer::sync(xrt::bo bo, …) synchronizes the buffer content between xrt::aie::buffer (GMIO/External Buffer) & xrt::bo (Global Memory)

xrt::aie::buffer::sync(xrt::bo ping, xrt::bo pong, …) configures the External buffer with ping/pong buffer for parallelism.

The following code shows a sample example with a single input/output GMIO/External Buffer. Data gets transferred from global buffer “in_bo” to “gr.in1”

   auto device = xrt::aie::device(0);
   auto uuid = device.load_xclbin(xclbin-filename);

   // Create Buffer in DDR/Global memory & prepare input
   auto in_bo  = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
   auto inp_bo_map = in_bo.map<float *>();
   std::copy(my_float_array,my_float_array+SIZE,inp_bo_map);

   // Create Buffer in DDR/Global memory to store output
   auto out_bo  = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
   auto out_bo_map = out_bo.map<float *>();

   // create GMIO/External Buffer object for input & sync from in_bo
   auto xrt::aie::buffer in_buffer = xrt::aie::buffer(device, uuid,"gr.in1");
   in_buffer.sync(in_bo, XCL_BO_SYNC_BO_GMIO_TO_AIE, SIZE * sizeof(float),0);

   //run your graphs which uses output GMIO/External buffer

   // create GMIO/External Buffer object for output & sync from out_bo
   auto xrt::aie::buffer out_buffer  = xrt::aie::buffer(device, uuid,"gr.out1");
   out_buffer.sync(out_bo, XCL_BO_SYNC_BO_AIE_TO_GMIO, SIZE * sizeof(float),0);

This class has overloaded member function xrt::aie::buffer::async(...) that can be used to initiate an asynchronize operation to synchronize the xrt::bo buffer object

xrt::aie::buffer::async(xrt::bo bo, …) initiate an asynchronize operation between xrt::aie::buffer (GMIO/External Buffer) & xrt::bo (Global Memory)

xrt::aie::buffer::async(xrt::bo ping,xrt::bo pong, …) initiate an asynchronize operation between xrt::aie::buffer (GMIO/External Buffer) & ping/pong xrt::bo objects

xrt::aie::buffer::wait() waits for the asynchronize operation to complete

The following code shows a sample example with a single input/output GMIO/External Buffer. Data gets transferred from global buffer “in_bo” to “gr.in1”

   auto device = xrt::aie::device(0);
   auto uuid = device.load_xclbin(xclbin-filename);

   // Create Buffer in DDR/Global memory & prepare input
   auto in_bo  = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
   auto inp_bo_map = in_bo.map<float *>();
   std::copy(my_float_array,my_float_array+SIZE,inp_bo_map);

   // Create Buffer in DDR/Global memory to store output
   auto out_bo  = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
   auto out_bo_map = out_bo.map<float *>();

   // create GMIO/External Buffer object for input & sync from in_bo
   auto xrt::aie::buffer in_buffer = xrt::aie::buffer(device, uuid,"gr.in1");
   in_buffer.async(in_bo, XCL_BO_SYNC_BO_GMIO_TO_AIE, SIZE * sizeof(float),0);

   //run your graphs which uses output GMIO/External buffer

   // create GMIO/External Buffer object for output & sync from out_bo
   auto xrt::aie::buffer out_buffer  = xrt::aie::buffer(device, uuid,"gr.out1");
   out_buffer.async(out_bo, XCL_BO_SYNC_BO_AIE_TO_GMIO, SIZE * sizeof(float),0);
   out_buffer.wait();

Ping Pong buffers¶

The following code shows ping-pong buffer example.This shows an example with a ping/ping buffer being set on one of External buffer

   auto device = xrt::aie::device(0);
   auto uuid = device.load_xclbin(xclbin-filename);

   // Create a Buffer in DDR/Global Memory & prepare input
   auto in_bo  = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
   auto in_bo_map = in_bo.map<float *>();
   std::copy(my_float_array,my_float_array+SIZE,in_bo_map);

   // create GMIO/External Buffer object for input
   auto xrt::aie::buffer in_buffer  = xrt::aie::buffer(device, uuid,"gr.in1");
   in_buffer.sync(in_bo, XCL_BO_SYNC_BO_AIE_TO_GMIO, SIZE * sizeof(float),0);

   // Create a Buffer in DDR/Global Memory for storing output
   auto out_bo  = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
   auto out_bo_map = out_bo.map<float *>();

   // Create Ping/Pong Buffers in global memory for intermediate buffers
   auto ext1_bo  = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
   auto ext2_bo  = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);

   // create GMIO/External Buffer object for input
   auto xrt::aie::buffer ping_pong_bo  = xrt::aie::buffer(device, uuid,"gr.ext1");
   ping_pong_bo.sync(ext1_bo, ext2_bo, XCL_BO_SYNC_BO_GMIO_TO_AIE, SIZE * sizeof(float),0);
   // create GMIO/External Buffer object for output
   auto xrt::aie::buffer out_buffer  = xrt::aie::buffer(device, uuid,"gr.out1");
   out_buffer.sync(out_bo, XCL_BO_SYNC_BO_AIE_TO_GMIO, SIZE * sizeof(float),0);

XRT Error API¶

In general, XRT APIs can encounter two types of errors:

Synchronous error: Error can be thrown by the API itself. The host code can catch these exception and take necessary steps.
Asynchronous error: Errors from the underneath driver, system, hardware, etc.

XRT provides an xrt::error class and its member functions to retrieve the asynchronous errors into the userspace host code. This helps to debug when something goes wrong.

Member function xrt::error::get_error_code() - Gets the last error code and its timestamp of a given error class
Member function xrt::error::get_timestamp() - Gets the timestamp of the last error
Member function xrt:error::to_string() - Gets the description string of a given error code.

NOTE: The asynchronous error retrieving APIs are at an early stage of development and only supports AIE related asynchronous errors. Full support for all other asynchronous errors is planned in a future release.

Example code

     graph.run(runInteration);

     try {
        graph.wait(timeout);
     }
     catch (const std::system_error& ex) {

        if (ex.code().value() == ETIME) {
           xrt::error error(device, XRT_ERROR_CLASS_AIE);

           auto errCode = error.get_error_code();
           auto timestamp = error.get_timestamp();
           auto err_str = error.to_string();

           /* code to deal with this specific error */
           std::cout << err_str << std::endl;
        } else {
         /* Something else */
        }
     }

The above code shows

After timeout occurs from xrt::graph::wait() the member functions xrt::error class are called to retrieve asynchronous error code and timestamp
Member function xrt::error::to_string() is called to obtain the error string.

Profiling¶

In Versal ACAPs with AI Engines, the XRT Profiling class (xrt::aie::profiling) and its member functions can be used to configure AI Engine hardware resources for performance profiling and event tracing.

Create Profiling Event¶

The class constructor xrt::aie::profiling is used to create profiling event object as shown below

     auto event = xrt::aie::profiling(device);

The profling object can be used to execute the profiling functions and collect profile statistics by calling profiling APIs.

Start Profiling¶

The member function xrt::aie::profiling::start() is used to start performance counters in AI Engine as per the profiling option passed as an argument. This function configures the performance counters in the AI Engine and starts profiling.

     auto graph = xrt::graph(device, xclbin_uuid, "graph_name");
     auto handle = event.start(xrt::aie::profiling::profiling_option option, std::string& port1, std::string& port2, int value);

     // run graph
     ...
     s2mm_run.wait();

It returns a handle to be used by read and stop.

Read Profiling¶

The xrt::aie::profiling::read function will return the current performance counter value associated with the profiling handle. It can be used using profiling event object as shown below

     long long cycle_count = profile.read();

Stop Profiling¶

The xrt::aie::profiling::stop function stops the performance profiling associated with the profiling handle and releases the corresponding hardware resources.

  event.stop();
  double throughput = output_size_in_bytes / (cycle_count *0.8 * 1e-3);
  // Every AIE cycle is 0.8ns in production board
  std::cout << "Throughput of the graph: " << throughput << " MB/s" << std::endl;

Asynchornous Programming with XRT (experimental)¶

From the 22.1 release, XRT offers a simple asynchronous programming mechanism through the user-defined queues. The xrt::queue is lightweight, general-purpose queue implementation which is completely separated from core XRT native API data structures. If needed, the user can also use their own queue implementation instead of the implementation offered by xrt::queue.

XRT queue implementation needs #include <experimental/xrt_queue.h to be added as the header file. The implementation also use C++17 features so the host code must be compiled with g++ -std=c++17

As a premise, by default, all XRT native APIs execute from the host-thread, so if an API has synchronous behavior the host-thread blocks until that task is finished. For example

    buffer.sync(XCL_BO_SYNC_BO_TO_DEVICE);

As xrt::bo::sync is a synchronous API, the host thread blocks until its completion.

XRT defines a queue with xrt::queue with the following properties

Any task enqueued on the queue runs parallel to the original host-thread. So the host threads does not wait for its completion and can do other tasks in parallel.

The task enqueued on the queue must be synchronous in nature

All the tasks enqueued on the queue will be completed in the order it is enqueued (strict in-order execution).

The task enqueued on a queue can be any C++ Callable, which can conveniently be expressed by a C++ lambda

When a synchronous task is enqueued on a queue, an event (xrt::queue:event) is returned. The event can be used for multiple purposes, such as:

The host-thread can wait on that event to synchronize with the queue::enqueue(task) from where the event was generated.

The event can be enqueued to other queues to synchronize among the queues

Let’s understand the above properties by reviewing the examples below

Executing synchronous task asynchronously¶

    auto bo0 = xrt::bo(device, vector_size_bytes, krnl.group_id(0));
    auto bo0_map = bo0.map<dtype*>();
    .... // fill buffer content

    xrt::queue my_queue;
    auto sync_event = queue.enqueue([&bo0] {bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); });

    myCpuTask(b);  // here we can perform other host task that will run parallel to the above bo::sync task

    sync_event.wait();  // stall the host-thread till sync operation completes

The above code shows the synchronous API xrt::bo::sync is enqueued through an xrt::queue. The argument of xrt::queue is unnamed callable written using C++ lambda capturing buffer object. This technique is useful to execute any synchronous task asynchronously from the host-thread, and while this task is ongoing, the host-thread can do other operation in parallel (myCpuTask() in the above code). The return type of xrt::queue::enqueue() is type of xrt::queue::event which is later synchronized to the host-thread by xrt::queue::event::wait() blocking function.

Executing multiple tasks through queue¶

Every new xrt::queue can be thought of as a new thread running parallel to the host-thread executing a series of synchronous tasks following the order they were submitted (enqueued) on the queue. For example, let’s consider tasks A, B, C and D as below

Task A: Host to device data transfer (buffer bo0)

Task B: Execute the kernel and wait for the kernel to finish execution

Task C: Device to host data transfer (buffer bo_out)

Task D: Check return data bo_out

The above four tasks should be executed in-order for correct functionality. To execute them in parallel to the host-thread, these four tasks can be enqueued through a queue as below.

    xrt::queue queue;
    queue.enqueue([&bo0] {bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); });
    queue.enqueue([&run] {run.start(); run.wait(); });
    queue.enqueue([&bo_out] {bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE); });
    queue.enqueue([&bo_out_map]{my_function_to_check_data(bo_out_map)});

The user can create and use as many queues in the host code to overlap tasks in parallel. Next, we will see how it is possible to synchronize among the queues using the event.

Using events to synchronize among the queues¶

Let’s assume in the above example, it is required to do two host-to-device buffer transfers before the kernel execution. If using a single queue the code would appear as

    xrt::queue main_queue;
    main_queue.enqueue([&bo0] {bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); });
    main_queue.enqueue([&bo1] {bo1.sync(XCL_BO_SYNC_BO_TO_DEVICE); });
    main_queue.enqueue([&run] {run.start(); run.wait(); });
    main_queue.enqueue([&bo_out] {bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE); });

In the above code, as a single queue (main_queue) is used, the host-to-device data transfers for buffer bo0 and bo1 would happen sequentially. In order to do parallel data transfer for bo0 and bo1, a separate queue is needed for one of the buffers, and also it is required to ensure that the kernel executes only after both the buffer transfers are completed.

    xrt::queue main_queue;
    xrt::queue queue_bo1;
    main_queue.enqueue([&bo0] {bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); });
    auto bo1_event = queue_bo1.enqueue([&bo1] {bo1.sync(XCL_BO_SYNC_BO_TO_DEVICE); });
    main_queue.enqueue(bo1_event);
    main_queue.enqueue([&run] {run.start(); run.wait(); });
    main_queue.enqueue([&bo_out] {bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE); });

In line number 43 and 44 bo0 and bo1 host-to-device data transfers are enqueued through two separate queues to achieve parallel transfers. To synchronize between these two queues the returned event from the queue_bo1 is enqueued in the main_queue, similar to a task enqueue (line 45). As a result, any other task submitted after that event won’t execute until the event is finished. So in the above code example, subsequent task in the main_queue (such as kernel execution) would wait till the bo1_event is completed. By submitting an event returned from a queue::enqueue to another queue, we can synchronize among the queues.