XRT Native APIs¶

From 2020.2 release XRT provides a new XRT API set in C, C++, and Python flavor. This document introduces the usability of C and C++ APIs.

To use the native XRT APIs, the host application must link with the xrt_coreutil library.

Example g++ command

g++ -g -std=c++14 -I$XILINX_XRT/include -L$XILINX_XRT/lib -o host.exe host.cpp -lxrt_coreutil -pthread

The core data structures in C and C++ are as below

	C++ Class	C Type (Handle)
Device	xrt::device	xrtDeviceHandle
XCLBIN	xrt::xclbin	xrtXclbinHandle
Buffer	xrt::bo	xrtBufferHandle
Kernel	xrt::kernel	xrtKernelHandle
Run	xrt::run	xrtRunHandle
Graph	TBD	xrtGraphHandle

All the core data structures are defined inside in the header files at $XILINX_XRT/include/experimental/ directory. In the user host code, it is sufficient to include "experimental/xrt_kernel.h" and "experimental/xrt_aie.h" (when using Graph APIs) to access all the APIs related to these data structure.

5      #include "experimental/xrt_kernel.h"
6      #include "experimental/xrt_aie.h"

The common host code flow using the above data structures is as below

Open Xilinx Device and Load the XCLBIN

Set up the Buffers that are used to transfer the data between the host and the device

Use the Buffer APIs for the data transfer between host and device (before and after the kernel execution).

Use Kernel and Run handle/objects to offload and manage the compute-intensive tasks running on FPGA.

Below we will walk through the common API usage to accomplish the above tasks.

Device and XCLBIN¶

Device and XCLBIN class provide fundamental infrastructure-related interfaces. The primary objective of the device and XCLBIN related APIs are

Open a Device

Load compiled kernel binary (or XCLBIN) onto the device

Example C API based code

10      xrtDeviceHandle device = xrtDeviceOpen(0);
11 
12      xrtXclbinHandle xclbin = xrtXclbinAllocFilename("kernel.xclbin");
13 
14      xrtDeviceLoadXclbinHandle(device,xclbin);
15      ..............
16      ..............
17      xrtDeviceClose(device);

The above code block shows

Opening the device (enumerated as 0) and get device handle xrtDeviceHandle (line 10)
Device indices are enumerated as 0,1,2 and can be observed by xbutil scan
>>xbutil scan
INFO: Found total 2 card(s), 2 are usable
.............
[0] 0000:b3:00.1 xilinx_u250_gen3x16_base_1 user(inst=129)
[1] 0000:65:00.1 xilinx_u50_gen3x16_base_1 user(inst=128)
Opening the XCLBIN from the filename and get an XCLBIN handle xrtXclbinHandle (line 12)

Loading the XCLBIN onto the Device by using the XCLBIN handle by API xrtDeviceLoadXclbinHandle (line 14)

Closing the device handle at the end of the application (line 19)

C++: The equivalent C++ API based code

10      unsigned int dev_index = 0;
11      auto device = xrt::device(dev_index);
12      auto xclbin_uuid = device.load_xclbin("kernel.xclbin");

The above code block shows

The xrt::device class’s constructor is used to open the device

The member function xrt::device::load_xclbin is used to load the XCLBIN from the filename.

The member function xrt::device::load_xclbin returns the XCLBIN UUID, which is required to open the kernel (refer the Kernel Section).

Buffers¶

Buffers are primarily used to transfer the data between the host and the device. The Buffer related APIs are discussed in the following three subsections

Buffer allocation and deallocation

Data transfer using Buffers

Miscellaneous other Buffer APIs

1. Buffer allocation and deallocation¶

XRT APIs provides API for

xrtBOAlloc: Allocates a buffer object 4K aligned, the API must be called with appropriate flags.

xrtBOAllocUserPtr: Allocates a buffer object using pointer provided by the user. The user pointer must be aligned to 4K boundary.

xrtBOFree: Deallocates the allocated buffer.

15      xrtMemoryGroup bank_grp_idx_0 = xrtKernelArgGroupId(kernel, 0);
16      xrtMemoryGroup bank_grp_idx_1 = xrtKernelArgGroupId(kernel, 1);
17 
18      xrtBufferHandle input_buffer = xrtBOAlloc(device, buffer_size_in_bytes, XRT_BO_FLAGS_NONE, bank_grp_idx_0);
19      xrtBufferHandle output_buffer = xrtBOAlloc(device, buffer_size_in_bytes, XRT_BO_FLAGS_NONE, bank_grp_idx_1);
20 
21      ....
22      ....
23      xrtBOFree(input_buffer);
24      xrtBOFree(output_buffer);

The above code block shows

Buffer allocation API xrtBOAlloc at lines 15,16

Buffer deallocation API xrtBOFree at lines 23,24

The various arguments of the API xrtBOAlloc are

Argument 1: The device on which the buffer should be allocated

Argument 2: The size (in bytes) of the buffer

Argument 3: xrtBufferFlags: Used to specify the buffer type, most commonly used types are

XRT_BO_FLAGS_NONE: Regular Buffer

XRT_BO_FLAGS_DEV_ONLY: Device only Buffer (meant to be used only by the kernel).

XRT_BO_FLAGS_HOST_ONLY: Host Only Buffer (buffers reside in the host memory directly transferred to/from the kernel)

XRT_BO_FLAGS_P2P: P2P Buffer, buffer for NVMe transfer

XRT_BO_FLAGS_CACHEABLE: Cacheable buffer can be used when host CPU frequently accessing the buffer (applicable for embedded platform).

Argument 4: xrtMemoryGroup: Enumerated Memory Bank to specify the location on the device where the buffer should be allocated. The xrtMemoryGroup is obtained by the API xrtKernelArgGroupId as shown in line 15 (for more details of this API refer to the Kernel section).

C++: The equivalent C++ API based code

15      auto bank_grp_idx_0 = kernel.group_id(0);
16      auto bank_grp_idx_1 = kernel.group_id(1);
17 
18      auto input_buffer = xrt::bo(device, buffer_size_in_bytes,bank_grp_idx_0);
19      auto output_buffer = xrt::bo(device, buffer_size_in_bytes, bank_grp_idx_1);

In the above code xrt::bo buffer objects are created using the class’s constructor. Note the buffer flag is not used as constructor by default created regular buffer. Nonetheless, the available buffer flags for xrt::bo are described using enum class argument with the following enumerator values

xrt::bo::flags::normal: Default, Regular Buffer

xrt::bo::flags::device_only: Device only Buffer (meant to be used only by the kernel).

xrt::bo::flags::host_only: Host Only Buffer (buffer resides in the host memory directly transferred to/from the kernel)

xrt::bo::flags::p2p: P2P Buffer, buffer for NVMe transfer

xrt::bo::flags::cacheable: Cacheable buffer can be used when host CPU frequently accessing the buffer (applicable for embedded platform).

2. Data transfer using Buffers¶

XRT Buffer API library provides a rich set of APIs helping the data transfers between the host and the device, between the buffers, etc. We will discuss the following data transfer style

Data transfer between host and device by Buffer read/write API

Data transfer between host and device by Buffer map API

Data transfer between buffers by copy API

I. Data transfer between host and device by Buffer read/write API¶

To transfer the data from the host to the device, the user first needs to update the host-side buffer backing pointer followed by a DMA transfer to the device.

The following C APIs are used for the above tasks

xrtBOWrite

xrtBOSync with flag XCL_BO_SYNC_BO_TO_DEVICE

In C++, xrt::bo class has following member functions for the same functionality

xrt::bo::write

xrt::bo::sync with flag XCL_BO_SYNC_BO_TO_DEVICE

To transfer the data from the device to the host, the steps are reverse, the user first needs to do a DMA transfer from the device followed by the reading data from the host-side buffer backing pointer.

The following C APIs are used for the above tasks

xrtBOSync with flag XCL_BO_SYNC_BO_FROM_DEVICE

xrtBORead

In C++ the corresponding xrt::bo class’s member functions are

xrt::bo::sync with flag XCL_BO_SYNC_BO_FROM_DEVICE

xrt::bo::read

Code example of transferring data from the host to the device

20      xrtBufferHandle input_buffer = xrtBOAlloc(device, buffer_size_in_bytes, XRT_BO_FLAGS_NONE, bank_grp_idx_0);
21 
22      // Prepare the input data
23      int buff_data[data_size];
24      for (int i=0; i<data_size; ++i) {
25          buff_data[i] = i;
26      }
27 
28      xrtBOWrite(input_buffer,buff_data,data_size*sizeof(int),0);
29      xrtSyncBO(input_buffer,XCL_BO_SYNC_BO_TO_DEVICE, data_size*sizeof(int),0);

C++: The equivalent C++ API based code

20      auto input_buffer = xrt::bo(device, buffer_size_in_bytes, bank_grp_idx_0);
21      // Prepare the input data
22      int buff_data[data_size];
23      for (auto i=0; i<data_size; ++i) {
24          buff_data[i] = i;
25      }
26 
27      input_buffer.write(buff_data);
28      input_buffer.sync(XCL_BO_SYNC_BO_TO_DEVICE);

Note the C++ xrt::bo::sync, xrt::bo::write, xrt::bo::read etc has overloaded version that can be used for paritial buffer sync/read/write by specifying the size and the offset. For the above code example, the full buffer size and 0 offset are used as default arguments.

II. Data transfer between host and device by Buffer map API¶

The API xrtBOMap (C++: xrt::bo::map) allows mapping the host-side buffer backing pointer to a user pointer. The host code can subsequently exercise the user pointer for the data reads and writes. However, after writing to the mapped pointer (or before reading from the mapped pointer) the API xrtBOSync (C++: xrt::bo::sync) should be used with direction flag for the DMA operation.

Code example of transferring data from the host to the device by this approach

20      xrtBufferHandle input_buffer = xrtBOAlloc(device, buffer_size_in_bytes, XRT_BO_FLAGS_NONE, bank_grp_idx_0);
21      int* input_buffer_mapped = (int*)xrtBOMap(input_buffer);
22 
23      for (int i=0;i<data_size;++i) {
24          input_buffer_mappped[i] = i;
25      }
26 
27      xrtBOSync(input_buffer, XCL_BO_SYNC_BO_TO_DEVICE, buffer_size_in_bytes, 0);

C++: The equivalent C++ API based code

20      auto input_buffer = xrt::bo(device, buffer_size_in_bytes, bank_grp_idx_0);
21      auto input_buffer_mapped = input_buffer.map<int*>();
22 
23      for (auto i=0;i<data_size;++i) {
24          input_buffer_mapped[i] = i;
25      }
26 
27      input_buffer.sync(XCL_BO_SYNC_BO_TO_DEVICE);

III. Data transfer between the buffers by copy API¶

XRT provides xrtBOCopy (C++: xrt::bo::copy) API for deep copy between the two buffer objects if the platform supports a deep-copy (for detail refer M2M feature described in Memory-to-Memory (M2M)). If deep copy is not supported by the platform the data transfer happens by shallow copy (the data transfer happens via host).

API Example in C, all arguments are self-explanatory

25      size_t dst_buffer_offset = 0;
26      size_t src_buffer_offset = 0;
27      xrtBOCopy(dst_buffer, src_buffer, size_of_copy, dst_buffer_offset, src_buffer_offset);

C++: The equivalent C++ API based code

25      dst_buffer.copy(src_buffer, copy_size_in_bytes);

The API xrt::bo::copy also has overloaded version to provide a different offset than 0 for both the source and the destination buffer.

3. Miscellaneous other Buffer APIs¶

This section describes a few other specific use-cases using buffers.

DMA-BUF API¶

XRT provides Buffer export and import APIs primarily used for sharing buffer across devices (P2P application) and processes.

xrtBOExport (C++: xrt::bo::export_buffer): Export the buffer to an exported buffer handle

xrtBOImport (C++: xrt::bo constructor) : Allocate a BO imported from exported buffer handle

Consider the situation of exporting buffer from device 1 to device 2.

18      xclBufferExportHandle buffer_exported = xrtBOExport(buffer_device_1);
19      xrtBufferHandle buffer_device_2 = xrtBOImport(device_2, buffer_exported);

In the above example

The buffer buffer_device_1 is a buffer allocated on device 1

buffer_device_1 is exported to an xclBufferExportHandle by API xrtBOExport

The exported buffer of type xclBufferExportHandle is imported to device 2 by API xrtBOImport

C++: The equivalent C++ API based code

18      auto buffer_exported = buffer_device_1.export_buffer();
19      auto buffer_device_2 = xrt::bo(device_2, buffer_exported);

In the above example

The buffer buffer_device_1 is a buffer allocated on device 1

buffer_device_1 is exported by the member function xrt::bo::export_buffer

The new buffer buffer_device_2 is imported for device_2 by the constructor xrt::bo

Sub-buffer support¶

The API xrtBOSubAlloc (C++: supported by an xrt::bo class constructor) allocates a sub-buffer from a parent buffer by specifying a start offset and the size.

In the example below a sub-buffer is created from a parent buffer of size 4 bytes staring from its offset 0

18      xrtBufferHandle parent_buffer;
19      xrtBufferHandle sub_buffer;
20 
21      size_t sub_buffer_size = 4;
22      size_t sub_buffer_offset = 0;
23 
24      sub_buffer = xrtBOSubAlloc(parent_buffer, sub_buffer_size, sub_buffer_offset);

C++: The equivalent C++ API based code

In C++ a sub-buffer is created by using the xrt::bo class’s constructor using the parent buffer, size, and offset as parameters.

18      size_t sub_buffer_size = 4;
19      size_t sub_buffer_offset = 0;
20 
21      auto sub_buffer = xrt::bo(parent_buffer, sub_buffer_size, sub_buffer_offset);

Buffer information¶

XRT provides few other APIs to obtain information related to the buffer.

xrtBOSize (C++: member function xrt::bo::size): Size of the buffer

xrtBOAddr (C++: member function xrt::bo::address) : Physical address of the buffer

Kernel and Run¶

The XRT kernel APIs support creating of kernel handle (or object in C++) from currently loaded xclbin. The kernel handle is used to execute the kernel function on the hardware instance (Compute Unit or CU) of the kernel.

A Run handle/object represents an execution of the kernel. Upon finishing the kernel execution, the Run handle/object can be reused to invoke the same kernel function if desired.

The following topics are discussed below

Obtaining kernel handle/object from XCLBIN

Getting the bank group index of a kernel argument

Reading and write CU mapped registers

Execution of kernel and dealing with the associated run

Other kernel execution related API

Obtaining kernel handle/object from XCLBIN¶

The kernel handle (or object) is created from the device, XCLBIN UUID and the kernel name.

35      xuid_t xclbin_uuid;
36      xrtXclbinGetUUID(xclbin,xclbin_uuid);
37 
38      xrtKernelHandle kernel = xrtPLKernelOpen(device, xclbin_uuid, "kernel_name");
39      ....
40      ....
41      xrtKernelClose(kernel);

In the above code example

The UUID of the XCLBIN is retrieved by the API xrtXclbinGetUUID

The kernel is created by the API xrtPLKernelOpen

The kernel is closed by the API xrtKernelClose

Note: For the kernel with more than 1 CU, a kernel handle (or object) should represent all the CUs having identical interface connectivity. If all the CUs of the kernel are not having identical connectivity, the specific CU name(s) should be used to obtain a kernel handle (or object) to represent the subset of CUs with identical connectivity. Otherwise XRT will do this selection internally to select a group of CUs and discard the rest of the CUs (discarded CUs are not used during the execution of a kernel).

As an example, assume a kernel name is foo having 3 CUs foo_1, foo_2, foo_3. The CUs foo_1 and foo_2 are connected to DDR bank 0, but the CU foo_3 is connected to DDR bank 1.

Opening kernel handle for foo_1 and foo_2 (as they have identical interface connection)
35      cu_group_1 = xrtPLKernelOpen(device, xclbin_uuid, "foo:{foo_1,foo_2}");
Opening kernel handle for foo_3
35      cu_group_2 = xrtPLKernelOpen(device, xclbin_uuid, "foo:{foo_3}");

C++: In C++, xrt::kernel object can be created from the constructor of xrt::kernel class.

35      auto xclbin_uuid = device.load_xclbin("kernel.xclbin");
36      auto krnl = xrt::kernel(device, xclbin_uuid, name);

Exclusive access of the kernel’s CU¶

The API xrtPLKernelOpen opens a kernel’s CU in a shared mode so that the CU can be shared with the other processes. In some cases, it is required to open the CU in exclusive mode (for example, when it is required to read/write CU mapped register). Exclusive CU opening fails if the CU is already opened in either shared or exclusive access.

39      xrtKernelHandle kernel = xrtPLKernelOpenExclusive(device, xclbin_uuid, "name");

C++: In C++, xrt::kernel constructor can be called with an additional enum class argument to access the kernel in exclusive mode. The enumerator values are:

xrt::kernel::cu_access_mode::shared (default xrt::kernel constructor argument)

xrt::kernel::cu_access_mode::exclusive

39      auto krnl = xrt::kernel(device, xclbin_uuid, name, xrt::kernel::cu_access_mode::exclusive);

Getting bank group index of the kernel argument¶

We have seen in the Buffer creation section that it is required to provide the buffer location during the buffer creation. XRT provides an API xrtKernelArgGroupId (C++: xrt::kernel::group_id) that returns the bank index (ID) of a specific argument of the kernel. This ID is used as the last argument of xclAllocBO (in C++ with xrt::bo constructor) API to create the buffer on the same memory bank.

Let us review the example below where the buffer is allocated for the kernel’s first (argument index 0) by using this API

39      xrtMemoryGroup idx_0 = xrtKernelArgGroupId(kernel, 0); // bank index of 0th argument
40      xrtBufferHandle a = xrtBOAlloc(device, data_size*sizeof(int), XRT_BO_FLAGS_NONE, idx_0);

15      auto input_buffer = xrt::bo(device, buffer_size_in_bytes, kernel.group_id(0));

The API fails if the kernel bank index is ambiguous. For example, the kernel has multiple CU with different connectivity for that argument. In those cases, it is required to create a kernel object/handle with specific a CU (or group of CUs with identical connectivity).

Reading and write CU mapped registers¶

To read and write from the AXI-Lite register space corresponding to a CU, the CU must be opened in exclusive mode (in shared mode, multiple processes can access the CU’s address space, hence it is unsafe if they are trying to access/change registers at the same time leading to a potential race behavior). The required APIs for kernel register read and write are

xrtKernelReadRegister (C++: member function xrt::kernel::read_register)

xrtKernelWriteRegiste (C++: member function xrt::kernel::write_register)

35      int read_data;
36      int write_data = 7;
37 
38      xrtKernelHandle kernel = xrtPLKernelOpenExclusive(device, xclbin_uuid, "foo:{foo_1}");
39 
40      xrtKernelReadRegister(kernel,READ_OFFSET,&read_data);
41      xrtKernelWriteRegister(kernel,WRITE_OFFSET,write_data);
42 
43      xrtKernelClose(kernel);

In the above code block

The CU named “foo_1” (name syntax: “kernel_name:{cu_name}”) is opened exclusively.

The Register Read/Write operation is performed.

Closed the kernel

C++: The equivalent C++ API example

35      int read_data;
36      int write_data = 7;
37 
38      auto krnl = xrt::kernel(device, xclbin_uuid, "foo:{foo_1}", true);
39 
40      read_data = kernel.read_register(READ_OFFSET);
41      kernel.write_register(WRITE_OFFSET,write_data);

Obtaining the argument offset¶

The register read/write access APIs use the register offset as shown in the above examples. The user can get the register offset of a corresponding kernel argument from the v++ generated .xclbin.info file and use with the register read/write APIs.

--------------------------
Instance:        foo_1
Base Address: 0x1800000

Argument:          a
Register Offset:   0x10

However, XRT also provides APIs to obtain the register offset for CU arguments. In the below example C API xrtKernelArgOffset is used to obtain offset of third argument of the CU foo:foo_1.

38      // Assume foo has 3 arguments, a,b,c (arg 0, arg 1 and arg 2 respectively)
39 
40      xrtKernelHandle kernel = xrtPLKernelOpenExclusive(device, xclbin_uuid, "foo:{foo_1}");
41      uint32_t arg_c_offset = xrtKernelArgOffset(kernel, 2);

C++: The equivalent C++ API example

38      // Assume foo has 3 arguments, a,b,c (arg 0, arg 1 and arg 2 respectively)
39 
40      auto krnl = xrt::kernel(device, xclbin_uuid, "foo:{foo_1}", true);
41      auto offset = krnl.offset(2);

Executing the kernel¶

Execution of the kernel is associated with a Run handle (or object). The kernel can be executed by the API xrtKernelRun (in C++ overloaded operator xrt::kernel::operator()) that takes all the kernel arguments in order. The kernel execution API returns a run handle (or object) corresponding to the execution.

50      // 1st kernel execution
51      xrtRunHandle run = xrtKernelRun(kernel, buf_a, buf_b,  scalar_1);
52      xrtRunWait(run);
53 
54      // 2nd kernel execution with just changing 3rd argument
55      xrtRunSetArg(run,2,scalar_2); // Arguments are specified starting from 0
56      xrtRunStart(run);
57      xrtRunWait(run);
58 
59      // Close the run handle
60      xrtRunClose(run);

Note the following APIs regarding the above example

The kernel is executed by xrtKernelRun API by specifying all its arguments to obtain a Run handle

The API xrtKernelRun is non-blocking. It returns as soon as it submits the job without waiting for the kernel’s actual execution start.

The host code uses xrtRunWait API to block the current thread and wait till the kernel execution is finished.

After a run is finished, the same run handle can be reused to execute the kernel multiple times if desired.

API xrtRunSetArg is used to set one or more arguments, in the example above only the last (3rd) argument is changed before the second execution

API xrtRunStart is used to execute the kernel using the run handle.

API xrtRunClose is used to close the Run handle.

C++: The equivalent C++ code

In C++ the xrt::kernel class provides overloaded operator () to execute the kernel with a comma-separated list of arguments.

50      // 1st kernel execution
51      auto run = kernel(buf_a, buf_b, scalar_1);
52      run.wait();
53 
54      // 2nd kernel execution with just changing 3rd argument
55      run.set_arg(2,scalar_2); // Arguments are specified starting from 0
56      run.start();
57      run.wait();

The above c++ code block is demonstrating

The kernel execution using the xrt::kernel() operator with the list of arguments that returns a xrt::run object. This is an asynchronous API and returns after submitting the task.

The member function xrt::run::wait is used to block the current thread until the current execution is finished.

The member function xrt::run::set_arg is used to set one or more kernel argument(s) before the next execution. In the example above, only the last (3rd) argument is changed.

The member function xrt::run::start is used to start the next kernel execution with new argument(s).

Graph¶

In Versal ACAPs with AI Engines, the XRT Graph APIs can be used to dynamically load, monitor, and control the graphs executing on the AI Engine array. As of the 2020.2 release, XRT provides a set of C APIs for graph control. The C++ APIs are planned for a future release. Also, as of the 2020.2 release Graph APIs are only supported on the Edge platform.

A graph handle is of type xrtGraphHandle.

Graph Opening and Closing¶

The XRT graph APIs support the obtaining of graph handle from currently loaded xclbin. The required APIs for graph open and close are

xrtGraphOpen: API provides the handle of the graph from the device, XCLBIN UUID, and the graph name.

xrtGraphClose: API to close the graph handle.

35      xuid_t xclbin_uuid;
36      xrtXclbinGetUUID(xclbin,xclbin_uuid);
37 
38      xrtGraphHandle graph = xrtGraphOpen(device, xclbin_uuid, "graph_name");
39      ....
40      ....
41      xrtGraphClose(graph);

The graph handle obtained from xrtGraphOpen is used to execute the graph function on the AIE tiles.

Reset Functions¶

There are two reset functions are used:

API xrtAIEResetArray is used to reset the whole AIE array.

API xrtGraphReset is used to reset a specified graph by disabling tiles and enabling tile reset.

45      xrtDeviceHandle device_handle = xrtDeviceOpen(0);
46      ...
47      // AIE Array Reset
48      xrtAIEResetArray(device_handle)
49 
50      xrtGraphHandle graph = xrtGraphOpen(device, xclbin_uuid, "graph_name");
51      // Graph Reset
52      xrtGraphReset(graphHandle);

Graph execution¶

XRT provides basic graph execution control APIs to initialize, run, wait, and terminate graphs for a specific number of iterations. Below we will review some of the common graph execution styles.

Graph execution for a fixed number of iterations¶

A graph can be executed for a fixed number of iterations followed by a “busy-wait” or a “time-out wait”.

Busy Wait scheme

The graph can be executed for a fixed number of iteration by xrtGraphRun API using an iteration argument. Subsequently, xrtGraphWait or xrtGraphEnd API should be used (with argument 0) to wait until graph execution is completed.

Let’s review the below example

The graph is executed for 3 iterations by API xrtGraphRun with the number of iterations as an argument.
The API xrtGraphWait(graphHandle,0) is used to wait till the iteration is done.
- The API xrtGraphWait is used because the host code needs to execute the graph again.
The Graph is executed again for 5 iteration
The API xrtGraphEnd(graphHandle,0) is used to wait till the iteration is done.
- After xrtGraphEnd the same graph should not be executed.

35      // start from reset state
36      xrtGraphReset(graphHandle);
37 
38      // run the graph for 3 iteration
39      xrtGraphRun(graphHandle, 3);
40 
41      // Wait till the graph is done
42      xrtGraphWait(graphHandle,0);  // Use xrtGraphWait if you want to execute the graph again
43 
44 
45      xrtGraphRun(graphHandle,5);
46      xrtGraphEnd(graphHandle,0);  // Use xrtGraphEnd if you are done with the graph execution

Timeout wait scheme

As shown in the above example xrtGraphWait(graphHandle,0) performs a busy-wait and suspend the execution till the graph is not done. If desired a timeout version of the wait can be achieved by xrtGraphWaitDone which can be used to wait for some specified number of milliseconds, and if the graph is not done do something else in the meantime. An example is shown below

35      // start from reset state
36      xrtGraphReset(graphHandle);
37 
38      // run the graph for 100 iteration
39      xrtGraphRun(graphHandle, 100);
40 
41       while (1) {
42        auto rval  = xrtGraphWaitDone(graphHandle, 5);
43         std::cout << "Wait for graph done returns: " << rval << std::endl;
44         if (rval == -ETIME)  {
45              std::cout << "Timeout, reenter......" << std::endl;
46              // Do something
47         }
48         else  // Graph is done, quit the loop
49             break;
50        }

Infinite Graph Execution¶

The graph runs infinitely if xrtGraphRun is called with iteration argument -1. While a graph running infinitely the APIs xrtGraphWait, xrtGraphSuspend and xrtGraphEnd can be used to suspend/end the graph operation after some number of AIE cycles. The API xrtGraphResume is used to execute the infinitely running graph again.

39      // start from reset state
40      xrtGraphReset(graphHandle);
41 
42      // run the graph infinitely
43      xrtGraphRun(graphHandle, -1);
44 
45      xrtGraphWait(graphHandle,3000);  // Suspends the graph after 3000 AIE cycles from the previous start
46 
47 
48      xrtGraphResume(graphHandle); // Restart the suspended graph again to run forever
49 
50      xrtGraphSuspend(graphHandle); // Suspend the graph immediately
51 
52      xrtGraphResume(graphHandle); // Restart the suspended graph again to run forever
53 
54      xrtGraphEnd(graphHandle,5000);  // End the graph operation after 5000 AIE cycles from the previous start

In the example above

The API xrtGraphRun(graphHandle, -1) is used to execute the graph infinitely
The API xrtGraphWait(graphHandle,3000) suspends the graph after 3000 AIE cycles from the graph starts.
- If the graph was already run more than 3000 AIE cycles the graph is suspended immediately.
The API xrtGraphResume is used to restart the suspended graph
The API xrtGraphSuspend is used to suspend the graph immediately
The API xrtGraphEnd(graphHandle,5000) is ending the graph after 5000 AIE cycles from the previous graph start.
- If the graph was already run more than 5000 AIE cycles the graph ends immediately.
- Using xrtGraphEnd eliminates the capability of rerunning the Graph (without loading PDI and a graph reset again).

Measuring AIE cycle consumed by the Graph¶

The API xrtGraphTimeStamp can be used to determine AIE cycle consumed between a graph start and stop.

Here in this example, the AIE cycle consumed by 3 iteration is calculated

35      // start from reset state
36      xrtGraphReset(graphHandle);
37 
38      uint64_t begin_t = xrtGraphTimeStamp(graphHandle);
39 
40      // run the graph for 3 iteration
41      xrtGraphRun(graphHandle, 3);
42 
43      xrtGraphWait(graphHandle, 0);
44 
45      uint64_t end_t = xrtGraphTimeStamp(graphHandle);
46 
47      std::cout<<"Number of AIE cycles consumed in the 3 iteration is: "<< end_t-begin_t;

RTP (Runtime Parameter) control¶

XRT provides the API to update and read the runtime parameters of the graph.

The API xrtGraphUpdateRTP to update the RTP
The API xrtGraphReadRTP to read the RTP.

35      ret = xrtGraphReset(graphHandle);
36      if (ret) throw std::runtime_error("Unable to reset graph");
37 
38      ret = xrtGraphRun(graphHandle, 2);
39      if (ret) throw std::runtime_error("Unable to run graph");
40 
41      float increment[1] = {1};
42      const char *inVect = reinterpret_cast<const char *>(increment);
43      xrtGraphUpdateRTP(graphHandle, "mm.mm0.in[2]", inVect, sizeof (float));
44 
45      // Do more things
46      xrtGraphRun(graphHandle,16);
47      xrtGraphWait(graphHandle,0);
48 
49      // Read RTP
50      float increment_out[1] = {1};
51      char *outVect = reinterpret_cast<char *>(increment_out);
52      xrtGraphReadRTP(graphHandle, "mm.mm0.inout[0]", outVect, sizeof(float));
53      std::cout<<"\n RTP value read<<increment_out[0];

In the above example, the API xrtGraphUpdateRTP and xrtGraphReadRTP are used to update and read the RTP values respectively. Note the API arguments

The hierarchical name of the RTP port

Pointer to write or read the RTP variable

The size of the RTP value.

DMA operation to and from Global Memory IO¶

XRT provides API xrtAIESyncBO to synchronize the buffer contents between GMIO and AIE. The following code shows a sample example

35      xrtDeviceHandle device_handle = xrtDeviceOpen(0);
36 
37      // Buffer from GM to AIE
38      xrtBufferHandle in_bo_handle  = xrtBOAlloc(device_handle, SIZE * sizeof (float), 0, 0);
39 
40      // Buffer from AIE to GM
41      xrtBufferHandle out_bo_handle  = xrtBOAlloc(device_handle, SIZE * sizeof (float), 0, 0);
42 
43      inp_bo_map = (float *)xrtBOMap(in_bo_handle);
44      out_bo_map = (float *)xrtBOMap(out_bo_handle);
45 
46      // Prepare input data
47      std::copy(my_float_array,my_float_array+SIZE,inp_bo_map);
48 
49 
50      xrtAIESyncBO(device_handle, in_bo_handle, "in_sink", XCL_BO_SYNC_BO_GMIO_TO_AIE, SIZE * sizeof(float),0);
51 
52      xrtAIESyncBO(device_handle, out_bo_handle, "out_sink", XCL_BO_SYNC_BO_AIE_TO_GMIO, SIZE * sizeof(float), 0);

The above code shows

Input and output buffer (in_bo_handle and out_bo_handle) to the graph are created and mapped to the user space

The API xrtAIESyncBO is used for data transfer using the following arguments

Device and Buffer Handle

The name of the GMIO ports associated with the DMA transfer

The direction of the buffer transfer

GMIO to Graph: XCL_BO_SYNC_BO_GMIO_TO_AIE

Graph to GMIO: XCL_BO_SYNC_BO_AIE_TO_GMIO

The size and the offset of the buffer

XRT Error API¶

In general, XRT APIs can encounter two types of errors:

Synchronous error: Error can be thrown by the API itself. These types of errors should be checked against all APIs (strongly recommended).

Asynchronous error: Errors from the underneath driver, system, hardware, etc.

XRT provides a couple of APIs to retrieve the asynchronous errors into the userspace host code. This helps to debug when something goes wrong.

xrtErrorGetLast - Gets the last error code and its timestamp of a given error class

xrtErrorGetString - Gets the description string of a given error code.

NOTE: The asynchronous error retrieving APIs are at an early stage of development and only supports AIE related asynchronous errors. Full support for all other asynchronous errors is planned in a future release.

Example code

41      rval = xrtGraphRun(graphHandle, runInteration);
42      if (rval != 0) {
43          /* code to handle synchronous xrtGraphRun error */
44          goto fail;
45      }
46 
47      rval = xrtGraphWaitDone(graphHandle, timeout);
48      if (rval == -ETIME) {
49          /* wait Graph done timeout without further information */
50          xrtErrorCode errCode;
51          uint64_t timestamp;
52 
53          rval = xrtErrorGetLast(devHandle, XRT_ERROR_CLASS_AIE, &errCode, &timestamp);
54          if (rval == 0) {
55              size_t len = 0;
56              if (xrtErrorGetString(devHandle, errCode, nullptr, 0, &len))
57                  goto fail;
58              std::vector<char> buf(len);  // or C equivalent
59              if (xrtErrorGetString(devHandle, errCode, buf.data(), buf.size()))
60                  goto fail;
61              /* code to deal with this specific error */
62              std::cout << buf.data() << std::endl;
63          }
64     }
65     /* more code can be added here to check other error class */

The above code shows

As good practice synchronous error checking is done directly against all APIs (line 41,47,53,56,59)

After timeout occurs from xrtGraphWaitDone the API xrtErrorGetLast is called to retrieve asynchronous error code (line 53)

Using the error code API xrtErrorGetString is called to get the length of the error string (line 56)

The API xrtErrorGetString called again for the second time to get the full error string (line 59)

XRT Native APIs¶

Device and XCLBIN¶

Buffers¶

1. Buffer allocation and deallocation¶

2. Data transfer using Buffers¶

I. Data transfer between host and device by Buffer read/write API¶

II. Data transfer between host and device by Buffer map API¶

III. Data transfer between the buffers by copy API¶

3. Miscellaneous other Buffer APIs¶

DMA-BUF API¶

Sub-buffer support¶

Buffer information¶

Kernel and Run¶

Obtaining kernel handle/object from XCLBIN¶

Exclusive access of the kernel’s CU¶

Getting bank group index of the kernel argument¶

Reading and write CU mapped registers¶

Obtaining the argument offset¶

Executing the kernel¶

Other kernel execution related APIs¶

Graph¶

Graph Opening and Closing¶

Reset Functions¶

Graph execution¶

Graph execution for a fixed number of iterations¶

Infinite Graph Execution¶

Measuring AIE cycle consumed by the Graph¶

RTP (Runtime Parameter) control¶

DMA operation to and from Global Memory IO¶

XRT Error API¶