XRT Native APIs¶
From 2020.2 release XRT provides a new XRT API set in C, C++, and Python flavor.
To use the native XRT APIs, the host application must link with the xrt_coreutil library. Compiling host code with XRT native C++ API requires C++ standard with -std=c++17 (or newer).
Example g++ command
g++ -g -std=c++17 -I$XILINX_XRT/include -L$XILINX_XRT/lib -o host.exe host.cpp -lxrt_coreutil -pthread
The XRT native API supports both the C and C++ flavor of APIs. For general host code development, C++-based APIs are recommended, hence this document only describes the C++-based API interfaces. The full Doxygen generated C and C++ API documentation can be found in XRT Native Library C++ API.
The C++ Class objects used for the APIs are
C++ Class |
Header files |
|
---|---|---|
Device |
|
|
XCLBIN |
|
|
Buffer |
|
|
Kernel |
|
|
Run |
|
|
User-managed Kernel |
|
|
Graph |
|
|
Majority of the core data structures are defined inside in the header files at $XILINX_XRT/include/xrt/
directory. Few newer features such as xrt::ip
, xrt::aie
related header files are inside $XILINX_XRT/include/experimental
directory. The API interfaces that are in the experimental folder are subject to breaking changes.
The common host code flow using the above data structures is as below
Open Xilinx Device and Load the XCLBIN
Create Buffer objects to transfer data to kernel inputs and outputs
Use the Buffer class member functions for the data transfer between host and device (before and after the kernel execution).
Use Kernel and Run objects to offload and manage the compute-intensive tasks running on FPGA.
Below we will walk through the common API usage to accomplish the above tasks.
Device and XCLBIN¶
Device and XCLBIN class provide fundamental infrastructure-related interfaces. The primary objective of the device and XCLBIN related APIs are
Open a Device
Load compiled kernel binary (or XCLBIN) onto the device
The simplest code to load an XCLBIN as below
10 unsigned int dev_index = 0;
11 auto device = xrt::device(dev_index);
12 auto xclbin_uuid = device.load_xclbin("kernel.xclbin");
The above code block shows
The
xrt::device
class’s constructor is used to open the device (enumerated as 0)The member function
xrt::device::load_xclbin
is used to load the XCLBIN from the filename.The member function
xrt::device::load_xclbin
returns the XCLBIN UUID, which is required to open the kernel (refer the Kernel Section).
The class constructor xrt::device::device(const std::string& bdf)
also supports opening a device object from a Pcie BDF passed as a string.
10 auto device = xrt::device("0000:03:00.1");
The xrt::device::get_info()
is a useful member function to obtain necessary information about a device. Some of the information such as Name, BDF can be used to select a specific device to load an XCLBIN
10 std::cout << "device name: " << device.get_info<xrt::info::device::name>() << "\n";
11 std::cout << "device bdf: " << device.get_info<xrt::info::device::bdf>() << "\n";
Buffers¶
Buffers are primarily used to transfer the data between the host and the device. The Buffer related APIs are discussed in the following three subsections
Buffer allocation and deallocation
Data transfer using Buffers
Miscellaneous other Buffer APIs
1. Buffer allocation and deallocation¶
The C++ interface for buffers as below
The class constructor xrt::bo
is mainly used to allocates a buffer object 4K aligned. By default, a regular buffer is created (optionally the user can create other types of buffers by providing a flag).
15 auto bank_grp_arg0 = kernel.group_id(0); // Memory bank index for kernel argument 0
16 auto bank_grp_arg1 = kernel.group_id(1); // Memory bank index for kernel argument 1
17
18 auto input_buffer = xrt::bo(device, buffer_size_in_bytes,bank_grp_arg0);
19 auto output_buffer = xrt::bo(device, buffer_size_in_bytes, bank_grp_arg1);
In the above code xrt::bo
buffer objects are created using the class constructor. Please note the following
As no special flags are used a regular buffer will be created. Regular buffer is most common type of buffer that has a host backing pointer allocated by user space in heap memory and a device buffer allocated in the specified memory bank.
The second argument specifies the buffer size.
The third argument is used to specify the enumerated memory bank index (to specify the buffer location) where the buffer should be allocated. There are two ways to specify the memory bank index
Through kernel arguments: In the above example, the
xrt::kernel::group_id()
member function is used to pass the memory bank index. This member function accept kernel argument-index and automatically detect corresponding memory bank index by inspecting XCLBIN.Passing Memory bank index: The
xrt::kernel::group_id()
also accepts the direct memory bank index (as observed fromxbutil examine --report memory
output).
Creating special Buffers¶
The xrt::bo()
constructors accept multiple other buffer flags those are described using enum class
argument with the following enumerator values
xrt::bo::flags::normal
: Regular buffer (default)xrt::bo::flags::device_only
: Device only buffer (meant to be used only by the kernel, there is no host backing pointer).xrt::bo::flags::host_only
: Host only buffer (buffer resides in the host memory directly transferred to/from the kernel)xrt::bo::flags::p2p
: P2P buffer, A special type of device-only buffer capable of peer-to-peer transferxrt::bo::flags::cacheable
: Cacheable buffer can be used when the host CPU frequently accessing the buffer (applicable for edge platform).
The below example shows creating a P2P buffer on a device memory bank connected to argument 3 of the kernel.
15 auto p2p_buffer = xrt::bo(device, buffer_size_in_byte,xrt::bo::flags::p2p, kernel.group_id(3));
Creating Buffers from the user pointer¶
The xrt::bo()
constructor can also be called using a pointer provided by the user. The user pointer must be aligned to 4K boundary.
15 // Host Memory pointer aligned to 4K boundary
16 int *host_ptr;
17 posix_memalign(&host_ptr,4096,MAX_LENGTH*sizeof(int));
18
19 // Sample example filling the allocated host memory
20 for(int i=0; i<MAX_LENGTH; i++) {
21 host_ptr[i] = i; // whatever
22 }
23
24 auto mybuf = xrt::bo (device, host_ptr, MAX_LENGTH*sizeof(int), kernel.group_id(3));
2. Data transfer using Buffers¶
XRT Buffer API library provides a rich set of APIs helping the data transfers between the host and the device, between the buffers, etc. We will discuss the following data transfer style
Data transfer between host and device by Buffer read/write API
Data transfer between host and device by Buffer map API
Data transfer between buffers by copy API
I. Data transfer between host and device by Buffer read/write API¶
To transfer the data from the host to the device, the user first needs to update the host-side buffer backing pointer followed by a DMA transfer to the device.
The xrt::bo
class has following member functions for the same functionality
xrt::bo::write()
xrt::bo::sync()
with flagXCL_BO_SYNC_BO_TO_DEVICE
To transfer the data from the device to the host, the steps are reversed, the user first needs to do a DMA transfer from the device followed by the reading data from the host-side buffer backing pointer.
The corresponding xrt::bo
class’s member functions are
xrt::bo::sync()
with flagXCL_BO_SYNC_BO_FROM_DEVICE
xrt::bo::read()
Code example of transferring data from the host to the device
20 auto input_buffer = xrt::bo(device, buffer_size_in_bytes, bank_grp_idx_0);
21 // Prepare the input data
22 int buff_data[data_size];
23 for (auto i=0; i<data_size; ++i) {
24 buff_data[i] = i;
25 }
26
27 input_buffer.write(buff_data);
28 input_buffer.sync(XCL_BO_SYNC_BO_TO_DEVICE);
Note the C++ xrt::bo::sync
, xrt::bo::write
, xrt::bo::read
etc has overloaded version that can be used for partial buffer sync/read/write by specifying the size and the offset. For the above code example, the full buffer size and offset=0 are assumed as default arguments.
Also note that if the buffer is created through the user-pointer, the xrt::bo::write
or xrt::bo::read
is not required before or after the xrt::bo::sync
call.
For the device only buffers (created with xrt::bo::flags::device_only
flag) the xrt::bo::sync()
operation is not required, only xrt::bo::write()
(or xrt::bo::read()
) is sufficient for DMA operation. As for the device only buffer there is no host backing storage, the xrt::bo::write()
(or xrt::bo::read()
) directly performs DMA operation to (or from) the device memory.
Below is the example for creation of device only buffers.
18 xrt::bo::flags device_flags = xrt::bo::flags::device_only;
19 auto device_only_buffer = xrt::bo(device, size_in_bytes, device_flags, bank_grp_arg0);
Here is how xrt::bo::read()
and xrt::bo::write()
API’s to read/write directly from/to device only buffer there is no host backing storage.
xrt::bo::write(const void* src, size_t size, size_t seek)
: Copies data from src to device buffer directly.xrt::bo::read(void* dst, size_t size, size_t skip)
: Copies data from device buffer to dst.
II. Data transfer between host and device by Buffer map API¶
The API xrt::bo::map()
allows mapping the host-side buffer backing pointer to a user pointer. The host code can subsequently exercise the user pointer for the data reads and writes. However, after writing to the mapped pointer (or before reading from the mapped pointer) the API xrt::bo::sync()
should be used with direction flag for the DMA operation.
Code example of transferring data from the host to the device by this approach
20 auto input_buffer = xrt::bo(device, buffer_size_in_bytes, bank_grp_idx_0);
21 auto input_buffer_mapped = input_buffer.map<int*>();
22
23 for (auto i=0;i<data_size;++i) {
24 input_buffer_mapped[i] = i;
25 }
26
27 input_buffer.sync(XCL_BO_SYNC_BO_TO_DEVICE);
III. Data transfer between the buffers by copy API¶
XRT provides xrt::bo::copy()
API for deep copy between the two buffer objects if the platform supports a deep-copy (for detail refer M2M feature described in Memory-to-Memory (M2M)). If deep copy is not supported by the platform the data transfer happens by shallow copy (the data transfer happens via host).
25 dst_buffer.copy(src_buffer, copy_size_in_bytes);
The API xrt::bo::copy()
also has overloaded versions to provide a different offset than 0 for both the source and the destination buffer.
3. Miscellaneous other Buffer APIs¶
This section describes a few other specific use-cases using buffers.
DMA-BUF API¶
XRT provides Buffer export and import APIs primarily used for sharing buffer across devices (P2P application) and processes. The buffer handle obtained from xrt::bo::export_buffer()
is essentially a file descriptor, hence sending across the processes requires a suitable IPC mechanism (example, UDS or Unix Domain Socket) to translate the file descriptor of one process into another process.
xrt::bo::export_buffer()
: Export the buffer to an exported buffer handlexrt::bo()
constructor: Allocate a BO imported from exported buffer handle
Consider the situation of exporting buffer from device 1 to device 2 (inside same host process).
18 auto buffer_exported = buffer_device_1.export_buffer();
19 auto buffer_device_2 = xrt::bo(device_2, buffer_exported);
In the above example
The buffer buffer_device_1 is a buffer allocated on device 1
buffer_device_1 is exported by the member function
xrt::bo::export_buffer
The new buffer buffer_device_2 is imported for device_2 by the constructor
xrt::bo
Sub-buffer support¶
The xrt::bo
class constructor can also be used to allocate a sub-buffer from a parent buffer by specifying a start offset and the size.
In the example below a sub-buffer is created from a parent buffer of size 4 bytes starting from its offset 0
18 size_t sub_buffer_size = 4;
19 size_t sub_buffer_offset = 0;
20
21 auto sub_buffer = xrt::bo(parent_buffer, sub_buffer_size, sub_buffer_offset);
Buffer information¶
XRT provides few other API Class member functions to obtain information related to the buffer.
The member function
xrt::bo::size()
: Size of the bufferThe member function
xrt::bo::address()
: Physical address of the buffer
Kernel and Run¶
To execute a kernel on a device, a kernel class (xrt::kernel
) object has to be created from currently loaded xclbin. The kernel object can be used to execute the kernel function on the hardware instance (Compute Unit or CU) of the kernel.
A Run object (xrt::run
) represents an execution of the kernel. Upon finishing the kernel execution, the Run object can be reused to invoke the same kernel function if desired.
The following topics are discussed below
Obtaining kernel object from XCLBIN
Getting the bank group index of a kernel argument
Execution of kernel and dealing with the associated run
Other kernel related API
Obtaining kernel object from XCLBIN¶
The kernel object is created from the device, XCLBIN UUID and the kernel name using xrt::kernel()
constructor as shown below
35 auto xclbin_uuid = device.load_xclbin("kernel.xclbin");
36 auto krnl = xrt::kernel(device, xclbin_uuid, name);
Note: A single kernel object (when created by a kernel name) can be used to execute multiple CUs as long as CUs are having identical interface connectivity. If all the CUs of the kernel are not having identical connectivity, XRT assigns a subset of CUs (one or more CUs with identical connectivity) to the created kernel object and discards the rest of the CUs (discarded CUs are not used during the execution of a kernel). For this type of situation creating a kernel object using mangled CU names can be more useful.
As an example, assume a kernel name is foo having 3 CUs foo_1, foo_2, foo_3. The CUs foo_1 and foo_2 are connected to DDR bank 0, but the CU foo_3 is connected to DDR bank 1.
Opening kernel object for foo_1 and foo_2 (as they have identical interface connection)
35 krnl_obj_1_2 = xrt::kernel(device, xclbin_uuid, "foo:{foo_1,foo_2}");
Opening kernel object for foo_3
35 krnl_obj_3 = xrt::kernel(device, xclbin_uuid, "foo:{foo_3}");
Getting bank group index of the kernel argument¶
We have seen in the Buffer creation section that it is required to provide the buffer location during the buffer creation. The member function xrt::kernel::group_id()
returns the memory bank index (or id) of a specific argument of the kernel. This id is passed as a parameter of xrt::bo()
constructor to create the buffer on the same memory bank.
Let us review the example below where the buffer is allocated for the kernel’s first (argument index 0) argument.
15 auto input_buffer = xrt::bo(device, buffer_size_in_bytes, kernel.group_id(0));
If the kernel bank index is ambiguous then kernel.group_id()
returns the last memory bank index in the list it maintains. This is the case when the kernel has multiple CU with different connectivity for that argument. For example, let’s assume a kernel argument (argument 0) is connected to memory bank 0, 1, 2 (for 3 CUs), then kernel.group_id(0)
will return the last index from the group {0,1,2}, i.e. 2. As a result the buffer is created on the memory bank 2, so the buffer cannot be used for the CU0 and CU1.
However, in the above situation, the user can always create 3 distinct kernel objects corresponds to 3 CUs (by using the {kernel_name:{cu_name(s)}}
for xrt::kernel constructor) to execute the CUs by separate xrt::kernel
objects.
Executing the kernel¶
Execution of the kernel is associated with a Run object. The kernel can be executed by the xrt::kernel::operator()
that takes all the kernel arguments in order. The kernel execution API returns a run object corresponding to the execution.
50 // 1st kernel execution
51 auto run = kernel(buf_a, buf_b, scalar_1);
52 run.wait();
53
54 // 2nd kernel execution with just changing 3rd argument
55 run.set_arg(2,scalar_2); // Arguments are specified starting from 0
56 run.start();
57 run.wait();
The xrt::kernel
class provides overloaded operator () to execute the kernel with a comma-separated list of arguments.
The above c++ code block is demonstrating
The kernel execution using the
xrt::kernel()
operator with the list of arguments that returns axrt::run
object. This is an asynchronous API and returns after submitting the task.The member function
xrt::run::wait()
is used to block the current thread until the current execution is finished.The member function
xrt::run::set_arg()
is used to set one or more kernel argument(s) before the next execution. In the example above, only the last (3rd) argument is changed.The member function
xrt::run::start()
is used to start the next kernel execution with new argument(s).
Other kernel APIs¶
Obtaining the run object before execution: Example of the previous section shows to obtain a xrt::run
object when the kernel is executed (kernel execution returns a run object). However, a xrt::run
object can be obtained even before the kernel execution. The flow is as below
Open a Run object by the
xrt::run
constructor with a kernel argument).Set the kernel arguments associated for the next execution by the member function
xrt::run::set_arg()
.Execute the kernel by the member function
xrt::run::start()
.Wait for the execution finish by the member function
xrt::run::wait()
.
Timeout while wait for kernel finish: The member function xrt::run::wait()
blocks the current thread until the kernel execution finishes. To specify a timeout supported API xrt::run::wait()
also accepts a timeout in millisecond unit.
User Managed Kernel¶
The xrt::kernel
is used to execute the kernels with standard control interface through AXI-Lite control registers. These standard control interfaces are well defined and understood by XRT but transparent to the user. These XRT managed kernels should always be represented by xrt::kernel
objects in the host code.
The XRT also supports custom control interface for a kernel. These type of kernels (a.k.a User-Managed Kernel) must be managed by the user by writing/reading to/from the AXI-Lite registers controlling these kernels. To differentiate from the XRT managed kernel, class xrt::ip
is used to specify a user-managed kernel inside the user host code.
Creating xrt::ip
object from XCLBIN¶
The xrt::ip
object creation is very similar to creating a kernel.
35 auto xclbin_uuid = device.load_xclbin("kernel.xclbin");
36 auto ip = xrt::ip(device, xclbin_uuid, "ip_name");
An ip object can only be opened in exclusive mode. That means at a time, only one thread/process can access IP at the same time. This is required for a safety reason because multiple threads/processes reading/writing to the AXI-Lite registers at the same time potentially leads to a race situation.
Allocating buffers for the IP inputs/outputs¶
Similar to XRT managed kernel xrt::bo
objects are used to create buffers for IP ports. However, the memory bank location must be specified explicitly by providing enumerated index of the memory bank.
Below is a example of creating two buffers. Note the last argument of xrt::bo
is the enumerated index of the memory bank as seen by the XRT (in this example index 8 corresponds to the host-memory bank). The bank index can be obtained by xbutil examine --report memory
command.
35 auto buf_in_a = xrt::bo(device, DATA_SIZE, xrt::bo::flags::host_only, 8);
36 auto buf_in_b = xrt::bo(device, DATA_SIZE, xrt::bo::flags::host_only, 8);
Reading and write CU mapped registers¶
To read and write from the AXI-Lite register space to a CU (specified by xrt::ip
object in the host code), the required member functions from the xrt::ip
class are
xrt::ip::read_register
xrt::ip::write_register
35 int read_data;
36 int write_data = 7;
37
38 auto ip = xrt::ip(device, xclbin_uuid, "foo:{foo_1}");
39
40 read_data = ip.read_register(READ_OFFSET);
41 ip.write_register(WRITE_OFFSET,write_data);
In the above code block
The CU named “foo_1” (name syntax: “kernel_name:{cu_name}”) is opened exclusively.
The Register Read/Write operation is performed.
Graph¶
In Versal ACAPs with AI Engines, the XRT Graph class (xrt::graph
) and its member functions can be used to dynamically load, monitor, and control the graphs executing on the AI Engine array.
A note regarding Device and Buffer: In AIE based application, the device and buffer have some additional functionlities. For this reason the classes xrt::aie::device
and xrt::aie::buffer
are recommended to specify device and buffer objects.
Graph Opening and Closing¶
The xrt::graph
object can be opened using the uuid of the currently loaded XCLBIN file as shown below
35 auto xclbin_uuid = device.load_xclbin("kernel.xclbin");
36 auto graph = xrt::graph(device, xclbin_uuid, "graph_name");
The graph object can be used to execute the graph function on the AIE tiles.
Reset Functions¶
The member function xrt::graph::reset()
is used to reset a specified graph by disabling tiles and enabling tile reset.
45 auto device = xrt::aie::device(0);
46
47 // load XCLBIN
48 ...
49
50 auto graph = xrt::graph(device, xclbin_uuid, "graph_name");
51 // Graph Reset
52 graph.reset();
The member function xrt::aie::device::reset_array()
is used to reset the whole AIE array. But after this AIE reset functionality is called, the PDI get lost, so a special AIE only XCLBIN has be loaded (This flow is for advanced user only).
Graph execution¶
XRT provides basic graph execution control interfaces to initialize, run, wait, and terminate graphs for a specific number of iterations. Below we will review some of the common graph execution styles.
Graph execution for a fixed number of iterations¶
A graph can be executed for a fixed number of iterations followed by a “busy-wait” or a “time-out wait”.
Busy Wait scheme
The graph can be executed for a fixed number of iteration by xrt::graph::run()
API using an iteration argument. Subsequently, xrt::graph::wait()
or xrt::graph::end()
API should be used (with argument 0) to wait until graph execution is completed.
Let’s review the below example
The graph is executed for 3 iterations by API
xrt::graph::run()
with the number of iterations as an argument.The API
xrt::graph::wait(0)
is used to wait till the iteration is done.The API xrt::graph::wait() is used because the host code needs to execute the graph again.
The Graph is executed again for 5 iteration
The API
xrt::graph::end(0)
is used to wait till the iteration is done.After
xrt::graph::end()
the same graph can not be executed.
35 // start from reset state
36 graph.reset();
37
38 // run the graph for 3 iteration
39 graph.run(3);
40
41 // Wait till the graph is done
42 graph.wait(0); // Use graph::wait if you want to execute the graph again
43
44
45 graph.run(5);
46 graph.end(0); // Use graph::end if you are done with the graph execution
Timeout wait scheme
As shown in the above example xrt::graph::wait(0)
performs a busy-wait and suspend the execution till the graph is not done. If desired a timeout version of the wait can be achieved by xrt::graph::wait(std::chrono::milliseconds)
which can be used to wait for some specified number of milliseconds, and if the graph is not done do something else in the meantime. An example is shown below
35 // start from reset state
36 graph.reset();
37
38 // run the graph for 100 iteration
39 graph.run(100);
40
41 while (1) {
42
43 try {
44 graph.wait(5);
45 }
46 catch (const std::system_error& ex) {
47
48 if (ex.code().value() == ETIME) {
49
50 std::cout << "Timeout, reenter......" << std::endl;
51 // Do something
52
53 }
54 }
Infinite Graph Execution¶
The graph runs infinitely if xrt::graph::run()
is called with iteration argument 0. While a graph running infinitely the APIs xrt::graph::wait()
, xrt::graph::suspend()
and xrt::graph::end()
can be used to suspend/end the graph operation after some number of AIE cycles. The API xrt::graph::resume()
is used to execute the infinitely running graph again.
39 // start from reset state
40 graph.reset();
41
42 // run the graph infinitely
43 graph.run(0);
44
45 graph.wait(3000); // Suspends the graph after 3000 AIE cycles from the previous start
46
47
48 graph.resume(); // Restart the suspended graph again to run forever
49
50 graph.suspend(); // Suspend the graph immediately
51
52 graph.resume(); // Restart the suspended graph again to run forever
53
54 graph.end(5000); // End the graph operation after 5000 AIE cycles from the previous start
In the example above
The member function
xrt::graph::run(0)
is used to execute the graph infinitelyThe member function
xrt::graph::wait(3000)
suspends the graph after 3000 AIE cycles from the graph starts.If the graph was already run more than 3000 AIE cycles the graph is suspended immediately.
The member function
xrt::graph::resume()
is used to restart the suspended graphThe member function
xrt::graph::suspend()
is used to suspend the graph immediatelyThe member function
xrt::graph::end(5000)
is ending the graph after 5000 AIE cycles from the previous graph start.If the graph was already run more than 5000 AIE cycles the graph ends immediately.
Using
xrt::graph::end()
eliminates the capability of rerunning the Graph (without loading PDI and a graph reset again).
Measuring AIE cycle consumed by the Graph¶
The member function xrt::graph::get_timestamp()
can be used to determine AIE cycle consumed between a graph start and stop.
Here in this example, the AIE cycle consumed by 3 iteration is calculated
35 // start from reset state
36 graph.reset();
37
38 uint64_t begin_t = graph.get_timestamp();
39
40 // run the graph for 3 iteration
41 graph.run(3);
42
43 graph.wait(0);
44
45 uint64_t end_t = graph.get_timestamp();
46
47 std::cout<<"Number of AIE cycles consumed in the 3 iteration is: "<< end_t-begin_t;
RTP (Runtime Parameter) control¶
The xrt::graph
class contains member function to update and read the runtime parameters of the graph.
The member function
xrt::graph::update()
to update the RTPThe member function
xrt::graph::read()
to read the RTP.
35 graph.reset();
36
37 graph.run(2);
38
39 float increment = 1.0;
40 graph.update("mm.mm0.in[2]", increment);
41
42 // Do more things
43 graph.run(16);
44 graph.wait(0);
45
46 // Read RTP
47 float increment_out;
48 graph.read("mm.mm0.inout[0]", &increment_out);
49 std::cout<<"\n RTP value read<<increment_out;
In the above example, the member function xrt::graph::update()
and xrt::graph::read()
are used to update and read the RTP values respectively. Note the function arguments
The hierarchical name of the RTP port
Variable to set/read the RTP
DMA operation to and from Global Memory IO¶
The AIE buffer class xrt::aie::bo
supports member function xrt::aie::bo::sync()
that can be used to synchronize the buffer contents between global memory and AIE. The following code shows a sample example
35 auto device = xrt::aie::device(0);
36
37 // Buffer from global memory (GM) to AIE
38 auto in_bo = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
39
40 // Buffer from AIE to global memory (GM)
41 auto out_bo = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
42
43 auto inp_bo_map = in_bo.map<float *>();
44 auto out_bo_map = out_bo.map<float *>();
45
46 // Prepare input data
47 std::copy(my_float_array,my_float_array+SIZE,inp_bo_map);
48
49
50 in_bo.sync("in_sink", XCL_BO_SYNC_BO_GMIO_TO_AIE, SIZE * sizeof(float),0);
51
52 out_bo.sync("out_sink", XCL_BO_SYNC_BO_AIE_TO_GMIO, SIZE * sizeof(float), 0);
The above code shows
Input and output buffer (
in_bo
andout_bo
) to the graph are created and mapped to the user spaceThe member function
xrt::aie::bo::sync
is used for data transfer using the following argumentsThe name of the GMIO ports associated with the DMA transfer
The direction of the buffer transfer
GMIO to Graph:
XCL_BO_SYNC_BO_GMIO_TO_AIE
Graph to GMIO:
XCL_BO_SYNC_BO_AIE_TO_GMIO
The size and the offset of the buffer
GMIOs & External Buffers¶
XRT provides a buffer class xrt::aie::buffer
which represents GMIO & External-Buffers. GMIOs & External Buffers facilitates the movement of data from global memory (like DDR) to the AI Engine and vice versa.Both GMIO and External buffers work together to manage data flow efficiently by ensuring that large datasets can be processed effectively without overwhelming local memory resources
The AIE buffer xrt::aie::buffer
object creation would be succesful if GMIO/External buffer exists with given name.
This class has overloaded member function xrt::aie::buffer::sync(...)
that can be used to synchronize the buffer contents between global memory and AIE.
xrt::aie::buffer::sync(xrt::bo bo, …) synchronizes the buffer content between xrt::aie::buffer (GMIO/External Buffer) & xrt::bo (Global Memory)
xrt::aie::buffer::sync(xrt::bo ping, xrt::bo pong, …) configures the External buffer with ping/pong buffer for parallelism.
The following code shows a sample example with a single input/output GMIO/External Buffer. Data gets transferred from global buffer “in_bo” to “gr.in1”
1 auto device = xrt::aie::device(0);
2 auto uuid = device.load_xclbin(xclbin-filename);
3
4 // Create Buffer in DDR/Global memory & prepare input
5 auto in_bo = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
6 auto inp_bo_map = in_bo.map<float *>();
7 std::copy(my_float_array,my_float_array+SIZE,inp_bo_map);
8
9 // Create Buffer in DDR/Global memory to store output
10 auto out_bo = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
11 auto out_bo_map = out_bo.map<float *>();
12
13 // create GMIO/External Buffer object for input & sync from in_bo
14 auto xrt::aie::buffer in_buffer = xrt::aie::buffer(device, uuid,"gr.in1");
15 in_buffer.sync(in_bo, XCL_BO_SYNC_BO_GMIO_TO_AIE, SIZE * sizeof(float),0);
16
17 //run your graphs which uses output GMIO/External buffer
18
19 // create GMIO/External Buffer object for output & sync from out_bo
20 auto xrt::aie::buffer out_buffer = xrt::aie::buffer(device, uuid,"gr.out1");
21 out_buffer.sync(out_bo, XCL_BO_SYNC_BO_AIE_TO_GMIO, SIZE * sizeof(float),0);
This class has overloaded member function xrt::aie::buffer::async(...)
that can be used to initiate an asynchronize operation to synchronize the xrt::bo buffer object
xrt::aie::buffer::async(xrt::bo bo, …) initiate an asynchronize operation between xrt::aie::buffer (GMIO/External Buffer) & xrt::bo (Global Memory)
xrt::aie::buffer::async(xrt::bo ping,xrt::bo pong, …) initiate an asynchronize operation between xrt::aie::buffer (GMIO/External Buffer) & ping/pong xrt::bo objects
xrt::aie::buffer::wait() waits for the asynchronize operation to complete
The following code shows a sample example with a single input/output GMIO/External Buffer. Data gets transferred from global buffer “in_bo” to “gr.in1”
1 auto device = xrt::aie::device(0);
2 auto uuid = device.load_xclbin(xclbin-filename);
3
4 // Create Buffer in DDR/Global memory & prepare input
5 auto in_bo = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
6 auto inp_bo_map = in_bo.map<float *>();
7 std::copy(my_float_array,my_float_array+SIZE,inp_bo_map);
8
9 // Create Buffer in DDR/Global memory to store output
10 auto out_bo = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
11 auto out_bo_map = out_bo.map<float *>();
12
13 // create GMIO/External Buffer object for input & sync from in_bo
14 auto xrt::aie::buffer in_buffer = xrt::aie::buffer(device, uuid,"gr.in1");
15 in_buffer.async(in_bo, XCL_BO_SYNC_BO_GMIO_TO_AIE, SIZE * sizeof(float),0);
16
17 //run your graphs which uses output GMIO/External buffer
18
19 // create GMIO/External Buffer object for output & sync from out_bo
20 auto xrt::aie::buffer out_buffer = xrt::aie::buffer(device, uuid,"gr.out1");
21 out_buffer.async(out_bo, XCL_BO_SYNC_BO_AIE_TO_GMIO, SIZE * sizeof(float),0);
22 out_buffer.wait();
Ping Pong buffers¶
The following code shows ping-pong buffer example.This shows an example with a ping/ping buffer being set on one of External buffer
1 auto device = xrt::aie::device(0);
2 auto uuid = device.load_xclbin(xclbin-filename);
3
4 // Create a Buffer in DDR/Global Memory & prepare input
5 auto in_bo = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
6 auto in_bo_map = in_bo.map<float *>();
7 std::copy(my_float_array,my_float_array+SIZE,in_bo_map);
8
9 // create GMIO/External Buffer object for input
10 auto xrt::aie::buffer in_buffer = xrt::aie::buffer(device, uuid,"gr.in1");
11 in_buffer.sync(in_bo, XCL_BO_SYNC_BO_AIE_TO_GMIO, SIZE * sizeof(float),0);
12
13 // Create a Buffer in DDR/Global Memory for storing output
14 auto out_bo = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
15 auto out_bo_map = out_bo.map<float *>();
16
17 // Create Ping/Pong Buffers in global memory for intermediate buffers
18 auto ext1_bo = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
19 auto ext2_bo = xrt::aie::bo (device, SIZE * sizeof (float), 0, 0);
20
21 // create GMIO/External Buffer object for input
22 auto xrt::aie::buffer ping_pong_bo = xrt::aie::buffer(device, uuid,"gr.ext1");
23 ping_pong_bo.sync(ext1_bo, ext2_bo, XCL_BO_SYNC_BO_GMIO_TO_AIE, SIZE * sizeof(float),0);
24 // create GMIO/External Buffer object for output
25 auto xrt::aie::buffer out_buffer = xrt::aie::buffer(device, uuid,"gr.out1");
26 out_buffer.sync(out_bo, XCL_BO_SYNC_BO_AIE_TO_GMIO, SIZE * sizeof(float),0);
XRT Error API¶
In general, XRT APIs can encounter two types of errors:
Synchronous error: Error can be thrown by the API itself. The host code can catch these exception and take necessary steps.
Asynchronous error: Errors from the underneath driver, system, hardware, etc.
XRT provides an xrt::error
class and its member functions to retrieve the asynchronous errors into the userspace host code. This helps to debug when something goes wrong.
Member function
xrt::error::get_error_code()
- Gets the last error code and its timestamp of a given error classMember function
xrt::error::get_timestamp()
- Gets the timestamp of the last errorMember function
xrt:error::to_string()
- Gets the description string of a given error code.
NOTE: The asynchronous error retrieving APIs are at an early stage of development and only supports AIE related asynchronous errors. Full support for all other asynchronous errors is planned in a future release.
Example code
41 graph.run(runInteration);
42
43 try {
44 graph.wait(timeout);
45 }
46 catch (const std::system_error& ex) {
47
48 if (ex.code().value() == ETIME) {
49 xrt::error error(device, XRT_ERROR_CLASS_AIE);
50
51 auto errCode = error.get_error_code();
52 auto timestamp = error.get_timestamp();
53 auto err_str = error.to_string();
54
55 /* code to deal with this specific error */
56 std::cout << err_str << std::endl;
57 } else {
58 /* Something else */
59 }
60 }
The above code shows
After timeout occurs from
xrt::graph::wait()
the member functionsxrt::error
class are called to retrieve asynchronous error code and timestampMember function
xrt::error::to_string()
is called to obtain the error string.
Profiling¶
In Versal ACAPs with AI Engines, the XRT Profiling class (xrt::aie::profiling
) and its member functions can be used to configure AI Engine hardware resources for performance profiling and event tracing.
Create Profiling Event¶
The class constructor xrt::aie::profiling
is used to create profiling event object as shown below
35 auto event = xrt::aie::profiling(device);
The profling object can be used to execute the profiling functions and collect profile statistics by calling profiling APIs.
Start Profiling¶
The member function xrt::aie::profiling::start()
is used to start performance counters in AI Engine as per the profiling option passed as an argument. This function configures the performance counters in the AI Engine and starts profiling.
45 auto graph = xrt::graph(device, xclbin_uuid, "graph_name");
46 auto handle = event.start(xrt::aie::profiling::profiling_option option, std::string& port1, std::string& port2, int value);
47
48 // run graph
49 ...
50 s2mm_run.wait();
It returns a handle to be used by read and stop.
Read Profiling¶
The xrt::aie::profiling::read
function will return the current performance counter value associated with the profiling handle. It can be used using profiling event object as shown below
35 long long cycle_count = profile.read();
Stop Profiling¶
The xrt::aie::profiling::stop
function stops the performance profiling associated with the profiling handle and releases the corresponding hardware resources.
35 event.stop();
36 double throughput = output_size_in_bytes / (cycle_count *0.8 * 1e-3);
37 // Every AIE cycle is 0.8ns in production board
38 std::cout << "Throughput of the graph: " << throughput << " MB/s" << std::endl;
Asynchornous Programming with XRT (experimental)¶
From the 22.1 release, XRT offers a simple asynchronous programming mechanism through the user-defined queues. The xrt::queue
is lightweight, general-purpose queue implementation which is completely separated from core XRT native API data structures. If needed, the user can also use their own queue implementation instead of the implementation offered by xrt::queue
.
XRT queue implementation needs #include <experimental/xrt_queue.h
to be added as the header file. The implementation also use C++17 features so the host code must be compiled with g++ -std=c++17
As a premise, by default, all XRT native APIs execute from the host-thread, so if an API has synchronous behavior the host-thread blocks until that task is finished. For example
41 buffer.sync(XCL_BO_SYNC_BO_TO_DEVICE);
As xrt::bo::sync
is a synchronous API, the host thread blocks until its completion.
XRT defines a queue with xrt::queue
with the following properties
Any task enqueued on the queue runs parallel to the original host-thread. So the host threads does not wait for its completion and can do other tasks in parallel.
The task enqueued on the queue must be synchronous in nature
All the tasks enqueued on the queue will be completed in the order it is enqueued (strict in-order execution).
The task enqueued on a queue can be any C++ Callable, which can conveniently be expressed by a C++ lambda
When a synchronous task is enqueued on a queue, an event (
xrt::queue:event
) is returned. The event can be used for multiple purposes, such as:
The host-thread can wait on that event to synchronize with the
queue::enqueue(task)
from where the event was generated.The event can be enqueued to other queues to synchronize among the queues
Let’s understand the above properties by reviewing the examples below
Executing synchronous task asynchronously¶
41 auto bo0 = xrt::bo(device, vector_size_bytes, krnl.group_id(0));
42 auto bo0_map = bo0.map<dtype*>();
43 .... // fill buffer content
44
45 xrt::queue my_queue;
46 auto sync_event = queue.enqueue([&bo0] {bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); });
47
48 myCpuTask(b); // here we can perform other host task that will run parallel to the above bo::sync task
49
50 sync_event.wait(); // stall the host-thread till sync operation completes
The above code shows the synchronous API xrt::bo::sync
is enqueued through an xrt::queue
. The argument of xrt::queue is unnamed callable written using C++ lambda capturing buffer object. This technique is useful to execute any synchronous task asynchronously from the host-thread, and while this task is ongoing, the host-thread can do other operation in parallel (myCpuTask()
in the above code). The return type of xrt::queue::enqueue()
is type of xrt::queue::event
which is later synchronized to the host-thread by xrt::queue::event::wait()
blocking function.
Executing multiple tasks through queue¶
Every new xrt::queue
can be thought of as a new thread running parallel to the host-thread executing a series of synchronous tasks following the order they were submitted (enqueued) on the queue. For example, let’s consider tasks A, B, C and D as below
Task A: Host to device data transfer (buffer bo0)
Task B: Execute the kernel and wait for the kernel to finish execution
Task C: Device to host data transfer (buffer bo_out)
Task D: Check return data bo_out
The above four tasks should be executed in-order for correct functionality. To execute them in parallel to the host-thread, these four tasks can be enqueued through a queue as below.
41 xrt::queue queue;
42 queue.enqueue([&bo0] {bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); });
43 queue.enqueue([&run] {run.start(); run.wait(); });
44 queue.enqueue([&bo_out] {bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE); });
45 queue.enqueue([&bo_out_map]{my_function_to_check_data(bo_out_map)});
The user can create and use as many queues in the host code to overlap tasks in parallel. Next, we will see how it is possible to synchronize among the queues using the event.
Using events to synchronize among the queues¶
Let’s assume in the above example, it is required to do two host-to-device buffer transfers before the kernel execution. If using a single queue the code would appear as
41 xrt::queue main_queue;
42 main_queue.enqueue([&bo0] {bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); });
43 main_queue.enqueue([&bo1] {bo1.sync(XCL_BO_SYNC_BO_TO_DEVICE); });
44 main_queue.enqueue([&run] {run.start(); run.wait(); });
45 main_queue.enqueue([&bo_out] {bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE); });
In the above code, as a single queue (main_queue
) is used, the host-to-device data transfers for buffer bo0
and bo1
would happen sequentially. In order to do parallel data transfer for bo0
and bo1
, a separate queue is needed for one of the buffers, and also it is required to ensure that the kernel executes only after both the buffer transfers are completed.
41 xrt::queue main_queue;
42 xrt::queue queue_bo1;
43 main_queue.enqueue([&bo0] {bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); });
44 auto bo1_event = queue_bo1.enqueue([&bo1] {bo1.sync(XCL_BO_SYNC_BO_TO_DEVICE); });
45 main_queue.enqueue(bo1_event);
46 main_queue.enqueue([&run] {run.start(); run.wait(); });
47 main_queue.enqueue([&bo_out] {bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE); });
In line number 43 and 44 bo0
and bo1
host-to-device data transfers are enqueued through two separate queues to achieve parallel transfers. To synchronize between these two queues the returned event from the queue_bo1
is enqueued in the main_queue
, similar to a task enqueue (line 45). As a result, any other task submitted after that event won’t execute until the event is finished. So in the above code example, subsequent task in the main_queue
(such as kernel execution) would wait till the bo1_event
is completed. By submitting an event returned from a queue::enqueue
to another queue, we can synchronize among the queues.