.. _xrt_native_apis.rst: .. comment:: SPDX-License-Identifier: Apache-2.0 comment:: Copyright (C) 2019-2021 Xilinx, Inc. All rights reserved. XRT Native APIs =============== From 2020.2 release XRT provides a new XRT API set in C, C++, and Python flavor. To use the native XRT APIs, the host application must link with the **xrt_coreutil** library. Compiling host code with XRT native C++ API requires C++ standard with -std=c++17 (or newer). Example g++ command .. code-block:: shell g++ -g -std=c++17 -I$XILINX_XRT/include -L$XILINX_XRT/lib -o host.exe host.cpp -lxrt_coreutil -pthread The XRT native API supports both the C and C++ flavor of APIs. For general host code development, C++-based APIs are recommended, hence this document only describes the C++-based API interfaces. The full Doxygen generated C and C++ API documentation can be found in :doc:`xrt_native.main`. The C++ Class objects used for the APIs are +----------------------+-------------------+------------------------------------------------+ | | C++ Class | Header files | +======================+===================+================================================+ | Device | ``xrt::device`` | ``#include `` | +----------------------+-------------------+------------------------------------------------+ | XCLBIN | ``xrt::xclbin`` | ``#include `` | +----------------------+-------------------+------------------------------------------------+ | Buffer | ``xrt::bo`` | ``#include `` | +----------------------+-------------------+------------------------------------------------+ | Kernel | ``xrt::kernel`` | ``#include `` | +----------------------+-------------------+------------------------------------------------+ | Run | ``xrt::run`` | ``#include `` | +----------------------+-------------------+------------------------------------------------+ | User-managed Kernel | ``xrt::ip`` | ``#include `` | +----------------------+-------------------+------------------------------------------------+ | Graph | ``xrt::graph`` | ``#include `` | | | | | | | | ``#include `` | +----------------------+-------------------+------------------------------------------------+ Majority of the core data structures are defined inside in the header files at ``$XILINX_XRT/include/xrt/`` directory. Few newer features such as ``xrt::ip``, ``xrt::aie`` related header files are inside ``$XILINX_XRT/include/experimental`` directory. The API interfaces that are in the experimental folder are subject to breaking changes. The common host code flow using the above data structures is as below - Open Xilinx **Device** and Load the **XCLBIN** - Create **Buffer** objects to transfer data to kernel inputs and outputs - Use the Buffer class member functions for the data transfer between host and device (before and after the kernel execution). - Use **Kernel** and **Run** objects to offload and manage the compute-intensive tasks running on FPGA. Below we will walk through the common API usage to accomplish the above tasks. Device and XCLBIN ----------------- Device and XCLBIN class provide fundamental infrastructure-related interfaces. The primary objective of the device and XCLBIN related APIs are - Open a Device - Load compiled kernel binary (or XCLBIN) onto the device The simplest code to load an XCLBIN as below .. code:: c++ :number-lines: 10 unsigned int dev_index = 0; auto device = xrt::device(dev_index); auto xclbin_uuid = device.load_xclbin("kernel.xclbin"); The above code block shows - The ``xrt::device`` class's constructor is used to open the device (enumerated as 0) - The member function ``xrt::device::load_xclbin`` is used to load the XCLBIN from the filename. - The member function ``xrt::device::load_xclbin`` returns the XCLBIN UUID, which is required to open the kernel (refer the Kernel Section). The class constructor ``xrt::device::device(const std::string& bdf)`` also supports opening a device object from a Pcie BDF passed as a string. .. code:: c++ :number-lines: 10 auto device = xrt::device("0000:03:00.1"); The ``xrt::device::get_info()`` is a useful member function to obtain necessary information about a device. Some of the information such as Name, BDF can be used to select a specific device to load an XCLBIN .. code:: c++ :number-lines: 10 std::cout << "device name: " << device.get_info() << "\n"; std::cout << "device bdf: " << device.get_info() << "\n"; Buffers ------- Buffers are primarily used to transfer the data between the host and the device. The Buffer related APIs are discussed in the following three subsections 1. Buffer allocation and deallocation 2. Data transfer using Buffers 3. Miscellaneous other Buffer APIs 1. Buffer allocation and deallocation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The C++ interface for buffers as below The class constructor ``xrt::bo`` is mainly used to allocates a buffer object 4K aligned. By default, a regular buffer is created (optionally the user can create other types of buffers by providing a flag). .. code:: c++ :number-lines: 15 auto bank_grp_arg0 = kernel.group_id(0); // Memory bank index for kernel argument 0 auto bank_grp_arg1 = kernel.group_id(1); // Memory bank index for kernel argument 1 auto input_buffer = xrt::bo(device, buffer_size_in_bytes,bank_grp_arg0); auto output_buffer = xrt::bo(device, buffer_size_in_bytes, bank_grp_arg1); In the above code ``xrt::bo`` buffer objects are created using the class constructor. Please note the following - As no special flags are used a regular buffer will be created. Regular buffer is most common type of buffer that has a host backing pointer allocated by user space in heap memory and a device buffer allocated in the specified memory bank. - The second argument specifies the buffer size. - The third argument is used to specify the enumerated memory bank index (to specify the buffer location) where the buffer should be allocated. There are two ways to specify the memory bank index - Through kernel arguments: In the above example, the ``xrt::kernel::group_id()`` member function is used to pass the memory bank index. This member function accept kernel argument-index and automatically detect corresponding memory bank index by inspecting XCLBIN. - Passing Memory bank index: The ``xrt::kernel::group_id()`` also accepts the direct memory bank index (as observed from ``xbutil examine --report memory`` output). Creating special Buffers ************************ The ``xrt::bo()`` constructors accept multiple other buffer flags those are described using ``enum class`` argument with the following enumerator values - ``xrt::bo::flags::normal``: Regular buffer (default) - ``xrt::bo::flags::device_only``: Device only buffer (meant to be used only by the kernel, there is no host backing pointer). - ``xrt::bo::flags::host_only``: Host only buffer (buffer resides in the host memory directly transferred to/from the kernel) - ``xrt::bo::flags::p2p``: P2P buffer, A special type of device-only buffer capable of peer-to-peer transfer - ``xrt::bo::flags::cacheable``: Cacheable buffer can be used when the host CPU frequently accessing the buffer (applicable for edge platform). The below example shows creating a P2P buffer on a device memory bank connected to argument 3 of the kernel. .. code:: c++ :number-lines: 15 auto p2p_buffer = xrt::bo(device, buffer_size_in_byte,xrt::bo::flags::p2p, kernel.group_id(3)); Creating Buffers from the user pointer ************************************** The ``xrt::bo()`` constructor can also be called using a pointer provided by the user. The user pointer must be aligned to 4K boundary. .. code:: c++ :number-lines: 15 // Host Memory pointer aligned to 4K boundary int *host_ptr; posix_memalign(&host_ptr,4096,MAX_LENGTH*sizeof(int)); // Sample example filling the allocated host memory for(int i=0; i(); for (auto i=0;i(); auto out_bo_map = out_bo.map(); // Prepare input data std::copy(my_float_array,my_float_array+SIZE,inp_bo_map); in_bo.sync("in_sink", XCL_BO_SYNC_BO_GMIO_TO_AIE, SIZE * sizeof(float),0); out_bo.sync("out_sink", XCL_BO_SYNC_BO_AIE_TO_GMIO, SIZE * sizeof(float), 0); The above code shows - Input and output buffer (``in_bo`` and ``out_bo``) to the graph are created and mapped to the user space - The member function ``xrt::aie::bo::sync`` is used for data transfer using the following arguments - The name of the GMIO ports associated with the DMA transfer - The direction of the buffer transfer - GMIO to Graph: ``XCL_BO_SYNC_BO_GMIO_TO_AIE`` - Graph to GMIO: ``XCL_BO_SYNC_BO_AIE_TO_GMIO`` - The size and the offset of the buffer XRT Error API ------------- In general, XRT APIs can encounter two types of errors: - Synchronous error: Error can be thrown by the API itself. The host code can catch these exception and take necessary steps. - Asynchronous error: Errors from the underneath driver, system, hardware, etc. XRT provides an ``xrt::error`` class and its member functions to retrieve the asynchronous errors into the userspace host code. This helps to debug when something goes wrong. - Member function ``xrt::error::get_error_code()`` - Gets the last error code and its timestamp of a given error class - Member function ``xrt::error::get_timestamp()`` - Gets the timestamp of the last error - Member function ``xrt:error::to_string()`` - Gets the description string of a given error code. **NOTE**: The asynchronous error retrieving APIs are at an early stage of development and only supports AIE related asynchronous errors. Full support for all other asynchronous errors is planned in a future release. Example code .. code:: c++ :number-lines: 41 graph.run(runInteration); try { graph.wait(timeout); } catch (const std::system_error& ex) { if (ex.code().value() == ETIME) { xrt::error error(device, XRT_ERROR_CLASS_AIE); auto errCode = error.get_error_code(); auto timestamp = error.get_timestamp(); auto err_str = error.to_string(); /* code to deal with this specific error */ std::cout << err_str << std::endl; } else { /* Something else */ } } The above code shows - After timeout occurs from ``xrt::graph::wait()`` the member functions ``xrt::error`` class are called to retrieve asynchronous error code and timestamp - Member function ``xrt::error::to_string()`` is called to obtain the error string. Asynchornous Programming with XRT (experimental) ------------------------------------------------ From the 22.1 release, XRT offers a simple asynchronous programming mechanism through the user-defined queues. The ``xrt::queue`` is lightweight, general-purpose queue implementation which is completely separated from core XRT native API data structures. If needed, the user can also use their own queue implementation instead of the implementation offered by ``xrt::queue``. XRT queue implementation needs ``#include (); .... // fill buffer content xrt::queue my_queue; auto sync_event = queue.enqueue([&bo0] {bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); }); myCpuTask(b); // here we can perform other host task that will run parallel to the above bo::sync task sync_event.wait(); // stall the host-thread till sync operation completes The above code shows the synchronous API ``xrt::bo::sync`` is enqueued through an ``xrt::queue``. The argument of xrt::queue is unnamed callable written using C++ lambda capturing buffer object. This technique is useful to execute any synchronous task asynchronously from the host-thread, and while this task is ongoing, the host-thread can do other operation in parallel (``myCpuTask()`` in the above code). The return type of ``xrt::queue::enqueue()`` is type of ``xrt::queue::event`` which is later synchronized to the host-thread by ``xrt::queue::event::wait()`` blocking function. Executing multiple tasks through queue ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Every new ``xrt::queue`` can be thought of as a new thread running parallel to the host-thread executing a series of synchronous tasks following the order they were submitted (enqueued) on the queue. For example, let's consider tasks A, B, C and D as below - Task A: Host to device data transfer (buffer bo0) - Task B: Execute the kernel and wait for the kernel to finish execution - Task C: Device to host data transfer (buffer bo_out) - Task D: Check return data bo_out The above four tasks should be executed in-order for correct functionality. To execute them in parallel to the host-thread, these four tasks can be enqueued through a queue as below. .. code:: c++ :number-lines: 41 xrt::queue queue; queue.enqueue([&bo0] {bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); }); queue.enqueue([&run] {run.start(); run.wait(); }); queue.enqueue([&bo_out] {bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE); }); queue.enqueue([&bo_out_map]{my_function_to_check_data(bo_out_map)}); The user can create and use as many queues in the host code to overlap tasks in parallel. Next, we will see how it is possible to synchronize among the queues using the event. Using events to synchronize among the queues ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's assume in the above example, it is required to do two host-to-device buffer transfers before the kernel execution. If using a single queue the code would appear as .. code:: c++ :number-lines: 41 xrt::queue main_queue; main_queue.enqueue([&bo0] {bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); }); main_queue.enqueue([&bo1] {bo1.sync(XCL_BO_SYNC_BO_TO_DEVICE); }); main_queue.enqueue([&run] {run.start(); run.wait(); }); main_queue.enqueue([&bo_out] {bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE); }); In the above code, as a single queue (``main_queue``) is used, the host-to-device data transfers for buffer ``bo0`` and ``bo1`` would happen sequentially. In order to do parallel data transfer for ``bo0`` and ``bo1``, a separate queue is needed for one of the buffers, and also it is required to ensure that the kernel executes only after both the buffer transfers are completed. .. code:: c++ :number-lines: 41 xrt::queue main_queue; xrt::queue queue_bo1; main_queue.enqueue([&bo0] {bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); }); auto bo1_event = queue_bo1.enqueue([&bo1] {bo1.sync(XCL_BO_SYNC_BO_TO_DEVICE); }); main_queue.enqueue(bo1_event); main_queue.enqueue([&run] {run.start(); run.wait(); }); main_queue.enqueue([&bo_out] {bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE); }); In line number 43 and 44 ``bo0`` and ``bo1`` host-to-device data transfers are enqueued through two separate queues to achieve parallel transfers. To synchronize between these two queues the returned event from the ``queue_bo1`` is enqueued in the ``main_queue``, similar to a task enqueue (line 45). As a result, any other task submitted after that event won't execute until the event is finished. So in the above code example, subsequent task in the ``main_queue`` (such as kernel execution) would wait till the ``bo1_event`` is completed. By submitting an event returned from a ``queue::enqueue`` to another queue, we can synchronize among the queues.