Stream Simple

This is a simple Vector Add C Kernel design with 2 Stream inputs and 1 Stream output that demonstrates on how streaming kernel can be implemented and how host can directly send data to kernel without global memory.

KEY CONCEPTS: Read/Write Stream, Create/Release Stream

KEYWORDS: cl_stream, CL_STREAM_EOT, CL_STREAM_NONBLOCKING

This is Simple Streaming Kernel Example and demonstrate how host application can directly transfer data via streaming interface without moving data to Global memory. Host Application demonstrate both approaches (blocking and non-blocking) stream interface between host and device. Stream interface does not require address management as access to data is done in sequential manner. Stream interfaces are useful for applications where either the data is too big to reside on an FPGA or data is being streamed from a sensor network.

Inside Kernel code HLS pragma must be defined for every streaming interface.

#include "ap_axi_sdata.h"
typedef qdma_axis<32, 0, 0, 0> pkt;
void krnl_stream_adder1(hls::stream<pkt> &a, hls::stream<pkt> &output) {
    #pragma HLS INTERFACE axis port=a
    #pragma HLS INTERFACE axis port=output
    #pragma HLS INTERFACE s_axilite port=return bundle=control
...
}

hls::stream kernels use a special class qdma_axis<D,0,0,0> for kernel streams which requires the header file ap_axi_sdata.h. It has variables data,last and keep to manage the data transfer.

data: Internally qdma_axis datatype has ap_uint<D> which can be accessed by get_data() and set_data() methods.

keep: For all data before last, keep variable must be set to -1 to denote all bytes of data are valid. For the last data, the kernel has the flexibility to send fewer bytes. For example, for the four bytes data transfer, the kernel can truncate the last data by sending 1 byte or 2 bytes or 3 bytes by using set_keep() function.

typedef qdma_axis<32, 0, 0, 0> t_out;
t_out.set_data(tmpOut);
t_out.set_last(t1.get_last());
t_out.set_keep(-1);

last: Final data transferred must be identified by the last variable. get_last() and set_last() methods are used to access/set the last variable. Kernel doesn not know how many data items are coming through the stream. Stream is polled by calling get_last() after every transfer and breaks when get_last() returns 1.

Host to Kernel Streaming is supported by all QDMA based platforms. To use streaming support and utility APIs provided inside xcl2.hpp, programmer needs to specify following in host code:

// Declaration of custom stream APIs that binds to Xilinx Streaming APIs.
decltype(&clCreateStream) xcl::Stream::createStream = nullptr;
decltype(&clReleaseStream) xcl::Stream::releaseStream = nullptr;
decltype(&clReadStream) xcl::Stream::readStream = nullptr;
decltype(&clWriteStream) xcl::Stream::writeStream = nullptr;
decltype(&clPollStreams) xcl::Stream::pollStreams = nullptr;

Streaming class needs to be initialized before use as below:

xcl::Stream::init(platform_id);

To make a streaming object connection to specific kernel of design, following steps are needed:

cl_mem_ext_ptr_t ext;
ext.param = krnl_adder1.get();
ext.obj = NULL;
ext.flags = 0; // Indicates that connect to argument 0 of kernel
cl_stream write_stream_a = xcl::Stream::createStream(device.get(), CL_STREAM_WRITE_ONLY, CL_STREAM, &ext,nullptr));
ext.flags = 1; // Indicates that connect to argument 1 of kernel
cl_stream read_stream = xcl::Stream::createStream(device.get(), CL_STREAM_READ_ONLY, CL_STREAM, &ext, &ret));

xcl::stream::createStream API is used to create a stream and read and write properties are determined by the flags CL_STREAM_WRITE_ONLY and CL_STREAM_READ_ONLY ..flags is used to specify the kernel argument to which stream is connected.

There are blocking and non-blocking APIs to transfer data between Host and kernel via stream interface.

Blocking stream requires the stream operation (read or write) to finish before the next operation can be executed. Following shows a blocking call of writeStream but inside a child thread:

cl_stream_xfer_req b_wr_req{0};
b_wr_req.flags = CL_STREAM_EOT;
b_wr_req.priv_data = (void *)"b_write_a";
// Thread 1 for writing data to input stream 1 independently in case of default blocking transfers.
std::thread thr1(xcl::Stream::writeStream,
                write_stream_a,     // cl_stream object
                h_a.data(),         // host memory pointer from where the data has to be transferred
                vector_size_bytes,  // size of data to be transfered in bytes
                &b_wr_req,          // xfer req flag to indicate type of transfer
                &ret);

Similar to this following shows a blocking call of readStream but inside a child thread:

cl_stream_xfer_req b_rd_req{0};
b_rd_req.flags = CL_STREAM_EOT;
b_rd_req.priv_data = (void *)"b_read_out";
// Output thread to read the stream data independently in case of default blocking transfers.
std::thread thr2(xcl::Stream::readStream,
                 read_stream,       //cl_stream object
                 hw_results.data(), // host memory pointer on which data will be read
                 vector_size_bytes, // max size of data which can be stored in host memory
                 &b_rd_req,         // xfer_req flag to indicate type of transfer
                 &ret);

As both are blocking calls running inside different threads. So host application needs to wait for thread to finish to successful data transfer using thread join() API as below:

xcl::clReadstream and xcl::clWritestream APIs are used to read from and write to streams respectively.

thr1.join();
thr2.join();

In case of non-blocking stream, other operations can be carried out while data is being written into or being read from the stream. Non-blocking stream requires CL_STREAM_NONBLOCKING flag to be specified in the transfer initiation request.

cl_stream_xfer_req nb_wr_req{0};
nb_wr_req.flags = CL_STREAM_EOT | CL_STREAM_NONBLOCKING;
nb_wr_req.priv_data = (void *)"nb_write_a";
xcl::Stream::writeStream(write_stream_a, h_a.data(), vector_size_bytes, &nb_wr_req, &ret));
cl_stream_xfer_req nb_rd_req{0};
nb_rd_req.flags = CL_STREAM_EOT | CL_STREAM_NONBLOCKING;
nb_rd_req.priv_data = (void *)"nb_read";
xcl::Stream::readStream(read_stream, hw_results.data(),vector_size_bytes,&nb_rd_req,&ret));

Since non-blocking streams are asynchronous and return immediately, xcl::stream::pollStream is a blocking API used to monitor the status of completion of the transfer through streams, it returns the execution to the host code after streams are completed.

cl_streams_poll_req_completions poll_req[2]{0, 0}; // 2 Requests
auto num_compl = 2;
xcl::Stream::pollStreams(device.get(), poll_req, 2, 2, &num_compl, 50000, &ret);
// Blocking API, waits for 2 poll request completion or 50000ms, whichever occurs first.

xcl::clreleaseStream is used to release stream objects.

xcl::Stream::releaseStream(read_stream);
xcl::Stream::releaseStream(write_stream_a);

EXCLUDED PLATFORMS

Platforms containing following strings in their names are not supported for this example :

zc
xdma
xilinx_u250_qep
aws
samsung

DESIGN FILES

Application code is located in the src directory. Accelerator binary files will be compiled to the xclbin directory. The xclbin directory is required by the Makefile and its contents will be filled during compilation. A listing of all the files in this example is shown below

src/host.cpp
src/krnl_stream_vadd.cpp

COMMAND LINE ARGUMENTS

Once the environment has been configured, the application can be executed by

./vadd_stream <krnl_stream_vadd XCLBIN>