Stream Simple¶
This is a simple Vector Add C Kernel design with 2 Stream inputs and 1 Stream output that demonstrates on how streaming kernel can be implemented and how host can directly send data to kernel without global memory.
KEY CONCEPTS: Read/Write Stream, Create/Release Stream
KEYWORDS: cl_stream, CL_STREAM_EOT, CL_STREAM_NONBLOCKING
This is Simple Streaming Kernel Example and demonstrate how host application can directly transfer data via streaming interface without moving data to Global memory. Host Application demonstrate both approaches (blocking and non-blocking) stream interface between host and device. Stream interface does not require address management as access to data is done in sequential manner. Stream interfaces are useful for applications where either the data is too big to reside on an FPGA or data is being streamed from a sensor network.
Inside Kernel code HLS pragma must be defined for every streaming interface.
#include "ap_axi_sdata.h"
typedef qdma_axis<32, 0, 0, 0> pkt;
void krnl_stream_adder1(hls::stream<pkt> &a, hls::stream<pkt> &output) {
#pragma HLS INTERFACE axis port=a
#pragma HLS INTERFACE axis port=output
#pragma HLS INTERFACE s_axilite port=return bundle=control
...
}
hls::stream
kernels use a special class qdma_axis<D,0,0,0>
for
kernel streams which requires the header file ap_axi_sdata.h
. It has
variables data
,last
and keep
to manage the data transfer.
data
: Internally qdma_axis datatype has ap_uint<D>
which can be
accessed by get_data()
and set_data()
methods.
keep
: For all data before last, keep
variable must be set to
-1
to denote all bytes of data are valid. For the last data, the
kernel has the flexibility to send fewer bytes. For example, for the
four bytes data transfer, the kernel can truncate the last data by
sending 1 byte or 2 bytes or 3 bytes by using set_keep() function.
typedef qdma_axis<32, 0, 0, 0> t_out;
t_out.set_data(tmpOut);
t_out.set_last(t1.get_last());
t_out.set_keep(-1);
last
: Final data transferred must be identified by the last
variable. get_last() and set_last() methods are used to access/set the
last variable. Kernel doesn not know how many data items are coming
through the stream. Stream is polled by calling get_last() after every
transfer and breaks when get_last() returns 1.
Host to Kernel Streaming is supported by all QDMA based platforms. To use streaming support and utility APIs provided inside xcl2.hpp, programmer needs to specify following in host code:
// Declaration of custom stream APIs that binds to Xilinx Streaming APIs.
decltype(&clCreateStream) xcl::Stream::createStream = nullptr;
decltype(&clReleaseStream) xcl::Stream::releaseStream = nullptr;
decltype(&clReadStream) xcl::Stream::readStream = nullptr;
decltype(&clWriteStream) xcl::Stream::writeStream = nullptr;
decltype(&clPollStreams) xcl::Stream::pollStreams = nullptr;
Streaming class needs to be initialized before use as below:
xcl::Stream::init(platform_id);
To make a streaming object connection to specific kernel of design, following steps are needed:
cl_mem_ext_ptr_t ext;
ext.param = krnl_adder1.get();
ext.obj = NULL;
ext.flags = 0; // Indicates that connect to argument 0 of kernel
cl_stream write_stream_a = xcl::Stream::createStream(device.get(), CL_STREAM_WRITE_ONLY, CL_STREAM, &ext,nullptr));
ext.flags = 1; // Indicates that connect to argument 1 of kernel
cl_stream read_stream = xcl::Stream::createStream(device.get(), CL_STREAM_READ_ONLY, CL_STREAM, &ext, &ret));
xcl::stream::createStream
API is used to create a stream and read
and write properties are determined by the flags
CL_STREAM_WRITE_ONLY
and CL_STREAM_READ_ONLY
..flags
is
used to specify the kernel argument to which stream is connected.
There are blocking and non-blocking APIs to transfer data between Host and kernel via stream interface.
Blocking stream requires the stream operation (read or write) to finish before the next operation can be executed. Following shows a blocking call of writeStream but inside a child thread:
cl_stream_xfer_req b_wr_req{0};
b_wr_req.flags = CL_STREAM_EOT;
b_wr_req.priv_data = (void *)"b_write_a";
// Thread 1 for writing data to input stream 1 independently in case of default blocking transfers.
std::thread thr1(xcl::Stream::writeStream,
write_stream_a, // cl_stream object
h_a.data(), // host memory pointer from where the data has to be transferred
vector_size_bytes, // size of data to be transfered in bytes
&b_wr_req, // xfer req flag to indicate type of transfer
&ret);
Similar to this following shows a blocking call of readStream but inside a child thread:
cl_stream_xfer_req b_rd_req{0};
b_rd_req.flags = CL_STREAM_EOT;
b_rd_req.priv_data = (void *)"b_read_out";
// Output thread to read the stream data independently in case of default blocking transfers.
std::thread thr2(xcl::Stream::readStream,
read_stream, //cl_stream object
hw_results.data(), // host memory pointer on which data will be read
vector_size_bytes, // max size of data which can be stored in host memory
&b_rd_req, // xfer_req flag to indicate type of transfer
&ret);
As both are blocking calls running inside different threads. So host
application needs to wait for thread to finish to successful data
transfer using thread join()
API as below:
xcl::clReadstream
and xcl::clWritestream
APIs are used to read
from and write to streams respectively.
thr1.join();
thr2.join();
In case of non-blocking stream, other operations can be carried out
while data is being written into or being read from the stream.
Non-blocking stream requires CL_STREAM_NONBLOCKING
flag to be
specified in the transfer initiation request.
cl_stream_xfer_req nb_wr_req{0};
nb_wr_req.flags = CL_STREAM_EOT | CL_STREAM_NONBLOCKING;
nb_wr_req.priv_data = (void *)"nb_write_a";
xcl::Stream::writeStream(write_stream_a, h_a.data(), vector_size_bytes, &nb_wr_req, &ret));
cl_stream_xfer_req nb_rd_req{0};
nb_rd_req.flags = CL_STREAM_EOT | CL_STREAM_NONBLOCKING;
nb_rd_req.priv_data = (void *)"nb_read";
xcl::Stream::readStream(read_stream, hw_results.data(),vector_size_bytes,&nb_rd_req,&ret));
Since non-blocking streams are asynchronous and return immediately,
xcl::stream::pollStream
is a blocking API used to monitor the status
of completion of the transfer through streams, it returns the execution
to the host code after streams are completed.
cl_streams_poll_req_completions poll_req[2]{0, 0}; // 2 Requests
auto num_compl = 2;
xcl::Stream::pollStreams(device.get(), poll_req, 2, 2, &num_compl, 50000, &ret);
// Blocking API, waits for 2 poll request completion or 50000ms, whichever occurs first.
xcl::clreleaseStream
is used to release stream objects.
xcl::Stream::releaseStream(read_stream);
xcl::Stream::releaseStream(write_stream_a);
EXCLUDED PLATFORMS¶
Platforms containing following strings in their names are not supported for this example :
zc
xdma
xilinx_u250_qep
aws
samsung
DESIGN FILES¶
Application code is located in the src directory. Accelerator binary files will be compiled to the xclbin directory. The xclbin directory is required by the Makefile and its contents will be filled during compilation. A listing of all the files in this example is shown below
src/host.cpp
src/krnl_stream_vadd.cpp
COMMAND LINE ARGUMENTS¶
Once the environment has been configured, the application can be executed by
./vadd_stream <krnl_stream_vadd XCLBIN>