The streamOneToN API
is designed for distributing data from one source to multiple processor units.
Three different algorithms have been implemented,
To ensure the throughput, it is very common to pass a vector of elements in
FPGA data paths, so
streamOneToN supports element vector input, if the
data elements are passed in the form of
It also offers overload for generic template type for non-vector input.
The round-robin algorithm distributes elements to output streams in circular order, starting from the output stream with index 0.
With generic type input, the function dispatches one element per cycle. This mode works best for sharing the multi-cycle processing work across an array of units.
With input casted to a long
ap_uint vector, higher input rate can be done.
This implementation consists of two dataflow processes working in parallel.
The first one breaks the vector into a ping-pong buffer,
while the second one reads from the buffers and schedules output in
The ping-pong buffers are implemented as two
ap_uint of width as least
common multiple (LCM) of input width and total output stream width.
This imposes a limitation, as the LCM should be no more than
AP_INT_MAX_W, which is default to 1024 in HLS.
AP_INT_MAX_W can be set to larger values, it may slow down HLS
synthesis, and to effectively override
AP_INT_MAX_W, the macro must be
set before first inclusion of
This library tries to override
AP_INT_MAX_W to 4096, but it’s only
ap_int.h has not be included before utility library
The load-balancing algorithm does not keep a fixed order in dispatching, instead, it skips successors that cannot read, and tries to feed as much as possible to outputs.
The design of the primitive includes 3 modules:
- read: Read data from the input stream then output data by one stream whose
lcm(Win, N * Wout)bits. Here, the least common multiple of
N * Woutis the inner buffer size in order to solve the different input width and output width.
- reduce: split the large width to a array of
- distribute: Read the array of elements, and distibute them to output streams which are not full yet.
Current implementation has the following limitations:
- It uses a wide
ap_uintas internal buffer. The buffer is as wide as the least common multiple (LCM) of input width and total output width. The width is limited by
AP_INT_MAX_W, which defaults to 1024.
- This library will try to override
AP_INT_MAX_Wto 4096, but user should ensure that
ap_int.hhas not be included before the library headers.
- Too large
AP_INT_MAX_Wwill significantly slow down HLS synthesis.
The depth of output streams must be no less than 4 due to internal delay.