Internals of streamNToOne¶
The streamNToOne API
is designed for collecting data from multiple processor units.
Three different algorithms have been implemented, RoundRobinT
,
LoadBalanceT
and TagSelectT
.
To ensure the throughput, it is very common to pass a vector of elements in
FPGA data paths, so streamNToOne
supports element vector output, if the
data elements are passed in the form of ap_uint
.
It also offers overload for generic template type for non-vector output.
Round-Robin¶
The round-robin algorithm collects elements from input streams in circular order, starting from the output stream with index 0.
Generic Type¶
With generic type input, the function dispatches one element per cycle. This mode works best for sharing the multi-cycle processing work across an array of units.
Vector Output¶
The design of the primitive includes 3 modules:
- fetch: attempt to read data from the n input streams.
- vectorize: Inner buffers as wide as the least common multiple of
N * Win
andWout
are used to combine the inputs into vectors. - emit: read vectorized data and emit to output stream.
Attention
Current implementation has the following limitations:
- It uses a wide
ap_uint
as internal buffer. The buffer is as wide as the least common multiple (LCM) of input width and total output width. The width is limited byAP_INT_MAX_W
, which defaults to 1024. - This library will try to override
AP_INT_MAX_W
to 4096, but user should ensure thatap_int.h
has not be included before the library headers. - Too large
AP_INT_MAX_W
will significantly slow down HLS synthesis.
Load-Balancing¶
The load-balancing algorithm does not keep a fixed order in collection, instead, it skips predecessors that cannot be read, and tries to feed as much as possible to output.
Generic Type¶
Vector Output¶
The design of the primitive includes 3 modules:
- fetch: attempt to read data from the n input streams.
- vectorize: Inner buffers as wide as the least common multiple of
N * Win
andWout
are used to combine the inputs into vectors. - emit: read vectorized data and emit to output stream.
Attention
Current implementation has the following limitations:
- It uses a wide
ap_uint
as internal buffer. The buffer is as wide as the least common multiple (LCM) of input width and total output width. The width is limited byAP_INT_MAX_W
, which defaults to 1024. - This library will try to override
AP_INT_MAX_W
to 4096, but user should ensure thatap_int.h
has not be included before the library headers. - Too large
AP_INT_MAX_W
will significantly slow down HLS synthesis.
Important
The depth of output streams must be no less than 4 due to internal delay.
Tag-Select¶
This algorithm collects data elements according to provided tags. The tags are used as index of input streams, and it is expected that each input element is accompanied by a tag.