Internals of streamNToOne ¶

The streamNToOne API is designed for collecting data from multiple processor units. Three different algorithms have been implemented, RoundRobinT, LoadBalanceT and TagSelectT.

To ensure the throughput, it is very common to pass a vector of elements in FPGA data paths, so streamNToOne supports element vector output, if the data elements are passed in the form of ap_uint. It also offers overload for generic template type for non-vector output.

Contents

Internals of streamNToOne

Round-Robin ¶

The round-robin algorithm collects elements from input streams in circular order, starting from the output stream with index 0.

Generic Type¶

With generic type input, the function dispatches one element per cycle. This mode works best for sharing the multi-cycle processing work across an array of units.

Vector Output¶

The design of the primitive includes 3 modules:

fetch: attempt to read data from the n input streams.
vectorize: Inner buffers as wide as the least common multiple of N * Win and Wout are used to combine the inputs into vectors.
emit: read vectorized data and emit to output stream.

structure of vectorized round-robin collection

Attention

Current implementation has the following limitations:

It uses a wide ap_uint as internal buffer. The buffer is as wide as the least common multiple (LCM) of input width and total output width. The width is limited by AP_INT_MAX_W, which defaults to 1024.
This library will try to override AP_INT_MAX_W to 4096, but user should ensure that ap_int.h has not be included before the library headers.
Too large AP_INT_MAX_W will significantly slow down HLS synthesis.

Load-Balancing ¶

The load-balancing algorithm does not keep a fixed order in collection, instead, it skips predecessors that cannot be read, and tries to feed as much as possible to output.

Generic Type¶

Vector Output¶

The design of the primitive includes 3 modules:

fetch: attempt to read data from the n input streams.
vectorize: Inner buffers as wide as the least common multiple of N * Win and Wout are used to combine the inputs into vectors.
emit: read vectorized data and emit to output stream.

structure of vectorized load-balance collection

Attention

Current implementation has the following limitations:

It uses a wide ap_uint as internal buffer. The buffer is as wide as the least common multiple (LCM) of input width and total output width. The width is limited by AP_INT_MAX_W, which defaults to 1024.
This library will try to override AP_INT_MAX_W to 4096, but user should ensure that ap_int.h has not be included before the library headers.
Too large AP_INT_MAX_W will significantly slow down HLS synthesis.

Important

The depth of output streams must be no less than 4 due to internal delay.

Tag-Select ¶

This algorithm collects data elements according to provided tags. The tags are used as index of input streams, and it is expected that each input element is accompanied by a tag.

Internals of streamNToOne¶

Round-Robin¶

Generic Type¶

Vector Output¶

Load-Balancing¶

Generic Type¶

Vector Output¶

Tag-Select¶

Internals of streamNToOne ¶

Round-Robin ¶

Load-Balancing ¶

Tag-Select ¶