CSCMV Kernel APIs

Note

The CSCMV implementation in the current release uses 16 HBM channels on U280 card. In future releases, 32 (the maximum number of) HBM channels may be used to achieve the best performance possible.

bufTransColVecKernel

#include "bufTransColVecKernel.hpp"
void bufTransColVecKernel (
    hls::stream <ap_uint <SPARSE_ddrMemBits>>& in0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out1,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out2,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out3,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out4,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out5,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out6,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out7,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out8,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out9,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out10,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out11,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out12,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out13,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out14,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out15
    )

bufTransColVecKernel is used to buffer and dispatch input column vector entries acrross multiple CUs of xBarColKernel

Parameters:

in0 input column vector entries stream
out0 output column vector entries stream for CU0 of xBarColKernel
out1 output column vector entries stream for CU1 of xBarColKernel
out2 output column vector entries stream for CU2 of xBarColKernel
out3 output column vector entries stream for CU3 of xBarColKernel
out4 output column vector entries stream for CU4 of xBarColKernel
out5 output column vector entries stream for CU5 of xBarColKernel
out6 output column vector entries stream for CU6 of xBarColKernel
out7 output column vector entries stream for CU7 of xBarColKernel
out8 output column vector entries stream for CU8 of xBarColKernel
out9 output column vector entries stream for CU9 of xBarColKernel
out10 output column vector entries stream for CU10 of xBarColKernel
out11 output column vector entries stream for CU11 of xBarColKernel
out12 output column vector entries stream for CU12 of xBarColKernel
out13output column vector entries stream for CU13 of xBarColKernel
out14 output column vector entries stream for CU14 of xBarColKernel
out15 output column vector entries stream for CU15 of xBarColKernel

bufTransNnzColKernel

#include "bufTransNnzColKernel.hpp"
void bufTransNnzColKernel (
    hls::stream <ap_uint <SPARSE_ddrMemBits>>& in0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out1,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out2,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out3,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out4,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out5,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out6,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out7,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out8,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out9,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out10,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out11,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out12,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out13,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out14,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out15
    )

bufTransNnzColKernel is used to buffer and dispatch the column pointers across multiple CUs of the xBarColKernel

Parameters:

in0 input column pointer entries stream
out0 output column pointer entries stream for CU0 of xBarColKernel
out1 output column pointer entries stream for CU1 of xBarColKernel
out2 output column pointer entries stream for CU2 of xBarColKernel
out3 output column pointer entries stream for CU3 of xBarColKernel
out4 output column pointer entries stream for CU4 of xBarColKernel
out5 output column pointer entries stream for CU5 of xBarColKernel
out6 output column pointer entries stream for CU6 of xBarColKernel
out7 output column pointer entries stream for CU7 of xBarColKernel
out8 output column pointer entries stream for CU8 of xBarColKernel
out9 output column pointer entries stream for CU9 of xBarColKernel
out10 output column pointer entries stream for CU10 of xBarColKernel
out11 output column pointer entries stream for CU11 of xBarColKernel
out12 output column pointer entries stream for CU12 of xBarColKernel
out13 output column pointer entries stream for CU13 of xBarColKernel
out14 output column pointer entries stream for CU14 of xBarColKernel
out15 output column pointer entries stream for CU15 of xBarColKernel

cscRowKernel

#include "cscRowKernel.hpp"
void cscRowKernel (
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& in0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in1,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out
    )

cscRowKernel is used to accumulate the multiplication results for the same row

Parameters:

in0 the input axis stream of the NNZs’ values and row indices
in1 the input axis stream of column vector entries for the NNZs
out the output axis stream of result row vector entries

loadColKernel

#include "loadColKernel.hpp"
void loadColKernel (
    ap_uint <SPARSE_ddrMemBits>* p_colValPtr,
    ap_uint <SPARSE_ddrMemBits>* p_nnzColPtr,
    hls::stream <ap_uint <SPARSE_ddrMemBits>>& out0,
    hls::stream <ap_uint <SPARSE_ddrMemBits>>& out1
    )

loadColKernel is used to read the input column vector and pointers out of the device memory

Parameters:

p_colValPtr device memory pointer for reading the input column vector
p_nnzColPtr device memory pointer for reading the column pointers of NNZ entries
out0 the output axis stream of the column vector entries
out1 the output axis stream of the column pointer entries

xBarColKernel

#include "xBarColKernel.hpp"
void xBarColKernel (
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in1,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out
    )

xBarColKernel is used to select input column vector entries according to the input column pointers

Parameters:

in0 input axis stream of parallelly processed column vector entries
in1 input axis stream of parallelly processed column pointer entries
out output axis stream of parallelly column vector entries for the NNZs