CSCMV Kernel APIs

Note

The CSCMV implementation in the current release uses 16 HBM channels on U280 card. In future releases, 32 (the maximum number of) HBM channels may be used to achieve the best performance possible.

bufTransColVecKernel

#include "fp32/bufTransColVecKernel.hpp"
void bufTransColVecKernel (
    hls::stream <ap_uint <SPARSE_ddrMemBits>>& in0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out1,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out2,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out3,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out4,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out5,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out6,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out7,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out8,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out9,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out10,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out11,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out12,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out13,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out14,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out15
    )

bufTransColVecKernel is used to buffer and dispatch input column vector entries acrross multiple CUs of xBarColKernel

Parameters:

in0 input column vector entries stream
out0 output column vector entries stream for CU0 of xBarColKernel
out1 output column vector entries stream for CU1 of xBarColKernel
out2 output column vector entries stream for CU2 of xBarColKernel
out3 output column vector entries stream for CU3 of xBarColKernel
out4 output column vector entries stream for CU4 of xBarColKernel
out5 output column vector entries stream for CU5 of xBarColKernel
out6 output column vector entries stream for CU6 of xBarColKernel
out7 output column vector entries stream for CU7 of xBarColKernel
out8 output column vector entries stream for CU8 of xBarColKernel
out9 output column vector entries stream for CU9 of xBarColKernel
out10 output column vector entries stream for CU10 of xBarColKernel
out11 output column vector entries stream for CU11 of xBarColKernel
out12 output column vector entries stream for CU12 of xBarColKernel
out13output column vector entries stream for CU13 of xBarColKernel
out14 output column vector entries stream for CU14 of xBarColKernel
out15 output column vector entries stream for CU15 of xBarColKernel

bufTransNnzColKernel

#include "fp32/bufTransNnzColKernel.hpp"
void bufTransNnzColKernel (
    hls::stream <ap_uint <SPARSE_ddrMemBits>>& in0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out1,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out2,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out3,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out4,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out5,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out6,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out7,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out8,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out9,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out10,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out11,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out12,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out13,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out14,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out15
    )

bufTransNnzColKernel is used to buffer and dispatch the column pointers across multiple CUs of the xBarColKernel

Parameters:

in0 input column pointer entries stream
out0 output column pointer entries stream for CU0 of xBarColKernel
out1 output column pointer entries stream for CU1 of xBarColKernel
out2 output column pointer entries stream for CU2 of xBarColKernel
out3 output column pointer entries stream for CU3 of xBarColKernel
out4 output column pointer entries stream for CU4 of xBarColKernel
out5 output column pointer entries stream for CU5 of xBarColKernel
out6 output column pointer entries stream for CU6 of xBarColKernel
out7 output column pointer entries stream for CU7 of xBarColKernel
out8 output column pointer entries stream for CU8 of xBarColKernel
out9 output column pointer entries stream for CU9 of xBarColKernel
out10 output column pointer entries stream for CU10 of xBarColKernel
out11 output column pointer entries stream for CU11 of xBarColKernel
out12 output column pointer entries stream for CU12 of xBarColKernel
out13 output column pointer entries stream for CU13 of xBarColKernel
out14 output column pointer entries stream for CU14 of xBarColKernel
out15 output column pointer entries stream for CU15 of xBarColKernel

cscRowKernel

#include "fp32/cscRowKernel.hpp"
void cscRowKernel (
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& in0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in1,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out
    )

cscRowKernel is used to accumulate the multiplication results for the same row

Parameters:

in0 the input axis stream of the NNZs’ values and row indices
in1 the input axis stream of column vector entries for the NNZs
out the output axis stream of result row vector entries

loadColKernel

#include "fp32/loadColKernel.hpp"
void loadColKernel (
    ap_uint <SPARSE_ddrMemBits>* p_colValPtr,
    ap_uint <SPARSE_ddrMemBits>* p_nnzColPtr,
    hls::stream <ap_uint <SPARSE_ddrMemBits>>& out0,
    hls::stream <ap_uint <SPARSE_ddrMemBits>>& out1
    )

loadColKernel is used to read the input column vector and pointers out of the device memory

Parameters:

p_colValPtr device memory pointer for reading the input column vector
p_nnzColPtr device memory pointer for reading the column pointers of NNZ entries
out0 the output axis stream of the column vector entries
out1 the output axis stream of the column pointer entries

readWriteHbmKernel

#include "fp32/readWriteHbmKernel.hpp"
void readWriteHbmKernel (
    ap_uint <SPARSE_hbmMemBits>* p_rd0,
    ap_uint <SPARSE_hbmMemBits>* p_wr0,
    ap_uint <SPARSE_hbmMemBits>* p_rd1,
    ap_uint <SPARSE_hbmMemBits>* p_wr1,
    ap_uint <SPARSE_hbmMemBits>* p_rd2,
    ap_uint <SPARSE_hbmMemBits>* p_wr2,
    ap_uint <SPARSE_hbmMemBits>* p_rd3,
    ap_uint <SPARSE_hbmMemBits>* p_wr3,
    ap_uint <SPARSE_hbmMemBits>* p_rd4,
    ap_uint <SPARSE_hbmMemBits>* p_wr4,
    ap_uint <SPARSE_hbmMemBits>* p_rd5,
    ap_uint <SPARSE_hbmMemBits>* p_wr5,
    ap_uint <SPARSE_hbmMemBits>* p_rd6,
    ap_uint <SPARSE_hbmMemBits>* p_wr6,
    ap_uint <SPARSE_hbmMemBits>* p_rd7,
    ap_uint <SPARSE_hbmMemBits>* p_wr7,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in0,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out1,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in1,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out2,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in2,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out3,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in3,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out4,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in4,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out5,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in5,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out6,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in6,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out7,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in7
    )

readWriteHbmKernel is used to read NNZ values and row indices from HBM and write result row vector to HBM

Parameters:

p_rd0 the device memory pointer, which is mapped to HBM channel 0, for reading NNZs’ values and row indices
p_wr0 the device memory pointer, which is mapped to HBM channel 0, for storing result row vector
p_rd1 the device memory pointer, which is mapped to HBM channel 1, for reading NNZs’ values and row indices
p_wr1 the device memory pointer, which is mapped to HBM channel 1, for storing result row vector
p_rd2 the device memory pointer, which is mapped to HBM channel 2, for reading NNZs’ values and row indices
p_wr2 the device memory pointer, which is mapped to HBM channel 2, for storing result row vector
p_rd3 the device memory pointer, which is mapped to HBM channel 3, for reading NNZs’ values and row indices
p_wr3 the device memory pointer, which is mapped to HBM channel 3, for storing result row vector
p_rd4 the device memory pointer, which is mapped to HBM channel 4, for reading NNZs’ values and row indices
p_wr4 the device memory pointer, which is mapped to HBM channel 4, for storing result row vector
p_rd5 the device memory pointer, which is mapped to HBM channel 5, for reading NNZs’ values and row indices
p_wr5 the device memory pointer, which is mapped to HBM channel 5, for storing result row vector
p_rd6 the device memory pointer, which is mapped to HBM channel 6, for reading NNZs’ values and row indices
p_wr6 the device memory pointer, which is mapped to HBM channel 6, for storing result row vector
p_rd7 the device memory pointer, which is mapped to HBM channel 7, for reading NNZs’ values and row indices
p_wr7 the device memory pointer, which is mapped to HBM channel 7, for storing result row vector
out0 the output NNZ values and row indices axis stream to CU0 of cscRowKernel
in0 the input result row vector axis stream from CU0 of cscRowKernel
out1 the output NNZ values and row indices axis stream to CU1 of cscRowKernel
in1 the input result row vector axis stream from CU1 of cscRowKernel
out2 the output NNZ values and row indices axis stream to CU2 of cscRowKernel
in2 the input result row vector axis stream from CU2 of cscRowKernel
out3 the output NNZ values and row indices axis stream to CU3 of cscRowKernel
in3 the input result row vector axis stream from CU3 of cscRowKernel
out4 the output NNZ values and row indices axis stream to CU4 of cscRowKernel
in4 the input result row vector axis stream from CU4 of cscRowKernel
out5 the output NNZ values and row indices axis stream to CU5 of cscRowKernel
in5 the input result row vector axis stream from CU5 of cscRowKernel
out6 the output NNZ values and row indices axis stream to CU6 of cscRowKernel
in6 the input result row vector axis stream from CU6 of cscRowKernel
out7 the output NNZ values and row indices axis stream to CU7 of cscRowKernel
in7 the input result row vector axis stream from CU7 of cscRowKernel

xBarColKernel

#include "fp32/xBarColKernel.hpp"
void xBarColKernel (
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in1,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out
    )

xBarColKernel is used to select input column vector entries according to the input column pointers

Parameters:

in0 input axis stream of parallelly processed column vector entries
in1 input axis stream of parallelly processed column pointer entries
out output axis stream of parallelly column vector entries for the NNZs