CSCMV Kernel APIs¶

Note

The CSCMV implementation in the current release uses 16 HBM channels on U280 card. In future releases, 32 (the maximum number of) HBM channels may be used to achieve the best performance possible.

bufTransColVecKernel¶

#include "fp32/bufTransColVecKernel.hpp"

void bufTransColVecKernel (
    hls::stream <ap_uint <SPARSE_ddrMemBits>>& in0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out1,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out2,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out3,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out4,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out5,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out6,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out7,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out8,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out9,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out10,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out11,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out12,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out13,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out14,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out15
    )

bufTransColVecKernel is used to buffer and dispatch input column vector entries acrross multiple CUs of xBarColKernel

Parameters:

in0	input column vector entries stream
out0	output column vector entries stream for CU0 of xBarColKernel
out1	output column vector entries stream for CU1 of xBarColKernel
out2	output column vector entries stream for CU2 of xBarColKernel
out3	output column vector entries stream for CU3 of xBarColKernel
out4	output column vector entries stream for CU4 of xBarColKernel
out5	output column vector entries stream for CU5 of xBarColKernel
out6	output column vector entries stream for CU6 of xBarColKernel
out7	output column vector entries stream for CU7 of xBarColKernel
out8	output column vector entries stream for CU8 of xBarColKernel
out9	output column vector entries stream for CU9 of xBarColKernel
out10	output column vector entries stream for CU10 of xBarColKernel
out11	output column vector entries stream for CU11 of xBarColKernel
out12	output column vector entries stream for CU12 of xBarColKernel
out13output	column vector entries stream for CU13 of xBarColKernel
out14	output column vector entries stream for CU14 of xBarColKernel
out15	output column vector entries stream for CU15 of xBarColKernel

bufTransNnzColKernel¶

#include "fp32/bufTransNnzColKernel.hpp"

void bufTransNnzColKernel (
    hls::stream <ap_uint <SPARSE_ddrMemBits>>& in0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out1,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out2,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out3,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out4,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out5,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out6,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out7,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out8,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out9,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out10,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out11,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out12,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out13,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out14,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out15
    )

bufTransNnzColKernel is used to buffer and dispatch the column pointers across multiple CUs of the xBarColKernel

Parameters:

in0	input column pointer entries stream
out0	output column pointer entries stream for CU0 of xBarColKernel
out1	output column pointer entries stream for CU1 of xBarColKernel
out2	output column pointer entries stream for CU2 of xBarColKernel
out3	output column pointer entries stream for CU3 of xBarColKernel
out4	output column pointer entries stream for CU4 of xBarColKernel
out5	output column pointer entries stream for CU5 of xBarColKernel
out6	output column pointer entries stream for CU6 of xBarColKernel
out7	output column pointer entries stream for CU7 of xBarColKernel
out8	output column pointer entries stream for CU8 of xBarColKernel
out9	output column pointer entries stream for CU9 of xBarColKernel
out10	output column pointer entries stream for CU10 of xBarColKernel
out11	output column pointer entries stream for CU11 of xBarColKernel
out12	output column pointer entries stream for CU12 of xBarColKernel
out13	output column pointer entries stream for CU13 of xBarColKernel
out14	output column pointer entries stream for CU14 of xBarColKernel
out15	output column pointer entries stream for CU15 of xBarColKernel

cscRowKernel¶

#include "fp32/cscRowKernel.hpp"

void cscRowKernel (
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& in0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in1,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out
    )

cscRowKernel is used to accumulate the multiplication results for the same row

Parameters:

in0	the input axis stream of the NNZs’ values and row indices
in1	the input axis stream of column vector entries for the NNZs
out	the output axis stream of result row vector entries

loadColKernel¶

#include "fp32/loadColKernel.hpp"

void loadColKernel (
    ap_uint <SPARSE_ddrMemBits>* p_colValPtr,
    ap_uint <SPARSE_ddrMemBits>* p_nnzColPtr,
    hls::stream <ap_uint <SPARSE_ddrMemBits>>& out0,
    hls::stream <ap_uint <SPARSE_ddrMemBits>>& out1
    )

loadColKernel is used to read the input column vector and pointers out of the device memory

Parameters:

p_colValPtr	device memory pointer for reading the input column vector
p_nnzColPtr	device memory pointer for reading the column pointers of NNZ entries
out0	the output axis stream of the column vector entries
out1	the output axis stream of the column pointer entries

readWriteHbmKernel¶

#include "fp32/readWriteHbmKernel.hpp"

void readWriteHbmKernel (
    ap_uint <SPARSE_hbmMemBits>* p_rd0,
    ap_uint <SPARSE_hbmMemBits>* p_wr0,
    ap_uint <SPARSE_hbmMemBits>* p_rd1,
    ap_uint <SPARSE_hbmMemBits>* p_wr1,
    ap_uint <SPARSE_hbmMemBits>* p_rd2,
    ap_uint <SPARSE_hbmMemBits>* p_wr2,
    ap_uint <SPARSE_hbmMemBits>* p_rd3,
    ap_uint <SPARSE_hbmMemBits>* p_wr3,
    ap_uint <SPARSE_hbmMemBits>* p_rd4,
    ap_uint <SPARSE_hbmMemBits>* p_wr4,
    ap_uint <SPARSE_hbmMemBits>* p_rd5,
    ap_uint <SPARSE_hbmMemBits>* p_wr5,
    ap_uint <SPARSE_hbmMemBits>* p_rd6,
    ap_uint <SPARSE_hbmMemBits>* p_wr6,
    ap_uint <SPARSE_hbmMemBits>* p_rd7,
    ap_uint <SPARSE_hbmMemBits>* p_wr7,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in0,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out1,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in1,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out2,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in2,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out3,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in3,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out4,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in4,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out5,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in5,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out6,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in6,
    hls::stream <ap_uint <SPARSE_hbmMemBits>>& out7,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in7
    )

readWriteHbmKernel is used to read NNZ values and row indices from HBM and write result row vector to HBM

Parameters:

p_rd0	the device memory pointer, which is mapped to HBM channel 0, for reading NNZs’ values and row indices
p_wr0	the device memory pointer, which is mapped to HBM channel 0, for storing result row vector
p_rd1	the device memory pointer, which is mapped to HBM channel 1, for reading NNZs’ values and row indices
p_wr1	the device memory pointer, which is mapped to HBM channel 1, for storing result row vector
p_rd2	the device memory pointer, which is mapped to HBM channel 2, for reading NNZs’ values and row indices
p_wr2	the device memory pointer, which is mapped to HBM channel 2, for storing result row vector
p_rd3	the device memory pointer, which is mapped to HBM channel 3, for reading NNZs’ values and row indices
p_wr3	the device memory pointer, which is mapped to HBM channel 3, for storing result row vector
p_rd4	the device memory pointer, which is mapped to HBM channel 4, for reading NNZs’ values and row indices
p_wr4	the device memory pointer, which is mapped to HBM channel 4, for storing result row vector
p_rd5	the device memory pointer, which is mapped to HBM channel 5, for reading NNZs’ values and row indices
p_wr5	the device memory pointer, which is mapped to HBM channel 5, for storing result row vector
p_rd6	the device memory pointer, which is mapped to HBM channel 6, for reading NNZs’ values and row indices
p_wr6	the device memory pointer, which is mapped to HBM channel 6, for storing result row vector
p_rd7	the device memory pointer, which is mapped to HBM channel 7, for reading NNZs’ values and row indices
p_wr7	the device memory pointer, which is mapped to HBM channel 7, for storing result row vector
out0	the output NNZ values and row indices axis stream to CU0 of cscRowKernel
in0	the input result row vector axis stream from CU0 of cscRowKernel
out1	the output NNZ values and row indices axis stream to CU1 of cscRowKernel
in1	the input result row vector axis stream from CU1 of cscRowKernel
out2	the output NNZ values and row indices axis stream to CU2 of cscRowKernel
in2	the input result row vector axis stream from CU2 of cscRowKernel
out3	the output NNZ values and row indices axis stream to CU3 of cscRowKernel
in3	the input result row vector axis stream from CU3 of cscRowKernel
out4	the output NNZ values and row indices axis stream to CU4 of cscRowKernel
in4	the input result row vector axis stream from CU4 of cscRowKernel
out5	the output NNZ values and row indices axis stream to CU5 of cscRowKernel
in5	the input result row vector axis stream from CU5 of cscRowKernel
out6	the output NNZ values and row indices axis stream to CU6 of cscRowKernel
in6	the input result row vector axis stream from CU6 of cscRowKernel
out7	the output NNZ values and row indices axis stream to CU7 of cscRowKernel
in7	the input result row vector axis stream from CU7 of cscRowKernel

xBarColKernel¶

#include "fp32/xBarColKernel.hpp"

void xBarColKernel (
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in0,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in1,
    hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out
    )

xBarColKernel is used to select input column vector entries according to the input column pointers

Parameters:

in0	input axis stream of parallelly processed column vector entries
in1	input axis stream of parallelly processed column pointer entries
out	output axis stream of parallelly column vector entries for the NNZs