CSCMV Kernel APIs¶
Note
The CSCMV implementation in the current release uses 16 HBM channels on U280 card. In future releases, 32 (the maximum number of) HBM channels may be used to achieve the best performance possible.
bufTransColVecKernel¶
#include "fp32/bufTransColVecKernel.hpp"
void bufTransColVecKernel ( hls::stream <ap_uint <SPARSE_ddrMemBits>>& in0, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out0, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out1, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out2, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out3, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out4, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out5, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out6, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out7, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out8, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out9, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out10, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out11, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out12, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out13, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out14, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out15 )
bufTransColVecKernel is used to buffer and dispatch input column vector entries acrross multiple CUs of xBarColKernel
Parameters:
in0 | input column vector entries stream |
out0 | output column vector entries stream for CU0 of xBarColKernel |
out1 | output column vector entries stream for CU1 of xBarColKernel |
out2 | output column vector entries stream for CU2 of xBarColKernel |
out3 | output column vector entries stream for CU3 of xBarColKernel |
out4 | output column vector entries stream for CU4 of xBarColKernel |
out5 | output column vector entries stream for CU5 of xBarColKernel |
out6 | output column vector entries stream for CU6 of xBarColKernel |
out7 | output column vector entries stream for CU7 of xBarColKernel |
out8 | output column vector entries stream for CU8 of xBarColKernel |
out9 | output column vector entries stream for CU9 of xBarColKernel |
out10 | output column vector entries stream for CU10 of xBarColKernel |
out11 | output column vector entries stream for CU11 of xBarColKernel |
out12 | output column vector entries stream for CU12 of xBarColKernel |
out13output | column vector entries stream for CU13 of xBarColKernel |
out14 | output column vector entries stream for CU14 of xBarColKernel |
out15 | output column vector entries stream for CU15 of xBarColKernel |
bufTransNnzColKernel¶
#include "fp32/bufTransNnzColKernel.hpp"
void bufTransNnzColKernel ( hls::stream <ap_uint <SPARSE_ddrMemBits>>& in0, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out0, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out1, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out2, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out3, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out4, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out5, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out6, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out7, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out8, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out9, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out10, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out11, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out12, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out13, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out14, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out15 )
bufTransNnzColKernel is used to buffer and dispatch the column pointers across multiple CUs of the xBarColKernel
Parameters:
in0 | input column pointer entries stream |
out0 | output column pointer entries stream for CU0 of xBarColKernel |
out1 | output column pointer entries stream for CU1 of xBarColKernel |
out2 | output column pointer entries stream for CU2 of xBarColKernel |
out3 | output column pointer entries stream for CU3 of xBarColKernel |
out4 | output column pointer entries stream for CU4 of xBarColKernel |
out5 | output column pointer entries stream for CU5 of xBarColKernel |
out6 | output column pointer entries stream for CU6 of xBarColKernel |
out7 | output column pointer entries stream for CU7 of xBarColKernel |
out8 | output column pointer entries stream for CU8 of xBarColKernel |
out9 | output column pointer entries stream for CU9 of xBarColKernel |
out10 | output column pointer entries stream for CU10 of xBarColKernel |
out11 | output column pointer entries stream for CU11 of xBarColKernel |
out12 | output column pointer entries stream for CU12 of xBarColKernel |
out13 | output column pointer entries stream for CU13 of xBarColKernel |
out14 | output column pointer entries stream for CU14 of xBarColKernel |
out15 | output column pointer entries stream for CU15 of xBarColKernel |
cscRowKernel¶
#include "fp32/cscRowKernel.hpp"
void cscRowKernel ( hls::stream <ap_uint <SPARSE_hbmMemBits>>& in0, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in1, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out )
cscRowKernel is used to accumulate the multiplication results for the same row
Parameters:
in0 | the input axis stream of the NNZs’ values and row indices |
in1 | the input axis stream of column vector entries for the NNZs |
out | the output axis stream of result row vector entries |
loadColKernel¶
#include "fp32/loadColKernel.hpp"
void loadColKernel ( ap_uint <SPARSE_ddrMemBits>* p_colValPtr, ap_uint <SPARSE_ddrMemBits>* p_nnzColPtr, hls::stream <ap_uint <SPARSE_ddrMemBits>>& out0, hls::stream <ap_uint <SPARSE_ddrMemBits>>& out1 )
loadColKernel is used to read the input column vector and pointers out of the device memory
Parameters:
p_colValPtr | device memory pointer for reading the input column vector |
p_nnzColPtr | device memory pointer for reading the column pointers of NNZ entries |
out0 | the output axis stream of the column vector entries |
out1 | the output axis stream of the column pointer entries |
readWriteHbmKernel¶
#include "fp32/readWriteHbmKernel.hpp"
void readWriteHbmKernel ( ap_uint <SPARSE_hbmMemBits>* p_rd0, ap_uint <SPARSE_hbmMemBits>* p_wr0, ap_uint <SPARSE_hbmMemBits>* p_rd1, ap_uint <SPARSE_hbmMemBits>* p_wr1, ap_uint <SPARSE_hbmMemBits>* p_rd2, ap_uint <SPARSE_hbmMemBits>* p_wr2, ap_uint <SPARSE_hbmMemBits>* p_rd3, ap_uint <SPARSE_hbmMemBits>* p_wr3, ap_uint <SPARSE_hbmMemBits>* p_rd4, ap_uint <SPARSE_hbmMemBits>* p_wr4, ap_uint <SPARSE_hbmMemBits>* p_rd5, ap_uint <SPARSE_hbmMemBits>* p_wr5, ap_uint <SPARSE_hbmMemBits>* p_rd6, ap_uint <SPARSE_hbmMemBits>* p_wr6, ap_uint <SPARSE_hbmMemBits>* p_rd7, ap_uint <SPARSE_hbmMemBits>* p_wr7, hls::stream <ap_uint <SPARSE_hbmMemBits>>& out0, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in0, hls::stream <ap_uint <SPARSE_hbmMemBits>>& out1, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in1, hls::stream <ap_uint <SPARSE_hbmMemBits>>& out2, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in2, hls::stream <ap_uint <SPARSE_hbmMemBits>>& out3, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in3, hls::stream <ap_uint <SPARSE_hbmMemBits>>& out4, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in4, hls::stream <ap_uint <SPARSE_hbmMemBits>>& out5, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in5, hls::stream <ap_uint <SPARSE_hbmMemBits>>& out6, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in6, hls::stream <ap_uint <SPARSE_hbmMemBits>>& out7, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in7 )
readWriteHbmKernel is used to read NNZ values and row indices from HBM and write result row vector to HBM
Parameters:
p_rd0 | the device memory pointer, which is mapped to HBM channel 0, for reading NNZs’ values and row indices |
p_wr0 | the device memory pointer, which is mapped to HBM channel 0, for storing result row vector |
p_rd1 | the device memory pointer, which is mapped to HBM channel 1, for reading NNZs’ values and row indices |
p_wr1 | the device memory pointer, which is mapped to HBM channel 1, for storing result row vector |
p_rd2 | the device memory pointer, which is mapped to HBM channel 2, for reading NNZs’ values and row indices |
p_wr2 | the device memory pointer, which is mapped to HBM channel 2, for storing result row vector |
p_rd3 | the device memory pointer, which is mapped to HBM channel 3, for reading NNZs’ values and row indices |
p_wr3 | the device memory pointer, which is mapped to HBM channel 3, for storing result row vector |
p_rd4 | the device memory pointer, which is mapped to HBM channel 4, for reading NNZs’ values and row indices |
p_wr4 | the device memory pointer, which is mapped to HBM channel 4, for storing result row vector |
p_rd5 | the device memory pointer, which is mapped to HBM channel 5, for reading NNZs’ values and row indices |
p_wr5 | the device memory pointer, which is mapped to HBM channel 5, for storing result row vector |
p_rd6 | the device memory pointer, which is mapped to HBM channel 6, for reading NNZs’ values and row indices |
p_wr6 | the device memory pointer, which is mapped to HBM channel 6, for storing result row vector |
p_rd7 | the device memory pointer, which is mapped to HBM channel 7, for reading NNZs’ values and row indices |
p_wr7 | the device memory pointer, which is mapped to HBM channel 7, for storing result row vector |
out0 | the output NNZ values and row indices axis stream to CU0 of cscRowKernel |
in0 | the input result row vector axis stream from CU0 of cscRowKernel |
out1 | the output NNZ values and row indices axis stream to CU1 of cscRowKernel |
in1 | the input result row vector axis stream from CU1 of cscRowKernel |
out2 | the output NNZ values and row indices axis stream to CU2 of cscRowKernel |
in2 | the input result row vector axis stream from CU2 of cscRowKernel |
out3 | the output NNZ values and row indices axis stream to CU3 of cscRowKernel |
in3 | the input result row vector axis stream from CU3 of cscRowKernel |
out4 | the output NNZ values and row indices axis stream to CU4 of cscRowKernel |
in4 | the input result row vector axis stream from CU4 of cscRowKernel |
out5 | the output NNZ values and row indices axis stream to CU5 of cscRowKernel |
in5 | the input result row vector axis stream from CU5 of cscRowKernel |
out6 | the output NNZ values and row indices axis stream to CU6 of cscRowKernel |
in6 | the input result row vector axis stream from CU6 of cscRowKernel |
out7 | the output NNZ values and row indices axis stream to CU7 of cscRowKernel |
in7 | the input result row vector axis stream from CU7 of cscRowKernel |
xBarColKernel¶
#include "fp32/xBarColKernel.hpp"
void xBarColKernel ( hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in0, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& in1, hls::stream <ap_uint <SPARSE_dataBits*SPARSE_parEntries>>& out )
xBarColKernel is used to select input column vector entries according to the input column pointers
Parameters:
in0 | input axis stream of parallelly processed column vector entries |
in1 | input axis stream of parallelly processed column pointer entries |
out | output axis stream of parallelly column vector entries for the NNZs |