Vitis Graph Library Tutorial¶

Get and Run the Vitis Graph Library¶

Get the Dependencies¶

Vitis, Instructions to install Vitis can be found here.
Alveo U50 packages, Instructions to deploy Alveo U50 can be found here.
XRM (Optional, only for L3 use cases), Instructions to install XRM can be found here.

Setup Environment¶

#!/bin/bash
source <Vitis_install_path>/Vitis/2022.1/settings64.sh
source /opt/xilinx/xrt/setup.sh
source /opt/xilinx/xrm/setup.sh
export PLATFORM_REPO_PATHS=<path to platforms>
export DEVICE=xilinx_u50_gen3x16_xdma_5_202210_1
export TARGET=sw_emu

Note: The TARGET environment variable can be set as sw_emu, hw_emu and hw according to which Vitis target is expected to run. sw_emu is for C level emulations. hw_emu is for RTL level emulations. hw is for real on-board test. For more information about the Vitis Target please have a look at here.

Download the Vitis Graph Library¶

#!/bin/bash
git clone https://github.com/Xilinx/Vitis_Libraries.git
cd Vitis_Libraries/graph

Run a L3 Example¶

#!/bin/bash
cd L3/tests/SSSP  # SSSP is an example case. Please change directory to any other cases in L3/test if interested.
make help         # show available make command
make host         # build the binary running on host
make xclbin       # build the binary running on Alveo
make run          # run the entire program
make cleanall

For more explanation on L3 cases please have a read on L3 API.

Run a L2 Example¶

#!/bin/bash
cd L2/tests/shortest_path_float_pred   # shortest_path_float_pred is an example case. Please change directory to any other cases in L2/test if interested.
make help                              # show available make command
make host                              # build the binary running on host
make xclbin                            # build the binary running on Alveo
make run                               # run the entire program
make cleanall

For more explanation on L2 cases please have a read on L2 API.

Run a L1 Example¶

#!/bin/bash
cd L1/tests/hw/dense_similarity_int    # dense_similarity_int is an example case. Please change directory to any other cases in L1/test if interested
make help                              # show available make command
make run CSIM=1                        # run C level simulation of the HLS code
make run CSYNTH=1 COSIM=1              # run RTL level simulation of the HLS code
make cleanall

For more explanation on L1 cases please have a read on L1 API.

How Vitis Graph Library Works¶

Vitis Graph Library aims to provide reference Vitis implementations for a set of graph processing algorithms which fits the Xilinx Alveo Series acceleration cards. The API in Vitis Graph Library has been classified into three layers, namely L1/L2/L3. Each targets to serve different audience.

L3 APIs locate at Vitis_Libraries/graph/L3/include. Pure software APIs are prodived to customers who want a fast deployment of graph processing algorithms on Alveo Cards. It provides a series of software designs to efficiently make use of resources in Alveo cards and deliver high performance graph processing.
L2 APIs locate at Vitis_Libraries/graph/L2/include. They are a number of compute-unit designs running on Alveo cards. It provides a set of compute-unit designs implemented in HLS codes. These L2 APIs needs be compiled as OpenCL kernels and will be called by OpenCL APIs.
L1 APIs locate at Vitis_Libraries/graph/L1/include. They are basic components that will be used to compose compute-units. The L1 APIs are all well-optimized HLS design and are able to fit into various resource constraints.

L3 API¶

Target Audience¶

If a fast deployment of FPGA accelerated graph processor is required, then the Vitis Graph L3 APIs would be the best choice. Pre-designed and well-optimized Vitis compute units are provided in these APIs. And efficient software management of resources is also included in these APIs. To deploy graph accelerators, all users need to do is just a simple call of these c++ L3 APIs.

Example Usage¶

Please run the following codes to build the library (Do not forget to install XRT/XRM and setup the environment):

#!/bin/bash
cd Vitis_Libraries/graph/L3/lib
make libgraphL3
export LD_LIBRARY_PATH=<PATH TO YOUR Vitis_Libraries/graph/L3/lib>:$LD_LIBRARY_PATH

To make use of the L3/APIs, please include Vitis_Libraries/graph/L3/include path and link Vitis_Libraries/graph/L3/lib path when compiling the code.

The following steps are usually required to make a call of the L3 APIs:

Setup the handle

xf::graph::L3::Handle::singleOP op0;   // create a configuration of operation (such as shortest path, wcc)
op0.operationName = "shortestPathFloat";

xf::graph::L3::Handle handle0;
handle0.addOp(op0);  // initialize the Alveo board with the required operation, may have more than one kind of operation
handle0.setUp();     // Download binaries to FPGAs

Setup and Deploy the Graph

xf::graph::Graph<uint32_t, DT> g("CSR", numVertices, numEdges, offsetsCSR, indicesCSR, weightsCSR); // Create the graph
(handle0.opsp)->loadGraph(g); // Deploy the graph data

Run the required operation

auto ev = xf::graph::L3::shortestPath(handle0, nSource, &sourceID, weighted, g, result, pred); // Run the operation, this is a non-block call, actually start a thread
int ret = ev.wait(); // wait for the operation to finish

Release resources

(handle0.opsp)->join(); // join the thread
handle0.free(); // release other memories
g.freeBuffers(); // release graph memories

L2 API¶

Target Audience¶

If a pure FPGA based graph accelerator is required, then the Vitis Graph L2 interface might be interested. The L2 APIs provide HLS function that can be directly built into a Vitis compute-unit (OpenCL kernel). The testcases of the L2 APIs can be good references to compile and run the FPGA binaries (xclbins). Simple OpenCL codes are also provided to make use of the generated FPGA binaries. To efficiently management this FPGA binaries and make use of FPGA resources, please take a look at L3 API.

Example Usage¶

The L2 API can be found at Vitis_Libraries/L2/include. A typical code for calling L2 APIs may looks like this:

extern "C" void shortestPath_top(ap_uint<32>* config,
                                 ap_uint<512>* offset,
                                 ap_uint<512>* column,
                                 ap_uint<512>* weight,

                                 ap_uint<512>* ddrQue512,
                                 ap_uint<32>* ddrQue,

                                 ap_uint<512>* result512,
                                 ap_uint<32>* result,
                                 ap_uint<512>* pred512,
                                 ap_uint<32>* pred,
                                 ap_uint<8>* info) {
   const int depth_E = E;
   const int depth_V = V;

#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 1 num_read_outstanding = \
   32 max_write_burst_length = 2 max_read_burst_length = 8 bundle = gmem0 port = config depth = 4
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 1 num_read_outstanding = \
   32 max_write_burst_length = 2 max_read_burst_length = 8 bundle = gmem0 port = offset depth = depth_V
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 1 num_read_outstanding = \
   32 max_write_burst_length = 2 max_read_burst_length = 32 bundle = gmem1 port = column depth = depth_E
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 1 num_read_outstanding = \
   32 max_write_burst_length = 2 max_read_burst_length = 32 bundle = gmem2 port = weight depth = depth_E
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 32 num_read_outstanding = \
   1 max_write_burst_length = 2 max_read_burst_length = 2 bundle = gmem3 port = ddrQue depth = depth_E*16
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 32 num_read_outstanding = \
   1 max_write_burst_length = 2 max_read_burst_length = 2 bundle = gmem3 port = ddrQue512 depth = depth_E
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 32 num_read_outstanding = \
   32 max_write_burst_length = 64 max_read_burst_length = 2 bundle = gmem4 port = result512 depth = depth_V
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 32 num_read_outstanding = \
   32 max_write_burst_length = 64 max_read_burst_length = 2 bundle = gmem4 port = info depth = 8
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 32 num_read_outstanding = \
   32 max_write_burst_length = 64 max_read_burst_length = 2 bundle = gmem4 port = result depth = depth_V*16
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 32 num_read_outstanding = \
   1 max_write_burst_length = 64 max_read_burst_length = 2 bundle = gmem5 port = pred512 depth = depth_V
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 32 num_read_outstanding = \
   1 max_write_burst_length = 64 max_read_burst_length = 2 bundle = gmem5 port = pred depth = depth_V*16

   xf::graph::singleSourceShortestPath<32, MAXOUTDEGREE>(config, offset, column, weight, ddrQue512, ddrQue, result512,
                                                         result, pred512, pred, info);
}

It is usually a wrapper function of APIs in Vitis_Libraries/graph/L3/lib. Something interesting might be the following code:

#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 1 num_read_outstanding = \
   32 max_write_burst_length = 2 max_read_burst_length = 8 bundle = gmem0 port = config depth = 4
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 1 num_read_outstanding = \
   32 max_write_burst_length = 2 max_read_burst_length = 8 bundle = gmem0 port = offset depth = depth_V
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 1 num_read_outstanding = \
   32 max_write_burst_length = 2 max_read_burst_length = 32 bundle = gmem1 port = column depth = depth_E
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 1 num_read_outstanding = \
   32 max_write_burst_length = 2 max_read_burst_length = 32 bundle = gmem2 port = weight depth = depth_E
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 32 num_read_outstanding = \
   1 max_write_burst_length = 2 max_read_burst_length = 2 bundle = gmem3 port = ddrQue depth = depth_E*16
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 32 num_read_outstanding = \
   1 max_write_burst_length = 2 max_read_burst_length = 2 bundle = gmem3 port = ddrQue512 depth = depth_E
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 32 num_read_outstanding = \
   32 max_write_burst_length = 64 max_read_burst_length = 2 bundle = gmem4 port = result512 depth = depth_V
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 32 num_read_outstanding = \
   32 max_write_burst_length = 64 max_read_burst_length = 2 bundle = gmem4 port = info depth = 8
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 32 num_read_outstanding = \
   32 max_write_burst_length = 64 max_read_burst_length = 2 bundle = gmem4 port = result depth = depth_V*16
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 32 num_read_outstanding = \
   1 max_write_burst_length = 64 max_read_burst_length = 2 bundle = gmem5 port = pred512 depth = depth_V
#pragma HLS INTERFACE m_axi offset = slave latency = 32 num_write_outstanding = 32 num_read_outstanding = \
   1 max_write_burst_length = 64 max_read_burst_length = 2 bundle = gmem5 port = pred depth = depth_V*16

These are the HLS pragmas of the interface. They are responsible for configuring the interface of the FPGA binaries and might be vary with Alveo board. For more information about these pragmas, pleas vitis HLS interface pragma.

The steps to compile the C/C++ code into FPGA binaries is in the Makefile of each testcase. It generally has the following two steps:

v++ --compile to compile the C/C++ code into RTL code. A .xo file is generated in this step.
v++ --link to link the .xo file into FPGA binaries. A .xclbin file is generated in this step.

For more information about compiling the HLS code please visit here

The code to make use of the FPGA binaries is usually C/C++ code with OpenCL APIs and typically contains the following steps:

Create the entire platform and OpenCL kernels

std::vector<cl::Device> devices = xcl::get_xil_devices();
cl::Device device = devices[0];
cl::Context context(device, NULL, NULL, NULL, &fail);
cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE | CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &fail);
cl::Program::Binaries xclBins = xcl::import_binary_file(xclbin_path);
devices.resize(1);
cl::Program program(context, devices, xclBins, NULL, &fail);
cl::Kernel shortestPath;
shortestPath = cl::Kernel(program, "shortestPath_top", &fail);

Create CL::Buffers and decide which data needs to be tranfered to FPGA devices and back to host machine.

std::vector<cl::Memory> ob_in;
cl::Buffer offset_buf = cl::Buffer(context, CL_MEM_EXT_PTR_XILINX | CL_MEM_USE_HOST_PTR | CL_MEM_READ_WRITE,
                        sizeof(ap_uint<32>) * (numVertices + 1), &mext_o[0]);
ob_in.push_back(offset_buf);

std::vector<cl::Memory> ob_out;
cl::Buffer result_buf = cl::Buffer(context, CL_MEM_EXT_PTR_XILINX | CL_MEM_USE_HOST_PTR | CL_MEM_READ_WRITE,
                        sizeof(float) * ((numVertices + 1023) / 1024) * 1024, &mext_o[6]);
ob_out.push_back(result_buf);

Set arguments for FPGA OpenCL kernels

shortestPath.setArg(j++, config_buf);
shortestPath.setArg(j++, offset_buf);
shortestPath.setArg(j++, column_buf);
shortestPath.setArg(j++, weight_buf);
shortestPath.setArg(j++, ddrQue_buf);
shortestPath.setArg(j++, ddrQue_buf);
shortestPath.setArg(j++, result_buf);
shortestPath.setArg(j++, result_buf);
shortestPath.setArg(j++, pred_buf);
shortestPath.setArg(j++, pred_buf);
shortestPath.setArg(j++, info_buf);

Set up event dependencies

std::vector<cl::Event> events_write(1);
std::vector<cl::Event> events_kernel(1);
std::vector<cl::Event> events_read(1);

q.enqueueMigrateMemObjects(ob_in, 0, nullptr, &events_write[0]);  // Transfer Host data to Device
q.enqueueTask(shortestPath, &events_write, &events_kernel[0]); // execution of the OpenCL kernels (FPGA binaries)
q.enqueueMigrateMemObjects(ob_out, 1, &events_kernel, &events_read[0]); // Transfer Device data to Host

Run OpenCL tasks and execute FPGA binaries

q.finish()

L1 API¶

Target Audience¶

Target audience of L1 API are users who is familiar with HLS programming and want to test / profile / modify operators or add new operator. With the HLS test project provided in L1 layer, user could get:

Function correctness test, both in C-simulation and Co-simulation
Performance profiling from HLS synthesis report and Co-simulaiton
Resource and timing evaluation from Vivado synthesis.