2020.2 Vitis™ - Runtime and System Optimization
See Vitis™ Development Environment on xilinx.com
|
Overview¶
The FPGA image that we’ve loaded contains a very simple vector addition core. It takes two buffers of arbitrary length as inputs and produces a buffer of equal length as an output. As the name implies, during the process it adds them together.
Our code has not really been optimized to run well in an FPGA. It’s mostly equivalent to directly putting the algorithm in the listing below directly into the FPGA fabric. This isn’t particularly efficient. the implementation tools will do some optimization for us, but we can make it better. With this code, we can process one addition operation on each tick of the clock but we’re still only processing one 32-bit output at a time.
void vadd_sw(uint32_t *a, uint32_t *b, uint32_t *c, uint32_t size)
{
for (int i = 0; i < size; i++) {
c[i] = a[i] + b[i];
}
}
It’s very important to note that at this point there is no way this code will beat the processor. The clock in the FPGA fabric is significantly slower than the CPU clock. This is expected, though - thinking back to our earlier example, we’re only loading a single passenger into each car on the train. We also have overhead to pass the data over PCIe, set up DMA, etc. For the next few examples, we’ll look at how to efficiently manage the buffers for our inputs and outputs to this function. Only after that will we start to take advantage of the acceleration we can get from the Alveo Data Center accelerator card.
Key Code¶
This example is the first time we’re going to actually run something on the FPGA, modest though it may be. In order to run something on the card there are four things that we must do:
Allocate and populate the buffers we?ll use to send and receive data from the card.
Transfer those buffers between the host memory space and the Alveo global memory.
Run the kernel to act on those buffers.
Transfer the results of the kernel operation back to the host memory space so that they can be accessed via the processor.
As you can see, only one of those things actually takes place on the card. Memory management will make or break your application’s performance, so let’s start to take a look at that.
If you haven’t done acceleration work before, you may be tempted to jump in and just use normal calls to malloc()
or new
to allocate your memory. In this example we’ll do just that, allocating a series of buffers to transfer between the host and the Alveo card. We’ll allocate four buffers: two input buffers to add together, one output buffer for the Alveo to use, and an extra buffer for a software implementation of our vadd
function. This allows us to see something interesting: how we allocate memory for Alveo also impacts how efficiently the processor will run.
Buffers are allocated simply, as in the listing below. In our case, BUFSIZE
is 24 MiB, or 6 × 1024 × 1024 values of type uint32_t
. Any code not mentioned here is either identical or functionally equivalent to the previous examples.
uint32_t* a =new uint32_t[BUFSIZE];
uint32_t* b =new uint32_t[BUFSIZE];
uint32_t* c =new uint32_t[BUFSIZE];
uint32_t* d =new uint32_t[BUFSIZE];
This will allocate memory that is virtual, paged, and, most importantly, non-aligned. In particular it’s this last one that is going to cause some problems, as we’ll soon see.
Once we allocate the buffers and populate them with initial test vectors, the next acceleration step is to
send them down to the Alveo global memory. We do that by creating OpenCL buffer objects using the flag
CL_MEM_USE_HOST_PTR
. This tells the API that rather than allocating its own buffer, we are providing our
own pointers. This isn’t necessarily bad, but because we haven’t taken care allocating our pointers it’s going
to hurt our performance.
The listing below contains the code mapping our allocated buffers to OpenCL buffer objects.
std::vector<cl::Memory> inBufVec, outBufVec;
cl::Buffer a_to_device(context,
static_cast<cl_mem_flags>(CL_MEM_READ_ONLY |
CL_MEM_USE_HOST_PTR),
BUFSIZE * sizeof (uint32_t),
a,
NULL);
cl::Buffer b_to_device (context,
static_cast<cl_mem_flags(CL_MEM_READ_ONLY |
CL_MEM_USE_HOST_PTR),
BUFSIZE * sizeof(uint32_t),
b,
NULL);
cl::Buffer c_from_device(context,
static_cast<cl_mem_flags>(CL_MEM_WRITE_ONLY |
CL_MEM_USE_HOST_PTR),
BUFSIZE * sizeof(uint32_t),
c,
NULL);
inBufVec.push_back(a_to_device);
inBufVec.push_back(b_to_device);
outBufVec.push_back(c_from_device);
What we’re doing here is allocating cl::Buffer
objects, which are recognized by the API, and passing in
pointers a
, b
, and c
from our previously-allocated buffers. The additional flags CL_MEM_READ_ONLY
and
CL_MEM_WRITE_ONLY
specify to the runtime the visibility of these buffers from the perspective of the
kernel. In other words, a
and b
are written to the card by the host - to the kernel they are read
only. Then, c
is read back from the card to the host. To the kernel it is write only. We
additionally add these buffer objects to vectors so that we can transfer multiple buffers at once (note that
we’re essentially adding pointers to the vectors, not the data buffers themselves).
Next, we can transfer the input buffers down to the Alveo card:
cl::Event event_sp;
q.enqueueMigrateMemObjects(inBufVec, 0, NULL, &event_sp);
clWaitForEvents(1, (const cl_event *)&event_sp);
In this code snippet the “main event” is the call to enqueueMigrateMemObjects() on line 108. We pass in our
vector of buffers, the 0 indicates that this is a transfer from host to device, and we also pass in a
cl::Event
object.
This is a good time to segue briefly into synchronization. When we enqueue the transfer we’re adding it to the
runtime’s ‘to-do list’, if you will, but not actually waiting for it to complete. By registering a
cl::Event
object, we can then decide to wait on that event at any point in the future. In general this
isn’t a point where you would necessarily want to wait, but we’ve done this at various points throughout the
code to more easily instrument it to display the time taken for various operations. This adds a small amount
of overhead to the application, but again, this is a learning exercise and not an example of optimizing for
maximum performance.
We now need to tell the runtime what to pass to our kernel, and we do that in the next listing. Recall that our argument list looked like this:
(uint32_t*a, uint32_t*b, uint32_t*c, uint32_t size)
In our case a
is argument 0, b
is argument 1, and so on.
krnl.setArg(0, a_to_device);
krnl.setArg(1, b_to_device);
krnl.setArg(2, c_from_device);
krnl.setArg(3, BUFSIZE);
Next, we add the kernel itself to the command queue so that it will begin executing. Generally speaking, you would enqueue the transfers and the kernel such that they’d execute back-to-back rather than synchronizing in between. The line of code that adds the execution of the kernel to the command queue is:
q.enqueueTask(krnl, NULL, &event_sp);
If you don’t want to wait at this point you can again pass in NULL
instead of a cl::Event
object.
And, finally, once the kernel completes we want to transfer the memory back to the host so that we can access the new values from the CPU. This is done as follows:
q.enqueueMigrateMemObjects(outBufVec, CL_MIGRATE_MEM_OBJECT_HOST, NULL, &event_sp);
clWaitForEvents(1, (const cl_event *)&event_sp);
In this instance we do want to wait for synchronization. This is important; recall that when we call these enqueue functions, we’re placing entries onto the command queue in a non-blocking manner. If we then attempt to access the buffer immediately after enqueuing the transfer, it have finished reading back in.
Excluding the FPGA configuration from example 0, the new additions in order to run the kernel are:
Allocate buffers in the normal way. We’ll soon see that there are better ways of doing this, but this is the way many people experimenting with acceleration might do it their first time.
Map the allocated buffers to cl::Buffer objects.
Enqueue the migration of the input buffers (a and b) to Alveo device global memory.
Set the kernel arguments, both buffers and scalar values.
Run the kernel.
Read the results of the kernel back into CPU host memory, synchronizing on the completion of the read.
Only one synchronization is needed were this a real application. As previously, mentioned we’re using several to better report on the timing of various operations in the workflow.
Running the Application¶
With the XRT initialized, run the application by running the following command from the build directory.
./01_simple_malloc alveo_examples
The program will output a message similar to this:
-- Example 1: Vector Add with Malloc() --
Loading XCLBin to program the Alveo board:
Found Platform
Platform Name: Xilinx
XCLBINFile Name: alveo_examples
INFO: Importing ./alveo_examples.xclbin
Loading: ./alveo_examples.xclbin
Running kernel test with malloc()ed buffers
WARNING: unaligned host pointer 0x154f7909e010 detected, this leads to extra memcpy
WARNING: unaligned host pointer 0x154f7789d010 detected, this leads to extra memcpy
WARNING: unaligned host pointer 0x154f7609c010 detected, this leads to extra memcpy
Simple malloc vadd example complete!
--------------- Key execution times ---------------
OpenCL Initialization: 247.371 ms
Allocating memory buffer: 0.030 ms
Populating buffer inputs: 47.955 ms
Software VADD run: 35.706 ms
Map host buffers to OpenCL buffers: 64.656 ms
Memory object migration enqueue: 24.829 ms
Set kernel arguments: 0.009 ms
OCL Enqueue task: 0.064 ms
Wait for kernel to complete: 92.118 ms
Read back computation results: 24.887 ms
Note that we have some warnings about unaligned host pointers. Because we didn’t take care with our allocation, none of our buffers that we’re transferring to or from the Alveo card are aligned to the 4 KiB boundaries needed by the Alveo DMA engine. Because of this, we need to copy the buffer contents so they’re aligned before transfer, and that operation is quite expensive.
From this point on in our examples, let’s keep a close eye on these numbers. While there will be some variability on the latency run-to-run, generally speaking we are looking for deltas in each particular area. For now let’s establish a baseline:
Operation | Example 1 |
---|---|
OCL Initialization | 247.371 ms |
Buffer Allocation | 30 µs |
Buffer Population | 47.955 ms |
Software VADD | 35.706 ms |
Buffer Mapping | 64.656 ms |
Write Buffers Out | 24.829 ms |
Set Kernel Args | 9 µs |
Kernel Runtime | 92.118 ms |
Read Buffer In | 24.887 ms |
ΔAlveo→CPU | -418.228 ms |
ΔAlveo→CPU (algorithm only) | -170.857 ms |
Extra Exercises¶
Some things to try to build on this experiment:
Vary the size of the buffers allocated. Can you derive an approximate relationship between buffer size and the timing for individual operations? Do they all scale at the same rate?
If you remove synchronization between each step, what is the quantitative effect on the runtime?
What happens if you remove the synchronization after the final buffer copy from Alveo back to the host?
Key Takeaways¶
Once again we have to pay our FPGA configuration “tax”. We will need to save at least 250 ms over the CPU to make up for it. Note that our trivial example will never be at the CPU if we’re just looking at processing a single buffer!
Simply-allocated memory isn’t a good candidate for passing to accelerators, as we’ll incur a memory copy to compensate. We’ll investigate the impact this has in subsequent examples.
OpenCL works on command queues. It’s up to the developer how and when to synchronize, but care must be taken when reading buffers back in from the Alveo global memory to ensure synchronization before the CPU accesses the data in the buffer.
Read Example 2: Aligned Memory Allocation
Copyright© 2019-2021 Xilinx