(CL_MEM_WRITE_ONLY |
CL_MEM_USE_HOST_PTR),
BUFSIZE * sizeof(uint32_t),
c,
NULL);
inBufVec.push_back(a_to_device);
inBufVec.push_back(b_to_device);
outBufVec.push_back(c_from_device);
```
What we're doing here is allocating `cl::Buffer` objects, which are recognized by the API, and passing in
pointers `a`, `b`, and `c` from our previously-allocated buffers. The additional flags `CL_MEM_READ_ONLY` and
`CL_MEM_WRITE_ONLY` specify to the runtime the visibility of these buffers from the perspective of the
kernel. In other words, `a` and `b` are written to the card by the host - to the kernel they are **read
only**. Then, `c` is read back from the card to the host. To the kernel it is **write only**. We
additionally add these buffer objects to vectors so that we can transfer multiple buffers at once (note that
we're essentially adding pointers to the vectors, not the data buffers themselves).
Next, we can transfer the input buffers down to the Alveo card:
```cpp
cl::Event event_sp;
q.enqueueMigrateMemObjects(inBufVec, 0, NULL, &event_sp);
clWaitForEvents(1, (const cl_event *)&event_sp);
```
In this code snippet the "main event" is the call to enqueueMigrateMemObjects() on line 108. We pass in our
vector of buffers, the 0 indicates that this is a transfer from host to device, and we also pass in a
`cl::Event` object.
This is a good time to segue briefly into synchronization. When we enqueue the transfer we're adding it to the
runtime's 'to-do list', if you will, but not actually waiting for it to complete. By registering a
`cl::Event` object, we can then decide to wait on that event at any point in the future. In general this
isn't a point where you would necessarily want to wait, but we've done this at various points throughout the
code to more easily instrument it to display the time taken for various operations. This adds a small amount
of overhead to the application, but again, this is a learning exercise and not an example of optimizing for
maximum performance.
We now need to tell the runtime what to pass to our kernel, and we do that in the next listing. Recall that
our argument list looked like this:
`(uint32_t*a, uint32_t*b, uint32_t*c, uint32_t size)`
In our case `a` is argument 0, `b` is argument 1, and so on.
```cpp
krnl.setArg(0, a_to_device);
krnl.setArg(1, b_to_device);
krnl.setArg(2, c_from_device);
krnl.setArg(3, BUFSIZE);
```
Next, we add the kernel itself to the command queue so that it will begin executing. Generally speaking, you
would enqueue the transfers and the kernel such that they'd execute back-to-back rather than synchronizing in
between. The line of code that adds the execution of the kernel to the command queue is:
```cpp
q.enqueueTask(krnl, NULL, &event_sp);
```
If you don't want to wait at this point you can again pass in `NULL` instead of a `cl::Event` object.
And, finally, once the kernel completes we want to transfer the memory back to the host so that we can access
the new values from the CPU. This is done as follows:
```cpp
q.enqueueMigrateMemObjects(outBufVec, CL_MIGRATE_MEM_OBJECT_HOST, NULL, &event_sp);
clWaitForEvents(1, (const cl_event *)&event_sp);
```
In this instance we do want to wait for synchronization. This is important; recall that when we call these
enqueue functions, we're placing entries onto the command queue in a **non-blocking** manner. If we then
attempt to access the buffer immediately after enqueuing the transfer, it have finished reading back in.
Excluding the FPGA configuration from example 0, the new additions in order to run the kernel are:
1. Allocate buffers in the normal way. We'll soon see that there are better ways of doing this, but this is the way many people experimenting with acceleration might do it their first time.
2. Map the allocated buffers to cl::Buffer objects.
3. Enqueue the migration of the input buffers (a and b) to Alveo device global memory.
4. Set the kernel arguments, both buffers and scalar values.
5. Run the kernel.
6. Read the results of the kernel back into CPU host memory, synchronizing on the completion of the read.
Only one synchronization is needed were this a real application. As previously, mentioned we're using several
to better report on the timing of various operations in the workflow.
## Running the Application
With the XRT initialized, run the application by running the following command from the build directory.
`./01_simple_malloc alveo_examples`
The program will output a message similar to this:
```
-- Example 1: Vector Add with Malloc() --
Loading XCLBin to program the Alveo board:
Found Platform
Platform Name: Xilinx
XCLBINFile Name: alveo_examples
INFO: Importing ./alveo_examples.xclbin
Loading: ./alveo_examples.xclbin
Running kernel test with malloc()ed buffers
WARNING: unaligned host pointer 0x154f7909e010 detected, this leads to extra memcpy
WARNING: unaligned host pointer 0x154f7789d010 detected, this leads to extra memcpy
WARNING: unaligned host pointer 0x154f7609c010 detected, this leads to extra memcpy
Simple malloc vadd example complete!
--------------- Key execution times ---------------
OpenCL Initialization: 247.371 ms
Allocating memory buffer: 0.030 ms
Populating buffer inputs: 47.955 ms
Software VADD run: 35.706 ms
Map host buffers to OpenCL buffers: 64.656 ms
Memory object migration enqueue: 24.829 ms
Set kernel arguments: 0.009 ms
OCL Enqueue task: 0.064 ms
Wait for kernel to complete: 92.118 ms
Read back computation results: 24.887 ms
```
Note that we have some warnings about unaligned host pointers. Because we didn't take care with our
allocation, none of our buffers that we're transferring to or from the Alveo card are aligned to the 4 KiB
boundaries needed by the Alveo DMA engine. Because of this, we need to copy the buffer contents so they're
aligned before transfer, and that operation is quite expensive.
From this point on in our examples, let's keep a close eye on these numbers. While there will be some
variability on the latency run-to-run, generally speaking we are looking for deltas in each particular area.
For now let's establish a baseline:
| Operation | Example 1 |
| -------------------------------------- | :---------: |
| OCL Initialization | 247.371 ms |
| Buffer Allocation | 30 µs |
| Buffer Population | 47.955 ms |
| Software VADD | 35.706 ms |
| Buffer Mapping | 64.656 ms |
| Write Buffers Out | 24.829 ms |
| Set Kernel Args | 9 µs |
| Kernel Runtime | 92.118 ms |
| Read Buffer In | 24.887 ms |
| ΔAlveo→CPU | -418.228 ms |
| ΔAlveo→CPU (algorithm only) | -170.857 ms |
## Extra Exercises
Some things to try to build on this experiment:
- Vary the size of the buffers allocated. Can you derive an approximate relationship between buffer size and
the timing for individual operations? Do they all scale at the same rate?
- If you remove synchronization between each step, what is the quantitative effect on the runtime?
- What happens if you remove the synchronization after the final buffer copy from Alveo back to the host?
## Key Takeaways
- Once again we have to pay our FPGA configuration "tax". We will need to save at least 250 ms over the CPU
to make up for it. Note that our trivial example will never be at the CPU if we're just looking at
processing a single buffer!
- Simply-allocated memory isn't a good candidate for passing to accelerators, as we'll incur a memory copy to
compensate. We'll investigate the impact this has in subsequent examples.
- OpenCL works on command queues. It's up to the developer how and when to synchronize, but care must be
taken when reading buffers back in from the Alveo global memory to ensure synchronization before the CPU
accesses the data in the buffer.
Read [**Example 2:** Aligned Memory Allocation](./02-aligned-memory-allocation.md)
Copyright© 2019-2021 Xilinx