in_vec;
in_vec.push_back(a);
in_vec.push_back(b);
q.enqueueMigrateMemObjects(in_vec, 0, &tx_events, &m_event);
krnl_events.push_back(m_event);
tx_events.push_back(m_event);
if (tx_events.size() > 1) {
tx_events[0] = tx_events[1];
tx_events.pop_back();
}
krnl.setArg(0, a);
krnl.setArg(1, b);
krnl.setArg(2, c);
krnl.setArg(3, (uint32_t)(size /sizeof(uint32_t)));
q.enqueueTask(krnl, &krnl_events, &k_event);
krnl_events.push_back(k_event);
if (rx_events.size() == 1) {
krnl_events.push_back(rx_events[0]);
rx_events.pop_back ();
}
c_vec.push_back (c);
q.enqueueMigrateMemObjects(c_vec,
CL_MIGRATE_MEM_OBJECT_HOST,
&krnl_events,
&event);
rx_events.push_back (event);
return 0;
}
```
In this new function we’re doing basically the same sequence of events that we had before:
1. Enqueue migration of the buffer from the host memory to the Alveo memory.
2. Set the kernel arguments to the current buffer.
3. Enqueue the run of the kernel.
4. Enqueue a transfer back of the results.
The difference, though, is that now we’re doing them in an actual queued, sequential fashion. We aren’t
waiting for the events to fully complete now, as we were in previous examples, because that would defeat the
whole purpose of pipelining. So now we’re using event-based dependencies. By using `cl::Event` objects we
can build a chain of events that must complete before any subsequent chained events (non-linked events can
still be scheduled at any time).
We enqueue multiple runs of the kernel and then wait for all of them to complete, which will result in much
more efficient scheduling. Note that if we had built the same structure as in
[Example 4](./04-parallelizing-the-data-path.md) using this queuing method we’d see the same results as then,
because the runtime has no way of knowing whether or not we can safely start processing before sending all of
the data. As designers we have to tell the scheduler what can and cannot be done.
And, finally, none of this would happen in the correct sequence if we didn’t do one more very important
thing: we have to specify that we can use an out of order command queue by passing in the flag
`CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE` when we create it.
The code in this example should otherwise seem familiar. We now call those functions instead of calling the
API directly from main(), but it’s otherwise unchanged.
There is something interesting, though, about mapping buffer `c` back into userspace - we don’t have to work
with individual sub-buffers. Because they’ve already been migrated back to host memory, and because when
creating sub-buffers the underlying pointers don’t change, we can still work with the parent even though we
have children (and the parent buffer somehow even manages to sleep through the night!).
## Running the Application
With the XRT initialized, run the application by running the following command from the build directory:
`./05_pipelined_vadd alveo_examples`
The program will output a message similar to this:
```
-- Example 5: Pipelining Kernel Execution --
Loading XCLBin to program the Alveo board:
Found Platform
Platform Name: Xilinx
XCLBIN File Name: alveo_examples
INFO: Importing ./alveo_examples.xclbin
Loading: ’./alveo_examples.xclbin’
-- Running kernel test with XRT-allocated contiguous buffers and wide VADD (16 values/clock)
OCL-mapped contiguous buffer example complete!
--------------- Key execution times ---------------
OpenCL Initialization: 263.001ms
Allocate contiguous OpenCL buffers: 915.048 ms
Map buffers to userspace pointers: 0.282 ms
Populating buffer inputs: 1166.471 ms
Software VADD run: 1195.575ms
Memory object migration enqueue: 0.441ms
Wait for kernel to complete: 692.173 ms
```
And comparing these results to the previous run:
| Operation | Example 4 | Example 5 | Δ4→5 |
| --------------------- | :---------: | :---------: | :-------------: |
| Software VADD | 820.596 ms | 1166.471 ms | 345.875 ms |
| Hardware VADD (Total) | 1184.897 ms | 692.172 ms | −492.725 ms |
| ΔAlveo→CPU | 364.186 ms | −503.402 ms | 867.588 ms |
Mission accomplished for sure this time. Look at those margins!
There’s no way this would turn around on us now, right? Let’s sneak out early - I’m sure there isn’t an
“other shoe” that’s going to drop.
## Extra Exercises
Some things to try to build on this experiment:
- Play around with the buffer sizes again. Is there a similar inflection point in this exercise?
- Capture the traces again too; can you see the difference? How does the choice of the number of sub-buffers
impact runtime (if it does)?
## Key Takeaways
- Intelligently managing your data transfer and command queues can lead to significant speedups.
Read [**Example 6:** Meet the Other Shoe](./06-meet-the-other-shoe.md)
Copyright© 2019-2021 Xilinx