(CL_MEM_READ_WRITE |
CL_MEM_EXT_PTR_XILINX),
BUFSIZE*sizeof(uint32_t),
&bank1_ext,
NULL);
```
You can see that this code is very similar to the previous example, with the difference being that we're
passing our `cl_mem_ext_ptr_t` object to the `cl::Buffer` constructor and are using the flags field to
specify the memory bank we need to use for that particular buffer. Note again that there is fundamentally no
reason for us to do this for such a simple example, but since this example is so structurally similar to the
previous one it seemed like a good time to mix it in. This can be a very useful technique for performance
optimization for very heavy workloads.
Otherwise, aside from switching to the `wide_vadd` kernel the code here is exactly unchanged other than
increasing the buffer size to 1 GiB. This was done to better accentuate the differences between the hardware
and software.
## Running the Application
With the XRT initialized, run the application by running the following command from the build directory:
`./04_wide_vadd alveo_examples`
The program will output a message similar to this:
```
-- Example 4: Parallelizing the Data Path --
Loading XCLBin to program the Alveo board:
Found Platform
Platform Name: Xilinx
XCLBIN File Name: alveo_examples
INFO: Importing ./alveo_examples.xclbin
Loading: './alveo_examples.xclbin'
Running kernel test XRT-allocated buffers and wide data path:
OCL-mapped contiguous buffer example complete!
--------------- Key execution times ---------------
OpenCL Initialization: 244.463 ms
Allocate contiguous OpenCL buffers: 37.903 ms
Map buffers to userspace pointers: 0.333 ms
Populating buffer inputs: 30.033 ms
Software VADD run : 21.489 ms
Memory object migration enqueue : 4.639 ms
Set kernel arguments: 0.012 ms
OCL Enqueue task: 0.090 ms
Wait for kernel to complete : 9.003 ms
Read back computation results : 2.197 ms
```
| Operation | Example 3 | Example 4 | Δ3→4 |
| ------------------------------------- | :----------: | :----------: | :-------------: |
| OCL Initialization | 247.460 ms | 244.463 ms | - |
| Buffer Allocation | 30.365 ms | 37.903 ms | 7.538 ms |
| Buffer Population | 22.527 ms | 30.033 ms | 7.506 ms |
| Software VADD | 24.852 ms | 21.489 ms | -3.363 ms |
| Buffer Mapping | 222 µs | 333 µs | - |
| Write Buffers Out | 66.739 ms | 4.639 ms | - |
| Set Kernel Args | 14 µs | 12 µs | - |
| Kernel Runtime | 92.068 ms | 9.003 ms | -83.065 ms |
| Read Buffer In | 2.243 ms | 2.197 ms | - |
| ΔAlveo→CPU | -323.996 ms | -247.892 ms | -76.104 ms |
| ΔFPGA→CPU (algorithm only) | -76.536 ms | 5.548 ms | -82.084 ms |
Mission accomplished - we've managed to beat the CPU!
There are a couple of interesting bits here, though. The first thing to notice is that, as you probably
expected, the time required to transfer that data into and out of the FPGA hasn't changed. Like we disussed
in earlier sections, that's effectively a fixed time based on your memory topology, amount of data, and
overall system memory bandwidth utilization. You'll see some minor run-to-run variability here, especially
in a virtualized environment like a cloud data center, but for the most part you can treat it like a
fixed-time operation.
The more interesting one, though, is the achieved speedup. You might be surprised to see that widening the
data path to 16 words per clock did not, in fact, result in a 16x speedup. The kernel itself processes 16
words per clock, but what we're seeing in the timed results is a cumulative effect of the external DDR
latency. Each interface will burst in data, process it, and burst it out. But the inherent latency of going
back and forth to DDR slows things down and we only actually achieve a 10x speedup over the single-word
implementation.
Unfortunately, we're hitting a fundamental problem with vector addition: it's too easy. The computational
complexity of vadd is O(N), and a very simpleO(N) at that. As a result you very quickly become I/O bandwidth
bound, not computation bound. With more complex algorithms, such as nested loops which are O(Nx) or even
computationally complex O(N) algorithms like filters needing lots of calculations, we can achieve
significantly higher acceleration by performing more computation inside the FPGA fabric instead of going out
to the memory quite so often. The best candidate algorithms for acceleration have very little data transfer
and lots of calculations.
We can also run another experiment with a larger buffer. Change the buffer size:
```cpp
#define BUFSIZE (1024*1024*256)// 256*sizeof(uint32_t) = 1 GiB
```
Rebuild and run it again, and now let's switch over to a simplified set of metrics as in the table below. I
think you get the idea on OpenCL initialization by now, too, so we'll only compare the algorithm's
performance. That doesn't mean that this initialization time isn't important, but you should architect your
application so it's a one-time cost that you pay during other initialization operations.
We will also stop tracking the buffer allocation and initialization times. Again this is something that
you'd generally want to do when setting up your application. If you're routinely allocating large buffers in
the critical path of your application, odds are you'd get more "bang for your buck" rethinking your
architecture rather than trying to apply hardware acceleration.
| Operation | Example 4 |
| --------------------- | :---------: |
| Software VADD | 820.596 ms |
| Buffer PCIe TX | 383.907 ms |
| VADD Kernel | 484.050 ms |
| Buffer PCIe RX | 316.825 ms |
| Hardware VADD (Total) | 1184.897 ms |
| ΔAlveo→CPU | 364.186 ms |
Oh no! We had such a good thing going, and here we went and ruined it. With this larger buffer we're not beating the CPU anymore at all. What happened?
Looking closely at the numbers, we can see that while the hardware kernel itself is processing the data pretty efficiently, the data transfer to and from the FPGA is really eating up a lot of our time. In fact, if you were to draw out our overall execution timeline, it would look something like figure 3.4 at this point.
![Kernel Execution Timeline](./images/04_kernel_execution_time.jpg)
This chart is figurative, but to relative scale. Looking closely at the numbers, we see that in order to
beat the CPU our entire kernel would need to run to completion in only about 120 ms. For that to work we
would need to run four times faster (not likely), process four times as much data per clock, or run four
accelerators in parallel. Which one should we pick? Let's leave that for the next example.
## Extra Exercises
Some things to try to build on this experiment:
- Play around with the buffer sizes again. Can you find the inflection point where the CPU becomes faster?
- Try to capture the Vitis Timeline Trace and view this for yourself. The instructions for doing so are
found in the [Vitis Documentation](https://www.xilinx.com/html_docs/xilinx_2020_1/vitis_doc/index.html)
## Key Takeaways
- Parallelizing the kernel can provide great results, but you need to take care that your data transfer
doesn't start to dominate your execution time. Or if it does, that you stay under your target execution
time.
- Simple computation isn't always a good candidate for acceleration, unless you're trying to free up the
processor to work on other tasks. Better candidates for acceleration are complex workloads with high
complexity, especially O(Nx) algorithms like nested loops, etc.
- If you have to wait for all of the data to transfer before you begin processing you may have trouble
hitting your overall performance target even if your kernels are highly optimized.
Now we've seen that just parallelizing the data path isn't quite enough. Let's see what we can do about
those transfer times.
Read [**Example 5:** Optimizing Compute and Transfer](./05-optimizing-compute-and-transfer.md)
Copyright© 2019-2021 Xilinx