2020.2 Vitis™ - Runtime and System Optimization
Example 2: Aligned Memory Allocation

See Vitis™ Development Environment on xilinx.com
## Overview In our last example we allocated memory simply, but as we saw the DMA engine requires that our buffers be aligned to 4 KiB pages boundaries. If the buffers are not so aligned, which they likely won't be if don't explicitly ask for it, then the runtime will copy the buffers so that their contents are aligned. That's an expensive operation, but can we quantify how expensive? And how can we allocate aligned memory? ## Key Code This is a relatively short example in that we're only changing four lines vs. Example 1, our buffer allocation. There are various ways to allocate aligned memory but in this case we'll make use of a POSIX function, `posix_memalign()`. This change replaces our previous allocation with the code shown below. We also need to include an additional header not shown in the listing, `memory`. ```cpp uint32_t*a,*b,*c,*d = NULL; posix_memalign((void**)&a, 4096, BUFSIZE*sizeof(uint32_t)); posix_memalign((void**)&b, 4096, BUFSIZE*sizeof(uint32_t)); posix_memalign((void**)&c, 4096, BUFSIZE*sizeof(uint32_t)); posix_memalign((void**)&d, 4096, BUFSIZE*sizeof(uint32_t)); ``` Note that for our calls to `posix_memalign()`, we're passing in our requested alignment, which in this case is 4 KiB as we discussed previously. Otherwise, this is the only change to the code vs. the first example. Note that we have changed the allocation for _all_ of the buffers, including buffer `d` which is only used by the CPU baseline VADD function. We'll see if this has any impact on the runtime performance for both the accelerator and the CPU. ## Running the Application With the XRT initialized, run the application by running the following command from the build directory: `./02_aligned_malloc alveo_examples` The program will output a message similar to this: ``` -- Example 2: Vector Add with Aligned Allocation -- Loading XCLBin to program the Alveo board: Found Platform Platform Name: Xilinx XCLBIN File Name: alveo_examples INFO: Importing ./alveo_examples.xclbin Loading: './alveo_examples.xclbin' Running kernel test with aligned virtual buffers Simple malloc vadd example complete! --------------- Key execution times --------------- OpenCL Initialization: 256.254 ms Allocating memory buffer: 0.055 ms Populating buffer inputs: 47.884 ms Software VADD run: 35.808 ms Map host buffers to OpenCL buffers : 9.103 ms Memory object migration enqueue: 6.615 ms Set kernel arguments: 0.014 ms OCL Enqueue task: 0.116 ms Wait for kernel to complete: 92.110 ms Read back computation results: 2.479 ms ``` This seems at first glance to be much better! Let's compare these results to our results from Example 1 to see how things have changed. Refer to table below for details, noting that we'll exclude minor run-to-run variation from the comparison to help keep things clean. | Operation | Example 1 | Example 2 | Δ1→2 | | -------------------------------------- | :---------: | :---------: | :-------------: | | OCL Initialization | 247.371 ms | 256.254 ms | - | | Buffer Allocation | 30 µs | 55 µs | 25 µs | | Buffer Population | 47.955 ms | 47.884 ms | - | | Software VADD | 35.706 ms | 35.808 ms | - | | Buffer Mapping | 64.656 ms | 9.103 ms | -55.553 ms | | Write Buffers Out | 24.829 ms | 6.615 ms | -18.214 ms | | Set Kernel Args | 9 µs | 14 µs | - | | Kernel Runtime | 92.118 ms | 92.110 ms | - | | Read Buffer In | 24.887 ms | 2.479 ms | -22.408 ms | | ΔAlveo→CPU | -418.228 ms | -330.889 ms | 87.339 ms | | ΔAlveo→CPU (algorithm only) | -170.857 ms | -74.269 ms | 96.588 ms | Nice! By only changing four lines of code we've managed to shave nearly 100 ms off of our execution time. The CPU is still faster, but just by changing one minor thing about how we're allocating memory we saw huge improvement. That's really down to the memory copy that's needed for alignment; if we take a few extra microseconds to ensure the buffers are aligned when we allocate them, we can save orders of magnitude more time later when those buffers are consumed. Also note that as expected in this use case, the software runtime is the same. We're changing the alignment of the allocated memory, but otherwise it's normal userspace memory allocation. ## Extra Exercises Some things to try to build on this experiment: - Once again vary the size of the buffers allocated. Do the relationships that you derived in the previous example still hold true? - Experiment with other methods of allocating aligned memory (not the OCL API). Do you see differences between the approaches, beyond minor run-to-run fluctuations? ## Key Takeaways - Unaligned memory will kill your performance. Always ensure buffers you want to share with the Alveo card are aligned. Now we're getting somewhere! Let's try using the OpenCL API to allocate memory and see what happens. Read [**Example 3:** Memory Allocation with XRT](./03-xrt-memory-allocation.md)

Copyright© 2019-2021 Xilinx