task_events;
for(int cu = 0; cu < compute_units; cu++) {
cl::Event task_event;
convolve_kernel.setArg(6, cu * lines_per_compute_unit);
convolve_kernel.setArg(7, lines_per_compute_unit);
q.enqueueTask(convolve_kernel, &iteration_events, &task_event);
task_events.push_back(task_event);
}
copy(begin(task_events), end(task_events), std::back_inserter(iteration_events));
```
This `for` loop will launch one task per CU. You will pass an event object to each of the tasks, and then add it to the `task_events` vector. Notice that you are not adding it to the `iteration_events` until after the end of the loop. This is because you only want the tasks to depend on the `enqueueWriteBuffer` call and not each other.
Now you can compile and run the design, and you should see results similar to the following section.
## Run Hardware Emulation for Multiple Compute Units
1. Before running emulation, you need to set the CU number to 4. To do that, open the `design.cfg` and modify the `nk` option as follows.
```
nk=convolve_fpga:4
```
The `nk` option is used to specify the number of kernel instances, or CUs, generated during the linking step of the build process. For this lab, set it to 4.
2. Go to the `makefile` directory.
3. Use the following command to run hardware emulation.
```
make run TARGET=hw_emu STEP=multicu SOLUTION=1 NUM_FRAMES=1
```
The following code shows the results of this kernel running on four CUs.
```
Processed 0.08 MB in 42.810s (0.00 MBps)
INFO: [Vitis-EM 22] [Wall clock time: 01:34, Emulation time: 0.102462 ms] Data transfer between kernel(s) and global memory(s)
convolve_fpga_1:m_axi_gmem1-DDR[0] RD = 24.012 KB WR = 0.000 KB
convolve_fpga_1:m_axi_gmem2-DDR[0] RD = 0.000 KB WR = 20.000 KB
convolve_fpga_1:m_axi_gmem3-DDR[0] RD = 0.035 KB WR = 0.000 KB
convolve_fpga_2:m_axi_gmem1-DDR[0] RD = 22.012 KB WR = 0.000 KB
convolve_fpga_2:m_axi_gmem2-DDR[0] RD = 0.000 KB WR = 20.000 KB
convolve_fpga_2:m_axi_gmem3-DDR[0] RD = 0.035 KB WR = 0.000 KB
convolve_fpga_3:m_axi_gmem1-DDR[0] RD = 24.012 KB WR = 0.000 KB
convolve_fpga_3:m_axi_gmem2-DDR[0] RD = 0.000 KB WR = 20.000 KB
convolve_fpga_3:m_axi_gmem3-DDR[0] RD = 0.035 KB WR = 0.000 KB
convolve_fpga_4:m_axi_gmem1-DDR[0] RD = 22.000 KB WR = 0.000 KB
convolve_fpga_4:m_axi_gmem2-DDR[0] RD = 0.000 KB WR = 20.000 KB
convolve_fpga_4:m_axi_gmem3-DDR[0] RD = 0.035 KB WR = 0.000 KB
```
You can now perform four times more work in about the same amount of time. You transfer more data from global memory because each CU needs to read the surrounding padding lines.
## View Profile Summary Report for Hardware Emulation
Use the following command to view the Profile Summary report.
```
make view_run_summary TARGET=hw_emu STEP=multicu
```
The kernel execution time for four CUs is around 0.065 ms each.
Here is the updated table.
| Step | Image Size | Time (HW-EM)(ms) | Reads (KB) | Writes (KB) | Avg. Read (KB) | Avg. Write (KB) | BW (MBps) |
| :-------------- | :--------- | ---------------: | ---------: | ----------: | -------------: | --------------: | --------: |
| baseline | 512x10 | 3.903 | 344 | 20.0 | 0.004 | 0.004 | 5.2 |
| localbuf | 512x10 | 1.574 (2.48x) | 21 (0.12x) | 20.0 | 0.064 | 0.064 | 13 |
| fixed-type data | 512x10 | 0.46 (3.4x) | 21 | 20.0 | 0.064 | 0.064 | 44 |
| dataflow | 512x10 | 0.059 (7.8x) | 21 | 20.0 | 0.064 | 0.064 | 347 |
| multi-CU | 512x40* | 0.06 (0.98x) | 92 (4.3x) | 80.0 (4x) | 0.064 | 0.064 | 1365* |
>**NOTE:**
>
>* The multi-CU version processed four times the amount of data compared to previous versions. Even if the CU execution time does not change for each CU, the four parallel CUs increase the system performance by almost four times.
>* This is calculated by 4x data/time. Here the data transfer time is not accounted for, and you assume that the four CUs are executing in parallel. This is not as accurate as the hardware run, but you will use it as a reference for optimizations effectiveness.
## Next Steps
In this step, you performed host code optimizations by using out-of-order command queue and executing multiple CUs. In the next step, you will be [Using QDMA Streaming with Multiple Compute Units](./qdma.md).
[hostopt_hwemu_profilesummary]: ./images/191_hostopt_hwemu_pfsummary_40_2.jpg "Host code optimization version hardware emulation profile summary"
Return to Getting Started Pathway — Return to Start of Tutorial
Copyright© 2020 Xilinx