& write_stream,
int elements)
{
int pixel = 0;
pkt t_out;
ap_int<32> a_out;
RGBPixel tmpout;
int i=0;
while(elements--) {
write_stream >> tmpout;
a_out = tmpout.func1();
t_out.set_data(a_out);
i++;
if(elements ==0)
{
t_out.set_last(1);
}
else
{
t_out.set_last(0);
t_out.set_keep(-1);
outFrame.write(t_out);
}
}
}
```
## Run Hardware Emulation for Multiple CUs with Streaming Interfaces
1. Before running emulation, you need to set the number of CUs to 4. Open the `design.cfg` and modify the `nk` option as follows.
```
nk=convolve_fpga:4
```
The `nk` option is used to specify the number of kernel instances, or CUs, generated during the linking step of the build process.
2. Go to the `makefile` directory.
3. Use the following command to run hardware emulation.
```
make run TARGET=hw_emu STEP=qdma SOLUTION=1 NUM_FRAMES=1
```
The following code shows the results of this kernel running on four CUs.
```
Data transfer on stream interfaces
HOST-->convolve_fpga_1:coefficient 0.035 KB
HOST-->convolve_fpga_3:inFrame 24.012 KB
convolve_fpga_3:outFrame-->HOST 20.000 KB
HOST-->convolve_fpga_4:coefficient 0.035 KB
HOST-->convolve_fpga_4:inFrame 22.000 KB
convolve_fpga_4:outFrame-->HOST 20.000 KB
HOST-->convolve_fpga_1:inFrame 22.012 KB
convolve_fpga_1:outFrame-->HOST 20.000 KB
HOST-->convolve_fpga_2:coefficient 0.035 KB
HOST-->convolve_fpga_2:inFrame 24.012 KB
convolve_fpga_2:outFrame-->HOST 20.000 KB
HOST-->convolve_fpga_3:coefficient 0.035 KB
```
You can now perform four times more work in about the same amount of time. Because each CU needs to read the surrounding padded lines, more data is transferred from the global memory.
## View Profile Summary Report for Hardware Emulation
Use the following command to view the Profile Summary report.
```
make view_run_summary TARGET=hw_emu STEP=qdma
```
The kernel execution time for the four CUs is around 0.135061 ms each.
Here is the updated table.
| Step | Image Size | Time (HW-EM)(ms) | Reads (KB) | Writes (KB) | Avg. Read (KB) | Avg. Write (KB) | Bandwidth (MBps) |
| :------------------- | :--------- | ---------------: | --------------: | -------------: | -------------: | --------------: | ---------: |
| baseline | 512x10 | 3.903 | 344 | 20.0 | 0.004 | 0.004 | 5.2 |
| localbuf | 512x10 | 1.574 (2.48x) | 21 (0.12x) | 20.0 | 0.064 | 0.064 | 13 |
| fixed-type data | 512x10 | 0.46 (3.4x) | 21 | 20.0 | 0.064 | 0.064 | 44 |
| dataflow | 512x10 | 0.059 (7.8x) | 21 | 20.0 | 0.064 | 0.064 | 347 |
| multi-CU | 512x40*| 0.358 | 92 | 80.0 (4x)| 0.064 | 0.064 | 1365* |
| Stream-multi-CU | 512x40*| 0.130561 (~3x) | 96.188 (4.3x) | 80.0 | 22.540 | 0.036 | 1200 |
>**NOTE:**
>
>* The Stream-multi-CU version processed four times of the data compared to the previous versions. Even if the execution time for each CU does not change, four parallel CUs increase the system performance by almost four times.
>* This is calculated by 4x data/time. Here the data transfer time is not accounted for, and you assume that the four CUs are executing in parallel. This is not as accurate as the hardware run, but you will use it as a reference for optimizations effectiveness.
## Next Steps
In this step, you performed the host code and kernel code modifications to generate multiple streaming CUs. In the next step, you have the application [run the accelerator in hardware](./RunOnHardware.md).
[hostopt_hwemu_profilesummary]: ./images/191_hostopt_hwemu_pfsummary_40_2.jpg "Host code optimization version hardware emulation profile summary"
Return to Getting Started Pathway — Return to Start of Tutorial
Copyright© 2020 Xilinx