available)
{
flagWait[iter].wait();
available += subbuf_doc_info[iter].size / sizeof(uint);
iter++;
}
for (unsigned i = 0; i < size ; i++, n++)
{
curr_entry = input_doc_words[n];
inh_flags = output_inh_flags[n];
if (inh_flags)
{
unsigned frequency = curr_entry & 0x00ff;
unsigned word_id = curr_entry >> 8;
ans += profile_weights[word_id] * (unsigned long)frequency;
}
}
profile_score[doc] = ans;
}
```
### Run the Application Using the Bloom8x Kernel
Go to the `makefile` directory, and run the `make` command.
```
cd $LAB_WORK_DIR/makefile; make run STEP=sw_overlap TARGET=hw PF=8 ITER=8
```
The following output displays.
```
Processing 1398.905 MBytes of data
Splitting data in 8 sub-buffers of 174.863 MBytes for FPGA processing
--------------------------------------------------------------------
Executed FPGA accelerated version | 427.1341 ms ( FPGA 230.345 ms )
Executed Software-Only version | 3057.6307 ms
--------------------------------------------------------------------
Verification: PASS
```
### Review Profile Report and Timeline Trace for the Bloom8x Kernel
1. Run the following commands to view the Timeline Trace with Bloom8x kernel.
```
vitis_analyzer $LAB_WORK_DIR/build/sw_overlap/kernel_8/hw/runOnfpga_hw.xclbin.run_summary
```
2. Zoom in to display the Timeline Trace report.
![](./images/sw_overlap_timeline_trace.PNG)
- As shown in *OpenCL API Calls* of the *Host* section, the red segments are shorter (indicated by red squares) in width which indicates that the processing time of the host CPU is now overlapping with the FPGA processing, which improved the overall application execution time. In the previous steps, the host remained completely idle until the FPGA finished all its processing.
- *Data Transfer -> Write* of the Host section seems to have no gap. Kernel compute time of each invocation is smaller than the Host transfer.
- Each Kernel compute and writing flags to DDR are overlapped with the next Host->
Device transfer.
### Review Profile Summary Report for the Bloom8x Kernel
1. *Kernels & Compute Unit:Kernel Execution* reports 168ms. This should be same as when Bloom8x kernel run with ITER=8.
2. *Kernels & Compute Unit: Compute Unit Stalls* section also confirms that "External Memory" stalls are about 20.045 ms compared to no "External Memory" stalls when single buffer was used. This will result in slower data transfer and kernel compute compared to single buffer run.
![](./images/sw_overlap_stalls.PNG)
3. *Host Data Transfer: Host Transfer* Host to Global Memory WRITE Transfer takes about 207.5 ms and Host to Global Memory READ Transfer takes about 36.4 ms
![](./images/sw_overlap_profile_host.PNG)
* *Kernels & Compute Unit: Compute Unit Utilization* section shows that CU Utilization is about 71%. This is an important measure representing how much time CU was active over the Device execution time.
![](./images/sw_overlap_profile_CU_util.PNG)
In the next lab, you will compare the results for "Host Data Transfer Rates" and "CU Utilization".
### Throughput Achieved
- Based on the results, the throughput of the application is 1399 MB/427 ms = approx 3.27 GBs.` You have now approximately 7.2 times (=3057 ms/427 ms) the performance results compared to the software results.
### Opportunities for Performance Improvements
The host and kernel are trying to access the same DDR bank at the same time which resulted in an external memory stall of 20.045 ms. These accesses cause memory contention and limits the speed-up of application execution. In the next module, you will [make use of additional bank](./6_using-multiple-ddr.md) to minimize the memory contentions.
---------------------------------------
Return to Getting Started Pathway — Return to Start of Tutorial
Copyright© 2020 Xilinx