> 8;
unsigned hash_pu = MurmurHash2(word_id, 3, 1);
unsigned hash_lu = MurmurHash2(word_id, 3, 5);
bool doc_end= (word_id==docTag);
unsigned hash1 = hash_pu&hash_bloom;
bool inh1 = (!doc_end) && (bloom_filter_local[j][ hash1 >> 5 ] & ( 1 << (hash1 & 0x1f)));
unsigned hash2=(hash_pu+hash_lu)&hash_bloom;
bool inh2 = (!doc_end) && (bloom_filter_local[j][ hash2 >> 5 ] & ( 1 << (hash2 & 0x1f)));
inh_flags(7+j*8, j*8) = (inh1 && inh2) ? 1 : 0;
}
flag_stream.write(inh_flags);
}
}
```
- Added `#pragma HLS UNROLL`
- Unrolls internal loop to make four copies of the Hash functionality.
- Vitis HLS will try to pipeline the outer loop with `II=1`. With the inside loop unrolled, you can initiate the outer loop every clock cycle, and compute 4 words in parallel.
- Added `#pragma HLS LOOP_TRIPCOUNT` min=1 max=3500000`
- Reports the latency of the function after HLS Synthesis.
### Build the Kernel Using the Vitis Tool Flow
Now, build the kernel using the Vitis compiler. The Vitis compiler will call the Vitis HLS tool to synthesize the C++ kernel code into an RTL kernel. You will also review the reports to confirm if the kernel meets the latency/throughput requirements for your performance goals.
1. Use the following command to build the kernel.
```
cd $LAB_WORK_DIR/makefile; make build STEP=single_buffer PF=4 TARGET=hw_emu
```
This command will call the `v++` compiler which then calls the Vitis HLS tool to translate the C++ code into RTL code that can be used to run Hardware Emulation.
>**NOTE**: For purposes of this tutorial, the number of input words used is only 100 because it will take a longer time to run the Hardware Emulation.
2. Then, use the following commands to visualize the HLS Synthesis Report in the Vitis analyzer.
```
vitis_analyzer ../build/single_buffer/kernel_4/hw_emu/runOnfpga_hw_emu.xclbin.link_summary
```
![](./images/4_Kernel_4_link.PNG)
- The `compute_hash_flags` latency reported is 875,011 cycles. This is based on total of 35,000,000 words, computed with 4 words in parallel. This loop has 875,000 iterations and including the `MurmurHash2` latency, the total latency of 875,011 cycles is optimal.
- The `compute_hash_flags_dataflow` function has `dataflow` enabled in the Pipeline column. This function is important to review and indicates that the task-level parallelism is enabled and expected to have overlap across the sub-functions in the `compute_hash_flags_dataflow` function.
- The latency reported for `read_bloom_filter` function is 16,385 for reading the Bloom filter coefficients from the DDR using the `bloom_filter maxi` port. This loop is iterated over 16,000 cycles reading 32-bits data of from the Bloom filter coefficients.
3. The HLS reports confirm that the latency of the function meets your target. You still need to ensure the functionality is correct when communicating with the host. In the next section, you will walk through the initial host code and run the software and hardware emulation.
### Review the Initial Host Code
The initial version of the accelerated application code structure follows the structure of the original software version. The entire input buffer is transferred from the host to the FPGA in a single transaction. Then, the FPGA accelerator performs the computation. Finally, the results are read back from the FPGA to the host before being post-processed.
The following figure shows the sequential process of the host writing data on the device, compute by the accelerator on the FPGA, and read flags back to host, implemented in this first step. The Profile score is calculated sequentially on CPU after all the flags are received by the host.
![](./images/overlap_single_buffer.PNG)
The FPGA accelerator computes the hash values and flags for the provided input words.
The functionality of the different inputs passed to the accelerator kernel is as follows:
* `input_doc_words`: Input array that contains the 32-bit words for all the documents.
* `bloom_filter`: Bloom filter array that contains the inserted search array hash values.
* `total_size`: Unsigned `int` that represents the total size processed by the FPGA when called.
* `load_weights`: Boolean that allows the `bloom_filter` array to load only once to the FPGA in the case of multiple kernel invocations.
The output of the accelerator is as follows:
* `output_inh_flags`: Output array of 8-bit outputs where each bit in the 8-bit output indicates whether a word is present in the Bloom filter, that is then used for computing score in the CPU.
### Run Software Emulation, Hardware Emulation and Hw
1. To ensure the application passes Software Emulation with your changes, run the following command.
```
cd $LAB_WORK_DIR/makefile; make run STEP=single_buffer TARGET=sw_emu
```
Make sure that the Software Emulation is passing.
2. Next, to verify the functionality is intact, use the following command to run Hardware Emulation.
```
cd $LAB_WORK_DIR/makefile; make run STEP=single_buffer TARGET=hw_emu
```
- The commands show that the SIMULATION is PASSED. This ensures that the generated hardware is functionally correct. However, you have not run the hardware on the FPGA. .
> **NOTE**: This tutorial is provided with `xclbin` files in the `$LAB_WORK_DIR/xclbin_save` directory. The `SOLUTION=1` option can be added to the make target for using these `xclbin` files for `hw` runs. These `xclbin` files were generated for Alveo U200 cards only. You must generate new `xclbin` files for every platform used in this tutorial.
4. Run the following steps to execute the application on hardware.
You are using 100,000 documents compute on the hardware.
```
cd $LAB_WORK_DIR/makefile; make run STEP=single_buffer ITER=1 PF=4 TARGET=hw
```
* If you are using an `xclbin` provided as part of solution in this tutorial, then use the following command.
```
cd $LAB_WORK_DIR/makefile; make run STEP=single_buffer ITER=1 PF=4 TARGET=hw SOLUTION=1
```
* To use four words in parallel, `PF=4` will set the PARALLELIZATION macro to 4 in `$LAB_WORK_DIR/reference_files/compute_score_fpga_kernel.cpp`.
* `ITER=1` indicates buffer sent using single iteration (using a single buffer).
The following output displays.
```
Loading runOnfpga_hw.xclbin
Processing 1398.903 MBytes of data
Running with a single buffer of 1398.903 MBytes for FPGA processing
--------------------------------------------------------------------
Executed FPGA accelerated version | 838.5898 ms ( FPGA 447.964 ms )
Executed Software-Only version | 3187.0354 ms
--------------------------------------------------------------------
Verification: PASS
```
- Total FPGA time is 447 ms. This includes the host to DDR transfer, Total Compute on FPGA and DDR to host transfer.
- Total time of computing 100,000 documents is about 838 ms.
At this point, review the Profile reports and Timeline Trace to extract information, such as how much time it takes to transfer the data between host and kernel and how much time it takes to compute on the FPGA.
### Visualize the Resources Utilized
Use the Vitis analyzer to visualize the HLS Synthesis Report. You will need to build the kernel without `SOLUTION=1` to `generate link_summary` as this is not provided as part of tutorial. You can skip this step as well.
```
vitis_analyzer $LAB_WORK_DIR/build/single_buffer/kernel_4/hw/runOnfpga_hw.xclbin.link_summary
```
The HLS Synthesis Report shows the number of LUTs, REG, and BRAM utilized for the Bloom4x kernel implementation.
![](./images/kernel_4_util.PNG)
### Review Profile Reports and Timeline Trace
Use the Vitis analyzer to visualize the `run_summary` report.
```
vitis_analyzer $LAB_WORK_DIR/build/single_buffer/kernel_4/hw/runOnfpga_hw.xclbin.run_summary
```
The Profile Summary and Timeline Trace reports are useful tools to analyze the performance of the FPGA-accelerated application.
### Review Profile Summary Report
* *Kernels & Compute Unit: Kernel Execution* indicates that the Total Time by kernel enqueue is about 292 ms.
![](./images/kernel_4_profile_1.PNG)
- 4 words in parallel are computed. The accelerator is architected at 300 MHz. In total, you are computing 350,000,000 words (3,500 words/document * 100,000 documents).
- Number of words/(Clock Freq * Parallelization factor in kernel) = 350M/(300M*4) = 291.6 ms. The actual FPGA compute time is almost same as your theoretical calculations.
* *Host Data Transfer: Host Transfer* shows that the Host Write Transfer to DDR is 145 ms and the Host Read Transfer to DDR is 36 ms.
![](./images/kernel_4_profile_2.PNG)
- Host Write transfer using a theoretical PCIe bandwidth of 9GB should be 1399 MB/9GBps = 154 ms
- Host Read transfer using a theoretical PCIe bandwidth of 12GB should be 350 MB/12 GBps = 30 ms
- Reported number indicates that the PCIe transfers are occurring at the maximum bandwidth
* *Kernels & Compute Unit: Compute Unit Stalls* confirms that there are almost no "External Memory Stalls"
![](./images/Kernel_4_Guidance_Stall.PNG)
### Review the Timeline Trace
The Timeline Trace shows the data transfer from the host to the FPGA and back to the host as they appear. The Timeline Trace can be visualized so that the transfer from the host to the FPGA and the FPGA compute and transfer from the FPGA to host occur sequentially.
![](./images/Kernel_4_time_1.PNG)
- There is a sequential execution of operations starting from the data transferred from the host to the FPGA, followed by compute in the FPGA and transferring back the results from the FPGA to the host.
- At any given time, either the host or FPGA has access to the DDR. In other words, there is no memory contention between the host and kernel accessing the same DDR.
- Using a single buffer will create a kernel itself with lowest latency and most optimized performance.
### Throughput Achieved
Based on the results, the throughput of the application is 1399 MB/838 ms = approx 1.66 GBs. This is your first attempt to run the application on hardware, and you have four times the performance results compared to the software-only version.
### Opportunities for Performance Improvements
Because there is no external memory access contention for accessing memory, this is the best possible performance for the Kernel computing four words in parallel.
* The kernel is reading 512-bits of data, but only 128-bits are used to compute 4 words in parallel. The kernel has the potential to compute more words because you have resources available on the FPGA. You can increase the number of words to be processed in parallel and can experiment with 8 words and 16 words in parallel.
* Also note, even though the kernel is operating at its best performance, the kernel has to wait until the complete transfer is done by the host. Usually, it is recommended that you send a larger buffer from the host to DDR, but a very large buffer will add a delay before the kernel can start, impacting overall performance.
You will continue this lab and create a kernel with 8 words to be computed in parallel. In the next lab, [Analyze Data Movement Between Host and Kernel](./5_data-movement.md), you will experiment with splitting the buffer into multiple chunks, and observe how this affects the performance later in this tutorial.
## Bloom8x: Kernel Implementation Using 8 Words in Parallel
For config Bloom4x, you read 512-bit input values from the DDR and computed 4 words in parallel which uses only 128-bit input values. This steps enables you to run 8 words in parallel.
You can achieve this by using `PF=8` on the command line. Use the following steps to compute 8 words in parallel.
### Run Hardware on the FPGA
Use the following command to run the hardware on the FPGA.
```
cd $LAB_WORK_DIR/makefile; make run STEP=single_buffer ITER=1 PF=8
```
The following output displays.
```
Single_Buffer: Running with a single buffer of 1398.903 MBytes for FPGA processing
--------------------------------------------------------------------
Executed FPGA accelerated version | 739.4475 ms ( FPGA 315.475 ms )
Executed Software-Only version | 3053.9516 ms
--------------------------------------------------------------------
Verification: PASS
```
- The Total FPGA time is 315 ms. This includes the Host to DDR transfer, Total Compute on FPGA and DDR to Host transfer.
- As expected, when computing 8 words in parallel, the total FPGA Time that includes the data transfer between the host and device, has reduced from 447 ms to 315 ms.
### Visualize the Resources Utilized
Use the Vitis analyzer to visualize the HLS Synthesis Report. You will need to build the kernel without `SOLUTION=1` to generate link_summary; this is not provided as part of tutorial. You can skip this step.
```
vitis_analyzer $LAB_WORK_DIR/build/single_buffer/kernel_8/hw/runOnfpga_hw.xclbin.link_summary
```
From the HLS Synthesis Report, you can see that the number of resources increased for LUTs, REG, and BRAM compared to the Bloom4x kernel implementation.
![](./images/kernel_8_util.PNG)
### Review Profile Summary Report and Timeline Trace
Use the Vitis analyzer to visualize the run_summary report.
```
vitis_analyzer $LAB_WORK_DIR/build/single_buffer/kernel_8/hw/runOnfpga_hw.xclbin.run_summary
```
Review the profile reports and compare the metrics with configuration Bloom4x
1. *Kernels & Compute Unit:Kernel Execution* reports 146 ms compared to the 292 ms. This is exactly half the time as now 8 words are computed in parallel instead of 4 words.
2. *Host Data Transfer: Host Transfer* section reports the same delays.
3. The overall gain in the application overall gain is only because that kernel now is processing 8 words in parallel compared to 4 words in parallel.
![](./images/Kernel_8_time_1.PNG)
### Throughput Achieved
- Based on the results, the throughput of the application is 1399 MB/739 ms = approximately 1.9 GBs. You have now 4x performance results compared to software-only version.
## Bloom16x : Kernel Implementation Using 16 Words in Parallel
In the previous step, you read 512-bit input values from the DDR and computed 8 words in parallel. You can compute 16 words in parallel by setting `PF=16` on the command line. Use the following steps to compute 16 words in parallel.
### Run Hardware on the FPGA
```
cd $LAB_WORK_DIR/makefile; make run STEP=single_buffer ITER=1 PF=16
```
The following output displays.
```
Processing 1398.903 MBytes of data
Single_Buffer: Running with a single buffer of 1398.903 MBytes for FPGA processing
--------------------------------------------------------------------
Executed FPGA accelerated version | 694.6162 ms ( FPGA 270.275 ms )
Executed Software-Only version | 3052.5701 ms
--------------------------------------------------------------------
Verification: PASS
```
As previous steps, you can also review profile and timeline trace reports. This is left out as an homework exercise.
### Opportunities for Performance Improvements
In this lab, you focused on building on an optimal kernel with 4, 8 and 16 words in parallel using a single buffer and reviewed the reports. As you observed from the Timeline Report that the kernel cannot start until the whole buffer is transferred to the DDR, which is about 145 ms when the single buffer is used. This is a substantial time delay considering the `Kernel execution time` is even smaller than the total host transfer time.
## Next Steps
In the next lab, you [explore sending documents in multiple buffers](./5_data-movement.md) rather than sending all of the documents at once from the host and explore how this improves the overall application performance. You will also use the configuration `Bloom8x` kernel and `Bloom16x` kernel for exploration in the next step.
Return to Getting Started Pathway — Return to Start of Tutorial
Copyright© 2020 Xilinx