Release Note


In 2021.1 release, GQE receives early-access support the following features

  • 64-bit join support: now the gqeJoin kernel and its companion gqePart kernel has been extended to 64-bit key and payload, so that larger scale of data can be supported.
  • Initial Bloom-filter support: the gqeJoin kernel now ships with a mode in which it executes Bloom-filter probing. This improves efficiency on certain multi-node flows where minimizing data size in early stage is important.

Both features are offered now as L3 pure software APIs, please check corresponding L3 test cases.


The 2020.2 release brings a major update to the GQE kernel design, and brand new L3 APIs for JOIN and GROUP-BY AGGREGATE.

  • The GQE kernels now take each column as an input buffer, which can greatly simplify the data preparation on the host-code side. Also, allocating multiple buffers on host side turns should cause less out-of-memory issues comparing to a big contiguous one, especially when the server is under heavy load.
  • The L2 layer now provides command classes to generate the configuration bits for GQE kernels. Developers no longer have to dive into the bitmap table to understand which bit(s) to toggle to enable or disable a function in GQE pipeline. Thus the host code can be less error-prone and more sustainable.
  • The all-new experimental L3 APIs are built with our experiments and insights into scaling the problem size that GQE can handle. They can breakdown the tables into parts based on hash, and call the GQE kernels multiple rounds in a well-schedule fashion. The strategy of execution is separated from execution, so database gurus can fine-tune the execution based on table statistics, without messing with the OpenCL execution part.


The 2020.1 release contains:

  • Compound sort API (compoundSort): Previously three sort algorithm modules have been provided, and this new API combines insertSort and mergeSort, to provide a more scalable solution for on-chip sorting. When working with 32-bit integer keys, URAM resource on one SLR could support the design to scale to 2M entries.
  • Better HBM bandwidth usage in hash-join (hashJoinV3): In 2019.2 Alveo U280 shell, ECC has been enabled. So sub-ECC size write to HBM becomes read-modify-write, and wastes some bandwidth. The hashJoinV3 primitive in this release has been modified to use 256-bit port, to avoid this problem.
  • Various bug fixes: many small issues has been cleaned up, especially in host code of L2/demos.


The 2019.2 release introduces GQE (generic query engine) kernels, which are post-bitstream programmable and allow different SQL queries to be accelerated with one xclbin. It is conceptually a big step from per-query design, and a sound example of Xilinx’s acceleration approach.

Each GQE kernel is essentially a programmable pipeline of execution step primitives, which can be enabled or bypassed via run-time configuration.

Internal Release

The first release provides a range of HLS primitives for mapping the execution plan steps in relational database. They cover most of the occurrence in the plan generated from TPC-H 22 queries.

These modules work in streaming fashion and can work in parallel when possible.