GQE Kernel Design¶
Join Kernel¶
The GQE join kernel is a compound of multiple post-bitstream programmable primitives, and can execute not only hash-join but also a number of primitives often found as prologue or epilogue of join operations. With its bypass design in data path, it can even perform execution without a join.
The internal of this kernel is illustrated in the figure above. Internal multi-join supports three reconfigurable modes, namely inner join, anti-join and semi-join. This kernel works with three input buffers, two for data and one for configuration, and it emits result to one output buffer, with same data structure as its data input buffers.
The uniformed buffer content structure design allows this kernel to be seamlessly scheduled, using one’s output as successor’s input. Basically, a buffer contains an input table or an intermediate result table, and each table has a list of data columns, placed one after another and aligned to 512b boundaries.
Each column’s data structure is essentially a 512-bit header, followed by raw data.
Note
In the current release, all columns are expected to have the same number of elements of same type, so only the first columns header is used by kernel. This will likely change in future release.
The buffer and column’s data structure is shown in the figure below.
The configuration buffer basically programs the kernel at runtime. It toggles execution step primitives on or off, and defines the filter and/or evaluation expressions. The details are documented in the following table:
192-511 | 191-184 | 120-183 | 56-119 | 6 | 3-5 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|
Shuffle | Tab C col sel | Tab B col-id | Tab A col-id | append | join sel | dual key | aggr on | join on |
(padding at MSB) eval-0 config | ||||||||
(padding at MSB) eval-1 config | ||||||||
filter Tab A config | ||||||||
filter Tab A config (cont’) | ||||||||
(padding at MSB) filter Tab A config (cont’) | ||||||||
filter Tab B config | ||||||||
filter Tab B config (cont’) | ||||||||
(padding at MSB) filter Tab B config (cont’) |
Both input table A and B can support up to 8 columns.
The selection and order of columns in pipeline is appointed via the column index.
Within each buffer, the columns are indexed starting from 0.
-1
is used as a special value to instruct the table scanner to feed zero for that column.
For example, suppose Table A’s column indices are [3,2,1,-1,-1]
.
Then Column 3 will be the first in data path, and Column 2 the second and Column 1 the third.
Column 0 and other columns if exists are ignored.
The last two slots in data path will be filled with zeros.
The filter config is aligned to lower bits, and each filter’s config fully covers the first two 512-bit slots, and partially use the third one.
Here the join_on
option toggles whether hash-join is enabled or by-passed in the pipeline.
The dual_key
option instructs the kernel to use
both first and second column as join key in hash-join, and when it is asserted, the third column
becomes the first part of the payload input.
The join sel
option indicates the work mode of multi-join, 0 for normal hash join, 1 for semi-join and
2 for anti-join.
The append
option toggles whether the append mode is enabled during writing out consecutive joined table.
This option would be usually used when it joins two sub-tables after hash partition.
The eval config is for the dynamicEval primitive, and aligns to the lower bits of the 512-bit allocated for it.
The aggregation always performs the calculation of min, max, sum and count for each of its input column.
When aggr_on
is set, aggregation values will write instead of the original rows.
Caution
To support large sum and count, the output data width of these two fields are doubled.
So that the aggregation value for each column is min, max, sum LSB, sum MSB, count LSB, count MSB
.
The write option is basically a bit mask for 8 slots in data path. Only when the corresponding one-hot bit is asserted, the column is written to the output buffer.
Caution
Due to limitation in current write_out
, the output buffer must always
provide 8 column slots, even not all used.
The hardware resource utilization of join kernel is shown in the table below (work as 182MHz).
Primitive | Quantity | LUT | LUT as memory | LUT as logic | Register | BRAM36 | URAM | DSP |
Scan | 1 | 12814 | 4758 | 8056 | 18968 | 0 | 0 | 2 |
Filter | 4 | 2155 | 13 | 2142 | 1776 | 0.5 | 0 | 0 |
Hash join | 1 | 118625 | 30396 | 88229 | 171852 | 254.5 | 192 | 80 |
Eval | 2 | 2362 | 315 | 2047 | 2325 | 0 | 0 | 21 |
Direct aggr | 1 | 1958 | 0 | 1958 | 3307 | 0 | 0 | 0 |
Write | 1 | 20604 | 5693 | 14911 | 30275 | 0 | 0 | 0 |
AXI DDR | 1 | 6803 | 1370 | 5433 | 15045 | 60 | 0 | 0 |
AXI HBM | 1 | 25734 | 3321 | 22413 | 31290 | 32 | 0 | 0 |
Total | 236692 | 63384 | 173608 | 330108 | 348.5 | 192 | 124 |
Aggregate Kernel¶
The GQE Aggregate kernel is another key kernel of General Query Engine (GQE), and supports both grouping and non-grouping aggregate operations.
The internal structure of this kernel is shown in the figure above. It consists of one scan and write, two evaluations, one filter, hash group aggregate as well as one aggregate primitive. Raw input table is scanned in or write out by column. Before entering into hash group aggregate module, each element in each row will be evaluated and filtered. Thus, some new elements can be generated and some rows will be discarded behind. Moreover, two cascaded evaluation modules are added to support more complex expression.
Hash group aggregate is the key module in this kernel, which is a multi-PU implementation and given in the following diagram. Each PU requires 2 HBM banks and some URAM memory blocks to buffer distinct keys as well as payloads after aggregate operations. And one internal loop is implemented to consume all input rows with each iteration. Furthermore, all PUs are working in parallel to achieve higher performance.
Also, the data structure of input and output tables is same as join kernel. The whole configuration is composed of 128 32-bit slot. And the details of configuration buffers are listed in the table:
Module | Module Config Width | Position |
Scan | 64 bit | config[0]~config[1] |
Eval0 | 289 bit | config[2]~config[11] |
Eval1 | 289 bit | config[12]~config[21] |
Filter | 45*32 bit | config[22]~config[66] |
Shuffle0 | 64 bit | config[67]~config[68] |
Shuffle1 | 64 bit | config[69]~config[70] |
Shuffle2 | 64 bit | config[71]~config[72] |
Shuffle3 | 64 bit | config[73]~config[74] |
Group Aggr | 4*32 bit | config[75]~config[78] |
Column Merge | 64 bit | config[79]~config[80] |
Aggregate | 1 bit | config[81] |
Write | 16 bit | config[82] |
Reserved | config[83]~config[127] |
The hardware resource utilization of hash group aggregate is shown in the table below (work as 193MHz).
Primitive | Quantity | LUT | LUT as memory | LUT as logic | Register | BRAM36 | URAM | DSP |
Scan | 1 | 12209 | 4758 | 7451 | 18974 | 0 | 0 | 2 |
Eval | 8 | 2153 | 426 | 1727 | 2042 | 4 | 0 | 21 |
Filter | 4 | 2168 | 13 | 2155 | 1764 | 0.5 | 0 | 0 |
Group Aggr | 1 | 162202 | 27819 | 134383 | 210926 | 62 | 256 | 0 |
Direct Aggr | 1 | 4349 | 0 | 4349 | 6611 | 0 | 0 | 0 |
Write | 1 | 30938 | 9490 | 21448 | 43579 | 0 | 0 | 0 |
AXI DDR | 1 | 4586 | 1313 | 3273 | 78855 | 18 | 0 | 0 |
AXI HBM | 1 | 20528 | 4456 | 16072 | 45416 | 124 | 0 | 0 |
Total | 298470 | 60402 | 238068 | 399737 | 255 | 256 | 2 |
Partition Kernel¶
The GQE partition kernel can partition one table’s rows into corresponding clusters according to hash of selected key columns. This kernel is designed to scale the problem size that can be handled by the GQE Join or Aggregate kernel. To reduce the size of intermediate data, it is equipped with dynamic filter like other kernels.
The internal of this kernel is illustrated in the figure above. It consists of two input buffers and one output buffer. Firstly, the kernel scans the input tables into multiple columns and then it filters them (if the related condition is given in configuration buffer). After that, each row will be dispatched into various buckets based on the hash value of primary key. Finally, every full hash bucket will trigger on one burst write into output buffer.
The details for hash partition is shown in the following figure. One URAM array is used to buffer one burst length rows for each hash part. And the following module behind build PU will spilt each row into multiple columns for compatible output format with other kernels.
To simplify the design, GQE partition kernel can reuse the scan and filter configuration with GQE join kernel. Also, as mentioned above, the data structure of input and output tables is the same as join kernel.
The hardware resource utilization of single hash partition is shown in the table below (work as 200MHz).
Primitive | Quantity | LUT | LUT as memory | LUT as logic | Register | BRAM36 | URAM | DSP |
Scan | 1 | 12032 | 4752 | 7280 | 19383 | 0 | 0 | 2 |
Filter | 4 | 3551 | 809 | 2742 | 3809 | 0.5 | 0 | 0 |
Hash partition | 1 | 26363 | 1521 | 27496 | 45762 | 20 | 256 | 5 |
Write | 1 | 26083 | 5082 | 21286 | 33202 | 1 | 0 | 3 |
AXI DDR | 1 | 5046 | 1010 | 4036 | 11303 | 59.5 | 0 | 0 |
Total | 76091 | 14102 | 61989 | 116884 | 81 | 256 | 10 |
Attention
To use the GQE Partition kernel, host must pass the number of partitions through a kernel argument, create corresponding number of sub-buffers on Partition kernel’s output, and invoke GQE Join or Aggregate kernel multiple times accordingly.