Data Compression Library Tutorial¶

Data compression and Hardware acceleration¶

Data Compression is reduction in number of bits needed to represent the data. Compressing data saves storage capacity, speed up data transfer, decreases cost to store data.

Why acceleration is required and how it helps¶

General purpose CPU has its computation capabilities and limitations. Some additional hardware acceleration is used at times for performing some functions faster and more efficiently. Hardware accelerators improve the performance of a specific algorithm by allowing greater concurrency (i.e. parallel execution) based on the application.

GZIP is one such algorithm which is widely used in applications such as file storage, distributed systems, genetics etc. Traditionally the CPU based solutions are limited to MB/s speed but there is a high demand for accelerated GZip which provides throughput in terms of GB/s. Hence we need to accelerate this algorithm.

GZIP is combination of LZ77 and Huffman coding and this is a block based processing algorithm .The main advantages of block feature is that each block can be processed independently which enables greater concurrency and helps in achieving higher performance.

Xilinx Alveo card can help improve performance in following way:

Instruction parallelism by creating a customized and long pipeline.
Data parallelism by processing multiple blocks at the same time.
Customizable memory hierarchy of BRAM/URAM/HBM, providing high bandwidth of memory access.

Xilinx FPGA-based solution accelerates both compression and decompression with multicore and multibyte architectures thus speeding up overall processing time which results in improved system throughput and efficient resource utilization.

How Xilinx Data Compression Library Works¶

Xilinx data compression library is an open-sourced performance-optimized Vitis library written in C++ for accelerating data compression applications on Xilinx Accelerator cards in a variety of use cases. The library covers two levels of acceleration: the module level and the pre-defined kernel level, and evolve to offer the third level as pure software APIs working with pre-defined hardware overlays.

L1: Module level, it provides optimized hardware implementation of the core LZ based and data compression algorithm specific modules .
L2: Kernel level, this section calls compression/decompression kernel which internally uses the optimized hardware modules to showcase various kernel demos.
L3: The software API level will wrap the details of offloading acceleration with prebuilt binary (overlay) and allow users to accelerate data compression tasks on Alveo cards without hardware development.

It is designed as a specialized compression engine, multiple of which can run concurrently on the same Xilinx accelerator card to meet the high-throughput requirements of your algorithms. This reduces the bandwidth consumption and the overall infrastructure costs, on-premise or in the cloud.

GZIP compression kernel takes the raw data as input and compresses the data in block based fashion and writes the output to global memory. LZ77 is a byte based compression scheme. The resulting output from this kernel is represented in packet form of 32bit length <Literal, Match Length, Distance>.It also generates output of literal and distance frequencies for dynamic Huffman tree generation. The output generated by this kernel is referred by TreeGen and Huffman Kernel.

L3 API¶

L3 API are more scalable solutions to achieve maximum performance with optimized host and kernel for an end to end solution. Here we are targeting more number of compute units to get the maximum throughput and get to know bandwidth saturation for the design.

This demo is aimed at showcasing Xilinx Alveo U250 acceleration of Gzip_app and Xilinx Alveo U50 (HBM Platform) acceleration of Gzip_hbm for both compression and decompression, it also supports Zlib with a host argument switch.

Tested Tool: 2021.2 Tested XRT: 2021.2 Tested XSA: xilinx_u50_gen3x16_xdma_201920_3

Flow	Target Compute units	Compression-Ratio	FMax	LUT	BRAM	URAM	Memory	Througput
Gzip_app	Compression 2, Decompression 8	2.70	300MHz	202K	362	144	DDR	Compression-632MBps , Decompression 408.8MBps
Gzip_hbm	Compression 6, Decompression 8	2.70	450MHz	277K	503	208	HBM	Compression-961MBps , Decompression 356MBps

This application is present under L3/demos directory. Follow build instructions to generate executable and binary.

The host executable generated is named as “xil_gzip” and it is generated in ./build directory.

Executable Usage¶

To execute single file for compression

./build/xil_gzip -xbin ./build/xclbin_<xsa_name>_<TARGET mode>/compress_decompress.xclbin -c <input file_name>
To execute single file for decompression :

./build/xil_gzip -xbin ./build/xclbin_<xsa_name>_<TARGET mode>/compress_decompress.xclbin -d <compressed file_name>
To validate single file (compress & decompress)

./build/xil_gzip -xbin ./build/xclbin_<xsa_name>_<TARGET mode>/compress_decompress.xclbin -t <input file_name>
To execute multiple files for compression

./build/xil_gzip -xbin ./build/xclbin_<xsa_name>_<TARGET mode>/compress_decompress.xclbin -cfl <files.list>
To execute multiple files for decompression

./build/xil_gzip -xbin ./build/xclbin_<xsa_name>_<TARGET mode>/compress_decompress.xclbin -dfl <compressed files.list>
To validate multiple files (compress & decompress)
./build/xil_gzip -xbin ./build/xclbin_<xsa_name>_<TARGET mode>/compress_decompress.xclbin -l <files.list>
- <files.list>: Contains various file names with current path

The default design flow is GZIP design to run the ZLIB, enable the switch -zlib in the command line, as mentioned below: ./build/xil_gzip -xbin ./build/xclbin_<xsa_name>_<TARGET mode>/compress_decompress.xclbin -c <input file_name> -zlib 1

L2 API¶

L2 API are for users who has certain understanding of HLS and programming on FPGA and want to make modification on kernels.

These APIs are more Vitis flow based designs in which communication and data transfer happens between kernel and host. Kernel works on data and output send back to the host. Optimized kernel with best kernel performance can be seen.

GZIP by default supports 32KB block size. But in our library we support multiple block sizes namely, 8KB, 16KB. Not only multiple block sizes but data compression library has both dynamic and static huffman modules which are optimized to give good performance.

Architecture	Compression Ratio	Throughput	FMax	LUT	BRAM	URAM
GZipc 32KB Compress Stream	2.70	2.0 GB/s	300MHz	54K	141	64
GZip 8KB Compress Stream	2.70	2.0 GB/s	300MHz	57.5K	100	48
GZip 16KB Compress Stream	2.70	2.0 GB/s	282MHz	58K	164	48
Gzipc_block_mm32KB	2.70	2.0 GB/s	300MHz	57K	135	64
Gzipc_static32KB	2.70	2.0 GB/s	300MHz	35K	45	64

Library designs supports Free-Running-Kernel and Memory-Mapped kernels.

GZip/Zlib Memory Mapped and GZip/Zlib Compress Stream: Supports Dynamic Huffman.

GZip/Zlib Streaming: Full standard support (Dynamic Huffman, Fixed Huffman and Stored Blocks supported)

Commands to Run L2 and L3 cases¶

cd L2/tests/
# build and run one of the following using U250 platform
make run TARGET=sw_emu DEVICE=/path/to/xilinx_u250_gen3x16_xdma_3_1_202020_1/

# delete generated files
make cleanall

Here, TARGET decides the FPGA binary type

sw_emu is for software emulation
hw_emu is for hardware emulation
hw is for deployment on physical card. (Compilation to hardware binary often takes hours.)

Besides run, the Vitis case makefile also allows host and xclbin as build target.

L1 API¶

L1 API are for users who is familiar with HLS programming and want to tests / profile / modify the HLS modules. With the HLS test project provided in L1 layer, user could get:

Function correctness tests, both in c-simulation and co-simulation
Performance profiling from HLS synthesis report and co-simulaiton
Resource and timing from Vivado synthesis.

Command to Run L1 cases¶

cd L1/tests/

make run CSIM=1 CSYNTH=0 COSIM=0 VIVADO_SYN=0 VIVADO_IMPL=0 \
    DEVICE=/path/to/xilinx_u250_gen3x16_xdma_3_1_202020_1/

Test control variables are:

CSIM for high level simulation.
CSYNTH for high level synthesis to RTL.
COSIM for co-simulation between software test bench and generated RTL.
VIVADO_SYN for synthesis by Vivado.
VIVADO_IMPL for implementation by Vivado.

For all these variables, setting to 1 indicates execution while 0 for skipping. The default value of all these control variables are 0, so they can be omitted from command line if the corresponding step is not wanted.