This page lists the research publications which have been carried out in the context of the XACC program, or papers that may be of interest to the XACC community.


If you would like to contribute to this page by adding a reference to your publication, please follow the contribution guidelines


Name Author(s) Institution Link Notes
Distributed Recommendation Inference on FPGA Clusters Yu Zhu et al. ETH Zurich Paper Implementation of an efficient distributed recommendation inference on an FPGA cluster that optimizes both the memory-bound embedding layer and the computation-bound fully-connected layers. The system achieves a maximum speed up of 28.95x, while guaranteeing very low latency.
EasyNet: 100 Gbps Network for HLS Zhenhao He et al. ETH Zurich Paper Github Integration of an open-source 100 Gbps TCP/IP stack into Vitis without degrading its performance. A set of MPI-like communication primitives are provided to abstract away low level details of the networking stack.
FleetRec: Large-Scale Recommendation Inference on Hybrid GPU-FPGA Clusters Wenqi Jiang et al. ETH Zurich Paper Github A high-performance and scalable recommendation inference system within tight latency constraints. FleetRec takes adventage of both GPUs and FPGAs by disaggregating computation and memory to different types of accelerators and bridging their connections by high-speed network, FleetRec gains the best of both worlds, and can naturally scale out by adding nodes to the cluster
MicroRec: Efficient Recommendation Inference by Hardware and Data Structure Solutions Wenqi Jiang et al. ETH Zurich Paper High-performance inference engine for recommendation systems. MicroRec accelerates recommendation inference by (1) redesigning the data structures to reduce the number of lookups and (2) taking advantage of HBM in FPGA accelerators to tackle the latency by enabling parallel lookups.
Optimized Implementation of the HPCG Benchmark on Reconfigurable Hardware Alberto Zeni et al. Xilinx Inc. Paper The HPCG benchmark represents a modern complement to the HPL benchmark in the performance evaluation of HPC systems. This paper presents the details of the first FPGA-based implementation of HPCG that takes advantage customized compute architectures. The results show that the high-performance multi-FPGA implementation, using 1 and 4 Xilinx Alveo U280 achieves up to 108.3 GFlops and 346.5 GFlops respectively. Comparable performance with respect to modern GPUs are also demonstrated.
SKT: A One-Pass Multi-Sketch Data Analytics Accelerator Monica Chiosa et al. ETH Zurich/Accemic Technologies Paper Github SKT is an FPGA-based accelerator that can compute several sketches along with basic statistics (av- erage, max, min, etc.) in a single pass over the data streams. SKT has been designed to characterize a data set by calculating its cardinality, its second frequency moment, and its frequency distribution. The design processes data streams coming either from PCIe or TCP/IP, and it is built to fit emerging cloud service architectures


Name Author(s) Institution Link Notes
Do OS abstractions make sense on FPGAs? Dario Korolija et al. ETH Zurich Paper To what extent do traditional OS abstractions make sense in the context of an FPGA as part of a hybrid system? This paper introduces Coyote which supports secure spatial and temporal multiplexing of the FPGA between tenants, virtual memory, communication, and memory management inside a uniform execution environment.
EMOGI: efficient memory-access for out-of-memory graph-traversal in GPUs Seung Won Min et al. University of Illinois at Urbana-Champaign Paper Sparse-matrix computation
Extending High-Level Synthesis for Task-Parallel Programs Yuze Chi et al. UCLA Paper Extend the HLS C++ language and present a fully automated framework with programmer-friendly interfaces, universal software simulation, and fast code generation to overcome these limitations.
FReaC Cache: Folded-logic Reconfigurable Computing in the Last Level Cache Ashutosh Dhar et al. University of Illinois at Urbana-Champaign Paper Energy efficient computation
Making Search Engines Faster by Lowering the Cost of Querying Business Rules Through FPGAs Fabio Maschi et al. ETH Zurich Paper Explore how to use hardware acceleration to (i) improve the performance of the MCT module (lower latency, higher throughput); and (ii) reduce the amount of computing resources needed
Portable Linear Algebra on FPGA using Data-Centric Parallel Programming Manuel Burger et al. ETH Zurich Paper 2020 XOHW Winner PhD
Specializing the network for scatter-gather workloads Catalina Alvarez et al. ETH Zurich Paper Explore hardware-offload of the scatter-gather primitive. This approach not only virtually eliminates CPU usage, but with suitable scheduling of responses, it also speeds up scatter by allowing parallel queries
Weighing up the new kid on the block: Impressions of using Vitis for HPC software development Nick Brown et al. The University of Edinburgh Paper Vitis case study using Himeno benchmark as a vehicle for exploring the Vitis platform for building, executing and optimizing HPC codes


Name Author(s) Institution Link Notes
AcMC²: Accelerating Markov Chain Monte Carlo Algorithms for Probabilistic Models Subho S. Banerjee et al. University of Illinois at Urbana-Champaign Paper Compiler development transforming probabilistic models into optimized hardware accelerators
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs Yao Chen et al. National University of Singapore Paper Open-source automated tool chain called Cloud-DNN. Our tool chain takes trained CNN models specified in Caffe as input, performs a set of transformations, and maps the model to a cloud-based FPGA. Cloud-DNN can significantly improve the overall design productivity of CNNs on FPGAs while satisfying the emergent computational requirements.
Flexible Communication Avoiding Matrix Multiplication on FPGA with HLS Johannes de Fine Licht et al. ETH Zurich Paper A flexible, fully HLS-based, high-performance matrix multiplication accelerator, capable of efficiently utilizing all available resources on the target device, including for multi-SLR FPGAs.
High-Performance Distributed Memory Programming on Reconfigurable Hardware Tiziano De Matteis et al. ETH Zurich Paper SMI is an API that unifies the flexibility and single-program, multiple-data approach of MPI with the streaming programming model of spatial architectures.
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters Subho S. Banerjee et al. University of Illinois at Urbana-Champaign Paper System schedulers
hlslib: Software Engineering for Hardware Design Johannes de Fine Licht et al. ETH Zurich Paper A collection of extensions for Vitis to improve developer quality of life, including CMake integration, better vectorization support, support for simulating dataflow kernels with feedback dependencies.
Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures Tal Ben-Nun et al. ETH Zurich Paper Enables high-level programming of FPGAs from Python using the dataflow-based SDFG representation, allowing productive optimization of programs via provided graph transformations without modifying the input program, and code generating highly efficient FPGA kernels.


Name Author(s) Institution Link Notes
FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks Michaela Blott et al. Xilinx Inc. FINN-R Paper Framework for Quantized Neural Networks on reconfigurable hardware
Transformations of High-Level Synthesis Codes for High-Performance Computing Johannes de Fine Licht et al. ETH Zurich Paper A survey of important source-to-source optimization techniques for high-throughput HLS codes to target pipelining, parallelism, and memory bandwidth utilization.


Name Author(s) Institution Link Notes
Architectural optimizations for high performance and energy efficient Smith-Waterman implementation on FPGAs using OpenCL Lorenzo Di Tucci et al. Xilinx Inc. and Politecnico di Milano Paper Smith-Waterman: A key bio-informatics algorithm


Name Author(s) Institution Link Notes
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu et al. Xilinx Inc. FINN Paper Framework for Binarized Neural networks on reconfigurable hardware

Copyright© 2021 Xilinx