Internal Design of Closed Form Merton 76

Overview

The Merton 76 model (or Merton Jump Diffusion model) models the dynamics of a financial market; it adds a random jump aspect to the Black-Scholes model.

Design Structure

There are two layers to the kernel; the engine itself and the IO wrapper.

The Engine (M76Engine.hpp)

The engine performs a single Merton 76 Closed Form solution for a European Call Option in two parts. The first part calculates a series of Black Scholes solutions weighted by the random jump variable (Poisson Distribution). The second part sums these results to form the overall call price. The engine is split into two parts in order to minimize the latency (maximize throughput) of the overall design, and it is pipelined in order to maximize throughput when dealing with multiple call price calculations. The combination of nested loops and the inner loop, being a summing loop, makes pipelining difficult. So the summing loop is taken out of the engine loop in order to mitigate the pipeline issue.

The m76 engine makes use of the Black Scholes Closed Form kernel. The Greeks are calculated by the Black Scholes kernel but are ignored by the m76 kernel.

The engine by default sums 100 iterations of weighted Black Scholes. It can be changed with the #define MAX_N in m76_engine_defn.hpp

IO Wrapper (m76_kernel.cpp)

The wrapper takes as input of a parameter array, and it iterates through the array calling the Engine for each entry. The results are returned also as an array in order to make full use of DMA in the FPGA. Because a batch data transaction is much faster than multiple single transactions. The data is firstly read from global memory into local memory, then processed in the kernel and finally returned from local memory back to global memory. This is done because the extra time required by the copies is more than compensation by speedup the Engine in accessing local memory.

The wrapper will process up to 2048 calculations in one batch. This number can be increased by expense of memory in the FPGA, and it will give performance advantages when processing large amounts of data due to the kernel’s pipelining.

In order to speed up kernel execution for a large number of calculations, the input array and the sum array have been partitioned by a factor of 8. The inner loop has been unrolled which creates 8 engines working in parallel. The array partitioning is needed to let multiple read accesses possible in the same clock cycle. The partition factor is 8 because of a balance between efficiency and the amount of resource required in the FPGA.

Resource Utilization

The floating point kernel Area Information:

FF:136311 (17% of SLR on u200 board)
LUT:132573 (33% of SLR on u200 board)
DSP:718 (31% of SLR on u200 board)
BRAM:548 (38% of SLR on u200 board)
URAM:0

Throughput

The theoretical throughput depends on different factors. A floating point kernel will be faster than a double kernel. A larger MAX_N will provide more accurate results but will decrease throughput. The kernel has been pipelined in order to increase the throughput when a large number of inputs is to be processed.

Throughput is composed of three processes: transferring data to the FPGA, running the computations and transferring the results back from the FPGA. The demo contains options to measure timings as described in the README.md file.

As an example, processing a batch of 2048 call calculations with a floating point kernel with MAX_N = 100 breaks down as follows:

Time to transfer data = 0.207ms

Time for 2048 calculations = 0.969ms (equates to ~0.47us per calculation)

Time to transfer results = 0.078ms