# Covariance Matrix and Regularizaiton¶

## Overview¶

In probability theory and statistics, a covariance matrix, (aka, variance-covariance matrix) is a :math: Ntimes N square matrix that contains the variances and covariances associated with N observed variables. The diagonal elements of the matrix contain the variances of the variables, and the off-diagonal elements contain the covariances between all possible pairs of variables. At the same time, in order to solve an ill-posed problem or to prevent overfitting, there are four ways to regularize the covariance matrix, including hard-thresholding. soft-thresholding, banding, and tapering.

## Algorithm¶

### Covariance Matrix¶

For a Matrix :math: X=left ( a_{n,m} right )_{Ntimes M} with N variables from M observations, its covariance matrix can be expressed as:

where :math: i,jin left [ 0,N-1 right ], :math: a_{i,m} denotes m-th element of the i-th row, :math: bar{a_{i}} denotes the expected value (mean) of the all observations on the i-th row.

The variance-covariance matrix is symmetric because the element :math: cleft ( i,j right ) the same as the element :math: cleft ( j,i right ). Therefore, for the implementation of the covariance matrix, it only needs to calculate a lower triangular matrix.

### Covariance Regularizaiton¶

For the algorithm of covariance regularization, please refer to “High-Dimensional Covariance Estimation” by Mohsen Pourahmadi.

## Implementation¶

The implementation of the covariance matrix and covariance regularization are very common and simple. There is not elaborated here. Here, the key optimization based on the design of FPGA of covariance matrix is introduced. According to the formula or implementation code of the covariance matrix, the core design requires 3 layers of loops, which will cause the biggest latency. Therefore, it needs to be optimized to improve throughput.

Firstly, the loop of the bottom and middle layers is unrolled to increase throughput. However, due to the self-addition operation in the loop, the effect of unroll operator is not particularly obvious to reduce latency. So, the core calculation part and the self-addition part in the underlying loop is split into two processes, passing the intermediate result through the stream by using pragma dataflow to improve throughput. See the function covCoreWrapper for details.

## Profiling¶

The hardware resources utilization for the covariance matrix are listed in Table 3. (Vivado result)

Primitives | BRAM | DSP | LUT | clock period(ns) |

covCoreStrm | 448 | 730 | 92318 | 3.962 |

covReHardThreshold | 0 | 4 | 312 | 2.341 |

covReSoftThreshold | 0 | 15 | 5006 | 3.016 |

covReBand | 0 | 4 | 403 | 2.797 |

covReTaper | 0 | 7 | 1168 | 2.729 |