Cholesky¶

Overview¶

Cholesky decomposition is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, in the form of \(A = LL^*\). \(A\) is a Hermitian positive-definite matrix, \(L\) is a lower triangular matrix with real and positive diagonal entries, and \(L^*\) denotes the conjugate transpose of \(L\). Cholesky decomposition is useful for efficient numerical solutions.

\[A = L*L^*\]

Implementation¶

DataType Supported¶

float
x_complex<float>
std::complex<float>
ap_fixed
x_complex<ap_fixed>
std::complex<ap_fixed>

Note

Subnormall values are not supported. If used, the synthesized hardware will flush these to zero, and the behavior will differ versus software simulation.

Interfaces¶

Template parameters:
- RowsColsA Defines the matrix dimensions
- InputType Input data type
- OutputType Output data type
- TRAITS Cholesky traits class
Arguments:
- matrixAStrm Stream of Square Hermitian/symmetric positive definite input matrix
- matrixLStrm Stream of Lower or upper triangular output matrix
Return Values:
- 0 = Success.
- 1 = Failure. The function attempted to find the square root of a negative number, that is, the input matrix A was not Hermitian/symmetric positive definite.

Implementation Controls¶

Specifications¶

There is a configuration class derived from the base configuration class xf::solver::choleskyTraits by redefining the appropriate class member.

struct my_cholesky_traits : xf::solver::choleskyTraits<LOWER_TRIANGULAR, DIM, MATRIX_IN_T, MATRIX_OUT_T> {
    static const int ARCH = SEL_ARCH;
};

The default base configuration class is as following. If the input datatype is complex or ap_fixed, please refer to L1/include/hw/cholesky.hpp for more details.

template <bool LowerTriangularL, int RowsColsA, typename InputType, typename OutputType>
struct choleskyTraits {
    typedef InputType PROD_T;
    typedef InputType ACCUM_T;
    typedef InputType ADD_T;
    typedef InputType DIAG_T;
    typedef InputType RECIP_DIAG_T;
    typedef InputType OFF_DIAG_T;
    typedef OutputType L_OUTPUT_T;
    static const int ARCH = 1;
    static const int INNER_II = 1;
    static const int UNROLL_FACTOR = 1;
    static const int UNROLL_DIM = (LowerTriangularL == true ? 1 : 2);
    static const int ARCH2_ZERO_LOOP = true;
};

Note

ARCH: Select implementation: 0=Basic, 1=Lower latency architecture, 2=Further improved latency architecture
INNER_II: Specify the pipelining target for the inner loop
UNROLL_FACTOR: The inner loop unrolling factor for the choleskyAlt2 architecture(2) to increase throughput
UNROLL_DIM: Dimension to unroll matrix
ARCH2_ZERO_LOOP: Additional implementation “switch” for the choleskyAlt2 architecture (2).

The configuration class is supplied to the xf::solver::cholesky function as a template paramter as follows.

template <bool LowerTriangularL,
          int RowsColsA,
          class InputType,
          class OutputType,
          typename TRAITS = choleskyTraits<LowerTriangularL, RowsColsA, InputType, OutputType> >
int cholesky(hls::stream<InputType>& matrixAStrm, hls::stream<OutputType>& matrixLStrm)

Key Factors¶

The following table summarizes how the key factors from the configuration class influence resource utilization, function throughput (initiation interval), and function latency. The values of Low, Medium, and High are relative to the other key factors.

Table 3 Cholesky Key Factor Summary¶
Key Factor	Value	Resources	Throughput	Latency
Architecture (ARCH)	0	Low	Low	High
	1	Medium	Medium	Medium
	2	High	High	Low
Inner loop pipeling (INNER_II)	1	High	High	Low
Inner loop pipeling (INNER_II)	>1	Low	Low	High
Inner loop unrolling (UNROLL_FACTOR)	1	Low	Low	High
Inner loop unrolling (UNROLL_FACTOR)	>1	High	High	Low

Note

Architecture
- 0: Uses the lowest DSP utilization and lowest throughput.
- 1: Uses higher DSP utilization but minimized memory utilization with increased throughput. This value does not support inner loop unrolling to further increase throughput.
- 2: Uses highest DSP and memory utilization. This value supports inner loop unrolling to improve overall throughput with a limited increase in DSP resources. This is the most flexible architecture for design exploration.
Inner loop pipeling
- >1: For ARCH 2, enables resource share and reduce the DSP utilization. When using complex floating-point data types, setting the value to 2 or 4 significantly reduces DSP utilization.
Inner loop unrolling
- For ARCH 2, duplicates the hardware required to implement the loop processing by a specified factor, executes the corresponding number of loop iterations in parallel, and increases throughput but also increases DSP and memory utilization.