xclbin timing closure tips

Overview

The two main timing issues you may encounter would be in the power and GT_MAC CUs. To solve them, some pblock are needed and potentially an update of Vivado default strategy.

These can be done by adding some configurations into your Vitis options file: vpp.ini:

  • pblock: Use a TCL file.

  • strategy: Vivado property.

Here is an example:

[vivado]
prop=run.impl_1.STEPS.PLACE_DESIGN.TCL.PRE=../../vpp_cfg/place_design_pre.tcl
prop=run.impl_1.strategy=Performance_EarlyBlockPlacement
prop=run.impl_1.STEPS.POST_ROUTE_PHYS_OPT_DESIGN.IS_ENABLED=true
prop=run.impl_1.STEPS.ROUTE_DESIGN.ARGS.DIRECTIVE=Explore
prop=run.impl_1.STEPS.POST_ROUTE_PHYS_OPT_DESIGN.ARGS.DIRECTIVE=Explore
prop=run.impl_1.STEPS.PHYS_OPT_DESIGN.ARGS.DIRECTIVE=Explore
prop=run.impl_1.STEPS.PHYS_OPT_DESIGN.IS_ENABLED=true

Run multi-strategy on LSF

If the above options are not enough to close timing, you can also try to run multiple strategies over a cluster:

  • Use __ALL_IMPL__ macro to apply some settings to all runs.

  • Use multiStrategiesWaitOnAllRuns to see the result of all strategies.

    • Seeing all results will give you an indication on how hard it is to close timing for the tool.

[advanced]
#param=compiler.multiStrategiesWaitOnAllRuns=1
## only for vivado >=2022.1
#param=compiler.errorOnPulseWidthViolation=false

[vivado]
impl.lsf={bsub -R "select[(type==X86_64) && (osdistro=rhel || osdistro=centos) && (osver == ws7) && (ossp > 3)] rusage[mem=48000]" -N -q long -W 48:00}
impl.strategies=ALL
prop=run.__ALL_IMPL__.STEPS.PLACE_DESIGN.TCL.PRE=../../vpp_cfg/place_design_pre.tcl
prop=run.__ALL_IMPL__.STEPS.ROUTE_DESIGN.TCL.PRE=../../vpp_cfg/route_design_pre.tcl
prop=run.__ALL_IMPL__.STEPS.POST_ROUTE_PHYS_OPT_DESIGN.IS_ENABLED=true
prop=run.__ALL_IMPL__.STEPS.ROUTE_DESIGN.ARGS.DIRECTIVE=Explore
prop=run.__ALL_IMPL__.STEPS.POST_ROUTE_PHYS_OPT_DESIGN.ARGS.DIRECTIVE=Explore
prop=run.__ALL_IMPL__.STEPS.PHYS_OPT_DESIGN.ARGS.DIRECTIVE=Explore
prop=run.__ALL_IMPL__.STEPS.PHYS_OPT_DESIGN.IS_ENABLED=true

Note

With vivado 2022.1 (or more recent) and only in multi-strategy, pulse violation timing violation are not ignored. This is a different behaviour compared to a single strategy build. Pulse timing violations in the power CU can be safely ignored.

Power CU

Although the power CU constraints all FF, DSP, BRAM and URAM, timing closure is not always achieved. This could happen if CLOCK_LOW_FANOUT constraint set for the clk_throttling block of the power CU is ignored.

  • You may have a setup violation between the FF Gate_Fast_d1 and the CE pin of FCLK (BUFGCE_DIV).

../_images/bufgce-div.png

Clock throttling BUFGCE_DIV

In this case, we can emulate the effect of CLOCK_LOW_FANOUT constraint by creating a pblock and assigning the power CU clock throttle block.

In place_design_pre.tcl, you can add the following constraints (update pblock according to your design).

Use the same pblock in which you can find GC.FCLK clock buffer of the UCS IP.

create_pblock pwr_SLR0_clk_throttle
resize_pblock pwr_SLR0_clk_throttle -add CLOCKREGION_X4Y0:CLOCKREGION_X4Y0
add_cells_to_pblock -quiet pwr_SLR0_clk_throttle [get_cells -hierarchical -filter {NAME =~ level0_i/ulp/krnl_powertest_slr0_1/*/pwr_ctrl.clk_throttle/*}]

Here is how to locate the clock buffer:

../_images/bufgce-div-source.png

Clock throttling BUFGCE_DIV

Tip

It is also recommended to cascade the power CU if you’ve got multiple of them. So only 1 power CU will contain the throttling logic.

GT_MAC CU

The GT_MAC CU is mainly composed by the xxv_ethernet, which is a huge IP (without the RS-FEC it’s still 75k FF and 63k LUT, and a limited quantity of RAMB36). There are numerous paths with more than 8 levels of logic/routing. Although most of the DRP logic is running at relatively low frequency 75MHz, giving too much room for the xxv_ethernet, design could not meet timing.

  • E.g. system clock: rxoutclk_out[0]_9; slack:  -0.118 ns

Pblock usage

Do not let Vivado use the entire SLR and freely place the xxv_ethernet IP. Use pblock to squeeze instances of xxv_ethernet. Here are 2 examples of how to pblock 1 or 2 instances of xxv_ethernet

When creating the pblock, don’t be afraid to increase the CLBL/CLBM utilization to 80-85%.

Example: u50 GT_MAC CU: 1 xxv_ethernet pblock

# Create pblock for krnl_gt_mac_test0_1
create_pblock gt_krnl0
resize_pblock [get_pblocks gt_krnl0] -add {CLOCKREGION_X0Y6:CLOCKREGION_X5Y7}
add_cells_to_pblock -quiet [get_pblocks gt_krnl0] [get_cells -hierarchical -filter {NAME =~ level0_i/ulp/krnl_gt_mac_test0_1/*/mac_wrapper/gty_4lanes.xxv_ip.mac/*}]

Example: u55c GT_MAC CU: 2 xxv_ethernet pblocks

# Create pblock for krnl_gt_mac_test0_1
create_pblock gt_krnl0
resize_pblock [get_pblocks gt_krnl0] -add {CLOCKREGION_X0Y4:CLOCKREGION_X6Y6}
add_cells_to_pblock -quiet [get_pblocks gt_krnl0] [get_cells -hierarchical -filter {NAME =~ level0_i/ulp/krnl_gt_mac_test0_1/*/mac_wrapper/gty_4lanes.xxv_ip.mac/*}]

# Create pblock for krnl_gt_mac_test1_1
create_pblock gt_krnl1
resize_pblock [get_pblocks gt_krnl1] -add {CLOCKREGION_X0Y5:CLOCKREGION_X6Y7}
add_cells_to_pblock -quiet [get_pblocks gt_krnl1] [get_cells -hierarchical -filter {NAME =~ level0_i/ulp/krnl_gt_mac_test1_1/*/mac_wrapper/gty_4lanes.xxv_ip.mac/*}]

Note

The two pblocks overlap as the GT transceivers are located close to each other on the die and the CLBL/CLBM utilization is at ~ 75%

../_images/gt-pblock-overlap.png

GT pblock overlap

Force LOC of BRAM

Unless you use incremental compilation with Vitis™, you can try the following to extract some guidelines from a passing timing project and feeding them into the failing on. You don’t need to extract the location of each FF, the BRAM forces already a decent repetition of the P&R.

Only GT BRAMs

Try first to close timing with only the GT (and maybe with the other CU present in the SLR, e.g. DDR). From the timing closed project:

Force LOC of Only GT BRAMs

Step

Example

Open the DCP.

Find current BRAM locations. Only search for BLOCKRAM in the *mac* instance, do not try to extract BRAMs of the memory subsystem.

show_objects -name find_1 [get_cells -hierarchical -filter { PRIMITIVE_TYPE =~ BLOCKRAM.*.* && NAME =~ "*inst_krnl_core/inst_gt_test_top/*mac*" } ]

In the search result window

  • Highlight the BRAMs.

  • Right-click.

  • select: fix cells.

In TCL console:

write_xdc -exclude_timing mac_bram.xdc
  • mac_bram.xdc should then contain the LOC constraints for the BRAM.

  • -exclude_timing isn’t strictly needed, but it should make the XDC a bit smaller as it excludes all timing related constraints (only includes the physical constraints).

Extract BRAM locations from mac_bram.xdc.

Add these locations in place_desing_pre.tcl.

All ULP BRAMs

If only fixing location of GT BRAMs is still not working you need to use a more complex approach to reach timing closure:

Force LOC of all ULP BRAMs

Step

Description

Try to insert ALL CUs but reduce the utilization of the power CU to 1% DSP per SLR.

The idea is to have the infrastructure in place (mem-SS, hbm-SS, AXI ….) without the putting any extra pressure on the tool with a power CU.

Build with multi-strategy.

Maybe use param=compiler.multiStrategiesWaitOnAllRuns=1. So, you can see which strategy is closing timing and if lots of strategy are closing (it gives an indication of how hard it is).

From a successful strategy (frequently Performance_EarlyBlockPlacement strategy):

  • Extract ALL BRAM LOC of the ULP (mem-SS, hbm-SS, GT).

  • If your platform contains DDR, you can also extract the calibration DSP of the DDR controller.

  • Add LOC constraints on them in your place_design_pre.tcl script.

Build the power CU around all these LOC BRAM/DSP.

  • If Clock Region has some LOC BRAM, simply don’t use any BRAM of that Clock Region at all.

  • If you really want to insert power CU where LOC BRAM are present, use invalid.tcl to exclude all LOC BRAM from the power CU.

Build in multi strategy again.

This time, comment param=compiler.multiStrategiesWaitOnAllRuns=1 as you want to go as quick as you can.

Frequency override

To help timing closure, the frequency of the CU clocks can be overwritten setting the following parameters in wizard_config.json.

Clocks description

Parameter

Range

Default

Description

cu_configuration.clock.0.freq

[200;500]

300 MHz

This parameter defines the frequency (in MHz) of clock0 for all CUs.

xbtest_wizard consumes this information and sets the Vitis configuration kernel_frequency for clock index 0.

clock0 is used in all CUs for AXI interfaces.

cu_configuration.clock.1.freq

[200;500]

500 MHz

This parameter defines the frequency (in MHz) of clock1 for all CUs.

xbtest_wizard consumes this information and sets the Vitis configuration kernel_frequency for clock index 1.

clock1 is only used in power CUs for all sites defined in the power floorplan.

Enable clock scaling

If the frequency override is not enough, clock scaling may be enabled.

Vitis™ linker run as part of xclbin_generate workflow is set to disable auto scaling by default. This way the xclbin generation will fail when timing is not met.

This is done by setting the following Vitis command line option to false:

compiler.enableAutoFrequencyScaling

If timing is not met, the xclbin_generate workflow will fail.

If you then decide that you are happy with a timing failure, you can add the following option to your run (not recommended).

compiler.skipTimingCheckAndFrequencyScaling

Warning

If actual frequency of ap_clk and configured frequency (cu_configuration.clock.0.freq) are different, BW and latency measurements reported by xbtest are not accurate.