Important Design Considerations from PG344

When tackling design issues, navigating a lengthy product guide can be overwhelming. To streamline your troubleshooting process, this article lists essential design considerations directly from the product guide, highlighting key points to keep in mind. This article condenses over 200 pages into a focused, 10-12 page summary, making it easier to access the most relevant information quickly. For full details, be sure to click the provided link to explore each consideration further in the original product guide.

Note

Link: Resets

  • If your board is designed to use the same PCIe edge connectors to operate with CPM and PL PCIE, then AMD recommend using PS reset using the Control Interface and Processing System (CIPS) IP core.

Note

Link: Descriptor Engine

  • The descriptor engine will have only one DMA read outstanding per queue at a time and can read as many descriptors as can fit in a queue is associated with interrupt aggregation, AMD recommends that the status descriptor be turned off, and instead the DMA status be received from the interrupt aggregation ring.

Note

Link: H2C Stream Engine

  • The total length of all descriptors put together must be less than 64 KB.

  • A packet with multiple descriptors straddling is not allowed due to the lack of per queue storage.

Note

Link: C2H Stream Engine

  • In Simple Bypass Mode, the engine does not track anything for the queue, and the user logic can define its own method to receive descriptors. The user logic is then responsible for delivering the packet and associated descriptor through the simple bypass interface.

  • The ordering of the descriptors fetched by a queue in the bypass interface and the C2H stream interface must be maintained across all queues in bypass mode.

Note

Link: AXI Memory Mapped Bridge Master Interface

  • The AXI MM Bridge Master interface is used for high bandwidth access to AXI Memory Mapped space from the host. The interface supports up to 32 outstanding AXI reads and writes. One or more PCIe BAR of any physical function (PF) or virtual function (VF) can be mapped to the AXI-MM bridge master interface. This selection must be made prior to design compilation.

  • Note that all VFs belonging to the same PF share the same PCIe to AXI translation vector. Therefore, the AXI address space of each VF is concatenated together. Use VFG_OFFSET to calculate the actual starting address of AXI for a particular VF.

Note

Link: AXI Memory Mapped Bridge Slave Interface

  • The AXI-MM Bridge Slave interface is used for high bandwidth memory transfers between the user logic and the Host. AXI to PCIe translation is supported through the AXI to PCIe BARs. The interface will split requests as necessary to obey PCIe MPS and 4 KB boundary crossing requirements. Up to 32 outstanding read and write requests are supported.

Note

Link: Interrupt Module

  • Queue-based interrupts and user interrupts are allowed on PFs and VFs, but error interrupts are allowed only on PFs.

Note

Link: General Design of Queues

  • PIDX update should never be equal to CIDX. For this case, if CIDX is 0, the maximum PIDX update would be 6.

Note

Link: QDMA Subsystem Limitations

  • Use AXI SmartConnect to support Narrow Burst.

  • ECC and Slave Narrow Burst support is mutually exclusive.

  • If you want an ECC feature, the recommendation is to up-size your AXI Master externally.

Note

Link: Performance and Resource Utilization

  • Following are the QDMA register settings recommended by AMD for better performance. Performance numbers can vary based on systems and OS used.

  • AMD recommends that you limit the total outstanding descriptor fetch to be less than 8 KB on the PCIe. For example, limit the outstanding credits across all queues to 512 for a 16B descriptor.

Note

Link: Descriptor Context

  • After the queue is enabled, the software context should only be updated through the direct mapped address space to update the Producer Index and Interrupt Arm® bit, unless the queue is being disabled.

  • The hardware context and credit context contain only status. It is only necessary to interact with the hardware and credit contexts as part of queue initialization to clear them to all zeros.

Note

Link: Software Descriptor Context Structure 0x0-C2H and 0x1-H2C

  • If bypass mode is not enabled, 32B is required for Memory Mapped DMA, 16B is required for H2C Stream DMA, and 8B is required for C2H Stream DMA.

Note

Link: Credit Descriptor Context Structure

  • The credit descriptor context is for internal DMA use only and it can be read from the indirect bus for debug.

Note

Link: Descriptor Fetch

  • If fetch crediting is enabled, the user logic is required to provide a credit for each descriptor that should be fetched.

  • If queue size is 8, which contains the entry index 0 to 7, the last entry (index 7) is reserved for status. This index should never be used for the PIDX update, and the PIDX update should never be equal to CIDX. For this case, if CIDX is 0, the maximum PIDX update would be 6.

Note

Link: Internal Mode Writeback and Interrupts AXI-MM and H2C-ST

  • It is recommended the wbi_chk bit be set for all internal mode operation, including when interval mode is enabled.

Note

Link: Descriptor Bypass Mode

  • To perform DMA operations, the user logic drives descriptors (must be QDMA format) into the descriptor bypass input interface.

Note

Link: Descriptor Bypass Mode Writeback/Interrupts

  • Once a descriptor with the sdi bit is sent, another irq_arm assertion must be observed before another descriptor with the sdi bit can be sent.

  • If you set the sdi bit when the arm bit is not properly observed, an interrupt might or might not be sent, and software might hang indefinitely waiting for an interrupt.

  • When interrupts are not enabled, setting the sdi bit has no restriction. However, excessive writeback events can severely reduce the descriptor engine performance and consume write bandwidth to the host.

Note

Link: Traffic Manager Output Interface

  • While the tm_dsc_sts interface is a valid/ready interface, it should not be back-pressured for optimal performance. Since multiple events trigger a tm_dsc_sts cycle, if internal buffering is filled, descriptor fetching will be halted to prevent generation of new events.

Note

Link: Errors

  • After the queue is invalidated, if there is an error you can determine the cause by reading the error registers and context for that queue. You must clear and remove that queue, and then add the queue back later when needed.

  • If the descriptor fetch itself encounters an error, the descriptor will be marked with an error bit. If the error bit is set, the contents of the descriptor should be considered invalid.

Note

Link: Memory Mapped DMA

  • PCIe-to-PCIe, and AXI MM-to-AXI MM DMAs are not supported.

Note

Link: Operation

  • Any descriptors that have already started the source buffer fetch will continue to be processed. Reassertion of the run bit will result in resetting internal engine state and should only be done when the engine is quiesced.

  • Once sufficient read completion data is received, the write request will be issued to the destination interface in the same order that the read data was requested. Before the request is retired, the destination interfaces must accept all the write data and provide a completion response.

Note

Link: AXI Memory Mapped Descriptor for H2C and C2H 32B

  • Internal mode memory mapped DMA must configure the descriptor queue to be 32B and follow the above descriptor format. In bypass mode, the descriptor format is defined by the user logic, which must drive the H2C or C2H MM bypass input port.

Note

Link: Internal and Bypass Modes

  • If the packet is present in host memory in non-contiguous space, then it has to be defined by more than one descriptor, and this requires that the queue be programmed in bypass mode.

  • When fcrd_en is enabled in the software context, DMA will wait for the user application to provide credits. When fcrd_en is not set, the DMA uses a pointer update, fetches descriptors and sends the descriptor out. The user application should not send in credits.

  • Because the bypass mode allows a packet to span multiple descriptors, the user logic needs to indicate to QDMA which descriptor marks the Start-Of-Packet (SOP) and which marks the End-Of-Packet (EOP).

  • At the QDMA H2C Stream bypass-in interface, among other pieces of information, the user logic needs to provide: Address, Length, SOP, and EOP. It is required that once the user logic feeds SOP descriptor information into QDMA, it must eventually feed EOP descriptor information also.

  • Descriptors for these multi-descriptor packets must be fed in sequentially. Other descriptors not belonging to the packet must not be interleaved within the multidescriptor packet.

  • The user logic must accumulate the descriptors up to the EOP descriptor, before feeding them back to QDMA. Not doing so can result in a hang. The QDMA will generate a TLAST at the QDMA H2C AXI4-Stream data output once it issues the last beat for the EOP descriptor. This is guaranteed because the user is required to submit the descriptors for a given packet sequentially.

  • The Stream engine is designed to saturate PCIe for packet sizes as low as 128B, so AMD recommends that you restrict the packet size to be host page size or maximum transfer unit as required by the user application.

Note

Link: H2C Stream Descriptor 16B

  • This H2C descriptor format is only applicable for internal mode.

Note

Link: Descriptor Metadata

  • Passing metadata on the tuser is not supported for a queue in bypass mode and consequently there is no input to provide the metadata on the QDMA H2C Stream bypass-in interface.

Note

Link: Zero Length Descriptor

  • The user logic must set both the SOP and EOP for a zero byte descriptor. If not done, an error will be flagged by the H2C Stream Engine.

Note

Link: H2C Stream Status Descriptor Writeback

  • The format of the H2C-ST status descriptor written to the descriptor ring is different from that written into the interrupt coalesce entry.

Note

Link: Handling Descriptors With Errors

  • For a queue in bypass mode, it is the responsibility of the user logic to not issue a batch of descriptors with an error descriptor. Instead, it must send just one descriptor with error input asserted on the H2C Stream bypass-in interface and set the SOP, EOP, no_dma signal, and sdi or mrkr-req signal to make the H2C Stream Engine send a writeback to Host.

Note

Link: C2H Stream Engine

  • The buffer size is fixed per queue basis.

  • The QDMA requires software to post full ring size so the C2H stream engine can fetch the needed number of descriptors for all received packets.

  • For performance reasons, the software is required to post the PIDX as soon as possible to ensure there are always enough descriptors in the ring.

Note

Link: C2H Prefetch Engine

  • sw_crdt (Software credit): The software must initialize it to 0 and then treat it as read-only.

Note

Link: C2H Stream Modes

  • The descriptors from the C2H bypass input interfaces have one interface for both simple mode and cache mode (note that both simple bypass and cache bypass use the same interface).

  • If you already have the descriptor cached on the device, there is no need to fetch one from the host and you should follow the simple bypass mode for the C2H Stream application. In simple bypass mode, do not provide credits to fetch the descriptor, and instead, you need to send in the descriptor on the descriptor bypass interface.

  • AXI4-Stream C2H Simple Bypass mode and Cache Bypass mode both use same bypass in ports (c2h_byp_in_st_csh_* ports).

  • For simple bypass transfer to work, a prefetch tag is needed and it can be fetched from the QDMA IP.

  • The user application must request a prefetch tag before sending any traffic for a simple bypass queue through the C2H ST engine. Invalid queues or non-bypass queues should not request any tags using this method, as it might reduce performance by freezing tags that never get used.

  • The prefetch tag needs to be reserved upfront before any traffic can start. One prefetch tag per target host is required.

  • In most applications, one prefetch tag for a host is needed. In Simple Bypass mode, the tag is not tied to any descriptor ring. For the queues that share the same prefetch tag, the data and descriptors need to come in the same order.

  • For Simple Bypass, the data and descriptors are both controlled by the user, so they need to guarantee the order is maintained. For example when the data stream has packets in the order of Q0, Q1, and Q2, on descriptor input, you cannot send Q1, Q2, Q0 etc. The order needs to be maintained.

  • The user application writes to the MDMA_C2H_PFCH_BYP_QID (0x1408) register with the qid for a simple bypass queue, then reads from MDMA_C2H_PFCH_BYP_TAG (0x140C) register to retrieve the corresponding prefetch tag. This tag must be driven with all bypass_in descriptors for as long as the current qid is valid. If a current qid is invalidated, a new prefetch tag must be requested with a valid qid.

  • The prefetch tag points to the CAM that stores the active queues in the prefetch engine. Also the qid that was used to prefetch tag needs to be used as the qid for all simple bypass packets. Assign the qid to dma_s_axis_c2h_ctrl_qid.

  • The prefetch tag and the qid that was used to fetch the tag should be used for all simple bypass packets. This information needs to be communicated to the user side.

  • The c2h_byp_in_st_csh_pfch_tag[6:0] port can have the same prefetch_tag for as long as the original qid is valid.

  • No sequence is required between descriptor bypass in, data payload, and completion packets.

  • When prefetch mode is enabled, the user application cannot send credits as input in QDMA Descriptor Credit input ports.

  • In cache bypass mode, prefetch tag is maintained by the IP internally. Signal c2h_byp_out_pfch_tag[6:0] should be looped back as an input c2h_byp_in_st_csh_pfch_tag[6:0]. The prefetch tag points to the cam that stores the active queues in the prefetch engine.

  • No sequence is required between payload and completion packets.

Note

Link: Handling Descriptors With Errors

  • For a queue in bypass mode, it is the responsibility of the user logic to not issue a batch of descriptors with an error descriptor. Instead, it must send just one descriptor with error input asserted on the C2H Stream bypass-in interface and set the SOP, EOP, no_dma signal, and sdi or mrkr_req signal to make the C2H Stream Engine send a writeback to Host.

Note

Link: Completion Engine

  • Although not a requirement, a CMPT is typically used with a C2H queue.

  • The user-defined portion of the CMPT packet typically needs to specify the length of the data packet transferred and whether or not descriptors were consumed as a result of the data packet transfer.

  • Maximum buffer size register 0xB50 bits[31:26] is programmed to 0 (default value). This value might result in an overflow depending on the simulator or the synthesis tool used. To avoid overflow, set 0xB50 bits[31:26] to maximum value of 63.

Note

Link: Completion Context Structure

  • baddr4_low: Since the minimum alignment supported is 64B in this case, this field must be 0.

  • pidx: Completion Ring Producer Index. This is a field written by the hardware. The software must initialize it to 0 and then treat it as read-only. Color bit to be used on Completion.

Note

Link: Completion Status Structure

  • In order to make the QDMA write Completion Status to the Completion ring, Completion Status must be enabled in the Completion context.

Note

Link: Completion Status/Interrupt Moderation

  • When in TRIGGER_EVERY, TRIGGER_USER, and TRIGGER_USER_COUNT mode, the software must read all the Completion entries in the Completion ring as indicated by an interrupt (or a Completion Status write).

Note

Link: Address Translation

  • When this option is selected, one full 64-bit BAR space is given for slave data transfer. You must set up any address translation if needed. If No Address Translation is not selected, DMA will do address translation.

Note

Link: Slave Address Translation Examples

  • The slave bridge does not support narrow burst AXI transfers.

Note

Link: Legacy Interrupt

  • To enable the legacy interrupt, the software needs to set the en_lgcy_intr bit in the register QDMA_GLBL_GLBL_INTERRUPT_CFG (0x2C4).

Note

Link: Function Map Table

  • Along with FMAP table programming in the IP, you must program the FMAP table in the Mailbox IP. This is needed for function level reset (FLR) procedure.

Note

Link: Context Programming

  • A host profile table context needs to be programmed before any context settings.

Note

Link: Queue Setup

  • If interrupts/status writes are desired (enabled in the Completion Context), an initial Completion CIDX update is required to send the hardware into a state where it is sensitive to trigger conditions. This initial CIDX update is required, because when out of reset, the hardware initializes into an unarmed state.

Note

Link: Host Profile

  • Host profile must be programmed to represent root port host. Host profile can be programmed through context programming. Select QDMA_CTXT_SELC_HOST_PROFILE (4’hA) in QDMA_IND_CTXT_CMD.

  • H2C AXI4-MM steering bit and C2H AXI4-MM steering bit should be set to 0s. If not, DMA AXI4-MM transfers do not work. For most cases, host profile context structure is all 0s, and host profile must still be programmed to represent a host.

Note

Link: Resets

  • Reset the QDMA logic through the soft_reset_n port. This port needs to be held in reset for a minimum of 100 clock cycles (axi_aclk cycles). This does not reset PCIe hard block. It resets only the DMA portion of logic. This reset can be asserted if there is a DMA hang or some error condition.

  • The use cases that prompt the use of soft_reset include: - DMA hangs and user is not getting proper values. - DMA transfers have errors, but the PCIe links are good. DMA records some asynchronous error.

  • After soft_reset, you must reinitialize the queues and program all queue context.

Note

Link: Expansion ROM

  • The maximum size for the Expansion ROM BAR should be no larger than 16 MB. Selecting an address space larger than 16 MB can result in a non-compliant core.

Note

Link: Data Path Errors

  • Any DMA during and after the parity error should be considered invalid. If there is a parity error and transfer hangs or stops, the DMA will log the error. You must investigate and fix the parity issues.

Note

Link: AXI Bridge Slave Ports

  • The valid data identified by s_axib_wstrb must be continuous from the first byte enable to the last byte enable.

Note

Link: AXI4 Stream H2C Ports

  • m_axis_h2c_tuser_err : If set, indicates the packet has an error. The error could come from the PCIe, or the error could be in the DMA transfer. AMD recommends that you look at the error registers and context for details.

Note

Link: AXI4 Stream C2H Ports

  • s_axis_c2h_ctrl_len [15:0] : ctrl_len is in bytes and should be valid in first beat of the packet.

  • s_axis_c2h_mty [5:0] : Empty byte should be set in last beat.

Note

Link: AXI4 Stream C2H Completion Ports

  • HAS_PLD. The CMPT packet has a corresponding payload packet, and it needs to wait for the payload packet to be sent before sending the CMPT packet.

  • s_axis_c2h_cmpt_tvalid must be asserted until s_axis_c2h_cmpt_tready is asserted.

Note

Link: VDM Ports

  • When this interface is not used, Ready must be tied-off to 1.

Note

Link: QDMA Descriptor Bypass Input Ports

  • For performance reasons, AMD recommends that this port be asserted once in 32 or 64 descriptors and assert at the last descriptor if there are no more descriptors left.

  • If h2c_byp_in_st_no_dma is set, then both h2c_byp_in_st_sop and h2c_byp_in_st_eop must be set.

  • h2c_byp_in_mm_len[27:0] : The DMA data length. The upper 12 bits must be tied to 0. Thus only the lower 16 bits of this field can be used for specifying the length.

  • h2c_byp_in_mm_sdi : For performance reasons, AMD recommends that this port be asserted once in 32 or 64 descriptors and be asserted at the last descriptor if there are no more descriptors left.

  • c2h_byp_in_st_csh_pfch_tag[6:0] : In Cache Bypass mode, you must loop back c2h_byp_out_pfch_tag[6:0] to c2h_byp_in_st_csh_pfch_tag[6:0]. In Simple Bypass mode, user needs to pass in the Prefetch tag value from MDMA_C2H_PFCH_BYP_TAG (0x140C) register.

Note

Link: QDMA Descriptor Bypass Output Ports

  • h2c_byp_out_cidx [15:0] : The User must echo this field back to QDMA when submitting the descriptor on the bypass-in interface.

  • h2c_byp_out_rdy : When this interface is not used, Ready must be tied-off to 1.

  • c2h_byp_out_cidx [15:0] : The User must echo this field back to QDMA when submitting the descriptor on the bypass-in interface.

  • c2h_byp_out_rdy : When this interface is not used, Ready must be tied-off to 1.

Note

Link: QDMA Descriptor Credit Input Ports

  • dsc_crdt_in_vld : When asserted the user must be presenting valid data on the bus and maintain the bus values until both valid and ready are asserted on the same cycle.

  • dsc_crdt_in_fence : The fence bit should only be set for a queue that is enabled, and has both descriptors and credits available, otherwise a hang condition might occur.

Note

Link: QDMA Traffic Manager Credit Output Ports

  • tm_dsc_sts_rdy : When this interface is not used, Ready must be tied-off to 1.

Note

Link: Queue Status Ports

  • qsts_out_rdy : Ready must be tied to 1 so status output will not be blocked. Even if this interface is not used, the ready port must be tied to 1.

Note

Link: QDMA PF Address Register Space

  • When you generate the IP in default mode, not all registers are exposed. For example, debug registers will be missing. Refer to the qdma_v5_0_pf_registers.csv file to identify the debug registers. To expose all registers, use the following tcl command during IP generation:

  • set_property CONFIG.debug_mode {DEBUG_REG_ONLY} [get_ips qdma_0]