AVED QDMA

Xilinx QDMA

The Xilinx PCI Express Multi Queue DMA (QDMA) IP provides high-performance direct memory access (DMA) via PCI Express.

Both the linux kernel driver and the DPDK driver can be run on a PCI Express root port host PC to interact with the QDMA endpoint IP via PCI Express.

For the detailed documentation, following links should be followed:


Installing

Before building add the PCIe identifier into table at end of PF section at src/pci_ids.h .

The identifier can be found by issuing the following command:

lspci -vd 10ee:
21:00.0 Processing accelerators: Xilinx Corporation Device 50b4
        Subsystem: Xilinx Corporation Device 000e
        Physical Slot: 2-1
        Flags: bus master, fast devsel, latency 0, NUMA node 2, IOMMU group 27
        Memory at 2bf70000000 (64-bit, prefetchable) [size=256M]
        Capabilities: <access denied>
        Kernel driver in use: ami
        Kernel modules: ami

21:00.1 Processing accelerators: Xilinx Corporation Device 50b5
        Subsystem: Xilinx Corporation Device 000e
        Physical Slot: 2-1
        Flags: bus master, fast devsel, latency 0, NUMA node 2, IOMMU group 27
        Memory at 2bf80000000 (64-bit, prefetchable) [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: qdma-pf
        Kernel modules: qdma_pf, ami, qdma_vf

# At the same line with 21:00.1, you can see the PCIe identifier as 50b5
# 50b5 should be added to end of src/pci_ids.h in following form
{ PCI_DEVICE(0x10ee, 0x50b5), },        /** V80 */

The QDMA driver can then be built:

# Build the QDMA driver
[xilinx@] cd dma_ip_drivers/QDMA/linux-kernel && make

# Install the QDMA driver
[xilinx@] make install-mods

# After install-mods, check if module is probed
[xilinx@] lspci -vd 10ee:
21:00.0 Processing accelerators: Xilinx Corporation Device 50b4
        Subsystem: Xilinx Corporation Device 000e
        Physical Slot: 2-1
        Flags: bus master, fast devsel, latency 0, NUMA node 2, IOMMU group 27
        Memory at 2bf70000000 (64-bit, prefetchable) [size=256M]
        Capabilities: <access denied>
        Kernel driver in use: ami
        Kernel modules: ami

21:00.1 Processing accelerators: Xilinx Corporation Device 50b5
        Subsystem: Xilinx Corporation Device 000e
        Physical Slot: 2-1
        Flags: bus master, fast devsel, latency 0, NUMA node 2, IOMMU group 27
        Memory at 2bf80000000 (64-bit, prefetchable) [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: qdma-pf
        Kernel modules: qdma_pf, ami, qdma_vf

Examples:

Note: All examples have been run with sudo privilege

List Available Devices

[xilinx@] dma-ctl dev list

qdma01000       0000:01:00.0    max QP: 0, -~-
qdma01001       0000:01:00.1    max QP: 0, -~-
qdma01002       0000:01:00.2    max QP: 0, -~-
qdma01003       0000:01:00.3    max QP: 0, -~-
**Set Qmax**
[xilinx@] dma-ctl dev list

qdma01000       0000:01:00.0    max QP: 0, -~-
qdma01001       0000:01:00.1    max QP: 0, -~-
qdma01002       0000:01:00.2    max QP: 0, -~-
qdma01003       0000:01:00.3    max QP: 0, -~-


qdmavf01004     0000:01:00.4    max QP: 0, -~-


[xilinx@] echo 100 > /sys/bus/pci/devices/0000\:01\:00.0/qdma/qmax
[xilinx@] echo 100 > /sys/bus/pci/devices/0000\:01\:00.1/qdma/qmax
[xilinx@] echo 100 > /sys/bus/pci/devices/0000\:01\:00.2/qdma/qmax
[xilinx@] echo 100 > /sys/bus/pci/devices/0000\:01\:00.3/qdma/qmax
[xilinx@] echo 100 > /sys/bus/pci/devices/0000\:01\:00.4/qdma/qmax
[xilinx@] dma-ctl dev list

qdma01000       0000:01:00.0    max QP: 100, 0~99
qdma01001       0000:01:00.1    max QP: 100, 100~199
qdma01002       0000:01:00.2    max QP: 100, 200~299
qdma01003       0000:01:00.3    max QP: 100, 300~399


qdmavf01004     0000:01:00.4    max QP: 100, 400~499

Queue Management

# Queue stats
[xilinx@] dma-ctl qdma01000 stat

qdma01000:statistics
Total MM H2C packets processed = 0
Total MM C2H packets processed = 0
Total ST H2C packets processed = 0
Total ST C2H packets processed = 0
Min Ping Pong Latency = 0
Max Ping Pong Latency = 0
Avg Ping Pong Latency = 0

# Add a queue
[xilinx@] dma-ctl qdma01000 q add idx 4 mode mm dir h2c

qdma01000-MM-4 H2C added.
Added 1 Queues.

# Start a queue
[xilinx@] dma-ctl qdma01000 q start idx 4 dir h2c
dma-ctl: Info: Default ring size set to 2048

1 Queues started, idx 4 ~ 4.

Read/Write Operations

# Write to DMA
[xilinx@] dma-to-device [OPTIONS]
-d (--device) device path from /dev. Device name is formed as qdmabbddf-<mode>-<queue_number>. Ex: /dev/qdma01000-MM-0
-a (--address) the start address on the AXI bus
-s (--size) size of a single transfer in bytes, default 32 bytes
-o (--offset) page offset of transfer
-c (--count) number of transfers, default 1
-f (--data input file) filename to read the data from.
-w (--data output file) filename to write the data of the transfers
-h (--help) print usage help and exit
-v (--verbose) verbose output

# Example Write
[xilinx@] dma-to-device -d /dev/qdma06000-MM-0 -s 64
size=64 Average BW = 375.194937 KB/sec

# Read from DMA
[xilinx@] dma-from-device [OPTIONS]
-d (--device) device path from /dev. Device name is formed as qdmabbddf-<mode>-<queue_number>. Ex: /dev/qdma01000-MM-0
-a (--address) the start address on the AXI bus
-s (--size) size of a single transfer in bytes, default 32 bytes.
-o (--offset) page offset of transfer
-c (--count) number of transfers, default is 1.
-f (--file) file to write the data of the transfers
-h (--help) print usage help and exit
-v (--verbose) verbose output

# Example Read
[xilinx@] dma-from-device -d /dev/qdma01000-MM-1 -s 64
size=64 Average BW = 328.311188 KB/sec

# Compare Example
# Create 128kb file filled with random values
dd if=/dev/urandom bs=1024 count=128 of=file_128kb conv=notrunc

# Example write to address 0
[xilinx@] dma-to-device -d /dev/qdma06000-MM-0 -a 0 -s 131072 -f file_128kb

# Example read from address 0
[xilinx@] dma-from-device -d /dev/qdma06000-MM-0 -a 0 -s 131072 -f output_128kb

# Compare the files
[xilinx@] cmp ./file_128kb ./output_128kb
# If there is nothing showed up, it means files are the same

DMA Perf

Standard IO tools such as fio can be used for performing IO operations using the char device interface.

However, most of the tools are limited to sending / receiving 1 packet at a time and wait for the processing of the packet to complete, so they are not able to keep the driver/ HW busy enough for performance measurement. Although fio also supports asynchronous interfaces, it does not continuously submit IO requests while polling for the completion in parallel.

To overcome this limitation, Xilinx developed dma-perf tool. It leverages the asynchronous functionality provided by libaio library. Using libaio, an application can submit IO request to the driver and the driver returns the control to the caller immediately (i.e., non-blocking). The completion notification is sent separately, so the application can then poll for the completion and free the buffer upon receiving the completion.

DMA Performance Tools

usage: dma-perf [OPTIONS]
   -c (--config) config file that has configuration for IO

[xilinx@] dma-perf -c perf_config.txt
qdma65000-MM-0 H2C added.
Added 1 Queues.
Queues started, idx 0 ~ 0.
qdma65000-MM-0 C2H added.
Added 1 Queues.
Queues started, idx 0 ~ 0.
dmautils(16) threads
Exit Check: tid =8, req_sbmitted=1495488 req_completed=1495488 dir=H2C, intime=0 loop_count=0,
Exit Check: tid =13, req_sbmitted=1482752 req_completed=1482752 dir=C2H, intime=0 loop_count=0,
Exit Check: tid =14, req_sbmitted=1494720 req_completed=1494720 dir=H2C, intime=0 loop_count=0,
Exit Check: tid =8, req_sbmitted=1495488 req_completed=1495488 dir=H2C, intime=0 loop_count=0,
Exit Check: tid =14, req_sbmitted=1494720 req_completed=1494720 dir=H2C, intime=0 loop_count=0,
Exit Check: tid =6, req_sbmitted=1495488 req_completed=1495488 dir=H2C, intime=1495360 loop_count=1,
Exit Check: tid =5, req_sbmitted=1485568 req_completed=1485568 dir=C2H, intime=1485440 loop_count=1,
Exit Check: tid =11, req_sbmitted=1454208 req_completed=1454208 dir=C2H, intime=1454080 loop_count=1,
Exit Check: tid =13, req_sbmitted=1482944 req_completed=1482944 dir=C2H, intime=1482752 loop_count=1,
Exit Check: tid =0, req_sbmitted=1495168 req_completed=1495168 dir=H2C, intime=1494976 loop_count=2,
Exit Check: tid =10, req_sbmitted=1495104 req_completed=1495104 dir=H2C, intime=1494912 loop_count=2,
Exit Check: tid =12, req_sbmitted=1494592 req_completed=1494592 dir=H2C, intime=1494400 loop_count=2,
Exit Check: tid =9, req_sbmitted=1486784 req_completed=1486784 dir=C2H, intime=1486592 loop_count=2,
Exit Check: tid =15, req_sbmitted=1485248 req_completed=1485248 dir=C2H, intime=1485056 loop_count=2,
Exit Check: tid =1, req_sbmitted=1486656 req_completed=1486656 dir=C2H, intime=1486592 loop_count=1,
Exit Check: tid =4, req_sbmitted=1495872 req_completed=1495872 dir=H2C, intime=1495744 loop_count=1,
Exit Check: tid =3, req_sbmitted=1486336 req_completed=1486336 dir=C2H, intime=1486208 loop_count=2,
Exit Check: tid =7, req_sbmitted=1486400 req_completed=1486400 dir=C2H, intime=1486208 loop_count=2,
Exit Check: tid =2, req_sbmitted=1495744 req_completed=1495744 dir=H2C, intime=1495616 loop_count=2,
Exit Check: tid =10, req_sbmitted=1495296 req_completed=1495104 dir=H2C, intime=1494912 loop_count=10000,
Exit Check: tid =11, req_sbmitted=1454464 req_completed=1454336 dir=C2H, intime=1454080 loop_count=10000,
Exit Check: tid =5, req_sbmitted=1485632 req_completed=1485504 dir=C2H, intime=1485440 loop_count=10000,
Exit Check: tid =0, req_sbmitted=1495616 req_completed=1495424 dir=H2C, intime=1494976 loop_count=10000,
Exit Check: tid =12, req_sbmitted=1494912 req_completed=1494720 dir=H2C, intime=1494400 loop_count=10000,
Exit Check: tid =6, req_sbmitted=1495616 req_completed=1495488 dir=H2C, intime=1495360 loop_count=10000,
Stopped Queues 0 -> 0.
Exit Check: tid =9, req_sbmitted=1486912 req_completed=1486720 dir=C2H, intime=1486592 loop_count=10000,
Exit Check: tid =15, req_sbmitted=1485952 req_completed=1485760 dir=C2H, intime=1485056 loop_count=10000,
Exit Check: tid =13, req_sbmitted=1483456 req_completed=1483264 dir=C2H, intime=1482752 loop_count=10000,
Stopped Queues 0 -> 0.
Deleted Queues 0 -> 0.
Deleted Queues 0 -> 0.
WRITE: total pps = 3987072 BW = 255.172608 MB/sec
READ: total pps = 3950976 BW = 252.862464 MB/sec

dma-perf tool takes a configuration file as input. The configuration file format is as below.

Example Config File

name=mm_1_1
mode=mm #mode
dir=bi #dir
pf_range=0:0 #no spaces
q_range=0:0 #no spaces
wb_acc=5
tmr_idx=9
cntr_idx=0
trig_mode=usr_cnt
rngidx=9
ram_width=15 #31 bits - 2^31 = 2GB
runtime=30 #secs
num_threads=8
bidir_en=1
num_pkt=64
pkt_sz=64
offset_q_en=1
h2c_q_start_offset=0x100
h2c_q_offset_intvl=10
c2h_q_start_offset=0x200
c2h_q_offset_intvl=20
pci_bus=06
pci_device=00

Parameters

  • name : name of the configuration

  • mode : mode of the queue, streaming(st) or memory mapped(mm). Mode defaults to mm.

  • dir : Direction of the queue, host-to-card(h2c), card-to-host (c2h) or both (bi).

  • pf_range : Range of the PFs from 0-3 on which the performance metrics are to be collected.

  • q_range : Range of the Queues from 0-2047 on which the performance metrics are to be collected.

  • flags : queue flags

  • wb_acc : write back accumulation index from CSR register ( 0 - 15 )

  • tmr_idx : timer index from CSR register ( 0 - 15 )

  • cntr_idx : Counter index from CSR register ( 0 - 15 )

  • trig_mode : trigger mode (every, usr_cnt, usr, usr_tmr, dis)

  • rngidx : Ring index from CSR register ( 0 - 15 )

  • runtime : Duration of the performance runs, time in seconds.

  • num_threads : number of threads to be used in dma-perf application to pump the traffic to queues

  • bidir_en : Enable or Disable the bi-direction mode ( 0: Disable, 1: Enable )

  • num_pkt : number of packets

  • pkt_sz : Packet size

  • mm_chnl : MM Channel ( 0 - 1 ) for Versal devices

  • keyhole_en : Enable the Keyhole feature

  • offset : Offset to be written to for MM Performance Use cases

  • aperture_sz : Size of aperture when using the keyhole feature

  • offset_q_en : Offset queue enable (0-1) to enable H2C/C2H queues offsets.

  • h2c_q_start_offset : Start address of H2C queue.

  • h2c_q_offset_intvl : Fixed interval for subsequent H2C queues offsets.

  • c2h_q_start_offset : Start address of C2H queue.

  • c2h_q_offset_intvl : Fixed interval for subsequent C2H queues offsets.

  • pci_bus : pci bus id.

  • pci_device : pci device id.