Security of Alveo Platform

_images/XSA-shell.svg

Alveo shell mgmt and user components, data and control paths

Security is built into Alveo platform hardware and software architecture. The platform is made up of two physical partitions: an immutable Shell and user compiled User partition. This design allows end users to perform Dynamic Function eXchange (Partial Reconfiguration in classic FPGA terminology) in the well defined User partition while the static Shell provides key infrastructure services. Alveo shells assume PCIe host (with access to PF0) is part of Root-of-Trust. The following features reinforce security of the platform:

  1. Two physical function shell design

  2. Clearly classified trusted vs untrusted shell peripherals

  3. Signing of xclbins

  4. AXI Firewall

  5. Well-defined compute kernel execution model

  6. No direct access to PCIe TLP from User partition

  7. Treating User partition as untrused partition

Shell

The Shell provides core infrastructure to the Alveo platform. It includes hardened PCIe block which provides physical connectivity to the host PCIe bus via two physical functions as described in XRT and Vitis™ Platform Overview. The Shell is trusted partition of the Alveo platform and for all practical purposes should be treated like an ASIC. During system boot, the shell is loaded from the PROM. Once loaded, the Shell cannot be changed.

In the figure above, the Shell peripherals shaded blue can only be accessed from management physical function 0 (PF0) while those shaded violet can be accessed from user physical function 1 (PF1). From PCIe topology point of view PF0 owns the device and performs supervisory actions on the device. It is part of Root-of-Trust. Peripherals shaded blue are trusted while those shaded violet are not. Alveo shells use a specialized IP called PCIe Demux which routes PCIe traffic destined for PF0 to PF0 AXI network and those destined for PF1 to PF1 AXI network. It is responsible for the necessary isolation between PF0 and PF1.

Trusted peripherals includes ICAP for bitstream download (DFX), CMC for sensors and thermal management, Clock Wizards for clock scaling, QSPI Ctrl for PROM access (shell upgrades), DFX Isolation, Firewall controls and ERT UART.

All peripherals in the shell except XDMA/QDMA are slaves from PCIe point of view and cannot initiate PCIe transactions. Alveo shells have one of XDMA or QDMA PCIe DMA engine. Both XDMA and QDMA are regular PCIe scatter-gather DMA engine with a well defined programming model.

The Shell provides a control path and a data path to the user compiled image loaded on User partition. The Firewalls in control and data paths protect the Shell from un-trusted User partition. For example if a slave in DFX has a bug or is malicious the appropriate firewall will step in and protect the Shell from the failing slave as soon as a non compliant AXI transaction is placed on AXI bus.

Newer revisions of shell have a feature called sb which provides direct access to host memory from kernels in the User partition. With this feature kernels can initiate PCIe burst transfers from PF1 without direct access to PCIe bus. AXI Firewall (SI) in reverse direction protects PCIe from non-compliant transfers.

Note

Features sb and PCIe Peer-to-Peer (P2P) are not available in all shells.

For more information on firewall protection see Firewall section below.

For shell update see Shell Update section below.

Compatibility enforcement between Shell and User xclbin is described in platform_partitions

PCIe Topology

As mentioned before Alveo platforms have two physical function architecture where each function has its own BARs. The table below gives overview of the topology and functionality.

PF

BAR

Driver

Purpose

0

0

xclmgmt

Memory mapped access to privileged IPs in the shell as shown in the Figure above.

0

2

xclmgmt

Setup MSI-X vector table

1

0

xocl

Access to register maps of user compiled compute units in the DFX region

1

2

xocl

Memory mapped access to XDMA/QDMA PCIe DMA engine programming registers

1

4

xocl

CPU direct and P2P access to device attached DDR/HBM/PL-RAM memory. By default its size is limited to 256MB but can be expanded using XRT xbutil tool as described in PCIe Peer-to-Peer (P2P)

Sample output of Linux lspci command for U50 device below:

dx4300:~>lspci -vvv -d 10ee:
02:00.0 Processing accelerators: Xilinx Corporation Device 5020
        Subsystem: Xilinx Corporation Device 000e
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        NUMA node: 0
        Region 0: Memory at 20fd2000000 (64-bit, prefetchable) [size=32M]
        Region 2: Memory at 20fd4020000 (64-bit, prefetchable) [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: xclmgmt
        Kernel modules: xclmgmt

02:00.1 Processing accelerators: Xilinx Corporation Device 5021
        Subsystem: Xilinx Corporation Device 000e
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 66
        NUMA node: 0
        Region 0: Memory at 20fd0000000 (64-bit, prefetchable) [size=32M]
        Region 2: Memory at 20fd4000000 (64-bit, prefetchable) [size=128K]
        Region 4: Memory at 20fc0000000 (64-bit, prefetchable) [size=256M]
        Capabilities: <access denied>
        Kernel driver in use: xocl
        Kernel modules: xocl

dx4300:~>

Dynamic Function eXchange

User compiled image packaged as xclbin is loaded on the Dynamic Functional eXchange partition by the Shell. The image may be signed with a private key and its public key registered with Linux kernel keyring. The xclbin signature is validated by xclmgmt driver. This guarantees that only known good user compiled images are loaded by the Shell. The image load is itself effected by xclmgmt driver which binds to PF0. xclmgmt driver downloads the bitstream packaged in the bitstream section of xclbin by programming the ICAP peripheral. The management driver also discovers the target frequency of the User partition by reading the xclbin clock section and then programs the clocks which are controlled from Shell. DFX is exposed as one atomic ioctl by xclmgmt driver.

xclbin is a container which packs FPGA bitstream for the User partition and host of related metadata like clock frequencies, information about instantiated compute units, etc. The compute units typically expose a well defined register space on the PCIe BAR for access by XRT. An user compiled image does not have any physical path to directly interact with PCIe Bus. Compiled images do have access to device DDR.

More information on xclbin can be found in Binary Formats.

Xclbin Generation

Users compile their Verilog/VHDL/OpenCL/C/C++ design using Vitis™ compiler, v++ which also takes the shell specification as a second input. By construction the Vitis™ compiler, v++ generates image compatible with User partition of the shell. The compiler uses a technology called PR Verify to ensure that the user design physically confines itself to User partition and does not attempt to overwrite portions of the Shell. It also validates that all the IOs between the DFX and Shell are going through fixed pins exposed by Shell.

Signing of Xclbins

xclbin signing process is similar to signing of Linux kernel modules. xclbins can be signed by XRT xclbinutil utility. The signing adds a PKCS7 signature at the end of xclbin. The signing certificate is then registered with appropriate key-ring. When Linux is running in UEFI secure mode, signature verification is enforced using signing certification in system key-ring (when Linux is not running in secure mode there is no such verification).

Firewall

Alveo hardware design uses standard AXI bus. As shown in the figure the control path uses AXI-Lite and data path uses AXI4 full. Specialized hardware element called AXI Protocol Firewall monitors all transactions going across the bus into the un-trusted User partition. It is possible that one or more AXI slave in the DFX partition is not fully AXI-compliant or deadlocks/stalls/hangs during operation. When an AXI slave in DFX partition fails, AXI Firewall trips – it starts completing AXI transactions on behalf of the slave so the master and the specific AXI bus is not impacted – to protect the Shell. The AXI Firewall starts completing all transactions on behalf of misbehaving slave while also notifying the mgmt driver about the trip. The xclmgmt driver then starts taking recovery action. xclmgmt posts a XCL_MAILBOX_REQ_FIREWALL message to xocl using MailBox to inform the peer about FireWall trip. xocl can suggest a reset by sending a XCL_MAILBOX_REQ_HOT_RESET message to xclmgmt via mailBox. Note that even if no reset is performed the AXI Protocol Firewall will continue to protect the host PCIe bus. DFX partition will be unavailable till device is reset. A reboot of host is not required to reset the device.

Alveo boards with multiple FPGA devices on the same board like U30 support card level reset. Mailbox usage by each device on the card is similar to that of single device cards, however firewall trip in one device will trigger reset to all devices on the card.

AXI Firewall in Slave Interface (SI) mode also protects the host from errant transactions initiated by kernels over Slave Bridge. For example if an AXI master kernel in the Dynamic Region issues a non compliant AXI transaction like starting a burst transfer but stalling afterwards, the AXI Firewall (SI) will complete the transaction on behalf of the failing kernel. This protects PCIe from un-correctable errors.

PCIe Bus Safety

As explained in the Firewall section above PCIe bus is protected by AXI Firewalls on both control and data path. DFX Isolation only exposes AXI bus (AXI-Lite for control and AXI-Full for data paths) to the Dynamic Region. Kernels compiled by user which sit in Dynamic Region do not have direct access to PCIe bus and hence cannot generate TLP packets. This removes the risk of an errant User partition compromising the PCIe bus and taking over the host system. PCIe Demux IP ensures that all PCIe transactions mastered by device over P2P, XDMA/QDMA and SB data paths are only possible over PF1. This is critical for Pass-through Virtualization where host should not see any transactions initiated by PF1.

Deployment Models

In all deployment models PCIe host with access to PF0 is considered part of Root-of-Trust.

Baremetal

In Baremetal deployment model, both physical functions are visible to the end user who does not have root privileges. End user have access to both XRT xclmgmt and XRT xocl drivers. The system administrator trusts both drivers which provide well defined XCLMGMT (PCIe Management Physical Function) Driver Interfaces and XOCL (PCIe User Physical Function) Driver Interfaces. End user does have the privilege to load xclbins which should be signed for maximum security. This will ensure that only known good xclbins are loaded by end users.

Certain operations like resetting the board and upgrading the flash image on PROM (from which the shell is loaded on system boot) require root privileges and are effected by xclmgmt driver.

Pass-through Virtualization

In Pass-through Virtualization deployment model, management physical function (PF0) is only visible to the host but user physical function (PF1) is visible to the guest VM. Host considers the guest VM a hostile environment. End users in guest VM may be root and may be running modified implementation of XRT xocl driver – XRT xclmgmt driver does not trust XRT xocl driver. xclmgmt as described before exposes well defined XCLMGMT (PCIe Management Physical Function) Driver Interfaces to the host. In a good and clean deployment end users in guest VM interact with standard xocl using well defined XOCL (PCIe User Physical Function) Driver Interfaces.

As explained under the Shell section above, by design xocl has limited access to violet shaded Shell peripherals. This ensures that users in guest VM cannot perform any privileged operation like updating flash image or device reset. A user in guest VM can only perform operations listed under USER PF (PF1) section in XRT and Vitis™ Platform Overview.

A guest VM user can potentially crash a compute unit in User partition, deadlock data path AXI bus or corrupt device memory. If the user has root access he may compromise VM memory. But none of this can bring down the host or the PCIe bus. Host memory is protected by system IOMMU. Device reset and recovery is described below.

A user cannot load a malicious xclbin on the User partition since xclbin downloads are done by xclmgmt drive. xclbins are passed on to the host via a plugin based MPD/MSD framework defined in Mailbox Subdevice Driver. Host can add any extra checks necessary to validate xclbins received from guest VM.

This deployment model is ideal for public cloud where host does not trust the guest VM. This is the prevalent deployment model for FaaS operators.

Summary

Behavior

Deployment Model

Bare Metal

Pass-through

System admin trusts drivers

xocl

Yes

No

xclmgmt

Yes

Yes

End user has root access

xocl

No

Maybe

xclmgmt

No

No

End user can crash device

Yes

Yes

End user can crash PCIe bus

No

No

End user with root access can crash PCIe bus

Yes

No

Mailbox

Mailbox is used for communication between user physical function driver, xocl and management physical function driver, xclmgmt. The Mailbox hardware design and xclmgmt driver mailbox handling implementation has the ability to throttle requests coming from xocl driver.

xclmgmt driver has twofold security protections on the h/w mailbox. From packet layer, xclmgmt monitors the receiving packet rates and can enforce a threshold. If the receiving packet rates exceeds the threshold, the mailbox is disabled which prevents the guest from sending any more commands over mailbox. Only a hot reset on the FPGA device from xclmgmt can recover it. From message layer,system administrator can configure the xclmgmt driver to ignore specific mailbox opcodes.

Here is an example how System administrator managing the privileged management physical function driver xclmgmt can configure the mailbox to ignore specific opcodes using xbmgmt utility.

# In host
Host>$ sudo xbmgmt dump --config --output /tmp/config.ini -d bdf

# Edit the dumped ini file and change the value to key 'mailbox_channel_disable'
# eg. if both xclbin download and reset are to be disabled, one can set
# mailbox_channel_disable=0x120
# where 0x120 is 1 << XCL_MAILBOX_REQ_LOAD_XCLBIN |
#                1 << XCL_MAILBOX_REQ_HOT_RESET
# as defined as below
# XCL_MAILBOX_REQ_UNKNOWN =             0,
# XCL_MAILBOX_REQ_TEST_READY =          1,
# XCL_MAILBOX_REQ_TEST_READ =           2,
# XCL_MAILBOX_REQ_LOCK_BITSTREAM =      3,
# XCL_MAILBOX_REQ_UNLOCK_BITSTREAM =    4,
# XCL_MAILBOX_REQ_HOT_RESET =           5,
# XCL_MAILBOX_REQ_FIREWALL =            6,
# XCL_MAILBOX_REQ_LOAD_XCLBIN_KADDR =   7,
# XCL_MAILBOX_REQ_LOAD_XCLBIN =         8,
# XCL_MAILBOX_REQ_RECLOCK =             9,
# XCL_MAILBOX_REQ_PEER_DATA =           10,
# XCL_MAILBOX_REQ_USER_PROBE =          11,
# XCL_MAILBOX_REQ_MGMT_STATE =          12,
# XCL_MAILBOX_REQ_CHG_SHELL =           13,
# XCL_MAILBOX_REQ_PROGRAM_SHELL =       14,
# XCL_MAILBOX_REQ_READ_P2P_BAR_ADDR =   15,

Host>$ vi /tmp/config.ini

# Load config
Host>$ xbmgmt advanced --load-conf --input=/tmp/config.ini -d bdf

Mailbox Subdevice Driver has details on mailbox usage.

Device Reset and Recovery

Device reset and recovery is a privileged operation and can only be performed by xclmgmt driver. xocl driver can request device reset by sending a message to xclmgmt driver over the Mailbox. An end user can reset a device by using XRT xbutil utility. This utility talks to xocl driver which uses the reset message as defined in Mailbox Subdevice Driver

Currently Alveo boards are reset by using PCIe bus hot reset mechanism. This resets the board peripherals and also the PCIe link. As part of reset, drivers kill all the clients which have opened the device node by sending them a SIGBUS.

On some Alveo boards like u30, there are multiple FPGA devices supported with help of pcie bifurcation. The reset in this case is card level reset, which means, a reset issued from one FPGA device will result in all FPGAs on same board being reset. Both xocl and xclmgmt drivers can identify other FPGA devices on same board and handle the reset accordingly.

Shell Update

Shell update is like firmware update in conventional PCIe devices. Shell updates are distributed as signed RPM/DEB package files by Xilinx®. Shells may be upgraded using XRT xbmgmt utility by system administrators only. The upgrade process will update the PROM. A cold reboot of host is required in In order to boot the platform from the updated image.

Compute Kernel Execution Models

XRT and Alveo support software defined compute kernel execution models having standard AXI hardware interfaces. More details on XRT Controlled Kernel Execution Models. These well understood models do not require direct register access from user space. To execute a compute kernel XRT has a well defined exec command buffer API and a wait for exec completion API. These operations are exposed as ioctls by the xocl driver.