Servers in data centers host increasing varieties of PCIe devices, e.g. GPUs, NVMe SSDs, NICs, accelerator cards and FPGAs. For high throughput and low latency, CPU-bypass direct communication among PCIe devices (e.g. GPU-Direct, NVMe-OF) is flourishing. However, many PCIe devices are designed to only talk to drivers on CPU, while the PCIe register and DMA interface is intricate and potentially undocumented. In order to capture PCIe packets and debug PCIe protocol implementations, developers need PCIe protocol analyzers which are expensive (~$250K), hard to deploy in production environment and unable to modify PCIe TLP packets that pass through.

In this work, we design and implement a transparent PCIe debugger and gateway with a commodity FPGA-based PCIe board. PCIe gateway captures packets bump-in-the-wire between a target PCIe device (e.g. NIC) and CPU. Because PCIe has fixed routing, it is impossible to perform ARP-spoofing-like attack on PCIe fabric. However, we can spoof the device driver to redirect the PCIe traffic to go through our PCIe gateway. The communication between a PCIe device and CPU falls in two categories according to the initiator.

The first category is memory-mapped I/O (MMIO) from CPU to device, in which the CPU accesses the memory region pointed by the PCIe BAR. The device driver gets the BAR address from a routine in kernel. We modify or hook the routine. If the device ID matches what we want to capture, we:

  1. Allocate a spoof memory region from BAR of PCIe gateway.
  2. Setup a mapping on PCIe gateway from spoof physical address to real BAR address of the target device, by sending the information to PCIe gateway via MMIO.
  3. Return the spoof physical address instead of real BAR address.
  4. When the target driver maps the BAR to virtual address in kernel or user mode, the mapping actually points to the spoof physical address.
  5. When a kernel or user-mode process accesses the virtual address of PCIe BAR of the target device, it actually accesses the spoof memory region and sends the MMIO TLP packet to the PCIe gateway. The capturing device then forwards the MMIO TLP to the target device.

The second category is DMA from device to CPU. At the first sight, there is no way to know what memory address the device would access. However, a well-behaving device should only access host memory regions allocated or mapped by its device driver. In Linux, there are two ways for the device driver to get a DMA-able memory region and its physical address:

  1. Allocate a DMA-able memory region (e.g. the host buffers in Catapult driver, the WQ and CQ in Mellanox driver).
  2. Map a virtual memory region to physical memory (e.g. the user-allocated data buffer registered for RDMA).

We can hook the kernel routines to match device ID, allocate spoof memory region from FPGA BAR, setup mapping on FPGA (from spoof address to real host memory address), finally return the spoof address (instead of host memory physical address), similar to the MMIO scenario. When the target driver sends the physical address to the target device via some opaque protocol, it actually sends the spoof address on FPGA BAR. Therefore the target device would regard the spoof address as host memory address and send read/write DMA TLPs to the PCIe gateway. The capturing device then forwards the DMA TLPs to the correct host memory address.

As long as the device driver uses standard kernel routines to access BAR and map DMA memory, the PCIe gateway acts as a bidirectional transparent proxy between the CPU and a specified PCIe device. PCIe gateway will be able to perform PCIe packet capture, or modify the PCIe TLP packets.

People

  • Bojie Li, 4th year Ph.D. student in MSRA and USTC
  • Dr. Andrew Putnam, Principal Research Hardware Design Engineer in the Microsoft Research Technologies (MSR-T)
  • Dr. Lintao Zhang, Principal Researcher in Microsoft Research Asia

Comments

2018-01-01