Hardware-based transports, such as RDMA, are becoming prevalent because of its low latency, high throughput and low CPU overhead. However, current RDMA NICs have limited NIC memory to store per-flow transport states. When the number of flows exceed memory capacity, the NIC needs to swap out flow states to host memory via PCIe, leading to performance degradation.
This paper presents a hardware-based transport without per-flow state. At its core, flow state bounces between the two end hosts along with a data packet, analagous to a thread whose state is always in-flight. To enable multiple in-flight packets, each thread is assigned a distinct sequence of packets to send. We enable each thread to fork, throttle and merge independently, which effectively simulates a window-based congestion control mechanism. For loss recovery, we design an epoch-based single loss detector for all flows, which enables selective retransmission and the storage size is proportional to the number of lost packets in a round trip. When there are more losses than the NIC can handle, the receiver CPU is notified to recover loss.
We design and implement RDMA, TCP and TLS transports without per-flow states in an FPGA prototype. The transports have small network bandwidth and CPU overhead. Simulations and testbed experiments show that flows share network bandwidth fairly in a multi-bottleneck network, and solves the incast problem even better than DCTCP and DCQCN. With a large number of concurrent flows, the throughput of our stateless hardware-based TLS transport is 100x of a stateful hardware-based transport and 50x of a software-based transport.