Accelerators such as GPUs, TPUs and FPGAs are deployed at scale in data centers to accelerate many online serving applications, e.g. machine learning inference, image processing, encryption and compression. These applications typically receive requests from network, do pre-processing, call a computationally intensive routine, do post-processing and finally send response to network. With accelerators, the computationally intensive routine is replaced by an RPC to the accelerator device. Here the challenge arises: what the CPU should do while waiting for the accelerator?
The traditional approach is to relinquish the CPU after sending the offloading request and the OS scheduler will switch to another thread. However, context switch in Linux takes several microseconds. A fine-grained offloaded task also ranges from several to tens of microseconds, which would soon complete and wake up the thread again. The context switch overhead not only wastes CPU, but also adds thread wake up latency to request processing latency. A second approach is to busy wait until the offloaded task completes, which obviously wastes CPU. A third approach is to rewrite the application to do other jobs within the thread while waiting for the accelerator. In this work, we build a library to transparently find and execute non-conflict jobs within the thread, without modifying application code.
Observing that most concurrent applications use event-driven programming style, we propose transparent coroutines, in which the processing of each OS event (e.g. received network message) is considered as a coroutine. When the application issues an offloaded operation, the stack and registers of the current coroutine is saved. Then we process another OS event or completion of an offloaded operation. For an OS event, a new coroutine is created. For offload completion, the corresponding coroutine is restored, which executes the post-processing.
The challenge originates from the out-of-order execution of post-processing and new coroutines. For example, if a connection receives two consecutive messages, and the processing of the second message depends on the post-processing result of the first message, then the application will misbehave. To this end, we need to identify conflicting OS events and do not report these events to the application until the conflicting coroutines complete. The conflicts originate from global states changed by the coroutine during post-processing. For most applications, we observe that the global states either belong to a specific connection (i.e. file descriptor) or protected by locks. In light of this, we partition events according to file descriptors and conceptually create a coroutine to process events of a specific file descriptor. If a coroutine is blocked by a lock that is held by another coroutine, the coroutine is paused until the lock is released.
We are evaluating transparent offload on the OpenSSL encryption library used by Nginx and lighttpd, as well as a DNN inference application. We expect the applications to behave correctly without code modification. We expect transparent offload to demonstrate better performance than the context switch and busy waiting approaches.