Fault tolerance is critical for distributed applications. Many request serving and batch processing frameworks have been proposed to simplify programming of fault tolerant distributed systems, which basically ask the programmers to separate states from computation and store states in a fault-tolerant system. However, many existing applications (e.g. Node.js, Memcached and Python in Tensorflow) do not support fault tolerance, and fault tolerant systems are often slower than their non-fault-tolerant counterparts. In this work, we take up the challenge of achieving transparent and efficient fault tolerance for general distributed applications. Challenges include process migration, deterministic replay and distributed snapshot.
First, fault tolerance at different abstraction levels have trade-offs. Fault tolerance at architecture level requires specialized hardware. Fault tolerance at VM level observes all network communication to be bi-directional and cannot capture high-level semantics such as inter-process communication. Fault tolerance at system call level needs modifications to the OS kernel to migrate a process, i.e., extract process states from the source host and inject them to the destination host. Process migration in Linux is complicated because states of multiple processes are fused in a monolithic kernel. The Unikernel approach fails to support many inter-process communication mechanisms, e.g. semaphores. To this end, we adopt the SocksDirect architecture and build a distributed library OS in user space that is compatible with standard Linux APIs, so that the snapshot of a process captures states of both the library and the application, while preserving high-level semantics for optimization.
Second, state machine replication (SMR) and snapshot-replay are two major approaches to achieve fault tolerance. SMR requires at least two hosts to execute exactly the same application, thus introducing CPU overhead. Snapshot-based systems typically buffer output of the application during the interval between two adjacent snapshots, because when a host fails, the system cannot guarantee deterministic execution since last snapshot. This so-called output commit problem introduces significant request serving latency to transparent fault-tolerant systems. Alternatively, logging all non-deterministic events of an application still involves significant overhead. To this end, FTLinux predicts the non-deterministic events of an application according to its recent execution history. If the prediction is correct, the application continues. Otherwise, FTLinux waits a short period for the prediction to come true, because many uncertainties originate from minute timing fluctuations. On timeout, FTLinux logs the mispredicted event.
We plan to design and implement FTLinux on commodity servers running Linux kernel. We plan to evaluate FTLinux using both request serving and batch processing applications. For request serving applications such as Nginx, Node.js, Memcached and SQLite, FTLinux is expected to achieve transparent fault tolerance with negligible request latency and CPU overhead. We expect that failure of one host does not affect the remaining part of the system, and the failed host can recover quickly. For batch processing applications such as GraphX, Apache Storm and Tensorflow, FTLinux is also expected to demonstrate low CPU overhead and fast recovery. It is worth noting that the fault tolerance overhead and recovery speed of FTLinux is even better than the built-in fault tolerance mechanisms of GraphX, Apache Storm and Tensorflow. To the best of our knowledge, FTLinux is the first transparent and efficient fault tolerant system for general distributed applications on commodity Linux servers.