- ASPLOSServerless Computing on Heterogeneous ComputersDong Du, Qingyuan Liu, Xueqiang Jiang, and 3 more authorsIn Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 2022
Existing serverless computing platforms are built upon homogeneous computers, limiting the function density and restricting serverless computing to limited scenarios. We introduce Molecule, the first serverless computing system utilizing heterogeneous computers. Molecule enables both general-purpose devices (e.g., Nvidia DPU) and domain-specific accelerators (e.g., FPGA and GPU) for serverless applications that significantly improve function density (50% higher) and application performance (up to 34.6x). To achieve these results, we first propose XPU-Shim, a distributed shim to bridge the gap between underlying multi-OS systems (when using general-purpose devices) and our serverless runtime (i.e., Molecule). We further introduce vectorized sandbox, a sandbox abstraction to abstract hardware heterogeneity (when using domain-specific accelerators). Moreover, we also review state-of-the-art serverless optimizations on startup and communication latency and overcome the challenges to implement them on heterogeneous computers. We have implemented Molecule on real platforms with Nvidia DPUs and Xilinx FPGAs and evaluate it using benchmarks and real-world applications.
- TOCSBoosting Inter-Process Communication with Architectural SupportYubin Xia, Dong Du, Zhichao Hua, and 3 more authorsACM Trans. Comput. Syst. Jul 2022
IPC (inter-process communication) is a critical mechanism for modern OSes, including not only microkernels such as seL4, QNX, and Fuchsia where system functionalities are deployed in user-level processes, but also monolithic kernels like Android where apps frequently communicate with plenty of user-level services. However, existing IPC mechanisms still suffer from long latency. Previous software optimizations of IPC usually cannot bypass the kernel that is responsible for domain switching and message copying/remapping across different address spaces; hardware solutions such as tagged memory or capability replace page tables for isolation, but usually require non-trivial modification to existing software stack to adapt to the new hardware primitives. In this article, we propose a hardware-assisted OS primitive, XPC (Cross Process Call), for efficient and secure synchronous IPC. XPC enables direct switch between IPC caller and callee without trapping into the kernel and supports secure message passing across multiple processes without copying. We have implemented a prototype of XPC based on the ARM AArch64 with Gem5 simulator and RISC-V architecture with FPGA boards. The evaluation shows that XPC can reduce IPC call latency from 664 to 21 cycles, 14\texttimes–123\texttimes improvement on Android Binder (ARM), and improve the performance of real-world applications on microkernels by 1.6\texttimes on Sqlite3.
- OSDIScalable Memory Protection in the PENGLAI EnclaveErhu Feng, Xu Lu, Dong Du, and 5 more authorsIn 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21) Jul 2021
- ASPLOSCatalyzer: Sub-Millisecond Startup for Serverless Computing with Initialization-Less BootingDong Du, Tianyi Yu, Yubin Xia, and 5 more authorsIn Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems Jul 2020
Serverless computing promises cost-efficiency and elasticity for high-productive software development. To achieve this, the serverless sandbox system must address two challenges: strong isolation between function instances, and low startup latency to ensure user experience. While strong isolation can be provided by virtualization-based sandboxes, the initialization of sandbox and application causes non-negligible startup overhead. Conventional sandbox systems fall short in low-latency startup due to their application-agnostic nature: they can only reduce the latency of sandbox initialization through hypervisor and guest kernel customization, which is inadequate and does not mitigate the majority of startup overhead.This paper proposes Catalyzer, a serverless sandbox system design providing both strong isolation and extremely fast function startup. Instead of booting from scratch, Catalyzer restores a virtualization-based function instance from a well-formed checkpoint image and thereby skips the initialization on the critical path (init-less). Catalyzer boosts the restore performance by on-demand recovering both user-level memory state and system state. We also propose a new OS primitive, sfork (sandbox fork), to further reduce the startup latency by directly reusing the state of a running sandbox instance. Fundamentally, Catalyzer removes the initialization cost by reusing state, which enables general optimizations for diverse serverless functions. The evaluation shows that Catalyzer reduces startup latency by orders of magnitude, achieves < 1ms latency in the best case, and significantly reduces the end-to-end latency for real-world workloads. Catalyzer has been adopted by Ant Financial, and we also present lessons learned from industrial development.
- ACM SoCCCharacterizing Serverless Platforms with ServerlessbenchTianyi Yu, Qingyuan Liu, Dong Du, and 6 more authorsIn Proceedings of the 11th ACM Symposium on Cloud Computing Jul 2020
Serverless computing promises auto-scalability and cost-efficiency (in "pay-as-you-go" manner) for high-productive software development. Because of its virtue, serverless computing has motivated increasingly new applications and services in the cloud. This, however, also presents new challenges including how to efficiently design high-performance serverless platforms and how to efficiently program on the platforms.This paper proposes ServerlessBench, an open-source benchmark suite for characterizing serverless platforms. It includes test cases exploring characteristic metrics of serverless computing, e.g., communication efficiency, startup latency, stateless overhead, and performance isolation. We have applied the benchmark suite to evaluate the most popular serverless computing platforms, including AWS Lambda, Open-Whisk, and Fn, and present new serverless implications from the study. For example, we show scenarios where decoupling an application into a composition of serverless functions can be beneficial in cost-saving and performance, and that the "stateless" property in serverless computing can hurt the execution performance of serverless functions. These implications form several design guidelines, which may help platform designers to optimize serverless platforms and application developers to design their functions best fit to the platforms.
- SOSPUsing Concurrent Relational Logic with Helpers for Verifying the AtomFS File SystemMo Zou, Haoran Ding, Dong Du, and 3 more authorsIn Proceedings of the 27th ACM Symposium on Operating Systems Principles Jul 2019
Concurrent file systems are pervasive but hard to correctly implement and formally verify due to nondeterministic interleavings. This paper presents AtomFS, the first formally-verified, fine-grained, concurrent file system, which provides linearizable interfaces to applications. The standard way to prove linearizability requires modeling linearization point of each operation—the moment when its effect becomes visible atomically to other threads. We observe that path inter-dependency, where one operation (like rename) breaks the path integrity of other operations, makes the linearization point external and thus poses a significant challenge to prove linearizability.To overcome the above challenge, this paper presents Concurrent Relational Logic with Helpers (CRL-H), a framework for building verified concurrent file systems. CRL-H is made powerful through two key contributions: (1) extending prior approaches using fixed linearization points with a helper mechanism where one operation of the thread can logically help other threads linearize their operations; (2) combining relational specifications and rely/guarantee conditions for relational and compositional reasoning. We have successfully applied CRL-H to verify the linearizability of AtomFS directly in C code. All the proofs are mechanized in Coq. Evaluations show that AtomFS speeds up file system workloads by utilizing fine-grained, multicore concurrency.
- ISCAXPC: Architectural Support for Secure and Efficient Cross Process CallDong Du, Zhichao Hua, Yubin Xia, and 2 more authorsIn Proceedings of the 46th International Symposium on Computer Architecture Jul 2019
Microkernel has many intriguing features like security, fault-tolerance, modularity and customizability, which recently stimulate a resurgent interest in both academia and industry (including seL4, QNX and Google’s Fuchsia OS). However, IPC (inter-process communication), which is known as the Achilles’ Heel of microkernels, is still the major factor for the overall (poor) OS performance. Besides, IPC also plays a vital role in monolithic kernels like Android Linux, as mobile applications frequently communicate with plenty of user-level services through IPC. Previous software optimizations of IPC usually cannot bypass the kernel which is responsible for domain switching and message copying/remapping; hardware solutions like tagged memory or capability replace page tables for isolation, but usually require non-trivial modification to existing software stack to adapt the new hardware primitives. In this paper, we propose a hardware-assisted OS primitive, XPC (Cross Process Call), for fast and secure synchronous IPC. XPC enables direct switch between IPC caller and callee without trapping into the kernel, and supports message passing across multiple processes through the invocation chain without copying. The primitive is compatible with the traditional address space based isolation mechanism and can be easily integrated into existing microkernels and monolithic kernels. We have implemented a prototype of XPC based on a Rocket RISC-V core with FPGA boards and ported two microkernel implementations, seL4 and Zircon, and one monolithic kernel implementation, Android Binder, for evaluation. We also implement XPC on GEM5 simulator to validate the generality. The result shows that XPC can reduce IPC call latency from 664 to 21 cycles, up to 54.2x improvement on Android Binder, and improve the performance of real-world applications on microkernels by 1.6x on Sqlite3 and 10x on an HTTP server with minimal hardware resource cost.
- JCSTSplitPass: A Mutually Distrusting Two-Party Password ManagerJournal of Computer Science and Technology Jul 2018
- Usenix ATCEPTI: Efficient Defence against Meltdown Attack for Unpatched VMsZhichao Hua, Dong Du, Yubin Xia, and 2 more authorsIn Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference Jul 2018
The Meltdown vulnerability, which exploits the inherent out-of-order execution in common processors like x86, ARM and PowerPC, has shown to break the fundamental isolation boundary between user and kernel space. This has stimulated a non-trivial patch to modern OS to separate page tables for user space and kernel space, namely, KPTI (kernel page table isolation). While this patch stops kernel memory leakages from rouge user processes, it mandates users to patch their kernels (usually requiring a reboot), and is currently only available on the latest versions of OS kernels. Further, it also introduces non-trivial performance overhead due to page table switching during user/kernel crossings.In this paper, we present EPTI, an alternative approach to defending against the Meltdown attack for unpatched VMs (virtual machines) in cloud, yet with better performance than KPTI. Specifically, instead of using two guest page tables, we use two EPTs (extended page tables) to isolate user space and kernel space, and unmap all the kernel space in user’s EPT to achieve the same effort of KPTI. The switching of EPTs is done through a hardware-support feature called EPT switching within guest VMs without hypervisor involvement. Meanwhile, EPT switching does not flush TLB since each EPT has its own TLB, which further reduces the overhead. We have implemented our design and evaluated it on Intel Kaby Lake CPU with different versions of Linux kernel. The results show that EPTI only introduces up to 13% overhead, which is around 45% less than KPTI.