Publications
Publications on system software and computer architecture (# indicates the corresponding author).
2024
- SoCCOn-demand and Parallel Checkpoint/Restore for GPU ApplicationsYanning Yang, Dong Du (#), Haitao Song, and 1 more authorIn Proceedings of the 2024 ACM Symposium on Cloud Computing 2024
Leveraging serverless computing for cloud-based machine learning services is on the rise, promising cost-efficiency and flexibility are crucial for ML applications relying on high-performance GPUs and substantial memory. However, despite modern serverless platforms handling diverse devices like GPUs seamlessly on a pay-as-you-go basis, a longstanding challenge remains: startup latency, a well-studied issue when serverless is CPU-centric. For example, initializing GPU apps with minor GPU models, like MobileNet, demands several seconds. For more intricate models such as GPT-2, startup latency can escalate to around 10 seconds, vastly overshadowing the short computation time for GPU-based inference. Prior solutions tailored for CPU serverless setups, like fork() and Checkpoint/Restore, cannot be directly and effectively applied due to differences between CPUs and GPUs.This paper presents gCROP (GPU Checkpoint/Restore made On-demand and Parallel), the first GPU runtime that achieves <100ms startup latency for GPU apps with up to 774 million parameters (3.1GB GPT-2-Large model). The key insight behind gCROP is to selectively restore essential states on demand and in parallel during boot from a prepared checkpoint image. To this end, gCROP first introduces a global service, GPU Restore Server, which can break the existing barrier between restore stages and achieve parallel restore. Besides, gCROP leverages both CPU and GPU page faults, and can on-demand restore both CPU and GPU data with profile-guided order to mitigate costs caused by faults. Moreover, gCROP designs a multi-checkpoint mechanism to increase the common contents among checkpoint images and utilizes deduplication to reduce storage costs. Implementation and evaluations on AMD GPUs show significant improvement in startup latency, 6.4x-24.7x compared with booting from scratch and 3.9x-23.5x over the state-of-the-art method (CRIU).
@inproceedings{10.1145/3698038.3698510, author = {Yang, Yanning and Du, Dong and Song, Haitao and Xia, Yubin}, title = {On-demand and Parallel Checkpoint/Restore for GPU Applications}, year = {2024}, isbn = {9798400712869}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3698038.3698510}, doi = {10.1145/3698038.3698510}, booktitle = {Proceedings of the 2024 ACM Symposium on Cloud Computing}, pages = {415–433}, numpages = {19}, keywords = {Checkpoint and Restore, Cloud Computing, GPUs, Startup Latency}, location = {Redmond, WA, USA}, series = {SoCC '24} }
- ATCHarmonizing Efficiency and Practicability: Optimizing Resource Utilization in Serverless Computing with JiaguQingyuan Liu, Yanning Yang, Dong Du (#), and 5 more authorsIn 2024 USENIX Annual Technical Conference (USENIX ATC 24) Jul 2024
@inproceedings{298488, author = {Liu, Qingyuan and Yang, Yanning and Du, Dong and Xia, Yubin and Zhang, Ping and Feng, Jia and Larus, James R. and Chen, Haibo}, title = {Harmonizing Efficiency and Practicability: Optimizing Resource Utilization in Serverless Computing with Jiagu}, booktitle = {2024 USENIX Annual Technical Conference (USENIX ATC 24)}, year = {2024}, isbn = {978-1-939133-41-0}, address = {Santa Clara, CA}, pages = {1--17}, url = {https://www.usenix.org/conference/atc24/presentation/liu-qingyuan}, publisher = {USENIX Association}, month = jul }
- OSDIUsing Dynamically Layered Definite Releases for Verifying the RefFS File SystemMo Zou, Dong Du , Mingkai Dong, and 1 more authorIn 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) Jul 2024
@inproceedings{298732, author = {Zou, Mo and Du, Dong and Dong, Mingkai and Chen, Haibo}, title = {Using Dynamically Layered Definite Releases for Verifying the {RefFS} File System}, booktitle = {18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)}, year = {2024}, isbn = {978-1-939133-40-3}, address = {Santa Clara, CA}, pages = {629--648}, url = {https://www.usenix.org/conference/osdi24/presentation/zou}, publisher = {USENIX Association}, month = jul }
- ISCAsNPU: Trusted Execution Environments on Integrated NPUsErhu Feng, Dahu Feng, Dong Du (#), and 2 more authorsIn 51st ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2024, Buenos Aires, Argentina, June 29 - July 3, 2024 Jul 2024
@inproceedings{DBLP:conf/isca/FengFDXC24, author = {Feng, Erhu and Feng, Dahu and Du, Dong and Xia, Yubin and Chen, Haibo}, title = {sNPU: Trusted Execution Environments on Integrated NPUs}, booktitle = {51st {ACM/IEEE} Annual International Symposium on Computer Architecture, {ISCA} 2024, Buenos Aires, Argentina, June 29 - July 3, 2024}, pages = {708--723}, publisher = {{IEEE}}, year = {2024}, url = {https://doi.org/10.1109/ISCA59077.2024.00057}, doi = {10.1109/ISCA59077.2024.00057}, timestamp = {Fri, 16 Aug 2024 20:48:15 +0200}, biburl = {https://dblp.org/rec/conf/isca/FengFDXC24.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- ASPLOSsIOPMP: Scalable and Efficient I/O Protection for TEEsErhu Feng, Dahu Feng, Dong Du (#), and 4 more authorsIn Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 Jul 2024
Trusted Execution Environments (TEEs), like Intel SGX/TDX, AMD SEV-SNP, ARM TrustZone/CCA, have been widely adopted in prevailing architectures. However, these TEEs typically do not consider I/O isolation (e.g., defending against malicious DMA requests) as a first-class citizen, which may degrade the I/O performance. Traditional methods like using IOMMU or software I/O can degrade throughput by at least 20% for I/O intensive workloads. The main reason is that the isolation requirements for I/O devices differ from CPU ones. This paper proposes a novel I/O isolation mechanism for TEEs, named sIOPMP (scalable I/O Physical Memory Protection), with three key features. First, we design a Multi-stage-Tree-based checker, supporting more than 1,000 hardware regions. Second, we classify the devices into hot and cold, and support unlimited devices with the mountable entry. Third, we propose a remapping mechanism to switch devices between hot and cold status for dynamic I/O workloads. Evaluation results show that sIOPMP introduces only negligible performance overhead for both benchmarks and real-world workloads, and improves 20% 38% network throughput compared with IOMMU-based mechanisms or software I/O adopted in TEEs.
@inproceedings{10.1145/3620665.3640378, author = {Feng, Erhu and Feng, Dahu and Du, Dong and Xia, Yubin and Zheng, Wenbin and Zhao, Siqi and Chen, Haibo}, title = {sIOPMP: Scalable and Efficient I/O Protection for TEEs}, year = {2024}, isbn = {9798400703850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3620665.3640378}, doi = {10.1145/3620665.3640378}, booktitle = {Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2}, pages = {1061–1076}, numpages = {16}, location = {<conf-loc>, <city>La Jolla</city>, <state>CA</state>, <country>USA</country>, </conf-loc>}, series = {ASPLOS '24} }
2023
- MICROAccelerating Extra Dimensional Page Walks for Confidential ComputingDong Du , Bicheng Yang, Yubin Xia, and 1 more authorIn Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture Jul 2023
To support highly scalable and fine-grained computing paradigms such as microservices and serverless computing better, modern hardware-assisted confidential computing systems, such as Intel TDX and ARM CCA, introduce permission table to achieve fine-grained and scalable memory isolation among different domains. However, it also adds an extra dimension to page walks besides page tables, leading to significantly more memory references (e.g., 4 → 12 for RISC-V Sv39)1. We observe that most costs (about 75%) caused by the extra dimension of page walks are used to validate page table pages. Based on this observation, this paper proposes HPMP (Hybrid Physical Memory Protection), a hardware-software co-design (on RISC-V) that protects page table pages using segment registers and normal pages using permission tables to balance scalability and performance. We have implemented HPMP and Penglai-HPMP (a TEE system based on HPMP) on FPGA with two RISC-V cores (both in-order and out-of-order). Evaluation results show that HPMP can reduce costs by 23.1%–73.1% on BOOM and significantly improve performance on real-world applications, including serverless computing (FunctionBench) and Redis.
@inproceedings{10.1145/3613424.3614293, author = {Du, Dong and Yang, Bicheng and Xia, Yubin and Chen, Haibo}, title = {Accelerating Extra Dimensional Page Walks for Confidential Computing}, year = {2023}, isbn = {9798400703294}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3613424.3614293}, doi = {10.1145/3613424.3614293}, booktitle = {Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture}, pages = {654–669}, numpages = {16}, location = {<conf-loc>, <city>Toronto</city>, <state>ON</state>, <country>Canada</country>, </conf-loc>}, series = {MICRO '23} }
- HPCAEfficient Distributed Secure Memory with Migratable Merkle TreeErhu Feng, Dong Du (#), Yubin Xia, and 1 more authorIn 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA) Jul 2023
@inproceedings{10071130, author = {Feng, Erhu and Du, Dong and Xia, Yubin and Chen, Haibo}, booktitle = {2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)}, title = {Efficient Distributed Secure Memory with Migratable Merkle Tree}, year = {2023}, volume = {}, number = {}, pages = {347-360}, doi = {10.1109/HPCA56546.2023.10071130} }
- ACM SoCCThe Gap Between Serverless Research and Real-World SystemsQingyuan Liu, Dong Du (#), Yubin Xia, and 2 more authorsIn Proceedings of the 2023 ACM Symposium on Cloud Computing Jul 2023
With the emergence of the serverless computing paradigm in the cloud, researchers have explored many challenges of serverless systems and proposed solutions such as snapshot-based booting. However, we have noticed that some of these optimizations are based on oversimplified assumptions that lead to infeasibility and hide real-world issues. This paper aims to analyze the gap between current serverless research and real-world systems from a perspective of industry, and present new observations, challenges, opportunities, and insights that may address the discrepancies.
@inproceedings{10.1145/3620678.3624785, author = {Liu, Qingyuan and Du, Dong and Xia, Yubin and Zhang, Ping and Chen, Haibo}, title = {The Gap Between Serverless Research and Real-World Systems}, year = {2023}, isbn = {9798400703874}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3620678.3624785}, doi = {10.1145/3620678.3624785}, booktitle = {Proceedings of the 2023 ACM Symposium on Cloud Computing}, pages = {475–485}, numpages = {11}, keywords = {cloud computing, serverless, sidecar, scheduling}, location = {Santa Cruz, CA, USA}, series = {SoCC '23} }
2022
- ASPLOSServerless Computing on Heterogeneous ComputersDong Du , Qingyuan Liu, Xueqiang Jiang, and 3 more authorsIn Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems Jul 2022
Existing serverless computing platforms are built upon homogeneous computers, limiting the function density and restricting serverless computing to limited scenarios. We introduce Molecule, the first serverless computing system utilizing heterogeneous computers. Molecule enables both general-purpose devices (e.g., Nvidia DPU) and domain-specific accelerators (e.g., FPGA and GPU) for serverless applications that significantly improve function density (50% higher) and application performance (up to 34.6x). To achieve these results, we first propose XPU-Shim, a distributed shim to bridge the gap between underlying multi-OS systems (when using general-purpose devices) and our serverless runtime (i.e., Molecule). We further introduce vectorized sandbox, a sandbox abstraction to abstract hardware heterogeneity (when using domain-specific accelerators). Moreover, we also review state-of-the-art serverless optimizations on startup and communication latency and overcome the challenges to implement them on heterogeneous computers. We have implemented Molecule on real platforms with Nvidia DPUs and Xilinx FPGAs and evaluate it using benchmarks and real-world applications.
@inproceedings{10.1145/3503222.3507732, author = {Du, Dong and Liu, Qingyuan and Jiang, Xueqiang and Xia, Yubin and Zang, Binyu and Chen, Haibo}, title = {Serverless Computing on Heterogeneous Computers}, year = {2022}, isbn = {9781450392051}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3503222.3507732}, doi = {10.1145/3503222.3507732}, booktitle = {Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems}, pages = {797–813}, numpages = {17}, keywords = {function-as-a-service, serverless computing, heterogeneous computers, operating system, Cloud computing}, location = {Lausanne, Switzerland}, series = {ASPLOS 2022} }
- TOCSBoosting Inter-Process Communication with Architectural SupportYubin Xia, Dong Du , Zhichao Hua, and 3 more authorsACM Trans. Comput. Syst. Jul 2022
IPC (inter-process communication) is a critical mechanism for modern OSes, including not only microkernels such as seL4, QNX, and Fuchsia where system functionalities are deployed in user-level processes, but also monolithic kernels like Android where apps frequently communicate with plenty of user-level services. However, existing IPC mechanisms still suffer from long latency. Previous software optimizations of IPC usually cannot bypass the kernel that is responsible for domain switching and message copying/remapping across different address spaces; hardware solutions such as tagged memory or capability replace page tables for isolation, but usually require non-trivial modification to existing software stack to adapt to the new hardware primitives. In this article, we propose a hardware-assisted OS primitive, XPC (Cross Process Call), for efficient and secure synchronous IPC. XPC enables direct switch between IPC caller and callee without trapping into the kernel and supports secure message passing across multiple processes without copying. We have implemented a prototype of XPC based on the ARM AArch64 with Gem5 simulator and RISC-V architecture with FPGA boards. The evaluation shows that XPC can reduce IPC call latency from 664 to 21 cycles, 14\texttimes–123\texttimes improvement on Android Binder (ARM), and improve the performance of real-world applications on microkernels by 1.6\texttimes on Sqlite3.
@article{10.1145/3532861, author = {Xia, Yubin and Du, Dong and Hua, Zhichao and Zang, Binyu and Chen, Haibo and Guan, Haibing}, title = {Boosting Inter-Process Communication with Architectural Support}, year = {2022}, issue_date = {November 2021}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {39}, number = {1–4}, issn = {0734-2071}, url = {https://doi.org/10.1145/3532861}, doi = {10.1145/3532861}, journal = {ACM Trans. Comput. Syst.}, month = jul, articleno = {6}, numpages = {35}, keywords = {Operating system, inter-process communication, microkernel, hardware-software co-design} }
2021
- OSDIScalable Memory Protection in the PENGLAI EnclaveErhu Feng, Xu Lu, Dong Du , and 5 more authorsIn 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21) Jul 2021
@inproceedings{273705, author = {Feng, Erhu and Lu, Xu and Du, Dong and Yang, Bicheng and Jiang, Xueqiang and Xia, Yubin and Zang, Binyu and Chen, Haibo}, title = {Scalable Memory Protection in the {PENGLAI} Enclave}, booktitle = {15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21)}, year = {2021}, isbn = {978-1-939133-22-9}, pages = {275--294}, url = {https://www.usenix.org/conference/osdi21/presentation/feng}, publisher = {{USENIX} Association}, month = jul }
2020
- ASPLOSCatalyzer: Sub-Millisecond Startup for Serverless Computing with Initialization-Less BootingDong Du , Tianyi Yu, Yubin Xia, and 5 more authorsIn Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems Jul 2020
Serverless computing promises cost-efficiency and elasticity for high-productive software development. To achieve this, the serverless sandbox system must address two challenges: strong isolation between function instances, and low startup latency to ensure user experience. While strong isolation can be provided by virtualization-based sandboxes, the initialization of sandbox and application causes non-negligible startup overhead. Conventional sandbox systems fall short in low-latency startup due to their application-agnostic nature: they can only reduce the latency of sandbox initialization through hypervisor and guest kernel customization, which is inadequate and does not mitigate the majority of startup overhead.This paper proposes Catalyzer, a serverless sandbox system design providing both strong isolation and extremely fast function startup. Instead of booting from scratch, Catalyzer restores a virtualization-based function instance from a well-formed checkpoint image and thereby skips the initialization on the critical path (init-less). Catalyzer boosts the restore performance by on-demand recovering both user-level memory state and system state. We also propose a new OS primitive, sfork (sandbox fork), to further reduce the startup latency by directly reusing the state of a running sandbox instance. Fundamentally, Catalyzer removes the initialization cost by reusing state, which enables general optimizations for diverse serverless functions. The evaluation shows that Catalyzer reduces startup latency by orders of magnitude, achieves < 1ms latency in the best case, and significantly reduces the end-to-end latency for real-world workloads. Catalyzer has been adopted by Ant Financial, and we also present lessons learned from industrial development.
@inproceedings{10.1145/3373376.3378512, author = {Du, Dong and Yu, Tianyi and Xia, Yubin and Zang, Binyu and Yan, Guanglu and Qin, Chenggang and Wu, Qixuan and Chen, Haibo}, title = {Catalyzer: Sub-Millisecond Startup for Serverless Computing with Initialization-Less Booting}, year = {2020}, isbn = {9781450371025}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3373376.3378512}, doi = {10.1145/3373376.3378512}, booktitle = {Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems}, pages = {467–481}, numpages = {15}, keywords = {startup latency, serverless computing, checkpoint and restore, operating system}, location = {Lausanne, Switzerland}, series = {ASPLOS '20} }
- ACM SoCCCharacterizing Serverless Platforms with ServerlessbenchTianyi Yu, Qingyuan Liu, Dong Du , and 6 more authorsIn Proceedings of the 11th ACM Symposium on Cloud Computing Jul 2020
Serverless computing promises auto-scalability and cost-efficiency (in "pay-as-you-go" manner) for high-productive software development. Because of its virtue, serverless computing has motivated increasingly new applications and services in the cloud. This, however, also presents new challenges including how to efficiently design high-performance serverless platforms and how to efficiently program on the platforms.This paper proposes ServerlessBench, an open-source benchmark suite for characterizing serverless platforms. It includes test cases exploring characteristic metrics of serverless computing, e.g., communication efficiency, startup latency, stateless overhead, and performance isolation. We have applied the benchmark suite to evaluate the most popular serverless computing platforms, including AWS Lambda, Open-Whisk, and Fn, and present new serverless implications from the study. For example, we show scenarios where decoupling an application into a composition of serverless functions can be beneficial in cost-saving and performance, and that the "stateless" property in serverless computing can hurt the execution performance of serverless functions. These implications form several design guidelines, which may help platform designers to optimize serverless platforms and application developers to design their functions best fit to the platforms.
@inproceedings{10.1145/3419111.3421280, author = {Yu, Tianyi and Liu, Qingyuan and Du, Dong and Xia, Yubin and Zang, Binyu and Lu, Ziqian and Yang, Pingchao and Qin, Chenggang and Chen, Haibo}, title = {Characterizing Serverless Platforms with Serverlessbench}, year = {2020}, isbn = {9781450381376}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3419111.3421280}, doi = {10.1145/3419111.3421280}, booktitle = {Proceedings of the 11th ACM Symposium on Cloud Computing}, pages = {30–44}, numpages = {15}, location = {Virtual Event, USA}, series = {SoCC '20} }
2019
- SOSPUsing Concurrent Relational Logic with Helpers for Verifying the AtomFS File SystemMo Zou, Haoran Ding, Dong Du , and 3 more authorsIn Proceedings of the 27th ACM Symposium on Operating Systems Principles Jul 2019
Concurrent file systems are pervasive but hard to correctly implement and formally verify due to nondeterministic interleavings. This paper presents AtomFS, the first formally-verified, fine-grained, concurrent file system, which provides linearizable interfaces to applications. The standard way to prove linearizability requires modeling linearization point of each operation—the moment when its effect becomes visible atomically to other threads. We observe that path inter-dependency, where one operation (like rename) breaks the path integrity of other operations, makes the linearization point external and thus poses a significant challenge to prove linearizability.To overcome the above challenge, this paper presents Concurrent Relational Logic with Helpers (CRL-H), a framework for building verified concurrent file systems. CRL-H is made powerful through two key contributions: (1) extending prior approaches using fixed linearization points with a helper mechanism where one operation of the thread can logically help other threads linearize their operations; (2) combining relational specifications and rely/guarantee conditions for relational and compositional reasoning. We have successfully applied CRL-H to verify the linearizability of AtomFS directly in C code. All the proofs are mechanized in Coq. Evaluations show that AtomFS speeds up file system workloads by utilizing fine-grained, multicore concurrency.
@inproceedings{10.1145/3341301.3359644, author = {Zou, Mo and Ding, Haoran and Du, Dong and Fu, Ming and Gu, Ronghui and Chen, Haibo}, title = {Using Concurrent Relational Logic with Helpers for Verifying the AtomFS File System}, year = {2019}, isbn = {9781450368735}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3341301.3359644}, doi = {10.1145/3341301.3359644}, booktitle = {Proceedings of the 27th ACM Symposium on Operating Systems Principles}, pages = {259–274}, numpages = {16}, location = {Huntsville, Ontario, Canada}, series = {SOSP '19} }
- ISCAXPC: Architectural Support for Secure and Efficient Cross Process CallDong Du , Zhichao Hua, Yubin Xia, and 2 more authorsIn Proceedings of the 46th International Symposium on Computer Architecture Jul 2019
Microkernel has many intriguing features like security, fault-tolerance, modularity and customizability, which recently stimulate a resurgent interest in both academia and industry (including seL4, QNX and Google’s Fuchsia OS). However, IPC (inter-process communication), which is known as the Achilles’ Heel of microkernels, is still the major factor for the overall (poor) OS performance. Besides, IPC also plays a vital role in monolithic kernels like Android Linux, as mobile applications frequently communicate with plenty of user-level services through IPC. Previous software optimizations of IPC usually cannot bypass the kernel which is responsible for domain switching and message copying/remapping; hardware solutions like tagged memory or capability replace page tables for isolation, but usually require non-trivial modification to existing software stack to adapt the new hardware primitives. In this paper, we propose a hardware-assisted OS primitive, XPC (Cross Process Call), for fast and secure synchronous IPC. XPC enables direct switch between IPC caller and callee without trapping into the kernel, and supports message passing across multiple processes through the invocation chain without copying. The primitive is compatible with the traditional address space based isolation mechanism and can be easily integrated into existing microkernels and monolithic kernels. We have implemented a prototype of XPC based on a Rocket RISC-V core with FPGA boards and ported two microkernel implementations, seL4 and Zircon, and one monolithic kernel implementation, Android Binder, for evaluation. We also implement XPC on GEM5 simulator to validate the generality. The result shows that XPC can reduce IPC call latency from 664 to 21 cycles, up to 54.2x improvement on Android Binder, and improve the performance of real-world applications on microkernels by 1.6x on Sqlite3 and 10x on an HTTP server with minimal hardware resource cost.
@inproceedings{10.1145/3307650.3322218, author = {Du, Dong and Hua, Zhichao and Xia, Yubin and Zang, Binyu and Chen, Haibo}, title = {XPC: Architectural Support for Secure and Efficient Cross Process Call}, year = {2019}, isbn = {9781450366694}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3307650.3322218}, doi = {10.1145/3307650.3322218}, booktitle = {Proceedings of the 46th International Symposium on Computer Architecture}, pages = {671–684}, numpages = {14}, keywords = {microkernel, operating system, accelerators, inter-process communication}, location = {Phoenix, Arizona}, series = {ISCA '19} }
2018
- JCSTSplitPass: A Mutually Distrusting Two-Party Password ManagerJournal of Computer Science and Technology Jul 2018
@article{split-pass, author = {}, title = {SplitPass: A Mutually Distrusting Two-Party Password Manager}, publisher = {Journal of Computer Science and Technology}, year = {2018}, journal = {Journal of Computer Science and Technology}, volume = {33}, number = {1}, eid = {98}, numpages = {17}, pages = {98}, keywords = {<p>password manager;privacy protection;mobile-cloud system</p>}, doi = {10.1007/s11390-018-1810-y} }
- Usenix ATCEPTI: Efficient Defence against Meltdown Attack for Unpatched VMsZhichao Hua, Dong Du , Yubin Xia, and 2 more authorsIn Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference Jul 2018
The Meltdown vulnerability, which exploits the inherent out-of-order execution in common processors like x86, ARM and PowerPC, has shown to break the fundamental isolation boundary between user and kernel space. This has stimulated a non-trivial patch to modern OS to separate page tables for user space and kernel space, namely, KPTI (kernel page table isolation). While this patch stops kernel memory leakages from rouge user processes, it mandates users to patch their kernels (usually requiring a reboot), and is currently only available on the latest versions of OS kernels. Further, it also introduces non-trivial performance overhead due to page table switching during user/kernel crossings.In this paper, we present EPTI, an alternative approach to defending against the Meltdown attack for unpatched VMs (virtual machines) in cloud, yet with better performance than KPTI. Specifically, instead of using two guest page tables, we use two EPTs (extended page tables) to isolate user space and kernel space, and unmap all the kernel space in user’s EPT to achieve the same effort of KPTI. The switching of EPTs is done through a hardware-support feature called EPT switching within guest VMs without hypervisor involvement. Meanwhile, EPT switching does not flush TLB since each EPT has its own TLB, which further reduces the overhead. We have implemented our design and evaluated it on Intel Kaby Lake CPU with different versions of Linux kernel. The results show that EPTI only introduces up to 13% overhead, which is around 45% less than KPTI.
@inproceedings{10.5555/3277355.3277380, author = {Hua, Zhichao and Du, Dong and Xia, Yubin and Chen, Haibo and Zang, Binyu}, title = {EPTI: Efficient Defence against Meltdown Attack for Unpatched VMs}, year = {2018}, isbn = {9781931971447}, publisher = {USENIX Association}, address = {USA}, booktitle = {Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference}, pages = {255–266}, numpages = {12}, location = {Boston, MA, USA}, series = {USENIX ATC '18} }