UNICORE member and representative of IBM, Mike Rapoport, in collaboration with James Bottomley, also from IBM, gave a talk about “Restricted Address Spaces for Container Security” at the Open Source Summit on September 28.
As it’s explained in the abstract, “containers are generally perceived as less secure than virtual machines” and speakers “suggest exploring the possibility of using MMU as the hardware isolation mechanism to make containers even more secure”. Traditionally, the Linux kernel uses a single page table to manage all its objects and any kernel data is accessible from anywhere in the kernel. From a security standpoint, such ability of the kernel to access any memory from any part of the code is a liability. The fundamental mechanism of container isolation – namespaces – makes most of the kernel objects private for a namespace. There is no need for the kernel code that runs outside the namespace to access these private objects.
Authors have presented restricted kernel address spaces and their use with Linux namespaces to ensure that private objects of a namespace cannot be accessed by other parts of the kernel. A restricted page table is assigned to a namespace in a way that minimizes overhead and allows removing private objects from the default kernel page table. Besides, speakers have presented possible optimizations for direct map management to reduce the performance penalty caused by the direct map fragmentation.