Page Table Isolation Explained

 

 

 

 

 

 

Following a recent serious security hole in Intel X86 processors (dubbed MELTDOWN and SPECTRE), OS designers have incorporated some changes in the memory management part of kernels of operating systems. For the case of Linux, this in the second major change in kernel memory management in the recent months. The previous one was upgrading kernel to use 5Level paging.

The changes to the Linux kernel have been done at the lowest level possible to make them almost invisible to other parts of the kernel, having those parts see less impacts as the result of this patch and eliminating any unnecessary efforts to change their code. To implement the isolation, kernel uses a bigger global directory for each process which is 8K. The first half of the table is similar to what a PGD looked like in previous versions and the second half is the one whose address is loaded into CR3 when processor is running in user mode. In other words this is the part that constitutes the user process address space. The difference between the two parts is the absence of kernel space maps in the user part of this new 8K PGD. I’ve tried to described this graphically in the figure below.

In the figure you can see kernel address space includes the user part in RAM as well (as it always did) but the kernel data are not mapped into the user address space. Taking a look at early page table construction in boot process, also shows this new half getting constructed. We can see this is file “/arch/x86/kernel/head_64.h”.


#define PTI_USER_PGD_FILL	512
NEXT_PGD_PAGE(early_top_pgt)
	.fill	511,8,0
#ifdef CONFIG_X86_5LEVEL
	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
#else
	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
#endif
	.fill	PTI_USER_PGD_FILL,8,0

Here we add 512 new entries (taking 4K extra space) immediately after the top level PGD which plays the role of that second half I described.
Now the question is that how does the kernel manage this? As I said the global page table is 8K in size; 4K for each part. Kernel uses the 13th bit (which is bit number 12 when starting from zero) to switch the pointer between these two parts. The reason is that each part is covered by the first 12 bits (which make the 4K area) and the only difference between the two is bit number 12.

Kernel switches between these parts by flipping the bit whenever necessary. One security measure the developers had in mind was to set the NX bit of user data in kernel mapping. This means that we set this bit in all user page entries of the first half. So that if the kernel returns to user space and the processor resumes running in user mode, the user process cannot run while kernel address space is still loaded and in use. There’s a new file added to the kernel code base which contains this exact action in line 140. The file is “/arch/x86/mm/pti.c”:


.
.
.

if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
(__supported_pte_mask & _PAGE_NX))
    pgd.pgd |= _PAGE_NX;

.
.
.

As another example of switching between kernel/user PGDs you can see this in the same file:


pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
{
       .
       .
	kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;

The function “kernel_to_user_pgdp” has been defined in “/arch/x86/include/asm/pgtable_64.h” as:


static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
{
	return ptr_set_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
}

“ptr_set_bit” does one thing and it’s flipping the 12th bit to one and returning the result to the caller:


static inline void *ptr_set_bit(void *ptr, int bit)
{
	unsigned long __ptr = (unsigned long)ptr;

	__ptr |= BIT(bit);
	return (void *)__ptr;
}

The “PTI_PGTABLE_SWITCH_BIT” has been defined as “PAGE_SHIFT” which in turn is defined as 12:


#define PTI_PGTABLE_SWITCH_BIT	PAGE_SHIFT

What happens while system is running is that a mandatory context switch is made every time a user process calls a system function (any time it runs a syscall instruction). The SYSCALL entry code of the Linux kernel has been changed into this:


ENTRY(entry_SYSCALL_64_trampoline)
	UNWIND_HINT_EMPTY
	swapgs

	/* Stash the user RSP. */
	movq	%rsp, RSP_SCRATCH

	/* Note: using %rsp as a scratch reg. */
	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp

	/* Load the top of the task stack into RSP */
	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp

	/* Start building the simulated IRET frame. */
	pushq	$__USER_DS			/* pt_regs->ss */
	pushq	RSP_SCRATCH			/* pt_regs->sp */
	pushq	%r11				/* pt_regs->flags */
	pushq	$__USER_CS			/* pt_regs->cs */
	pushq	%rcx				/* pt_regs->ip */

	/*
	 * x86 lacks a near absolute jump, and we can't jump to the real
	 * entry text with a relative jump.  We could push the target
	 * address and then use retq, but this destroys the pipeline on
	 * many CPUs (wasting over 20 cycles on Sandy Bridge).  Instead,
	 * spill RDI and restore it in a second-stage trampoline.
	 */
	pushq	%rdi
	movq	$entry_SYSCALL_64_stage2, %rdi
	jmp	*%rdi
END(entry_SYSCALL_64_trampoline)

Look at this line:


SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp

This line runs SWITCH_TO_KERNEL_CR3 macro and uses RSP as an auxiliary register to change the pointer in CR3. There are two macros involved in this and defined in “/arch/x86/colling.h”:


.macro ADJUST_KERNEL_CR3 reg:req
	ALTERNATIVE "", "SET_NOFLUSH_BIT \reg", X86_FEATURE_PCID
	/* Clear PCID and "PAGE_TABLE_ISOLATION bit", point CR3 at kernel pagetables: */
	andq    $(~PTI_SWITCH_MASK), \reg
.endm
.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
	mov	%cr3, \scratch_reg
	ADJUST_KERNEL_CR3 \scratch_reg
	mov	\scratch_reg, %cr3
.Lend_\@:
.endm

The current value of CR3 is loaded into RSP whose value is and’ed by (~PTI_SWITCH_MASK) to flip two bits to one. We are only interested in bit number 12th which changes the pointer to PGD. But there’s also bit number 11th that is changed here which doesn’t have anything to do with our pointer translation. The reason to change 11th bit to zero is way more technical but I propose a quick explanation here. In Intel processors when PCIDE is set in CR4 registers, the first 12 bits of CR3 register takes a special meaning in IA32e mode. In this situation the bits refer to an identifier named “Process Context Identifier” which resides there for the processor to be able to distinguish between cached information for different linear address translations. In this situation bits[12:] still contain the 4K-aligned physical address of PGD.
Anyways after we change the bit in RSP we store it back into CR3 and from this moment on memory addresses are translated using kernel page table entries. User pages are shared between the two address spaces but any change to kernel data is not populated to user entries of user address space which are introduced in the second half of the 8K PGD.

When kernel job is finished it changes the PGD pointer to the beginning of the user half of our 8k PGD.
With this scenario the kernel address space is physically removed from user page tables and the kernel has to switch between address spaces many times a second. Despite the kernel needs to only flip a single bit, as you see this necessitates several assembly instructions to be executed. With all that said if we want to list the various kinds of overheads inflicted by the new design we can describe them as follow:
1. A twice bigger PGD (memory waste)
2. Having to replicate some page tables entries for the two address spaces (computation waste)
3. Different management of entries in the two address spaces (computation waste)
4. Having to switch between the address spaces twice each time a user program does a syscall (computation waste)

Leave a Reply

Your email address will not be published. Required fields are marked *