Kernel Memory Management Dissected (Part 3)

We have seen how page tables and memory are created and managed when the system starts running and also when a new process is forked by its parent. As I said in the previous post, we want to understand how the page tables are revised while a program is running. By revising I mean how the kernel enforces the paging policies when a process wants to access a portion of its memory that is not resident in RAM at the time of the request. Time is Gold so without any further ado let’s see how kernel fulfills such a request.

Hardware Interrupts
Much of what we know as “kernel works” is actually done through interrupt handlers. Network interactions, getting input from peripheral devices like mouse and keyboard, system timers and many more other computer elements actions are made possible by the sole use of interrupts. I assume that readers already are familiar with the concept of interrupts so I avoid talking about more of its introduction and application. But instead I directly explain what role they play for system memory management regarding helping the system organize the RAM and page tables. If you remember from my earlier posts, the page tables’ entries are hardware compatible; the entries are filled congruous with the processor requirements. We introduced Intel page table structure and stipulated that the kernel has been written in a way to comply with those requirements. Because the Hardware Memory Management Unit itself reads the page table entries and interprets their meaning. So the kernel does not interfere with address translation unless there’s a problem with that address. It’s when the processor informs the the kernel following an interrupt signal by running the respective interrupt handler. The problem can be one of the wide range of reasons that an address fails to be resolved or be granted access to. Two general interrupts usually occur in these situations. General Protection Fault and Page Fault. The former is normally not recoverable and the offending program must be terminated. But the later happens when we need to perform some revisions in our page table entries.

Page Fault
As the name suggests this happens when we have to do something with a page before allowing a program or even the kernel itself to access that page. Page faults are propagated to the kernel by using the interrupt vectors. Each interrupt has a unique number assigned to it (called an Interrupt Request Line or Interrupt Vector) that uses a handler routine that is executed when that interrupt happens. Interrupts can be added, removed, enabled and disabled dynamically at runtime but we have some interrupts whose handlers need to be registered for the processor (through the IDTR register that points to the beginning of interrupt table) immediately after the kernel is loaded and ready to run. These interrupts are defined in “x86/entry/entry_64.S” for x86 based computers. The interrupt handlers are registered in the file using these two lines:


idtentry general_protection	do_general_protection	has_error_code=1
trace_idtentry page_fault	do_page_fault		has_error_code=1

between which we are interested in page fault handler that’s called “do_page_fault”.
The handler is defined in “x86/mm/fault.c”.


dotraplinkage void notrace
do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
	unsigned long address = read_cr2(); /* Get the faulting address */
	enum ctx_state prev_state;

	/*
	 * We must have this function tagged with __kprobes, notrace and call
	 * read_cr2() before calling anything else. To avoid calling any kind
	 * of tracing machinery before we've observed the CR2 value.
	 *
	 * exception_{enter,exit}() contain all sorts of tracepoints.
	 */

	prev_state = exception_enter();
	__do_page_fault(regs, error_code, address);
	exception_exit(prev_state);
}

As you too must have imagined by looking, this function itself doesn’t do anything special. “exception_enter” prepares the CPU to enter the kernel context (after the page fault is handled we call, not surprisingly, “exception_exit”). The linear address that caused the fault is read from CR2 register that was filled by the hardware before calling the interrupt handler and then is passed to __do_page_fault. Let’s go one step deeper and see what __do_page_fault is:


__do_page_fault(struct pt_regs *regs, unsigned long error_code,
		unsigned long address)
{
	struct vm_area_struct *vma;
	struct task_struct *tsk;
	struct mm_struct *mm;
	int fault, major = 0;
	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

	tsk = current;
	mm = tsk->mm;

	if (kmemcheck_active(regs))
		kmemcheck_hide(regs);
	prefetchw(&mm->mmap_sem);

	if (unlikely(kmmio_fault(regs, address)))
		return;

	if (unlikely(fault_in_kernel_space(address))) {
		if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
			if (vmalloc_fault(address) >= 0)
				return;

			if (kmemcheck_fault(regs, address, error_code))
				return;
		}

		/* Can handle a stale RO->RW TLB: */
		if (spurious_fault(error_code, address))
			return;

		/* kprobes don't want to hook the spurious faults: */
		if (kprobes_fault(regs))
			return;
		bad_area_nosemaphore(regs, error_code, address, NULL);

		return;
	}

	/* kprobes don't want to hook the spurious faults: */
	if (unlikely(kprobes_fault(regs)))
		return;

	if (unlikely(error_code & PF_RSVD))
		pgtable_bad(regs, error_code, address);

	if (unlikely(smap_violation(error_code, regs))) {
		bad_area_nosemaphore(regs, error_code, address, NULL);
		return;
	}

	if (unlikely(faulthandler_disabled() || !mm)) {
		bad_area_nosemaphore(regs, error_code, address, NULL);
		return;
	}

	if (user_mode(regs)) {
		local_irq_enable();
		error_code |= PF_USER;
		flags |= FAULT_FLAG_USER;
	} else {
		if (regs->flags & X86_EFLAGS_IF)
			local_irq_enable();
	}

	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

	if (error_code & PF_WRITE)
		flags |= FAULT_FLAG_WRITE;
	if (error_code & PF_INSTR)
		flags |= FAULT_FLAG_INSTRUCTION;
	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
		if ((error_code & PF_USER) == 0 &&
		    !search_exception_tables(regs->ip)) {
			bad_area_nosemaphore(regs, error_code, address, NULL);
			return;
		}
retry:
		down_read(&mm->mmap_sem);
	} else {
		might_sleep();
	}

	vma = find_vma(mm, address);
	if (unlikely(!vma)) {
		bad_area(regs, error_code, address);
		return;
	}
	if (likely(vma->vm_start <= address)) goto good_area; if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
		bad_area(regs, error_code, address);
		return;
	}
	if (error_code & PF_USER) {
		if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
			bad_area(regs, error_code, address);
			return;
		}
	}
	if (unlikely(expand_stack(vma, address))) {
		bad_area(regs, error_code, address);
		return;
	}

good_area:
	if (unlikely(access_error(error_code, vma))) {
		bad_area_access_error(regs, error_code, address, vma);
		return;
	}

	
	fault = handle_mm_fault(vma, address, flags);
	major |= fault & VM_FAULT_MAJOR;

	if (unlikely(fault & VM_FAULT_RETRY)) {
		/* Retry at most once */
		if (flags & FAULT_FLAG_ALLOW_RETRY) {
			flags &= ~FAULT_FLAG_ALLOW_RETRY;
			flags |= FAULT_FLAG_TRIED;
			if (!fatal_signal_pending(tsk))
				goto retry;
		}

		/* User mode? Just return to handle the fatal exception */
		if (flags & FAULT_FLAG_USER)
			return;

		/* Not returning to user mode? Handle exceptions or die: */
		no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
		return;
	}

	up_read(&mm->mmap_sem);
	if (unlikely(fault & VM_FAULT_ERROR)) {
		mm_fault_error(regs, error_code, address, vma, fault);
		return;
	}

	if (major) {
		tsk->maj_flt++;
		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
	} else {
		tsk->min_flt++;
		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
	}

	check_v8086_mode(regs, address, tsk);
}

Despite being larger than the previous function, this function still is relatively high level and doesn’t do much. Anyways the steps passed here are remarkable and I explain the more important ones that pertain to normal situations in which page faults occur. Firstly note that page faults are only meaningful for user space programs’ data not the kernel’s, since all the kernel code and data must be always resident in memory and not swapped out. Thus we shouldn’t have any page faults for kernel proprietary parts in normal states. The first lines of this function stores the current task descriptor and its memory descriptor addresses in two local variables “tsk” and “mm” respectively. The function then checks the fault is related to a user space program or the kernel itself. In the later case it stops execution of the handler and returns. The pertinent VMA of this address is then found using the function find_vma and passing the “mm” and the linear “address” to it. Remember that a program is only allowed to access a virtual address when that address in the range of one of its VMAs. If a corresponding VMA is not found the function terminates after a call to “bad_area(…)”. If the address is reasonable (the kernel compares it with the start of the respective VMA) then the code jumps to “good_area” part. The control then is passed to another function named “handle_mm_fault”. This function is defined in “mm/memory.c”. The next function that continues the procedure of loading the page with the desired user space content is “__handle_mm_fault” defined in the same file. Explaining how this function loads a physical page needs a lengthy blog post by itself and for now I don’t see it necessary to delve into this functions’ details. Only remember this function both creates page tables pertaining to the requested page (if not already in memory) and loads the page itself in memory. Another important job of this function is to check whether the page needs to be duplicated because of Copy-On-Write mechanism. Remember when I said in the previous post that the new forked process shares the pages of its parent in a read-only manner? Those pages are marked read-only in their corresponding page table entries and anytime any one of the parent or the child wants to write to those pages, a page fault happens and the kernel (more exactly a later function in the course of running __handle_mm_fault that is named “do_wp_page”) copies the page into another new page the marks it as writable so that the program can perform the desired write operation. All this procedure in invisible to both the parent process and the child. “handle_mm_fault” can specify either a MAJOR or a MINOR fault in its return value. Major faults are the ones that necessitate to block the user space program and load the page from the disk (so they require some I/O stuff) and minor faults are the faults that don’t need I/O actions and consequently won’t lead to blocking the process. If for example the desired page is already in memory (either in kernel cache or in possession of another process) no disk interaction is required.

Process Context Switch
Now we want to see how the kernel pulls out a process from the running state and replace it with another process to start/resume execution. To switch a process with another one the kernel generally needs to do these:
– Replacing the Global Directory Page table of the old process with the new one.
– Loading the hardware context of the new process (all registers values for the new process)

The step that we are about to inspect is the first one; memory switch. Switching is performed when the kernel scheduler calls “context_switch” function defined in “kernel/sched/core.c”. Let’s take a look at the function:


static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next, struct pin_cookie cookie)
{
	struct mm_struct *mm, *oldmm;

	prepare_task_switch(rq, prev, next);

	mm = next->mm;
	oldmm = prev->active_mm;
	
	arch_start_context_switch(prev);

	if (!mm) {
		next->active_mm = oldmm;
		atomic_inc(&oldmm->mm_count);
		enter_lazy_tlb(oldmm, next);
	} else
		switch_mm_irqs_off(oldmm, mm, next);

	.
        .
        .
}

This is what happens in the function to switch to a new memory context:
* “mm” local variable is loaded with memory descriptor of the new process
* “oldmm” local variable is loaded with the memory descriptor of the previous process – the one that is going to be preempted – If mm is NULL no memory switch is performed and the new process (which must be a kernel thread due to the reason I mentioned in the previous post) uses the memory context of the old process.
* But if mm – the memory descriptor of the new process – is not NULL then the kernel switches to the new memory page tables by calling “switch_mm_irq_off”. This is an architecture-dependent function and is defined in “arch/x86/mm/tlb.c”.


void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
			struct task_struct *tsk)
{
	unsigned cpu = smp_processor_id();

	if (likely(prev != next)) {
		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
			unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
			pgd_t *pgd = next->pgd + stack_pgd_index;
			if (unlikely(pgd_none(*pgd)))
				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
		}
#ifdef CONFIG_SMP
		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
		this_cpu_write(cpu_tlbstate.active_mm, next);
#endif
		cpumask_set_cpu(cpu, mm_cpumask(next));

          ----> load_cr3(next->pgd);

		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);

		/* Stop flush ipis for the previous mm */
		cpumask_clear_cpu(cpu, mm_cpumask(prev));

		/* Load per-mm CR4 state */
		load_mm_cr4(next);

#ifdef CONFIG_MODIFY_LDT_SYSCALL
		if (unlikely(prev->context.ldt != next->context.ldt))
			load_mm_ldt(next);
#endif
	    .
            .
            .
}

 

It first checks to see whether the new process has a valid “pgd” and if not it uses the init_mm (introduced in earlier posts) as the new global directory for the new process. This is – as the code itself highlights – unlikely to happen.
Then TLB is prepared for the memory pages and finally where I outlined using an arrow, the actual memory switch happens by loading the new process’s PGD address into the CR3 register.
From this moment on the hardware memory manager unit uses the new process’s page tables to access the RAM. Keep in mind the kernel need not switch to a separate PGD while the processor jumps from user space to kernel space because the kernel is equally mapped into all user space processes address spaces – though they don’t have access to it and it’s there with the sole purpose of being used in kernel space.
The function then performs some settings on CR4 control register and LDTR and then terminates. Rest of the function is for the rare case that this function is called with “prev” and “next” parameters being equal to each other. Also as you can see in the code above we usually don’t need to switch the value of LDTR register.

Very well. In this 3 part series I tried to clarify how the Linux kernel implements paging and how it manages memory. Despite these articles being around Linux and x86 architecture, the overall implementation principals are the same for other architectures and operating systems. I tried to follow my general writing strategy of making my posts short and rich also for this series. I hope the explanations to be satisfactory. You can write comments if you have questions and can use the mail address provided in the side bar of the page for contacting.
Till next post… good luck!

7 thoughts on “Kernel Memory Management Dissected (Part 3)

  1. My partner and I absolutely love your blog and find most of your post’s to be just what I’m looking for. Do you offer guest writers to write content for you personally?

    I wouldn’t mind publishing a post or elaborating on a few of the
    subjects you write concerning here. Again, awesome website!

  2. Greetings from Los angeles! I’m bored at work so I decided to check out your site on my
    iphone during lunch break. I love the knowledge you present here and
    can’t wait to take a look when I get home. I’m surprised at how
    quick your blog loaded on my mobile .. I’m not even using
    WIFI, just 3G .. Anyhow, fantastic blog!

  3. I really can’t believe how great this site is. Keep up the good work. I’m going to tell all my friends about this place.

  4. Thanks for finally talking about >Kernel Memory Management
    Dissected (Part 3) – Sirus Shahini <Liked it!

  5. Thanks for the auspicious writeup. It if truth be told used to be a
    leisure account it. Glance complex to more brought agreeable from you!

    By the way, how can we keep up a correspondence?

Leave a Reply

Your email address will not be published. Required fields are marked *