Kernel Memory Management Dissected (Part 2)

InĀ the previous post we saw how early page tables are made after kernel takes control. If you remember I said that those tables are not of much use for the applications that are actually the reason why operating system itself is implemented. When a new process is going to run, the OS needs to create new page tables to be used by that new process. Before we get to that point let’s start with the data structures that hold the necessary information regarding the pages. The physical pages themselves (their state information) are kept inside an instances of a structure, that is nicely called “page”! This structure is defined in “include/linux/mm_types.h”. Memory is conceptualized for every running entity using a structure called mm_struct defined in the same aforementioned file. Each such structure has a Page Global Directory assigned to it through the member “pgd”. Kernel stores the information related to the page tables of its own memory address space in a mm_struct instance called “init_mm” defined in “mm/init_mm.c” (Ridiculously there’s another file in the same directory called “mm_init.c”. See? Linux is not as open as it seems) This memory descriptor is not of much use except in the initialization of the kernel besides playing the role of the init task memory descriptor.


struct mm_struct init_mm = {
	.mm_rb		= RB_ROOT,
	.pgd		= swapper_pg_dir,
	.mm_users	= ATOMIC_INIT(2),
	.mm_count	= ATOMIC_INIT(1),
	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
	.user_ns	= &init_user_ns,
	INIT_MM_CONTEXT(init_mm)
};

Remember the function start_kernel in “init/main.c”. There’s a function call to “setup_arch()” in there that invokes an architecture specific function that in our case resides in “arch/x86/kernel/setup.c”.


void __init setup_arch(char **cmdline_p)
{
	memblock_reserve(__pa_symbol(_text),
			 (unsigned long)__bss_stop - (unsigned long)_text);

	early_reserve_initrd();
        .
        .
        init_mem_mapping();
        .
        .

The function “init_mem_mapping” is what we track here. The function is defined in “x86/mm/init.c”


unsigned long __ref init_memory_mapping(unsigned long start,
					       unsigned long end)
{
	struct map_range mr[NR_RANGE_MR];
	unsigned long ret = 0;
	int nr_range, i;

	pr_debug("init_memory_mapping: [mem %#010lx-%#010lx]\n",
	       start, end - 1);

	memset(mr, 0, sizeof(mr));
	nr_range = split_mem_range(mr, 0, start, end);

	for (i = 0; i < nr_range; i++) ret = kernel_physical_mapping_init(mr[i].start, mr[i].end, mr[i].page_size_mask); add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT);

	return ret >> PAGE_SHIFT;
}

As I said we are tracking the way down to populating the respective structures for holding the information of the kernel address space itself. The next important function to look is “kernel_physical_mapping_init” defined in “x86/mm/init_64.c”


unsigned long __meminit
kernel_physical_mapping_init(unsigned long paddr_start,
			     unsigned long paddr_end,
			     unsigned long page_size_mask)
{
	bool pgd_changed = false;
	unsigned long vaddr, vaddr_start, vaddr_end, vaddr_next, paddr_last;

	paddr_last = paddr_end;
	vaddr = (unsigned long)__va(paddr_start);
	vaddr_end = (unsigned long)__va(paddr_end);
	vaddr_start = vaddr;

	for (; vaddr < vaddr_end; vaddr = vaddr_next) {
		pgd_t *pgd = pgd_offset_k(vaddr);
		pud_t *pud;

		vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;

		if (pgd_val(*pgd)) {
			pud = (pud_t *)pgd_page_vaddr(*pgd);
			paddr_last = phys_pud_init(pud, __pa(vaddr),
						   __pa(vaddr_end),
						   page_size_mask);
			continue;
		}

		pud = alloc_low_page();
		paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
					   page_size_mask);

		spin_lock(&init_mm.page_table_lock);
		pgd_populate(&init_mm, pgd, pud);
		spin_unlock(&init_mm.page_table_lock);
		pgd_changed = true;
	}

	if (pgd_changed)
		sync_global_pgds(vaddr_start, vaddr_end - 1, 0);

	__flush_tlb_all();

	return paddr_last;
}

What this function does is first simply calculating some physical and virtual addresses by the same procedure I explained in the previous post and after some other minor tasks, initializes the structure instance “init_mm”. “pgd_populate” is a short function defined in “x86/include/asm/pgalloc.h” as:


static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
{
	paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT);
	set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud)));
}

paravirt_allox_pud is superfluous here. (You can see why by looking at it’s definition!) and set_pgd simply stores the value denoted by the second parameter inside the first one.

Well it’s enough of early kernel memory initialization up to this point; I don’t want to discuss this more and I prefer to start explaining how user space is built. Firstly, let’s talk a little bit about processes. From this point on everything gets a little smoother and more friendly than the previous harsh assembly and memory management code since we are gradually getting farther from the hardware.

When a process is going to be selected by the scheduler to run, the kernel loads a new set of page tables related to that specific process. By loading we mean changing CR3 register of the processor so that it points to a new PML4 table that is pertinent to the process. The address of the new page table to be loaded in CR3 is stored in “mm” member of the process’s task descriptor. Task descriptor is defined in “include/linux/sched.c”. It’s a very large structure and we get acquainted with some of its important members step by step. “mm” is defined as “struct mm_struct *” along with another variable named “active_mm”. active_mm is usually equal to “mm” in normal situations and it denotes the address space a process is using during the execution. It’s not bad to mention that the “mm” field of kernel threads (the threads that only run in kernel space) is always NULL. In that case, the kernel does not load a new address space, instead the kernel thread uses the page tables of the previous process. Kernel threads don’t need to access user space; the only regions that they need in memory is kernel space which is mapped into all processes address spaces equally. This is why they can use any process’s page tables.

Kernel doesn’t simply submit the whole address space to the process to be used by the help of the hardware memory management unit. Each process is instead only allowed to access certain ranges of its address space. This ranges reside in another structures called areas. Each area can encompass one or more complete pages (so must be 4k aligned) and has its own access rights. Such areas are represented in the kernel by “vm_area_struct” instances (called VMAs). Each area is congruent in the process address space (but not necessarily in physical RAM). All memory areas (also called memory regions) are stored in a doubly linked list. The first area is pointed to by the member “mmap” of the memory descriptor of the process. Each vm_area_struct has two pointers to refer to the next and previous adjacent areas. The access rights of each area is reflected to the respective page frame by setting the corresponding flags of the underlying pages in the page table of the process. The areas are the memory regions that the process has right to at least read them. So if some parts of memory (kernel space for example) is not to be accessed by the process at all, those parts are not added to any VMA. I had published a post that very briefly looked into how the processes are born in Linux. Here we need to get a more detailed feeling about the creation phase to understand how a process is given its unique memory.

A process is created by fork() system call. fork() duplicates the process that runs this syscall as the new process and establishes its necessary data structures between which the page tables are of our special interest here. fork() and all other functions related to replicating a process are defined in “/kernel/fork.c”.


SYSCALL_DEFINE0(fork)
{
#ifdef CONFIG_MMU
	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
#else
	/* can not support in nommu mode */
	return -EINVAL;
#endif
}

The function doesn’t do anything special except calling _do_fork with SIGCHILD as the flags parameter. This parameter contains some guidelines for the next functions. SIGCHLD is the signal passed to the parent after it terminates. _do_fork() is defined in the same file and it’s still a small function:


long _do_fork(unsigned long clone_flags,
	      unsigned long stack_start,
	      unsigned long stack_size,
	      int __user *parent_tidptr,
	      int __user *child_tidptr,
	      unsigned long tls)
{
	struct task_struct *p;
	int trace = 0;
	long nr;

	
	if (!(clone_flags & CLONE_UNTRACED)) {
		if (clone_flags & CLONE_VFORK)
			trace = PTRACE_EVENT_VFORK;
		else if ((clone_flags & CSIGNAL) != SIGCHLD)
			trace = PTRACE_EVENT_CLONE;
		else
			trace = PTRACE_EVENT_FORK;

		if (likely(!ptrace_event_enabled(current, trace)))
			trace = 0;
	}

	p = copy_process(clone_flags, stack_start, stack_size,
			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
	add_latent_entropy();
	
	if (!IS_ERR(p)) {
		struct completion vfork;
		struct pid *pid;

		trace_sched_process_fork(current, p);

		pid = get_task_pid(p, PIDTYPE_PID);
		nr = pid_vnr(pid);

		if (clone_flags & CLONE_PARENT_SETTID)
			put_user(nr, parent_tidptr);

		if (clone_flags & CLONE_VFORK) {
			p->vfork_done = &vfork;
			init_completion(&vfork);
			get_task_struct(p);
		}

		wake_up_new_task(p);

		/* forking complete and child started to run, tell ptracer */
		if (unlikely(trace))
			ptrace_event_pid(trace, pid);

		if (clone_flags & CLONE_VFORK) {
			if (!wait_for_vfork_done(p, &vfork))
				ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
		}

		put_pid(pid);
	} else {
		nr = PTR_ERR(p);
	}
	return nr;
}

The most important line in this function is


p = copy_process(clone_flags, stack_start, stack_size, child_tidptr, NULL, trace, tls, NUMA_NO_NODE);

copy_process(…) is the actual function that does the replication. It’s a big function and I only explain the important parts (not that the other parts are garbage but they are not relevant to our discussion and I really don’t have time to cover them all. I’m organizing my articles in a way to provide the necessary knowledge to start working with the security aspects of the kernel)


static __latent_entropy struct task_struct *copy_process(
					unsigned long clone_flags,
					unsigned long stack_start,
					unsigned long stack_size,
					int __user *child_tidptr,
					struct pid *pid,
					int trace,
					unsigned long tls,
					int node)
{
    .
    .
    struct task_struct *p;
    .
    .
    p = dup_task_struct(current, node);
    .
    .
    retval = sched_fork(clone_flags, p);
    .
    .
    retval = copy_mm(clone_flags, p);
    .
    .

The address of the new process descriptor that is assigned to the process that is going to be created is stored in local variable “p”. dup_task_struct function makes a copy of the parent task descriptor as the new process descriptor. Certain fields of the new process’s task descriptor are changed in this function to fit the new process conditions. (But as I said they’re not related to our post) The next interesting function is “sched_fork” that introduces this new process to the kernel scheduler to be run after the forking is complete.
And then is “copy_mm” function. We’re at this point finally. Okay let’s trace this function step by step.


static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
{
    struct mm_struct *mm, *oldmm;
    .
    .
    mm = dup_mm(tsk);
    .
    .
good_mm:
    tsk->mm = mm;
    tsk->active_mm = mm;
} 

Again we skip to the important parts that are regularly executed in normal forking of processes. The actual memory duplication is done in dup_mm which itself is broken down to some smaller phases:


static struct mm_struct *dup_mm(struct task_struct *tsk)
{
	struct mm_struct *mm, *oldmm = current->mm;
	int err;

	mm = allocate_mm();
	if (!mm)
		goto fail_nomem;

	memcpy(mm, oldmm, sizeof(*mm));

	if (!mm_init(mm, tsk, mm->user_ns))
		goto fail_nomem;

	err = dup_mmap(mm, oldmm);
	if (err)
		goto free_pt;

	mm->hiwater_rss = get_mm_rss(mm);
	mm->hiwater_vm = mm->total_vm;

	if (mm->binfmt && !try_module_get(mm->binfmt->module))
		goto free_pt;

	return mm;

free_pt:
	/* don't put binfmt in mmput, we haven't got module yet */
	mm->binfmt = NULL;
	mmput(mm);

fail_nomem:
	return NULL;
}

First a new “struct mm_struct” instance in created whose address is stored in “mm” local variable. Then the old process’s memory descriptor is copied into this new descriptor using a simple memcpy command. Now the hardest part in memory duplication arises. The dup_mmap function. This function traverses through the memory areas of the parent process and creates the identical areas in the child process memory descriptor. Then it calls a function called “copy_page_range” which is this time defined in another file; mm/memory.c. This file is the main memory manager of the kernel regarding the page tables. The aforementioned function is called for each of the VMAs of the parent. Below is the main loop of dup_mm function in the previous file:


for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
		struct file *file;

		if (mpnt->vm_flags & VM_DONTCOPY) {
			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
			continue;
		}
		charge = 0;
		if (mpnt->vm_flags & VM_ACCOUNT) {
			unsigned long len = vma_pages(mpnt);

			if (security_vm_enough_memory_mm(oldmm, len)) /* sic */
				goto fail_nomem;
			charge = len;
		}
		tmp = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
		if (!tmp)
			goto fail_nomem;
		*tmp = *mpnt;
		INIT_LIST_HEAD(&tmp->anon_vma_chain);
		retval = vma_dup_policy(mpnt, tmp);
		if (retval)
			goto fail_nomem_policy;
		tmp->vm_mm = mm;
		if (anon_vma_fork(tmp, mpnt))
			goto fail_nomem_anon_vma_fork;
		tmp->vm_flags &=
			~(VM_LOCKED|VM_LOCKONFAULT|VM_UFFD_MISSING|VM_UFFD_WP);
		tmp->vm_next = tmp->vm_prev = NULL;
		tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
		file = tmp->vm_file;
		if (file) {
			struct inode *inode = file_inode(file);
			struct address_space *mapping = file->f_mapping;

			get_file(file);
			if (tmp->vm_flags & VM_DENYWRITE)
				atomic_dec(&inode->i_writecount);
			i_mmap_lock_write(mapping);
			if (tmp->vm_flags & VM_SHARED)
				atomic_inc(&mapping->i_mmap_writable);
			flush_dcache_mmap_lock(mapping);
			/* insert tmp into the share list, just after mpnt */
			vma_interval_tree_insert_after(tmp, mpnt,
					&mapping->i_mmap);
			flush_dcache_mmap_unlock(mapping);
			i_mmap_unlock_write(mapping);
		}
	
		if (is_vm_hugetlb_page(tmp))
			reset_vma_resv_huge_pages(tmp);

		/*
		 * Link in the new vma and copy the page table entries.
		 */
		*pprev = tmp;
		pprev = &tmp->vm_next;
		tmp->vm_prev = prev;
		prev = tmp;

		__vma_link_rb(mm, tmp, rb_link, rb_parent);
		rb_link = &tmp->vm_rb.rb_right;
		rb_parent = &tmp->vm_rb;

		mm->map_count++;
      --->      retval = copy_page_range(mm, oldmm, mpnt); 

		if (tmp->vm_ops && tmp->vm_ops->open)
			tmp->vm_ops->open(tmp);

		if (retval)
			goto out;
	}

Look at the line I have specified by an arrow. This is the code that manages the copying of the VMAs for the child process:


int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
		struct vm_area_struct *vma)
{
	pgd_t *src_pgd, *dst_pgd;
	unsigned long next;
	unsigned long addr = vma->vm_start;
	unsigned long end = vma->vm_end;
	unsigned long mmun_start;	/* For mmu_notifiers */
	unsigned long mmun_end;		/* For mmu_notifiers */
	bool is_cow;
	int ret;
	if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
			!vma->anon_vma)
		return 0;

	if (is_vm_hugetlb_page(vma))
		return copy_hugetlb_page_range(dst_mm, src_mm, vma);

	if (unlikely(vma->vm_flags & VM_PFNMAP)) {
		ret = track_pfn_copy(vma);
		if (ret)
			return ret;
	}
	
	is_cow = is_cow_mapping(vma->vm_flags);
	mmun_start = addr;
	mmun_end   = end;
	if (is_cow)
		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
						    mmun_end);

	ret = 0;
	dst_pgd = pgd_offset(dst_mm, addr);
	src_pgd = pgd_offset(src_mm, addr);
	do {
		next = pgd_addr_end(addr, end);
		if (pgd_none_or_clear_bad(src_pgd))
			continue;
		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
					    vma, addr, next))) {
			ret = -ENOMEM;
			break;
		}
	} while (dst_pgd++, src_pgd++, addr = next, addr != end);

	if (is_cow)
		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
	return ret;
}

This function adds page table entries for the child process level by level. As you can see there’s a loop for each PGD entry which then jumps to another funciton to iterate the PUD entries for that PGD entry. The function is “copy_pud_range” which in turn calls another function that handles PMD entries for each corresponfing PUD entry. And this goes down until copying all pages of the parent parocess that corresponds to each PTE. Then we go back to the first loop in copy_page_range to continue building page tables of the child process until all parent pages are traversed. Note that “copying” here does not mean duplicating the pages contents of the parent. What is done here is to write the addresses of the pages of the parent process into the PTE entries of the child. So that both parent and child will be pointing to and using same physical pages. This extricates the kernel from a huge burden of inefficient writing of the pages that are not guaranteed to be used at all in child context. (Many times the child loads another executable by calling a funciton from execv family, consequenlty discarding all its current pages, we’ll talk about it later)

After this stage is completed the child is almost born healthy and ready to run. But remember it has read-only access to the parent pages. As soon as the child wants to modify any thing in its address space the kernel gives it a new fresh copy of the corresponing page. (which is appropriately called Copy-On-Write). In the next post I’ll explain how this task is done and also how the kernel switches between address spaces.
Good Luck!

Leave a Reply

Your email address will not be published. Required fields are marked *