Kernel Memory Management Dissected (Part 1)

NOTE: This series is based on Linux kernel 4.9.5 . The latest kernel works slightly different from what I’ve explained here.

I started to introduce some basic kernel internals in the previous posts. It’s been a while since I decided to write about memory management at operating system level and fortunately today I could accommodate a few a hours in my time to write the first part of what I had been thinking about to share.
Kernel code alone is harsh enough and everything gets far more complex when it comes to managing the memory. It’s not the theory behind memory management policies that makes this a difficult topic but that’s the implementation. The theory is what you have probably learned in the “Operating Systems” course of your BSc. studies but if you have some experience in designing and writing computer programs you surely know there’s a significant distance between the theory and the implementation. But again this is not the implementation itself that’s difficult in kernel purview but it’s the lack of documentation around the kernel code. Linux is open source but it has been very poorly documented so that comprehending its code is much more a matter of reverse engineering than researching. That said, there’s another problem with understanding this part of the kernel (memory related concepts) and it’s about the technical issues in the parlance. The related code is very technical and needs familiarity even with the underlying hardware. Note that the theory or the way operating systems manage memory is similar for all different hardware architectures and operating systems but the implementation is different. Specifically since paging is directly supported by the hardware, the implementation of paging for different operating systems are more similar when the underlying processors are of the same family. Memory management is one of the most challenging parts of designing operating systems since the slightest mistakes in the design and in the respective code knocks totally down the OS along with all its running applications. So memory management needs a delicate and accurate design on the side of the operating systems designers. I start the explanation by looking on how the very first paging structures are constructed after you turn on your computer, covering enough technical details to get a fair and precise image of how the operating systems start to warm up the memory to leverage the virtual memory technique. If you are not familiar with the paging mechanism overall, read my post about Paging and Segmentation before proceeding to rest of this post.

A quick look at how the operating systems starts

After you press the power button there are generally two possible ways that the computer loads a boot loader. In the first (and old) manner, you have a legacy BIOS-based motherboard which loads the instructions of a boot loader from the first sector of the first bootable device. In this mode a device is considered bootable if it has an MBR signature which is by convention defined to be (0xAA55). If the firmware (the code that runs as the BIOS) finds this signature at byte number 511 and 512 it loads first 512 bytes of that device into the RAM and at the physical location of 7C00. To understand this scenario better I have written a simple boot loader that is tested on a virtual machine to see the results. This is the boot loader code:

 


[bits 16]
[org 0x7C00]

mov ah,0x5
mov al,0x01    ;Set screen to page 1
int 0x10

mov cx,line1_end-line1
mov bp,line1
mov bh,0x01   	;Write on page one
mov bl,0xA    	;Text color 
mov dx,0      	;First line
mov ah,0x13
int 0x10

mov cx,line2_end-line2
mov bp,line2   	;Second line
mov dh,0x01
int 0x10

jmp $

line1 db "This is a simple boot loader"
line1_end db 0
line2 db "Cyrus Sh"
line2_end db 0

times 510-($-$$) db 0
dw 0xAA55

Compile it with:

nasm -f bin -o boot.bin boot.asm

Then using the object file you can boot either a real hardware or a virtual machine. In either case the result is showing a short message as this:

First let’s see the content of our assembled file:


0000000 05b4 01b0 10cd 1cb9 bd00 7c23 01b7 0ab3
0000010 00ba b400 cd13 b910 0008 40bd b67c cd01
0000020 eb10 54fe 6968 2073 7369 6120 7320 6d69
0000030 6c70 2065 6f62 746f 6c20 616f 6564 0072
0000040 7943 7572 2073 6853 0000 0000 0000 0000
0000050 0000 0000 0000 0000 0000 0000 0000 0000
*
00001f0 0000 0000 0000 0000 0000 0000 0000 aa55
0000200

Now let’s see how the memory looks like when our boot loader runs:


0x7c00: 0xb4 0x05 0xb0 0x01 0xcd 0x10 0xb9 0x1c
0x7c08: 0x00 0xbd 0x23 0x7c 0xb7 0x01 0xb3 0x0a
0x7c10: 0xba 0x00 0x00 0xb4 0x13 0xcd 0x10 0xb9
0x7c18: 0x08 0x00 0xbd 0x40 0x7c 0xb6 0x01 0xcd
0x7c20: 0x10 0xeb 0xfe 0x54 0x68 0x69 0x73 0x20
0x7c28: 0x69 0x73 0x20 0x61 0x20 0x73 0x69 0x6d
0x7c30: 0x70 0x6c 0x65 0x20 0x62 0x6f 0x6f 0x74
0x7c38: 0x20 0x6c 0x6f 0x61 0x64 0x65 0x72 0x00
0x7c40: 0x43 0x79 0x72 0x75 0x73 0x20 0x53 0x68
0x7c48: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
*
0x7df8: 0x00 0x00 0x00 0x00 0x00 0x00 0x55 0xaa

Look at the last two bytes in both the file content and in the RAM that show the MBR signature 0xAA55 (showing this number on little endian systems is 55AA – less significants bytes are stored in lower memory). So the firmware loads the first sector of the bootable device at location 0x7C00 and CPU jumps to it to execute the boot loader instructions.
In the second case the firmware is UEFI based which has a completely different story regarding booting the system. Those firmware run an EFI compatible application located in the ESP partition. We don’t delve into the details but remember in both cases the boot loader must hand in the control to the kernel code. Where the boot loader jumps to start the kernel differs based on which boot scenario is used. In Legacy BIOS mode the first instructions of kernel run while the CPU is still in real mode and they start running at address 0x200. The first instruction at this address is a short jump to 0x268 to run the rest of real mode kernel code.


Real mode kernel start:

|     512 byte   boot sector                    |
0                495    497=0x1f1         511  512                  0x267      0x268                                                
                  0xffff|header          0x55aa              ROF                 |                  
                                             _start                         start_of_setup                                                     
                                              jmp                                                                                                
                                              |                                 ^                                                               
                                              |---------------------------------|
                                           
                                              ^
                                              '
                                              '
                                              '
                                      Bootloader jumps here

In UEFI based motherboards the processor doesn’t start in real mode and the execution path is also different. The kernel can be compiled with UEFI support by setting the “EFI stub support” config option. in “Processor types and features” category. We jump squarely to the point that relates to memory management; Making the first page tables in boot/compressed/head_64.S.


        .
        .
        .
/*
 * Prepare for entering 64 bit mode
 */

	/* Load new GDT with the 64bit segments using 32bit descriptor */
	leal	gdt(%ebp), %eax
	movl	%eax, gdt+2(%ebp)
	lgdt	gdt(%ebp)

	/* Enable PAE mode */
	movl	%cr4, %eax
	orl	$X86_CR4_PAE, %eax
	movl	%eax, %cr4

 /*
  * Build early 4G boot pagetable
  */
	/* Initialize Page tables to 0 */
	leal	pgtable(%ebx), %edi
	xorl	%eax, %eax
	movl	$(BOOT_INIT_PGT_SIZE/4), %ecx
	rep	stosl

	/* Build Level 4 */
	leal	pgtable + 0(%ebx), %edi
	leal	0x1007 (%edi), %eax
	movl	%eax, 0(%edi)

	/* Build Level 3 */
	leal	pgtable + 0x1000(%ebx), %edi
	leal	0x1007(%edi), %eax
	movl	$4, %ecx
1:	movl	%eax, 0x00(%edi)
	addl	$0x00001000, %eax
	addl	$8, %edi
	decl	%ecx
	jnz	1b

	/* Build Level 2 */
	leal	pgtable + 0x2000(%ebx), %edi
	movl	$0x00000183, %eax
	movl	$2048, %ecx
1:	movl	%eax, 0(%edi)
	addl	$0x00200000, %eax
	addl	$8, %edi
	decl	%ecx
	jnz	1b

	/* Enable the boot page tables */
	leal	pgtable(%ebx), %eax
	movl	%eax, %cr3
        .
        .
        .

As you see all we have here is pure x86 assembly. Well sifting through the kernel C code is a headache, let alone the assemblies. But I try to help you avoid the headache by explaining the key stages and instructions in building and arranging the memory page tables step by step. I have pasted a part of head64_S above. Look at the comment at the beginning of this part. As you can predict, the codes following the comment are to prepare the processor to go to 64 bit mode. As the Intel documentation mandates (Intel® 64 and IA-32 ArchitecturesSoftware Developer’s Manual) we need to set the PAE bit of CR4 register in order to activate the processor in IA32e mode. This is what we do after loading GDTR register. Next part is the start of building the first tables. Initially we zero out all the memory corresponding to the area that holds the page tables. This is necessary to prevent some random entries being present in our page tables.

movl	$(BOOT_INIT_PGT_SIZE/4), %ecx

This instruction initializes ECX register to contain a number big enough to zero all the aforementioned area by the next instruction. ECX plays the role of a counter. BOOT_INIT_PGT_SIZE is defined as:

# define BOOT_INIT_PGT_SIZE	(6*4096)

So BOOT_INIT_PGT_SIZE is 24KB wide. Each entry of any type of page tables is 8 bytes. And each table can index 512 entries. So each page is 4KB wide. And thus we have 6 page tables here that are initialized to zero. At next parts, 3 page tables are constructed for 3 translation levels. If you have read my previous posts you must have seen that Intel uses 4 level paging in 64 bit mode. The 4 is the maximum here. So we can have 3 or even 2 levels totally. This is the entries of the page tables that specify whether we have more levels forward to translate an address or not. The processor interprets the meaning of the entries and the translation procedure differently based of the information bits of those entries. What makes the difference here between these entries, is the presence of the PS bit in the second level page table. To understand how the entries are stored in these tables you must first know the structure of the tables and the translation process. Physical addresses are 52 bits and linear addresses are 48 bits wide in 64 bit mode. Since each page can address 512 entries and each entry is 8 bytes, the physical address of each page must be 4K-aligned. Each entry contains an address part and an information part. The address is the 40 leftmost bits of the physical address of either a page (a 4K memory frame) or the physical address of another (next level) page table. The first table here contains only one entry, and that entry simply points to the next page table which resides immediately after itself. Since each page table is 4KB wide so if the address of a page table is X then the next adjacent table has the address X+0x1000. Irrespective of how the start address of each table is calculated, the way an address can be stored in the right place in each entry is important here. As I said each entry has an address part and an information part. The information bits comprise bits[0:11]. So we have 12 bits that indicate various aspects of interpretation of that entry by the processor. The address part comprises the bits[12:51]. Bits [52:63] are mostly unused or irrelevant to our discussion. So the leftmost 40 bits of a physical address must be stored in bits number 12 through 51. How could we store these bits along with the information bits using a single instruction? You should remember that I said the page tables are 4K-alined; that is the rightmost 12 bits of their physical addresses must be zero. This facilitates appending the information bits to the corresponding entry. To better understand this, look and the figure below:

The rightmost 12 bits of the physical address are all zero so the information part can placed in those bits simply by adding the respective value to the physical address and then put the result in the corresponding entry. Now let’s get back to our assembly instructions.

leal pgtable + 0(%ebx), %edi
leal 0x1007 (%edi), %eax
movl %eax, 0(%edi)

Ignore the EBX register since it contains a constant offset which is added to the start of all our tables. (Assume it’s zero) The address of the level 4 page table is denoted by “pgtable”. pgtable is a label defined in the same file and contains 4KB of zeros. The address of pgtabel is stored in EDI register by the first instruction. The second instruction adds the value 0x1007 to EDI. Now this is the interesting part. Remember each table was 4KB and our tables are adjacent, consequently the start of each one is 4KB away from it’s next neighbor. On the other hand these addresses are 4K-aligned based on the initial location of the kernel in the RAM so their rightmost 12 bits are zero. Ergo to append the information bits we can simply add the hexadecimal equivalent of those bits while they are seen as a single number. Clear?
Now what does the number seven (or 111) explain about the entry that contain them in the rightmost 3 bits? Look at the figure below: (The information bits of different page tables of different levels are slightly different but that’s not necessary to show in this figure)


P=1 means the page is present in main memory.
R/W=1 means the page is both readable and writable.
U/S=1 means user mode access is allowed to the memory region addressed by this entry

Okay, now lets see what the 3rd level page table contains:


       /* Build Level 3 */
	leal	pgtable + 0x1000(%ebx), %edi
	leal	0x1007(%edi), %eax
	movl	$4, %ecx
1:	movl	%eax, 0x00(%edi)
	addl	$0x00001000, %eax
	addl	$8, %edi
	decl	%ecx
	jnz	1b

For this page table we fill 4 entries which respectively point to the next 4 adjacent level 2 pages. First of all note that it loads the address of the third level page by adding 0x1000 to the address of label “pgtable” which as said before acts as the level 4 page table. Then it stores the addresses of the corresponding level 2 pages one by one with considering 0x7 as their information bits. ECX is initialized to 4 and in each iteration the address of the next table is calculated by simply adding 4 kilo bytes to the starting address of the previous one.
Note that the addresses that we are now mapping through these early page tables coincide the their physical addresses of the corresponding pages. That is, in this part there is no virtual address and we just make this to enable the processor to access the kernel code and data after activating paging.
The most important part is next one that we fill the 4 “level 2 page tables”.


        leal	pgtable + 0x2000(%ebx), %edi
	movl	$0x00000183, %eax
	movl	$2048, %ecx
1:	movl	%eax, 0(%edi)
	addl	$0x00200000, %eax
	addl	$8, %edi
	decl	%ecx
	jnz	1b

The value “pgtable + 0x2000” is the start of the third page table which is the first “level2 page table”(Level 2 page tables are called PDE or PMD tables). Now look at the value playing the role of information bits here; 0x183. This converts to some different binary value than what we saw for 0x7 for previous tables. 0x183 is decoded in binary as “100000011”.

G means global; i.e the translation of linear addresses related to this entry is global in TLB. Here P/S bit is set. And when the PS bit of a PDE entry is 1 it means that the processor should consider the 40 bit physical address resident in this entry as the address to a 2MB page rather than to another page table. We have 4 page tables left so 4*512=2048 entries are filled similarly. Each one points to a 2MB physical page so we have a total of 2018*2MG=4G of memory that is addressed by this 6 pages. Again these pages map a linear address to an equal value. So after entering 64 bit mode we are still using the physical addresses. Generating pages to translate virtual address space of the kernel is done in a next stage. The next instructions loads the CR3 register with the address of pgtable. This is the PML4 address from the Intel CPU point of view.

       leal	pgtable(%ebx), %eax
	movl	%eax, %cr3

Then we jump to “startup_64” and then after decompressing the the main kernel image jump to /x86/kernel/head_64.S. This is the part that page tables are revised and changed to enable the actual virtual to physical address translation.
Firstly, the code here calculate the difference between the expected location of the kernel in RAM and its current location. Normally these two are equal but when KASLR is enable at compile time for early boot address space randomization, there will be a distance between the two. In most cases it’s disabled by default because it is assumed that the users may want to hibernate their systems; a feature which is infeasible in current kernel versions when KASLR changes the expected addresses.


        /*
	 * Compute the delta between the address I am compiled to run at and the
	 * address I am actually running at.
	 */
	leaq	_text(%rip), %rbp
	subq	$_text - __START_KERNEL_map, %rbp

	/* Is the address not 2M aligned? */
	testl	$~PMD_PAGE_MASK, %ebp
	jnz	bad_address

	/*
	 * Is the address too large?
	 */
	leaq	_text(%rip), %rax
	shrq	$MAX_PHYSMEM_BITS, %rax
	jnz	bad_address

	/*
	 * Fixup the physical addresses in the page table
	 */
	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)

	addq	%rbp, level3_kernel_pgt + (510*8)(%rip)
	addq	%rbp, level3_kernel_pgt + (511*8)(%rip)

	addq	%rbp, level2_fixmap_pgt + (506*8)(%rip)

This difference is calculated by leveraging the RIP-relative addressing. A technique to make the address of some desired location in code or data independent from its compile time address. You may be already familiar with RIP-relative addressing but the concept is important enough that I devote a few more lines to explaining it.
Relative addressing is the act of addressing a location within a program without needing to know its static address beforehand. Consider an example. I have a variable called “X” whose address is “100”. In a point of execution there’s an instruction that has been written with the purpose of updating the value of X. For example writing zero in the memory location that holds X. If my program is loaded in a way so that it’s starting point coincides the start of memory, then the location of X is 100 in the RAM. So if the aforementioned instruction simply asks the processor to write zero to the location 100 of the main memory, the value will be correctly written in X. But what if my program is loaded somewhere else so that the address of X while the program is running will not be any more 100. What happens if the same instruction that contained the address 100 for X gets executed? In this case the processor writes somewhere other than X; consequently corrupting the program memory. This is where relative addressing comes handy. To work this problem out, rather than hardcoding an address in the instruction we can calculate the distance between the desired address and the address of the current instruction at compile time and use this offset in the instruction instead of the address itself. Since the address of the current instruction is always known and saved in RIP register the correct address of any location of memory can be specified by calculating its relative distance from the current instruction. In this case we can ask the assembler (take care it’s the assembler, this doesn’t have anything to do with the processor in theory) to calculate the offset and put it in the instruction as a value to be added to RIP in time of execution. In our hypothetical program we can write an instruction like: (this is a pseudo-instruction not a valid one):

mov 100(rip),0

The assemblers have been designed in a way that if you use RIP to be added to a value like above, that value (100 here) is changed to the distance between the instruction (next instruction to be more precise since the CPU increments RIP immediately after fetching an instruction from memory) and the value itself. For example if current instruction is in the location 30 of the program and X is as mentioned in 100, then the above instruction turns into this at time of writing the respective machine code in the output binary file:

mov 70(rip),0

On the other hand, the CPU doesn’t differentiate between RIP and any other register when confronting with such an instruction in runtime. It just sees it should access a location in memory that is 70 bytes away from the current location of RIP. It doesn’t matter what RIP is at that time, the distance between RIP and X will always be the same.
Using these two lines, the kernel code calculates the aforementioned distance between the physical address that the kernel should have been loaded at (it’s 0x1000000) and the physical address of memory location it’s loaded and running now. As I said, KASLR is disabled by default so the kernel is decompressed and loaded at the preset address which is 0x1000000.

       leaq	_text(%rip), %rbp
	subq	$_text - __START_KERNEL_map, %rbp

The first line gets the physical address of the text section using its relative address to RIP and stores it in RBP. Then it subtracts this value from the difference of fixed compile time addresses of _text section and __START_KERNEL_map. The former is specified according to the organization of the assembly file and the custom addresses that have been defined and finally the linker script. We can simply read this address by using a debugger to verify it’s exactly what we expect:

(gdb) p &_text
$1 = 0xffffffff81000000 

The “0xffffffff8” part is the virtual relocation, we are still in physical addresses so at run time _text actually resides in 0x1000000 location. This is hardcoded physical address of the kernel after decompression.
The later is defined as:

#define __START_KERNEL_map	_AC(0xffffffff80000000, UL)

So the result of this subtraction that is stored in RBP is:

RBP = (0xffffffff81000000 - 0xffffffff80000000) - 0x1000000 = 0

This value then is added to the entries of the page tables that are defined here in this file which will soon play the role of the page tables that the processor will use to map the virtual memory. So we should first see the page tables definitions. What follows is the list of the page tables defined in the same file in an ascending order relative to their physical address:

The last one (dynamic+4) is a page table that its presence can be seen by reading the code in the assembly file but doesn’t have any name associated with it (or at least I couldn’t find it) and is placed immediately after early_dynamic_pgts. Now let’s look at the definitions of these pages:


_INITDATA
NEXT_PAGE(early_level4_pgt)
	.fill	511,8,0
	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

NEXT_PAGE(early_dynamic_pgts)
	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0

NEXT_PAGE(init_level4_pgt)
	.fill	512,8,0

NEXT_PAGE(level3_kernel_pgt)
	.fill	L3_START_KERNEL,8,0
	/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
	.quad	level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
	.quad	level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE

NEXT_PAGE(level2_kernel_pgt)
	PMDS(0, __PAGE_KERNEL_LARGE_EXEC,
		KERNEL_IMAGE_SIZE/PMD_SIZE)

NEXT_PAGE(level2_fixmap_pgt)
	.fill	506,8,0
	.quad	level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
	/* 8MB reserved for vsyscalls + a 2MB hole = 4 + 1 entries */
	.fill	5,8,0

NEXT_PAGE(level1_fixmap_pgt)
	.fill	512,8,0

Let’s investigate them one by one. early_level4_pgt has only one entry. The first 511 entries are filled with zero so they are effectively deactivated. The only active entry that is the last one in this table contains the physical address of level3_kernel_pgt. You will understand why we only set the last one shortly. The _PAGE_TABLE and _KERNPG_TABLE that you see in the definitions above are all information bits that are appended to an entry. We don’t need to cover them now as they are not relevant to the actual mapping (except the PS bit that I will mention whenever necessary). early_dynamic_pgts and init_level4_pgt are filled with zero. Now look at how level3_kernel_pgt is initialized. The first instruction fills the first L3_START_KERNEL entries of this page with zero. L3_START_KERNEL is defined as pud_index(__START_KERNEL_map). Now it’s getting beautiful. It’s the time that you should get a filling of “Virtual Memory Management” tables. Okay remember what __START_KERNEL_map was:

#define __START_KERNEL_map	_AC(0xffffffff80000000, UL)

This is the START (the very beginning of the ACCESSIBLE kernel address space). Virtual addresses below this are not accessible even by the kernel after loading early_level4_pgt physical address. This is the hardware that fails to map the lower addresses since they are not mapped in PML4 table unless you reload or the modify the page tables. Now if we want to map this address where should we put the corresponding entry in our level 4 page table? the PGD index (the index that is used for the first stage of translation in PML4) of this virtual address is 111111111. All the 9 bits are 1. So which entry does this belong to? the 512th. This is why we placed it in the last entry of early_level4_pgt. the PUD index (level3 index) of this address is 111111110. I don’t know what’s the decimal equivalence of this and it seems that the kernel developers didn’t bother to calculate either. This is why they used L3_START_KERNEL. Suppose that the aforementioned index is 1000. Then this address corresponds to the 1001th entry (indexes start from zero so the first page is mapped to the entry number one). In addition the addresses below __START_KERNEL_map need not be accessible. So the previous entries must all be disabled. Consequently we fill the first 1000 entries with zero. But in reality the PUD index of that virtual address is L3_START_KERNEL and this is the number of entries that need to be disables. This is what the first instruction does. level2_kernel_pgt can address 1G of memory. So is the case with level2_fixmap_pgt. Virtual addresses relating to level2_kernel_pgt starts from 0xffffffff80000000 and covers 1G. This means that 1GB on memory is reserved for the kernel’s code, data and its modules. Virtual addresses relating to level2_fixmap_pgt starts from 0xffffffffC0000000 whose PUD indexed are 111111111. This also can cover 1GB of memory. We don’t need more memory to map in the kernel space. 2GB is more than enough.
level2_kernel_pgt maps the code and the data of kernel to the first 512MB of the ram.
Look at the single line that does this:

PMDS(0, __PAGE_KERNEL_LARGE_EXEC,KERNEL_IMAGE_SIZE/PMD_SIZE)

PMDS is a macro that fills the entries of a page table. The macro expands as:


#define PMDS(START, PERM, COUNT)			\
	i = 0 ;						\
	.rept (COUNT) ;					\
	.quad	(START) + (i << PMD_SHIFT) + (PERM) ;	\
	i = i + 1 ;					\
	.endr

This macro receives the physical address of the location that is going to be mapped and builds a page table that is considered a level2 table (also called page middle directory) and fills its entries an many times as what COUNT denotes. The PERM is a part of the information bits. This value contains the PS bit set to one. So we are creating the mapping to 2MB sized physical pages. KERNEL_IMAGE_SIZE is defined as 512 * 1024 * 1024=512MB.


#define KERNEL_IMAGE_SIZE	(512 * 1024 * 1024)

PMD_SIZE is 2MB. So the COUNT will be 512/2=256. This means that the PMDS macro fills the first 256 entries to map the first 512MB of memory. Since these page tables are 4k-aligned; however; the resulting page table will be still 4k with 512 entries (only 256 units of which are used). The first 506 entries of level2_fixmap_pgt get disabled. The address of level1_fixmap_pgt is stored in the next entry and the remaining 5 entries are considered as reserved for other purposes. And finally level1_fixmap_pgt is initialized with zero. Now getting back to the beginning of the file I said the code adds a fixed value to the physical addresses stored in these tables. In normal situations that fixed value is zero due to th KASLR being disabled and thus having the kernel loaded at the expected physical location. In the next stage the kernel builds some identity-mapped page tables. These tables are similar to the first page tables that had been built in boot/compressed/head_64.S from the functional point view. Those page tables are currently in use when the code in kernel/head_64.S is running. But soon we switch to the newly created pages.


leaq	_text(%rip), %rdi
	leaq	early_level4_pgt(%rip), %rbx

	movq	%rdi, %rax
	shrq	$PGDIR_SHIFT, %rax

The above instruction is just to get the PML4 index part of the address. This address is physical and an equal virtual address is considered for it.


	leaq	(4096 + _KERNPG_TABLE)(%rbx), %rdx // rdx = early_dynamic_pgts
	movq	%rdx, 0(%rbx,%rax,8) 
	movq	%rdx, 8(%rbx,%rax,8) 

	addq	$4096, %rdx  //next table: 
	movq	%rdi, %rax
	shrq	$PUD_SHIFT, %rax
	andl	$(PTRS_PER_PUD-1), %eax
	movq	%rdx, 4096(%rbx,%rax,8)  
	incl	%eax
	andl	$(PTRS_PER_PUD-1), %eax
	movq	%rdx, 4096(%rbx,%rax,8)  

	addq	$8192, %rbx
	movq	%rdi, %rax
	shrq	$PMD_SHIFT, %rdi
	addq	$(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
	leaq	(_end - 1)(%rip), %rcx
	shrq	$PMD_SHIFT, %rcx
	subq	%rdi, %rcx
	incl	%ecx

1:
	andq	$(PTRS_PER_PMD - 1), %rdi
	movq	%rax, (%rbx,%rdi,8)
	incq	%rdi
	addq	$PMD_SIZE, %rax
	decl	%ecx
	jnz	1b

	/*
	 * Fixup the kernel text+data virtual addresses. Note that
	 * we might write invalid pmds, when the kernel is relocated
	 * cleanup_highmap() fixes this up along with the mappings
	 * beyond _end.
	 */
	leaq	level2_kernel_pgt(%rip), %rdi
	leaq	4096(%rdi), %r8
	/* See if it is a valid page table entry */
1:	testb	$1, 0(%rdi)
	jz	2f
	addq	%rbp, 0(%rdi)
	/* Go to the next page */
2:	addq	$8, %rdi
	cmp	%r8, %rdi
	jne	1b

	/* Fixup phys_base */
	addq	%rbp, phys_base(%rip)

	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
	jmp 1f

Since the virtual addresses equal to the kernel physical addresses must be mapped now in these identity-mapped tables, the PML4 index of these virtual addresses is zero. This is why we fill the first entry on early level4 map. You can see that a same memory location is mapped to different virtual addresses. This is one of those useful features of virtual memory technique. In our case once they are mapped to the actual virtual addresses that has been specified at compile time later, they are also mapped to virtual addresses equal to the physical addresses themselves to make it possible to run the kernel while it can correctly access memory immediately after switching to the newly created pages. Because at that time the RIP has still the physical address of the current instruction and switching into the virtual address space of the kernel is done later.

To build the identity-mapped part we store the physical address of _text section in RBP and RAX. Then RAX is shifted 39 bits to the right so that it’ll contain the PML4 index of the physical address which is zero. Also the address of early_level4_pgt is loaded into RBX register. Then we store the address of the dynamic page table into the first and second entry of eraly_level4_pgt. Similarly the first two entries of the dynamic table are loaded with the start address of the next page table (located at dynamic+4K) that I mentioned in the table of the different kernel page tables defined in this head_64.S. Then we calculate the number of the PMD entries between the end and start of the kernel.


        movq	%rdi, %rax
	shrq	$PMD_SHIFT, %rdi
	addq	$(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
	leaq	(_end - 1)(%rip), %rcx
	shrq	$PMD_SHIFT, %rcx
	subq	%rdi, %rcx

The calculation is done by subtracting the leftmost 27 bits of the physical address of _text from the that of _end. Note that these two addresses are similar in the PML4 and PUD parts. So that part gets zero in the result which is what we want since the goal is to calculate the PMD part in order to specify how many pages we need to address that section of memory that contains the kernel code.


        incl	%ecx

1:
	andq	$(PTRS_PER_PMD - 1), %rdi
	movq	%rax, (%rbx,%rdi,8)
	incq	%rdi
	addq	$PMD_SIZE, %rax
	decl	%ecx
	jnz	1b

Again the PS bit of these entries are set to one. So we will have PMD entries here that are mapped into some 2MB pages which in turn contain the kernel code. You will see that these identity-mapping entries’ lives are too short and soon after the kernel is ready to jump out of the initialization phase they will come to demise.
Remember that we fixed the physical addresses for virtual address mapping entries before, when we started to run this file. In the next part same is done for level2_kernel_pgt entries. Remember that early KASLR address randomization only affects the physical addresses. The virtual addresses are what are defined in the source files and don’t change by the KASLR in early boot process. Now we get ready to enable early_level4_pgt table and enter the virtual address space area:

       addq	%rbp, phys_base(%rip)

	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
	jmp 1f

Again we put the KASLR relocation offset in somewhere else, this time in a variable called phys_base. This is a label to be more precise. Anyways for most cases it will be zero because of what I previously explained. Then the “expected” physical address of early_level4_pgt is loaded into RAX register. Then we jump to the “following” label “1” defined in secondary_startup_64. Heed that this is a short jump and the actual assembly instruction (eb 0f) contains the offset to this label not its absolute virtual address.

Then sum of RAX and phys_base (the correct physical address of early_level4_pgt at run time) is loaded into CR3 register. From this moment on, we are using the new page tables that we just created but RIP still contains the physical address of the current instruction.

Now it’s time to enter into the virtual address space of the kernel:

       movq	$1f, %rax
	jmp	*%rax 

First of all the absolute virtual address of the next following label is stored in RAX register. This value is hardcoded at the compile time by the compiler itself. Then RAX is effectively loaded into RIP by the next jmp instruction. And from this moment we are running from the virtual addresses. This is the end of using physical addresses of the kernel code and data. Then after some other initialization steps (which are not related to memory so we ignore them) the code jumps to x86_64_start_kernel defined in x86/kernel/head64.c


asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
{
	int i;

	/*
	 * Build-time sanity checks on the kernel image and module
	 * area mappings. (these are purely build-time and produce no code)
	 */
	BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);
	BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE); BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
	BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
	BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
	BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
	BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) ==
				(__START_KERNEL & PGDIR_MASK)));
	BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

	cr4_init_shadow();

	/* Kill off the identity-map trampoline */
	reset_early_page_tables();

	clear_bss();

	clear_page(init_level4_pgt);

	kasan_early_init();

	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
		set_intr_gate(i, early_idt_handler_array[i]);
	load_idt((const struct desc_ptr *)&idt_descr);

	copy_bootdata(__va(real_mode_data));

	/*
	 * Load microcode early on BSP.
	 */
	load_ucode_bsp();

	/* set init_level4_pgt kernel high mapping*/
	init_level4_pgt[511] = early_level4_pgt[511];

	x86_64_start_reservations(real_mode_data);
}

Look at the line containing “reset_early_page_tables();”. Remember when I said the identity map table life is too short? This is where it is destroyed to allow the access to kernel data only through virtual addresses. This function removes (writes zero) all entries of the early_level4_pgt table except the last one which is used to translate virtual addresses starting from 0xffffffff80000000.


static void __init reset_early_page_tables(void)
{
	memset(early_level4_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
	next_early_pgt = 0;
	write_cr3(__pa_nodebug(early_level4_pgt));
}

Look that this function also rewrites early_level4_pgt physical address (using the __pa_nodebug macro) into the CR3 register. Kernel developers some times redo stuff just in case when they have negligible overhead. This line seems to be one of them. By calling start_kernel() we finally exit from head64.c and enter /init/main.c.
At this point kernel is normally running with memory virtualization enabled. But note that up to this point we have only created the kernel address space mapping. This is not enough to run user space applications. In future posts I’ll explain how those pages are constructed and how the kernel switches between address spaces. I don’t guarantee the posts of the series of memory management to be successive since they require spending more to write time than generally other posts do. But I try to cover the main concepts and details regarding the way Linux kernel manages memory in a 3-part series whose inception was this article.
Good luck until next post.

Leave a Reply

Your email address will not be published. Required fields are marked *