一直想边学边写一些关于Linux操作系统尤其是Linux内核方面的东西。苦于没有时间不能做到。很好最近有这么一个机会,外企有时候说close就要close,还好不是马上,这就有了这么两三个月时间可以静下来理清一些思路,借助开源代码的优势分析了一下Linux内核一些关键部分的设计、实现。主要涉及内核的启动、内存管理、中断处理、进程调度、块设备、网络等。另外还有自己平常积累的一些经验、心得。记下来主要是为了让自己不忘记,同时如果偶尔能吸引到一些眼球,和志同道合者做一些探讨也是一件很荣幸的事。
开篇从你手中有了一台比较新的x86
64位的机器、并安装好了比较新的Linux操作系统、然后按下机器的power键开始。
1
从power
On开始
机器上电后是一系列的硬件启动过程。这里略去(因为不熟)。CPU开始工作(取第一条指令),执行POST。下面是几个寄存器在CPU上电后的初始值。
http://s9/mw690/b02f77c8gcc3d4ccf78f8&690
分析:
1. CR0.PE = 0 处理器处于实模式
此位决定了BIOS程序执行时的寻址方式
2. RIP = FFF0h
3. CS.Base=FFFF_0000h
4. 由RIP和CS.Base得到第一条指令的物理地址为:CS.Base + RIP =
FFFF_FFF0h.这个地址经过北桥、南桥的映射,基本上是映射到保存BIOS的rom上(如SPI Flash),具体为4G -
16。
自此,CPU开始了BIOS的执行。
2
BIOS执行MBR(Master
Boot Record)
BIOS从第一条指令开始经过一段时间的运行(包括硬件检测,中断控制器初始化、内存初始化等)后,会遍历(检测)安装了的硬盘/软盘等驱动器、USB设备、网路设备等,形成所谓的Boot
Device Structure。然后根据启动顺序设定,call
INT 19h尝试从这些设备启动OS。如果某有设备上有启动信息,则进入到启动流程。
这里假设可以启动的设备是硬盘,则进入到运行硬盘上MBR的流程。
MBR位于硬盘的第一个扇区(每个扇区的大小为512字节)。这512个字节中前446个字节包含了启动代码,其后64字节为分区表(4*16byets),最后两个字节为启动标志(固定为0x55、0xAA,这两个字节会首先被BIOS检查,只有设定为这两个值,BIOS才认为这是一个可以启动的硬盘)。
MBR结构:
偏移
内容
大小(字节)
0h
主引导程序
446
01BEh
硬盘分区表(DPT)
64
01FEh
启动标志(0x55
0xAA)
2
(十六进制标记55 AA,标志着一个有效的引导记录的结尾。每一个分区的引导记录中都必须有这个标记。)
硬盘分区表(DPT)结构:
偏移
内容
大小(字节)
01BEh
分区1的分区数据表
16
01CEh
分区2的分区数据表
16
01DEh
分区3的分区数据表
16
01FEh
分区4的分区数据表
16
分区数据表(Partition Data
Table)结构:
偏移
内容
大小(字节)
00h
引导ID标记(Boot
indicator) 1
01h
起始扇区头号
1
02h
起始扇区 (柱面号的最高2位)
1
03h
起始柱面号# (柱面号的低位)
1
04h
系统属性ID
标记
1
05h
结束扇区头号
1
06h
结束扇区(柱面号的最高2位)
1
07h
结束柱面号# (柱面号的低位)
1
08h
此分区前的扇区总数目
4
0Bh
此分区的扇区总数目
4
引导ID标记(Boot
indicator):
00h:不可启动分区
80h:可启动分区(只能有一个分区为此ID)
系统属性ID
标记(System Indicator ):
00h:未知操作系统
01h:DOS
FAT12(16位扇区数)
02h:XENIX
04h:DOS
FAT16(16位扇区数)
05h:DOS
扩展分区(DOS 3.3+)
06h:DOS
4.0 (Compaq 3.31), 32位扇区数
51h:Ontrack扩展分区
64h:Novell
75h:PCIX
DBh:CP/M
FFh:BBT
下面是一个例子。BIOS根据55aa确认该硬盘为可引导盘。然后装入前446个字节到内存中运行。一般称这446个字节为primary
boot loader。通常该部分代码会检查分区表中第一个字节被标志位80h(可引导分区)的分区,然后取回装入这个分区的第一个扇区(称为volume
boot record
或boot sector)并把控制权交给它。
http://s14/mw690/b02f77c8g7ad2f126206d&690
GRUB(GRand
Unified Bootloader)是现代Linux操作系统常用的linux内核引导工具。GRUB有三个阶段,分别是stage1、stage1.5
和stage2.
如下图所示。Stage1将安装到上节所谈到MBR,阶段1.5取决于文件系统,如要在ext2或ext3等文件系统中加载,则装入e2fs_stage1_5,阶段1.5运行完后就会转入stage2。
Stage2加载之后,GRUB
就可以显示可用内核列表(在 /boot/grub.conf
中进行定义)。我们可以选择内核甚至修改附加内核参数。另外,我们也可以使用一个命令行的 shell
对引导过程进行高级手工控制。
将第二阶段的引导加载程序加载到内存中之后,就可以对文件系统进行查询了,并将默认的内核映像和 initrd
映像加载到内存中。当这些映像文件准备好之后,阶段 2
的引导加载程序就可以调用内核映像了。
http://s5/mw690/b02f77c8gcc3d72a370a4&690
当内核映像被加载到内存中,并且阶段 2
的引导加载程序释放控制权之后,内核阶段就开始了。内核映像并不是一个可执行的内核,而是一个压缩过的内核映像。通常它是一个
zImage(压缩映像,小于 512KB)或一个 bzImage(较大的压缩映像,大于 512KB),它是提前使用 zlib
进行压缩过的。在这个内核映像前面是一个例程,它实现少量硬件设置,并对内核映像中包含的内核进行解压,然后将其放入内存的固定位置中,然后再调用内核,并开始启动内核引导的过程。
下面以rhel6.2 x86_64内核代码为例介绍一下该流程。
bzImage映像开始于acrh/x86/boot/header.S.Grub会把其加载到0x90000起始的内存地址。但Grub会跳过前512字节,直接到0x902000处开始执行。
上面提到的bzImage的最前512字节称为bootsector,现代linux内核(2.6以后版本)不允许直接从bzImage启动,而是需要由bootloader引导。所以如果试图尝试从0字节开始引导,将会进行告警处理同时重启机器。下面的摘录就是形成这512字节的代码。
==== from acrh/x86/boot/header.S ====
BOOTSEG
=
0x07C0
SYSSEG
=
0x1000
#ifndef SVGA_MODE
#define SVGA_MODE ASK_VGA
#endif
#ifndef RAMDISK
#define RAMDISK 0
#endif
#ifndef ROOT_RDONLY
#define ROOT_RDONLY 1
#endif
.code16
.section ".bstext", "ax"
.global bootsect_start
bootsect_start:
#
Normalize the start address
ljmp
$BOOTSEG, $start2
start2:
movw %cs,
%ax
movw %ax,
%ds
movw %ax,
%es
movw %ax,
%ss
xorw %sp,
%sp
sti
cld
movw
$bugger_off_msg, %si
msg_loop:
lodsb
andb %al,
%al
jz bs_die
movb $0xe,
%ah
movw $7,
%bx
int
$0x10
jmp
msg_loop
bs_die:
#
Allow the user to press a key, then reboot
xorw %ax,
%ax
int
$0x16
int
$0x19
# int
0x19 should never return. In case it does
anyway,
#
invoke the BIOS reset code...
ljmp
$0xf000,$0xfff0
.section ".bsdata", "a"
bugger_off_msg:
.ascii "Direct booting from floppy is no longer
supported.\r\n"
.ascii "Please use a boot loader program
instead.\r\n"
.ascii "\n"
.ascii "Remove disk and press any key to reboot .
. .\r\n"
.byte 0
#
Kernel attributes; used by setup. This is part 1
of the
#
header, from the old boot sector.
.section ".header", "a"
.globl hdr
[E1] hdr:
setup_sects:
.byte
0
root_flags: .word ROOT_RDONLY
syssize:
.long
0
ram_size: .word
0
vid_mode: .word
SVGA_MODE
root_dev: .word
0
boot_flag: .word 0xAA55
==== from acrh/x86/boot/header.S ====
512字节以后的代码:
==== from acrh/x86/boot/header.S ====
.globl _start
_start:
# Explicitly enter this as bytes, or the assembler
# tries to generate a 3-byte jump here, which causes
# everything else to push off to the wrong offset.
.byte
0xeb
# short (2-byte) jump[E2]
.byte start_of_setup-1f
1:
# Part
2 of the header, from the old setup.S
.ascii
"HdrS"
# header signature
.word
0x020a
# header version number (>= 0x0105)
# or else old loadlin-1.5 will fail)
.globl realmode_swtch
realmode_swtch: .word 0,
0
# default_switch, SETUPSEG
start_sys_seg:
.word
SYSSEG
# obsolete and meaningless, but just
# in case something decided to "use" it
.word kernel_version-512 #
pointing to kernel version string
# above section of header is compatible
# with loadlin-1.5 (header v1.5). Don't
# change it.
type_of_loader: .byte
0
# 0 means ancient bootloader, newer
# bootloaders know to change this.
# See Documentation/i386/boot.txt for
# assigned ids
# flags, unused bits must be zero (RFU) bit within
loadflags
loadflags:
LOADED_HIGH =
1
# If set, the kernel is loaded high
CAN_USE_HEAP
=
0x80
# If set, the loader also has set
# heap_end_ptr to tell how much
# space behind setup.S can be used for
# heap purposes.
# Only the loader knows what is free
.byte LOADED_HIGH
setup_move_size: .word
0x8000
# size to move, when setup is not
# loaded at 0x90000. We will move setup
# to 0x90000 then just before jumping
# into the kernel. However, only the
# loader knows how much data behind
# us also needs to be loaded.
code32_start:
# here loaders can put a different
# start address for 32-bit code.
.long
0x100000 #
0x100000 = default for big kernel
ramdisk_image:
.long
0
# address of loaded ramdisk image
# Here the loader puts the 32-bit
# address where it loaded the image.
# This only will be read by the kernel.
ramdisk_size:
.long
0
# its size in bytes
bootsect_kludge:
.long
0
# obsolete
heap_end_ptr:
.word
_end+STACK_SIZE-512
# (Header version 0x0201 or later)
# space from here (exclusive) down to
# end of setup code can be used by setup
# for local heap purposes.
ext_loader_ver:
.byte
0
# Extended boot loader version
ext_loader_type:
.byte
0
# Extended boot loader type
cmd_line_ptr:
.long
0
# (Header version 0x0202 or later)
# If nonzero, a 32-bit pointer
# to the kernel command line.
# The command line should be
# located between the start of
# setup and the end of low
# memory (0xa0000), or it may
# get overwritten before it
# gets read. If this field is
# used, there is no longer
# anything magical about the
# 0x90000 segment; the setup
# can be located anywhere in
# low memory 0x10000 or higher.
ramdisk_max:
.long 0x7fffffff
# (Header version 0x0203 or later)
# The highest safe address for
# the contents of an initrd
# The
current kernel allows up to 4 GB,
# but leave it at 2 GB to avoid
# possible bootloader bugs.
kernel_alignment: .long
CONFIG_PHYSICAL_ALIGN #physical addr
alignment
#required for protected mode
#kernel
#ifdef CONFIG_RELOCATABLE
relocatable_kernel:
.byte 1
#else
relocatable_kernel:
.byte 0
#endif
min_alignment:
.byte MIN_KERNEL_ALIGN_LG2 # minimum
alignment
pad3:
.word 0
cmdline_size:
.long
COMMAND_LINE_SIZE-1
#length of the command line,
#added
with boot protocol
#version 2.06
hardware_subarch: .long
0
# subarchitecture, added with 2.07
# default to 0 for normal x86 PC
hardware_subarch_data: .quad 0
payload_offset:
.long ZO_input_data
payload_length:
.long ZO_z_input_len
setup_data:
.quad
0
# 64-bit physical pointer to
# single linked list of
# struct setup_data
pref_address:
.quad
LOAD_PHYSICAL_ADDR
# preferred load addr
#define
ZO_INIT_SIZE
(ZO__end - ZO_startup_32 + ZO_z_extract_offset)
#define
VO_INIT_SIZE
(VO__end - VO__text)
#if ZO_INIT_SIZE > VO_INIT_SIZE
#define INIT_SIZE ZO_INIT_SIZE
#else
#define INIT_SIZE VO_INIT_SIZE
#endif
init_size:
.long
INIT_SIZE
# kernel initialization size
# End of setup header
#####################################################[E3]
.section ".entrytext", "ax"
start_of_setup:
#ifdef SAFE_RESET_DISK_CONTROLLER
# Reset the disk controller.
movw $0x0000,
%ax
# Reset disk controller
movb $0x80,
%dl
# All disks
int
$0x13
#endif
# Force %es = %ds
movw %ds,
%ax
movw %ax,
%es
cld
# Apparently some ancient versions of LILO invoked the kernel
with %ss != %ds,
# which happened to work by accident for the old
code. Recalculate the stack
# pointer if %ss is invalid. Otherwise
leave it alone, LOADLIN sets up the
# stack behind its own code, so we can't blindly put it
directly past the heap.
movw %ss,
%dx
cmpw %ax,
%dx # %ds ==
%ss?
movw %sp,
%dx
je
2f
# -> assume %sp is reasonably set
#
Invalid %ss, make up a new stack
movw $_end,
%dx
testb $CAN_USE_HEAP,
loadflags
jz 1f
movw
heap_end_ptr, %dx
1:
addw
$STACK_SIZE, %dx
jnc
2f
xorw %dx,
%dx # Prevent
wraparound
2: # Now %dx should point to the end of our
stack space
andw $~3,
%dx # dword
align (might as well...)
jnz
3f
movw $0xfffc,
%dx # Make
sure we're not zero
3:
movw %ax,
%ss
movzwl %dx,
%esp # Clear upper half of
%esp
sti
# Now we should have a working stack
# We will have entered with %cs = %ds+0x20, normalize %cs
so
# it is on par with the other segments.
pushw %ds
pushw $6f
lretw
6:
# Check signature at end of setup
cmpl
$0x5a5aaa55, setup_sig
jne
setup_bad
# Zero the bss
movw
$__bss_start, %di
movw $_end+3,
%cx
xorl �x,
�x
subw %di,
%cx
shrw $2,
%cx
rep;
stosl
# Jump to C code (should not return)
calll main[E4]
# Setup corrupt somehow...
setup_bad:
movl
$setup_corrupt, �x
calll puts
# Fall
through...
.globl die
.type die, @function
die:
hlt
jmp
die
.size die, .-die
.section ".initdata", "a"
setup_corrupt:
.byte 7
.string "No setup signature found...\n"
==== from acrh/x86/boot/header.S ====
下面是main函数的定义。
==== from acrh/x86/boot/main.c ====
void main(void)
{
copy_boot_params();
init_heap();
if
(validate_cpu()) {
puts("Unable to boot - please use a kernel appropriate "
"for your CPU.\n");
die();
}
set_bios_mode();
detect_memory();
keyboard_set_repeat();
query_mca();
query_ist();
#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)
query_apm_bios();
#endif
#if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE)
query_edd();
#endif
set_video();
if
(cmdline_find_option_bool("quiet"))
boot_params.hdr.loadflags |= QUIET_FLAG;
go_to_protected_mode();
}
==== from acrh/x86/boot/main.c ====
在完成一些必要的硬件设置、环境(如堆栈等)设置后,开始进入保护模式运行(go_to_protected_mode)。
==== from acrh/x86/boot/pm.c ====
void go_to_protected_mode(void)
{
realmode_switch_hook();
if
(enable_a20()) {
puts("A20 gate not responding, unable to boot...\n");
die();
}
reset_coprocessor();
mask_all_interrupts();
setup_idt();
setup_gdt();
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds()
<< 4));
}
==== from acrh/x86/boot/pm.c ====
Arch/x86/boot/pmjump.S定义了protected_mode_jump。
==== from acrh/x86/boot/pmjump.S ====
GLOBAL(protected_mode_jump)
movl �x,
%esi
# Pointer to boot_params table
xorl �x,
�x
movw %cs,
%bx
shll $4,
�x
addl �x,
2f
jmp
1f
# Short jump to serialize on 386/486
1:
movw
$__BOOT_DS, %cx
movw
$__BOOT_TSS, %di
movl %cr0,
�x
orb
$X86_CR0_PE,
%dl #
Protected mode
movl �x,
%cr0
#
Transition to 32-bit mode
.byte 0x66,
0xea
# ljmpl opcode
2:
.long
in_pm32
# offset
.word
__BOOT_CS
# segment
ENDPROC(protected_mode_jump)
==== from acrh/x86/boot/pmjump.S ====
这里entrypoint设置为0x100000(也就是header中的code32_start).0x100000到底对应了哪部分代码取决于grub的加载。Grub把arch/x86/boot/compressed/head_64.S加载在内存的1M位置(即0x100000)。其入口点是startup_32。下面是代码摘录。
==== from acrh/x86/boot/compressed/head_64.S ====
__HEAD
.code32
ENTRY(startup_32)
cld
testb
$(1<<6), BP_loadflags(%esi)
jnz
1f
cli
movl
$(__KERNEL_DS), �x
movl �x,
%ds
movl �x,
%es
movl �x,
%ss
1:
leal
(BP_scratch+4)(%esi), %esp
call
1f
1:
popl
�p
subl $1b,
�p
movl
$boot_stack_end, �x
addl �p,
�x
movl �x,
%esp
call
verify_cpu
testl �x, �x
jnz
no_longmode
#ifdef CONFIG_RELOCATABLE
movl �p,
�x
movl
BP_kernel_alignment(%esi), �x
decl
�x
addl �x,
�x
notl
�x
andl �x,
�x
#else
movl
$LOAD_PHYSICAL_ADDR, �x
#endif
addl
$z_extract_offset, �x
leal gdt(�p),
�x
movl �x,
gdt+2(�p)
lgdt
gdt(�p)
xorl �x,
�x
orl
$(X86_CR4_PAE), �x
movl �x,
%cr4
leal
pgtable(�x), �i
xorl �x,
�x
movl
$((4096*6)/4), �x
rep
stosl
leal pgtable
+ 0(�x), �i
leal 0x1007
(�i), �x
movl �x,
0(�i)
leal pgtable
+ 0x1000(�x), �i
leal
0x1007(�i), �x
movl $4,
�x
1:
movl �x,
0x00(�i)
addl
$0x00001000, �x
addl $8,
�i
decl
�x
jnz
1b
leal pgtable
+ 0x2000(�x), �i
movl
$0x00000183, �x
movl $2048,
�x
1:
movl �x,
0(�i)
addl
$0x00200000, �x
addl $8,
�i
decl
�x
jnz
1b
leal
pgtable(�x), �x
movl �x,
%cr3
movl
$MSR_EFER, �x
rdmsr
btsl
$_EFER_LME, �x
wrmsr
pushl $__KERNEL_CS
leal
startup_64(�p), �x
pushl �x
movl
$(X86_CR0_PG | X86_CR0_PE), �x
movl �x,
%cr0
lret
ENDPROC(startup_32)
==== from acrh/x86/boot/compressed/head_64.S ====
进入到64位模式处理流程
==== from acrh/x86/boot/compressed/head_64.S ====
.code64
.org
0x200
ENTRY(startup_64)
xorl �x,
�x
movl �x,
%ds
movl �x,
%es
movl �x,
%ss
movl �x,
%fs
movl �x,
%gs
lldt
%ax
movl $0x20,
�x
ltr
%ax
#ifdef CONFIG_RELOCATABLE
leaq
startup_32(%rip) , %rbp
movl
BP_kernel_alignment(%rsi), �x
decl
�x
addq %rax,
%rbp
notq
%rax
andq %rax,
%rbp
#else
movq
$LOAD_PHYSICAL_ADDR, %rbp
#endif
leaq
z_extract_offset(%rbp), %rbx
leaq
boot_stack_end(%rbx), %rsp
pushq $0
popfq
pushq %rsi
leaq
(_bss-8)(%rip), %rsi
leaq
(_bss-8)(%rbx), %rdi
movq $_bss ,
%rcx
shrq $3,
%rcx
std
rep
movsq
cld
popq
%rsi
leaq
relocated(%rbx), %rax
jmp
*%rax
.text
relocated:
xorl �x,
�x
leaq
_bss(%rip), %rdi
leaq
_ebss(%rip), %rcx
subq %rdi,
%rcx
shrq $3,
%rcx
rep
stosq
pushq
%rsi
movq %rsi,
%rdi
leaq
boot_heap(%rip), %rsi
leaq
input_data(%rip), %rdx
movl
$z_input_len, �x
movq %rbp,
%r8
[E5]
call
decompress_kernel[E6]
popq
%rsi
jmp
*%rbp[E7]
.data
gdt:
.word gdt_end - gdt
.long gdt
.word 0
.quad
0x0000000000000000
.quad
0x00af9a000000ffff
.quad
0x00cf92000000ffff
.quad
0x0080890000000000
.quad
0x0000000000000000
gdt_end:
.bss
.balign 4
boot_heap:
.fill
BOOT_HEAP_SIZE, 1, 0
boot_stack:
.fill
BOOT_STACK_SIZE, 1, 0
boot_stack_end:
.section ".pgtable","a",@nobits
.balign 4096
pgtable:
.fill
6*4096, 1, 0
==== from acrh/x86/boot/compressed/head_64.S ====
下面是decompress_kernel的实现
==== from acrh/x86/boot/compressed/misc.c ====
asmlinkage void decompress_kernel(void *rmode, memptr heap,
unsigned char *input_data,
unsigned long input_len,
unsigned char *output)
{
real_mode = rmode;
if
(real_mode->hdr.loadflags &
QUIET_FLAG)
quiet = 1;
if
(real_mode->screen_info.orig_video_mode == 7)
{
vidmem = (char *) 0xb0000;
vidport = 0x3b4;
} else
{
vidmem =
(char *) 0xb8000;
vidport = 0x3d4;
}
lines
= real_mode->screen_info.orig_video_lines;
cols =
real_mode->screen_info.orig_video_cols;
free_mem_ptr
= heap;
free_mem_end_ptr = heap + BOOT_HEAP_SIZE;
if
((unsigned long)output & (MIN_KERNEL_ALIGN -
1))
error("Destination address inappropriately aligned");
#ifdef CONFIG_X86_64
if
(heap > 0x3fffffffffffUL)
error("Destination address too large");
#else
if
(heap >
((-__PAGE_OFFSET-(512<<20)-1)
& 0x7fffffff))
error("Destination address too large");
#endif
#ifndef CONFIG_RELOCATABLE
if
((unsigned long)output != LOAD_PHYSICAL_ADDR)
error("Wrong destination address");
#endif
if
(!quiet)
putstr("\nDecompressing Linux... ");
decompress(input_data, input_len, NULL, NULL, output, NULL,
error);
parse_elf(output);
if
(!quiet)
putstr("done.\nBooting the kernel.\n");
return;
==== from acrh/x86/boot/compressed/misc.c ====
上面的流程中函数decompress解压缩内核后得到elf格式的内核(保存在output中)。Output的首地址在没有CONFIG_RELOCATABLE的情况下为16M,如果有CONFIG_RELOCATABLE,则会重新计算但也必须以2M对齐。实验证明,该地址在CONFIG_RELOCATABLE配置了的情况下也一般为16M。
函数parse_elf进一步解析elf内核(代码段、数据段等)。当decompress_kernel返回则跳入解析后的内核。其首地址就是arch/x86/kernel/head_64.S中的startup_64了。
==== from acrh/x86/boot/compressed/misc.c ====
static void parse_elf(void *output)
{
#ifdef CONFIG_X86_64
Elf64_Ehdr ehdr;
Elf64_Phdr *phdrs, *phdr;
#else
Elf32_Ehdr ehdr;
Elf32_Phdr *phdrs, *phdr;
#endif
void
*dest;
int
i;
memcpy(&ehdr, output, sizeof(ehdr));
if (ehdr.e_ident[EI_MAG0] != ELFMAG0 ||
ehdr.e_ident[EI_MAG1] !=
ELFMAG1 ||
ehdr.e_ident[EI_MAG2] !=
ELFMAG2 ||
ehdr.e_ident[EI_MAG3] !=
ELFMAG3) {
error("Kernel is not a valid ELF file");
return;
}
if
(!quiet)
putstr("Parsing ELF... ");
phdrs
= malloc(sizeof(*phdrs) * ehdr.e_phnum);
if
(!phdrs)
error("Failed to allocate space for phdrs");
memcpy(phdrs, output + ehdr.e_phoff, sizeof(*phdrs) *
ehdr.e_phnum);
for (i
= 0; i < ehdr.e_phnum; i++) {
phdr = &phdrs[i];
switch (phdr->p_type) {
case PT_LOAD:
#ifdef CONFIG_RELOCATABLE
dest = output;
dest += (phdr->p_paddr -
LOAD_PHYSICAL_ADDR);
#else
dest = (void *)(phdr->p_paddr);
#endif
memcpy(dest,
output + phdr->p_offset,
phdr->p_filesz);
break;
default: break;
}
}
}
==== from acrh/x86/boot/compressed/misc.c ====
4.2
进入“真正的”linux内核
经过4.1 内核初次加载、解压、解析再加载
后,CPU跳入arch/x86/kernel/head_64.S中的startup_64。终于进入所谓“真正的”的内核。startup_64主要完成分页功能启用。最后跳入C代码x86_64_start_kernel。
==== from acrh/x86/kernel/head_64.S ====
#define
pud_index(x)
(((x) >> PUD_SHIFT) &
(PTRS_PER_PUD-1))
L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET)
L3_PAGE_OFFSET = pud_index(__PAGE_OFFSET)
L4_START_KERNEL = pgd_index(__START_KERNEL_map)
L3_START_KERNEL = pud_index(__START_KERNEL_map)
.text
__HEAD
.code64
.globl
startup_64
startup_64:
leaq
_text(%rip), %rbp
subq $_text -
__START_KERNEL_map, %rbp
movq %rbp,
%rax
andl
$~PMD_PAGE_MASK, �x
testl �x, �x
jnz
bad_address
leaq
_text(%rip), %rdx
movq
$PGDIR_SIZE, %rax
cmpq %rax,
%rdx
jae
bad_address
addq %rbp,
init_level4_pgt + 0(%rip)
addq %rbp,
init_level4_pgt + (L4_PAGE_OFFSET*8)(%rip)
addq %rbp,
init_level4_pgt + (L4_START_KERNEL*8)(%rip)
addq %rbp,
level3_ident_pgt + 0(%rip)
addq %rbp,
level3_kernel_pgt + (510*8)(%rip)
addq %rbp,
level3_kernel_pgt + (511*8)(%rip)
addq %rbp,
level2_fixmap_pgt + (506*8)(%rip)
//…
jmp secondary_startup_64
ENTRY(secondary_startup_64)
movl
$(X86_CR4_PAE | X86_CR4_PGE), �x
movq %rax,
%cr4
movq
$(init_level4_pgt - __START_KERNEL_map), %rax
addq
phys_base(%rip), %rax
movq %rax,
%cr3
movq $1f,
%rax
jmp
*%rax
1:
movl
$0x80000001, �x
cpuid
movl
�x,�i
movl
$MSR_EFER, �x
rdmsr
btsl
$_EFER_SCE,
�x
btl
$20,�i
jnc
1f
btsl
$_EFER_NX, �x
1:
wrmsr
#define CR0_STATE
(X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \
X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \
X86_CR0_PG)
movl
$CR0_STATE, �x
movq %rax,
%cr0
movq
stack_start(%rip),%rsp
pushq
$0
popfq
lgdt
early_gdt_descr(%rip)
movl
$__KERNEL_DS,�x
movl
�x,%ds
movl
�x,%ss
movl
�x,%es
movl
�x,%fs
movl
�x,%gs
movl
$MSR_GS_BASE,�x
movq
initial_gs(%rip),%rax
movq
%rax,%rdx
shrq
$32,%rdx
wrmsr
movl %esi,
�i
movq
initial_code(%rip),%rax
pushq
$0
# fake return address to stop unwinder
pushq
$__KERNEL_CS
# set correct cs
pushq
%rax
# target address in negative space
lretq
__REFDATA
.align 8
ENTRY(initial_code)
.quad
x86_64_start_kernel
ENTRY(initial_gs)
.quad
INIT_PER_CPU_VAR(irq_stack_union)
ENTRY(stack_start)
.quad init_thread_union+THREAD_SIZE-8
.word 0
__FINITDATA
//…
==== from acrh/x86/kernel/head_64.S ====
下面是x86_64_start_kernel的实现。
==== from acrh/x86/kernel/head_64.S ====
void __init x86_64_start_kernel(char *
real_mode_data)
{
int
i;
BUILD_BUG_ON(MODULES_VADDR <
KERNEL_IMAGE_START);
BUILD_BUG_ON(MODULES_VADDR-KERNEL_IMAGE_START <
KERNEL_IMAGE_SIZE);
BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE >
2*PUD_SIZE);
BUILD_BUG_ON((KERNEL_IMAGE_START & ~PMD_MASK) !=
0);
BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) !=
0);
BUILD_BUG_ON(!(MODULES_VADDR >
__START_KERNEL));
BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK)
==
(__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses)
<= MODULES_END);
clear_bss();
zap_identity_mappings();
cleanup_highmap();
for (i
= 0; i < NUM_EXCEPTION_VECTORS; i++) {
#ifdef CONFIG_EARLY_PRINTK
set_intr_gate(i, &early_idt_handlers[i]);
#else
set_intr_gate(i, early_idt_handler);
#endif
}
load_idt((const struct desc_ptr
*)&idt_descr);
if
(console_loglevel == 10)
early_printk("Kernel alive\n");
x86_64_start_reservations(real_mode_data);
}
void __init x86_64_start_reservations(char
*real_mode_data)
{
copy_bootdata(__va(real_mode_data));
reserve_trampoline_memory();
reserve_early(__pa_symbol(&_text),
__pa_symbol(&__bss_stop), "TEXT DATA BSS");
#ifdef CONFIG_BLK_DEV_INITRD
if
(boot_params.hdr.type_of_loader &&
boot_params.hdr.ramdisk_image) {
unsigned long ramdisk_image =
boot_params.hdr.ramdisk_image;
unsigned long ramdisk_size =
boot_params.hdr.ramdisk_size;
unsigned long ramdisk_end =
ramdisk_image + ramdisk_size;
reserve_early(ramdisk_image, ramdisk_end, "RAMDISK");
}
#endif
reserve_ebda_region();
start_kernel();
}
==== from acrh/x86/kernel/head_64.S ====
最后进入与体系结构无关的start_kernel()。linux引导启动阶段可以说就基本完成了。start_kernel及以下流程则可视为初始化过程了。
[E1]同时这也是header
section。使用全局变量hdr。
[E2]eb其实就是跳转指令。下一个字节是偏移量。将跳转到start_of_setup处执行。
[E3]Section
header 定义结束。注意code32_start的值为0x100000.这是一个重要的值。
[E4]调用C代码,该main函数在arch/x86/boot/main.c中。
[E5]Output的首地址在没有CONFIG_RELOCATABLE的情况下为16M,如果有CONFIG_RELOCATABLE,则会重新计算但也必须以2M对齐
[E7]跳入output
target address。
加载中,请稍候......