加载中…
正文 字体大小:

Linux内核分析-引导、启动

(2012-10-17 16:10:55)
标签:

软件

操作系统

linux

分类: 软件

    一直想边学边写一些关于Linux操作系统尤其是Linux内核方面的东西。苦于没有时间不能做到。很好最近有这么一个机会,外企有时候说close就要close,还好不是马上,这就有了这么两三个月时间可以静下来理清一些思路,借助开源代码的优势分析了一下Linux内核一些关键部分的设计、实现。主要涉及内核的启动、内存管理、中断处理、进程调度、块设备、网络等。另外还有自己平常积累的一些经验、心得。记下来主要是为了让自己不忘记,同时如果偶尔能吸引到一些眼球,和志同道合者做一些探讨也是一件很荣幸的事。

    开篇从你手中有了一台比较新的x86 64位的机器、并安装好了比较新的Linux操作系统、然后按下机器的power键开始。

1      power On开始

    机器上电后是一系列的硬件启动过程。这里略去(因为不熟)。CPU开始工作(取第一条指令),执行POST下面是几个寄存器在CPU上电后的初始值。

Linux内核分析-引导、启动
分析:
1. CR0.PE = 0  处理器处于实模式 此位决定了BIOS程序执行时的寻址方式
2. RIP = FFF0h
3. CS.Base=FFFF_0000h
4. 由RIP和CS.Base得到第一条指令的物理地址为:CS.Base + RIP = FFFF_FFF0h.这个地址经过北桥、南桥的映射,基本上是映射到保存BIOS的rom上(如SPI Flash),具体为4G - 16。

自此,CPU开始了BIOS的执行。

2      BIOS执行MBRMaster Boot Record

2.1    BIOS检测启动设备

BIOS从第一条指令开始经过一段时间的运行(包括硬件检测,中断控制器初始化、内存初始化等)后,会遍历(检测)安装了的硬盘/软盘等驱动器、USB设备、网路设备等,形成所谓的Boot Device Structure。然后根据启动顺序设定,call INT 19h尝试从这些设备启动OS。如果某有设备上有启动信息,则进入到启动流程。

这里假设可以启动的设备是硬盘,则进入到运行硬盘上MBR的流程。

2.2    执行MBR

MBR位于硬盘的第一个扇区(每个扇区的大小为512字节)。这512个字节中前446个字节包含了启动代码,其后64字节为分区表(4*16byets),最后两个字节为启动标志(固定为0x550xAA,这两个字节会首先被BIOS检查,只有设定为这两个值,BIOS才认为这是一个可以启动的硬盘)。

 

MBR结构:

偏移    内容                         大小(字节)

0h      主引导程序                   446

01BEh   硬盘分区表(DPT)              64

01FEh   启动标志(0x55 0xAA)          2

(十六进制标记55 AA,标志着一个有效的引导记录的结尾。每一个分区的引导记录中都必须有这个标记。) 

 

硬盘分区表(DPT)结构:

偏移    内容                         大小(字节)

01BEh   分区1的分区数据表            16

01CEh   分区2的分区数据表            16

01DEh   分区3的分区数据表            16

01FEh   分区4的分区数据表            16

 

分区数据表(Partition Data Table)结构:

偏移    内容                         大小(字节)

00h     引导ID标记(Boot indicator)   1

01h     起始扇区头号                 1

02h     起始扇区 (柱面号的最高2  1

03h     起始柱面号# (柱面号的低位  1

04h     系统属性ID 标记              1

05h     结束扇区头号                 1

06h     结束扇区(柱面号的最高2   1

07h     结束柱面号# (柱面号的低位  1

08h     此分区前的扇区总数目         4

0Bh     此分区的扇区总数目           4

 

引导ID标记(Boot indicator)

00h:不可启动分区

80h:可启动分区(只能有一个分区为此ID

 

系统属性ID 标记(System Indicator )

00h:未知操作系统

01hDOS FAT1216位扇区数)

02hXENIX

04hDOS FAT1616位扇区数)

05hDOS 扩展分区(DOS 3.3+)

06hDOS 4.0 (Compaq 3.31), 32位扇区数

51hOntrack扩展分区

64hNovell

75hPCIX

DBhCP/M

FFhBBT

 

下面是一个例子。BIOS根据55aa确认该硬盘为可引导盘。然后装入前446个字节到内存中运行。一般称这446个字节为primary boot loader。通常该部分代码会检查分区表中第一个字节被标志位80h(可引导分区)的分区,然后取回装入这个分区的第一个扇区(称为volume boot record boot sector)并把控制权交给它。

Linux内核分析-引导、启动

3      GRUB

GRUBGRand Unified Bootloader)是现代Linux操作系统常用的linux内核引导工具。GRUB有三个阶段,分别是stage1stage1.5 stage2. 如下图所示。Stage1将安装到上节所谈到MBR,阶段1.5取决于文件系统,如要在ext2ext3等文件系统中加载,则装入e2fs_stage1_5,阶段1.5运行完后就会转入stage2

    Stage2加载之后,GRUB 就可以显示可用内核列表(在 /boot/grub.conf 中进行定义)。我们可以选择内核甚至修改附加内核参数。另外,我们也可以使用一个命令行的 shell 对引导过程进行高级手工控制。

将第二阶段的引导加载程序加载到内存中之后,就可以对文件系统进行查询了,并将默认的内核映像和 initrd 映像加载到内存中。当这些映像文件准备好之后,阶段 2 的引导加载程序就可以调用内核映像了。

 

Linux内核分析-引导、启动

4 进入Linux kernel

4.1 内核初次加载、解压、解析再加载

当内核映像被加载到内存中,并且阶段 2 的引导加载程序释放控制权之后,内核阶段就开始了。内核映像并不是一个可执行的内核,而是一个压缩过的内核映像。通常它是一个 zImage(压缩映像,小于 512KB)或一个 bzImage(较大的压缩映像,大于 512KB),它是提前使用 zlib 进行压缩过的。在这个内核映像前面是一个例程,它实现少量硬件设置,并对内核映像中包含的内核进行解压,然后将其放入内存的固定位置中,然后再调用内核,并开始启动内核引导的过程。

 

下面以rhel6.2 x86_64内核代码为例介绍一下该流程。

bzImage映像开始于acrh/x86/boot/header.S.Grub会把其加载到0x90000起始的内存地址。但Grub会跳过前512字节,直接到0x902000处开始执行。

上面提到的bzImage的最前512字节称为bootsector,现代linux内核(2.6以后版本)不允许直接从bzImage启动,而是需要由bootloader引导。所以如果试图尝试从0字节开始引导,将会进行告警处理同时重启机器。下面的摘录就是形成这512字节的代码。

==== from acrh/x86/boot/header.S ====

BOOTSEG     = 0x07C0       

SYSSEG      = 0x1000       

 

#ifndef SVGA_MODE

#define SVGA_MODE ASK_VGA

#endif

 

#ifndef RAMDISK

#define RAMDISK 0

#endif

 

#ifndef ROOT_RDONLY

#define ROOT_RDONLY 1

#endif

 

    .code16

    .section ".bstext", "ax"

 

    .global bootsect_start

bootsect_start:

 

    # Normalize the start address

    ljmp    $BOOTSEG, $start2

 

start2:

    movw    %cs, %ax

    movw    %ax, %ds

    movw    %ax, %es

    movw    %ax, %ss

    xorw    %sp, %sp

    sti

    cld

 

    movw    $bugger_off_msg, %si

 

msg_loop:

    lodsb

    andb    %al, %al

    jz  bs_die

    movb    $0xe, %ah

    movw    $7, %bx

    int $0x10

    jmp msg_loop

 

bs_die:

    # Allow the user to press a key, then reboot

    xorw    %ax, %ax

    int $0x16

    int $0x19

 

    # int 0x19 should never return.  In case it does anyway,

    # invoke the BIOS reset code...

    ljmp    $0xf000,$0xfff0

 

    .section ".bsdata", "a"

bugger_off_msg:

    .ascii  "Direct booting from floppy is no longer supported.\r\n"

    .ascii  "Please use a boot loader program instead.\r\n"

    .ascii  "\n"

    .ascii  "Remove disk and press any key to reboot . . .\r\n"

    .byte   0

 

 

    # Kernel attributes; used by setup.  This is part 1 of the

    # header, from the old boot sector.

 

    .section ".header", "a"

    .globl  hdr

[E1] hdr:

setup_sects:    .byte 0        

root_flags: .word ROOT_RDONLY

syssize:    .long 0        

ram_size:   .word 0        

vid_mode:   .word SVGA_MODE

root_dev:   .word 0        

boot_flag:  .word 0xAA55

==== from acrh/x86/boot/header.S ====

 

512字节以后的代码:

==== from acrh/x86/boot/header.S ====

.globl  _start

_start:

        # Explicitly enter this as bytes, or the assembler

        # tries to generate a 3-byte jump here, which causes

        # everything else to push off to the wrong offset.

        .byte   0xeb        # short (2-byte) jump[E2] 

        .byte   start_of_setup-1f

1:

    # Part 2 of the header, from the old setup.S

        .ascii  "HdrS"      # header signature

        .word   0x020a      # header version number (>= 0x0105)

                    # or else old loadlin-1.5 will fail)

        .globl realmode_swtch

realmode_swtch: .word   0, 0        # default_switch, SETUPSEG

start_sys_seg:  .word   SYSSEG      # obsolete and meaningless, but just

                    # in case something decided to "use" it

        .word   kernel_version-512 # pointing to kernel version string

                    # above section of header is compatible

                    # with loadlin-1.5 (header v1.5). Don't

                    # change it.

type_of_loader: .byte        # 0 means ancient bootloader, newer

                    # bootloaders know to change this.

                    # See Documentation/i386/boot.txt for

                    # assigned ids

# flags, unused bits must be zero (RFU) bit within loadflags

loadflags:

LOADED_HIGH = 1         # If set, the kernel is loaded high

CAN_USE_HEAP    = 0x80          # If set, the loader also has set

                    # heap_end_ptr to tell how much

                    # space behind setup.S can be used for

                    # heap purposes.

                    # Only the loader knows what is free

        .byte   LOADED_HIGH

 

setup_move_size: .word  0x8000      # size to move, when setup is not

                    # loaded at 0x90000. We will move setup

                    # to 0x90000 then just before jumping

                    # into the kernel. However, only the

                    # loader knows how much data behind

                    # us also needs to be loaded.

 

code32_start:               # here loaders can put a different

                    # start address for 32-bit code.

        .long   0x100000    # 0x100000 = default for big kernel

 

ramdisk_image:  .long        # address of loaded ramdisk image

                    # Here the loader puts the 32-bit

                    # address where it loaded the image.

                    # This only will be read by the kernel.

ramdisk_size:   .long        # its size in bytes

 

bootsect_kludge:

        .long        # obsolete

 

heap_end_ptr:   .word   _end+STACK_SIZE-512

                    # (Header version 0x0201 or later)

                    # space from here (exclusive) down to

                    # end of setup code can be used by setup

                    # for local heap purposes.

 

ext_loader_ver:

        .byte        # Extended boot loader version

ext_loader_type:

        .byte        # Extended boot loader type

 

cmd_line_ptr:   .long        # (Header version 0x0202 or later)

                    # If nonzero, a 32-bit pointer

                    # to the kernel command line.

                    # The command line should be

                    # located between the start of

                    # setup and the end of low

                    # memory (0xa0000), or it may

                    # get overwritten before it

                    # gets read.  If this field is

                    # used, there is no longer

                    # anything magical about the

                    # 0x90000 segment; the setup

                    # can be located anywhere in

                    # low memory 0x10000 or higher.

 

ramdisk_max:    .long 0x7fffffff

                    # (Header version 0x0203 or later)

                    # The highest safe address for

                    # the contents of an initrd

                    # The current kernel allows up to 4 GB,

                    # but leave it at 2 GB to avoid

                    # possible bootloader bugs.

 

kernel_alignment:  .long CONFIG_PHYSICAL_ALIGN  #physical addr alignment

                        #required for protected mode

                        #kernel

#ifdef CONFIG_RELOCATABLE

relocatable_kernel:    .byte 1

#else

relocatable_kernel:    .byte 0

#endif

min_alignment:      .byte MIN_KERNEL_ALIGN_LG2  # minimum alignment

pad3:           .word 0

 

cmdline_size:   .long   COMMAND_LINE_SIZE-1     #length of the command line,

                                                #added with boot protocol

                                                #version 2.06

 

hardware_subarch:   .long 0         # subarchitecture, added with 2.07

                        # default to 0 for normal x86 PC

 

hardware_subarch_data:  .quad 0

 

payload_offset:     .long ZO_input_data

payload_length:     .long ZO_z_input_len

 

setup_data:     .quad 0         # 64-bit physical pointer to

                        # single linked list of

                        # struct setup_data

 

pref_address:       .quad LOAD_PHYSICAL_ADDR    # preferred load addr

 

#define ZO_INIT_SIZE    (ZO__end - ZO_startup_32 + ZO_z_extract_offset)

#define VO_INIT_SIZE    (VO__end - VO__text)

#if ZO_INIT_SIZE > VO_INIT_SIZE

#define INIT_SIZE ZO_INIT_SIZE

#else

#define INIT_SIZE VO_INIT_SIZE

#endif

init_size:      .long INIT_SIZE     # kernel initialization size

 

# End of setup header #####################################################[E3] 

 

    .section ".entrytext", "ax"

start_of_setup:

#ifdef SAFE_RESET_DISK_CONTROLLER

# Reset the disk controller.

    movw    $0x0000, %ax        # Reset disk controller

    movb    $0x80, %dl      # All disks

    int $0x13

#endif

 

# Force %es = %ds

    movw    %ds, %ax

    movw    %ax, %es

    cld

 

# Apparently some ancient versions of LILO invoked the kernel with %ss != %ds,

# which happened to work by accident for the old code.  Recalculate the stack

# pointer if %ss is invalid.  Otherwise leave it alone, LOADLIN sets up the

# stack behind its own code, so we can't blindly put it directly past the heap.

 

    movw    %ss, %dx

    cmpw    %ax, %dx    # %ds == %ss?

    movw    %sp, %dx

    je  2f      # -> assume %sp is reasonably set

 

    # Invalid %ss, make up a new stack

    movw    $_end, %dx

    testb   $CAN_USE_HEAP, loadflags

    jz  1f

    movw    heap_end_ptr, %dx

1:  addw    $STACK_SIZE, %dx

    jnc 2f

    xorw    %dx, %dx    # Prevent wraparound

 

2:  # Now %dx should point to the end of our stack space

    andw    $~3, %dx    # dword align (might as well...)

    jnz 3f

    movw    $0xfffc, %dx    # Make sure we're not zero

3:  movw    %ax, %ss

    movzwl  %dx, %esp   # Clear upper half of %esp

    sti         # Now we should have a working stack

 

# We will have entered with %cs = %ds+0x20, normalize %cs so

# it is on par with the other segments.

    pushw   %ds

    pushw   $6f

    lretw

6:

 

# Check signature at end of setup

    cmpl    $0x5a5aaa55, setup_sig

    jne setup_bad

 

# Zero the bss

    movw    $__bss_start, %di

    movw    $_end+3, %cx

    xorl    �x, �x

    subw    %di, %cx

    shrw    $2, %cx

    rep; stosl

 

# Jump to C code (should not return)

    calll   main[E4] 

 

# Setup corrupt somehow...

setup_bad:

    movl    $setup_corrupt, �x

    calll   puts

    # Fall through...

 

    .globl  die

    .type   die, @function

die:

    hlt

    jmp die

 

    .size   die, .-die

 

    .section ".initdata", "a"

setup_corrupt:

    .byte   7

    .string "No setup signature found...\n"

==== from acrh/x86/boot/header.S ====

 

下面是main函数的定义。

==== from acrh/x86/boot/main.c ====

void main(void)

{

   

    copy_boot_params();

 

   

    init_heap();

 

   

    if (validate_cpu()) {

        puts("Unable to boot - please use a kernel appropriate "

             "for your CPU.\n");

        die();

    }

 

   

    set_bios_mode();

 

   

    detect_memory();

 

   

    keyboard_set_repeat();

 

   

    query_mca();

 

   

    query_ist();

 

   

#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)

    query_apm_bios();

#endif

 

   

#if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE)

    query_edd();

#endif

 

   

    set_video();

 

   

    if (cmdline_find_option_bool("quiet"))

        boot_params.hdr.loadflags |= QUIET_FLAG;

 

   

    go_to_protected_mode();

}

==== from acrh/x86/boot/main.c ====

在完成一些必要的硬件设置、环境(如堆栈等)设置后,开始进入保护模式运行(go_to_protected_mode)。

 

==== from acrh/x86/boot/pm.c ====

void go_to_protected_mode(void)

{

   

    realmode_switch_hook();

 

   

    if (enable_a20()) {

        puts("A20 gate not responding, unable to boot...\n");

        die();

    }

 

   

    reset_coprocessor();

 

   

    mask_all_interrupts();

 

   

    setup_idt();

    setup_gdt();

    protected_mode_jump(boot_params.hdr.code32_start,

                (u32)&boot_params + (ds() << 4));

}

==== from acrh/x86/boot/pm.c ====

 

Arch/x86/boot/pmjump.S定义了protected_mode_jump。

==== from acrh/x86/boot/pmjump.S ====

GLOBAL(protected_mode_jump)

    movl    �x, %esi      # Pointer to boot_params table

 

    xorl    �x, �x

    movw    %cs, %bx

    shll    $4, �x

    addl    �x, 2f

    jmp 1f          # Short jump to serialize on 386/486

1:

 

    movw    $__BOOT_DS, %cx

    movw    $__BOOT_TSS, %di

 

    movl    %cr0, �x

    orb $X86_CR0_PE, %dl    # Protected mode

    movl    �x, %cr0

 

    # Transition to 32-bit mode

    .byte   0x66, 0xea      # ljmpl opcode

2:  .long   in_pm32         # offset

    .word   __BOOT_CS       # segment

ENDPROC(protected_mode_jump)

==== from acrh/x86/boot/pmjump.S ====

 

这里entrypoint设置为0x100000(也就是header中的code32_start).0x100000到底对应了哪部分代码取决于grub的加载。Grub把arch/x86/boot/compressed/head_64.S加载在内存的1M位置(即0x100000)。其入口点是startup_32。下面是代码摘录。

==== from acrh/x86/boot/compressed/head_64.S ====

__HEAD

    .code32

ENTRY(startup_32)

    cld

   

    testb $(1<<6), BP_loadflags(%esi)

    jnz 1f

 

    cli

    movl    $(__KERNEL_DS), �x

    movl    �x, %ds

    movl    �x, %es

    movl    �x, %ss

1:

 

    leal    (BP_scratch+4)(%esi), %esp

    call    1f

1:  popl    �p

    subl    $1b, �p

 

    movl    $boot_stack_end, �x

    addl    �p, �x

    movl    �x, %esp

 

    call    verify_cpu

    testl   �x, �x

    jnz no_longmode

 

 

#ifdef CONFIG_RELOCATABLE

    movl    �p, �x

    movl    BP_kernel_alignment(%esi), �x

    decl    �x

    addl    �x, �x

    notl    �x

    andl    �x, �x

#else

    movl    $LOAD_PHYSICAL_ADDR, �x

#endif

 

   

    addl    $z_extract_offset, �x

 

 

   

    leal    gdt(�p), �x

    movl    �x, gdt+2(�p)

    lgdt    gdt(�p)

 

   

    xorl    �x, �x

    orl $(X86_CR4_PAE), �x

    movl    �x, %cr4

 

 

   

    leal    pgtable(�x), �i

    xorl    �x, �x

    movl    $((4096*6)/4), �x

    rep stosl

 

   

    leal    pgtable + 0(�x), �i

    leal    0x1007 (�i), �x

    movl    �x, 0(�i)

 

   

    leal    pgtable + 0x1000(�x), �i

    leal    0x1007(�i), �x

    movl    $4, �x

1:  movl    �x, 0x00(�i)

    addl    $0x00001000, �x

    addl    $8, �i

    decl    �x

    jnz 1b

 

   

    leal    pgtable + 0x2000(�x), �i

    movl    $0x00000183, �x

    movl    $2048, �x

1:  movl    �x, 0(�i)

    addl    $0x00200000, �x

    addl    $8, �i

    decl    �x

    jnz 1b

 

   

    leal    pgtable(�x), �x

    movl    �x, %cr3

 

   

    movl    $MSR_EFER, �x

    rdmsr

    btsl    $_EFER_LME, �x

    wrmsr

 

   

    pushl   $__KERNEL_CS

    leal    startup_64(�p), �x

    pushl   �x

 

   

    movl    $(X86_CR0_PG | X86_CR0_PE), �x

    movl    �x, %cr0

 

   

    lret

ENDPROC(startup_32)

==== from acrh/x86/boot/compressed/head_64.S ====

 

进入到64位模式处理流程

==== from acrh/x86/boot/compressed/head_64.S ====

   

    .code64

    .org 0x200

ENTRY(startup_64)

   

 

   

    xorl    �x, �x

    movl    �x, %ds

    movl    �x, %es

    movl    �x, %ss

    movl    �x, %fs

    movl    �x, %gs

    lldt    %ax

    movl    $0x20, �x

    ltr %ax

 

   

 

   

#ifdef CONFIG_RELOCATABLE

    leaq    startup_32(%rip) , %rbp

    movl    BP_kernel_alignment(%rsi), �x

    decl    �x

    addq    %rax, %rbp

    notq    %rax

    andq    %rax, %rbp

#else

    movq    $LOAD_PHYSICAL_ADDR, %rbp

#endif

 

   

    leaq    z_extract_offset(%rbp), %rbx

 

   

    leaq    boot_stack_end(%rbx), %rsp

 

   

    pushq   $0

    popfq

 

    pushq   %rsi

    leaq    (_bss-8)(%rip), %rsi

    leaq    (_bss-8)(%rbx), %rdi

    movq    $_bss , %rcx

    shrq    $3, %rcx

    std

    rep movsq

    cld

    popq    %rsi

 

    leaq    relocated(%rbx), %rax

    jmp *%rax

 

    .text

relocated:

 

    xorl    �x, �x

    leaq    _bss(%rip), %rdi

    leaq    _ebss(%rip), %rcx

    subq    %rdi, %rcx

    shrq    $3, %rcx

    rep stosq

 

    pushq   %rsi           

    movq    %rsi, %rdi     

    leaq    boot_heap(%rip), %rsi  

    leaq    input_data(%rip), %rdx 

    movl    $z_input_len, �x 

    movq    %rbp, %r8       [E5] 

    call    decompress_kernel[E6] 

    popq    %rsi

 

    jmp *%rbp[E7] 

 

    .data

gdt:

    .word   gdt_end - gdt

    .long   gdt

    .word   0

    .quad   0x0000000000000000 

    .quad   0x00af9a000000ffff 

    .quad   0x00cf92000000ffff 

    .quad   0x0080890000000000 

    .quad   0x0000000000000000 

gdt_end:

 

    .bss

    .balign 4

boot_heap:

    .fill BOOT_HEAP_SIZE, 1, 0

boot_stack:

    .fill BOOT_STACK_SIZE, 1, 0

boot_stack_end:

 

    .section ".pgtable","a",@nobits

    .balign 4096

pgtable:

    .fill 6*4096, 1, 0

==== from acrh/x86/boot/compressed/head_64.S ====

 

下面是decompress_kernel的实现

==== from acrh/x86/boot/compressed/misc.c ====

asmlinkage void decompress_kernel(void *rmode, memptr heap,

                  unsigned char *input_data,

                  unsigned long input_len,

                  unsigned char *output)

{

    real_mode = rmode;

 

    if (real_mode->hdr.loadflags & QUIET_FLAG)

        quiet = 1;

 

    if (real_mode->screen_info.orig_video_mode == 7) {

        vidmem = (char *) 0xb0000;

        vidport = 0x3b4;

    } else {

        vidmem = (char *) 0xb8000;

        vidport = 0x3d4;

    }

 

    lines = real_mode->screen_info.orig_video_lines;

    cols = real_mode->screen_info.orig_video_cols;

 

    free_mem_ptr     = heap;

    free_mem_end_ptr = heap + BOOT_HEAP_SIZE;

 

    if ((unsigned long)output & (MIN_KERNEL_ALIGN - 1))

        error("Destination address inappropriately aligned");

#ifdef CONFIG_X86_64

    if (heap > 0x3fffffffffffUL)

        error("Destination address too large");

#else

    if (heap > ((-__PAGE_OFFSET-(512<<20)-1) & 0x7fffffff))

        error("Destination address too large");

#endif

#ifndef CONFIG_RELOCATABLE

    if ((unsigned long)output != LOAD_PHYSICAL_ADDR)

        error("Wrong destination address");

#endif

 

    if (!quiet)

        putstr("\nDecompressing Linux... ");

    decompress(input_data, input_len, NULL, NULL, output, NULL, error);

    parse_elf(output);

    if (!quiet)

        putstr("done.\nBooting the kernel.\n");

    return;

==== from acrh/x86/boot/compressed/misc.c ====

 

上面的流程中函数decompress解压缩内核后得到elf格式的内核(保存在output中)。Output的首地址在没有CONFIG_RELOCATABLE的情况下为16M,如果有CONFIG_RELOCATABLE,则会重新计算但也必须以2M对齐。实验证明,该地址在CONFIG_RELOCATABLE配置了的情况下也一般为16M。

函数parse_elf进一步解析elf内核(代码段、数据段等)。当decompress_kernel返回则跳入解析后的内核。其首地址就是arch/x86/kernel/head_64.S中的startup_64了。

==== from acrh/x86/boot/compressed/misc.c ====

static void parse_elf(void *output)

{

#ifdef CONFIG_X86_64

    Elf64_Ehdr ehdr;

    Elf64_Phdr *phdrs, *phdr;

#else

    Elf32_Ehdr ehdr;

    Elf32_Phdr *phdrs, *phdr;

#endif

    void *dest;

    int i;

 

    memcpy(&ehdr, output, sizeof(ehdr));

    if (ehdr.e_ident[EI_MAG0] != ELFMAG0 ||

       ehdr.e_ident[EI_MAG1] != ELFMAG1 ||

       ehdr.e_ident[EI_MAG2] != ELFMAG2 ||

       ehdr.e_ident[EI_MAG3] != ELFMAG3) {

        error("Kernel is not a valid ELF file");

        return;

    }

 

    if (!quiet)

        putstr("Parsing ELF... ");

 

    phdrs = malloc(sizeof(*phdrs) * ehdr.e_phnum);

    if (!phdrs)

        error("Failed to allocate space for phdrs");

 

    memcpy(phdrs, output + ehdr.e_phoff, sizeof(*phdrs) * ehdr.e_phnum);

 

    for (i = 0; i < ehdr.e_phnum; i++) {

        phdr = &phdrs[i];

 

        switch (phdr->p_type) {

        case PT_LOAD:

#ifdef CONFIG_RELOCATABLE

            dest = output;

            dest += (phdr->p_paddr - LOAD_PHYSICAL_ADDR);

#else

            dest = (void *)(phdr->p_paddr);

#endif

            memcpy(dest,

                   output + phdr->p_offset,

                   phdr->p_filesz);

            break;

        default: break;

        }

    }

}

==== from acrh/x86/boot/compressed/misc.c ====

 

4.2    进入“真正的”linux内核  

经过4.1 内核初次加载、解压、解析再加载 后,CPU跳入arch/x86/kernel/head_64.S中的startup_64。终于进入所谓“真正的”的内核。startup_64主要完成分页功能启用。最后跳入C代码x86_64_start_kernel。

==== from acrh/x86/kernel/head_64.S ====

#define pud_index(x)    (((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))

 

L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET)

L3_PAGE_OFFSET = pud_index(__PAGE_OFFSET)

L4_START_KERNEL = pgd_index(__START_KERNEL_map)

L3_START_KERNEL = pud_index(__START_KERNEL_map)

 

    .text

    __HEAD

    .code64

    .globl startup_64

startup_64:

 

   

 

   

    leaq    _text(%rip), %rbp

    subq    $_text - __START_KERNEL_map, %rbp

 

   

    movq    %rbp, %rax

    andl    $~PMD_PAGE_MASK, �x

    testl   �x, �x

    jnz bad_address

 

   

    leaq    _text(%rip), %rdx

    movq    $PGDIR_SIZE, %rax

    cmpq    %rax, %rdx

    jae bad_address

 

   

    addq    %rbp, init_level4_pgt + 0(%rip)

    addq    %rbp, init_level4_pgt + (L4_PAGE_OFFSET*8)(%rip)

    addq    %rbp, init_level4_pgt + (L4_START_KERNEL*8)(%rip)

 

    addq    %rbp, level3_ident_pgt + 0(%rip)

 

    addq    %rbp, level3_kernel_pgt + (510*8)(%rip)

    addq    %rbp, level3_kernel_pgt + (511*8)(%rip)

 

    addq    %rbp, level2_fixmap_pgt + (506*8)(%rip)

 

   

 

//…

 

jmp secondary_startup_64

ENTRY(secondary_startup_64)

   

    movl    $(X86_CR4_PAE | X86_CR4_PGE), �x

    movq    %rax, %cr4

 

   

    movq    $(init_level4_pgt - __START_KERNEL_map), %rax

    addq    phys_base(%rip), %rax

    movq    %rax, %cr3

 

   

    movq    $1f, %rax

    jmp *%rax

1:

 

   

    movl    $0x80000001, �x

    cpuid

    movl    �x,�i

 

   

    movl    $MSR_EFER, �x

    rdmsr

    btsl    $_EFER_SCE, �x   

    btl $20,�i       

    jnc     1f

    btsl    $_EFER_NX, �x

1:  wrmsr              

 

   

#define CR0_STATE   (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \

             X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \

             X86_CR0_PG)

    movl    $CR0_STATE, �x

   

    movq    %rax, %cr0

 

   

    movq stack_start(%rip),%rsp

 

   

    pushq $0

    popfq

 

   

    lgdt    early_gdt_descr(%rip)

 

   

    movl $__KERNEL_DS,�x

    movl �x,%ds

    movl �x,%ss

    movl �x,%es

 

   

    movl �x,%fs

    movl �x,%gs

 

   

    movl    $MSR_GS_BASE,�x

    movq    initial_gs(%rip),%rax

    movq    %rax,%rdx

    shrq    $32,%rdx

    wrmsr  

 

   

    movl    %esi, �i

   

   

    movq    initial_code(%rip),%rax

    pushq   $0      # fake return address to stop unwinder

    pushq   $__KERNEL_CS    # set correct cs

    pushq   %rax        # target address in negative space

    lretq

 

   

    __REFDATA

    .align  8

    ENTRY(initial_code)

    .quad   x86_64_start_kernel

    ENTRY(initial_gs)

    .quad   INIT_PER_CPU_VAR(irq_stack_union)

 

    ENTRY(stack_start)

    .quad  init_thread_union+THREAD_SIZE-8

    .word  0

    __FINITDATA

 

//…

==== from acrh/x86/kernel/head_64.S ====

 

下面是x86_64_start_kernel的实现。

==== from acrh/x86/kernel/head_64.S ====

void __init x86_64_start_kernel(char * real_mode_data)

{

    int i;

 

   

    BUILD_BUG_ON(MODULES_VADDR < KERNEL_IMAGE_START);

    BUILD_BUG_ON(MODULES_VADDR-KERNEL_IMAGE_START < KERNEL_IMAGE_SIZE);

    BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);

    BUILD_BUG_ON((KERNEL_IMAGE_START & ~PMD_MASK) != 0);

    BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);

    BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));

    BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) ==

                (__START_KERNEL & PGDIR_MASK)));

    BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

 

   

    clear_bss();

 

   

    zap_identity_mappings();

 

   

    cleanup_highmap();

 

    for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {

#ifdef CONFIG_EARLY_PRINTK

        set_intr_gate(i, &early_idt_handlers[i]);

#else

        set_intr_gate(i, early_idt_handler);

#endif

    }

    load_idt((const struct desc_ptr *)&idt_descr);

 

    if (console_loglevel == 10)

        early_printk("Kernel alive\n");

 

    x86_64_start_reservations(real_mode_data);

}

 

void __init x86_64_start_reservations(char *real_mode_data)

{

    copy_bootdata(__va(real_mode_data));

 

    reserve_trampoline_memory();

 

    reserve_early(__pa_symbol(&_text), __pa_symbol(&__bss_stop), "TEXT DATA BSS");

 

#ifdef CONFIG_BLK_DEV_INITRD

   

    if (boot_params.hdr.type_of_loader && boot_params.hdr.ramdisk_image) {

        unsigned long ramdisk_image = boot_params.hdr.ramdisk_image;

        unsigned long ramdisk_size  = boot_params.hdr.ramdisk_size;

        unsigned long ramdisk_end   = ramdisk_image + ramdisk_size;

        reserve_early(ramdisk_image, ramdisk_end, "RAMDISK");

    }

#endif

 

    reserve_ebda_region();

 

   

 

    start_kernel();

}

==== from acrh/x86/kernel/head_64.S ====

 

最后进入与体系结构无关的start_kernel()。linux引导启动阶段可以说就基本完成了。start_kernel及以下流程则可视为初始化过程了。

 


 [E1]同时这也是header section。使用全局变量hdr。

 [E2]eb其实就是跳转指令。下一个字节是偏移量。将跳转到start_of_setup处执行。

 [E3]Section header 定义结束。注意code32_start的值为0x100000.这是一个重要的值。

 [E4]调用C代码,该main函数在arch/x86/boot/main.c中。

 [E5]Output的首地址在没有CONFIG_RELOCATABLE的情况下为16M,如果有CONFIG_RELOCATABLE,则会重新计算但也必须以2M对齐

 [E6]调用此函数进行内核的解压缩。

 [E7]跳入output target address。


 

0

阅读 评论 收藏 转载 喜欢 打印举报
已投稿到:
前一篇:2012年08月18日
  • 评论加载中,请稍候...
发评论

    发评论

    以上网友发言只代表其个人观点,不代表新浪网的观点或立场。

    < 前一篇2012年08月18日
      

    新浪BLOG意见反馈留言板 电话:4006900000 提示音后按1键(按当地市话标准计费) 欢迎批评指正

    新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 会员注册 | 产品答疑

    新浪公司 版权所有