linux kernel components -...

92
Linux Kernel Components v Memory management v Process management v Inter-process communication (IPC) v File systems v Networking v Device control and device drivers

Upload: others

Post on 25-Jun-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Linux Kernel Components

vMemory managementvProcess managementv Inter-process communication (IPC)vFile systemsvNetworking vDevice control and device drivers

Memory Management

Memory Managementv Virtual Memory Mechanism�Overcome physical memory limitation

vMakes the system appear to have more memory than it actually has by sharing it between competing processes as they need it.

v Memory management subsystem provides�Large address spaces�Protection

vEach process has its own virtual address space completely separate from each other�a process running one application cannot affect another.

vThe hardware virtual memory mechanisms allow areas of memory to be protected against writing.

Memory Managementv Memory Management Subsystem Provides�Memory mapping

vUsed to map image and data files into a processes address space

v In memory mapping, the contents of a file are linked directly into the virtual address space of a process

�Fair physical memory allocation�Shared virtual memory

vCan also be used as an Inter Process Communication (IPC) mechanism, with processes exchanging information via memory common to all of them.

vLinux supports the Unix System V shared memory IPC.

Memory Managementv x86 Memory Management�Segmentation�Paging

v Linux Memory Management�Memory Initialization�Memory Allocation & Deallocation�Memory Map�Page Fault Handling�Demand Paging and Page Replacement

Memory AddressvPreviously, a memory address is

the way to access the memory cell�However, with 80x86, we have to

specify the “address” preciselyvMemory address�Logical address�Linear address ( or virtual address )�Physical address

Memory Address (Cont.)v Logical address�Used to specify the address of an operand or of

an instruction�Consist of a segment and an offset values

v Linear address�A single 32-bit unsigned integer that can be

used to address up to 4GB

Memory Address (Cont.)

vPhysical address�Used to address memory cells in memory chips�Represented as 32-bit unsigned integers

logical address SEGMENTATIONUNIT (HW)

Linear address PAGINGUNIT (HW) Physical address

Segment Translation

Selector Offset

15 0 31 0

Segment Descriptor Table

Segment Descriptor

base address+

Dir Page Offset

linear address

logical address

Linear Address Translation

Directory Table Offset31 22 21 12 11 0

linear address

Directory Entry

Page-Table Entry

Physical Address

10 10

12

CR3(PDBR)32 Page directory

Page table

Physical memory

Segmentation and Paging

SegmentDescriptor

SegmentSelector Offset

Logical Address

Segment

Segment Base Address

Linear AddressSpace

Page

Dir Table OffsetLinear Address

Page

Physical AddressSpace

PageDirectory

Page Table

Abstract model of Virtual to Physical address mapping

VPFN7

VPFN6

VPFN3

VPFN2

VPFN1

VPFN0

VPFN4

VPFN5

VPFN7

VPFN6

VPFN3

VPFN2

VPFN1

VPFN0

VPFN4

VPFN5

PFN3

PFN2

PFN1

PFN0

PFN4

Process X Process Y

Process XPage Table

Process YPage Table

Virtual Memory Virtual MemoryPhysical Memory

An Abstract Model of VM (Cont.)

v Each page table entry contains: �Valid flag�Physical page frame number�Access control information

v X86 page table entry and page directory entry:

31 12 6 5 2 1 0

Page Address D AU/S

R/W

P

Entries of Page Page Tables

vPresent flag: bit 0�1: in memory�0: not in memory and the remaining entry

bits may be used by O.S.vPaging unit stores the linear address in a

control register named cr2 and generate the page fault exception

Entries of Page Directories and Page Tables (Cont.)

vRead/Write flag: bit 1�Contain the access right of the page or of

the Page Table

vUser/Supervisor flag: bit 2�The privilege level required to access the

page or Page Table

Entries of Page Directories and Page Tables (Cont.)

vAccess flag: bit 5�Set when paging unit addresses the

corresponding page frame

�Used by O.S. to select page to be swapped out

�Never reset by paging unit, but only by O.S.

Entries of Page Directories and Page Tables (Cont.)

vDirty page: bit 6�Applied only to Page Table entries

�Set each time a write operation is performed

�Used by O.S. to select page to be swapped out

�Never reset by paging unit, but only by O.S.

Demand Pagingv Loading virtual pages into memory as they are

accessed�Decide the page is in swap file or somewhere in disk

v Segment/Page fault handling�Segmentation fault

v invalid virtual address�Page Fault

vpage not presentvpage protection violation

Swappingv If a process needs to bring a virtual page

into physical memory and there are no free physical pages available

v Linux uses a Least Recently Used page aging technique to choose pages which might be removed from the system

vKernel Swap Daemon (kswapd)

How to Select the Page to Be Swapped Out

vThe process that performed fewer page faults is selected to reclaim memory.

vAccessed flag included in each page table entry to simulate LRU algorithm

When to Perform Page Swap-out

vThe kernel thread, kswapd, activated once every second whenever the number of free pages frames falls below a predefined threshold

vWhen a memory request to the Buddy system cannot be satisfied

Page Fault Handlingv Demand paging

Technique that consists of deferring page frame allocation until it is accessed, which is not is RAM, thus causing a “Page fault” exception

v Copy on WriteWhen a fork() system call is issued, pages are shared between the parent and the child process. Whenever the parent or the child attempts to write into a shared page frame, an exception occurs.

Page Allocation and Deallocation

v The Buddy algorithm�Pages are allocated in blocks which are powers of 2

in size.v If the block of pages found is larger than requested must be

broken down until there is a block of the right size.

�The page deallocation codes recombine pages into large blocks of free pages whenever it canvWhenever a block of pages is freed, the adjacent or buddy

block of the same size is checked to see if it is free.

Splitting of Memory in a Buddy Heap

Cachingv Use physical memory as a cache�Defer writing to disk

�Sync() system call force disk synchronization by writing all dirty pages

�All O.S. also periodically write dirty pages to disk

Process Management

The Process/Kernel ModelvEach CPU can run in either User Mode or

Kernel Mode�80x86 has four different execution statesvLinux use only User Mode and Kernel Mode

�Each CPU also provides special instructions to switch between these modes

The Process/Kernel Model (Cont.)

vThe process/kernel model assumes�Processes that require a kernel service use

specific programming construct called system calls

�The kernel itself is not a process but a process manager that implements many system calls

The Process/Kernel Model (Cont.)v However, Unix systems also include a few

privileged process called kernel threads�They run in Kernel Mode and in kernel address space

�They do not interact with users

�They are usually created during system startup and remain alive until the system is shut down.

The Process/Kernel Model (Cont.)v kernel routine can be activated in several ways�A process invokes a system call

�The CPU executing the process signal an exception.

�A peripheral device issues an interrupt to the CPUv Invoke interrupt handler

�A kernel thread is executed

Transitions Between user and Kernel Mode

Process Managementv Process�an instances of a program in execution�contains

vprogram's code and datavprogram counter vall CPU's registersvprocess stacks (containing temporary data)

� runs in its own virtual address spacev is not capable of interacting with another process except

through secure, kernel managed mechanisms

Process Address Space

vEach process runs in its private address space�Running in User Mode refers to private

stack, data, and code areas�When running in Kernel Mode, the

process addresses the kernel data and code area and uses a private kernel stackvSince several kernel control paths exists, thus

each kernel control path uses its own private kernel stack

Process Descriptor

vTo manage processes, �The kernel must have a clear picture of what

each process is doingvWhat is process descriptor does

�A task_struct type structure whose fields contains all the information related to a single process

Process Descriptor (Cont.)v task_struct data structure

� Process State � Scheduling Information� Inter-Process Communication� Times and Timers� Tty

v tty_struct: keep track of tty associated with the process� Fs

v fs_struct: keep track of current directory� Files

v files_struct: pointers to file descriptors� Virtual memory

v mm_struct: pointer to memory area� Signal

v signal_struct: record signals recived� Processor Specific Context� ……

Process StatevState field�An array of flags, each of which describes a

possible state

�These states are mutually exclusive and hence exactly one flag of state is set

Process State

ready

stopped

suspended

executing zombie

creationsignal signal

scheduling

input / outputend ofinput / output

termination

Process Statev Running

� TASK_RUNNING� The process is either running (it is the current process in the system)

or it is ready to runv Suspended/Blocked

� The process is waiting for an event or for a resource. � Linux differentiates between two types of waiting process:

v Interruptible� TASK_INTERRUPTIBLE� waiting processes can be interrupted by signals

v uninterruptible � TASK_UNINTERRUPIBLE� waiting processes are waiting directly on hardware conditions and cannot

be interrupted by signals

Process State

v Stopped�TASK_STOPPED�The process has been stopped, usually by receiving a

signal. �A process that is being debugged by another process

using the ptrace() system call and has passed control to the monitoring process

v Zombie�TASK_ZOMBIE�This is a halted process which, for some reason, still

has a task_struct data structure in the task vector.

Process InformationvScheduling Informationv Identifier�Process id, user id, group id

v Inter-Process Communication�signals, pipes and semaphores �shared memory and message queues

vTimes and Timers�jiffies

Process Information

vFile system�Processes can open and close files �The process task_struct contains pointers to

descriptors for each open file and two VFS inodes: home directory and current directory.

vVirtual Memory�Linux kernel must track how that virtual memory

is mapped onto the system's physical memory.

Parenthood Relationships Among Processes

v Process have a parent/child relationship

v Several fields in a process descriptor of a process P represent these relationships�p_opptr (original parent)

vPointer to the process descriptor of the process that created P

vOr point to the descriptor of process 1 (init) if the parent process no longer exists

Parenthood Relationships Among Processes (Cont.)

�p_pptr (parent)vPoint to the current parent’s descriptor of P vThis is the process that must be signaled when the child

process terminated�Usually the same as p_opptr, but when another process

issues a ptrace() system call, they will be different

�p_cptr (child)vPoint to the process descriptor of the youngest child of P

Parenthood Relationships Among Processes (Cont.)

�p_ysptr (younger sibling)vPoint to the process descriptor of the process

that has been created immediately after P by P’s current parent

�p_osptr (older sibling)vPoint to the process descriptor of the process

that has been created immediately before P by P’s current parent

Process RelationshipProcess Relationship

parent

youngestchild

child oldestchild

p_osptrp_osptrp_ysptrp_ysptr

p_pptrp_opptr

p_pptrp_opptr

p_pptrp_opptr

p_cptr

Schedulingv Two kinds of processes supported�Normal�Real Time

v Priority based scheduling algorithmv Pre-emptive schedulingv Time-slice: 200msv Scheduling: select the most deserving process to

run�Priority: weightvNormal : countervReal Time : counter + 1000

When the Scheduler Run ? v The scheduler is run from several places within

the kernel:� run after putting the current process onto a wait queue

(when wait for some events)� run at the end of a system call (just before a process is

returned to user mode from system mode.)�when the system timer has just set the current

processes counter to zero. �Process exits

Schedulerv Each time the scheduler is run, it does�Selects the next process

vPriority: weight�Normal : counter�Real Time : counter + 1000

�Performs process switching vsave the hardware context of the current processv load the hardware context of new processvupdate page table entries vFlush those TLB entries that belonged to the old process

A Process's Filescurrent

task_struct

...

files...

...

...

...

...

...

Table ofopen files

Table ofi-nodes

Process Virtual Address Space Handling

vA process’s address space�Contain all the virtual memory address that the

process is allowed to reference

�Stored as a list of memory area descriptors

Process Virtual Address Space Handling (Cont.)

v A process’s virtual address space contains�The executable code the program�The initialized data of the program�The uninitialized data of the program�The initial program stack (i.e., the User Mode

stack)�The executable code and data of needed shared

libraries�The heap (for memory dynamically requested by

program)

Process Virtual Address Space Handling (Cont.)

vDemand paging�Loading virtual pages into memory as they are

accessedvDecide the page is in swap file or somewhere in disk

Process Address Spacekernel

memoryenvironmentarguments

stack

data (bss)datacode

0

0xC0000000

A Process’s Virtual Memory

mm

Process’s Virtual Memory

countpgd

mmapmmap_avlmmap_sem

mm_structtask_struct

vm_endvm_startvm_flagsvm_inodevm_ops

vm_next

vm_endvm_startvm_flagsvm_inodevm_ops

vm_next

vm_area_struct

code

data

vm_area_struct

Times and Timers

vTimes�each clock tick, the kernel updates the amount

of time in jiffies (system and user mode)

v Interval Timers�Real : SIGALRM�Virtual : This timer only ticks when the process

is running: SIGVTALRM

The Process Listv All existing process descriptors are linked by a

circularly doubly linked list�Linked by the pre_task and next_task fields of

task_struct�The head of the list is the init_task descriptor, which is

called process 0 or swapper�A macro called for_each_task, scan the whole process

lists

Fig. 3-3. The Process List

The List of TASK_RUNNING Processes

vThe kernel maintain a doubly linked list of TASK_RUNNING processes called runqueue

vThis list is implemented through the run_listfiled in the process descriptor

How Processes Are Organizedv Except the runqueue list links all process in

TASK_RUNNING state, there are still other states�Process in a TASK_STOPPED or in a

TASK_ZOMBIE are not linked in specific listsvSince they are only accessed via PID or via linked lists

of the child processes for a particular parent�Process in a TASK_INTERRUPTIBLE or

TASK_UNINTERRUPTIBLE are queued by wait queues

Wait QueuesvWait queues implement conditional

waits on events�A process wishing to wait for a specific

event places itself in the proper wait queue and relinquishes control�A wait queue represents a set of sleeping

processes, vWoken up by the kernel when some conditions

becomes true

Wait Queues (Cont.)

vSince wait queues are modified by interrupt handler as well by kernel functions�Wait queue must be protected from concurrent

access

Wait Queues (Cont.)vThere are two kinds of sleeping

processes�Exclusive processesvOnly one process is selected for running if an

event occurs�Nonexclusive processesvAll processes are waked up by the kernel

when the event occurs

Process Resource Limits

vEach process has an associated set of resource limits �Specify the amount of system resources it can

usevRLIMIT_AS: the maximum size of process

address space, in byte

Process Resource Limits (Cont.)

vRLIMIT_CORE: the maximum core dump file size, in bytes

vRLIMIT_CPU: the maximum CPU time for the process, in milli-seconds

vRLIMIT_DATA: the maximum heap size, in bytes

Process Resource Limits (Cont.)v RLIMIT_FSIZE: the maximum file size allowed, in

bytesv RLIMIT_LOCKS: the maximum number of file

locksv RLIMIT_MEMLOCK: the maximum size of

nonswappable memory, in bytes�Check when a process uses the mlock() or mlockall()

system call to lock a page fram

Process Resource Limits (Cont.)v RLIMIT_NOFILE: the maximum number of open

file descriptorsv RLIMIT_NPROC: the maximum number of child

processes that the user can ownv RLIMIT_RSS: the maximum number of page

frames owned by the process�max resident set size

v RLIMIT_STACK: the maximum user mode stack size, in bytes

Process Resource Limits (Cont.)v The resource limits are stored in the rlim field of

the process descriptorv Which is an array of element of type struct rlimit�struct rlimit {

vunsigned long rlim_cur;vunsigned long rlim_max;

}

Process Resource Limits (Cont.)v rlim_cur field is the current resource limit for the

resource�current->rlim[RLIMIT_CPU].rlim_cur represent the

current process’s CPU limitv rlim_max field is the maximum allowed value for

the resource limit�User can use getrlimit() and setrlimit() sytem call to

increase the rlim_cur up to rlim_max

Executing Programsv Process Creation�Fork

v Program Execution�Exec

v Linux Binary Format� ELF, a.out, script

Creating Processesv Unix relies on process creation to satisfy user

requests�The shell creates a new process to execute the user

request

v Traditional Unix duplicated resources owned by the parent to child�Requires copy the entire address space of the parent�However, child may not need so much resources,

especially child issues an immediate execve()

Creating Processes (Cont.)v Modern Unix kernel solves this by�Copy-on-Write: when either one tries to write on

a physical page, the kernel then allocates a new page frame

�Lightweight process: all parent and child to share many per-process kernel DS

�vfork() system call: create a process that shares the memory address space of its parentvParent is blocked until child exits or executes a new

program�Prevent the parent overwriting data needed by child

Clone(), Fork() and Vfork() System Calls

v Lightweight processes are created in Linux by using clone() with four parameters.� fn: the function to be executed�arg: points to data passed to the fn()� flags

vThe low bye specifies the signal number to be sent to the parent when child terminate

vRemaining three types encode a group of flags�child_stack: specifies the User Mode stack pointer.

v If 0, share with parent until Copy_on_Write is invokedvHowever, must be non-null if child wants to share the

same address space as parent

Clone(), Fork() and Vfork() System Calls (Cont.)

vFlags�CLONE_VM: share the memory descriptor and

all Page Tables�CLONE_FS: vShare the table that identifies the root and current

working directoriesvShare the bitmap (called umask) to make the initial

file permission of a new file

Clone(), Fork() and Vfork() System Calls (Cont.)

�CLONE_FILES: share the table that identifies the open files

�CLONE_PARENT: Set the parent of the child to the parent of the calling processvChild and calling process are sibling and have the same parent

�CLONE_PID: share the PID (only used by Process 0) and used for a multiprocessor system

�CLONE_PTRACE: a parent is ptraced, also the child

Clone(), Fork() and Vfork() System Calls (Cont.)

�CLONE_SIGHAND: share the table that identifies the signal handlers

�CLONE_THREAD: insert the child into the same thread group of the parent

�CLONE_SIGNAL: CLONE_SIGHAND + CLONE_THREADvSend a signal to all threads of a multithreaded application

�CLONE_VFORK: used by vfork() system call

Clone(), Fork() and Vfork() System Calls (Cont.)

v Clone() is a wrapper function defined in C library that in turn uses a clone() system call�Clone() system call only have flags and child_stack

parameters

�Thus, when the system call return to the clone() functionv It determines if it is in the parent or the childv If child, execute the fn() function

Clone(), Fork() and Vfork() System Calls (Cont.)

vFork() system call is implemented by a clone() system call with�flags specifies SIGCHID and other are clear

�child_stack is 0

Clone(), Fork() and Vfork() System Calls (Cont.)

vVfork() is implemented by a clone() system call with�flags = a SIGCHID signal + CLONE_VM +

CLONE_VFORK

�child_stack is 0

Kernel Thread

vSome critical tasks are scheduled in the background�Flushing disk caches, swapping out unused

page frames…vThese system processes run only in Kernel

Mode�Kernel thread

Kernel Thread (Cont.)v Difference between kernel threads and

regular processes�Kernel thread executes a single specific kernel C

functionvRegular processes execute kernel functions only

through system calls�Kernel thread runs only in Kernel Mode

vRegular processes runs alternatively in Kernel Mode and User Mode

�Kernel thread use only linear addresses greater than PAGE_OFFSET. vRegular processes use all four gigabytes of linear

address

Kernel Thread (Cont.)v Kernel thread

� Process 0� Process 1� keventd: executes the tasks in the qt_context task queue,

mentioned in Chapter 4� kapm: handles the events related to the APM (Advanced

Power Management)� kswapd: perform memory reclaiming� kflushd (or bdflush): flush dirty buffers to disk to reclaim

memory� kupdated: flush old dirty buffers to disk to reduce risks of

filesystem inconsistencies� ksoftirqd: runs the tasklets, mentioned in Chapter 4

Process 0v The ancestor of all processes, also called

swapper processv Created from scratch during the initialization

phase by the start_kernel functionv Execute the cpu_idle() function after having

created the init processv Process 0 is selected by the scheduler only

when there are no TASK_RUNNING tasks

Process 1v Also called init process since it executes the

init() function that completes the initialization of the kernel

v Created by process 0 by calling the kernel_thread() function

v Process 1 creates and monitors the activity of all processes that implement the outer layer of the operating system�For example, routinely issue wait() system to get

ride of all zombie processes

Destroying Processesv Destroying Processes� Invoke exit() library function

vRelease the resources allocated by C libraryv Invoke the _exit() system call

�Kernel may force a process to dievProcess received a specific signalvUnrecoverable CPU exception

v Both of above approaches invokes the do_exit() function

Process Terminationv Process termination is handled by the do_exit()

function�Set the flag filed of the process descriptor as

PF_EXITING�Remove, if necessary, the process descriptor from an

IPC semaphore queuevRelease semaphore

�Remove, if necessary, the process descriptor from a dynamic timer queue vRelease timer

Process Termination (Cont.)�Examine the process’s data structure related to

paging, filesystem, open file descriptors, and signal handling vRemove each of them if no other processes are sharing

them

�Decrement the resource counter of the modules used by the process

�Set the exit_code of the process descriptor to the process termination codevEqual to either the _exit() system call parameter or error

code supplied by the kernel

Process Termination (Cont.)�Invoke the exit_notify() function to update the

parenthood relationships of both the parent and child processvAll child processes becomes the children of another

process or of the init processvSet the state field of the process descriptor to

TASK_ZOMBIE�Invoke the schedule() function to select a new

process to run

Process Removalv In Unix, a process can query kernel�Obtain the PID of its parent process�The execution state of its children

vWhen the child has terminated, its termination code tells the parent (via wait()-like system call ) if it has been carried out successfully

v Thus, kernel are not allowed to discard data included in a process descriptor field right after the process termination�Must be saved until the parent process is notified�Why the introduction of TASK_ZOMBIE state

Zombie Process - Summaryv The wait() system call allow a process to wait until

one of its children terminates� It returns the process ID (PID) of the terminated child

v Zombie process�Terminated but before its parent executes wait() system

call�Still hold the task_struct data structure

Zombie Process - Summary (Cont.)

vThe related data structure is released until wait() call

vBut, how about a parent terminates without issue a wait() call ?�Start a child in background and then parent

exits

Zombie Process - Summary (Cont.)

vThe solutions rely on init process�Created during system initialization

vWhen child terminated with no parent�Change its parent to init

v Init then routinely issues wait() system call