linux kernel components -...
TRANSCRIPT
Linux Kernel Components
vMemory managementvProcess managementv Inter-process communication (IPC)vFile systemsvNetworking vDevice control and device drivers
Memory Managementv Virtual Memory Mechanism�Overcome physical memory limitation
vMakes the system appear to have more memory than it actually has by sharing it between competing processes as they need it.
v Memory management subsystem provides�Large address spaces�Protection
vEach process has its own virtual address space completely separate from each other�a process running one application cannot affect another.
vThe hardware virtual memory mechanisms allow areas of memory to be protected against writing.
Memory Managementv Memory Management Subsystem Provides�Memory mapping
vUsed to map image and data files into a processes address space
v In memory mapping, the contents of a file are linked directly into the virtual address space of a process
�Fair physical memory allocation�Shared virtual memory
vCan also be used as an Inter Process Communication (IPC) mechanism, with processes exchanging information via memory common to all of them.
vLinux supports the Unix System V shared memory IPC.
Memory Managementv x86 Memory Management�Segmentation�Paging
v Linux Memory Management�Memory Initialization�Memory Allocation & Deallocation�Memory Map�Page Fault Handling�Demand Paging and Page Replacement
Memory AddressvPreviously, a memory address is
the way to access the memory cell�However, with 80x86, we have to
specify the “address” preciselyvMemory address�Logical address�Linear address ( or virtual address )�Physical address
Memory Address (Cont.)v Logical address�Used to specify the address of an operand or of
an instruction�Consist of a segment and an offset values
v Linear address�A single 32-bit unsigned integer that can be
used to address up to 4GB
Memory Address (Cont.)
vPhysical address�Used to address memory cells in memory chips�Represented as 32-bit unsigned integers
logical address SEGMENTATIONUNIT (HW)
Linear address PAGINGUNIT (HW) Physical address
Segment Translation
Selector Offset
15 0 31 0
Segment Descriptor Table
Segment Descriptor
base address+
Dir Page Offset
linear address
logical address
Linear Address Translation
Directory Table Offset31 22 21 12 11 0
linear address
Directory Entry
Page-Table Entry
Physical Address
10 10
12
CR3(PDBR)32 Page directory
Page table
Physical memory
Segmentation and Paging
SegmentDescriptor
SegmentSelector Offset
Logical Address
Segment
Segment Base Address
Linear AddressSpace
Page
Dir Table OffsetLinear Address
Page
Physical AddressSpace
PageDirectory
Page Table
Abstract model of Virtual to Physical address mapping
VPFN7
VPFN6
VPFN3
VPFN2
VPFN1
VPFN0
VPFN4
VPFN5
VPFN7
VPFN6
VPFN3
VPFN2
VPFN1
VPFN0
VPFN4
VPFN5
PFN3
PFN2
PFN1
PFN0
PFN4
Process X Process Y
Process XPage Table
Process YPage Table
Virtual Memory Virtual MemoryPhysical Memory
An Abstract Model of VM (Cont.)
v Each page table entry contains: �Valid flag�Physical page frame number�Access control information
v X86 page table entry and page directory entry:
31 12 6 5 2 1 0
Page Address D AU/S
R/W
P
Entries of Page Page Tables
vPresent flag: bit 0�1: in memory�0: not in memory and the remaining entry
bits may be used by O.S.vPaging unit stores the linear address in a
control register named cr2 and generate the page fault exception
Entries of Page Directories and Page Tables (Cont.)
vRead/Write flag: bit 1�Contain the access right of the page or of
the Page Table
vUser/Supervisor flag: bit 2�The privilege level required to access the
page or Page Table
Entries of Page Directories and Page Tables (Cont.)
vAccess flag: bit 5�Set when paging unit addresses the
corresponding page frame
�Used by O.S. to select page to be swapped out
�Never reset by paging unit, but only by O.S.
Entries of Page Directories and Page Tables (Cont.)
vDirty page: bit 6�Applied only to Page Table entries
�Set each time a write operation is performed
�Used by O.S. to select page to be swapped out
�Never reset by paging unit, but only by O.S.
Demand Pagingv Loading virtual pages into memory as they are
accessed�Decide the page is in swap file or somewhere in disk
v Segment/Page fault handling�Segmentation fault
v invalid virtual address�Page Fault
vpage not presentvpage protection violation
Swappingv If a process needs to bring a virtual page
into physical memory and there are no free physical pages available
v Linux uses a Least Recently Used page aging technique to choose pages which might be removed from the system
vKernel Swap Daemon (kswapd)
How to Select the Page to Be Swapped Out
vThe process that performed fewer page faults is selected to reclaim memory.
vAccessed flag included in each page table entry to simulate LRU algorithm
When to Perform Page Swap-out
vThe kernel thread, kswapd, activated once every second whenever the number of free pages frames falls below a predefined threshold
vWhen a memory request to the Buddy system cannot be satisfied
Page Fault Handlingv Demand paging
Technique that consists of deferring page frame allocation until it is accessed, which is not is RAM, thus causing a “Page fault” exception
v Copy on WriteWhen a fork() system call is issued, pages are shared between the parent and the child process. Whenever the parent or the child attempts to write into a shared page frame, an exception occurs.
Page Allocation and Deallocation
v The Buddy algorithm�Pages are allocated in blocks which are powers of 2
in size.v If the block of pages found is larger than requested must be
broken down until there is a block of the right size.
�The page deallocation codes recombine pages into large blocks of free pages whenever it canvWhenever a block of pages is freed, the adjacent or buddy
block of the same size is checked to see if it is free.
Cachingv Use physical memory as a cache�Defer writing to disk
�Sync() system call force disk synchronization by writing all dirty pages
�All O.S. also periodically write dirty pages to disk
The Process/Kernel ModelvEach CPU can run in either User Mode or
Kernel Mode�80x86 has four different execution statesvLinux use only User Mode and Kernel Mode
�Each CPU also provides special instructions to switch between these modes
The Process/Kernel Model (Cont.)
vThe process/kernel model assumes�Processes that require a kernel service use
specific programming construct called system calls
�The kernel itself is not a process but a process manager that implements many system calls
The Process/Kernel Model (Cont.)v However, Unix systems also include a few
privileged process called kernel threads�They run in Kernel Mode and in kernel address space
�They do not interact with users
�They are usually created during system startup and remain alive until the system is shut down.
The Process/Kernel Model (Cont.)v kernel routine can be activated in several ways�A process invokes a system call
�The CPU executing the process signal an exception.
�A peripheral device issues an interrupt to the CPUv Invoke interrupt handler
�A kernel thread is executed
Process Managementv Process�an instances of a program in execution�contains
vprogram's code and datavprogram counter vall CPU's registersvprocess stacks (containing temporary data)
� runs in its own virtual address spacev is not capable of interacting with another process except
through secure, kernel managed mechanisms
Process Address Space
vEach process runs in its private address space�Running in User Mode refers to private
stack, data, and code areas�When running in Kernel Mode, the
process addresses the kernel data and code area and uses a private kernel stackvSince several kernel control paths exists, thus
each kernel control path uses its own private kernel stack
Process Descriptor
vTo manage processes, �The kernel must have a clear picture of what
each process is doingvWhat is process descriptor does
�A task_struct type structure whose fields contains all the information related to a single process
Process Descriptor (Cont.)v task_struct data structure
� Process State � Scheduling Information� Inter-Process Communication� Times and Timers� Tty
v tty_struct: keep track of tty associated with the process� Fs
v fs_struct: keep track of current directory� Files
v files_struct: pointers to file descriptors� Virtual memory
v mm_struct: pointer to memory area� Signal
v signal_struct: record signals recived� Processor Specific Context� ……
Process StatevState field�An array of flags, each of which describes a
possible state
�These states are mutually exclusive and hence exactly one flag of state is set
Process State
ready
stopped
suspended
executing zombie
creationsignal signal
scheduling
input / outputend ofinput / output
termination
Process Statev Running
� TASK_RUNNING� The process is either running (it is the current process in the system)
or it is ready to runv Suspended/Blocked
� The process is waiting for an event or for a resource. � Linux differentiates between two types of waiting process:
v Interruptible� TASK_INTERRUPTIBLE� waiting processes can be interrupted by signals
v uninterruptible � TASK_UNINTERRUPIBLE� waiting processes are waiting directly on hardware conditions and cannot
be interrupted by signals
Process State
v Stopped�TASK_STOPPED�The process has been stopped, usually by receiving a
signal. �A process that is being debugged by another process
using the ptrace() system call and has passed control to the monitoring process
v Zombie�TASK_ZOMBIE�This is a halted process which, for some reason, still
has a task_struct data structure in the task vector.
Process InformationvScheduling Informationv Identifier�Process id, user id, group id
v Inter-Process Communication�signals, pipes and semaphores �shared memory and message queues
vTimes and Timers�jiffies
Process Information
vFile system�Processes can open and close files �The process task_struct contains pointers to
descriptors for each open file and two VFS inodes: home directory and current directory.
vVirtual Memory�Linux kernel must track how that virtual memory
is mapped onto the system's physical memory.
Parenthood Relationships Among Processes
v Process have a parent/child relationship
v Several fields in a process descriptor of a process P represent these relationships�p_opptr (original parent)
vPointer to the process descriptor of the process that created P
vOr point to the descriptor of process 1 (init) if the parent process no longer exists
Parenthood Relationships Among Processes (Cont.)
�p_pptr (parent)vPoint to the current parent’s descriptor of P vThis is the process that must be signaled when the child
process terminated�Usually the same as p_opptr, but when another process
issues a ptrace() system call, they will be different
�p_cptr (child)vPoint to the process descriptor of the youngest child of P
Parenthood Relationships Among Processes (Cont.)
�p_ysptr (younger sibling)vPoint to the process descriptor of the process
that has been created immediately after P by P’s current parent
�p_osptr (older sibling)vPoint to the process descriptor of the process
that has been created immediately before P by P’s current parent
Process RelationshipProcess Relationship
parent
youngestchild
child oldestchild
p_osptrp_osptrp_ysptrp_ysptr
p_pptrp_opptr
p_pptrp_opptr
p_pptrp_opptr
p_cptr
Schedulingv Two kinds of processes supported�Normal�Real Time
v Priority based scheduling algorithmv Pre-emptive schedulingv Time-slice: 200msv Scheduling: select the most deserving process to
run�Priority: weightvNormal : countervReal Time : counter + 1000
When the Scheduler Run ? v The scheduler is run from several places within
the kernel:� run after putting the current process onto a wait queue
(when wait for some events)� run at the end of a system call (just before a process is
returned to user mode from system mode.)�when the system timer has just set the current
processes counter to zero. �Process exits
Schedulerv Each time the scheduler is run, it does�Selects the next process
vPriority: weight�Normal : counter�Real Time : counter + 1000
�Performs process switching vsave the hardware context of the current processv load the hardware context of new processvupdate page table entries vFlush those TLB entries that belonged to the old process
A Process's Filescurrent
task_struct
...
files...
...
...
...
...
...
Table ofopen files
Table ofi-nodes
Process Virtual Address Space Handling
vA process’s address space�Contain all the virtual memory address that the
process is allowed to reference
�Stored as a list of memory area descriptors
Process Virtual Address Space Handling (Cont.)
v A process’s virtual address space contains�The executable code the program�The initialized data of the program�The uninitialized data of the program�The initial program stack (i.e., the User Mode
stack)�The executable code and data of needed shared
libraries�The heap (for memory dynamically requested by
program)
Process Virtual Address Space Handling (Cont.)
vDemand paging�Loading virtual pages into memory as they are
accessedvDecide the page is in swap file or somewhere in disk
A Process’s Virtual Memory
mm
Process’s Virtual Memory
countpgd
mmapmmap_avlmmap_sem
mm_structtask_struct
vm_endvm_startvm_flagsvm_inodevm_ops
vm_next
vm_endvm_startvm_flagsvm_inodevm_ops
vm_next
vm_area_struct
code
data
vm_area_struct
Times and Timers
vTimes�each clock tick, the kernel updates the amount
of time in jiffies (system and user mode)
v Interval Timers�Real : SIGALRM�Virtual : This timer only ticks when the process
is running: SIGVTALRM
The Process Listv All existing process descriptors are linked by a
circularly doubly linked list�Linked by the pre_task and next_task fields of
task_struct�The head of the list is the init_task descriptor, which is
called process 0 or swapper�A macro called for_each_task, scan the whole process
lists
The List of TASK_RUNNING Processes
vThe kernel maintain a doubly linked list of TASK_RUNNING processes called runqueue
vThis list is implemented through the run_listfiled in the process descriptor
How Processes Are Organizedv Except the runqueue list links all process in
TASK_RUNNING state, there are still other states�Process in a TASK_STOPPED or in a
TASK_ZOMBIE are not linked in specific listsvSince they are only accessed via PID or via linked lists
of the child processes for a particular parent�Process in a TASK_INTERRUPTIBLE or
TASK_UNINTERRUPTIBLE are queued by wait queues
Wait QueuesvWait queues implement conditional
waits on events�A process wishing to wait for a specific
event places itself in the proper wait queue and relinquishes control�A wait queue represents a set of sleeping
processes, vWoken up by the kernel when some conditions
becomes true
Wait Queues (Cont.)
vSince wait queues are modified by interrupt handler as well by kernel functions�Wait queue must be protected from concurrent
access
Wait Queues (Cont.)vThere are two kinds of sleeping
processes�Exclusive processesvOnly one process is selected for running if an
event occurs�Nonexclusive processesvAll processes are waked up by the kernel
when the event occurs
Process Resource Limits
vEach process has an associated set of resource limits �Specify the amount of system resources it can
usevRLIMIT_AS: the maximum size of process
address space, in byte
Process Resource Limits (Cont.)
vRLIMIT_CORE: the maximum core dump file size, in bytes
vRLIMIT_CPU: the maximum CPU time for the process, in milli-seconds
vRLIMIT_DATA: the maximum heap size, in bytes
Process Resource Limits (Cont.)v RLIMIT_FSIZE: the maximum file size allowed, in
bytesv RLIMIT_LOCKS: the maximum number of file
locksv RLIMIT_MEMLOCK: the maximum size of
nonswappable memory, in bytes�Check when a process uses the mlock() or mlockall()
system call to lock a page fram
Process Resource Limits (Cont.)v RLIMIT_NOFILE: the maximum number of open
file descriptorsv RLIMIT_NPROC: the maximum number of child
processes that the user can ownv RLIMIT_RSS: the maximum number of page
frames owned by the process�max resident set size
v RLIMIT_STACK: the maximum user mode stack size, in bytes
Process Resource Limits (Cont.)v The resource limits are stored in the rlim field of
the process descriptorv Which is an array of element of type struct rlimit�struct rlimit {
vunsigned long rlim_cur;vunsigned long rlim_max;
}
Process Resource Limits (Cont.)v rlim_cur field is the current resource limit for the
resource�current->rlim[RLIMIT_CPU].rlim_cur represent the
current process’s CPU limitv rlim_max field is the maximum allowed value for
the resource limit�User can use getrlimit() and setrlimit() sytem call to
increase the rlim_cur up to rlim_max
Executing Programsv Process Creation�Fork
v Program Execution�Exec
v Linux Binary Format� ELF, a.out, script
Creating Processesv Unix relies on process creation to satisfy user
requests�The shell creates a new process to execute the user
request
v Traditional Unix duplicated resources owned by the parent to child�Requires copy the entire address space of the parent�However, child may not need so much resources,
especially child issues an immediate execve()
Creating Processes (Cont.)v Modern Unix kernel solves this by�Copy-on-Write: when either one tries to write on
a physical page, the kernel then allocates a new page frame
�Lightweight process: all parent and child to share many per-process kernel DS
�vfork() system call: create a process that shares the memory address space of its parentvParent is blocked until child exits or executes a new
program�Prevent the parent overwriting data needed by child
Clone(), Fork() and Vfork() System Calls
v Lightweight processes are created in Linux by using clone() with four parameters.� fn: the function to be executed�arg: points to data passed to the fn()� flags
vThe low bye specifies the signal number to be sent to the parent when child terminate
vRemaining three types encode a group of flags�child_stack: specifies the User Mode stack pointer.
v If 0, share with parent until Copy_on_Write is invokedvHowever, must be non-null if child wants to share the
same address space as parent
Clone(), Fork() and Vfork() System Calls (Cont.)
vFlags�CLONE_VM: share the memory descriptor and
all Page Tables�CLONE_FS: vShare the table that identifies the root and current
working directoriesvShare the bitmap (called umask) to make the initial
file permission of a new file
Clone(), Fork() and Vfork() System Calls (Cont.)
�CLONE_FILES: share the table that identifies the open files
�CLONE_PARENT: Set the parent of the child to the parent of the calling processvChild and calling process are sibling and have the same parent
�CLONE_PID: share the PID (only used by Process 0) and used for a multiprocessor system
�CLONE_PTRACE: a parent is ptraced, also the child
Clone(), Fork() and Vfork() System Calls (Cont.)
�CLONE_SIGHAND: share the table that identifies the signal handlers
�CLONE_THREAD: insert the child into the same thread group of the parent
�CLONE_SIGNAL: CLONE_SIGHAND + CLONE_THREADvSend a signal to all threads of a multithreaded application
�CLONE_VFORK: used by vfork() system call
Clone(), Fork() and Vfork() System Calls (Cont.)
v Clone() is a wrapper function defined in C library that in turn uses a clone() system call�Clone() system call only have flags and child_stack
parameters
�Thus, when the system call return to the clone() functionv It determines if it is in the parent or the childv If child, execute the fn() function
Clone(), Fork() and Vfork() System Calls (Cont.)
vFork() system call is implemented by a clone() system call with�flags specifies SIGCHID and other are clear
�child_stack is 0
Clone(), Fork() and Vfork() System Calls (Cont.)
vVfork() is implemented by a clone() system call with�flags = a SIGCHID signal + CLONE_VM +
CLONE_VFORK
�child_stack is 0
Kernel Thread
vSome critical tasks are scheduled in the background�Flushing disk caches, swapping out unused
page frames…vThese system processes run only in Kernel
Mode�Kernel thread
Kernel Thread (Cont.)v Difference between kernel threads and
regular processes�Kernel thread executes a single specific kernel C
functionvRegular processes execute kernel functions only
through system calls�Kernel thread runs only in Kernel Mode
vRegular processes runs alternatively in Kernel Mode and User Mode
�Kernel thread use only linear addresses greater than PAGE_OFFSET. vRegular processes use all four gigabytes of linear
address
Kernel Thread (Cont.)v Kernel thread
� Process 0� Process 1� keventd: executes the tasks in the qt_context task queue,
mentioned in Chapter 4� kapm: handles the events related to the APM (Advanced
Power Management)� kswapd: perform memory reclaiming� kflushd (or bdflush): flush dirty buffers to disk to reclaim
memory� kupdated: flush old dirty buffers to disk to reduce risks of
filesystem inconsistencies� ksoftirqd: runs the tasklets, mentioned in Chapter 4
Process 0v The ancestor of all processes, also called
swapper processv Created from scratch during the initialization
phase by the start_kernel functionv Execute the cpu_idle() function after having
created the init processv Process 0 is selected by the scheduler only
when there are no TASK_RUNNING tasks
Process 1v Also called init process since it executes the
init() function that completes the initialization of the kernel
v Created by process 0 by calling the kernel_thread() function
v Process 1 creates and monitors the activity of all processes that implement the outer layer of the operating system�For example, routinely issue wait() system to get
ride of all zombie processes
Destroying Processesv Destroying Processes� Invoke exit() library function
vRelease the resources allocated by C libraryv Invoke the _exit() system call
�Kernel may force a process to dievProcess received a specific signalvUnrecoverable CPU exception
v Both of above approaches invokes the do_exit() function
Process Terminationv Process termination is handled by the do_exit()
function�Set the flag filed of the process descriptor as
PF_EXITING�Remove, if necessary, the process descriptor from an
IPC semaphore queuevRelease semaphore
�Remove, if necessary, the process descriptor from a dynamic timer queue vRelease timer
Process Termination (Cont.)�Examine the process’s data structure related to
paging, filesystem, open file descriptors, and signal handling vRemove each of them if no other processes are sharing
them
�Decrement the resource counter of the modules used by the process
�Set the exit_code of the process descriptor to the process termination codevEqual to either the _exit() system call parameter or error
code supplied by the kernel
Process Termination (Cont.)�Invoke the exit_notify() function to update the
parenthood relationships of both the parent and child processvAll child processes becomes the children of another
process or of the init processvSet the state field of the process descriptor to
TASK_ZOMBIE�Invoke the schedule() function to select a new
process to run
Process Removalv In Unix, a process can query kernel�Obtain the PID of its parent process�The execution state of its children
vWhen the child has terminated, its termination code tells the parent (via wait()-like system call ) if it has been carried out successfully
v Thus, kernel are not allowed to discard data included in a process descriptor field right after the process termination�Must be saved until the parent process is notified�Why the introduction of TASK_ZOMBIE state
Zombie Process - Summaryv The wait() system call allow a process to wait until
one of its children terminates� It returns the process ID (PID) of the terminated child
v Zombie process�Terminated but before its parent executes wait() system
call�Still hold the task_struct data structure
Zombie Process - Summary (Cont.)
vThe related data structure is released until wait() call
vBut, how about a parent terminates without issue a wait() call ?�Start a child in background and then parent
exits