User space program
invokes the read system call using _libc_read.
The _libc_read makes
the transition into kernel space by setting EAX register with the
appropiate sys_call symbol in this case __NR_read, which is defined
in asm/unistd.h and then generates a software interrupt also called a gate
(0x80).
The system call routine
in arch/i386/kernel/entry.S uses the syscall table to look up the
appropiate sys function call based on the contents of the EAX
register. In this case the sys_read function is invoked.
sys_read located in
fs/read_write.c looks up the file object using the fd parameter.
Using the file object's function pointer for read, sys_read invokes
the FS dependent read call.
The function pointer
was initialized at the time of file open and is set differently
depending on the type of FS and file i.e. regular file, block file,
char file
. . In the case of a ext2 and most other FSs and even
block files the read call is set to generic_file_read function located in
mm/filemap.c
generic_file_read
checks whether O_DIRECT flag has been selected. If the flag has been
selected generic_file_direct_IO function is called otherwise
do_generic_file_read function is executed. In this example we are
following path for a regular file without O_DIRECT flag set.
do_generic_file_read
located in the mm/filemap.c is a complex function that tries to
optimize for sequential reads. It looks for the data in page cache,
by calling __find_page_nolock.
__find_page_nolock
located in mm/filemap.c fails because in this example our data lives
on disk and is not in any cache.
__find_page_nolock
returns NULL;
do_generic_file_read
calls page_cache_alloc.
page_cache_alloc
located in include/linux/pagemap.h is just a wrapper for alloc_page
which just returns a new allocated page frame.
page_cache_alloc
returns with the allocated page frame.
do_generic_file_read
adds the page to the page cache by calling __add_to_page_cache
__add_to_page_cache
located in mm/filemap.c inserts the new page into the page_cache.
__add_to_page_cache
returns.
The files inodes
i_mapping address function pointer readpage, is used to fill the the
new page. The function pointer is FS dependent in the case of ext2 and
it calls the ext2_read_page function.
ext2_readpage located
in fs/ext2/inode.c is just a wrapper for block_read_full_page
located in fs/buffer.c.
block_read_full_page
locks the page and creates empty buffers for the page. The function
proceeds to fill the empty buffers from the buffer cache by calling
get_block for each buffer. In this case get_block fails because the
data is located on disk. Next the buffers are locked and
set_buffer_async_io is called. This routine just sets the
bh->b_end_io function ptr to the end_buffer_io_async function. Next submit_bh
is invoked.
submit_bh is located in
devices/block/ll_rw_blk.c it just creates a new bio and initializes
it with the buffer_head and initializes the bios bi_end_io
function ptr with end_bio_bh_io_sync then calls submit_bio.
submit_bio located in
devices/block/ll_rw_blk.c just does some validity checks and updates
some kernel statistics and then invokes generic_make_request.
generic_make_request
located in devices/block/ll_rw_blk.c gets the devices request_queue
and calls the decices make_request_fn function ptr. This function ptr can be
defined by the device driver or the device can choose to use a
generic function __make_request. LVM,MD are examples of drives who
defined their own make_request. In this example we assume the device
that is being read from is using the default __make_request.
The __make_request
function located in devices/block/ll_rw_blk.c must arrange to
transfer the given block. The __make_request function must grab the
queues request lock (NOTE: In 2.4 this was a global lock for all request
queues 2.5 has a lock for each request queue) before manipulating
the request queue. The __make_request function allows clusterd
request by delaying the actual I/O request to allow the joining
together of request that operate on adjacent blocks. This is done by
plugging the queue. The function blk_plug_device accomplishes this.
blk_plug_device located
in devices/block/ll_rw_blk.c schedules the plug_tq task queue
descriptor in the tq_disk task_queue to cause the devices request_fn
routine to be activated latter.
After scheduling the
task blk_plug_device returns.
On return from
blk_plug_device the __make_request function allocates a new request
for the request queue and adds the bio to the request. Then
__make_request function unlocks the queue and returns. Note: On
subsequent calls to __make_request the kernel applies an elevator
algorithm to the request, this algorithm tries to keep the disk head
moving in the same direction as long as possible; this approach
tends to minimize seek times while ensuring that all request get
satisified eventually.
The __make_request
function returns 0 to generic_make_request. This causes generic_make_request
to return.
On return from
generic_make_request submit_bio returns 1;
submit_bh function
returns the result from submit_bio.
On return from
submit_bh block_read_full_page calls submit_bh once for each buffer.
Once completed, block_read_full_page returns 0;
ext2_readpage the
function ptr for readpage returns the result from
block_read_full_page.
On return from readpage
do_generic_file_read checks to see if the Page is up to date. If it
is not up to date as it is in this case it issues a readahead for
this page. On return from the readahead wait_on_page is called.
wait_on_page located in
include/linux/pagemap.h locks the page and invokes __wait_on_page.
__wait_on_page declares
a waitqueue adds the pages wait queue entry to the wait queue and
sets the taks to TASK_UNINTERRYPTIBLE then invokes sync_page.
sync_page located in
mm/filemap.c invokes the pages function ptr sync_page.
sync_page was initialized to block_sync_page on opening of the file.
block_sync_page located
in fs/buffer.c issues a request for tq_disk tasklet to be run by
calling run_task_queue.
run_task_queue sets up
tq_disk tasklet to be run on next schedule invocation.
Return from
run_task_queue.
block_sync_page
function returns 0.
sync_page function
returns 0.
__wait_on_page goes to
sleep on return from sync_page by calling schedule.
The sync_page and the call to schedule are continuously called until
the page is unlocked. This eventually will cause the tq_disk tasklet
to be run.
tq_disk tasklet uses a
function pointer to call generic_unplug_device. The initialization
of td_disk tasklet to call generic_unplug_device happen in
blk_init_queue which is called upon during initialization of the
block driver this is not shown in this thread.
generic_unplug_device
checks to make sure that the queue is not empty and calls the device
driver specific function pointer request_fn. This starts the I/O.
A typical driver
request function will do the following: Checks the validity of the
request. Spawn a data transfer and return immediately without ending
the request. This frees up the the cpu and allows the request to
be collected while the device is dealing with the current one. Once
the device has the request it issues a interrupt and the bottom half
of the interrupt handles the IO completion by calling
end_that_request_first.
end_that_request_first
located in device/block/ll_rw_blk.c ends the I/O on the first buffer
attached by calling bio_endio.
bio_endio located in
fs/bio.c sets the bio object up to date and calls the bios
function ptr bi_end_io. This was originally initialized in the
function submit_bh to end_bio_bh_io_sync.
end_bio_bh_io_sync
located in drivers/block/ll_rw_blk.c calls the buffer heads
function pointer b_end_io. This function pointer was initialized in
block_read_full_page with end_buffer_io_async.
end_buffer_io_async
located in fs/buffer.c marks the buffers up to date and page buffer
up to date and unlocks the page buffer.
end_buffer_io_async
returns
end_bio_bh_io_sync
releases the bio struct and returns 0.
bio_endio returns 0;
end_that_request_first
sets up the next buffer_head to be transferred (if any) and returns
1 else the request is finished and returns 0.
The request function
will call end_that_request_last when done with the request or if
there are more buffers it will grab the next one and spawn another
data transfer. end_that_request_last just returns the request to the
request queue free list. Then request function returns.
generic_unplug_device
returns thus ending the tasklet.
Eventually schedule is
called again and control is returned to __wait_on_page function.
__wait_on_page checks
to if the page is unlocked and in this case it is, this happen in
end_buffer_io_async. __wait_on_page sets the state of the thread to
TASK_RUNNING and removes itself off the wait queue and returns.
wait_on_page returns
do_generic_file_read
copies the page to user space by calling the function actor. The
actor routine has to update the user buffer pointers and the
remaining count. The do_generic_file_read updates the access time
of the inode and returns.
generic_file_read
updates the return value and then returns the value.
sys_read updates the
director and updates the file objects statistics and returns.
system_call function
handle signals, possibily schedules and does a RESTORE_ALL and calls
iret. In this case iret returns program control from the
software-generated interrupt gate to the user space program.
_libc_read checks for
errors and returns.
User space returns
from the read call with the amount of data it has read.