17 October 2015

I want to understand the linux kernel IO path, by setup gdb debugging and step-by-step trace.

Build Kernel

First I start up a centos 7 VM, which is my debug target, and built the kernel. To build custom centos 7 kernel, I followed the official guide. Remember to change kernel identifier in step 4.

Kernel source will be at ~/rpmbuild/BUILD/kernel-*/linux-*/. Kernel config is copied from existing OS /boot/config-uname -r`. Make sure below config options are enabled

It took me 3 hours to build kernel and 6GB disk space, put aside the time to install rpm. After it finishes, reboot, grub into the second <your-kernel-identifier>.*.debug kernel. It is the kernel you just built. If you want to debug kernel from boot, in grep press e and append kernel option kgdbwait kgdboc=ttyS0,115200. Then the kenrel will suspend before boot and wait for your gdb connection. Otherwise, the kernel boots normally. Then input below in shell to trigger a debug session. Note that I’m using serial port ttyS0 to connect debug target and debug host.

echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc
echo g > /proc/sysrq-trigger

The debug target should be suspended and wait for your gdb connection now. To connect, input below in your debug host VM. My debug host VM is a fedora VM with graphic desktop. The debug host VM should already have kernel source and vmlinux from where you built kernel.

$ cd <kernel-source-dir>
$ cat gdbinit
set serial baud 115200
target remote /dev/ttyS0
$ gdb -x gdbinit vmlinux

The debug session should begin now. For complete guide, see my blog and find “Debug via Serial Port”. I tried to setup Eclipse CDT + gdb remote debugging on my debug host VM. But it never worked. So I stick to gdb in shell commandline.

A Program to Generate IO Writes

To generate IO writes, I wrote a program. There is only one C file which name is loop_write.c. It loops per several seconds to write something into a file, report its file descriptor fd and flushes. It flushes so that the writes would enter deeper block layer rather than stuck in kernel page cache. Below is my program

# my loop_write.c to generate writes periodically to fs
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>

int main() {
    FILE* fp = NULL;
    int fd = 0;
    char templ[] = "Hello world I'm printing something here. Date = %u\n";
    unsigned int now = 0;
    int count = 0;

    fp = fopen("test.txt","at+");
    if (NULL == fp) {
        printf("Open file failed.\n");
        exit(1);
    }

    fd = fileno(fp);
    if (fd < 0) {
        printf("Cannot get fd.\n");
        exit(1);
    }
    printf("The file fd = %d\n", fd);
    printf("\n");
    fprintf(fp, "\nThe file fd = %d\n", fd);

    while (1) {
        now = (unsigned)time(NULL);
        fprintf(fp, templ, now);
        printf(templ, now);

        count++;
        fflush(fp);    // flush only the FILE* level
        printf("Flush the FILE* level.\n");
        if (count % 5 == 0) {   
            fsync(fd);    // tell OS to flush data to disk
            printf("Flush the OS level.\n");
        }
        printf("\n");
        sleep(5);
    }

    fclose(fp);
    return 0;
}

Trace into the Rabbit Hole

Next I will start tracing into kernel IO path. My kernel version is 3.10.0-229.1.2.el7.centos.local_20151002.x86_64.debug. The most helpful material is the linux kernel IO stack diagram from Wikipedia.

Linux Kernel IO Stack

Preparation steps

# on debug host VM, copy the kernel executable
$ cd <kenrel-souce-dir>
$ scp <debug-target>:~/rpmbuild/BUILD/kernel-3.10.0-229.1.2.el7/linux-3.10.0-229.1.2.el7.centos.local_20151007.x86_64/vmlinux

# on debug target VM, prepare the xfs module for gdb (since my loop_write.c writes to xfs)
$ lsmod | grep xfs    # you must make sure xfs module is loaded
xfs                   915019  2
libcrc32c              12644  1 xfs
$ ls /sys/module/xfs/    # to see runtime module files
$ cd /sys/module/xfs/sections/
$ cat .text .data .bss    # we will use the addresses later
0xffffffffa0144000
0xffffffffa0213000
0xffffffffa022b618
$ ls /lib/modules/3.10.0-229.1.2.el7.centos.local_20151002.x86_64.debug/kernel/fs/xfs/xfs.ko    # this is the xfs module executable (don't have debuginfo)

# on debug host VM, copy xfs.o
$ cd <kenrel-souce-dir>
$ scp <debug-target>:~/rpmbuild/BUILD/kernel-3.10.0-229.1.2.el7/linux-3.10.0-229.1.2.el7.centos.local_20151007.x86_64/fs/xfs/xfs.o ./fs/xfs/
$ #scp <debug-target>:/lib/modules/3.10.0-229.1.2.el7.centos.local_20151002.x86_64.debug/kernel/fs/xfs/xfs.ko ./fs/xfs/    # I once tried the .ko file, but it doesn't contain debuginfo

# on debug target VM, start my loop_write.c
cd ~/workspace/loop_write
gcc loop_write.c
nohup ./a.out 2>&1 1>a.out.log &

Dive into the FS layer. My fs under use is XFS.

# on debug target VM, start debugging (make sure you are at the local_20151002.*.debug kernel. and the debug host is using the correct version of kernel source)
$ echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc
$ echo g > /proc/sysrq-trigger

# on debug host VM, start debug session
# tips: while debugging, avoid switching window to the debug target VM, otherwise gdb may be interrupted by received SYSTRAP signal (or other signals)
# tips: always delete unecessary breakpoints so that gdb will not catch irrelavent function calls
# tips: plug on the power for the laptop, otherwise I see my gdb session frequently interrupted by trap signals
# tips: if you recieve a SIGTRAP, 0x00000... in irq_stack_union () stuff, keep step `n`, it will get you back later

$ cd <kenrel-souce-dir>
$ gdb -x gdbinit vmlinux
# input 'break sys_write if (fd==3)', or input `break sys_write if (count==59)`, 59 is the loop_write.c message length
# input 'c', if you break on the right func call, the parameter `count` should be 59
fs\read_write.c::SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, unsigned long, vlen)    # expand to sys_write(...). the n in SYSCALL_DEFINEn, should be argument count, I guess.
    ret = vfs_write(f.file, buf, count, &pos);
    # input `break vfs_write`
    # input `n`. gdb should be breaking at `vfs_write`
    # input `frame`, you should see `buf@entry` holds what loop_write.c prints to its file
         ret = file->f_op->write(file, buf, count, pos);
         # input 'delete breakpoint 1-2'    # to avoid catching irrelated func calls
         # input `break fs/read_write.c:466`    # just break on above line. if I don't do this, sometime gdb will go wrong
         # input 'c'
            fs/read_write.c::do_sync_write(..)
                # sync_write is actually implemented by aio_write
                ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);    # I'm using xfs. filp->f_op->aio_write == fs/xfs/xfs_file.c::xfs_file_operations.xfs_file_aio_write
                
                # since xfs is a loadable module, rather than built into kernel, gdb cannot see or step into it on default. we need to load module symbols
                # to load module symbols, input `add-symbol-file ./fs/xfs/xfs.o 0xffffffffa0144000 -s .data 0xffffffffa0213000 -s .bss 0xffffffffa022b618`; input 'p xfs_file_aio_write' to verify
                # input 'break xfs_file_aio_write', then `s`. with the added module symols, I can not dive into xfs code
                
                    # when we have O_DIRECT, direct io will be launched
                    ret = xfs_file_dio_aio_write(iocb, iovp, nr_segs, pos, ocount);
                        ret = generic_file_direct_write(iocb, iovp, &nr_segs, pos, &iocb->ki_pos, count, ocount);
                            written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);    # points to fs/xfs/xfs_aops.c::xfs_vm_direct_IO
                                ret = __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, nr_segs, xfs_get_blocks_direct, xfs_end_io_direct_write, NULL, 0);
                                    return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, nr_segs, get_block, end_io, submit_io, flags);
                                        retval = do_direct_IO(dio, &sdio, &map_bh);
                                            ret = submit_page_section(dio, sdio, page, offset_in_page, this_chunk_bytes, sdio->next_block_for_io, map_bh);
                                                ret = dio_send_cur_page(dio, sdio, map_bh);
                                                    dio_bio_submit(dio, sdio);    # in fs/direct-io.c
                                                        submit_bio(dio->rw, bio);    # block/blk-core.c::submit_bio is the entrace from kernel FS layer to kernel block layer. I will trace that later.

                    # otherwise, buffered (buffered in kernel page cache) io will be launched
                    ret = xfs_file_buffered_aio_write(iocb, iovp, nr_segs, pos, ocount);
                        ret = generic_file_buffered_write(iocb, iovp, nr_segs, pos, &iocb->ki_pos, count, 0);
                            status = generic_perform_write(file, &i, pos);
                                struct address_space *mapping = file->f_mapping;    # `address_space` links current writing file to kernel page cache
                                const struct address_space_operations *a_ops = mapping->a_ops;    # where is the actual `address_space_operations` object assigned? it is defined at fs/xfs/xfs_aops.c::xfs_address_space_operations
                                
                                # a buffered write only needs to write data to kernel page cache
                                status = a_ops->write_begin(file, mapping, pos, bytes, flags, &page, &fsdata);    # points to fs/xfs/xfs_aops.c::xfs_address_space_operations.xfs_vm_write_begin
                                copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
                                status = a_ops->write_end(file, mapping, pos, bytes, copied, page, fsdata);    # points to fs/xfs/xfs_aops.c::xfs_address_space_operations.xfs_vm_write_end
                                
                                # if there are too many dirty pages, flush to disk
                                balance_dirty_pages_ratelimited(mapping);    # defined at mm/page-writeback.c
                                    ...    # doing tests: do we really need to flush dirty pages?
                                    balance_dirty_pages(mapping, current->nr_dirtied);    # we are really going to flush dirty pages
                                        struct backing_dev_info *bdi = mapping->backing_dev_info;
                                        bdi_start_background_writeback(bdi);    # defined at fs/fs-writeback.c
                                            bdi_wakeup_thread(bdi);
                                                mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);    # push delayed_work `bdi->wb.dwork` to workqueue

    # so what is the `bdi->wb.dwork`?
    mm/backing-dev.c::bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
        INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);   # so, `bdi->wb.dwork` is fs/fs-writeback.c::void bdi_writeback_workfn

                                                # continued from above
                                                mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
                                                    
                                                    # async task executed by workqueue
                                                    fs/fs-writeback.c::bdi_writeback_workfn(struct work_struct *work)
                                                        pages_written = wb_do_writeback(wb);
                                                            wrote += wb_writeback(wb, work);
                                                                progress = writeback_sb_inodes(work->sb, wb, work);
                                                                    write_chunk = writeback_chunk_size(wb->bdi, work);
                                                                        ...    # just calc size, doesn't do actual io
                                                                    __writeback_single_inode(inode, &wbc);    # why the only lower level entrance I found is 'single' inode, no batch?
                                                                        ret = do_writepages(mapping, wbc);
                                                                            ret = mapping->a_ops->writepages(mapping, wbc);    # points to fs/xfs/xfs_aops.c::xfs_vm_writepages, defined at fs/xfs/xfs_aops.c::xfs_address_space_operations
                                                                                fs/xfs/xfs_aops.c::xfs_vm_writepages(struct address_space *mapping, struct writeback_control *wbc)
                                                                                    return generic_writepages(mapping, wbc);
                                                                                        ret = write_cache_pages(mapping, wbc, __writepage, mapping);
                                                                                            ret = (*writepage)(page, wbc, data);    # `writepage` points to mm/page-writeback.c::__writepage
                                                                                                int ret = mapping->a_ops->writepage(page, wbc);    # points to fs/xfs/xfs_aops.c::xfs_vm_writepage
                                                                                                    fs/xfs/xfs_aops.c::xfs_vm_writepage(struct page *page, struct writeback_control *wbc)
                                                                                                        bh = head = page_buffers(page);
                                                                                                        xfs_cluster_write(inode, page->index + 1, &imap, &ioend, wbc, end_index);    # looks like it doesn't actually do io
                                                                                                            done = xfs_convert_page(inode, pvec.pages[i], tindex++, imap, ioendp, wbc);    # allocate & map buffers
                                                                                                        xfs_submit_ioend(wbc, iohead, err);    # submit all of the bios
                                                                                                            bio = xfs_alloc_ioend_bio(bh);
                                                                                                            xfs_submit_ioend_bio(wbc, ioend, bio);
                                                                                                                submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio);    # block/blk-core.c::submit_bio, the same end as the direct io path. here we eneters kernel block layer

Continuing from the above. Dive into the block layer.

# block/blk-core.c::submit_bio(..) is the entrace from kernel fs layer to kernel block layer.
# in the above section we traced kernel fs layer, the xfs filesystem (a kernel module). next we will trace into kernel block layer.

# let's start from block/blk-core.c::submit_bio (continuing from the fs layer gdb above)
block/blk-core.c::submit_bio(int rw, struct bio *bio)    # `struct bio` is defined at include/linux/blk_types.h and include/linux/bio.h.
    generic_make_request(bio);
        struct request_queue *q = bdev_get_queue(bio->bi_bdev);    # get our io queue of the current io device. `request_queue` is defined at blkdev.h
            return bdev->bd_disk->queue;    # `bdev_get_queue` is defined at include/linux/blkdev.h
        q->make_request_fn(q, bio);    # `make_request_fn` defined at blkdev.h
            block/blk-core.c::blk_queue_bio(struct request_queue *q, struct bio *bio)    # I digged out who is `make_request_fn` by gdb.

    # so which place assigned blk_queue_bio to q->make_request_fn?
    blk/blk-core.c::blk_init_queue(request_fn_proc *rfn, spinlock_t *lock)
        return blk_init_queue_node(rfn, lock, NUMA_NO_NODE);
            q = blk_init_allocated_queue(uninit_q, rfn, lock);
                blk_queue_make_request(q, blk_queue_bio);    # defined in block/blk-settings.c::blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
                    q->make_request_fn = mfn;

            # continued from above
            block/blk-core.c::blk_queue_bio(struct request_queue *q, struct bio *bio)  
                struct request *req;    # defined at include/linux/blkdev.h
                el_ret = elv_merge(q, &req, bio);    # `*req` is assigned. decide merge type (not actual merge), return ELEVATOR_NO_MERGE, ELEVATOR_BACK_MERGE or ELEVATOR_FRONT_MERGE (defined at include\linux\elevator.h).
                
                # if `el_ret` is ELEVATOR_BACK_MERGE or ELEVATOR_FRONT_MERGE
                bio_attempt_back_merge(q, req, bio)    # or bio_attempt_front_merge(q, req, bio). this one merges the `bio` rather than `req`
                    blk_account_io_start(req, false);
                elv_bio_merged(q, req, bio);
                    struct elevator_queue *e = q->elevator;    # `elevator_queue`, `elevator_ops` is defined at include/linux/elevator.h
                    e->type->ops.elevator_bio_merged_fn(q, rq, bio);    # these `q->elevator->type->op.*` are defined at block/*-iosched.c

    # what are those `q->elevator->type->op.*` functions? they are the linux io scheduler.
    # by `cat /sys/block/sda/queue/scheduler`, you can see (or change) your current kernel io scheduler, one from noop, anticipatory, deadline, or cfq.
    # take cfg as an example
    block/cfq-iosched.c::iosched_cfq = {
        .ops = {
            .elevator_merged_fn =       cfq_merged_request,
            .elevator_bio_merged_fn =   cfq_bio_merged,
            ...
        },
    }

                    # continued from above
                    e->type->ops.elevator_bio_merged_fn(q, rq, bio);    
                attempt_back_merge(q, req)    # or attempt_front_merge(q, req)
                    struct request *next = elv_latter_request(q, rq);
                    return attempt_merge(q, rq, next);    # this on merges `req` rather than `bio`
                        elv_merge_requests(q, req, next);
                        blk_account_io_merge(next);    # get statistics of the io merge
                            part = req->part;    # `part` is of type `hd_struct`, which is defined at include/linux/genhd.h
                elv_merged_request(q, req, el_ret);
                    struct elevator_queue *e = q->elevator;
                    e->type->ops.elevator_merged_fn(q, rq, type);    # see above

                # else if `el_ret` is ELEVATOR_NO_MERGE
                init_request_from_bio(req, bio);
                if (plug) {    # `struct blk_plug` is defined at include/linux/blkdev.h
                    list_add_tail(&req->queuelist, &plug->list);    # kernel plug feature to pool and batch io requests. here we just put new io request to plug list, instead of handling it immediately
                    blk_account_io_start(req, true);    # accounting io statistics
                } else {
                    add_acct_request(q, req, where);    # do accounting for this io request, and merge it to io queue
                        blk_account_io_start(rq, true);    # accounting io statistics
                        __elv_add_request(q, rq, where);    # add request to block io request queue. overall there are so many ways to elevator add an io request to our io queue
                            switch (where) {
                            case ELEVATOR_INSERT_REQUEUE, ELEVATOR_INSERT_FRONT:
                                list_add(&rq->queuelist, &q->queue_head);    # add `rq->queuelist` to `q->queue_head`
                            case ELEVATOR_INSERT_BACK:
                                list_add_tail(&rq->queuelist, &q->queue_head);
                                __blk_run_queue(q);    # we kick the queue here
                            case ELEVATOR_INSERT_SORT_MERGE:
                                elv_attempt_insert_merge(q, rq)
                            case ELEVATOR_INSERT_SORT:
                                q->elevator->type->ops.elevator_add_req_fn(q, rq);    # where and what is these elevator functions? see above iosched_cfq's
                            case ELEVATOR_INSERT_FLUSH:
                                blk_insert_flush(rq);
                            }
                    __blk_run_queue(q);
                        __blk_run_queue_uncond(q);
                            q->request_fn(q);    # handle the io requests in request queue `q` by each low level driver. this the gate from kernel block layer to kernel block driver layer
                                drivers/scsi/scsi_lib.c::scsi_request_fn(struct request_queue *q)    # how I digged out that 'scsi_request_fn' is the actual `request_fn`? since the exist of `plug`, I cannot directly step from `blk_queue_bio` to this line. so I break on `__blk_run_queue_uncond` and step into `q->request_fn(q)` to dig. By repeating many times I'm sure `scsi_request_fn` is it.
                            
    # so where is q->request_fn assigned? it is assigned in kernel block driver layer (the next lower layer under the block layer).
    # drivers under drivers/block/* (or some other places) invokes `blk_init_queue`. for example drivers/block/hd.c::hd_init.
    # different drivers assign different `request_fn`, i.e. it is driver specific
    # my debug target VM, which runs on virtualbox, uses `scsi_request_fn`. it is registed as following
    drivers/scsi/scsi_scan.c::scsi_alloc_sdev(struct scsi_target *starget, unsigned int lun, void *hostdata)
        sdev->request_queue = scsi_alloc_queue(sdev);
            q = __scsi_alloc_queue(sdev->host, scsi_request_fn);
                q = blk_init_queue(request_fn, NULL);
                    block/blk-core.c::blk_init_queue(request_fn_proc *rfn, spinlock_t *lock)
                        return blk_init_queue_node(rfn, lock, NUMA_NO_NODE);
                            q = blk_init_allocated_queue(uninit_q, rfn, lock);
                                q->request_fn       = rfn;

    # is there other places where `q->request_fn` is called? for example is there a async thread that keep handling requests in the io queue (I don't find answer)?
    # I searched the whole source directory, only `__blk_run_queue_uncond` calls to it.
    # digging the call hierarchy in reverse order ...
    q->request_fn(q);
        block/blk_core.c::__blk_run_queue_uncond(struct request_queue *q)
            block/blk_core.c::__blk_run_queue(struct request_queue *q)
                block/blk_core.c::__blk_drain_queue(struct request_queue *q, bool drain_all)    # looks like a invoke on `q->request_fn(q)` is enough to drain the complete queue
                ... # a lot of parent functions
            block/blk_exec.c::blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk, struct request *rq, int at_head, rq_end_io_fn *done)
                blk_execute_rq(struct request_queue *q, struct gendisk *bd_disk, struct request *rq, int at_head)
                    block/scsi_ioctl.c::__blk_send_generic(struct request_queue *q, struct gendisk *bd_disk, int cmd, int data)
                        blk_send_start_stop(struct request_queue *q, struct gendisk *bd_disk, int data)
                            scsi_cmd_ioctl(struct request_queue *q, struct gendisk *bd_disk, fmode_t mode, unsigned int cmd, void __user *arg)    # scsi is much more complex than driver drivers/block/hd.c
                ... # a lot of parent functions
                                
                                drivers/scsi/scsi_lib.c::scsi_request_fn(struct request_queue *q)

Continuing from the above. Dive into the block driver layer. My block driver is SCSI.

# in the above section we traced the kernel block layer. next we will traced into kernel block driver layer. drivers/block/hd.c is a good example with only one file.
# but scsi is much more complex. my debug target VM uses scsi as the block driver. so I will dive into the scsi one. fisrt, I need to prepare kernel modules for gdb to see debuginfo

# on debug target VM prepare the scsi sd driver module for gdb
$ lsmod | grep sd
$ cd /sys/module/sd_mod/sections/
$ cat .text .data .bss
0xffffffffa0120000
0xffffffffa0129000
0xffffffffa0129998
$ ls /lib/modules/3.10.0-229.1.2.el7.centos.local_20151002.x86_64.debug/kernel/drivers/scsi/sd_mod.ko    # the sd module executable

# on debug target VM prepare the scsi mptspi driver module for gdb
$ lsmod | grep mptspi
$ cd /sys/module/mptspi/sections/
$ cat .text .data .bss
0xffffffffa00d3000
0xffffffffa00d7000
0xffffffffa00d7698
$ ls /lib/modules/3.10.0-229.1.2.el7.centos.local_20151002.x86_64.debug/kernel/drivers/message/fusion/mptspi.ko    # the mptspi driver is not under drivers/scsi/, weird

# on debug host VM, copy sd_mod.o and mptspi.o. if you don't have sd_mod.o, go to <debug-target>:<kernel-source-dir>, `make modules` to build
$ cd <kernel-source-dir>
$ scp <debug-target>:~/rpmbuild/BUILD/kernel-3.10.0-229.1.2.el7/linux-3.10.0-229.1.2.el7.centos.local_20151007.x86_64/drivers/scsi/sd_mod.o drivers/scsi/
$ scp <debug-target>:~/rpmbuild/BUILD/kernel-3.10.0-229.1.2.el7/linux-3.10.0-229.1.2.el7.centos.local_20151007.x86_64/drivers/message/fusion/mptspi.o drivers/message/fusion/

# my block driver layer uses scsi. but scsi it selves have drivers itselves again.
# scsi consists of high level drivers: sg, sr, sd, st. code at driver/scsi/* who defines a `struct scsi_driver` variable
# and low level drivers: there are a lot, e.g. fibre channel, SAS, iSCSI. code at drivers/scsi/* who defines a `struct scsi_host_template` variable. low level drivers are also called scsi host adapter drivers.

# I have to find which drivers I'm using. to find my disk major:minor numbers, `lsblk`.
# to find what high level scsi driver my disks are using
$ ll /sys/dev/block/*/device/driver
/sys/dev/block/8:0/device/driver -> ../../../../../../bus/scsi/drivers/sd
# in my case, my high level driver is the `sd` driver.
# when we `ls /dev/sd*`, the `sd` comes from here. in old times we have IDE drives which give /dev/hd*. but now we all use SCSI disks which give the /dev/sd*

# to find my low level scsi driver
$ udevadm info -a -n /dev/sda | grep -oP 'DRIVERS?=="\K[^"]+
sd
mptspi
# The second one, `mptspi` ,is my low level scsi driver.
# `lsscsi` is also a good tool
$ yum install -y lsscsi
$ lsscsi -Hlv    # to show scsi host info
$ lsscsi -t    # -t shows the transport I'm using - spi

# lets start from `scsi_request_fn` (continuing from fs layer and block layer gdb)
drivers/scsi/scsi_lib.c::scsi_request_fn(struct request_queue *q)
    for (;;) {
        req = blk_peek_request(q);
        cmd = req->special;
        rtn = scsi_dispatch_cmd(cmd);
            rtn = host->hostt->queuecommand(host, cmd);
            # we need to load kernel module symbols here
            # add-symbol-file ./drivers/scsi/sd_mod.o 0xffffffffa0120000 -s .data 0xffffffffa0129000 -s .bss 0xffffffffa0129998
            # add-symbol-file ./drivers/message/fusion/mptspi.o 0xffffffffa00d3000 -s .data 0xffffffffa00d7000 -s .bss 0xffffffffa00d7698
            # input `break mptspi_qcmd` to make it safer not to jump over
            # input `s`
                drivers/message/fusion/mptspi.c::mptspi_qcmd(struct Scsi_Host *shost, struct scsi_cmnd *SCpnt)    # my scsi low level driver here is mptspi
                    # input `delete break N` to delete prior break points to avoid unwanted catch
                    mptscsih_qcmd(SCpnt);
                        mpt_put_msg_frame(ioc->DoneCtx, ioc, mf);    # posts an MPT request frame to the request post FIFO of a specific MPT adapter
                            CHIPREG_WRITE32(&ioc->chip->RequestFifo, mf_dma_addr);    # there is a `#define CHIPREG_WRITE32(addr,val)   writel(val, addr)` and a `#include <asm/io.h>`
                                arch/x86/include/asm/io.h::writel(val, addr)    # links to include/asm-generic/io.h::'#define writel(b,addr) __raw_writel(__cpu_to_le32(b),addr)', I guess
                                    include/asm-generic/io.h::writel(b,addr)    # the '#define writel(b,addr) __raw_writel(__cpu_to_le32(b),addr)' links to below, I guess
                                        arch/x86/include/asm/io.h::__raw_writel(__cpu_to_le32(b),addr)    # there is a `#define __raw_writel __writel`
                                            arch/x86/include/asm/io.h::build_mmio_write(__writel, "l", unsigned int, "r", )    # the `build_mmio_write` macro defines `__writel`
                                                static inline void writel(unsigned val, volatile void __iomem *addr)     # note the `mf_dma_addr` above. so the assembles here are using DMA here.
                                                { asm volatile("mov" size " %0,%1": :"r" (val), \
                                                "m" (*(volatile unsigned int __force *)addr) ); }
    }
   
    # where does high level driver sd.c plays in? by searching `to_driver` in drivers/scsi I found below.
    # so I guess those high level drivers are high level, which interact with general scsi code, and low level drivers have no direct connection to them
    drivers\scsi\scsi.c::scsi_finish_command(struct scsi_cmnd *cmd)
        drv = scsi_cmd_to_driver(cmd);
    drivers\scsi\scsi_error.c::scsi_eh_action(struct scsi_cmnd *scmd, int rtn)
        struct scsi_driver *sdrv = scsi_cmd_to_driver(scmd);
    drivers\scsi\scsi_lib.c::scsi_prep_fn(struct request_queue *q, struct request *req)
        ret = scsi_cmd_to_driver(cmd)->init_command(cmd);
    drivers\scsi\scsi_lib.c::scsi_unprep_fn(struct request_queue *q, struct request *req)
        struct scsi_driver *drv = scsi_cmd_to_driver(cmd);


# since we write the scsi data by DMA, how do we handle the interrupt callback when it finishes? I found the irq handler mpt_interrupt(..)
# I'm not sure how mpt_interrupt(..) is registered as irq handler but the key entrace should be it
drivers/message/fusion/mptbase.c::mpt_interrupt(int irq, void *bus_id)
    mpt_reply(ioc, pa);
        freeme = MptCallbacks[cb_idx](ioc, mf, mr);    # invoke the IO callback
        # the io callback can be drivers/message/fusion/mptscsih.c::mptscsih_io_done(..), mptscsih_taskmgmt_complete(..), or mptscsih_scandv_complete(..). I will take :mptscsih_io_done as the example
            drivers/message/fusion/mptscsih.c::mptscsih_io_done(MPT_ADAPTER *ioc, MPT_FRAME_HDR *mf, MPT_FRAME_HDR *mr)
                switch(status) {
                case MPI_IOCSTATUS_SUCCESS:
                    sc->result = (DID_OK << 16) | scsi_status;
                }
                sc->scsi_done(sc);    # issue the command callback
                    drivers/scsi/scsi.c::scsi_done(struct scsi_cmnd *cmd)
                        blk_complete_request(cmd->request);
                            block/blk-softirq.c::blk_complete_request(struct request *req)
                                __blk_complete_request(struct request *req)
                                    list = this_cpu_ptr(&blk_cpu_done);    # the `static DEFINE_PER_CPU(struct list_head, blk_cpu_done);` defines `blk_cpu_done`
                                    list_add_tail(&req->ipi_list, list);    # the `req->ipi_list` is put into `blk_cpu_done` list

                                        # below is async invoked by BLOCK_SOFTIRQ. it is registered by `open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);` at block/blk-softirq.c
                                        # I'm not sure how this BLOCK_SOFTIRQ is triggered or why we need a soft irq for block io completion. search "raise_softirq(BLOCK_SOFTIRQ" found nothing
                                        block/blk-softirq.c::blk_done_softirq(struct softirq_action *h)
                                            cpu_list = this_cpu_ptr(&blk_cpu_done);
                                            list_replace_init(cpu_list, &local_list);    # `local_list` now holds the local per cpu list `blk_cpu_done`
                                            while (!list_empty(&local_list)) {
                                                struct request *rq;
                                                rq = list_entry(local_list.next, struct request, ipi_list);
                                                list_del_init(&rq->ipi_list);
                                                rq->q->softirq_done_fn(rq);
                                                    drivers/scsi/scsi_lib.c::scsi_softirq_done(struct request *rq)    # I digged out who is the `softirq_done_fn` by gdb. it is assigned at drivers/scsi/scsi_lib.c::scsi_alloc_queue(struct scsi_device *sdev)
                                                        struct scsi_cmnd *cmd = rq->special;
                                                        scsi_finish_command(cmd);
                                                            scsi_io_completion(cmd, good_bytes);
                                                                blk_end_request_all(req, 0);
                                                                    pending = blk_end_bidi_request(rq, error, blk_rq_bytes(rq), bidi_bytes);
                                                                        blk_finish_request(rq, error);
                                                                            __blk_put_request(req->q, req);
                                                                                elv_completed_request(q, req);
                                                                                    freed_request(rl, flags);
                                                                                        __freed_request(rl, sync);
                                                                                            wake_up(&rl->wait[sync]);    # at block/blk-core.c::__freed_request(..)
                                            }

# remember the how fs/read_write.c::vfs_write(..) wait for aio to complete and wake up.
# I'm not sure how above `wake_up(&rl->wait[sync])` works to here the `atomic_read(&iocb->ki_users)`. but I guess that's it.
fs/read_write.c::vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
    ret = do_sync_write(file, buf, count, pos);
        ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);
        ret = wait_on_sync_kiocb(&kiocb);    # here's how we wait for io to complete. defined at fs/aio.c
            while (atomic_read(&iocb->ki_users)) {
                set_current_state(TASK_UNINTERRUPTIBLE);
                if (!atomic_read(&iocb->ki_users))
                    break;
                io_schedule();
            }
            __set_current_state(TASK_RUNNING);

Come back to fs layer and study the kernel page cache

# next, let's go back to kernel fs later, and study the page cache `address_space` stuff
# remember that when we digging into `xfs_file_buffered_aio_write`, we step into this `generic_perform_write` function
mm/filemap.c::generic_perform_write(struct file *file, struct iov_iter *i, loff_t pos);
    # the `address_space` object is the bridge to kernel page cache.
    # inside the `address_space` defines a `struct radix_tree_root page_tree`
    # the kernel page cache is actually hold by the radix_tree.
    # radix tree is a space-optimized trie tree. logically it is a big and sparse array.
    # so to implement it, we break it into pieces, and add an tree index. For the holes in the array, we don't need actual tree nodes, i.e. space-optimized.
    struct address_space *mapping = file->f_mapping;    # `address_space` is defined at include/linux/fs.h
    const struct address_space_operations *a_ops = mapping->a_ops;    # here the actual a_op is fs/xfs/xfs_aops.c::xfs_address_space_operations.
    
    # here is how we write user data to kernel page cache
    status = a_ops->write_begin(file, mapping, pos, bytes, flags, &page, &fsdata);    # step 1: use `address_space` to allocate / find proper pages on the kernel page cache, i.e. the `&page`. we will write the `&page`.
        fs/xfs/xfs_aops.c::xfs_vm_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata)
            page = grab_cache_page_write_begin(mapping, index, flags);
                page = find_lock_page(mapping, index);
                    struct page *page = __find_lock_page(mapping, offset);
                        page = __find_get_page(mapping, offset);
                            pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);     # now it is the radix tree stuff
                or page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
            status = __block_write_begin(page, pos, len, xfs_get_blocks);    # some preparation work

    copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);    # step 2: copy user data the `page` which is actually backed by kernel page cache
        kaddr = kmap_atomic(page);
        left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
            arch/x86/include/asm/uaccess_64.h::__copy_from_user_inatomic(void *dst, const void __user *src, unsigned size)
                return copy_user_generic(dst, (__force const void *)src, size);
                    alternative_call_2(copy_user_generic_unrolled,    # call to low level asm code here
                            copy_user_generic_string,
                            X86_FEATURE_REP_GOOD,
                            copy_user_enhanced_fast_string,
                            X86_FEATURE_ERMS,
                            ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
                                 "=d" (len)),
                            "1" (to), "2" (from), "3" (len)
                            : "memory", "rcx", "r8", "r9", "r10", "r11");
        kunmap_atomic(kaddr);
    
    status = a_ops->write_end(file, mapping, pos, bytes, copied, page, fsdata);    # step 3: end the writing and mark dirty
        fs/xfs/xfs_aops.c::xfs_vm_write_end(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, struct page *page, void *fsdata)
            ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);    # defined at fs/buffer.c
                copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
                    __block_commit_write(inode, page, start, start+copied);
                        set_buffer_uptodate(bh);
                        mark_buffer_dirty(bh);
                            __set_page_dirty(page, mapping, 0);    # defined at fs/buffer.c
                                radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY);    # now it is the radix tree stuff
                                __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
    
    # next, if there are too many dirty pages, flush to disk
    balance_dirty_pages_ratelimited(mapping);    # defined at mm/page-writeback.c
        ...    # a greate bunch of function calls
            # the `writepages` here do actually write pages back to disk
            ret = mapping->a_ops->writepages(mapping, wbc);    # points to fs/xfs/xfs_aops.c::xfs_vm_writepages. 
                ...    # another greate bunch of function calls
                # the `writepage` here do actually write page back to disk
                int ret = mapping->a_ops->writepage(page, wbc);    # points to fs/xfs/xfs_aops.c::xfs_vm_writepage
                    ...    # yet another greate bunch of function calls
                        submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio);    # block/blk-core.c::submit_bio, then entrace to kernel block layer

Things I’m still not clear about by now

1. in block driver layer, when writes finish and calls back, how BLOCK_SOFTIRQ is triggered and why we need it.
2. in block driver layer, when writes finish and calls back, in the end, how `wake_up(&rl->wait[sync])` works and how does fs/aio.c::wait_on_sync_kiocb(..) wake up on the right time
3. in block layer, is there an async thread keep processing block io requests in the request queue? I.e. an async thread keep calling block/blk_core.c::__blk_run_queue_uncond(..) { .. q->request_fn(q); ..}

Easy ways to piece together load module symbol commands for gdb

echo add-symbol-file ./fs/xfs/xfs.o $(cat /sys/module/xfs/sections/.text) -s .data $(cat /sys/module/xfs/sections/.data) -s .bss $(cat /sys/module/xfs/sections/.bss)
echo add-symbol-file ./drivers/scsi/sd_mod.o $(cat /sys/module/sd_mod/sections/.text) -s .data $(cat /sys/module/sd_mod/sections/.data) -s .bss $(cat /sys/module/sd_mod/sections/.bss)
echo add-symbol-file ./drivers/message/fusion/mptspi.o $(cat /sys/module/mptspi/sections/.text) -s .data $(cat /sys/module/mptspi/sections/.data) -s .bss $(cat /sys/module/mptspi/sections/.bss)

References



Create an Issue or comment below