Linux块层技术全面剖析-v0.1

  1. 云栖社区>
  2. 博客>
  3. 正文

Linux块层技术全面剖析-v0.1

binarydady 2018-07-21 18:17:43 浏览3015
展开阅读全文

 

 

 

 

 

 

 

 

 

 

 

 

Linux块层技术全面剖析-v0.1

perftrace@gmail.com

 

1     前言

网络上很多文章对块层的描述散乱在各个站点,而一些经典书籍由于更新不及时难免更不上最新的代码,例如关于块层的多队列。那么,是时候写一个关于linux块层的中文专题片章了,本文基于内核4.17.2。

因为文章中很多内容都可以单独领出来做个专题文章,所以信息量还是比较大的。如果大家发现阅读过程中无法顺序读完,那么可以选择性的看,或者阶段性的来看,例如看一段消化一段时间,毕竟文章也是不一下子写完的。最后文章给出的参考链接是非常棒的素材,如果英文可以建议大家去浏览一下不必细究,因为很多内容基本已经融合在文章中了。

2     总体逻辑

操作系统内核这个东西真是复杂,如果上来直接说代码,我想很可能会被大家直接嫌弃。所以,我们先列整体的框架,从逻辑入手,先抽象再具体。Let’s go。

块层是文件系统访问存储设备的接口,是连接文件系统和驱动的桥梁。

块层从代码上其实还可以分为两层,一个是bio层,另一个是request层。如下图所示。

7097f3a0bc1f005dec1f2d181d79c91e3b2dc4d8


 

2.1     bio 层

文件系统中最后会调用函数generic_make_request,将请求构造成bio结构体,并将其传递到bio层。该函数本身也不返回,等io请求处理完毕后,会异步调用bio->bi_end_io中指定的函数。bio层其实就是这个generic_make_request函数,非常的薄。

bio层下面就是request层,可以看到有多队列和单队列之分。

2.2     Request层

request 层有单队列和多队列之分,多队列也是内核进化出来的结果,将来很有可能只剩下多队列了。如果是单队列generic_make_request会调用blk_queue_io,如果是多队列则调用blk_mq_make_request.

2.2.1 单队列

单队列主要是考虑传统的机械硬件,因为机械臂同一时刻只能在一个地方,所以单队列足矣。换个角度说,单队列也正是被旋转磁盘需要的。为了充分利用机械磁盘的特性,单队列中需要有三个关键的任务:

l   收集多个连续操作到一组请求中,充分利用硬件。这个通过代码会对队列进行检查,看请求是否可以接收新的bio。如果可以接收,调度器就会同意合并;否则,后续考虑和其他请求合并。请求会变得很大且连续。

l   将请求按顺序排列,减少寻道时间, 这样不会延时重要的请求。但是我们无法知道每个请求的重要程度,以及寻道浪费的时间。这个需要通过队列调度器,例如deadline,cfq,noop等。

l   为了让请求到达底层驱动,当请求准备好时候需要踢到底层,而请求完成时候需要有机制通知。驱动在会通过blk_init_queue_node函数来注册一个request_fn()函数。在新请求出现在队列时候会被调用。驱动会调用blk_peek_request函数来收集请求并处理,当请求完成后驱动会继续去取请求而不是调用request_fn()函数。每个请求完成后会调用blk_finish_request()函数。

一些设备同时可以接收多个请求,在上一个请求完成前接受新的请求。可以将请求进行打标签,当请求完成的时候可以继续执行合适的请求。随着设备不断的进步可以内部执行更多调度工作后,对多队列的就越来越大了。

2.2.2 多队列

多队列的另一个动机是,系统中核数越来越多,最后都放入到一个队列中,直接导致性能瓶颈。

如果在每个NUMA节点或者cpu上分配队列,请求放到队列的传输压力会大大减少。但是如果硬件一次提交一个,那么多队列最后需要做合并操作。

       我们知道cfq调取在内部也会有多个队列,但是和cfq调度不同, cfq是将请求同优先级关联,而多队列是将队列绑定到硬件设备。

       多队列的request层有两个硬件相关队列:软件staging队列(也叫submission queues)和硬件dispatch队列

软件staging队列结构体是blk_mq_ctx,基于cpu硬件数量来分配,每个numa节点或者cpu分配一个,请求被添加到这些队列中。这些队列被独立的多队列调度器管理,例如常用的:bfq, kyber, mq-deadline。不同CPU下软件队列并不会去跨cpu聚合。

       硬件dispatch队列基于目标硬件来分配,可能只有一个也有可能有2048个。驱动向底层驱动负责。request 层给每个硬件队列分配一个blk_mq_hw_ctx结构体(硬件上下文),最后请求和硬件上下文一起被传递到底层驱动。这个队列需要负责控制提交给设备驱动的速度,防止过载。当前请求从软件队列所下发的硬件队列,最好是在同一个cpu上运行,提高cpu缓存本地化率。

       多队列还有一个不同于单队列的是,多队列的request结构体是与分配好的。每个request结构体都有一个数字标签,用于区分设备。

       多队列不是提供request_fn()函数,而是需要操作函数结构体blk_mq_ops,其中定义了函数,最重要的是queue_rq()函数。还有一些其他的,超时、polling、请求初始化等函数。如果调度认为请求已经准备好不需要继续放在队列上的时候,将会调用queue_rq()函数,将请求下放到request层之外,单队列中是由驱动来从队列中获取请求的。queue_rq函数会将请求放到内部FIFO队列,并直接处理。函数也可以通过返回BLK_STS_RESOURCE来拒接接收请求,这回导致请求继续处于staging 队列中。处理返回BLK_STS_RESOURCE和BLK_STS_OK之外,其他返回都表示错误。

2.2.2.1  多队列调度

       多队列驱动并不需要配置调度器,工作类似于单队列的noop调度器。当调用blk_mq_run_hw_queue()或blk_mq_delay_run_hw_queue()时候,会将请求传给驱动。多队列的调度函数定义在函数集elevator_mq_ops中,主要的有insert_requests()和dispatch_request()函数。Insert_requests会将请求插入到staging队列中,dispatch_request函数会选择一个请求传入到给定的硬件队列。当然,可以不提供insert_requests函数,内核也是很有弹性的,那么请求增加在最后了,如果没有提供dispatch_request,请求会从任何一个staging队列中取然后投入到硬件队列中,这个对性能有伤害(如果设备只有一个硬件队列,那就无所谓了)。

       上面我们提到多队列常用的有三个调度器,mq-deadline,bfq和kyber。

       mq-deadline中也有insert_request函数,它会忽略staging队列,直接将请求插入到两个全局基于时间排序的读写队列。而dispatch_request函数会基于时间、大小、饥饿程度来返回一个队列。注意这里的函数和elevator_mq_ops中是不一样的名字,少了一个s.

       bfq调度器是cfq的升级版,是Budget Fair Queueing的缩写。不过更像mq-deadline,不适用每个cpu的staging队列。如果有多队列则通过一个自旋锁来被所有cpu获取。

       Kyber I/O 调度器,会利用每个cpu或者每个node的staging 队列。该调度器没有提供insert_quest函数,使用默认方式。dispatch_request函数基于硬件上下文来维护内部队列。这个调度在17年初大家才开始讨论,所以很多细节将来可能固定下来,先一跃而过了。

2.2.2.2  多队列拓扑图

       最后我们来看下一个最清楚不够的图,如下:

785b10d210037f3838621c81288cd75a4d4c1360

从图中我们可以看到software staging queues是大于hardware dispatch queues的。

    不过有其实有三种情况:

l   software staging queues大于hardware dispatch queues

2个或多个software staging queues分配到一个硬件队列,分发请求时候回从所有相关软件队列中获取请求。

l   software staging queues小于hardware dispatch queues

这个场景下,软件队列顺序映射到硬件队列。

l   software staging queues等于hardware dispatch queues

这个场景就是1:1映射的。

 

2.2.3 多队列何时取代单队列

这个可能还需要些时间,因为任何新事物都是不完美。另外在红帽在内部存储测试时候发现mq-deadlien的性能问题,还有一些公司也在测试时候发现性能倒退。不过,这个只是时间问题,并不会很久远。

       倒霉的是,很多书籍中描述的都是单队列的情况,当然也包括书中描述的调度器。幸好,我们这个专题书籍涉及了,欢迎分享给自己的小伙伴。

2.3     bio based driver

早些时候,为了绕过系统单队列的瓶颈,需要将绕过request 层,那么驱动叫做基于request的驱动,如果跳过了request层,那么驱动就叫做bio based driver了。如何跳过呢?

设备驱动可以通过调用blk_queue_make_request注册一个make_request_fn函数,make_request_fn可以直接处理bio。generic_make_request会把设备指定的bio来调用make_request_fn,从而跳过了bio层下面的request 层,因为有些设备例如ssd盘,是不需要request层中的请求合并、排序等。

其实,这个方法并不是为SSD高性能设计的,本是为MD RAID实际的,用于处理请求并将其下发到底层真实的硬件设备中。

此外,bio based driver是为了绕过碰到的内核中单队列瓶颈,也带来一个问题:所有驱动都需要自己去处理并发明所有事情,代码不具备通用性。针对此事,内核中引入了多队列机制,后续bio based的驱动也是会慢慢绝迹的。Blk-mq:new multi-queue block IO queueing mechnism

       总体上看,bio层是比较薄的层,只负责将IO请求构建成bio机构体,然后传递给合适的make_request_fn()函数。而request层还是比较厚的,毕竟还有调度器、请求合并等操作。

3     请求分发逻辑

3.1     多队列

3.1.1 请求提交

多队列使用的make_request函数是blk_mq_make_request,当设备支持单硬件队列或异步请求时,将不会阻塞或大幅减轻阻塞。如果请求是同步的,驱动不会执行阻塞。

make_request函数会执行请求合并,如果设备允许阻塞,会负载搜索阻塞列表中的合适候选者,最后将软件队列映射到当前的cpu。提交路径不涉及I/O 调度相关的回调。

       make_request会将同步请求立即发送到对应的硬件队列,如果是异步或是flush(批量)的请求或被delay用于后续高效的合并的分发。

       针对同步和异步请求,make_request的处理存在一些差异。

3.1.2 请求分发

如果IO请求是同步的(在多队列中不允许阻塞),那么分发有同一个请求上下文来执行。

如果是异步或flush的,分发可能由关联到同一个硬件队列请求在其上下文执行;也可能延迟的工作调度来执行。

       多队列中由函数blk_mq_run_hw_queue来具体实现,同步请求会被立即分发到驱动,而异步请求会被延迟。当时同步请求时候,该函数会调用内部函数__blk_mq_run_hw_queue,先会加入和当前硬件队列相关的软件队列,然后加入已存在的分发列表,收集条目后,函数开始将每个请求分发到驱动中,最后由queue_rq来处理。

       整个逻辑代码位于函数blk_mq_make_request中。

       多队列的逻辑如下图:

d7311e9c9e01d32115fc63731882fe8a7c883731

       blk_mq的高清图链接如下:

https://github.com/kernel-z/filesystem/blob/master/blk_mq.png

 

3.2     单队列

函数generic_make_request在单队列中会调用blk_queue_bio,来负责处理bio结构体。该函数是块层中最重要的,需要重点关注的一个函数,里面内容也是极其丰富,第一次去看它肯定会“迷路”的。

       blk_queue_bio函数中会进行电梯调度的处理,由向前合并或向后合并,如果不能合并就新产生一个request请求,最后会调用blk_account_io_start记录io开始处理,很多io的监控统计就是从这个函数开始的。

       逻辑相对代码来说是简单很多,先判断bio是否可以合并到进程的plug链表中,不行则判断是否可以合并到块层请求队列中;如果都不支持合并,则重新产生一个新的request来处理bio,这里又分为是否可以阻塞的;如果可以阻塞则判断原阻塞队列是否需要刷盘,不需要刷则直接挂到plug队列中即可返回;如果不可阻塞的,就添加到请求队列中,并调用__blk_run_queue函数,该函数会调用rq->request_fn(由设备驱动指定,scsi的是scsi_request_fn),离开块层。这个也是blk_queue_bio的整体逻辑。如下图所示:

cfb168b21cacb96c752feb6560a7bdccf7c0ee74

高清图链接如下:

https://github.com/kernel-z/filesystem/blob/master/blk_single.png

       plug中的请求,除了被新的blk_queue_bio函数调用链(blk_flush_plug_list)触发外,还可以被进程调度触发:

schedule->

sched_submit_work ->

blk_schedule_flush_plug()->

blk_flush_plug_list(plug, true) ->

queue_unplugged->

      blk_run_queue_async

    唤醒kblockd工作队列来进行unplug操作。

plug队列中的请求是要先刷到请求队列中,而最后都由__blk_run_queue往下发,会调用->request_fn函数,这个函数因驱动而已(scsi驱动是scsi_request_fn)。

3.2.1 函数小结

插入函数:__elv_add_request

拆分函数:blk_queue_split

合并函数:

bio_attemp_front_merge/bio_attempt_back_merge,blk_attemp_plug_merge

发起IO:  __blk_run_queue

4     块层函数初始化分析(scsi)

驱动初始化时候需要根据硬件设备确定驱动是否能使用多队列。从而在初始化时候确定请求入队列的函数块层入口(blk_queue_bio或者blk_mq_make_request),以及最后发起请求的函数离开块层(scsi_request_fn或scsi_queue_rq)。

4.1     scsi为例

4.1.1 scsi_alloc_sdev

驱动在探测scsi设备过程中,会使用函数scsi_alloc_sdev。会分配、初始化io,并返回指向scsi_device结构体的指针,scsi_device会存储主机、通道、id和lun,并将scsi_device添加到合适的列表中。

该会做如下判断:

if (shost_use_blk_mq(shost))

                sdev->request_queue = scsi_mq_alloc_queue(sdev);

        else

                sdev->request_queue = scsi_old_alloc_queue(sdev);

       如果是设备能使用多队列,则调用函数scsi_mq_alloc_queue,否则使用单队列,调用scsi_old_alloc_queue函数,其中参数sdev是scsi_device.

       在scsi_mq_alloc_queue函数中,会调用blk_mq_init_queue,最后会注册blk_mq_make_request函数。

       初始化逻辑如下,横向太大,所以给竖过来了:

c68a5edc839f49eb02d03546fb878eaba3ed5746

高清图如下:

https://github.com/kernel-z/filesystem/blob/master/scsi-init.png

下面是关于具体代码中的一些结构体、函数的解释,结合上面的文字描述可以更好的理解块层。

5     结构体:关键结构体

5.1     request

request结构体就是请求操作块设备的请求结构体,该结构体被放到request_queue队列中,等到合适的时候再处理。

该结构体定义在include/linux/blkdev.h文件中:

struct request {        

        struct request_queue *q;  //所在队列

        struct blk_mq_ctx *mq_ctx;

 

        int cpu;        

        unsigned int cmd_flags;         /* op and common flags */

        req_flags_t rq_flags;    

 

        int internal_tag;

 

        /* the following two fields are internal, NEVER access directly */

        unsigned int __data_len;        /* total data len */     

        int tag;        

        sector_t __sector;              /* sector cursor */      

 

        struct bio *bio;

        struct bio *biotail;     

 

        struct list_head queuelist; //请求结构体队列链表

 

        /*

         * The hash is used inside the scheduler, and killed once the

         * request reaches the dispatch list. The ipi_list is only used

         * to queue the request for softirq completion, which is long

         * after the request has been unhashed (and even removed from

         * the dispatch list).

         */

        union {

                struct hlist_node hash; /* merge hash */

                struct list_head ipi_list;

        };

 

        /*

         * The rb_node is only used inside the io scheduler, requests

         * are pruned when moved to the dispatch queue. So let the

         * completion_data share space with the rb_node.

         */

        union {

                struct rb_node rb_node; /* sort/lookup */

                struct bio_vec special_vec;

                void *completion_data;

                int error_count; /* for legacy drivers, don't use */

        };

        /*

         * Three pointers are available for the IO schedulers, if they need

         * more they have to dynamically allocate it.  Flush requests are

         * never put on the IO scheduler. So let the flush fields share

         * space with the elevator data.

         */

        union {

                struct {

                        struct io_cq            *icq;

                        void                    *priv[2];

                } elv;

 

                struct {

                        unsigned int            seq;

                        struct list_head        list;

                        rq_end_io_fn            *saved_end_io;

                } flush;

        };

 

        struct gendisk *rq_disk;

        struct hd_struct *part;

        unsigned long start_time;

        struct blk_issue_stat issue_stat;

        /* Number of scatter-gather DMA addr+len pairs after

         * physical address coalescing is performed.

         */

        unsigned short nr_phys_segments;

 

#if defined(CONFIG_BLK_DEV_INTEGRITY)

        unsigned short nr_integrity_segments;

#endif

 

        unsigned short write_hint;

        unsigned short ioprio;

 

        unsigned int timeout;

 

        void *special;          /* opaque pointer available for LLD use */

 

        unsigned int extra_len; /* length of alignment and padding */

 

        /*

         * On blk-mq, the lower bits of ->gstate (generation number and

         * state) carry the MQ_RQ_* state value and the upper bits the

         * generation number which is monotonically incremented and used to

         * distinguish the reuse instances.

         *

         * ->gstate_seq allows updates to ->gstate and other fields

         * (currently ->deadline) during request start to be read

         * atomically from the timeout path, so that it can operate on a

         * coherent set of information.

         */

        seqcount_t gstate_seq;

        u64 gstate;

 

        /*

         * ->aborted_gstate is used by the timeout to claim a specific

         * recycle instance of this request.  See blk_mq_timeout_work().

         */

        struct u64_stats_sync aborted_gstate_sync;

        u64 aborted_gstate;

 

        /* access through blk_rq_set_deadline, blk_rq_deadline */

        unsigned long __deadline;

 

        struct list_head timeout_list;

 

        union {

                struct __call_single_data csd;

                u64 fifo_time;

        };

 

        /*

         * completion callback.

         */

        rq_end_io_fn *end_io;

        void *end_io_data;

 

        /* for bidi */

        struct request *next_rq;

 

#ifdef CONFIG_BLK_CGROUP

        struct request_list *rl;                /* rl this rq is alloced from */

        unsigned long long start_time_ns;

        unsigned long long io_start_time_ns;    /* when passed to hardware */

#endif

};

 

表示块设备驱动层I/O请求,经由I/O调度层转换后的I/O请求,将会发到块设备驱动层进行处理;

 

5.2     request_queue

每一块设备都会有一个队列,当需要对设备操作时,把请求放在队列中。因为对块设备的操作 I/O访问不能及时调用完成,I/O操作比较慢,所以把所有的请求放在队列中,等到合适的时候再处理这些请求;

该结构体定义在include/linux/blkdev.h文件中:

struct request_queue {

        /*

         * Together with queue_head for cacheline sharing

         */

        struct list_head        queue_head;//待处理请求的链表

        struct request          *last_merge;//队列中首先可能合并的请求描述符

        struct elevator_queue   *elevator;//指向elevator对象指针。

        int                     nr_rqs[2];      /* # allocated [a]sync rqs */

        int                     nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */

 

        atomic_t                shared_hctx_restart;

 

        struct blk_queue_stats  *stats;

        struct rq_wb            *rq_wb;

 

        /*

         * If blkcg is not used, @q->root_rl serves all requests.  If blkcg

         * is used, root blkg allocates from @q->root_rl and all other

         * blkgs from their own blkg->rl.  Which one to use should be

         * determined using bio_request_list().

         */

        struct request_list     root_rl;

 

        request_fn_proc         *request_fn;//驱动程序的策略例程入口点

        make_request_fn         *make_request_fn;

        poll_q_fn               *poll_fn;

        prep_rq_fn              *prep_rq_fn;

        unprep_rq_fn            *unprep_rq_fn;

        softirq_done_fn         *softirq_done_fn;

        rq_timed_out_fn         *rq_timed_out_fn;

        dma_drain_needed_fn     *dma_drain_needed;

        lld_busy_fn             *lld_busy_fn;

        /* Called just after a request is allocated */

        init_rq_fn              *init_rq_fn;

        /* Called just before a request is freed */

        exit_rq_fn              *exit_rq_fn;

        /* Called from inside blk_get_request() */

        void (*initialize_rq_fn)(struct request *rq);

 

        const struct blk_mq_ops *mq_ops;

 

        unsigned int            *mq_map;

 

        /* sw queues */

        struct blk_mq_ctx __percpu      *queue_ctx;

        unsigned int            nr_queues;

        unsigned int            queue_depth;

 

        /* hw dispatch queues */

        struct blk_mq_hw_ctx    **queue_hw_ctx;

        unsigned int            nr_hw_queues;

 

        /*

         * Dispatch queue sorting

         */

        sector_t                end_sector;

        struct request          *boundary_rq;

 

        /*

         * Delayed queue handling

         */

        struct delayed_work     delay_work;

 

        struct backing_dev_info *backing_dev_info;

 

        /*

         * The queue owner gets to use this for whatever they like.

         * ll_rw_blk doesn't touch it.

         */

        void                    *queuedata;

 

        /*

         * various queue flags, see QUEUE_* below

         */

        unsigned long           queue_flags;

 

        /*

         * ida allocated id for this queue.  Used to index queues from

         * ioctx.

         */

        int                     id;

 

        /*

         * queue needs bounce pages for pages above this limit

         */

        gfp_t                   bounce_gfp;

 

        /*

         * protects queue structures from reentrancy. ->__queue_lock should

         * _never_ be used directly, it is queue private. always use

         * ->queue_lock.

         */

        spinlock_t              __queue_lock;

        spinlock_t              *queue_lock;

 

        /*

         * queue kobject

         */

        struct kobject kobj;

 

        /*

         * mq queue kobject

         */

        struct kobject mq_kobj;

 

#ifdef  CONFIG_BLK_DEV_INTEGRITY

        struct blk_integrity integrity;

#endif  /* CONFIG_BLK_DEV_INTEGRITY */

 

#ifdef CONFIG_PM

        struct device           *dev;

        int                     rpm_status;

        unsigned int            nr_pending;

#endif

 

        /*

         * queue settings

         */

        unsigned long           nr_requests;    /* Max # of requests */

        unsigned int            nr_congestion_on;

        unsigned int            nr_congestion_off;

        unsigned int            nr_batching;

 

        unsigned int            dma_drain_size;

        void                    *dma_drain_buffer;

        unsigned int            dma_pad_mask;

        unsigned int            dma_alignment;

 

        struct blk_queue_tag    *queue_tags;

        struct list_head        tag_busy_list;

 

        unsigned int            nr_sorted;

        unsigned int            in_flight[2];

 

        /*

         * Number of active block driver functions for which blk_drain_queue()

         * must wait. Must be incremented around functions that unlock the

         * queue_lock internally, e.g. scsi_request_fn().

         */

        unsigned int            request_fn_active;

 

        unsigned int            rq_timeout;

        int                     poll_nsec;

 

        struct blk_stat_callback        *poll_cb;

        struct blk_rq_stat      poll_stat[BLK_MQ_POLL_STATS_BKTS];

 

        struct timer_list       timeout;

        struct work_struct      timeout_work;

        struct list_head        timeout_list;

 

        struct list_head        icq_list;

#ifdef CONFIG_BLK_CGROUP

        DECLARE_BITMAP          (blkcg_pols, BLKCG_MAX_POLS);

        struct blkcg_gq         *root_blkg;

        struct list_head        blkg_list;

#endif

 

        struct queue_limits     limits;

 

        /*

         * Zoned block device information for request dispatch control.

         * nr_zones is the total number of zones of the device. This is always

         * 0 for regular block devices. seq_zones_bitmap is a bitmap of nr_zones

         * bits which indicates if a zone is conventional (bit clear) or

         * sequential (bit set). seq_zones_wlock is a bitmap of nr_zones

         * bits which indicates if a zone is write locked, that is, if a write

         * request targeting the zone was dispatched. All three fields are

         * initialized by the low level device driver (e.g. scsi/sd.c).

         * Stacking drivers (device mappers) may or may not initialize

         * these fields.

         */

        unsigned int            nr_zones;

        unsigned long           *seq_zones_bitmap;

        unsigned long           *seq_zones_wlock;

 

        /*

         * sg stuff

         */

        unsigned int            sg_timeout;

        unsigned int            sg_reserved_size;

        int                     node;

#ifdef CONFIG_BLK_DEV_IO_TRACE

        struct blk_trace        *blk_trace;

        struct mutex            blk_trace_mutex;

#endif

        /*

         * for flush operations

         */

        struct blk_flush_queue  *fq;

 

        struct list_head        requeue_list;

        spinlock_t              requeue_lock;

        struct delayed_work     requeue_work;

 

        struct mutex            sysfs_lock;

 

        int                     bypass_depth;

        atomic_t                mq_freeze_depth;

 

#if defined(CONFIG_BLK_DEV_BSG)

        bsg_job_fn              *bsg_job_fn;

        struct bsg_class_device bsg_dev;

#endif

 

#ifdef CONFIG_BLK_DEV_THROTTLING

        /* Throttle data */

        struct throtl_data *td;

#endif

        struct rcu_head         rcu_head;

        wait_queue_head_t       mq_freeze_wq;

        struct percpu_ref       q_usage_counter;

        struct list_head        all_q_node;

 

        struct blk_mq_tag_set   *tag_set;

        struct list_head        tag_set_list;

        struct bio_set          *bio_split;

 

#ifdef CONFIG_BLK_DEBUG_FS

        struct dentry           *debugfs_dir;

        struct dentry           *sched_debugfs_dir;

#endif

 

        bool                    mq_sysfs_init_done;

 

        size_t                  cmd_size;

        void                    *rq_alloc_data;

 

        struct work_struct      release_work;

#define BLK_MAX_WRITE_HINTS     5

        u64                     write_hints[BLK_MAX_WRITE_HINTS];

};

       该结构体还是异常庞大的,都快接近sk_buff结构体了。

d14c372f0fe082ff6bc5a8e38e925d57c5784437

维护块设备驱动层I/O请求的队列,所有的request都插入到该队列,每个磁盘设备都只有一个queue(多个分区也只有一个);

一个request_queue中包含多个request,每个request可能包含多个bio,请求的合并就是根据各种原则将多个bio加入到同一个request中。

d588ef9f4f8b7153ff38d9a556a43a57c03cdb4e

5.3     elevator_queue

电梯调度队列,每个队列都会有一个电梯调度队列。

struct elevator_queue

{

        struct elevator_type *type;

        void *elevator_data;

        struct kobject kobj;

        struct mutex sysfs_lock;

        unsigned int registered:1;

        unsigned int uses_mq:1;

        DECLARE_HASHTABLE(hash, ELV_HASH_BITS);

};

5.4     elevator_type

电梯类型其实就是调度算法类型。

struct elevator_type

{              

        /* managed by elevator core */

        struct kmem_cache *icq_cache;

                

        /* fields provided by elevator implementation */

        union {

                struct elevator_ops sq;

                struct elevator_mq_ops mq;

        } ops;

        size_t icq_size;        /* see iocontext.h */

        size_t icq_align;       /* ditto */

        struct elv_fs_entry *elevator_attrs;

        char elevator_name[ELV_NAME_MAX];

        const char *elevator_alias;

        struct module *elevator_owner;

        bool uses_mq;

#ifdef CONFIG_BLK_DEBUG_FS

        const struct blk_mq_debugfs_attr *queue_debugfs_attrs;

        const struct blk_mq_debugfs_attr *hctx_debugfs_attrs;

#endif 

 

        /* managed by elevator core */

        char icq_cache_name[ELV_NAME_MAX + 6];  /* elvname + "_io_cq" */

        struct list_head list;

};

5.4.1 iosched_cfq

例如cfq调度器结构体,指定了该调度器相关的所有函数。

static struct elevator_type iosched_cfq = {

        .ops.sq = {

                .elevator_merge_fn =            cfq_merge,

                .elevator_merged_fn =           cfq_merged_request,

                .elevator_merge_req_fn =        cfq_merged_requests,

                .elevator_allow_bio_merge_fn =  cfq_allow_bio_merge,

                .elevator_allow_rq_merge_fn =   cfq_allow_rq_merge,

                .elevator_bio_merged_fn =       cfq_bio_merged,

                .elevator_dispatch_fn =         cfq_dispatch_requests,

                .elevator_add_req_fn =          cfq_insert_request,

                .elevator_activate_req_fn =     cfq_activate_request,

                .elevator_deactivate_req_fn =   cfq_deactivate_request,

                .elevator_completed_req_fn =    cfq_completed_request,

                .elevator_former_req_fn =       elv_rb_former_request,

                .elevator_latter_req_fn =       elv_rb_latter_request,

                .elevator_init_icq_fn =         cfq_init_icq,

                .elevator_exit_icq_fn =         cfq_exit_icq,

                .elevator_set_req_fn =          cfq_set_request,

                .elevator_put_req_fn =          cfq_put_request,

                .elevator_may_queue_fn =        cfq_may_queue,

                .elevator_init_fn =             cfq_init_queue,

                .elevator_exit_fn =             cfq_exit_queue,

                .elevator_registered_fn =       cfq_registered_queue,

        },

        .icq_size       =       sizeof(struct cfq_io_cq),

        .icq_align      =       __alignof__(struct cfq_io_cq),

        .elevator_attrs =       cfq_attrs,

        .elevator_name  =       "cfq",

        .elevator_owner =       THIS_MODULE,

};

5.5     gendisk

再来看下磁盘的数据结构gendisk (定义于 <include/linux/genhd.h>) ,是单独一个磁盘驱动器的内核表示。是块I/O子系统中最重要的数据结构。

struct gendisk {

        /* major, first_minor and minors are input parameters only,

         * don't use directly.  Use disk_devt() and disk_max_parts().

         */

        int major;                      /* major number of driver */

        int first_minor;

        int minors;                     /* maximum number of minors, =1 for

                                         * disks that can't be partitioned. */

 

        char disk_name[DISK_NAME_LEN];  /* name of major driver */

        char *(*devnode)(struct gendisk *gd, umode_t *mode);

 

        unsigned int events;            /* supported events */

        unsigned int async_events;      /* async events, subset of all */

 

        /* Array of pointers to partitions indexed by partno.

         * Protected with matching bdev lock but stat and other

         * non-critical accesses use RCU.  Always access through

         * helpers.

         */

        struct disk_part_tbl __rcu *part_tbl;

        struct hd_struct part0;

 

        const struct block_device_operations *fops;

        struct request_queue *queue;

        void *private_data;

 

        int flags;

        struct rw_semaphore lookup_sem;

        struct kobject *slave_dir;

 

        struct timer_rand_state *random;

        atomic_t sync_io;               /* RAID */

        struct disk_events *ev;

#ifdef  CONFIG_BLK_DEV_INTEGRITY

        struct kobject integrity_kobj;

#endif  /* CONFIG_BLK_DEV_INTEGRITY */

        int node_id;

        struct badblocks *bb;

        struct lockdep_map lockdep_map;

};

    该结构体中有设备号、次编号(标记不同分区)、磁盘驱动器名字(出现在/proc/partitionssysfs中)、 设备的操作集(block_device_operations)、设备IO请求结构、驱动器状态、驱动器容量、驱动内部数据指针private_data等。

       和gendisk相关的函数有,alloc_disk函数用来分配一个磁盘,del_gendisk用来减掉一个对结构体的引用。

分配一个 gendisk 结构不能使系统可使用这个磁盘。还必须初始化这个结构并且调用 add_disk。一旦调用add_disk后, 这个磁盘是"活的"并且它的方法可被在任何时间被调用了,内核这个时候就可以来摸设备了。实际上第一个调用将可能发生, 也可能在 add_disk 函数返回之前; 内核将读前几个字节以试图找到一个分区表。在驱动被完全初始化并且准备好之前,不要调用add_disk来响应对磁盘的请求。

5.6     hd_struct

磁盘分区结构体。

struct hd_struct {

        sector_t start_sect;

        /*

         * nr_sects is protected by sequence counter. One might extend a

         * partition while IO is happening to it and update of nr_sects

         * can be non-atomic on 32bit machines with 64bit sector_t.

         */

        sector_t nr_sects;

        seqcount_t nr_sects_seq;

        sector_t alignment_offset;

        unsigned int discard_alignment;

        struct device __dev;

        struct kobject *holder_dir;

        int policy, partno;

        struct partition_meta_info *info;

#ifdef CONFIG_FAIL_MAKE_REQUEST

        int make_it_fail;

#endif

        unsigned long stamp;

        atomic_t in_flight[2];

#ifdef  CONFIG_SMP

        struct disk_stats __percpu *dkstats;

#else

        struct disk_stats dkstats;

#endif

        struct percpu_ref ref;

        struct rcu_head rcu_head;

};

5.7     bio

在2.4内核以前使用缓冲头的方式,该方式下会将每个I/O请求分解成512字节的块,所以不能创建高性能IO子系统。2.5中一个重要的工作就是支持高性能I/O,于是有了现在的BIO结构体。

bio结构体是request结构体的实际数据,一个request结构体中包含一个或者多个bio结构体,在底层实际是按bio来对设备进行操作的,传递给驱动。

代码会把它合并到一个已经存在的request结构体中,或者需要的话会再创建一个新的request结构体;bio结构体包含了驱动程序执行请求的全部信息。

一个bio包含多个page,这些page对应磁盘上一段连续的空间。由于文件在磁盘上并不连续存放,文件I/O提交到块设备之前,极有可能被拆成多个bio结构;

该结构体定义在include/linux/blk_types.h文件中,不幸的是该结构和以往发生了一些较大变化,特别是与ldd一书中不匹配了。

/*

 * main unit of I/O for the block layer and lower layers (ie drivers and

 * stacking drivers)

 */

struct bio {

        struct bio              *bi_next;       /* request queue link */

        struct gendisk          *bi_disk;

        unsigned int            bi_opf;         /* bottom bits req flags,

                                                 * top bits REQ_OP. Use

                                                 * accessors.

                                                 */

        unsigned short          bi_flags;       /* status, etc and bvec pool number */

        unsigned short          bi_ioprio;

        unsigned short          bi_write_hint;

        blk_status_t            bi_status;

        u8                      bi_partno;

 

        /* Number of segments in this BIO after

         * physical address coalescing is performed.

         */

        unsigned int            bi_phys_segments;

 

        /*

         * To keep track of the max segment size, we account for the

         * sizes of the first and last mergeable segments in this bio.

         */

        unsigned int            bi_seg_front_size;

        unsigned int            bi_seg_back_size;

 

        struct bvec_iter        bi_iter;

 

        atomic_t                __bi_remaining;

        bio_end_io_t            *bi_end_io;

 

        void                    *bi_private;

#ifdef CONFIG_BLK_CGROUP

        /*

         * Optional ioc and css associated with this bio.  Put on bio

         * release.  Read comment on top of bio_associate_current().

         */

        struct io_context       *bi_ioc;

        struct cgroup_subsys_state *bi_css;

#ifdef CONFIG_BLK_DEV_THROTTLING_LOW

        void                    *bi_cg_private;

        struct blk_issue_stat   bi_issue_stat;

#endif

#endif

        union {

#if defined(CONFIG_BLK_DEV_INTEGRITY)

                struct bio_integrity_payload *bi_integrity; /* data integrity */

#endif

        };

 

        unsigned short          bi_vcnt;        /* how many bio_vec's */

 

        /*

         * Everything starting with bi_max_vecs will be preserved by bio_reset()

         */

 

        unsigned short          bi_max_vecs;    /* max bvl_vecs we can hold */

 

        atomic_t                __bi_cnt;       /* pin count */

 

        struct bio_vec          *bi_io_vec;     /* the actual vec list */

 

        struct bio_set          *bi_pool;

 

        /*

         * We can inline a number of vecs at the end of the bio, to avoid

         * double allocations for a small number of bio_vecs. This member

         * MUST obviously be kept at the very end of the bio.

         */

        struct bio_vec          bi_inline_vecs[0];

};

5.8     bio_vec

其中bio_vec结构体位于文件include/linux/bvec.h中:

struct bio_vec {

        struct page     *bv_page; //指向整个缓冲区所驻留的物理页面

        unsigned int    bv_len;   //以字节为单位的大小

        unsigned int    bv_offset;//以字节为单位的偏移量

};

5.9     elevator_type

       电梯调度类型,例如AS或者deadline调度类型。

struct elevator_type

{

        /* managed by elevator core */

        struct kmem_cache *icq_cache;

 

        /* fields provided by elevator implementation */

        union {

                struct elevator_ops sq;

                struct elevator_mq_ops mq;

        } ops;

        size_t icq_size;        /* see iocontext.h */

        size_t icq_align;       /* ditto */

        struct elv_fs_entry *elevator_attrs;

        char elevator_name[ELV_NAME_MAX];

        const char *elevator_alias;

        struct module *elevator_owner;

        bool uses_mq;

#ifdef CONFIG_BLK_DEBUG_FS

        const struct blk_mq_debugfs_attr *queue_debugfs_attrs;

        const struct blk_mq_debugfs_attr *hctx_debugfs_attrs;

#endif

 

        /* managed by elevator core */

        char icq_cache_name[ELV_NAME_MAX + 6];  /* elvname + "_io_cq" */

        struct list_head list;

};

 

5.10 多队列结构体

5.10.1          blk_mq_ctx

代表software staging queues.

struct blk_mq_ctx {

        struct {

                spinlock_t              lock;

                struct list_head        rq_list;

        }  ____cacheline_aligned_in_smp;

       

        unsigned int            cpu;

        unsigned int            index_hw;

       

        /* incremented at dispatch time */

        unsigned long           rq_dispatched[2];

        unsigned long           rq_merged;

       

        /* incremented at completion time */   

        unsigned long           ____cacheline_aligned_in_smp rq_completed[2];

       

        struct request_queue    *queue;

        struct kobject          kobj;

}

 

5.10.2          blk_mq_hw_ctx

多队列的硬件队列。它和blk_mq_ctx的映射通过blk_mq_ops中map_queues来实现。同时映射也保存在request_queue中的mq_map中。

/**

 * struct blk_mq_hw_ctx - State for a hardware queue facing the hardware block device

 */

struct blk_mq_hw_ctx {

        struct {

                spinlock_t              lock;

                struct list_head        dispatch;

                unsigned long           state;          /* BLK_MQ_S_* flags */

        } ____cacheline_aligned_in_smp;

 

        struct delayed_work     run_work;

        cpumask_var_t           cpumask;

        int                     next_cpu;

        int                     next_cpu_batch;

                       

        unsigned long           flags;          /* BLK_MQ_F_* flags */

               

        void                    *sched_data;

        struct request_queue    *queue;

        struct blk_flush_queue  *fq;

 

        void                    *driver_data;                                                                                               

 

        struct sbitmap          ctx_map;                                                                                                    

 

        struct blk_mq_ctx       *dispatch_from;

 

        struct blk_mq_ctx       **ctxs;

        unsigned int            nr_ctx;

       

        wait_queue_entry_t      dispatch_wait;

        atomic_t                wait_index;

 

        struct blk_mq_tags      *tags;

        struct blk_mq_tags      *sched_tags;

 

        unsigned long           queued;

        unsigned long           run;

#define BLK_MQ_MAX_DISPATCH_ORDER       7

        unsigned long           dispatched[BLK_MQ_MAX_DISPATCH_ORDER];

 

        unsigned int            numa_node;

        unsigned int            queue_num;

 

        atomic_t                nr_active;

        unsigned int            nr_expired;

 

        struct hlist_node       cpuhp_dead;

        struct kobject          kobj;

 

        unsigned long           poll_considered;

        unsigned long           poll_invoked;

        unsigned long           poll_success;

 

#ifdef CONFIG_BLK_DEBUG_FS

        struct dentry           *debugfs_dir;

        struct dentry           *sched_debugfs_dir;

#endif

 

        /* Must be the last member - see also blk_mq_hw_ctx_size(). */

        struct srcu_struct      srcu[0];

};

 

struct blk_mq_tag_set {

        unsigned int            *mq_map;

        const struct blk_mq_ops *ops;

        unsigned int            nr_hw_queues;

        unsigned int            queue_depth;    /* max hw supported */

        unsigned int            reserved_tags;

        unsigned int            cmd_size;       /* per-request extra data */

        int                     numa_node;

        unsigned int            timeout;

        unsigned int            flags;          /* BLK_MQ_F_* */

        void                    *driver_data;

 

        struct blk_mq_tags      **tags;

 

        struct mutex            tag_list_lock;

        struct list_head        tag_list;

};

 

5.11 函数操作结构体

5.11.1          elevator_ops

调度操作函数集合。

struct elevator_ops

{

        elevator_merge_fn *elevator_merge_fn;

        elevator_merged_fn *elevator_merged_fn;

        elevator_merge_req_fn *elevator_merge_req_fn;

        elevator_allow_bio_merge_fn *elevator_allow_bio_merge_fn;

        elevator_allow_rq_merge_fn *elevator_allow_rq_merge_fn;

        elevator_bio_merged_fn *elevator_bio_merged_fn;

 

        elevator_dispatch_fn *elevator_dispatch_fn;

        elevator_add_req_fn *elevator_add_req_fn;

        elevator_activate_req_fn *elevator_activate_req_fn;

        elevator_deactivate_req_fn *elevator_deactivate_req_fn;

 

        elevator_completed_req_fn *elevator_completed_req_fn;

 

        elevator_request_list_fn *elevator_former_req_fn;

        elevator_request_list_fn *elevator_latter_req_fn;

 

        elevator_init_icq_fn *elevator_init_icq_fn;     /* see iocontext.h */

        elevator_exit_icq_fn *elevator_exit_icq_fn;     /* ditto */

 

        elevator_set_req_fn *elevator_set_req_fn;

        elevator_put_req_fn *elevator_put_req_fn;

 

        elevator_may_queue_fn *elevator_may_queue_fn;

 

        elevator_init_fn *elevator_init_fn;

        elevator_exit_fn *elevator_exit_fn;

        elevator_registered_fn *elevator_registered_fn;

};

 

 

 

5.11.2          elevator_mq_ops

多队列调度操作函数集合。

struct elevator_mq_ops {

        int (*init_sched)(struct request_queue *, struct elevator_type *);

        void (*exit_sched)(struct elevator_queue *);

        int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);

        void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);

 

        bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);

        bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);

        int (*request_merge)(struct request_queue *q, struct request **, struct bio *);

        void (*request_merged)(struct request_queue *, struct request *, enum elv_merge);

        void (*requests_merged)(struct request_queue *, struct request *, struct request *);

        void (*limit_depth)(unsigned int, struct blk_mq_alloc_data *);

        void (*prepare_request)(struct request *, struct bio *bio);

        void (*finish_request)(struct request *);

        void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);

        struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);

        bool (*has_work)(struct blk_mq_hw_ctx *);

        void (*completed_request)(struct request *);

        void (*started_request)(struct request *);

        void (*requeue_request)(struct request *);

        struct request *(*former_request)(struct request_queue *, struct request *);

        struct request *(*next_request)(struct request_queue *, struct request *);

        void (*init_icq)(struct io_cq *);

        void (*exit_icq)(struct io_cq *);

};

5.11.3          多队列

5.11.3.1        blk_mq_ops

多队列的操作函数,是块层多队列和块设备的桥梁,非常重要。

struct blk_mq_ops {

        /*     

         * Queue request

         */    

        queue_rq_fn             *queue_rq;//队列处理函数。

               

        /*

         * Reserve budget before queue request, once .queue_rq is

         * run, it is driver's responsibility to release the

         * reserved budget. Also we have to handle failure case

         * of .get_budget for avoiding I/O deadlock.

         */

        get_budget_fn           *get_budget;

        put_budget_fn           *put_budget;

       

        /*

         * Called on request timeout

         */

        timeout_fn              *timeout;

       

        /*     

         * Called to poll for completion of a specific tag.

         */

        poll_fn                 *poll;

 

        softirq_done_fn         *complete;

 

        /*

         * Called when the block layer side of a hardware queue has been

         * set up, allowing the driver to allocate/init matching structures.

         * Ditto for exit/teardown.

         */

        init_hctx_fn            *init_hctx;

        exit_hctx_fn            *exit_hctx;

 

        /*

         * Called for every command allocated by the block layer to allow

         * the driver to set up driver specific data.

         *

         * Tag greater than or equal to queue_depth is for setting up

         * flush request.

         *

         * Ditto for exit/teardown.

         */

        init_hctx_fn            *init_hctx;

        exit_hctx_fn            *exit_hctx;

 

        /*

         * Called for every command allocated by the block layer to allow

         * the driver to set up driver specific data.

         *

         * Tag greater than or equal to queue_depth is for setting up

         * flush request.

         *

         * Ditto for exit/teardown.

         */

        init_request_fn         *init_request;

        exit_request_fn         *exit_request;

        /* Called from inside blk_get_request() */

        void (*initialize_rq_fn)(struct request *rq);

 

        map_queues_fn           *map_queues;//blk_mq_ctxblk_mq_hw_ctx映射关系

 

#ifdef CONFIG_BLK_DEBUG_FS

        /*

         * Used by the debugfs implementation to show driver-specific

         * information about a request.

         */

        void (*show_rq)(struct seq_file *m, struct request *rq);

#endif

};

           例如scsi的多队列操作函数集合。

5.11.3.1.1  scsi_mq_ops

最新scsi驱动使用的多队列函数操作集合,老的单队列处理函数是scsi_request_fn。

static const struct blk_mq_ops scsi_mq_ops = {

        .get_budget     = scsi_mq_get_budget,

        .put_budget     = scsi_mq_put_budget,

        .queue_rq       = scsi_queue_rq,

        .complete       = scsi_softirq_done,

        .timeout        = scsi_timeout,

#ifdef CONFIG_BLK_DEBUG_FS

        .show_rq        = scsi_show_rq,

#endif

        .init_request   = scsi_mq_init_request,

        .exit_request   = scsi_mq_exit_request,

        .initialize_rq_fn = scsi_initialize_rq,

        .map_queues     = scsi_map_queues,

};

6     函数:主要函数

6.1     多队列

6.1.1 blk_mq_flush_plug_list

多队列中刷plug队列中请求的函数。会有blk_flush_plug_list函数调用。

6.1.2 blk_mq_make_request多队列块入口

这个函数和单队列的blk_queue_bio对立,是多队列的入口函数。整个函数逻辑也体现了多队列中io的处理流程。

       总体逻辑和单队列的blk_queue_bio函数非常相似。

       如果能够合入到进程的plug队列中就直接合入后并返回。否则,通过函数blk_mq_sched_bio_merge来进行合并到请求队列中。不管合并到哪里,都分为向前合并和向后合并两种方式。

       如果不能合并bio,则需要根据bio来生成一个request结构体。新产生的request会根据bio有多种执行分支,判断条件有flush操作、sync、plug等,最终都是调用blk_mq_run_hw_queue来向设备发起io请求。

static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)

{      

        const int is_sync = op_is_sync(bio->bi_opf);

        const int is_flush_fua = op_is_flush(bio->bi_opf);

        struct blk_mq_alloc_data data = { .flags = 0 };

        struct request *rq;

        unsigned int request_count = 0;

        struct blk_plug *plug;

        struct request *same_queue_rq = NULL;

        blk_qc_t cookie;

        unsigned int wb_acct;

       

        blk_queue_bounce(q, &bio);

       

        blk_queue_split(q, &bio);//根据设备硬件上限来分割bio

       

        if (!bio_integrity_prep(bio))

                return BLK_QC_T_NONE;

               

        if (!is_flush_fua && !blk_queue_nomerges(q) &&

            blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))//合并到进程的plug队列

                return BLK_QC_T_NONE;

 

        if (blk_mq_sched_bio_merge(q, bio))//合并到请求队列中,成功返回

                return BLK_QC_T_NONE;

 

        wb_acct = wbt_wait(q->rq_wb, bio, NULL);

 

        trace_block_getrq(q, bio, bio->bi_opf);

 

        rq = blk_mq_get_request(q, bio, bio->bi_opf, &data);//无法合并,产生新的request 请求

        if (unlikely(!rq)) {

                __wbt_done(q->rq_wb, wb_acct);

                if (bio->bi_opf & REQ_NOWAIT)

                        bio_wouldblock_error(bio);

                return BLK_QC_T_NONE;

        }

        wbt_track(&rq->issue_stat, wb_acct);

 

        cookie = request_to_qc_t(data.hctx, rq);

 

        plug = current->plug;

        if (unlikely(is_flush_fua)) {//flush操作

                blk_mq_put_ctx(data.ctx);

                blk_mq_bio_to_request(rq, bio);//根据bio生成request,继续下方到硬件队列

 

                /* bypass scheduler for flush rq */

                blk_insert_flush(rq);

                blk_mq_run_hw_queue(data.hctx, true);//向设备发起io请求

        } else if (plug && q->nr_hw_queues == 1) {//可以plug,同时硬件队列数量为1。

                struct request *last = NULL;

 

                blk_mq_put_ctx(data.ctx);

                blk_mq_bio_to_request(rq, bio);

 

                /*

                 * @request_count may become stale because of schedule

                 * out, so check the list again.

                 */

                if (list_empty(&plug->mq_list))

                        request_count = 0;

                else if (blk_queue_nomerges(q))

                        request_count = blk_plug_queued_count(q);

 

                if (!request_count)

                        trace_block_plug(q);

                else

                        last = list_entry_rq(plug->mq_list.prev);

 

                if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&

                    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {

                        blk_flush_plug_list(plug, false);

                        trace_block_plug(q);

                }

 

                list_add_tail(&rq->queuelist, &plug->mq_list);

        } else if (plug && !blk_queue_nomerges(q)) {

                blk_mq_bio_to_request(rq, bio);

 

                /*

                 * We do limited plugging. If the bio can be merged, do that.

                 * Otherwise the existing request in the plug list will be

                 * issued. So the plug list will have one request at most

                 * The plug list might get flushed before this. If that happens,

                 * the plug list is empty, and same_queue_rq is invalid.

                 */

                if (list_empty(&plug->mq_list))

                        same_queue_rq = NULL;

                if (same_queue_rq)

                        list_del_init(&same_queue_rq->queuelist);

                list_add_tail(&rq->queuelist, &plug->mq_list);

 

                blk_mq_put_ctx(data.ctx);

 

                if (same_queue_rq) {

                        data.hctx = blk_mq_map_queue(q,

                                        same_queue_rq->mq_ctx->cpu);

                        blk_mq_try_issue_directly(data.hctx, same_queue_rq,

                                        &cookie);

                }

        } else if (q->nr_hw_queues > 1 && is_sync) {

                blk_mq_put_ctx(data.ctx);

                blk_mq_bio_to_request(rq, bio);

                blk_mq_try_issue_directly(data.hctx, rq, &cookie);

        } else if (q->elevator) {

                blk_mq_put_ctx(data.ctx);

                blk_mq_bio_to_request(rq, bio);

                blk_mq_sched_insert_request(rq, false, true, true);

        } else {

                blk_mq_put_ctx(data.ctx);

                blk_mq_bio_to_request(rq, bio);

                blk_mq_queue_io(data.hctx, data.ctx, rq);

                blk_mq_run_hw_queue(data.hctx, true);

        }

 

        return cookie;

}

6.2     单队列

6.2.1 blk _flush_plug_list

对应多队列的blk_mq_flush_plug_list函数,负责将进程中plug链上的bio通过函数__elv_add_request刷到调度队列中,并调用__blk_run_queue函数发起io。

6.2.2 blk_queue_bio单队列块层入口

这个函数是单队列的请求处理函数,负责将bio放入到队列中。由generic_make_request函数调用。将来如果多队列完全体会了单队列,那么这个函数就成为历史了。

该函数提交的 bio 的缓存处理存在以下几种情况,

l   当前进程 IO 处于 Plug 状态,尝试将 bio 合并到当前进程的 plugged list 里,即 current->plug.list 。

l   当前进程 IO 处于 Unplug 状态,尝试利用 IO 调度器的代码找到合适的 IO request,并将 bio 合并到该 request 中。

l   如果无法将 bio 合并到已经存在的 IO request 结构里,那么就进入到单独为该 bio 分配空闲 IO request 的逻辑里。

不论是 plugged list 还是 IO scheduler 的 IO 合并,都分为向前合并和向后合并两种情况,

向后由 bio_attempt_back_merge 完成。

向前由 bio_attempt_front_merge 完成。

static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)

{      

        struct blk_plug *plug;//阻塞结构体

        int where = ELEVATOR_INSERT_SORT;

        struct request *req, *free;

        unsigned int request_count = 0;

        unsigned int wb_acct;

 

        /*

         * low level driver can indicate that it wants pages above a

         * certain limit bounced to low memory (ie for highmem, or even

         * ISA dma in theory)

         */

        blk_queue_bounce(q, &bio);

                       

        blk_queue_split(q, &bio);//根据块设备请求队列的limits.max_sectorslimits.max_segmetns来拆分bio,适应设备缓存。会在函数blk_set_default_limits中设置。

              

        if (!bio_integrity_prep(bio))//判断bio是否完整

                return BLK_QC_T_NONE;

 

        if (op_is_flush(bio->bi_opf)) {//判断bio是否是REQ_PREFLUSH或者REQ_FUA, 需要特殊处理

                spin_lock_irq(q->queue_lock);

                where = ELEVATOR_INSERT_FLUSH;

                goto get_rq;

        }

 

        /*

         * Check if we can merge with the plugged list before grabbing

         * any locks.

         */

        if (!blk_queue_nomerges(q)) {//判断队列能否合并,由QUEUE_FLAG_NOMERGES

                if (blk_attempt_plug_merge(q, bio, &request_count, NULL)) //尝试将bio合并到进程plug列表中,然后直接返回,等后续触发再处理阻塞队列。

                        return BLK_QC_T_NONE;

        } else

                request_count = blk_plug_queued_count(q);//获取plug队列中的请求数量即可。

 

        spin_lock_irq(q->queue_lock);

 

        switch (elv_merge(q, &req, bio)) {//单队列的io调度层,进入到电梯调度函数。

        case ELEVATOR_BACK_MERGE://向后合并,bio合入到已经存在的request,合并后,调用blk_account_io_start结束

                if (!bio_attempt_back_merge(q, req, bio)) //向后合并函数

                        break;

                elv_bio_merged(q, req, bio);

                free = attempt_back_merge(q, req);

                if (free)

                        __blk_put_request(q, free);

                else

                        elv_merged_request(q, req, ELEVATOR_BACK_MERGE);

                goto out_unlock;

        case ELEVATOR_FRONT_MERGE://向前合并,bio合入到已经存在的request,合并后,调用blk_account_io_start结束

                if (!bio_attempt_front_merge(q, req, bio))

                        break;

                elv_bio_merged(q, req, bio);

                free = attempt_front_merge(q, req);

                if (free)

                        __blk_put_request(q, free);

                else

                        elv_merged_request(q, req, ELEVATOR_FRONT_MERGE);

                goto out_unlock;

        default:

                break;

        }

 

get_rq:

        wb_acct = wbt_wait(q->rq_wb, bio, q->queue_lock);

 

        /*

         * Grab a free request. This is might sleep but can not fail.

         * Returns with the queue unlocked.

         */

        blk_queue_enter_live(q);

        req = get_request(q, bio->bi_opf, bio, 0); //如果在plug链和request队列中都无法合并,则重新生成一个request.

        if (IS_ERR(req)) {

                blk_queue_exit(q);

                __wbt_done(q->rq_wb, wb_acct);

                if (PTR_ERR(req) == -ENOMEM)

                        bio->bi_status = BLK_STS_RESOURCE;

                else

                        bio->bi_status = BLK_STS_IOERR;

                bio_endio(bio);

                goto out_unlock;

        }

        wbt_track(&req->issue_stat, wb_acct);

 

        /*

         * After dropping the lock and possibly sleeping here, our request

         * may now be mergeable after it had proven unmergeable (above).

         * We don't worry about that case for efficiency. It won't happen

         * often, and the elevators are able to handle it.

         */

        blk_init_request_from_bio(req, bio);//通过bio初始化request请求。

 

        if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags))

                req->cpu = raw_smp_processor_id();

 

        plug = current->plug;

        if (plug) {

                /*

                 * If this is the first request added after a plug, fire

                 * of a plug trace.

                 *

                 * @request_count may become stale because of schedule

                 * out, so check plug list again.

                 */

                if (!request_count || list_empty(&plug->list))

                        trace_block_plug(q);

                else {

                        struct request *last = list_entry_rq(plug->list.prev);

                        if (request_count >= BLK_MAX_REQUEST_COUNT ||

                            blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE) {

                                blk_flush_plug_list(plug, false);//如果请求数量或者大小超过指定,就触发刷阻塞的io,第二参数表示不是从调度触发的,是自己触发的。会调用__elv_add_request将请求插入到电梯队列中。

                                trace_block_plug(q);

                        }

                }

                list_add_tail(&req->queuelist, &plug->list);//把请求添加到plug列表中

                blk_account_io_start(req, true);//启动队列中的io静态相关统计.

        } else {

                spin_lock_irq(q->queue_lock);

                add_acct_request(q, req, where);//该函数会调用blk_account_io_start__elv_add_request,将请求放入到请求队列中,准备被处理。

                __blk_run_queue(q);//如果非阻塞,则调用__blk_run_queue函数,触发IO,开工干活。

out_unlock:

                spin_unlock_irq(q->queue_lock);

        }

 

        return BLK_QC_T_NONE;

}

 

6.3     初始化函数

6.3.1 blk_mq_init_queue

       这个函数初始化软件(software staging queues)和硬件(hardware dispatch queues)队列,同时执行映射操作。

也会通过调用blk_queue_make_request来设置blk_mq_make_request函数。

 

struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)

{

        struct request_queue *uninit_q, *q;

 

        uninit_q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node, NULL);

        if (!uninit_q)

                return ERR_PTR(-ENOMEM);

       

        q = blk_mq_init_allocated_queue(set, uninit_q);

        if (IS_ERR(q))

                blk_cleanup_queue(uninit_q);

       

        return q;

}

 

6.3.2 blk_mq_init_request

该函数会调用.init_request函数。

static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,

                               unsigned int hctx_idx, int node)

{

        int ret;

 

        if (set->ops->init_request) {

                ret = set->ops->init_request(set, rq, hctx_idx, node);

                if (ret)

                        return ret;

        }

 

        seqcount_init(&rq->gstate_seq);

        u64_stats_init(&rq->aborted_gstate_sync);

        /*

         * start gstate with gen 1 instead of 0, otherwise it will be equal

         * to aborted_gstate, and be identified timed out by

         * blk_mq_terminate_expired.

         */

        WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);

 

        return 0;

}

 

6.3.3 blk_init_queue

初始化队列函数,会调用blk_init_queue_node。

struct request_queue *blk_init_queue(request_fn_proc *rfn, spinlock_t *lock)

{

        return blk_init_queue_node(rfn, lock, NUMA_NO_NODE);

}

会调用blk_init_queue_node函数,而函数blk_init_queue_node会调用blk_init_allocated_queue 函数。

struct request_queue *

blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)

{

        struct request_queue *q;

        q = blk_alloc_queue_node(GFP_KERNEL, node_id, lock);

        if (!q)

                return NULL;

               

        q->request_fn = rfn;

        if (blk_init_allocated_queue(q) < 0) {

                blk_cleanup_queue(q);

                return NULL;

        }

        return q;

}

 

6.3.4 blk_queue_make_request

   blk_queue_make_request用来设置多队列的入口函数:blk_mq_make_request函数

 

6.4     关键承上启下函数

6.4.1 generic_make_request

       这个函数本身起到一个承上启下的作用,所以在函数定义处加入了大量的描述性文字,帮助开发者理解。

generic_make_request函数是bio层的入口,负责把bio传递给块层,将bio结构体到请求队列。如果是使用单队列则调用blk_queue_bio,如果是使用多队列的则调用blk_mq_make_request。

/**

 * generic_make_request - hand a buffer to its device driver for I/O

 * @bio:  The bio describing the location in memory and on the device.

 *

 * generic_make_request() is used to make I/O requests of block

 * devices. It is passed a &struct bio, which describes the I/O that needs

 * to be done.

 *

 * generic_make_request() does not return any status.  The

 * success/failure status of the request, along with notification of

 * completion, is delivered asynchronously through the bio->bi_end_io

 * function described (one day) else where.

 *

 * The caller of generic_make_request must make sure that bi_io_vec

 * are set to describe the memory buffer, and that bi_dev and bi_sector are

 * set to describe the device address, and the

 * bi_end_io and optionally bi_private are set to describe how

 * completion notification should be signaled.

 *

 * generic_make_request and the drivers it calls may use bi_next if this

 * bio happens to be merged with someone else, and may resubmit the bio to

 * a lower device by calling into generic_make_request recursively, which

 * means the bio should NOT be touched after the call to ->make_request_fn.

 */

blk_qc_t generic_make_request(struct bio *bio)

{

        /*

         * bio_list_on_stack[0] contains bios submitted by the current

         * make_request_fn.

         * bio_list_on_stack[1] contains bios that were submitted before

         * the current make_request_fn, but that haven't been processed

         * yet.

         */

        struct bio_list bio_list_on_stack[2];

        blk_mq_req_flags_t flags = 0;

        struct request_queue *q = bio->bi_disk->queue;//获取bio关联设备的队列

        blk_qc_t ret = BLK_QC_T_NONE;

 

        if (bio->bi_opf & REQ_NOWAIT)//判断bio是否是REQ_NOWAIT的,设置flags

                flags = BLK_MQ_REQ_NOWAIT;

        if (blk_queue_enter(q, flags) < 0) {//判断队列是否可以处理响应请求。

                if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))

                        bio_wouldblock_error(bio);

                else

                        bio_io_error(bio);

                return ret;

        }

 

        if (!generic_make_request_checks(bio))//检测bio

                goto out;

 

        /*

         * We only want one ->make_request_fn to be active at a time, else

         * stack usage with stacked devices could be a problem.  So use

         * current->bio_list to keep a list of requests submited by a

         * make_request_fn function.  current->bio_list is also used as a

         * flag to say if generic_make_request is currently active in this

         * task or not.  If it is NULL, then no make_request is active.  If

         * it is non-NULL, then a make_request is active, and new requests

         * should be added at the tail

         */

        if (current->bio_list) {//current是描述进程的task_struct机构体,其中bio_list Stacked block device infoMD),如果是MD设备就添加到队列后退出了。

                bio_list_add(&current->bio_list[0], bio);

                goto out;

        }

 

        /* following loop may be a bit non-obvious, and so deserves some

         * explanation.

         * Before entering the loop, bio->bi_next is NULL (as all callers

         * ensure that) so we have a list with a single bio.

         * We pretend that we have just taken it off a longer list, so

         * we assign bio_list to a pointer to the bio_list_on_stack,

         * thus initialising the bio_list of new bios to be

         * added.  ->make_request() may indeed add some more bios

         * through a recursive call to generic_make_request.  If it

         * did, we find a non-NULL value in bio_list and re-enter the loop

         * from the top.  In this case we really did just take the bio

         * of the top of the list (no pretending) and so remove it from

         * bio_list, and call into ->make_request() again.

         */

        BUG_ON(bio->bi_next);

        bio_list_init(&bio_list_on_stack[0]);//初始化当前要提交的bio链表结构

        current->bio_list = bio_list_on_stack;//赋值给task_struct->bio_list,最后函数结束后会赋值为null.

        do {//循环处理bio,调用make_request_fn处理每个bio

                bool enter_succeeded = true;

 

                if (unlikely(q != bio->bi_disk->queue)) {//判断第一个bio关联的队列是否与上次make_request_fn函数提交的bio队列一致。

                        if (q)

                                blk_queue_exit(q);//减少队列引用,是blk_queue_enter逆操作

                        q = bio->bi_disk->queue; //从下一个bio中获取关联的队列

                        flags = 0;

                        if (bio->bi_opf & REQ_NOWAIT)

                                flags = BLK_MQ_REQ_NOWAIT;

                        if (blk_queue_enter(q, flags) < 0) {

                                enter_succeeded = false;

                                q = NULL;

                        }

                }

 

                if (enter_succeeded) {//成功放入队列后

                        struct bio_list lower, same;

 

                        /* Create a fresh bio_list for all subordinate requests */

                        bio_list_on_stack[1] = bio_list_on_stack[0];//上次make_request_fn提交的bios,赋值给bio_list_on_stack[1].

                        bio_list_init(&bio_list_on_stack[0]);//初始化这次需要提交的bios存放结构体bio_list_on_stack[0].

                        ret = q->make_request_fn(q, bio);//调用关键函数->make_request_fn

 

                        /* sort new bios into those for a lower level

                         * and those for the same level

                         */

                        bio_list_init(&lower);//初始化两个bio链表

                        bio_list_init(&same);

                        while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)//循环处理这次提交的bios

                                if (q == bio->bi_disk->queue)

                                        bio_list_add(&same, bio);

                                else

                                        bio_list_add(&lower, bio);

                        /* now assemble so we handle the lowest level first */

                        bio_list_merge(&bio_list_on_stack[0], &lower);//进行合并。

                        bio_list_merge(&bio_list_on_stack[0], &same);

                        bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);

                } else {

                        if (unlikely(!blk_queue_dying(q) &&

                                        (bio->bi_opf & REQ_NOWAIT)))

                                bio_wouldblock_error(bio);

                        else

                                bio_io_error(bio);

                }

                bio = bio_list_pop(&bio_list_on_stack[0]);//获取下一个bio,继续处理

        } while (bio);

        current->bio_list = NULL; /* deactivate */

 

out:

        if (q)

                blk_queue_exit(q);

        return ret;

}

7     参考

一切不配参考链接的文章都是耍流氓。

https://lwn.net/Articles/736534/

https://lwn.net/Articles/738449/

https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mechanism_(blk-mq)

https://miuv.blog/2017/10/21/linux-block-mq-simple-walkthrough/

https://hyunyoung2.github.io/2016/09/14/Multi_Queue/

http://ari-ava.blogspot.com/2014/07/opw-linux-block-io-layer-part-4-multi.html

Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems

The multiqueue block layer

Blk-mq:new multi-queue block IO queueing mechnism

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

网友评论

登录后评论
0/500
评论
binarydady
+ 关注