|CFQ (Complete Fairness Queueing)
|The main aim of CFQ scheduler is to provide a fair allocation of the disk
|I/O bandwidth for all the processes which requests an I/O operation.
|CFQ maintains the per process queue for the processes which request I/O
|operation(synchronous requests). In case of asynchronous requests, all the
|requests from all the processes are batched together according to their
|process's I/O priority.
|CFQ ioscheduler tunables
|This specifies how long CFQ should idle for next request on certain cfq queues
|(for sequential workloads) and service trees (for random workloads) before
|queue is expired and CFQ selects next queue to dispatch from.
|By default slice_idle is a non-zero value. That means by default we idle on
|queues/service trees. This can be very helpful on highly seeky media like
|single spindle SATA/SAS disks where we can cut down on overall number of
|seeks and see improved throughput.
|Setting slice_idle to 0 will remove all the idling on queues/service tree
|level and one should see an overall improved throughput on faster storage
|devices like multiple SATA/SAS disks in hardware RAID configuration. The down
|side is that isolation provided from WRITES also goes down and notion of
|IO priority becomes weaker.
|So depending on storage and workload, it might be useful to set slice_idle=0.
|In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
|keeping slice_idle enabled should be useful. For any configurations where
|there are multiple spindles behind single LUN (Host based hardware RAID
|controller or for storage arrays), setting slice_idle=0 might end up in better
|throughput and acceptable latencies.
|This specifies, given in Kbytes, the maximum "distance" for backward seeking.
|The distance is the amount of space from the current head location to the
|sectors that are backward in terms of distance.
|This parameter allows the scheduler to anticipate requests in the "backward"
|direction and consider them as being the "next" if they are within this
|distance from the current head location.
|This parameter is used to compute the cost of backward seeking. If the
|backward distance of request is just 1/back_seek_penalty from a "front"
|request, then the seeking cost of two requests is considered equivalent.
|So scheduler will not bias toward one or the other request (otherwise scheduler
|will bias toward front request). Default value of back_seek_penalty is 2.
|This parameter is used to set the timeout of asynchronous requests. Default
|value of this is 248ms.
|This parameter is used to set the timeout of synchronous requests. Default
|value of this is 124ms. In case to favor synchronous requests over asynchronous
|one, this value should be decreased relative to fifo_expire_async.
|This parameter forces idling at the CFQ group level instead of CFQ
|queue level. This was introduced after after a bottleneck was observed
|in higher end storage due to idle on sequential queue and allow dispatch
|from a single queue. The idea with this parameter is that it can be run with
|slice_idle=0 and group_idle=8, so that idling does not happen on individual
|queues in the group but happens overall on the group and thus still keeps the
|IO controller working.
|Not idling on individual queues in the group will dispatch requests from
|multiple queues in the group at the same time and achieve higher throughput
|on higher end storage.
|Default value for this parameter is 8ms.
|This parameter is used to enable/disable the latency mode of the CFQ
|scheduler. If latency mode (called low_latency) is enabled, CFQ tries
|to recompute the slice time for each process based on the target_latency set
|for the system. This favors fairness over throughput. Disabling low
|latency (setting it to 0) ignores target latency, allowing each process in the
|system to get a full time slice.
|By default low latency mode is enabled.
|This parameter is used to calculate the time slice for a process if cfq's
|latency mode is enabled. It will ensure that sync requests have an estimated
|latency. But if sequential workload is higher(e.g. sequential read),
|then to meet the latency constraints, throughput may decrease because of less
|time for each process to issue I/O request before the cfq queue is switched.
|Though this can be overcome by disabling the latency_mode, it may increase
|the read latency for some applications. This parameter allows for changing
|target_latency through the sysfs interface which can provide the balanced
|throughput and read latency.
|Default value for target_latency is 300ms.
|This parameter is same as of slice_sync but for asynchronous queue. The
|default value is 40ms.
|This parameter is used to limit the dispatching of asynchronous request to
|device request queue in queue's slice time. The maximum number of request that
|are allowed to be dispatched also depends upon the io priority. Default value
|for this is 2.
|When a queue is selected for execution, the queues IO requests are only
|executed for a certain amount of time(time_slice) before switching to another
|queue. This parameter is used to calculate the time slice of synchronous
|time_slice is computed using the below equation:-
|time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To increase the
|time_slice of synchronous queue, increase the value of slice_sync. Default
|value is 100ms.
|This specifies the number of request dispatched to the device queue. In a
|queue's time slice, a request will not be dispatched if the number of request
|in the device exceeds this parameter. This parameter is used for synchronous
|In case of storage with several disk, this setting can limit the parallel
|processing of request. Therefore, increasing the value can improve the
|performance although this can cause the latency of some I/O to increase due
|to more number of requests.
|CFQ Group scheduling
|CFQ supports blkio cgroup and has "blkio." prefixed files in each
|blkio cgroup directory. It is weight-based and there are four knobs
|for configuration - weight[_device] and leaf_weight[_device].
|Internal cgroup nodes (the ones with children) can also have tasks in
|them, so the former two configure how much proportion the cgroup as a
|whole is entitled to at its parent's level while the latter two
|configure how much proportion the tasks in the cgroup have compared to
|its direct children.
|Another way to think about it is assuming that each internal node has
|an implicit leaf child node which hosts all the tasks whose weight is
|configured by leaf_weight[_device]. Let's assume a blkio hierarchy
|composed of five cgroups - root, A, B, AA and AB - with the following
|weights where the names represent the hierarchy.
| weight leaf_weight
| root : 125 125
| A : 500 750
| B : 250 500
| AA : 500 500
| AB : 1000 500
|root never has a parent making its weight is meaningless. For backward
|compatibility, weight is always kept in sync with leaf_weight. B, AA
|and AB have no child and thus its tasks have no children cgroup to
|compete with. They always get 100% of what the cgroup won at the
|parent level. Considering only the weights which matter, the hierarchy
|looks like the following.
| / | \
| A B leaf
| 500 250 125
| / | \
| AA AB leaf
| 500 1000 750
|If all cgroups have active IOs and competing with each other, disk
|time will be distributed like the following.
|Distribution below root. The total active weight at this level is
|A:500 + B:250 + C:125 = 875.
| root-leaf : 125 / 875 =~ 14%
| A : 500 / 875 =~ 57%
| B(-leaf) : 250 / 875 =~ 28%
|A has children and further distributes its 57% among the children and
|the implicit leaf node. The total active weight at this level is
|AA:500 + AB:1000 + A-leaf:750 = 2250.
| A-leaf : ( 750 / 2250) * A =~ 19%
| AA(-leaf) : ( 500 / 2250) * A =~ 12%
| AB(-leaf) : (1000 / 2250) * A =~ 25%
|CFQ IOPS Mode for group scheduling
|Basic CFQ design is to provide priority based time slices. Higher priority
|process gets bigger time slice and lower priority process gets smaller time
|slice. Measuring time becomes harder if storage is fast and supports NCQ and
|it would be better to dispatch multiple requests from multiple cfq queues in
|request queue at a time. In such scenario, it is not possible to measure time
|consumed by single queue accurately.
|What is possible though is to measure number of requests dispatched from a
|single queue and also allow dispatch from multiple cfq queue at the same time.
|This effectively becomes the fairness in terms of IOPS (IO operations per
|If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
|to IOPS mode and starts providing fairness in terms of number of requests
|dispatched. Note that this mode switching takes effect only for group
|scheduling. For non-cgroup users nothing should change.
|CFQ IO scheduler Idling Theory
|Idling on a queue is primarily about waiting for the next request to come
|on same queue after completion of a request. In this process CFQ will not
|dispatch requests from other cfq queues even if requests are pending there.
|The rationale behind idling is that it can cut down on number of seeks
|on rotational media. For example, if a process is doing dependent
|sequential reads (next read will come on only after completion of previous
|one), then not dispatching request from other queue should help as we
|did not move the disk head and kept on dispatching sequential IO from
|CFQ has following service trees and various queues are put on these trees.
| sync-idle sync-noidle async
|All cfq queues doing synchronous sequential IO go on to sync-idle tree.
|On this tree we idle on each queue individually.
|All synchronous non-sequential queues go on sync-noidle tree. Also any
|request which are marked with REQ_NOIDLE go on this service tree. On this
|tree we do not idle on individual queues instead idle on the whole group
|of queues or the tree. So if there are 4 queues waiting for IO to dispatch
|we will idle only once last queue has dispatched the IO and there is
|no more IO on this service tree.
|All async writes go on async service tree. There is no idling on async
|CFQ has some optimizations for SSDs and if it detects a non-rotational
|media which can support higher queue depth (multiple requests at in
|flight at a time), then it cuts down on idling of individual queues and
|all the queues move to sync-noidle tree and only tree idle remains. This
|tree idling provides isolation with buffered write queues on async tree.
|Q1. Why to idle at all on queues marked with REQ_NOIDLE.
|A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
| with REQ_NOIDLE. This helps in providing isolation with all the sync-idle
| queues. Otherwise in presence of many sequential readers, other
| synchronous IO might not get fair share of disk.
| For example, if there are 10 sequential readers doing IO and they get
| 100ms each. If a REQ_NOIDLE request comes in, it will be scheduled
| roughly after 1 second. If after completion of REQ_NOIDLE request we
| do not idle, and after a couple of milli seconds a another REQ_NOIDLE
| request comes in, again it will be scheduled after 1second. Repeat it
| and notice how a workload can lose its disk share and suffer due to
| multiple sequential readers.
| fsync can generate dependent IO where bunch of data is written in the
| context of fsync, and later some journaling data is written. Journaling
| data comes in only after fsync has finished its IO (atleast for ext4
| that seemed to be the case). Now if one decides not to idle on fsync
| thread due to REQ_NOIDLE, then next journaling write will not get
| scheduled for another second. A process doing small fsync, will suffer
| badly in presence of multiple sequential readers.
| Hence doing tree idling on threads using REQ_NOIDLE flag on requests
| provides isolation from multiple sequential readers and at the same
| time we do not idle on individual threads.
|Q2. When to specify REQ_NOIDLE
|A2. I would think whenever one is doing synchronous write and not expecting
| more writes to be dispatched from same context soon, should be able
| to specify REQ_NOIDLE on writes and that probably should work well for
| most of the cases.