| CFQ (Complete Fairness Queueing) | 
 | =============================== | 
 |  | 
 | The main aim of CFQ scheduler is to provide a fair allocation of the disk | 
 | I/O bandwidth for all the processes which requests an I/O operation. | 
 |  | 
 | CFQ maintains the per process queue for the processes which request I/O | 
 | operation(synchronous requests). In case of asynchronous requests, all the | 
 | requests from all the processes are batched together according to their | 
 | process's I/O priority. | 
 |  | 
 | CFQ ioscheduler tunables | 
 | ======================== | 
 |  | 
 | slice_idle | 
 | ---------- | 
 | This specifies how long CFQ should idle for next request on certain cfq queues | 
 | (for sequential workloads) and service trees (for random workloads) before | 
 | queue is expired and CFQ selects next queue to dispatch from. | 
 |  | 
 | By default slice_idle is a non-zero value. That means by default we idle on | 
 | queues/service trees. This can be very helpful on highly seeky media like | 
 | single spindle SATA/SAS disks where we can cut down on overall number of | 
 | seeks and see improved throughput. | 
 |  | 
 | Setting slice_idle to 0 will remove all the idling on queues/service tree | 
 | level and one should see an overall improved throughput on faster storage | 
 | devices like multiple SATA/SAS disks in hardware RAID configuration. The down | 
 | side is that isolation provided from WRITES also goes down and notion of | 
 | IO priority becomes weaker. | 
 |  | 
 | So depending on storage and workload, it might be useful to set slice_idle=0. | 
 | In general I think for SATA/SAS disks and software RAID of SATA/SAS disks | 
 | keeping slice_idle enabled should be useful. For any configurations where | 
 | there are multiple spindles behind single LUN (Host based hardware RAID | 
 | controller or for storage arrays), setting slice_idle=0 might end up in better | 
 | throughput and acceptable latencies. | 
 |  | 
 | back_seek_max | 
 | ------------- | 
 | This specifies, given in Kbytes, the maximum "distance" for backward seeking. | 
 | The distance is the amount of space from the current head location to the | 
 | sectors that are backward in terms of distance. | 
 |  | 
 | This parameter allows the scheduler to anticipate requests in the "backward" | 
 | direction and consider them as being the "next" if they are within this | 
 | distance from the current head location. | 
 |  | 
 | back_seek_penalty | 
 | ----------------- | 
 | This parameter is used to compute the cost of backward seeking. If the | 
 | backward distance of request is just 1/back_seek_penalty from a "front" | 
 | request, then the seeking cost of two requests is considered equivalent. | 
 |  | 
 | So scheduler will not bias toward one or the other request (otherwise scheduler | 
 | will bias toward front request). Default value of back_seek_penalty is 2. | 
 |  | 
 | fifo_expire_async | 
 | ----------------- | 
 | This parameter is used to set the timeout of asynchronous requests. Default | 
 | value of this is 248ms. | 
 |  | 
 | fifo_expire_sync | 
 | ---------------- | 
 | This parameter is used to set the timeout of synchronous requests. Default | 
 | value of this is 124ms. In case to favor synchronous requests over asynchronous | 
 | one, this value should be decreased relative to fifo_expire_async. | 
 |  | 
 | group_idle | 
 | ----------- | 
 | This parameter forces idling at the CFQ group level instead of CFQ | 
 | queue level. This was introduced after a bottleneck was observed | 
 | in higher end storage due to idle on sequential queue and allow dispatch | 
 | from a single queue. The idea with this parameter is that it can be run with | 
 | slice_idle=0 and group_idle=8, so that idling does not happen on individual | 
 | queues in the group but happens overall on the group and thus still keeps the | 
 | IO controller working. | 
 | Not idling on individual queues in the group will dispatch requests from | 
 | multiple queues in the group at the same time and achieve higher throughput | 
 | on higher end storage. | 
 |  | 
 | Default value for this parameter is 8ms. | 
 |  | 
 | latency | 
 | ------- | 
 | This parameter is used to enable/disable the latency mode of the CFQ | 
 | scheduler. If latency mode (called low_latency) is enabled, CFQ tries | 
 | to recompute the slice time for each process based on the target_latency set | 
 | for the system. This favors fairness over throughput. Disabling low | 
 | latency (setting it to 0) ignores target latency, allowing each process in the | 
 | system to get a full time slice. | 
 |  | 
 | By default low latency mode is enabled. | 
 |  | 
 | target_latency | 
 | -------------- | 
 | This parameter is used to calculate the time slice for a process if cfq's | 
 | latency mode is enabled. It will ensure that sync requests have an estimated | 
 | latency. But if sequential workload is higher(e.g. sequential read), | 
 | then to meet the latency constraints, throughput may decrease because of less | 
 | time for each process to issue I/O request before the cfq queue is switched. | 
 |  | 
 | Though this can be overcome by disabling the latency_mode, it may increase | 
 | the read latency for some applications. This parameter allows for changing | 
 | target_latency through the sysfs interface which can provide the balanced | 
 | throughput and read latency. | 
 |  | 
 | Default value for target_latency is 300ms. | 
 |  | 
 | slice_async | 
 | ----------- | 
 | This parameter is same as of slice_sync but for asynchronous queue. The | 
 | default value is 40ms. | 
 |  | 
 | slice_async_rq | 
 | -------------- | 
 | This parameter is used to limit the dispatching of asynchronous request to | 
 | device request queue in queue's slice time. The maximum number of request that | 
 | are allowed to be dispatched also depends upon the io priority. Default value | 
 | for this is 2. | 
 |  | 
 | slice_sync | 
 | ---------- | 
 | When a queue is selected for execution, the queues IO requests are only | 
 | executed for a certain amount of time(time_slice) before switching to another | 
 | queue. This parameter is used to calculate the time slice of synchronous | 
 | queue. | 
 |  | 
 | time_slice is computed using the below equation:- | 
 | time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To increase the | 
 | time_slice of synchronous queue, increase the value of slice_sync. Default | 
 | value is 100ms. | 
 |  | 
 | quantum | 
 | ------- | 
 | This specifies the number of request dispatched to the device queue. In a | 
 | queue's time slice, a request will not be dispatched if the number of request | 
 | in the device exceeds this parameter. This parameter is used for synchronous | 
 | request. | 
 |  | 
 | In case of storage with several disk, this setting can limit the parallel | 
 | processing of request. Therefore, increasing the value can improve the | 
 | performance although this can cause the latency of some I/O to increase due | 
 | to more number of requests. | 
 |  | 
 | CFQ Group scheduling | 
 | ==================== | 
 |  | 
 | CFQ supports blkio cgroup and has "blkio." prefixed files in each | 
 | blkio cgroup directory. It is weight-based and there are four knobs | 
 | for configuration - weight[_device] and leaf_weight[_device]. | 
 | Internal cgroup nodes (the ones with children) can also have tasks in | 
 | them, so the former two configure how much proportion the cgroup as a | 
 | whole is entitled to at its parent's level while the latter two | 
 | configure how much proportion the tasks in the cgroup have compared to | 
 | its direct children. | 
 |  | 
 | Another way to think about it is assuming that each internal node has | 
 | an implicit leaf child node which hosts all the tasks whose weight is | 
 | configured by leaf_weight[_device]. Let's assume a blkio hierarchy | 
 | composed of five cgroups - root, A, B, AA and AB - with the following | 
 | weights where the names represent the hierarchy. | 
 |  | 
 |         weight leaf_weight | 
 |  root :  125    125 | 
 |  A    :  500    750 | 
 |  B    :  250    500 | 
 |  AA   :  500    500 | 
 |  AB   : 1000    500 | 
 |  | 
 | root never has a parent making its weight is meaningless. For backward | 
 | compatibility, weight is always kept in sync with leaf_weight. B, AA | 
 | and AB have no child and thus its tasks have no children cgroup to | 
 | compete with. They always get 100% of what the cgroup won at the | 
 | parent level. Considering only the weights which matter, the hierarchy | 
 | looks like the following. | 
 |  | 
 |           root | 
 |        /    |   \ | 
 |       A     B    leaf | 
 |      500   250   125 | 
 |    /  |  \ | 
 |   AA  AB  leaf | 
 |  500 1000 750 | 
 |  | 
 | If all cgroups have active IOs and competing with each other, disk | 
 | time will be distributed like the following. | 
 |  | 
 | Distribution below root. The total active weight at this level is | 
 | A:500 + B:250 + C:125 = 875. | 
 |  | 
 |  root-leaf :   125 /  875      =~ 14% | 
 |  A         :   500 /  875      =~ 57% | 
 |  B(-leaf)  :   250 /  875      =~ 28% | 
 |  | 
 | A has children and further distributes its 57% among the children and | 
 | the implicit leaf node. The total active weight at this level is | 
 | AA:500 + AB:1000 + A-leaf:750 = 2250. | 
 |  | 
 |  A-leaf    : ( 750 / 2250) * A =~ 19% | 
 |  AA(-leaf) : ( 500 / 2250) * A =~ 12% | 
 |  AB(-leaf) : (1000 / 2250) * A =~ 25% | 
 |  | 
 | CFQ IOPS Mode for group scheduling | 
 | =================================== | 
 | Basic CFQ design is to provide priority based time slices. Higher priority | 
 | process gets bigger time slice and lower priority process gets smaller time | 
 | slice. Measuring time becomes harder if storage is fast and supports NCQ and | 
 | it would be better to dispatch multiple requests from multiple cfq queues in | 
 | request queue at a time. In such scenario, it is not possible to measure time | 
 | consumed by single queue accurately. | 
 |  | 
 | What is possible though is to measure number of requests dispatched from a | 
 | single queue and also allow dispatch from multiple cfq queue at the same time. | 
 | This effectively becomes the fairness in terms of IOPS (IO operations per | 
 | second). | 
 |  | 
 | If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches | 
 | to IOPS mode and starts providing fairness in terms of number of requests | 
 | dispatched. Note that this mode switching takes effect only for group | 
 | scheduling. For non-cgroup users nothing should change. | 
 |  | 
 | CFQ IO scheduler Idling Theory | 
 | =============================== | 
 | Idling on a queue is primarily about waiting for the next request to come | 
 | on same queue after completion of a request. In this process CFQ will not | 
 | dispatch requests from other cfq queues even if requests are pending there. | 
 |  | 
 | The rationale behind idling is that it can cut down on number of seeks | 
 | on rotational media. For example, if a process is doing dependent | 
 | sequential reads (next read will come on only after completion of previous | 
 | one), then not dispatching request from other queue should help as we | 
 | did not move the disk head and kept on dispatching sequential IO from | 
 | one queue. | 
 |  | 
 | CFQ has following service trees and various queues are put on these trees. | 
 |  | 
 | 	sync-idle	sync-noidle	async | 
 |  | 
 | All cfq queues doing synchronous sequential IO go on to sync-idle tree. | 
 | On this tree we idle on each queue individually. | 
 |  | 
 | All synchronous non-sequential queues go on sync-noidle tree. Also any | 
 | request which are marked with REQ_NOIDLE go on this service tree. On this | 
 | tree we do not idle on individual queues instead idle on the whole group | 
 | of queues or the tree. So if there are 4 queues waiting for IO to dispatch | 
 | we will idle only once last queue has dispatched the IO and there is | 
 | no more IO on this service tree. | 
 |  | 
 | All async writes go on async service tree. There is no idling on async | 
 | queues. | 
 |  | 
 | CFQ has some optimizations for SSDs and if it detects a non-rotational | 
 | media which can support higher queue depth (multiple requests at in | 
 | flight at a time), then it cuts down on idling of individual queues and | 
 | all the queues move to sync-noidle tree and only tree idle remains. This | 
 | tree idling provides isolation with buffered write queues on async tree. | 
 |  | 
 | FAQ | 
 | === | 
 | Q1. Why to idle at all on queues marked with REQ_NOIDLE. | 
 |  | 
 | A1. We only do tree idle (all queues on sync-noidle tree) on queues marked | 
 |     with REQ_NOIDLE. This helps in providing isolation with all the sync-idle | 
 |     queues. Otherwise in presence of many sequential readers, other | 
 |     synchronous IO might not get fair share of disk. | 
 |  | 
 |     For example, if there are 10 sequential readers doing IO and they get | 
 |     100ms each. If a REQ_NOIDLE request comes in, it will be scheduled | 
 |     roughly after 1 second. If after completion of REQ_NOIDLE request we | 
 |     do not idle, and after a couple of milli seconds a another REQ_NOIDLE | 
 |     request comes in, again it will be scheduled after 1second. Repeat it | 
 |     and notice how a workload can lose its disk share and suffer due to | 
 |     multiple sequential readers. | 
 |  | 
 |     fsync can generate dependent IO where bunch of data is written in the | 
 |     context of fsync, and later some journaling data is written. Journaling | 
 |     data comes in only after fsync has finished its IO (atleast for ext4 | 
 |     that seemed to be the case). Now if one decides not to idle on fsync | 
 |     thread due to REQ_NOIDLE, then next journaling write will not get | 
 |     scheduled for another second. A process doing small fsync, will suffer | 
 |     badly in presence of multiple sequential readers. | 
 |  | 
 |     Hence doing tree idling on threads using REQ_NOIDLE flag on requests | 
 |     provides isolation from multiple sequential readers and at the same | 
 |     time we do not idle on individual threads. | 
 |  | 
 | Q2. When to specify REQ_NOIDLE | 
 | A2. I would think whenever one is doing synchronous write and not expecting | 
 |     more writes to be dispatched from same context soon, should be able | 
 |     to specify REQ_NOIDLE on writes and that probably should work well for | 
 |     most of the cases. |