|  | CPU frequency and voltage scaling code in the Linux(TM) kernel | 
|  |  | 
|  |  | 
|  | L i n u x    C P U F r e q | 
|  |  | 
|  | C P U F r e q   G o v e r n o r s | 
|  |  | 
|  | - information for users and developers - | 
|  |  | 
|  |  | 
|  | Dominik Brodowski  <linux@brodo.de> | 
|  | some additions and corrections by Nico Golde <nico@ngolde.de> | 
|  |  | 
|  |  | 
|  |  | 
|  | Clock scaling allows you to change the clock speed of the CPUs on the | 
|  | fly. This is a nice method to save battery power, because the lower | 
|  | the clock speed, the less power the CPU consumes. | 
|  |  | 
|  |  | 
|  | Contents: | 
|  | --------- | 
|  | 1.   What is a CPUFreq Governor? | 
|  |  | 
|  | 2.   Governors In the Linux Kernel | 
|  | 2.1  Performance | 
|  | 2.2  Powersave | 
|  | 2.3  Userspace | 
|  | 2.4  Ondemand | 
|  | 2.5  Conservative | 
|  | 2.6  Interactive | 
|  |  | 
|  | 3.   The Governor Interface in the CPUfreq Core | 
|  |  | 
|  |  | 
|  |  | 
|  | 1. What Is A CPUFreq Governor? | 
|  | ============================== | 
|  |  | 
|  | Most cpufreq drivers (in fact, all except one, longrun) or even most | 
|  | cpu frequency scaling algorithms only offer the CPU to be set to one | 
|  | frequency. In order to offer dynamic frequency scaling, the cpufreq | 
|  | core must be able to tell these drivers of a "target frequency". So | 
|  | these specific drivers will be transformed to offer a "->target" | 
|  | call instead of the existing "->setpolicy" call. For "longrun", all | 
|  | stays the same, though. | 
|  |  | 
|  | How to decide what frequency within the CPUfreq policy should be used? | 
|  | That's done using "cpufreq governors". Two are already in this patch | 
|  | -- they're the already existing "powersave" and "performance" which | 
|  | set the frequency statically to the lowest or highest frequency, | 
|  | respectively. At least two more such governors will be ready for | 
|  | addition in the near future, but likely many more as there are various | 
|  | different theories and models about dynamic frequency scaling | 
|  | around. Using such a generic interface as cpufreq offers to scaling | 
|  | governors, these can be tested extensively, and the best one can be | 
|  | selected for each specific use. | 
|  |  | 
|  | Basically, it's the following flow graph: | 
|  |  | 
|  | CPU can be set to switch independently	 |	   CPU can only be set | 
|  | within specific "limits"		 |       to specific frequencies | 
|  |  | 
|  | "CPUfreq policy" | 
|  | consists of frequency limits (policy->{min,max}) | 
|  | and CPUfreq governor to be used | 
|  | /		      \ | 
|  | /		       \ | 
|  | /		       the cpufreq governor decides | 
|  | /			       (dynamically or statically) | 
|  | /			       what target_freq to set within | 
|  | /			       the limits of policy->{min,max} | 
|  | /			            \ | 
|  | /				     \ | 
|  | Using the ->setpolicy call,		 Using the ->target call, | 
|  | the limits and the			  the frequency closest | 
|  | "policy" is set.			  to target_freq is set. | 
|  | It is assured that it | 
|  | is within policy->{min,max} | 
|  |  | 
|  |  | 
|  | 2. Governors In the Linux Kernel | 
|  | ================================ | 
|  |  | 
|  | 2.1 Performance | 
|  | --------------- | 
|  |  | 
|  | The CPUfreq governor "performance" sets the CPU statically to the | 
|  | highest frequency within the borders of scaling_min_freq and | 
|  | scaling_max_freq. | 
|  |  | 
|  |  | 
|  | 2.2 Powersave | 
|  | ------------- | 
|  |  | 
|  | The CPUfreq governor "powersave" sets the CPU statically to the | 
|  | lowest frequency within the borders of scaling_min_freq and | 
|  | scaling_max_freq. | 
|  |  | 
|  |  | 
|  | 2.3 Userspace | 
|  | ------------- | 
|  |  | 
|  | The CPUfreq governor "userspace" allows the user, or any userspace | 
|  | program running with UID "root", to set the CPU to a specific frequency | 
|  | by making a sysfs file "scaling_setspeed" available in the CPU-device | 
|  | directory. | 
|  |  | 
|  |  | 
|  | 2.4 Ondemand | 
|  | ------------ | 
|  |  | 
|  | The CPUfreq governor "ondemand" sets the CPU depending on the | 
|  | current usage. To do this the CPU must have the capability to | 
|  | switch the frequency very quickly.  There are a number of sysfs file | 
|  | accessible parameters: | 
|  |  | 
|  | sampling_rate: measured in uS (10^-6 seconds), this is how often you | 
|  | want the kernel to look at the CPU usage and to make decisions on | 
|  | what to do about the frequency.  Typically this is set to values of | 
|  | around '10000' or more. It's default value is (cmp. with users-guide.txt): | 
|  | transition_latency * 1000 | 
|  | Be aware that transition latency is in ns and sampling_rate is in us, so you | 
|  | get the same sysfs value by default. | 
|  | Sampling rate should always get adjusted considering the transition latency | 
|  | To set the sampling rate 750 times as high as the transition latency | 
|  | in the bash (as said, 1000 is default), do: | 
|  | echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) \ | 
|  | >ondemand/sampling_rate | 
|  |  | 
|  | sampling_rate_min: | 
|  | The sampling rate is limited by the HW transition latency: | 
|  | transition_latency * 100 | 
|  | Or by kernel restrictions: | 
|  | If CONFIG_NO_HZ_COMMON is set, the limit is 10ms fixed. | 
|  | If CONFIG_NO_HZ_COMMON is not set or nohz=off boot parameter is used, the | 
|  | limits depend on the CONFIG_HZ option: | 
|  | HZ=1000: min=20000us  (20ms) | 
|  | HZ=250:  min=80000us  (80ms) | 
|  | HZ=100:  min=200000us (200ms) | 
|  | The highest value of kernel and HW latency restrictions is shown and | 
|  | used as the minimum sampling rate. | 
|  |  | 
|  | up_threshold: defines what the average CPU usage between the samplings | 
|  | of 'sampling_rate' needs to be for the kernel to make a decision on | 
|  | whether it should increase the frequency.  For example when it is set | 
|  | to its default value of '95' it means that between the checking | 
|  | intervals the CPU needs to be on average more than 95% in use to then | 
|  | decide that the CPU frequency needs to be increased. | 
|  |  | 
|  | ignore_nice_load: this parameter takes a value of '0' or '1'. When | 
|  | set to '0' (its default), all processes are counted towards the | 
|  | 'cpu utilisation' value.  When set to '1', the processes that are | 
|  | run with a 'nice' value will not count (and thus be ignored) in the | 
|  | overall usage calculation.  This is useful if you are running a CPU | 
|  | intensive calculation on your laptop that you do not care how long it | 
|  | takes to complete as you can 'nice' it and prevent it from taking part | 
|  | in the deciding process of whether to increase your CPU frequency. | 
|  |  | 
|  | sampling_down_factor: this parameter controls the rate at which the | 
|  | kernel makes a decision on when to decrease the frequency while running | 
|  | at top speed. When set to 1 (the default) decisions to reevaluate load | 
|  | are made at the same interval regardless of current clock speed. But | 
|  | when set to greater than 1 (e.g. 100) it acts as a multiplier for the | 
|  | scheduling interval for reevaluating load when the CPU is at its top | 
|  | speed due to high load. This improves performance by reducing the overhead | 
|  | of load evaluation and helping the CPU stay at its top speed when truly | 
|  | busy, rather than shifting back and forth in speed. This tunable has no | 
|  | effect on behavior at lower speeds/lower CPU loads. | 
|  |  | 
|  | powersave_bias: this parameter takes a value between 0 to 1000. It | 
|  | defines the percentage (times 10) value of the target frequency that | 
|  | will be shaved off of the target. For example, when set to 100 -- 10%, | 
|  | when ondemand governor would have targeted 1000 MHz, it will target | 
|  | 1000 MHz - (10% of 1000 MHz) = 900 MHz instead. This is set to 0 | 
|  | (disabled) by default. | 
|  | When AMD frequency sensitivity powersave bias driver -- | 
|  | drivers/cpufreq/amd_freq_sensitivity.c is loaded, this parameter | 
|  | defines the workload frequency sensitivity threshold in which a lower | 
|  | frequency is chosen instead of ondemand governor's original target. | 
|  | The frequency sensitivity is a hardware reported (on AMD Family 16h | 
|  | Processors and above) value between 0 to 100% that tells software how | 
|  | the performance of the workload running on a CPU will change when | 
|  | frequency changes. A workload with sensitivity of 0% (memory/IO-bound) | 
|  | will not perform any better on higher core frequency, whereas a | 
|  | workload with sensitivity of 100% (CPU-bound) will perform better | 
|  | higher the frequency. When the driver is loaded, this is set to 400 | 
|  | by default -- for CPUs running workloads with sensitivity value below | 
|  | 40%, a lower frequency is chosen. Unloading the driver or writing 0 | 
|  | will disable this feature. | 
|  |  | 
|  |  | 
|  | 2.5 Conservative | 
|  | ---------------- | 
|  |  | 
|  | The CPUfreq governor "conservative", much like the "ondemand" | 
|  | governor, sets the CPU depending on the current usage.  It differs in | 
|  | behaviour in that it gracefully increases and decreases the CPU speed | 
|  | rather than jumping to max speed the moment there is any load on the | 
|  | CPU.  This behaviour more suitable in a battery powered environment. | 
|  | The governor is tweaked in the same manner as the "ondemand" governor | 
|  | through sysfs with the addition of: | 
|  |  | 
|  | freq_step: this describes what percentage steps the cpu freq should be | 
|  | increased and decreased smoothly by.  By default the cpu frequency will | 
|  | increase in 5% chunks of your maximum cpu frequency.  You can change this | 
|  | value to anywhere between 0 and 100 where '0' will effectively lock your | 
|  | CPU at a speed regardless of its load whilst '100' will, in theory, make | 
|  | it behave identically to the "ondemand" governor. | 
|  |  | 
|  | down_threshold: same as the 'up_threshold' found for the "ondemand" | 
|  | governor but for the opposite direction.  For example when set to its | 
|  | default value of '20' it means that if the CPU usage needs to be below | 
|  | 20% between samples to have the frequency decreased. | 
|  |  | 
|  | sampling_down_factor: similar functionality as in "ondemand" governor. | 
|  | But in "conservative", it controls the rate at which the kernel makes | 
|  | a decision on when to decrease the frequency while running in any | 
|  | speed. Load for frequency increase is still evaluated every | 
|  | sampling rate. | 
|  |  | 
|  | 2.6 Interactive | 
|  | --------------- | 
|  |  | 
|  | The CPUfreq governor "interactive" is designed for latency-sensitive, | 
|  | interactive workloads. This governor sets the CPU speed depending on | 
|  | usage, similar to "ondemand" and "conservative" governors, but with a | 
|  | different set of configurable behaviors. | 
|  |  | 
|  | The tuneable values for this governor are: | 
|  |  | 
|  | target_loads: CPU load values used to adjust speed to influence the | 
|  | current CPU load toward that value.  In general, the lower the target | 
|  | load, the more often the governor will raise CPU speeds to bring load | 
|  | below the target.  The format is a single target load, optionally | 
|  | followed by pairs of CPU speeds and CPU loads to target at or above | 
|  | those speeds.  Colons can be used between the speeds and associated | 
|  | target loads for readability.  For example: | 
|  |  | 
|  | 85 1000000:90 1700000:99 | 
|  |  | 
|  | targets CPU load 85% below speed 1GHz, 90% at or above 1GHz, until | 
|  | 1.7GHz and above, at which load 99% is targeted.  If speeds are | 
|  | specified these must appear in ascending order.  Higher target load | 
|  | values are typically specified for higher speeds, that is, target load | 
|  | values also usually appear in an ascending order. The default is | 
|  | target load 90% for all speeds. | 
|  |  | 
|  | min_sample_time: The minimum amount of time to spend at the current | 
|  | frequency before ramping down. Default is 80000 uS. | 
|  |  | 
|  | hispeed_freq: An intermediate "hi speed" at which to initially ramp | 
|  | when CPU load hits the value specified in go_hispeed_load.  If load | 
|  | stays high for the amount of time specified in above_hispeed_delay, | 
|  | then speed may be bumped higher.  Default is the maximum speed | 
|  | allowed by the policy at governor initialization time. | 
|  |  | 
|  | go_hispeed_load: The CPU load at which to ramp to hispeed_freq. | 
|  | Default is 99%. | 
|  |  | 
|  | above_hispeed_delay: When speed is at or above hispeed_freq, wait for | 
|  | this long before raising speed in response to continued high load. | 
|  | The format is a single delay value, optionally followed by pairs of | 
|  | CPU speeds and the delay to use at or above those speeds.  Colons can | 
|  | be used between the speeds and associated delays for readability.  For | 
|  | example: | 
|  |  | 
|  | 80000 1300000:200000 1500000:40000 | 
|  |  | 
|  | uses delay 80000 uS until CPU speed 1.3 GHz, at which speed delay | 
|  | 200000 uS is used until speed 1.5 GHz, at which speed (and above) | 
|  | delay 40000 uS is used.  If speeds are specified these must appear in | 
|  | ascending order.  Default is 20000 uS. | 
|  |  | 
|  | timer_rate: Sample rate for reevaluating CPU load when the CPU is not | 
|  | idle.  A deferrable timer is used, such that the CPU will not be woken | 
|  | from idle to service this timer until something else needs to run. | 
|  | (The maximum time to allow deferring this timer when not running at | 
|  | minimum speed is configurable via timer_slack.)  Default is 20000 uS. | 
|  |  | 
|  | timer_slack: Maximum additional time to defer handling the governor | 
|  | sampling timer beyond timer_rate when running at speeds above the | 
|  | minimum.  For platforms that consume additional power at idle when | 
|  | CPUs are running at speeds greater than minimum, this places an upper | 
|  | bound on how long the timer will be deferred prior to re-evaluating | 
|  | load and dropping speed.  For example, if timer_rate is 20000uS and | 
|  | timer_slack is 10000uS then timers will be deferred for up to 30msec | 
|  | when not at lowest speed.  A value of -1 means defer timers | 
|  | indefinitely at all speeds.  Default is 80000 uS. | 
|  |  | 
|  | boost: If non-zero, immediately boost speed of all CPUs to at least | 
|  | hispeed_freq until zero is written to this attribute.  If zero, allow | 
|  | CPU speeds to drop below hispeed_freq according to load as usual. | 
|  | Default is zero. | 
|  |  | 
|  | boostpulse: On each write, immediately boost speed of all CPUs to | 
|  | hispeed_freq for at least the period of time specified by | 
|  | boostpulse_duration, after which speeds are allowed to drop below | 
|  | hispeed_freq according to load as usual. | 
|  |  | 
|  | boostpulse_duration: Length of time to hold CPU speed at hispeed_freq | 
|  | on a write to boostpulse, before allowing speed to drop according to | 
|  | load as usual.  Default is 80000 uS. | 
|  |  | 
|  |  | 
|  | 3. The Governor Interface in the CPUfreq Core | 
|  | ============================================= | 
|  |  | 
|  | A new governor must register itself with the CPUfreq core using | 
|  | "cpufreq_register_governor". The struct cpufreq_governor, which has to | 
|  | be passed to that function, must contain the following values: | 
|  |  | 
|  | governor->name -	    A unique name for this governor | 
|  | governor->governor -	    The governor callback function | 
|  | governor->owner	-	    .THIS_MODULE for the governor module (if | 
|  | appropriate) | 
|  |  | 
|  | The governor->governor callback is called with the current (or to-be-set) | 
|  | cpufreq_policy struct for that CPU, and an unsigned int event. The | 
|  | following events are currently defined: | 
|  |  | 
|  | CPUFREQ_GOV_START:   This governor shall start its duty for the CPU | 
|  | policy->cpu | 
|  | CPUFREQ_GOV_STOP:    This governor shall end its duty for the CPU | 
|  | policy->cpu | 
|  | CPUFREQ_GOV_LIMITS:  The limits for CPU policy->cpu have changed to | 
|  | policy->min and policy->max. | 
|  |  | 
|  | If you need other "events" externally of your driver, _only_ use the | 
|  | cpufreq_governor_l(unsigned int cpu, unsigned int event) call to the | 
|  | CPUfreq core to ensure proper locking. | 
|  |  | 
|  |  | 
|  | The CPUfreq governor may call the CPU processor driver using one of | 
|  | these two functions: | 
|  |  | 
|  | int cpufreq_driver_target(struct cpufreq_policy *policy, | 
|  | unsigned int target_freq, | 
|  | unsigned int relation); | 
|  |  | 
|  | int __cpufreq_driver_target(struct cpufreq_policy *policy, | 
|  | unsigned int target_freq, | 
|  | unsigned int relation); | 
|  |  | 
|  | target_freq must be within policy->min and policy->max, of course. | 
|  | What's the difference between these two functions? When your governor | 
|  | still is in a direct code path of a call to governor->governor, the | 
|  | per-CPU cpufreq lock is still held in the cpufreq core, and there's | 
|  | no need to lock it again (in fact, this would cause a deadlock). So | 
|  | use __cpufreq_driver_target only in these cases. In all other cases | 
|  | (for example, when there's a "daemonized" function that wakes up | 
|  | every second), use cpufreq_driver_target to lock the cpufreq per-CPU | 
|  | lock before the command is passed to the cpufreq processor driver. | 
|  |  |