In my professional experience, I quite often have the opportunity to conduct recruitment, a part of which are questions related to Linux family systems. One of the standard questions is about an important, although in fact not saying much, system statistics, which is the average system load (abbreviated to loadavg). Unfortunately, it happens, also in the case of people with really impressive professional experience, that the candidate is not able to correctly describe and interpret the information ensuing from the load average.
So I decided to ask myself what is the reason for this. I checked trusted “sources of truth”, such as search engines (DuckDuckGo and Google) and Wikipedia. The latter, in fact, described the statistics most fully. Unfortunately, there are a lot of simplifications or even errors on the Internet about that subject. There is nothing surprising in this, as the system manuals themselves often do not fully describe what these statistics are.
To discuss the average system load, however, it is necessary to go back and introduce a minimum theory about processes and their states in Linux. I let myself draw from a proven source of knowledge that is the EuroLinux Training Manual for Enterprise Linux Administration at the Basic Level.
Process states on Linux
Excerpts from the “Enterprise Linux Administration I” training manual
Processes can be in different states. To understand the outputs of commands that describe processes, you must first understand these states. The table below lists the flags to be displayed with process names and an explanation of the state assigned to the given state in the internal kernel structures.
Flag | Name | Kernel name and explanation |
---|---|---|
R | Running or Runnable | TASK_RUNNING. The process is being executed or is waiting to be executed. Can run in user space (user code) or kernel space (kernel code) |
S | Interruptible sleep | TASK_INTERRUPTIBLE. The process is waiting for a condition to be met, e.g. access to a resource or a signal (such a signal is e.g. the output of a child process) |
D | Uninterruptible sleep | Uninterruptible sleep| TASK_UNINTERRUPTIBLE. The process is in a sleeping state as for the S flag, but will not respond to signals. It is used when interrupting or killing a process could put devices or a process in an undefined state. Generally used during I/O operations. The process enters this state also when its memory pages are dumped or loaded into extended memory, i.e. "swapped" |
T | Stopped | TASK_STOPPED. The process was stopped by appropriate signals. It can be restored by another signal |
t | Traced | TASK_TRACED. The process is being debugged and its execution is being traced. It is temporarily paused to test its status |
Z | Zombie | EXIT_ZOMBIE. The child process has completed its operation and wants to inform the parent of its exit code. Unfortunately, sometimes parents do not handle their children properly. In this case, the child becomes a zombie |
X | Dead | EXIT_DEAD. The process has finished, the parent has cleared their child. All process resources are released. Under normal circumstances this state should not be visible |
In addition to the flags listed above, we also have a K
flag for "killable", which is a D-flagged process that can be killed, and an I flag for "idle". However, processes in this state are rarely seen as they are usually performed by internal kernel processes.
The diagram below shows the different states of processes and the basic interactions between them.
Where are the process states defined?
The Linux process itself and the process states are defined in the kernel source in the source/include/linux/sched.h
("sched" is an abbreviation from “scheduler”). Let me paste the code snippet that defines the states of the process:
/* Used in tsk->state: */
#define TASK_RUNNING 0x0000
#define TASK_INTERRUPTIBLE 0x0001
#define TASK_UNINTERRUPTIBLE 0x0002
#define __TASK_STOPPED 0x0004
#define __TASK_TRACED 0x0008
/* Used in tsk->exit_state: */
#define EXIT_DEAD 0x0010
#define EXIT_ZOMBIE 0x0020
#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)
/* Used in tsk->state again: */
#define TASK_PARKED 0x0040
#define TASK_DEAD 0x0080
#define TASK_WAKEKILL 0x0100
#define TASK_WAKING 0x0200
#define TASK_NOLOAD 0x0400
#define TASK_NEW 0x0800
#define TASK_STATE_MAX 0x1000
/* Convenience macros for the sake of set_current_state: */
#define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
#define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED)
#define TASK_TRACED (TASK_WAKEKILL | __TASK_TRACED)
#define TASK_IDLE (TASK_UNINTERRUPTIBLE | TASK_NOLOAD)
/* Convenience macros for the sake of wake_up(): */
#define TASK_NORMAL (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)
/* get_task_state(): */
#define TASK_REPORT (TASK_RUNNING | TASK_INTERRUPTIBLE | \
TASK_UNINTERRUPTIBLE | __TASK_STOPPED | \
__TASK_TRACED | EXIT_DEAD | EXIT_ZOMBIE | \
TASK_PARKED)
The most important process states are those that the kernel allows us to read easily with the const char *get_task_state
function, which assigns the state to the char variable contained in the task_state_array.
static const char * const task_state_array[] = {
/* states in TASK_REPORT: */
"R (running)", /* 0x00 */
"S (sleeping)", /* 0x01 */
"D (disk sleep)", /* 0x02 */
"T (stopped)", /* 0x04 */
"t (tracing stop)", /* 0x08 */
"X (dead)", /* 0x10 */
"Z (zombie)", /* 0x20 */
"P (parked)", /* 0x40 */
/* states beyond TASK_REPORT: */
"I (idle)", /* 0x80 */
};
static inline const char *get_task_state(struct task_struct *tsk)
{
BUILD_BUG_ON(1 + ilog2(TASK_REPORT_MAX) != ARRAY_SIZE(task_state_array));
return task_state_array[task_state_index(tsk)];
}
Already knowing the states of processes at the kernel level (it is worth paying attention to the special selection of their values and the ability to perform bitwise operations on them) we can move on and discuss what loadavg is.
Load average – definition
According to the manual man 5 proc
loadavg contains statistics that is the average sum of the processes in states R and D. This means that we have here the following processes:
- queued for execution (state R)
- executed (state R)
- sleeping in D state – waiting for disk I/O.
Most studies on the Internet do not recognise the fact that the R state can mean both executing and queuing to be executed. The concept of a state waiting to be run (state R) is also often confused with the concept of a state waiting for I/O (state D). In addition, no attention is payed to the fact that in the case of saturation of RAM and the use of swaps, the processes also fall into state D.
Load average is given for 3 periods:
- the last minute
- the last 5 minutes
- the last 15 minutes.
This allows the administrator to understand the system load trend.
The load average should be considered in the context of the amount of CPUs available, taking into account SMT (Simultaneous MultiThreading) and HT (Hyper-Threading) mechanisms. The easiest way to find the number of available processors (in the sense of a logical processor) is to use the nproc
command.
An example of calling an uptime
command (returning, among other things, load average) and nproc:
[[email protected] loadavg]$ uptime
17:22:55 up 5:37, 5 users, load average: 4.92, 4.89, 4.84
[[email protected] loadavg]$ nproc
8
The simplest rule regarding the load average is that it should be smaller than the available compute units. Of course, depending on the actual use of individual resources, the system and its services may not work at all satisfactory. Looking at the above example, for 8 logical processors and loadavg for about 5, the system is not overloaded. However, if the number of processors were less than 5, we would have to deal with a potentially overloaded system.
File interface /proc/loadavg
Under the /proc/loadavg
path, the Linux kernel provides a file interface with information about the average system load. It is used by programs such as uptime
, w
or top
.
[[email protected] ~]$ strace w |& grep loadavg
read(6, "grep\0--color=auto\0loadavg\0", 2047) = 26
openat(AT_FDCWD, "/proc/loadavg", O_RDONLY) = 6
[[email protected] ~]$ strace uptime |& grep loadavg
openat(AT_FDCWD, "/proc/loadavg", O_RDONLY) =
The /proc/loadvg
file returns the following values (pseudocode transcription):
LOAD_AVG_1_MIN LOAD_AVG_5_MIN LOAD_AVG_15_MIN NUMBER_OF_RUNNING/NUMBER_OF_THREADS LAST_CREATED_PID
Sample reading of the file /proc/loadavg
[[email protected] ~]$ cat /proc/loadavg 5.93 2.14 1.67 17/1396 173294
As for the source file that defines this interface, it is linux/fs/proc/loadavg.c
. Let me include a code snippet as I find it a very good example of how kernel code can be fascinating:
static int loadavg_proc_show(struct seq_file *m, void *v)
{
unsigned long avnrun[3];
get_avenrun(avnrun, FIXED_1/200, 0);
seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %ld/%d %d\n",
LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),
LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
nr_running(), nr_threads,
idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
return 0;
}
Here I would like to draw attention to the non-obvious techniques used in such a short piece of code.
- First, the use of an
unsigned long
variable is worth noting, followed by a very specific projection of it to the integer part by the macroLOAD_INT
and a fraction by theLOAD_FRAC
macro. Among most high-level language developers, this solution raises some embarrassment. However, in kernel programming, where code efficiency is critical, this type of a "hack" in the sense of an exceptionally clever solution (http://www.catb.org/jargon/html/H/hack.html) is particularly desirable. - It is also worth paying attention to the
nr_threads
variable, which is a variable that indicates the number of tasks in the system, and thenr_running()
function, which is intended to return only tasks that are in the R state.
Download load average using the system call
To download loadavg we can use the system sysinfo
call. It returns a sysinfo
structure that looks like this:
struct sysinfo {
long uptime; /* Seconds since boot */
unsigned long loads[3]; /* 1, 5, and 15 minute load averages */
unsigned long totalram; /* Total usable main memory size */
unsigned long freeram; /* Available memory size */
unsigned long sharedram; /* Amount of shared memory */
unsigned long bufferram; /* Memory used by buffers */
unsigned long totalswap; /* Total swap space size */
unsigned long freeswap; /* Swap space still available */
unsigned short procs; /* Number of current processes */
char _f[22]; /* Pads structure to 64 bytes */
};
If we want to create ourselves a program that reads loadavg without using the file interface (the discussed earlier /proc/loadavg
), the above system call is the easiest way to collect information about the state of the system. Example of a simple C report:
#include <linux/kernel.h>
#include <stdio.h>
#include <sys/sysinfo.h>
int main ()
{
// consts
const double mb=1024*1024;
const float load_avg_sysinfo_scale = 2<<16; // this is magic number
// get sysinfo structure
struct sysinfo s;
sysinfo (&s);
// print raport
printf ("--- System Report ---\n");
printf ("Load AVG: 1min[%4.2f] 5min[%4.2f] 15min[%4.2f]\n",
s.loads[0]/load_avg_sysinfo_scale,
s.loads[1]/load_avg_sysinfo_scale,
s.loads[2]/load_avg_sysinfo_scale);
printf ("Total RAM: %8.2f MB\n", s.totalram / mb);
printf ("Free (not used) RAM: %8.2f MB\n", s.freeram / mb);
printf ("Total swap memory: %8.2f MB\n", s.totalswap / mb);
printf ("Free swap memory:: %8.2f MB\n", s.freeswap / mb);
printf ("Total process count: %d\n", s.procs);
return 0;
}
and its compilation along with the launch:
[[email protected] loadavg]$ gcc -Wpedantic system-report.c -o system-report && ./system-report
--- System Report ---
Load AVG: 1min[0.32] 5min[0.22] 15min[0.18]
Total RAM: 31871.28 MB
Free (not used) RAM: 22587.86 MB
Total swap memory: 20095.99 MB
Free swap memory: 20095.99 MB
Total process count: 1470
Myths Related to Load Average
Myth 0 – load average is the number of processes currently running on the system/CPU
This is the most common mistake (resulting from the lack of knowledge about the state of the processes) that I can observe during recruitment interviews. As we have emphasised many times above, load average includes both processes in the state R (running and ready to run) and D (pending). If load average specified only how many processes are running, it would never be greater than the number of compute units in the system. The above response of the candidate quickly leads to contradictions.
Myth 1 – load average is the arithmetic mean
This is one of the more difficult questions that are called "follow up questions"(additional questions during the recruitment process, often aimed at checking how thorough is the candidate's knowledge on the subject).
In the man 5 proc
manual it says:
/proc/loadavg
The first three fields in this file are load average figures
giving the number of jobs in the run queue (state R) or waiting for disk I/O
(state D) averaged over 1, 5, and 15 minutes.(...)
It would seem, then, that we are dealing here with an ordinary arithmetic mean. The final answer will be given by Linux sources, which explicitly describe load average as:
/*
* ...
* The global load average is an exponentially decaying average of nr_running +
* nr_uninterruptible.
* ...
*/
Thus, we are dealing with an exponential moving average, in which further attempts have less impact on the result than newer ones.
Myth 2 – high load average is always related to the processor
As previously explained, loadavg counts processes in state R and D in exponential moving average. This means that these processes can:
- wait for CPU (R)
- run on CPU (R)
- wait for disk (D)
- wait for the memory pages to be loaded from the disk (D).
So we have at least three components that can be overloaded:
- CPU
- disk
- RAM memory (depletion and need to use swap).
There are therefore several possibilities which can occur individually, in pairs, or all at once.
In addition, it may happen that a high loadavg does not mean that the system is overloaded. For example, the processor load at half its capacity + swapping (RAM saturation with the need to dump and load memory pages on/off disk) can result in high loadavg.
The average load is therefore statistics that mainly draws attention to the need for further diagnostic steps to understand the state of the system and its processes.
Myth 3 - load average really makes sense
As a curiosity on the border of computer science and philosophy I would like to quote the comment contained in Linux sources in the linux/sched/loadavg.c
file.
/*
* kernel/sched/loadavg.c
*
* This file contains the magic bits required to compute the global loadavg
* figure. Its a silly number but people think its important. We go through
* great pains to make it work on big machines and tickless kernels.
*/
The point here is that the Linux kernel can manage the processor in such a way that system interrupts do not occur at fixed intervals (hence the name tickless: tick from ticking, e.g. every 1000 Hz [processor cycles] and less [meaning "without"]), but are set dynamically.
To be honest, when I first saw this comment in my life, I was speechless. After all, we are talking about one of the most important statistics from the point of view of the administrator in the system! Therefore, when considering and reading the source files, I would like to point out that from the point of view of the kernel programmer, the statistics is not at all obvious or "certain" (as far as trust is concerned) for many reasons:
- processors dynamically scale their performance according to a number of factors, including, but not limited to, load and temperature
- to accurately calculate it, it would be necessary to freeze the system at least when reading the loadavg value (moreover, it is mentioned in another comment in this file: "These values are estimates at best, so no need for locking.")
- for multiprocessor machines, as well as tickless kernels, the calculation is even more rough.
A natural question therefore arises - does load average make sense? Well, one might be tempted to say that if the authors themselves have doubts, maybe not quite. However, even if we see the flaws of the solution at an expert level of knowledge and understanding, but in general it works, it is worth applying Murphy's law here. As I said at the outset, please treat the above statement in terms of loose considerations of a modest author.
Summary
The article discusses the states the processes can be in, how the load average looks from the inside (i.e. in Linux kernel sources) and how we, as administrators and programmers, we can get to this value. Finally, we have dealt with common myths about this useful statistics.
Bibliography
https://elixir.bootlin.com/linux/v5.9.3/source/include/linux/sched.h#L69
https://elixir.bootlin.com/linux/v5.9.3/source/fs/proc/array.c#L129
https://elixir.bootlin.com/linux/v5.9.3/source/fs/proc/loadavg.c
http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
man 5 proc
man 2 sysinfo