Load average, process states on Linux, and some loadavg myths

load average process states on Linux

In my professional experience, I quite often have the opportunity to conduct recruitment, a part of which are questions related to Linux family systems. One of the standard questions is about an important, although in fact not saying much, system statistics, which is the average system load (abbreviated to loadavg). Unfortunately, it happens, also in the case of people with really impressive professional experience, that the candidate is not able to correctly describe and interpret the information ensuing from the load average.

So I decided to ask myself what is the reason for this. I checked trusted “sources of truth”, such as search engines (DuckDuckGo and Google) and Wikipedia. The latter, in fact, described the statistics most fully. Unfortunately, there are a lot of simplifications or even errors on the Internet about that subject. There is nothing surprising in this, as the system manuals themselves often do not fully describe what these statistics are.

To discuss the average system load, however, it is necessary to go back and introduce a minimum theory about processes and their states in Linux. I let myself draw from a proven source of knowledge that is the EuroLinux Training Manual for Enterprise Linux Administration at the Basic Level.

Process states on Linux

Excerpts from the “Enterprise Linux Administration I” training manual

Processes can be in different states. To understand the outputs of commands that describe processes, you must first understand these states. The table below lists the flags to be displayed with process names and an explanation of the state assigned to the given state in the internal kernel structures.

Flag Name Kernel name and explanation
R Running or Runnable TASK_RUNNING. The process is being executed or is waiting to be executed. Can run in user space (user code) or kernel space (kernel code)
S Interruptible sleep TASK_INTERRUPTIBLE. The process is waiting for a condition to be met, e.g. access to a resource or a signal (such a signal is e.g. the output of a child process)
D Uninterruptible sleep Uninterruptible sleep| TASK_UNINTERRUPTIBLE. The process is in a sleeping state as for the S flag, but will not respond to signals. It is used when interrupting or killing a process could put devices or a process in an undefined state. Generally used during I/O operations. The process enters this state also when its memory pages are dumped or loaded into extended memory, i.e. "swapped"
T Stopped TASK_STOPPED. The process was stopped by appropriate signals. It can be restored by another signal
t Traced TASK_TRACED. The process is being debugged and its execution is being traced. It is temporarily paused to test its status
Z Zombie EXIT_ZOMBIE. The child process has completed its operation and wants to inform the parent of its exit code. Unfortunately, sometimes parents do not handle their children properly. In this case, the child becomes a zombie
X Dead EXIT_DEAD. The process has finished, the parent has cleared their child. All process resources are released. Under normal circumstances this state should not be visible

In addition to the flags listed above, we also have a K flag for "killable", which is a D-flagged process that can be killed, and an I flag for "idle". However, processes in this state are rarely seen as they are usually performed by internal kernel processes.

The diagram below shows the different states of processes and the basic interactions between them.

process states

Where are the process states defined?

The Linux process itself and the process states are defined in the kernel source in the source/include/linux/sched.h ("sched" is an abbreviation from “scheduler”). Let me paste the code snippet that defines the states of the process:

/* Used in tsk->state: */
#define TASK_RUNNING			0x0000
#define TASK_INTERRUPTIBLE		0x0001
#define TASK_UNINTERRUPTIBLE		0x0002
#define __TASK_STOPPED			0x0004
#define __TASK_TRACED			0x0008
/* Used in tsk->exit_state: */
#define EXIT_DEAD			0x0010
#define EXIT_ZOMBIE			0x0020
#define EXIT_TRACE			(EXIT_ZOMBIE | EXIT_DEAD)
/* Used in tsk->state again: */
#define TASK_PARKED			0x0040
#define TASK_DEAD			0x0080
#define TASK_WAKEKILL			0x0100
#define TASK_WAKING			0x0200
#define TASK_NOLOAD			0x0400
#define TASK_NEW			0x0800
#define TASK_STATE_MAX			0x1000

/* Convenience macros for the sake of set_current_state: */
#define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
#define TASK_STOPPED			(TASK_WAKEKILL | __TASK_STOPPED)
#define TASK_TRACED			(TASK_WAKEKILL | __TASK_TRACED)

#define TASK_IDLE			(TASK_UNINTERRUPTIBLE | TASK_NOLOAD)

/* Convenience macros for the sake of wake_up(): */
#define TASK_NORMAL			(TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)

/* get_task_state(): */
#define TASK_REPORT			(TASK_RUNNING | TASK_INTERRUPTIBLE | \
					 TASK_UNINTERRUPTIBLE | __TASK_STOPPED | \
					 __TASK_TRACED | EXIT_DEAD | EXIT_ZOMBIE | \
					 TASK_PARKED)

The most important process states are those that the kernel allows us to read easily with the const char *get_task_state function, which assigns the state to the char variable contained in the task_state_array.

static const char * const task_state_array[] = {

	/* states in TASK_REPORT: */
	"R (running)",		/* 0x00 */
	"S (sleeping)",		/* 0x01 */
	"D (disk sleep)",	/* 0x02 */
	"T (stopped)",		/* 0x04 */
	"t (tracing stop)",	/* 0x08 */
	"X (dead)",		/* 0x10 */
	"Z (zombie)",		/* 0x20 */
	"P (parked)",		/* 0x40 */

	/* states beyond TASK_REPORT: */
	"I (idle)",		/* 0x80 */
};

static inline const char *get_task_state(struct task_struct *tsk)
{
	BUILD_BUG_ON(1 + ilog2(TASK_REPORT_MAX) != ARRAY_SIZE(task_state_array));
	return task_state_array[task_state_index(tsk)];
}

Already knowing the states of processes at the kernel level (it is worth paying attention to the special selection of their values and the ability to perform bitwise operations on them) we can move on and discuss what loadavg is.

Load average – definition

According to the manual man 5 proc loadavg contains statistics that is the average sum of the processes in states R and D. This means that we have here the following processes:

  • queued for execution (state R)
  • executed (state R)
  • sleeping in D state – waiting for disk I/O.

Most studies on the Internet do not recognise the fact that the R state can mean both executing and queuing to be executed. The concept of a state waiting to be run (state R) is also often confused with the concept of a state waiting for I/O (state D). In addition, no attention is payed to the fact that in the case of saturation of RAM and the use of swaps, the processes also fall into state D.

Load average is given for 3 periods:

  • the last minute
  • the last 5 minutes
  • the last 15 minutes.

This allows the administrator to understand the system load trend.

The load average should be considered in the context of the amount of CPUs available, taking into account SMT (Simultaneous MultiThreading) and HT (Hyper-Threading) mechanisms. The easiest way to find the number of available processors (in the sense of a logical processor) is to use the nproc command.

An example of calling an uptime command (returning, among other things, load average) and nproc:

[[email protected] loadavg]$ uptime
 17:22:55 up  5:37,  5 users,  load average: 4.92, 4.89, 4.84
[[email protected] loadavg]$ nproc
8

The simplest rule regarding the load average is that it should be smaller than the available compute units. Of course, depending on the actual use of individual resources, the system and its services may not work at all satisfactory. Looking at the above example, for 8 logical processors and loadavg for about 5, the system is not overloaded. However, if the number of processors were less than 5, we would have to deal with a potentially overloaded system.

File interface /proc/loadavg

Under the /proc/loadavg path, the Linux kernel provides a file interface with information about the average system load. It is used by programs such as uptime, w or top.

[[email protected] ~]$ strace w |& grep loadavg 
read(6, "grep\0--color=auto\0loadavg\0", 2047) = 26
openat(AT_FDCWD, "/proc/loadavg", O_RDONLY) = 6
[[email protected] ~]$ strace uptime |& grep loadavg 
openat(AT_FDCWD, "/proc/loadavg", O_RDONLY) = 

The /proc/loadvg file returns the following values (pseudocode transcription):

LOAD_AVG_1_MIN LOAD_AVG_5_MIN LOAD_AVG_15_MIN NUMBER_OF_RUNNING/NUMBER_OF_THREADS LAST_CREATED_PID

Sample reading of the file /proc/loadavg

[[email protected] ~]$ cat /proc/loadavg
5.93 2.14 1.67 17/1396 173294

As for the source file that defines this interface, it is linux/fs/proc/loadavg.c. Let me include a code snippet as I find it a very good example of how kernel code can be fascinating:

static int loadavg_proc_show(struct seq_file *m, void *v)
{
	unsigned long avnrun[3];

	get_avenrun(avnrun, FIXED_1/200, 0);

	seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %ld/%d %d\n",
		LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),
		LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
		LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
		nr_running(), nr_threads,
		idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
	return 0;
}

Here I would like to draw attention to the non-obvious techniques used in such a short piece of code.

  1. First, the use of an unsigned long variable is worth noting, followed by a very specific projection of it to the integer part by the macro LOAD_INT and a fraction by the LOAD_FRAC macro. Among most high-level language developers, this solution raises some embarrassment. However, in kernel programming, where code efficiency is critical, this type of a "hack" in the sense of an exceptionally clever solution (http://www.catb.org/jargon/html/H/hack.html) is particularly desirable.
  2. It is also worth paying attention to the nr_threads variable, which is a variable that indicates the number of tasks in the system, and the nr_running() function, which is intended to return only tasks that are in the R state.

Download load average using the system call

To download loadavg we can use the system sysinfo call. It returns a sysinfo structure that looks like this:

struct sysinfo {
   long uptime;             /* Seconds since boot */
   unsigned long loads[3];  /* 1, 5, and 15 minute load averages */
   unsigned long totalram;  /* Total usable main memory size */
   unsigned long freeram;   /* Available memory size */
   unsigned long sharedram; /* Amount of shared memory */
   unsigned long bufferram; /* Memory used by buffers */
   unsigned long totalswap; /* Total swap space size */
   unsigned long freeswap;  /* Swap space still available */
   unsigned short procs;    /* Number of current processes */
   char _f[22];             /* Pads structure to 64 bytes */
};

 

If we want to create ourselves a program that reads loadavg without using the file interface (the discussed earlier /proc/loadavg), the above system call is the easiest way to collect information about the state of the system. Example of a simple C report:

#include <linux/kernel.h>
#include <stdio.h>
#include <sys/sysinfo.h>
int main ()
{
    // consts
    const double mb=1024*1024;
    const float load_avg_sysinfo_scale = 2<<16; // this is magic number
    // get sysinfo structure 
    struct sysinfo s;
    sysinfo (&s);
    // print raport
    printf ("--- System Report ---\n");
    printf ("Load AVG: 1min[%4.2f] 5min[%4.2f] 15min[%4.2f]\n",
            s.loads[0]/load_avg_sysinfo_scale,
            s.loads[1]/load_avg_sysinfo_scale,
            s.loads[2]/load_avg_sysinfo_scale);
    printf ("Total RAM: %8.2f MB\n", s.totalram / mb);
    printf ("Free (not used) RAM: %8.2f MB\n", s.freeram / mb);
    printf ("Total swap memory: %8.2f MB\n", s.totalswap / mb);
    printf ("Free swap memory:: %8.2f MB\n", s.freeswap / mb);
    printf ("Total process count: %d\n", s.procs);
    return 0;
}

 

and its compilation along with the launch:

[[email protected] loadavg]$ gcc -Wpedantic system-report.c -o system-report && ./system-report
--- System Report ---
Load AVG: 1min[0.32] 5min[0.22] 15min[0.18]
Total RAM: 31871.28 MB
Free (not used) RAM: 22587.86 MB
Total swap memory: 20095.99 MB
Free swap memory: 20095.99 MB
Total process count: 1470

Myths Related to Load Average

Myth 0 – load average is the number of processes currently running on the system/CPU

This is the most common mistake (resulting from the lack of knowledge about the state of the processes) that I can observe during recruitment interviews. As we have emphasised many times above, load average includes both processes in the state R (running and ready to run) and D (pending). If load average specified only how many processes are running, it would never be greater than the number of compute units in the system. The above response of the candidate quickly leads to contradictions.

Myth 1 – load average is the arithmetic mean

This is one of the more difficult questions that are called "follow up questions"(additional questions during the recruitment process, often aimed at checking how thorough is the candidate's knowledge on the subject).

In the man 5 proc manual it says:

/proc/loadavg 
    The  first  three  fields in this file are load average figures
    giving the number of jobs in the run queue (state R) or waiting for disk I/O
   (state D) averaged over 1, 5, and 15 minutes.(...)

 

It would seem, then, that we are dealing here with an ordinary arithmetic mean. The final answer will be given by Linux sources, which explicitly describe load average as:

/*
 * ...
 * The global load average is an exponentially decaying average of nr_running +
 * nr_uninterruptible.
 * ...
 */

Thus, we are dealing with an exponential moving average, in which further attempts have less impact on the result than newer ones.

Myth 2 – high load average is always related to the processor

As previously explained, loadavg counts processes in state R and D in exponential moving average. This means that these processes can:

  • wait for CPU (R)
  • run on CPU (R)
  • wait for disk (D)
  • wait for the memory pages to be loaded from the disk (D).

So we have at least three components that can be overloaded:

  • CPU
  • disk
  • RAM memory (depletion and need to use swap).

There are therefore several possibilities which can occur individually, in pairs, or all at once.

In addition, it may happen that a high loadavg does not mean that the system is overloaded. For example, the processor load at half its capacity + swapping (RAM saturation with the need to dump and load memory pages on/off disk) can result in high loadavg.

The average load is therefore statistics that mainly draws attention to the need for further diagnostic steps to understand the state of the system and its processes.

Myth 3 - load average really makes sense

As a curiosity on the border of computer science and philosophy I would like to quote the comment contained in Linux sources in the linux/sched/loadavg.c file.

/*
 * kernel/sched/loadavg.c
 *
 * This file contains the magic bits required to compute the global loadavg
 * figure. Its a silly number but people think its important. We go through
 * great pains to make it work on big machines and tickless kernels.
 */

The point here is that the Linux kernel can manage the processor in such a way that system interrupts do not occur at fixed intervals (hence the name tickless: tick from ticking, e.g. every 1000 Hz [processor cycles] and less [meaning "without"]), but are set dynamically.

To be honest, when I first saw this comment in my life, I was speechless. After all, we are talking about one of the most important statistics from the point of view of the administrator in the system! Therefore, when considering and reading the source files, I would like to point out that from the point of view of the kernel programmer, the statistics is not at all obvious or "certain" (as far as trust is concerned) for many reasons:

  • processors dynamically scale their performance according to a number of factors, including, but not limited to, load and temperature
  • to accurately calculate it, it would be necessary to freeze the system at least when reading the loadavg value (moreover, it is mentioned in another comment in this file: "These values are estimates at best, so no need for locking.")
  • for multiprocessor machines, as well as tickless kernels, the calculation is even more rough.

A natural question therefore arises - does load average make sense? Well, one might be tempted to say that if the authors themselves have doubts, maybe not quite. However, even if we see the flaws of the solution at an expert level of knowledge and understanding, but in general it works, it is worth applying Murphy's law here. As I said at the outset, please treat the above statement in terms of loose considerations of a modest author.

Summary

The article discusses the states the processes can be in, how the load average looks from the inside (i.e. in Linux kernel sources) and how we, as administrators and programmers, we can get to this value. Finally, we have dealt with common myths about this useful statistics.

Bibliography

https://elixir.bootlin.com/linux/v5.9.3/source/include/linux/sched.h#L69
https://elixir.bootlin.com/linux/v5.9.3/source/fs/proc/array.c#L129
https://elixir.bootlin.com/linux/v5.9.3/source/fs/proc/loadavg.c
http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
man 5 proc
man 2 sysinfo