Tuesday, September 4, 2007

What the hell is Unix Load Average.

Article mostly based on linux I have it installed on this pc. And I was busy reading my daily kernel source on http://lxr.linux.no

Load Average (LA) on a server is a touchy subject. It is very much dependent on the type of tasks running on the server. I have had servers sitting at 4 < LA < 6 and performing better than another server at 0 < LA < 2. Seen +40 a few times, probably could push it to +100 #DEFINE (Please send a multi processor server).

Usually its best if your load average is under the CPU-core count.

Slow I/O (Disk,network) will increase the load average. Since there are more processes waiting for resource.

Think of load average as queue length. You have so many process that is waiting to be processed.

1) What am I talking about. How do I find the load average?
The following commands, to name a few, give Load Average.
top
w
uptime
on linux its available in /proc/loadavg
vmstat is another way of getting similar data.

2) Where do the magic numbers come from?
The timer.c in the linux kernel. The kernel keeps track of the number of waiting/running processes over 1 min, 5min and 15min. These numbers are fed through a statistic formula, that deserves its own blog post and examples, there after made available to the system.

3) Interpretation:
Let say your numbers were 5.99, 5.52, 5.13 and this is a single CPU server.
This tells me 499% of the processes could not be finished in the last minute.
Obvious 100% could be done. More important is that over 15min, 413% of the processes did not complete.

Thus a 6 cpu server would have had a LA<1. If you could find such a thing.

4) What to do now?
Do not press the RED button!!!! What ever you do, do not press the RED button.!
Find out if its CPU or I/O?
if (I/O) Disk or network I/O?
if (Disk)Filesystem or Swap?

use "vmstat 2" and stare at the screen. it will give you a good idea quickly. top is handy to find the application that is hogging, and "ps aux" though I am getting suspect on ps's results.

5) The fluffy stuff.
network services like http,smtp,ftp cause a high LA. is usually I/O related. Look at your networking.
databases can be interesting, here you'll have to look at network,disk,swap and CPU.

6) So what about threads? You had to Ask...mmm The exact handling of thread count on relation to LA is different in the UNIX's.

May the source guide you. sched.c The answer is reviled as a comment.

/*
* nr_running, nr_uninterruptible and nr_context_switches:
*
* externally visible scheduler statistics: current number of runnable
* threads, current number of uninterruptible-sleeping threads, total
* number of context switches performed since bootup.
*/
7) Should I care?
If the box is doing ok with LA>1 then dont stress. Know that it is a bit busy. If the box is not performing well and you have LA>1 it is realy usefull to graph this value, so you know when the load is occurring.

Need to know more.

If you wanted to you could have read the man pages:
man uptime

<snip>system load averages is the average number of processes that are either in a runnable or uninterruptable state. A process in a runnable state is
either using the CPU or waiting to use the CPU. A process in uninterruptable state is waiting for some I/O access, eg waiting for disk. The aver‐
ages are taken over the three time intervals. Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a
single CPU system is loaded all the time while on a 4 CPU system it means it was idle 75% of the time.
</snip>



E&O.E.

No comments: