Monitoring Processes

City College of San Francisco - CS260A
Linux System Administration
Module: Processes

The ps command is the standard way to monitor processes on a Unix or linux system. ps has more options that any other Unix command, at least that I know of. The reason for this is that ps' options vary significantly between System5 and BSD Unix variants, and the ps version on linux tries to be all things to all users. This creates a nightmare of seemingly conflicting options that can be used to modify the output of ps to give you just what you want in the syntax of whichever variant you choose.

We will concentrate on learning a few standard ps options and the type of data output by ps. You can learn how to be more specific on your own or just use standard tools to filter the output and extract the bits you want.

By default, ps gives you abbreviated information on the processes owned by the current user and associated with the current terminal. Normally this group includes all processes subordinate to the login shell (or the shell that was originally started in the window). The abbreviated information includes the command name, the process id, the tty is is attached to and the CPU time it has used:

Common ps options

System 5 option	meaning
-e	all processes on the system
-l, -f, -fl	extended information
-u user	all processes with the euid of user. user may be a comma- or space- separated list of users
-p pid	only process with process id pid. pid may be a comma- or space- separated list of pids
-o fields	only output these fields in this order. Here fields is a comma-separated list of fields.

The -l, -f options give a different mix of output fields of interest. In particular, -l gives the priorities and CPU time, while -f gives the command arguments and start time (wall clock time that the process started). -fl gives a mixture of this data.

Each field output has a name (abbreviation) which appears in upper-case in the header of the output. The field names (in lower-case) can also be used in the field specification of the -o option above.

field	meaning
uid	user id (or user name). In -o this means uid. Use uname for user name
pid	process id
ppid	parent process id
ni	nice number
pri	scheduling priority
sz	total memory size (in memory pages on linux. currently 4kB each)
rsz	run-set size (in kB on linux)
time	cpu time consumed (system time + user time)
stime	wall-clock time the process was started
s	state (S=sleep,R=runnable,T=stop...)
cmd	command. with -o this means 'command + arguments'

The -o option to ps is very useful for specifying exactly the output you want, although its format is very system-specific. An example of the linux version is below. The field names are the standard abbreviations from the table above. Note that the numbers output for sz are smaller than those for rsz. Since rsz is a subset of sz, this is impossible. The reason is that the units of sz are pages, while the units of rsz are kB.

$ ps -o pid,uid,pri,rsz,sz
PID   UID PRI   RSZ    SZ
31599   500 24 1492 1169
31626   500 22   760 1048
$

top

top is a very useful command to help the system administrator keep track of processes executing on the system and of the use of resources. It displays a page of data containing summary system statistics and the ps-type output of the processes that are consuming the most CPU time.

top [ -d delay ] [ -n iterations ] [ -p pid,pid,pid... ]

top, by default, continually updates the screen every few seconds (the delay) and runs forever (infinite iterations), selecting the biggest CPU users as the processes to display. The options allow for the monitoring of specific processes in addition to changing the delay and number of iterations. Other options include 'batch mode' operation, where top writes its output into a file for later examination or analysis.

If the system performance is significantly degraded, top can help identify the issue. However, in times of system bottlenecks, top is just another process, and if it is difficult to run any processes, it can be difficult to get information from top. To remedy this problem, it is useful to nice top so that it runs with increased priority. This is so often needed, that some versions of top have an option to run with a decreased nice value, and alleviate the need for using nice.

top is interactive, and responds to command keystrokes when it is running. The most important of these are h for help and q for quit.

uptime

A simple program that provides a quick thumbnail of system response time is uptime:

bash$ uptime
12:53pm up 3 days, 23:44, 17 users, load average: 0.06, 0.18, 1.04
bash$

uptime displays the time the system has been up as well as the one, five, and fifteen minute load averages, in that order. The load average is defined as the average number of processes in the ready-to-run state during the period. (i.e., the number of processes waiting to run) The reference load average, of course, is 1. A value of 1 implies that, on average, the system always has a process to run. As the load average increases, the system response time suffers. Load averages are output by top and other process-monitoring commands.

The display of the three load averages is useful to provide a quick 'history' of how system load is changing. In the example above, the system load has decreased dramatically over the last while, since the one-minute average is much lower than the 15-minute average. This is important and reasurring information to a system administrator investigating why the system has been slow. If the numbers were reversed, like 1.04, 0.18, 0.06, it would mean the system is getting much busier. In this case it might be appropriate to run the top command and examine which processes were using the resources.

vmstat

vmstat [ delay [count] ]

vmstat gives virtual memory statistics. It gives a one-line summary of memory, paging, swapping and i/o statistics. The output of vmstat (without arguments) can be misunderstood as the single line of statistics it outputs are averages since the system was started. To get a snapshot of activity, you must run vmstat with a count greater than one. The first line is always averages, so if you are interested in current statistics it should be ignored. delay is the number of seconds over which the measurement is done. If delay is given with no count, the count is infinity, or 'measure each delay seconds forever'.

Example:

[gboyd@nelson ~]$ vmstat 1 7
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b   swpd   free   buff cache   si   so    bi    bo   in   cs us sy id wa st
0 0    144 20848 21004 668868    0    0     4     3   62   22 3 0 97 0 0
0 0    144 20784 21004 668868    0    0     0     0 1006 880 2 0 98 0 0
3 0    144 129932 21168 558320    0    0 32104 44300 1360 1190 2 23 52 24 0
1 4    144 42636 21256 643228    0    0     4 55584 1181 577 1 14 0 85 0
0 4    144 43132 21264 643220    0    0     0 12068 1145 449 0 1 0 99 0
0 1    144 12296 17584 681408    0    4 4820   212 1184 707 1 17 4 79 0
0 0    144 12296 17584 681408    0    0     0     0 1003 266 0 0 98 2 0
[gboyd@nelson ~]$

The above run of vmstat shows a brief flurry of system activity. During the middle measurements, a large I/O operation occurred that caused the following effects:

the number of processes waiting to run (the r field under procs) increased briefly to 3
the blocks transferred in the i/o section (blocks-in (bi) and blocks-out (bo)) increased dramatically.
the i/o operation actually produced a few pages swapped out (so)
during the i/o operation, the number of context switches (cs) per second increased to 1190. (a context switch is performed when the running process is changed.) For comparisons, our hills server performs about 400 context switches per second when the load average is near one.
the cpu %idle (id) decreased briefly, transferring to % in wait (wa) state, which was 'waiting for i/o', and the system time (sy) increased while the user time (us) did not. This indicates that the majority of the cpu time being used was spent providing services for processes rather than executing code in the processes themselves.

iostat

iostat provides an alternate view of CPU and hard disk utilization from vmstat:

you can choose between CPU (-c), disk (-d) and NFS statistics (-n). The default is -c -d.
it displays its output in units per second, and the units can be chosen. (-k, -m)
output is broken down by device, and a %utilization measurement can be added to attempt to diagnose i/o bottlenecks and do some simple load-balancing. (-x)

The interface is similar to vmstat. An interval and count follow, and the first measurement is averages.

I would give sample output here, but I/O is so fast on our systems that simulating interesting data takes too much time. Try the command

iostat -k -d -x 1 10

sar

The kernel makes a record of many system events: i/o movements, process activity, paging behavior, cpu utilization, even interrupts processed in /proc. The data is saved to a daily file in /var/log/sa/saNN, where NN is the day of the month. sar analyzes that data and dumps it in a human-readable form for analysis. By default, the current day's data is examined, which includes the activity since midnight. You can use sar to do two things with these records

sar [options] [-s starttime ] [-e stoptime] [-f filename]

display all or part of the data recorded since midnight. The start and stop time are in hh:mm:ss format. The options limit the types of measurements shown. The default is "CPU measurements only". You can use -A for "all measurements".

If you add the -f filename option, filename should point to the sa file in /var/log/sa corresponding to the day you want to analyze.

sar [ options ] interval [ count ]

start displaying certain current measurements beginning now as the records are written. The values of interval and count determine what is displayed:

if interval is 0, count may not be used. sar outputs average statistics since the system was started.
if interval is non-zero, sar begins outputting count measurements, one every interval minutes. A missing count means "continuous display".

Examples:

sar -f /var/log/sa/sa05

outputs the CPU usage information from the file for the 5th of this month

sar

outputs the CPU usage information from today's file

sar -A

outputs all the information from today's file

sar -A -s 12:00:00 -e 13:00:00

outputs all data collected between Noon and 1pm today

sar 0

outputs CPU usage information summary since the system was started.

The statistics output by sar are detailed. See sar(8) for a description of the fields.

Other process tools

sleep N

is a command that simply sleeps for N seconds. It can be used in a shell script to force a delay. Most daemons, for example, run in a loop that sleeps for a while, then checks for work to do.

wait [pid]

is a command used to suspend the current process until process pid (or, by default, all of the current process' children) have exited.

This page was made entirely with free software on linux:
Kompozer, the Mozilla Project and Openoffice.org