sections in this module City College of San Francisco - CS260A
Unix/Linux System Administration

Module: Administration Basics I
module list

finding files

Preview question: Suppose you wanted to make a copy of each file on the system that had changed in the last week. How would you do this? If you know how to use find, the problem would be simple.

find outputs paths to files on the system that have a particular set of characteristics. It is used to gather statistics on the filesystem and is one of the system administrator's best friends.

Notes:

Two qualities of find make it very useful.

Before we investigate further, consider the following fairly simple snippet of shell programming:

find / ! -user root -size +1M -mtime +365 -atime +365 -type f > /tmp/$$files
while read path; do
if ! file "$path" | grep -q 'compress' ; then
ls -l "$path" >> /var/log/compressed
gzip "$path"
fi
done < /tmp/$$files
rm -f /tmp/$$files

This snippet locates all the regular files whose size is greater than one megabyte and has neither been modified nor accessed for a year. It omits files owned by root. The files found are checked to ensure they are not compressed. If not, they are compressed and a record of the original attributes kept in a log file.

The snippet above is used to save space on the filesystem as well as to generate a report later so that users of this clutter can be notified.

For the regular user, find can be very useful as well. One such use is to locate a lost file. ("I know I put that report somewhere")

The general format of a find command is

find [-H|-L|-P] <list of paths> [ options ]

list of paths is a list of places to begin looking. (A list is a whitespace-separated set of paths, usually to directories.) This is usually a single directory such as / or ~find is always recursive. The path to each file found is output. The default is "find everything" and the default path (on linux) is . (the current directory).

The new form of the find command has one option appearing before the path list. This option tells find what to do when it encounters a symbolic link. There are three possibilities:

-P  never follow symbolic links. With this option, a symlink is a symlink and the thing it points to is not examined. This option is the default.

-L always follow symbolic links. With this option, all symlinks are dereferenced and the things they point to examined.

-H only follow symlinks if they are in the list of paths

Hopefully, the difference between these options can be seen by a simple example. In this example we will simply use find and wc -l to tell how many things find finds. By default find finds everything, and wc -l counts each item found.

We create a symlink to /bin in the current directory and a symlink to /mnt in a subdirectory. Then we see how many things find finds in each

$ ls -lR link-in-subdir link-to-bin
lrwxrwxrwx. 1 gboyd gboyd    4 Jun 28 18:18 link-to-bin -> /bin
link-in-subdir:
total 0
lrwxrwxrwx. 1 gboyd gboyd 4 Jun 28 18:19 mnt -> /mnt
$ find /mnt | wc -l
7
$ find /bin | wc -l
125

Using -P (never follow symlinks) only finds the things themselve:

$ find -P link-to-bin link-in-subdir
link-to-bin
link-in-subdir
link-in-subdir/mnt
$ find -P link-to-bin link-in-subdir | wc -l
3

Using -L goes through each link, adding 127 (/bin) and 7 (/mnt) to the total:

$ find -L link-to-bin link-in-subdir | wc -l
133

Using -H only goes through the symlink on the command-line itself. The result is all the items in /bin (125) plus the two items on the command-line (2):

$ find -H link-to-bin link-in-subdir | wc -l
127 

The -L, -P and -H options can be intermixed with the paths on the commandline. In this case, they apply to the paths that follow the option. (The author has not tested this variation.) Remember, the default is -P - dont follow symlinks.

find options

find options are used to restrict the items output. They are a bit strange - a word rather than a single letter and most take an argument. For example, owned by a particular user followed by the user (as in -user gboyd), or things this large followed by the size as in -size 40k

Many find options (such as -size) take an argument that is an integer indicating this many. For these, the option compares some attribute to that number. The default behavior is to match the number exactly. A prefix can be used to perform other comparisons:

N exactly N
+N more than N
-N less than N

You can see an example of this in the syntax +1M in the first example in this section to indicate more than one megabyte. In the options below, instances where these prefixes may be used are indicated by N

find has a huge number of options. The most useful ones are indicated below:

option meaning
-name "pattern" only output files whose name matches the wildcard pattern. Be sure to use quotes around the pattern, so that the shell does not expand it.
-iname "pattern" same as -name, except the pattern is case-insensitive (linux only)
-type t limit search to files whose file type is t. t may be f (regular files), d (directories), l (symlinks), c or b (device files), etc
-print this is required for find to output any results. Most modern versions of find implicitly add -print as the last option.
-user u find files that are owned by the username u. -uid u can be used for a user id u.
-group g find files that whose group name is g. -gid g can be used for group id g
-inum num find files whose inode number is num
-links N find files that have N links
-perm onum find files whose permissions match the octal value onum exactly. More often, you are only interested in particular permissions. In this case, use -onum or /onum. (-perm -044 searches for files that have the read permission for both group and other set, irrespective of the rest of the permission bits. -perm /044 searches for files that have either the read permission for group or the read permission for other set, irrespective of other permissions.)
-size N whose size is N. By default, N is measured in 512-byte blocks. Use the suffix c for characters, and, on linux, k for kilobytes, M for megabytes and G for gigabytes.
-follow follow symbolic links and examine what they point to. Thus if ./x is a symlink to a file, find . -type f -follow finds ./x whereas  find . -type f does not
This option is probably only found in older code. It used to be synonymous with the -L option applied to all paths on the command-line, but now it only applies to paths appearing after it (of which there are probably none).
-{c,a,m}time N whose change (ctime), access (atime) or modification (mtime) date is N days ago.
-newer file whose modification date is more recent that the modification date of file. Usually file is a date-stamp file created by touch -t [[CC]YY]MMDDHHMM[.SS]
-anewer file whose access date is more recent than the modification date of file
-exec cmd execute cmd on each file found. Here cmd is a Unix command with the special syntax {} to indicate where in cmd the current file's path should be inserted. The end of cmd is indicated by a semicolon, which must be escaped (\;) (see the section on running commands on the files found below)
-ok cmd just like -exec, but verification is requested for each file before cmd is executed
-noleaf suppress an optimization that find makes by assuming that directory link counts follow Unix conventions. Use this option when searching non-Unix filesystems such as CDROMs or MSDOS. (I haven't tested this - see the man page)

Combining find options with or and ! (negation)

If two find options are used, and is implied by default. Thus

find ~ -type f -name "*.txt"

outputs files that are regular files and whose names end in .txt  You can change this default by using the -o conjunction

find ~ -type f -o -name "*.txt"

outputs files that are regular files or whose names end in .txt (or both, of course). Note that and binds more tightly than or, so

find ~ -type f -user gboyd -o -user root

outputs regular files that are owned by gboyd as well as everything that is owned by root. If you want the or to be limited to the two -user options, you must add parenthesis for grouping. However, the parenthesis symbol is a shell metacharacter, so it must be escaped:

find ~ -type f \( -user gboyd -o -user root \)

outputs regular files that are owned by either root or gboyd

You can also negate an option by preceding it with an exclamation point. Just make sure that the exclamation point has spaces around it. (! is a bash special character, but is ignored if it is not a prefix)

find ~ -type f ! -user gboyd

outputs regular files that are not  owned by gboyd. Of course you can apply this to parenthesized expressions as well. For example,

find ~ -type f ! \( -user gboyd -o -user root \)

finds all regular files owned by anyone other than gboyd or root

Applying commands

Once you generate a list of files that match a set of criteria, you often want to do something to them. There are several ways to accomplish this.

The most general method for applying a set of commands to your file list is to save the list in a file (one-per-line), then read it into a while loop, processing one file (line) at a time. We saw an example of this at the beginning of this section. In this case, you can do any kind of processing you like to the data: your limit is simply the limit of shell scripting.

If you review that while loop, you may ask the question, Why save the list in a file at all? Can't you just use a pipe to send the output of find to the input of the loop? The answer is yes; however, there is an issue: a pipe separates two processes, so any state that you save in the while loop is gone when the loop exits. For example, the following loop (which is silly, but provides a simple illustration) doesn't do what you want.

nfiles=0
find . | while read path; do
((nfiles++))
done
echo $nfiles

This code outputs 0, since the loop's nfiles variable is a copy of the nfiles variable in the surrounding shell. (The surrounding shell's nfiles variable was initialized to 0 and never modified.)

If this is confusing, don't worry. The first time I encountered this it took me a long time to figure out, and I have forgotten it and been tricked again many times. Just avoid piping to a while loop (or to any snippet of shell code). Instead, save the output of find in a file and feed the file's contents into the loop like we did at the beginning of this section.

If you only need to apply a single command to each item in your list, you can choose from two simpler techniques: the -exec option of find, or xargs

find -exec

The -exec option of find embeds a single command on find's command-line and runs the command on each path that find finds. The syntax can be a bit difficult at first, so a simple example will help

find . -type f -exec wc -l {} \;

Here we have emphasized the command to run. The {} cookie instructs find where on the command-line of wc to insert the current path output by find. The escaped semi-colon tells find where the end of the embedded command is, in case more find options appear after -exec. The use of the {} cookie allows the placement of the file path anywhere on the command-line of the embedded command:

find . -type f -name "*.txt" -exec cp {} ~/textfiles \; -print

This command copies each of the files found to the given directory. Normally, the files copied would not be reported, but we have added the -print option after -exec so that we can see which files are copied and what their paths were.

Note that in older versions of find, the {} cookie must appear all by itself  on the command-line of the embedded command. In these versions the command

find . -type f -exec mv {} {}.sv \;

is disasterous, as the second {} does not get substituted! This results in moving every regular file to a file of the same name ( literally, {}.sv ) one after the other, overwriting every file but the last one! At the time of this writing, the linux version of find seems to work as you would hope, but the user should test their version of find before running this command on a lot of data. OR you could just use xargs, which seems to work in any case (see below)

A very simple example shows just how useful -exec is. Suppose you have just uploaded your entire website and discovered that the permissions are wrong! You need to add x permission for other to each directory in the website and r permission for other to each regular file in the website. By hand, this would be a mess, but using find -exec, it is simple. Just go to the root directory of the website and do this

find . -type d -exec chmod o+x {} \;
find . -type f -exec chmod o+r {} \;

xargs

Extract arguments is used to apply a Unix command to a list of paths read from standard input. You give it a command to apply and it applies it to each path from the list:

$ echo a.txt b | xargs ls -l
-rw--w---- 1 gboyd gboyd 0 Oct 10 16:59 a.txt
-rw-----w- 1 gboyd gboyd 0 Oct 10 16:59 b
$

This is very similar to find -exec, but there are some differences

xargs has options to modify this basic behavior. Although its wide choice of options allows a lot of customization, -i makes it function similarly to find -exec. Just like find, xargs -i uses the {} cookie to indicate where on the command-line the path should be inserted. If the input is one path per line, most versions of xargs now deal correctly with spaces in filenames, and it is not a problem to embed the cookie in other text. Thus, although our earlier example of renaming files using -exec mv {} {}.sv \; was a disaster, the following works just fine on linux:

$ ls
a file   a.txt     b     stamp
$ find . -type f | xargs -i mv {} {}.sv
$ ls
a file.sv   a.txt.sv   b.sv   stamp.sv
$

Example:

We will do a single complex example that should illustrate how you can tackle a complex question with find.

You want to find all configuration files that may have been modified since the system was installed. Suppose you are going to limit your search to files owned by either root or bin beneath /etc. You would also like to know if there appears to be another version of the file (hopefully the original version), which would be named the same, with an added suffix. You should omit XML files (.xml), as they are auto-generated. The system was last installed on June 24, 2012. Ignore files that are not text files. The resulting list should be in ls -l format.

First, let's analyze the issues here:

touch -t 1206240001 $$stamp  # this format is YYMMDDHHMM
find /etc -type f ! -name "*.xml" -newer $$stamp \( -user root -o -user bin \) | while read path; do

if file "$path" | grep -q text; then
echo "$path"
other=$(ls -d "$path".* 2>/dev/null)
if [ -f "$other" ]; then

echo "$other"
fi
fi
done | xargs -i ls -l {}
When I did a similar thing on my system, I found 21 configuration files that were changed.
The program above has a bug: If multiple "alternate versions" of a configuration file exist, that configuration file is not reported. Can you fix the bug? (Hint: it will require a for loop.)

locate xxx

Often, you just need to locate static files with a specific name on the system, and you are not interested in file properties such as size. This may be for research, or just because you forgot where something is. Using the find command to do this is silly, as it analyzes the filesystem directly, and can be quite time-consuming. Instead, most linux systems keep a database of the current filesystem list, updated nightly, and you can search this list for a file pattern. We will call this database the locate database. Remember that this database is only as accurate as its last update, which is configurable on your system.

The locate command searches the locate database (created by updatedb) for files on the system that match a given pattern. In essence, it is a quick version of a simple find command with the caveat that its [cached] database of file paths is only updated periodically (usually once a day). Thus, although it is faster than find, it does not allow find options (except the implied -name, of course), and it gives information that is slightly out-of-date. It is very useful, however, for locating a misplaced file, or for finding files on the system that go with a particular program, assuming they are named for the program or are in a directory that is named for it. Note that your system must be configured to periodically run updatedb(8) for the locate database to be accurate. (This is usually configured as a standard cron(8) job.)

locate xxx outputs the path to each object on the system whose path contains xxx. Thus, patterns should always be used with locate. By default, locate uses wildcards as patterns.

locate xxx

outputs the path to each object on the system whose path contains xxx, i.e., it implies *xxx*. For example, locate httpd outputs 171 file paths - all files whose name or whose directory's name is any variation of httpd. Probably these files are associated with the webserver, so the output may very well be of interest.

locate *xxx

outputs the path to each object on the system whose path ends in xxx. In the case of httpd, this command outputs 21 paths.

locate */xxx

outputs the path to each object on the system whose path ends in /xxx Again, in the case of  httpd, this command outputs 11 paths, each to a file or directory named httpd.

locate */old/*

outputs the path to each object on the system whose path contains a directory named old

The patterns allowed with locate are, by default, globbing (wildcard) patterns, as used on the shell command-lines. The caveat, as you see, is that a lack of globbing characters implies a leading and trailing *. Supposedly you can use extended regular expressions with locate as well (if you use an option), but cursory tests of this were mixed. Here is an example that does work, though:

locate --regexp '\.pdf$'

outputs all the paths in the locate database whose name ends in .pdf

The locate database is created by updatedb with the exceptions indicated in /etc/updatedb.conf. If you examine that file on our systems, you will notice that filesystems of type nfs are excluded ("pruned"). Thus on our systems you cannot use locate to locate files beneath your home directory. Besides this, the locate database is created (and updated) as a cron job overnight. On well-configured systems, this job will be picked up by anacron after a day or so, but locate will fail to find anything immediately after installation!


Prev This page was made entirely with free software on linux:  
the Mozilla Project
and Openoffice.org    
Next

Copyright 2014 Greg Boyd - All Rights Reserved.