sections in this module City College of San Francisco - CS260A
Unix/Linux System Administration

Module: Administration Basics I
module list

Comparing Text Files

Preview question: You are preparing for a new release and encounter a configuration file that you modified. As you should, you saved the original version, but you cannot remember what changes you made. The files are 246 lines each. How do you compare them so that you can decide if the changes are needed when you reinstall the system?

diff is used to compare two text files

diff first second

outputs information about what changes need to be made to first so that it matches second. For reasons we will see later, we will refer to the file that we want to change (first in the example above) as the file on the left. Similarly, the file we are comparing it against (second in the example above) will be referred to as the file on the right. 

diff distinguishes three types of alterations to the file on the left

Before we look at what diff outputs for each of these three types of differences, let's consider what information we would need in order to implement an add block. In this alteration, we need the following information:

This is exactly the information that diff outputs. To enable the viewer to decide whether she wants to incorporate the alteration, diff also outputs a copy of the lines from the file on the right that should be copied. 

The example below shows part of the comparison of two files currently on our linux boxes:

[gboyd@luann ~]$ diff /etc/pam.d/gdm-password.instdefault /etc/pam.d/gdm-password
2a3,4
> auth       required    pam_succeed_if.so user != root quiet
>

Here is how to interpret this difference:

We will now just reverse the comparison to create a delete block:

[gboyd@luann ~]$ diff /etc/pam.d/gdm-password /etc/pam.d/gdm-password.instdefault
3,4d2
< auth       required    pam_succeed_if.so user != root quiet
<

In this example, the d indicates delete, the line numbers to delete from the file on the left are indicated to the left of the d, and the number to the right of the d indicates the line number in the file on the right that you will be at after the deletion (i.e., these lines would have appeared at this line in the file on the right). Again, the left-pointing symbol indicates the lines shown are from the file on the left.

In this case, the fact that diff outputs copies of the lines in question in addition to their line numbers allows us to decide upon scanning that this change is inconsequential, as the changes appear in a comment.

Last, we will look at a change block from a different file (in /etc)

[gboyd@luann etc]$ diff login.defs login.defs.instdefault
33c33
< UID_MIN                  7000
---
> UID_MIN                  1000

The change block above indicates that one line has changed. The c indicates a change block. The line number to the left of the c indicate the number of the line changed in the file on the left, and the line itself is output below with the left-facing symbol. The line number to the right of the c indicate the number of the line changed in the file on the right. Again, the line itself is output below with the right-facing symbols. The line from the two files are separated by a dashed line. Of course, this change block consists of a single line, and the line change did not insert or delete lines from the original, so the line number in each file is the same. If it consisted of multiple sequential lines, each line number indicator would be in the form startline,stopline (see a later example).

Similarly, looking at a set of change blocks from a file comparison can show a pattern of changes:

-bash$ diff classify1.bash classify1.bash.sv
10c10
<    echo "$progname: need exactly one argument" >&2
---
>    echo "$progname: need exactly one argument"
15c15
<    echo -e "$progname: \"$thisfile\" is neither a directory nor a file." >&2
---
>    echo -e "$progname: \"$thisfile\" is neither a directory nor a file."
19c19
<    echo -e "$progname: \"$thisfile\" is not readable." >&2
---
>    echo -e "$progname: \"$thisfile\" is not readable."

Here we can see that the author edited classify1.bash to send error messages to standard error instead of standard output. Once you become comfortable with the output format of diff you can easily spot patterns of changes, even when they are mixed in with other output.

Options to diff

As with most Unix programs, diff has a number of options. There are only four that are important enough to mention here. We will illustrate the first two options and how they differ (!) with an example:

-bash$ diff classify1.bash classify1.bash.sv
26,30c26,28
< *directory*)          echo "$thisfile is a directory" ;;
<       *ascii*)        echo "$thisfile    is an ascii file" ;;
<
<       *commands*)     echo "$thisfile    is a commands file" ;;
<
---
>     *directory*)      echo "$thisfile is a directory" ;;
>       *ascii*)        echo "$thisfile is an ascii file" ;;
>       *commands*)     echo "$thisfile is a commands file" ;;
-bash$

A quick look at this change block shows that the differences here are all due to whitespace. This is the most important type of information to suppress when using diff, as any change in indentation can pollute the output with a deluge of useless information.

There are two options to diff that suppress the output of differences that are only due to whitespace:

Let's look at the above change block when we use each of these options:

-bash$ diff -b classify1.bash classify1.bash.sv
26c26
< *directory*)          echo "$thisfile is a directory" ;;
---
>     *directory*)      echo "$thisfile is a directory" ;;
28d27
<
30d28
<
-bash$

With -b the differences due to amounts of whitespace between words are suppressed.

diff -w classify1.bash classify1.bash.sv
28d27
<
30d28
<
-bash$

With -w, the differences due to the the amount of leading whitespace is suppressed as well. Note that neither -w nor -b suppress the appearance of additional blank lines in classify1.bash.

Another option we will quickly show adds lines of context to the output of diff, so that you can review a few lines before and after the difference. This option (-CN, where N is the number of lines of context you want), also changes the output format:

$ diff -C3 -w classify1.bash classify1.bash.sv
*** classify1.bash      Thu Jan 15 18:16:30 2009
--- classify1.bash.sv   Thu Jan 15 18:07:00 2009
***************
*** 25,33 ****
  case "$class" in
  *directory*)          echo "$thisfile is a directory" ;;
        *ascii*)        echo "$thisfile    is an ascii file" ;;
-
        *commands*)     echo "$thisfile    is a commands file" ;;
-
        *)              echo "don't know what $thisfile is." ;;
 
  esac
--- 25,31 ----
$

Personally, I think this output format is more complicated, but here we go:

The last option -s (that's lower-case s) has diff report files that are identical. Normally, running diff on two identical files produces no output. If you add the -s option, however, diff tells you the files are identical:

$ diff x.html x.html
$
$ diff -s x.html x.html
Files x.html and x.html are identical
$

Although getting used to even the basic output format of diff can be confusing, it is an extremely useful tool and is essential for keeping track of such things as changes to configuration files or differing configurations between releases of some package. 

Using diff on a pair of directories

Running diff on a pair of directories reports the following

No information is output about a file that is the same in both directories.

The comm utility

comm [option] file1 file2

comm compares sorted files line by line, outputting three columns of information

The options -1 -2 and -3 can be used to suppress the corresponding column. Thus,

comm -12 file1 file2

outputs lines that are common to both file1 and file2 (i.e. columns 1 and 2 are suppressed, leaving only column 3)

Remember, file1 and file2 must be sorted. 

comm is very useful when comparing two directories to find missing or common files. For example, consider the ls listing of two similar directories:

$ ls adminbasics
compression.html      file_info.html             template.html
diff.html             gathering_info.html        transferring_files.html
diff.html.sv          index.html
extracting_info.html  madewithNvu80x15clear.png
$ ls adminbasics.sv
compression.html      file_info.html       madewithNvu80x15clear.png
diff.html             gathering_info.html  template.html
extracting_info.html  index.html           transferring_files.html
$

Obviously, these two directories have some common files. If you produce a listing of each directory

$ ls adminbasics > a.ls
$ ls adminbasics.sv > a.sv.ls
$

then run comm on the listing, you can see the comparison.

$ comm a.ls a.sv.ls
                compression.html
                diff.html
diff.html.sv
                extracting_info.html
                file_info.html
                gathering_info.html
                index.html
                madewithNvu80x15clear.png
                template.html
                transferring_files.html
$

The directories have the same files except for one, which is only in the directory on the left. (It is not obvious from this listing, but column 2 is missing, which indicates that the directory on the right does not contain any files that the directory on the left does not contain.)

(Note that in this case, it would've probably been easier to run diff on the two listings.)

If you want to list only the common files, use comm -12

$ comm -12 a.ls a.sv.ls
compression.html
diff.html
extracting_info.html
file_info.html
gathering_info.html
index.html
madewithNvu80x15clear.png
template.html
transferring_files.html
$


Prev This page was made entirely with free software on linux:  
the Mozilla Project
and Openoffice.org    
Next

Copyright 2012 Greg Boyd - All Rights Reserved.