Comparing Text Files

City College of San Francisco - CS260A
Unix/Linux System Administration
Module: Administration Basics I

Preview question: You are preparing for a new release and encounter a configuration file that you modified. As you should, you saved the original version, but you cannot remember what changes you made. The files are 246 lines each. How do you compare them so that you can decide if the changes are needed when you reinstall the system?

diff is used to compare two text files

diff first second

outputs information about what changes need to be made to first so that it matches second. For reasons we will see later, we will refer to the file that we want to change (first in the example above) as the file on the left. Similarly, the file we are comparing it against (second in the example above) will be referred to as the file on the right.

diff distinguishes three types of alterations to the file on the left

lines that only appear in the file on the right. These lines must be added to the file on the left to make it match the file on the right. This is called an add block.
lines that only appear in the file on the left. These lines must be deleted from the file on the left to make it match the file on the right. This is called a delete block.
lines that are similar in both files, and whose contents must be changed (edited) in the file on the left to match the file on the right. This is called a change block.

Before we look at what diff outputs for each of these three types of differences, let's consider what information we would need in order to implement an add block. In this alteration, we need the following information:

where to add the lines to the file on the left. (i.e., after which line should the needed lines be added?)
which lines to add from the file on the right. (i.e., what line numbers should be copied?)

This is exactly the information that diff outputs. To enable the viewer to decide whether she wants to incorporate the alteration, diff also outputs a copy of the lines from the file on the right that should be copied.

The example below shows part of the comparison of two files currently on our linux boxes:

[gboyd@luann ~]$ diff /etc/pam.d/gdm-password.instdefault /etc/pam.d/gdm-password
2a3,4
> auth required pam_succeed_if.so user != root quiet
>

Here is how to interpret this difference:

the a on the first line indicates this in an add block
the line number to the left of the a indicates the line number of the file on the left after which the copied lines should be added.
the line number(s) to the right of the a indicate the line number(s) of the file on the right to copy to the file on the left.
the line(s) following are a copy of the line(s) that should be copied from the file on the right to the file on the left. The fact that they come from the file on the right is reinforced by the use of the greater-than symbol, which points to the right.

We will now just reverse the comparison to create a delete block:

[gboyd@luann ~]$ diff /etc/pam.d/gdm-password /etc/pam.d/gdm-password.instdefault
3,4d2
< auth required pam_succeed_if.so user != root quiet
<

In this example, the d indicates delete, the line numbers to delete from the file on the left are indicated to the left of the d, and the number to the right of the d indicates the line number in the file on the right that you will be at after the deletion (i.e., these lines would have appeared at this line in the file on the right). Again, the left-pointing symbol indicates the lines shown are from the file on the left.

In this case, the fact that diff outputs copies of the lines in question in addition to their line numbers allows us to decide upon scanning that this change is inconsequential, as the changes appear in a comment.

Last, we will look at a change block from a different file (in /etc)

[gboyd@luann etc]$ diff login.defs login.defs.instdefault
33c33
< UID_MIN 7000
---
> UID_MIN 1000

The change block above indicates that one line has changed. The c indicates a change block. The line number to the left of the c indicate the number of the line changed in the file on the left, and the line itself is output below with the left-facing symbol. The line number to the right of the c indicate the number of the line changed in the file on the right. Again, the line itself is output below with the right-facing symbols. The line from the two files are separated by a dashed line. Of course, this change block consists of a single line, and the line change did not insert or delete lines from the original, so the line number in each file is the same. If it consisted of multiple sequential lines, each line number indicator would be in the form startline,stopline (see a later example).

Similarly, looking at a set of change blocks from a file comparison can show a pattern of changes:

-bash$ diff classify1.bash classify1.bash.sv
10c10
<    echo "$progname: need exactly one argument" >&2
---
>    echo "$progname: need exactly one argument"
15c15
<    echo -e "$progname: \"$thisfile\" is neither a directory nor a file." >&2
---
>    echo -e "$progname: \"$thisfile\" is neither a directory nor a file."
19c19
<    echo -e "$progname: \"$thisfile\" is not readable." >&2
---
>    echo -e "$progname: \"$thisfile\" is not readable."

Here we can see that the author edited classify1.bash to send error messages to standard error instead of standard output. Once you become comfortable with the output format of diff you can easily spot patterns of changes, even when they are mixed in with other output.

Options to diff

As with most Unix programs, diff has a number of options. There are only four that are important enough to mention here. We will illustrate the first two options and how they differ (!) with an example:

-bash$ diff classify1.bash classify1.bash.sv
26,30c26,28
< *directory*)          echo "$thisfile is a directory" ;;
<       *ascii*)        echo "$thisfile    is an ascii file" ;;
<
<       *commands*)     echo "$thisfile    is a commands file" ;;
<
---
>     *directory*)      echo "$thisfile is a directory" ;;
>       *ascii*)        echo "$thisfile is an ascii file" ;;
>       *commands*)     echo "$thisfile is a commands file" ;;
-bash$

A quick look at this change block shows that the differences here are all due to whitespace. This is the most important type of information to suppress when using diff, as any change in indentation can pollute the output with a deluge of useless information.

There are two options to diff that suppress the output of differences that are only due to whitespace:

-b suppress differences due to differing amounts of whitespace between words
-w is like -b, but suppresses differences due to leading whitespace as well.

Let's look at the above change block when we use each of these options:

-bash$ diff -b classify1.bash classify1.bash.sv
26c26
< *directory*) echo "$thisfile is a directory" ;;
---
> *directory*) echo "$thisfile is a directory" ;;
28d27
<
30d28
<
-bash$

With -b the differences due to amounts of whitespace between words are suppressed.

diff -w classify1.bash classify1.bash.sv
28d27
<
30d28
<
-bash$

With -w, the differences due to the the amount of leading whitespace is suppressed as well. Note that neither -w nor -b suppress the appearance of additional blank lines in classify1.bash.

Another option we will quickly show adds lines of context to the output of diff, so that you can review a few lines before and after the difference. This option (-CN, where N is the number of lines of context you want), also changes the output format:

$ diff -C3 -w classify1.bash classify1.bash.sv
*** classify1.bash      Thu Jan 15 18:16:30 2009
--- classify1.bash.sv   Thu Jan 15 18:07:00 2009
***************
*** 25,33 ****
case "$class" in
*directory*)          echo "$thisfile is a directory" ;;
        *ascii*)        echo "$thisfile    is an ascii file" ;;
-
        *commands*)     echo "$thisfile    is a commands file" ;;
-
        *)              echo "don't know what $thisfile is." ;;

esac
--- 25,31 ----
$

Personally, I think this output format is more complicated, but here we go:

The two notations with the filenames above the difference block indicates that the asterisk notation will be used for lines that come from the file on the left, while dash notation will be used for lines from the file on the right. There are three references to this notation in the output:

the line number range with the asterisks surrounding it means these lines refer to the file on the left. Since the range is followed by a set of lines, these lines are copies from the file on the left.
the line number range with the dashes surrounding it means that the corresponding section of the file on the right has that line number range.

The - sign before the two blank lines means that these lines must be deleted from the file on the left to make it agree with the file on the right. If the lines had a + symbol prefix they would have to be added to the file on the left. In that case, the copy of the lines would be from the file on the right. Last, if a line needed to be changed, it would be flagged with an exclamation point, and copies of the lines from each file would appear.

The last option -s (that's lower-case s) has diff report files that are identical. Normally, running diff on two identical files produces no output. If you add the -s option, however, diff tells you the files are identical:

$ diff x.html x.html
$
$ diff -s x.html x.html
Files x.html and x.html are identical
$

Although getting used to even the basic output format of diff can be confusing, it is an extremely useful tool and is essential for keeping track of such things as changes to configuration files or differing configurations between releases of some package.

Using diff on a pair of directories

Running diff on a pair of directories reports the following

the files that appear only in one of the directories
for a file that appears in each directory, diff is run and the output shown.

No information is output about a file that is the same in both directories.

The comm utility

comm [option] file1 file2

comm compares sorted files line by line, outputting three columns of information

lines that appear in the first file only (the first column)
lines that appear in the second file only (the second column)
lines that appear in both files (the third column)

The options -1 -2 and -3 can be used to suppress the corresponding column. Thus,

comm -12 file1 file2

outputs lines that are common to both file1 and file2 (i.e. columns 1 and 2 are suppressed, leaving only column 3)

Remember, file1 and file2 must be sorted.

comm is very useful when comparing two directories to find missing or common files. For example, consider the ls listing of two similar directories:

$ ls adminbasics
compression.html      file_info.html             template.html
diff.html             gathering_info.html        transferring_files.html
diff.html.sv          index.html
extracting_info.html madewithNvu80x15clear.png
$ ls adminbasics.sv
compression.html      file_info.html       madewithNvu80x15clear.png
diff.html             gathering_info.html template.html
extracting_info.html index.html           transferring_files.html
$

Obviously, these two directories have some common files. If you produce a listing of each directory

$ ls adminbasics > a.ls
$ ls adminbasics.sv > a.sv.ls
$

then run comm on the listing, you can see the comparison.

$ comm a.ls a.sv.ls
                compression.html
                diff.html
diff.html.sv
                extracting_info.html
                file_info.html
                gathering_info.html
                index.html
                madewithNvu80x15clear.png
                template.html
                transferring_files.html
$

The directories have the same files except for one, which is only in the directory on the left. (It is not obvious from this listing, but column 2 is missing, which indicates that the directory on the right does not contain any files that the directory on the left does not contain.)

(Note that in this case, it would've probably been easier to run diff on the two listings.)

If you want to list only the common files, use comm -12

$ comm -12 a.ls a.sv.ls
compression.html
diff.html
extracting_info.html
file_info.html
gathering_info.html
index.html
madewithNvu80x15clear.png
template.html
transferring_files.html
$

This page was made entirely with free software on linux:
the Mozilla Project and Openoffice.org