Binary Files

Dealing with text files is easy, and Linux is awash with tools to handle them. Besides that, you can look at them to help the analysis. Binary files are another matter, and most files on a Linux system are, indeed, binary files for which the cat command just produces a mess.

In this section we will discuss two tools for dealing with binary files - the od command, which can be used to display the contents of binary files in text for examination, and the dd command, which can be used to take apart binary files. First, however, we will visit the minimally-useful program cmp

cmp can be used to compare two files - text or binary files. It compares them character by character and stops when it finds the first difference:

In this example, two obviously different binary files were compared. Surprisingly, the first 24 bytes were identical! cmp has options to actually output the differing bytes, but it is most useful for verifying that two binary files are identical:

The od command is used to examine binary data, displaying it in text. Unfortunately, its basic output format is somewhat scary. Let's take a look:

$ od /bin/cat | more
0000000 042577 043114 000402 000001 000000 000000 000000 000000
0000020 000002 000076 000001 000000 014120 000100 000000 000000
0000040 000100 000000 000000 000000 132670 000000 000000 000000
0000060 000000 000000 000100 000070 000011 000100 000040 000037
0000100 000006 000000 000005 000000 000100 000000 000000 000000
0000120 000100 000100 000000 000000 000100 000100 000000 000000
0000140 000770 000000 000000 000000 000770 000000 000000 000000
0000160 000010 000000 000000 000000 000003 000000 000004 000000
0000200 001070 000000 000000 000000 001070 000100 000000 000000
0000220 001070 000100 000000 000000 000034 000000 000000 000000

This default output format shows od's heritage from the earliest days of Unix. In fact, od (and dd) are some of the oldest programs on Unix.

The first column in od's output indicates the address (or byte offset) of the following line of data. The remainder of that line consists of data converted to numeric text. The annoying thing about this output is that it is in octal (base 8) - hence od's name - octal dump.

The format of the data dump is equally annoying - it is output in two-byte quantities as octal numbers. Thus, bytes 0 and 1 read from the file /bin/cat and interpreted as a two-byte (16-bit) integer is 042577 base 8, or 0100010101111111 in binary or 457F in hexadecimal.

At this point, a few of my readers may be seriously considering dropping the class. Don't worry - we will not spend too long on this topic. But you probably think that the output format above will only be worth interpreting in the most dire of circumstances. Fortunately, the modern version of od allows you to change its output format (and how it interprets the data) with the use of a few options. These options govern how the data is output (the data type) and, thankfully, with recent version of od, what radix is used to output the addresses:

radixes: d for decimal, o for octal (default), x for hexadecimal. The -Ar option is used to indicate the radix for the address column, where r is the radix

types: c for character, d for integers output as decimal, x for integers output as hexadecimal, etc. The type must be followed by a size (how many bytes to output at a time), except in the case of character type where 1 is implied. The -tT[S] option is used to indicate a type T and size S. Thus, in our example above, we could output 2-byte integers in hexadecimal with a hexadecimal address like this

$ od -Ax -tx2 /bin/cat | more
000000 457f 464c 0102 0001 0000 0000 0000 0000
000010 0002 003e 0001 0000 1850 0040 0000 0000
000020 0040 0000 0000 0000 b5b8 0000 0000 0000
000030 0000 0000 0040 0038 0009 0040 0020 001f
000040 0006 0000 0005 0000 0040 0000 0000 0000
000050 0040 0040 0000 0000 0040 0040 0000 0000
000060 01f8 0000 0000 0000 01f8 0000 0000 0000
000070 0008 0000 0000 0000 0003 0000 0004 0000
000080 0238 0000 0000 0000 0238 0040 0000 0000

If you are not convinced yet, let's consider a problem: You are supposed to extract the number from the output of this command

Seems pretty straightforward, so you write a simple sed command to delete everything on the line beginning with the first space character. Unfortunately, it doesn't work

What the heck? After re-issuing the same command to ensure it doesn't work (as most of us do - admit it!), you examine the output of the du command using od, telling it to output the data as individual characters:

This immediately reveals that the problem is that du uses a tab as a separator (highlighted) after its size field. Then you adjust your sed command appropriately!

$ od -tc test.sh
0000000   #   ! \0   /   b   i   n   /   b   a   s   h \n   #       t
0000020   h   i   s       s   h   e   l   l       s   c   r   i   p   t
0000040      \n   #       b   l   a   h       b   l   a   h       b   l
0000060   a   h \n \n

The bottom line here is that od is not needed very often - but when it is, nothing else will work. Get comfortable with simple od commands such as those above to examine the bytes in some data output or, possibly, to examine a binary data value at a particular location in a file. A little bit of od in your arsenal will serve you well.

Also seldom-used but indispensible when it is needed is the dd command. dd stands for disk-to-disk copy in reference to its most famous use - to make a carbon copy of the data on a device. You can actually copy an entire [unmounted] device to a disk file for a fairly quick carbon-copy backup.

One of dd's more common uses is to create a placeholder file of a certain size, filling it with constant or random data. We will see it used like this to create a swap file later. It can also be used to extract part of a file or device, such as the boot sector on a disk, for backup or for offline examination. One last, probably dated, use of dd is to divide one large file into several smaller files for transport on a limited capacity device or across the internet, then put the pieces together afterwards.

dd's options are unusual. They consist of an option word followed by an equal sign, then the option parameters. The option words are not prefixed with a dash. Here are the most commonly-used options

Example: to backup the [512 byte] boot sector of a hard disk at /dev/sda and place the sector in a file /xyz/bootsector