sections in this module City College of San Francisco - CS260A
Unix/Linux System Administration

Module: Administration Basics I
module list

Compression

Preview question: If you want to download some data from a Unix system, you will often encounter the extension .tar.gz (or .tgz). Would you know what to do with the file?


Most of us have used zip files. The zip file format is a compressed archive. Traditionally, these two operations (archiving and compressing) were separate, and they continue to be separate on Unix systems today. One program performs the archiving function (tar or cpio) and another performs the compression (gzip, or bzip2). (zip programs exist on many versions of Unix now, but they are not often used, except to share data with non-Unix systems. Rather, the newer xz compression is used, which can be opened on a non-Unix system using 7zip.)

We will cover archiving programs in detail much later. However, tar files are so common on Unix that we should learn about its simple use much sooner, so we will use it to explain archiving.

Simple archiving with tar

Archiving is the process of combining (archiving) the contents of one (or more) directory structures and all the files in it into one archive file, affectionately called a tarball. The resulting tarball has everything needed to reproduce the source structure and contents. Contrary to zip, archive programs on Unix do not delete the source data.

Obviously, the archive file should be placed in outside that being archived.

Example: create an archive of the directory Doc and all its contents. Name the archive Doc.tar

tar -cf Doc.tar Doc

After this command is finished, Doc.tar can be moved to a new location (or new system) and re-expanded to reproduce the Doc directory using

tar -xf Doc.tar 

An archive file created by tar is affectionately called a tarball.

(Note: -cf is a combination of -c and -f. -f takes an argument: the name of the archive, hence, Doc.tar must come immediately after the f. Although it should be acceptable for the options to be separated, some older vesions of tar insist there is a single - with all the options after it. Linux tar is not so fussy. Also, tar is so old that it is often used without the leading dash for the options, thus tar xf Doc.tar is acceptable as well.)

Information on tar options

option
meaning
-v
verbose (gives list of files archived or extracted (-c or -x)) OR
gives ls -l type output when using -t
-c
create archive
-x
extract archive
-t
list archive (t stands for table of contents)
-f archivepath
path to tarball
-z
archive is compressed with gzip
-j archive is compressed with bzip2
-J
archive is compressed with xz

One of the options -c, -x, or -t must be given. The -f archivepath option must also be given, since the default location of the archive is usually a device that does not exist (it used to be the default tape drive)

You can combine the options in a single string, but, for portability, the f option must come last, so that it is immediately followed by the archivepath.

When extracting, the results are placed relative to the current directory. When archiving, the path to the directory to archive should be given relative to the current directory. tar overwrites anything that is in its way even if the object in its way is read-only.

Examples

tar xvf x.tar

Extract the archive x.tar, giving a list of data extracted. If the archive contained a file abc/foo, a directory abc would be created in the current directory and foo placed in it. If foo previously existed, it would be overwritten. The archive is not deleted.

tar -czf abc.tgz abc

Create the archive abc.tgz to contain the directory abc and all its contents. The resulting archive is gzipped. Here, tar is silent. The directory abc is not removed.

tar -t -v -f x.tar

Output a table of contents of the archive x.tar, using an ls -l style format. 

Note that the presence of the leading dash when using a single string of options is redundant

Since an archive contains all of the information in the original data, it is approximately the same size:

$ du -sk Doc
25508   Doc
$ du -sk Doc.tar
21064   Doc.tar
$

Because of this, archives are often compressed before they are transported to another system.

Compression

Compression programs work by finding redundant patterns in data and replacing them with information about them. There are several compression utilities on Linux, but gzip is now the most commonly used. All of them compress the input, then replace it with the compressed version. By default, the name is changed by adding an extension when the compressor is run.

compressor input  --> produces input.xx (and deletes input!)

compressor xx uncompressor
gzip gz gunzip
bzip2 bz2 bunzip2
xz
xz
unxz

The command

gzip Doc.tar

compresses Doc.tar and replaces it with Doc.tar.gz. When uncompressing Doc.tar.gz, the base name is given as input, although the entire name is acceptable:

gunzip Doc.tar  or    gunzip Doc.tar.gz

looks for Doc.tar.gz, uncompresses it and replaces it with Doc.tar

uncompress works similarly, but bunzip2 requires the entire name of the compressed file when uncompressing it.

You can avoid this naming restriction by using standard input and standard output:

$ gzip < Doc.tar > Doc.tar.g
$ gunzip < Doc.tar.g > Doc.tar
$

Question: You find xxx.tar.gz on the Internet. How do you extract it? 

Answer: The extensions show that this is a gzipped tarball. First, uncompress it using gunzip xxx.tar  Then extract the archive using tar -x -f xxx.tar  (remember, gunzip removed the .gz extension).

gunzip xxx.tar

tar -x -f xxx.tar

If the input was named xxx.tgz, you can standard input and output

gunzip < xxx.tgz > xxx.tar

tar -x -f xxx.tar

On Linux, tar has the gzip compressor built into it. You can skip the gzip/gunzip step by adding the z option to tar:

tar -cvzf Doc.tgz Doc

produces a gzipped tarball of the Doc directory. It can be uncompressed and restored using

tar -xvzf Doc.tgz

Notes for the compression programs

gzip - has compression levels 1 through 9, where 9 is the most compression. (You specify this using -N, where N is the compression level desired.) As you might expect, the more compression you do, the longer it takes and the more space it saves. The default level is the best tradeoff between time and space. gzip can also be used to uncompress (rather than gunzip), as gunzip simply calls gzip!

$ ls -l /bin/{gzip,gunzip}
-rwxr-xr-x. 1 root root    61 Jun 29  2010 /bin/gunzip
-rwxr-xr-x. 1 root root 68616 Jun 29  2010 /bin/gzip

$ file /bin/gunzip
/bin/gunzip: POSIX shell script text executable
$ cat /bin/gunzip
#!/bin/sh
PATH=${GZIP_BINDIR-'/bin'}:$PATH
exec gzip -d "$@"

compress - older, faster compressor (not available on linux) (gzip can now uncompress files that have been compressed with compress)

bzip2 - the best, but slowest, compressor. You will see .tar.bz2 (or .tbz2) files on the Internet, but not nearly as often as .tar.gz files.


Prev This page was made entirely with free software on linux:  
the Mozilla Project
and Openoffice.org    
Next

Copyright 2012 Greg Boyd - All Rights Reserved.