Information Kept about Files

City College of San Francisco - CS260A
Unix/Linux System Administration
Module: Administration Basics I

Information Kept about Data

Everything that is data is a file. This basic tenet of Unix can be confusing, but the simplification that it provides may be one of the reasons Unix has survived so long. At the level of interface between the user and the system, all data is housed in this one unit - the file.

There are different types of files, so far as the system is concerned, but these distinguish how to handle the file object at the system level - not what we normally think of as the filetype (e.g., PDF, JPEG).

Before we list and discuss the type of file information kept on Unix, we will briefly review where data is stored. Before we do that, however, I must rant for a moment about this units issue.

Units

As you know, computers work in bits. Multiple bits create the basic unit of storage, and since bits are binary (2**0), basic units of storage have always been binary.Also traditionally, a kilobyte was abbreviated KB or kB. (Similarly, a megabyte (MB) referred to 1024x1024 or 1048576 bytes.) Transmission units are a different story. When transmission units are spoken of, kilo usually refers to 1000, perhaps to avoid confusing the public. To make this worse, computer advertisements often use the 1000 multiple in all contexts, as the number quoted in MB is larger when using multiples of 1000! Thus, from wikipedia "Although the prefix kilo- means 1000, the term kilobyte and symbol KB have historically been used to refer to either 1024 (2¹⁰) bytes or 1000 (10³) bytes, dependent upon context"

Someone in some standard organization decided to standardize this in favor of the "public" use of multiples of 1000. They added the word "kibibyte" (KiB) to denote the traditional power-of-two idea. However (again from wikipedia) "in the over‑12 years that have since elapsed, the proposal has seen little adoption by the computer industry." To make matters worse, the standards organization adopted an exception for suffixes produced by computer programs, where M and k denote the traditional power-of-two meaning.

Fortunately, for most people, the 4% difference between MB and MiB rarely matter. (of course this is 7% (based on units of 1000) when you get to GB/GiB and 10% when you get to TB/TiB). If you care, however, you probably have to investigate unless you are fortunate to have a document quoted in MiB. (Although some of the computer-users I surveyed thought MiB meant the 1000 counterpart.) Annoying.

I will dispense with this problem in my documentation. Any measure of KB, MB, GB, TB, etc, always refers to the power-of-two version. I will not be editing my documents and adding an 'i' based on an ill-conceived and poorly-adopted standard.

The term "block" is another issue, but we will defer that until we discuss filesystems.

How a file is kept track of

The data comprising a file is divided into two logical places: the file's contents (the data itself) and information about the file. The file's information is kept in an information node (inode). Inodes are kept logically in a table, which is an array of inodes. The index of a particular inode in the array is called the inode number.

The file's inode contains all of the unique information about the file, including the location of its data (the location of the file's contents). Thus, given an inode number, you have access to all of the information comprising the file. Inode numbers are, of course, unique for a filesystem. You can see the inode number for a piece of data by adding the -i option to ls.

Unix would be perfectly happy to refer to files using inode numbers, but humans prefer to use names. When a name is associated with the file, it creates a reference called a link. A link is a name-inode number pair. The link is stored in a special binary file called a directory.

A directory, then, is simply a binary file containing links (name-inode number pairs). The names are conceptially thought of as being "in the directory".

A directory, of course, is a file, and thus has an inode. The name of the directory is associated with the directory's inode number and this link is stored in another directory, which is the parent directory.

This logic can be used to trace the path of a file backwards to the root directory of the filesystem. We will learn much more about filesystems and their organization later in the course.

Information kept about files

Much of the information kept in the inode is normally seen in the output of ls -l. Items of particular interest include:

the filetype. The filetype indicates how the system should deal with the file. The most common filetype is regular file. When users use the word file they are referring to regular files. Other filetypes are directory, symbolic link and device files. There are about a dozen filetypes. The filetype is indicated by the first character in the ls -l output of the file.
the number of links. This is the number of names associated with this inode. Every instance of a name/inode number pair that references this inode counts as a link. Regular files typically have one link. Directories have at least two links - one for the name of the directory, which is kept in the directory's parent directory, and one for dot (.) that is kept in the directory itself. When the number of links (references) to the inode is decremented to zero, the inode is no longer referenced. At that time, it, and any data it refers to, are recycled.
the permissions.
the uid of the file's owner and the gid of the file's group. (The name of the owner and of the group are looked up in /etc/passwd and /etc/group.)
the file's dates. Dates are stored as seconds since the the beginning of the epoch. (That is, seconds since the birth of Unix, which was Jan 1, 1970.) There are three dates

the modification date is the last time the file's contents were changed. This is the date that is output by default by ls -l.
the access date is the last time the file's contents were read. The access date may be examined by ls -lu. (NOTE: on ext4 filesystems, the access date is only updated if it is older than the last modified or last changed date or if it is more than one day out of date!)
the change date is the last time the inode was changed. The change date may be examined by ls -lc.

the files size in bytes.
pointers to the file's data blocks or, in ext4, information about its extents.

Besides being output by the ls -l command, information about a piece of data can be examined by the stat(1) command on linux. For those interested, the definition of the inode structure for the ext2/ext3 filesystem can be found in
/usr/include/linux/ext2_fs.h

Curious facts about file information

A symbolic link simply associates a path to another file with the symlink name. On some versions of Unix, if the path is short enough, it is stored directly in the inode, using the space usually reserved for the file's block pointers. The example below illustrates this:

$ ln -s foo_file foo
$ ls -l foo
lrwxrwxrwx. 1 gboyd cisdept 8 Jan 10 14:22 foo -> foo_file
$ du -sk foo
0 foo
$

In this example, the path associated with the symlink will fit in the inode. Thus, even though the file's size in the ls -l output shows 8 bytes (the number of bytes in the path that the symlink refers to), du -sk shows that there are no data blocks associated with the symlink file. In contrast,

$ ln -s 'very long file path' foo
$ ls -l foo
lrwxrwxrwx. 1 gboyd cisdept 191 Jan 10 14:22 foo -> very long file path
$ du -sk foo
4 foo
$

(Note that the very long file path was created by holding a character down until it repeated 396 times.) In this example, the path that the symlink refers to is too long to fit in the inode. Thus, it must be placed in a data block, and du -sk shows the minimum amount of data has been allocated (this is evidently a 4k filesystem).

The permissions of symbolic links are meaningless. This is shown on linux by setting them to 777 when the symlink is created. You cannot change a symlink's permissions either; using chmod changes the permissions of the item the symlink refers to.

Since device files have no size, the size field of a device file's inode holds the device's major and minor number instead.

Location of Executables

As you know, when you start an executable at the command-line, the user's PATH variable is used to find that executable and run it. Thus, if you execute the command

$ xxx

and your PATH is

$ echo $PATH
/usr/lib/qt-3.3/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/gboyd/bin

the shell will search each directory in $PATH sequentially until xxx is found, then run it. Normally, you trust that the executable found is the one you want. In fact, you don't care which directory in your PATH contains xxx. From time to time you may need to know where an executable comes from. There are several commands that may be used to give you information about the location of executables on your system and, unfortunately, there is a lot of confusion about the difference between them.

whereis xxx

According to its man page, the whereis command "attempts to locate the desired program in a list of standard Linux places." Also, according to the man page, "whereis has a hard-coded path, so it may not always find what you’re looking for."

This means that whereis uses its own internally-defined set of directories and search order, which may have nothing to do with your $PATH setting.

As an example, consider the output of the following commands

$ hello
hello, world
$ whereis hello
hello:
$

In this example, the executable hello is obviously found by the shell, but whereis does not locate it. This is because hello is in my personal bin directory (the last directory in my $PATH above) and the whereis command does not look there!

So what is the purpose of the whereis command? It is used to locate standard system executables. Even given this, there is no guarantee that the command whereis finds is the one that will be executed when you run it since your PATH is not used by whereis and may search alternate directories or search directories with conflicting commands in a different order. Here is another example:

$ whereis find
find: /bin/find /usr/bin/find /usr/share/man/man1p/find.1p.gz /usr/share/man/man1/find.1.gz

This seems to indicate that the find command will be executed preferentially from /bin. (whereis on linux also outputs the path to man pages about the name in question.) However, examining my PATH again (see above) shows that the version in /usr/bin is the one found first!

whereis is useful to locate standard system files. Do not use it to ask the question where did that executable come from?

which xxx

In a way, the algorithm used by which is even more confusing than that used by whereis. Originally, the which command came from the C shell, and it used the C shell's ${path} to find executables. This, in fact, is still true on hills. On linux, however, the which command apparently uses bash's current PATH. Just be careful of which when you go to another system, as it may not be the same as the built-in bash function type:

type xxx

type uses bash's internal PATH (i.e., your current PATH) to give you information about where command xxx comes from. On linux it appears to be similar to which. In our example of the hello command above, both type and which give the correct information:

$ which hello
~/bin/hello
$ type hello
hello is hashed (/home/gboyd/bin/hello)
$

The Korn-shell equivalent of type is whence.

This page was made entirely with free software on linux:
the Mozilla Project and Openoffice.org