sections in this module City College of San Francisco - CS260A
Unix/Linux System Administration

Module: Filesystems I
module list

Filesystem Types

Unix filesystems have three major components: superblock(s), inodes and data blocks.

In addition to these major components there are some data structures to keep track of free space and the journal, if the filesystem uses one. These take up a very small amount of space - less than 100MB - even if journaling is used.

Different Unix filesystem types vary in

The System V Filesystem

The easiest way to discuss older filesystem types used on Unix and linux systems is to start with the base model, the System V Filesystem. The System V Filesystem (sysv) is very simple. The partition is viewed as a linear array of blocks. When the filesystem is created, this array is allocated very simply:

The blocksize was very small (no more than 1kB), so that as little space was wasted as possible.

Other filesystems are measured against the System V Filesystem. It was intended to be simple and to use space efficiently. It accomplished these goals, but it also had some significant drawbacks:

Consider this HTML file. It consumes slightly more than 10kB of space on linux. The System V Filesystem would allocate this file as eleven allocation units of one kB each. If you think about allocating and deallocating multiple files of various sizes in this fashion, it does not take long to see your disk become fragmented.
Small disk errors were very common on earlier disks. These consisted mostly of areas of the disk that were marginal and eventually failed, causing one or more sectors in a localized area to become unreliable. Other small disk errors arose from 'head crashes'. Old disks often had 'flying heads' that could strike the surface of the disk, marring the magnetic coating and losing information. If either of these types of errors occurred at the superblock, the entire filesystem was lost. If it occurred in the inode area, and a mere 8kB of the inode area is lost, all the information associated with 64 files is lost, essentially losing the 64 files and all their contents, irrespective of how large the files were. (8kB of inodes contains 64 inodes) If some of these lost inodes described directories, the damage was more severe.

Again, this is due to using movable heads on older disk drives. Today it would be less, but each location on the disk that must be read requires a separate queued read, and there may be some cost due to disk latency (the time it takes for the disk to spin around to the position you want). 

The Berkely Fast File System (FFS or UFS)

This filesystem, also called the unified or fast file system, enhanced versions of which are still in use today, addresses many of the drawbacks of the System V Filesystem, and forms the base of every major Unix (and linux) filesystem today. It incorporates several significant enhancements to remedy the shortcomings of the System V Filesystem

A cylinder group is a set of 'functionally adjacent' cylinders that can be readily accessed together. Storing all the pieces required to access a file's data (the inode, directory, index blocks, and data blocks) together is a significant optimization, especially on older disks with moving heads. (Note: index blocks or indirect blocks will be discussed in the next section under Block Size)

Suppose you have two files: one is 8kB and one is 40kB. On a filesystem with a 64kB block size, you would have to allocate one block per file, for a total allocation of 128kB to store 48kB of information. If fragments with the granularity of 8kB are allowed, both files can be allocated in a single block, by creating an 8kB fragment at the end of the block containing the 40kB file, resulting in the allocation of 64kB for the storage of 48kB of information. As you might imagine, the complexity introduced to handle the allocation strategy and bookkeeping for this enhancement is significant.

The UFS has been extremely successful, and forms the basis of the linux extended filesystem (ext2), which is the filesystem used on most linux systems today. ext2 does not implement fragments, however, and takes a compromise position on blocksize: an increase over the System V Filesystem, but considerably less than the UFS.  The RedHat system we use today allows filesystem block sizes of 1,2, or 4kB. It also simplified the cylinder group concept, and made one further enhancement: the path stored in a symbolic link is stored directory in the inode, if it will fit, rather than in an associated datablock.

Journaled Filesystems

Each of the filesystems we have studied uses traditional updates. Filesystems are block-access, and the kernel is free to optimize block I/O as it wishes. This means that some filesystem writes will be delayed. Even if this were not the case, sudden system interruptions can cause filesystem errors. 

Consider the creation of a new file. The update might occur using the following steps:

  1. data blocks are allocated and filled with the data
  2. an inode is allocated, and initialized to describe the data. The block addresses of the data are added to the inode.
  3. the file's directory is updated to refer to the inode and associate a name

If each of these write operations is considered atomic, a system interruption, such as a power failure, could occur between any steps. If it occurs before step 2 is complete, the data is lost, as no record has been made of it (there is no inode). (The record is in memory, but that doesn't do any good when the power fails.) If the failure occurs between steps 2 and 3, however, the data is safe, as the combination of the inode and data blocks comprise the data, but it has not been 'attached' to a directory. The likelihood of an interruption occurring at this point is increased because directory blocks are 'sticky'. Since updates to directories are often localized (in other words, the same directory is updated multiple times in a short time interval), they are kept in memory longer, delaying their update. (Note that we are speaking of milliseconds here.) 

When the system comes back up, the state of the filesystem is unknown. The only solution is to analyze it completely. This involves accessing much of the filesystem to verify all the bookkeeping, to add the missing updates, and to gather unreferenced (unattached) inodes. This process is the purview of fsck, the file system check program, and is the subject of a later section. For our purposes, the issue is how long the process takes. As filesystems increase in size, the time required to check them completely increases dramatically. The difference between an old filesystem of 500MB and a current filesystem of 20GB is on the order of 100 times. Today it can take hours to check a filesystem completely. This is not good when you need your system.

The solution to this problem is smarter updates and more bookkeeping. This is what a journaled filesystem does. Basically, updates are grouped in consistent sets, and a journal kept of pending updates (transactions), including the data to be updated. When the transaction is 'committed' the data is forced to disk and then the transaction entry can be removed. When the filesystem must be recovered after a system interruption, the only areas that can possibly be inconsistent are the pending changes (transactions) area, so only the journal must be replayed and its results checked. This greatly simplifies and dramatically speeds up the process of recovering the system. Journaling filesystems commonly found on linux include ext3 and ReiserFS. 

Journaling filesystems have some overhead: writes may be done twice (once to the journal, once to the filesystem), and the journal takes some space. Thus, journaling filesystems are not recommended on extremely small filesystems or on very slow devices.

Note: it may be unwise to use a journaled filesystem for a pen device, as journaled filesystems do significantly more writes to the device (and to the same location) due to the journal. ext2 may be a better option.

ext3

The ext3 filesystem is simply ext2 with journaling added. The purpose of the journal is to speed filesystem recovery after a crash. It is recommended that all filesystems of any significant size be ext3 instead of ext2.

The ext3 filesystem contains tunable parameters that force a consistency check periodically. These parameters may be changed using tune2fs(8).

ext4

The ext4 filesystem uses extents rather than a block array to address file fragments. (An extent is a contiguous set of disk blocks.) This results in the following characteristics

ext4 also does a much better job of optimizing disk writes. This has an important consequence: in an ext3 filesystem, a file was flushed to disk when its transaction was committed. (This essentially mimicks the application performing an fsync() system call on the file when it is written.) In ext4 the actual writes to disk may be delayed by several seconds in the absence of an explicit fsync() call. Although the kernel tries to compensate for problematic situations by forcing data to disk, this means that ext4 has a somewhat higher probability of data loss due to a sudden power failure unless the applications that write the data are more strict. The benefit of this write optimization is, of course, a faster filesystem.

ext4 file timestamps have a much finer (sub-second) granularity than earlier ext filesystems. The documentation also indicates that the ext4 filesystem implements a file creation date. These characteristics can be confirmed on our system by examining the inode with the filesystem debugger (but see the notes in that section.) The updated time fields were also changed to delay the 'Unix end of the world' from 2038 to 2242, at which date Unix time will reset to a long long time ago.

Write barriers on ext3 and ext4 filesystems

This problem of delayed writes discussed with ext4 filesystems is compounded on both ext3 and ext4 filesystems by the larger on-disk caches on newer large disk devices. When a file is flushed to disk, these devices report the file as written once it is in the disk cache. If the disk cache is not persistent upon a power failure (i.e., it does not have a battery-backup), this data could be lost. To avoid this problem, the administrator can turn write barriers on on the filesystem in question. If write barriers are on, the on-disk cache is flushed at times that ensure that data is actually written to the disk device when it is committed. Write barriers also ensure that the on-disk cache is flushed when an explicit fsync() call is used by the application.

According to the documentation, write barriers are not enabled in ext3 filesystems by default, but they are enabled by default in ext4 filesystems. I cannot find a way to verify that, since no barrier flag appears in the output of tune2fs. Apparently the only way to confirm it would be a bitwise analysis of a dump of the filesystem if you knew where the parameter value was stored.

Write barriers can be controlled by the barrier and nobarrier options to mount.

Note that erecting write barriers may significantly erode the value of the on-disk cache and thus affect performance of the filesystem.

ReiserFS

The Reiser FS was a filesystem with a newer design that stores directory entries, inodes, and the beginning data blocks in a single structure. The result is vastly improved performance, especially in applications that deal with a large number of very small files. The future development of this filesystem, and its new variant, Reiser4, has been somewhat clouded due to the legal problems of its main developer, Hans Reiser. (see wikipedia for more information). ReiserFS is, however, currently supported by the linux kernel. Although it heralded a new generation of filesystem designs, it has now pretty much been sidetracked by others.

XFS

Developed by SGI in the 1990's, XFS was open-sourced about 2000 and included in linux kernels after that. It is the first well-known filesystem to use B+-trees, which it uses for block-allocation and directories. The use of this tree structure to allocate space for extents makes file allocation much faster. It is the first true 64-bit filesystem, and, as such, supports huge files (8 exbibytes, or 8x10^60 bytes) and volumes (16 exbibytes). Adopted as the base filesystem by RedHat as of RHEL7, support for it was even added to grub2, meaning it is the only filesystem in use on most RHEL7 systems.

Like ext4, XFS is a journaling filesystem that uses delayed allocation controllable by write barriers to attempt to allocate files using contiguous space. Benchmarks show an improvement of at least 10% in throughput compared to parallel ext4 filesystems.

XFS filesystems can be defragmented while in-use. They can also be extended (resized to a larger size) while in-use, as can ext4. They cannot, however, be shrunk.

Future development

At the time of this writing, it appears that ext4 and XFS are being used as a bridge to a newest-generation filesystem named btrfs, which, although still under development, is available on RHEL7. An initial examination of btrfs characteristics show some awesome new capabilities, including the ability to maintain files that share common portions. This greatly simplifies, at a filesystem level, the taking of file and device snapshots. A live snapshot can be created that grows in size as the original object changes - only recording the changed blocks and sharing unchanged data with the original. btrfs (named for the b-trees that it is based on) appears to have many of the qualities of a logical volume manager. It can span volumes, support RAID arrays, etc. Check out the wikipedia article if you are interested.

Other filesystem types

Many other filesystem types are supported at least partially in the linux kernel. For more information on your system, see fs(5). Note that some or most of these will be disabled in your kernel, as many are legacy code and are not often used. The only other two filesystems that are always available are the msdos-compatible filesystem msdos, and the ISO9660 filesystem, which is used for CDs and DVDs.

Pseudo-filesystems

The file interface is so central to a Unix system that it has been used for various non-file functions. Thus, the standard file interface, which everyone knows and understands, can be used for other data-oriented tasks. For example, a terminal connection writes to a 'file' for the terminal. This 'file' is not a file at all: it is a communication area set up to transfer the data between the kernel and the device. Usually mounted under /dev/pts, this set of 'files' will accept data destined for the terminal. Similarly, Unix shared memory may appear as a file. A process using shared memory opens something that looks like a file and uses the standard file interface to read and write shared memory. In actuality, these reads and writes are converted to appropriate system calls to access or update the appropriate data.

Pseudo-filesystems are used to 'attach' these 'fake' filesystems into the Unix filesystem. These pseudo-filesystems will appear in the output of mount, but their device will be indicated as none. They also may appear in the file system table, as we will see. Other commonly-used 'fake' filesystems include the well-known /proc filesystem as well as tmpfs for shared memory and sysfs for device information.


Prev This page was made entirely with free software on linux:  
Kompozer
and Openoffice.org    
Next

Copyright 2014 Greg Boyd - All Rights Reserved.

Document made with Kompozer