File Systems Understanding the Internals
Henry Newman
This month, I will cover the internal workings and implementations of file
systems. File systems are the next stage in the data path as part of my series
on storage. In my last column, I covered volume management how volumes
are created and managed, and performance issues related to striping volumes.
For file systems that require a volume manager to control multiple devices,
understanding the integration between the volume manager and file system is
critical.
File System Services
A file system provides a management framework for the storage. It allows you
to see and manage the data via the framework that we call files. File systems
do the following:
- Manage the available space Allocation maps, where the files reside
on the storage, removal of files
- Provide access control Standard UNIX/POSIX permissions, Access Control
Lists (ACL)
- Support standard UNIX/POSIX interfaces read/write, fread/fwrite,
aioread/aiowrite
- Support other feature functions Hierarchical storage management,
special I/O for databases
Some file systems have additional features and functions for data sharing across servers, security, file locking, and hierarchical storage management (HSM).
File System Functions
Historically, file systems needed to be checked after a system crash for consistency.
As file systems got larger and disk performance did not scale with the size,
the time to check the file system after a crash became longer and longer. I
remember back in 1992 waiting for 11 hours to fsck(1M) a Cray file system
that I had crashed (the customer was not happy). Since that time, a number of
technologies have been developed to improve this functionality.
Logging
Log-based file systems were first discussed in a USENIX paper, The Design
and Implementation of a Log Structured File System, by Rosenbaum and Ousterhout
in 1990. In this paper, the authors analyzed I/O usage and concluded that:
- I/Os are short.
- Most I/Os are random.
- All data is fragmented on current file system technology (BSD and other
file systems).
- Disk access speeds have not changed much.
Since that time, file system vendors such as Compaq/HP (ADvFS), Veritas (VxFS), SGI (XFS), IBM (JFS), Sun (UFS Logging), and others, as well as Linux file systems (EXT3, ReiserFS) have taken the original concept and modified it to log metadata operations. The goal is to ensure that the file system metadata is synchronized when the log area becomes full. If the system crashes, the expectation is that only the metadata log will have to be checked for consistency after the boot not all of the metadata. This check is commonly called fsck(1M). This methodology was developed based on the requirement to boot quickly after a crash.
Log Placement
Most file system and volume managers allow the placement of the log on another
device different from the data. This is done to reduce contention and increase
the performance of the file system metadata operations. Each time the files
are written or opened, the inode is kept in the log and it is periodically written
to the metadata area within the actual file system metadata area(s).
When doing a large number of metadata logging operations, logging device performance can become an issue. With logging, the file system metadata is copied two times.
- The file system metadata is written to the log device.
- The file system metadata is moved from the log device to the file system
metadata area after the log becomes full.
This double copy can become a performance bottleneck if, for example, a large number of metadata operations fill the log and the file system is busy with data operations, or if the log device is slower than the number of log operations that are needed.
Some people (including me) have philosophical issues with logs and logging, and you either fall into the logging camp or non-logging camp, and there are far more loggers than non-loggers. These file system wars are best left to bar room discussions (which occasionally turn into brawls), but in a nutshell, the original reason for logging was to cache data and metadata. Logging is currently used as a method for fast file system recovery. Fast file system recovery is the requirement, not the logging, and I believe other methods can meet the requirement.
Inode and Data Location
Two types of inode formats are used by file systems to describe the location
of the data associated with the file being represented by the inode. They are:
- Indirect blocks
- Extents
Indirect Blocks
If a file system is using indirect blocks, the inode points to the data block
but all subsequent data blocks are pointed to by the last data block. So, the
first inode points to the first data allocation, and the last data within that
first allocation points to the location of the next data block. UFS is a good
example of a file system that uses indirect block allocation to represent block
address. Using indirect blocks becomes a huge issue when doing random reads,
because often the whole file must be read to determine the location of the data
that is needed. Here are some good Web resources for further information:
http://ou800doc.caldera.com/FS_admin/ \
_ufs_Inodes.html#_The_ufs_Inodes_Disk_Block_Addre
http://www.cs.wisc.edu/~bart/537/lecturenotes/s25.html
http://www.stanford.edu/class/cs140/projects/filesys/review.html
The advantage of this indirect block method was that adding to the end of
a file was simple, because the data block was in memory (remember that 8K of
memory was huge in 1965). It was a great idea given the available technology
at the time.
Extents
With extent-based inode usage, the inode contains the address list of where
data resides in the file system. For many file systems, each inode contains
space for 15 addresses and a pointer to the location of the next inode, which
can also contain 15 addresses. The data blocks contain any information as to
the location of other data unlike the indirect block representation.
See the following for more information:
http://www.cs.utexas.edu/users/mootaz/cs372/Lectures/L-17.ppt
http://www.cs.uwyo.edu/~seker/courses/4740/lect12.ppt
Extent representation is used in almost all file systems today. Indirect block
allocation was proposed with the original UFS design circa 1965, and might have
met the hardware and software requirements of that day. However, extent-based
allocation is far more optimal given todays hardware and software structure.
I believe, given current technology trends, that extents will become outdated
as we move to the future.
Data Representation/Allocation
Data representation and allocation involves how and where file systems place
data on the managed devices (note that I do not say disks, because some file
system support hierarchical storage management (HSM) so that data can reside
on tape or disk or both). Data allocation algorithms and data placement becomes
a big issue for some file systems. As mentioned in my previous article, most
file systems require a volume manager when using multiple devices.
Given what I have seen for most user and scratch file systems, a 90/10 rule applies. 90% of the files use 10% of the space, and 10% of the files use 90% of the space. Sometimes the distribution is 95/5 or even a bi-modal distribution, but you are likely to have a very skewed distribution of sizes and counts for files not a statistically normal distribution. Understanding how allocation is accomplished in a file system provides a better understanding of how the file system will scale and perform in your environment given the file sizes, the allocation sizes, and how the data is accessed. There are generally two types of representation of free space.
- Btrees used by Veritas, XFS Reiser, and other file systems
- Bitmaps used by QFS, NFTS, HFS (Mac/Linux)
A B-tree is a sorted list of all the free allocations and used space within the file system. It is important to understand how space is allocated and used from within the tree. Some good reading on B-tree allocation, which is used in most file systems, can be found at:
http://cs-people.bu.edu/dbuzan/cs112/lab7/lab92.html
http://cis.stvincent.edu/carlsond/swdesign/btree/btree.html
Bitmap allocation is less commonly used. This method is used where each bit
in the map represents a single allocation unit such as 512 bytes (a single hardware
sector on a disk) or up to 65536 KB (the largest allocation in the Sun QFS file
system). There are at least two methods for free space searching. Most file
systems use first-fit algorithms, however, at least two other file systems use
best-fit Cray NC1FS and Fujitsu FPFS.
http://www.cs.utexas.edu/users/dahlin/Classes/UGOS/quiz/soln6.html
The first-fit method tries to find the first space within the file system
that matches the allocation size. In some file systems, the first-fit method
is used to find the space closest to the last allocation of the file being extended.
The best-fit method tries to find the best place in the file system for the
allocation of the data. This method is used to try to reduce total file system
fragmentation. The best-fit method always takes more CPU cycles than first-fit,
because the whole file system must be searched for the best allocation.
The best-fit method also works better at reducing fragmentation, especially when files can be pre-allocated (for file systems that support this method) or for large allocations (e.g., multiple MBs). Most vendors do not support this method, because most users do not pre-allocate and most allocations in file systems are not large. Also, because volume managers provide the file system with a single address space, the search time would be huge in a large file system. Even for file systems that allow round-robin allocation, the best-fit algorithm requires searching the whole round-robin device (partition).
In general, B-trees can do a better job for smaller allocations, because the allocation time is faster. As the file system grows, and data is added to the end of files, the files become fragmented. Bitmaps are better for larger allocations, but the time for allocation (searching for free space) can be much greater, especially if the best-fit algorithm is used. Both bitmaps and B-trees can be optimized for your operational requirements with tunable parameters in the file system, volume manager, RAID configuration, and allocations.
Conclusions
Ive provided an overview of how file systems work internally and some
common file system activities. This sets the foundation for understanding file
system tuning parameters and their means. In the next column, I will cover some
of the specific tunables, creating volumes and file systems on Solaris (UFS,
VxFS/VxVM, QFS), AIX (JFS), and Linux (ResierFS, EXT3). Using these file systems
and volume managers will provide a basis for understanding other file systems
and volume managers.
Henry Newman has worked in the IT industry for more than 20 years. Originally
at Cray Research and now with a consulting organization, he has provided expertise
in systems architecture and performance analysis to customers in government,
scientific research, and industry around the world. His focus is high-performance
computing, storage and networking for UNIX systems, and he previously authored
a monthly column about storage for Server/Workstation Expert magazine.
He may be reached at: hsn@hsnewman.com.