Trees Are Fast

 

By Hans Reiser

 

The Naming System Venture
6979 Exeter Dr.
Oakland, CA 94611-1625
reiser@ricochet.net (be sure to send me email if you want to be on the reiserfs announcements and code discussion list)
 URL (for this paper, in case you printed it): http://idiom.com/~beverly/reiserfs.html
Linux Kernel Containing Beta of Reiserfs
Recent Changes to the above kernel (New file system checker, new (faster) read code with packing locality read-ahead, fixed some bugs..  It's a 960k tar file that you overlay the kernel with.)

Current Code in Progress: Revising VFS-reiserfs interface to not use leashes, using the Linux dcache code, porting from 2.1.42 to 2.1.57.  After that we will fix NFS interaction, and then simplify balancing code (we hope to cut it in half by use of careful coding.)


 

ABSTRACT:

Reiserfs is a file system using a variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system running under the same operating system and employing the same buffering code suggest that these algorithms are more effective for large files and small files not near node size in time performance, become less effective in time performance and more significantly effective in space performance as one approaches files close to the node size, and become markedly more effective in both space and time as file size decreases substantially below node size (4k), reaching order of magnitude advantages for file sizes of 100bytes. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. 



Introduction
    General Systems Science Motivations
Should File Boundaries Be Block Aligned?
Balanced Trees and Large File I/O
Serialization and Consistency
Why Aggregate Small Objects at the File System Level?
Tree Definitions
Using the Tree to Optimize Layout of Files
    Physical Layout
    Node Layout
    Ordering within the Tree
    Node Balancing Optimizations
        Drops
        Factors That Influence Code Complexity in Balancing Algorithms
Buffering & the Preserve List
Benchmark Results
    Benchmark Selection Issues
    80/20 Banded File Set Benchmarks
    0-250 Byte Uniform Distribution File Set Benchmarks
   Multiple Process Medium Sized File Benchmarks
    Single Process Medium Sized File Benchmarks
    Single Process Large Sized File Benchmarks
    Multiple Process Large Sized File Benchmarks
Benchmark Results Discussion
Lessons From Log Structured File Systems
Directions For the Future
Conclusion
Acknowledgments
Business Model And Licensing
References

Introduction

The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways [e.g. Pike ``The Use of Name Spaces in Plan9'' ]. None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively.   Most readers aren't substantially interested in the general systems science motivations for such work, so it is not inlined with this paper.

This implementation offers value to the average Linux user, in that it offers higher performance for large and small files (files not near a 4k node in size) than the current Linux file system known as ext2fs. It also saves space to an extent that is important for some applications, and convenient for most.

Since ext2fs exceeds FFS, UFS, and NTFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who find its new method of employing preserve lists for preserving meta-data without substantive performance loss to be attractive.


Should File Boundaries 
Be Block Aligned?

Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code.

The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity is a major disadvantage of the approach compared to the 6,000 lines of the ext2fs approach.

I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. This can be effective as a first approximation, and the performance results of [C-FFS] seem to have been more successful than the earlier implementation of the approach performed by [NTFS]. Note that the authors of C-FFS fail to point out that they are comparing their performance to FFS, which is quite unsuitable for small files, especially compared to ext2fs. One cannot tell from their results whether they are faster or slower than ext2fs.

I considered the following in my decision not to use the approach employed by NTFS and C-FFS: Any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions.

In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms.

Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. Subsequent to my releasing on the net my initial results and a detailed description of my design, [C-FFS] embarked on a small-file file system research project that has many ideas in common with ours. They published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. It is difficult to compare the packing of the two approaches since they are rather vague in their definitions of their packing algorithms. That said, their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. It is unfortunate that they compared their results to FFS rather than ext2fs, as the FFS approach chooses both a synchronization policy and a fragment layout policy that is markedly slower for smaller files. Since ext2fs would also show marked advantages over FFS for small files, one wonders how much advantage they would have over ext2fs. Hopefully they will port to Linux and it can then be measured.

Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms.


Why Aggregate Small Objects 
at the File System Level?

There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance.

Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility.

Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it.

Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution.

Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance.

In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system.


Tree Definitions

I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies.

Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node.

In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]).

The tree has three node types: internal nodes, formatted nodes, and unformatted nodes.

The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.)

Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. Reiserfs starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item.

The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree.

All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake.

Formatted nodes consist of items. Items have three types: direct items, indirect items, and directory items. All items contain a key which is unique to the item. This key is used to sort, and find, the item.

Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo 4k of a file).

Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes.

Directory items contain the key of the first directory entry in the item followed by a number of directory entries. The first item of a file or directory contains its stat data.

A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item.

Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining).

The first item of a file or directory contains its stat data.

In reiserfs formatted nodes have the property that no node and its two neighbors can be compressed into two nodes. Internal nodes have the property that each node is at least 50% full. This difference in compressibility reflects a different prioritization of space efficiency vs. balancing overhead for the two levels. I feel greater compression is best left to an FS cleaner to perform rather than attempting it dynamically.


Using the Tree to Optimize Layout of Files

There are four levels at which layout optimization is performed: 1) the mapping of logical block numbers to physical locations on disk 2) the assigning of nodes to logical block numbers, 3) the ordering of objects within the tree, and 4) the balancing of the objects across the nodes they are packed into.

Physical Layout

This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance.

Node Layout

When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in, reversing when I reach disk boundaries. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor).

Ordering within the Tree

While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.)

It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate.

Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider.

The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature.

Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories.

In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms.

Node Balancing Optimizations

When balancing nodes I do so according to the following ordered priorities:
  1. minimize number of nodes used
  2. minimize number of nodes affected by the balancing operation
  3. minimize the number of uncached nodes affected by the balancing operation
  4. if shifting to another formatted node is necessary, maximize the bytes shifted
Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing.

Drops (the difficult design issues in the current version that our next version can do better)

Even though I knew all of the following reasons for why it was a bad idea, and knew that it was a bad idea on the whole, I let the BLOB approach of using indirect items be introduced into the code for mistaken non-technical reasons, rather than depending entirely on internal nodes. Let me start the discussion of what was learned during the course of my mistake by describing the first approach employed in detail. It is my hope that others will thereby benefit from my painful exploration in great depths of why the traditional BLOB approach is measurably slower than when one uses my initial approach.

Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree.

Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files.

Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node.

In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. The performance loss for introducing indirect items (BLOBs) was marked, and unmistakable.

Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging.

A moderately better solution would have been to  have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.)

For this reason I have the following plan for the next version.  I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data.  Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other.  This is indeed why it took me two years to realize it could be made to work.  My plan is to interleave the nodes of the three trees according to the following algorithm:

Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs.  The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction.  Currently we use the left neighbor of the node in the tree as the node it should be placed near.  This is nice and simple.  Oh well.  Time to create a virtual neighbor layer.

The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near.  This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality.  Pseudo-code is best for describing this:

/* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to
start the search for a free block, and for use by read-ahead algorithm when
there are not enough nodes to the right and in the same packing locality for
packing locality reading ahead purposes */
get_logical_layout_left_neighbors_blocknr(key of current node)
{
/* Based on examination of current node key and type, find the virtual neighbor of that node. */
    If body node
        if first body node of file
            if (node in tail tree whose key is less but is in same packing locality exists)
                return blocknr of such node with largest key
            else
                find node with largest key less than key of current node in stat_data tree
                    return its blocknr
        else
            return blocknr of node in body tree with largest key less than key
of current node
    else
        if tail node
            if (node in body tree belonging to same file as first tail of current node exists)
                return its blocknr
            else if (node in tail tree with lesser delimiting key but same packing locality exists)
                    return  blocknr of such node with largest delimiting key
            else
                return blocknr of node with largest key less than key of current node in stat_data tree
    else /* is stat_data tree node */
        if stat_data node with lesser key from same packing locality exists
            return blocknr of such node with largest key
        else /* no node from same packing locality with lesser key exists */
            return blocknr of node with largest delimiting key that is less than key of current node
}

/* for use by packing locality read-ahead */
get_logical_layout_right_neighbors_blocknr(key of current node)
{
    right-handed version of get_logical_layout_left_neighbors_blocknr logic
}

It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent.  I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency.

Code Complexity

I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type.

When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction.  The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite.  The balancing code will employ a set of item operations which all item types must support.  The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called.  Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now.

We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway.  That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity.


Balanced Trees and Large File I/O

There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography.

Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries.

By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space.

There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files.
 
 

 


Serialization and Consistency

The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design.  Let's define the two extremes in serialization so that the reason for this can be clear.  Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.   Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm).  The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks.  Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather chronological sequence.  Most of high performance design centers around the twin issues of allowing the elevator algorithm to make I/O's in their layout sequence, and in ensuring that the layout sequence is as close as possible to a linear sequence.

Many file systems (almost all of the older ones) add serialization of I/O that is not requested by the application programmer, and provide no mechanism for the application programmer to avoid such serialization when it is not desired. This results in applications stalling on I/O without the application programmer desiring that they do so. Since the serialization is typically performed at file boundaries, the performance impact of unrequested serialization becomes progressively more dramatic as one decreases in file size below the track size, and for small files it can completely dominate performance.

The motivation for the massive inefficiency of such an approach is that the original UFS and FAT file systems lacked the ability to avoid becoming corrupted when only part of an update to the file system reached disk. To minimize this problem, file systems such as [FFS], [UFS], and [NTFS] employed aggressive unrequested synchronization. The large investment in skill developmnt with implementing older technologies that assume added serialization, by many in the filesystem developer community, means that the newer approaches of mostly avoiding serialization will be slow in their adoption.

Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file.

Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes.

Ext2fs was a leader in avoiding unrequested serialization, and I have much personal experience with it that leads me to prefer compiles that are fast. Its breaking the POSIX standard is not in practice a problem: only a few applications have a need for adding an fsync() after an fclose() in their source code when they are ported to Linux. I might argue that it is the best actually implemented policy at this time of the various OSes for performing compiling, but I do not by any means claim it to be optimal. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ]

[NTFS] by default adds unrequested serialization to an extent that even old fashioned file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies).

Reiserfs employs a new scheme called preserve lists for ensuring recoverability, with unrequested serialization occuring only when disk space is low, and recoverability in principle being better than that of [FFS]. (It most likely will not be better in reality until we have gone through extensive testing.) I have avoided changing Linux buffering for reiserfs 1.0, in part so as to make the benchmarks more fair, in part out of a policy of doing as little additional to our core technology as possible before we ship a filesystem useful to Linux users.

Reiserfs, like ext2fs, allows the application programmer control over serialization, and for small files this is crucial to performance. Reiserfs and ext2fs default to unserialized I/O, and perform serialization on request (via fsync() ), but the critical issue is not what is the default, but whether there is a choice. I will go so far as to say that I currently err primarily on the side of not providing sufficient options to the programmer for specifying desired buffer delay policy. 


Buffering 
& the Preserve List

I maintain in cache a buffer tree whose buffers contain a version of (newer or the same as) some fraction of the nodes of the disk tree, rather than hashing buffers in the usual manner for Linux. When I need a node I read it from disk, and connect it to the appropriate parent node of the buffer tree. The buffer tree has the special property that if a disk node is in the cache then its parent is in the cache as well, and to maintain this property I free only leaves of the buffer tree when I need to free buffers. The details of our method of attaching buffers in memory to this internal tree is inefficient in its current implementation, and while it was correct that there were more critical design issues to focus on, and that improving other aspects of buffering probably still could yield a higher reward, it will be rewritten from scratch soon. I make no claim that it is better than traditional Linux hashing, I didn't manage this aspect of the code closely enough, and now that I have read it carefully I am eager to discard it in the next version.

We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows:

If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the elevator_direction until it reaches the edge of the disk, and then it reverses the value of elevator_direction and searches in that direction. Elevator_direction is stored in the super block. Yes, this odd seeming use of an elevator_direction shared variable does make it measurably faster.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no dirty formatted nodes in memory (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to.

In practice, despite its seeming inelegance, it is remarkably effective and efficient. The alternatives such as logging cost overhead consistently, whereas preserve list overhead has an erraticity of occurrence that contributes to the intuitive inelegance. It is a bit odd how the log structured file system style of consistently paying overhead for writing twice and writing far from the semantic neighbor seems intuitively more elegant than the preserve list practice of paying occasionally for a sync and slightly for writing near to the optimal placement, yes? When comparing the approach to a log, please remember that one can use an FS cleaner with this approach as well, even if the data does not move as far, and if the spikes caused by occasionally syncing the disk when it is full are distasteful one can always add a gentler throttle to the file system to ensure that moments of consistency occur. Further, it lays the groundwork for a future fastboot mechanism in which prior to each pass of the elevator algorithm I will write the preserve list and the list of buffers on the elevator list to disk. Log structured file systems do not eliminate the need for recovery, they confine it. Storing on disk the preserve list and the elevator list, I will be able to confine recovery without employing a log. Only implementation will prove the extent of the overhead this mechanism costs....

Note that ext2fs avoids the metadata shifting problem for directory entries by never shrinking directories.


Benchmark Results

Benchmark Selection Issues

The P.M Chen benchmark results are best understood after spending significant time reading the P.M. Chen benchmark paper, and the graphs are quite spacious. They are consistent with my results printed here, but not included for space reasons. (All persons are welcome to sit down at my benchmarker PC and run your or my benchmarks. I find I usually learn new things from other people's benchmarks, though I warn that you will likely be disappointed in how often benchmarks designed to be universal are seriously flawed in benchmarking file systems with designs not considered by the author of the benchmark.)

My benchmarks had their origins in the Andrew Benchmark, but I was motivated to innovate by the following issues: 1) Unlike the file systems it was designed to test, both ext2fs and reiserfs avoid the use of unrequested serialization, and as a result the CPU, and not either of the file systems, is the bottleneck for compiles. Unfortunately I am reduced to copying files using cp if I want to generate a sufficiently heavy load using a 200Mhz Pentium Pro. 2) Even the use of sed to search for all occurences of the word kangaroo was user process CPU bound for reiserfs (sed is just barely fast enough to not bottleneck ext2). I found it necessary to write a simple program to read a file without processing it. 3) I find it much more scientifically interesting to use benchmarks to illustrate the impact of different design tradeoffs as I vary file size and directory size than to engage in the usual discussion of which approach is better overall via a single numeric measure. Such measures necessarily involve an arbitrary controversial weighting of various aspects of performance with much referencing of tradition and authority figures, and so I defer such discussions of whether mini-vans can move more cargo than motorcycles to their appropriate marketing forums.

LADDIS is a network file system benchmark, and optimizing performance for NFS is a research topic of its own that I did not choose to pursue however important it might be.

80/20 Banded File Set Benchmarks

As said earlier, currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system (<10K). Here I show a benchmark that explores implications of the speculative notion that if one changes the range of object sizes for which file systems are useful, and it becomes sensible to store objects 100 bytes in size in the file system, within a generation of programmers 80% of file accesses might once again be to files sized in the first order of magnitude of file sizes for which it is sensible to store files in the file system. Note the result that the effect on time is more dramatic than the effect on space.

In this benchmark, when creating a file I determine which order of magnitude its size will fall in, with one file in five being promoted into a higher size class than the base size, and in one in five of those being promoted to the next size, etc. The size of an individual file is randomly distributed within the size range of its category. Most bytes belong to the larger sizes, most files belong to the smaller sizes. I assume 100% inode space utilization for ext2fs. The file sets are identical in all ways for ext2fs and reiserfs (the pseudo-randomization uses the same seed of 1 for both filesystems, I avoided changing the seed to produce a more ``average'' randomization.)

Note that the literature has not yet produced a statistical distribution function for file sizes, though there are many differing statistics available on the net. The author does not think that this function provided here is correct for that purpose (it has too many large files), but it is perhaps adequate to show that the impact on total system performance of small files can be proportional to their number rather than their space consumption. I wish to note that the use of a file set containing files in the 0-100 byte range is in no way fair to ext2fs, which like most file systems was not specially designed for files below 1k, though its architecture is better than most file systems would be for such a file set due to its synchronization policy.

% To distinguish the effect of directory size on efficiency of directory operations, I produce these statistics for the same fileset stored variously in trees with one directory, 10 sub-directories, 100 sub-sub-directories, and 1000 sub-sub-sub-directories. I use 16000 files with sizes in the table below. Note that at 16000 files in one directory, ext2 is exceeding its design intent, and it takes long enough to perform directory insertions that it is not truly usable, and for this reason larger benchmarks were not performed. Ratios were calculated using measured numbers with more precision than those shown in the tables.
 
 

80/20 Banded File Set Numbers and Sizes

 number of files size of files   total size of files  total number of all files  total size of all files 
0-10M  33,712,005     
21  0-1M  9,335,273     
113  0-100K  6,193,394     
615  0-10K  3,055,154  16000  54,385,534 
3002  0-1K  1,479,559     
12244  0-100B  610,149     
 
Percentage of 506MB disk partition used after copying tree containing 80/20 Banded File Set, and then interleaved copying both trees(cp -r testdir testdir2;find . -exec cp {} {}.o \;)
ext2  reiserfs  ratio: reiserfs/ext2 
58%  43%  1.35 
 

 

 

Time results for 80/20 Banded File Set of 16000 files

all times are of the form ext2/reiserfs=ratio
Phase of Benchmark:  files in one directory  files in ten subdirectories  files in 100 subsubdirectories  files in 1000 subsubsub-directories 
creation of 80/20 Banded File Set  4:31/10.48=26  42.35/13.42=3.16  25.95/15.45=1.68  54.84/40.86=1.34 
cp -r testdir testdir2  7:35/37.22=12  1:53/39.13=2.9  1:35/46.76=2.0  2:16/1:04=2.13 
scanning all files  6:32/57.40=6.83  2:42/55.62=2.9  2:29/59.79=2.5  2:54/1:33=1.9 
recursive stat of all files 

(find . -exec ls -l {}>&/dev/null \;) 

9:05/3:35=2.63 

(find command was bottleneck for reiserfs, not ext2)

5:11/3:35=1.44 

(find command was bottleneck for reiserfs, not ext2) 

4:14/3:38=1.17 

(find command was bottleneck) 

5:05/4:01=1.27 

(find command was bottleneck) 

copies semantically interleaved with originals 

(find linux -name '*' 

cp {} {}.o \;) 

30:30/3:59=7.7 

(find command was significant overhead for reiserfs) 

5:39/4:19=1.3 

(find command was significant overhead for reiserfs) 

5:39/4:19=1.3 

(find command was significant overhead for reiserfs) 

7:06/4:49=1.5 

(find command was significant overhead for reiserfs) 

accessing stat data(du -a | wc -l)  15:23/35.71=26  2:37/37.59=4.2  1:29/42.58=2.08  1:59/38.45=3.1 
rm -rf *  3:32/21.15=10  56.17/22.57=2.5  53.52/25.72=2.08  1:58/29.93=4.0 
 
 

0-250 Byte Uniform Distribution File Set Benchmarks

Time Benchmarks

all times are of the form ext2/reiserfs=ratio
Phase of Benchmark:  files in one directory files in ten subdirectories files in 100 subsubdirectories files in 1000 subsubsub-directories
creation of File Set  4:15/4.51=57  26.16/04.44=5.9  11.10/5.24=2.11  25.03/16.40=1.53 
cp -r testdir testdir2  7:22/17.58=25  1:30/16.74=5.4  1:08/16.74=4.0  1:52/19.32=5.8 
scanning all files  6:15/36.35=10.4  2:26/35.69=4.1  5:33/1:11=4.7  4.1 
recursive stat of all files 

(find . -exec ls -l {}>&/dev/null \;) 

9:02/3:32=2.6 

(find command was bottleneck for reiserfs, not ext2) 

5:09/3:33=1.45 

(find command was bottleneck for reiserfs, not ext2) 

4:04/3:32=1.15 

(find command was bottleneck) 

4:17/3:47=1.14 

(find command was bottleneck) 

copy semantically interleaved with originals 

(find linux -name '*' 

cp {} {}.o \;) 

30:01/3:04=10 

(find command was significant overhead for reiserfs) 

6:59/3:10=2.21 

(find command was significant overhead for reiserfs) 

5:05/3:12=1.6 

(find command was significant overhead for reiserfs) 

6:33/3:14=2.02 

(find command was significant overhead for reiserfs) 

accessing stat data(du -a | wc -l)  15:25/6.38=145  2:29/6.33=24  1:14/6.31=12  1:07/6.79=10 
rm -rf *  3:05/19.42=9.5  27.45/21.72=1.26  16.29/24.38=0.67  40.97/27.97=1.46 
Note: this benchmark of 16000 files of 0-250Bytes may have been a bit small for treefs, but it was quite lengthy for ext2, and so it seemed inappropriate to run a larger one.

 

Space Utilization Benchmark

Percentage of 506MB disk partition used after copying whole tree of 16000 0-250Byte files, 1,988,298 bytes total), and then interleaved copying both trees(cp -r a b;find . -exec cp {} {}.o \;).
ext2  reiserfs  ratio: reiserfs/ext2 
14%  3%  5.0 
 

Multiple Process Medium Sized File Benchmarks

 (X11R6, 72.8mb in 4198 files, average 17k size)
Phase of Benchmark  Time for ext2fs  Time for reiserfs  ext2fs/reiserfs 
three simultaneous copies of file set (same source, three different semantic locations in file system)  2:40  2:16  1.18 
three simultaneous semantically interleaved copies 

(different sources: find $i -exec cp {} {}.o \;) 

3:35  2:51  1.25 
three simultaneous recursive stats of all files 

(find $i -exec ls -l {}) 

2:28  3:03  0.82 
three simultaneous reads of three trees  1:12  1:21  0.89 
three simultaneous deletions 

(rm -rf $i) 

0:34.31  0:25  1.37 
 

Single Process Medium Sized File Benchmarks

 (Linux Kernel, 14k average file size, 38mb total)
Phase of Benchmark  Time for ext2fs  Time for reiserfs  ext2fs/reiserfs 
cp -r of linux kernel source code  38.2  31.85  1.13 
scanning all files (find . -exec wc {} >&/dev/null \;)  2:51  2:40  1.07 
recursive stat of all files (du -s * >&/dev/null)  2:58  3:09  0.94 
copy all .c files to .o files for seven linux kernels 

(copies are semantically interleaved with originals) 

find linux -name '*.c' cp {} {}.o \; 

3:25.27  2:59  1.23 
find and then semantically interleaved delete files 

(find -type f -name '*[aos]*' -exec rm {} \;) 

3:56.82  3:07  1.56 
rm -rf *  37.79  5.38  10.1 
 
 

Single Process Large Sized File Benchmarks

 
(files > 512k from /usr, 110 files, 84MB total, 634k average size)
Phase of Benchmark  Time for ext2fs  Time for reiserfs  ext2fs/reiserfs 
cp -r of file set  0:45.79  0:36.39  1.26 
semantically interleaved copying of 163MB 

(find usr cp {} {}.o \;) 

2:05.31  1:16.56  1.64 
recursive stat of all files (find . -exec ls -l \;)  0:03.44  0:03.02  1.14 
scanning all files before interleaved copy is performed (find . -exec wc {} >&/dev/null \;)  0:31.44  0:27.68  1.14 
scanning all files after interleaved copy is performed (find . -exec wc {} >&/dev/null \;)  1:00.55  0:59.22  1.02 
find and then semantically interleaved delete files 

(find -type f -name '*[aos]*' -exec rm {} \;) 

0:25.88  0:02.38  11 
rm -rf *  0:02.63  0:00.16  16 

 

Multiple Process Large Sized File Benchmarks

 (files > 512k from /usr, 153 files, 97MB total, 634k average size)
Phase of Benchmark  Time for ext2fs  Time for reiserfs  ext2fs/reiserfs 
three simultaneous copies of the same 97 MB file set  44.14  36.52  1.21 
three simultaneous semantically interleaved copies of 128MB (different sources) 

(find usr cp {} {}.o \;) 

64.89  54.96  1.18 
recursive stat of all files (find . -exec ls -l \;)  1.72  1.44  1.19 
scanning all files (find . -exec wc {} >&/dev/null \;)  15.78  14.36  1.10 
three find and then semantically interleaved deletions of files 

(find $i -type f -name '*[aos]*' -exec rm {} \;) 

6.5  0.81  8.0 
three simultaneous tree deletes (rm -rf $i)  7.78  0.92  8.5 
 


Benchmark Results Discussion

(Note: we just implemented packing locality read-ahead and eliminated some serialization of tail reading, it is faster now, complete set of new benchmarks in a few weeks.)


Lessons From Log Structured File Systems

Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons.

In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging.


Directions For the Future

To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not.

Conclusion

Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for.


Acknowledgments

The author was the project initiator, primary architect, supplier of funding, and one of the programmers. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system (the better code), and is remarkably gifted at just making things work. Thanks Vladimir. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements.  This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1.  I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux.


Business Model and Licensing

I personally favor performing a balance of commercial and public works in my life.  I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users.  This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors.  It is not usable to them without such exceptions, and many of them are wise enough to understand that: I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well.

I believe it is not so much the cost that has made Linux so successful as it is the openness.  Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it).  Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell.  At some point an OS vendor will realize that if it:

that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have.  Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage.....

Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project.  This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting.  Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper.

If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors.  Through packaging ourselves as a group, we are more marketable to such OS vendors.  Many of us have spent too much time working at day jobs unrelated to our Linux work.  This is too hard, and I hope to make things easier for us all.  If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also.  Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so.


References

http://[Soon to exist....] will contain source code (a complete easy to install and try Linux kernel source code tree with reiserfs added plus the scripts used in my benchmarks). Currently reiserfs is implemented for Linux 2.0.7, but it will be ported to 2.1.* soon, and released. I hope some of you will find it an accessible and convenient testbed for your own experiments with balanced tree file systems.

G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees.

[Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development.

[Bach] Maurice J. Bach, ``The Design of the Unix Operating System'', 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia].

[BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance.

[Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page.

[C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, page with link to postscript paper A very well written paper focused on small file issues, they use markedly similar notions (most especially their concept of grouping compared to my packing localities), and they do so without bothering to credit our work despite our having posted our intial results and the underlying concepts on the web immediately prior to what would seem to be the start of their work according to their web page. To receive such flattery from MIT is most unexpected.

[ext2fs] by Remi Card extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable.

[FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories)

[Ganger] Gregory R. Ganger, Yale N. Patt, ``Metadata Update Performance in File Systems'' abstract only

[Gifford], postscript only Describes a file system enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure.

[Holton and Das] , Holton, Mike., Das, Raj., ``The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file).'' Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available.

[Howard] ``Scale and Performance in a Distributed File System'', Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs.

[Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length.

[LADDIS] Wittle, Mark., and Bruce, Keith., ``LADDIS: The Next Generation in NFS File Server Benchmarking'', Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128

[Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry ``Data Structures & Their Algorithms'', HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors.

[McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records.

[McVoy and Kleiman], the implementation of write-clustering for Sun's UFS.

[OLE] ``Inside OLE'' by Kraig Brockshmidt, discusses Structured Storage, HREF="http://www.microsoft.com/mspress/books/abs/5-843-2b.htm" abstract only

[Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. A trace-driven analysis of the UNIX 4.2BSD file system. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985.

[NTFS] ``Inside the Windows NT File System'' the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space.

[Peacock] K. Peacock, ``The CounterPoint Fast File System'', Proceedings of the Usenix Conference Winter 1988

[Pike]  Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985.   It is a pity he no longer has this paper on his home page, it was short, informal, and drove home why inconsistent naming schemes in an OS are detrimental. A discussion of naming in plan9 that is on the web.

[Rosenblum and Ousterhout] ``The Design and Implementation of a Log-Structured File System'', Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.sunlabs.com/~ouster/seltzer93.html and the arguments by Margo Seltzer it links to.

[Snyder] , ``tmpfs: A Virtual Memory File System'' discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance.

[Vahalia] Uresh Vahalia, ``Unix