From: CSBVAX::MRGATE!RELAY-INFO-VAX@CRVAX.SRI.COM@SMTP 4-OCT-1988 13:23 To: ARISIA::EVERHART Subj: Re: VMS vs. UNIX file system Received: From KL.SRI.COM by CRVAX.SRI.COM with TCP; Mon, 3 OCT 88 21:27:12 PDT Received: from ucbvax.Berkeley.EDU by KL.SRI.COM with TCP; Mon, 3 Oct 88 21:23:33 PDT Received: by ucbvax.Berkeley.EDU (5.59/1.31) id AA14927; Mon, 3 Oct 88 18:18:21 PDT Received: from USENET by ucbvax.Berkeley.EDU with netnews for info-vax@kl.sri.com (info-vax@kl.sri.com) (contact usenet@ucbvax.Berkeley.EDU if you have questions) Date: 1 Oct 88 08:07:25 GMT From: chris@mimsy.umd.edu (Chris Torek) Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Subject: Re: VMS vs. UNIX file system Message-Id: <13802@mimsy.UUCP> References: <880928131853.164e@CitHex.Caltech.Edu> Sender: info-vax-request@kl.sri.com To: info-vax@kl.sri.com And now for some real information.... :-) In article <880928131853.164e@CitHex.Caltech.Edu> carl@CITHEX.CALTECH.EDU (Carl J Lydick) writes (in 78 character records, right justified, which detracts from readability on a CRT---please do not do it): >... UNIX (and here I'm basing my claims on personal experience with >ULTRIX, SYSTEM V, NORMIX [a variation on AT&T's TSUNIX developed by >Norman Wilson at Caltech to take advantage of VAX architecture; its >main advantage over AT&T's product was that it allowed paging], and >XENIX; there may be implementations of the UNIX file system which don't >have the problems I describe, but I've never seen one) file systems >tend to be a bit on the flakey side. We need to back up a bit to deal with this one. There are essentially two underlying `file systems' in the Unix world as it now stands: The V7 file system, and the 4.2BSD `fast file system.' There are many variants of both, some more reliable than others, but the 4.2BSD file system was designed with reliability in mind. (Of course, the 4.2BSD release itself was extremely buggy, but that is what happens when your contract says that you must release by a certain date, whether the software works or not. One should consider the 4.2BSD file system in a context in which the rest of the kernel has been debugged, e.g., in 4.3BSD, and notably *not* so in early Ultrix releases. SunOS 3.2 still has some interesting bugs here too.) >When the system crashes, a full FSCK with interactive input from a UNIX >guru is called for; This has proven (in my own experience, which incidentally does not include System V) false. When the system crashes---rarely, except for 4.2BSD based systems---fsck usually manages everything quite nicely. (I am not sure what the phrase `a full fsck' is intended to imply: there is no such thing as a `partial' fsck.) Certainly there are times when a `guru' is required: e.g., after the drive catches fire, and you forgot to back up the machine all month. VMS is not immune to these either. >the information on the disk (the combination of the state of the >filesystem as recorded on the disk and the algorithms for cleaning it >up as embodied in fsck) is frequently inadequate for reconstruction of >the filesystem ... In what way? 4.1BSD and 4.3BSD (ignoring 4.2BSD for the obvious reason) both go to quite a bit of trouble to ensure that the data on the disk is sufficient to reconstruct a clean state. This has a noticeable cost in performance; it is done quite deliberately so as to prevent your `guru required' scenario. From what I have heard, older V7 file systems did not do this; there, perhaps this was true (and perhaps it remains true in System V based kernels such as Xenix). >ODS-2 has at times suffered similar problems ... but problems >of this sort have been acknowledged to be bugs, and have been fixed >fairly soon after they've been discovered, by and large; on UNIX, such >problems have been around for long enough without any visible efforts >to fix them that they've pretty much got to be considered features by >now. Berkeley is not a software vendor, and as such, is under no real obligation to provide fixes, yet I recall some effort in distributing fixes for the more major kernel bugs in 4.2BSD, in particular, those affecting the file system. Any software fault that results in the system crashing, lost data, or file system state that cannot be recovered automatically is most certainly considered a bug. Perhaps our efforts are not visible enough for you; I assure you that they are there. >Under ODS-2, every block on a disk (including the bad-block track >on last-track devices) is either allocated to a file or is free to be >allocated to a file. The UNIX file system has an entire class of >blocks that aren't allocated to a file and that cannot be allocated to >a file: they're called inodes. Since they're not in a file, they're >set up as a (doubly?) linked list. Unallocated blocks other than >inodes are described by another linked list, the freelist. This is true of the V7 file system. The 4.2BSD Fast File System is organised rather differently. >One of the things fsck does is to search the disk for blocks that >aren't in the freelist or the inode list, and tries to figure out which >one they should belong to. No `figuring out' is required: that information is inherent in the block index, given the data from the superblock. (The V7 file system has only a single superblock; if this is overwritten, you are indeed in trouble. The V7 superblock was, however, largely static, so disasters were rare. The 4.2BSD FFS superblock is replicated; more on that in a moment.) >ODS-2 has all the inode-equivalents (file headers) allocated to >INDEXF.SYS. In case the file header for INDEXF.SYS itself >becomes corrupted, there's a backup header for it in a readily >locatable position on the disk. Information on whether a block >(actually, a cluster) is allocated is stored in BITMAP.SYS. (This creates an interesting bootstrap problem: to find BITMAP.SYS you must find INDEXF.SYS; to find INDEXF.SYS you must find INDEXF.SYS. I imagine INDEXF.SYS is in a fixed location, a la a Unix superblock.) In the 4.2BSD Fast File System, the disk is organised into cylinder groups. Each cylinder group contains c.g. summary information, including a copy of the superblock, an inode area, and one or two data areas (depending on whether the cg data and inodes split the cylinder group or lie at either end). The primary superblock is located at block 8 (where `block' here means 1kbyte); there is a primary backup superblock at block 32, and the others are placed in a spiral pattern along the disk so that no single failure of a platter or head will lose the backup superblock data. When you make a file system, the newfs program% prints the locations of the alternate superblocks; you are expected to save this data, but if you forget, or lose it, it can be regenerated (as long as you remember the parameters to newfs!). ----- % Yes, newfs, not mkfs: 4.3BSD-tahoe no longer has a separate mkfs ----- >Information on whether a file header is in use is stored both in the >file header and in a bitmap in INDEXF.SYS. This means that under >ODS-2, you can, in principle, find the next free block or file header >with a single read followed by a single VAX instruction (though on >non-VAX-11 architecture VAXen, the single instruction is >software-emulated). Under UNIX, you have to scan the inode list until >you find a free inode; allocating the first free block involves >removing it from the freelist (which could, I suppose, be done using a >REMQUE instruction, if the version of UNIX takes advantage of the VAX >queue instructions). Because ODS-2 uses a storage bitmap, searching >for the first set of n contiguous free blocks can be done quite >efficiently, again with only one read from disk required; under UNIX, >you have to scan the freelist, which can be a fairly time-consuming >process. Again, this is not true in the 4.2BSD FFS. It too uses a free block bitmap (and 4.3BSD makes use of the fancy VAX instructions, where they are present, and does *not* trap to emulation code where they are not, assuming you compile with the proper flags). Moreover, the bitmap is organised to account for rotational delay and, optionally, contiguous transfers. (More work is needed in other parts of the kernel to make contiguous transfers more efficient, but the functionality is there.) The inode free list is also kept as a bitmap. Incidentally, the VAX `remque' instruction is not particularly efficient. On a VAX-11/780, a remque to remove from a queue head takes about 14 times as long as a `regular' instruction (14us). remqhi is even worse, at over 30us. Actually, I think this is for ins+rem que, not just remque; it certainly makes more sense that way. [begin quote---this is from an article that appeared on Usenet in 1983] The following VAX instruction timings were obtained from a former DEC employee. I cannot vouch for their accuracy and have no idea how they were obtained. VAX-11/780 vs. VAX-11/750 vs. VAX-11/730 WITHOUT FPA INSTRUCTION 780 750 730 750 730 INSERT AT TAIL + REMOVE FROM HEAD 14.00 15.07 26.89 0.929 0.521 INTERLOCKED INSERT + REMOVE 30.43 26.43 41.14 1.151 0.740 [and for contrast] ADDB Reg, Reg 0.40 0.94 2.88 0.426 0.139 ADDW Reg, Reg 0.40 0.93 2.69 0.430 0.149 ADDL Reg, Reg 0.40 0.93 2.53 0.430 0.158 ADDL3 Reg, Reg, Reg 0.60 1.29 2.88 0.465 0.208 ADDL #imed, Reg 0.84 1.69 4.95 0.497 0.170 ADDL @Reg, Reg 0.80 1.11 3.21 0.721 0.249 ADDL Reg, @Reg 1.33 1.61 4.21 0.826 0.316 [much deleted] VAX-11/780 vs. VAX-11/750 vs. VAX-11/730 WITH FPA INSTRUCTION 780 750 730 750 730 INSERT AT TAIL + REMOVE FROM HEAD 14.00 15.07 27.51 0.929 0.509 INTERLOCKED INSERT + REMOVE 30.43 26.43 41.02 1.151 0.742 [much more deleted] [end quote] For contrast, a subroutine call and return (CALLS #0 + RET) takes 14.75us, according to these tables. To return to the 4.2BSD Fast File System: An important property of this system is that it is self-organising. File blocks are allocated in rotationally optimal positions (provided, of course, that such blocks are free), and the system attempts to group related files together. (The rule is that files go in the same CG as their parent directory, while subdirectories go in a different CG. This rule is applied recursively.) One of the most common advertisements I see in magazines oriented toward VMS is for a disk reorganiser. These things apparently squirrel around in the ODS II file system, moving files hither and yon to place them better and to reduce fragmentation. The 4.2BSD FFS does this every time you write a new file, and all you need do to reorganise a file is copy it. Fragmentation occurs only at the end of files, since the file system is block, not extent, based; only when files shrink is there any unnecessary fragmentation. Again, copying files will fix this. Summary: 4.3BSD goes to quite a bit of trouble to make the file system reliable. It succeeds. The underlying implementation is not simple, but it is reasonably fast. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris