From:	CSBVAX::MRGATE!RELAY-INFO-VAX@CRVAX.SRI.COM@SMTP  4-OCT-1988 13:23
To:	ARISIA::EVERHART
Subj:	Re: VMS vs. UNIX file system


Received: From KL.SRI.COM by CRVAX.SRI.COM with TCP; Mon,  3 OCT 88 21:27:12 PDT
Received: from ucbvax.Berkeley.EDU by KL.SRI.COM with TCP; Mon, 3 Oct 88 21:23:33 PDT
Received: by ucbvax.Berkeley.EDU (5.59/1.31)
	id AA14927; Mon, 3 Oct 88 18:18:21 PDT
Received: from USENET by ucbvax.Berkeley.EDU with netnews
	for info-vax@kl.sri.com (info-vax@kl.sri.com)
	(contact usenet@ucbvax.Berkeley.EDU if you have questions)
Date: 1 Oct 88 08:07:25 GMT
From: chris@mimsy.umd.edu  (Chris Torek)
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Subject: Re: VMS vs. UNIX file system
Message-Id: <13802@mimsy.UUCP>
References: <880928131853.164e@CitHex.Caltech.Edu>
Sender: info-vax-request@kl.sri.com
To: info-vax@kl.sri.com

And now for some real information.... :-)

In article <880928131853.164e@CitHex.Caltech.Edu> carl@CITHEX.CALTECH.EDU
(Carl J Lydick) writes (in 78 character records, right justified, which
detracts from readability on a CRT---please do not do it):

>... UNIX (and here I'm basing my claims on personal experience with
>ULTRIX, SYSTEM V, NORMIX [a variation on AT&T's TSUNIX developed by
>Norman Wilson at Caltech to take advantage of VAX architecture; its
>main advantage over AT&T's product was that it allowed paging], and
>XENIX; there may be implementations of the UNIX file system which don't
>have the problems I describe, but I've never seen one) file systems
>tend to be a bit on the flakey side.

We need to back up a bit to deal with this one.

There are essentially two underlying `file systems' in the Unix world
as it now stands:  The V7 file system, and the 4.2BSD `fast file
system.' There are many variants of both, some more reliable than
others, but the 4.2BSD file system was designed with reliability in
mind.  (Of course, the 4.2BSD release itself was extremely buggy, but
that is what happens when your contract says that you must release by a
certain date, whether the software works or not.  One should consider
the 4.2BSD file system in a context in which the rest of the kernel has
been debugged, e.g., in 4.3BSD, and notably *not* so in early Ultrix
releases.  SunOS 3.2 still has some interesting bugs here too.)

>When the system crashes, a full FSCK with interactive input from a UNIX
>guru is called for;

This has proven (in my own experience, which incidentally does not
include System V) false.  When the system crashes---rarely, except for
4.2BSD based systems---fsck usually manages everything quite nicely.
(I am not sure what the phrase `a full fsck' is intended to imply:
there is no such thing as a `partial' fsck.)  Certainly there are
times when a `guru' is required: e.g., after the drive catches fire,
and you forgot to back up the machine all month.  VMS is not immune
to these either.

>the information on the disk (the combination of the state of the
>filesystem as recorded on the disk and the algorithms for cleaning it
>up as embodied in fsck) is frequently inadequate for reconstruction of
>the filesystem ...

In what way?  4.1BSD and 4.3BSD (ignoring 4.2BSD for the obvious
reason) both go to quite a bit of trouble to ensure that the data on
the disk is sufficient to reconstruct a clean state.  This has a
noticeable cost in performance; it is done quite deliberately so as to
prevent your `guru required' scenario.  From what I have heard, older
V7 file systems did not do this; there, perhaps this was true (and
perhaps it remains true in System V based kernels such as Xenix).

>ODS-2 has at times suffered similar problems ... but problems
>of this sort have been acknowledged to be bugs, and have been fixed
>fairly soon after they've been discovered, by and large; on UNIX, such
>problems have been around for long enough without any visible efforts
>to fix them that they've pretty much got to be considered features by
>now.

Berkeley is not a software vendor, and as such, is under no real
obligation to provide fixes, yet I recall some effort in distributing
fixes for the more major kernel bugs in 4.2BSD, in particular, those
affecting the file system.  Any software fault that results in the
system crashing, lost data, or file system state that cannot be
recovered automatically is most certainly considered a bug.  Perhaps
our efforts are not visible enough for you; I assure you that they
are there.

>Under ODS-2, every block on a disk (including the bad-block track
>on last-track devices) is either allocated to a file or is free to be
>allocated to a file.  The UNIX file system has an entire class of
>blocks that aren't allocated to a file and that cannot be allocated to
>a file:  they're called inodes.  Since they're not in a file, they're
>set up as a (doubly?) linked list.  Unallocated blocks other than
>inodes are described by another linked list, the freelist.

This is true of the V7 file system.  The 4.2BSD Fast File System is
organised rather differently.

>One of the things fsck does is to search the disk for blocks that
>aren't in the freelist or the inode list, and tries to figure out which
>one they should belong to.

No `figuring out' is required: that information is inherent in the
block index, given the data from the superblock.  (The V7 file system
has only a single superblock; if this is overwritten, you are indeed in
trouble.  The V7 superblock was, however, largely static, so disasters
were rare.  The 4.2BSD FFS superblock is replicated; more on that in
a moment.)

>ODS-2 has all the inode-equivalents (file headers) allocated to
>INDEXF.SYS.  In case the file header for INDEXF.SYS itself
>becomes corrupted, there's a backup header for it in a readily
>locatable position on the disk.  Information on whether a block
>(actually, a cluster) is allocated is stored in BITMAP.SYS.

(This creates an interesting bootstrap problem: to find BITMAP.SYS
you must find INDEXF.SYS; to find INDEXF.SYS you must find INDEXF.SYS.
I imagine INDEXF.SYS is in a fixed location, a la a Unix superblock.)

In the 4.2BSD Fast File System, the disk is organised into cylinder
groups.  Each cylinder group contains c.g. summary information,
including a copy of the superblock, an inode area, and one or two data
areas (depending on whether the cg data and inodes split the cylinder
group or lie at either end).  The primary superblock is located at
block 8 (where `block' here means 1kbyte); there is a primary backup
superblock at block 32, and the others are placed in a spiral pattern
along the disk so that no single failure of a platter or head will lose
the backup superblock data.  When you make a file system, the newfs
program% prints the locations of the alternate superblocks; you are
expected to save this data, but if you forget, or lose it, it can be
regenerated (as long as you remember the parameters to newfs!).
-----
% Yes, newfs, not mkfs: 4.3BSD-tahoe no longer has a separate mkfs
-----

>Information on whether a file header is in use is stored both in the
>file header and in a bitmap in INDEXF.SYS.  This means that under
>ODS-2, you can, in principle, find the next free block or file header
>with a single read followed by a single VAX instruction (though on
>non-VAX-11 architecture VAXen, the single instruction is
>software-emulated).  Under UNIX, you have to scan the inode list until
>you find a free inode; allocating the first free block involves
>removing it from the freelist (which could, I suppose, be done using a
>REMQUE instruction, if the version of UNIX takes advantage of the VAX
>queue instructions).  Because ODS-2 uses a storage bitmap, searching
>for the first set of n contiguous free blocks can be done quite
>efficiently, again with only one read from disk required; under UNIX,
>you have to scan the freelist, which can be a fairly time-consuming
>process.

Again, this is not true in the 4.2BSD FFS.  It too uses a free block
bitmap (and 4.3BSD makes use of the fancy VAX instructions, where they
are present, and does *not* trap to emulation code where they are not,
assuming you compile with the proper flags).  Moreover, the bitmap is
organised to account for rotational delay and, optionally, contiguous
transfers.  (More work is needed in other parts of the kernel to make
contiguous transfers more efficient, but the functionality is there.)
The inode free list is also kept as a bitmap.

Incidentally, the VAX `remque' instruction is not particularly efficient.
On a VAX-11/780, a remque to remove from a queue head takes about 14 times
as long as a `regular' instruction (14us).  remqhi is even worse, at
over 30us.  Actually, I think this is for ins+rem que, not just remque;
it certainly makes more sense that way.

[begin quote---this is from an article that appeared on Usenet in 1983]

The following VAX instruction timings were obtained from a former
DEC employee.  I cannot vouch for their accuracy and have no idea
how they were obtained.


  VAX-11/780 vs. VAX-11/750 vs. VAX-11/730 WITHOUT FPA
  INSTRUCTION			      <EXECUTION TIME MICROSECS> <TIMES 780>
					  780	  750	  730	 750	 730

INSERT AT TAIL + REMOVE FROM HEAD	 14.00	 15.07	 26.89	0.929	0.521
INTERLOCKED INSERT + REMOVE		 30.43	 26.43	 41.14	1.151	0.740
[and for contrast]
ADDB Reg, Reg				  0.40	  0.94	  2.88	0.426	0.139
ADDW Reg, Reg				  0.40	  0.93	  2.69	0.430	0.149
ADDL Reg, Reg				  0.40	  0.93	  2.53	0.430	0.158
ADDL3 Reg, Reg, Reg			  0.60	  1.29	  2.88	0.465	0.208
ADDL #imed, Reg				  0.84	  1.69	  4.95	0.497	0.170
ADDL @Reg, Reg				  0.80	  1.11	  3.21	0.721	0.249
ADDL Reg, @Reg				  1.33	  1.61	  4.21	0.826	0.316
[much deleted]


  VAX-11/780 vs. VAX-11/750 vs. VAX-11/730 WITH FPA
  INSTRUCTION			      <EXECUTION TIME MICROSECS> <TIMES 780>
					  780	  750	  730	 750	 730

INSERT AT TAIL + REMOVE FROM HEAD	 14.00	 15.07	 27.51	0.929	0.509
INTERLOCKED INSERT + REMOVE		 30.43	 26.43	 41.02	1.151	0.742
[much more deleted]
[end quote]

For contrast, a subroutine call and return (CALLS #0 + RET) takes
14.75us, according to these tables.

To return to the 4.2BSD Fast File System:  An important property of
this system is that it is self-organising.  File blocks are allocated
in rotationally optimal positions (provided, of course, that such
blocks are free), and the system attempts to group related files
together.  (The rule is that files go in the same CG as their parent
directory, while subdirectories go in a different CG.  This rule is
applied recursively.)  One of the most common advertisements I see in
magazines oriented toward VMS is for a disk reorganiser.  These things
apparently squirrel around in the ODS II file system, moving files
hither and yon to place them better and to reduce fragmentation.  The
4.2BSD FFS does this every time you write a new file, and all you need
do to reorganise a file is copy it.  Fragmentation occurs only at the
end of files, since the file system is block, not extent, based; only
when files shrink is there any unnecessary fragmentation.  Again,
copying files will fix this.

Summary:

4.3BSD goes to quite a bit of trouble to make the file system
reliable.  It succeeds.  The underlying implementation is not simple,
but it is reasonably fast.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris