Mark (and Bill, as a copy fyi)...
   As I mentioned to you yesterday, I've been working up a paper at
home about some storage ideas. It is not done (may turn out to be
barely begun) and I intend to explore some issues further, and add
more detail on how the thing can be implemented (or could discuss
it with folks...I already own code that does softlinks, and started
code for a cacher in about 3/1995)...if there's interest). I also want
to explore in thought some other alternatives that would not be
free of the need to alter vms sources, just to see if there's some
cheap mod that could be used instead. However, these ideas seem
interesting enough that they should be perhaps shared now while
they might be of value to y'all over in EDO. My personal bias is that
a fixed-up ODS-2 needs to exist in the future, BTW, and that Spiralog
will not prove best in all cases, but when you might have thousands
of disk volumes [I'm working on SCSI naming upgrades in "real life"],
the current storage management scheme gets to need help real badly,
and since I've worked in storage management (e.g. wrote my own HSM,
jukebox control software that is still far better technically than
what DEC bought from Perceptics [e.g. works RIGHT in clusters, fails
over etc etc...! Dammit...wish DEC had bought THAT! <argh>...so much
for bias...], wrote shadow/stripe/compressing disk drivers, journalling
drivers for anything, etc. etc.) I figured these ideas need to be
explored. If I were still doing ISV work I'd be trying to sell the
cache now, and might have implemented the rest of the second idea
also by now...what I have already went far along that road and I
was struggling with various schemes for melding filesystems for some
time, since the need for such has been obvious to me for years.

Anyhow, please share this around; I don't know who are the right folks
to get it, but would love to see the ideas at least considered. BTW
if you want a demo of MY hsm, undelete, etc., I can bring in a
vmsinstallable kit; runs on vax or alpha vms.
Glenn
Everhart@star.enet.dec.com
star::everhart   dtn 381 1497
--------------------------------
From:	US2RMC::"EVERHART@Arisia.GCE.Com" 17-JUL-1996 19:27:14.74
Subj:	opnrestart.txt

File System Extension
Glenn C. Everhart
Everhart@Arisia.GCE.Com (or everhart@gce.mv.com)
Evehart@star.enet.dec.com (w)

This document is meant to suggest a couple variant schemes that may
be able to enhance VMS' file system manageability and usability.

It would seem clear that a disk farm of dozens or more volumes in which
each volume is a separate entity has some disadvantages as well as
advantages.

The advantages lie in backup or error recovery, where a file structure
that becomes toast can be recovered in a more reasonable time frame than
would be the case if the file structure spanned the whole farm. Burroughs
learned that a long time ago...

There are also security advantages, since volumes can be protected
and volume access serves as a kind of "mandatory" protection for the
volume contents.

These however tend not to be widely visible.

The disadvantages are in managing the thing. Once a disk runs out of
capacity with systems like ntfs or files-11, files must be migrated,
usually manually. These considerations are constantly visible to
everyone and represent an operational disadvantage vis a vis unix.

The disadvantage group can be dealt with by allocating disk control
structures (notably, bitmaps) sized for a larger space than is actually
there, and permitting use according to what storage is really present.
(If all virtual space is visible, a driver can return the device full
error code, and VMS will respond sensibly. Of course, one could modify
the VCB slightly at mount time to adjust free blocks as well. My compressing
disk did the first, not the second, since its situation was too dynamic.)

However, build huge "rubber disks" like this and the advantages will tend
to be lost, if that is all you do. (Not that it's bad to do, just that
one loses the advantages.)

Possibility 1:

I will mention another possibility: a disk-backed LBN cache, where the
cache lives on disks...as many as you like...and the file structure is
again of the "rubber" class. However, the backing store need not all
be of equal speed, since the cache system could see that recently
accessed storage was on fast disks, but slower store could be used for
older data. The resulting system would logically span many disks,
the cache system handling device boundary issues as a side effect,
and be sized for many disks, but underlying store would be able to
be mostly quite a bit slower than usual, possibly even residing partly
on things like tapes. My design for such a beast involves letting a
cache server handle the actual work of swapping things around, and
providing a second path to get to real disks which only the cache
server would use. (Details would in some ways resemble what mount
verify does, with a bit more locking and general usability, but the
cache server would use normal (well, almost normal) $qio functions
and the whole thing can be constructed with straightforward i/o
interception techniques, not really needing mods to filesystem or
executive.

Disk access for in-cache storage (and the cache could be gigabytes
or larger) would be very fast, handled by tiny mods to IRPs on the
way to the appropriate storage device. The cache server gets into the
picture only for cache misses (and must coordinate across cluster when
the cache map changes). This kind of thing is wonderful if you have
a solid state disk, by the way; it can be used in layers if you want,
though logically it will look like a huge volume, or maybe several such.
(It would also adapt easily for WORMs.)

Note that the fact that the cache server does NOT have to be active
for all writes to cache can mean much less than the usual amount of
cache contention; there may be situations where this will be a
significant win. Also this can be a handy thing to think about where
in a read-intensive situation, where Spiralog is disadvantaged; the
access to the fixed blocks might transparently be migrated to a small
area of a cache device and tend to be "close". Of course, a conventional
memory based writethru cache could be used below this if desired, so it
was an LBN cache...

This would be useful and interesting, but still has the problems one
has with single huge file structures in some ways. (The backup restore
problem does become easier, since "new stuff" is on a much smaller store
than older things, and older things can be backed up as they migrate
to slower store.)

Possibility 2:
Another possibility exists that I'd like to suggest.

My idea here is that a collection of disks will be handled by two
directory structures, one on a "master" volume, and one on each
individual volume. The "master" volume will either have only
directories,or will have directories pointing to files on itself
and on everything else. Like a volume set, this structure will be
managed as one file structure. Unlike a volume set, it will have
individual volumes as sensible entities unto themselves, so that the
unit of backup will be the individual volume. The total directory
tree would resemble one in unix, with a root directory and all other
volumes falling at mount points somewhere below root.

One key to this is to add a restart capability to open. The other
key is to maintain both sets of directories in parallel, stating
clearly that the per-volume directories get maintained first, and
the master directories are handled second, so that in crashes,
some inconsistency can occur (and be fixed later). 

The other key is that when files are opened that are old, they need
to be found where they are, whereas new files should be created where
space exists.

The space trick might be doable using the volume set logic, but in fact
can be done by a front end.

The way you'd do this with a front end looks something like this:

You insert some processing by intercepting the XQP calls' FDT routines
(to be filesystem independent) (yeah, other possibilities exist too; I'll
describe this one.)

Create:
Now for non-kernel channels, you save the user open request in a pool
data structure, and keep all context (including prev. mode psl) there.
When you find the disk with most free space, save the original channel
UCB and point the CCB at the desired disk. Now enter the directory entry
in the master disk's directlry, with an ACE (or other marking if you
like) to tell where the real file is. Then let the original operation
run and update the directory in the disk where the file will have its
data. Capture deaccess (close) so the channel can be put back as the
program expects.

Open:
Save context as before, and issue a read-acl (or otherwise read the file
markings) to get the file location. Now again replace the channel UCB
and save the original one, and let the user open go with the user
FIB having had its file ID filled in, DID clear, and on the right disk.
If the marking has the FID this is direct; this works well but is not
symbolic. A symbolic access will want to do a full lookup of the file
(a kernel thread can do it from the intercept) and then alter the open.
(Note that the open IRP must be pointed at the right disk too.) When
you use a kernel thread, get an AST and issue the next processing from
there. You can replace the PSL prev. mode temporarily to reissue the
user I/O from inside an AST context. (I've used special kernel ast
for this.)

If you care to read a marking and interpret it as a filename or filepath
somewhere, the kernel mode state machine gets more complex, but at
each step in the path you issue an io$_access for the desired file
on the desired disk and check for markings on the final access, possibly
repeating the process. This amounts to a facility to restart an open
request, even though it gets layered ahead of the "real" open. It can
be done using the existing facilities if need be.

Access to directories in this case presumes that create, open, delete,
etc. all perform their operations on disk "one-volume" directory
structures, and also on a set of master directories, leaving markings
in the files in the master directories giving pointers to the
files in the one-volume directories.

This is closely akin to unix softlinks, with the ability to select
a site automatically. Reading directories linked in this way could
probably be handled too, by keeping track of the volume used to
read a directory. However, RMS' insistence on directory caching
gets in the way here. It would be desirable to somehow flag that
a directory was for a different volume, possibly by hacking off a
high bit or two of the file ID and keeping an internal table of
volumes and currently accessed directories. However, this is a sidelight
and not what I am suggesting.

The duplicated directory handling will take some time (amount
depending on the underlying system) to create files, and following
links will take a bit of time, but no process context mods really 
are needed here, and header and directory caching will reduce the
time needed. What is gained is a very flexible filesystem space
allocation across many volumes with ability to use the entire
filesystems name space as a single entity. The added linking for
file access could be avoided by accessing a particular disk, of course,
but none of this depends on what filesystems are in use. If all
filesystems supported the ods-2 ACP interface, conceivably the
access could be transparent across all of them. Rules for volume
selection do not need to be based only on space, either.


While these systems can be implemented with heavy hacking in the
file system or the I/O system, you'll note that they can also be
handled with a layer between these two and if done that way, they
need no modifications to the VMS kernel, nor to the file system.
The per-volume directory is treated as primary here, as currently,
and the master directory is the secondary path. In addition, provision
of softlinks would be trivial for files, and probably manageable
for directories.

There would of course need to be some facility for detecting when
the master directory was not fully updated, much as mount verification
is done now.

The advantage of a system like this is that the entire disk farm would
appear to be one large tree structure, and adding new disks or even
new filesystems would be simple. Initially I might not bother with 
crossing disk boundaries with this one, though it might turn out not
to be too hard to handle; the link headers might have the marking
in them to point at the next disk. In such a system, too, each
individual disk would have valid filestructures of its own, and
it would not be necessry to worry so much about volume size limits
so long as aggregate store remained. 

Another advantage of a dual directory system is that folks would tend
to use the "master" directory the bulk of the time. Put that on a Spiralog
disk (possibly with the suggestion #1 cache below it) where directories
are stored in B-trees, and directory performance may soar (at least on
reads) even though the underlying store might be ods-2 or otherwise
use slower techniques.

There is some synergy possible here that's worth thinking about too.

% ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ======
% Received: from mail11.digital.com by us2rmc.zko.dec.com (5.65/rmc-22feb94) id AA23049; Wed, 17 Jul 96 19:15:40 -040
% From: EVERHART@Arisia.GCE.Com
% Received: from Arisia.GCE.Com by mail11.digital.com (8.7.5/UNX 1.2/1.0/WV) id TAA14032; Wed, 17 Jul 1996 19:11:40 -0400 (EDT
% Date: Tue, 16 Jul 1996 21:23:09 -0400 (EDT)
% To: GCE@Arisia.GCE.Com
% Message-Id: <960716212309.62@Arisia.GCE.Com>
% Subject: opnrestart.txt