Mark (and Bill, as a copy fyi)... As I mentioned to you yesterday, I've been working up a paper at home about some storage ideas. It is not done (may turn out to be barely begun) and I intend to explore some issues further, and add more detail on how the thing can be implemented (or could discuss it with folks...I already own code that does softlinks, and started code for a cacher in about 3/1995)...if there's interest). I also want to explore in thought some other alternatives that would not be free of the need to alter vms sources, just to see if there's some cheap mod that could be used instead. However, these ideas seem interesting enough that they should be perhaps shared now while they might be of value to y'all over in EDO. My personal bias is that a fixed-up ODS-2 needs to exist in the future, BTW, and that Spiralog will not prove best in all cases, but when you might have thousands of disk volumes [I'm working on SCSI naming upgrades in "real life"], the current storage management scheme gets to need help real badly, and since I've worked in storage management (e.g. wrote my own HSM, jukebox control software that is still far better technically than what DEC bought from Perceptics [e.g. works RIGHT in clusters, fails over etc etc...! Dammit...wish DEC had bought THAT! ...so much for bias...], wrote shadow/stripe/compressing disk drivers, journalling drivers for anything, etc. etc.) I figured these ideas need to be explored. If I were still doing ISV work I'd be trying to sell the cache now, and might have implemented the rest of the second idea also by now...what I have already went far along that road and I was struggling with various schemes for melding filesystems for some time, since the need for such has been obvious to me for years. Anyhow, please share this around; I don't know who are the right folks to get it, but would love to see the ideas at least considered. BTW if you want a demo of MY hsm, undelete, etc., I can bring in a vmsinstallable kit; runs on vax or alpha vms. Glenn Everhart@star.enet.dec.com star::everhart dtn 381 1497 -------------------------------- From: US2RMC::"EVERHART@Arisia.GCE.Com" 17-JUL-1996 19:27:14.74 Subj: opnrestart.txt File System Extension Glenn C. Everhart Everhart@Arisia.GCE.Com (or everhart@gce.mv.com) Evehart@star.enet.dec.com (w) This document is meant to suggest a couple variant schemes that may be able to enhance VMS' file system manageability and usability. It would seem clear that a disk farm of dozens or more volumes in which each volume is a separate entity has some disadvantages as well as advantages. The advantages lie in backup or error recovery, where a file structure that becomes toast can be recovered in a more reasonable time frame than would be the case if the file structure spanned the whole farm. Burroughs learned that a long time ago... There are also security advantages, since volumes can be protected and volume access serves as a kind of "mandatory" protection for the volume contents. These however tend not to be widely visible. The disadvantages are in managing the thing. Once a disk runs out of capacity with systems like ntfs or files-11, files must be migrated, usually manually. These considerations are constantly visible to everyone and represent an operational disadvantage vis a vis unix. The disadvantage group can be dealt with by allocating disk control structures (notably, bitmaps) sized for a larger space than is actually there, and permitting use according to what storage is really present. (If all virtual space is visible, a driver can return the device full error code, and VMS will respond sensibly. Of course, one could modify the VCB slightly at mount time to adjust free blocks as well. My compressing disk did the first, not the second, since its situation was too dynamic.) However, build huge "rubber disks" like this and the advantages will tend to be lost, if that is all you do. (Not that it's bad to do, just that one loses the advantages.) Possibility 1: I will mention another possibility: a disk-backed LBN cache, where the cache lives on disks...as many as you like...and the file structure is again of the "rubber" class. However, the backing store need not all be of equal speed, since the cache system could see that recently accessed storage was on fast disks, but slower store could be used for older data. The resulting system would logically span many disks, the cache system handling device boundary issues as a side effect, and be sized for many disks, but underlying store would be able to be mostly quite a bit slower than usual, possibly even residing partly on things like tapes. My design for such a beast involves letting a cache server handle the actual work of swapping things around, and providing a second path to get to real disks which only the cache server would use. (Details would in some ways resemble what mount verify does, with a bit more locking and general usability, but the cache server would use normal (well, almost normal) $qio functions and the whole thing can be constructed with straightforward i/o interception techniques, not really needing mods to filesystem or executive. Disk access for in-cache storage (and the cache could be gigabytes or larger) would be very fast, handled by tiny mods to IRPs on the way to the appropriate storage device. The cache server gets into the picture only for cache misses (and must coordinate across cluster when the cache map changes). This kind of thing is wonderful if you have a solid state disk, by the way; it can be used in layers if you want, though logically it will look like a huge volume, or maybe several such. (It would also adapt easily for WORMs.) Note that the fact that the cache server does NOT have to be active for all writes to cache can mean much less than the usual amount of cache contention; there may be situations where this will be a significant win. Also this can be a handy thing to think about where in a read-intensive situation, where Spiralog is disadvantaged; the access to the fixed blocks might transparently be migrated to a small area of a cache device and tend to be "close". Of course, a conventional memory based writethru cache could be used below this if desired, so it was an LBN cache... This would be useful and interesting, but still has the problems one has with single huge file structures in some ways. (The backup restore problem does become easier, since "new stuff" is on a much smaller store than older things, and older things can be backed up as they migrate to slower store.) Possibility 2: Another possibility exists that I'd like to suggest. My idea here is that a collection of disks will be handled by two directory structures, one on a "master" volume, and one on each individual volume. The "master" volume will either have only directories,or will have directories pointing to files on itself and on everything else. Like a volume set, this structure will be managed as one file structure. Unlike a volume set, it will have individual volumes as sensible entities unto themselves, so that the unit of backup will be the individual volume. The total directory tree would resemble one in unix, with a root directory and all other volumes falling at mount points somewhere below root. One key to this is to add a restart capability to open. The other key is to maintain both sets of directories in parallel, stating clearly that the per-volume directories get maintained first, and the master directories are handled second, so that in crashes, some inconsistency can occur (and be fixed later). The other key is that when files are opened that are old, they need to be found where they are, whereas new files should be created where space exists. The space trick might be doable using the volume set logic, but in fact can be done by a front end. The way you'd do this with a front end looks something like this: You insert some processing by intercepting the XQP calls' FDT routines (to be filesystem independent) (yeah, other possibilities exist too; I'll describe this one.) Create: Now for non-kernel channels, you save the user open request in a pool data structure, and keep all context (including prev. mode psl) there. When you find the disk with most free space, save the original channel UCB and point the CCB at the desired disk. Now enter the directory entry in the master disk's directlry, with an ACE (or other marking if you like) to tell where the real file is. Then let the original operation run and update the directory in the disk where the file will have its data. Capture deaccess (close) so the channel can be put back as the program expects. Open: Save context as before, and issue a read-acl (or otherwise read the file markings) to get the file location. Now again replace the channel UCB and save the original one, and let the user open go with the user FIB having had its file ID filled in, DID clear, and on the right disk. If the marking has the FID this is direct; this works well but is not symbolic. A symbolic access will want to do a full lookup of the file (a kernel thread can do it from the intercept) and then alter the open. (Note that the open IRP must be pointed at the right disk too.) When you use a kernel thread, get an AST and issue the next processing from there. You can replace the PSL prev. mode temporarily to reissue the user I/O from inside an AST context. (I've used special kernel ast for this.) If you care to read a marking and interpret it as a filename or filepath somewhere, the kernel mode state machine gets more complex, but at each step in the path you issue an io$_access for the desired file on the desired disk and check for markings on the final access, possibly repeating the process. This amounts to a facility to restart an open request, even though it gets layered ahead of the "real" open. It can be done using the existing facilities if need be. Access to directories in this case presumes that create, open, delete, etc. all perform their operations on disk "one-volume" directory structures, and also on a set of master directories, leaving markings in the files in the master directories giving pointers to the files in the one-volume directories. This is closely akin to unix softlinks, with the ability to select a site automatically. Reading directories linked in this way could probably be handled too, by keeping track of the volume used to read a directory. However, RMS' insistence on directory caching gets in the way here. It would be desirable to somehow flag that a directory was for a different volume, possibly by hacking off a high bit or two of the file ID and keeping an internal table of volumes and currently accessed directories. However, this is a sidelight and not what I am suggesting. The duplicated directory handling will take some time (amount depending on the underlying system) to create files, and following links will take a bit of time, but no process context mods really are needed here, and header and directory caching will reduce the time needed. What is gained is a very flexible filesystem space allocation across many volumes with ability to use the entire filesystems name space as a single entity. The added linking for file access could be avoided by accessing a particular disk, of course, but none of this depends on what filesystems are in use. If all filesystems supported the ods-2 ACP interface, conceivably the access could be transparent across all of them. Rules for volume selection do not need to be based only on space, either. While these systems can be implemented with heavy hacking in the file system or the I/O system, you'll note that they can also be handled with a layer between these two and if done that way, they need no modifications to the VMS kernel, nor to the file system. The per-volume directory is treated as primary here, as currently, and the master directory is the secondary path. In addition, provision of softlinks would be trivial for files, and probably manageable for directories. There would of course need to be some facility for detecting when the master directory was not fully updated, much as mount verification is done now. The advantage of a system like this is that the entire disk farm would appear to be one large tree structure, and adding new disks or even new filesystems would be simple. Initially I might not bother with crossing disk boundaries with this one, though it might turn out not to be too hard to handle; the link headers might have the marking in them to point at the next disk. In such a system, too, each individual disk would have valid filestructures of its own, and it would not be necessry to worry so much about volume size limits so long as aggregate store remained. Another advantage of a dual directory system is that folks would tend to use the "master" directory the bulk of the time. Put that on a Spiralog disk (possibly with the suggestion #1 cache below it) where directories are stored in B-trees, and directory performance may soar (at least on reads) even though the underlying store might be ods-2 or otherwise use slower techniques. There is some synergy possible here that's worth thinking about too. % ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ====== % Received: from mail11.digital.com by us2rmc.zko.dec.com (5.65/rmc-22feb94) id AA23049; Wed, 17 Jul 96 19:15:40 -040 % From: EVERHART@Arisia.GCE.Com % Received: from Arisia.GCE.Com by mail11.digital.com (8.7.5/UNX 1.2/1.0/WV) id TAA14032; Wed, 17 Jul 1996 19:11:40 -0400 (EDT % Date: Tue, 16 Jul 1996 21:23:09 -0400 (EDT) % To: GCE@Arisia.GCE.Com % Message-Id: <960716212309.62@Arisia.GCE.Com> % Subject: opnrestart.txt