Article 147619 of comp.os.vms: Note: this is "blue sky" stuff and rather long... Glenn C. Everhart 18-May-1996 Once in a while the group should have a "blue sky" discussion (just as we have in magic sessions at DECUS symposia at times). If you don't care for this one you can skip it now... If on the other hand you have any thoughts about these kinds of issues, perhaps this can turn into a useful discussion about methods. For the last several years I've thought of ways to make the remote virtual disks I've given out work r/w from many locations. Since at this point I've not seen anyone else's code that does this, it seems appropriate to share some design info I've worked out, for what anyone wants to do with it. (These speculations and design efforts got their start in a DECUS session around 1989 or 1990 where I got talking with a speaker about how to accomplish this; some of the others who were there at the time may remember it, though I have forgotten who the speaker was. Some of the attendees were in the VAX SIG leadership at the time...) The FDdriver based remote disk uses a very simple protocol talking to its server, which just does logical I/O. Therefore, logical I/O is effectively moved from one machine to another. This is fine as far as it goes, and is what the MSCP server does also. Inside a cluster there are locking mechanisms the file system uses to coordinate file accesses, so any disk with a unique name & allocation class gets treated as the same storage, regardless of underlying mechanism. I've used this to have VDdriver do cluster virtual disks without serving the virtual disks, or with VEdriver or VQdriver or WQdriver to do shadowing with each node having its own shadow driver, or with SDdriver or VWdriver to do striping with each node using its own stripe driver. In networks however, this locking traffic does not exist. Now, it is conceivable to try and export the lock traffic by keeping "shadow" locks around and having a remote server grab and release the appropriate disk locks in response to messages from a local server where the disk is that uses blocking ASTs to determine when different locks are grabbed. This does have some dangers though; mainly, the XQP usage of locks is not a documented interface, is likely to change with time (think about it...) and it also IMPORTS into the local file system any flakiness from the network's glitches. Still, the notion of transmitting when the bitmap is locked around, and sending updates out when this happens, has its attractions. But you must also sense when index file, directories, and (if you want to get this deep) records of files are written. The extent cache is used precisely to avoid hitting the bitmap every time something needs to be written. If we continue to export logical I/O, this would allow some operation in a network with shared writing as if it were in a local system. The export would then be in both directions. Note of course that trying to track all access locking for records is a hard problem, because there's locking that goes on within the XQP and RMS which is not explicitly available outside. Providing shared reading (readonly access) or exclusive access is much simpler since the locknames used for files are generally known. Another approach is also feasible, and perhaps easier to understand. Recall for this that every file access to a disk is via the ACP interface (see the I/O user manual). A protocol that allows one system to handle create, delete, extend, and truncate operations need not worry about the bitmap; those are the only operations that affect the bitmap. If we export logical I/O as the existing drivers do, then AS LONG AS THE UNDERLYING FILE STRUCTURE IS THE SAME the disks will be handled so the storage will not be corrupted. It is not quite enough in general, since directory updates, for example, might not be coordinated. Some of this can be handled by insisting we mount /nocache, so that the system has to "go to the well" to get information off the disk every time. But it is also worthwhile to be able to transport information about virtual I/O. If we do this enough, and do NOT transport logical I/O, it is indeed possible (but more work) to have most file operations work across a network invisibly. You lose certain backup operations (so that, for example, backup/physical would not work as it does for fddriver.) You gain in principle the ability to make any file system on "the other end" act like normal ODS-2. (Operations like this can be useful if you're trying to do things like NFS clients.) For example, backup/record does virtual I/O to the index file to write dates. What's important is to ensure that writes get propagated around; for reading you may be able if the systems speak the same filestructure language to get away with leaving logical I/O transported. Thus transport of virtual write and logical read is a possible system to use. It is possible to intercept I/O operations in a driver by intercepting at driver FDT time, or to steal the XQP entry and gain access there when the IRP is fully formed. Those are the convenient points. At XQP entry you are in a normal kernel AST context within your process and have the full IRP to work with, and can still access data buffers. Suppose I want to open a file for read now and I'm on a system remote from the actual disk. I have a server running on the disk system (the one with the actual disk) and a virtual driver and process it talks to to do the actual work. Now I can have my local system send a message to the server to tell it to open the file on my behalf (transmitting security information) so that it stays locked on the disk system. Generally I'd do this for exclusive or r/o access (there's a bit in the FIB one can look at to find about this quickly). The disk system need not send any more info back; but it will "know" that file is open. I then just let my normal local open run, and it reads the disk with read-logical and opens the file locally. (Alternatively I can have the remote system send back info about the file if I plan to handle only virtual I/O.) To create a file, I need to tell the disk system to do the create, and get it to tell me the resulting name (or report error) and if the create went well, I need to locally open an existing file. (The disk system would open the file at its end if needed.) Fortunately the arguments to io$_access and io$_create are nearly alike so this can be handled. It is permissible to wait till the disk system gets done its operation before proceeding with the local operation, so things like the io$_access io$m_create modifier can be handled. To delete a file, I need to get both ends to delete it, but only the remote end will actually do the delete. I may handle the deletion by simply telling the local system that the disk cache is now invalid (as clusters do) so it will forget about the file and any directory entries it thought might contain pointers to it. To truncate a file, the disk system gets told to do the truncate, and the local system just gets told to do a window turn (invalidating its windows so it gets them loaded off disk again). Extend works the same way. The server process should be guarded against being clobbered as must the local process which talks to it. Setting nodelete (and maybe forcex pending and delete pending) is fairly effective. Tricks like those used by Ehud Gavron's "invisible" program could be used, but I consider doing that on someone's system to be a slimy trick; if something goes wrong, having a hidden process makes it hard to find out what's wrong, and a process cannot be hidden from its necessary network connections anyway. It is possible to monitor the io$_available, io$_unload, and io$_packack functions to see when a disk is dismounted or mounted, and to use, for example, io$_unload as a way to achieve a clean exit. Dismount/nounload generates io$_available, and dismount (if the disk is marked as a removable device) without /nounload generates io$_unload. Mount will generate io$_packack. (This same pattern is true for non-disks, e.g. tapes, as well.) To handle virtual r/w one must send it across and have the remote system do it, and (for read) return the data. It is also possible to mark a window block on the local system such that your "ACP" intercept code gets called for every operation. At FDT time, of course, virtual I/O entries are used for all I/O. If you want the XQP to see the file, you must be sure that window blocks are marked so the XQP always is called. This can be done if need by by setting IRP$L_PID to your completion routine call (save the value somewhere first!), reset the structures as needed, and restore IRP$L_PID and perform actual completion. All XQP paths check this completion path explicitly so that "system" routine completion code can be invoked. If of course you propose to use all virtual I/O, you can set up your own local window block that will have this effect, since the disk system will be handling your actual I/O. (The server had better have lots of channels available...). Note that I do not consider using MSCP or TMSCP protocols a viable approach for any of this transport. The reason is that these protocols are complex and documnentation of them is not generally available except at Digital. Moreover, like other undocumented interfaces they will change now and again, and for disks MSCP is not sufficient for the stuff discussed above. Remote tape access is best handled by using something like ZT_driver, which has gotten very good. (Over TCP/IP one can use the same code, but one must add some code to break up large records and recombine them, and handle retries.) While this means adding one's own drivers and so on, talking to the TMSCP server is not something that I believe can be done safely by a third party. The effort of trying to reverse engineer the Digital protocols would be enormous, violate licenses in any case, and would break when the protocol changes. (This even though the listings are on the listing CDs; they are not there to be lifted completely, and other methods of figuring out the protocols would be on very questionable ground if the intent was to duplicate the effect totally) Rather, a remote driver is perfectly feasible. Some functions are possible to add, of course. One could, for example, add functions to a server to report ucb$l_record from the real tape's UCB instead of computing it. One could also then profitably allow some control remotely over things like tape density, compression, compression algorithm, and so on, so long as the underlying drivers have these capabilities. (On scsi tapes, io$_diagnose can be used once you know the "magic densities" to send to different drives; see examples on the sigtapes.) While using zt_driver in a for-sale package has its questionable side too, I'm speaking for public consumption here and don't think there'd be any problems with someone adding to it as a service. Building something comparable looks feasible too. Private protocols are handy for this, since they can readily be added to and are not subject to change. The approaches I have suggested depend by and large on documented VMS interfaces which (now that the step 2 driver interface is here) are likely to be stable for a while. It is interesting to think about approaches like this in terms of network independence as well. Personally I'm inclined to wish such methods were used instead of the current VMS scheme of having network code incestuously involved with RMS code; it would seem to make both harder to update. (Also, as I've speculated before, VMS I/O at driver level is very fast, and user I/O can be fast also, by using the system/acp interface. If one cared to, say, port a C runtime from Linux or some other system where the underlying services needed are block I/O (remember RT11's runtime?) it ought to run exceedingly fast. Of course, if such a thing had been used instead of the RMS approach, it would have the difficulties that NFS does...needing automount and auto dismount, lest pool fill up with UCBs that were not in use. The current approach requires less memory and is in some ways simpler. (This path might be possible (barely) to use to tweak ftp access to a site into something that might just look like a disk file system though, as was mentioned recently. It would need to have a local cache like AFS though, to have local copies of files being accessed...I don't see a way to fake per record access without something like this.) (Another aside: don't you wish other vendors would publish enough interfaces to allow this kind of thing?)