From: STAR::EVERHART "Glenn C. Everhart 603 881 1497" 2-JUL-1996 09:03:15.40 To: MOVIES::MOVIES::PALMER CC: EVERHART Subj: RE: Cache in vms Thanks for the "Ack", Julian. The first point was a bit of a nicety... basically you still cache VBNs, but can recognize reads from MSCP. You might think about this: I've suggested to Eleanor already (RMS group) that some way (per process?) to force the RAH/WBH bits on for RMS, maybe even as crude as setting them on the way in, would be a general performance win, and cheap to do...saves both code path length and time. (We had some i/o walkthru here where I suggested it and she agreed this would be a good thing.) The disk based cache should not get too far in the back of your mind. Operation is something like this: Normal startio gets redirected to tell a server about the request for r/w, and you provide an alternate path to real disks that works like vddriver (I can send this to you if you don't have it) but calls ioc$*initiate instead of insioqc. The server looks in a memory data structure when it gets the message about work to be done and ensures the data is in cache, or uses its special "on the side" interface to get it there and write out whatever was there; it'll cache biggish (~100 blocks maybe) chunks of disk space. There also needs to be an interlock in the intercept to ensure that normal I/O is not started while the server's running, since it will also have to do periodic cleanup, and when the server modifies the in memory page of the mapping structure it must coordinate globally to be sure this is cluster coherent. Data r/w does not need this coordination since it's to nonvolatile store and the map structure needs to be on nonvolatile disk always. (I'm reciting this from memory... original docs are at home...so don't be too critical of the order of presentation please.) Actually, the intercept driver itself (example intercept driver available on request in source too) will handle the mapping if the in memory page of the mapping structure tells it where the cache data is. All it needs to do is (after all) to clobber the LBN in irp$l_media and point irp$l_ucb at the right device and get the right locks if it lacks them; the example intercept shows this. Then it can just queue the IRP off. I envision several disks served by one cache. You'll note the beauty of this is that the server does all the fancy 64 bit mapping and so on; once it gets data to cache, a user read is restarted pointing at the disk cache (backed by your memory cache perhaps, or even its own special one if one needs...) and restarted. NO funny mapping tricks needed, and you use a documented interface to read your data. There is a hook in post-processing now to speed that up...it'll need to be stolen now and then...so that it can take place all at IPL 8, skipping the IPL 4 interrupt and fork interrupt that can otherwise be needed; you set the IRP fastio bit and the finipl8 bit (new for 7.1) together to do this, then clear them in your intercept code (actually restore to prior context) and continue with whatever, then do a real post... By using a server to do the dirty work of the disk cache update, you see, most of the difficult kernel work is simply bypassed. The server could in principle work in kernel, of course, but it gets harder to do; running in a process means simple $qiow can be used for underlying disk access (or possibly $qio if you're sure the synch can be handled correctly by your own interlocks, doing the wait in the virtual disk path instead of the "real" disk path). You need to take some care about the busy bit, since dkdriver sets/clears it nonstandardly, and may want to do some special stuff with it to ensure you don't cause trouble while error processing on the real disk is going on. This shouldn't be all that hard. Note too that write through is NOT needed for this to the real disk...just to the cache disk. It was and is my thought that this could turn out to be even more useful with Spiralog than with ODS2 since it would tend to mean that on disk reads of the fixed locations would tend to be near the r/w heads of the cache disk. With one stroke you solve some of the Spiralog caching problem as well as that of many other devices, and provide a real serious benefit to buying DEC's solid state disks, which currently are pretty hard to use in practice (I've heard from customers). I figured you should have some details... One other hack for an LBN cluster cache is to break the disk into ~10k pieces and lock each BTW. Gives a manageable number of locks to cluster coordinate. A low level driver piece of code can readily examine the appropriate lock and go ahead and use a cache if valid and no conflict exists. For writing it'd have to grab the lock first, if not held already. The less fragmented the disk is, the more useful this is. One trick I've used for several years is to trap io$_modify and when a file is extended I alter the extend quantity as follows: 1. Boost the quantity by some fraction (usually 1/4) of the file size 2. Ensure it isn't more than 1/8 of free space on volume at the time 3. Maximize with a user settable maximum (default 100k blocks) 4. Optionally test that the extend quantity is over some user set minimum or bail out 5. Ensure that the quantity chosen is at least as large as the original request (VITAL!!) 6. If chosen, every Nth time (default N=1), and if the extent is not marked contiguous or contig best try already, set the CBT bit to get some of the chaff out of extent cache. This DRAMATICALLY reduces fragmentation and speeds ODS-2 up a lot, particularly on large files. I'll send you code to try it yourself if you want...I use this on my workstations here,using an intercept. Takes maybe 20 lines of code in the intercept to do the real work. If you have this in, of course, demos like the one at DECUS may go the other way, since the trips "to the well" to get disk blocks become fairly rare, but I think the result will please you. So much for today's attempt to be helpful... Write if you have questions. Glenn Everhart star::everhart dtn 381 1497