From:	STAR::EVERHART     "Glenn C. Everhart 603 881 1497"  2-JUL-1996 09:03:15.40
To:	MOVIES::MOVIES::PALMER
CC:	EVERHART
Subj:	RE: Cache in vms

Thanks for the "Ack", Julian. The first point was a bit of a nicety...
basically you still cache VBNs, but can recognize reads from MSCP.
You might think about this: I've suggested to Eleanor already (RMS group)
that some way (per process?) to force the RAH/WBH bits on for RMS, maybe
even as crude as setting them on the way in, would be a general performance
win, and cheap to do...saves both code path length and time. (We had some
i/o walkthru here where I suggested it and she agreed this would be a
good thing.)

The disk based cache should not get too far in the back of your mind.

Operation is something like this:

Normal startio gets redirected to tell a server about the request for
r/w, and you provide an alternate path to real disks that works like
vddriver (I can send this to you if you don't have it) but calls
ioc$*initiate instead of insioqc.

The server looks in a memory data structure when it gets the message
about work to be done and ensures the data is in cache, or uses its
special "on the side" interface to get it there and write out whatever
was there; it'll cache biggish (~100 blocks maybe) chunks of disk space.
There also needs to be an interlock in the intercept to ensure that
normal I/O is not started while the server's running, since it will
also have to do periodic cleanup, and when the server modifies
the in memory page of the mapping structure it must coordinate globally
to be sure this is cluster coherent. Data r/w does not need this
coordination since it's to nonvolatile store and the map structure
needs to be on nonvolatile disk always. (I'm reciting this from memory...
original docs are at home...so don't be too critical of the order of
presentation please.)
Actually, the intercept driver itself (example intercept driver available
on request in source too) will handle the mapping if the in memory page
of the mapping structure tells it where the cache data is. All it needs
to do is (after all) to clobber the LBN in irp$l_media and point
irp$l_ucb at the right device and get the right locks if it lacks
them; the example intercept shows this. Then it can just queue the
IRP off. I envision several disks served by one cache.

You'll note the beauty of this is that the server does all the fancy 64
bit mapping and so on; once it gets data to cache, a user read is restarted
pointing at the disk cache (backed by your memory cache perhaps, or even
its own special one if one needs...) and restarted. NO funny mapping
tricks needed, and you use a documented interface to read your data.

There is a hook in post-processing now to speed that up...it'll need to
be stolen now and then...so that it can take place all at IPL 8, skipping
the IPL 4 interrupt and fork interrupt that can otherwise be needed;
you set the IRP fastio bit and the finipl8 bit (new for 7.1) together
to do this, then clear them in your intercept code (actually restore to
prior context) and continue with whatever, then do a real post...

By using a server to do the dirty work of the disk cache update, you see,
most of the difficult kernel work is simply bypassed. The server
could in principle work in kernel, of course, but it gets harder to
do; running in a process means simple $qiow can be used for underlying
disk access (or possibly $qio if you're sure the synch can be handled
correctly by your own interlocks, doing the wait in the virtual disk
path instead of the "real" disk path). You need to take some care
about the busy bit, since dkdriver sets/clears it nonstandardly, and 
may want to do some special stuff with it to ensure you don't cause
trouble while error processing on the real disk is going on. This
shouldn't be all that hard.

Note too that write through is NOT needed for this to the real disk...just
to the cache disk.

It was and is my thought that this could turn out to be even more useful
with Spiralog than with ODS2 since it would tend to mean that on disk
reads of the fixed locations would tend to be near the r/w heads of the
cache disk. With one stroke you solve some of the Spiralog caching
problem as well as that of many other devices, and provide a real serious
benefit to buying DEC's solid state disks, which currently are pretty
hard to use in practice (I've heard from customers).

I figured you should have some details...

One other hack for an LBN cluster cache is to break the disk into ~10k
pieces and lock each BTW. Gives a manageable number of locks to cluster
coordinate. A low level driver piece of code can readily examine the
appropriate lock and go ahead and use a cache if valid and no conflict
exists. For writing it'd have to grab the lock first, if not held
already.

The less fragmented the disk is, the more useful this is. One trick I've
used for several years is to trap io$_modify and when a file is extended
I alter the extend quantity as follows:
1. Boost the quantity by some fraction (usually 1/4) of the file size
2. Ensure it isn't more than 1/8 of free space on volume at the time
3. Maximize with a user settable maximum (default 100k blocks)
4. Optionally test that the extend quantity is over some user set minimum
	or bail out
5. Ensure that the quantity chosen is at least as large as the original
	request (VITAL!!)
6. If chosen, every Nth time (default N=1), and if the extent is not marked
	contiguous or contig best try already, set the CBT bit to get some
	of the chaff out of extent cache.

This DRAMATICALLY reduces fragmentation and speeds ODS-2 up a lot,
particularly on large files. I'll send you code to try it yourself if
you want...I use this on my workstations here,using an intercept. Takes
maybe 20 lines of code in the intercept to do the real work. If
you have this in, of course, demos like the one at DECUS may go the other
way, since the trips "to the well" to get disk blocks become fairly rare,
but I think the result will please you.

So much for today's attempt to be helpful...
Write if you have questions.

Glenn Everhart
star::everhart
dtn 381 1497