From:	MERC::"uunet!CRVAX.SRI.COM!RELAY-INFO-VAX"  8-JUL-1992 16:29:58.10
To:	"info-vax@crvax.sri.com"
CC:	
Subj:	FD: drives cluster-wide

I tried awhile ago to run fddriver on a cluster and encountered
(and had to fix) the problem of invalid buffer lengths.

As background, fddriver has an internal buffer which is used to
mediate between the user process and the host process that's doing
the work. In order to conserve pool, I've generally left this thing
sized at 4K or 8K (depending on how I felt that day) and set
ucb$l_maxbcnt to match this. On a single machine, this works fine and
the QIO service will ensure that transfers to the driver are limited
to this size. (Longer ones get split, which is no problem to the
logic, so user code never sees this.) The unit initialization
code in fddriver ensures this is set up properly, and so long as
the driver buffer size and the buffer sizes in the host program and
the network server all match, it works remotely just fine.

   On a cluster with mscp-serve-all set, as soon as fddriver is loaded
and units connected via sysgen connect, even though these units are
marked as offline and invalid, MSCP sets up client UCBs on all other
cluster nodes so it can do transfers to them. However, rather than
get the UCB parameters from the just-loaded FDdriver ones (which
are already set correctly on the machine where the driver is loaded),
MSCP makes the assumption that they are all "standard", so that
the ucb$l_maxbcnt field is set at (if memory serves) FE00 hex.
For a vd: type disk this is no problem. For an FD: type disk, it
is, and it is complicated by the fact that I had no logic in
fddriver to check for it.
   Not having any simple way to modify the actions of SCS in this
regard, I modified the fdhost/clear option in fdhostcry5.mar
to set the ucb$l_maxbcnt field correctly. This allowed me to
set up remote FD: units as MSCP served units and have everything
work by doing the fdhostcry5/clear FDAn: command first, then
mounting the fd: device, or by at any rate using the fdhost/clear
first before anyone tried to use the device. This has to be done
on any node that wants to use the device. I also added some code
so that fddriver will recognize that MSCP has sent a too-long
packet to it and will reject the I/O if that happens, rather than
possibly corrupt data inside the driver and perhaps in other pool.
(The MSCP packets are just queued directly to the start-io entry
of a disk driver and don't go through its' FDT code; again, "standard"
FDT processing is assumed.)
   I'm putting this material on the S92 VAX tapes. It can be
posted if there's enough demand. Another way to get the driver
to work with less user work, but taking more pool, is to make the
internal buffer size 65024 decimal (fe00 hex) in fddriver. Just set
the assembly parameter FQ_BUFSIZ to 65024 and the assumptions that
MSCP makes will be correct. You must similarly make the buffers
in fdhostremot.mar and fdremsrv.for match this size and set your
RMS network buffer count to 128 to cover it so the network data
transfer QIO's will work. I'd caution that these transfers over
slow links can cause decnet timeouts; I had that problem doing
backup over a 9600 baud asynch DECnet circuit once; block sizes
over 4000 or so would often cause the link to drop, just from
timeouts on the circuit from the VS2000 to the VAX 785 I was
using at the time (going into a DZ11!!). Simply switching to
an ethernet circuit deep-sixed those problems.
   I regret any inconvenience, but in my old job I didn't have
access to a cluster to try these things out on. Now, I do.
   If any DEC person might be willing to reveal any magic about how
one can get ucb$l_maxbcnt set correctly or other MSCP tricks,
I'd be most grateful. In the meanwhile my nasty and brutish
hack will do the job.
Glenn Everhart
Everhart@Raxco.com