From: MERC::"uunet!CRVAX.SRI.COM!RELAY-INFO-VAX" 8-JUL-1992 16:29:58.10 To: "info-vax@crvax.sri.com" CC: Subj: FD: drives cluster-wide I tried awhile ago to run fddriver on a cluster and encountered (and had to fix) the problem of invalid buffer lengths. As background, fddriver has an internal buffer which is used to mediate between the user process and the host process that's doing the work. In order to conserve pool, I've generally left this thing sized at 4K or 8K (depending on how I felt that day) and set ucb$l_maxbcnt to match this. On a single machine, this works fine and the QIO service will ensure that transfers to the driver are limited to this size. (Longer ones get split, which is no problem to the logic, so user code never sees this.) The unit initialization code in fddriver ensures this is set up properly, and so long as the driver buffer size and the buffer sizes in the host program and the network server all match, it works remotely just fine. On a cluster with mscp-serve-all set, as soon as fddriver is loaded and units connected via sysgen connect, even though these units are marked as offline and invalid, MSCP sets up client UCBs on all other cluster nodes so it can do transfers to them. However, rather than get the UCB parameters from the just-loaded FDdriver ones (which are already set correctly on the machine where the driver is loaded), MSCP makes the assumption that they are all "standard", so that the ucb$l_maxbcnt field is set at (if memory serves) FE00 hex. For a vd: type disk this is no problem. For an FD: type disk, it is, and it is complicated by the fact that I had no logic in fddriver to check for it. Not having any simple way to modify the actions of SCS in this regard, I modified the fdhost/clear option in fdhostcry5.mar to set the ucb$l_maxbcnt field correctly. This allowed me to set up remote FD: units as MSCP served units and have everything work by doing the fdhostcry5/clear FDAn: command first, then mounting the fd: device, or by at any rate using the fdhost/clear first before anyone tried to use the device. This has to be done on any node that wants to use the device. I also added some code so that fddriver will recognize that MSCP has sent a too-long packet to it and will reject the I/O if that happens, rather than possibly corrupt data inside the driver and perhaps in other pool. (The MSCP packets are just queued directly to the start-io entry of a disk driver and don't go through its' FDT code; again, "standard" FDT processing is assumed.) I'm putting this material on the S92 VAX tapes. It can be posted if there's enough demand. Another way to get the driver to work with less user work, but taking more pool, is to make the internal buffer size 65024 decimal (fe00 hex) in fddriver. Just set the assembly parameter FQ_BUFSIZ to 65024 and the assumptions that MSCP makes will be correct. You must similarly make the buffers in fdhostremot.mar and fdremsrv.for match this size and set your RMS network buffer count to 128 to cover it so the network data transfer QIO's will work. I'd caution that these transfers over slow links can cause decnet timeouts; I had that problem doing backup over a 9600 baud asynch DECnet circuit once; block sizes over 4000 or so would often cause the link to drop, just from timeouts on the circuit from the VS2000 to the VAX 785 I was using at the time (going into a DZ11!!). Simply switching to an ethernet circuit deep-sixed those problems. I regret any inconvenience, but in my old job I didn't have access to a cluster to try these things out on. Now, I do. If any DEC person might be willing to reveal any magic about how one can get ucb$l_maxbcnt set correctly or other MSCP tricks, I'd be most grateful. In the meanwhile my nasty and brutish hack will do the job. Glenn Everhart Everhart@Raxco.com