HSZ40 Switching - Design Spec V2 Draft, 1-Jul-1996 Glenn C. Everhart -------------------------------------------------------------------- Problem Statement: The HSZ40 series with the next release of HSOF will offer a dual bus failover capability. This is characterized by some new INQUIRY information so that a host can be informed that the failover is possible, and new logic to provide a "preferred" initial path. (A single SCSI bus failover is also offered, but that requires no software changes.) When the devices come up under current autoconfiguration, it is to be expected that each device will appear twice, once via its path over the first SCSI bus from the HSZ40 to the host, once via the other. Notwithstanding this, the devices are not duplicated, and because they have two aliases, the file system can readily corrupt file structures located on these devices. Some means to control access so that a single path is used at any given moment, and so that normal VMS operations will not notice the dual path, is needed, which will allow access to the devices via the second SCSI bus in the event the first fails. Allowing accesses to be shared over the busses initially is highly desirable as well, and is supported to a degree by the HSZ firmware. (This is done by allowing a preference to be stated for each device, so that some devices can be set to be "preferred" over each bus.) This failover must be available for disks. It should be available for other devices also. Background: Some HSZ devices have multiple SCSI bus connections, and the issue of failover between them has arisen. These connections can be connected either to the same SCSI bus (providing dual paths to that bus so that the failure of either controller does not prevent access to devices connected to the HSZ) or to different SCSI busses. If both SCSI controllers on the HSZ are connected to the same SCSI bus, the HSZ will be able to handle failover within itself so that a host on the bus will not notice any change. However, when each controller is connected to a different SCSI bus, the host must be involved. In this case, an HSZ might be on two ports on a system, with two SCSI controllers, and all LUNs attached to the HSZ will therefore show up twice; a disk might show up as DKB300: and as DKD300:, for example, if the HSZ were connected to the second and fourth SCSI adapters on the machine. At the HSZ itself, it is possible to set a preferred path to the device, and it will appear unready on the other path, but both could be configured and would refer to the same device. Having dual names for the same storage violates the VMS cluster naming scheme and can result in disk corruption, so this situation by itself is not satisfactory. Fortunately the HSZ itself provides certain bits of information which an operating system can use to figure which devices are which. First, when in this dual-bus configuration, an HSZ will return some extra data in INQUIRY responses. This data includes: * The serial number of this controller * The serial number of the alternate controller * A bitmask of LUNs which are preferred for this controller. Therefore one can determine, from the INQUIRY data, if the device is an HSZ, what this and the "other" controller is, and whether this particular device is preferred on "this" controller. (The bitmask changes to reflect the actual situation, so that if one controller fails, all LUNs are marked as preferred on the other). This extra information is present only in the dual bus case (the serial numbers being nulled otherwise). This permits a driver to determine, when configuring a device, that this particular path to the device is the preferred one or is an alternate non-preferred one. Moreover, the controller serial numbers are unique and visible to all nodes on a cluster, so that if a device name is chosen based on them, it will automatically be the same for all cluster nodes./ In addition, the HSZ firmware is being given the ability to notify drivers when a controller fails. This presumes that some devices are active on each controller, and works by having the HSZ detect the controller failure. If this happens, the next I/O to the good controller will receive a CHECK CONDITION status (unit attention). The sense data then uses some vendor unique sense codes for failover (and eventually failback) events and returns the good controller serial number, the failed controller serial number, failed controller target number, and a bitmask of LUNs moved. In addition, when this happens, the surviving controller kills (resets) the other controller to keep it from trying to continue operation. This information can permit the processor to be notified of a path failure without necessarily having to incur timeout and mount verify delays. On VMS, however, a SCSI adapter on a failed path may have I/O in various states within its control, and if this is the case, some method of extracting it is needed. The usual path for this function is for timeouts to occur and force I/O requeue and mount verify. Where I/O is in progress to a device, there is no convenient external handle available to extract it (and the notion that as a side effect of a successful I/O on, say, mkb200:, we might stop and redirect all I/O active on DKD400:, seems likely to be far more complex and error prone than can be tolerated, if it can be done at all on all adapters). Therefore this information is likely to be most useful where the failed path devices are in fact idle. Where I/O is in progress at some stage within a SCSI adapter, it will have to be timed out or otherwise cleared from the adapter before a path switchover can take place. (This also means that in the event a transient failure occurs, nothing will be left "in the pipeline" to a device at switch time.) Actual HSZ switchover is done by a SCSI START command (which is done as part of the IO$_PACKACK operation in VMS) so that host software has some control. There is a proposal to the SCSI-3 committee which details a more general configuration, in which some number of devices are controlled by a set of controllers, where a device may be accessible from one or more of the controllers at a time. It is anticipated that LUN ownership might have to be established in this case via reserve/ release to set initial path preference (if only one path at a time may be used). This proposal defines some SCSI commands which may be sent to a storage control device to report which controllers and devices are associated and to set up access. Since these devices will have their own LUNs and device types (apart from disks, tapes, etc. behind them) it is apparent that an io$_packack to a disk would have to have been preceded by some FC initialization commands. The unit init code of a new class driver may be the most logical place for such commands. Failover or failback is to be reported by ASC/ASCq event codes, same as for the HSZ. While this suggestion is not yet definite, this specification does attempt to be generally compatible with it. (A server, for a specific case, can communicate with a control device if need be when a failover is signalled.) Goals: * Support HSZ failover for HSZ7x type controllers where two SCSI busses are connected to a single machine. * Leave open expansion possibilities * Be compatible with planed HSG failover mechanism (which is generally similar to the HSZ one, with some differences due to the changes between SCSI-2 and SCSI-3) * If possible, facilitate failover between direct SCSI connections and MSCP or other server connections. (That is, a design that may help with MSCP failover should be preferred over one that cannot.) Non-Goals: * Support more than 2 busses * Support the case where both HSZ controllers are on a single bus (this is supported within the HSZ) * Solve the device naming problem generally * Dynamic routing or load balancing between paths to a device in full detail * Describe details of compatibility with the HSG proposed failover scheme. Discussion of goals: Much more complex situations may arise in the future, where devices are reachable via any of several paths. Controllers are under discussion which have 16 bus interconnects available to different computers, and which will need to do load balancing, and will need to have devices handled in such a way that confusion does not result due to multiple names. The approach discussed herein does not attempt to deal with this complexity yet, but to find a way to deal with the part of the failover problem defined by the HSZ firmware (HSOF 3.0 and later) which requires host CPU cooperation. It does attempt not to constrain its implementation too much, so that extension of the switching to more than two busses, routing of I/O dynamically between several paths, and failover between paths regardless of their method of connection can be contemplated as extensions to it rather than total reworks. All these are possible, but all will require additional design effort, which is not covered directly here. The techniques here appear to be usefully extensible in the directions mentioned, but the full set of issues around any of these other but related problems has not yet been addressed. The configurations being addressed are therefore limited at this time to the dual-bus HSZ cases. The more general case of many paths via many controller types with possible load balancing is not addressed here, save in part and with important issues over how to generalize synchronization boundary conditions not dealt with in their full generality. That general discussion is beyond the scope of this design. The design proposed here is also a VMS variant of the kind of driver interface called "streams" in the Unix world. This is an interesting sidelight which may be suggestive but going beyond this sidebar comment is also beyond the scope of this design. This document should be considered as the design spec for HSZ failover primarily, though critiques of the design where it may be over specialized in ways which will make it harder to solve follow-on problems might be appropriate. Approach: It was initially considered that some form of altering SCSI connections' purely SCSI structure based "routing" to devices might be feasible for the switching needed here. However, in principle, two SCSI busses can be controlled by entirely different SCSI port drivers, so that an attempt to alter connections on the fly at port level could involve considerable complexity in ensuring that port driver specific structures were initialized as the port drivers expect. (These initializations are not all alike.) Also, a "port level" approach does not deal with the appearance of multiple class driver units after autoconfiguration. Idle drivers might be revectored, but any links between SCSI structures and class drivers would need to be traced and reset, and any future asynchronous events would need to be blocked from access to the structures during this time, and any port driver specificities in the SCDT in particular would be a problem. Since the failover scheme used in DUdriver is basically near the top of the I/O chain in the class driver, this seemed a more promising direction to go, and had the extra advantage that it might facilitate failover between DK and DU. Therefore a simpler approach has been investigated. This approach involves small modifications to DKdriver (and possibly, but not necessarily, other drivers) to recognize HSZ units which are non-preferred path aliases of other devices and to mark them so that the MSCP server and normal VMS mounting services do not attempt to access them. This will ensure that for each device, one and only one mountable, serveable class level device appears. The alternate path will however still be autoconfigured, so that the SCSI connections will be created and initialized as at present by the class drivers. The alternate path will however have its data structures set so that they will be effectively invisible to normal VMS users. This will mean that the device will exist, but will not be found by VMS search routines as a device available for channel assignment. Then at some point moderately early in system startup, but after autoconfigure, a switching driver will be inserted. This driver will implement the failover policy by gaining control at the class driver start_io entry point for the preferred path device, and doing monitoring or switching. Sufficient units of this virtual driver will be connected to handle all pairs of disks present, and a server will be started which will scan the device configuration (from the SCSI data base which by then will have been set up) and connect the pairs of disks appropriately, and also remain active awaiting notification of failures so that it can direct the failover of idle devices to a remaining good path. Only one server is required for any number of such devices. (Insertion of the switching component earlier in the boot sequence is possible, and may be desired at some point, but it must occur at least after all local disks are configured. This may not be difficult with the new file oriented configurator, but remains to be investigated. The basic feasibility of the approach appears adequate even if startup is deferred to one of the startup scripts, though earlier connection may make it unnecessary to use a sysgen parameter, at the expense of some early boot code to effectively rename a device.) (The switching driver will also intercept all other relevant entries pointed to by the DDTAB tables of drivers, to ensure that where the device is being accessed, the accesses are properly routed to the "live" device. Entries relevant are altstart, mountverify, pending io, auxiliary routines, and cancel io from current examinations; register dump appears not to need to be switched due to its calling usage. The pending I/O entry will be used primarily to ensure that I/O is seen even if a driver directly pulls requests off its queue.) The intercept driver's monitoring function will monitor I/O requests coming to the device so that when an IO$_PACKACK coming from mount verify is seen after the Nth time (initially, the third), this will be taken to mean that the I/O via the currently active path is infeasible, and that it is time to try switching. When this happens, I/O packets will have their path switches. The driver will either be set to requeue IRPs received to the alternate path driver (and gain control at I/O posting time to complete the I/O in the original device's context), or to stop doing this and allow IRPs to continue to the original start_io entry of the initially preferred path's driver. Also, some "Special" I/O status returns will be monitored (implemented as alternate success statuses in the current thinking) so that a server can be notified if an I/O returns from one controller and indicates the HSZ has found that its other controller has failed. The switching driver can switch paths on command as well, provided that there is no I/O active on a device being switched. I/O is defined as active if an IRP has been seen at the driver's start-io entry point and has not been seen at I/O postprocessing. IMPORTANT: What VMS needs for valid file structures is that the device name as seen by the rest of the system be uniform. Once the switching component is present, this name can be that of either path, regardless of which device is actually preferred. The intent is not to force a preferred allocation of HSZ slots, but to set names uniformly, permitting the HSZ console choice of actual path preference to be honored. The switching takes place "under" the chosen device name, with the initial state of the switch being set so that the preferred device is used initially. If the switching software is to be loaded early in boot path, some cooperation with DKdriver to honor the HSZ preference (or later, generic SCSI3 preferences) will be needed. This is not expected to be a large amount of code. DESIGN: There are two new component, SWDRIVER and SWCTL, and some modifications to DKDRIVER used to produce the failover. (Similar changes can be made to other class drivers in a second pass; the switching software is largely independent of device class and can readily have those limitations removed for devices which cannot support mount verify.) DKDRIVER CHANGES: DKdriver is to be modified so that in unit init, when it looks at INQUIRY data from the HSZ, it determines whether this device is on a "non preferred" path (this being returned by the HSZ INQUIRY data). If so, it sets the DEV$V_NOCLU bit in its DEVCHAR2 field so that the MSCP server will not initially serve the non-preferred path. The preferred path for naming will be chosen as that with the lower controller serial number where possible (or the higher, depending on a parameter which must be set the same clusterwide). Thus, all nodes will see the same path, and it will be possible to boot the cluster even if one path's controller is down. (The problem of a shared SCSI bus being sampled by one node with path A down and soon after by another node with path A back up again is otherwise rather intractable.) In this way, boot time consistency can be assured in naming and access. In the HSZ failover case, the device will come up with two aliases, and will return to each DKdriver unit the conroller serial numbers of the "current" and "other" path controllers. Thus a given device might be visible as, say, DKA300 and DKB300, but while (where the "A" controller happens to be the preferred path) DKA300 will be identified as a disk and come up normally, this code will cause DKB300 to be reset to not be visible to other nodes or to users. The HSZ will experience timeouts when a bus fails, which will produce mount verify conditions. In addition, should the HSZ detect a controller failure, it will allow failover to take place and will signal this by generating CHECK CONDITION on the next I/O to the "good" side controller. The CHECK CONDITION operations within DKdriver to handle UNIT ATTENTION will in fact return success with the current DKdriver. To preserve the status that the devices are operating correctly, yet allow the switching server to obtain the signal, DKdriver will, in this situation, return alternate success reports which will set the 16384 bit of the I/O status word (unused by DKdriver in any other context) and also the 8192 bit if this is a failback. These returns will be sent to the DKdriver caller. However, it is expected that the switching driver SWdriver will act upon them. The I/O status will "really" always be SS$_NORMAL in this case, and DKdriver will check the sense data flags to ensure that the (Digital vendor unique) codes are present before setting these flag bits in the return code. DKdriver will NOT however perform any switching operations on its own. This means that minimal DKdriver modification is made here, but the vital information needed is present and passed on by DKdriver to layers of the failover system above it. Where these alternate success statuses are seen by the switching driver, it will remove them prior to really completing the I/O, thus hiding any unusual behavior from applications or other VMS layers. SWDRIVER SWdriver stands for "SWitching Driver" and is a two way (currently) toggle switch sending I/O either to one disk or another, assuming the disks used are in fact the same but accessed over different paths. (Extending the driver to be an N-way switch should be straightforward, treating paths 3-N the same as path 2, but is not needed for any currently known problem. Future systems may however require this.) If Bus B fails and some operation is completed on Bus A (these being the two busses on the HSZ40), the HSZ will generate CHECK CONDITION responses which DKdriver and other drivers need to be able to turn into statuses the switch can recogzize. The CHECK CONDITION data will indicate that Bus B has failed, not that anything is wrong with the current device on Bus A. To perform failover promptly when this happens, it will be necessary to have some server aware of the whole HSZ configuration and able to command switchover promptly. Accordingly, the switch driver is programmed to send a signal to a server when it recognizes such a condition, so that the server can command switchover to the remaining path. This server can have the necessary global configuration information so that all devices can be switched to the good path. (The server will also send an IO$_PACKACK to get the device to come online at that time, before anything else is queued there.) Also, some code will be added to DKdriver to ensure the controller serial numbers are made available to the server, so that it can find the pairs of controllers automatically, rather than needing to have it generated by a customer. Periodic polling of devices will also be added to the server component here, so that an operator can be notified of device failover. (There is a special I/O path in the switching driver allowing the server to contact all actually-known channels in spite of the otherwise opaque overloading of the chosen device name.) The server will initially determine device pairs by issuing INQUIRY packets using io$_diagnose, so that DKdriver need not store information about controller IDs. It will ensure that the UCB$V_NOASSIGN flag is set in UCB$L_STS of nonpreferred paths to help set these invisible, and will make such other modifications as shall be needed to ensure that the scan_device routines in VMS exec cannot see the extra paths either. These must scale so that multiple extra paths can be managed. Operationally, then, autoconfig does not change. Since DKDRIVER will be altered to ensure that no disks are served via multiple paths, the switch logic can be loaded during normal startup commands and need not run very early in the boot path. Tapes and generic devices for the most part are not made visible as early, and it is possible that resetting the alternate units' characteristics for those device types can be done by the switching software itself, after autoconfiguration shall have run. If this causes problems, the tape driver will need to be edited also to prevent too-early detection of tape alternate paths. Loading the switching code after full VMS is up simplifies it greatly, at the cost of failover not functioning until this code is loaded. Normal disk operation would be unchanged by the switch (the actual intercept is synchronized at fork level, which is necessary for any access to the intercepted path), but an HSZ controller failure would not be recovered if it occurred within the first few seconds (up to a few minutes) of system operation. However, once the software loaded, a switchover could be accomplished, presuming the failed devices were in mount verify state and had not timed out during the interval. Thus even in the case of a very early controller failure, a remedy could be applied partially "ex post facto". (The swdriver code would simply have to count MV Packacks starting after they had been going a while.) Only a system disk failure early on would not be covered in this way, since the recovery code would not load, and this can be considered much the same as a failure during early booting; a reboot would use the other controller and succeed. In only one case does something unusual need to be done: when the boot disk is on the higher numbered controller. In this case, setting a boolean sysgen parameter will allow boot off a higher serial number controller by making it preferred. While this effectively changes the device physical names, a configuration file option will allow them to be effectively reset for all but the system disk. It is hoped that this will be a rare circumstance. The system will then, when running, see one device name per device, and the path switching will take place below the start_io level in a way invisible to anything in VMS above driver level. By simply requeueing the IRP, high performance can be achieved and only minimal changes to driver operation (mainly to handle the new information in the INQUIRY data and the extra CHECK CONDITION flags) are needed, none of them of major import. The functionality here is completely orthogonal to the device naming scheme in use, and in practice it doesn't matter what the device name scheme is so long as IOC$SEARCHDEV can still find both devices. It is further expected that the qio server will eventually perform operations somewhat akin to this. By functioning in this way, the system will avoid adding greatly to the complexity of DKdriver (et.alia) and can be extended to handle other failover situations rather simply, though the custom signals from the HSZ will not be used only in limited ways. It should be added that for SCSI drivers, the mere startup of mount verify does not in itself mean that bus failover is appropriate, since SCSI RESET can be a normal part of system function. This is why the switch is not set to switch paths at the first pack-ack (or indeed at the start of the mount verify condition). This is also the reason why the switch does not simply intercept the start-mount-verify driver entry. In fact, the IO$_PACKACK will generate a SCSI START command on the new path, which the HSZ40 needs in order to switch its internal indicators. This situation is different from that obtaining for DUdriver, where mount verify generally does mean a path failure may have occurred. SWDRIVER INTERNALS SWdriver is an intercept driver which intercepts disk start-io entries. This is done by code which creates a copy of the DDT table, located in the intercept driver's UCB, and points the intercepted driver's UCB$L_DDT vector at it. This permits a per-drive intercept and is done in such a way that the vector can be intercepted by other similar intercepts totally reversibly, and in any order, just so they follow the connection logic (which has been published). (Because the intercepted DDT is located within the intercept driver UCB, the intercept code can locate the intercept driver UCB using this DDT. Some additional code exists to allow the code to be sure it has this data for its own intercept, not another on a possible chain of them.) When the intercept is present, start-io for the "primary" path disk now points at the intercept address within a unit of SWdriver, which also knows the UCB addresses of the "primary" and "secondary" path devices. An IRP entering here is first examined to see if it is a mount verify pack-ack IRP (and counted; if 3 of these are seen in a row, SWdriver switches to the "secondary" path.) By using mount verification in this way, SWdriver assures that I/O through the failed path has been idled. (The mount verify driver entries are NOT used because for SCSI a mount verify condition does not necesarily mean a bad path.) SWdriver also counts up outstanding I/O and arranges to gain control at I/O post time (so it can count down the I/O and post it). This is done by saving IRP$L_PID and replacing it with an address within SWDRIVER which will count the I/O down and, after replacing modified fields, perform a real I/O completion on the IRP. Now if the I/O request is being routed to the primary path, SWdriver just calls the primary path start-io entry and returns. Since it is entered as part of the primary driver, it has all needed locks. If on the other hand the path routed to is the secondary, SWdriver calls INSIOQC instead, redirecting the IRP to the secondary device. The primary device is unbusied in this case also, since SWdriver is acting in lieu of the primary device, which will not in fact get any I/O when it is routed this way. IRP$L_UCB is pointed at the secondary device during this operation, to be replaced with its original value when I/O is posted. In all cases, when the I/O completes (and without a detour through IPL 4 if assembled that way), SWdriver regains control. At this point it decrements the outstanding I/O count, replaces a few IRP fields it needed to regain control, and completes the I/O (via a call to COM$POST, since it has no right to alter the underlying driver's busy or unbusy state). If on the secondary path, SWdriver checks the I/O to ensure that mount verification is begun on it also, as this would not otherwise be done. The I/O checking, mount verify processing, and postprocessing is all done in the context of the primary path, so that the primary path remains mounted and apparently active, though the secondary path may in fact be the one in use. To save volatile parameters from an IRP during the switching, SWdriver currently overwrites the IRP argument areas (which are used prior to start_io but are not used after that point) to hold a number of IRP fields which are being reused to route the packet. The usage is as follows: Field: Saves contents of: IRP$Q_QIO_P1+4 IRP$L_STS (if fast finish shortcut only) IRP$Q_QIO_P2 IRP$L_MEDIA (block number) IRP$Q_QIO_P2+4 IRP$L_PID (PID, used to capture post processing) IRP$Q_QIO_P2+8 IRP$L_UCB While it is of course possible to allocate another structure to hold this information, these IRP fields are used by no other driver code since they are present only to make the $QIO arguments available to FDT code, completed before start-io code can be run. It may be desirable to consider extending the IRP to supply dedicated fields for this functionality, or perhaps to consider reusing some of the structures shadowing uses where the device is not shadowed, and otherwise use some separate structure. This approach does however provide very fast operation. The fields mentioned are saved and restored so that the IRP can be passed to another driver, yet have its I/O posted in the context of the correct driver. Saving IRP$L_MEDIA is necessary to ensure that IRPs which are re-inserted in device I/O queues at the start of mount verify have the correct block information. The UCB and PID fields must be altered to redirect the IRP to another driver and regain control when the I/O is posted by that driver. The IRP$L_STS field must also be treated this way if a "shortcut" to avoid IPL 4 processing is used, which is also present to minimize extra code caused by this approach, using the fast path I/O processing to eliminate most of the completion overhead which would otherwise be seen due to the need for two request completion calls. SWdriver also has an interface for program controlled path switching. This is built using the IO$_RETCENTER function code sent to SWdriver itself. (It is meant as a private interface.) This code passes a single parameter, 1 or 2, to indicate whether to take the primary or the secondary path. When this function is sent to SWdriver, it will switch to the selected path, provided that its count of active I/O (I/O seen at start-io and not yet seen at I/O post) is ZERO. When the HSZ sends notice that "the other controller has failed", the switch server sends a packack to the currently inactive path to flush out all I/O before switching in this way. The secondary device exists independently and is just addressed directly. The primary device, recall, has its start-io entry stolen, so there is code in SWdriver which will notice an I/O with all I/O function modifiers set, and which will strip all these and send the I/O to the primary path, whether it is connected or not for other purposes. The reason for this packack is to ensure that any "left over" activity on the path will be flushed, and also to issue the necessary SCSI functions to activate the path. This will be required for HSZ40 and up, and is likely to be important for others. To interact with the failover server, SWdriver sends messages to a mailbox allocated by the failover server and whose UCB address has been stored in part of the SWdriver UCB extension. Thus SWdriver can use CALL_WRTMAILBOX, a documented interface, to send messages to the controller indicating that a mount-verify-initiated switchover has occurred, or that an I/O status with the 16384 bit set has been seen. These messages are simply sent, provided the server is present. The server is sent enough information to tell which devices are involved, and one server can handle any number of pairs of switched devices. It has the convention that SW units must be allocated and enabled starting with unit zero. (There is a UCB table in SWdriver which limits the number of units permitted, but its size is an assembly parameter and can be made as large as needed. Currently it is set for 500 units or less.) Mount Verify The mount verify service functions only with a normally mounted device. It is desirable for similar service to be optionally available for foreign device pairs, where a database vendor may be handling the disk itself. This cannot be the default, but is sensible as a general matter. Fortunately, there is a server available which is able to handle much of the complexity here. If this function is implemented, it is feasible for swdriver to notice error codes that currently result in mount verify being used, communicate these to the server, and have the server/switch driver call mount verify entry points (if any) in the appropriate drivers (to flush I/O) and within the intercept driver to requeue any I/O that may have been outstanding, handle device busy, and for the server to issue the periodic packack functions via its private "wormhole" I/O functions permitting access to separate paths as needed. (The "wormhole" functions use patterns of some of the function modifier bits as flags as currently planned, so that the design scales easily to a modest number of paths, one or two dozen perhaps being a practical maximum. This should exceed what will be needed.) By the use of such functions, this system should be able to provide what amounts to mount verify functions on foreign devices, and thus to handle failover. Defect Containment The investigation has resulted in a driver and control suite already which will serve as a source of a code count. The software written for this purpose (not counting some library functions used to allow the optional configuration file to be free form) totals some 3216 lines of code. It is estimated that another ~250 lines of code will be needed for the automatic controller-pair recognition, and the DKdriver lines already added (to side copies) to support these functions total 180. Thus there are so far about 3400 lines of code and the total for HSZ failover functionality may be expected to total when all is said and done 3650 to 4000 (to pick a round number) lines of code. The bogey number of defects expected in 4000 lines of code at one per 40 lines of code would be 100. However, for code which is unit tested already (the driver and control daemon code) this estimate is reported high, and an estimate of 10 defects per KLOC is suggested for that segment of the code. This would mean about 34 defects in the code so far, plus another ~10 in code to be generated. Not all of this code is new (in that some older virtual disk driver examples were built on which have been functioning for several years) and the switching driver code has been tested in one system, which is why it is expected that a lower defect count will cover the code so far. Methods for defect removal include (in addition to unit tests): * Overall design - minimal modifications will be introduced into the (already complex) SCSI drivers to support the failover functions. This can be expected to be the chief contributor to defect containment, since the effects of changes to existing SCSI drivers form a small fraction of the overall effort and their function is limited to reporting information to the failover system on the whole. * Reviews. It will be important to have the code in the driver reviewed so that its design, and particularly its detailed control flow, can be reviewed. The same goes for the server components particularly where privileged. * Stress testing. The code must be tested in SMP and large cluster environments to catch any timing subtleties.