Multi Path Switching
          Problem Statement & Investigation Report
          Glenn C. Everhart
          August 30, 1996 (Minor revs 10/1/96)

          --------------------------------------------------------------------

          The Current Situation: The VMS SCSI system currently uses one
          and only one path to reach any device, and that path is used to
          determine the device names. As a result, SCSI device names are
          unique and stable. Also, the system can guarantee some I/O to
          occur sequentially in situations where this is needed
          (principally in mount verify situations and cluster state
          transitions). However, in VMS clusters, other paths exist in
          some situations which cannot be used, and thus SCSI disks always
          have a single point of failure, even where there exists hardware
          capable of avoiding this.

          Problem Statement: A variety of situations exist or will shortly
          exist in which there will be more than one path to a device, and
          in which these paths will need to be coordinated so that only
          one is active at a time, and so that VMS sees only one,
          cluster-unique, name for this storage clusterwide, regardless of
          physical path to the storage being used.

          This situation exists with SCSI clusters now, where direct SCSI
          paths to a disk exist at the same time as MSCP served paths. It
          will shortly exist with HSZ series controllers where these are
          attached to more than one SCSI bus in a cluster. It can be
          expected to appear in Fibre Channel connection topologies and in
          other intelligent controllers in the future. Moreover, when the
          QIOserver is introduced, it will define additional paths to
          storage which will need to be coordinated and to fail over.

          To maintain file system integrity, each storage entity (disk,
          generally) in VMS must have a name which is consistent and
          unique across a cluster. To provide greatest system robustness
          and availability, it should be possible to switch from one path
          to another, at least when one fails, and possibly at other
          times.

          This is needed for disks in the shortest time frame. Similar
          controls for other device types are desirable.

          Characteristics of Multiport Storage Devices
          There are several multiport SCSI device controllers currently
          known. More are on the way. One of these is the HSZ50 (and
          HSZ70). Others from EMC, CMD, and Seagate exist, to name a few.
          Some provide information about their paths, others do not, so
          that it is likely that special case code to handle each or a
          manual (or semi-manual) way to handle some of these devices will
          be needed for full generality.


                                                                         1


          The HSZ series will offer a number of pieces of support
          information which can be used by systems software to determine
          what is present. The HSZ50 will have two controllers which can
          be connected to two separate SCSI busses on a cluster (shared or
          separate).

          Fortunately the HSZ itself provides certain bits of information
          which an operating system can use to figure which devices are
          which. (This information may need to be manually fed in where
          similar support is not present in hardware.)

          First, when in this dual-bus configuration, an HSZ will return
          some extra data in INQUIRY responses. This data includes:

                  * The serial number of this controller
                  * The serial number of the alternate controller
                  * A bitmask of LUNs which are preferred for this
          controller
                  * State of the "other" controller

          Therefore one can determine, from the INQUIRY data, if the
          device is an HSZ, what this and the "other" controller is, and
          whether this particular device is preferred on "this"
          controller. (The bitmask changes to reflect the actual
          situation, so that if one controller fails, all LUNs are marked
          as preferred on the other). This extra information is present
          only in the dual bus case (the serial numbers being nulled
          otherwise). This permits a driver to determine, when configuring
          a device, that this particular path to the device is the
          preferred one or is an alternate non-preferred one. Moreover,
          the controller serial numbers are unique and visible to all
          nodes on a cluster, so that if a device name is chosen based on
          them, it will automatically be the same for all cluster nodes.

          In addition, the HSZ firmware is being given the ability to
          notify drivers when a controller fails. This presumes that some
          devices are active on each controller, and works by having the
          HSZ detect the controller failure. If this happens, the next I/O
          to the good controller will receive a CHECK CONDITION status
          (unit attention). The sense data then uses some vendor unique
          sense codes for failover (and eventually failback) events and
          returns the good controller serial number, the failed controller
          serial number, failed controller target number, and a bitmask of
          LUNs moved. In addition, when this happens, the surviving
          controller kills (resets) the other controller to keep it from
          trying to continue operation.

          This information can permit the processor to be notified of a
          path failure without necessarily having to incur timeout and
          mount verify delays. On VMS, however, a SCSI adapter on a failed
          path may have I/O in various states within its control, and if
          this is the case, some method of extracting it is needed. The
          usual path for this function is for timeouts to occur and force


                                                                         2


          I/O requeue and mount verify. Where I/O is in progress to a
          device, there is no convenient external handle available to
          extract it. Therefore this information is likely to be most
          useful where the failed path devices are in fact idle. Where I/O
          is in progress at some stage within a SCSI adapter, it will have
          to be timed out or otherwise cleared from the adapter before a
          path switchover can take place. (This also means that in the
          event a transient failure occurs, nothing will be left "in the
          pipeline" to a device at switch time.)

          Actual HSZ switchover is done by a SCSI START command (which is
          done as part of the IO$_PACKACK operation in VMS) so that host
          software has some control.

          The other multipath controllers offer different hints; the EMC
          controller allows one to intermix usage of its multiple paths in
          any mix desired, dynamically. One could in principle alternate
          which bus was used with that hardware, though cluster
          configuration and mount verify need to be able to ensure that
          old traffic is flushed before new traffic begins; this is more
          complex than static path usage, and is not being considered for
          this round of design.

          Futures Proposal:
          There is a proposal to the SCSI-3 committee which details a more
          general configuration, in which some number of devices are
          controlled by a set of controllers, where a device may be
          accessible from one or more of the controllers at a time. It is
          anticipated that LUN ownership might have to be established in
          this case via reserve/ release to set initial path preference
          (if only one path at a time may be used).

          This proposal defines some SCSI commands which may be sent to a
          storage control device to report which controllers and devices
          are associated and to set up access. Since these devices will
          have their own LUNs and device types (apart from disks, tapes,
          etc. behind them) it is apparent that an io$_packack to a disk
          would have to have been preceded by some FC initialization
          commands. The unit init code of a new class driver may be the
          most logical place for such commands. Failover or failback is to
          be reported by ASC/ASCq event codes, same as for the HSZ.

          While this suggestion is not yet definite, this specification
          does attempt to be generally compatible with it. (A server, for
          a specific case, can communicate with a control device if need
          be when a failover is signalled.)

          In addition to issues with controllers like HSZ, we must
          remember that issues already exist with the MSCP server in SCSI
          clusters. At times, disks are visible with two paths, one via
          MSCP and one via direct shared SCSI connections. Until the
          advent of SCSI clusters, a SCSI disk could be reached either
          locally or via MSCP, never both. Now, however, both paths can


                                                                         3


          exist and be useful, but no failover is possible. Failover is
          supported with DUdriver, but not generally with other drivers
          such as DKdriver. Wherever there is at least one redundant path
          to a device, however, it is highly desirable that the system be
          able to switch paths invisibly to the good path when one of the
          paths fails.


          Goals:
           * Support HSZ failover for HSZ7x type controllers where two
          SCSI busses are connected to a single system or cluster.
           * Support other multipath failover cases such as SCSI to / from
          MSCP paths or to and from other server types (e.g., QIOserver).
           * Be compatible with planed HSG failover mechanism (which is
          generally similar to the HSZ one, with some differences due to
          the changes between SCSI-2 and SCSI-3)

          Non-Goals:
           * Code for more than two paths initially, though the code must
          extend easily to "n" paths.
           * Support the case where both HSZ controllers are on a single
          bus (this is supported within the HSZ and needs no system
          support)
           * Solve the device naming problem generally
           * Dynamic routing or load balancing between paths to a device
          in full detail. (That is,it is expected that the solution must
          function to switch a failed path, but it is not initially
          necessary for a solution to load share between multiple paths
          dynamically in the first design pass.)
           * Describe details of compatibility with the HSG proposed
          failover scheme.
           * Support magtape sharing in the first round. (There may be
          details that differ for different device classes.)


                                                                         4


          Problem Components

          There are two components to this problem: how to arrange that
          I/O be directed to the correct place (and how exactly to switch
          it), and how device names may be kept cluster unique where there
          is more than one path to a given piece of storage, and these
          paths' "natural" names are in general not the same. These will
          be discussed separately here. (In the current SCSI system,
          because there is one path only, these issues don't arise.)

          Switching Solutions:

          1: Non-Solution: SCSI Port-Class relinking

          To deal with the problems only of HSZ, it was initially
          considered that some form of altering SCSI connections' purely
          SCSI structure based "routing" to devices might be feasible for
          the switching needed here. However, in principle, two SCSI
          busses can be controlled by entirely different SCSI port
          drivers, so that an attempt to alter connections on the fly at
          port level could involve considerable complexity in ensuring
          that port driver specific structures were initialized as the
          port drivers expect. (These initializations are not all alike.)
          Also, a "port level" approach does not deal with the appearance
          of multiple class driver units after autoconfiguration. That is,
          one must move not one, but several devices' structures such that
          no timing windows are opened up for the rest of VMS or for the
          SCSI subsystem. Idle drivers might be revectored, but any links
          between SCSI structures and class drivers would need to be
          traced and reset, and any future asynchronous events would need
          to be blocked from access to the structures during this time,
          and any port driver specificities in the SCDT in particular
          would be a problem. This can probably be done, but looked error
          prone and complex to build and to test.

          Finally, this approach would not help with failover between SCSI
          and MSCP or other servers; it is peculiar to the internals of
          SCSI (as they currently stand), and represents an interface
          which has never been externally constrained to follow a set
          design. Rather, this interface has been mutable, and may remain
          that way.

          2: Non-Solution: Add Code to DKdriver

          It is possible to imagine adding switching code directly into
          DKdriver and possibly other class drivers which might redirect
          I/O as needed. In DUdriver, this approach is used, and DUdriver
          is quite tightly connected with the MSCP server and mount
          verify. In DUdriver, I/O is switched by requeueing, basically in
          the beginning of the start-io path, to the "other" path if such
          is needed. However, the details of most of the processing are
          heavily involved with MSCP, and rather complex. This part of the


                                                                         5


          driver interface is a large and complex one and touches many
          places in VMS, so that an attempt to duplicate the DU processing
          in DK would be a sizable increase in DKdriver's complexity. This
          would introduce considerable risk into DKdriver and may in the
          end prove completely infeasible. (It is worth noting that this
          complexity and the difficulty of maintaining it are among the
          reasons for the qio server project.) The current DUdriver code
          does not in fact switch to a local DK path now, even when the
          UCBs and CDDBs are "cross linked", probably because the DU data
          structures are not completely set up in DKdriver (and don't
          exist at all in some other disk drivers). What is more of a
          problem, the path switching done between DUdriver, the MSCP
          server, and SCS is specialized to exactly two paths maximum
          between a piece of storage and a processor. When a new served
          path must be found, MSCP commands are sent to attempt this.
          While this is appropriate for MSCP storage, it is not
          particularly efficient for other drivers.

          It is however important to consider that driver mount
          verification code must be called to ensure that related
          components be notified, as mount verification is the signal used
          in DUdriver that path modification may be needed.

          A simpler variant of this approach would be to include some
          "generic" switching code early in DKdriver which would simple
          queue packets to other driver start-io entries. This would also
          increase the complexity of the DKdriver driver interface, but
          less so than attempting to duplicate DUdriver processing. While
          this is a feasible approach, it has two problems:

          1. It would be specialized to DKdriver, and would need to be
          separately added to any other drivers in the future where
          failover may be desired. MKdriver and GKdriver may turn out to
          be early candidates.

          2. Each DK unit "knows" only about its own device, where
          multiple path switching would need additional structures and
          controls to be added to the DK interface so that groupings of
          devices might be known. The class driver is not at the most
          appropriate level of globality to perform this operation, and
          again this would mean easily user visible changes to the
          DKdriver program interface, which would have to be do-able in
          any future drivers which incorporated the modifications.

          Note that either of these approaches involves requeueing the IRP
          to the "real" destination device, so that the requeueing
          overhead exists in any related approach.

          To the extent that DK-internal code lacks a well defined
          interface at its back end, it would tend to become specialized
          over time and would be likely to become harder with time to move
          to other drivers where it was needed. The lack of generality of
          this approach makes it appear less desirable than the final one.


                                                                         6


          The approach of adding code to DKdriver seems less than optimal.

          3: Solution: Class Switching

          The one interface in the I/O driver system which is well defined
          and constrained to be relatively stable is that at its top.
          Control transfer into drivers must follow conventions OVMS has
          long established and which are documented. Therefore, a
          switching function using this interface should be fairly
          universal, and may reduce the amount of work to keep it current
          as future VMS designs arise. It is a confirmation of the value
          of such a choice that the DUdriver failover control is managed
          very close to the "top" (i.e., start-io) entry of that driver
          also.

          Therefore a simpler approach has been investigated. In this
          approach, a dedicated device switching subsystem inserts itself
          at the class driver external entry points and serves to abstract
          the device name used a level from the actual device driver
          paths. Thus, above this level, the OVMS system sees one device
          with one name which corresponds to the one piece of storage
          underneath. Below this level, two or more paths may exist, all
          leading to the same physical storage. This level's job, then, is
          to switch all I/O seen by the rest of OVMS (at a documented and
          fairly stable interface) between these underlying paths. It is
          not necessary that I/O order be preserved (since the SCSI
          drivers do not in general preserve it), except that in certain
          well defined circumstances such as path or cluster node failure,
          mount verify, and a few more, the guarantees offered by SCSI of
          order must be maintained. (See Figure 3).

          This subsystem must offer a control interface distinct from the
          underlying drivers, which can work with components able to group
          devices together, select one name consistently across a cluster
          for the storage device, and allow communication to each
          underlying device for those switching situations when
          sequentiality across all paths must be ensured. (A server
          communicating with an intercept driver is the scheme in mind
          here.)

          Note that the external VMS driver interface is the input
          interface of this intercept driver, since it will insert its
          processing in the start-io points of underlying drivers, and
          that its output interface is also this same external, documented
          VMS driver interface. These are documented and rather stable
          interfaces, and simple enough to control. While a separate
          subsystem may not have direct internal access to underlying
          driver private structures, this is a software engineering
          advantage. If additional information is needed, it must be
          explicitly made available and added somehow to the underlying
          driver's external interface, or some driver-private inspection
          method must be devised to aid processing in some cases. Path
          switching when mount verification occurs (as the usual case is


                                                                         7


          for DUdriver) can be handled without additions to the external
          interface,since mount verify currently can operate with any
          driver. With a general purpose switch subsystem, however,
          switchover functions can be used for any multiple paths to
          storage, of whatever nature, and it becomes possible to create
          cluster consistent names for devices across widely divergent
          interconnects, by creating a dummy alias device name if need be
          and arranging that actual I/O pass through other paths. (This is
          a possible future item.)

          It should be noted too that the MSCP server currently operates
          promptly when it sees mount verification and uses underlying
          facilities to find another server node if one is available. It
          would normally be expected that other servers may wish to do
          this as well. Should such support be desired, the switching
          software layer will need to send additional requests to a served
          path which will cause a new path to be probed for. The current
          suggested solution will allow the MSCP server's mount verify
          processing to happen before giving up on that path, so already
          it deals with MSCP.


          Naming Solutions:


          Background:

          The path switching solution is related to device naming in its
          need to ensure that no device will be accessed via different
          names in the cluster. While this must be done uniformly across a
          cluster, the namespace can be the current one or any future one;
          other paths will be hidden (and the search and scan routines
          must be altered slightly to hide any N such hidden paths) but
          can exist. A path can come into being with its own name, and the
          path switching system can handle it under another name, even
          well after boot, so long as the name being hidden has not begun
          to be used. Apart from ensuring cluster commonality, path
          switching is largely independent of the choice of naming
          schemes.

          There is one effect that is necessary, and that is that whatever
          is needed to ensure common device names clusterwide must be
          provided. This means in particular that when an alternate path
          comes up which is not the cluster-required one, some effort is
          necessary to see that it is not itself served to other nodes
          where the primary path exists and not used. The preferred
          switching solution can switch served path UCBs also, but the
          uniformly named cluster access path must be present and used
          first, so that only one path is used. This will require some
          additions in DKdriver to ensure that the "right" path is served,


                                                                         8


          and alternate path access is delayed until the switching system
          is present. Figure 1 shows some possibilities with SCSI clusters
          and dual pathed HSZ which will illustrate some of the complexity
          involved. The picture with QIOserver and MSCP server added is
          different in detail but not different in concept. It is likely
          that the type of processing in DUTUSUBS and in DKdriver which
          looks for other paths to the same storage will have to be
          extended to a multipath case. While detail will change, the
          overall concept need not.


          Cluster Consistency Strategies

          To accomplish cluster consistent naming, it is essential that
          some method exist which will prevent use of extra paths, even if
          they are visible, before the switching code can start. Refer
          again to Figure 1 and it will be apparent that this is not
          trivial, particularly since a cluster cannot expect to have any
          fully shared storage. There are several approaches that might be
          used to achieve this.

          Naming Solution 1: Common Heuristic
          This approach involves use of the same heuristic decision
          algorithm in the entire cluster for determining which name to
          use. Thus, for example, where multiple controllers exist, there
          must be some way by which a driver coming up can find out
          whether the present UCB is for the lowest or highest numbered
          controller, and the algorithm would be "use the path with the
          lowest serial number controller". The proposed HSZ firmware
          makes this information available in its INQUIRY data. To deal
          with multiple paths for SCSI, it would suffice to add a test at
          the end of unit init to check whether this UCB was the lowest
          numbered controller one, and if not, hide the UCB so it will not
          come into use. Other controller services proposed for SCSI-3
          will have similar capabilities, and a common heuristic will
          permit selection of the proper direct path. So long as servers
          do not start for nonpreferred paths, this will mean that the
          same, and common, device name will be seen everywhere. Similarly
          one could have a rule by which any server path being set up
          checks pre-existing paths and hides any new path if it is not
          preferred. It should be made clear that heuristic code for many
          types of controllers may be needed, with the ability to specify
          a configuration file manually as a back-up for those for which
          no heuristics yet exist. The ability to delay configuration of
          SCSI IDs may need to be used to handle some of these cases by
          allowing all connections to be delayed so that access to the
          device will start after switching software first runs. (One
          "heuristic" in fact could be that some cluster nodes simply
          don't configure a particular device automatically, set by hand,
          in cases where a controller provides no information.)

          This approach will mean code in the MSCP driver to detect also
          QIOserver presence and vice versa, as well as to detect (as


                                                                         9


          currently) local paths. Also code would be needed in DKdriver
          and any other affected drivers to check for whether a path was
          optimal, as well as code to hide other pre-existing served paths
          or to hide itself if another path existed currently.

          Note: In the presence of switching software, initially
          configuring a served UCB as the "primary" name to be used is not
          as performance critical as might otherwise be thought, since any
          failover will permit the switching code to switch to a local
          path if one exists. It may even be feasible to force mount
          verify processing to achieve this after the switching code
          starts.

          The common heuristic approach can work provided that devices
          with multiple paths can supply some per-path information to
          drive the decision. It does not require boot path modifications,
          and does not require intra-cluster negotiation. The processing
          involved is similar to what is now present. Because only one
          local path would be ever used to access a device for purposes of
          determining its cluster name, there could be no conflicts of
          names. Once other local paths were connected, of course, they
          could be made available via server, or one could serve the
          multipath alias device with suitable enabling modifications. The
          former, with corresponding MSCP and DUTUSUBS modifications,
          appears simplest since the latter generates routing issues with
          MSCP packets whose solution could be complex.

          The approach does have a possible disadvantage though. That is
          that the port chosen by heuristic may turn out to be connected
          to only one system, where another port of some device may be
          connected to a shared bus, so that instead of instantly allowing
          two direct paths with no name conflict, only one is allowed,
          until failover to the direct paths is possible after loading of
          the switching components. This can be a temporary problem;
          certainly the HSZ firmware permits one (via sending SCSI START
          to the other port) to force mount verify to happen and failover
          to occur on the "less connected" port, so long as some global
          site of intelligence exists to determine when such should be
          done. Once the switching software is in place, a local path can
          be used even though the device name came from a served path.
          (See Figures 2 and 3 for the connectivity.)

          The common heuristic in the HSZ case would be to use the
          controller marked as preferred by the customer, of course, this
          information being available also. Basically, however, so long as
          only one local device name can be chosen by local paths (and
          this is guaranteed on shared SCSI busses already), naming in a
          cluster will be unique.

          Naming Solution 2: Common Configuration File

          If every cluster node is able to access a common configuration
          file, and this file is required to be the same even if storage


                                                                        10


          access requires multiple copies of it to exist, then such a file
          could be used to select which UCBs should be enabled. It would
          be necessary to also have cluster code check, at cluster state
          transitions, that all mappings agreed, and hang any node with a
          different map, which could involve a sizeable amount of traffic.
          This traffic cannot well be avoided, though, since where
          separate files are involved, they can get out of
          synchronization. John Hallyburton's IR on SCSI naming discusses
          some methods in detail for how this might happen.

          Unlike the problem of naming devices on busses where the bus is
          connected to one or a few systems, the problem here is that
          devices may be on widely different paths, attached physically to
          different computers, so a configuration file covering all such
          names must be truly global cluster wide.

          Making a config file work would require that somehow, early in
          the life of the system, it should be read in from disk so that
          driver unit init routines could determine whether a particular
          device should be connected. Cluster state transition or
          something similar whch runs early in the system life must also
          compare this data for any disks being actively used at minimum,
          so as to ensure that no disk being used on a system differs in
          name from that same disk anywhere else. Identifying the "same
          disk" means using the unique device ID.

          Cluster transition hooking can be accomplished by sending
          Worldwide ID / device name pairs to the cluster coordinator, to
          verify that no mispairings are seen. The size of this list is
          likely not to be excessive; perhaps 128 bits per device could
          encode the pairs, with 64 devices per 1K of memory, so that 2048
          devices would occupy 32K. When a node was joining a cluster it
          could then send its pairings to the coordinator node which would
          validate that no different matches were used.

          It is possible that the lock system could be used instead,
          within cluster code called from IOGEN, so that at boot time the
          device name validity can be established. In this case, locks
          could be taken out with names corresponding to devices, and lock
          values corresponding to worldwide IDs. Should a node acquire
          such a lock and see a different value from its preferred one, it
          would "know" that another node already running had used that
          name with a different device, and bugcheck after a message. This
          would need to live in the swapper process to allow it to live as
          long as the system,and could be implemented by a doorbell (to
          reduce the number of locks that must be constantly present)
          which the cluster coordinator would use, or by having a lot of
          reserved locks. (If cluster logicals were implemented and loaded
          early enough at boot, even those might be used.)


          Recommendations re: common names:
          For initial operation, the "common heuristic" solution appears


                                                                        11


          much simpler to implement and would be well to implement first.
          This has the extra virtue that it may simplify backporting of
          this capability. Over the longer haul, it is likely that a
          configuration file approach will be needed to handle naming
          issues. The naming project needs to ensure only one name per
          device in any case, and is a better place to put this kind of
          information.

          Even if a cut-down "new naming" scheme is used to only extend
          LUNs in a more or less fixed way (let unit number be, say,
          400*SCSI ID + SCSI LUN, fixed algorithm for scsi3 and/or fibre),
          this means that multipath failover can accomplish what it needs
          to. If multiple paths never are configured, there may be only
          one path to select from, which is not bad either.

          Defect Containment

          The investigation has resulted in a driver and control suite
          already which will serve as a source of a code count. The
          software written for this purpose (not counting some library
          functions used to allow the optional configuration file to be
          free form) totals some 3216 lines of code. It is estimated that
          another ~250 lines of code will be needed for the automatic
          controller-pair recognition, and the DKdriver lines already
          added (to side copies) to support these functions total 180.
          Thus there are so far about 3400 lines of code and the total for
          HSZ failover functionality may be expected to total when all is
          said and done 3650 to 4000 (to pick a round number) lines of
          code.

          The bogey number of defects expected in 4000 lines of code at
          one per 40 lines of code would be 100. However, for code which
          is unit tested already (the driver and control daemon code) this
          estimate is reported high, and an estimate of 10 defects per
          KLOC is suggested for that segment of the code. This would mean
          about 34 defects in the code so far, plus another ~10 in code to
          be generated.

          Not all of this code is new (in that some older virtual disk
          driver examples were built on which have been functioning for
          several years) and the switching driver code has been tested in
          one system, which is why it is expected that a lower defect
          count will cover the code so far.

          Methods for defect removal include (in addition to unit tests):
           * Overall design - minimal modifications will be introduced
          into the
                  (already complex) SCSI drivers to support the failover
                  functions. This can be expected to be the chief
          contributor
                  to defect containment, since the effects of changes to
                  existing SCSI drivers form a small fraction of the
          overall


                                                                        12


                  effort and their function is limited to reporting
          information
                  to the failover system on the whole.
          * Reviews. It will be important to have the code in the driver
          reviewed         SCSI BUS 1
              CPU so that its design, andPparticularly its detailed
          control flow,
                  can be reviewed. The same goes for the server components
                  particularly where privileged.
          * Stress testing. The code must be tested in SMP and large
          cluster
                  environments to catch any timing subtleties.

                                                Ethernet

          Resources:        HSZ

          Glenn Everhart will work on this project.


          Issues:

          * Details of the function of QIOserver have not been finalized
          at this time, and these may have some effect, although designs
          are being shared and the mesh currently looks good.

          * Should the SCSI committeernot ultimately support supplying
          information about all paths when querying a device path, the
          "common heuristic" approach cannot be used. This would be a
          change from current plans, but could happen. The approach will
          be fine for HSZs, though, and by the time such a change could
          occur, it is likelyIthat the code for unique worldwide name
          supportPwill have been completed, so that the configuration file
          approach should be workable.


          Figures:
                                  SCSI        SCSI
          The following figures are illustrations of some of the types of
          connections which may occur, considering HSZ dual paths and
          their connections as particular examples.
                                   HSZ
          Figure 1 illustrates two possible ways a dual-bus HSZ can be
          connected to a cluster (in addition to which the case exists of
          two busses on a single machine where the busses are not shared
          SCSI busses).

          Figure 2 illustrates the way naming works to the rest of VMS,
          again using the HSZ system,Dwhich is one near-term dual path
          system, when using an intercept layer.

          Figure 3 isFmeant toSshowHtheCswitchrserver in the picture. It


                                                                        13


          is implicit here that the switch driver has interfaces for its
          control and communicationfwhichDareedistinctafrom those of any
          underlying drivers. (For that reason, controls do not need to be
          added to any underlyingldriverstof whatever1sort.)ardless of driver us


                            DKA100 (primary name)

                             Switching Software Layer, just inside
                             DKDRIVER start_io


                                 One OR the other used


                    DKA100 UCB               DKB100 UCB           ...

                   DKDRIVER                 DKDRIVER


                    PKA...                     PKB...


                                HSZ Controller


                                    Disk(s)


               Figure 2. The Software Layering Used


                                                                        14


                VMS Services            Switch server


                      DKAn visible to rest of VMS


                                             Note: Switch layer hides extra path
                Switch layer


                  DKAn                  DKBn ?
                 Class Driver          Class driver


                 Port driver           Port Driver


                                      (One path may be via some network, e.g. SC


                         Storage Dvc


                   Figure 3. Layering with Switching Subsystem Added above Class


                                                                        15


                                                                        16


                                                                        17


            1.              2.                 3.


                 CPU             CPU                CPU


                                  Dsk
                  Dsk
                                                      Dsk

                              Two controllers on
           Single Connection                        Two controllers on
                               Same CPU controller  CPU, one on disk


            4.                    5.


                  CPU                 CPU            CPU


                                              HSZ
                  HSZ


                                              Dsk

                   Dsk

            Two busses, two controllers Two machines in a cluster
                                        each connected to one controller
            to HSZ talking to disk      of an HSZ or similar


            Figure 4.  Various ways in which disks and processors may be connect


                                                                        18


          Types Of Connections
          Consider Figure 4 above. It shows 5 basic types of connectivity.
          This proposal is irrelevant to the drawing labelled 1 since that
          is a single path case. In the drawing labelled 2, it could be
          since a single disk might appear as two IDs, though if the
          controller were an HSZ, the controller would deal with the
          failover internally and such dual paths would not appear. We
          would be able to switch paths provided some other controller
          were used either with some heuristic, or by manually controlling
          serving and path configuration on one side at least. (Adding a
          facility to locally connect a disk unit without permitting
          serving could be useful in manual setups in such
          situations; forcing use of set /served in all cases is at times
          onerous.)

          The third drawing shows a multiport disk/controller hooked to
          two SCSI adapters in the same host. From the CPU point of view,
          for a non-HSZ, drawings 2 and 3 are similar in that two
          different device pathnames point at the same disk. Currently
          these are illegal configurations. This proposal can support them
          to a degree with a configuration file, in the absence of
          information about the devices. If a device-based ID which is
          known site unique is available, a heuristic can also be devised
          to facilitate control of serving. Failing that, manual control
          of serving may be needed. (Automatic serving control could be
          handled by the switching server if some simple class driver
          support were added, to prevent any server connections at initial
          driver load..setting a few bits in the UCB at unit init time,
          for example...and server connections made after the switching
          code loaded. This would give much more automated support.)

          The fourth drawing, showing an HSZ connected on two busses to a
          processor, and the fifth, showing connection through servers,
          with two HSZ controllers connected to two busses on two
          different but clustered processors, are fully supported by this
          proposal. So long as each path can find out about other paths'
          existence, DKdriver can inhibit servers. This will support HSZ
          and should work with the proposed SCSI interconnects. A still
          larger variety of devices can be handled if server access to
          SCSI disks is inhibited until switching servers can be
          activated, after which these servers must use cluster
          communications to validate that their configurations (whether
          from config files or heuristics) are clusterwide uniform.

          Appendix A

          Some Technical Details

          Since some work on a prototype was done as one of the early
          tasks of this IR, it seems fitting to present some of the
          thought and detail that went into this prototype, and some
          considerations that can lead to its evolution toward a final
          system.


                                                                        19


          Overview:

          The thinking which went into the server prototype was that there
          will need to be in each machine which has a multipath device a
          server, whose job it will be to group devices together and
          control what clusterwide names are used, as well as to allow
          periodic checking to do timely switching, and a switching
          intercept driver which will insert itself into the DDTAB entries
          of the primary path driver chosen and requeue IRPs to the
          appropriate paths it "knows" about. The server will know some
          intimate details of the switching driver controls, and the
          switching driver will offer special controls to allow the server
          to pass commands to individual paths, providing a "back door"
          opening beyond the abstraction that the main path is one path
          only.

          The switching driver can also internally watch for mount verify
          IRPs and use them to signal itself to switch paths, and will
          keep track of what activity exists at its subordinate drivers.

          Since the switching will work via an intercept, the device name
          can be one of the original path names, and no VMS data
          structures above the intercepted driver need be touched. Thus
          any risk introduced by adding a switching component is isolated
          therein. The queueing that is added to the path is exactly what
          is added for DUdriver, or which is used for striping drivers and
          some other "performance enhancing" VMS components. Thus the
          overhead should not be a large worry.

          Some Internals:

          In order to implement this approach, it is necessary to select a
          "primary path" from among the paths to the device, so that this
          path's name can be used clusterwide. (This choice must be the
          same for all nodes in a cluster.) That done, a switching layer
          which can exist just ahead of the underlying drivers must
          control all driver entry points from outside. These are stored
          in the DDTAB structure. The primary ones are start-io,
          pending-io, mount-verify-start/finish, altstart-io, i/o cancel,
          and of course the FDT tables.

          The switching functions contemplated here should not need to
          alter FDT entries, but will need control to intercept the
          primary path's driver entries. (Other path drivers should be
          hidden and set to no-assign status to keep them from being
          used.)

          The basic approach considered implies that a new UCB I/O queue
          will be used instead of the primary driver's queue as the system
          input for I/O to one or more of the paths. It then is the
          switching layer's job to vector this I/O to underlying paths, to
          keep track of what is active and what idle, and to ensure that
          on path transfer, all I/O is revectored to the correct place. In


                                                                        20


          addition it is necessary to provide the ability for specialized
          software to control path selection.

          This approach does require some additional data over that of, in
          effect, swapping between two whole SCSI port drivers, since one
          switches each device, not each port. Still, where some multipath
          points may exist via servers and some via SCSI, it provides a
          single approach which can handle all cases, regardless of the
          interconnects.

          Maintaining a separate input queue is facilitated by the DDTAB
          entries. When a new I/O is to be initiated, it is added to the
          device queue by calling the pending_io entry, or is started
          directly from the start-io entry. When mount verify is started,
          the mount verify entry is called to do the actual requeueing of
          the IRP back to the device, and the mount verify end entry is
          called to actually start new I/O after mount verify. When
          altstart is called, the switch layer will be able to use
          internal information to tell where the IRP supplied needs to be
          sent. Finally, by altering the IRP$L_PID field of each IRP, the
          switch layer is able to gain control after each I/O so it can
          tell when I/O is done, and can monitor I/O for purposes of
          initiating mount verify conditions when the driver just
          completing the actual work happens to be one of the paths not
          known as mounted
          by VMS. (A testbed experiment has validated the feasibility of
          this "third party" switching , demonstrating it working in both
          directions.)

          The treatment of mount verify entries is important here. When
          mount verify start is called, the underlying driver's routine
          must be called (if present) to insert the current IRP on the
          driver's wait queue, and then the intercept must move it to the
          multipath input queue. When mount verify ends, the intercept
          will move the first IRP through its operations and into the
          path's driver's input queue, and then call the mount verify end
          routine to actually begin I/O again.

          It is intended that path switching will occur either after a
          failed path has gone into mount verify processing (so that the
          I/O system will have been fully idled, and after mount verify
          starts to give SCSI drivers a chance to recover from SCSI RESET
          before concluding that a path has in fact failed, and so the
          MSCP/DUdriver code may seek a different served path if
          possible), or when the switching paths are known to be idle.
          Idleness in the second case will be determined by counting. The
          I/O outstanding count is the number of IRPs into the driver less
          the number of IRPs postprocessed out. When this is zero, the
          driver is necessarily idle. When paths are switched when idle,
          the intent is also that packack should be issued on both paths,
          to force sequential processing at the device and thus guarantee
          again that no path being abandonned can have any remaining I/O.
          This is in fact a stronger guarantee than is needed, since


                                                                        21


          during most operations, the paths could perfectly well be
          shared. Only where there is a need from an upper layer to
          enforce sequentiality is it necessary that this sequentiality be
          carried to the lower levels, or when a condition at the device
          requires only one path to be used for correct recovery (as for
          example bad block replacement done by a CPU might be).

          The approach here is to defer any consideration of performing
          such synchronization dynamically for a time, but to make path
          failover work. The approach under consideration does not
          preclude a much more dynamic style of path sharing, but this is
          not intended immediately.

          Mount Verify and /Foreign Mounts
          The mount verify service functions only with a normally mounted
          device. It is desirable for similar service to be optionally
          available for foreign device pairs, where a database vendor may
          be handling the disk itself. This cannot be the default, but is
          sensible as a general matter.

          Fortunately, there is a server available which is able to handle
          much of the complexity here. If this function is implemented, it
          is feasible for the switching driver to requeue its I/O to its
          input, set the mount verify bits in underlying devices, and have
          the server process perform essentially the same operations that
          mount verify does. This will allow the same path failover to be
          offered for foreign mounted devices, should a site so select,
          that occurs on mounted ones. Since some database vendors operate
          on non-filestructured disks, this will permit a significant
          functional gain for their support. Again, driver routines can be
          called even for secondary drivers, so that underlying code will
          notice no changes.

          Code Assists

          Some simple optimizations will be present to detect whether
          underlying drivers have such DDTAB entries as mount verify
          start/end, pending_io, and so on, so that IRPs which can be
          handled more simply by just requeueing to the multipath input
          queue need not be queued and then moved. However, an underlying
          driver may be able to speed up operations if it is able to
          provide some kind of assist so that the path switching system
          might know when purely local-path processing of a path has
          completed. This should be possible to add "unobtrusively" in
          return status from a packack, for example, since we are able in
          the intercept code to look at any IRP field. There need not even
          be any user visible changes. The initial scheme of waiting for
          some number of Mount Verify pack-ack functions (IO$_PACKACK) is
          usable but crude and might be improved by incorporating some
          knowledge of the specifics of what was going on at lower levels.
          (This can mean noting that a bus reset is being handled in SCSI,
          or noting that a call to MSCP revalidate may be pending, as
          opposed to complete.) A status bit meaning either "the local


          processing has failed; try switching if possible" or "more local
          processing is pending, delay a switch" might be used. The former
          allows one to expedite switching, while the latter permits one
          to delay it. The packack count one would use varies accordingly.
          Alternatively there could be two bits, one per meaning.

          Underlying driver support is highly desirable, but it is
          noteworthy that a multipath switching layer can be constructed
          with no driver modification at all. This will make retrofit,
          with at least
          partial function, easy. Such a solution will not handle full
          multipath, but could be used in cases such as devices on a
          shared SCSI bus, permitting failover between MSCP and a direct
          SCSI path.