From: Bill Todd [billtodd@foo.mv.com]
Sent: Monday, February 21, 2000 6:57 PM
To: Info-VAX@Mvb.Saic.Com
Subject: Re: The Future of VMS?

There's been so much information of dubious validity floating around on this
issue (Sun clusters vs. VMS clusters) that I spent some time over at sun.com
today trying to raise the quality level.  Turns out it's not all that easy.

For example, in "Sun Clusters  A White Paper" we find the following:

"The fact that all nodes share access to all files and devices lets system
administrators readily move resources around on the cluster.  The
programming interface for files and devices is global across a cluster, so
moving a physical resource doesn't involve having to change any software."

"The server sends "checkpoint" messages (meta data) to the secondary server
for each write operation to guarantee the integrity of data in case of
failover...  If the primary server fails for any reason, the secondary
server assumes the identity of the primary, and the framework automatically
redirects requests so that replication and failover are transparent to the
client."  (The 'client' in this case is some other cluster node accessing
the files as if they were local:  the client's accesses continue
transparently, without interruption other than the brief pause for the
server fail-over.)

"The Sun cluster file system is fast. It is implemented with a caching
scheme that makes access to data from remote files almost as fast as local
access. The file system is implemented flexibly enough that it can be used
with a range of cluster interconnects. So in the case of the SCI-based Sun
Cluster Channel, it can take advantage of remote Direct Memory Access (DMA)
to avoid buffering information on the remote node and do a direct memory
access back to the requesting node."

"All data written on the cluster file system is mirrored for higher data
availability.  This fact, combined with the ability of any node to access
any file means that software installed once anywhere on the cluster is
accessible by any other node on the cluster - even if the original node that
installed it is not online. This feature permits the cluster to have global
work queues, global print queues, and a global mail repository."

"All the nodes in a given cluster can share a single logical host name, and
a global IP address."

Trouble is, this paper was written in 1997 and describes the cluster file
system as arriving in one of the later development phases:  in other words,
all the above (save possibly the last sentence, which is not
file-system-specific) was vaporware (even the Solaris 8 Operating
Environment paper for the March 2000 release indicates that cluster file
system functionality is a future feature not present in Solaris 8 or Sun
Cluster 2.2), and I never did succeed in finding out any information about
how the file system works in any Sun cluster you can buy *today* (or even in
the near future), so I assume it's just the same as what's been available in
non-cluster environments (journaling UFS plus VxFS and other third-party
offerings).

If and when the above vaporware materializes, Sun clusters will offer a file
system application environment effectively equivalent to VMS's, which is
what I mistakenly believed was already the case.  Meanwhile, they may only
be able to offer NFS access between other cluster nodes and the node
managing the private disks holding a portion of the file system - which is a
far cry in both function and performance from existing VMS facilities,
though within the context of the performance and functional limitations of
NFS and the slower reaction of fail-over vs. a cluster state transition it
should, more or less, allow application instances on other cluster nodes to
continue without having to restart (since the combination of NFS access and
identification of private data resources by special IP addresses that can be
carried over to the new host should support this).

So I owe Kerry a partial apology, even though he still missed the mark and
still doesn't understand the concept of fail-over and why it's not a
relevant distinction to attempt to draw.  It's not *application fail-over*
that's the issue:  Sun clusters can indeed support applications that run
multiple cooperating instances on multiple nodes, just as they can on a VMS
cluster.  What Sun clusters don't (yet) appear able to do is offer a
cluster-wide file system with single-system Unix semantics (rather than the
more limited semantics supported by client/server-style NFS), let alone one
that rides through a failed host node transparently.  But if/when they do
offer it (as described in the white paper above), the fact that it's
implemented on top of a shared-nothing hardware architecture instead of a
shared-disk hardware architecture won't detract from its utility:  while
there are marginal scaling advantages that an optimally-implemented
shared-disk architecture enjoys over an optimally-implemented shared-nothing
architecture, VMS itself is far enough from an optimal implementation that
they will be down in the noise.

In some other areas he may have short-changed Solaris, since in "What's New
in the Solaris 8 Operating Environment" we find:

"Many mid-range to high-end SPARC systems support a long list of hardware
boards that can be changed without shutting down the system, including:
memory and CPU boards, I/O controllers, network interface cards (NICs), disk
drives, and other SCSI devices."

"If it became necessary for Sun's engineers to work with a customer to
diagnose or correct an operating environment bug, technology in the Solaris
8 Operating Environment enables them to patch most areas of the system
without rebooting. These dynamically applied patches re-vector crucial
kernel code to the patched code without interrupting the operation of
applications."

These certainly help any single node stay up through many (not all)
situations that would require most systems to be shut down.  There's also
mention somewhere of the ability to upgrade incrementally to new release
*features*, which may or may not imply that mixed-version clusters (and
hence rolling upgrades) are supported.

Hope this helps straighten out the record.  It's certainly possible that I
misunderstood or missed something, but I'd appreciate solid references for
any additions or corrections:  there's been too much hot air blown around
here, and while it may be to Compaq's advantage for its customers to believe
that 'nothing else even comes close' to VMS, it's not necessarily healthy
for those customers.

- bill

Main, Kerry <Kerry.Main@Compaq.com> wrote in message
news:910612C07BCAD1119AF40000F86AF0D803B2BB8C@kaoexc4.kao.dec.com...
> Bill,
>
> >> I'm not sure what your recent kick about 'fail-over' is all about.  Sun
> clusters do it.  VMS clusters do it.<<
>
> I disagree.
>
> I guess it depends on what you mean by fail-over. I define fail-over as
> having to restart application(s) somewhere because the primary system has
> gone away. OpenVMS does not restart applications when a node goes away.
The
> applications are already running on another node.
>
> There is a big difference in terms of recovery time, scalability (addng
> entire systems and storage sub-systems as required) and complexity in a
> fail-over solution vs what an OpenVMS Cluster offers.
>
> Imho, the issue relates to the concept of what a "shared nothing"
> architecture offers vs what a "shared everything" architecture offers.
>
> With OpenVMS, it is a simple application db or user re-connect to the same
> cluster alias and that user/application instance continues to run (albeit
> the current transaction would have to be re-entered). The applications are
> already running on all nodes (all accessing the same files), logicals are
> already defined, batch and print queues already defined , users and system
> files all exist on the exact same same disk etc.
>
> With a fail-over, shared nothing architecture based solution, the local
> drives are owned exclusively by the local system and "served" to other
> systems. If that local system fails, the backup system has to first notice
> the primary has really gone away (primary not just heavily loaded or a NIC
> has hung up), then it has to gain control (assuming a dual control SCSI or
> some other shared storage adapter) of the disk devices, then it has to
> restart the primary applications on that system before it can start to
> respond to user requests.
>
> During the fail-over period and these reconfig activities, the application
> is unavailable to the business community.
>
> Other considerations for a fail-over "shared nothing" solution which an
> OpenVMS "shared everything" solution does not need to consider:
>
> - OS upgrades and patches, HW (memory add, PCI add) require fail-over
which
> means downtime and scheduling it with the user community. OpenVMS can do
all
> of these transparently to the end users (load balancing, DNS and cluster
> alias).
>
> With OpenVMS clusters, the concept of SYSTEM availablity can be configured
> so that it is separate from the concept of APPLICATION availability.
>
> In other words, who cares if systems are being rebooted in the background
as
> lond as the application is 100% available? This is not the case with a
> shared nothing fail-over architecture.
>
> - failover solution needs to restart batch and print queues on the
secondary
> system to make it transparent to the application and user communities.
>
> - logicals need to be setup to point to new locations.
>
> - How to automatically update the DNS so that it no longer direct users at
> the failed system - keeping in mind the standard DNS only does round robin
> resolution - no load balancing.
>
> - if the cpu load becomes to great on the primary system, the only HW
option
> is to upgrade it, but that means upgrading the backup as well since you
> usually want similar HW configurations in a fail-over shared nothing
cluster
> configuration as the backup server must deal with the entire load of the
> primary.
>
> So, to summarize, fail-over is painful no matter how quickly it is done.
For
> proactive reasons, it still requires scheduling with users, and becomes
very
> visible.
>
> In addition, many business requirements no longer allow the luxury of not
> counting the scheduled downtime against these vendors "high availability"
> numbers.
>
> In todays rapidly and extremely dynamic environments brought on by the
> Internet, OS upgrades, HW upgrades (memory, PCI, entire system and storage
> subsystems) and tuning reboots due to hugely changing business loads are a
> fact of life.
>
> I recommend that for any vendor that quotes high availability numbers, ask
> them if those numbers include scheduled downtime for addressing these
> issues.
>
> Regards,
>
> Kerry Main
> Senior Consultant,
> Compaq Canada
> Professional Services
> Voice : 613-592-4660
> FAX   : 819-772-7036
> Email : kerry.main@compaq.com
>
>
>
> -----Original Message-----
> From: Bill Todd [mailto:billtodd@foo.mv.com]
> Sent: Sunday, February 20, 2000 1:20 AM
> To: Info-VAX@Mvb.Saic.Com
> Subject: Re: The Future of VMS?
>
>
>
> Main, Kerry <Kerry.Main@Compaq.com> wrote in message
> news:910612C07BCAD1119AF40000F86AF0D803B2BB86@kaoexc4.kao.dec.com...
>
> ...
>
> > Would you not want a solution that does not use the "F" word (fail-over)
> for
> > availability whereby OS upgrades, HW adds (memory, PCI) etc all need
> > "SCHEDULED" downtime ?
> >
> > On another thought here is a math question for those solutions which
> depend
> > on the "F" word for availability and it relates to scalability -
> >
> > "What do you do in a "N" cpu system, when the cpu load becomes "N+1"
> > (assuming N is the number of CPU's in a box). Upgrade ? Ok, but I guess
> that
> > means you have to do the backup system as well I guess. Now, since we
are
> > dealing with failover solutions, I guess this all means more "scheduled"
> > downtime - right?
>
> I'm not sure what your recent kick about 'fail-over' is all about.  Sun
> clusters do it.  VMS clusters do it.  The only configurations that don't
are
> the hardware-redundant lock-stepped implementations from Tandem
(Integrity,
> I think), Stratus, (used to be) ftVAX, and perhaps others I'm not familiar
> with where the application just keeps on running on the surviving hardware
> without pause.
>
> Fail-over occurs when a failed node's load is taken up by some other
> node(s).  It's not transparent to the software running on the failed node.
> It can be transparent to software running outside the failed node that's
> using the cluster as an apparently homogeneous, non-stop (save for the
brief
> pause during fail-over) resource, as long as that external software
doesn't
> interact with the specific cluster node in any way that holds 'session'
> context specific to that node.  It can appear transparent to external
> software that operates through a local stub that handles transitions from
> one cluster node to another even if node-specific session context does
> exist.
>
> Sun clusters and even NT 'clusters' support this kind of fail-over
> essentially the same way VMS clusters do, though the fact that their kind
of
> fail-over includes acquisition of privately-owned data resources instead
of
> continued use of concurrently-shared data resources can make the
transition
> slightly longer - but only by the difference between the duration of a
> cluster state transition (since the VMS file system does not require
> explicit restart recovery operations by virtue of its 'careful update'
> approach to integrity) and the seconds-long (at most) mounting and
recovery
> of a journaled file system.  Application-level recovery can extend these
> durations in all cases (including VMS).
>
> Are you saying that, e.g., Sun clusters can't add a node to a running
> cluster as load increases, without rebooting all the cluster nodes?  That,
> if true, would certainly be a valid point of comparison - but it sounds
more
> as if you are comparing a VMS cluster to an Sun SMP box, which is pretty
> much irrelevant (given that Sun supports clustering multiple individual
> nodes rather than just partitions within an SMP - which I'm pretty sure it
> does).
>
> Perhaps one of your points is that Sun doesn't provide release-to-release
> compatibility such that one can perform a 'rolling upgrade' to the OSs on
> the cluster nodes without taking the entire cluster off line.  If true,
that
> also would be a valid knock on Sun compared to VMS.
>
> Don't make the mistake of assuming that because Sun clusters don't
> concurrently share disk access that only the node that manages a
particular
> portion of the cluster data has access to it:  other nodes can access it
as
> well, just as VMS nodes can access data held on disks private to other
> cluster members if those members export those disks to the cluster (the
Sun
> approach exports at the file system rather than the disk level, but this
> difference is transparent to applications).
>
> In sum, I don't think you understand how little functional difference,
from
> the application viewpoint, there may be between VMS clusters and Sun
> clusters - and whatever differences may exist (I called out a couple above
> that *might*), they don't involve the concept of 'fail-over', since that's
> the same in both.
>
> - bill
>