From: Bill Todd [billtodd@foo.mv.com] Sent: Monday, February 21, 2000 6:57 PM To: Info-VAX@Mvb.Saic.Com Subject: Re: The Future of VMS? There's been so much information of dubious validity floating around on this issue (Sun clusters vs. VMS clusters) that I spent some time over at sun.com today trying to raise the quality level. Turns out it's not all that easy. For example, in "Sun Clusters A White Paper" we find the following: "The fact that all nodes share access to all files and devices lets system administrators readily move resources around on the cluster. The programming interface for files and devices is global across a cluster, so moving a physical resource doesn't involve having to change any software." "The server sends "checkpoint" messages (meta data) to the secondary server for each write operation to guarantee the integrity of data in case of failover... If the primary server fails for any reason, the secondary server assumes the identity of the primary, and the framework automatically redirects requests so that replication and failover are transparent to the client." (The 'client' in this case is some other cluster node accessing the files as if they were local: the client's accesses continue transparently, without interruption other than the brief pause for the server fail-over.) "The Sun cluster file system is fast. It is implemented with a caching scheme that makes access to data from remote files almost as fast as local access. The file system is implemented flexibly enough that it can be used with a range of cluster interconnects. So in the case of the SCI-based Sun Cluster Channel, it can take advantage of remote Direct Memory Access (DMA) to avoid buffering information on the remote node and do a direct memory access back to the requesting node." "All data written on the cluster file system is mirrored for higher data availability. This fact, combined with the ability of any node to access any file means that software installed once anywhere on the cluster is accessible by any other node on the cluster - even if the original node that installed it is not online. This feature permits the cluster to have global work queues, global print queues, and a global mail repository." "All the nodes in a given cluster can share a single logical host name, and a global IP address." Trouble is, this paper was written in 1997 and describes the cluster file system as arriving in one of the later development phases: in other words, all the above (save possibly the last sentence, which is not file-system-specific) was vaporware (even the Solaris 8 Operating Environment paper for the March 2000 release indicates that cluster file system functionality is a future feature not present in Solaris 8 or Sun Cluster 2.2), and I never did succeed in finding out any information about how the file system works in any Sun cluster you can buy *today* (or even in the near future), so I assume it's just the same as what's been available in non-cluster environments (journaling UFS plus VxFS and other third-party offerings). If and when the above vaporware materializes, Sun clusters will offer a file system application environment effectively equivalent to VMS's, which is what I mistakenly believed was already the case. Meanwhile, they may only be able to offer NFS access between other cluster nodes and the node managing the private disks holding a portion of the file system - which is a far cry in both function and performance from existing VMS facilities, though within the context of the performance and functional limitations of NFS and the slower reaction of fail-over vs. a cluster state transition it should, more or less, allow application instances on other cluster nodes to continue without having to restart (since the combination of NFS access and identification of private data resources by special IP addresses that can be carried over to the new host should support this). So I owe Kerry a partial apology, even though he still missed the mark and still doesn't understand the concept of fail-over and why it's not a relevant distinction to attempt to draw. It's not *application fail-over* that's the issue: Sun clusters can indeed support applications that run multiple cooperating instances on multiple nodes, just as they can on a VMS cluster. What Sun clusters don't (yet) appear able to do is offer a cluster-wide file system with single-system Unix semantics (rather than the more limited semantics supported by client/server-style NFS), let alone one that rides through a failed host node transparently. But if/when they do offer it (as described in the white paper above), the fact that it's implemented on top of a shared-nothing hardware architecture instead of a shared-disk hardware architecture won't detract from its utility: while there are marginal scaling advantages that an optimally-implemented shared-disk architecture enjoys over an optimally-implemented shared-nothing architecture, VMS itself is far enough from an optimal implementation that they will be down in the noise. In some other areas he may have short-changed Solaris, since in "What's New in the Solaris 8 Operating Environment" we find: "Many mid-range to high-end SPARC systems support a long list of hardware boards that can be changed without shutting down the system, including: memory and CPU boards, I/O controllers, network interface cards (NICs), disk drives, and other SCSI devices." "If it became necessary for Sun's engineers to work with a customer to diagnose or correct an operating environment bug, technology in the Solaris 8 Operating Environment enables them to patch most areas of the system without rebooting. These dynamically applied patches re-vector crucial kernel code to the patched code without interrupting the operation of applications." These certainly help any single node stay up through many (not all) situations that would require most systems to be shut down. There's also mention somewhere of the ability to upgrade incrementally to new release *features*, which may or may not imply that mixed-version clusters (and hence rolling upgrades) are supported. Hope this helps straighten out the record. It's certainly possible that I misunderstood or missed something, but I'd appreciate solid references for any additions or corrections: there's been too much hot air blown around here, and while it may be to Compaq's advantage for its customers to believe that 'nothing else even comes close' to VMS, it's not necessarily healthy for those customers. - bill Main, Kerry wrote in message news:910612C07BCAD1119AF40000F86AF0D803B2BB8C@kaoexc4.kao.dec.com... > Bill, > > >> I'm not sure what your recent kick about 'fail-over' is all about. Sun > clusters do it. VMS clusters do it.<< > > I disagree. > > I guess it depends on what you mean by fail-over. I define fail-over as > having to restart application(s) somewhere because the primary system has > gone away. OpenVMS does not restart applications when a node goes away. The > applications are already running on another node. > > There is a big difference in terms of recovery time, scalability (addng > entire systems and storage sub-systems as required) and complexity in a > fail-over solution vs what an OpenVMS Cluster offers. > > Imho, the issue relates to the concept of what a "shared nothing" > architecture offers vs what a "shared everything" architecture offers. > > With OpenVMS, it is a simple application db or user re-connect to the same > cluster alias and that user/application instance continues to run (albeit > the current transaction would have to be re-entered). The applications are > already running on all nodes (all accessing the same files), logicals are > already defined, batch and print queues already defined , users and system > files all exist on the exact same same disk etc. > > With a fail-over, shared nothing architecture based solution, the local > drives are owned exclusively by the local system and "served" to other > systems. If that local system fails, the backup system has to first notice > the primary has really gone away (primary not just heavily loaded or a NIC > has hung up), then it has to gain control (assuming a dual control SCSI or > some other shared storage adapter) of the disk devices, then it has to > restart the primary applications on that system before it can start to > respond to user requests. > > During the fail-over period and these reconfig activities, the application > is unavailable to the business community. > > Other considerations for a fail-over "shared nothing" solution which an > OpenVMS "shared everything" solution does not need to consider: > > - OS upgrades and patches, HW (memory add, PCI add) require fail-over which > means downtime and scheduling it with the user community. OpenVMS can do all > of these transparently to the end users (load balancing, DNS and cluster > alias). > > With OpenVMS clusters, the concept of SYSTEM availablity can be configured > so that it is separate from the concept of APPLICATION availability. > > In other words, who cares if systems are being rebooted in the background as > lond as the application is 100% available? This is not the case with a > shared nothing fail-over architecture. > > - failover solution needs to restart batch and print queues on the secondary > system to make it transparent to the application and user communities. > > - logicals need to be setup to point to new locations. > > - How to automatically update the DNS so that it no longer direct users at > the failed system - keeping in mind the standard DNS only does round robin > resolution - no load balancing. > > - if the cpu load becomes to great on the primary system, the only HW option > is to upgrade it, but that means upgrading the backup as well since you > usually want similar HW configurations in a fail-over shared nothing cluster > configuration as the backup server must deal with the entire load of the > primary. > > So, to summarize, fail-over is painful no matter how quickly it is done. For > proactive reasons, it still requires scheduling with users, and becomes very > visible. > > In addition, many business requirements no longer allow the luxury of not > counting the scheduled downtime against these vendors "high availability" > numbers. > > In todays rapidly and extremely dynamic environments brought on by the > Internet, OS upgrades, HW upgrades (memory, PCI, entire system and storage > subsystems) and tuning reboots due to hugely changing business loads are a > fact of life. > > I recommend that for any vendor that quotes high availability numbers, ask > them if those numbers include scheduled downtime for addressing these > issues. > > Regards, > > Kerry Main > Senior Consultant, > Compaq Canada > Professional Services > Voice : 613-592-4660 > FAX : 819-772-7036 > Email : kerry.main@compaq.com > > > > -----Original Message----- > From: Bill Todd [mailto:billtodd@foo.mv.com] > Sent: Sunday, February 20, 2000 1:20 AM > To: Info-VAX@Mvb.Saic.Com > Subject: Re: The Future of VMS? > > > > Main, Kerry wrote in message > news:910612C07BCAD1119AF40000F86AF0D803B2BB86@kaoexc4.kao.dec.com... > > ... > > > Would you not want a solution that does not use the "F" word (fail-over) > for > > availability whereby OS upgrades, HW adds (memory, PCI) etc all need > > "SCHEDULED" downtime ? > > > > On another thought here is a math question for those solutions which > depend > > on the "F" word for availability and it relates to scalability - > > > > "What do you do in a "N" cpu system, when the cpu load becomes "N+1" > > (assuming N is the number of CPU's in a box). Upgrade ? Ok, but I guess > that > > means you have to do the backup system as well I guess. Now, since we are > > dealing with failover solutions, I guess this all means more "scheduled" > > downtime - right? > > I'm not sure what your recent kick about 'fail-over' is all about. Sun > clusters do it. VMS clusters do it. The only configurations that don't are > the hardware-redundant lock-stepped implementations from Tandem (Integrity, > I think), Stratus, (used to be) ftVAX, and perhaps others I'm not familiar > with where the application just keeps on running on the surviving hardware > without pause. > > Fail-over occurs when a failed node's load is taken up by some other > node(s). It's not transparent to the software running on the failed node. > It can be transparent to software running outside the failed node that's > using the cluster as an apparently homogeneous, non-stop (save for the brief > pause during fail-over) resource, as long as that external software doesn't > interact with the specific cluster node in any way that holds 'session' > context specific to that node. It can appear transparent to external > software that operates through a local stub that handles transitions from > one cluster node to another even if node-specific session context does > exist. > > Sun clusters and even NT 'clusters' support this kind of fail-over > essentially the same way VMS clusters do, though the fact that their kind of > fail-over includes acquisition of privately-owned data resources instead of > continued use of concurrently-shared data resources can make the transition > slightly longer - but only by the difference between the duration of a > cluster state transition (since the VMS file system does not require > explicit restart recovery operations by virtue of its 'careful update' > approach to integrity) and the seconds-long (at most) mounting and recovery > of a journaled file system. Application-level recovery can extend these > durations in all cases (including VMS). > > Are you saying that, e.g., Sun clusters can't add a node to a running > cluster as load increases, without rebooting all the cluster nodes? That, > if true, would certainly be a valid point of comparison - but it sounds more > as if you are comparing a VMS cluster to an Sun SMP box, which is pretty > much irrelevant (given that Sun supports clustering multiple individual > nodes rather than just partitions within an SMP - which I'm pretty sure it > does). > > Perhaps one of your points is that Sun doesn't provide release-to-release > compatibility such that one can perform a 'rolling upgrade' to the OSs on > the cluster nodes without taking the entire cluster off line. If true, that > also would be a valid knock on Sun compared to VMS. > > Don't make the mistake of assuming that because Sun clusters don't > concurrently share disk access that only the node that manages a particular > portion of the cluster data has access to it: other nodes can access it as > well, just as VMS nodes can access data held on disks private to other > cluster members if those members export those disks to the cluster (the Sun > approach exports at the file system rather than the disk level, but this > difference is transparent to applications). > > In sum, I don't think you understand how little functional difference, from > the application viewpoint, there may be between VMS clusters and Sun > clusters - and whatever differences may exist (I called out a couple above > that *might*), they don't involve the concept of 'fail-over', since that's > the same in both. > > - bill >