From: CSBVAX::MRGATE!RELAY-INFO-VAX@CRVAX.SRI.COM@SMTP 21-SEP-1988 02:22 To: ARISIA::EVERHART Subj: Thoughts On Dump Files. (Creation, Management And Analysis Of)... Received: From KL.SRI.COM by CRVAX.SRI.COM with TCP; Tue, 20 SEP 88 21:30:23 PDT Received: from central.cis.upenn.edu by KL.SRI.COM with TCP; Tue, 20 Sep 88 21:15:13 PDT Received: from LINC.CIS.UPENN.EDU by central.cis.upenn.edu id AA07047; Tue, 20 Sep 88 22:02:29 EST Received: from XRT.UPENN.EDU by linc.cis.upenn.edu id AA27511; Tue, 20 Sep 88 22:02:21 EDT Posted-Date: Tue, 20 Sep 88 22:03 EDT Message-Id: <8809210202.AA27511@linc.cis.upenn.edu> Date: Tue, 20 Sep 88 22:03 EDT From: "Clayton, Paul D." Subject: Thoughts On Dump Files. (Creation, Management And Analysis Of)... To: INFO-VAX@KL.SRI.COM X-Vms-To: @INFOVAX,CLAYTON There are up to three VMS SYSGEN parameters that indicate how VMS will perform a memory dump in the event of a fatal error. These three are listed below. SAVEDUMP - ignored if the dump is being written to SYS$SYSTEM:SYSDUMP.DMP, otherwise the following actions are taken. - when set to a '0', and the crash dump is done to the system page file, SYS$SYSTEM:PAGEFILE.SYS, due to there not being a separate dump file, SYS$SYSTEM:SYSDUMP.DMP, it will NOT be kept on the next system boot. This eliminates the chance of performing an analysis of it to determine what went wrong. - when set to a '1', and the crash dump is to the system page file, then it is mandatory to perform the following step as part of the next system boot. $ANALYZE/CRASH COPY SYS$SYSTEM:PAGEFILE.SYS XXX.YY where: XXX.YY is the file to use for storing the dump This should be done first thing as part of the boot process in order to allow the space in the page file to be used for its intended purpose. If this is not done, then the space is not available, and the system may not be able to completely boot without problems. For those still running V3 of VMS, note that a bug in the COPY command resulted in the SP register being increased by eight (8) for each COPY command issued on the same dump file. This was corrected in V4 of VMS. DUMPBUG - when set to a '0', no memory/processor status, as well as the error log buffers at the time of shutdown, is written to a file for later analysis. In other words, there will be nothing to help prevent a problem from recurring since this information is not available. - when set to a '1', a memory dump and current processor status, and error log buffers, are written to one of two places. In the event that the file SYS$SYSTEM:SYSDUMP.DMP was present at boot time, it will be the first choice to write to. If this file is not present, then it will write the information to the primary system page file, SYS$SYSTEM:PAGEFILE.SYS. DUMPSTYLE - when set to a '0', the entire amount of physical (VMS V5 only) memory will be written to the dump file, which ever one is used. This results in the same actions as that taken under VMS V4. - when set to a '1', only those portions of physical memory that were marked as 'valid' at the time of the shutdown, are written to the dump file. Physical memory used and allocated to VMS are written first, then if there is more space in the dump file, memory taken by user processes are written as well. If the dump file is to small to hold the pages VMS is using, later analysis of anything can not be performed. It should be noted that you have no control over what user processes get written to the dump and which do not, in the event the dump file is to small to hold everything. There are several important issues to understand here. Problems with any one of them can result in useless, or no, information to aid in future analysis. 1. System dump files are not created on the 'fly' like those for terminal servers, job controller and printer symbionts. These files have to created and maintained by the system manager. 2. The dump file, under VMS 4, has to be the size which is the result of the following equation. # of physical pages of memory + 4 It does not matter if the dump file is the primary system page file, or the separate SYS$SYSTEM:SYSDUMP.DMP file, the required size is the same regardless. It does not hurt anything if the file is larger then this value. If the page file is used, then the primary page file must be at least this big, all other page files are ignored for this purpose. The bottom line here is that under V4, you have to save all the information to be able to use any of it. 3. Under VMS 5, the dump file does not have to be big enough to hold all the information, regardless of which file is used. The amount of information that is stored is determined by the size of the file, and if it is to small, then the particular piece of information that may be required to determine the exact cause of the crash may not be available. Or the contents of the dump file could be totally useless. It should also be noted that you have no control over what parts are saved when a 'compressed' dump is performed. A good starting point to determine the size of a very usable 'compressed' dump file would be to average out the maximum amount of physical memory USED over a given time period. Then add to this between 2,000 to 7,000. An increase in system usage could result in this value changing over time. The bottom line here is that a partial dump is supported and an analysis can be performed on the parts that are saved. 4. If there is no SYS$SYSTEM:SYSDUMP.DMP file present at the time of the boot, then the dump will be to the primary page file. If the file, SYS$SYSTEM:SYSDUMP.DMP, was created after the time of the boot, it will not be used until after the system has been rebooted and then taken down again. 5. The file SYS$SYSTEM:SYSDUMP.DMP is not kept open by VMS during the course of normal system operation. In other words, doing the command, $SHOW DEVICE/FILE SYS$SYSDEVICE would not show the file to be open. This also means that should the DCL command DELETE be issued against this file, it will not report any error messages dealing with the file being 'locked' by another user, and the file will in fact be deleted from the disk. This is a problem area that must be avoided at all costs. Should this file exist at boot time, then is deleted, for what ever reason, followed by the system being shutdown or a crash, the system disk is in all likelihood corrupted. The amount of corruption largely depends on how the disk is used for non-VMS operating system purposes. When VMS wants to write the dump file, 'normal' VMS disk I/O operations are not used. The bootstrap device driver, which is a bare bones subsystem, is used to write the information to the file. The true starting logical block number of the dump file which is stored by VMS at boot time, is used to locate where to write the information. No directory lookups are performed to 'find' the file, and the information is written directly. The implication is that, no checks are made during shutdown to determine if the dump file present at boot time, is still around at system shutdown time. If it was deleted, and the system continued operation, then the space that the dump file had reserved would be used, as needed, for new files on the disk. These new files, and maybe some file headers themselves, WILL be overwritten when it comes time for VMS to write to the dump file it knew of at the previous boot. 6. In the event that the dump file is deleted, by whatever causes and for whatever reasons, the only safe way to bring down the VAX processor that was to use that dump file is to HALT the machine. Do not do a normal shutdown or '@CRASH'. You can dismount the disks yourself, given that no open files are present, and stop the queue manager before this drastic action is taken. A new version of the dump file should also be created before the machine is halted and rebooted. Note that this new dump file will not be used and does not replace the prior dump file, that was deleted, for the purpose of this system shutdown. The new dump file will be used the next time the system is shutdown. 7. In order to conserve space on the system disk, the dump file(s) can be shared between processors by placing the dump file in SYS$COMMON:[SYSEXE], instead of the usual node specific, SYS$SPECIFIC:[SYSEXE] directory. While this can save considerable space, there are several drawbacks to this method. a. This only works on 'Cluster Common System Disks', which are used for both CI and NI based VAXClusters. And only for processors that are using the same disk as their system disk. b. The size of the dump file has to be large enough to accommodate the largest memory size of any single processor in the group that is sharing dump files under VMS V4 and V5,when compression is disabled. Under VMS 5 with compression enabled, the size of the dump file has to be the 'best guess' of what will have all the needed information. c. Given the scenario where multiple VAX processors, that are sharing a dump file, crash or otherwise come down, the contents of the dump file is questionable. The Distributed Lock Manager is not used when the information is written out, so the result may be a 'mixture' of several processors which renders it unusable for later analysis. If the above conditions are acceptable, then the dump file can be shared. In order to share the dump file, pick the largest one and issue the following command. $RENAME/LOG SYS$SPECIFIC:[SYSEXE]SYSDUMP.DMP SYS$COMMON:[SYSEXE]* This command should be done when logged into the machine with the largest dump file.Note that the RENAME command does not 'move' the file header or the file itself, so the system whose dump file is being moved can still be taken down normally without corrupting the system disk.The other processors, that are to share the dump file, should be taken down normally, and their dump file in the directory: disk:[SYSx.SYSEXE]SYSDUMP.DMP where: disk: is the disk that the group uses as their system disk x is the 'root number' for the processor(s) that no longer need a dump file specific to them. can be deleted from any remaining VAX processor(s) in the VAXCluster. 8. The 'compression' feature that is available under VMS V5, will save disk space as well, but the size of the dump file that is needed to hold the information needed to perform a complete analysis can change from one crash to the next. It depends on the problem that caused the system to crash to start with. 9. Given that there are problems with the dump file,either it is not present or was deleted, and/or the page file dump was not saved then the ERRLOG buffers at the time of the crash are also not available for analysis. These error buffers may hold vital hardware failure information that actually caused the problem. In the event of a good system dump, and later reboot, the contents of the error buffers from the dump file are written to the ERRLOG process for recording purposes and later use by maintenance personnel. Hope this helps some in understanding how dump files work, and how to manage them in the future. :-) pdc Paul D. Clayton Address - CLAYTON%XRT@RELAY.UPENN.EDU Disclaimer: All thoughts and statements here are my own and NOT those of my employer, and are also not based on, or contain, restricted information.