From: MERC::"uunet!CRVAX.SRI.COM!RELAY-INFO-VAX" 28-OCT-1992 09:56:30.33 To: INFO-VAX@KL.SRI.COM CC: Subj: RE: How to create pagefile on "Too fragmented" disk Having just worked out the following work-around ( and having decided that SYSGEN CREATE and/or the Files-11 XQP on VMS 5.5-1 is somewhat broken), I thought I'd pass it on. Problem: system wedged at 4.30 pm with full pagefile. Plenty-ish space on new 2Gb disk, so obvious cure: create a second pagegfile before it wedges again. $ MCR SYSGEN CREATE SYS$SYSDEVICE:[SYS0.SYSEXE]PAGEFILE2.SYS /SIZE=80000 comes back with "File only partly created - disk may be too fragmented", or comething like that. BUT IT ISN'T. Investigating with a fragmentation analysis took tells me that an 80000 block file can easily be created in a mere 12 extents. DUMP/HEADER shows me that after grabbing four decent-sized extents, the rest of the too-small file is full of tiny fragments, right down to single-cluster ones. AAAAARGH! After trying a lot of things that didn't work, like creating a small pagefile first in an attempt to sweep the 'junk' file extents into that one, or creating a heap of small files with the same in mind, or using $SET RMS /EXTEND=6000 in case that stopped use of the tiny extents, I thought "well, if it gets the first few extents right let's sneak up on it". There were plenty of 8000-block extents available, so I tried: $ MCR SYSGEN CREATE SYS$SYSDEVICE:[SYS0.SYSEXE]PAGEFILE2.SYS /SIZE=8000 CREATE SYS$SYSDEVICE:[SYS0.SYSEXE]PAGEFILE2.SYS /SIZE=16000 CREATE SYS$SYSDEVICE:[SYS0.SYSEXE]PAGEFILE2.SYS /SIZE=24000 .... CREATE SYS$SYSDEVICE:[SYS0.SYSEXE]PAGEFILE2.SYS /SIZE=80000 $ Voila! Dump shows allocation of 19 extents: not as good as the 12 that a human could have done, but not too bad, and a helluva lot better than having to take the system down at almost no notice to backup and restore a 2Gb disk :-) DEC - I don't care what the reason is, its BROKEN. I'm not so sure about that. Here's a guess as to what happened: Among the many caches in the file system is an extent cache. When the XQP looks for extents to extend a file, it starts with the extent cache. If the extent cache runs out, it falls back on the disk allocation bitmap - a much slower operation. In a cluster, each member keeps its own extent cache. From the point of view of any one node, the extent in other members's caches are already allocated. This allows a member to use blocks from its own extent cache without worrying about what the other cluster members are doing. It's possible for a member to tell other members to flush their extent caches, thus making the pre-allocated extents effectively available again. The full details of these algorithms are complex and I've never looked into them (never had any reason to). In fact, there is both an extent cache and a bitmap cache, and I'm not sure that the distinction is - perhaps the extent cache contains a list of free extents while the bitmap cache is a copy of part of the bitmap. I don't know. But the intent is clear: To improve I/O system performance by cutting down on disk I/O to the bitmap file and even messages among cluster members. A side-effect of caches, however, is that no one node in a cluster has full knowledge of the free space on a disk (which is why $GETDVI doesn't return correct information, only an approximation - SHOW DEVICE does some magic to ask around the cluster). Even in a single-node system, the CPU, looking only at its caches, doesn't have a complete picture of the disk. In either case, it can get the information - but that costs I/O operations, CPU, and time. As a result, it is not necessarily true that the XQP will find anything like the optimal set of extents when allocating a file. Consider what it would have to do to accomplish this. It would repeatedly have to find the largest remaining extent on the disk and allocate it to the new file until the file had reached its target size. On a single node, this would require repeated searches of the bitmap (which could be optimized by keeping track of, say, the top 5 extents found on each search - but what if another process frees more disk space in the meantime? These searches can take quite some time, and we don't want to lock everyone else out of the disk.) In a cluster, you're talking about repeated requests to all nodes. Performance would be terrible. What I'm pretty sure is going on is very simple: The node you are allocating the pagefile on is managing to allocate it entirely out of its caches. To do so, though, it has to get down to "the dregs" of its caches, the many small fragments. Creating a small page file, or many small files, doesn't help much because they don't change the distribution of fragment sizes in the caches by much - there are plenty of small fragments to go around. In fact, if I were designing such a cache, I'd deliberately aim for a distribu- tion of cached fragments sizes that include a few large fragments, more medium size ones, and plenty of small ones. After all, most requests are for fairly small chunks, and if you fill the cache with the largest extents on the disk, you'll just end up fragmenting them rapidly to fulfill the small requests - bad policy. On the other hand, growing the page file in large, manual chunks probably helps because, each time you grow it, you seriously deplete the cache - in particular, you use up all the large extents in it - so the XQP sets out to refill it (from the disk or from other cluster nodes). Because as a human being you enter the next command an eon later, it has plenty of time to do so. Should the XQP do better? What does "better" mean? How much performance at file allocation time are you willing to trade for a less fragmented file? How often will this make a difference? The pagefile is an unusual case: It's a large file that has to be fully mapped in a limited number of extents. For the typical 80000 block file, you'll never see the difference in performance between 10 extents and 50. How much effort should be put into an optimization that hardly ever makes a difference? Note that you can probably affect the tradeoff a bit by using a larger cache. The larger cache, in effect, keeps local more of the global information. Of course, in a cluster, if EVERYONE'S cache grows, this doesn't help much. You may suggest that SYSGEN should be smarter, since it KNOWS that it is creating a pagefile. But SYSGEN isn't doing the allocation - the XQP is. (The RMS /EXTEND quantity is irrelevent, BTW, because it controls how much should be added to an existing file if it grows - but SYSGEN certainly creates the file at the target size. It's also quite possible that SYSGEN doesn't bother with RMS for this, since it will have to to direct XQP calls anyway to check the fragmentation of the resulting file.) To do any better, SYSGEN would have to get into the innards of the file system in some way. Again, how much effort is appropriate for what is a rare special-case situation? BTW, when you couldn't create one 80000 block pagefile, did you consider creating two 40000 block pagefiles, or three 27000 block pagefiles? Since you indicate that there were plenty of 8000-block free extents available, and that your 80000 block file got four of them (i.e., there are about four in the cache), you should have had no trouble at all creating the three files, and you probably could have gotten away with the two files. In effect, by using multiple pagefiles, you would be multiplying the total number of extents that can be used for all pagefiles in the system. These days, the cost of additional pagefiles is minimal, since a process can be split among several. And by the way, when are you going to support disk partitions and/or VDDRIVER?? With the latest disks coming in at several Gbytes, we need them! What problem would they solve, especially, what problem related to this issue? (I'm not saying this is a bad idea - I especially think VDDRIVER is a fine idea, but I'm not so sure that disk partitions are worth the headache - but how about some justification?) (BTW, not to start a flame war, but did you ever notice that when people talk about all the things Unix has, they always include all the free stuff you can pull off the net - but when they talk about VMS, they only talk about what is officially supported? The vast bulk of free software, for ANY system out there, can't be compared in quality, documentation, or available support to VDDRIVER. Hell, most COMMERCIAL software isn't as good. If VDDRIVER solves a problem for you, why will it solve it any better if it comes on the VMS tape?) -- Jerry