From:	MERC::"uunet!CRVAX.SRI.COM!RELAY-INFO-VAX" 28-OCT-1992 09:56:30.33
To:	INFO-VAX@KL.SRI.COM
CC:	
Subj:	RE: How to create pagefile on "Too fragmented" disk

	Having just worked out the following work-around ( and having decided
	that SYSGEN CREATE and/or the Files-11 XQP on VMS 5.5-1 is somewhat
	broken), I thought I'd pass it on. 

	Problem: system wedged at 4.30 pm with full pagefile. Plenty-ish space
	on new 2Gb disk, so obvious cure: create a second pagegfile before it
	wedges again.

	$ MCR SYSGEN
	CREATE SYS$SYSDEVICE:[SYS0.SYSEXE]PAGEFILE2.SYS /SIZE=80000

	comes back with "File only partly created - disk may be too
	fragmented", or comething like that. BUT IT ISN'T. Investigating with
	a fragmentation analysis took tells me that an 80000 block file can
	easily be created in a mere 12 extents. DUMP/HEADER shows me that
	after grabbing four decent-sized extents, the rest of the too-small
	file is full of tiny fragments, right down to single-cluster ones.
	AAAAARGH!

	After trying a lot of things that didn't work, like creating a small
	pagefile first in an attempt to sweep the 'junk' file extents into
	that one, or creating a heap of small files with the same in mind, or
	using $SET RMS /EXTEND=6000 in case that stopped use of the tiny
	extents, I thought "well, if it gets the first few extents right let's
	sneak up on it". There were plenty of 8000-block extents available, so
	I tried:

	$ MCR SYSGEN
	CREATE SYS$SYSDEVICE:[SYS0.SYSEXE]PAGEFILE2.SYS /SIZE=8000
	CREATE SYS$SYSDEVICE:[SYS0.SYSEXE]PAGEFILE2.SYS /SIZE=16000
	CREATE SYS$SYSDEVICE:[SYS0.SYSEXE]PAGEFILE2.SYS /SIZE=24000
	....
	CREATE SYS$SYSDEVICE:[SYS0.SYSEXE]PAGEFILE2.SYS /SIZE=80000
	$

	Voila! Dump shows allocation of 19 extents: not as good as the 12 that
	a human could have done, but not too bad, and a helluva lot better
	than having to take the system down at almost no notice to backup and
	restore a 2Gb disk :-)

	DEC - I don't care what the reason is, its BROKEN.

I'm not so sure about that.  Here's a guess as to what happened:  Among the
many caches in the file system is an extent cache.  When the XQP looks for
extents to extend a file, it starts with the extent cache.  If the extent
cache runs out, it falls back on the disk allocation bitmap - a much slower
operation.

In a cluster, each member keeps its own extent cache.  From the point of view
of any one node, the extent in other members's caches are already allocated.
This allows a member to use blocks from its own extent cache without worrying
about what the other cluster members are doing.  It's possible for a member to
tell other members to flush their extent caches, thus making the pre-allocated
extents effectively available again.

The full details of these algorithms are complex and I've never looked into
them (never had any reason to).  In fact, there is both an extent cache and
a bitmap cache, and I'm not sure that the distinction is - perhaps the extent
cache contains a list of free extents while the bitmap cache is a copy of part
of the bitmap.  I don't know.  But the intent is clear:  To improve I/O system
performance by cutting down on disk I/O to the bitmap file and even messages
among cluster members.

A side-effect of caches, however, is that no one node in a cluster has full
knowledge of the free space on a disk (which is why $GETDVI doesn't return
correct information, only an approximation - SHOW DEVICE does some magic to
ask around the cluster).  Even in a single-node system, the CPU, looking only
at its caches, doesn't have a complete picture of the disk.  In either case,
it can get the information - but that costs I/O operations, CPU, and time.

As a result, it is not necessarily true that the XQP will find anything like
the optimal set of extents when allocating a file.  Consider what it would
have to do to accomplish this.  It would repeatedly have to find the largest
remaining extent on the disk and allocate it to the new file until the file
had reached its target size.  On a single node, this would require repeated
searches of the bitmap (which could be optimized by keeping track of, say,
the top 5 extents found on each search - but what if another process frees
more disk space in the meantime?  These searches can take quite some time,
and we don't want to lock everyone else out of the disk.)  In a cluster,
you're talking about repeated requests to all nodes.  Performance would be
terrible.

What I'm pretty sure is going on is very simple:  The node you are allocating
the pagefile on is managing to allocate it entirely out of its caches.  To
do so, though, it has to get down to "the dregs" of its caches, the many
small fragments.  Creating a small page file, or many small files, doesn't
help much because they don't change the distribution of fragment sizes in
the caches by much - there are plenty of small fragments to go around.  In
fact, if I were designing such a cache, I'd deliberately aim for a distribu-
tion of cached fragments sizes that include a few large fragments, more
medium size ones, and plenty of small ones.  After all, most requests are
for fairly small chunks, and if you fill the cache with the largest extents
on the disk, you'll just end up fragmenting them rapidly to fulfill the small
requests - bad policy.

On the other hand, growing the page file in large, manual chunks probably
helps because, each time you grow it, you seriously deplete the cache - in
particular, you use up all the large extents in it - so the XQP sets out to
refill it (from the disk or from other cluster nodes).  Because as a human
being you enter the next command an eon later, it has plenty of time to do so.

Should the XQP do better?  What does "better" mean?  How much performance at
file allocation time are you willing to trade for a less fragmented file?
How often will this make a difference?  The pagefile is an unusual case:  It's
a large file that has to be fully mapped in a limited number of extents.  For
the typical 80000 block file, you'll never see the difference in performance
between 10 extents and 50.  How much effort should be put into an optimization
that hardly ever makes a difference?

Note that you can probably affect the tradeoff a bit by using a larger cache.
The larger cache, in effect, keeps local more of the global information.  Of
course, in a cluster, if EVERYONE'S cache grows, this doesn't help much.

You may suggest that SYSGEN should be smarter, since it KNOWS that it is
creating a pagefile.  But SYSGEN isn't doing the allocation - the XQP is.
(The RMS /EXTEND quantity is irrelevent, BTW, because it controls how much
should be added to an existing file if it grows - but SYSGEN certainly creates
the file at the target size.  It's also quite possible that SYSGEN doesn't
bother with RMS for this, since it will have to to direct XQP calls anyway to
check the fragmentation of the resulting file.)  To do any better, SYSGEN
would have to get into the innards of the file system in some way.  Again,
how much effort is appropriate for what is a rare special-case situation?

BTW, when you couldn't create one 80000 block pagefile, did you consider
creating two 40000 block pagefiles, or three 27000 block pagefiles?  Since
you indicate that there were plenty of 8000-block free extents available, and
that your 80000 block file got four of them (i.e., there are about four in
the cache), you should have had no trouble at all creating the three files,
and you probably could have gotten away with the two files.  In effect, by
using multiple pagefiles, you would be multiplying the total number of extents
that can be used for all pagefiles in the system.  These days, the cost of
additional pagefiles is minimal, since a process can be split among several.

							   And by the way,
	when are you going to support disk partitions  and/or VDDRIVER?? With
	the latest disks coming in at several Gbytes, we need them! 

What problem would they solve, especially, what problem related to this issue?
(I'm not saying this is a bad idea - I especially think VDDRIVER is a fine
idea, but I'm not so sure that disk partitions are worth the headache - but
how about some justification?)

(BTW, not to start a flame war, but did you ever notice that when people talk
about all the things Unix has, they always include all the free stuff you can
pull off the net - but when they talk about VMS, they only talk about what is
officially supported?  The vast bulk of free software, for ANY system out
there, can't be compared in quality, documentation, or available support to
VDDRIVER.  Hell, most COMMERCIAL software isn't as good.  If VDDRIVER solves a
problem for you, why will it solve it any better if it comes on the VMS tape?)

							-- Jerry