From: "Glenn C. Everhart" To: info-vax@crvax.sri.com Subject: compressing disk I have thought about a compressing virtual disk for some time but never got one working, though I did make a start way back. My initial concept was to use an isam file to hold the storage, compressing records. Others use something like this approach, doing their own space management and doing a better job than normal ISAM. On reflection, I don't think that's a terribly good approach since the isam file would need to be frequently reorganized to gain space. What I'd do now would be to have the host program manage a whole slew of files, each file comprising the contents of perhaps 100 blocks of the virtual disk (so it's long enough for adaptive algorithms to do some good). A good host process for fddriver/fqdriver should also have some caching in itself. On reading a block, the host would read the file and decompress it into its cache someplace. Since each cache "cell" would be for 100 didk oops, disk blocks or so, storage would be easy to manage. Writing would write into this cache area. When the space was needed, the relevant cache chunk would be compressed and written over the old file that had held that range of blocks. The host file system would then manage the actual space allocation. This scheme would work on unix too, using sunfddvr.c (also on sig tapes). I believe that having a "fence" block below which the disk storage would be just kept on A file is still a good idea; this would allow the disk to be INITed /INDEX=BEG/HEADERS=nnnn so the index file would not be compressed, speeding many access operations. I'd keep at least 10 and possibly more of these cache blocks, depending on how much virtual address space I felt could be used for it; 100 would be better still; I'd treat them as LRU cache, but would provide a periodic flush to disk every 10-20 minutes (guess) and would provide a way to tell the host process to flush on command so a shutdown would be able to ensure all data had flushed to disk. The overhead of flushing to disk for EVERY write would be a lot; a write to 1 block might result in compressing 100 and probably doing file open/close/truncate. I would also advise that if one plays this game (and I'd be happy to see someone do it and put the result on the sig tapes for next time) that the files containing all the actual storage live on a VD: or similar virtual disk, because it's gonna fragment REAL fast. It should be added that a compressing disk is vulnerable to system crashes, so that maintaining journals of written material and removing them once compressed versions are safely written is a worthwhile thing. A journal can be roughly the size of the in memory cache, and might be organized to be in the same order, with maybe one or two blocks extra off to the side that had the index of cache block <-> disk LBN. Periodic flushes of a cache are a kludge that might be tried instead (Unix uses such a scheme) but it is inherently more vulnerable to corruption. By keeping the index file NOT compressed, of course, we reduce the liklihood of total disk corruption; this is a good thing. Incidentally, I have working code that catches FDT time activity and lets me compress files onto disk when space is filling up and automatically decompresses them if anyone opens them. The effect is that basically any files you like can be compressed on the disk (even to all the ones you ever look at) and automatically get decompressed and pulled out when you try to open them. It's very flexible, since it can store them anywhere you like also (e.g., on tape, other disks, across any net, and so on...) compressed or not as pleases you. However, it's non-free and may become a product. It has a ton of other functions also. The nice thing about that method of operation is that you can have a site policy about what gets compressed first or preferentially, so you can play this game with seldom-accessed files mostly, and take less of a hit in performance. You can use the network file system hooks in unix to get a similar kind of effect, but the performance hit may be worse; my code on vms is pretty efficient at deciding what needs to be done... I should add that the concept of leaving file headers out there as tags for files stored "somewhere else" is a dirty approach if there is any way to store your index some other way; I plan to clean that up and have most of the support for such done. If anyone cares to build a compressing host for fddriver/fqdriver though, please feel free to contact me if you have questions and/or want to ensure you are starting with the latest & greatest code (the F92 tapes are actually fine). Glenn C. Everhart (raxco, inc) Everhart@Arisia.GCE.Com 215 358 5875