Enhancing VMSINDEX: Ohio Cookbook


Overview

In our environment, we have a less-experienced VMS user who does the LYNX crawling and the indexing for most of the servers. In order to reduce the scenarios that might result in catastrophe, that user's account has OPER privilege, but does not have the VMS privileges needed in order to move the new index files to the production location. The COPYTHEM job (which re-submits itself each time it runs) is submitted /HOLD by someone who does have the necessary privileges. When a new set of files is ready, either of us can set the job's parameter to indicate which server's files are ready for update, and release the job.

When a new server comes to our attention, we crawl that server and use the FIRSTINDEX procedure to build the three index files and the COPYTHEM perpetual batch job to move them to production. Then we create and optimize an FDL file for each of the three index files and place those FDL files in the location needed for future indexing to be done with the REINDEX procedure. This provides better performance, both for the indexing process and for the searching process.

The various steps include:

  1. Crawling a particular server with LYNX. This creates a collection of text files named LNKnnnnnnnn.DAT that contain the Web page as seen using LYNX. For large sites, servers with more than perhaps 1,000 Web pages, some intervention during the crawl may be required for good performance.

  2. Removing duplicate and external files. This is done by running DEDUP1 in batch. That reads the URL from the first line of each LNKnnnnnnnn.DAT file, trims default filenames from the end, lowercases all letters in the URL for those servers that are not case-sensitive, and then sorts by URL with a /NODUPLICATE qualifier. Manually editing that sorted summary file permits the removal of the lines for external pages.

    Even if you crawl using "-realm", LYNX will still leave you with disk files for external pages if there are server configuration re-directs to pages on another server, but by sorting in URL order, those will all be at the top or bottom of the file, and hence quick for a person to identify and remove, even for a large site.

    A second batch job, DEDUP2, preserves the LNKnnnnnnnn.DAT files whose names remain in the edited, sorted summary file, and deletes the duplicates and other rejected files.

  3. Create the indexes with VMSINDEX. The first time, using the FIRSTINDEX procedure, the three resulting RMS indexed files are simply CONVERTed to provide partial optimization. Then, while those index files are in use, we create an optimized FDL file for each, so that future cycles of crawling and re-indexing will be more efficient, using the REINDEX procedure.

  4. Copy the three index files to the production disk and directory with the COPYTHEM procedure.

  5. The first time a new server is indexed, the search page's HTML must be modified to permit specification of that server.

  6. From time to time we re-create the combined index of all servers that are part of the institution's Web presence. This is simplified by the fact that we keep most of the LNKnnnnnnnn.DAT files within a single directory tree. We use a dedicated procedure, INDEXOHIOU, to invoke VMSINDEX with appropriate options to allow for the large number of words and pages. We use similar dedicated procedures, as needed, for the largest servers.


Coping with Large REJECT.DAT Files

As discussed in the Status section of my description of enhancements to LYNX, the crawling performance degrades for large sites, especially large personal-page sites, because every found link has be checked against every line of REJECT.DAT, a file that accumulates many lines if the pages are not self-contained or have coding errors in the links. For example, a recent crawl of a 25,000 page server generated nearly 4.5 MBytes of REJECT.DAT files.

At this time, the workaround I use is to

  1. Login at another terminal or through another window.

  2. $ SHOW PROCESS

  3. Inspect the output to determine the Process ID (PID) of this controlling process.

  4. $ SHOW USER {myself}/FULL

  5. Inspect the output to determine the PID of the other, CRAWLing session.

  6. $ SET PROCESS/ID=nnnnnnnn/SUSPEND

    Make sure that the PID you specify is NOT that of the controlling process you are typing in.

  7. $ SHOW DEVICE {disk}/FILES/NOSYS

  8. Inspect the output to see whether REJECT.DAT is open. If it is open, skip to step 10 and then come back to step 6.

  9. If REJECT.DAT is NOT open, as determined in the previous step, then perform one of the following two steps, as appropriate:

    • $ COPY    REJECT.TEMPLATE    REJECT.DAT

    • $ RENAME    REJECT.DAT    REJECT.OLD

  10. $ SET PROCESS/ID=nnnnnnnn/NOSUSPEND

Doing this every few hours significantly speeds up the crawl, but large sites can still takes days because REJECT.DAT grows too large overnight.


Creating and Optimizing the FDL Files

$ set default disk9:[index.newfdl]

$ analyze/rms/fdl/output=ohiou-sel.fdl    www_root:[index]ohiou.sel

$ edit/fdl/script=optimize    ohiou-sel.fdl

$ purge    *.*

$ rename    *.*    [-.fdl]*.*.0

$ set default [-.fdl]

$ dir/dat *ohiou*.*.*

$ purge *ohiou*.*


Enhancing VMSINDEX Examples Page



Dick Piccard revised this file (http://ouvaxa.cats.ohiou.edu/vmsindex/examples/ohiocookbook.html) on September 29, 2000.

Please E-mail comments or suggestions to piccard@ohio.edu