From:	MERC::"uunet!CRVAX.SRI.COM!RELAY-INFO-VAX" 13-JAN-1993 15:56:50.01
To:	0005066432@mcimail.com
CC:	info-vax@sri.com, info-pascal@brl.mil, compilers@iecc.cambridge.ma.us
Subj:	RE: [TDR] Robinson's Index

	Paul Robinson <TDARCOS@MCIMAIL.COM> writes:
>
>The following is a demonstration on how to solve an indexing or hash-table
>problem; since the query was from a Pascal list, the examples are in
>Pascal. The following has applications to a lot of different things, so I
>thought I would post my answer.
>
>About indexing a large (120,000+ word) dictionary.
>
>In list INFO-PASCAL@BRL.MIL, a user ( Robert_salesas@mindlink.bc.ca) spoke
>about Tim Ciceran's work on a spelling checker and how Robert is doing
>something similar but the hash codes are too large.  I have a simpler
>feature he could try:
>
>You don't need every word in the list, you only need to have enough
>material on hand to reduce the search point down to a reasonable amount.
>If you can do a SEEK() on the file, I have a *very* simple method, as long
>as the file is a static file and does not change.
>
>I have never seen this method documented elsewhere, and I thought it up,
>so I am referring to it as "Robinson's Index".
>
>Sort the file into alphabetical order.  Write a program to scan your file
>and index every word by the first two letters.  Create a list of all 676
>digrams plus the 26 letters A-Z.  Here's what you do:
>
>Create a table in the dictionary file preceding the list of words which is
>702 x 12 characters. Leave this table in the dictionary file consisting of
>26 entries of the following: a letter of the alphabet, two blanks and 6
>spaces followed by a single character "0" (plus CRLF that WRITELN adds).
>You then add the other 676 digrams consisting of the two letters AA, a
>space, and 6 blanks and a 0.  Then AB through AZ, then BA, then BZ through
>ZZ.
>
>It is at this point the dictionary, in alphabetical order, would follow.
>You could either make all words fixed size, (which might waste some space)
>or simply do WRITELN and end each line with a CRLF.   One simple way to
>create this would be to copy the file, and keep an index of how many bytes
>have been written, then go back and rewrite the records with the
>replacement indexes.  Note I put a two-byte "key" plus a space and room
>for a 7 byte index.  I recommend a 7 byte index to allow the dictionary to
>be up to 9.9 million characters. Since a dictionary of that size is
>probably already at or near 1 million characters anyway, the point is
>probably moot, and we are only talking about the extra one byte in the 702
>index records.
>
>If you were doing the Oxford English Dictionary or something huge, or if
>timing is extremely critical in searches, I'd suggest providing for a 9
>digit number and possibly setting up trigrams of the first 3 characters.
>
>Scan this file for (1) the first word which starts with each letter of the
>alphabet; (this becomes the first entry in the single letter class, i.e.
>"A " through "Z ").   (2) the first word to start with any of the digrams
>that are in the list, i.e. AA, CA, CR, GH, IR etc.  For words with
>symbols, you probably want to put those in the start for that letter, so
>that "i.e." is looked up under the letter "I " for example.
>
>When scanning the dictionary, the same table at the top of the file is
>created internally in memory (it's only going to be 702 long words or
>about 3K), and can be created as
>
>   Word_Index: ARRAY ['a'..'z','@'..'z'] of LONGINT;
>
>or for trigrams:
>
>   Word_Index: ARRAY ['a'..'z','@'..'z','@'..'z'] of LONGINT;
>
>(Space will be changed internally to '@' to fit the ASCII code sequence.)
>Initialize all elements of 'Word_Index' to zero.  For the first word found
>for each matching letter or digram, in the above array, you will store the
>current byte position at which that word begins.  If no word is found, you
>be left with 0 as the index.
>
>Once you have scanned the whole file, you go back and rewrite the 702
>entries at the top of the file with the byte index for each letter and
>digram.  Since the entries are pre-loaded with 0, you could simply reopen
>the file, then do a simple write of the index keys, (changing the second
>"@" to a space when writing it out) along with a
>
>     WRITELN(Word_index[highloop,lowloop]:8)
>
>(use 8 to put the extra space between the letter/digram and the index.)
>This method would also allow someone to read the file with the TYPE
>command.
>
>Now, when your program wants to examine the dictionary, all you do is
>pre-load the 702 words representing the letters A-Z, plus the two letter
>digrams.  This only requires an array of 702 longwords, or LESS THAN 3K.
>
>Now, when you get a word, you look up the digram in the index.  If that
>digram exists, you do a SEEK() to that byte of the file.  If the index for
>that digram is 0, you do a SEEK() to the file entry of the single letter
>with which the word in question begins.
>
>Then just do reads from that point.  If you have one of the routines to
>allow you to do Random I/O to a text file, then you can do READLN on the
>next few words.   Chances are, you will only need to scan a few words, as
>not very many digrams are have a lot of words.  As I stated earlier, if
>this is too slow, use trigrams which would require only 18,954 words; you
>could even use a linked list, not use the empty entries, and reduce the
>amount of memory used for the index.  There are lots of choices you can
>make.
>
>Once you get a word that matches, that word is okay; if the word found in
>the dictionary is higher alphabetically, then the word that the user
>entered is not valid.  If you were doing this for a word processor, it
>would be time to then turn to the user's private dictionary.
>

	A couple of optimizations occurred to me immediately upon reading 
this.  The two are mutually exclusive.

	If you are going to use all the possible digrams in your table, 
there is no need to store them in the table since they will occur at
calculable offsets from the beginning of the table.  It is probably faster
to calculate the offset into the table than to search the table.  All you
really need to store are the pointers into the file. 

	The other possible optimization is to shorten the table by 
eliminating the digrams that are "impossible" in the English language; e.g.
the letter 'Q' is always followed by 'U' in English so we can eliminate 
twentyfive digrams beginning with 'Q'.


*************************************************************************
*                                                                       *
*                        Here, there be dragons!                        *
*                      dragon@nscvax.princeton.edu                      *
*                                                                       *
*                                                Richard B. Gilbert     *
*************************************************************************