13-Jul-87 06:57:57-MDT,20882;000000000000 Return-Path: Received: from nems.ARPA by SIMTEL20.ARPA with TCP; Mon, 13 Jul 87 06:57:03 MDT Received: by nems.ARPA id AA02207; Mon, 13 Jul 87 08:33:15 edt Message-Id: <8707131233.AA02207@nems.ARPA> Date: 13 Jul 87 08:32 EDT From: science@nems.ARPA (Mark Zimmermann) Subject: correl.13t3.c program listing To: kotelly@mitre.ARPA, research@nems.ARPA, cperry@mitre.ARPA, mike@media-lab.media.mit.edu, gergely@drea-xx.ARPA, decvax!savax!mf@ucbvax.berkeley.edu, pbr%pco@bco-multics.ARPA, microsof!brianc@beaver.cs.washington.edu, dlcdev!glenn@eddie.mit.edu, mbj@hq.lcs.mit.edu, ed298-ca@violet.berkeley.edu, despain@ucbvax.berkeley.edu, science@nems.ARPA, master@nems.ARPA, programs@nems.ARPA, rthum@simtel20.ARPA, bptpcapb%uiamvs.bitnet@wiscvm.wisc.edu, davis%wanginst.edu@relay.cs.net, kersch%gmuvax.bitnet@wiscvm.wisc.edu, rezende%corwin.ccs.northeastern.edu@RELAY.CS.NET, vinge%sdsu.UUCP@sdcsvax.ucsd.edu, rada@mcs.nlm.nih.gov Cc: science@nems.ARPA Appended below, gods of UNIX willing, is a C program to take an inverted index in my simpleminded format and perform statistical correlations on chosen words in that index ... as with the indexer.c program, it seems to run fine on a VAX, a Sun workstation, and on the Macintosh ... any and all bug reports and critiques of my C coding style would be appreciated, as would comments "Don't send me any more of this stuff!" .... ^z -------------------------------------------------- /* transportable (?) version of index correlator v.13 -- 870707-9 ^z * this file is correl.13t3.c -- compile with command: * cc correl.13t2.c -lm * on the VAX (need -lm to do math library) * then run by saying: * a.out doc_file_name index_file_name proximity_length word1 word2 ... * where index_file is the index built for doc_file, proximity_length is * the distance in bytes to consider correlations over, and word1, word2, ... * are the words which ('OR'd together) define the subindex to correlate * against. * * (Note that proximity is discretized in the program in steps of size 32 * ( = CHUNK_LENGTH) and so actually there is a somewhat broader range * considered than strictly N bytes for proximity_length = N ... larger by * as much as 31 bytes, though the odds are against that.) * * Try values for proximity_length of 32 for "within a few words", * 320 for "within a few sentences", and 3200 for "within a few paragraphs". * * This correlator program evolved from a prototype in MacForth Plus * -- C version was originally developed in Lightspeed C on the Macintosh; * has been tested on the NEMS.arpa VAX and on Sun Workstations... * * Contact Mark^Zimmermann (9511 Gwyndale Dr., Silver Spring, MD 20910; * science@nems.arpa, or [75066,2044] on CompuServe; telephone * 301-565-2166) for additional details, bug reports, etc. */ /* Program to correlate terms in an index ... based on the * "popwords" prototype in MacForth Plus for Browser v.244+ * * CONCEPT: run the program and give it a command line consisting * of: a document file name * an index file name * a proximity range in bytes * a list of words which define the subject under consideration * For example, the command line might look like: * junkdocument junkindex 32 garbage trash * which would cause the subindex defined by proximity within 32 characters * to the words "garbage" or "trash" within the index file junkindex to be * correlated with all the words in the file ... words that were associated * with "garbage" or "trash" would then be printed out (or sent to a chosen * file with the UNIX > operator) * * Eventually, may want to allow wild cards in the command line (e.g., * trash* to include trash, trashes, trashy, etc.) * * Currently, all the document file is used for is to get its length * (and thus the length of the valid_bits array) ... seems a waste to * demand it as an argument, but what else can I do?.... * * 870629-30... ^z */ #include #include #include #include /* now, here comes my header file of definitions... ^z */ /* headers for my correlator experiments - 870629-30-... ^z */ /* CHUNK_LENGTH is the length of a chunk of text in the document file * that gets handled by a single bit in the index file */ #define CHUNK_LENGTH 32 /* FLAG_COMPRESSION is the conversion factor from bytes in the document file * to bytes in the flag array ... specifically, it is just * sizeof(char) * CHUNK_LENGTH * It is the ratio of the size of the document file to the size of the * valid_bits flag array, that is, the amount of compression used.... */ #define FLAG_COMPRESSION 256 /* KEY_LENGTH is the number of letters we have in each record */ #define KEY_LENGTH 12 /* INDEX_REC_LENGTH is KEY_LENGTH plus the size of a long int used * as an index_offset pointer */ #define INDEX_REC_LENGTH 16 /* define a structure to hold my index records: * a record begins with the offset from the file start (long int) * and then contains KEY_LENGTH characters, already filtered into * all capitals (if that is the chosen filter_char function) * as the key for use in sorting that index record * (and for later browsing of the index) */ typedef struct index_rec { long index_offset; char index_key[KEY_LENGTH]; } INDEX_REC; #ifndef TRUE #define TRUE 1 #endif #ifndef FALSE #define FALSE 0 #endif /* next, the main routine to read in arguments, open files, etc. */ main (argc, argv) int argc; char *argv[]; { char *valid_bits, *vb, *malloc(); long doc_size, index_size, proximity_range; unsigned long start_time, end_time; double valid_frac, threshhold_sigmas, findvalidfrac(), sqrt(); FILE *doc_file, *index_file; printf ("correl version 13t3 by ^z\n"); if (argc < 5) { printf ("Not enough command line arguments!\n"); printf ("Usage: doc_file index_file proximity word1 word2 ... \n"); exit(); } /* keep track of time, for user's amusement */ time (&start_time); printf ("Starting work: %s\n", ctime (&start_time)); /* open the two files doc_file (text) and index_file (binary) */ doc_file = fopen (argv[1], "r"); if (doc_file == NULL) { printf ("Unable to open document file \"%s\"!", argv[1]); exit(); } index_file = fopen (argv[2], "rb"); if (index_file == NULL) { printf ("Unable to open index file \"%s\"!", argv[2]); exit(); } /* find the length of each file */ fseek (doc_file, 0L, 2); doc_size = ftell (doc_file); printf ("Document file \"%s\" is %ld characters long.\n", argv[1], doc_size); fseek (index_file, 0L, 2); index_size = ftell (index_file); printf ("Index file \"%s\" contains %ld words.\n", argv[2], index_size / INDEX_REC_LENGTH); /* reserve a block of memory for the array of flag bits * create the block cleared out to all zeros (nothing valid) * (the for-loop to do that, and the pointer vb, were added for VAX * compatibility, since apparently it doesn't support calloc() or * clalloc() commands to clear a block of allocated memory ... sorry!) * * note that if an integer has 32 bits, the call to malloc() will suffice * for large files, up to hundreds of gigabytes; on the other hand, if * an int is only 16 bits (as in Lightspeed C) one won't be able to handle * documents bigger than 16 MB or so ... hence, my usage of clalloc and * long ints in the LSC version of this program) */ valid_bits = malloc (1 + doc_size / FLAG_COMPRESSION); if (valid_bits == NULL ) { printf("Sorry, out of memory error!\n"); exit(); } for (vb = valid_bits; vb < valid_bits + 1 + doc_size / FLAG_COMPRESSION; vb++) *vb = 0; /* read in proximity_range from command line */ proximity_range = atol (argv[3]); if (proximity_range < CHUNK_LENGTH || proximity_range > doc_size) { printf ("%ld is a bad choice for proximity_range!\n", proximity_range); exit(); } else { printf ("Defining neighborhoods of %ld characters around words...\n", proximity_range); } /* set bits in the validbits array of flags for the neighborhood * of every word in the input list */ define_neighborhood (argc, argv, index_file, valid_bits, doc_size, index_size, proximity_range); /* compute the fraction of the document defined as valid by the * chosen input words, and the resulting threshhold number of * standard deviations above the noise required for a word to * be recognized as significantly correlated */ valid_frac = findvalidfrac ( valid_bits, doc_size); printf ("Valid fraction of total = %f\n", valid_frac); if (valid_frac == 0.) { printf ("Nothing correlates with the empty set!\n"); exit(); } if (valid_frac == 1.) { printf ("Everything correlates equally with the universe!\n"); exit(); } threshhold_sigmas = (1.0 - valid_frac) / sqrt (valid_frac); printf ("Threshhold surplus sigmas = %f\n", threshhold_sigmas); /* scan through index, reporting on all words that come up * highly correlated with the chosen subindex */ show_correlated_words ( index_file, valid_bits, valid_frac, threshhold_sigmas); /* print some final statistics */ printf ("\nCorrelation completed!\n"); time (&end_time); printf ("Ending work: %s\n", ctime (&end_time)); printf ("Elapsed time = %ld seconds.\n", end_time - start_time); printf ("Correlation rate = %.2f Megabytes/hour.\n", doc_size * 0.003600 / ( end_time - start_time )); } define_neighborhood (argc, argv, index_file, valid_bits, doc_size, index_size, proximity_range) int argc; long doc_size, index_size, proximity_range; char *argv[], *valid_bits; FILE *index_file; { int i; long index_rec_number, next_rec_number, find_word(), find_next_key(); for (i = 4; i < argc; ++i) { index_rec_number = find_word (argv[i], index_file, index_size); if (index_rec_number >= 0L) { next_rec_number = find_next_key (index_rec_number, index_file); printf ("\"%s\" occurs %ld times in index file\n", argv[i], next_rec_number - index_rec_number); for ( ; index_rec_number < next_rec_number; ++index_rec_number) set_valid_bits (index_rec_number, index_file, valid_bits, doc_size, proximity_range); } else printf ("\"%s\" not found in index file!\n", argv[i]); } } double findvalidfrac (valid_bits, doc_size) char *valid_bits; long doc_size; { long i, valid_count = 0, total_count = 0; for (i = 0; i < doc_size; i += CHUNK_LENGTH) { valid_count += get_valid_bit (i, valid_bits); ++total_count; } return ((double) valid_count / total_count); } show_correlated_words (index_file, valid_bits, valid_frac, threshhold_sigmas) double valid_frac, threshhold_sigmas; char *valid_bits; FILE *index_file; { long current_item = 0L, valid_counts, count_valid_keys(), total_counts, count_repeated_keys(); double surplus_sigmas, sqrt(); INDEX_REC current_rec; printf ("\nCorrelated words in subindex:\n"); printf ("\n word excess valid/total\n"); printf ( " ---- ------ ----- -----\n\n"); while (get_index_rec (current_item, index_file, ¤t_rec)) { total_counts = count_repeated_keys (current_item, index_file); valid_counts = count_valid_keys (current_item, index_file, total_counts, valid_bits); if (valid_counts > 1L) { surplus_sigmas = (valid_counts - total_counts * valid_frac) / sqrt (total_counts * valid_frac); if (surplus_sigmas > threshhold_sigmas) printf ("%.12s %5.1f %6ld/%-6ld\n", current_rec.index_key, surplus_sigmas, valid_counts, total_counts); } current_item += total_counts; } } set_valid_bits (index_rec_number, index_file, valid_bits, doc_size, proximity_range) long index_rec_number, doc_size, proximity_range; FILE *index_file; char *valid_bits; { long first_offset, last_offset, i; INDEX_REC this_rec; get_index_rec (index_rec_number, index_file, &this_rec); first_offset = this_rec.index_offset - proximity_range; if (first_offset < 0L ) first_offset = 0L; last_offset = this_rec.index_offset + proximity_range; if (last_offset >= doc_size) last_offset = doc_size - 1L; for (i = first_offset; i <= last_offset; i += CHUNK_LENGTH) set_the_valid_bit (i, valid_bits); } set_the_valid_bit (offset, valid_bits) long offset; char *valid_bits; { int bit_number; long byte_number; byte_number = offset / FLAG_COMPRESSION; bit_number = (offset % FLAG_COMPRESSION) / CHUNK_LENGTH; *(valid_bits + byte_number) |= 1 << bit_number; } get_valid_bit (offset, valid_bits) long offset; char *valid_bits; { int bit_number, the_bit; long byte_number; byte_number = offset / FLAG_COMPRESSION; bit_number = (offset % FLAG_COMPRESSION) / CHUNK_LENGTH; if (*(valid_bits + byte_number) & (1 << bit_number)) return (TRUE); else return (FALSE); } long count_valid_keys (index_item, index_file, total_counts, valid_bits) long index_item, total_counts; FILE *index_file; char *valid_bits; { long next_item, valid_keys = 0L; INDEX_REC this_rec; next_item = index_item + total_counts; for ( ; index_item < next_item; ++index_item) { get_index_rec ( index_item, index_file, &this_rec); if (get_valid_bit (this_rec.index_offset, valid_bits)) ++valid_keys; } return (valid_keys); } /* finally, file "index utilities.c": */ /* utility functions to move around in an index array * * functions are: * * long count_repeated_keys (long index_item, FILE *index_file) * (counts how many identical keys including index_item follow) * int equal_keys (INDEX_REC *rec1, INDEX_REC *rec2) * (tests two keys for equality) * long find_first_key (long index_item, FILE *index_file) * (finds first key in a block of identical ones) * long find_next_key (long index_item, FILE *index_file) * (finds first key of block after index_item's block) * int get_index_rec (long index_item, FILE *index_file, * INDEX_REC *this_rec) * (fetches index_item's record from index_file into this_rec -- * NOTE: must supply an address for last argument!) * long find_word (char *word, FILE *index_file, long index_size) * (binary search through index to find a word) * * ^z 870629-30-... */ /* function to count how many index records * including the current one occur with the same key * * ^z -- 870616-... */ long count_repeated_keys (index_item, index_file) long index_item; FILE *index_file; { long find_next_key(); return (find_next_key (index_item, index_file) - index_item); } /* function to determine if two index keys are equal * * ^z -- 870616-... */ int equal_keys (rec1, rec2) INDEX_REC *rec1, *rec2; { int i; for (i = 0; i < KEY_LENGTH && rec1->index_key[i] == rec2->index_key[i]; ++i); if (i == KEY_LENGTH) return (TRUE); else return (FALSE); } /* function to find first key in a block of identical keys * do this by taking the current key and backing up until a different * key is found (or until item #0 at the top of the index file is * reached) * * ^z -- 870616-... */ long find_first_key(index_item, index_file) long index_item; FILE *index_file; { long test_item; INDEX_REC this_rec, test_rec; if (index_item == 0L) return (0L); get_index_rec (index_item, index_file, &this_rec); test_item = index_item - 1L; get_index_rec (test_item, index_file, &test_rec); while (equal_keys (&this_rec, &test_rec) && test_item > 0L) get_index_rec (--test_item, index_file, &test_rec); return (++test_item); } /* function to find the next different key in an index file (or to * stop when the last key in the file is reached and return a pointer * to one step past the last valid record) * * ^z -- 870616-19-... */ long find_next_key(index_item, index_file) long index_item; FILE *index_file; { int good_item; long test_item; INDEX_REC this_rec, test_rec; good_item = get_index_rec (index_item, index_file, &this_rec); if ( ! good_item) /* if current key is already bad, return it */ return (index_item); /* and look no further */ test_item = index_item; do good_item = get_index_rec (++test_item, index_file, &test_rec); while (good_item && equal_keys (&this_rec, &test_rec)); return (test_item); } /* function to fetch an index record from disk to a * chosen location ... return 0 if failed (probably EOF), 1 if ok * * NOTE: must be called with an address for last argument! * don't call it with an index_rec itself there!! * * ^z -- 870616 * */ int get_index_rec (index_item, index_file, this_rec_ptr) long index_item; FILE *index_file; INDEX_REC *this_rec_ptr; { fseek (index_file, INDEX_REC_LENGTH * index_item, 0); return (fread ((char *)this_rec_ptr, INDEX_REC_LENGTH, 1, index_file)); } /* routine to find a word using a binary search in an index file * long index_size is length of index file in bytes * * ^z 870629-30... */ long find_word (word, index_file, index_size) char *word; long index_size; FILE *index_file; { int compar, i; long left_bound, right_bound, current_item; INDEX_REC this_rec, current_rec; /* initialize pointers for binary search */ left_bound = 0L; right_bound = index_size / INDEX_REC_LENGTH - 1; /* fill this_rec.index_key with blanks */ for (i = 0; i < KEY_LENGTH; ++i) this_rec.index_key[i] = ' '; /* copy characters from input word into this_rec.index_key * NOTE: on Sun Workstation's C compiler toupper() messes * up non-lower-case characters -- must therefore put in * an if statement to redefine it for that machine. Not * necessary on VAX or Macintosh in my experiments.... */ for (i = 0; i < KEY_LENGTH && *word != '\0'; ++i, ++word) this_rec.index_key[i] = (islower (*word) ? toupper(*word) : *word); /* now begin binary search by taking record pointed to by * current_item, copying it into current_rec, and comparing * it with this_rec to see whether we need to go left, right, * or have found it */ do { current_item = (left_bound + right_bound) / 2; get_index_rec (current_item, index_file, ¤t_rec); compar = strncmp (current_rec.index_key, this_rec.index_key, KEY_LENGTH); if (compar > 0) right_bound = current_item - 1; else left_bound = current_item + 1; } while (left_bound <= right_bound && compar != 0); /* now make sure that item found is the first one of its block * of identical keys, by backing up as necessary */ current_item = find_first_key (current_item, index_file); /* return the item number if found, or -1 if failed to find it */ if (compar == 0) return (current_item); else return (-1L); } -------