j  -� ht://Dig: Features and System requirements� �  

# Features and System requirements



= ht://Dig Copyright © 1995-2000 The ht://Dig Group
8 Please see the file COPYING for license information.




Features



? Here are some of the major features of ht://Dig. They are in no particular order.



S * Intranet searching

9 ht://Dig has the ability to search through many servers* on a network by acting as a WWW browser.

K * It is free

( The whole system is released under the2 GNU General Public License

] * Robot exclusion is supported

M The 2 Standard for Robot Exclusion is supported by ht://Dig.

] * Boolean expression searching

3 Searches can be arbitrarily complex using boolean expressions.

\ * Configurable search results

7 The output of a search can easily be tailored to your- needs by means of providing HTML templates.

P * Fuzzy searching

6 Searches can be performed using various configurable4 algorithms. Currently the following algorithms are! supported (in any combination):
    
  •  exact
  • 
  • soundex
  • 
  • metaphone
  • 
  •  common word endings
  • 
  • synonyms
  • 


a * Searching of HTML and text files

1 Both HTML documents and plain text files can be1 searched. Searching of other file types will be supported in future versions.

U * Keywords can be added to HTML documents

7 Any number of keywords can be added to HTML documents5 which will not show up when the document is viewed.7 This is used to make a document more like to be found2 and also to make it appear higher in the list of matches.

U * Email notification of expired documents

9 Special meta information can be added to HTML documents5 which can be used to notify the maintainer of those1 documents at a certain time. It is handy to get9 reminded when to remove the "New" images from a certain page, for example.

b * A Protected server can be indexed

5 ht://Dig can be told to use a specific username and8 password when it retrieves documents. This can be used1 to index a server or parts of a server that are' protected by a username and password.

V * Searches on subsections of the database

2 It is easy to set up a search which only returns5 documents whose URL matches a certain pattern. This7 becomes very useful for people who want to make their6 own data searchable without having to use a separate search engine or database.

Z * Full source code included

4 The search engine comes with full source code. The9 whole system is released under the terms and conditions5 of the GNU Public License version 2.0

g * The depth of the search can be limited

9 Instead of limiting the search to a set of machines, it8 can also be restricted to documents that are a certain8 number of "mouse-clicks" away from the start document.

b * Full support for the ISO-Latin-1 character set

8 Both SGML entities like 'à' and ISO-Latin-1) characters can be indexed and searched.





! Requirements to build ht://Dig



/ ht://Dig was developed under Unix using C++.



> For this reason, you will need a Unix machine, a C compiler@ and a C++ compiler. (The C compiler is needed to compile some of the GNU libraries)



= Unfortunately, I only have access to a couple of different= Unix machines. ht://Dig has been tested on these machines:

 F There are reports of ht://Dig working on a number of other platforms.

libstdc++



A If you plan on using g++ to compile ht://Dig, you have to makeH sure that libstdc++ has been installed. Unfortunately, libstdc++ is a@ separate package from gcc/g++. You can get libstdc++ from theA GNU software archive.



 Berkeley 'make'



? The building relies heavily on the make program. The problem< with this is that not all make programs are the same. The> requirement for the make program is that it understands the 'include' statement as in

 include somefile


A The Berkeley 4.4 make program doesn't use this syntax, instead it wants

 .include "somefile"


1 and hence it cannot be used to build ht://Dig.



> If your make program doesn't understand the right 'include', syntax, it is best if you get and installB gnumake before you try> to compile everything. The alternative is to change all the Makefiles.




 Disk space requirements



= The search engine will require lots of disk space to store= its databases. Unfortunately, there is no exact formula to> compute the space requirements. It depends on the number of; documents you are going to index but also on the various4 options you use. To give you an idea of the space9 requirements, here is what I have deduced from our own/ database size at San Diego State University.



? If you keep around the wordlist database (for update digging; instead of initial digging) I found that multiplying the? number of documents covered by 12,000 will come pretty close to the space required.



" We have about 13,000 documents:


         13,000         12,000 x    -----------    156,000,000
 or about 150 MB.

@ Without the wordlist database, the factor drops down to about 7500:


         13,000          7,500 x     ----------     97,500,000
 or about 93 MB.

9 Keep in mind that we keep at most 50,000 bytes of each@ document. This may seen a lot, but most documents aren't very? big and it gives us a big enough chunk to almost always show  an excerpt of the matches.F



< You may find that if you store most of each document, the> databases are almost the same size, or even larger than the: documents themselves! Remember that if you're storing a< significant portion of each document (say 50,000 bytes as? above), you have that requirement, plus the size of the word B database and all the additional information about each document2 (size, URL, date, etc.) required for searching.

<
I Andrew Scherpbier <andrew@contigo.com>o
>+Last modified: $Date: 2000/02/17 22:05:23 $h ˙˙on a network by acting as a WWW browser. 
K * It is free

( The whole system is released under the2