[AscToHTM] Documentation for AscToHTM conversion utility
----------------------------------------------------------------------------
6 Using File Policies
File policies have two main uses :
a. To correct any failure of analysis that AscToHTM makes. Hopefully
this won't be needed too much as the core analysis engine
improves.
Examples include page width, whether or not underlined section
headings are expected etc.
b. To tell the program how to produce better HTML end product in ways
that couldn't possibly be inferred from the original text.
Examples include adding colour and titles to the page, as well as
requesting a large document is split into several pages, and a
contents list created.
6.1 An example conversion
This documentation has itself been converted using AscToHTM. The files
used were
o A2HDOCO.TXT. This is the text version of the documentation. The
text version is kept as the master copy and updated as required.
It's then converted to HTML.
o IA2HDOCO.POL. This is the policy file used to create the HTML
version of this document. Only those policies that differ from the
defaults have been added.
This policy file "includes" the link dictionary A2HLINKS.TXT.
o A2HDOCOH.TXT. This is the header HTML added as the bottom of each
generated HTML page.
o A2HDOCOS.TXT. This is the JavaScript HTML added into the
... portion of the generated HTML page. This particular
example toggles the logo when the mouse is over it (only if you
use Netscape V3.0 or above though).
o A2HLINKS.TXT. This is the link dictionary used for this document
and is used to add hyperlinks to the main text file.
o A2HDOCOF.TXT. This is the footer HTML added as the bottom of each
generated HTML page.
These files are included in the distribution kit as an example set of
documentation.
You can, of course, use AscToHTM to convert this doco into whatever
format, colour etc that you wish.
6.2 Layout policies
6.2.1 Indentation
AscToHTM has the following indentation policies that will normally be
correctly calculated on the analysis pass :-
Indentation
-----------
Indent position(s) : 0 6 8 16
Hanging paragraph position(s) : (none)
6.2.1.1 "Indent position(s)"
These are the positions of the major indent levels in the document.
6.2.1.1 "Hanging paragraph position(s)"
These refer to the indentation levels at which definition paragraphs
are expected (see 5.6.2).
6.2.2 Page layout
AscToHTM has the following general layout policies that will normally
be correctly calculated on the analysis pass :-
Page layout
-----------
Page width : 78
Expect Contents List : FALSE
Text Justification : Left
6.2.2.1 "Page width"
This value can be used to influence short line and centred text
detection. It also helps to determine if the definition characters ':'
and '-' (see 5.6.1) are to be regarded as "strong" or "weak".
6.2.2.2 "Expects contents list"
See the discussion in 4.3.3.
6.2.2.3 "Text justification"
This policy is important in detecting pre-formatted text.
The possible values are "left", "center" (i.e. left and right), "right"
and "none". If text is centered then padding spaces may be added. This
has to be ignored when detecting pre-formatted text.
6.2.3 Paragraphs
AscToHTM has the following paragraph policies that will normally be
correctly calculated on the analysis pass :-
Paragraphs
----------
Expect blank lines between paras : TRUE
New Paragraph Offset : (none)
6.2.3.1 "Expect blank lines between paras"
Paragraphs are normally expected to have blank lines before them. Where
this isn't true (e.g. on a text file dumped from Word) different
algorithms can be applied more rigoursously.
6.2.3.2 "New paragraph offset"
This policy refers to any hanging indent. Again, this is a Word for
Windows favourite.
6.2.4 Styling
AscToHTM has the following "styling" policies that will normally be
correctly calculated on the analysis pass :-
Style
-----
Highlight Definition Text : FALSE
Allow automatic centring : FALSE
Allow automatic 1-line < PRE> : TRUE
6.2.4.1 "Highlight definition text"
This policy specifies whether or not the definition term (the part
marked up in ... ) should be placed in bold for greater
emphasis (see 5.6).
6.2.4.2 "Allow automatic centring"
This policy allows automatic detection of centred text to be performed.
This is normally left switched off, as it is prone to give errors. This
algorithm may be refined in later versions.
6.2.4.3 "Allow automatic 1-line < PRE>"
This policy allows individual lines to places in their own < PRE> ... <
/PRE> sections.
This is sometimes desirable, so is switched on by default. For example
if you have lines with page numbers at the top of each page in your
document. Of course... this makes no sense in an HTML document.
6.3 Section headers
AscToHTM has the following section heading policies that will normally
be correctly calculated on the analysis pass :-
Section Types
-------------
First Section Number : 1
Expect Numbered Sections : TRUE
Expect Underlined Headings : FALSE
Expect Capitalised Headings : FALSE
Expect Second Word Headings : FALSE
We have 3 recognised headings
Heading level 0 is of form : "" N "" at indents : 0 /-1
Heading level 1 is of form : "" N.N "" at indents : 0 /-1
Heading level 2 is of form : "" N.N.N "" at indents : 0 /-1
Section headers are far and away the most complex things the analysis
pass has to detect, and the most likely area for errors to occur.
6.3.1 "First Section Number"
This policy indicates what the first section number is. Normally this
starts at 1, but if it starts higher, then AscToHTM may reject headers
as being out of sequence, and fail to detect to presence or absence of
contents lists correctly.
6.3.2 "Expect Numbered Sections"
This indicates whether or not numbered sections are to be expected.
6.3.3 "Expect Underlined Headings"
This indicates whether or not underlined headers are to be expected.
AscToHTM normally promotes any underlined lines to section headers.
This policy can be used to switch that behaviour off.
6.3.4 "Expect Capitalised Headings"
This indicates whether or not a line that is wholly capitalised should
be regarded as a section heading.
6.3.5 "Expect Second Word Headings"
*** not fully supported in this version ***
6.3.6 "Heading level ..."
*** not fully supported in this version ***
6.4 Bullets
AscToHTM has the following bullet point policies that will normally be
correctly calculated on the analysis pass :-
Bullet Types
-------------
Expect Numbered bullets : FALSE
Expect alphabetic bullets : TRUE
Expect Roman Numeral bullets : FALSE
Bullet characters
-----------------
Bullet Char : '-'
AscToHTM tries hard not to get confused by the "1", "a" and "I" that
happen to end up at the start of lines by random. These could get
mistaken for bullet points.
6.4.1 "Expect Numbered bullets"
This indicates that numerical bullets are expected (but you probably
guessed that).
6.4.2 "Expect alphabetic bullets"
This does likewise for alphabetic bullet points.
AscToHTM recognises (and distinguises between) upper and lower case
variants.
6.4.3 "Expect Roman Numeral bullets"
This does likewise for roman numerals. Again upper and lower case
variants are recognised.
6.4.4 "Bullet Char"
These policy lines indicate character(s) that can occur at the start of
a line to represent a bullet point. Special attention is paid to '-'
and 'o' characters, but any character will do.
Use one line per bullet char.
6.5 Definitions
AscToHTM has the following "definitions" policies that will normally be
correctly calculated on the analysis pass :-
Definitions
-----------
Definition Char : '-' (weak)
Definition Char : ':' (strong)
See the discussion in 5.6.1
6.6 Hyperlink policies
AscToHTM has the following hyperlink policies set as defaults :-
Hyperlinks
----------
Create hyperlinks : TRUE
Create mailto links : TRUE
Create NEWS links : TRUE
Cross-refs at level : 2
6.6.1 "Create hyperlinks"
This policy really means that all http, www and ftp URLs will get
converted to hyperlinks.
6.6.2 "Create mailto links"
This indicates that probable email addresses such as jaf@yrl.co.uk
are to be converted into mailto hyperlinks.
6.6.3 "Create NEWS links"
This indicates that probable newsgroup references such as
alt.games.mornington.cresent (sic) are to be converted.
6.6.4 "Cross-refs at level"
This indicates the section level at which and above which all
cross-references are to be converted to hyperlinks.
For example a value of 2 means all n.n, n.n.n etc references are
converted. A value of "1" might seem desirable, but is liable to give
many false references (see 5.3.2).
This behaviour may be improved in later versions.
6.7 Extra HTML details
AscToHTM has the following HTML policies that will only ever take
effect if supplied in a user policy file :-
HTML details
------------
Document Title : ASC2HTML user documentation
Background Colour : DDDDCC
Background Image : (none)
HTML Script file : (none)
HTML header : a2hdocoh.txt
HTML footer : a2hdocof.txt
These "polices" allow you to start "adding value" to the HTML
generated. That is, they allow to specify things that cannot be
inferred from the original text.
6.7.1 "Document Title"
This identifies the text to be placed in the ...
markup in the document header.
We did consider defaulting to the first line of text, but that rarely
works.
6.7.2 "Background Colour"
This identifies the colour to be placed in the BGCOLOR attribute of the
tag. The value is normally expressed as a 6-digit hexadecimal
value in the range 000000 (black) to FFFFFF (white).
6.7.3 "Background Image"
This identifies the URL of any image to be placed in the BACKGROUND
attribute of the tag.
6.7.4 "HTML Script file"
This identifies the name of a text include file to be transcribed into
the ... portion of the generated HTML page.
This allows you to place JavaScript in your pages (though you'll be a
little limited as to what it can act on).
In the near future HTML will support style sheets, so again, this
provides a hook to refine the appearance of your HTML pages.
6.7.5 "HTML header"
This identifies the name of a text include file to be transcribed into
the HTML file at the top of the ... portion of the
generated HTML page.
This can be used to add standard headers, logos, contact addresses to
your HTML pages, and is especially useful to give a consistent "look
and feel" when breaking your document up into a number of smaller HTML
files.
6.7.6 "HTML footer"
This identifies the name of a text include file to be transcribed into
the HTML file at the bottom of the ... portion of the
generated HTML page.
This can be used to add "return to home page" links, and contact
addresses to your HTML pages. Again, this helps to give a consistent
"look and feel" when breaking your document up into a number of smaller
HTML files.
6.8 File organisation policies
AscToHTM has the following HTML policies that will only ever take
effect if supplied in a user policy file :-
File split details
------------------
Split level : 1
Min HTML File size : (none)
Add contents list : TRUE
Add navigation bar : TRUE
Use DOS filenames : TRUE
DOS filename root : a2h
These policies how your document is divided into one or more HTML
files, and how those files are to be named and linked together with
hyperlinks.
6.8.1 "Split level"
This identifies the heading level at which the generated HTML should be
split into smaller files.
A value of "none" will put all the HTML into one file.
A value of "1" will create a new HTML file for each new major section.
A value of "2" will create a new HTML file for each new n.n section,
whilst "3" creates a new document for each n.n.n section, and so on.
The first file created normally has a name that matches the source
file. Subsequent files append the section number, separated by
underscores.
This a file called MYDOC.TXT, will generate MYDOC.HTML, MYDOC_1.HTML,
MYDOC_1_1.HTML etc...
6.8.2 "Min HTML File size"
This policy is only relevant when splitting the document into small
output files, i.e. a "split level" is specified (see 6.8.1).
This policy specifies a minimum output HTML size in lines (although
this is only approximate).
This can be useful for documents that have chapters where all the
content is in the sub-sections. In such documents you'd end up with a
virtually empty chapter heading file if this policy is not used.
6.8.3 "Add contents list"
This policy specifies that AscToHTML should generate a contents list to
match all the section heading that it marks up. This contents list will
consist of hyperlinks to take you to the corresponding section and HTML
file.
The placement of the contents list depends on how you have decided to
split up your output HTML (see 6.8.1).
If you decide to convert MYDOC.TXT to a single HTML file MYDOC.HTML,
AscToHTM will create a separate file called CONTENTS_MYDOC.HTML and add
a link to this file at the top of MYDOC.HTML. You can, if you wish,
simply cut and paste this file into MYDOC.HTML.
If you decide to convert MYDOC.TXT into several files, then the
contents list is placed at the bottom of MYDOC.HTML, and points to all
the newly created files. Any text before the first section in your
document will be placed before the contents list in MYDOC.HTML.
Whenever you elect to have a contents list generated, and lines
perceived by AscToHTM as being part of a contents list in the original
document will be discarded.
6.8.4 "Add navigation bar"
This policy is only relevant if you have elected to split your document
into a number of smaller HTML files (see 6.8.1).
In such cases this policy allows you have a navigation bar inserted at
the foot of each HTML page, before any standard footer is added.
The navigation bar consists of
o A "Previous" link, to take to the previous HTML page.
o A "Next" link, to take to the next HTML page.
o A "Contents" link, to take to the start of the next section in the
contents list.
6.8.5 "Use DOS filenames"
This policy allows you to specify that the HTML file names must be DOS
compatible.
If selected the filenames will all have a ".HTM" extension, and be
given upper case names.
Any file name whose root exceeds 8 character will be shortened by
keeping the first 3 characters, and adding a unique 5-digit number
derived from the longer name.
See the discussion in 4.2.4.
6.8.6 "DOS filename root"
Where DOS filenames are used this allows you to specify an up-to-5
character root to which any section numbers will be appended (see
6.8.1).
If splitting a document at 2 levels we normally recommend a 3-character
filename root.
Thus MYDOC.TXT given a root of MYD would produce MYD.HTM, MTD_1.HTM
MYD_1_1.HTM etc... which are all less than 8 characters and thus
maintain some readability.
If no root were specified, MYDOC_1_1.HTM would be renamed to
MYDnnnnn.HTM where "nnnnn" would be a generated 5-digit code.
See the discussion in 4.2.4.
6.9 Link definitions
Link definitions appear as follows :-
Link Dictionary
---------------
Link definition : "A2HDOCO.TXT" = "Source text" + "/~jaf/A2HDOCO.TXT"
That is, the text to be matched, the text to be used in its placed as
the highlighted text, and the URL this link is to point to (in this
case a relative URL).
See the discussion in 4.2.2.
6.10 Analysis policies
AscToHTM has the following policies that can be used to influenece its
analysis :-
Analysis
---------
Min chapter size : 8
6.10.1 "Min chapter size"
This policy allows you to specify the minimum chapter size expected in
the document (in numbers of lines). AscToHTM will ignore any apparant
Chapter headings that appear too close together.
----------------------------------------------------------------------------
Prev | Next | Contents
----------------------------------------------------------------------------
© 1997 John A. Fotheringham