AscToHTM Documentation for AscToHTM conversion utility
This documentation can be downloaded in .zip format.

5 HTML markup produced

5.1 Indentation

AscToHTM performs statistical analysis on the document to determine at what character positions indentations occur. This information is used on the output pass to determine the indentation level for each source line.

AscToHTM attempts to indent the HTML code to match the output indentation level, to make it easier to read. The indentations themselves will be marked up using <BLOCKQUOTE> ... </BLOCKQUOTE> tags.

5.2 Header Lines

AscToHTM recognises various types of headers. Where headers are found, and deemed to be consistent with the prevailing document policy (correct indentation, right type, in numerical sequence etc), AscToHTM will use the standard <Hn> ... </Hn> markup.

In addition to this, AscToHTM will insert a named Anchor tag (<A> ... </A>) to allow hyperlink jumps to this point. These anchors are used for example in the contents list and cross-reference hyperlinks that AscToHTM generates.

5.2.1 Numbered headers

This is the preferred heading type and the type that AscToHTM has most success with. Sections of type N.N.N can be checked for consistency, and references to them can be spotted and converted into hyperlinks.

At present more exotic numbering schemes using roman numerals and letters of the alphabet are not fully supported. This is planned to be implemented soon, possibly via user policy files.

5.2.2 Capitalised headers

AscToHTM can treat wholly capitalised lines as headers. It also allows for such headers to be spread over more than one line.


5.2.3 Underlined headers

AscToHTM can recognise underlined text, and optionally promote the preceding line to be a section header.

5.2.4 Numbered paragraphs

Some types of documents use what look like section numbers to number paragraphs (e.g. legal documents, or sets of rules).

AscToHTM can recognise this, and mark up such lines by placing the number in bold, and not using <Hn> ... </Hn> markup on the whole line.


5.3 Hyperlinks

5.3.1 Contents List lines

Contents list lines are marked up in bold, and turned into a hyperlink pointing at the section referenced. The text is sized according to heading type in the range +/- 1 font size from normal (3).

5.3.2 Cross-references

AscToHTM can convert cross-references to other sections into hyperlinks to those sections. Unfortunately this is currently only possible for second, third, fourth... level numeric headings (n.n, n.n.n, n.n.n.n etc)

This is because the error rate becomes too high on single numbers/letters or roman numerals. This may be refined in later releases.

5.3.3 URLs

AscToHTM can convert any URLs in the document to hyperlinks. This includes http and ftp URLs and any web addresses beginning with www.

5.3.4 Usenet Newsgroups

AscToHTM can convert any newsgroup names is spots into hyperlinks to those newsgroups. Because this is prone to error, AscToHTM currently only converts newsgroups in known USENET hierarchies such as rec.gardens by default.

This can be overcome either by placing "news:" in front of the newsgroup name (e.g. news:uk.d-i-y) or by relaxing this condition via a document
policy (see 6.3.2.4).

5.3.5 E-mail addresses

AscToHTM can convert any email addresses into hypertext mailto: links.

5.4 Hanging paragraph indents

Some documents, especially ones dumped from Word, have hanging paragraph indents. That is, each paragraph starts at an offset to the rest of the paragraph.

AscToHTM struggles heroically with this, and tries not to treat this as text at two indent levels, but it does occasionally get confused.

If writing a text file from scratch with AscToHTM in mind, then it is best to avoid this practice.


5.5 Bullets

AscToHTM detects and supports several types of bullets.

5.5.1 Bullet chars

Bullet chars are lines of the type

        - this is a bullet line

        - this is a bullet paragraph
          because it carries over onto
          more lines
That is, a single character followed by the bullet line. AscToHTM can determine via statistical analysis which character, if any, is being used in this way. Special attention is paid to the '-' and 'o' characters.

Bullets of this type are given a <UL> ... <LI> ... </UL> markup.


5.5.2 Numbered bullets

AscToHTM can spot numbered bullets. These can sometimes be confused with section headings in some documents. This is one area where the use of a document policy really pays dividends in sorting the sheep from the goats.

Numbered bullets are given a <OL TYPE=1 START=N> ... <LI> ... < /OL> markup.

Note:
Not all browsers support this type of markup. In such cases, it's possible that the numbering of bullets will get reset to 1 every so often. However, this isn't a problem with either Netscape or Internet Explorer.

5.5.3 Alphabetic bullets

AscToHTM detects upper and lower case alphabetic bullets. These are marked up like numbered bullets, with TYPE=a.

5.5.4 Roman Numeral bullets

AscToHTM detects upper and lower case roman numeral bullets. These are marked up like numbered bullets, with TYPE=a.



5.6 Definitions

5.6.1 Definition lines

A definition line is a single line that appears to be defining something. Usually this is a line with either a colon (:) or an equals sign (=) in it. For example
        IMHO = In my humble opinion

        Address : Somewhere over the rainbow.
AscToHTM attempts to determine what definition characters are used and whether they are strong (only ever used in a definition) or weak (only sometimes used in a definition).

AscToHTM marks up definition lines by placing a <BR> on the end of the line to preserve the original line structure. Where this decision is made incorrectly unexpected breaks can appear in text.

AscToHTM offers the option of marking up the definition term in bold. This is not the default behaviour however.


5.6.2 Definition paragraphs

AscToHTM also recognises the use of definition paragraphs such as :-

      Note:     This is a definition paragraph whereby the whole
                paragraph is defining the term shown on the first line.
                Unfortunately AscToHTM currently only copes with single
                paragraphs (i.e. not with continuation paragraphs), and
                only with single word definitions.
This gets marked up in a <DL> <DT>...</DT> <DD>...< /DD> </DL> sequence

Note:
This is a "definition" paragraph, i.e. the whole paragraph defines the term shown on the first line. Unfortunately AscToHTM currently only copes with single paragraphs (i.e. not with continuation paragraphs), and only with single word definitions.


5.7 Quoted lines

AscToHTM recognises that, especially in Internet files, it is increasingly common to quote from other text sources such as e-mail. The convention used in such cases is to insert a quote character such as > at the start of each line.

Consequently, AscToHTM adds a <BR> tag at the end of such lines to preserve the layout in the original.


5.8 Pre-formatted text

5.8.1 Lines and form feeds

Lines are interpreted in context. If they appear to be underlining text, or part of some pre-formatted structure such as a table, then they are treated as such.

Otherwise they become horizontal rules (<HR>). Form feeds or page beaks also become <HR> markups.


5.8.2 User defined pre-formatted text

AscToHTM normally ignores any HTML markup in the original text. The sole exceptions are any preprocessor tags which a user may insert into their text document (see Using the preprocessor).

For example :-

      The use of BEGIN_PRE and END_PRE preprocessor commands (see 7.2) in 
      the text documents tells AscToHTM that this portion of the document 
      has been formatted by the user and should be left unchanged.  

5.8.3 Automatically detected pre-formatted text

AscToHTM attempts to spot chunks of preformatted text. This can vary from a single line (e.g. a line with a page number on the right-hand margin) to a complete table of data.

Where such text is detected AscToHTM marks these sections up in <PRE> ... </PRE> tags.

Eventually it is hoped to add full <TABLE> ... </TABLE> generation for such sections.


5.9 Centred text

AscToHTM can attempt to spot chunks of centred text. However, because this can easily go wrong this option is normally switched off.

Centering is only switched on for single isolated lines, or any group of at least two lines. <CENTER> ... </CENTER> markup is used.



Prev | Next | Contents


© 1997 John A. Fotheringham