A KERMIT PROTOCOL EXTENSION FOR INTERNATIONAL CHARACTER SETS

			      Christine Gianone
		 Manager, Kermit Development and Distribution
             Columbia University Center for Computing Activities
                            612 West 115th Street
                           New York, NY 10025, USA

                                DRAFT NUMBER 3
                                 JULY 7, 1989

ABSTRACT

A two-level extension to the presentation layer of the Kermit file transfer
protocol is proposed to allow transfer of non-English-language text files
between unlike computers.  Level 1 allows substitution of single character
sets other than ASCII in Kermit's normal text-file transfer syntax.  Level 2
specifies a new transfer syntax in which multiple character sets may be used,
along with mechanisms for switching among them as defined in ISO Standard
2022.

This is still a DRAFT proposal.  Readers with knowledge of real-world
multi-alphabet applications and file formats are urged to comment on the
suitability of this proposal.  It is assumed the reader is familiar with the
Kermit file transfer protocol.


SUMMARY OF CHANGES SINCE DRAFT #2, March 30, 1989

 - Separation of extension into Levels 1 and 2.
 - Additional file attributes for preannouncement of character sets.
 - Criteria for selection of character sets.
 - Handling of unknown character sets.
 - Handling of "illegal" characters in data.
 - Preliminary specification of user-loadable translation tables.
 - Avoidance of cryptic ISO terminology in Kermit commands.


ACKNOWLEDGEMENTS

Many thanks to these people for their helpful and constructive comments on the
first two drafts.  In most cases, their suggestions or the information they
provided have been incorporated into the third draft.

  John Chandler (Harvard/Smithsonian Center for Astrophysics, USA)
  Alan Curtis (University of London, UK)
  Frank da Cruz (Columbia University, USA)
  Joe Doupnik (Utah State University, USA)
  Hirofumi Fujii (Japan National Laboratory of High Energy Physics, Tokyo)
  John Klensin (Massachusetts Institute of Technology, USA)
  Ken-ichiro Murakami (Nippon Telephone and Telegraph Research Labs, Tokyo)
  Vladimir Novikov (VNIIPAS, Moscow, USSR)
  Jacob Palme (Stockholm University, Sweden)
  Andre Pirard (University of Liege, Belgium)
  Paul Placeway (Ohio State University, USA)
  Gisbert W. Selke (University of Bonn, West Germany)
  Fridrik Skulason (University of Iceland, Reykjavik)
  Johan van Wingen (Leiden, Netherlands)
  Konstantin Vinogradov (ICSTI, Moscow, USSR)
  Amanda Walker (InterCon Systems Corp, USA)

Thanks also to the following people for organizing meetings or conferences
in their countries at which the issues of this proposal were discussed:

  Kohichi Nishimoto (Nihon DEC, Tokyo, Japan)
  Juri Gornostaev and A. Butrimenko (ICSTI, Moscow, USSR)

and thanks also to those who attended these gatherings!


STATEMENT OF THE PROBLEM

Kermit has always been able to transfer text files between unlike systems
(e.g. a UNIX system with ASCII stream text files and an IBM mainframe with
EBCDIC record-oriented text files).  To do the text file code conversion,
Kermit transfers text in ASCII.  But ASCII only includes enough letters and
symbols for English.

There are now computers capable of representing the characters of other
languages: Roman letters with diacritical marks, Cyrillic letters, Hebrew,
Arabic, and Greek characters, Japanese and Chinese ideograms.  But different
computer manufacturers use different codes for these characters.

For example, the IBM PS/2 and the Apple Macintosh have character sets that are
"8-bit ASCII".  When the character value is 32-127, the character is
(normally) a standard ASCII graphic (printable) character.  When the value is
128 or higher, it is a special character.  But the PC and the Macintosh assign
different special characters to these values.  Here are just a few of examples:

   Value     PS/2 Character      Macintosh Character
    138       Small e acute       Small a umlaut
    143       Capital A ring      Small e acute
    144       Capital E acute     Small e circumflex 
    136       Small e circumflex  Small a acute

When a file contains "8-bit ASCII", Kermit presently transfers it without any
character translation.  Therefore, a text file written in French, German,
Italian, or Norwegian transferred between a PS/2 and a Macintosh will contain
the wrong characters when it arrives at its destination: the PS/2's e-acute
becomes a-umlaut on the Macintosh, etc.

The problem is compounded when a file is composed of characters from more than
one character set, for example a Japanese text file that contains Kanji,
Katakana, and Roman characters.

There are many computer vendors in the world and nobody controls what codes
they use to represent characters.  Without a standard protocol for
transferring non-ASCII text, each computer would have to know the codes of all
the other computers in order for correct transfer of non-English text files to
occur between unlike systems.


NORMAL KERMIT FILE TRANSFER SYNTAX

The Kermit file transfer protocol makes a distinction between text and binary
files.  Binary files are transmitted with no translation or conversion.  For
text files, Kermit defines a standard transfer syntax for text files, namely
ASCII characters with carriage return and linefeed (CRLF) after each line, so
that text may be stored in useful fashion on any computer to which it is
transferred.  Each Kermit program knows how to translate from the local
text-file storage conventions to ASCII/CRLF syntax, and vice versa.  This is
the basic, required, and default mode of operation for any Kermit program, and
it will be referred to as Kermit's "Normal" or "Level 0" syntax.

EXPANDED KERMIT TRANSFER SYNTAX

This proposal adds two additional levels of transfer syntax, Levels 1 and 2.
Level 1 permits the use of a single character set other than ASCII in the
transfer syntax.  These additional character sets are taken from recognized
national or internation standards, such as ISO 8859-1 (Latin Alphabet 1), JIS
X 0208 (Japanese), etc.

By using using a standard character set (other than ASCII), it is possible to
transfer a file containing more than one language.  For example Latin Alphabet
1 can represent a file containing a mixture of Italian, Norwegian, French,
German, English, and Icelandic.

Level 2 allows a mixture of character sets to transfer mixed-language text
that requires characters from more than one standard character set, for
example a document written in Russian, French, and Greek.

The additional levels are optional features for Kermit programs, except that
Level 2 should not be provided without Level 1.

The additional overhead incurred by a Kermit program running in text mode at
any level can be avoided when transferring files between two computers that
use the same codes and formats.  Simply use the command SET FILE TYPE BINARY
to disable all translations and reformatting.

The following discussion applies to text-file transfer only.  When the Kermit
user has selected binary file transfer, none of the text-file conversions
discussed here apply.


EXPANDED SYNTAX, LEVEL 1

When all the characters in a text file can be represented by a single
character set, then that character set can be used in place of ASCII in
Kermit's text file transfer syntax.

As with ASCII, there must be a mapping between the local file character set
and the character set of the common transfer syntax.  That is, there must be a
pair of translation tables in the program, one from local to common, and one
from common to local.  Since this mode of operation is not Kermit's normal
behavior, it must be selected by the user.  The new Kermit commands are:

  SET FILE CHARACTER-SET <file-character-set-name>
  SET TRANSFER-SYNTAX CHARACTER-SET <transfer-character-set-name>

The file character set is a system-dependent item.  Some computers have only
one character set, in which case the SET FILE CHARACTER-SET command would be
unnecessary.  But other computers allow the use of different character sets,
often without any way to identify a file's encoding.  For example, the IBM PC
family running MS-DOS 3.3 or later supports a variety of "code pages" and
allows users to switch among them, as described in Chapter 9, "Code Page
Switching", of the IBM DOS 3.3 manual.  Thus, on any given PC, a file may be
encoded using Code Page 437 (USA), Code Page 850 (Multilingual), Code Page 860
(Portugal), Code Page 865 (Norway), etc.  If you have set your Code Page to
437, you may display a file created using Code Page 865 on your screen but the
wrong characters are likely to appear.

Therefore, Kermit for the IBM PC family will require the SET FILE
CHARACTER-SET command, with operands to denote the code page such as CP437,
CP850, etc.  The default character set would be the PC's original set, CP437.
Those who use other sets can avoid keying in a SET FILE CHARACTER-SET command
every time Kermit is started, by including this command in the program's
initialization file.

Similar remarks apply to European computers that use the "national replacement
characters" allowed by ISO Standard 646.  This standard specifies a 7-bit
character set equivalent to ASCII, but with national variants in which certain
non-alphanumeric ASCII graphic characters are replaced by "national
characters", as shown in Table 1.

_____________________________________________________________________________

Column/Row   ASCII          German         Finnish   Norwegian     French

  04/00      at-sign        section       at-sign   at-sign       a-grave
  05/11      left-bracket   A-umlaut      A-umlaut  AE-diphthong  degree
  05/12      backslash      O-umlaut      O-umlaut  O-slash       c-cedilla
  05/13      right-bracket  U-umlaut      A-circle  A-circle      section  
  06/00      accent-grave   accent-grave  e-acute   accent-grave  accent-grave
  07/11      left-brace     a-umlaut      a-umlaut  ae-diphthong  e-acute
  07/12      vertical-bar   o-umlaut      o-umlaut  o-circle      u-grave
  07/13      right-brace    u-umlaut      a-circle  a-circle      e-grave
  07/14      tilde          ess-zet       u-umlaut  tilde         umlaut

           Table 1: ISO 646 Usage in Selected Countries
_____________________________________________________________________________

(see Figure 1 for an explanation of column/row notation.)

For example, the German phrase "Gr<u-umlaut><ess-zet> aus K<o-umlaut>ln" would
be rendered in ASCII as "Gr}~ aus K|ln", and the ASCII C-language phrase
"{~a[x]}" would become "<a-umlaut><ess-zet>a<A-umlaut>x<U-umlaut><u-umlaut>"
in German ISO 646.  The German user would want Kermit to interpret the local
file characters as German in the former case, and as ASCII in the latter.

SPECIFYING THE TRANSFER SYNTAX

To select Level 1, the user must type the command

  SET TRANSFER-SYNTAX CHARACTER-SET <name>

Where <name> is the name of a standard character set.  To minimize the work of
the programmer, the consternation of the user, and the memory requirements for
the Kermit program itself, the number of character sets which Kermit uses for
Level 1 transfer syntax should be kept to a minimum.  As a starting point, the
sets shown in Table 2 are recommended.  The criteria for including a character
set in this table are:

1. ASCII (= ISO-646 International Reference Version, IRV) is included.

2. Except for ASCII, each set should be either (a) the "right half" of an
   8-bit single-byte set whose "left half" is the same as ASCII/ISO-646-IRV,
   or (b) a multi-byte set.

3. Each character in the set should be self-contained, and not formed as
   a composite of other characters.

4. The set must be listed in the ISO International Register of Character 
   Sets, so that it has a unique registration number and designating escape
   sequence.  (But provisions are made for other registration authorities.)

5. The set must be a national or international standard graphic character 
   set, intended for use in computer text processing or programming (as
   opposed to Videotex, Teletex, OCR, device control, and other applications).
   This category may include line-drawing or technical character sets which
   fit the other criteria.

Note in particular that the national variants of ISO 646 are not included,
since these are covered adequately by ASCII and the ISO Latin alphabets.

Standard "Kermit names" (for use with the SET TRANSFER-SYNTAX command) are
given to these character sets so that they may be referred to uniformly in
all Kermit implementations.  These names are chosen to be mnemonic, so that
users don't have to remember long numbers like "ISO-8859-3".  The choice of
single words like "CYRILLIC" implies that there will not be more than one
transfer syntax for Cyrillic text.  However, if these standards change in the
future, it will be possible to append further identifying material to these
names, e.g. "CYRILLIC-2", "CYRILLIC-3", etc.

_____________________________________________________________________________

US 7-bit ASCII, equivalent to the ISO 646 International Reference Version
  (IRV) character set.  English, German without umlauts or ess-zet, etc.
  Kermit name: NORMAL.  ISO Registration Number: 2.

ISO 8859-1, Latin Alphabet 1, for Dutch, English, Faeroese, Finnish, French,
  German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and
  Swedish.
  Kermit name: LATIN1.  ISO Registration Number: 100.

ISO 8859-2, Latin Alphabet 2.  Albanian, Czech, English, German, Hungarian,
  Polish, Romanian, Serbocroation, Slovak, and Slovene.
  Kermit name: LATIN2.  ISO Registration Number: 101.

ISO 8859-3, Latin Alphabet 3, for Afrikaans, Catalan, English, Esperanto,
  French, Galician, German, Italian, Maltese, and Turkish.
  Kermit name: LATIN3.  ISO Registration Number: 109.

ISO 8859-4, Latin Alphabet 4, for Danish, English, Estonian, Finnish, German,
  Greenlandic, Lappish, Latvian, Lithuanian, Norwegian, and Swedish.
  Kermit name: LATIN4.  ISO Registration Number: 110.

ISO 8859-5, the Latin/Cyrillic Alphabet, for Bulgarian, Byelorussian,
  Macedonian, Russian, Serbocroation, and Ukrainian (Compatible with USSR GOST
  Standard 19768-1987 and ECMA-113).
  Kermit name: CYRILLIC.  ISO Registration Number: 144.

ISO 8859-6, the Latin/Arabic Alphabet.
  Kermit name: ARABIC.  ISO Registration Number: 127.

ISO 8859-7, the Latin/Greek Alphabet.
  Kermit name: GREEK.  ISO Registration Number: 126.

ISO 8859-8, the Latin/Hebrew Alphabet.
  Kermit name: HEBREW.  ISO Registration Number: 138.

ISO DIS 8859-9, Latin Alphabet 5, in which six Icelandic letters from
  Latin Alphabet 1 were replaced by six other letters needed for Turkish.
  Kermit name: LATIN5.  ISO Registration Number: 148.

CSN 36 91 03, Czechoslovak Standard alphabet.
  Kermit name: CZECH.  ISO Registration Number: 139.

JIS X 0201, a 1-byte code including ASCII and Japanese Katakana.
  Kermit name: KATAKANA.  ISO Registration Number: 13 (Kana), 14 (Roman).

JIS X 0208, a 2-byte code containing Japanese Kanji, Katakana, Hiragana,
  Roman, Greek, and Russian characters, plus special symbols, etc.
  Kermit name: KANJI.  ISO Registration Number: 87.

Chinese Standard GB 2312-80, a 2-byte code for Chinese.
  Kermit name: CHINESE.  ISO Registration Number: 58.

KS C 5601 (1987), a 2-byte code for Korean.
  Kermit name: KOREAN.  ISO Registration Number: 149.

            Table 2: Standard 8-Bit Character Sets
_____________________________________________________________________________

The ISO Latin alphabets and the Czech character set are 8-bit character sets
whose left half is identical with ASCII, and whose right half contains the
special characters.  The ISO registration number refers only to the right half
of each of these character sets.  But each of these sets must be used in its
entirety, because the unaccented Roman letters, the digits, and the
punctuation marks appear only in the ASCII left half.  Therefore, this
proposal considers an 8-bit character set composed of ASCII plus one of the
right-half sets to be a SINGLE character set.  The Kermit character-set name
refers to the two halves combined as a single set.  See Figure 2 for the
layout of an 8-bit character set.

A particular Kermit program need not incorporate all of these character sets.
In many cases, a single 8-bit character set such as LATIN1 will suffice.  For
example, in the USSR there are at least five computer codes in use for
Cyrillic characters.  But all of them can be mapped to ISO Latin/Cyrillic,
which also includes ASCII.  So in all likelihood, a Soviet version of Kermit
need only use LATIN5 in its Level-1 transfer syntax, allowing it to transfer
Russian and English language text files among computers using different codes.

When a language is representable in more than one character set from this
table, as are English, German, Finnish, Czech, Turkish, etc., the character
set highest on the list which adequately represents the language should be
preferred.  For example, NORMAL for English, LATIN1 for French, LATIN1 for
German (because it represents German better than ASCII), LATIN5 for Turkish
(because it represents Turkish better than LATIN3), etc.  This is to maximize
the chance that any two particular Kermit programs will recognize the same
character sets.

Unfortunately, but unavoidably, the burden of choosing the best transfer
syntax character set must be placed upon the user.  If a file containing a
mixture of Finnish, English, and Danish must be transferred, the user must
find a character set that can adequately represent all three languages, in
this case Latin Alphabet 4.  A table like Table 3 should be provided in the
user documentation to help the user make this selection.

_____________________________________________________________________________

    Arabic     ARABIC                      Italian        LATIN1,3
    Bulgarian  CYRILLIC                    Kanji          KANJI
    Chinese    CHINESE                     Katakana       KATAKANA, KANJI
    Czech      CZECH, LATIN2               Korean         KOREAN
    Danish     LATIN4                      Latvian        LATIN4
    Dutch      LATIN1,2,3,4                Lithuanian     LATIN4
    English    NORMAL,LATIN1,2,3,4,5,etc   Norwegian      LATIN1,4
    Esperanto  LATIN3                      Polish         LATIN2
    Estonian   LATIN4                      Portuguese     LATIN1
    Finnish    LATIN1,4                    Romanian       LATIN2
    Flemish    LATIN1,2,3,4,5              Russian        CYRILLIC
    French     LATIN1,3,5                  Serbocroation  LATIN2
    German     LATIN1,2,3,4,5              Slovak         LATIN2
    Greek      GREEK                       Spanish        LATIN1
    Hebrew     HEBREW                      Swedish        LATIN1,4
    Hungarian  LATIN2                      Turkish        LATIN5,3
    Icelandic  LATIN1                      Ukranian       CYRILLIC

          Table 3: Preferred Transfer Syntax Character Sets

_____________________________________________________________________________


IMPLEMENTATION OF LEVEL 1

The Level-1 Kermit extension can be added to existing Kermit programs with
a minimum of effort.  The following steps are required for each Kermit program:

1. Add the SET FILE TYPE { BINARY, TEXT } command, if the program doesn't
   have it already.  SET FILE TYPE TEXT enables text-file character set
   conversion at all levels.  SET FILE TYPE BINARY disables conversions of
   all kinds.

2. Add the SET FILE CHARACTER-SET <name> command.  The set of <names> should
   include ASCII (used for program source, etc) plus the names of any
   "national" character sets that are used on this particular computer.

3. Add the SET TRANSFER-SYNTAX <option> command.  The options should include
   NORMAL (Kermit's normal, unextended ASCII-based transfer syntax), as well
   as CHARACTER-SET <name>, including the names of one or more character sets
   from Table 2 which contain the characters from the computer's local
   character set(s) in (1). 

4. Add translation tables between each pair of character sets in (2) and (3).
   For each pair, two translation tables are necessary: one from the local
   set to the common one, and one from the common set to the local one.

5. Add SHOW commands to let the user find out (a) what character sets are
   available, and which ones are currently selected, for the transfer syntax
   and for local files.

6. Optionally, to allow for user-defined character-set translations, also
   add the LOAD TRANSLATION-TABLE, SHOW TRANSLATION-TABLE, and DUMP
   TRANSLATION-TABLE commands (described in the next section).

Example: To transfer a Finnish-language text file from a computer that uses
the Finnish ISO 646 national replacement set to an IBM PS/2, and to store the
file using the PS/2's Multilingual Code Page:

  On the sending computer:                   On the receiving computer:
    SET FILE TYPE TEXT                         SET FILE TYPE TEXT
    SET FILE CHARACTER-SET FINNISH-ISO646      SET TRANSFER-SYNTAX LATIN1
    SET TRANSFER-SYNTAX LATIN1                 SET FILE CHARACTER-SET CP850

To transfer a C-language source program between the same two computers:

  On the sending computer:                   On the receiving computer:
    SET FILE TYPE TEXT                         SET FILE TYPE TEXT
    SET FILE CHARACTER-SET ASCII               SET TRANSFER-SYNTAX LATIN1
    SET TRANSFER-SYNTAX LATIN1                 SET FILE CHARACTER-SET ASCII


TRANSLATION TABLES

In some cases, translation tables will be 1-for-1.  That is, the two character
sets are the same size, and each character from one set can be found in the
other set.  In such cases, the translation table need be only a list of
numbers, in which position "n" in the table contains the translation for
character number "n".

In many cases, the two character sets will be the same size, but certain
characters from one will be lacking in the other, and/or vice versa.  For
example, IBM Code Page 850 and the Apple Macintosh sets are both "8-bit
ASCII", but the IBM set lacks the Macintosh's Y-umlaut, and the Macintosh
set lacks IBM's Y-acute.

In other cases the character sets will be different sizes.  We have long been
familiar with this problem when translating between 7-bit ASCII and 8-bit
EBCDIC.  In Japan, there must be translations between single-byte Roman,
Greek, and Cyrillic characters and the two-byte JIS X 0208 character set.  In
the future we must allow for the possibility of a multibyte universal
character set containing characters from many languages, as proposed in ISO
DIS 10646.  Here, the translation will be between 7- or 8-bit one-byte codes
and a multibyte code.

It is recommended that translation tables built into Kermit programs be as
general and useful as possible, substituting the closest possible character
when an exact match is not available.  For instance, when translating from
Latin-1 to ASCII, accented letters should be translated into the corresponding
unaccented letters: a-acute becomes a, etc.

Translation tables should be designed for maximum flexibility.  They should
not be restricted 1-to-1, byte-for-byte mappings, but should also allow for
many-to-single and single-to-many byte mappings.  Translation tables are best
left to native speakers of each language.

USER-DEFINED TRANSLATIONS

It should be possible for users to alter Kermit's translation tables or to add
new ones, without having to change the program's source code.  For example, in
certain situations it might be preferable to have a-grave rendered in ASCII as
"a", but in others as "`a", "a`", or even "?".  It is also likely that new
character sets will appear which are unknown to the Kermit program.

To return to the example of translating Latin-1 into ASCII, consider German
text containing the Ess-Zet character and vowels with umlauts.  It is
acceptable to render Ess-Zet as "ss", and to render an umlaut vowel as the
same vowel followed by "e".  But this should not necessarily be done for
languages other than German.  Therefore German users might wish to set up a
special translation table which performs these functions, so that
"Gr<u-umlaut><ess-zet> aus K<o-umlaut>ln" would become "Gruess aus Koeln"
(correct German) rather than "Grus aus Koln".

For these reasons, we suggest a standard format for translation tables, and a
LOAD TRANSLATION-TABLE command to allow the user to add new character sets to
a Kermit program's repertoire, or to alter current translations.

Each table within a program is assigned an arbitrary tablename.  For example,
LATIN1-CP850 could be the name for the Latin-1 to CP850 table in the PS/2, and
CP850-LATIN1 could be the name for the table in the other direction.  To load
a replacement table, the user would issue the command:

  LOAD TRANSLATION-TABLE <tablename> <filename>

where <tablename> is the name to assign to the new table.  If a table with
that name already existed, that table is replaced.  The layout of a loadable
translation table is given in Appendix C.

A Kermit program, upon loading one of these files, would set up the
translation table, add the name of the table and of the character sets
themselves to the appropriate keyword tables, and so on.

So that the translation-table related commands can also be effective for
built-in translation tables, it is recommended that the built-in tables be
designed in the same format as the loadable tables.

Two additional commands should be furnished to allow the user to get
information about the currently loaded tables:

  SHOW TRANSLATION-TABLE <tablename>

which would give summary information, and:

  DUMP TRANSLATION TABLE <tablename> <filename>

which would write out a translation table (even a built-in one) in the form
shown in Appendix C, so that it could be edited and loaded again.


ATTRIBUTE PACKETS AT LEVEL 1

The objective of Kermit's Level-1 extension is to accommodate as many
computers as possible with a minimum of programming effort.  But this approach
places a burden on the user in the form of new commands and the confusion
which results if the user forgets to issue these commands.

Level 1 does not require support for Kermit File Attribute Packets, whose use
is negotiated in the Kermit Initialization exchange.  But the user's burden
can be alleviated if the sending Kermit program uses an attribute field to
inform the receiving Kermit of the character set to be used in the transfer
syntax.  The receiving program can accept or refuse the file based on whether
it supports the specified character set.  If the receiving program refuses a
file, the user can override this refusal, for example, if a long file contains
only a word or two in an unknown character set.  The most common user-override
is the command SET ATTRIBUTES OFF.  However, this also disables other
desirable effects of attribute packets, such as prenotification of file size.
Therefore, it might be desirable to let the user specify exactly which
attributes are to be "turned off", e.g. SET ATTRIBUTES CHARACTER-SET OFF.

In order for the sender to inform the receiver of the transfer alphabet, a new
value for the Encoding attribute ("*") is defined, namely "C", which is
substituted for the normal value "A" (ASCII).  "C" means that the actual
character set is specified in a separate new character-set attribute, "2".
The operand of the character-set attribute is the letter "I" (for ISO)
followed by an ISO registration number for the character set, as shown in
Table 2, expressed in decimal ASCII digits, for example:

  +---+---+---+  +---+---+------+
  | * | ! | C |  | 2 | $ | I100 |
  +---+---+---+  +---+---+------+

where "*" is code for the Encoding Attribute (or transfer syntax), "!" is the
length (ASCII 33 - 32 = 1) of its value, and the single character "C" is the
value itself, which means "I'm using a specified Character set, look in the
'2' attribute to find out which one."  If "C" is specified as the encoding,
but the "2" attribute is absent, this should be diagnosed as a protocol error.

"2" is the character set attribute, "$" is the length of the following data
(in this case, ASCII 36 - 32 = 4), and the four characters "I100" mean
"character set ISO 100", i.e. ISO Latin-1.  The "I" is included to allow for
the possibility of other character set registration authorities (for example,
K for Kermit Development and Distribution).

Based on this information, the receiver may accept or reject the file, using
Kermit's normal attribute response mechanism.  To accept, it puts a "Y" as the
first character of the data field, followed by "*" to indicate that it accepts
Level-1 transfer syntax, and a "2" to indicate it accepts the specified
character set.  To refuse, it puts an "N" instead of a "Y", followed by "*" if
it cannot do Level-1, and/or a "2" if it does not recognize the character set.

If the file is refused in this manner, the sending Kermit can issue an
informative message to the user, and the user can cancel the transfer manually
and then find some other way to transfer the file, for example in normal,
Level-0 text or binary mode with pre- and postprocessing, or even by loading a
new translation table.

It is recognized that there are presently Kermit implementations in the USSR,
Japan, and elsewhere that use character sets other than the ones listed in
Table 2 in their transfer syntax, and/or sets that are not listed in the
International Register.  It is recommended that these Kermit programs be
converted to use the recommended standard character sets.

LEVEL 1 PERFORMANCE

Level 1 can be used to transfer files containing special characters when
character-set switching is not required.  However, Level-1 transfer will not
always be efficient.  Since the special characters have their 8th bits set to
one, there will be a lot of 8th-bit prefixing in the 7-bit environment -- the
higher the proportion of special characters to ASCII characters, the lower the
efficiency.  For a language like Russian, in which all letters come from the
right half of the character set, efficiency will be very poor.

Therefore, even though Russian (and Greek, Hebrew, and Arabic) are served by
Level 1 of this proposal, files encoded in these character sets can be
transferred more efficiently using the facilities of Level 2 in the 7-bit
communication environment.  See Table 4.

EXPANDED SYNTAX, LEVEL 2: MULTIPLE CHARACTER SETS

Suppose there is a computer that can store a file containing characters from
many languages.  It may do this by using a multibyte character code, or by
imbedding some kind of control information in the file to mark each change of
character set.

One such computer is the Xerox Star and its successors, described by Joseph D.
Becker in the Scientific American articles "Multilingual Word Processing"
(July 1984) and "Arabic Word Processing" (July 1987).  The Star stores textual
data intermixed with special codes.  A byte of all 1's means "alphabet shift",
followed by another byte or two to identify the alphabet.

Another, more limited, example is a computer using one of the AT&T Extended
UNIX Codes (EUC), such as JAE for Japan.  In this code, a byte with its high
order bit set to zero is ASCII.  If it is set to one, then it is either a
1-byte Kanji or 1-byte Katakana character, or (if it has a certain special
value) a shift character indicating that the next two bytes are a Kanji
character.  (See N. Takahashi and W. Krone, "The Language Problem", UNIX
Review, February 1987.)

A third example is an IBM PC or PS/2 running a commercial word processor which
uses the PC's graphics adapter to display characters from different alphabets
(Roman, Greek, etc) in different renditions (bold, italic, underlined).  A
multilanguage word processor file may contain not only alphabet information,
but also formatting and rendition information.  The format of these word
processor files is proprietary, and differs from product to product.

A final example is a simple IBM PC "8-bit ASCII" text file which also contains
the PC's line-drawing characters.  These characters have no equivalents in the
ISO Latin alphabets, and so two standard character sets would be required to
represent these files during transmission.

Now suppose we want to transfer a multi-language text file from one computer
to a different kind of computer.  Since there will be a growing need to do
this, and a growing number of computers and applications that will support
multi-language text in incompatible ways, it is clearly impractical to require
each computer to know the formats and codes of each other computer.

Once again, a standard common intermediate representation, or transfer syntax,
is required so that each Kermit program need only know the codes and formats
used on its own computer, plus the transfer syntax.  But unlike Kermit's
normal transfer syntax, and unlike Kermit's Level-1 extended transfer syntax,
the multi-language syntax must embody an in-band mechanism for identifying
character sets and switching among them.

Fortunately, these mechanisms are already well-defined in the host-terminal
communications environment, and they can be readily adapted to Kermit file
transfer.  The mechanisms we propose to use are defined in the following
international standards:

  ISO 4873, "Information Processing - ISO 8-bit code for information
    interchange - Structure and rules for implementation"

  ISO 2022, "Information Processing - ISO 7-bit and 8-bit coded character 
    sets - Code extension techniques"

  ISO 2375, "Data Processing - Procedure for registration of Escape Sequences"

  ISO "Internation Register of coded Character Sets to be Used with Escape
    Sequences"

These standards are summarized in Appendix B, "How the Standards Work".


KERMIT MULTI-CHARACTER-SET FILE TRANSFER

Level 2 Kermit syntax is intended for transferring multilanguage files that
cannot be adequately represented in a single standard character set.  The new
"international" transfer syntax preannounces character sets by their ISO
registration numbers, designates them by their registered escape sequences,
and invokes them by single or locking shifts as defined in ISO Standard 2022.

ENABLING LEVEL-2 TRANSFER SYNTAX

The new transfer syntax must be selected in some way, either automatically or
explicitly by the user.  In the automatic case, the Kermit program recognizes
(somehow) that it is to transfer a multi-alphabet text file.  In the manual
case, the user issues the command SET TRANSFER-SYNTAX INTERNATIONAL.

It must also be possible for the user to override the automatic use of
international syntax via the command SET TRANSFER-SYNTAX NORMAL.

PROTOCOL PRIOR TO DATA TRANSFER

It is strongly recommended that any Kermit program which is to use
international syntax also support file attribute (A) packets.  These are used
for two purposes: (1) to inform the receiver that international syntax will be
used and with which ISO-2022 facilities, and (2) to preannounce the file's
character sets.  This will give the receiver the opportunity to refuse files
that it cannot translate, and to allocate the necessary resources for those
files which it can accept.

In Level 2, the value of the encoding attribute "*" should be the uppercase
letter "I" (for International), optionally followed by one or more ISO-2022
announcer letters, as listed in Appendix C, for example "IBZ" to declare that
G0 and G1 will be used with locking shifts, and G2 with single shifts.

  +---+---+-----+
  | * | # | IBZ |
  +---+---+-----+

In addition, the sender may (but is not required to) preannounce the transfer
syntax character sets by listing them in the new attribute, "2", "Character
Sets".  The value of the character-sets attribute is a comma-separated list of
ISO character set registration numbers.  For example:

  +---+---+----------------+
  | 2 | . | 1I00,I127,I144 |
  +---+---+----------------+

where "2" is the character-set attribute, "." is the length of the following
value (in this case "." = ASCII 46 - 32 = 14) and the next 14 bytes list ISO
character sets numbers 100, 127, and 144, each number prefixed by "I" to
denote the ISO registration authority.

If the sending Kermit can ascertain a file's character sets easily, it should
send this information in the attribute packet.  Otherwise preannouncement of
character sets could require a time-consuming scan through the file prior to
sending it, which is undesirable for large files not only because it reduces
Kermit's efficiency, but also because it could cause the entire Kermit session
to time out.  Therefore, preannouncement of character sets is not required.

The receiver may accept or refuse a file using Kermit's normal attribute reply
mechanism.  When accepting the file, it should include, at minimum, the "*"
attribute its acceptance, so that the sender will know that the receiver
understands international syntax.  When refusing a file, it should indicate
what caused the problem: "*" means it can't do international transfer syntax,
but "2" (without "*") means that one or more of the announced character sets
are unknown.

If the file is refused in this manner, the sending Kermit can issue an
informative message to the user, and the user can cancel the transfer manually
and then find some other way to transfer the file (for example, binary mode,
or normal text mode with pre- and postprocessing, or even by loading a new
translation table).

DATA TRANSFER PROTOCOL

Transfer of a multi-character-set text file in international transfer syntax
by Kermit is similar to transfer of a 7-bit ASCII text file, except that it
may contain embedded control characters and escape sequences to identify and
switch between character sets.  The file sender translates the file's
characters (if necessary) into one or more registered alphabets, imbedding
character-set designation and shifting codes in the data stream, and
terminates lines of text (records) with CRLF as in ASCII text mode.  The file
receiver translates from international transfer syntax into the format
demanded by the local system or application.  All of this occurs before Kermit
packet encoding by the sender, and after Kermit packet decoding by the
receiver.

ISO 2022 states that "at the beginning of information interchange, except
where the interchanging parties have agreed otherwise, all designations shall
be defined by use of the appropriate escape sequences, and the shift status
shall be defined by the use of the appropriate locking-shift functions."
Kermit programs should "agree otherwise" that the default G0 character set is
the US-ASCII/ISO-646-IRV (International Reference Version) 7-bit character
set; thus international transfer syntax can be identical to Normal Kermit
transfer syntax when transferring 7-bit text files.  There are no defaults for
G1, G2, or G3, in the interest of fairness to all countries and peoples.

When the text contains characters outside the ASCII range, an escape sequence
from Table 5 must be issued, designating the alphabet to which they belong
(using the identification letters shown in Table 5) to the desired
intermediate character set G0, G1, G2, or G3.  This sequence must be given
before the first occurrence of a character in that alphabet.  If no such
sequence is given, then all characters are treated as ASCII data, including
<ESC>, the shift characters, and bytes with their 8th bits set to one.  In
other words, the file transfer behaves in the normal Kermit fashion for text
files.

Since ISO 8859 character sets are subject to revision from time to time, an
alphabet selector may be preceded by <ESC>&F, where F is the revision number
(@ = 1, A = 2, B = 3, etc).  For example, <ESC>&@<ESC>-A means Latin Alphabet
Number One, Revision One.  (This information is from ISO 2022 6.3.13.)

ISO 2022 escape sequences are inserted into the data, and are
indistinguishable by the Kermit packet encoder/decoder from the data itself.
Therefore these escape sequences may be broken across packets, just as any
other data may be.

UNKNOWN ALPHABETS

It is not required that the sender preannounce all of a file's character sets
prior to transfer.  Suppose a file contains a mixture of alphabets, some known
to the receiver, others not.  At some point, an alphabet designator arrives
which the receiving Kermit does not recognize.  Should the receiving Kermit
cancel the file transfer, or accept the unknown code?  A new command is
provided to let the user control what happens in this situation:

  SET UNKNOWN-ALPHABET {KEEP, CANCEL}.

If the user elects CANCEL, then the receiver will behave as if the user
had manually cancelled the file, i.e. it will put the character "X" in the
data field of its next acknowledgement, and the sender (assuming it supports
this feature) will stop sending the file.

If the user elects KEEP, the file will be accepted in its entirety.  But the
unknown code should be marked in case the user wants to fix it afterwards.  To
do this, receiving program accepts the designator for the unknown alphabet and
stores it in the file as data, with subsequent characters stored untranslated.
When the unknown character set is shifted out of (or the end of file arrives),
the receiving Kermit stores the ISO-2022 Coding Method Delimiter, <ESC>d, and
resumes translation.  If the unknown alphabet is shifted back into, the
designating escape sequence is stored again, and the process resumes.  Unknown
alphabets may be nested in this manner.

The default behavior should be "KEEP".  This command should also be effective
at Level 1, where it would simply prevent the receiving Kermit from refusing
a file on the basis of the character set used to transfer it.

LOCAL FILE REPRESENTATION

This proposal assumes nothing about the representation of the file on the
local storage medium.  It may be ASCII, EBCDIC, a proprietary word processor
format, IBM code page, or anything else.  It is an implementation "detail" for
Kermit programmer to convert between the local file representation for
multi-alphabet text files, and Kermit's file transfer syntax.

In some cases, the file itself (or its directory entry) might contain the
necessary identifying information, in which case the sending Kermit program
can automatically emit the appropriate escape sequences during file transfer.
In others, the user will have to tell the sending program how the file is
encoded.  The suggested command is:

  SET FILE TYPE <xxx>

where <xxx> specifies how the file is (or when receiving, is to be) encoded on
disk.  This will necessarily be highly dependent on the system's conventions,
or the conventions of the applications to be used with the file (e.g. a
multi-language word processing program).  Possibilities for <xxx> might
include application names like WORDPERFECT, XYWRITE, NOTA-BENE, MACWRITE,
ALEPH-BET, PC-HANGUL.

BREAKING THE RULES

If the local file is not encoded according to ISO 2022 rules, it may contain
<ESC>, <SO>, and <SI> characters.  It is up to the Kermit program to know
what these characters mean in the context of the file's format, and to either
strip them from the file or translate them to something else.  The ISO 2022
rules forbid the use of these characters as data to be transferred.

If a file is to be transferred using international syntax, and it contains
any of the characters significant to this syntax, namely <ESC>, <SI>, <SO>,
<SS2>, or <SS3>, then such characters should be prefixed during transmission
with Datalink Escape, <DLE>, C0 character 01/00 (Control-P).  Furthermore,
if <DLE> itself occurs in the data, it should also be prefixed with <DLE>.

LEVEL-2 PERFORMANCE

Kermit programs may use the full range of ISO 2022 code extension techniques,
including use of G0, G1, G2, and G3 in both the 7-bit and 8-bit environments,
with both single-byte and multibyte character sets.  In the general case, G0
will be used for ASCII and English, G1 for the "native language" of the local
country or region, G2 for a third language, and G3 for a fourth.  Additional
character sets may be swapped in and out of G2 and G3 as required.

Transmission of 8-bit data in the 7-bit environment is accomplished by Kermit
using 8th-bit prefixing, which is an optional feature of the Kermit protocol.
However, most popular implementations of Kermit do include this feature.  If a
Kermit program cannot do 8th-bit prefixing, then it must operate in the ISO
2022 7-bit environment, shifting GL among the intermediate graphics sets
G0-G3.

If the Kermit program can do 8th-bit prefixing, the choice of the ISO 2022
7-bit or 8-bit environment is entirely independent of the communication
channel.  Selection of the ISO 2022 7-bit or 8-bit environment should be made
on other grounds, such as transmission efficiency or program simplicity.  For
example, if the ISO 2022 8-bit environment is used on a 7-bit channel, then
Kermit will have to do 8th-bit prefixing.

On a 7-bit communication channel, the best choice of ISO 7-bit or 8-bit
environment depends on the nature of the data to be transferred.  If there is
little or no 8-bit data (as in English text), it doesn't matter.  If there is
frequent shifting between 7-bit and 8-bit characters (as in French or
Portuguese), then single shifts would tend to be more efficient than locking
shifts, and Kermit's 8th-bit prefixing is equivalent to a single shift.
Therefore, use the ISO 8-bit environment and let Kermit do the prefixing.  If
there are along strings of 8-bit characters, as in "right-sided" languages
like Russian, Greek, Arabic, and Hebrew, then locking shifts are more
efficient -- use the ISO 7-bit environment.

In Japan, many computer systems use at least three character sets, Roman
(close to ASCII), Katakana (a 1-byte code), and Kanji (a 2-byte code).  Kanji
is specified in JIS X 0208, which also includes Roman, Hiragana, Katakana, and
some other character sets, but these are double width and not normally used.
Roman characters are usually taken from the left half of JIS X 0201, and
Katakana from the right half.  Japanese text frequently shifts between Roman,
Kana, and Kanji, and therefore requires three active character sets, for
example G0 (Roman), G1 (Kana), and G2 or G3 (Kanji).  In the 8-bit
environment, data transfer can be quite efficient: locking shifts are used to
shift GL between Roman and Kana, and any bytes with the 8th bit set to one
automatically invoke Kanji in GR as a multi-byte character set.  In the 7-bit
environment, locking shifts would also be used to select Kanji.  Note that
locking shifts are more efficient in this case than Kermit 8th-bit prefixing
because Kanji characters consist of more than one byte, and tend to occur in
runs.  For Japanese, therefore, it is better to use the ISO 7-bit environment
on a 7-bit communication channel.

The situation is summarized in Table 4.

_____________________________________________________________________________

                            ISO 2022 Environment
                     7-bit                       8-bit
       +------------------------------+-----------------------------+
       | Recommended for right-       | Recommended for 2-sided     |
 7-bit | sided languages like Greek,  | languages like French,      |
  data | Russian, Arabic, Hebrew.     | German, etc.  Use Kermit's  |
  path | Use ISO 2022 locking shifts. | 8th-bit prefix for special  |
       | Also for Japanese.           | characters.                 |
       +------------------------------+-----------------------------+
       | No reason to use ISO 7-bit   | Clear transmission of 8-bit |
 8-bit | environment on a clear 8-bit | characters.  Use for both   |
  data | communication channel.       | left- and right-sided       |
  path | OK for 7-bit ASCII, though.  | languages.                  |
       |                              |                             |
       +------------------------------+-----------------------------+

          Table 4: Selecting ISO 7- vs 8-Bit Environment
_____________________________________________________________________________

The user should have control over whether the ISO-2022 7-bit or 8-bit
environment is used.  To allow this, the command SET TRANSFER-SYNTAX
INTERNATIONAL may be extended as follows:

  SET TRANSFER-SYNTAX INTERNATIONAL [ {7, 8} ]

which means that an optional final field may be included to specify the
7- or 8-bit ISO-2022 environment.  The default should be 8, since it is the
most efficient method in most cases.

If Kermit -- at all levels -- offered locking shifts in addition to single
shifts, then international syntax could always proceed in the 8-bit
environment, and this would simplify implementation considerably.  A proposal
on locking shifts for Kermit is forthcoming.

FILE TRANSFER SYNTAX EXAMPLES

A simple 7-bit ASCII text file can be transmitted in the normal Kermit manner
for text files, without any escapes or shifts, even in ISO 2022 mode.  The
"encoding" file attribute, if used with international transfer syntax, could
be "*#IAJ2"I2" (encoding = international with GL = G0, ISO 2022 7-bit
environment, character set = ASCII).  Or it could be simply "*!A" (ASCII).

A text file containing characters from a language or languages covered by a
single alphabet other than ASCII can be transferred exactly like an ASCII text
file, except that the attribute, if used, would denote the character set, e.g.
"*!C2$I100" for Latin-1.  In the 7-bit environment, international syntax can
be used to cut down on Kermit's 8th-bit prefixing overhead, in which case the
attributes might look like "*#IBJ2$144", and any strings of GR characters would
be preceded by LS1 and transmitted with their high-order bits set to zero.

A multi-character-set text file will require an escape sequence to identify
each alphabet.  The attribute packet would show international encoding,
optionally including the ISO 2022 facilities announcers, and the character
sets, as in "*#ICK2)I100,I144".

In the 7-bit environment, <SO> and <SI> are used to shift between the G0 and
G1 sets.  In the absence of any specific designators, the G0 set is presumed
to be ASCII.  Example:

  A dangerous German word is "gef<ESC>-A<SO>d<SI>hrlich".

In this case, the only extended character is the umlaut-a in "gefaehrlich"
(where ae is a way of writing umlaut-a without an umlaut).  <ESC>-A designates
Latin-1 into G1, <SO> shifts GL out to G1, "d" is the left-half equivalent
of umlaut-a, and <SI> shifts GL back in to G0.

For clarity and consistency with the ISO-2022 recommendations, it is
recommended that the text begin with explicit character set designations, and
then explicitly shift into the G0 set, rather than defaulting to it:

  <ESC>(B<ESC>-A<SI>A dangerous German word is "gef<SO>d<SI>hrlich".

A text file containing characters from multiple ISO 8859 alphabets requires an
designation sequence for each alphabet.  In the 7-bit environment, SO and SI
can be used to shift between G0 and G1 of the current alphabet, and <ESC>(B
can be used to select G0 of any of the alphabets, since these are all the
same.  For example, the following text contains the same word in English,
French, and Russian:

  <ESC>-A<SI>Disappointed, d<SO>ig<SI>u, <ESC>-L<SO>`PW^gP`^RP]]kY<SI>.

The first escape sequence assigns Latin Alphabet No. 1 to G1, and the
subsequent <SO> and <SI> shifts apply to its G0 and G1 set, which is used to
form the English and French words.  The second escape sequence assigns the
Latin/Cyrillic 96-character set to G1, and the subsequent shifts apply to this
new set.

Another 7-bit example, in which the same word is repeated in English,
Russian, and German, shows how a locking shift remains in effect when the
alphabet is changed.  We begin in Latin/Cyrillic, start with an English word
from G0, shift to G1 for the Russian word, and while still in G1 switch to
Latin Alphabet No. 1 for German to get the umlaut-A at the beginning of
Aenderung (where Ae = umlaut-uppercase-A), and shift back to G0 for the rest
of the word:

  <ESC>-LAlteration <SO>_U`UTU[ZP <ESC>-AD<SI>nderung.

Some rules and hints to remember:

1. In the 8-bit communication environment, always use 8-bit character
   transmission -- it's more efficient.

2. There can be no more than four character sets designated at one time.
   Generally designate ASCII to G0, the most frequently-used non-ASCII set
   to G2, less frequently used sets to G3 and G1.  If a file has more than 
   four sets, swap the least frequently used sets in and out of G3 and G1.

3. Single shifts can only be used with G2 and G3.  This is why G2 and G3
   are preferred to G1.

4. Only two character sets can be invoked at once in the 8-bit communication
   environment, and only one in the 7-bit environment.

TERMINAL EMULATION

While not part of the Kermit file transfer protocol, terminal emulation is a
feature of many Kermit programs.  It is hoped that these terminal emulators
will evolve along the lines of the ISO standards mentioned above.  In some
cases, this is already a fact, insofar as DEC VT300 series terminals already
follow these standards and Kermit programs are beginning to emulate these
terminals.

In this regard, it is important to note that not all languages are written
from left to right, top to bottom.  Hebrew and Arabic are two examples of
right-to-left languages, and Japanese and Chinese may be written top to
bottom.  The order of the text characters on disk or on the transmission line
do not necessarily reflect their order on the screen or the printed page.

Kermit should be as easy to use as possible, but should still give the user
the ability to specify exactly what character codes are in use for both
terminal emulation and file transfer.  There should also be a consistent set
of commands for all Kermit programs.

SPECIAL EFFECTS

Today, most multi-alphabet files are produced by proprietary text processing
programs.  These programs have many functions besides switching among
alphabets.  They may also endow text with special attributes such as boldface,
italic, underline, super- or subscript, color, etc, and render characters in a
variety of type styles and sizes.  Each text processing program may have its
own unique formats and conventions.

These special effects are not addressed by this proposal.  Nevertheless, it is
likely that a multi-alphabet file produced by a text processing program also
contains special effects.  In order for a Kermit program to send a
multi-alphabet file, it must have detailed knowledge of the file's format and
coding conventions.  Therefore, the Kermit program should be able to strip out
the special effects, and send only the text.  Otherwise the result would be
meaningless when received on an unlike system or for use with a different
application.  (When transferring such files between like systems or compatible
applications, Kermit binary mode transfers will suffice.)

At some future time, it might be possible to adapt one of the popular document
description languages to the Kermit protocol, so that Kermit will be able to
transfer formatted documents between unlike systems and applications.
Presently, there are many competing would-be standards including IBM DCA and
DIA, DEC DDIF, US Navy DIF, Postscript.  There are also two ISO standards
emerging in this area: Standard Generalized Markup Language (ISO 8879, 9069,
and 9573), and Office Document Architecture (ISO 8613).  This is an area for
further study.


APPENDIX A: STANDARDS

ANSI X3.4 (1986), "Coded Character Sets - 7-bit American Standard Code for
  Information Interchange" (US ASCII), is the 7-bit code currently used by
  Kermit for transferring text files. 

ISO 646 (1983) (= ECMA-6), "Information Processing - ISO 7-bit Coded Character 
  Sets for Information Interchange", gives us a 7-bit character set equivalent
  to ASCII with provision for substituting "national characters" in selected
  positions.

ISO 4873 (1986) (= ECMA-43), "Information Processing - ISO 8-bit Code for
  Information Interchange - Structure and Rules for Implementation", defines
  8-bit character sets, their graphic and control regions, and how to extend
  an 8-bit character set by using multiple intermediate graphics sets.

ISO 2022 (1986) (= ECMA-35), "Information Processing - ISO 7-bit and 8-bit
  Coded Character Sets - Code Extension Techniques", describes how to use
  8-bit character sets in both 7-bit and 8-bit environments, and how to switch
  among different character sets and alphabets.

ISO International Register of Coded Character Sets to be Used with Escape
  Sequences.  This is the source of the ISO registration numbers.

ISO 2375 (1985) "Data Processing - Procedure for Registration of Escape
  Sequences".  The procedure by which a character set gets into the above
  register and has a registration number and designating escape sequence
  assigned to it.

JIS X 0202, "Code Extension Techniques for Use the the Code for Information
  Interchange", the Japanese counterpart of ISO 2022.

ANSI X3.41-1974, "Code Extension Techniques for Use with the 7-Bit Coded
  Character Set of the American National Standard Code for Information
  Interchange", describes 7- and 8-bit codes and extension techniques in
  approximately the same manner as ISO 4873 and ISO 2022.

ISO 8859 (1987-present) (see Table 5 for ECMA equivalents), "Information
  Processing - 8-Bit Single-Byte Coded Graphic Character Sets", defines the
  actual 8-bit character sets to be used for many of the world's languages.
  The left half of each of these is the same as ASCII and ISO 646.  Each
  character, including those with diacritics, is represented by a single byte.

ISO is the Internation Standardization Organization, ANSI is the American
National Standards Institute, ECMA is the European Computer Manufacturers
Association.  JIS means Japan Industrial Standard.

The ISO/ECMA standards discussed in this proposal may be obtained free of
charge in their ECMA form by writing to:

  ECMA Headquarters
  Rue du Rhone 114
  CH-1204 Geneva
  SWITZERLAND

Be sure to specify the title and the ECMA number of each standard requested.
ISO standards can also be ordered from the UN bookstore, but not for free:

  CCITT
  United Nations Bookstore
  United Nations Building
  New York, NY  10017

ANSI standards may be ordered, for a fee, from:

  Sales Department
  American National Standards Institute
  1430 Broadway
  New York, NY  10018


APPENDIX B: HOW THE STANDARDS WORK

ASCII and ISO 646 give us a 128-character 7-bit character set.  This set is
divided into two parts:

  1. 33 "control characters" (characters 0 through 31, and character 127).
  2. 95 "graphic characters" (32-126).

"Graphics" means printing characters -- characters that make ink appear on the
page or phosphor glow on the screen (as opposed to pixel- or line-oriented
picture graphics), plus the space character.  The ASCII / ISO-646 IRV
character set is shown in Figure 1, arranged in a table of 16 rows and 8
colums.

_____________________________________________________________________________

      00  01  02  03  04  05  06  07
     +---+---+---+---+---+---+---+---+
  00 |NUL DLE| SP  0   @   P   `   p |
  01 |SOH DC1| !   1   A   Q   a   q |
  02 |STX DC2| "   2   B   R   b   r |
  03 |ETX DC3| #   3   C   S   c   s |
  04 |EOT DC4| $   4   D   T   d   t |
  05 |ENQ NAK| %   5   E   U   e   u |
  06 |ACK SYN| &   6   F   V   f   v |
  07 |BEL ETB| '   7   G   W   g   w |
  08 |BS  CAN| (   8   H   X   h   x |
  09 |HT  EM | )   9   I   Y   i   y |
  10 |LF  SUB| *   :   J   Z   j   z |
  11 |VT  ESC| +   ;   K   [   k   { |
  12 |LF  FS | ,   <   L   \   l   | |
  13 |CR  GS | -   =   M   ]   m   } |
  14 |SO  RS | .   >   N   ^   n   ~ |
  15 |SI  US | /   ?   O   _   o  DEL|
     +---+---+---+---+---+---+---+---+

  Figure 1: The ASCII / ISO-646 International
     Reference Version 7-bit Character Set
_____________________________________________________________________________

Characters are often referred to by their column and row position in this type
of table.  For example, character 05/08 in Figure 1 is "X".  Columns 00-01,
plus character 07/15, comprise the control set.  Columns 02-07, minus
character 07/15, comprise the graphics.

8-bit character sets are described in ISO 4873.  An 8-bit character set has
two sides.  Each side has a control set and a graphics set.  The "left half"
consists of the control set C0 and the graphics set GL (Graphics Left).  GL
has 94 characters, and corresponds to ASCII (and ISO 646 IRV) positions
02/01-07/14.  SP (space) and DEL are not considered part of GL.  All the
characters in the left half have their high-order, or 8th, bit set to zero,
and are therefore representable in 7 bits.  The "right half" consists of the
control set C1 and the graphics set GR (Graphics Right).  All characters in
the right half have their 8th bits set to one.  Figure 2 shows the layout of
an 8-bit character set.

_____________________________________________________________________________

     <--C0--> <---------GL---------->  <--C1--> <---------GR---------->
       00  01  02  03  04  05  06  07    08  09  10  11  12  13  14  15
     +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+
  00 |NUL DLE| SP  0   @   P   `   p | |    DCS|---+                   |
  01 |SOH DC1| !   1   A   Q   a   q | |    PU1|                       |
  02 |STX DC2| "   2   B   R   b   r | |    PU2|                       |
  03 |ETX DC3| #   3   C   S   c   s | |    STS|                       |
  04 |EOT DC4| $   4   D   T   d   t | |IND CCH|                       |
  05 |ENQ NAK| %   5   E   U   e   u | |NEL MW |                       |
  06 |ACK SYN| &   6   F   V   f   v | |SSA SPA|                       |
  07 |BEL ETB| '   7   G   W   g   w | |ESA EPA|                       |
  08 |BS  CAN| (   8   H   X   h   x | |HTS    |      (special         |
  09 |HT  EM | )   9   I   Y   i   y | |HTJ    |       graphics)       |
  10 |LF  SUB| *   :   J   Z   j   z | |VTS    |                       |
  11 |VT  ESC| +   ;   K   [   k   { | |PLD CSI|                       |
  12 |LF  FS | ,   <   L   \   l   | | |PLU ST |                       |
  13 |CR  GS | -   =   M   ]   m   } | |RI  OSC|                       |
  14 |SO  RS | .   >   N   ^   n   ~ | |SS2 PM |                       |
  15 |SI  US | /   ?   O   _   o  DEL| |SS3 APC|                   +---|
     +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+
     <--C0--> <---------GL---------->  <--C1--> <---------GR---------->

                      Figure 2: An 8-Bit Character Set
_____________________________________________________________________________

GR character sets can have either 94 or 96 characters.  A 94-character GR set
begins in position 10/01 and ends in position 15/14, with Space (SP) occupying
position 10/00 and DEL in position 15/15, just like GL (the corners shown in
GR in the diagram).  A 96-character set has graphic characters in all 96
positions, 10/00 through 15/15.

An 8-bit alphabet, therefore, has up to 94 + 96 = 190 graphic characters.
This number is sufficient to represent the characters in many of the world's
written languages, but not necessarily sufficient to represent all the graphic
symbols required in a given application, for instance a multi-language
document.

To represent a greater number of graphic characters, ISO 4873 defines four
"intermediate sets" of graphic characters, of either 94 or 96 characters each.
These are called G0, G1, G2, and G3.  The G0 set never has more than 94
graphic characters, and G1-G3 can have up to 96 each.  Therefore there can be
up to:

  94 + (3 x 96) = 382

graphics characters simultaneously within the repertoire of a given device.

These intermediate graphics sets are kept in tables in the memory of the
terminal or computer.  One of the intermediate sets (usually G0) is assigned
to GL, and (in the 8-bit communications environment) another may be assigned
to GR.  When the terminal or computer receives a data byte, the numeric value
of its bits denotes the position of the character in GL or GR.  For example,
the byte 01000001 binary = 65 decimal = 04/01 = uppercase A in ASCII.  In the
8-bit environment, any byte with its 8th bit set to zero is from GL, and a
byte with its 8th bit set to one is from GR.

A language like English can be represented adequately GL, because all the
required characters fit there.  When a language has more than 94 characters,
two techniques are used to represent all the characters:

  1. For alphabetic languages, put ASCII (or the ISO-646 IRV) in GL and
     the special characters (like accented letters) in GR.  French, German,
     and Russian are examples.

  2. For languages with many symbols (e.g. where a symbol is assigned
     to each word, rather than to each sound), represent each character
     with multiple bytes rather than one byte.  Japanese Kanji, for example,
     uses a 2-byte code.  A multibyte code may be assigned to G0, G1, G2, or
     G3, just like a single-byte code. 

How do we assign actual character sets to G0-G3, and how do we associate the
intermediate character sets with the active character set?

Selection of character sets is accomplished using special control characters
and escape sequences embedded within the data stream as described in ISO
Standard 2022.  An escape sequence is used to DESIGNATE a particular alphabet
(such as Roman, Cyrillic, Hebrew, Arabic, Kanji, etc) to a particular
intermediate graphics set (G0, G1, G2, or G3).  A shift function is used to
INVOKE a particular intermediate graphics set into GL or GR.  In programmer's
terms, GL and GR are pointers into the array of tables G0..G3, and the shift
functions simply change the values of these pointers.

In our discussion, we use the following notation (numbers are decimal unless
otherwise noted):

  <ESC> Escape (ASCII 27, character 01/11)
  <SP>  Space  (ASCII 32, character 02/00)
  <SO>  Shift Out (Ctrl-N, ASCII 14, character 01/14)
  <SI>  Shift In  (Ctrl-O, ASCII 15, character 01/15)

Table 5 shows the alphabet designatation functions for single-byte and
multi-byte character sets in both the 7-bit and 8-bit environments.  The
character which is substituted for "F" identifies the actual character set to
be used.

_____________________________________________________________________________

  Escape            
 Sequence     Function                                         Invoked By

  <ESC>(F     assigns 94-character graphics set "F" to G0.     SI or LS0
  <ESC>)F     assigns 94-character graphics set "F" to G1.     SO or LS1
  <ESC>*F     assigns 94-character graphics set "F" to G2.     SS2 or LS2
  <ESC>+F     assigns 94-character graphics set "F" to G3.     SS3 or LS3
  <ESC>-F     assigns 96-character graphics set "F" to G1.     SO or LS1
  <ESC>.F     assigns 96-character graphics set "F" to G2.     SS2 or LS2
  <ESC>/F     assigns 96-character graphics set "F" to G3.     SS3 or LS3
  <ESC>$(F    assigns multibyte character set "F" to G0.       SI or LS0
  <ESC>$)F    assigns multibyte character set "F" to G1.       SO or LS1
  <ESC>$*F    assigns multibyte character set "F" to G2.       SS2 or LS2
  <ESC>$+F    assigns multibyte character set "F" to G3.       SS3 or LS3

             Table 5: Escape Sequences for Alphabet Designation
_____________________________________________________________________________

Table 6 shows the escape sequences used to designate each of the
registered character sets discussed in this proposal to G1 (except that ASCII
is designated to G0, which is the normal situation).  It is important to note
that the final letter is not always sufficient to designate a character set.
For example, Czech Standard and JIS Katakana are both designated by letter I.
But the two can be distinguished by the intermediate characters of the escape
sequence, which specify whether the set is single- or multibyte, or, when both
sets are single-byte, whether there are 94 or 96 characters.

_____________________________________________________________________________

                            Escape    ISO          ECMA        ISO/ECMA
 Alphabet Name              Sequence  Reference    Reference   Registration

  ASCII (ANSI X3.4-1986)    <ESC>(B   ISO 646 IRV  ECMA-6        2
  Latin Alphabet No. 1      <ESC>-A   ISO 8859-1   ECMA-94     100
  Latin Alphabet No. 2      <ESC>-B   ISO 8859-2   ECMA-94     101
  Latin Alphabet No. 3      <ESC>-C   ISO 8859-3   ECMA-94     109
  Latin Alphabet No. 4      <ESC>-D   ISO 8859-4   ECMA-94     110
  Latin/Cyrillic            <ESC>-L   ISO 8859-5   ECMA-113    144
  Latin/Arabic              <ESC>-G   ISO 8859-6   ECMA-114    127
  Latin/Greek               <ESC>-F   ISO 8859-7   ECMA-118    126
  Latin/Hebrew              <ESC>-H   ISO 8859-8   ECMA-121    138
  Latin Alphabet No. 5      <ESC>-M   ISO 8859-9   ECMA-128    148
  Czech Standard CSN 369 03 <ESC>-I   none         none        139
* Math/Technical Set        <ESC>-K   none         none        143
  Chinese (CAS GB 2312-80)  <ESC>$)A  none         none         58
  Japanese (JIS X 0208)     <ESC>$)B  none         none         87
  JIS-Katakana (JIS X 0201) <ESC>)I   none         none         13
  JIS-Roman (JIS X 0201)    <ESC>)J   none         none         14
  Korean (KS C 5601-1987)   <ESC>$)C  none         none        149

   Table 6: Alphabets, Selectors, Standards, and Registration Numbers
_____________________________________________________________________________

* A math/technical set is clearly needed to handle the IBM PC, DEC VT-series,
  and other math/technical/line-drawing characters, but there is apparently no
  such standard set at this time.

Tables 7 and 8 show the shift functions that are used to invoke the
intermediate character sets.  These shift functions may be either locking or
single.  "Locking shift" is like shift-lock on a typewriter.  It means that
all subsequent characters until the next shift are to be taken from the
designated intermediate character set.  "Single shift" applies only to the
character (either single or multibyte) that follows it immediately, but single
shift functions are only available for the G2 and G3 sets.  Locking shift
functions remain in effect across alphabet changes.

In the 7-bit environment, only one character set, GL, can be active at a time.
The active character set can be selected from among the intermediate sets
G0-G3 by the shifts shown in Table 6.  Control characters from C0 are
transmitted as-is, and those from the C1 set are sent prefixed by <ESC>
followed by the character value, minus 64.  For example, the C1 character
10000001 binary (129 decimal) becomes <ESC>A (129 - 64 = 65 = "A").

_____________________________________________________________________________

 Shift  Representation  Name              Function

  SI       Ctrl-O       Shift In          invoke G0 into GL
  SO       Ctrl-N       Shift Out         invoke G1 into GL
  LS2      <ESC>n       Locking Shift 2   invoke G2 into GL
  LS3      <ESC>o       Locking Shift 3   invoke G3 into GL
  SS2      <ESC>N       Single Shift 2    select single character from G2
  SS3      <ESC>O       Single Shift 3    select single character from G3

               Table 7: Shifts Used in the 7-Bit Environment
_____________________________________________________________________________

In the 8-bit environment two character sets, GL and GR, can be active at once.
A GL character is selected by a byte whose 8th bit is zero, and a GR character
by a byte whose eighth bit is one.  The actual character sets assigned to GL
and GR are selected by the shifts shown in Table 8.  Control characters from
both the C0 and C1 sets are sent as is.

_____________________________________________________________________________

 Shift  Representation  Name                   Function

  LS0      Ctrl-O       Locking Shift 0        invoke G0 into GL
  LS1      Ctrl-N       Locking Shift 1        invoke G1 into GL
  LS2      <ESC>n       Locking Shift 2        invoke G2 into GL
  LS3      <ESC>o       Locking Shift 3        invoke G3 into GL
  LS1R     <ESC>~       Locking Shift 1 Right  invoke G1 into GR
  LS2R     <ESC>}       Locking Shift 2 Right  invoke G2 into GR
  LS3R     <ESC>|       Locking Shift 3 Right  invoke G3 into GR
  SS2       08/14       Single Shift 2         select single character from G2
  SS3       08/15       Single Shift 3         select single character from G3

             Table 8: Shifts Used in the 8-Bit Environment
_____________________________________________________________________________

So we have a 3-tiered system.  At the bottom tier lie all the world's coded
character sets.  We can designate up to four of them to each of the
intermediate graphics sets G0, G1, G2, and G3 using the escape sequences shown
in Tables 5 and 6.  The terminal or computer keeps each of the selected
intermediate sets in memory.  There is also one active set, composed of GL and
GR.  The intermediate sets are invoked to GL or GR (one at a time) by the
shifts SO, SI, LS0, LS1, etc, shown in Tables 7 and 8.  A simplified diagram
for the 8-bit environment is shown in Figure 3 (see ISO 2022 for detailed
diagrams of both the 7-bit and 8-bit environments).  On a more sophisticated
output device, Figure 3 would contain numerous arrows pointing upwards to
demonstrate the operation of the designators and shifts.

_____________________________________________________________________________

                   +--+--------+  +--+--------+
                   |C0|   GL   |  |C1|   GR   |
                   |  |        |  |  |        |                  8-Bit
                   |  |        |  |  |        |                  Code
                   |  |        |  |  |        |                  In Use
                   +--+--------+  +--+--------+
                     
                   
         LS0          LS1,LS1R      LS2,LS2R      LS3,LS3R       Shifts
                                      SS2           SS3
       +--------+    +--------+    +--------+    +--------+      Intermediate
       |        |    |        |    |        |    |        |      Graphics
       |   G0   |    |   G1   |    |   G2   |    |   G3   |      Sets
       |        |    |        |    |        |    |        |
       +--------+    +--------+    +--------+    +--------+
                                                                 Alphabet
                                                                 Designation
 <ESC>(B      <ESC>-A      <ESC>-B      <ESC>-L      <ESC>$)B    Sequences
                                                    +---------+
+--------+   +--------+   +--------+   +--------+  +--------+ |  The world's
| ISO    |   | ISO    |   |  ISO   |   |  ISO   |  | JIS X  | |  registered
| 646IRV |   | Latin  |   |  Latin |   |  Latin |  | 0208   | |  character
|(ASCII) |   | 1      |   |  2     |   |Cyrillic|  | Kanji  | +  sets
+--------+   +--------+   +--------+   +--------+  +--------+

          Figure 3: The ISO 2022 Character Set Selection Mechanisms
_____________________________________________________________________________

For example, the following sequence would be used to transmit the German word
"<umlaut-u>bern<umlaut-a>chtig" using Latin Alphabet 1 in the 7-bit
environment:

  <ESC>(B<ESC>-A<SO>|<SI>bern<SO>d<SI>chtig

where:

  <ESC>(B   designates ASCII to G0
  <ESC>-A   designates Latin Alphabet 1 to G1
  <SO>      invokes G1 to GL
  |         is character 07/12, but since G1 is invoked to GL, it really
              denotes character 15/12, which is <umlaut-u>
  <SI>      invokes G0 to GL
  bern      are characters from G0, which is invoked in GL
  <SO>      invokes G1 to GL
  d         is character 06/04, but since G1 is invoked to GL, it really
              denotes character 14/04, which is <umlaut-a>
  <SI>      invokes G0 to GL
  chtig     are characters from G0

The same word could be transmitted in the 7-bit environment using single
shifts, if Latin Alphabet 1 were designated to G2 (or G3):

  <ESC>(B<ESC>*A<ESC>N|bern<ESC>Ndchtig

(where <ESC>*A designates Latin-1 to G2, and <ESC>N is Single Shift 2).

In the 8-bit environment it could be transmitted using no shifts at all:

  <ESC>(B<ESC>-A<umlaut-u>bern<umlaut-a>chtig

The designation escape sequences are transmitted only at the beginning of a
session and need not be repeated after the initial designations are made,
unless an intermediate set (G0-G3) is to be recycled.

To understand the three-tiered design of ISO 2022, imagine a computer
programmed to display a mixture of character sets on its screen.  A large
collection of fonts might be stored on the disk, one font per file.  These are
the character sets of the bottom tier.  When a font is needed, it will be read
from the disk and stored in memory in an array, for rapid access.  If several
fonts are needed, they will be stored in several arrays.  These arrays are the
intermediate character sets, G0-G3.  When a data byte arrives to be displayed,
the actual graphic representation is taken from GL or GR (depending on the
byte's 8th bit).  GL is associated with one of the intermediate graphic sets,
and GR with another.  If no more than four character sets are used, then each
one needs to be read from the disk only once, and display is rapid and
efficient thereafter.

ANNOUNCING ISO 2022 FACILITIES

A large portion of ISO 2022 is devoted to describing how 8-bit characters may
be transmitted on a 7-bit communication path, for example when parity is in
use.  In the 7-bit environment, there is only GL -- no GR.  Therefore, all
characters are transmitted with their 8th bit removed, and shifts are used to
specify which intermediate set they belong to.

In fact, there are many possible ways to use the ISO 2022 code extension
facilities within both 7-bit and 8-bit environments.  For example, the sender
may inform the receiver in advance whether G1, G2, or G3 will be used, etc, so
that the receiver can allocate the appropriate resources.  At the beginning of
any particular data transfer, the facilities that actually will be used can be
announced with a sequence of the form <ESC><SP>F, where F is replaced by an
ISO 2022 announcer.  Several of the most important ones are described here.
Table 9 lists all the defined announcers in summary form.  For details, see
ISO 2022.

<ESC><SP>A means that only the G0 set will be used, invoked into GL.  No
  shift functions will be used.  In the 8-bit environment, GR is not used.
  In other words, only a single 7-bit character set is used.

<ESC><SP>B means the G0 and G1 sets will be used with locking shifts.  In the
  7-bit environment <SI> invokes G0 into GL, <SO> invokes G1 into GL.  In the
  8-bit environment, LS0 invokes G0 into GL, LS1 invokes G1 into GL.  In other
  words, two character sets are used, with characters from both sets always
  sent as 7-bit values, with locking shifts used to specify the 8th bit.

<ESC><SP>C means that G0 and G1 will be used in the 8-bit environment, with G0
  invoked in GL and G1 in GR.  No locking shift functions are used.  In other
  words, a single 8-bit character set is used, with all 8 bits transmitted as
  data.  GL is selected when the character's 8th bit is zero, GR is selected
  when the 8th bit is one.

<ESC><SP>D means that G0 and G1 will be used with locking shifts.  In the
  7-bit environment, <SI> invokes G0 into GL and <SO> invokes G1 into GL.  In
  the 8-bit environment, all 8 bits of each character are transmitted with no
  shifts.

<ESC><SP>L means that Level 1 of ISO 4873 will be used.  That is, a single
  8-bit character set with C0, G0, C1, and G1, with no shift functions.
  This is like <ESC><SP>C.

<ESC><SP>M means that Level 2 of ISO 4873 will be used.  This is equivalent
  to Level 1, with the addition of G2 and G3.  Characters from G2 and G3 are
  invoked only by the single-shift functions SS2 and SS3.

<ESC><SP>N means that Level 3 of ISO 4873 will be used.  This is equivalent
  to Level 2 with the addition of the locking shift functions LS1R, LS2R, and
  LS3R. (Note that ISO 4873 does not concern itself with the 7-bit
  environment, and therefore does not discuss the use of LS0, LS2, LS2, or
  LS3.) 

_____________________________________________________________________________

Esc Sequence  7-Bit Environment          8-Bit Environment 

<ESC><SP>A    G0->GL                     G0->GL
<ESC><SP>B    G0-(SI)->GL, G1-(SO)->GL   G0-(LS0)->GL, G1-(LS1)->GL
<ESC><SP>C    (not used)                 G0->GL, G1->GR
<ESC><SP>D    G0-(SI)->GL, G1-(SO)->GL   G0->GL, G1->GR
<ESC><SP>E    Full preservation of shift functions in 7 & 8 bit environments
<ESC><SP>F    C1 represented as <ESC>F   C1 represented as <ESC>F
<ESC><SP>G    C1 represented as <ESC>F   C1 represented as 8-bit quantity
<ESC><SP>H    All graphic character sets have 94 characters
<ESC><SP>I    All graphic character sets have 94 or 96 characters
<ESC><SP>J    In a 7 or 8 bit environment, a 7 bit code is used
<ESC><SP>K    In an 8 bit environment, an 8 bit code is used
<ESC><SP>L    Level 1 of ISO 4873 is used
<ESC><SP>M    Level 2 of ISO 4873 is used
<ESC><SP>N    Level 3 of ISO 4873 is used
<ESC><SP>P    G0 is used in addition to any other sets:
              G0 -(SI)-> GL              G0 -(LS0)-> GL
<ESC><SP>R    G1 is used in addition to any other sets:
              G1 -(SO)-> GL              G1 -(LS1)-> GL
<ESC><SP>S    G1 is used in addition to any other sets:
              G1 -(SO)-> GL              G1 -(LS1R)-> GR
<ESC><SP>T    G2 is used in addition to any other sets:
              G2 -(LS2)-> GL             G2 -(LS2)-> GL
<ESC><SP>U    G2 is used in addition to any other sets:
              G2 -(LS2)-> GL             G2 -(LS2R)-> GR
<ESC><SP>V    G3 is used in addition to any other sets:
              G3 -(LS2)-> GL             G3 -(LS3)-> GL
<ESC><SP>W    G3 is used in addition to any other sets:
              G3 -(LS2)-> GL             G3 -(LS3R)-> GR
<ESC><SP>Z    G2 is used in addition to any other sets:
              SS2 invokes a single character from G2
<ESC><SP>[    G3 is used in addition to any other sets:
              SS3 invokes a single character from G3

                     Table 9: ISO 2022 Announcer Summary
_____________________________________________________________________________


APPENDIX C: PRELIMINARY DESIGN FOR LOADABLE TRANSLATION TABLES

The translation table is specified in a file is written entirely in printable
ASCII, with line divisions as shown.  Numbers are represented as ASCII decimal
digits.

Line  Contents
 1.    Name of this table
 2.    The word "COMMON" or "LOCAL"
 3.    Name of SOURCE character set (translating FROM)
 4.    Number of bytes per character of source set (1, 2, 3, 1-2, etc)
 5.    Number of characters per plane of source set (94, 96, 128)
 6.    Name of TARGET character set (translating TO)
 7.    Number of bytes per character of target set (1, 2, 3, 1-2, etc)
 8.    Number of characters per plane of target set (94, 96, 128)
 9.    Designating sequence for COMMON character set.
10.    Version number of common character set (blank if none)
11.    Registration number of common character set (e.g. I100, blank if none)
12.    Direction of writing (Left-to-right, Right-to-left, Upwards, etc)
13.    Number of entries in the translation table.
14.    Count of lines, n, between this line and beginning of translation table.
15 - 15+n.  Reserved for future use.
n+16...     The translation table itself.

Line 2 is either COMMON or LOCAL, and applies to the SOURCE character set.
LOCAL means that the source character set is local, and the target character
set is common, i.e. the one used during transmission in the transfer syntax.
COMMON means vice-versa.

Lines 10-12 apply to the COMMON character set, which may be either the source
or target set, as specified in line 3.

Line 3 gives the name of the source character set, which is either local or
common, depending on line 2.

Line 4 specifies the number of bytes per character in the source character
set.  For example, 1 for ASCII, ISO Latin-1, etc, 2 for JIS X 0208, etc.
The notation "1-2" means that a character can be either one or two bytes, as
in (for instance) CCITT T.61, where "A" is the single character "A", but
"`A" is the the single character A-grave.

Line 5 specifies the number of "characters per plane".  In a single-byte
character set, there is one plane, in a multibyte set there are many.  In the
ISO world, an important distinction is made between 94-byte sets and 96-byte
sets.  See Appendix B for a fuller explanation.

Lines 6-8 are like lines 2-5, but for the target character set.  If the source
set was local, the target set is common, and vice versa.

Lines 9-11 give further information about the standard, COMMON character set:
the designating sequences required to assign the set to G0, G1, G2, and G3
(see Table 6), in that order, with the bytes written as decimal numbers, each
byte separated by a space, and each sequence separated by a comma.  For
example, the entry for a 94-character set whose final designating letter is
"B" would look like this:

  27 40 66, 27 41 66, 27 42 66, 27 43 66

If a character set cannot be assigned to G0 (as is the case with a 96-byte
set), then the first entry would be left blank (the final letter here is A):

  , 27 45 65, 27 46 65, 27 47 65

Line 10 gives the revision number of the common character set, as described in
the "Data Transfer Protocol" section of the description of Level 2.  This
should be a simple digit, like 1, 2, 3, etc.

Line 11 gives the "Kermit registration number" of the common character set,
such as I100 for ISO character set number 100 (Latin Alphabet 1).

Line 12, direction of writing, has nothing to do with file transfer, but is
included in case the same table is also to be used with terminal emulation.
The actual notation should be the letter L (Left-to-right), R (Right-to-left),
U (Upwards), D (Downwards), or B (Boustrophedon, i.e. alternating L and R).

Line 13 is self explanatory.

Line 14 allows for future expansion of this "information header".

Lines 15 through the end contain the translation table itself.  Each of these
lines contains a pair of characters or strings in ASCII decimal
representation, with the members of the pair separated by a comma, followed
optionally by a comment, like "Uppercase A Circumflex"

    <character from transfer set>, <character from local set> ; <comment>

Each byte of a character is separated by a space, for example:

    231, 135           ; c Cedilla (Latin-1 to CP850)
    228, 97 101        ; Latin-1 a-umlaut to ASCII "ae"
    97 101, 228        ; "ae" to Latin-1 a-umlaut
   123 456, 234 567    ; Something from a pair of 2-byte character sets

The character pair is listed, rather than a single value (as in most
translation tables) to allow for special translations like "ae" to a-umlaut.

There is no rule against having different numbers of bytes on either side of
the comma.  There is also no requirement to always have the same number of
bytes on the left or right side of the comma, nor to have every position
filled.  If a position is vacant, the program should take some kind of default
action, like substituting a question mark.


APPENDIX D: SUMMARY OF NEW KERMIT COMMANDS


SET FILE TYPE { BINARY, TEXT, <other> }

    BINARY means no translation, and overrides all other file-related
    commands, including SET TRANSFER-SYNTAX.

    TEXT is the default.  Enables Level 0, 1, or 2 transfer syntax,
    depending on the setting of SET TRANSFER SYNTAX.

    <other> means any application-specific format known to the Kermit
    program, like WORDPERFECT, and also implies text, rather than binary, mode.


SET FILE CHARACTER-SET <name>

    Effective only when file type is TEXT.
    Tells Kermit what character set the file is coded in,
    or what character set to translate an incoming file to.


SET TRANSFER SYNTAX { NORMAL, CHARACTER-SET <name>, INTERNATIONAL [{7,8}] }

    NORMAL is default.  Treat text and binary files as before.

    CHARACTER-SET <name> invokes the Level 1 extension.

    INTERNATIONAL invokes the Level 2 extension.  7 or 8 specifies the
      ISO-2022 7- or 8-bit environment.


SET UNKNOWN-CHARACTER-SET { KEEP, CANCEL }

    For use with international transfer syntax.  Tells receiver whether to
    keep or cancel an incoming file that contains an unknown character set.
    KEEP is the default.

LOAD TRANSLATION-TABLE <tablename> <filename>

    Load a new translation table, or overlay an existing one, from the
    specified file.

SHOW TRANSLATION-TABLE <name>

    Show information about the named translation table.  If <name> omitted,
    show information about all translation tables.    

DUMP TRANSLATION-TABLE <name> <filename>

    Write the contents of the named table to the specified file, in a format
    compatible with the LOAD TRANSLATION-TABLE command.

SET ATTRIBUTES { ON, OFF }
SET ATTRIBUTE <name-of-attribute> { ON, OFF }

    Enables or disables processing of attribute packets, or specific
    attribute fields such as DATE, CHARACTER-SET, LENGTH, etc.

SHOW { CHARACTER-SETS, TRANSFER-SYNTAX, TRANSLATION-TABLES }

    Display what character sets, transfer syntax, and translation tables
    are in available, and which ones are currently selected.


APPENDIX E:
ESCAPE SEQUENCES AND CONTROL CHARACTERS FOR KERMIT LEVEL-2 TRANSFER SYNTAX

1. Designation of character the.
   The final letter "F" denotes the character set, e.g. "A" for ISO Latin-1.

  <ESC>(F     assigns 94-character graphics set "F" to G0.
  <ESC>)F     assigns 94-character graphics set "F" to G1.
  <ESC>*F     assigns 94-character graphics set "F" to G2.
  <ESC>+F     assigns 94-character graphics set "F" to G3.
  <ESC>-F     assigns 96-character graphics set "F" to G1.
  <ESC>.F     assigns 96-character graphics set "F" to G2.
  <ESC>/F     assigns 96-character graphics set "F" to G3.
  <ESC>$(F    assigns multibyte character set "F" to G0.
  <ESC>$)F    assigns multibyte character set "F" to G1.
  <ESC>$*F    assigns multibyte character set "F" to G2.
  <ESC>$+F    assigns multibyte character set "F" to G3.

2. Shift functions:

Character(s) Name      Function 
  Ctrl-O      SI,LS0    Shift In (invoke G0 to GL)
  Ctrl-N      SO,LS1    Shift Out (invoke G1 to GL)
  <ESC>n      LS2       Locking Shift 2 (invoke G2 to GL)
  <ESC>o      LS3       Locking Shift 3 (invoke G3 to GL)
  <ESC>~      LS1R      Locking Shift 1 Right (invoke G1 to GR)
  <ESC>}      LS2R      Locking Shift 2 Right (invoke G2 to GR)
  <ESC>|      LS3R      Locking Shift 3 Right (invoke G3 to GR)
  <ESC>N      SS2       Single Shift 2, 7-bit version, single char from G2
   08/14      SS2       Single Shift 2, 8-bit version, single char from G2
  <ESC>O      SS3       Single Shift 3, 7-bit version, single char from G3
   08/15      SS3       Single Shift 3, 8-bit version, single char from G3

3. Coding method delimiter:

When receiving text in an unknown character set, store the character set
designator, then store the untranslated characters, and terminate with the
coding method delimiter.

  <ESC>d

4. Special characters in data:

If any of the following characters appear in the data itself, they must be
prefixed during transmission with <DLE>, datalink escape, 01/00, Control-P:

  <SO>   00/14
  <SI>   00/15
  <DLE>  01/00
  <ESC>  01/11
  <SS2>  08/14
  <SS3>  08/15


APPENDIX F: SIMPLIFIED FLOW DIAGRAM OF KERMIT TRANSFER SYNTAX OPTIONS


  SET FILE TYPE BINARY (overrides SET TRANSFER-SYNTAX command)
  |    |
  N    Y-->  Transfer file unmodified.  END.
  |
  Text mode.  Three possibilities:
  SET TRANSFER-SYNTAX NORMAL (the default)
  |    |
  N    Y-->  LEVEL 0: Transfer syntax is ASCII with CRLF as line terminator.
  |          Sending program translates from local format to transfer syntax,
  |          Receiving program translates from transfer syntax to local format.
  |          END.
  |
  SET TRANSFER-SYNTAX CHARACTER-SET LATIN1 (or any other single character set)
  |    |
  N    Y-->  LEVEL 1: Transfer syntax is specified character set with CRLFs.
  |          Sender translates from local format to specified character set.
  |          Receiver translates from specified character set to local format.
  |          END.
  |
  File composed of more than one character set:
  SET TRANSFER-SYNTAX INTERNATIONAL
  |    |
  N    Y-->  LEVEL 2: Transfer syntax is ISO-2022.  Assumes that sender can
  |          Identify the different character sets in the local file, and
  |          can translate them to registered character sets if necessary.
  |          |
  |          Sender specifies encoding ("*") to be International ("I"),
  |          and lists ISO-2022 announcers.  Sender also optionally lists the
  |          alphabets to be used in new character-set ("2") attribute.
  |          |
  |          Receiver agrees to these facilities and alphabets?
  |          |    |
  |          Y    N --> Receiver rejects the file, indicating "*" and/or "2"
  |          |          as the reason.  END.
  |          |
  |          Receiver accepts the file.
  |          |
  |          Transfer begins.  Sender translates from local file format to
  |          the character sets of the transfer syntax, using ISO-2022
  |          announcers, designators, and shifts to switch among them.
  |          |
  |          2 --> Receiver heeds announcers, designators, and shifts, and
  |                translates from the indicated character sets to local 
  |                representation.
  |                |
  |                If the receiver encounters an alphabet it does not know, it 
  |                will act according to the SET UNKNOWN-CHARACTER-SET command:
  |                |      |
  |                KEEP   CANCEL --> Reject the file by putting an X (Cancel 
  |                |                 File) code in the data field of its 
  |                |                 Acknowledgement.  END.
  |                |
  |                (default) Continue to receive the file, but store the 
  |                designator for the unknown alphabet along with the 
  |                untranslated characters from that alphabet, until the next 
  |                known alphabet is encountered.  Mark the end of the 
  |                untranslated material with <ESC>d.  Warn user.
  |                |
  |                END.
  |
  Reserved for future (e.g. ISO 10646)...

(END)
                                                                                           