2-Mar-89  6:18:46-GMT,39284;000000000001
Return-Path: <fdc@watsun.cc.columbia.edu>
Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0)
	id AA05724; Thu, 2 Mar 89 01:18:38 EST
Received: from watsun.cc.columbia.edu by cunixc.cc.columbia.edu (5.54/5.10) id AA03005; Thu, 2 Mar 89 01:16:06 EST
Received: by watsun.cc.columbia.edu (4.0/SMI-4.0)
	id AA05568; Thu, 2 Mar 89 01:02:25 EST
Date: Thu, 2 Mar 1989 1:02:25 EST
From: Frank da Cruz <fdc@watsun.cc.columbia.edu>
To: Joe Doupnik <jrd@usu.bitnet>, Paul Placeway <paul@cis.ohio-state.edu>,
        Andre Pirard <A-PIRARD@bliulg11.bitnet>,
        Baruch Cochavy <baruchc@techunix.bitnet>,
        Johan Van Wingen <MOSGLA@hlerul2.bitnet>,
        Ken-ichiro Murakami <MURAKAMI%NTT-20.NTT.JP@relay.cs.net>,
        Kohichi Nishimoto <s153380%tkov02.DEC@decwrl.dec.com>,
        Hirofumi Fujii <KEIBUN@jpnkekvm.bitnet>,
        Gisbert W.Selke <RECK@dbnuama1.bitnet>,
        Kurt Enulf <UPSKE@seguc11.bitnet>,
        Jacob Palme <jacob_palme_qz@qzcom.bitnet>,
        Per Lindberg <Per_Lindberg_ZQ@qzcom.bitnet>,
        "Bj|rn Larsen" <x_larsen_b%use.uio.uninett@tor.nta.no>,
        "Hans A. ]lien" <hans%ifi.uio.no@tor.nta.no>,
        Kai U.Leppamaki <LK-KLE@finhut.bitnet>,
        Steve Jenkins <pdsoft%uk.ac.lancs.cent1@nss.cs.ucl.ac.uk>,
        Jean Dutertre <dutertre%padis1.DEC@decwrl.dec.com>,
        Gerard Gaye <GAYE@frsac11.bitnet>,
        David Guerlet <KERMIT@czheth5a.bitnet>,
        Bernie Eiben <eiben@tops20.dec.com>,
        Volker Edelhoff <edelhoff@unido.bitnet>
Cc: Christine M Gianone <cmg@cunixc.cc.columbia.edu>
Subject: Kermit International Character Set Proposal
Message-Id: <CMM.0.88.604821745.fdc@watsun.cc.columbia.edu>

	 A KERMIT PROTOCOL EXTENSION FOR INTERNATIONAL CHARACTER SETS

		     Christine Gianone and Frank da Cruz
	     Columbia University Center for Computing Activities
				   New York
				March 1, 1989


PREFACE

This is a ROUGH FIRST DRAFT of a proposal, based upon a reading of the current
ISO and ECMA character-set standards, some familiarity with the issues
involved, and limited testing with devices that claim to implement these
standards (such as the DEC VT340 terminal).  Readers are urged to correct us
if we have misinterpreted the standards, to fill in missing information, and
to make any comments or criticisms they desire.  Readers with knowledge of
real-world multi-alphabet applications and file formats are especially urged
to comment on how this proposal meshes with these particular file formats.

This first draft is being sent to a selected list of people who we know to be
familiar with both Kermit and the character set issues discussed here, so your
comments will be especially helpful before we circulate this proposal among a
wider audience, most likely by mailing it to Info-Kermit and the ISO8859
discussion group.


INTRODUCTION

The Kermit protocol makes a distinction between text and binary files, and it
defines a particular transfer syntax for text files, namely ASCII characters
with carriage return and linefeed (CRLF) after each line, so that text may be
stored in useful fashion on any computer it is transferred to.  Each Kermit
program knows how to translate from the local storage conventions to Kermit's
transfer syntax, and vice versa.  In this way, text files can be transferred
between unlike systems (say, an EBCDIC card oriented system and an ASCII
stream file system) and remain useful after transfer.

Now that the world's computer users have begun to find US ASCII insufficient
for their uses, and ISO, ECMA, etc, are adopting standard codes for the
world's other alphabets, and vendors like IBM, DEC, and Apple have begun to
make these characters available on their displays (albeit in different
positions), and people are beginning to produce increasing numbers of
multilingual documents, ... (will this sentence ever end???) ... Kermit's text
file transfer syntax needs to be extended to allow for texts in a mixture of
alphabets.

It is best if this can be done in line with currently existing and evolving
standards.  Here are the standards we believe are pertinent:

ANSI X3.4 (1986), "Coded Character Sets - 7-bit American Standard Code for
  Information Interchange" (US ASCII), is the 7-bit code currently used by
  Kermit for transferring text files. 

ISO 646 (1983), "Information Processing - ISO 7-bit Coded Character Sets for
  Information Interchange", gives us a 7-bit character set equivalent to
  ASCII, and says we can substitute "national characters" for for ASCII
  characters #$@[\]^`{|}.  Different languages put different characters in
  these positions, and there's no mechanism defined to specify which language
  is being used. 

ISO 4873 (1986) (= ECMA-43), "Information Processing - ISO 8-bit Code for
  Information Interchange - Structure and Rules for Implementation", defines
  8-bit character sets, their graphic and control regions, and how to extend
  an 8-bit character set by using multiple graphics sets.

ISO 2022 (1986) (= ECMA-35), "Information Processing - ISO 7-bit and 8-bit
  Coded Character Sets - Code Extension Techniques", describes how to use
  8-bit character sets in both 7-bit and 8-bit environments, and how to switch
  among different character sets and alphabets.

ANSI X3.41-1974, "Code Extension Techniques for Use with the 7-Bit Coded
  Character Set of the American National Standard Code for Information
  Interchange", describes 7- and 8-bit codes and extension techniques in
  approximately the same manner as ISO 4873 and ISO 2022.

ISO 8859 (1987-present) (see below for ECMA equivalents), "Information
  Processing - 8-Bit Single-Byte Coded Graphic Character Sets", defines the
  actual 8-bit character sets to be used for many of the world's languages.
  The "lower half" (C0 + G0) of each of these is the same as ASCII and ISO
  646.

ISO is the Internation Standardization Organization, ANSI is the American
National Standards Institute, and ECMA is the European Computer Manufacturers
Association.


HOW THE STANDARDS WORK

ASCII and ISO 646 give us a 128-character 7-bit character set.  This set is
divided into several parts:

  1. 32 control characters (characters 0 through 31).
  2. Space (SP, character 32).
  3. 94 graphic, or printing, characters (33-126).
  4. DEL (rubout, character 127), considered a control character.

The control characters except DEL compose the C0 part of ASCII, and the
graphic characters plus SP and DEL compose the G0 part.  If the ASCII alphabet
is written in a table of 16 rows and 8 colums, then the left 2 columns are the
C0 set, and the right 6 columns are the G0 set:

     <--C0--> <---------G0---------->
      00  01  02  03  04  05  06  07
     +---+---+---+---+---+---+---+---+
  00 |NUL DLE| SP  0   @   P   `   p |
  01 |SOH DC1| !   1   A   Q   a   q |
  02 |STX DC2| "   2   B   R   b   r |
  03 |ETX DC3| #   3   C   S   c   s |
  04 |EOT DC4| $   4   D   T   d   t |
  05 |ENQ NAK| %   5   E   U   e   u |
  06 |ACK SYN| &   6   F   V   f   v |
  07 |BEL ETB| '   7   G   W   g   w |
  08 |BS  CAN| (   8   H   X   h   x |
  09 |HT  EM | )   9   I   Y   i   y |
  10 |LF  SUB| *   :   J   Z   j   z |
  11 |VT  ESC| +   ;   K   [   k   { |
  12 |LF  FS | ,   <   L   \   l   | |
  13 |CR  GS | -   =   M   ]   m   } |
  14 |SO  RS | .   >   N   ^   n   ~ |
  15 |SI  US | /   ?   O   _   o  DEL|
     +---+---+---+---+---+---+---+---+
     <--C0--> <---------G0---------->

Many vendors are now using the full 8 bits available within the computer byte
(and on the transmission line in some cases) for character representation.
At first there were ad-hoc character assignments (e.g. IBM PC 8-bit ASCII,
Apple Macintosh ASCII, etc), but standards are beginning to emerge.

8-bit character sets are described in ISO 4873 and ANSI X3.41.  An 8-bit
character set has two halves.  The "left half" or "lower half" corresponds to
ASCII (and ISO 646).  All the characters in the left half have their
high-order, or 8th, bit set to zero, and are therefore representable in 7
bits.  The "right half" or "upper half" mirrors the left half in that the
first 32 characters are control characters, and the remaining 94 or 96
characters are graphics, but all characters in the right half have their high
order bits set to one.  The right-half controls are called C1, and the
right-half graphics are called G1:

     <--C0--> <---------G0---------->  <--C1--> <---------G1---------->
       00  01  02  03  04  05  06  07    08  09  10  11  12  13  14  15
     +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+
  00 |NUL DLE| SP  0   @   P   `   p | |    DCS|---+                   |
  01 |SOH DC1| !   1   A   Q   a   q | |    PU1|                       |
  02 |STX DC2| "   2   B   R   b   r | |    PU2|                       |
  03 |ETX DC3| #   3   C   S   c   s | |    STS|                       |
  04 |EOT DC4| $   4   D   T   d   t | |IND CCH|                       |
  05 |ENQ NAK| %   5   E   U   e   u | |NEL MW |                       |
  06 |ACK SYN| &   6   F   V   f   v | |SSA SPA|                       |
  07 |BEL ETB| '   7   G   W   g   w | |ESA EPA|                       |
  08 |BS  CAN| (   8   H   X   h   x | |HTS    |      (special         |
  09 |HT  EM | )   9   I   Y   i   y | |HTJ    |       graphics)       |
  10 |LF  SUB| *   :   J   Z   j   z | |VTS    |                       |
  11 |VT  ESC| +   ;   K   [   k   { | |PLD CSI|                       |
  12 |LF  FS | ,   <   L   \   l   | | |PLU ST |                       |
  13 |CR  GS | -   =   M   ]   m   } | |RI  OSC|                       |
  14 |SO  RS | .   >   N   ^   n   ~ | |SS2 PM |                       |
  15 |SI  US | /   ?   O   _   o  DEL| |SS3 APC|                   +---|
     +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+
     <--C0--> <---------G0---------->  <--C1--> <---------G1---------->

G1 character sets can have either 94 or 96 characters.  A 94-character G1 set
has Space (SP) in its first position and DEL in its last, just like G0 (the
notches shown in G1 in the diagram).  A 96-character set has graphic
characters in all 96 positions.

ISO 4873 allows up to four sets of printable graphic characters, of either 94
or 96 characters each (G0, G1, G2, and G3), "active" at one time, plus the
control sets (C0 and C1).  Therefore there can be up to

  2 x 32 + 4 x 96 = 448

characters simultaneously within the repertoire of a given device.  But in
today's world, 7- or 8-bit character i/o is the norm imposed by our terminals,
computer architectures, and (most of all) the nature of our asynchronous
communication devices and transmission systems.  Therefore, a terminal or
computing device will receive at most a 7- or 8-bit code capable of denoting
only 128 or 256 different characters, out of the possible 448.  How can the
additional characters be accessed in the 8-bit environment?  In the 7-bit
environment?

Switching among character sets is accomplished using special control
characters or escape sequences embedded within the data stream.  These control
characters and escape sequences are specified in ISO 2022.  In the following
discussion, we use this notation (numbers are in decimal unless otherwise
noted):

  <ESC> Escape (ASCII 27)
  <SP>  Space  (ASCII 32)
  <SO>  Shift Out (Ctrl-N, ASCII 14)
  <SI>  Shift In  (Ctrl-O, ASCII 15)

ISO 2022 provides two separate mechanisms for handling multiple character
sets.  The first is a set of escape sequences for assigning a particular
alphabet (such as Cyrillic, Hebrew, Arabic, etc) to a particular character set
(G0, G1, G2, or G3).  The second is a set of functions for shifting among the
currently active sets.  Here are the alphabet selectors and shift functions:

  Escape Sequence                                         Shift Function

  <ESC>(F - assigns 94-character graphics set "F" to G0.  Invoke by SI
  <ESC>)F - assigns 94-character graphics set "F" to G1.  Invoke by SO
  <ESC>*F - assigns 94-character graphics set "F" to G2.  Invoke by SS2 or LS2
  <ESC>+F - assigns 94-character graphics set "F" to G3.  Invoke by SS3 or LS3
  <ESC>-F - assigns 96-character graphics set "F" to G1.  Invoke SO
  <ESC>.F - assigns 96-character graphics set "F" to G2.  Invoke by SS2 or LS2
  <ESC>/F - assigns 96-character graphics set "F" to G3.  Invoke by SS3 or LS3

The values for "F" are discussed below.  The shift functions are:

  SO  (Ctrl-N) - Shift Out:       select G1 (locking shift)
  SI  (Ctrl-O) - Shift In:        select G0 (locking shift)
  LS2 (<ESC>n) - Locking Shift 2: select G2 (locking shift)
  LS3 (<ESC>o) - Locking Shift 3: select G3 (locking shift)
  SS2 (<ESC>N) - Single Shift 2:  select G2 (single character shift)
  SS3 (<ESC>O) - Single Shift 3:  select G3 (single character shift)

"Locking shift" is like shift-lock on a typewriter.  It means that all
subsequent characters until the next shift character are to be taken from the
designated character set.  "Single shift" applies only to the character that
follows it immediately, but single shift functions are only available for
the G2 and G3 character sets.  Locking shift functions remain in effect across
alphabet changes.

There are many possible ways to use these code extension facilities within
both 7-bit and 8-bit environments.  In any particular data transfer, the
facilities that are actually used can be announced using <ESC><SP>F, where the
possibilities for F are listed in ISO 2022.  This "announcer" escape sequence
should be sent at the beginning of the data transfer.  <ESC><SP>B means the G0
and G1 sets will be used, where <SI> invokes G0 in the left half, <SO>
invokes G1 in the left half.  <ESC><SP>C means the full 8-bit set shall be
used, with no shifting.  There are many other possibilities, which need not
concern us here.

ISO 8859 defines a series of 8-bit character sets.  In each of these, the left
half (G0) is the ISO 646 set, i.e. 7-bit ASCII.  Because of this, the left
half of any ISO 8859 character set may be used to represent English or any
other Latin-alphabet language that can make do without diacritical marks (e.g.
German without umlauts or ess-zet).

By convention, the G0 set can be selected with <ESC>(B.  When we say "by
convention" we mean that each of the ISO 8859 standards says to select the G0
set using this sequence, even if the G1 set is selected using some other
letter, like A, C, L, etc (see below).  Theoretically, <ESC>(A could also be
used to select the G0 set of "alphabet A", <ESC>(L could select the G0 set of
"alphabet L", etc.

Languages with special characters must use specific ISO 8859 G1 sets.  These
sets are specified (to date) in ISO 8859-1 through 8859-9:

ISO 8859-1 is Latin Alphabet No. 1.  The right half (G1) contains all the
  special characters needed for Dutch, Faeroese, Finnish, French, German,
  Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and Swedish.
  Select G1 with <ESC>-A. 

ISO 8859-2 is Latin Alphabet No. 2.  G1 contains special characters for
  Albanian, Czech, German, Hungarian, Polish, Romanian, Serbocroation, Slovak,
  and Slovene.  Select G1 with <ESC>-B.

ISO 8859-3 is Latin Alphabet No. 3, for Afrikaans, Catalan, Esperanto,
  French, Galician, German, Italian, Maltese, and Turkish.  Select G1 with
  <ESC>-C.

ISO 8859-4 is Latin Alphabet No. 4, for Danish, Estonian, Finnish, German,
  Greenlandic, Lappish, Latvian, Lithuanian, Norwegian, and Swedish.  Select G1
  with <ESC>-D.

ISO 8859-5 is the Latin/Cyrillic Alphabet, for Bulgarian, Byelorussian,
  Macedonian, Russian, Serbocroation, and Ukrainian.  Select G1 with <ESC>-L.
  (Comptible with USSR GOST Standard 19768-1987 and ECMA-113).

ISO 8859/6 is the Latin/Arabic Alphabet.  *** Selection sequence unknown ***.

ISO 8859-7 is the Latin/Greek Alphabet.  Select G1 with <ESC>-F.

ISO 8859-8 is the Latin/Hebrew Alphabet.  Select G1 with <ESC>-H.

ISO DIS 8859-9 is Latin alphabet No. 5, in which six Icelandic letters from
  Latin Alphabet No. 1 were replaced by 6 letters needed for Turkish.  Select
  G1 with <ESC>-M.

The alphabet selection escape sequences are registered in the International
Register of Coded Character Sets under the provisions of ISO 2375, "Data
Processing - Procedure for Registration of Escape Sequences".  The
registration authority is the ECMA, which periodically issues updates.  The
reference number for this register is ISBN 2-12-953907-0.

There may also be "private alphabets", such as those found on DEC terminals.
In the DEC environment only, these may be selected using escape sequences
listed in the DEC manuals, e.g. <ESC>)> to select the DEC Technical
94-character set and assign it to G1.

Alphabet Summary Table:

  Esc Seq   Alphabet Name             ISO Number     ECMA Number

  <ESC>(B   ASCII (ANSI X3.4-1986)    ISO 646        ECMA-6
  <ESC>-A   Latin Alphabet No. 1      ISO 8859-1     ECMA-94
  <ESC>-B   Latin Alphabet No. 2      ISO 8859-2     ECMA-94
  <ESC>-C   Latin Alphabet No. 3      ISO 8859-3     ECMA-94
  <ESC>-D   Latin Alphabet No. 4      ISO 8859-4     ECMA-94
  <ESC>-L   Latin/Cyrillic            ISO 8859-5     ECMA-113
  <ESC>-*** Latin/Arabic              ISO 8859-6     ECMA-114
  <ESC>-F   Latin/Greek               ISO 8859-7     ECMA-118
  <ESC>-H   Latin/Hebrew              ISO 8859-8     ECMA-121
  <ESC>-M   Latin Alphabet No. 5      ISO 8859-9     ECMA-128

*** Unassigned as of June 1986


KERMIT FILE TRANSFER

Different computer systems and software packages have different conventions
for representing, storing, and displaying mixed-alphabet textual data.  Such
data can be transferred in binary mode by Kermit, but it will only make sense
when transferred to a system that uses the same representational conventions.

To transfer mixed-alphabet textual data between systems that use different
conventions, a new mechanism is required.  Currently, Kermit defines the
"common intermediate representation", or "transfer syntax", for textual data
(before encoding) to be ASCII characters arranged in lines or records
terminated by ASCII Carriage Return and Linefeed (CRLF).  Henceforth, this
will be known as the Normal Kermit transfer syntax.

The extension proposed here will allow a Kermit program that has specific
knowledge of the local file format (or formats) for storing multilingual or
multi-alphabet text to translate between these system- and
application-specific formats and a new format to be used during file transfer.
This will be called ISO-2022 Kermit transfer syntax.  Like all extensions to
the original Kermit protocol, this will be an optional feature of any Kermit
program.


SELECTING ISO-2022 TRANSFER SYNTAX

The proposed extension to the Kermit protocol follows a subset of ISO 2022, in
which a single ISO-8859 alphabet, comprised of a C0, G0, C1, and G1 set, may
be active at one time, and in which:

 - the C1 and G1 sets are transmitted using ISO 2022's 7-bit code extension
   techniques,

 - escape sequences can be used to switch among different alphabets,

 - the C0, G0, and C1 sets are assumed to be identical for all alphabets.

Kermit's default transfer syntax is Normal.  Kermit's ISO-2022 transfer syntax
must therefore be enabled in some way, either automatically or explicitly by
the user.  In the automatic case, the Kermit program recognizes (somehow) that
it is to transfer a multi-alphabet text file.  In the manual case, the user
issues a SET command.

The sending Kermit may inform the receiving Kermit of the selected transfer
syntax by means of the Kermit File Attribute (A) packet, whose use is
negotiated in the Kermit Initialization exchange.  There is an attribute "*"
(ASCII 42) which represents "encoding", with values like "A" for Normal Kermit
ASCII encoding, "E" for EBCDIC (so far, never used).  The proposed new value
is "I8", for "ISO 8-bit character sets".  The receiver can agree to accept the
file or refuse it using Kermit's attribute reply mechanism.  If the receiver
does not do attribute packets, then the sender may still elect to send the
file (with a warning to the user), as either a binary file or an 8-bit text
file, for storing (and perhaps forwarding) purposes only.

It should also be possible for the user to select ISO-2022 transfer syntax
using an explicit SET command.  This command would have to be given to both
Kermits in order for the ISO transfer syntax to have its desired effect.  The
suggested command is:

  SET TRANSFER-SYNTAX ISO8

This denotes the use of ISO 8-bit alphabets.

(By the way, if the user gives this command to the sender, but not to the
receiver, then the received file will be stored in ISO 2022 format, with the
escape codes mixed with the file characters on disk; if Attribute packets are
not being used, then the receiver will get no warning).

The advantage of using Attribute packets is that the sending Kermit can
automatically inform the receiving Kermit of the file transfer syntax, so that
the user does not have to type a SET command to both Kermits.  On a computer
system where the Kermit program can recognize the attributes and encoding of a
file automatically, this mechanism will allow files of different types (text,
binary, multi-alphabet) to be sent together as a group, even between unlike
systems.  The drawback is that the attribute mechanism must be programmed into
a Kermit program that doesn't already have it.

There should be a way for the user to disable the use of ISO-2022 transfer
syntax.  The recommended command is SET TRANSFER-SYNTAX NORMAL.


DESCRIPTION OF ISO-2022 TRANSFER SYNTAX

Transfer of a multi-character-set text file in ISO-2022 transfer syntax is the
same as transfer of a 7-bit ASCII text file, except that it may contain
embedded escape sequences to switch between character sets.  The file sender
translates the file's characters (if necessary) into one or more selected ISO
8859 alphabets, and terminates lines of text (records) with CRLF.  The file
receiver translates from ISO-2022 transfer syntax into the format demanded by
the local system or application.  The current alphabet is designated by an
escape sequence, and locking shift functions switch between its G0 and G1 sets.

The mechanism described in ISO 646 for building composite graphic characters
by overprinting using Backspace or Carriage Return should not be used; this
practice is prohibited by ISO 8859.

ISO-2022 transfer syntax uses only 7-bit data.  If any character arrives
with its high-order (8th) bit set to one (after stripping of parity and Kermit
decoding), there has been an error.

ISO 2022 states that "at the beginning of information interchange, except
where the interchanging parties have agreed otherwise, all deisgnations shall
be defined by use of the appropriate escape sequences, and the shift status
shall be defined by the use of the appropriate locking-shift functions."
Kermit programs should "agree otherwise" that the default character set is the
US ASCII / ISO-646 / ECMA-6 7-bit set; thus ISO-2022 transfer syntax can be
identical to Normal Kermit transfer syntax when transferring 7-bit text files.
There is no default G1 set, in the interest of fairness to all countries and
peoples.

When the text contains characters outside the ASCII alphabet, an escape
sequence must be used to identify which other alphabet these characters belong
to.  This sequence is <ESC>-F, where F is the officially registered letter for
that alphabet, e.g.  A-D for Latin Alphabets 1-4, L for Cyrillic, etc.  This
sequence assigns the designated alphabet to the active G1 set.

The G1 set is transmitted in its 7-bit form to eliminate Kermit's 8th-bit
prefix overhead on 7-bit connections.  Once a G1 set is selected, it remains in
effect until another G1 set is selected.  Switching between the G0 (ASCII) set
and the G1 (extended) set is done using the ISO-2022 "locking shifts":

  SO (Ctrl-N) - select G1 (the extended set)
  SI (Ctrl-O) - select G0 (the ASCII set)

If a particular set is already invoked, use of the corresponding shift has no
effect.

During file transfer, an <ESC>-F or <ESC>)F sequence must be given before the
first occurance of an extended character from a 96-character or 94-character
set, respectively.  If no such sequence is given, then all characters are
treated as ASCII data, including <ESC>, <SI>, and <SO>.  In other words, the
file transfer behaves in the normal Kermit fashion for text files.

The C0 and C1 sets, i.e. the two sets of control characters, are not subject
to shifting.  Control characters from the C1 set must be transmitted using
2-character escape sequences, as described in ISO 2022: <ESC>@, <ESC>A,
<ESC>B, etc, stand for 10000000, 10000001, 10000010, etc (binary).  This
method results in less Kermit encoding overhead on 7-bit connections than
would sending these characters "bare" (which is not allowed).

All the escaping and shifting operations specified here take place before
normal Kermit packet encoding, and are subject to Kermit's control-character
and repeat-count prefixing.  For example, <ESC>-A<SO>x<SI>y becomes #$-A#Nx#Oy
according to Kermit's normal rules for control character prefixing.

ISO-2022 transfer syntax may be used in conjunction with even, odd, mark, or
space parity, or with no parity at all.  8-bit data is never transferred in
this mode, so 8th-bit prefixing will never occur.


ADDITIONAL ESCAPES

The preceding mode of operation is the one described in ISO-2022 under
"Announcer 4/2" for the 7-bit environment, which is selected by the escape
sequence <ESC><SP>B.  This means that the G0 and G1 sets are used, both in
their 7-bit forms, with <SO> and <SI> used to shift between them.  "Announcer
4/10" <ESC><SP>J specifies that a 7-bit code is used, even in an 8-bit
environment.  The use of 2-character escape sequences for C1 characters can be
announced using <ESC><SP>F (the "F" in this case is really an F).  For
clarity, these escape sequences may be sent at the beginning of the file
transfer, but they are not required.

Similarly, the ISO-2022 Coding Method Delimiter, <ESC>d, may be transmitted at
the end of the file, or at any point within the file after which this coding
method is no longer used.

Since ISO 8859 character sets are subject to revision from time to time, an
alphabet selector may be preceded by <ESC>&F, where F is the revision number
(@ = 1, A = 2, B = 3, etc).  For example, <ESC>&@<ESC>-A means Latin Alphabet
Number One, Revision One.


TRANSFER SYNTAX SUMMARY

All characters are 7-bit, all sequences are optional, except if an extended
alphabet is selected, <SI> and <SO> are required to shift between its G0 and
G1 sets.

Preamble:
  <ESC><SP>J<ESC><SP>B<ESC><SP>F (before first file characters):

    <ESC><SP>J - Using 7-bit code.
    <ESC><SP>B - Map both G0 and G1 into the left half.
    <ESC><SP>F - Using 2-character escape sequences for C1 set.

Alphabet selector:
  <ESC>(B<ESC>&@<ESC>-F (before first use of extended characters):

    <ESC>(B - Designate the normal (ASCII, ISO 646) G0 character set.
    <ESC>&@ - Specify the alphabet revision number, if any (@=1, A=2, etc)
    <ESC>-F - Designate the alphabet for G1 (substitute the appropriate F)

Alphabet shifts:
  <SO> - Select G1 set (extended characters)
  <SI> - Select G0 set (ASCII, ISO 646) (default)

Postamble:
  <ESC>d - Coding method delimiter (terminator), at end of file.


LOCAL FILE REPRESENTATION

This proposal assumes nothing about the representation of the file on the
local storage medium.  It may be ASCII, EBCDIC, a proprietary word processor
format, IBM code page, or anything else.  It is an implementation "detail" for
Kermit programmer to convert between the local file representation for
multi-alphabet text files, and Kermit's file transfer syntax.

In some cases, the file itself (or its directory entry) might contain the
necessary identifying information, in which case the sending Kermit program
can automatically emit the appropriate escape sequences during file transfer.
In others, the user will have to tell the sending program how the file is
encoded.  If file attribute packets are not used, the user will also have to
tell the receiving Kermit that the transfer syntax is ISO-2022, and in what
format to store the file upon receipt.

The suggested command is SET FILE TYPE <xxx>, where <xxx> specifies how the
file is (or when receiving, is to be) encoded on disk.  This will necessarily
be highly dependent on the system's conventions, or the conventions of the
applications to be used with the file (e.g. a multi-language word processing
program).  Possibilities for <xxx> might include application names like
WORDPERFECT, XYWRITE, NOTA-BENE, MACWRITE, or system-specific names like
IBM-CODE-PAGE-437 (the IBM PC US character set), IBM-CODE-PAGE-850
(multilingual), IBM-CODE-PAGE-865 (Norway), etc.

It may be that a file is encoded entirely in a single ISO-8859 alphabet, e.g.
Latin Alphabet No. 1, or Latin/Cyrillic, but the file itself contains no
information to that effect.  Therefore, it should be possible for the user to
specify the alphabet in the SET FILE TYPE command, where the possibilities
are:

  LATIN1-ISO8      ARABIC-ISO8
  LATIN2-ISO8      CYRILLIC-ISO8
  LATIN3-ISO8      GREEK-ISO8
  LATIN4-ISO8      HEBREW-ISO8
  LATIN5-ISO8

The part before the dash is the name of the alphabet, and the "-ISO8"
says that the alphabet belongs to the ISO family of 8-bit character sets.
This allows for the possibility of other encoding methods for the same
languages, e.g. GREEK-DEC, where the Greek letters are taken from the DEC
technical character set.

If the local file is not encoded according to ISO 2022 rules, it may contain
<ESC>, <SO>, and <SI> characters.  It is up to the Kermit program to know
what these characters mean in the context of the file's format, and to either
strip them from the file or translate them to something else.  The ISO 2022
rules forbid the use of these characters as data to be transferred.


SPECIAL EFFECTS

Today, most multi-alphabet files are produced by proprietary text processing
programs.  These programs have many functions besides switching among
alphabets.  They may also endow text with special attributes such as boldface,
italic, underline, super- or subscript, color, etc, and render characters in a
variety of type styles and sizes.  Each text processing program may have its
own unique formats and conventions.

These special effects are not addressed by this proposal.  Nevertheless, it is
likely that a multi-alphabet file produced by a text processing program also
contains special effects.  In order for a Kermit program to send a
multi-alphabet file, it must have detailed knowledge of the file's format and
coding conventions.  Therefore, the Kermit program should be able to strip out
the special effects, and send only the text.  Otherwise the result would be
meaningless when received on an unlike system or for use with a different
application.  (When transferring such files between like systems or compatible
applications, Kermit binary mode transfers will suffice.)

At some future time, it might be possible to adapt one of the popular document
description languages to Kermit, so that Kermit will be able to transfer
formatted documents between unlike systems and applications.  Presently, there
are many competing would-be standards inlcuding IBM DCA and DIA, DEC DDIF, US
Navy DIF, ISO ODA and ODIF, Postscript.  Kermit should wait for the dust to
settle and then pick a relatively simple, stable alternative.  (Comments
welcome!)


ARCHIVING

The Kermit protocol includes a so-far little-used archiving function.  In this
mode, Kermit stores incoming file data together with the attribute packets
that precede it, so that the file can be retrieved and reconstituted on
another system at a later time.  In archive mode, the alphabet escapes and
shifts should not be interpreted by the receiving Kermit, but simply stored as
data.


MULTIBYTE ALPHABETS

This proposal does not address alphabets such as Japanese, Chinese, and Korean
that do not fit into 8-bit character sets.  A new standard, ISO 10646, is in
preparation.  This standard will define a universal 3-byte character code to
cover all the world's written languages, providing for 1- and 2-byte shortcuts
within a given language environment.  All designation, invocation and shifting
as in ISO 2022 will be avoided.  When and if this standard becomes relatively
stable, it too can be added as a Kermit file transfer syntax option, perhaps
ISO24.

In the meantime, national versions of Kermit can (and do) use SET FILE TYPE
commands to identify the encoding or standard used for a multibyte alphabet.
For example, some Japanese Kermit programs have the command SET FILE TYPE
TEXT, BINARY, or KANJI, and add a further command to specify the local Kanji
encoding: SET KANJI VAX, JIS, or SHIFTJIS (JIS is the Japan Industrial
Standard, JIS X 0208; SHIFTJIS is JIS X 0202 which differs from JIS X 0208 by
the introduction of escape sequences to shift between Kanji and ASCII; VAX is
the encoding used on Japanese VAX/VMS systems).  These Kermit programs use
SHIFTJIS as the transfer syntax, and the Kermit program maps between it and
the local format, which may be VAX, JIS, or SHIFTJIS.  To better mesh with the
current proposal, however, these programs should make a distinction between
the file format and the transfer syntax by adding a command like SET
TRANSFER-SYNTAX SHIFTJIS.

In this connection, a "rider" to this proposal is that "JS" (for SHIFTJIS)
be added to the list of Kermit Kermit encodings under Attribute "*".
Designations for Chinese, Korean, and other multibyte-character-set languages
are welcome, as are alternative designations for Japanese.


TERMINAL EMULATION

While not part of the Kermit file transfer protocol, terminal emulation is a
feature of many Kermit programs.  It is hoped that these terminal emulators
will evolve along the lines of the ISO standards mentioned above.  In some
cases, this is already a fact, insofar as DEC VT200 and 300 series terminals
already follow these standards.

In this regard, it is important to note that not all languages are written
from left to right, top to bottom.  Hebrew and Arabic are two examples of
right-to-left languages, and Japanese and Chinese may be written top to
bottom.  The order of the text characters on disk or on the transmission line
do not necessarily reflect their order on the screen or the printed page.


FILE TRANSFER SYNTAX EXAMPLES

A simple 7-bit ASCII text file can be transmitted in the normal Kermit manner
for text files, without any escapes or shifts, even in ISO8 mode.

A text file containing characters from a language or languages covered by a
single ISO 8859 alphabet will require an <ESC>-F sequence to identify the
alphabet.  <SO> and <SI> are used to shift between the G0 and G1 sets.  The
following lines are all produce the same result:

  A dangerous German word is "gef<ESC>-A<SO>d<SI>hrlich".
  <ESC>-AA dangerous German word is "gef<SO>d<SI>hrlich".
  <ESC>-A<SI>A dangerous German word is "gef<SO>d<SI>hrlich".
  <ESC>&@<ESC>-A<ESC>(B<SI>A dangerous German word is "gef<SO>d<SI>hrlich".

In this case, the only extended character is the umlaut-a in "gefaehrlich"
(where ae is a way of writing umlaut-a without an umlaut).

For clarity and consistency with the ISO-2022 recommendations, the latter form
is preferred: the text begins with an announcement of the G0 and G1 sets in
use, including the version number, and then explicitly shifts into the G0 set,
rather than defaulting to it.  Similarly, use of the preamble at the beginning
of the file and the postamble at the end is also recommended.

A text file containing characters from multiple ISO 8859 alphabets requires an
<ESC>-F sequence to identify each alphabet.  SO and SI can be used to shift
between G0 and G1 of the current alphabet, and <ESC>(B can be used to select
G0 of any of the alphabets, since these are all the same.  For example, the
following text contains the same word in English, French, and Russian:

  <ESC>-A<SI>Disappointed, d<SO>ig<SI>u, <ESC>-L<SO>`PW^gP`^RP]]kY<SI>.

The first escape sequence assigns Latin Alphabet No. 1 to G1, and the
subsequent <SO> and <SI> shifts apply to its G0 and G1 set, which is used to
form the English and French words.  The second escape sequence assigns the
Latin/Cyrillic 96-character set to G1, and the subsequent shifts apply to this
new set.

A final example, in which the same word is repeated in English, Russian, and
German, shows how a locking shift remains in effect when the alphabet
is changed.  We begin in Latin/Cyrillic, start with an English word from G0,
shift to G1 for the Russian word, and while still in G1 switch to Latin
Alphabet No. 1 for German to get the umlaut-A at the beginning of Aenderung
(where Ae = umlaut-uppercase-A), and shift back to G0 for the rest of the word:

  <ESC>-LAlteration <SO>_U`UTU[ZP <ESC>-AD<SI>nderung.


PERFORMANCE

For each file, the preamble and postamble add from 0 to 11 characters.  There
are an additional 3 characters per alphabet change, for instance when
switching between Finnish and Russian, and an additional shift character for
every shift between G0 and G1, and finally 2-character escape sequences used
in place of the C1 control characters.

For files of any length at all, the preamble/postamble overhead is negligible.
It is recommended that the "ambles" be included for compatibility with other
ISO-2022-conformant applications.

The restriction of data to 7 bits during transmission should not incur a high
transmission penalty, since the locking shift mechanism will tend to add fewer
characters to the transmission stream than would 8th-bit prefixing of
characters from the G1 set (although in the worst case -- a file composed of
characters alternating between the G0 and G1 sets -- the overhead of shifting
would actually be higher).  The use of two-character escape sequences for the
C1 control set should also have small impact; the overhead will be the same as
for 8th-bit prefixing, but these characters should appear rarely in text
files.

Hence, the transmission overhead of ISO-2022 transfer syntax should not not be
significantly different from that of normal Kermit, and in some cases (e.g.
for texts completely in Russian, Greek, Hebrew, or Arabic) the overhead is far
lower.


WHERE TO GET STANDARDS

The ISO/ECMA standards discussed in this proposal may be obtained free of
charge in their ECMA form by writing to:

  ECMA Headquarters
  Rue du Rhone 114
  CH-1204 Geneva
  SWITZERLAND

Be sure to specify the title and the ECMA number of each standard requested.
We tried this ourselves, and got delivery within about two weeks.

ISO standards can also be ordered from the UN bookstore, but not for free:

  CCITT
  United Nations Bookstore
  United Nations Building
  New York, NY  10017

ANSI standards may be ordered, for a fee, from:

  Sales Department
  American National Standards Institute
  1430 Broadway
  New York, NY  10018


SUMMARY

We hope that this attempt to blend Kermit text file transfer with the ISO
international character set standards is in keeping with the intended use of
those standards.  Anyone who has can offer insights as to whether we are using
the standards appropriately is encouraged to comment.

 2-Mar-89 17:20:40-GMT,2351;000000000001
Return-Path: <fdc@cunixc.cc.columbia.edu>
Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0)
	id AA10260; Thu, 2 Mar 89 12:20:30 EST
Received: by cunixc.cc.columbia.edu (5.54/5.10) id AA01174; Thu, 2 Mar 89 12:16:20 EST
Date: Thu, 2 Mar 1989 12:16:19 EST
From: Frank da Cruz <fdc@cunixc.cc.columbia.edu>
To: Joe Doupnik <jrd@usu.bitnet>, Paul Placeway <paul@cis.ohio-state.edu>,
        Andre Pirard <A-PIRARD@bliulg11.bitnet>,
        Baruch Cochavy <baruchc@techunix.bitnet>,
        Johan Van Wingen <MOSGLA@hlerul2.bitnet>,
        Ken-ichiro Murakami <MURAKAMI%NTT-20.NTT.JP@relay.cs.net>,
        Kohichi Nishimoto <s153380%tkov02.DEC@decwrl.dec.com>,
        Hirofumi Fujii <KEIBUN@jpnkekvm.bitnet>,
        Gisbert W.Selke <RECK@dbnuama1.bitnet>,
        Kurt Enulf <UPSKE@seguc11.bitnet>,
        Jacob Palme <jacob_palme_qz@qzcom.bitnet>,
        Per Lindberg <Per_Lindberg_ZQ@qzcom.bitnet>,
        "Bj|rn Larsen" <x_larsen_b%use.uio.uninett@tor.nta.no>,
        "Hans A. ]lien" <hans%ifi.uio.no@tor.nta.no>,
        Kai U.Leppamaki <LK-KLE@finhut.bitnet>,
        Steve Jenkins <pdsoft%uk.ac.lancs.cent1@nss.cs.ucl.ac.uk>,
        Jean Dutertre <dutertre%padis1.DEC@decwrl.dec.com>,
        Gerard Gaye <GAYE@frsac11.bitnet>,
        David Guerlet <KERMIT@czheth5a.bitnet>,
        Bernie Eiben <eiben@tops20.dec.com>,
        Volker Edelhoff <edelhoff@unido.bitnet>
Subject: Kermit/ISO proposal
Cc: Christine M Gianone <cmg@cunixc.cc.columbia.edu>
Message-Id: <CMM.0.88.604862179.fdc@cunixc.cc.columbia.edu>

It occurs to me that since the proposal was sent from a brand-new computer,
some of you might not be able to reply to the message.  You can also mail
to us as cmg@cunixc.cc.columbia.edu and fdc@cunixc.cc.columbia.edu, or simply
(but less efficiently) as cmg@columbia.edu and fdc@columbia.edu.  And on
BITNET/EARN you can send direct to KERMIT@CUVMA or FDCCU@CUVMA.  If you don't
know what I'm talking about (i.e. if you didn't receive the proposal) please
let me know and I'll get it to you somehow.

Meanwhile, I'd also appreciate any comments on how it meshes with X.400 and
FTAM and other ISO application protocols in their current incarnations.  I
have some several-year-old drafts of these standards, and as far as I can
tell, the only character set they talk about is ISO 646.

Thanks!  - Frank

                                                                                                                                                                                                                                                                          