2-Mar-89 6:18:46-GMT,39284;000000000001 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA05724; Thu, 2 Mar 89 01:18:38 EST Received: from watsun.cc.columbia.edu by cunixc.cc.columbia.edu (5.54/5.10) id AA03005; Thu, 2 Mar 89 01:16:06 EST Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA05568; Thu, 2 Mar 89 01:02:25 EST Date: Thu, 2 Mar 1989 1:02:25 EST From: Frank da Cruz To: Joe Doupnik , Paul Placeway , Andre Pirard , Baruch Cochavy , Johan Van Wingen , Ken-ichiro Murakami , Kohichi Nishimoto , Hirofumi Fujii , Gisbert W.Selke , Kurt Enulf , Jacob Palme , Per Lindberg , "Bj|rn Larsen" , "Hans A. ]lien" , Kai U.Leppamaki , Steve Jenkins , Jean Dutertre , Gerard Gaye , David Guerlet , Bernie Eiben , Volker Edelhoff Cc: Christine M Gianone Subject: Kermit International Character Set Proposal Message-Id: A KERMIT PROTOCOL EXTENSION FOR INTERNATIONAL CHARACTER SETS Christine Gianone and Frank da Cruz Columbia University Center for Computing Activities New York March 1, 1989 PREFACE This is a ROUGH FIRST DRAFT of a proposal, based upon a reading of the current ISO and ECMA character-set standards, some familiarity with the issues involved, and limited testing with devices that claim to implement these standards (such as the DEC VT340 terminal). Readers are urged to correct us if we have misinterpreted the standards, to fill in missing information, and to make any comments or criticisms they desire. Readers with knowledge of real-world multi-alphabet applications and file formats are especially urged to comment on how this proposal meshes with these particular file formats. This first draft is being sent to a selected list of people who we know to be familiar with both Kermit and the character set issues discussed here, so your comments will be especially helpful before we circulate this proposal among a wider audience, most likely by mailing it to Info-Kermit and the ISO8859 discussion group. INTRODUCTION The Kermit protocol makes a distinction between text and binary files, and it defines a particular transfer syntax for text files, namely ASCII characters with carriage return and linefeed (CRLF) after each line, so that text may be stored in useful fashion on any computer it is transferred to. Each Kermit program knows how to translate from the local storage conventions to Kermit's transfer syntax, and vice versa. In this way, text files can be transferred between unlike systems (say, an EBCDIC card oriented system and an ASCII stream file system) and remain useful after transfer. Now that the world's computer users have begun to find US ASCII insufficient for their uses, and ISO, ECMA, etc, are adopting standard codes for the world's other alphabets, and vendors like IBM, DEC, and Apple have begun to make these characters available on their displays (albeit in different positions), and people are beginning to produce increasing numbers of multilingual documents, ... (will this sentence ever end???) ... Kermit's text file transfer syntax needs to be extended to allow for texts in a mixture of alphabets. It is best if this can be done in line with currently existing and evolving standards. Here are the standards we believe are pertinent: ANSI X3.4 (1986), "Coded Character Sets - 7-bit American Standard Code for Information Interchange" (US ASCII), is the 7-bit code currently used by Kermit for transferring text files. ISO 646 (1983), "Information Processing - ISO 7-bit Coded Character Sets for Information Interchange", gives us a 7-bit character set equivalent to ASCII, and says we can substitute "national characters" for for ASCII characters #$@[\]^`{|}. Different languages put different characters in these positions, and there's no mechanism defined to specify which language is being used. ISO 4873 (1986) (= ECMA-43), "Information Processing - ISO 8-bit Code for Information Interchange - Structure and Rules for Implementation", defines 8-bit character sets, their graphic and control regions, and how to extend an 8-bit character set by using multiple graphics sets. ISO 2022 (1986) (= ECMA-35), "Information Processing - ISO 7-bit and 8-bit Coded Character Sets - Code Extension Techniques", describes how to use 8-bit character sets in both 7-bit and 8-bit environments, and how to switch among different character sets and alphabets. ANSI X3.41-1974, "Code Extension Techniques for Use with the 7-Bit Coded Character Set of the American National Standard Code for Information Interchange", describes 7- and 8-bit codes and extension techniques in approximately the same manner as ISO 4873 and ISO 2022. ISO 8859 (1987-present) (see below for ECMA equivalents), "Information Processing - 8-Bit Single-Byte Coded Graphic Character Sets", defines the actual 8-bit character sets to be used for many of the world's languages. The "lower half" (C0 + G0) of each of these is the same as ASCII and ISO 646. ISO is the Internation Standardization Organization, ANSI is the American National Standards Institute, and ECMA is the European Computer Manufacturers Association. HOW THE STANDARDS WORK ASCII and ISO 646 give us a 128-character 7-bit character set. This set is divided into several parts: 1. 32 control characters (characters 0 through 31). 2. Space (SP, character 32). 3. 94 graphic, or printing, characters (33-126). 4. DEL (rubout, character 127), considered a control character. The control characters except DEL compose the C0 part of ASCII, and the graphic characters plus SP and DEL compose the G0 part. If the ASCII alphabet is written in a table of 16 rows and 8 colums, then the left 2 columns are the C0 set, and the right 6 columns are the G0 set: <--C0--> <---------G0----------> 00 01 02 03 04 05 06 07 +---+---+---+---+---+---+---+---+ 00 |NUL DLE| SP 0 @ P ` p | 01 |SOH DC1| ! 1 A Q a q | 02 |STX DC2| " 2 B R b r | 03 |ETX DC3| # 3 C S c s | 04 |EOT DC4| $ 4 D T d t | 05 |ENQ NAK| % 5 E U e u | 06 |ACK SYN| & 6 F V f v | 07 |BEL ETB| ' 7 G W g w | 08 |BS CAN| ( 8 H X h x | 09 |HT EM | ) 9 I Y i y | 10 |LF SUB| * : J Z j z | 11 |VT ESC| + ; K [ k { | 12 |LF FS | , < L \ l | | 13 |CR GS | - = M ] m } | 14 |SO RS | . > N ^ n ~ | 15 |SI US | / ? O _ o DEL| +---+---+---+---+---+---+---+---+ <--C0--> <---------G0----------> Many vendors are now using the full 8 bits available within the computer byte (and on the transmission line in some cases) for character representation. At first there were ad-hoc character assignments (e.g. IBM PC 8-bit ASCII, Apple Macintosh ASCII, etc), but standards are beginning to emerge. 8-bit character sets are described in ISO 4873 and ANSI X3.41. An 8-bit character set has two halves. The "left half" or "lower half" corresponds to ASCII (and ISO 646). All the characters in the left half have their high-order, or 8th, bit set to zero, and are therefore representable in 7 bits. The "right half" or "upper half" mirrors the left half in that the first 32 characters are control characters, and the remaining 94 or 96 characters are graphics, but all characters in the right half have their high order bits set to one. The right-half controls are called C1, and the right-half graphics are called G1: <--C0--> <---------G0----------> <--C1--> <---------G1----------> 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ 00 |NUL DLE| SP 0 @ P ` p | | DCS|---+ | 01 |SOH DC1| ! 1 A Q a q | | PU1| | 02 |STX DC2| " 2 B R b r | | PU2| | 03 |ETX DC3| # 3 C S c s | | STS| | 04 |EOT DC4| $ 4 D T d t | |IND CCH| | 05 |ENQ NAK| % 5 E U e u | |NEL MW | | 06 |ACK SYN| & 6 F V f v | |SSA SPA| | 07 |BEL ETB| ' 7 G W g w | |ESA EPA| | 08 |BS CAN| ( 8 H X h x | |HTS | (special | 09 |HT EM | ) 9 I Y i y | |HTJ | graphics) | 10 |LF SUB| * : J Z j z | |VTS | | 11 |VT ESC| + ; K [ k { | |PLD CSI| | 12 |LF FS | , < L \ l | | |PLU ST | | 13 |CR GS | - = M ] m } | |RI OSC| | 14 |SO RS | . > N ^ n ~ | |SS2 PM | | 15 |SI US | / ? O _ o DEL| |SS3 APC| +---| +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ <--C0--> <---------G0----------> <--C1--> <---------G1----------> G1 character sets can have either 94 or 96 characters. A 94-character G1 set has Space (SP) in its first position and DEL in its last, just like G0 (the notches shown in G1 in the diagram). A 96-character set has graphic characters in all 96 positions. ISO 4873 allows up to four sets of printable graphic characters, of either 94 or 96 characters each (G0, G1, G2, and G3), "active" at one time, plus the control sets (C0 and C1). Therefore there can be up to 2 x 32 + 4 x 96 = 448 characters simultaneously within the repertoire of a given device. But in today's world, 7- or 8-bit character i/o is the norm imposed by our terminals, computer architectures, and (most of all) the nature of our asynchronous communication devices and transmission systems. Therefore, a terminal or computing device will receive at most a 7- or 8-bit code capable of denoting only 128 or 256 different characters, out of the possible 448. How can the additional characters be accessed in the 8-bit environment? In the 7-bit environment? Switching among character sets is accomplished using special control characters or escape sequences embedded within the data stream. These control characters and escape sequences are specified in ISO 2022. In the following discussion, we use this notation (numbers are in decimal unless otherwise noted): Escape (ASCII 27) Space (ASCII 32) Shift Out (Ctrl-N, ASCII 14) Shift In (Ctrl-O, ASCII 15) ISO 2022 provides two separate mechanisms for handling multiple character sets. The first is a set of escape sequences for assigning a particular alphabet (such as Cyrillic, Hebrew, Arabic, etc) to a particular character set (G0, G1, G2, or G3). The second is a set of functions for shifting among the currently active sets. Here are the alphabet selectors and shift functions: Escape Sequence Shift Function (F - assigns 94-character graphics set "F" to G0. Invoke by SI )F - assigns 94-character graphics set "F" to G1. Invoke by SO *F - assigns 94-character graphics set "F" to G2. Invoke by SS2 or LS2 +F - assigns 94-character graphics set "F" to G3. Invoke by SS3 or LS3 -F - assigns 96-character graphics set "F" to G1. Invoke SO .F - assigns 96-character graphics set "F" to G2. Invoke by SS2 or LS2 /F - assigns 96-character graphics set "F" to G3. Invoke by SS3 or LS3 The values for "F" are discussed below. The shift functions are: SO (Ctrl-N) - Shift Out: select G1 (locking shift) SI (Ctrl-O) - Shift In: select G0 (locking shift) LS2 (n) - Locking Shift 2: select G2 (locking shift) LS3 (o) - Locking Shift 3: select G3 (locking shift) SS2 (N) - Single Shift 2: select G2 (single character shift) SS3 (O) - Single Shift 3: select G3 (single character shift) "Locking shift" is like shift-lock on a typewriter. It means that all subsequent characters until the next shift character are to be taken from the designated character set. "Single shift" applies only to the character that follows it immediately, but single shift functions are only available for the G2 and G3 character sets. Locking shift functions remain in effect across alphabet changes. There are many possible ways to use these code extension facilities within both 7-bit and 8-bit environments. In any particular data transfer, the facilities that are actually used can be announced using F, where the possibilities for F are listed in ISO 2022. This "announcer" escape sequence should be sent at the beginning of the data transfer. B means the G0 and G1 sets will be used, where invokes G0 in the left half, invokes G1 in the left half. C means the full 8-bit set shall be used, with no shifting. There are many other possibilities, which need not concern us here. ISO 8859 defines a series of 8-bit character sets. In each of these, the left half (G0) is the ISO 646 set, i.e. 7-bit ASCII. Because of this, the left half of any ISO 8859 character set may be used to represent English or any other Latin-alphabet language that can make do without diacritical marks (e.g. German without umlauts or ess-zet). By convention, the G0 set can be selected with (B. When we say "by convention" we mean that each of the ISO 8859 standards says to select the G0 set using this sequence, even if the G1 set is selected using some other letter, like A, C, L, etc (see below). Theoretically, (A could also be used to select the G0 set of "alphabet A", (L could select the G0 set of "alphabet L", etc. Languages with special characters must use specific ISO 8859 G1 sets. These sets are specified (to date) in ISO 8859-1 through 8859-9: ISO 8859-1 is Latin Alphabet No. 1. The right half (G1) contains all the special characters needed for Dutch, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and Swedish. Select G1 with -A. ISO 8859-2 is Latin Alphabet No. 2. G1 contains special characters for Albanian, Czech, German, Hungarian, Polish, Romanian, Serbocroation, Slovak, and Slovene. Select G1 with -B. ISO 8859-3 is Latin Alphabet No. 3, for Afrikaans, Catalan, Esperanto, French, Galician, German, Italian, Maltese, and Turkish. Select G1 with -C. ISO 8859-4 is Latin Alphabet No. 4, for Danish, Estonian, Finnish, German, Greenlandic, Lappish, Latvian, Lithuanian, Norwegian, and Swedish. Select G1 with -D. ISO 8859-5 is the Latin/Cyrillic Alphabet, for Bulgarian, Byelorussian, Macedonian, Russian, Serbocroation, and Ukrainian. Select G1 with -L. (Comptible with USSR GOST Standard 19768-1987 and ECMA-113). ISO 8859/6 is the Latin/Arabic Alphabet. *** Selection sequence unknown ***. ISO 8859-7 is the Latin/Greek Alphabet. Select G1 with -F. ISO 8859-8 is the Latin/Hebrew Alphabet. Select G1 with -H. ISO DIS 8859-9 is Latin alphabet No. 5, in which six Icelandic letters from Latin Alphabet No. 1 were replaced by 6 letters needed for Turkish. Select G1 with -M. The alphabet selection escape sequences are registered in the International Register of Coded Character Sets under the provisions of ISO 2375, "Data Processing - Procedure for Registration of Escape Sequences". The registration authority is the ECMA, which periodically issues updates. The reference number for this register is ISBN 2-12-953907-0. There may also be "private alphabets", such as those found on DEC terminals. In the DEC environment only, these may be selected using escape sequences listed in the DEC manuals, e.g. )> to select the DEC Technical 94-character set and assign it to G1. Alphabet Summary Table: Esc Seq Alphabet Name ISO Number ECMA Number (B ASCII (ANSI X3.4-1986) ISO 646 ECMA-6 -A Latin Alphabet No. 1 ISO 8859-1 ECMA-94 -B Latin Alphabet No. 2 ISO 8859-2 ECMA-94 -C Latin Alphabet No. 3 ISO 8859-3 ECMA-94 -D Latin Alphabet No. 4 ISO 8859-4 ECMA-94 -L Latin/Cyrillic ISO 8859-5 ECMA-113 -*** Latin/Arabic ISO 8859-6 ECMA-114 -F Latin/Greek ISO 8859-7 ECMA-118 -H Latin/Hebrew ISO 8859-8 ECMA-121 -M Latin Alphabet No. 5 ISO 8859-9 ECMA-128 *** Unassigned as of June 1986 KERMIT FILE TRANSFER Different computer systems and software packages have different conventions for representing, storing, and displaying mixed-alphabet textual data. Such data can be transferred in binary mode by Kermit, but it will only make sense when transferred to a system that uses the same representational conventions. To transfer mixed-alphabet textual data between systems that use different conventions, a new mechanism is required. Currently, Kermit defines the "common intermediate representation", or "transfer syntax", for textual data (before encoding) to be ASCII characters arranged in lines or records terminated by ASCII Carriage Return and Linefeed (CRLF). Henceforth, this will be known as the Normal Kermit transfer syntax. The extension proposed here will allow a Kermit program that has specific knowledge of the local file format (or formats) for storing multilingual or multi-alphabet text to translate between these system- and application-specific formats and a new format to be used during file transfer. This will be called ISO-2022 Kermit transfer syntax. Like all extensions to the original Kermit protocol, this will be an optional feature of any Kermit program. SELECTING ISO-2022 TRANSFER SYNTAX The proposed extension to the Kermit protocol follows a subset of ISO 2022, in which a single ISO-8859 alphabet, comprised of a C0, G0, C1, and G1 set, may be active at one time, and in which: - the C1 and G1 sets are transmitted using ISO 2022's 7-bit code extension techniques, - escape sequences can be used to switch among different alphabets, - the C0, G0, and C1 sets are assumed to be identical for all alphabets. Kermit's default transfer syntax is Normal. Kermit's ISO-2022 transfer syntax must therefore be enabled in some way, either automatically or explicitly by the user. In the automatic case, the Kermit program recognizes (somehow) that it is to transfer a multi-alphabet text file. In the manual case, the user issues a SET command. The sending Kermit may inform the receiving Kermit of the selected transfer syntax by means of the Kermit File Attribute (A) packet, whose use is negotiated in the Kermit Initialization exchange. There is an attribute "*" (ASCII 42) which represents "encoding", with values like "A" for Normal Kermit ASCII encoding, "E" for EBCDIC (so far, never used). The proposed new value is "I8", for "ISO 8-bit character sets". The receiver can agree to accept the file or refuse it using Kermit's attribute reply mechanism. If the receiver does not do attribute packets, then the sender may still elect to send the file (with a warning to the user), as either a binary file or an 8-bit text file, for storing (and perhaps forwarding) purposes only. It should also be possible for the user to select ISO-2022 transfer syntax using an explicit SET command. This command would have to be given to both Kermits in order for the ISO transfer syntax to have its desired effect. The suggested command is: SET TRANSFER-SYNTAX ISO8 This denotes the use of ISO 8-bit alphabets. (By the way, if the user gives this command to the sender, but not to the receiver, then the received file will be stored in ISO 2022 format, with the escape codes mixed with the file characters on disk; if Attribute packets are not being used, then the receiver will get no warning). The advantage of using Attribute packets is that the sending Kermit can automatically inform the receiving Kermit of the file transfer syntax, so that the user does not have to type a SET command to both Kermits. On a computer system where the Kermit program can recognize the attributes and encoding of a file automatically, this mechanism will allow files of different types (text, binary, multi-alphabet) to be sent together as a group, even between unlike systems. The drawback is that the attribute mechanism must be programmed into a Kermit program that doesn't already have it. There should be a way for the user to disable the use of ISO-2022 transfer syntax. The recommended command is SET TRANSFER-SYNTAX NORMAL. DESCRIPTION OF ISO-2022 TRANSFER SYNTAX Transfer of a multi-character-set text file in ISO-2022 transfer syntax is the same as transfer of a 7-bit ASCII text file, except that it may contain embedded escape sequences to switch between character sets. The file sender translates the file's characters (if necessary) into one or more selected ISO 8859 alphabets, and terminates lines of text (records) with CRLF. The file receiver translates from ISO-2022 transfer syntax into the format demanded by the local system or application. The current alphabet is designated by an escape sequence, and locking shift functions switch between its G0 and G1 sets. The mechanism described in ISO 646 for building composite graphic characters by overprinting using Backspace or Carriage Return should not be used; this practice is prohibited by ISO 8859. ISO-2022 transfer syntax uses only 7-bit data. If any character arrives with its high-order (8th) bit set to one (after stripping of parity and Kermit decoding), there has been an error. ISO 2022 states that "at the beginning of information interchange, except where the interchanging parties have agreed otherwise, all deisgnations shall be defined by use of the appropriate escape sequences, and the shift status shall be defined by the use of the appropriate locking-shift functions." Kermit programs should "agree otherwise" that the default character set is the US ASCII / ISO-646 / ECMA-6 7-bit set; thus ISO-2022 transfer syntax can be identical to Normal Kermit transfer syntax when transferring 7-bit text files. There is no default G1 set, in the interest of fairness to all countries and peoples. When the text contains characters outside the ASCII alphabet, an escape sequence must be used to identify which other alphabet these characters belong to. This sequence is -F, where F is the officially registered letter for that alphabet, e.g. A-D for Latin Alphabets 1-4, L for Cyrillic, etc. This sequence assigns the designated alphabet to the active G1 set. The G1 set is transmitted in its 7-bit form to eliminate Kermit's 8th-bit prefix overhead on 7-bit connections. Once a G1 set is selected, it remains in effect until another G1 set is selected. Switching between the G0 (ASCII) set and the G1 (extended) set is done using the ISO-2022 "locking shifts": SO (Ctrl-N) - select G1 (the extended set) SI (Ctrl-O) - select G0 (the ASCII set) If a particular set is already invoked, use of the corresponding shift has no effect. During file transfer, an -F or )F sequence must be given before the first occurance of an extended character from a 96-character or 94-character set, respectively. If no such sequence is given, then all characters are treated as ASCII data, including , , and . In other words, the file transfer behaves in the normal Kermit fashion for text files. The C0 and C1 sets, i.e. the two sets of control characters, are not subject to shifting. Control characters from the C1 set must be transmitted using 2-character escape sequences, as described in ISO 2022: @, A, B, etc, stand for 10000000, 10000001, 10000010, etc (binary). This method results in less Kermit encoding overhead on 7-bit connections than would sending these characters "bare" (which is not allowed). All the escaping and shifting operations specified here take place before normal Kermit packet encoding, and are subject to Kermit's control-character and repeat-count prefixing. For example, -Axy becomes #$-A#Nx#Oy according to Kermit's normal rules for control character prefixing. ISO-2022 transfer syntax may be used in conjunction with even, odd, mark, or space parity, or with no parity at all. 8-bit data is never transferred in this mode, so 8th-bit prefixing will never occur. ADDITIONAL ESCAPES The preceding mode of operation is the one described in ISO-2022 under "Announcer 4/2" for the 7-bit environment, which is selected by the escape sequence B. This means that the G0 and G1 sets are used, both in their 7-bit forms, with and used to shift between them. "Announcer 4/10" J specifies that a 7-bit code is used, even in an 8-bit environment. The use of 2-character escape sequences for C1 characters can be announced using F (the "F" in this case is really an F). For clarity, these escape sequences may be sent at the beginning of the file transfer, but they are not required. Similarly, the ISO-2022 Coding Method Delimiter, d, may be transmitted at the end of the file, or at any point within the file after which this coding method is no longer used. Since ISO 8859 character sets are subject to revision from time to time, an alphabet selector may be preceded by &F, where F is the revision number (@ = 1, A = 2, B = 3, etc). For example, &@-A means Latin Alphabet Number One, Revision One. TRANSFER SYNTAX SUMMARY All characters are 7-bit, all sequences are optional, except if an extended alphabet is selected, and are required to shift between its G0 and G1 sets. Preamble: JBF (before first file characters): J - Using 7-bit code. B - Map both G0 and G1 into the left half. F - Using 2-character escape sequences for C1 set. Alphabet selector: (B&@-F (before first use of extended characters): (B - Designate the normal (ASCII, ISO 646) G0 character set. &@ - Specify the alphabet revision number, if any (@=1, A=2, etc) -F - Designate the alphabet for G1 (substitute the appropriate F) Alphabet shifts: - Select G1 set (extended characters) - Select G0 set (ASCII, ISO 646) (default) Postamble: d - Coding method delimiter (terminator), at end of file. LOCAL FILE REPRESENTATION This proposal assumes nothing about the representation of the file on the local storage medium. It may be ASCII, EBCDIC, a proprietary word processor format, IBM code page, or anything else. It is an implementation "detail" for Kermit programmer to convert between the local file representation for multi-alphabet text files, and Kermit's file transfer syntax. In some cases, the file itself (or its directory entry) might contain the necessary identifying information, in which case the sending Kermit program can automatically emit the appropriate escape sequences during file transfer. In others, the user will have to tell the sending program how the file is encoded. If file attribute packets are not used, the user will also have to tell the receiving Kermit that the transfer syntax is ISO-2022, and in what format to store the file upon receipt. The suggested command is SET FILE TYPE , where specifies how the file is (or when receiving, is to be) encoded on disk. This will necessarily be highly dependent on the system's conventions, or the conventions of the applications to be used with the file (e.g. a multi-language word processing program). Possibilities for might include application names like WORDPERFECT, XYWRITE, NOTA-BENE, MACWRITE, or system-specific names like IBM-CODE-PAGE-437 (the IBM PC US character set), IBM-CODE-PAGE-850 (multilingual), IBM-CODE-PAGE-865 (Norway), etc. It may be that a file is encoded entirely in a single ISO-8859 alphabet, e.g. Latin Alphabet No. 1, or Latin/Cyrillic, but the file itself contains no information to that effect. Therefore, it should be possible for the user to specify the alphabet in the SET FILE TYPE command, where the possibilities are: LATIN1-ISO8 ARABIC-ISO8 LATIN2-ISO8 CYRILLIC-ISO8 LATIN3-ISO8 GREEK-ISO8 LATIN4-ISO8 HEBREW-ISO8 LATIN5-ISO8 The part before the dash is the name of the alphabet, and the "-ISO8" says that the alphabet belongs to the ISO family of 8-bit character sets. This allows for the possibility of other encoding methods for the same languages, e.g. GREEK-DEC, where the Greek letters are taken from the DEC technical character set. If the local file is not encoded according to ISO 2022 rules, it may contain , , and characters. It is up to the Kermit program to know what these characters mean in the context of the file's format, and to either strip them from the file or translate them to something else. The ISO 2022 rules forbid the use of these characters as data to be transferred. SPECIAL EFFECTS Today, most multi-alphabet files are produced by proprietary text processing programs. These programs have many functions besides switching among alphabets. They may also endow text with special attributes such as boldface, italic, underline, super- or subscript, color, etc, and render characters in a variety of type styles and sizes. Each text processing program may have its own unique formats and conventions. These special effects are not addressed by this proposal. Nevertheless, it is likely that a multi-alphabet file produced by a text processing program also contains special effects. In order for a Kermit program to send a multi-alphabet file, it must have detailed knowledge of the file's format and coding conventions. Therefore, the Kermit program should be able to strip out the special effects, and send only the text. Otherwise the result would be meaningless when received on an unlike system or for use with a different application. (When transferring such files between like systems or compatible applications, Kermit binary mode transfers will suffice.) At some future time, it might be possible to adapt one of the popular document description languages to Kermit, so that Kermit will be able to transfer formatted documents between unlike systems and applications. Presently, there are many competing would-be standards inlcuding IBM DCA and DIA, DEC DDIF, US Navy DIF, ISO ODA and ODIF, Postscript. Kermit should wait for the dust to settle and then pick a relatively simple, stable alternative. (Comments welcome!) ARCHIVING The Kermit protocol includes a so-far little-used archiving function. In this mode, Kermit stores incoming file data together with the attribute packets that precede it, so that the file can be retrieved and reconstituted on another system at a later time. In archive mode, the alphabet escapes and shifts should not be interpreted by the receiving Kermit, but simply stored as data. MULTIBYTE ALPHABETS This proposal does not address alphabets such as Japanese, Chinese, and Korean that do not fit into 8-bit character sets. A new standard, ISO 10646, is in preparation. This standard will define a universal 3-byte character code to cover all the world's written languages, providing for 1- and 2-byte shortcuts within a given language environment. All designation, invocation and shifting as in ISO 2022 will be avoided. When and if this standard becomes relatively stable, it too can be added as a Kermit file transfer syntax option, perhaps ISO24. In the meantime, national versions of Kermit can (and do) use SET FILE TYPE commands to identify the encoding or standard used for a multibyte alphabet. For example, some Japanese Kermit programs have the command SET FILE TYPE TEXT, BINARY, or KANJI, and add a further command to specify the local Kanji encoding: SET KANJI VAX, JIS, or SHIFTJIS (JIS is the Japan Industrial Standard, JIS X 0208; SHIFTJIS is JIS X 0202 which differs from JIS X 0208 by the introduction of escape sequences to shift between Kanji and ASCII; VAX is the encoding used on Japanese VAX/VMS systems). These Kermit programs use SHIFTJIS as the transfer syntax, and the Kermit program maps between it and the local format, which may be VAX, JIS, or SHIFTJIS. To better mesh with the current proposal, however, these programs should make a distinction between the file format and the transfer syntax by adding a command like SET TRANSFER-SYNTAX SHIFTJIS. In this connection, a "rider" to this proposal is that "JS" (for SHIFTJIS) be added to the list of Kermit Kermit encodings under Attribute "*". Designations for Chinese, Korean, and other multibyte-character-set languages are welcome, as are alternative designations for Japanese. TERMINAL EMULATION While not part of the Kermit file transfer protocol, terminal emulation is a feature of many Kermit programs. It is hoped that these terminal emulators will evolve along the lines of the ISO standards mentioned above. In some cases, this is already a fact, insofar as DEC VT200 and 300 series terminals already follow these standards. In this regard, it is important to note that not all languages are written from left to right, top to bottom. Hebrew and Arabic are two examples of right-to-left languages, and Japanese and Chinese may be written top to bottom. The order of the text characters on disk or on the transmission line do not necessarily reflect their order on the screen or the printed page. FILE TRANSFER SYNTAX EXAMPLES A simple 7-bit ASCII text file can be transmitted in the normal Kermit manner for text files, without any escapes or shifts, even in ISO8 mode. A text file containing characters from a language or languages covered by a single ISO 8859 alphabet will require an -F sequence to identify the alphabet. and are used to shift between the G0 and G1 sets. The following lines are all produce the same result: A dangerous German word is "gef-Adhrlich". -AA dangerous German word is "gefdhrlich". -AA dangerous German word is "gefdhrlich". &@-A(BA dangerous German word is "gefdhrlich". In this case, the only extended character is the umlaut-a in "gefaehrlich" (where ae is a way of writing umlaut-a without an umlaut). For clarity and consistency with the ISO-2022 recommendations, the latter form is preferred: the text begins with an announcement of the G0 and G1 sets in use, including the version number, and then explicitly shifts into the G0 set, rather than defaulting to it. Similarly, use of the preamble at the beginning of the file and the postamble at the end is also recommended. A text file containing characters from multiple ISO 8859 alphabets requires an -F sequence to identify each alphabet. SO and SI can be used to shift between G0 and G1 of the current alphabet, and (B can be used to select G0 of any of the alphabets, since these are all the same. For example, the following text contains the same word in English, French, and Russian: -ADisappointed, digu, -L`PW^gP`^RP]]kY. The first escape sequence assigns Latin Alphabet No. 1 to G1, and the subsequent and shifts apply to its G0 and G1 set, which is used to form the English and French words. The second escape sequence assigns the Latin/Cyrillic 96-character set to G1, and the subsequent shifts apply to this new set. A final example, in which the same word is repeated in English, Russian, and German, shows how a locking shift remains in effect when the alphabet is changed. We begin in Latin/Cyrillic, start with an English word from G0, shift to G1 for the Russian word, and while still in G1 switch to Latin Alphabet No. 1 for German to get the umlaut-A at the beginning of Aenderung (where Ae = umlaut-uppercase-A), and shift back to G0 for the rest of the word: -LAlteration _U`UTU[ZP -ADnderung. PERFORMANCE For each file, the preamble and postamble add from 0 to 11 characters. There are an additional 3 characters per alphabet change, for instance when switching between Finnish and Russian, and an additional shift character for every shift between G0 and G1, and finally 2-character escape sequences used in place of the C1 control characters. For files of any length at all, the preamble/postamble overhead is negligible. It is recommended that the "ambles" be included for compatibility with other ISO-2022-conformant applications. The restriction of data to 7 bits during transmission should not incur a high transmission penalty, since the locking shift mechanism will tend to add fewer characters to the transmission stream than would 8th-bit prefixing of characters from the G1 set (although in the worst case -- a file composed of characters alternating between the G0 and G1 sets -- the overhead of shifting would actually be higher). The use of two-character escape sequences for the C1 control set should also have small impact; the overhead will be the same as for 8th-bit prefixing, but these characters should appear rarely in text files. Hence, the transmission overhead of ISO-2022 transfer syntax should not not be significantly different from that of normal Kermit, and in some cases (e.g. for texts completely in Russian, Greek, Hebrew, or Arabic) the overhead is far lower. WHERE TO GET STANDARDS The ISO/ECMA standards discussed in this proposal may be obtained free of charge in their ECMA form by writing to: ECMA Headquarters Rue du Rhone 114 CH-1204 Geneva SWITZERLAND Be sure to specify the title and the ECMA number of each standard requested. We tried this ourselves, and got delivery within about two weeks. ISO standards can also be ordered from the UN bookstore, but not for free: CCITT United Nations Bookstore United Nations Building New York, NY 10017 ANSI standards may be ordered, for a fee, from: Sales Department American National Standards Institute 1430 Broadway New York, NY 10018 SUMMARY We hope that this attempt to blend Kermit text file transfer with the ISO international character set standards is in keeping with the intended use of those standards. Anyone who has can offer insights as to whether we are using the standards appropriately is encouraged to comment. 2-Mar-89 17:20:40-GMT,2351;000000000001 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA10260; Thu, 2 Mar 89 12:20:30 EST Received: by cunixc.cc.columbia.edu (5.54/5.10) id AA01174; Thu, 2 Mar 89 12:16:20 EST Date: Thu, 2 Mar 1989 12:16:19 EST From: Frank da Cruz To: Joe Doupnik , Paul Placeway , Andre Pirard , Baruch Cochavy , Johan Van Wingen , Ken-ichiro Murakami , Kohichi Nishimoto , Hirofumi Fujii , Gisbert W.Selke , Kurt Enulf , Jacob Palme , Per Lindberg , "Bj|rn Larsen" , "Hans A. ]lien" , Kai U.Leppamaki , Steve Jenkins , Jean Dutertre , Gerard Gaye , David Guerlet , Bernie Eiben , Volker Edelhoff Subject: Kermit/ISO proposal Cc: Christine M Gianone Message-Id: It occurs to me that since the proposal was sent from a brand-new computer, some of you might not be able to reply to the message. You can also mail to us as cmg@cunixc.cc.columbia.edu and fdc@cunixc.cc.columbia.edu, or simply (but less efficiently) as cmg@columbia.edu and fdc@columbia.edu. And on BITNET/EARN you can send direct to KERMIT@CUVMA or FDCCU@CUVMA. If you don't know what I'm talking about (i.e. if you didn't receive the proposal) please let me know and I'll get it to you somehow. Meanwhile, I'd also appreciate any comments on how it meshes with X.400 and FTAM and other ISO application protocols in their current incarnations. I have some several-year-old drafts of these standards, and as far as I can tell, the only character set they talk about is ISO 646. Thanks! - Frank