2-Mar-89 6:18:46-GMT,39284;000000000001 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA05724; Thu, 2 Mar 89 01:18:38 EST Received: from watsun.cc.columbia.edu by cunixc.cc.columbia.edu (5.54/5.10) id AA03005; Thu, 2 Mar 89 01:16:06 EST Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA05568; Thu, 2 Mar 89 01:02:25 EST Date: Thu, 2 Mar 1989 1:02:25 EST From: Frank da Cruz To: Joe Doupnik , Paul Placeway , Andre Pirard , Baruch Cochavy , Johan Van Wingen , Ken-ichiro Murakami , Kohichi Nishimoto , Hirofumi Fujii , Gisbert W.Selke , Kurt Enulf , Jacob Palme , Per Lindberg , "Bj|rn Larsen" , "Hans A. ]lien" , Kai U.Leppamaki , Steve Jenkins , Jean Dutertre , Gerard Gaye , David Guerlet , Bernie Eiben , Volker Edelhoff Cc: Christine M Gianone Subject: Kermit International Character Set Proposal Message-Id: A KERMIT PROTOCOL EXTENSION FOR INTERNATIONAL CHARACTER SETS Christine Gianone and Frank da Cruz Columbia University Center for Computing Activities New York March 1, 1989 PREFACE This is a ROUGH FIRST DRAFT of a proposal, based upon a reading of the current ISO and ECMA character-set standards, some familiarity with the issues involved, and limited testing with devices that claim to implement these standards (such as the DEC VT340 terminal). Readers are urged to correct us if we have misinterpreted the standards, to fill in missing information, and to make any comments or criticisms they desire. Readers with knowledge of real-world multi-alphabet applications and file formats are especially urged to comment on how this proposal meshes with these particular file formats. This first draft is being sent to a selected list of people who we know to be familiar with both Kermit and the character set issues discussed here, so your comments will be especially helpful before we circulate this proposal among a wider audience, most likely by mailing it to Info-Kermit and the ISO8859 discussion group. INTRODUCTION The Kermit protocol makes a distinction between text and binary files, and it defines a particular transfer syntax for text files, namely ASCII characters with carriage return and linefeed (CRLF) after each line, so that text may be stored in useful fashion on any computer it is transferred to. Each Kermit program knows how to translate from the local storage conventions to Kermit's transfer syntax, and vice versa. In this way, text files can be transferred between unlike systems (say, an EBCDIC card oriented system and an ASCII stream file system) and remain useful after transfer. Now that the world's computer users have begun to find US ASCII insufficient for their uses, and ISO, ECMA, etc, are adopting standard codes for the world's other alphabets, and vendors like IBM, DEC, and Apple have begun to make these characters available on their displays (albeit in different positions), and people are beginning to produce increasing numbers of multilingual documents, ... (will this sentence ever end???) ... Kermit's text file transfer syntax needs to be extended to allow for texts in a mixture of alphabets. It is best if this can be done in line with currently existing and evolving standards. Here are the standards we believe are pertinent: ANSI X3.4 (1986), "Coded Character Sets - 7-bit American Standard Code for Information Interchange" (US ASCII), is the 7-bit code currently used by Kermit for transferring text files. ISO 646 (1983), "Information Processing - ISO 7-bit Coded Character Sets for Information Interchange", gives us a 7-bit character set equivalent to ASCII, and says we can substitute "national characters" for for ASCII characters #$@[\]^`{|}. Different languages put different characters in these positions, and there's no mechanism defined to specify which language is being used. ISO 4873 (1986) (= ECMA-43), "Information Processing - ISO 8-bit Code for Information Interchange - Structure and Rules for Implementation", defines 8-bit character sets, their graphic and control regions, and how to extend an 8-bit character set by using multiple graphics sets. ISO 2022 (1986) (= ECMA-35), "Information Processing - ISO 7-bit and 8-bit Coded Character Sets - Code Extension Techniques", describes how to use 8-bit character sets in both 7-bit and 8-bit environments, and how to switch among different character sets and alphabets. ANSI X3.41-1974, "Code Extension Techniques for Use with the 7-Bit Coded Character Set of the American National Standard Code for Information Interchange", describes 7- and 8-bit codes and extension techniques in approximately the same manner as ISO 4873 and ISO 2022. ISO 8859 (1987-present) (see below for ECMA equivalents), "Information Processing - 8-Bit Single-Byte Coded Graphic Character Sets", defines the actual 8-bit character sets to be used for many of the world's languages. The "lower half" (C0 + G0) of each of these is the same as ASCII and ISO 646. ISO is the Internation Standardization Organization, ANSI is the American National Standards Institute, and ECMA is the European Computer Manufacturers Association. HOW THE STANDARDS WORK ASCII and ISO 646 give us a 128-character 7-bit character set. This set is divided into several parts: 1. 32 control characters (characters 0 through 31). 2. Space (SP, character 32). 3. 94 graphic, or printing, characters (33-126). 4. DEL (rubout, character 127), considered a control character. The control characters except DEL compose the C0 part of ASCII, and the graphic characters plus SP and DEL compose the G0 part. If the ASCII alphabet is written in a table of 16 rows and 8 colums, then the left 2 columns are the C0 set, and the right 6 columns are the G0 set: <--C0--> <---------G0----------> 00 01 02 03 04 05 06 07 +---+---+---+---+---+---+---+---+ 00 |NUL DLE| SP 0 @ P ` p | 01 |SOH DC1| ! 1 A Q a q | 02 |STX DC2| " 2 B R b r | 03 |ETX DC3| # 3 C S c s | 04 |EOT DC4| $ 4 D T d t | 05 |ENQ NAK| % 5 E U e u | 06 |ACK SYN| & 6 F V f v | 07 |BEL ETB| ' 7 G W g w | 08 |BS CAN| ( 8 H X h x | 09 |HT EM | ) 9 I Y i y | 10 |LF SUB| * : J Z j z | 11 |VT ESC| + ; K [ k { | 12 |LF FS | , < L \ l | | 13 |CR GS | - = M ] m } | 14 |SO RS | . > N ^ n ~ | 15 |SI US | / ? O _ o DEL| +---+---+---+---+---+---+---+---+ <--C0--> <---------G0----------> Many vendors are now using the full 8 bits available within the computer byte (and on the transmission line in some cases) for character representation. At first there were ad-hoc character assignments (e.g. IBM PC 8-bit ASCII, Apple Macintosh ASCII, etc), but standards are beginning to emerge. 8-bit character sets are described in ISO 4873 and ANSI X3.41. An 8-bit character set has two halves. The "left half" or "lower half" corresponds to ASCII (and ISO 646). All the characters in the left half have their high-order, or 8th, bit set to zero, and are therefore representable in 7 bits. The "right half" or "upper half" mirrors the left half in that the first 32 characters are control characters, and the remaining 94 or 96 characters are graphics, but all characters in the right half have their high order bits set to one. The right-half controls are called C1, and the right-half graphics are called G1: <--C0--> <---------G0----------> <--C1--> <---------G1----------> 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ 00 |NUL DLE| SP 0 @ P ` p | | DCS|---+ | 01 |SOH DC1| ! 1 A Q a q | | PU1| | 02 |STX DC2| " 2 B R b r | | PU2| | 03 |ETX DC3| # 3 C S c s | | STS| | 04 |EOT DC4| $ 4 D T d t | |IND CCH| | 05 |ENQ NAK| % 5 E U e u | |NEL MW | | 06 |ACK SYN| & 6 F V f v | |SSA SPA| | 07 |BEL ETB| ' 7 G W g w | |ESA EPA| | 08 |BS CAN| ( 8 H X h x | |HTS | (special | 09 |HT EM | ) 9 I Y i y | |HTJ | graphics) | 10 |LF SUB| * : J Z j z | |VTS | | 11 |VT ESC| + ; K [ k { | |PLD CSI| | 12 |LF FS | , < L \ l | | |PLU ST | | 13 |CR GS | - = M ] m } | |RI OSC| | 14 |SO RS | . > N ^ n ~ | |SS2 PM | | 15 |SI US | / ? O _ o DEL| |SS3 APC| +---| +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ <--C0--> <---------G0----------> <--C1--> <---------G1----------> G1 character sets can have either 94 or 96 characters. A 94-character G1 set has Space (SP) in its first position and DEL in its last, just like G0 (the notches shown in G1 in the diagram). A 96-character set has graphic characters in all 96 positions. ISO 4873 allows up to four sets of printable graphic characters, of either 94 or 96 characters each (G0, G1, G2, and G3), "active" at one time, plus the control sets (C0 and C1). Therefore there can be up to 2 x 32 + 4 x 96 = 448 characters simultaneously within the repertoire of a given device. But in today's world, 7- or 8-bit character i/o is the norm imposed by our terminals, computer architectures, and (most of all) the nature of our asynchronous communication devices and transmission systems. Therefore, a terminal or computing device will receive at most a 7- or 8-bit code capable of denoting only 128 or 256 different characters, out of the possible 448. How can the additional characters be accessed in the 8-bit environment? In the 7-bit environment? Switching among character sets is accomplished using special control characters or escape sequences embedded within the data stream. These control characters and escape sequences are specified in ISO 2022. In the following discussion, we use this notation (numbers are in decimal unless otherwise noted): Escape (ASCII 27) Space (ASCII 32) Shift Out (Ctrl-N, ASCII 14) Shift In (Ctrl-O, ASCII 15) ISO 2022 provides two separate mechanisms for handling multiple character sets. The first is a set of escape sequences for assigning a particular alphabet (such as Cyrillic, Hebrew, Arabic, etc) to a particular character set (G0, G1, G2, or G3). The second is a set of functions for shifting among the currently active sets. Here are the alphabet selectors and shift functions: Escape Sequence Shift Function (F - assigns 94-character graphics set "F" to G0. Invoke by SI )F - assigns 94-character graphics set "F" to G1. Invoke by SO *F - assigns 94-character graphics set "F" to G2. Invoke by SS2 or LS2 +F - assigns 94-character graphics set "F" to G3. Invoke by SS3 or LS3 -F - assigns 96-character graphics set "F" to G1. Invoke SO .F - assigns 96-character graphics set "F" to G2. Invoke by SS2 or LS2 /F - assigns 96-character graphics set "F" to G3. Invoke by SS3 or LS3 The values for "F" are discussed below. The shift functions are: SO (Ctrl-N) - Shift Out: select G1 (locking shift) SI (Ctrl-O) - Shift In: select G0 (locking shift) LS2 (n) - Locking Shift 2: select G2 (locking shift) LS3 (o) - Locking Shift 3: select G3 (locking shift) SS2 (N) - Single Shift 2: select G2 (single character shift) SS3 (O) - Single Shift 3: select G3 (single character shift) "Locking shift" is like shift-lock on a typewriter. It means that all subsequent characters until the next shift character are to be taken from the designated character set. "Single shift" applies only to the character that follows it immediately, but single shift functions are only available for the G2 and G3 character sets. Locking shift functions remain in effect across alphabet changes. There are many possible ways to use these code extension facilities within both 7-bit and 8-bit environments. In any particular data transfer, the facilities that are actually used can be announced using F, where the possibilities for F are listed in ISO 2022. This "announcer" escape sequence should be sent at the beginning of the data transfer. B means the G0 and G1 sets will be used, where invokes G0 in the left half, invokes G1 in the left half. C means the full 8-bit set shall be used, with no shifting. There are many other possibilities, which need not concern us here. ISO 8859 defines a series of 8-bit character sets. In each of these, the left half (G0) is the ISO 646 set, i.e. 7-bit ASCII. Because of this, the left half of any ISO 8859 character set may be used to represent English or any other Latin-alphabet language that can make do without diacritical marks (e.g. German without umlauts or ess-zet). By convention, the G0 set can be selected with (B. When we say "by convention" we mean that each of the ISO 8859 standards says to select the G0 set using this sequence, even if the G1 set is selected using some other letter, like A, C, L, etc (see below). Theoretically, (A could also be used to select the G0 set of "alphabet A", (L could select the G0 set of "alphabet L", etc. Languages with special characters must use specific ISO 8859 G1 sets. These sets are specified (to date) in ISO 8859-1 through 8859-9: ISO 8859-1 is Latin Alphabet No. 1. The right half (G1) contains all the special characters needed for Dutch, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and Swedish. Select G1 with -A. ISO 8859-2 is Latin Alphabet No. 2. G1 contains special characters for Albanian, Czech, German, Hungarian, Polish, Romanian, Serbocroation, Slovak, and Slovene. Select G1 with -B. ISO 8859-3 is Latin Alphabet No. 3, for Afrikaans, Catalan, Esperanto, French, Galician, German, Italian, Maltese, and Turkish. Select G1 with -C. ISO 8859-4 is Latin Alphabet No. 4, for Danish, Estonian, Finnish, German, Greenlandic, Lappish, Latvian, Lithuanian, Norwegian, and Swedish. Select G1 with -D. ISO 8859-5 is the Latin/Cyrillic Alphabet, for Bulgarian, Byelorussian, Macedonian, Russian, Serbocroation, and Ukrainian. Select G1 with -L. (Comptible with USSR GOST Standard 19768-1987 and ECMA-113). ISO 8859/6 is the Latin/Arabic Alphabet. *** Selection sequence unknown ***. ISO 8859-7 is the Latin/Greek Alphabet. Select G1 with -F. ISO 8859-8 is the Latin/Hebrew Alphabet. Select G1 with -H. ISO DIS 8859-9 is Latin alphabet No. 5, in which six Icelandic letters from Latin Alphabet No. 1 were replaced by 6 letters needed for Turkish. Select G1 with -M. The alphabet selection escape sequences are registered in the International Register of Coded Character Sets under the provisions of ISO 2375, "Data Processing - Procedure for Registration of Escape Sequences". The registration authority is the ECMA, which periodically issues updates. The reference number for this register is ISBN 2-12-953907-0. There may also be "private alphabets", such as those found on DEC terminals. In the DEC environment only, these may be selected using escape sequences listed in the DEC manuals, e.g. )> to select the DEC Technical 94-character set and assign it to G1. Alphabet Summary Table: Esc Seq Alphabet Name ISO Number ECMA Number (B ASCII (ANSI X3.4-1986) ISO 646 ECMA-6 -A Latin Alphabet No. 1 ISO 8859-1 ECMA-94 -B Latin Alphabet No. 2 ISO 8859-2 ECMA-94 -C Latin Alphabet No. 3 ISO 8859-3 ECMA-94 -D Latin Alphabet No. 4 ISO 8859-4 ECMA-94 -L Latin/Cyrillic ISO 8859-5 ECMA-113 -*** Latin/Arabic ISO 8859-6 ECMA-114 -F Latin/Greek ISO 8859-7 ECMA-118 -H Latin/Hebrew ISO 8859-8 ECMA-121 -M Latin Alphabet No. 5 ISO 8859-9 ECMA-128 *** Unassigned as of June 1986 KERMIT FILE TRANSFER Different computer systems and software packages have different conventions for representing, storing, and displaying mixed-alphabet textual data. Such data can be transferred in binary mode by Kermit, but it will only make sense when transferred to a system that uses the same representational conventions. To transfer mixed-alphabet textual data between systems that use different conventions, a new mechanism is required. Currently, Kermit defines the "common intermediate representation", or "transfer syntax", for textual data (before encoding) to be ASCII characters arranged in lines or records terminated by ASCII Carriage Return and Linefeed (CRLF). Henceforth, this will be known as the Normal Kermit transfer syntax. The extension proposed here will allow a Kermit program that has specific knowledge of the local file format (or formats) for storing multilingual or multi-alphabet text to translate between these system- and application-specific formats and a new format to be used during file transfer. This will be called ISO-2022 Kermit transfer syntax. Like all extensions to the original Kermit protocol, this will be an optional feature of any Kermit program. SELECTING ISO-2022 TRANSFER SYNTAX The proposed extension to the Kermit protocol follows a subset of ISO 2022, in which a single ISO-8859 alphabet, comprised of a C0, G0, C1, and G1 set, may be active at one time, and in which: - the C1 and G1 sets are transmitted using ISO 2022's 7-bit code extension techniques, - escape sequences can be used to switch among different alphabets, - the C0, G0, and C1 sets are assumed to be identical for all alphabets. Kermit's default transfer syntax is Normal. Kermit's ISO-2022 transfer syntax must therefore be enabled in some way, either automatically or explicitly by the user. In the automatic case, the Kermit program recognizes (somehow) that it is to transfer a multi-alphabet text file. In the manual case, the user issues a SET command. The sending Kermit may inform the receiving Kermit of the selected transfer syntax by means of the Kermit File Attribute (A) packet, whose use is negotiated in the Kermit Initialization exchange. There is an attribute "*" (ASCII 42) which represents "encoding", with values like "A" for Normal Kermit ASCII encoding, "E" for EBCDIC (so far, never used). The proposed new value is "I8", for "ISO 8-bit character sets". The receiver can agree to accept the file or refuse it using Kermit's attribute reply mechanism. If the receiver does not do attribute packets, then the sender may still elect to send the file (with a warning to the user), as either a binary file or an 8-bit text file, for storing (and perhaps forwarding) purposes only. It should also be possible for the user to select ISO-2022 transfer syntax using an explicit SET command. This command would have to be given to both Kermits in order for the ISO transfer syntax to have its desired effect. The suggested command is: SET TRANSFER-SYNTAX ISO8 This denotes the use of ISO 8-bit alphabets. (By the way, if the user gives this command to the sender, but not to the receiver, then the received file will be stored in ISO 2022 format, with the escape codes mixed with the file characters on disk; if Attribute packets are not being used, then the receiver will get no warning). The advantage of using Attribute packets is that the sending Kermit can automatically inform the receiving Kermit of the file transfer syntax, so that the user does not have to type a SET command to both Kermits. On a computer system where the Kermit program can recognize the attributes and encoding of a file automatically, this mechanism will allow files of different types (text, binary, multi-alphabet) to be sent together as a group, even between unlike systems. The drawback is that the attribute mechanism must be programmed into a Kermit program that doesn't already have it. There should be a way for the user to disable the use of ISO-2022 transfer syntax. The recommended command is SET TRANSFER-SYNTAX NORMAL. DESCRIPTION OF ISO-2022 TRANSFER SYNTAX Transfer of a multi-character-set text file in ISO-2022 transfer syntax is the same as transfer of a 7-bit ASCII text file, except that it may contain embedded escape sequences to switch between character sets. The file sender translates the file's characters (if necessary) into one or more selected ISO 8859 alphabets, and terminates lines of text (records) with CRLF. The file receiver translates from ISO-2022 transfer syntax into the format demanded by the local system or application. The current alphabet is designated by an escape sequence, and locking shift functions switch between its G0 and G1 sets. The mechanism described in ISO 646 for building composite graphic characters by overprinting using Backspace or Carriage Return should not be used; this practice is prohibited by ISO 8859. ISO-2022 transfer syntax uses only 7-bit data. If any character arrives with its high-order (8th) bit set to one (after stripping of parity and Kermit decoding), there has been an error. ISO 2022 states that "at the beginning of information interchange, except where the interchanging parties have agreed otherwise, all deisgnations shall be defined by use of the appropriate escape sequences, and the shift status shall be defined by the use of the appropriate locking-shift functions." Kermit programs should "agree otherwise" that the default character set is the US ASCII / ISO-646 / ECMA-6 7-bit set; thus ISO-2022 transfer syntax can be identical to Normal Kermit transfer syntax when transferring 7-bit text files. There is no default G1 set, in the interest of fairness to all countries and peoples. When the text contains characters outside the ASCII alphabet, an escape sequence must be used to identify which other alphabet these characters belong to. This sequence is -F, where F is the officially registered letter for that alphabet, e.g. A-D for Latin Alphabets 1-4, L for Cyrillic, etc. This sequence assigns the designated alphabet to the active G1 set. The G1 set is transmitted in its 7-bit form to eliminate Kermit's 8th-bit prefix overhead on 7-bit connections. Once a G1 set is selected, it remains in effect until another G1 set is selected. Switching between the G0 (ASCII) set and the G1 (extended) set is done using the ISO-2022 "locking shifts": SO (Ctrl-N) - select G1 (the extended set) SI (Ctrl-O) - select G0 (the ASCII set) If a particular set is already invoked, use of the corresponding shift has no effect. During file transfer, an -F or )F sequence must be given before the first occurance of an extended character from a 96-character or 94-character set, respectively. If no such sequence is given, then all characters are treated as ASCII data, including , , and . In other words, the file transfer behaves in the normal Kermit fashion for text files. The C0 and C1 sets, i.e. the two sets of control characters, are not subject to shifting. Control characters from the C1 set must be transmitted using 2-character escape sequences, as described in ISO 2022: @, A, B, etc, stand for 10000000, 10000001, 10000010, etc (binary). This method results in less Kermit encoding overhead on 7-bit connections than would sending these characters "bare" (which is not allowed). All the escaping and shifting operations specified here take place before normal Kermit packet encoding, and are subject to Kermit's control-character and repeat-count prefixing. For example, -Axy becomes #$-A#Nx#Oy according to Kermit's normal rules for control character prefixing. ISO-2022 transfer syntax may be used in conjunction with even, odd, mark, or space parity, or with no parity at all. 8-bit data is never transferred in this mode, so 8th-bit prefixing will never occur. ADDITIONAL ESCAPES The preceding mode of operation is the one described in ISO-2022 under "Announcer 4/2" for the 7-bit environment, which is selected by the escape sequence B. This means that the G0 and G1 sets are used, both in their 7-bit forms, with and used to shift between them. "Announcer 4/10" J specifies that a 7-bit code is used, even in an 8-bit environment. The use of 2-character escape sequences for C1 characters can be announced using F (the "F" in this case is really an F). For clarity, these escape sequences may be sent at the beginning of the file transfer, but they are not required. Similarly, the ISO-2022 Coding Method Delimiter, d, may be transmitted at the end of the file, or at any point within the file after which this coding method is no longer used. Since ISO 8859 character sets are subject to revision from time to time, an alphabet selector may be preceded by &F, where F is the revision number (@ = 1, A = 2, B = 3, etc). For example, &@-A means Latin Alphabet Number One, Revision One. TRANSFER SYNTAX SUMMARY All characters are 7-bit, all sequences are optional, except if an extended alphabet is selected, and are required to shift between its G0 and G1 sets. Preamble: JBF (before first file characters): J - Using 7-bit code. B - Map both G0 and G1 into the left half. F - Using 2-character escape sequences for C1 set. Alphabet selector: (B&@-F (before first use of extended characters): (B - Designate the normal (ASCII, ISO 646) G0 character set. &@ - Specify the alphabet revision number, if any (@=1, A=2, etc) -F - Designate the alphabet for G1 (substitute the appropriate F) Alphabet shifts: - Select G1 set (extended characters) - Select G0 set (ASCII, ISO 646) (default) Postamble: d - Coding method delimiter (terminator), at end of file. LOCAL FILE REPRESENTATION This proposal assumes nothing about the representation of the file on the local storage medium. It may be ASCII, EBCDIC, a proprietary word processor format, IBM code page, or anything else. It is an implementation "detail" for Kermit programmer to convert between the local file representation for multi-alphabet text files, and Kermit's file transfer syntax. In some cases, the file itself (or its directory entry) might contain the necessary identifying information, in which case the sending Kermit program can automatically emit the appropriate escape sequences during file transfer. In others, the user will have to tell the sending program how the file is encoded. If file attribute packets are not used, the user will also have to tell the receiving Kermit that the transfer syntax is ISO-2022, and in what format to store the file upon receipt. The suggested command is SET FILE TYPE , where specifies how the file is (or when receiving, is to be) encoded on disk. This will necessarily be highly dependent on the system's conventions, or the conventions of the applications to be used with the file (e.g. a multi-language word processing program). Possibilities for might include application names like WORDPERFECT, XYWRITE, NOTA-BENE, MACWRITE, or system-specific names like IBM-CODE-PAGE-437 (the IBM PC US character set), IBM-CODE-PAGE-850 (multilingual), IBM-CODE-PAGE-865 (Norway), etc. It may be that a file is encoded entirely in a single ISO-8859 alphabet, e.g. Latin Alphabet No. 1, or Latin/Cyrillic, but the file itself contains no information to that effect. Therefore, it should be possible for the user to specify the alphabet in the SET FILE TYPE command, where the possibilities are: LATIN1-ISO8 ARABIC-ISO8 LATIN2-ISO8 CYRILLIC-ISO8 LATIN3-ISO8 GREEK-ISO8 LATIN4-ISO8 HEBREW-ISO8 LATIN5-ISO8 The part before the dash is the name of the alphabet, and the "-ISO8" says that the alphabet belongs to the ISO family of 8-bit character sets. This allows for the possibility of other encoding methods for the same languages, e.g. GREEK-DEC, where the Greek letters are taken from the DEC technical character set. If the local file is not encoded according to ISO 2022 rules, it may contain , , and characters. It is up to the Kermit program to know what these characters mean in the context of the file's format, and to either strip them from the file or translate them to something else. The ISO 2022 rules forbid the use of these characters as data to be transferred. SPECIAL EFFECTS Today, most multi-alphabet files are produced by proprietary text processing programs. These programs have many functions besides switching among alphabets. They may also endow text with special attributes such as boldface, italic, underline, super- or subscript, color, etc, and render characters in a variety of type styles and sizes. Each text processing program may have its own unique formats and conventions. These special effects are not addressed by this proposal. Nevertheless, it is likely that a multi-alphabet file produced by a text processing program also contains special effects. In order for a Kermit program to send a multi-alphabet file, it must have detailed knowledge of the file's format and coding conventions. Therefore, the Kermit program should be able to strip out the special effects, and send only the text. Otherwise the result would be meaningless when received on an unlike system or for use with a different application. (When transferring such files between like systems or compatible applications, Kermit binary mode transfers will suffice.) At some future time, it might be possible to adapt one of the popular document description languages to Kermit, so that Kermit will be able to transfer formatted documents between unlike systems and applications. Presently, there are many competing would-be standards inlcuding IBM DCA and DIA, DEC DDIF, US Navy DIF, ISO ODA and ODIF, Postscript. Kermit should wait for the dust to settle and then pick a relatively simple, stable alternative. (Comments welcome!) ARCHIVING The Kermit protocol includes a so-far little-used archiving function. In this mode, Kermit stores incoming file data together with the attribute packets that precede it, so that the file can be retrieved and reconstituted on another system at a later time. In archive mode, the alphabet escapes and shifts should not be interpreted by the receiving Kermit, but simply stored as data. MULTIBYTE ALPHABETS This proposal does not address alphabets such as Japanese, Chinese, and Korean that do not fit into 8-bit character sets. A new standard, ISO 10646, is in preparation. This standard will define a universal 3-byte character code to cover all the world's written languages, providing for 1- and 2-byte shortcuts within a given language environment. All designation, invocation and shifting as in ISO 2022 will be avoided. When and if this standard becomes relatively stable, it too can be added as a Kermit file transfer syntax option, perhaps ISO24. In the meantime, national versions of Kermit can (and do) use SET FILE TYPE commands to identify the encoding or standard used for a multibyte alphabet. For example, some Japanese Kermit programs have the command SET FILE TYPE TEXT, BINARY, or KANJI, and add a further command to specify the local Kanji encoding: SET KANJI VAX, JIS, or SHIFTJIS (JIS is the Japan Industrial Standard, JIS X 0208; SHIFTJIS is JIS X 0202 which differs from JIS X 0208 by the introduction of escape sequences to shift between Kanji and ASCII; VAX is the encoding used on Japanese VAX/VMS systems). These Kermit programs use SHIFTJIS as the transfer syntax, and the Kermit program maps between it and the local format, which may be VAX, JIS, or SHIFTJIS. To better mesh with the current proposal, however, these programs should make a distinction between the file format and the transfer syntax by adding a command like SET TRANSFER-SYNTAX SHIFTJIS. In this connection, a "rider" to this proposal is that "JS" (for SHIFTJIS) be added to the list of Kermit Kermit encodings under Attribute "*". Designations for Chinese, Korean, and other multibyte-character-set languages are welcome, as are alternative designations for Japanese. TERMINAL EMULATION While not part of the Kermit file transfer protocol, terminal emulation is a feature of many Kermit programs. It is hoped that these terminal emulators will evolve along the lines of the ISO standards mentioned above. In some cases, this is already a fact, insofar as DEC VT200 and 300 series terminals already follow these standards. In this regard, it is important to note that not all languages are written from left to right, top to bottom. Hebrew and Arabic are two examples of right-to-left languages, and Japanese and Chinese may be written top to bottom. The order of the text characters on disk or on the transmission line do not necessarily reflect their order on the screen or the printed page. FILE TRANSFER SYNTAX EXAMPLES A simple 7-bit ASCII text file can be transmitted in the normal Kermit manner for text files, without any escapes or shifts, even in ISO8 mode. A text file containing characters from a language or languages covered by a single ISO 8859 alphabet will require an -F sequence to identify the alphabet. and are used to shift between the G0 and G1 sets. The following lines are all produce the same result: A dangerous German word is "gef-Adhrlich". -AA dangerous German word is "gefdhrlich". -AA dangerous German word is "gefdhrlich". &@-A(BA dangerous German word is "gefdhrlich". In this case, the only extended character is the umlaut-a in "gefaehrlich" (where ae is a way of writing umlaut-a without an umlaut). For clarity and consistency with the ISO-2022 recommendations, the latter form is preferred: the text begins with an announcement of the G0 and G1 sets in use, including the version number, and then explicitly shifts into the G0 set, rather than defaulting to it. Similarly, use of the preamble at the beginning of the file and the postamble at the end is also recommended. A text file containing characters from multiple ISO 8859 alphabets requires an -F sequence to identify each alphabet. SO and SI can be used to shift between G0 and G1 of the current alphabet, and (B can be used to select G0 of any of the alphabets, since these are all the same. For example, the following text contains the same word in English, French, and Russian: -ADisappointed, digu, -L`PW^gP`^RP]]kY. The first escape sequence assigns Latin Alphabet No. 1 to G1, and the subsequent and shifts apply to its G0 and G1 set, which is used to form the English and French words. The second escape sequence assigns the Latin/Cyrillic 96-character set to G1, and the subsequent shifts apply to this new set. A final example, in which the same word is repeated in English, Russian, and German, shows how a locking shift remains in effect when the alphabet is changed. We begin in Latin/Cyrillic, start with an English word from G0, shift to G1 for the Russian word, and while still in G1 switch to Latin Alphabet No. 1 for German to get the umlaut-A at the beginning of Aenderung (where Ae = umlaut-uppercase-A), and shift back to G0 for the rest of the word: -LAlteration _U`UTU[ZP -ADnderung. PERFORMANCE For each file, the preamble and postamble add from 0 to 11 characters. There are an additional 3 characters per alphabet change, for instance when switching between Finnish and Russian, and an additional shift character for every shift between G0 and G1, and finally 2-character escape sequences used in place of the C1 control characters. For files of any length at all, the preamble/postamble overhead is negligible. It is recommended that the "ambles" be included for compatibility with other ISO-2022-conformant applications. The restriction of data to 7 bits during transmission should not incur a high transmission penalty, since the locking shift mechanism will tend to add fewer characters to the transmission stream than would 8th-bit prefixing of characters from the G1 set (although in the worst case -- a file composed of characters alternating between the G0 and G1 sets -- the overhead of shifting would actually be higher). The use of two-character escape sequences for the C1 control set should also have small impact; the overhead will be the same as for 8th-bit prefixing, but these characters should appear rarely in text files. Hence, the transmission overhead of ISO-2022 transfer syntax should not not be significantly different from that of normal Kermit, and in some cases (e.g. for texts completely in Russian, Greek, Hebrew, or Arabic) the overhead is far lower. WHERE TO GET STANDARDS The ISO/ECMA standards discussed in this proposal may be obtained free of charge in their ECMA form by writing to: ECMA Headquarters Rue du Rhone 114 CH-1204 Geneva SWITZERLAND Be sure to specify the title and the ECMA number of each standard requested. We tried this ourselves, and got delivery within about two weeks. ISO standards can also be ordered from the UN bookstore, but not for free: CCITT United Nations Bookstore United Nations Building New York, NY 10017 ANSI standards may be ordered, for a fee, from: Sales Department American National Standards Institute 1430 Broadway New York, NY 10018 SUMMARY We hope that this attempt to blend Kermit text file transfer with the ISO international character set standards is in keeping with the intended use of those standards. Anyone who has can offer insights as to whether we are using the standards appropriately is encouraged to comment. 2-Mar-89 17:20:40-GMT,2351;000000000001 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA10260; Thu, 2 Mar 89 12:20:30 EST Received: by cunixc.cc.columbia.edu (5.54/5.10) id AA01174; Thu, 2 Mar 89 12:16:20 EST Date: Thu, 2 Mar 1989 12:16:19 EST From: Frank da Cruz To: Joe Doupnik , Paul Placeway , Andre Pirard , Baruch Cochavy , Johan Van Wingen , Ken-ichiro Murakami , Kohichi Nishimoto , Hirofumi Fujii , Gisbert W.Selke , Kurt Enulf , Jacob Palme , Per Lindberg , "Bj|rn Larsen" , "Hans A. ]lien" , Kai U.Leppamaki , Steve Jenkins , Jean Dutertre , Gerard Gaye , David Guerlet , Bernie Eiben , Volker Edelhoff Subject: Kermit/ISO proposal Cc: Christine M Gianone Message-Id: It occurs to me that since the proposal was sent from a brand-new computer, some of you might not be able to reply to the message. You can also mail to us as cmg@cunixc.cc.columbia.edu and fdc@cunixc.cc.columbia.edu, or simply (but less efficiently) as cmg@columbia.edu and fdc@columbia.edu. And on BITNET/EARN you can send direct to KERMIT@CUVMA or FDCCU@CUVMA. If you don't know what I'm talking about (i.e. if you didn't receive the proposal) please let me know and I'll get it to you somehow. Meanwhile, I'd also appreciate any comments on how it meshes with X.400 and FTAM and other ISO application protocols in their current incarnations. I have some several-year-old drafts of these standards, and as far as I can tell, the only character set they talk about is ISO 646. Thanks! - Frank 3-Mar-89 19:59:09-GMT,5320;000000000011 Return-Path: <@cuvmb.cc.columbia.edu:A-PIRARD@BLIULG11.BITNET> Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA01921; Fri, 3 Mar 89 14:59:04 EST Message-Id: <8903031959.AA01921@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA27239; Fri, 3 Mar 89 14:58:52 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 3521; Fri, 03 Mar 89 14:55:23 EST Received: from VM1.EARN-ULG.AC.BE by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1035; Fri, 03 Mar 89 14:55:21 EST Received: by BLIULG11 (Mailer R2.02) id 8393; Fri, 03 Mar 89 18:53:57 +0100 Date: Fri, 03 Mar 89 16:59:40 +0100 From: Andre' PIRARD Subject: Re: MacKermit and national characters To: Paul Placeway Cc: Frank da Cruz In-Reply-To: Your message of Mon, 27 Feb 89 22:59:57 EST Paul, Well, for one thing we're not on strike, but there was a lot of talking. About the "fonts", Frank is right. The 80-9F range (128-159, really) is "forbidden" in ISO. I suspect a reason is they would map to control characters when sent on 7-bit lines and could upset some nodes. About switching G sets, I include below a message that once appeared on the ISO8859 list. I still wonder how multiple fonts are used by MacKermit. Are they in the code or does MacKermit refer to external font files? Please pardon my ignorance about MacIntosh internals. As to the keyboard, one thing I should explain is that our keyboards are the other way round. We key in some accented symbols directly, form others with dead keys overstrikes, but are missing the @ and some others of ASCII. But these missing symbols are still available thru the Alt key (the one next to the Apple one, with some kind of sleigh on it) combined with many keys yielding apparently the whole sets of Apple characters. I expect the same should hold for US keyboards with slightly different assignments. In that case, we shouldn't touch the user interface by changing a well defined Apple convention and the best is to translate them to ISO to keep keyboard independence. We can leave the Apple-missing characters to the user's taste by using the keyboard macros that would have to yield the ISO codes I guess. Any key code readily comes to Kermit in a keyboard independent way since Mathias'es version, so we get those missing ASCII characters back to live. But even in "no parity" mode I can't get the special characters be echoed to the screen. They get the bell sound. I think the problem is being 8-bit. In all but "no-parity" mode, the final operation would be converting A0-FF to 20-7F and imbed it between SO/SI for sending to the line. One thing to be careful is how the keyboard is echoed to the screen (although it must not be a very frequent mode of operation, yet is still useful for checking). Tell me if I can be of some help. As soon as you get SO/SI working, I can undertake live tests. The network has been bad this last week. My latest notes from you are of the 27th + Frank's reply. Andr). Date: Fri, 27 May 88 18:35:47 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John Kesich Subject: Re: Extended ASCII with Kermit To: Andre' Pirard In-Reply-To: Message of Fri, 27 May 88 14:44:05 +0200 from >From my reading of ISO's 646, 2022, 4873, 8859-1 & 8859-2 I have come to the conclusion that there is a fairly widespread misunderstanding of ISO8859. If I'm the one who has misunderstood I hope someone will take the trouble to correct me. People seem to think that you pick one of the ISO8859-x sets and then those 256 characters are the only ones used. However, ISO's 2022 & 4873 define a number of escape sequences for switching among different versions (as they term character sets which conform to the standards). What this means is that simple translation table mappings are not enough to translate ISO to other code sets, one must also change translation tables 'on the fly' as the escape sequences are encountered. A somewhat simplified example may help to illustrate the problem: data stream (ISO notation) hex comments -------------- --- -------- ESC 02/00 04/12 1B 20 4C select level 1 of ISO4873 ESC 02/13 04/01 1B 2D 41 designate (and invoke) ISO8859-1's G1 set 12/00 C0 1st 'real' character - capital A, grave accent ESC 02/13 04/02 1B 2D 42 designate (and invoke) ISO8859-2's G1 set 12/00 C0 2nd 'real' character - capital R, grave accent Does an implementation which uses a single set of ISO8859-x characters conform to the standard? Even if it does, would it make any sense to standardize on a particular ISO8859-x to the exclusion of others? Finally, if one were to do so, how would the 2 character text in my example be transmitted? Any implementation which doesn't include the ISO escape sequences will eventually have to incorporate some such mechanism. I think the ISO escape sequences should be a part of any standard which is adopted. 5-Mar-89 6:38:30-GMT,6160;000000000411 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA18531; Sun, 5 Mar 89 01:38:28 EST Resent-Message-Id: <8903050638.AA18531@watsun.cc.columbia.edu> Message-Id: <8903050638.AA18531@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA08526; Sun, 5 Mar 89 01:38:20 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 4127; Sun, 05 Mar 89 01:34:46 EST Received: by CUVMB (Mailer X1.25) id 3524; Sun, 05 Mar 89 01:34:45 EST Received: from JPNKEKVM by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 3523; Sun, 05 Mar 89 01:34:44 EST Received: by JPNKEKVM (Mailer R2.02) id 2408; Sun, 05 Mar 89 15:37:52 JST Date: SUN, 05 MAR 89 15:35:06 JST From: Hirofumi Fujii Subject: RE:Kermit/ISO proposal To: Frank da Cruz , Joe Doupnik , Ken-Ichirou Murakami , Hirohide Mikami , Hirohide Mikami , Masamichi Ute Resent-Date: Sun, 05 Mar 89 01:34:44 EST Resent-From: Network Mailer Resent-To: fdc@cunixc.cc.columbia.edu Dear Frank, Thank you very much for informing us about the proposal of Kermit extension. 1. My understanding of your proposal is like following figures; Is it correct ? If it is correct, I agree with you. +------------------------[ Sender machine ]--------------------------+ | | | Read file in internal (local) representation | | | | | v | | Internal-to-ISO converter (machine dependent) | | ( ESC-sequnce + GL/GR character sets according to the ISO-2022 ) | | | | | v | | Traditional Kermit SEND routine | +--------------------------------------------------------------------+ || ( communication line ) || \/ +--------------------------------------------------------------------+ | Traditional Kermit RECEIVE routine | | | | | v | | ISO-to-internal converter (machine dependent) | | ( Interpret the ISO-2022 ESC-sequnce ) | | | | | v | | Write file in internal (local) representation | | | +-----------------------[ Receiver machine ]-------------------------+ 2.About the Japanese character sets Your description of MULTIBYTE ALPHABETS is not correct. (1) SHIFTJIS is NOT the Japanese standard (the name is quite misleading). It is the internal code of the Japanese MS-DOS like EBCDIC. (2) JIS X 0202 and X 0208 are diffrent kind of standards. The title of the JIS X 0202 is "Code Extension Techniques for Use with the Code for Information Interchange", and of the JIS X 0208 is "Code of the Japanese Graphic Character Set for Information Interchange". JIS X 0202 corresponds to the ISO-2022. JIS X 0208 is the table of the code and its graphical representation (like ASCII table). This is so called JIS-code table. (3) It is possible to send Kanji file within the ISO-2022 scheme (therefore, it is not necessary to prepare some attribute like 'JS' for Japanese character sets). JIS X 0202 (I am not sure the followings are ISO-2022 or not) defines $F and $,F designates multi-byte character set "F" to G0 $)F and $-F designates multi-byte character set "F" to G1 $*F and $.F designates multi-byte character set "F" to G2 $+F and $/F designates multi-byte character set "F" to G3 and Invocation of these character sets to GL or GR is the same as ISO-2022 (includeing sigle- and locking-shifts). JIS X 0208 is the 2-byte character set for Japanese (Symbol:147+ Number:10+Roman:52+Hirakana:83+Katakana:86+Greek:48+Russian:66+ Kanji:6353+Rule:32 characters!). The above "F" for JIS X 0208 character set is assigned to "B(4/2)" (I am not sure it is ISO-registerd or not, but it is described in JIS X 0208). For example, $B designates JIS X 0208 character set to G0. Therefore, we can send Kanji file using this scheme; for example, send Kanji-file from MS-DOS machine (SHIFTJIS) to VAX (DEC-KANJI), read file in SHIFTJIS convert SHIFTJIS to JIS X 0202 form send packet | v receive packet convert JIS X 0202 form to DEC-KANJI write file And I think this is compatible with your proposal if my understang is correct. (4) I don't know about the new standard, ISO 10646. However, many of the Japanese people have already used above method (by hand). So, I think we are very happy if the above scheme are included to the Kermit. 05-Mar-1989 Hirofumi Fujii Natinal Lab. for High-energy Physics (KEK) JAPAN KEIBUN@JPNKEKVM.BITNET KEKVAX::KEIBUN (HEPNET) 6-Mar-89 4:17:33-GMT,7127;000000000001 Return-Path: Received: from cc.usu.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA27782; Sun, 5 Mar 89 23:17:15 EST Message-Id: <8903060417.AA27782@watsun.cc.columbia.edu> Date: Sun, 5 Mar 89 21:16 MDT From: Joe Doupnik Subject: Reply.ISO-2022 (pretty neat) To: fdc@watsun.cc.columbia.edu X-Vms-To: IN%"fdc@watsun.cc.columbia.edu",JRD Frank and the group, The description of ISO 2022 appears to be clear enough to use, and I appreciate being part of the initial discussion group. I would like to add some small comments to the discussion, however. It appears that the ISO suggestion is directed at two objectives, terminal emulation and file transfer. Set aside terminal emulation for a moment since it is well adapted to this method, and let's examine the file transfer case. Background ---------- At the bottom of the protocol stack Kermit implements some encoding techniques such as control quoting, eight bit prefixing, run length encoding, and so forth. These are intended to be transparent operations to allieviate many communicaitons channel difficulties. They are understood only by matching Kermits and operate on an unstructured stream of characters; they are Kermit to Kermit protocol items. Above that we have the highly useful but nevertheless sticky area of using CR/LF in packets to indicate file system record delimiters. Who knows what a record is? The best we can do is cope with the two common display control commands, CR and LF, and use CR/LF in the character stream where the local operating system would do an equivalent if displaying the stream on a very simple terminal device. We even treat Horizontal Tabs as literal text. The CR/LF item means two things to me. First, we are transferring flat files, sequentially. Second, we are attempting to map an important piece of file system architecture from one side to the another via an in-line message (CR/LF). Thus, CR/LF is not literal file data unless we force it to be so by blinding the receiver. This is a file system to file system message about record demarkation. We recognize that much work needs to be done to work with files which are not simple sequential objects. This includes indexed files, file descriptor blocks, resource forks, and other structured or multicomponent "objects". And we wish to be careful in distinguishing the object itself from access methods. I think that even the most elaborate object can be reduced to one or more flat files, with reconstruction rules attached, since we do just that when making backups and patching disks. Reconstruction might not be much fun, but it is possible. Commentary ---------- Now we come to the ISO parts of the proposal. It is suggested that we use the ISO 2022 conventions to encode the contents of files. I interpret this to mean that a local Kermit needs to understand the contents or "meaning" of a file (as distinct from the file system architecture). The contents are controlled by applications programs rather than being firmly rooted in the host file system design. This is really an applications program to applications program protocol, or in ISO networking terms a Presentation Layer service. In the case at hand the destination application is a model terminal. Again, the two sides must cooperate to achieve error free transmission and that requires a decoder to match (i.e., understand) the encoder. Existing Kermit negotiation mechanisms can easily achieve that match, though some short replies from Kermit Servers are outside files and cannot be encoded this way. It also means there are major difficulties with Kermit discovering how to interpret the file/object without help from either the user or hopefully from the host's file system. The SET FILE TYPE command illustrates the point. What the proposal is saying is that each Kermit would have a set of filter procedures to understand the file's contents and communicate the information in-line via ISO 2022 conventions. Needless to say, word processor formats, spread sheets, and other popular applications program data are not 100% convertible between systems (and barely from version to version of the same program on the same machine). Ref: SPECIAL EFFECTS section of the proposal. My view is Kermit ought to stay well away from making any selection of "useful" versus "special effects" material in such files. Underlying the whole ISO 2022 discussion is the concept of displayable characters, with no reference to file systems. It is a terminal communications mechanism: a sequential stream of characters, with hints to the display hardware to select the symbols shown for given (heavily overloaded) data codes. It presumes the capabilities of the display hardware to provide those symbols and to position them appropriately. Thus for terminal emulation I strongly support the ISO 2022 convention, and follow ons, in Kermits. Regarding ISO 2022 and file transfer I think that the two are not necessarily related. For example, a text file composed in two or more mixed languages needs internal codes to indicate how data items are to be displayed; the editor or other applications program uses some kind of system for this. If the system is private then it becomes burdensome for production Kermits to support that filter. The filter would best be done as either a stand alone program or a special loadable filter to Kermit (not an easy thing to accomplish if the filter needs to recognize strings of characters at a time). Such a filter would be like to ISO 2022 and the matching ISO 2022 to display at the other end, but word processors use far more elaborate display formatting methods than ISO 2022 understands since the target is usually not a simple character based terminal. Matters make sense when the file is homogeneous, say ISO 2022, or Kanji, or Spanish. In principle no filter is needed for the communications channel aside from squeezing 8-bit data into 7-bit form via shift locks. (I omit consideration of non-8-bit systems). Stand alone filter programs could convert local forms to one more generally understood, and back again on the other side and do so for even complicated multi-character representations. To me this means in-built filters are tailored to specific kinds of homogeneous files and a single Kermit implementation has either a small embedded set and/or a convenient way to load new procedures at run time (I vote for both). Loadable ones allow local enhancements and are the pathway to transferring hetrogeneous documents, even though the implementation details will be a real headache if it's code rather than a data table. In summary, I think the ISO 2022 approach has much merit and I support it. At the same time we should be sensitive to the fact that we are discussing some, and only some, terminal based display attributes but not file system differences nor, for the most part, applications programs. This is not much to add to the discussion really, except I think it is a good concept in its area. Joe Doupnik 6-Mar-89 13:57:25-GMT,2024;000000000001 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA01156; Mon, 6 Mar 89 08:57:17 EST Resent-Message-Id: <8903061357.AA01156@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA05330; Mon, 6 Mar 89 08:56:21 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 4307; Mon, 06 Mar 89 08:53:32 EST Received: by CUVMB (Mailer X1.25) id 4208; Mon, 06 Mar 89 08:53:31 EST Received: from CUVMB by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 4207; Mon, 06 Mar 89 08:53:30 EST Received: from watsun by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with TCP; Mon, 06 Mar 89 08:53:29 EST Received: by watsun (4.0/SMI-4.0) id AA01110; Mon, 6 Mar 89 08:55:48 EST Date: Mon, 6 Mar 1989 8:52:47 EST From: Frank da Cruz To: Hirofumi Fujii Cc: Frank da Cruz , Joe Doupnik , Ken-Ichirou Murakami , Hirohide Mikami , Hirohide Mikami , Masamichi Ute , Christine M Gianone Subject: RE:Kermit/ISO proposal In-Reply-To: Your message of SUN, 05 MAR 89 15:35:06 JST Message-Id: Resent-Date: Mon, 06 Mar 89 08:53:30 EST Resent-From: Network Mailer Resent-To: fdc@cunixc.cc.columbia.edu Thanks very much for your explanation of the Japanese standards. Your understanding of our proposal was completely correct, and we're very glad to see that the same mechanism can be used for the present-day Japanese codes. We will change our proposal to reflect what you have said. Too bad we didn't get to meet you when we were in Japan. Thanks again! - Frank 6-Mar-89 22:41:21-GMT,2987;000000000001 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA05998; Mon, 6 Mar 89 17:41:18 EST Date: Mon, 6 Mar 1989 17:41:18 EST From: Frank da Cruz To: Joe Doupnik Subject: Proposal Message-Id: Joe, thanks for the comments. I take it the main discomfort you have with the proposal is that a terminal-oriented mechanism is being bent into a file transfer application. And you're right. And as you point out, Kermit has always done that (with the CRLF line terminators, etc). So the proposal carries along shamelessly in this tradition. Anyway, if you want to be able to unambiguously flag different alphabets in a file that's being transferred, what other mechanism is there? I don't think there is one, except maybe certain proprietary schemes cooked up by Xerox, etc. It's also true that by following the terminal model, we seem to restrict ourselves to flat, sequential files. Simple Kermit programs have never claimed to be able to transfer anything else. There is some not-very-well-thought-out mumbling in the Attributes section of KtB intended to address this problem (it doesn't, really). Do you think Kermit -- or any other nonproprietary file transfer protocol -- will ever be able to handle complex record-oriented files (e.g. ISAM or other kinds of databases containing strings, integers, floats, bit flags, etc) between unlike systems, without resorting to tricks (like Kermit-11 sending the FAB in the Attribute packet)? I doubt that even ISO FTAM with full-blown ASN.1 encoding could do it (I may be wrong). Even if it could, it probably should not be a goal of Kermit to do everything that FTAM can do (even though it can pretty much do everything that FTP can!). So anyway, insofar as the proposal confines itself to TEXT (and it does), we're in pretty good shape. As to implementation -- we have the choice of putting the conversions into Kermit (either statically or dynamically) or forcing the user to run pre- and postprocessors. Naturally, I'd rather see Kermit do the work whenever possible to save users headaches, confusion, and disk space. The question is, how hard will it be on the programmer? Probably MS-DOS is an extreme case, with hundreds of mutually-incompatible word processors, every user of each clamoring for JRD to put support into MS-Kermit. At the other extreme are the national versions of VMS, where all files are encoded in a single ISO 8859 alphabet (e.g. Roman/Hebrew). What I hope will come out of this is some incentive for makers of multi-alphabet software to get together and come up with some common file formats, preferably in line with existing standards. Speaking of which, do you have the name and number of the Wordperfect guy I passed along to you? Maybe I can grill him about Wordperfect formats to get a better idea of what goes on in a typical "real-world" application... Thanks again! - Frank 8-Mar-89 3:24:57-GMT,1064;000000000411 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA20052; Tue, 7 Mar 89 22:24:55 EST Received: from ntt-sh.ntt.jp ([129.60.57.1]) by cunixc.cc.columbia.edu (5.54/5.10) id AA25613; Tue, 7 Mar 89 22:23:01 EST Received: by ntt-sh.ntt.jp (3.2/ntt-sh-03c) with TCP; Wed, 8 Mar 89 12:24:22 JST Date: Wed, 8 Mar 89 12:23:03 I From: ken-ichiro murakami Subject: Re: ISO Kermit Proposal To: fdc@watsun.cc.columbia.edu In-Reply-To: Message-Id: <12476256355.16.MURAKAMI@NTT-20.NTT.JP> Hi Frank! I've got the messages from you and from Dr.Fujii. Now, we, DECUS Japan, are considering to have a meeting to discuss about your proposal. Many users might have their opinion as well as Dr.Fujii. So, Mr.Nishimoto at DECUS Japan is preparing to send postal mail to Kermit users. The meeting will be held in the last week on March. Would you please wait for the result? I've already send a mail to Dr.Fujii about it. -Ken ------- 8-Mar-89 15:22:28-GMT,1288;000000000001 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA24513; Wed, 8 Mar 89 10:22:24 EST Date: Wed, 8 Mar 1989 10:22:23 EST From: Frank da Cruz To: Andre' PIRARD Subject: Re: Kermit International Character Set Proposal In-Reply-To: Your message of Wed, 08 Mar 89 13:16:05 +0100 Message-Id: For now, I think the discussion group should be confined to the people in the message header, except for the Finnish guy, who apparently has disappeared. If you "reply all" that should do the trick. If the discussion becomes lively and detailed, then maybe I'll set up a mailing list. I've already received two substantive replies. One from Joe Doupnik (MS-Kermit) who likes it, but says we should stress that we're still following the terminal model -- all these escape sequences are really designed for host- terminal interaction (but what else is there that we can apply to the problem at hand?). The other from the people in Japan, who told me that the ISO 2022 scheme also applies to Japanese codes, and they don't need to be a special case -- just a multibyte application of ISO 2022 and 4873. So far nothing back from anyone else yet. - Frank 8-Mar-89 23:13:42-GMT,2877;000000000011 Return-Path: <@cuvmb.cc.columbia.edu:PEPMNT@CFAAMP.BITNET> Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA28557; Wed, 8 Mar 89 18:13:41 EST Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA16757; Wed, 8 Mar 89 18:11:43 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 5586; Wed, 08 Mar 89 18:09:50 EST Received: from CFAAMP.BITNET (PEPMNT) by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 8824; Wed, 08 Mar 89 18:09:49 EST Date: Wed, 1989 Mar 8 14:48:24 EST From: (John F. Chandler) PEPMNT@cfaamp.bitnet To: (Frank da Cruz) fdc@cunixc.cc.columbia.edu Subject: Kermit alphabets Message-Id: Frank, I have read through the draft protocol extension, and I have a few comments. 1. It wasn't clear until rather late in the document that the proposal was for pure 7-bit transmission. Since my thinking about character sets had been mostly in terms of 256-to-256 translation tables, I kept looking for 8-bit features. 2. In the section "DESCRIPTION OF IS0-2022 TRANSFER SYNTAX" at line 437 it says: to shifting. Control characters from the C1 set must be transmitted using 2-character escape sequences, as described in ISO 2022: @, A, B, etc, stand for 10000000, 10000001, 10000010, etc (binary). This method results in less Kermit encoding overhead on 7-bit connections than would sending these characters "bare" (which is not allowed). However, the overhead should be the same, since gets encoded to two characters. By the way, the next paragraph gives the encoding for as #$, rather than #[. 3. In the 3rd paragraph of that section, by the way, I would say "parity stripping", instead of "stripping of parity" -- it's a matter of style and also to avoid the impression that you meant to say "stripping off". 4. Line 617 (in the section "MULTIBYTE ALPHABETS") contains the phrase "Kermit Kermit encodings" -- should that be "Kermit alphabet encodings" or perhaps just "encodings"? 5. I presume the very last paragraph ("SUMMARY") will be dropped from the ultimate draft. If not, the last sentence should be amended from "Anyone who has can offer" by dropping the "has". 6. The section "ADDITIONAL ESCAPES" is a little unclear. In view of the quote above, which says 2-byte escapes *must* be used for C1, it seems superfluous to require F -- presumably one or the other of these statements is incorrect. Perhaps the description of what *may* be sent should be more of a *prescription* -- in particular, I think d should not be permitted anywhere except at the end of a file and should be required there (in Kermit transfers, that is). John 9-Mar-89 1:20:40-GMT,1818;000000000001 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA29372; Wed, 8 Mar 89 20:20:29 EST Date: Wed, 8 Mar 1989 20:20:28 EST From: Frank da Cruz In-Reply-To: Your message of 03/08 19:21:24 Cc: Christine M Gianone To: Gisbert W.Selke Subject: Re: ISO / Kermit Proposal Message-Id: Actually, we have a vested interest in keeping the world from blowing itself up, so any little bit we can do to help people people and nations communicate with each other... Also (let's be honest) maybe we'll get some more trips out of it... Your specific comments were very good. We're not sure what to do about the overloaded ISO-646 characters... Maybe there's a list somewhere of what "national" characters are used in these positions in each country, so that the ISO-8859 equivalents can be identified. And yes, the problem of non-language-related escapes within multilanguage files is a conundrum. Presumably, it "shouldn't happen", but in real life, who knows? We agree about the silly single character shifts. In this case Western Europeans lose (having to shift between ASCII and special characters all the time), whereas Russians, Israelis, Greeks, and Arabs win -- they can stay in the "right half" all the time. The shifts could be avoided by using all 8 bits, but then we'd get even more overhead on 7-bit connections due to Kermit's 8th-bit-prefixing mechanism. Extended negotiations problably won't happen. A single, two-sided exchange is imbedded too deeply in the protocol. On some level, the user is simply going to have to know something about the alphabet codes in use, and which ones are supported by the Kermit programs. Thanks again! - Chris and Frank 9-Mar-89 2:04:17-GMT,2289;000000000001 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA29634; Wed, 8 Mar 89 21:04:16 EST Message-Id: <8903090204.AA29634@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA05259; Wed, 8 Mar 89 21:02:16 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 5657; Wed, 08 Mar 89 21:00:26 EST Received: by CUVMB (Mailer X1.25) id 9131; Wed, 08 Mar 89 21:00:25 EST Date: 03/08 20:44:16 From: FDCCU@cuvmb.cc.columbia.edu Subject: RECK NOTE - PUN file from RSCS To: FDC@cunixc.cc.columbia.edu Reply-To: RSCS@cuvmb.cc.columbia.edu Date: 9 March 1989, 02:29:11 SET From: Gisbert W.Selke +49 228 225888 To: FDCCU at CUVMA Re: Caught in the act Frank and Chris, Yes, I'm quite sympathetic to that healthy blend of philantropism and hedonism that drives you in the making of Kermit... A few more quick comments on comments on comments: (i) In Germany, the ISO 646 standard is quite rigorously the following: left square bracket: A umlaut left curly brace: a umlaut right " " : U umlaut right " " : u umlaut backslash : O umlaut vertical bar : o umlaut tilde : ess-zet So, even if you can't tell from looking at the file itself, a semi- knowledgeable user will know which overloading is used. (Or am I being too optimistic?) Currently, I am using a filter when going from the PC (IBM extended ASCII) to the host (ISO 646) or vice versa. (ii) Other-purpose escape sequences: at least on the PC, it's a common thing to have, say, batch files, or user screens, using ANSI features (colouring, highlighting, positioning,...) *and* umlaute. So it would of course be nice to have this catered for, but we've been living without. - There should be some warning, though, if extraneous escape sequences are encountered in an ISO file, instead of stealthily doing some wild alphabet switching. (iii) Shiftin/out: why not use 8-bit data where available? Binaries work like that, too. (iv) Extended negotiation: I agree... if it ain't broken, don't fix it. On my way to extended feasting, \Gisbert 10-Mar-89 14:04:23-GMT,3706;000000000611 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA15480; Fri, 10 Mar 89 09:04:21 EST Message-Id: <8903101404.AA15480@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA18990; Fri, 10 Mar 89 09:01:24 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 6202; Fri, 10 Mar 89 09:00:27 EST Received: by CUVMB (Mailer X1.25) id 1342; Fri, 10 Mar 89 09:00:25 EST Date: 03/10 08:51:16 From: FDCCU@cuvmb.cc.columbia.edu Subject: PUN file from RSCS - MOSGLA.MAIL X-Tag: FILE (9356) ORIGIN HLERUL2 MAILER 3/10/89 3:52:41 E.S.T. To: FDC@cunixc.cc.columbia.edu Reply-To: MAILER%HLERUL2@cuvmb.cc.columbia.edu Date: Fri, 10 Mar 89 13:42 CET From: "Johan van Wingen" To: "F. da Cruz" Subject: Kermit/ISO Dear Christine and Frank I read your proposal with great interest, although I am not a Kermit, nor even a network expert. Congratulations with your tutorial on ISO standards, parts of which I would like to copy in my own documents in the future (source stated of course). My present comments are very provisional, and do not cover the Kermit part in detail. ISO is strictly Int. Organization for Standardization. ISO 4873 contains an important feature: "levels". With Level 1 no shifts are allowed, with Level 2 only single shifts, with Level 3 all the rest. (I keep the documents at home, not here, so I cannot quote literally now.) Thus the generality of ISO 2022 is somewhat restricted here. It must be said that there are no known implementations of ISO 2022 in data processing whatsoever, so one should be careful not to raise too high expectations of its use. But it is good as a specification method. It still remains one of the most impenetrable ISO standards. It is not so much ISO 8859 that defines a series of 8-bit character sets, but ISO 4873. The left half will become after revision identical to ASCII, that is ISO 646 International Reference Version Revised (not plain ISO 646). There is a ISO-XYZ in development where switching the right hand part is defined. Another standard is now being proposed, a non-extensible 8-bit set with NULL, ESC and 254 other characters, graphic or control. I submitted a first draft, a copy of which I may send you. It includes only HT (tab), CR and LF. Both West and East Europe is covered, and Turkish. As for Kermit I think it important to indicate the Level of ISO 4873 used. Most environments do not allow midstream code table switching, and it is only fair to tell when that is not intended. For program texts only Level 1 will be permitted. There is a strong tendency with SC2, and still more with SC22, to do away all 7-bit processing. Then ISO 646 will only be kept for the CCITT Telematic services. At the meeting of the SC22 Ad Hoc Group on Character Handling in Programming Languages earlier this week in Paris, there was a strong request for giving names to specific coded character sets, like LATIN1, LATIN2. This could also be used for SET FILE TYPE, after it has been standardized (by SC2). As for document description languages there is now Standard Generalized Mark-up Language (SGML) for which there is an ISO standard, and a very active users group. No characters other than found in ISO standards are used. Not even adapting Kermit will be required for its use in transferring files. Be sure this is not meant to be my final and last reaction. FROM J. W. van Wingen MOSGLA@HLERUL2.BITNET Mail to P. O. Box 486, 2300AL Leiden, Netherlands 13-Mar-89 13:04:31-GMT,4408;000000000011 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA11432; Mon, 13 Mar 89 08:04:29 EST Message-Id: <8903131304.AA11432@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA12680; Mon, 13 Mar 89 08:04:17 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 6847; Mon, 13 Mar 89 08:00:28 EST Received: by CUVMB (Mailer X1.25) id 3703; Mon, 13 Mar 89 08:00:26 EST Date: 03/13 07:22:11 From: FDCCU@cuvmb.cc.columbia.edu Subject: PUN file from RSCS - MOSGLA.MAIL X-Tag: FILE (2245) ORIGIN HLERUL2 MAILER 3/13/89 2:25:25 E.S.T. To: FDC@cunixc.cc.columbia.edu Reply-To: MAILER%HLERUL2@cuvmb.cc.columbia.edu Date: Mon, 13 Mar 89 13:23 CET From: "Johan van Wingen" To: "F. da Cruz" Subject: Kermit/ISO Dear Frank Here are some additional comments. You can order the International Register of Coded Character Sets from ECMA free, on official paper, stating your name and address as the recipient. Ask also for ECMA Memento 1989, a nice mandarin-colourd booklet which includes a list of all ECMA standards. The final character for Arabic (Part 6 ) is G. Then we can update: Alphabet Summary Table: Esc Seq Alphabet Name ISO Number ECMA Number Regist. nr. (B ASCII (ANSI X3.4-1986) ISO 646 ECMA-6 -A Latin Alphabet No. 1 ISO 8859-1 ECMA-94 100 -B Latin Alphabet No. 2 ISO 8859-2 ECMA-94 101 -C Latin Alphabet No. 3 ISO 8859-3 ECMA-94 109 -D Latin Alphabet No. 4 ISO 8859-4 ECMA-94 110 -L Latin/Cyrillic ISO 8859-5 ECMA-113 144 -G Latin/Arabic ISO 8859-6 ECMA-114 127 -F Latin/Greek ISO 8859-7 ECMA-118 126 -H Latin/Hebrew ISO 8859-8 ECMA-121 138 -M Latin Alphabet No. 5 ISO 8859-9 ECMA-128 148 Other registered sets (with also a 96-char. G1) -I Czech Standard ($ <-> currency sign) 139 -J Right Half of ISO 6937-2 ECMA- ? 142 -K Mathematical/ Technical set ECMA- ? 143 A large lot has been registered as a G0. These also include many of the national versions of ISO 646, but not all. I'll send a table a soon as have typed one. In Dutch the "ij" and the "IJ" are sometimes handled as a separate character, and sorted with "y" (as in the list in the Railway timetable), but even if two letters, "ij" is always capitalized as "IJ". As a single letter both figure in ISO 6937-2 and in the National Coded Character Standard for Korean (!!!). The SGML documents are: ISO 8879 SGML ISO TR 9573 SGML Users Guide ISO 9069 SDIF (SGML Document Interchange Format) The ODA (Office Document Architecture) standard is: ISO 8613 Parts 1,2,4-8 (there is no 3) (630 pages !) I only receive SC2, SC18 and SC22 documents, not those from SC6, SC21, (which would be too much for a single person to understand). Thus I have not got details about FTAM. As far I know X.400 is MOTIS which is about to be approved as ISO 10021. I received the DIS, but did not study it. It is under SC18. As for naming and identifying entities in data transfer increasing use is being made of ISO 8824,8825 ASN.1 Abstract Syntax Notation 1. This may even prove an alternative for ISO 2022. In the document SC2/WG3 N 48, Revision of ISO 4873, it is stated: "A proposal is now under review in SC21/WG6 for extension of ASN.1 in the area of character coding identification. That proposal uses Registration Numbers to identify character sets. It also assumes that a single identification number is sufficient to identify coding structure, and that SC2 will maintain a list of such numbers. For simplicity, the proposal assumes that such a number can be derived directly from the final character of the announcer ESC sequence. This feature of the proposal must be clarified before the proposal is finally approved." More in my next contribution. FROM J. W. van Wingen MOSGLA@HLERUL2.BITNET Mail to P. O. Box 486, 2300AL Leiden, Netherlands 17-Mar-89 15:05:28-GMT,1774;000000000001 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA09434; Fri, 17 Mar 89 10:05:25 EST Message-Id: <8903171505.AA09434@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA09299; Fri, 17 Mar 89 10:03:39 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 9051; Fri, 17 Mar 89 10:00:25 EST Received: by CUVMB (Mailer X1.25) id 1922; Fri, 17 Mar 89 10:00:24 EST Date: 03/17 09:28:17 From: FDCCU@cuvmb.cc.columbia.edu Subject: PUN file from RSCS - MOSGLA.MAIL X-Tag: FILE (9635) ORIGIN HLERUL2 MAILER 3/17/89 4:32:15 E.S.T. To: fdc@cunixc.cc.columbia.edu Reply-To: MAILER%HLERUL2@cuvmb.cc.columbia.edu Date: Fri, 17 Mar 89 15:18 CET From: "Johan van Wingen" To: "F. da Cruz" Subject: More on char. sets Dear Frank To continue, SGML is Standard Generalized Mark-up Language. A precursor, GML, runs on IBM systems under DCF, with MVS or VM. The East-Asian standards are: China: (ISO Reg. 58) GB 2312-80 Japan: (ISO Reg. 87) JIS X 0208 (formerly JIS C 6226-1983) Korea: (ISO Reg. 149) KS C 5601-1987 I located a VT340 in the Computing Centre, and got it demonstrated. I was quite impressed. It is possible to show all the otherwise unprintable characters on the screen, and all the escape sequences, and to change the "mode" for displaying what you would see on paper. In the manual EK-VT3XX-TP-001 you find on page 25 the list of ISO 646 versions that you wanted (except Hungarian). FROM J. W. van Wingen MOSGLA@HLERUL2.BITNET Mail to P. O. Box 486, 2300AL Leiden, Netherlands 18-Mar-89 15:13:32-GMT,6449;000000000011 Return-Path: Received: from ntt-20.NTT.JP by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA24177; Sat, 18 Mar 89 10:12:40 EST Date: Sun, 19 Mar 89 00:09:45 I From: ken-ichiro murakami Subject: Re: ISO Kermit Proposal To: fdc@watsun.cc.columbia.edu Cc: cmg@cunixc.cc.columbia.edu, KEIBUN%JPNKEKVM.BITNET@ume.cc.tsukuba.junet, murakami@ntt-20.ntt.jp In-Reply-To: Message-Id: <12479006446.14.MURAKAMI@NTT-20.NTT.JP> Frank and Chris, I talked with Mr.Hirofumi Fujii at KEK about your proposal. We confirmed that we needed the facility to convert character code and we had the same idea except for our implementation model. However, we have not came to the conclusion. It will take for a few weeks to find solution. So, I'll give you a comment for your original proposal. Prior to give you a comment, I must explain you our complex situation about Kanji code. As Mr.Fujii pointed out, we have several kinds of Kanji code as follows; (1) SHIFTJIS --- 2 byte length, MSB is used, mainly used in micro computer OS such as MS-DOS and CP/M (2) EUC --- 2 byte length, MSB is used, mainly used in mini computer and workstation OS such as SUN, DEC and ELIS (NTT's AI workstation) EUC stands for Extended Unix Code. It's equivalent to VAX code. (3) JIS-7 --- 2 byte length, MSB is unused, standard Kanji code on UUCP and TCP/IP. This might be equivalent to ISO-2022(JIS X 0202). (4) JIS-8 --- 2 byte length, MSB is used. I don't know in detail. (5) vendor specific Kanji code such as IBM, XEROX, etc Since there is no de facto standard Kanji code, we are often confused by the inconsistency. This also affects terminal emulation facility in Kermit. We must support more than three Kanji code. In our(NTT's) implementation, we prepared SET KANJI {JIS-7|VAX(EUC)|SHIFTJIS} command to inform BOTH emulator AND file transfer module of the Kanji code. The problem is that this will make inconsistency between your proposal and our requirement. If we adopt your command(SET FILE TYPE), we must prepare another command(SET TERMINAL) for terminal emulator to specify Kanji code. It's inconvenient, since we must specify Kanji code twice. It's dilemma.;-( In contract with our implementation, Mr.Fujii has yet another idea. In his implementation, he prepared SET TERMINAL KANJI command to specify Kanji code only for terminal emulator. To support local Kanji conversion, he will prepare SET LOCAL TRANSLATION {ON|OFF|EUC|JIS8|JIS7}. This command resembles your SET FILE TYPE command. If user specifies ON, file transfer module will convert Kanji based on the Kanji code specified in SET TERMINAL KANJI command. If other code(EUC, JIS8 or JIS7) is specified, kanji is converted based on the specified Kanji code type. Basically, we agree with your proposal. It's good idea to have standard character code on transmission channel. However, it takes for a long time for ALL kermit implementation to have code conversion facility. For the present, we would also like to allow local Kanji conversion from local Kanji code to remote Kanji code(non-ISO code). This is our common requirement. I hope we can find common solution for Kanji conversion implementation. The following is my comment to your proposal. >In the meantime, national versions of Kermit can (and do) use SET FILE TYPE >commands to identify the encoding or standard used for a multibyte alphabet. >For example, some Japanese Kermit programs have the command SET FILE TYPE >TEXT, BINARY, or KANJI, and add a further command to specify the local Kanji ~~~~~ >encoding: SET KANJI VAX, JIS, or SHIFTJIS (JIS is the Japan Industrial >Standard, JIS X 0208; SHIFTJIS is JIS X 0202 which differs from JIS X 0208 by ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >the introduction of escape sequences to shift between Kanji and ASCII; VAX is ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >the encoding used on Japanese VAX/VMS systems). These Kermit programs use ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >SHIFTJIS as the transfer syntax, and the Kermit program maps between it and ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >the local format, which may be VAX, JIS, or SHIFTJIS. To better mesh with the ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >current proposal, however, these programs should make a distinction between >the file format and the transfer syntax by adding a command like SET >TRANSFER-SYNTAX SHIFTJIS. >In this connection, a "rider" to this proposal is that "JS" (for SHIFTJIS) ~~~~~~~~~~~~ >be added to the list of Kermit Kermit encodings under Attribute "*". >Designations for Chinese, Korean, and other multibyte-character-set languages >are welcome, as are alternative designations for Japanese. [corrected sentences] In the meantime, national versions of Kermit can (and do) use SET FILE TYPE commands to identify the encoding or standard used for a multibyte alphabet. For example, some Japanese Kermit programs have the command SET FILE TYPE TEXT, BINARY, or KANJI, and add a further command to specify the remote Kanji encoding: SET KANJI VAX(EUC), JIS, or SHIFTJIS (JIS is the Japan Industrial Standard, JIS X 0202 and JIS X 0208; SHIFTJIS and VAX(EUC) are the encoding used on Japanese MS-DOS systems and VAX/VMS systems respectively. These Kermit programs use these specified Kanji encoding as the transfer syntax, and the Kermit program maps between the remote format and local one, which may be VAX(EUC), JIS, or SHIFTJIS.To better mesh with the current proposal, however, these programs should make a distinction between the file format and the transfer syntax by adding a command like SET TRANSFER-SYNTAX JIS. In this connection, a "rider" to this proposal is that "JS" (for JIS) be added to the list of Kermit Kermit encodings under Attribute "*". Designations for Chinese, Korean, and other multibyte-character-set languages are welcome, as are alternative designations for Japanese. -Ken P.S. I posted to fj.kermit(kermit news group in Japan) for Korean and Chinese character set. But, nobody has contacted me yet. ------- 19-Mar-89 22:07:43-GMT,10386;000000000401 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA07279; Sun, 19 Mar 89 17:05:18 EST Date: Sun, 19 Mar 1989 17:05:17 EST From: Frank da Cruz To: ken-ichiro murakami Cc: Hirofumi Fujii , Christine M Gianone Subject: Re: ISO Kermit Proposal In-Reply-To: Your message of Sun, 19 Mar 89 00:09:45 I Message-Id: Ken, many thanks for your comments, and your careful reading of the proposal. You have clarified the situation for us a lot. In particular, it seems that we did not devote enough attention to terminal emulation in our proposal. We hope that the following comments will be useful in your discussions with Hiro and in the meeting which you mentioned previously. First, we have several questions: 1. Do the Japanese codes JIS-7, JIS-8, EUC (VAX, DEC), and SHIFTJIS also include the ASCII alphabet? Or must you use two different alphabets to switch between ASCII and Kanji? If so, how? 2. Do the Japanese codes include Kana? 3. On a particular system, such as the VAX, the MS-DOS PC, or the ELIS workstation, is it normal to use only one character code internally? 4. Is JIS C 6228 equivalent to ISO 2022? That is, does it specify the same mechanisms for transmitting 8-bit data over a 7-bit connection -- Shift-In and Shift-Out to switch between the G0 and G1 sets? Kermit should be as easy to use as possible, but should still give the user the ability to specify exactly what character codes are in use for both terminal emulation and file transfer. There should also be a consistent set of commands for all Kermit programs. TERMINAL EMULATION The following command should specify what character set is sent and received on the transmission medium during terminal emulation. The Kermit program must translate between this character set and the one that is used locally. SET TERMINAL CHARACTER-SET This command already exists, but is currently used only in MS-DOS Kermit, and only to switch between US and UK ASCII. We should extend this command to select any character code, and we should have a standard set of 's including the currently defined ISO 8-bit alphabets: LATIN1-ISO, ..., LATIN5-ISO, CYRILLIC-ISO, GREEK-ISO, HEBREW-ISO, ARABIC-ISO, etc. 7-bit ASCII and its national variants (ISO-646): ASCII-US, ASCII-UK, ASCII-FR, ASCII-DE, ASCII-IT, ASCII-NL, ASCII-ES, ASCII-DK, ASCII-FI, ASCII-IS, ASCCI-SE, ASCII-NO, ASCII-TR, etc. And for Japanese: KANJI-JIS, KANJI-SHIFTJIS, KANJI-EUC (this is the same as VAX or DEC?). For example, an MS-DOS computer might use SHIFTJIS locally, but a VAX communicates using EUC, so the MS-DOS Kermit user would give the command SET TERM CHAR KANJI-EUC. We assume that a Kermit terminal emulator may be used to connect to a variety of computers -- DEC, IBM, Fujitsu, Hitachi, etc -- which probably use different character codes for communicating with terminals. So unfortunately, the Japanese user who logs in to more than one kind of computer will have to issue the appropriate SET TERMINAL CHARACTER-SET command each time. You may have noticed that we did not define separate names for 7-bit and 8-bit versions of the same alphabet. We think that the actual method used for transmitting these alphabets should be governed by the SET PARITY setting. That is, if parity is EVEN, ODD, MARK, or SPACE, then the 7-bit code extension techniques described in ISO 2022 (and JIS C 6228?) should be used. If parity is NONE, then 8-bit codes may be sent and received. FILE TRANSFER Now, what about file transfer? Here we must answer three questions. First, is the file text, binary, or some special application? Second, what character code is used in the file? Third, what character set is used inside the Kermit packets? Third, what character code is used in the local file? These are specified in separate commands: SET FILE TYPE {TEXT, BINARY, WORDSTAR, ...} SET FILE CHARACTER-SET SET TRANSFER-SYNTAX FILE TYPE BINARY means that data is transmitted and received without any translation or conversion at all. SET FILE TYPE TEXT means that alphabet and record format conversions are done. SET FILE TYPE means that some application-specific conversions are done between the disk file and the transfer syntax. This would be used with word processors, spreadsheets, databases, etc. A lot of design work needs to be done in this area! SET FILE CHARACTER-SET can be any of the alphabet names listed above, or also some system-dependent codes like EBCDIC or IBM-CODEPAGE-xxx (IBM mainframes), CDC-SIXBIT (CDC mainframes), etc. This applies to files of type TEXT, but not BINARY, and may or may not apply to application-specific file types, depending on the application. The possibilities for TRANSFER-SYNTAX should be much more restricted. So far we have NORMAL -- the old Kermit syntax (which follows TEXT or BINARY). We have proposed adding ISO8 for the European 8-bit alphabets. We should now also add names for the common Asian codes, such as KANJI-JIS, KANJI-EUC, or KANJI-SHIFTJIS. Ideally, Japanese Kermit programmers would agree upon only one transfer syntax for Kanji, and preferably this would be a code that also included ASCII as a subset. We believe that the terminal emulation character set should not be linked to the file transfer syntax. There are potentially hundreds of different terminal character sets in the world, but we don't want the Kermit protocol to have to know about them, otherwise we will have a situation in which each Kermit program would have to know the codes of hundreds of other systems. This is the kind of combinatorial problem that data communication protocols are designed to avoid. And we are in the lucky position of being able to design the Kermit protocol in the best possible way right now. So far, we have separated the 8-bit European alphabets from Kanji in this discussion. What mechanism can be used to allow Kanji to coexist with French, Hebrew, Russian, Greek, and other language codes? We hope that the answer to this question is that JIS C 6228 uses the same mechanisms as ISO 2022 to identify and switch between alphabets. Therefore, we hope it will be possible to use an escape sequence to identify Kanji code, and therefore to switch between Japanese and ISO alphabets in the same data stream. SIMPLICITY So now the poor user is faced with several confusing commands: SET TERMINAL CHARACTER-SET, SET FILE CHARACTER-SET, SET FILE TYPE, and SET TRANSFER-SYNTAX. If a Kermit program has all these commands, how can we make it easy to use? Can we supply each command with a useful default? TERMINAL CHARACTER-SET: It is not possible to specify a default special terminal character set for a particular Kermit program, because it depends on what kind of computer is on the other end of the connection. Therefore, the default must remain what it has always been -- ASCII. FILE TYPE: The default is, as always, TEXT. FILE CHARACTER-SET: The default here should be the local system's normal encoding of text (ASCII, EBCDIC, LATIN1-ISO, KANJI-JIS, etc). TRANSFER-SYNTAX: The default is, as always, NORMAL (that is, ASCII text or binary, depending on SET FILE TYPE). Other possibilities like ISO8 or JIS must be specified. We cannot change these basic defaults, because these are already used by hundreds of different Kermit programs all over the world. So how can we make Kermit easy to use by the Japanese (or Korean, or Russian, or ...) user? These four commands give all the information necessary to perform the required translations during both terminal emulation and file transfer. But there may be thousands of different combinations of these three commands with all their possible parameters. The best way to simplify the user interface is to define macros for the combinations that are commonly used at each site. For example, in MS-DOS Kermit the Japanese user could have the following definitions in her or his MSKERMIT.INI file: define vax set parity even, set term char kanji-euc, - set transfer-syntax kanji-euc, set file type shiftjis define fujitsu set parity none, set term char kanji-jis, - set transfer-syntax kanji-jis, set file type shiftjis define pdp11 set parity even, set term char ascii-us, - set transfer-syntax normal, set file type text and then just give "vax" or "fujitsu" commands. Japanese Kermit programs could even be distributed with a commonly-useful set of macros pre-defined and documented. Kermit programs that do not have command macros can define special new commands that are equivalent to specific combinations of the four character-set-related commands. SUMMARY We have attempted to specify the simplest set of commands that can be used in all Kermit programs all over the world. Unfortunately, they are not as simple as we would want them to be because the character code used for terminal emulation might be different from the local file system's character code, and also from the Kermit file transfer syntax. And we also must have a way to identify the local file format (text, binary, word processor, database, etc). Obviously, many Japanese Kermit programs that are in operation today do not have these same commands, or do not use them in the same way. That's OK. They should still be able interoperate with "new" Kermit programs in any of several ways: (1) using their current SET KANJI, SET TERMINAL KANJI, SET LOCAL TRANSLATION, and similar commands; (2) using binary-mode file transfer between systems that use the same code; or (3) making sure that the file transfer syntax of "old" and "new" Kanji-capable Kermit programs is compatible. We hope that (3) is possible. But we think it is more important that Kanji file transfer syntax be compatible with ISO8 transfer syntax -- that is, use the same kinds of escape sequences -- so that Japanese is not treated differently from all the other languages, and so that Japanese text can be mixed with text in any other language. Thank you once again for your careful attention and your valuable insights. We hope that agreement can be reached very soon. - Christine and Frank 20-Mar-89 23:05:09-GMT,5910;000000000011 Return-Path: <@cuvmb.cc.columbia.edu:A-PIRARD@BLIULG11.BITNET> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA18673; Mon, 20 Mar 89 18:01:42 EST Message-Id: <8903202301.AA18673@watsun.cc.columbia.edu> Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 0266; Mon, 20 Mar 89 17:56:20 EST Received: from VM1.EARN-ULG.AC.BE by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 5632; Mon, 20 Mar 89 17:56:19 EST Received: by BLIULG11 (Mailer R2.02) id 3540; Mon, 20 Mar 89 23:50:55 +0100 Date: Mon, 20 Mar 89 23:46:41 +0100 From: Andre' PIRARD Subject: Re: Kermit International Character Set Proposal To: Frank da Cruz , Joe Doupnik , Paul Placeway , Andre Pirard , Baruch Cochavy , Johan Van Wingen , Ken-ichiro Murakami , Kohichi Nishimoto , Hirofumi Fujii , "Gisbert W.Selke" , Kurt Enulf , Jacob Palme , Per Lindberg , Bj|rn Larsen , "Hans A. ]lien" , "Kai U.Leppamaki" , Steve Jenkins , Jean Dutertre , Gerard Gaye , David Guerlet , Bernie Eiben , Volker Edelhoff Cc: Christine M Gianone In-Reply-To: Message of Thu, 2 Mar 1989 1:02:25 EST from > A KERMIT PROTOCOL EXTENSION FOR INTERNATIONAL CHARACTER SETS Well, we are busy moving to new premises and it's going to last probably at least one week after Easter. But I take some extra work time for a short reply to the main points. Please excuse if my English is not very "careful" this time. First of all, many thanks to Frank for considering the problem and for a neat summary of the standards. Now, I think some history of our problem will help. ASCII (ANSI X 3.4) is a well settled standard, but only covers some languages. 7-bit communication is a well encrusted habit too. For those reasons, the least-shaking way to support other languages was to invent ISO 646 which redefines some characters of ASCII. But, in addition to the inherent inconvenience, the amount redefined is not enough for many languages and we had to compose additional characters (circumflex and trema in French) by mean of (or the other way round according to occasional taste). It allowed some word processing, but is a nightmare for data processing. Imagine that in DBASE fixed fields, then trying to sort it! The only easy way out was to uppercase everything, an excuse to drop the accents. This is why 8-bit extended sets, and especially an ISO 8859 standard, are being applauded here (but why we much regret some software insist on screening out the 8th bit on the IBM PC). But, sorrily, that charm holds only when confining into a single version of ISO 8859. These standards define how to transmit data, not how to store it. It probably comes from the evidence of how to store ASCII (text files are coherent within a given system and almost across all of them) and that this fact extends to a single 8859 version. Thus, every software knows what to store. Thinking of storing multi-ISO data by using 2022 would render it even less manageable than with 646. Leaving it up to anybody's whim is starting the same story all over again. And this time, the 8 bits are exhausted and I really don't know where the wind blows from. Maybe it is too soon to say. So, my opinion is that while ISO 8859 + 2022 is OK to instruct a terminal or printer how to switch ISO versions and is quite suitable for a Kermit's terminal mode, the lack of definition of how to store the data makes Kermit's file transfer a real problem, mostly because it will not know how a particular software would store it. I am not sure there is a present way out this dead end, but I hope to be wrong because I sure hope for one. The problem I raised is that, even when restricting to a single version of ISO, one cannot switch an Macintosh for an IBM PC on a communication line without telling the other end that it happened. That's because they use different code points for the upper half of what are roughly equivalent codes. So, the "other end" has to be aware of the Mac, the PC, the Amiga, the Atari etc... And the PC has to be aware of the Mac, the Atari etc... Translation tables for terminal mode and file transfer are a real plus for any purpose, easy to do and not committing. The Kermit protocol does well for 8-bit transfer, but SI/SO is needed for 7- bit wide terminal mode. I didn't want to further bother people with our problems and suggested that these tables be patchable internals, but my idea is that each machine talking the best of its ISO8859 on the communication like is the suggestion to include in an apparent program's code support option to make each machine understand the other at best. Before going to more details, I can say I loved the idea of Kermit being able to transmit an emerging standard file structure (being able to tell text from pictures). But why drop 8th bit quoting? It's Kermit's own right to superimpose it's own data encoding and, on the contrary, I can tell Kermit 8th bit quoting is more efficient that SI/SO would be (2 to 3, accented letters are most often isolated in French). Thanks once more, Frank. And hoping to hear from others' opinion. Gee, it's late! Andr). 21-Mar-89 0:13:58-GMT,2413;000000000001 Return-Path: Received: from cc.usu.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA19362; Mon, 20 Mar 89 19:13:00 EST Message-Id: <8903210013.AA19362@watsun.cc.columbia.edu> Date: Mon, 20 Mar 89 17:10 MDT From: Joe Doupnik Subject: Andre's ISO comments. To: fdc@watsun.cc.columbia.edu X-Vms-To: IN%"fdc@watsun.cc.columbia.edu" From: USU::JRD "Joe Doupnik" 20-MAR-1989 17:09 To: IN%"A-PIRARD@BLIULG11.BITNET",JRD Subj: RE: Re: Kermit International Character Set Proposal Frank, I think that Andre is making the case FOR using ISO 2022, or relatives. Terminal communications is one part, where the host emits characters using a standard of one kind or another. In that case the PC has to translate comms line codes to displayable characters via a table (as present) or through in-line shift locks (ISO style or similar). If Macs and PCs do their job properly then the host always sends the same bytes for the same text and the screens appear much alike. In fact, terminal emulation has the messier task of needing to convert from more host formats than file transfers. File transfers need language translation at both ends and a small number of comms line forms. The proposal is directed at finding those small number of comms line formats, and at the same time satisfying some or most of the terminal emulation aspects as a consequence. If we do select a set then a particular Kermit need only understand the set <--> local conversion for file transfer, plus any optional terminal emulation problems (the poor PC's need to do the most work here, alas). If the sets are well selected they will include the widely used "local" forms, such as ISO xxxx and straight ASCII and Kermit privately encoded but otherwise transparent Binary (omitting the byte ordering problems) stream i/o. Eight vs seven bits is always a worry. It affects terminal emulation more than file transfer but Andre's point of shift vs prefixing overhead on some languages is well taken. However, the same file ought to be transportable through either channel width automatically (by knowing whether parity is used or if the channel is seven bits wide regardless, as on VAX VMS systems). Personally, I think that extra comms line characters might offset extra program execution time for some encoding methods and thus make throughput a difficult quantity to estimate. Regards, Joe D. 21-Mar-89 0:47:02-GMT,3127;000000000401 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA19083; Mon, 20 Mar 89 18:43:50 EST Date: Mon, 20 Mar 1989 18:43:43 EST From: Frank da Cruz To: Andre' PIRARD Cc: Joe Doupnik , Paul Placeway , Andre Pirard , Baruch Cochavy , Johan Van Wingen , Ken-ichiro Murakami , Kohichi Nishimoto , Hirofumi Fujii , "Gisbert W.Selke" , Kurt Enulf , Jacob Palme , Per Lindberg , Bj|rn Larsen , "Hans A. ]lien" , "Kai U.Leppamaki" , Steve Jenkins , Jean Dutertre , Gerard Gaye , David Guerlet , Bernie Eiben , Volker Edelhoff , Christine M Gianone , Frank da Cruz Subject: Re: Kermit International Character Set Proposal In-Reply-To: Your message of Mon, 20 Mar 89 23:46:41 +0100 Message-Id: Brief response to Andre's message... First of all, Christine Gianone is the principal author of the ISO / Kermit proposal, I only helped! Second, Andre is absolutely correct: the proposal begs the question of local file storage. Yes, there is no standard for storing mixed alphabets within files. But by using ISO 2022 as the file transfer syntax, we are able to represent mixed alphabets unambiguously on the communication line. It is up to the Kermit programs to convert between this syntax and the local storage formats. We recognize that there are many application-specific formats for mixed alphabets, so it is up to the Kermit programmer to learn these formats and make the conversions. We hope that the introduction of this extension to the Kermit protocol will in some small way provide an incentive for computer and software makers and standards organizations to speed up their efforts to define storage formats for mixed alphabet files. Third, others have complained about the lack of attention to terminal emulation in this proposal. This deficiency will be corrected in the next draft of the proposal. Finally, others have also suggested that there is no reason (other than complexity) to restrict the ISO / Kermit file transfer syntax to the 7-bit environment with locking shifts (similar to ISO 4873 Level 3). If this is the general opinion, then the proposal will be amended to allow for 8-bit data transfer without shifts (similar to ISO 4873 Level 1). Level 2 is not considered practical, because too much complexity is required if we are to keep G2 and G3 sets active. Further opinions? - Chris and Frank 21-Mar-89 14:55:32-GMT,2886;000000000001 Return-Path: Received: from tut.cis.ohio-state.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA25944; Tue, 21 Mar 89 09:53:48 EST Received: from morganucodon.cis.ohio-state.edu by tut.cis.ohio-state.edu (5.61/3.890314) id AA25428; Tue, 21 Mar 89 09:52:54 -0500 Received: by morganucodon.cis.ohio-state.edu (3.2/2.890120) id AA07368; Tue, 21 Mar 89 09:50:09 EST Date: Tue, 21 Mar 89 09:50:09 EST From: Paul W. Placeway Message-Id: <8903211450.AA07368@morganucodon.cis.ohio-state.edu> To: A-PIRARD%BLIULG11.BITNET@cunyvm.cuny.edu Cc: Frank da Cruz , Christine M Gianone , Joe Doupnik , Andre Pirard , Baruch Cochavy , Johan Van Wingen , Ken-ichiro Murakami , Kohichi Nishimoto , Hirofumi Fujii , Gisbert W.Selke , Kurt Enulf , Jacob Palme , Per Lindberg , "Bj|rn Larsen" , "Hans A. ]lien" , Kai U.Leppamaki , Steve Jenkins , Jean Dutertre , Gerard Gaye , David Guerlet , Bernie Eiben , Volker Edelhoff In-Reply-To: Andre' PIRARD's message of Mon, 20 Mar 89 23:46:41 +0100 <8903202303.AA07885@cheops.cis.ohio-state.edu> Subject: Kermit International Character Set Proposal Reply-To: paul@cis.ohio-state.edu I have to agree with Andre': since Kermit allready has a transport layer capable of 8-bit wide transmision, why further confuse things by making the text translation layer only 7 bits wide? On of the strengths of the Kermit protocol is a reasonable layering. As Andre' said, the kermit 8-bit-quote is more efficient than the locking shifts for Latin character based languages, and if the actuall comunication path is 8 bits wide, then there is not penalty for using the G1 characters at all. I like the idea of a standard international text transfer protocol for Kermit, and think the preamble definition of character sets used, and the ability to switch them back and forth is a good thing, but we should use the whole 8 bit channel, and let lower layers deal with shoving the data through 7 bit hardware. If the local machine cannot store 8 bit text, fine; the local kermit allready is responsible for translating into the local text format. -- Paul 22-Mar-89 15:26:29-GMT,8463;000000000401 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA03833; Wed, 22 Mar 89 10:23:24 EST Resent-Message-Id: <8903221523.AA03833@watsun.cc.columbia.edu> Message-Id: <8903221523.AA03833@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA23238; Wed, 22 Mar 89 10:23:11 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 1030; Wed, 22 Mar 89 10:19:06 EST Received: by CUVMB (Mailer X1.25) id 8160; Wed, 22 Mar 89 10:19:04 EST Received: from JPNKEKVM by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 8159; Wed, 22 Mar 89 10:19:03 EST Received: by JPNKEKVM (Mailer R2.02) id 2231; Thu, 23 Mar 89 00:22:03 JST Date: THU, 23 MAR 89 00:21:28 JST From: Hirofumi Fujii Subject: Re:Kermit/ISO proposal To: Frank da Cruz , Joe Doupnik , Andre Pirard , Ken-Ichirou Murakami , Kohichi Nishimoto Resent-Date: Wed, 22 Mar 89 10:19:03 EST Resent-From: Network Mailer Resent-To: fdc@cunixc.cc.columbia.edu - Summary - I agree to the proposal of the Kermit ISO 2022 extension. And I also agree to allow the 8-bit data for ISO/Kermit file transfer. The overhead to switch the character set in 7-bit environment is very high for Japanese text files. I propose to use the Kermit A-packet with ISO 2022 announcer to negotiate the extension protocol. - End of summary - I don't know about the ISO 2022. My only knowledge about this standard is the JIS X 0202 (JIS = Japanese Industrial Standard) which is corresponding to the ISO 2022. There may be the diffrences between JIS X 0202 and ISO 2022, so I will briefly describe my understanding about this standard at the bottom of this mail. Please point out if my understanding is wrong. In the followings, it is assumed that the JIS X 0202 is equivalent to the ISO 2022. Many of the Japanese computer system is using at least 3 character sets, Roman (almost equivalent to US-national i.e., so called ASCII), Katanana ( 1byte code ) and Kanji ( 2byte code ). And the internal(local) representation of these characters are system(OS) dependent. For example, SYSTEM Roman Katakana Kanji ------ ------------ --------------- ------------------- MS-DOS JIS-Roman JIS-Katakana MS-Kanji(Shift-JIS) VAX/VMS JIS-Roman (see note 1) DEC-Kanji (see node 2) IBM/VM/CMS EBCDIC-Roman EBCDIC-Katakana IBM-Kanji Unix JIS-Roman (see note 1) EUC(Extended Unix code) (Note) 1.Katakata is invoked by Locking-shift mechanism on VAX/VMS, and by Single-shift mechanism on Unix. 2.DEC-Kanji code and EUC are almost equivalent. Usually, Japanese text file contains the above three characters. And in this case, switching the character set in 7-bit environment is very expensive. Let me show you an example 1234567890123456789012345678 ---------------------------- This is an English sentence. ---------------------------- NNNNRKKRNNRRRRRRRRRNNNNNNNNR where N is Kanji, K is Katakana and R is Roman (of course this is not the real Japanese sentence but the character set in the sentence looks like this). In 7-bit enviroment, we usually assign as G0:Roman, G1:Katakana so the above sentence is translated as $BThis(J is $Ban(J English $Bsentence(J. 28-byte text needs additional 20 bytes in this case! In 8-bit enviroment, usually we assign at the beginning, GL=G0:Roman G1:Katakana GR=G3:Kanji so the above sentence becomes This is an English sentence. ^^^^ ^^ ^^^^^^^^ where ^ means 8th bit ON (GR character set). In this case, only 2 bytes are required to switch the character set. (Note that the Locking-shift mechanism is required even in 8-bit environment.) These discussions are restricted to the Japanese, but I think the situation is the same for other countries where they use more than three character sets The 8-bit environment is better than 7-bit one. However, full 8-bit implementation requires a lot of efforts. Thererfore, I propose to use the announcer to negotiate the extension protocol. The form of the ISO 2022 announcer is F. The sending Kermit inform the 'F' of the announcer by using Kermit A-packet (e.g., with encoding (*) T{xxx} where xxx is the combination of 'F'). The receiver accept or refuse by using Kermit reply mechanism. I think this may be a great help for implementation. ............................................................................ The followings are my understanding about the JIS-X0202 (ISO-2022). 1. This standard defines code extension techniques for information interchange in BOTH 7bit AND 8bit environments. ^^^^^^^^^^^^^^^^^^ 2. There are four intermediate character sets called G0, G1, G2 and G3. ^^^^ ^^^^^^^^^^^^^^^^^ 3. In 7bit environment, only one character set can be activated at one time. The active character set can be selected from the above intermediate character set by issuing the following control codes SI (Shift in) invoke G0 character set SO (Shift out) invoke G1 character set LS2 (locking-shift two) invoke G2 character set LS3 (locking-shift three) invoke G3 character set 4. In 8bit environment, two character sets GL and GR can be activated at one time. GL character set is selected if the 8th bit is OFF and GR is selected if the 8th bit is ON. The active chatacter sets are selected by LS0 (Locking-shift zero) invoke G0 character set to GL LS1 (Locking-shift one) invoke G1 character set to GL LS2 (Locking-shift two) invoke G2 character set to GL LS3 (Locking-shift three) invoke G3 character set to GL LS1R (Locking-shift one right) invoke G1 character set to GR LS2R (Locking-shift two right) invoke G2 character set to GR LS3R (Locking-shift three right) invoke G3 character set to GR 5. In both 7bit and 8bit environments, a single-byte chatacter set is ^^^^^^^^^^^^^^^^^^^^^^^^^ designated to the intermediate character set by issuing the following ESC sequences ESC 2/8 F or ESC 2/12 F designate character set F to G0 ESC 2/9 F or ESC 2/13 F designate character set F to G1 ESC 2/10 F or ESC 2/14 F designate character set F to G2 ESC 2/11 F or ESC 2/15 F designate character set F to G3 where F is A:UK, B:US,..., I:JIS-Katakana, J:JIS-Roman etc. A multi-byte character set is designated to the intermediate character ^^^^^^^^^^^^^^^^^^^^^^^^ set by ESC 2/4 F or ESC 2/4 2/12 F designate character set F to G0 ESC 2/4 2/9 F or ESC 2/4 2/13 F designate character set F to G1 ESC 2/4 2/10 F or ESC 2/4 2/14 F designate character set F to G2 ESC 2/4 2/11 F or ESC 2/4 2/15 F designate character set F to G3 where F is A:Chinese-Kanji, B:Japanese-Kanji etc., 6. One character in the G2 or G3 character set can be invoked by issuing the SS2 invoke next one character from G2 SS3 invoke next one character from G3 In 8bit environment, the character is invoked to GL character set. 7. At the beginning of the information interchange, extension method used in the subsequent data stream is announced by ESC 2/0 F where F is one of the 4/1, 4/2, 4/3, 4/4, 4/5, 4/6, 4/7, 5/0, 5/2, 5/3, 5/4, 5/5, 5/6, 5/7, 5/10, 5/11. For example, 4/1 G0 only. No LS. GR is not used. 4/2 G0 and G1. SI and SO (LS0 and LS1) are used. 4/3 G0 and G1 only in 8bit env. LS's are not used. GL = G0, GR = G1. etc. -------------- Hirofumi Fujii National Laboratory for High Energy Physics (KEK) KEIBUN@JPNKEKVM.BITNET 22-Mar-89 16:28:01-GMT,3543;000000000001 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA04632; Wed, 22 Mar 89 11:19:30 EST Date: Wed, 22 Mar 1989 11:19:29 EST From: Frank da Cruz To: Joe Doupnik , Paul Placeway , Andre Pirard , Baruch Cochavy , Johan Van Wingen , Ken-ichiro Murakami , Kohichi Nishimoto , Hirofumi Fujii , Gisbert W.Selke , Kurt Enulf , Jacob Palme , Per Lindberg , "Bj|rn Larsen" , "Hans A. ]lien" , Steve Jenkins , Jean Dutertre , Gerard Gaye , David Guerlet , Bernie Eiben , Volker Edelhoff , John Chandler Subject: Kermit / ISO Transfer Syntax: 7-bit vs 8-bit Cc: Frank da Cruz , Christine M Gianone Message-Id: The prevailing sentiment seems to be to allow 8-bit data transfer, a`la ISO 4873 Level 1, and let Kermit's packet encoding do all of the transformations necessary to transfer 8-bit characters in the 7-bit environment. That means that whenever a character with its 8th bit set to 1 is transmitted on a 7-bit connection (i.e. when PARITY is not NONE), it will be prefixed by the Kermit 8th-bit-prefix character (normally '&'). This is equivalent to the Single Shift that is used in ISO 4873 Level 2. This mode of operation depends upon the Kermit program having the 8th-bit-prefixing option. This is an OPTIONAL feature of the Kermit protocol, negotiated between the two Kermits. In practice, most widely-used Kermit programs do have this feature. The main advantage of allowing 8-bit ISO text transfer is that the overhead is lower for languages like French and German that shift frequently between the G0 and G1 sets. A disadvantage is that languages like Russian, Greek, Hebrew, and Arabic that tend to stay in the G1 set will have a very high prefixing overhead in the 7-bit environment. The question now becomes: should Kermit continue to allow ISO 7-bit text transfer with locking shifts, as originally proposed? If we do not, then Kermit programs that do not implement 8th-bit prefixing will not be able to transfer mixed-alphabet texts. But maybe that's OK -- we can simply state that 8th-bit prefixing is a PREREQUISITE for mixed-alphabet text transfer. The advantage here is simplicity. The Kermit program will be simple, and the protocol specification will be simple. This increases the chance that programmers will actually want to -- and be able to -- do the work. Allowing the full range of ISO 4873 / 2022 code extension techniques would give us the greatest flexibility (e.g. efficiency for both French and Cyrillic), but would make the Kermit mixed-alphabet text protocol specification nearly as complicated as the ISO standards themselves, and about as likely to be implemented. Shall we take a vote? - Christine and Frank 22-Mar-89 17:59:22-GMT,12197;000000000401 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA05922; Wed, 22 Mar 89 12:59:17 EST Resent-Message-Id: <8903221759.AA05922@watsun.cc.columbia.edu> Message-Id: <8903221759.AA05922@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA04140; Wed, 22 Mar 89 12:59:06 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 1170; Wed, 22 Mar 89 12:55:01 EST Received: by CUVMB (Mailer X1.25) id 8560; Wed, 22 Mar 89 12:55:00 EST Received: from JPNKEKVM by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 8559; Wed, 22 Mar 89 12:54:59 EST Received: by JPNKEKVM (Mailer R2.02) id 2301; Thu, 23 Mar 89 02:57:39 JST Date: THU, 23 MAR 89 02:57:09 JST From: Hirofumi Fujii Subject: Japanese character sets To: Frank da Cruz , Joe Doupnik , Ken-Ichirou Murakami , Kohichi Nishimoto Resent-Date: Wed, 22 Mar 89 12:54:59 EST Resent-From: Network Mailer Resent-To: fdc@cunixc.cc.columbia.edu Dear Frank and Ken, I think I can answer some of the Frank's questions. I also put the description about JIS at the bottom of this mail (Appendix A and Appendix B). 1. Japanese code systems: The answer is NO. Japanese computer system has at least three character sets, Roman (almost ASCII), Katakana(1byte code), and Kanji (2byte code). Kanji set also includes the Roman characters but the face of the character is double width. Therefore it should be considered as different characters. The local representation for these character sets are OS Roman Katakana Kanji ------ ---------- ------------- ------------------- MS-DOS JIS X 0201 JIS X 0201 in GR MS-Kanji (SHIFTJIS) VAX/VMS US-national (see note 1) DEC-Kanji (see note 2) IBM/VM/CMS EBCDIC EBCDIC-Katakana IBM-Kanji UNIX JIS X0201 EUC (see note 3) EUC (see note 4) Elis JIS X0201 EUC (see note 3) EUC (see note 4) (Note 1) Invoked by LS2 (Locking shift two). (Note 2) JIS X 0208 in GR (i.e., 8th bit on). (Note 3) Invoked by SS2 (Single shift two) and 8th bit on. (Note 4) JIS X 0208 in GR (i.e., 8th bit on). To switch the Roman, Katakana and Kanji, both shift mechnism and GR extension in 8bit environment are used. MS-DOS uses GL as Roman(ASCII) ,GR as Katakana, and 1st byte of the Kanji is mapped to the C1- and undefined Katakana-area. Therefore MS-DOS does not need shift-mechanism but it violates the standard (C1 is used as a visible character). VAX/VMS, UNIX and Elis uses GL as Roman, GR as Kanji, and Katakana is invoked by shift-mechanism. IBM/VM/CMS uses their original code system. Kanji is invoked by some shift-like mechanism. 2. Kana The answer is YES. Kana is a Japanese phonetic character set. There are two types of Kana, Hirakana (Hiragana) and Katakana. Both character sets are included in JIS X 0208 (Kanji set). However, all the characters in JIS X 0208 has double width cahracter face. JIS X 0201 alse defines Katakana and its character face is single width. (See Appendix A). 3. Intranal character set The answer is NO. As described in 1, normally, we used at least three character sets. 4. JIS X 0202 (old name is JIS C 6228) and ISO 2022 I'm not sure. It is written in footnote of the JIS X 0202 that the JIS X 0202 correspond to the ISO 2022 (see APPENDIX B). TERMINAL EMULATION SET TERMINAL CHARACTER-SET My Kermit (MSVP98) have another command, SET TERMINAL KANJI CODE . The SET TERMIANAL CHARACTER-SET specifies GL character set. However, Kanji is mainly used in GR character set as described above. This is because we need another command to specify the Kanji code. There is one more reason we need another command. It is the code for keyinput. SET TERMINAL KANJI CODE also used for keyinput character conversion. To unify these command, I propose SET TERMINAL CHARACTER-SET [as {GL,GR}] where the default is GL. And for keyinput SET KEYINPUT CHARACTER-SET [as {GL,GR}] Hirofumi Fujii National Laboratory for High Energy Physics (KEK) <<<<<<<<<<<<<<<<<<<<< Appendix A >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Japanese code for information interchange JIS X 0201 Code for Information Interchange JIS X 0208 Code of the Japanese Graphic Character Set for Informaiton Interchange First, I will explain JIS X 0208 (old name is JIS C6226). This is a code system for Japanese characters. All Japanese characters are represented in 2byte code. Each byte is ranged in '21'x - '7E'x, i.e., the same range as ASCII character set. This is called JIS Kanji code, but this set contains not only Kanji but also English, Greeks, Russians, etc. At present time, this character set contains the following characters. - 147 special symbols (+,-,square_root,arrows,etc.) - 10 numeric characters (0,1,2,...,9) - 52 Roman characters (A,B,C,..,Z,a,b,c,...,z) - 83 Hira-kana characters - 86 Kata-kana characters - 48 Greek characters (Upper alpha, beta,..., Lower alpha,..) - 66 Russian characters - 6353 Kanji characters (1st level 2965 + 2nd level 3388) - 32 Rules The code table of the JIS X 0208 looks like | 2nd byte | +--+--+--+----+--+--+--+---+--+--+---------+--+--+--+ |21|22|23 ....|30|31|32| |41|42|......... 7C|7D|7E| -----------+--+--+--+----+--+--+--+---+--+--+---------+--+--+--+ 1 |21|SP|..|..|.................................|..|..|..| s +--+--+--+--+.................................+--+--+--+ t |22|..|..|.......................................|..|..| +--+--+--+--.....+--+--+--+-..+--+--+-.........--+--+--+ b |23|..|..|.......| 0| 1| 2|...| A| B|............|..|..| y +--+--+--+--.....+--+--+--+-..+--+--+-.........--+--+--+ t | :|...................................................| e +--+--+--+--+.................................+--+--+--+ |7E|..|..|..|.................................|..|..|..| -----------+--+--+--+---------------------------------+--+--+--+ For example, the code of Roman 'A' is '2341'x (1st byte is '23'x and 2nd byte is '41'x). Therefore, if you use simple English terminal, it is displayed as '#A'. The diffrence between '2341'x and '41'x (ASCII 'A') is the width of the character face. All the JIS X0208 characters are double width because that the Kanji is so complex to display. There is another character set, JIS X 0201 (old name is JIS C6220). This is 1byte code system like ASCII. The character face is single width. This set contains Roman characters in the range '21'x - '7E'x, and Katakana in the range 'A1'x - 'DF'x. +--+--+--+--+--+--+--+--++--+--+--+--+--+--+--+--+ |00|10|20|30|40|50|60|70||80|90|A0|B0|C0|D0|E0|F0| +--+--+--+--+--+--+--+--++--+--+--+--+--+--+--+--+ 00| |SP| || | | | 01| | || | | | 02| C | || U | | U | 03| O | || N | | N | 04| N | || D | | D | 05| T | || E | | E | 06| R | Roman || F | Katakana | F | 07| O | || I | | I | 08| L | || N | | N | 09| S | || E | | E | 0A| | || D | | D | 0B| | || | | | 0C| | || | | | 0D| | || | | | 0E| | || | | | 0F| | DEL|| | | | +--+--+--+--+--+--+--+--++--+--+--+--+--+--+--+--+ JIS X0201 looks like this The code for Roman characters is almost equivalent to ASCII. <<<<<<<<<<<<<<<<<<<<< Appendix B >>>>>>>>>>>>>>>>>>>>>>>>>>> The followings are my understanding about the JIS X 0202. JIS X 0202 Code Extension Techniques for Use with the Code for Information Interchange 1. This standard defines code extension techniques for information interchange in BOTH 7bit AND 8bit environments. ^^^^^^^^^^^^^^^^^^ 2. There are four intermediate character sets called G0, G1, G2 and G3. ^^^^ ^^^^^^^^^^^^^^^^^ 3. In 7bit environment, only one character set can be activated at one time. The active character set can be selected from the above intermediate character set by issuing the following control codes SI (Shift in) invoke G0 character set SO (Shift out) invoke G1 character set LS2 (locking-shift two) invoke G2 character set LS3 (locking-shift three) invoke G3 character set 4. In 8bit environment, two character sets GL and GR can be activated at one time. GL character set is selected if the 8th bit is OFF and GR is selected if the 8th bit is ON. The active chatacter sets are selected by LS0 (Locking-shift zero) invoke G0 character set to GL LS1 (Locking-shift one) invoke G1 character set to GL LS2 (Locking-shift two) invoke G2 character set to GL LS3 (Locking-shift three) invoke G3 character set to GL LS1R (Locking-shift one right) invoke G1 character set to GR LS2R (Locking-shift two right) invoke G2 character set to GR LS3R (Locking-shift three right) invoke G3 character set to GR 5. In both 7bit and 8bit environments, a single-byte chatacter set is ^^^^^^^^^^^^^^^^^^^^^^^^^ designated to the intermediate character set by issuing the following ESC sequences ESC 2/8 F or ESC 2/12 F designate character set F to G0 ESC 2/9 F or ESC 2/13 F designate character set F to G1 ESC 2/10 F or ESC 2/14 F designate character set F to G2 ESC 2/11 F or ESC 2/15 F designate character set F to G3 where F is A:UK, B:US,..., I:JIS-Katakana, J:JIS-Roman etc. A multi-byte character set is designated to the intermediate character ^^^^^^^^^^^^^^^^^^^^^^^^ set by ESC 2/4 F or ESC 2/4 2/12 F designate character set F to G0 ESC 2/4 2/9 F or ESC 2/4 2/13 F designate character set F to G1 ESC 2/4 2/10 F or ESC 2/4 2/14 F designate character set F to G2 ESC 2/4 2/11 F or ESC 2/4 2/15 F designate character set F to G3 where F is A:Chinese-Kanji, B:Japanese-Kanji etc., 6. One character in the G2 or G3 character set can be invoked by issuing the SS2 invoke next one character from G2 SS3 invoke next one character from G3 In 8bit environment, the character is invoked to GL character set. 7. At the beginning of the information interchange, extension method used in the subsequent data stream is announced by ESC 2/0 F where F is one of the 4/1, 4/2, 4/3, 4/4, 4/5, 4/6, 4/7, 5/0, 5/2, 5/3, 5/4, 5/5, 5/6, 5/7, 5/10, 5/11. For example, 4/1 G0 only. No LS. GR is not used. 4/2 G0 and G1. SI and SO (LS0 and LS1) are used. 4/3 G0 and G1 only in 8bit env. LS's are not used. GL = G0, GR = G1. etc. ------------------------------------- Hirofumi Fujii National Laboratory for High Energy Physics (KEK) KEIBUN@JPNKEKVM.BITNET  22-Mar-89 23:40:10-GMT,3611;000000000011 Return-Path: <@cuvmb.cc.columbia.edu:PEPMNT@CFAAMP.BITNET> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA09707; Wed, 22 Mar 89 18:40:08 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 1374; Wed, 22 Mar 89 18:35:57 EST Received: by CUVMB (Mailer X1.25) id 9382; Wed, 22 Mar 89 18:35:56 EST Date: Wed, 1989 Mar 22 17:16 EST From: "John F. Chandler" Subject: Re: Kermit / ISO Transfer Syntax: 7-bit vs 8-bit In-Reply-To: fdc@watsun.cc.columbia.edu message of Wed, 22 Mar 1989 11:19:29 EST Message-Id: To: Frank da Cruz > A disadvantage is that languages like Russian, Greek, > Hebrew, and Arabic that tend to stay in the G1 set will have a very high > prefixing overhead in the 7-bit environment. > > The question now becomes: should Kermit continue to allow ISO 7-bit text > transfer with locking shifts, as originally proposed? If we do not, then > Kermit programs that do not implement 8th-bit prefixing will not be able to > transfer mixed-alphabet texts. I'm not sure the term mixed-alphabet is quite the right one here. Wouldn't the languages cited above normally be encoded in 8-bit alphabets that are "mixed" only in the sense that the character sets place the Latin alphabet in the G0 slots? I would imagine that the typical use would consist of un-mixed Cyrillic or Hebrew or whatever. > the full range of ISO 4873 / 2022 code extension techniques would > give us the greatest flexibility (e.g. efficiency for both French and > Cyrillic), but... I have another suggestion that just occurred to me. First, let me state what I am assuming about the nature of text files: 1. Truly mixed-alphabet stuff is rare, that is, stuff reguiring more than a single 256-entry character set. I realize that ideographic text representation requires more than 256 distinct characters, but I think the solution to that difficulty is to represent ideograms by strings of bytes, rather than to define universal escape sequences for switching among alphabets. 2. In non-ideographic representations, the language will either be written either entirely in G0 (or in G1) or will switch back and forth frequently between G0 and G1. This certainly fits all the languages represented by the various ISO 8859 alphabets mentioned in the draft protocol extension. I confess that I don't know the situation for the various Japanese syllabaries, nor whether anyone would choose to use them if given a chance to use the standard combination of kana and kanji. Given the above, I think it makes sense to offer a single new feature: 8th-bit-complement mode. In that mode, ALL codes 0-127 would be swapped with those 128-255 as the first step of Kermit encoding (and the last step of decoding). Such a mode could be selected via an Attribute (or perhaps by a Capability flag). The implementation would be simple and would entail no overhead associated with scanning data streams for locking shifts. 8BC mode would be the method of choice for all the languages that use G1 exclusively, but could also be useful for transferring certain kinds of binary files -- there's nothing about it that need be restricted to text files. Being new to this discussion, I don't know whether this idea has been suggested before, but it seems to me to merit consideration. John 23-Mar-89 7:02:41-GMT,5061;000000000401 Return-Path: Received: from cc.usu.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA15639; Thu, 23 Mar 89 02:02:37 EST Message-Id: <8903230702.AA15639@watsun.cc.columbia.edu> Date: Thu, 23 Mar 89 00:01 MDT From: Joe Doupnik Subject: ISO commentary To: fdc@watsun.cc.columbia.edu X-Vms-To: IN%"fdc@watsun.cc.columbia.edu",JRD Chris, Frank, and the group: A little sense is dawning on me about ISO 7 vs 8 bit issue. It seems that ISO 2022 is nearly the superset of the various specifications which have been mentioned. It includes the availability of four graphics sets, G0-G3, shifts between them, allowance for display symbols composed of two or more characters, and the shift mechanisms to operate these tables via either 7 or 8 bit communications paths. ISO 4873, as I read it, is similar to 2022, but omits 7 bit codes, is more stringent about the control codes (C0, C1), uses essentially the same shift mechanisms execpt that SI/SO are replaced with full escape sequences. 4873 also provides levels of capability for having only two, three, or all four alphabet sets (G0-G3) whereas 2022 presumes presence of all four. ISO 2022 has all the features, yet retains more conventional use of control codes; 2022 seems to be the standard of choice. Since the underlying available alphabets are the same in these cases, there being normally four, the communications part becomes concerned with the number of bytes needed to select one or the other either for a single character or for a string of characters. ISO 4873 prefers to use escape sequences to toggle between the two active alphabets but 2022 retains the historical SI/SO short form. In a 7 bit environment only one alphabet is active at any instant so that either SI/SO or escape sequences are required to access any one of the other three. In other words, the 8 bit scheme lets two alphabets be "on line" but the 7 bit scheme is restricted to one "on line" (but the swap from one to a second can be accomplished via SI/SO, a whole byte versus one bit in a character). As Hirofumi demonstrated so clearly, there is a considerable difference when three alphabets are needed in rapid succession. Without question, the 8-bit channel lets the high order bit in a byte select one of two and then the shifting codes reload one alphabet upon demand, similar to paging memory. All of the above standards use an escape sequence to select which particular "language" is to be loaded into G0-G3. Thus this is no longer a communications performance issue. One communications issue which still is not clear to me is the ordinary Kermit 8-bit quoting mechanism. If ISO-XXXX were used I suspect that it would be applied "above" the ordinary Kermit methods (meaning before packet encoding and after packet decoding). In that case ordinary Kermit provides the illusion of an 8-bit channel, as it must for many systems. When the channel is 8-bits wide, it seems clear to me that an 8-bit ISO style code is the shortest method for all the work. When the channel is 7-bits wide we would need to run tests to determine whether ISO 7-bit shifts or Kermit 8-bit quoting is more efficient. ISO wins on long strings in one alphabet by stating a lock-shift escape sequence, and Kermit wins on most per-character alphabet swaps by needing only one quote byte rather than an escape sequence (it is almost a draw when SI/SO can be used to swap, but ISO loses by requiring a Kermit control-quote prefix on SI/SO). I think I've talked myself into saying that ISO 2022, or similar, has about all the features we need for both Western and Eastern languages and that the 8-bit version is faster by letting Kermit's 8-bit quoting mechanism provide the channel width (if necessary). The ISO 2022 locking shift sequences allow any of the four alphabets to be loaded into the left (low order) character codes and any of G1-G3 to be loaded into the right half. This flexibility can drastically reduce the need for high bits on characters when a single Western (one byte yields one graphical symbol) language is used. The last doubt in my mind then relates to terminal emulation in a 7 bit environment. Here the Kermit 8 bit quoting mechanism is not available. I think that it is not a difficult task to allow both 7 and 8 bit ISO shifts to be available in a terminal emulator, selected automatically by presence of parity and overridable by existing Kermit commands. People writing terminal emulators also will need to face the direction of writing problem; it's not so simple. [I know I should not say much about this now; but, you know, keyboards generate a lot of these symbols. At some point we will need to think more about translating keystrokes to communications line characters.] Finally, I'd like to express my appreciation to Hirofumi for explaining the complicated typography environment in Japan using terms that even I can understand. Joe Doupnik P.S. Frank and Chris: I've not posted this to the group because my typing skills would collapse about half way through! 23-Mar-89 20:52:12-GMT,5316;000000000001 Return-Path: <@cuvmb.cc.columbia.edu:PEPMNT@CFAAMP.BITNET> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA22919; Thu, 23 Mar 89 15:52:05 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 1882; Thu, 23 Mar 89 15:47:50 EST Received: from CFAAMP.BITNET (PEPMNT) by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1139; Thu, 23 Mar 89 15:47:40 EST Date: Thu, 1989 Mar 23 15:40:14 EST From: (John F. Chandler) PEPMNT@cfaamp To: Joe Doupnik , Paul Placeway , Andre Pirard , Baruch Cochavy , Johan Van Wingen , Ken-ichiro Murakami , Kohichi Nishimoto , Hirofumi Fujii , "Gisbert W. Selke" , Kurt Enulf , Jacob Palme , Per Lindberg , "Bj|rn Larsen" , "Hans A. ]lien" , Steve Jenkins , Jean Dutertre , Gerard Gaye , David Guerlet , Bernie Eiben , Volker Edelhoff , John Chandler , Frank da Cruz Subject: Re: Kermit / ISO Transfer Syntax: 7-bit vs 8-bit In-Reply-To: fdc@watsun.cc.columbia.edu message of Wed, 22 Mar 1989 11:19:29 EST Message-Id: Pardon the resending of this message -- the first time I tried, I used the LISTSERV DISTRIBUTE feature, and at least one LISTSERV in the chain rejected the message, so that an unknown number of copies were lost. Rather than guess who got it, I'm just sending to everyone again... ----------------------------------------------------------------- Date: Wed, 1989 Mar 22 17:16 EST Subject: Re: Kermit / ISO Transfer Syntax: 7-bit vs 8-bit In-reply-to: fdc@watsun.cc.columbia.edu message of Wed, 22 Mar 1989 11:19:29 EST Message-id: > A disadvantage is that languages like Russian, Greek, > Hebrew, and Arabic that tend to stay in the G1 set will have a very high > prefixing overhead in the 7-bit environment. > > The question now becomes: should Kermit continue to allow ISO 7-bit text > transfer with locking shifts, as originally proposed? If we do not, then > Kermit programs that do not implement 8th-bit prefixing will not be able to > transfer mixed-alphabet texts. I'm not sure the term mixed-alphabet is quite the right one here. Wouldn't the languages cited above normally be encoded in 8-bit alphabets that are "mixed" only in the sense that the character sets place the Latin alphabet in the G0 slots? I would imagine that the typical use would consist of un-mixed Cyrillic or Hebrew or whatever. > the full range of ISO 4873 / 2022 code extension techniques would > give us the greatest flexibility (e.g. efficiency for both French and > Cyrillic), but... I have another suggestion that just occurred to me. First, let me state what I am assuming about the nature of text files: 1. Truly mixed-alphabet stuff is rare, that is, stuff reguiring more than a single 256-entry character set. I realize that ideographic text representation requires more than 256 distinct characters, but I think the solution to that difficulty is to represent ideograms by strings of bytes, rather than to define universal escape sequences for switching among alphabets. 2. In non-ideographic representations, the language will either be written either entirely in G0 (or in G1) or will switch back and forth frequently between G0 and G1. This certainly fits all the languages represented by the various ISO 8859 alphabets mentioned in the draft protocol extension. I confess that I don't know the situation for the various Japanese syllabaries, nor whether anyone would choose to use them if given a chance to use the standard combination of kana and kanji. Given the above, I think it makes sense to offer a single new feature: 8th-bit-complement mode. In that mode, ALL codes 0-127 would be swapped with those 128-255 as the first step of Kermit encoding (and the last step of decoding). Such a mode could be selected via an Attribute (or perhaps by a Capability flag). The implementation would be simple and would entail no overhead associated with scanning data streams for locking shifts. 8BC mode would be the method of choice for all the languages that use G1 exclusively, but could also be useful for transferring certain kinds of binary files -- there's nothing about it that need be restricted to text files. Being new to this discussion, I don't know whether this idea has been suggested before, but it seems to me to merit consideration. John 24-Mar-89 0:20:05-GMT,1408;000000000001 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA25079; Thu, 23 Mar 89 19:20:03 EST Date: Thu, 23 Mar 1989 19:20:02 EST From: Frank da Cruz To: "John F. Chandler" Subject: Re: Kermit / ISO Transfer Syntax: 7-bit vs 8-bit In-Reply-To: Your message of Wed, 1989 Mar 22 17:16 EST Message-Id: We were just rereading your suggestion about 8th-bit-complement mode. The underlying idea seems to be that if in a given environment (like Hebrew or Cyrillic) most characters are from GR, then complementing the 8th bit would give greater efficiency. So this would have to be on a per-file (or per-language) basis. It wouldn't help out the Germans or French, who must switch between GL and GR frequently. And it certainly would not bring much benefit to the Japanese (see Hirofumi Fujii's message that I forwarded to you a few minutes ago, even though this one might arrive first). We're beginning to think the only way to satisfy everybody is to allow a full implementation of ISO 2022, perhaps (as Hiro suggested) with an announcer in the attribute packet to specify what facilities are being used -- full 8-bit transfer with no shifts, 7-bit transfer with locking shifts, and (sigh) even allowing for G2 and G3 sets and the corresponding single shifts. - Chris & Frank 24-Mar-89 3:41:59-GMT,6351;000000000001 Return-Path: <@cuvmb.cc.columbia.edu:PEPMNT@CFAAMP.BITNET> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA27154; Thu, 23 Mar 89 22:41:56 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 2055; Thu, 23 Mar 89 22:37:41 EST Received: from CFAAMP.BITNET (PEPMNT) by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1745; Thu, 23 Mar 89 22:37:34 EST Date: Thu, 1989 Mar 23 20:10:47 EST From: (John F. Chandler) PEPMNT@cfaamp.bitnet To: Joe Doupnik , Paul Placeway , Andre Pirard , Baruch Cochavy , Johan Van Wingen , Ken-ichiro Murakami , Kohichi Nishimoto , Hirofumi Fujii , "Gisbert W. Selke" , Kurt Enulf , Jacob Palme , Per Lindberg , "Bj|rn Larsen" , "Hans A. ]lien" , Steve Jenkins , Jean Dutertre , Gerard Gaye , David Guerlet , Bernie Eiben , Volker Edelhoff , John Chandler , Frank da Cruz Subject: Re: Kermit / ISO Transfer Syntax: 7-bit vs 8-bit In-Reply-To: fdc@watsun.cc.columbia.edu message of Thu, 23 Mar 1989 19:20:02 EST Message-Id: > The underlying idea seems to be that if in a given environment (like > Hebrew or Cyrillic) most characters are from GR, then complementing the > 8th bit would give greater efficiency. So this would have to be on a > per-file (or per-language) basis. Right. As I pointed out, this need not even be restricted to text files, since some binaries would have a preponderance of bytes with the 8th bit set and could profit from the same efficiency. > It wouldn't help out the Germans or > French, who must switch between GL and GR frequently. I think you can easily prove that any scheme without a carefully tailored compression will not help the French, Germans, Swedes, etc. One possibility, of course, is to adopt a transfer coding that swaps the essential extra characters for little-used symbols -- German, in particular, requires only 7 extra letters, and though French technically requires 18, a common practice is to omit diacritical marks on upper-case vowels, thereby reducing the need to 10. Such a scheme would obviously have to be worked out for each language and would therefore entail a lot of work, but it would certainly increase the "efficiency" of transfers. I doubt, though, that the overhead involved in managing code selections would be worthwhile, and the complexity introduced by having an extra, generalized translation step would certainly give everyone headaches -- it's bad enough in TTY-mode IBM mainframe transfers with disk -(ETOA)-> -(TATOE)-> -(system E to A with parity)-> receiver and vice versa. > And it certainly > would not bring much benefit to the Japanese Well, that depends. Hirofumi Fujii's example was something of an extreme case, and there is certainly one mode of operation that would fit in very well, namely, using Kana to the exclusion of Kanji. The occasional word in Roman text would, in 8th-bit-complement mode, have to get 8th-bit prefixing, but the bulk of the text would go through with one byte transmitted per character. The difficulty, of course, is that Japanese has lots of homonyms, so that Kana-only text can be ambiguous. Thus, that mode is somewhat less than desirable. This brings me to the final point. I agree that the exigencies of Japanese text require more than a single 256-character alphabet, but they may be unique in that respect -- Chinese, for example, does not have a syllabary and so might use an entirely 2-byte representation for ideogram-only text (I don't know what they would do with foreign words or how frequently they would come up). The question I would raise is whether there is a need for a translation between the stored text and the transmission medium. Note the distinction: hard copy of a text file may appear to have a mixture of fonts and alphabets, but the underlying disk file has everything encoded into 1-byte units (that is, in all the schemes we seem to be considering). Kermit can transfer the bytes without knowing how to decode them and can also transfer an attribute or two informing the receiver how, in principle, to decode them. Therefore, there is only one reason for Kermit to define a standard transmission protocol that includes decoding the stored disk file and re-encoding in the transmission protocol, namely, the fear that a receiving machine will (A) need to use the text file locally in some form *other* than the original, (B) have a Kermit smart enough to decode the transmission protocol and re-encode in that *other* form, and (C) *not* have other software smart enough to translate the coding scheme found on the originating machine. One possible scenario I can imagine is the case of an IBM mainframe with one ETOA mapping for a combined Roman+Kana set from "EBCDIC+Kana" to ASCII+Kana and a completely different mapping (perhaps idempotent) from the mainframe Kanji codes to the JIS ones. If such a thing can happen, and if the file is to be transmitted usefully to unlike machines, then it is clearly necessary for the mainframe Kermit to decode the file before transmitting. In that case, I agree that an implementation of something like the full ISO 2022 is necessary. On the other hand, it may be that a single mapping suffices, or perhaps nobody would want to process mixed-alphabet text files on an IBM mainframe anyway. I think a fuller discussion of the actual situation would be useful here. John 24-Mar-89 13:59:44-GMT,14064;000000000001 Return-Path: <@cuvmb.cc.columbia.edu:A-PIRARD@BLIULG11.BITNET> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA00903; Fri, 24 Mar 89 08:59:40 EST Message-Id: <8903241359.AA00903@watsun.cc.columbia.edu> Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 2135; Fri, 24 Mar 89 08:55:25 EST Received: from VM1.EARN-ULG.AC.BE by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 2027; Fri, 24 Mar 89 08:55:23 EST Received: by BLIULG11 (Mailer R2.02) id 1684; Fri, 24 Mar 89 14:48:29 +0100 Date: Fri, 24 Mar 89 14:37:42 +0100 From: Andre' PIRARD Subject: Re: ISO/Kermit 7 vs 8 bit transfer syntax To: Frank da Cruz , Paul Placeway , Andre Pirard , Baruch Cochavy , Johan Van Wingen , Ken-ichiro Murakami , Kohichi Nishimoto , Hirofumi Fujii , "Gisbert W.Selke" , Kurt Enulf , Jacob Palme , Per Lindberg , Bj|rn Larsen , "Hans A. ]lien" , Steve Jenkins , Jean Dutertre , Gerard Gaye , David Guerlet , Bernie Eiben , Volker Edelhoff , John Chandler , Joe Doupnik In-Reply-To: Message of Thu, 23 Mar 1989 10:24:16 EST from Despite my being in a desolated empty room because of our moving, I interleave some comments with packing as long as I am still able to use the network. You will find below a document I find essential to our discussion. Thanks to Johan van Wingen for the information. It is an answer to my first note: "how would systems store multilingual character sets". In other words, "what would Kermit have to transfer". I see this ISO 10646 project as responding to the fundamental needs of leading software developers with international scope and I guess they must be longing for that. As we say, "Il n'y a pas de fum)e sans feu". The task of Kermit would be much easier on these more efficient grounds (I guess translating between the 4 "forms-of-use" and between them and single-byte ISO 8859 versions). But I think that, despite their evident lack of performance, the present standards will continue to hold for terminal mode tied to 7-bit or 8-bit lines because they keep in line with present hardware in working by "adding more fonts". It's probably a thing to do on machines with a graphic screen even though many users will be satisfied with a single ISO version. But for that reason, the hosts driving the terminal might forget to specify which version it uses and the default one should be customizable. I hope this unifying standard will come very soon to solve those intricate problems of Asian languages. The fundamental question is whether to wait or do something in the meantime that could have to be erased in the end? These Asian people have a strong vote weight. My own testimony is that we will use our single ISO version while waiting and are interested by a mere hidden byte-to-byte translation capability in file transfer in addition to the terminal mode extensions. Hear you all in one week time. Andr). Date: Fri, 10 Feb 89 15:27:00 CET From: Johan van Wingen Subject: Informal Introduction to ISO 10646 To: Andre' Pirard 1 INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ISO/IEC JTC1/SC2/WG2 INTERNATIONAL ELECTROTECHNICAL COMMISSION N 274 Joint Technical Committee 1 Subcommittee 2 Characters and Information Coding, Working Group 2 ====================================================================== Introduction to ISO 10646 - Multiple-Octet Coded Character Set ====================================================================== A new standard is being developed within Working Group 2 of ISO/IEC JTC1/SC2 for the multiple-octet coded character set. Formal drafts will be issued during 1989. Its purpose is to provide a single character code which will permit + _______ the written form of all present-day languages throughout the world to be used within computers, to be processed and interchanged. All types of text written in character form will be provided for, from simple commercial documents to publication of technical reports etc. Also the bibliographic requirements of librarians will be met. The structure of the whole code may be illustrated thus, with an octet + _________ _____ of bits for each dimension: ZDDDDDDDDDDDDDDDDDDD? ZDDDDDDDDDDDDDDDDDDD? 3 ZDDDDDDDDDDDDDDDDDDD? 3 3 ZDDDDDDDDDDDDDDDDDDD? 3 3 3 Plane ZDDDDDDDDDDDDDDDDDDD? 3 3 3 3 / ZDDDDDDDDDDDDDDDDDDD? 3 3 3 3 3 / ZDDDDDDDDDDDDDDDDDDD? 3 3 3 3 3 3 ZDD> ZDDDDDDDDDDDDDDDDDDD? 3 3 3 3 3 3 3 3Cell 3 3 3 3 3 3 3 3 3 3 3 ZDDDDDD? ZDDDDDD 3 3 3 3 3 3 3 V 3 3 A00 3 3 A01 3 3 3 3 3 3 3 3 Row 3 DDDDDD DDDDDD 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 J1 3 3 DD 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 @DDDDDDY @DDDDDD 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3DDDDY 3 ZDDDDDD? ZDDDDDD 3 3 3 3 3DDDDY 3 3 A10 3 3 A11 3 3 3 3 3DDDDY (future 3 DDDDDD DDDDDD 3 3 3DDDDY standardization) 3 3 3 3 3 3 3DDDDY (Korean) 3 3 C1 3 3 K1 3 3DDDDY (Japanese) 3 3 3 3 3DDDDY (Chinese) @DDJDDDDDDJDDJDDDDDDY (bibliographic) Basic multi-lingual plane Supplementary planes The basic multi-lingual plane will contain four segments for graphic + _________________________ ________ characters, each holding 96 * 96 characters. Each segment will be divided into two zones: an alphabetic zone of + _____ 16 * 96 characters, and another zone either for the most-frequently used characters of the Chinese, Japanese and Korean ideographic scripts, or for certain special purposes. The shaded area outside the graphic quadrants will be used for control + _______ functions. All those of ISO 6429, ISO 6937 and ISO 8613 will be + _________ available, with the same coding. The supplementary planes will accomodate characters that overflow from + ________________________ the basic multi-lingual plane. 1 A coded character anywhere in the code may be uniquely identified by means of three octets: m-s ZDDDDDDDDDDDDDD>DDDDDDDDDDDDDD>DDDDDDDDDDDDDD? l-s 3 Plane-octet 3 Row-octet 3 Cell-octet 3 @DDDDDDDDDDDDDDJDDDDDDDDDDDDDDJDDDDDDDDDDDDDDY NOTE: Sequences of characters run horizontally along the rows, not vertically as in previous code tables. The code may be used in different forms-of-use: + ____________ a) A four-octet form, in which the three octets for the character are preceded by one for systems use. Three octet coding will never be used. b) A two-octet form, restricted exclusively to a single plane. Especially for users with alphabetic scripts, this will accomodate probably 99% of their applications. c) A two-octet form with extension using occasional four-octets. d) A compacted form, permitting strings of related characters to be used as single-octets. The basic multi-lingual plane is being designed to permit easy inter-working with existing 8-bit codes. Generally, conversion will be by the table look-up technique; however, conversion with ISO 8859 parts 1,2,5,6,7,8 may use a simple algorithm. All designation, invocation and shifting as in ISO 2022 will be avoided. + _______ It is considered that the consequent simplification of software, + __________________________ especially for generalized applications in the OSI environment, will make this code economically attractive despite the the relatively extravagant use of bits. The layout of the basic multi-lingual plane may be illustrated in + ______ _________________________ FIGURE 1 (next page), the axes being not drawn linearly. NOTE: The value of any octet is shown in simple decimal notation, e.g. 032, 255. The contents of any of the rows are set out in detailed code tables. + ____________________ These are drawn on a pro-forma which shows a complete row in twelve strips, each of 16 graphic characters. Because the code is designed to be used as a whole, especially the basic multi-lingual plane, no significance attaches to whether certain characters are in the left hand or right-hand halves of a row, or early or late in the code table. A character once included in the code table is not duplicated elsewhere. Therefore for any particular application characters will be taken from many different places in the code table. For example users within Greece will find Greek letters in row 040, the equivalent Latin letters they use for transliteration in row 032, and some symbols they use in row 034. It will be trivially easy to adapt any equipment designed for the Japanese or Chinese scripts to provide all the characters of the basic multi-lingual plane. Therefore it is expected that suitable cost-effective equipment will become readily available. + ________________________ The feature of fixed length coding, especially in the two-octet + ___________________ mode-of-use, will make this code very easy to use in high-level programming languages and other software as employed for OSI and ODA. Hugh McG Ross, editor. Revised Oct. 1988 1 FIGURE 1 ISO 10646 Structure of the basic multi-lingual plane / / / / / Row. /000/032 Cell-octet 126/ /160 255/ oct.ZDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD? 0003 3 3 ZDDDDDDDDDDDDDDDDDDDDDDD? ZDDDDDDDDDDDDDDDDDDDDDDD 0323 3 Latin script for 3 3 European languages 3 \ 0333 3 ISO 8859-1 and -2 3 3 and ISO 6937-2 3 \ 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD \ 0343 3 Extended symbols 3 3 from ISO 8879 3 \ 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD \ 0353 3 Extended Latin 3 3 script for 3 \ 3 3 all world 3 3 languages 3 \ 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD \ 0373 3 Special African and 3 3 phonetic letters 3 \ 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD Alphabetic 0383 3 Cyrillic script for 3 3 major languages 3 3 3 Cyrillic for all 3 3 minority languages 3 scripts 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 0403 3 Greek script 3 3 for all 3 / 3 3 forms of 3 3 writing 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 0423 3 Arabic script for 3 3 all languages 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 0433 3 Hebrew 3 3 script 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 0443 3 Other 3 3 scripts 3 / 3 3 3 3 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD 0483 3 Japanese 3 3 Special Purpose 3 Ideographs 3 3 JIS X 0208 3 3 3 1263 3 3 3 3 3 @DDDDDDDDDDDDDDDDDDDDDDDY @DDDDDDDDDDDDDDDDDDDDDDD 3 3 3 ZDDDDDDDDDDDDDDDDDDDDDDD? ZDDDDDDDDDDDDDDDDDDDDDDD \ 1603 3 3 3 3 \ 3 3 Indian 3 3 scripts 3 \ 3 3 3 3 3 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD Alphabetic 3 3 Mathematical 3 3 symbols 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 3 3 Oriental 3 3 scripts 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD 1763 3 Chinese 3 3 Korean 3 Ideographs 3 3 GB 2312 3 3 KS C 5601 3 2553 3 3 3 3 @DDJDDDDDDDDDDDDDDDDDDDDDDDJDDJDDDDDDDDDDDDDDDDDDDDDDDY 24-Mar-89 17:35:49-GMT,2564;000000000001 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA02622; Fri, 24 Mar 89 12:30:49 EST Date: Fri, 24 Mar 1989 12:30:49 EST From: Frank da Cruz To: ISO/Kermit Discussion Group Subject: New ISO/Kermit Mailing List Message-Id: A new mailing list has been set up at Columbia for the benefit of those who are interested in taking part in the discussion of adding mechanisms for transfer of multi-language text to the Kermit file transfer protocol. You can send e-mail to everyone in this group by using any of the following addresses: isokermit@watsun.cc.columbia.edu isokermit@cunixc.cc.columbia.edu ISOKERM@CUVMA.BITNET (It might take a few days for the BITNET/EARN address to be established but the others are working now.) The isokermit mailing list currently has the following members: Christine M Gianone Joe Doupnik Paul Placeway Andre Pirard Gisbert W. Selke Ken-ichiro Murakami Hirofumi Fujii Kohichi Nishimoto Dvorah Baruch Cochavy Johan Van Wingen Kurt Enulf Jacob Palme Per Lindberg Frithjov Iversen "Bj|rn Larsen" "Hans A. ]lien" Steve Jenkins Jean Dutertre Gerard Gaye David Guerlet Volker Edelhoff Frank da Cruz This group was selected to provide a cross-section of our international Kermit correspondents, and also to get some expertise from active members of the ISO8859 discussion group. Once the members of this group have come to a reasonable concensus, the proposal will be placed before the (much) larger Kermit and ISO8859 discussions groups. Please respond to this message, so we will know if the mailing list is working. Let us know if you want to be removed, or if you know of someone else who should be added. Thanks! - Christine and Frank 25-Mar-89 2:58:09-GMT,1696;000000000001 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA08103; Fri, 24 Mar 89 21:58:08 EST Resent-Message-Id: <8903250258.AA08103@watsun.cc.columbia.edu> Message-Id: <8903250258.AA08103@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA10105; Fri, 24 Mar 89 21:57:59 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 2491; Fri, 24 Mar 89 21:53:49 EST Received: by CUVMB (Mailer X1.25) id 3215; Fri, 24 Mar 89 21:53:48 EST Received: from JPNKEKVM by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 3214; Fri, 24 Mar 89 21:53:47 EST Received: by JPNKEKVM (Mailer R2.02) id 4600; Sat, 25 Mar 89 11:57:14 JST Date: SAT, 25 MAR 89 11:56:41 JST From: Hirofumi Fujii Subject: Kermit 8th bit quoting To: Joe Doupnik , Frank da Cruz , Ken-Ichirou Murakami Resent-Date: Fri, 24 Mar 89 21:53:47 EST Resent-From: Network Mailer Resent-To: fdc@cunixc.cc.columbia.edu Sorry, I did not know that the 8-th bit quoting of the Kermit is OPTIONAL. Please give me a few days to consider the transfer method for Japanese text file in 7-bit environment( of course it is possible if we use the ISO-2022 for 7-bit environmet... but I want to check the efficiency etc....). Hirofumi Fujii National Laboratory for High Energy Physics (KEK) 26-Mar-89 22:19:24-GMT,3916;000000000001 Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA00995; Sun, 26 Mar 89 17:14:32 EST Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA27610; Sun, 26 Mar 89 17:14:19 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 2782; Sun, 26 Mar 89 17:10:11 EST Received: from techunix.bitnet (BARUCHC) by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 4155; Sun, 26 Mar 89 17:10:09 EST Return-Path: Date: Mon, 27 Mar 89 00:13:26 +0200 From: Baruch Cochavy Comments: Domain style address is "baruchc@techunix.technion.ac.il" Message-Id: <8903262213.AA26074@techunix.bitnet> To: isokermit@cunixc.cc.columbia.edu Subject: ISO/Kermit - some remarks. Here is a comment I had sent to Joe Doupnik and his reply. Hello Joe, I mail this query to you and not to the list, since my view would be a bit controversial. First, let me see if I get the situation right: 1. We all have local file formats. 2. We wish there was a way Kermit could support local file format exchange. 3. So, ISO (whatever ..) is considered as a common ground. Now, this all means that the local Kermit would have to have some knowledge of the local file format and local character sets representation. Since no common file structure, nor data representation exists, this means that each and every Kermit would have to know not only local file formats, but also remote file formats. Take MS-Kermit and C-Kermit, for example. If I transfer an Alef-Bet (A Hebrew word processor) file, than *both* the sender and the receiver must be aware of the file format and representation, for this file represents Hebrew characters in the 80h-9ah, per IBM code page 972. Else, the receiver side would have to store the data in some common format and representation, hoping that it would be usable at it's side. Now, given the number of different file formats available, and local data representation, I can see no way we can produce anything meaningful. I'm sure I got things wrong somewhere along the line. Could you please enlighten me ? Many thanks, Baruch Cochavy baruchc@techunix.BITNET baruchc@techunix.technion.ac.il >From jrd@usu.bitnet Sun Mar 26 23:31:35 1989 Baruch, Yes, you are quite correct. If there are zillions of local file formats then the local Kermit would need to know about some of them, on a local basis. The idea is to convert them, as much as possible, to some "standard" format for transmission. Needless to say, we both agree that the local formats are application specific and unless there is a simple minded way of writing a filter program for use at run time then the Kermits would be, ah, rather large! I'll forward one of my messages to Frank on this particular item. We see eye to eye on this one. Columbia's view is, I think: yes, this is true right now. However, in the near future vendors might shift to the "standard" forms and the Kermit project will be one of the forces being applied. In addition, local filter programs could be written as standalone items, to convert applications output to ISO XXXX or another acceptable format. The terminal emulation part is not bad for many situations, but even that does not accomodate some programs now in existence (such as Alef-Bet). You may want to repeat your comments for the group since the consequences are substantial. Meanwhile, I'm preparing to add ISO ???? support in the terminal emulator as an extension of the current material (won't lose anything). I want to make any new mechanism more pleasant for your situation at the same time. Thanks for the comment, Joe D. ----- From fdc@watsun.cc.columbia.edu Sun Mar 26 17:38:32 1989 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA01322; Sun, 26 Mar 89 17:38:32 EST Received: from watsun.cc.columbia.edu by cunixc.cc.columbia.edu (5.54/5.10) id AA28224; Sun, 26 Mar 89 17:38:20 EST Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA01319; Sun, 26 Mar 89 17:38:27 EST Date: Sun, 26 Mar 1989 17:38:26 EST From: Frank da Cruz To: Baruch Cochavy Cc: isokermit@cunixc.cc.columbia.edu Subject: Re: ISO/Kermit - some remarks. In-Reply-To: Your message of Mon, 27 Mar 89 00:13:26 +0200 Message-Id: In response to Baruch's message... We had hoped it would be clearer that we are proposing a "common intermediate representation" or "transfer syntax" for transfer of multi-language (multi-alphabet, multi-character-set) text files. Therefore, any particular Kermit program will only have to know one or more file formats for its own local computer, plus the standard Kermit transfer syntax. No Kermit program will have to know another computer's file formats. The situation for terminal emulation is obviously a little bit trickier. However, terminal emulation is not part of the Kermit file transfer protocol, but rather a feature of many Kermit programs. It is presumed (and hoped) that any PC-based Kermit program that supports the new multi-language text file transfer syntax (whatever it finally turns out to be) will also support terminal emulation in the same languages. - Chris & Frank 26-Mar-89 23:48:48-GMT,3216;000000000001 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA01645; Sun, 26 Mar 89 18:39:46 EST Date: Sun, 26 Mar 1989 18:39:45 EST From: Frank da Cruz To: protocols@rutgers.edu Subject: Multi-alphabet text files Message-Id: We are looking for information on any standards -- corporate, de-facto, national, international, or lack of any of these -- for storage (as opposed to transmission) of textual data that contains a mixture of alphabets, for example, Roman, Hebrew, Arabic, Cyrillic, Greek, Japanese, Chinese, Korean, Cherokee, ... Commonly-used computer alphabets today include the well-known 7-bit US ASCII and its "national" variations (UK ASCII, ISO-646 with various national characters substituted for ASCII brackets, etc), the ISO 8859 family of 8-bit alphabets (Latin 1-5, Cyrillic, Hebrew, Arabic, Greek, etc), the several Japanese alphabets (JIS X 0201, JIS X 0208, etc), and so on. For transmission of text composed of more than one alphabet, we convert from local storage conventions to the international standard alphabets (e.g. ISO or JIS) and then use the mechanisms and escape sequences defined in ISO 4873 and ISO 2022 (or JIS X 0202) for switching between them. But for storing mixed-alphabet text within a computer file, what do we have? We have the "corporate standard" alphabets, such as the EBCDIC and ASCII "code pages" used on IBM mainframes and PCs, DEC Kanji, the Xerox character sets, the Macintosh character sets, and so on... Does anyone know anything about "8-bit UNIX" -- the extension of UNIX to languages other than English? How about national versions of VAX/VMS, like French, German, or Hebrew VMS? Is it true that most multi-language text files are those created by word processing programs, and are therefore in special proprietary or private formats, which include not only mechanisms for alphabet switching, but also special effects like font selection, highlighting, page formatting, etc? What are some popular multi-language word processing programs (for the PC, PS/2, Macintosh, etc), and what do their file formats look like? How difficult is it to separate the alphabet selection from the page formatting? This query is connected with an effort to extend the Kermit file transfer protocol to include a transfer syntax for multi-language text. This transfer syntax will probably wind up using the ISO 4873 and 2022 mechanisms for switching among ISO 8859 alphabets, with similar mechanisms applied to Japanese and other multi-byte character sets. Meanwhile, real-world examples of multi-language file formats are needed to test the proposed (and evolving) Kermit file transfer syntax against. Please respond to any of the following addresses: cmg@watsun.cc.columbia.edu KERMIT@CUVMA.BITNET fdc@watsun.cc.columbia.edu FDCCU@CUVMA.BITNET If you are interested in participating in the ensuing discussion, also ask to be added to the "isokermit" mailing list. Thanks for your help! Christine Gianone Frank da Cruz cmg@watsun.cc.columbia.edu fdc@watsun.cc.columbia.edu KERMIT@CUVMA.BITNET FDCCU@CUVMA.BITNET 27-Mar-89 12:25:13-GMT,1925;000000000001 Return-Path: Received: from rutgers.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA03044; Mon, 27 Mar 89 07:25:11 EST Received: from think.UUCP by rutgers.edu (5.59/SMI4.0/RU1.1/3.04) with UUCP id AA27232; Mon, 27 Mar 89 07:25:00 EST Received: by news.think.com; Mon, 27 Mar 89 06:53:59 EST Received: by redsox.bsw.com (smail2.5) id AA13748; 27 Mar 89 06:37:51 EST (Mon) Received: by redsox.bsw.com (5.51/smail2.5/09-10-88) id AA13744; Mon, 27 Mar 89 06:37:50 EST Date: Mon, 27 Mar 89 06:37:50 EST From: campbell@redsox.bsw.com (Larry Campbell) Message-Id: <8903271137.AA13744@redsox.bsw.com> To: fdc@watsun.cc.columbia.edu Subject: Re: Multi-alphabet text files Newsgroups: comp.protocols.misc In-Reply-To: Organization: The Boston Software Works, Inc. You didn't mention T.61. That's what we use in our email gateway products (Wang/DEC/UNIX), mainly because it's specified in X.400. We would have preferred ISO 8859 because it's simpler, but the ISO document specifically says ISO 8859 is *not* to be used in any "CCITT telematic application". T.61 does have the advantage of allowing you to apply diacritic marks to *any* character, but the disadvantage that characters with diacritics take two bytes, so when you translate from a typical character set into T.61 the output can be longer than the input. I'm not familiar with PC word processors, but in the DEC and Wang word processors with which I am familiar, alphabet selection is completely orthogonal (as it should be) to font and style selection. Anyway, I have a considerable interest in this whole question, so I would appreciate being included in any mailing list you might construct. -- Larry Campbell The Boston Software Works, Inc. campbell@bsw.com 120 Fulton Street wjh12!redsox!campbell Boston, MA 02146 28-Mar-89 6:10:14-GMT,5077;000000000401 Return-Path: <@cuvmb.cc.columbia.edu:KEIBUN@JPNKEKVM.BITNET> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA08095; Tue, 28 Mar 89 01:10:13 EST Message-Id: <8903280610.AA08095@watsun.cc.columbia.edu> Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 3343; Tue, 28 Mar 89 01:05:52 EST Received: from JPNKEKVM by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6085; Tue, 28 Mar 89 01:05:50 EST Received: by JPNKEKVM (Mailer R2.02) id 1574; Tue, 28 Mar 89 15:02:28 JST Date: TUE, 28 MAR 89 15:01:22 JST From: Hirofumi Fujii Subject: Re: ISO / Kermit To: Frank da Cruz In-Reply-To: Your message of Mon, 27 Mar 1989 10:48:23 EST Dear Frank Yes, I agree to the 8bit proposal. In 8bit environment, it is obvious that 8bit data transfer is more efficient than 7bit one. However, as Joe mentioned, it is also true that in the case of 7bit environment, Kermit 8bit quoting mechanism may not be efficient (is equivalent to single shift of ISO-2022 at best). For Japanese, locking shift mechanism (or SI/SO) is more efficient than shingle shift (8bit quoting) because characters are appeared in word unit in most cases. So, how about the followings ? -----------------(beginning of my proposal)---------------------- Instead of 'I8' of the original proposal, use I where is the final character used in the announcer of the ISO-2022. For example, Meaning A Only G0 is used. All characters are mapped into left half. Shift mechanism is not used. B Map both G0 and G1 into the left half. SI and SO are used to switch between G0 and G1. (original proposal). D In 7bit environment, map both G0 and G1 into G. SI and SO are used to switch between G0 and G1. In 8bit environment, both GL (left half ) and GR (right half) are used. Locking shift is not used. etc., in addition to the above 8 (or something) Full ISO-2022 is used. (G0, G1, G2, G3, Locking-shift, Single-shift etc.) X (or something) ISO-10646 ?! etc., etc., etc. (note) 'A', 'B' and 'D' are the final characters of the announcer, F. The above transfer protocol is initiated by sender. If the receiver Kermit does not support the protocol requested by sender, but support one of the above, return the Y packet with that data. If the receiver Kermit does not support any of the above protocol, return N packet. A Kermit which supports international character sets, MUST SUPPORT AT LEAST PROTOCOL 'D'. -----------------------( end of my proposal )---------------------- Locking-shift vs Single-shift ----------------------------- In the case of Japanese, if the communication line is 7-bit, it is more efficient to use 'B' even if Kermit support 8bit quoting because many of the characeters are appeared in word unit, i.e., locking-shift (or SI/SO) is more efficient than single-shift(8bit quoting). G2 and G3 character sets ------------------------ It is also better to use G2 or G3 for Japanese. However, it requires more complicated shift mechanism. So, I think it should be optional. Actually, I have checked my Japanese mail, and found that the use of the Katakana character set (Hankaku katakana) is quite few. Therfore, I think G0 and G1 is enough in many cases. ISO-10646 --------- I have no opinion about ISO-10646. The above 'D' in 8bit environment uses both GL and GR, and does not use locking-shift mechanism. This is very similar to ISO-10646. The only difference is that the ISO-10646 is the multi-byte code. Therfore, I think it is easy to extend the ISO-10646 feature within the above scheme. This is one of the reasons why I choose 'D' protocol as default. Terminal Emulator ----------------- I think, in Kermit protocol, it is not necessary to say about the terminal emulator. It is machine dependent and can be handled within the local routines. Actually, my Kermit (MSVP98) already has ISO-2022 features (supports G0, G1, G2 and G3 character sets, all locking-shift and single-shift mechanisms) within the scheme of MS-Kermit. I have not modified any machine-indepent routines of the MS-Kermit. Joe has separated Kermit modules very nicely and clearly. -------------- Hirofumi Fujii National Laboratory for High Energy Physics (KEK) KEIBUN@JPNKEKVM.BITNET From @cuvmb.cc.columbia.edu:A-PIRARD@BLIULG11.BITNET Thu Mar 30 10:25:51 1989 Return-Path: <@cuvmb.cc.columbia.edu:A-PIRARD@BLIULG11.BITNET> Received: from columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA18213; Thu, 30 Mar 89 10:25:51 EST Received: from cuvmb.cc.columbia.edu by columbia.edu (5.59++/0.3) with SMTP id AA16472; Thu, 30 Mar 89 10:24:07 EST Resent-Message-Id: <8903301524.AA16472@columbia.edu> Message-Id: <8903301524.AA16472@columbia.edu> Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 4455; Thu, 30 Mar 89 10:19:38 EST Received: from VM1.EARN-ULG.AC.BE by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0380; Thu, 30 Mar 89 10:19:36 EST Received: by BLIULG11 (Mailer R2.03B) id 0757; Thu, 30 Mar 89 17:21:37 +0200 Resent-Date: Thu, 30 Mar 89 17:16:27 +0200 Resent-From: Andre' PIRARD Resent-To: ISO/Kermit Discussion Group Date: Fri, 24 Mar 89 14:37:42 +0100 From: Andre' PIRARD Subject: Re: ISO/Kermit 7 vs 8 bit transfer syntax To: Frank da Cruz , Joe Doupnik In-Reply-To: Message of Thu, 23 Mar 1989 10:24:16 EST from Well, I am back to the network on and off and I see a mailing list setup. I am not sure this message I sent before being disconnected made it through. It is the test reply asked for anyway. Andre. ----------------------------Original message---------------------------- Despite my being in a desolated empty room because of our moving, I interleave some comments with packing as long as I am still able to use the network. You will find below a document I find essential to our discussion. Thanks to Johan van Wingen for the information. It is an answer to my first note: "how would systems store multilingual character sets". In other words, "what would Kermit have to transfer". I see this ISO 10646 project as responding to the fundamental needs of leading software developers with international scope and I guess they must be longing for that. As we say, "Il n'y a pas de fum)e sans feu". The task of Kermit would be much easier on these more efficient grounds (I guess translating between the 4 "forms-of-use" and between them and single-byte ISO 8859 versions). But I think that, despite their evident lack of performance, the present standards will continue to hold for terminal mode tied to 7-bit or 8-bit lines because they keep in line with present hardware in working by "adding more fonts". It's probably a thing to do on machines with a graphic screen even though many users will be satisfied with a single ISO version. But for that reason, the hosts driving the terminal might forget to specify which version it uses and the default one should be customizable. I hope this unifying standard will come very soon to solve those intricate problems of Asian languages. The fundamental question is whether to wait or do something in the meantime that could have to be erased in the end? These Asian people have a strong vote weight. My own testimony is that we will use our single ISO version while waiting and are interested by a mere hidden byte-to-byte translation capability in file transfer in addition to the terminal mode extensions. Hear you all in one week time. Andr). Date: Fri, 10 Feb 89 15:27:00 CET From: Johan van Wingen Subject: Informal Introduction to ISO 10646 To: Andre' Pirard 1 INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ISO/IEC JTC1/SC2/WG2 INTERNATIONAL ELECTROTECHNICAL COMMISSION N 274 Joint Technical Committee 1 Subcommittee 2 Characters and Information Coding, Working Group 2 ====================================================================== Introduction to ISO 10646 - Multiple-Octet Coded Character Set ====================================================================== A new standard is being developed within Working Group 2 of ISO/IEC JTC1/SC2 for the multiple-octet coded character set. Formal drafts will be issued during 1989. Its purpose is to provide a single character code which will permit + _______ the written form of all present-day languages throughout the world to be used within computers, to be processed and interchanged. All types of text written in character form will be provided for, from simple commercial documents to publication of technical reports etc. Also the bibliographic requirements of librarians will be met. The structure of the whole code may be illustrated thus, with an octet + _________ _____ of bits for each dimension: ZDDDDDDDDDDDDDDDDDDD? ZDDDDDDDDDDDDDDDDDDD? 3 ZDDDDDDDDDDDDDDDDDDD? 3 3 ZDDDDDDDDDDDDDDDDDDD? 3 3 3 Plane ZDDDDDDDDDDDDDDDDDDD? 3 3 3 3 / ZDDDDDDDDDDDDDDDDDDD? 3 3 3 3 3 / ZDDDDDDDDDDDDDDDDDDD? 3 3 3 3 3 3 ZDD> ZDDDDDDDDDDDDDDDDDDD? 3 3 3 3 3 3 3 3Cell 3 3 3 3 3 3 3 3 3 3 3 ZDDDDDD? ZDDDDDD 3 3 3 3 3 3 3 V 3 3 A00 3 3 A01 3 3 3 3 3 3 3 3 Row 3 DDDDDD DDDDDD 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 J1 3 3 DD 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 @DDDDDDY @DDDDDD 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3DDDDY 3 ZDDDDDD? ZDDDDDD 3 3 3 3 3DDDDY 3 3 A10 3 3 A11 3 3 3 3 3DDDDY (future 3 DDDDDD DDDDDD 3 3 3DDDDY standardization) 3 3 3 3 3 3 3DDDDY (Korean) 3 3 C1 3 3 K1 3 3DDDDY (Japanese) 3 3 3 3 3DDDDY (Chinese) @DDJDDDDDDJDDJDDDDDDY (bibliographic) Basic multi-lingual plane Supplementary planes The basic multi-lingual plane will contain four segments for graphic + _________________________ ________ characters, each holding 96 * 96 characters. Each segment will be divided into two zones: an alphabetic zone of + _____ 16 * 96 characters, and another zone either for the most-frequently used characters of the Chinese, Japanese and Korean ideographic scripts, or for certain special purposes. The shaded area outside the graphic quadrants will be used for control + _______ functions. All those of ISO 6429, ISO 6937 and ISO 8613 will be + _________ available, with the same coding. The supplementary planes will accomodate characters that overflow from + ________________________ the basic multi-lingual plane. 1 A coded character anywhere in the code may be uniquely identified by means of three octets: m-s ZDDDDDDDDDDDDDD>DDDDDDDDDDDDDD>DDDDDDDDDDDDDD? l-s 3 Plane-octet 3 Row-octet 3 Cell-octet 3 @DDDDDDDDDDDDDDJDDDDDDDDDDDDDDJDDDDDDDDDDDDDDY NOTE: Sequences of characters run horizontally along the rows, not vertically as in previous code tables. The code may be used in different forms-of-use: + ____________ a) A four-octet form, in which the three octets for the character are preceded by one for systems use. Three octet coding will never be used. b) A two-octet form, restricted exclusively to a single plane. Especially for users with alphabetic scripts, this will accomodate probably 99% of their applications. c) A two-octet form with extension using occasional four-octets. d) A compacted form, permitting strings of related characters to be used as single-octets. The basic multi-lingual plane is being designed to permit easy inter-working with existing 8-bit codes. Generally, conversion will be by the table look-up technique; however, conversion with ISO 8859 parts 1,2,5,6,7,8 may use a simple algorithm. All designation, invocation and shifting as in ISO 2022 will be avoided. + _______ It is considered that the consequent simplification of software, + __________________________ especially for generalized applications in the OSI environment, will make this code economically attractive despite the the relatively extravagant use of bits. The layout of the basic multi-lingual plane may be illustrated in + ______ _________________________ FIGURE 1 (next page), the axes being not drawn linearly. NOTE: The value of any octet is shown in simple decimal notation, e.g. 032, 255. The contents of any of the rows are set out in detailed code tables. + ____________________ These are drawn on a pro-forma which shows a complete row in twelve strips, each of 16 graphic characters. Because the code is designed to be used as a whole, especially the basic multi-lingual plane, no significance attaches to whether certain characters are in the left hand or right-hand halves of a row, or early or late in the code table. A character once included in the code table is not duplicated elsewhere. Therefore for any particular application characters will be taken from many different places in the code table. For example users within Greece will find Greek letters in row 040, the equivalent Latin letters they use for transliteration in row 032, and some symbols they use in row 034. It will be trivially easy to adapt any equipment designed for the Japanese or Chinese scripts to provide all the characters of the basic multi-lingual plane. Therefore it is expected that suitable cost-effective equipment will become readily available. + ________________________ The feature of fixed length coding, especially in the two-octet + ___________________ mode-of-use, will make this code very easy to use in high-level programming languages and other software as employed for OSI and ODA. Hugh McG Ross, editor. Revised Oct. 1988 1 FIGURE 1 ISO 10646 Structure of the basic multi-lingual plane / / / / / Row. /000/032 Cell-octet 126/ /160 255/ oct.ZDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD? 0003 3 3 ZDDDDDDDDDDDDDDDDDDDDDDD? ZDDDDDDDDDDDDDDDDDDDDDDD 0323 3 Latin script for 3 3 European languages 3 \ 0333 3 ISO 8859-1 and -2 3 3 and ISO 6937-2 3 \ 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD \ 0343 3 Extended symbols 3 3 from ISO 8879 3 \ 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD \ 0353 3 Extended Latin 3 3 script for 3 \ 3 3 all world 3 3 languages 3 \ 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD \ 0373 3 Special African and 3 3 phonetic letters 3 \ 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD Alphabetic 0383 3 Cyrillic script for 3 3 major languages 3 3 3 Cyrillic for all 3 3 minority languages 3 scripts 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 0403 3 Greek script 3 3 for all 3 / 3 3 forms of 3 3 writing 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 0423 3 Arabic script for 3 3 all languages 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 0433 3 Hebrew 3 3 script 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 0443 3 Other 3 3 scripts 3 / 3 3 3 3 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD 0483 3 Japanese 3 3 Special Purpose 3 Ideographs 3 3 JIS X 0208 3 3 3 1263 3 3 3 3 3 @DDDDDDDDDDDDDDDDDDDDDDDY @DDDDDDDDDDDDDDDDDDDDDDD 3 3 3 ZDDDDDDDDDDDDDDDDDDDDDDD? ZDDDDDDDDDDDDDDDDDDDDDDD \ 1603 3 3 3 3 \ 3 3 Indian 3 3 scripts 3 \ 3 3 3 3 3 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD Alphabetic 3 3 Mathematical 3 3 symbols 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 3 3 Oriental 3 3 scripts 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD 1763 3 Chinese 3 3 Korean 3 Ideographs 3 3 GB 2312 3 3 KS C 5601 3 2553 3 3 3 3 @DDJDDDDDDDDDDDDDDDDDDDDDDDJDDJDDDDDDDDDDDDDDDDDDDDDDDY 30-Mar-89 22:51:43-GMT,18689;000000000001 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA19916; Thu, 30 Mar 89 17:40:00 EST Date: Thu, 30 Mar 1989 17:39:59 EST From: Christine M Gianone To: isokermit Subject: ISO/Kermit Proposal Draft #2, Part 1 of 4 Message-Id: A KERMIT PROTOCOL EXTENSION FOR INTERNATIONAL CHARACTER SETS Christine Gianone and Frank da Cruz Columbia University Center for Computing Activities 612 West 115th Street New York, NY 10025, USA DRAFT NUMBER 2 March 30, 1989 ABSTRACT An extension to the Kermit file transfer protocol is proposed to allow transfer of multi-language text files between unlike computer systems. The new transfer syntax uses the 8-bit character sets defined in the ISO 8859 and similar standards, and mechanisms for switching among them defined in ISO 2022. Japanese and other multi-byte character sets are handled by similar mechanisms. SUMMARY OF CHANGES SINCE DRAFT #1, March 1, 1989 - Summary of current standards expanded and clarified. - Kermit file transfer syntax expanded to allow full range of ISO 2022 mechanisms in both 7-bit and 8-bit environments. - ISO 2022 announcers added to attribute packet. - Incorporation of Japanese character sets. ACKNOWLEDGEMENTS Many thanks to these people for their helpful and constructive comments on the first draft. In most cases, their suggestions or the information they provided have been incorporated into the second draft. John Chandler (Harvard/Smithsonian Center for Astrophysics, USA) Joe Doupnik (Utah State University, USA) Hirofumi Fujii (Japan National Laboratory of High Energy Physics) Ken-ichiro Murakami (Nippon Telephone and Telegraph Research Labs, Tokyo) Jacob Palme (Stockholm University, Sweden) Andre Pirard (University of Liege, Belgium) Paul Placeway (Ohio State University, USA) Gisbert W. Selke (University of Bonn, West Germany) Johan van Wingen (Leiden, Netherlands) PREFACE This is a DRAFT proposal, based upon a reading of current character-set standards, some familiarity with the issues involved, and limited testing with devices that claim to implement these standards (such as the DEC VT340 terminal). Readers are urged to correct us if we have misinterpreted the standards, to fill in missing information, and to make any comments or criticisms they desire. Readers with knowledge of real-world multi-alphabet applications and file formats are especially urged to comment on the suitability of this proposal. THIS IS NOT A FINAL REVISION. ANY AND ALL PROPOSED TECHNIQUES AND MECHANISMS MAY CHANGE, BASED UPON FURTHER DISCUSSION. In fact, this draft is still quite rough and may contain inconsistencies and mistakes -- please point them out! Even after the draft has reached the "final" stage, it is fully expected that changes will be necessary once programmers start to actually transform the protocol description into working code. INTRODUCTION The Kermit file transfer protocol makes a distinction between text and binary files, and it defines a particular transfer syntax for text files, namely 7-bit ASCII characters with carriage return and linefeed (CRLF) after each line, so that text may be stored in useful fashion on any computer to which it is transferred. Each Kermit program knows how to translate from the local text-file storage conventions to Kermit's transfer syntax, and vice versa. In this way, text files can be transferred between unlike systems (say, an EBCDIC card-oriented system and an ASCII stream file system) and remain useful after transfer. Now that the world's computer users have begun to find US ASCII insufficient for their uses, and standards organizations are adopting standard codes for the world's other alphabets, and vendors like IBM, DEC, and Apple have begun to make these characters available on their displays (albeit in different positions), and people are beginning to produce increasing numbers of multilingual documents, Kermit's text file transfer syntax must be extended to allow for texts in a mixture of alphabets. It is best if this can be done in line with currently existing and evolving information interchange standards, including ANSI X3.4 (ASCII), ISO 646, ISO 4873, ISO 2022, ISO 8859, etc. These and other standards which we believe to be pertinent are listed in Appendix A. To transfer text files containing a mixture of alphabets, we propose to treat Kermit data transfer in the same manner as ISO 2022 treats a terminal-to-host data transmission, by embedding specific escape sequences and control characters in the data stream for the purposes of alphabet identification and switching. Any of the world's standard registered alphabets (Table 5) may be included in this scheme, no matter whether they are single-byte codes (such as ASCII or ISO Latin Alphabet 1) or multi-byte codes (such as Japanese Kanji), and any number of alphabets may be used within a single text file. This extension to the Kermit protocol will be called ISO-2022 Transfer Syntax. Like all other Kermit protocol extensions, this one will be optional. In a Kermit program that supports ISO-2022 Transfer Syntax, commands will be included for the user to enable and disable this feature. When ISO-2022 Transfer Syntax is enabled, the sending Kermit program will translate from the local storage formats and conventions for multi-language text into ISO-2022 Transfer Syntax, and the receiving Kermit will translate from the transfer syntax into its own local storage formats and conventions (or it may elect not to do so). Therefore, each Kermit program will have to know only about the transfer syntax and its own computer's local formats. WHY ISO 2022? Many different multi-alphabet transfer syntaxes are imaginable. Our aim has been to settle on a single syntax that achieves the best balance among the following requirements: 1. The ability to represent any character in any coded character set 2. The ability to uniquely identify each coded character set 3. The ability to switch among different coded characters sets 4. The ability to work in both the 7-bit and 8-bit transmission environments 5. Minimization of transmission overhead within the Kermit encoding scheme 6. Compatibility with existing applicable standards 7. Fairness to all nationalities ISO 2022, when used in conjunction with the ISO 8859 and other registered single-byte and multi-byte character sets, would seem to fulfill all the listed requirements. Any character in any registered character set (see Appendix C) can be transmitted unambiguously, any number of character sets can be used in a single transmission, mechanisms are available for both 7-bit and 8-bit transmission, transmission overhead can be minimized by selection of single and locking shifts. And finally, the flexibility of ISO 2022 can result in fair treatment for each alphabet. For example, a language like Russian can be transmitted efficiently in the 7-bit environment because shifting between G0 and G1 is relatively infrequent, whereas a language like French shifts very frequently between G0 and G1 to get at the accented characters. ISO 2022 allows locking shifts in the Russian case and single shifts in the French case. And since ISO 2022 allows for both single-byte and multi-byte character sets, and because it has compatible counterparts in Asia, the same scheme can apply to Asian character sets. This will allow Kermit to transfer computer text containing virtually any mixture of languages. HOW THE STANDARDS WORK ASCII and ISO 646 give us a 128-character 7-bit character set. This set is divided into two parts: 1. 33 "control characters" (characters 0 through 31, and character 127). 2. 95 "graphic characters" (32-126). ISO 646 allows for national variations (explained later), but an International Reference Version (IRV) is defined, which is identical to US ASCII except in the appearance of the graphic used for character 36 ("$" in US ASCII and currency sign in the IRV) and for character 126 (tilde "~" in US ASCII, overline in the IRV). "Graphics" means printing characters -- characters that make ink appear on the page or phosphor glow on the screen (as opposed to pixel- or line-oriented picture graphics). The ASCII / IRV character set is shown in Figure 1, arranged in a table of 16 rows and 8 colums. _____________________________________________________________________________ 00 01 02 03 04 05 06 07 +---+---+---+---+---+---+---+---+ 00 |NUL DLE| SP 0 @ P ` p | 01 |SOH DC1| ! 1 A Q a q | 02 |STX DC2| " 2 B R b r | 03 |ETX DC3| # 3 C S c s | 04 |EOT DC4| $ 4 D T d t | 05 |ENQ NAK| % 5 E U e u | 06 |ACK SYN| & 6 F V f v | 07 |BEL ETB| ' 7 G W g w | 08 |BS CAN| ( 8 H X h x | 09 |HT EM | ) 9 I Y i y | 10 |LF SUB| * : J Z j z | 11 |VT ESC| + ; K [ k { | 12 |LF FS | , < L \ l | | 13 |CR GS | - = M ] m } | 14 |SO RS | . > N ^ n ~ | 15 |SI US | / ? O _ o DEL| +---+---+---+---+---+---+---+---+ Figure 1: The ASCII / ISO-646 International Reference Version Character Set _____________________________________________________________________________ Characters are often referred to by their column and row position in this type of table. For example, character 05/08 in Figure 1 is "X". Columns 00-01, plus character 07/15, comprise the control set. Columns 02-07, minus character 07/15, comprise the graphics. 8-bit character sets are described in ISO 4873 and ANSI X3.41 (see Appendix A). An 8-bit character set has two sides. Each side has a control set and a graphics set. The "left half" consists of the control set C0 and the graphics set GL (Graphics Left). GL has 94 characters, and corresponds to ASCII (and ISO 646) positions 02/01-07/14. SP (space) and DEL are not considered part of GL. All the characters in the left half have their high-order, or 8th, bit set to zero, and are therefore representable in 7 bits. The "right half" consists of the control set C1 and the graphics set GR (Graphics Right). All characters in the right half have their 8th bits set to one. Figure 2 shows the layout of an 8-bit character set. _____________________________________________________________________________ <--C0--> <---------GL----------> <--C1--> <---------GR----------> 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ 00 |NUL DLE| SP 0 @ P ` p | | DCS|---+ | 01 |SOH DC1| ! 1 A Q a q | | PU1| | 02 |STX DC2| " 2 B R b r | | PU2| | 03 |ETX DC3| # 3 C S c s | | STS| | 04 |EOT DC4| $ 4 D T d t | |IND CCH| | 05 |ENQ NAK| % 5 E U e u | |NEL MW | | 06 |ACK SYN| & 6 F V f v | |SSA SPA| | 07 |BEL ETB| ' 7 G W g w | |ESA EPA| | 08 |BS CAN| ( 8 H X h x | |HTS | (special | 09 |HT EM | ) 9 I Y i y | |HTJ | graphics) | 10 |LF SUB| * : J Z j z | |VTS | | 11 |VT ESC| + ; K [ k { | |PLD CSI| | 12 |LF FS | , < L \ l | | |PLU ST | | 13 |CR GS | - = M ] m } | |RI OSC| | 14 |SO RS | . > N ^ n ~ | |SS2 PM | | 15 |SI US | / ? O _ o DEL| |SS3 APC| +---| +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ <--C0--> <---------GL----------> <--C1--> <---------GR----------> Figure 2: An 8-Bit Character Set _____________________________________________________________________________ GR character sets can have either 94 or 96 characters. A 94-character GR set begins in position 10/01 and ends in position 15/14, with Space (SP) occupying position 10/00 and DEL in position 15/15, just like G0 (the corners shown in GR in the diagram). A 96-character set has graphic characters in all 96 positions, 10/00 through 15/15. An 8-bit alphabet, therefore, has up to 94 + 96 = 190 graphic characters. This number is sufficient to represent the characters in many of the world's written languages, but not necessarily sufficient to represent all the graphic symbols required in a given application, for instance a multi-language document. To represent a greater number of graphic characters, ISO 4873 defines four "intermediate sets" of graphic characters, of either 94 or 96 characters each. These are called G0, G1, G2, and G3. The G0 set never has more than 94 graphic characters, and G1-G3 can have up to 96 each. Therefore there can be up to: (2 x 32) + 94 + (3 x 96) = 446 characters simultaneously within the repertoire of a given device. These intermediate graphics sets are kept in tables in the memory of the terminal or computer. One of the intermediate sets is assigned to GL, and (in the 8-bit communications environment) another may be assigned to GR. When the terminal or computer receives a data byte, the numeric value of its bits denote the position of the character in GL or GR. For example, the byte 01000001 binary = 65 decimal = 04/01 = uppercase A in ASCII. In the 8-bit environment, any byte with its 8th bit set to zero is from GL, and byte with its 8th bit set to one is from GR. A language like English can be represented adequately GL, because all the required characters fit there. When a language has more than 94 characters, two techniques are used to represent all the characters: 1. For Roman-alphabet languages, put ASCII (or the ISO-646 IRV) in GL and the special characters (like accented letters) in GR. French and German are examples. 2. For languages with many symbols (e.g. where a symbol is assigned to each word, rather than to each sound), represent each character with multiple bytes rather than one byte. Japanese Kanji, for example, uses a 2-byte code. A multibyte code may be assigned to G0, G1, G2, or G3, just like a single-byte code. So far we have a terminal or computer with an "active" GL/GR character set, and four intermediate character sets G0, G1, G2, and G3. How do we assign actual character sets to G0-G3, and how do we associate the intermediate character sets with the active character set? Selection of character sets is accomplished using special control characters or escape sequences embedded within the data stream as described in ISO 2022. An escape sequence is used to DESIGNATE a particular alphabet (such as Roman, Cyrillic, Hebrew, Arabic, Kanji, etc) to a particular intermediate graphics set (G0, G1, G2, or G3). A shift function is used to INVOKE a particular intermediate graphics set into GL or GR. In our discussion, we use the following notation (numbers are decimal unless otherwise noted): Escape (ASCII 27, character 01/11) Space (ASCII 32, character 02/00) Shift Out (Ctrl-N, ASCII 14, character 01/14) Shift In (Ctrl-O, ASCII 15, character 01/15) Table 1 shows the alphabet designators and shift functions for single-byte and multi-byte character sets. The same escape sequences are used for character set designation in both the 7-bit and 8-bit environments. The character which is substituted for "F" identifies the actual character set to be used; these are listed in Table 5. The shift functions may be either locking or single. "Locking shift" is like shift-lock on a typewriter. It means that all subsequent characters until the next shift are to be taken from the designated intermediate character set. "Single shift" applies only to the character (either single or multibyte) that follows it immediately, but single shift functions are only available for the G2 and G3 sets. Locking shift functions remain in effect across alphabet changes. _____________________________________________________________________________ Escape Sequence Function Invoked By (F assigns 94-character graphics set "F" to G0. SI or LS0 )F assigns 94-character graphics set "F" to G1. SO or LS1 *F assigns 94-character graphics set "F" to G2. SS2 or LS2 +F assigns 94-character graphics set "F" to G3. SS3 or LS3 -F assigns 96-character graphics set "F" to G1. SO or LS1 .F assigns 96-character graphics set "F" to G2. SS2 or LS2 /F assigns 96-character graphics set "F" to G3. SS3 or LS3 $(F assigns multibyte character set "F" to G0. SI or LS0 $)F assigns multibyte character set "F" to G1. SO or LS1 $*F assigns multibyte character set "F" to G2. SS2 or LS2 $+F assigns multibyte character set "F" to G3. SS3 or LS3 Table 1: Escape Sequences for Alphabet Designation (Note, $F was used in earlier versions of ISO 2022 to assign a multibyte character set to G0, and no provisions were made to assign multibyte character sets to G1-G3.) _____________________________________________________________________________ In the 7-bit environment, only one character set, GL, can be active at a time. The active character set can be selected from among the intermediate sets G0-G3 by the shifts shown in Table 2. Control characters from C0 are transmitted as is, and those from the C1 set are sent prefixed by followed by the character value, minus 64. For example, the C1 character 10000001 binary (129 decimal) becomes A (129 - 64 = 65 = "A"). _____________________________________________________________________________ Shift Representation Name Function SI Ctrl-O Shift In invoke G0 into GL SO Ctrl-N Shift Out invoke G1 into GL LS2 n Locking Shift 2 invoke G2 into GL LS3 o Locking Shift 3 invoke G3 into GL SS2 N Single Shift 2 select single character from G2 SS3 O Single Shift 3 select single character from G3 Table 2: Shifts Used in the 7-Bit Environment _____________________________________________________________________________ 30-Mar-89 22:48:55-GMT,19888;000000000001 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA19919; Thu, 30 Mar 89 17:40:55 EST Date: Thu, 30 Mar 1989 17:40:55 EST From: Christine M Gianone To: isokermit Subject: ISO/Kermit Proposal Draft #2, Part 2 of 4 Message-Id: In the 8-bit environment two character sets, GL and GR, can be active at once. A GL character is selected by a byte whose 8th bit is zero, and a GR character by a byte whose eighth bit is one. The actual character sets assigned to GL and GR are selected by the shifts shown in Table 3. Control characters from both the C0 and C1 sets are sent as is. _____________________________________________________________________________ Shift Representation Name Function LS0 Ctrl-O Locking Shift 0 invoke G0 into GL LS1 Ctrl-N Locking Shift 1 invoke G1 into GL LS2 n Locking Shift 2 invoke G2 into GL LS3 o Locking Shift 3 invoke G3 into GL LS1R ~ Locking Shift 1 Right invoke G1 into GR LS2R } Locking Shift 2 Right invoke G2 into GR LS3R | Locking Shift 3 Right invoke G3 into GR SS2 08/14 Single Shift 2 select single character from G2 SS3 08/15 Single Shift 3 select single character from G3 Table 3: Shifts Used in the 8-Bit Environment _____________________________________________________________________________ So we have a 3-tiered system. At the bottom tier lie all the world's coded character sets. We can designate up to four of them to each of the intermediate graphics sets G0, G1, G2, and G3 using the escape sequences shown in Table 1. The terminal or computer keeps each of the selected intermediate sets in memory. The terminal or computer also has one active set, composed of GR and GL. The intermediate sets are invoked to GL or GR (one at a time) by the shifts SO, SI, LS0, LS1, etc. A simplified diagram for the 8-bit environment is shown in Figure 3 (see ISO 2022 for detailed diagrams of both the 7-bit and 8-bit environments). On a more sophisticated output device, Figure 3 would contain numerous arrows pointing upwards to demonstrate the operation of the designators and shifts. _____________________________________________________________________________ +--+--------+ +--+--------+ |C0| GL | |C1| GR | | | | | | | 8-Bit | | | | | | Code | | | | | | In Use +--+--------+ +--+--------+ LS0 LS1,LS1R LS2,LS2R LS3,LS3R Shifts SS2 SS3 +--------+ +--------+ +--------+ +--------+ Intermediate | | | | | | | | Graphics | G0 | | G1 | | G2 | | G3 | Sets | | | | | | | | +--------+ +--------+ +--------+ +--------+ Alphabet Designation (B -A -B -L $)B Sequences +---------+ +--------+ +--------+ +--------+ +--------+ +--------+ | The world's | ISO | | ISO | | ISO | | ISO | | JIS X | | registered | 646 | | Latin | | Latin | | Latin | | 0208 | | character |(ASCII) | | 1 | | 2 | |Cyrillic| | Kanji | + sets +--------+ +--------+ +--------+ +--------+ +--------+ Figure 3: The ISO 2022 Character Set Selection Mechanisms _____________________________________________________________________________ To understand the three-tiered design of ISO 2022, imagine a computer programmed to display a mixture of character sets on its screen. A large collection of fonts might be stored on the disk, one font per file. These are the character sets of the bottom tier. When a font is needed, it will be read from the disk and stored in memory in an array, for rapid access. If several fonts are needed, they will be stored in several arrays. These arrays are the intermediate character sets, G0-G3. When a data byte arrives to be displayed, the actual graphic representation is taken from GL or GR (depending on the byte's 8th bit). GL is associated with one of the intermediate graphic sets, and GR with another. If no more than four character sets are used, then each one needs to be read from the disk only once, and display is rapid and efficient thereafter. ANNOUNCING ISO 2022 FACILITIES A large portion of ISO 2022 is devoted to describing how 8-bit characters may be transmitted on a 7-bit communication path, for example when parity is in use. In the 7-bit environment, there is only GL -- no GR. Therefore, all characters are transmitted with their 8th bit removed, and shifts are used to specify which intermediate set they belong to. In fact, there are many possible ways to use the ISO 2022 code extension facilities within both 7-bit and 8-bit environments. For example, the sender may inform the receiver in advance whether G1, G2, or G3 will be used, etc. At the beginning of any particular data transfer, the facilities that actually will be used can be announced using a sequence of the form F. These sequences are listed in Appendix B. CHARACTER SETS Many coded character sets are used in the transmission of textual data between computers and terminals, in telegraphy, in the international Teletex system, and in many other telecommunications applications. From our point of view, these fall into two categories: those that are standardized, well-defined, and registered in the International Register of Coded Character Sets under the provisions of ISO 2375, "Data Processing - Procedure for Registration of Escape Sequences", under the authority of the ECMA (European Computer Manufacturers Association). And those that are not. Kermit's ISO 2022 Transfer Syntax will be able to transfer files containing REGISTERED 8-bit single-byte or multi-byte character sets. Registration implies that international standards bodies -- which typically include representatives of major computer companies -- have agreed upon the character set, so that variations are unlikely to occur. There must be an unambiguous way for the sender to identify each alphabet for the receiver, and registration of an approved character set results in a unique designating escape sequence. Currently, there are several major categories of registered 8-bit character sets: the ISO 8859 family of 8-bit alphabets, the CCITT telegraphy alphabets, and the Asian multibyte alphabets. These are summarized in Appendix C. KERMIT FILE TRANSFER The Kermit file transfer protocol currently defines two syntaxes for data transfer: TEXT, in which characters are represented in ASCII, and records are terminated by Carriage-Return Linefeed. The sending program translates from local text storage format (e.g. EBCDIC card images) to CRLF-terminated ASCII records, and the receiving program translates from CRLF-terminated ASCII records into its own local text file storage format (e.g. LF-terminated ASCII records). This is Kermit's default file transfer syntax, and it may also be selected by the Kermit command SET FILE TYPE TEXT. BINARY, in which no transformations are done at all. This mode must be requested explicitly by the Kermit user with the command SET FILE TYPE BINARY. The original assumption was that the file transfer syntax would never change, and would therefore be a function of the file type, text or binary. Henceforth, this mode of operation will be called Kermit's NORMAL TRANSFER SYNTAX. Different computer systems and software packages have different conventions for representing, storing, and displaying mixed-alphabet textual data. Such data can be transferred by Kermit using normal transfer syntax, but it will only make sense when transferred to a system that uses the same representational conventions (analogous to binary file transfer). To transfer mixed-alphabet textual data between systems that use different conventions, a new mechanism is required. By specifying ISO-2022 TRANSFER SYNTAX as the common intermediate representation, we ensure that any particular Kermit program will only have to know about its own local file formats and the standard transfer syntax -- it will never have to know anything about file formats on other kinds of computers (in the same way that Kermit's normal text file syntax works now). The extension proposed here will allow a Kermit program that has specific knowledge of the local file format (or formats) for storing multilingual or multi-alphabet text to translate between these system- and application-specific formats and the new syntax to be used during file transfer. SELECTING ISO-2022 TRANSFER SYNTAX Kermit's default transfer syntax is NORMAL (meaning either ASCII text, or binary, according to SET FILE TYPE). Kermit's ISO-2022 transfer syntax must therefore be enabled in some way, either automatically or explicitly by the user. In the automatic case, the Kermit program recognizes (somehow) that it is to transfer a multi-alphabet text file. In the manual case, the user issues a SET command: SET TRANSFER-SYNTAX ISO-2022 It must also be possible to override the automatic use of ISO-2022 syntax via the command: SET TRANSFER-SYNTAX NORMAL The sending Kermit may inform the receiving Kermit of the selected transfer syntax by means of the Kermit File Attribute (A) packet, whose use is negotiated in the Kermit Initialization exchange. There is an attribute "*" (ASCII 42) which represents "encoding", with values like "A" for Normal Kermit ASCII encoding, "B" for binary, "E" for EBCDIC (so far, never used). The proposed new value for this attribute is "I", followed by one or more ISO 2022 announcer letters (the letters after shown in Table 4), for example IA, IB, IC, etc. The receiver can agree to accept the file or refuse it using Kermit's attribute reply mechanism. Refusal could occur because the receiving Kermit does not support the ISO 2022 facilities announced by the sender, or because the receiver does not support the ISO 2022 transfer syntax at all. To refuse, the receiver puts the character "N" in the data field of its acknowledgement to the A packet, followed by the character "*" (along with any other Kermit attribute designators it objects to). If the receiver does not do attribute packets, then the sender may still elect to send the file (with a warning to the user), as either a binary file or an 8-bit text file, for storing (and perhaps forwarding) purposes only. In this case, the file will be stored on the receiving computer in ISO-2022 transfer syntax. The advantage of using Attribute packets is that the sending Kermit can automatically inform the receiving Kermit of the file transfer syntax, so that the user does not have to type a SET command to both Kermits. On a computer system where the Kermit program can recognize the attributes and encoding of a file automatically, this mechanism will allow files of different types (ASCII text, binary, multi-alphabet text) to be sent together as a group, even between unlike systems. The drawback is that the attribute mechanism must be programmed into a Kermit program that doesn't already have it. DESCRIPTION OF KERMIT'S ISO-2022 TRANSFER SYNTAX Transfer of a multi-character-set text file in ISO-2022 transfer syntax by Kermit is similar to transfer of a 7-bit ASCII text file, except that it may contain embedded control characters and escape sequences to identify and switch between character sets. The file sender translates the file's characters (if necessary) into one or more registered alphabets, and terminates lines of text (records) with CRLF, as in ASCII text mode. The file receiver translates from ISO-2022 transfer syntax into the format demanded by the local system or application. All of this occurs before Kermit packet encoding by the sender, and after Kermit packet decoding by the receiver, as shown in Figure 4. _____________________________________________________________________________ +----------------------------------+ | File data | | | | Sending Kermit | Conversion to transfer syntax | | | | | Kermit encoding | +----------------------------------+ | Transmission of Kermit packets | +----------------------------------+ | Kermit decoding | | | | Receiving Kermit | Conversion from transfer syntax | | | | | File data | +----------------------------------+ Figure 4: ISO 2022 Transfer Syntax and Kermit Packet Encoding _____________________________________________________________________________ ISO 2022 states that "at the beginning of information interchange, except where the interchanging parties have agreed otherwise, all designations shall be defined by use of the appropriate escape sequences, and the shift status shall be defined by the use of the appropriate locking-shift functions." Kermit programs should "agree otherwise" that the default G0 character set is the US ASCII / ISO-646 / ECMA-6 7-bit set; thus ISO-2022 transfer syntax can be identical to Normal Kermit transfer syntax when transferring 7-bit text files. There are no defaults for G1, G2, or G3, in the interest of fairness to all countries and peoples. When the text contains characters outside the ASCII alphabet, an escape sequence from Table 1 must be issued, designating the alphabet to which they belong (using the identification letters shown in Table 5) to the desired intermediate character set G0, G1, G2, or G3. This sequence must be given before the first occurrence of a character in that alphabet. If no such sequence is given, then all characters are treated as ASCII data, including , the shift characters, and bytes with their 8th bits set to one. In other words, the file transfer behaves in the normal Kermit fashion for text files. ISO 2022 escape sequences are inserted into the data, and are indistinguishable by the Kermit packet encoder/decoder from the data itself. Therefore these escape sequences may be broken across packets, just as any other data may be. CHOOSING THE APPROPRIATE ISO 2022 FACILITIES This proposal allows Kermit programs to use the full range of ISO 2022 code extension techniques, including use of G0, G1, G2, and G3 in both the 7-bit and 8-bit environments, with both single-byte and multibyte character sets. In the general case, G0 will be used for ASCII and English, G1 for the "native language" of the local country or region, G2 for a third language, and G3 for a fourth. Additional character sets may be swapped in and out of G2 and G3 as required. Transmission of 8-bit data in the 7-bit environment is accomplished by Kermit using 8th-bit prefixing, which is an optional feature of the Kermit protocol. However, most popular implementations of Kermit do include this feature. If a Kermit program cannot do 8th-bit prefixing, then it must operate in the ISO 2022 7-bit environment, shifting GL among the intermediate graphics sets G0-G3. If the Kermit program can do 8th-bit prefixing, the choice of the ISO 2022 7-bit or 8-bit environment is entirely independent of the communication channel. 8-bit communication may be used on a 7-bit channel (in which case Kermit does the required 8th-bit prefixing), or 7-bit communication can be done on an 8-bit channel. Or any other combination. Selection of the ISO 2022 7-bit or 8-bit environment should be made on other grounds, such as transmission efficiency or program simplicity. For example, if the ISO 2022 8-bit environment is used on a 7-bit channel, then Kermit will have to do 8th-bit prefixing, which can be much less efficient than locking shifts. Taken in their entirety, the ISO 2022 facilities are quite complex and may be "overkill" for many applications. Let's look at some specific examples in which a subset will do. 1. ASCII text Only a single G0 character set is active. Normal Kermit ASCII text file transfer syntax is used, in conjunction with normal Kermit packet encoding, in which control characters are translated to printable characters (e.g. Ctrl-A becomes #A). In the 7-bit environment (when parity must be used on the communication line), characters which have their 8th bit set to one are transmitted with the 8th bit replaced by a parity bit, and prefixed by the character "&" (ASCII 38). In the 8-bit environment (no parity), the 8th bit of the transmitted character (after control prefixing) is the same as the 8th bit of the original data character. No ISO 2022 escape sequences are necessary, but if ASCII files are to be transferred when using ISO-2022 syntax, they may be prefixed with the announcer A to specify that only the G0 set is used and (B to designate the left half of ISO 8859/2 to the G0 set, and the attribute packet may contain "*IA". 2. A single ISO 8859 alphabet This method will be quite common in countries like France, Germany, Italy, Poland, the USSR, or Greece, where a single alphabet (such as ISO Latin 1, ISO Latin/Cyrillic, ISO Latin/Greek) can be used to represent all text in the native language (plus, in most cases, also English, and most computer programming languages). The choice of ISO 2022 facilities depends upon two factors: whether transmission is to be in the 7-bit or 8-bit environment, and whether the language is predominently "one-sided". A one-sided language confines itself mostly to either the G0 set (like English) or the G1 set (like Russian, Greek, Hebrew, or Arabic). A two-sided language jumps frequently between the two sets (like French). In the 8-bit environment, the G0 and G1 sets are used without shifts. That is, the alphabet should be transmitted in its full 8-bit form. C0 and C1 characters are also transmitted as-is. In the 7-bit environment, we must choose between Kermit's 8th-bit prefixing and ISO 2022 locking shifts. For left-sided languages (like English), the question is largely irrelevant since few G1 character will be encountered. For right-sided languages like Russian, it is clear that locking shifts will result in far less transmission overhead than Kermit's per-character 8th-bit-prefixing. For two-sided languages like French, Kermit's 8th-bit-prefixing is equivalent to an ISO 2022 single shift function, and will probably result in less overhead than locking shifts. The situation is summarized in Table 6. _____________________________________________________________________________ 8-bit Environment 7-bit Environment Language Type Assignments Announcer Assignments Announcer Left-sided G0->GL, G1->GR; C G0->GL, G1->GR; C Right-sided G0->GL, G1->GR; C G0->GL, G1->GL; B Two-sided G0->GL, G1->GR; C G0->GL, G1->GR; C Table 6: ISO 2022 Facilities for Single Alphabet Text Transfer with Kermit _____________________________________________________________________________ 30-Mar-89 22:49:59-GMT,17722;000000000001 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA19922; Thu, 30 Mar 89 17:42:48 EST Date: Thu, 30 Mar 1989 17:42:48 EST From: Christine M Gianone To: isokermit Subject: ISO/Kermit Proposal Draft #3, Part 3 of 4 Message-Id: 3. ISO 646 text ISO 646 has many national variations, in which national characters are substituted for ASCII brackets, etc. Some examples are shown in Table 7. When transferring such text between systems that use the same encoding, normal ASCII text file syntax may be used (as is commonly done today). _____________________________________________________________________________ ASCII Column/Row Graphic German Finnish Norwegian French 04/00 @ section @ @ a grave 05/11 [ A umlaut A umlaut AE diphthong degree 05/12 \ O umlaut O umlaut O slash c cedilla 05/13 ] U umlaut A circle A circle section 06/00 ` ` e acute ` ` 07/11 { a umlaut a umlaut ae diphthong e aigu 07/12 | o umlaut o umlaut o circle u grave 07/13 } u umlaut a circle a circle e grave 07/14 ~ ess-zet u umlaut ~ diaresis Table 7: ISO 646 Usage in Selected Countries _____________________________________________________________________________ When transferring ISO 646 text with a system that uses a different encoding, ISO 2022 transfer syntax should be used, along with the appropriate ISO 8859 alphabet, for instance ISO Latin 1 for German, Finnish, Norwegian, or French. The special characters from Table 7 should be translated into the ISO 8859 alphabet equivalents. In this case, the comments about the "sidedness" of the language vs the 7-bit environment also apply. Users should be cautioned to make a distinction between text documents and computer program source. For program source, normal Kermit ASCII text syntax should be used (SET TRANSFER-SYNTAX NORMAL), otherwise programs in C, Pascal, etc, will have their brackets, braces, not-signs, and logical ORs appear on the target system as accented letters, etc. 4. Japan Many Japanese computer systems use at least three character sets, Roman (close to ASCII), Katakana (a 1-byte code), and Kanji (a 2-byte code). Kanji is specified in JIS X 0208, which also includes Roman, Hiragana, Katakana, and some other character sets, but these are double width and not normally used. Roman characters are usually taken from the left half of JIS X 0201, and Katakana from the right half. Japanese text frequently shifts between Roman, Kana, and Kanji, and therefore requires three active character sets, for example G0 (Roman), G1 (Kana), and G2 or G3 (Kanji). In the 8-bit environment, data transfer can be quite efficient: locking shifts are used to shift GL between Roman and Kana, and any bytes with the 8th bit set to one automatically invoke Kanji in GR as a multi-byte character set. In the 7-bit environment, locking shifts would also be used to select and Kanji. Note that locking shifts are more efficient in this case than Kermit 8th-bit prefixing because Kanji characters consist of more than one byte. It is an open matter whether the ISO 2022 7-bit or 8-bit environment should be chosen by the programmer, based on the language, or whether the user should be given the option of choosing the environment using a possibly confusing SET command option. ADDITIONAL ESCAPES Since ISO 8859 character sets are subject to revision from time to time, an alphabet selector may be preceded by &F, where F is the revision number (@ = 1, A = 2, B = 3, etc). For example, &@-A means Latin Alphabet Number One, Revision One. LOCAL FILE REPRESENTATION This proposal assumes nothing about the representation of the file on the local storage medium. It may be ASCII, EBCDIC, a proprietary word processor format, IBM code page, or anything else. It is an implementation "detail" for Kermit programmer to convert between the local file representation for multi-alphabet text files, and Kermit's file transfer syntax. In some cases, the file itself (or its directory entry) might contain the necessary identifying information, in which case the sending Kermit program can automatically emit the appropriate escape sequences during file transfer. In others, the user will have to tell the sending program how the file is encoded. The suggested command is: SET FILE TYPE where specifies how the file is (or when receiving, is to be) encoded on disk. This will necessarily be highly dependent on the system's conventions, or the conventions of the applications to be used with the file (e.g. a multi-language word processing program). Possibilities for might include application names like WORDPERFECT, XYWRITE, NOTA-BENE, MACWRITE, ALEPH-BET, PC-HANGUL. It may be that a file is encoded entirely in a single ISO-8859 alphabet, e.g. Latin Alphabet No. 1, or Latin/Cyrillic, but the file itself contains no information to that effect. Therefore, it must be possible for the user to specify the alphabet, independent of the application, using the new command: SET FILE CHARACTER-SET where might be one of the following: LATIN1-ISO8 ARABIC-ISO8 IBM-CODE-PAGE-437 LATIN2-ISO8 CYRILLIC-ISO8 IBM-CODE-PAGE-850 LATIN3-ISO8 GREEK-ISO8 IBM-CODE-PAGE-865 LATIN4-ISO8 HEBREW-ISO8 KANJI-SHIFJIS LATIN5-ISO8 KANJI-JIS KANJI-EUC The part before the dash is the name of the alphabet, and the "-ISO8" says that the alphabet belongs to the ISO family of 8-bit character sets. This allows for the possibility of other encoding methods for the same languages, e.g. GREEK-DEC, where the Greek letters are taken from the DEC technical character set. If the local file is not encoded according to ISO 2022 rules, it may contain , , and characters. It is up to the Kermit program to know what these characters mean in the context of the file's format, and to either strip them from the file or translate them to something else. The ISO 2022 rules forbid the use of these characters as data to be transferred. MISMATCHED CAPABILITIES Each Kermit program should be informed -- either by SET command or through an Attribute packet -- that ISO 2022 transfer syntax is in use. Once the two Kermit programs have been instructed to use ISO 2022 syntax, it is still possible that the sender will announce an ISO 2022 facility that the receiver does not support, or will designate an alphabet that the receiver is not familiar with. If this happens, the receiver can cancel the file transfer by putting an "X" in the data field of its acknowledgement to the data packet which contained the unknown announcer or designator, and the sender will stop sending. At that point, the user can find some workaround like sending the file using normal transfer syntax, etc. To prevent useless data transfer, it is recommended that all announcers and alphabet designators be transmitted at the beginning of the file, so that cancellation can take place as early as possible. The announcers should be included in the data transfer even though they appear in the Attribute packet, for two reasons: 1. The receiver might not support Attribute packets. 2. The receiver might want to store the file in the ISO 2022 transfer syntax, e.g. for display on a terminal, or for postprocessing by another program. SPECIAL EFFECTS Today, most multi-alphabet files are produced by proprietary text processing programs. These programs have many functions besides switching among alphabets. They may also endow text with special attributes such as boldface, italic, underline, super- or subscript, color, etc, and render characters in a variety of type styles and sizes. Each text processing program may have its own unique formats and conventions. These special effects are not addressed by this proposal. Nevertheless, it is likely that a multi-alphabet file produced by a text processing program also contains special effects. In order for a Kermit program to send a multi-alphabet file, it must have detailed knowledge of the file's format and coding conventions. Therefore, the Kermit program should be able to strip out the special effects, and send only the text. Otherwise the result would be meaningless when received on an unlike system or for use with a different application. (When transferring such files between like systems or compatible applications, Kermit binary mode transfers will suffice.) At some future time, it might be possible to adapt one of the popular document description languages to the Kermit protocol, so that Kermit will be able to transfer formatted documents between unlike systems and applications. Presently, there are many competing would-be standards including IBM DCA and DIA, DEC DDIF, US Navy DIF, Postscript. There are also two ISO standards emerging in this area: Standard Generalized Markup Language (ISO 8879, 9069, and 9573), and Office Document Architecture (ISO 8613). This is an area for further study. ARCHIVING The Kermit protocol includes a so-far little-used archiving function. In this mode, Kermit stores incoming file data together with the attribute packets that precede it, so that the file can be retrieved and reconstituted on another system at a later time. In archive mode, the alphabet escapes and shifts should not be interpreted by the receiving Kermit, but simply stored as data. FILE TRANSFER SYNTAX EXAMPLES A simple 7-bit ASCII text file can be transmitted in the normal Kermit manner for text files, without any escapes or shifts, even in ISO 2022 mode. A text file containing characters from a language or languages covered by a single ISO 8859 alphabet will require an -F sequence to identify the alphabet. In the 7-bit environment, and are used to shift between the G0 and G1 sets. The following lines are all produce the same result: A dangerous German word is "gef-Adhrlich". B-AA dangerous German word is "gefdhrlich". B-AA dangerous German word is "gefdhrlich". &@-A(BA dangerous German word is "gefdhrlich". In this case, the only extended character is the umlaut-a in "gefaehrlich" (where ae is a way of writing umlaut-a without an umlaut). For clarity and consistency with the ISO-2022 recommendations, the latter form is preferred: the text begins with an announcement of the G0 and G1 sets in use, including the version number, and then explicitly shifts into the G0 set, rather than defaulting to it. A text file containing characters from multiple ISO 8859 alphabets requires an -F sequence to identify each alphabet. In the 7-bit environment, SO and SI can be used to shift between G0 and G1 of the current alphabet, and (B can be used to select G0 of any of the alphabets, since these are all the same. For example, the following text contains the same word in English, French, and Russian: -ADisappointed, digu, -L`PW^gP`^RP]]kY. The first escape sequence assigns Latin Alphabet No. 1 to G1, and the subsequent and shifts apply to its G0 and G1 set, which is used to form the English and French words. The second escape sequence assigns the Latin/Cyrillic 96-character set to G1, and the subsequent shifts apply to this new set. Another 7-bit example, in which the same word is repeated in English, Russian, and German, shows how a locking shift remains in effect when the alphabet is changed. We begin in Latin/Cyrillic, start with an English word from G0, shift to G1 for the Russian word, and while still in G1 switch to Latin Alphabet No. 1 for German to get the umlaut-A at the beginning of Aenderung (where Ae = umlaut-uppercase-A), and shift back to G0 for the rest of the word: -LAlteration _U`UTU[ZP -ADnderung. The following example (contributed by Hirofumi Fujii) illustrates why 8-bit operation is desirable when there are more than two character sets: Usually, Japanese text files contain Roman, Kana, and Kanji characters. In this case, switching the character set in the 7-bit environment is very expensive. For example, suppose the following sentence contains characters from all three sets: 1234567890123456789012345678 ---------------------------- This is an English sentence. ---------------------------- NNNNRKKRNNRRRRRRRRRNNNNNNNNR where N is Kanji, K is Katakana and R is Roman (of course this is not the real Japanese sentence but the character set in the sentence looks like this). In 7-bit enviroment, we usually assign: G0:Roman, G1:Katakana so the above sentence is translated as $BThis(J is $Ban(J English $Bsentence(J. 28-byte text needs additional 20 bytes in this case! In the 8-bit environment, we usually assign at the beginning: GL=G0:Roman G1:Katakana GR=G3:Kanji so the above sentence becomes This is an English sentence. ^^^^ ^^ ^^^^^^^^ where ^ means 8th bit = one (GR character set). In this case, only 2 bytes are required to switch the character set. (Note that the locking-shift mechanism is required even in the 8-bit environment.) TERMINAL EMULATION While not part of the Kermit file transfer protocol, terminal emulation is a feature of many Kermit programs. It is hoped that these terminal emulators will evolve along the lines of the ISO standards mentioned above. In some cases, this is already a fact, insofar as DEC VT200 and 300 series terminals already follow these standards. In this regard, it is important to note that not all languages are written from left to right, top to bottom. Hebrew and Arabic are two examples of right-to-left languages, and Japanese and Chinese may be written top to bottom. The order of the text characters on disk or on the transmission line do not necessarily reflect their order on the screen or the printed page. Kermit should be as easy to use as possible, but should still give the user the ability to specify exactly what character codes are in use for both terminal emulation and file transfer. There should also be a consistent set of commands for all Kermit programs. The following command should specify what character set is sent and received on the transmission medium during terminal emulation. The Kermit program must translate between this character set and the one that is used locally. SET TERMINAL CHARACTER-SET [{GL, GR}] This command already exists, but is currently used only in MS-DOS Kermit, and only to switch between US and UK ASCII. We should extend this command to select any character code, and to assign it to GL (default) or GR, and we should have a standard set of 's including the currently defined ISO 8-bit alphabets: LATIN1-ISO8, ..., LATIN5-ISO8, CYRILLIC-ISO8, GREEK-ISO8, HEBREW-ISO8, ARABIC-ISO8, etc. 7-bit ASCII and its national variants (ISO-646): ASCII-US, ASCII-UK, ASCII-FR, ASCII-DE, ASCII-IT, ASCII-NL, ASCII-ES, ASCII-DK, ASCII-FI, ASCII-IS, ASCCI-SE, ASCII-NO, ASCII-TR, etc. And for Japanese: KANJI-JIS, KANJI-SHIFTJIS, KANJI-EUC, etc. For example, an MS-DOS computer might use SHIFTJIS locally, but a VAX communicates using EUC, so the MS-DOS Kermit user would give the command SET TERM CHAR KANJI-EUC. For keyboard character input, in addition to the current per-key SET KEY mechanism, there should be a way to assign an entire translation table to the entire keyboard. This command would be: SET TERMINAL KEYMAP [{GL, GR}] THE IBM PC How can ISO 2022 transfer syntax be used on the IBM PC (USA version)? It happens that the original IBM PC (without graphics adapter) contains many of the characters needed for Latin Alphabet 1 in its character ROM. GL is equivalent to ASCII, and GR has the accented vowels, etc, but in nonstandard positions. For example, the PC has A-umlaut in 08/14 (a C1 position!), whereas Latin 1 has it in 12/04. Therefore, translation tables must be written to convert from Latin 1 to IBM PC, and vice versa. The PC's character ROM does not contain letters from other sets, so the PC would only be able to handle ISO Latin 1. PCs with certain graphics adapters (and all PS/2's), on the other hand, can load different character sets from disk files into their character generators. IBM calls these files "code pages". USA Code Page 437 (the one used on the original PC) was capable of supporting 5 languages, whereas the new Multinational Code Page 850 can support 11 (according to the DOS 3.3 manual). There are also special code pages for Portuguese, French-Canadian, and Norwegian. A file can be created or displayed using any one of these code pages. However, the file itself contains no information about which code page was used, so it's up to the user to switch to the appropriate code page before accessing the file. For this reason, the Kermit program would need a SET FILE CHARACTER-SET command. There is no mechanism defined in IBM PC-DOS for switching code pages within a file. Therefore, mixed alphabet files are only possible within the private environment of proprietary PC-based multilingual word processors. A Kermit program would need to know the details... 30-Mar-89 22:50:04-GMT,15421;000000000001 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA19925; Thu, 30 Mar 89 17:43:37 EST Date: Thu, 30 Mar 1989 17:43:37 EST From: Christine M Gianone To: isokermit Subject: ISO/Kermit Proposal Draft #2, Part 4 of 4 Message-Id: APPENDIX A: STANDARDS (Also see Appendix C) ANSI X3.4 (1986), "Coded Character Sets - 7-bit American Standard Code for Information Interchange" (US ASCII), is the 7-bit code currently used by Kermit for transferring text files. ISO 646 (1983) (= ECMA-6), "Information Processing - ISO 7-bit Coded Character Sets for Information Interchange", gives us a 7-bit character set equivalent to ASCII with provision for substituting "national characters" in selected positions. ISO 4873 (1986) (= ECMA-43), "Information Processing - ISO 8-bit Code for Information Interchange - Structure and Rules for Implementation", defines 8-bit character sets, their graphic and control regions, and how to extend an 8-bit character set by using multiple graphics sets. ISO 2022 (1986) (= ECMA-35), "Information Processing - ISO 7-bit and 8-bit Coded Character Sets - Code Extension Techniques", describes how to use 8-bit character sets in both 7-bit and 8-bit environments, and how to switch among different character sets and alphabets. JIS X 0202, "Code Extension Techniques for Use the the Code for Information Interchange", the Japanese counterpart of ISO 2022. ANSI X3.41-1974, "Code Extension Techniques for Use with the 7-Bit Coded Character Set of the American National Standard Code for Information Interchange", describes 7- and 8-bit codes and extension techniques in approximately the same manner as ISO 4873 and ISO 2022. ISO 8859 (1987-present) (see Table 5 for ECMA equivalents), "Information Processing - 8-Bit Single-Byte Coded Graphic Character Sets", defines the actual 8-bit character sets to be used for many of the world's languages. The left half of each of these is the same as ASCII and ISO 646. Each character, including those with diacritics, is represented by a single byte. ISO is the Internation Standardization Organization, ANSI is the American National Standards Institute, ECMA is the European Computer Manufacturers Association. The ISO/ECMA standards discussed in this proposal may be obtained free of charge in their ECMA form by writing to: ECMA Headquarters Rue du Rhone 114 CH-1204 Geneva SWITZERLAND Be sure to specify the title and the ECMA number of each standard requested. We tried this ourselves, and got delivery within about two weeks. ISO standards can also be ordered from the UN bookstore, but not for free: CCITT United Nations Bookstore United Nations Building New York, NY 10017 ANSI standards may be ordered, for a fee, from: Sales Department American National Standards Institute 1430 Broadway New York, NY 10018 APPENDIX B: ISO 2022 ANNOUNCERS At the beginning of data transfer, the actual ISO 2022 facilities that will be used may be announced by means of escape sequences. Several of the most important ones are described here. Table 4 lists all the defined announcers in summary form. For details, see ISO 2022. A means that only the G0 set will be used, invoked into GL. No shift functions will be used. In the 8-bit environment, GR is not used. In other words, only a single 7-bit character set is used. B means the G0 and G1 sets will be used with locking shifts. In the 7-bit environment invokes G0 into GL, invokes G1 into GL. In the 8-bit environment, LS0 invokes G0 into GL, LS1 invokes G1 into GL. In other words, two character sets are used, with characters from both sets always sent as 7-bit values, with locking shifts used to specify the 8th bit. C means that G0 and G1 will be used in the 8-bit environment, with G0 invoked in GL and G1 in GR. No locking shift functions are used. In other words, a single 8-bit character set is used, with all 8 bits transmitted as data. GL is selected when the character's 8th bit is zero, GR is selected when the 8th bit is one. D means that G0 and G1 will be used with locking shifts. In the 7-bit environment, invokes G0 into GL and invokes G1 into GL. In the 8-bit environment, all 8 bits of each character are transmitted with no shifts. L means that Level 1 of ISO 4873 will be used. That is, a single 8-bit character set with C0, G0, C1, and G1, with no shift functions. This is like C. M means that Level 2 of ISO 4873 will be used. This is equivalent to Level 1, with the addition of G2 and G3. Characters from G2 and G3 are invoked only by the single-shift functions SS2 and SS3. N means that Level 3 of ISO 4873 will be used. This is equivalent to Level 2 with the addition of the locking shift functions LS1R, LS2R, and LS3R. (Note that ISO 4873 does not concern itself with the 7-bit environment, and therefore does not discuss the use of LS0, LS2, LS2, or LS3.) _____________________________________________________________________________ Esc Sequence 7-Bit Environment 8-Bit Environment A G0->GL G0->GL B G0-(SI)->GL, G1-(SO)->GL G0-(LS0)->GL, G1-(LS1)->GL C (not used) G0->GL, G1->GR D G0-(SI)->GL, G1-(SO)->GL G0->GL, G1->GR E Full preservation of shift functions in 7 & 8 bit environments F C1 represented as F C1 represented as F G C1 represented as F C1 represented as 8-bit quantity H All graphic character sets have 94 characters I All graphic character sets have 94 or 96 characters J In a 7 or 8 bit environment, a 7 bit code is used K In an 8 bit environment, an 8 bit code is used L Level 1 of ISO 4873 is used M Level 2 of ISO 4873 is used N Level 3 of ISO 4873 is used P G0 is used in addition to any other sets: G0 -(SI)-> GL G0 -(LS0)-> GL R G1 is used in addition to any other sets: G1 -(SO)-> GL G1 -(LS1)-> GL S G1 is used in addition to any other sets: G1 -(SO)-> GL G1 -(LS1R)-> GR T G2 is used in addition to any other sets: G2 -(LS2)-> GL G2 -(LS2)-> GL U G2 is used in addition to any other sets: G2 -(LS2)-> GL G2 -(LS2R)-> GR V G3 is used in addition to any other sets: G3 -(LS2)-> GL G3 -(LS3)-> GL W G3 is used in addition to any other sets: G3 -(LS2)-> GL G3 -(LS3R)-> GR Z G2 is used in addition to any other sets: SS2 invokes a single character from G2 [ G3 is used in addition to any other sets: SS3 invokes a single character from G3 Table 4: ISO 2022 Announcer Summary _____________________________________________________________________________ APPENDIX C: CHARACTER SET STANDARDS AND DESIGNATION SEQUENCES ISO 8859 defines a series of 8-bit character sets. In each of these, the left half (called G0 in this appendix) is the same as US 7-bit ASCII. Because of this, the left half of any ISO 8859 character set may be used to represent English or any other Latin-alphabet language that can make do without diacritical marks (e.g. German without umlauts or ess-zet, Dutch with ij considered two letters, etc.). By convention, the G0 set can be selected with (B. When we say "by convention" we mean that each of the ISO 8859 standards says to select the G0 set using this sequence, even if the G1 set (right half) is selected using some other letter, like A, C, L, etc (see below). Theoretically, (A could also be used to select the G0 set of "alphabet A", (L could select the G0 set of "alphabet L", etc. Languages with special characters (i.e. non-ASCII graphics) must use specific ISO 8859 G1 sets. These sets are specified (to date) in ISO 8859-1 through 8859-9: ISO 8859-1 is Latin Alphabet No. 1. The right half (G1) contains all the special characters needed for Dutch, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and Swedish. Select G1 with -A. ISO 8859-2 is Latin Alphabet No. 2. G1 contains special characters for Albanian, Czech, German, Hungarian, Polish, Romanian, Serbocroation, Slovak, and Slovene. Select G1 with -B. ISO 8859-3 is Latin Alphabet No. 3, for Afrikaans, Catalan, Esperanto, French, Galician, German, Italian, Maltese, and Turkish. Select G1 with -C. ISO 8859-4 is Latin Alphabet No. 4, for Danish, Estonian, Finnish, German, Greenlandic, Lappish, Latvian, Lithuanian, Norwegian, and Swedish. Select G1 with -D. ISO 8859-5 is the Latin/Cyrillic Alphabet, for Bulgarian, Byelorussian, Macedonian, Russian, Serbocroation, and Ukrainian (Comptible with USSR GOST Standard 19768-1987 and ECMA-113). Select G1 with -L. ISO 8859-6 is the Latin/Arabic Alphabet. Select G1 with -G. ISO 8859-7 is the Latin/Greek Alphabet. Select G1 with -F. ISO 8859-8 is the Latin/Hebrew Alphabet. Select G1 with -H. ISO DIS 8859-9 is Latin alphabet No. 5, in which six Icelandic letters from Latin Alphabet No. 1 were replaced by 6 letters needed for Turkish. Select G1 with -M. OTHER CHARACTER SET STANDARDS: ISO 646 (1983), "Information Processing - ISO 7-bit Coded Character Sets for Information Interchange", gives us a 7-bit character set equivalent to ASCII, and says we can substitute "national characters" for for ASCII characters #$@[\]^`{|}. Different languages put different characters in these positions, and there's no mechanism defined to specify which language is being used. ISO 646 is commonly used in Europe, and much confusion results from the substitution of national characters for brackets and other symbols that are used in programming languages like C. CCITT T.61 (1984), "Character repertoire and coded character sets for the international Teletex service". This is an extension of ISO-646 into the 8-bit arena, but unlike ISO 8859, T.61 uses character combinations to represent letters with diacritical marks. For example, the 2-byte sequence ^o would represent the single character o-circumflex. The left half of this set is equivalent to ASCII and ISO 646, except that the following characters are left undefined: 05/12 (ASCII "\"), 05/14 (ASCII "^"), 07/11 (ASCII "{"), 07/13 (ASCII "}"), and 07/14 (ASCII "~"). The right half contains currency signs, mathematical symbols, diacritical marks, and characters used in roman-alphabet languages that cannot be formed by combining A-Z with a diacritical mark (like Dutch "ij", Icelandic thorn, German Ess-Zet, etc). ISO 6937, "Coded character sets for text communication". ISO 6937/2-1987, "Latin alphabetic and non-alphabetic graphic characters" is the ISO equivalent of CCITT T.61. The right half of this set may be selected using -J. Note that when this alphabet is used, special procedures must be used to translate between its two-byte sequences for accented letters and the single-byte representation of these characters in other sets. JIS X 0201, "Code for Information Interchange", a 1-byte code containing ASCII in the left half and Japanese Katakana in the right half. Select G0 with -J, and G1 with -I (See POSSIBLE PROBLEM, below). JIS X 0208, "Code of the Japanese Graphic Character Set for Information Interchange", a 2-byte code containing Japanese Kanji, Katakana, Hiragana, Roman, Greek, and Russian characters, plus special symbols, etc. Select with $)B. CAS GB 2312-80, Chinese. ISO Reg 58. KS C 5601 (1987), Korean. ISO Reg 149. ISO DIS 10646, "Multiple-Octet Coded Character Set". This is a new standard under development by Working Group 2 of ISO/IEC JTC1/SC2. Formal drafts have not yet been issued. Its purpose is to provide a single character code for all present-day languages throughout the world, with provision also for technical and bibliographic documents. According to a preliminary description, the basic multilingual plane will permit easy interworking with existing 8-bit codes, but all designation, invocation and shifting as in ISO 2022 will be avoided. When this standard becomes formalized, it can be incorporated into Kermit as a new transfer syntax. CCITT is the International Telephone and Telegraph Consultative Committee, GOST is the USSR standards organization, and JIS means Japan Industrial Standard. The alphabet selection escape sequences are registered in the International Register of Coded Character Sets under the provisions of ISO 2375, "Data Processing - Procedure for Registration of Escape Sequences". The registration authority is the ECMA, which periodically issues updates. Some registered character sets are shown in Table 5; the ISO Number is the number of the ISO standard, ECMA Ref is the corresponding ECMA standard number, and ECMA Registration is the ECMA character set registration number (currently unused, but which will be incorporated into future revisions of ISO 8824,8825: ASN.1). The escape sequences shown (except in the ASCII entry) assign the given set to G1. There may also be "private alphabets", such as those found on DEC terminals. In the DEC environment only, these may be selected using escape sequences listed in the DEC manuals, e.g. )> to select the DEC Technical 94-character set and assign it to G1. _____________________________________________________________________________ Alphabet Name Esc Seq ISO Number ECMA Ref ECMA Registration ASCII (ANSI X3.4-1986) (B ISO 646 ECMA-6 ? Latin Alphabet No. 1 -A ISO 8859-1 ECMA-94 100 Latin Alphabet No. 2 -B ISO 8859-2 ECMA-94 101 Latin Alphabet No. 3 -C ISO 8859-3 ECMA-94 109 Latin Alphabet No. 4 -D ISO 8859-4 ECMA-94 110 Latin/Cyrillic -L ISO 8859-5 ECMA-113 144 Latin/Arabic -G ISO 8859-6 ECMA-114 127 Latin/Greek -F ISO 8859-7 ECMA-118 126 Latin/Hebrew -H ISO 8859-8 ECMA-121 138 Latin Alphabet No. 5 -M ISO 8859-9 ECMA-128 148 Czech Standard -I ? ? 139 Right Half, ISO 6937-2 -J ISO 6937-2 ? 142 Math/Technical Set -K ? ? 143 Chinese (CAS GB 2312-80) $)A ? ? ? Japanese (JIS 0208) $)B ? ? ? Korean (KS C 5601-1987) ? ? ? ? Table 5: Alphabets, Selectors, Standards, and Registration Numbers _____________________________________________________________________________ POSSIBLE PROBLEM: There seems to be conflict between ISO/ECMA alphabet codes and the codes used in Japan: Letter Europe Japan I Czech JIS-Katakana J ISO6937 JIS-Roman THE END 30-Mar-89 20:32:22-GMT,6876;000000000411 Return-Path: <@cuvmb.cc.columbia.edu:ISO8859@JHUVM.BITNET> Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA19127; Thu, 30 Mar 89 15:32:15 EST Message-Id: <8903302032.AA19127@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA22424; Thu, 30 Mar 89 15:29:26 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 4600; Thu, 30 Mar 89 15:27:44 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0924; Thu, 30 Mar 89 15:27:43 EST Received: by BITNIC (Mailer X1.25) id 0137; Thu, 30 Mar 89 15:28:38 EST Date: Thu, 30 Mar 89 12:47:24 CST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Michael Sperberg-McQueen Subject: query about overstruck characters in ISO 8859 To: Frank da Cruz Johan van Wingen has pointed out several times in this forum that in ISO 8859, as opposed to ISO 6937, 646, and other earlier coded character sets, it is illegal to use backspaces to overstrike two characters as a method of obtaining a new character. At least, that's what I understood him to say. ISO 8859-1 : 1987 (E) says (paragraph 7) "The use of control functions, such as BACKSPACE or CARRIAGE RETURN for the coded representation of composite characters is prohibited by ISO 8859." I have two questions: (1) just what sorts of activities are supposed to be forbidden here? and (2) why? To be more specific: if I need to print a Serbo-Croatian word containing a 'c' with an acute accent, I could probably do any of the following things (depending on my system environment). Which of them are legal, and which illegal? And can we construct a rationale for the legality and illegality of each? (= *should* they be legal?) (a) embed the sequence 'c' BACKSPACE ´. (hex 63 07 B4) in my file (if I'm using an editor that allows me to embed backspace characters, as some do and some don't) and let the printer, the display, and other devices deal with it as best they can. The display will probably show me the acute, and the printer will do an overstrike, unless it's a line printer, in which case I may get a variety of things but almost certainly not what I want. (b) use a Script command like ".dc bs <" and then use the combination 'c<´.' in my file. Script will arrange to have the acute and the 'c' overstruck, either by issuing a backspace or by doing something else. (c) use the same Script command, and also define a Script symbol with ".sr cacute = 'c<´.'" or ".sr cacute = 'c&sysbs.´." and then in my file use "&cacute." instead of "c<´." (d) use some relevant system facility (either in Script or in a microcomputer word processor) to define the width of hex B4 as 0. Then send the sequence hex B4 63 to the printer. (e) use the editor or some (imaginary) Script facilities to embed a sequence like ESC '-' 'B' (hex 1B 2D 42) at the beginning of my file to set up ISO 8859-2 as my G1 character set, and then in my file embed SHIFT-IN X'B6' SHIFT-OUT (hex 0F B6 0E) for the acute-accented 'c' (f) embed the ESC '-' 'B' sequence in some way, use Script's symbol facility to define ".sr cacute = &x'0FB60E' " and then use "&cacute." in my file as usual. If I understand the text of paragraph 7, approach (a) is clearly in violation of the spirit and letter of the standard. What about approach (b)? In my file, I'm not using any control characters to create composite characters: only graphics. I don't expect any editor to resolve the multi-character encoding for me and display an accented 'c'. But I am, I admit, using backspace or CR in the printer stream (or if the printer is more sophisticated, maybe something even more devious). Or perhaps I'm not. I don't know what Script97 does with the Xerox 9700; all I know is that the ".sr" command given should give me something resembling the character I want on my output. Approach (c) is much the same as (b), except that a lot of these symbols are already defined at installation. Is it a violation of the standard to use them, if they produce backspaces in the printer data stream? Approach (d) avoids the backspace in the data stream, but probably violates another part of paragraph 7: "None of these characters are non-spacing." Approach (e) and (f) sound as though they are what the standards committee expects us to do. But given that very few pieces of software will handle such escape sequences, I am not sure what paragraph 7 can mean or is supposed to mean for sites, developers, or end users. If I cannot use character 11/4 (acute accent) to form composite characters, why is it there? For use in mathematics to distinguish symbols (K and K' = K-prime)? In that case it would be far better to use slots 11/4, 10/8, 11/8, and 10/15 to include Turkish, and define another single character set for all sorts of mathematical symbols. ("Lead us not into temptation.") I imagine the point of paragraph 7 must be to say that extension of the character set to handle things like accented 'c' should be done through the extension techniques defined by other ISO standards, and not by overstriking characters of the ISO 8859 sets. In an ideal world, all the equipment would support ISO 8859-1 through -9, and ISO 2022 and so on. But in the real world -- is it considered a violation of ISO 8859 to use non-standard code extension techniques in order to make non-conforming equipment produce appropriate results? Our printer probably doesn't have a-umlaut as a separate character. Is it a violation of paragraph 7 to write a printer driver that reads character 14/4 from a file and sends an overstrike sequence including BACKSPACE to the printer? Would it be a violation if the printer driver translated from ISO 8859 to ISO 6937? Frankly, I find the blanket prohibition against use of BACKSPACE and CR in paragraph 7 a bit confusing and don't believe I understand the logic behind it. I am involved in a large international project to formulate methods for encoding literary and linguistic data in machine-readable form. It is important that we be able to recommend sound practice for encoding diacritics. To me, that means practice which agrees with relevant standards. But it is also essential that the recommended practice be something that people can actually work with using the software that exists. So I am particularly interested in finding out what the character set committee had in mind when they wrote paragraph 7. -Michael Sperberg-McQueen Editor in Chief, ACH / ACL / ALLC Text Encoding Initiative University of Illinois at Chicago 30-Mar-89 23:54:00-GMT,1277;000000000001 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA20163; Thu, 30 Mar 89 18:53:57 EST Date: Thu, 30 Mar 1989 18:53:57 EST From: Frank da Cruz To: ASCII/EBCDIC character set related issues Cc: Christine M Gianone Subject: Re: query about overstruck characters in ISO 8859 In-Reply-To: Your message of Thu, 30 Mar 89 12:47:24 CST Message-Id: We share your curiosity about the ISO8859 prohibition on composite characters. Not that it doesn't make sense -- ISO 8859 wants a character to be a character, so that it is possible for character and string oriented software to deal with text in a uniform way. Hence ISO 8859 shuns the composite "character building" allowed by ISO 646, and *required* by CCITT T.61. Our curiosity, like yours, is about how mixed-alphabet data is to be stored on disk. This relates closely to an extension to the Kermit file transfer protocol that we're working on, for transferring text in mixed alphabets between unlike systems. If you'd like to read & comment on it, or want to be added to the "isokermit" discussion group, let us know. - Christine Gianone and Frank da Cruz From @cunyvm.cuny.edu:RECK@DBNUAMA1.BITNET Fri Mar 31 20:01:29 1989 Return-Path: <@cunyvm.cuny.edu:RECK@DBNUAMA1.BITNET> Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA24487; Fri, 31 Mar 89 20:01:29 EST Message-Id: <8904010101.AA24487@watsun.cc.columbia.edu> Received: from CUNYVM.CUNY.EDU ([128.228.1.2]) by cunixc.cc.columbia.edu (5.54/5.10) id AA09964; Fri, 31 Mar 89 19:57:35 EST Received: from DBNUAMA1.BITNET by CUNYVM.CUNY.EDU (IBM VM SMTP R1.1) with BSMTP id 2534; Fri, 31 Mar 89 19:59:46 EST Date: Sat, 01 Apr 89 01:49:08 SET To: isokermit@cunixc.cc.columbia.edu From: RECK%DBNUAMA1.BITNET@cunyvm.cuny.edu Comment: CROSSNET mail via SMTP@INTERBIT Subject: Comments on ISO/Kermit proposal, 2nd draft Date: 1 April 1989, 00:37:03 SET From: Gisbert W.Selke +49 228 225888 To: ISOKERM at CUVMA Re: ISO/Kermit proposal, 2nd draft Here are a few musings after reading the second draft; as usual, let me start with what I'm uneasy about: (i) In the less enthusiastic moments, I ask myself, "How likely is all this ever to be implemented? Isn't it too much to expect an aspiring, or even an accomplished, Kermit implementor to know about a whole bunch of proprietary word processor formats?" With the advent of SGML, if ever, that may become less of an obstacle, but I don't think we're near now... - So, why not do the whole thing via extraneous filters, converting, on one side, from BlunderWrite 4.F format to ISO-something, transmit that via ordinary Kermit, and re-encode to WonderType 0.07 on the other side? That would only require the standardization of multi-lingual text file transfer - which, as we see right now, is a non-trivial task in itself -, but wouldn't put any further burden on Kermit programmers. Writing such filters, on the other hand, is not so hard a task for any moderately skilled programmer who has access to the word processor specs; hence, the scheme is easily extended to any and all word processor in the world. And, with many Kermits allowing to write macros and/or scripts and run programmes from within, running such a filter may be virtually transparent to the end user. OK, so that gets us back in time, when we had to boo/uu/hexify binary files in order to mail them. So what. (Well, I'm not particularly sure if I really mean what I have written. Don't crucify me.) (ii) If, contrary to what I seem to suggest in (i), the translation mechanism is included into Kermit implementations (as I expect it to happen), should there be some standard syntax for telling Kermit to employ a user-specified translation table that helps coping with a particular text format that the implementor didn't know about? I'm not at all sure that such a reasonably flexible mechanism could be specified; I'm thinking vaguely of something like the 'input translation'/'set key ...' feature of MS-Kermit, which has been of great help to us here in Germany (thanks, Joe!). The input translation works on a character-per-character basis only, though; that wouldn't be complex enough for text files, and that's where the problems start. Anyway, giving the local wizards a chance to customize Kermit would probably help. (iii) The draft mentions the problem of mis-matched Kermit implementations, i.e., two Kermits knowing about different subsets of ISO. If the subsets are disjoint, then one can but fall back on plain ASCII (or forsake readability on an unlike target system); but, in the case of partially different subsets, we can do better. Imagine an originating Kermit knowing about all of ISO, and a receiving Kermit not knowing about Latin-4 and Latin-9. Then, if the sender tries to negotiate to send a, say Turkish (or German) text encoded as Latin-9, the receiver will not accept, and the whole transfer has to fall back on plain ASCII, losing all the national characters on the way. - As an alternative, if, in the initial exchange (say, the A packet) the sender lists all the pertaining variants it knows about, then the receiver may choose one of those variants that it in turn knows about, and, in the ACK, tells the sender which variant it actually should use. So this mechanism is quite similar to the matching of capabilities as negotiated in the CAPAS byte(s) - essentially, there should be a way for the receiver to transmit information on its abilities back to the sender. (This remains within the one-step negotiation scheme of Kermit - no extended prattling is involved.) (iv) On what grounds should a Kermit choose 7 or 8 bit environment: I don't think it should be left to the programmer. MS-Kermit (and most others) is written in the US, but is used in Western Europe, in Israel, in Japan,... It should be left to the user, really - probably by tieing it to some 'set language' command, maybe with additional options to override this standard (which I will use if and when I start writing dadaist poems consisting mainly of umlauts). (v) A minor point on the wording of the draft: the term 'n-bit environment' seems to me to be used inconsistently in various places; e.g., in and near to table 6, it refers to the properties of the communications path (table 6 talks about using C in a 7-bit environment), whereas appendix B uses the term with respect to the subsection of ISO 2022 that is being employed (and, consequently, states that C must not be used in a 7-bit environment). This had me confused for a while; maybe it should be made clearer that these are slightly different concepts, and that Kermit, under certain conditions, provides a mock-8-bit environment as seen from the ISO level. (vi) Another, even lesser point: in appendix C, the use of 'left (rsp. right) half' for G0 (rsp. G1) left me briefly puzzled, too, given that G1 may reside in GL, etc. It would seem clearer to me if all talk about 'left' and 'right' is dropped in this context. In spite of the evidence I have just given, let me tell you that I like the whole ISO/Kermit idea, and I did enjoy reading the second draft, which is a big improvement on the first version (and I liked that one, too). Taking the full power of ISO 2022 is certainly the right thing to do - it can be done, and that way, we're less likely to outgrow the restrictions we put on ourselves now. I appreciate your work, Chris and Frank! \Gisbert From MURAKAMI@ntt-20.ntt.jp Tue Apr 4 23:40:10 1989 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA13049; Tue, 4 Apr 89 23:40:10 EDT Received: from ntt-sh.ntt.jp ([129.60.57.1]) by cunixc.cc.columbia.edu (5.54/5.10) id AA14324; Tue, 4 Apr 89 23:32:46 EDT Received: by ntt-sh.ntt.jp (3.2/ntt-sh-03c) with TCP; Wed, 5 Apr 89 12:36:16 JST Date: Wed, 5 Apr 89 12:36:13 I From: ken-ichiro murakami Subject: Re: ISO/Kermit Proposal Draft #2 To: isokermit@watsun.cc.columbia.edu Cc: murakami@ntt-20.ntt.jp In-Reply-To: Message-Id: <12483598785.15.MURAKAMI@NTT-20.NTT.JP> Chris, Frank, Fujii-san and kermit wizards, Your revised draft greatly helped me to understand ISO 2022 facilities. Thank you. I think it's very important to clarify the background technology and the reason why the Kermit standard representation is adopted. I believe the draft is also helpful to extend other protocols such as FTP. Here is my opinion for the second draft and comments for the previous messages on mailing-list. My opinion is based on the policy that Kermit protocol and its command should be clearly structured(layered) and simple. I believe well structured system helps user to understand its usage. My opinion is summarized as follows; (1) simplify terminal emulation commands (2) separate coding into two categories, generic coding and application specific coding. (3) proposal for new command SET FILE FILTER to offer flexible interface for application specific coding (4) compatibility with conventional kermit program (5) 8bit transparency(8th-bit quoting) which should be ensured NOT as optional but as mandatory facility. When I was writing my opinion, I asked myself a simple question. "It seems new facility makes kermit complex and lose kermit the simplicity and the beauty. Is it possible to reduce the number of new command?" If user requires only to convert files which are created by some applications such as MACWRITE and WORDPERFECT, SET FILE FILTER command might do everything. How about non-standard Japanese Kanji coding? It seems SET FILE FILTER command can do it. If ISO2022 is adopted in Kermit, we have to modify both server and client kermit. Do we have enough wizards who can modify them in Japan? ....... I'm still confused by these questions. I hope I can make my idea clear in the next mail. Sorry. I expect you to correct my proposal and idea. -Ken Ken-ichiro Murakami NTT Laboratories Tokyo, Japan 1. TERMINAL EMULATION >FROM: HIROFUMI FUJII > >1. JAPANESE CODE SYSTEMS: > JAPANESE COMPUTER SYSTEM HAS AT LEAST THREE CHARACTER SETS, > ROMAN (ALMOST ASCII), KATAKANA(1BYTE CODE), AND KANJI > (2byte code). Kanji set also includes the Roman characters > but the face of the character is double width. Therefore it > should be considered as different characters. > The local representation for these character sets are > > OS Roman Katakana Kanji > ------ ---------- ------------- ------------------- > MS-DOS JIS X 0201 JIS X 0201 in GR MS-Kanji (SHIFTJIS) > VAX/VMS US-national (see note 1) DEC-Kanji (see note 2) > IBM/VM/CMS EBCDIC EBCDIC-Katakana IBM-Kanji > UNIX JIS X0201 EUC (see note 3) EUC (see note 4) > ^^^^^^^^^^^^^^^^ > Elis JIS X0201 EUC (see note 3) EUC (see note 4) > > (Note 1) Invoked by LS2 (Locking shift two). > (Note 2) JIS X 0208 in GR (i.e., 8th bit on). > (Note 3) Invoked by SS2 (Single shift two) and 8th bit on. > (Note 4) JIS X 0208 in GR (i.e., 8th bit on). In Japan, each vendor extended original UNIX with different Kanji code. There are three kinds of Kanji UNIX systems now. There are no standard Kanji code in Kanji UNIX. OS Roman Katakana Kanji ------ ---------- ------------- ------------------- MS-DOS JIS X 0201 JIS X 0201 in GR MS-Kanji (SHIFTJIS) VAX/VMS US-national (see note 1) DEC-Kanji (see note 2) IBM/VM/CMS EBCDIC EBCDIC-Katakana IBM-Kanji UNIX JIS X0201 EUC (see note 3) EUC (see note 4) JIS X 0201 in GR MS-Kanji (SHIFTJIS) JIS X0208(7bit) JIS X0208(7bit code) Elis JIS X0201 EUC (see note 3) EUC (see note 4) >FROM: HIROFUMI FUJII > >TERMINAL EMULATION > >SET TERMINAL CHARACTER-SET > >My Kermit (MSVP98) have another command, SET TERMINAL KANJI CODE . >The SET TERMINAL CHARACTER-SET specifies GL character set. However, >Kanji is mainly used in GR character set as described above. This is >because we need another command to specify the Kanji code. There is one >more reason we need another command. It is the code for keyinput. >SET TERMINAL KANJI CODE also used for keyinput character conversion. >To unify these command, I propose > > SET TERMINAL CHARACTER-SET [as {GL,GR}] > >where the default is GL. And for keyinput > > SET KEYINPUT CHARACTER-SET [as {GL,GR}] Is it necessary to specify GL and GR parameter in separate manner? This makes it complex to specify character code for terminal emulation. Of course, this problem may be improved by MACRO. However, FOR SIMPLICITY, we should not add new option which is not used so often. How about to consider the CHARACTER-SET as a set of codes which consists Roman, Katakana and Kanji. For example, when we interact with UNIX, we issue SET TERMINAL CHARACTER-SET EUC command. The parameter EUC means Roman=JIS X0201, Katakana=EUC and Kanji=EUC. Usually, remote host uses the same character code for input and output. In addition, we don't type Kanji code directly. In Japan, we usually uses Front End Processor which is a device driver or a resident program on PC and convert Roman or Katakana to Kanji code. The converted Kanji code is passed to operating system such as MS-DOS. This means it's not necessary to have SET KEYINPUT(KEYMAP) command. So, SET TERMINAL CHARACTER-SET should be applied also for output from PC. Even if you want to redirect and transmit characters not from keyboard but from file(by TRANSMIT command), this approach could work well. In our environment, we are satisfied with the conventional SET KEY command and we don't need yet another command to specify keyboard character mapping. >From: Joe Doupnik > > The last doubt in my mind then relates to terminal emulation in a >7 bit environment. Here the Kermit 8 bit quoting mechanism is not available. >I think that it is not a difficult task to allow both 7 and 8 bit ISO shifts >to be available in a terminal emulator, selected automatically by presence of >parity and overridable by existing Kermit commands. In Japan, we have character encoding standard JIS X0208. However, it's NON-standard and one of Kanji codes. Actually, we have more than three Kanji encoding non-standard and it's difficult to find JIS x0208 oriented machine. So, we have to specify the Kanji encoding explicitly in TERMINAL EMULATION as well as FILE TRANSFER. This means standard representation in communication channel is related BOTH terminal emulation AND file transfer. Especially, Japanese Kermit lovers are eager for the standardization. Of course, it's possible to use SET TRANSFER-SYNTAX command to specify character coding to terminal emulator as well as file transfer to reduce the number of kermit command set. But, Frank thinks that two commands (SET TERMINAL CHARACTER-SET and SET TRANSFER SYNTAX) should be prepared for terminal emulation and file transfer respectively. >From: Hirofumi Fujii > >Terminal Emulator >----------------- >I think, in Kermit protocol, it is not necessary to say about the terminal >emulator. It is machine dependent and can be handled within the local >routines. >Actually, my Kermit (MSVP98) already has ISO-2022 features (supports >G0, G1, G2 and G3 character sets, all locking-shift and single-shift >mechanisms) within the scheme of MS-Kermit. I have not modified any >machine-independent routines of the MS-Kermit. Joe has separated Kermit >modules very nicely and clearly. I know terminal emulator can support both ISO-2022 and EUC as Kanji. However, we cannot specify SHIFTJIS code without SET TERMINAL CHARACTER-SET. Actually, we cannot use UNIX manufactured by SONY without the command. So, it's necessary to implement SET TERMINAL CHARACTER-SET command. We cannot do without it! >The second DRAFT says: > >TERMINAL EMULATION > >In this regard, it is important to note that not all languages are written >from left to right, top to bottom. Hebrew and Arabic are two examples of >right-to-left languages, and Japanese and Chinese may be written top to >bottom. The order of the text characters on disk or on the transmission line >do not necessarily reflect their order on the screen or the printed page. In our(Japanese) case, we usually write in left-to-right manner. So, the order of text character will reflect their order. I don't know the order in other languages. >The following command should specify what character set is sent and received >on the transmission medium during terminal emulation. The Kermit program must >translate between this character set and the one that is used locally. > >SET TERMINAL CHARACTER-SET [{GL, GR}] > This command already exists, but is currently used only in MS-DOS Kermit, and > only to switch between US and UK ASCII. We should extend this command to > select any character code, and to assign it to GL (default) or GR, and we > should have a standard set of 's including the currently defined ISO > 8-bit alphabets: > > LATIN1-ISO8, ..., LATIN5-ISO8, CYRILLIC-ISO8, GREEK-ISO8, > HEBREW-ISO8, ARABIC-ISO8, etc. > > 7-bit ASCII and its national variants (ISO-646): > > ASCII-US, ASCII-UK, ASCII-FR, ASCII-DE, ASCII-IT, ASCII-NL, ASCII-ES, > ASCII-DK, ASCII-FI, ASCII-IS, ASCCI-SE, ASCII-NO, ASCII-TR, etc. > > And for Japanese: > > KANJI-JIS, KANJI-SHIFTJIS, KANJI-EUC, etc. > >For example, an MS-DOS computer might use SHIFTJIS locally, but a VAX >communicates using EUC, so the MS-DOS Kermit user would give the command SET >TERM CHAR KANJI-EUC. As I pointed out, it's better not to have [{GL, GR}]. Instead, we should consider the specified parameter as a set of Roman, Katakana and Kanji. > The second DRAFT says; > >For keyboard character input, in addition to the current per-key SET KEY >mechanism, there should be a way to assign an entire translation table to the >entire keyboard. This command would be: > > SET TERMINAL KEYMAP [{GL, GR}] As I pointed out, we can do without KEYMAP command. If user want to change keyboard mapping, the user should use macro which includes several SET KEY commands. 2. PROPOSAL for NEW or EXTENDED KERMIT COMMANDS In the DRAFT, several new commands are defined. In addition, there are some extended conventional commands related to character encoding. They are; SET FILE TYPE {TEXT|BINARY|WORDPERFECT......} SET FILE CHARACTER-SET {KANJI-EUC|KANJI-SHIFTJIS|....} SET TRANSFER-SYNTAX {NORMAL|ISO-2022} SET TERMINAL CHARACTER-SET {KANJI-EUC|KANJI-SHIFTJIS|....} ( I think we don't need SET TERMINAL KEYMAP command) I'm confused by these commands, because it seems two level representation(coding) are mixed. I think there are two protocol layers in the proposal, (1) generic character coding and (2) application oriented coding. We should distinguish these two layers clearly. As for (1), it's common problem in all files and have relation to communication channel. However, (2) is specific to some word processor such as WORDPERFECT and considered as a upper layer on (1). (1) is stable and clearly defined, but (2) is unstable since it's application defined and new software may use other coding. As for (2), Kermit should be flexible. So, it's better to translate presentation by filter in batch manner as Mr.Gisbert W.Selke said in his mail. SET FILE TYPE command in the draft confused me, since the command includes both (1) and (2) specification in one command. So, I think we don't have to extend SET FILE TYPE command. Rather, we propose new SET FILE FILTER command to convert application oriented coding every time a file is transferred. Our proposal is as follows; SET FILE TYPE {TEXT|BINARY} specifies local file translation If TEXT is specified, character may be converted. If BINARY is specified, character is not converted. SET FILE FILTER {NONE|conversion-program-name} Specified program converts application oriented coding prior to transfer file in batch manner, if SET FILE TYPE TEXT is specified. SET FILE CHARACTER-SET {KANJI-EUC|KANJI-SHIFTJIS|....} specifies local file coding SET TRANSFER-SYNTAX {NORMAL|ISO-2022} specifies communication channel coding in file transfer SET TERMINAL CHARACTER-SET {KANJI-EUC|KANJI-SHIFTJIS|....} specifies communication channel coding in terminal emulation We believe SET FILE FILTER command offers flexible interface to application oriented coding. If two programs are necessary for receive and transmit, we may need argument {RECEIVE|TRANSMIT} after the program name. The following figure shows layers related to character coding. +-------------+--------------------+ | WORDPERFECT | XYWRITE | MACWRITE | | (application specific coding) | +-------------+---------+----------+--------------------------+ | TEXT file (coding is converted) | BINARY (never converted) | +=============================================================+ | | | 8bit-transparent channel compensated by 8th-bit quoting | | | +-------------------------------------------------------------+ | | | raw transmission channel (non-transparent channel) | | | +-------------------------------------------------------------+ Kermit presentation layer As for SET TERMINAL CHARACTER-SET command, We could integrate SET TRANSFER-SYNTAX command and SET TERMINAL CHARACTER-SET command if SET TRANSFER-SYNTAX command is extended, since SET TERMINAL CHARACTER-SET is considered as a coding specification in communication channel. (This is described in detail in the following section.) 3. CONSIDERATION FOR THE CONVENTIONAL KERMIT PROGRAMS (COMPATIBILITY) Indeed, we should make progress toward common coding system described in the draft. Currently, there is no such implementation. However, such automatic conversion function, Kanji coding conversion, is strongly requested in Japan. NTT has already developed local Kanji coding conversion facility and has distributed the improved Kermit in Japan. This local conversion requires no modification on server side. It's impossible to implement the proposed function in all kermit server immediately. So, we must consider how we should implement LOCAL Kanji coding translation and integrate these commands with the proposed new commands for the present. Consider what character coding is adopted in conventional kermit transmission channel. It's server's coding. If the remote host is IBM, the coding is IBM-Kanji. If the remote host is VAX, the coding on transmission channel is KANJI-EUC. If the remote host is UNIX, the coding may be KANJI-JIS(equivalent to ISO-2022). So, we can consider it as if we specified command SET TRANSFER-SYNTAX {KANJI-EUC|KANJI-IBM|ISO-2022(JIS)}. Local kermit can convert the coding if the remote host's coding is specified by SET TRANSFER-SYNTAX command. Therefore, we propose to extend the command for the present. Of course, these extended options will be deleted in then future. We must consider initial file attribute negotiation for extended parameters. (But, I have no idea now.) SET TRANSFER-SYNTAX {NORMAL|ISO-2022(JIS)|KANJI-EUC|KANJI-IBM...} If the remote host Kanji coding is specified by this command, we can utilize the information also to terminal emulation. This is because the same Kanji coding is used both in file system and in interaction with remote host. So, we can offer a macro to set coding for both SET TRANSFER-SYNTAX and SET TERMINAL CHARACTER-SET. (Indeed, we can integrate these two commands if file transfer and interaction use the same coding in the future. But, Frank recommended us to consider these coding as different function. I understand his recommendation. (If it's impossible to modify shell, we can use the same coding for file transfer and terminal emulation. But, it's hard to modify shell.) 4. CONFLICT BETWEEN ISO/ECMA AND THE CODES IN JAPAN In the second draft, the conflict is reported. According to an article in a magazine in Japan, 4/7(G) and 4/8(H) are allocated to Swedish Roman Character, 4/9(I) and 4/10(J) are allocated to Japanese Roman Character. I'll ask a scientist in NTT who is an member of one of ISO committee about this conflict and report it on this mailing-list. 5. 8bit transparency The important function which kermit offers is that 8bit transparency on any transmission channel. ISO 2022 also offers the same mechanism. In the proposal, 7bit coding is allowed to keep transparency for kermit which has no 8th-bit quoting facility. In my opinion, kermit should offer 8th-bit quoting by default. This makes kermit layer model simple. The 8bit transparent channel is considered as a common base for both ISO2022 coding TEXT data and binary file data. If 7bit coding is employed for transparency, it violates natural kermit layer model. We should separate transmission channel and data encoding clearly. Some people has already pointed out that 8bit coding make file transfer fast. So, I think 8th-bit coding is suitable for Kermit. However, 8th-bit quoting must be defined NOT as optional BUT as mandatory function for 8bit encoding. +-------------------------------------+ | application oriented coding | +-----------------------+-------------+ | 8bit(7bit) ISO coding | binary data | +-------------------------------------+ | 8bit transparent channel | +-------------------------------------+ | non-transparent line | +-------------------------------------+ 6. Announcer letter in attribute packet I agree with this beautiful negotiation. ------- ------- From MURAKAMI@ntt-20.ntt.jp Fri Apr 7 09:43:04 1989 Return-Path: Received: from ntt-20.NTT.JP by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA23205; Fri, 7 Apr 89 09:43:04 EDT Date: Fri, 7 Apr 89 22:42:20 I From: ken-ichiro murakami Subject: conflict between ECMA/ISO alphabet codes and the codes in Japan To: isokermit@watsun.cc.columbia.edu Cc: murakami@ntt-20.ntt.jp Message-Id: <12484233414.26.MURAKAMI@NTT-20.NTT.JP> Chris and Frank, In the second DRAFT, conflict between ISO/ECMA alphabet codes and the codes used in Japan is pointed out. I asked Mr.Kouichi Suzuki in NTT, who knows ISO specification very well, about it and confirmed that there is NO conflict. -Ken Ken-ichiro Murakami NTT Laboratories Tokyo, Japan > > POSSIBLE PROBLEM: There seems to be conflict between ISO/ECMA alphabet codes > and the codes used in Japan: > > Letter Europe Japan > I Czech JIS-Katakana > J ISO6937 JIS-Roman (1) What is assigned by ISO/ECMA? ECMA assigns both alphabet codes and escape sequences for it. Therefore, even if the same alphabet code number is assigned to two code set, different escape code ensures its uniqueness. (2) 94 character set and 96 character set ISO2022 defines two kind of character set. One is 94 character code set which doesn't include 2/0 and 7/15, the other is 96 character set such as JIS X0201 and ISO646(IRV). Different escape sequence for designation is used for 94 and 96 character set as follows; 2/8 F ; designate 94 character set to G0 2/9 F ; designate 94 character set to G1 2/10 F ; designate 94 character set to G2 2/11 F ; designate 94 character set to G3 2/12 F ; designate 96 character set to G0 2/13 F ; designate 96 character set to G1 2/14 F ; designate 96 character set to G2 2/15 F ; designate 96 character set to G3 Therefore, the designation is corrected as follows; > Alphabet Name Esc Seq ISO Number ECMA Ref ECMA Registration > > ASCII (ANSI X3.4-1986) (B ISO 646 ECMA-6 ? Registration No. 6 G0 set: 2/8 4/2 G1 set: 2/9 4/2 G2 set: 2/10 4/2 G3 set: 2/11 4/2 > Latin Alphabet No. 1 -A ISO 8859-1 ECMA-94 100 Registration No. 100 G0 set: --- (not defined because this set is specified not to be invoked into G0; see ISO 8859 part 1) G1 set: 2/13 4/1 G2 set: 2/14 4/2 G3 set: 2/15 4/2 > Latin Alphabet No. 2 -B ISO 8859-2 ECMA-94 101 Registration No. 101 G0 set: --- (not defined because this set is specified not to be invoked into G0; see ISO 8859 part 2) G1 set: 2/13 4/2 G2 set: 2/14 4/2 G3 set: 2/15 4/2 > Latin Alphabet No. 3 -C ISO 8859-3 ECMA-94 109 Registration No. 109 G0 set: --- (not defined because this set is specified not to be invoked into G0; see ISO 8859 part 3) G1 set: 2/13 4/3 G2 set: 2/14 4/3 G3 set: 2/15 4/3 > Latin Alphabet No. 4 -D ISO 8859-4 ECMA-94 110 Registration No. 110 G0 set: --- (not defined because this set is specified not to be invoked into G0; see ISO 8859 part 4) G1 set: 2/13 4/4 G2 set: 2/14 4/4 G3 set: 2/15 4/4 > Latin/Cyrillic -L ISO 8859-5 ECMA-113 144 > Latin/Arabic -G ISO 8859-6 ECMA-114 127 Registration No. 127 G0 set: --- (not defined because this set is specified not to be invoked into G0; see ISO 8859 part 6) G1 set: 2/13 4/7 G2 set: 2/14 4/7 G3 set: 2/15 4/7 > Latin/Greek -F ISO 8859-7 ECMA-118 126 Registration No. 126 G0 set: --- (not defined because this set is specified not to be invoked into G0; see ISO 8859 part 7) G1 set: 2/13 4/5 G2 set: 2/14 4/5 G3 set: 2/15 4/5 > Latin/Hebrew -H ISO 8859-8 ECMA-121 138 Registration No. 138 G0 set: --- (not defined because this set is specified not to be invoked into G0) G1 set: 2/13 4/8 G2 set: 2/14 4/8 G3 set: 2/15 4/8 > Latin Alphabet No. 5 -M ISO 8859-9 ECMA-128 148 > Czech Standard -I ? ? 139 Registration No. 139 G0 set: --- G1 set: 2/13 4/9 G2 set: 2/14 4/9 G3 set: 2/15 4/9 > JIS-Roman -I ? ? 14 Registration No. 14 G0 set: 2/8 4/9 G1 set: 2/9 4/9 G2 set: 2/10 4/9 G3 set: 2/11 4/9 > Right Half, ISO 6937-2 -J ISO 6937-2 ? 142 Registration No. 142 G0 set: --- (not defined because this set is specified not to be invoked into G0; see ISO 6937 part 2) G1 set: 2/13 4/10 G2 set: 2/14 4/10 G3 set: 2/15 4/10 > JIS-Katakana -I ? ? 13 Registration No. 13 G0 set: 2/8 4/10 G1 set: 2/9 4/10 G2 set: 2/10 4/10 G3 set: 2/11 4/10 > Math/Technical Set -K ? ? 143 > Chinese (CAS GB 2312-80) $)A ? ? ? > Chinese (CAS GB 2312-80) $)A ? ? 58 > Japanese (JIS 0208) $)B ? ? ? > Japanese (JIS 0208) $)B ? ? 87 > Korean (KS C 5601-1987) ? ? ? ? > > Table 5: Alphabets, Selectors, Standards, and Registration Numbers ------- From @cuvmb.cc.columbia.edu:JPALME@COM.QZ.SE Sat Apr 8 08:35:17 1989 Return-Path: <@cuvmb.cc.columbia.edu:JPALME@COM.QZ.SE> Received: from columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA27194; Sat, 8 Apr 89 08:35:17 EDT Received: from cuvmb.cc.columbia.edu by columbia.edu (5.59++/0.3) with SMTP id AA06855; Sat, 8 Apr 89 08:35:13 EDT Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 8506; Sat, 08 Apr 89 08:35:23 EDT Received: from SEARN.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 5206; Sat, 08 Apr 89 08:34:19 EDT Received: from QZCOM by SEARN.BITNET (Mailer X1.25) with BSMTP id 1868; Sat, 08 Apr 89 13:34:00 EDT Message-Id: <409118@QZCOM> Date: 08 Apr 89 12:45 +0200 From: "Jacob Palme QZ" Reply-To: "Jacob Palme QZ" To: "ISO/Kermit Discussion Group" Subject: ODA/ODIF The ISO standard for exchange of data between word processors is called ODA/ODIF. If KERMIT is to include a facility for such transfer, the natural way would be to send the text in ODA/ODIF, and translate at either end to ODA/ODIF format. From MURAKAMI@ntt-20.ntt.jp Mon Apr 10 09:56:04 1989 Return-Path: Received: from ntt-20.NTT.JP by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA09406; Mon, 10 Apr 89 09:56:04 EDT Date: Mon, 10 Apr 89 19:44:55 I From: ken-ichiro murakami Subject: Kermit meeting report from Tokyo To: isokermit@watsun.cc.columbia.edu Message-Id: <12484987549.15.MURAKAMI@NTT-20.NTT.JP> KERMIT MEETING REPORT from TOKYO April 7, 1989 Ken-ichiro Murakami NTT laboratories Tokyo, Japan Kermit experts in Japan had a meeting in Tokyo on April 4 and discussed the Kermit Extension for International Character Sets. The meeting was sponsored by DECUS Japan. About 10 people were present. This short report summarizes their opinions and the controversial points. DATE Tuesday, April 4, 1989, 15:20 pm - 17:20 pm DEC Japan Executive Room, Tokyo MEMBER Akira Itoh (University of Tokyo) Hirofumi Fujii (National Lab. for High-energy Physics) Hideaki Mikami (NTT Laboratories) Hideki Nakakita (NICON Systems) Kazuhisa Ohta (Nihon UNISYS) Koichi Nishimoto (DECUS Japan) Kenji Rikitake (University of Tokyo) Ken-ichiro Murakami (NTT Laboratories) Mamoru Ushimaru (University of Tokyo) Yutaka Ogawa (NTT Laboratories) Youichi Kazama (NTT Laboratories) SUMMARY In Japan, we have many non-standard Kanji codes and we are always bothered by this confusion. Therefore, we think it is important to have a standard representation like ISO-2022 for file transfer. In particular, this standardization will bring us a convenient function, that is, automatic Kanji code conversion. To realize this function, we should consider further the user interface (command set) and implementation. CONTROVERSIAL POINTS 1. NEW COMMANDS for TERMINAL EMULATION Is it necessary to have a new SET TERMINAL KEYMAP command? Is it necessary to have an argument {GR|GL} for SET TERMINAL CHARACTER-SET? A: User interface should be as simple as possible. This means we must minimize the number of new commands. Usually, the same Kanji code is used both input and output. Therefore, we don't need the KEYMAP command. B: Some UNIX machines use 8-bit EUC code as the internal Kanji encoding. However, these machines cannot receive EUC because of 8th-bit non-transparency. For such systems, it's necessary to have the KEYMAP command. A: Even if the system uses both EUC and JIS as input and output respectively, the terminal emulator can display both EUC and JIS simultaneously. If you issue SET TERMINAL CHARACTER-SET JIS, you can receive both EUC and JIS. Therefore, we don't need KEYMAP command. B: How about setting both CHARACTER-SET and KEYMAP automatically, when user issues SET TERMINAL CHARACTER-SET command? A: You can receive both EUC and JIS simultaneously. Would you really use the KEYMAP command? We should not add a new Kermit command if we don't need it. 2. SET FILE FILTER is a good idea. But, is it allowed to use non-ISO-2022 code? Using ISO-2022 on the communication channel is a very good idea and we should adopt ISO-2022 transfer syntax. However, we also need the SET FILE FILTER command until the ISO-2022 facility is implemented in the popular Kermit programs. We can convert Kanji code locally by this command for the present. For local conversion, we have to specify Kanji code in remote host. For this purpose, SET TRANSFER-SYNTAX will be used. The current draft specification only allows NORMAL and ISO-2022 as the argument. We may need additional arguments for other non-standard Kanji code such as EUC and SHIFTJIS. OPINIONS 1. We should keep Kermit commands simple. Novice users are often confused by a bunch of Kermit commands. (Since experts have a mental model for Kermit, they tend not to notice novice users' problems.) So, we should keep the set of Kermit commands as small as possible. For example, the SET TERMINAL KEYMAP command should be deleted, if we have an alternative way. 2. Is it possible for Japanese to modify Kermit for ISO2022? We Japanese have a few Kermit experts and contributors. Even if the ISO-2022 transfer standard is adopted, we cannot expect somebody to modify kermit for Japanese. This means we cannot use ISO-2022 right away. We should also consider yet another way to convert Kanji code in local. How we can merge the ISO-2022 and the conventional local Kanji conversion? 3. We have the same requirement to convert special files created by application programs. The popular Japanese word-processor ICHITARO creates special files like WORDPERFECT. Many users want to transfer these files and share them on mainframes or workstations. However, these files contain special format control characters and require conversion prior to transfer. For this purpose, automatic conversion facility is desirable. The SET FILE FILTER command proposed by NTT will be convenient for this purpose. CONCLUSION We have not reached a conclusion. We must consider how we can integrate SET FILE FILTER command with SET TRANSFER-SYNTAX command and Attribute negotiation. ISO 2022 might bring us automatic Kanji conversion in the future. We have to find yet another way for local Kanji conversion which can be merged with the ISO-2022 mechanism. ACKNOWLEDGMENT We would like to express our appreciation to Ms. Christine Gianone and Mr. Frank da Cruz for their help and consideration for the Japanese Kanji inconsistency problem. We also express special thanks to Mr. Koichi Nishimoto, Administrator of DECUS Japan, for supporting this meeting. ---< cut here >--- << Questions and Comments from Chris and Frank. >> Prior to delivering of this report, I asked Chris and Frank to correct the English in my report. Thank you very much. Chris and Frank also gave me comments and questions. So, I'll answer them. Please note that the answer is MY OWN opinion. Other members may have different opinions. >CONTROVERSIAL POINTS > >1. NEW COMMANDS for TERMINAL EMULATION > > Is it necessary to have a new SET TERMINAL KEYMAP command? > Is it necessary to have an argument {GR|GL} for SET TERMINAL > CHARACTER-SET? > > A: User interface should be as simple as possible. This means we must >minimize the number of new commands. Usually, the same Kanji code is used >both input and output. Therefore, we don't need the KEYMAP command. [But the KEYMAP command might be useful for other reasons, too. For example, in MS-DOS Kermit, you must enter many SET KEY commands to change the key map. Then, if you want to switch to another language, you must enter many more SET KEY commands. To switch back and forth between languages, you must have big macro definitions or TAKE files full of SET KEY commands, and you must execute them frequently. A more convenient approach for language switching is to build several complete keymaps into the Kermit program, assign names to them, and give a command to conveniently select an entire keymap. For example, SET TERMINAL KEYMAP EUC, SET TERMINAL KEYMAP NORWEGIAN, SET TERMINAL KEYMAP FRENCH...] OK. Some of Kermit users may need this function. As Frank said before, we need macro to set character code both in emulator and in keyboard since the same code is used for input and output usually. > B: Some UNIX machines use 8-bit EUC code as the internal Kanji >encoding. However, these machines cannot receive EUC because of 8th-bit >non-transparency. For such systems, it's necessary to have the KEYMAP >command. [We don't understand. How does the terminal transmit 8-bit EUC keystrokes to the UNIX system in the 7-bit environment? Does it use shifts like and ?] It's impossible to transmit 8bit-EUC keystrokes to the UNIX system in the 7-bit environment. Therefore, we have already adopted 7bit ISO2022 in communication channel such as TCP/IP and UUCP. However, I heard some CRAZY machines offered inconsistent environment, that is, 7bit for input and 8bit for output. I don't know these systems in detail. Mr.Fujii at KEK might explain us about this strange story. :-) > A: Even if the system uses both EUC and JIS as input and output >respectively, the terminal emulator can display both EUC and JIS >simultaneously. If you issue SET TERMINAL CHARACTER-SET JIS, you can >receive both EUC and JIS. Therefore, we don't need KEYMAP command. [How can the terminal display two character sets simultaneously? How does it know which set an incoming character belongs to? Are the codes compatible?] It's easy to distinguish between EUC Kanji and JIS Kanji(JIS means ISO2022 in 7bit environment.), because EUC Kanji doesn't overlaps with JIS Kanji. As for KataKana and Roman ASCII, they use the same code. > B: How about setting both CHARACTER-SET and KEYMAP automatically, when >user issues SET TERMINAL CHARACTER-SET command? [You mean that SET TERMINAL CHARACTER-SET should do two things: (1) assign a table that maps communication line input bytes to screen graphics, and (2) that assigns a table of transmission codes to the keyboard. This is OK, so long as the user still has a way to change the keyboard layout to correspond to her or his typing preferences.] Yes. We need macro to set both SET TERMINAL CHARACTER-SET and SET KEYMAP simultaneously. >2. SET FILE FILTER is a good idea. But, is it allowed to use non-ISO-2022 >code? > > Using ISO-2022 on the communication channel is a very good idea and we >should adopt ISO-2022 transfer syntax. However, we also need the SET FILE >FILTER command until the ISO-2022 facility is implemented in the popular >Kermit programs. We can convert Kanji code locally by this command for the >present. [Do we understand the SET FILE FILTER command? It seems to mean that a separate program must run to translate between the file format and the transfer syntax. Of course, this command can only be used on multiprocessing computer systems like UNIX, where one program can run another one, and their input and output can be piped together. On systems that can't do this, the user must run a preprocessor program before running Kermit, and then use Kermit to transfer the file in the normal way (TRANSFER-SYNTAX NORMAL). If the file is very big, then this can be most inconvenient -- disks will fill up, processing time will triple, etc. But on systems like UNIX, the SET FILE FILTER command is definitely a good idea. It can be used not only for international characters, but for compression, etc.] The filter program runs IN BATCH MANNER before file transfer or after file transfer on single process system such as MS-DOS. As you pointed out, this may inconvenient -- disks will fill up, processing time will triple, etc. In this case, user should keep enough room for code conversion. After the meeting, we further considered this function and noticed that powerful macro facility and take command might enable us to implement the same function as SET FILE FILTER. (MS-DOS Kermit supports the powerful macro facility.) This will never affect ISO2022 standardization. I'm trying to write such code conversion scenario using MACRO and TAKE command. > For local conversion, we have to specify Kanji code in remote host. >For this purpose, SET TRANSFER-SYNTAX will be used. The current draft >specification only allows NORMAL and ISO-2022 as the argument. We may >need additional arguments for other non-standard Kanji code such as EUC >and SHIFTJIS. [We did not intend to imply that NORMAL and ISO-2022 were the only allowable transfer syntaxes. Any REGISTERED character set should be transferred using ISO-2022. Any UNREGISTERED character set, such as EUC or SHIFTJIS (or even HEXADECIMAL or EBCDIC), can be used as a transfer syntax by itself, without any inline control codes for alphabet selection or shifting. So yes, SET TRANSFER-SYNTAX {EUC, SHIFTJIS} should also be allowed so that existing Japanese Kermit programs can interoperate with new ones.] If we can implement local Kanji code conversion using macro and TAKE command, our requirement never affect ISO2022 standardization. >OPINIONS >2. Is it possible for Japanese to modify Kermit for ISO2022? > >We Japanese have a few Kermit experts and contributors. Even if the >ISO-2022 transfer standard is adopted, we cannot expect somebody to modify >kermit for Japanese. This means we cannot use ISO-2022 right away. We >should also consider yet another way to convert Kanji code in local. How >we can merge the ISO-2022 and the conventional local Kanji conversion? [Do this by allowing new Japanese Kermit programs to support SET TRANSFER-SYNTAX {EUC, SHIFTJIS} so they can talk to old Japanese Kermit programs that support only these transfer syntaxes.] Take command and macro may enable us to convert Kanji code in local. Until ISO2022 is implemented for Japanese, we may be able to use this local conversion facility. >ACKNOWLEDGMENT > >We would like to express our appreciation to Ms. Christine Gianone and >Mr. Frank da Cruz for their help and consideration for the Japanese Kanji >inconsistency problem. We also express special thanks to Mr. Koichi >Nishimoto, Administrator of DECUS Japan, for supporting this meeting. [And we appreciate the efforts of the Japanese contingent, and their valuable contributions to this proposal. The Japanese, with their multiplicity of character sets and their extensive practical experience in this area, are the "acid test" of any attempt to extend Kermit or any other data communications protocol to encompass multiple character sets. - Christine and Frank] Thank you very much for correcting my English. It's a good practice for me. -Ken (murakami%ntt-20.ntt.jp@relay-cs-net or murakami@ntt-20.ntt.jp) ------- From jrd Mon Apr 10 17:54:32 1989 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA13825; Mon, 10 Apr 89 17:54:32 EDT Date: Mon, 10 Apr 1989 17:54:31 EDT From: "Joe R. Doupnik" To: isokermit Cc: jrd Subject: ISO-2022 end results Message-Id: Chris, Frank, and the group: I think it is worth remembering that multi-language documents or terminal sessions presume the ability to maintain and display two, and sometimes more, alphabets. For printed output three or more alphabets is an implementation detail and is similar to font changes for word processed documents. For computer displays it is a more complicated problem because those with which I am familiar have at most 256 bytes (GL and GR) of pattern information (code byte + bit map) and we can't add a third ON THE SAME SCREEN without hardware games or going to bitmapped graphics mode. Hopefully the Japanese manufacturers have done better than 256 bytes. In the US some of the major word processor vendors have decided that the only sensible way of displaying fonts is to use graphics mode, a good stack of bitmap files, and then ask for patience by the users as the screen is updated slowly. While ISO 2022 provides the tools to change alphabets at any point there is a practical aspect of viewing files. The latter normally restricts the display to a maximum of two fonts/languages, one each in GL and GR tables. Many dot matrix or laser printers are freed from this limitation, at the cost of downloading new fonts when needed. A rather smart printer program is needed to accomplish this; mine is called WordPerfect. One conclusion to be drawn from this is taken from word processing programs: when a given "font" is not available then substitute the nearest equivalent character (by heuristics particular to each vendor). Adopting such a strategy for file transfer is dangerous because the file is not being reproduced precisely, and may not have warning messages about substitutions. It can be useful for terminal emulation however. The second conclusion is that we might focus on two character sets at a (long) time for terminal emulation, with substitutions, and let file transfers employ the full range of ISO-2022 as required. Finally, with terminal emulation it is convenient for echos of what we send to appear on the screen in the expected form rather than sending one form and receiving a different mapping. Thus, keyboard output needs to track the received language. Joe Doupnik From JRD@cc.usu.edu Mon Apr 10 18:15:11 1989 Return-Path: Received: from cc.usu.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA14048; Mon, 10 Apr 89 18:15:11 EDT Message-Id: <8904102215.AA14048@watsun.cc.columbia.edu> Date: Thu, 6 Apr 89 21:03 MDT From: Joe Doupnik To: isokermit@watsun.cc.columbia.edu X-Vms-To: IN%"isokermit@watsun.cc.columbia.edu",JRD Chris, Frank, and the group: Ken-Ichiro Murakami brought up some important points on the command set of Kermit. I agree that there should be separate commands for file transfer and terminal emulation. I think we are trying to arrange file transfer commands so that SET FILE TYPE manages record delimiters and Columbia would like this to be extended to manage more general file constructions such as found in applications programs; both uses are file system items. It is appropriate to include applications program names in this command because the file format is of concern, even though it does strongly involve elements of alphabet (or font) changing. The purpose of the command SET TRANSFER-SYNTAX is to manually inform the other machine that a polyalphabet file is being transferred using ISO-2022 methods and thus allow the host to be aware of and to translate ISO-2022 to perhaps another storage form. It is a user override since the file reader may not be able to decide independently to deliver characters literally or in ISO forms. Clearly, normal Kermit quoting mechanisms can manage both forms on the communications link, but the receiving host may need advance notification about how to store text in it's own local format. The command can also be used to enable the local Kermit's file reader to use ISO conventions. An excerpt from draft proposal #2: > SELECTING ISO-2022 TRANSFER SYNTAX > > Kermit's default transfer syntax is NORMAL (meaning either ASCII text, or > binary, according to SET FILE TYPE). Kermit's ISO-2022 transfer syntax > must therefore be enabled in some way, either automatically or explicitly by > the user. In the automatic case, the Kermit program recognizes (somehow) that > it is to transfer a multi-alphabet text file. In the manual case, the user > issues a SET command: > > SET TRANSFER-SYNTAX ISO-2022 > > It must also be possible to override the automatic use of ISO-2022 syntax > via the command: > > SET TRANSFER-SYNTAX NORMAL Personally, I think we should be very cautious about adding more than NORMAL or ISO-2022 to the list of methods. ISO is supposed to allow conversion from almost any to almost any other representation and thus reduce the "N by N" problem (where each side understands how to convert all the forms of the other side). On mandatory eight bit quoting capabilities: it would certainly simplify matters. Tough on long departed developers. And I tend to agree with Gisbert that files from applications programs are best managed with standalone filters. As a practical matter, adding executable code to a Kermit program during operation of the program is not an easy thing to accomplish. We should not overlook the chance to understand a few widely accepted file formats, so each Kermit implementation may include some or let the user do the filtering externally. Terminal emulation ought to be separated from file transfer, in my opinion. A diverse collection of files may be transferred via ISO methods while the terminal communications language might be unrelated to the files. We do need some method of permitting the local user to define the contents of the display adapter (or equivalent) for both Left and Right tables. Thus GL and GR qualifiers in SET TERMINAL CHARACTER-SET are convenient options. Translation of keystrokes is still a problem for me and requires more information from persons most affected. One concept I try to keep in mind is that an echo of what we transmit ought to appear "correctly" on the local screen. Curiously, ISO-2022 says a great deal about displays but nothing about keyboards. Somehow we need to transmit character codes representing both GR and GL tables (without memorizing lots of escape codes). Defining a few specialized keys to transmit the ISO table shift codes is possible and would be similar to operating in an outgoing 7-bit environment and an incoming 8-bit one. After all, the keyboard maps to only one table at a time. Keyboard questions should not impact the proposal before us, however. Joe Doupnik From lts!amanda@uunet.uu.net Tue Apr 11 09:47:23 1989 Return-Path: Received: from uunet.UU.NET by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA20330; Tue, 11 Apr 89 09:47:23 EDT Received: from lts.UUCP by uunet.UU.NET (5.61/1.14) with UUCP id AA01767; Tue, 11 Apr 89 09:47:18 -0400 Received: by lts.UUCP (4.12/2.881128) id AA01856; Tue, 11 Apr 89 08:08:47 est Date: Tue, 11 Apr 89 08:08:47 est From: Amanda Walker Message-Id: <8904111308.AA01856@lts.UUCP> To: isokermit@watsun.cc.columbia.edu Subject: Display vs. File Transfer Joe, While I think you have made some excellent points, I'm not so sure that we should limit ourselves to two-character-sets-at-a-time. There seem to me to be three major classes of displays out there that we need to worry about: 1. Hardware character generator, perhaps with some non-vanilla ASCII characters available. This includes most terminals, the IBM MDA & CGA, and most terminal emulators for window systems. 2. Software programmable character generator with a limited number of slots. This includes the IBM EGA & VGA, the Hercules RAMfont boards, the DEC VT220/240/320/340, and so on. 3. True graphics displays. This includes the Macintosh, Amiga, a PC running Windows or the Presentation Manager, and most workstations (in theory, anyway...). On class one machines, you do the best you can. On, say, an IBM MDA/CGA, the ROM character generator does have a fair amount of ISO 8859/1 in it; it's not complete, but you do get most of the characters with diacritical marks, which makes it more useful than nothing. I think it would be a mistake to cripple Kermit implementations on machines that fall into the latter two classes in order to accomodate machines of the first one. There will always be implementation limits--for example, very few machines will able to display the full JIS Kanji set--but I think these should be left up to the implementation, rather than being part of the spec. There are also tricks you can play with machines in class 2, which are the ones you seemed concerned about. On a machine with an EGA/VGA/MCGA/Hercules card, you get two full 256-slot character sets. Nothing says that each of these must be a single ISO 2022 character set... For example, I wrote the EGA support for my company's PC telnet package, which emulates a VT220. It leaves the default character set alone, and loads a secondary character set with only those characters in the VT100 line-drawing set and ISO 8859/1 that don't appear in the ROM font. So far it handles VT100 graphics, DMCS (DEC Multinational character set, which almost but not quite ISO 8859/1), and ISO 8859/1, and there's still a fair amount of room left. If things get cramped (which they will as we add character sets), I may end up loading a new set into slot 0 as well, but so far that's been unnecessary. On class 3 machines, of course, it's not a problem. This is where things seem to be going, as well. I, for one, would be happy to run my IBM PC/AT in graphics mode if I needed to see all of the characters I was using. No, it's not as speedy as text mode, but it's not all that shabby (esp. if you're not running under Windows :-)). One of the nice things about Kermit so far has been that it has taken more of a "Greatest Common Factor" philosphy than a "Least Common Denominator" one, and I don't think we should stop now. Amanda Walker InterCon Systems Corporation P.S. I can't donate any code to the project, but I'd be happy to send the PC folks a copy of the supplemetary font I mentioned above, if they're interested. Just let me know where I should mail it... From @cunyvm.cuny.edu:FI@NORUNIT.BITNET Wed Apr 12 13:56:08 1989 Return-Path: <@cunyvm.cuny.edu:FI@NORUNIT.BITNET> Received: from CUNYVM.CUNY.EDU by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA04421; Wed, 12 Apr 89 13:56:08 EDT Message-Id: <8904121756.AA04421@watsun.cc.columbia.edu> Received: from NORUNIT.BITNET by CUNYVM.CUNY.EDU (IBM VM SMTP R1.1) with BSMTP id 0575; Wed, 12 Apr 89 13:55:57 EDT Date: Wed, 12 Apr 89 16:10:44 ECT To: isokermit@watsun.cc.columbia.edu From: FI%NORUNIT.BITNET@cunyvm.cuny.edu Comment: CROSSNET mail via SMTP@INTERBIT To the ISO-kermit list administration: This list seems to be lacking a '-request' address, which forces me to distribute this message to everybody. My apologies for this. Please remove me from the list until further. Frithjov Iversen From mcvax!krafla!frisk@uunet.uu.net Wed Apr 12 21:14:36 1989 Return-Path: Received: from uunet.UU.NET by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA08362; Wed, 12 Apr 89 21:14:36 EDT Received: from mcvax.UUCP by uunet.UU.NET (5.61/1.14) with UUCP id AA24988; Wed, 12 Apr 89 21:14:29 -0400 Received: by mcvax.cwi.nl via EUnet; Wed, 12 Apr 89 20:14:16 +0200 (MET) Received: by hafro.is (5.57/smail2.5/08-08-88) id AA24627; Wed, 12 Apr 89 13:09:08 GMT Received: by rhi.hi.is (13.1/smail2.5/03-10-88) id AA28781; Wed, 12 Apr 89 13:02:31 gmt From: mcvax!rhi.hi.is!frisk@uunet.uu.net (Fridrik Skulason) Message-Id: <8904121302.AA28781@rhi.hi.is> Subject: 8859/1 kermit in Iceland To: isokermit@watsun.cc.columbia.edu Date: Wed, 12 Apr 89 13:02:29 GMT X-Mailer: Elm [version 2.1 PL1] Just a few random thoughts... Here in Iceland we have been using kermit with ISO 8859/1 translation for almost three years now. This was done since our national language contains 10 characters outside the 7-bit ASCII character set, and we wanted to simplify the file transfer process. The changes: A) File transfer: The command SET FILE TYPE {BINARY|TEXT} was added for Kermit versions that did not support it before (like IBM PC). No changes were made in the BINARY case, but when transmitting TEXT files, automatic translation to ISO 8859/1 was done. Also, when receiving TEXT files, translation from ISO 8859/1 to the native character set was done automatically. So, when transmitting text files from an IBM PC using Code Page 861 (861 is the Icelandic PC character set) to a VAX using DEC- Multilingual character set (which is almost, but not quite ISO 8859/1), the PC translated the file to ISO 8859/1 and the VAX translated the file to DEC Multinational. Natic character set A ----> ISO 8859/1 ----> Native character set B We also changed the Macintosh Kermit in a similar way, but in other cases (HP 9000 and ATARI ST) no changes were needed, since those machines use the ISO 8859/1 character set here in Iceland. The only problem that we have run into is that some characters can not be represented in ISO 8859/1. The most common problem is when people try to transmit files containing PC line/box drawing characters. This, however has not been a serious problem. B) Terminal emulation. A new command: SET TRANSLATION {NONE | ISO | CP850 | DEC-MULTI | ROMAN-8} was added to the IBM PC Kermit. It was used to specify what sort of translation should be applied to incoming/outgoing characters while in terminal emulation mode. An important thing to note is that this translation command is totally independent of the file translation. All incoming characters that could not be translated to Code Page 861 were displayed as character 168 (upside-down question mark) From MURAKAMI@ntt-20.ntt.jp Thu Apr 13 02:41:30 1989 Return-Path: Received: from ntt-20.NTT.JP by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA10524; Thu, 13 Apr 89 02:41:30 EDT Date: Thu, 13 Apr 89 15:37:37 I From: ken-ichiro murakami Subject: Simple Macro commands for local (Kanji) code conversion To: isokermit@watsun.cc.columbia.edu Cc: murakami@ntt-20.ntt.jp Message-Id: <12485718037.27.MURAKAMI@NTT-20.NTT.JP> Hi. I wrote macros to convert Kanji in batch manner. This was the first experience for me to write Kermit macro definition. I found this facility is powerful enough to handle a file. However, It's impossible to handle multiple files using conventional macro facility. Therefore, I tried to utilize MS-DOS batch command such as FOR command in vain. The FOR command allows only a command as its argument. So, I'm compelled to invoke yet another batch command file. This caused a problem. After yet another batch file is executed, variable which is used in FOR command is cleared. So, I gave up to use FOR batch command. To handle multiple files specified by wild card, we need the following new macro commands (1) FOR command like MS-DOS batch command For example, define send-all FOR \%f IN (*.ASM) ksend %f define ksend send \%1 Of course, file expansion command (*.ASM) is necessary in the FOR command. (2) STRING-SEARCH command to check if the specified file contains wild card or not. For Example, define check-wild if string-search \%1 * goto check,- if string-search \%1 ? goto check,- goto ok,- :check, if defined \%2 goto error,- This is used to check parameters specified in commands such as (K)SEND and (K)GET. If the first argument contains wild card, the second parameter should not be specified. (3) clear the variables \%1 .. \%9 In the conventional Kermit macro, parameters for macro command such as \%1 is not cleared. It caused a problem. It's impossible to decide whether user specified macro parameters of not. For example, the following command sequence makes the problem. MS-KERMIT>KSEND foo.bar qwe.asd MS-KERMIT>KGET kkk.kkk The second command will rename the file kkk.kkk to qwe.asd. If it's impossible to clear these variables, we must have a convention to clear them. I defined such macro to clear variables after every macro execution. My experience shows that it's possible to make macro commands for code conversion. They can convert special characters created by specific application programs such as WORDPERFECT. (Under condition that (1), (2) and (3) are supported.) This makes it possible to consider ISO2022 transfer convention and application oriented character coding conversion separetely. So, I take back my proposal, that is, SET TRANSFER-SYNTAX argument extension. Only {NORMAL|ISO2022} is OK. As for SET FILE {TEXT|BINARY|WORDPERFECT..} command, it's desirebale to separate application oriented file coding such as WORDPERFECT. If these application name is inlcuded in the command, we must modify the command every time a new program appears. Rather, we should process these special files by macro. So, I propose not to extend SET FILE argument. {TEXT|BINARY} is enough. -Ken ---< cut here >--- echo Kermit, AUTOMATIC KANJI CONVERSION MACROS version -1 April 12, 1989\13 echo \9 by Ken-ichiro Murakami\13 echo USAGE:\13 echo \9 KANJI {JIS|EUC|SJIS} specifies Kanji code in remote host\13 echo \9 KSEND source [destination] sends a source file after conversion\13 echo \9 KTRANSMIT source transmits a source file after conversion\13 echo \9 KGET source [destination] gets a file and converts Kanji code\13 echo \9 KRECEIVE source [destination] gets a file and converts Kanji code\13 echo NOTE:\13 echo \9 NO WILD CARD is allowed in the file specification.\13 echo \9 Reserved variables are %r, %s and %x. Don't overwrite them.\13 echo \9 Reserved file name is SYS9999.TMP.\13 ; ******* CAUTION! ****** ; Because of limited MS-DOS batch command, it's impossible to process multiple ; files specified by wild card. Therefore, I gave up to support wild card. ; I would like to request MS-kermit to support the following facility ; to compensate the limited MS-DOS batch command. ;(1) FOR command like MS-DOS FOR batch command ;(2) File name expansion facility in FOR command like (*.ASM) in MS-DOS batch ; command. This will be used for wild card. ;(3) String-search predicate for IF command argument. This will be used ; to inspect wild card specification in variable. ;(4) If macro argument is not specified, unspecified arguments(variables) ; should be cleared. It's impossible to decide whether the argument is ; specified or not in the conventional MS-KERMIT macro, because variables ; for arguments are not cleared. ; ******* definitions for KANJI macro command ******* define kanji if equal \%1 jis do set-kanji-jis,- if equal \%1 euc do set-kanji-euc, if equal \%1 sjis do set-kanji-sjis,- if equal \%x ok goto done,- :no-arg, do kanji-usage,- :done, define \%x, clr-var define kanji-usage echo argument is JIS\44 EUC or SJIS\13 ; Set register according to remote host Kanji code ; This value is used as a parameter for program CONVERT ; Note that MS-DOS uses SHIFTJIS as local Kanji code ; -1 = convert SHIFTJIS to JIS ; -2 = convert SHIFTJIS to EUC ; -3 = convert JIS to SHIFTJIS ; -4 = convert EUC to SHIFTJIS define set-kanji-jis define \%s -1, define \%r -3, echo remote host is JIS,- set terminal kanji-code jis-7, define \%x ok define set-kanji-euc define \%s -2, define \%r -4, echo remote host is EUC,- set terminal kanji-code DEC-code, define \%x ok define set-kanji-sjis define \%s, define \%r, echo remote host is SHIFTJIS\13,- echo Use SEND\44 RECEIVE\44 GET and TRANSMIT instead of Kanji macros\13,- set terminal kanji-code Shift-JIS, define \%x ok ; ****** macro definitions for KSEND macro command ******* define ksend if not defined \%1 goto ksend-usage,- if not defined \%s goto ksend-error,- ksend1, clr-var,stop,- :ksend-usage, echo need local file name,clr-var,stop,- :ksend-error, echo specify remote Kanji code,clr-var define ksend1 if not defined \%2 define \%2 \%1,- if exist SYS9999.TMP del SYS9999.TMP,- run convert \%s \%1 SYS9999.TMP,- if exist SYS9999.TMP send SYS9999.TMP \%2,- if exist SYS9999.TMP del SYS9999.TMP define clr-var define \%1, define \%2, define \%3 ; ****** macro definitions for KGET macro command ******* define kget if not defined \%1 goto kget-usage,- if not defined \%s goto kget-error,- kget1, clr-var,stop,- :kget-usage, echo need remote file name,clr-var,stop,- :kget-error, echo specify remote Kanji code,clr-var define kget1 if not defined \%2 define \%2 \%1,- if exist SYS9999.TMP del SYS9999.TMP,- get \%1 SYS9999.TMP,if not exist SYS9999.TMP,goto nop,- run convert \%r SYS9999.TMP \%2,del SYS9999.TMP,- :nop ; ****** macro definitions for KTRANSMIT macro command ******* define ktransmit if not defined \%1 goto ktransmit-usage,- if not defined \%s goto ktransmit-error,- ktransmit1, clr-var,stop,- :ktransmit-usage, echo need local file name,clr-var,stop,- :ktransmit-error, echo specify remote Kanji code,clr-var define ktransmit1 if exist SYS9999.TMP del SYS9999.TMP,- run convert \%s \%1 SYS9999.TMP,- if exist SYS9999.TMP transmit SYS9999.TMP,- if exist SYS9999.TMP del SYS9999.TMP ; ****** macro definitions for KRECEIVE macro command ******* define kreceive if not defined \%1 goto kreceive-usage,- if not defined \%s goto kreceive-error,- kreceive1, clr-var,stop,- :kreceive-usage, echo need remote file name,clr-var,stop,- :kreceive-error, echo specify remote Kanji code,clr-var define kreceive1 if not defined \%2 define \%2 \%1,- if exist SYS9999.TMP del SYS9999.TMP,- receive \%1 SYS9999.TMP,if not exist SYS9999.TMP,goto nop,- run convert \%r SYS9999.TMP \%2,del SYS9999.TMP,- :nop ------- From @cuvmb.cc.columbia.edu:MOSGLA@HLERUL2.BITNET Thu Apr 13 08:52:14 1989 Return-Path: <@cuvmb.cc.columbia.edu:MOSGLA@HLERUL2.BITNET> Received: from columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA12668; Thu, 13 Apr 89 08:52:14 EDT Received: from cuvmb.cc.columbia.edu by columbia.edu (5.59++/0.3) with SMTP id AA24454; Thu, 13 Apr 89 08:52:03 EDT Message-Id: <8904131252.AA24454@columbia.edu> Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 0750; Thu, 13 Apr 89 08:52:03 EDT Received: from HLERUL2.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 2807; Thu, 13 Apr 89 08:52:02 EDT Date: Thu, 13 Apr 89 14:35 CET From: "Johan van Wingen" To: isokermit@watsun.cc.columbia.edu Subject: First comments on 2nd draft Dear Kermit listers My mailing of 6 April of the following text seems not to have arrived. ######################################################################## I have not had the time to study the new draft closely. Two general remarks should be made first. 1. If a mechanism is required for including code extension techniques in Kermit, ISO 2022 will create a Paradise for you. It is certainly to be preferred over other methods made ad hoc by manufacturers. 2. ISO 2022 is implemented only very rarely in the data processing world, and for good reasons. Thus what is offered to you is in fact a Fata Morgana. The main service ISO 2022 did, is that it acts as an ordering principle. The new developments in ISO JTC1/SC2 go in a different direction. While the main framework of ISO 2022 will be kept, with its announcing sequences, the C and G selecting system will remain only a cumbersome alternative to the multiple-octet standard (10646, at DP stage, not yet at DIS!), and the 254 graphic character code, now proposed. As for Appendix C, ISO 6937 does not present a usable alternative to ISO 8859. Mr. Palme's comments are quite misleading. (There are NO national variants of ISO 8859, Icelandic is in ISO 8859-1!) It requires special hardware for dealing with diacritics, because accented letters are being coded with TWO bytes, instead of ONE, as in ISO 8859. To the opinion of ISO SC2/WG3 members, who maintain it, ISO 6937 is almost dead now, and will only be continued for the sake of CCITT. ISO and ECMA registrations are the same thing. Thus for Chinese and Korean put 58, 149. Japan is 87. Replace the other "?" by "none". The "final letter problem" I have to verify in my files at home. DP 10646 (140 pages) is now circulated for voting (ending 30 May). The Netherlands voted already (yesterday) NO, in order to have several things changed. It may be possible to get copies of the document from ANSI, or from the institutes in your own country (DIN, AFNOR, BSI etc.). ######################################################################## For additional comments, I support Mr. Palme suggestion to consider ODA/ODIF. It is ISO 8613, in 8 Parts, more than 600 pages. I have got it, but it is no easy matter. Good luck with it. FROM J. W. van Wingen MOSGLA@HLERUL2.BITNET Mail to P. O. Box 486, 2300AL Leiden, Netherlands From KLENSIN@infoods.mit.edu Thu Apr 13 13:30:28 1989 Return-Path: Received: from INFOODS.MIT.EDU by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA14942; Thu, 13 Apr 89 13:30:28 EDT Received: by INFOODS id <0000026B061@INFOODS.MIT.EDU> ; Thu, 13 Apr 89 13:28:28 EDT Date: Thu, 13 Apr 89 13:27:11 EDT From: John C Klensin Subject: Draft Number 2 comments -- Quite long. To: isokermit@watsun.cc.columbia.edu X-Vms-Mail-To: EXOS%"isokermit@watsun.cc.columbia.edu" Message-Id: <890413132711.0000026B061@INFOODS.MIT.EDU> Three prefatory comments: (1) I want to apologize in advance for the length of these comments and the high likelihood that they cover old ground: I'm an inadvertent latecomer to these proceedings. (2) I think it is appropriate to identify myself and my perspective for those who don't know them. I don't claim to speak for any of the groups that follow, only to have absorbed some perspective from my interactions with them: (i) I have been technical director for a series of projects over the last few decades that have, among other things, been concerned with data interchange among machines with different internal representations. The most recent of these is a UN-sponsored activity in the international interchange of data about the nutrient composition of foods. In that context, we have have had to deal with representation of food names in local language and character sets in files that will be used in many countries and environments. (ii) In conjunction with another project, I was involved in some of the early discussions and design of the Internet FTP that led to the TYPE, MODE, STRU, and SITE commands, which have never done what it was hoped that they would do, and which were intended to address some of the issues raised in the draft (iii) I have a certain amount of standards development experience, both in the US and internationally. I chair the US committee on PL/I (X3J1) and have been convenor of the corresponding ISO working group. PL/I was the first widely-available programming language with dialects that contain provision for simultaneous use of single- and multiple-byte character sets. Those provisions were rejected for inclusion in the standard at the recommendation of the company that developed them. The PL/I standards are also the only language standards in the ISO arena that have been defined by precise semi-formal methods, in which it has not been possible to paper over some of the issues with which the draft "iso kermit" document deals. (iv) Finally, I chair the Standards Committee of ACM and, largely as a consequence, am a member of ANSI/ISSB. For readers from outside the USA, ISSB acts as a "board of directors" for policy issues in Information System standardization within the USA. Among other things, it approves and, when needed, coordinates, all of the US Member Body TAGs to the ISO working groups and subcommittees. (3) Johan van Wingen's note of 13 Apr arrived just as I was about to send this off and it covers some of the same ground. I have not removed the redundancies, but wish to reinforce what I see as his three main points that are relevant to my argument: (i) there are few, if any, serious implementations of ISO2022. At best, vendors use a little bit of it for a few things. (ii) the industry (and standards) trends are toward eight-bit character sets, and then toward multipe-octet character sets. If devices support multiple of those sets, they are more likely to do it by some major and semi-permanent activity (switches, setup modes, etc) than by dynamic ISO2022 switching. (iii) The ISO 8859-n components are the only variations there are. There are no national variants *within* a registered ISO8859 set. That is the point of ISO 8859, for better or for worse. And, as mentioned above, that point reflects industry and hardware realities. ISO 6937 is, or was, a reasonable *communications* standard (hence the CCITT connection), but a terrible file-storage and processing standard (because the concept "length of a character string" and the concept "number of characters in a character string" are either intimately related or all of the programming languages and database languages/systems that deal with these things go crazy). DP10646 is in DP vote. Given a single "no", that means realistically that it is at least a year to 18 months away from being an ISO standard. And, despite SC2's optimism, it is not yet clear who (among hardware and software vendors) are going to pay any attention to it in practice. SUMMARY OF WHAT FOLLOWS Having studied "Draft 2", I think it launches us, and kermit, down the proverbial slippery slope. While the capabilities suggested are, without question, useful, they go well beyond the near-term capabilities of practical devices. Even the devices cited as compliant to these standards are not, in fact, compliant at the level anticipated and required by the draft. Similarly, the draft implies the availability of translations in each kermit that are certainly hard, may not be well-defined, and that are certainly not defined in the standards referenced. I suspect that, partially as a symptom of this slippery slope phenomomenon, the draft wanders off into areas that don't really have much to do with character sets. I will outline some small difficulties with the draft, then suggest an alternative way of looking at the problem(s). Some of the nit-picking is included because it sets the foundation for the suggestions that follow. QUIBBLES AND EDITORIAL REMARKS 1) Just below figure 2, paragraph starting "An 8-bit alphabet...": As far as I know, it would be correct to say that the number is adequate to represent the characters in all of the world's "alphabetic" languages. The one-graphic-per-word character sets are not alphabetic in the sense that term is typically used. 2) Three paragraphs below, starting "A language like English...": "adequately GL" should be "adequately in GL". Note that this is not strictly true as soon as punctuation and special characters are considered. Note that the mapping of dollar-sign-symbol (in ASCII) to universal-currency-symbol (in ISO 646) to pound-sterling-symbol (in the UK (BSI) version of ISO 646 whose name I don't recall at the moment) has been the source of considerable confusion as people try to figure out whether amounts should be multiplied by appropriate conversion factors. It is claimed that "English" is used in both countries, but communication between them requires/suggests at least two distinct currency symbols, with distinct and agreed-upon code points. 3) This may or may not be appropriate here, but it seems to me to be intimately related to the current draft proposal. As kermit moves into eight-bit character sets, it may be appropriate to modify the protocol somewhat to understand that combinations such as "8 data bits, parity, one stop" are as well defined as "7 data bits, parity, one stop" and "8 data bits, no parity, one stop". The latter two are supported by the existing protocols, the former is not. The only arguments I know of against supporting parity along with eight data bits are (i) that terminals and modems don't handle it, which is no longer true: many do, and (ii) that the trends are toward OSI-like transmission, with length encoding and ECC. On the other hand, if we have that, we don't need kermit (or at least a lot of what kermit has been most important for in the past). 4) Under "CHARACTER SETS" is a discussion of the registration activity. Note that ECMA is "simply" the registration authority under ISO2375. As such, they are acting as ISO's agent, and have obligations to register (or not register) things independent of the desires and plans of their membership. More important, while coordination has been steadily improving, there is no requirement that CCITT register its character sets with ECMA or anyone else. There are many respects in which CCITT "Recommendations" bear more resemblance to treaties than they do to voluntary standards: the members of CCITT are governments and PTTs, while ISO is a voluntary organization whose member bodies may be private and voluntary organizations (in the USA, the formal representation to CCITT is in the State Department; the ISO member body is ANSI, a private organization with no ties to the government other than government representation on many of its boards and committees). The material in appendix B is better in this regard. On this same topic, note that it is very easy, although time-consuming, to register a character set under ISO2375, as long as the characters in it are part of a pre-defined character repertoire. The ISO standard that contains the identifications of all known Latin-based alphabet characters should probably be on the reference list--I don't have the number handy, but can look it up if needed. Since, in principle, more variations can be registered every month, a scheme that assumes that a given kermit will be able to accept alphabet designation sequence and do the "right" mapping of graphics onto the local preference implies continual updating of tables, etc. Note that registration applications and draft international standards don't count. They are subject to change, and sometimes do. In particular, there is a strong argument for removing all of the "Esc Seq"s from Table 5 that do not have corresponding Registration numbers in the last column. The effect of that on the table may call out another phrase that periodically occurs when thinking about standards: "premature for standardization". We really don't want kermit in the middle of this rapidly-changing situation if we can avoid it. Referring to the "possible problem" called out at the bottom of the document, the theory according to ISO is that Japan will have to get in line. JIS had the option to taking significant exception to those assignments and apparently did not, which may speak for their intentions. I await Ken-ichiro Murakami's second round of comments on this, after his expert advice arrives. 5) In at least most of the kermit implementations I have worked with, the TEXT/ BINARY distinction does not affect file transfer and is often required by one of the transferring kermits but not the other. With a few trivial exceptions, and the less trivial ASCII->EBCDIC translation going to IBM hosts, the choice of FILE TYPE tends to impact storage representations, rather than data conversion. 6) The fourth paragraph under "SELECTING ISO-2022 TRANSFER SYNTAX" points to "table 4". Table 4 is very well hidden in appendix B and should be pointed to there. 7) In the paragraph immediately under Figure 4, ISO 646 is referred to. When discussing these types of issues, it is important to distinguish between ISO 646 IRV (i.e., "ASCII" with "universal currency symbol" substituted for "dollar sign") and ISO 646 BV (which is where the national variations on special character positions show up). CCITT IA4 should also be on that list as another almost-ISO 646 character set. 8) Under TERMINAL EMULATION, the DEC VT200 and VT300 series terminals are identified as ones that "already follow these standards". This is not strictly true: these terminals do implement GL-GR switching. They implement a very limited number of character sets that can be bound to G0 and G1. They will ignore any ISO2022 announcer or designation sequence that they don't understand, which is not the sort of thing that is usually considered a satisfactory implementation (although it is typical of ISO 2022 "implemenations"). It would be more accurate to describe the VT200 (which does not even support Latin Alphabet 1/ ISO 8859-1) and VT300 as fairly dumb terminals, with limited character set switching capability, that implement what capability they do have by using ISO 2022 control sequences. I would favor support for eight bit characters in seven bit environments in the kermit terminal emulators. For that purpose, ISO2022 shifts should be supported, rather than the eighth-bit-quoting of the file transfer protocol, since the terminal emulators are talking to hosts, not to kermit servers. That said, please understand that very few hosts implement this stuff: Digitial's model with VAX/VMS is fairly typical, with most of the operating system supporting GR graphics (high-bit-on) and ISO8859-1 and DEC's "multinational" variations only if the terminal is operating in eight-data-bit mode. C1 controls are, however, supported with escape sequences if the terminal is operating in seven-data-bit mode only. 9) In Appendix A, availability of ISO standards is listed. It may be that some text was left out. But, the correct statement as I understand it is more or less as follows: - Each of the national ISO member bodies is ISO's official sales agent in that country. Consequently, ISO standards should be ordered from ANSI in the USA, from BSI in the UK, from AFNOR in France, from DIN in Germany, etc. The UN Bookstore may also carry some ISO standards. ISO standards are never free; the ISO Central Secretariat derives improtant operating income from the sale of standards by its agents. - CCITT is part of the International Telecommunications Union (ITU) and hence part of the UN system. Its recommendations are available from the ITU secretariat in Geneva, from the UN Bookstore(s), and through the CCITT national committees and/or PTTs in many countries. Some national standards bodies, including ANSI, also carry CCITT colored books (sets of recommendations). The cost of CCITT colored books depends on where you get them: sometimes costs are absorbed in indirect ways. ANSI, which derives a significant fraction of its annual expenses from the sale of publications, charges for them, for its own Standards, and for ISO Standards. ANOTHER WAY TO THINK ABOUT THE PROBLEM In the general case, we can expect a given system to actively support only a small fraction of the character sets that it is possible to register. Except for high-performance bit mapped devices, it is likely that a large fraction of the characters in the repertoires from which the registered character sets are and will drawn will not be available. Conversion among word processor formats is extremely complex and typically involves the loss of information, since different processors support different capabilities (and, hence, capabilities for specification). To provide one specific example, on MSDOS, a barely adequate WordStar 5.0 to WordPerfect 5.0 converter is a larger program than MSKERMIT, not counting user-supplied tables that drive the heuristics. I suggest that, terminal emulation aside, the problem of sending special character sets ("special" is anything but ISO 646 BV, with no national-use character positions included) is really a problem of telling the receiving system *what kind* of "binary" file is being sent, and reaching appropriate agreement. Anything else involves very complex conversion issues that don't belong in kermit, that will be little used in practice, and that probably can't be made to work. Two historical kermit assumptions are key in figuring out what should be done. If either is relaxed, other options are possible. The first of these assumptions is that options should be agreed to on a single exchange only: the Telnet negotiation arrangement, in which "DO" and "WONT" requests are exchanged until some agreement is reached, has not been considered acceptable. Second, although less explicit, is the assumption that error-reporting behavior as to what can be handled (as distinct from actual transmission errors) should occur during send/receive negotiation, not be discovered midway in a transmission. Sending the ISO2022 announcer C (or D) tells the receiving kermit that it should expect alphabet selection escape sequences, but gives no clue as to what alphabets might be selected. That is undesirable: given limits of devices, what is wanted is to know *what* alphabets might be specified *before* the transfer is begun. Referring to the "MISMATCHED CAPABILITIES" section of the draft, an "X" earlier is clearly better than an "X" later, but an early "NAK" with "I'm not going to deal with that" is far more acceptable. If the "X" after transmission starts is the agreed option, *please* let's specify exactly what goes into the [rest of the] data field when this situation is encountered. Parenthetically, reason 1 there is not relevant: if the receiver does not know enough about attribute packets, then the control sequences will end up in the file, with whatever conversions are in effect. I think this implies that SET TRANSFER-SYNTAX ISO-2022, or any of its relatives, imply "I can handle attribute packets, and, if you can't, we are going to call this off". "Call this off", here, might mean "sender (user, not software) will convert to ASCII and retry" or "send binary, will fix up at receiver end". The protocol should not care. The overwhelming number of useful transfers will occur among parties that support the same character sets, whatever they might be. For both files and terminal emulation, the "how do I translate from what is coming in to what I have" question is, in practice, less interesting than the "how do I figure out what they are sending, so I can match it" question. Let me ignore for a moment the way that the attributes are encoded. What is needed is a way for me to send to the remote kermit "I am about to send eight-bit characters. The whole transfer is going to be in ISO8859-6, with *no* embedded ISO2022 controls. Can you cope with it?". Now, the answer to that question is either "yes" or "no", and, as a sender, I don't care whether "yes" means "we have a device that can display it" or "we have local capability to convert to something else". In principle, if we device a general enough way to say that, then I can also say "I am about to send WordPerfect 5.0, can you cope?". Again, the receiver's answer is either "yes" or "no", and "yes" might mean "we support WordPerfect around here" or "we can accept that and have a conversion program that does a plausible job". To a considerable extent, ISO8859 came into being to avoid data transfers and files with embedded ISO2022 controls. The 8859 theory is, more or less, "if we can agree which character set we will use among ourselves and be clear about it, then we are in a modern version of 'the good old days'". I'd recommend a look at the ways message character sets are (or used to be; I'm still working from the Red Books) specified in the X.400 suite to reinforce this. On the other hand, one of the things that I should be able to put in that inquiry packet is "I want to send you a file with all sorts of embedded ISO2022 controls, including alphabet switching". Then, and only then, does much of the complexity of the existing draft come into play. And, if the receiver says "yes", an ability to handle any alphabet registered up to that day is presumably assumed. Since that is not realistic, I want to repeat here a variation of a suggestion that I saw go by a few days ago (I don't still have it, or would acknowledge the idea more specifically): that acceptance of ISO2022 sending should be followed by a packet that specifies what I intend to bind onto G0-G4, and/or a list of what I would *like* to use, in descending order. In the former case, the receiver would say "yes" or "no"; in the latter, it would send back its list of preferences, leaving off anything it could not handle. Both of those are consistent with the general kermit model. Or, one could break the rules and negotiate back and forth. This is a game, incidentally, that a VT300 emulator could play without problems, since the emulator could know what character sets were supported, what was available for downloading from the attached computer, and could give an orderly reply as to what it was prepared to cope with. Another observation about word processors and similar programs: The notion of word processor format conversion is basically unworkable, even more unworkable than "how do I deal with the GR of ISO8859-6 on a VT52", since there are ISO recommendations for Arabic-Latin transliteration. I am singling out 8859-6, incidentally, not because of any prejudice, but since it the one of the 8859 sets for which I haven't seen an emulator (even on a high-performance workstation) within a few miles of MIT. I doubt that emulators are readily available in a similar distance of most of the other contributors. And shifting from 8859-1 to 8859-6 or 8859-8 and back again with announcer sequences, but within the same document or line, raises some *hard* "terminal" problems. The problem is the information loss mentioned above, or, more important, the need to make up information. These things have gotten complicated enough that there are very complex conversion programs on the market. Even those programs, or at least the best of them, work with user-supplied tables that specify parameters for the heuristics that understand certain constructs. And people will want to send Postscript and similar files as well (conversion from Postscript back to "word processor" is analogous to decompiling a program or to optical character recognition). Keep in mind that the current versions of WordStar and WordPerfect (at least) permit imbedding bit-mapped graphic files in text documents, so Postscript is not a big stretch. While ODA and SGML have been suggested as alternatives, each has special properties. ODA is basically just another word processor format, with the advantage that it is an international standard and the disadvantage that it has few implementations. And SGML has its own mechanisms for dealing with "funny" characters, which are usually not expected to appear directly in files to be transferred: I would rather know that an incoming file is SGML, not just "text", but the protocol does not need to do anything different, certainly not try to convert it to, e.g., PostScript. PROPOSAL Having denounced everything in sight, let me try to make a simplifying proposal. I. Terminal emulation There is, at the moment, no protocol for a remote host to tell terminals what they should, or must, support. If such a protocol is defined, it will be by the host and terminal vendors, we hope in conjunction with appropriate International Standards. But we will still be emulating real terminals, with small variations. That is why we call it "terminal emulation", after all. Just as kermit terminal emulators gradually moved from Z19/VT52 support to VT100 support, we should anticipate and encourage the emulation of devices that are able to support some of the ISO8859-n sets (e.g., the VT3xx) and the subsets of ISO2022 controls provided by those terminals. If those emulators can move beyond the subsets, so much the better, but ISO8859-n support, for a small set of n's, is much more important than general ISO2022 support including midstream character set switching. I am very encouraged in this regard by what I have been able to infer about the work in Japan, and would like to hear more about it. In no event does this issue have much to do with the transfer protocol, except for the implications of SET DEFAULT-DEVICE SCREEN, where that is implemented. I am at least sympathetic with Joe Doupnik's 10 April comments relating to terminals and terminal emulation. Similar thinking motivated the radical suggestions made here. I agree with Ken-ichiro that SET KEY should be adequate. If SET TERMINAL KEYMAP is needed (for reasons I don't understand), provision should be made for assigning a different keymap everytime the character set changes. I.e., you probably want to permit binding keymaps to G0, G1,..., rather than, or in addition to, GL and GR. I could easily be wrong about this, I haven't thought about it very much. II. File transfer 1. If both sender and receiver don't support attribute packets, then only "binary" is going to work for any of this. 2. Define, and acquire, an attribute packet something like the following. The command breakdown may be wrong, as may the keywords. However, whatever is done, it is important to define things so that the default preferred character set and file information ("transfer syntax") can be specified in a start-up macro and retained, while the "extended" characteristic is turned on and off. 2a. Support SET FILE-TYPE EXTENDED, as well as "text" and "binary". "Extended" implies attribute handling, and attribute "*X". 2b. Support SET TRANSFER-SYNTAX where is a list that gets registered and built into the protocol on the same basis as attribute packets. I would suggest that the following concepts are good candidates for " (note that these are the concepts, not the specific keywords). Each keyword is associated with its own definition of a value field. Value fields, however, are specified by generating rules, such as those outlined below, not by the kermit protocol manual. - ISO8859. Value field is a part number and a date. (WARNING: these things get revised, not always consistently). - ISO2022. Value field is a date, followed by the proposed announcer sequences, followed by a list of the character sets that will be referenced. - ISO ODA. Value field is a date, and maybe some other stuff (I haven't studied ODA in enough detail). - ISO SGML. Value field is a date. - Proprietary word processor. Value field is a name and a version, and some conventions are needed about how those names are spelled. Note that "WordPerfect 5.0" file formats are not equivalent to "WordPerfect 4.2" file formats, and that this is a symptom of a general problem: more capability typically means file format revisions; significant increases in capability imply non-upward-compatible file format revisions. - Proprietary 'picture' or 'page description'. Values are things like "Postscript", "HP-PCL", MSP, etc. It is not clear to me why this category and the previous one should be separate, but my intuition says that they should be. - CCITT recommendation. Number and date (or color). - ISO standard (not covered above). Number and date. - National standard (not covered above). National standards body acronym (these are listed/specified in ISO documents), number, and date. There may be some others, and I would encourage something that a server and user can agree upon, without assuring uniqueness, such as: - private. some value field. 3. The associated attribute packet contains *X and an encoding of the keyword and value string. When it arrives, the receiving kermit either agrees or rejects, as usual with attribute packets. And, as suggested above, "agree" implies an ability to deal with the result, not what that ability is going to be. An important issue not addressed in the draft is that we (at least) increasingly use kermit as an element in multi-hop transfers, e.g., PC or workstation to host via kermit, to other host via FTP, to other host via some encoding with a mail envelope, to PC or workstation via kermit. Even if both workstations can handle, say, ISO8859-5, that is no guarantee that the intermediate hosts can. No real problem here, but the files will either have to arrive as binary and be post-decoded (as below) or the kermit on the receiving workstation will need to be given a *local* SET RECEIVE TRANSFER-SYNTAX ... command that deals with the file in the way that files of that type are dealt with, independent of any attribute negotiations. 4. For operating systems for which the capability is appropriate, a kermit command that looks something like: SET DATA-CONVERTER would permit user specification of a routine that could be invoked on the fly to convert those particular types of files. The semantics of "routine" would have to be local-system-dependent. For the record, I don't think that this is a good idea, but it is a way to incorporate the capability that some of you seem to want within kermit. Note that, as soon as you move either into word processor land (or even things like SGML), "conversion" is not a matter for tables, but for programs. Some similar capability that *does* permit user-specified remapping would be appropriate. If ISO8859-6 is going to be "converted" into ISO8859-1 so that it can be displayed on an ISO8859-1 machine with no 8859-6 capability, then the specification of the actions should be under user control. The ISO Romanization of Arabic, incidentally, runs left-to- right. But the main point is that 2-3 above get a "binary" file delivered, and delivered with sufficient information to decode it. And they enforce agreement on the file organization and content before the file is transferred. That decoding is probably best done post-transfer, rather than on-the-fly, if for no reason than that one is likely to like to have the file -- as transferred -- available when the decoding produces something unexpected. If you decode or translate on the fly, a new set of debugging options that save the transferred data would be helpful. The logical flow chart now looks like: Open transfer connection | Tell receiver what you are going to send | Receiver agrees to accept that form | Send data in that form | Close transfer connection Note that notions of "convert to transfer syntax" and "kermit encoding" don't appear here, at either end. That is reasonable, practical, and probably The Right Thing. At the same time, things are not being transferred as "text" that are not "text" as that is traditionally understood by kermit implementations. John Klensin Klensin@INFOODS.MIT.EDU From @cuvmb.cc.columbia.edu:A-PIRARD@BLIULG11.BITNET Fri Apr 14 11:48:13 1989 Return-Path: <@cuvmb.cc.columbia.edu:A-PIRARD@BLIULG11.BITNET> Received: from columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA24542; Fri, 14 Apr 89 11:48:13 EDT Received: from cuvmb.cc.columbia.edu by columbia.edu (5.59++/0.3) with SMTP id AA07620; Fri, 14 Apr 89 11:47:45 EDT Message-Id: <8904141547.AA07620@columbia.edu> Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 1479; Fri, 14 Apr 89 11:47:43 EDT Received: from VM1.EARN-ULG.AC.BE by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 5383; Fri, 14 Apr 89 11:47:42 EDT Received: by BLIULG11 (Mailer R2.03B) id 3958; Fri, 14 Apr 89 17:46:36 +0200 Date: Fri, 14 Apr 89 14:45:26 +0200 From: Andr'e PIRARD Subject: Re: 8859/1 kermit in Iceland To: Fridrik Skulason , ISO/Kermit Discussion Group In-Reply-To: Your message of Wed, 12 Apr 89 13:02:29 GMT Abridged, Fridrik says: > Here in Iceland we have been using kermit with ISO 8859/1 translation > for almost three years now. This was done since our national language > contains 10 characters outside the 7-bit ASCII character set, and we > wanted to simplify the file transfer process. > > No changes were made in the BINARY case, but when transmitting > TEXT files, automatic translation to ISO 8859/1 was done. Also, > when receiving TEXT files, translation from ISO 8859/1 to the > native character set was done automatically. > > Natic character set A ----> ISO 8859/1 ----> Native character set B > > We also changed the Macintosh Kermit in a similar way, but in other > cases (HP 9000 and ATARI ST) no changes were needed, since those > machines use the ISO 8859/1 character set here in Iceland. > > SET TRANSLATION {NONE | ISO | CP850 | DEC-MULTI | ROMAN-8} > It was used to specify what sort of translation should be applied > to incoming/outgoing characters while in terminal emulation mode. I quote this note because it just tells me that what Fridrik has done is exactly what I have done between CMS Kermit and the IBM PC (in our own program), what I am longing to have for the MacIntosh and others and what I proposed to Frank as a very low cost to value addition to Kermit implementations: allowing byte to byte generalized translation in both text transfer and terminal mode, and recommending to use this translation so that a common code ISO 8859-x be talked on the line to remove the NxN problem. I said even hidden patchable translation tables are useful, but the more user interface (SETs or the like) the better. I am sure Fridrik is not the only one. On the contrary, this solution will satisfy the majority of those who can do with a single version of ISO 8859 (and who maybe do not know what ISO 8859 is), a useful solution *now*, because it is usable with today's software. Now I quite understand Frank's reaction that being restricted to a single -x is a pity and that my proposition is no use to e. g. the Eastern languages. And I sure praise him to have risen the debate to higher grounds. But, given the size of his document and that of the comments it already raised, I am afraid this proposition is very difficult to implement and to use. So, I suggest that international characters be supported on two levels: 1) restricted, within a single version of ISO8859, in the proposition terms no announcers, switchers etc... In fact, having it work with to-day's Kermits to which translation and maybe simple commands are added. That's in fact coordinating Fridrik's work, mine and maybe other's and is straightforward in its definition. 2) general, across multiple codes or ISO versions, in which usage I am not interested, but my heart is with implementers and will comment: If a multibyte standard existed, most of the guessing at how to store and transmit the data would not exist. ISO 10646 is sure the kind, but Johan van Wingen tells me it is not sure we should wait for it. But, even if we don't want to bet on 10646, the multibyte idea is there. Why insisting on sticking to existing standards for what is our own right to define under the Kermit protocol? If ISO 2022 is used just to switch among several codes, why not transmit double bytes ccdd where cc is the code and dd the data. The only reason would be performance, but is it really so or that important, faced to simplicity. (Of course, cc being constant would be transmitted under the restricted case). Or, even though 10646 is not waited for, trying to be as close as possible to its definition could ease yet another conversion when/if it comes to reality. Now I'll sure contact Fridrik for some software and ideas exchange soon. Hoping the best move for Kermit. Andr). From lts!amanda@uunet.uu.net Fri Apr 14 14:45:59 1989 Return-Path: Received: from uunet.UU.NET by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA25822; Fri, 14 Apr 89 14:45:59 EDT Received: from lts.UUCP by uunet.UU.NET (5.61/1.14) with UUCP id AA20648; Fri, 14 Apr 89 14:45:55 -0400 Received: by lts.UUCP (4.12/2.881128) id AA09736; Fri, 14 Apr 89 13:40:06 est Date: Fri, 14 Apr 89 13:40:06 est From: Amanda Walker Message-Id: <8904141840.AA09736@lts.UUCP> To: isokermit@watsun.cc.columbia.edu Subject: Are we getting off track, or am I just confused...? I've only been on this list for a week or so, but I have read the draft proposals, and am pretty familar with the ISO standards we've been talking about. It seems to me that some of the debate going on is ranging pretty far from the original goal of this effort, at least the way I understand it. Now, I may be covering familiar ground in this message, but it evidently seems simpler to me than it does to some others. Looking at things from the point of view of a long-time Kermit user, Kermit does two things: file transfer and terminal emulation. In the case of many implementations, the terminal emulation function amounts to acting as a "virtual cable," reducing it to basically a means of transferring files. Historically, one of Kermit's main strengths was the fact that it allows us to transfer a text file from one machine to another, even (or especially) when they have widely differing internal formats for storing such files. It does this by defining the representation taken by such files (i.e. printable ASCII delimited by CR-LF pairs) and a way to encode this representation into packets that can be sent over almost any communications channel. As I see it, one of the main points of the ISO-Kermit idea is to extend this representation of a "Kermit text stream" to include polyalphabetic text. As several people have pointed out, once there is a common representation, each implementation only has to deal with it's own native formats. I think that ISO 2022 is very appropriate for such a representation. By adding a few more control characters to the representation, we gain the ability to send polyalphabetic text, while remaining compatible with pre-existing kermits. The only "protocol" extension I see any need for is a flag in the initial negotiation saying "I can send/handle ISO 2022", much in the way long packets are handled now. If one side can't handle it, the current kermit text format is used. Now, if a given implementation is smart enough to do other kinds of translation (such as handling complexly formatted documents or, say, TeX notation for diactricals and non-ASCII characters), that's fine, but it's an implementation feature that the user can ask for, not part of the represention "on the wire." One of the advantages of using ISO 2022 controls is that if all else fails, a file can be transmitted as an 8-bit stream, thus preserving the information even though one end of the connection may not be able to interpret it. Given that both sides of a connection can handle translating from their native character set to and from an ISO 2022 stream, I think that we've accomplished what we set out to. The fact that most machines can only handle one character set at a time (at least for simple text documents) is a red herring, I think, as is the fact that most of these character sets are only partially-intersecting subsets of what can be represented using ISO 2022. It still lets us preserve as much information as possible, which, once again, has been of the biggest strengths of Kermit. Once we go beyond simple text file transfer into the realm of being able to interchange arbitrarily formatted documents, it's time to look at ODA or full ANSI X3.64 or something, but that seems to me to be a separate issue. I am personally interested in it, but I think we should take this one step at a time. ISO 2022 is a way to represent a polyalphabetic text stream, and if that's what we want, it'll do the job quite well. It's straightforward and will bring immediate benefit to Kermit users. The second major issue is terminal emulation, where it would be very nice to be able to view polyalphabetic text. Some existing terminals (such as the DEC VT340) are a start, but microcomputer implementations of terminal emulators are an excellent testbed for doing a much better job. I think that using ISO 2022 is also a good way to start on this, since so far, most hosts talk to their terminals over a text stream. However, I still would like to keep file transfer and terminal emulation separate, despite the fact that in many implementations they may well share code. As I mentioned in a previous message, I think it would be a mistake to cripple one machine because of another's shortcomings. I think that implementors should be encouraged to put as many of the registered character sets into their emulators as is pratical. I hope this is making sense--it's been a long week. On to some implementation details.... For machines (such as an IBM with an EGA) where there is a programmable character generator with a limited number of slots, one way to make the most of the hardware would be to treat the CG as a cache, and only keep it loaded with the characters that are used on a given screenful. Aside from test patterns :-), I can't think of many times when you would actually have more than 512 different alphabetic glyphs on the screen at once (I'm not counting Kanji/Hanzi as alphabetic), even for multilingual text. I have used a similar technique in printer drivers for printers with very limited download capacity (anyone remember the Xerox 2700?). The only problem I can think of for the EGA/VGA is that you'd have to be real careful to avoid screen flicker when updating the character set table... On a related note, I have seen reference to a freely available JIS- sponsored dot matrix (20x20 dot?) font sent that has glyphs for the entire JIS Kanji/Roman/Cyrillic/Greek set. Does anyone have any information about how to obtain this? If it truly is freely available, it could save a lot of work in implementing Kanji versions of Kermit... Amanda Walker InterCon Systems Corporation amanda@lts.UUCP From KLENSIN@infoods.mit.edu Fri Apr 14 17:02:43 1989 Return-Path: Received: from INFOODS.MIT.EDU by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA26944; Fri, 14 Apr 89 17:02:43 EDT Received: by INFOODS id <0000047A071@INFOODS.MIT.EDU> ; Fri, 14 Apr 89 16:59:45 EDT Date: Fri, 14 Apr 89 16:36:17 EDT From: John C Klensin Subject: RE: Are we getting off track, or am I just confused...? To: Amanda Walker X-Vms-Mail-To: EXOS%"Amanda Walker " Message-Id: <890414163617.0000047A071@INFOODS.MIT.EDU> Cc: isokermit@watsun.cc.columbia.edu Amanda, It depends on how one defines the problem, and I'm not sure, after reading your note, how you define it. If the notion is "take my internal character set, transform it to ISO2022 sequences and an arbitrary collection of registered character sets, and send it to a remote host with the expectation that it will translate to its nearest approximation to what was sent" then I think you create chaos. The problem is that the list of "registered character sets" is not a closed set. In principle, each week can bring a new one, and, with the introduction of non-Latin characters into ISO 8859, you can't predict from already-known ones what the new one will be. On-the-fly translation from an unknown character set to a local one so that local characters can be stored in files (much like ASCII->EBCDIC translation now occurs) impresses me as a difficult job. Alternately, the idea might be "take my internal character set, transform it into ISO 2022 form, including the correct announcers and identifiers to tell the remote kermit what my character set is, send it to the remote, and expect that it will place it in a file, ISO2022 controls and all" seems to be to be perfectly sensible and consistent with my model of what is plausible. If that--perfectly canonical--ISO 2022 code containing file is then to converted to something else on the target machine (either a single character set or a different canonicalization or even a different set of ISO2022 introducers and announcers), then that is quite reasonable, too, but it is not part of the kermit problem. The third interpretation is that one really wants to have a single canonical character set into which all "text" other than ASCII is translated in the hope that a local system can map it into the characters it understands and can display locally and that it can map characters it does not understand into some local convention. If that is the intent, then the right solution is some universal multi-octet character set--"universal" in the sense that all conceivable characters are in it--for kermits to use in transmission. Unfortunately, such a Standard is probably some time off, and, unless you want to eliminate a large fraction of the world's population, it may be three or more bytes rather than two. A factor of three in file transmission size is a pretty high price to pay if all I want to do is to send a file in, say, German or French. And, while I agree about the number of characters one would have to download to an EGA or VGA "most of the time", one of the things we realized looking at some related issues for PL/I is that almost any scheme that limits the number of character sets on a screen can be broken by text that uses parallel translation or a multilingual dictionary. Those are precisely the things that some of us would like to send but, for most purposes that do not involve Kanji or Hanji, ISO 8859-n, for small values of n, tends to suffice without a great deal of complexity, in-line character set changes, or multiple octet sets. John Klensin Klensin@INFOODS.MIT.EDU From lts!amanda@uunet.uu.net Fri Apr 14 19:47:02 1989 Return-Path: Received: from uunet.UU.NET by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA29112; Fri, 14 Apr 89 19:47:02 EDT Received: from lts.UUCP by uunet.UU.NET (5.61/1.14) with UUCP id AA15025; Fri, 14 Apr 89 19:46:58 -0400 Received: by lts.UUCP (4.12/2.881128) id AA10357; Fri, 14 Apr 89 18:36:26 est Date: Fri, 14 Apr 89 18:36:26 est From: Amanda Walker Message-Id: <8904142336.AA10357@lts.UUCP> To: isokermit@watsun.cc.columbia.edu Subject: A difference in scope, I think. John, I apologize if I was little vague in my previous message; I think that my interpretation of the scope of the current proposal (based on my first reading of Draft #2) was quite a bit more limited than yours, which caused my confusion at your previous message. After rereading the draft and the last couple of weeks worth of traffic on this mailing list, I am less confused about your position, but a little more confused about just what the proposal is supposed to do (which, I suppose, is exactly what this list is for...). My initial interpretation of the draft, and one that I still think is heavily implied by the introduction, is that the extension is indeed meant to, as you put it: "take my internal character set, transform it to ISO2022 sequences and an arbitrary collection of registered character sets, and send it to a remote host with the expectation that it will translate to its nearest approximation to what was sent." I agree thyat this is an open-ended problem, since the set of registered character sets is open-ended. That is, I believe why the new attribute field was introduced--so that two Kermits can determine whether what they can interchange effectively. Basically, I'd rehprase your quote above as "I know what I can send, and you know what you can receive; if these overlap, we now know what we can interchange without having to know anything about each other's respective character sets." This is a much more restricted capability than the ability to transfer arbitrary polyalphabetic text, and one I find quite plausible. Fridrik's and Andre's experience seem to bear this out. The idea of "take my internal character set, transform it into ISO 2022 form, including the correct announcers and identifiers to tell the remote kermit what my character set is, send it to the remote, and expect that it will place it in a file, ISO2022 controls and all" is trivial, given the ability to translate out of the native character set; all the sending kermit has to do is do the translation but flag the file as a binary file. The wider problem, that of universal interchange of arbitrary polyalphabetic text, is definitely a fascinating one, but I'm not sure it's tractable in kermit in the short term, if only because for most (especially small) machines that run kermit, we'd have to basically write a whole text processing system to handle it. I don't think that should be part of this effort. Of course, there are a few machines which can in fact represent almost arbitrary polyalphabetic text as a subset of of their generic text format. The Macintosh is one, thanks to the Script Manager, but so far it's an exception to the rule, and even it only handles horizontal-format text--so much for Mongolian... This brings up a question I had: I was under the impression that the set of registered character set defined, in effect, a single mapping of multi-byte codes to alphabetic glyphs, with ISO 2022 defining how to encode a stream of these code into a 7 or 8 bit data stream in a reasonably concise manner. Is this in fact how it was intended to work? If not, is it an unreasonable way to look at it? I guess the gist of what I am trying to say is that the draft proposal seems to be intended to allow Kermit users to interchange files between dissimilar systems with as little modification to Kermit (and therefore as little modification to the text stream format) as possible. I didn't read it as a general solution by any means. Frank and Christine, if this was the wrong interpretation, please let me know :-). I hope this clears things up a bit. Amanda Walker InterCon Systems Corporation amanda@lts.UUCP From KLENSIN@infoods.mit.edu Sat Apr 15 10:47:45 1989 Return-Path: Received: from INFOODS.MIT.EDU by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA04948; Sat, 15 Apr 89 10:47:45 EDT Received: by INFOODS id <0000057B061@INFOODS.MIT.EDU> ; Sat, 15 Apr 89 10:45:49 EDT Date: Sat, 15 Apr 89 09:55:27 EDT From: John C Klensin Subject: RE: A difference in scope, I think. To: Amanda Walker X-Vms-Mail-To: EXOS%"Amanda Walker " Message-Id: <890415095527.0000057B061@INFOODS.MIT.EDU> Cc: isokermit@watsun.cc.columbia.edu Amanda, I, too, may be confused about what was intended. A few observations: > "take my internal character set, transform it to ISO2022 > sequences and an arbitrary collection of registered character > sets, and send it to a remote host with the expectation that it > will translate to its nearest approximation to what was sent." >I agree thyat this is an open-ended problem, since the set of registered >character sets is open-ended. That is, I believe why the new attribute >field was introduced--so that two Kermits can determine whether what they >can interchange effectively. Basically, I'd rehprase your quote above >as "I know what I can send, and you know what you can receive; if these >overlap, we now know what we can interchange without having to know anything >about each other's respective character sets." This is a much more >restricted capability than the ability to transfer arbitrary polyalphabetic >text, and one I find quite plausible. Fridrik's and Andre's experience >seem to bear this out. I think this is what I'm trying to get to, but, for this purpose, the draft proposal is (slightly) inadequate. To make it adequate, information needs to be exchanged, at attribute-packet evaluation time, about what character set(s) are going to be designated. I don't think that finding out, halfway through a transfer, that you are including a reference to a character set that I've never heard of is a satisfactory way to proceed. >The idea of "take my internal character set, transform it into ISO >2022 form, including the correct announcers and identifiers to tell >the remote kermit what my character set is, send it to the remote, and >expect that it will place it in a file, ISO2022 controls and all" is >trivial, given the ability to translate out of the native character >set; all the sending kermit has to do is do the translation but flag >the file as a binary file. Possibly trivial, but extremely useful. This is *not* a binary file, it is a text file with certain embedded controls. Knowing the latter makes it much easier to apply post-transmission processing. It also makes ISO->EBCDIC translation possible if solid mappings from ISO-eight- bit sets to extended EBCDIC ever solidify. An EBCDIC analogue to ISO2022 is equally trivial; the problem is getting the code page mess to settle down. So, when I say "place it in a file, ... and all", I don't necessarily mean "completely untransformed", which is the criterion we apply to "binary". >The wider problem, that of universal interchange of arbitrary >polyalphabetic text, is definitely a fascinating one, but I'm not >sure it's tractable in kermit in the short term, if only because >for most (especially small) machines that run kermit, we'd have to >basically write a whole text processing system to handle it. I don't >think that should be part of this effort. Concur. My concern was that the proposal included enough things that doing those in combination implied an effort large and complex enough that one might as well do this task. >Of course, there are a few machines which can in fact represent almost >arbitrary polyalphabetic text as a subset of of their generic text format. >The Macintosh is one, thanks to the Script Manager, but so far it's an >exception to the rule, and even it only handles horizontal-format text--so >much for Mongolian... And only if someone builds the appropriate tables. I don't pay enough attention to the Macintosh to know, but has anyone done tables for Thai yet? Just as one example, others are possible. >This brings up a question I had: I was under the impression that the set >of registered character set defined, in effect, a single mapping of multi-byte >codes to alphabetic glyphs, with ISO 2022 defining how to encode a stream >of these code into a 7 or 8 bit data stream in a reasonably concise manner. >Is this in fact how it was intended to work? If not, is it an unreasonable >way to look at it? Let me give an impression of the answer, since I'm not completely sure I understand the question, nor am I sure that I know the correct answer. Johan, please comment on this, since I think you are the expert. Simple answer (possibly wrong) to one version of your question (I'm here ignoring the "multi-byte" part): Yes, that was the way it was intended, but then someone discovered non-Roman alphabets. So, today the answer is "no", regardless of what was intended. And the reason why it is "no" makes this "an unreasonable way to look at it". Warning: in what follows, the term "character" is roughly the same as "graphic", although the character standards are quite careful to avoid specifying exactly how a character should be printed. However, "Icelandic Lower-case Eth" is a "character" in this sense (one of those that appear in no other alphabet). "Character" is independent of "alphabet", even though the name of the alphabet(s) in which it appears may be part of the character's identification. More important, it is independent of any particular "character set" (really a code table) or position (column and row) of that table. "code point" is a synonym for "position" as used in this sense. To all practical intents and purposes, "code point" ("column and row") of a "character set" or "code table" is isomorphic with the binary encoding of that code point in data transmission, but that is not strictly true either. For the (extended) Roman alphabet characters, there is an international standard that lists all of the characters and assigns standard identifiers (names and sequence numbers) to them. The registration procedure says, more or less, "produce a list that identifies those characters with code points" and we will assign it a number. So, for the Roman-based alphabets, I think the answer to your intended question is still "yes", with the "only" problem being that, while I can know what code tables I use, I can't know how to translate an arbitrary code page, given only its identifier, even if I somehow know that it is Roman-only. The reference listing of characters provides an unambiguous way to map between (Roman) character sets, if mappings exist (of course, each character set contains only a few characters from the whole, so 'mappings' are possible only when one set is just a different coding permutation of the characters in another. Or if you decide to map to "closest approximation in graphic form". That is a disaster, since those mapping are often misleading and not reversible. But SC2, in its wisdom, decided to permit ISO8859 character sets that contain distinctly non-Roman characters and some characters that, because of right-to-left problems, I'm not sure what to do with even if I "understand" the character set and have the characters (see "Example" below). For those non-Roman character assignments, there is, in at least some cases, no reference international standard. Hence no universal anything that we are just mapping and remapping. And, if the government of Thailand decides to register a Roman-Thai character set as ISO DP 8859-x, it will probably eventually be approved, and the answer to your question becomes, seriously, "no". Alternate answer, taking the "multi-byte" seriously: Let me say this as strongly as I can: THERE IS NO SUCH THING AS A STANDARD, OR EVEN NEAR-STANDARD, MULTIBYTE CHARACTER SET. ANYWHERE. The closest approximation is ISO DP 10646, and it is a *PROPOSAL*, not a Standard, or even a Draft Standard. It is also, in the opinion of The Netherlands and others, defective and it is quite probable that, if it is ever approved, it will be changed from the present form. It is also not large enough to handle the number of Hanji that China's standards bodies believe they need, since they claim they need at least three octets. I have not studied 10646, so have no further opinions on it. However, the point is that it is not suitable, today, as the base for anything. > I guess the gist of what I am trying to say is that the draft proposal >seems to be intended to allow Kermit users to interchange files between >dissimilar systems with as little modification to Kermit (and therefore >as little modification to the text stream format) as possible. If this were the proposal, I would be in favor of it. While I don't think the proposal modifies the text stream format very much, I read a demand for a lot of data conversion capability into things like Figure 4 and the surrounding text. Also read the paragraphs "LOCAL FILE REPRESENTATION". This is not an "implementation detail" or, as we say here, a "small matter of programming". It is an impossible task, requiring programs that automatically adapt to changing knowledge and conversions that can't occur, at least without significant loss of information, in real systems. Or maybe I'm seriously misreading the intent, in which case someone should please correct me before I clutter the bandwidth up any further. But comments made by others about how nice it would be to have kermit perform automatic WordPerfect to ODA conversions imply that I'm not the only one reading things this way. >I hope this clears things up a bit. I hope so, too. I'm getting pretty confused, and hope that I haven't added to the general confusion. John Klensin Klensin@INFOODS.MIT.EDU From jrd Sat Apr 15 21:15:37 1989 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA08824; Sat, 15 Apr 89 21:15:37 EDT Date: Sat, 15 Apr 1989 21:15:36 EDT From: "Joe R. Doupnik" To: isokermit Cc: jrd Subject: Short commentary Message-Id: Just a brief commentary on the most recent discussions between John and Amanda, or "The practical side of affairs, as I perceive them." People already use statically defined glyph/token/symbol tables, relating display glyphs and the corresponding octet value(s) or code point. Some are fully registered, others such as the IBM-PC left table(s) are not. In either case they are present and popular enough for us to pay attention in Kermit. I think that disposes of the arguments concerning glyphs in isolation versus existing in tables: only the tables are of concern to Kermit because only the tables are used to prepare the original file. Concepts dealing with isolated glyphs for all of the world's major languages need to count "our" octets and budgets. I think it also obviates the discussion about how many octets or whatever are needed to decide which glyph to "display" because that is really a detail. After all, the escape sequences we have been discussing are multi-character and some, CSI for example, can be one octet or two, depending on whether 7 or 8 bit controls are used. While the etymology of individual characters can be fascinating it has no bearing on our Kermit discussion. To amplify the sentence above, we have been concerned about the receiver not possessing a table matching that in the transmitter, and the possibility of letting the receiver select "a close equivalent." To select implies knowledge of the transmitter's table; in fact the table may exist at the receiver. But the catch is: the receiver is promising that it can do something constructive with the character, such as print, display, or store it in recognizable form at the time the character arrives on the communications circuit. It may not be able to do that, short of physically drawing the thing, because the storage device (printer, screen, disk/file system, etc) has no such ability. Consequently, the receiver either accepts text (thanks John) verbatium without understanding code tables and such or it knows how to reformat the information for local consumption. That's a simple choice yielding receiver responses of "Yes, I can deal with that file" or "No, I can't", regardless of how smart the Kermit receiver code might be (internal tables vs exporting the information to the operating system). The number of tables, such as ISO-8859-n, need not be huge in practice. How many are there now? Less than two dozen I would guess if we don't count Eastern languages. At 256 bytes per table that is not a huge storage problem; realities of equipment reduce the quantity of active tables much more. Ok, the receiver has plenty of tables but Kermit has to decide where the file output is going: printer (can it manage Greek etc?), disk (uh oh), screen (so what kind of display is in use right now?). A mismatch at that point means translation cannot occur. It also means that Kermit would need to know a lot about the local computer system; too much, in my opinion. Thus, the poor user needs to inform the Kermit receiver which tables are permitted with which destinations; not very pleasant, but I see few alternatives. I do have a question for our non-US colleagues: how do your file systems store mixed language text? I'm betting that almost none has the slightest concept of language or mixed character sets, just raw octets or equivalent storage units. If this were the case what should Kermit do? My thought is Kermit ought to store/deliver the file in ISO-2022 form and let later utilities/hardware deal with any conversions, except when the local "system" (not just Kermit) possesses the character set of the transmitter. ISO-2022 and similar are just mechanisms for heavily overloading octets and thus reducing the number of them on a communications link; they are intended to be transparent when the communications process completes. They are data compression methods, not translation methods. John and Amanda have elaborated that point but it needs reinforcing. The only place where aspects of language occurs is specifying the active character sets. I can easily imagine using ISO-2022 to transmit digitized pictures effectively, as patterns in a set of unregistered tables. (Let's not get distracted discussing pictures of characters; FAX does that job well enough already). Summarizing: tables count, both sides need a 100% match as a system or the output is considered verbatum, the number of tables is normally not large, the number of octets describing a table entry is a detail, ISO-2022 represents a fairly useful method of walking a hierarchy of such tables, the NxN problem has not gone away. There must be some blind spots in the above discussion or I am missing the whole point of this proposal. Joe D. From MURAKAMI@ntt-20.ntt.jp Mon Apr 17 11:23:58 1989 Return-Path: Received: from ntt-20.NTT.JP by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA23362; Mon, 17 Apr 89 11:23:58 EDT Date: Mon, 17 Apr 89 21:53:16 I From: ken-ichiro murakami Subject: my poinion for ISO2022 To: isokermit@watsun.cc.columbia.edu Message-Id: <12486834997.27.MURAKAMI@NTT-20.NTT.JP> Hi. Kermit> SET TRANSFER-SYNTAX ENGLISH (default is Japanese) :-) Here are some opinions and comments for ISO2022 discussion. First, I take back my proposal SET FILE FILTER command. I think kermit macro and external program can do the same thing as filter. (However, it's not on-the-fly but post- or pre-transfer.) John recommend me to use MAKE program instead of MS-DOS batch command. (Thanks, John.) I have not finished to write automatic Kanji conversion macro using make program. But, it seems the make program brings us convenient multiple file handling and conditional branch facility. Second, I explains you some Japanese situation and some comments to the previous discussions. (1) John says; >As kermit moves into >eight-bit character sets, it may be appropriate to modify the protocol >somewhat to understand that combinations such as "8 data bits, parity, one >stop" are as well defined as "7 data bits, parity, one stop" and "8 data >bits, no parity, one stop". The latter two are supported by the existing >protocols, the former is not. Is it correct? I thought all these combinations were allowed in Kermit protocol. If both end use the same combination, there is no problem. I think it's out of scope. Frank, is it correct? (2) John says; > Referring to the "possible problem" called out at the bottom of the >document, the theory according to ISO is that Japan will have to get in >line. JIS had the option to taking significant exception to those >assignments and apparently did not, which may speak for their intentions. >I await Ken-ichiro Murakami's second round of comments on this, after his >expert advice arrives. We, Japanese, have really a lot of Kanji code and we are often annoyed by them. As John said, JIS had the option to taking significant exception. But, it is only true for ISO2022 interpretation. As I reported before, THERE IS NO CONFLICT between ISO/ECMA alphabet codes and escape sequences and the codes used in Japan. Please refer my previous comment about it. As for loose Japanese interpretation for ISO2022, we seldom encounter a problem. It's possible to say there is ISO2022-like de-facto standard. If this is incorrect, Dr.Fujii will give us a comment. (3) John said, >that acceptance of ISO2022 sending should be followed by a packet that >specifies what I intend to bind onto G0-G4, and/or a list of what I would >*like* to use, in descending order. In the former case, the receiver >would say "yes" or "no"; in the latter, it would send back its list of >preferences, leaving off anything it could not handle. Both of those are >consistent with the general kermit model. Or, one could break the rules >and negotiate back and forth. As John pointed out, there are two view points; (a) whether ISO2022 is supported or not (b) what character set is supported Consider, there are two micro computers which support ISO2022. One is only for Japanese, the other is only for Chinese. Since the latter cannot handle Kanji, file transfer between these computers may be aborted because of unexpected character sets. What should we do? The problem is that we don't know whether the file in the former computer contains Kanji or not. If we want to know it, we must read through the file before file transfer. I don't know what I should do in this case. (4) John said; >I. Terminal emulation . . >If those emulators can move beyond the subsets, so much >the better, but ISO8859-n support, for a small set of n's, is much more >important than general ISO2022 support including midstream character set >switching. I am very encouraged in this regard by what I have been >able to infer about the work in Japan, and would like to hear more about >it. Most of terminals around me support ANSI X3.32, ANSI X3.41, ANSI X3.4, ANSI X3.64, ISO 646, ISO2022, ISO6429 and other Kanji codes such as EUC and SHIFTJIS. These documents say nothing about ISO8859-n. Since ordinary communication channel such as TCP/IP and UUCP ensures only 7-bit transparency, it's necessary for Japanese terminal to support 7-bit ISO2022. (Both EUC and SHIFTJIS needs 8-bit transparency.) Therefore, ISO2022 is very popular in Japan. (5) John said; >if both workstations can handle, say, ISO8859-5, that is no guarantee that >the intermediate hosts can. No real problem here, but the files will >either have to arrive as binary and be post-decoded (as below) or the >kermit on the receiving workstation will need to be given a *local* SET >RECEIVE TRANSFER-SYNTAX ... command that deals with the file in the way >that files of that type are dealt with, independent of any attribute >negotiations. You can implement this facility by macro independent of Kermit command. I think we should not add new kermit command if you can do with macro. (6) John said; > 4. For operating systems for which the capability is appropriate, a >kermit command that looks something like: > SET DATA-CONVERTER >would permit user specification of a routine that could be invoked on the >fly to convert those particular types of files. The semantics of >"routine" would have to be local-system-dependent. For the record, I >don't think that this is a good idea, but it is a way to incorporate the >capability that some of you seem to want within kermit. This command is the same as SET FILE FILTER command which I proposed before. I think macro might do the same thing as SET DATA-CONVERTER command. (7) John said; > But the main point is that 2-3 above get a "binary" file delivered, and >delivered with sufficient information to decode it. And they enforce >agreement on the file organization and content before the file is >transferred. That decoding is probably best done post-transfer, rather >than on-the-fly, if for no reason than that one is likely to like to have >the file -- as transferred -- available when the decoding produces >something unexpected. If you decode or translate on the fly, a new set of >debugging options that save the transferred data would be helpful. This is important point. It seems it's safe to adopt post-transfer decoding. (8) Andr'e said; > So, I suggest that international characters be supported on two levels: > 1) restricted, within a single version of ISO8859, in the..... > 2) general, across multiple codes or ISO versions, in which usage ... We, Japanese, need 2). 1) will bring us nothing. For Japanese, standardization of Kanji(ISO2022) is important. Ordinary, our files contain various characters including Kanji, Kana, Roman ASCII. In addition, we have many Kanji character codes. That's all. Since English is one of my weak point, I might misunderstand what other guys said. If so, please correct me. Our objective to adopt ISO 2022 is automatic Kanji conversion. It seems it slightly different from your objective. -Ken P.S. Since I'm very busy, I cannot afford to post my opinion anymore in this month. It's time consuming job for me to read and write English. Sorry. ------- From fdc Mon Apr 17 21:40:34 1989 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA28863; Mon, 17 Apr 89 21:40:34 EDT Date: Mon, 17 Apr 1989 21:40:33 EDT From: Frank da Cruz To: isokermit Subject: De nonnullis Message-Id: (is anyone else having trouble mailing to ISOKERM@CUVMA.BITNET? It's supposed to work...) --------------- Date: 18 April 1989, 00:54:22 SET From: Gisbert W.Selke +49 228 225888 To: ISOKERM at CUVMA Re: De nonnullis Mhm, I'm not sure I had antecipated discussions of this scope. Well. Living in a mostly non-English speaking country and having to cope with several (mis)representations of German characters, I think I'd be quite happy with a standard - a *Kermit* standard that not necessarily coincides completely with any ISO or whatever standard or draft or proposal - to be able to transfer German files from one system to the (un-like) other, with at least the characters coming out the 'same'. Now, this means I'm not interested mainly in converting from one word processor format to the other; as long as there are many different formats living on the same machine, we can't even dream of tackling that task in a general fashion. So that should indeed be left as an implementation detail, or as mere programming, preferably outside of Kermit proper. What I *am* interested in is defining a standard so that I may reasonably expect a file - containing, say, a maths text (in German, with Greek characters) - that I send from my PC to arrive well and legible on my friend's VAX in Turkey, say. Basically, the ISO kermit draft proposal does exactly this. Apart from details, having this would be a great thing for us - practically speaking. And I think it's not too Eurocentristic either - we should indeed be able to cope with Eastern languages within this scheme, at least to a great extent. The purpose of all this is, of course, *file* transfer - so the question of how the receiving system might be able to represent physically whatever it receives is of minor interest (oh yes, it should be able to store incoming bits...). Of course I can write a TeX text on my vanilla PC, and I'm not at all bothered that it doesn't have the hardware to display it properly on screen, or even on my cheapo dot matrix printer - so, again, sending such a file is not a question of Kermit (per se) or the receiver's hardware but of the software that it runs. (Imagine someone sending me a file in Hebrew - all I'd need is, say, an ISO-to-TeXXeT converter to print it out or even view it on screen. No extraordinary hardware!) To repeat: what we (well, I for one) need is a reasonable chance of two arbitrary Kermits speaking the same language. Anything else should be left to the respective Kermit or stand-alone 'mere programming', and shouldn't be allowed to clobber the transmission standard. So that does indeed reduce the NxN problem to a 2x(N-1) problem. - As a matter of course, we may suggest a standardized syntax for the user to specify what conversion to use (*if* a particular Kermit implementation provides such conversion mechanisms); but that's not really what is at stake here. Yes, the user must, in general, know something about the file format E uses; still, it's much better to have to say something like 'set file-type xyz' than to have no chance at all to get a file thorugh unharmed. And I guess that's all you can - practically - really ask for - as long as we don't have The Universal Word Processor. \Gisbert From cmg Tue Apr 18 11:40:38 1989 Return-Path: Received: by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA04967; Tue, 18 Apr 89 11:40:38 EDT Date: Tue, 18 Apr 1989 11:40:37 EDT From: Christine M Gianone To: isokermit Subject: ISO/Kermit Draft Proposal.... Message-Id: Many thanks to John Klensin for his detailed, thoughtful, and probing comments on the second ISO/Kermit draft proposal, and to Amanda Walker, Joe Doupnik, et al, for their remarks also. ISO 2022 TRANSFER SYNTAX It is true that ISO 2022 is not widely implemented, except to some degree in DEC VT300-series terminals and a few other places. Nevertheless, we are proposing to use it in Kermit as a file transfer syntax because it is the only widely known and approved mechanism for designating and switching among textual character sets. Furthermore, the 7- and 8-bit single-byte and multibyte character sets described in the proposal are in use today, and they are standardized and registered. Our concern with registration is that there be a unique identifier for each character set. We assume (perhaps naively) that these character sets -- which are agreed upon after months or years of debate by national and international standards organizations -- will not change very often, meaning not more than once every 5 or 10 years (like, say, US ASCII or GOST Latin/Cyrillic), in other words much less frequently than do Kermit programs themselves. And when they do, adjusting Kermit programs will be the least of our problems! In any case, the proposal allows for both the character set identifier AND the version number. It's true that other sets will be registered as time goes on, and that the world may eventually settle on universal multibyte code. Neither of these considerations should hold us back from trying to extend Kermit file transfer to accommodate alphabets beyond ASCII. Nor should the current proposal preclude addition at a later time of a transfer syntax built around a universal character code. It is important that we agree that the use of ISO 2022 escape sequences and shifts with registered alphabets provides a completely unambiguous representation of the text being transferred. This is the "common intermediate representation" or "transfer syntax", crucial to any communication protocol, that avoids the n x n problem in which every computer must know about every other computer's formats. So the most fundamental question we can ask is: "Have we chosen the best transfer syntax?". Once this matter is settled, all other questions boil down to implementation -- conversion between local file format and the transfer syntax, matching capabilities between two ISO-capable Kermits, and the design of the user interface. LOCAL FILE FORMAT As John points out, the proposal deliberately sidesteps the issue of local storage format. ISO 2022 is not used for this, and there is no other widely accepted standard. And in fact, mixed-language documents are often embedded in some word-processor's complicated proprietary and version-dependent formatting. This issue has already prompted some lively debate, as well as suggestions for a SET FILE FILTER command to pre- and postprocess such files, and even more ambitious suggestions to incorporate ISO ODA or SGML in the transfer syntax. This latter suggestion must be deferred for further study, but we must take care to ensure that the multinational text transfer syntax that we decide upon does not preclude the later addition of a "document description language" layer. Anyone familiar with ODA or SGML is urged to consider this question. Nevertheless, if we are to extend the Kermit protocol to transfer multinational text, the extension should be as general and unrestrictive as possible. So some day, when (if) multi-language files are common and easily parsable, the Kermit protocol will be ready to transfer them. In the meantime, the proposal should also address more mundane tasks like transferring, say, a French document from a PC to a Macintosh. MATCHING CAPABILITIES BETWEEN KERMITS John makes the very good point that the attribute packet includes ISO 2022 announcers, but not the alphabet designators. Therefore, a file transfer could fail in midstream when the receiver sees an unknown alphabet designator. This was a deliberate omission. Do we want to turn Kermit into a 2-pass compiler? This could be tedious for large files, particular in multi-file operations, where the entire operation could time out before the sender had time to comb through a 9000-megabyte file collecting alphabets. In a common case, we're transferring files between incompatible systems, but the files contain only characters that have equivalents in (say) the ISO Latin-1 alphabet. No problem here, so long as both Kermits know this alphabet and use it in the transfer syntax. But as soon as we mix two or more character sets, there is the chance that these might include one that the receiving Kermit will not know about. How does the sending Kermit tell the receiving Kermit what alphabets will be used? First, note that we have not required the use of attribute packets. Therefore any attribute-based notification will be optional. Second, we have recommended (but not required) that alphabet designating sequences be transmitted at the head of the file data. Perhaps this is a bad idea -- it might cause a lot of unecessary disk activity at the receiver -- premature reading in of translation tables, etc. Prenotification of alphabets should remain optional, so that the sender is not compelled to read the entire file before sending it. But assuming the sender knows the alphabets in advance, then it makes sense to send a list of them in an attribute packet. This means we have to (a) define a new attribute, and (b) settle upon the syntax of the alphabet designators. The first unused attribute code is "2" (ASCII 50, 03/02 if you prefer). Let's call this one "Character Set". One or more of these may appear in an Attribute packet in any positions. The format is: +---+---+-----+ | 2 | L | CSD | +---+---+-----+ where CSD is the "character set designator", and L is the length of the CSD. We must still decide the format of the CSD. It can't just be the final letter of the designating escape sequence, because these are not necessarily unique (see Table 5 of Draft 2 of the proposal). There would seem to be several choices for the CSD: 1. Something we make up, like "A" for Latin 1, "B" for Latin-2, etc. 2. Everything after the in the designating sequence (as shown in Table 5 of the second draft). The first character (such as "-", "(", or "$") would tell whether it was a 94, 96, or multibyte character set. The problem here is that different characters are used in this position for the same character set, depending on whether it is to be designated as G0, G1, G2, or G3. Also, the designators are different lengths, depending on whether the character set is single- or multi-byte. 3. The ECMA registration number, as shown in Table 5. This makes more sense, since we are simply identifying the character set and not making irrelevant assertions about where to assign it. But in the latest version of Table 5, we are still missing registration numbers for JIS-Katakana and JIS-Roman. Can anyone supply these? OK, so now we have a way for the sender to notify the receiver about the alphabets in advance, so the receiver can reject the file if it contains an unknown alphabet and, because character sets now have their own attribute type, the receiver can inform the sender exactly why the file has been rejected. This can save the user time, but it doesn't get the file transferred. The user currently has one alternative: SET TRANSFER-SYNTAX NORMAL on the receiving end, send the file with ISO-2022 syntax so that it will be stored with embedded alphabet designators and shifts on the receiving end, and then postprocess it after it is received. A second alternative is now suggested to cover the case where a file contains a mixture of alphabets, some known to the receiver, others not. The receiver has not been prenotified of the alphabets, and has not rejected the file. At some point, an alphabet designator arrives which the receiving Kermit does not recognize. We suggest that this designator be accepted and stored as data, and that subsequent characters be stored untranslated. When a designator for a known alphabet arrives, the receiving Kermit stores the ISO-2022 Coding Method Delimiter, d, and resumes translation. We suggest that this be the default behavior when an unexpected, unknown alphabet arrives, but that the behavior can be controlled by a new command, SET UNKNOWN-ALPHABET {KEEP, DISCARD}. Now suppose the receiving computer has applications or devices which support a character set which the receiving Kermit does not know about? To cover this case, we can define a standard format for translation tables, and provide a LOAD CHARACTER-SET command to allow the user to add new character sets to a Kermit program's repertoire. What should such a file look like? Here's a first attempt to design the format. The file is written entirely in printable ASCII, with line divisions as shown. Numbers are represented as ASCII decimal digits. Line Contents 1. Transfer Syntax Character Set Name (e.g. LATIN1-ISO) 2. Local Character Set Name (e.g. IBM-CODE-PAGE-437) 3. Number of bytes per character (1, 2, 3) = b 4. Number of Characters per plane (94, 96) = n 5. ISO/ECMA Registration Number of Transfer Character Set (0 if none) 6. Final Letter of Designating Sequence for Transfer Character Set 7. Version Number of Transfer Character Set (0 if none) 8. Direction of display (encoded to allow for any combination of left, right, upwards, downwards, boustrophedon, etc). 9-16. Reserved. Each line, 17 through (17 + n^b), contains a pair of characters in 8-bit form, in ASCII decimal representation, with the two characters separated by a comma: , The character pair is listed, rather than a single value (as in most translation tables) so that the program may build two tables from it, one for each direction of translation. For a single byte set, the numbers vary from 0 to 255. Typical entries might look like: 43, 43 243, 224 For a multibyte set, each byte is represented separately, for example: 37 143, 255 10 37 144, 255 142 The obvious limitation of this kind of loadable translation table is that it is one-to-one. It will not accommodate transfer syntaxes like T.61, which would require some two-to-one mappings, nor local file formats in which special characters might be represented by sequences. Is it worth expanding the syntax of the loadable translation table to allow for arbitrary translations? There is also an implication that character set names must be standardized and registered, so that different Kermit versions will mean the same thing by the same command, and possibly that built-in translation tables can be overlaid by user-defined ones, and also so that tables may be built up and shared by Kermit users. Perhaps most significant (and we're not sure if this is a plus or a minus), user-loadable character sets would also let people use ISO-2022 transfer syntax with nonstandard and/or unregistered character sets, for instance by a Cherokee Indian organization that composes its newspaper on a combination of PCs and Macintoshes that use different encodings for Cherokee. John argued that character set standards change all the time. We think it's more likely that computer-system-specific character sets will change all the time -- witness the hot debate over IBM EBCDIC Code Pages. Either way, we have an argument for user-loadable (or site-loadable) translation tables. To complement the LOAD CHARACTER-SET command, there should also be a SHOW CHARACTER-SET command, by which the Kermit program tells the user what character sets it knows about, including the name, size, designator, version, and registration number of each set. A SINGLE ALPHABET For the foreseeable future, few of the complications of the proposal will come into play. As many have pointed out, the overwhelming use of an "ISO-Extended" Kermit will be within a single 8-bit character set, like ISO 8859-1. Recall that the proposal says this can be done by sending the data as-is, with no shifts, announcers, or designators, provided both Kermits know what the transfer alphabet is. However, the proposal is vague on exactly how to accomplish this. The recent messages from Fridrik and Andre amplify this concern. To address this common case, we suggest a command like this: SET TRANSFER-SYNTAX LATIN1-ISO to specify that the sender will translate from local notation to (e.g.) ISO Latin 1, or that the receiver will interpret incoming data as ISO Latin 1. This command would differ from SET TRANSFER-SYNTAX ISO-2022 in that no ISO-2022 announcers, designators, shifts, or other controls would be inserted into the data stream. In order for the sender to inform the receiver of the transfer alphabet, another new operand can be defined for the Encoding ("*") attribute. This could be "O" (uppercase O, for "Other alphabet"), meaning that the character set is specified in the "2" attribute, of which there should be only one per file. Example (recall that Attribute packets are in Parameter-Length-Value notation): +---+---+---+---+---+-----+ | * | ! | O | 2 | # | 100 | +---+---+---+---+---+-----+ P L V P L V Encoding = "O", Alphabet = 100 (ECMA Registration Number for Latin 1, assuming this is the convention we adopt for alphabet identification). CONCLUSION In light of recent comments, it would seem useful to break the ISO/Kermit proposal into levels. Level 0 (default) is the current "normal" file transfer syntax. Level 1: Allow user to specify transfer syntax to be some character set other than ASCII, with no announcers, designators or shifts. Level 2: Full ISO-2022 transfer syntax, as previously proposed, amended to allow (a) preannouncement of character sets, and (b) user-defined loadable character sets. Defer all discussion of "higher-level" presentation syntaxes (wordprocessor/database/spreadsheet/graphics formats, etc, not to mention LZW compression...) until MUCH LATER, but keep them in mind while designing Kermit's international-alphabet extension. Finally, note that Kermit Levels 0, 1, and 2 can be mixed, so that (for example) a PC can convert from its international code page to ISO Latin 1, with or without announcers, designators, and shifts, send it to another Kermit operating at Level 0, and have it stored as transmitted for later postprocessing. For John's "quibbles" and editorial remarks, as well as corrections of fact, we are grateful, and they will be reflected in the next draft. We are also indebted to Amanda, Joe, Ken, Hiro, Andre, Fridrick, Gisbert, and all others who have contributed to the current round of discussions. We enclose below a simplified flow diagram representing the operation of an extended Kermit program. Further discussion welcome! - Christine and Frank NOTE: The following does not take into account the FILE FILTER function which has been proposed and withdrawn, and may be proposed again. Also, an 8-bit transmission channel is assumed. SET FILE TYPE BINARY (overrides SET TRANSFER-SYNTAX command) | | N Y--> Transfer file unmodified. END. | Text mode. Three possibilities: SET TRANSFER-SYNTAX NORMAL (the default) | | N Y--> LEVEL 0: Transfer syntax is ASCII with CRLF as line terminator. | Sending program translates from local format to transfer syntax, | Receiving program translates from transfer syntax to local format. | END. | SET TRANSFER-SYNTAX LATIN1-ISO (or any other single character set) | | N Y--> LEVEL 1: Transfer syntax is specified character set with CRLFs. | Sender translates from local format to specified character set | using (a) default built-in table, (b) built-in table selected by | SET FILE CHARACTER-SET, or (c) user-defined table obtained via | LOAD CHARACTER-SET and then selected by SET FILE CHARACTER-SET. | Receiver may operate at Level 0 (file will be stored as sent), | or Level 1 or 2 (user must issue {SET FILE, LOAD} CHARACTER-SET | commands if this is not the program's default character set, and | also the appropriate SET TRANSFER-SYNTAX command to have receiving | Kermit program convert the file to local form). END. | File composed of more than one character set: SET TRANSFER-SYNTAX ISO-2022 | | N Y--> LEVEL 2: Transfer syntax is ISO-2022. Assumes that sender can | Identify the different character sets in the local file, and | can translate them to registered character sets if necessary. | | | Do sender and receiver both support Attribute packets? | | | | N Y--> Sender specifies encoding ("*") to be ISO-2022 ("I"), | | and lists ISO-2022 announcers. Sender also optionally | | lists the alphabets to be used in Attribute "2". | | | | | Receiver agrees to these facilities and alphabets? | | | | | | Y N --> Receiver rejects the file, indicating "*" | | | or "2" as the reason. END. | | | | | | NOTE: If receiver has been set to NORMAL | | | transfer-syntax, it will always accept the | | | file. | | | | | Receiver accepts the file. | | | | Transfer begins. Sender translates from local file format to | the character sets of the transfer syntax, using ISO-2022 | announcers, designators, and shifts to switch among them. | | | If the sender encounters an unknown alphabet while reading the | file, it should send an EOF (Z) packet with the D (Discard) code | in the data field and proceed to the next file, if any. Receiver | will discard or keep the file according to its SET INCOMPLETE | setting. | | | Receiver may operate at Level 0, Level 1, or Level 2. | | | | | | | 0 -- > Receiver stores the data in the form | | | transmitted, retaining all | | | announcers, designators, and shifts, | | | but converting from ASCII/CRLF | | | format to local text format. | | | | | 1 --> Receiver stores the data as transmitted, | | retaining all announcers, designators, and | | shifts, but converting from the specified | | transfer character set to local text format. | | | 2 --> Receiver heeds announcers, designators, and shifts, and | translates from the indicated character set to local | representation. Translations are according to built-in | tables, or tables obtained via LOAD CHARACTER-SET. | | | If the receiver encounters an alphabet it does not know, it | will act according to the SET UNKNOWN-ALPHABET command: | | | | KEEP DISCARD --> Reject the file by putting an X (Cancel | | File) code in the data field of its | | Acknowledgement. END. | | | (default) Continue to receive the file, but store the | designator for the unknown alphabet along with the | untranslated characters from that alphabet, until the next | known alphabet is encountered. Mark the end of the | untranslated material with the ISO-2022 Coding Method | Delimiter, d. Also, issue a warning to the user. | | | END. | Reserved for future (e.g. ISO 10646)... (END) From lts!amanda@uunet.uu.net Tue Apr 18 12:46:53 1989 Return-Path: Received: from uunet.UU.NET by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA05678; Tue, 18 Apr 89 12:46:53 EDT Received: from lts.UUCP by uunet.UU.NET (5.61/1.14) with UUCP id AA06348; Tue, 18 Apr 89 12:45:24 -0400 Received: by lts.UUCP (4.12/2.881128) id AA15825; Tue, 18 Apr 89 11:40:23 est Date: Tue, 18 Apr 89 11:40:23 est From: Amanda Walker Message-Id: <8904181640.AA15825@lts.UUCP> To: isokermit@watsun.cc.columbia.edu Subject: Re: Frank & Christine's latest messages I think the three-level idea is a good one, but I do want to bring up one point that John made to me in email: as soon as alphabet shifts are introduced, the encdoding of a particular piece of text is no longer unambiguous, since glyphs are duplicated across alphabets. The JIS Kanji set contains roman, cyrillic, and greek characters; ISO 8859/2 contains non-ASCII characters that are also in ISO 8859/2, and so on. This is not a problem for rendering the text, but it could be for interpretation. Christine's suggestion of mandating the ability to use a dynamically-loaded translation table does alleviate this problem to some extent. If Machine A -> ISO 2022 is a one-to-many mapping (which it will be in the general case), then ISO 2022 -> Machine B needs to be able to be a many-to-one mapping. --Amanda