A KERMIT PROTOCOL EXTENSION FOR INTERNATIONAL CHARACTER SETS Christine Gianone Manager, Kermit Development and Distribution Columbia University Center for Computing Activities 612 West 115th Street New York, NY 10025, USA DRAFT NUMBER 3 JULY 7, 1989 ABSTRACT A two-level extension to the presentation layer of the Kermit file transfer protocol is proposed to allow transfer of non-English-language text files between unlike computers. Level 1 allows substitution of single character sets other than ASCII in Kermit's normal text-file transfer syntax. Level 2 specifies a new transfer syntax in which multiple character sets may be used, along with mechanisms for switching among them as defined in ISO Standard 2022. This is still a DRAFT proposal. Readers with knowledge of real-world multi-alphabet applications and file formats are urged to comment on the suitability of this proposal. It is assumed the reader is familiar with the Kermit file transfer protocol. SUMMARY OF CHANGES SINCE DRAFT #2, March 30, 1989 - Separation of extension into Levels 1 and 2. - Additional file attributes for preannouncement of character sets. - Criteria for selection of character sets. - Handling of unknown character sets. - Handling of "illegal" characters in data. - Preliminary specification of user-loadable translation tables. - Avoidance of cryptic ISO terminology in Kermit commands. ACKNOWLEDGEMENTS Many thanks to these people for their helpful and constructive comments on the first two drafts. In most cases, their suggestions or the information they provided have been incorporated into the third draft. John Chandler (Harvard/Smithsonian Center for Astrophysics, USA) Alan Curtis (University of London, UK) Frank da Cruz (Columbia University, USA) Joe Doupnik (Utah State University, USA) Hirofumi Fujii (Japan National Laboratory of High Energy Physics, Tokyo) John Klensin (Massachusetts Institute of Technology, USA) Ken-ichiro Murakami (Nippon Telephone and Telegraph Research Labs, Tokyo) Vladimir Novikov (VNIIPAS, Moscow, USSR) Jacob Palme (Stockholm University, Sweden) Andre Pirard (University of Liege, Belgium) Paul Placeway (Ohio State University, USA) Gisbert W. Selke (University of Bonn, West Germany) Fridrik Skulason (University of Iceland, Reykjavik) Johan van Wingen (Leiden, Netherlands) Konstantin Vinogradov (ICSTI, Moscow, USSR) Amanda Walker (InterCon Systems Corp, USA) Thanks also to the following people for organizing meetings or conferences in their countries at which the issues of this proposal were discussed: Kohichi Nishimoto (Nihon DEC, Tokyo, Japan) Juri Gornostaev and A. Butrimenko (ICSTI, Moscow, USSR) and thanks also to those who attended these gatherings! STATEMENT OF THE PROBLEM Kermit has always been able to transfer text files between unlike systems (e.g. a UNIX system with ASCII stream text files and an IBM mainframe with EBCDIC record-oriented text files). To do the text file code conversion, Kermit transfers text in ASCII. But ASCII only includes enough letters and symbols for English. There are now computers capable of representing the characters of other languages: Roman letters with diacritical marks, Cyrillic letters, Hebrew, Arabic, and Greek characters, Japanese and Chinese ideograms. But different computer manufacturers use different codes for these characters. For example, the IBM PS/2 and the Apple Macintosh have character sets that are "8-bit ASCII". When the character value is 32-127, the character is (normally) a standard ASCII graphic (printable) character. When the value is 128 or higher, it is a special character. But the PC and the Macintosh assign different special characters to these values. Here are just a few of examples: Value PS/2 Character Macintosh Character 138 Small e acute Small a umlaut 143 Capital A ring Small e acute 144 Capital E acute Small e circumflex 136 Small e circumflex Small a acute When a file contains "8-bit ASCII", Kermit presently transfers it without any character translation. Therefore, a text file written in French, German, Italian, or Norwegian transferred between a PS/2 and a Macintosh will contain the wrong characters when it arrives at its destination: the PS/2's e-acute becomes a-umlaut on the Macintosh, etc. The problem is compounded when a file is composed of characters from more than one character set, for example a Japanese text file that contains Kanji, Katakana, and Roman characters. There are many computer vendors in the world and nobody controls what codes they use to represent characters. Without a standard protocol for transferring non-ASCII text, each computer would have to know the codes of all the other computers in order for correct transfer of non-English text files to occur between unlike systems. NORMAL KERMIT FILE TRANSFER SYNTAX The Kermit file transfer protocol makes a distinction between text and binary files. Binary files are transmitted with no translation or conversion. For text files, Kermit defines a standard transfer syntax for text files, namely ASCII characters with carriage return and linefeed (CRLF) after each line, so that text may be stored in useful fashion on any computer to which it is transferred. Each Kermit program knows how to translate from the local text-file storage conventions to ASCII/CRLF syntax, and vice versa. This is the basic, required, and default mode of operation for any Kermit program, and it will be referred to as Kermit's "Normal" or "Level 0" syntax. EXPANDED KERMIT TRANSFER SYNTAX This proposal adds two additional levels of transfer syntax, Levels 1 and 2. Level 1 permits the use of a single character set other than ASCII in the transfer syntax. These additional character sets are taken from recognized national or internation standards, such as ISO 8859-1 (Latin Alphabet 1), JIS X 0208 (Japanese), etc. By using using a standard character set (other than ASCII), it is possible to transfer a file containing more than one language. For example Latin Alphabet 1 can represent a file containing a mixture of Italian, Norwegian, French, German, English, and Icelandic. Level 2 allows a mixture of character sets to transfer mixed-language text that requires characters from more than one standard character set, for example a document written in Russian, French, and Greek. The additional levels are optional features for Kermit programs, except that Level 2 should not be provided without Level 1. The additional overhead incurred by a Kermit program running in text mode at any level can be avoided when transferring files between two computers that use the same codes and formats. Simply use the command SET FILE TYPE BINARY to disable all translations and reformatting. The following discussion applies to text-file transfer only. When the Kermit user has selected binary file transfer, none of the text-file conversions discussed here apply. EXPANDED SYNTAX, LEVEL 1 When all the characters in a text file can be represented by a single character set, then that character set can be used in place of ASCII in Kermit's text file transfer syntax. As with ASCII, there must be a mapping between the local file character set and the character set of the common transfer syntax. That is, there must be a pair of translation tables in the program, one from local to common, and one from common to local. Since this mode of operation is not Kermit's normal behavior, it must be selected by the user. The new Kermit commands are: SET FILE CHARACTER-SET SET TRANSFER-SYNTAX CHARACTER-SET The file character set is a system-dependent item. Some computers have only one character set, in which case the SET FILE CHARACTER-SET command would be unnecessary. But other computers allow the use of different character sets, often without any way to identify a file's encoding. For example, the IBM PC family running MS-DOS 3.3 or later supports a variety of "code pages" and allows users to switch among them, as described in Chapter 9, "Code Page Switching", of the IBM DOS 3.3 manual. Thus, on any given PC, a file may be encoded using Code Page 437 (USA), Code Page 850 (Multilingual), Code Page 860 (Portugal), Code Page 865 (Norway), etc. If you have set your Code Page to 437, you may display a file created using Code Page 865 on your screen but the wrong characters are likely to appear. Therefore, Kermit for the IBM PC family will require the SET FILE CHARACTER-SET command, with operands to denote the code page such as CP437, CP850, etc. The default character set would be the PC's original set, CP437. Those who use other sets can avoid keying in a SET FILE CHARACTER-SET command every time Kermit is started, by including this command in the program's initialization file. Similar remarks apply to European computers that use the "national replacement characters" allowed by ISO Standard 646. This standard specifies a 7-bit character set equivalent to ASCII, but with national variants in which certain non-alphanumeric ASCII graphic characters are replaced by "national characters", as shown in Table 1. _____________________________________________________________________________ Column/Row ASCII German Finnish Norwegian French 04/00 at-sign section at-sign at-sign a-grave 05/11 left-bracket A-umlaut A-umlaut AE-diphthong degree 05/12 backslash O-umlaut O-umlaut O-slash c-cedilla 05/13 right-bracket U-umlaut A-circle A-circle section 06/00 accent-grave accent-grave e-acute accent-grave accent-grave 07/11 left-brace a-umlaut a-umlaut ae-diphthong e-acute 07/12 vertical-bar o-umlaut o-umlaut o-circle u-grave 07/13 right-brace u-umlaut a-circle a-circle e-grave 07/14 tilde ess-zet u-umlaut tilde umlaut Table 1: ISO 646 Usage in Selected Countries _____________________________________________________________________________ (see Figure 1 for an explanation of column/row notation.) For example, the German phrase "Gr aus Kln" would be rendered in ASCII as "Gr}~ aus K|ln", and the ASCII C-language phrase "{~a[x]}" would become "ax" in German ISO 646. The German user would want Kermit to interpret the local file characters as German in the former case, and as ASCII in the latter. SPECIFYING THE TRANSFER SYNTAX To select Level 1, the user must type the command SET TRANSFER-SYNTAX CHARACTER-SET Where is the name of a standard character set. To minimize the work of the programmer, the consternation of the user, and the memory requirements for the Kermit program itself, the number of character sets which Kermit uses for Level 1 transfer syntax should be kept to a minimum. As a starting point, the sets shown in Table 2 are recommended. The criteria for including a character set in this table are: 1. ASCII (= ISO-646 International Reference Version, IRV) is included. 2. Except for ASCII, each set should be either (a) the "right half" of an 8-bit single-byte set whose "left half" is the same as ASCII/ISO-646-IRV, or (b) a multi-byte set. 3. Each character in the set should be self-contained, and not formed as a composite of other characters. 4. The set must be listed in the ISO International Register of Character Sets, so that it has a unique registration number and designating escape sequence. (But provisions are made for other registration authorities.) 5. The set must be a national or international standard graphic character set, intended for use in computer text processing or programming (as opposed to Videotex, Teletex, OCR, device control, and other applications). This category may include line-drawing or technical character sets which fit the other criteria. Note in particular that the national variants of ISO 646 are not included, since these are covered adequately by ASCII and the ISO Latin alphabets. Standard "Kermit names" (for use with the SET TRANSFER-SYNTAX command) are given to these character sets so that they may be referred to uniformly in all Kermit implementations. These names are chosen to be mnemonic, so that users don't have to remember long numbers like "ISO-8859-3". The choice of single words like "CYRILLIC" implies that there will not be more than one transfer syntax for Cyrillic text. However, if these standards change in the future, it will be possible to append further identifying material to these names, e.g. "CYRILLIC-2", "CYRILLIC-3", etc. _____________________________________________________________________________ US 7-bit ASCII, equivalent to the ISO 646 International Reference Version (IRV) character set. English, German without umlauts or ess-zet, etc. Kermit name: NORMAL. ISO Registration Number: 2. ISO 8859-1, Latin Alphabet 1, for Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and Swedish. Kermit name: LATIN1. ISO Registration Number: 100. ISO 8859-2, Latin Alphabet 2. Albanian, Czech, English, German, Hungarian, Polish, Romanian, Serbocroation, Slovak, and Slovene. Kermit name: LATIN2. ISO Registration Number: 101. ISO 8859-3, Latin Alphabet 3, for Afrikaans, Catalan, English, Esperanto, French, Galician, German, Italian, Maltese, and Turkish. Kermit name: LATIN3. ISO Registration Number: 109. ISO 8859-4, Latin Alphabet 4, for Danish, English, Estonian, Finnish, German, Greenlandic, Lappish, Latvian, Lithuanian, Norwegian, and Swedish. Kermit name: LATIN4. ISO Registration Number: 110. ISO 8859-5, the Latin/Cyrillic Alphabet, for Bulgarian, Byelorussian, Macedonian, Russian, Serbocroation, and Ukrainian (Compatible with USSR GOST Standard 19768-1987 and ECMA-113). Kermit name: CYRILLIC. ISO Registration Number: 144. ISO 8859-6, the Latin/Arabic Alphabet. Kermit name: ARABIC. ISO Registration Number: 127. ISO 8859-7, the Latin/Greek Alphabet. Kermit name: GREEK. ISO Registration Number: 126. ISO 8859-8, the Latin/Hebrew Alphabet. Kermit name: HEBREW. ISO Registration Number: 138. ISO DIS 8859-9, Latin Alphabet 5, in which six Icelandic letters from Latin Alphabet 1 were replaced by six other letters needed for Turkish. Kermit name: LATIN5. ISO Registration Number: 148. CSN 36 91 03, Czechoslovak Standard alphabet. Kermit name: CZECH. ISO Registration Number: 139. JIS X 0201, a 1-byte code including ASCII and Japanese Katakana. Kermit name: KATAKANA. ISO Registration Number: 13 (Kana), 14 (Roman). JIS X 0208, a 2-byte code containing Japanese Kanji, Katakana, Hiragana, Roman, Greek, and Russian characters, plus special symbols, etc. Kermit name: KANJI. ISO Registration Number: 87. Chinese Standard GB 2312-80, a 2-byte code for Chinese. Kermit name: CHINESE. ISO Registration Number: 58. KS C 5601 (1987), a 2-byte code for Korean. Kermit name: KOREAN. ISO Registration Number: 149. Table 2: Standard 8-Bit Character Sets _____________________________________________________________________________ The ISO Latin alphabets and the Czech character set are 8-bit character sets whose left half is identical with ASCII, and whose right half contains the special characters. The ISO registration number refers only to the right half of each of these character sets. But each of these sets must be used in its entirety, because the unaccented Roman letters, the digits, and the punctuation marks appear only in the ASCII left half. Therefore, this proposal considers an 8-bit character set composed of ASCII plus one of the right-half sets to be a SINGLE character set. The Kermit character-set name refers to the two halves combined as a single set. See Figure 2 for the layout of an 8-bit character set. A particular Kermit program need not incorporate all of these character sets. In many cases, a single 8-bit character set such as LATIN1 will suffice. For example, in the USSR there are at least five computer codes in use for Cyrillic characters. But all of them can be mapped to ISO Latin/Cyrillic, which also includes ASCII. So in all likelihood, a Soviet version of Kermit need only use LATIN5 in its Level-1 transfer syntax, allowing it to transfer Russian and English language text files among computers using different codes. When a language is representable in more than one character set from this table, as are English, German, Finnish, Czech, Turkish, etc., the character set highest on the list which adequately represents the language should be preferred. For example, NORMAL for English, LATIN1 for French, LATIN1 for German (because it represents German better than ASCII), LATIN5 for Turkish (because it represents Turkish better than LATIN3), etc. This is to maximize the chance that any two particular Kermit programs will recognize the same character sets. Unfortunately, but unavoidably, the burden of choosing the best transfer syntax character set must be placed upon the user. If a file containing a mixture of Finnish, English, and Danish must be transferred, the user must find a character set that can adequately represent all three languages, in this case Latin Alphabet 4. A table like Table 3 should be provided in the user documentation to help the user make this selection. _____________________________________________________________________________ Arabic ARABIC Italian LATIN1,3 Bulgarian CYRILLIC Kanji KANJI Chinese CHINESE Katakana KATAKANA, KANJI Czech CZECH, LATIN2 Korean KOREAN Danish LATIN4 Latvian LATIN4 Dutch LATIN1,2,3,4 Lithuanian LATIN4 English NORMAL,LATIN1,2,3,4,5,etc Norwegian LATIN1,4 Esperanto LATIN3 Polish LATIN2 Estonian LATIN4 Portuguese LATIN1 Finnish LATIN1,4 Romanian LATIN2 Flemish LATIN1,2,3,4,5 Russian CYRILLIC French LATIN1,3,5 Serbocroation LATIN2 German LATIN1,2,3,4,5 Slovak LATIN2 Greek GREEK Spanish LATIN1 Hebrew HEBREW Swedish LATIN1,4 Hungarian LATIN2 Turkish LATIN5,3 Icelandic LATIN1 Ukranian CYRILLIC Table 3: Preferred Transfer Syntax Character Sets _____________________________________________________________________________ IMPLEMENTATION OF LEVEL 1 The Level-1 Kermit extension can be added to existing Kermit programs with a minimum of effort. The following steps are required for each Kermit program: 1. Add the SET FILE TYPE { BINARY, TEXT } command, if the program doesn't have it already. SET FILE TYPE TEXT enables text-file character set conversion at all levels. SET FILE TYPE BINARY disables conversions of all kinds. 2. Add the SET FILE CHARACTER-SET command. The set of should include ASCII (used for program source, etc) plus the names of any "national" character sets that are used on this particular computer. 3. Add the SET TRANSFER-SYNTAX