Network Working Group K. Simonsen Request for Comments: 1345 Rationel Almen Planlaegning June 1992 Character Mnemonics & Character Sets Status of the Memo This memo provides information for the Internet community. It does not specify an Internet standard. Distribution of this memo is unlimited. Summary This memo lists a selection of characters and their presence in some coded character sets. To facilitate the coded character set tabulations an unambiguous mnemonic for each character is used, and a format for tabulating the coded character sets is defined. The coded character sets are given names for easy reference. A family of coded character sets called the mnemonic character sets and conversion between these coded character set without information loss is defined. The character set names are registered with the Internet Assigned Numbers Authority (IANA). Additional character sets not described in this memo should be registered with the IANA. This memo may be updated periodically, or additional specifications may be published, to reflect other coded character sets. Please send any comments including comments about the accuracy of the tables to the author, Keld Simonsen . 1. INTRODUCTION With the growing internationalization of the Internet, support for many coded character sets is required. It is the intention of this memo to document precisely the mapping between all characters and their corresponding coded representations in various coded character sets, and give names to these coded character sets, so they can be referenced unambiguously in Internet standards. This memo does not indicate anything about the validity of using these specifications in any Internet standard, so you should consult each individual Internet standard to see which coded character sets and names are allowed there. Unambiguous character mnemonics are specified, which provide a practical way of identifying a character, without reference to a coded character set and its code in this coded character set. The mnemonics are written in a minimal set of characters, namely the invariant 83 graphical characters of ISO 646, which is a kind of greatest common subset to be found between the majority of coded Simonsen [Page 1] RFC 1345 Character Mnemonics & Character Sets June 1992 character sets, including ASCII, national variants of the ISO 646 7- bit character set and various EBCDICs. In addition, the numeric value of the coded representations of all these characters are the same in all coded character sets compatible with ISO standards. All of them except two, EXCLAMATION MARK and QUOTATION MARK, have the same coded representation in all variants of EBCDIC. This minimal set of characters is called the reference character set in this memo. The mnemonics can be used in Internet standards for easy and unambiguous reference, and they can also serve as a fallback representation in various Internet specifications. The coded character sets covered include all parts of ISO 8859, ISO 6937-2 and all ISO 646 conforming coded character sets in the ISO character set registry managed by ECMA according to ISO 2375. Almost all graphic coded character sets in the ECMA registry (1) are covered. The graphic coded character sets not included are registry numbers 31, 38, 39, 53, 59, 68, 71, 72, 129 and 137. In addition many vendor defined character sets are covered, including PC codepages (4), (7), (8), many EBCDIC character sets (4), (5), (6) and HP, DEC and Apple character sets (8), (9), (10), (13), (14). The East-Asian 16-bit character sets from the ECMA registry is also included in this memo. 2. CHARACTER MNEMONICS 2.1 General Syntax The character mnemonics are taken from the ISO committee draft (CD) of the POSIX.2 standard (3). They are classified into two groups: 1. A group with two-character mnemonics - Primarily intended for alphabetic scripts like Latin, Greek, Cyrillic, Hebrew and Arabic, and special characters. 2. A group with variable-length mnemonics - primarily intended for non-alphabetic scripts like Japanese and Chinese, but also used for some accented letters and special characters. In the two-character mnemonics, all invariant graphic character in the ISO 646 character codes except "&" are used, i.e. the following characters: ! " % ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z The character "_" is not used as the first character. In the variable-length mnemonics, the character "_" is not used as the first character. If it is used in a name, its presence is doubled. Simonsen [Page 2] RFC 1345 Character Mnemonics & Character Sets June 1992 The mnemonics can be used in several different ways for different purposes. One of these is description of coded character sets, which is detailed in section 3. Another is for extending a given coded character set to a mnemonic character set. This is described in section 4. The restrictions on the use of the characters "&" and "_" are due to demands of the compositional methods of these techniques. 2.2 ISO Official Long Descriptive Character Name For all mnemonics, the character for which it stands is indicated in the following table by a long descriptive name. This name is identical to the ISO name of the character as given in reference (2). For a few characters that are not included there, descriptive names of the same kind are introduced in this memo. The source of each character is stated in the table after the name and should be consulted for a reliable identification of the character. These long descriptive names consists only of the capital Latin letters of the invariant part of ISO 646, the digits, "-", and SPACE. Digits are only used in names of ideographic and Hangul characters and never as the first character. 2.3 The 2-character Mnemonics The two-character mnemonics include various accented Latin letters, Greek, Cyrillic, Hebrew, Arabic, Hiragana and Katakana. Also a fair number of special characters are included. Almost all ISO or ISO registered 7- and 8-bit graphical coded character sets are covered with these two-character mnemonics. The two characters are chosen so the graphical appearance in the reference set resembles as much as possible (within the possibilities available) the graphical appearance of the character. The basic character set of ISO 646 is used as the reference set, as mentioned above. The characters in the reference character set are chosen to represent themselves. For control characters from ISO 646 the two-character acronyms of ISO 2047 are used as mnemonics. For the other control characters of ISO 6429, two-character mnemonics have been selected based on the variable-length acronyms used in that standard. Letters, including Greek, Cyrillic, Arabic and Hebrew, are represented with the base letter as the first letter, and the second letter represents an accent or relation to a non-Latin script. Non- Latin letters are transliterated to Latin letters, following transliteration standards as closely as possible. This is also done with the Latin letters such as ETH and THORN, and the Danish/Norwegian/Swedish letter A WITH RING ABOVE is transliterated into "aa". Simonsen [Page 3] RFC 1345 Character Mnemonics & Character Sets June 1992 After a letter, the second character signifies the following: Exclamation mark ! Grave Apostrophe ' Acute accent Greater-Than sign > Circumflex accent Question Mark ? tilde Hyphen-Minus - Macron Left parenthesis ( Breve Full Stop . Dot Above Colon : Diaeresis Comma , Cedilla Underline _ Underline Solidus / Stroke Quotation mark " Double acute accent Semicolon ; Ogonek Less-Than sign < Caron Zero 0 Ring above Two 2 Hook Nine 9 Horn Equals = Cyrillic Asterisk * Greek Percent sign % Greek/Cyrillic special Plus + smalls: Arabic, capitals: Hebrew Three 3 some Latin/Greek/Cyrillic letters Four 4 Bopomofo Five 5 Hiragana Six 6 Katakana In designing the mnemonics the following special characters were reserved: The ampersand is reserved as an intro character, indicating that the following string is in the mnemonic character set. The underline character is reserved for the variable-length mnemonics. This use does not eliminate usage as an accent or language identifier. Special characters are encoded with some mnemonic value. These are not systematic thruout, but most mnemonics start with a related special character of the reference set. 2.4 The Variable-length Character Mnemonics The Variable-length Character Mnemonics are primarily meant for the ideographic characters in larger Asian character sets, but are also used for accented characters with several accents and some special characters. To have the mnemonics as short as possible, which both saves storage and is easier to input, a quite short name is preferred. Considering the Chinese standard GB 2312-1980, the Japanese standards JIS X0208 and JIS X0212, and the Korean standard KS C 5601, they are all given by row and column numbers between 1 and 94. So two positions for row and column and a character set identifier of one character would be almost as short as possible. The following character set identifiers are defined: Simonsen [Page 4] RFC 1345 Character Mnemonics & Character Sets June 1992 c GB 2312-1980 j JIS X0208-1990 J JIS X0212-1990 k KS C 5601-1987 This system for the representation of ideographic characters and Hangul characters is not truly mnemonic, but it provides short representations that are easy to connect to the corresponding character by means of the code table of an official character set standard. Alternative methods based on the graphic appearance or the pronunciation of the characters are thought to be unfeasible. One prominent character in the reference character set is reserved for identifying variable-length mnemonics, namely the underline character "_". This character is intended as a delimiter both in the front and in the end of the mnemonic. An example of its use would be: (&=intro): &_j3210_ &_j4436_&_j6530_ 3. CHARACTER MNEMONIC TABLE The following table contains the character mnemonic and the encoding and long descriptive name of ISO 2DIS 10646 (2). Although the ISO 10646 is only at DIS stage at this moment of writing and there is quite some debate about it, the long descriptive naming in the DIS is considered to be stable and the best official ISO reference to character names. The 2-octet encoded value of the ISO 2DIS 10646 is also used, but only as an identification of the character, and it should only be used for identification purposes as the coded representation may be changed in the final 10646 international standard. Some characters not in the ISO 2DIS 10646 are allocated values in the private use zone and given names and references to a character set where it is used. The format of the table is: 1st field is the character mnemonic (mostly 2 characters). 2nd field is the ISO 2DIS 10646 code in hexadecimal. 3rd field is the long descriptive name of ISO 2DIS 10646. SP 0020 SPACE ! 0021 EXCLAMATION MARK " 0022 QUOTATION MARK Nb 0023 NUMBER SIGN DO 0024 DOLLAR SIGN % 0025 PERCENT SIGN & 0026 AMPERSAND ' 0027 APOSTROPHE ( 0028 LEFT PARENTHESIS ) 0029 RIGHT PARENTHESIS * 002a ASTERISK + 002b PLUS SIGN Simonsen [Page 5] RFC 1345 Character Mnemonics & Character Sets June 1992 , 002c COMMA - 002d HYPHEN-MINUS . 002e FULL STOP / 002f SOLIDUS 0 0030 DIGIT ZERO 1 0031 DIGIT ONE 2 0032 DIGIT TWO 3 0033 DIGIT THREE 4 0034 DIGIT FOUR 5 0035 DIGIT FIVE 6 0036 DIGIT SIX 7 0037 DIGIT SEVEN 8 0038 DIGIT EIGHT 9 0039 DIGIT NINE : 003a COLON ; 003b SEMICOLON < 003c LESS-THAN SIGN = 003d EQUALS SIGN > 003e GREATER-THAN SIGN ? 003f QUESTION MARK At 0040 COMMERCIAL AT A 0041 LATIN CAPITAL LETTER A B 0042 LATIN CAPITAL LETTER B C 0043 LATIN CAPITAL LETTER C D 0044 LATIN CAPITAL LETTER D E 0045 LATIN CAPITAL LETTER E F 0046 LATIN CAPITAL LETTER F G 0047 LATIN CAPITAL LETTER G H 0048 LATIN CAPITAL LETTER H I 0049 LATIN CAPITAL LETTER I J 004a LATIN CAPITAL LETTER J K 004b LATIN CAPITAL LETTER K L 004c LATIN CAPITAL LETTER L M 004d LATIN CAPITAL LETTER M N 004e LATIN CAPITAL LETTER N O 004f LATIN CAPITAL LETTER O P 0050 LATIN CAPITAL LETTER P Q 0051 LATIN CAPITAL LETTER Q R 0052 LATIN CAPITAL LETTER R S 0053 LATIN CAPITAL LETTER S T 0054 LATIN CAPITAL LETTER T U 0055 LATIN CAPITAL LETTER U V 0056 LATIN CAPITAL LETTER V W 0057 LATIN CAPITAL LETTER W X 0058 LATIN CAPITAL LETTER X Y 0059 LATIN CAPITAL LETTER Y Z 005a LATIN CAPITAL LETTER Z <( 005b LEFT SQUARE BRACKET // 005c REVERSE SOLIDUS )> 005d RIGHT SQUARE BRACKET '> 005e CIRCUMFLEX ACCENT _ 005f LOW LINE '! 0060 GRAVE ACCENT Simonsen [Page 6] RFC 1345 Character Mnemonics & Character Sets June 1992 a 0061 LATIN SMALL LETTER A b 0062 LATIN SMALL LETTER B c 0063 LATIN SMALL LETTER C d 0064 LATIN SMALL LETTER D e 0065 LATIN SMALL LETTER E f 0066 LATIN SMALL LETTER F g 0067 LATIN SMALL LETTER G h 0068 LATIN SMALL LETTER H i 0069 LATIN SMALL LETTER I j 006a LATIN SMALL LETTER J k 006b LATIN SMALL LETTER K l 006c LATIN SMALL LETTER L m 006d LATIN SMALL LETTER M n 006e LATIN SMALL LETTER N o 006f LATIN SMALL LETTER O p 0070 LATIN SMALL LETTER P q 0071 LATIN SMALL LETTER Q r 0072 LATIN SMALL LETTER R s 0073 LATIN SMALL LETTER S t 0074 LATIN SMALL LETTER T u 0075 LATIN SMALL LETTER U v 0076 LATIN SMALL LETTER V w 0077 LATIN SMALL LETTER W x 0078 LATIN SMALL LETTER X y 0079 LATIN SMALL LETTER Y z 007a LATIN SMALL LETTER Z (! 007b LEFT CURLY BRACKET !! 007c VERTICAL LINE !) 007d RIGHT CURLY BRACKET '? 007e TILDE NS 00a0 NO-BREAK SPACE !I 00a1 INVERTED EXCLAMATION MARK Ct 00a2 CENT SIGN Pd 00a3 POUN