imapext-2007: docs/rfc/rfc5051.txt annotate

imapext-2007

annotate docs/rfc/rfc5051.txt @ 0:ada5e610ab86

imap-2007e

author	yuuji@gentei.org
date	Mon, 14 Sep 2009 15:17:45 +0900
parents
children

rev	line source
yuuji@0	1
yuuji@0	2
yuuji@0	3
yuuji@0	4
yuuji@0	5
yuuji@0	6
yuuji@0	7 Network Working Group M. Crispin
yuuji@0	8 Request for Comments: 5051 University of Washington
yuuji@0	9 Category: Standards Track October 2007
yuuji@0	10
yuuji@0	11
yuuji@0	12 i;unicode-casemap - Simple Unicode Collation Algorithm
yuuji@0	13
yuuji@0	14 Status of This Memo
yuuji@0	15
yuuji@0	16 This document specifies an Internet standards track protocol for the
yuuji@0	17 Internet community, and requests discussion and suggestions for
yuuji@0	18 improvements. Please refer to the current edition of the "Internet
yuuji@0	19 Official Protocol Standards" (STD 1) for the standardization state
yuuji@0	20 and status of this protocol. Distribution of this memo is unlimited.
yuuji@0	21
yuuji@0	22 Abstract
yuuji@0	23
yuuji@0	24 This document describes "i;unicode-casemap", a simple case-
yuuji@0	25 insensitive collation for Unicode strings. It provides equality,
yuuji@0	26 substring, and ordering operations.
yuuji@0	27
yuuji@0	28 1. Introduction
yuuji@0	29
yuuji@0	30 The "i;ascii-casemap" collation described in [COMPARATOR] is quite
yuuji@0	31 simple to implement and provides case-independent comparisons for the
yuuji@0	32 26 Latin alphabetics. It is specified as the default and/or baseline
yuuji@0	33 comparator in some application protocols, e.g., [IMAP-SORT].
yuuji@0	34
yuuji@0	35 However, the "i;ascii-casemap" collation does not produce
yuuji@0	36 satisfactory results with non-ASCII characters. It is possible, with
yuuji@0	37 a modest extension, to provide a more sophisticated collation with
yuuji@0	38 greater multilingual applicability than "i;ascii-casemap". This
yuuji@0	39 extension provides case-independent comparisons for a much greater
yuuji@0	40 number of characters. It also collates characters with diacriticals
yuuji@0	41 with the non-diacritical character forms.
yuuji@0	42
yuuji@0	43 This collation, "i;unicode-casemap", is intended to be an alternative
yuuji@0	44 to, and preferred over, "i;ascii-casemap". It does not replace the
yuuji@0	45 "i;basic" collation described in [BASIC].
yuuji@0	46
yuuji@0	47 2. Unicode Casemap Collation Description
yuuji@0	48
yuuji@0	49 The "i;unicode-casemap" collation is a simple collation which is
yuuji@0	50 case-insensitive in its treatment of characters. It provides
yuuji@0	51 equality, substring, and ordering operations. The validity test
yuuji@0	52 operation returns "valid" for any input.
yuuji@0	53
yuuji@0	54
yuuji@0	55
yuuji@0	56
yuuji@0	57
yuuji@0	58 Crispin Standards Track [Page 1]
yuuji@0	59
yuuji@0	60 RFC 5051 i;unicode-casemap October 2007
yuuji@0	61
yuuji@0	62
yuuji@0	63 This collation allows strings in arbitrary (and mixed) character
yuuji@0	64 sets, as long as the character set for each string is identified and
yuuji@0	65 it is possible to convert the string to Unicode. Strings which have
yuuji@0	66 an unidentified character set and/or cannot be converted to Unicode
yuuji@0	67 are not rejected, but are treated as binary.
yuuji@0	68
yuuji@0	69 Each input string is prepared by converting it to a "titlecased
yuuji@0	70 canonicalized UTF-8" string according to the following steps, using
yuuji@0	71 UnicodeData.txt ([UNICODE-DATA]):
yuuji@0	72
yuuji@0	73 (1) A Unicode codepoint is obtained from the input string.
yuuji@0	74
yuuji@0	75 (a) If the input string is in a known charset that can be
yuuji@0	76 converted to Unicode, a sequence in the string's charset
yuuji@0	77 is read and checked for validity according to the rules of
yuuji@0	78 that charset. If the sequence is valid, it is converted
yuuji@0	79 to a Unicode codepoint. Note that for input strings in
yuuji@0	80 UTF-8, the UTF-8 sequence must be valid according to the
yuuji@0	81 rules of [UTF-8]; e.g., overlong UTF-8 sequences are
yuuji@0	82 invalid.
yuuji@0	83
yuuji@0	84 (b) If the input string is in an unknown charset, or an
yuuji@0	85 invalid sequence occurs in step (1)(a), conversion ceases.
yuuji@0	86 No further preparation is performed, and any partial
yuuji@0	87 preparation results are discarded. The original string is
yuuji@0	88 used unchanged with the i;octet comparator.
yuuji@0	89
yuuji@0	90 (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]),
yuuji@0	91 are performed on the resulting codepoint from step (1)(a).
yuuji@0	92
yuuji@0	93 (a) If the codepoint has a titlecase property in
yuuji@0	94 UnicodeData.txt (this is normally the same as the
yuuji@0	95 uppercase property), the codepoint is converted to the
yuuji@0	96 codepoints in the titlecase property.
yuuji@0	97
yuuji@0	98 (b) If the resulting codepoint from (2)(a) has a decomposition
yuuji@0	99 property of any type in UnicodeData.txt, the codepoint is
yuuji@0	100 converted to the codepoints in the decomposition property.
yuuji@0	101 This step is recursively applied to each of the resulting
yuuji@0	102 codepoints until no more decomposition is possible
yuuji@0	103 (effectively Normalization Form KD).
yuuji@0	104
yuuji@0	105 Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)
yuuji@0	106 has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D
yuuji@0	107 WITH SMALL LETTER Z WITH CARON). Codepoint U+01C5 has a
yuuji@0	108 decomposition property of U+0044 (LATIN CAPITAL LETTER D)
yuuji@0	109 U+017E (LATIN SMALL LETTER Z WITH CARON). U+017E has a
yuuji@0	110 decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c
yuuji@0	111
yuuji@0	112
yuuji@0	113
yuuji@0	114 Crispin Standards Track [Page 2]
yuuji@0	115
yuuji@0	116 RFC 5051 i;unicode-casemap October 2007
yuuji@0	117
yuuji@0	118
yuuji@0	119 (COMBINING CARON). Neither U+0044, U+007A, nor U+030C have
yuuji@0	120 any decomposition properties. Therefore, U+01C4 is converted
yuuji@0	121 to U+0044 U+007A U+030C by this step.
yuuji@0	122
yuuji@0	123 (3) The resulting codepoint(s) from step (2) is/are appended, in
yuuji@0	124 UTF-8 format, to the "titlecased canonicalized UTF-8" string.
yuuji@0	125
yuuji@0	126 (4) Repeat from step (1) until there is no more data in the input
yuuji@0	127 string.
yuuji@0	128
yuuji@0	129 Following the above preparation process on each string, the equality,
yuuji@0	130 ordering, and substring operations are as for i;octet.
yuuji@0	131
yuuji@0	132 It is permitted to use an alternative implementation of the above
yuuji@0	133 preparation process if it produces the same results. For example, it
yuuji@0	134 may be more convenient for an implementation to convert all input
yuuji@0	135 strings to a sequence of UTF-16 or UTF-32 values prior to performing
yuuji@0	136 any of the step (2) actions. Similarly, if all input strings are (or
yuuji@0	137 are convertible to) Unicode, it may be possible to use UTF-32 as an
yuuji@0	138 alternative to UTF-8 in step (3).
yuuji@0	139
yuuji@0	140 Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3),
yuuji@0	141 because UTF-16 surrogates will cause i;octet to collate codepoints
yuuji@0	142 U+E0000 through U+FFFF after non-BMP codepoints.
yuuji@0	143
yuuji@0	144 This collation is not locale sensitive. Consequently, care should be
yuuji@0	145 taken when using OS-supplied functions to implement this collation.
yuuji@0	146 Functions such as strcasecmp and toupper are sometimes locale
yuuji@0	147 sensitive and may inconsistently casemap letters.
yuuji@0	148
yuuji@0	149 The i;unicode-casemap collation is well suited to use with many
yuuji@0	150 Internet protocols and computer languages. Use with natural language
yuuji@0	151 is often inappropriate; even though the collation apparently supports
yuuji@0	152 languages such as Swahili and English, in real-world use it tends to
yuuji@0	153 mis-sort a number of types of string:
yuuji@0	154
yuuji@0	155 o people and place names containing scripts that are not collated
yuuji@0	156 according to "alphabetical order".
yuuji@0	157 o words with characters that have diacriticals. However,
yuuji@0	158 i;unicode-casemap generally does a better job than i;ascii-casemap
yuuji@0	159 for most (but not all) languages. For example, German umlaut
yuuji@0	160 letters will sort correctly, but some Scandinavian letters will
yuuji@0	161 not.
yuuji@0	162 o names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike
yuuji@0	163 in English),
yuuji@0	164 o strings containing other non-letter symbols; e.g., euro and pound
yuuji@0	165 sterling symbols, quotation marks other than '"', dashes/hyphens,
yuuji@0	166 etc.
yuuji@0	167
yuuji@0	168
yuuji@0	169
yuuji@0	170 Crispin Standards Track [Page 3]
yuuji@0	171
yuuji@0	172 RFC 5051 i;unicode-casemap October 2007
yuuji@0	173
yuuji@0	174
yuuji@0	175 3. Unicode Casemap Collation Registration
yuuji@0	176
yuuji@0	177 <?xml version='1.0'?>
yuuji@0	178 <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
yuuji@0	179 <collation rfc="5051" scope="global" intendedUse="common">
yuuji@0	180 <identifier>i;unicode-casemap</identifier>
yuuji@0	181 <title>Unicode Casemap</title>
yuuji@0	182 <operations>equality order substring</operations>
yuuji@0	183 <specification>RFC 5051</specification>
yuuji@0	184 <owner>IETF</owner>
yuuji@0	185 <submitter>mrc@cac.washington.edu</submitter>
yuuji@0	186 </collation>
yuuji@0	187
yuuji@0	188 4. Security Considerations
yuuji@0	189
yuuji@0	190 The security considerations for [UTF-8], [STRINGPREP], and [UNICODE-
yuuji@0	191 SECURITY] apply and are normative to this specification.
yuuji@0	192
yuuji@0	193 The results from this comparator will vary depending upon the
yuuji@0	194 implementation for several reasons. Implementations MUST consider
yuuji@0	195 whether these possibilities are a problem for their use case:
yuuji@0	196
yuuji@0	197 1) New characters added in Unicode may have decomposition or
yuuji@0	198 titlecase properties that will not be known to an implementation
yuuji@0	199 based upon an older revision of Unicode. This impacts step (2).
yuuji@0	200
yuuji@0	201 2) Step (2)(b) defines a subset of Normalization Form KD (NFKD) that
yuuji@0	202 does not require normalization of out-of-order diacriticals.
yuuji@0	203 However, an implementation MAY use an NFKD library routine that
yuuji@0	204 does such normalization. This impacts step (2)(b) and possibly
yuuji@0	205 also step (1)(a), and is an issue only with ill-formed UTF-8
yuuji@0	206 input.
yuuji@0	207
yuuji@0	208 3) The set of charsets handled in step (1)(a) is open-ended. UTF-8
yuuji@0	209 (and, by extension, US-ASCII) are the only mandatory-to-implement
yuuji@0	210 charsets. This impacts step (1)(a).
yuuji@0	211
yuuji@0	212 Implementations SHOULD, as far as feasible, support all the
yuuji@0	213 charsets they are likely to encounter in the input data, in order
yuuji@0	214 to avoid poor collation caused by the fall through to the (1)(b)
yuuji@0	215 rule.
yuuji@0	216
yuuji@0	217 4) Other charsets may have revisions which add new characters that
yuuji@0	218 are not known to an implementation based upon an older revision.
yuuji@0	219 This impacts step (1)(a) and possibly also step (1)(b).
yuuji@0	220
yuuji@0	221
yuuji@0	222
yuuji@0	223
yuuji@0	224
yuuji@0	225
yuuji@0	226 Crispin Standards Track [Page 4]
yuuji@0	227
yuuji@0	228 RFC 5051 i;unicode-casemap October 2007
yuuji@0	229
yuuji@0	230
yuuji@0	231 An attacker may create input that is ill-formed or in an unknown
yuuji@0	232 charset, with the intention of impacting the results of this
yuuji@0	233 comparator or exploiting other parts of the system which process this
yuuji@0	234 input in different ways. Note, however, that even well-formed data
yuuji@0	235 in a known charset can impact the result of this comparator in
yuuji@0	236 unexpected ways. For example, an attacker can substitute U+0041
yuuji@0	237 (LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or
yuuji@0	238 U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of causing a
yuuji@0	239 non-match of strings which visually appear the same and/or causing
yuuji@0	240 the string to appear elsewhere in a sort.
yuuji@0	241
yuuji@0	242 5. IANA Considerations
yuuji@0	243
yuuji@0	244 The i;unicode-casemap collation defined in section 2 has been added
yuuji@0	245 to the registry of collations defined in [COMPARATOR].
yuuji@0	246
yuuji@0	247 6. Normative References
yuuji@0	248
yuuji@0	249 [COMPARATOR] Newman, C., Duerst, M., and A. Gulbrandsen,
yuuji@0	250 "Internet Application Protocol Collation
yuuji@0	251 Registry", RFC 4790, February 2007.
yuuji@0	252
yuuji@0	253 [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of
yuuji@0	254 Internationalized Strings ("stringprep")", RFC
yuuji@0	255 3454, December 2002.
yuuji@0	256
yuuji@0	257 [UTF-8] Yergeau, F., "UTF-8, a transformation format of
yuuji@0	258 ISO 10646", STD 63, RFC 3629, November 2003.
yuuji@0	259
yuuji@0	260 [UNICODE-DATA] <http://www.unicode.org/Public/UNIDATA/
yuuji@0	261 UnicodeData.txt>
yuuji@0	262
yuuji@0	263 Although the UnicodeData.txt file referenced
yuuji@0	264 here is part of the Unicode standard, it is
yuuji@0	265 subject to change as new characters are added
yuuji@0	266 to Unicode and errors are corrected in Unicode
yuuji@0	267 revisions. As a result, it may be less stable
yuuji@0	268 than might otherwise be implied by the
yuuji@0	269 standards status of this specification.
yuuji@0	270
yuuji@0	271 [UNICODE-SECURITY] Davis, M. and M. Suignard, "Unicode Security
yuuji@0	272 Considerations", February 2006,
yuuji@0	273 <http://www.unicode.org/reports/tr36/>.
yuuji@0	274
yuuji@0	275
yuuji@0	276
yuuji@0	277
yuuji@0	278
yuuji@0	279
yuuji@0	280
yuuji@0	281
yuuji@0	282 Crispin Standards Track [Page 5]
yuuji@0	283
yuuji@0	284 RFC 5051 i;unicode-casemap October 2007
yuuji@0	285
yuuji@0	286
yuuji@0	287 7. Informative References
yuuji@0	288
yuuji@0	289 [BASIC] Newman, C., Duerst, M., and A. Gulbrandsen,
yuuji@0	290 "i;basic - the Unicode Collation Algorithm",
yuuji@0	291 Work in Progress, March 2007.
yuuji@0	292
yuuji@0	293 [IMAP-SORT] Crispin, M. and K. Murchison, "Internet Message
yuuji@0	294 Access Protocol - SORT and THREAD Extensions",
yuuji@0	295 Work in Progress, September 2007.
yuuji@0	296
yuuji@0	297 Author's Address
yuuji@0	298
yuuji@0	299 Mark R. Crispin
yuuji@0	300 Networks and Distributed Computing
yuuji@0	301 University of Washington
yuuji@0	302 4545 15th Avenue NE
yuuji@0	303 Seattle, WA 98105-4527
yuuji@0	304
yuuji@0	305 Phone: +1 (206) 543-5762
yuuji@0	306 EMail: MRC@CAC.Washington.EDU
yuuji@0	307
yuuji@0	308
yuuji@0	309
yuuji@0	310
yuuji@0	311
yuuji@0	312
yuuji@0	313
yuuji@0	314
yuuji@0	315
yuuji@0	316
yuuji@0	317
yuuji@0	318
yuuji@0	319
yuuji@0	320
yuuji@0	321
yuuji@0	322
yuuji@0	323
yuuji@0	324
yuuji@0	325
yuuji@0	326
yuuji@0	327
yuuji@0	328
yuuji@0	329
yuuji@0	330
yuuji@0	331
yuuji@0	332
yuuji@0	333
yuuji@0	334
yuuji@0	335
yuuji@0	336
yuuji@0	337
yuuji@0	338 Crispin Standards Track [Page 6]
yuuji@0	339
yuuji@0	340 RFC 5051 i;unicode-casemap October 2007
yuuji@0	341
yuuji@0	342
yuuji@0	343 Full Copyright Statement
yuuji@0	344
yuuji@0	345 Copyright (C) The IETF Trust (2007).
yuuji@0	346
yuuji@0	347 This document is subject to the rights, licenses and restrictions
yuuji@0	348 contained in BCP 78, and except as set forth therein, the authors
yuuji@0	349 retain all their rights.
yuuji@0	350
yuuji@0	351 This document and the information contained herein are provided on an
yuuji@0	352 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
yuuji@0	353 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
yuuji@0	354 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
yuuji@0	355 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
yuuji@0	356 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
yuuji@0	357 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
yuuji@0	358
yuuji@0	359 Intellectual Property
yuuji@0	360
yuuji@0	361 The IETF takes no position regarding the validity or scope of any
yuuji@0	362 Intellectual Property Rights or other rights that might be claimed to
yuuji@0	363 pertain to the implementation or use of the technology described in
yuuji@0	364 this document or the extent to which any license under such rights
yuuji@0	365 might or might not be available; nor does it represent that it has
yuuji@0	366 made any independent effort to identify any such rights. Information
yuuji@0	367 on the procedures with respect to rights in RFC documents can be
yuuji@0	368 found in BCP 78 and BCP 79.
yuuji@0	369
yuuji@0	370 Copies of IPR disclosures made to the IETF Secretariat and any
yuuji@0	371 assurances of licenses to be made available, or the result of an
yuuji@0	372 attempt made to obtain a general license or permission for the use of
yuuji@0	373 such proprietary rights by implementers or users of this
yuuji@0	374 specification can be obtained from the IETF on-line IPR repository at
yuuji@0	375 http://www.ietf.org/ipr.
yuuji@0	376
yuuji@0	377 The IETF invites any interested party to bring to its attention any
yuuji@0	378 copyrights, patents or patent applications, or other proprietary
yuuji@0	379 rights that may cover technology that may be required to implement
yuuji@0	380 this standard. Please address the information to the IETF at
yuuji@0	381 ietf-ipr@ietf.org.
yuuji@0	382
yuuji@0	383
yuuji@0	384
yuuji@0	385
yuuji@0	386
yuuji@0	387
yuuji@0	388
yuuji@0	389
yuuji@0	390
yuuji@0	391
yuuji@0	392
yuuji@0	393
yuuji@0	394 Crispin Standards Track [Page 7]
yuuji@0	395