imapext-2007

annotate docs/rfc/rfc5051.txt @ 0:ada5e610ab86

imap-2007e
author yuuji@gentei.org
date Mon, 14 Sep 2009 15:17:45 +0900
parents
children
rev   line source
yuuji@0 1
yuuji@0 2
yuuji@0 3
yuuji@0 4
yuuji@0 5
yuuji@0 6
yuuji@0 7 Network Working Group M. Crispin
yuuji@0 8 Request for Comments: 5051 University of Washington
yuuji@0 9 Category: Standards Track October 2007
yuuji@0 10
yuuji@0 11
yuuji@0 12 i;unicode-casemap - Simple Unicode Collation Algorithm
yuuji@0 13
yuuji@0 14 Status of This Memo
yuuji@0 15
yuuji@0 16 This document specifies an Internet standards track protocol for the
yuuji@0 17 Internet community, and requests discussion and suggestions for
yuuji@0 18 improvements. Please refer to the current edition of the "Internet
yuuji@0 19 Official Protocol Standards" (STD 1) for the standardization state
yuuji@0 20 and status of this protocol. Distribution of this memo is unlimited.
yuuji@0 21
yuuji@0 22 Abstract
yuuji@0 23
yuuji@0 24 This document describes "i;unicode-casemap", a simple case-
yuuji@0 25 insensitive collation for Unicode strings. It provides equality,
yuuji@0 26 substring, and ordering operations.
yuuji@0 27
yuuji@0 28 1. Introduction
yuuji@0 29
yuuji@0 30 The "i;ascii-casemap" collation described in [COMPARATOR] is quite
yuuji@0 31 simple to implement and provides case-independent comparisons for the
yuuji@0 32 26 Latin alphabetics. It is specified as the default and/or baseline
yuuji@0 33 comparator in some application protocols, e.g., [IMAP-SORT].
yuuji@0 34
yuuji@0 35 However, the "i;ascii-casemap" collation does not produce
yuuji@0 36 satisfactory results with non-ASCII characters. It is possible, with
yuuji@0 37 a modest extension, to provide a more sophisticated collation with
yuuji@0 38 greater multilingual applicability than "i;ascii-casemap". This
yuuji@0 39 extension provides case-independent comparisons for a much greater
yuuji@0 40 number of characters. It also collates characters with diacriticals
yuuji@0 41 with the non-diacritical character forms.
yuuji@0 42
yuuji@0 43 This collation, "i;unicode-casemap", is intended to be an alternative
yuuji@0 44 to, and preferred over, "i;ascii-casemap". It does not replace the
yuuji@0 45 "i;basic" collation described in [BASIC].
yuuji@0 46
yuuji@0 47 2. Unicode Casemap Collation Description
yuuji@0 48
yuuji@0 49 The "i;unicode-casemap" collation is a simple collation which is
yuuji@0 50 case-insensitive in its treatment of characters. It provides
yuuji@0 51 equality, substring, and ordering operations. The validity test
yuuji@0 52 operation returns "valid" for any input.
yuuji@0 53
yuuji@0 54
yuuji@0 55
yuuji@0 56
yuuji@0 57
yuuji@0 58 Crispin Standards Track [Page 1]
yuuji@0 59
yuuji@0 60 RFC 5051 i;unicode-casemap October 2007
yuuji@0 61
yuuji@0 62
yuuji@0 63 This collation allows strings in arbitrary (and mixed) character
yuuji@0 64 sets, as long as the character set for each string is identified and
yuuji@0 65 it is possible to convert the string to Unicode. Strings which have
yuuji@0 66 an unidentified character set and/or cannot be converted to Unicode
yuuji@0 67 are not rejected, but are treated as binary.
yuuji@0 68
yuuji@0 69 Each input string is prepared by converting it to a "titlecased
yuuji@0 70 canonicalized UTF-8" string according to the following steps, using
yuuji@0 71 UnicodeData.txt ([UNICODE-DATA]):
yuuji@0 72
yuuji@0 73 (1) A Unicode codepoint is obtained from the input string.
yuuji@0 74
yuuji@0 75 (a) If the input string is in a known charset that can be
yuuji@0 76 converted to Unicode, a sequence in the string's charset
yuuji@0 77 is read and checked for validity according to the rules of
yuuji@0 78 that charset. If the sequence is valid, it is converted
yuuji@0 79 to a Unicode codepoint. Note that for input strings in
yuuji@0 80 UTF-8, the UTF-8 sequence must be valid according to the
yuuji@0 81 rules of [UTF-8]; e.g., overlong UTF-8 sequences are
yuuji@0 82 invalid.
yuuji@0 83
yuuji@0 84 (b) If the input string is in an unknown charset, or an
yuuji@0 85 invalid sequence occurs in step (1)(a), conversion ceases.
yuuji@0 86 No further preparation is performed, and any partial
yuuji@0 87 preparation results are discarded. The original string is
yuuji@0 88 used unchanged with the i;octet comparator.
yuuji@0 89
yuuji@0 90 (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]),
yuuji@0 91 are performed on the resulting codepoint from step (1)(a).
yuuji@0 92
yuuji@0 93 (a) If the codepoint has a titlecase property in
yuuji@0 94 UnicodeData.txt (this is normally the same as the
yuuji@0 95 uppercase property), the codepoint is converted to the
yuuji@0 96 codepoints in the titlecase property.
yuuji@0 97
yuuji@0 98 (b) If the resulting codepoint from (2)(a) has a decomposition
yuuji@0 99 property of any type in UnicodeData.txt, the codepoint is
yuuji@0 100 converted to the codepoints in the decomposition property.
yuuji@0 101 This step is recursively applied to each of the resulting
yuuji@0 102 codepoints until no more decomposition is possible
yuuji@0 103 (effectively Normalization Form KD).
yuuji@0 104
yuuji@0 105 Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)
yuuji@0 106 has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D
yuuji@0 107 WITH SMALL LETTER Z WITH CARON). Codepoint U+01C5 has a
yuuji@0 108 decomposition property of U+0044 (LATIN CAPITAL LETTER D)
yuuji@0 109 U+017E (LATIN SMALL LETTER Z WITH CARON). U+017E has a
yuuji@0 110 decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c
yuuji@0 111
yuuji@0 112
yuuji@0 113
yuuji@0 114 Crispin Standards Track [Page 2]
yuuji@0 115
yuuji@0 116 RFC 5051 i;unicode-casemap October 2007
yuuji@0 117
yuuji@0 118
yuuji@0 119 (COMBINING CARON). Neither U+0044, U+007A, nor U+030C have
yuuji@0 120 any decomposition properties. Therefore, U+01C4 is converted
yuuji@0 121 to U+0044 U+007A U+030C by this step.
yuuji@0 122
yuuji@0 123 (3) The resulting codepoint(s) from step (2) is/are appended, in
yuuji@0 124 UTF-8 format, to the "titlecased canonicalized UTF-8" string.
yuuji@0 125
yuuji@0 126 (4) Repeat from step (1) until there is no more data in the input
yuuji@0 127 string.
yuuji@0 128
yuuji@0 129 Following the above preparation process on each string, the equality,
yuuji@0 130 ordering, and substring operations are as for i;octet.
yuuji@0 131
yuuji@0 132 It is permitted to use an alternative implementation of the above
yuuji@0 133 preparation process if it produces the same results. For example, it
yuuji@0 134 may be more convenient for an implementation to convert all input
yuuji@0 135 strings to a sequence of UTF-16 or UTF-32 values prior to performing
yuuji@0 136 any of the step (2) actions. Similarly, if all input strings are (or
yuuji@0 137 are convertible to) Unicode, it may be possible to use UTF-32 as an
yuuji@0 138 alternative to UTF-8 in step (3).
yuuji@0 139
yuuji@0 140 Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3),
yuuji@0 141 because UTF-16 surrogates will cause i;octet to collate codepoints
yuuji@0 142 U+E0000 through U+FFFF after non-BMP codepoints.
yuuji@0 143
yuuji@0 144 This collation is not locale sensitive. Consequently, care should be
yuuji@0 145 taken when using OS-supplied functions to implement this collation.
yuuji@0 146 Functions such as strcasecmp and toupper are sometimes locale
yuuji@0 147 sensitive and may inconsistently casemap letters.
yuuji@0 148
yuuji@0 149 The i;unicode-casemap collation is well suited to use with many
yuuji@0 150 Internet protocols and computer languages. Use with natural language
yuuji@0 151 is often inappropriate; even though the collation apparently supports
yuuji@0 152 languages such as Swahili and English, in real-world use it tends to
yuuji@0 153 mis-sort a number of types of string:
yuuji@0 154
yuuji@0 155 o people and place names containing scripts that are not collated
yuuji@0 156 according to "alphabetical order".
yuuji@0 157 o words with characters that have diacriticals. However,
yuuji@0 158 i;unicode-casemap generally does a better job than i;ascii-casemap
yuuji@0 159 for most (but not all) languages. For example, German umlaut
yuuji@0 160 letters will sort correctly, but some Scandinavian letters will
yuuji@0 161 not.
yuuji@0 162 o names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike
yuuji@0 163 in English),
yuuji@0 164 o strings containing other non-letter symbols; e.g., euro and pound
yuuji@0 165 sterling symbols, quotation marks other than '"', dashes/hyphens,
yuuji@0 166 etc.
yuuji@0 167
yuuji@0 168
yuuji@0 169
yuuji@0 170 Crispin Standards Track [Page 3]
yuuji@0 171
yuuji@0 172 RFC 5051 i;unicode-casemap October 2007
yuuji@0 173
yuuji@0 174
yuuji@0 175 3. Unicode Casemap Collation Registration
yuuji@0 176
yuuji@0 177 <?xml version='1.0'?>
yuuji@0 178 <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
yuuji@0 179 <collation rfc="5051" scope="global" intendedUse="common">
yuuji@0 180 <identifier>i;unicode-casemap</identifier>
yuuji@0 181 <title>Unicode Casemap</title>
yuuji@0 182 <operations>equality order substring</operations>
yuuji@0 183 <specification>RFC 5051</specification>
yuuji@0 184 <owner>IETF</owner>
yuuji@0 185 <submitter>mrc@cac.washington.edu</submitter>
yuuji@0 186 </collation>
yuuji@0 187
yuuji@0 188 4. Security Considerations
yuuji@0 189
yuuji@0 190 The security considerations for [UTF-8], [STRINGPREP], and [UNICODE-
yuuji@0 191 SECURITY] apply and are normative to this specification.
yuuji@0 192
yuuji@0 193 The results from this comparator will vary depending upon the
yuuji@0 194 implementation for several reasons. Implementations MUST consider
yuuji@0 195 whether these possibilities are a problem for their use case:
yuuji@0 196
yuuji@0 197 1) New characters added in Unicode may have decomposition or
yuuji@0 198 titlecase properties that will not be known to an implementation
yuuji@0 199 based upon an older revision of Unicode. This impacts step (2).
yuuji@0 200
yuuji@0 201 2) Step (2)(b) defines a subset of Normalization Form KD (NFKD) that
yuuji@0 202 does not require normalization of out-of-order diacriticals.
yuuji@0 203 However, an implementation MAY use an NFKD library routine that
yuuji@0 204 does such normalization. This impacts step (2)(b) and possibly
yuuji@0 205 also step (1)(a), and is an issue only with ill-formed UTF-8
yuuji@0 206 input.
yuuji@0 207
yuuji@0 208 3) The set of charsets handled in step (1)(a) is open-ended. UTF-8
yuuji@0 209 (and, by extension, US-ASCII) are the only mandatory-to-implement
yuuji@0 210 charsets. This impacts step (1)(a).
yuuji@0 211
yuuji@0 212 Implementations SHOULD, as far as feasible, support all the
yuuji@0 213 charsets they are likely to encounter in the input data, in order
yuuji@0 214 to avoid poor collation caused by the fall through to the (1)(b)
yuuji@0 215 rule.
yuuji@0 216
yuuji@0 217 4) Other charsets may have revisions which add new characters that
yuuji@0 218 are not known to an implementation based upon an older revision.
yuuji@0 219 This impacts step (1)(a) and possibly also step (1)(b).
yuuji@0 220
yuuji@0 221
yuuji@0 222
yuuji@0 223
yuuji@0 224
yuuji@0 225
yuuji@0 226 Crispin Standards Track [Page 4]
yuuji@0 227
yuuji@0 228 RFC 5051 i;unicode-casemap October 2007
yuuji@0 229
yuuji@0 230
yuuji@0 231 An attacker may create input that is ill-formed or in an unknown
yuuji@0 232 charset, with the intention of impacting the results of this
yuuji@0 233 comparator or exploiting other parts of the system which process this
yuuji@0 234 input in different ways. Note, however, that even well-formed data
yuuji@0 235 in a known charset can impact the result of this comparator in
yuuji@0 236 unexpected ways. For example, an attacker can substitute U+0041
yuuji@0 237 (LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or
yuuji@0 238 U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of causing a
yuuji@0 239 non-match of strings which visually appear the same and/or causing
yuuji@0 240 the string to appear elsewhere in a sort.
yuuji@0 241
yuuji@0 242 5. IANA Considerations
yuuji@0 243
yuuji@0 244 The i;unicode-casemap collation defined in section 2 has been added
yuuji@0 245 to the registry of collations defined in [COMPARATOR].
yuuji@0 246
yuuji@0 247 6. Normative References
yuuji@0 248
yuuji@0 249 [COMPARATOR] Newman, C., Duerst, M., and A. Gulbrandsen,
yuuji@0 250 "Internet Application Protocol Collation
yuuji@0 251 Registry", RFC 4790, February 2007.
yuuji@0 252
yuuji@0 253 [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of
yuuji@0 254 Internationalized Strings ("stringprep")", RFC
yuuji@0 255 3454, December 2002.
yuuji@0 256
yuuji@0 257 [UTF-8] Yergeau, F., "UTF-8, a transformation format of
yuuji@0 258 ISO 10646", STD 63, RFC 3629, November 2003.
yuuji@0 259
yuuji@0 260 [UNICODE-DATA] <http://www.unicode.org/Public/UNIDATA/
yuuji@0 261 UnicodeData.txt>
yuuji@0 262
yuuji@0 263 Although the UnicodeData.txt file referenced
yuuji@0 264 here is part of the Unicode standard, it is
yuuji@0 265 subject to change as new characters are added
yuuji@0 266 to Unicode and errors are corrected in Unicode
yuuji@0 267 revisions. As a result, it may be less stable
yuuji@0 268 than might otherwise be implied by the
yuuji@0 269 standards status of this specification.
yuuji@0 270
yuuji@0 271 [UNICODE-SECURITY] Davis, M. and M. Suignard, "Unicode Security
yuuji@0 272 Considerations", February 2006,
yuuji@0 273 <http://www.unicode.org/reports/tr36/>.
yuuji@0 274
yuuji@0 275
yuuji@0 276
yuuji@0 277
yuuji@0 278
yuuji@0 279
yuuji@0 280
yuuji@0 281
yuuji@0 282 Crispin Standards Track [Page 5]
yuuji@0 283
yuuji@0 284 RFC 5051 i;unicode-casemap October 2007
yuuji@0 285
yuuji@0 286
yuuji@0 287 7. Informative References
yuuji@0 288
yuuji@0 289 [BASIC] Newman, C., Duerst, M., and A. Gulbrandsen,
yuuji@0 290 "i;basic - the Unicode Collation Algorithm",
yuuji@0 291 Work in Progress, March 2007.
yuuji@0 292
yuuji@0 293 [IMAP-SORT] Crispin, M. and K. Murchison, "Internet Message
yuuji@0 294 Access Protocol - SORT and THREAD Extensions",
yuuji@0 295 Work in Progress, September 2007.
yuuji@0 296
yuuji@0 297 Author's Address
yuuji@0 298
yuuji@0 299 Mark R. Crispin
yuuji@0 300 Networks and Distributed Computing
yuuji@0 301 University of Washington
yuuji@0 302 4545 15th Avenue NE
yuuji@0 303 Seattle, WA 98105-4527
yuuji@0 304
yuuji@0 305 Phone: +1 (206) 543-5762
yuuji@0 306 EMail: MRC@CAC.Washington.EDU
yuuji@0 307
yuuji@0 308
yuuji@0 309
yuuji@0 310
yuuji@0 311
yuuji@0 312
yuuji@0 313
yuuji@0 314
yuuji@0 315
yuuji@0 316
yuuji@0 317
yuuji@0 318
yuuji@0 319
yuuji@0 320
yuuji@0 321
yuuji@0 322
yuuji@0 323
yuuji@0 324
yuuji@0 325
yuuji@0 326
yuuji@0 327
yuuji@0 328
yuuji@0 329
yuuji@0 330
yuuji@0 331
yuuji@0 332
yuuji@0 333
yuuji@0 334
yuuji@0 335
yuuji@0 336
yuuji@0 337
yuuji@0 338 Crispin Standards Track [Page 6]
yuuji@0 339
yuuji@0 340 RFC 5051 i;unicode-casemap October 2007
yuuji@0 341
yuuji@0 342
yuuji@0 343 Full Copyright Statement
yuuji@0 344
yuuji@0 345 Copyright (C) The IETF Trust (2007).
yuuji@0 346
yuuji@0 347 This document is subject to the rights, licenses and restrictions
yuuji@0 348 contained in BCP 78, and except as set forth therein, the authors
yuuji@0 349 retain all their rights.
yuuji@0 350
yuuji@0 351 This document and the information contained herein are provided on an
yuuji@0 352 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
yuuji@0 353 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
yuuji@0 354 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
yuuji@0 355 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
yuuji@0 356 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
yuuji@0 357 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
yuuji@0 358
yuuji@0 359 Intellectual Property
yuuji@0 360
yuuji@0 361 The IETF takes no position regarding the validity or scope of any
yuuji@0 362 Intellectual Property Rights or other rights that might be claimed to
yuuji@0 363 pertain to the implementation or use of the technology described in
yuuji@0 364 this document or the extent to which any license under such rights
yuuji@0 365 might or might not be available; nor does it represent that it has
yuuji@0 366 made any independent effort to identify any such rights. Information
yuuji@0 367 on the procedures with respect to rights in RFC documents can be
yuuji@0 368 found in BCP 78 and BCP 79.
yuuji@0 369
yuuji@0 370 Copies of IPR disclosures made to the IETF Secretariat and any
yuuji@0 371 assurances of licenses to be made available, or the result of an
yuuji@0 372 attempt made to obtain a general license or permission for the use of
yuuji@0 373 such proprietary rights by implementers or users of this
yuuji@0 374 specification can be obtained from the IETF on-line IPR repository at
yuuji@0 375 http://www.ietf.org/ipr.
yuuji@0 376
yuuji@0 377 The IETF invites any interested party to bring to its attention any
yuuji@0 378 copyrights, patents or patent applications, or other proprietary
yuuji@0 379 rights that may cover technology that may be required to implement
yuuji@0 380 this standard. Please address the information to the IETF at
yuuji@0 381 ietf-ipr@ietf.org.
yuuji@0 382
yuuji@0 383
yuuji@0 384
yuuji@0 385
yuuji@0 386
yuuji@0 387
yuuji@0 388
yuuji@0 389
yuuji@0 390
yuuji@0 391
yuuji@0 392
yuuji@0 393
yuuji@0 394 Crispin Standards Track [Page 7]
yuuji@0 395

UW-IMAP'd extensions by yuuji