imapext-2007
diff docs/rfc/rfc5051.txt @ 0:ada5e610ab86
imap-2007e
author | yuuji@gentei.org |
---|---|
date | Mon, 14 Sep 2009 15:17:45 +0900 |
parents | |
children |
line diff
1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/docs/rfc/rfc5051.txt Mon Sep 14 15:17:45 2009 +0900 1.3 @@ -0,0 +1,395 @@ 1.4 + 1.5 + 1.6 + 1.7 + 1.8 + 1.9 + 1.10 +Network Working Group M. Crispin 1.11 +Request for Comments: 5051 University of Washington 1.12 +Category: Standards Track October 2007 1.13 + 1.14 + 1.15 + i;unicode-casemap - Simple Unicode Collation Algorithm 1.16 + 1.17 +Status of This Memo 1.18 + 1.19 + This document specifies an Internet standards track protocol for the 1.20 + Internet community, and requests discussion and suggestions for 1.21 + improvements. Please refer to the current edition of the "Internet 1.22 + Official Protocol Standards" (STD 1) for the standardization state 1.23 + and status of this protocol. Distribution of this memo is unlimited. 1.24 + 1.25 +Abstract 1.26 + 1.27 + This document describes "i;unicode-casemap", a simple case- 1.28 + insensitive collation for Unicode strings. It provides equality, 1.29 + substring, and ordering operations. 1.30 + 1.31 +1. Introduction 1.32 + 1.33 + The "i;ascii-casemap" collation described in [COMPARATOR] is quite 1.34 + simple to implement and provides case-independent comparisons for the 1.35 + 26 Latin alphabetics. It is specified as the default and/or baseline 1.36 + comparator in some application protocols, e.g., [IMAP-SORT]. 1.37 + 1.38 + However, the "i;ascii-casemap" collation does not produce 1.39 + satisfactory results with non-ASCII characters. It is possible, with 1.40 + a modest extension, to provide a more sophisticated collation with 1.41 + greater multilingual applicability than "i;ascii-casemap". This 1.42 + extension provides case-independent comparisons for a much greater 1.43 + number of characters. It also collates characters with diacriticals 1.44 + with the non-diacritical character forms. 1.45 + 1.46 + This collation, "i;unicode-casemap", is intended to be an alternative 1.47 + to, and preferred over, "i;ascii-casemap". It does not replace the 1.48 + "i;basic" collation described in [BASIC]. 1.49 + 1.50 +2. Unicode Casemap Collation Description 1.51 + 1.52 + The "i;unicode-casemap" collation is a simple collation which is 1.53 + case-insensitive in its treatment of characters. It provides 1.54 + equality, substring, and ordering operations. The validity test 1.55 + operation returns "valid" for any input. 1.56 + 1.57 + 1.58 + 1.59 + 1.60 + 1.61 +Crispin Standards Track [Page 1] 1.62 + 1.63 +RFC 5051 i;unicode-casemap October 2007 1.64 + 1.65 + 1.66 + This collation allows strings in arbitrary (and mixed) character 1.67 + sets, as long as the character set for each string is identified and 1.68 + it is possible to convert the string to Unicode. Strings which have 1.69 + an unidentified character set and/or cannot be converted to Unicode 1.70 + are not rejected, but are treated as binary. 1.71 + 1.72 + Each input string is prepared by converting it to a "titlecased 1.73 + canonicalized UTF-8" string according to the following steps, using 1.74 + UnicodeData.txt ([UNICODE-DATA]): 1.75 + 1.76 + (1) A Unicode codepoint is obtained from the input string. 1.77 + 1.78 + (a) If the input string is in a known charset that can be 1.79 + converted to Unicode, a sequence in the string's charset 1.80 + is read and checked for validity according to the rules of 1.81 + that charset. If the sequence is valid, it is converted 1.82 + to a Unicode codepoint. Note that for input strings in 1.83 + UTF-8, the UTF-8 sequence must be valid according to the 1.84 + rules of [UTF-8]; e.g., overlong UTF-8 sequences are 1.85 + invalid. 1.86 + 1.87 + (b) If the input string is in an unknown charset, or an 1.88 + invalid sequence occurs in step (1)(a), conversion ceases. 1.89 + No further preparation is performed, and any partial 1.90 + preparation results are discarded. The original string is 1.91 + used unchanged with the i;octet comparator. 1.92 + 1.93 + (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]), 1.94 + are performed on the resulting codepoint from step (1)(a). 1.95 + 1.96 + (a) If the codepoint has a titlecase property in 1.97 + UnicodeData.txt (this is normally the same as the 1.98 + uppercase property), the codepoint is converted to the 1.99 + codepoints in the titlecase property. 1.100 + 1.101 + (b) If the resulting codepoint from (2)(a) has a decomposition 1.102 + property of any type in UnicodeData.txt, the codepoint is 1.103 + converted to the codepoints in the decomposition property. 1.104 + This step is recursively applied to each of the resulting 1.105 + codepoints until no more decomposition is possible 1.106 + (effectively Normalization Form KD). 1.107 + 1.108 + Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON) 1.109 + has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D 1.110 + WITH SMALL LETTER Z WITH CARON). Codepoint U+01C5 has a 1.111 + decomposition property of U+0044 (LATIN CAPITAL LETTER D) 1.112 + U+017E (LATIN SMALL LETTER Z WITH CARON). U+017E has a 1.113 + decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c 1.114 + 1.115 + 1.116 + 1.117 +Crispin Standards Track [Page 2] 1.118 + 1.119 +RFC 5051 i;unicode-casemap October 2007 1.120 + 1.121 + 1.122 + (COMBINING CARON). Neither U+0044, U+007A, nor U+030C have 1.123 + any decomposition properties. Therefore, U+01C4 is converted 1.124 + to U+0044 U+007A U+030C by this step. 1.125 + 1.126 + (3) The resulting codepoint(s) from step (2) is/are appended, in 1.127 + UTF-8 format, to the "titlecased canonicalized UTF-8" string. 1.128 + 1.129 + (4) Repeat from step (1) until there is no more data in the input 1.130 + string. 1.131 + 1.132 + Following the above preparation process on each string, the equality, 1.133 + ordering, and substring operations are as for i;octet. 1.134 + 1.135 + It is permitted to use an alternative implementation of the above 1.136 + preparation process if it produces the same results. For example, it 1.137 + may be more convenient for an implementation to convert all input 1.138 + strings to a sequence of UTF-16 or UTF-32 values prior to performing 1.139 + any of the step (2) actions. Similarly, if all input strings are (or 1.140 + are convertible to) Unicode, it may be possible to use UTF-32 as an 1.141 + alternative to UTF-8 in step (3). 1.142 + 1.143 + Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3), 1.144 + because UTF-16 surrogates will cause i;octet to collate codepoints 1.145 + U+E0000 through U+FFFF after non-BMP codepoints. 1.146 + 1.147 + This collation is not locale sensitive. Consequently, care should be 1.148 + taken when using OS-supplied functions to implement this collation. 1.149 + Functions such as strcasecmp and toupper are sometimes locale 1.150 + sensitive and may inconsistently casemap letters. 1.151 + 1.152 + The i;unicode-casemap collation is well suited to use with many 1.153 + Internet protocols and computer languages. Use with natural language 1.154 + is often inappropriate; even though the collation apparently supports 1.155 + languages such as Swahili and English, in real-world use it tends to 1.156 + mis-sort a number of types of string: 1.157 + 1.158 + o people and place names containing scripts that are not collated 1.159 + according to "alphabetical order". 1.160 + o words with characters that have diacriticals. However, 1.161 + i;unicode-casemap generally does a better job than i;ascii-casemap 1.162 + for most (but not all) languages. For example, German umlaut 1.163 + letters will sort correctly, but some Scandinavian letters will 1.164 + not. 1.165 + o names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike 1.166 + in English), 1.167 + o strings containing other non-letter symbols; e.g., euro and pound 1.168 + sterling symbols, quotation marks other than '"', dashes/hyphens, 1.169 + etc. 1.170 + 1.171 + 1.172 + 1.173 +Crispin Standards Track [Page 3] 1.174 + 1.175 +RFC 5051 i;unicode-casemap October 2007 1.176 + 1.177 + 1.178 +3. Unicode Casemap Collation Registration 1.179 + 1.180 + <?xml version='1.0'?> 1.181 + <!DOCTYPE collation SYSTEM 'collationreg.dtd'> 1.182 + <collation rfc="5051" scope="global" intendedUse="common"> 1.183 + <identifier>i;unicode-casemap</identifier> 1.184 + <title>Unicode Casemap</title> 1.185 + <operations>equality order substring</operations> 1.186 + <specification>RFC 5051</specification> 1.187 + <owner>IETF</owner> 1.188 + <submitter>mrc@cac.washington.edu</submitter> 1.189 + </collation> 1.190 + 1.191 +4. Security Considerations 1.192 + 1.193 + The security considerations for [UTF-8], [STRINGPREP], and [UNICODE- 1.194 + SECURITY] apply and are normative to this specification. 1.195 + 1.196 + The results from this comparator will vary depending upon the 1.197 + implementation for several reasons. Implementations MUST consider 1.198 + whether these possibilities are a problem for their use case: 1.199 + 1.200 + 1) New characters added in Unicode may have decomposition or 1.201 + titlecase properties that will not be known to an implementation 1.202 + based upon an older revision of Unicode. This impacts step (2). 1.203 + 1.204 + 2) Step (2)(b) defines a subset of Normalization Form KD (NFKD) that 1.205 + does not require normalization of out-of-order diacriticals. 1.206 + However, an implementation MAY use an NFKD library routine that 1.207 + does such normalization. This impacts step (2)(b) and possibly 1.208 + also step (1)(a), and is an issue only with ill-formed UTF-8 1.209 + input. 1.210 + 1.211 + 3) The set of charsets handled in step (1)(a) is open-ended. UTF-8 1.212 + (and, by extension, US-ASCII) are the only mandatory-to-implement 1.213 + charsets. This impacts step (1)(a). 1.214 + 1.215 + Implementations SHOULD, as far as feasible, support all the 1.216 + charsets they are likely to encounter in the input data, in order 1.217 + to avoid poor collation caused by the fall through to the (1)(b) 1.218 + rule. 1.219 + 1.220 + 4) Other charsets may have revisions which add new characters that 1.221 + are not known to an implementation based upon an older revision. 1.222 + This impacts step (1)(a) and possibly also step (1)(b). 1.223 + 1.224 + 1.225 + 1.226 + 1.227 + 1.228 + 1.229 +Crispin Standards Track [Page 4] 1.230 + 1.231 +RFC 5051 i;unicode-casemap October 2007 1.232 + 1.233 + 1.234 + An attacker may create input that is ill-formed or in an unknown 1.235 + charset, with the intention of impacting the results of this 1.236 + comparator or exploiting other parts of the system which process this 1.237 + input in different ways. Note, however, that even well-formed data 1.238 + in a known charset can impact the result of this comparator in 1.239 + unexpected ways. For example, an attacker can substitute U+0041 1.240 + (LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or 1.241 + U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of causing a 1.242 + non-match of strings which visually appear the same and/or causing 1.243 + the string to appear elsewhere in a sort. 1.244 + 1.245 +5. IANA Considerations 1.246 + 1.247 + The i;unicode-casemap collation defined in section 2 has been added 1.248 + to the registry of collations defined in [COMPARATOR]. 1.249 + 1.250 +6. Normative References 1.251 + 1.252 + [COMPARATOR] Newman, C., Duerst, M., and A. Gulbrandsen, 1.253 + "Internet Application Protocol Collation 1.254 + Registry", RFC 4790, February 2007. 1.255 + 1.256 + [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of 1.257 + Internationalized Strings ("stringprep")", RFC 1.258 + 3454, December 2002. 1.259 + 1.260 + [UTF-8] Yergeau, F., "UTF-8, a transformation format of 1.261 + ISO 10646", STD 63, RFC 3629, November 2003. 1.262 + 1.263 + [UNICODE-DATA] <http://www.unicode.org/Public/UNIDATA/ 1.264 + UnicodeData.txt> 1.265 + 1.266 + Although the UnicodeData.txt file referenced 1.267 + here is part of the Unicode standard, it is 1.268 + subject to change as new characters are added 1.269 + to Unicode and errors are corrected in Unicode 1.270 + revisions. As a result, it may be less stable 1.271 + than might otherwise be implied by the 1.272 + standards status of this specification. 1.273 + 1.274 + [UNICODE-SECURITY] Davis, M. and M. Suignard, "Unicode Security 1.275 + Considerations", February 2006, 1.276 + <http://www.unicode.org/reports/tr36/>. 1.277 + 1.278 + 1.279 + 1.280 + 1.281 + 1.282 + 1.283 + 1.284 + 1.285 +Crispin Standards Track [Page 5] 1.286 + 1.287 +RFC 5051 i;unicode-casemap October 2007 1.288 + 1.289 + 1.290 +7. Informative References 1.291 + 1.292 + [BASIC] Newman, C., Duerst, M., and A. Gulbrandsen, 1.293 + "i;basic - the Unicode Collation Algorithm", 1.294 + Work in Progress, March 2007. 1.295 + 1.296 + [IMAP-SORT] Crispin, M. and K. Murchison, "Internet Message 1.297 + Access Protocol - SORT and THREAD Extensions", 1.298 + Work in Progress, September 2007. 1.299 + 1.300 +Author's Address 1.301 + 1.302 + Mark R. Crispin 1.303 + Networks and Distributed Computing 1.304 + University of Washington 1.305 + 4545 15th Avenue NE 1.306 + Seattle, WA 98105-4527 1.307 + 1.308 + Phone: +1 (206) 543-5762 1.309 + EMail: MRC@CAC.Washington.EDU 1.310 + 1.311 + 1.312 + 1.313 + 1.314 + 1.315 + 1.316 + 1.317 + 1.318 + 1.319 + 1.320 + 1.321 + 1.322 + 1.323 + 1.324 + 1.325 + 1.326 + 1.327 + 1.328 + 1.329 + 1.330 + 1.331 + 1.332 + 1.333 + 1.334 + 1.335 + 1.336 + 1.337 + 1.338 + 1.339 + 1.340 + 1.341 +Crispin Standards Track [Page 6] 1.342 + 1.343 +RFC 5051 i;unicode-casemap October 2007 1.344 + 1.345 + 1.346 +Full Copyright Statement 1.347 + 1.348 + Copyright (C) The IETF Trust (2007). 1.349 + 1.350 + This document is subject to the rights, licenses and restrictions 1.351 + contained in BCP 78, and except as set forth therein, the authors 1.352 + retain all their rights. 1.353 + 1.354 + This document and the information contained herein are provided on an 1.355 + "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1.356 + OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 1.357 + THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 1.358 + OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 1.359 + THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1.360 + WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1.361 + 1.362 +Intellectual Property 1.363 + 1.364 + The IETF takes no position regarding the validity or scope of any 1.365 + Intellectual Property Rights or other rights that might be claimed to 1.366 + pertain to the implementation or use of the technology described in 1.367 + this document or the extent to which any license under such rights 1.368 + might or might not be available; nor does it represent that it has 1.369 + made any independent effort to identify any such rights. Information 1.370 + on the procedures with respect to rights in RFC documents can be 1.371 + found in BCP 78 and BCP 79. 1.372 + 1.373 + Copies of IPR disclosures made to the IETF Secretariat and any 1.374 + assurances of licenses to be made available, or the result of an 1.375 + attempt made to obtain a general license or permission for the use of 1.376 + such proprietary rights by implementers or users of this 1.377 + specification can be obtained from the IETF on-line IPR repository at 1.378 + http://www.ietf.org/ipr. 1.379 + 1.380 + The IETF invites any interested party to bring to its attention any 1.381 + copyrights, patents or patent applications, or other proprietary 1.382 + rights that may cover technology that may be required to implement 1.383 + this standard. Please address the information to the IETF at 1.384 + ietf-ipr@ietf.org. 1.385 + 1.386 + 1.387 + 1.388 + 1.389 + 1.390 + 1.391 + 1.392 + 1.393 + 1.394 + 1.395 + 1.396 + 1.397 +Crispin Standards Track [Page 7] 1.398 +