imapext-2007

view docs/rfc/rfc5051.txt @ 0:ada5e610ab86

imap-2007e
author yuuji@gentei.org
date Mon, 14 Sep 2009 15:17:45 +0900
parents
children
line source
7 Network Working Group M. Crispin
8 Request for Comments: 5051 University of Washington
9 Category: Standards Track October 2007
12 i;unicode-casemap - Simple Unicode Collation Algorithm
14 Status of This Memo
16 This document specifies an Internet standards track protocol for the
17 Internet community, and requests discussion and suggestions for
18 improvements. Please refer to the current edition of the "Internet
19 Official Protocol Standards" (STD 1) for the standardization state
20 and status of this protocol. Distribution of this memo is unlimited.
22 Abstract
24 This document describes "i;unicode-casemap", a simple case-
25 insensitive collation for Unicode strings. It provides equality,
26 substring, and ordering operations.
28 1. Introduction
30 The "i;ascii-casemap" collation described in [COMPARATOR] is quite
31 simple to implement and provides case-independent comparisons for the
32 26 Latin alphabetics. It is specified as the default and/or baseline
33 comparator in some application protocols, e.g., [IMAP-SORT].
35 However, the "i;ascii-casemap" collation does not produce
36 satisfactory results with non-ASCII characters. It is possible, with
37 a modest extension, to provide a more sophisticated collation with
38 greater multilingual applicability than "i;ascii-casemap". This
39 extension provides case-independent comparisons for a much greater
40 number of characters. It also collates characters with diacriticals
41 with the non-diacritical character forms.
43 This collation, "i;unicode-casemap", is intended to be an alternative
44 to, and preferred over, "i;ascii-casemap". It does not replace the
45 "i;basic" collation described in [BASIC].
47 2. Unicode Casemap Collation Description
49 The "i;unicode-casemap" collation is a simple collation which is
50 case-insensitive in its treatment of characters. It provides
51 equality, substring, and ordering operations. The validity test
52 operation returns "valid" for any input.
58 Crispin Standards Track [Page 1]
60 RFC 5051 i;unicode-casemap October 2007
63 This collation allows strings in arbitrary (and mixed) character
64 sets, as long as the character set for each string is identified and
65 it is possible to convert the string to Unicode. Strings which have
66 an unidentified character set and/or cannot be converted to Unicode
67 are not rejected, but are treated as binary.
69 Each input string is prepared by converting it to a "titlecased
70 canonicalized UTF-8" string according to the following steps, using
71 UnicodeData.txt ([UNICODE-DATA]):
73 (1) A Unicode codepoint is obtained from the input string.
75 (a) If the input string is in a known charset that can be
76 converted to Unicode, a sequence in the string's charset
77 is read and checked for validity according to the rules of
78 that charset. If the sequence is valid, it is converted
79 to a Unicode codepoint. Note that for input strings in
80 UTF-8, the UTF-8 sequence must be valid according to the
81 rules of [UTF-8]; e.g., overlong UTF-8 sequences are
82 invalid.
84 (b) If the input string is in an unknown charset, or an
85 invalid sequence occurs in step (1)(a), conversion ceases.
86 No further preparation is performed, and any partial
87 preparation results are discarded. The original string is
88 used unchanged with the i;octet comparator.
90 (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]),
91 are performed on the resulting codepoint from step (1)(a).
93 (a) If the codepoint has a titlecase property in
94 UnicodeData.txt (this is normally the same as the
95 uppercase property), the codepoint is converted to the
96 codepoints in the titlecase property.
98 (b) If the resulting codepoint from (2)(a) has a decomposition
99 property of any type in UnicodeData.txt, the codepoint is
100 converted to the codepoints in the decomposition property.
101 This step is recursively applied to each of the resulting
102 codepoints until no more decomposition is possible
103 (effectively Normalization Form KD).
105 Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)
106 has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D
107 WITH SMALL LETTER Z WITH CARON). Codepoint U+01C5 has a
108 decomposition property of U+0044 (LATIN CAPITAL LETTER D)
109 U+017E (LATIN SMALL LETTER Z WITH CARON). U+017E has a
110 decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c
114 Crispin Standards Track [Page 2]
116 RFC 5051 i;unicode-casemap October 2007
119 (COMBINING CARON). Neither U+0044, U+007A, nor U+030C have
120 any decomposition properties. Therefore, U+01C4 is converted
121 to U+0044 U+007A U+030C by this step.
123 (3) The resulting codepoint(s) from step (2) is/are appended, in
124 UTF-8 format, to the "titlecased canonicalized UTF-8" string.
126 (4) Repeat from step (1) until there is no more data in the input
127 string.
129 Following the above preparation process on each string, the equality,
130 ordering, and substring operations are as for i;octet.
132 It is permitted to use an alternative implementation of the above
133 preparation process if it produces the same results. For example, it
134 may be more convenient for an implementation to convert all input
135 strings to a sequence of UTF-16 or UTF-32 values prior to performing
136 any of the step (2) actions. Similarly, if all input strings are (or
137 are convertible to) Unicode, it may be possible to use UTF-32 as an
138 alternative to UTF-8 in step (3).
140 Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3),
141 because UTF-16 surrogates will cause i;octet to collate codepoints
142 U+E0000 through U+FFFF after non-BMP codepoints.
144 This collation is not locale sensitive. Consequently, care should be
145 taken when using OS-supplied functions to implement this collation.
146 Functions such as strcasecmp and toupper are sometimes locale
147 sensitive and may inconsistently casemap letters.
149 The i;unicode-casemap collation is well suited to use with many
150 Internet protocols and computer languages. Use with natural language
151 is often inappropriate; even though the collation apparently supports
152 languages such as Swahili and English, in real-world use it tends to
153 mis-sort a number of types of string:
155 o people and place names containing scripts that are not collated
156 according to "alphabetical order".
157 o words with characters that have diacriticals. However,
158 i;unicode-casemap generally does a better job than i;ascii-casemap
159 for most (but not all) languages. For example, German umlaut
160 letters will sort correctly, but some Scandinavian letters will
161 not.
162 o names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike
163 in English),
164 o strings containing other non-letter symbols; e.g., euro and pound
165 sterling symbols, quotation marks other than '"', dashes/hyphens,
166 etc.
170 Crispin Standards Track [Page 3]
172 RFC 5051 i;unicode-casemap October 2007
175 3. Unicode Casemap Collation Registration
177 <?xml version='1.0'?>
178 <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
179 <collation rfc="5051" scope="global" intendedUse="common">
180 <identifier>i;unicode-casemap</identifier>
181 <title>Unicode Casemap</title>
182 <operations>equality order substring</operations>
183 <specification>RFC 5051</specification>
184 <owner>IETF</owner>
185 <submitter>mrc@cac.washington.edu</submitter>
186 </collation>
188 4. Security Considerations
190 The security considerations for [UTF-8], [STRINGPREP], and [UNICODE-
191 SECURITY] apply and are normative to this specification.
193 The results from this comparator will vary depending upon the
194 implementation for several reasons. Implementations MUST consider
195 whether these possibilities are a problem for their use case:
197 1) New characters added in Unicode may have decomposition or
198 titlecase properties that will not be known to an implementation
199 based upon an older revision of Unicode. This impacts step (2).
201 2) Step (2)(b) defines a subset of Normalization Form KD (NFKD) that
202 does not require normalization of out-of-order diacriticals.
203 However, an implementation MAY use an NFKD library routine that
204 does such normalization. This impacts step (2)(b) and possibly
205 also step (1)(a), and is an issue only with ill-formed UTF-8
206 input.
208 3) The set of charsets handled in step (1)(a) is open-ended. UTF-8
209 (and, by extension, US-ASCII) are the only mandatory-to-implement
210 charsets. This impacts step (1)(a).
212 Implementations SHOULD, as far as feasible, support all the
213 charsets they are likely to encounter in the input data, in order
214 to avoid poor collation caused by the fall through to the (1)(b)
215 rule.
217 4) Other charsets may have revisions which add new characters that
218 are not known to an implementation based upon an older revision.
219 This impacts step (1)(a) and possibly also step (1)(b).
226 Crispin Standards Track [Page 4]
228 RFC 5051 i;unicode-casemap October 2007
231 An attacker may create input that is ill-formed or in an unknown
232 charset, with the intention of impacting the results of this
233 comparator or exploiting other parts of the system which process this
234 input in different ways. Note, however, that even well-formed data
235 in a known charset can impact the result of this comparator in
236 unexpected ways. For example, an attacker can substitute U+0041
237 (LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or
238 U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of causing a
239 non-match of strings which visually appear the same and/or causing
240 the string to appear elsewhere in a sort.
242 5. IANA Considerations
244 The i;unicode-casemap collation defined in section 2 has been added
245 to the registry of collations defined in [COMPARATOR].
247 6. Normative References
249 [COMPARATOR] Newman, C., Duerst, M., and A. Gulbrandsen,
250 "Internet Application Protocol Collation
251 Registry", RFC 4790, February 2007.
253 [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of
254 Internationalized Strings ("stringprep")", RFC
255 3454, December 2002.
257 [UTF-8] Yergeau, F., "UTF-8, a transformation format of
258 ISO 10646", STD 63, RFC 3629, November 2003.
260 [UNICODE-DATA] <http://www.unicode.org/Public/UNIDATA/
261 UnicodeData.txt>
263 Although the UnicodeData.txt file referenced
264 here is part of the Unicode standard, it is
265 subject to change as new characters are added
266 to Unicode and errors are corrected in Unicode
267 revisions. As a result, it may be less stable
268 than might otherwise be implied by the
269 standards status of this specification.
271 [UNICODE-SECURITY] Davis, M. and M. Suignard, "Unicode Security
272 Considerations", February 2006,
273 <http://www.unicode.org/reports/tr36/>.
282 Crispin Standards Track [Page 5]
284 RFC 5051 i;unicode-casemap October 2007
287 7. Informative References
289 [BASIC] Newman, C., Duerst, M., and A. Gulbrandsen,
290 "i;basic - the Unicode Collation Algorithm",
291 Work in Progress, March 2007.
293 [IMAP-SORT] Crispin, M. and K. Murchison, "Internet Message
294 Access Protocol - SORT and THREAD Extensions",
295 Work in Progress, September 2007.
297 Author's Address
299 Mark R. Crispin
300 Networks and Distributed Computing
301 University of Washington
302 4545 15th Avenue NE
303 Seattle, WA 98105-4527
305 Phone: +1 (206) 543-5762
306 EMail: MRC@CAC.Washington.EDU
338 Crispin Standards Track [Page 6]
340 RFC 5051 i;unicode-casemap October 2007
343 Full Copyright Statement
345 Copyright (C) The IETF Trust (2007).
347 This document is subject to the rights, licenses and restrictions
348 contained in BCP 78, and except as set forth therein, the authors
349 retain all their rights.
351 This document and the information contained herein are provided on an
352 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
353 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
354 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
355 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
356 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
357 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
359 Intellectual Property
361 The IETF takes no position regarding the validity or scope of any
362 Intellectual Property Rights or other rights that might be claimed to
363 pertain to the implementation or use of the technology described in
364 this document or the extent to which any license under such rights
365 might or might not be available; nor does it represent that it has
366 made any independent effort to identify any such rights. Information
367 on the procedures with respect to rights in RFC documents can be
368 found in BCP 78 and BCP 79.
370 Copies of IPR disclosures made to the IETF Secretariat and any
371 assurances of licenses to be made available, or the result of an
372 attempt made to obtain a general license or permission for the use of
373 such proprietary rights by implementers or users of this
374 specification can be obtained from the IETF on-line IPR repository at
375 http://www.ietf.org/ipr.
377 The IETF invites any interested party to bring to its attention any
378 copyrights, patents or patent applications, or other proprietary
379 rights that may cover technology that may be required to implement
380 this standard. Please address the information to the IETF at
381 ietf-ipr@ietf.org.
394 Crispin Standards Track [Page 7]

UW-IMAP'd extensions by yuuji