imapext-2007

diff docs/rfc/rfc5051.txt @ 0:ada5e610ab86

imap-2007e
author yuuji@gentei.org
date Mon, 14 Sep 2009 15:17:45 +0900
parents
children
line diff
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/docs/rfc/rfc5051.txt	Mon Sep 14 15:17:45 2009 +0900
     1.3 @@ -0,0 +1,395 @@
     1.4 +
     1.5 +
     1.6 +
     1.7 +
     1.8 +
     1.9 +
    1.10 +Network Working Group                                         M. Crispin
    1.11 +Request for Comments: 5051                      University of Washington
    1.12 +Category: Standards Track                                   October 2007
    1.13 +
    1.14 +
    1.15 +         i;unicode-casemap - Simple Unicode Collation Algorithm
    1.16 +
    1.17 +Status of This Memo
    1.18 +
    1.19 +   This document specifies an Internet standards track protocol for the
    1.20 +   Internet community, and requests discussion and suggestions for
    1.21 +   improvements.  Please refer to the current edition of the "Internet
    1.22 +   Official Protocol Standards" (STD 1) for the standardization state
    1.23 +   and status of this protocol.  Distribution of this memo is unlimited.
    1.24 +
    1.25 +Abstract
    1.26 +
    1.27 +   This document describes "i;unicode-casemap", a simple case-
    1.28 +   insensitive collation for Unicode strings.  It provides equality,
    1.29 +   substring, and ordering operations.
    1.30 +
    1.31 +1.  Introduction
    1.32 +
    1.33 +   The "i;ascii-casemap" collation described in [COMPARATOR] is quite
    1.34 +   simple to implement and provides case-independent comparisons for the
    1.35 +   26 Latin alphabetics.  It is specified as the default and/or baseline
    1.36 +   comparator in some application protocols, e.g., [IMAP-SORT].
    1.37 +
    1.38 +   However, the "i;ascii-casemap" collation does not produce
    1.39 +   satisfactory results with non-ASCII characters.  It is possible, with
    1.40 +   a modest extension, to provide a more sophisticated collation with
    1.41 +   greater multilingual applicability than "i;ascii-casemap".  This
    1.42 +   extension provides case-independent comparisons for a much greater
    1.43 +   number of characters.  It also collates characters with diacriticals
    1.44 +   with the non-diacritical character forms.
    1.45 +
    1.46 +   This collation, "i;unicode-casemap", is intended to be an alternative
    1.47 +   to, and preferred over, "i;ascii-casemap".  It does not replace the
    1.48 +   "i;basic" collation described in [BASIC].
    1.49 +
    1.50 +2.  Unicode Casemap Collation Description
    1.51 +
    1.52 +   The "i;unicode-casemap" collation is a simple collation which is
    1.53 +   case-insensitive in its treatment of characters.  It provides
    1.54 +   equality, substring, and ordering operations.  The validity test
    1.55 +   operation returns "valid" for any input.
    1.56 +
    1.57 +
    1.58 +
    1.59 +
    1.60 +
    1.61 +Crispin                     Standards Track                     [Page 1]
    1.62 +
    1.63 +RFC 5051                   i;unicode-casemap                October 2007
    1.64 +
    1.65 +
    1.66 +   This collation allows strings in arbitrary (and mixed) character
    1.67 +   sets, as long as the character set for each string is identified and
    1.68 +   it is possible to convert the string to Unicode.  Strings which have
    1.69 +   an unidentified character set and/or cannot be converted to Unicode
    1.70 +   are not rejected, but are treated as binary.
    1.71 +
    1.72 +   Each input string is prepared by converting it to a "titlecased
    1.73 +   canonicalized UTF-8" string according to the following steps, using
    1.74 +   UnicodeData.txt ([UNICODE-DATA]):
    1.75 +
    1.76 +      (1) A Unicode codepoint is obtained from the input string.
    1.77 +
    1.78 +          (a) If the input string is in a known charset that can be
    1.79 +              converted to Unicode, a sequence in the string's charset
    1.80 +              is read and checked for validity according to the rules of
    1.81 +              that charset.  If the sequence is valid, it is converted
    1.82 +              to a Unicode codepoint.  Note that for input strings in
    1.83 +              UTF-8, the UTF-8 sequence must be valid according to the
    1.84 +              rules of [UTF-8]; e.g., overlong UTF-8 sequences are
    1.85 +              invalid.
    1.86 +
    1.87 +          (b) If the input string is in an unknown charset, or an
    1.88 +              invalid sequence occurs in step (1)(a), conversion ceases.
    1.89 +              No further preparation is performed, and any partial
    1.90 +              preparation results are discarded.  The original string is
    1.91 +              used unchanged with the i;octet comparator.
    1.92 +
    1.93 +      (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]),
    1.94 +          are performed on the resulting codepoint from step (1)(a).
    1.95 +
    1.96 +          (a) If the codepoint has a titlecase property in
    1.97 +              UnicodeData.txt (this is normally the same as the
    1.98 +              uppercase property), the codepoint is converted to the
    1.99 +              codepoints in the titlecase property.
   1.100 +
   1.101 +          (b) If the resulting codepoint from (2)(a) has a decomposition
   1.102 +              property of any type in UnicodeData.txt, the codepoint is
   1.103 +              converted to the codepoints in the decomposition property.
   1.104 +              This step is recursively applied to each of the resulting
   1.105 +              codepoints until no more decomposition is possible
   1.106 +              (effectively Normalization Form KD).
   1.107 +
   1.108 +          Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)
   1.109 +          has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D
   1.110 +          WITH SMALL LETTER Z WITH CARON).  Codepoint U+01C5 has a
   1.111 +          decomposition property of U+0044 (LATIN CAPITAL LETTER D)
   1.112 +          U+017E (LATIN SMALL LETTER Z WITH CARON).  U+017E has a
   1.113 +          decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c
   1.114 +
   1.115 +
   1.116 +
   1.117 +Crispin                     Standards Track                     [Page 2]
   1.118 +
   1.119 +RFC 5051                   i;unicode-casemap                October 2007
   1.120 +
   1.121 +
   1.122 +          (COMBINING CARON).  Neither U+0044, U+007A, nor U+030C have
   1.123 +          any decomposition properties.  Therefore, U+01C4 is converted
   1.124 +          to U+0044 U+007A U+030C by this step.
   1.125 +
   1.126 +      (3) The resulting codepoint(s) from step (2) is/are appended, in
   1.127 +          UTF-8 format, to the "titlecased canonicalized UTF-8" string.
   1.128 +
   1.129 +      (4) Repeat from step (1) until there is no more data in the input
   1.130 +          string.
   1.131 +
   1.132 +   Following the above preparation process on each string, the equality,
   1.133 +   ordering, and substring operations are as for i;octet.
   1.134 +
   1.135 +   It is permitted to use an alternative implementation of the above
   1.136 +   preparation process if it produces the same results.  For example, it
   1.137 +   may be more convenient for an implementation to convert all input
   1.138 +   strings to a sequence of UTF-16 or UTF-32 values prior to performing
   1.139 +   any of the step (2) actions.  Similarly, if all input strings are (or
   1.140 +   are convertible to) Unicode, it may be possible to use UTF-32 as an
   1.141 +   alternative to UTF-8 in step (3).
   1.142 +
   1.143 +      Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3),
   1.144 +      because UTF-16 surrogates will cause i;octet to collate codepoints
   1.145 +      U+E0000 through U+FFFF after non-BMP codepoints.
   1.146 +
   1.147 +   This collation is not locale sensitive.  Consequently, care should be
   1.148 +   taken when using OS-supplied functions to implement this collation.
   1.149 +   Functions such as strcasecmp and toupper are sometimes locale
   1.150 +   sensitive and may inconsistently casemap letters.
   1.151 +
   1.152 +   The i;unicode-casemap collation is well suited to use with many
   1.153 +   Internet protocols and computer languages.  Use with natural language
   1.154 +   is often inappropriate; even though the collation apparently supports
   1.155 +   languages such as Swahili and English, in real-world use it tends to
   1.156 +   mis-sort a number of types of string:
   1.157 +
   1.158 +   o  people and place names containing scripts that are not collated
   1.159 +      according to "alphabetical order".
   1.160 +   o  words with characters that have diacriticals.  However,
   1.161 +      i;unicode-casemap generally does a better job than i;ascii-casemap
   1.162 +      for most (but not all) languages.  For example, German umlaut
   1.163 +      letters will sort correctly, but some Scandinavian letters will
   1.164 +      not.
   1.165 +   o  names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike
   1.166 +      in English),
   1.167 +   o  strings containing other non-letter symbols; e.g., euro and pound
   1.168 +      sterling symbols, quotation marks other than '"', dashes/hyphens,
   1.169 +      etc.
   1.170 +
   1.171 +
   1.172 +
   1.173 +Crispin                     Standards Track                     [Page 3]
   1.174 +
   1.175 +RFC 5051                   i;unicode-casemap                October 2007
   1.176 +
   1.177 +
   1.178 +3.  Unicode Casemap Collation Registration
   1.179 +
   1.180 +   <?xml version='1.0'?>
   1.181 +   <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
   1.182 +   <collation rfc="5051" scope="global" intendedUse="common">
   1.183 +   <identifier>i;unicode-casemap</identifier>
   1.184 +   <title>Unicode Casemap</title>
   1.185 +   <operations>equality order substring</operations>
   1.186 +   <specification>RFC 5051</specification>
   1.187 +   <owner>IETF</owner>
   1.188 +   <submitter>mrc@cac.washington.edu</submitter>
   1.189 +   </collation>
   1.190 +
   1.191 +4.  Security Considerations
   1.192 +
   1.193 +   The security considerations for [UTF-8], [STRINGPREP], and [UNICODE-
   1.194 +   SECURITY] apply and are normative to this specification.
   1.195 +
   1.196 +   The results from this comparator will vary depending upon the
   1.197 +   implementation for several reasons.  Implementations MUST consider
   1.198 +   whether these possibilities are a problem for their use case:
   1.199 +
   1.200 +   1) New characters added in Unicode may have decomposition or
   1.201 +      titlecase properties that will not be known to an implementation
   1.202 +      based upon an older revision of Unicode.  This impacts step (2).
   1.203 +
   1.204 +   2) Step (2)(b) defines a subset of Normalization Form KD (NFKD) that
   1.205 +      does not require normalization of out-of-order diacriticals.
   1.206 +      However, an implementation MAY use an NFKD library routine that
   1.207 +      does such normalization.  This impacts step (2)(b) and possibly
   1.208 +      also step (1)(a), and is an issue only with ill-formed UTF-8
   1.209 +      input.
   1.210 +
   1.211 +   3) The set of charsets handled in step (1)(a) is open-ended.  UTF-8
   1.212 +      (and, by extension, US-ASCII) are the only mandatory-to-implement
   1.213 +      charsets.  This impacts step (1)(a).
   1.214 +
   1.215 +      Implementations SHOULD, as far as feasible, support all the
   1.216 +      charsets they are likely to encounter in the input data, in order
   1.217 +      to avoid poor collation caused by the fall through to the (1)(b)
   1.218 +      rule.
   1.219 +
   1.220 +   4) Other charsets may have revisions which add new characters that
   1.221 +      are not known to an implementation based upon an older revision.
   1.222 +      This impacts step (1)(a) and possibly also step (1)(b).
   1.223 +
   1.224 +
   1.225 +
   1.226 +
   1.227 +
   1.228 +
   1.229 +Crispin                     Standards Track                     [Page 4]
   1.230 +
   1.231 +RFC 5051                   i;unicode-casemap                October 2007
   1.232 +
   1.233 +
   1.234 +   An attacker may create input that is ill-formed or in an unknown
   1.235 +   charset, with the intention of impacting the results of this
   1.236 +   comparator or exploiting other parts of the system which process this
   1.237 +   input in different ways.  Note, however, that even well-formed data
   1.238 +   in a known charset can impact the result of this comparator in
   1.239 +   unexpected ways.  For example, an attacker can substitute U+0041
   1.240 +   (LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or
   1.241 +   U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of causing a
   1.242 +   non-match of strings which visually appear the same and/or causing
   1.243 +   the string to appear elsewhere in a sort.
   1.244 +
   1.245 +5.  IANA Considerations
   1.246 +
   1.247 +   The i;unicode-casemap collation defined in section 2 has been added
   1.248 +   to the registry of collations defined in [COMPARATOR].
   1.249 +
   1.250 +6.  Normative References
   1.251 +
   1.252 +   [COMPARATOR]          Newman, C., Duerst, M., and A. Gulbrandsen,
   1.253 +                         "Internet Application Protocol Collation
   1.254 +                         Registry", RFC 4790, February 2007.
   1.255 +
   1.256 +   [STRINGPREP]          Hoffman, P. and M. Blanchet, "Preparation of
   1.257 +                         Internationalized Strings ("stringprep")", RFC
   1.258 +                         3454, December 2002.
   1.259 +
   1.260 +   [UTF-8]               Yergeau, F., "UTF-8, a transformation format of
   1.261 +                         ISO 10646", STD 63, RFC 3629, November 2003.
   1.262 +
   1.263 +   [UNICODE-DATA]        <http://www.unicode.org/Public/UNIDATA/
   1.264 +                         UnicodeData.txt>
   1.265 +
   1.266 +                         Although the UnicodeData.txt file referenced
   1.267 +                         here is part of the Unicode standard, it is
   1.268 +                         subject to change as new characters are added
   1.269 +                         to Unicode and errors are corrected in Unicode
   1.270 +                         revisions.  As a result, it may be less stable
   1.271 +                         than might otherwise be implied by the
   1.272 +                         standards status of this specification.
   1.273 +
   1.274 +   [UNICODE-SECURITY]    Davis, M. and M. Suignard, "Unicode Security
   1.275 +                         Considerations", February 2006,
   1.276 +                         <http://www.unicode.org/reports/tr36/>.
   1.277 +
   1.278 +
   1.279 +
   1.280 +
   1.281 +
   1.282 +
   1.283 +
   1.284 +
   1.285 +Crispin                     Standards Track                     [Page 5]
   1.286 +
   1.287 +RFC 5051                   i;unicode-casemap                October 2007
   1.288 +
   1.289 +
   1.290 +7.  Informative References
   1.291 +
   1.292 +   [BASIC]               Newman, C., Duerst, M., and A. Gulbrandsen,
   1.293 +                         "i;basic - the Unicode Collation Algorithm",
   1.294 +                         Work in Progress, March 2007.
   1.295 +
   1.296 +   [IMAP-SORT]           Crispin, M. and K. Murchison, "Internet Message
   1.297 +                         Access Protocol - SORT and THREAD Extensions",
   1.298 +                         Work in Progress, September 2007.
   1.299 +
   1.300 +Author's Address
   1.301 +
   1.302 +   Mark R. Crispin
   1.303 +   Networks and Distributed Computing
   1.304 +   University of Washington
   1.305 +   4545 15th Avenue NE
   1.306 +   Seattle, WA  98105-4527
   1.307 +
   1.308 +   Phone: +1 (206) 543-5762
   1.309 +   EMail: MRC@CAC.Washington.EDU
   1.310 +
   1.311 +
   1.312 +
   1.313 +
   1.314 +
   1.315 +
   1.316 +
   1.317 +
   1.318 +
   1.319 +
   1.320 +
   1.321 +
   1.322 +
   1.323 +
   1.324 +
   1.325 +
   1.326 +
   1.327 +
   1.328 +
   1.329 +
   1.330 +
   1.331 +
   1.332 +
   1.333 +
   1.334 +
   1.335 +
   1.336 +
   1.337 +
   1.338 +
   1.339 +
   1.340 +
   1.341 +Crispin                     Standards Track                     [Page 6]
   1.342 +
   1.343 +RFC 5051                   i;unicode-casemap                October 2007
   1.344 +
   1.345 +
   1.346 +Full Copyright Statement
   1.347 +
   1.348 +   Copyright (C) The IETF Trust (2007).
   1.349 +
   1.350 +   This document is subject to the rights, licenses and restrictions
   1.351 +   contained in BCP 78, and except as set forth therein, the authors
   1.352 +   retain all their rights.
   1.353 +
   1.354 +   This document and the information contained herein are provided on an
   1.355 +   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   1.356 +   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
   1.357 +   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
   1.358 +   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
   1.359 +   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   1.360 +   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
   1.361 +
   1.362 +Intellectual Property
   1.363 +
   1.364 +   The IETF takes no position regarding the validity or scope of any
   1.365 +   Intellectual Property Rights or other rights that might be claimed to
   1.366 +   pertain to the implementation or use of the technology described in
   1.367 +   this document or the extent to which any license under such rights
   1.368 +   might or might not be available; nor does it represent that it has
   1.369 +   made any independent effort to identify any such rights.  Information
   1.370 +   on the procedures with respect to rights in RFC documents can be
   1.371 +   found in BCP 78 and BCP 79.
   1.372 +
   1.373 +   Copies of IPR disclosures made to the IETF Secretariat and any
   1.374 +   assurances of licenses to be made available, or the result of an
   1.375 +   attempt made to obtain a general license or permission for the use of
   1.376 +   such proprietary rights by implementers or users of this
   1.377 +   specification can be obtained from the IETF on-line IPR repository at
   1.378 +   http://www.ietf.org/ipr.
   1.379 +
   1.380 +   The IETF invites any interested party to bring to its attention any
   1.381 +   copyrights, patents or patent applications, or other proprietary
   1.382 +   rights that may cover technology that may be required to implement
   1.383 +   this standard.  Please address the information to the IETF at
   1.384 +   ietf-ipr@ietf.org.
   1.385 +
   1.386 +
   1.387 +
   1.388 +
   1.389 +
   1.390 +
   1.391 +
   1.392 +
   1.393 +
   1.394 +
   1.395 +
   1.396 +
   1.397 +Crispin                     Standards Track                     [Page 7]
   1.398 +

UW-IMAP'd extensions by yuuji