rev |
line source |
yuuji@0
|
1
|
yuuji@0
|
2
|
yuuji@0
|
3
|
yuuji@0
|
4
|
yuuji@0
|
5
|
yuuji@0
|
6
|
yuuji@0
|
7 Network Working Group M. Crispin
|
yuuji@0
|
8 Request for Comments: 5051 University of Washington
|
yuuji@0
|
9 Category: Standards Track October 2007
|
yuuji@0
|
10
|
yuuji@0
|
11
|
yuuji@0
|
12 i;unicode-casemap - Simple Unicode Collation Algorithm
|
yuuji@0
|
13
|
yuuji@0
|
14 Status of This Memo
|
yuuji@0
|
15
|
yuuji@0
|
16 This document specifies an Internet standards track protocol for the
|
yuuji@0
|
17 Internet community, and requests discussion and suggestions for
|
yuuji@0
|
18 improvements. Please refer to the current edition of the "Internet
|
yuuji@0
|
19 Official Protocol Standards" (STD 1) for the standardization state
|
yuuji@0
|
20 and status of this protocol. Distribution of this memo is unlimited.
|
yuuji@0
|
21
|
yuuji@0
|
22 Abstract
|
yuuji@0
|
23
|
yuuji@0
|
24 This document describes "i;unicode-casemap", a simple case-
|
yuuji@0
|
25 insensitive collation for Unicode strings. It provides equality,
|
yuuji@0
|
26 substring, and ordering operations.
|
yuuji@0
|
27
|
yuuji@0
|
28 1. Introduction
|
yuuji@0
|
29
|
yuuji@0
|
30 The "i;ascii-casemap" collation described in [COMPARATOR] is quite
|
yuuji@0
|
31 simple to implement and provides case-independent comparisons for the
|
yuuji@0
|
32 26 Latin alphabetics. It is specified as the default and/or baseline
|
yuuji@0
|
33 comparator in some application protocols, e.g., [IMAP-SORT].
|
yuuji@0
|
34
|
yuuji@0
|
35 However, the "i;ascii-casemap" collation does not produce
|
yuuji@0
|
36 satisfactory results with non-ASCII characters. It is possible, with
|
yuuji@0
|
37 a modest extension, to provide a more sophisticated collation with
|
yuuji@0
|
38 greater multilingual applicability than "i;ascii-casemap". This
|
yuuji@0
|
39 extension provides case-independent comparisons for a much greater
|
yuuji@0
|
40 number of characters. It also collates characters with diacriticals
|
yuuji@0
|
41 with the non-diacritical character forms.
|
yuuji@0
|
42
|
yuuji@0
|
43 This collation, "i;unicode-casemap", is intended to be an alternative
|
yuuji@0
|
44 to, and preferred over, "i;ascii-casemap". It does not replace the
|
yuuji@0
|
45 "i;basic" collation described in [BASIC].
|
yuuji@0
|
46
|
yuuji@0
|
47 2. Unicode Casemap Collation Description
|
yuuji@0
|
48
|
yuuji@0
|
49 The "i;unicode-casemap" collation is a simple collation which is
|
yuuji@0
|
50 case-insensitive in its treatment of characters. It provides
|
yuuji@0
|
51 equality, substring, and ordering operations. The validity test
|
yuuji@0
|
52 operation returns "valid" for any input.
|
yuuji@0
|
53
|
yuuji@0
|
54
|
yuuji@0
|
55
|
yuuji@0
|
56
|
yuuji@0
|
57
|
yuuji@0
|
58 Crispin Standards Track [Page 1]
|
yuuji@0
|
59
|
yuuji@0
|
60 RFC 5051 i;unicode-casemap October 2007
|
yuuji@0
|
61
|
yuuji@0
|
62
|
yuuji@0
|
63 This collation allows strings in arbitrary (and mixed) character
|
yuuji@0
|
64 sets, as long as the character set for each string is identified and
|
yuuji@0
|
65 it is possible to convert the string to Unicode. Strings which have
|
yuuji@0
|
66 an unidentified character set and/or cannot be converted to Unicode
|
yuuji@0
|
67 are not rejected, but are treated as binary.
|
yuuji@0
|
68
|
yuuji@0
|
69 Each input string is prepared by converting it to a "titlecased
|
yuuji@0
|
70 canonicalized UTF-8" string according to the following steps, using
|
yuuji@0
|
71 UnicodeData.txt ([UNICODE-DATA]):
|
yuuji@0
|
72
|
yuuji@0
|
73 (1) A Unicode codepoint is obtained from the input string.
|
yuuji@0
|
74
|
yuuji@0
|
75 (a) If the input string is in a known charset that can be
|
yuuji@0
|
76 converted to Unicode, a sequence in the string's charset
|
yuuji@0
|
77 is read and checked for validity according to the rules of
|
yuuji@0
|
78 that charset. If the sequence is valid, it is converted
|
yuuji@0
|
79 to a Unicode codepoint. Note that for input strings in
|
yuuji@0
|
80 UTF-8, the UTF-8 sequence must be valid according to the
|
yuuji@0
|
81 rules of [UTF-8]; e.g., overlong UTF-8 sequences are
|
yuuji@0
|
82 invalid.
|
yuuji@0
|
83
|
yuuji@0
|
84 (b) If the input string is in an unknown charset, or an
|
yuuji@0
|
85 invalid sequence occurs in step (1)(a), conversion ceases.
|
yuuji@0
|
86 No further preparation is performed, and any partial
|
yuuji@0
|
87 preparation results are discarded. The original string is
|
yuuji@0
|
88 used unchanged with the i;octet comparator.
|
yuuji@0
|
89
|
yuuji@0
|
90 (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]),
|
yuuji@0
|
91 are performed on the resulting codepoint from step (1)(a).
|
yuuji@0
|
92
|
yuuji@0
|
93 (a) If the codepoint has a titlecase property in
|
yuuji@0
|
94 UnicodeData.txt (this is normally the same as the
|
yuuji@0
|
95 uppercase property), the codepoint is converted to the
|
yuuji@0
|
96 codepoints in the titlecase property.
|
yuuji@0
|
97
|
yuuji@0
|
98 (b) If the resulting codepoint from (2)(a) has a decomposition
|
yuuji@0
|
99 property of any type in UnicodeData.txt, the codepoint is
|
yuuji@0
|
100 converted to the codepoints in the decomposition property.
|
yuuji@0
|
101 This step is recursively applied to each of the resulting
|
yuuji@0
|
102 codepoints until no more decomposition is possible
|
yuuji@0
|
103 (effectively Normalization Form KD).
|
yuuji@0
|
104
|
yuuji@0
|
105 Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)
|
yuuji@0
|
106 has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D
|
yuuji@0
|
107 WITH SMALL LETTER Z WITH CARON). Codepoint U+01C5 has a
|
yuuji@0
|
108 decomposition property of U+0044 (LATIN CAPITAL LETTER D)
|
yuuji@0
|
109 U+017E (LATIN SMALL LETTER Z WITH CARON). U+017E has a
|
yuuji@0
|
110 decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c
|
yuuji@0
|
111
|
yuuji@0
|
112
|
yuuji@0
|
113
|
yuuji@0
|
114 Crispin Standards Track [Page 2]
|
yuuji@0
|
115
|
yuuji@0
|
116 RFC 5051 i;unicode-casemap October 2007
|
yuuji@0
|
117
|
yuuji@0
|
118
|
yuuji@0
|
119 (COMBINING CARON). Neither U+0044, U+007A, nor U+030C have
|
yuuji@0
|
120 any decomposition properties. Therefore, U+01C4 is converted
|
yuuji@0
|
121 to U+0044 U+007A U+030C by this step.
|
yuuji@0
|
122
|
yuuji@0
|
123 (3) The resulting codepoint(s) from step (2) is/are appended, in
|
yuuji@0
|
124 UTF-8 format, to the "titlecased canonicalized UTF-8" string.
|
yuuji@0
|
125
|
yuuji@0
|
126 (4) Repeat from step (1) until there is no more data in the input
|
yuuji@0
|
127 string.
|
yuuji@0
|
128
|
yuuji@0
|
129 Following the above preparation process on each string, the equality,
|
yuuji@0
|
130 ordering, and substring operations are as for i;octet.
|
yuuji@0
|
131
|
yuuji@0
|
132 It is permitted to use an alternative implementation of the above
|
yuuji@0
|
133 preparation process if it produces the same results. For example, it
|
yuuji@0
|
134 may be more convenient for an implementation to convert all input
|
yuuji@0
|
135 strings to a sequence of UTF-16 or UTF-32 values prior to performing
|
yuuji@0
|
136 any of the step (2) actions. Similarly, if all input strings are (or
|
yuuji@0
|
137 are convertible to) Unicode, it may be possible to use UTF-32 as an
|
yuuji@0
|
138 alternative to UTF-8 in step (3).
|
yuuji@0
|
139
|
yuuji@0
|
140 Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3),
|
yuuji@0
|
141 because UTF-16 surrogates will cause i;octet to collate codepoints
|
yuuji@0
|
142 U+E0000 through U+FFFF after non-BMP codepoints.
|
yuuji@0
|
143
|
yuuji@0
|
144 This collation is not locale sensitive. Consequently, care should be
|
yuuji@0
|
145 taken when using OS-supplied functions to implement this collation.
|
yuuji@0
|
146 Functions such as strcasecmp and toupper are sometimes locale
|
yuuji@0
|
147 sensitive and may inconsistently casemap letters.
|
yuuji@0
|
148
|
yuuji@0
|
149 The i;unicode-casemap collation is well suited to use with many
|
yuuji@0
|
150 Internet protocols and computer languages. Use with natural language
|
yuuji@0
|
151 is often inappropriate; even though the collation apparently supports
|
yuuji@0
|
152 languages such as Swahili and English, in real-world use it tends to
|
yuuji@0
|
153 mis-sort a number of types of string:
|
yuuji@0
|
154
|
yuuji@0
|
155 o people and place names containing scripts that are not collated
|
yuuji@0
|
156 according to "alphabetical order".
|
yuuji@0
|
157 o words with characters that have diacriticals. However,
|
yuuji@0
|
158 i;unicode-casemap generally does a better job than i;ascii-casemap
|
yuuji@0
|
159 for most (but not all) languages. For example, German umlaut
|
yuuji@0
|
160 letters will sort correctly, but some Scandinavian letters will
|
yuuji@0
|
161 not.
|
yuuji@0
|
162 o names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike
|
yuuji@0
|
163 in English),
|
yuuji@0
|
164 o strings containing other non-letter symbols; e.g., euro and pound
|
yuuji@0
|
165 sterling symbols, quotation marks other than '"', dashes/hyphens,
|
yuuji@0
|
166 etc.
|
yuuji@0
|
167
|
yuuji@0
|
168
|
yuuji@0
|
169
|
yuuji@0
|
170 Crispin Standards Track [Page 3]
|
yuuji@0
|
171
|
yuuji@0
|
172 RFC 5051 i;unicode-casemap October 2007
|
yuuji@0
|
173
|
yuuji@0
|
174
|
yuuji@0
|
175 3. Unicode Casemap Collation Registration
|
yuuji@0
|
176
|
yuuji@0
|
177 <?xml version='1.0'?>
|
yuuji@0
|
178 <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
|
yuuji@0
|
179 <collation rfc="5051" scope="global" intendedUse="common">
|
yuuji@0
|
180 <identifier>i;unicode-casemap</identifier>
|
yuuji@0
|
181 <title>Unicode Casemap</title>
|
yuuji@0
|
182 <operations>equality order substring</operations>
|
yuuji@0
|
183 <specification>RFC 5051</specification>
|
yuuji@0
|
184 <owner>IETF</owner>
|
yuuji@0
|
185 <submitter>mrc@cac.washington.edu</submitter>
|
yuuji@0
|
186 </collation>
|
yuuji@0
|
187
|
yuuji@0
|
188 4. Security Considerations
|
yuuji@0
|
189
|
yuuji@0
|
190 The security considerations for [UTF-8], [STRINGPREP], and [UNICODE-
|
yuuji@0
|
191 SECURITY] apply and are normative to this specification.
|
yuuji@0
|
192
|
yuuji@0
|
193 The results from this comparator will vary depending upon the
|
yuuji@0
|
194 implementation for several reasons. Implementations MUST consider
|
yuuji@0
|
195 whether these possibilities are a problem for their use case:
|
yuuji@0
|
196
|
yuuji@0
|
197 1) New characters added in Unicode may have decomposition or
|
yuuji@0
|
198 titlecase properties that will not be known to an implementation
|
yuuji@0
|
199 based upon an older revision of Unicode. This impacts step (2).
|
yuuji@0
|
200
|
yuuji@0
|
201 2) Step (2)(b) defines a subset of Normalization Form KD (NFKD) that
|
yuuji@0
|
202 does not require normalization of out-of-order diacriticals.
|
yuuji@0
|
203 However, an implementation MAY use an NFKD library routine that
|
yuuji@0
|
204 does such normalization. This impacts step (2)(b) and possibly
|
yuuji@0
|
205 also step (1)(a), and is an issue only with ill-formed UTF-8
|
yuuji@0
|
206 input.
|
yuuji@0
|
207
|
yuuji@0
|
208 3) The set of charsets handled in step (1)(a) is open-ended. UTF-8
|
yuuji@0
|
209 (and, by extension, US-ASCII) are the only mandatory-to-implement
|
yuuji@0
|
210 charsets. This impacts step (1)(a).
|
yuuji@0
|
211
|
yuuji@0
|
212 Implementations SHOULD, as far as feasible, support all the
|
yuuji@0
|
213 charsets they are likely to encounter in the input data, in order
|
yuuji@0
|
214 to avoid poor collation caused by the fall through to the (1)(b)
|
yuuji@0
|
215 rule.
|
yuuji@0
|
216
|
yuuji@0
|
217 4) Other charsets may have revisions which add new characters that
|
yuuji@0
|
218 are not known to an implementation based upon an older revision.
|
yuuji@0
|
219 This impacts step (1)(a) and possibly also step (1)(b).
|
yuuji@0
|
220
|
yuuji@0
|
221
|
yuuji@0
|
222
|
yuuji@0
|
223
|
yuuji@0
|
224
|
yuuji@0
|
225
|
yuuji@0
|
226 Crispin Standards Track [Page 4]
|
yuuji@0
|
227
|
yuuji@0
|
228 RFC 5051 i;unicode-casemap October 2007
|
yuuji@0
|
229
|
yuuji@0
|
230
|
yuuji@0
|
231 An attacker may create input that is ill-formed or in an unknown
|
yuuji@0
|
232 charset, with the intention of impacting the results of this
|
yuuji@0
|
233 comparator or exploiting other parts of the system which process this
|
yuuji@0
|
234 input in different ways. Note, however, that even well-formed data
|
yuuji@0
|
235 in a known charset can impact the result of this comparator in
|
yuuji@0
|
236 unexpected ways. For example, an attacker can substitute U+0041
|
yuuji@0
|
237 (LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or
|
yuuji@0
|
238 U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of causing a
|
yuuji@0
|
239 non-match of strings which visually appear the same and/or causing
|
yuuji@0
|
240 the string to appear elsewhere in a sort.
|
yuuji@0
|
241
|
yuuji@0
|
242 5. IANA Considerations
|
yuuji@0
|
243
|
yuuji@0
|
244 The i;unicode-casemap collation defined in section 2 has been added
|
yuuji@0
|
245 to the registry of collations defined in [COMPARATOR].
|
yuuji@0
|
246
|
yuuji@0
|
247 6. Normative References
|
yuuji@0
|
248
|
yuuji@0
|
249 [COMPARATOR] Newman, C., Duerst, M., and A. Gulbrandsen,
|
yuuji@0
|
250 "Internet Application Protocol Collation
|
yuuji@0
|
251 Registry", RFC 4790, February 2007.
|
yuuji@0
|
252
|
yuuji@0
|
253 [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of
|
yuuji@0
|
254 Internationalized Strings ("stringprep")", RFC
|
yuuji@0
|
255 3454, December 2002.
|
yuuji@0
|
256
|
yuuji@0
|
257 [UTF-8] Yergeau, F., "UTF-8, a transformation format of
|
yuuji@0
|
258 ISO 10646", STD 63, RFC 3629, November 2003.
|
yuuji@0
|
259
|
yuuji@0
|
260 [UNICODE-DATA] <http://www.unicode.org/Public/UNIDATA/
|
yuuji@0
|
261 UnicodeData.txt>
|
yuuji@0
|
262
|
yuuji@0
|
263 Although the UnicodeData.txt file referenced
|
yuuji@0
|
264 here is part of the Unicode standard, it is
|
yuuji@0
|
265 subject to change as new characters are added
|
yuuji@0
|
266 to Unicode and errors are corrected in Unicode
|
yuuji@0
|
267 revisions. As a result, it may be less stable
|
yuuji@0
|
268 than might otherwise be implied by the
|
yuuji@0
|
269 standards status of this specification.
|
yuuji@0
|
270
|
yuuji@0
|
271 [UNICODE-SECURITY] Davis, M. and M. Suignard, "Unicode Security
|
yuuji@0
|
272 Considerations", February 2006,
|
yuuji@0
|
273 <http://www.unicode.org/reports/tr36/>.
|
yuuji@0
|
274
|
yuuji@0
|
275
|
yuuji@0
|
276
|
yuuji@0
|
277
|
yuuji@0
|
278
|
yuuji@0
|
279
|
yuuji@0
|
280
|
yuuji@0
|
281
|
yuuji@0
|
282 Crispin Standards Track [Page 5]
|
yuuji@0
|
283
|
yuuji@0
|
284 RFC 5051 i;unicode-casemap October 2007
|
yuuji@0
|
285
|
yuuji@0
|
286
|
yuuji@0
|
287 7. Informative References
|
yuuji@0
|
288
|
yuuji@0
|
289 [BASIC] Newman, C., Duerst, M., and A. Gulbrandsen,
|
yuuji@0
|
290 "i;basic - the Unicode Collation Algorithm",
|
yuuji@0
|
291 Work in Progress, March 2007.
|
yuuji@0
|
292
|
yuuji@0
|
293 [IMAP-SORT] Crispin, M. and K. Murchison, "Internet Message
|
yuuji@0
|
294 Access Protocol - SORT and THREAD Extensions",
|
yuuji@0
|
295 Work in Progress, September 2007.
|
yuuji@0
|
296
|
yuuji@0
|
297 Author's Address
|
yuuji@0
|
298
|
yuuji@0
|
299 Mark R. Crispin
|
yuuji@0
|
300 Networks and Distributed Computing
|
yuuji@0
|
301 University of Washington
|
yuuji@0
|
302 4545 15th Avenue NE
|
yuuji@0
|
303 Seattle, WA 98105-4527
|
yuuji@0
|
304
|
yuuji@0
|
305 Phone: +1 (206) 543-5762
|
yuuji@0
|
306 EMail: MRC@CAC.Washington.EDU
|
yuuji@0
|
307
|
yuuji@0
|
308
|
yuuji@0
|
309
|
yuuji@0
|
310
|
yuuji@0
|
311
|
yuuji@0
|
312
|
yuuji@0
|
313
|
yuuji@0
|
314
|
yuuji@0
|
315
|
yuuji@0
|
316
|
yuuji@0
|
317
|
yuuji@0
|
318
|
yuuji@0
|
319
|
yuuji@0
|
320
|
yuuji@0
|
321
|
yuuji@0
|
322
|
yuuji@0
|
323
|
yuuji@0
|
324
|
yuuji@0
|
325
|
yuuji@0
|
326
|
yuuji@0
|
327
|
yuuji@0
|
328
|
yuuji@0
|
329
|
yuuji@0
|
330
|
yuuji@0
|
331
|
yuuji@0
|
332
|
yuuji@0
|
333
|
yuuji@0
|
334
|
yuuji@0
|
335
|
yuuji@0
|
336
|
yuuji@0
|
337
|
yuuji@0
|
338 Crispin Standards Track [Page 6]
|
yuuji@0
|
339
|
yuuji@0
|
340 RFC 5051 i;unicode-casemap October 2007
|
yuuji@0
|
341
|
yuuji@0
|
342
|
yuuji@0
|
343 Full Copyright Statement
|
yuuji@0
|
344
|
yuuji@0
|
345 Copyright (C) The IETF Trust (2007).
|
yuuji@0
|
346
|
yuuji@0
|
347 This document is subject to the rights, licenses and restrictions
|
yuuji@0
|
348 contained in BCP 78, and except as set forth therein, the authors
|
yuuji@0
|
349 retain all their rights.
|
yuuji@0
|
350
|
yuuji@0
|
351 This document and the information contained herein are provided on an
|
yuuji@0
|
352 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
|
yuuji@0
|
353 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
|
yuuji@0
|
354 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
|
yuuji@0
|
355 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
|
yuuji@0
|
356 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
|
yuuji@0
|
357 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
|
yuuji@0
|
358
|
yuuji@0
|
359 Intellectual Property
|
yuuji@0
|
360
|
yuuji@0
|
361 The IETF takes no position regarding the validity or scope of any
|
yuuji@0
|
362 Intellectual Property Rights or other rights that might be claimed to
|
yuuji@0
|
363 pertain to the implementation or use of the technology described in
|
yuuji@0
|
364 this document or the extent to which any license under such rights
|
yuuji@0
|
365 might or might not be available; nor does it represent that it has
|
yuuji@0
|
366 made any independent effort to identify any such rights. Information
|
yuuji@0
|
367 on the procedures with respect to rights in RFC documents can be
|
yuuji@0
|
368 found in BCP 78 and BCP 79.
|
yuuji@0
|
369
|
yuuji@0
|
370 Copies of IPR disclosures made to the IETF Secretariat and any
|
yuuji@0
|
371 assurances of licenses to be made available, or the result of an
|
yuuji@0
|
372 attempt made to obtain a general license or permission for the use of
|
yuuji@0
|
373 such proprietary rights by implementers or users of this
|
yuuji@0
|
374 specification can be obtained from the IETF on-line IPR repository at
|
yuuji@0
|
375 http://www.ietf.org/ipr.
|
yuuji@0
|
376
|
yuuji@0
|
377 The IETF invites any interested party to bring to its attention any
|
yuuji@0
|
378 copyrights, patents or patent applications, or other proprietary
|
yuuji@0
|
379 rights that may cover technology that may be required to implement
|
yuuji@0
|
380 this standard. Please address the information to the IETF at
|
yuuji@0
|
381 ietf-ipr@ietf.org.
|
yuuji@0
|
382
|
yuuji@0
|
383
|
yuuji@0
|
384
|
yuuji@0
|
385
|
yuuji@0
|
386
|
yuuji@0
|
387
|
yuuji@0
|
388
|
yuuji@0
|
389
|
yuuji@0
|
390
|
yuuji@0
|
391
|
yuuji@0
|
392
|
yuuji@0
|
393
|
yuuji@0
|
394 Crispin Standards Track [Page 7]
|
yuuji@0
|
395
|