imapext-2007: ada5e610ab86 docs/rfc/rfc5051.txt

imapext-2007

view docs/rfc/rfc5051.txt @ 0:ada5e610ab86

imap-2007e

author	yuuji@gentei.org
date	Mon, 14 Sep 2009 15:17:45 +0900
parents
children

line source

7 Network Working Group M. Crispin

8 Request for Comments: 5051 University of Washington

9 Category: Standards Track October 2007

12 i;unicode-casemap - Simple Unicode Collation Algorithm

14 Status of This Memo

16 This document specifies an Internet standards track protocol for the

17 Internet community, and requests discussion and suggestions for

18 improvements. Please refer to the current edition of the "Internet

19 Official Protocol Standards" (STD 1) for the standardization state

20 and status of this protocol. Distribution of this memo is unlimited.

22 Abstract

24 This document describes "i;unicode-casemap", a simple case-

25 insensitive collation for Unicode strings. It provides equality,

26 substring, and ordering operations.

28 1. Introduction

30 The "i;ascii-casemap" collation described in [COMPARATOR] is quite

31 simple to implement and provides case-independent comparisons for the

32 26 Latin alphabetics. It is specified as the default and/or baseline

33 comparator in some application protocols, e.g., [IMAP-SORT].

35 However, the "i;ascii-casemap" collation does not produce

36 satisfactory results with non-ASCII characters. It is possible, with

37 a modest extension, to provide a more sophisticated collation with

38 greater multilingual applicability than "i;ascii-casemap". This

39 extension provides case-independent comparisons for a much greater

40 number of characters. It also collates characters with diacriticals

41 with the non-diacritical character forms.

43 This collation, "i;unicode-casemap", is intended to be an alternative

44 to, and preferred over, "i;ascii-casemap". It does not replace the

45 "i;basic" collation described in [BASIC].

47 2. Unicode Casemap Collation Description

49 The "i;unicode-casemap" collation is a simple collation which is

50 case-insensitive in its treatment of characters. It provides

51 equality, substring, and ordering operations. The validity test

52 operation returns "valid" for any input.

58 Crispin Standards Track [Page 1]

60 RFC 5051 i;unicode-casemap October 2007

63 This collation allows strings in arbitrary (and mixed) character

64 sets, as long as the character set for each string is identified and

65 it is possible to convert the string to Unicode. Strings which have

66 an unidentified character set and/or cannot be converted to Unicode

67 are not rejected, but are treated as binary.

69 Each input string is prepared by converting it to a "titlecased

70 canonicalized UTF-8" string according to the following steps, using

71 UnicodeData.txt ([UNICODE-DATA]):

73 (1) A Unicode codepoint is obtained from the input string.

75 (a) If the input string is in a known charset that can be

76 converted to Unicode, a sequence in the string's charset

77 is read and checked for validity according to the rules of

78 that charset. If the sequence is valid, it is converted

79 to a Unicode codepoint. Note that for input strings in

80 UTF-8, the UTF-8 sequence must be valid according to the

81 rules of [UTF-8]; e.g., overlong UTF-8 sequences are

82 invalid.

84 (b) If the input string is in an unknown charset, or an

85 invalid sequence occurs in step (1)(a), conversion ceases.

86 No further preparation is performed, and any partial

87 preparation results are discarded. The original string is

88 used unchanged with the i;octet comparator.

90 (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]),

91 are performed on the resulting codepoint from step (1)(a).

93 (a) If the codepoint has a titlecase property in

94 UnicodeData.txt (this is normally the same as the

95 uppercase property), the codepoint is converted to the

96 codepoints in the titlecase property.

98 (b) If the resulting codepoint from (2)(a) has a decomposition

99 property of any type in UnicodeData.txt, the codepoint is

100 converted to the codepoints in the decomposition property.

101 This step is recursively applied to each of the resulting

102 codepoints until no more decomposition is possible

103 (effectively Normalization Form KD).

104

105 Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)

106 has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D

107 WITH SMALL LETTER Z WITH CARON). Codepoint U+01C5 has a

108 decomposition property of U+0044 (LATIN CAPITAL LETTER D)

109 U+017E (LATIN SMALL LETTER Z WITH CARON). U+017E has a

110 decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c

111

112

113

114 Crispin Standards Track [Page 2]

115

116 RFC 5051 i;unicode-casemap October 2007

117

118

119 (COMBINING CARON). Neither U+0044, U+007A, nor U+030C have

120 any decomposition properties. Therefore, U+01C4 is converted

121 to U+0044 U+007A U+030C by this step.

122

123 (3) The resulting codepoint(s) from step (2) is/are appended, in

124 UTF-8 format, to the "titlecased canonicalized UTF-8" string.

125

126 (4) Repeat from step (1) until there is no more data in the input

127 string.

128

129 Following the above preparation process on each string, the equality,

130 ordering, and substring operations are as for i;octet.

131

132 It is permitted to use an alternative implementation of the above

133 preparation process if it produces the same results. For example, it

134 may be more convenient for an implementation to convert all input

135 strings to a sequence of UTF-16 or UTF-32 values prior to performing

136 any of the step (2) actions. Similarly, if all input strings are (or

137 are convertible to) Unicode, it may be possible to use UTF-32 as an

138 alternative to UTF-8 in step (3).

139

140 Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3),

141 because UTF-16 surrogates will cause i;octet to collate codepoints

142 U+E0000 through U+FFFF after non-BMP codepoints.

143

144 This collation is not locale sensitive. Consequently, care should be

145 taken when using OS-supplied functions to implement this collation.

146 Functions such as strcasecmp and toupper are sometimes locale

147 sensitive and may inconsistently casemap letters.

148

149 The i;unicode-casemap collation is well suited to use with many

150 Internet protocols and computer languages. Use with natural language

151 is often inappropriate; even though the collation apparently supports

152 languages such as Swahili and English, in real-world use it tends to

153 mis-sort a number of types of string:

154

155 o people and place names containing scripts that are not collated

156 according to "alphabetical order".

157 o words with characters that have diacriticals. However,

158 i;unicode-casemap generally does a better job than i;ascii-casemap

159 for most (but not all) languages. For example, German umlaut

160 letters will sort correctly, but some Scandinavian letters will

161 not.

162 o names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike

163 in English),

164 o strings containing other non-letter symbols; e.g., euro and pound

165 sterling symbols, quotation marks other than '"', dashes/hyphens,

166 etc.

167

168

169

170 Crispin Standards Track [Page 3]

171

172 RFC 5051 i;unicode-casemap October 2007

173

174

175 3. Unicode Casemap Collation Registration

176

177 <?xml version='1.0'?>

178 <!DOCTYPE collation SYSTEM 'collationreg.dtd'>

179 <collation rfc="5051" scope="global" intendedUse="common">

180 <identifier>i;unicode-casemap</identifier>

181 <title>Unicode Casemap</title>

182 <operations>equality order substring</operations>

183 <specification>RFC 5051</specification>

184 <owner>IETF</owner>

185 <submitter>mrc@cac.washington.edu</submitter>

186 </collation>

187

188 4. Security Considerations

189

190 The security considerations for [UTF-8], [STRINGPREP], and [UNICODE-

191 SECURITY] apply and are normative to this specification.

192

193 The results from this comparator will vary depending upon the

194 implementation for several reasons. Implementations MUST consider

195 whether these possibilities are a problem for their use case:

196

197 1) New characters added in Unicode may have decomposition or

198 titlecase properties that will not be known to an implementation

199 based upon an older revision of Unicode. This impacts step (2).

200

201 2) Step (2)(b) defines a subset of Normalization Form KD (NFKD) that

202 does not require normalization of out-of-order diacriticals.

203 However, an implementation MAY use an NFKD library routine that

204 does such normalization. This impacts step (2)(b) and possibly

205 also step (1)(a), and is an issue only with ill-formed UTF-8

206 input.

207

208 3) The set of charsets handled in step (1)(a) is open-ended. UTF-8

209 (and, by extension, US-ASCII) are the only mandatory-to-implement

210 charsets. This impacts step (1)(a).

211

212 Implementations SHOULD, as far as feasible, support all the

213 charsets they are likely to encounter in the input data, in order

214 to avoid poor collation caused by the fall through to the (1)(b)

215 rule.

216

217 4) Other charsets may have revisions which add new characters that

218 are not known to an implementation based upon an older revision.

219 This impacts step (1)(a) and possibly also step (1)(b).

220

221

222

223

224

225

226 Crispin Standards Track [Page 4]

227

228 RFC 5051 i;unicode-casemap October 2007

229

230

231 An attacker may create input that is ill-formed or in an unknown

232 charset, with the intention of impacting the results of this

233 comparator or exploiting other parts of the system which process this

234 input in different ways. Note, however, that even well-formed data

235 in a known charset can impact the result of this comparator in

236 unexpected ways. For example, an attacker can substitute U+0041

237 (LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or

238 U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of causing a

239 non-match of strings which visually appear the same and/or causing

240 the string to appear elsewhere in a sort.

241

242 5. IANA Considerations

243

244 The i;unicode-casemap collation defined in section 2 has been added

245 to the registry of collations defined in [COMPARATOR].

246

247 6. Normative References

248

249 [COMPARATOR] Newman, C., Duerst, M., and A. Gulbrandsen,

250 "Internet Application Protocol Collation

251 Registry", RFC 4790, February 2007.

252

253 [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of

254 Internationalized Strings ("stringprep")", RFC

255 3454, December 2002.

256

257 [UTF-8] Yergeau, F., "UTF-8, a transformation format of

258 ISO 10646", STD 63, RFC 3629, November 2003.

259

260 [UNICODE-DATA] <http://www.unicode.org/Public/UNIDATA/

261 UnicodeData.txt>

262

263 Although the UnicodeData.txt file referenced

264 here is part of the Unicode standard, it is

265 subject to change as new characters are added

266 to Unicode and errors are corrected in Unicode

267 revisions. As a result, it may be less stable

268 than might otherwise be implied by the

269 standards status of this specification.

270

271 [UNICODE-SECURITY] Davis, M. and M. Suignard, "Unicode Security

272 Considerations", February 2006,

273 <http://www.unicode.org/reports/tr36/>.

274

275

276

277

278

279

280

281

282 Crispin Standards Track [Page 5]

283

284 RFC 5051 i;unicode-casemap October 2007

285

286

287 7. Informative References

288

289 [BASIC] Newman, C., Duerst, M., and A. Gulbrandsen,

290 "i;basic - the Unicode Collation Algorithm",

291 Work in Progress, March 2007.

292

293 [IMAP-SORT] Crispin, M. and K. Murchison, "Internet Message

294 Access Protocol - SORT and THREAD Extensions",

295 Work in Progress, September 2007.

296

297 Author's Address

298

299 Mark R. Crispin

300 Networks and Distributed Computing

301 University of Washington

302 4545 15th Avenue NE

303 Seattle, WA 98105-4527

304

305 Phone: +1 (206) 543-5762

306 EMail: MRC@CAC.Washington.EDU

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338 Crispin Standards Track [Page 6]

339

340 RFC 5051 i;unicode-casemap October 2007

341

342

343 Full Copyright Statement

344

345 Copyright (C) The IETF Trust (2007).

346

347 This document is subject to the rights, licenses and restrictions

348 contained in BCP 78, and except as set forth therein, the authors

349 retain all their rights.

350

351 This document and the information contained herein are provided on an

352 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS

353 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND

354 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS

355 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF

356 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED

357 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

358

359 Intellectual Property

360

361 The IETF takes no position regarding the validity or scope of any

362 Intellectual Property Rights or other rights that might be claimed to

363 pertain to the implementation or use of the technology described in

364 this document or the extent to which any license under such rights

365 might or might not be available; nor does it represent that it has

366 made any independent effort to identify any such rights. Information

367 on the procedures with respect to rights in RFC documents can be

368 found in BCP 78 and BCP 79.

369

370 Copies of IPR disclosures made to the IETF Secretariat and any

371 assurances of licenses to be made available, or the result of an

372 attempt made to obtain a general license or permission for the use of

373 such proprietary rights by implementers or users of this

374 specification can be obtained from the IETF on-line IPR repository at

375 http://www.ietf.org/ipr.

376

377 The IETF invites any interested party to bring to its attention any

378 copyrights, patents or patent applications, or other proprietary

379 rights that may cover technology that may be required to implement

380 this standard. Please address the information to the IETF at

381 ietf-ipr@ietf.org.

382

383

384

385

386

387

388

389

390

391

392

393

394 Crispin Standards Track [Page 7]

395