imapext-2007

view docs/rfc/rfc4790.txt @ 0:ada5e610ab86

imap-2007e
author yuuji@gentei.org
date Mon, 14 Sep 2009 15:17:45 +0900
parents
children
line source
7 Network Working Group C. Newman
8 Request for Comments: 4790 Sun Microsystems
9 Category: Standards Track M. Duerst
10 Aoyama Gakuin University
11 A. Gulbrandsen
12 Oryx
13 March 2007
16 Internet Application Protocol Collation Registry
18 Status of This Memo
20 This document specifies an Internet standards track protocol for the
21 Internet community, and requests discussion and suggestions for
22 improvements. Please refer to the current edition of the "Internet
23 Official Protocol Standards" (STD 1) for the standardization state
24 and status of this protocol. Distribution of this memo is unlimited.
26 Copyright Notice
28 Copyright (C) The IETF Trust (2007).
30 Abstract
32 Many Internet application protocols include string-based lookup,
33 searching, or sorting operations. However, the problem space for
34 searching and sorting international strings is large, not fully
35 explored, and is outside the area of expertise for the Internet
36 Engineering Task Force (IETF). Rather than attempt to solve such a
37 large problem, this specification creates an abstraction framework so
38 that application protocols can precisely identify a comparison
39 function, and the repertoire of comparison functions can be extended
40 in the future.
58 Newman, et al. Standards Track [Page 1]
60 RFC 4790 Collation Registry March 2007
63 Table of Contents
65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
66 1.1. Conventions Used in This Document . . . . . . . . . . . . 4
67 2. Collation Definition and Purpose . . . . . . . . . . . . . . . 4
68 2.1. Definition . . . . . . . . . . . . . . . . . . . . . . . . 4
69 2.2. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 4
70 2.3. Some Other Terms Used in this Document . . . . . . . . . . 5
71 2.4. Sort Keys . . . . . . . . . . . . . . . . . . . . . . . . 5
72 3. Collation Identifier Syntax . . . . . . . . . . . . . . . . . 6
73 3.1. Basic Syntax . . . . . . . . . . . . . . . . . . . . . . . 6
74 3.2. Wildcards . . . . . . . . . . . . . . . . . . . . . . . . 6
75 3.3. Ordering Direction . . . . . . . . . . . . . . . . . . . . 7
76 3.4. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
77 3.5. Naming Guidelines . . . . . . . . . . . . . . . . . . . . 7
78 4. Collation Specification Requirements . . . . . . . . . . . . . 8
79 4.1. Collation/Server Interface . . . . . . . . . . . . . . . . 8
80 4.2. Operations Supported . . . . . . . . . . . . . . . . . . . 8
81 4.2.1. Validity . . . . . . . . . . . . . . . . . . . . . . . 9
82 4.2.2. Equality . . . . . . . . . . . . . . . . . . . . . . . 9
83 4.2.3. Substring . . . . . . . . . . . . . . . . . . . . . . 9
84 4.2.4. Ordering . . . . . . . . . . . . . . . . . . . . . . . 10
85 4.3. Sort Keys . . . . . . . . . . . . . . . . . . . . . . . . 10
86 4.4. Use of Lookup Tables . . . . . . . . . . . . . . . . . . . 11
87 5. Application Protocol Requirements . . . . . . . . . . . . . . 11
88 5.1. Character Encoding . . . . . . . . . . . . . . . . . . . . 11
89 5.2. Operations . . . . . . . . . . . . . . . . . . . . . . . . 11
90 5.3. Wildcards . . . . . . . . . . . . . . . . . . . . . . . . 12
91 5.4. String Comparison . . . . . . . . . . . . . . . . . . . . 12
92 5.5. Disconnected Clients . . . . . . . . . . . . . . . . . . . 12
93 5.6. Error Codes . . . . . . . . . . . . . . . . . . . . . . . 13
94 5.7. Octet Collation . . . . . . . . . . . . . . . . . . . . . 13
95 6. Use by Existing Protocols . . . . . . . . . . . . . . . . . . 13
96 7. Collation Registration . . . . . . . . . . . . . . . . . . . . 14
97 7.1. Collation Registration Procedure . . . . . . . . . . . . . 14
98 7.2. Collation Registration Format . . . . . . . . . . . . . . 15
99 7.2.1. Registration Template . . . . . . . . . . . . . . . . 15
100 7.2.2. The Collation Element . . . . . . . . . . . . . . . . 15
101 7.2.3. The Identifier Element . . . . . . . . . . . . . . . . 16
102 7.2.4. The Title Element . . . . . . . . . . . . . . . . . . 16
103 7.2.5. The Operations Element . . . . . . . . . . . . . . . . 16
104 7.2.6. The Specification Element . . . . . . . . . . . . . . 16
105 7.2.7. The Submitter Element . . . . . . . . . . . . . . . . 16
106 7.2.8. The Owner Element . . . . . . . . . . . . . . . . . . 16
107 7.2.9. The Version Element . . . . . . . . . . . . . . . . . 17
108 7.2.10. The Variable Element . . . . . . . . . . . . . . . . . 17
109 7.3. Structure of Collation Registry . . . . . . . . . . . . . 17
110 7.4. Example Initial Registry Summary . . . . . . . . . . . . . 18
114 Newman, et al. Standards Track [Page 2]
116 RFC 4790 Collation Registry March 2007
119 8. Guidelines for Expert Reviewer . . . . . . . . . . . . . . . . 18
120 9. Initial Collations . . . . . . . . . . . . . . . . . . . . . . 19
121 9.1. ASCII Numeric Collation . . . . . . . . . . . . . . . . . 20
122 9.1.1. ASCII Numeric Collation Description . . . . . . . . . 20
123 9.1.2. ASCII Numeric Collation Registration . . . . . . . . . 20
124 9.2. ASCII Casemap Collation . . . . . . . . . . . . . . . . . 21
125 9.2.1. ASCII Casemap Collation Description . . . . . . . . . 21
126 9.2.2. ASCII Casemap Collation Registration . . . . . . . . . 22
127 9.3. Octet Collation . . . . . . . . . . . . . . . . . . . . . 22
128 9.3.1. Octet Collation Description . . . . . . . . . . . . . 22
129 9.3.2. Octet Collation Registration . . . . . . . . . . . . . 23
130 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 23
131 11. Security Considerations . . . . . . . . . . . . . . . . . . . 23
132 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 23
133 13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 24
134 13.1. Normative References . . . . . . . . . . . . . . . . . . . 24
135 13.2. Informative References . . . . . . . . . . . . . . . . . . 24
170 Newman, et al. Standards Track [Page 3]
172 RFC 4790 Collation Registry March 2007
175 1. Introduction
177 The Application Configuration Access Protocol ACAP [11] specification
178 introduced the concept of a comparator (which we call collation in
179 this document), but failed to create an IANA registry. With the
180 introduction of stringprep [6] and the Unicode Collation Algorithm
181 [7], it is now time to create that registry and populate it with some
182 initial values appropriate for an international community. This
183 specification replaces and generalizes the definition of a comparator
184 in ACAP, and creates a collation registry.
186 1.1. Conventions Used in This Document
188 The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY"
189 in this document are to be interpreted as defined in "Key words for
190 use in RFCs to Indicate Requirement Levels" [1].
192 The attribute syntax specifications use the Augmented Backus-Naur
193 Form (ABNF) [2] notation, including the core rules defined in
194 Appendix A. The ABNF production "Language-tag" is imported from
195 Language Tags [5] and "reg-name" from URI: Generic Syntax [4].
197 2. Collation Definition and Purpose
199 2.1. Definition
201 A collation is a named function which takes two arbitrary length
202 strings as input and can be used to perform one or more of three
203 basic comparison operations: equality test, substring match, and
204 ordering test.
206 2.2. Purpose
208 Collations are an abstraction for comparison functions so that these
209 comparison functions can be used in multiple protocols. The details
210 of a particular comparison operation can be specified by someone with
211 appropriate expertise, independent of the application protocols that
212 use that collation. This is similar to the way a charset [13]
213 separates the details of octet to character mapping from a protocol
214 specification, such as MIME [9], or the way SASL [10] separates the
215 details of an authentication mechanism from a protocol specification,
216 such as ACAP [11].
226 Newman, et al. Standards Track [Page 4]
228 RFC 4790 Collation Registry March 2007
231 Here is a small diagram to help illustrate the value of this
232 abstraction:
234 +-------------------+ +-----------------+
235 | IMAP i18n SEARCH |--+ | Basic |
236 +-------------------+ | +--| Collation Spec |
237 | | +-----------------+
238 +-------------------+ | +-------------+ | +-----------------+
239 | ACAP i18n SEARCH |--+--| Collation |--+--| A stringprep |
240 +-------------------+ | | Registry | | | Collation Spec |
241 | +-------------+ | +-----------------+
242 +-------------------+ | | +-----------------+
243 | ...other protocol |--+ | | locale-specific |
244 +-------------------+ +--| Collation Spec |
245 +-----------------+
247 Thus IMAP, ACAP, and future application protocols with international
248 search capability simply specify how to interface to the collation
249 registry instead of each protocol specification having to specify all
250 the collations it supports.
252 2.3. Some Other Terms Used in this Document
254 The terms client, server, and protocol are used in somewhat unusual
255 senses.
257 Client means a user, or a program acting directly on behalf of a
258 user. This may be a mail reader acting as an IMAP client, or it may
259 be an interactive shell, where the user can type protocol commands/
260 requests directly, or it may be a script or program written by the
261 user.
263 Server means a program that performs services requested by the
264 client. This may be a traditional server such as an HTTP server, or
265 it may be a Sieve [14] interpreter running a Sieve script written by
266 a user. A server needs to use the operations provided by collations
267 in order to fulfill the client's requests.
269 The protocol describes how the client tells the server what it wants
270 done, and (if applicable) how the server tells the client about the
271 results. IMAP is a protocol by this definition, and so is the Sieve
272 language.
274 2.4. Sort Keys
276 One component of a collation is a transformation, which turns a
277 string into a sort key, which is then used while sorting.
282 Newman, et al. Standards Track [Page 5]
284 RFC 4790 Collation Registry March 2007
287 The transformation can range from an identity mapping (e.g., the
288 i;octet collation Section 9.3) to a mapping that makes the string
289 unreadable to a human.
291 This is an implementation detail of collations or servers. A
292 protocol SHOULD NOT expose it to clients, since some collations leave
293 the sort key's format up to the implementation, and current
294 conformant implementations are known to use different formats.
296 3. Collation Identifier Syntax
298 3.1. Basic Syntax
300 The collation identifier itself is a single US-ASCII string. The
301 identifier MUST NOT be longer than 254 characters, and obeys the
302 following grammar:
304 collation-char = ALPHA / DIGIT / "-" / ";" / "=" / "."
306 collation-id = collation-prefix ";" collation-core-name
307 *collation-arg
309 collation-scope = Language-tag / "vnd-" reg-name
311 collation-core-name = ALPHA *( ALPHA / DIGIT / "-" )
313 collation-arg = ";" ALPHA *( ALPHA / DIGIT ) "="
314 1*( ALPHA / DIGIT / "." )
317 Note: the ABNF production "Language-tag" is imported from Language
318 Tags [5] and "reg-name" from URI: Generic Syntax [4].
320 There is a special identifier called "default". For protocols that
321 have a default collation, "default" refers to that collation. For
322 other protocols, the identifier "default" MUST match no collations,
323 and servers SHOULD treat it in the same way as they treat nonexistent
324 collations.
326 3.2. Wildcards
328 The string a client uses to select a collation MAY contain one or
329 more wildcard ("*") characters that match zero or more collation-
330 chars. Wildcard characters MUST NOT be adjacent. If the wildcard
331 string matches multiple collations, the server SHOULD attempt to
332 select a widely useful collation in preference to a narrowly useful
333 one.
338 Newman, et al. Standards Track [Page 6]
340 RFC 4790 Collation Registry March 2007
343 collation-wild = ("*" / (ALPHA ["*"])) *(collation-char ["*"])
344 ; MUST NOT exceed 254 characters total
346 3.3. Ordering Direction
348 When used as a protocol element for ordering, the collation
349 identifier MAY be prefixed by either "+" or "-" to explicitly specify
350 an ordering direction. "+" has no effect on the ordering operation,
351 while "-" inverts the result of the ordering operation. In general,
352 collation-order is used when a client requests a collation, and
353 collation-selected is used when the server informs the client of the
354 selected collation.
356 collation-selected = ["+" / "-"] collation-id
358 collation-order = ["+" / "-"] collation-wild
360 3.4. URIs
362 Some protocols are designed to use URIs [4] to refer to collations
363 rather than simple tokens. A special section of the IANA URL space
364 is reserved for such usage. The "collation-uri" form is used to
365 refer to a specific named collation (the collation registration may
366 not actually be present). The "collation-auri" form is an abstract
367 name for an ordering, a collation pattern or a vendor private
368 collator.
370 collation-uri = "http://www.iana.org/assignments/collation/"
371 collation-id ".xml"
373 collation-auri = ( "http://www.iana.org/assignments/collation/"
374 collation-order ".xml" ) / other-uri
376 other-uri = <absoluteURI>
377 ; excluding the IANA collation namespace.
379 3.5. Naming Guidelines
381 While this specification makes no absolute requirements on the
382 structure of collation identifiers, naming consistency is important,
383 so the following initial guidelines are provided.
385 Collation identifiers with an international audience typically begin
386 with "i;". Collation identifiers intended for a particular language
387 or locale typically begin with a language tag [5] followed by a ";".
388 After the first ";" is normally the name of the general collation
389 algorithm, followed by a series of algorithm modifications separated
390 by the ";" delimiter. Parameterized modifications will use "=" to
394 Newman, et al. Standards Track [Page 7]
396 RFC 4790 Collation Registry March 2007
399 delimit the parameter from the value. The version numbers of any
400 lookup tables used by the algorithm SHOULD be present as
401 parameterized modifications.
403 Collation identifiers of the form *;vnd-hostname;* are reserved for
404 vendor-specific collations created by the owner of the hostname
405 following the "vnd-" prefix (e.g., vnd-example.com for the vendor
406 example.com). Registration of such collations (or the name space as
407 a whole), with intended use of the "Vendor", is encouraged when a
408 public specification or open-source implementation is available, but
409 is not required.
411 4. Collation Specification Requirements
413 4.1. Collation/Server Interface
415 The collation itself defines what it operates on. Most collations
416 are expected to operate on character strings. The i;octet
417 (Section 9.3) collation operates on octet strings. The i;ascii-
418 numeric (Section 9.1) operation operates on numbers.
420 This specification defines the collation interface in terms of octet
421 strings. However, implementations may choose to use character
422 strings instead. Such implementations may not be able to implement
423 e.g., i;octet. Since i;octet is not currently mandatory to implement
424 for any protocol, this should not be a problem.
426 4.2. Operations Supported
428 A collation specification MUST state which of the three basic
429 operations are supported (equality, substring, ordering) and how to
430 perform each of the supported operations on any two input character
431 strings, including empty strings. Collations must be deterministic,
432 i.e., given a collation with a specific identifier, and any two fixed
433 input strings, the result MUST be the same for the same operation.
435 In general, collation operations should behave as their names
436 suggest. While a collation may be new, the operations are not, so
437 the new collation's operations should be similar to those of older
438 collations. For example, a date/time collation should not provide a
439 "substring" operation that would morph IMAP substring SEARCH into
440 e.g., a date-range search.
442 A non-obvious consequence of the rules for each collation operation
443 is that, for any single collation, either none or all of the
444 operations can return "undefined". For example, it is not possible
445 to have an equality operation that never returns "undefined", and a
446 substring operation that occasionally does.
450 Newman, et al. Standards Track [Page 8]
452 RFC 4790 Collation Registry March 2007
455 4.2.1. Validity
457 The validity test takes one string as argument. It returns valid if
458 its input string is a valid input to the collation's other
459 operations, and invalid if not. (In other words, a string is valid
460 if it is equal to itself according to the collation's equality
461 operation.)
463 The validity test is provided by all collations. It MUST NOT be
464 listed separately in the collation registration.
466 4.2.2. Equality
468 The equality test always returns "match" or "no-match" when it is
469 supplied valid input, and MAY return "undefined" if one or both input
470 strings are not valid.
472 The equality test MUST be reflexive and symmetric. For valid input,
473 it MUST be transitive.
475 If a collation provides either a substring or an ordering test, it
476 MUST also provide an equality test. The substring and/or ordering
477 tests MUST be consistent with the equality test.
479 The return values of the equality test are called "match", "no-match"
480 and "undefined" in this document.
482 4.2.3. Substring
484 The substring matching operation determines if the first string is a
485 substring of the second string, i.e., if one or more substrings of
486 the second string is equal to the first, as defined by the
487 collation's equality operation.
489 A collation that supports substring matching will automatically
490 support two special cases of substring matching: prefix and suffix
491 matching, if those special cases are supported by the application
492 protocol. It returns "match" or "no-match" when it is supplied valid
493 input and returns "undefined" when supplied invalid input.
495 Application protocols MAY return position information for substring
496 matches. If this is done, the position information SHOULD include
497 both the starting offset and the ending offset for each match. This
498 is important because more sophisticated collations can match strings
499 of unequal length (for example, a pre-composed accented character can
500 match a decomposed accented character). In general, overlapping
501 matches SHOULD be reported (as when "ana" occurs twice within
502 "banana"), although there are cases where a collation may decide not
506 Newman, et al. Standards Track [Page 9]
508 RFC 4790 Collation Registry March 2007
511 to. For example, in a collation which treats all whitespace
512 sequences as identical, the substring operation could be defined such
513 that " 1 " (SP "1" SP) is reported just once within " 1 " (SP SP
514 "1" SP SP), not four times (SP SP "1" SP, SP "1" SP, SP "1" SP SP and
515 SP SP "1" SP SP), since the four matches are, in a sense, the same
516 match.
518 A string is a substring of itself. The empty string is a substring
519 of all strings.
521 Note that the substring operation of some collations can match
522 strings of unequal length. For example, a pre-composed accented
523 character can match a decomposed accented character. The Unicode
524 Collation Algorithm [7] discusses this in more detail.
526 The return values of the substring operation are called "match", "no-
527 match", and "undefined" in this document.
529 4.2.4. Ordering
531 The ordering operation determines how two strings are ordered. It
532 MUST be reflexive. For valid input, it MUST be transitive and
533 trichotomous.
535 Ordering returns "less" if the first string is listed before the
536 second string, according to the collation; "greater", if the second
537 string is listed before the first string; and "equal", if the two
538 strings are equal, as defined by the collation's equality operation.
539 If one or both strings are invalid, the result of ordering is
540 "undefined".
542 When the collation is used with a "+" prefix, the behavior is the
543 same as when used with no prefix. When the collation is used with a
544 "-" prefix, the result of the ordering operation of the collation
545 MUST be reversed.
547 The return values of the ordering operation are called "less",
548 "equal", "greater", and "undefined" in this document.
550 4.3. Sort Keys
552 A collation specification SHOULD describe the internal transformation
553 algorithm to generate sort keys. This algorithm can be applied to
554 individual strings, and the result can be stored to potentially
555 optimize future comparison operations. A collation MAY specify that
556 the sort key is generated by the identity function. The sort key may
557 have no meaning to a human. The sort key may not be valid input to
558 the collation.
562 Newman, et al. Standards Track [Page 10]
564 RFC 4790 Collation Registry March 2007
567 4.4. Use of Lookup Tables
569 Some collations use customizable lookup tables, e.g., because the
570 tables depend on locale, and may be modified after shipping the
571 software. Collations that use more than one customizable lookup
572 table in a documented format MUST assign numbers to the tables they
573 use. This permits an application protocol command to access the
574 tables used by a server collation, so that clients and servers use
575 the same tables.
577 5. Application Protocol Requirements
579 This section describes the requirements and issues that an
580 application protocol needs to consider if it offers searching,
581 substring matching and/or sorting, and permits the use of characters
582 outside the US-ASCII charset.
584 5.1. Character Encoding
586 The protocol specification has to make sure that it is clear on which
587 characters (rather than just octets) the collations are used. This
588 can be done by specifying the protocol itself in terms of characters
589 (e.g., in the case of a query language), by specifying a single
590 character encoding for the protocol (e.g., UTF-8 [3]), or by
591 carefully describing the relevant issues of character encoding
592 labeling and conversion. In the later case, details to consider
593 include how to handle unknown charsets, any charsets that are
594 mandatory-to-implement, any issues with byte-order that might apply,
595 and any transfer encodings that need to be supported.
597 5.2. Operations
599 The protocol must specify which of the operations defined in this
600 specification (equality matching, substring matching, and ordering)
601 can be invoked in the protocol, and how they are invoked. There may
602 be more than one way to invoke an operation.
604 The protocol MUST provide a mechanism for the client to select the
605 collation to use with equality matching, substring matching, and
606 ordering.
608 If a protocol needs a total ordering and the collation chosen does
609 not provide it because the ordering operation returns "undefined" at
610 least once, the recommended fallback is to sort all invalid strings
611 after the valid ones, and use i;octet to order the invalid strings.
613 Although the collation's substring function provides a list of
614 matches, a protocol need not provide all that to the client. It may
618 Newman, et al. Standards Track [Page 11]
620 RFC 4790 Collation Registry March 2007
623 provide only the first matching substring, or even just the
624 information that the substring search matched. In this way,
625 collations can be used with protocols that are defined such that "x
626 is a substring of y" returns true-false.
628 If the protocol provides positional information for the results of a
629 substring match, that positional information SHOULD fully specify the
630 substring(s) in the result that matches, independent of the length of
631 the search string. For example, returning both the starting and
632 ending offset of the match would suffice, as would the starting
633 offset and a length. Returning just the starting offset is not
634 acceptable. This rule is necessary because advanced collations can
635 treat strings of different lengths as equal (for example, pre-
636 composed and decomposed accented characters).
638 5.3. Wildcards
640 The protocol MUST specify whether it allows the use of wildcards in
641 collation identifiers. If the protocol allows wildcards, then:
642 The protocol MUST specify how comparisons behave in the absence of
643 explicit collation negotiation, or when a collation of "default"
644 is requested. The protocol MAY specify that the default collation
645 used in such circumstances is sensitive to server configuration.
647 The protocol SHOULD provide a way to list available collations
648 matching a given wildcard pattern, or patterns.
650 5.4. String Comparison
652 If a protocol compares strings in any nontrivial way, using a
653 collation may be appropriate. As an example, many protocols use
654 case-independent strings. In many cases, a simple ASCII mapping to
655 upper/lower case works well. In other cases, it may be better to use
656 a specifiable collation; for example, so that a server can treat "i"
657 and "I" as equivalent in Italy, and different in Turkey (Turkish also
658 has a dotted upper-case" I" and a dotless lower-case "i").
660 Protocol designers should consider, in each case, whether to use a
661 specifiable collation. Keywords often have other needs than user
662 variables, and search arguments may be different again.
664 5.5. Disconnected Clients
666 If the protocol supports disconnected clients, and a collation is
667 used that can use configurable tables (e.g., to support
668 locale-specific extensions), then the client may not be able to
669 reproduce the server's collation operations while offline.
674 Newman, et al. Standards Track [Page 12]
676 RFC 4790 Collation Registry March 2007
679 A mechanism to download such tables has been discussed. Such a
680 mechanism is not included in the present specification, since the
681 problem is not yet well understood.
683 5.6. Error Codes
685 The protocol specification should consider assigning protocol error
686 codes for the following circumstances:
688 o The client requests the use of a collation by identifier or
689 pattern, but no implemented collation matches that pattern.
691 o The client attempts to use a collation for an operation that is
692 not supported by that collation -- for example, attempting to use
693 the "i;ascii-numeric" collation for substring matching.
695 o The client uses an equality or substring matching collation, and
696 the result is an error. It may be appropriate to distinguish
697 between the two input strings, particularly when one is supplied
698 by the client and the other is stored by the server. It might
699 also be appropriate to distinguish the specific case of an invalid
700 UTF-8 string.
702 5.7. Octet Collation
704 The i;octet (Section 9.3) collation is only usable with protocols
705 based on octet-strings. Clients and servers MUST NOT use i;octet
706 with other protocols.
708 If the protocol permits the use of collations with data structures
709 other than strings, the protocol MUST describe the default behavior
710 for a collation with those data structures.
712 6. Use by Existing Protocols
714 This section is informative.
716 Both ACAP [11] and Sieve [14] are standards track specifications that
717 used collations prior to the creation of this specification and
718 registry. Those standards do not meet all the application protocol
719 requirements described in Section 5.
721 These protocols allow the use of the i;octet (Section 9.3) collation
722 working directly on UTF-8 data, as used in these protocols.
730 Newman, et al. Standards Track [Page 13]
732 RFC 4790 Collation Registry March 2007
735 In Sieve, all matches are either true or false. Accordingly, Sieve
736 servers must treat "undefined" and "no-match" results of the equality
737 and substring operations as false, and only "match" as true.
739 In ACAP and Sieve, there are no invalid strings. In this document's
740 terms, invalid strings sort after valid strings.
742 IMAP [15] also collates, although that is explicit only when the
743 COMPARATOR [17] extension is used. The built-in IMAP substring
744 operation and the ordering provided by the SORT [16] extension may
745 not meet the requirements made in this document.
747 Other protocols may be in a similar position.
749 In IMAP, the default collation is i;ascii-casemap, because its
750 operations are understood to match IMAP's built-in operations.
752 7. Collation Registration
754 7.1. Collation Registration Procedure
756 The IETF will create a mailing list, collation@ietf.org, which can be
757 used for public discussion of collation proposals prior to
758 registration. Use of the mailing list is strongly encouraged. The
759 IESG will appoint a designated expert who will monitor the
760 collation@ietf.org mailing list and review registrations.
762 The registration procedure begins when a completed registration
763 template is sent to iana@iana.org and collation@ietf.org. The
764 designated expert is expected to tell IANA and the submitter of the
765 registration within two weeks whether the registration is approved,
766 approved with minor changes, or rejected with cause. When a
767 registration is rejected with cause, it can be re-submitted if the
768 concerns listed in the cause are addressed. Decisions made by the
769 designated expert can be appealed to the IESG Applications Area
770 Director, then to the IESG. They follow the normal appeals procedure
771 for IESG decisions.
773 Collation registrations in a standards track, BCP, or IESG-approved
774 experimental RFC are owned by the IETF, and changes to the
775 registration follow normal procedures for updating such documents.
776 Collation registrations in other RFCs are owned by the RFC author(s).
777 Other collation registrations are owned by the individual(s) listed
778 in the contact field of the registration, and IANA will preserve this
779 information.
781 If the registration is a change of an existing collation, it MUST be
782 approved by the owner. In the event the owner cannot be contacted
786 Newman, et al. Standards Track [Page 14]
788 RFC 4790 Collation Registry March 2007
791 for a period of one month, and the designated expert deems the change
792 necessary, the IESG MAY re-assign ownership to an appropriate party.
794 7.2. Collation Registration Format
796 Registration of a collation is done by sending a well-formed XML
797 document to collation@ietf.org and iana@iana.org.
799 7.2.1. Registration Template
801 Here is a template for the registration:
803 <?xml version='1.0'?>
804 <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
805 <collation rfc="YYYY" scope="global" intendedUse="common">
806 <identifier>collation identifier</identifier>
807 <title>technical title for collation</title>
808 <operations>equality order substring</operations>
809 <specification>specification reference</specification>
810 <owner>email address of owner or IETF</owner>
811 <submitter>email address of submitter</submitter>
812 <version>1</version>
813 </collation>
815 7.2.2. The Collation Element
817 The root of the registration document MUST be a <collation> element.
818 The collation element contains the other elements in the
819 registration, which are described in the following sub-subsections,
820 in the order given here.
822 The <collation> element MAY include an "rfc=" attribute if the
823 specification is in an RFC. The "rfc=" attribute gives only the
824 number of the RFC, without any prefix, such as "RFC", or suffix, such
825 as ".txt".
827 The <collation> element MUST include a "scope=" attribute, which MUST
828 have one of the values "global", "local", or "other".
830 The <collation> element MUST include an "intendedUse=" attribute,
831 which must have one of the values "common", "limited", "vendor", or
832 "deprecated". Collation specifications intended for "common" use are
833 expected to reference standards from standards bodies with
834 significant experience dealing with the details of international
835 character sets.
837 Be aware that future revisions of this specification may add
838 additional function types, as well as additional XML attributes,
842 Newman, et al. Standards Track [Page 15]
844 RFC 4790 Collation Registry March 2007
847 values, and elements. Any system that automatically parses these XML
848 documents MUST take this into account to preserve future
849 compatibility.
851 7.2.3. The Identifier Element
853 The <identifier> element gives the precise identifier of the
854 collation, e.g., i;ascii-casemap. The <identifier> element is
855 mandatory.
857 7.2.4. The Title Element
859 The <title> element gives the title of the collation. The <title>
860 element is mandatory.
862 7.2.5. The Operations Element
864 The <operations> element lists which of the three operations
865 ("equality", "order" or "substring") the collation provides,
866 separated by single spaces. The <operations> element is mandatory.
868 7.2.6. The Specification Element
870 The <specification> element describes where to find the
871 specification. The <specification> element is mandatory. It MAY
872 have a URI attribute. There may be more than one <specification>
873 element, in which case, they together form the specification.
875 If it is discovered that parts of a collation specification conflict,
876 a new revision of the collation is necessary, and the
877 collation@ietf.org mailing list should be notified.
879 7.2.7. The Submitter Element
881 The <submitter> element provides an RFC 2822 [12] email address for
882 the person who submitted the registration. It is optional if the
883 <owner> element contains an email address.
885 There may be more than one <submitter> element.
887 7.2.8. The Owner Element
889 The <owner> element contains either the four letters "IETF" or an
890 email address of the owner of the registration. The <owner> element
891 is mandatory. There may be more than one <owner> element. If so,
892 all owners are equal. Each owner can speak for all.
898 Newman, et al. Standards Track [Page 16]
900 RFC 4790 Collation Registry March 2007
903 7.2.9. The Version Element
905 The <version> element MUST be included when the registration is
906 likely to be revised, or has been revised in such a way that the
907 results change for one or more input strings. The <version> element
908 is optional.
910 7.2.10. The Variable Element
912 The <variable> element specifies an optional variable to control the
913 collation's behaviour, for example whether it is case sensitive. The
914 <variable> element is optional. When <variable> is used, it must
915 contain <name> and <default> elements, and it may contain one or more
916 <value> elements.
918 7.2.10.1. The Name Element
920 The <name> element specifies the name value of a variable. The
921 <name> element is mandatory.
923 7.2.10.2. The Default Element
925 The <default> element specifies the default value of a variable. The
926 <default> element is mandatory.
928 7.2.10.3. The Value Element
930 The <value> element specifies a legal value of a variable. The
931 <value> element is optional. If one or more <value> elements are
932 present, only those values are legal. If none are, then the
933 variable's legal values do not form an enumerated set, and the rules
934 MUST be specified in an RFC accompanying the registration.
936 7.3. Structure of Collation Registry
938 Once the registration is approved, IANA will store each XML
939 registration document in a URL of the form
940 http://www.iana.org/assignments/collation/collation-id.xml, where
941 collation-id is the content of the identifier element in the
942 registration. Both the submitter and the designated expert are
943 responsible for verifying that the XML is well-formed. The
944 registration document should avoid using new elements. If any are
945 necessary, it is important to be consistent with other registrations.
947 IANA will also maintain a text summary of the registry under the name
948 http://www.iana.org/assignments/collation/collation-index.html. This
949 summary is divided into four sections. The first section is for
950 collations intended for common use. This section is intended for
954 Newman, et al. Standards Track [Page 17]
956 RFC 4790 Collation Registry March 2007
959 collation registrations published in IESG-approved RFCs, or for
960 locally scoped collations from the primary standards body for that
961 locale. The designated expert is encouraged to reject collation
962 registrations with an intended use of "common" if the expert believes
963 it should be "limited", as it is desirable to keep the number of
964 "common" registrations small and of high quality. The second section
965 is reserved for limited-use collations. The third section is
966 reserved for registered vendor-specific collations. The final
967 section is reserved for deprecated collations.
969 7.4. Example Initial Registry Summary
971 The following is an example of how IANA might structure the initial
972 registry summary.html file:
974 Collation Functions Scope Reference
975 --------- --------- ----- ---------
976 Common Use Collations:
977 i;ascii-casemap e, o, s Local [RFC 4790]
979 Limited Use Collations:
980 i;octet e, o, s Other [RFC 4790]
981 i;ascii-numeric e, o Other [RFC 4790]
983 Vendor Collations:
985 Deprecated Collations:
988 References
989 ----------
990 [RFC 4790] Newman, C., Duerst, M., Gulbrandsen, A., "Internet
991 Application Protocol Collation Registry", RFC 4790,
992 Sun Microsystems, March 2007.
994 8. Guidelines for Expert Reviewer
996 The expert reviewer appointed by the IESG has fairly broad latitude
997 for this registry. While a number of collations are expected
998 (particularly customizations of the UCA for localized use), an
999 explosion of collations (particularly common-use collations) is not
1000 desirable for widespread interoperability. However, it is important
1001 for the expert reviewer to provide cause when rejecting a
1002 registration, and, when possible, to describe corrective action to
1010 Newman, et al. Standards Track [Page 18]
1012 RFC 4790 Collation Registry March 2007
1015 permit the registration to proceed. The following table includes
1016 some example reasons to reject a registration with cause:
1018 o The registration is not a well-formed XML document.
1020 o The registration has an intended use of "common", but there is no
1021 evidence the collation will be widely deployed, so it should be
1022 listed as "limited".
1024 o The registration has an intended use of "common", but it is
1025 redundant with the functionality of a previously registered
1026 "common" collation.
1028 o The registration has an intended use of "common", but the
1029 specification is not detailed enough to allow interoperable
1030 implementations by others.
1032 o The collation identifier fails to precisely identify the version
1033 numbers of relevant tables to use.
1035 o The registration fails to meet one of the "MUST" requirements in
1036 Section 4.
1038 o The collation identifier fails to meet the syntax in Section 3.
1040 o The collation specification referenced in the registration is
1041 vague or has optional features without a clear behavior specified.
1043 o The referenced specification does not adequately address security
1044 considerations specific to that collation.
1046 o The registration's operations are needlessly different from those
1047 of traditional operations.
1049 o The registration's XML is needlessly different from that of
1050 already registered collations.
1052 9. Initial Collations
1054 This section registers the three collations that were originally
1055 defined in [11], and are implemented in most [14] engines. Some of
1056 the behavior of these collations is perhaps not ideal, such as
1057 i;ascii-casemap accepting non-ASCII input. Compatibility with widely
1058 deployed code was judged more important than fixing the collations.
1059 Some of the aspects of these collations are necessary to maintain
1060 compatibility with widely deployed code.
1066 Newman, et al. Standards Track [Page 19]
1068 RFC 4790 Collation Registry March 2007
1071 9.1. ASCII Numeric Collation
1073 9.1.1. ASCII Numeric Collation Description
1075 The "i;ascii-numeric" collation is a simple collation intended for
1076 use with arbitrarily-sized, unsigned decimal integer numbers stored
1077 as octet strings. US-ASCII digits (0x30 to 0x39) represent digits of
1078 the numbers. Before converting from string to integer, the input
1079 string is truncated at the first non-digit character. All input is
1080 valid; strings that do not start with a digit represent positive
1081 infinity.
1083 The collation supports equality and ordering, but does not support
1084 the substring operation.
1086 The equality operation returns "match" if the two strings represent
1087 the same number (i.e., leading zeroes and trailing non-digits are
1088 disregarded), and "no-match" if the two strings represent different
1089 numbers.
1091 The ordering operation returns "less" if the first string represents
1092 a smaller number than the second, "equal" if they represent the same
1093 number, and "greater" if the first string represents a larger number
1094 than the second.
1096 Some examples: "0" is less than "1", and "1" is less than
1097 "4294967298". "4294967298", "04294967298", and "4294967298b" are all
1098 equal. "04294967298" is less than "". "", "x", and "y" are equal.
1100 9.1.2. ASCII Numeric Collation Registration
1102 <?xml version='1.0'?>
1103 <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
1104 <collation rfc="4790" scope="other" intendedUse="limited">
1105 <identifier>i;ascii-numeric</identifier>
1106 <title>ASCII Numeric</title>
1107 <operations>equality order</operations>
1108 <specification>RFC 4790</specification>
1109 <owner>IETF</owner>
1110 <submitter>chris.newman@sun.com</submitter>
1111 </collation>
1122 Newman, et al. Standards Track [Page 20]
1124 RFC 4790 Collation Registry March 2007
1127 9.2. ASCII Casemap Collation
1129 9.2.1. ASCII Casemap Collation Description
1131 The "i;ascii-casemap" collation is a simple collation that operates
1132 on octet strings and treats US-ASCII letters case-insensitively. It
1133 provides equality, substring, and ordering operations. All input is
1134 valid. Note that letters outside ASCII are not treated case-
1135 insensitively.
1137 Its equality, ordering, and substring operations are as for i;octet,
1138 except that at first, the lower-case letters (octet values 97-122) in
1139 each input string are changed to upper case (octet values 65-90).
1141 Care should be taken when using OS-supplied functions to implement
1142 this collation, as it is not locale sensitive. Functions, such as
1143 strcasecmp and toupper, are sometimes locale sensitive, and may
1144 inappropriately map lower-case letters other than a-z to upper case.
1146 The i;ascii-casemap collation is well-suited for use with many
1147 Internet protocols and computer languages. Use with natural language
1148 is often inappropriate; even though the collation apparently supports
1149 languages such as Swahili and English, in real-world use, it tends to
1150 mis-sort a number of types of string:
1152 o people and place names containing non-ASCII,
1154 o words such as "naive" (if spelled with an accent, the accented
1155 character could push the word to the wrong spot in a sorted list),
1157 o names such as "Lloyd" (which, in Welsh, sorts after "Lyon", unlike
1158 in English),
1160 o strings containing euro and pound sterling symbols, quotation
1161 marks other than '"', dashes/hyphens, etc.
1178 Newman, et al. Standards Track [Page 21]
1180 RFC 4790 Collation Registry March 2007
1183 9.2.2. ASCII Casemap Collation Registration
1185 <?xml version='1.0'?>
1186 <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
1187 <collation rfc="4790" scope="local" intendedUse="common">
1188 <identifier>i;ascii-casemap</identifier>
1189 <title>ASCII Casemap</title>
1190 <operations>equality order substring</operations>
1191 <specification>RFC 4790</specification>
1192 <owner>IETF</owner>
1193 <submitter>chris.newman@sun.com</submitter>
1194 </collation>
1196 9.3. Octet Collation
1198 9.3.1. Octet Collation Description
1200 The "i;octet" collation is a simple and fast collation intended for
1201 use on binary octet strings rather than on character data. Protocols
1202 that want to make this collation available have to do so by
1203 explicitly allowing it. If not explicitly allowed, it MUST NOT be
1204 used. It never returns an "undefined" result. It provides equality,
1205 substring, and ordering operations.
1207 The ordering algorithm is as follows:
1209 1. If both strings are the empty string, return the result "equal".
1211 2. If the first string is empty and the second is not, return the
1212 result "less".
1214 3. If the second string is empty and the first is not, return the
1215 result "greater".
1217 4. If both strings begin with the same octet value, remove the first
1218 octet from both strings and repeat this algorithm from step 1.
1220 5. If the unsigned value (0 to 255) of the first octet of the first
1221 string is less than the unsigned value of the first octet of the
1222 second string, then return "less".
1224 6. If this step is reached, return "greater".
1226 This algorithm is roughly equivalent to the C library function
1227 memcmp, with appropriate length checks added.
1234 Newman, et al. Standards Track [Page 22]
1236 RFC 4790 Collation Registry March 2007
1239 The matching operation returns "match" if the sorting algorithm would
1240 return "equal". Otherwise, the matching operation returns "no-
1241 match".
1243 The substring operation returns "match" if the first string is the
1244 empty string, or if there exists a substring of the second string of
1245 length equal to the length of the first string, which would result in
1246 a "match" result from the equality function. Otherwise, the
1247 substring operation returns "no-match".
1249 9.3.2. Octet Collation Registration
1251 This collation is defined with intendedUse="limited" because it can
1252 only be used by protocols that explicitly allow it.
1254 <?xml version='1.0'?>
1255 <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
1256 <collation rfc="4790" scope="global" intendedUse="limited">
1257 <identifier>i;octet</identifier>
1258 <title>Octet</title>
1259 <operations>equality order substring</operations>
1260 <specification>RFC 4790</specification>
1261 <owner>IETF</owner>
1262 <submitter>chris.newman@sun.com</submitter>
1263 </collation>
1265 10. IANA Considerations
1267 Section 7 defines how to register collations with IANA. Section 9
1268 defines a list of predefined collations that have been registered
1269 with IANA.
1271 11. Security Considerations
1273 Collations will normally be used with UTF-8 strings. Thus, the
1274 security considerations for UTF-8 [3], stringprep [6], and Unicode
1275 TR-36 [8] also apply, and are normative to this specification.
1277 12. Acknowledgements
1279 The authors want to thank all who have contributed to this document,
1280 including Brian Carpenter, John Cowan, Dave Cridland, Mark Davis,
1281 Spencer Dawkins, Lisa Dusseault, Lars Eggert, Frank Ellermann, Philip
1282 Guenther, Tony Hansen, Ted Hardie, Sam Hartman, Kjetil Torgrim Homme,
1283 Michael Kay, John Klensin, Alexey Melnikov, Jim Melton, and Abhijit
1284 Menon-Sen.
1290 Newman, et al. Standards Track [Page 23]
1292 RFC 4790 Collation Registry March 2007
1295 13. References
1297 13.1. Normative References
1299 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
1300 Levels", BCP 14, RFC 2119, March 1997.
1302 [2] Crocker, D. and P. Overell, "Augmented BNF for Syntax
1303 Specifications: ABNF", RFC 4234, October 2005.
1305 [3] Yergeau, F., "UTF-8, a transformation format of ISO 10646",
1306 STD 63, RFC 3629, November 2003.
1308 [4] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1309 Resource Identifier (URI): Generic Syntax", RFC 3986,
1310 January 2005.
1312 [5] Phillips, A. and M. Davis, "Tags for Identifying Languages",
1313 BCP 47, RFC 4646, September 2006.
1315 [6] Hoffman, P. and M. Blanchet, "Preparation of Internationalized
1316 Strings ("stringprep")", RFC 3454, December 2002.
1318 [7] Davis, M. and K. Whistler, "Unicode Collation Algorithm version
1319 14", May 2005,
1320 <http://www.unicode.org/reports/tr10/tr10-14.html>.
1322 [8] Davis, M. and M. Suignard, "Unicode Security Considerations",
1323 February 2006, <http://www.unicode.org/reports/tr36/>.
1325 13.2. Informative References
1327 [9] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
1328 Extensions (MIME) Part One: Format of Internet Message Bodies",
1329 RFC 2045, November 1996.
1331 [10] Melnikov, A., "Simple Authentication and Security Layer
1332 (SASL)", RFC 4422, June 2006.
1334 [11] Newman, C. and J. Myers, "ACAP -- Application Configuration
1335 Access Protocol", RFC 2244, November 1997.
1337 [12] Resnick, P., "Internet Message Format", RFC 2822, April 2001.
1339 [13] Freed, N. and J. Postel, "IANA Charset Registration
1340 Procedures", BCP 19, RFC 2978, October 2000.
1346 Newman, et al. Standards Track [Page 24]
1348 RFC 4790 Collation Registry March 2007
1351 [14] Showalter, T., "Sieve: A Mail Filtering Language", RFC 3028,
1352 January 2001.
1354 [15] Crispin, M., "Internet Message Access Protocol - Version
1355 4rev1", RFC 3501, March 2003.
1357 [16] Crispin, M. and K. Murchison, "Internet Message Access Protocol
1358 - Sort and Thread Extensions", Work in Progress, May 2004.
1360 [17] Newman, C. and A. Gulbrandsen, "Internet Message Access
1361 Protocol Internationalization", Work in Progress, January 2006.
1363 Authors' Addresses
1365 Chris Newman
1366 Sun Microsystems
1367 1050 Lakes Drive
1368 West Covina, CA 91790
1369 USA
1371 EMail: chris.newman@sun.com
1374 Martin Duerst
1375 Aoyama Gakuin University
1376 5-10-1 Fuchinobe
1377 Sagamihara, Kanagawa 229-8558
1378 Japan
1380 Phone: +81 42 759 6329
1381 Fax: +81 42 759 6495
1382 EMail: duerst@it.aoyama.ac.jp
1383 URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/
1385 Note: Please write "Duerst" with u-umlaut wherever possible, for
1386 example as "D&#252;rst" in XML and HTML.
1389 Arnt Gulbrandsen
1390 Oryx Mail Systems GmbH
1391 Schweppermannstr. 8
1392 81671 Munich
1393 Germany
1395 Fax: +49 89 4502 9758
1396 EMail: arnt@oryx.com
1397 URI: http://www.oryx.com/arnt/
1402 Newman, et al. Standards Track [Page 25]
1404 RFC 4790 Collation Registry March 2007
1407 Full Copyright Statement
1409 Copyright (C) The IETF Trust (2007).
1411 This document is subject to the rights, licenses and restrictions
1412 contained in BCP 78, and except as set forth therein, the authors
1413 retain all their rights.
1415 This document and the information contained herein are provided on an
1416 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1417 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
1418 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
1419 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
1420 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1421 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
1423 Intellectual Property
1425 The IETF takes no position regarding the validity or scope of any
1426 Intellectual Property Rights or other rights that might be claimed to
1427 pertain to the implementation or use of the technology described in
1428 this document or the extent to which any license under such rights
1429 might or might not be available; nor does it represent that it has
1430 made any independent effort to identify any such rights. Information
1431 on the procedures with respect to rights in RFC documents can be
1432 found in BCP 78 and BCP 79.
1434 Copies of IPR disclosures made to the IETF Secretariat and any
1435 assurances of licenses to be made available, or the result of an
1436 attempt made to obtain a general license or permission for the use of
1437 such proprietary rights by implementers or users of this
1438 specification can be obtained from the IETF on-line IPR repository at
1439 http://www.ietf.org/ipr.
1441 The IETF invites any interested party to bring to its attention any
1442 copyrights, patents or patent applications, or other proprietary
1443 rights that may cover technology that may be required to implement
1444 this standard. Please address the information to the IETF at
1445 ietf-ipr@ietf.org.
1447 Acknowledgement
1449 Funding for the RFC Editor function is currently provided by the
1450 Internet Society.
1458 Newman, et al. Standards Track [Page 26]

UW-IMAP'd extensions by yuuji