imapext-2007
diff docs/formats.txt @ 0:ada5e610ab86
imap-2007e
author | yuuji@gentei.org |
---|---|
date | Mon, 14 Sep 2009 15:17:45 +0900 |
parents | |
children |
line diff
1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/docs/formats.txt Mon Sep 14 15:17:45 2009 +0900 1.3 @@ -0,0 +1,217 @@ 1.4 +/* ======================================================================== 1.5 + * Copyright 1988-2006 University of Washington 1.6 + * 1.7 + * Licensed under the Apache License, Version 2.0 (the "License"); 1.8 + * you may not use this file except in compliance with the License. 1.9 + * You may obtain a copy of the License at 1.10 + * 1.11 + * http://www.apache.org/licenses/LICENSE-2.0 1.12 + * 1.13 + * 1.14 + * ======================================================================== 1.15 + */ 1.16 + 1.17 + Mailbox Format Characteristics 1.18 + Mark Crispin 1.19 + 11 December 2006 1.20 + 1.21 + 1.22 + When a mailbox storage technology uses local files and 1.23 +directories directly, the file(s) and directories are layed out in a 1.24 +mailbox format. 1.25 + 1.26 +I. Flat-File Formats 1.27 + 1.28 + In these formats, a mailbox and all the messages inside are a 1.29 +single file on the filesystem. The mailbox name is the name of the 1.30 +file in the filesystem, relative to the user's "mail home directory." 1.31 + 1.32 + A flat-file format mailbox is always a file, never a directory. 1.33 +This means that it is impossible to have a flat-file format mailbox 1.34 +that has inferior mailbox names under it (so-called "dual-usage" 1.35 +mailboxes). For some inexplicable reason, some people want this. 1.36 + 1.37 + The mail home directory is usually the same as the user login 1.38 +home directory if that concept is meaningful; otherwise, it is some 1.39 +other default directory (e.g. "C:\My Documents" on Windows 98). This 1.40 +can be redefined by modifying the c-client source code or in an 1.41 +application via the SET_HOMEDIR mail_parameters() call. 1.42 + 1.43 + For example, a mailbox named "project" is likely to be found in 1.44 +the file "project" in the user's home directory. Similarly, a mailbox 1.45 +named "test/trial1" (assuming a UNIX system) is likely to be found in 1.46 +the file "trial1" in the subdirectory "test" in the user's home 1.47 +directory. 1.48 + 1.49 + Note that the name "INBOX" has special semantics and rules, as 1.50 +described in the file naming.txt. 1.51 + 1.52 + The following flat-file formats are supported by c-client as of 1.53 +the time of this writing: 1.54 + 1.55 +. unix This is the traditional UNIX mailbox format, in use for nearly 1.56 + 30 years. It uses a line starting with "From " to indicate 1.57 + start of message, and stores the message status inside the 1.58 + RFC822 message header. 1.59 + 1.60 + unix is not particularly efficient; the entire mailbox file 1.61 + must be read when the mailbox is open, and when reading message 1.62 + texts it is necessary to convert the newline convention to 1.63 + Internet standard CR LF form. unix preserves UIDs, and allows 1.64 + the creation of keywords. 1.65 + 1.66 + Only one process may have a unix-format mailbox open 1.67 + read/write at a time. 1.68 + 1.69 +. mmdf This is the format used by the MMDF mailer. It uses a line 1.70 + consisting of 4 <CTRL/A> (0x01) characters to indicate start 1.71 + and end of message. Optionally, there may also be a unix 1.72 + format "From " line. It otherwise has the same 1.73 + characteristics as unix format. 1.74 + 1.75 +. mbx This is the current preferred mailbox format. It can be 1.76 + handled quite efficiently by c-client, without the problems 1.77 + that exist with unix and mmdf formats. Messages are stored 1.78 + in Internet standard CR LF format. 1.79 + 1.80 + mbx permits shared access, including shared expunge. It 1.81 + preserves UIDs, and allows the creation of keywords. 1.82 + 1.83 +. mtx This is supported for compatibility with the past. This is 1.84 + the old Tenex/TOPS-20 mail.txt format. It can be handled 1.85 + quite efficiently by c-client, and has most of the 1.86 + characteristics of mbx format. 1.87 + 1.88 + mtx is deficient in that it does not support shared expunge; 1.89 + it has no means to store UIDs; and it has no way to define 1.90 + keywords except through an external configuration file. 1.91 + 1.92 +. tenex This is supported for compatibility with the past. This is 1.93 + the old Columbia MM format. This is similar to mtx format, 1.94 + only it uses UNIX-style bare-LF newlines instead of CR LF 1.95 + newlines, thus incurring a performance penalty for newline 1.96 + conversion. 1.97 + 1.98 +. phile This is not strictly a format. Any file which is not in a 1.99 + recognized format is in phile format, which treats the entire 1.100 + contents of the file as a single message. 1.101 + 1.102 + 1.103 +II. File/Message Formats 1.104 + 1.105 + In these formats, a mailbox is a directory, and each the messages 1.106 +inside are separate files inside the directory. The file names of 1.107 +these files are generally the text form of a number, which also 1.108 +matches the UID of the message. 1.109 + 1.110 + In the case of mx, the mailbox name is the name of the directory 1.111 +in the filesystem, relative to the user's "mail home directory." In 1.112 +the case of news and mh, the mailbox name is in a separate namespace 1.113 +as described in the file naming.txt. 1.114 + 1.115 + A file/message format mailbox is always a directory. This means 1.116 +that it is possible to have a file/message format mailbox that has 1.117 +inferior mailbox names under it (so-called "dual-usage" mailboxes). 1.118 +For some inexplicable reason, some people want this. 1.119 + 1.120 + Note that the name "INBOX" has special semantics and rules, as 1.121 +described in the file naming.txt. 1.122 + 1.123 + The following file/message formats are supported by c-client as of 1.124 +the time of this writing: 1.125 + 1.126 +. mx This is an experimental format, and may be removed in a future 1.127 + release. An mx format mailbox has a .mxindex file which holds 1.128 + the message status and unique identifiers. Messages are 1.129 + stored in Internet standard CF LF form, so the file size of 1.130 + the message file equals the size of the message. 1.131 + 1.132 + mx is somewhat inefficient; the entire directory must be read 1.133 + and each file stat()'d. We found it intolerable for a 1.134 + moderate sized mailbox (2000 messages) and have more or less 1.135 + abandoned it. 1.136 + 1.137 +. mh This is supported for compatibility with the past. This is 1.138 + the format used by the old mh program. 1.139 + 1.140 + mh is very inefficient; the entire directory must be read 1.141 + and each file stat()'d, and in order to determine the size 1.142 + of a message, the entire file must be read and newline 1.143 + conversion performed. 1.144 + 1.145 + mh is deficient in that it does not support any permanent 1.146 + flags or keywords; and has no means to store UIDs (because 1.147 + the mh "compress" command renames all the files, that's 1.148 + why). 1.149 + 1.150 +. news This is an export of the local filesystem's news spool, e.g. 1.151 + /var/spool/news. Access to mailboxes in news format is read 1.152 + only; however, message "deleted" status is preserved in a 1.153 + .newsrc file in the user's home directory. There is no other 1.154 + status or keywords. 1.155 + 1.156 + news is very inefficient; the entire directory must be 1.157 + read and each file stat()'d, and in order to determine the 1.158 + size of a message, the entire file must be read and newline 1.159 + conversion performed. 1.160 + 1.161 + news is deficient in that it does not support permanent flags 1.162 + other than deleted; does not support keywords; and has no 1.163 + expunge. 1.164 + 1.165 + 1.166 +Soapbox on File/Message Formats 1.167 + 1.168 + If it sounds from the above descriptions that we're not putting 1.169 +too much effort into file/message formats, you are correct. 1.170 + 1.171 + There's a general reason why file/message formats are a bad idea. 1.172 +Just about every filesystem in existance serializes file creation and 1.173 +deletions because these manipulate the free space map. This turns out 1.174 +to be an enormous problem when you start creating/deleting more than a 1.175 +few messages per second; you spend all your time thrashing in the 1.176 +filesystem. 1.177 + 1.178 + It is also extremely slow to do a text search through a 1.179 +file/message format mailbox. All of those open()s and close()s really 1.180 +add up to major filesystem thrashing. 1.181 + 1.182 + 1.183 +What about Cyrus and Maildir? 1.184 + 1.185 + Both formats are vulnerable to the filesystem thrashing outlined 1.186 +above. 1.187 + 1.188 + The Cyrus format used by CMU's Cyrus server (and Esys' server) 1.189 +has a special associated flat file in each directory that contains 1.190 +extensive data (including pre-parsed ENVELOPEs and BODYSTRUCTUREs) 1.191 +about the messages. Put another way, it's a (considerably) more 1.192 +featureful form of mx. It also uses certain operating system 1.193 +facilities (e.g. file/memory mapping) which are not available on older 1.194 +systems, at a cost of much more limited portability than c-client. 1.195 +These considerably ameliorate the fundamental problems with 1.196 +file/message formats; in fact, Cyrus is halfway to being a database. 1.197 +Rather than support Cyrus format in c-client, you should run Cyrus or 1.198 +Esys if you want that format. 1.199 + 1.200 + The Maildir format used by qmail has all of the performance 1.201 +disadvantages of mh noted above, with the additional problem that the 1.202 +files are renamed in order to change their status so you end up having 1.203 +to rescan the directory frequently to locate the current names 1.204 +(particularly in a shared mailbox scenario). It doesn't scale, and it 1.205 +represents a support nightmare; it is therefore not supported in the 1.206 +official distribution. Maildir support code for c-client is available 1.207 +from third parties; but, if you use it, it is entirely at your own 1.208 +risk (read: don't complain about how poorly it performs or bugs). 1.209 + 1.210 + 1.211 +So what does this all mean? 1.212 + 1.213 + A database (such as used by Exchange) is really a much better 1.214 +approach if you want to move away from flat files. mx and especially 1.215 +Cyrus take a tenative step in that direction; mx failed mostly because 1.216 +it didn't go anywhere near far enough. Cyrus goes much further, and 1.217 +scores remarkable benefits from doing so. 1.218 + 1.219 + However, a well-designed pure database without the overhead of 1.220 +separate files would do even better.