imapext-2007

diff docs/formats.txt @ 0:ada5e610ab86

imap-2007e
author yuuji@gentei.org
date Mon, 14 Sep 2009 15:17:45 +0900
parents
children
line diff
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/docs/formats.txt	Mon Sep 14 15:17:45 2009 +0900
     1.3 @@ -0,0 +1,217 @@
     1.4 +/* ========================================================================
     1.5 + * Copyright 1988-2006 University of Washington
     1.6 + *
     1.7 + * Licensed under the Apache License, Version 2.0 (the "License");
     1.8 + * you may not use this file except in compliance with the License.
     1.9 + * You may obtain a copy of the License at
    1.10 + *
    1.11 + *     http://www.apache.org/licenses/LICENSE-2.0
    1.12 + *
    1.13 + * 
    1.14 + * ========================================================================
    1.15 + */
    1.16 +
    1.17 +		    Mailbox Format Characteristics
    1.18 +			     Mark Crispin
    1.19 +			   11 December 2006
    1.20 +
    1.21 +
    1.22 +     When a mailbox storage technology uses local files and
    1.23 +directories directly, the file(s) and directories are layed out in a
    1.24 +mailbox format.
    1.25 +
    1.26 +I. Flat-File Formats
    1.27 +
    1.28 +     In these formats, a mailbox and all the messages inside are a
    1.29 +single file on the filesystem.  The mailbox name is the name of the
    1.30 +file in the filesystem, relative to the user's "mail home directory."
    1.31 +
    1.32 +     A flat-file format mailbox is always a file, never a directory.
    1.33 +This means that it is impossible to have a flat-file format mailbox
    1.34 +that has inferior mailbox names under it (so-called "dual-usage"
    1.35 +mailboxes).  For some inexplicable reason, some people want this.
    1.36 +
    1.37 +     The mail home directory is usually the same as the user login
    1.38 +home directory if that concept is meaningful; otherwise, it is some
    1.39 +other default directory (e.g. "C:\My Documents" on Windows 98).  This
    1.40 +can be redefined by modifying the c-client source code or in an
    1.41 +application via the SET_HOMEDIR mail_parameters() call.
    1.42 +
    1.43 +     For example, a mailbox named "project" is likely to be found in
    1.44 +the file "project" in the user's home directory.  Similarly, a mailbox
    1.45 +named "test/trial1" (assuming a UNIX system) is likely to be found in
    1.46 +the file "trial1" in the subdirectory "test" in the user's home
    1.47 +directory.
    1.48 +
    1.49 +     Note that the name "INBOX" has special semantics and rules, as
    1.50 +described in the file naming.txt.
    1.51 +
    1.52 +     The following flat-file formats are supported by c-client as of
    1.53 +the time of this writing:
    1.54 +
    1.55 +. unix	This is the traditional UNIX mailbox format, in use for nearly
    1.56 +	30 years.  It uses a line starting with "From " to indicate
    1.57 +	start of message, and stores the message status inside the
    1.58 +	RFC822 message header.
    1.59 +
    1.60 +	unix is not particularly efficient; the entire mailbox file
    1.61 +	must be read when the mailbox is open, and when reading message
    1.62 +	texts it is necessary to convert the newline convention to
    1.63 +	Internet standard CR LF form.  unix preserves UIDs, and allows
    1.64 +	the creation of keywords.
    1.65 +
    1.66 +	Only one process may have a unix-format mailbox open
    1.67 +	read/write at a time.
    1.68 +
    1.69 +. mmdf	This is the format used by the MMDF mailer.  It uses a line
    1.70 +	consisting of 4 <CTRL/A> (0x01) characters to indicate start
    1.71 +	and end of message.  Optionally, there may also be a unix
    1.72 +	format "From " line.  It otherwise has the same
    1.73 +	characteristics as unix format.
    1.74 +
    1.75 +. mbx	This is the current preferred mailbox format.  It can be
    1.76 +	handled quite efficiently by c-client, without the problems
    1.77 +	that exist with unix and mmdf formats.  Messages are stored
    1.78 +	in Internet standard CR LF format.
    1.79 +
    1.80 +	mbx permits shared access, including shared expunge.  It
    1.81 +	preserves UIDs, and allows the creation of keywords.
    1.82 +
    1.83 +. mtx	This is supported for compatibility with the past.  This is
    1.84 +	the old Tenex/TOPS-20 mail.txt format.  It can be handled
    1.85 +	quite efficiently by c-client, and has most of the
    1.86 +	characteristics of mbx format.
    1.87 +
    1.88 +	mtx is deficient in that it does not support shared expunge;
    1.89 +	it has no means to store UIDs; and it has no way to define
    1.90 +	keywords except through an external configuration file.
    1.91 +
    1.92 +. tenex	This is supported for compatibility with the past.  This is
    1.93 +	the old Columbia MM format.  This is similar to mtx format,
    1.94 +	only it uses UNIX-style bare-LF newlines instead of CR LF
    1.95 +	newlines, thus incurring a performance penalty for newline
    1.96 +	conversion.
    1.97 +
    1.98 +. phile	This is not strictly a format.  Any file which is not in a
    1.99 +	recognized format is in phile format, which treats the entire
   1.100 +	contents of the file as a single message.
   1.101 +
   1.102 +
   1.103 +II. File/Message Formats
   1.104 +
   1.105 +     In these formats, a mailbox is a directory, and each the messages
   1.106 +inside are separate files inside the directory.  The file names of
   1.107 +these files are generally the text form of a number, which also
   1.108 +matches the UID of the message.
   1.109 +
   1.110 +     In the case of mx, the mailbox name is the name of the directory
   1.111 +in the filesystem, relative to the user's "mail home directory."  In
   1.112 +the case of news and mh, the mailbox name is in a separate namespace
   1.113 +as described in the file naming.txt.
   1.114 +
   1.115 +     A file/message format mailbox is always a directory.  This means
   1.116 +that it is possible to have a file/message format mailbox that has
   1.117 +inferior mailbox names under it (so-called "dual-usage" mailboxes).
   1.118 +For some inexplicable reason, some people want this.
   1.119 +
   1.120 +     Note that the name "INBOX" has special semantics and rules, as
   1.121 +described in the file naming.txt.
   1.122 +
   1.123 +     The following file/message formats are supported by c-client as of
   1.124 +the time of this writing:
   1.125 +
   1.126 +. mx	This is an experimental format, and may be removed in a future
   1.127 +	release.  An mx format mailbox has a .mxindex file which holds
   1.128 +	the message status and unique identifiers.  Messages are
   1.129 +	stored in Internet standard CF LF form, so the file size of
   1.130 +	the message file equals the size of the message.
   1.131 +
   1.132 +	mx is somewhat inefficient; the entire directory must be read
   1.133 +	and each file stat()'d.  We found it intolerable for a
   1.134 +	moderate sized mailbox (2000 messages) and have more or less
   1.135 +	abandoned it.	
   1.136 +
   1.137 +. mh	This is supported for compatibility with the past.  This is
   1.138 +	the format used by the old mh program.
   1.139 +
   1.140 +	mh is very inefficient; the entire directory must be read
   1.141 +	and each file stat()'d, and in order to determine the size
   1.142 +	of a message, the entire file must be read and newline
   1.143 +	conversion performed.
   1.144 +
   1.145 +	mh is deficient in that it does not support any permanent
   1.146 +	flags or keywords; and has no means to store UIDs (because
   1.147 +	the mh "compress" command renames all the files, that's
   1.148 +	why).
   1.149 +
   1.150 +. news	This is an export of the local filesystem's news spool, e.g.
   1.151 +	/var/spool/news.  Access to mailboxes in news format is read
   1.152 +	only; however, message "deleted" status is preserved in a
   1.153 +	.newsrc file in the user's home directory.  There is no other
   1.154 +	status or keywords.
   1.155 +
   1.156 +	news is very inefficient; the entire directory must be
   1.157 +	read and each file stat()'d, and in order to determine the
   1.158 +	size of a message, the entire file must be read and newline
   1.159 +	conversion performed.
   1.160 +
   1.161 +	news is deficient in that it does not support permanent flags
   1.162 +	other than deleted; does not support keywords; and has no
   1.163 +	expunge.
   1.164 +
   1.165 +
   1.166 +Soapbox on File/Message Formats
   1.167 +
   1.168 +     If it sounds from the above descriptions that we're not putting
   1.169 +too much effort into file/message formats, you are correct.
   1.170 +
   1.171 +     There's a general reason why file/message formats are a bad idea.
   1.172 +Just about every filesystem in existance serializes file creation and
   1.173 +deletions because these manipulate the free space map.  This turns out
   1.174 +to be an enormous problem when you start creating/deleting more than a
   1.175 +few messages per second; you spend all your time thrashing in the
   1.176 +filesystem.
   1.177 +
   1.178 +     It is also extremely slow to do a text search through a
   1.179 +file/message format mailbox.  All of those open()s and close()s really
   1.180 +add up to major filesystem thrashing.
   1.181 +
   1.182 +
   1.183 +What about Cyrus and Maildir?
   1.184 +
   1.185 +     Both formats are vulnerable to the filesystem thrashing outlined
   1.186 +above.
   1.187 +
   1.188 +     The Cyrus format used by CMU's Cyrus server (and Esys' server)
   1.189 +has a special associated flat file in each directory that contains
   1.190 +extensive data (including pre-parsed ENVELOPEs and BODYSTRUCTUREs)
   1.191 +about the messages.  Put another way, it's a (considerably) more
   1.192 +featureful form of mx.  It also uses certain operating system
   1.193 +facilities (e.g. file/memory mapping) which are not available on older
   1.194 +systems, at a cost of much more limited portability than c-client.
   1.195 +These considerably ameliorate the fundamental problems with
   1.196 +file/message formats; in fact, Cyrus is halfway to being a database.
   1.197 +Rather than support Cyrus format in c-client, you should run Cyrus or
   1.198 +Esys if you want that format.
   1.199 +
   1.200 +     The Maildir format used by qmail has all of the performance
   1.201 +disadvantages of mh noted above, with the additional problem that the
   1.202 +files are renamed in order to change their status so you end up having
   1.203 +to rescan the directory frequently to locate the current names
   1.204 +(particularly in a shared mailbox scenario).  It doesn't scale, and it
   1.205 +represents a support nightmare; it is therefore not supported in the
   1.206 +official distribution.  Maildir support code for c-client is available
   1.207 +from third parties; but, if you use it, it is entirely at your own
   1.208 +risk (read: don't complain about how poorly it performs or bugs).
   1.209 +
   1.210 +
   1.211 +So what does this all mean?
   1.212 +
   1.213 +     A database (such as used by Exchange) is really a much better
   1.214 +approach if you want to move away from flat files.  mx and especially
   1.215 +Cyrus take a tenative step in that direction; mx failed mostly because
   1.216 +it didn't go anywhere near far enough.  Cyrus goes much further, and
   1.217 +scores remarkable benefits from doing so.
   1.218 +
   1.219 +     However, a well-designed pure database without the overhead of
   1.220 +separate files would do even better.

UW-IMAP'd extensions by yuuji