yuuji@0: /* ========================================================================
yuuji@0:  * Copyright 1988-2006 University of Washington
yuuji@0:  *
yuuji@0:  * Licensed under the Apache License, Version 2.0 (the "License");
yuuji@0:  * you may not use this file except in compliance with the License.
yuuji@0:  * You may obtain a copy of the License at
yuuji@0:  *
yuuji@0:  *     http://www.apache.org/licenses/LICENSE-2.0
yuuji@0:  *
yuuji@0:  * 
yuuji@0:  * ========================================================================
yuuji@0:  */
yuuji@0: 
yuuji@0: 	 UNIX Advisory File Locking Implications on c-client
yuuji@0: 		    Mark Crispin, 28 November 1995
yuuji@0: 
yuuji@0: 
yuuji@0: 	THIS DOCUMENT HAS BEEN UPDATED TO REFLECT THE FACT THAT
yuuji@0: 	LINUX SUPPORTS BOTH flock() AND fcntl() AND THAT OSF/1
yuuji@0: 	HAS BEEN BROKEN SO THAT IT ONLY SUPPORTS fcntl().
yuuji@0: 	-- JUNE 15, 2004
yuuji@0: 
yuuji@0: 	THIS DOCUMENT HAS BEEN UPDATED TO REFLECT THE CODE IN THE
yuuji@0: 	IMAP-4 TOOLKIT AS OF NOVEMBER 28, 1995.  SOME STATEMENTS
yuuji@0: 	IN THIS DOCUMENT DO NOT APPLY TO EARLIER VERSIONS OF THE
yuuji@0: 	IMAP TOOLKIT.
yuuji@0: 
yuuji@0: INTRODUCTION
yuuji@0: 
yuuji@0:      Advisory locking is a mechanism by which cooperating processes
yuuji@0: can signal to each other their usage of a resource and whether or not
yuuji@0: that usage is critical.  It is not a mechanism to protect against
yuuji@0: processes which do not cooperate in the locking.
yuuji@0: 
yuuji@0:      The most basic form of locking involves a counter.  This counter
yuuji@0: is -1 when the resource is available.  If a process wants the lock, it
yuuji@0: executes an atomic increment-and-test-if-zero.  If the value is zero,
yuuji@0: the process has the lock and can execute the critical code that needs
yuuji@0: exclusive usage of a resource.  When it is finished, it sets the lock
yuuji@0: back to -1.  In C terms:
yuuji@0: 
yuuji@0:   while (++lock)		/* try to get lock */
yuuji@0:     invoke_other_threads ();	/* failed, try again */
yuuji@0:    .
yuuji@0:    .	/* critical code  here */
yuuji@0:    .
yuuji@0:   lock = -1;			/* release lock */
yuuji@0: 
yuuji@0:      This particular form of locking appears most commonly in
yuuji@0: multi-threaded applications such as operating system kernels.  It
yuuji@0: makes several presumptions:
yuuji@0:  (1) it is alright to keep testing the lock (no overflow)
yuuji@0:  (2) the critical resource is single-access only
yuuji@0:  (3) there is shared writeable memory between the two threads
yuuji@0:  (4) the threads can be trusted to release the lock when finished
yuuji@0: 
yuuji@0:      In applications programming on multi-user systems, most commonly
yuuji@0: the other threads are in an entirely different process, which may even
yuuji@0: be logged in as a different user.  Few operating systems offer shared
yuuji@0: writeable memory between such processes.
yuuji@0: 
yuuji@0:      A means of communicating this is by use of a file with a mutually
yuuji@0: agreed upon name.  A binary semaphore can be passed by means of the
yuuji@0: existance or non-existance of that file, provided that there is an
yuuji@0: atomic means to create a file if and only if that file does not exist.
yuuji@0: In C terms:
yuuji@0: 
yuuji@0: 				/* try to get lock */
yuuji@0:   while ((fd = open ("lockfile",O_WRONLY|O_CREAT|O_EXCL,0666)) < 0)
yuuji@0:     sleep (1);			/* failed, try again */
yuuji@0:   close (fd);			/* got the lock */
yuuji@0:    .
yuuji@0:    .	/* critical code  here */
yuuji@0:    .
yuuji@0:   unlink ("lockfile"); 		/* release lock */
yuuji@0: 
yuuji@0:      This form of locking makes fewer presumptions, but it still is
yuuji@0: guilty of presumptions (2) and (4) above.  Presumption (2) limits the
yuuji@0: ability to have processes sharing a resource in a non-conflicting
yuuji@0: fashion (e.g. reading from a file).  Presumption (4) leads to
yuuji@0: deadlocks should the process crash while it has a resource locked.
yuuji@0: 
yuuji@0:      Most modern operating systems provide a resource locking system
yuuji@0: call that has none of these presumptions.  In particular, a mechanism
yuuji@0: is provided for identifying shared locks as opposed to exclusive
yuuji@0: locks.  A shared lock permits other processes to obtain a shared lock,
yuuji@0: but denies exclusive locks.  In other words:
yuuji@0: 
yuuji@0: 	current state		want shared	want exclusive
yuuji@0: 	-------------		-----------	--------------
yuuji@0: 	 unlocked		 YES		 YES
yuuji@0: 	 locked shared		 YES		 NO
yuuji@0: 	 locked exclusive	 NO		 NO
yuuji@0: 
yuuji@0:      Furthermore, the operating system automatically relinquishes all
yuuji@0: locks held by that process when it terminates.
yuuji@0: 
yuuji@0:      A useful operation is the ability to upgrade a shared lock to
yuuji@0: exclusive (provided there are no other shared users of the lock) and
yuuji@0: to downgrade an exclusive lock to shared.  It is important that at no
yuuji@0: time is the lock ever removed; a process upgrading to exclusive must
yuuji@0: not relenquish its shared lock.
yuuji@0: 
yuuji@0:      Most commonly, the resources being locked are files.  Shared
yuuji@0: locks are particularly important with files; multiple simultaneous
yuuji@0: processes can read from a file, but only one can safely write at a
yuuji@0: time.  Some writes may be safer than others; an append to the end of
yuuji@0: the file is safer than changing existing file data.  In turn, changing
yuuji@0: a file record in place is safer than rewriting the file with an
yuuji@0: entirely different structure.
yuuji@0: 
yuuji@0: 
yuuji@0: FILE LOCKING ON UNIX
yuuji@0: 
yuuji@0:      In the oldest versions of UNIX, the use of a semaphore lockfile
yuuji@0: was the only available form of locking.  Advisory locking system calls
yuuji@0: were not added to UNIX until after the BSD vs. System V split.  Both
yuuji@0: of these system calls deal with file resources only.
yuuji@0: 
yuuji@0:      Most systems only have one or the other form of locking.  AIX
yuuji@0: and newer versions of OSF/1 emulate the BSD form of locking as a jacket
yuuji@0: into the System V form.  Ultrix and Linux implement both forms.
yuuji@0: 
yuuji@0: BSD
yuuji@0: 
yuuji@0:      BSD added the flock() system call.  It offers capabilities to
yuuji@0: acquire shared lock, acquire exclusive lock, and unlock.  Optionally,
yuuji@0: the process can request an immediate error return instead of blocking
yuuji@0: when the lock is unavailable.
yuuji@0: 
yuuji@0: 
yuuji@0: FLOCK() BUGS
yuuji@0: 
yuuji@0:      flock() advertises that it permits upgrading of shared locks to
yuuji@0: exclusive and downgrading of exclusive locks to shared, but it does so
yuuji@0: by releasing the former lock and then trying to acquire the new lock.
yuuji@0: This creates a window of vulnerability in which another process can
yuuji@0: grab the exclusive lock.  Therefore, this capability is not useful,
yuuji@0: although many programmers have been deluded by incautious reading of
yuuji@0: the flock() man page to believe otherwise.  This problem can be
yuuji@0: programmed around, once the programmer is aware of it.
yuuji@0: 
yuuji@0:      flock() always returns as if it succeeded on NFS files, when in
yuuji@0: fact it is a no-op.  There is no way around this.
yuuji@0: 
yuuji@0:      Leaving aside these two problems, flock() works remarkably well,
yuuji@0: and has shown itself to be robust and trustworthy.
yuuji@0: 
yuuji@0: SYSTEM V/POSIX
yuuji@0: 
yuuji@0:      System V added new functions to the fnctl() system call, and a
yuuji@0: simple interface through the lockf() subroutine.  This was
yuuji@0: subsequently included in POSIX.  Both offer the facility to apply the
yuuji@0: lock to a particular region of the file instead of to the entire file.
yuuji@0: lockf() only supports exclusive locks, and calls fcntl() internally;
yuuji@0: hence it won't be discussed further.
yuuji@0: 
yuuji@0:      Functionally, fcntl() locking is a superset of flock(); it is
yuuji@0: possible to implement a flock() emulator using fcntl(), with one minor
yuuji@0: exception: it is not possible to acquire an exclusive lock if the file
yuuji@0: is not open for write.
yuuji@0: 
yuuji@0:      The fcntl() locking functions are: query lock station of a file
yuuji@0: region, lock/unlock a region, and lock/unlock a region and block until
yuuji@0: have the lock.  The locks may be shared or exclusive.  By means of the
yuuji@0: statd and lockd daemons, fcntl() locking is available on NFS files.
yuuji@0: 
yuuji@0:      When statd is started at system boot, it reads its /etc/state
yuuji@0: file (which contains the number of times it has been invoked) and
yuuji@0: /etc/sm directory (which contains a list of all remote sites which are
yuuji@0: client or server locking with this site), and notifies the statd on
yuuji@0: each of these systems that it has been restarted.  Each statd then
yuuji@0: notifies the local lockd of the restart of that system.
yuuji@0: 
yuuji@0:      lockd receives fcntl() requests for NFS files.  It communicates
yuuji@0: with the lockd at the server and requests it to apply the lock, and
yuuji@0: with the statd to request it for notification when the server goes
yuuji@0: down.  It blocks until all these requests are completed.
yuuji@0: 
yuuji@0:      There is quite a mythos about fcntl() locking.
yuuji@0: 
yuuji@0:      One religion holds that fcntl() locking is the best thing since
yuuji@0: sliced bread, and that programs which use flock() should be converted
yuuji@0: to fcntl() so that NFS locking will work.  However, as noted above,
yuuji@0: very few systems support both calls, so such an exercise is pointless
yuuji@0: except on Ultrix and Linux.
yuuji@0: 
yuuji@0:      Another religion, which I adhere to, has the opposite viewpoint.
yuuji@0: 
yuuji@0: 
yuuji@0: FCNTL() BUGS
yuuji@0: 
yuuji@0:      For all of the hairy code to do individual section locking of a
yuuji@0: file, it's clear that the designers of fcntl() locking never
yuuji@0: considered some very basic locking operations.  It's as if all they
yuuji@0: knew about locking they got out of some CS textbook with not
yuuji@0: investigation of real-world needs.
yuuji@0: 
yuuji@0:      It is not possible to acquire an exclusive lock unless the file
yuuji@0: is open for write.  You could have append with shared read, and thus
yuuji@0: you could have a case in which a read-only access may need to go
yuuji@0: exclusive.  This problem can be programmed around once the programmer
yuuji@0: is aware of it.
yuuji@0: 
yuuji@0:      If the file is opened on another file designator in the same
yuuji@0: process, the file is unlocked even if no attempt is made to do any
yuuji@0: form of locking on the second designator.  This is a very bad bug.  It
yuuji@0: means that an application must keep track of all the files that it has
yuuji@0: opened and locked.
yuuji@0: 
yuuji@0:      If there is no statd/lockd on the NFS server, fcntl() will hang
yuuji@0: forever waiting for them to appear.  This is a bad bug.  It means that
yuuji@0: any attempt to lock on a server that doesn't run these daemons will
yuuji@0: hang.  There is no way for an application to request flock() style
yuuji@0: ``try to lock, but no-op if the mechanism ain't there''.
yuuji@0: 
yuuji@0:      There is a rumor to the effect that fcntl() will hang forever on
yuuji@0: local files too if there is no local statd/lockd.  These daemons are
yuuji@0: running on mailer.u, although they appear not to have much CPU time.
yuuji@0: A useful experiment would be to kill them and see if imapd is affected
yuuji@0: in any way, but I decline to do so without an OK from UCS!  ;-) If
yuuji@0: killing statd/lockd can be done without breaking fcntl() on local
yuuji@0: files, this would become one of the primary means of dealing with this
yuuji@0: problem.
yuuji@0: 
yuuji@0:      The statd and lockd daemons have quite a reputation for extreme
yuuji@0: fragility.  There have been numerous reports about the locking
yuuji@0: mechanism being wedged on a systemwide or even clusterwide basis,
yuuji@0: requiring a reboot to clear.  It is rumored that this wedge, once it
yuuji@0: happens, also blocks local locking.  Presumably killing and restarting
yuuji@0: statd would suffice to clear the wedge, but I haven't verified this.
yuuji@0: 
yuuji@0:      There appears to be a limit to how many locks may be in use at a
yuuji@0: time on the system, although the documentation only mentions it in
yuuji@0: passing.  On some of their systems, UCS has increased lockd's ``size
yuuji@0: of the socket buffer'', whatever that means.
yuuji@0: 
yuuji@0: C-CLIENT USAGE
yuuji@0: 
yuuji@0:      c-client uses flock().  On System V systems, flock() is simulated
yuuji@0: by an emulator that calls fcntl().
yuuji@0: 
yuuji@0: 
yuuji@0: BEZERK AND MMDF
yuuji@0: 
yuuji@0:      Locking in the traditional UNIX formats was largely dictated by
yuuji@0: the status quo in other applications; however, additional protection
yuuji@0: is added against inadvertantly running multiple instances of a
yuuji@0: c-client application on the same mail file.
yuuji@0: 
yuuji@0:      (1) c-client attempts to create a .lock file (mail file name with
yuuji@0: ``.lock'' appended) whenever it reads from, or writes to, the mail
yuuji@0: file.  This is an exclusive lock, and is held only for short periods
yuuji@0: of time while c-client is actually doing the I/O.  There is a 5-minute
yuuji@0: timeout for this lock, after which it is broken on the presumption
yuuji@0: that it is a stale lock.  If it can not create the .lock file due to
yuuji@0: an EACCES (protection failure) error, it once silently proceeded
yuuji@0: without this lock; this was for systems which protect /usr/spool/mail
yuuji@0: from unprivileged processes creating files.  Today, c-client reports
yuuji@0: an error unless it is built otherwise.  The purpose of this lock is to
yuuji@0: prevent against unfavorable interactions with mail delivery.
yuuji@0: 
yuuji@0:      (2) c-client applies a shared flock() to the mail file whenever
yuuji@0: it reads from the mail file, and an exclusive flock() whenever it
yuuji@0: writes to the mail file.  This lock is freed as soon as it finishes
yuuji@0: reading.  The purpose of this lock is to prevent against unfavorable
yuuji@0: interactions with mail delivery.
yuuji@0: 
yuuji@0:      (3) c-client applies an exclusive flock() to a file on /tmp
yuuji@0: (whose name represents the device and inode number of the file) when
yuuji@0: it opens the mail file.  This lock is maintained throughout the
yuuji@0: session, although c-client has a feature (called ``kiss of death'')
yuuji@0: which permits c-client to forcibly and irreversibly seize the lock
yuuji@0: from a cooperating c-client application that surrenders the lock on
yuuji@0: demand.  The purpose of this lock is to prevent against unfavorable
yuuji@0: interactions with other instances of c-client (rewriting the mail
yuuji@0: file).
yuuji@0: 
yuuji@0:      Mail delivery daemons use lock (1), (2), or both.  Lock (1) works
yuuji@0: over NFS; lock (2) is the only one that works on sites that protect
yuuji@0: /usr/spool/mail against unprivileged file creation.  Prudent mail
yuuji@0: delivery daemons use both forms of locking, and of course so does
yuuji@0: c-client.
yuuji@0: 
yuuji@0:      If only lock (2) is used, then multiple processes can read from
yuuji@0: the mail file simultaneously, although in real life this doesn't
yuuji@0: really change things.  The normal state of locks (1) and (2) is
yuuji@0: unlocked except for very brief periods.
yuuji@0: 
yuuji@0: 
yuuji@0: TENEX AND MTX
yuuji@0: 
yuuji@0:      The design of the locking mechanism of these formats was
yuuji@0: motivated by a design to enable multiple simultaneous read/write
yuuji@0: access.  It is almost the reverse of how locking works with
yuuji@0: bezerk/mmdf.
yuuji@0: 
yuuji@0:      (1) c-client applies a shared flock() to the mail file when it
yuuji@0: opens the mail file.  It upgrades this lock to exclusive whenever it
yuuji@0: tries to expunge the mail file.  Because of the flock() bug that
yuuji@0: upgrading a lock actually releases it, it will not do so until it has
yuuji@0: acquired an exclusive lock (2) first.  The purpose of this lock is to
yuuji@0: prevent against expunge taking place while some other c-client has the
yuuji@0: mail file open (and thus knows where all the messages are).
yuuji@0: 
yuuji@0:      (2) c-client applies a shared flock() to a file on /tmp (whose
yuuji@0: name represents the device and inode number of the file) when it
yuuji@0: parses the mail file.  It applies an exclusive flock() to this file
yuuji@0: when it appends new mail to the mail file, as well as before it
yuuji@0: attempts to upgrade lock (1) to exclusive.  The purpose of this lock
yuuji@0: is to prevent against data being appended while some other c-client is
yuuji@0: parsing mail in the file (to prevent reading of incomplete messages).
yuuji@0: It also protects against the lock-releasing timing race on lock (1).
yuuji@0: 
yuuji@0: OBSERVATIONS
yuuji@0: 
yuuji@0:      In a perfect world, locking works.  You are protected against
yuuji@0: unfavorable interactions with the mailer and against your own mistake
yuuji@0: by running more than one instance of your mail reader.  In tenex/mtx
yuuji@0: formats, you have the additional benefit that multiple simultaneous
yuuji@0: read/write access works, with the sole restriction being that you
yuuji@0: can't expunge if there are any sharers of the mail file.
yuuji@0: 
yuuji@0:      If the mail file is NFS-mounted, then flock() locking is a silent
yuuji@0: no-op.  This is the way BSD implements flock(), and c-client's
yuuji@0: emulation of flock() through fcntl() tests for NFS files and
yuuji@0: duplicates this functionality.  There is no locking protection for
yuuji@0: tenex/mtx mail files at all, and only protection against the mailer
yuuji@0: for bezerk/mmdf mail files.  This has been the accepted state of
yuuji@0: affairs on UNIX for many sad years.
yuuji@0: 
yuuji@0:      If you can not create .lock files, it should not affect locking,
yuuji@0: since the flock() locks suffice for all protection.  This is, however,
yuuji@0: not true if the mailer does not check for flock() locking, or if the
yuuji@0: the mail file is NFS-mounted.
yuuji@0: 
yuuji@0:      What this means is that there is *no* locking protection at all
yuuji@0: in the case of a client using an NFS-mounted /usr/spool/mail that does
yuuji@0: not permit file creation by unprivileged programs.  It is impossible,
yuuji@0: under these circumstances, for an unprivileged program to do anything
yuuji@0: about it.  Worse, if EACCES errors on .lock file creation are no-op'ed
yuuji@0: , the user won't even know about it.  This is arguably a site
yuuji@0: configuration error.
yuuji@0: 
yuuji@0:      The problem with not being able to create .lock files exists on
yuuji@0: System V as well, but the failure modes for flock() -- which is
yuuji@0: implemented via fcntl() -- are different.
yuuji@0: 
yuuji@0:      On System V, if the mail file is NFS-mounted and either the
yuuji@0: client or the server lacks a functioning statd/lockd pair, then the
yuuji@0: lock attempt would have hung forever if it weren't for the fact that
yuuji@0: c-client tests for NFS and no-ops the flock() emulator in this case.
yuuji@0: Systemwide or clusterwide failures of statd/lockd have been known to
yuuji@0: occur which cause all locks in all processes to hang (including
yuuji@0: local?).  Without the special NFS test made by c-client, there would
yuuji@0: be no way to request BSD-style no-op behavior, nor is there any way to
yuuji@0: determine that this is happening other than the system being hung.
yuuji@0: 
yuuji@0:      The additional locking introduced by c-client was shown to cause
yuuji@0: much more stress on the System V locking mechanism than has
yuuji@0: traditionally been placed upon it.  If it was stressed too far, all
yuuji@0: hell broke loose.  Fortunately, this is now past history.
yuuji@0: 
yuuji@0: TRADEOFFS
yuuji@0: 
yuuji@0:      c-client based applications have a reasonable chance of winning
yuuji@0: as long as you don't use NFS for remote access to mail files.  That's
yuuji@0: what IMAP is for, after all.  It is, however, very important to
yuuji@0: realize that you can *not* use the lock-upgrade feature by itself
yuuji@0: because it releases the lock as an interim step -- you need to have
yuuji@0: lock-upgrading guarded by another lock.
yuuji@0: 
yuuji@0:      If you have the misfortune of using System V, you are likely to
yuuji@0: run into problems sooner or later having to do with statd/lockd.  You
yuuji@0: basically end up with one of three unsatisfactory choices:
yuuji@0: 	1) Grit your teeth and live with it.
yuuji@0: 	2) Try to make it work:
yuuji@0: 	   a) avoid NFS access so as not to stress statd/lockd.
yuuji@0: 	   b) try to understand the code in statd/lockd and hack it
yuuji@0: 	      to be more robust.
yuuji@0: 	   c) hunt out the system limit of locks, if there is one,
yuuji@0: 	      and increase it.  Figure on at least two locks per
yuuji@0: 	      simultaneous imapd process and four locks per Pine
yuuji@0: 	      process.  Better yet, make the limit be 10 times the
yuuji@0: 	      maximum number of processes.
yuuji@0: 	   d) increase the socket buffer (-S switch to lockd) if
yuuji@0: 	      it is offered.  I don't know what this actually does,
yuuji@0: 	      but giving lockd more resources to do its work can't
yuuji@0: 	      hurt.  Maybe.
yuuji@0: 	3) Decide that it can't possibly work, and turn off the 
yuuji@0: 	   fcntl() calls in your program.
yuuji@0: 	4) If nuking statd/lockd can be done without breaking local
yuuji@0: 	   locking, then do so.  This would make SVR4 have the same
yuuji@0: 	   limitations as BSD locking, with a couple of additional
yuuji@0: 	   bugs.
yuuji@0: 	5) Check for NFS, and don't do the fcntl() in the NFS case.
yuuji@0: 	   This is what c-client does.
yuuji@0: 
yuuji@0:      Note that if you are going to use NFS to access files on a server
yuuji@0: which does not have statd/lockd running, your only choice is (3), (4),
yuuji@0: or (5).  Here again, IMAP can bail you out.
yuuji@0: 
yuuji@0:      These problems aren't unique to c-client applications; they have
yuuji@0: also been reported with Elm, Mediamail, and other email tools.
yuuji@0: 
yuuji@0:      Of the other two SVR4 locking bugs:
yuuji@0: 
yuuji@0:      Programmer awareness is necessary to deal with the bug that you
yuuji@0: can not get an exclusive lock unless the file is open for write.  I
yuuji@0: believe that c-client has fixed all of these cases.
yuuji@0: 
yuuji@0:      The problem about opening a second designator smashing any
yuuji@0: current locks on the file has not been addressed satisfactorily yet.
yuuji@0: This is not an easy problem to deal with, especially in c-client which
yuuji@0: really doesn't know what other files/streams may be open by Pine.
yuuji@0: 
yuuji@0:      Aren't you so happy that you bought an System V system?