yuuji@0: /* ======================================================================== yuuji@0: * Copyright 1988-2006 University of Washington yuuji@0: * yuuji@0: * Licensed under the Apache License, Version 2.0 (the "License"); yuuji@0: * you may not use this file except in compliance with the License. yuuji@0: * You may obtain a copy of the License at yuuji@0: * yuuji@0: * http://www.apache.org/licenses/LICENSE-2.0 yuuji@0: * yuuji@0: * yuuji@0: * ======================================================================== yuuji@0: */ yuuji@0: yuuji@0: UNIX Advisory File Locking Implications on c-client yuuji@0: Mark Crispin, 28 November 1995 yuuji@0: yuuji@0: yuuji@0: THIS DOCUMENT HAS BEEN UPDATED TO REFLECT THE FACT THAT yuuji@0: LINUX SUPPORTS BOTH flock() AND fcntl() AND THAT OSF/1 yuuji@0: HAS BEEN BROKEN SO THAT IT ONLY SUPPORTS fcntl(). yuuji@0: -- JUNE 15, 2004 yuuji@0: yuuji@0: THIS DOCUMENT HAS BEEN UPDATED TO REFLECT THE CODE IN THE yuuji@0: IMAP-4 TOOLKIT AS OF NOVEMBER 28, 1995. SOME STATEMENTS yuuji@0: IN THIS DOCUMENT DO NOT APPLY TO EARLIER VERSIONS OF THE yuuji@0: IMAP TOOLKIT. yuuji@0: yuuji@0: INTRODUCTION yuuji@0: yuuji@0: Advisory locking is a mechanism by which cooperating processes yuuji@0: can signal to each other their usage of a resource and whether or not yuuji@0: that usage is critical. It is not a mechanism to protect against yuuji@0: processes which do not cooperate in the locking. yuuji@0: yuuji@0: The most basic form of locking involves a counter. This counter yuuji@0: is -1 when the resource is available. If a process wants the lock, it yuuji@0: executes an atomic increment-and-test-if-zero. If the value is zero, yuuji@0: the process has the lock and can execute the critical code that needs yuuji@0: exclusive usage of a resource. When it is finished, it sets the lock yuuji@0: back to -1. In C terms: yuuji@0: yuuji@0: while (++lock) /* try to get lock */ yuuji@0: invoke_other_threads (); /* failed, try again */ yuuji@0: . yuuji@0: . /* critical code here */ yuuji@0: . yuuji@0: lock = -1; /* release lock */ yuuji@0: yuuji@0: This particular form of locking appears most commonly in yuuji@0: multi-threaded applications such as operating system kernels. It yuuji@0: makes several presumptions: yuuji@0: (1) it is alright to keep testing the lock (no overflow) yuuji@0: (2) the critical resource is single-access only yuuji@0: (3) there is shared writeable memory between the two threads yuuji@0: (4) the threads can be trusted to release the lock when finished yuuji@0: yuuji@0: In applications programming on multi-user systems, most commonly yuuji@0: the other threads are in an entirely different process, which may even yuuji@0: be logged in as a different user. Few operating systems offer shared yuuji@0: writeable memory between such processes. yuuji@0: yuuji@0: A means of communicating this is by use of a file with a mutually yuuji@0: agreed upon name. A binary semaphore can be passed by means of the yuuji@0: existance or non-existance of that file, provided that there is an yuuji@0: atomic means to create a file if and only if that file does not exist. yuuji@0: In C terms: yuuji@0: yuuji@0: /* try to get lock */ yuuji@0: while ((fd = open ("lockfile",O_WRONLY|O_CREAT|O_EXCL,0666)) < 0) yuuji@0: sleep (1); /* failed, try again */ yuuji@0: close (fd); /* got the lock */ yuuji@0: . yuuji@0: . /* critical code here */ yuuji@0: . yuuji@0: unlink ("lockfile"); /* release lock */ yuuji@0: yuuji@0: This form of locking makes fewer presumptions, but it still is yuuji@0: guilty of presumptions (2) and (4) above. Presumption (2) limits the yuuji@0: ability to have processes sharing a resource in a non-conflicting yuuji@0: fashion (e.g. reading from a file). Presumption (4) leads to yuuji@0: deadlocks should the process crash while it has a resource locked. yuuji@0: yuuji@0: Most modern operating systems provide a resource locking system yuuji@0: call that has none of these presumptions. In particular, a mechanism yuuji@0: is provided for identifying shared locks as opposed to exclusive yuuji@0: locks. A shared lock permits other processes to obtain a shared lock, yuuji@0: but denies exclusive locks. In other words: yuuji@0: yuuji@0: current state want shared want exclusive yuuji@0: ------------- ----------- -------------- yuuji@0: unlocked YES YES yuuji@0: locked shared YES NO yuuji@0: locked exclusive NO NO yuuji@0: yuuji@0: Furthermore, the operating system automatically relinquishes all yuuji@0: locks held by that process when it terminates. yuuji@0: yuuji@0: A useful operation is the ability to upgrade a shared lock to yuuji@0: exclusive (provided there are no other shared users of the lock) and yuuji@0: to downgrade an exclusive lock to shared. It is important that at no yuuji@0: time is the lock ever removed; a process upgrading to exclusive must yuuji@0: not relenquish its shared lock. yuuji@0: yuuji@0: Most commonly, the resources being locked are files. Shared yuuji@0: locks are particularly important with files; multiple simultaneous yuuji@0: processes can read from a file, but only one can safely write at a yuuji@0: time. Some writes may be safer than others; an append to the end of yuuji@0: the file is safer than changing existing file data. In turn, changing yuuji@0: a file record in place is safer than rewriting the file with an yuuji@0: entirely different structure. yuuji@0: yuuji@0: yuuji@0: FILE LOCKING ON UNIX yuuji@0: yuuji@0: In the oldest versions of UNIX, the use of a semaphore lockfile yuuji@0: was the only available form of locking. Advisory locking system calls yuuji@0: were not added to UNIX until after the BSD vs. System V split. Both yuuji@0: of these system calls deal with file resources only. yuuji@0: yuuji@0: Most systems only have one or the other form of locking. AIX yuuji@0: and newer versions of OSF/1 emulate the BSD form of locking as a jacket yuuji@0: into the System V form. Ultrix and Linux implement both forms. yuuji@0: yuuji@0: BSD yuuji@0: yuuji@0: BSD added the flock() system call. It offers capabilities to yuuji@0: acquire shared lock, acquire exclusive lock, and unlock. Optionally, yuuji@0: the process can request an immediate error return instead of blocking yuuji@0: when the lock is unavailable. yuuji@0: yuuji@0: yuuji@0: FLOCK() BUGS yuuji@0: yuuji@0: flock() advertises that it permits upgrading of shared locks to yuuji@0: exclusive and downgrading of exclusive locks to shared, but it does so yuuji@0: by releasing the former lock and then trying to acquire the new lock. yuuji@0: This creates a window of vulnerability in which another process can yuuji@0: grab the exclusive lock. Therefore, this capability is not useful, yuuji@0: although many programmers have been deluded by incautious reading of yuuji@0: the flock() man page to believe otherwise. This problem can be yuuji@0: programmed around, once the programmer is aware of it. yuuji@0: yuuji@0: flock() always returns as if it succeeded on NFS files, when in yuuji@0: fact it is a no-op. There is no way around this. yuuji@0: yuuji@0: Leaving aside these two problems, flock() works remarkably well, yuuji@0: and has shown itself to be robust and trustworthy. yuuji@0: yuuji@0: SYSTEM V/POSIX yuuji@0: yuuji@0: System V added new functions to the fnctl() system call, and a yuuji@0: simple interface through the lockf() subroutine. This was yuuji@0: subsequently included in POSIX. Both offer the facility to apply the yuuji@0: lock to a particular region of the file instead of to the entire file. yuuji@0: lockf() only supports exclusive locks, and calls fcntl() internally; yuuji@0: hence it won't be discussed further. yuuji@0: yuuji@0: Functionally, fcntl() locking is a superset of flock(); it is yuuji@0: possible to implement a flock() emulator using fcntl(), with one minor yuuji@0: exception: it is not possible to acquire an exclusive lock if the file yuuji@0: is not open for write. yuuji@0: yuuji@0: The fcntl() locking functions are: query lock station of a file yuuji@0: region, lock/unlock a region, and lock/unlock a region and block until yuuji@0: have the lock. The locks may be shared or exclusive. By means of the yuuji@0: statd and lockd daemons, fcntl() locking is available on NFS files. yuuji@0: yuuji@0: When statd is started at system boot, it reads its /etc/state yuuji@0: file (which contains the number of times it has been invoked) and yuuji@0: /etc/sm directory (which contains a list of all remote sites which are yuuji@0: client or server locking with this site), and notifies the statd on yuuji@0: each of these systems that it has been restarted. Each statd then yuuji@0: notifies the local lockd of the restart of that system. yuuji@0: yuuji@0: lockd receives fcntl() requests for NFS files. It communicates yuuji@0: with the lockd at the server and requests it to apply the lock, and yuuji@0: with the statd to request it for notification when the server goes yuuji@0: down. It blocks until all these requests are completed. yuuji@0: yuuji@0: There is quite a mythos about fcntl() locking. yuuji@0: yuuji@0: One religion holds that fcntl() locking is the best thing since yuuji@0: sliced bread, and that programs which use flock() should be converted yuuji@0: to fcntl() so that NFS locking will work. However, as noted above, yuuji@0: very few systems support both calls, so such an exercise is pointless yuuji@0: except on Ultrix and Linux. yuuji@0: yuuji@0: Another religion, which I adhere to, has the opposite viewpoint. yuuji@0: yuuji@0: yuuji@0: FCNTL() BUGS yuuji@0: yuuji@0: For all of the hairy code to do individual section locking of a yuuji@0: file, it's clear that the designers of fcntl() locking never yuuji@0: considered some very basic locking operations. It's as if all they yuuji@0: knew about locking they got out of some CS textbook with not yuuji@0: investigation of real-world needs. yuuji@0: yuuji@0: It is not possible to acquire an exclusive lock unless the file yuuji@0: is open for write. You could have append with shared read, and thus yuuji@0: you could have a case in which a read-only access may need to go yuuji@0: exclusive. This problem can be programmed around once the programmer yuuji@0: is aware of it. yuuji@0: yuuji@0: If the file is opened on another file designator in the same yuuji@0: process, the file is unlocked even if no attempt is made to do any yuuji@0: form of locking on the second designator. This is a very bad bug. It yuuji@0: means that an application must keep track of all the files that it has yuuji@0: opened and locked. yuuji@0: yuuji@0: If there is no statd/lockd on the NFS server, fcntl() will hang yuuji@0: forever waiting for them to appear. This is a bad bug. It means that yuuji@0: any attempt to lock on a server that doesn't run these daemons will yuuji@0: hang. There is no way for an application to request flock() style yuuji@0: ``try to lock, but no-op if the mechanism ain't there''. yuuji@0: yuuji@0: There is a rumor to the effect that fcntl() will hang forever on yuuji@0: local files too if there is no local statd/lockd. These daemons are yuuji@0: running on mailer.u, although they appear not to have much CPU time. yuuji@0: A useful experiment would be to kill them and see if imapd is affected yuuji@0: in any way, but I decline to do so without an OK from UCS! ;-) If yuuji@0: killing statd/lockd can be done without breaking fcntl() on local yuuji@0: files, this would become one of the primary means of dealing with this yuuji@0: problem. yuuji@0: yuuji@0: The statd and lockd daemons have quite a reputation for extreme yuuji@0: fragility. There have been numerous reports about the locking yuuji@0: mechanism being wedged on a systemwide or even clusterwide basis, yuuji@0: requiring a reboot to clear. It is rumored that this wedge, once it yuuji@0: happens, also blocks local locking. Presumably killing and restarting yuuji@0: statd would suffice to clear the wedge, but I haven't verified this. yuuji@0: yuuji@0: There appears to be a limit to how many locks may be in use at a yuuji@0: time on the system, although the documentation only mentions it in yuuji@0: passing. On some of their systems, UCS has increased lockd's ``size yuuji@0: of the socket buffer'', whatever that means. yuuji@0: yuuji@0: C-CLIENT USAGE yuuji@0: yuuji@0: c-client uses flock(). On System V systems, flock() is simulated yuuji@0: by an emulator that calls fcntl(). yuuji@0: yuuji@0: yuuji@0: BEZERK AND MMDF yuuji@0: yuuji@0: Locking in the traditional UNIX formats was largely dictated by yuuji@0: the status quo in other applications; however, additional protection yuuji@0: is added against inadvertantly running multiple instances of a yuuji@0: c-client application on the same mail file. yuuji@0: yuuji@0: (1) c-client attempts to create a .lock file (mail file name with yuuji@0: ``.lock'' appended) whenever it reads from, or writes to, the mail yuuji@0: file. This is an exclusive lock, and is held only for short periods yuuji@0: of time while c-client is actually doing the I/O. There is a 5-minute yuuji@0: timeout for this lock, after which it is broken on the presumption yuuji@0: that it is a stale lock. If it can not create the .lock file due to yuuji@0: an EACCES (protection failure) error, it once silently proceeded yuuji@0: without this lock; this was for systems which protect /usr/spool/mail yuuji@0: from unprivileged processes creating files. Today, c-client reports yuuji@0: an error unless it is built otherwise. The purpose of this lock is to yuuji@0: prevent against unfavorable interactions with mail delivery. yuuji@0: yuuji@0: (2) c-client applies a shared flock() to the mail file whenever yuuji@0: it reads from the mail file, and an exclusive flock() whenever it yuuji@0: writes to the mail file. This lock is freed as soon as it finishes yuuji@0: reading. The purpose of this lock is to prevent against unfavorable yuuji@0: interactions with mail delivery. yuuji@0: yuuji@0: (3) c-client applies an exclusive flock() to a file on /tmp yuuji@0: (whose name represents the device and inode number of the file) when yuuji@0: it opens the mail file. This lock is maintained throughout the yuuji@0: session, although c-client has a feature (called ``kiss of death'') yuuji@0: which permits c-client to forcibly and irreversibly seize the lock yuuji@0: from a cooperating c-client application that surrenders the lock on yuuji@0: demand. The purpose of this lock is to prevent against unfavorable yuuji@0: interactions with other instances of c-client (rewriting the mail yuuji@0: file). yuuji@0: yuuji@0: Mail delivery daemons use lock (1), (2), or both. Lock (1) works yuuji@0: over NFS; lock (2) is the only one that works on sites that protect yuuji@0: /usr/spool/mail against unprivileged file creation. Prudent mail yuuji@0: delivery daemons use both forms of locking, and of course so does yuuji@0: c-client. yuuji@0: yuuji@0: If only lock (2) is used, then multiple processes can read from yuuji@0: the mail file simultaneously, although in real life this doesn't yuuji@0: really change things. The normal state of locks (1) and (2) is yuuji@0: unlocked except for very brief periods. yuuji@0: yuuji@0: yuuji@0: TENEX AND MTX yuuji@0: yuuji@0: The design of the locking mechanism of these formats was yuuji@0: motivated by a design to enable multiple simultaneous read/write yuuji@0: access. It is almost the reverse of how locking works with yuuji@0: bezerk/mmdf. yuuji@0: yuuji@0: (1) c-client applies a shared flock() to the mail file when it yuuji@0: opens the mail file. It upgrades this lock to exclusive whenever it yuuji@0: tries to expunge the mail file. Because of the flock() bug that yuuji@0: upgrading a lock actually releases it, it will not do so until it has yuuji@0: acquired an exclusive lock (2) first. The purpose of this lock is to yuuji@0: prevent against expunge taking place while some other c-client has the yuuji@0: mail file open (and thus knows where all the messages are). yuuji@0: yuuji@0: (2) c-client applies a shared flock() to a file on /tmp (whose yuuji@0: name represents the device and inode number of the file) when it yuuji@0: parses the mail file. It applies an exclusive flock() to this file yuuji@0: when it appends new mail to the mail file, as well as before it yuuji@0: attempts to upgrade lock (1) to exclusive. The purpose of this lock yuuji@0: is to prevent against data being appended while some other c-client is yuuji@0: parsing mail in the file (to prevent reading of incomplete messages). yuuji@0: It also protects against the lock-releasing timing race on lock (1). yuuji@0: yuuji@0: OBSERVATIONS yuuji@0: yuuji@0: In a perfect world, locking works. You are protected against yuuji@0: unfavorable interactions with the mailer and against your own mistake yuuji@0: by running more than one instance of your mail reader. In tenex/mtx yuuji@0: formats, you have the additional benefit that multiple simultaneous yuuji@0: read/write access works, with the sole restriction being that you yuuji@0: can't expunge if there are any sharers of the mail file. yuuji@0: yuuji@0: If the mail file is NFS-mounted, then flock() locking is a silent yuuji@0: no-op. This is the way BSD implements flock(), and c-client's yuuji@0: emulation of flock() through fcntl() tests for NFS files and yuuji@0: duplicates this functionality. There is no locking protection for yuuji@0: tenex/mtx mail files at all, and only protection against the mailer yuuji@0: for bezerk/mmdf mail files. This has been the accepted state of yuuji@0: affairs on UNIX for many sad years. yuuji@0: yuuji@0: If you can not create .lock files, it should not affect locking, yuuji@0: since the flock() locks suffice for all protection. This is, however, yuuji@0: not true if the mailer does not check for flock() locking, or if the yuuji@0: the mail file is NFS-mounted. yuuji@0: yuuji@0: What this means is that there is *no* locking protection at all yuuji@0: in the case of a client using an NFS-mounted /usr/spool/mail that does yuuji@0: not permit file creation by unprivileged programs. It is impossible, yuuji@0: under these circumstances, for an unprivileged program to do anything yuuji@0: about it. Worse, if EACCES errors on .lock file creation are no-op'ed yuuji@0: , the user won't even know about it. This is arguably a site yuuji@0: configuration error. yuuji@0: yuuji@0: The problem with not being able to create .lock files exists on yuuji@0: System V as well, but the failure modes for flock() -- which is yuuji@0: implemented via fcntl() -- are different. yuuji@0: yuuji@0: On System V, if the mail file is NFS-mounted and either the yuuji@0: client or the server lacks a functioning statd/lockd pair, then the yuuji@0: lock attempt would have hung forever if it weren't for the fact that yuuji@0: c-client tests for NFS and no-ops the flock() emulator in this case. yuuji@0: Systemwide or clusterwide failures of statd/lockd have been known to yuuji@0: occur which cause all locks in all processes to hang (including yuuji@0: local?). Without the special NFS test made by c-client, there would yuuji@0: be no way to request BSD-style no-op behavior, nor is there any way to yuuji@0: determine that this is happening other than the system being hung. yuuji@0: yuuji@0: The additional locking introduced by c-client was shown to cause yuuji@0: much more stress on the System V locking mechanism than has yuuji@0: traditionally been placed upon it. If it was stressed too far, all yuuji@0: hell broke loose. Fortunately, this is now past history. yuuji@0: yuuji@0: TRADEOFFS yuuji@0: yuuji@0: c-client based applications have a reasonable chance of winning yuuji@0: as long as you don't use NFS for remote access to mail files. That's yuuji@0: what IMAP is for, after all. It is, however, very important to yuuji@0: realize that you can *not* use the lock-upgrade feature by itself yuuji@0: because it releases the lock as an interim step -- you need to have yuuji@0: lock-upgrading guarded by another lock. yuuji@0: yuuji@0: If you have the misfortune of using System V, you are likely to yuuji@0: run into problems sooner or later having to do with statd/lockd. You yuuji@0: basically end up with one of three unsatisfactory choices: yuuji@0: 1) Grit your teeth and live with it. yuuji@0: 2) Try to make it work: yuuji@0: a) avoid NFS access so as not to stress statd/lockd. yuuji@0: b) try to understand the code in statd/lockd and hack it yuuji@0: to be more robust. yuuji@0: c) hunt out the system limit of locks, if there is one, yuuji@0: and increase it. Figure on at least two locks per yuuji@0: simultaneous imapd process and four locks per Pine yuuji@0: process. Better yet, make the limit be 10 times the yuuji@0: maximum number of processes. yuuji@0: d) increase the socket buffer (-S switch to lockd) if yuuji@0: it is offered. I don't know what this actually does, yuuji@0: but giving lockd more resources to do its work can't yuuji@0: hurt. Maybe. yuuji@0: 3) Decide that it can't possibly work, and turn off the yuuji@0: fcntl() calls in your program. yuuji@0: 4) If nuking statd/lockd can be done without breaking local yuuji@0: locking, then do so. This would make SVR4 have the same yuuji@0: limitations as BSD locking, with a couple of additional yuuji@0: bugs. yuuji@0: 5) Check for NFS, and don't do the fcntl() in the NFS case. yuuji@0: This is what c-client does. yuuji@0: yuuji@0: Note that if you are going to use NFS to access files on a server yuuji@0: which does not have statd/lockd running, your only choice is (3), (4), yuuji@0: or (5). Here again, IMAP can bail you out. yuuji@0: yuuji@0: These problems aren't unique to c-client applications; they have yuuji@0: also been reported with Elm, Mediamail, and other email tools. yuuji@0: yuuji@0: Of the other two SVR4 locking bugs: yuuji@0: yuuji@0: Programmer awareness is necessary to deal with the bug that you yuuji@0: can not get an exclusive lock unless the file is open for write. I yuuji@0: believe that c-client has fixed all of these cases. yuuji@0: yuuji@0: The problem about opening a second designator smashing any yuuji@0: current locks on the file has not been addressed satisfactorily yet. yuuji@0: This is not an easy problem to deal with, especially in c-client which yuuji@0: really doesn't know what other files/streams may be open by Pine. yuuji@0: yuuji@0: Aren't you so happy that you bought an System V system?