Chapter 5. Administration du dépôt

Table of Contents

Les bases du dépôt
Comprendre les transactions et les révisions
Unversioned Properties
Repository Data Stores
Berkeley DB
FSFS
Repository Creation and Configuration
Hook Scripts
Berkeley DB Configuration
Repository Maintenance
An Administrator's Toolkit
svnlook
svnadmin
svndumpfilter
Berkeley DB Utilities
Repository Cleanup
Managing Disk Space
Repository Recovery
Migrating a Repository
Repository Backup
Adding Projects
Choosing a Repository Layout
Creating the Layout, and Importing Initial Data
Summary

Le dépôt Subversion est le lieu de stockage central des données sous contrôle de version pour un nombre quelconque de projets. En tant que tel, il devient le candidat évident pour toute l'attention et l'amour qu'un administrateur peut offrir. Bien que le dépôt soit généralement un élément nécessitant peu de maintenance, il est important de comprendre comment le configurer convenablement et le gérer de manière à ce que les problèmes potentiels soient évités, et que les problèmes réels soient sainement résolus.

Dans ce chapitre, nous discuterons la création et la configuration d'un dépôt Subversion. Nous traiterons aussi de la maintenance du dépôt, y compris l'utilisation des outils svnlook et svnadmin (qui sont fournis avec Subversion). Nous aborderons quelques questions et erreurs courantes, et donnerons des suggestions sur la manière d'organiser les données dans le dépôt.

Si vous prévoyez seulement d'accéder au dépôt Subversion en tant qu'utilisateur dont les données sont sous contrôle de version (c'est-à-dire à partir d'un client Subversion), vous pouvez sauter l'ensemble du chapitre. Cependant, si vous êtes, ou voulez devenir, l'administrateur d'un dépôt Subversion, [20] vous devriez absolument prêter attention à ce chapitre.

Les bases du dépôt

Avant de se lancer dans le sujet plus large de l'administration d'un dépôt, définissons plus précisément ce qu'est un dépôt. De quoi a-t-il l'air ? Comment va-t-il ? Prend-il son thé chaud ou glacé, sucré et avec du citron ? En tant qu'administrateur, on attendra de vous que vous compreniez la composition d'un dépôt à la fois d'un point de vue logique — traitant de la manière dont les données sont représentées dans le dépôt — et d'un point de vue purement physique — comment un dépôt est perçu et réagit selon des outils autres que Subversion. La section suivante couvre certains de ces concepts de base à un très haut niveau.

Comprendre les transactions et les révisions

Conceptuellement, un dépôt Subversion est une séquence d'arborescence de répertoires. Chaque arborescence est un instantané de l'apparence à un moment donné des fichiers et répertoires sous contrôle dans le dépôt. La création de ces instantanés est le résultat des opérations du client, et ils sont appelés révisions.

Chaque révision commence sa vie comme un arbre de transaction. En appliquant ses changement, un client construit une transaction Subversion reproduisant ses modifications locales (plus toute modification supplémentaire ayant pu être faite au dépôt depuis le début du processus de soumission du client), et demande ensuite au dépôt de stocker cette arborescence comme le prochain instantané dans la séquence. Si la soumission réussit, la transaction est effectivement intégrée dans une nouvelle révision de l'arborescence, et un nouveau numéro de révision lui est attribué. Si la soumission échoue pour une quelconque raison, la transaction est annulée et le client est informé de l'échec.

Les mises à jour fonctionnent de manière similaire. Le client construit un arbre de transaction temporaire qui reproduit l'état de la copie de travail. Le dépôt compare alors cette arborescence avec l'arborescence de la révision demandée (habituellement la plus récente, ou l'arbre le “plus jeune”), et informe le client des changements nécessaires pour transformer sa copie de travail en une réplication de cette révision de l'arborescence. Après que la mise à jour est terminée, la transaction temporaire est supprimée.

L'utilisation d'arbres de transaction est le seul moyen de modifier de manière permanente le système de fichiers sous contrôle de version d'un dépôt. Il est toutefois important de comprendre que la durée de vie d'une transaction est complètement flexible. Dans le cas de mises à jour, les transactions sont des arbres temporaires qui sont immédiatement détruits. Dans le cas de soumissions, les transactions sont transformées en révisions permanentes (ou supprimés si la soumission échoue). Dans le cas d'une erreur ou d'un bogue, il est possible qu'une transaction reste accidentellement dans le dépôt (n'affectant réellement rien, mais occupant toujours de la place).

En théorie, la gestion complète des processus de l'application pourrait un jour évoluer vers un contrôle plus fin de la durée de vie des transaction. Il est possible d'imaginer un système dont chaque transaction censée devenir une révision est laissée dans un état inactif bien après que le client a fini de décrire ses changements au dépôt. Ceci permettrait à chaque nouvelle soumission d'être revue par quelqu'un d'autre, peut-être un administrateur ou une équipe d'ingénieur, qui pourrait choisir de valider la transaction en révision, ou de l'annuler.

Unversioned Properties

Transactions and revisions in the Subversion repository can have properties attached to them. These properties are generic key-to-value mappings, and are generally used to store information about the tree to which they are attached. The names and values of these properties are stored in the repository's filesystem, along with the rest of your tree data.

Revision and transaction properties are useful for associating information with a tree that is not strictly related to the files and directories in that tree—the kind of information that isn't managed by client working copies. For example, when a new commit transaction is created in the repository, Subversion adds a property to that transaction named svn:date—a datestamp representing the time that the transaction was created. By the time the commit process is finished, and the transaction is promoted to a permanent revision, the tree has also been given a property to store the username of the revision's author (svn:author) and a property to store the log message attached to that revision (svn:log).

Revision and transaction properties are unversioned properties—as they are modified, their previous values are permanently discarded. Also, while revision trees themselves are immutable, the properties attached to those trees are not. You can add, remove, and modify revision properties at any time in the future. If you commit a new revision and later realize that you had some misinformation or spelling error in your log message, you can simply replace the value of the svn:log property with a new, corrected log message.

Repository Data Stores

As of Subversion 1.1, there are two options for storing data in a Subversion repository. One type of repository stores everything in a Berkeley DB database; the other kind stores data in ordinary flat files, using a custom format. Because Subversion developers often refer to a repository as “the (versioned) filesystem”, they have adopted the habit of referring to the latter type of repository as FSFS [21] —a versioned filesystem implementation that uses the native OS filesystem to store data.

When a repository is created, an administrator must decide whether it will use Berkeley DB or FSFS. There are advantages and disadvantages to each, which we'll describe in a bit. Neither back-end is more “official” than another, and programs which access the repository are insulated from this implementation detail. Programs have no idea how a repository is storing data; they only see revision and transaction trees through the repository API.

Table 5.1, “Repository Data Store Comparison” gives a comparative overview of Berkeley DB and FSFS repositories. The next sections go into detail.

Table 5.1. Repository Data Store Comparison

FeatureBerkeley DBFSFS
Sensitivity to interruptionsvery; crashes and permission problems can leave the database “wedged”, requiring journaled recovery procedures.quite insensitive.
Usable from a read-only mountnoyes
Platform-independent storagenoyes
Usable over network filesystemsnoyes
Repository sizeslightly largerslightly smaller
Scalability: number of revision treesdatabase; no problemssome older native filesystems don't scale well with thousands of entries in a single directory.
Scalability: directories with many filesslowerfaster
Speed: checking out latest codefasterslower
Speed: large commitsslower, but work is spread throughout commitfaster, but finalization delay may cause client timeouts
Group permissions handlingsensitive to user umask problems; best if accessed by only one user.works around umask problems
Code maturityin use since 2001in use since 2004

Berkeley DB

When the initial design phase of Subversion was in progress, the developers decided to use Berkeley DB for a variety of reasons, including its open-source license, transaction support, reliability, performance, API simplicity, thread-safety, support for cursors, and so on.

Berkeley DB provides real transaction support—perhaps its most powerful feature. Multiple processes accessing your Subversion repositories don't have to worry about accidentally clobbering each other's data. The isolation provided by the transaction system is such that for any given operation, the Subversion repository code sees a static view of the database—not a database that is constantly changing at the hand of some other process—and can make decisions based on that view. If the decision made happens to conflict with what another process is doing, the entire operation is rolled back as if it never happened, and Subversion gracefully retries the operation against a new, updated (and yet still static) view of the database.

Another great feature of Berkeley DB is hot backups—the ability to backup the database environment without taking it “offline”. We'll discuss how to backup your repository in the section called “Repository Backup”, but the benefits of being able to make fully functional copies of your repositories without any downtime should be obvious.

Berkeley DB is also a very reliable database system. Subversion uses Berkeley DB's logging facilities, which means that the database first writes to on-disk log files a description of any modifications it is about to make, and then makes the modification itself. This is to ensure that if anything goes wrong, the database system can back up to a previous checkpoint—a location in the log files known not to be corrupt—and replay transactions until the data is restored to a usable state. See the section called “Managing Disk Space” for more about Berkeley DB log files.

But every rose has its thorn, and so we must note some known limitations of Berkeley DB. First, Berkeley DB environments are not portable. You cannot simply copy a Subversion repository that was created on a Unix system onto a Windows system and expect it to work. While much of the Berkeley DB database format is architecture independent, there are other aspects of the environment that are not. Secondly, Subversion uses Berkeley DB in a way that will not operate on Windows 95/98 systems—if you need to house a repository on a Windows machine, stick with Windows 2000 or Windows XP. Also, you should never keep a Berkeley DB repository on a network share. While Berkeley DB promises to behave correctly on network shares that meet a particular set of specifications, almost no known shares actually meet all those specifications.

Finally, because Berkeley DB is a library linked directly into Subversion, it's more sensitive to interruptions than a typical relational database system. Most SQL systems, for example, have a dedicated server process that mediates all access to tables. If a program accessing the database crashes for some reason, the database daemon notices the lost connection and cleans up any mess left behind. And because the database daemon is the only process accessing the tables, applications don't need to worry about permission conflicts. These things are not the case with Berkeley DB, however. Subversion (and programs using Subversion libraries) access the database tables directly, which means that a program crash can leave the database in a temporarily inconsistent, inaccessible state. When this happens, an administrator needs to ask Berkeley DB to restore to a checkpoint, which is a bit of an annoyance. Other things can cause a repository to “wedge” besides crashed processes, such as programs conflicting over ownership and permissions on the database files. So while a Berkeley DB repository is quite fast and scalable, it's best used by a single server process running as one user—such as Apache's httpd or svnserve (see Chapter 6, Server Configuration)—rather than accessing it as many different users via file:/// or svn+ssh:// URLs. If using a Berkeley DB repository directly as multiple users, be sure to read the section called “Supporting Multiple Repository Access Methods”.

FSFS

In mid-2004, a second type of repository storage system came into being: one which doesn't use a database at all. An FSFS repository stores a revision tree in a single file, and so all of a repository's revisions can be found in a single subdirectory full of numbered files. Transactions are created in separate subdirectories. When complete, a single transaction file is created and moved to the revisions directory, thus guaranteeing that commits are atomic. And because a revision file is permanent and unchanging, the repository also can be backed up while “hot”, just like a Berkeley DB repository.

The revision-file format represents a revision's directory structure, file contents, and deltas against files in other revision trees. Unlike a Berkeley DB database, this storage format is portable across different operating systems and isn't sensitive to CPU architecture. Because there's no journaling or shared-memory files being used, the repository can be safely accessed over a network filesystem and examined in a read-only environment. The lack of database overhead also means that the overall repository size is a bit smaller.

FSFS has different performance characteristics too. When committing a directory with a huge number of files, FSFS uses an O(N) algorithm to append entries, while Berkeley DB uses an O(N^2) algorithm to rewrite the whole directory. On the other hand, FSFS writes the latest version of a file as a delta against an earlier version, which means that checking out the latest tree is a bit slower than fetching the fulltexts stored in a Berkeley DB HEAD revision. FSFS also has a longer delay when finalizing a commit, which could in extreme cases cause clients to time out when waiting for a response.

The most important distinction, however, is FSFS's inability to be “wedged” when something goes wrong. If a process using a Berkeley DB database runs into a permissions problem or suddenly crashes, the database is left unusable until an administrator recovers it. If the same scenarios happen to a process using an FSFS repository, the repository isn't affected at all. At worst, some transaction data is left behind.

The only real argument against FSFS is its relative immaturity compared to Berkeley DB. It hasn't been used or stress-tested nearly as much, and so a lot of these assertions about speed and scalability are just that: assertions, based on good guesses. In theory, it promises a lower barrier to entry for new administrators and is less susceptible to problems. In practice, only time will tell.



[20] Cela peut sembler prestigieux et noble, mais nous parlons juste de n'importe qui intéressé par ce royaume mystérieux au delà de la copie de travail, où résident les données de tout le monde.

[21] Pronounced “fuzz-fuzz”, if Jack Repenning has anything to say about it.