%%% Internal Design Specification

The %%% Volume Manager
Internal Design Specification

Bill Ross

This section describes the internal workings of the Volume Manager, for administrators and developers.

The Volume Manager system, otherwise known as volman, services user requests for removable volumes (e.g. tapes). Typical requests are ``mount volume,'' ``unmount volume'' and ``move volume.'' The system overview is in the Volume Manager External Reference Specification.


Introduction

The components and their relations are described immediately below in general functional terms, followed by a section on configuration, a discussion of inter-process communication, and then a detailed description of each component: ( volcd | vold | volnd | vaultrc | chrc | acsrc ).

Components

There are three major categories of components: the Client Library (libvol.a), the Main Servers (vold, volcd, volnd) and the Repository Controllers (RC's: chrc, acsrc, vaultrc).


Volume Manager Components

[block diagram - gif]

Client Library

User requests are made via a client library, libvol.a; the interface is described in libvol(3) and volop(3). All of the volman user-level command line utilities use this interface. The source files for building libvol.a are in volman/lib/.

The libvol communication routines all pass messages through the Client Daemon (volcd) on the user's host, which confirms client UID and passes messages on the the main server (vold).

Some utilities which require root to run make direct, console connections to vold without going through volcd; these are in the volman/srvr/ directory. An example is vshutdown. The console interface is discussed further below.

Servers

All the Volume Manager daemons (including the Repository Controllers) run as root. Originally this was thought to provide access to privileged sockets that would guarantee root identity in communicating with one another; however it turns out that no such privilege exists and other means wold be necessary to guarantee identity. Additionally, volnd needs root to access device nodes. All processes that connect with vold use the XDR(3M) protocol (over sockets), which guarantees that transmitted data structures are valid on different architectures.

Main Servers

The main servers, volcd, vold and volnd, are described in the vold(1M) man page.

The volcd (Client Daemon) accepts client connections and authenticates identity by performing a handshake involving writing a random key to a file chowned to the client's claimed UID. Upon the first client request after the handshake, volcd informs vold of the new client and begins passing requests to vold and responses to the client. When the client terminates the connection, volcd informs vold. There is a volcd on every host supported by the system. Volcd in detail

The vold is the central decision maker and message router of the system. It authorizes access to volumes based on client UID (as established by volcd) and volume ownership and permissions contained in its database. It keeps a queue of volume requests, allocates volumes and drives (avoiding deadlock using the Banker's Algorithm), and forwards mount and move requests to the appropriate Repository Controller(s) according to the client request and the location of the volume per the database. When an RC needs to check whether a volume is mounted (vaultrc) or check an internal label and set up a user- accessible node (vaultrc, chrc, acsrc), vold forwards the request to the volnd on the drive's system and forwards the response back to the RC. The vold can run on any host, including hosts not served by the volman system - it does not access the drives. Its administrative configuration file, /usr/mss/etc/rc.volman, contains a list of hosts from which it will accept connections, which would include all hosts served (via volcd and volnd) as well as any other hosts on which RCs run. Vold in detail

The volnd (Node Daemon) handles all volman system access to the drives' data paths, checking internal labels, setting up / deleting client-accessible device nodes (/dev/vol/xxxxxx), and scratching volumes (erasing and writing new internal labels). It accepts requests forwarded by vold from RCs (mount / node creation) and from the vold (node delete). There is a volnd on every host supported by the system. Volnd in detail

Repository Controllers

Each Repository Controller (RC) manages an agent (interface) that controls a given set of volumes and devices. Requests received from the vold are translated into the language of the agent, which could be an operator, a robot, or a program controlling a robot. The RC selects a drive for mounts based on volume location (and optionally channel usage). The RCs can run on any host, including ones not served by the volman system.

The vaultrc manages an operator-run vault. It sends messages to the operator via syslog(3) broadcasts to the console account (defined in the vold /usr/mss/etc/rc.volman file) and receives error status from the operator via rcerr(1M). The vaultrc does not attempt to optimize channel usage in allocating drives. Its optimization is that it caches mounts by leaving `dismounted' volumes on the drives until a dismount is physically required for a new mount. This incurs no cost in time since manual drives can dismount immediately (this is done by volnd as the first stage of a vaultrc mount request), and saves the significant operator time required to relocate and remount a volume that would otherwise have been dismounted. Vaultrc in detail

Each chrc process manages a 'changer' robot, currently lower-cost devices with a single volume-moving mechanism that can be reasonably driven in synchronous mode. The chrc process runs on a host connected to the robot via a SCSI bus. Chrc in detail

Each acsrc process manages a StorageTek Automated Cartridge System (ACS) robot farm which is controlled by ACS Library Server (ACSLS) software that runs on a Sun server. A dedicated `ssi' process on the acsrc host passes messages between the acsrc and its Sun server. The database function is handled by the Sun server - to allocate a drive, an acsrc determines which silo a volume is in and searches 'outward' from it in the silo numbering scheme for an empty drive, preferring drives on less-used channels. Acsrc in detail

A Mount Scenario

Here we see how the components work together to accomplish a mount.

Volume Manager Mount

[block diagram - gif]
  1. Client connects with local volcd, which establishes client UID
  2. Client makes mount request, and volcd forwards client info and mount request to vold.
  3. Vold looks up requested volume in its database, establishing that client is authorized to mount, and queues request. When both the volume and a drive in the volume's RC are available, vold sends the mount to the RC, which chooses a drive and has the repository do the mount.
  4. When the mount is done, the RC sends a request via vold to the volnd on the drive's host. The volnd forks a child which performs a label check and sets up the client access node to the device (/dev/vol/xxxxxx) with the appropriate UID and access permissions. Status is forwarded back to the RC by vold; the RC updates its state and informs the vold, which informs the client via the volcd.

Configuration

The administrative configuration file, /usr/mss/etc/rc.volman, includes a list of hosts that vold will accept connections from, and definitions of the generic features of each Repository Controller including the drives it contains. Another RC attribute worthy of special notice is the ``serial number,'' which is of utmost importance to never change once assigned, since it is also bound to the database records of all the volumes contained in that RC. This example includes a format description in its header. All processes that need configuration information load it from vold. The vconfig(1M) utility reports the configuration.

The Volume Manager account name, along with server ports and various directories and file names used for installation and running, is set in the src/volman/Mkconfig.volman file. This file is included by the various Makefiles.

Communication

Since the servers can be on different hosts, XDR translation is used to guarantee that the packet data structures translate correctly. The packet data structures are defined in an rpcgen(1) input file, src/volman/lib/vm_pkt.x. Rpcgen produces vm_pkt.h (which is used both for client-volcd and server-server communication), along with vm_pkt_xdr.c which contains routines used by the servers for sending and receiving packets via XDR(3M) primitives.

Client-volcd communications on the other hand use simple socket communications without XDR, writing the packet data structures directly into the socket. This is possible because both client and volcd are on the same host, therefore the structures are identical for both processes (e.g. sizeof(int) is the same).

Server-server communications are always between vold and the other servers - when the non-vold servers need to communicate, they pass messages through vold. Vold has a list of acceptable server hosts in its administrative configuration file, /usr/mss/etc/rc.volman.

All communication is via sockets. Some attempt has been made to allow relatively painless switching to another protocol, however this mainly extends to client-volcd connections. Both client-volcd and server-server connections use Internet domain sockets for convenience; volcd does not check the host for its connections, so a client could be on any host as long as it could read the key file written on the volcd host.

General client-volcd routines are in src/volman/lib/vm_sock.c, while client-specific ones are in src/volman/lib/cl_sock.c.

The general server-server routines are in src/volman/srvr/rl_sock.c and rl_xdr.c; vold-specific ones are in rp_sock.c and rp_xdr.c, and routines for the other servers are in srv_xdr.c. Some higher-level message-sending routines are in cd_msg.c (volcd) and rp_msg.c (vold).

Connections are established the same way in all cases: a server process at startup creates a socket using socket(2), uses bind(2) to associate it with a well-known (hardcoded) port, then does a listen(2) to register willingness to accept connections. Another process builds a socket using socket(2), then uses connect(2) to establish a connection with the server's port. The server detects the main port connection (along with messages on regular sockets) using select(2), and uses accept(2) to create a unique socket for the connection.

Volcd in detail

When volcd is started, it connects to the vold and sends an identifying packet (M_HELLO); the vold registers that it has a volcd on the host in question. The volcd then initializes its own connection-receiving socket and goes into a loop on select(2) (in cd_sock.c) waiting for messages from vold, new connections, or client messages.

The volcd keeps an array of per-client information which is indexed by the file descriptor of the connection. When volcd accepts a connection, it notes that there is an unknown client on that socket and waits for an identifying packet (M_HELLO) specifying the client UID. When this is received, volcd writes a random key to a randomly-named file in a special volman spool directory, chowns the file to the claimed UID, and sends the name of the file to the client. The client reads the key from the file and includes it in any future request. On receipt of the first request, volcd removes the key file and sends a copy of the client structure to the vold before forwarding the request.

Messages from vold are forwarded to the appropriate client, or handled internally if they are for volcd itself (shutdown and starting a new logfile).

When a client connection is closed, volcd marks the client slot as free and informs vold.

Vold in detail

When vold is started, it reads its configuration file, /usr/mss/etc/rc.volman, opens its database, initializes its main (connection-receiving) socket, and forks a `voltimer' process which connects to it and sends timer packets at intervals to prompt checking of queues. The timer packets are also forwarded to the other servers, which may use the occasion to check queues and can use the timestamp in the packet to check communication latency. If the voltimer dies, vold restarts it, up to MAXTIMERFORKS (defined in rp_init.c).

Vold also forks a volcd and any RCs that have a PATH defined in /usr/mss/etc/rc.volman. It logs SIGCLD from these processes but does not attempt to restart them.

Vold then goes into a loop on select() (in rp_sock.c), accepting connections and receiving messages from voltimer, volcd, volnd, the RCs, and console commands such as vshutdown(1M). When a connection is made, vold uses gethostbyaddr(3N) to get the host name and checks it against the list that was read in from the /usr/mss/etc/rc.volman file. The first message received on a new connection must identify the type of connecting process. Subsequent messages are handled in sender-type-specific switch routines in rp_1.c, which also contains main() and `housekeeping' routines such as MShutdown() which handles the M_SHUTDOWN packet sent by vshutdown. The switch routine for client requests calls packet-handling routines in the following categories:

rp_2.c: mounting & moving volumes and reporting request status
rp_3.c: reporting/changing database information
rp_4.c: varying drives and reporting drive status
rp_5.c: reserving drives & volumes

Client mount and move requests are checked and placed in a queue for resources. Routines for manipulating the requests are in rp_vol.c. Each request is also queued in several other ways, including hash-queued on the volume's external label, which speeds up checks for resource availability. Request queue processing is handled in rp_q.c - requests are examined in order of receipt and commands are sent to RCs when resources are available. Vold keeps counts of available drives in each RC and sends only as many mounts as there are drives. A mount request to an RC contains a bitmap of varied-on drives of the volume's medium in the RC, and the RC uses this list to choose a specific drive to optimize according to the nature of the repository.

In addition to checking drive availability, vold uses the Banker's Algorithm to check if requests using reservations for multiple resources could deadlock. Requests that are not part of reservations are put in `implicit' reservations, and all requests are also queued by reservation. Reservation-related routines are in rp_rid.c. The reservation functionality is described in the vreserve(1M) man page.

In addition to resource-queued requests for volumes, vold handles various requests to read/write the database that are handled immediately and not queued, such as allocating a volume, updating volume ownership or permissions, or recycling a volume, as described in volalloc(1), vchown(1), vchmod(1) and vrecycle(1). The vold also responds immediately to requests for status of mount/move requests, drives, reservations, and configuration.

The database consists of two tables: volume information and quotas. The tables are are in fixed-field ASCII format, with each record terminated by a newline and a blank character between each field in a record. The records are padded so that none will cross a disk block boundary - this, in combination with synchronized disk access (open(2) using the O_SYNC flag), ensures that a record will not be partially written to disk in the event of a system crash. The database routines in rp_db.c translate the ASCII format to C structures and maintain B-tree indexes on various fields.

The quota table is indexed by concatenated UID,RC. The volume table is indexed by volume external label (for most lookups), by owner UID (for vls(1) requests), by internal label concatenated with UID (used for guaranteeing that users do not have duplicate internal labels), and by status (lookup of scratch volumes). If a previous vold process did not close the indexes (required to flush blocks buffered by the B-tree package), they are rebuilt on startup in parallel by forked processes.

Volnd in detail

Volnd forks a child for each request to access the device node - this is done in case there is a hang in the device driver. Volnd times out the child, in an attempt to detect such hangs. If the child completes, it responds directly to vold and exits with status indicating success or failure for the volnd. If volnd times out the child, it sends an error to vold on the child's behalf.

When a volnd starts up, it registers with the vold, then clears the /dev/vol/ directory of any user volume nodes left by a previous volnd process. It then waits on requests from vold, catching SIGCLD and getting status when the child it forks for each request terminates. The parent process and message-sending routines are in nd_1.c, and the child routines are in nd_2.c. This file in turn includes either nd_2a.c (for hosts on which a device will not open until it's ready) or nd_2b.c (for hosts on which a device can be opened then polled to see if it's ready).

The basic commands for which a volnd child is forked are:

Volume checking involves testing `write ring' status, and also checking the internal label unless the client requested bypassing this step. `Test for mount' is invoked by the vaultrc and involves putting `MOUNT / XXXXXX' on the drive's billboard if the drive has one.

Vaultrc in detail

As used in production, vaultrc relied on a shelf filing system, and so keeps no record of volume location. A vault-management system based on a handheld barcode reader is provided, which keeps a database and enables random-access filing; it has not been used in production owing to operator discomfort with using the barcode reader.

Communication with the barcode server is via RPC(3N), i.e. the response is immediately available as the result of a remote procedure call; this is known as a `stateless' system since there are no followup messages after an interval, unlike the protocol between the vold and the vaultrc. This simple protocol is adequate because the asynchronous action waited on by the vault is the mount of a volume, which is detected and reported by volnd. Thus the barcode system has no appreciable impact on the complexity of the vaultrc - vaultrc merely keeps the barcode server informed of requests, and the server provides the same info as the console broadcasts, plus volume location, both on its own console and on the barcode gun's LCD display.

When the vaultrc starts up, it connects to the vold and gets the system configuration, using the RC id (provided as the only required argument in starting the program) to look up its own configuration. If the barcode system is configured (as described at the beginning of vaultrc.c), vaultrc also connects to the barcode server. It checks if anyone is logged into the console account (complaining to the log if not), and notifies vold that it is ready. It does not worry about what volumes may be currently on the drives; they will be automatically dismounted when the drives are assigned new mounts.

(Note: tape media should in fact not be left mounted and idle indefinitely, because tape can lose tension, and, in the case of helical scan technology, tape and tape head can wear out since the head spins constantly in contact with the tape.)

After initialization, vaultrc then loops on receipt of packets from vold, which can be client requests for volumes, operator error reports, status from volnds, or commands like shutdown from vold.

When a drive is assigned a mount, a message is sent via vold to the volnd on the drive's machine. If the volume has been left on a drive (cached) following a dismount, volnd is only requested to recheck the internal label and create the user-accessible device node, /dev/vol/XXXXXX. If it is a fresh mount, volnd is instructed to first loop on testing the device until the mount has occurred.

All the code is in vaultrc.c, except for barcode RPC calls which are in vault.x.

Chrc in detail

When a chrc starts up, it connects to the vold and gets the system configuration, using the RC id (provided as the only required argument in starting chrc) to look up its own RC configuration. The chrc then queries the status of all elements in the robot setting up indexes of VSNs and empty slots and unmounting any mounted tapes. Finally, it registers with vold and varies off any drives that have tapes stuck in them, then loops on receipt of commands from vold. For now, commands to the robot are executed synchronously.

When a mount has occurred, chrc sends a message (via vold) to the volnd on the machine the drive is attached to. When volnd has checked write ring status and label, a message is forwarded back and chrc informs vold that the mount has completed.

Acsrc in detail

When an acsrc starts up, it connects to the vold and gets the system configuration, using the RC id (provided as the only required argument in starting acsrc) to look up its own RC configuration. It then loads its own ACSCONF configuration file for information on which Sun server / ACSLS it will address and the ACS hardware configuration. If forks an ssi process to communicate with the Sun server, if one is not running, then connects via it with the Sun server and gets status of the robot. Dismounts are sent for any drives that are mounted. The acsrc then tells vold to vary off any drives that had startup dismount problems, and notifies vold that it is ready. Before receiving any vold requests, however, acsrc queries the Sun server for the location of any volumes that were in the process of ejecting when the last instance of this acsrc terminated, and informs vold of any volumes that are no longer in the robot.

Author mail: Bill Ross


 NAS HOME PAGE  Storage Systems home page WebWork: Harry Waddell
NASA Official: John Lekashman