NAStore Internal Design Specification

The NAStore Volume Manager
Internal Design Specification

Bill Ross
Network Archive Systems

This section describes the internal workings of the Volume Manager for administrators and developers.

The Volume Manager system, otherwise known as volman, services user requests for removable volumes (e.g. tapes). Typical requests are ``mount volume,'' ``unmount volume'' and ``move volume.'' The system overview is in the Volman External Reference Specification.


Introduction

The components and their relations are described immediately below in general functional terms, followed by a section on configuration, a discussion of inter-process communication and then a detailed description of each component: ( volcd | vold | volnd | vaultrc | acsrc ).

Components

There are three major categories of components: the Client Library (libvol.a), the Main Servers (vold, volcd, volnd) and the Repository Controllers (RC's: acsrc, vaultrc).


Volume Manager Components

[block diagram - gif]

Client Library

User requests are made via a client library, libvol.a, the interface of which is described in libvol(3) and volop(3). All of the volman user-level command line utilities use this interface. The source files for building libvol.a are in volman/lib/.

The libvol communication routines all operate through the Client Daemon (volcd) on the user's host, which establishes client UID and passes messages on the the main server (vold).

Some utilities which run as root make direct, console connections to vold without going through volcd; an example is vshutdown which is in the volman/srvr/ directory. The console interface is discussed further below.

Servers

All the Volume Manager daemons (including the Repository Controllers) run as root in order to make use of privileged sockets in communicating with one another; additionally, volcd and volnd need root to verify client UID and to access device nodes respectively. All processes that connect with vold use the XDR(3M) protocol which guarantees that transmitted data structures are valid on different architectures.

Main Servers

The main servers, volcd, vold and volnd, are described in the vold(1M) man page.

The volcd (Client Daemon) accepts client connections and authenticates identity by performing a handshake involving writing a random key to a file chowned to the client's claimed UID. Upon the first client request after the handshake, volcd informs vold of the new client and begins passing requests to vold and responses to the client. When the client terminates the connection, volcd informs vold. There is a volcd on every host supported by the system. Volcd in detail

The vold is the central decision maker and message router of the system. It authorizes access to volumes based on client UID (as established by volcd) and volume ownership and permissions contained in its database. It keeps a queue of volume requests, allocates volumes and drives (avoiding deadlock using the Banker's Algorithm), and forwards mount and move requests to the appropriate Repository Controller(s) according to the client request and the location of the volume per the database. When an RC needs to check whether a volume is mounted (vaultrc) or check an internal label and set up a user- accessible node (vaultrc, acsrc), vold forwards the request to the volnd on the client's system and forwards the response back to the RC. The vold can run on any host, including hosts not served by the volman system - it does not access the drives. Its administrative configuration file, VCONF, contains a list of hosts from which it will accept connections, which would include all hosts served (via volcd and volnd) as well as any other hosts on which RCs run. Vold in detail

The volnd (Node Daemon) handles all access to the drives' data paths, checking internal labels and setting up / deleting client-accessible device nodes (/dev/vol/xxxxxx). It accepts requests forwarded by vold from RCs (mount / node creation) and from the vold (node delete). There is a volnd on every host supported by the system. Volnd in detail

Repository Controllers

Each Repository Controller (RC) manages an agent that controls a given set of volumes and devices. Requests received from the vold are translated into the language of the agent, which could be an operator, a robot, or a program controlling a robot. The RC selects a drive for mounts based on volume location and optionally channel usage. The RCs can run on any host, including ones not served by the volman system.

The vaultrc manages an operator-run vault. It sends messages to the operator via syslog(3) broadcasts to the console account (defined in the vold VCONF file) and receives error status from the operator via rcerr(1M). The vaultrc does not attempt to optimize channel usage in allocating drives. Its optimization is that it caches mounts by leaving `dismounted' volumes on the drives until a dismount is physically required for a new mount. This incurs no cost in time since manual drives can dismount immediately (this is done by volnd as the first stage of a vaultrc mount request), and saves the significant operator time required to relocate and remount a volume that would otherwise have been dismounted. Vaultrc in detail

Each acsrc process manages a StorageTek Automated Cartridge System (ACS) robot farm which is controlled by ACS Library Server (ACSLS) software that runs on a Sun server. A dedicated `ssi' process on the acsrc host passes messages between the acsrc and its Sun server. The database function is handled by the Sun server - to allocate a drive, an acsrc determines which silo a volume is in and searches 'outward' from it in the silo numbering scheme for an empty drive, preferring drives on less-used channels. Acsrc in detail

A Mount Scenario

Here we see how the components work together to accomplish a mount.

Volume Manager Mount

[block diagram - gif]
  1. Client connects with local volcd, which establishes client UID
  2. Client makes mount request, and volcd forwards client info and mount request to vold.
  3. Vold looks up requested volume in its database, establishing that client is authorized to mount, and queues request. When both the volume and a drive in the volume's RC are available, vold sends the mount to the RC, which chooses a drive and has the repository do the mount.
  4. When the mount is done, the RC sends a request via vold to the volnd on the client's host. The volnd forks a child which performs a label check and sets up the client access node to the device (/dev/vol/xxxxxx) with the appropriate UID and access permissions. Status is forwarded back to the RC by vold; the RC updates its state and informs the vold, which informs the client via the volcd.

Configuration

The administrative configuration file, VCONF, includes a list of hosts that vold will accept connections from, and definitions of the generic features of each Repository Controller including the drives it contains. Another RC attribute worthy of special notice is the ``serial number,'' which is of utmost importance to never change once assigned, since it is also bound to the database records of all the volumes contained in that RC. This example includes a format description in its header. All processes that need configuration information load it from vold. The vconfig(1M) utility reports the configuration.

The Volume Manager account name, along with server ports and various directories and file names used for installation and running, is set in the `config.csh' file in the top volman source directory. This file is sourced by the various Makefiles with single arguments such as `SPOOLDIR', and returns the appropriate string which is assigned to a local variable and used in the current command line of the Makefile.

Communication

All messages sent between components consist of data structures which are defined in an rpcgen(1) input file, lib/vm_pkt.x. Rpgen produces vm_pkt.h (which is used both for client-volcd and server-server communication), along with vm_pkt_xdr.c which contains routines used by the servers for sending and receiving packets via XDR(3M) primitives.

Client-volcd communications use simple socket communications, writing the packet data structures directly into the socket. This is possible because both client and volcd are on the same host, therefore the structures are identical for both processes (e.g. sizeof(int) is the same).

Server-server communications are always between vold and the other servers - when the other servers need to communicate, they pass messages through vold. Since the servers can be on different hosts, XDR translation is used to guarantee that the packet data structures translate correctly. The connections between vold and the other servers are made via a root-only port (i.e. < 5000) to guarantee identity; the vold has a list of acceptable hosts in its administrative configuration file, VCONF.

All communication is via sockets. Some attempt has been made to allow relatively painless switching to another protocol, however this mainly extends to client-volcd connections. Both client-volcd and server-server connections use Internet domain sockets for convenience; volcd does not check the host for its connections, so a client could be on any host as long as it could read the key file written on the volcd host.

General client-volcd routines are in lib/vm_sock.c, while client-specific ones are in lib/cl_sock.c. Volcd client socket routines are in srvr/cd_sock.c.

The general server-server routines are in srvr/rl_sock.c and rl_xdr.c; vold-specific ones are in rp_sock.c and rp_xdr.c, and routines for the other servers are in srvr_xdr.c. Some higher-level message-sending routines are in cd_msg.c and rp_msg.c.

Connections are established the same way in all cases: a server process at startup creates a socket using socket(2), uses bind(2) to associate it with a well-known (hardcoded) port, then does a listen(2) to register willingness to accept connections. Another process builds a socket using socket(2), then uses connect(2) to establish a connection with the server's port. The server detects the main port connection (along with messages on regular sockets) using select(2), and uses accept(2) to create a unique socket for the connection.

Volcd in detail

When volcd is started, it connects to the vold and sends an identifying packet (M_HELLO); the vold registers that it has a volcd on the host in question. The volcd then initializes its own connection-receiving socket and goes into a loop on select(2) (in cd_sock.c) waiting for messages from vold, new connections, or client messages.

The volcd keeps an array of per-client information which is indexed by the file descriptor of the connection. When volcd accepts a connection, it notes that there is an unknown client on that socket and waits for an identifying packet (M_HELLO) specifying the client UID. When this is received, volcd writes a random key to a randomly-named file in a special volman spool directory, chowns the file to the claimed UID, and sends the name of the file to the client. The client reads the key from the file and includes it in any future request. On receipt of the first request, volcd removes the key file and sends a copy of the client structure to the vold before forwarding the request.

Messages from vold are forwarded to the appropriate client, or handled internally if they are for volcd itself (shutdown and starting a new logfile).

When a client connection is closed, volcd marks the client slot as free and informs vold.

Vold in detail

When vold is started, it reads its configuration file, VCONF, opens its database, initializes its main (connection-receiving) socket, and forks a `voltimer' process which connects to it and sends timer packets at intervals to prompt checking of queues. The timer packets are also forwarded to the other servers, which may also use the occasion to check queues and can use the timestamp in the packet to check communication latency. If the voltimer dies, vold restarts it, up to MAXTIMERFORKS (defined in rp_init.c).

For convenience in the current production configuration (Fall, 1996), vold also forks a volcd, volnd and the RCs; it logs SIGCLD from these processes but does not attempt to restart them.

Vold then goes into a loop on select() (in rp_sock.c), accepting connections and receiving messages from voltimer, volcd, volnd, the RCs, and console commands such as vshutdown(1M). When a connection is made, vold uses gethostbyaddr(3N) to get the host name and checks it against the list that was read in from the VCONF file. The first message received on a new connection must identify the type of connecting process. Subsequent messages are handled in sender-type-specific switch routines in rp_1.c, which also contains main() and `housekeeping' routines such as MShutdown() which handles the M_SHUTDOWN packet sent by vshutdown. The switch routine for client requests calls packet-handling routines in the following categories:

rp_2.c: mounting & moving volumes and reporting request status
rp_3.c: reporting/changing database information
rp_4.c: varying drives and reporting drive status
rp_5.c: reserving drives & volumes

Client mount and move requests are checked and placed in a queue for resources. Routines for manipulating the requests are in rp_vol.c. The structure used to track requests is also hash-queued on the volume's external label, which speeds up checks for resource availability. Request queue processing is handled in rp_q.c - requests are examined in order of receipt and commands are sent to RCs when resources are available. Vold keeps counts of available drives in each RC and sends only as many mounts as there are drives. A mount request to an RC contains a bitmap of varied-on drives of the volume's medium in the RC, and the RC uses this list to choose a specific drive to optimize according to the nature of the repository.

In addition to checking drive availability, vold uses the Banker's Algorithm to check if requests using reservations for multiple resources could deadlock. Requests that are not part of reservations are put in `implicit' reservations, and all requests are also queued by reservation. Reservation-related routines are in rp_rid.c. The reservation functionality is described in the vreserve(1M) man page.

In addition to resource-queued requests for volumes, vold handles various requests to read/write the database that are handled immediately and not queued, such as allocating a volume, updating volume ownership or permissions, or recycling a volume, as described in volalloc(1), vchown(1), vchmod(1) and vrecycle(1). The vold also responds immediately to requests for status of mount/move requests, drives, reservations, and configuration.

The database consists of two tables: volume information and quotas. The tables are are in fixed-field ASCII format, with each record terminated by a newline and a blank character between each field in a record. The records are padded so that none will cross a disk block boundary - this, in combination with synchronized disk access (open(2) using the O_SYNC flag), ensures that a record will not be partially written to disk in the event of a system crash. The database routines in rp_db.c translate the ASCII format to C structures and maintain B-tree indexes on various fields.

The quota table is indexed by concatenated UID,RC. The volume table is indexed by volume external label (for most lookups), by owner UID (for vls(1) requests), by internal label concatenated with UID (used for guaranteeing that users do not have duplicate internal labels), and by status (lookup of scratch volumes). If a previous vold process did not close the indexes (required to flush blocks buffered by the B-tree package), they are rebuilt on startup in parallel by forked processes.

Volnd in detail

Volnd forks a child for each request to access the device node - this is done in case there is a hang in the device driver. Volnd times out the child, in an attempt to detect such hangs. If the child completes, it responds directly to vold and exits with status indicating success or failure for the volnd. If volnd times out the child, it sends an error to vold on the child's behalf.

When a volnd starts up, it registers with the vold, then clears the /dev/vol/ directory of any user volume nodes left by a previous volnd process. It then waits on requests from vold, catching SIGCLD and getting status when the child it forks for each request terminates. The parent process and message-sending routines are in nd_1.c, and the child routines are in nd_2.c. The basic commands for which a child is forked are:

Volume checking involves testing `write ring' status, and also checking the internal label unless the client requested bypassing this step. `Test for mount' is invoked by the vaultrc and involves putting `MOUNT / XXXXXX' on the drive's billboard if the drive has one.

Vaultrc in detail

As used in production, vaultrc relies on a shelf filing system, and so keeps no record of volume location. A vault-management system based on a handheld barcode reader is provided, which keeps a database and enables random-access filing; it has not been used in production owing to operator discomfort with the barcode reader.

Communication with the barcode server is via RPC(3N), i.e. the response is immediately available as the result of a remote procedure call; this is known as a `stateless' system since there are no followup messages after an interval, unlike the protocol between the vold and the vaultrc. This simple protocol is adequate because the asynchronous action waited on by the vault is the mount of a volume, which is detected and reported by volnd. Thus the barcode system has no appreciable impact on the complexity of the vaultrc - vaultrc merely keeps the barcode server informed of requests, and the server provides the same info as the console broadcasts, plus volume location, both on its own console and on the barcode gun's LCD display.

When the vaultrc starts up, it connects to the vold and gets the system configuration, using the RC id (provided as the only argument in starting the program) to look up its own configuration. If the barcode system is configured (as described at the beginning of vaultrc.c), vaultrc also connects to the barcode server. It checks if anyone is logged into the console account (complaining to the log if not), changes its process name to the title indicated in its configuration, and notifies vold that it is ready. It does not worry about what volumes may be currently on the drives; they will be automatically dismounted when the drives are assigned new mounts. Vaultrc then loops on receipt of packets from vold, which can be client requests for volumes, operator error reports, status from volnds, or commands like shutdown from vold.

When a drive is assigned a mount, a message is sent via vold to the volnd on the client's machine. If the volume has been left on a drive (cached) following a dismount, volnd is only requested to recheck the internal label and create the user-accessible device node, /dev/vol/XXXXXX. If it is a fresh mount, volnd is instructed to first loop on testing the device until the mount has occurred.

All the code is in vaultrc.c, except for barcode RPC calls which are in vault.x.

Acsrc in detail

When an acsrc starts up, it connects to the vold and gets the system configuration, using the RC id (provided as the only argument in starting acsrc) to look up its own generic RC configuration. It then loads its own ACSCONF configuration file for information on which Sun server / ACSLS it will address and the ACS hardware configuration. It forks an ssi process to communicate with the Sun server, if one is not running, then connects via it with the Sun server and gets status of the robot. Dismounts are sent for any drives that are mounted. The acsrc then changes its process name to the one in the vold RC definition, tells vold to vary off any drives that had startup dismount problems, and notifies vold that it is ready. Before receiving any vold requests, however, acsrc queries the Sun server for the location of any volumes that were in the process of ejecting when the last instance of this acsrc terminated, and informs vold of any volumes that are no longer in the robot.

Author mail: Bill Ross


 NAS HOME PAGE  Storage Systems home page WebWork: Harry Waddell
NASA Official: John Lekashman