NAStore Internal Design Specification

NAStore
Virtual Volume Manager
Internal Design Specification


Bill Ross, Network Archive Systems

The Virtual Volume Manager (VVM) provides a disk cache layer between clients and removable volumes (e.g. tape) in the form of "virtual volumes" (VV's).

Note: removable volumes are referred to as "VSN's" (based on volman terminology: Volume Serial Number).


Inter-Process Communication (IPC)

The VVM components may run on different hosts sharing filesystem access to the disk cache; they communicate via the machine-independant XDR protocol using Internet sockets. Accepted hosts are specified in the The VVM configuration file,
rc.vvm.

The client library uses a synchronous request/response protocol, while the daemons use asynchronous. The main difference in implementing the two in terms of IPC is that since the XDR interface is buffered, if one wants to use a protocol that blocks on select() instead of on VVCLRecv(), the result of VVCLNextRecord() needs to be checked after each VVCLRecv() and another VVCLRecv() executed if nonzero, or any buffered packet will be ignored until the next fresh one hits the underlying socket.


Virtual Volume Manager Daemon

The Virtual Volume Manager Daemon (VVMD) manages client requests for VV's, maintaining a database of VV's and physical volumes. It mounts physical volumes as necessary via the
Volume Manager and copies VV's to and from these physical volumes via Virtual Volume Manager Mover Daemons described below.

Physical Volume Management/h3> Physical volumes, or VSN's, are mounted for three purposes: writing finished VV's to 'hot' VSN's, restoring nonresident VV's to the disk cache, and recycling VSN's by copying all the nonresident VV's to cache (and queueing them for write to current hot VSN's). Each mount is handled as if the request came from one of these internal 'clients' defined in vvm_clwrite.c, vvm_clread.c, vvm_clrecycle.c. The Volman interface for the mount requests themselves is handled in vvm_volman.c.

Mounts for writing. vvm_clwrite.c. Each storage class (defined in vv_class.c) consists of a number of 'slots' which represent physical volumes on which copies of the VV are stored. Each class has by default a single 'hot panel' of these slots, which represents an array of potential mounts for writing VV's. The VVMD endeavors to keep the slots filled with 'hot' VSNs that have been allocated to VVM by Volman. In filling a slot, VVMD first checks its database for a previously-allocated hot VSN in the category (class, slot), and if none is available, attempts to allocate from Volman. If Volman reports none available, VVMD issues a console message and backs off for a while. The administrator can specify additional hot panels in the rc.vvm configration file, allowing for multiple simultaneous writes-to-VSN in each slot. Note that when multiple panels are defined, consecutively numbered VV's can be written to any of the VSN's defined in each slot, e.g. VV 1 could go to VSN's A and B, while VV 2 could go to A and C, where B and C are in the same slot in different panels.

The VSN's are not actually mounted until either a sufficient number of VV writes is waiting or a lesser number has waited a sufficient amount of time. Once mount(s) have occurred and all queued writes have completed, VSNs are unmounted after a given idle time in order to preserve the media.

A 'hot-mounted' VSN may also be temporarily allocated to a VV restore-to-disk request; this is necessary (as well as perhaps optimal in terms of mount time) because the VV may only exist on hot-mounted VSN's. If a VSN needed for restore is in a 'hot panel' but not mounted, it is mounted as if for a write request (i.e. the primary request is attached to the hot panel) but with the read queued on it; whenever a hot mount completes and whenever a write completes on a hot mount, a read-queued request on that VSN is always given precedence over any waiting writes.

Mounts for reading. vvm_clread.c. VSN mounts for reading are executed immediately on request, since a client may be waiting for the VV to be restored to the disk cache. (Another source of read mounts is VSN compaction and replacement of lost VSN's.) Read mount requests are formulated as a list of all the VSN's that contain the VV. Before issuing this "mount any of these VSN's" mount to Volman, the panel(s) for the class are checked to see if any of the VSN's are already mounted - if so, the read takes it if the VSN is idle, or queues if the VSN is currently being written. Failing this, if a VSN containing the VV is a hot VSN in a panel, a write mount is generated and the read is queued on this; the read will get precedence over any queued write requests when the mount completes, then the VSN will be available for writing until it happens to be idle long enough to trigger a dismount.

If no hot panel VSN can be used for a read mount, VVM sends the list of VSN's containing the VV to Volman in a "mount any" request: Volman chooses the first idle VSN in the list, thus allowing the administrator to prioritize faster-access media for restores by placing them in the low-order slot in the class definition (vv_class.c).

Mounts for recycling. vvm_clrecycle.c. When VSN's are to be recycled, e.g. when enough VV's have been deleted by clients or the VSN medium is viewed as having reached its service life (oxide ready to fall off) or having become obsolete (no longer cost effective, drives themselves obsolete), it is necessary to pull all the live VV's off a VSN and rewrite them to new media. The 'recycle' internal mount client code implements this 'bulk read', which is analogous to the 'bulk writes' to hot tapes and is more efficient than placing multiple individual restores in the read queue since all the nonresident, undeleted VV's will be pulled from one mount of one VSN instead of potentially multiple mounts of multiple VSN's.

Mount request queueing. vvm_volman.c. All mount requests are put on a master 'Volman' queue, which is searched by request id when responses come back from Volman. Each volman request points to its initiating write, read, or recycle request. The vvm_volman.c initiates allocate/mount/dismount requests and handles the ups and downs of the Volman system, calling a routine in the requestor's file upon completion or error. Unmount requests are executed asynchronously.

Disk Cache Management/h3> vv_cache.c. VVMD also manages the VV disk cache, freeing the longest-unused VV's that have been written to VSN's when space is needed. The available cache space is checked whenever a new VV is about to be created, as well as when a non-resident VV is to be restored.

When a hot VV is to be created or appended, or a nonresident one is to be paged in by a client read or VSN recycle, the available cache space is checked via statfs(), taking into account the space reserved for VV's already in the process of being written or paged in. If space is inadequate, an attempt is made to free old VV's until the required space is found. In the process, if any hot VV's are encountered in the age-sorted list, they are finished, which begins the save of those VV's to VSN's. VVMD keeps an index of the VV's on disk, keyed by atime+VV, in its database; this index is initialized at startup time and updated when VV's are added, deleted, or written. When the cache is full and space is needed to create a new VV or restore an old one, the index is used to choose the oldest on-VSN VV for removal - the use of atime as the first key element provides the sorting, and the index version of atime is kept in a scratch field in the VV table, allowing reconstruction of the key and thus deletion of the index entry in the event the inode atime changes.

When the space is available for the hot or restoring VV, that disk space is reserved, equal to the max VV size for the class plus 2 blocks (for directory and inode entries). When the VV write or restore is complete (or fails), the total reserved is decremented by the same amount.

Since 'finished' VV's are written to VSN's as quickly as possible, as long as the disk cache is big enough to handle the burst-iness of the data coming in, this cache management strategy should suffice.

VVMD Initialization

vvm_init.c Upon initialization, the VVMD immediately goes to the spool (logging) directory in case of coredump, then gets a lock to ensure that only one VVMD is running and starts logging. The cache filesystem is checked to make sure that it is mounted, the cache free space is assessed, and communications and class management structures are set up in memory. VVMD then reads its configuration file, rc.vvm, starts its database, the index of on-disk VV's, and the volman connection, and tries to determine 'hot' physical volumes for slots of each of the storage classes for writing 'finished' VV's to; first the database is checked for previously-allocated VSNs, and if any slots remain empty, new allocations are attempted from Volman.

VVMD then checks the database to see if there are any finished VV's to write or VSN-recycle VV's to read, and enqueues any such requests. It then begins to accept client connections.

VVMD Database

The VVMD Database consists of two 'tables', each with its own B-tree indexes. The tables are ASCII; each record has single spaces separating the fields and a newline character at the end. The record size is an integer factor of the disk block size so that disk block boundaries are not crossed, i.e. a partial write of a record cannot happen on a machine crash.

VV Table

The VV Table has a record for each VV. Its contents are specified in vvm_db.h. The indexes on it are:

VSN Table

This table contains information on removable volumes (referred to as "VSN's" based on volman terminology: Volume Serial Number). Its contents are also specified in vvm_db.h. The indexes on it are:


Virtual Volume Manager Mover Daemons

In the
NAStore 3 design, tape drives are mounted on dedicated hosts in order to get maximum bandwidth. Each such host has a Virtual Volume Manager Mover Daemon (VVMVD) running on it to copy VV's between tape and the VVM's shared on-disk VV cache. A VVMVD is selected by the VVMD after the Volume Manager has mounted a tape on a drive that is attached to its host. Once it receives a request to copy a VV from/to a given mounted VSN, it forks a child to do the work and checks periodically to make sure that it is not hung; if successful, the child responds directly to the VVMD. (The VVMVD is modeled on the volman volnd.) When writing a VV to tape, the VVMVD writes a label before and after the VV data; these labels are checked later when the VV is read from the tape.

VV Label Format The label format is somewhat analogous to the ANSI format, but gets all the information needed in a single label.
     data            chars           format

        label           3               'HDR' or 'EOF'
        vv class        1               alphanum
        vv serial       10              decimal, left-justified
        id              10              decimal
        file number     10              decimal
        size            16              64-bit hex
        time finished   10              unix time in 40-bit hex, like dbase
        time written    10              ditto
        version         10              (remaining chars)
 
(all alpha chars uppercase)

Author (mail): Bill Ross


 NAS HOME PAGE  Storage Systems home page WebWork: Harry Waddell
NASA Official: John Lekashman