NAStore Internal Design Specification

NAStore
Rapid Access Storage Hierarchy
Internal Design Specification


Tom Proett, MRJ Technology Solutions

The Rapid Access Storage Hierarchy system (RASH) moves data between a file system and virtual volumes (VV's). The file system can have a virtual size which is much larger than its true size. No special user commands are required to use a RASH file system.


File Archiving

The process of copying data from the file system to a lower level of storage is called archiving. In the case of RASH, the second storage level is a disk cache which contains virtual volumes (VV's). The VV's are managed by the
Virtual Volume Manager which writes them to removable media so they can be deleted and if necessary copied back into the VV cache.

Two different programs are provided to do archiving. One is designed to be run periodically to "sweep" data off the file system. The other is run by hand to manually archive a set of files.

Archive

This program should be set to run from cron frequently enough to keep the chance of file system overflow low. It reads a list of files to process into a buffer which can be sized by a command line option. A child process is started, up to a settable limit, each time the buffer is filled.

After processing any command line arguments, the log file is initialized and setpgid is called and signal handling is arranged for SIGHUP, SIGINT, and SIGTERM. The handler function is different depending on whether child processes are to be forked later. If there will be only one process, the function is intrpt. If there will be more than one process, the parent will only read the list of files and fork children to do the work. The signal function will then be bomb_archive.

At this point, the input file is opened if it is not stdin, then it is flocked. The input buffer is allocated and reading the input file begins. Each line from the input consists of a file uid, gid, size and path. The path has any non-printable characters escaped so restore_name is called to reverse this. The buffer is used as an array of structures. Each one holds information for one file.


typedef struct  {
        char    name[PATH_MAX+1];
        uid_t   uid; 
        gid_t   gid;
        off_t   size;
} arc_ent_t;

Once the buffer is full, or the end of the input is reached, the function process_arc is called. This will decide if a child is to be forked to process the buffer or if the current process will do the work. In either case, the function archive_list does the actual processing of the buffer. It loops through the buffer, checking each element of the array of structures. An lstat is done on the file and the uid, gid and size are all compared to make sure they match the information saved in the structure. If not, the entry is skipped. If the information all matches, the function rashvv is called for the file.

The function rashvv is passed the file name and stat structure pointer. The file is opened for read and fcntl is called with DMFS_GETRINFO to get the metadata. If the file has zero length or is being restored, it is skipped. If it is already archived and the space is to be released, the rash database is opened to be sure it is consistent with the information in the file system. If it is, fcntl is called with DMFS_SETNONR to release the data blocks and the function returns.

If the file is not already archived, the metadata is checked to see if a bitfile id is needed. If so, getbfid is called and the new value is pushed into the file system by calling fcntl with DMFS_SETRINFO.

At this point, the rash database is opened and a connection to the virtual volume manager is initialized. The file is written to one or more VV's by calling writefile. If this succeeds, rashdb_add is called for each VV which was written. The database is closed and fstat is called to get the information for the file and check it once again to be sure it hasn't changed while being written. If all is well, fcntl is called with DMFS_FINISHARCH and again with DMFS_CLRCOMPACT. If space is being released, fcntl is called yet again with DMFS_SETNONR. The function then returns.

The function writefile copies an open file to VV's. A loop is entered which creates an element in an array of structures for each VV written.


typedef struct {
        char            vvname[MAXVVNAMELEN+1]; 
        off_t           vvlseek;
        off_t           flseek;
        off_t           vvdata;
        int             lastfno;
        int             new;
} vvlist_t;

The function vv_openwrite is called to gain access to a writable VV. The VV has labels written to it and file data is written until the VV is out of space or the file is done. If more file data remains, the loop continues.

When processing the input file is done, any child processes are waited for and the parent returns.

Forcearc

This program differs from archive in that it does a file tree decent rather than reading file names from a list. After processing any command line arguments, the log file is set up and a signal handler function is installed. A loop is entered for each file listed on the command line. If a file tree walk is being done via the -r (recursive) flag, then the function traverse is called and passed a pointer to the function forcefile. If a file tree walk is not being done, the function forcefile is called directly. The function forcefile checks to make sure the calling user owns the file being processed, then calls the function rashvv.

File Restoration

A file can have any amount of data from zero to it's total size resident in the file system. The point at which the resident data ends is called the "barrier". If the barrier is any less than the total file size, the file is considered non-resident. To access non-resident data requires the intervention of a system process to copy the data back from VV(s) to the file system. This can happen automatically when an attempt is made to access the non-resident data, or manually when a user knows ahead of time that a set of files will be needed.

Restore Daemon

The system will send a message via a 'kernel to daemon' device
kdcom whenever a file needs to be restored. This message is received by the dmfsd process which will do what is required.

The dmfsd process is usually started by init. It will process command line arguments and set up signal handler functions for SIGTERM, SIGUSR1 and SIGCHLD. The signals SIGHUP and SIGINT are ignored. A loop is entered to wait for input.

Each time a read of kdcom returns, the message is checked to be sure it is the correct size and the message type is determined. The two possible commands are RASH_RESTORE and RASH_OBSOLETE. The message is checked to be sure it has a valid bfid, then the queues are checked to see if there is already a message outstanding the for bfid sent with the current message. If so, the message is ignored. Otherwise, a new queue entry is created. If the number of running child processes for the given type of message is less than the number allowed, a new child is forked to process the message and the queue entry is added to the active queue. If the message must wait, it is added to the waiting queue.

If the message is for a restore operation, the function restore is called. It is running in a child process, so the kdcom device is closed and the log file is reinitialized. The message is checked to be sure it has a valid bfid, then the file handle is opened with fhopen. The file information is retrieved with fstat and fcntl(DMFS_GETRINFO). The bfid sent with the message is compared to what is given in the file information. If they don't match, or the file is not marked DM_NONR (non-resident), then the restore is abandoned.

The database is opened and the records are retrieved for the bfid in question. If this operation is successful, the function restfile is called to restore the file. Following this, the database is closed and the process exits.

The restfile function loops through the array of VV's that contain the file segments for the bfid. Any data that is resident in the file system is passed over while checking that the database shows the correct offset for each VV so that the data could be restored if needed. The VV's containing the file system resident data are not mounted. The first VV which contains non-resident data is mounted and the labels are checked. Any part of the data on the VV which is resident is skipped and a loop is entered which reads from the VV and writes to the file. Any other VV's are mounted and read. Once the restore is complete, fcntl(DMFS_FINISHRESTORE) is called and the file is closed.

The parent dmfsd gets a SIGCHLD signal every time a forked process exits. The function getkid is called in this situation. It looks for the pid of the child in the active queue to make sure it is known and what type of work it was doing. If any requests are waiting for that type of work, it will start another process, otherwise it just decrements the number of running tasks for that request type.

Frestore

The program frestore is run in a situation where a number of files are to be restored at once. The information for each file to be restored is read from the database and the function restfile is called.

Database

The entire RASH database consists of multiple sub-databases, one per user. Each RASH sub-database is a flat file with two b-tree indexes based on
Sleepycat Software. The data file consists of fixed size records. A file segment on a VV is described by one primary record and up to two auxiliary records. The primary record will contain all or part of the file name of the file segment. Any part of the file name which will not fit into the primary record is recorded in the auxiliary records.

The simplest possible situation is a single primary record and no auxiliary records. This would be the case for a file which fits entirely on one VV and has a file name short enough to not need any auxiliary records. The most complex case is where a file spans multiple VVs and has a long file name which requires auxiliary records. In this situation, only the first file segment has the full file name recorded in auxiliary records. The subsequent file segments have only the part of the file name which fits into the primary record and no auxiliary records.

VV Label Format

The label format is somewhat analogous to the ANSI format. The beginning of a VV has a label which gives information about which user it has been assigned to. All fields are separated with blanks and the label ends with a newline.

        data            chars           format

        hdr             4               'RASH'
        vvname          33              this vv
        version         10              rash version
        dbuid_name      10              owner when bfid assigned
        dbuid           10              dbuid number
        date            16              time label written
 
(all alpha chars uppercase)

Each file segment on a VV is surrounded by a header and trailer label. If a VV contains more than one file, the trailer will be analogous to an 'EOF' label. If a file spans multiple VV's every VV except the last will have an 'EOV' label with the last having an 'EOF' label. All fields are separated with blanks and the label ends with a newline. The filename is written following the 'HDR' label. Each label is followed by an eye catcher that corresponds to a tape mark.


        data            chars           format
        hdr             4               'FILE'
        label           3               'HDR' 'EOV' or 'EOF'
        version         10              rash version
        vv0		33              1st vv for multi VV file
        vvno            5               volume order number
        othervv         33              prev or next vv for multi VV file
        fno             5               which file on this multi file vv
        bfid            32              bfid in ascii hex
        uname           10              user name
        uid             10              file uid in ascii hex
        gname           10              group name
        gid             10              file gid in ascii hex
        mode            4               file mode in ascii hex
        mtime           16              mtime
        ctime           16              ctime
        arctm           16              archive time
        fsize           16              file size in ascii hex
        lseek           16              lseek address in ascii hex
        vvdata          16              amount of file on this vv
        flen            4               length of file name hex

A multi-file VV would have the following format with labels shown in bold:
RASH HDR FileName1 EndMark FileSegment1 EOF EndMark HDR FileName2 EndMark FileSegment2 EOF EndMark

A set of VV's with a single file written to them would have this format:
RASH HDR FileName EndMark FileSegment EOV EndMark
RASH HDR FileName EndMark FileSegment EOF EndMark

Author: Tom Proett


 NAS HOME PAGE  Storage Systems home page WebWork: Harry Waddell
NASA Official: John Lekashman