Parrot User's Manual

March 2011

Parrot is Copyright (C) 2003-2004 Douglas Thain and Copyright (C) 2005- The University of Notre Dame. All rights reserved. This software is distributed under the GNU General Public License. See the file COPYING for details.

The Glite module of Parrot is Copyright (c) Members of the EGEE Collaboration. 2004. See http://eu-egee.org/partners/ for details on the copyright holders. For license conditions see the license file or http://eu-egee.org/license.html

Please use the following citation to refer to Parrot:

  • Douglas Thain and Miron Livny, Parrot: An Application Environment for Data-Intensive Computing, Scalable Computing: Practice and Experience, Volume 6, Number 3, Pages 9--18, 2005.
  • Overview

    Parrot is a tool for attaching old programs to new storage systems. Parrot makes a remote storage system appear as a file system to a legacy application. Parrot does not require any special privileges, any recompiling, or any change whatsoever to existing programs. It can be used by normal users doing normal tasks. For example, an anonymous FTP service is made available to vi like so:
    % parrot_run vi /anonftp/ftp.cs.wisc.edu/RoadMap
    

    Parrot is useful to users of distributed systems, because it frees them from rewriting code to work with new systems and relying on remote administrators to trust and install new software. Parrot is also useful to developers of distributed systems, because it allows rapid deployment of new code to real applications and real users that do not have the time, inclination, or permissions to build a kernel-level filesystem.

    Parrot currently supports a variety of remote I/O systems, all detailed below. We welcome contributions of new remote I/O drivers from others. However, if you are working on a protocol driver please drop us a note so that we can make sure work is not duplicated.

    Almost any application - whether static or dynmically linked, standard or commercial, command-line or GUI - should work with Parrot. There are a few exceptions. Because Parrot relies on the Linux ptrace interface any program that relies on the ptrace interface cannot run under Parrot. This means Parrot cannot run a debugger, nor can it run itself recursively. In addition, Parrot cannot run setuid programs, as the operating system system considers this a security risk.

    Parrot also provide a new experimental features called identity boxing. This feature allows you to securely run a visiting application within a protection domain without become root or creating a new account. Read below for more information on identity boxing.

    Parrot currently runs on the Linux operating system with either Intel compatible (i386) or AMD compatible (x86_64) processors. It relies on some fairly low level details in order to implement system call trapping. Ports to other platforms and processors Linux may be possible in the future.

    Like any software, Parrot is bound to have some bugs. Please check the known bugs page for the latest scoop.

    Installation

    Parrot is distributed as part of the Cooperative Computing Tools. To install, please read the cctools installation instructions.

    Examples

    To use Parrot, you simply use the parrot command followed by any other Unix program. For example, to run a Parrot-enabled vi, execute this command:
    % parrot_run vi /anonftp/ftp.cs.wisc.edu/RoadMap
    
    Of course, it can be clumsy to put parrot before every command you run, so try starting a shell with Parrot already loaded:
    % parrot_run tcsh
    
    Now, you should be able to run any standard command using Parrot filenames. Here are some examples to get you thinking:
    % cp /http/www.cse.nd.edu/~dthain/papers/parrot-agm2003.pdf .
    % grep Yahoo /http/www.yahoo.com
    % set autolist
    % cat /anonftp/ftp.cs.wisc.edu/[Press TAB here]
    

    Hint: You may find it useful to have some visual indication of when Parrot is active, so we recommend that you modify your shell startup scripts to change the prompt when Parrot is enabled. If you use tcsh, you might add something like this to your .cshrc:
            if ( $?PARROT_ENABLED ) then
                    set prompt = " (Parrot) %n@%m%~%# "
            else
                    set prompt = " %n@%m%~%# "
            endif
    

    We have limited the examples so far to HTTP and anonymous FTP, as they are the only services we know that absolutely everyone is familiar with. There are a number of other more powerful and secure remote services that you may be less familiar with. Parrot supports them in the same form: The filename begins with the service type, then the host name, then the file name. Here are all the currently supported services:


    example pathremote servicemore info
    /http/www.somewhere.com/index.htmlHypertext Transfer Protocolincluded
    /grow/www.somewhere.com/index.htmlGROW - Global Read-Only Web Filesystem included
    /ftp/ftp.cs.wisc.edu/RoadMapFile Transfer Protocolincluded
    /anonftp/ftp.cs.wisc.edu/RoadMapAnonymous File Transfer Protocolincluded
    /chirp/target.cs.wisc.edu/pathChirp Storage Systemincluded + more info
    /gsiftp/ftp.globus.org/pathGlobus Security + File Transfer Protocolmore info
    /nest/nest.cs.wisc.edu/pathNetwork Storage Technologymore info
    /rfio/host.cern.ch/pathCastor Remote File I/Omore info
    /dcap/dcap.cs.wisc.edu/pnfs/cs.wisc.edu/pathDCache Access Protocolmore info
    /lfn/logical/pathLogical File Name - Grid File Access Librarymore info
    /srm/server/pathSite File Name - Grid File Access Librarymore info
    /guid/abc123Globally Unique File Name - Grid File Access Librarymore info
    /gfal/protocol://host//pathGrid File Access Librarymore info
    /irods/host:port/zone/home/user/pathiRODSmore info
    /hdfs/namenode:port/pathHadoop Distributed File System (HDFS)more info
    /xrootd/host:port/pathXRootD/Scalla Distributed Storage System (xrootd)more info

    You will notice quite quickly that not all remote I/O systems provide all of the functionality common to an ordinary file system. For example, HTTP is incapable of listing files. If you attempt to perform a directory listing on an HTTP server, Parrot will attempt to keep ls happy by producing a bogus directory entry:
        % parrot_run ls -la /http/www.yahoo.com/
        -r--r--r--    1 thain    thain           0 Jul 16 11:50 /http/www.yahoo.com
    
    A less-drastic example is found in FTP. If you attempt to perform a directory listing of an FTP server, Parrot fills in the available information -- the file names and their sizes -- but again inserts bogus information to fill the rest out:
        % parrot_run ls -la /anonftp/ftp.cs.wisc.edu
        total 0
        -rwxrwxrwx    1 thain    thain        2629 Jul 16 11:53 RoadMap
        -rwxrwxrwx    1 thain    thain     1622222 Jul 16 11:53 ls-lR
        -rwxrwxrwx    1 thain    thain      367507 Jul 16 11:53 ls-lR.Z
        -rwxrwxrwx    1 thain    thain      212125 Jul 16 11:53 ls-lR.gz
    
    If you would like to get a better idea of the underlying behavior of Parrot, try running it with the -d remote option, which will display all of the remote I/O operations that it performs on a program's behalf:
        % parrot_run -d remote ls -la /anonftp/ftp.cs.wisc.edu
        ...
        ftp.cs.wisc.edu <-- TYPE I
        ftp.cs.wisc.edu --> 200 Type set to I.
        ftp.cs.wisc.edu <-- PASV
        ftp.cs.wisc.edu --> 227 Entering Passive Mode (128,105,2,28,194,103)
        ftp.cs.wisc.edu <-- NLST /
        ftp.cs.wisc.edu --> 150 Opening BINARY mode data connection for file list.
        ...
    
    If your program is upset by the unusual semantics of such storage systems, then consider using the Chirp protocol and server:

    The Chirp Protocol and Server

    Although Parrot works with many different protocols, is it limited by the capabilities provided by each underlying system. (For example, HTTP does not have reliable directory listings.) Thus, we have developed a custom protocol, Chirp, which provides secure remote file access with all of the capabilities needed for running arbitrary Unix programs. Chirp is included with the distribution of Parrot, and requires no extra steps to install.

    To start a Chirp server, simply do the following:

        % chirp_server -d all
    
    The -d all option turns on debugging, which helps you to understand how it works initially. You may remove this option once everything is working.

    Suppose the Chirp server is running on bird.cs.wisc.edu. Using Parrot, you may access all of the Unix features of that host from elsewhere:

        % parrot_run tcsh
        % cd /chirp/bird.cs.wisc.edu
        % ls -la
        % ...
    
    In general, Parrot gives better performance and usability with Chirp than with other protocols. You can read extensively about the Chirp server and protocol in the Chirp manual.

    In addition, Parrot provides several custom command line tools (parrot_getacl, parrot_setacl, parrot_lsalloc, and parrot_mkalloc) that can be used to manage the access control and space allocation features of Chirp from the Unix command line.

    Name Resolution

    In addition to accessing remote storage, Parrot allows you to create a custom namespace for any program. All file name activity passes through the Parrot name resolver, which can transform any given filename according to a series of rules that you specify.

    The simplest name resolver is the mountlist, given by the -m mountfile option. This file corresponds closely to /etc/ftsab in Unix. A mountlist is simply a file with two columns. The first column gives a logical directory or file name, while the second gives the physical path that it must be connected to.

    For example, if a database is stored at an FTP server under the path /anonftp/ftp.cs.wisc.edu/db, it may be spliced into the filesystem under /dbase with a mount list like this:

         /dbase       /anonftp/ftp.cs.wisc.edu/db
    
    Instruct Parrot to use the mountlist as follows:
        % parrot_run -m mountfile tcsh
        % cd /dbase
        % ls -la
    
    A single mount entry may be given on the command line with the -M option as follows:
        % parrot_run -M /dbase=/anonftp/ftp.cs.wisc.edu/db tcsh
    
    A more sophisticated way to perform name binding is with an external resolver. This is a program executed whenever Parrot needs to locate a file or directory. The program accepts a logical file name and then returns the physical location where it can be found.

    Suppose that you have a database service that locates the nearest copy of a file for you. If you run the command locate_file, it will print out the nearest copy of a file. For example:

        % locate_file /1523.data
        /chirp/server.nd.edu/mix/1523.data
    
    To connect the program locate_file to Parrot, simply give a mount string that specifies the program as a resolver:
        % parrot_run -M /dbase=resolver:/path/to/locate_file tcsh
    
    Now, if you attempt to access files under /dbase, Parrot will execute locate_file and access the data stored there:
        % cat /dbase/1523.data
        (see contents of /chirp/server.nd.edu/mix/1523.data)
    

    More Efficient Copies with parrot_cp

    If you are using Parrot to copy lots of files across the network, you may see better performance using the parrot_cp tool. This program looks like an ordinary cp, but it makes use of an optimized Parrot system call that streams entire files over the network, instead of copying them block by block.

    To use parrot_cp, simply use your shell to alias calls to cp with calls to parrot_cp:

    % parrot_run tcsh
    % alias cp parrot_cp
    % cp /tmp/mydata /chirp/server.nd.edu/joe/data
    % cp -rR /chirp/server.nd.edu/joe /tmp/joe
     

    If run outside of Parrot, parrot_cp will operate as an ordinary cp without any performance gain or loss.

    Notes on Protocols

    HTTP Proxy Servers

    Both HTTP and GROW can take advantage of standard HTTP proxy servers. To route requests through a single proxy server, set the HTTP_PROXY environment variable to the server name and port:
        % setenv HTTP_PROXY "http://proxy.nd.edu:8080"
    
    Multiple proxy servers can be given, separated by a semicolon. This will cause Parrot to try each proxy in order until one succeeds. If DIRECT is given as the last name in the list, then Parrot will fall back on a direct connection to the target web server. For example:
        % setenv HTTP_PROXY "http://proxy.nd.edu:8080;http://proxy.wisc.edu:1000;DIRECT"
    

    GROW - Global Read Only Web Filesystem

    Although the strict HTTP protocol does not allow for correct structured directory listings, it is possible to emulate directory listings with a little help from the underlying filesystem. We call this technique GROW, a global filesystem based on the Web. GROW requires the exporter of data to run a script (make_growfs) that generates a complete directory listing of the data that you wish to export. This directory listing is then used to produce reliable metadata. Of course, if the data changes, the script must be run again, so GROW is only useful for data that changes infrequently.

    To set up an GROW filesystem, you must run make_growfs on the web server machine with the name of the local storage directory as the argument. For example, suppose that the web server my.server.com stores pages for the URL http://my.server.com/~fred in the local directory /home/fred/www. In this case, you should run the following command:

        % make_growfs /home/fred/www
    
    Now, others may perceive the web server as a file server under the /grow hierarchy. For example:
        % parrot_run tcsh
        % cd /grow/my.server.com/~fred
        % ls -la
    
    In addition to providing precise directory metadata, GROW offers two additional advantages over plain HTTP:
  • Aggressive Caching. GROW caches files in an on-disk cache, but unlike plain HTTP, does not need to issue up-to-date checks against the server. Using the cached directory metadata, it can tell if a file is up-to-date without any network communication. The directory is only checked for changes at the beginning of program execution, so changes become visible only to newly executed programs.
  • SHA-1 Integrity. make_growfs generates SHA-1 checksums on the directory and each file so that the integrity of the system can be verified at runtime. If a checksum fails, GROW will attempt to reload the file or directory listing in order to repair the error, trying until the master timeout (set by the -T option) expires. This will also occur if the underlying files have been modified and make_growfs has not yet been re-run. If necessary, checksums can be disabled by giving the -k option to either Parrot or make_growfs.
  • Hadoop Distributed File System (HDFS)

    HDFS is the primary distributed filesystem used in the Hadoop project. Parrot supports read and write access to HDFS systems using the parrot_run_hdfs wrapper. This script checks that the appropriate environmental variables are defined and calls parrot.

    In particular, you must ensure that you define the following environmental variables:

    JAVA_HOME     Location of your Java installation.
    HADOOP_HOME   Location of your Hadoop installation.
    
    Based on these environmental variables, parrot_run_hdfs will attempt to find the appropriate paths for libjvm.so and libhdfs.so. These paths are stored in the environmental variables LIBJVM_PATH and LIBHDFS_PATH, which are used by the HDFS Parrot module to load the necessary shared libraries at run-time. To avoid the startup overhead of searching for these libraries, you may set the paths manually in your environment before calling parrot_run_hdfs, or you may edit the script directly.

    Note that while Parrot supports read access to HDFS, it only provides write-once support on HDFS. This is because the current implementations of HDFS do not provide reliable append operations. Likewise, files can only be opened in either read (O_RDONLY) or write mode (O_WRONLY), and not both (O_RDWR).

    EGEE Data Access: GFAL, LFN, GUID, SRM, RFIO, DCAP, and LFC

    The EGEE project is a European-wide grid computing project. The EGEE software stack, GLite, can be downloaded here. (Warning: it's big.) Parrot can be connected to several different but related components of EGEE.

    The simplest way to use Parrot with EGEE is through the Grid File Access Library (GFAL). This library is attached to Parrot under the paths /lfn, /guid, /srm, /rfio, /dcap, and /gfal. This manual does not document all of the details of GFAL itself, you should read that manual for more information.

    For example, suppose that you have an RFIO server running on server.somewhere.edu. To access this server, try the following:

        % parrot_run tcsh
        % cd /rfio/server.somewhere.edu/
        % ls -la
    
    Or, to access a file by logical name or by GUID, note that you can either use a pathname-like syntax, or specify a URL with colons. Both ways are acceptable:
        % parrot_run tcsh
    
        % cat /lfn/baud/testgfal15 
        % cat lfn:baud/testgfal15 
    
        % cat /guid/2cd59291-7ae7-4778-af6d-b1f423719441
        % cat guid:2cd59291-7ae7-4778-af6d-b1f423719441
    
    In these cases, Parrot pre-processes the paths so that they are formatted correctly for use by the GFAL. However, you can also compose paths directly for you by GFAL, by accessing files underneath /gfal. This syntax can be useful if you are attempting to harness some new GFAL capability that Parrot is not aware of. For example:
        % parrot_run tcsh
        % cd /gfal/rfio://server.somewhere.edu//
        % ls -la
    
    As noted above, GFAL requires some setup to work properly. If you are having difficulties, make sure that you have set all of the GFAL environment variables appropriately:
        LCG_GFAL_INFOSYS
        LCG_CATALOG_TYPE
        LFC_HOST
        LCG_RFIO_TYPE
        LCG_GFAL_VO
    
    And, if using RFIO, make sure that the appropriate dynamic library libshift.so is in your library search path:
        export LD_LIBRARY_PATH=/opt/lhc/lib
    
    Use of the GFAL involves many complex interacting pieces of software. It is recommended that you run Parrot with the -d remote flag in order to debug problems using GFAL.

    EGEE Data Catalog: LFC

    In addition to using GFAL, Parrot can access an LHC file catalog (LFC) directly under the path /lfc. For example:
        % parrot_run cat /lfc/my/logical/name
    
    In this case, Parrot itself performs the lookup against the LFC, and uses its own access methods to retrieve the file directly. This can be useful when Parrot supports a protocol (such as GSIFTP) that is not directly supported by the GFAL library itself.

    Use of the LFC involves many complex interacting pieces of software. It is recommended that you run Parrot with the -d remote flag in order to debug problems using LFC.

    Identity Boxing

    Parot provides a unique feature known as identity boxing. This feature allows you to run a (possibly) untrusted program within a protection domain, as if it were run in a completely separate account. Using an identity box, you do not need to become root or even to manage account names: you can create any identity that you like on the fly.

    For example, suppose that you wish to allow a friend to log into your private workstation. Instead of creating a new account, simply use a script supplied with Parrot to create an identity box:

    % whoami
    dthain
    % parrot_identity_box MyFriend
    % whoami
    MyFriend
    % touch ~dthain/private-data
    touch: creating ~dthain/private-data': Permission denied
    

    Note that the shell running within the identity box cannot change or modify any of the supervising user's data. In fact, the contained user can only access items that are world-readable or world-writable.

    You can give the contained user access to other parts of the filesystem by creating access control lists. (ACLs) An ACL is a list of users and the resources that they are allowed to access. Each directory has it's own ACL in the file .__acl. This file does not appear in a directory listing, but you can read and write it just the same.

    For example, MyFriend above can see his initial ACL as follows:

    % cat .__acl
    MyFriend rwlxa
    
    This means that MyFriend can read, write, list, execute, and administer items in the current directory. Now, suppose that MyFriend wants to allow Freddy read access to the same directory. Simply edit the ACL file to read:
    MyFriend rwlxa
    Freddy   rl
    
    Identity boxing and ACLs are particularly useful when using distributed storage. You can read more about ACLs and identity boxing in the Chirp manual.

    64-Bit Support

    In all modes, Parrot supports applications that access large (>2GB) files that require 64-bit seek pointers. However, we have found that many tools and filesystems do not manipulate such large files properly. If possible, we advise users to break up files into smaller pieces for processing.

    Parrot supports 64 bit programs and processors in the following combinations:

    Program Type
    32-bit 64-bit CPU Type
    YES NO Parrot for 32-bit X86 CPU
    Pentium, Xeon, Athlon, Sempron
    YES YES Parrot for 64-bit X86_64 CPU
    Opteron, Athlon64, Turion64, Sempron64

    Options and Environment

    Parrot has several command line options and corresponding environment variables. Use these Chirp authentication methods.(PARROT_CHIRP_AUTH)
    OptionPurposeEnvironment Variable
    -a <list>
    -b <bytes> Set the recommended remote I/O block size.PARROT_LOCAL_BLOCK_SIZE
    -B <bytes> Set the recommended local I/O block size.PARROT_REMOTE_BLOCK_SIZE
    -C <MB>Set the size of the I/O channel.PARROT_CHANNEL_SIZE
    -d <system>Enable debugging for this sub-system.PARROT_DEBUG_FLAGS
    -hShow this screen.
    -m <file> Use this file as a mountlist.PARROT_MOUNT_FILE
    -M <local>=<remote>Mount this remote file on this local directory.
    -o <file>Send debugging messages to this file.PARROT_DEBUG_FILE
    -p <host:port>Use this proxy for HTTP requests.HTTP_PROXY
    -t <dir>Where to store temporary files.PARROT_TEMP_DIR
    -vDisplay version number.

    This list is probably out of date, so you should run parrot_run -h to see the most up-to-date list.

    The flexible debugging flags can be a great help in both debugging and understanding Parrot. To turn on multiple debugging flags, you may either issue multiple -d options:

        % parrot_run -d ftp -d chirp tcsh
    
    Or, you may give a space separated list in the corresponding environment variable:
        % setenv PARROT_DEBUG_FLAGS "ftp chirp"
        % parrot_run tcsh
    
    Here is the meaning of each of the debug flags.

    syscallThis shows all of the system calls attempted by each program, even those that Parrot does not trap or modify. (To see arguments and return values, try -d libcall instead.)
    libcallThis shows only the I/O calls that are actually trapped and implemented by Parrot. The arguments and return codes are the logical values seen by the application, not the underlying operations. (To see the underlying operations try -d remote or -d local instead.)
    cacheThis shows all of the shared segments that are loaded into the channel cache and shared by multiple programs. For most programs, this means all the shared libraries.
    processThis shows all process creations, deletions, signals, and process state changes.
    resolveThis shows every invocation of the name resolver. A plain file name indicates the name was not modified, while more detailed records show names that were changed or denied access.
    localThis shows all local I/O calls from the perspective of Parrot. Notice that the file descriptors and file names shown are internal to Parrot. (To see fds and names from the perspective of the job, try -d libcall.)
    remoteThis shows all non-local file activity.
    httpThis shows only HTTP operations.
    ftpThis shows only FTP operations.
    nestThis shows only NeST operations.
    chirpThis shows only Chirp operations.
    rfioThis shows only RFIO operations.
    gfalThis shows only GFAL operations.
    lfnThis shows only LFC LFN operations.
    hdfsThis shows only HDFS operations.
    pollThis shows all activity related to processes that block (explicitly or implicitly) waiting for I/O.
    timeThis adds the current time to every debug message.
    pidThis adds the calling process id to every debug message.
    allThis shows all possible debugging messages.