3
# dbfetch style caching proxy for GenBank
6
use HTTP::Request::Common;
10
use vars qw(%GOT $BUFFER %MAPPING $CACHE);
12
use constant CACHE_LOCATION => '/usr/tmp/dbfetch_cache';
13
use constant MAX_SIZE => 100_000_000; # 100 megs, roughly
14
use constant CACHE_DEPTH => 4;
15
use constant EXPIRATION => "1 week";
16
use constant PURGE => "1 hour";
18
%MAPPING = (genbank => {db=>'nucleotide',
20
genpep => {db=>'protein',
22
# we're doing everything in callbacks, so initialize globals.
26
print header('text/plain');
28
param() or print_usage();
31
my $style = param('style');
32
my $format = param('format');
34
my @ids = split /\s+/,$id;
36
$format = 'genbank' if $format eq 'default'; #h'mmmph
38
$MAPPING{$db} or error(1=>"Unknown database [$db]");
39
$style eq 'raw' or error(2=>"Unknown style [$style]");
40
$format eq 'genbank' or error(3=>"Format [$format] not known for database [$db]");
42
$CACHE = Cache::FileCache->new({cache_root => CACHE_LOCATION,
43
default_expires_in => EXPIRATION,
44
cache_DEPTH => CACHE_DEPTH,
45
namespace => 'dbfetch',
46
auto_purge_interval => PURGE});
48
# handle cached entries
50
if (my $obj = $CACHE->get($_)) {
56
# handle the remainder
57
@ids = grep {!$GOT{$_}} @ids;
59
my $request = POST('http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi',
60
[rettype => $MAPPING{$db}{rettype},
61
db => $MAPPING{$db}{db},
69
my $ua = LWP::UserAgent->new;
70
my $response = $ua->request($request,\&callback);
72
if ($response->is_error) {
73
my $status = $response->status_line;
74
error(6 => "HTTP error from GenBank [$status]");
78
my @missing_ids = grep {!$GOT{$_}} @ids;
79
foreach (@missing_ids) {
80
error(4=>"ID [$_] not found in database [$db]",1);
83
# my $response = $response->content;
88
my ($locus) = $record =~ /^LOCUS\s+(\S+)/m;
89
my ($accession) = $record =~ /^ACCESSION\s+(\S+)/m;
90
my ($version,$gi) = $record =~ /^VERSION\s+(\S+)\s+GI:(\d+)/m;
91
foreach ($locus,$accession,$version,$gi) {
93
$CACHE->set($_,$record);
101
while (($index = index($BUFFER,"//\n\n",$index))>=0) {
102
my $record = substr($BUFFER,0,$index);
103
$index += length("//\n\n");
104
substr($BUFFER,0,$index) = '';
105
process_record($record);
113
This script is intended to be used non-interactively.
115
Brief summary of arguments:
118
This interface does not specify what happens when biofetch is called
119
in interactive context. The implementations can return the entries
120
decorated with HTML tags and hypertext links.
122
A URL for biofetch consists of four sections:
126
2. host www.ebi.ac.uk
127
3. path to program /Tools/dbfetch/dbfetch
128
4. query string ?style=raw;format=embl;db=embl;id=J00231
133
The query string options are separated from the base URL (protocol +
134
host + path) by a question mark (?) and from each other by a semicolon
135
';' (or by ampersand '&'). See CGI GET documents at
136
http://www.w3.org/CGI/). The order of options is not critical. It is
137
recommended to leave the ID to be the last item.
139
Input for options should be case insensitive.
145
Descr : database name
147
Usage : db=genpep | db=genbank
150
Currently this server accepts "genbank" and "genpep"
155
Descr : +/- HTML tags
157
Usage : style=raw | db=html
158
Arg : enum (raw|html)
160
In non-interactive context, always give "style=raw". This uses
161
"Content-Type: text/plain". If other content types are needed (XML),
162
this part of the spesifications can be extended to accommodate them.
164
This server only accepts "raw".
170
Descr : format of the database entries returned
172
Usage : format=genbank
175
Format defaults to the distribution format of the database (embl for
176
EMBL database). If some other supported format is needed this option
177
is needed (E.g. formats for EMBL: fasta, bsml, agave).
179
This server only accepts "genbank" format.
184
Descr : unique database identifier(s)
186
Usage : db=J00231 | id=J00231+BUM
189
The ID option should be able to process all UIDS in a database. It
190
should not be necessary to know if the UID is an ID, accession number
191
or accession.version.
193
The number of entry UIDs allowed is implementation specific. If the
194
limit is exceeded, the the program reports an error. The UIDs should
195
be separated by spaces (use '+' in a GET method string).
200
The following standardized one line messages should be printed out in
203
ERROR 1 Unknown database [$db].
204
ERROR 2 Unknown style [$style].
205
ERROR 3 Format [$format] not known for database [$db].
206
ERROR 4 ID [$id] not found in database [$db].
207
ERROR 5 Too many IDs [$count]. Max [$MAXIDS] allowed.
216
my ($code,$message,$noexit) = @_;
217
print "ERROR $code $message\n";
218
exit 0 unless $noexit;
225
biofetch_genbank_proxy.pl - Caching BioFetch-compatible web proxy for GenBank
229
Install in cgi-bin directory of a Web server. Stand back.
233
This CGI script acts as the server side of the BioFetch protocol as
234
described in http://obda.open-bio.org/Specs/. It provides two
235
database access services, one for data source "genbank" (nucleotide
236
entries) and the other for data source "genpep" (protein entries).
238
This script works by forwarding its requests to NCBI's eutils script,
239
which lives at http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi.
240
It then reformats the output according to the BioFetch format so the
241
sequences can be processed and returned by the Bio::DB::BioFetch
242
module. Returned entries are temporarily cached on the Web server's
243
file system, allowing frequently-accessed entries to be retrieved
244
without another round trip to NCBI.
248
You must have the following installed in order to run this script:
251
2) the perl modules LWP and Cache::FileCache
252
3) a web server (Apache recommended)
254
To install this script, copy it into the web server's cgi-bin
255
directory. You might want to shorten its name; "dbfetch" is
258
There are several constants located at the top of the script that you
259
may want to adjust. These are:
263
This is the location on the filesystem where the cached files will be
264
located. The default is /usr/tmp/dbfetch_cache.
268
This is the maximum size that the cache can grow to. When the cache
269
exceeds this size older entries will be deleted automatically. The
270
default setting is 100,000,000 bytes (100 MB).
274
Entries that haven't been accessed in this length of time will be
275
removed from the cache. The default is 1 week.
279
This constant specifies how often the cache will be purged for older
280
entries. The default is 1 hour.
284
To see if this script is performing as expected, you may test it with
287
use Bio::DB::BioFetch;
288
my $db = Bio::DB::BioFetch->new(-baseaddress=>'http://localhost/cgi-bin/dbfetch',
291
my $seq = $db->get_Seq_by_id('DDU63596');
292
print $seq->seq,"\n";
294
This should print out a DNA sequence.
298
L<Bio::DB::BioFetch>, L<Bio::DB::Registry>
302
Lincoln Stein, E<lt>lstein-at-cshl.orgE<gt>
304
Copyright (c) 2003 Cold Spring Harbor Laboratory
306
This library is free software; you can redistribute it and/or modify
307
it under the same terms as Perl itself. See DISCLAIMER.txt for
308
disclaimers of warranty.