1
# $Id: EUtilities.pm,v 1.24.4.3 2006/11/23 12:36:14 sendu Exp $
3
# BioPerl module for Bio::DB::EUtilities
5
# Cared for by Chris Fields <cjfields at uiuc dot edu>
7
# Copyright Chris Fields
9
# You may distribute this module under the same terms as perl itself
11
# POD documentation - main docs before the code
13
# Interfaces with new GenericWebDBI interface
17
Bio::DB::EUtilities - interface for handling web queries and data
18
retrieval from Entrez Utilities at NCBI.
22
use Bio::DB::EUtilities;
24
my $esearch = Bio::DB::EUtilities->new(-eutil => 'esearch',
29
$esearch->get_response; # parse the response, fetch a cookie
31
my $elink = Bio::DB::EUtilities->new(-eutil => 'elink',
34
-cookie => $esearch->next_cookie,
35
-cmd => 'neighbor_history');
37
$elink->get_response; # parse the response, fetch the next cookie
39
my $efetch = Bio::DB::EUtilities->new(-cookie => $elink->next_cookie,
43
print $efetch->get_response->content;
47
WARNING: Please do B<NOT> spam the Entrez web server with multiple requests.
48
NCBI offers Batch Entrez for this purpose, now accessible here via epost!
50
This is a test interface to the Entrez Utilities at NCBI. The main purpose of this
51
is to enable access to all of the NCBI databases available through Entrez and
52
allow for more complex queries. It is likely that the API for this module as
53
well as the documentation will change dramatically over time. So, novice users
56
The experimental base class is L<Bio::DB::GenericWebDBI|Bio::DB::GenericWebDBI>,
57
which as the name implies enables access to any web database which will accept
58
parameters. This was originally born from an idea to replace
59
WebDBSeqI/NCBIHelper with a more general web database accession tool so one
60
could access sequence information, taxonomy, SNP, PubMed, and so on.
61
However, this may ultimately prove to be better used as a replacement for
62
L<LWP::UserAgent|LWP::UserAgent> when ccessing NCBI-related web tools
63
(Entrez Utilitites, or EUtilities). Using the base class GenericWebDBI,
64
one could also build web interfaces to other databases to access anything
67
Currently, you can access any database available through the NCBI interface:
69
http://eutils.ncbi.nlm.nih.gov/
71
At this point, Bio::DB::EUtilities uses the EUtilities plugin modules somewhat
72
like Bio::SeqIO. So, one would call the particular EUtility (epost, efetch,
73
and so forth) upon instantiating the object using a set of parameters:
75
my $esearch = Bio::DB::EUtilities->new(-eutil => 'esearch',
77
-term => 'dihydroorotase',
80
The default EUtility (when C<eutil> is left out) is 'efetch'. For specifics on
81
each EUtility, see their respective POD (**these are incomplete**) or
82
the NCBI Entrez Utilities page:
84
http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
86
At this time, retrieving the response is accomplished by using the method
87
get_response (which also parses for cookies and other information, see below).
88
This method returns an HTTP::Response object. The raw data is accessed by using
89
the object method C<content>, like so:
91
my $efetch = Bio::DB::EUtilities->new(-cookie => $elink->next_cookie,
95
print $efetch->get_response->content;
97
Based on this, if one wanted to retrieve sequences or other raw data
98
but was not interested in directly using Bio* objects (such as if
99
genome sequences were to be retrieved) one could do so by using the
100
proper EUtility object(s) and query(ies) and get the raw response back
101
from NCBI through 'efetch'.
103
A great deal of the documentation here will likely end up in the form
104
of a HOWTO at some future point, focusing on getting data into Bioperl
109
Some EUtilities (C<epost>, C<esearch>, or C<elink>) retain information on
110
the NCBI server under certain settings. This information can be retrieved by
111
using a B<cookie>. Here, the idea of the 'cookie' is similar to the
112
'cookie' set on a your computer when browsing the Web. XML data returned
113
by these EUtilities, when applicable, is parsed for the cookie information
114
(the 'WebEnv' and 'query_key' tags to be specific) The information along
115
with other identifying data, such as the calling eutility, description
116
of query, etc.) is stored as a
117
L<Bio::DB::EUtilities::Cookie|Bio::DB::EUtilities::Cookie> object
118
in an internal queue. These can be retrieved one at a time by using
119
the next_cookie method or all at once in an array using get_all_cookies.
120
Each cookie can then be 'fed', one at a time, to another EUtility object,
121
thus enabling chained queries as demonstrated in the synopsis.
123
For more information, see the POD documentation for
124
L<Bio::DB::EUtilities::Cookie|Bio::DB::EUtilities::Cookie>.
128
Resetting internal parameters is planned so one could feasibly reuse
129
the objects once instantiated, such as if one were to use this as a
130
replacement for LWP::UserAgent when retrieving responses i.e. when
131
using many of the Bio::DB* NCBI-related modules.
133
File and filehandle support to be added.
135
Switch over XML parsing in most EUtilities to XML::SAX (currently
138
Any feedback is welcome.
144
User feedback is an integral part of the
145
evolution of this and other Bioperl modules. Send
146
your comments and suggestions preferably to one
147
of the Bioperl mailing lists. Your participation
150
bioperl-l@lists.open-bio.org - General discussion
151
http://www.bioperl.org/wiki/Mailing_lists - About the mailing lists
153
=head2 Reporting Bugs
155
Report bugs to the Bioperl bug tracking system to
156
help us keep track the bugs and their resolution.
157
Bug reports can be submitted via the web.
159
http://bugzilla.open-bio.org/
163
Email cjfields at uiuc dot edu
167
The rest of the documentation details each of the
168
object methods. Internal methods are usually
173
# Let the code begin...
175
package Bio::DB::EUtilities;
178
use vars qw($HOSTBASE %CGILOCATION $MAX_ENTRIES %DATABASE @PARAMS
179
$DEFAULT_TOOL @COOKIE_PARAMS @METHODS);
183
use base qw(Bio::DB::GenericWebDBI);
185
our $DEFAULT_TOOL = 'bioperl';
187
our $HOSTBASE = 'http://eutils.ncbi.nlm.nih.gov';
188
# map eutility to location
190
'einfo' => ['get' => '/entrez/eutils/einfo.fcgi', 'xml'],
191
'epost' => ['post' => '/entrez/eutils/epost.fcgi', 'xml'],
192
'efetch' => ['get' => '/entrez/eutils/efetch.fcgi', 'dbspec'],
193
'esearch' => ['get' => '/entrez/eutils/esearch.fcgi', 'xml'],
194
'esummary' => ['get' => '/entrez/eutils/esummary.fcgi', 'xml'],
195
'elink' => ['get' => '/entrez/eutils/elink.fcgi', 'xml'],
196
'egquery' => ['get' => '/entrez/eutils/egquery.fcgi', 'xml']
198
# map database to return mode
199
our %DATABASE = ('pubmed' => 'xml',
201
'nucleotide' => 'text',
205
'structure' => 'text',
208
'cancerchromosomes'=> 'xml',
212
'genomeprj' => 'xml',
216
'homologene' => 'xml',
217
'journals' => 'text',
219
'ncbisearch' => 'xml',
220
'nlmcatalog' => 'xml',
227
'pccompound' => 'xml',
228
'pcsubstance' => 'xml',
235
our @PARAMS = qw(rettype usehistory term field tool reldate mindate
236
maxdate datetype retstart retmax sort seq_start seq_stop strand
237
complexity report dbfrom cmd holding version linkname retmode);
238
our @COOKIE_PARAMS = qw(db sort seq_start seq_stop strand complexity rettype
239
retstart retmax cmd linkname retmode);
241
our @METHODS = qw(rettype usehistory term field tool reldate mindate
242
maxdate datetype retstart retmax sort seq_start seq_stop strand
243
complexity report dbfrom cmd holding version linkname);
244
for my $method (@METHODS) {
248
return \$self->{'_$method'} = shift if \@_;
249
return \$self->{'_$method'};
256
my($class,@args) = @_;
257
if( $class =~ /Bio::DB::EUtilities::(\S+)/ ) {
258
my ($self) = $class->SUPER::new(@args);
259
$self->_initialize(@args);
263
@param{ map { lc $_ } keys %param } = values %param; # lowercase keys
264
my $eutil = $param{'-eutil'} || 'efetch';
265
return unless ($class->_load_eutil_module($eutil));
266
return "Bio::DB::EUtilities::$eutil"->new(@args);
271
my ($self, @args) = @_;
272
my ( $tool, $ids, $retmode, $verbose, $cookie, $keep_cookies) =
273
$self->_rearrange([qw(TOOL ID RETMODE VERBOSE COOKIE KEEP_COOKIES)], @args);
274
# hard code the base address
275
$self->url_base_address($HOSTBASE);
276
$tool ||= $DEFAULT_TOOL;
278
$ids && $self->id($ids);
279
$verbose && $self->verbose($verbose);
280
$retmode && $self->retmode($retmode);
281
$keep_cookies && $self->keep_cookies($keep_cookies);
282
if ($cookie && ref($cookie) =~ m{cookie}i) {
283
$self->db($cookie->database) if !($self->db);
284
$self->add_cookie($cookie);
286
$self->{'_cookieindex'} = 0;
287
$self->{'_cookiecount'} = 0;
288
$self->{'_authentication'} = [];
294
Usage : $db->add_cookie($cookie)
295
Function: adds an NCBI query cookie to the internal cookie queue
297
Args : a Bio::DB::EUtilities::Cookie object
305
$self->throw("Expecting a Bio::DB::EUtilities::Cookie, got $cookie.")
306
unless $cookie->isa("Bio::DB::EUtilities::Cookie");
307
push @{$self->{'_cookie'}}, $cookie;
309
$self->{'_cookiecount'}++;
315
Usage : $cookie = $db->next_cookie
316
Function: return a cookie from the internal cookie queue
317
Returns : a Bio::DB::EUtilities::Cookie object
324
my $index = $self->_next_cookie_index;
325
if ($self->{'_cookie'}) {
326
return $self->{'_cookie'}->[$index];
328
$self->warn("No cookies left in the jar!");
334
Title : reset_cookies
335
Usage : $db->reset_cookies
336
Function: resets (empties) the internal cookie queue
344
$self->{'_cookie'} = [];
345
$self->{'_cookieindex'} = 0;
346
$self->{'_cookiecount'} = 0;
349
=head2 get_all_cookies
351
Title : get_all_cookies
352
Usage : @cookies = $db->get_all_cookies
353
Function: retrieves all cookies from the internal cookie queue; this leaves
354
the cookies in the queue intact
355
Returns : array of cookies (if wantarray) of first cookie
360
sub get_all_cookies {
362
return @{ $self->{'_cookie'} } if $self->{'_cookie'} && wantarray;
363
return $self->{'_cookie'}->[0] if $self->{'_cookie'}
366
=head2 get_cookie_count
368
Title : get_cookie_count
369
Usage : $ct = $db->get_cookie_count
370
Function: returns # cookies in internal queue
376
sub get_cookie_count {
378
return $self->{'_cookiecount'};
381
=head2 rewind_cookies
383
Title : rewind_cookies
384
Usage : $elink->rewind_cookies;
385
Function: resets cookie index to 0 (starts over)
393
$self->{'_cookieindex'} = 0;
400
Usage : $db->keep_cookie(1)
401
Function: Flag to retain the internal cookie queue;
402
this is normally emptied upon using get_response
404
Args : Boolean - value that evaluates to TRUE or FALSE
410
return $self->{'_keep_cookies'} = shift if @_;
411
return $self->{'_keep_cookies'};
414
=head2 parse_response
416
Title : parse_response
417
Usage : $db->_parse_response($content)
418
Function: parse out response for cookies and other goodies
421
Throws : Not implemented (implemented in plugin classes)
427
$self->throw_not_implemented;
433
Usage : $db->get_response($content)
434
Function: main method to submit request and retrieves a response
435
Returns : HTTP::Response object
442
$self->_sleep; # institute delay policy
443
my $request = $self->_submit_request;
444
if ($self->authentication) {
445
$request->proxy_authorization_basic($self->authentication)
447
if (!$request->is_success) {
448
$self->throw(ref($self)." Request Error:".$request->as_string);
450
$self->reset_cookies if !($self->keep_cookies);
451
$self->parse_response($request); # grab cookies and what not
455
# not implemented yet
456
#=head2 reset_parameters
458
# Title : reset_parameters
459
# Usage : $db->reset_parameters(@args);
460
# Function: resets the parameters for a EUtility with args (in @args)
462
# Args : array of arguments (arg1 => value, arg2 => value)
466
#sub reset_parameters {
469
# $self->reset_cookies; # no baggage allowed
470
# if ($self->can('next_linkset')) {
471
# $self->reset_linksets;
473
# # resetting the EUtility will not occur even if added as a parameter;
474
# $self->_initialize(@args);
480
Usage : $count = $elink->get_ids($db); # array ref of specific db ids
481
@ids = $esearch->get_ids(); # array
482
$ids = $esearch->get_ids(); # array ref
483
Function: returns an array or array ref of unique IDs.
484
Returns : array or array ref of ids
485
Args : Optional : database string if elink used (required arg if searching
486
multiple databases for related IDs)
487
Currently implemented only for elink object with single linksets
493
my $user_db = shift if @_;
494
if ($self->can('get_all_linksets')) {
495
my $querydb = $self->db;
496
if (!$user_db && ($querydb eq 'all' || $querydb =~ m{,}) ) {
497
$self->throw(q(Multiple databases searched; must use a specific ).
498
q(database as an argument.) );
501
my $count = $self->get_linkset_count;
503
$self->throw( q(No linksets!) );
505
elsif ($count == 1) {
506
my ($linkset) = $self->get_all_linksets;
507
my ($db) = $user_db ? $user_db : $linkset->get_all_linkdbs;
508
$self->_add_db_ids( scalar( $linkset->get_LinkIds_by_db($db) ) );
511
$self->throw( q(Multiple linkset objects present; can't use get_ids.).
512
qq(\nUse get_all_linksets/get_databases/get_LinkIds_by_db ).
513
qq(\n$count total linksets ));
516
if ($self->{'_db_ids'}) {
517
return @{$self->{'_db_ids'}} if wantarray;
518
return $self->{'_db_ids'};
522
# carried over from NCBIHelper/WebDBSeqI
527
Usage : $secs = $self->delay_policy
528
Function: return number of seconds to delay between calls to remote db
529
Returns : number of seconds to delay
532
NOTE: NCBI requests a delay of 3 seconds between requests. This method
533
implements that policy.
544
Title : get_entrezdbs
545
Usage : @dbs = $self->get_entrezdbs;
546
Function: return list of all Entrez databases; convenience method
547
Returns : array or array ref (based on wantarray) of databases
554
my $info = Bio::DB::EUtilities->new(-eutil => 'einfo');
556
# copy list, not ref of list (so einfo obj doesn't stick around)
557
my @databases = $info->einfo_dbs;
561
=head1 Private methods
567
# Title : _add_db_ids
568
# Usage : $self->add_db_ids($db, $ids);
569
# Function: sets internal hash of databases with reference to array of IDs
571
# Args : String (name of database) and ref to array of ID's
575
# used by esearch and elink, hence here
578
my ($self, $ids) = @_;
579
$self->throw ("IDs must be an ARRAY reference") unless ref($ids) =~ m{ARRAY}i;
580
my @ids = @{ $ids}; # deep copy
581
$self->{'_db_ids'} = \@ids;
596
return $self->{'_eutil'} = shift if @_;
597
return $self->{'_eutil'};
602
#Title : _submit_request
603
#Usage : my $url = $self->_submit_request
604
#Function: builds request object based on set parameters
605
#Returns : HTTP::Request
609
# as the name implies....
611
sub _submit_request {
613
my %params = $self->_get_params;
614
my $eutil = $self->_eutil;
616
# this is in case multiple id groups are present
617
if ($self->can('multi_id') && $self->multi_id) {
618
# multiple id groups if groups are together in an array reference
619
# ids and arrays are flattened into individual groups
620
for my $id_group (@{ $self->id }) {
621
if (ref($id_group) eq 'ARRAY') {
622
push @{ $params{'id'} }, (join q(,), @{ $id_group });
624
elsif (!ref($id_group)) {
625
push @{ $params{'id'} }, $id_group;
628
$self->throw("Unknown ID type: $id_group");
633
my @ids = @{ $self->id };
634
$params{'id'} = join ',', @ids;
637
my $url = URI->new($HOSTBASE . $CGILOCATION{$eutil}[1]);
638
$url->query_form(%params);
639
$self->debug("The web address:\n".$url->as_string."\n");
640
if ($CGILOCATION{$eutil}[0] eq 'post') { # epost request
641
return $self->post($url);
642
} else { # all other requests
643
return $self->get($url);
649
# Title : _get_params
650
# Usage : my $url = $self->_get_params
651
# Function: builds parameter list for web request
652
# Returns : hash of parameter-value paris
655
# these get sorted out in a hash originally but end up in an array to
656
# deal with multiple id parameters (hash values would kill that)
660
my $cookie = $self->get_all_cookies ? $self->get_all_cookies : 0;
661
my @final; # final parameter list; this changes dep. on presence of cookie
662
my $eutil = $self->_eutil;
664
@final = ($cookie && $cookie->isa("Bio::DB::EUtilities::Cookie")) ?
665
@COOKIE_PARAMS : @PARAMS;
667
# build parameter hash based on final parameter list
668
for my $method (@final) {
669
if ($self->$method) {
670
$params{$method} = $self->$method;
675
my ($webenv, $qkey) = @{$cookie->cookie};
676
$self->debug("WebEnv:$webenv\tQKey:$qkey\n");
677
($params{'WebEnv'}, $params{'query_key'}) = ($webenv, $qkey);
678
$params{'dbfrom'} = $cookie->database if $eutil eq 'elink';
683
# elink cannot set the db from a cookie (it is actually dbfrom)
684
$params{'db'} = $db ? $db :
685
($cookie && $eutil ne 'elink') ? $cookie->database :
687
# einfo db exception (db is optional)
688
if (!$db && ($eutil eq 'einfo' || $eutil eq 'egquery')) {
689
delete $params{'db'};
691
unless (exists $params{'retmode'}) { # set by user
692
my $format = $CGILOCATION{ $eutil }[2]; # set by eutil
693
if ($format eq 'dbspec') { # database-specific
694
$format = $DATABASE{$params{'db'}} ?
695
$DATABASE{$params{'db'}} : 'xml'; # have xml as a fallback
697
$params{'retmode'} = $format;
699
$self->debug("Param: $_\tValue: $params{$_}\n") for keys %params;
703
# enable dynamic loading of proper module at run time
705
sub _load_eutil_module {
706
my ($self,$eutil) = @_;
707
my $module = "Bio::DB::EUtilities::" . $eutil;
711
$ok = $self->_load_module($module);
715
$self: $eutil cannot be found
717
For more information about the EUtilities system please see the EUtilities docs.
718
This includes ways of checking for formats at compile time, not run time
725
sub _next_cookie_index {
727
return $self->{'_cookieindex'}++;