10
bp_load_gff.pl - Load a Bio::DB::GFF database from GFF files.
14
% bp_load_gff.pl -d testdb -u user -p pw
15
--dsn 'dbi:mysql:database=dmel_r5_1;host=myhost;port=myport'
16
dna1.fa dna2.fa features1.gff features2.gff ...
20
This script loads a Bio::DB::GFF database with the features contained
21
in a list of GFF files and/or FASTA sequence files. You must use the
22
exact variant of GFF described in L<Bio::DB::GFF>. Various
23
command-line options allow you to control which database to load and
24
whether to allow an existing database to be overwritten.
26
This script uses the Bio::DB::GFF interface, and so works with all
27
database adaptors currently supported by that module (MySQL, Oracle,
28
PostgreSQL soon). However, it is slow. For faster loading, see the
29
MySQL-specific L<bp_bulk_load_gff.pl> and L<bp_fast_load_gff.pl> scripts.
33
If the filename is given as "-" then the input is taken from standard
34
input. Compressed files (.gz, .Z, .bz2) are automatically
37
FASTA format files are distinguished from GFF files by their filename
38
extensions. Files ending in .fa, .fasta, .fast, .seq, .dna and their
39
uppercase variants are treated as FASTA files. Everything else is
40
treated as a GFF file. If you wish to load -fasta files from STDIN,
41
then use the -f command-line swith with an argument of '-', as in
43
gunzip my_data.fa.gz | bp_fast_load_gff.pl -d test -f -
45
On the first load of a database, you will see a number of "unknown
46
table" errors. This is normal.
48
About maxfeature: the default value is 100,000,000 bases. If you have
49
features that are close to or greater that 100Mb in length, then the
50
value of maxfeature should be increased to 1,000,000,000, or another
53
=head1 COMMAND-LINE OPTIONS
55
Command-line options can be abbreviated to single-letter options.
56
e.g. -d instead of --database.
58
--dsn <dsn> Data source (default dbi:mysql:test)
59
--adaptor <adaptor> Schema adaptor (default dbi::mysqlopt)
60
--user <user> Username for mysql authentication
61
--pass <password> Password for mysql authentication
62
--fasta <path> Fasta file or directory containing fasta files for the DNA
63
--create Force creation and initialization of database
64
--maxfeature Set the value of the maximum feature size (default 100 Mb; must be a power of 10)
65
--group A list of one or more tag names (comma or space separated)
66
to be used for grouping in the 9th column.
67
--upgrade Upgrade existing database to current schema
68
--gff3_munge Activate GFF3 name munging (see Bio::DB::GFF)
69
--quiet No progress reports
70
--summary Generate summary statistics for drawing coverage histograms.
71
This can be run on a previously loaded database or during
76
L<Bio::DB::GFF>, L<bulk_load_gff.pl>, L<load_gff.pl>
80
Lincoln Stein, lstein@cshl.org
82
Copyright (c) 2002 Cold Spring Harbor Laboratory
84
This library is free software; you can redistribute it and/or modify
85
it under the same terms as Perl itself. See DISCLAIMER.txt for
86
disclaimers of warranty.
90
my ($DSN,$ADAPTOR,$CREATE,$USER,$PASSWORD,$FASTA,$UPGRADE,$MAX_BIN,$GROUP_TAG,$MUNGE,$QUIET,$SUMMARY_STATS);
92
GetOptions ('dsn:s' => \$DSN,
93
'adaptor:s' => \$ADAPTOR,
95
'p|password:s' => \$PASSWORD,
97
'upgrade' => \$UPGRADE,
98
'maxbin|maxfeature:s' => \$MAX_BIN,
99
'group:s' => \$GROUP_TAG,
100
'gff3_munge' => \$MUNGE,
102
'summary' => \$SUMMARY_STATS,
103
'create' => \$CREATE) or (system('pod2text',$0), exit -1);
105
# some local defaults
106
$DSN ||= 'dbi:mysql:test';
107
$ADAPTOR ||= 'dbi::mysqlopt';
108
$MAX_BIN ||= 1_000_000_000; # to accomodate human-sized chromosomes
111
push @args,(-user=>$USER) if defined $USER;
112
push @args,(-pass=>$PASSWORD) if defined $PASSWORD;
113
push @args,(-preferred_groups=>[split(/[,\s+]+/,$GROUP_TAG)]) if defined $GROUP_TAG;
114
push @args,(-create=>1) if $CREATE;
115
push @args,(-write=>1);
117
my $db = Bio::DB::GFF->new(-adaptor=>$ADAPTOR,-dsn => $DSN,@args)
118
or die "Can't open database: ",Bio::DB::GFF->error,"\n";
120
$db->gff3_name_munging(1) if $MUNGE;
124
$MAX_BIN ? $db->initialize(-erase=>1,-MAX_BIN=>$MAX_BIN) :
127
warn qq(expect to see several "table already exists" messages\n);
129
my $dbi = $db->dbh; # get the raw database handle
130
my ($count) = $dbi->selectrow_array('SELECT COUNT(*) FROM fnote');
131
if (defined($count) && $count > 0) {
132
warn qq(fnote table detected. Translating into fattribute table. This may take a while.\n);
133
$dbi->do("INSERT INTO fattribute VALUES (1,'Note')") or die "failed: ",$dbi->errstr;
134
$dbi->do("INSERT INTO fattribute_to_feature (fid,fattribute_id,fattribute_value) SELECT fnote.fid,1,fnote FROM fnote") or die "failed: ",$dbi->errstr;
135
warn qq(Schema successfully upgraded. You might want to drop the fnote table when you're sure everything's working.\n);
141
if (/\.(fa|fasta|dna|seq|fast)$/i) {
148
for my $file (@gff) {
149
warn "$file: loading...\n";
150
my $loaded = $db->load_gff($file,!$QUIET);
151
warn "$file: $loaded records loaded\n";
154
unshift @fasta,$FASTA if defined $FASTA;
156
for my $file (@fasta) {
157
warn "Loading fasta ",(-d $file?"directory":"file"), " $file\n";
158
my $loaded = $db->load_fasta($file,!$QUIET);
159
warn "$file: $loaded records loaded\n";
162
if ($SUMMARY_STATS) {
163
warn "Building summary statistics for coverage histograms...\n";
164
$db->build_summary_statistics;