2
if ($ARGV[0]=~/^-?-h/ || @ARGV < 1) {
5
This script massages the RELEASE 3 Flybase/Gadfly GFF files located at
6
http://www.fruitfly.org/sequence/release3download.shtml into the
7
"correct" version of the GFF format.
9
To use this script, download the whole genome FASTA file and save it
10
to disk. (The downloaded file will be called something like
11
"na_whole-genome_genomic_dmel_RELEASE3.FASTA", but the link on the
12
HTML page doesn't give the filename.) Do the same for the whole
13
genome GFF annotation file (the saved file will be called something
14
like "whole-genome_annotation-feature-region_dmel_RELEASE3.GFF".) If
15
you wish you can download the ZIP compressed versions of these files.
17
Next run this script on the two files, indicating the name of the
18
downloaded FASTA file first, followed by the gff file:
20
% process_gadfly.pl na_whole-genome_genomic_dmel_RELEASE3.FASTA whole-genome_annotation-feature-region_dmel_RELEASE3.GFF > fly.gff
22
The gadfly.gff file and the fasta file can now be loaded into a Bio::DB::GFF database
23
using the following command:
25
% bulk_load_gff.pl -d fly -fasta na_whole-genome_genomic_dmel_RELEASE3.FASTA fly.gff
27
(Where "fly" is the name of the database. Change it as appropriate.
28
The database must already exist and be writable by you!)
30
The resulting database will have the following feature types
31
(represented as "method:source"):
33
Component:arm A chromosome arm
34
Component:scaffold A chromosome scaffold (accession #)
35
Component:gap A gap in the assembly
36
clone:clonelocator A BAC clone
37
gene:gadfly A gene accession number
38
transcript:gadfly A transcript accession number
39
translation:gadfly A translation
40
codon:gadfly Significance unknown
42
symbol:gadfly A classical gene symbol
43
similarity:blastn A BLASTN hit
44
similarity:blastx A BLASTX hit
45
similarity:sim4 EST->genome using SIM4
46
similarity:groupest EST->genome using GROUPEST
47
similarity:repeatmasker A repeat
49
IMPORTANT NOTE: This script will *only* work with the RELEASE3 gadfly
50
files and will not work with earlier releases.
60
$_ = "gunzip -c $_ |" if /\.gz$/;
63
if ($ARGV[0] =~ /fasta/i) {
66
die "call as process_gadfly.pl \"release3_dna.FASTA\" \"release3_features.GFF\"";
72
my ($ref,$csource,$cmethod,$start,$stop,$cscore,$strand,$cphase,$cgroup) = split "\t";
73
next if $start > $stop; # something wrong. Don't bother fixing it.
75
my $fixed_group = fix_group($csource,$cmethod,$cgroup);
76
print join("\t",$ref,$csource,$cmethod,$start,$stop,$cscore,$strand,$cphase,$fixed_group),"\n";
77
dump_symbol($ref,$csource,$cmethod,$start,$stop,$cscore,$strand,$cphase,$cgroup) if $cgroup =~ /symbol/i;
81
my ($source,$method,$group) = @_;
83
push @group,"Transcript $1" if $group =~ /transgrp=([^; ]+)/;
84
push @group,"Gene $1" if $method eq 'gene' && $group =~ /genegrp=([^; ]+)/;
86
$gene ||= qq(Note "FlyBase $1") if $group =~ /dbxref=FlyBase:(\w+)/;
87
$gene ||= qq(Note "GadFly $1") if $group =~ /genegrp=([^; ]+)/;
88
push @group,qq(Note "Symbol $1") if $group =~ /symbol=([^; ]+)/ && "Gene $1" ne $group[0];
90
return join ' ; ',@group;
93
# called when we encounter a gene symbol
95
my ($ref,$csource,$cmethod,$start,$stop,$cscore,$strand,$cphase,$cgroup) = @_;
96
my ($symbol) = $cgroup=~/symbol=([^;]+)/;
97
my ($gene) = $cgroup=~/genegrp=([^;]+)/;
98
return if $symbol eq $gene;
100
print join("\t",$ref,$csource,$cmethod,$start,$stop,$cscore,$strand,$cphase,qq(Symbol "$symbol")),"\n";
104
my $file = shift @ARGV;
105
open F,$file or die "Can't open $file: $!";
106
print STDERR "Reading big FASTA file, please be patient...\n";
107
my ($current_id,%lengths);
113
die "this doesn't look like a fasta file to me" unless $current_id;
115
$lengths{$current_id} += length;
117
foreach (sort keys %lengths) {
118
print join("\t",$_,'arm','Component',1,$lengths{$_},'.','+','.',qq(Sequence "$_")),"\n";
126
bp_process_gadfly.pl - Massage Gadfly/FlyBase GFF files into a version suitable for the Generic Genome Browser
130
% bp_process_gadfly.pl ./RELEASE2 > gadfly.gff
134
This script massages the RELEASE 3 Flybase/Gadfly GFF files located at
135
http://www.fruitfly.org/sequence/release3download.shtml into the "correct"
136
version of the GFF format.
138
To use this script, download the whole genome FASTA file and save it
139
to disk. (The downloaded file will be called something like
140
"na_whole-genome_genomic_dmel_RELEASE3.FASTA", but the link on the
141
HTML page doesn't give the filename.) Do the same for the whole
142
genome GFF annotation file (the saved file will be called something
143
like "whole-genome_annotation-feature-region_dmel_RELEASE3.GFF".) If
144
you wish you can download the ZIP compressed versions of these files.
146
Next run this script on the two files, indicating the name of the
147
downloaded FASTA file first, followed by the gff file:
149
% bp_process_gadfly.pl na_whole-genome_genomic_dmel_RELEASE3.FASTA whole-genome_annotation-feature-region_dmel_RELEASE3.GFF > fly.gff
151
The gadfly.gff file and the fasta file can now be loaded into a Bio::DB::GFF database
152
using the following command:
154
% bulk_load_gff.pl -d fly -fasta na_whole-genome_genomic_dmel_RELEASE3.FASTA fly.gff
156
(Where "fly" is the name of the database. Change it as appropriate.
157
The database must already exist and be writable by you!)
159
The resulting database will have the following feature types
160
(represented as "method:source"):
162
Component:arm A chromosome arm
163
Component:scaffold A chromosome scaffold (accession #)
164
Component:gap A gap in the assembly
165
clone:clonelocator A BAC clone
166
gene:gadfly A gene accession number
167
transcript:gadfly A transcript accession number
168
translation:gadfly A translation
169
codon:gadfly Significance unknown
171
symbol:gadfly A classical gene symbol
172
similarity:blastn A BLASTN hit
173
similarity:blastx A BLASTX hit
174
similarity:sim4 EST->genome using SIM4
175
similarity:groupest EST->genome using GROUPEST
176
similarity:repeatmasker A repeat
178
IMPORTANT NOTE: This script will *only* work with the RELEASE3 gadfly
179
files and will not work with earlier releases.
183
L<Bio::DB::GFF>, L<bulk_load_gff.pl>, L<load_gff.pl>
187
Lincoln Stein, lstein@cshl.org
189
Copyright (c) 2002 Cold Spring Harbor Laboratory
191
This library is free software; you can redistribute it and/or modify
192
it under the same terms as Perl itself. See DISCLAIMER.txt for
193
disclaimers of warranty.