2
if ($ARGV[0]=~/^-?-h/ || @ARGV < 1) {
5
This script massages the RELEASE 3 Flybase/Gadfly GFF files located at
6
http://www.fruitfly.org/sequence/release3download.shtml into the
7
"correct" version of the GFF format.
9
To use this script, download the whole genome FASTA file and save it
10
to disk. (The downloaded file will be called something like
11
"na_whole-genome_genomic_dmel_RELEASE3.FASTA", but the link on the
12
HTML page doesn't give the filename.) Do the same for the whole
13
genome GFF annotation file (the saved file will be called something
14
like "whole-genome_annotation-feature-region_dmel_RELEASE3.GFF".) If
15
you wish you can download the ZIP compressed versions of these files.
17
Next run this script on the two files, indicating the name of the
18
downloaded FASTA file first, followed by the gff file:
20
% process_gadfly.pl na_whole-genome_genomic_dmel_RELEASE3.FASTA whole-genome_annotation-feature-region_dmel_RELEASE3.GFF > fly.gff
22
The gadfly.gff file and the fasta file can now be loaded into a Bio::DB::GFF database
23
using the following command:
25
% bulk_load_gff.pl -d fly -fasta na_whole-genome_genomic_dmel_RELEASE3.FASTA fly.gff
27
(Where "fly" is the name of the database. Change it as appropriate.
28
The database must already exist and be writable by you!)
30
The resulting database will have the following feature types
31
(represented as "method:source"):
33
Component:arm A chromosome arm
34
Component:scaffold A chromosome scaffold (accession #)
35
Component:gap A gap in the assembly
36
clone:clonelocator A BAC clone
37
gene:gadfly A gene accession number
38
transcript:gadfly A transcript accession number
39
translation:gadfly A translation
40
codon:gadfly Significance unknown
42
symbol:gadfly A classical gene symbol
43
similarity:blastn A BLASTN hit
44
similarity:blastx A BLASTX hit
45
similarity:sim4 EST->genome using SIM4
46
similarity:groupest EST->genome using GROUPEST
47
similarity:repeatmasker A repeat
49
IMPORTANT NOTE: This script will *only* work with the RELEASE3 gadfly
50
files and will not work with earlier releases.
59
$_ = "gunzip -c $_ |" if /\.gz$/;
62
if ($ARGV[0] =~ /fasta/i) {
65
die "call as process_gadfly.pl \"release3_dna.FASTA\" \"release3_features.GFF\"";
71
my ($ref,$csource,$cmethod,$start,$stop,$cscore,$strand,$cphase,$cgroup) = split "\t";
72
next if $start > $stop; # something wrong. Don't bother fixing it.
74
my $fixed_group = fix_group($csource,$cmethod,$cgroup);
75
print join("\t",$ref,$csource,$cmethod,$start,$stop,$cscore,$strand,$cphase,$fixed_group),"\n";
76
dump_symbol($ref,$csource,$cmethod,$start,$stop,$cscore,$strand,$cphase,$cgroup) if $cgroup =~ /symbol/i;
80
my ($source,$method,$group) = @_;
82
push @group,"Transcript $1" if $group =~ /transgrp=([^; ]+)/;
83
push @group,"Gene $1" if $method eq 'gene' && $group =~ /genegrp=([^; ]+)/;
85
$gene ||= qq(Note "FlyBase $1") if $group =~ /dbxref=FlyBase:(\w+)/;
86
$gene ||= qq(Note "GadFly $1") if $group =~ /genegrp=([^; ]+)/;
87
push @group,qq(Note "Symbol $1") if $group =~ /symbol=([^; ]+)/ && "Gene $1" ne $group[0];
89
return join ' ; ',@group;
92
# called when we encounter a gene symbol
94
my ($ref,$csource,$cmethod,$start,$stop,$cscore,$strand,$cphase,$cgroup) = @_;
95
my ($symbol) = $cgroup=~/symbol=([^;]+)/;
96
my ($gene) = $cgroup=~/genegrp=([^;]+)/;
97
return if $symbol eq $gene;
99
print join("\t",$ref,$csource,$cmethod,$start,$stop,$cscore,$strand,$cphase,qq(Symbol "$symbol")),"\n";
103
my $file = shift @ARGV;
104
open F,$file or die "Can't open $file: $!";
105
print STDERR "Reading big FASTA file, please be patient...\n";
106
my ($current_id,%lengths);
112
die "this doesn't look like a fasta file to me" unless $current_id;
114
$lengths{$current_id} += length;
116
foreach (sort keys %lengths) {
117
print join("\t",$_,'arm','Component',1,$lengths{$_},'.','+','.',qq(Sequence "$_")),"\n";
125
process_gadfly.pl - Massage Gadfly/FlyBase GFF files into a version suitable for the Generic Genome Browser
129
% process_gadfly.pl ./RELEASE2 > gadfly.gff
133
This script massages the RELEASE 3 Flybase/Gadfly GFF files located at
134
http://www.fruitfly.org/sequence/release3download.shtml into the "correct"
135
version of the GFF format.
137
To use this script, download the whole genome FASTA file and save it
138
to disk. (The downloaded file will be called something like
139
"na_whole-genome_genomic_dmel_RELEASE3.FASTA", but the link on the
140
HTML page doesn't give the filename.) Do the same for the whole
141
genome GFF annotation file (the saved file will be called something
142
like "whole-genome_annotation-feature-region_dmel_RELEASE3.GFF".) If
143
you wish you can download the ZIP compressed versions of these files.
145
Next run this script on the two files, indicating the name of the
146
downloaded FASTA file first, followed by the gff file:
148
% process_gadfly.pl na_whole-genome_genomic_dmel_RELEASE3.FASTA whole-genome_annotation-feature-region_dmel_RELEASE3.GFF > fly.gff
150
The gadfly.gff file and the fasta file can now be loaded into a Bio::DB::GFF database
151
using the following command:
153
% bulk_load_gff.pl -d fly -fasta na_whole-genome_genomic_dmel_RELEASE3.FASTA fly.gff
155
(Where "fly" is the name of the database. Change it as appropriate.
156
The database must already exist and be writable by you!)
158
The resulting database will have the following feature types
159
(represented as "method:source"):
161
Component:arm A chromosome arm
162
Component:scaffold A chromosome scaffold (accession #)
163
Component:gap A gap in the assembly
164
clone:clonelocator A BAC clone
165
gene:gadfly A gene accession number
166
transcript:gadfly A transcript accession number
167
translation:gadfly A translation
168
codon:gadfly Significance unknown
170
symbol:gadfly A classical gene symbol
171
similarity:blastn A BLASTN hit
172
similarity:blastx A BLASTX hit
173
similarity:sim4 EST->genome using SIM4
174
similarity:groupest EST->genome using GROUPEST
175
similarity:repeatmasker A repeat
177
IMPORTANT NOTE: This script will *only* work with the RELEASE3 gadfly
178
files and will not work with earlier releases.
182
L<Bio::DB::GFF>, L<bulk_load_gff.pl>, L<load_gff.pl>
186
Lincoln Stein, lstein@cshl.org
188
Copyright (c) 2002 Cold Spring Harbor Laboratory
190
This library is free software; you can redistribute it and/or modify
191
it under the same terms as Perl itself. See DISCLAIMER.txt for
192
disclaimers of warranty.