3
# Author Chris Mungall <cjm-at-bioperl.org>
7
unflatten_seq - unflatten a genbank or genbank-style feature file into
8
a nested SeqFeature hierarchy
12
unflatten_seq.PLS -e 3 -gff ~/cvs/bioperl-live/t/data/AE003644_Adh-genomic.gb
14
unflatten_seq.PLS --detail ~/cvs/bioperl-live/t/data/AE003644_Adh-genomic.gb
16
unflatten_seq.PLS -i foo.embl --from embl --to chadoxml -o out.chado.xml
18
unflatten_seq.PLS --notypemap --detail --to asciitree -ethresh 2 AE003644_Adh-genomic.gb
22
This script will B<unflatten> a genbank or genbank-style file of
23
SeqFeatures into a nested hierarchy.
25
See L<Bio::SeqFeature::Tools::Unflattener>
27
In a GenBank/EMBL representation, features are 'flat' - for example,
28
there is no link between an mRNA and a CDS, other than implicit links
29
(eg via tags or via splice site coordinates) which may be hard to code
32
This is most easily illustrated with the default output format,
35
An unflattened genbank feature set may look like this (AB077698)
38
databank_entry 1..2701[+]
41
CDS hCHCR-G 80..1144[+]
43
five_prime_UTR 1..79[+]
44
located_sequence_feature 137..196[+]
45
located_sequence_feature 239..292[+]
46
located_sequence_feature 617..676[+]
47
located_sequence_feature 725..778[+]
48
three_prime_UTR 1145..2659[+]
49
polyA_site 1606..1606[+]
50
polyA_site 2660..2660[+]
52
Or like this (portion of AE003734)
57
CDS CG3320-PA 53126..54971[-]
64
CDS CG3320-PB 53383..54971[-]
70
The unflattening will also 'normalize' the containment hierarchy (in
71
the sense of standardising it - e.g. making sure there is always a
72
transcript record, even if genbank just specifies CDS and gene)
74
By default, the GenBank types will be mapped to SO types
76
See L<Bio::SeqFeature::Tools::TypeMapper>
78
=head1 COMMAND LINE ARGUMENTS
84
input file (can also be specified as last argument)
88
input format (defaults to genbank)
90
probably doesnt make so much sense to use this for non-flat formats;
91
ie other than embl/genbank
95
output format (defaults to asciitree)
97
should really be a format that is nested SeqFeature aware; I think
98
this is only asciitree, chadoxml and gff3
102
with export to GFF3 format (pre-3 GFFs make no sense with unflattened
103
sequences, as they have no set way of representing feature graphs)
107
outfile defaults to STDOUT
111
show extra detail on features (asciitree mode only)
115
sets the error threshold on unflattening
117
by default this script will throw a wobbly if it encounters weird
118
stuff in the genbank file - raise the error threshold to signal these
119
to be ignored (and reported on STDERR)
123
suppress use_magic in unflattening (see
124
L<Bio::SeqFeature::Tools::Unflattener>
128
suppress type mapping (see
129
L<Bio::SeqFeature::Tools::TypeMapper>
136
L<Bio::SeqFeature::Tools::Unflattener> allows fine-grained control
137
over the unflattening process - need to add more options to allow this
138
control at the command line
145
User feedback is an integral part of the evolution of this and other
146
Bioperl modules. Send your comments and suggestions preferably to
147
the Bioperl mailing list. Your participation is much appreciated.
149
bioperl-l@bioperl.org - General discussion
150
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
152
=head2 Reporting Bugs
154
Report bugs to the Bioperl bug tracking system to help us keep track
155
of the bugs and their resolution. Bug reports can be submitted via
158
https://redmine.open-bio.org/projects/bioperl/
162
Chris Mungall E<lt>cjm-at-bioperl.orgE<gt>
167
use Bio::SeqFeature::Tools::Unflattener;
168
use Bio::SeqFeature::Tools::TypeMapper;
169
use Bio::SeqFeature::Tools::IDHandler;
174
my ($input,$from,$to,$output,$verbosity,$ethresh,$nomagic,$group_tag,$detail,
180
my @remove_types = ();
183
'i|input:s' => \$input,
186
'o|output:s'=> \$output,
187
"verbosity|v=s"=>\$verbosity,
188
"ethresh|e=s"=>\$ethresh,
189
"remove_type=s@"=>\@remove_types,
190
"nomagic"=>\$nomagic,
191
"notypemap"=>\$notypemap,
192
"group_tag"=>\$group_tag,
196
system("perldoc $0");
202
if ($to =~ /^gff/i) {
206
$input = $input || shift if @ARGV;
208
my $in = new Bio::SeqIO(-file => $input,
211
my @out_opt = $output ? (-file => ">$output") : ();
213
$out = new Bio::SeqIO(-format=>$to, @out_opt);
214
$out->show_detail($detail) if $out->can("show_detail") && $detail;
217
my $unflattener = Bio::SeqFeature::Tools::Unflattener->new;
218
$unflattener->verbose($verbosity);
219
$unflattener->error_threshold($ethresh);
220
my $tm = Bio::SeqFeature::Tools::TypeMapper->new;
221
my $idhandler = Bio::SeqFeature::Tools::IDHandler->new;
223
while( my $seq = $in->next_seq ) {
224
$unflattener->remove_types(-seq=>$seq,
225
-types=>\@remove_types)
228
$unflattener->unflatten_seq(-seq=>$seq,
229
-use_magic=>!$nomagic,
230
-group_tag=>$group_tag,
232
$unflattener->report_problems(\*STDERR);
233
$tm->map_types_to_SO(-seq=>$seq) unless $notypemap;
235
my @seq_args = ($seq);
236
if ($to eq 'chadoxml') {
237
@seq_args = (-seq=>$seq, -nounflatten=>1)
240
my $gffio = Bio::Tools::GFF->new(@out_opt, -noparse=>1, -gff_version => 3);
241
$idhandler->set_ParentIDs_from_hierarchy($seq);
242
foreach my $feature ($seq->get_all_SeqFeatures) {
243
$gffio->write_feature($feature);
248
$out->write_seq(@seq_args);