2
# BioPerl module for Bio::Assembly::Tools::ContigSpectrum
4
# Copyright by Florent Angly
6
# You may distribute this module under the same terms as Perl itself
8
# POD documentation - main docs before the code
12
Bio::Assembly::Tools::ContigSpectrum
16
# Simple contig spectrum creation
17
my $csp1 = Bio::Assembly::Tools::ContigSpectrum->new(
19
-spectrum => { 1 => 10,
23
# ...or another way to create a simple contig spectrum
24
my $csp2 = Bio::Assembly::Tools::ContigSpectrum->new;
26
$csp2->spectrum({ 1 => 20, 2 => 1, 4 => 1 });
28
# Get some information
29
print "This is contig spectrum ".$csp->id."\n";
30
print "It contains ".$csp->nof_seq." sequences\n";
31
print "The largest contig has ".$csp->max_size." sequences\n";
32
print "The spectrum is: ".$csp->to_string($csp->spectrum)."\n";
34
# Let's add the contig spectra
35
my $summed_csp = Bio::Assembly::Tools::ContigSpectrum->new;
36
$summed_csp->add($csp1);
37
$summed_csp->add($csp2);
38
print "The summed contig spectrum is ".$summed_csp->to_string."\n";
41
my $avg_csp = Bio::Assembly::Tools::ContigSpectrum->new;
42
$avg_csp = $avg_csp->average([$csp1, $csp2]);
43
print "The average contig spectrum is ".$avg_csp->to_string."\n";
45
# Get a contig spectrum from an assembly
46
my $from_assembly = Bio::Assembly::Tools::ContigSpectrum->new(
47
-assembly => $assembly_object,
48
-eff_asm_params => 1);
49
print "The contig spectrum from assembly is ".$from_assembly->to_string."\n";
51
# Report advanced information (possible because eff_asm_params = 1)
52
print "Average sequence length: ".$from_assembly->avg_seq_length." bp\n";
53
print "Minimum overlap length: ".$from_assembly->min_overlap." bp\n";
54
print "Average overlap length: ".$from_assembly->avg_overlap." bp\n";
55
print "Minimum overlap match: ".$from_assembly->min_identity." %\n";
56
print "Average overlap match: ".$from_assembly->avg_identity." %\n";
58
# Assuming the assembly object contains sequences from several different
59
# metagenomes, we have a mixed contig spectrum from which a cross contig
60
# spectrum and dissolved contig spectra can be obtained
61
my $mixed_csp = $from_assembly;
63
# Calculate a dissolved contig spectrum
64
my $meta1_dissolved = Bio::Assembly::Tools::ContigSpectrum->new(
65
-dissolve => [$mixed_csp, 'metagenome1'] );
66
my $meta2_dissolved = Bio::Assembly::Tools::ContigSpectrum->new(
67
-dissolve => [$mixed_csp, 'metagenome2'] );
68
print "The dissolved contig spectra are:\n".
69
$meta1_dissolved->to_string."\n".
70
$meta2_dissolved->to_string."\n";
72
# Determine a cross contig spectrum
73
my $cross_csp = Bio::Assembly::Tools::ContigSpectrum->new(
74
-cross => $mixed_csp );
75
print "The cross contig spectrum is ".$cross_csp->to_string."\n";
80
The Bio::Assembly::Tools::ContigSpectrum Perl module enables to
81
manually create contig spectra, import them from assemblies,
82
manipulate them, transform between different types of contig spectra
85
Bio::Assembly::Tools::ContigSpectrum is a module to create, manipulate
86
and output contig spectra, assembly-derived data used in metagenomics
87
(community genomics) for diversity estimation.
91
A contig spectrum is the count of the number of contigs of different
92
size in an assembly. For example, the contig spectrum [100 5 1 0 0
93
...] means that there were 100 singlets (1-contigs), 5 contigs of 2
94
sequences (2-contigs), 1 contig of 3 sequences (3-contig) and no
97
An assembly can be produced from a mixture of sequences from different
98
metagenomes. The contig obtained from this assembly is a mixed contig
99
spectrum. The contribution of each metagenome in this mixed contig
100
spectrum can be obtained by determining a dissolved contig spectrum.
102
Finally, based on a mixed contig spectrum, a cross contig spectrum can
103
be determined. In a cross contig spectrum, only contigs containing
104
sequences from different metagenomes are kept; "pure" contigs are
105
excluded. Additionally, the total number of singletons (1-contigs)
106
from each region that assembles with any fragments from other regions
107
is the number of 1-contigs in the cross contig spectrum.
111
The simplest representation of a contig spectrum is as a hash
112
representation where the key is the contig size (number of sequences
113
making up the contig) and the value the number of contigs of this
116
In fact, it is useful to have more information associated with the
117
contig spectrum, hence the Bio::Assembly::Tools::ContigSpectrum module
118
implements an object containing a contig spectrum hash and additional
119
information. The get/set methods to access them are:
121
id contig spectrum ID
122
nof_seq number of sequences
123
nof_rep number of repetitions (assemblies) used
124
max_size size of (number of sequences in) the largest contig
125
nof_overlaps number of overlaps
126
min_overlap minimum overlap length for building a contig
127
min_identity minimum sequence identity over the overlap length
128
avg_overlap average overlap length
129
avg_identity average overlap identity
130
avg_seq_length average sequence length
131
eff_asm_params effective assembly parameters
132
spectrum hash representation of a contig spectrum
134
Operations on the contig spectra:
136
to_string create a string representation of the spectrum
137
spectrum import a hash contig spectrum
138
assembly determine a contig spectrum from an assembly
139
dissolve calculate a dissolved contig spectrum (based on assembly)
140
cross produce a cross contig spectrum (based on assembly)
141
add add a contig spectrum to an existing one
142
average make an average of several contig spectra
144
When using operations that rely on knowing "where" (from what
145
metagenomes) a sequence came from (i.e. when creating a dissolved or
146
cross contig spectrum), make sure that the sequences used for the
147
assembly have a name header, e.g. E<gt>metagenome1|seq1,
148
E<gt>metagenome2|seq1, ...
154
User feedback is an integral part of the evolution of this and other
155
BioPerl modules. Send your comments and suggestions preferably to the
156
BioPerl mailing lists. Your participation is much appreciated.
158
bioperl-l@bioperl.org - General discussion
159
http://bio.perl.org/MailList.html - About the mailing lists
161
=head2 Reporting Bugs
163
Report bugs to the BioPerl bug tracking system to help us keep track
164
the bugs and their resolution. Bug reports can be submitted via email
167
bioperl-bugs@bio.perl.org
168
http://bugzilla.bioperl.org/
170
=head1 AUTHOR - Florent E Angly
172
Email florent_dot_angly_at_gmail_dot_com
176
The rest of the documentation details each of the object
177
methods. Internal methods are usually preceded with a "_".
181
package Bio::Assembly::Tools::ContigSpectrum;
186
use Bio::Assembly::Scaffold;
187
use Bio::SimpleAlign;
188
use Bio::LocatableSeq;
189
use Bio::Align::PairwiseStatistics;
191
use base 'Bio::Root::Root';
197
Usage : my $csp = Bio::Assembly::Tools::ContigSpectrum->new();
199
my $csp = Bio::Assembly::Tools::ContigSpectrum->new(
201
-spectrum => { 1 => 90 , 2 => 3 , 4 => 1 },
204
my $csp = Bio::Assembly::Tools::ContigSpectrum->new(
205
-assembly => $assembly_obj
207
Function: create a new contig spectrum object
208
Returns : reference to a contig spectrum object
214
my ($class, @args) = @_;
215
my $self = $class->SUPER::new(@args);
216
my ( $id, $nof_seq, $nof_rep, $max_size, $nof_overlaps, $min_overlap,
217
$min_identity, $avg_overlap, $avg_identity, $avg_seq_len, $spectrum,
218
$assembly, $eff_asm_params, $dissolve, $cross) = $self->_rearrange( [qw(ID
219
NOF_SEQ NOF_REP MAX_SIZE NOF_OVERLAPS MIN_OVERLAP MIN_IDENTITY AVG_OVERLAP
220
AVG_IDENTITY AVG_SEQ_LEN SPECTRUM ASSEMBLY EFF_ASM_PARAMS DISSOLVE CROSS)],
223
# First set up some defauts
224
$self->{'_id'} = 'NoName';
225
$self->{'_nof_seq'} = 0;
226
$self->{'_nof_rep'} = 0;
227
$self->{'_max_size'} = 0;
228
$self->{'_nof_overlaps'} = 0;
229
$self->{'_min_overlap'} = undef;
230
$self->{'_min_identity'} = undef;
231
$self->{'_avg_overlap'} = 0;
232
$self->{'_avg_identity'} = 0;
233
$self->{'_avg_seq_len'} = 0;
234
$self->{'_eff_asm_params'} = 0;
235
$self->{'_spectrum'} = {1 => 0}; # contig spectrum hash representation
236
$self->{'_assembly'} = []; # list of assembly objects used
238
# Then, according to user desires, override defaults
239
$self->{'_id'} = $id if (defined $id);
240
$self->{'_nof_seq'} = $nof_seq if (defined $nof_seq);
241
$self->{'_nof_rep'} = $nof_rep if (defined $nof_rep);
242
$self->{'_max_size'} = $max_size if (defined $max_size);
243
$self->{'_nof_overlaps'} = $nof_overlaps if (defined $nof_overlaps);
244
$self->{'_min_overlap'} = $min_overlap if (defined $min_overlap);
245
$self->{'_avg_overlap'} = $avg_overlap if (defined $avg_overlap);
246
$self->{'_min_identity'} = $min_identity if (defined $min_identity);
247
$self->{'_avg_identity'} = $avg_identity if (defined $avg_identity);
248
$self->{'_avg_seq_len'} = $avg_seq_len if (defined $avg_seq_len);
249
$self->{'_eff_asm_params'} = $eff_asm_params if (defined $eff_asm_params);
251
# Finally get stuff that can be gotten in an automated way
252
$self->_import_spectrum($spectrum) if defined($spectrum);
253
$self->_import_assembly($assembly) if defined($assembly);
254
if (defined($dissolve)) {
255
my ($mixed_csp, $header) = (@$dissolve[0], @$dissolve[1]);
256
$self->_import_dissolved_csp($mixed_csp, $header);
258
$self->_import_cross_csp($cross) if defined($cross);
268
Function: get/set contig spectrum id
270
Args : string [optional]
275
my ($self, $id) = @_;
277
$self->{'_id'} = $id;
279
$id = $self->{'_id'};
287
Usage : $csp->nof_seq
288
Function: get/set the number of sequences making up the contig spectrum
290
Args : integer [optional]
295
my ($self, $nof_seq) = @_;
296
if (defined $nof_seq) {
297
$self->throw("The number of sequences must be strictly positive. Got ".
298
"'$nof_seq'") if $nof_seq < 1;
299
$self->{'_nof_seq'} = $nof_seq;
301
$nof_seq = $self->{'_nof_seq'};
309
Usage : $csp->nof_rep
310
Function: Get/Set the number of repetitions (assemblies) used to create the
313
Args : integer [optional]
318
my ($self, $nof_rep) = @_;
319
if (defined $nof_rep) {
320
$self->throw("The number of repetitions must be strictly positive. Got ".
321
"'$nof_rep'") if $nof_rep < 1;
322
$self->{'_nof_rep'} = $nof_rep;
324
$nof_rep = $self->{'_nof_rep'};
332
Usage : $csp->max_size
333
Function: get/set the size of (number of sequences in) the largest contig
335
Args : integer [optional]
340
my ($self, $max_size) = @_;
341
if (defined $max_size) {
342
$self->throw("The contig maximum size must be strictly positive. Got ".
343
"'$max_size'") if $max_size < 1;
344
$self->{'_max_size'} = $max_size;
346
$max_size = $self->{'_max_size'};
354
Usage : $csp->nof_overlaps
355
Function: Get/Set the number of overlaps in the assembly.
357
Args : integer [optional]
362
my ($self, $nof_overlaps) = @_;
363
if (defined $nof_overlaps) {
364
$self->throw("The number of overlaps must be strictly positive. Got ".
365
"'$nof_overlaps'") if $nof_overlaps < 1;
366
$self->{'_nof_overlaps'} = $nof_overlaps;
368
$nof_overlaps = $self->{'_nof_overlaps'};
369
return $nof_overlaps;
376
Usage : $csp->min_overlap
377
Function: get/set the assembly minimum overlap length
379
Args : integer [optional]
384
my ($self, $min_overlap) = @_;
385
if (defined $min_overlap) {
386
$self->throw("The minimum of overlap length must be strictly positive. Got".
387
" '$min_overlap'") if $min_overlap < 1;
388
$self->{'_min_overlap'} = $min_overlap;
390
$min_overlap = $self->{'_min_overlap'};
398
Usage : $csp->avg_overlap
399
Function: get/set the assembly average overlap length
401
Args : decimal [optional]
406
my ($self, $avg_overlap) = @_;
407
if (defined $avg_overlap) {
408
$self->throw("The average overlap length must be strictly positive. Got ".
409
"'$avg_overlap'") if $avg_overlap < 1;
410
$self->{'_avg_overlap'} = $avg_overlap;
412
$avg_overlap = $self->{'_avg_overlap'};
420
Usage : $csp->min_identity
421
Function: get/set the assembly minimum overlap identity percent
422
Returns : 0 < decimal < 100
423
Args : 0 < decimal < 100 [optional]
428
my ($self, $min_identity) = @_;
429
if (defined $min_identity) {
430
$self->throw("The minimum overlap percent identity must be strictly ".
431
"positive. Got '$min_identity'") if $min_identity < 1;
432
$self->{'_min_identity'} = $min_identity;
434
$min_identity = $self->{'_min_identity'};
435
return $min_identity;
442
Usage : $csp->avg_identity
443
Function: get/set the assembly average overlap identity percent
444
Returns : 0 < decimal < 100
445
Args : 0 < decimal < 100 [optional]
450
my ($self, $avg_identity) = @_;
451
if (defined $avg_identity) {
452
$self->throw("The average overlap percent identity must be strictly ".
453
"positive. Got '$avg_identity'") if $avg_identity < 1;
454
$self->{'_avg_identity'} = $avg_identity;
456
$avg_identity = $self->{'_avg_identity'};
457
return $avg_identity;
464
Usage : $csp->avg_seq_len
465
Function: get/set the assembly average sequence length
466
Returns : avg_seq_len
467
Args : real [optional]
472
my ($self, $avg_seq_len) = @_;
473
if (defined $avg_seq_len) {
474
$self->throw("The average sequence length must be strictly positive. Got ".
475
"'$avg_seq_len'") if $avg_seq_len < 1;
476
$self->{'_avg_seq_len'} = $avg_seq_len;
478
$avg_seq_len = $self->{'_avg_seq_len'};
483
=head2 eff_asm_params
485
Title : eff_asm_params
486
Usage : $csp->eff_asm_params(1)
487
Function: Get/set the effective assembly parameters option. It defines if the
488
effective assembly parameters should be determined when a contig
489
spectrum based or derived from an assembly is calulated. The
490
effective assembly parameters include avg_seq_length, nof_overlaps,
491
min_overlap, avg_overlap, min_identity and avg_identity.
492
1 = get them, 0 = don't.
494
Args : integer [optional]
499
my ($self, $eff_asm_params) = @_;
500
if (defined $eff_asm_params) {
501
$self->throw("eff_asm_params can only take values 0 or 1. Input value was ".
502
"'$eff_asm_params'") unless $eff_asm_params == 0 || $eff_asm_params == 1;
503
$self->{'_eff_asm_params'} = $eff_asm_params;
505
$eff_asm_params = $self->{'_eff_asm_params'};
506
return $eff_asm_params;
513
Usage : my $spectrum = $csp->spectrum({1=>10, 2=>2, 3=>1});
514
Function: Get the current contig spectrum represented as a hash / Update a
515
contig spectrum object based on a contig spectrum represented as a
517
The hash representation of a contig spectrum is as following:
518
key -> contig size (in number of sequences)
519
value -> number of contigs of this size
520
Returns : contig spectrum as a hash reference
521
Args : contig spectrum as a hash reference [optional]
526
my ($self, $spectrum) = @_;
527
if (defined $spectrum) {
528
$self->_import_spectrum($spectrum);
530
$spectrum = $self->{'_spectrum'};
538
Usage : my @asm_list = $csp->assembly();
539
Function: Get a reference to the list of assembly object reference used to
540
make the contig spectrum object / Update the contig spectrum object
541
based on an assembly object.
542
Returns : array of Bio::Assembly::Scaffold
543
Args : Bio::Assembly::Scaffold
548
my ($self, $assembly) = @_;
549
if (defined $assembly) {
550
$self->_import_assembly($assembly);
552
my @asm_list = @{$self->{'_assembly'}} if defined $self->{'_assembly'};
558
Title : drop_assembly
559
Usage : $csp->drop_assembly();
560
Function: Remove all assembly objects associated with a contig spectrum.
561
Assembly objects can be big. This method allows to free some memory
562
when assembly information is not needed anymore.
563
Returns : 1 for success, 0 for failure
570
$self->{'_assembly'} = [];
577
Usage : $dissolved_csp->dissolve($mixed_csp, $seq_header);
578
Function: Dissolve a mixed contig spectrum for the set of sequences that
579
contain the specified header, i.e. determine the contribution of
580
these sequences to the mixed contig spectrum based on the assembly.
581
The mixed contig spectrum object must have been created based on one
582
(or several) assembly object(s). Additionally, min_overlap and
583
min_identity must have been set (either manually using min_overlap
584
or automatically by switching on the eff_asm_params option).
585
Returns : 1 for success, 0 for failure
586
Args : Bio::Assembly::Tools::ContigSpectrum reference
587
sequence header string
593
my ($self, $mixed_csp, $seq_header) = @_;
594
$self->_import_dissolved_csp($mixed_csp, $seq_header);
602
Usage : $cross_csp->cross($mixed_csp);
603
Function: Calculate a cross contig_spectrum based on a mixed contig_spectrum.
604
Returns : 1 for success, 0 for failure
605
Args : Bio::Assembly::Tools::ContigSpectrum reference
610
my ($self, $mixed_csp) = @_;
611
$self->_import_cross_csp($mixed_csp);
618
Usage : my $csp_string = $csp->to_string;
619
Function: Convert the contig spectrum into a string (easy to print!!).
621
Args : element separator (integer) [optional]
624
3 -> newline-separated
629
my ($self, $element_separator) = @_;
630
return 0 if $self->{'_max_size'} == 0;
631
$element_separator ||= 1;
632
if ($element_separator == 1) {
633
$element_separator = ' ';
634
} elsif ($element_separator == 2) {
635
$element_separator = "\t";
636
} elsif ($element_separator == 3) {
637
$element_separator = "\n";
639
$self->throw("Unknown separator type '$element_separator'\n");
642
for (my $q = 1 ; $q <= $self->{'_max_size'} ; $q++) {
644
if (exists $self->{'_spectrum'}{$q}) {
645
$val = $self->{'_spectrum'}{$q};
647
$str .= $val.$element_separator;
657
Usage : $csp->add($additional_csp);
658
Function: Add a contig spectrum to an existing one: sums the spectra, update
659
the number of sequences, number of repetitions, ...
660
Returns : 1 for success, 0 for failure
661
Args : Bio::Assembly::Tools::ContigSpectrum object
666
my ($self, $csp) = @_;
668
if( !ref $csp || ! $csp->isa('Bio::Assembly::Tools::ContigSpectrum') ) {
669
$self->throw("Unable to process non Bio::Assembly::Tools::ContigSpectrum ".
670
"object [".ref($csp)."]");
672
# Update overlap statistics
673
if ( $self->{'_eff_asm_params'} > 0 ) {
675
if ( $csp->{'_eff_asm_params'} == 0 ) {
676
$self->warn("The parent contig spectrum needs effective assembly ".
677
"parameters (eff_asm_params = ".$self->{'_eff_asm_params'}.") but the ".
678
"child contig spectrum doesn't have them (eff_asm_params = ".
679
$csp->{'_eff_asm_params'}."). Skipping them...");
680
} elsif ( $csp->{'_eff_asm_params'} != $self->{'_eff_asm_params'} ) {
681
$self->warn("The parent contig spectrum needs a different method for ".
682
"detecting the effective assembly parameters (eff_asm_params = ".
683
$self->{'_eff_asm_params'}.") than the one specified for the child ".
684
"contig spectrum (eff_asm_params = ".$csp->{'_eff_asm_params'}."). ".
685
"Ignoring the differences...");
687
# Update existing stats
688
my $tot_num_overlaps = $csp->{'_nof_overlaps'} + $self->{'_nof_overlaps'};
689
$self->{'_min_overlap'} = $csp->{'_min_overlap'} if
690
defined $csp->{'_min_overlap'} && ( ! defined $self->{'_min_overlap'} ||
691
$csp->{'_min_overlap'} < $self->{'_min_overlap'} );
692
$self->{'_min_identity'} = $csp->{'_min_identity'} if
693
defined $csp->{'_min_identity'} && ( ! defined $self->{'_min_identity'} ||
694
$csp->{'_min_identity'} < $self->{'_min_identity'} );
695
if ($tot_num_overlaps != 0) {
696
$self->{'_avg_overlap'} =
697
($csp->{'_avg_overlap'} * $csp->{'_nof_overlaps'}
698
+ $self->{'_avg_overlap'} * $self->{'_nof_overlaps'})
700
$self->{'_avg_identity'} =
701
($csp->{'_avg_identity'} * $csp->{'_nof_overlaps'}
702
+ $self->{'_avg_identity'} * $self->{'_nof_overlaps'})
705
$self->{'_nof_overlaps'} = $tot_num_overlaps;
707
# Update sequence statistics
708
my $tot_nof_seq = $csp->{'_nof_seq'} + $self->{'_nof_seq'};
709
if (not $tot_nof_seq == 0) {
710
$self->{'_avg_seq_len'} = ($csp->{'_avg_seq_len'} * $csp->{'_nof_seq'} +
711
$self->{'_avg_seq_len'} * $self->{'_nof_seq'}) / $tot_nof_seq;
713
# Update spectrum (and nof_seq, max_size, and increment nof_rep by 1)
714
$self->_import_spectrum($csp->{'_spectrum'});
716
$self->{'_nof_rep'}--;
717
$self->{'_nof_rep'} += $csp->{'_nof_rep'};
718
# Update list of assembly objects used
719
push @{$self->{'_assembly'}}, @{$csp->{'_assembly'}}
720
if defined $csp->{'_assembly'};
728
Usage : my $avg_csp = $csp->average([$csp1, $csp2, $csp3]);
729
Function: Average one contig spectrum or the sum of several contig spectra by
730
the number of repetitions
731
Returns : Bio::Assembly::Tools::ContigSpectrum
732
Args : Bio::Assembly::Tools::ContigSpectrum array reference
738
my ($self, $list) = @_;
740
if ( ! ref $list || ! ref $list eq 'ARRAY') {
741
$self->throw("Average takes an array reference but got [".ref($list)."]");
743
# New average contig spectrum object
744
my $avg = Bio::Assembly::Tools::ContigSpectrum->new;
745
$avg->{'_eff_asm_params'} = 1;
747
# Cycle through contig spectra
749
for my $csp (@$list) {
751
if (not $csp->isa('Bio::Assembly::Tools::ContigSpectrum')) {
752
$csp->throw("Unable to process non Bio::Assembly::Tools::ContigSpectrum ".
753
"object [".ref($csp)."]");
755
# Import contig spectrum
759
# Average sum of contig spectra by number of repetitions
760
for (my $q = 1 ; $q <= $avg->{'_max_size'} ; $q++) {
761
$avg->{'_spectrum'}{$q} /= $avg->{'_nof_rep'}
762
if (defined $avg->{'_spectrum'}{$q});
764
# Average number of sequences
765
$avg->{'_nof_seq'} /= $avg->{'_nof_rep'};
766
# Average number of overlaps
767
$avg->{'_nof_overlaps'} /= $avg->{'_nof_rep'};
773
=head2 _naive_assembler
775
Title : _naive_assembler
777
Function: Determines the contig spectrum (hash representation) of a subset of
778
sequences from a mixed contig spectrum by "reassembling" the
779
specified sequences only based on their position in the contig. This
780
naive assembly only verifies that the minimum overlap length and
781
percentage identity are respected. There is no actual alignment done
782
Returns : contig spectrum hash reference
783
Args : Bio::Assembly::Contig
784
sequence ID array reference
785
minimum overlap length (integer) [optional]
786
minimum percentage identity (integer) [optional]
790
sub _naive_assembler {
791
my ($self, $contig, $seqlist, $min_overlap, $min_identity) = @_;
793
if ( ! ref $seqlist || ! ref($seqlist) eq 'ARRAY') {
794
$self->throw('Expecting an array reference. Got ['.ref($seqlist)."] \n");
796
my $max = scalar @$seqlist;
797
$self->throw("Expecting at least 2 sequences as input for _naive_assembler")
800
my %spectrum = (1 => 0);
803
# Map what sequences overlap with what sequences
804
for (my $i = 0 ; $i < $max-1 ; $i++) {
806
my $qseqid = $$seqlist[$i];
807
my $qseq = $contig->get_seq_by_name($qseqid);
809
for (my $j = $i+1 ; $j < $max ; $j++) {
811
my $tseqid = $$seqlist[$j];
812
my $tseq = $contig->get_seq_by_name($tseqid);
813
# try to align sequences
814
my ($aln, $overlap, $identity)
815
= $self->_overlap_alignment($contig, $qseq, $tseq, $min_overlap,
817
# if there is no valid overlap, go to next sequence
818
next if ! defined $aln;
819
# the overlap is valid
821
push @{$overlap_map{$qseqid}}, $tseqid;
822
$has_overlap{$tseqid} = 1;
823
$has_overlap{$qseqid} = 1;
825
# check if sequence is in previously seen overlap
826
if (exists $has_overlap{$qseqid}) {
829
if ($is_singlet == 1) {
833
# take care of last sequence
834
my $last_is_singlet = 1;
835
if (exists $has_overlap{$$seqlist[$max-1]}) {
836
$last_is_singlet = 0;
838
if ($last_is_singlet == 1) {
842
for my $seqid (@$seqlist) {
843
# list of sequences that should go in the contig
844
next if not exists $overlap_map{$seqid};
845
my @overlist = @{$overlap_map{$seqid}};
846
for (my $j = 0 ; $j < scalar(@overlist) ; $j++) {
847
my $otherseqid = $overlist[$j];
848
if (exists $overlap_map{$otherseqid}) {
849
push @overlist, @{$overlap_map{$otherseqid}};
850
delete $overlap_map{$otherseqid};
853
# remove duplicates from list
854
@overlist = sort @overlist;
855
for (my $j = 0 ; $j < scalar(@overlist)-1 ; $j++) {
856
if ( $overlist[$j] eq $overlist[$j+1] ) {
857
splice @overlist, $j, 1;
861
# update spectrum with size of contig
862
my $qsize = scalar(@overlist) + 1;
863
if (defined $spectrum{$qsize}) {
866
$spectrum{$qsize} = 1;
873
=head2 _new_from_assembly
875
Title : _new_from_assembly
877
Function: Creates a new contig spectrum object based solely on the result of
879
Returns : Bio::Assembly::Tools::ContigSpectrum
880
Args : Bio::Assembly::Scaffold
884
sub _new_from_assembly {
885
# Create new contig spectrum object based purely on what we can get from the
887
my ($self, $assemblyobj) = @_;
888
my $csp = Bio::Assembly::Tools::ContigSpectrum->new();
890
$csp->{'_id'} = $assemblyobj->id;
891
# 2: Set overlap statistics: nof_overlaps, min_overlap, avg_overlap,
892
# min_identity and avg_identity
893
$csp->{'_eff_asm_params'} = $self->{'_eff_asm_params'};
894
if ($csp->{'_eff_asm_params'} > 0) {
895
my ($nover, $minl, $avgl, $minid, $avgid)
896
= $csp->_get_overlap_stats($assemblyobj);
897
$csp->{'_min_overlap'} = $minl;
898
$csp->{'_min_identity'} = $minid;
899
$csp->{'_avg_overlap'} = $avgl;
900
$csp->{'_avg_identity'} = $avgid;
901
$csp->{'_nof_overlaps'} = $nover;
903
# 3: Set sequence statistics: nof_seq and avg_seq_len
904
my ($nseq, $avgseql) = $self->_get_seq_stats($assemblyobj);
905
$csp->{'_avg_seq_len'} = $avgseql;
906
$csp->{'_nof_seq'} = $nseq;
907
# 4: Set the spectrum: spectrum and max_size
908
for my $contigobj ($assemblyobj->all_contigs) {
909
my $size = $contigobj->no_sequences;
910
if (defined $csp->{'_spectrum'}{$size}) {
911
$csp->{'_spectrum'}{$size}++;
913
$csp->{'_spectrum'}{$size} = 1;
915
$csp->{'_max_size'} = $size if $size > $csp->{'_max_size'};
917
my $nof_singlets = $assemblyobj->get_nof_singlets();
918
if (defined $nof_singlets) {
919
$csp->{'_spectrum'}{1} += $nof_singlets;
920
$csp->{'_max_size'} = 1 if $nof_singlets >= 1 && $csp->{'_max_size'} < 1;
922
# 5: Set list of assembly objects used
923
push @{$csp->{'_assembly'}}, $assemblyobj;
924
# 6: Set number of repetitions
925
$csp->{'_nof_rep'} = 1;
931
=head2 _new_dissolved_csp
934
Usage : create a dissolved contig spectrum object
942
sub _new_dissolved_csp {
943
my ($self, $mixed_csp, $seq_header) = @_;
944
# Sanity checks on the mixed contig spectrum
946
# min_overlap and min_identity must be specified if there are some overlaps
947
# in the mixed contig
948
unless ($mixed_csp->{'_nof_overlaps'} == 0) {
949
unless ( defined $self->{'_min_overlap'} ||
950
defined $mixed_csp->{'_min_overlap'} ) {
951
$self->throw("min_overlap must be defined in the dissolved contig spectrum".
952
" or mixed contig spectrum to dissolve a contig");
954
unless ( defined $self->{'_min_identity'} ||
955
defined $mixed_csp->{'_min_identity'} ) {
956
$self->throw("min_identity must be defined in the dissolved contig spectrum".
957
" or mixed contig spectrum");
961
# there must be at least one assembly in mixed contig spectrum
962
if (!defined $mixed_csp->{'_assembly'} ||
963
scalar @{$mixed_csp->{'_assembly'}} < 1) {
964
$self->throw("The mixed contig spectrum must be based on at least one
968
# New dissolved contig spectrum object
969
my $dissolved = Bio::Assembly::Tools::ContigSpectrum->new();
971
# take parent attributes if existent or mixed attributes otherwise
972
if ($self->{'_eff_asm_params'}) {
973
$dissolved->{'_eff_asm_params'} = $self->{'_eff_asm_params'};
975
$dissolved->{'_eff_asm_params'} = $mixed_csp->{'_eff_asm_params'};
977
if ($self->{'_min_overlap'} && $self->{'_min_identity'}) {
978
( $dissolved->{'_min_overlap'}, $dissolved->{'_min_identity'} ) =
979
( $self->{'_min_overlap'}, $self->{'_min_identity'} );
981
( $dissolved->{'_min_overlap'}, $dissolved->{'_min_identity'} ) =
982
( $mixed_csp->{'_min_overlap'}, $mixed_csp->{'_min_identity'} );
985
# Dissolve each assembly
986
for my $assembly (@{$mixed_csp->{'_assembly'}}) {
987
# Dissolve this assembly for the given sequences
988
my %asm_spectrum = (1 => 0);
991
for my $contig ($assembly->all_contigs) {
994
for my $seq ($contig->each_seq) {
995
my $seq_id = $seq->id;
996
# get sequence origin
997
next unless $seq_id =~ m/^$seq_header\|/;
999
push @contig_seqs, $seq_id;
1000
$good_seqs{$seq_id} = 1;
1003
my $size = scalar @contig_seqs;
1006
} elsif ($size == 1) {
1008
} elsif ($size > 1) {
1009
# Reassemble good sequences
1010
my $contig_spectrum = $dissolved->_naive_assembler(
1011
$contig, \@contig_seqs, $dissolved->{'_min_overlap'},
1012
$dissolved->{'_min_identity'});
1014
for my $qsize (keys %$contig_spectrum) {
1015
$asm_spectrum{$qsize} += $$contig_spectrum{$qsize};
1018
$self->throw("The size is not valid... how could that happen?");
1022
for my $singlet ($assembly->all_singlets) {
1023
my $seq_id = $singlet->seqref->id;
1024
# get sequence origin
1025
next unless $seq_id =~ m/^$seq_header\|/;
1027
$good_seqs{$seq_id} = 1;
1032
$dissolved->_import_spectrum(\%asm_spectrum);
1034
$dissolved->{'_nof_rep'}--;
1035
$dissolved->{'_nof_rep'} += $mixed_csp->{'_nof_rep'};
1037
# Get sequence stats
1038
my ($nseq, $avgseql) = $dissolved->_get_seq_stats($assembly, \%good_seqs);
1039
$dissolved->{'_avg_seq_len'} = $avgseql;
1040
$dissolved->{'_nof_seq'} = $nseq;
1042
# Get eff_asm_param for these sequences
1043
if ($dissolved->{'_eff_asm_params'} > 0) {
1044
my ($nover, $minl, $avgl, $minid, $avgid)
1045
= $dissolved->_get_overlap_stats($assembly, \%good_seqs);
1046
$dissolved->{'_min_overlap'} = $minl;
1047
$dissolved->{'_min_identity'} = $minid;
1048
$dissolved->{'_avg_overlap'} = $avgl;
1049
$dissolved->{'_avg_identity'} = $avgid;
1050
$dissolved->{'_nof_overlaps'} = $nover;
1058
=head2 _new_cross_csp
1062
Function: create a cross contig spectrum object
1069
sub _new_cross_csp {
1070
my ($self, $mixed_csp) = @_;
1071
# Sanity check on the mixed contig spectrum
1072
# There must be at least one assembly
1073
if (!defined $mixed_csp->{'_assembly'} ||
1074
scalar @{$mixed_csp->{'_assembly'}} < 1) {
1075
$self->throw("The mixed contig spectrum must be based on at least one ".
1079
# New dissolved contig spectrum object
1080
my $cross = Bio::Assembly::Tools::ContigSpectrum->new();
1081
my %spectrum = (1 => 0);
1083
# Take parent or mixed attributes
1084
if ($self->{'_eff_asm_params'}) {
1085
$cross->{'_eff_asm_params'} = $self->{'_eff_asm_params'};
1087
$cross->{'_eff_asm_params'} = $mixed_csp->{'_eff_asm_params'};
1089
if ($self->{'_min_overlap'} && $self->{'_min_identity'}) {
1090
( $cross->{'_min_overlap'}, $cross->{'_min_identity'} ) =
1091
( $self->{'_min_overlap'}, $self->{'_min_identity'} );
1093
( $cross->{'_min_overlap'}, $cross->{'_min_identity'} ) =
1094
( $mixed_csp->{'_min_overlap'}, $mixed_csp->{'_min_identity'} );
1097
# Get cross contig spectrum for each assembly
1098
for my $assembly (@{$mixed_csp->{'_assembly'}}) {
1099
# Go through contigs and skip the pure ones
1101
for my $contig ($assembly->all_contigs) {
1105
for my $seq ($contig->each_seq) {
1106
# current sequence origin
1107
my $seq_id = $seq->id;
1108
$seq_id =~ m/^(.+)\|/;
1109
my $seq_header = $1;
1110
$self->warn("Sequence $seq_id does not seem to have a header. Skipping".
1111
" it...") if not defined $seq_header;
1113
push @seq_origins, $seq_header;
1114
push @seq_ids, $seq_id;
1116
my $qsize = scalar(@seq_ids);
1117
my @origins = sort { $a cmp $b } @seq_origins;
1118
my $size = scalar(@origins);
1119
for (my $i = 1 ; $i < $size ; $i++) {
1120
if ($origins[$i] eq $origins[$i-1]) {
1121
splice @origins, $i, 1;
1126
# Update cross-contig number in spectrum
1127
if ($size > 1) { # cross-contig detected
1128
# update good sequences
1129
for my $seq_id (@seq_ids) {
1130
$good_seqs{$seq_id} = 1;
1132
# update number of cross q-contigs in spectrum
1133
if (defined $spectrum{$qsize}) {
1134
$spectrum{$qsize} = 1;
1136
$spectrum{$qsize}++;
1139
# Update number of cross 1-contigs
1140
if ($size > 1) { # cross-contig detected
1141
for my $origin (@origins) {
1144
for (my $i = 0 ; $i < $qsize ; $i++) {
1145
my $seq_origin = $seq_origins[$i];
1146
my $seq_id = $seq_ids[$i];
1147
push @ids, $seq_id if $seq_origin eq $origin;
1149
if (scalar @ids == 1) {
1151
} elsif (scalar @ids > 1) {
1152
my $contig_spectrum = $cross->_naive_assembler(
1153
$contig, \@ids, $cross->{'_min_overlap'},
1154
$cross->{'_min_identity'});
1155
$spectrum{1} += $$contig_spectrum{1};
1157
$self->throw("The size is <= 0. How could such a thing happen?");
1162
# Get sequence stats
1163
my ($nseq, $avgseql) = $cross->_get_seq_stats($assembly, \%good_seqs);
1164
$cross->{'_avg_seq_len'} = $avgseql;
1165
$cross->{'_nof_seq'} = $nseq;
1166
# Get eff_asm_param for these sequences
1167
if ($cross->{'_eff_asm_params'} > 0) {
1168
my ($nover, $minl, $avgl, $minid, $avgid)
1169
= $cross->_get_overlap_stats($assembly, \%good_seqs);
1170
$cross->{'_min_overlap'} = $minl;
1171
$cross->{'_min_identity'} = $minid;
1172
$cross->{'_avg_overlap'} = $avgl;
1173
$cross->{'_avg_identity'} = $avgid;
1174
$cross->{'_nof_overlaps'} = $nover;
1178
$cross->_import_spectrum(\%spectrum);
1180
$cross->{'_nof_rep'}--;
1181
$cross->{'_nof_rep'} += $mixed_csp->{'_nof_rep'};
1186
=head2 _import_assembly
1188
Title : _import_assembly
1189
Usage : $csp->_import_assembly($assemblyobj);
1190
Function: Update a contig spectrum object based on an assembly object
1191
Returns : 1 for success, 0 for error
1192
Args : Bio::Assembly::Scaffold assembly object
1196
sub _import_assembly {
1197
my ($self, $assemblyobj) = @_;
1199
if( !ref $assemblyobj || ! $assemblyobj->isa('Bio::Assembly::Scaffold') ) {
1200
$self->throw("Unable to process non Bio::Assembly::Scaffold assembly ".
1201
"object [".ref($assemblyobj)."]");
1203
# Create new object from assembly
1204
my $csp = $self->_new_from_assembly($assemblyobj);
1205
# Update current contig spectrum object with new one
1211
=head2 _import_spectrum
1213
Title : _import_spectrum
1214
Usage : $csp->_import_spectrum({ 1 => 90 , 2 => 3 , 4 => 1 })
1215
Function: update a contig spectrum object based on a contig spectrum
1216
represented as a hash (key: contig size, value: number of contigs of
1218
Returns : 1 for success, 0 for error
1219
Args : contig spectrum as a hash reference
1223
sub _import_spectrum {
1224
my ($self, $spectrum) = @_;
1226
if( ! ref $spectrum || ! ref $spectrum eq 'HASH') {
1227
$self->throw("Spectrum should be a hash reference, but it is [".
1228
ref($spectrum)."]");
1231
# Update the spectrum (+ nof_rep, max_size and nof_seq)
1232
for my $size (keys %$spectrum) {
1233
# Get the number of contigs of different size
1234
if (defined $self->{'_spectrum'}{$size}) {
1235
$self->{'_spectrum'}{$size} += $$spectrum{$size};
1237
$self->{'_spectrum'}{$size} = $$spectrum{$size};
1240
$self->{'_nof_seq'} += $size * $$spectrum{$size};
1242
$self->{'_max_size'} = $size if $size > $self->{'_max_size'};
1245
# If the contig spectrum has only zero 1-contigs, max_size is zero
1246
$self->{'_max_size'} = 0 if scalar keys %{$self->{'_spectrum'}} == 1 &&
1247
defined $self->{'_spectrum'}{'1'} && $self->{'_spectrum'}{'1'} == 0;
1250
$self->{'_nof_rep'}++;
1254
=head2 _import_dissolved_csp
1256
Title : _import_dissolved_csp
1257
Usage : $csp->_import_dissolved_csp($mixed_csp, $seq_header);
1258
Function: Update a contig spectrum object by dissolving a mixed contig
1259
spectrum based on the header of the sequences
1260
Returns : 1 for success, 0 for error
1261
Args : Bio::Assembly::Tools::ContigSpectrum
1262
sequence header string
1266
sub _import_dissolved_csp {
1267
my ($self, $mixed_csp, $seq_header) = @_;
1269
if (not defined $mixed_csp || not defined $seq_header) {
1270
$self->throw("Expecting a contig spectrum reference and sequence header as".
1273
# Create new object from assembly
1274
my $dissolved_csp = $self->_new_dissolved_csp($mixed_csp, $seq_header);
1275
# Update current contig spectrum object with new one
1276
$self->add($dissolved_csp);
1281
=head2 _import_cross_csp
1283
Title : _import_cross_csp
1284
Usage : $csp->_import_cross_csp($mixed_csp);
1285
Function: Update a contig spectrum object by calculating the cross contig
1286
spectrum based on a mixed contig spectrum
1287
Returns : 1 for success, 0 for error
1288
Args : Bio::Assembly::Tools::ContigSpectrum
1292
sub _import_cross_csp {
1293
my ($self, $mixed_csp) = @_;
1295
if (not defined $mixed_csp) {
1296
$self->throw("Expecting a contig spectrum reference as argument");
1299
# Create new object from assembly
1300
my $cross_csp = $self->_new_cross_csp($mixed_csp);
1302
# Update current contig spectrum object with new one
1303
$self->add($cross_csp);
1309
=head2 _get_seq_stats
1311
Title : _get_seq_stats
1312
Usage : my $seqlength = $csp->_get_seq_stats($assemblyobj);
1313
Function: Get sequence statistics from an assembly:
1314
number of sequences, average sequence length
1315
Returns : number of sequences (integer)
1316
average sequence length (decimal)
1317
Args : assembly object reference
1318
hash reference with the IDs of the sequences to consider [optional]
1322
sub _get_seq_stats {
1323
my ($self, $assemblyobj, $seq_hash) = @_;
1326
$self->throw("Must provide a Bio::Assembly::Scaffold object")
1327
if (!defined $assemblyobj || !$assemblyobj->isa("Bio::Assembly::Scaffold"));
1328
$self->throw("Expecting a hash reference. Got [".ref($seq_hash)."]")
1329
if (defined $seq_hash && ! ref($seq_hash) eq 'HASH');
1331
my $avg_seq_len = 0;
1333
for my $contigobj ($assemblyobj->all_contigs) {
1334
for my $seqobj ($contigobj->each_seq) {
1335
my $seq_id = $seqobj->id;
1336
next if defined $seq_hash && !defined $$seq_hash{$seq_id};
1338
my $seq_string = $seqobj->seq;
1339
$seq_string =~ s/-//g;
1340
$avg_seq_len += length($seq_string);
1343
for my $singletobj ($assemblyobj->all_singlets) {
1344
my $seq_id = $singletobj->seqref->id;
1345
next if defined $seq_hash && !defined $$seq_hash{$seq_id};
1347
my $seq_string = $singletobj->seqref->seq;
1348
$seq_string =~ s/-//g;
1349
$avg_seq_len += length($seq_string);
1351
$avg_seq_len /= $nof_seq unless $nof_seq == 0;
1352
return $nof_seq, $avg_seq_len;
1356
=head2 _get_overlap_stats
1358
Title : _get_overlap_stats
1359
Usage : my ($minlength, $min_identity, $avglength, $avgidentity)
1360
= $csp->_get_overlap_stats($assemblyobj);
1361
Function: Get statistics about pairwise overlaps in contigs of an assembly
1362
Returns : number of overlaps
1363
minimum overlap length
1364
average overlap length
1365
minimum identity percent
1366
average identity percent
1367
Args : assembly object reference
1368
hash reference with the IDs of the sequences to consider [optional]
1372
sub _get_overlap_stats {
1373
my ($self, $assembly_obj, $seq_hash) = @_;
1376
$self->throw("Must provide a Bio::Assembly::Scaffold object")
1377
if (!defined $assembly_obj || !$assembly_obj->isa("Bio::Assembly::Scaffold"));
1378
$self->throw("Expecting a hash reference. Got [".ref($seq_hash)."]")
1379
if (defined $seq_hash && ! ref($seq_hash) eq 'HASH');
1381
my $matchdef = $self->{'_eff_asm_params'};
1382
my ($min_length, $avg_length, $min_identity, $avg_identity, $nof_overlaps)
1383
= (undef, 0, undef, 0, 0);
1385
# Look at all the contigs (and I really mean no singlets!)
1386
for my $contig_obj ($assembly_obj->all_contigs) {
1389
# Look at best overlap possible with previous sequences in contig
1390
my @all_seq_objs = $contig_obj->each_seq;
1391
# sequences should be ordered by starting position
1392
for (my $i = 0 ; $i < scalar(@all_seq_objs) ; $i++) {
1393
my $seq_obj = $all_seq_objs[$i];
1394
my $seq_id = $seq_obj->id;
1396
# skip this sequence if not in list of wanted sequences
1397
next if defined $seq_hash && !defined $$seq_hash{$seq_id};
1400
# skip the first sequence (no other sequence to compare against)
1401
next if $nof_seq <= 1;
1403
# what is the best previous sequence to align to?
1404
my $stats = Bio::Align::PairwiseStatistics->new;
1411
for (my $j = $i-1 ; $j >= 0 ; $j--) {
1412
my $tmp_target_obj = $all_seq_objs[$j];
1413
my $tmp_target_id = $tmp_target_obj->id;
1415
# skip this sequence if not in list of wanted sequences
1416
next if defined $seq_hash && !defined $$seq_hash{$tmp_target_id};
1418
# find overlap with that sequence
1419
my ($aln_obj, $tmp_length, $tmp_identity)
1420
= $self->_overlap_alignment($contig_obj, $seq_obj, $tmp_target_obj);
1421
next if ! defined $aln_obj; # there was no sequence overlap
1422
my $tmp_score = $stats->score_nuc($aln_obj);
1424
# update score and best sequence for overlap
1425
if (!defined $best_score || $best_score < $tmp_score) {
1426
$best_score = $tmp_score;
1427
$best_length = $tmp_length;
1428
$best_identity = $tmp_identity;
1429
$target_obj = $tmp_target_obj;
1430
$target_id = $tmp_target_id;
1434
# Update our overlap statistics
1435
if (defined $best_score) {
1436
$avg_length += $best_length;
1437
$avg_identity += $best_identity;
1438
$min_length = $best_length if ! defined $min_length ||
1439
$best_length < $min_length;
1440
$min_identity = $best_identity if ! defined $min_identity ||
1441
$best_identity < $min_identity;
1448
unless ($nof_overlaps == 0) {
1449
$avg_length /= $nof_overlaps;
1450
$avg_identity /= $nof_overlaps;
1453
return $nof_overlaps, $min_length, $avg_length, $min_identity, $avg_identity;
1457
=head2 _overlap_alignment
1459
Title : _overlap_alignment
1461
Function: Produce an alignment of the overlapping section of two sequences of
1462
a contig. Minimum overlap length and percentage identity can be
1463
specified. Return undef if the sequences do not overlap or do not
1464
meet the minimum overlap criteria.
1465
Return : Bio::SimpleAlign object reference
1466
alignment overlap length
1467
alignment overlap identity
1468
Args : Bio::Assembly::Contig object reference
1469
Bio::LocatableSeq contig sequence 1
1470
Bio::LocatableSeq contig sequence 2
1471
minium overlap length [optional]
1472
minimum overlap percentage identity [optional]
1476
sub _overlap_alignment {
1477
my ($self, $contig, $qseq, $tseq, $min_overlap, $min_identity) = @_;
1478
# get query sequence position
1479
my $qpos = $contig->get_seq_coord($qseq);
1480
my $qstart = $qpos->start;
1481
my $qend = $qpos->end;
1482
# get target sequence position
1483
my $tpos = $contig->get_seq_coord($tseq);
1484
my $tstart = $tpos->start;
1485
my $tend = $tpos->end;
1486
# check that there is an overlap
1487
return if $qstart > $tend || $qend < $tstart;
1488
# get overlap boundaries and check overlap length
1490
$left = $tstart if $qstart < $tstart;
1492
$right = $tend if $qend > $tend;
1493
my $overlap = $right - $left + 1;
1494
return if defined $min_overlap && $overlap < $min_overlap;
1495
# slice query and target sequence to overlap boundaries
1496
my $qleft = $contig->change_coord('gapped consensus', "aligned ".$qseq->id,
1498
my $qright = $qleft + $overlap - 1;
1499
my $qstring = $qseq->seq;
1500
$qstring = substr($qstring, $qleft - 1, $overlap);
1501
my $tleft = $contig->change_coord('gapped consensus', "aligned ".$tseq->id,
1503
my $tright = $tleft + $overlap - 1;
1504
my $tstring = $tseq->seq;
1505
$tstring = substr($tstring, $tleft - 1, $overlap);
1506
# remove gaps present in both sequences at the same position
1507
for (my $pos = 0 ; $pos < $overlap ; $pos++) {
1508
my $qnt = substr($qstring, $pos, 1);
1509
my $tnt = substr($tstring, $pos, 1);
1510
if ($qnt eq '-' && $tnt eq '-') {
1511
substr($qstring, $pos, 1, '');
1512
substr($tstring, $pos, 1, '');
1517
return if defined $min_overlap && $overlap < $min_overlap;
1518
# make an aligned object
1519
my $aln = Bio::SimpleAlign->new;
1520
my $qalseq = Bio::LocatableSeq->new(
1526
$aln->add_seq($qalseq);
1527
my $talseq = Bio::LocatableSeq->new(
1533
$aln->add_seq($talseq);
1534
# check overlap percentage identity
1535
my $identity = $aln->overall_percentage_identity;
1536
return if defined $min_identity && $identity < $min_identity;
1537
# all checks passed, return alignment
1538
return $aln, $overlap, $identity;