1
# module Bio::PopGen::HtSNP.pm
2
# cared by Pedro M. Gomez-Fabre <pgf18872-at-gsk-dot-com>
8
Bio::PopGen::HtSNP.pm- Select htSNP from a haplotype set
12
use Bio::PopGen::HtSNP;
14
my $obj = Bio::PopGen::HtSNP->new($hap,$snp,$pop);
18
Select the minimal set of SNP that contains the full information about
19
the haplotype without redundancies.
21
Take as input the followin values:
25
=item - the haplotype block (array of array).
27
=item - the snp id (array).
29
=item - family information and frequency (array of array).
33
The final haplotype is generated in a numerical format and the SNP's
34
sets can be retrieve from the module.
39
- If you force to include a family with indetermination, the SNP's
40
with indetermination will be removed from the analysis, so consider
41
before to place your data set what do you really want to do.
43
- If two families have the same information (identical haplotype), one
44
of them will be removed and the removed files will be stored classify
47
- Only are accepted for calculation A, C, G, T and - (as deletion) and
48
their combinations. Any other value as n or ? will be considered as
49
degenerations due to lack of information.
53
On a haplotype set is expected that some of the SNP and their
54
variations contribute in the same way to the haplotype. Eliminating
55
redundancies will produce a minimal set of SNP's that can be used as
56
input for a taging selection process. On the process SNP's with the
57
same variation are clustered on the same group.
59
The idea is that because the tagging haplotype process is
60
exponential. All redundant information we could eliminate on the
61
tagging process will help to find a quick result.
65
my $obj = Bio::PopGen::HtSNP->new
66
(-haplotype_block => \@haplotype_patterns,
67
-snp_ids => \@snp_ids,
68
-pattern_freq => \@pattern_name_and_freq);
70
where $hap, $snp and $pop are in the format:
76
]; # haplotype patterns' id
78
my $snp = [qw/s1 s2 s3 s4/]; # snps' Id's
84
]; # haplotype_pattern_id Frequency
88
See Below for more detailed summaries.
93
=head2 How the process is working with one example
95
Let's begin with one general example of the code.
104
The first thing to to is to B<split the haplotype> into characters.
111
Now we have to B<convert> the haplotype to B<Upercase>. This
112
will produce the same SNP if we have input a or A.
119
The program admit as values any combination of ACTG and - (deletions).
120
The haplotype is B<converted to number>, considering the first variation
121
as zero and the alternate value as 1 (see expanded description below).
128
Once we have the haplotype converted to numbers we have to generate the
129
snp type information for the haplotype.
132
B<SNP code = SUM ( value * multiplicity ^ position );>
135
SUM is the sum of the values for the SNP
136
value is the SNP number code (0 [generally for the mayor allele],
137
1 [for the minor allele].
138
position is the position on the block.
140
For this example the code is:
146
------------------------------------------------------------------
147
14 10 12 4 2 14 14 14 14
149
14 = 0*2^0 + 1*2^1 + 1*2^2 + 1*2^3
150
12 = 0*2^0 + 1*2^1 + 0*2^2 + 1*2^3
153
Once we have the families classify. We will B<take> just the SNP's B<not
158
This information will be B<passed to the tag module> is you want to tag
161
Whatever it happens to one SNPs of a class will happen to a SNP of
162
the same class. Therefore you don't need to scan redundancies
164
=head2 Working with fuzzy data.
166
This module is designed to work with fuzzy data. As the source of the
167
haplotype is diverse. The program assume that some haplotypes can be
168
generated using different values. If there is any indetermination (? or n)
169
or any other degenerated value or invalid. The program will take away
170
This SNP and will leave that for a further analysis.
172
On a complex situation:
182
On this haplotype everything is happening. We have a multialelic variance.
183
We have indeterminations. We have deletions and we have even one SNP
184
which is not a real SNP.
186
The buiding process will be the same on this situation.
188
Convert the haplotype to uppercase.
198
All columns that present indeterminations will be removed from the analysis
201
hapotype after remove columns:
211
All changes made on the haplotype matrix, will be also made on the SNP list.
213
snp_id_1 snp_id_2 snp_id_4 snp_id_6 snp_id_8 snp_id_9
215
now the SNP that is not one SNP will be removed from the analysis.
216
SNP with Id snp_id_4 (the one with all T's).
219
because of the removing. Some of the families will become the same and will
220
be clustered. A posteriori analysis will diference these families.
221
but because of the indetermination can not be distinguish.
231
The result of the mergering will go like:
238
Once again the changes made on the families and we merge the frequency (I<to be
241
Before to convert the haplotype into numbers we consider how many variations
242
we have on the set. On this case the variations are 3.
244
The control code will use on this situation base three as mutiplicity
250
-----------------------------------
253
And the minimal set for this combination is
260
B<NOTE:> this second example is a remote example an on normal conditions. This
261
conditions makes no sense, but as the haplotypes, can come from many sources
262
we have to be ready for all kind of combinations.
269
User feedback is an integral part of the evolution of this and other
270
Bioperl modules. Send your comments and suggestions preferably to
271
the Bioperl mailing list. Your participation is much appreciated.
273
bioperl-l@bioperl.org - General discussion
274
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
276
=head2 Reporting Bugs
278
Report bugs to the Bioperl bug tracking system to help us keep track
279
of the bugs and their resolution. Bug reports can be submitted via
282
http://bugzilla.open-bio.org/
284
=head1 AUTHOR - Pedro M. Gomez-Fabre
286
Email pgf18872-at-gsk-dot-com
291
The rest of the documentation details each of the object methods.
292
Internal methods are usually preceded with a _
296
# Let the code begin...
298
package Bio::PopGen::HtSNP;
300
use Storable qw(dclone);
306
use base qw(Bio::Root::Root);
310
Bio::PopGen::HtSNP->new(-haplotype_block -ids -pattern_freq)
317
Function: constructor of the class.
318
Usage : $obj-> Bio::PopGen::HtSNP->new(-haplotype_block
322
Args : input haplotype (array of array)
324
pop_freq (array of array)
330
my($class, @args) = @_;
332
my $self = $class->SUPER::new(@args);
333
my ($haplotype_block,
335
$pattern_freq ) = $self->_rearrange([qw(HAPLOTYPE_BLOCK
337
PATTERN_FREQ)],@args);
339
if ($haplotype_block){
340
$self->haplotype_block($haplotype_block);
343
$self->throw("Haplotype block has not been defined.
347
$self->snp_ids($snp_ids);
350
$self->throw("Array with ids has not been defined.
354
$self->pattern_freq($pattern_freq);
357
$self->throw("Array with pattern id and frequency has not been defined.
361
# if the input values are not well formed complained and exit.
369
=head2 haplotype_block
371
Title : haplotype_block
372
Usage : my $haplotype_block = $HtSNP->haplotype_block();
373
Function: Get the haplotype block for a haplotype tagging selection
374
Returns : reference of array
375
Args : reference of array with haplotype pattern
382
return $self->{'_haplotype_block'} = shift if @_;
383
return $self->{'_haplotype_block'};
389
Usage : my $snp_ids = $HtSNP->$snp_ids();
390
Function: Get the ids for a haplotype tagging selection
391
Returns : reference of array
392
Args : reference of array with SNP ids
399
return $self->{'_snp_ids'} = shift if @_;
400
return $self->{'_snp_ids'};
407
Usage : my $pattern_freq = $HtSNP->pattern_freq();
408
Function: Get the pattern id and frequency for a haplotype
410
Returns : reference of array
411
Args : reference of array with SNP ids
417
return $self->{'_pattern_freq'} = shift if @_;
418
return $self->{'_pattern_freq'};
424
Usage : _check_input($self)
425
Function: check for errors on the input
432
#------------------------
434
#------------------------
438
_haplotype_length_error($self);
439
_population_error($self);
443
=head2 _haplotype_length_error
445
Title : _haplotype_length_error
446
Usage : _haplotype_length_error($self)
447
Function: check if the haplotype length is the same that the one on the
448
SNP id list. If not break and exit
456
#------------------------
457
sub _haplotype_length_error{
458
#------------------------
462
my $input_block = $self->haplotype_block();
463
my $snp_ids = $self->snp_ids();
466
#############################
468
#############################
469
my $different_haplotype_length = 0;
471
##############################
472
# get parameters used to find
474
##############################
476
my $snp_number = scalar @$snp_ids;
477
my $number_of_families = scalar @$input_block;
478
my $h = 0; # haplotype position
481
############################
484
# if the length differs from the number of ids
485
############################
487
for ($h=0; $h<$#$input_block+1 ; $h++){
488
if (length $input_block->[$h] != $snp_number){
489
$different_haplotype_length = 1;
494
# haploytypes does not have the same length
495
if ($different_haplotype_length){
496
$self->throw("The number of snp ids is $snp_number and ".
497
"the length of the family (". ($h+1) .") [".
498
$input_block->[$h]."] is ".
499
length $input_block->[$h], "\n");
503
=head2 _population_error
506
Title : _population_error
507
Usage : _population_error($self)
508
Function: use input_block and pop_freq test if the number of elements
509
match. If doesn't break and quit.
517
#------------------------
518
sub _population_error{
519
#------------------------
523
my $input_block = $self->haplotype_block();
524
my $pop_freq = $self->pattern_freq();
526
#############################
528
#############################
529
my $pop_freq_elements_error = 0; # matrix bad formed
531
##############################
532
# get parameters used to find
534
##############################
535
my $number_of_families = scalar @$input_block;
537
my $pf = 0; # number of elements on population frequency
538
my $frequency = 0; # population frequency
541
# check if the pop_freq array is well formed and if the number
542
# of elements fit with the number of families
544
#############################
545
# check population frequency
547
# - population frequency matrix need to be well formed
548
# - get the frequency
549
# - calculate number of families on pop_freq
550
#############################
552
for ($pf=0; $pf<$#$pop_freq+1; $pf++){
553
$frequency += $pop_freq->[$pf]->[1];
555
if ( scalar @{$pop_freq->[$pf]} !=2){
556
$p_f_length = scalar @{$pop_freq->[$pf]};
557
$pop_freq_elements_error = 1;
562
###########################
564
###########################
567
# The frequency shouldn't be greater than 1
569
$self->warn("The frequency for this set is $frequency (greater than 1)\n");
572
# the haplotype matix is not well formed
573
if ($pop_freq_elements_error){
574
$self->throw("the frequency matrix is not well formed\n".
575
"\nThe number of elements for pattern ".($pf+1)." is ".
577
"It should be 2 for pattern \"@{$pop_freq->[$pf]}\"\n".
578
"\nFormat should be:\n".
579
"haplotype_id\t frequency\n"
583
# the size does not fit on pop_freq array
584
# with the one in haplotype (input_block)
585
if ($pf != $number_of_families) {
586
$self->throw("The number of patterns on frequency array ($pf)\n".
587
"does not fit with the number of haplotype patterns on \n".
588
"haplotype array ($number_of_families)\n");
596
Usage : _do_it($self)
597
Function: Process the input generating the results.
604
#------------------------
606
#------------------------
610
# first we are goinf to define here all variables we are going to use
611
$self -> {'w_hap'} = [];
612
$self -> {'w_pop_freq'} = dclone ( $self ->pattern_freq() );
613
$self -> {'deg_pattern'} = {};
614
$self -> {'snp_type'} = {}; # type of snp on the set. see below
615
$self -> {'alleles_number'} = 0; # number of variations (biallelic,...)
616
$self -> {'snp_type_code'} = [];
617
$self -> {'ht_type'} = []; # store the snp type used on the htSet
618
$self -> {'split_hap'} = [];
619
$self -> {'snp_and_code'} = [];
622
# we classify the SNP under snp_type
623
$self->{snp_type}->{useful_snp} = dclone ( $self ->snp_ids() );
624
$self->{snp_type}->{deg_snp} = []; # deg snp
625
$self->{snp_type}->{silent_snp} = []; # not a real snp
627
# split the haplotype
628
_split_haplo ($self);
630
# first we convert to upper case the haplotype
631
# to make A the same as a for comparison
632
_to_upper_case( $self -> {w_hap} );
634
#######################################################
635
# check if any SNP has indetermination. If any SNP has
636
# indetermination this value will be removed.
637
#######################################################
638
_remove_deg ( $self );
640
#######################################################
641
# depending of the families you use some SNPs can be
642
# silent. This silent SNP's are not used on the
643
# creation of tags and has to be skipped from the
645
#######################################################
646
_rem_silent_snp ( $self );
648
#######################################################
649
# for the remaining SNP's we have to check if two
650
# families have the same value. If this is true, the families
651
# will produce the same result and therefore we will not find
652
# any pattern. So, the redundant families need to be take
653
# away from the analysis. But also considered for a further
656
# When we talk about a normal haplotype blocks this situation
657
# makes no sense but if we remove one of the snp because the
658
# degeneration two families can became the same.
659
# these families may be analised on a second round
660
#######################################################
662
_find_deg_pattern ( $self );
664
#################################################################
665
# if the pattern list length is different to the lenght of the w_hap
666
# we can tell that tow columns have been considered as the same one
667
# and therefore we have to start to remove the values.
668
# remove all columns with degeneration
670
# For this calculation we don't use the pattern frequency.
671
# All patterns are the same, This selection makes
672
# sense when you have different frequency.
674
# Note: on this version we don't classify the haplotype by frequency
675
# but if you need to do it. This is the place to do it!!!!
677
# In reality you don't need to sort the values because you will remove
678
# the values according to their values.
680
# But as comes from a hash, the order could be different and as a
681
# consequence the code generate on every run of the same set could
682
# differ. That is not important. In fact, does not matter but could
684
#################################################################
686
my @tmp =sort { $a <=> $b}
687
keys %{$self -> {deg_pattern}}; # just count the families
689
# if the size of the list is different to the size of the degenerated
690
# family. There is degeneration. And the redundancies will be
692
if($#tmp != $#{$self -> { w_hap } } ){
693
_keep_these_patterns($self->{w_hap}, \@tmp);
694
_keep_these_patterns($self->{w_pop_freq}, \@tmp);
697
#################################################################
698
# the steps made before about removing snp and cluster families
699
# are just needed pre-process the haplotype before.
701
# Now is when the fun starts.
704
# once we have the this minimal matrix, we have to calculate the
705
# max multipliticy for the values. The max number of alleles found
706
# on the set. A normal haplotype is biallelic but we can not
707
# reject multiple variations.
708
##################################################################
710
_alleles_number ( $self );
712
##################################################################
713
# Now we have to convert the haplotype into number
720
# one haplotype like this transformed into number produce this result
727
##################################################################
729
_convert_to_numbers( $self );
731
###################################################################
732
# The next step is to calculate the type of the SNP.
733
# This process is made based on the position of the SNP, the value
734
# and its multiplicity.
735
###################################################################
737
_snp_type_code( $self );
739
###################################################################
740
# now we have all information we need to calculate the haplotype
742
###################################################################
746
###################################################################
749
# all SNP have a code. but if the SNP is not used this code must
750
# be zero in case of silent SNP. This looks not to informative
751
# because all the information is already there. But this method
752
# compile the full set.
753
###################################################################
755
_snp_and_code_summary( $self );
761
Usage : $obj->input_block()
762
Function: returns input block
763
Returns : reference to array of array
769
#------------------------
771
#------------------------
774
return $self -> {input_block};
780
Usage : $obj->hap_length()
781
Function: get numbers of SNP on the haplotype
788
#------------------------
790
#------------------------
793
return scalar @{$self -> {'_snp_ids'}};
800
Usage : $obj->pop_freq()
801
Function: returns population frequency
802
Returns : reference to array
808
#------------------------
810
#------------------------
813
return $self -> {pop_freq}
821
Usage : $obj->deg_snp()
822
Function: returns snp_removes due to indetermination on their values
823
Returns : reference to array
829
#------------------------
831
#------------------------
833
return $self -> {snp_type} ->{deg_snp};
841
Usage : $obj->snp_type()
842
Function: returns hash with SNP type
843
Returns : reference to hash
849
#------------------------
851
#------------------------
853
return $self -> {snp_type};
861
Usage : $obj->silent_snp()
862
Function: some SNP's are silent (not contibuting to the haplotype)
863
and are not considering for this analysis
864
Returns : reference to a array
870
#------------------------
872
#------------------------
874
return $self -> {snp_type} ->{silent_snp};
882
Usage : $obj->useful_snp()
883
Function: returns list of SNP's that are can be used as htSNP. Some
884
of them can produce the same information. But this is
886
Returns : reference to a array
892
#------------------------
894
#------------------------
896
return $self -> {snp_type} ->{useful_snp};
904
Usage : $obj->ht_type()
905
Function: every useful SNP has a numeric code dependending of its
906
value and position. For a better description see
907
description of the module.
908
Returns : reference to a array
914
#------------------------
916
#------------------------
918
return $self -> {ht_type};
924
Usage : $obj->ht_set()
925
Function: returns the minimal haplotype in numerical format. This
926
haplotype contains the maximal information about the
927
haplotype variations but with no redundancies. It's the
928
minimal set that describes the haplotype.
929
Returns : reference to an array of arrays
935
#------------------------
937
#------------------------
939
return $self -> {w_hap};
945
Title : snp_type_code
946
Usage : $obj->snp_type_code()
947
Function: returns the numeric code of the SNPs that need to be
948
tagged that correspond to the SNP's considered in ht_set.
949
Returns : reference to an array
955
#------------------------
957
#------------------------
959
return $self -> {snp_type_code};
966
Usage : $obj->snp_and_code()
967
Function: Returns the full list of SNP's and the code associate to
968
them. If the SNP belongs to the group useful_snp it keep
969
this code. If the SNP is silent the code is 0. And if the
970
SNP is degenerated the code is -1.
971
Returns : reference to an array of array
977
#------------------------
979
#------------------------
981
return $self -> {'snp_and_code'};
988
Usage : $obj->deg_pattern()
989
Function: Returns the a list with the degenerated haplotype.
990
Sometimes due to degeneration some haplotypes looks
991
the same and if we don't remove them it won't find
993
Returns : reference to a hash of array
999
#------------------------
1001
#------------------------
1004
return $self -> {'deg_pattern'};
1011
Usage : $obj->split_hap()
1012
Function: simple representation of the haplotype base by base
1013
Same information that input haplotype but base based.
1014
Returns : reference to an array of array
1020
#------------------------
1022
#------------------------
1024
return $self -> {'split_hap'};
1029
Title : _split_haplo
1030
Usage : _split_haplo($self)
1031
Function: Take a haplotype and split it into bases
1038
#------------------------
1040
#------------------------
1043
my $in = $self ->{'_haplotype_block'};
1044
my $out = $self ->{'w_hap'};
1046
# split every haplotype and store the result into $out
1048
push @$out, [split (//,$_)];
1051
$self -> {'split_hap'} = dclone ($out);
1054
# internal method to convert the haplotype to uppercase
1057
=head2 _to_upper_case
1060
Title : _to_upper_case
1061
Usage : _to_upper_case()
1062
Function: make SNP or in-dels Upper case
1069
#------------------------
1070
sub _to_upper_case {
1071
#------------------------
1074
foreach my $aref (@$arr){
1075
foreach my $value (@{@$aref} ){
1086
Usage : _remove_deg()
1087
Function: when have a indetermination or strange value this SNP
1089
Returns : haplotype family set and degeneration list
1090
Args : ref to an AoA and a ref to an array
1095
#------------------------
1097
#------------------------
1100
my $hap = $self->{w_hap};
1101
my $snp = $self->{snp_type}->{useful_snp};
1102
my $deg_snp = $self->{snp_type}->{deg_snp};
1104
my $rem = []; # take the position of the array to be removed
1106
# first we work on the columns we have void values
1107
$rem = _find_indet($hap,$rem); # find degenerated columns
1111
# remove column on haplotype
1112
_remove_col($hap,$rem); # remove list
1114
# now remove the values from SNP id
1115
_remove_snp_id($snp,$deg_snp,$rem); # remove list
1120
=head2 _rem_silent_snp
1123
Title : _rem_silent_snp
1124
Usage : _rem_silent_snp()
1125
Function: there is the remote possibilty that one SNP won't be a
1126
real SNP on this situation we have to remove this SNP,
1127
otherwise the program won't find any tag
1129
Args : ref to an AoA and a ref to an array
1134
#------------------------
1135
sub _rem_silent_snp {
1136
#------------------------
1139
my $hap = $self->{w_hap};
1140
my $snp = $self->{snp_type}->{useful_snp};
1141
my $silent_snp = $self->{snp_type}->{silent_snp};
1143
my $rem = []; # store the positions to be removed
1145
#find columns with no variation on the SNP, Real snp?
1146
$rem = _find_silent_snps($hap);
1150
# remove column on haplotype
1151
_remove_col($hap,$rem);
1153
# remove the values from SNP id
1154
_remove_snp_id($snp,$silent_snp,$rem);
1159
=head2 _find_silent_snps
1162
Title : _find_silent_snps
1164
Function: list of snps that are not SNPs. All values for that
1165
SNPs on the set is the same one. Look stupid but can
1166
happend and if this happend you will not find any tag
1173
#------------------------
1174
sub _find_silent_snps{
1175
#------------------------
1178
my $list =[]; # no snp list;
1180
# determine the number of snp by the length of the first row.
1181
# we assume that the matrix is squared.
1182
my $colsn= @{$arr->[0]};
1184
for (my $i=0;$i<$colsn;$i++){
1185
my $different =0; # check degeneration
1187
for my $r (1..$#$arr){
1188
if($arr->[0][$i] ne $arr->[$r][$i]){
1208
Function: find column (SNP) with invalid or degenerated values
1209
and store this values into the second parameter suplied.
1211
Args : ref to AoA and ref to an array
1216
#------------------------
1218
#------------------------
1219
my ($arr, $list)=@_;
1221
foreach my $i(0..$#$arr){
1222
foreach my $j(0..$#{$arr->[$i]}){
1223
unless ($arr->[$i][$j] =~ /[ACTG-]/){
1228
my $found =0; # check if already exist the value
1229
foreach my $k(0..$#$list){
1230
$found =1 if ($list->[$k] eq $j);
1241
@$list = sort { $a <=> $b} @$list;
1250
Function: remove columns contained on the second array from
1253
Args : array of array reference and array reference
1258
#------------------------
1260
#------------------------
1263
foreach my $col (reverse @$rem){
1264
splice @$_, $col, 1 for @$arr;
1269
=head2 _remove_snp_id
1271
Title : _remove_snp_id
1273
Function: remove columns contained on the second array from
1276
Args : array of array reference and array reference
1281
#------------------------
1283
#------------------------
1284
my ($arr,$removed,$rem_list)=@_;
1286
push @$removed, splice @$arr, $_, 1 foreach reverse @$rem_list;
1290
=head2 _find_deg_pattern
1292
Title : _find_deg_pattern
1294
Function: create a list with the degenerated patterns
1301
#------------------------
1302
sub _find_deg_pattern{
1303
#------------------------
1306
my $arr = $self ->{w_hap}; # the working haplotype
1307
my $list = $self ->{'deg_pattern'}; # degenerated patterns
1309
# we have to check all elements
1310
foreach my $i(0..$#$arr){
1311
# is the element has not been used create a key
1312
unless ( _is_on_hash ($list,\$i) ) {
1316
foreach my $j($i+1..$#$arr){
1317
my $comp = compare_arrays($arr->[$i],$arr->[$j]);
1320
# as we have no elements we push this into the list
1321
# check for the first element
1322
my $key = _key_for_value($list,\$i);
1324
push (@{$list->{$key}},$j);
1333
#------------------------
1335
#------------------------
1336
my($hash,$value)=@_;
1338
foreach my $key (keys %$hash){
1339
if( _is_there(\@{$hash->{$key}},$value)){
1345
#------------------------
1347
#------------------------
1348
my($hash,$value)=@_;
1350
foreach my $key (keys %$hash){
1351
if( _is_there(\@{$hash->{$key}},$value)){
1357
#------------------------
1359
#------------------------
1363
foreach my $el (@$arr){
1364
if ($el eq $$value){
1371
=head2 _keep_these_patterns
1374
Title : _keep_these_patterns
1376
Function: this is a basic approach, take a LoL and a list,
1377
keep just the columns included on the list
1379
Args : an AoA and an array
1384
#------------------------
1385
sub _keep_these_patterns{
1386
#------------------------
1389
# by now we just take one of the repetitions but you can weight
1390
# the values by frequency
1394
foreach my $k (@$list){
1395
push @outValues, $arr->[$k];
1398
#make arr to hold the new values
1399
@$arr= @{dclone(\@outValues)};
1404
=head2 compare_arrays
1407
Title : compare_arrays
1409
Function: take two arrays and compare their values
1410
Returns : 1 if the two values are the same
1411
0 if the values are different
1412
Args : an AoA and an array
1417
#------------------------
1418
sub compare_arrays {
1419
#------------------------
1420
my ($first, $second) = @_;
1421
return 0 unless @$first == @$second;
1422
for (my $i = 0; $i < @$first; $i++) {
1423
return 0 if $first->[$i] ne $second->[$i];
1429
=head2 _convert_to_numbers
1432
Title : _convert_to_numbers
1433
Usage : _convert_to_numbers()
1434
Function: tranform the haplotype into numbers. before to do that
1435
we have to consider the variation on the set.
1437
Args : ref to an AoA and a ref to an array
1442
#------------------------
1443
sub _convert_to_numbers{
1444
#------------------------
1447
my $hap_ref = $self->{w_hap};
1448
my $mm = $self->{alleles_number};
1450
# the first element is considered as zero. The first modification
1451
# is consider as one and so on.
1453
my $length = @{ @$hap_ref[0]}; #length of the haplotype
1455
for (my $c = 0; $c<$length;$c++){
1459
for my $r (0..$#$hap_ref){
1461
push @al,$hap_ref->[$r][$c]
1462
unless _is_there(\@al,\$hap_ref->[$r][$c]);
1464
$hap_ref->[$r][$c] = get_position(\@al,\$hap_ref->[$r][$c]);
1470
=head2 _snp_type_code
1473
Title : _snp_type_code
1476
we have to create the snp type code for each version.
1477
The way the snp type is created is the following:
1479
we take the number value for every SNP and do the
1480
following calculation
1482
let be a SNP set as follow:
1489
on this case the situation is:
1491
sum (value * multiplicity ^ position) for each SNP
1493
0 * 3 ^ 0 + 1 * 3 ^ 1 + 1 * 3 ^ 2 = 12
1494
0 * 3 ^ 0 + 1 * 3 ^ 1 + 2 * 3 ^ 2 = 21
1501
#------------------------
1503
#------------------------
1506
my $hap = $self->{w_hap};
1507
my $arr = $self->{snp_type_code};
1508
my $al = $self->{alleles_number};
1510
my $length = @{ $hap->[0]}; #length of the haplotype
1512
for (my $c=0; $c<$length; $c++){
1513
for my $r (0..$#$hap){
1514
$arr->[$c] += $hap->[$r][$c] * $al ** $r;
1519
#################################################
1520
# return the position of an element in one array
1521
# The element is always present on the array
1522
#################################################
1524
#------------------------
1526
#------------------------
1528
my($array, $value)=@_;
1530
for my $i(0..$#$array) {
1531
if ($array->[$i] eq $$value){
1539
=head2 _alleles_number
1542
Title : _alleles_number
1544
Function: calculate the max number of alleles for a haplotype and
1545
if the number. For each SNP the number is stored and the
1546
max number of alleles for a SNP on the set is returned
1547
Returns : max number of alleles (a scalar storing a number)
1553
#------------------------
1554
sub _alleles_number{
1555
#------------------------
1559
my $hap_ref = $self ->{w_hap}; # working haplotype
1561
my $length = @{ @$hap_ref[0]}; # length of the haplotype
1563
for (my $c = 0; $c<$length;$c++){
1567
for my $r (0..$#$hap_ref){
1568
$alleles{ $hap_ref->[$r][$c] } =1; # new key for every new snp
1571
# if the number of alleles for this column is
1572
# greater than before set $m value as allele number
1573
if ($self->{alleles_number} < keys %alleles) {
1574
$self->{alleles_number} = keys %alleles;
1585
Function: calculate the minimal set that contains all information of the
1588
Args : ref to an AoA and a ref to an array
1593
#------------------------
1595
#------------------------
1598
my $hap = $self->{'w_hap'};
1599
my $type = $self->{'snp_type_code'};
1600
my $set = $self->{'ht_type'};
1601
my $out = []; # store the minimal set
1603
my $nc=0; # new column for the output values
1605
# pass for every value of the snp_type_code
1606
for my $c (0..$#$type){
1610
# every new value (not present) is pushed into set
1611
if ( ! _is_there( $set,\$type->[$c] ) ){
1612
push @$set, $type->[$c];
1616
for my $r(0..$#$hap){
1617
#save value of the snp for every SNP
1618
$out->[$r][$nc]= $hap->[$r][$c];
1622
if ($exist){ $nc++ };
1625
@$hap = @{dclone $out};
1628
=head2 _snp_and_code_summary
1630
Title : _snp_and_code_summary
1631
Usage : _snp_and_code_summary()
1632
Function: compile on a list all SNP and the code for each. This
1633
information can be also obtained combining snp_type and
1634
snp_type_code but on these results the information about
1635
the rest of SNP's are not compiled as table.
1637
0 will be silent SNPs
1638
-1 are degenerated SNPs
1639
and the rest of positive values are the code for useful SNP
1642
Args : ref to an AoA and a ref to an array
1647
#------------------------
1648
sub _snp_and_code_summary{
1649
#------------------------
1652
my $snp_type_code = $self->{'snp_type_code'};
1653
my $useful_snp = $self->{'snp_type'}->{'useful_snp'};
1654
my $silent_snp = $self->{'snp_type'}->{'silent_snp'};
1655
my $deg_snp = $self->{'snp_type'}->{'deg_snp'};
1656
my $snp_ids = $self->snp_ids();
1657
my $snp_and_code = $self->{'snp_and_code'};
1659
# walk all SNP's and generate code for each
1661
# do a practical thing. Consider all snp silent
1662
foreach my $i (0..$#$snp_ids){
1664
# assign zero to silent
1668
foreach my $j (0..$#$useful_snp){
1669
if ($snp_ids->[$i] eq $useful_snp->[$j]){
1670
$value = $snp_type_code->[$j];
1675
# assign -1 to degenerated
1676
foreach my $j (0..$#$deg_snp){
1677
if ($snp_ids->[$i] eq $deg_snp->[$j]){
1683
push @$snp_and_code, [$snp_ids->[$i], $value];