38
39
5. Trie [Qiime team, unpublished], which collapsing identical sequences and sequences which are subsequences of other sequences.
40
6. uclust (Robert Edgar, unpublished, 2009), creates \"seeds\" of sequences which generate clusters based on percent identity.
42
7. usearch (Robert Edgar, unpublished, 2011), creates \"seeds\" of sequences which generate clusters based on percent identity, filters low abundance clusters, performs de novo and reference based chimera detection.
41
6. uclust (Edgar, RC 2010), creates \"seeds\" of sequences which generate clusters based on percent identity.
43
7. uclust_ref (Edgar, RC 2010), as uclust, but takes a reference database to use as seeds. New clusters can be toggled on or off.
45
8. usearch (Edgar, RC 2010, version v5.2.236), creates \"seeds\" of sequences which generate clusters based on percent identity, filters low abundance clusters, performs de novo and reference based chimera detection.
47
9. usearch_ref (Edgar, RC 2010, version v5.2.236), as usearch, but takes a reference database to use as seeds. New clusters can be toggled on or off.
49
Quality filtering pipeline with usearch 5.X is described as usearch_qf "usearch quality filter", described here: http://qiime.org/tutorials/usearch_quality_filter.html
51
8. usearch61 (Edgar, RC 2010, version v6.1.544), creates \"seeds\" of sequences which generate clusters based on percent identity.
53
9. usearch61_ref (Edgar, RC 2010, version v6.1.544), as usearch61, but takes a reference database to use as seeds. New clusters can be toggled on or off.
55
Chimera checking with usearch 6.X is implemented in identify_chimeric_seqs.py. Chimera checking should be done first with usearch 6.X, and the filtered resulting fasta file can then be clustered.
44
57
The primary inputs for pick_otus.py are:
52
65
pick_otus.py takes a standard fasta file as input.
55
69
script_info['script_usage'] = []
57
script_info['script_usage'].append(("""Example (uclust method, default):""","""Using the seqs.fna file generated from split_libraries.py and outputting the results to the directory \"picked_otus/\", while using default parameters (0.97 sequence similarity, no reverse strand matching):""","""pick_otus.py -i seqs.fna -o picked_otus/"""))
60
script_info['script_usage'].append(("""""","""To change the percent identity to a lower value, such as 90%, and also enable reverse strand matching, the script would be the following:""","""pick_otus.py -i seqs.fna -o picked_otus/ -s 0.90 -z"""))
62
script_info['script_usage'].append(("""Uclust Reference-based OTU picking example""","""uclust_ref can be passed via -m to pick OTUs against a reference set where sequences within the similarity threshold to a reference sequence will cluster to an OTU defined by that reference sequence, and sequences outside of the similarity threshold to a reference sequence will form new clusters. OTU identifiers will be set to reference sequence identifiers when sequences cluster to reference sequences, and 'qiime_otu_<integer>' for new OTUs. Creation of new clusters can be suppressed by passing -C, in which case sequences outside of the similarity threshold to any reference sequence will be listed as failures in the log file, and not included in any OTU.""","""pick_otus.py -i seqs.fna -r core_set_unaligned.fasta_11_8_07 -m uclust_ref"""))
64
script_info['script_usage'].append(("""Example (cdhit method):""","""Using the seqs.fna file generated from split_libraries.py and outputting the results to the directory \"picked_otus/\", while using default parameters (0.97 sequence similarity, no prefix filtering):""","""pick_otus.py -i seqs.fna -m cdhit -o picked_otus/"""))
66
script_info['script_usage'].append(("""""","""Currently the cd-hit OTU picker allows for users to perform a pre-filtering step, so that highly similar sequences are clustered prior to OTU picking. This works by collapsing sequences which begin with an identical n-base prefix, where n is specified by the -n parameter. A commonly used value here is 100 (e.g., -n 100). So, if using this filter with -n 100, all sequences which are identical in their first 100 bases will be clustered together, and only one representative sequence from each cluster will be passed to cd-hit. This is used to greatly increase the run-time of cd-hit-based OTU picking when working with very large sequence collections, as shown by the following command:""","""pick_otus.py -i seqs.fna -m cdhit -o picked_otus/ -n 100"""))
68
script_info['script_usage'].append(("""""","""Alternatively, if the user would like to collapse identical sequences, or those which are subsequences of other sequences prior to OTU picking, they can use the trie prefiltering (\"-t\") option as shown by the following command:""","""pick_otus.py -i seqs.fna -m cdhit -o picked_otus/ -t"""))
70
script_info['script_usage'].append(("""""","""Note: It is highly recommended to use one of the prefiltering methods when analyzing large dataset (>100,000 seqs) to reduce run-time.""",""""""))
71
script_info['script_usage'].append(("""Example (uclust method, default):""","""Using the seqs.fna file generated from split_libraries.py and outputting the results to the directory \"picked_otus_default/\", while using default parameters (0.97 sequence similarity, no reverse strand matching):""","""%prog -i seqs.fna -o picked_otus_default"""))
73
script_info['script_usage'].append(("""""","""To change the percent identity to a lower value, such as 90%, and also enable reverse strand matching, the command would be the following:""","""%prog -i seqs.fna -o picked_otus_90_percent_rev/ -s 0.90 -z"""))
75
script_info['script_usage'].append(("""Uclust Reference-based OTU picking example""","""uclust_ref can be passed via -m to pick OTUs against a reference set where sequences within the similarity threshold to a reference sequence will cluster to an OTU defined by that reference sequence, and sequences outside of the similarity threshold to a reference sequence will form new clusters. OTU identifiers will be set to reference sequence identifiers when sequences cluster to reference sequences, and 'qiime_otu_<integer>' for new OTUs. Creation of new clusters can be suppressed by passing -C, in which case sequences outside of the similarity threshold to any reference sequence will be listed as failures in the log file, and not included in any OTU.""","""%prog -i seqs.fna -r refseqs.fasta -m uclust_ref --uclust_otu_id_prefix qiime_otu_"""))
77
script_info['script_usage'].append(("""Example (cdhit method):""","""Using the seqs.fna file generated from split_libraries.py and outputting the results to the directory \"cdhit_picked_otus/\", while using default parameters (0.97 sequence similarity, no prefix filtering):""","""%prog -i seqs.fna -m cdhit -o cdhit_picked_otus/"""))
79
script_info['script_usage'].append(("""""","""Currently the cd-hit OTU picker allows for users to perform a pre-filtering step, so that highly similar sequences are clustered prior to OTU picking. This works by collapsing sequences which begin with an identical n-base prefix, where n is specified by the -n parameter. A commonly used value here is 100 (e.g., -n 100). So, if using this filter with -n 100, all sequences which are identical in their first 100 bases will be clustered together, and only one representative sequence from each cluster will be passed to cd-hit. This is used to greatly decrease the run-time of cd-hit-based OTU picking when working with very large sequence collections, as shown by the following command:""","""%prog -i seqs.fna -m cdhit -o cdhit_picked_otus_filter/ -n 100"""))
81
script_info['script_usage'].append(("""""","""Alternatively, if the user would like to collapse identical sequences, or those which are subsequences of other sequences prior to OTU picking, they can use the trie prefiltering (\"-t\") option as shown by the following command.
83
Note: It is highly recommended to use one of the prefiltering methods when analyzing large datasets (>100,000 seqs) to reduce run-time.""","""%prog -i seqs.fna -m cdhit -o cdhit_picked_otus_trie_prefilter/ -t"""))
73
85
script_info['script_usage'].append(("""BLAST OTU-Picking Example:""","""OTUs can be picked against a reference database using the BLAST OTU picker. This is useful, for example, when different regions of the SSU RNA have sequenced and a sequence similarity based approach like cd-hit therefore wouldn't work. When using the BLAST OTU picking method, the user must supply either a reference set of sequences or a reference database to compare against. The OTU identifiers resulting from this step will be the sequence identifiers in the reference database. This allows for use of a pre-existing tree in downstream analyses, which again is useful in cases where different regions of the 16s gene have been sequenced.
75
The following command can be used to blast against a reference sequence set, using the default E-value and sequence similarity (0.97) parameters:""","""pick_otus.py -i seqs.fna -o picked_otus/ -m blast -r ref_seq_set.fna"""))
77
script_info['script_usage'].append(("""""","""If you already have a pre-built BLAST database, you can pass the database prefix as shown by the following command:""","""pick_otus.py -i seqs.fna -o picked_otus/ -m blast -b ref_database"""))
79
script_info['script_usage'].append(("""""","""If the user would like to change the sequence similarity (\"-s\") and/or the E-value (\"-e\") for the blast method, they can use the following command:""","""pick_otus.py -i seqs.fna -o picked_otus/ -m blast -s 0.90 -e 1e-30"""))
81
script_info['script_usage'].append(("""Prefix-suffix OTU Picking Example:""","""OTUs can be picked by collapsing sequences which being and/or end with identical bases (i.e., identical prefixes or suffixes). This OTU picker is currently likely to be of limited use on its own, but will be very useful in collapsing very similar sequences in a chained OTU picking strategy that is currently in development. For example, user will be able to pick OTUs with this method, followed by representative set picking, and then re-pick OTUs on their representative set. This will allow for highly similar sequences to be collapsed, followed by running a slower OTU picker. This ability to chain OTU pickers is not yet supported in QIIME. The following command illustrates how to pick OTUs by collapsing sequences which are identical in their first 50 and last 25 bases:""","""pick_otus.py -i seqs.fna -o picked_otus/ -m prefix_suffix -p 50 -u 25"""))
87
The following command can be used to blast against a reference sequence set, using the default E-value and sequence similarity (0.97) parameters:""","""%prog -i seqs.fna -o blast_picked_otus/ -m blast -r refseqs.fasta"""))
89
script_info['script_usage'].append(("""""","""If you already have a pre-built BLAST database, you can pass the database prefix as shown by the following command:""","""%prog -i seqs.fna -o blast_picked_otus_prebuilt_db/ -m blast -b refseqs.fasta"""))
91
script_info['script_usage'].append(("""""","""If the user would like to change the sequence similarity (\"-s\") and/or the E-value (\"-e\") for the blast method, they can use the following command:""","""%prog -i seqs.fna -o blast_picked_otus_90_percent/ -m blast -r refseqs.fasta -s 0.90 -e 1e-30"""))
93
script_info['script_usage'].append(("""Prefix-suffix OTU Picking Example:""","""OTUs can be picked by collapsing sequences which begin and/or end with identical bases (i.e., identical prefixes or suffixes). This OTU picker is currently likely to be of limited use on its own, but will be very useful in collapsing very similar sequences in a chained OTU picking strategy that is currently in development. For example, the user will be able to pick OTUs with this method, followed by representative set picking, and then re-pick OTUs on their representative set. This will allow for highly similar sequences to be collapsed, followed by running a slower OTU picker. This ability to chain OTU pickers is not yet supported in QIIME. The following command illustrates how to pick OTUs by collapsing sequences which are identical in their first 50 and last 25 bases:""","""%prog -i seqs.fna -o prefix_suffix_picked_otus/ -m prefix_suffix -p 50 -u 25"""))
83
95
script_info['script_usage'].append(("""Mothur OTU Picking Example:""","""The Mothur program (http://www.mothur.org/) provides three clustering algorithms for OTU formation: furthest-neighbor (complete linkage), average-neighbor (group average), and nearest-neighbor (single linkage). Details on the algorithms may be found on the Mothur website and publications (Schloss et al., 2009). However, the running times of Mothur's clustering algorithms scale with the number of sequences squared, so the program may not be feasible for large data sets.
85
The following command may be used to create OTU's based on a furthest-neighbor algorithm (the default setting):""","""pick_otus.py -i seqs.fna -o picked_otus/ -m mothur"""))
87
script_info['script_usage'].append(("""""","""If you prefer to use a nearest-neighbor algorithm instead, you may specify this with the '-c' flag:""","""pick_otus.py -i seqs.fna -o picked_otus/ -m mothur -c nearest"""))
89
script_info['script_usage'].append(("""""","""The sequence similarity parameter may also be specified. For example, the following command may be used to create OTU's at the level of 95% similarity:""","""pick_otus.py -i seqs.fna -o picked_otus/ -m mothur -s 0.90"""))
91
script_info['script_usage'].append(("""Usearch_qf ('usearch quality filter')""","""Usearch (http://www.drive5.com/usearch/) provides clustering, chimera checking, and quality filtering.""",""""""))
93
script_info['script_usage'].append(("""Standard usearch (usearch_qf) example:""","""""","""pick_otus.py -i seqs.fna -m usearch --word_length 64 --db_filepath reference_sequence_filepath -o usearch_qf_results/"""))
95
script_info['script_usage'].append(("""Usearch (usearch_qf) example where reference-based chimera detection is disabled, and minimum cluster size filter is reduced from default (4) to 2:""","""""","""pick_otus.py -i seqs.fna -m usearch --word_length 64 --reference_chimera_detection --minsize 2 -o usearch_qf_results/"""))
97
The following command may be used to create OTUs based on a furthest-neighbor algorithm (the default setting) using aligned sequences as input:""","""%prog -i seqs.aligned.fna -o mothur_picked_otus/ -m mothur"""))
99
script_info['script_usage'].append(("""""","""If you prefer to use a nearest-neighbor algorithm instead, you may specify this with the '-c' flag:""","""%prog -i seqs.aligned.fna -o mothur_picked_otus_nn/ -m mothur -c nearest"""))
101
script_info['script_usage'].append(("""""","""The sequence similarity parameter may also be specified. For example, the following command may be used to create OTUs at the level of 90% similarity:""","""%prog -i seqs.aligned.fna -o mothur_picked_otus_90_percent/ -m mothur -s 0.90"""))
103
script_info['script_usage'].append(("""usearch ""","""Usearch (http://www.drive5.com/usearch/) provides clustering, chimera checking, and quality filtering. The following command specifies a minimum cluster size of 2 to be used during cluster size filtering:""","""%prog -i seqs.fna -m usearch --word_length 64 --db_filepath refseqs.fasta -o usearch_qf_results/ --minsize 2"""))
105
script_info['script_usage'].append(("""usearch example where reference-based chimera detection is disabled, and minimum cluster size filter is reduced from default (4) to 2:""","""""","""%prog -i seqs.fna -m usearch --word_length 64 --suppress_reference_chimera_detection --minsize 2 -o usearch_qf_results_no_ref_chim_detection/"""))
97
107
script_info['output_description'] = """The output consists of two files (i.e. seqs_otus.txt and seqs_otus.log). The .txt file is composed of tab-delimited lines, where the first field on each line corresponds to an (arbitrary) cluster identifier, and the remaining fields correspond to sequence identifiers assigned to that cluster. Sequence identifiers correspond to those provided in the input FASTA file. Usearch (i.e. usearch quality filter) can additionally have log files for each intermediate call to usearch.
110
120
The resulting .log file contains a list of parameters passed to the pick_otus.py script along with the output location of the resulting .txt file."""
112
122
script_info['required_options'] = [
113
make_option('-i', '--input_seqs_filepath',
123
make_option('-i', '--input_seqs_filepath',type='existing_filepath',
114
124
help='Path to input sequences file'),
117
127
script_info['optional_options'] = [
118
128
make_option('-m', '--otu_picking_method', type='choice',
119
129
choices=otu_picking_method_choices, default = "uclust",
120
help=('Method for picking OTUs. Valid choices are: ' +\
121
', '.join(otu_picking_method_choices) +\
122
'. The mothur method requires an input file ' +\
130
help=('Method for picking OTUs. Valid choices are: ' +
131
', '.join(otu_picking_method_choices) +
132
'. The mothur method requires an input file '
123
133
'of aligned sequences. usearch will enable the usearch quality '
124
134
'filtering pipeline. [default: %default]')),
126
136
make_option('-c', '--clustering_algorithm', type='choice',
127
137
choices=MothurOtuPicker.ClusteringAlgorithms, default='furthest',
128
help=('Clustering algorithm for mothur otu picking method. Valid ' +\
130
', '.join(MothurOtuPicker.ClusteringAlgorithms) +\
138
help=('Clustering algorithm for mothur otu picking method. Valid '
140
', '.join(MothurOtuPicker.ClusteringAlgorithms) +
131
141
'. [default: %default]')),
133
make_option('-M', '--max_cdhit_memory', type=int, default=400,
143
make_option('-M', '--max_cdhit_memory', type='int', default=400,
134
144
help=('Maximum available memory to cd-hit-est (via the program\'s -M '
135
145
'option) for cdhit OTU picking method (units of Mbyte) '
136
146
'[default: %default]')),
138
make_option('-o', '--output_dir',\
148
make_option('-o', '--output_dir',type='new_dirpath',
139
149
help=('Path to store result file '
140
150
'[default: ./<OTU_METHOD>_picked_otus/]')),
142
make_option('-r', '--refseqs_fp',
152
make_option('-r', '--refseqs_fp',type='existing_filepath',
143
153
help=('Path to reference sequences to search against when using -m '
144
'blast, -m uclust_ref, or -m usearch_ref [default: %default]')),
154
'blast, -m uclust_ref, -m usearch_ref, or -m '
155
'usearch61_ref [default: %default]')),
146
make_option('-b', '--blast_db',
157
make_option('-b', '--blast_db',type='blast_db',
147
158
help=('Pre-existing database to blast against when using -m blast '
148
159
'[default: %default]')),
150
161
make_option('--min_aligned_percent',
151
help=('Minimum percent of query sequence that can be aligned to consider a hit '
152
' (BLAST OTU picker only) [default: %default]'),default=0.50,type='float'),
162
help=('Minimum percent of query sequence that can be aligned to '
163
'consider a hit (BLAST OTU picker only) [default: %default]'),
164
default=0.50,type='float'),
154
166
make_option('-s', '--similarity', type='float', default=0.97,
155
help=('Sequence similarity threshold (for cdhit, uclust, uclust_ref, or'
156
' usearch) [default: %default]')),
167
help=('Sequence similarity threshold (for blast, cdhit, uclust, '
168
'uclust_ref, usearch, usearch_ref, usearch61, or usearch61_ref'
169
') [default: %default]')),
158
171
make_option('-e', '--max_e_value', type='float', default=1e-10,
159
172
help=('Max E-value when clustering with BLAST [default: %default]')),
177
190
'large sequence collections where OTU picking doesn\'t scale '
178
191
'well [default: %default]')),
180
make_option('-p', '--prefix_length', type=int, default=50,
193
make_option('-p', '--prefix_length', type='int', default=50,
181
194
help=('Prefix length when using the prefix_suffix otu picker; '
182
'WARNING: CURRENTLY DIFFERENT FROM prefix_prefilter_length (-n)! '
183
'[default: %default]')),
195
'WARNING: CURRENTLY DIFFERENT FROM prefix_prefilter_length '
196
'(-n)! [default: %default]')),
185
make_option('-u', '--suffix_length', type=int, default=50,
198
make_option('-u', '--suffix_length', type='int', default=50,
186
199
help=('Suffix length when using the prefix_suffix otu picker '
187
200
'[default: %default]')),
189
202
make_option('-z', '--enable_rev_strand_match', action='store_true',
191
help=('Enable reverse strand matching for uclust otu picking, '
204
help=('Enable reverse strand matching for uclust, uclust_ref, '
205
'usearch, usearch_ref, usearch61, or usearch61_ref otu picking, '
192
206
'will double the amount of memory used. [default: %default]')),
194
make_option('-D','--suppress_presort_by_abundance_uclust', action='store_true',
196
help=('Suppress presorting of sequences by abundance when picking'
208
make_option('-D','--suppress_presort_by_abundance_uclust',
211
help=('Suppress presorting of sequences by abundance when picking'
197
212
' OTUs with uclust or uclust_ref [default: %default]')),
199
214
make_option('-A','--optimal_uclust', action='store_true',
214
229
make_option('-C','--suppress_new_clusters',action='store_true',
216
help="Suppress creation of new clusters using seqs that don't" +
217
" match reference when using -m uclust_ref or "+
231
help="Suppress creation of new clusters using seqs that don't"
232
" match reference when using -m uclust_ref, -m usearch61_ref, or "
218
233
"-m usearch_ref [default: %default]"),
220
make_option('--max_accepts',type='int',default=20,
221
help="max_accepts value to uclust and "
222
"uclust_ref [default: %default]"),
235
make_option('--max_accepts', default='default',
236
help="max_accepts value to uclust, uclust_ref, usearch61, and "
237
"usearch61_ref. By default, will use value suggested by "
238
"method (uclust: 20, usearch61: 1) [default: %default]"),
224
make_option('--max_rejects',type='int',default=500,
225
help="max_rejects value to uclust and "
226
"uclust_ref [default: %default]"),
240
make_option('--max_rejects', default='default',
241
help="max_rejects value for uclust, uclust_ref, usearch61, and "
242
"usearch61_ref. With default settings, will use value "
243
"recommended by clustering method used "
244
"(uclust: 500, usearch61: 8 for usearch_fast_cluster option,"
245
" 32 for reference and smallmem options) "
246
"[default: %default]"),
228
248
make_option('--stepwords',type='int',default=20,
229
249
help="stepwords value to uclust and "
230
250
"uclust_ref [default: %default]"),
232
make_option('--word_length',type='int',default=12,
233
help="w value to usearch, uclust, and "
234
"uclust_ref. Set to 64 for usearch. [default: %default]"),
252
make_option('--word_length',default='default',
253
help="word length value for uclust, uclust_ref, and "
254
"usearch, usearch_ref, usearch61, and usearch61_ref. "
255
"With default setting, will use the setting recommended by "
256
"the method (uclust: 12, usearch: 64, usearch61: 8). int "
257
"value can be supplied to override this setting. "
258
"[default: %default]"),
236
make_option('--uclust_otu_id_prefix',default=None,
260
make_option('--uclust_otu_id_prefix',default="denovo",type='string',
237
261
help=("OTU identifier prefix (string) for the de novo uclust"
238
" OTU picker [default: %default, OTU ids are ascending"
262
" OTU picker and for new clusters when uclust_ref is used "
263
"without -C [default: %default, OTU ids are ascending"
241
make_option('--uclust_stable_sort',default=True,action='store_true',
242
help=("Deprecated: stable sort enabled by default, pass "
243
"--uclust_suppress_stable_sort to disable [default: %default]")),
245
266
make_option('--suppress_uclust_stable_sort',default=False,
246
267
action='store_true', help=("Don't pass --stable-sort to "
247
268
"uclust [default: %default]")),
254
275
help=("Enable preservation of intermediate uclust (.uc) files "
255
276
"that are used to generate clusters via uclust. Also enables "
256
277
"preservation of all intermediate files created by usearch "
257
"(usearch_qf). [default: %default]")),
278
" and usearch61. [default: %default]")),
259
make_option('-j', '--percent_id_err', default=0.97, help=("Percent identity"
260
" threshold for cluster error detection with usearch_qf. "
261
"[default: %default]"), type='float'),
280
make_option('-j', '--percent_id_err', default=0.97,
281
help=("Percent identity threshold for cluster error detection "
282
"with usearch. [default: %default]"), type='float'),
263
284
make_option('-g', '--minsize', default=4, help=("Minimum cluster size "
264
"for size filtering with usearch_qf. [default: %default]"),
285
"for size filtering with usearch. [default: %default]"),
267
288
make_option('-a','--abundance_skew', default=2.0, help=("Abundance skew "
268
"setting for de novo chimera detection with usearch_qf. "
289
"setting for de novo chimera detection with usearch. "
269
290
"[default: %default]"), type='float'),
271
make_option('-f', '--db_filepath', default=None, help=("Reference database "
272
"of fasta sequences for reference based chimera detection with "
273
"usearch_qf. [default: %default]")),
292
make_option('-f', '--db_filepath',type='existing_filepath', default=None,
293
help=("Reference database of fasta sequences for reference "
294
"based chimera detection with usearch. [default: %default]")),
275
296
make_option('--perc_id_blast', default=0.97, help=("Percent ID for "
276
"mapping OTUs created by usearch_qf back to original sequence"
297
"mapping OTUs created by usearch back to original sequence"
277
298
" IDs [default: %default]"), type='float'),
279
make_option('-k', '--de_novo_chimera_detection', default=True, help=(
280
"Perform de novo chimera detection in usearch_qf. "
281
"[default: %default]"), action='store_false'),
283
make_option('-x', '--reference_chimera_detection', default=True,
284
help=("Perform reference based chimera detection in usearch_qf. "
285
"[default: %default]"), action='store_false'),
287
make_option('-l', '--cluster_size_filtering', default=True, help=("Perform "
288
"cluster size filtering in usearch_qf. [default: %default]"),
289
action='store_false'),
300
make_option('--de_novo_chimera_detection', help=(
301
"Deprecated: de novo chimera detection performed by default, "
302
"pass --suppress_de_novo_chimera_detection to disable."
303
" [default: %default]")),
305
make_option('-k', '--suppress_de_novo_chimera_detection', default=False,
306
help=("Suppress de novo chimera detection in usearch. "
307
"[default: %default]"), action='store_true'),
309
make_option('--reference_chimera_detection',
310
help=("Deprecated: Reference based chimera detection performed "
311
"by default, pass --supress_reference_chimera_detection to "
312
"disable [default: %default]")),
314
make_option('-x', '--suppress_reference_chimera_detection', default=False,
315
help=("Suppress reference based chimera detection in usearch. "
316
"[default: %default]"), action='store_true'),
318
make_option('--cluster_size_filtering', help=("Deprecated, "
319
"cluster size filtering enabled by default, pass "
320
"--suppress_cluster_size_filtering to disable."
321
" [default: %default]")),
323
make_option('-l', '--suppress_cluster_size_filtering', default=False,
324
help=("Suppress cluster size filtering in usearch. "
325
"[default: %default]"), action='store_true'),
291
327
make_option('--remove_usearch_logs', default=False, help=("Disable "
292
328
"creation of logs when usearch is called. Up to nine logs are "
301
337
make_option('-F', '--non_chimeras_retention', default='union',
303
339
"subsets of sequences detected as non-chimeras to retain after "
304
"de novo and refernece based chimera detection. Options are "
340
"de novo and reference based chimera detection. Options are "
305
341
"intersection or union. union will retain sequences that are "
306
342
"flagged as non-chimeric from either filter, while intersection "
307
343
"will retain only those sequences that are flagged as non-"
308
344
"chimeras from both detection methods. [default: %default]"),
347
make_option('--minlen', default=64, help=("Minimum length of sequence "
348
"allowed for usearch, usearch_ref, usearch61, and "
349
"usearch61_ref. [default: %default]"), type='int'),
351
make_option('--usearch_fast_cluster', default=False, help=("Use fast "
352
"clustering option for usearch or usearch61_ref with new "
353
"clusters. --enable_rev_strand_match can not be enabled "
354
"with this option, and the only valid option for "
355
"usearch61_sort_method is 'length'. This option uses more "
356
"memory than the default option for de novo clustering."
357
" [default: %default]"), action='store_true'),
359
make_option('--usearch61_sort_method', default='abundance', help=(
360
"Sorting method for usearch61 and usearch61_ref. Valid "
361
"options are abundance, length, or None. If the "
362
"--usearch_fast_cluster option is enabled, the only sorting "
363
"method allowed in length. [default: %default]"), type='str'),
365
make_option('--sizeorder', default=False, help=(
366
"Enable size based preference in clustering with usearch61. "
367
"Requires that --usearch61_sort_method be abundance. "
368
"[default: %default]"), action='store_true')
312
372
script_info['version'] = __version__
315
375
# Parse the command line parameters
316
option_parser, opts, args =\
317
parse_command_line_parameters(**script_info)
376
option_parser, opts, args = parse_command_line_parameters(**script_info)
319
378
# Create local copies of the options to avoid repetitive lookups
320
379
prefix_prefilter_length = opts.prefix_prefilter_length
344
403
chimeras_retention = opts.non_chimeras_retention
345
404
verbose = opts.verbose
347
# usearch_qf specific parameters
407
# usearch specific parameters
348
408
percent_id_err = opts.percent_id_err
349
409
minsize = opts.minsize
350
410
abundance_skew = opts.abundance_skew
351
411
db_filepath = opts.db_filepath
352
412
perc_id_blast = opts.perc_id_blast
353
de_novo_chimera_detection = opts.de_novo_chimera_detection
354
reference_chimera_detection = opts.reference_chimera_detection
355
cluster_size_filtering = opts.cluster_size_filtering
413
de_novo_chimera_detection = not opts.suppress_de_novo_chimera_detection
414
reference_chimera_detection = not opts.suppress_reference_chimera_detection
415
cluster_size_filtering = not opts.suppress_cluster_size_filtering
356
416
remove_usearch_logs = opts.remove_usearch_logs
419
# usearch61 specific parameters
420
# also uses: enable_rev_strand_match, refseqs_fp, suppress_new_clusters,
421
# save_uc_files, remove_usearch_logs, minlen, maxaccepts, maxrejects,
423
usearch_fast_cluster = opts.usearch_fast_cluster
424
usearch61_sort_method = opts.usearch61_sort_method
425
sizeorder = opts.sizeorder
427
# Set default values according to clustering method
428
if word_length != "default":
430
word_length = int(word_length)
432
raise ValueError,("--word_length must either be 'default' "
434
if word_length == "default":
435
if otu_picking_method in ["uclust", "uclust_ref"]:
437
elif otu_picking_method in ["usearch", "usearch_ref"]:
440
# default setting for usearch61
443
if max_accepts != "default":
445
max_accepts = int(max_accepts)
447
option_parser.error("--max_accepts must either be 'default' "
449
if max_accepts == "default":
450
if otu_picking_method in ["uclust", "uclust_ref"]:
453
# default setting for usearch61
456
if max_rejects != "default":
458
max_rejects = int(max_rejects)
460
option_parser.error("--max_rejects must be either 'default' "
462
if max_rejects == "default":
463
if otu_picking_method in ["uclust", "uclust_ref"]:
465
# usearch61 settings, depends upon fast clustering option
467
if usearch_fast_cluster:
472
# Check for logical/compatible inputs
358
473
if user_sort and not suppress_presort_by_abundance_uclust:
359
474
option_parser.error("Cannot pass -B/--user_sort without -D/--suppress_presort_by_abundance_uclust, as your input would be resorted by abundance. To presort your own sequences before passing to uclust, pass -DB.")
361
476
if abundance_skew <= 1:
362
raise ValueError,('abundance skew must be > 1')
477
option_parser.error('abundance skew must be > 1')
364
479
# Check for logical inputs
365
if otu_picking_method == 'usearch' and \
480
if otu_picking_method in ['usearch', 'usearch_ref'] and \
366
481
reference_chimera_detection and not db_filepath:
367
raise ValueError,('No reference filepath specified with '+\
368
'--db_filepath option. Disable reference based chimera detection '+\
369
'with --reference_chimera_detection or specify a reference fasta '+\
370
'file with --db_filepath.')
482
option_parser.error('No reference filepath specified with '
483
'--db_filepath option. Disable reference based chimera detection '
484
'with --suppress_reference_chimera_detection or specify a reference '
485
'fasta file with --db_filepath.')
372
487
if chimeras_retention not in ['intersection', 'union']:
373
raise ValueError,('--chimeras_retention must be either union or '+\
488
option_parser.error('--chimeras_retention must be either union or '
491
if usearch61_sort_method not in ['length', 'abundance', 'None']:
492
option_parser.error("--usearch61_sort_method must be one of the "
493
"following: length, abundance, None")
495
if otu_picking_method in ['usearch61', 'usearch61_ref']:
496
if usearch_fast_cluster:
497
if enable_rev_strand_match:
498
option_parser.error("--enable_rev_strand_match can not be "
499
"enabled when using --usearch_fast_cluster.")
500
if usearch61_sort_method != "length":
501
option_parser.error("--usearch61_sort_method must be 'length' "
502
"when --usearch_fast_cluster used.")
504
if otu_picking_method in ['usearch61']:
505
if opts.suppress_new_clusters:
506
option_parser.error("--suppress_new_clusters cannot be enabled when "
507
"using usearch61 as the OTU picking method as this is strictly "
508
"de novo. Use --otu_picking_method usearch61_ref and a reference "
509
"database for closed reference OTU picking.")
512
if usearch61_sort_method != 'abundance':
513
option_parser.error("To use --sizeorder, usearch61_sort_method must "
376
517
# Test that db_filepath can be opened to avoid wasted time
379
520
tmp_db_filepath = open(db_filepath, "U")
380
521
tmp_db_filepath.close()
522
db_filepath = abspath(db_filepath)
382
raise IOError,('Unable to open %s, please check path/permissions' %\
524
raise IOError,('Unable to open %s, please check path/permissions' %
388
527
# Input validation to throw a useful error message on common mistakes
389
528
if (otu_picking_method == 'cdhit' and
390
529
similarity < 0.80):
502
645
'suppress_new_clusters':opts.suppress_new_clusters,
503
646
'derep_fullseq':derep_fullseq,
504
647
'chimeras_retention':chimeras_retention,
508
otu_picker = otu_picker_constructor(params)
509
otu_picker(input_seqs_filepath, result_path=result_path,
510
refseqs_fp=refseqs_fp, log_path=log_path, HALT_EXEC=False)
650
'rev':enable_rev_strand_match}
653
otu_picker = otu_picker_constructor(params)
654
otu_picker(input_seqs_filepath, result_path=result_path,
655
refseqs_fp=refseqs_fp, failure_path=failure_path,
656
log_path=log_path, HALT_EXEC=False)
658
# usearch 6.1 (de novo OTU picking only)
659
elif otu_picking_method == 'usearch61':
660
otu_prefix = opts.uclust_otu_id_prefix or 'denovo'
662
'percent_id':opts.similarity,
663
'wordlength':word_length,
664
'save_intermediate_files':save_uc_files,
665
'output_dir':output_dir,
666
'remove_usearch_logs':remove_usearch_logs,
669
'rev':enable_rev_strand_match,
670
'usearch_fast_cluster':usearch_fast_cluster,
671
'usearch61_sort_method':usearch61_sort_method,
672
'usearch61_maxrejects':max_rejects,
673
'usearch61_maxaccepts':max_accepts,
674
'sizeorder':sizeorder
677
otu_picker = otu_picker_constructor(params)
678
otu_picker(input_seqs_filepath, result_path=result_path,
679
log_path=log_path,otu_prefix=otu_prefix,
682
# usearch 6.1 reference OTU picking
683
elif otu_picking_method == 'usearch61_ref':
684
otu_prefix = opts.uclust_otu_id_prefix or 'denovo'
686
'percent_id':opts.similarity,
687
'wordlength':word_length,
688
'save_intermediate_files':save_uc_files,
689
'output_dir':output_dir,
690
'remove_usearch_logs':remove_usearch_logs,
693
'rev':enable_rev_strand_match,
694
'usearch_fast_cluster':usearch_fast_cluster,
695
'usearch61_sort_method':usearch61_sort_method,
696
'usearch61_maxrejects':max_rejects,
697
'usearch61_maxaccepts':max_accepts,
698
'sizeorder':sizeorder,
699
'suppress_new_clusters':opts.suppress_new_clusters
702
otu_picker = otu_picker_constructor(params)
703
otu_picker(input_seqs_filepath, refseqs_fp, result_path=result_path,
704
log_path=log_path, failure_path=failure_path,
705
otu_prefix=otu_prefix, HALT_EXEC=False)
512
707
## uclust (reference-based)
513
708
elif otu_picking_method == 'uclust_ref':