22
22
5. Trie [Qiime team, unpublished], which collapsing identical sequences and sequences which are subsequences of other sequences.
24
6. uclust (Robert Edgar, unpublished, 2009), creates "seeds" of sequences which generate clusters based on percent identity.
26
7. usearch (Robert Edgar, unpublished, 2011), creates "seeds" of sequences which generate clusters based on percent identity, filters low abundance clusters, performs de novo and reference based chimera detection.
24
6. uclust (Edgar, RC 2010), creates "seeds" of sequences which generate clusters based on percent identity.
26
7. uclust_ref (Edgar, RC 2010), as uclust, but takes a reference database to use as seeds. New clusters can be toggled on or off.
28
8. usearch (Edgar, RC 2010, version v5.2.236), creates "seeds" of sequences which generate clusters based on percent identity, filters low abundance clusters, performs de novo and reference based chimera detection.
30
9. usearch_ref (Edgar, RC 2010, version v5.2.236), as usearch, but takes a reference database to use as seeds. New clusters can be toggled on or off.
32
Quality filtering pipeline with usearch 5.X is described as usearch_qf "usearch quality filter", described here: http://qiime.org/tutorials/usearch_quality_filter.html
34
8. usearch61 (Edgar, RC 2010, version v6.1.544), creates "seeds" of sequences which generate clusters based on percent identity.
36
9. usearch61_ref (Edgar, RC 2010, version v6.1.544), as usearch61, but takes a reference database to use as seeds. New clusters can be toggled on or off.
38
Chimera checking with usearch 6.X is implemented in `identify_chimeric_seqs.py <./identify_chimeric_seqs.html>`_. Chimera checking should be done first with usearch 6.X, and the filtered resulting fasta file can then be clustered.
28
40
The primary inputs for `pick_otus.py <./pick_otus.html>`_ are:
55
67
-m, `-`-otu_picking_method
56
Method for picking OTUs. Valid choices are: usearch, usearch_ref, prefix_suffix, mothur, trie, blast, uclust_ref, cdhit, uclust. The mothur method requires an input file of aligned sequences. usearch will enable the usearch quality filtering pipeline. [default: uclust]
68
Method for picking OTUs. Valid choices are: mothur, trie, uclust_ref, usearch, usearch_ref, blast, usearch61, usearch61_ref, prefix_suffix, cdhit, uclust. The mothur method requires an input file of aligned sequences. usearch will enable the usearch quality filtering pipeline. [default: uclust]
57
69
-c, `-`-clustering_algorithm
58
70
Clustering algorithm for mothur otu picking method. Valid choices are: furthest, nearest, average. [default: furthest]
59
71
-M, `-`-max_cdhit_memory
62
74
Path to store result file [default: ./<OTU_METHOD>_picked_otus/]
64
Path to reference sequences to search against when using -m blast, -m uclust_ref, or -m usearch_ref [default: None]
76
Path to reference sequences to search against when using -m blast, -m uclust_ref, -m usearch_ref, or -m usearch61_ref [default: None]
66
78
Pre-existing database to blast against when using -m blast [default: None]
67
79
`-`-min_aligned_percent
68
Minimum percent of query sequence that can be aligned to consider a hit (BLAST OTU picker only) [default: 0.5]
80
Minimum percent of query sequence that can be aligned to consider a hit (BLAST OTU picker only) [default: 0.5]
70
Sequence similarity threshold (for cdhit, uclust, uclust_ref, or usearch) [default: 0.97]
82
Sequence similarity threshold (for blast, cdhit, uclust, uclust_ref, usearch, usearch_ref, usearch61, or usearch61_ref) [default: 0.97]
71
83
-e, `-`-max_e_value
72
84
Max E-value when clustering with BLAST [default: 1e-10]
73
85
-q, `-`-trie_reverse_seqs
92
104
Pass the --user_sort flag to uclust for uclust otu picking. [default: False]
93
105
-C, `-`-suppress_new_clusters
94
Suppress creation of new clusters using seqs that don't match reference when using -m uclust_ref or -m usearch_ref [default: False]
106
Suppress creation of new clusters using seqs that don't match reference when using -m uclust_ref, -m usearch61_ref, or -m usearch_ref [default: False]
96
Max_accepts value to uclust and uclust_ref [default: 20]
108
Max_accepts value to uclust, uclust_ref, usearch61, and usearch61_ref. By default, will use value suggested by method (uclust: 20, usearch61: 1) [default: default]
98
Max_rejects value to uclust and uclust_ref [default: 500]
110
Max_rejects value for uclust, uclust_ref, usearch61, and usearch61_ref. With default settings, will use value recommended by clustering method used (uclust: 500, usearch61: 8 for usearch_fast_cluster option, 32 for reference and smallmem options) [default: default]
100
112
Stepwords value to uclust and uclust_ref [default: 20]
102
W value to usearch, uclust, and uclust_ref. Set to 64 for usearch. [default: 12]
114
Word length value for uclust, uclust_ref, and usearch, usearch_ref, usearch61, and usearch61_ref. With default setting, will use the setting recommended by the method (uclust: 12, usearch: 64, usearch61: 8). int value can be supplied to override this setting. [default: default]
103
115
`-`-uclust_otu_id_prefix
104
OTU identifier prefix (string) for the de novo uclust OTU picker [default: None, OTU ids are ascending integers]
105
`-`-uclust_stable_sort
106
Deprecated: stable sort enabled by default, pass --uclust_suppress_stable_sort to disable [default: True]
116
OTU identifier prefix (string) for the de novo uclust OTU picker and for new clusters when uclust_ref is used without -C [default: denovo, OTU ids are ascending integers]
107
117
`-`-suppress_uclust_stable_sort
108
118
Don't pass --stable-sort to uclust [default: False]
109
119
`-`-suppress_uclust_prefilter_exact_match
110
120
Don't collapse exact matches before calling uclust [default: False]
111
121
-d, `-`-save_uc_files
112
Enable preservation of intermediate uclust (.uc) files that are used to generate clusters via uclust. Also enables preservation of all intermediate files created by usearch (usearch_qf). [default: True]
122
Enable preservation of intermediate uclust (.uc) files that are used to generate clusters via uclust. Also enables preservation of all intermediate files created by usearch and usearch61. [default: True]
113
123
-j, `-`-percent_id_err
114
Percent identity threshold for cluster error detection with usearch_qf. [default: 0.97]
124
Percent identity threshold for cluster error detection with usearch. [default: 0.97]
116
Minimum cluster size for size filtering with usearch_qf. [default: 4]
126
Minimum cluster size for size filtering with usearch. [default: 4]
117
127
-a, `-`-abundance_skew
118
Abundance skew setting for de novo chimera detection with usearch_qf. [default: 2.0]
128
Abundance skew setting for de novo chimera detection with usearch. [default: 2.0]
119
129
-f, `-`-db_filepath
120
Reference database of fasta sequences for reference based chimera detection with usearch_qf. [default: None]
130
Reference database of fasta sequences for reference based chimera detection with usearch. [default: None]
121
131
`-`-perc_id_blast
122
Percent ID for mapping OTUs created by usearch_qf back to original sequence IDs [default: 0.97]
123
-k, `-`-de_novo_chimera_detection
124
Perform de novo chimera detection in usearch_qf. [default: True]
125
-x, `-`-reference_chimera_detection
126
Perform reference based chimera detection in usearch_qf. [default: True]
127
-l, `-`-cluster_size_filtering
128
Perform cluster size filtering in usearch_qf. [default: True]
132
Percent ID for mapping OTUs created by usearch back to original sequence IDs [default: 0.97]
133
`-`-de_novo_chimera_detection
134
Deprecated: de novo chimera detection performed by default, pass --suppress_de_novo_chimera_detection to disable. [default: None]
135
-k, `-`-suppress_de_novo_chimera_detection
136
Suppress de novo chimera detection in usearch. [default: False]
137
`-`-reference_chimera_detection
138
Deprecated: Reference based chimera detection performed by default, pass --supress_reference_chimera_detection to disable [default: None]
139
-x, `-`-suppress_reference_chimera_detection
140
Suppress reference based chimera detection in usearch. [default: False]
141
`-`-cluster_size_filtering
142
Deprecated, cluster size filtering enabled by default, pass --suppress_cluster_size_filtering to disable. [default: None]
143
-l, `-`-suppress_cluster_size_filtering
144
Suppress cluster size filtering in usearch. [default: False]
129
145
`-`-remove_usearch_logs
130
146
Disable creation of logs when usearch is called. Up to nine logs are created, depending on filtering steps enabled. [default: False]
131
147
`-`-derep_fullseq
132
148
Dereplication of full sequences, instead of subsequences. Faster than the default --derep_subseqs in usearch. [default: False]
133
149
-F, `-`-non_chimeras_retention
134
Selects subsets of sequences detected as non-chimeras to retain after de novo and refernece based chimera detection. Options are intersection or union. union will retain sequences that are flagged as non-chimeric from either filter, while intersection will retain only those sequences that are flagged as non-chimeras from both detection methods. [default: union]
150
Selects subsets of sequences detected as non-chimeras to retain after de novo and reference based chimera detection. Options are intersection or union. union will retain sequences that are flagged as non-chimeric from either filter, while intersection will retain only those sequences that are flagged as non-chimeras from both detection methods. [default: union]
152
Minimum length of sequence allowed for usearch, usearch_ref, usearch61, and usearch61_ref. [default: 64]
153
`-`-usearch_fast_cluster
154
Use fast clustering option for usearch or usearch61_ref with new clusters. --enable_rev_strand_match can not be enabled with this option, and the only valid option for usearch61_sort_method is 'length'. This option uses more memory than the default option for de novo clustering. [default: False]
155
`-`-usearch61_sort_method
156
Sorting method for usearch61 and usearch61_ref. Valid options are abundance, length, or None. If the --usearch_fast_cluster option is enabled, the only sorting method allowed in length. [default: abundance]
158
Enable size based preference in clustering with usearch61. Requires that --usearch61_sort_method be abundance. [default: False]
155
179
**Example (uclust method, default):**
157
Using the seqs.fna file generated from `split_libraries.py <./split_libraries.html>`_ and outputting the results to the directory "picked_otus/", while using default parameters (0.97 sequence similarity, no reverse strand matching):
161
pick_otus.py -i seqs.fna -o picked_otus/
163
To change the percent identity to a lower value, such as 90%, and also enable reverse strand matching, the script would be the following:
167
pick_otus.py -i seqs.fna -o picked_otus/ -s 0.90 -z
181
Using the seqs.fna file generated from `split_libraries.py <./split_libraries.html>`_ and outputting the results to the directory "picked_otus_default/", while using default parameters (0.97 sequence similarity, no reverse strand matching):
185
pick_otus.py -i seqs.fna -o picked_otus_default
187
To change the percent identity to a lower value, such as 90%, and also enable reverse strand matching, the command would be the following:
191
pick_otus.py -i seqs.fna -o picked_otus_90_percent_rev/ -s 0.90 -z
169
193
**Uclust Reference-based OTU picking example:**
175
pick_otus.py -i seqs.fna -r core_set_unaligned.fasta_11_8_07 -m uclust_ref
199
pick_otus.py -i seqs.fna -r refseqs.fasta -m uclust_ref --uclust_otu_id_prefix qiime_otu_
177
201
**Example (cdhit method):**
179
Using the seqs.fna file generated from `split_libraries.py <./split_libraries.html>`_ and outputting the results to the directory "picked_otus/", while using default parameters (0.97 sequence similarity, no prefix filtering):
183
pick_otus.py -i seqs.fna -m cdhit -o picked_otus/
185
Currently the cd-hit OTU picker allows for users to perform a pre-filtering step, so that highly similar sequences are clustered prior to OTU picking. This works by collapsing sequences which begin with an identical n-base prefix, where n is specified by the -n parameter. A commonly used value here is 100 (e.g., -n 100). So, if using this filter with -n 100, all sequences which are identical in their first 100 bases will be clustered together, and only one representative sequence from each cluster will be passed to cd-hit. This is used to greatly increase the run-time of cd-hit-based OTU picking when working with very large sequence collections, as shown by the following command:
189
pick_otus.py -i seqs.fna -m cdhit -o picked_otus/ -n 100
191
Alternatively, if the user would like to collapse identical sequences, or those which are subsequences of other sequences prior to OTU picking, they can use the trie prefiltering ("-t") option as shown by the following command:
195
pick_otus.py -i seqs.fna -m cdhit -o picked_otus/ -t
197
Note: It is highly recommended to use one of the prefiltering methods when analyzing large dataset (>100,000 seqs) to reduce run-time.
203
Using the seqs.fna file generated from `split_libraries.py <./split_libraries.html>`_ and outputting the results to the directory "cdhit_picked_otus/", while using default parameters (0.97 sequence similarity, no prefix filtering):
207
pick_otus.py -i seqs.fna -m cdhit -o cdhit_picked_otus/
209
Currently the cd-hit OTU picker allows for users to perform a pre-filtering step, so that highly similar sequences are clustered prior to OTU picking. This works by collapsing sequences which begin with an identical n-base prefix, where n is specified by the -n parameter. A commonly used value here is 100 (e.g., -n 100). So, if using this filter with -n 100, all sequences which are identical in their first 100 bases will be clustered together, and only one representative sequence from each cluster will be passed to cd-hit. This is used to greatly decrease the run-time of cd-hit-based OTU picking when working with very large sequence collections, as shown by the following command:
213
pick_otus.py -i seqs.fna -m cdhit -o cdhit_picked_otus_filter/ -n 100
215
Alternatively, if the user would like to collapse identical sequences, or those which are subsequences of other sequences prior to OTU picking, they can use the trie prefiltering ("-t") option as shown by the following command.
217
Note: It is highly recommended to use one of the prefiltering methods when analyzing large datasets (>100,000 seqs) to reduce run-time.
221
pick_otus.py -i seqs.fna -m cdhit -o cdhit_picked_otus_trie_prefilter/ -t
199
223
**BLAST OTU-Picking Example:**
207
pick_otus.py -i seqs.fna -o picked_otus/ -m blast -r ref_seq_set.fna
231
pick_otus.py -i seqs.fna -o blast_picked_otus/ -m blast -r refseqs.fasta
209
233
If you already have a pre-built BLAST database, you can pass the database prefix as shown by the following command:
213
pick_otus.py -i seqs.fna -o picked_otus/ -m blast -b ref_database
237
pick_otus.py -i seqs.fna -o blast_picked_otus_prebuilt_db/ -m blast -b refseqs.fasta
215
239
If the user would like to change the sequence similarity ("-s") and/or the E-value ("-e") for the blast method, they can use the following command:
219
pick_otus.py -i seqs.fna -o picked_otus/ -m blast -s 0.90 -e 1e-30
243
pick_otus.py -i seqs.fna -o blast_picked_otus_90_percent/ -m blast -r refseqs.fasta -s 0.90 -e 1e-30
221
245
**Prefix-suffix OTU Picking Example:**
223
OTUs can be picked by collapsing sequences which being and/or end with identical bases (i.e., identical prefixes or suffixes). This OTU picker is currently likely to be of limited use on its own, but will be very useful in collapsing very similar sequences in a chained OTU picking strategy that is currently in development. For example, user will be able to pick OTUs with this method, followed by representative set picking, and then re-pick OTUs on their representative set. This will allow for highly similar sequences to be collapsed, followed by running a slower OTU picker. This ability to chain OTU pickers is not yet supported in QIIME. The following command illustrates how to pick OTUs by collapsing sequences which are identical in their first 50 and last 25 bases:
247
OTUs can be picked by collapsing sequences which begin and/or end with identical bases (i.e., identical prefixes or suffixes). This OTU picker is currently likely to be of limited use on its own, but will be very useful in collapsing very similar sequences in a chained OTU picking strategy that is currently in development. For example, the user will be able to pick OTUs with this method, followed by representative set picking, and then re-pick OTUs on their representative set. This will allow for highly similar sequences to be collapsed, followed by running a slower OTU picker. This ability to chain OTU pickers is not yet supported in QIIME. The following command illustrates how to pick OTUs by collapsing sequences which are identical in their first 50 and last 25 bases:
227
pick_otus.py -i seqs.fna -o picked_otus/ -m prefix_suffix -p 50 -u 25
251
pick_otus.py -i seqs.fna -o prefix_suffix_picked_otus/ -m prefix_suffix -p 50 -u 25
229
253
**Mothur OTU Picking Example:**
231
255
The Mothur program (http://www.mothur.org/) provides three clustering algorithms for OTU formation: furthest-neighbor (complete linkage), average-neighbor (group average), and nearest-neighbor (single linkage). Details on the algorithms may be found on the Mothur website and publications (Schloss et al., 2009). However, the running times of Mothur's clustering algorithms scale with the number of sequences squared, so the program may not be feasible for large data sets.
233
The following command may be used to create OTU's based on a furthest-neighbor algorithm (the default setting):
257
The following command may be used to create OTUs based on a furthest-neighbor algorithm (the default setting) using aligned sequences as input:
237
pick_otus.py -i seqs.fna -o picked_otus/ -m mothur
261
pick_otus.py -i seqs.aligned.fna -o mothur_picked_otus/ -m mothur
239
263
If you prefer to use a nearest-neighbor algorithm instead, you may specify this with the '-c' flag:
243
pick_otus.py -i seqs.fna -o picked_otus/ -m mothur -c nearest
245
The sequence similarity parameter may also be specified. For example, the following command may be used to create OTU's at the level of 95% similarity:
249
pick_otus.py -i seqs.fna -o picked_otus/ -m mothur -s 0.90
251
**Usearch_qf ('usearch quality filter'):**
253
Usearch (http://www.drive5.com/usearch/) provides clustering, chimera checking, and quality filtering.
255
**Standard usearch (usearch_qf) example:**
259
pick_otus.py -i seqs.fna -m usearch --word_length 64 --db_filepath reference_sequence_filepath -o usearch_qf_results/
261
**Usearch (usearch_qf) example where reference-based chimera detection is disabled, and minimum cluster size filter is reduced from default (4) to 2:**
265
pick_otus.py -i seqs.fna -m usearch --word_length 64 --reference_chimera_detection --minsize 2 -o usearch_qf_results/
267
pick_otus.py -i seqs.aligned.fna -o mothur_picked_otus_nn/ -m mothur -c nearest
269
The sequence similarity parameter may also be specified. For example, the following command may be used to create OTUs at the level of 90% similarity:
273
pick_otus.py -i seqs.aligned.fna -o mothur_picked_otus_90_percent/ -m mothur -s 0.90
277
Usearch (http://www.drive5.com/usearch/) provides clustering, chimera checking, and quality filtering. The following command specifies a minimum cluster size of 2 to be used during cluster size filtering:
281
pick_otus.py -i seqs.fna -m usearch --word_length 64 --db_filepath refseqs.fasta -o usearch_qf_results/ --minsize 2
283
**usearch example where reference-based chimera detection is disabled, and minimum cluster size filter is reduced from default (4) to 2:**
287
pick_otus.py -i seqs.fna -m usearch --word_length 64 --suppress_reference_chimera_detection --minsize 2 -o usearch_qf_results_no_ref_chim_detection/