56
64
is a k\-mer counter based on a multi\-threaded hash
57
65
table implementation.
67
.SS COUNTING AND MERGING
59
69
To count k\-mers, use a command like:
62
72
jellyfish count \-m 22 \-o output \-c 3 \-s 10000000 \-t 32 input.fasta
65
This will count the the 22\-mers in species.fasta with 32 threads. The
75
This will count the the 22\-mers in input.fasta with 32 threads. The
66
76
counter field in the hash uses only 3 bits and the hash has at least
67
10 million entries. Let the size of the table be s=2^l and the max
68
reprobe value is less than 2^r, then the memory usage per entry in
69
the hash is (in bits, not bytes) 2k\-l+r+1\&.
79
The output files will be named output_0, output_1, etc. (the prefix
80
is specified with the \fB\-o\fP
81
switch). If the hash is large enough
82
(has specified by the \fB\-s\fP
83
switch) to fit all the k\-mers, there
84
will be only one output file named output_0. If the hash filled up
85
before all the mers were read, the hash is dumped to disk, zeroed out
86
and reading in mers resumes. Multiple intermediary files will be
87
present on the disks, named output_0, output_1, etc.
89
To obtain correct results from the other sub\-commands (such as histo,
90
stats, etc.), the multiple output files, if any, need to be merged into one
91
with the merge command. For example with the following command:
94
jellyfish merge \-o output.jf output\\_*
97
Should you get many intermediary output files (say hundreds), the size
98
of the hash table is too small. Rerunning Jellyfish
100
larger size (option \fB\-s\fP)
101
is probably faster than merging all the
105
When the orientation of the sequences in the input fasta file is not
106
known, e.g. in sequencing reads, using \fB--both\-strands\fP
108
makes the most sense.
110
For any k\-mer m, its canonical representation is m itself or its
111
reverse\-complement, whichever comes first lexicographically. With the
113
only the canonical representation of the mers are
114
stored in the hash and the count value is the number of occurrences of
115
both the mer and its reverse\-complement.
117
.SS CHOOSING THE HASH SIZE
119
To achieve the best performance, a minimum number of intermediary
120
files should be written to disk. So the parameter \fB\-s\fP
122
chosen to fit as many k\-mers as possible (ideally all of them) while
123
still fitting in memory.
125
We consider to examples: counting mers in sequencing reads and in a
128
First, suppose we count k\-mers in short sequencing reads:
129
there are n reads and there is an average of 1 error per reads where
130
each error generates k unique mers. If the genome size is G, the
131
size of the hash (option \fB\-s\fP)
132
to fit all k\-mers at once is estimated to: $(G +
133
k*n)/0.8$. The division by 0.8 compensates for the maximum usage of
134
approximately $80%$ of the hash table.
136
On the other hand, when counting k\-mers in an assembled sequence of
137
length G, setting \fB\-s\fP
140
As a matter of convenience, Jellyfish understands ISO suffixes for the
141
size of the hash. Hence \&'\-s 10M\&' stands 10 million entries while \&'\-s
142
50G\&' stands for 50 billion entries.
144
The actual memory usage of the hash table can be computed as
145
follow. The actual size of the hash will be rounded up to the next
146
power of 2: s=2^l\&. The parameter r is such that the maximum
147
reprobe value (\fB\-p\fP)
148
plus one is less than 2^r\&. Then the memory usage per
149
entry in the hash is (in bits, not bytes) 2k\-l+r+1\&. The total memory
150
usage of the hash table in bytes is: 2^l*(2k\-l+r+1)/8\&.
152
.SS CHOOSING THE COUNTING FIELD SIZE
71
153
To save space, the hash table supports variable length counter, i.e. a
72
154
k\-mer occurring only a few times will use a small counter, a k\-mer
73
occurring many times will used multiple entries in the hash. The
75
specify the length of the small counter. The tradeoff is: a
76
low value will save space per entry in the hash but will increase the
77
number of entries used, hence maybe requiring a larger hash. In
78
practice, use a value for \fB\-c\fP
155
occurring many times will used multiple entries in the hash.
157
Important: the size of the couting field does NOT change the result,
158
it only impacts the amount of memory used. In particular, there is no
159
maximum value in the hash. Even if the counting field uses 5 bits, a
160
k\-mer occuring 2 million times will have a value reported of 2
161
million (i.e., it is not capped at 2^5).
164
specify the length (in bits) of the counting field. The
165
trade off is as follows: a low value will save space per entry in the
166
hash but can potentially increase the number of entries used, hence
167
maybe requiring a larger hash.
169
In practice, use a value for \fB\-c\fP
79
170
so that most of you k\-mers
80
171
require only 1 entry. For example, to count k\-mers in a genome,
81
172
where most of the sequence is unique, use \fB\-c\fP\fI1\fP
83
174
\fB\-c\fP\fI2\fP\&.
84
For sequencing reads, use a value for
86
large enough to counts up to twice the coverage.
88
When the orientation of the sequences in the input fasta file is not
89
known, e.g. in sequencing reads, using \fB--both\-strands\fP
93
The following subcommand are used to look at the result: histo, dump, stats.
175
For sequencing reads, use a value for \fB\-c\fP
177
enough to counts up to twice the coverage. For example, if the
178
coverage is 10X, choose a counter length of 5 (\fB\-c\fP\fI5\fP)
181
.SH SUBCOMMANDS AND OPTIONS
184
Usage: jellyfish count [options] file:path+
99
186
Count k\-mers or qmers in fasta or fastq files
101
Usage: jellyfish count [OPTIONS]... [file.f[aq]]...
104
\fB\-h\fP,\fB--help\fP
108
Print help, including hidden options, and exit
110
\fB\-V\fP,\fB--version\fP
111
Print version and exit
188
Options (default value in (), *required):
114
\fB--mer\-len\fP\fI=INT\fP
115
Length of mer (mandatory)
191
\fB--mer\-len\fP\fI=uint32\fP
118
\fB--size\fP\fI=LONG\fP
119
Hash size (mandatory)
195
\fB--size\fP\fI=uint64\fP
122
\fB--threads\fP\fI=INT\fP
123
Number of threads (default=1)
199
\fB--threads\fP\fI=uint32\fP
200
Number of threads (1)
126
\fB--output\fP\fI=STRING\fP
127
Output prefix (default=mer_counts)
203
\fB--output\fP\fI=string\fP
204
Output prefix (mer_counts)
130
207
\fB--counter\-len\fP\fI=Length\fP
132
Length of counting field (default=7)
208
in bits Length of counting field (7)
134
210
\fB--out\-counter\-len\fP\fI=Length\fP
136
Length of counter field in output
211
in bytes Length of counter field in output (4)
139
213
\fB\-C\fP,\fB--both\-strands\fP
140
Count both strand, canonical representation
214
Count both strand, canonical representation (false)
144
\fB--reprobes\fP\fI=INT\fP
145
Maximum number of reprobes (default=62)
217
\fB--reprobes\fP\fI=uint32\fP
218
Maximum number of reprobes (62)
147
220
\fB\-r\fP,\fB--raw\fP
148
Write raw database (default=off)
221
Write raw database (false)
150
223
\fB\-q\fP,\fB--quake\fP
151
Quake compatibility mode (default=off)
153
\fB--quality\-start\fP\fI=INT\fP
154
Starting ASCII for quality values
157
\fB--min\-quality\fP\fI=INT\fP
158
Minimum quality. A base with lesser quality
159
becomes an N (default=0)
224
Quake compatibility mode (false)
226
\fB--quality\-start\fP\fI=uint32\fP
227
Starting ASCII for quality values (64)
229
\fB--min\-quality\fP\fI=uint32\fP
230
Minimum quality. A base with lesser quality becomes an N (0)
162
\fB--lower\-count\fP\fI=LONG\fP
233
\fB--lower\-count\fP\fI=uint64\fP
163
234
Don\&'t output k\-mer with count < lower\-count
166
\fB--upper\-count\fP\fI=LONG\fP
237
\fB--upper\-count\fP\fI=uint64\fP
167
238
Don\&'t output k\-mer with count > upper\-count
169
240
\fB--matrix\fP\fI=Matrix\fP
172
243
\fB--timing\fP\fI=Timing\fP
173
244
file Print timing information
246
\fB--stats\fP\fI=Stats\fP
252
\fB\-h\fP,\fB--help\fP
258
\fB\-V\fP,\fB--version\fP
262
Usage: jellyfish stats [options] db:path
266
Display some statistics about the k\-mers in the hash:
268
Unique: Number of k\-mers which occur only once.
269
Distinct: Number of k\-mers, not counting multiplicity.
270
Total: Number of k\-mers, including multiplicity.
271
Max_count: Maximum number of occurrence of a k\-mer.
273
Options (default value in (), *required):
276
\fB--lower\-count\fP\fI=uint64\fP
277
Don\&'t consider k\-mer with count < lower\-count
280
\fB--upper\-count\fP\fI=uint64\fP
281
Don\&'t consider k\-mer with count > upper\-count
283
\fB\-v\fP,\fB--verbose\fP
287
\fB--output\fP\fI=string\fP
293
\fB\-h\fP,\fB--help\fP
299
\fB\-V\fP,\fB--version\fP
303
Usage: jellyfish histo [options] db:path
177
305
Create an histogram of k\-mer occurrences
179
Usage: jellyfish histo [OPTIONS]... [database.jf]...
185
\fB\-V\fP,\fB--version\fP
186
Print version and exit
189
\fB--buffer\-size\fP\fI=Buffer\fP
191
Length in bytes of input buffer
307
Create an histogram with the number of k\-mers having a given
308
count. In bucket \&'i\&' are tallied the k\-mers which have a count \&'c\&'
309
satisfying \&'low+i*inc <= c < low+(i+1)*inc\&'\&. Buckets in the output are
310
labeled by the low end point (low+i*inc).
312
The last bucket in the output behaves as a catchall: it tallies all
313
k\-mers with a count greater or equal to the low end point of this
316
Options (default value in (), *required):
195
\fB--low\fP\fI=LONG\fP
196
Low count value of histogram (default=1)
319
\fB--low\fP\fI=uint64\fP
320
Low count value of histogram (1)
199
\fB--high\fP\fI=LONG\fP
200
High count value of histogram
323
\fB--high\fP\fI=uint64\fP
324
High count value of histogram (10000)
204
\fB--increment\fP\fI=LONG\fP
205
Increment value for buckets (default=1)
327
\fB--increment\fP\fI=uint64\fP
328
Increment value for buckets (1)
208
\fB--threads\fP\fI=INT\fP
209
Number of threads (default=1)
331
\fB--threads\fP\fI=uint32\fP
332
Number of threads (1)
334
\fB\-f\fP,\fB--full\fP
335
Full histo. Don\&'t skip count 0. (false)
212
\fB--output\fP\fI=STRING\fP
213
Output file (default=/dev/fd/1)
338
\fB--output\fP\fI=string\fP
341
\fB\-v\fP,\fB--verbose\fP
342
Output information (false)
353
\fB\-V\fP,\fB--version\fP
357
Usage: jellyfish dump [options] db:path
217
359
Dump k\-mer counts
219
Usage: jellyfish stats [OPTIONS]... [database.jf]...
361
By default, dump in a fasta format where the header is the count and
362
the sequence is the sequence of the k\-mer. The column format is a 2
363
column output: k\-mer count.
222
\fB\-h\fP,\fB--help\fP
225
\fB\-V\fP,\fB--version\fP
226
Print version and exit
365
Options (default value in (), *required):
228
367
\fB\-c\fP,\fB--column\fP
229
Column format (default=off)
368
Column format (false)
231
370
\fB\-t\fP,\fB--tab\fP
232
Tab separator (default=off)
371
Tab separator (false)
235
\fB--lower\-count\fP\fI=LONG\fP
374
\fB--lower\-count\fP\fI=uint64\fP
236
375
Don\&'t output k\-mer with count < lower\-count
239
\fB--upper\-count\fP\fI=LONG\fP
378
\fB--upper\-count\fP\fI=uint64\fP
240
379
Don\&'t output k\-mer with count > upper\-count
243
\fB--output\fP\fI=STRING\fP
244
Output file (default=/dev/fd/1)
250
Usage: jellyfish stats [OPTIONS]... [database.jf]...
382
\fB--output\fP\fI=string\fP
253
388
\fB\-h\fP,\fB--help\fP
257
Print help, including hidden options, and exit
259
391
\fB\-V\fP,\fB--version\fP
260
Print version and exit
263
\fB--lower\-count\fP\fI=LONG\fP
264
Don\&'t output k\-mer with count < lower\-count
267
\fB--upper\-count\fP\fI=LONG\fP
268
Don\&'t output k\-mer with count > upper\-count
270
\fB\-v\fP,\fB--verbose\fP
271
Verbose (default=off)
274
\fB--output\fP\fI=STRING\fP
275
Output file (default=/dev/fd/1)
395
Usage: jellyfish merge [options] input:string+
279
397
Merge jellyfish databases
281
Usage: jellyfish merge [OPTIONS]... [database.jf]...
284
\fB\-h\fP,\fB--help\fP
287
\fB\-V\fP,\fB--version\fP
288
Print version and exit
399
Options (default value in (), *required):
291
402
\fB--buffer\-size\fP\fI=Buffer\fP
293
Length in bytes of input buffer
297
\fB--output\fP\fI=STRING\fP
298
Output file (default=mer_counts_merged.jf)
300
\fB--out\-counter\-len\fP\fI=INT\fP
301
Length (in bytes) of counting field in output
304
\fB--out\-buffer\-size\fP\fI=LONG\fP
305
Size of output buffer per thread
308
\fB\-v\fP,\fB--verbose\fP
309
Be verbose (default=off)
403
length Length in bytes of input buffer (10000000)
406
\fB--output\fP\fI=string\fP
407
Output file (mer_counts_merged.jf)
409
\fB--out\-counter\-len\fP\fI=uint32\fP
410
Length (in bytes) of counting field in output (4)
412
\fB--out\-buffer\-size\fP\fI=uint64\fP
413
Size of output buffer per thread (10000000)
415
\fB\-v\fP,\fB--verbose\fP
421
\fB\-h\fP,\fB--help\fP
424
\fB\-V\fP,\fB--version\fP
428
Usage: jellyfish query [options] db:path
430
Query from a compacted database
432
Query a hash. It reads k\-mers from the standard input and write the counts on the standard output.
434
Options (default value in (), *required):
436
\fB\-C\fP,\fB--both\-strands\fP
439
\fB\-c\fP,\fB--cary\-bit\fP
440
Value field as the cary bit information (false)
443
\fB--input\fP\fI=file\fP
447
\fB--output\fP\fI=file\fP
453
\fB\-h\fP,\fB--help\fP
456
\fB\-V\fP,\fB--version\fP
460
Usage: jellyfish qhisto [options] db:string
462
Create an histogram of k\-mer occurences
464
Options (default value in (), *required):
467
\fB--low\fP\fI=double\fP
468
Low count value of histogram (0.0)
471
\fB--high\fP\fI=double\fP
472
High count value of histogram (10000.0)
475
\fB--increment\fP\fI=double\fP
476
Increment value for buckets (1.0)
478
\fB\-f\fP,\fB--full\fP
479
Full histo. Don\&'t skip count 0. (false)
487
\fB\-V\fP,\fB--version\fP
491
Usage: jellyfish qdump [options] db:path
493
Dump k\-mer from a qmer database
495
By default, dump in a fasta format where the header is the count and
496
the sequence is the sequence of the k\-mer. The column format is a 2
497
column output: k\-mer count.
499
Options (default value in (), *required):
501
\fB\-c\fP,\fB--column\fP
502
Column format (false)
504
\fB\-t\fP,\fB--tab\fP
505
Tab separator (false)
508
\fB--lower\-count\fP\fI=double\fP
509
Don\&'t output k\-mer with count < lower\-count
512
\fB--upper\-count\fP\fI=double\fP
513
Don\&'t output k\-mer with count > upper\-count
515
\fB\-v\fP,\fB--verbose\fP
519
\fB--output\fP\fI=string\fP
525
\fB\-h\fP,\fB--help\fP
528
\fB\-V\fP,\fB--version\fP
532
Usage: jellyfish merge [options] db:string+
534
Merge quake databases
536
Options (default value in (), *required):
539
\fB--size\fP\fI=uint64\fP
540
*Merged hash table size
543
\fB--mer\-len\fP\fI=uint32\fP
547
\fB--output\fP\fI=string\fP
548
Output file (merged.jf)
551
\fB--reprobes\fP\fI=uint32\fP
552
Maximum number of reprobes (62)
557
\fB\-h\fP,\fB--help\fP
563
\fB\-V\fP,\fB--version\fP
567
Usage: jellyfish cite [options]
313
569
How to cite Jellyfish\&'s paper
315
Usage: jellyfish cite [OPTIONS]...
318
\fB\-h\fP,\fB--help\fP
321
\fB\-V\fP,\fB--version\fP
322
Print version and exit
573
Options (default value in (), *required):
324
575
\fB\-b\fP,\fB--bibtex\fP
325
Bibtex format (default=off)
576
Bibtex format (false)
328
\fB--output\fP\fI=STRING\fP
329
Output file (default=/dev/fd/1)
579
\fB--output\fP\fI=string\fP
585
\fB\-h\fP,\fB--help\fP
588
\fB\-V\fP,\fB--version\fP
334
Version: 1.1 of 2010/10/1
594
Version: 1.1.4 of 2010/10/1
341
jellyfish merge has not been parallelized and is very
601
jellyfish merge has not been parallelized and is
605
The hash table does not grow in memory automatically and
607
is not called automatically on the
608
intermediary files (if any).
344
610
.SH COPYRIGHT & LICENSE