1
.TH FA2HTGS 1 2001-12-28 NCBI "NCBI Tools User's Manual"
3
fa2htgs \- formatter for high throughput genome sequencing project submissions
7
[\|\fB-6\fP\ \fIstr\fP\|]
8
[\|\fB-7\fP\ \fIstr\fP\|]
9
[\|\fB-A\fP\ \fIfilename\fP\|]
10
[\|\fB-C\fP\ \fIstr\fP\|]
12
[\|\fB-L\fP\ \fIfilename\fP\|]
13
[\|\fB-M\fP\ \fIstr\fP\|]
15
[\|\fB-O\fP\ \fIfilename\fP\|]
16
[\|\fB-P\fP\ \fIstr\fP\|]
17
[\|\fB-S\fP\ \fIstr\fP\|]
18
[\|\fB-T\fP\ \fIfilename\fP\|]
20
[\|\fB-a\fP\ \fIstr\fP\|]
21
[\|\fB-b\fP\ \fIN\fP\|]
22
[\|\fB-c\fP\ \fIstr\fP\|]
23
[\|\fB-d\fP\ \fIstr\fP\|]
24
[\|\fB-e\fP\ \fIfilename\fP\|]
27
[\|\fB-h\fP\ \fIstr\fP\|]
28
[\|\fB-i\fP\ \fIfilename\fP\|]
29
[\|\fB-l\fP\ \fIN\fP\|]
31
[\|\fB-n\fP\ \fIstr\fP\|]
32
[\|\fB-o\fP\ \fIfilename\fP\|]
33
[\|\fB-p\fP\ \fIN\fP\|]
35
[\|\fB-r\fP\ \fIstr\fP\|]
37
[\|\fB-t\fP\ \fIfilename\fP\|]
40
[\|\fB-x\fP\ \fIstr\fP\|]
42
This manual page documents briefly the \fBfa2htgs\fP command.
43
This manual page was written for the Debian GNU/Linux distribution
44
because the original program does not have a manual page.
46
\fBfa2htgs\fP is a program used to generate Seq-submits (an ASN.1
47
sequence submission file) for high throughput genome sequencing
50
\fBfa2htgs\fP will read a FASTA file (or an Ace Contig file with Phrap
51
sequence quality values), a Sequin submission template file, (to get
52
contact and citation information for the submission), and a series of
53
command line arguments (see below). This program will then combines
54
these information to make a submission suitable for GenBank. Once you
55
have generated your submission file, you need to follow the submission
56
protocol (see the README present on your FTP account or mailed out to
59
\fBfa2htgs\fP is intended for the automation by scripts for bulk
60
submission of unannotated genome sequence. It can easily be extended
61
from its current simple form to allow more complicated processing. A
62
submission prepared with \fBfa2htgs\fP can also be read into
63
\fBPsequin\fP(1), and then annotated more extensively.
65
Questions and concerns about this processing protocol, or how to
66
use this tool should be forwarded to <htgs@ncbi.nlm.nih.gov>.
68
A summary of options is included below.
74
SP6 clone (e.g., Contig1,left)
77
T7 clone (e.g., Contig2,right)
79
\fB-A\fP\ \fIfilename\fP
80
Filename for accession list input (mutually exclusive with \fB-T\fP
81
and \fB-i\fP). The input file contains a tab-delimited table with
82
three to five columns, which are accession number, start position,
83
stop position, and (optionally) length and strand. If start > stop,
84
the minus strand on the referenced accession is used. A gap is
85
indicated by the word "gap" instead of an accession, 0 for the start
86
and stop positions, and a number for the length.
89
Clone library name (will appear as \fB/clone-lib="\fP\fIstr\fP\fB"\fP
90
on the source feature)
95
\fB-L\fP\ \fIfilename\fP
96
Read phrap contig order from \fIfilename\fP. This is a tab-delimited
97
file that can be used to drive the order of contigs (normally
98
specified by \fB-P\fP), as well as indicating the SP6 and T7 ends. It
99
can also be used when contigs are known to be in opposite orientation.
108
The first column is the contig name, the second is the orientation,
109
the third is the fragment_group, the fourth indicates the SP6 or T7
110
end, and the fifth says which side of SP6 or T7 end had vector
114
Map name (will appear as \fB/map="\fP\fIstr\fP\fB"\fP on the source feature)
117
Annotate assembly_fragments
119
\fB-O\fP\ \fIfilename\fP
120
Read comment from \fIfilename\fP (100-character-per-line maximum;
121
\fB~\fP is a linebreak and \fB`~\fP is a literal \fB~\fP. You can
122
check the format with \fBPSequin\fP(1).)
125
Contigs to use, separated by commas. If \fB-P\fP is not indicated
126
with the \fB-T\fP option, then the fragments will go in in the order
127
that they are in the ace file (which is appropriate for a phase 1
128
record, but not for a phase 2 or 3). If you need to set the order of
129
the segments of the ace file, you need to set it with the \fB-P\fP
130
flag, like this: \fB-P "Contig1,Contig4,Contig3,Contig2,Contig5"\fP
135
\fB-T\fP\ \fIfilename\fP
136
Filename for phrap input (mutually exclusive with \fB-A\fP and \fB-i\fP)
139
The coordinates in the input file are on the resulting segmented
140
sequence. (Bases 1 through \fIn\fP of each accession are used.)
141
Otherwise, the coordinates are on the individual accessions, which
142
need not start at base 1 of the record.
145
GenBank accession; use if and only if updating a sequence.
148
Gap length (default = 100; anything from 0 to 1000000000 is legal)
151
Clone name (will appear as \fB/clone\fP in the source feature; can be
152
the same as \fB-s\fP)
155
Title for sequence (will appear in GenBank \fBDEFINITION\fP line)
157
\fB-e\fP\ \fIfilename\fP
158
Log errors to \fIfilename\fP
164
Genome Center tag (probably the same as your login name on the NCBI FTP server)
167
Chromosome (will appear as \fB/chromosome\fP in the source feature)
169
\fB-i\fP\ \fIfilename\fP
170
Filename for fasta input (default is stdin; mutually exclusive with
171
\fB-A\fP and \fB-T\fP)
174
Length of sequence in bp (default = 0). The length is checked against
175
the actual number of bases we get. For phase 1 and 2 sequence it is
176
also used to estimate gap lengths. For phase 1 and 2 records, it is
177
important to use a number GREATER than the amount of provided
178
nucleotide, otherwise this will generate false 'gaps'. Here is
179
assumed that the putative full length of the BAC or cosmid will be
180
used. There should be at least 20 to 30 'n' in between the segments
181
(you can check for these in Sequin), as this will ensure proper
182
behavior when this sequence is used with BLAST. Otherwise
183
'artifactual' unrelated segment neighbors may be brought into
184
proximity of each other.
187
Take comment from template
190
Organism name (default = Homo sapiens)
192
\fB-o\fP\ \fIfilename\fP
193
Filename for asn.1 output (default = stdout)
200
A collection of unordered contigs with gaps of unknown length. A
201
Phase 1 record must at the very least have two segments with one gap.
204
A series of ordered contigs, possibly with known gap lengths. This
205
could be a single sequence without gaps, if the sequence has
206
ambiguities to resolve.
208
A single contiguous sequence. This sequence is finished, but not
209
necessarily annotated.
214
htgs_cancelled keyword
217
Remark for update (brief comment describing the nature of the update,
218
such as "new sequence", "new citation", or "updated features")
221
Sequence name. The sequence must have a name that is unique within
222
the genome center. We use the combination of the genome center name
223
(\fB-g\fP argument) and the sequence name (\fB-s\fP) to track this
224
sequence and to talk to you about it. The name can have any form you
225
like but must be unique within your center.
227
\fB-t\fP\ \fIfilename\fP
228
Filename for Seq-submit template (default = template.sub)
231
Take biosource from template
234
htgs_activefin keyword
237
Secondary accession numbers, separated by commas, s.t. U10000,L11000.
240
In some cases a large segment will supersede another or group of other
241
accession numbers (records). These records which are no longer wanted
242
in GenBank should be made secondary. Using the \fB-x\fP argument you
243
can list the Accession Numbers you want to make secondary. This will
244
instruct us to remove the accession number(s) from GenBank, and will
245
no longer be part of the GenBank release. They will nonetheless be
246
available from Entrez.
248
\fBGREAT CARE\fP should be taken when using this argument!!! Improper
249
use of accession numbers here will result in the inappropriate
250
withdrawal of GenBank records from GenBank, EMBL and DDBJ. We provide
251
this parameter as a convenience to submitting centers, but this may
252
need removed if it is not used carefully.
255
This manual page was written by Aaron M. Ucko <ucko@debian.org>,
256
for the Debian GNU/Linux system (but may be used by others).
260
/usr/share/doc/ncbi-tools-bin/README.fa2htgs.gz