1
=+= README =+============Last update: April 4, 2000 ============
3
the latest version of this document can be found at:
5
ftp://ftp.ncbi.nih.gov/fa2htgs/README
9
After having consulted with NCBI staff (see contact information below)
10
submitters from Genome Sequencing centers will establish what the best
11
protocol will be for them to deposit their sequence submission data to
14
One of these protocol may require the fa2htgs tool, present in this
15
directory. fa2htgs is a program used to generate Seq-submits (an ASN.1
16
sequence submission file) for high throughput genome sequencing
17
projects. Presently we have built fa2htgs for the following platforms:
25
win32/fa2htgs.exe (win95/NT)
27
If fa2htgs is required for a platform not present here,
28
please let us know (address below) and we will be happy to
31
fa2htgs will read a FASTA file (or an Ace Contig file with Phrap sequence
32
quality values), a Sequin submission template file, (to get contact
33
and citation information for the submission), and a series of command line
34
arguments (see below). This program will then combines these
35
information to make a submission suitable for GenBank. Once you have
36
generated your submission file, you need to follow the submission
37
protocol (see the README present on your FTP account or mailed out to
40
fa2htgs is intended for the automation by scripts for bulk submission of
41
unannotated genome sequence. It can easily be extended from its current
42
simple form to allow more complicated processing. A submission
43
prepared with fa2htgs can also be read into Sequin, and then annotated
44
more extensively. See the Sequin home page at:
46
http://www.ncbi.nlm.nih.gov/Sequin/
48
. Contacting NCBI about HTGS submissions and about using fa2htgs:
50
Questions and concerns about this processing protocol, or how to
51
use this tool should be forwarded to:
53
htgs@ncbi.nlm.nih.gov.
60
typing "fa2htgs -" will cause the program to show its command line
61
arguments. Below we show these with additional comments (what we show
62
within { } does not appear on the command line)
64
fa2htgs 2.0 arguments:
66
-i Filename for fasta input [File In]
68
-t Filename for Seq-submit template [File In]
69
default = template.sub
70
-o Filename for asn.1 output [File Out] Optional
72
-e Log errors to file named: [File Out] Optional
73
-n Organism name? [String] Optional
74
default = Homo sapiens
75
-s Sequence name? [String]
77
{ The sequence must have a name that is unique within }
78
{ the genome center. We use the combination of the genome }
79
{ center name (-g argument) and the sequence name (-s) to }
80
{ track this sequence and to talk to you about it. }
81
{ The name can have any form you like but must be unique }
84
-l length of sequence in bp? [Integer]
86
{ The length is checked against the actual number of }
87
{ bases we get. For phase 1 and 2 sequence it is also }
88
{ used to estimate gap lengths. For phase 1 and 2 }
89
{ records, it is important to use a number GREATER than }
90
{ the amount of provided nucleotide, otherwise this will }
91
{ generate false 'gaps'. Here is assumed that the }
92
{ putative full length of the BAC or cosmid will be used. }
93
{ There should be at least 20 to 30 'n' in between the }
94
{ segments (you can check for these in Sequin), as this }
95
{ will ensure proper behavior when this sequence }
96
{ is used with BLAST. Otherwise 'artifactual' unrelated }
97
{ segment neighbors may be brought into proximity of }
100
-g Genome Center tag? [String]
102
{ This is probably the same as your login name on the }
105
-p HTGS phase? [Integer]
109
{ Phase 1 - a collection of unordered contigues with }
110
{ gaps of unknown length. Phase 1 record must }
111
{ at the very least have two segments with }
113
{ Phase 2 - a series of ordered contigs, gap lengths may }
114
{ be known. This could be a single sequence, }
115
{ without gaps, if the sequence has ambiguities }
116
{ which will be resolved. }
117
{ Phase 3 - a single contiguous sequence. This sequenced }
118
{ is finished, although it may, or may not }
121
-a GenBank accession (if an update) [String] Optional
123
{ this argument is required if this is an update, do }
124
{ not use it if you are preparing a new submission }
126
-r Remark for update? [String] Optional
128
{ if this is an update, you can add a brief comment }
129
{ (within "") describing the nature of the update }
130
{ ("new sequence", "new citation", "updated features") }
132
-c Clone name? [String] Optional
134
{ will appear as /clone in the source feature }
135
{ This could be the same as the -s argument (sequence }
136
{ name) but this one will appear in the /clone qualifier }
138
-h Chromosome? [String] Optional
140
{ will appear as /chromsome in the source feature }
142
-d Title for sequence? [String] Optional
144
{ the text that will appear in the DEFINITION line }
145
{ of the GenBank flatfile. }
147
-m Take comment from template ? [T/F] Optional
149
-u Take biosource from template ? [T/F] Optional
151
-x Secondary accession number, separate by commas if multiple, s.t. U10000,L11000 [String] Optional
153
[ ACCESSION AC000000 L00000 }
155
{ | secondary accession number }
156
{ primary accession number }
158
{ In some cases a large segment will supercede another }
159
{ or group of other accession numbers (records). These }
160
{ records which are no longer wanted in GenBank should be }
161
{ made secondary. Using the -x argument you can list the }
162
{ Accession Numbers you want to make secondary. This will}
163
{ instruct us to remove the accession number(s) from }
164
{ GenBank, and will no longuer be part of the GenBank }
165
{ release. They will nonetheless be available from Entrez.}
167
{ !!GREAT CARE should be taken when using this argument!!!}
168
{ inproper use of accession numbers here will result in }
169
{ the innapropriate withdrawal of GenBank records from }
170
{ GenBank, EMBL and DDBJ. We provide this parameter as }
171
{ a conveniance to submitting centers, but this may need }
172
{ removed if it is not used carefully. }
174
-C Clone library name? [String] Optional
176
{ will appear as /clone-lib="string" on the source feature }
178
-M Map? [String] Optional
180
{ will appear as /map="string" on the source feature }
182
-O Filename for the comment: [File In] Optional
184
{ will read the comment from a given file. }
185
{ maximum 100 characters per line. }
186
{ new lines can be incorporated with "~", and if you }
187
{ actually want to include the "~" in your text, you }
188
{ need to escape it with "`". Please ensure that the }
189
{ correct format is obtained by viewing your comment }
193
-T Filename for phrap input [File In] Optional
195
{ Using this argument infers that you are NOT using the }
198
-P Contigs to use, separate by commas if multiple [String] Optional
200
{ if -P is not indicated with the -T option, then the }
201
{ fragments will go in in the order that they are in the }
202
{ ace file (which is appropriate for a phase 1 record, }
203
{ but not for a phase 2 or 3. If you need to set the }
204
{ order of the segments of the ace file, you need to set }
205
{ it with the -P flag, like this: }
206
{ -P "Contig1,Contig4,Contig3,Contig2,Contig5" }
209
-A Filename for accession list input [File In] Optional
211
{ Using this argument infers that you are NOT using the }
212
{ -i or -T arguments above. The input file contains a }
213
{ tab-delimited table with three to five columns, which }
214
{ are accession number, start position, stop position, }
215
{ and (optionally) length and strand. If start > stop, }
216
{ the minus strand on the referenced accession is used. }
217
{ A gap is indicated by the word "gap" instead of an }
218
{ accession, 0 for the start and stop positions, and a }
219
{ number for the length. }
221
-X Coordinates are on the resulting sequence ? [T/F] Optional
224
{ if -X is TRUE, then the coordinates in the input file }
225
{ are on the resulting segmented sequence. This implies }
226
{ that bases 1 through n of each accession are used. }
227
{ if -X is FALSE, the coordinates are on the individual }
228
{ accessions, and these need not start at base 1 of the }
232
-D HTGS_DRAFT sequence? [T/F] Optional
235
-S Strain name? [String] Optional
237
-b Gap length [Integer]
239
range from 0 to 1000000000
241
-N Annotate assembly_fragments [T/F] Optional
244
-6 SP6 clone (e.g., Contig1,left) [String] Optional
246
-7 T6 clone (e.g., Contig2,right) [String] Optional
248
-L Filename for phrap contig order [File In] Optional
250
{ This is a tab-delimited file that can be used to drive }
251
{ the order of contigs (normally specified by -P), as well }
252
{ as indicating the SP6 and T7 ends. It can also be used }
253
{ when contigs are known to be in opposite orientation. }
256
{ Contig2 + 1 SP6 left }
258
{ Contig1 - T7 right }
260
{ The first column is the contig name, the second is the }
261
{ orientation, the third is the fragment_group, the fourth }
262
{ indicates the SP6 or T7 end, and the fifth says which }
263
{ side of SP6 or T7 end had vector removed. }
266
Presented here is an example of a phase 2 submission from an Arabidopsis
267
sequencing center. It is followed by an command line arguments used in
268
an example with a Phrap ace file.
271
BEFORE YOU BEGIN: fa2htgs does depend on the presence of some external
272
files. These are provided with Sequin, so if a networked version of
273
Sequin is already installed (see URL above for Sequin info) all the
274
default files that need to be present will be there and allow fa2htgs
278
Here are the files you need (let's assume we have a 100Kb BAC):
280
1) fasta file (example below)
281
2) sequin submission file (more on this below)
282
3) genome center name ("pgec" in this example, use your
284
4) the sequence/clone name (this will *always* stay with the record)
287
phase 1: multiple pieces, not in order (alway >= 2 pieces,
289
phase 2: multiple pieces, in order, but can be as few as
290
one unfinished sequence
291
phase 3: 1 piece, where the sequence is "finished"
293
6) the full sequence length, when the project is finished (eg 100000
296
7) A new submission has no Accession Number, and and an update always
297
does. You will need to keep track of this (ie which sequence name has
298
which accession number)
300
8) The organism, in this example "Arabidopsis thaliana"
302
9) The chromosome number, 1 in this example.
304
10) the output (file name) convention so far has been to call it the
305
clone name.ss (eg P74A8.ss) "ss" is a seq-submit, or sequence
306
submission. We then have our scripts/code report with the same file
307
name convention. Also note that because we are working in Unix space,
308
'case' of letter is important, and try to avoid 'metacharacters'
311
so the phase 1 or 2 FASTA file will look like this (in this example,
312
this is one has 3 segments, but you could (in phase 1) have many more):
314
>P74A8 pcr product joining p130c12 and p91c10
315
gatcagcccaaagcattgattaggggaacttacctgtagagggctgcagcaatggggaac
316
acctggctgggtcacagagtggtcaatgcactccatgacttttgggtcaggacacagaaa
317
gaaagagcggggaaccggggggccctacagtgatgaattatactaactgattttagaatg
320
ttaaacaaacattgcatttccagaataaaccccatttagtaacgcatagtgtgcttgtat
321
ctcagcctcccaaagtgctgggattatagacatgagccagcgcacctggctttgttagcc
324
ttttcaaataactttttgaactttgttaattttttaattgcacgttttctccttcattta
325
ctaattccattcaaaagtagcatcaatgagaataaattacttaggaatacatttaattaa
326
aaagtgctagacttgtacactgaaaattacaaagtactctggagatatattc
330
The first line has the seqence id, and a title, then each segment
341
where you put a "?" if you don't know the distance between the pieces,
342
or a number of bp if you do know the distance (eg 200 bp), and the
343
other line is the fasta formated next segment (foobar). So that is it
344
for phase 1 or 2. Phase 3 will be a single fasta file. All phase 1
345
will probably always be >?.
347
So the other thing you need is a submisssion prepared by sequin. This
348
will allow you to put in the references, authors, Titles, submission
349
information the way you want it. You simply need to make a 1 bp
350
submission really. fa2htgs will read that file and copy the
351
information over to the htgs information with the "real" data.
353
So once you have made the submission, you deposit it on the FTP account
354
under "SEQSUBMIT" directory, we have software that looks for it there
355
every day, validate the center, clone (sequence) id's, check if it's an
356
update and so on, and write a report that you can pick up the next
359
It is good to put the output of fa2htgs in Sequin and validate the
360
record. This is specially important for phase 3 records where many
361
annotations may be present (added with the help of Sequin): Sequin has
362
a very good validation suite (look under Search -> Validate)
364
This finished record is now ready for deposition to your FTP account
365
in the SEQSUBMIT directory.
368
example of the command line arguments using quality score/Phrap ace file
369
(all on tyhe same command line):
371
./fa2htgs -t nuc1.sqn -o test.cmd32.out -s Phrap_Contig_Test2 -l 111505
372
-g pgec -p 2 -h 1 -d Phrap_Contig_Test2 -n "Arabidopsis thaliana"
373
-T g5129z079.ace -P "Contig1,Contig2,Contig4,Contig3,Contig7"
376
example of a contig file for a yeast chromosome (with coordinates on the
377
individual accessions):
386
-- Questions about fa2htgs or how to submit?
388
Just contact us at NCBI:
390
e-mail: htgs@ncbi.nlm.nih.gov
392
==============+= end of the fa2htgs README =+==========================