1
TBL2ASN AUTOMATED BULK SUBMISSION PROGRAM
3
tbl2asn is a program that automates the submission of sequence records to
4
GenBank. It uses many of the same functions as Sequin, but is driven
5
entirely by data files, and records need no additional manual editing before
6
submission. Entire genomes, consisting of many chromosomes with feature
7
annotation, can be processed in seconds using this method.
9
For a submission, tbl2asn expects a template file containing a text ASN.1
10
Submit-block object. These can be easily generated in Sequin and saved for
11
use by tbl2asn. The Submit-block contains contact information (to whom
12
questions on the submission can be addressed) and a submission citation
13
(which lists the authors who get scientific credit for the sequencing).
15
The template file can also contain one or more text ASN.1 Seq-descr objects
16
(such as Title or BioSource) appended after the Submit-block. These can
17
also be generated in Sequin and saved to a file, and then appended to the
18
template with a text editor. They will become descriptors packaged at the
19
top of each submission file.
21
tbl2asn reads five other kinds of data files. Nucleotide sequence data is
22
expected in FASTA format, and these files are identified by having a .fsa
23
suffix. Feature table files, in the five-column format described later,
24
have a .tbl suffix. These can be easily generated by most genome centers
25
that maintain feature locations in a spreadsheet or database. The protein
26
translations of CDS features can also be supplied as FASTA sequences in
27
files with a .pep suffix. These will replace the tbl2asn-generated
28
conceptual translations, and can be used to verify correct CDS intervals.
29
The nucleotide sequence products of mRNA features can be provided as FASTA
30
files with a .rna suffix. Sequence quality scores can be supplied in files
33
tbl2asn generates a .sqn file for submission to the database from these
37
COMMAND-LINE ARGUMENTS
39
To process a set of chromosomes, sets of .fsa and .tbl files (along with
40
optional .pep, .rna, and .qvl files) are placed into a source directory.
41
The path to this directory is specified in the -p command-line argument.
42
The path for the resulting .sqn submission files is given in the -r
43
argument. If the -r argument is not given, the .sqn files are saved in the
46
For example, if an organism has fifteen chromosomes, one would expect at
47
least the following files in the source directory:
57
The exact names of the files are not important, but when a file with a
58
suffix of .fsa is found, tbl2asn will look for a file with the same prefix
59
that has a .tbl suffix, and then generate a .sqn file.
61
The -t command-line argument specifies the template file.
63
Normally a single FASTA sequence per .fsa file is expected. If there are
64
multiple sequences, only the first is processed, unless one of two other
65
flags is given. These are discussed below.
67
The -s flag tells tbl2asn to package the multiple FASTA components as a set
68
of unrelated sequences. This accommodates users who create a single file
69
instead of one file per sequence.
71
The -d flag tells the program to make a delta sequence out of the multiple
72
components.This can be used for HTGS submissions where the sequence of the
73
BAC/PAC clone has not been completely determined. Gaps of 100 base pairs
74
should be inserted in between the actual sequence segments with lines
75
containing an angle bracket '>', a question mark '?', and the length of
80
The -g flag causes tbl2asn to generate a genomic product set. Within the
81
set, the products of each related mRNA and CDS are packaged together in an
82
internal nuc-prot set. The feature table must provide reciprocal
83
protein_id and transcript_id qualifiers in order to correctly identify each
84
mRNA/CDS pair. From the resulting .sqn file, the genomic sequence, all
85
transcripts, and all proteins will be entered into the database and given
86
accessions. Note, however, that -g cannot be used for records submitted to
87
GenBank. It is only suitable for records going into RefSeq.
89
If a feature table is not given, the -c flag tells tbl2asn to annotate the
90
longest Open Reading Frame (ORF) on each record. The -m flag allows
91
alternative start codons to be used when finding the ORF. The protein will
92
be named 'unknown' unless the name is present in the .fsa file definition
93
line, e.g., [protein=helicase].
95
Data records will be validated when the -v flag is indicated. Output is
96
saved to files with a .val suffix. The validator checks for many things,
97
including internal stops in CDS features and mismatches between the CDS
98
translation and the supplied protein sequence. Errors need to be corrected
99
before submitting files to GenBank.
101
GenBank format output is generated when the -b flag is used. Resulting
102
files have a .gbf suffix.
105
NUCLEOTIDE SEQUENCE FORMAT
107
tbl2asn can read nucleotide sequences of any size in FASTA format. A FASTA
108
record consists of a single definition line, beginning with a '>' and
109
followed by optional text, and subsequent lines of sequence. At minimum,
110
all definition lines must contain an identifier for the sequence, called
111
the SeqID. The SeqID cannot begin with "contig", as this is reserved for
112
entry of accession lists in Sequin. Other optional information about the
113
biological source of the organism can also be encoded in brackets on the
114
definition line. A sample definition line is
116
>Sc_16 [organism=Saccharomyces cerevisiae] [strain=S288C] [chromosome=XVI]
118
Other elements include [topology=circular] and [location=mitochondrion].
119
Rna viruses would be indicated by [molecule=rna] and [moltype=genomic]. The
120
sequencing technique can be supplied as [tech=fli cDNA]. Many other source
121
qualifiers, such as map, clone, isolate, cell-line, and cultivar, can be
122
used. For organisms that are not commonly submitted with tbl2asn, the
123
nuclear and mitochondrial genetic codes can be indicated by [gcode=1] and
124
[mgcode=3], respectively. This will ensure proper translation of CDS
125
features. Primary accessions of TPA (third party annotation) records are
126
given by [primary=xxx,xxx,...]. Finally, a general note can be added with
129
Note that the definition line must be a single line, with no return or
130
newline characters. Some word-processors will word-wrap text, either during
131
display or when saving to a file, and care must be taken to avoid unwanted
132
newlines introduced by the editor.
134
>slpy [organism=Zea mays] [chromosome=9] Sleepy transposon
135
TGTAAGATCACTGCTGGGTTGTTGATGAGTTGAGCACCGCTCCCGGCACCCGTCTCCTCTCACGAAGATC
136
TTTAGGGTATGAAAAGTATCTGGAGTTCTTACACGACGGCGAGCCGCCTCTTCTCCGGACGCAGCCGGCC
137
AGCCTTCTTCTCCAAGTCACCTTTTACCGACTCCAAACCCCACCTCAAATACTCCACTCAATCCAGATCA
140
Multiple SeqIDs can be indicated in FASTA parsable strings.
142
>gnl|ZGP|chr1|gb|U28041
143
>gi|54465|emb|X16935|MMTCRAC
148
tbl2asn reads features from a simple five-column tab-delimited table. This
149
is described in more detail at
151
http://www.ncbi.nlm.nih.gov/Sequin/table.html
153
The feature table specifies the location and type of each feature, and
154
tbl2asn processes the feature intervals and translates any CDSs into
155
proteins. The first line of the table contains the following basic
158
>Features SeqId table_name
160
The SeqId must be the same as that used on the sequence. The table name is
161
optional. Subsequent lines of the table list the features. Columns are
164
The first and second columns are the start and stop locations of the
165
feature, respectively, the third column is the type of feature (the feature
166
key, e.g., gene, mRNA, CDS), the fourth column is a qualifier name (e.g.,
167
"product", and the fifth is a qualifier value (e.g., the name of the protein
170
A simple feature table is
179
product RNA helicase SDE3
185
product RNA helicase SDE3
187
If a feature contains multiple intervals, each interval is listed on a
188
separate line by its start and stop position. Features that are on the
189
complementary strand are indicated by reversing the interval locations.
190
Locations of partial (incomplete) features are indicated with a '>' or '<'
193
Gene features are always a single interval, and their location should cover
194
the intervals of all the relevant features. If the gene feature spans the
195
intervals of the CDS or mRNA features for that gene, there is no need to
196
include gene qualifiers on those features in the table, since they will be
197
picked up by overlap. Use of the overlapping gene can be suppressed by
198
adding a gene qualifier with the value "-". This is important when, for
199
example, a tRNA is encoded within an intron of a housekeeping gene.
201
Translation exception qualifiers are parsed from the same style used in the
204
transl_except (pos:591..593,aa:Sec)
206
The codon recognized and anticodon position of tRNAs can also be given.
209
anticodon (pos:7591..7593,aa:Trp)
211
In addition to the standard qualifiers seen in GenBank format, several other
212
tokens are used to direct values to specific fields in the ASN.1 data.
213
These include gene_syn, gene_desc, locus_tag, prot_desc, prot_note,
214
region_name, bond_type, and site_type.
216
Genomic product sets require protein_id and transcript_id qualifiers on each
217
mRNA and CDS feature. These are used to associate the correct pair of
218
features for packaging.
221
transcript_id lcl|sde3m
223
Exceptional biological situations can be annotated by use of the exception
224
qualifier. For example
226
exception ribosomal slippage
228
The following are legal exception qualifier values
231
reasons given in citation
234
alternative processing
235
artificial frameshift
236
nonconsensus splice site
237
rearrangement required for product
238
modified codon recognition
239
alternative start codon
241
Since the International Nucleotide Sequence Database collaboration only
242
allows "RNA editing" and "reasons given in citation" to appear in release
243
mode, other exceptions are mapped to the /note qualifier in the flatfile.
244
However, each exception text string turns off specific validator tests that
245
would otherwise produce warning messages, so they should be entered as
246
exception qualifiers.
248
Gene Ontology (GO) terms can be indicated with the following qualifiers
250
go_component endoplasmic reticulum|0005783
251
go_process glycolysis and gluconeogenesis|57|89197757|ACT,TEM
252
go_function excision repair|93||IPD
254
The value field is separated by vertical bars '|' into a descriptive
255
string, the GO identifier (leading zeroes are retained), and optionally a
256
PubMed ID and one or more evidence codes.
259
PROTEIN SEQUENCE FORMAT
261
Protein sequences are FASTA files with a .pep extension that can substitute
262
for the translated product of a CDS feature. Supplying these files acts as
263
a reality check that the CDS intervals do in fact translate to the expected
264
protein sequence. The FASTA defline with a '>' and sequence identifier are
265
required, but the [gene] and [protein] data (which are used by Sequin) are
268
>sde3p [gene=SDE3] [protein=RNA helicase SDE3]
269
MSVSUYKSDDEYSVIADKGEIGFIDYQNDGSSGCYNPFDEGPVVVSVPFPFKKEKPQSVTVGETSFDSFT
270
VKNTMDEPVDLWTKIYASNPEDSFTLSILKPPSKDSDLKERQCFYETFTLEDRMLEPGDTLTIWVSCKPK
273
The SeqID must match that a protein_id in the .tbl file. In the example
274
above, the protein_id and transcript_id needed to explicitly use a 'lcl|'
275
prefix before the SeqID string to indicate a local identifier. A local
276
sequence identifier is assumed when reading FASTA, but a database accession
277
is assumed in the feature table.
279
Sequin's Suggest Interval functionality, which can derive CDS intervals
280
from nucleotide and protein sequences plus the genetic code, is not used in
281
tbl2asn. Instead, the CDS is required, and the supplied protein sequence
282
is just used to confirm proper translation.
285
MESSENGER RNA SEQUENCE FORMAT
287
mRNA sequences are FASTA files with a .rna extension that can substitute
288
for the transcribed product of an mRNA feature. Like the .pep files, they
289
act as a reality check that the supplied intervals do in fact encode the
290
expected mRNA sequence.
292
>sde3m [product=RNA helicase SDE3]
293
TTTTCATGTTTCTTCTCCTTTGAAGCCTGCCTGCGTTAGTCTGGCTTCATTGCTTCTCCATTTCTTGGTG
294
TGATCGAATCAAAGAGTGTAACCCATTTTGCTACTGATTCAGTACGTATGATCAATTCTCTCAATTTCAG
297
The SeqID must match the transcript_id from an mRNA feature.
300
QUALITY SCORES FORMAT
302
Phrap/Consed quality scores can be supplied in .qvl files. These generate
303
Seq-graph data that will be attached to the nucleotide sequence from the
304
.fsa file. Programs such as Sequin can display these in a graphical view.
307
51 63 70 82 82 82 90 90 90 90 86 86 86 86 90 90 90 90 90 86
308
86 86 86 86 86 86 86 90 90 90 90 90 90 86 86 78 78 90 90 86
311
These values can be extracted from the output files of the Phrap and Consed
312
programs used to process raw data from automated sequencing machines.
315
SUBMISSION TEMPLATE FORMAT
317
The submission template is an ASN.1 Submit-block that can be generated by
318
Sequin. A simple example is shown below.
331
affil "Oxbridge University" ,
332
div "Evolutionary Biology Department" ,
334
country "United Kingdom" ,
335
street "1859 Tennis Court Lane" ,
336
email "darwin@beagle.edu.uk" ,
337
phone "01 44 171-007-1212" ,
338
postal-code "OX1 2BH" } } } ,
348
initials "C.R." } } } ,
351
affil "Oxbridge University" ,
352
div "Evolutionary Biology Department" ,
354
country "United Kingdom" ,
355
street "1859 Tennis Court Lane" ,
356
postal-code "OX1 2BH" } } ,
364
This can be exported from the Desktop view of a template file in Sequin. In
365
addition, unpublished reference or comments can also be generated in Sequin
366
and saved from the Desktop. The two files can be catenated to make a .sbt
367
template with the publication or comment descriptor after the submit block.