1
GENBANK FLATFILE GENERATOR
3
A new flatfile generator has been written to replace the old asn2ff code.
4
It is provided both as a stand-alone application, asn2gb, and as C
5
functions in the NCBI software toolkit. There are several function
6
parameters, with equivalent command-line arguments, that control the
7
behavior of the new flatfile generator and customize its performance.
9
SeqEntryToGnbk takes a SeqEntryPtr or SeqLocPtr and calls asn2gnbk_setup,
10
asn2gnbk_format, and asn2gnbk_cleanup, which are available from a private
11
header. It returns FALSE if there was a problem generating the flatfile.
12
BioseqToGnbk is simply a convenience function that takes a BioseqPtr, looks
13
up the parent SeqEntryPtr, and then calls SeqEntryToGnbk. To use these
14
functions, #include <asn2gnbk.h>.
16
NLM_EXTERN Boolean SeqEntryToGnbk (
29
NLM_EXTERN Boolean BioseqToGnbk (
42
In the asn2gb application, format, mode, style, flags, locks, and custom
43
parameters are specified by the -f, -m, -s, -g, -h and -u arguments,
47
FORMATS include GENBANK_FMT, EMBL_FMT, GENPEPT_FMT, and FTABLE_FMT (Sequin's
48
5-column parsable feature table). If the sep passed to SeqEntryToGnbk points
49
to a Bioseq-set, the function processes all Bioseqs of the appropriate
50
molecule type (nucleotide or protein) for the specified format.
53
MODES are RELEASE_MODE, ENTREZ_MODE (release mode strictness except that it
54
allows local IDs and does not require a valid CDS /protein_id accession),
55
SEQUIN_MODE, and DUMP_MODE. RefSeq records can have otherwise illegal
56
qualifiers (e.g., /transcript_id) and db_xrefs show up in release mode.
57
Entrez mode should be used for web display, and can show new elements that
58
haven't yet finished their 4-month quarantine period.
61
STYLES are NORMAL_STYLE, SEGMENT_STYLE, MASTER_STYLE, and CONTIG_STYLE.
62
Segment style is the traditional representation of segmented sequences,
63
while contig style displays a CONTIG line with a join of accessions instead
64
of a sequence. Normal style automatically chooses between segment and contig
65
style, depending upon the kind of data. (Near segmented records will be done
66
in segment style. Far segmented sequences or deltas with no literals will be
67
done as if you chose contig style.) Master style shows features mapped to
68
the segmented Bioseq's coordinates.
71
FLAGS are bit flags controlling appearance or behavior, and are ORed together.
73
One 2-bit flag tells asn2gnbk to create HTML with web links, flatfile in XML
74
form, or flatfile in ASN.1 form. These settings are mutually exclusive. The
75
setup for creating HTML links is within SeqEntryToGnbk itself.
77
#define CREATE_HTML_FLATFILE 1
78
#define CREATE_XML_GBSEQ_FILE 2
79
#define CREATE_ASN_GBSEQ_FILE 3
81
Others control feature display behavior in contig style, whether it was
82
explicitly chosen or was called when a far segmented or far delta record was
83
processed in normal style.
85
#define SHOW_CONTIG_FEATURES 4
86
#define SHOW_CONTIG_SOURCES 8
88
Another set controls translation of CDS features with far products.
90
#define SHOW_FAR_TRANSLATION 16
91
#define TRANSLATE_IF_NO_PRODUCT 32
92
#define ALWAYS_TRANSLATE_CDS 64
94
Another 2-bit flag controls where to get features when using far segmented
95
parts or far component delta Bioseqs.
97
#define ONLY_NEAR_FEATURES 128
98
#define FAR_FEATURES_SUPPRESS 256
99
#define NEAR_FEATURES_SUPPRESS 384
101
Other flags allow customization of reports from genomic product sets.
103
#define COPY_GPS_CDS_UP 512
104
#define COPY_GPS_GENE_DOWN 1024
106
The CONTIG block can be shown along with the sequence block in master or
107
segment style, when appropriate.
109
#define SHOW_CONTIG_AND_SEQ 2048
111
Still others are expected to be rarely used, or are for testing new features.
113
#define DDBJ_VARIANT_FORMAT 4096
114
#define USE_OLD_SOURCE_ORG 8192
116
GBSeq XML has been replaced by INSDSeq XML. The CREATE_XML_GBSEQ_FILE flag
117
will actually produce INSDSeq. The original GBSeq can be generated during
118
the transition period by adding the following flag.
120
#define PRODUCE_OLD_GBSEQ 16384
123
LOCKS are bits for controlling program performance, and are also ORed together.
125
One flag set is for locking far segmented or delta components, far feature
126
location Bioseqs, or far feature product Bioseqs in advance. This prevents
127
the object manager from uncaching components at an inopportune time, causing
128
unnecessary thrashing. Far component Bioseqs are needed for displaying the
131
#define LOCK_FAR_COMPONENTS 2
132
#define LOCK_FAR_LOCATIONS 4
133
#define LOCK_FAR_PRODUCTS 8
135
Another set attempts to do bulk accession to gi lookups in advance, which is
136
possible if PubSeqFetchEnable was called by the application. Remote fetching
137
in asn2gb uses this new access mechanism. Far component IDs are needed for
138
the CONTIG line, far location IDs for feature location joins, and far product
139
IDs for the /protein_id and /transcript_id accessions.
141
#define LOOKUP_FAR_COMPONENTS 16
142
#define LOOKUP_FAR_LOCATIONS 32
143
#define LOOKUP_FAR_PRODUCTS 64
144
#define LOOKUP_FAR_HISTORY 128
146
To use PubSeqFetchEnable, the application should #include <pmfapi.h>.
149
CUSTOM are bit flags suppressing specific features, and are also ORed
152
One set suppresses all import features, or all that don't have specific
153
custom bits of their own.
155
#define HIDE_IMP_FEATS 1
156
#define HIDE_REM_IMP_FEATS 2
158
Another set suppresses common individual import feature types.
160
#define HIDE_SNP_FEATS 4
161
#define HIDE_EXON_FEATS 8
162
#define HIDE_INTRON_FEATS 16
163
#define HIDE_MISC_FEATS 32
165
Additional bits hide CDD regions, or all features on the CDS product.
167
#define HIDE_CDD_FEATS 64
168
#define HIDE_CDS_PROD_FEATS 128
170
mRNAs and peptide features can show /transcription or /peptide sequence.
172
#define SHOW_TRANCRIPTION 256
173
#define SHOW_PEPTIDE 512
175
GeneRIF references in RefSeq records can also be specifically hidden, non-
176
GeneRIF records can be hidden, or only the most recent 10 GeneRIFs can be
179
#define HIDE_GENE_RIFS 1024
180
#define ONLY_GENE_RIFS 2048
181
#define LATEST_GENE_RIFS 3072
183
Protein feature tables and References in feature tables can also be shown.
185
#define SHOW_PROT_FTABLE 4096
186
#define SHOW_FTABLE_REFS 8192
189
EXTRA is an opaque pointer used for preparing internal NCBI indices. Most
190
programs will pass NULL for this parameter.
193
ASN2GB STANDALONE APPLICATION
195
An asn2gb executable is now available on all platforms, and is packaged
196
with the Sequin archive. The most commonly used arguments are shown below.
201
-f Format (b GenBank, e EMBL, p GenPept, t Feature Table, x INSDSet)
202
-m Mode (r Release, e Entrez, s Sequin, d Dump)
203
-s Style (n Normal, s Segment, m Master, c Contig)
204
-g Bit Flags (1 HTML, 2 XML, 4 ContigFeats, 8 ContigSrcs, 16 FarTransl)
205
-h Lock/Lookup Flags (8 LockProd, 16 LookupComp, 64 LookupProd)
206
-u Custom Flags (2 HideMostImpFeats, 4 HideSnpFeats)
208
-a ASN.1 Type (a Any, e Seq-entry, b Bioseq, s Bioseq-set, m Seq-submit)
210
Batch processing of Bioseq-set ASN.1 release files is also supported.
212
-t Batch (1 Report, 2 Sequin/Release)
213
-b Bioseq-set is Binary [T/F]
214
-c Bioseq-set is Compressed [T/F]
215
-p Propagate Top Descriptors [T/F]
219
Remote fetching allows gi to accession lookups and fetching of far components.
221
-r Remote Fetching [T/F]