2
Last updated on July 21, 2003
4
This document describes the "BLAST" databases available on the NCBI
5
FTP site under the "blast/db" subdirectory. The direct URL to this
7
ftp://ftp.ncbi.nih.gov/blast/db
9
I. General Introduction
11
NCBI BLAST home pages (http://www.ncbi.nih.gov/BLAST/) use a standard
12
set of BLAST databases for Nucleotide, Protein, and Translated BLAST
13
searches. These databases are made available in the db directory as
14
compressed archives (ftp://ftp.ncbi.nih.gov/blast/db/) in preformatted
15
format. The FASTA databases now reside under the blast/FASTA
18
The preformatted databases offer the following advantages:
20
* The preformatted databases are smaller in size and are
22
* Preformatting removes the need to run formatdb;
23
* Taxonomy information is available for each database entry.
25
Preformatted databases must be downloaded in binary mode and inflated
26
with gzip or other decompress tools. The BLAST database files can then
27
be extracted out of the resulting tar file using �tar� program on Unix/Linux
28
or WinZip and StuffIt Expander on Windows and Macintosh platforms,
31
Large databases are formatted in multiple 1 Gigabytes volumes, which
32
are named using the �database.##.tar.gz� convention. All relevant volumes
33
are required. An alias file is provided so that the database can be called
34
using the alias name without the extension (.nal or .pal). For example,
35
to call est database, simply use ��d est� option in the commandline
38
Certain databases are subsets of a larger parental database. For those
39
databases, mask files, rather than actual databases, are provided. The
40
mask file needs the parent database to function properly. The parent
41
databases should be generated on the same day as the mask file. For
42
example, to use swissprot preformatted database, swissprot.tar.gz, one
43
will need to get the nr.tar.gz with the same date stamp.
45
Additional BLAST databases that are not provided in preformatted
46
formats are available in the FASTA subdirectory. For genomic BLAST
47
databases, please check the genomes ftp directory at:
48
ftp://ftp.ncbi.nih.gov/genomes/
51
2. Contents of the /blast/db/ directory
53
The formatted databases are archived in this directory. The name of
54
these databases and their contents are listed below.
55
+--------------------+-----------------------------------------------+
56
|File Name | Content Description |
57
+--------------------+-----------------------------------------------+
58
/FASTA subdirectory for FASTA formatted sequences
60
README README for this subdirectory (this file)
62
est.00.tar.gz | three volumes of the formatted est database
63
est.01.tar.gz | from the EST division of GenBank, EMBL,
64
est.02.tar.gz | and DDBJ
66
est_human.tar.gz | mask file for human subset of the est
67
est_mouse.tar.gz | mask file for mouse subset of the est
68
est_others.tar.gz | mask file for non-human and non-mouse subset
70
| These three mask files need all volumes of
71
| est to function properly.
73
gss.00.tar.gz | two volumes of the formatted gss database
74
gss.01.tar.gz | from the GSS division of GenBank, EMBL, and
77
htgs.00.tar.gz | three volumes of htgs database with entries
78
htgs.01.tar.gz | from HTG division of GenBank, EMBL, and DDBJ
81
human_genomic.tar.gz human RefSeq (NC_######) chromosome records
82
with gap adjusted concatenated NT_ contigs
84
nr.tar.gz non-redundant protein sequence database with
85
entries from GenPept, Swissprot, PIR, PDF, PDB,
88
nt.00.tar.gz | nucleotide sequence database, with entries
89
nt.01.tar.gz | from all traditional divisions of GenBank,
90
nt.02.tar.gz | EMBL, and DDBJ excluding bulk divisions (gss,
91
| sts, pat, est, and htg divisions. wgs entries
92
| are also excluded. Not non-redundant.
94
other_genomic.tar.gz RefSeq chromosome records (NC_######) for
95
organisms other than human
97
pataa.tar.gz | patent protein sequence database
98
patnt.tar.gz | patent nucleotide sequence database
99
| The above two databases are directly from
100
| USPTO or from EU/Japan Patent Agencies via
103
pdbaa.tar.gz protein sequences from pdb protein structures
104
pdbnt.tar.gz nucleotide sequences from pdb nucleic acid
105
structures. They are NOT the protein coding
106
sequences for the corresponding pdbaa entries.
108
sts.tar.gz Sequences from the STS division of GenBank, EMBL,
111
swissprot.tar.gz swiss-prot sequence databases (last major update)
113
taxdb.tar.gz Taxonomy information for the formatted database
115
wgs.00.tar.gz | Whole genome shotgun sequence assemblies for
116
wgs.01.tar.gz | different organisms, broken up into 1 GB
117
wgs.02.tar.gz | volumes.
121
+--------------------+-----------------------------------------------+
124
3. Content of the /db/FASTA Subdirectory
126
This subdirectory contains FASTA formatted sequence files, formerly
127
available under /db directory. The file names and database contents
128
are listed below. These files are now archived in .gz format and must
129
be processed through formatdb before they can be used by the BLAST
132
+--------------------+-----------------------------------------------+
133
|File Name | Content Description |
134
+--------------------+-----------------------------------------------+
135
alu.a.gz translation of alu.n repeats
136
alu.n.gz alu repeat elements
138
drosoph.aa.gz CDS translations from drosophila.nt
139
drosoph.nt.gz genomic sequences for drosophila
141
ecoli.aa.gz CDS translations from ecoli.nt
142
ecoli.nt.gz Escherichia coli K-12 genomic sequences
144
est_human.gz* | human subset of the est database (see Note 1)
145
est_mouse.gz* | mouse subset of the est database
146
est_others.gz* | non-human and non-mouse subset of the est
149
gss.gz* sequences from the GSS division of GenBank,
152
htg.gz* htgs database with high throughput genomic
153
entries from the htg division of GenBank,
156
human_genomic.gz* human RefSeq (NC_######) chromosome records
157
with gap adjusted concatenated NT_ contigs
159
igSeqNt.gz human and mouse immunoglobulin nucleotide
161
igSeqProt.gz human and mouse immunoglobulin protein
164
mito.aa.gz CDS translations of complete mitochondrial
166
mito.nt.gz complete mitochondrial genomes
168
month.aa.gz | newly released/updated protein sequences
170
month.est_human.gz | newly released/updated human est sequences
171
month.est_mouse.gz | newly released/updated mouse est sequences
172
month.est_others.gz | newly released/updated est other than
174
month.gss.gz | newly released/updated gss sequences
175
month.htgs.gz | newly released/updated htgs sequences
176
month.nt.gz | newly released/updated sequences for the nt
179
nr.gz* non-redundant protein sequence database with
180
entries from GenPept, Swissprot, PIR, PDF,
183
nt.gz* nucleotide sequence database, with entries
184
from all traditional divisions of GenBank,
185
EMBL, and DDBJ excluding bulk divisions
186
(gss, sts, pat, est, htg divisions) and wgs
187
entries. Not non-redundant.
189
other_genomic.gz* RefSeq chromosome records (NC_######) for
190
organisms other than human
192
pataa.gz* | patent protein sequence database
193
patnt.gz* | patent nucleotide sequence database
194
| The above two dbs are directly from USPTO
195
| of from EU/Japan Patent Agency via EMBL/DDBJ
197
pdbaa.gz* protein sequences from pdb protein structures
198
pdbnt.gz* nucleotide sequences from pdb nucleic acid
199
structures. They are NOT the protein coding
200
sequences for the corresponding pdbaa entries.
202
sts.gz* database for sequence tag site entries
204
swissprot.gz* swiss-prot database (last major release)
206
vector.gz vector sequence database (See Note 3)
208
wgs.gz* whole genome shotgun genome assemblies
210
yeast.aa.gz protein translations from yeast genome
211
yeast.nt.gz yeast genomes.
212
+--------------------+-----------------------------------------------+
214
(1) we do not provide the complete est database in FASTA format. One
215
need to get all three subsets(est_human, est_mouse, and est_others
216
and concatenate them into the complete est fasta database.
217
(2) month.### databases are the sequences newly released or updated
218
within the last 30 days for that database.
219
(3) For vector contamination screening, use the UniVec database from:
220
ftp://ftp.ncbi.nih.gov/pub/UniVec/
221
* marked files have preformatted counterparts.
226
The BLAST databases are updated daily. Update of existing databases
227
by merging of new records from the month database using fmerge is no
228
longer supported. We do not have an established incremental update
229
scheme at this time. We recommend downloading the databases regularly
230
to keep their content current.
232
5. Non-redundant defline syntax
234
The only non-redundant database is the protein nr. In it, identical
235
sequences are merged into one entry. To be merged two sequences must
236
have identical lengths and every residue at every position must be the
237
same. The FASTA deflines for the different entries that belong to one
238
nr record are separated by control-A characters invisible to most
239
programs. In the example below both entries gi|1469284 and gi|1477453
240
have the same sequence, in every respect:
242
>gi|3023276|sp|Q57293|AFUC_ACTPL Ferric transport ATP-binding protein afuC
243
^Agi|1469284|gb|AAB05030.1| afuC gene product ^Agi|1477453|gb|AAB17216.1|
244
afuC [Actinobacillus pleuropneumoniae]
245
MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVT
246
KSSIQNRDICIVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQ
247
QQRVALARALVLKPKVLILDEPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMN
248
KGTIMQKARQKIFIYDRILYSLRNFMGESTICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPE
249
AIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLINANPDQFDPDATKAFIHFTEQGIFLLNKE
251
The syntax of sequence header lines used by the NCBI BLAST server
252
depends on the database from which each sequence was obtained. The table
253
below lists the identifiers for the databases from which the sequences
256
Database Name Identifier Syntax
257
============================ ========================
258
GenBank gb|accession|locus
259
EMBL Data Library emb|accession|locus
260
DDBJ, DNA Database of Japan dbj|accession|locus
262
Protein Research Foundation prf||name
263
SWISS-PROT sp|accession|entry name
264
Brookhaven Protein Data Bank pdb|entry|chain
265
Patents pat|country|number
266
GenInfo Backbone Id bbs|number
267
General database identifier gnl|database|identifier
268
NCBI Reference Sequence ref|accession|locus
269
Local Sequence identifier lcl|identifier
271
"gi" identifiers are being assigned by NCBI for all sequences contained
272
within NCBI's sequence databases. The "gi" identifier provides a uniform
273
and stable naming convention whereby a specific sequence is assigned its
274
unique gi identifier. If a nucleotide or protein sequence changes,
275
however, a new gi identifier is assigned, even if the accession number
276
of the record remains unchanged. Thus gi identifiers provide a mechanism
277
for identifying the exact sequence that was used or retrieved in a given
280
We recommend that "gi display option" be activated in local blast search
281
by setting the -I option to T, which was set to false by default:
283
-I Show GI's in deflines [T/F]
286
For databases whose entries are not from official NCBI sequence databases,
287
such as Trace database, the gnl| convention is used. For custom database,
288
this convention should be followed and the id for each sequence must be
289
unique, if one would like to take the advantage of indexed database,
290
which enables specific sequence retrieval using fastacmd program included
291
in the blast executable package. One should refer to documents
292
distributed in the standalone BLAST package for more details.
295
6. Formatting the FASTA database
297
FASTA database files need to be formatted with formatdb before they can be
298
used in local blast search. For those from NCBI, the following formatdb
300
formatdb �i input_db �p F �o T for nucleotide
301
formatdb �i input_db �p T �o T for protein
303
The -A option introduced in 2.2.3 is now built into the formatdb program
304
and thus removed from the list of configurable options since 2.2.8. This
305
enables formatdb to properly handle large sequence files (longer than 16
306
million bases). Please refer to formatdb.txt under the /blast/documents
307
directory for more information. Database preprared using 2.2.8 formatdb
308
will not be backward compatible with blast programs old than version 2.2.3.
313
Questions and comments on this document and NCBI BLAST related questions
314
should be sent to blast-help group at:
315
blast-help@ncbi.nlm.nih.gov
317
For information about other NCBI resources/services, please send email to
318
NCBI User Serivce at:
319
info@ncbi.nlm.nih.gov