1
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook V4.1//EN">
4
<title>Flat Databases HOWTO</title>
7
<firstname>Lincoln</firstname>
8
<surname>Stein</surname>
11
<ulink url="http://www.cshl.org">Cold Spring Harbor Laboratory</ulink>
14
<email>lstein-at-cshl.org</email>
19
<firstname>Brian</firstname>
20
<surname>Osborne</surname>
22
<para>Bioperl contributor</para>
25
<orgname><ulink url="http://www.cognia.com">Cognia Corporation</ulink></orgname>
27
<email>brian-at-cognia.com</email>
32
<firstname>Heikki</firstname>
33
<surname>Lehv�slaiho</surname>
35
<para>Bioperl contributor</para>
39
<ulink url="http://www.ebi.ac.uk">European Bioinformatics Institute</ulink>
42
<email>heikki-at-ebi.co.uk</email>
47
<pubdate>2003-02-26</pubdate>
51
<revnumber>1.0</revnumber>
52
<date>2003-02-26</date>
53
<authorinitials>LS</authorinitials>
54
<revremark>First version</revremark>
57
<revnumber>1.1</revnumber>
58
<date>2003-10-17</date>
59
<authorinitials>BS</authorinitials>
60
<revremark>fom text into howto</revremark>
63
<revnumber>1.2</revnumber>
64
<date>2003-10-17</date>
65
<authorinitials>HL</authorinitials>
66
<revremark>fom txt reformatted into SGML</revremark>
71
<para>This document is copyright Lincoln Stein, 2002. For
72
reproduction other than personal use please contact lstein at cshl.org
78
The Open Biological Database Access (OBDA) standard
79
specifies a way of generating indexes for entry-based
80
sequence files (e.g. FASTA, EMBL) so that the entries can be
81
looked up and retrieved quickly. These indexes are created
82
and accessed using the <classname>Bio::DB::Flat</classname>
88
<title>Creating OBDA-Compliant Indexed Sequence Files</title>
91
<classname>Bio::DB::Flat</classname> has the same functionality as
92
the various <classname>Bio::Index</classname> modules. The main
93
reason to use it is if you want to use the BioSequence Registry
94
system (see the OBDA Access HOWTO at <ulink
95
url="http://bioperl.org/HOWTOs">http://bioperl.org/HOWTOs</ulink>),
96
or if you want to share the same indexed files among scripts
97
written in other languages, such as those written with BioJava or
102
There are four steps to creating a
103
<classname>Bio::DB::Flat</classname> database:
108
<para>Select a Root Directory</para>
111
Select a directory in which the flat file indexes will be stored.
112
This directory should be writable by you, and readable by everyone
113
who will be running applications that access the sequence data.
116
<listitem><para>Move the Flat Files Into a Good Location</para>
119
The indexer records the path to the source files (e.g. FASTA, or
120
local copies of GenBank, Embl or SwissProt). This means that you
121
must not change the location or name of the source files after
122
running the indexer. Pick a good stable location for the source
123
files and move them there.
127
<listitem><para>Choose a Symbolic Name for the Database</para>
130
Choose a good symbolic name for the database. For example, if you
131
are mirroring GenBank, "genbank" might be a good choice. The
132
indexer will create files in a subdirectory by this name located
133
underneath the root directory.
136
<listitem><para>Run the bioflat_index.pl script to load the
137
sequence files into the database.</para>
140
The final step is to run the bioflat_index.PLS script. This
141
script is located in the BioPerl distribution, under scripts/DB.
142
For convenience, you are offered the option to copy it to
143
/usr/bin or another system-wide directory on 'make install' (and
144
its name will be changed to bioflat_index.pl).
151
<section id="options">
152
<title>Choosing Your Options</title>
155
The first time you run the script, the typical usage is as
158
bioflat_index.pl -c -l /usr/share/biodb -d genbank -i bdb -f fasta data/*.fa
161
The following command line options are required:
168
<entry>Option</entry>
169
<entry>Describtion</entry>
175
<entry>create a new index</entry>
179
<entry>path to the root directory</entry>
183
<entry>symbolic name for the new database</entry>
187
<entry>indexing scheme (discussed below)</entry>
191
<entry>source file format</entry>
200
The <parameter>-c</parameter> option must be present to create the
201
database. If the database already exists,
202
<parameter>-c</parameter> will reinitialize the index, wiping out
203
its current contents.
207
The <parameter>-l</parameter> option specifies the root directory
208
for the database indexes.
212
The <parameter>-d</parameter> option chooses the symbolic name for
213
the new database. If the <parameter>-c</parameter> option is
214
specified, this will cause a new directory to be created
215
underneath the root directory.
219
The <parameter>-i</parameter> option selects the indexing scheme.
220
Currently there are two indexing schemes supported: "bdb" and
221
"flat." "bdb" selects an index based on the Berkeley DB library.
222
It is generally the faster of the two, but it requires that the
223
Berkeley DB library (from Linux RPM or from www.sleepycat.com,
224
version 2 or higher) and the Perl BerkeleyDB module be installed
225
on your system. The Perl DB_File module will not work.
228
"flat" is a sorted text-based index that uses a binary search algorithm to
229
rapidly search for entries. Although not as fast as bdb, the flat
230
indexing system has good performance for even large databases, and
231
it has no requirements beyond Perl itself. Once an indexing
232
scheme has been selected there is no way to change it other than
233
recreating the index from scratch using the
234
<parameter>-c</parameter> option.
238
The <parameter>-f</parameter> option specifies the format of the
239
source database files. It must be one of the many formats that BioPerl
240
supports, including "genbank", "swiss", "embl" or "fasta".
241
Consult the <classname>Bio::SeqIO</classname> documentation for
242
the complete list. All files placed in the index must share the
247
The indexing script will print out a progress message every 1000
248
entries, and will report the number of entries successfully
249
indexed at the end of its run.
253
To update an existing index run bioflat_index.pl without the
254
<parameter>-c</parameter> option and list the files to be added or
255
reindexed. The <parameter>-l</parameter> and
256
<parameter>-d</parameter> options are required, but the indexing
257
scheme and source file format do not have to be specified for
258
updating as they will be read from the existing index.
262
For your convenience, bioflat_index.pl will also take default values
263
from the following environment variables:
270
<entry>ENV variable</entry>
271
<entry>description</entry>
276
<entry>OBDA_FORMAT</entry>
277
<entry>format of sequence file (<parameter>-f</parameter>)
281
<entry>OBDA_LOCATION</entry> <entry>path to directory
282
in which index files are stored
283
(<parameter>-l</parameter>)
287
<entry>OBDA_DBNAME</entry>
288
<entry>name of database (-d)</entry>
291
<entry>OBDA_INDEX</entry>
292
<entry>type of index to create (<parameter>-i</parameter>)</entry>
301
<title>Moving Database Files</title>
303
If you must change the location of the source sequence files after
304
you create the index, there is a way to do so. Inside the root
305
directory you will find a subdirectory named after the database,
306
and inside that you will find a text file named "config.dat." An
307
example config.dat is shown here:
310
fileid_0 /share/data/alnfile.fasta 294
311
fileid_1 /share/data/genomic-seq.fasta 171524
312
fileid_2 /share/data/hs_owlmonkey.fasta 416
313
fileid_3 /share/data/test.fasta 804
314
fileid_4 /share/data/testaln.fasta 4620
315
primary_namespace ACC
316
secondary_namespaces ID
317
format URN:LSID:open-bio.org:fasta
322
For each source file you have moved, find its corresponding
323
"fileid" line and change the path. Be careful not to change
324
anything else in the file or to inadvertently replace tab
325
characters with spaces.
328
<title>More information</title>
331
For more information on using your indexed flat files please see the
332
<ulink url="http://bioperl.org/HOWTOs/html/OBDA_Access.html">
333
OBDA Access HOWTO</ulink>.